Overview
- OpenAI made its Realtime API generally available and released gpt‑realtime, a speech‑to‑speech model that can switch languages mid‑sentence and is positioned as its most advanced production voice model.
- The API now supports Model Context Protocol for tool access, image inputs for on‑the‑fly visual understanding, and Session Initiation Protocol for direct phone connectivity to contact centers.
- OpenAI reports stronger instruction following, expanded function calling, more natural and expressive speech, and recognition of non‑verbal cues, with benchmarks at 82.8% on Big Bench Audio and 30.5% on MultiChallenge.
- Two API‑only voices, Cedar and Marin, are available, and pricing for gpt‑realtime is reduced by 20% to $32 per million audio input tokens and $64 per million audio output tokens.
- Customer demos highlighted enterprise use cases from T‑Mobile and Zillow, while the offering enters a crowded field that includes ElevenLabs, SoundHound, Hume, Mistral, and Google.