
"OpenAI has released gpt-realtime, its most advanced speech-to-speech model, alongside the general availability of the Realtime API. The updates aim to reduce latency, improve speech quality, and give developers stronger tools, such as MCP server support, image input, and Session Initiation Protocol (SIP) phone calling support, for building production-ready AI voice agents. The combined Realtime API and gpt-realtime is designed to handle end-to-end speech processing within a single system, rather than chaining together separate speech-to-text and text-to-speech models."
"This architecture cuts response times while preserving nuance in delivery, a critical improvement for real-time agents where even small delays can break conversational flow. The gpt-realtime was trained to produce higher-quality speech with more natural pacing and intonation, and to respond reliably to style instructions such as "speak empathetically" or "use a professional tone." Two new synthetic voices, Cedar and Marin, are available, and existing voices have been updated for greater realism."
"On comprehension benchmarks, gpt-realtime shows measurable improvements. It can track non-verbal cues, switch languages within a single sentence, and more accurately process alphanumeric sequences (such as phone numbers, VINs, etc) across languages, including Spanish, Chinese, Japanese, and French. Internal testing highlights this jump, with gpt-realtime reaching 82.8% accuracy on Big Bench Audio compared to 65.6% for the previous model. Instruction-following is also sharper, with MultiChallenge audio benchmark scores rising from 20.6% to 30.5%."
gpt-realtime and the Realtime API provide an end-to-end speech processing system that reduces latency and preserves conversational nuance. The model generates higher-quality, natural-sounding speech with improved pacing, intonation, and adherence to style instructions like empathetic or professional tones. Two new synthetic voices, Cedar and Marin, complement updates to existing voices. Comprehension capabilities improved across languages and non-verbal cues, with better alphanumeric handling and higher benchmark scores. Function calling accuracy and asynchronous function support were enhanced to enable reliable invocation and argument passing. Developers gain production-ready tools including MCP server support, image input, and SIP phone calling.
Read at InfoQ
Unable to calculate read time
Collection
[
|
...
]