OpenAI launches new real-time audio models via APIs for developers working on voice agents, live translation, and transcription.
OpenAI has announced three new real-time audio models for developers building voice-based apps and agents through its API. The new models are GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper. These new models can support more natural voice interactions, live translation, and low-latency speech-to-text transcription.
GPT-Realtime-2 is the most important model in this launch. It is built for live voice interactions where the model can reason through requests, call tools, handle corrections, and continue the conversation naturally. GPT-Realtime-2 includes the following new capabilities for voice agents:
- Preambles: The model can say short phrases such as “let me check that” before completing a task.
- Parallel tool calls: It can call multiple tools at once while keeping the user informed.
- Better recovery: It can respond more gracefully when something goes wrong instead of failing silently.
- Longer context: OpenAI has increased the context window from 32K to 128K.
- Improved domain understanding: The model is better at retaining specialized terms, proper nouns, and healthcare-related vocabulary.
- Tone control: It can adjust its speaking style depending on the situation.
- Adjustable reasoning effort: Developers can choose between minimal, low, medium, high, and xhigh reasoning levels.
The improvements in this new model are evident through benchmark results. GPT-Realtime-2 with high reasoning scored 96.6% on Big Bench Audio, compared to 81.4% for GPT-Realtime-1.5. GPT-Realtime-2 with xhigh reasoning scored 48.5% on Audio MultiChallenge instruction following, compared to 34.7% for GPT-Realtime-1.5.
The new GPT-Realtime-Translate model is designed for live multilingual voice experiences. It can translate speech from more than 70 input languages into 13 output languages. OpenAI claims that this model can preserve meaning while keeping pace with the speaker, even when users switch contexts, use regional pronunciations, or speak with domain-specific vocabulary.
The new GPT-Realtime-Whisper is a streaming transcription model built for low-latency speech-to-text. It transcribes audio while someone is speaking, which can be useful for live captions, meeting notes, classroom transcripts, and more.
All three models are now available through the Realtime API. GPT-Realtime-2 costs $32/1M audio input tokens, $0.40/1M cached input tokens, and $64/1M audio output tokens. GPT-Realtime-Translate costs $0.034 per minute, while GPT-Realtime-Whisper costs $0.017 per minute. Developers can try out the new real-time voice models in the Playground. For general consumers, OpenAI is still working on upgrading the voice experience in ChatGPT.

