InfoQ Homepage News OpenAI Launches Public Beta of Realtime API for Low-Latency Speech Interactions
OpenAI Launches Public Beta of Realtime API for Low-Latency Speech Interactions
This item in japanese
Oct 14, 2024 2 min read
Write for InfoQ
Feed your curiosity. Help 550k+ globalsenior developers
each month stay ahead.Get in touch
OpenAI launched the public beta of the Realtime API, offering developers the ability to create low-latency, multimodal voice interactions within their applications. Additionally, audio input/output is now available in the Chat Completions API, expanding options for voice-driven applications. Early feedback highlights limited voice options and response cutoffs, similar to ChatGPT’s Advanced Voice Mode.
The Realtime API enables real-time, natural speech-to-speech interactions using six preset voices, combining speech recognition and synthesis into a single API call. This simplifies the development of fluid conversational applications by streamlining what previously required multiple models.
OpenAI has also extended the capabilities of its Chat Completions API by adding support for audio input and output. This feature is geared towards use cases that do not require the low-latency performance of the Realtime API, allowing developers to send either text or audio inputs and receive responses in text, audio, or both.
In the past, creating voice assistant experiences required using multiple models for different tasks, such as automatic speech recognition, text inference, and text-to-speech. This often resulted in delays and lost nuance. The Realtime API addresses these issues by streamlining the entire process into a single API call, offering faster and more natural conversational capabilities.
The Realtime API is powered by a persistent WebSocket connection, allowing for continuous message exchange with GPT-4o. It also supports function calling, enabling voice assistants to perform tasks such as placing orders or retrieving relevant user data to provide more personalized responses.
Additionally, the community observed that although the API is accessible through the Playground, the available voice options are currently limited to alloy, echo, and shimmer. During testing, users noticed that the responses were subject to the same limitations as ChatGPT’s Advanced Voice Mode. Despite attempts to use detailed system messages, the responses were still cut off, hinting at the involvement of a separate model managing the flow of conversations.
The Realtime API is available in public beta for all paid developers. Audio in the Chat Completions API will be released in the coming weeks. Pricing for the Realtime API includes both text and audio tokens, with audio input priced at approximately 0ドル.06 per minute and audio output at 0ドル.24 per minute.
There have been concerns about how this pricing could impact long-duration interactions. Some developers pointed out that due to the way the model processes conversations, where every response involves revisiting prior exchanges, costs can accumulate quickly. This has led to thoughts about value, especially for extended conversations, given that large language models do not retain short-term memory and must reprocess prior content.
Developers can explore the Realtime API by checking the official documentation and the reference client.
This content is in the AI, ML & Data Engineering topic
Related Topics:
-
Related Editorial
-
Related Sponsors
-
Popular across InfoQ
-
AWS Introduces ECS Managed Instances for Containerized Applications
-
Producing a Better Software Architecture with Residuality Theory
-
GitHub Introduces New Embedding Model to Improve Code Search and Context
-
Google DeepMind Introduces CodeMender, an AI Agent for Automated Code Repair
-
Building Distributed Event-Driven Architectures across Multi-Cloud Boundaries
-
Mental Models in Architecture and Societal Views of Technology: A Conversation with Nimisha Asthagiri
-
Related Content
The InfoQ Newsletter
A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example