OpenAI is adding a voice mode that talks and listens at once
Tools & Apps

OpenAI is adding a voice mode that talks and listens at once

· 8 min read

ChatGPT’s text capabilities have raced ahead of its voice. Type a question and you get an answer on par with GPT-5.5. Ask the same question out loud and you notice the gap: older audio stack, clunkier pacing, and a model that freezes the moment you say “mm-hm.” OpenAI appears to be closing that gap. A new model spotted in the ChatGPT app carries the codename GPT-Bidi-1, and it is built to listen and speak simultaneously. If it delivers on the name, this is the upgrade that turns AI on the phone from a novelty into a genuinely useful channel.

What exactly turned up?

A new voice model called gpt-bidi-1 was spotted inside the ChatGPT app, on both web and mobile, along with announcement text describing “the next generation of Voice” and a “major leap in intelligence.” The discovery was shared on June 16 by X users @M1Astra and @chetaslua and reported in detail by TestingCatalog.

OpenAI has not confirmed anything publicly. The codename may change before launch. For context: companies rarely leave model labels in production apps by accident, and these sightings span both web and mobile simultaneously, which suggests a consumer rollout is near.

What does “bidirectional” actually mean?

Bidirectional means the model listens and speaks at the same time, rather than taking strict turns. Current voice AI works like a walkie-talkie: one side talks, then the other. You press play, it responds; you speak, it waits.

A real phone conversation does not work that way. You say “mm-hm” while the other person is mid-sentence. You interrupt to add something. You mutter “yeah, got it” to signal you are following along. Current voice AI treats any of those as a full turn handover, which is why interrupting ChatGPT’s voice mode often produces a jarring stop-and-restart.

The technical requirement is processing two audio streams in parallel: your voice and the model’s own output, without waiting for a formal end-of-turn marker. Speed matters enormously here. Any latency above a few hundred milliseconds breaks the feeling of natural speech.

This approach has been demonstrated before. The French AI lab Kyutai shipped Moshi, the first true full-duplex speech model, running at roughly 200 milliseconds latency, about the speed of a natural conversation. In their research paper, the team described it simply:

“Moshi can always speak and listen, and do both simultaneously when needed.”

Kyutai, Moshi research (2024)

GPT-Bidi-1 appears to bring the same architecture to the hundreds of millions of people who already use ChatGPT.

What can the current voice do, and what still falls short?

Advanced Voice Mode is already a speech-to-speech model. You can interrupt it, it responds in near-phone-call time, and it picks up tone and emotion. Many users find it surprisingly natural for short exchanges.

Under the hood, it still operates in turns. It waits for you to finish before generating a response. According to the leaked interface text, GPT-Bidi-1 adds three intelligence levels, High, Medium, and Instant, matching the structure already available on the text side. That lets you trade speed for depth by task: quick lookups on Instant, detailed explanations on High.

Voice features are available in the free tier of ChatGPT and in ChatGPT Plus, priced at $20 per month. For a broader picture of where ChatGPT stands globally, the TheAIDaily ChatGPT statistics page tracks usage across platforms and regions.

Why did text race ahead while voice fell behind?

Text models can take a second to think. Voice models cannot. Every added millisecond of latency in a spoken response is audible to the listener, so real-time audio generation demands processing shortcuts that text generation can ignore. That constraint has kept voice at an earlier generation than text, even inside the same product.

OpenAI is not the first to tackle this. Google built bidirectional streaming into Gemini Live earlier this year. Apple is rebuilding Siri with AI assistance as well. The race to natural-sounding AI voice is moving fast, and GPT-Bidi-1 represents OpenAI’s move to close the gap it created.

What does this mean for AI in customer service?

This is the upgrade that moves voice AI from a rough demo to a workable channel. A voice assistant that stalls on “yeah, and?” sounds like a phone tree from 2010. One that holds its thread while you talk over it sounds like a colleague.

The market data makes the stakes clear. The global AI customer service market is projected to reach $15.12 billion in 2026, growing at a 25.8% compound annual rate. The voice AI segment is expanding even faster, at 34.8% CAGR, with projections to hit $47.5 billion by 2034. Eighty percent of businesses globally plan to integrate voice AI into customer service before the end of 2026, according to Fortune Business Insights.

The TheAIDaily AI Customer Service Statistics tracks cost-per-interaction, resolution rates, and adoption benchmarks across industries.

The economics already make a strong case. An AI-handled customer interaction currently costs around $0.62, against $7.40 for a human agent. The constraint so far has been quality: customers who feel they are being routed around a system that does not actually understand them simply hang up and call back. Full-duplex voice changes that calculus. A voice that can handle interruptions and course corrections mid-sentence removes the biggest quality complaint in AI phone channels.

Here’s the thing: this plays differently depending on the use case. For high-volume, repeatable queries, such as appointment rescheduling, order status, or policy lookups, the gain is immediate. For situations requiring nuanced judgment, full-duplex voice improves the surface but does not replace the substance.

What are the EU AI Act requirements for AI phone calls?

Under the EU AI Act, any AI system interacting with people in real time must identify itself as AI at the start of the interaction. The transparency obligation takes effect on August 2, 2026. For a voice agent on a phone line, that means a clear disclosure at the beginning of every call.

That requirement becomes more important, not less, as voice AI improves. A model indistinguishable from a human creates a higher obligation to disclose, precisely because the deception risk is greater. The practical implementation is a single sentence: “You are speaking with a digital assistant.” Build that into your call flow before August 2, and you are covered.

This applies to all businesses operating in the EU, not only those based in member states. If your phone channel reaches EU residents, the obligation applies regardless of where your company is incorporated.

How does it perform in non-English languages?

The current voice mode handles English well and European languages with varying quality. Performance typically drops on proper nouns, regional accents, and fast interruptions in any language. More reasoning capacity and full-duplex architecture are expected to improve both dimensions, but until the model ships, those remain commitments rather than verified results.

The practical advice is consistent across markets: run your own pilot on your actual customer queries before deploying to a live line. A polished English demo says nothing about how the model handles an impatient caller switching topics mid-sentence in German or French.

What can you do with this right now?

GPT-Bidi-1 is not yet available. But preparation now saves scrambling later.

  • Identify which phone or service channel costs your team the most time, and note the three questions that come up most often.
  • Test the current Advanced Voice Mode on exactly those questions, so you have a baseline before the new model arrives.
  • Run the numbers on your highest-volume call type: what does a single handled call actually cost, and what share of that volume is repeatable enough to automate?
  • Lock in your EU AI Act disclosure language now, so you are not writing it under deadline pressure in late July.

According to reports, “Bidi (Latest)” will appear as a new option alongside the existing voice mode rather than replacing it, so you retain control over which version handles which tasks. The direction is clear: the same trajectory that moved ChatGPT from a typed assistant to a visual analyst is now reaching voice.

Michael Groeneweg
Written by Michael Groeneweg AI consultant at Digital Impact and founder of UnicornAI.nl

Michael is an AI consultant at Digital Impact in Rotterdam and the founder of UnicornAI.nl, where he builds AI solutions and SaaS integrations for businesses. An entrepreneur for ten years, he has spent the last few refusing to touch anything that doesn't have AI woven into it, at work and at home, to the mild dismay of the people around him. His travels have turned into a running experiment in what AI can and can't do from a cafe terrace in Lisbon or a train station in Tokyo. He obsessively tests new tools, builds solutions for clients, and believes nobody should buy the hype, but nobody can keep pretending AI doesn't change everything either. Loves good coffee, long flights, and people who build with AI instead of just talking about it.

Written by a human, with AI assisting research and editing. More on our method in the AI disclosure.