The fix for the echo problem is acoustic echo cancellation: feed the speaker output back as a reference signal, subtract it from the mic input, and you are left with just the human. That is table stakes for full-duplex and I will not relitigate it here.
The fix for the cough-and-backchannel problem is where the latency went. You cannot trust a single VAD frame. Energy-based VAD does not know the difference between "I disagree, stop" and someone clearing their throat in a coffee shop. Background noise pushing energy above threshold is exactly the failure mode the field keeps naming (Future AGI). So you add a guard. You require the interrupting speech to persist for a minimum duration before you commit to yielding.
That guard is the 120ms. And it buys something real. A minimum-duration guard can cut the false-barge-in rate by 60-80%, but it adds roughly 200ms to the barge-in path (Future AGI). I tuned mine tighter than that and landed at +120ms before my false-positive rate dropped under the 5% I was aiming for. The published target for barge-in is brutal in both directions: 95%+ accuracy, under 5% false positives, under 5% missed real interruptions (Future AGI). You do not get there for free, and the currency you pay in is the same milliseconds your benchmark is bragging about.
The barge-in path as a flow: agent speaking, VAD fires on the mic, a minimum-duration guard that adds 120ms, then duck and yield at minus-24dB. Below, a tradeoff: no guard yields instantly but fires on coughs and echo, while the guard drops false barge-ins 60-80% at the cost of an interruption landing 120ms slower
The two timers nobody puts on the same chart
Here is what I think the benchmarks get structurally wrong. There is not one latency in a voice agent. There are two, and they pull in opposite directions.
| Timer |
What it measures |
Direction barge-in pushes it |
| Turn-taking latency |
User stops -> agent starts |
This is what every chart reports |
| Barge-in latency |
User cuts in -> agent stops |
This is the one nobody reports |
Turn-taking latency is the relay-race number. Barge-in latency is the interrupt-handling number, and the field is starting to put real targets on it: interruption response under 200ms, measured from user-speech onset to TTS suppression (Future AGI). The trap is that these two timers fight. Make the agent quicker to yield and you generate more false stops. Add a guard to kill the false stops and you slow the yield. You are not optimizing a number. You are choosing a point on a tradeoff curve, and the benchmark that reports only the first timer is hiding the second axis entirely.
The research framing I found most honest measures the minimum latency required to reach 90% barge-in accuracy, rather than reporting latency and accuracy as if they were independent (Future AGI). That is the joint metric. That is what a barge-in benchmark should look like, and almost nobody publishes it.
Where the +120ms actually fits in the budget
To be clear about scale: in a cascade, the latency that gets all the attention is the STT-to-LLM-to-TTS chain, which even at its fastest is a few hundred milliseconds of stacked work. The barge-in path is a separate budget. It runs in parallel, on the listening side, the whole time the agent is talking. The response chain and the listening path never touch.
So the +120ms does not lengthen your response. It lengthens the interruption. When a user cuts in, that is the delay before the agent goes quiet. And that delay has a much lower tolerance than response latency does. People forgive an agent that takes 600ms to answer. They do not forgive an agent that keeps talking for 600ms after they have clearly told it to stop, because at that point it is not slow, it is rude. The barge-in timer is the one your users feel as a personality flaw.
What 2026 turn-taking models change
The honest version of this story is that the brute-force guard is the old way, and the field has moved. The fix for "VAD is too dumb to tell a cough from a correction" is to stop using a bare energy threshold and use a model that understands turns.
This is the shift everyone is making right now. The 2026 production stack is migrating from energy-threshold VAD toward dedicated turn-taking models that classify backchannel versus barge-in versus continued silence as a learned signal (Future AGI). The named players:
-
Deepgram Flux does model-native end-of-turn detection using acoustic, semantic, and conversational context instead of silence thresholds, landing around 250ms end-of-turn and removing the need for a separate VAD-plus-endpointing stack (Deepgram).
-
Krisp Turn Prediction v3 pushes end-of-turn latency below 200ms, and in May 2026 benchmarks its accuracy curve sat below LiveKit's built-in and Deepgram Flux's across the operating range (Krisp).
-
LiveKit Agents ships adaptive interruption handling at 86% precision and 100% recall, with the barge-in and backchannel-suppression logic living in the orchestrator, not the ASR model (Inworld).
That last point reframed the whole problem for me. Barge-in quality lives in your orchestrator, not your speech-to-text model. The model tells you what it heard; the orchestrator decides what to do about it, and that decision is the entire game (AssemblyAI). I had been tuning the wrong layer for a week.
A semantic turn detector earns back most of my 120ms because it does not need a long duration guard. It can tell that "actually—" is an interruption and "yeah, mm-hm" is not, from the prosody and the words, not from how long the sound lasted. The guard was a crutch for a dumb VAD. A model that understands the turn lets you commit to the decision sooner with the same accuracy, which is the only way to move down the tradeoff curve instead of along it. Combining audio and text this way is what closes the gap to roughly 300ms without cutting users off mid-thought (Future AGI).
What I would tell myself before shipping it
Three things rearranged in my head, and they are the things I wish a benchmark had told me.
Measure the second timer. If your dashboard only has turn-taking latency, you are flying with one instrument. Add barge-in latency, measured from user-speech onset to TTS suppression, and watch them as a pair. The moment you start optimizing one in isolation, you are quietly wrecking the other.
The guard is a tax, not a feature. A minimum-duration guard is the cheapest way to stop false barge-ins and the most expensive way to feel responsive. It is fine as a first pass. It is a bad place to live. If you are still paying a 120-200ms guard tax six months in, you have not solved barge-in, you have postponed it.
Barge-in is an orchestrator problem. I spent days assuming a better STT model would fix my interruptions. It would not have. The yield-or-hold decision lives above the model, and that is where the engineering actually is. Pick your transport and orchestrator for how they handle interruption events, because that is the layer your users will judge.
The number nobody puts on the chart is the number your users feel first. An agent that answers fast but will not stop talking is not a fast agent. It is a fast bulldozer. I would rather lose 120ms and have it know when to shut up.
I pulled the latency-budget framing and the cascade anatomy behind this from my book The 300ms Voice-AI UX Problem, which is where I worked out why turn-taking is the part of the budget that does not behave like the rest of it. This post is what happened when I stopped reading about turn gaps and started measuring the one in the other direction.