The Voice Stack · Lesson 2 of 14

Where Latency Hides In A Call

Callers forgive a lot but not long silences. Learn where each hundred milliseconds is spent and how to hold a call under a natural response delay.

Latency is the number one killer

Voice agents rarely fail because the model is wrong. They fail because the pause after the caller stops speaking is too long, and a human on the other end thinks the line dropped or the agent is confused. On a phone call, roughly one second of total response delay is the edge of natural. Cross it and the conversation feels broken no matter how good the words are.

The latency budget of one turn

Response delay accumulates across the pipeline, and each part is separately tunable.

Endpointing: the time STT waits after the caller goes quiet before deciding they are actually done. Set it too long and every reply feels sluggish. Set it too short and the agent interrupts.
Model time to first token: how long the model takes to start producing its answer. A large model with a long prompt is slower to begin speaking.
TTS first-audio latency: the gap before the voice produces its first sound, separate from how long the whole sentence takes to say.
Network and telephony: the round trip through the phone network, usually small but real.

How to cut it

Stream everything. Send model tokens to TTS as they arrive instead of waiting for the full response, so the agent starts speaking almost immediately.
Trim the prompt. A shorter system prompt means faster time to first token on every single turn. Long, padded instructions tax the whole call.
Pick faster models and voices. A smaller model or a lower-latency voice provider can save hundreds of milliseconds with little quality loss for short spoken replies.
Tune endpointing per use case. A receptionist taking a name can wait less than one collecting a long complaint.

Measure on real calls

A demo on office wifi lies. Measure end-to-end latency on an actual phone call over a normal connection, because that is what your callers get. Most platforms expose per-turn timing in call logs; use it to find which layer is eating the budget rather than guessing.

What good looks like

Replies land fast enough that a caller never wonders if the line dropped, and you can point at a call log and say which layer to shave when a particular flow feels slow.

Common mistakes

Optimizing the prompt for quality while ignoring that its length is slowing every turn.
Not streaming, so the agent waits for a whole sentence before it starts talking.
Judging latency on a clean demo instead of a real phone connection.

Previous Next (locked)