The Voice Stack · Lesson 1 of 14

How Voice Agents Actually Work

A phone agent is four systems chained together, not a chat window with a speaker. Learn the pipeline so you can debug calls instead of guessing at prompts.

A call is a pipeline, not a chat box

Text chat gives the model all the time it wants. A phone call does not. Every voice agent on Vapi, Retell, or Bland is really four systems chained together, and a call feels good or broken depending on how they hand off to each other. Understand the chain first and most "the prompt is bad" problems turn out to be something else entirely.

The four layers of every voice turn

Speech-to-text (STT): the caller's audio is transcribed into words in real time. Providers like Deepgram or Whisper listen to a live stream and guess when a sentence is finished. Accents, background noise, and cheap phone audio all live here.
The language model: the transcript plus your instructions go to a model, which decides what to say or which action to take. This is the layer operators fixate on, but it is only one of four.
Text-to-speech (TTS): the model's text is turned back into audio by a voice provider such as ElevenLabs or Cartesia. Voice quality and the delay before the first word both come from here.
Telephony: the phone network itself, usually through Twilio or a platform's built-in numbers, carries the audio to and from the caller.

Why the chain matters

A caller speaks, STT transcribes, the model thinks, TTS speaks, telephony delivers it. Each layer adds delay, and the delays stack. If a call feels slow, the problem might be a slow model, a slow voice, or STT waiting too long to decide the caller finished. You cannot fix what you cannot locate, and you can only locate it if you know the layers exist.

What platforms actually give you

Vapi, Retell, and Bland bundle all four layers behind one abstraction so you are not wiring Deepgram to a model to ElevenLabs to Twilio by hand. That is the real value they sell. It also means your job shifts from plumbing to design: picking the STT, model, and voice, and tuning how they interact.

What good looks like

You can trace a single turn out loud: caller speaks, STT transcribes, model responds, TTS speaks, telephony delivers. When a call misbehaves you can name which layer is the suspect before you touch anything.

Common mistakes

Treating a voice agent like a chatbot and editing the prompt when the real problem is STT cutting the caller off early.
Ignoring that every layer is a separate provider with its own cost, latency, and failure mode.
Testing on clean laptop audio and being surprised when real phone calls transcribe worse.