Voice AI Concepts You Must Know

When your voice agent interrupts the user at the wrong moment, is that a Speech to Text (STT) problem? Or is that a Voice Activity Detection (VAD) problem? Or could it be that your Text to Speech AI model wasn’t stopped mid-generation?

They’re three different things, and the fix is different for each. They even live in different parts of your pipeline.

That’s why building truly conversational voice AI agents is much harder than just picking a STT or TTS model.

At Cartesia, we are building AI for humans to express themselves, and be understood.

Understanding starts with a common vocabulary.

This blog gives you the vocabulary you need to understand, explore and reason about Conversational AI Agents, and it will help you understand the benchmark results that seem to get released every few days!

I’ve grouped the terms into sections so they’re easier to contextualize. You know what they say - context is everything (especially in AI!).

Core Voice Pipeline

STT (Speech-to-Text). This is the conversion of speech input into text output. In practice, STT and ASR (Automatic Speech Recognition) are often used interchangeably.

Any time you dictate a text message into your phone, you’re doing STT / ASR.

For Conversational Voice AI – i.e. a natural two way conversation between human and AI – this is the only channel of communication from the human into the system, but it’s only one part of it.

You can play with Cartesia’s STT models (called Ink) here for free. Push the models! Give it things like phone numbers, dates, and spelled-out email addresses!
VAD (Voice Activity Detection). VAD models are specialist classifier models, and different from the STT transcription model itself. VAD models detect:
1. silence from sound in an input signal, and
2. speech from background sound
  
  VAD models filter the audio signal input into the system, and are designed to isolate speech. They then send that isolated speech into the ASR model for transcription.
Cartesia’s latest Ink-2 STT model handles turn detection internally, which simplifies voice agent pipelines because you don’t need to evaluate, integrate, and maintain a separate VAD model just to detect when the user is speaking.
Turn Detection: This is often called endpointing, which unfortunately can be confused with API endpoints. So here we’ll just refer to it as Turn Detection.

Turn Detection is the system’s ability to accurately detect whether the caller has started, finished, or resumed speaking their turn. This is much trickier than you’d expect, because it’s often hard to tell whether a person is pausing, thinking, momentarily distracted or taking a long breath!

Turn Detection is critical for Voice Agent UX because a model that is too eager in assuming the human has finished their turn will produce annoying interruptions. On the other hand, if turn detection is sloppy and late, it adds precious milliseconds to latency, and that produces a very unnatural sounding conversation full of awkward gaps.

You can imagine how unreliable VAD, combined with over-eager turn-end detection, can make the user feel they’re speaking to an AI that interrupts and won’t let the user get a word in!

So, weak turn detection really destroys the flow of natural conversations and can also cause transcription errors that accumulate in, and damage, downstream actions in the Voice AI pipeline.

Historically, VAD models and turn detection systems have calculated whether a user’s turn has ended using “proxy” signals like pause duration and so on. However, more recent cutting edge models like Cartesia’s Ink-2 have turn-detection built into the model.
TTS (Text-to-Speech). This is the reverse of STT - it is the conversion of text into synthetic speech. TTS can often have much more subjective aspects than STT, which we will go into shortly.

TTS has been around for decades now. They used to be robotic sounding but these days good TTS models are very natural sounding! You can play with our cutting-edge TTS models (called Sonic) here for free!

Conversational Voice AI Features

Barge-in This is commonly referred to as interruption handling, and is extremely important in conversational voice AI.

Enterprise-grade Conversational AI is so much more than just connecting an STT model with a TTS model for a chatbot. Conversational Voice Agents must handle messy human conversations, with a high degree of unpredictability, environmental, regional and quality variability.

Barge-in allows a user to interrupt the AI and have the AI stop speaking. As you’d imagine this requires that the TTS model must be made aware the user has resumed, or started speaking - an event that must be detected and handled upstream in the pipeline by the STT or VAD model.

Interruption handling requires more than just turn-detection. It needs careful design of the models, and the voice agent harness that orchestrates the various models.

In the case of interruption handling, the STT model’s turn detection would emit an event that the user has started (or resumed!) speaking and the orchestration layer must immediately “mute” the TTS stream. This is non-trivial but with well engineered models and orchestration, it is very achievable.

At Cartesia, we use an event-driven architecture for orchestrating Voice Agents, so that each event can be observed by the system and responded to appropriately.
Voice Isolation & Noise Reduction: Some models are dedicated to reducing ambient noise, or to isolate the active conversation from background conversations. This is important when a Voice Agent is intended for use in noisy environments like offices, cafes, airports, call centers, hospitals and other crowded places. This is also a difficult problem to solve, and models differ in their ability to handle this. For Cartesia’s Ink-2 ASR model, specific attention was placed on this aspect because enterprise use cases for conversational voice agents are often around activities that have ambient noise, and are far from studio environments.

Voice Characteristics

Voice Characteristics are important but can be highly subjective. If you and your best friend don’t agree on whether an actor has a nice voice, it’s because of the subjectivity involved in assessing its characteristics.

Here are some examples of good and not-so-good voices, based on their tone, emotional feel etc.

Good prosody	Flat prosody
Natural emotional expression	Over-the-top emotional expression

As you’d note, undesirable voice characteristics are more likely to be identified (and easier to notice) than desirable characteristics. By paying attention to these attributes, we can perceive too much or too little of these. That helps us decide whether a voice is suitable for its intended purpose.

Prosody. It is best understood as a voice’s pitch (intonation), loudness and duration/timing/pacing - the “how” of speech . It is a mix of attributes and technical characteristics that are used to analyse a voice.

Loudness and timing are self-explanatory, but intonation is worth clarifying.

Intonation refers to how the voice rises and falls while speaking. Intonation is often how we know if something is a question or a statement, a continuation or a closure, emphasized or neutral. For example, a rising pitch at the end of a sentence generally signals a question.

Prosody enriches the literal words with meaning and impact. You can imagine how “Johnny bit the dog?” and “Johnny bit the dog!” would produce very different reactions at a school inquiry.
Emotional Expression. This is related to prosody because it is the effect of prosody. Prosody conveys emotion and meaning, and humans are very tuned to detecting and classifying emotion in voice. In Voice AI systems, the emotiveness of a voice must be congruent with and fit the intended use case.
Timbre. This is the tonal quality of the sound. In the context of voice it is often the adjectives that try to describe the sound. Radio presenters can be warm/rich or dark/velvety. Some singers can be nasal and others raspy. Timbre can give identity, emotional resonance, gravitas, trust and other qualities to a voice.
Speed & Pacing. It is easy to think that this is just about how many words per minute Eminem can rap, but speed, as a voice characteristic, has a second crucial aspect - pacing. Pacing refers to the overall speaking cadence, and the placement and duration of pauses. Good pacing makes the audio easy to follow, with pauses that align naturally with punctuation.
Vocal Fry. This creaky, low-frequency vibration occurs when air pulses slowly through the vocal folds. It is useful for realism, since humans naturally “fry” at the end of sentences.

But it presents a technical tightrope for AI. ASR models can sometimes interpret vocal fry as background noise or silence. On the other hand, when training TTS models, excessive fry acts as “noise” that produces glitchy models.

Ultimately, the goal is a “natural” balance: enough to avoid sounding robotic, but not so much that it garbles the voice or ruins intelligibility.
Localization, Dialects, and Pronunciation. Effective TTS needs to produce voices that sound “local” to be trusted. This requires more than just a general accent — it relies on normalization, the AI’s ability to correctly interpret abbreviations, dates, and currencies based on the region. For example, knowing whether “12/01” is 12 January or 1 December, or whether “Sr.” stands for “Senior” or “Señor.”

But normalization is the semantic layer. There’s also the phonetic layer: accent, pronunciation, and clarity. A model can normalize correctly and still sound wrong to a listener in Mumbai or Dublin. Mispronounced proper nouns, brand names, or technical terms break brand persona and can affect listener confidence.

A High-quality Voice Agent must get both right — what it says and how it sounds — for the specific audience it’s speaking to. For example:
Backchanneling. These are the small verbal signals humans use to show they’re following a conversation — “mm-hmm”, “right”, “got it”, “yeah.”

But they’re not conversational turns. They’re just acknowledgements and tiny cues.

In conversational AI, backchanneling works in two directions: the TTS model needs to produce them at the right moments to avoid sounding robotic and detached, and the ASR model needs to correctly interpret them as being back-channelling and not an interruption.

An agent that mishears “mm-hmm” as an interruption, or never gives any signal that it’s listening, feels broken — even if the system is working correctly.

Performance Metrics

Transcript Following: This measures the faithfulness of the TTS model’s speech to the input text. Even advanced models can “hallucinate”—adding words that aren’t there—or omit entire phrases through deletions. Accuracy also tracks whether the model slurs its generated speech or produces “weird” pronunciations that deviate from the intended script.

If a speech model lacks high transcript faithfulness, it can’t be used for applications like legal or medical reading, where semantic precision is critical.
Alphanumeric Handling. This tracks how reliably a model speaks “non-dictionary” data — phone numbers, tracking IDs, dates, email addresses.

It builds on normalization. The model first needs to interpret the format correctly (is “12/01” in MM-DD or DD-MM format?), then render it accurately as speech. If the TTS model speaks out a phone number read as a large integer rather than digit by digit, it is a jarring experience — particularly in professional or high-stakes contexts like healthcare, finance, or logistics.
RTF (Real-Time Factor): For STT transcriptions, RTF is the speed of a model at generating the text from a given length of input audio. It divides the processing time by the duration of the input audio to produce the RTF.

Example: If an AI takes 30 seconds to transcribe a 1-minute (60-second) recording, the RTF is 0.5 - i.e. faster than real time. If RTF = 1 that means the processing time is equal to the length of the audio - i.e. exactly real-time. If RTF is > 1 then the STT is laggy because it is slower than real time.

And for TTS, the real-time factor is slightly different – it’s the time taken to generate the audio from the input text.

STT/ASR	TTS
Input is audio. Output is text.	Input is text. Output is audio.
RTF = transcription time / input audio duration	RTF = speech synthesis time / duration of speech generated

TTFS (Time To Final Segment). For a Voice AI Agent to work in a live conversation, measuring the ASR model’s RTF is not granular enough.

What really matters is TTFS (Time to Final Segment). This is also known as TTCT (Time to Complete Transcript). This measures the gap between the moment a user finishes their turn and the moment the STT model outputs a finalized transcript — the complete, committed text ready for the LLM to act on.

This is distinct from another “TTFS” (Time to First Segment), which only measures when the first chunk of text arrives. TTCT measures when the whole transcript is done. For a conversational agent, the final segment is the meaningful hand-off point because nothing downstream can act until TTCT completes.
TTFB (Time to First Byte). This measures the “reactivity” of a TTS model. It is the time between text input arriving at the TTS model and the model beginning to stream back the very first bytes of speech audio.

To feel like a natural conversation, a TTFB of under 300ms used to be the gold standard; anything longer creates that awkward “walkie-talkie” lag.

But at Cartesia, we’ve pushed it even further: Cartesia’s Sonic 3 models clock in at a remarkable 40ms to 90ms - faster than a human blink (approx. 100ms).

This makes real-time real.
MOS (Mean Opinion Score) exists because human perception of voice “quality” — what sounds natural, trustworthy, or appropriate — has a strongly subjective component. A voice that scores well for a meditation app may feel wrong for a medical triage agent. While traditionally evaluated by human listening panels, modern Voice AI workflow often utilizes AI models to predict these scores as well.

MOS is the average score on a 1-5 scale, given by human listeners. It’s a structured way to score that subjectivity and evaluate a voice against the specific expectations and preferences of your intended audience.

MOS provides a structured framework to evaluate a voice model against the exact context and use-case expectations of your target audience. This makes MOS particularly relevant for conversational voice AI, where quality is highly context and use-case dependent.
STT Word Error Rate (STT WER). WER measures the literal percentage of incorrectly transcribed words in an STT transcript. At times this literal precision is important.

For conversational voice agents though, the STT WER may be a vanity metric.

A transcription no longer needs to be 100% accurate to be 100% effective because modern LLMs are incredibly good at “reading between the lines” - LLMs treat “gonna” and “going to” the same way.

The more useful metric is semantic WER. Semantic WER for speech-to-text measures how well the transcript conveys the user’s intent, rather than grading the AI on every “um,” “ah,” or misplaced comma.

Too much emphasis on literal WER can also add latency to the system. Semantic WER as a metric is more useful as it focuses on whether the AI actually understands and addresses the user’s intention.
TTS Word Error Rate (TTS WER). For TTS models, WER measures how well the speech was generated compared to its source transcript. This includes whether the model correctly articulated the given text into their corresponding spoken words, handled complex pronunciations well, and whether it made up (hallucinated) or missed words.

As you can see, building truly conversational AI has many more layers than connecting fast voice models to an LLM. Human conversation is messy, noisy and unpredictable - so engineering effective AI Voice Agents is a balancing act between engineering and human attributes, without any material loss of quality, accuracy and speed.

At Cartesia we believe that unlocking the power of AI Voice Agents starts with careful design and fine-grained control of the end to end conversational experience. This philosophy is baked into the way Cartesia’s State Space Models (SSMs) are designed, trained, built, evaluated and deployed. Each layer of the system needs to be pieced together intentionally, with plenty of thought given to tradeoffs, commercial outcomes, use case goals, and engineering excellence.

We are here to help you find your voice - one that is instinctive, reliable, and human. Just reach out at business@cartesia.ai.