What is time-to-first-audio (TTFA) and why does it matter for voice agents?

TTFA is the delay between when a user stops speaking and when the agent begins responding with audio. It is the primary metric for voice agent usability. Under one second feels acceptable; under 500ms feels natural; above two seconds, users lose confidence and the experience breaks down. Streaming at every pipeline stage — ASR, LLM, and TTS — is how you minimise it.

Can I build a voice AI agent that works offline or locally?

Yes, with trade-offs. Whisper runs locally for ASR. Small open models (7B–14B) run locally for the LLM layer. Local TTS models exist but lag cloud options in naturalness. The main constraint is hardware — a decent local voice stack needs at least 16 GB of RAM and a capable GPU or Apple Silicon chip. Latency will be higher than a cloud pipeline on the same hardware.

What is voice activity detection (VAD) and do I really need it?

VAD detects when a user is speaking versus when they are pausing. You need it. Without VAD, your agent either cuts users off too early or waits awkwardly too long before responding. Silero VAD is a small, accurate open-source option that runs in real time. It is one of the most undervalued components in a voice pipeline.

How do I handle users who speak more than one language in the same sentence?

Code-switching — mixing languages mid-sentence — is common in Indian speech. The practical approach is: use a multilingual ASR model, add language detection at the transcription stage, build your LLM system prompt to handle mixed-language input gracefully, and where possible route to language-specific TTS engines. It is not fully solved by any off-the-shelf system today.

What is barge-in and how do I implement it?

Barge-in means the user interrupts the agent while it is speaking. To support it, you monitor the audio input stream continuously — even during TTS playback — using VAD. When speech is detected while the agent is talking, you abort the current TTS output, discard it, and restart the pipeline from ASR. It requires careful state management but is essential for any agent used in real conversation.

When should I not use a voice AI agent?

Voice agents are wrong when the task involves reviewing, scanning, copying, or sharing information; when the environment is noisy and you cannot control the audio front end; and when the conversation is long and complex. Voice is excellent for short, transactional, high-frequency tasks — and almost anything where the user's hands are busy.

Voice AI Agents in 2026: What It Takes

A voice AI agent is a pipeline, not a product. Speech-to-text captures the words. An LLM decides what to say. Text-to-speech turns that into audio. And a layer of turn-taking logic decides when the human is done talking and when the agent should speak. Get the pipeline right and the agent feels natural. Get it wrong — mostly by letting latency creep in — and it feels broken, no matter how clever the LLM.

I have built voice agents across three languages, including MIRA, a multilingual router that handles Gujarati, Hindi, and English mid-sentence. Here is what I have learned.

The Voice Stack, Component by Component

A production voice agent has five layers. Most tutorials show you two.

ASR — Automatic Speech Recognition

This converts audio to text. Whisper (from OpenAI, open-weights) is the most widely deployed option with a mature local ecosystem. NVIDIA Parakeet is a newer alternative for fast on-device dictation. In 2026, streaming ASR is table stakes — you need partial transcripts arriving in real time, not a transcript that appears three seconds after the user finishes speaking.

The LLM Brain

This is where reasoning, memory, and personality live. In a voice agent the LLM prompt needs to be structured differently than in a chat interface. Responses must be short — ideally under 40 words for the first utterance — because the listener cannot skim. Bullet points are useless. You are writing for ears, not eyes.

TTS — Text-to-Speech

This converts the LLM output into audio the user hears. Streaming TTS, where audio playback starts before the full sentence is generated, is critical. Cartesia's Sonic is one option with sub-150ms latency via WebSocket streaming. ElevenLabs and others have competitive offerings. If you are building offline or privacy-first, local TTS models have improved significantly but still lag cloud options on naturalness.

VAD — Voice Activity Detection

VAD decides when the user has stopped speaking. This is underappreciated. Too aggressive and you cut people off mid-thought. Too conservative and the agent waits awkwardly while the user expects a reply. Silero VAD is a widely used, small, accurate open-source model for this purpose.

Barge-In Handling

This is the user interrupting the agent mid-sentence. Most simple pipelines ignore it entirely. Real conversations require it. Architecturally, this means monitoring the audio stream during TTS playback and being ready to abort and re-process at any time.

Why Latency Is the Whole Game

Time-to-first-audio (TTFA) — the gap between the user finishing and the agent beginning to speak — is the single number that determines whether a voice agent feels usable. The target is under one second. Under 500ms feels good. Above two seconds, users lose confidence in the system.

A cascading pipeline (STT → LLM → TTS run sequentially) is simpler to build and debug. Most production voice agents use it. But each stage adds latency and those latencies stack. Streaming at every stage is how you claw that time back.

Concretely:

Start streaming STT so the LLM can begin processing before the user finishes speaking (using interim transcripts).
Begin TTS before the full LLM response is complete — pipe tokens through as they arrive.
Run VAD on a separate thread so it is never blocking the processing chain.

The LLM choice matters here too. A slower, more intelligent model is often the wrong trade-off for voice. A fast 7B model that responds in 80ms is more valuable than a 70B model that takes 600ms, because the user will not notice the reasoning gap but they will notice the pause.

The Multilingual and Code-Switching Problem

Building MIRA taught me that multilingual voice AI is its own discipline, not just a checkbox on the model card.

Code-switching is when a speaker moves between languages mid-sentence. "Aaj mara call mate can we reschedule?" is a single utterance mixing Gujarati, Hindi, and English. Most ASR models handle this poorly. They are trained on clean, monolingual audio. Real Indian speech — in Gujarati, in Hindi, in Tamil — does not sound like a dictation exercise.

The approaches that actually help:

At ASR: You need a model trained on code-switched or multilingual data, or you need to run language-detection in parallel and route accordingly. Neither is perfect.

At LLM: The model needs to understand the intent even if the transcription contains mixed-script oddities. A system prompt that instructs the model to handle mixed-language input gracefully is mandatory.

At TTS: Most TTS models handle one language well and murder the others. If your users switch languages, you may need to switch TTS engines dynamically, or accept lower naturalness in secondary languages.

The honest answer: code-switching is an unsolved problem. You can mitigate it with routing, redundancy, and robust system prompts. You cannot fully solve it today with off-the-shelf components.

Where Voice Agents Actually Win

Voice agents work well in narrow, high-frequency tasks. Specific domains, specific users, specific contexts.

They work well when:

The user's hands are occupied (driving, cooking, working on equipment).
The interaction is short and transactional — "What is my next appointment?" not "Help me plan my quarter."
Accessibility is the point. For our work with SmartON (17,000+ users), voice is not a convenience feature — it is the primary interface for blind users. The bar for getting it right is different.
The domain vocabulary is bounded. A voice agent for GST queries in Gujarati can be tuned to the specific terminology. A general-purpose voice assistant faces an open vocabulary problem.

They fail when:

The task requires scanning or reviewing content. Audio cannot be skimmed.
The user needs to reference, copy, or share the output.
The environment is noisy without a noise-cancelling front end.
The conversation is long. Working memory in audio-only conversations is low.

Design Principles for Voice-First UX

Most voice UX fails because designers apply chat or form conventions to audio. They are different mediums.

Confirm, do not repeat. A voice agent should confirm what it heard and what it is doing — briefly. "Got it, booking for 3pm" is good. Re-reading the full request back is annoying.

Speak in the user's rhythm. Match the pace and register of your target user. A voice agent for truck drivers in rural Gujarat should not sound like a corporate IVR.

Plan for failure early. Misrecognition is frequent. Design the failure recovery flow — "I didn't catch that, can you say it again?" — as carefully as the happy path.

Accessible by default. The lessons from building for blind users apply broadly: clear feedback, minimal latency, predictable turn-taking, and explicit state announcements. These make the experience better for everyone.

Keep the first response short. The first thing the agent says after a user speaks sets the tone. Long, verbose openings train users to stop listening. Short, direct, relevant responses build trust.

If you are building on MIRA-style routing or want to understand how we approached multilingual voice design, the MIRA deep dive has the architecture details.

---

Voice AI is one of the few places in software where the engineering and the human experience are inseparable. A 200ms latency improvement is not a performance metric. It is the difference between a conversation that feels alive and one that feels like a phone tree.

---

Voice AI Agents in 2026: What It Takes to Build One That Works

The Voice Stack, Component by Component

Why Latency Is the Whole Game

The Multilingual and Code-Switching Problem

Where Voice Agents Actually Win

Design Principles for Voice-First UX

Frequently Asked Questions

Related Posts

MIRA Deep Dive: Building a Multilingual AI Router

Voice-First UX: Designing AI for Blind Users