7 min read

    Voice AI Agents in 2026: What It Takes to Build One That Works

    by Deep Parmar

    CTO, Sunbots & Xwits

    Voice AI Agents in 2026: What It Takes | Deep Parmar

    A voice AI agent is a pipeline, not a product. Speech-to-text captures the words. An LLM decides what to say. Text-to-speech turns that into audio. And a layer of turn-taking logic decides when the human is done talking and when the agent should speak. Get the pipeline right and the agent feels natural. Get it wrong — mostly by letting latency creep in — and it feels broken, no matter how clever the LLM.

    I have built voice agents across three languages, including MIRA, a multilingual router that handles Gujarati, Hindi, and English mid-sentence. Here is what I have learned.

    The Voice Stack, Component by Component

    A production voice agent has five layers. Most tutorials show you two.

    ASR — Automatic Speech Recognition

    This converts audio to text. Whisper (from OpenAI, open-weights) is the most widely deployed option with a mature local ecosystem. NVIDIA Parakeet is a newer alternative for fast on-device dictation. In 2026, streaming ASR is table stakes — you need partial transcripts arriving in real time, not a transcript that appears three seconds after the user finishes speaking.

    The LLM Brain

    This is where reasoning, memory, and personality live. In a voice agent the LLM prompt needs to be structured differently than in a chat interface. Responses must be short — ideally under 40 words for the first utterance — because the listener cannot skim. Bullet points are useless. You are writing for ears, not eyes.

    TTS — Text-to-Speech

    This converts the LLM output into audio the user hears. Streaming TTS, where audio playback starts before the full sentence is generated, is critical. Cartesia's Sonic is one option with sub-150ms latency via WebSocket streaming. ElevenLabs and others have competitive offerings. If you are building offline or privacy-first, local TTS models have improved significantly but still lag cloud options on naturalness.

    VAD — Voice Activity Detection

    VAD decides when the user has stopped speaking. This is underappreciated. Too aggressive and you cut people off mid-thought. Too conservative and the agent waits awkwardly while the user expects a reply. Silero VAD is a widely used, small, accurate open-source model for this purpose.

    Barge-In Handling

    This is the user interrupting the agent mid-sentence. Most simple pipelines ignore it entirely. Real conversations require it. Architecturally, this means monitoring the audio stream during TTS playback and being ready to abort and re-process at any time.

    Why Latency Is the Whole Game

    Time-to-first-audio (TTFA) — the gap between the user finishing and the agent beginning to speak — is the single number that determines whether a voice agent feels usable. The target is under one second. Under 500ms feels good. Above two seconds, users lose confidence in the system.

    A cascading pipeline (STT → LLM → TTS run sequentially) is simpler to build and debug. Most production voice agents use it. But each stage adds latency and those latencies stack. Streaming at every stage is how you claw that time back.

    Concretely:

    • Start streaming STT so the LLM can begin processing before the user finishes speaking (using interim transcripts).
    • Begin TTS before the full LLM response is complete — pipe tokens through as they arrive.
    • Run VAD on a separate thread so it is never blocking the processing chain.

    The LLM choice matters here too. A slower, more intelligent model is often the wrong trade-off for voice. A fast 7B model that responds in 80ms is more valuable than a 70B model that takes 600ms, because the user will not notice the reasoning gap but they will notice the pause.

    The Multilingual and Code-Switching Problem

    Building MIRA taught me that multilingual voice AI is its own discipline, not just a checkbox on the model card.

    Code-switching is when a speaker moves between languages mid-sentence. "Aaj mara call mate can we reschedule?" is a single utterance mixing Gujarati, Hindi, and English. Most ASR models handle this poorly. They are trained on clean, monolingual audio. Real Indian speech — in Gujarati, in Hindi, in Tamil — does not sound like a dictation exercise.

    The approaches that actually help:

    At ASR: You need a model trained on code-switched or multilingual data, or you need to run language-detection in parallel and route accordingly. Neither is perfect.

    At LLM: The model needs to understand the intent even if the transcription contains mixed-script oddities. A system prompt that instructs the model to handle mixed-language input gracefully is mandatory.

    At TTS: Most TTS models handle one language well and murder the others. If your users switch languages, you may need to switch TTS engines dynamically, or accept lower naturalness in secondary languages.

    The honest answer: code-switching is an unsolved problem. You can mitigate it with routing, redundancy, and robust system prompts. You cannot fully solve it today with off-the-shelf components.

    Where Voice Agents Actually Win

    Voice agents work well in narrow, high-frequency tasks. Specific domains, specific users, specific contexts.

    They work well when:

    • The user's hands are occupied (driving, cooking, working on equipment).
    • The interaction is short and transactional — "What is my next appointment?" not "Help me plan my quarter."
    • Accessibility is the point. For our work with SmartON (17,000+ users), voice is not a convenience feature — it is the primary interface for blind users. The bar for getting it right is different.
    • The domain vocabulary is bounded. A voice agent for GST queries in Gujarati can be tuned to the specific terminology. A general-purpose voice assistant faces an open vocabulary problem.

    They fail when:

    • The task requires scanning or reviewing content. Audio cannot be skimmed.
    • The user needs to reference, copy, or share the output.
    • The environment is noisy without a noise-cancelling front end.
    • The conversation is long. Working memory in audio-only conversations is low.

    Design Principles for Voice-First UX

    Most voice UX fails because designers apply chat or form conventions to audio. They are different mediums.

    Confirm, do not repeat. A voice agent should confirm what it heard and what it is doing — briefly. "Got it, booking for 3pm" is good. Re-reading the full request back is annoying.

    Speak in the user's rhythm. Match the pace and register of your target user. A voice agent for truck drivers in rural Gujarat should not sound like a corporate IVR.

    Plan for failure early. Misrecognition is frequent. Design the failure recovery flow — "I didn't catch that, can you say it again?" — as carefully as the happy path.

    Accessible by default. The lessons from building for blind users apply broadly: clear feedback, minimal latency, predictable turn-taking, and explicit state announcements. These make the experience better for everyone.

    Keep the first response short. The first thing the agent says after a user speaks sets the tone. Long, verbose openings train users to stop listening. Short, direct, relevant responses build trust.

    If you are building on MIRA-style routing or want to understand how we approached multilingual voice design, the MIRA deep dive has the architecture details.

    ---

    Voice AI is one of the few places in software where the engineering and the human experience are inseparable. A 200ms latency improvement is not a performance metric. It is the difference between a conversation that feels alive and one that feels like a phone tree.

    ---

    Frequently Asked Questions

    Quick answers about this topic — also indexed by AI search engines via FAQPage schema.

    Share this article: