9 min read

    MIRA Deep Dive: Building a Multilingual AI Router

    by Deep Parmar

    CTO at Sunbots Innovations LLP | Director at Xwits Developers Pvt Ltd

    MIRA Deep Dive: Multilingual AI Router | Deep Parmar

    What a Voice Router Actually Does

    MIRA — SmartON's Multimodal Inclusive AI for Recognition and Assistance — is the interface layer between a user's spoken request and the four AI capabilities behind it: currency detection, scene understanding, OCR, and document search. Every spoken request goes through MIRA. If the routing is wrong, the user gets a response that doesn't help them, and in a voice-first interface for visually impaired users, a wrong response is worse than silence.

    The routing problem is more nuanced than it appears from the outside. It's not just "which capability handles this request?" It's also: what is the current conversation context, what language is the user speaking, what did they last ask, and is this a continuation of that task or a new request?

    The Intent Classification Model

    The core of MIRA's routing is a fine-tuned multilingual intent classifier. The model architecture is DistilBERT-multilingual — small enough to run on the Jetson Nano within the latency budget, and pre-trained on multilingual corpora including Hindi and Gujarati.

    Training data: 2,400 labeled examples across four intent classes (currency_detection, scene_understanding, ocr_read, document_search) and a fifth class for out-of-scope requests. Examples were written in Gujarati, Hindi, English, and code-switched combinations. Each example was validated by a native speaker of each language to ensure natural phrasing.

    Fine-tuning took approximately 6 hours on a single A100 GPU. The resulting model is 67MB and achieves 96.2% accuracy on a held-out test set of 300 examples — balanced across languages and intent classes.

    The 3.8% error rate matters. For a visually impaired user who can't see what's happening, a misrouted request results in either silence (if the capability produces no output for the given input) or a confusing response (if the wrong capability runs). We added explicit confidence thresholding: if the classifier confidence is below 0.7, MIRA asks a clarifying question rather than routing to the best-guess capability.

    Conversation Context and State

    MIRA maintains a conversation state window — the last 5 turns of the conversation, with timestamps, detected intents, and the user's language for each turn. This context informs three decisions:

    Continuation vs. new request: If the user says "and what about this part?" after a document search query, the word "this" is ambiguous. With conversation context, MIRA knows the user was querying a document and interprets this as a continuation of that task, not a scene description request. Contextual disambiguation accounts for roughly 15% of all MIRA interactions.

    Language preference: The conversation context window tracks the language of each turn. If a user has been speaking Gujarati for 8 turns and the current utterance is ambiguous between Gujarati and Hindi (some words are similar), the context biases toward Gujarati. This reduces language detection errors by approximately 40% on ambiguous inputs compared to stateless detection.

    Task state: Some tasks are inherently multi-turn. A document analysis session might involve several follow-up questions about the same document. MIRA tracks the active task state and maintains the loaded document context across turns, so users don't need to re-specify which document they're querying on every follow-up question.

    Handling Misclassification Gracefully

    No classifier is perfect. MIRA's error recovery uses three strategies:

    Confidence-gated routing: Requests below the 0.7 confidence threshold trigger a clarification prompt: "Did you want me to read the currency note or describe the scene in front of you?" This adds one conversational turn but prevents confident wrong answers.

    Rapid correction: If the user immediately responds with "no" or "not that" after a misrouted response, MIRA interprets this as a correction signal and re-routes. The system doesn't ask what went wrong — it assumes the next utterance is a more specific version of the original request and re-classifies.

    Explicit override: Users can always explicitly name the capability they want: "MIRA, currency" or "MIRA, scene" routes directly to that capability, bypassing the classifier. This is the escape hatch when disambiguation fails.

    Response Formatting by Capability

    MIRA doesn't just route — it also formats responses according to each capability's output contract. Each capability returns structured data; MIRA converts this to natural language audio according to per-capability templates tuned with user feedback:

    • Currency detection: "[denomination]. [orientation]." — two pieces of information, maximum. Example: "Two hundred rupees. Face side up."
    • Scene understanding: Action-oriented description, 3 elements maximum. "Glass door, slightly left. Push bar at waist height. One low step."
    • OCR: Full text, read aloud. For long text, MIRA offers "Read all" or "Give me the main points" before reading.
    • Document search: Retrieved passage, followed by source indicator. "From chapter four, page 31: [passage]. Would you like more from this section?"

    Read the full SmartON story in Say It Once. MIRA Does the Rest → or learn about building multilingual AI in our Multilingual AI post →

    Frequently Asked Questions

    Quick answers about this topic — also indexed by AI search engines via FAQPage schema.

    Share this article: