
MIRA Deep Dive: Building a Multilingual AI Router
by Deep Parmar
CTO at Sunbots Innovations LLP | Director at Xwits Developers Pvt Ltd

What a Voice Router Actually Does
MIRA — SmartON's Multimodal Inclusive AI for Recognition and Assistance — is the interface layer between a user's spoken request and the four AI capabilities behind it: currency detection, scene understanding, OCR, and document search. Every spoken request goes through MIRA. If the routing is wrong, the user gets a response that doesn't help them, and in a voice-first interface for visually impaired users, a wrong response is worse than silence.
The routing problem is more nuanced than it appears from the outside. It's not just "which capability handles this request?" It's also: what is the current conversation context, what language is the user speaking, what did they last ask, and is this a continuation of that task or a new request?
The Intent Classification Model
The core of MIRA's routing is a fine-tuned multilingual intent classifier. The model architecture is DistilBERT-multilingual — small enough to run on the Jetson Nano within the latency budget, and pre-trained on multilingual corpora including Hindi and Gujarati.
Training data: 2,400 labeled examples across four intent classes (currency_detection, scene_understanding, ocr_read, document_search) and a fifth class for out-of-scope requests. Examples were written in Gujarati, Hindi, English, and code-switched combinations. Each example was validated by a native speaker of each language to ensure natural phrasing.
Fine-tuning took approximately 6 hours on a single A100 GPU. The resulting model is 67MB and achieves 96.2% accuracy on a held-out test set of 300 examples — balanced across languages and intent classes.
The 3.8% error rate matters. For a visually impaired user who can't see what's happening, a misrouted request results in either silence (if the capability produces no output for the given input) or a confusing response (if the wrong capability runs). We added explicit confidence thresholding: if the classifier confidence is below 0.7, MIRA asks a clarifying question rather than routing to the best-guess capability.
Conversation Context and State
MIRA maintains a conversation state window — the last 5 turns of the conversation, with timestamps, detected intents, and the user's language for each turn. This context informs three decisions:
Continuation vs. new request: If the user says "and what about this part?" after a document search query, the word "this" is ambiguous. With conversation context, MIRA knows the user was querying a document and interprets this as a continuation of that task, not a scene description request. Contextual disambiguation accounts for roughly 15% of all MIRA interactions.
Language preference: The conversation context window tracks the language of each turn. If a user has been speaking Gujarati for 8 turns and the current utterance is ambiguous between Gujarati and Hindi (some words are similar), the context biases toward Gujarati. This reduces language detection errors by approximately 40% on ambiguous inputs compared to stateless detection.
Task state: Some tasks are inherently multi-turn. A document analysis session might involve several follow-up questions about the same document. MIRA tracks the active task state and maintains the loaded document context across turns, so users don't need to re-specify which document they're querying on every follow-up question.
Handling Misclassification Gracefully
No classifier is perfect. MIRA's error recovery uses three strategies:
Confidence-gated routing: Requests below the 0.7 confidence threshold trigger a clarification prompt: "Did you want me to read the currency note or describe the scene in front of you?" This adds one conversational turn but prevents confident wrong answers.
Rapid correction: If the user immediately responds with "no" or "not that" after a misrouted response, MIRA interprets this as a correction signal and re-routes. The system doesn't ask what went wrong — it assumes the next utterance is a more specific version of the original request and re-classifies.
Explicit override: Users can always explicitly name the capability they want: "MIRA, currency" or "MIRA, scene" routes directly to that capability, bypassing the classifier. This is the escape hatch when disambiguation fails.
Response Formatting by Capability
MIRA doesn't just route — it also formats responses according to each capability's output contract. Each capability returns structured data; MIRA converts this to natural language audio according to per-capability templates tuned with user feedback:
- Currency detection: "[denomination]. [orientation]." — two pieces of information, maximum. Example: "Two hundred rupees. Face side up."
- Scene understanding: Action-oriented description, 3 elements maximum. "Glass door, slightly left. Push bar at waist height. One low step."
- OCR: Full text, read aloud. For long text, MIRA offers "Read all" or "Give me the main points" before reading.
- Document search: Retrieved passage, followed by source indicator. "From chapter four, page 31: [passage]. Would you like more from this section?"
Read the full SmartON story in Say It Once. MIRA Does the Rest → or learn about building multilingual AI in our Multilingual AI post →
Frequently Asked Questions
Quick answers about this topic — also indexed by AI search engines via FAQPage schema.
Share this article:
