MIRA stands for Multimodal Inclusive AI for Recognition and Assistance — it is Smarton's AI router that takes user intent (voice, text, or image) and dispatches it to the right downstream model: vision, OCR, translation, search, or LLM. It also maintains context across turns.

How does an AI router work?

An AI router classifies incoming user requests by intent and routes them to the most appropriate model or tool — a vision model for image queries, an OCR model for text-in-image, a translation model for language tasks, an LLM for open-ended reasoning. Good routers also maintain conversation context across turns.

How does MIRA handle multiple languages?

MIRA is built around code-switching: users can mix Gujarati, Hindi, and English mid-sentence, and MIRA classifies the intent based on the combined utterance. It uses a multilingual embedding model and falls back gracefully when the request is ambiguous.

Why is conversation context important for AI assistants?

Context lets the assistant interpret 'read that one' or 'translate it to Hindi' without the user repeating themselves. Without context tracking, every turn becomes an isolated request — exhausting for users and especially poor for voice-first assistive use cases.

How do you handle misclassification in an AI router?

Handle misclassification by setting a confidence threshold below which the router asks a clarifying question instead of acting, by maintaining context so a wrong route can be corrected on the next turn, and by logging routing decisions for offline analysis and continuous improvement of the classifier.

MIRA Deep Dive: Multilingual AI Router

What a Voice Router Actually Does

MIRA — SmartON's Multimodal Inclusive AI for Recognition and Assistance — is the interface layer between a user's spoken request and the four AI capabilities behind it: currency detection, scene understanding, OCR, and document search. Every spoken request goes through MIRA. If the routing is wrong, the user gets a response that doesn't help them, and in a voice-first interface for visually impaired users, a wrong response is worse than silence.

The routing problem is more nuanced than it appears from the outside. It's not just "which capability handles this request?" It's also: what is the current conversation context, what language is the user speaking, what did they last ask, and is this a continuation of that task or a new request?

The Intent Classification Model

The core of MIRA's routing is a fine-tuned multilingual intent classifier. The model architecture is DistilBERT-multilingual — small enough to run on the Jetson Nano within the latency budget, and pre-trained on multilingual corpora including Hindi and Gujarati.

Training data: 2,400 labeled examples across four intent classes (currency_detection, scene_understanding, ocr_read, document_search) and a fifth class for out-of-scope requests. Examples were written in Gujarati, Hindi, English, and code-switched combinations. Each example was validated by a native speaker of each language to ensure natural phrasing.

Fine-tuning took approximately 6 hours on a single A100 GPU. The resulting model is 67MB and achieves 96.2% accuracy on a held-out test set of 300 examples — balanced across languages and intent classes.

The 3.8% error rate matters. For a visually impaired user who can't see what's happening, a misrouted request results in either silence (if the capability produces no output for the given input) or a confusing response (if the wrong capability runs). We added explicit confidence thresholding: if the classifier confidence is below 0.7, MIRA asks a clarifying question rather than routing to the best-guess capability.

Conversation Context and State

MIRA maintains a conversation state window — the last 5 turns of the conversation, with timestamps, detected intents, and the user's language for each turn. This context informs three decisions:

Continuation vs. new request: If the user says "and what about this part?" after a document search query, the word "this" is ambiguous. With conversation context, MIRA knows the user was querying a document and interprets this as a continuation of that task, not a scene description request. Contextual disambiguation accounts for roughly 15% of all MIRA interactions.

Language preference: The conversation context window tracks the language of each turn. If a user has been speaking Gujarati for 8 turns and the current utterance is ambiguous between Gujarati and Hindi (some words are similar), the context biases toward Gujarati. This reduces language detection errors by approximately 40% on ambiguous inputs compared to stateless detection.

Task state: Some tasks are inherently multi-turn. A document analysis session might involve several follow-up questions about the same document. MIRA tracks the active task state and maintains the loaded document context across turns, so users don't need to re-specify which document they're querying on every follow-up question.

Handling Misclassification Gracefully

No classifier is perfect. MIRA's error recovery uses three strategies:

Confidence-gated routing: Requests below the 0.7 confidence threshold trigger a clarification prompt: "Did you want me to read the currency note or describe the scene in front of you?" This adds one conversational turn but prevents confident wrong answers.

Rapid correction: If the user immediately responds with "no" or "not that" after a misrouted response, MIRA interprets this as a correction signal and re-routes. The system doesn't ask what went wrong — it assumes the next utterance is a more specific version of the original request and re-classifies.

Explicit override: Users can always explicitly name the capability they want: "MIRA, currency" or "MIRA, scene" routes directly to that capability, bypassing the classifier. This is the escape hatch when disambiguation fails.

Response Formatting by Capability

MIRA doesn't just route — it also formats responses according to each capability's output contract. Each capability returns structured data; MIRA converts this to natural language audio according to per-capability templates tuned with user feedback:

Currency detection: "[denomination]. [orientation]." — two pieces of information, maximum. Example: "Two hundred rupees. Face side up."
Scene understanding: Action-oriented description, 3 elements maximum. "Glass door, slightly left. Push bar at waist height. One low step."
OCR: Full text, read aloud. For long text, MIRA offers "Read all" or "Give me the main points" before reading.
Document search: Retrieved passage, followed by source indicator. "From chapter four, page 31: [passage]. Would you like more from this section?"

Read the full SmartON story in Say It Once. MIRA Does the Rest → or learn about building multilingual AI in our Multilingual AI post →

MIRA Deep Dive: Building a Multilingual AI Router

What a Voice Router Actually Does

The Intent Classification Model

Conversation Context and State

Handling Misclassification Gracefully

Response Formatting by Capability

Frequently Asked Questions

Related Posts

Building SmartON: Assistive AI for the Visually Impaired

Building Multilingual AI for Indian Languages

MIRA Deep Dive: Building a Multilingual AI Router

What a Voice Router Actually Does

The Intent Classification Model

Conversation Context and State

Handling Misclassification Gracefully

Response Formatting by Capability

Frequently Asked Questions

What is MIRA?

How does an AI router work?

How does MIRA handle multiple languages?

Why is conversation context important for AI assistants?

How do you handle misclassification in an AI router?

Related Posts

Building SmartON: Assistive AI for the Visually Impaired

Building Multilingual AI for Indian Languages