9 min read

    Building Multilingual AI for Indian Languages

    by Deep Parmar

    CTO at Sunbots Innovations LLP | Director at Xwits Developers Pvt Ltd

    Multilingual AI for Indian Languages | Deep Parmar

    The Assumption That Breaks Everything

    Most AI teams building multilingual products start with the same assumption: "We'll add multilingual support after we get the English version right." This assumption is wrong in two ways. First, the English version often doesn't generalize to Indian languages — not because of vocabulary, but because of syntax, script, and the way people actually speak. Second, the hardest part of multilingual AI isn't translation — it's code-switching.

    Code-switching is the phenomenon where speakers move between languages — or mix them — in a single conversation. Indian users do this constantly. "Mira, mujhe explain karo this graph in English please" is a realistic input that contains Hindi, English, and a mixed structure. A system that assumes clean language boundaries will fail on this input.

    The Language Stack for MIRA

    MIRA — SmartON's voice interface — needed to handle Gujarati, Hindi, and English, with code-switching between all three. Here's how we built each layer:

    Speech recognition: We evaluated Whisper (OpenAI's open-source model), Google's Speech-to-Text API, and Azure Cognitive Services. Whisper had the best accuracy for code-switched speech but was too slow for on-device inference at the time we made the decision. We use a combination: Whisper for complex, mixed-language utterances and a lighter on-device model for simple, single-language commands. The routing between them is based on utterance complexity detected in the first 200ms.

    Language detection: After transcription, we detect the language of the transcribed text. This sounds straightforward — it is not. A sentence like "please find electrolysis in my notes" is technically English, but the user's intent was probably in the context of a Gujarati study session. We use FastText for primary language detection and a custom contextual model that considers conversation history to resolve ambiguities.

    Response generation: For simple factual responses (currency denomination, distance estimation, navigation instructions), we pre-generate responses in all three languages and select based on detected language. For longer, generative responses (document search results, graph explanations), we use language-conditioned generation.

    The Data Problem

    Indian language AI has a fundamental data problem: Gujarati and many other Indian languages are severely underrepresented in training data for large models. While Hindi is reasonably well-represented, Gujarati has roughly 55 million native speakers but a fraction of the web presence of English or even Hindi.

    The practical consequence: off-the-shelf multilingual models perform noticeably worse on Gujarati than on Hindi or English. We address this with domain-specific fine-tuning using data we collect from SmartON users who opt into data contribution, and with aggressive data augmentation — translating English training examples into Gujarati and validating them with native speakers.

    This is a genuine constraint. If you're building for a language with limited training data availability, budget extra time for data collection and expect that off-the-shelf model quality won't be sufficient.

    Code-Switching: The Hard Problem

    Code-switching requires the system to detect mid-utterance language transitions and handle them gracefully. Our approach has three components:

    • Segment-level detection: After transcription, we segment the utterance at clause boundaries and run language detection on each segment independently. This catches cases like "yeh note kaun sa hai — tell me in English."
    • Conversation context: We maintain a language context window — if the last 5 turns were primarily in Gujarati, we weight Gujarati detection higher for ambiguous segments. Users rarely switch languages abruptly without a reason.
    • Explicit override: Users can say "now speak in Hindi" or the Gujarati equivalent, and MIRA switches and stays switched until told otherwise. Explicit overrides always take precedence over context-based detection.

    What Worked, What Didn't

    What worked well: treating code-switching as a first-class feature rather than an edge case, investing in native speaker validation of all generated content, and building explicit override mechanisms so users always have control when detection fails.

    What didn't work: trying to handle all three languages equally in the first version. We launched with Hindi and English, validated the architecture, and added Gujarati with the confidence that the language stack worked correctly. Trying to add all three simultaneously would have been slower and produced worse quality across all three.

    MIRA is part of SmartON — read the full product story at Say It Once. MIRA Does the Rest. Building multilingual AI? Let's talk through your language stack.

    Frequently Asked Questions

    Quick answers about this topic — also indexed by AI search engines via FAQPage schema.

    Share this article: