10 min read

    Multimodal AI in Production: Combining Vision and Language for Real Problems

    by Deep Parmar

    CTO, Sunbots & Xwits

    Multimodal AI Production Guide | Deep Parmar

    Multimodal AI — models that handle images, text, and sometimes audio in combination — is one of those capabilities where the demos are immediately compelling and the production reality is more nuanced. When Anthropic released Claude 3's vision capabilities and Google released Gemini 1.5 Pro with video understanding, the demos showed AI describing complex scenes, reading handwritten text, and explaining charts. What the demos did not show was latency, cost at scale, or how to architect a system that uses these capabilities reliably for real users.

    What Multimodal Actually Means (Not the Marketing Version)

    In practice, "multimodal" covers several distinct capabilities that are often conflated:

    • Vision-language models (VLMs) — Models that take images as input and produce text. GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 are the main options. They handle general visual question answering, image description, chart reading, and document understanding.
    • Specialised vision models — Models fine-tuned for specific visual tasks: YOLO for object detection, TrOCR for OCR, models trained for medical imaging, satellite imagery, or specific document types. They outperform general VLMs on their target tasks at much lower cost.
    • Audio-language models — Whisper for speech transcription, models that process audio features directly for classification or generation.

    Most production multimodal systems are not a single model — they are orchestrated pipelines that route different modalities to the most appropriate model for that specific task.

    SmartON's Multimodal Pipeline

    SmartON handles vision and language across every user interaction. A user holds up their phone or glasses camera and asks a question in natural language. MIRA determines the intent (currency detection? scene description? text reading?) and routes to the appropriate vision model. The response from the vision model is then processed by a language model to produce a natural, helpful spoken answer in the user's language.

    The key architectural decision: we do not use a single general VLM for everything. Object detection runs a quantised YOLO model on-device for speed — 200ms latency. OCR runs TrOCR for Indian script recognition. Scene understanding and complex visual questions go to a general VLM via API. Currency detection runs a specialised classifier. This tiered architecture keeps latency low for common tasks and reserves API calls for queries that genuinely need the general model's broader capability.

    Latency and Cost: What to Expect in Production

    General VLM API calls with a high-resolution image take 1.5 to 4 seconds depending on image size and the model. This is too slow for real-time user interactions. On-device specialised models run in 100-500ms. The production pattern that works: classify the task first, route to on-device models for tasks they can handle well, fall back to API models only for tasks requiring broader understanding. For SmartON users, this means 90% of interactions complete in under 800ms, with the remaining 10% (complex scene questions) taking up to 3 seconds with user feedback that the AI is processing.

    Frequently Asked Questions

    Quick answers about this topic — also indexed by AI search engines via FAQPage schema.

    Share this article: