SmartON is an assistive AI product built for visually impaired users in India. It combines computer vision, speech, and a multilingual AI router to help users navigate their environment, identify objects, read text, and complete daily tasks — all through a voice-first interface.

How does AI help visually impaired users?

AI helps visually impaired users by describing scenes, reading printed and digital text, identifying objects and people, supporting navigation, and translating across languages — all through interfaces that do not require sight, like voice and haptic feedback.

What technology powers SmartON?

SmartON combines on-device object detection (YOLO via TFLite), OCR, speech recognition and synthesis, and the MIRA multilingual router to handle Gujarati, Hindi, and English. It is designed to work on affordable Android phones common in India.

Why build SmartON for Indian users specifically?

Most existing assistive AI products are designed around English-speaking, urban Western users. SmartON was built around what visually impaired users in Indian cities actually need — multilingual support, robustness on low-end Android hardware, and offline operation when connectivity is unreliable.

How is SmartON different from existing accessibility apps?

SmartON is voice-first instead of screen-first, multilingual by default with strong code-switching support, and designed for low-end Android hardware common in India. It also integrates the MIRA router so users can express intent naturally instead of memorising commands.

Building SmartON: AI for Visually Impaired

Starting with the Right Question

Most assistive technology is built by engineers who imagine what visually impaired users need. SmartON started differently: I spent three weeks talking to visually impaired users in Ahmedabad before writing a single line of code.

The question I asked wasn't "what AI features do you want?" It was: "What do you struggle with today that you believe technology could solve?" The answers clustered around three problems that existing tools handled poorly: identifying Indian currency (existing apps were inaccurate on worn notes and failed entirely in dim light), understanding physical space (navigation apps describe routes, not the immediate physical environment), and accessing printed documents (OCR apps existed but required sighted assistance to aim the camera correctly).

These three problems became SmartON's three core capabilities. Everything else was deferred until we'd solved these well.

The Technical Architecture

SmartON is an Android application connected to a USB camera, designed to work with a Jetson Nano edge computing unit. The hardware choice was deliberate: by running inference on the Jetson rather than on-device or in the cloud, we hit a sweet spot of latency and model capability that neither pure mobile nor cloud could achieve.

The four AI components, and what they do:

Currency detection: A YOLO-based object detection model trained on a custom dataset of 8,000+ images of Indian rupee notes across denominations, lighting conditions, and states of wear. Accuracy: 98.7% in production. Inference time: ~45ms on Jetson Nano.
Scene understanding: A vision-language model that converts a camera frame into an action-oriented description. The model is prompted to prioritize navigation-relevant information (obstacles, entrances, distances) over a comprehensive inventory of the scene.
OCR: A two-stage pipeline: first detect whether text is present and approximately where, then run a specialized OCR model on the detected text regions. This two-stage approach is significantly faster than running full-image OCR on every frame.
Document search: A RAG pipeline where users can load documents into a local index and query them by voice. Documents are embedded offline; queries retrieve the relevant passages.

The Voice Interface Design

The voice interface is where most of SmartON's complexity lives. MIRA — the voice layer — needs to understand a spoken request, determine which of the four capabilities should handle it, execute the capability, and return a response that's useful without being overwhelming.

The routing problem is harder than it sounds. "What is in front of me?" routes to scene understanding. "What does this paper say?" routes to OCR. "Find the section about electrolysis in my chemistry notes" routes to document search. These distinctions are clear when written down; they're ambiguous when spoken in mixed languages with varying phrasing.

We use a fine-tuned intent classification model for routing — trained on 2,000 example utterances in Gujarati, Hindi, and English, covering the full range of how users actually phrase requests. The classifier runs locally in under 30ms, adding minimal latency to the response pipeline.

Response design follows one rule: tell the user the next action, not everything the system knows. "Two hundred rupees, portrait orientation" is better than "The system has detected a two-hundred-rupee note with denomination markers visible at 0.97 confidence, oriented in portrait mode with the face side facing the camera." The first response takes under 2 seconds to say. The second takes 8 seconds and overwhelms the user with information they don't need.

What We Got Wrong the First Time

Two significant mistakes in v1:

Feedback latency: We initially returned the full inference result before speaking. For currency detection, this meant 45ms of silence, then the answer. User testing showed that a streaming response — "detected... two hundred rupees" — felt dramatically more responsive even though the total time was similar. We rebuilt the audio pipeline to stream partial results.

Scene description verbosity: The first scene understanding model described everything it could see. Users found this exhausting. A scene with seven objects generated a 12-second audio description before they could do anything with it. We added a prompt constraint: "Describe only the 2–3 most actionable elements for navigation" and rebuilt the evaluation against this criterion. Response length dropped by 70%; user satisfaction increased significantly.

Where SmartON Is Headed

The current system works for the three core use cases we set out to solve. The next capabilities on the roadmap are: color identification (useful for clothing and product selection), public transit navigation integration, and a more sophisticated document analysis mode that can answer specific questions about loaded documents rather than just retrieving passages.

Visit getsmartonai.com to learn more about the current product and upcoming features.

Building assistive AI or accessibility technology? Reach out — this is work I care about deeply and I'm happy to share what we've learned.

Building SmartON: Assistive AI for the Visually Impaired

Starting with the Right Question

The Technical Architecture

The Voice Interface Design

What We Got Wrong the First Time

Where SmartON Is Headed

Frequently Asked Questions

Related Posts

MIRA Deep Dive: Building a Multilingual AI Router

Voice-First UX: Designing AI for Blind Users

Building SmartON: Assistive AI for the Visually Impaired

Starting with the Right Question

The Technical Architecture

The Voice Interface Design

What We Got Wrong the First Time

Where SmartON Is Headed

Frequently Asked Questions

What is SmartON?

How does AI help visually impaired users?

What technology powers SmartON?

Why build SmartON for Indian users specifically?

How is SmartON different from existing accessibility apps?

Related Posts

MIRA Deep Dive: Building a Multilingual AI Router

Voice-First UX: Designing AI for Blind Users