
Building SmartON: Assistive AI for the Visually Impaired
by Deep Parmar
CTO at Sunbots Innovations LLP | Director at Xwits Developers Pvt Ltd

Starting with the Right Question
Most assistive technology is built by engineers who imagine what visually impaired users need. SmartON started differently: I spent three weeks talking to visually impaired users in Ahmedabad before writing a single line of code.
The question I asked wasn't "what AI features do you want?" It was: "What do you struggle with today that you believe technology could solve?" The answers clustered around three problems that existing tools handled poorly: identifying Indian currency (existing apps were inaccurate on worn notes and failed entirely in dim light), understanding physical space (navigation apps describe routes, not the immediate physical environment), and accessing printed documents (OCR apps existed but required sighted assistance to aim the camera correctly).
These three problems became SmartON's three core capabilities. Everything else was deferred until we'd solved these well.
The Technical Architecture
SmartON is an Android application connected to a USB camera, designed to work with a Jetson Nano edge computing unit. The hardware choice was deliberate: by running inference on the Jetson rather than on-device or in the cloud, we hit a sweet spot of latency and model capability that neither pure mobile nor cloud could achieve.
The four AI components, and what they do:
- Currency detection: A YOLO-based object detection model trained on a custom dataset of 8,000+ images of Indian rupee notes across denominations, lighting conditions, and states of wear. Accuracy: 98.7% in production. Inference time: ~45ms on Jetson Nano.
- Scene understanding: A vision-language model that converts a camera frame into an action-oriented description. The model is prompted to prioritize navigation-relevant information (obstacles, entrances, distances) over a comprehensive inventory of the scene.
- OCR: A two-stage pipeline: first detect whether text is present and approximately where, then run a specialized OCR model on the detected text regions. This two-stage approach is significantly faster than running full-image OCR on every frame.
- Document search: A RAG pipeline where users can load documents into a local index and query them by voice. Documents are embedded offline; queries retrieve the relevant passages.
The Voice Interface Design
The voice interface is where most of SmartON's complexity lives. MIRA — the voice layer — needs to understand a spoken request, determine which of the four capabilities should handle it, execute the capability, and return a response that's useful without being overwhelming.
The routing problem is harder than it sounds. "What is in front of me?" routes to scene understanding. "What does this paper say?" routes to OCR. "Find the section about electrolysis in my chemistry notes" routes to document search. These distinctions are clear when written down; they're ambiguous when spoken in mixed languages with varying phrasing.
We use a fine-tuned intent classification model for routing — trained on 2,000 example utterances in Gujarati, Hindi, and English, covering the full range of how users actually phrase requests. The classifier runs locally in under 30ms, adding minimal latency to the response pipeline.
Response design follows one rule: tell the user the next action, not everything the system knows. "Two hundred rupees, portrait orientation" is better than "The system has detected a two-hundred-rupee note with denomination markers visible at 0.97 confidence, oriented in portrait mode with the face side facing the camera." The first response takes under 2 seconds to say. The second takes 8 seconds and overwhelms the user with information they don't need.
What We Got Wrong the First Time
Two significant mistakes in v1:
Feedback latency: We initially returned the full inference result before speaking. For currency detection, this meant 45ms of silence, then the answer. User testing showed that a streaming response — "detected... two hundred rupees" — felt dramatically more responsive even though the total time was similar. We rebuilt the audio pipeline to stream partial results.
Scene description verbosity: The first scene understanding model described everything it could see. Users found this exhausting. A scene with seven objects generated a 12-second audio description before they could do anything with it. We added a prompt constraint: "Describe only the 2–3 most actionable elements for navigation" and rebuilt the evaluation against this criterion. Response length dropped by 70%; user satisfaction increased significantly.
Where SmartON Is Headed
The current system works for the three core use cases we set out to solve. The next capabilities on the roadmap are: color identification (useful for clothing and product selection), public transit navigation integration, and a more sophisticated document analysis mode that can answer specific questions about loaded documents rather than just retrieving passages.
Visit getsmartonai.com to learn more about the current product and upcoming features.
Building assistive AI or accessibility technology? Reach out — this is work I care about deeply and I'm happy to share what we've learned.
Frequently Asked Questions
Quick answers about this topic — also indexed by AI search engines via FAQPage schema.
Share this article:
