10 min read

    Building an AI Product in 90 Days: Lessons from SmartON

    by Deep Parmar

    CTO at Sunbots Innovations LLP | Director at Xwits Developers Pvt Ltd

    Building an AI Product in 90 Days | Deep Parmar

    The Starting Point

    In early 2023, I sat in a room with two questions: how do visually impaired users navigate daily financial transactions, and what would it take to build something that genuinely helps? The second question had a constraint attached — we had 90 days before a demo commitment I'd already made.

    SmartON started as an Android app that could identify Indian currency notes by pointing a camera at them. It shipped as a multilingual, multimodal assistive AI that handles currency detection, scene understanding, OCR, and document search — all routed through a voice interface that works in Gujarati, Hindi, and English. Here's how those 90 days actually went.

    Week 1–2: Define the Minimum Viable Capability

    The first thing we did was resist scope. Every conversation about SmartON generated new feature ideas — text-to-speech with personality, a social layer for sharing descriptions, integration with navigation apps. All of them were interesting. None of them were the core problem.

    We forced ourselves to answer one question: what is the single most important thing this product needs to do for a visually impaired user to consider it genuinely useful on day one? The answer was currency detection — the ability to identify Indian rupee denominations reliably, fast, and without an internet connection.

    Everything else was Phase 2. This decision saved us from an 18-month build and let us focus engineering resources on making one thing excellent instead of five things mediocre.

    Week 3–6: Data Collection and Model Training

    We quickly hit the first major obstacle: there was no public dataset of Indian currency notes under realistic conditions — varied lighting, worn notes, partial views, different camera angles. We had to build our own.

    We spent three weeks photographing notes in every condition we could manufacture: bright sunlight, dim indoor lighting, crumpled notes, notes partially covered by fingers. We collected ~8,000 images, labeled them carefully, and applied aggressive augmentation to expand the effective dataset.

    The model training itself took two iterations. Our first YOLO-based model hit 94% accuracy in testing — which sounds good until you realize that a 6% error rate means 1 in 17 notes is misidentified. For a financial transaction tool, this is not acceptable. We went back to the data, identified the failure modes (worn serial numbers, low-light conditions), collected targeted examples, and retrained. Second iteration: 98.7% accuracy. We shipped that.

    Week 7–10: Building the Android App and Voice Layer

    The model was the hardest part. The Android integration was hard in a different way — predictable engineering challenges rather than research uncertainty.

    We chose TensorFlow Lite for on-device inference because the alternative (sending camera frames to a server for inference) would add 300–800ms of latency — unacceptable for a real-time tool. On-device inference at 30fps took model quantization, careful memory management, and a custom camera preview pipeline.

    The voice layer came next. We needed text-to-speech that felt natural in three languages and speech recognition that worked in noisy environments. We used Android's native speech APIs for recognition and a custom TTS pipeline for output, with language detection handled by a small classification model running locally.

    Week 11–13: User Testing and the Pivots

    We put the app in front of five visually impaired users in week 11. What we learned in two days of testing reshaped 20% of the product.

    The biggest surprise: users wanted the app to tell them the note's orientation, not just its denomination. "Two hundred rupees" is useful. "Two hundred rupees, face side up, portrait orientation" is what you need when you're returning the note to your wallet correctly.

    The second surprise: the voice feedback was too slow. We were generating the response after full inference. Users wanted to hear confirmation before the full inference completed — a partial result that could be corrected if wrong was better than silence followed by a correct answer 400ms later. We rebuilt the feedback loop to stream partial results.

    We also killed a feature we'd spent two weeks building: a scene description mode that described everything in the camera view. Users found it overwhelming. They wanted specific, actionable information — "glass door slightly left, push bar at waist height" — not a comprehensive inventory of the scene. We scoped it down aggressively.

    What I'd Do Differently

    Three things I wish we'd done earlier:

    User testing in week 3, not week 11. We made 20% wrong decisions that we discovered only at the end. Earlier user testing would have caught the orientation requirement and feedback latency issue before we'd built around the wrong assumptions.

    Set accuracy thresholds before training, not after. We trained to "as good as we can get" rather than "98% or we iterate." Having a target upfront would have given the team clearer stopping criteria.

    Simpler architecture first. We over-engineered the scene description module from the start. Starting with a simpler approach and adding complexity only when the simpler approach provably failed would have saved a week.

    SmartON is now live at getsmartonai.com. If you're building an assistive AI product or want to talk through your AI build timeline, reach out.

    Frequently Asked Questions

    Quick answers about this topic — also indexed by AI search engines via FAQPage schema.

    Share this article: