10 min read

    Client-Side RAG: Running AI in Your Browser

    by Deep Parmar

    CTO at Sunbots Innovations LLP | Director at Xwits Developers Pvt Ltd

    Client-Side RAG: AI Without a Server | Deep Parmar

    The Case for Serverless AI

    Every production RAG system I'd built before 2024 had the same architecture: a server running the embedding model, a vector database running somewhere, and an LLM API call to generate the final response. This works — until you're building a product where a server is the wrong answer.

    The wrong answer is any product where: the data is personal and shouldn't leave the device, the user may not have reliable internet access, the inference volume is too low to justify server infrastructure cost, or you're building a developer tool or NPM package that other developers should be able to embed without setting up backend services.

    All four of these conditions apply to Dhiya NPM, the client-side RAG framework I built. Here's how client-side RAG actually works and when it's the right choice.

    The Browser AI Stack in 2025

    Three technologies made client-side AI practical:

    Transformers.js — A JavaScript port of the Hugging Face Transformers library that runs inference using ONNX Runtime in the browser. It supports a growing catalog of models for embedding, classification, translation, and generation. Critically, it uses the WebAssembly and WebGPU backends for near-native inference speed without any Python or server dependency.

    WebGPU — The browser's new GPU compute API, available in Chrome 113+ and Firefox Nightly. WebGPU unlocks GPU-accelerated tensor operations in the browser — the same operations that make server-side inference fast. A browser with WebGPU can run embedding inference 5–10× faster than CPU-only WASM. For small-to-medium models (under 500M parameters), this brings inference times into the range of 50–200ms, which is usable for interactive applications.

    IndexedDB — The browser's built-in key-value store, capable of storing binary data and queryable with indexes. It serves as the vector store in Dhiya NPM — embeddings are stored as Float32Arrays, and similarity search is computed in JavaScript. Not as fast as a purpose-built vector database, but sufficient for collections under ~50,000 documents.

    How Client-Side RAG Works

    The RAG pipeline has four steps, all running in the browser:

    Step 1 — Ingestion: User-provided documents (text, PDF, or raw strings) are chunked into overlapping segments. Dhiya NPM uses 512-token chunks with a 50-token overlap by default, which balances context richness against embedding cost. Each chunk is a JavaScript string in memory.

    Step 2 — Embedding: Each chunk is embedded using a sentence transformer model loaded via Transformers.js. The model runs in a Web Worker (off the main thread, so the UI stays responsive). On a device with WebGPU support, embedding a 512-token chunk takes approximately 15–40ms. Embeddings are stored as Float32Arrays in IndexedDB.

    Step 3 — Retrieval: When a user asks a question, the question is embedded (same model, same pipeline). Cosine similarity between the question embedding and all stored chunk embeddings is computed. The top-k most similar chunks are retrieved. With collections under 10,000 chunks, this brute-force similarity search takes under 5ms in JavaScript.

    Step 4 — Generation: The retrieved chunks plus the user's question are assembled into a prompt. This prompt is passed to a local LLM (if WebGPU is available and the user has Chrome AI or a local model loaded) or returned as context for the developer to send to their preferred LLM API. Dhiya NPM supports Chrome's built-in AI API when available, falling back gracefully to developer-provided generation.

    Performance Benchmarks

    On a MacBook Pro M2 (Chrome 124 with WebGPU):

    • Model load time (first use): ~2.3 seconds (cached from second use onward)
    • Embedding per 512-token chunk: ~18ms
    • Ingesting 100-page document (PDF): ~12 seconds end-to-end
    • Retrieval from 1,000-chunk collection: ~4ms
    • Total time from question to retrieved context: ~25ms

    On a budget Android device (Snapdragon 662, Chrome 124 with WASM fallback):

    • Embedding per chunk: ~180ms (no WebGPU, CPU WASM)
    • Retrieval: ~15ms
    • Ingesting a 100-page document: ~90 seconds

    The WASM fallback is viable for small document collections but not for real-time ingestion of large documents on lower-end hardware. Dhiya NPM handles this by deferring ingestion to an idle callback and showing progress feedback.

    When Client-Side RAG Is the Wrong Choice

    Client-side RAG is not universally better. Avoid it when: your document collection exceeds ~50MB of embeddings (IndexedDB storage isn't unlimited), you need cross-device synchronization (browser storage is local), you need enterprise-grade retrieval quality with re-ranking and hybrid search, or your user base is primarily on older devices without WebGPU support.

    For these cases, a server-side RAG system with a proper vector database is the right answer. Client-side RAG is best for developer tools, privacy-sensitive personal AI assistants, and products where zero infrastructure cost is a design constraint.

    Dhiya NPM is open source and available on npm. Read the full introduction → or jump straight to the build tutorial →

    Frequently Asked Questions

    Quick answers about this topic — also indexed by AI search engines via FAQPage schema.

    Share this article: