What is client-side RAG?

Client-side RAG runs the entire retrieval-augmented generation pipeline — embeddings, vector search, and LLM inference — inside the user's browser, with no server roundtrip. It uses libraries like Transformers.js for embeddings, IndexedDB for the vector store, and small in-browser LLMs accelerated by WebGPU.

Can a RAG chatbot run entirely in the browser?

Yes. With Transformers.js for embeddings, IndexedDB as a vector store, and WebGPU-accelerated inference, a complete RAG chatbot can run in the browser with no backend, no API keys, and no recurring cost. The trade-off is initial model download size and reliance on the user's device for compute.

What are the benefits of browser-based AI?

Browser-based AI offers strong privacy (user data never leaves the device), zero per-request cost after the model loads, offline operation, and easy distribution as a static site. It is especially attractive for privacy-sensitive applications and for products that cannot justify per-user inference cost.

What are the limitations of client-side AI?

Client-side AI is constrained by model size that fits in memory, slower inference than server GPUs, large initial download for model weights, and variable performance across devices. For very large models or very latency-sensitive tasks, server-side inference is still better.

Is client-side RAG production-ready in 2025?

Client-side RAG is production-ready for use cases that fit smaller models — knowledge-base chatbots, document Q&A, privacy-first assistants. Dhiya NPM is one example of a framework that makes this practical. For agent workflows or large-context reasoning, server-side LLMs are still required.

Client-Side RAG: AI Without a Server

The Case for Serverless AI

Every production RAG system I'd built before 2024 had the same architecture: a server running the embedding model, a vector database running somewhere, and an LLM API call to generate the final response. This works — until you're building a product where a server is the wrong answer.

The wrong answer is any product where: the data is personal and shouldn't leave the device, the user may not have reliable internet access, the inference volume is too low to justify server infrastructure cost, or you're building a developer tool or NPM package that other developers should be able to embed without setting up backend services.

All four of these conditions apply to Dhiya NPM, the client-side RAG framework I built. Here's how client-side RAG actually works and when it's the right choice.

The Browser AI Stack in 2025

Three technologies made client-side AI practical:

Transformers.js — A JavaScript port of the Hugging Face Transformers library that runs inference using ONNX Runtime in the browser. It supports a growing catalog of models for embedding, classification, translation, and generation. Critically, it uses the WebAssembly and WebGPU backends for near-native inference speed without any Python or server dependency.

WebGPU — The browser's new GPU compute API, available in Chrome 113+ and Firefox Nightly. WebGPU unlocks GPU-accelerated tensor operations in the browser — the same operations that make server-side inference fast. A browser with WebGPU can run embedding inference 5–10× faster than CPU-only WASM. For small-to-medium models (under 500M parameters), this brings inference times into the range of 50–200ms, which is usable for interactive applications.

IndexedDB — The browser's built-in key-value store, capable of storing binary data and queryable with indexes. It serves as the vector store in Dhiya NPM — embeddings are stored as Float32Arrays, and similarity search is computed in JavaScript. Not as fast as a purpose-built vector database, but sufficient for collections under ~50,000 documents.

How Client-Side RAG Works

The RAG pipeline has four steps, all running in the browser:

Step 1 — Ingestion: User-provided documents (text, PDF, or raw strings) are chunked into overlapping segments. Dhiya NPM uses 512-token chunks with a 50-token overlap by default, which balances context richness against embedding cost. Each chunk is a JavaScript string in memory.

Step 2 — Embedding: Each chunk is embedded using a sentence transformer model loaded via Transformers.js. The model runs in a Web Worker (off the main thread, so the UI stays responsive). On a device with WebGPU support, embedding a 512-token chunk takes approximately 15–40ms. Embeddings are stored as Float32Arrays in IndexedDB.

Step 3 — Retrieval: When a user asks a question, the question is embedded (same model, same pipeline). Cosine similarity between the question embedding and all stored chunk embeddings is computed. The top-k most similar chunks are retrieved. With collections under 10,000 chunks, this brute-force similarity search takes under 5ms in JavaScript.

Step 4 — Generation: The retrieved chunks plus the user's question are assembled into a prompt. This prompt is passed to a local LLM (if WebGPU is available and the user has Chrome AI or a local model loaded) or returned as context for the developer to send to their preferred LLM API. Dhiya NPM supports Chrome's built-in AI API when available, falling back gracefully to developer-provided generation.

Performance Benchmarks

On a MacBook Pro M2 (Chrome 124 with WebGPU):

Model load time (first use): ~2.3 seconds (cached from second use onward)
Embedding per 512-token chunk: ~18ms
Ingesting 100-page document (PDF): ~12 seconds end-to-end
Retrieval from 1,000-chunk collection: ~4ms
Total time from question to retrieved context: ~25ms

On a budget Android device (Snapdragon 662, Chrome 124 with WASM fallback):

Embedding per chunk: ~180ms (no WebGPU, CPU WASM)
Retrieval: ~15ms
Ingesting a 100-page document: ~90 seconds

The WASM fallback is viable for small document collections but not for real-time ingestion of large documents on lower-end hardware. Dhiya NPM handles this by deferring ingestion to an idle callback and showing progress feedback.

When Client-Side RAG Is the Wrong Choice

Client-side RAG is not universally better. Avoid it when: your document collection exceeds ~50MB of embeddings (IndexedDB storage isn't unlimited), you need cross-device synchronization (browser storage is local), you need enterprise-grade retrieval quality with re-ranking and hybrid search, or your user base is primarily on older devices without WebGPU support.

For these cases, a server-side RAG system with a proper vector database is the right answer. Client-side RAG is best for developer tools, privacy-sensitive personal AI assistants, and products where zero infrastructure cost is a design constraint.

Dhiya NPM is open source and available on npm. Read the full introduction → or jump straight to the build tutorial →

Client-Side RAG: Running AI in Your Browser

The Case for Serverless AI

The Browser AI Stack in 2025

How Client-Side RAG Works

Performance Benchmarks

When Client-Side RAG Is the Wrong Choice

Frequently Asked Questions

Related Posts

Dhiya NPM — No-Cost AI for the Web: Build RAG Bots That Run Entirely in the Browser

How to Build a RAG Chatbot with Dhiya NPM