What is Transformers.js?

Transformers.js is a JavaScript library from Hugging Face that runs transformer models directly in the browser or in Node.js. It supports embeddings, classification, token generation, image and audio models — all without a Python backend or server-side inference.

Can I run an LLM in the browser with Transformers.js?

Yes — small to mid-sized LLMs run in the browser with Transformers.js, especially when accelerated by WebGPU. Models in the 1B-7B parameter range are practical on modern laptops; larger models become impractical due to memory and download size.

What models does Transformers.js support?

Transformers.js supports a wide range of Hugging Face models in ONNX format — embedding models, text classification, sentence transformers, small LLMs, image classifiers, speech recognition, and more. The library is actively expanded to track popular new models.

Is Transformers.js fast enough for production?

For embeddings, classification, and small generative tasks, Transformers.js is production-ready — especially with WebGPU acceleration. For larger LLMs or low-latency interactive generation, server-side inference is still typically faster. The trade-off is privacy and cost vs latency.

How do I use Transformers.js in a React app?

Install @xenova/transformers from npm, lazy-load the model on first use (the weights are large), run inference in a Web Worker to keep the UI responsive, and cache the model in IndexedDB or browser storage so users do not re-download on every visit.

Transformers.js: LLMs in the Browser

What Transformers.js Does

Transformers.js is a JavaScript library from Hugging Face that lets you run transformer models directly in the browser or Node.js, without any Python or server dependency. It uses ONNX Runtime Web as the inference engine, which means models trained in PyTorch can run in a browser tab after export to ONNX format.

The library mirrors the Transformers Python API — if you've used pipeline('sentiment-analysis') in Python, the browser version works almost identically. This makes it accessible to ML practitioners who know Python and to web developers who've never written a neural network.

Installing and Basic Usage

npm install @huggingface/transformers

The library is about 2MB of JavaScript. Models are downloaded from the Hugging Face Hub on first use and cached in the browser's Cache API — subsequent loads use the cached model instantly.

The simplest usage is the pipeline API:

import { pipeline } from '@huggingface/transformers';

// Feature extraction (embeddings)
const extractor = await pipeline(
  'feature-extraction',
  'Xenova/all-MiniLM-L6-v2'
);
const embedding = await extractor('Hello, world!', {
  pooling: 'mean',
  normalize: true
});
console.log(embedding.data); // Float32Array of 384 dimensions

// Text classification
const classifier = await pipeline(
  'text-classification',
  'Xenova/distilbert-base-uncased-finetuned-sst-2-english'
);
const result = await classifier('I love this product!');
// [{ label: 'POSITIVE', score: 0.9998 }]

Using It in a Web Worker

Always run Transformers.js in a Web Worker. Model loading and inference block the JavaScript thread — if they run on the main thread, your UI freezes. The library supports Web Workers natively:

// worker.js
import { pipeline } from '@huggingface/transformers';

let extractor = null;

self.onmessage = async (event) => {
  const { type, payload } = event.data;

  if (type === 'init') {
    extractor = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');
    self.postMessage({ type: 'ready' });
  }

  if (type === 'embed') {
    const result = await extractor(payload.text, { pooling: 'mean', normalize: true });
    self.postMessage({ type: 'embedding', data: Array.from(result.data) });
  }
};

// main.js
const worker = new Worker(new URL('./worker.js', import.meta.url), { type: 'module' });
worker.postMessage({ type: 'init' });
worker.onmessage = (e) => {
  if (e.data.type === 'ready') {
    worker.postMessage({ type: 'embed', payload: { text: 'Hello!' } });
  }
  if (e.data.type === 'embedding') {
    console.log('Embedding:', e.data.data.slice(0, 5));
  }
};

Models That Work Well in the Browser

Not all Hugging Face models work in the browser. They need to be in ONNX format and available on the Hub. The Xenova namespace (maintained by Hugging Face) has pre-converted versions of popular models:

Embeddings: Xenova/all-MiniLM-L6-v2 (23MB, 384 dims), Xenova/all-mpnet-base-v2 (418MB, 768 dims)
Classification: Xenova/distilbert-base-uncased-finetuned-sst-2-english (67MB)
Translation: Xenova/opus-mt-en-hi (298MB, English to Hindi)
Speech recognition: Xenova/whisper-tiny (75MB), Xenova/whisper-base (145MB)
Small LLMs: Xenova/LaMini-Flan-T5-783M (783M params, generative)

Dhiya NPM uses all-MiniLM-L6-v2 as the default embedding model — it's the best balance of size, speed, and quality for RAG applications in the browser.

Managing Model Downloads

The first use of any model downloads it from the Hub. MiniLM is 23MB — fine for most applications. Larger models (Whisper, MPNet) are 150–400MB. Design your loading experience accordingly:

const extractor = await pipeline(
  'feature-extraction',
  'Xenova/all-MiniLM-L6-v2',
  {
    progress_callback: (progress) => {
      if (progress.status === 'downloading') {
        console.log(`Downloading: ${(progress.loaded / progress.total * 100).toFixed(1)}%`);
      }
    }
  }
);

Models are cached in the browser's Cache Storage after the first download — they're available offline on subsequent visits and load in under 200ms from cache.

Dhiya NPM abstracts Transformers.js into a clean RAG API so you don't have to manage pipelines, workers, or caching directly. See the build tutorial →

Transformers.js: Running LLMs in the Browser

What Transformers.js Does

Installing and Basic Usage

Using It in a Web Worker

Models That Work Well in the Browser

Managing Model Downloads

Frequently Asked Questions

Related Posts

WebGPU for AI Inference: A Web Developer's Guide

How to Build a RAG Chatbot with Dhiya NPM