What is a typical Gen-AI production stack in 2025?

A typical Gen-AI production stack includes a foundation model API (or self-hosted model), a vector database for retrieval, an embedding model, a prompt and context management layer, input and output guardrails, an evaluation harness, observability and cost tracking, and a user-facing application layer. Each component can be swapped, but all are usually needed in some form.

What components do you need for a production LLM application?

Production LLM apps need: a model layer (API or self-hosted), a retrieval layer (vector store + embeddings) when RAG is used, a prompt and context layer, guardrails for safety and format compliance, an evaluation pipeline, logging and cost monitoring, and a user interface or API. Skipping observability or evaluation is the most common production mistake.

Which vector database is best for production RAG?

Popular production choices include Pinecone, Weaviate, Qdrant, pgvector on Postgres, and managed offerings on AWS and GCP. For client-side RAG, IndexedDB-backed stores work in the browser. The right choice depends on scale, latency, hosting preference, and whether you need hybrid keyword + vector search.

Should I self-host LLMs or use APIs in production?

Use APIs while you are validating the product and your usage is moderate. Move to self-hosted open models when API cost becomes a meaningful share of revenue, when data residency requirements force it, or when you need latency or behaviour guarantees the API cannot meet. The crossover is usually higher than teams expect.

How do I monitor an LLM application in production?

Monitor request latency, token usage per call, end-to-end cost per user action, output quality through automated evaluation or LLM-as-judge scoring, guardrail trigger rates, and user feedback signals. Capture full request and response logs (with PII handling) so you can debug specific failures, not just aggregate dashboards.

My Gen-AI Production Stack

What "Production Stack" Actually Means

Every few months a new framework or model releases and the AI community collectively decides it's the thing everyone should immediately switch to. I've been building Gen-AI systems for clients across healthcare, legal, retail, and assistive technology since 2022. Here's the stack I actually reach for in 2025 — chosen for reliability, maintainability, and the ability for an engineer who isn't me to understand and operate it six months after launch.

This isn't a comprehensive survey of every option. It's what I know works in production, with real trade-offs acknowledged.

The LLM Layer

My default is to start with the best available API model (GPT-4o or Claude 3.5 Sonnet, depending on the task) and migrate to a self-hosted smaller model only when one of these conditions is true: latency requirements mandate it, data privacy mandates it, or volume makes the API cost prohibitive.

For most production deployments I've run, Llama 3.1 8B and Mistral 7B cover 80% of use cases where a self-hosted model is needed. These models are small enough to run on a single A100 GPU, large enough to handle complex reasoning tasks, and open enough to fine-tune for domain-specific behavior.

The mistake I see most often: teams that start with self-hosted models before validating product-market fit. Running your own model cluster is real engineering work. Validate with an API first, then migrate when you have the data to justify it.

Embeddings and Vector Storage

For embedding generation, my go-to is the sentence-transformers/all-MiniLM-L6-v2 family for general-purpose RAG and BAAI/bge-m3 for multilingual applications. These models balance quality, speed, and model size better than larger alternatives for most retrieval tasks.

For vector storage, the choice depends on scale and infrastructure requirements:

Under 1M vectors, self-hosted: Chroma or FAISS. Simple to deploy, no licensing, handles the scale of most production applications.
1M–100M vectors: Qdrant or Weaviate. Better query performance at scale, more robust filtering, managed cloud options.
Browser-based RAG: Dhiya NPM uses IndexedDB with a custom approximate nearest-neighbor implementation. No server required.
Enterprise managed: Pinecone if you need a fully managed service with enterprise SLAs and don't want to operate infrastructure.

Orchestration

For complex multi-step AI workflows — document ingestion pipelines, agentic systems, RAG with re-ranking — I use LangChain for prototyping and LlamaIndex for production document processing. Both have their frustrations (LangChain abstractions can become a debugging maze; LlamaIndex's API has changed significantly across versions), but they're faster than writing orchestration from scratch.

For simpler workflows, I often don't use a framework at all. If your pipeline is: embed documents → store in vector DB → retrieve on query → generate with LLM → return response, you can write that in 200 lines of clean Python without any framework dependency. The framework overhead isn't free, and simpler code is easier to debug and maintain.

The pattern I follow: use a framework until it becomes the bottleneck, then replace the problematic part with direct implementation.

API Layer and Serving

FastAPI is my default for AI API serving. It's fast enough for most production loads, easy to document with automatic OpenAPI spec generation, and Python-native (which matches most ML teams' expertise). For high-throughput serving where Python's GIL becomes a bottleneck, I've used NVIDIA Triton Inference Server for computer vision models.

For the Sunbots Management System's AI analytics features, we use FastAPI with background task queues (Celery + Redis) for long-running inference tasks. The API accepts the request, queues the inference job, and returns a job ID; the client polls for results. This pattern handles variable inference times without blocking HTTP connections.

Monitoring

Three layers of monitoring for every production Gen-AI system:

Infrastructure monitoring: Standard metrics — latency, error rate, throughput, GPU utilization. Prometheus + Grafana if self-hosted; AWS CloudWatch or GCP Monitoring if cloud-native.
LLM-specific monitoring: Token usage (for cost tracking), response latency percentiles (p50, p95, p99), and hallucination rate detection for high-stakes applications. I use LangSmith for tracing during development and a custom logging layer in production.
Business metric monitoring: Did the AI output produce the intended user action? This is the only metric that matters for product success, and it requires instrumenting your application, not just your model.

Want to run a Gen-AI stack entirely in the browser without a server? Dhiya NPM handles embedding and retrieval client-side. Or reach out if you're designing a production Gen-AI architecture.

The Gen-AI Stack I Use in Every Production Project

What "Production Stack" Actually Means

The LLM Layer

Embeddings and Vector Storage

Orchestration

API Layer and Serving

Monitoring

Frequently Asked Questions

Related Posts

RAG vs. Fine-Tuning: Which Does Your Business Need?

Dhiya NPM — No-Cost AI for the Web: Build RAG Bots That Run Entirely in the Browser