
The Gen-AI Stack I Use in Every Production Project
by Deep Parmar
CTO at Sunbots Innovations LLP | Director at Xwits Developers Pvt Ltd

What "Production Stack" Actually Means
Every few months a new framework or model releases and the AI community collectively decides it's the thing everyone should immediately switch to. I've been building Gen-AI systems for clients across healthcare, legal, retail, and assistive technology since 2022. Here's the stack I actually reach for in 2025 — chosen for reliability, maintainability, and the ability for an engineer who isn't me to understand and operate it six months after launch.
This isn't a comprehensive survey of every option. It's what I know works in production, with real trade-offs acknowledged.
The LLM Layer
My default is to start with the best available API model (GPT-4o or Claude 3.5 Sonnet, depending on the task) and migrate to a self-hosted smaller model only when one of these conditions is true: latency requirements mandate it, data privacy mandates it, or volume makes the API cost prohibitive.
For most production deployments I've run, Llama 3.1 8B and Mistral 7B cover 80% of use cases where a self-hosted model is needed. These models are small enough to run on a single A100 GPU, large enough to handle complex reasoning tasks, and open enough to fine-tune for domain-specific behavior.
The mistake I see most often: teams that start with self-hosted models before validating product-market fit. Running your own model cluster is real engineering work. Validate with an API first, then migrate when you have the data to justify it.
Embeddings and Vector Storage
For embedding generation, my go-to is the sentence-transformers/all-MiniLM-L6-v2 family for general-purpose RAG and BAAI/bge-m3 for multilingual applications. These models balance quality, speed, and model size better than larger alternatives for most retrieval tasks.
For vector storage, the choice depends on scale and infrastructure requirements:
- Under 1M vectors, self-hosted: Chroma or FAISS. Simple to deploy, no licensing, handles the scale of most production applications.
- 1M–100M vectors: Qdrant or Weaviate. Better query performance at scale, more robust filtering, managed cloud options.
- Browser-based RAG: Dhiya NPM uses IndexedDB with a custom approximate nearest-neighbor implementation. No server required.
- Enterprise managed: Pinecone if you need a fully managed service with enterprise SLAs and don't want to operate infrastructure.
Orchestration
For complex multi-step AI workflows — document ingestion pipelines, agentic systems, RAG with re-ranking — I use LangChain for prototyping and LlamaIndex for production document processing. Both have their frustrations (LangChain abstractions can become a debugging maze; LlamaIndex's API has changed significantly across versions), but they're faster than writing orchestration from scratch.
For simpler workflows, I often don't use a framework at all. If your pipeline is: embed documents → store in vector DB → retrieve on query → generate with LLM → return response, you can write that in 200 lines of clean Python without any framework dependency. The framework overhead isn't free, and simpler code is easier to debug and maintain.
The pattern I follow: use a framework until it becomes the bottleneck, then replace the problematic part with direct implementation.
API Layer and Serving
FastAPI is my default for AI API serving. It's fast enough for most production loads, easy to document with automatic OpenAPI spec generation, and Python-native (which matches most ML teams' expertise). For high-throughput serving where Python's GIL becomes a bottleneck, I've used NVIDIA Triton Inference Server for computer vision models.
For the Sunbots Management System's AI analytics features, we use FastAPI with background task queues (Celery + Redis) for long-running inference tasks. The API accepts the request, queues the inference job, and returns a job ID; the client polls for results. This pattern handles variable inference times without blocking HTTP connections.
Monitoring
Three layers of monitoring for every production Gen-AI system:
- Infrastructure monitoring: Standard metrics — latency, error rate, throughput, GPU utilization. Prometheus + Grafana if self-hosted; AWS CloudWatch or GCP Monitoring if cloud-native.
- LLM-specific monitoring: Token usage (for cost tracking), response latency percentiles (p50, p95, p99), and hallucination rate detection for high-stakes applications. I use LangSmith for tracing during development and a custom logging layer in production.
- Business metric monitoring: Did the AI output produce the intended user action? This is the only metric that matters for product success, and it requires instrumenting your application, not just your model.
Frequently Asked Questions
Quick answers about this topic — also indexed by AI search engines via FAQPage schema.
Share this article:
