
You Can't Fix What You Can't See: Observability for Production LLM Apps
by Deep Parmar
CTO, Sunbots & Xwits

Traditional software has a useful property: the same input produces the same output, and errors produce error messages. LLM applications break both of these properties. Non-determinism means the same input can produce different outputs on different calls. Failures often look like bad outputs rather than exceptions — the system runs fine, it just returns wrong or low-quality results. This makes LLM observability different from standard application monitoring, and it makes it more important.
Why LLM Apps Are Harder to Observe
Standard monitoring tells you when something is broken. LLM observability needs to tell you when something is degraded — when output quality has dropped, when the model is producing confident-sounding incorrect answers, when retrieval is returning irrelevant context. These problems do not trigger error rates or latency alerts. They trigger user churn, support tickets, and eroded trust — lagging indicators that are expensive to connect back to a root cause without proper tracing.
Multi-step LLM applications compound the problem. A RAG pipeline with retrieval, re-ranking, and generation has multiple points of failure. If the final output is wrong, you need to know whether the retrieval returned irrelevant documents, the re-ranker ranked them poorly, or the generation model hallucinated despite good context. Without tracing at each step, debugging this requires reproducing the exact conditions of the failure — often impossible in production.
What to Log (And What to Skip)
Every LLM call should capture: the full prompt (system + user), the full response, input and output token counts, latency in milliseconds, the model name and version, the feature or endpoint that triggered the call, and estimated cost. Tag every log with a session ID and user ID (hashed for privacy) to enable session-level tracing.
What to skip: do not log raw user data that contains PII unless you have explicit consent and secure storage. Do not log every intermediate thinking step for reasoning models unless you are specifically debugging reasoning quality. Storage costs for full trace logs can be significant — implement a sampling strategy that logs 100% of errors and a configurable sample of successful calls (10-20% is typical).
Distributed Tracing for Multi-Step Pipelines
For agentic or RAG applications with multiple steps, implement distributed tracing using a trace context that propagates through the entire pipeline. Each step logs its inputs, outputs, and duration as a span attached to the parent trace. When a user reports a bad answer, you pull the trace for that session and see exactly what happened at each step.
LangSmith, Langfuse, and Arize are the commonly used LLM observability platforms that provide this tracing out of the box. If you prefer to build your own, OpenTelemetry traces with a structured logging backend is a reasonable approach. The tooling matters less than the practice — instrument first, optimise the tooling later.
LLM-as-Judge for Automated Quality Monitoring
The most powerful addition to LLM observability is automated quality evaluation using a separate model as a judge. For each production call (or a sample of them), run a second LLM call that evaluates the response quality on your defined criteria — accuracy, relevance, helpfulness, citation correctness. This judge score becomes a metric you can track over time, alert on when it drops, and use to detect regressions after model or prompt changes.
We run LLM-as-judge evaluation on 10% of production calls across all Xwits AI products. When judge scores drop below threshold, it triggers an alert and a human review of the flagged outputs. This has caught prompt regressions after updates, detected retrieval quality drops after index changes, and identified specific user query patterns where our models consistently underperform.
Frequently Asked Questions
Quick answers about this topic — also indexed by AI search engines via FAQPage schema.
Share this article:
