What is LLM observability?

LLM observability is the practice of instrumenting AI applications to track inputs, outputs, quality, cost, and latency across LLM calls. It enables debugging, quality monitoring, cost optimisation, and detection of model degradation in production.

What should I log for LLM applications?

At minimum: full prompt, full response, input/output token counts, latency, model name, feature tag, estimated cost, session ID, and user ID (hashed). For multi-step pipelines, log each step separately with a shared trace ID.

How do I trace LLM calls in production?

Implement distributed tracing with a trace context that propagates through your pipeline. Each step logs a span with inputs, outputs, and duration. Tools like LangSmith, Langfuse, and Arize provide out-of-the-box LLM tracing. OpenTelemetry plus structured logging is a DIY alternative.

What metrics matter most for LLM applications?

Latency percentiles (p50, p95, p99), input/output token counts and cost, error rates, LLM-as-judge quality scores, and user feedback signals (thumbs down, regenerate requests). Track these per feature, not just globally.

What tools exist for LLM monitoring?

LangSmith (by LangChain), Langfuse (open source), Arize Phoenix, and Helicone are the main dedicated LLM observability tools. For general-purpose teams, OpenTelemetry with a backend like Grafana or Datadog can be extended for LLM use cases.

LLM Observability: Logging and Tracing for Production AI

Traditional software has a useful property: the same input produces the same output, and errors produce error messages. LLM applications break both of these properties. Non-determinism means the same input can produce different outputs on different calls. Failures often look like bad outputs rather than exceptions — the system runs fine, it just returns wrong or low-quality results. This makes LLM observability different from standard application monitoring, and it makes it more important.

Why LLM Apps Are Harder to Observe

Standard monitoring tells you when something is broken. LLM observability needs to tell you when something is degraded — when output quality has dropped, when the model is producing confident-sounding incorrect answers, when retrieval is returning irrelevant context. These problems do not trigger error rates or latency alerts. They trigger user churn, support tickets, and eroded trust — lagging indicators that are expensive to connect back to a root cause without proper tracing.

Multi-step LLM applications compound the problem. A RAG pipeline with retrieval, re-ranking, and generation has multiple points of failure. If the final output is wrong, you need to know whether the retrieval returned irrelevant documents, the re-ranker ranked them poorly, or the generation model hallucinated despite good context. Without tracing at each step, debugging this requires reproducing the exact conditions of the failure — often impossible in production.

What to Log (And What to Skip)

Every LLM call should capture: the full prompt (system + user), the full response, input and output token counts, latency in milliseconds, the model name and version, the feature or endpoint that triggered the call, and estimated cost. Tag every log with a session ID and user ID (hashed for privacy) to enable session-level tracing.

What to skip: do not log raw user data that contains PII unless you have explicit consent and secure storage. Do not log every intermediate thinking step for reasoning models unless you are specifically debugging reasoning quality. Storage costs for full trace logs can be significant — implement a sampling strategy that logs 100% of errors and a configurable sample of successful calls (10-20% is typical).

Distributed Tracing for Multi-Step Pipelines

For agentic or RAG applications with multiple steps, implement distributed tracing using a trace context that propagates through the entire pipeline. Each step logs its inputs, outputs, and duration as a span attached to the parent trace. When a user reports a bad answer, you pull the trace for that session and see exactly what happened at each step.

LangSmith, Langfuse, and Arize are the commonly used LLM observability platforms that provide this tracing out of the box. If you prefer to build your own, OpenTelemetry traces with a structured logging backend is a reasonable approach. The tooling matters less than the practice — instrument first, optimise the tooling later.

LLM-as-Judge for Automated Quality Monitoring

The most powerful addition to LLM observability is automated quality evaluation using a separate model as a judge. For each production call (or a sample of them), run a second LLM call that evaluates the response quality on your defined criteria — accuracy, relevance, helpfulness, citation correctness. This judge score becomes a metric you can track over time, alert on when it drops, and use to detect regressions after model or prompt changes.

We run LLM-as-judge evaluation on 10% of production calls across all Xwits AI products. When judge scores drop below threshold, it triggers an alert and a human review of the flagged outputs. This has caught prompt regressions after updates, detected retrieval quality drops after index changes, and identified specific user query patterns where our models consistently underperform.

You Can't Fix What You Can't See: Observability for Production LLM Apps

Why LLM Apps Are Harder to Observe

What to Log (And What to Skip)

Distributed Tracing for Multi-Step Pipelines

LLM-as-Judge for Automated Quality Monitoring

Frequently Asked Questions

Related Posts

Harness Engineering: The Infrastructure Layer for Production AI

How to Cut Your LLM API Bill by 60%: Techniques That Actually Work