What is the difference between an AI agent and a chatbot?

A chatbot responds to a single input and stops. An agent takes a goal, breaks it into steps, calls tools, observes results, and continues working until it reaches a termination condition — often without human input on each step.

Which LLM should I use to power my agent?

It depends on the task. In mid-2026 there is no single best model. For reasoning-heavy tasks Claude Opus 4.8 performs well; for high-volume, cost-sensitive tasks a smaller model may be more practical. The architecture around the model matters more than the model choice.

How do I handle hallucinations in an agent?

Narrow the task scope so the model has less to hallucinate about. Require structured output with strict schema validation. Use tool calls to ground factual claims — the model should retrieve, not recall. Run output guardrails before returning results to users.

Do I need a vector database for agent memory?

Not always. For within-session state, a simple dict or list suffices. For agents that need to recall facts across many sessions, a vector store helps with retrieval. Start without one and add it when the specific need arises.

What is a reasonable step limit for an agent?

For most tasks: 3x to 4x the expected number of steps in the happy path. If a task normally takes 4 tool calls, set the limit at 12-15. Log every run that hits the limit — it signals either a bad task definition or a new failure mode.

How do I know when my agent is production-ready?

When it passes a written eval harness at a defined threshold, every failure mode has a containment plan, and you can debug any run from logs alone without needing to reproduce it. If those three things are not true, it is still a prototype.

How to Build a Production AI Agent

A production AI agent is an LLM wrapped in a controlled loop — tools, a clear task, a termination condition, guardrails, memory, and evaluation. The model is about 20% of the work. The other 80% is the scaffolding you build around it, and that scaffolding is what separates a demo that impresses someone in a meeting from a system you can actually run at midnight without watching it.

I have built agents in production — for GST compliance, for marketing automation, for assistive AI that 17,000+ blind users depend on daily. The failure modes are consistent. None of them come from the model itself. They come from missing architecture.

The minimum architecture of an agent

An agent needs six things to be production-ready:

1. A tool set — defined, tested functions the model can call 2. A clear task — a precise instruction, not an aspiration 3. A perceive-reason-act loop — observe state, decide action, execute, repeat 4. A termination condition — when to stop, not just what to do 5. Guardrails — constraints on inputs and outputs 6. Memory — what the agent carries across steps

Skip any one of these and you do not have a production agent. You have a demo that works in the slides.

Step by step

Define the task and success check first

Before writing a line of code, write two things: the task definition in one sentence, and the success condition you will check programmatically. "Summarise customer emails" is not a task definition. "Given a batch of raw email threads, produce a JSON object with fields sentiment, primary_issue, and suggested_reply for each thread, returning null for fields that cannot be determined" is a task definition.

If you cannot write the success check, you are not ready to build the agent.

Give it tools

Tools are functions the model can call. Keep them narrow. A tool that does one thing is easier to test, easier to log, and easier to fix when the model calls it wrong.

def get_invoice_status(invoice_id: str) -> dict:
    """Fetch status for a single invoice. Returns status, amount, due_date."""
    ...

Every tool should have: a typed signature, a docstring the model reads, deterministic behaviour for the same input, and a timeout. Never give an agent a tool it does not need for the current task.

Build the perceive-reason-act loop

The loop is the agent. On each iteration: the agent reads the current state (tool outputs, conversation history, memory), reasons about the next action, and executes one tool call or produces a final answer.

while not terminated:
    action = model.decide(state, tools, task)
    if action.is_final:
        return action.output
    result = tools[action.name](**action.args)
    state.append(result)

One action per loop iteration. Do not let the model chain tool calls invisibly — surface each step so you can log and debug it.

Set termination and retries

Every agent needs a hard step limit. If your task normally takes 5 steps, set the limit at 12. Log when agents hit it. If a task is regularly reaching the step limit, the task definition is wrong.

Retry logic is separate from the loop. Network errors on tool calls should retry. Logical dead-ends — the model calling the same tool with the same arguments twice — should not retry. They should terminate and flag.

Add input and output guardrails

Input guardrails run before the agent starts. Check that the task is within scope. Reject inputs that are ambiguous, oversized, or outside the defined domain. This is your cheapest safety layer.

Output guardrails run on the final response. Validate schema. Check that required fields are present. Run a lightweight check for obvious hallucination patterns (references to entities not in the input). For high-stakes outputs, route to a second, cheaper model for sanity-checking before returning.

Add memory

Most tasks do not need cross-session memory. They need within-session state — a scratch pad the agent writes to and reads from as it works. Keep this separate from the model's context window.

For agents that do need long-term memory, be surgical. Store facts, not full conversation histories. Retrieve by semantic similarity, but filter by relevance to the current task before injecting into context.

Evaluate before you deploy

Write an evaluation harness with at least 20 test cases covering: the happy path, edge cases, adversarial inputs, and the specific failures you found in development. Score each run on your success criteria. Set a minimum pass rate. If you cannot reach it, do not ship.

An agent without an eval harness is a guess that you are shipping as a product.

Where agents fail in production and how to contain it

The failure modes I see most often:

Infinite loops. The model keeps calling tools without converging. Fix: hard step limit, log everything, alert on limit hit.

Tool call errors that the model ignores. The tool returns an error. The model decides to continue anyway with stale state. Fix: treat tool errors as hard failures unless explicitly marked as recoverable.

Context overflow. After many steps, the context window fills up and the model starts losing early instructions. Fix: summarise state at regular intervals. Do not let the full tool output history sit in the context.

Prompt injection via tool outputs. A tool fetches external content that contains instructions. The model follows them. Fix: sanitise tool outputs before returning them to the model. Treat external data as untrusted.

Overly broad task scope. The agent is given a task that requires judgment calls you have not defined. Fix: narrow the scope. If you cannot narrow it, add a human-in-the-loop step at the ambiguous decision point.

Build from scratch vs. use a framework

Frameworks like LangGraph and LlamaIndex give you the loop scaffolding pre-built. Use them if your team is small and the task is standard. The trade-off is that you inherit their abstractions, and when something goes wrong in production, you are debugging their code as well as yours.

I tend to build the core loop from scratch for anything business-critical, and use frameworks for the parts that do not touch the task logic — evaluation runners, observability integrations, vector store connectors. The loop itself is not that much code. Control over it is worth the lines.

See how I think about the broader distinction in AI agents vs. agentic AI and the engineering patterns I use for prompting production systems in harness engineering for LLMs.

Pre-launch checklist

Before you call an agent production-ready:

[ ] Task definition is one sentence, unambiguous
[ ] Success check is programmatic, not subjective
[ ] Every tool has a typed signature, docstring, and timeout
[ ] Step limit is set and logged
[ ] Input guardrails reject out-of-scope requests
[ ] Output guardrails validate schema and flag anomalies
[ ] Eval harness exists with at least 20 test cases
[ ] Pass rate meets your minimum threshold
[ ] Logging covers every tool call, argument, and output
[ ] Alerts are set for step-limit hits and error spikes
[ ] Rollback plan exists

If any item is unchecked, you are not done.

---

How to Build a Production AI Agent (Not a Demo)