8 min read

    How to Build a Production AI Agent (Not a Demo)

    by Deep Parmar

    CTO, Sunbots & Xwits

    How to Build a Production AI Agent | Deep Parmar

    A production AI agent is an LLM wrapped in a controlled loop — tools, a clear task, a termination condition, guardrails, memory, and evaluation. The model is about 20% of the work. The other 80% is the scaffolding you build around it, and that scaffolding is what separates a demo that impresses someone in a meeting from a system you can actually run at midnight without watching it.

    I have built agents in production — for GST compliance, for marketing automation, for assistive AI that 17,000+ blind users depend on daily. The failure modes are consistent. None of them come from the model itself. They come from missing architecture.

    The minimum architecture of an agent

    An agent needs six things to be production-ready:

    1. A tool set — defined, tested functions the model can call 2. A clear task — a precise instruction, not an aspiration 3. A perceive-reason-act loop — observe state, decide action, execute, repeat 4. A termination condition — when to stop, not just what to do 5. Guardrails — constraints on inputs and outputs 6. Memory — what the agent carries across steps

    Skip any one of these and you do not have a production agent. You have a demo that works in the slides.

    Step by step

    Define the task and success check first

    Before writing a line of code, write two things: the task definition in one sentence, and the success condition you will check programmatically. "Summarise customer emails" is not a task definition. "Given a batch of raw email threads, produce a JSON object with fields sentiment, primary_issue, and suggested_reply for each thread, returning null for fields that cannot be determined" is a task definition.

    If you cannot write the success check, you are not ready to build the agent.

    Give it tools

    Tools are functions the model can call. Keep them narrow. A tool that does one thing is easier to test, easier to log, and easier to fix when the model calls it wrong.

    def get_invoice_status(invoice_id: str) -> dict:
        """Fetch status for a single invoice. Returns status, amount, due_date."""
        ...

    Every tool should have: a typed signature, a docstring the model reads, deterministic behaviour for the same input, and a timeout. Never give an agent a tool it does not need for the current task.

    Build the perceive-reason-act loop

    The loop is the agent. On each iteration: the agent reads the current state (tool outputs, conversation history, memory), reasons about the next action, and executes one tool call or produces a final answer.

    while not terminated:
        action = model.decide(state, tools, task)
        if action.is_final:
            return action.output
        result = tools[action.name](**action.args)
        state.append(result)

    One action per loop iteration. Do not let the model chain tool calls invisibly — surface each step so you can log and debug it.

    Set termination and retries

    Every agent needs a hard step limit. If your task normally takes 5 steps, set the limit at 12. Log when agents hit it. If a task is regularly reaching the step limit, the task definition is wrong.

    Retry logic is separate from the loop. Network errors on tool calls should retry. Logical dead-ends — the model calling the same tool with the same arguments twice — should not retry. They should terminate and flag.

    Add input and output guardrails

    Input guardrails run before the agent starts. Check that the task is within scope. Reject inputs that are ambiguous, oversized, or outside the defined domain. This is your cheapest safety layer.

    Output guardrails run on the final response. Validate schema. Check that required fields are present. Run a lightweight check for obvious hallucination patterns (references to entities not in the input). For high-stakes outputs, route to a second, cheaper model for sanity-checking before returning.

    Add memory

    Most tasks do not need cross-session memory. They need within-session state — a scratch pad the agent writes to and reads from as it works. Keep this separate from the model's context window.

    For agents that do need long-term memory, be surgical. Store facts, not full conversation histories. Retrieve by semantic similarity, but filter by relevance to the current task before injecting into context.

    Evaluate before you deploy

    Write an evaluation harness with at least 20 test cases covering: the happy path, edge cases, adversarial inputs, and the specific failures you found in development. Score each run on your success criteria. Set a minimum pass rate. If you cannot reach it, do not ship.

    An agent without an eval harness is a guess that you are shipping as a product.

    Where agents fail in production and how to contain it

    The failure modes I see most often:

    Infinite loops. The model keeps calling tools without converging. Fix: hard step limit, log everything, alert on limit hit.

    Tool call errors that the model ignores. The tool returns an error. The model decides to continue anyway with stale state. Fix: treat tool errors as hard failures unless explicitly marked as recoverable.

    Context overflow. After many steps, the context window fills up and the model starts losing early instructions. Fix: summarise state at regular intervals. Do not let the full tool output history sit in the context.

    Prompt injection via tool outputs. A tool fetches external content that contains instructions. The model follows them. Fix: sanitise tool outputs before returning them to the model. Treat external data as untrusted.

    Overly broad task scope. The agent is given a task that requires judgment calls you have not defined. Fix: narrow the scope. If you cannot narrow it, add a human-in-the-loop step at the ambiguous decision point.

    Build from scratch vs. use a framework

    Frameworks like LangGraph and LlamaIndex give you the loop scaffolding pre-built. Use them if your team is small and the task is standard. The trade-off is that you inherit their abstractions, and when something goes wrong in production, you are debugging their code as well as yours.

    I tend to build the core loop from scratch for anything business-critical, and use frameworks for the parts that do not touch the task logic — evaluation runners, observability integrations, vector store connectors. The loop itself is not that much code. Control over it is worth the lines.

    See how I think about the broader distinction in AI agents vs. agentic AI and the engineering patterns I use for prompting production systems in harness engineering for LLMs.

    Pre-launch checklist

    Before you call an agent production-ready:

    • [ ] Task definition is one sentence, unambiguous
    • [ ] Success check is programmatic, not subjective
    • [ ] Every tool has a typed signature, docstring, and timeout
    • [ ] Step limit is set and logged
    • [ ] Input guardrails reject out-of-scope requests
    • [ ] Output guardrails validate schema and flag anomalies
    • [ ] Eval harness exists with at least 20 test cases
    • [ ] Pass rate meets your minimum threshold
    • [ ] Logging covers every tool call, argument, and output
    • [ ] Alerts are set for step-limit hits and error spikes
    • [ ] Rollback plan exists

    If any item is unchecked, you are not done.

    ---

    Frequently Asked Questions

    Quick answers about this topic — also indexed by AI search engines via FAQPage schema.

    Share this article: