What is context engineering in AI?

Context engineering is the discipline of designing what information goes into an LLM's context window — the knowledge, conversation history, and data the model needs to answer accurately. It sits above prompt engineering and covers retrieval strategy, context selection and compression, conversation memory management, and context security. For production AI, context engineering failures are more common than prompt engineering failures.

How is context engineering different from prompt engineering?

Prompt engineering is about crafting instructions — phrasing, format, examples, output constraints. Context engineering is about information architecture — what the model needs in its context window, how to select it, compress it, and stay within token budget limits. A well-engineered context with a simple prompt often outperforms a complex prompt with poorly managed context.

How does RAG relate to context engineering?

RAG (Retrieval-Augmented Generation) is the most widespread form of context engineering. Before calling the LLM, relevant documents are retrieved and injected into the context window. The hard problems in RAG — chunking strategy, how many chunks to include, document positioning, handling conflicting sources — are context engineering decisions, not retrieval algorithm choices.

What is context poisoning or prompt injection in AI?

Context poisoning (prompt injection) occurs when malicious content in the LLM's context — from retrieved documents, user data, or API responses — contains instructions designed to override the system prompt or cause unintended behaviour. Mitigations include sanitising external content before injection, using structural separators between content types, and validating outputs before acting on model responses.

How do I manage an LLM context window in production?

Manage context windows by layering intentionally: a compact system prompt, selectively retrieved knowledge, compressed conversation history, and the current input. Prioritise content by relevance, compress older turns, use structured formats over prose for token efficiency, and actively monitor token usage — context fills faster than expected in multi-turn systems.

Context Engineering: The Layer Above Prompt Engineering

There's a concept reshaping how serious AI engineers think about language model systems — and most tutorials haven't caught up to it yet. It's called context engineering, and it sits one level above prompt engineering in the stack.

Prompt engineering asks: how do I phrase my instructions so the model does what I want? Context engineering asks a prior question: what information does the model need in its context window to give me a reliable, accurate, useful answer at all?

Getting context engineering right is what separates AI systems that work in demos from ones that work in production. Here's how to think about it.

Prompt Engineering vs. Context Engineering

These two disciplines are related but distinct. Prompt engineering is about the craft of writing instructions — phrasing, format, examples, output constraints. Context engineering is about information architecture — what knowledge, history, and data to include in the context window, how to select it, how to structure it, and how to keep it within limits.

A useful analogy: prompt engineering is like writing a great job posting. Context engineering is like making sure the candidate has the right information and tools to do the job when they show up. Both matter. In many production AI failures, the prompt is fine — the context is the problem.

The Context Window Is a Finite Resource

Every language model has a context window — the maximum amount of text (measured in tokens) it can process in a single inference call. In 2025, frontier models offer context windows ranging from 128K tokens (GPT-4o) to 1M+ tokens (Gemini 1.5 Pro). These numbers sound large. In practice, they fill up faster than you expect.

Consider what competes for space in a typical production system:

System prompt with instructions and persona
Retrieved documents from a knowledge base (RAG)
Conversation history (which grows with each turn)
Tool call inputs and outputs
The current user message
Space for the model's response

Context engineering is the discipline of managing this space intentionally — deciding what earns a place in the context window, what gets compressed, what gets excluded, and how the included information is structured for the model to use effectively.

Anatomy of a Well-Designed Context

A production context window typically has four layers, each with a different role:

1. System layer. The system prompt defining the model's role, behaviour, output format, and guardrails. This should be compact and stable — it doesn't change between requests and eats into the budget for dynamic content.

2. Retrieved knowledge layer. Documents, records, or data retrieved from external sources based on the current query. This is the domain-specific knowledge the model needs to answer accurately. Quality of retrieval matters enormously here — irrelevant chunks waste budget and can confuse the model.

3. Conversation history layer. Prior turns in the current conversation. As conversations grow, this layer grows with them. Without management, it eventually consumes the entire budget. With management, it's a source of continuity and context the model can reference to give coherent multi-turn responses.

4. Current input layer. The immediate user query or task. This should always be present and ideally positioned where the model's attention is strongest (typically near the end, in most current architectures).

RAG as Context Engineering

Retrieval-Augmented Generation (RAG) is the most widespread form of context engineering in production today. The pattern: before calling the LLM, retrieve relevant documents from a knowledge base and inject them into the context window. The model answers based on retrieved evidence rather than training-time knowledge.

RAG is context engineering because the hard problems aren't retrieval algorithm problems — they're context design problems:

How do you chunk documents so retrieved pieces contain complete, useful information?
How many retrieved chunks fit without overwhelming the model or exceeding the budget?
How do you handle conflicting information across retrieved sources?
How do you position retrieved documents relative to the user's question?
What metadata (source, date, confidence) should accompany retrieved content?

Getting these decisions right — not just the vector similarity algorithm — is what produces reliable RAG systems. Most RAG failures are context design failures, not retrieval algorithm failures.

Conversation Memory Strategies

In multi-turn conversational systems, managing conversation history is one of the most consequential context engineering challenges. Four strategies, from simple to sophisticated:

Sliding window. Keep the last N turns and discard older ones. Simple to implement, loses long-range context. Fine for most customer support bots where recent turns matter most.

Summarisation. Periodically compress older conversation history into a summary, replacing the raw turns. The summary is more token-efficient but loses detail. Works well when the general thread of a conversation matters more than specific phrasing.

Entity and fact extraction. Extract key facts, decisions, and entities from the conversation and maintain a structured memory store. Inject the relevant subset into each new context. This is more complex but preserves the most useful information with high token efficiency.

Semantic retrieval over history. Treat conversation history like a mini knowledge base — store turns as embeddings and retrieve the most relevant prior exchanges for each new query. Best for long, complex conversations where relevance varies significantly.

Context Compression Techniques

When you're close to the context limit, compression techniques help preserve information density:

Selective inclusion. Not every retrieved document deserves full inclusion. Extract and include only the specific paragraphs or sentences relevant to the current query rather than entire documents.

LLM-based compression. Use a fast, cheap model to summarize retrieved documents before passing them to the main model. Adds latency and cost but can dramatically increase the effective information density within budget.

Structured over prose. When possible, represent retrieved information as structured data (tables, key-value pairs) rather than prose. Models can extract information from structured formats more reliably, and structure is typically more token-efficient than prose for the same information content.

Context Security: Injection and Poisoning

Context engineering introduces a security surface that prompt engineering alone doesn't expose. When you inject external content into the context window — retrieved documents, user-provided data, API responses — you create an attack vector called prompt injection.

A malicious document in your knowledge base could contain instructions designed to override your system prompt. A user could embed instructions in data they know will be retrieved. These attacks are real and have been demonstrated against production RAG systems.

Mitigation strategies: sanitise external content before injection, use structural separators that clearly delineate retrieved content from instructions, apply output validation before acting on model responses, and maintain privileged instruction layers that aren't overridable by injected content.

Context Engineering for Agents

Agent systems — where the model takes a sequence of actions over multiple steps — put the most demanding requirements on context engineering. Each tool call adds to the context: the tool input, the tool output, and the model's reasoning about what to do next. Over a long agent run, the context accumulates rapidly.

For agentic contexts, consider:

Selective tool output inclusion. Not every tool output needs to be retained in full. Compress or summarise tool outputs once the model has acted on them.
Working memory vs. long-term memory. Distinguish between what the agent needs right now (in context) and what it might need later (retrievable on demand).
Explicit planning context. Give the agent a structured scratchpad for its plan, separate from the raw action history. This improves coherence over long runs.

Context Engineering Is the Harder Skill

Prompt engineering is learnable in days. Context engineering is learnable in months. The reason: it requires understanding how models actually use information across a context window, how attention and position affect information retrieval, how different content types interact, and how to test context designs systematically at scale.

The best AI engineers I've seen treat the context window as precisely as a compiler treats a register file — every byte is accounted for, every allocation is intentional, and waste is a design flaw. That discipline, applied to AI context design, produces systems that are more reliable, cheaper to run, and easier to debug when they fail.

If you're building production AI systems and spending all your time on prompt wording, you're optimising the wrong layer. The context is where the leverage is.