Context engineering
Context engineering is the discipline of designing, structuring, and managing everything that goes into a large language model's context window to produce accurate, reliable, and useful outputs. It is a broader concept than prompt engineering, which focuses specifically on the wording of instructions and examples. Context engineering encompasses the entire information architecture of an LLM invocation: the system prompt, retrieved documents from a retrieval-augmented generation pipeline, tool definitions and function signatures available to the model, conversation history from prior turns, few-shot examples, structured data passed as context, and the formatting decisions that determine how all of these are arranged in the token stream.
A concrete illustration: a customer support AI agent receives a user message — "What's the status of my return?" The prompt engineer's concern is how to phrase the system instruction so the model responds helpfully and in the right tone. The context engineer's concern is: which order history records should be retrieved and injected? How should they be formatted? How many prior conversation turns should be included before truncation? Should the model receive the full return policy or a summarized version? What tool signatures should be visible, and which should be hidden to reduce distraction? Each of these decisions shapes the model's reasoning quality as much as the instruction text does.
What context engineering encompasses
The components that context engineers manage fall into several categories:
- System prompt: The foundational instruction set defining the model's role, constraints, output format, and behavioral guidelines. Well-structured system prompts use explicit sections (role definition, rules, output format, examples) rather than a single paragraph of instructions.
- Retrieved context: Documents, database records, or knowledge base articles fetched by a retrieval system and inserted into the context. The selection algorithm, chunk size, and ordering of retrieved documents all affect how effectively the model uses the information.
- Conversation history: Prior turns of a multi-turn conversation. Including too little history causes the model to lose track of user intent; including too much consumes context window tokens and can cause the model to over-weight earlier turns.
- Tool definitions: In function-calling architectures, the tool schemas provided to the model define what actions are available. Providing too many tool definitions at once increases the risk of tool misselection; providing too few leaves the model unable to complete certain tasks.
- Few-shot examples: Representative input-output pairs that demonstrate desired behavior. Placement, format, and selection of examples significantly affect model performance on similar inputs.
- Structured data: Tables, JSON payloads, or key-value pairs representing user account data, product catalogs, or policy documents. How this data is formatted affects how reliably the model extracts values from it.
Why context engineering emerged as a distinct concept
Prompt engineering, as originally defined, addressed a narrower problem: finding the right phrasing of a task instruction. As LLM applications moved from standalone completions to complex agentic pipelines with retrieval, tool use, memory, and multi-step reasoning, the phrasing of a single instruction became a small fraction of what determined output quality. Engineers building production retrieval-augmented generation systems discovered that retrieval strategy — what documents to fetch and how to rank them — mattered more than instruction wording. Teams building tool-using agents found that the structure of tool definitions shaped model behavior as much as the task description did.
Context engineering provides a unifying frame for all of these considerations. It positions the LLM not as a black box that responds to prompts but as a system with a bounded information budget (the context window) that must be carefully allocated across competing signal types. This framing naturally leads to practices like context compression (summarizing older conversation turns to free up tokens), selective tool exposure (showing only the tools relevant to the current task state), and dynamic retrieval (re-querying the retrieval system mid-conversation as the user's need clarifies).
Context engineering and token economics
Every element in the context window consumes tokens, which map directly to inference cost and latency. A system that naively includes the full conversation history, all available tool definitions, and maximally verbose retrieved documents will quickly saturate the context window of most models on long conversations. Context engineering therefore involves explicit token budgeting: allocating a ceiling on history tokens, retrieved context tokens, tool definition tokens, and system prompt tokens, and managing the tradeoffs when the sum approaches the context window limit.
Context compression techniques — summarizing earlier conversation turns with a secondary model call, selecting the most relevant retrieved documents rather than the top-k by recency, or converting verbose JSON payloads into compact natural language summaries — are standard tools in a context engineer's toolkit. The tradeoff in each case is fidelity vs cost: more compact context is cheaper but may omit information the model needs.
Practical implications for enterprise AI
For enterprise teams deploying AI agents in production, context engineering is where most of the reliability work happens. A well-written system prompt with poorly designed retrieval, excess conversation history, and unstructured tool definitions will underperform a simpler instruction with well-engineered context. The most common failure mode in production AI agents is not that the model lacks capability — it is that the context provided does not contain the information the model needs, or contains too much noise for the model to extract the relevant signal reliably.
Systematic context engineering involves establishing explicit contracts for each context component (what format, what maximum size, what selection algorithm), logging context payloads alongside outputs, and A/B testing context architecture changes the same way product teams test UI changes. AI observability tooling increasingly logs full context payloads — not just prompts — so that engineers can diagnose why a specific invocation produced a particular output. Teams that treat context design as an engineering discipline — with version control, structured evaluation, and intentional iteration — consistently outperform teams that treat the context window as an afterthought to the instruction text.

