Context window
A context window is the amount of information an AI model can consider in a single request. It's measured in tokens — the basic units of text a language model reads and produces — and it sets a hard upper limit on how much input prompt, conversation history, and retrieved content the model can take into account when generating a response.
The context window is one of the most consequential technical specifications of any large language model. It determines how long a conversation can run before the model loses earlier turns, how much reference material can be supplied in a single call, and how complex a task the model can reason about end to end.
How a context window works
When you send a request to a language model, everything you include — the system prompt, prior conversation turns, retrieved knowledge passages, tool call results, and the new user message — is concatenated into a single input. The model processes that input one token at a time, building up an internal representation that it uses to predict the next token of the response. The context window is the maximum total length of input plus output the model can handle in that pass.
Tokens are not characters or words. As a rule of thumb, one English token is roughly four characters, or about 0.75 words. A 100,000-token context window therefore holds roughly 75,000 English words — about a 300-page book.
Context window sizes across major models
Context windows have grown dramatically. A few reference points as of 2026:
- GPT-4o: 128,000 tokens.
- Claude 3.5 Sonnet: 200,000 tokens.
- Gemini 1.5 Pro: Up to 2,000,000 tokens for select customers.
- Llama 3.1 (Meta): 128,000 tokens.
Bigger isn't always better. Larger context windows cost more to use, can introduce subtle attention degradation across very long inputs, and rarely outperform a well-designed retrieval pipeline for finding the right information. Most production systems use retrieval to surface the few thousand tokens of context that actually matter rather than stuffing everything into a giant window.
Context window vs. tokens vs. memory
These three concepts get blurred together in casual conversation, but they're distinct. Tokens are the units of measurement. The context window is the maximum number of tokens the model can hold at once. Memory — sometimes called long-term memory — is a layer outside the model that persists information across requests, typically by storing facts, summaries, or embeddings in an external store and retrieving them when relevant. Memory works around the context-window limit by deciding what to load into the window at each turn.
Why context windows matter in production
In any real AI system, the context window is a budget you spend each request. Three practical implications:
- Cost and latency: Every token in the window costs money and adds processing time. Long prompts and long histories silently inflate both.
- Conversation length: Without summarization or a memory layer, long conversations eventually hit the limit and the earliest turns get dropped.
- Retrieval design: The smaller the context window, the more selective retrieval has to be. The larger the window, the more useful long documents can be fed in directly.
Managing the context window in real systems
Production AI agents use several techniques to stay inside the window without sacrificing quality. Summarization compresses older conversation turns into a shorter form that preserves the essential context. Retrieval-augmented generation — RAG — pulls only the most relevant passages from a knowledge base rather than dumping everything in. Prompt compression trims system prompts and few-shot examples down to the essential tokens. And conversation hygiene — clearing irrelevant turns, collapsing repeated content — keeps the window focused on what matters.
Context windows and conversational AI
For conversational AI in customer support, the context window has to hold the system prompt, the customer profile, the relevant retrieved policy or product passages, the conversation history, and the new message — all within budget. Good prompt engineering and disciplined retrieval are what make the window manageable. Push too much in and the model slows down, costs spike, and quality often drops as the model loses focus on what's relevant. Research from Google has shown that performance can degrade on long-context retrieval tasks even when the window technically supports them — reinforcing the case for selective retrieval over brute-force loading.
Frequently asked questions
What is a context window in AI? A context window is the maximum amount of input plus output, measured in tokens, that an AI model can process in a single request.
How is the context window measured? It's measured in tokens. A token is the basic unit of text a language model processes — roughly four characters or 0.75 words in English.
How big are modern context windows? Frontier models in 2026 range from 128,000 tokens (GPT-4o) to 200,000 (Claude 3.5 Sonnet) and up to 2,000,000 (Gemini 1.5 Pro). Open-source models like Llama 3.1 also reach 128,000.
What happens when you exceed the context window? Either the request is rejected, or earlier content (typically older conversation turns) is truncated. Either way, information is lost from the model's view of the conversation.
Is a bigger context window always better? No. Larger windows cost more, can slow responses, and don't always improve accuracy — well-designed retrieval often outperforms a giant window stuffed with everything.
For a deeper dive, download Decagon's guide to agentic AI for customer experience.

