Introducing Duet Autopilot.
Learn more
Glossary

Token limit

A token limit (also called a context limit or maximum context length) is the hard ceiling on the total number of AI tokens that a large language model can process in a single inference call — encompassing the input prompt, retrieved documents, conversation history, and the model’s own output combined. When the total token count of a request reaches the model’s limit, the model cannot process any additional context; content that exceeds the limit must either be truncated, summarized, or split across multiple calls. Token limits are a fundamental architectural constraint that directly shapes how AI applications are designed and how they behave in production.

A quick reference for 2025–2026 production models: GPT-4o supports 128,000 tokens (roughly 96,000 words) in its context window; Claude 3.5 Sonnet and Claude 3.5 Haiku support 200,000 tokens (~150,000 words); Gemini 1.5 Pro supports up to 1,000,000 tokens in its extended-context configuration. These numbers represent the total input + output budget per call — a 200,000-token model can ingest a 190,000-token document and still generate a 10,000-token response.

How token limits work

Token limits are enforced at the transformer attention layer: the self-attention mechanism that allows a model to relate every token to every other token scales quadratically with sequence length, making arbitrarily long contexts computationally prohibitive. Providers set the token limit to balance model quality, inference cost, and latency. Exceeding the token limit in an API call results in an error; the client application is responsible for managing context to stay within bounds.

In practical terms, the context window consumed by a single request comprises several components: the system prompt (instructions, persona, constraints), retrieved documents (from RAG pipelines), conversation history (prior turns in a multi-turn session), the current user message, and the reserved output budget. Engineers typically allocate the output budget first (e.g., 2,000 tokens for the response) and work backwards to determine how much space remains for context. A 128,000-token model with a 2,000-token system prompt and a 2,000-token output reservation leaves 124,000 tokens for conversation history and retrieved documents.

Why token limits matter

  • Cost: Most LLM APIs charge per token consumed. A longer context means higher cost per call. On GPT-4o at $5 per million input tokens, a 128,000-token prompt costs $0.64 per call — acceptable for a complex analysis task but unsustainable for high-volume customer support interactions where the same model might be called 100,000 times per day.
  • Latency: Processing a larger context takes longer. Time-to-first-token scales with input length at roughly 2–5ms per 1,000 tokens on current hardware. A 100,000-token prompt adds 200–500ms of processing time relative to a 1,000-token prompt — noticeable in real-time conversational applications.
  • Accuracy (the “lost in the middle” problem): Research shows that LLMs are most accurate when retrieving information located at the beginning or end of a long context; information buried in the middle of a large context window is retrieved less reliably. Very large contexts do not always improve accuracy — they can introduce noise if irrelevant material fills the window.

Token limit vs. context window

Token limit and context window are often used interchangeably, but there is a subtle distinction. The context window refers to the span of tokens a model can “see” and attend to at inference time — its active working memory. The token limit is the numerical bound on that window. When people say a model “has a 128k context window,” they mean both that the context window is 128,000 tokens wide and that 128,000 is the token limit for a single call. The terms are functionally equivalent in most engineering conversations.

A related concept is the effective context limit — the point at which adding more context no longer improves (and may degrade) the model’s accuracy. For most current models, the effective limit is 60–80% of the technical token limit; filling a 200,000-token window to capacity does not reliably produce better results than filling it to 150,000 tokens, and can increase cost and latency without benefit.

Token limits in AI customer support

In AI customer support applications, token limits shape architecture decisions at every level. A support agent powered by an LLM must fit the system instructions, customer account context, knowledge base retrievals, and conversation history within the model’s token budget on every turn. As conversations grow longer, earlier turns must be summarized or pruned to stay within the limit. Most production systems implement a sliding window strategy: keep the most recent N turns verbatim and replace earlier turns with a compressed summary, preserving key facts (order number, stated intent, commitments made) while freeing token budget for new context.

Model selection for support workloads is partly a token-limit optimization problem. A model with a 200,000-token limit can ingest a customer’s full 12-month order history and policy documentation in a single call, potentially improving accuracy. But the cost of a 200,000-token call at premium model pricing may be 10–20x higher than a well-engineered 10,000-token call that retrieves only the relevant records. Prompt engineering — designing prompts that extract maximum utility from a compact context — is therefore one of the most valuable skills in AI support system design. Teams also increasingly use LLM routers to match query complexity with the appropriate model tier, avoiding expensive large-context models for simple queries that a smaller model can handle within a tight token budget.

Deliver the concierge experiences your customers deserve

Get a demo