Decagon raises $250M at a $4.5B valuation.
Learn more
Glossary

Token limit

A token limit is the maximum number of AI tokens that a large language model can process in a single input-output sequence. Tokens are the units — roughly corresponding to word fragments — that AI models use to represent text. Every model has a maximum token capacity, and when a conversation, document, or query exceeds it, the model cannot process the excess content, potentially losing critical context.

Token limits define the boundary of what a model can "see" at any one time. They apply to the combined total of all input (system prompt, conversation history, retrieved documents, user message) and output (the model's response). Understanding and managing token limits is essential for teams building AI customer service systems, because exceeding them silently degrades quality — responses become decontextualized or incorrect as earlier parts of a conversation fall out of the processing window.

How token limits work

Every large language model has a defined context window — the total number of tokens it can hold in active memory for a single generation call. When the combined token count of the input reaches this limit, content must be truncated or summarized. In most implementations, the earliest parts of the conversation are dropped first, which can cause the model to lose sight of key information the customer shared at the start of the interaction.

Factors that consume tokens include:

  • System prompt: The instructions, persona definition, and policy content provided to the AI before the conversation begins.
  • Conversation history: All prior turns in the current session, accumulating with each exchange.
  • Retrieved documents: Content pulled from a knowledge base via retrieval augmented generation (RAG) to ground the AI's responses.
  • User message and AI response: The current turn's input and output.

Why token limits matter in customer service AI

In a short, simple interaction — a two-turn FAQ exchange — token limits rarely pose a problem. But in complex, multi-turn service conversations involving lengthy policy documents, detailed order histories, or extended troubleshooting sequences, token consumption accumulates quickly. An AI system that silently loses the beginning of a conversation may forget what issue the customer originally described, contradict itself, or fail to apply context captured early in the session.

Multi-turn conversations are particularly exposed to token limit constraints. A customer working through a complex return dispute over eight or ten exchanges may fill a significant portion of even a large context window. System prompts that are verbose and poorly optimized compound this by consuming tokens before the conversation even starts.

Managing token limits effectively

Teams can manage token pressure through several strategies. Compressing system prompts to include only essential instructions reduces baseline token consumption. Implementing conversation summarization — periodically summarizing earlier turns into a compact summary that replaces the raw transcript — preserves semantic content while freeing token capacity. Selective retrieval from knowledge bases, pulling only the most relevant passages rather than entire documents, keeps RAG-sourced content from overwhelming the context.

According to Anthropic's guidance on context management, thoughtful context window management is one of the most impactful levers for maintaining response quality in long or complex interactions. Monitoring token usage per conversation type in production helps identify which workflows are most at risk and where optimization effort should focus.

Token limits and customer experience

When token limits are ignored, the consequences show up as inconsistency, forgetfulness, and unexplained errors in AI responses — all of which damage the customer experience in ways that are difficult to diagnose. Customers do not know what a token limit is; they only know that the AI seemed to forget what they said five messages ago. Building systems that manage token consumption proactively, rather than reactively, ensures that AI maintains context quality throughout the full arc of a customer interaction.

Deliver the concierge experiences your customers deserve

Get a demo