Introducing Proactive Agents.
Learn more
Glossary

LLM token usage

LLM token usage is the measure of how many tokens, the basic units of text that a large language model processes, are consumed by a given request, encompassing both the input tokens sent to the model and the output tokens it generates in response.

For customer service teams operating AI agents at scale, token usage is the primary driver of inference cost and a key variable in system design. A single support conversation might consume anywhere from a few hundred tokens for a simple FAQ exchange to several thousand tokens for a complex troubleshooting session that includes tool calls, retrieved documents, and multi-turn reasoning. Multiplied across millions of monthly interactions, token consumption decisions made at the architecture level translate directly into operating expense.

How LLM token usage works

Language models do not process raw text character-by-character. Instead, text is first broken into tokens by a tokenizer: short common words are often a single token, while longer or rarer words may span two or more tokens. As a rough benchmark, one token corresponds to approximately four characters of English text, or about 0.75 words. OpenAI's documentation on tokens provides a tokenizer tool for inspecting how specific strings are split.

Token usage for a single API call is the sum of:

  • System prompt tokens: The instructions, persona, policy constraints, and tool definitions sent at the start of each request. These are repeated for every call and can easily exceed 1,000 tokens in a production AI agent.
  • Context window tokens: The history of prior turns in the conversation, including any retrieved documents or tool outputs injected into the context window.
  • Input query tokens: The customer's message itself, including any structured data passed alongside it.
  • Output tokens: The tokens the model generates in its response. Output tokens are typically priced higher than input tokens by most providers, reflecting the greater compute cost of autoregressive generation.

The token limit of a model bounds the total context that can be processed in one call. When conversation history plus system prompt plus retrieved content approaches that limit, teams must either truncate earlier turns, summarize prior context, or switch to a model with a larger context window.

Why LLM token usage matters for customer experience

Token usage affects both cost and quality in ways that pull in opposite directions. Richer context, including more conversation history, more retrieved documents, and more detailed system instructions, generally improves response accuracy and reduces the rate of AI hallucinations. But it also increases per-request token counts and therefore cost. Teams optimizing AI agents for customer service must find the point where additional context produces diminishing returns relative to its token cost.

Prompt engineering is one of the most direct levers for reducing token usage without degrading quality. Concise, precisely scoped system prompts outperform verbose ones that repeat instructions across sections. Retrieval strategies that inject only the most relevant document chunks, rather than full knowledge base articles, reduce context-window consumption while preserving the grounding benefit. Teams should also monitor output token counts separately from input counts, since runaway generation, where the model produces long-winded responses, increases costs and can degrade the customer experience by delivering unnecessarily verbose answers.

A meaningful limitation of token-based billing is that it creates an incentive to minimize context that may conflict with quality goals. An AI agent configured to keep conversations short to save tokens may truncate reasoning, skip verification steps, or fail to surface relevant policy information, outcomes that raise escalation rate and reduce resolution rate. Cost optimization at the token level should always be validated against outcome metrics rather than treated as a standalone objective.

LLM token usage and operational design

Monitoring token usage per conversation, per intent category, and per agent version is an important part of AI observability. Sudden spikes in average token count may indicate that a retrieval system is returning noisier results, that a system prompt has grown without review, or that a new conversation flow is generating unexpectedly long outputs. Usage telemetry should be a standard dashboard metric alongside CSAT, resolution rate, and escalation rate. Teams running multiple model versions or A/B testing prompt changes can use token-per-resolution as a compound efficiency metric that balances cost against quality outcomes.

For a deeper dive, download Decagon's guide to agentic AI for customer experience.

Deliver the concierge experiences your customers deserve

Get a demo