Introducing Duet Autopilot.
Learn more
Glossary

LLM router

An LLM router (also called an AI router or model router) is a system that sits in front of a pool of large language models and dynamically directs each incoming query to the most appropriate model based on a set of routing criteria — typically a combination of query complexity, required response quality, latency budget, and cost per call. Rather than sending every request to a single, fixed model, an LLM router evaluates each request and selects the model best suited to handle it efficiently. As AI infrastructure matures and enterprises run workloads across multiple models simultaneously, the LLM router has become a critical architectural component for managing cost, latency, and quality at scale.

A concrete example illustrates the value: a production AI customer support platform might handle 100,000 queries per day. Approximately 60% of those queries are simple lookups — order status, return policy, store hours — that a small, fast, inexpensive model can answer correctly. The remaining 40% involve complex reasoning, multi-step actions, or nuanced tone matching that requires a frontier model. Without a router, the entire volume is sent to the frontier model, at a cost perhaps 20x higher per call than necessary. With an LLM router, 60,000 queries route to the smaller model and 40,000 to the frontier model, reducing per-query costs by 50–70% with no measurable quality degradation on the routed queries.

How LLM routing decisions are made

LLM routers use several signal types to make routing decisions. Complexity scoring is the most common: a lightweight classifier evaluates the incoming query and assigns it a complexity score, which maps to a model tier. Simple classifiers use lexical features (query length, presence of technical terms); more sophisticated ones use a small embedding model to score semantic complexity. Confidence-based routing is a second approach: a small model attempts to answer the query first, and the router escalates to a larger model only when the small model’s confidence score falls below a threshold. This cascading strategy is sometimes called “LLM cascade” or “model cascade.”

A third approach is rules-based routing: specific intent categories, content types, or system prompts are mapped to specific models by policy. For example, all calls involving personally identifiable information might be routed to an on-premises model to satisfy data residency requirements. All calls requiring tool use or function calling might route to models with strong tool-use benchmarks. Rules-based routing is transparent and auditable, making it preferred in regulated industries.

Routing criteria: cost, latency, and quality

  • Cost: Frontier models (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro) cost $5–15 per million input tokens. Mid-tier models (GPT-4o-mini, Claude Haiku) cost $0.15–0.40 per million tokens — 10–40x less. Routing simple queries to mid-tier models is the single highest-leverage cost optimization in most LLM applications. Tracking AI token consumption per model tier is essential for measuring routing efficiency.
  • Latency: Smaller models generate responses faster because they have fewer parameters to evaluate. For real-time conversational applications where response time must stay under 2 seconds, routing to a faster model for simple queries can meaningfully improve the user experience even if cost savings are secondary. Token limits also play a role: routing a long-context query to a model with a larger context window avoids truncation errors.
  • Quality: Routing must not sacrifice quality for queries that require it. The router’s accuracy in predicting which queries need the frontier model is the critical quality metric. Mis-routing a complex query to a weak model produces a worse answer than the naive “always use the best model” baseline. Measuring quality degradation from routing — via A/B testing or human evaluation — is necessary to validate the router’s accuracy before deploying it to production.

LLM router vs. model ensembling

An LLM router selects a single model per query and uses that model’s output exclusively. Model ensembling runs multiple models on the same query and combines their outputs — through voting, averaging logits, or a learned aggregation step. Ensembling generally achieves higher accuracy than any single model but at multiplicative cost (running N models costs N times as much). LLM routing achieves cost efficiency at the expense of the quality floor available from ensembling. The two approaches are complementary: in a tiered system, the router might send high-stakes queries to an ensemble and routine queries to a single fast model.

A related concept is the context window management problem — routing is one strategy for handling queries that exceed a given model’s token limit. If a query requires processing a document that exceeds the context window of the default model, the router can escalate it to a model with a larger window rather than truncating the input and degrading accuracy.

LLM routers in AI customer support

In AI customer support deployments, LLM routers are typically configured with three model tiers. The first tier — handling routine queries like order status, FAQs, and password resets — uses a small, fast model optimized for low latency and cost. The second tier — handling moderately complex queries requiring account data synthesis, policy interpretation, or multi-step troubleshooting — uses a mid-tier model. The third tier — reserved for escalation-risk queries, emotionally sensitive interactions, and novel issue types — uses the highest-quality available model and may also trigger a human-in-the-loop review step.

The router itself is a small, fast model that adds 20–50ms of overhead per call — negligible relative to the 500–2,000ms inference time of the downstream models. As LLM pricing continues to evolve and new models enter the market, the routing configuration becomes a continuously tunable cost-quality dial. Teams that invest in router infrastructure and quality measurement early gain a durable operational advantage: they can adopt new, cheaper models as they become available by simply updating the routing policy, without re-architecting the full system. Prompt engineering practices apply equally to each tier — well-crafted prompts for each model in the pool maximize the quality available at each cost point.

Deliver the concierge experiences your customers deserve

Get a demo