Mixture of experts

Mixture of experts (MoE) is a neural network architecture in which only a subset of the model's parameters, called expert networks, are activated for any given input, with a gating network dynamically selecting which experts to use based on the content of the input token.

Dense neural networks activate all of their parameters for every token they process. That design is simple but computationally expensive at scale: doubling the number of parameters doubles the compute required for every forward pass. MoE architectures break that coupling by partitioning the model into many specialized sub-networks and routing each token to only a small number of them at runtime. The result is a model that can hold significantly more total parameters than a dense model of equivalent compute cost, which matters for organizations evaluating large language models (LLMs) under both capability and cost constraints.

How mixture of experts works

In a transformer-based MoE model, the feed-forward layers of the standard transformer are replaced with a set of expert feed-forward networks. A small router network, trained alongside the experts, assigns each token an affinity score for every expert and selects the top-K experts, typically two, to process that token. The outputs of the selected experts are weighted by their affinity scores and summed to produce the layer output.

Key design decisions include:

Number of experts: Modern MoE models like Mixtral and the reported architecture of GPT-4 use dozens to hundreds of experts per layer, though only two or eight are active per token.
Load balancing: Without explicit regularization, gating networks collapse and route most tokens to a small set of popular experts, wasting the capacity of the rest. Auxiliary load-balancing losses during training counteract this tendency.
Expert capacity: Each expert processes a fixed maximum number of tokens per batch. Tokens routed to an over-subscribed expert are dropped or passed through residually, which introduces a source of approximation error that dense models do not have.
Communication overhead: In distributed inference, different experts may live on different hardware accelerators. Routing tokens across accelerators introduces inter-device communication latency that partly offsets the compute savings.

The foundational paper introducing learned mixture of experts for language modeling is Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer (Shazeer et al., 2017), which established the training and routing design that modern LLM MoE variants extend.

Why mixture of experts matters for customer experience

For teams deploying AI in customer service, MoE matters primarily through its effect on the cost and capability frontier. MoE-based models can offer capability closer to a dense model twice their size while running at the compute cost of the smaller dense model. That trade-off is significant when considering latency and cost targets for synchronous support channels, where a customer is waiting for a reply in real time.

The trade-offs are real, however. MoE models tend to require more total memory than a dense model with equivalent active parameters, because the inactive expert weights still need to reside in memory for routing decisions. This increases infrastructure cost at the hosting layer. Expert dropping, the approximation that occurs when a token's assigned expert is over-subscribed, can introduce subtle inconsistencies in outputs that are harder to diagnose than the more uniform failure modes of dense models. AI observability instrumentation is particularly useful here for catching output quality regressions tied to load-spike-induced expert drops during high-traffic periods.

Mixture of experts and model selection

When evaluating which foundation model to deploy for a customer service use case, MoE architecture is one factor among several. An MoE model may score comparably to a larger dense model on general benchmarks while costing less to run, but its higher memory footprint may make self-hosted deployment more expensive. Teams should benchmark against their specific inference time requirements and support workload distributions rather than relying on general capability rankings. The relationship between MoE and underlying foundation model families is also relevant: most frontier MoE models are themselves foundation models fine-tuned for instruction following, and the same adaptation techniques apply.

Mixture of experts

How mixture of experts works

Why mixture of experts matters for customer experience

Mixture of experts and model selection

Learn more

Deliver the concierge experiences your customers deserve

Product

Industries

Resources

Company