Mean opinion score (MOS)
Mean opinion score (MOS) is a standardized numerical measure of perceived audio quality, expressed on a scale of 1 to 5, where 1 represents unintelligible speech and 5 represents audio indistinguishable from a face-to-face conversation.
Voice quality is not merely a technical nicety in customer service. When a caller struggles to hear or understand a support agent, their frustration compounds whatever issue brought them to the phone. For AI voice agents, poor audio quality creates a compounding problem: degraded audio produces transcription errors in the automatic speech recognition (ASR) layer, which in turn produces incorrect intent classifications and wrong responses. MOS gives operations teams a single number to track whether their telephony infrastructure is holding up under production load.
How MOS is calculated
MOS was originally defined by the ITU-T in recommendation P.800 as a subjective test: trained human listeners sit in a calibrated quiet room, hear speech samples over the network under evaluation, and rate each sample on the 1-to-5 scale. The arithmetic mean of all listener ratings is the MOS for that condition. ITU-T P.800 specifies the room acoustics, the sample sentences, and the listener pool size to ensure results are reproducible across labs.
In practice, contact centers use objective, automated MOS estimation rather than running human listener panels. The most common algorithmic approach is the E-model, defined in ITU-T G.107, which computes an R-factor from measured network impairments and then maps that R-factor to a predicted MOS value. The key network variables are:
- Latency: One-way delay above 150 milliseconds begins to degrade conversational flow and lowers predicted MOS.
- Jitter: Variation in packet arrival times above roughly 30 milliseconds causes audible choppiness.
- Packet loss: Loss rates above 3 percent produce noticeable gaps in speech; loss above 10 percent typically renders conversation unworkable.
- Codec: The G.711 codec, common in PSTN and Voice over Internet Protocol (VoIP) deployments, has a theoretical maximum MOS of 4.4 under ideal conditions; lossy codecs cap lower.
Why MOS matters for customer experience
A MOS below 3.5 is generally the threshold at which a meaningful share of callers report dissatisfaction. Scores between 3.5 and 4.2 are the typical range for production VoIP calls. Scores above 4.0 are considered good and support natural conversation. These thresholds matter for AI voice deployments because ASR accuracy degrades non-linearly as audio quality drops. A system that performs well at MOS 4.0 can produce sharply higher word error rates at MOS 3.0, which then cascades into failed intent recognition and broken dialogue state tracking (DST).
MOS also surfaces problems that other metrics miss. A customer satisfaction score (CSAT) decline can have dozens of causes; a simultaneous drop in MOS on a specific trunk or region narrows the root cause quickly. Teams running real-time MOS monitoring can trigger alerts before a call quality degradation affects enough volume to move CSAT.
One honest limitation: MOS is a mean, and means obscure outlier experiences. A session with MOS 3.8 on average might include brief sub-2.0 intervals that cause missed words at a critical moment. Pairing MOS with per-packet-loss and per-jitter event logs gives a more complete picture than the aggregate score alone.
MOS thresholds and operational monitoring
Setting actionable MOS thresholds requires calibrating against your own call mix, because codec choices, network topology, and the geographic spread of callers all shift the baseline. A reasonable starting point is to alert on any trunk or region where rolling 5-minute MOS drops below 3.5, and to page on-call staff when it drops below 3.0. Correlate MOS alerts with ASR confidence scores and escalation rate to confirm that audio degradation is causing downstream quality problems, not just a network blip. For a deeper dive, download Decagon's guide to agentic AI for customer experience.

