Designing layered guardrails for reliable AI agents
July 30, 2025


Written by Sangwoo Bae
When AI agents are placed at the front lines of customer experience, they must reflect your brand values, speak accurately, and operate safely. That’s both a technical challenge and a business imperative. CX and Product leaders know that an AI agent isn’t simply a tool, but a new kind of company representative that needs structure, oversight, and boundaries.
We’ve built our platform with rigorous guardrails at every stage of the customer interaction—before, during, and after the conversation. These guardrails ensure agents stay on-brand, handle sensitive topics appropriately, minimize hallucinations, comply with business rules, and escalate intelligently. The result is a system that earns customer trust through every exchange.
Before the conversation: Preventing issues with regression testing
Trust starts before a conversation ever begins. Prior to deploying an agent to production, customers can use our testing suite to run it through a battery of structured evaluations designed to catch failures early and ensure predictable, grounded behavior.
Regression testing includes both single-turn and multi-turn interactions, simulating everything from routine FAQs to complex branching dialogues. Teams can replay historical customer conversations against the latest agent logic to verify improvements and guard against unintended changes, which is especially important when working with non-deterministic LLMs.

Decagon supports robust unit and integration testing. Unit tests confirm that the agent responds with the right tone and content, while integration tests validate that the agent interacts correctly with external systems, such as retrieving the right data or triggering the right downstream tools.
This test-driven approach aligns the agent with your business logic and customer expectations in real-world workflows, giving you confidence before a single customer message is received.
During the conversation: Real-time guardrails without latency tradeoffs
Once a conversation begins, the stakes rise. The agent is representing your brand in real time and needs to respond with accuracy, consistency, and care. That’s why our platform enforces a set of layered, real-time guardrails—powered by separate fine-tuned models—that monitor for risk, minimize hallucinations, and uphold brand guidelines.
These safeguards are designed to minimize impact on latency. Some, like bad actor detection, run in parallel with response generation, gating final responses and certain actions when necessary to ensure safety. Others, like brand guidelines and hallucination checks, evaluate the generated message just before delivery. This hybrid approach preserves speed and responsiveness without compromising on trust.
Bad actor detection
The first line of defense is our bad actor model, which evaluates user intent at the start of every conversation. It’s specifically trained to detect signs that the customer may be attempting to manipulate or misuse the agent, whether through adversarial prompts, topic baiting, or attempts to trigger unauthorized actions.
When flagged, the system can either deflect with a neutral response that redirects the conversation to supported topics, or escalate directly to a human. This prevents unsafe or brand-damaging exchanges before they unfold.
Human agent escalations
Some conversations need to be handled by a human. Our system allows customers to define escalation rules and policy boundaries that determine when the AI agent should defer to a human agent, such as for legal inquiries, emotionally sensitive topics, or high-stakes account actions.
These rules are evaluated during the reasoning process. When triggered, the agent can gracefully hand off to a human with a summary of the conversation, ensuring alignment with your internal policies and reducing the risk of AI overstepping in sensitive situations.
Brand voice guidelines
When the agent responds, it must do so in a way that reflects your brand’s personality. Our platform enables detailed customization of tone and style, whether that means sounding warm and casual, avoiding technical jargon, or using emojis and exclamations.
These brand voice constraints guide the phrasing and emotional tone of each message, helping your agent feel like a true extension of your team, not just a generic bot. The result is a consistently branded experience across all interactions.
Response supervision
Before any generated message is sent to a customer, it’s reviewed by our supervisor model. This model evaluates the response in context: does it stay on topic, reflect the intent of the conversation, and adhere to known facts?
If the supervisor detects any signs of hallucination (i.e., the response is not properly grounded in the provided model context), it can revise the response or trigger an escalation path. This ensures that what’s said is not only plausible, but also accurate, brand-safe, and appropriate.
After the conversation: Closing the feedback loop with Watchtower
Trust doesn’t stop at the end of a chat. In fact, post-conversation analysis is one of the most valuable tools CX and technical teams have for improving agent quality over time. That’s where Watchtower comes in.
Watchtower is our always-on conversation monitoring and QA system. Unlike traditional QA processes that rely on manual sampling and delayed feedback, Watchtower automatically reviews every conversation against your custom criteria. Whether you're monitoring for compliance risks, customer frustration, or upsell opportunities, Watchtower flags interactions that warrant attention and categorizes them for fast triage.
.png)
Rather than simple keyword matching, Watchtower understands tone, behavior, and the conversational arc, allowing it to distinguish between harmless language and actual issues. For example, it knows the difference between a casual mention of “cancel” and a true churn risk. This nuance leads to more accurate, contextual analysis and less operational noise.
By closing the loop between real-world conversations and product improvements, Watchtower helps your team continuously refine agent behavior at scale without the manual overhead of traditional QA.
End-to-end guardrails for safe AI agents
AI agents can be powerful extensions of your brand, but only if they’re reliable, respectful of boundaries, and aligned with your business goals. At every phase of the interaction, we’ve built systems to ensure that your AI stays grounded, safe, and trustworthy. From test-driven development to real-time supervision to post-conversation insight, we give leading brands the confidence to scale AI without scaling risk.
If you're building the future of customer experience, let's make sure it's one you can trust. Want to see how our platform puts these guardrails into action? Get a demo today!
Designing layered guardrails for reliable AI agents
Learn how Decagon designs guardrails that ensure agents reflect brand values, speak accurately, and operate safely.

When AI agents are placed at the front lines of customer experience, they must reflect your brand values, speak accurately, and operate safely. That’s both a technical challenge and a business imperative. CX and Product leaders know that an AI agent isn’t simply a tool, but a new kind of company representative that needs structure, oversight, and boundaries.
We’ve built our platform with rigorous guardrails at every stage of the customer interaction—before, during, and after the conversation. These guardrails ensure agents stay on-brand, handle sensitive topics appropriately, minimize hallucinations, comply with business rules, and escalate intelligently. The result is a system that earns customer trust through every exchange.
Before the conversation: Preventing issues with regression testing
Trust starts before a conversation ever begins. Prior to deploying an agent to production, customers can use our testing suite to run it through a battery of structured evaluations designed to catch failures early and ensure predictable, grounded behavior.
Regression testing includes both single-turn and multi-turn interactions, simulating everything from routine FAQs to complex branching dialogues. Teams can replay historical customer conversations against the latest agent logic to verify improvements and guard against unintended changes, which is especially important when working with non-deterministic LLMs.

Decagon supports robust unit and integration testing. Unit tests confirm that the agent responds with the right tone and content, while integration tests validate that the agent interacts correctly with external systems, such as retrieving the right data or triggering the right downstream tools.
This test-driven approach aligns the agent with your business logic and customer expectations in real-world workflows, giving you confidence before a single customer message is received.
During the conversation: Real-time guardrails without latency tradeoffs
Once a conversation begins, the stakes rise. The agent is representing your brand in real time and needs to respond with accuracy, consistency, and care. That’s why our platform enforces a set of layered, real-time guardrails—powered by separate fine-tuned models—that monitor for risk, minimize hallucinations, and uphold brand guidelines.
These safeguards are designed to minimize impact on latency. Some, like bad actor detection, run in parallel with response generation, gating final responses and certain actions when necessary to ensure safety. Others, like brand guidelines and hallucination checks, evaluate the generated message just before delivery. This hybrid approach preserves speed and responsiveness without compromising on trust.
Bad actor detection
The first line of defense is our bad actor model, which evaluates user intent at the start of every conversation. It’s specifically trained to detect signs that the customer may be attempting to manipulate or misuse the agent, whether through adversarial prompts, topic baiting, or attempts to trigger unauthorized actions.
When flagged, the system can either deflect with a neutral response that redirects the conversation to supported topics, or escalate directly to a human. This prevents unsafe or brand-damaging exchanges before they unfold.
Human agent escalations
Some conversations need to be handled by a human. Our system allows customers to define escalation rules and policy boundaries that determine when the AI agent should defer to a human agent, such as for legal inquiries, emotionally sensitive topics, or high-stakes account actions.
These rules are evaluated during the reasoning process. When triggered, the agent can gracefully hand off to a human with a summary of the conversation, ensuring alignment with your internal policies and reducing the risk of AI overstepping in sensitive situations.
Brand voice guidelines
When the agent responds, it must do so in a way that reflects your brand’s personality. Our platform enables detailed customization of tone and style, whether that means sounding warm and casual, avoiding technical jargon, or using emojis and exclamations.
These brand voice constraints guide the phrasing and emotional tone of each message, helping your agent feel like a true extension of your team, not just a generic bot. The result is a consistently branded experience across all interactions.
Response supervision
Before any generated message is sent to a customer, it’s reviewed by our supervisor model. This model evaluates the response in context: does it stay on topic, reflect the intent of the conversation, and adhere to known facts?
If the supervisor detects any signs of hallucination (i.e., the response is not properly grounded in the provided model context), it can revise the response or trigger an escalation path. This ensures that what’s said is not only plausible, but also accurate, brand-safe, and appropriate.
After the conversation: Closing the feedback loop with Watchtower
Trust doesn’t stop at the end of a chat. In fact, post-conversation analysis is one of the most valuable tools CX and technical teams have for improving agent quality over time. That’s where Watchtower comes in.
Watchtower is our always-on conversation monitoring and QA system. Unlike traditional QA processes that rely on manual sampling and delayed feedback, Watchtower automatically reviews every conversation against your custom criteria. Whether you're monitoring for compliance risks, customer frustration, or upsell opportunities, Watchtower flags interactions that warrant attention and categorizes them for fast triage.
.png)
Rather than simple keyword matching, Watchtower understands tone, behavior, and the conversational arc, allowing it to distinguish between harmless language and actual issues. For example, it knows the difference between a casual mention of “cancel” and a true churn risk. This nuance leads to more accurate, contextual analysis and less operational noise.
By closing the loop between real-world conversations and product improvements, Watchtower helps your team continuously refine agent behavior at scale without the manual overhead of traditional QA.
End-to-end guardrails for safe AI agents
AI agents can be powerful extensions of your brand, but only if they’re reliable, respectful of boundaries, and aligned with your business goals. At every phase of the interaction, we’ve built systems to ensure that your AI stays grounded, safe, and trustworthy. From test-driven development to real-time supervision to post-conversation insight, we give leading brands the confidence to scale AI without scaling risk.
If you're building the future of customer experience, let's make sure it's one you can trust. Want to see how our platform puts these guardrails into action? Get a demo today!