# What are AI Guardrails?

_AI guardrails are safety mechanisms that constrain AI system behavior, preventing harmful outputs, enforcing policies, and ensuring AI operates within acceptable boundaries._

AI guardrails are safety mechanisms that constrain AI system behavior: preventing harmful outputs, enforcing policies, and ensuring AI operates within acceptable boundaries. They are a key component of [AI safety](/ai-safety) and [LLM security](/llm-security) strategies.

Why it matters: Without guardrails, AI systems can generate harmful content, leak sensitive data, take unauthorized actions, or behave in ways that create liability for organizations. Guardrails are necessary but not sufficient for AI safety. For a deeper look at guardrail limitations, see [Why Guardrails Aren't Enough](/post/guardrails-are-not-enough-real-ai-safety-requires-hard-policy-boundaries).

## Types of AI Guardrails

### Input Guardrails
Filter and validate inputs before they reach the model.

- **Content filtering**: Block toxic, harmful, or inappropriate inputs
- **Injection detection**: Identify prompt injection and jailbreak attempts
- **PII detection**: Flag or redact sensitive information in inputs
- **Topic blocking**: Prevent queries about prohibited subjects

### Output Guardrails
Filter and validate model outputs before delivery.

- **Toxicity filtering**: Block harmful, offensive, or inappropriate content
- **Factual grounding**: Require outputs to cite sources or match retrieved context
- **Format validation**: Ensure outputs match expected schemas
- **Sensitive data detection**: Prevent PII/PHI leakage in outputs

### Behavioral Guardrails
Constrain what actions AI can take.

- **Tool allowlisting**: Restrict which functions/APIs the AI can call
- **Permission boundaries**: Limit access scope and capabilities
- **Rate limiting**: Prevent runaway resource consumption
- **Action confirmation**: Require approval for consequential actions

### Policy Guardrails
Enforce organizational rules and compliance requirements.

- **Brand voice**: Ensure outputs match corporate guidelines
- **Regulatory compliance**: Enforce sector-specific requirements
- **Use case boundaries**: Prevent AI from operating outside intended scope
- **Escalation triggers**: Route to humans when appropriate

## Implementation Approaches

### Prompt-Based Guardrails
System prompts that instruct the model on acceptable behavior.

**Pros**: Easy to implement, flexible, no additional infrastructure
**Cons**: Can be bypassed through prompt injection, not deterministic, depends on model cooperation

### Classifier-Based Guardrails
Separate models that evaluate inputs/outputs for safety.

**Pros**: More robust than prompts alone, can be fine-tuned for specific risks
**Cons**: Still probabilistic, adds latency and cost, can be adversarially attacked

### Rule-Based Guardrails
Deterministic rules enforced in code.

**Pros**: Can't be bypassed through prompting, predictable behavior, no false negatives for covered cases
**Cons**: Limited coverage, require manual rule creation, can be brittle

### Hybrid Approaches
Combine multiple guardrail types for defense in depth.

Most effective approach: Use rule-based guardrails for critical safety boundaries, classifier-based guardrails for broad coverage, and prompt-based guardrails for nuanced guidance. Test all guardrails with [adversarial testing](/ai-adversarial-testing) and [red-teaming](/ai-red-teaming) before deployment.

## Guardrail Metrics

Track these metrics to evaluate guardrail effectiveness:

- **Block rate**: Percentage of inputs/outputs blocked
- **False positive rate**: Legitimate content incorrectly blocked
- **False negative rate**: Harmful content that bypassed guardrails
- **Bypass rate**: Successful adversarial evasion attempts
- **Coverage**: Percentage of risk categories addressed
- **Latency impact**: Added response time from guardrail processing

## Why Guardrails Aren't Enough

Guardrails are necessary but not sufficient for AI safety:

### Probabilistic Nature
Most guardrails are probabilistic classifiers that can be fooled. A sufficiently creative attacker can find inputs that bypass detection while still achieving their goal.

### [Prompt Injection](/ai-prompt-injection) Vulnerability
Prompt-based guardrails can be overridden by [prompt injection](/ai-prompt-injection) attacks. If the guardrail depends on the model following instructions, it can be defeated by instructions that override those constraints.

### Coverage Gaps
No guardrail system covers all possible risks. Novel attacks, edge cases, and creative misuse can slip through.

### Stacking Problem
Adding more guardrail models doesn't linearly improve safety. Each additional classifier has its own failure modes, and the overall system is only as strong as its weakest link.

## Beyond Guardrails: Hard Policy Boundaries

Real AI safety requires deterministic policies enforced in code, not just probabilistic guardrails:

- **Medication dosage limits** that can't be exceeded regardless of model output
- **Data access controls** that enforce permissions at the infrastructure level
- **Action boundaries** that prevent certain operations entirely
- **Human approval requirements** that can't be bypassed

These hard boundaries protect your system when guardrails fail—which they will.

This is the core insight of [AI supervision](/ai-supervision): guardrails are one layer in a broader system of active oversight. Supervision combines guardrails with hard policies, monitoring, and enforcement into a coherent framework that maintains control even when individual components fail.

## How Swept AI Implements Guardrails

Swept AI combines guardrails with hard policy enforcement:

- **[Supervise](/product/supervise)**: Real-time guardrails for input/output filtering, plus deterministic policies that can't be bypassed. The model can propose, but policies govern final actions.

- **Distribution-aware detection**: Understand what's normal for your system, then detect deviations—not just known attack patterns.

- **Layered defense**: Prompt-based guidance, classifier-based detection, and rule-based enforcement working together.

Guardrails help you understand AI behavior. Hard policies protect your system when behavior goes wrong.