# How to Test ML Models?

_ML model testing goes beyond traditional software testing—evaluating accuracy, fairness, robustness, and safety to ensure models work reliably in production._

ML model testing goes beyond traditional software testing—evaluating accuracy, fairness, robustness, and safety to ensure models work reliably in production.

Why it matters: Unit tests don't catch model failures. A model can pass all code-level tests while producing biased predictions, hallucinating facts, or failing on edge cases. ML testing validates what the model does, not just that the code runs.

## How ML Testing Differs

### Traditional Software Testing
- **Deterministic**: Same input → same output
- **Specification-based**: Test against defined requirements
- **Binary outcomes**: Pass or fail
- **Code-focused**: Test functions and integrations

### ML Model Testing
- **Probabilistic**: Same input may produce different outputs
- **Distribution-based**: Evaluate across input populations
- **Metric-based**: Measure performance levels, not binary outcomes
- **Behavior-focused**: Test what the model does, not just that code runs

## Testing Dimensions

### 1. Accuracy Testing
Does the model make correct predictions? See [model evaluation functions](/model-evaluation-functions) for detailed metric implementations.

- **Overall metrics**: Accuracy, precision, recall, F1, AUC-ROC
- **Regression metrics**: MSE, MAE, R-squared
- **LLM metrics**: Groundedness, faithfulness, answer relevance
- **Threshold selection**: Operating points that balance precision/recall for your use case

### 2. Slice Analysis
Does the model work well for all subpopulations?

Test performance across:
- Protected attributes (gender, race, age)
- Business segments (customer tiers, regions)
- Input characteristics (text length, image quality)
- Edge cases (unusual inputs, boundary conditions)

Models that perform well on average may fail for specific groups.

### 3. Fairness Testing
Does the model treat protected groups equitably?

- **Demographic parity**: Equal positive rates across groups
- **Equalized odds**: Equal true positive and false positive rates
- **Calibration**: Prediction scores mean the same thing across groups
- **Disparate impact analysis**: Required for many regulatory contexts

See [AI Bias and Fairness](/ai-bias-fairness) for detailed fairness frameworks.

### 4. Robustness Testing
Does the model handle unexpected inputs gracefully?

- **Edge cases**: Boundary conditions, unusual inputs
- **Noise tolerance**: Performance degradation under noisy data
- **Adversarial inputs**: Deliberately crafted failure cases
- **Distribution shift**: Inputs different from training data (see [model drift](/ai-model-drift))

### 5. Safety Testing
Does the model avoid harmful outputs?

- **[Hallucination](/ai-hallucinations) testing**: Generate-and-verify fact accuracy
- **Toxicity testing**: Probe for harmful content generation
- **[Prompt injection](/ai-prompt-injection)**: Test resistance to manipulation
- **Information leakage**: Check for PII/PHI exposure
- **Policy compliance**: Verify adherence to business rules

## Testing Methodologies

### Holdout Validation
Reserve data never seen during training for final evaluation.
- **Train/validation/test split**: Standard approach
- **Time-based splits**: When temporal patterns matter
- **Stratified sampling**: Ensure subpopulations represented

### Cross-Validation
Multiple train/test splits for more robust estimates.
- K-fold: Rotate through data partitions
- Leave-one-out: Maximum data utilization
- Useful when data is limited

### A/B Testing
Compare models in production with real users.
- Random assignment to treatment/control
- Measure business outcomes, not just predictions
- Statistical significance testing
- Understand user experience impact

### Shadow Deployment
Run new models in parallel without affecting production.
- Compare predictions to current system
- Identify differences before full deployment
- Build confidence without user risk

### [Red-Teaming](/ai-red-teaming)
Adversarial testing by dedicated testers.
- Deliberate attempts to break the model
- Probe for safety and security vulnerabilities
- Find failures automated tests miss

### [Adversarial Testing](/ai-adversarial-testing)
Systematic evaluation against crafted failure cases.
- Perturbation attacks (small input changes → wrong outputs)
- Prompt injection attempts
- Boundary condition exploration
- Semantic-preserving transformations

## LLM-Specific Testing

Large language models require specialized testing approaches:

### Evaluation Sets
- **Domain-specific prompts**: Test on realistic queries from your use case
- **Edge case libraries**: Curated difficult/tricky inputs
- **Golden answers**: Human-validated expected outputs

### Quality Metrics
- **Faithfulness**: Does output match source context?
- **Groundedness**: Is output supported by retrieved documents?
- **Answer relevance**: Does response address the question?
- **Coherence**: Is output internally consistent?

### Safety Probes
- **Jailbreak attempts**: Can safety be bypassed?
- **Refusal testing**: Does model appropriately decline harmful requests?
- **Bias probes**: Does model show demographic disparities?
- **Sensitive topic handling**: Does model handle edge topics appropriately?

## Testing Best Practices

### Automate Testing
- Build continuous integration pipelines for models
- Run tests on every model change
- Fail deployments that don't meet thresholds

### Version Test Data
- Track what data was used for testing
- Enable reproducibility
- Maintain test set integrity over time

### Test in Realistic Conditions
- Use production-like data distributions
- Include realistic noise and variability
- Test at production scale

### Define Clear Thresholds
- Set minimum acceptable performance levels
- Define fairness constraints
- Establish safety requirements

### Test Continuously
- Production monitoring is ongoing testing
- Detect drift and degradation over time
- Don't rely solely on pre-deployment validation

Testing informs the policies that [AI supervision](/ai-supervision) enforces. Pre-deployment testing defines what behavior is acceptable; supervision ensures models stay within those boundaries in production. See also: [Understanding ML Model Testing Beyond Unit Tests](/post/beyond-unit-tests-level-up-your-ai-testing-strategy-variant-and-invariant-testing-explained).

## How Swept AI Enables ML Testing

Swept AI provides comprehensive testing capabilities:

- **[Evaluate](/product/evaluate)**: Pre-deployment testing across accuracy, fairness, robustness, and safety dimensions. Automated evaluation pipelines with customizable metrics and thresholds.

- **Red-team testing**: Adversarial probes for security vulnerabilities, prompt injection resistance, and safety boundary testing.

- **Distribution mapping**: Understand model behavior across input distributions, not just average performance.

- **[Supervise](/product/supervise)**: Continuous production testing through monitoring. Detect when real-world performance diverges from test results.

Testing is the difference between models that work in demos and models that work in production.