# Which Functions are Used for Model Evaluation?

_Model evaluation functions measure how well ML models perform their intended tasks. Understanding these metrics is essential for building and maintaining reliable AI systems._

Model evaluation functions measure how well machine learning models accomplish their intended tasks. They're the foundation of model quality assessment—without them, you're guessing about performance. These functions are used throughout [ML model testing](/ml-model-testing) and inform ongoing [AI model performance](/ai-model-performance) measurement. For agent-specific evaluation, see [AI agent evaluation](/ai-agent-evaluation).

Why evaluation matters: Models that look good in development can fail in production. Evaluation functions quantify performance, reveal weaknesses, and guide improvement. They're how you know whether a model is ready for deployment and how you detect when a deployed model is [degrading](/model-degradation).

## Classification Metrics

Classification models assign inputs to categories. Evaluation compares predicted categories to actual categories.

### The Confusion Matrix

All classification metrics derive from four outcomes:

- **True Positive (TP)**: Correctly predicted positive
- **True Negative (TN)**: Correctly predicted negative
- **False Positive (FP)**: Incorrectly predicted positive (Type I error)
- **False Negative (FN)**: Incorrectly predicted negative (Type II error)

Different metrics weight these outcomes differently based on what matters for your use case.

### Core Metrics

**Accuracy**
- What it measures: Overall percentage correct
- Formula: (TP + TN) / (TP + TN + FP + FN)
- When to use: Balanced datasets where all errors cost the same
- Limitation: Misleading with imbalanced data

**Precision**
- What it measures: Of positive predictions, how many were right?
- Formula: TP / (TP + FP)
- When to use: When false positives are costly
- Example: Spam detection (false positives annoy users)

**Recall (Sensitivity, True Positive Rate)**
- What it measures: Of actual positives, how many were caught?
- Formula: TP / (TP + FN)
- When to use: When false negatives are costly
- Example: Medical diagnosis (false negatives miss disease)

**F1 Score**
- What it measures: Harmonic mean of precision and recall
- Formula: 2 × (Precision × Recall) / (Precision + Recall)
- When to use: When you need to balance precision and recall
- Limitation: Assumes equal importance of precision and recall

**Specificity (True Negative Rate)**
- What it measures: Of actual negatives, how many were correctly identified?
- Formula: TN / (TN + FP)
- When to use: When correctly identifying negatives matters

### Threshold-Independent Metrics

**AUC-ROC (Area Under ROC Curve)**
- What it measures: Model's ability to distinguish classes across all thresholds
- Range: 0.5 (random) to 1.0 (perfect)
- When to use: Comparing models or when operating threshold isn't fixed
- Limitation: Can be optimistic with imbalanced data

**AUC-PR (Area Under Precision-Recall Curve)**
- What it measures: Precision-recall tradeoff across thresholds
- When to use: Imbalanced datasets where positives are rare
- Advantage: More informative than ROC for rare events

### Multi-Class Extensions

For more than two classes:
- **Macro-averaged**: Average metric across classes (treats all classes equally)
- **Micro-averaged**: Aggregate TP/TN/FP/FN across classes (weights by class frequency)
- **Weighted**: Weight by class frequency in ground truth

## Regression Metrics

Regression models predict continuous values. Evaluation measures the difference between predictions and actual values.

### Error Metrics

**Mean Absolute Error (MAE)**
- What it measures: Average absolute difference between prediction and actual
- Properties: Linear penalty, robust to outliers
- Interpretation: On average, predictions are off by X units

**Mean Squared Error (MSE)**
- What it measures: Average squared difference
- Properties: Quadratic penalty, penalizes large errors more
- Limitation: Sensitive to outliers

**Root Mean Squared Error (RMSE)**
- What it measures: Square root of MSE
- Advantage: Same units as target variable
- When to use: When large errors are particularly costly

**Mean Absolute Percentage Error (MAPE)**
- What it measures: Average percentage error
- Advantage: Scale-independent, interpretable
- Limitation: Undefined when actual values are zero, asymmetric

### Goodness of Fit

**R-squared (Coefficient of Determination)**
- What it measures: Variance explained by the model
- Range: 0 to 1 (can be negative for poor models)
- Interpretation: 0.8 means model explains 80% of variance
- Limitation: Can be gamed by adding features

**Adjusted R-squared**
- What it measures: R-squared penalized for number of features
- Advantage: Accounts for model complexity
- When to use: Comparing models with different numbers of features

## LLM and Generative Model Metrics

Traditional metrics don't apply to open-ended text generation. LLM evaluation requires specialized approaches.

### Output Quality Metrics

**Groundedness**
- What it measures: Is output supported by provided context?
- Method: Verify claims against source material
- Importance: Critical for RAG applications

**Faithfulness**
- What it measures: Does output accurately reflect source material?
- Difference from groundedness: Focus on accuracy, not just support
- Method: Check for distortions or misrepresentations

**Relevance**
- What it measures: Does output address the query?
- Challenge: Subjective, context-dependent
- Method: Human evaluation or learned relevance models

**Coherence**
- What it measures: Is output internally consistent and logical?
- Method: Check for contradictions, logical flow

### [Hallucination](/ai-hallucinations) Metrics

**Factual Accuracy**
- What it measures: Are stated facts correct?
- Method: Verify against knowledge bases
- Challenge: Requires up-to-date fact checking

**Citation Accuracy**
- What it measures: Do citations exist and say what's claimed?
- Method: Retrieve and compare cited sources
- Importance: Critical for research applications

### Safety Metrics

**Toxicity Rate**
- What it measures: Frequency of harmful outputs
- Method: Toxicity classifiers, human review

**Policy Violation Rate**
- What it measures: How often outputs violate defined policies
- Method: Policy classifiers, rule-based checks

## Evaluation Challenges

### Delayed Ground Truth

Many real-world outcomes take time to materialize:
- Loan defaults take months or years
- Customer lifetime value accumulates over years
- Medical outcomes may never be definitively labeled

**Strategies**:
- Use proxy metrics that correlate with eventual outcomes
- Monitor [drift](/ai-model-drift) as an early warning
- Track prediction distribution changes
- Incorporate partial or provisional labels

### Class Imbalance

Rare events are hard to evaluate:
- Accuracy is meaningless (99% accurate by never predicting fraud)
- Standard metrics can be misleading

**Strategies**:
- Use precision-recall metrics instead of accuracy
- Track per-class performance separately
- Apply appropriate sampling for evaluation
- Weight metrics by business importance

### Distribution Shifts

Models trained on one distribution, evaluated on another:
- Cross-validation can underestimate real-world error
- Historical evaluation may not predict future performance

**Strategies**:
- Use time-based splits for temporal data
- Evaluate on out-of-distribution samples
- Monitor production performance continuously
- Compare evaluation and production metrics

Evaluation functions inform your policies. [AI supervision](/ai-supervision) enforces them—using these metrics to determine when models can operate autonomously and when they need intervention.

## How Swept AI Approaches Evaluation

Swept AI provides comprehensive model evaluation:

- **[Evaluate](/product/evaluate)**: Pre-deployment assessment across accuracy, [fairness](/ai-bias-fairness), robustness, and [safety](/ai-safety). Test with the metrics that matter for your use case.

- **Multi-metric tracking**: Monitor multiple evaluation functions simultaneously. Understand tradeoffs between precision and recall, accuracy and fairness.

- **LLM-native evaluation**: Purpose-built metrics for language models including [hallucination](/ai-hallucinations) detection, groundedness scoring, and [safety](/ai-safety) evaluation.

The right evaluation functions tell you whether your model is ready—and whether it stays ready. For robustness evaluation, see [adversarial testing](/ai-adversarial-testing). See also: [Noise Is the Real Test](/post/noise-is-the-real-test-ai-quality-assurance-needs-a-new-foundation).