Model evaluation functions measure how well machine learning models accomplish their intended tasks. They're the foundation of model quality assessment—without them, you're guessing about performance. These functions are used throughout ML model testing and inform ongoing AI model performance measurement. For agent-specific evaluation, see AI agent evaluation.
Why evaluation matters: Models that look good in development can fail in production. Evaluation functions quantify performance, reveal weaknesses, and guide improvement. They're how you know whether a model is ready for deployment and how you detect when a deployed model is degrading.
Classification Metrics
Classification models assign inputs to categories. Evaluation compares predicted categories to actual categories.
The Confusion Matrix
All classification metrics derive from four outcomes:
- True Positive (TP): Correctly predicted positive
- True Negative (TN): Correctly predicted negative
- False Positive (FP): Incorrectly predicted positive (Type I error)
- False Negative (FN): Incorrectly predicted negative (Type II error)
Different metrics weight these outcomes differently based on what matters for your use case.
Core Metrics
Accuracy
- What it measures: Overall percentage correct
- Formula: (TP + TN) / (TP + TN + FP + FN)
- When to use: Balanced datasets where all errors cost the same
- Limitation: Misleading with imbalanced data
Precision
- What it measures: Of positive predictions, how many were right?
- Formula: TP / (TP + FP)
- When to use: When false positives are costly
- Example: Spam detection (false positives annoy users)
Recall (Sensitivity, True Positive Rate)
- What it measures: Of actual positives, how many were caught?
- Formula: TP / (TP + FN)
- When to use: When false negatives are costly
- Example: Medical diagnosis (false negatives miss disease)
F1 Score
- What it measures: Harmonic mean of precision and recall
- Formula: 2 × (Precision × Recall) / (Precision + Recall)
- When to use: When you need to balance precision and recall
- Limitation: Assumes equal importance of precision and recall
Specificity (True Negative Rate)
- What it measures: Of actual negatives, how many were correctly identified?
- Formula: TN / (TN + FP)
- When to use: When correctly identifying negatives matters
Threshold-Independent Metrics
AUC-ROC (Area Under ROC Curve)
- What it measures: Model's ability to distinguish classes across all thresholds
- Range: 0.5 (random) to 1.0 (perfect)
- When to use: Comparing models or when operating threshold isn't fixed
- Limitation: Can be optimistic with imbalanced data
AUC-PR (Area Under Precision-Recall Curve)
- What it measures: Precision-recall tradeoff across thresholds
- When to use: Imbalanced datasets where positives are rare
- Advantage: More informative than ROC for rare events
Multi-Class Extensions
For more than two classes:
- Macro-averaged: Average metric across classes (treats all classes equally)
- Micro-averaged: Aggregate TP/TN/FP/FN across classes (weights by class frequency)
- Weighted: Weight by class frequency in ground truth
Regression Metrics
Regression models predict continuous values. Evaluation measures the difference between predictions and actual values.
Error Metrics
Mean Absolute Error (MAE)
- What it measures: Average absolute difference between prediction and actual
- Properties: Linear penalty, robust to outliers
- Interpretation: On average, predictions are off by X units
Mean Squared Error (MSE)
- What it measures: Average squared difference
- Properties: Quadratic penalty, penalizes large errors more
- Limitation: Sensitive to outliers
Root Mean Squared Error (RMSE)
- What it measures: Square root of MSE
- Advantage: Same units as target variable
- When to use: When large errors are particularly costly
Mean Absolute Percentage Error (MAPE)
- What it measures: Average percentage error
- Advantage: Scale-independent, interpretable
- Limitation: Undefined when actual values are zero, asymmetric
Goodness of Fit
R-squared (Coefficient of Determination)
- What it measures: Variance explained by the model
- Range: 0 to 1 (can be negative for poor models)
- Interpretation: 0.8 means model explains 80% of variance
- Limitation: Can be gamed by adding features
Adjusted R-squared
- What it measures: R-squared penalized for number of features
- Advantage: Accounts for model complexity
- When to use: Comparing models with different numbers of features
LLM and Generative Model Metrics
Traditional metrics don't apply to open-ended text generation. LLM evaluation requires specialized approaches.
Output Quality Metrics
Groundedness
- What it measures: Is output supported by provided context?
- Method: Verify claims against source material
- Importance: Critical for RAG applications
Faithfulness
- What it measures: Does output accurately reflect source material?
- Difference from groundedness: Focus on accuracy, not just support
- Method: Check for distortions or misrepresentations
Relevance
- What it measures: Does output address the query?
- Challenge: Subjective, context-dependent
- Method: Human evaluation or learned relevance models
Coherence
- What it measures: Is output internally consistent and logical?
- Method: Check for contradictions, logical flow
Hallucination Metrics
Factual Accuracy
- What it measures: Are stated facts correct?
- Method: Verify against knowledge bases
- Challenge: Requires up-to-date fact checking
Citation Accuracy
- What it measures: Do citations exist and say what's claimed?
- Method: Retrieve and compare cited sources
- Importance: Critical for research applications
Safety Metrics
Toxicity Rate
- What it measures: Frequency of harmful outputs
- Method: Toxicity classifiers, human review
Policy Violation Rate
- What it measures: How often outputs violate defined policies
- Method: Policy classifiers, rule-based checks
Evaluation Challenges
Delayed Ground Truth
Many real-world outcomes take time to materialize:
- Loan defaults take months or years
- Customer lifetime value accumulates over years
- Medical outcomes may never be definitively labeled
Strategies:
- Use proxy metrics that correlate with eventual outcomes
- Monitor drift as an early warning
- Track prediction distribution changes
- Incorporate partial or provisional labels
Class Imbalance
Rare events are hard to evaluate:
- Accuracy is meaningless (99% accurate by never predicting fraud)
- Standard metrics can be misleading
Strategies:
- Use precision-recall metrics instead of accuracy
- Track per-class performance separately
- Apply appropriate sampling for evaluation
- Weight metrics by business importance
Distribution Shifts
Models trained on one distribution, evaluated on another:
- Cross-validation can underestimate real-world error
- Historical evaluation may not predict future performance
Strategies:
- Use time-based splits for temporal data
- Evaluate on out-of-distribution samples
- Monitor production performance continuously
- Compare evaluation and production metrics
Evaluation functions inform your policies. AI supervision enforces them—using these metrics to determine when models can operate autonomously and when they need intervention.
How Swept AI Approaches Evaluation
Swept AI provides comprehensive model evaluation:
-
Evaluate: Pre-deployment assessment across accuracy, fairness, robustness, and safety. Test with the metrics that matter for your use case.
-
Multi-metric tracking: Monitor multiple evaluation functions simultaneously. Understand tradeoffs between precision and recall, accuracy and fairness.
-
LLM-native evaluation: Purpose-built metrics for language models including hallucination detection, groundedness scoring, and safety evaluation.
The right evaluation functions tell you whether your model is ready—and whether it stays ready. For robustness evaluation, see adversarial testing. See also: Noise Is the Real Test.
Which Functions FAQs
Mathematical functions that measure model performance by comparing predictions to ground truth. Different model types require different evaluation functions.
Accuracy measures overall correct predictions. Precision measures what percentage of positive predictions were actually correct. They answer different questions about model performance.
Accuracy can be misleading with imbalanced data. A model that always predicts 'not fraud' achieves 99% accuracy when fraud is 1% of cases—but catches zero fraud.
Traditional metrics don't apply well. LLM evaluation uses groundedness, faithfulness, relevance, coherence, and task-specific quality measures like hallucination rates.
Use proxy metrics, drift detection, prediction distribution analysis, and user feedback. Monitor what you can while waiting for ground truth to confirm performance.
Yes. Single metrics hide important information. Track multiple metrics to understand performance across dimensions—especially metrics that capture different failure modes.