Model Evaluation Functions: Metrics for ML Model Assessment

Model evaluation functions measure how well machine learning models accomplish their intended tasks. They're the foundation of model quality assessment—without them, you're guessing about performance. These functions are used throughout ML model testing and inform ongoing AI model performance measurement. For agent-specific evaluation, see AI agent evaluation.

Why evaluation matters: Models that look good in development can fail in production. Evaluation functions quantify performance, reveal weaknesses, and guide improvement. They're how you know whether a model is ready for deployment and how you detect when a deployed model is degrading.

Classification Metrics

Classification models assign inputs to categories. Evaluation compares predicted categories to actual categories.

The Confusion Matrix

All classification metrics derive from four outcomes:

True Positive (TP): Correctly predicted positive
True Negative (TN): Correctly predicted negative
False Positive (FP): Incorrectly predicted positive (Type I error)
False Negative (FN): Incorrectly predicted negative (Type II error)

Different metrics weight these outcomes differently based on what matters for your use case.

Core Metrics

Accuracy

What it measures: Overall percentage correct
Formula: (TP + TN) / (TP + TN + FP + FN)
When to use: Balanced datasets where all errors cost the same
Limitation: Misleading with imbalanced data

Precision

What it measures: Of positive predictions, how many were right?
Formula: TP / (TP + FP)
When to use: When false positives are costly
Example: Spam detection (false positives annoy users)

Recall (Sensitivity, True Positive Rate)

What it measures: Of actual positives, how many were caught?
Formula: TP / (TP + FN)
When to use: When false negatives are costly
Example: Medical diagnosis (false negatives miss disease)

F1 Score

What it measures: Harmonic mean of precision and recall
Formula: 2 × (Precision × Recall) / (Precision + Recall)
When to use: When you need to balance precision and recall
Limitation: Assumes equal importance of precision and recall

Specificity (True Negative Rate)

What it measures: Of actual negatives, how many were correctly identified?
Formula: TN / (TN + FP)
When to use: When correctly identifying negatives matters

Threshold-Independent Metrics

AUC-ROC (Area Under ROC Curve)

What it measures: Model's ability to distinguish classes across all thresholds
Range: 0.5 (random) to 1.0 (perfect)
When to use: Comparing models or when operating threshold isn't fixed
Limitation: Can be optimistic with imbalanced data

AUC-PR (Area Under Precision-Recall Curve)

What it measures: Precision-recall tradeoff across thresholds
When to use: Imbalanced datasets where positives are rare
Advantage: More informative than ROC for rare events

Multi-Class Extensions

For more than two classes:

Macro-averaged: Average metric across classes (treats all classes equally)
Micro-averaged: Aggregate TP/TN/FP/FN across classes (weights by class frequency)
Weighted: Weight by class frequency in ground truth

Regression Metrics

Regression models predict continuous values. Evaluation measures the difference between predictions and actual values.

Error Metrics

Mean Absolute Error (MAE)

What it measures: Average absolute difference between prediction and actual
Properties: Linear penalty, robust to outliers
Interpretation: On average, predictions are off by X units

Mean Squared Error (MSE)

What it measures: Average squared difference
Properties: Quadratic penalty, penalizes large errors more
Limitation: Sensitive to outliers

Root Mean Squared Error (RMSE)

What it measures: Square root of MSE
Advantage: Same units as target variable
When to use: When large errors are particularly costly

Mean Absolute Percentage Error (MAPE)

What it measures: Average percentage error
Advantage: Scale-independent, interpretable
Limitation: Undefined when actual values are zero, asymmetric

Goodness of Fit

R-squared (Coefficient of Determination)

What it measures: Variance explained by the model
Range: 0 to 1 (can be negative for poor models)
Interpretation: 0.8 means model explains 80% of variance
Limitation: Can be gamed by adding features

Adjusted R-squared

What it measures: R-squared penalized for number of features
Advantage: Accounts for model complexity
When to use: Comparing models with different numbers of features

LLM and Generative Model Metrics

Traditional metrics don't apply to open-ended text generation. LLM evaluation requires specialized approaches.

Output Quality Metrics

Groundedness

What it measures: Is output supported by provided context?
Method: Verify claims against source material
Importance: Critical for RAG applications

Faithfulness

What it measures: Does output accurately reflect source material?
Difference from groundedness: Focus on accuracy, not just support
Method: Check for distortions or misrepresentations

Relevance

What it measures: Does output address the query?
Challenge: Subjective, context-dependent
Method: Human evaluation or learned relevance models

Coherence

What it measures: Is output internally consistent and logical?
Method: Check for contradictions, logical flow

Hallucination Metrics

Factual Accuracy

What it measures: Are stated facts correct?
Method: Verify against knowledge bases
Challenge: Requires up-to-date fact checking

Citation Accuracy

What it measures: Do citations exist and say what's claimed?
Method: Retrieve and compare cited sources
Importance: Critical for research applications

Safety Metrics

Toxicity Rate

What it measures: Frequency of harmful outputs
Method: Toxicity classifiers, human review

Policy Violation Rate

What it measures: How often outputs violate defined policies
Method: Policy classifiers, rule-based checks

Evaluation Challenges

Delayed Ground Truth

Many real-world outcomes take time to materialize:

Loan defaults take months or years
Customer lifetime value accumulates over years
Medical outcomes may never be definitively labeled

Strategies:

Use proxy metrics that correlate with eventual outcomes
Monitor drift as an early warning
Track prediction distribution changes
Incorporate partial or provisional labels

Class Imbalance

Rare events are hard to evaluate:

Accuracy is meaningless (99% accurate by never predicting fraud)
Standard metrics can be misleading

Strategies:

Use precision-recall metrics instead of accuracy
Track per-class performance separately
Apply appropriate sampling for evaluation
Weight metrics by business importance

Distribution Shifts

Models trained on one distribution, evaluated on another:

Cross-validation can underestimate real-world error
Historical evaluation may not predict future performance

Strategies:

Use time-based splits for temporal data
Evaluate on out-of-distribution samples
Monitor production performance continuously
Compare evaluation and production metrics

Evaluation functions inform your policies. AI supervision enforces them—using these metrics to determine when models can operate autonomously and when they need intervention.

How Swept AI Approaches Evaluation

Swept AI provides comprehensive model evaluation:

Evaluate: Pre-deployment assessment across accuracy, fairness, robustness, and safety. Test with the metrics that matter for your use case.
Multi-metric tracking: Monitor multiple evaluation functions simultaneously. Understand tradeoffs between precision and recall, accuracy and fairness.
LLM-native evaluation: Purpose-built metrics for language models including hallucination detection, groundedness scoring, and safety evaluation.

The right evaluation functions tell you whether your model is ready—and whether it stays ready. For robustness evaluation, see adversarial testing. See also: Noise Is the Real Test.

Which Functions are Used for Model Evaluation?

Classification Metrics

The Confusion Matrix

Core Metrics

Threshold-Independent Metrics

Multi-Class Extensions

Regression Metrics

Error Metrics

Goodness of Fit

LLM and Generative Model Metrics

Output Quality Metrics

Hallucination Metrics

Safety Metrics

Evaluation Challenges

Delayed Ground Truth

Class Imbalance

Distribution Shifts

How Swept AI Approaches Evaluation

Which Functions FAQs