Scoring Methodology

How we achieve statistically rigorous AI system evaluation with mathematically-derived sample sizes and transparent confidence intervals.

385

Test Minimum

Mathematically derived

95%

Confidence Level

Industry standard

±5%

Margin of Error

Precision bound

p=0.5

Max Variance

Conservative estimate

What is a test?

A test is a single query posed to an AI system, paired with a predetermined correct answer derived from the system's knowledge base.

1Input

A question or request that a real user might ask the AI system

2Ground Truth

The correct answer, verified against the system's source data

3Binary Outcome

Pass or fail: did the system surface the correct information?

Each test is independent and objective. By running hundreds of tests across representative query types, we measure how reliably an AI system retrieves and presents accurate information.

n = (Z² × p × (1-p)) / E²

Using this formula we can derive the minimum sample size (n) required to achieve a desired margin of error (E) at a given confidence level (Z), assuming maximum variance (p).

The Core Statistical Foundation

Our methodology rests on well-established statistical principles used across every field that requires rigorous measurement of probabilistic phenomena.

1.1

Sample Size Is Mathematically Derived

The 385-test requirement comes directly from solving the sample size equation for proportion estimation.

This formula is foundational to inferential statistics and is used by polling organizations, clinical trials, quality control processes, and regulatory bodies worldwide.

1.2

Conservative Variance Assumption

We use p=0.5 (maximum variance) rather than assuming any particular performance level.

This means we're using the largest sample size the formula would ever require. If a system actually performs at 72%, the true required sample size would be 310. The methodology over-tests by design.

1.3

Bernoulli Trial Framework

Each test is a binary outcome: the system either surfaces the correct information or it doesn't.

The sum of independent Bernoulli trials follows a binomial distribution, which for n≥30 approximates the normal distribution (Central Limit Theorem). At n=385, we're well into the regime where these approximations are highly accurate.

1.4

Law of Large Numbers

As sample size increases, the observed proportion converges to the true proportion.

At 385 samples, random variation is constrained to the ±5% band with 95% probability. We remain within the mathematical bounds of random sampling.

Addressing Specific Objections

Common concerns about our methodology and our evidence-based responses.

The sample size of 385 is determined by the desired precision of our statistical estimate, not the architectural complexity of the system under evaluation. The mathematical relationship between sample size, confidence level, and margin of error is invariant across application domains, whether the subject is a deterministic FAQ retrieval system or a stochastic multi-agent RAG architecture. Should 385 independent trials prove insufficient to establish stable performance characteristics, this indicates excessive variance inherent to the system itself: a substantive empirical finding rather than a methodological limitation. That said, if your use case demands higher precision, we can increase the confidence level or decrease the margin of error, thereby increasing the required sample size accordingly.

Our category benchmark uses proportional stratified sampling based on published research and industry data on query distribution for the category (e.g., customer service chatbots). The distribution is documented and public. If your actual usage differs significantly from category norms, that's an argument for custom evaluation (Tier 3), not against the validity of category benchmarking. The benchmark answers: “How does this system perform against a standardized, representative workload for this category?”

Systems sold as production-ready should perform consistently. However, we address this concern by:

Testing over a distributed time window, not a single session
Reporting confidence intervals, not point estimates
Offering retest protocols if a vendor believes results are anomalous

If a system's performance is highly variable day-to-day, that is the finding.

We use ground-truth evaluation: each test has a predetermined correct answer derived from the knowledge base content. The system either surfaces that information or it doesn't. For cases requiring judgment (e.g., partial credit, semantic equivalence), we document rubrics in advance and can provide inter-rater reliability metrics.

Correct, and we don't. We report “72% (95% CI: 67-77%)”. This is more honest than reporting a false precision. Any vendor claiming exact accuracy figures without confidence intervals is providing less statistical rigor, not more.

Category benchmarks are designed to assess aggregate performance against a representative workload distribution, not to capture domain-specific optimization. Systems exhibiting differentiated capabilities in particular areas are appropriate candidates for Tier 2 evaluation, which provides the granular analysis necessary to surface such distinctions. The benchmark serves a specific epistemic function: enabling reliable comparative assessment across vendors under standardized conditions.

What Makes This Valid for Probabilistic Systems

The stochastic nature of AI systems does not preclude rigorous measurement. Rather, it necessitates a statistical framework for evaluation.

The alternative approaches are worse:

Approach	Problem
Vendor-reported accuracy	Selection bias, inconsistent methodology, unverifiable
Small demo/POC	n=10-20 gives ±20-30% MoE, statistically meaningless
Anecdotal evaluation	Confirmation bias, recency bias, no systematic coverage
Single-run testing	Doesn't account for stochastic variation

Our methodology applies the same statistical rigor used in clinical trials, manufacturing quality control, and election polling. These are domains where the stakes for accurate measurement are high and the methods are well-established.

Transparency and Reproducibility

Everything we do is documented and verifiable.

Sample Size & Formula

Exact sample size and the formula used to derive it

Confidence Parameters

Confidence level and margin of error for all reported figures

Test Distribution

Test category distribution and rationale

Grading Standards

Grading rubrics and ground-truth sources

Testing Timeline

Time window over which testing occurred

System Configuration

Version/configuration of the system tested

Any qualified third party can verify our statistical claims. Vendors can request raw data (with test content redacted if needed to protect benchmark integrity).

Methodology Improvements

Additional measures we implement to strengthen validity and preempt sophisticated objections.

5.1

Test-Retest Reliability

Run a random 10% subset of tests twice per system and report correlation. Demonstrates that results are stable and not artifacts of momentary system behavior.

5.2

Inter-Rater Reliability: AI Judge Calibration

Calibration Process

Expert human graders evaluate 200 randomly sampled test cases with documented rubrics. Inter-rater reliability among human graders is measured (target: κ ≥ 0.85).

Each AI judge independently evaluates the same 200 cases.

Cohen's kappa is computed for each AI judge against the human consensus.

Only judges achieving κ ≥ 0.80 (substantial agreement) qualify for production use.

κ = (P_observed - P_chance) / (1 - P_chance)

Multi-Judge Consensus Architecture

We employ three AI judges from different model families to prevent correlated errors from shared training data or architectural blind spots.

Unanimous (3/3)

Accepted automatically

Majority (2/3)

Accepted with audit flag

No Majority

Escalated to human review

Bias Measurement and Disclosure

Judge	κ vs Human	False Positive	False Negative	Adversarial Detection
Claude Sonnet	0.87	3.1%	8.4%	91%
GPT-4o	0.84	5.2%	6.9%	88%
Gemini Pro	0.81	4.8%	9.2%	85%

5.3

Stratified Confidence Intervals

Report per-category results with appropriate (wider) confidence intervals, alongside the aggregate. This acknowledges that category-level estimates have lower precision while still providing directional signal.

5.4

Pre-Registration

Publish the test methodology and category distribution before testing begins, similar to clinical trial pre-registration. Prevents accusations of post-hoc methodology changes.

5.5

Vendor Challenge Period

Before publishing, allow vendors to review their raw results and flag specific test cases they believe are flawed. Adjudicate challenges with an independent reviewer.

5.6

Bootstrap Confidence Intervals

In addition to parametric confidence intervals, we compute bootstrap CIs to validate that our conclusions are not sensitive to distributional assumptions.

Bootstrap Confidence Intervals

A non-parametric approach to validate our conclusions.

Method

From the observed 385 test results, randomly resample 385 results with replacement

Calculate accuracy on the resampled dataset

Repeat 10,000 times

The 2.5th and 97.5th percentiles form the 95% bootstrap CI

Why Both Methods?

The parametric CI assumes test results behave as independent Bernoulli trials with uniform characteristics. Bootstrap CIs make no such assumption. They derive uncertainty directly from the observed data.

Agreement between methods confirms robustness; divergence reveals hidden structure worth investigating.

Interpretation Guide

Scenario	Interpretation
Bootstrap CI ≈ Parametric CI	Data matches assumptions. Tests behave as independent trials. High confidence in estimate.
Bootstrap CI wider than Parametric	Hidden structure in data, likely clustered failures by question category, or system instability.
Bootstrap CI narrower than Parametric	Conservative assumptions (p=0.5) overestimated variance. Actual precision is better than promised.

Example Reporting

System X achieved 72% accuracy.

Parametric 95% CI

67.0% – 77.0% (±5.0%)

Bootstrap 95% CI

66.2% – 77.8% (±5.8%)

The slightly wider bootstrap interval suggests minor clustering effects across question categories.

What This Tells Buyers

The point estimate (72%) tells buyers how well the system performs on average. The confidence interval width tells buyers how much to trust that number. When bootstrap and parametric CIs agree, the estimate is stable and reliable.

The 72% score with a 95% confidence interval of ±5% means we're 95% confident the system's true accuracy on this representative workload is between 67% and 77%.

This conclusion rests on well-established statistical principles: Bernoulli trials, binomial distribution, Central Limit Theorem, and standard confidence interval construction.

The methodology is conservative, transparent, and standardized.

Objections to this methodology are, implicitly, objections to inferential statistics itself.

Questions about our methodology?

We welcome scrutiny and are happy to discuss any aspect of our approach in detail.

Get in Touch