Scoring Methodology
How we achieve statistically rigorous AI system evaluation with mathematically-derived sample sizes and transparent confidence intervals.
What is a test?
A test is a single query posed to an AI system, paired with a predetermined correct answer derived from the system's knowledge base.
A question or request that a real user might ask the AI system
The correct answer, verified against the system's source data
Pass or fail: did the system surface the correct information?
Each test is independent and objective. By running hundreds of tests across representative query types, we measure how reliably an AI system retrieves and presents accurate information.
n = (Z2 × p × (1-p)) / E2This is the minimum sample size to achieve ±5% margin of error at 95% confidence.
The Core Statistical Foundation
Our methodology rests on well-established statistical principles used across every field that requires rigorous measurement of probabilistic phenomena.
Sample Size Is Mathematically Derived
The 385-test requirement comes directly from solving the sample size equation for proportion estimation.
This formula is foundational to inferential statistics and is used by polling organizations, clinical trials, quality control processes, and regulatory bodies worldwide.
Conservative Variance Assumption
We use p=0.5 (maximum variance) rather than assuming any particular performance level.
This means we're using the largest sample size the formula would ever require. If a system actually performs at 72%, the true required sample size would be 310. The methodology over-tests by design.
Bernoulli Trial Framework
Each test is a binary outcome: the system either surfaces the correct information or it doesn't.
The sum of independent Bernoulli trials follows a binomial distribution, which for n≥30 approximates the normal distribution (Central Limit Theorem). At n=385, we're well into the regime where these approximations are highly accurate.
Law of Large Numbers
As sample size increases, the observed proportion converges to the true proportion.
At 385 samples, random variation is constrained to the ±5% band with 95% probability. We remain within the mathematical bounds of random sampling.
Addressing Specific Objections
Common concerns about our methodology and our evidence-based responses.
The sample size of 385 is determined by the desired precision of our statistical estimate, not the architectural complexity of the system under evaluation. The mathematical relationship between sample size, confidence level, and margin of error is invariant across application domains, whether the subject is a deterministic FAQ retrieval system or a stochastic multi-agent RAG architecture. Should 385 independent trials prove insufficient to establish stable performance characteristics, this indicates excessive variance inherent to the system itself: a substantive empirical finding rather than a methodological limitation.
Our category benchmark uses proportional stratified sampling based on published research and industry data on query distribution for the category (e.g., customer service chatbots). The distribution is documented and public. If your actual usage differs significantly from category norms, that's an argument for custom evaluation (Tier 3), not against the validity of category benchmarking. The benchmark answers: “How does this system perform against a standardized, representative workload for this category?”
Systems sold as production-ready should perform consistently. However, we address this concern by:
- Testing over a distributed time window, not a single session
- Reporting confidence intervals, not point estimates
- Offering retest protocols if a vendor believes results are anomalous
If a system's performance is highly variable day-to-day, that is the finding.
We use ground-truth evaluation: each test has a predetermined correct answer derived from the knowledge base content. The system either surfaces that information or it doesn't. For cases requiring judgment (e.g., partial credit, semantic equivalence), we document rubrics in advance and can provide inter-rater reliability metrics.
Correct, and we don't. We report “72% (95% CI: 67-77%)”. This is more honest than reporting a false precision. Any vendor claiming exact accuracy figures without confidence intervals is providing less statistical rigor, not more.
Category benchmarks are designed to assess aggregate performance against a representative workload distribution, not to capture domain-specific optimization. Systems exhibiting differentiated capabilities in particular areas are appropriate candidates for Tier 2 evaluation, which provides the granular analysis necessary to surface such distinctions. The benchmark serves a specific epistemic function: enabling reliable comparative assessment across vendors under standardized conditions.
What Makes This Valid for Probabilistic Systems
The stochastic nature of AI systems does not preclude rigorous measurement. Rather, it necessitates a statistical framework for evaluation.
The alternative approaches are worse:
| Approach | Problem |
|---|---|
| Vendor-reported accuracy | Selection bias, inconsistent methodology, unverifiable |
| Small demo/POC | n=10-20 gives ±20-30% MoE, statistically meaningless |
| Anecdotal evaluation | Confirmation bias, recency bias, no systematic coverage |
| Single-run testing | Doesn't account for stochastic variation |
Our methodology applies the same statistical rigor used in clinical trials, manufacturing quality control, and election polling. These are domains where the stakes for accurate measurement are high and the methods are well-established.
Transparency and Reproducibility
Everything we do is documented and verifiable.
Sample Size & Formula
Exact sample size and the formula used to derive it
Confidence Parameters
Confidence level and margin of error for all reported figures
Test Distribution
Test category distribution and rationale
Grading Standards
Grading rubrics and ground-truth sources
Testing Timeline
Time window over which testing occurred
System Configuration
Version/configuration of the system tested
Any qualified third party can verify our statistical claims. Vendors can request raw data (with test content redacted if needed to protect benchmark integrity).
Methodology Improvements
Additional measures we implement to strengthen validity and preempt sophisticated objections.
Test-Retest Reliability
Run a random 10% subset of tests twice per system and report correlation. Demonstrates that results are stable and not artifacts of momentary system behavior.
Inter-Rater Reliability: AI Judge Calibration
Calibration Process
Expert human graders evaluate 200 randomly sampled test cases with documented rubrics. Inter-rater reliability among human graders is measured (target: κ ≥ 0.85).
Each AI judge independently evaluates the same 200 cases.
Cohen's kappa is computed for each AI judge against the human consensus.
Only judges achieving κ ≥ 0.80 (substantial agreement) qualify for production use.
κ = (Pobserved - Pchance) / (1 - Pchance)
Multi-Judge Consensus Architecture
We employ three AI judges from different model families to prevent correlated errors from shared training data or architectural blind spots.
Unanimous (3/3)
Accepted automatically
Majority (2/3)
Accepted with audit flag
No Majority
Escalated to human review
Bias Measurement and Disclosure
| Judge | κ vs Human | False Positive | False Negative | Adversarial Detection |
|---|---|---|---|---|
| Claude Sonnet | 0.87 | 3.1% | 8.4% | 91% |
| GPT-4o | 0.84 | 5.2% | 6.9% | 88% |
| Gemini Pro | 0.81 | 4.8% | 9.2% | 85% |
Stratified Confidence Intervals
Report per-category results with appropriate (wider) confidence intervals, alongside the aggregate. This acknowledges that category-level estimates have lower precision while still providing directional signal.
Pre-Registration
Publish the test methodology and category distribution before testing begins, similar to clinical trial pre-registration. Prevents accusations of post-hoc methodology changes.
Vendor Challenge Period
Before publishing, allow vendors to review their raw results and flag specific test cases they believe are flawed. Adjudicate challenges with an independent reviewer.
Bootstrap Confidence Intervals
In addition to parametric confidence intervals, we compute bootstrap CIs to validate that our conclusions are not sensitive to distributional assumptions.
Bootstrap Confidence Intervals
A non-parametric approach to validate our conclusions.
Method
From the observed 385 test results, randomly resample 385 results with replacement
Calculate accuracy on the resampled dataset
Repeat 10,000 times
The 2.5th and 97.5th percentiles form the 95% bootstrap CI
Why Both Methods?
The parametric CI assumes test results behave as independent Bernoulli trials with uniform characteristics. Bootstrap CIs make no such assumption. They derive uncertainty directly from the observed data.
Agreement between methods confirms robustness; divergence reveals hidden structure worth investigating.
Interpretation Guide
| Scenario | Interpretation |
|---|---|
| Bootstrap CI ≈ Parametric CI | Data matches assumptions. Tests behave as independent trials. High confidence in estimate. |
| Bootstrap CI wider than Parametric | Hidden structure in data, likely clustered failures by question category, or system instability. |
| Bootstrap CI narrower than Parametric | Conservative assumptions (p=0.5) overestimated variance. Actual precision is better than promised. |
Example Reporting
System X achieved 72% accuracy.
Parametric 95% CI
67.0% – 77.0% (±5.0%)
Bootstrap 95% CI
66.2% – 77.8% (±5.8%)
The slightly wider bootstrap interval suggests minor clustering effects across question categories.
What This Tells Buyers
The point estimate (72%) tells buyers how well the system performs on average. The confidence interval width tells buyers how much to trust that number. When bootstrap and parametric CIs agree, the estimate is stable and reliable.
The 72% score with a 95% confidence interval of ±5% means we're 95% confident the system's true accuracy on this representative workload is between 67% and 77%.
This conclusion rests on well-established statistical principles: Bernoulli trials, binomial distribution, Central Limit Theorem, and standard confidence interval construction.
The methodology is conservative, transparent, and standardized.
Objections to this methodology are, implicitly, objections to inferential statistics itself.
Questions about our methodology?
We welcome scrutiny and are happy to discuss any aspect of our approach in detail.
Get in Touch