# Scoring Methodology

Swept AI evaluates AI systems with statistically rigorous methods: mathematically-derived sample sizes and transparent confidence intervals. Headline parameters: a 385-test minimum, a 95% confidence level, a ±5% margin of error, and a conservative maximum-variance assumption of p=0.5.

## What Constitutes a Test

A test is a single query posed to an AI system, paired with a predetermined correct answer drawn from the system's own knowledge base and documentation. Tests are specific to your use case because they come from your source content. Each test has three parts:

- **Input**: a question or request a real user might ask the AI system.
- **Ground truth**: the correct answer, verified against the system's source data.
- **Binary outcome**: pass or fail, depending on whether the system surfaced the correct information.

Tests are independent and objective. Running hundreds of them across representative query types measures how reliably a system retrieves and presents accurate information.

## The Sample-Size Formula

The minimum sample size comes from solving the proportion-estimation equation:

`n = (Z² × p × (1-p)) / E²`

This derives the minimum number of tests (n) needed to reach a target margin of error (E) at a given confidence level (Z), assuming maximum variance (p).

## The Core Statistical Foundation

The methodology rests on established statistical principles used wherever probabilistic phenomena (outcomes that vary even under identical conditions) must be measured rigorously.

- **Sample size is mathematically derived**: 385 tests follows directly from the sample-size equation. The same formula is used by polling organizations, clinical trials, quality control, and regulators.
- **Conservative variance assumption**: using p=0.5 (maximum variance) yields the largest sample the formula would ever require. A system that truly performs at 72% would need only 310 tests, so the methodology over-tests by design.
- **Bernoulli trial framework**: each test is a binary outcome. The sum of independent Bernoulli trials follows a binomial distribution that approximates normal for n≥30 (Central Limit Theorem).
- **Central Limit Theorem guarantee**: at n=385, the margin of error is held to ±5% at 95% confidence, a direct consequence of the normal approximation to the binomial distribution.

## Addressing Specific Objections

- **"385 tests isn't enough."** Sample size is set by the precision of the statistical estimate, not by system complexity. The math is invariant across application domains. If 385 trials cannot establish stable performance, that variance is itself a finding. Higher precision is available by raising the confidence level or tightening the margin of error.
- **"Your questions aren't representative."** The category benchmark uses proportional stratified sampling based on published research and industry data on query distribution. The distribution is documented and public. Usage that differs significantly from category norms can warrant a further-customized evaluation.
- **"You tested on a bad day."** Testing runs over a distributed time window, reports confidence intervals rather than point estimates, and offers retest protocols. Highly variable day-to-day performance is itself the finding.
- **"Your grading is subjective."** Ground-truth evaluation gives each test a predetermined correct answer from the knowledge base. Cases needing judgment use rubrics documented in advance, with inter-rater reliability metrics available.
- **"You can't say we got 72%."** Results are reported as a point estimate with a confidence interval, for example "72% (95% CI: 67-77%)". Exact figures without intervals provide less statistical rigor.
- **"You under-weighted our specialty."** Category benchmarks assess aggregate performance against a representative workload, enabling reliable comparison across vendors. Differentiated capabilities are candidates for further evaluation.

## What Makes This Valid for Probabilistic Systems

AI systems are stochastic: answers change session to session, even for identical prompts. Measuring them requires a statistical framework. Weaker approaches fall short:

- **Vendor-reported accuracy**: selection bias, inconsistent methodology, unverifiable.
- **Small demo or POC**: n=10-20 gives ±20-30% margin of error, statistically meaningless.
- **Anecdotal evaluation**: confirmation bias, recency bias, no systematic coverage.
- **Single-run testing**: ignores stochastic variation.

This is the same statistical rigor used in clinical trials, manufacturing quality control, and election polling.

## Transparency and Reproducibility

Everything is documented and verifiable. For each evaluation Swept discloses:

- Exact sample size and the formula used to derive it.
- Confidence level and margin of error for all reported figures.
- Test category distribution and rationale.
- Grading rubrics and ground-truth sources.
- The time window over which testing occurred.
- The version or configuration of the system tested.

Any qualified third party can verify the statistical claims. Vendors may request raw data, with test content redacted if needed to protect benchmark integrity.

## Methodology Improvements

- **Test-retest reliability (5.1)**: a random 10% subset of tests is run twice per system, and correlation is reported to confirm results are stable.
- **Inter-rater reliability and AI judge calibration (5.2)**: expert humans grade 200 sampled cases against documented rubrics (target human agreement κ ≥ 0.85). Each AI judge grades the same 200 cases; Cohen's kappa is computed against human consensus, and only judges reaching κ ≥ 0.80 qualify for production. Three judges from different model families are used to avoid correlated errors. Unanimous (3/3) verdicts are accepted automatically, majority (2/3) are accepted with an audit flag, and no-majority cases escalate to human review.
- **Stratified confidence intervals (5.3)**: per-category results are reported with appropriately wider intervals alongside the aggregate.
- **Pre-registration (5.4)**: methodology and category distribution are published before testing begins, similar to clinical trial pre-registration.
- **Vendor challenge period (5.5)**: vendors review raw results and flag cases they believe are flawed, adjudicated by an independent reviewer.
- **Bootstrap confidence intervals (5.6)**: a non-parametric cross-check on the parametric intervals.

## Bootstrap Confidence Intervals

This non-parametric approach validates conclusions independently of distributional assumptions. The method:

1. Resample 385 results with replacement from the observed test results.
2. Calculate accuracy on the resampled dataset.
3. Repeat 10,000 times.
4. Take the 2.5th and 97.5th percentiles as the 95% bootstrap CI.

Parametric intervals assume independent Bernoulli trials with uniform characteristics; bootstrap intervals derive uncertainty directly from the data. Agreement between the two confirms robustness, and divergence reveals hidden structure worth investigating (for example, clustered failures by category, or conservative assumptions overestimating variance).

The point estimate tells buyers how well a system performs on average, and the interval width tells them how much to trust that number. When bootstrap and parametric intervals agree, the estimate is stable and reliable.

## Summary

A 72% score with a 95% confidence interval of ±5% means there is 95% confidence the system's true accuracy on this representative workload sits between 67% and 77%. The conclusion rests on Bernoulli trials, the binomial distribution, the Central Limit Theorem, and standard confidence-interval construction. The methodology is conservative, transparent, and standardized.

## Questions About Our Methodology?

We welcome scrutiny and are happy to discuss any part of the approach in detail. [Contact us](/contact).