An examiner reviewing a personal lines underwriting model in Colorado does not want to know whether the model is fair. The examiner wants the disparate impact ratio for every protected class the SB21-169 implementing regulations cover, the test data composition that produced the ratio, the proxy variables identified and validated against external data, the remediation history for any ratio that fell outside the acceptable range, and the date and methodology of the most recent retest. The examiner wants the math, the lineage, and the timestamps.
The same exam in New York will scope to a different protected-class list. The same model in a bulletin-adopting state will trigger questions about the carrier's overall AI governance program, of which the bias testing is one element. Carriers that built their bias testing program around academic fairness frameworks or vendor-provided dashboards routinely fail to produce what an examiner actually requests, because the program was designed to satisfy a different audience.
This post is about the artifact set. Not what bias is, not why it matters, not the moral case for fairness. The methodology that produces the file an examiner will accept, the four tests that compose the minimum, the protected-class scoping by state, and the documentation discipline that distinguishes a defensible program from a paper one.
The Four-Test Minimum
A defensible bias testing program runs four distinct tests on every model in the carrier's inventory that contributes to a consumer-facing decision. Each test answers a different question. None substitutes for the others. The NIST AI Risk Management Framework treats this multi-metric approach as a baseline for systems with consequential decisions, and examiners increasingly expect carriers to articulate why each metric was selected, not whether they ran "a fairness test."
Statistical parity. The simplest test, and the one examiners reach for first. Compute the rate at which the model produces favorable outcomes (acceptance, lower premium tier, claim approval) for each protected-class group. The disparate impact ratio compares the rate for the protected group to the rate for the reference group. The four-fifths rule (a ratio below 0.8 indicates potential disparate impact) remains the industry shorthand. It is also the threshold most state insurance departments apply when triaging exam findings. A ratio above 0.8 is not a clean bill of health, but a ratio below 0.8 will be flagged.
Equal opportunity. Statistical parity ignores ground truth. Equal opportunity asks whether the model's true positive rate is consistent across protected-class groups. For an underwriting model predicting loss propensity, the test asks: among the customers who actually do file claims, does the model identify them at the same rate across racial groups, age bands, or other protected categories. This is the test that reveals models that are accurate on one population and inaccurate on another, even when overall accuracy looks acceptable.
Calibration. A model is calibrated when its predicted probabilities match observed frequencies. A model that says a customer has a 20% chance of filing a claim should produce filings on 20% of similar customers, regardless of protected-class membership. Calibration matters because regulators are increasingly skeptical of models that produce systematically lower or higher predicted probabilities for particular groups, even when the headline accuracy looks fine. A miscalibrated model has predictable drift in pricing or eligibility decisions over time, which is an examination finding waiting to happen.
Counterfactual flip. The most expensive test to run and the one that produces the most useful evidence. For a sample of decisions, modify the protected-class indicator (or the most-likely proxy variables) and re-score the model. The test measures how often the model's decision changes when only the protected-class signal changes. A high flip rate indicates the model is using protected-class information directly or through tightly correlated proxies. Counterfactual testing is what surfaces the proxy problem before an examiner does.
Which Protected Classes Apply
Protected-class scope is the area where in-house bias testing programs most often fail. A program that tests for race and gender uniformly across a national book misses the state-specific obligations and underweights the categories examiners will actually scope to. Three regimes set the floor.
The Colorado SB21-169 framework covers race, color, national or ethnic origin, religion, sex, sexual orientation, disability, gender identity, and gender expression for the lines covered by the regulations (currently auto and life, with property lines under active rulemaking). Carriers writing in Colorado need ratio results for every applicable category, with proxy validation showing why surrogates for these characteristics are or are not present in the model.
New York's Department of Financial Services Circular Letter 7 (2024) on the use of artificial intelligence systems and external consumer data adds expectations around documented testing for race, color, creed, national origin, status as a victim of domestic violence, past lawful travel, or sexual orientation, with periodic re-testing required. The cadence is not specified in the letter, but exam practice suggests annual at minimum, with material model changes triggering retest.
The NAIC bulletin-adopting states (a growing list that now exceeds half the country) generally inherit a default expectation that the carrier's testing program covers the protected classes in that state's unfair trade practices act, which varies. Carriers operating across multiple bulletin states need a matrix that maps each model in the inventory to each state's protected-class list, with test results produced for the union of categories. This is a documentation problem more than a statistical one. It is also the failure mode that surfaces most commonly in pilot evaluations.
Proxy Validation Is the Hard Part
Modern underwriting and claims models do not include race or gender directly as features. The fairness testing surface is therefore not the input list, it is the proxy network. ZIP code, income tier, occupation, vehicle type, education, credit-based variables, and behavioral signals can each correlate with protected-class membership at varying strengths. A model that excludes protected-class inputs but heavily weights variables that correlate at 0.7 with race produces effectively the same disparate impact as a model with race included.
Proxy validation requires the carrier to compute, for each input variable, the correlation with each protected-class indicator using a reference dataset. Census ACS data, the carrier's own customer records (where the carrier has consented protected-class data for compliance reporting), or licensed third-party demographic data are the typical reference sources. The validation produces a proxy strength score per variable per protected class. Variables above a threshold (correlations above 0.3 are commonly flagged, above 0.5 are typically scrutinized) require either removal, transformation, or documented justification.
The documentation matters more than the threshold. An examiner reading a carrier's proxy validation file wants to see that every input variable in the production model was tested, the correlation was measured, and the carrier made a defended decision about retention. A model that retains a high-correlation proxy with a documented business justification (the variable carries unique predictive signal beyond the protected-class correlation, supported by ablation testing) is defensible. A model that retains a high-correlation proxy because no one looked is not.
The Artifact Set Examiners Will Request
When an exam request arrives for a model in production, the carrier has typically two to four weeks to produce the file. The artifacts an experienced examiner will name explicitly include the following.
The current production model card, with version history. Test methodology document specifying the four tests (or the carrier's defended alternative set), the test data composition, the protected-class scope, and the cadence. Most recent test results per protected class per metric, with ratios computed and any threshold violations flagged. Proxy validation file showing correlation analysis for every input variable. Remediation log documenting every flagged finding, the response (model retraining, variable removal, threshold adjustment, accepted residual risk), the responsible owner, and the closure date. Control comparison documentation, where the model's outcomes were compared against a defined baseline (often the prior model version or a deterministic rule set) to demonstrate the model is not introducing bias relative to the comparator.
The most overlooked artifact is the remediation log. A carrier that produces beautiful test results with no remediation history reads to an experienced examiner as either a model that has never been honestly tested or a program that has never acted on its own findings. The log of "we found this, we did this, here is when, here is why" is what distinguishes a working program from a documentary one.
For carriers building this systematically, Swept AI's evaluation infrastructure produces the artifact set as a byproduct of operating supervision. Tests run on a defined cadence per model per state. Proxy validation refreshes against external reference data. Remediation actions are logged against the test that triggered them. Examiner requests pull a complete file from a single index, rather than reconstructing it under deadline.
Pairing With Explainability
Bias testing answers the population-level question: does the model produce systematically different outcomes for protected groups. Explainability answers the decision-level question: why did this specific applicant receive this specific outcome. Examiners increasingly request both in the same exam, because a model that passes population-level fairness tests can still produce indefensible individual outcomes that the carrier cannot explain.
The two artifacts are conjoined deliverables. A carrier that has invested in explainability infrastructure for underwriting decisions but not in bias testing will be asked to produce the testing under deadline. A carrier with rigorous bias testing but no decision-level explanation will fail the consumer-complaint thread of the exam, because individual adverse decisions need a documented rationale that survives an applicant's challenge.
What "Defensible" Looks Like
A defensible bias testing program is recognizable by three properties. The methodology is documented in a written specification that predates the test results, which means the carrier is not selecting metrics post-hoc to make the model look acceptable. The protected-class scope matches the union of obligations across the carrier's operating states, mapped per model in the inventory. The remediation history is real, which means findings actually closed the way the documentation says they did, and the model versions in production today are traceable to the testing that signed them off.
Examiners do not want philosophical discussions of fairness. They want to see that the carrier defined a methodology, applied it consistently, found things, fixed them, and documented the work. The four-test minimum is the methodological spine. The state-specific protected-class scoping is the legal surface. The artifact set, queryable and versioned, is what survives the exam. Build the program for the artifact set, and the rest follows.
