The NAIC AI Systems Evaluation Tool pilot closes in September. The Fall Meeting in November is where the working group will present pilot findings and recommendations, including the draft third-party data and models framework that has been advancing since spring. Carriers that have been waiting to see what the regulators ultimately decide before committing to governance investment have approximately ninety days, between July 1 and the September pilot close, to ship work that will shape what those regulators decide.
The window is short. It is also the most leveraged ninety days an insurance CIO will spend on AI governance this year. Pilot participants who can demonstrate working artifacts during the pilot's evaluation phase get cited in the working group's findings. Carriers who cannot demonstrate working artifacts get cited differently. The November adoption decision will reference both.
This post is the playbook. Five concrete artifacts to ship by September 30. The org chart that owns each. And a frank look at the budget reality, because Q3 spending on governance trades directly against Q4 deployment velocity, and that is the right trade. Carriers that ship into Q4 will be shipping into a regulatory tool whose findings have already been written.
Why the Q3 Window Matters
The pilot is not symbolic. Foley's readout of pilot mechanics describes a structured request that asks carriers to demonstrate, with documentation, the AI governance practices the bulletin requires. The Pilot Project Summary circulated by the working group identifies the categories the evaluation focuses on: governance program design, model inventory, third-party oversight, fairness testing, and ongoing monitoring.
Insurance industry trade groups have pushed back, as InsuranceNewsNet documented, arguing the tool risks formalizing examination practice before carriers have built the underlying infrastructure. Regulators counter that the tool is voluntary in pilot phase and that the November decision will determine its production form. Substantive participation in the pilot is the only available channel for a carrier to influence that form before findings are written.
The five Q3 deliverables below are the artifacts the pilot evaluates, ordered by dependency. Ship them in sequence and the carrier ends Q3 with a defensible posture for whatever the working group adopts; ship them in Q4 and the posture is reactive against findings the carrier had no input into.
Deliverable 1: Model Inventory
The inventory is the foundational artifact because every other deliverable references it, and the one carriers most often have only in fragments. A complete model inventory built to examiner specification lists every production AI system, the business purpose, the in-scope state(s), the responsible owner, the deployment date, the vendor (if third party), the supervisory cadence, and the location of the supporting documentation file.
Owners: a single accountable executive (typically the Chief Risk Officer or Chief Data Officer), with model owners identified for each line of business. The inventory is maintained by a small operations team, but the categorization decisions (is this model in scope, what business purpose is recorded, what cadence is assigned) require business-line ownership.
Time and effort: six to eight weeks for a mid-sized carrier with one to two thousand candidate systems to evaluate. The discovery phase carries most of the work, surfacing models nobody has classified as AI: customer service routing models, agent-side recommendation tools, claims fraud scoring imported from acquisitions five years ago. Plan for discovery to surface 20 to 40% more in-scope systems than the initial estimate.
Ship first because every other deliverable references inventory entries.
Deliverable 2: Audit Trail Specification
The audit trail is the evidentiary spine of the governance program. The bulletin requires carriers to produce, on examiner request, the inputs, decisions, and supporting reasoning for AI-influenced consumer outcomes; most carriers can produce some of this in some systems but not all of it consistently across the inventory. A documented audit trail specification for compliance AI defines the data captured per decision (input features, model version, output, confidence indicators, timestamp, downstream action), the retention period (typically seven years to align with regulatory record-retention rules), the storage location, and the query mechanism.
Owners: the CIO and the Chief Compliance Officer co-own the specification. The model owners are accountable for implementing the specification within their systems. Engineering teams are accountable for the storage and query infrastructure.
Time and effort: the specification document is two to four weeks of policy work. Implementation across the inventory is a multi-quarter program; the Q3 deliverable is the specification plus a tracking dashboard showing each inventory entry's implementation status. Carriers who attempt to ship full implementation by September will burn the budget without producing the artifact regulators are asking for.
Ship second because the bias testing program needs the audit trail to assemble counterfactual analyses and individual decision explanations.
Deliverable 3: Vendor Contract Amendments
The bulletin places diligence on the carrier; the pending vendor registry will not relieve it. The Q3 deliverable is a documented program rather than a completed retrofit of every vendor contract by September. The program identifies every AI vendor in the inventory, maps the renewal calendar, and adds the eight clauses every contract needs to the standard amendment package. By September 30, the carrier should be able to produce a list showing each vendor, the renewal date, the clauses already in place, the clauses targeted for the next renewal, and the responsible owner.
Owners: General Counsel's office leads the contract work. Procurement owns the vendor calendar. The CIO ensures the AI inventory and the vendor calendar reconcile. Vendor management leads the renewal negotiations.
Time and effort: the program documentation is two weeks. Active retrofits run as a tail through Q4 and into 2027 based on renewal cadence. Carriers that try to renegotiate every contract simultaneously will produce no completed retrofits and exhaust legal budget.
Ship third because vendor responses to the contract clauses inform what the carrier needs to monitor in-house. A vendor that refuses fairness-test sharing feeds additional scope into the bias testing program.
Deliverable 4: Bias Testing Results
A defensible bias testing methodology built for market conduct exams runs the four-test minimum (statistical parity, equal opportunity, calibration, counterfactual flip) on every in-scope model, across the protected-class scope for each operating state, with proxy validation, remediation history, and control comparisons. The Q3 ship is the first complete pass: methodology document, test results for every consumer-facing model in the inventory, proxy validation file, and a remediation log capturing every flagged finding and the response.
Owners: the Chief Risk Officer or a dedicated AI fairness officer owns the methodology and the program. Model owners are accountable for executing tests on their systems. A central data science team typically operates the testing infrastructure. Compliance reviews and signs off on results before they enter the regulatory file.
Time and effort: ten to fourteen weeks for the first complete pass, depending on inventory size. Reference data acquisition for proxy validation and remediation cycles on the first round of flagged findings are the long poles. Plan for findings on 15 to 30% of consumer-facing models.
Ship fourth because this is the most evidence-rich artifact a carrier can produce, and the one pilot evaluators will reference most explicitly. A complete pass before the pilot closes enters the carrier's evidence into the working group's record.
Deliverable 5: Drift Dashboards
Periodic bias testing produces snapshots; drift dashboards make supervision continuous. The Q3 ship is operational drift detection on every consumer-facing model in the inventory, with thresholds defined for material drift (typically 5% accuracy change, four-fifths rule violation on any monitored protected class, or distribution shift in a key input), a notification path to the model owner, and a remediation pathway connected to the bias testing log.
Owners: the CIO owns the infrastructure, model owners respond to alerts, and the Chief Risk Officer owns the threshold definitions and remediation governance.
Time and effort: six to ten weeks if the audit trail specification is in place, since the dashboards consume audit trail data. Without it, dashboards fall back to vendor-supplied or model-side instrumentation that examiners discount.
Ship fifth because the dashboards close the loop between testing and operations. The pilot evaluation looks explicitly at ongoing monitoring as a category.
The Org Chart
Five deliverables, five distinct ownership maps, and one obvious risk: the work fragments across functions and stalls in coordination overhead. The carriers shipping all five by September 30 have a single executive accountable for the program, typically reporting to the CEO or the Chief Operating Officer, with dotted-line authority across the CIO, CRO, CCO, GC, and CDO functions. The role functions as a coordinator with budget authority and the executive air cover to force scheduling decisions across functions whose normal incentives do not align. It does not add a new layer of bureaucracy; it forces decisions that would otherwise stall.
Carriers that try to run the program through a steering committee without a single accountable executive will discover that no single function has the bandwidth to absorb the work, and the committee will meet every two weeks while the September deadline approaches. The pattern from carriers that have already shipped this kind of program: name the executive, give them the budget, and let the function heads negotiate scope through them.
The Budget Reality
Governance spending in Q3 trades against Q4 deployment velocity, and the trade is the right one. Carriers that compress governance work into Q4 to preserve Q3 deployment ship into a regulatory environment where findings have already been written; carriers that shift Q4 deployment milestones to fund Q3 governance ship into an environment where their evidence shaped the findings.
A reasonable benchmark for a mid-sized carrier with two to four thousand in-scope systems is $4 to $8 million in incremental Q3 spend across people, infrastructure, and outside counsel. The infrastructure component (audit trail, drift dashboards, evaluation tooling) is the largest line and the most reusable. Carriers that invest in evaluation infrastructure capable of producing all five deliverables from a unified system reduce both Q3 spend and ongoing operating cost; treating each deliverable as a separate point solution doubles the cost and produces a more fragile artifact set.
Whether or not the carrier participates in the pilot, the artifacts are required by the bulletin and by state market conduct exam practice. The choice is whether to produce them in time to influence the November decision, or after the decision against whatever framework emerges.
What Q4 Looks Like Either Way
Whether the November Fall Meeting adopts the third-party framework as drafted, modifies it, or defers it, Q4 will involve operationalizing the Q3 artifacts: remediation cycles on bias findings, vendor contract retrofits as renewals come up, audit trail implementation across systems where the Q3 specification covered the design but not the build, and drift dashboard tuning as thresholds collide with operational reality.
Carriers that complete Q3 are positioned to operationalize. Carriers that do not are positioned to start. The executive context for federal-state alignment makes state-level evidence even more important; federal posture has not stabilized, while state market conduct exams continue on the bulletin's authority regardless.
The AI Governance hub collects supporting material across these deliverables: vendor diligence, inventory specification, audit trail design, bias methodology, and ongoing supervision.
The evidence available in November is the evidence shipped between July and September. Carriers that treat the September pilot close as a hard deadline end Q3 defensible against any framework the working group adopts; the rest spend Q4 reading findings written without their input.
