AI Governance · Updated

The NAIC Pilot Closes in September. What Your Mutual Needs on File.

The NAIC's new examination tool for AI is being tested in twelve states right now, and the pilot closes in September. After that, the Big Data and Artificial Intelligence Working Group plans to bring an updated version to the Fall 2026 National Meeting for adoption. For a mutual insurer, the calendar is the part to read closely.

The pilot states are not a fringe group. They include California, Colorado, Connecticut, Florida, Iowa, Louisiana, Maryland, Pennsylvania, Rhode Island, Vermont, Virginia, and Wisconsin, and the reviews fold into the market conduct and financial examinations carriers already undergo. A mutual licensed in any of them is inside the test population now, and a mutual that writes in the other thirty-eight has roughly one National Meeting cycle before the standardized version reaches its regulators too.

From guidance to a standard set of questions

The AI Systems Evaluation Tool does something the Model Bulletin alone could not. The bulletin, adopted in about two dozen states and the District of Columbia, told insurers to maintain a written program governing how they develop, buy, and use AI, but it only described the expectation. The evaluation tool gives examiners a repeatable way to check it: the same questions asked the same way across every participating state.

At the Spring 2026 meeting, regulators went a step further and floated a way to triage AI systems by risk, a four-tier scale running from unacceptable down through high, medium, and low. The intent is to point the most scrutiny at the systems that can do the most harm, which in insurance means the models touching underwriting, pricing, claims, and anything near protected data. That classification carries real obligations: it decides how much documentation, testing, and oversight each system has to carry, so a carrier that has not sorted its models into tiers cannot show an examiner it is spending its governance effort where the risk actually sits.

For a mutual running a two-person compliance function, the standardization removes any hope of staying below the line. The questions a national carrier gets are the questions the mutual gets. The difference between them is which one already has the answers written down.

The tool is a waypoint, not the destination. The same working group has signaled interest in a formal model law on AI, and a separate model law on third-party data and models is anticipated later in 2026, possibly carrying licensing requirements for the vendors that sell models into the industry. The evaluation tool is how examiners read the current expectation, and the direction of travel is toward requirements with statutory weight. That is why building the capability now, rather than the minimum a single exam demands, is the position that ages well.

What an examiner will expect on file

The tool probes for evidence that a governance program exists and operates, not a binder that asserts one does. A mutual should be able to produce the following without a scramble:

  • A model inventory that includes the vendor models. Examiners want a single list of every AI system in use, and the licensed models count. "We bought it" is not an exemption from documenting it.
  • A risk rating for each system. Map every model to the four-tier scale so the high-risk ones, the underwriting and claims models, carry the documentation their tier demands.
  • Testing records, bias and fairness included. The record should show the model was evaluated for accuracy and disparate impact before deployment and on a schedule after, against thresholds the carrier set in advance.
  • Change approvals and version history. Every material change to a model should show who approved it and what testing preceded the release.
  • A trail from data to decision. For a contested outcome, the carrier should be able to reconstruct which inputs produced which recommendation, and what a human did with it.
  • Proof of oversight where it counts. On the high-risk paths, the human-in-the-loop and escalation steps need evidence they fire in practice, not just a policy that says they should.

A mutual that protects policyholder data inside a controlled environment answers a second set of questions in the same motion, the ones about where member information travels when a model processes it.

Build it as a byproduct, not a fire drill

The carriers that struggle with this are the ones assembling the file after the exam letter arrives. The work is real, and reconstructing months of model behavior from memory and scattered logs is the slowest, least convincing way to do it. The alternative is to let the system produce the record while it runs.

An examiner can usually tell the difference between a program that was lived and one that was assembled. Contemporaneous logs, dated approvals, and test results that line up with the deployment history read as a control that actually operated. A binder produced in the two weeks after the letter reads as exactly what it is.

That is what a governance layer is for. It logs every interaction, decision, and configuration before go-live, holds a live inventory of every AI application in production, and turns the whole picture into a board-ready Trust Report that maps each system to the program governing it. When the examiner asks for the validation history of a specific underwriting model, the answer is a query rather than a research project. The artifacts an examiner asks for stop being a project and become an export.

A mutual is well built to answer these questions, with short decision chains and a board close to the work, but only for questions it can actually answer. September is a date on the mutual's own calendar, not a problem reserved for the national carriers. The carriers that treat it that way will meet the examiner with the file already built, and turn a compliance review into a demonstration of the standard they hold.

Join our newsletter for AI Insights