Your logs are not an audit trail. They are discovery exhibits.
That distinction came into sharp regulatory focus in 2025 when a federal judge in the Lokken v. UnitedHealthcare litigation ruled that the plaintiffs were entitled to discovery into the insurer's use of AI in claim denials. The ruling, summarized in Hunton's analysis of the discovery decision, opened a new procedural reality: when an AI system contributes to a coverage decision and that decision is later disputed, the plaintiffs can compel production of the records that document how the system reached its conclusion. The implications of that ruling for bad-faith claim defense are detailed in our companion piece on the Lokken ruling and the new bad-faith discovery landscape.
The carriers that get subpoenaed first will discover what they have. Most will discover what they do not.
Application logs that capture a request and a response are not an audit trail. They lack the version pinning required to reconstruct what model produced the output, the integrity guarantees that prove the record has not been altered, the linkage to human review that demonstrates the carrier's claims-handling process, and the queryability required to respond to a discovery request without manually reading hundreds of thousands of log lines.
The same gap shows up in the NAIC AI Systems Evaluation Tool pilot. When the pilot questionnaire asks how the carrier reconstructs a specific automated decision, the answer "we have application logs" is not sufficient. The expected answer is a queryable audit trail with documented integrity controls, retention aligned with state statutes of limitations, and a defined process for producing decision histories on demand.
This piece specifies what that audit trail contains, what retention it requires, and the structural difference between a log and an audit trail that determines whether a carrier wins or loses a discovery dispute.
The Minimum Field Set
Every consequential AI decision, defined as any AI output that contributes to a coverage determination, claim disposition, pricing decision, customer-facing recommendation, or fraud referral, must produce a record with the following fields. The set below reflects the NIST AI Risk Management Framework's MEASURE function, the NAIC pilot's documentation expectations, and what plaintiffs' counsel are now requesting in discovery under Lokken-style theories.
Timestamp. A timezone-aware timestamp recorded by the system at decision time, not at log ingestion. Precision to the millisecond. Carriers that record timestamps at the application server but not at the model inference layer cannot prove the order of operations when multiple models contribute to a single decision.
Model ID and version hash. The model identifier from the carrier's model inventory, with a version hash that ties to the specific artifacts (training code, weights, hyperparameters, evaluation results) used to produce this prediction. Without a version hash, the carrier cannot reproduce the decision if the model has been retrained between the decision and the discovery request.
Input hash. A cryptographic hash of the model inputs at decision time, with the underlying input data preserved separately under access controls. The hash provides integrity verification: if the input data is later challenged, the hash proves what the model actually saw at decision time. The separate preservation provides the explainability material the carrier needs to walk an examiner or a court through the decision.
Output. The model's output value, with confidence or probability estimates where applicable. For classification models, the predicted class and the probability of each class. For regression models, the point estimate and any uncertainty bands the model produces. For LLM-based systems, the full generated output along with any retrieved context.
Confidence. A normalized confidence indicator that downstream systems and human reviewers can interpret consistently. Different model architectures produce different native uncertainty estimates, and the audit trail must translate them into a comparable scale.
Reviewer. The identity of the human reviewer who validated, modified, or accepted the model output, with the timestamp of that review action. For decisions that proceed without human review, an explicit indicator that the decision was fully automated, with the rule that authorized that automation.
Override flag. A boolean indicating whether the human reviewer overrode the model's recommendation, with a structured reason code where overrides occurred. Override patterns are one of the strongest signals of model drift and one of the first things examiners ask about. Without a structured override flag, the carrier cannot answer the question.
Downstream action. The action the carrier took as a consequence of the decision: claim approved, claim denied, claim referred to special investigations, quote issued at a specific premium, customer routed to a specific service tier. The audit trail closes the loop between the AI output and the customer-facing outcome. Without that closure, neither the carrier nor the regulator can evaluate whether the decision was consistent with policy and law.
The eight-field set is the minimum. Carriers operating in jurisdictions with specific AI disclosure or explainability requirements will need additional fields to satisfy local rules. The set above is what every carrier needs regardless of jurisdiction.
Retention Pinned to Bad-Faith SOL
The most common mistake in audit trail design is choosing a retention period before consulting state law on the statute of limitations for bad-faith claims. Carriers default to one or two years because that is what their existing log retention happens to be. Then a Lokken-style discovery request arrives for a claim decision made four years ago, and the carrier cannot produce the audit trail because the retention window expired.
State bad-faith statutes of limitations vary widely. Some states allow bad-faith claims for as long as four years after the disputed claim event, and certain claim types extend further under tolling provisions for minors, undiscovered injuries, or continuing breaches. The audit trail retention period must be set to the longest applicable SOL across all states the carrier writes business in, plus a margin for tolling and litigation hold scenarios. For most national carriers, that comes to seven years.
Retention is not the same as accessibility. Audit trail records that are technically retained but stored on cold archives that take weeks to retrieve do not satisfy a discovery request with a thirty-day production deadline. The retention specification must include accessibility tiers: hot storage for records less than ninety days old, warm storage for records up to two years old, cold storage for records up to the SOL ceiling, with documented retrieval times for each tier and the operational capacity to meet discovery deadlines.
A carrier that cannot produce a complete audit trail for a four-year-old claim decision because the records expired is in a worse defensive position than a carrier that produces an unflattering audit trail. The first invites adverse-inference instructions. The second invites a fact-based defense.
Log vs. Audit Trail: The Three Differences
The terms "log" and "audit trail" are used interchangeably in casual conversation. They are different artifacts in regulatory and litigation contexts, and three structural differences separate them.
Replayability. An audit trail allows the carrier to reconstruct the exact decision the model produced, given the inputs and version recorded. That requires preserving the inputs, the model artifacts, and enough environmental metadata to reproduce the inference. A log captures what happened. An audit trail captures what happened in enough detail to prove what would happen again under the same conditions. The Lokken ruling pushes plaintiffs' counsel toward replayability requests: "produce the model and the inputs, and we will run the inference ourselves." Carriers without replayable records concede that ground before the deposition.
Integrity. An audit trail includes integrity controls that prove the record has not been altered between the decision and the moment it is produced in litigation or examination. Cryptographic hashing of records, write-once storage, segregation of duties between the systems that produce records and the systems that retain them, and independent verification of the integrity controls themselves. A log without integrity controls invites challenges to its evidentiary weight. A log with integrity controls is an audit trail.
Query interface. An audit trail can be queried by claim ID, by model ID, by reviewer, by date range, by override status, by downstream action, and by combinations of these dimensions. A discovery request asks for "all AI-influenced denials of claims involving disputed coverage in 2024 where the model's recommendation was not overridden by the human reviewer." A log requires manual filtering across millions of lines. An audit trail returns the answer in minutes. The query interface is what makes the audit trail operationally useful, and the lack of one is what turns a discovery response into a multi-week project that consumes legal and engineering capacity.
A system that produces records satisfying replayability, integrity, and queryability is an audit trail regardless of what the carrier calls it. A system that produces records satisfying none of the three is a log regardless of how detailed it is.
What the Audit Trail Is For
The audit trail is the single artifact that supports four different governance and litigation functions, and carriers that have built it correctly find that the same records serve every consumer.
For regulatory examinations, the audit trail is the substrate that allows the carrier to answer specific questions about decisions, model behavior, and operator overrides without manual reconstruction. For market conduct exams, it provides the universe from which the examiner samples cases. For internal model monitoring, it provides the ground truth against which model behavior is evaluated over time. For litigation defense, it provides the contemporaneous record that documents the carrier's decision-making process in a form that withstands evidentiary challenge.
Certification infrastructure generates the audit trail as a byproduct of model deployment rather than as a separate engineering project. Every model that ships into production is wired into the audit trail at deployment time, with the field set above populated automatically and integrity controls enforced at the storage layer. The retention tiers, query interface, and access controls are configured against the carrier's compliance and legal requirements rather than reverse-engineered after a discovery request arrives.
The carriers that win the next round of bad-faith litigation under Lokken-style discovery theories will not be the ones with the most defensible AI decisions. Some of those decisions will, on review, look bad. The carriers that win will be the ones who produced complete, replayable, integrity-protected audit trails on the first request, defended their decisions on the merits with contemporaneous evidence, and avoided the adverse-inference instructions that flow from incomplete records. The audit trail does not guarantee the right outcome. It guarantees the right argument.
A discovery exhibit produced by your adversary's expert is one document. A complete audit trail produced by your own system is another. The carrier chooses which one ends up in front of the jury by deciding, before any claim is disputed, what records the system will keep.
