A mutual insurer rolled out a generative AI tool that wrote summaries of customer-service calls. Agents liked it. It shaved time off every call, and the productivity gain was real. Then the summaries began flowing into the systems that make decisions, and no one had confirmed whether they were accurate, consistent, or free of personal information. The case appears in Grant Thornton's 2026 AI Impact survey, and it rewards a close look, because the mistake it captures is the most common one in insurance AI right now.
The signal the rollout measured, and the ones it skipped
The deployment tracked productivity, and productivity improved. That is one signal, and it was the only one anyone checked. Three others went untested:
- Accuracy. Does the summary match what the caller actually said?
- Consistency. Do similar calls produce similar summaries, or does the model drift between thorough and careless?
- Privacy. Did a Social Security number or a health detail land in a summary that now gets stored and shared?
The rollout confirmed the first and assumed the rest.
That assumption is expensive. In the same survey, 44 percent of insurers said governance or compliance gaps contributed to an AI project failing or underperforming, and only 7 percent believed their workforce was fully ready to adopt AI. The tools are arriving faster than the discipline to verify them, and the verification is the part that protects the policyholder.
A summary does not stay a summary
The reason this matters is propagation. A call summary feels like a convenience at the moment it is written, but it does not stay one. It becomes a note attached to a claim file. It becomes context an adjuster reads before deciding a payment. It becomes part of a complaint record, or evidence in a dispute, or an input to a model further down the line. An error introduced at the summary step does not stay at the summary step. It travels into every decision that reads the summary as fact.
For a mutual, the stakes attach to a specific person. The policyholder whose call was mis-summarized is an owner of the company, and the inaccurate note now lives in their record. If the summary carried personal information it should not have, the cooperative has spread data its members trusted it to hold. The institutions most exposed to this are the ones automating service to stretch a small team, which describes most mutuals.
Consider how that plays out. A caller reports water damage and mentions, in passing, that the leak started weeks earlier. The model compresses a ten-minute call into four sentences and drops the timeline. The adjuster who opens the claim reads a clean summary, sets a reserve, and moves on. The timing detail that might have changed the coverage analysis never surfaces, because the record everyone trusted had already smoothed it away. No one was careless. The summary was simply wrong in a way nobody had been assigned to catch.
Proving it before trusting it
The answer is to prove the tool on the work before the work depends on it, not to abandon a genuine productivity gain.
Start with the data the model will actually see. Evaluating reliability means building a baseline on your own calls, not a vendor's demo set, and setting thresholds for accuracy, hallucination, and consistency that a summary has to clear before any downstream system treats it as trustworthy. The point is to know how the model behaves on your members, your call types, and your edge cases, where a caller mumbles an account number or describes a loss in dialect the model has never parsed. A useful baseline records more than a single score: how often the model reproduces the call faithfully, how often it invents a detail, and how those rates shift across claim types, so the carrier learns which calls it can hand the model and which still need a person.
Verification does not end at launch. A model that summarizes cleanly in March can degrade as call patterns shift, products change, and seasonal claim spikes alter what people call about. Drift in a summarizer is easy to miss, because the output reads as fluently as it did on day one while growing less faithful to the call underneath. Only a continuous comparison against the source recording exposes the gap. Monitoring catches that drift before it reaches a member, and a human stays in the loop on the high-stakes paths, where a wrong summary feeding a claim or coverage decision does real harm.
The privacy problem has its own answer. Private AI access keeps the call data inside an environment the mutual controls, redacts and masks personal information before it reaches a model, and ensures nothing the members said becomes training fuel for a third party. The redaction has to happen before the data reaches the model, not after the summary comes back, because a detail stripped from the output may already have been transmitted, logged, and used to generate the text. Containment is a property of where the processing happens, not a filter bolted onto the end. The summary the agents loved can keep saving time, without exporting a member's medical history to do it.
What a proven rollout looks like
The tool in the survey did its job. The failure was trusting it before earning the right to. A mutual that runs the same deployment the other way, measuring accuracy and PII safety on real calls first, monitoring for drift after, and keeping the data contained, gets the productivity without the exposure. We have seen carriers automate the majority of routine inquiries while holding customer-facing hallucinations at zero, because they treated "useful" as the beginning of the test rather than the end of it. Agents can keep liking the tool. This time, someone checked it first.