Underwriters spend 60 to 70 percent of their time extracting information from documents. Generative AI eliminates most of that work: reading submissions, pulling data points, summarizing loss histories, populating fields. The generative AI insurance market has reached $1.5 billion and is growing at a 38.9% compound annual growth rate. That growth reflects real value.
The problem is that the industry treats every gen AI use case as if it carries the same risk profile. A model that pulls a building's square footage from a submission PDF and a model whose claim summary shapes whether an adjuster approves or denies coverage both run on generative AI. They demand fundamentally different supervision. The distinction between them comes down to one question: does the output tolerate variability?
Outputs That Tolerate Variability
Document extraction tolerates variability in format and phrasing. Whether the model describes a building as "45,000 sq ft commercial office, built 2003" or "commercial office space, 45K SF, 2003 construction," both outputs serve the same purpose. The downstream human can interpret reasonable variations without consequence.
FNOL intake (standardizing incoming claim notifications), customer communication drafting (renewal notices, status updates), and internal knowledge retrieval (helping underwriters find relevant policy language) all share this characteristic. A human reviews the output before it reaches a consequential decision. The AI accelerates information processing. The human retains decision authority.
These applications carry manageable risk. Accuracy monitoring, periodic quality audits, and batch-level sampling are sufficient. Deploy them with lightweight governance and capture the efficiency gains.
Outputs That Do Not
The risk profile changes when generative AI outputs directly shape consequential decisions. Here, variability is not a formatting difference. It is a liability.
Consider a claims summary. An adjuster receives 200 pages of medical records, repair estimates, and witness statements. Generative AI summarizes them into a two-page narrative. The adjuster reads the summary, not the source documents, and makes a coverage decision. The difference between "claimant reports water damage consistent with gradual pipe deterioration" and "claimant reports sudden pipe failure resulting in water damage" determines the outcome. Under most property policies, one falls under maintenance exclusions. The other triggers covered peril provisions. The model's choices about emphasis, omission, and characterization now drive the decision.
The same problem surfaces when underwriters adopt AI-drafted notes without independent review of source material, when chatbots make representations about coverage that create binding obligations under state law, and when AI drafts rate filings where subtle differences in how actuarial methodology is characterized can trigger regulatory enforcement.
In each case, variability in the AI's output changes outcomes for real people: policyholders denied coverage, applicants unfairly declined, regulators misled.
Two Predictable Failures
Most carriers govern generative AI as a single category. That produces two failures.
The first is over-governing low-risk applications. When document extraction requires the same approval process and monitoring intensity as coverage determination tools, deployment slows to the pace of the highest-risk use case. The efficiency gains from automated document processing erode under governance overhead designed for a different purpose.
The second, and more dangerous, is under-governing high-risk applications. A claims summary tool deployed with the same oversight as a PDF extraction tool lacks the supervision its consequences demand. No completeness checks against source materials. No measurement of whether downstream decisions diverge from what direct document review would produce. No detection of systematic bias in emphasis or omission.
The fix: match supervision intensity to output consequence. Applications that tolerate variability get lightweight monitoring. Applications where variability changes outcomes get continuous supervision, completeness checks against source materials, and systematic review of how downstream decisions correlate with AI-generated inputs.
Measuring Gen AI Outputs
Generative AI presents a measurement challenge that traditional model monitoring does not address. Deterministic models produce the same output for the same input. Generative models do not. Standard accuracy metrics designed for classification or regression miss the quality dimensions that matter.
For claims summaries, the relevant metrics are completeness (did the summary capture all material facts?), accuracy (do characterizations match the source?), and neutrality (did the summary introduce bias through emphasis or omission?). For customer communications, we need to measure compliance conformity, factual accuracy relative to policy language, and consistency across similar scenarios. A chatbot that explains the same coverage provision differently to different policyholders creates confusion and potential liability.
These measurement capabilities require purpose-built evaluation frameworks, not off-the-shelf monitoring designed for traditional ML.
Drawing the Line
The $1.5 billion already invested in generative AI for insurance reflects real value: faster document processing, reduced manual labor, improved response times. Those gains are worth capturing.
The carriers that capture them sustainably will be those that draw a clear line between outputs that tolerate variability and those that do not. Lightweight governance on one side. Rigorous, continuous monitoring on the other.
Underwriters spending 60 to 70 percent of their time on document extraction is a solvable problem. An adjuster making a coverage decision based on an unsupervised AI summary is a liability exposure. The technology is the same. The governance infrastructure each one demands is not.
