AI Agents Are Getting Smarter, Not More Reliable — Now We Have Proof

Researchers at Princeton set the temperature to zero. They removed sampling randomness entirely. Then they ran the same tasks through 14 AI models, five times each.

The agents still could not give consistent answers.

Outcome consistency ranged from 30% to 75%. That means on identical tasks, with identical inputs, with every source of intentional randomness eliminated, the best models in the world produced different results up to 70% of the time. The inconsistency comes from sources most teams never think about: floating-point non-determinism, batch-size variation, GPU kernel scheduling. The infrastructure itself introduces variance that no prompt can control.

This finding comes from "Towards a Science of AI Agent Reliability," a 66-page study by Rabanser, Kapoor, Kirgis, Liu, Utpala, and Narayanan at Princeton. It is the most rigorous examination of AI agent reliability published to date. The researchers tested models from OpenAI, Google, and Anthropic across two established benchmarks, decomposing reliability into four dimensions and twelve concrete metrics grounded in safety-critical engineering from aviation, nuclear power, and automotive systems.

The paper dropped the same day that the real-world consequences of unreliable agents continue to pile up. The researchers catalog specific incidents. Replit's AI coding assistant deleted an entire production database despite explicit instructions not to. OpenAI's Operator made an unauthorized $31.43 Instacart purchase, violating its own confirmation safeguards. New York City's municipal chatbot gave consistently illegal business advice, and when ten journalists asked it the same question, it gave different incorrect answers to each of them.

These are not hypothetical risks from a conference presentation. They happened. They happened to well-funded organizations deploying frontier models with safety guardrails in place. And as the Princeton study demonstrates, the gap between what these agents can do and how reliably they do it is not closing. It is, in some dimensions, getting wider.

At Swept AI, this is the problem we built the company to solve. We have been saying for over a year that demos do not predict production, that accuracy scores mask reliability failures, and that enterprises need infrastructure, not just benchmarks, to deploy AI agents safely. This paper provides the academic rigor behind what we hear in every enterprise conversation. The teams deploying AI agents already know they are not 100% reliable. That is not the question anymore. The question they bring to us — in on-site engagements, in evaluation workshops, in deployment readiness reviews — is about the spread. How wide is the variance? What does the distribution of reliability look like across runs, conditions, and edge cases? They need that data so they can make educated go/no-go decisions about whether a specific agent is reliable enough for a specific workflow. That is the work we do every day. Not proving the problem exists. Quantifying it precisely enough to act on. If your team is navigating the same question, we should talk.

What the Research Actually Found

The study introduces a reliability framework built on four dimensions: Consistency, Robustness, Predictability, and Safety. Each dimension contains specific, measurable metrics. Twelve in total.

Consistency measures whether an agent produces the same outcomes and follows similar behavioral patterns across repeated runs. The researchers found outcome consistency (C_out) ranging from 30% to 75% across all 14 models tested. This is not a comparison between models. This is the same model, given the same task, producing different outcomes on different runs. Resource consistency was similarly variable, with agents consuming wildly different amounts of time and compute steps for identical tasks.

Robustness tests whether agents maintain performance when conditions change. This is where the findings become particularly revealing. The researchers tested three types of perturbation: fault injection (simulating server crashes and tool failures), environment changes (reordering JSON fields, altering date formats), and prompt paraphrasing (rephrasing the same instruction in different words).

Models handled genuine technical failures gracefully. When a tool crashed, agents recovered and tried alternative approaches. But prompt robustness told a different story. Rephrasing "cancel my subscription" as "please end my plan" caused measurable performance degradation. Models that could navigate server outages fell apart when a human worded the same request differently. Prompt robustness emerged as the key differentiator between models, and most models performed poorly on it.

Predictability examines whether agents can accurately assess their own likelihood of success. Here the results diverge in troubling ways. Calibration, the alignment between an agent's confidence and its actual correctness, has improved in recent models. Claude models showed notably stronger calibration scores. But discrimination, the ability to distinguish between tasks the agent will get right and tasks it will get wrong, has actually worsened on the GAIA benchmark. Models are becoming better at expressing appropriate confidence levels on average while becoming worse at flagging the specific cases where they will fail. That is a dangerous combination for production deployments.

Safety tracks compliance violations and harm severity. The researchers made a critical methodological choice here: safety metrics are reported separately from the aggregate reliability score. Averaging safety into an overall number would mask tail risks. A model could score well on consistency and robustness while still committing infrequent but catastrophic safety violations. Recent frontier models show markedly lower violation rates overall, but the violations that remain are severe: unauthorized data exposure, incorrect financial transactions, irreversible actions taken without confirmation. On TauBench, financial accuracy violations, specifically incorrect charges and refunds, emerged as the single most common failure mode.

One finding undermines a core assumption many teams carry into deployment: reliability does not scale uniformly with capability. Smaller models sometimes achieve equal or higher consistency scores than their larger counterparts. The researchers attribute this to behavioral repertoire. Larger models have more strategies available, which means more run-to-run variability in which strategy they select. A model that knows ten ways to solve a problem is less predictable than a model that knows three.

This is something we have tested across dozens of enterprise agent deployments. When our evaluation platform assesses an agent, we do not just measure whether it gets the right answer. We measure whether it gets the right answer the same way, repeatedly, under varied conditions. We map the spread across every dimension — consistency, robustness, predictability, safety — because that is what enterprise teams actually need to see. Not a single reliability score. The full distribution, broken down by dimension, so they can determine whether the variance is tight enough to trust the agent in a specific production context or too wide to deploy. The Princeton framework validates the methodology we have been running in production for our clients. See how our evaluation works.

Every One of These Failures Was Predictable

The paper includes a table that should be required reading for anyone deploying AI agents in production. Table 3 maps each real-world failure to the specific reliability metrics that would have caught it before deployment.

Start with the Replit incident. An AI coding assistant deleted an entire production database despite explicit instructions not to. Which metrics would have flagged the risk? Safety harm severity (S_harm), which tests whether agents take irreversible actions without appropriate safeguards. And prompt robustness (R_prompt), which tests whether agents maintain constraint adherence when instructions are paraphrased or presented in different contexts. If Replit had run their agent through perturbation testing, asking the same "do not delete the database" instruction in twenty different phrasings, they would have discovered that constraint adherence breaks down under variation.

The NYC chatbot failure maps to two different metrics. Outcome consistency (C_out) would have caught the core problem: ten journalists asking the same question received different answers. A simple multi-run test on identical inputs would have revealed the inconsistency before a single citizen received bad advice. Calibration (P_cal) would have caught the secondary problem: the chatbot expressed high confidence while delivering incorrect legal guidance. It did not hedge. It did not flag uncertainty. It stated illegal advice as established fact.

OpenAI's Operator making an unauthorized purchase maps to compliance (S_comp), which tests whether agents respect authorization boundaries, and trajectory consistency (C_traj), which analyzes behavioral patterns for anomalies. An agent that suddenly deviates from its established behavioral pattern to execute a purchase it was not authorized to make is exhibiting a trajectory anomaly that monitoring would detect.

The pattern across all three incidents is the same. These were not unpredictable black swan events. They were not edge cases that no reasonable testing framework could anticipate. They were measurable reliability failures. The metrics existed. The testing methodology existed. Nobody applied them.

That is the gap the Princeton study is trying to close. Not a knowledge gap. An implementation gap. We know how to measure reliability. The frameworks exist in aviation, nuclear power, and medical devices. We have simply not adopted them for AI systems. And until we do, every deployment is an uncontrolled experiment running on production data with real users.

This is precisely why we built Swept AI's evaluation framework around multi-dimensional reliability testing rather than single-pass accuracy benchmarks. Every metric the Princeton team identifies, consistency, robustness, predictability, safety, maps directly to testing capabilities we run for enterprise clients before their agents reach production. The Replit incident, the NYC chatbot, the Operator purchase: our evaluation pipeline is designed to surface exactly these failure modes. Not after they happen. Before deployment.

The Paper Diagnoses the Problem. Here Is What It Does Not Cover.

The researchers propose a "reliability index" and call for reliability evaluation to "complement, not replace, careful deployment practices including human oversight, sandboxed testing, monitoring, and ongoing performance assessment."

We strongly agree. That sentence is doing enormous work, though. Those "deployment practices" need to be actual infrastructure, not aspirational bullet points on a governance slide deck.

Consider the analogy the paper itself draws: safety-critical engineering in aviation. Aviation does not just measure reliability. It has operational infrastructure built around reliability at every stage. Pre-flight checklists and type certification ensure aircraft meet reliability standards before they carry passengers. That is evaluation. Black box recorders, real-time telemetry, and air traffic control provide continuous supervision during operation. That is monitoring. Incident reporting systems, maintenance logs, and airworthiness directives create a continuous evidence trail that proves ongoing compliance. That is certification.

Enterprise AI needs all three layers. The Princeton study provides the measurement framework. That is necessary. It is not sufficient.

The enterprise teams we sit down with are past the awareness stage. They understand that AI agents are probabilistic. They are not looking for a paper to convince them of the risk. They are looking for a partner who can show them the spread of reliability for each agent, for each use case, under realistic conditions — and help them interpret it. When an operations lead can see that an agent's consistency is 72% on one workflow but 91% on another, they can make a concrete call: deploy here, hold there, add guardrails in this dimension. That is the conversation we have every week. The question is never "is it reliable?" It is "how reliable, how consistent, and what does the spread look like?" If you are at that stage and need a partner to quantify the answer, start here.

In our work with enterprise teams, we see the same pattern repeatedly. Organizations invest in evaluation, run a benchmark suite before deployment, and then treat the results as permanent. But reliability is not a static property. It shifts with every model update, every API change, every evolution in user behavior. A reliability score from January tells you nothing about performance in February. The paper acknowledges this implicitly by testing environment robustness, simulating the kind of changes that happen naturally in production. But it measures at a single point in time. Production demands continuous measurement.

The gap between academic reliability measurement and operational reliability management is where most enterprise AI deployments fail. Not because they lack awareness of the problem, but because they lack the infrastructure to address it at every stage of the deployment lifecycle: before launch, during operation, and when proving compliance to stakeholders.

Accuracy Testing Is Not Reliability Testing

The paper's methodology provides a blueprint that most enterprise evaluation suites should study carefully. The key insight: single-run accuracy testing tells you almost nothing about reliability.

The Princeton team ran each task K=5 times to measure consistency. They created J=5 paraphrases of each prompt to measure robustness. They injected faults at a rate of p_fault=0.2 to test recovery. They perturbed environment conditions to test adaptability. They elicited confidence scores to test calibration. Each of these testing dimensions reveals failure modes that standard accuracy benchmarks miss entirely.

Consider what multi-run testing catches. If you run a task once and get the right answer, you record 100% accuracy. If you run it five times and get the right answer three times, you still have the same underlying capability, but now you know your consistency is 60%. That distinction matters enormously in production. A customer service agent that resolves 90% of tickets correctly but gives different answers to the same question 40% of the time will erode trust faster than a less accurate agent that behaves predictably.

This is the insight that resonates most with enterprise teams we advise: seeing the spread is what converts a reliability question into a deployment decision. Sixty percent consistency on a customer service agent is not an abstract research finding. It is a concrete reason to hold deployment until you understand why the variance is that wide and whether it can be narrowed. We have helped teams make exactly this call — sometimes greenlighting agents that had lower accuracy but tighter consistency, and sometimes holding back high-accuracy agents whose spread was too wide for the use case. The spread is what makes the decision defensible.

Prompt perturbation testing catches a different class of failure. Your evaluation suite tests "Cancel my subscription." Your agent handles it correctly. In production, a customer writes "I want to stop being charged monthly." Same intent. Different phrasing. If you never tested paraphrases, you have no idea whether your agent handles the variation. The Princeton study shows that this is the single biggest differentiator between models, and the dimension where most models perform worst.

Fault injection testing reveals whether agents degrade gracefully. In production, tools fail. APIs return errors. Databases time out. An agent that freezes, retries infinitely, or hallucinates a response when a tool call fails is dangerous. The paper's approach of injecting faults at a 20% rate during evaluation directly simulates production conditions. Most enterprise test suites run against pristine environments where nothing fails. That does not reflect reality.

Confidence elicitation testing addresses the predictability gap. The paper's finding that discrimination has worsened is particularly alarming for enterprise deployments. An agent that says "I'm 90% confident" on every response, whether correct or incorrect, provides no useful signal for downstream decision-making. Testing whether confidence scores actually correlate with correctness should be a standard pre-deployment check. It rarely is.

The practical takeaway: if your evaluation pipeline runs each test once, under ideal conditions, with the original prompt phrasing, and without checking confidence calibration, you are measuring capability. You are not measuring reliability. The Princeton study demonstrates that these are fundamentally different properties, and conflating them is how organizations end up surprised by production failures.

This is the core of what we do at Swept AI. Our evaluation platform implements every testing dimension the Princeton team describes: multi-run consistency testing, automated prompt perturbation, fault injection, environment variation, and confidence calibration analysis. We built these capabilities because we watched enterprise teams deploy agents with passing accuracy scores that failed in production for exactly the reasons this paper now documents. The methodology the Princeton researchers propose as a new standard is the methodology we already run for our clients.

Reliability Degrades. You Need to Watch It.

The Princeton study measures reliability at a single point in time. In production, reliability changes continuously. Every dimension the paper measures is subject to drift.

The paper's environment robustness metric (R_env) is instructive. The researchers tested agents against reordered JSON fields, changed date formats, and altered tool interfaces. This directly models what happens when upstream APIs evolve, when third-party services update their response formats, when a vendor changes their model version. The paper measures this once. In production, these shifts happen weekly. Sometimes daily.

Model providers update without warning. A patch that improves average accuracy can simultaneously degrade consistency on specific task types. The Princeton finding that reliability does not scale with capability suggests that model updates designed to improve performance could actively harm reliability. You would never know without continuous monitoring.

Reliability drift is harder to detect than accuracy drift. If an agent's accuracy drops from 90% to 70%, the failure rate doubles and the signal is loud. If an agent's consistency drops from 75% to 50%, the agent still gets many tasks right on any given run. Individual failures look like one-off errors. The pattern only becomes visible when you monitor distributions over time, comparing behavioral fingerprints across periods. Without that infrastructure, you are flying blind in exactly the conditions the paper identifies as most dangerous.

The four dimensions from the paper, consistency, robustness, predictability, and safety, should not function solely as evaluation metrics. They should become ongoing monitoring dimensions. Track outcome consistency weekly. Run perturbation tests on a continuous schedule against production prompts. Monitor calibration drift as models update. Flag safety violations in real time with severity classification.

This is the operational layer the paper calls for but does not specify. The researchers note that reliability evaluation should complement "monitoring and ongoing performance assessment." Building that monitoring infrastructure, the equivalent of aviation's black box recorders and real-time telemetry, is what separates organizations that measure reliability once from organizations that maintain it continuously.

This is exactly what our supervision platform does. We track the same four dimensions the Princeton team measures, but continuously, in production, across every agent interaction. When consistency drops, when an agent starts responding differently to paraphrased prompts, when calibration drifts after a model update, the system flags it in real time. We have seen model provider updates degrade consistency by 15 or more percentage points overnight with no announcement. Without continuous monitoring mapped to these reliability dimensions, teams discover the degradation only when customers start complaining.

Proving Reliability to the People Who Decide Your Budget

The paper's third recommendation states: "Reliability metrics should inform deployment governance, analogous to safety-critical industries." This is not just a technical recommendation. It is a communication framework.

Today, when a CISO asks "how reliable is our AI agent?" the answer is typically a single accuracy number. Eighty-nine percent. That number tells the CISO almost nothing about the risks the organization actually faces. It does not distinguish between an agent that gets 89% right consistently and one that gets 95% right half the time and 60% right the other half. It does not capture whether the agent respects authorization boundaries. It does not indicate whether confidence scores are meaningful.

The four-dimension framework from the Princeton study gives risk committees, compliance officers, and executives a much more meaningful answer. Consistency: 72% outcome consistency, meaning the agent gives the same answer to the same question roughly three-quarters of the time. Robustness: 85% fault tolerance but only 58% prompt robustness, meaning the agent handles crashes well but struggles with paraphrased instructions. Predictability: 0.82 calibration but only 0.61 discrimination, meaning confidence scores are generally reasonable but the agent cannot reliably flag its own likely failures. Safety: zero high-severity violations in testing, with a compliance score of 94%.

That profile tells a decision-maker something actionable. It identifies where the risk concentrates and what mitigation is needed.

This is exactly the conversation we facilitate with enterprise teams on-site. They look at the spread across dimensions, identify the specific areas where variance exceeds their tolerance, and make a concrete call: deploy, deploy with guardrails on the weak dimension, or hold until the spread tightens. The reliability profile turns "we think it works" into "we know the distribution, and we accept the risk at this threshold." That is the difference between hope and governance. If your team needs to make that shift, our certification framework generates the continuous, versioned evidence to back it up.

It maps directly to the kind of evidence that regulators in healthcare, financial services, and insurance are beginning to demand.

Static scores, however, are not sufficient for ongoing governance. A reliability scorecard from three months ago does not reflect current performance. The evidence must be continuous, versioned, and auditable. Each model update, each policy change, each shift in usage patterns should trigger reassessment and generate fresh evidence. That is the difference between a compliance checkbox and genuine deployment governance.

This is the problem our certification framework addresses. We generate continuous, versioned reliability evidence across all four dimensions. When a risk committee asks for proof that an agent meets reliability standards, the answer is not a stale benchmark from last quarter. It is a living document that reflects current performance, tracks changes over time, and flags when any dimension falls below the thresholds the organization has defined. We built this because the enterprises we work with need to prove reliability to regulators, auditors, and boards, not just measure it once for an internal slide deck.

Five Things to Change This Week

The Princeton study is a landmark contribution. It gives the industry a shared vocabulary, a rigorous methodology, and hard evidence for something many practitioners have observed but could not prove: capability and reliability are different properties, and one is not keeping pace with the other.

At Swept AI, we have operationalized every one of the recommendations below. They are not theoretical. They are the foundation of how we help enterprises deploy AI agents that work reliably, not just impressively.

Here is what to do about it.

First, stop treating accuracy as your primary metric. Add consistency, robustness, and calibration to your evaluation pipeline. If you only measure whether your agent gets the right answer, you know nothing about whether it gets the right answer reliably, under varied conditions, with appropriate confidence.

Second, audit your confidence scores against actual correctness. The paper's finding that discrimination is worsening means agents are becoming less able to flag their own failures. If your agent expresses high confidence on incorrect responses, downstream systems and human operators cannot make informed decisions about when to trust it and when to escalate.

Third, add perturbation testing to your pre-deployment pipeline. Create paraphrased versions of your test prompts. Inject faults. Alter environment conditions. The paper used five paraphrases per prompt, a 20% fault injection rate, and multiple environment perturbations. Match or exceed that rigor. Single-run accuracy testing on pristine environments is security theater.

Fourth, monitor distributions, not just averages. An average accuracy of 88% could mean consistent performance at 88% or volatile performance swinging between 60% and 100%. Track the variance. Track consistency across runs. Track behavioral patterns over time. The signal lives in the distribution, not the mean.

Fifth, build a reliability scorecard mapped to the paper's four dimensions. Give your risk committee, your compliance team, and your executive sponsors a structured view of reliability that goes beyond a single number. Update it continuously.

The Princeton researchers set the temperature to zero and agents still could not produce consistent results. That finding should fundamentally change how organizations evaluate, deploy, and govern AI systems. The capability is real and improving. The reliability gap is also real, and it will not close on its own. It requires the kind of infrastructure we have spent years building: evaluation systems that map the full spread of reliability before deployment, supervision systems that monitor drift continuously in production, and certification frameworks that generate the ongoing evidence governance demands.

We have been in the rooms where enterprise teams wrestle with exactly this problem. We have helped them quantify the spread, interpret the variance, and make deployment decisions they can defend to regulators, boards, and customers. If your organization is deploying AI agents and needs that level of rigor, let's talk.

AI Agents Are Getting Smarter, Not More Reliable. Now We Have the Data to Prove It.

What the Research Actually Found

Every One of These Failures Was Predictable

The Paper Diagnoses the Problem. Here Is What It Does Not Cover.

Accuracy Testing Is Not Reliability Testing

Reliability Degrades. You Need to Watch It.

Proving Reliability to the People Who Decide Your Budget

Five Things to Change This Week

Related Posts

Consumer AI Acceptance Just Doubled in P&C Insurance — Here's What the Insurity 2026 Report Means for Carriers

Beyond Compliance: Why AI Governance Is a Trust Problem

Join our newsletter for AI Insights