May 15, 2026 at 7:25am ET

25% document corruption. Even from the best AI models.

Even the best frontier AI models corrupted roughly a quarter of a document during a 20-step delegated workflow. Across all models tested, average degradation was 50%.

These were not weak or experimental systems. Microsoft Research tested Gemini, Claude, and GPT models across 52 professional domains including accounting, legal documents, and business reporting.

Here is the alarming part. Weaker models delete content. You can see the deletion. Frontier models rewrite content. The document still looks complete and polished. A human reviewer checking for formatting and readability may never notice the underlying facts changed.

And the failure pattern is not gradual. 80% of total degradation comes from single catastrophic interactions where the model suddenly corrupts 10% or more of the document in one step. Systems can appear reliable for several interactions before suffering a sudden catastrophic failure.

Two operational takeaways for business leaders deploying AI agents on knowledge work.

First, you cannot defer human review to the end. By the time you review the final output, earlier corrupted steps have already contaminated everything that followed.

Second, the gap between “AI completed the task in the demo” and “AI reliably operates in a production workflow” is not small. This study just quantified it.

DELEGATE-52 benchmark table from Microsoft Research showing round-trip relay results for 19 LLMs across workflow lengths from 2 to 20 interactions. All models accumulate errors; even the strongest (Gemini 3.1 Pro) retains only ~81% document integrity after 20 steps, while weaker models drop below 20%.

Source

VentureBeat / Microsoft Research

AI Productivity Frontier Models

Want longer reads on these topics?

Insights covers the same topics in depth: research-backed analysis on AI, value creation, and building companies.

Read Zaruko Insights