25% document corruption. Even from the best AI models.
Even the best frontier AI models corrupted roughly a quarter of a document during a 20-step delegated workflow. Across all models tested, average degradation was 50%.
These were not weak or experimental systems. Microsoft Research tested Gemini, Claude, and GPT models across 52 professional domains including accounting, legal documents, and business reporting.
Here is the alarming part. Weaker models delete content. You can see the deletion. Frontier models rewrite content. The document still looks complete and polished. A human reviewer checking for formatting and readability may never notice the underlying facts changed.
And the failure pattern is not gradual. 80% of total degradation comes from single catastrophic interactions where the model suddenly corrupts 10% or more of the document in one step. Systems can appear reliable for several interactions before suffering a sudden catastrophic failure.
Two operational takeaways for business leaders deploying AI agents on knowledge work.
First, you cannot defer human review to the end. By the time you review the final output, earlier corrupted steps have already contaminated everything that followed.
Second, the gap between “AI completed the task in the demo” and “AI reliably operates in a production workflow” is not small. This study just quantified it.
Want longer reads on these topics?
Insights covers the same topics in depth: research-backed analysis on AI, value creation, and building companies.
Read Zaruko Insights