The AI Agents Hype Machine: What the Evidence Actually Says

May 23, 2026 25 min read

President, Zaruko

Table of Contents

AI plus Reality Check. The AI Agents Hype Machine: What the Evidence Actually Says. Navy background with gold accent.

A 20-year AI builder's case against the autonomous-agent fairy tale

You have seen the posts on LinkedIn. The threads on X. The breathless YouTube videos with their case studies and their playbooks. The Forbes contributor pieces. The CNBC segments where a founder makes sweeping claims about AI agents replacing humans and the interviewer nods along. "Founder runs 13 companies. Zero employees. AI agents handle the real estate deals over email." Or some variant of it. Every week brings another claim across every channel about AI fully replacing humans in complex, multi-step business work. The posts go viral. The videos rack up views. The comments fill with applause. The next week brings another one.

Before I take this apart, let me be precise about what I am and am not arguing. I am not arguing that AI fails to deliver value. It does, and the evidence section later in this post lays out the controlled studies, surveys, and production reports that demonstrate it. I am not arguing that AI Agents cannot help enterprises automate work. They can, and many companies are seeing real productivity gains from doing exactly that. I am arguing one specific thing: it is a myth that you can build a business operated by AI agents alone, with no human employees. At least with where the technology is today.

A bit about where I am coming from. I have spent the last 20 years building, growing, and operating AI and ML software companies. I co-founded an enterprise ML company that had a 10-year run and an eight-figure exit. I run an AI advisory practice today. I use every new AI tool the day it ships, because I am always looking for ways to add capability to my own work. I am a believer in this technology. I have built my career on it.

The thesis of this post in one line: AI works when its outputs can be verified cheaply, either by humans or by software, and the agent hype collapses when verification is expensive, impossible, or comes too late. Everything that follows is the demonstration.

I am also a skeptic, because I am on the front lines with the tech every day. I see what it can do. I also see what it cannot do. And the gap between those two things is where most of the AI agents discourse now plays out.

So let me say plainly: the claim that AI agents are autonomously running multi-company businesses, closing real estate deals, negotiating contracts, and operating without human oversight is not true. It is not happening at scale. It is barely happening at all. The evidence on this is increasingly strong, and most of it comes from sources nobody can dismiss as cranky skeptics. It comes from MIT. It comes from the AI labs themselves. It comes from controlled studies of working developers. It comes from the people building agents in production right now.

This post walks through that evidence and gives you a framework for reading the next viral claim you see.

The math that kills the claim

Start with multiplication. If an agent has a 5% error rate at each step in a workflow, what is its success rate after 20 steps? Assuming errors at each step are independent, the answer is 0.95 to the 20th power, which is about 36%. After 30 steps, it drops to 21%. After 50 steps, it is 8%. The independence assumption is a simplification. In practice errors can be positively correlated (an agent that misunderstood the goal early gets many subsequent steps wrong) or negatively correlated (a good early step puts the agent on the right path). Either way, the basic compounding effect is real and unavoidable.

This is not a controversial claim. This is arithmetic. And it is exactly the argument Yann LeCun, the Turing Award winner who led Meta's AI research for over a decade, has been making for years. LeCun announced his departure from Meta in November 2025. His new venture, AMI Labs, raised over $1 billion in seed funding in March 2026 to pursue a different architecture, consistent with his long-stated view that the current generation of autoregressive LLMs is unlikely to reach human-level intelligence through scaling alone. His core argument is that autoregressive models predict one token at a time based on previous tokens, and that this mathematical structure means errors compound across long sequences.1

A real estate transaction is not a 20-step workflow. It involves contract review, counter-offer drafting, contingency negotiation, escrow coordination, title work, inspection responses, and ongoing communication with counterparties who have their own lawyers and their own interests. A modest deal involves hundreds of decisions over weeks. The math does not support full automation here. The math is not even close.

This is the first thing to understand. Better engineering can reduce the impact of error compounding through verification, retries, decomposition, and guardrails. But it does not eliminate the underlying mathematics. As long as the system is autoregressive and non-deterministic, every additional step is another opportunity for error to enter the chain.

What the labs themselves say

Here is something the hype machine will not tell you: the AI labs building these systems are remarkably honest about what they cannot do. You just have to read past the marketing and into their engineering blogs.

Anthropic, the company that builds Claude, published a detailed engineering post about long-running agent harnesses. They state directly that even a frontier coding model running their Agent SDK in a loop "will fall short of building a production-quality web app if it's only given a high-level prompt." The agent tries to do too much at once, runs out of context mid-task, and leaves work half-finished. Anthropic uses this as the setup for an engineering solution they propose. The point worth taking is not that they have failed, but that the underlying limitation is what their engineering is working around, not what they have eliminated.2

In a March 2026 paper, researchers tested frontier models on a 32-step simulated corporate network attack. At a 10 million token compute budget, the most recent model tested (Claude Opus 4.6, February 2026) completed an average of 9.8 steps. Its best single run, at a 100 million token budget, reached 22 of 32 steps. The paper also documents rapid progress: 18 months earlier, GPT-4o averaged 1.7 steps at the same 10 million token budget. The headline trend is that performance scales log-linearly with compute and improves generation over generation, with no observed plateau. The headline limitation is that full reliability across 32 adversarial steps remains out of reach as of February 2026. Two points worth holding at the same time.3

When the people building these systems describe what their systems actually do, the picture is consistent. Strong on bounded tasks. Strong on draft generation. Strong on code completion. Brittle on long-horizon autonomy. Useful with oversight. Unreliable without it.

The benchmark numbers nobody quotes

I usually do not put a lot of weight on benchmarks because they are easily gamed and rarely match real-world conditions. But they are useful for one specific purpose: as a warning signal about reliability, not a production guarantee.

The argument here is simple. If the best frontier model in the world, on a curated benchmark designed by the people who want it to succeed, scores below 50% on multi-step realistic tasks, then claims of fully autonomous business operation are very hard to take seriously. Real deployments can sometimes be easier than benchmarks if the workflow is narrow and well-instrumented. Real deployments are often harder than benchmarks if the workflow is messy or adversarial. Either way, the benchmark number is a warning signal worth heeding.

The numbers:

The GAIA benchmark, developed by Meta, HuggingFace, and AutoGPT researchers, tests AI agents on 466 real-world assistant tasks. When it launched in late 2023, the best model scored 15%. Humans scored 92%. After two years of intense improvement, the best agents now top out around 70 to 75%. Still well below the human baseline.4

Gaia2, a successor benchmark released in 2026 that tests agents in asynchronous environments closer to real deployments, found that GPT-5 in its highest-effort configuration reaches 42% pass-at-one. That means on its best try, it succeeds on 42 out of 100 tasks. The rest fail. The benchmark designers note the model "fails on time-sensitive tasks." Open-source models score around 21%.5

WebArena, a benchmark that tests agents on autonomous web navigation tasks like e-commerce and content management, reports that the first agents scored 14%. Two years later, the best agents reach 60 to 70%. Human performance is around 78%.6

These are warning signals worth heeding. A team deploying an autonomous agent in production should not expect to substantially exceed these numbers on tasks of comparable difficulty.

What happens when this hits real businesses

The lab-versus-production gap is where the claim collapses entirely. We have hard data on this now.

In August 2025, MIT's NANDA initiative published "The GenAI Divide: State of AI in Business 2025." Coverage in Fortune and elsewhere put the dataset at 150 business leader interviews, 350 employee survey responses, and analysis of 300 public AI deployments. The MIT report itself describes a somewhat smaller dataset, with 52 structured interviews and 153 employees surveyed. Either way, the headline finding is the same: 95% of corporate generative AI pilots delivered no measurable P&L impact. Despite an estimated $30 to $40 billion in enterprise AI spending, only 5% of projects showed measurable returns.78

Read that again. 95% of pilots delivered no measurable impact on company financials. This is from MIT, not from a hype-skeptical newsletter. The lead author, Aditya Challapally, attributed the failures not to model quality but to how organizations attempt to deploy them. Tools that integrate into existing workflows succeed about 67% of the time. Internal builds succeed only one-third as often.

In May 2025, PwC surveyed 308 senior US business executives. 79% reported their organizations had adopted AI agents in some form. But of those who had adopted, only 66% said the agents were delivering measurable value through increased productivity. PwC's own caveat in the report: "broad adoption doesn't always mean deep impact... reports of full adoption often reflect excitement about what agentic capabilities could enable, not evidence of widespread transformation."9

KPMG's Q1 2025 AI Pulse Survey of 130 C-suite leaders at organizations with over $1 billion in revenue found that full deployment of agentic AI remained at 11%, despite 65% having moved from experimentation to fully-fledged pilots.10 McKinsey's November 2025 State of AI report found that no more than 10% of respondents in any given business function reported their organizations were scaling AI agents in production.11

Gartner now predicts that over 40% of agentic AI projects will be scrapped by 2027, citing a combination of rising costs, unclear business value, and inadequate risk controls.12

These are not opinions. These are numbers from inside the companies trying to make this work.

The single most important fact in this entire piece

Here is the data point that explains a lot of what you see on your feeds.

METR, an independent AI evaluation nonprofit, has been running randomized controlled trials on AI tools and developer productivity. Their first study, covering February to June 2025, found that experienced open-source developers using early-2025 AI tools took 19% longer to complete tasks, while estimating that AI had made them 20% faster. A 39 percentage point gap between perception and measured reality.

In February 2026, METR published a revised version of the study. They had attempted to repeat the experiment with a larger cohort starting August 2025, using updated AI tools. The newer data suggests a small speedup. For the subset of the original developers who participated again, AI assistance reduced task time by an estimated 18%, with a confidence interval between -38% and +9%. For newly recruited developers, the estimated effect was small and statistically indistinguishable from zero, with a confidence interval between -15% and +9%. Both confidence intervals cross zero.1314

METR's own conclusion: AI likely provides productivity benefits in early 2026, but they cannot measure the magnitude reliably.

The reason for the revision is more interesting than the new number. METR found that the study itself broke down because of how AI adoption had changed in the intervening months. Developers refused to participate in the no-AI condition. Between 30% and 50% of recruited developers told the researchers they were choosing not to submit certain tasks because they did not want to do them without AI. The pool was systematically missing the developers and the tasks where AI uplift was expected to be largest. The new estimate is therefore a lower bound on the true effect, not the central estimate.13

Two things survive this revision intact, and both matter for the post.

The first: developers, experts, and economists massively overestimate AI productivity gains relative to what controlled measurement can detect. The early-2025 developers predicted 24% speedup; they were measurably slower. Economists forecast 39% faster; reality was minus 19%. The 2026 follow-up still shows wide confidence intervals around modest speedups, while industry surveys routinely report 2x or 3x productivity claims. The perception-reality gap has not closed.

The second: AI productivity is harder to measure than the headlines suggest. METR's study broke because adoption became universal among the people best positioned to estimate the effect. The same selection effects operate in every "I am so much more productive with AI" testimonial. Self-reports cannot be controlled. Counterfactuals cannot be measured. The people most enthusiastic about AI are systematically the ones whose claims are hardest to verify.

This is the missing piece. This is why your feeds are full of people claiming AI does things that controlled measurement struggles to confirm. Why the YouTube videos go viral. Why the Forbes columns get written. It is not that they are lying. Most of them truly believe what they are posting. They feel the productivity gain. The interface is smooth. The output looks professional. The vibes are good.

The productivity gain is probably real. The magnitude is almost certainly smaller than the testimonials suggest. The gap between feeling fast and being measurably fast is one of the most well-documented results in the literature, and it has not gone away with newer models.

When you read the post claiming someone runs 13 companies with zero employees and AI handles the real estate deals, you are reading a perception report from inside that gap. The author is not the person running the companies. The author heard about it in a room. The person they heard about probably feels enormously productive. The actual deployment, if it exists at all, almost certainly has more humans and less autonomy than the story suggests.

What it actually looks like when people try

Step away from the headlines and into the rooms where people are actually building this stuff. A 15-year FAANG engineer recently posted a question to a Reddit thread dedicated to AI agents. He asked a simple thing: is anyone really running a company with 30 or more AI agents, or is this just hype? He listed six honest sub-questions about deployment, communication, state, and improvement.

The thread is one of the better field surveys you can read in 2026, because the responses come from people building agents in production right now, not from people selling a course about it. The answers cluster into three groups.

The first group is the honest practitioners. One operator described running hundreds of agent instances across three classes: an internal context-builder that ingests company communications, a customer-facing agent that runs read-only on websites and dashboards, and a monthly summary agent. Token costs around $1,500 to $2,000 per month. Detailed memory architecture using vector databases. Logs, audit trails, monitoring. Read carefully though. The agents do not take refunds, quotes, or account changes. None of them. They are read-only. The flagship use case is "what is the status of X" lookups and PDF generation. Every output gets reviewed by a human before it matters. This is the recoverable use case. The same operator mentioned that every agent has a file in its repo where it logs what went wrong. The file name was a four-letter word followed by ".md." That detail, more than any benchmark, tells you what production agent work feels like.

Another operator running 6 to 8 agents in production wrote that the agent that breaks silently is worse than the one that crashes, and the agent that "completed" but did the wrong thing is the one that costs you a week of debugging. The cost, this operator wrote, is not building the agents. It is building the observability layer around them.

A third practitioner, who claimed to use more than 30 agents, was direct: none of them have been able to run consistently for days on end without intervention.

A fourth, with the most detailed workflow description in the thread, walked through a real GTM automation. Every external action passed through explicit human approval. The exact words: nothing gets posted to external platforms without explicit consent.

The second group is the skeptical builders. A developer who has been coding since 1984 wrote that anyone claiming to run a company with 30 agents is lying, because these agents will break constantly. Another, working with a frontier coding agent daily, wrote that he can barely manage one complex application without exorbitant token costs, and that nobody is solving serious production problems with these tools alone. A third closed his comment with: no real business owner is trusting sensitive operations to AI at this point.

The third group is the most useful. These are the operators who arrived independently at the same distinction. One wrote that 30 agents as workflow automation is real, but 30 agents as "a company of autonomous AI employees" is hype. Another: 30 free-roaming agents trying to run "the company" is BS, but 30 agents where each one owns exactly one task inside a defined process works fine. A third: most "30 agents" setups are really just workflows split into small services with routing rules, not autonomous agents coordinating in any meaningful way.

These three sentences, written separately by people who do not know each other, are not formal evidence. They are anecdotal practitioner reports from the thread. But they match what production operators repeatedly say in conversations and at conferences. The argument I am making in this post is not the contrarian view. It is the working operator's view. The hype version exists in a different layer of the internet entirely. The reason the operator view sounds different is that operators have actually built these things. (A working AI agent built end-to-end is mostly not AI. The model is 15-20% of the system. The rest is traditional software engineering.)

How the marketing apparatus actually works

Spend a few minutes watching the videos that produce these claims and the structure becomes obvious.

A typical example: a polished founder with previous exits opens a video on YouTube by stating that the next wave of billion-dollar companies will not have 100 employees, will not have 50 employees, but will have one. One person, using AI agents, doing all the work. The video then walks through six steps to build a one-person AI business from $0 to $10 million plus, possibly billions.

The video runs about 14 minutes. The case study at the climax is the proof point. The founder reveals it at minute 13. He has just launched a company doing $83,000 a month in recurring revenue. The team behind it: the founder and two part-time contractors.

That is the case study. Three humans. The headline promise of the video was one human and AI. The proof was three humans and AI.

The video also contains an instruction to message the host on Instagram with a specific keyword to receive a free playbook. The mechanic is standard. The video is the top of a funnel. The viewer is the product. The keyword is the qualifier. The playbook is the next step in a sequence that ends with a paid coaching program.

I am not picking on this one creator. This is a category. There are dozens of videos with this exact structure produced every week. The case study always reveals more humans than the headline promised. The promise is always presented as imminent. The dream is always one keyword away.

Now overlay the METR finding from the previous section. The creators making these videos really do feel productive. They are using AI tools constantly. They are shipping content, scripts, edits, thumbnails, and lead magnets at high volume. They report this experience as evidence that AI is replacing employees. What they are actually reporting is the perception side of a perception-reality gap.

The case studies they cite are real businesses. The businesses have humans. The humans get edited out of the headline. The viewer reads the headline and concludes that someone, somewhere, has cracked the code. The thread tells you that someone, somewhere, has not.

What an honest version sounds like

Now contrast this with how a credible operator writes about the same kind of result.

In 2026, Jason Lemkin, founder of SaaStr and a long-time SaaS investor and operator, published a post with a headline that could have been ripped from the worst hype-machine post: his team of 1.25 humans and 20 AI agents closed 140% of what their full human sales team closed the year before. Real number. Not hedged. He puts it on the page in the first paragraph.24

Then he spends the rest of the post deconstructing it.

He identifies three drivers of the result and says directly that he cannot isolate them. The first is concentration. When he went down to 1.25 humans, those humans were his best closers. Every qualified lead went to a top closer instead of being spread across a bench of mixed-quality reps. He estimates this alone might account for 50% or more of the gain. This has nothing to do with AI. It is a sales operations strategy that predates AI by decades.

The second driver is coverage. Before AI, his team responded to under 40% of inbound leads. With AI agents, they respond to 100%. Instantly. At 2am on a Saturday. He calls this directly: "that's not intelligence, that's coverage. And coverage turns out to matter a lot more than most sales leaders admit."

The third driver is market tailwind. SaaStr's entire business is now AI-focused content and community. The market for what he sells exploded at the exact same moment they restructured the sales team. His own words: "I can't tell you the ratio."

His characterization of the AI agents themselves is the line worth memorizing. He writes that the agents "worked fine. They didn't embarrass us. They didn't send weird emails. They didn't hallucinate pricing... at a level that was... fine. Not magical. Fine."

Fine. Not magical. Fine.

His takeaway is the opposite of the hype version: do not deploy AI agents because you think AI is magic at selling. Deploy them because you have a coverage problem, because you want to free up your closers from qualification work, or because you have a database of past prospects nobody is working. He closes with: "don't fool yourself into thinking the AI is what's closing deals. In our case, the AI agents created more at-bats for humans who are great at closing. That's the real story."

This is what an honest version of the headline sounds like. Same setup as the hype posts. Same numbers, even better. Different writer, different discipline, different conclusion. Lemkin asks the counterfactual that the hype posts always skip: what if we had kept the full team AND had the tailwind AND concentrated leads in our top closers? Would we have closed 180%? 200%? His answer: "we'll never know. There's no A/B test for this."

That sentence is the difference between an operator and a marketer. An operator knows what they cannot prove. A marketer omits what they cannot prove.

If you want a single piece of writing to use as a model for how to talk about AI results in your own business, this is it. Read it before you write your next AI post. The discipline is the lesson.

What happens legally when you trust the agent

The autonomous-agent fantasy hits a wall the moment the agent makes a mistake that costs money. The legal system has now ruled on this multiple times.

In February 2024, the British Columbia Civil Resolution Tribunal ruled in Moffatt v. Air Canada that the airline was liable for misleading information provided by its chatbot. Air Canada's defense was, and I am not making this up, that the chatbot was "a separate legal entity that is responsible for its own actions." The tribunal called this submission "remarkable" and rejected it. The company was held responsible for everything its automated agent said to a customer, regardless of whether the information appeared on a static page or came from the bot.1516

The same year, a customer at Chevy of Watsonville got the dealership's ChatGPT-powered chatbot to agree to sell a Chevy Tahoe for $1, with the bot stating "that's a legally binding offer, no takesies backsies." The legal community took this seriously enough to write multiple articles about whether such interactions could form binding contracts.17

New York City deployed a chatbot called MyCity to advise small business owners. It told shop owners they could go cashless, contradicting a 2020 NYC law requiring stores to accept cash. It told a landlord they could refuse tenants using rental assistance, which is illegal discrimination in New York. A local housing policy expert called the tool "dangerously inaccurate."18

Stanford research found that even specialized legal AI tools produce incorrect information 17 to 34% of the time. The same researchers found that general-purpose LLMs hallucinate at rates between 69 and 88% on specific legal queries.1920

Now think about what this means for the "AI agents handle real estate deals over email" claim. A real estate transaction involves binding offers, contractual representations, fiduciary duties, and counterparties with their own lawyers. Every email the agent sends can become part of a future dispute or litigation record. Every misstatement can be a negligent misrepresentation. The Air Canada precedent means the business owner cannot point at the AI and say "not me." They are responsible for everything the agent says.

A founder running 13 companies with zero employees and AI handling the deals is also a founder running 13 companies with 13 active liability surfaces and no humans reviewing what the liability surfaces are emitting. This is not a business model. This is a litigation pipeline.

What AI is actually good at

I want to be precise about this, because the goal of this post is not to dismiss the technology. The goal is to draw the actual line.

AI is useful. Documented productivity gains exist. They are real.

A controlled Microsoft study found developers given GitHub Copilot completed a defined coding task 55.8% faster than developers without it. The task was contained: build an HTTP server in JavaScript. The setting was controlled. The benefit was real.2122

A Cisco study evaluating GitHub Copilot on 15 software development tasks found 50% time savings on documentation, 30 to 40% on repetitive coding tasks, unit test generation, debugging, and pair programming. The same study found Copilot struggled with complex tasks, large functions, multi-file work, and proprietary contexts.23

The line that matters is not how many steps a task involves. A single image generation is one step and AI fails on it constantly. A single legal question to a general-purpose LLM is one step and Stanford measured 69 to 88% hallucination rates. Step count amplifies the problem when each step's reliability is good but not perfect. That is the point of the compounding math earlier in this post. But step count alone is not the root variable. A single high-stakes step with no verifier can fail just as badly as a 20-step chain with no verifier. The root variable is verification. (The architectural reasons for why LLMs fail in the ways they do are covered separately.)

The variable is whether the AI's output can be verified cheaply, and what happens if verification fails or comes too late. The framework is simple and worth stating directly. Four questions decide whether a deployment is sound. Can the output be verified? By whom or what, a human, a piece of software, a downstream system? If verification fails or does not happen in time, what breaks? Can the mistake be undone? Cheap verification with cheap recovery means AI can run without continuous human review. Expensive verification or no recovery means humans stay involved, providing the judgment that verification alone cannot. (Longer treatment of this question.)

Verifiability is the upstream concept. If something is verifiable, you can build a safe deployment around it. If verification is expensive, impossible, or comes too late, no amount of clever engineering rescues the system. Verification can come from a human, from software, or from both. Compilers verify code. Test suites verify behavior. Downstream rules verify classifications. Native speakers verify translations. Users verify generated images by picking the one they want. The successful AI deployments are the ones where verification is a solved problem, regardless of whether the verifier is human or software.

Recoverability is the secondary concept. It is the safety net for cases where verification is incomplete, late, or fails. Sometimes recovery is trivial. Send the prompt again, regenerate the image, ask the AI to try once more. Sometimes it requires a human escalation or a refund. Sometimes recovery is impossible. The email was sent, the contract was signed, the diagnosis was acted on. The successful deployments combine cheap verification with cheap or unnecessary recovery. The failed deployments have neither.

There is a third concept embedded in the framework that deserves a name. Verification answers the question of whether the AI's output is correct. Judgment is the harder question of whether the action is wise. AI is increasingly capable of the first. Humans remain the irreducible source of the second. A workflow can verify that an email is grammatically correct, factually accurate, and consistent with company policy, and still leave the question of whether to send the email at all, to this counterparty, in this context, at this moment, to the human. The successful deployments automate the grunt work and keep humans on the judgment calls. (Longer treatment of why this division of labor works.)

Apply this framework to AI deployments and the pattern is clear.

Where AI delivers value today: code suggestions verified by the developer and by the compiler and the test suite. Document drafts verified by a human editor before sending. Image generation verified by the user picking one of several candidates. Summarization verified by a reader who can sanity-check against the source. Tier 1 customer support verified by escalation paths when the AI cannot help. Lead qualification at the top of a funnel where downstream sales conversations verify the qualification. Classification and extraction from semi-structured data verified by downstream rules and spot-checks. Scheduling assistance verified by calendar confirmation. Translation verified by native-speaker review for anything that matters. Mathematical solutions verified against the problem statement.

Notice the pattern. In every case where AI works, verification is built in. Sometimes the verifier is a human. Sometimes the verifier is a piece of software. Sometimes both. Errors are caught cheaply before they propagate.

Where AI does not deliver value today: multi-step workflows with adversarial counterparties where no verifier exists between the agent and the consequence. Long-horizon autonomous decision-making with real liability. Legal, medical, or financial advice acted on without expert review. Customer-facing transactions where every word is a binding representation. Autonomous outbound communication to humans who will hold you to whatever the agent said.

Notice this pattern too. In every case where AI fails in production, verification is missing or arrives too late. The counterparty is not your verifier. They are an adversary who will sue you over the misstatement. The patient is not your verifier. They are the person harmed by the wrong diagnosis. The legal client is not your verifier. They acted on the wrong advice and now there is a malpractice case.

The line is verifiability, with recoverability as the safety net. Not steps, not bounded scope, not "augmentation versus autonomy." Where outputs can be verified cheaply, AI works. Where verification is expensive or impossible, AI fails. (Gartner identified about 130 legitimate agentic AI vendors out of thousands claiming the label. The real ones cluster in five domains with one thing in common: clear boundaries, measurable outcomes, fast feedback loops.)

This framework also resolves the apparent contradiction between the hype and the real production deployments. Salesforce can deploy Agentforce on Tier 1 support, where outputs are verified by escalation and recovery is cheap, because that is the territory where the framework predicts success. Salesforce does not deploy Agentforce to autonomously close enterprise contracts, because verification is expensive, the counterparty is adversarial, and recovery may be impossible. The same company, the same product, two different decisions, made on the same line.

Where AI demonstrably works, and the caveat that comes with it

I have spent most of this post showing where the hype overshoots. The post would be unbalanced and unfair if it stopped there. There is a serious body of research documenting where AI delivers measurable productivity gains. The results are real. The numbers are real. They come from controlled studies, surveys, benchmarks, and production reports, including work by major research universities and the Federal Reserve. Anyone serious about this technology has to take this evidence seriously. So let me lay it out.

The studies below are the strongest individual experiments that quantified AI productivity in specific work domains.

#	Specificity	Domain	Study	Finding
1	Domain study	Customer support	Brynjolfsson, Li, Raymond (NBER, 2023)	14% productivity gain across 5,179 agents using AI conversational assistance. 35% gain for novice and low-skilled workers. AI suggestions reviewed by human agents before going to customers.
2	Domain study	Software engineering	Peng, Kalliamvakou, Cihon, Demirer (2023)	Developers using GitHub Copilot completed a defined HTTP-server task 55.8% faster than the control group. Effect strongest for less experienced developers.
3	Domain study	Software engineering (field)	Cui, Demirer, Jaffe, Musolff, Peng, Salz (2024)	Three field experiments across software developers found measurable productivity gains, with effect size depending on task type and developer experience.
4	Domain study	Professional writing	Noy and Zhang (Science, 2023)	Mid-level professional writing tasks completed 40% faster with 18% quality improvement using ChatGPT. Time savings concentrated on drafting and editing.
5	Domain study	Knowledge work / consulting	Dell'Acqua et al. (BCG / Harvard Business School, 2023)	Consultants using GPT-4 completed 12.2% more tasks, 25.1% faster, with 40% higher quality on tasks within AI's "jagged frontier." Performance degraded on tasks outside the frontier.
6	Domain study	Legal analysis	Choi and Schwarcz (2023)	Empirical study found AI assistance produced measurable productivity gains in legal analysis tasks, with quality dependent on task type and human review.
7	Domain study	Translation	Merali (2024)	Productivity scaling laws documented for LLM-assisted translation, with measurable speedups across language pairs.
8	Domain study	Knowledge worker tasks	Wiles, Krayer, Abbadi, Awasthi, Kennedy, Mishkin, Sack, Candelon (2024)	Field experiment frames AI as an "exoskeleton" that extends human capability on new skills. Productivity gains documented across multiple knowledge work domains.
9	Aggregate adoption	US workforce (all sectors)	Bick, Blandin, Deming (Federal Reserve Bank of St. Louis, 2025)	27% of US workers use generative AI at work weekly. Users report time savings of 5.4% of work hours. Across all workers including non-users, 1.4% of total work hours saved. Each hour of AI use boosts that hour's productivity by approximately 30%. Aggregate productivity gain estimate: 1.2%.
10	Real-world usage	Claude.ai conversations	Anthropic Economic Index (Jan 2026)	52% of Claude.ai conversations are augmentation (human in the loop), 45% are automation. Task success rates: 70% on basic tasks, 66% college-level, 61% software development, 49% on automation-only API. Reliability adjustment cuts implied productivity gains roughly in half, from 1.8 to 1.0-1.2 percentage points per year.

Figure 1: The strongest evidence for AI productivity gains in specific work domains. Every study measured AI plus human review, not AI alone.

These results are the foundation of the case for AI as a productivity technology. The numbers are real. The studies are good. Many of these studies are academically rigorous, and together they form a consistent pattern. Take them seriously.

Now the caveat that the headlines strip out.

Every one of these studies measured AI plus human review, not AI alone. The Brynjolfsson customer support study had agents reviewing AI suggestions before they went to customers. The BCG consultants reviewed and edited the AI's output before submission. The GitHub Copilot developers accepted or rejected each suggestion. The professional writers in Noy and Zhang used ChatGPT as a drafting tool and edited the result. The Federal Reserve's 1.4% economy-wide time savings figure is what AI saves humans during AI-assisted work, not what AI produces autonomously. The Anthropic Economic Index report is even more explicit: when you adjust the implied productivity gains for the reliability of AI output, the headline number cuts roughly in half. That cut comes from the time humans spend verifying what the AI produced.

The studies do not say AI replaces humans. The studies say AI plus a human reviewer outperforms a human working alone. That is a real and important finding. It is not the finding the LinkedIn posts and the YouTube videos and the magazine covers are reporting.

Customer service is the cleanest example of how this actually works in production. The use case is close to perfect for current LLMs. The customer asks a question. The AI looks up the answer in the company's support documentation, where the answer almost always exists. The AI provides a more natural and intuitive interface than a search box. The customer gets a quick answer. When the AI cannot help, a human jumps in. The failed sessions get collected, reviewed, and used to improve the AI for the next round. Outputs are verifiable by the customer's reaction and by escalation to a human. Errors are recoverable. The blast radius is small. The human is the safety net. The system gets better over time because failures feed back into the training loop. This is the framework working exactly as it should.

Now imagine the same system without the human. Customer asks a question. AI cannot help. Customer is stuck. Customer is angry. Customer churns. The failure is not recoverable because there is no escalation path. This is what happens when the autonomy claim gets implemented literally. It is also what Klarna learned. After publicly claiming its AI assistant did the work of 700 customer service agents, the company quietly resumed hiring human customer service roles about a year and a half later.

Automation is not new. The non-determinism is

Worth pausing on a point that gets lost in the agent discourse.

Automation has been the goal of computing since the invention of computers. Every business workflow that gets done by software instead of by hand is an act of automation. We have been doing this deterministically for decades. Move these files here. Run this report on Tuesday. Sync this database to that one. Send a confirmation email when an order is placed. IFTTT, Zapier, RPA, batch ETL pipelines, scheduled jobs in your email client. Workflow automation is mature technology with a 30-year track record.

The new thing in 2024 to 2026 is not automation. The new thing is replacing deterministic logic with non-deterministic LLM calls inside the workflow.

A traditional Zapier workflow either runs or it breaks. The branching logic is explicit. If the email contains "invoice," route to accounting. The match is exact, the behavior is predictable, and when it fails it fails loudly and the operator can debug it. The deterministic version is verifiable by construction. The logic either matched or it did not. The system either ran or it did not. Every step is auditable. Every failure is visible.

Replace that branching logic with an LLM call. Now the workflow says: "Read the email and decide whether to route it to accounting." The LLM gets it right most of the time. It is more flexible than the deterministic version because it can handle invoices that do not contain the word "invoice." It is also non-deterministic. It will occasionally route the wrong email. It will occasionally hallucinate an interpretation. It will occasionally succeed at a task it should have failed at, producing output that looks correct but is not. The workflow runs. No error appears. The wrong thing happens silently. The LLM-mediated version is not verifiable by construction. It requires explicit verification mechanisms layered on top: human review, test suites, downstream sanity checks, classifier confirmations.

This is the actual difference between traditional automation and AI agents. The deterministic parts have always worked because they have verification built in. The LLM parts require verification to be added separately, and that work is what most agent deployments skip. Most of the marketing collapses these into one thing. Most operators experience them as two very different things. The deterministic parts of any "AI agent" deployment have always worked. The LLM-mediated parts are where the unreliability comes from, and where verification has to be designed in deliberately.

When you read about a workflow with 30 agents handling everything, the question to ask is which steps are deterministic and which steps are LLM-mediated. The deterministic steps are not the new thing. They are decades-old infrastructure with verification built in. The LLM-mediated steps are where the risk is, and where the verifiability and recoverability framework decides whether the deployment is sound or reckless.

The improvement caveat

One more honest point before moving on.

These productivity numbers reflect AI capability in 2025 and early 2026. The technology is improving. Anthropic's own data shows the augmentation/automation ratio shifting within months as new models ship. Task success rates have risen between successive Anthropic Economic Index reports. The METR task horizon, the duration of work AI can complete with 50% reliability, has been growing roughly every six months. Five years from now, the picture may look different.

The argument of this post is about 2026, not eternity. The claim is not that AI will never run autonomous multi-step workflows. The claim is that in 2026, deployed AI does not run autonomous multi-step adversarial workflows reliably, and the LinkedIn and YouTube and Forbes and CNBC claims to the contrary are not supported by the evidence we currently have. Where the technology is in 2030 is a different question. Make decisions for your business based on what is true now, not on what might be true later.

What about the layoffs and the Goldman Sachs forecast

Here is the strongest counter-argument to the case made so far. CEOs of major companies are publicly stating that AI is replacing human workers at scale. Goldman Sachs is forecasting AI-driven labor displacement. Block laid off 40% of its workforce. Meta is reportedly cutting 20,000 roles. Salesforce went from 9,000 support engineers to about 5,000.

If AI cannot do the things I just said it cannot do, why is this happening?

The honest answer requires holding two things at once. The displacement is real. The "massive scale" framing is overshooting what the data shows. Both can be true.

Start with what is real. In September 2025, Salesforce CEO Marc Benioff stated on a podcast that he had reduced support headcount from 9,000 to about 5,000, with AI agents now handling 50% of customer support interactions that humans used to handle entirely.25 That is replacement. The 4,000 people who used to do that work no longer do that work. The official Salesforce statement softens it slightly, noting that some staff were redeployed to professional services, sales, and customer success rather than terminated.25 But the headcount reduction is real and the work transfer to AI is real.

This is exactly the territory where the framework predicts AI replacement. Tier 1 customer support is high volume, repeatable, and recoverable. Outputs are verifiable by the customer asking again, by escalation paths, or by spot-check review. The blast radius of any single error is small. Errors are caught cheaply. This is the use case AI handles well, and it is the use case where companies are reducing headcount. (The catch is that AI agents at scale require permission models humans never needed. One human mistake might refund one customer wrongly. An AI agent with the same permissions can refund ten thousand.)

In March 2026, Circle CEO Jeremy Allaire told the Economic Club of New York that "AI agents will replace a huge percentage of work that's currently performed by humans on a massive scale."26 Read his full transcript and the picture sharpens. In the same conversation, Allaire described his own son using AI to become a 100x version of himself, building things he never imagined he could build. He told every employee at Circle that embracing agentic capabilities gives a person "new superpowers" and "your ability to have impact grows dramatically."

That is two different claims in the same interview. The first is replacement at the population level: companies need fewer humans for the same output. The second is augmentation at the individual level: the humans who remain are dramatically more capable. Both are true. The 4,000 Salesforce support engineers who left are gone. The 5,000 who remain are doing more with AI handling the recoverable half of the workload. Replacement at the aggregate, augmentation at the individual. These are not contradictions. They are the same story told from two angles.

Now layer in Goldman Sachs. In their February 2026 note, economist Pierfrancesco Mei forecast US unemployment drifting from 4.3% to 4.5% by year-end, in part because of AI-driven displacement. The upside risk: an additional 0.3 percentage points if AI adoption accelerates.27 In a separate Goldman analysis, economist Joseph Briggs's base case (assuming a roughly 10-year transition) has AI displacing 6 to 7% of workers, with a 0.6 percentage point increase in unemployment. A faster transition would produce larger numbers.28

Sit with those numbers. 0.2 to 0.5 percentage points of unemployment in 2026. 6 to 7% over a decade. These are real and significant labor market shifts. They are not the apocalypse the rhetoric suggests. They are roughly the size of the personal computer transition in the 1980s and 1990s, or the internet transition in the 1990s and 2000s. Both of those transitions reshaped white-collar work. Neither produced a one-person billion-dollar economy. (The longer historical view, from agriculture in 1800 to the SaaSpocalypse panic in February 2026, points the same direction. 224 years of evidence on automation creating more jobs than it destroys.)

Three things to keep in mind when you read the next "AI is replacing humans" claim.

First, read the full source, not the headline. Allaire said "massive scale" in one breath and "100x superpowers for humans" in the next. The headline took the first half. The CEO said both halves. The careful reading is that AI is replacing some humans on bounded recoverable work, while the humans who remain are doing more.

Second, watch for the conflict of interest. Benioff is CEO of Salesforce, which sells Agentforce. Allaire is CEO of Circle, which is positioning itself in agentic commerce infrastructure. The CEOs making the most aggressive AI replacement claims are selling the AI products that produce that replacement. They have the strongest possible incentive to inflate the displacement story, because that story sells the product to other CEOs who want to do the same thing. This is not a conspiracy. It is normal CEO behavior at the launch of any major technology cycle. But it should change how you weight the claims relative to careful analyst forecasts. Klarna is the cautionary example: in 2024 the company announced its AI chatbot was doing the work of 700 employees and bragged about cutting headcount nearly in half. By May 2025 it had quietly reversed course and started hiring customer service roles back. The CEO told Bloomberg that customers needed to know there would always be a human if they wanted one. The boast got millions of views. The reversal got a fraction of the coverage.

Third, separate AI-caused layoffs from AI-rationalized layoffs. Tech companies over-hired massively during 2020 to 2022 when capital was free. The post-zero-interest-rate correction has driven layoffs since 2023, well before agents existed in any deployable form. Some of the current "AI replaced our workers" announcements are correcting that over-hiring while attaching a more flattering narrative for the market. Block's CFO confirmed AI was a factor in their 40% cuts. Factor, not sole cause. The honest read is that AI-caused and AI-rationalized layoffs are both happening, and the share of each is unclear in any individual company.

Fourth, do the math on the size of the cuts. AI does let fewer employees do more work in the bounded, recoverable jobs already discussed. Those job cuts are real. But the announced cuts at the largest companies are far too big to be explained by Tier 1 support automation alone. A company cutting 40% of its workforce is not doing that because AI is now doing 40% of the work. Something else is going on, and three structural forces fill the gap. First, capital reallocation. AI infrastructure is expensive. Microsoft, Meta, Google, Amazon, and Oracle have each committed tens of billions in 2025 and 2026 to data centers, GPUs, and AI vendor contracts. That money has to come from somewhere. Cutting payroll frees up operating budget to fund capex and to protect margins for the market. A company laying off 10,000 people while spending $30 billion on AI infrastructure is not necessarily proving that AI is doing those 10,000 people's jobs. It is reallocating spending from labor toward capital expenditure and shareholder returns. Second, post-ZIRP correction, already discussed above. Third, a less-discussed factor: fiefdoms in large companies are often measured in reporting headcount, which creates structural pressure toward over-hiring during good years. When the cycle turns, those bloated fiefdoms become the obvious place to cut. The AI narrative provides convenient cover for cuts that the company would have made anyway as the cycle corrected. Meta's 2023 "year of efficiency" is the cleanest example. Zuckerberg cut about 21,000 roles (roughly a quarter of the workforce) and explicitly framed it as removing layers of management bloat that had accumulated during the growth years. This was before AI was a deployable cover story. The cuts revealed how much excess headcount Meta had built up for reasons unrelated to operational need.29 Salesforce cut about 10% of its workforce around the same time for the same reason, with CEO Marc Benioff publicly stating they had over-hired during the pandemic and taking personal responsibility for the over-hiring.30 Both happened before the current AI replacement narrative existed. Both proved that the underlying headcount levels were inflated. All three of these forces contribute. None of them require AI to be doing the work of the people being let go. Some of the cuts are AI-driven. Most of the difference between "AI is doing some work" and "we cut 40% of headcount" is something else.

The picture that holds together: AI is replacing humans on the recoverable bounded work where the framework predicts it would. That replacement is real, measurable, and consistent with Goldman's careful analyst forecast. It is not consistent with the "13 companies, zero employees" claim, because that claim requires AI to handle multi-step adversarial work without recoverable checkpoints, which is the work Salesforce specifically does not deploy Agentforce to do. (Even where AI is being deployed, only 5% of companies are creating substantial value at scale. The displacement is real and the value capture is concentrated.)

The displacement story and the autonomy hype are pointing at different elephants. The displacement is happening at the boundary between Tier 1 and Tier 2 work. The autonomy hype is pretending the boundary is at Tier 5.

A framework for reading the next viral post

You will see another one of these claims this week. Probably tomorrow. Here is how to evaluate it.

First, ask who specifically. Not "a founder I know" or "someone in a community I am part of." A name. A company. A verifiable identity. If the claim cannot survive this question, it is a story passed through a room, reshaped at every retelling.

Second, ask what the error rate is per step. If the claim involves a 20-step workflow, the agent needs over 99.5% accuracy at each step to have a 90% chance of completing the workflow without intervention. No general-purpose autonomous agent has demonstrated this reliably across messy, adversarial business workflows. If the answer to this question is unknown or hand-waved, the claim is impossible.

Third, ask what happens when it fails. Real businesses have failure modes. Customer complaints. Refund processes. Insurance coverage. Liability allocation. A claim of full autonomy without a clear answer to "what happens when the agent makes a six-figure mistake" is a claim that has not encountered reality yet.

Fourth, ask where the human checkpoint is. In every working agent deployment I have seen, including the impressive ones, humans review high-stakes outputs. The post will often quietly say "humans only at approval touchpoints," which is the giveaway. Approval touchpoints are humans. The number of them and the work they do is the actual story. It is also the part that gets stripped from the headline version. (88% of AI proofs of concept never reach production. The pilot is designed to prove the technology works. Production is where the hidden human work surfaces.)

Fifth, ask what the legal structure is. Who is liable when the agent fails? What is the indemnification? What does the insurance say? Real operators can answer these questions in 30 seconds. Marketing claims cannot answer them at all.

Apply these five questions to the next viral post you see. Most of them collapse under the first question. The rest collapse under the second. (It also helps to know that the term "agentic AI" itself has fifteen different definitions across fifteen different companies, each one calibrated to what that company sells.)

The reason the flood exists

I want to close with the structural explanation, because it matters.

Four forces produce this hype across every channel.

The algorithms reward the wrong things. LinkedIn's feed rewards specific, confident, slightly outrageous claims. "AI helped me automate email triage" gets 12 likes. "13 companies, zero employees, AI runs everything" gets 12,000. YouTube rewards videos with strong hooks and bold thumbnails. X rewards quote-tweets and dunks. The platforms select for the more exciting version of any AI story regardless of whether it is true. This is not a conspiracy. It is a feedback loop, and it operates the same way on every platform.

The creator economy needs the dream. A large cohort of creators, course-sellers, and community-builders make their living on the dream of AI-powered solopreneurship and AI-driven business transformation. Their product requires the dream to feel achievable and imminent. They are not necessarily lying. They are operating in an information environment where the most exciting interpretation of every anecdote becomes the headline, and the boring qualifications fall away. The same pattern operates whether they sell courses on Skool, paid newsletters on Substack, coaching programs through DM funnels, or paid memberships in private communities.

The mainstream media follows the founders. Forbes contributors, CNBC anchors, magazine cover stories, and conference keynotes amplify what executives say at scale. When a CEO claims AI is replacing humans on a massive scale, that claim becomes a Forbes headline within hours. The careful operator who would qualify the claim is not on the cover. The reporter is not paid to fact-check sweeping claims about future productivity. The CEO is paid to make sweeping claims about future productivity. Both incentives point the same direction.

Most readers cannot evaluate the claim. A real estate broker reading the 13-companies post does not know what an LLM can reliably do across a multi-step adversarial negotiation. A board member watching a CNBC segment does not know what production agent reliability looks like. A YouTube viewer does not know that the case study at minute 13 has three humans behind it. So the claim seems plausible. It gets reshared. The reshare adds social proof. The next reader sees high engagement and assumes credibility. The cycle continues across every channel simultaneously.

This is the mechanism. Now combine it with the METR finding. The people in the rooms feel substantially more productive than what controlled measurement can confirm. They report their feelings as facts. The feelings travel. The facts do not.

The gap between what AI can do and what the hype machine says it can do is not a gap of dishonesty. It is a gap of measurement. The technology is impressive. The deployments are bounded. The reports of the deployments are unbounded. The marketing is unbounded raised to the power of the algorithm, multiplied across every platform that competes for attention.

If you take one thing from this post: AI is a real and useful technology, deployed by real and competent people, doing real work. None of this requires it to be running 13 companies. The hype version is not just inaccurate. It is preventing your organization from finding the actual value, because the actual value is in the cases where errors get caught cheaply, where humans stay in the loop for judgment, and where the deployment is calibrated to the stakes.

The companies that win in 2026 will not be the ones running 13 phantom AI businesses. They will be the ones who deployed AI honestly on bounded tasks, measured the gain, kept the humans, and ignored the noise. (Five percent of companies are creating substantial value from AI at scale. Here is what they actually do differently.)

Pay attention to the boring cases. That is where the value is.

Sources

Yann LeCun's departure from Meta, AMI Labs launch and over $1 billion in funding, and his explicit argument that scaling LLMs cannot reach human-level AI because they predict text rather than understand the world. fortune.com. ↑
Anthropic engineering blog on long-running agent harnesses and Claude Agent SDK failure modes. anthropic.com. ↑
Multi-step cyber attack benchmark results, March 2026 preprint. arxiv.org. ↑
GAIA benchmark history and current performance. awesomeagents.ai. ↑
Gaia2 results showing 42% pass-at-one for top frontier models. arxiv.org. ↑
WebArena benchmark results. awesomeagents.ai. ↑
MIT GenAI Divide report, 95% of pilots delivered no measurable P&L impact. computing.co.uk. ↑
Fortune coverage of MIT GenAI Divide report. fortune.com. ↑
PwC AI Agent Survey, May 2025, 308 senior US executives. pwc.com. ↑
KPMG Q1 2025 AI Pulse Survey, 130 C-suite leaders at organizations with over $1 billion in revenue. kpmg.com. ↑
McKinsey, The State of AI in 2025: agents, innovation, and transformation. mckinsey.com. ↑
Gartner forecast on agentic AI project cancellations. kore.ai. ↑
METR, "We are Changing our Developer Productivity Experiment Design," February 24, 2026. The revised analysis covers both the original early-2025 study and the late-2025 follow-up. metr.org. ↑ ↑
METR original study paper, "Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity." arxiv.org. ↑
Moffatt v. Air Canada legal analysis. mccarthy.ca. ↑
Air Canada chatbot case liability analysis. pinsentmasons.com. ↑
Chevy of Watsonville $1 Tahoe binding offer analysis. qacomet.com. ↑
NYC MyCity chatbot misinformation cases. cxtoday.com. ↑
Stanford RegLab and HAI, Hallucinating Law: legal mistakes with general purpose LLMs are pervasive (69 to 88% on specific legal queries). hai.stanford.edu. ↑
Stanford RegLab and HAI, AI on Trial: specialized legal AI tools (Lexis+ AI, Westlaw, Practical Law) hallucinate 17 to 34% of the time. hai.stanford.edu. ↑
Microsoft Research GitHub Copilot controlled experiment. microsoft.com. ↑
GitHub-Harvard productivity analysis. arxiv.org. ↑
Cisco evaluation of GitHub Copilot across 15 development tasks. arxiv.org. ↑
Jason Lemkin, SaaStr, on what actually drove their 140% sales result with 1.25 humans and 20 AI agents. saastr.com. ↑
Salesforce CEO Marc Benioff on cutting 4,000 support jobs and replacing them with AI agents. fox4news.com. ↑ ↑
Circle CEO Jeremy Allaire on AI agents replacing work at scale, interview at the Economic Club of New York. yahoo.com. ↑
Goldman Sachs economist Pierfrancesco Mei on AI-driven labor displacement raising unemployment to 4.5% in 2026. yahoo.com. ↑
Goldman Sachs economist Joseph Briggs on AI displacing 6-7% of workers over a 10-year transition. goldmansachs.com. ↑
Meta SEC 8-K filing of Zuckerberg's "Update on Meta's Year of Efficiency" memo, March 14, 2023, announcing roughly 10,000 additional cuts on top of the prior 11,000, with explicit framing about removing management layers. sec.gov. ↑
CNBC coverage of Salesforce's 10% layoff announcement, January 4, 2023, with Benioff's direct quote on over-hiring during the pandemic. cnbc.com. ↑

Share this article

LinkedIn X

Frequently Asked Questions

Are AI agents really running entire businesses with no human employees?

No. The evidence is consistent across the AI labs, controlled studies, benchmarks, and practitioner reports: AI agents do not reliably run multi-step adversarial business workflows in 2026. MIT found 95% of corporate generative AI pilots delivered no measurable P&L impact. Frontier benchmarks like GAIA and WebArena still sit well below human performance. The viral 13-companies-zero-employees claims are perception reports from inside a measured perception-reality gap, not verified production deployments.

Why do so many people on LinkedIn and YouTube claim AI agents replaced their team?

METR, an independent evaluation nonprofit, ran randomized controlled trials and found that developers using AI tools in early 2025 took 19% longer to complete tasks while estimating they were 20% faster. A 39 percentage point gap between perception and measured reality. People genuinely feel more productive with AI. Controlled measurement consistently shows the magnitude is smaller than the testimonials suggest. The viral claims are sincere reports from inside that gap.

What is the right framework for deciding where AI agents work?

Verifiability with recoverability as the safety net. Four questions decide whether a deployment is sound: Can the output be verified? By whom or what, a human, a piece of software, a downstream system? If verification fails or does not happen in time, what breaks? Can the mistake be undone? Cheap verification with cheap recovery means AI can run with limited human review. Expensive verification or no recovery means humans stay in the loop. The successful deployments cluster on the verifiable, recoverable side. The failed deployments do not.

Evaluating an AI agents pitch?

I help mid-market operators figure out which AI agent deployments will actually deliver and which are repackaged hype. Let's talk.

Let's Talk