TL;DR
Developers using AI coding tools believe they’re 24% faster. A randomized controlled trial measured the opposite: 19% slower on real tasks. Across 10,000+ developers at 1,255 teams, AI adoption correlated with 21% more tasks completed but 91% longer PR review times, and zero net productivity gain at the company level. The perception gap is real, the data is stacking up, and teams that ignore it risk optimizing for feeling productive instead of actually shipping.
The 43-Point Gap Between Feeling and Measuring
I spend eight to ten hours a day inside AI coding tools. Cursor, Claude Code, Copilot. I rotate between them depending on the task, and I’ve been doing this long enough that the workflow feels like second nature. When someone asks whether these tools make me faster, my gut answer is yes. Of course they do. I watch code appear in real time, I skip the boilerplate, I get unstuck on unfamiliar APIs in minutes instead of hours.
Then I read the METR study. Sixteen experienced open-source developers, people who maintained repos with 22,000+ stars and over a million lines of code, were given 246 real issues to solve. Half the time they could use any AI tool they wanted (most picked Cursor Pro with Claude 3.5/3.7 Sonnet). Half the time they worked without AI. The result: tasks took 19% longer with AI.
The developers themselves predicted a 24% speedup. Even after finishing, they still believed AI had helped, rating their perceived speedup at 20%. That’s a 43-point gap between what experienced developers think is happening and what the clock says.
Four separate studies in the past year have converged on the same conclusion: AI coding tools do something to developer perception that doesn’t match what they do to developer output.
Inside the METR Study: What Actually Happened
The METR study (Model Evaluation and Threat Research) is the tightest-controlled measurement of AI coding tool productivity so far, and it outweighs the McKinsey surveys and vendor-sponsored benchmarks because it’s a randomized controlled trial with screen recordings, not a self-report.
The setup was designed to be hostile to bias. Sixteen developers were recruited from repositories they’d contributed to for years. These weren’t students solving LeetCode. They were maintainers solving bugs and building features in codebases they knew intimately. Each developer received a randomized list of real issues. For each issue, a coin flip determined whether they could use AI tools or not. Screen recordings captured everything.
Tasks averaged about two hours each. With AI, they took 19% longer — roughly 2 hours 23 minutes versus 2 hours flat. The confidence interval didn’t include zero.
The screen recordings showed why. With AI, developers spent less time writing code and searching documentation. Those parts were genuinely faster. But they spent more time in a new category of activity: prompting, waiting for responses, reading AI output, deciding whether to accept it, editing what they accepted, and re-prompting when the first attempt missed. Less than 44% of AI suggestions were accepted. The rest were discarded, edited, or triggered a new round of prompting.
There’s a compounding factor the numbers alone don’t capture. These developers knew their codebases well. They had mental models of how the architecture worked, where the gotchas lived, which patterns the project used. The AI didn’t share that context. Every suggestion required a translation step: does this approach match how we do things here? The developers with the deepest repository knowledge saw the largest slowdowns.
The researchers were careful about scope. They explicitly stated the study doesn’t prove AI tools slow down all developers in all contexts. It tested experienced developers on familiar codebases. Junior developers on unfamiliar code might show different results. But the population the study did test, the senior engineers companies depend on most, got slower.
The Faros Report: 10,000 Developers, Same Story
If METR was a surgical experiment, the Faros AI Productivity Paradox report is the epidemiological study. It analyzed telemetry from 10,000+ developers across 1,255 teams using data from task management, IDEs, version control, CI/CD pipelines, and incident management.
The headline numbers look like a success story at first glance:
| Metric | Change with AI adoption |
|---|---|
| Tasks completed per developer | +21% |
| Pull requests merged | +98% |
| PRs handled per day | +47% |
| Tasks touched per day | +9% |
More code. More PRs. More throughput. Exactly what the vendors promised.
Then you read the rest of the table:
| Metric | Change with AI adoption |
|---|---|
| PR review time | +91% |
| Average PR size | +154% |
| Bugs per developer | +9% |
| Company-level productivity | No significant change |
PR review time nearly doubled. PRs got 2.5x larger. Bugs ticked up. And when Faros zoomed out from individual developers to company-level outcomes, the gains vanished. The researchers used Spearman rank correlation across companies and found “no significant correlation between AI adoption and improvements at the company level.”
The bottleneck moved. Developers wrote code faster, so they wrote more of it. But every line of AI-generated code still needed a human to review it, and that human was now drowning in pull requests that were each 2.5 times the size they used to be. The system’s throughput didn’t change; the pressure shifted downstream.
Why Your Brain Lies to You About AI Speed
I’ve noticed this in my own workflow. I’ll ask Claude Code to scaffold a new module, watch 200 lines appear in seconds, and think: that would have taken me 30 minutes. But then I spend 20 minutes reading what it wrote, rewriting the parts that don’t match the project’s conventions, fixing the edge case it missed, and adjusting the tests. If I’d written it from scratch, I would have started with the architecture I wanted and built toward it. Instead, I started with someone else’s architecture and had to retrofit my intentions onto it.
The METR researchers pointed to a specific cognitive mechanism, one that echoes what BCG found in 1,488 workers about AI-induced cognitive overload. When AI generates code instantly, it triggers a sense of progress: code appeared, something happened, the file is longer than it was. That feels like work getting done. But the remaining work (reviewing, editing, integrating, debugging) is cognitively invisible until you’re in it. By then, the sense of progress is already locked in. Even after experiencing the slowdown, METR’s developers rated their perceived speedup at 20%.
The Gradle developer productivity report identified the same dynamic from the organizational side. Teams using AI tools reported feeling faster in surveys but showed flat or declining delivery velocity in their pipeline metrics. The feeling of speed didn’t correlate with the speed of shipping.
A UC Berkeley study covered by Scientific American found that developers at a US tech company who adopted AI voluntarily took on more tasks, worked at a faster pace, and worked longer hours. Some began prompting AI during lunch and breaks. The tool that was supposed to save time instead expanded the work surface. There was always one more thing the AI could help with, one more feature to start, one more PR to open.
The PR Review Bottleneck
The Faros data makes a structural problem visible. In a typical engineering organization, the ratio of code-writing time to code-reviewing time is roughly balanced: developers spend some percentage writing and some reviewing. AI tools accelerated writing but left reviewing untouched, and the ratio collapsed.
When each developer produces 98% more PRs and each PR is 154% larger, the review queue doesn’t double. It explodes. One senior engineer I talked to in a Cyprus-based fintech described it this way: “Before AI, I’d review three or four PRs a day and they’d each be 50 to 150 lines. Now I see eight PRs a day and some are 400 lines. I can’t review 400 lines of AI-generated code as carefully as 100 lines of code I watched someone write.”
This bottleneck has a specific shape. The people who need to approve PRs are usually the most experienced developers on the team, exactly the population METR showed gets slower with AI tools. So the developers who benefit least from AI-generated code are also the ones being buried under the most AI-generated code to review.
Some teams have started experimenting with AI-assisted code review (tools like CodeRabbit, Graphite, and GitHub’s own Copilot for PR review). Early results are mixed. AI reviewers catch formatting issues and obvious bugs but miss architectural problems, subtle logic errors, convention violations, and security-sensitive patterns like leaked secrets, which is exactly what senior reviewers are there to catch.
Who Benefits, Who Gets Hurt
Not all the data points in the same direction. The picture is split along experience lines.
Junior developers come out ahead. A McKinsey study found that AI tools cut time on code documentation by 45-50% and code generation by 35-45%, but that number drops to less than 10% for complex, unfamiliar tasks. Junior engineers spend more of their time on routine work (boilerplate, standard CRUD, well-documented API integrations), so they capture more of the benefit. An MIT study found that newer hires showed higher adoption rates and larger productivity gains.
Senior developers, on the other hand, get measurably slower. The METR results are the clearest signal here. Developers who knew the codebase, understood the solution architecture, and had strong debugging intuitions didn’t need the AI’s suggestions. The tool added friction to a workflow that was already efficient. The 19% slowdown was largest for developers with the deepest repository familiarity.
The organizational sweet spot turns out to be narrow. Companies with moderate AI adoption (tools available but not mandated, guidelines for when to use them, protected review time) seem to do better than either extreme. All-in mandates create the bottlenecks Faros documented. No adoption means missing the genuine gains on routine tasks.
This creates a difficult dynamic for the job market. Companies replacing junior roles with AI are removing exactly the population that benefits most from the tools. Meanwhile, the senior engineers who remain get more code to review and a heavier workload. The 100,000+ tech layoffs in 2026 are concentrated in routine roles — but the data suggests those are the roles where AI tools actually delivered on the productivity promise.
The Skills Tax
Anthropic published a study on AI assistance and coding skills in early 2026 that adds another dimension. Fifty-two engineers learning a new Python library (Trio, for async programming) were randomly split: half could use Claude, half couldn’t.
The AI group finished exercises at roughly the same speed. But on a comprehension test afterward, they scored 50% versus the manual group’s 67%, a 17-point gap. The gap was largest on debugging questions. Within the AI group, interaction style predicted outcomes: developers who used AI for conceptual inquiry (“explain how Trio cancellation scopes work”) scored 65%+ on the overall quiz. Those who delegated code generation (“write a Trio server that handles cancellation”) scored below 40%.
The implication for teams is concrete. If you’re adopting AI tools across an engineering org, the junior developers who benefit most from code generation are also the ones most at risk of stunted skill growth. They ship more code today but understand less of it. The debugging and architectural skills they would have built by struggling through problems don’t develop when an AI does the struggling for them.
Anthropic ships a teaching-oriented output style in Claude Code (the “Explanatory” style) designed to explain rather than just generate. Whether teams actually use it instead of the fast generation defaults is an open question.
What to Actually Do With This Data
The paradox signals that AI coding tools are being deployed without understanding their constraints. Here’s what the data suggests for different roles.
For your own workflow:
- Track your own numbers. Measure time-to-merge, not time-to-first-draft. Most developers overcount the writing phase and undercount review, debugging, and rework.
- Use AI for exploration, not production. I get the most value when I’m learning a new library or brainstorming an approach. The value drops when I’m writing code in a codebase I know well.
- Accept less, think more. The 44% acceptance rate from METR isn’t a failure of the tools. It’s a signal that experienced developers are right to reject most suggestions. If you’re accepting 80%+ of AI completions, you’re probably not reviewing carefully enough.
At the team level:
- Budget for review load. If your team adopts AI tools, review time will grow. Staff for it. Protected review time, review rotations, and PR size limits become load-bearing infrastructure.
- Don’t measure output in lines of code or PRs merged. The Faros data shows those numbers go up while actual delivery stays flat. Measure cycle time from issue creation to production deployment.
- Watch for the skills gap. Junior developers need deliberate practice time without AI (pair programming, code katas, architectural discussions) or they’ll hit a ceiling when they need to debug something the AI can’t fix.
For the 2026 job market:
- The market is bifurcating. ML engineer openings are up 59% from 2020 baseline. General software engineering is down 49%. The developers who thrive aren’t the ones who use AI the fastest. They’re the ones who understand the code well enough to review it critically.
- Understanding these productivity numbers gives you an edge in interviews. When a hiring manager says “we’ve adopted AI across the org and productivity is up 40%,” you can ask: “Are you measuring task completion or cycle time?” That’s the kind of question that separates someone who reads vendor marketing from someone who reads the research.
FAQ
Do AI coding tools actually improve developer productivity?
It depends on who’s using them and how you measure. For routine tasks by junior developers, the McKinsey study measured 35-50% time savings on code generation and documentation. For experienced developers on complex, familiar codebases, the METR study measured a 19% slowdown. At the organizational level, Faros found no significant productivity improvement across 1,255 teams despite 21% more individual tasks completed. The tools change what developers do with their time more than how much they get done.
Why do AI coding tools make experienced developers slower?
Experienced developers already have efficient workflows for codebases they know well. AI tools add new steps — prompting, waiting, reviewing suggestions, deciding what to accept, editing accepted code — that replace activities the developer was already fast at. Less than 44% of AI suggestions were accepted in the METR study, meaning most of the AI interaction time produced nothing usable. The cognitive load of evaluating someone else’s code approach is often higher than just writing your own.
Does AI make coding faster?
Code generation is faster. Code shipping is not — or at least not measurably so at the organizational level. The bottleneck moved from writing to reviewing. PRs grew 154% larger and took 91% longer to review. The total time from “developer starts working” to “code is in production” hasn’t improved in most organizations studied.
Why are developers using AI working longer hours?
A UC Berkeley study found that developers who adopted AI voluntarily expanded their work scope, taking on more tasks, working at a faster pace, and prompting AI during breaks and meetings. The tool that saves 20 minutes on one task makes it feel possible to start three more tasks. The net result is more output but also more hours logged.
How does AI assistance impact coding skills?
An Anthropic RCT found that developers using AI scored 17 points lower on comprehension tests when learning a new library (50% vs. 67%). The gap was largest for debugging skills. Developers who asked the AI to explain concepts retained knowledge; those who delegated code generation didn’t. AI tools help experienced developers apply existing skills but can prevent junior developers from building new ones.
Sources
- METR: Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity — the randomized controlled trial that found the 19% slowdown
- Faros AI Productivity Paradox Report — telemetry from 10,000+ developers showing 91% longer PR reviews and zero company-level gains
- McKinsey: Unleashing Developer Productivity with Generative AI — 4,500-developer survey finding 46% time savings on routine tasks, <10% on complex ones
- Anthropic: How AI Assistance Impacts the Formation of Coding Skills — RCT showing 17-point comprehension gap when learning with AI
- Scientific American: Why Developers Using AI Are Working Longer Hours — UC Berkeley research on work intensification
- Gradle: The Developer Productivity Paradox — pipeline metrics showing flat delivery velocity despite perceived speedup
Bottom Line
The productivity paradox isn’t a reason to stop using AI coding tools. I won’t — they’re too useful for exploration, learning new libraries, and handling the parts of coding I find tedious. But the data is clear: teams that adopt these tools without measuring the downstream effects (review load, bug rates, delivery speed) are buying a perception of productivity while actual delivery stays flat. Measure cycle time, not vibes. Budget for the review bottleneck before it buries your senior engineers. And if you’re a developer, track your own numbers — because your brain is lying to you about how fast you are.