2025 United States meta

Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity

METR, July 2025 — randomized trial. Developers were 19% slower with AI tools. They believed they were 20% faster. The instruments are lying.

METR (Model Evaluation and Threat Research) ran the first rigorous randomized controlled trial of AI tooling on real software-engineering work. Sixteen experienced open-source developers, with an average of five years on their own repositories, completed 246 randomly assigned tasks under two conditions: AI-allowed (Cursor Pro with Claude 3.5/3.7 Sonnet) and AI-forbidden. Result: developers using AI took 19% longer to complete tasks (95% CI: +2% to +39%), reversing the predicted speedup. Critically, developers self-reported a 20% speedup from AI, and outside experts (economists predicted 39% speedup; ML researchers predicted 38%) were also wrong in the same direction. The first measurement that separates perceived productivity from measured productivity in this domain. AI does not slow down all developers — it slows down experienced developers working on familiar codebases, exactly where global-invariant maintenance is most concentrated.

In plain terms

Up to this point in the book the empirical evidence has been observational. GitClear measures what the code did. DORA measures what the teams reported. Stanford measured outcomes on synthetic security tasks. Each one is real evidence. None is a randomized controlled trial. METR ran the trial. Sixteen developers, all experienced — five years on average on the open-source projects they work on. They had 246 real tasks from their own backlogs. Each task was randomly assigned to one of two conditions: AI-allowed (Cursor Pro plus Claude 3.5 or 3.7 Sonnet) or AI-forbidden. The developers logged time honestly. The data was clean. The result inverted every prediction. In the AI-allowed condition, developers took 19% longer on average to complete their tasks. Not a small effect. The 95% confidence interval was +2% to +39% — meaning the slowdown is statistically significant and meaningful in size. And the slowdown was occurring on real work in real repositories, not toy benchmarks. The reason this result is the keystone of the empirical case is the self-report contradiction. After the experiment, the same developers were asked to estimate the effect. They reported AI made them 20% faster on average. The actual measurement was 19% slower. The gap between perceived and actual productivity in this domain is roughly 39 percentage points. Outside experts were no better. Economists predicted a 39% speedup. ML researchers predicted a 38% speedup. Both groups were wrong by similar large margins. The expert prediction was based on a model of AI productivity that did not match reality on real work. Why? The METR team's hypothesis is the one this book has been arguing throughout. AI accelerates the local work — typing out code that the model has seen before. It slows down the global work — verifying that the local code does not break invariants the developer is keeping in their head. Inexperienced developers on unfamiliar codebases may benefit from AI because the local work dominates. Experienced developers on familiar codebases lose, because the global work dominates and AI does not help. This is exactly the population doing maintenance. Experienced developers, working on systems they know, preserving invariants under change. This is who Lehman wrote his laws about. This is who Hamilton was building flight software with. This is who Hoare designed his triples for. And it is exactly this population that AI tools are now demonstrably slowing down — while inverting their perception of the result. Twenty percent slower. Felt twenty percent faster. The instruments are lying.