Introducing Span's AI Effectiveness suite, powered by agent traces

Introducing Span's AI Effectiveness suite,
powered by agent traces

Quality in the Age of AI

Insights

The Reviewer Paradox: Why Code Review Is Where AI Quality Breaks Down

The Reviewer Paradox: Why Code Review Is Where AI Quality Breaks Down

Span Research Team

Across 248,099 PRs, three review-side signals correlate with higher defect rates: longer cycle times, more reviewers, and cross-team involvement. None of them are causing the defects. All three are markers of the same underlying variable — coordination cost — and AI is quietly increasing how often that variable gets triggered. 

When teams adopt AI tooling, the instinct is to measure what's changed upstream like adoption rates, lines of AI-generated code, and velocity. These are the least informative data points to measure. 

The most informative change for teams lies in what happens after code is written. AI accelerates generation, but review, coordination, and PR lifecycle have not accelerated in lockstep.

Span's Analysis

Across 248,099 pull requests from a selected sample of engineering organizations, we looked at three signals from the review and integration side of the pipeline: cycle time, reviewer count, and cross-team review involvement. Each one tells a piece of the same story: that AI is exposing inefficient code review practices, and the highest-performing teams operate in systems that have allowed code review workflows to scale with AI adoption.

Signal 1: Cycle time

The relationship between how long a PR stays open and whether it ships with a defect is clean and directional.

Cycle time

Defect rate

1-4 days

~6.7%

7+ days

Climbs materially

Short-cycle PRs are the safest merges in the dataset. This feels counterintuitive if you believe review quality scales with review times, but it makes sense the moment you observe what long-lived PRs accumulate.

A PR that sits open for a week collects stale context. The branch drifts from main. The author forgets what they wrote. Reviewers lose the thread and re-read with reduced attention. The surrounding system changes underneath the PR, creating integration surface area that didn't exist when the PR was opened. Bugs introduced elsewhere in the product during that week get tied back to the long-lived PR during incident review, whether or not it caused them.

The shorter the cycle, the less these factors compound. Fast-merging PRs catch less blame because there's less time for anything to go wrong.

This isn't an argument that rushing PRs through produces quality. It's the opposite: the PRs that merge quickly are usually the ones small and focused enough to merge quickly. Cycle time is a proxy for how contained a change is, and contained changes are safer.

Signal 2: The Reviewer Paradox

More reviewers correlates with more defects, not fewer.

Number of Reviewers

Defect rate

1

~6%

2

~8%

3

~10%

4

~12%

The first read is that reviewers are somehow degrading quality by piling on, diffusing accountability, generating noise. The actual explanation is simpler: nobody arbitrarily assigns that many reviewers.

Reviewer count is a proxy. When a PR pulls in three or four reviewers, it's because the change crosses team boundaries, touches shared infrastructure, or involves domain expertise the author doesn't have. These are exactly the PRs most likely to ship defects not because of the reviewers, but because of the underlying complexity and coordination cost that made the extra reviewers necessary in the first place.

The paradox dissolves once you stop reading reviewer count as a quality intervention and start reading it as a signal. More reviewers means harder work. Harder work means more defects. 

Signal 3: Cross-Team Review

The cross-team pattern reinforces the same point from a different angle. PRs requiring review from another team correlate with higher defect rates, and the cohort analysis picked this up cleanly: AI Champions ran 22.7% cross-team review while AI Risk Zone ran 51.4%, at matched levels of AI adoption (see: Autonomous Teams Ship Cleaner AI Code).

Cross-team work is harder than single-team work because context is thinner and ownership is fragmented. Communication overhead is also higher. Each of these is a defect vector, and none of them has anything to do with whether the code was AI-assisted.

What AI changes is the rate at which these situations arise. When code generation is cheap, it's easier to touch adjacent systems, propose broader changes, and pull in more stakeholders. The structural cost of cross-team work doesn't drop with AI. The volume of cross-team work goes up.

How These 3 Signals Come Together

Cycle time, reviewer count, and cross-team review are not independent findings. They're different views of the same underlying variable: how much coordination a PR requires to ship.

Long cycle times, extra reviewers, and cross-team involvement are all downstream markers of coordination cost. The defects they correlate with aren't caused by the reviewers or the days elapsed — they're caused by the thing that made those reviewers and those days necessary. 

AI doesn't cause coordination costs. It increases the rate at which coordination-heavy work gets proposed, by making the generation cheap enough that larger, more cross-cutting changes become practical. The review system absorbs the new volume, but not without degradation.

Code Generation Was Never the Real Bottleneck

One thing that AI has exposed is that the real bottleneck was never writing code. Review, merge coordination, and integration lifecycle are now the scarce resources, and they haven't been scaled to match. 

The true bottleneck for teams is the actual engineering systems in which they work. The teams that are significantly improving their cycle times without sacrificing quality are teams that had built robust operating systems pre-AI adoption. Teams that hadn’t built as strong of a foundation have ended up with review queues that absorb more volume at the same attention budget, which degrades review quality and, in turn, leads to more defects.

3 Things You Can Do To Address the Reviewer Paradox

Watch cycle time more carefully than throughput

Throughput is what AI accelerates. Cycle time is what determines whether the throughput is safe. If your team is merging more PRs per week but cycle times are lengthening, you're generating code faster than your review system can metabolize it. That gap is where defects live.

Read reviewer count as a complexity signal, not a quality lever

Adding reviewers to a risky PR feels like a control. The data suggests it's more useful as a flag: any PR that needs three or more reviewers is a PR that should be examined for whether it can be split, scoped down, or delayed until the cross-team dependencies are resolved. Treat high reviewer count as a prompt to restructure the work, not as a substitute for restructuring it.

Make cross-team work visible and expensive

Cross-team review is the single strongest review-side signal we found, and it's usually invisible in dashboards. Most orgs track PR count and cycle time but not what fraction of PRs cross team boundaries. Instrumenting this — and giving teams a reason to minimize it — is one of the highest-leverage workflow changes available in an AI-heavy environment.

The Bottom Line

AI doesn't degrade code quality by writing worse code. It degrades code quality by increasing the volume and coordination-heaviness of the work flowing into review systems that weren't previously designed to handle it. The bottleneck has moved from code generation to code integration, and most engineering orgs are still measuring the old bottleneck.

The teams that capture AI's leverage without paying the quality tax are the ones that noticed the shift early and scaled review, coordination, and PR lifecycle discipline in proportion. The teams that didn't are the ones explaining their defect rate in retros.

Watch what happens after the code is written. That's where AI quality is decided now.

Analysis based on 248,099 pull requests across a selected sample of engineering organizations, with defect attribution drawn from merged PRs linked to production incidents. Signals reported are cycle time, reviewer count, and cross-team review involvement, examined individually and in combination. Directional findings validated across organizations in the sample.

Everything you need to unlock engineering excellence

Everything you need to unlock engineering excellence