Insights
The Defect Fingerprint: How AI-Generated Code Breaks Differently Than Human Code
The Defect Fingerprint: How AI-Generated Code Breaks Differently Than Human Code
Span Research Team
•
An analysis of 65,697 production defects across a selected sample of engineering organizations reveals that AI and humans fail in fundamentally different ways — and neither is doing what you’d expect.
The debate around AI-generated code quality has been remarkably binary: does AI produce more bugs, or doesn’t it? But this framing misses a far more interesting question — even if the overall bug rate is similar, are the types of bugs the same?
We analyzed 65,697 production defects identified through automated root-cause analysis across a selected sample of engineering organizations. Each defect was traced back to the pull request that introduced it, and each PR was classified by its AI code ratio — the percentage of merged lines attributed to AI code-generation tools by editor and IDE telemetry.
The answer is clear: AI and humans fail in fundamentally different patterns, and the distinction has practical implications for how engineering teams should structure their review processes.
Span's Analysis
We let the data tell us what defect categories exist instead of writing them down ahead of time. Every bug summary was clustered with an unsupervised pipeline (no human-defined taxonomy), each cluster was named by an LLM acting as a labeling assistant, and each category’s AI vs. human composition was tested against the dataset-wide baseline.
We re-ran the analysis on a second, independent clustering pipeline as a stability check, and we validated each finding under a regression that controls for pull-request size and organization. Categories whose direction did not hold up under both checks are not reported here. The patterns below are the ones that survived.
The Baseline: A Critical Detail Most Analyses Miss
Before comparing AI and human defect profiles, we need to establish a baseline. Among the bugs whose root-cause PR has a measured AI ratio, 54.4% originated from pull requests with more than 5% AI-generated code. This is our null hypothesis: if AI ratio carried no information about defect type, every category would sit near 54%.
Any category significantly above 54% means AI-assisted PRs are over-represented for that defect type. Any category significantly below 54% means mostly human-written PRs are over-represented. Without this baseline, raw percentages are misleading — a category where “60% of bugs come from AI PRs” sounds alarming until you realize that AI-assisted PRs already produce 54% of all defects in this sample.
Where AI-Assisted Code Is Over-Represented: The Boundary Problem
A consistent set of defect categories show meaningful over-representation of AI-assisted PRs. They cluster around a single theme: AI struggles at boundaries — the seams where one piece of code talks to another.
AI-over-represented defect types
Category | Bugs | AI Share | Above Baseline |
|---|---|---|---|
Incorrect Return Values | 1,696 | 63.4% | +9.0pp |
Component Wiring & Lifecycle Errors | 1,278 | 62.5% | +8.1pp |
Null / Missing Data Dereference | 1,257 | 60.9% | +6.5pp |
Stale / Uncleared UI State | 1,000 | 59.8% | +5.4pp |
Stale Data / Cache Invalidation | 1,093 | 59.3% | +4.9pp |
Client / Backfill Configuration Errors | 1,078 | 58.8% | +4.4pp |
Token & Auth Mismanagement | 803 | 58.3% | +3.9pp |
Status & Entity State Mishandling | 1,027 | 57.7% | +3.3pp |
Three patterns emerge:
Return-value and null-handling failures
The single most AI-over-represented category is “Incorrect Return Values” — functions returning wrong, null, empty, or improperly typed values instead of the expected data. Close behind is “Null / Missing Data Dereference” — code that crashes by accessing properties of a null value without validation. Together these are roughly 3,000 bugs where AI-assisted code looked correct in isolation but violated implicit contracts about what gets returned and what might be missing.
This makes mechanical sense. AI generators see the local function context but rarely the full call graph. They produce a function whose return type is plausible without understanding what the caller actually needs, or they access a field without checking whether the upstream API might return null in edge cases the local file doesn’t show.
State-management and lifecycle failures
Three categories — Stale/Uncleared UI State, Stale Data/Cache Invalidation, and Status/Entity State Mishandling — share a common root cause: AI-assisted code correctly implements the initial state transition but fails to handle lifecycle events like cleanup, invalidation, or entity switching. A React component correctly sets loading state but never clears it. A cache correctly stores a value but never invalidates it when the source changes. A status field correctly transitions from “pending” to “active” but doesn’t handle the reverse path.
The closely related “Component Wiring & Lifecycle Errors” category captures the broader version of the same problem: correctly written components that are wired into the surrounding system with the wrong scope, the wrong initialization order, or the wrong refetch triggers. The local code is fine. Its place in the lifecycle is not.
AI tends to generate code that works for the first execution but breaks on the second.
Authentication and credential lifecycle
Token & Auth Mismanagement is the cleanest illustration of “looks right, isn’t right.” AI completions pattern-match on generic token-handling boilerplate without picking up the specific credential scoping rules, refresh timing, and naming conventions unique to each integration. The code compiles and passes tests but uses the wrong secret name, caches a token that expires, or assumes a refresh path that the surrounding system doesn’t actually wire in.
Where Human-Written Code Is Over-Represented: The Systemic Problem
Several categories show the opposite signal — defects where mostly human-written PRs are over-represented. The unifying theme is the inverse: humans fail at system-level maintenance.
Human-over-represented defect types
Category | Bugs | AI Share | Below Baseline |
|---|---|---|---|
IAM Role Misconfiguration | 488 | 29.9% | −24.5pp |
Removal / Upgrade Regression | 1,148 | 39.9% | −14.5pp |
Purchase Order Line-Item Logic | 445 | 41.1% | −13.3pp |
Account & Bank Data Mismanagement | 1,048 | 42.3% | −12.1pp |
Flawed Validation & Check Logic | 814 | 44.5% | −9.9pp |
Test Infrastructure & Flaky Tests | 1,158 | 46.0% | −8.4pp |
Infrastructure and access control
The single largest gap — about 25 percentage points below baseline — is IAM Role Misconfiguration: bugs where role ARNs, trust policies, or permission boundaries are incorrectly set. This is work that AI tools currently rarely touch (Terraform, IAM JSON, deployment configs are all areas where IDE-based AI assistance is currently weak), and it requires understanding the permission model of external systems — exactly the kind of deep, organization-specific knowledge that lives in human operators’ heads and occasionally gets wrong.
Removal and upgrade regression
When humans remove old features, rename dependencies, or upgrade libraries, they leave behind stale references, forgotten consumers, and guards that still check for the old behavior. This category captures the debris of human maintenance work. AI doesn’t generate these bugs because AI rarely removes code — it primarily adds.
Domain-specific financial logic
Purchase Order Line-Item Logic and Account & Bank Data Mismanagement are both heavily human-dominated. These involve deep domain knowledge about financial calculations, account lifecycle states, and cross-system identifier mappings. Humans make mistakes here not because the logic is hard in a programming sense, but because it requires knowledge that lives in business-rules documents and institutional memory rather than in code patterns.
Test infrastructure
Test plumbing — fixtures, mocks, CI configuration, flaky test management — is also heavily human-dominated. Like infrastructure, it’s an area where AI tooling is currently a poor fit, and where the relevant context lives outside the code.
The Neutral Zone: Where the Tool Doesn’t Matter
A large set of defect categories show no meaningful difference between AI-assisted and mostly human-written PRs. These include:
Date / Time Handling Errors — Both AI and humans struggle equally with timezone conversions, business-day calculations, and temporal edge cases.
API Contract Mismatches — Integration failures at API boundaries are an equal-opportunity defect.
Feature Flag Mismanagement — Both humans and AI misconfigure, misname, and miswire feature flags at similar rates.
Payment Processing Logic — Complex financial transaction logic produces bugs regardless of who wrote the code.
The neutral zone is instructive. These categories involve inherent complexity — domains where the bug rate is driven by the problem’s difficulty, not by the author’s limitations. Date/time handling is hard for everyone. Payment processing has irreducible edge cases. API contracts break when either side changes.
The Counterintuitive Insight
The narrative around AI code quality focuses on quantity: does AI produce more or fewer bugs? Our earlier regression analysis of 248,099 PRs shows the answer is roughly the same — AI code ratio essentially does not move the probability that a PR introduces a bug, after controlling for PR size and complexity.
But “the same number of bugs” does not mean “the same bugs.” AI-assisted and mostly human-written code have distinct defect fingerprints:
AI-assisted PRs are over-represented in: return values, null handling, state cleanup, cache invalidation, component wiring, credential lifecycles. These are bugs that require understanding what’s around the code — the call graph, the cache, the consumer, the lifecycle — not just the local function.
Mostly human-written PRs are over-represented in: IAM policies, removal regressions, account lifecycle management, infrastructure configuration, deep domain logic, test plumbing. These are bugs that require understanding the history of the code — what was there before, what depends on it, what changes when you change it — and the specific rules of an organization that don’t live in any one file.
This distinction has a practical implication: the same code-review process cannot catch both types of bugs equally well.
A reviewer looking at AI-assisted code should focus on: What does the caller expect this to return? What happens when the upstream value is null? Will this state get cleaned up on the next render? Is the cache invalidation path actually wired in?
A reviewer looking at mostly human-written code during a refactor or infrastructure change should focus on: What else references this? What breaks when this is removed? Are the IAM trust policies still correct? Do the business rules match the documented constraints?
3 Things You Can Do
If your team is adopting AI code-generation tools, the defect fingerprint suggests three concrete actions:
Add caller-contract checks to your AI code-review checklist
The strongest AI-over-represented finding is incorrect return values. Static analysis tools can catch some of these, but many require understanding the caller’s expectations — something a human reviewer is better positioned to verify. When reviewing AI-assisted code, explicitly ask: what does every caller of this function actually consume, and does the new code honor that?
Don’t assume AI-assisted code handles cleanup
State-management failures (stale UI state, cache invalidation, lifecycle cleanup, component wiring) are consistently over-represented in AI-assisted PRs. When reviewing AI-assisted code that introduces new state or wires up a new component, explicitly verify the corresponding cleanup, invalidation, and reset paths.
Maintain human expertise in infrastructure, configuration, and domain-specific data hand-off
IAM misconfiguration is the most human-over-represented defect category, and it’s one where AI tools are least helpful in our sample (they lack access to your specific permission model). The same applies to account/bank/PO domain code and to test infrastructure. This is not a category that AI adoption will make worse — but it remains an area where human review attention pays.
The broader insight: AI doesn’t need to produce fewer bugs to be valuable. It needs to produce different bugs — bugs that your existing review and testing processes can catch. If your test suite already has good null-checking coverage and your reviewers know to verify return contracts, AI-assisted code may slot in cleanly. If your test suite is weak on boundary conditions and lifecycle handling, AI adoption will amplify that gap.
The defect fingerprint is not a verdict on AI quality. It’s a map of where to look.
A Note on Scope
These patterns come from a selected sample of engineering organizations, weighted toward teams that have already adopted AI code-generation tools. Both the baseline AI share (54.4%) and the specific defect categories that surface as significant are shaped by which organizations are in the cohort, what stacks they run, and which AI tools they use. A larger or differently composed cohort would almost certainly surface additional categories on both sides — more domains where AI-assisted code shows distinctive failure modes, and more domains where human-written code does.
The categories above should be read as the patterns that are large and consistent enough to emerge from the sample we have, not as an exhaustive map of every place AI-assisted and mostly human-written code diverge. We expect future iterations of this analysis on broader cohorts to enrich the picture rather than overturn it.
Analysis based on 65,697 production defects identified through automated root-cause analysis across a selected sample of engineering organizations. Defect categories were derived by unsupervised clustering and labeled with LLM assistance. Findings were verified under an alternative clustering pipeline and a regression that controls for pull-request size and organization; categories whose direction did not hold up under both checks are not reported here.
Everything you need to unlock engineering excellence
Everything you need to unlock engineering excellence