Why teams that shipped 40% more code introduced 28% more production incidents
A mid-stage fintech company adopted AI coding tools across its 45-person engineering team in early 2025. The metrics looked extraordinary. Pull requests per engineer rose 42%. Average time to first commit on new features dropped from 4.3 days to 2.1. Lines of code per sprint nearly doubled. The VP of Engineering presented these numbers at a board meeting in June, alongside a proposal to reduce headcount by 30%.
By October, production incidents had increased 28%. Mean time to resolution had grown from 47 minutes to 93. Customer-facing outages, which had averaged 1.2 per quarter, hit 4 in Q3 alone. A mid-market SaaS company at that scale loses $50,000 to $200,000 per significant incident: engineering time, customer escalations, remediation overhead. The fourth incident in a quarter costs more than the first. The team is already fatigued. The customer base is already nervous. The remaining engineers are already carrying three people's worth of context.
The company that saved $1.8 million in headcount spent roughly $400,000 more on incidents. It did not save $1.4 million. It traded visible costs for hidden ones and declared victory before the bill arrived.
The Churn Problem
GitClear analysed 153 million lines of code and found that AI-assisted repositories showed a 39% increase in churn: code written and then rewritten within two weeks. The code was produced faster. It was also thrown away faster.
GitHub's research on Copilot found a 55% improvement in task completion speed. McKinsey's 2023 analysis estimated 20-40% gains in coding tasks. These numbers are real. They measure gross output. They do not measure what survives: the code still running two weeks later, still correct under production load.
Microsoft's research found that developers spend only 30-35% of their time writing code. The remaining 65-70% goes to understanding requirements, navigating existing systems, debugging, reviewing others' work, and coordinating with colleagues. AI tools accelerated the 30%. They also expanded the 70%. Every line of AI-generated code that enters a system must still be understood, reviewed, tested, and maintained by humans. The code arrives faster. The human capacity to absorb it does not.
The Two Failures of Trust
Teams that adopt AI coding tools split into two failure modes, and both are expensive.
Over-trust looks like the fintech company. Engineers accept AI suggestions without enough scrutiny. The code compiles. It passes the tests the AI also wrote. It appears to work. The subtle failures surface days or weeks later in production: race conditions, edge cases, security holes, architectural mismatches. A 2023 Stanford study by Perry et al. found that developers using AI assistants wrote significantly less secure code than those working without assistance. Those same developers rated their code as more secure. Confidence rose as quality fell.
Under-trust is slower to cause incidents but just as costly. Engineers treat the AI as unreliable and review every suggestion as if it came from an untrusted junior. The review overhead eats the productivity gain. In some teams, the net effect is negative: engineers spend more time correcting AI suggestions than they would have spent writing the code themselves. Uplevel's research on developer experience found that many engineers reported AI tools increased their cognitive load rather than reducing it.
The productive middle ground requires knowing when to trust and when to check. That judgment comes from experience. Exactly the kind of experience companies are using AI to replace. Junior engineers lack the context to spot when AI suggestions are subtly wrong. Senior engineers have the context but are now reviewing three times the volume.
Both failures are compounded by a problem specific to AI-generated code: the reviewer did not watch it being written. When an engineer reviews a colleague's pull request, they bring context: the author's habits, the team's recent design discussions, the constraints that shaped the approach. They can infer intent from structure. AI-generated code arrives without any of this. It is syntactically competent and architecturally naive. It solves the immediate problem with no awareness of the system's history or its failure modes. Reviewers must work harder to verify it, at the exact moment when there is more of it to verify.
The Understanding Gap
The deepest problem with AI-generated code is not quality. Quality improves with better models, better prompts, and better guardrails. The deepest problem is that AI generates code without generating understanding.
When an engineer writes code by hand, the act of writing is also the act of understanding. Choosing data structures, handling errors, thinking through edge cases. All of it forces the engineer to build a mental model of the problem. That mental model persists. When the system breaks at 3am, the engineer who wrote the code can reason about what failed, because they reasoned about how it should work.
When AI generates the code, the mental model does not form. The engineer has a solution without having solved the problem. When it fails, nobody understands why it was built the way it was. The code is present. The context is absent.
This is the knowledge debt that AI coding tools create. Not technical debt. The code may be perfectly clean. Knowledge debt: the growing gap between what the system does and what the team understands about why it does it. Every line of AI-generated code that an engineer accepts without deeply understanding adds to this balance. The debt compounds silently until an incident forces repayment at the worst possible time.
The optimistic counter-argument is that this is an adoption problem, that teams will learn to use the tools properly and quality will stabilise. There is some truth in this. But knowledge debt is not an adoption failure. It is a built-in consequence of handing implementation to a system that does not share its reasoning. A team can become expert at prompting, at constraining scope, at reviewing output. The mental models still do not form for the code the engineer did not write. Maturity slows how fast knowledge debt builds. It does not stop the mechanism.
What the Successful Adopters Do Differently
Not every company that adopted AI coding tools saw its incident rate climb. The companies that extracted genuine value share three structural patterns.
They constrained scope. They identified the tasks where AI reliably produces correct output and limited its use to those: boilerplate, test scaffolding, documentation, straightforward transformations. Code that touches payment processing, authentication, data migration, or infrastructure remains human-written. The distinction is not about capability. It is about consequence. AI can write an authentication module that compiles. The cost of that module being subtly wrong is nothing like the cost of a misformatted log message.
They invested in review. One company with 60 engineers created a four-person senior review team whose only job was evaluating AI-generated code before it reached the main codebase. The cost, roughly $800,000 annually in loaded compensation, looked steep against the $120,000 AI tooling budget. But the team caught an average of three significant issues per month that would otherwise have hit production. At $75,000 per prevented incident, the investment recovered its cost within the first year. More importantly, the review process rebuilt the shared understanding that AI-generated code otherwise erodes.
They measured what mattered. Instead of tracking pull requests per engineer or lines of code per sprint, they tracked defect escape rate, time to resolution, and code churn. When AI-assisted output showed higher churn or more defects, they tightened scope rather than celebrating the velocity numbers. This takes discipline. It is easier to show a chart of doubled output than to explain that doubled output with rising incident rates is not a productivity gain. It is a quality problem subsidised by speed.
The Trust Equation
Trust in a pair programmer, human or artificial, is built through demonstrated reliability in context. An engineer trusts a colleague after months of working together, observing their judgment, understanding their strengths and blind spots. Trust is not a binary state. It is a calibrated assessment: reliable for this kind of work, unreliable for that kind, needs checking in these specific areas.
AI coding tools have not earned this calibrated trust because they do not behave the same way across different tasks. The same tool that produces flawless React components may generate subtly broken database queries. The same model that handles string manipulation reliably may introduce concurrency bugs in shared state. The failure modes shift with context. Engineers cannot build a reliable mental model of when to trust and when to check.
Until AI tools fail in predictable ways, trust will stay binary. Engineers will either accept everything or suspect everything. Neither mode captures the productivity gains the tools theoretically offer. The companies that navigate this will be those that map where their tools work and where they do not, then encode that knowledge into process rather than individual judgment.
The pair programmer is here. The code it writes is clean, fast, and confident. Six months from now, at 3am, the system will fail in a way nobody can explain. The question is whether anyone understood how it worked in the first place.