The API That No One Dared Delete

In the summer of 2022, a senior engineer at a fintech company noticed something unusual in their infrastructure costs. One API endpoint, which returned user account balances, was consuming $15,000 per month in compute resources despite receiving only 47 requests per day. The endpoint had been marked for deprecation eighteen months earlier. The replacement endpoint, more efficient and better designed, had been live for twenty months. Yet the old endpoint persisted, consuming resources at an annual cost of $180,000.

The engineer proposed shutting it down. The proposal triggered a four-month investigation involving twelve engineers, three product managers, and two customer success representatives. They searched the codebase for references to the endpoint. They analyzed API logs looking for patterns in the remaining requests. They sent emails to customers asking if anyone was using it. They created a deprecation notice and scheduled a shutdown date. Then, three days before the planned shutdown, they discovered that a Fortune 500 customer's internal reporting dashboard depended on the endpoint. The shutdown was canceled. The endpoint remained live. The $180,000 annual cost continued.

This pattern repeats across the software industry with remarkable frequency. Companies maintain infrastructure they believe is unused but cannot prove is safe to remove. The cost of certainty exceeds the cost of waste, so waste persists. Zombie infrastructure accumulates over years, consuming resources, creating security vulnerabilities, and complicating system architecture. Everyone agrees it should be cleaned up. No one can justify the risk of cleaning it up.

The Dependency Discovery Problem

Modern software systems are built from layers of dependencies. Applications call APIs. APIs call other APIs. Those APIs call databases, message queues, and third-party services. Each dependency relationship represents a potential failure point when components are removed. The challenge is that dependency relationships are often implicit, undocumented, and distributed across systems the company does not control.

When the fintech company tried to understand what depended on their old API endpoint, they started with the obvious approach: search the codebase. They found three references, all in deprecated code that had been commented out but not removed. This suggested the endpoint was unused. They deployed monitoring to log every request to the endpoint for thirty days. The logs showed 47 requests per day, always at 3 AM, always from the same IP address, always requesting the same account IDs.

The IP address belonged to their infrastructure. This suggested an internal service was making the requests. They searched their service inventory and found nothing running at that IP. They checked their Kubernetes cluster and discovered the IP belonged to a cron job running in a namespace called "legacy-reporting." The cron job's source code was not in their main repository. After two weeks of investigation, they found it in a separate repository that had not been updated in three years. The repository's README file contained no documentation about what the cron job did or why it existed.

The cron job was generating reports for a customer. Which customer? The code contained no customer identifier. They analyzed the account IDs being requested and matched them to a customer who had signed a contract five years earlier. They reached out to the customer contact listed in their CRM. That person had left the company two years ago. They tried the billing contact. That person had also left. They tried the technical contact. The email bounced. Eventually, they found someone at the customer who confirmed that yes, they did receive these reports, and yes, someone in their finance department relied on them for month-end reconciliation.

The investigation consumed approximately 160 engineering hours across two months. At a fully loaded engineering cost of $120 per hour, the investigation cost $19,200. This was to determine the safety of removing an endpoint that cost $180,000 annually. The economics seemingly justified the investigation. Except that the investigation revealed they could not safely remove the endpoint. The $19,200 was spent to learn that they would need to continue spending $180,000 per year.

The Unknowable Graph

Dependency graphs in software systems have a property that makes them particularly difficult to analyze: they extend beyond the boundaries of what any single organization can observe. A company knows what calls their APIs from within their own infrastructure. They cannot know what calls their APIs from customer systems, partner integrations, or third-party tools.

A payments company experienced this when they attempted to deprecate a webhook endpoint. Webhooks are HTTP callbacks that allow one system to notify another system when events occur. The payments company sent webhooks when transactions completed, allowing customers to update their own systems in real-time. The specific webhook format they wanted to deprecate had been replaced by a more robust version two years earlier. Their logs showed only three customers still receiving webhooks in the old format.

They contacted all three customers. Two confirmed they had migrated to the new format and the old webhooks were being discarded. The third customer initially confirmed the same. The payments company scheduled a deprecation date and sent multiple notices. The day after they shut down the old webhook format, the third customer's integration broke. Their customer success team received an urgent ticket. The customer's order fulfillment system had stopped processing payments.

The investigation revealed that the customer had indeed migrated their primary integration to the new webhook format. However, they had a separate analytics system that consumed the old webhooks and fed data into a business intelligence tool. The person who confirmed they had migrated did not know about this secondary system because it was maintained by a different team. The analytics system was not essential to operations, so when it stopped receiving data, no one noticed immediately. But the business intelligence tool fed financial forecasting models that the CFO reviewed monthly. Two weeks after the webhook shutdown, the CFO discovered the missing data and escalated aggressively.

The payments company re-enabled the old webhook format. They sent another round of deprecation notices, this time more strongly worded. They required customers to explicitly confirm that all systems had migrated. Six months later, they tried again. This time the shutdown succeeded without incident. The total time elapsed from initial deprecation announcement to final shutdown was thirty-two months. The engineering time spent managing the deprecation totaled approximately 240 hours across multiple engineers. The opportunity cost included features not shipped because engineers were managing deprecation.

The Cost Asymmetry

The fundamental problem is an asymmetry in costs and risks. The cost of maintaining unused infrastructure is continuous but moderate. The cost of discovering dependencies is high but one-time. The risk of incorrectly removing infrastructure that is actually used is potentially catastrophic. This asymmetry biases heavily toward keeping things running.

Consider the economics. The fintech company's unused API endpoint cost $180,000 annually. Investigating whether it was safe to remove cost $19,200. But that investigation provided no certainty. It revealed one dependency and suggested there might not be others. The question remained: what if there are dependencies we did not discover? What if a customer has an integration we do not know about? What if there is an internal system we overlooked?

The cost of being wrong is high. If they remove the endpoint and break a critical customer integration, the immediate cost includes incident response time, customer support overhead, emergency engineering work to restore the endpoint, and relationship damage with the customer. The indirect costs include the company's reputation for reliability and the precedent that deprecating features is risky. These costs are difficult to quantify but potentially large.

Against this uncertain but potentially catastrophic downside, the certain savings from removing the endpoint is $180,000 per year. The calculation is: spend $19,200 to investigate, then make a decision that has a small probability of causing a very expensive incident. If the probability of an incident is higher than approximately 10%, the expected cost of removal exceeds the cost of keeping it running for another year. Most organizations cannot prove the incident probability is below 10%.

This calculation repeats for every piece of infrastructure considered for removal. Each calculation independently favors keeping things running. The cumulative effect is that infrastructure accumulates. The longer it runs, the harder it becomes to remove because more time means more opportunity for unknown dependencies to form. A service running for six months might have dependencies that are documented and understood. A service running for six years has dependencies that have been forgotten, transformed, or transferred to systems the company no longer controls.

The Documentation Decay

Even when dependencies are initially documented, documentation decays. The pace of documentation decay typically exceeds the pace of system change, meaning documentation becomes less accurate over time even if the underlying systems remain static.

A SaaS company maintained an internal wiki documenting all their API endpoints, including which services called each endpoint. The documentation was comprehensive when created in 2018. By 2023, an audit revealed that 67% of documented dependencies no longer existed, and 34% of actual dependencies were not documented. The documentation had negative value: following it would provide incorrect information that was worse than having no information.

The decay occurred through normal software development practices. Engineers built new services that called existing APIs. Sometimes they documented the dependency; often they did not because they were focused on shipping features and the documentation seemed secondary. Engineers deprecated old services that called APIs. Sometimes they updated the documentation; often they did not because the service was no longer running and the documentation seemed irrelevant. Engineers modified services to call different APIs. Sometimes they updated both the old and new documentation; often they only updated the new documentation.

The cumulative effect over five years was documentation that was superficially plausible but systematically wrong. The documentation listed dependencies that would never be discovered through testing because they no longer existed. The documentation omitted dependencies that would only be discovered through failure because they were never documented. Using the documentation to make removal decisions would systematically create false confidence in removing systems that were actually in use while preserving systems that were genuinely unused.

The company considered updating the documentation. They estimated it would require one engineer working full-time for six months to audit all systems, identify all dependencies, and update all documentation. At the end of those six months, the documentation would be accurate. One month later, it would begin decaying again. Unless documentation updates were enforced rigorously through code review and automated tooling, the documentation would return to its inaccurate state within two years.

They chose not to update the documentation. Instead, they deleted it entirely and relied on runtime dependency discovery through service mesh telemetry. This provided real-time accuracy at the cost of only seeing dependencies that were actually exercised. Silent dependencies that rarely activated remained invisible, but at least the visible information was correct.

The Testing Gap

Standard testing practices provide limited help in discovering dependencies. Unit tests verify that individual components work in isolation. Integration tests verify that known components work together. Neither reliably reveals dependencies on systems that exist outside the test environment or that only activate under specific conditions.

A healthcare technology company learned this when they removed an internal API that appeared unused. Their test suite was comprehensive, with 94% code coverage. All tests passed after the API removal. They deployed to staging and ran their integration test suite. All tests passed. They deployed to production during a maintenance window and monitored for errors. No errors appeared for the first six hours. They declared the deprecation successful.

Twelve days later, they received a support ticket. A hospital's monthly compliance report had failed to generate. The compliance report ran only once per month, on the first business day of the month. It depended on the removed API to retrieve historical audit logs. The API call was failing silently; the report generation code caught the error and logged it but did not alert anyone. The monthly report showed incomplete data. The hospital's compliance officer escalated to their legal department. The legal department escalated to the vendor's account management.

The healthcare company restored the API within four hours of the escalation. But the damage was done. The hospital's compliance report was late, requiring manual intervention to satisfy regulatory requirements. The healthcare company issued credits, conducted a post-mortem, and implemented additional safeguards for future deprecations. The total cost of the failed deprecation exceeded $80,000 when accounting for engineering time, customer credits, and legal review.

The lesson was that testing can only verify dependencies that tests know about. The monthly compliance report was not in the integration test suite because it was a customer-specific integration maintained by a different team. The dependency was not documented because it was implemented three years earlier by an engineer who had since left the company. The failure was not caught during the monitoring window because it only manifested monthly. The gap between what tests verify and what production requires created a blind spot.

The Eternal Staging Problem

Deprecation testing in staging environments provides false confidence because staging environments are not production. They have different data, different traffic patterns, different integrations, and different dependencies. A component can be unused in staging while being critical in production.

An e-commerce company wanted to remove a legacy inventory management API. They deployed the removal to staging and monitored for a week. No systems called the deprecated endpoint. They gradually disabled it in production using feature flags, starting with 1% of traffic. No errors appeared. They increased to 10%, then 25%, then 50%, then 100%. The deprecation appeared successful.

Three months later, they discovered an edge case. A specific category of products (custom-engraved items) required inventory checks that bypassed the new API and fell back to the old API when the new API returned unavailable. This fallback logic had been implemented years earlier to handle an edge case in custom products. It activated rarely (approximately 0.3% of traffic) and only for specific product types. The gradual rollout had not triggered it because the test traffic was randomly distributed and the probability of testing specifically custom-engraved products during the rollout period was low.

When customers tried to order custom-engraved products three months after the deprecation, their orders failed at checkout. The failure rate was 100% for that product category. The company did not discover the issue immediately because custom-engraved products represented only 2% of total revenue and the failure happened after checkout flow analytics, so it appeared as abandoned carts rather than technical failures. They discovered it only when a customer service representative noticed an unusual number of complaints about a specific product category.

The fix required re-enabling the deprecated API and then conducting a six-week investigation to understand why the new API did not handle custom-engraved products correctly. The root cause was that custom products had a different inventory model that the new API had not been designed to support. The eventual solution required rebuilding portions of the inventory system. The cost of the failed deprecation, including lost revenue during the three months when the feature was broken, engineering time to investigate and fix, and the cost of rebuilding the inventory system, exceeded $400,000.

The Incentive Misalignment

Organizations struggle to remove zombie infrastructure partly because the incentives are misaligned. Engineers are rewarded for shipping features and preventing incidents. They are rarely rewarded for successfully removing unused infrastructure. They are heavily punished for incorrectly removing infrastructure that was actually needed.

An engineer who proposes removing an unused service takes on significant risk. If the removal succeeds, the benefit (reduced infrastructure costs, simplified architecture) is diffuse and shared across the organization. The engineer receives modest recognition for good engineering hygiene. If the removal fails and causes an incident, the engineer is directly associated with the incident. Their judgment is questioned. The incident appears in their performance review. The asymmetry is clear: low upside, high downside.

This incentive structure biases engineers toward preserving existing infrastructure regardless of whether it is needed. An engineer who leaves zombie infrastructure running takes on no personal risk. The infrastructure cost is absorbed by the organization's budget. The complexity is someone else's problem. The engineer can focus on features that have clearer upside in performance reviews. The rational individual decision is to avoid deprecation work.

A mid-sized technology company recognized this pattern and attempted to correct it. They created a "clean-up credit" system where engineers received performance review credit for successfully deprecating infrastructure. The program launched with enthusiasm. Twelve engineers proposed deprecation projects in the first quarter. Six months later, only three deprecation projects had completed successfully. The other nine had stalled during the investigation phase when engineers discovered that confirming safety was more difficult than anticipated.

The stalled projects created a new problem. Engineers had committed to deprecation work and received partial credit for starting the investigation. But completing the work required more time than they had estimated, and the marginal value of finishing was lower than starting new features. The deprecation projects remained in a permanent state of "in progress," consuming small amounts of engineering time each quarter but never reaching completion. The program was quietly discontinued after eighteen months.

The Security Accumulation

Zombie infrastructure creates security vulnerabilities that compound over time. Unused services often stop receiving security updates because no one is actively maintaining them. They continue running with outdated dependencies, unpatched vulnerabilities, and deprecated authentication mechanisms. When a vulnerability is discovered, the unused service becomes an attack vector.

A financial services company discovered this during a security audit. They found forty-three internal services running in their infrastructure. Twelve had not been deployed in over a year. Eight had not been deployed in over two years. One had not been deployed in four years. The security team investigated the oldest service and found it was running software versions that had eighteen known critical vulnerabilities, including three that allowed remote code execution.

The service was processing no traffic. It appeared to be genuinely unused. But removing it required following their deprecation process, which required identifying all dependencies, obtaining approval from service owners, conducting a risk assessment, and scheduling a maintenance window. The process was designed to prevent accidental removal of critical services. It also prevented removal of genuinely unused services because the overhead was identical regardless of whether the service was critical or dormant.

The security team proposed an expedited process for removing services that had processed no traffic for over ninety days. The proposal was rejected because the absence of traffic in logs did not prove the absence of dependencies. A service might be called rarely but critically. The security team's compromise was to continue running the vulnerable service but isolate it on a separate network segment with strict firewall rules. This reduced the security risk while avoiding the deprecation process. The service ran in isolation, vulnerable and unused, for another eighteen months before finally being removed during a datacenter migration.

The Opportunity Cost

The most significant cost of zombie infrastructure is not the infrastructure cost itself but the opportunity cost of engineering time spent managing it. Every hour spent investigating dependencies, maintaining unused services, or working around architectural complexity created by zombie systems is an hour not spent building features.

A SaaS company calculated that they spent approximately 15% of engineering time on what they called "maintenance debt": work required to keep existing systems running that provided no new customer value. This included patching security vulnerabilities in unused services, updating dependencies to prevent compatibility issues, refactoring code to work around zombie APIs, and investigating incidents caused by interactions between current and deprecated systems.

For a company with thirty engineers, 15% of engineering time represented roughly four and a half full-time engineers. At a fully loaded cost of $150,000 per engineer annually, this represented $675,000 in annual cost. More importantly, it represented the features those four and a half engineers could have built if they were not managing technical debt. The company estimated that redirecting that capacity to features would have generated approximately $2 million in additional annual revenue.

The challenge was that eliminating maintenance debt required more maintenance debt. Removing zombie infrastructure required investigation time. Refactoring around deprecated systems required engineering time. Each cleanup project consumed engineering capacity in the short term. The return on investment was long-term reduction in maintenance burden, but measuring that reduction was difficult because counterfactuals are invisible. How much time would have been spent if the cleanup had not occurred? The answer requires estimation rather than measurement.

The Compounding Effect

Zombie infrastructure creates more zombie infrastructure through a compounding effect. When engineers discover they cannot safely remove old systems, they build around them rather than replacing them. The workarounds become new systems that depend on the old systems. When those new systems eventually become candidates for deprecation, they inherit the undeletable properties of their dependencies.

A logistics company experienced this with their address validation service. The original address validation API was built in 2015 and used a third-party service that was deprecated in 2017. Rather than remove the API (which might have broken unknown dependencies), they built a new API that called both the old third-party service and a new one, compared results, and returned the new results while logging discrepancies. This wrapper API was meant to be temporary until the old service could be safely removed.

Six years later, the wrapper API was still running. New services had been built that called the wrapper API. The wrapper API had become part of the company's standard infrastructure. Removing it would require identifying all services that called it, migrating them to call the new validation service directly, and confirming that no integrations depended on the wrapper's specific behavior (which combined two validators and had edge cases that the new validator alone did not match).

Meanwhile, the original third-party validation service had shut down entirely. The wrapper API was catching errors from the defunct service and falling back to the new service. This meant the wrapper was now functionally equivalent to just calling the new service, but with additional latency, complexity, and failure modes. Yet removing it was no easier than it had been six years earlier because the number of dependencies had grown. The temporary workaround had become permanent infrastructure.

The Rewrite Temptation

When accumulated zombie infrastructure becomes sufficiently problematic, organizations sometimes conclude that incremental cleanup is impossible and comprehensive rewrite is necessary. This reasoning is seductive but usually wrong. Rewrites fail for the same reasons that incremental cleanup fails: unknown dependencies, incomplete understanding of requirements, and the difficulty of running two systems in parallel.

A media company decided to rewrite their content management system, which had accumulated twelve years of zombie features, deprecated APIs, and architectural workarounds. They estimated the rewrite would take eighteen months with a team of eight engineers. They planned to build the new system while maintaining the old one, then migrate customers gradually. The plan seemed sound.

Twenty-seven months later, the rewrite was 80% complete and three months behind schedule. The remaining 20% included all the edge cases, special integrations, and custom features that the clean-room design had not accounted for. The engineering team discovered these missing requirements only when customers began testing the new system and reporting that specific workflows did not work. Each missing feature required investigation to understand what the old system did, why it did it that way, and how to replicate it in the new architecture.

The most painful discoveries were invisible features: behavior that users relied on but that was never explicitly designed or documented. One customer had built integration scripts that depended on the old API returning fields in a specific order. The new API returned the same fields in a different order, breaking the customer's scripts. Another customer relied on a race condition in the old system that caused updates to propagate in a specific sequence. The new system fixed the race condition, which broke the customer's workflow. A third customer relied on an edge case where the old system accepted technically invalid data and coerced it to valid data in a specific way. The new system rejected invalid data, which was correct but incompatible.

The rewrite eventually shipped after thirty-six months and required an additional six months of bug fixes and compatibility work. The total cost exceeded $4 million. The primary lesson was that zombie infrastructure cannot be eliminated through rewrite any more easily than through incremental cleanup because the fundamental problem, incomplete knowledge of dependencies and requirements, affects both approaches equally.

The Economic Equilibrium

Organizations reach an equilibrium where the rate of creating new infrastructure equals the rate of removing old infrastructure, minus the rate at which removal becomes impossible. This equilibrium point is usually suboptimal from a cost perspective but rational from a risk perspective. Companies tolerate significant waste to avoid small probabilities of large failures.

The equilibrium is stable because forces that might disrupt it are weak. Cost pressure encourages cleanup, but the cost of zombie infrastructure is continuous and moderate rather than acute and severe. A company spending $500,000 annually on unused infrastructure might consider this unfortunate but tolerable. The $500,000 is distributed across many budget lines (compute, storage, monitoring, engineering time) and does not appear as a single line item that executives can easily eliminate.

Security pressure encourages cleanup, but security risk from zombie infrastructure is potential rather than realized. A company with unused services running vulnerable software knows they have security risk but may operate for years without that risk manifesting as an actual breach. The rational response is to apply minimal mitigations (network isolation, firewall rules, monitoring) rather than undertake the expensive work of proper deprecation.

Complexity pressure encourages cleanup, but complexity creates slow-moving problems rather than acute crises. Engineers complain that the system is difficult to understand, that changes take longer than they should, and that cognitive load is high. These complaints are valid but difficult to quantify and difficult to attribute specifically to zombie infrastructure rather than to system complexity generally. Executive response is often to hire more engineers rather than to simplify the system.

The forces that might encourage more aggressive cleanup, competitors moving faster, customers demanding better features, market pressure on margins, are real but indirect. They affect the organization's overall performance but do not directly create pressure to remove specific pieces of zombie infrastructure. The connection between infrastructure cleanup and competitive advantage is real but mediated through many variables, making it difficult to justify specific cleanup projects through clear ROI calculation.

The Rare Success Stories

Organizations do occasionally succeed in removing zombie infrastructure, but success requires specific conditions that are difficult to replicate. The most common success pattern is the forcing function: an external pressure that makes cleanup necessary rather than merely desirable.

A retail company successfully removed eighteen months of accumulated zombie infrastructure during a datacenter migration. The migration required explicitly deciding what to move to the new datacenter. Services that were not migrated would stop running. This created a forcing function: every service required an affirmative decision to preserve it. The default was deletion rather than preservation.

The team used the migration as an opportunity to audit all infrastructure. They classified services into three categories: definitely needed, definitely unused, and uncertain. Services in the "definitely needed" category were migrated immediately. Services in the "definitely unused" category were shut down with a rollback plan but no migration plan. Services in the "uncertain" category received additional investigation. If investigation could not confirm the service was needed within two weeks, it was classified as "definitely unused" and not migrated.

This aggressive approach removed approximately 35% of their infrastructure. Three services that were not migrated turned out to be needed and had to be restored, creating brief incidents. The incidents were contained within hours because the rollback plan was to restart the services in the old datacenter temporarily. The cost of the three incidents was approximately $40,000 in engineering time and lost productivity. The savings from not migrating and not maintaining 35% of infrastructure exceeded $300,000 annually.

The key success factor was the forcing function. The migration created a moment where the default was deletion and preservation required justification. This inverted the normal dynamics where the default was preservation and deletion required justification. The inversion made cleanup feasible.

The Detection Strategy

The most effective approach to zombie infrastructure is not to create it in the first place. This requires proactive detection before infrastructure transitions from "temporarily unused" to "permanently undeletable." The transition typically occurs when institutional knowledge about the infrastructure's purpose and dependencies is lost.

One company implemented an "infrastructure sunset clock" that automatically marked any service that processed zero traffic for ninety consecutive days as a deprecation candidate. The service remained running but triggered a workflow: an automated ticket was created, the service owner was notified, and a calendar reminder was set for thirty days in the future. If the service owner did not respond within thirty days, the service was automatically moved to a separate environment with restricted resources and network access.

This approach did not automatically delete services (the risk of false positives remained too high) but it created visibility. Services in the deprecation environment were reviewed quarterly. Service owners had to explicitly confirm that services were still needed. If a service owner could not be found (because they had left the company or changed roles), the service became a candidate for aggressive deprecation testing.

Over two years, this approach identified forty-seven services that were genuinely unused and could be safely removed. It also identified twelve services that were used only for specific quarterly processes and could be run on-demand rather than continuously. The annual savings exceeded $400,000. Importantly, the approach prevented new zombie infrastructure from accumulating by forcing regular review before institutional knowledge was lost.

The Documentation Problem

Traditional documentation fails to prevent zombie infrastructure because documentation decays faster than systems change. An alternative is self-documenting systems where dependencies are declared explicitly in code and verified automatically. This approach reduces reliance on human memory and manual documentation maintenance.

A technology company required all services to declare their dependencies in a machine-readable format. When a service was deployed, the deployment system automatically verified that all declared dependencies existed and were accessible. When a service was proposed for deprecation, the system could query all other services to determine if any declared a dependency on it. This did not catch undeclared dependencies (services that called an API without declaring it) but it provided a baseline level of confidence.

The system also tracked runtime dependencies by analyzing network traffic. When a service made a request to another service, the service mesh recorded the dependency. Over time, this built a map of actual dependencies that complemented the declared dependencies. When a service was proposed for deprecation, engineers could compare declared dependencies (what services said they needed) with observed dependencies (what services actually called) to identify discrepancies.

This approach was not perfect. Silent dependencies (rarely-activated code paths) remained invisible until they activated. Indirect dependencies (where service A calls service B which calls service C) required transitive analysis. Customer integrations outside the service mesh were not captured. But the approach provided significantly better visibility than documentation-based approaches and scaled automatically as the system grew.

The True Cost

When organizations calculate the cost of zombie infrastructure, they typically focus on direct infrastructure costs: compute, storage, and bandwidth. The true cost is substantially higher when indirect costs are included. These include the engineering time spent maintaining zombie systems, the opportunity cost of features not built, the complexity cost of working around zombie systems, and the organizational cost of the knowledge that cleanup is impossible.

A mid-sized company with $50 million in annual revenue conducted a comprehensive audit of their zombie infrastructure costs. Direct infrastructure costs totaled $380,000 annually. Engineering time spent maintaining zombie systems was estimated at 18% of total engineering capacity, representing approximately $1.2 million annually in fully loaded costs. Complexity costs (additional time required to work around zombie systems when building new features) were estimated at 6% of engineering capacity, representing another $400,000 annually.

The total quantifiable cost was approximately $2 million annually, or 4% of revenue. The unquantifiable costs included slower feature velocity (how many additional features could have been built with that engineering capacity?), higher employee frustration (how much does working with zombie systems affect retention?), and accumulated technical debt (how much does zombie infrastructure constrain future architectural options?).

The company concluded that reducing zombie infrastructure by 50% would save approximately $1 million annually in direct and indirect costs. Achieving that reduction was estimated to require a dedicated team of three engineers working for twelve months, at a cost of approximately $450,000. The payback period was six months. The ROI was compelling. The project was approved. Eighteen months later, the project had reduced zombie infrastructure by 23% at a cost of $680,000. The remainder proved too difficult to remove safely.

Conclusion

The persistence of zombie infrastructure is not a failure of engineering but a rational response to incentives and constraints. When the cost of certainty exceeds the cost of waste, waste persists. When the risk of removal exceeds the benefit of savings, infrastructure remains. When discovering dependencies is harder than maintaining systems, systems continue running.

The fintech company's $180,000 API endpoint remains in production today. After multiple deprecation attempts, they concluded that the cost and risk of removal exceeded the cost of operation. They optimized the endpoint to reduce infrastructure costs to approximately $60,000 annually. They documented it carefully as a "known legacy system" with extensive notes about its history and the customer dependency. They accept that it will likely run indefinitely.

This outcome is not ideal but it is realistic. The perfect solution (remove all unused infrastructure) is often impossible because the information required to prove something is unused does not exist. The practical solution is to minimize the cost of zombie infrastructure while acknowledging that elimination is not always feasible. Organizations that accept this reality can focus their cleanup efforts on infrastructure where removal is clearly safe rather than spending energy on infrastructure where safety cannot be established.

The lesson is not that zombie infrastructure is acceptable but that perfect cleanup is impossible. The goal should not be zero zombie infrastructure but rather controlled accumulation. Infrastructure should be designed for deprecation from the beginning, with explicit dependency declarations, automated dependency tracking, and forcing functions that require regular justification for continued operation. These practices do not eliminate zombie infrastructure but they prevent unlimited accumulation.

The most important insight is that zombie infrastructure is an economic problem rather than a technical problem. The technical solution (delete unused infrastructure) is straightforward. The economic problem (prove that deletion is safe) is hard. Organizations that treat infrastructure cleanup as a technical problem will be frustrated by their inability to make progress. Organizations that treat it as an economic problem with probabilistic risk assessment can make rational decisions about which cleanup efforts are worthwhile and which are not.

The uncomfortable truth is that some infrastructure will run forever not because it is needed but because proving it is not needed is too expensive. The API that no one dared delete joins thousands of similar services across the industry, consuming resources and complicating architectures, preserved by the economic reality that certainty costs more than waste. The rational response is not to eliminate all zombie infrastructure but to prevent new accumulation while accepting that some proportion of existing infrastructure will remain undeletable. In a world of imperfect information, this is as close to optimal as organizations can achieve.