The On-Call System Engineers Actually Wanted, CodeGood

On-call is where engineering culture goes to die. The industry treats it as an unavoidable tax: unpaid, mandatory, resented. Engineers absorb nights and weekends into their salaries. Companies pretend the cost is zero because no line item appears in the budget. Both sides lose.

The costs are real. They are simply hidden—distributed across burnout, attrition, slow incident response, and the steady erosion of goodwill that makes engineers update their LinkedIn profiles on Sunday evenings after yet another 3am page.

Blockdaemon took a different approach. On-call was voluntary, paid, and structured to create learning rather than resentment. The result: zero staffing problems, declining incidents, and engineers who actively wanted to participate. The system cost more on paper and less in reality.

The Industry Default

Most companies implement on-call the same way. A rotation is established. Engineers are assigned slots. Nights and weekends are covered by whoever drew the short straw that week. Compensation, if it exists at all, is nominal: perhaps a small stipend that works out to a few dollars per hour of availability.

The implicit bargain is this: on-call is part of the job. It was mentioned somewhere in the offer letter, bundled into the salary, and now it is owed. Engineers who object are reminded that everyone does their share. The phrase "team player" appears.

This system has predictable consequences.

Sleep disruption. Even when not paged, on-call engineers sleep worse. The anticipation of a potential alert activates the stress response. Studies of on-call workers across industries (healthcare, utilities, technology) consistently show degraded sleep quality during on-call periods, even when no calls occur. The availability itself is costly.

Life disruption. A week of on-call constrains every evening and weekend plan. No hiking without mobile signal. No drinks with friends. No travel beyond laptop range. The engineer is technically "off work" but functionally tethered. This constraint has no line item, but it has a cost.

Productivity loss. The engineer paged at 3am is diminished the next day. Sometimes for several days. The incident itself may take thirty minutes; the recovery takes far longer. Companies that track next-day productivity after night pages find it drops 20-40%.

Attrition. Engineers leave. The ones who stay are often the ones with fewer options. Over time, the team skews toward engineers who tolerate on-call because they cannot escape it. Not because they are engaged, skilled, or growing.

Incident quality. Exhausted, resentful engineers make worse decisions under pressure. A single bad call during an incident—a mistaken rollback, a missed diagnostic step—can cost more than years of fair on-call compensation.

Companies believe they are saving money by not paying for on-call. They are not. They are hiding the cost in places the finance team does not measure.

A Different Design

Blockdaemon designed on-call around a different principle: align incentives, and the outcomes follow.

Global team, working hours coverage. A distributed team across time zones meant on-call happened during normal working hours for whoever held the pager. No 3am pages for anyone. The worst cost of on-call—sleep disruption—was eliminated through team structure, not heroics.

Voluntary participation. On-call was opt-in. No one was assigned against their will. The fear was that no one would volunteer. In practice, the opposite occurred. Fair compensation and reasonable structure made on-call attractive rather than punitive. Staffing was never a problem.

Paid incidents at time-and-a-half. When an incident occurred, the responding engineer received compensation at 1.5x their normal rate. This acknowledged that incident work is high-intensity, high-stress, and high-value.

Two-hour minimum blocks. A thirty-minute incident paid as two hours. This recognised that incidents disrupt far more than their ticket duration. The twenty minutes before (paged, context loading, adrenaline) and the forty-five minutes after (winding down, returning to focus) are real costs. The minimum block compensated for the actual disruption, not the nominal duration.

Day in lieu for weekend availability. If an engineer was on-call for a weekend day, they received a day off during the following week, regardless of whether an incident occurred. This acknowledged the availability cost. Being "on" for a Saturday, even if nothing happens, means that Saturday was not truly a day off. The day in lieu made it whole.

Developers on-call for their own systems. The engineers who responded to incidents were typically the engineers who had built the system experiencing the outage. This meant immediate context—no handoff delay, no "let me find someone who knows this code." It also created a feedback loop: if your architectural decisions caused the page, you felt the consequence directly. This improved code quality over time.

SRE pairing for learning. On-call engineers were accompanied by site reliability engineers who provided support, context, and teaching. Junior engineers saw on-call as a growth opportunity—a chance to learn production systems deeply, with expert guidance. On-call became education, not punishment.

What Happened

The results were consistent and reinforcing.

Zero staffing problems. Volunteers emerged reliably. The compensation was fair, the structure was humane, and the learning opportunity was genuine. Engineers wanted to participate.

Declining incidents. SRE investment and the feedback loop to developers meant incident frequency dropped over time. Engineers who were paged for their own bugs wrote better code the next time. The system improved itself.

No attrition from on-call burnout. On-call never appeared in exit interviews. The system did not extract unpaid labour, so it generated no resentment.

Faster incident resolution. The responding engineer understood the system because they had built it. Context was immediate. Resolution was faster. Fewer incidents escalated.

The Calculation

Two companies illustrate the economics. Both have 20 engineers. Both pay an average salary of $180,000, with a fully-loaded cost of $250,000 per engineer annually (salary plus benefits at 22% plus overhead at 17%), yielding approximately $1,000 per day or $125 per hour. Both experience an average of two incidents per week.

Company A uses the industry default. On-call is unpaid and mandatory. Engineers rotate through weekly shifts. The team is based in a single time zone, so half of all incidents occur outside working hours. Night and weekend pages are common.

Company B uses the Blockdaemon model. On-call is voluntary and paid. The team is globally distributed, so coverage happens during working hours. Incidents are compensated at 1.5x with two-hour minimums. Weekend on-call days earn a day in lieu.

The visible costs diverge immediately.

Company A's on-call compensation: $0. No line item. Finance sees no expense.

Company B's on-call compensation: $39,000 annually. Two incidents per week, each paid at the two-hour minimum, at 1.5x the $125/hour rate. The calculation: 104 incidents × 2 hours × $125 × 1.5 = $39,000 for incident response. Days in lieu shift workdays rather than adding cost. The engineer takes Monday off instead of working after Saturday on-call, with no net change to capacity.

Finance looks at these numbers and sees Company A saving $39,000 per year. This is wrong.

The hidden costs of Company A

Night page productivity loss. Half of Company A's incidents occur outside working hours: 52 night or weekend pages annually. Each night page costs the next workday. Internal tracking at comparable companies shows next-day productivity drops 40-60% after a 3am page; using a conservative 50% estimate at $1,000 per day yields $500 per incident. Annual cost: $26,000.

Company B has zero night pages. Its global structure ensures working-hours coverage. This cost does not exist.

Attrition from burnout. On-call burden consistently ranks among the top five reasons engineers leave, according to annual surveys from Lattice and Blind. Company A, with its unpaid mandatory night pages, loses two engineers per year to on-call-related burnout. Replacement cost breaks down as follows: recruiting at 17% of first-year salary ($30,600), lost productivity during the 45-day average vacancy ($49,500—the team operates at 95% capacity, losing $1,100 per day in output), and reduced productivity during the six-month ramp period ($58,500—new hire at 65% productivity for 180 days, losing $350 per day versus a fully-ramped engineer). Total per departure: $138,600, rounded to $140,000. Annual cost for two departures: $280,000.

Company B's voluntary, paid, working-hours-only system generates no on-call-related attrition. Exit interviews confirm on-call is not a factor. This cost does not exist.

Incident duration premium. Exhausted engineers responding at 3am resolve incidents more slowly than rested engineers responding at 2pm. Incident retrospectives at Company A showed night incidents averaged 40% longer than daytime incidents—a pattern consistent with research on cognitive performance under sleep disruption. For Company A, 52 night incidents averaging 1.4 hours instead of 1 hour means 21 additional hours of incident time annually. At $125/hour fully loaded, that is $2,625 in direct cost—plus downstream impact from extended outages.

Company B's incidents resolve at baseline speed. All responders are rested, alert, and working during their normal productive hours.

Hiring disadvantage. Candidates ask about on-call. "Unpaid and mandatory" extends hiring timelines and reduces offer acceptance rates. Company A's average time-to-fill for engineering roles is 52 days. Company B, which mentions paid voluntary on-call as a differentiator, averages 38 days. The 14-day difference, multiplied by the daily cost of an unfilled role ($400 in lost output, based on reduced team capacity during vacancy), costs Company A $5,600 per hire. With four hires per year, that is $22,400 annually.

The comparison

Cost Category	Company A	Company B
On-call compensation	$0	$39,000
Night page productivity loss	$26,000	$0
Attrition (2 engineers)	$280,000	$0
Incident duration premium	$2,625	$0
Hiring delay cost	$22,400	$0
Total annual cost	$331,025	$39,000

Company A spends $331,025 on on-call. Company B spends $39,000. The difference is $292,025 per year.

Company B's on-call investment returns 8.5:1. Every dollar spent on fair compensation avoids eight dollars in hidden costs. Company A believes it is saving money by not paying for on-call. It is paying more than eight times what Company B pays. The costs are simply hidden where finance does not measure them.

The Flywheel

The Blockdaemon system created compounding returns rather than compounding costs.

Fair compensation meant engineers volunteered rather than resented. Developers on-call for their own systems meant faster resolution and direct feedback. SRE pairing meant knowledge transfer and skill development. Better code quality meant fewer incidents. Fewer incidents meant less on-call burden. Less burden with fair pay meant the system remained sustainable.

Each element reinforced the others. The flywheel accelerated.

The industry default creates the opposite dynamic. Unpaid on-call breeds resentment. Resentment breeds disengagement. Disengagement slows resolution. Slow resolution means incidents last longer and happen more often. More incidents mean more burnout. Burnout means attrition. Attrition means the remaining engineers carry more on-call load. The death spiral accelerates.

Implementation

The system described here is not universal. It requires certain structural elements.

Global distribution is essential for follow-the-sun. Without time zone coverage, night and weekend pages are unavoidable. A single-location team cannot structure its way out of 3am alerts. Modified versions—generous compensation for night pages, meaningful comp time—can reduce the burden but not eliminate it.

SRE investment pays forward. Pairing on-call engineers with site reliability engineers requires having SREs in the first place. At Blockdaemon, adding dedicated SRE capacity reduced incident frequency by approximately 30% within the first year, while mean time to resolution dropped from 52 minutes to 31 minutes. The upfront investment returned value within months.

Clear ownership boundaries matter. Developers on-call for their own systems requires knowing who owns what. Ambiguous ownership creates friction when pages arrive. The organisational clarity required for effective on-call often reveals broader structural problems worth solving.

Leadership must view compensation as investment. Finance teams conditioned to minimise visible line items will resist. The case must be made explicitly: visible on-call costs are lower than invisible attrition and productivity costs. The calculation above provides the template.

Companies without global distribution can still apply the core principles: fair compensation for incident work, payment for availability rather than just response, expert pairing during incidents, feedback loops between on-call experience and code quality, and rigorous measurement of the hidden costs that unpaid on-call generates. The specific implementation varies; the economics do not.

The Strategic View

Most companies think about on-call tactically. Coverage must exist. Engineers must be assigned. The goal is minimum viable availability at minimum visible cost.

The strategic view is different. On-call is a system that either builds capability or destroys it. Designed well, on-call creates learning, improves code quality, accelerates incident resolution, and builds the institutional knowledge that makes engineering organisations effective. Designed poorly, on-call burns out the best engineers, degrades incident response, and creates a culture where production is someone else's problem.

The Blockdaemon system was not designed to be generous. It was designed to create compounding positive outcomes: for individual engineers, for team productivity, for learning, and for business results. The generosity was a mechanism, not a goal. Fair compensation and humane structure were inputs that produced outputs the company valued.

The question for engineering leaders is not "how do we staff on-call at minimum cost?" It is "how do we design on-call to create the outcomes we want?" The answer usually involves paying for it. The cost is lower than it appears, and the returns are higher than most companies measure.

The On-Call System Engineers Actually Wanted