The Dashboard No One Believes, CodeGood

In Q1 2023, a mid-sized technology company introduced a new engineering metrics dashboard. The dashboard tracked deployment frequency, lead time for changes, time to restore service, and change failure rate. These metrics, drawn from the DORA research, were meant to provide objective measurement of engineering effectiveness and drive continuous improvement. The VP of Engineering set targets for each metric and tied quarterly bonuses to achieving those targets.

Six months later, the dashboard showed remarkable progress. Deployment frequency had increased from 47 deployments per month to 483 deployments per month, a tenfold improvement. Lead time for changes had dropped from 6.2 days to 1.8 days. The metrics indicated that the engineering organization had dramatically improved its performance. The VP of Engineering presented these results to the board as evidence that the metrics program was driving substantial productivity gains.

Three months after that presentation, a new CTO joined the company through an acquisition. One of his first actions was to audit the engineering metrics. He discovered that deployment frequency had increased not because teams were shipping features faster but because they had learned to game the metric. Configuration file updates were being counted as deployments. Database schema migrations that did not change application behavior were being counted as deployments. Documentation updates were being deployed and counted. The actual rate of meaningful feature releases had barely changed. The metric was showing dramatic improvement while the underlying reality was stagnant.

This is Goodhart's Law in action: when a measure becomes a target, it ceases to be a good measure. The phenomenon appears across industries and contexts. Organizations implement metrics to drive better outcomes. People respond by optimizing for the metrics rather than the outcomes. The metrics improve while the underlying performance does not. Eventually, everyone knows the metrics are meaningless, but the metrics persist because admitting they are meaningless would require admitting that the improvement they showed was illusory.

The Measurement Imperative

Modern management philosophy holds that what gets measured gets managed. Organizations that want to improve performance must measure that performance objectively. Intuition and judgment are unreliable; metrics provide ground truth. This reasoning is sound when metrics accurately capture the phenomena they are meant to measure. It fails when metrics become proxies that can be optimized independently of the underlying phenomena.

Engineering productivity is particularly susceptible to this failure because the outcomes that matter (delivering customer value, maintaining system reliability, enabling future development) are difficult to measure directly. Organizations resort to proxy metrics that correlate with these outcomes: lines of code written, tickets closed, deployment frequency, code coverage percentage. These proxies work only if they maintain their correlation with actual outcomes. When people optimize the proxies, the correlation breaks down.

A classic example is lines of code as a productivity metric. Early software engineering research found that more productive teams produced more lines of code per engineer per month. This correlation led some organizations to measure and reward engineers based on code volume. Engineers responded by writing verbose code, avoiding code deletion during refactoring, and splitting logical units across multiple lines. Code volume increased while code quality and actual productivity declined. The metric had destroyed its own validity.

The technology company's deployment frequency metric followed the same pattern. Initially, deployment frequency correlated with team effectiveness: teams that could deploy more often had better automation, better testing, and better operational practices. When deployment frequency became a target, teams optimized it directly. They started counting things as deployments that were not deployments in any meaningful sense. The correlation between the metric and team effectiveness broke down. The metric became pure theater.

The Gaming Patterns

When metrics are tied to incentives, creative compliance is inevitable. Teams find ways to hit targets without achieving the intended outcomes. The specific gaming strategies vary by metric, but common patterns emerge across organizations.

One pattern is redefinition: changing what counts toward the metric without changing actual behavior. The deployment frequency gaming was redefinition. Teams redefined deployment to include config changes and documentation updates. This increased the deployment count without increasing the rate at which features shipped. The redefinition was technically defensible (config changes do get deployed) but violated the spirit of the metric (measuring how often teams ship meaningful changes).

Another pattern is threshold manipulation: managing work to just barely meet the target. A company measured bug resolution time and set a target of 80% of bugs resolved within 5 days. Teams responded by triaging bugs into two categories: important bugs that would be fixed within 5 days, and unimportant bugs that would be closed as "will not fix." The 80% target was consistently met. Bug resolution time appeared excellent. Actual bug backlog grew because anything that could not be fixed in 5 days was simply closed rather than remaining in the backlog where it would hurt the metric.

A third pattern is focus shifting: neglecting unmeasured aspects of performance to optimize measured aspects. A company measured test coverage and set a target of 90% code coverage. Teams achieved this by writing tests that executed code but did not verify correctness. Test coverage reached 90%, but defect rates did not decline because the tests were not catching bugs. The focus on coverage came at the expense of test quality, which was not measured.

A fourth pattern is sandbagging: deliberately lowering performance to create easier targets. Teams that knew their performance would be measured in the future would artificially suppress their metrics during the baseline period. Then, during the measurement period, they would return to normal performance and show dramatic improvement. A team measured on deployment frequency during Q2 would delay deployments in Q1, creating a low baseline that made Q2 improvement look substantial.

The Institutional Knowledge Problem

Organizations develop institutional knowledge about which metrics are real and which are theater. New employees believe the metrics. Experienced employees know the metrics are gamed but continue reporting them because the alternative would be admitting the gaming publicly. This creates a culture where everyone privately understands that metrics are unreliable while publicly continuing to track and report them.

A financial services company had tracked "time to market" for new features for six years. The official metric showed that time to market had improved from 4.2 months to 2.8 months over that period. Leadership cited this as evidence of improved engineering productivity. When an external consultant interviewed engineers, they uniformly explained that the metric was meaningless. Time to market was calculated from when a feature was added to the roadmap to when it shipped. Teams had learned to delay adding features to the roadmap until development was already underway. This made the measured time shorter without making actual development faster.

The consultant asked why no one had raised this with leadership. The answers were revealing. Some engineers had raised it early on and were told the metric was important for board reporting. Others felt that pointing out the gaming would reflect badly on their team (admitting we are gaming metrics is admitting we are gaming metrics). Most had concluded that the metric was primarily for external communication and that changing it was not worth the political capital required. The metric persisted not because anyone believed it but because changing it was harder than maintaining the fiction.

The Organizational Dynamics

Metrics gaming creates perverse organizational dynamics. Teams spend time optimizing metrics rather than improving actual performance. Leaders make decisions based on metrics they privately suspect are unreliable. The gap between official metrics and actual reality becomes institutionalized, creating a culture where data is not trusted.

A technology company implemented a customer satisfaction metric tied to support response times. Support tickets were supposed to receive an initial response within 4 hours. The company measured what percentage of tickets met this target and tied support team bonuses to achieving 90% compliance. The support team achieved 94% compliance within two months. Customer satisfaction declined.

Investigation revealed that the support team had optimized for initial response time by sending template responses within 4 hours regardless of whether they had investigated the issue. A ticket about a complex technical problem would receive a response within 4 hours saying "we are investigating this and will update you soon." The 4-hour target was met. The actual resolution might take days or weeks. Customers were frustrated because they received quick but unhelpful responses. The metric showed improvement while the customer experience degraded.

The metric created a focus on speed over quality. Support engineers who spent time actually solving problems before responding would miss the 4-hour target. Support engineers who sent quick template responses would hit the target. The incentive structure rewarded the wrong behavior. When this was raised with management, the response was that initial response time was important for customer satisfaction. The metric remained. The gaming continued. Customer satisfaction continued declining while the official metric showed excellent performance.

The Visibility Paradox

Dashboard metrics create a visibility paradox: what is measured becomes visible and therefore appears important, while what is not measured becomes invisible and therefore appears unimportant. This can create a situation where organizations optimize measured phenomena while ignoring more important unmeasured phenomena.

A SaaS company implemented a detailed dashboard tracking engineering velocity. Velocity was measured as story points completed per sprint. Teams with higher velocity were recognized and rewarded. Teams with lower velocity were put on improvement plans. The dashboard was updated in real-time and displayed prominently in the office. Velocity became the primary lens through which engineering performance was evaluated.

What was not on the dashboard: technical debt accumulation, code quality metrics, system reliability, or time spent helping other teams. These activities were important but not measured. Teams learned that spending time on measured activities (completing story points) was rewarded while spending time on unmeasured activities (refactoring code, improving monitoring, helping colleagues) was not. The unmeasured activities declined. Velocity increased while technical debt accumulated, system reliability degraded, and team collaboration decreased.

Two years later, the company faced a crisis. Their codebase had become so difficult to work with that velocity had begun declining despite teams working harder. Reliability incidents were occurring weekly. Cross-team projects were failing because teams had stopped collaborating. The root cause was that the velocity metric had created tunnel vision. Teams optimized what was visible (velocity) while neglecting what was invisible (sustainability, reliability, collaboration). The dashboard had inadvertently destroyed the things it was meant to improve.

The Narrative Function

Metrics often serve a narrative function rather than a measurement function. Organizations use metrics to tell stories to boards, investors, and employees about progress and improvement. The accuracy of these metrics matters less than their narrative utility. This creates pressure to maintain metrics that tell the right story even when everyone knows the metrics are unreliable.

A startup raised Series B funding with a deck that highlighted their "130% quarter-over-quarter growth in deployment frequency" as evidence of engineering productivity improvements. The actual deployment frequency increase was driven by a change in how deployments were counted, not by actual productivity changes. The investors were not technical enough to question the metric. The metric served its narrative purpose: it demonstrated that the engineering team was improving. Whether the improvement was real was secondary to whether the story was compelling.

After the funding round, the metric could not be changed without implicitly admitting that the previous metric was misleading. This would raise questions about what else in the fundraising narrative might have been misleading. The metric was locked in. Even though the engineering team knew it was meaningless, it continued to be reported quarterly because changing it would create more problems than it would solve.

This pattern appears frequently. Metrics that are presented externally become difficult to change internally because changes would raise questions about historical values. Organizations become trapped by their own narratives. The metrics persist not because they are useful but because changing them would require explaining why they were reported previously if they were not useful.

The Unmeasurable Important

The most important aspects of engineering work are often the least measurable. Mentoring junior engineers, maintaining system documentation, participating in architecture discussions, fixing obscure bugs that affect few users, and improving development tooling are all valuable but difficult to quantify. When organizations focus on measurable metrics, these unmeasurable activities are neglected.

A senior engineer at a technology company spent approximately 30% of her time helping junior engineers. She reviewed their pull requests thoroughly, explained architectural decisions, and helped them debug complex problems. This mentorship was valuable but not counted in any metric. When the company implemented a productivity dashboard measuring commits, pull requests merged, and story points completed, her measured productivity appeared low compared to engineers who focused entirely on their own work.

Her manager understood that her mentorship was valuable and protected her from pressure to improve her metrics. But the manager left the company and the new manager was not familiar with her contributions. The new manager looked at the dashboard, saw low productivity numbers, and put her on a performance improvement plan. She reduced time spent on mentorship and focused on activities that would improve her metrics. Her measured productivity increased. The team's overall productivity declined because junior engineers no longer received effective mentorship. The dashboard showed individual improvement while obscuring collective degradation.

The Legibility Trap

Organizations favor metrics that are legible (easy to measure and compare) over metrics that are meaningful (accurately capturing important phenomena). Legibility wins because illegible metrics cannot be dashboarded, compared across teams, or reported to stakeholders. This creates systematic bias toward measuring what is easy rather than what is important.

Code review thoroughness is more important than code review speed, but speed is more legible. Organizations measure time to approval rather than quality of feedback. Teams optimize for fast approvals rather than thorough reviews. The legible metric (speed) is optimized while the important metric (thoroughness) is neglected.

Customer value delivered is more important than features shipped, but features are more legible. Organizations measure features per quarter rather than value per feature. Teams optimize for shipping many small features rather than fewer high-value features. The legible metric (feature count) is optimized while the important metric (customer value) is neglected.

System reliability is more important than uptime percentage, but uptime is more legible. Organizations measure availability rather than user experience. Teams optimize for technical uptime while tolerating degraded performance that frustrates users. The legible metric (uptime) is optimized while the important metric (reliability) is neglected.

The legibility trap is particularly insidious because the shift from meaningful to legible metrics happens gradually. An organization starts measuring uptime as a proxy for reliability. Initially, the proxy works well. Over time, teams learn to optimize uptime directly (making sure servers stay running) while allowing reliability to degrade through performance problems that do not affect uptime. The metric continues to be reported because it is legible and familiar. Its declining correlation with actual reliability goes unnoticed until a crisis reveals the gap.

The Comparison Problem

Metrics enable comparisons across teams, which creates competitive dynamics that can be either productive or destructive. When teams are compared on metrics they can game, the result is competition to game metrics rather than competition to improve actual performance.

A company published internal rankings showing each team's deployment frequency. The stated goal was to encourage teams to learn from high-performing teams. The actual result was that teams competed to have high deployment frequency numbers. Teams split changes into smaller deployments to increase their deployment count. Teams deployed configuration changes that had no user impact. Teams deployed and immediately rolled back changes just to count the deployment. The rankings showed high-performing teams, but the underlying behavior was pure gaming.

The company could have addressed this by improving the metric definition. Instead, they added more metrics: deployment size, deployment success rate, customer impact per deployment. Teams responded by optimizing all the metrics simultaneously. Small, successful, low-impact deployments became the norm. This looked good on dashboards but did not correspond to shipping more valuable features. The teams with the best metrics were often not the teams shipping the most important work.

The Reset Strategy

Organizations occasionally attempt to reset metrics that have become gamed. This is difficult because the gaming has usually become institutionalized and removing the metric feels like removing accountability. The reset must carefully balance maintaining useful measurement against avoiding the gaming that made the previous metrics meaningless.

The technology company that discovered its deployment frequency gaming attempted a reset. The new CTO acknowledged that the existing metrics were being gamed and announced that the company would develop better metrics. He created a working group of engineers and managers to define new metrics that would resist gaming. The working group spent three months researching, debating, and designing new metrics focused on customer value rather than engineering activity.

The new metrics were more complex: they tracked feature usage by customers, customer satisfaction with new features, and actual revenue impact of features. These metrics were harder to game because they required external validation (customer behavior and satisfaction) rather than internal counting (deployments or commits). The complexity made them harder to understand and less visible than the previous simple metrics. Adoption was slow because people found the new metrics less intuitive than deployment frequency.

Six months after the reset, the organization had two sets of metrics. The old metrics (deployment frequency, lead time) were still being tracked because they were simple and familiar. The new metrics (customer value, satisfaction impact) were being tracked by some teams but not others. Neither set was fully trusted. The reset had created confusion rather than clarity. The CTO's eventual solution was to stop publishing organization-wide metrics and instead have teams define their own measures of success that they would report qualitatively. This removed gaming but also removed comparability.

The Qualitative Alternative

Some organizations respond to metrics gaming by abandoning quantitative metrics in favor of qualitative assessment. This approach recognizes that meaningful evaluation requires nuance and context that metrics cannot capture. The challenge is maintaining accountability and comparability without metrics.

A technology company eliminated its engineering metrics dashboard and replaced it with quarterly narrative reviews. Each team would write a narrative describing what they accomplished, what challenges they faced, and what they learned. Leaders would evaluate teams based on these narratives rather than on metrics. The narratives could not be easily gamed because they required explaining decisions and outcomes in context.

The shift to narrative reviews reduced gaming but introduced different problems. Evaluation became more subjective. Teams that were good at writing compelling narratives appeared more successful than teams that were actually delivering more value but writing less compelling narratives. Cross-team comparison became difficult because narratives were not standardized. Some leaders wanted to reintroduce metrics to enable "objective" evaluation, creating pressure to return to the previous system.

The company found a compromise: teams would report narratives that included supporting data of their choosing. The data was not standardized across teams, which prevented direct comparison and gaming. But teams could use data to support their narratives. A team might report deployment frequency if that was relevant to their goals, or they might report customer satisfaction if that was more relevant. The shift from standardized metrics to team-chosen data reduced gaming while maintaining some quantitative grounding.

The Cultural Diagnosis

Metrics gaming is ultimately a cultural problem rather than a measurement problem. Organizations with high-trust cultures that value actual outcomes over appearance can use metrics productively. Organizations with low-trust cultures that value performance theater over actual performance will game any metric regardless of how carefully it is designed.

The technology company's deployment frequency gaming was a symptom of a culture that valued metrics performance over actual performance. When deployment frequency became a target tied to bonuses, the message was clear: hitting the target was more important than shipping features. Teams responded rationally to the incentive structure. The gaming revealed that the organization prioritized measurable results over meaningful results.

Changing this culture required more than changing metrics. It required changing how success was defined, how work was evaluated, and how incentives were structured. The new CTO's decision to move away from organization-wide metrics was not primarily about metrics but about culture. He was signaling that the organization would value outcomes over metrics, judgment over measurement, and substance over appearance. The cultural change was more important than the specific measurement approach.

Conclusion

The dashboard that no one believes is a fixture of modern organizations. Engineering metrics dashboards, sales performance dashboards, customer satisfaction dashboards, all sharing a common feature: the people closest to the work know the metrics are gamed while the people furthest from the work trust the metrics because they have no alternative. The dashboards persist because they serve functions other than measurement: they create the appearance of management, they provide narratives for stakeholders, and they give leaders something concrete to discuss.

The technology company's deployment frequency metric showed dramatic improvement that was entirely illusory. The new CTO's audit revealed this, but his solution (replacing quantitative metrics with qualitative assessment) created different problems. There is no perfect solution because the fundamental problem is not the metrics but the assumption that complex phenomena like engineering productivity can be reduced to simple numbers.

Organizations that recognize this can use metrics modestly: as one input among many, as a prompt for investigation rather than a conclusion, and as a tool for learning rather than a mechanism for control. Metrics work when they are descriptive (helping us understand what is happening) rather than prescriptive (telling us what to optimize). The moment metrics become targets, they stop being useful measures.

The lesson is not that metrics are useless but that metrics are dangerous. They are dangerous because they create the illusion of objectivity while being easily manipulated. They are dangerous because they focus attention on what is measured while obscuring what is not measured. They are dangerous because they can be gamed, and once gaming begins, the metrics become worse than useless: they actively mislead.

The dashboard no one believes is worse than having no dashboard because it creates false confidence while consuming resources to maintain the fiction. Organizations would be better served by abandoning obviously gamed metrics than by continuing to report them. The courage to say "this metric is not telling us what we thought it would tell us" is more valuable than the comfort of having numbers to report. Measurement is useful only when the measures are trusted, and trust requires acknowledging when metrics fail rather than continuing to report metrics that everyone privately knows are meaningless.

The Dashboard No One Believes