Metrics are incentives

Dan Murphy, May 26 2024

Legolas taking aim

Choosing a target.

The first step is to measure whatever can be easily measured. The second step is to disregard that which can’t easily be measured or given a quantitative value. The third step is to presume that what can’t be measured easily really isn’t important. The fourth step is to say that what can’t be easily measured really doesn’t exist. This is suicide.

- Daniel Yankelovich

Howdy and welcome back to this series on metrics:

Metrics are conceptual compression - Metrics almost unavoidably reduce the amount of information available. This has both positive and negative implications.
Metrics are incentives - People love to set targets, people love to hit targets. What will they do to hit those targets? Is it what you want them to do?
Metrics are persuasion - Metrics created and interpreted by medical doctors, scientists, and economists are often misleading. Are you sure yours are fine?
Metrics are powerful - After three posts of me telling you how dangerous and hard to get right metrics are, I tell you metrics are sometimes very useful. With great power comes great responsibility, as they say.

Maybe you’ve heard some version of the quote above, which describes the McNamara fallacy. Basically, falling in love with the numbers and ignoring or devaluing what hasn’t been quantified.

Despite its common association with the Vietnam War (during which Robert McNamara was Secretary of Defense), it actually came out of Daniel Yankelovich’s time working in market research at Ford:

“When McNamara was president at Ford we conducted a number of market research projects … Our research afforded a blend of quantitative and qualitative measures. On the quantitative side we described the demographic characteristics of the people who were then buying small [cars]… On the qualitative side we found that a small car had nine different meanings for the average automobile buyer … Research showed a taste for cars that stressed functional quality; cars that were sportier; cars that expressed their owner’s personality and individuality … When we first presented our research findings on the small car the section on the research measuring the nine meanings of what a small car meant to people was discarded on the grounds that it could not be quantified.”

You could restate this fallacy as ignoring the fact that qualitative data is data. “Qualitative” doesn’t have to mean “leading questions” or “biased research“¹.

If the McNamara fallacy isn’t your cup of tea, perhaps you’ve read some version of Goodhart’s Law, often stated as “Every measure which becomes a target becomes a bad measure”.

Why does it become a bad measure? Because people start to game it.

People game metrics in part because Goodhart’s Law is a form of the Hawthorne effect, which notes that people behave differently when they know they’re being watched.

When you set a target for someone this signals to them that they’re being watched, and they change their behavior accordingly.

This isn’t necessarily a bad thing. Indeed, most of the time the person setting a target wants exactly that: for someone to change their behavior in order to reach a goal.

The happy path goes like this:

Engineering Team A keeps shipping high consequence bugs.
Management starts classifying the severity and measuring the frequency of bugs.
Management communicates their desire to minimize the frequency of severe bugs going out, and regularly revisits those figures with the team.
Engineering Team A does something different: more testing, more thorough code review, adding an error tracking service. Fill in the blank.
Fewer bugs go out the door.
Profit.

Gaming the system

The key question of the Hawthorne effect is, how does that person change their behavior?

Sometimes people, intentionally or not, change their behavior in a way that achieves the goal at the cost of the greater good.

It’s not hard to imagine how this could happen. Kent Beck and Gergely Orosz provide a real life example in their recent write-up on measuring software developer efficacy:

“I once worked with a recruiter whom other hiring managers raved about. This recruiter had a 100% close rate. Closing means that when a candidate gets an offer, we get them to sign. Back then, my part of the organization was so stretched that recruiters did most of the closing conversations and took care of details. Most recruiters closed 70-80% at most. I was told this recruiter is a ‘rockstar’ among their peers. I discovered in a painful way how they did it. About 6 months after this recruiter left, performance reviews and bonuses time came around. Several engineers in our group complained about their bonus amounts, saying they’d been “guaranteed” 10x the amount then they actually got. After some digging, all signs pointed to the rockstar recruiter; they’d made verbal promises to engineers in private settings, which were outrageously untrue.”

This is what Goodhart’s Law cautions against. Setting a target lets someone know they’re being watched, and that person will likely change their behavior to satisfy the metric. The side effects of their doing so may or may not align with the original intentions of the target.

Outcomes not inputs

There is a separate but related problem, called out by Beck and Orosz in their article, with relying solely on quantitative measures as a means of incentivizing someone or evaluating their performance.

That is that attempts to quantify productivity in complex domains almost always measure inputs or outputs rather than outcomes.

Why? Because while it’s easy to measure and set targets for inputs like hours worked, or outputs like how many story points or bugs an engineer cranks out, it’s far harder to get a sense of what kind of outcome those things produced for end users.

How do you measure if an engineer contributes to the success of the business broadly speaking? Is it purely how many features the ship? Do they contribute product ideas and UX improvements too?

Do they provide coaching, establish patterns, or build infrastructure enables their teammates to be more effective? Do they bring some positive dimension to the team’s culture of excellence?

How do you weigh all of these different types of contributions?

These kinds of outcome-oriented questions are qualitative ones, and they will often have qualitative answers.

Evaluating performance in a team setting can be especially hard to do because a person’s contribution to outcomes may not always be direct.

Resulting

Ideally we want to measure the outcomes that matter, not the inputs to those outcomes or even the outputs the get produced along the way to achieving the outcome.

But oftentimes the outcome we care about isn’t entirely under our control. You might do everything right and still get a bad outcome, or do everything wrong and get lucky.

World Series of Poker champion Annie Duke calls this “resulting”: when we judge performance based solely on the result, even though that result is not fully under our control².

Many things we care about in life are not fully under our control!

In those cases we shouldn’t fall into the trap of resulting. Sometimes you play your hand optimally based on the information you have at the time, but you get an unlucky outcome. Instead we should evaluate how well we did the things under their control that contributed to the outcome.

Given all of this it’s fairly nuanced to decide how to judge performance. We want to get a sense of how we impact the outcome we care about, and not reward or incentivize the performative generation of inputs (eg lots of hours online) or outputs (eg lots of lines of code written).

At the same time we often can’t measure only the outcome we care about, because it’s not fully under our control: you can’t judge an engineer’s performance based on Annual Recurring Revenue.

So the trick is to measure outputs that are mostly under your control that also contribute meaningfully to the desired outcome.

Outside of sales teams, tying the performance of an individual contributor to the financial success of the business probably isn’t a good idea in most cases.

Better to “roll up” from team metrics to department metrics to business metrics: the individual engineer advances their team’s technical goals, which advance the department’s strategic goals, which advance the business’s financial and customer impact goals.

Context

Whether you give a team a target of zero product defects, or of customer ticket response time less than 30 minutes, or of $1,000,000 closed-won, you are incentivizing them to optimize for that target.

But like all metrics, targets are often a form of conceptual compression.

That’s to say they are simplifications of the underlying reality, which is often far messier.

A product with zero defects is likely not evolving ambitiously enough. Quick response times to customer inquiries may indicate poor response quality. Deals closed under duress may contain false promises made to prospects.

And so the best approach for evaluating the effectiveness of goals and targets is the same as it is for evaluating the effectiveness of any metric: “decompress” it.

Dig into specific cases and talk to the people involved or impacted, in order to understand the implications behind the whatever the metrics say.

This is heeding the warning of the McNamara fallacy: the qualitative stuff matters.

None of this means that quantitative targets and goals are pointless or necessarily ineffective. Only that they should be accompanied by a rich qualitative understanding of the context in which those targets were hit, or not.

Signal from the quantitative

The beauty of evaluating things qualitatively is that it in no way precludes incorporating signals from quantitative measures.

It’s simply that you don’t handcuff yourself to a cut-and-dry interpretation of the numbers as your sole means of evaluating a situation.

That’s to say, a customer success manager absolutely may find it beneficial to track metrics around customer inquiry response time and a sales leader definitely should track how much each rep closes.

Patterns, outliers or drastic changes in those numbers will surely raise interesting questions and help inform the broader qualitative context.

I’ll say much more about this in part 4 of this series, which will be focused on all of the ways quantitative metrics are useful or even indispensable.

For now let me say that there are some helpful tactics for incorporating quantitative measures into your decision making in a way that doesn’t create perverse incentives or superficial solutions aimed only at addressing the metric rather than the underlying problem.

First, it’s important to distinguish between metrics on the one hand and targets, goals or quotas on the other. The former is only measuring something. The latter are taking that measurement and using it as the indicator of success or failure, often in a way associated with financial and reputational rewards.

In the example above, with Engineering Team A and their costly bugs, the process I outlined didn’t actually involve setting any targets or goals.

Management started tracking a metric and shared both the metric and their hoped-for outcome with the team.

That’s it. There was no number to hit or reward or punishment associated with the metric.

This is a technique from Will Larson’s handbook on engineering management where rather than mandate or incentivize something be done, you simply provide a nudge with context.

This is no silver bullet, sometimes a gentle nudge is not enough, but it does surface problems in a way that enables and encourages people to address them holistically.

Interestingly it requires no mandate. You don’t have to be the boss to share information about a problem and offer suggestions for how it might be addressed!

It also creates a culture focused on learning and knowledge sharing. To provide an effective nudge-plus-context you ideally will have some interesting context to share.

Another helpful tactic (also stolen shamelessly from Will Larson. Go read his blog!) is the use of countervailing metrics to make it harder to satisfy the target at the expense of the underlying ideal.

Countervailing metrics are simply metrics that are in tension with one another. As a result it can be more difficult to improve one while also improving the other.

To return to the example of the Customer Success rep, if you only measure how long it takes them to respond to customers it’s easy to to overlook the experience provided to customers by the rep’s response. Said another way, it’s easy to not focus on the desired outcome broadly speaking, only a facet of it.

If instead you are measuring a combination of response time and customer rating you’ll be framing your evaluation of the situation in a way that brings the tension between those two things front and center.

That in turn will provide more interesting signal than either metric would on its own.

Character

The Hawthorne Effect says people change their behavior when they know they’re being watched.

John Wooden said that “The true test of a person’s character is what they do when no one is watching”.

I’m not so naive as to think you can simply try to hire high character people and then wash your hands of any kind of inspection of their performance.

For one, it’s too hard to get an accurate assessment of someone’s character in a short interview process. You really only come to know that over years working with them.

For another, character does not necessarily come with competence in a given domain or success in a given project or time period.

Nevertheless I do think having a sense of a person’s character offers you insights that no metric can.

What drives them? Do they give up easily? How do they act when things aren’t going their way? Do they learn and grow and change over time?

To me these questions are of such crucial importance in selecting the people you want to have a big role in your organization, in your community, and in your life writ large.

And tellingly you can really only measure them qualitatively.

Perhaps that’s something to keep in mind alongside whatever metrics you may use.

Footnotes

For more on this, read The Mom Test and Just Enough Research.
Read her excellent book Thinking in Bets to learn more about this.