Metrics are powerful

Dan Murphy, December 17 2024

Man of steel.

Invert, always invert!

- Charlie Munger, quoting Carl Jacobi

Welcome to part 4 of this series on metrics:

Metrics are conceptual compression - Metrics almost unavoidably reduce the amount of information available. This has both positive and negative implications.
Metrics are incentives - People love to set targets, people love to hit targets. What will they do to hit those targets? Is it what you want them to do?
Metrics are persuasion - Metrics created and interpreted by medical doctors, scientists, and economists are often misleading. Are you sure yours are fine?
Metrics are powerful - After three posts of me telling you how dangerous and hard to get right metrics are, I tell you metrics are sometimes very useful. With great power comes great responsibility, as they say.

Swing carefully

A few years ago the Bloomberg columnist (and one of my favorite writers) Matt Levine wrote The Only Crypto Story You Need, a sweeping, hilarious, insightful deep dive into what the hell crypto is, what’s been going on in that world, and what it might mean for modern finance.

Matt is - with good reason! - a bit of a crypto skeptic, but he’s also - with good reason! - not entirely sour on the subject. He puts it well:

First, I don’t write about crypto as a deeply embedded crypto expert. I’m not a true believer. I write about crypto as a person who enjoys human ingenuity and human folly, and finds a lot of both in crypto. Conversely, I didn’t sit down and write 40,000 words to tell you that crypto is dumb and worthless and will now vanish without a trace. That would be an odd use of time.

Sadly, I’m not quite as smart, funny, or financially savvy as Matt. But I do feel about the same when it comes to writing about metrics and our attempts to gain insight from them.

Like crypto, the use of statistics in the wild is full of human ingenuity and human folly, and that makes it fascinating to me.

And like Matt, I didn’t write these blog posts in an attempt to convince you that numbers are bad, or that math is a lie, or that it’s futile to use them in an attempt to better understand the world.

In a world where applied statistics (aka machine learning, aka AI) allows you to literally talk to computers as if they were one of the smarter people you know, you can’t credibly argue that the broad concept of “metrics” is somehow useless or inherently bad.

What I hope I’ve shown in the first three posts of this series is that you very much can argue that metrics are often misused or misinterpreted, resulting in bad outcomes.

Metrics are a tool that requires quite a bit of skill to wield usefully.

Hammers are great, but if you swing haphazardly you’ll end up putting a hole in your wall instead of hanging that painting you just bought off Etsy.

I have a lot of confidence in statisticians and machine learning engineers to swing carefully. I have less confidence in the general population to do so, but a lot of problems can be avoided by watching out for the pitfalls mentioned in this series.

So in this final post we’ll look at some examples of metrics done well.

Steel men

One way to help you choose effective metrics is to help you not choose ineffective metrics.

Charlie Munger was fond of using the Carl Jacobi quote above - invert, always invert! - to remind his audience of the power of flipping things around. Rather than trying to directly solve a difficult problem Munger would encourage you to think of how you might cause the problem. Then try to avoid those causes.

Similarly, entrepreneur and investor Pete Thiel is well known for his “steel man” arguments. To strengthen his own arguments Thiel will take the opposing viewpoint and rather than constructing a straw man, a dumbed-down and easily defeated version of that viewpoint, he’ll create a “steel man” version of the argument.

Thiel will try to make the strongest case possible for his opponent, he will do everything he can to tear down his own argument. If he does this and his argument still holds water, if it still seems superior to the opposing viewpoint, then he knows his argument is solid.

These are two powerful exercises to undertake, and so as you consider metrics to use in your organization try to tear them down or think of what you don’t want them to do. Read through the first three posts and consider some of the failure modes discusssed.

If you are creating metrics to be used as incentives for your people ask yourself:

How would someone game this metric?
Does it measure outcomes we care about, or only inputs to that outcome? Will people be encouraged to produce as many inputs as they possibly can?
Does measuring this fall into the trap of “resulting”, of judging success based on a result that is not fully under our control?
If we start to measure this, how might people change their behavior in response?
Can we constrain this metric with a countervailing metric?

If you are creating metrics meant to persuade some audience (maybe the audience is yourself) ask yourself:

Could I slightly adjust the analysis or data in some way that would meaningfully change the conclusions? If so can I really be confident in the results?
Am I bending over backwards to show how I might be wrong? Or am I doing everything I can to conceal that possibility?
Is the way I’ve chosen to analyze and present the data the most honest way, or the way that best suits my purposes?

In general try to get an understanding of what your metric is compressing:

What does the distribution underlying this data look like?
What do some specific, typical examples look like?
What do some of the outliers look like?
Where does this data come from? Can I go talk to someone or look at something to see the original source?

The Chart of Shipped Potential

An example of a metric that works well, and one that can sanity checked by this kind of steel man thinking, is Jack Danger’s Chart of Shipped Potential.

Jack is a software engineering leader who was trying to come up with a metric that would convey the effectiveness of his engineering department to the CEO, head of product, and other leaders at the company.

This is a pretty tricky thing to come up with a metric for, for all of the reasons we’ve discussed throughout this series.

You need to measure an outcome that really matters, not just an input like lines of code written or hours worked.

At the same time, you need to measure something that’s actually within the engineering team’s control. So measuring something like revenue is not going to work.

You also need to make sure the metric isn’t easily gamed, which would incentivize people to do so, particuarly if their performance and pay were tied to it.

And ideally the metric is constrained in some way so that it reflects the strictures of operating in a business that needs to turn a profit (or at least not burn through all of their cash in a quarter).

What Jack came up with was the Chart of Shipped Potential, a concept derived from similar metrics used at Amazon Web Services and Square.

The idea is that you start by plotting on a chart every meaningful new capability that the engineering team releases.

What constitutes a “meaningful” release? There are a several options, but you want something that unambiguously signals that your organization believes it to be important, such as a press release to customers touting the new functionality or prominent slide in an investor pitch deck.

These types of qualitative barometers for what “meaningful” looks like work well, although you could try to use quantitative metrics like downloads or API calls per month.

Once you know what your meaningful releases are, you plot them on a chart over time and you constrain them by dividing by the number of engineers in the organization.

Now you have a way to see the rate at which the engineering org delivers things that the rest of the company believes to be valuable, taking into account how much engineering headcount was required to make those deliveries happen.

No metric is perfect, and you can quibble with parts of this one.

But on the whole I think the Chart of Shipped Potential does an admirable job of measuring something that matters but which is still within the organization’s control, while also being hard to game and considering organizational constraints.

If you can design metrics that satisfy these criteria you’re doing better than most.

You’ve got the power!

Elmo with too much power

Such power!

We’ve spent a lot of time talking about metrics but haven’t up to this point unpacked why they are so essential, and powerful.

Most of this series has focused on the shortcomings and dangers of metrics, and so it’s worth reminding ourselves that metrics are in fact hard to do without.

Each of the critiques made in this series can be flipped around to see some way in which metrics provide value, particularly when the size of an organization and its data grows¹:

Metrics are conceptual compression. They allow us to quickly get a rough understanding of a quantity or complexity of information that we can’t examine in detail in any reasonable timeframe. This is particularly helpful when communicating to audiences that are not expert in the domain or data set, as we’ll see in an example below.
Metrics are incentives. They allow us to set measurable, concrete targets to motivate performance. This gets increasingly difficult to do in a qualitative way as the size of an organization grows, particularly if you want to avoid introducing bias from having many different people set the incentives in different ways for different teams.
Metrics are persuasion. They can add a sense of specificity and objectivity to arguments, making them more convincing and verifiable when done well.

We often take these benefits for granted without considering that they are double-edged swords.

The Chart of Shipped Potential is a good a example of a metric that allows you to harness these benefits while steering clear of many of their associated downsides.

If you’re the leader of an engineering organization, you need a way to be able to quickly convey to non-technical stakeholders how the productivity of your team is trending over time, perhaps in leadership or board meetings.

Your audience likely won’t have the time or technical literacy to indulge in a detailed examination of all of the processes and tooling you’ve decided to invest in and why they’ll accelerate development velocity in the long term.

In order to communicate effectively in those settings you need to be able to quickly demonstrate that acceleration visually and with concrete numbers.

This is the benefit of conceptual compression at work.

The Chart of Shipped Potential also works well as a metric for incentivization and persuasion.

It acts as an interface, a common language, between the engineering organization and other departments where the value of investment in long-term acceleration is shown front and center.

Because of this you can have quite objective conversations with non-technical stakeholders about how prioritization and investment decisions impact the rate of releases shown in the chart, and you can reward initiatives that positively effect that rate.

Justified skepticism

A while back I wrote about skepticism and cynicism, my general argument being that while skepticism is useful as a default intellectual stance it’s easy to take it too far.

It’s easy for your skepticism to degrade into cynicism.

As I hope the Chart of Shipped Potential shows, cynicism is not warranted when we talk about metrics. They can be employed effectively, and in many cases the sheer number of cases to consider necessitates that you do so: you won’t be able to look at every case qualitatively, you need the conceptual compression of metrics to help you understand things at a high level.

But I also hope that this series in general has encouraged a healthy sense of skepticism when you hear someone rattle off statistics in support of their argument.

The skeptic demands evidence to move them from their default stance of doubt, and you should demand not only metrics but a convincing demonstration that those metrics are not so deeply flawed as to be misleading or counterproductive.

In particular I want to convince you that metrics are actually not most problematic when the person wielding them is poorly educated or unintelligent. When that’s the case the argument will be too crude to be convincing to the intelligent observer.

They are most dangerous when the person wielding them is well educated and intelligent but careless in how they’ve constructed the metric, and there is no shortage of real world examples demonstrating this fact.

Having studied economics in undergrad, one of my favorite examples is that of Reinhart and Rogoff.

Reinhart and Rogoff were two well respected Harvard economists who published an influential paper showing a correlation between a nation’s debt and the growth of its economy.

The paper received a lot of attention from the press and from economic and political leaders around the world, influencing policy in countries like Greece where it was seen as evidence of the need to reduce public debt in order to facilitate economic growth.

The only problem was that Reinhart and Rogoff had an error in their Excel formula. They weren’t counting all of the relevant rows, and if they had the data would’ve actually pointed them to the opposite conclusion of the one they drew in their paper.

Sadly this type of basic error is not so uncommon in science.

The anthropologist Richard McElreath puts it well when he says some science is conducted like amateur software development, where there is a lack of programmatic testing to ensure the calculations are sound and where the data and code are not easily auditable by reviewers.

A more somber exaple is that of Sally Clark². Sally Clark was convicted of murder when both of her infant children died in the house at a young age. The prosecutor’s argument was a statistical one: that two such deaths would be a “1 in 73 million” occurence.

Later examination of that claim lead to Clark’s exhoneration.

The odds of one child dying from Sudden Infant Death Syndrome in that circumstance were about 1 in 8543. So the prosecution naively multiplied 8543 * 8543 (the probability of one child dying from SIDS AND the probability of a second child dying from SIDS).

That assumes the deaths were independent, that there were no common factors underlying both, such as health hazards in the home or genes of the children.

Independent and dependent probability are not super advanced concepts in statistics. But like many parts of stats, it’s easy to forget how and when to apply them in real world situations³.

The point is that there is an imbalance between how easy it is to make a quantitative argument, which will be especially persuasive precisely because it is quantitative, and how hard it is to make sure your quantitative argument is actually sound statistically.

It’s not rocket science, but it does take some care.

I hope that by sharing cautionary tales like these this series has helped uncover some of the common hazards and nuances that come along with defining metrics, while also offering some guidance for how to do so effectively.

Happy measuring!

Footnotes

Related to this, there are exciting new possibilities for using an “LLM-as-a-judge” to evaluate qualitative data at scale. See the Forest Friends Zine for a great deep dive on this topic.
It’s worth noting that we should demand a greater level of rigor for statistical arguments being made in criminal trials and public policy treatises than we do for those being made to your boss about how well the engineering team is doing.
There are countless other examples. Look up the failure of Long Term Capital Management, where supposed mathematical geniuses badly underestimated the level of risk being taken by that firm by looking at too short a period of historical data when developing their risk models.