Why you should summarize your data with the geometric mean

Why you should summarize your data with the geometric mean

As an information designer, I’m charged with summarizing data. But even the simplest of questions, like “How big is a typical case?” presents choices about what to do; about what kind of summary to use. An “average” is supposed to describe something like a typical case, or the “central tendency” of the data. But there are many kinds of averages, as you might know. Here I’ll give a quick overview of two familiar averages, Median and arithmetic mean, and compare them to a third, the geometric mean — which I think should get a lot more use than it does.

To help illustrate each of the Statistics, I’ll use a small example dataset throughout the article:

This is when you select the middle element of your data, after ordering them from small to large (and if there’s an even number, take the arithmetic mean of the two closest to the middle). In our example this is 2:

Medians are useful for dividing the data into two halves, each with the same number of elements — e.g. “big” and “small” bins. The Median is really a special case of a quantile: it’s the 50th percentile (50/100), or second quartile (2/4), which means it can easily be paired with other quantiles, like it with box-and-whisker plots.

An advantage of medians is that they ignore extreme values. This can be helpful, but in my experience, people want to ignore extremes far too readily: they categorize anything inconvenient to their pet theory as an “outlier” and delete it. But extremes help to tell you what’s possible, and could suggest a very different distribution in your data from what you expect. Be cautious about fooling yourself.

In fact, medians don’t just ignore outliers’ values: they ignore the values of everything, except from the middle element. Otherwise, only rank order matters. The perils of this can be seen in the example: 44% of the elements are 9s, but that value does not affect the median: the 9s could be anything, even 1 million each, and the median would remain 2.

It’s hard to gotruly wrong by using medians as your summary statistic: they work on many kinds of data, and are robust with respect to outliers. But because they ignore so much of the data, they don’t work well with small sets of data. And they can’t be used well as part of ratios or many other other manipulations — as we’ll see shortly.

You would usually just call this the “mean” or, if you’re sloppy with words, the “average” (like Microsoft Excel does): add everything up and divide by the number of elements. These steps effectively spread the total value you have across all the cases you have, which makes all cases the same. Basically, you are answering the question “What would each case have to be, if all cases were identical?”

You can also look at the mean calculation geometrically: the mean effectively reconfigures separate areas into one big area, and then cuts it into equal parts:

Compared to the median, the mean has a real advantage: it takes account of all values, and is much less likely to jump around if you add in a data point or two (unless they’re extreme).

An important proviso is that the quantities you’re averaging have to be “addable” in some sensical way, given the real-world meaning of your data. Usually, this is no big deal: you can almost always find some valid interpretation of the arithmetic mean. The question is whether it’s the interpretation you want —because alternative statistics can produce other reasonable results at the same time, and you have to decide.

A common sort of math problem tries to lull students into taking an arithmetic mean when they shouldn’t: “You drive at 40 mph to Grandma’s house, and then 60 mph back; what was your average speed?” The naive student writes 50, ignoring the “mph” as much as possible, and treating the values as distances. The trick is that time is hidden in the units, but actually changes between the two legs of the journey.

Share it:
Share it:

[Social9_Share class=”s9-widget-wrapper”]

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

You Might Be Interested In

Why We Should Treat Public Data Like Water

5 Dec, 2017

The revolutionary potential of the internet means that we can do more than simply build a more beautiful user interface …

Read more

Data Marketing: How Can it Help you Create Better Content?

10 Jul, 2022

Why would you produce Digital Marketing content without knowing why or for whom?  Only those who want to waste time …

Read more

How open rail data is changing the way passengers travel

19 Jul, 2016

For many years, there were only a limited number of ways to find out if a train was delayed, or …

Read more

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

3-steps-to-drive-analytics-adoption

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.