Articles in The 'outliers' Tag

May 25 2010

Measures of center, outliers, and averages

Let me take you back to the days when you were an under-21 college student, figuring out who you were and what you wanted to be when you finally grew up. For some of you this may be a lifetime ago, and for others, it may have seemed as if those days happened yesterday (literally, yesterday).

Most college students must take one, if not two courses in mathematics during their college careers, regardless of their degree program. Most of the time, elementary statistics is the course selected, probably because it’s the easiest math elective to take for most people. In short, lots of people have an elementary knowledge of statistics. So, why are average-oriented metrics put on such a pedestal?

In elementary statistics, you most likely learned about the four measures of center and about outliers. If you don’t remember, that’s OK, it’s probably been a long time since, or you probably weren’t a math person and wanted to forget everything you had learned as quickly as possible.

The four measures of center are mean, median, mode, and midrange.

Mean – The mean is what you know as the average. It is calculated by taking all of the values in a set and dividing them by the total number of values in that set. The mean is very sensitive to outliers (more on outliers in a little bit).

Example: The mean of 1, 3, 5, 5, 5, 7, and 29 is about 7.8571.

Median – The median is not the same thing as the mean, even though in popular parlance, the two terms are often used interchangeably. The median is the number that is in the middle of a data set that is organized from lowest to highest or from highest to lowest. The median doesn’t represent a true average, but is not as greatly affected by the presence of outliers as is the mean.

Example: The median of 1, 3, 5, 5, 5, 7, and 29 is 5 (the number in the middle).

Mode – The mode is the number that repeats most often in a data set. It’s seldom used in statistics as a reliable measure of center.

Example: The mode of 1, 3, 5, 5, 5, 7, and 29 is 5 (it repeats 3 times – the other values only appear one time each).

Midrange – The midrange is calculated by adding the highest and lowest values of a data set together, and dividing the sum by 2. The midrange is hardly ever used as a measure of center.

Example: The midrange of 1, 3, 5, 5, 5, 7, and 29 is 15 (29 + 1 = 30; 30 / 2 = 15).

With four different measures of center, I’ve been able to come up with four different correct calculations for an average. Each measure of center has its benefits and present different sensitivities to the presence of outliers. Depending on the set of data, the measure of center may lose strength and implied value because of how it is calculated and how it is used.

Outliers – Outliers are numbers in a data set that are either way bigger or way smaller than the other numbers in a data set.

Example: In the 1, 3, 5, 5, 5, 7, and 29 data set, the number 29 is an outlier because of how much greater it is than all of the other numbers in the set. 29 is the only number that doesn’t “fit” in this set.

What is the meaning of all of this?
The meaning of all of this is to take your averages (average order value, average conversion rate, average time on site, and others) with a tiny grain of salt. Use average-oriented metrics cautiously and with skeptical optimism, as the presence of a mere few outliers in your data can distort the figures and not provide a true representation of what is really happening.

Take this extreme example of the revenue of five separate orders placed on a web site:

\$4.94
\$4.39
\$7.01
\$6.33
\$553.93

Your “realistic” average order value here should be \$5.67 (the four “normal” values added up and divided by four). But if we’re looking at a report from a web analytics tool, it would report the average order value as \$115.32. Clearly, there is a massive difference between \$5.67 and \$115.32.

To obtain real insights that will help your web site and your organization, you’ll have to dive much deeper beyond the averages to really exact meaningful information and data. Know your measures of center and your outliers, so that you can decide if your averages are realistic representations of what’s happening on your web site.

Until next time, I will leave you with one of my favorite all-time quotes, which fits right into this topic. Think about it the next time you’re obsessing over averages:

“A statistician drowned while crossing a river that was on average six inches deep”.

Posted in: Analytics