May 25 2010

Let me take you back to the days when you were an under-21 college student, figuring out who you were and what you wanted to be when you finally grew up. For some of you this may be a lifetime ago, and for others, it may have seemed as if those days happened yesterday (literally, yesterday).

Most college students must take one, if not two courses in mathematics during their college careers, regardless of their degree program. Most of the time, elementary statistics is the course selected, probably because it’s the easiest math elective to take for most people. In short, lots of people have an elementary knowledge of statistics. So, why are average-oriented metrics put on such a pedestal?

In elementary statistics, you most likely learned about the four measures of center and about outliers. If you don’t remember, that’s OK, it’s probably been a long time since, or you probably weren’t a math person and wanted to forget everything you had learned as quickly as possible.

The four measures of center are mean, median, mode, and midrange.

**Mean – **The mean is what you know as the average. It is calculated by taking all of the values in a set and dividing them by the total number of values in that set. The mean is very sensitive to outliers (more on outliers in a little bit).

Example: The mean of 1, 3, 5, 5, 5, 7, and 29 is about 7.8571.

**Median – **The median is not the same thing as the mean, even though in popular parlance, the two terms are often used interchangeably. The median is the number that is in the middle of a data set that is organized from lowest to highest or from highest to lowest. The median doesn’t represent a true average, but is not as greatly affected by the presence of outliers as is the mean.

Example: The median of 1, 3, 5, 5, 5, 7, and 29 is 5 (the number in the middle).

**Mode –** The mode is the number that repeats most often in a data set. It’s seldom used in statistics as a reliable measure of center.

Example: The mode of 1, 3, 5, 5, 5, 7, and 29 is 5 (it repeats 3 times – the other values only appear one time each).

**Midrange –** The midrange is calculated by adding the highest and lowest values of a data set together, and dividing the sum by 2. The midrange is hardly ever used as a measure of center.

Example: The midrange of 1, 3, 5, 5, 5, 7, and 29 is 15 (29 + 1 = 30; 30 / 2 = 15).

With four different measures of center, I’ve been able to come up with four different correct calculations for an average. Each measure of center has its benefits and present different sensitivities to the presence of outliers. Depending on the set of data, the measure of center may lose strength and implied value because of how it is calculated and how it is used.

**Outliers – **Outliers are numbers in a data set that are either way bigger or way smaller than the other numbers in a data set.

Example: In the 1, 3, 5, 5, 5, 7, and 29 data set, the number 29 is an outlier because of how much greater it is than all of the other numbers in the set. 29 is the only number that doesn’t “fit” in this set.

**What is the meaning of all of this?**

The meaning of all of this is to take your averages (average order value, average conversion rate, average time on site, and others) with a tiny grain of salt. Use average-oriented metrics cautiously and with skeptical optimism, as the presence of a mere few outliers in your data can distort the figures and not provide a true representation of what is really happening.

Take this extreme example of the revenue of five separate orders placed on a web site:

$4.94

$4.39

$7.01

$6.33

$553.93

Your “realistic” average order value here should be $5.67 (the four “normal” values added up and divided by four). But if we’re looking at a report from a web analytics tool, it would report the average order value as $115.32. Clearly, there is a massive difference between $5.67 and $115.32.

To obtain real insights that will help your web site and your organization, you’ll have to dive much deeper beyond the averages to really exact meaningful information and data. Know your measures of center and your outliers, so that you can decide if your averages are realistic representations of what’s happening on your web site.

Until next time, I will leave you with one of my favorite all-time quotes, which fits right into this topic. Think about it the next time you’re obsessing over averages:

*“A statistician drowned while crossing a river that was on average six inches deep”.*

February 4 2010

Let’s do something here on our Analytics & Site Intelligence blog that quite honestly we don’t do enough of: talk about pure statistics! Can you feel the excitement running through your veins? Oh wait, that’s only me.

As the Web Analytics industry becomes more and more mature, the requirement to understand basic statistical concepts becomes greater and greater. Awesome new features, like Google Analytics’ **Intelligence** report section and predictive modeling features from Google Insights for Search, beg the user to dive deep on their web data, segment it, grab insights, make a conclusion, and take meaningful actions.

Sure, you can do all that without knowing a lick about statistics, but chances are very high that you’ll start to get confused, lost, and overwhelmed along the way. Think of statistics like contractors think about a foundation for building a home – we all know what happens without a strong foundation!

Enter “Standard Deviation”, which is quite possibly (next to mean) the most important element in the field of statistics. Standard Deviation is the variance (another stat term!) from the mean (average) of a set of data.

Let’s say that the average football fan watches 3.5 hours of football a week, with a standard deviation of .5 hours (a half-hour). This means that – assuming a normal distribution (a third stats term!!) – most football fans (about 68% of them) will watch anywhere from 3 to 4 hours of football a week. Since the average is 3.5, and the standard deviation is .5, watching 4 hours of football a week is said to be “one standard deviation **above** the mean”. Conversely, watching 3 hours of football is said to be “one standard deviation **below** the mean”.

However, almost all football fans (which is about 95% of them, assuming a normal distribution), will watch anywhere between 2.5 and 4.5 hours of football, which is said to be “two standard deviations above or below the mean”. It’s two standard deviations above or below the mean, because 2.5 hours or 4.5 hours is two “.5’s” above or below our mean of 3.5.

In statistics, it is generally considered unusual if a particular data point (like, watching 9 hours of football) is above or below **two standard deviations** from the mean. Watching an average of 9 hours a week of football for the average football fan is way…WAY above *2s* (two standard deviations), so this would be considered highly unusual for the average football fan.

**What it means for you (the Web Analyst)?**

Knowing what Standard Deviation is and how it’s used in Web Analytics will help you get an idea of just how important events that happen on your website could be. For example, in the new Intelligence Section in Google Analytics, you may see some alerts for an increase in Revenue from different regions:

If you notice on the left-hand side of the image, the revenue for this particular time period increased by 111% from North Carolina from the expected revenue. This is definitely significant (check out the significance bar on the right), as it’s about 3 or even 4 standard deviations above the mean! Perhaps your new PPC campaigns that were targeted to North Carolina were successful, and you can now duplicate that success everywhere else! Or maybe your email marketing strategy worked, and North Carolina residents responded so well that you can re-market to them in 1-2 months.

In that same image, the Revenue from the United Kingdom increased by 46%, which is about one or possibly two standard deviations above the mean. It’s not as significant of an increase as North Carolina’s, but still worthy of your attention nonetheless. Apply the same negative keywords or the same match types for your other international campaigns as well!

So now that you know what standard deviation is all about, use reports like Google Analytics’ Intelligence section to get a truer, deeper meaning of just how significant certain trends are that happen on your website, which will allow you to improve whatever it is that you are doing exponentially. You’ll be a better analyst for it!