Sampling in Google Analytics is important to Google; the complex queries of dimensions, metrics, and segmentation across potentially millions of sessions that are contained within Google Analytics are very computation intensive. In order to throttle the legwork that the service provides without limiting users ability to analyze overall trends, Google uses sampling.
Sampling is similar to polling. There is an algorithm that looks at the data, then throws out a portion of the data – this is indicated in the interface by a call out in the upper right hand corner of the interface.
To examine how much data is lost and how that can affect reporting, we used a standard report (which is never sampled) and a custom segment that matches all pages. Both of these reports look at the entire audience of the site – but the custom segment gets sampled so you can see where the data loss occurs.
In this case, the segment utilizes 69.6% of all sessions, but the high level metrics are remarkably similar.
Report Type | Pageviews | Unique Pageviews | Avg. Time on Page | Entrances | Bounce Rate | % Exit | Page Value |
Sampled | 1,219,940 | 975,136 | 0:01:51 | 669,418 | 82.68% | 54.63% | $2.49 |
Unsampled | 1,245,641 | 1,007,906 | 0:01:48 | 678,962 | 82.72% | 54.51% | $2.38 |
Difference | 2.11% | 3.36% | -2.70% | 1.43% | 0.05% | -0.22% | -4.42% |
The real problems start occurring when the analyst is interested in the row-level data below the summary.
Reports | Rows of Data |
Pages in Report (Sampled) | 105286 |
Pages in Report (Unsampled) | 138352 |
Loss of rows | -23.90% |
The above table shows a 24% loss in row-level data (in this case page URIs). In other words, almost 1 in 4 pages will be missing from your data. Missing a quarter of all pages viewed can make it incredibly difficult to evaluate the importance of “needle in the haystack” analysis. This problem also affects key performance metrics like goals and details associated with the goals – such as Goal Completion Location.
If the number of sessions in the property over the given date range exceeds 500k sessions, then sampling will be activated. In the case of Google Analytics Premium – sampling will be activated at 25 million sessions at the view level. This is important because it allows for strategic views to be created that will allow sampling to be avoided by breaking websites up into large pieces. Sampling occurs in both the Google Analytics website and the core reporting API in the same way.
There are two ways to avoid sampling for Google Analytics. The first, no money required, method is to pull multiple reports using metrics that aggregate counts like sessions, bounces, total seconds – then using a spreadsheet or dashboarding software to calculate performance metrics and rates.
The second way to avoid sampling is to get Google Analytics Premium, which raises the ceiling for sampling and allows users to request un-sampled reports.
So remember to check for sampling on your data sets, and that there are simple workarounds for this difficult problem.