Unraveling Data Driven Attribution Part 1: Survival Analysis Basics

Harrison Mateika - January 17, 2024

Google has recently shifted away from rules-based attribution toward data-driven attribution in both GA4 and Google Ads. This shift was done to move toward a more accurate measurement of how different channels gain attribution. This post is the first of a multi-part blog post series that attempts to, in a simple and clear way, help you better understand Google’s approach to data-driven attribution modeling.

Where Google Has Come From: Rules-based Attribution

The problem with rules-based attribution was presentation bias. This lead to a tendency to overstate and understate the impact of certain channels or ads over others. Furthermore, these models were unable to account for certain parameters like the impact of device type, screen size, browser type, and time decay. These parameters, could clearly have an impact on conversions. However, despite these downsides, the rules-based attribution models had one particular benefit: they were simple to understand.

Where Google is Heading: Data-driven Attribution Modeling

Google’s data-driven attribution modeling, on the other hand, trades in transparency for accuracy. The data-driven models do an excellent job portioning out credit among different channels, but since the algorithms involved are part of Google’s proprietary technology, we do not fully understand what is going on in the backend to assign out credit. However, we do have some hints as to what Google is doing in the background based on some of their own documentation, and recent research on multi-channel attribution modeling.

What Google says About Their Data-Driven Attribution Methods

In Google’s data-driven attribution methodology resource paper, they say that to estimate attribution they take into account both converting and non-converting users as well the presence of certain channels and their timing into the probability of a conversion. This probability is apparently estimated utilizing an adaptation of a type of analysis called “survival analysis”. Through this method, they say that they estimate the gains that each ad adds to the probability of conversion based on its appearance. Furthermore, they admit to adding certain parameters such as time between channels and conversions, device type, and “other query signals”.

Therefore, to best understand Google’s data-driven attribution methods, it likely is best to examine what survival analysis is and see what contribution it could provide to multi-channel attribution.

What is Survival Analysis?

Survival analysis is typically utilized in clinical or health studies. To put it simply, it is meant to measure and analyze the probability of an individual’s survival from the beginning of the study through specific time periods. To be less dreary though, we will utilize it in the other context: factory machines. For example, if 100 machines were working on a factory floor, there would be frequent check-ins to see how the machines were doing. If, within 20 days, there were 2 machines that failed, the survival probability of all of the machines at 20 days would be 0.9796. At the next check-in on day 40 if 2 machines failed, the survival probability of all of the machines for 40 days would be 0.9592. This means that if the company bought another of the same 100 machines, they could expect about 50.25% of their machines to survive up to 140 days. An outline of how this process could be laid out can be found in the table below:

Time
(t)
Machines
Failed (f)
Surviving Machines (n)Probability of Failure
(f/n)
Probability of Survival
(1-(f/n))
Probability of Survivors at End of Time
(L)
202980.02040.97960.9796
402960.02080.9792(0.9796 * 0.9792) = 0.9592
603930.03230.96770.9282
804890.04490.95510.8865
1008810.09880.90120.7989
12010710.14080.85920.6864
14015560.26790.73210.5025

 

As time goes on, the probability of machines surviving up to a particular point goes down when the probability is graphed against the check in-times. Inevitably, these points lead to a graph that is referred to as the Kaplan-Meier survival curve. This curve is utilized rather than the classic curve so that analysts can more easily define certain points for “events” to occur. In the curve below, you can see that there is a steep drop in the survival probability by day 140.

The Kaplan Meier curve can be a great way to also represent and understand the differences of survival times between two different groups. For example, let’s say there is another group of 100 machines that were bought from a used machinery store. A Kaplan Meier curve comparing the two different groups might look like the one below.

These curves and these two groups can, of course, be compared with the use of statistical analysis to determine whether their differences are statistically significant. However, the purpose of comparing these two groups using statistical techniques is limited to measuring the differences in probability of the survival of the machines within a given period between the two groups. This type of analysis does not account for the additional of independent events that can have an instantaneous impact on survival time within certain periods of time such as a manufacturer intermittently kicking machinery within the two groups.

Applicability to Attribution Modeling

At first glance, this seems like a strange method to measure multi-channel attribution. However, this could begin to make sense when negative events like deaths or failures are replaced with something more positive, like conversions. For example, if we replace 100 machines with 100 users that landed on a website, the Kaplan Meier curve would illustrate the number of users that did NOT convert after a period of time. Large drops between periods would help indicate that this period is a moment where a lot of these users converted.

Utilizing the Kaplan Meier curve to check for statistical significance between two groups can also be helpful to understand the user behavior over time between two different user groups. For example, measuring out the differences in survival probabilities within time periods between Organic and Paid traffic.

While this can result in some interesting analysis, this does not yet complete the puzzle. Survival analysis via the Kaplan Meier curve by itself would only be good at measuring the probability of conversions within a specific period of time between two different groups. This is, as mentioned, due to the analysis not allowing for additional independent events (or variables) to be introduced to measure impact on survival time. This means that certain independent events like previous channels, change in device type, or change in browser cannot be introduced. However, survival analysis has a method that can be introduced that will allow for the impact of these independent events and variables to be measured. This method utilizes what is referred to as hazard functions and will be introduced in the next part.

If you want help in understanding your attribution models or need any understanding in applying these concepts. Please do not hesitate to reach out to us at info@morevisibility.com

© 2024 MoreVisibility. All rights reserved.