Many marketers agree that measuring incrementality for any digital marketing campaign is important. However, the question of how trustworthy the results actually are is a difficult one.
There are many sources of potential problems with incrementality measurements. Different biases and measurement artifacts can easily creep in and make results unusable. To make objective budgeting decisions, a marketer needs to be able to verify incremental results to make sure that the measurement is scientifically solid and that the uplift they see is not the result of random chance (a side effect of some selection bias or intentional cheating by a vendor partner).
So how can a marketer properly validate incrementality results? We address that issue by providing a set of tests which we recommend every client to apply to incremental measurements they run both in-house or from their vendors.
In this post, we will cover two analysis areas in detail: statistical significance and group balance.
To judge statistical significance of results is to, simply put, answer the question, “How sure are we that the results didn’t come up by random chance?”.
There are different ways of judging the statistical validity of the results. We suggest using two relatively simple methods: the frequentist Chi-squared test and the Bayesian A/B testing.
The Chi-squared test is used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in one or more categories.
For those with good grounding in statistics, the Chi-squared test will be very easy to apply to any report they encounter. One can easily use the size of the two groups and their corresponding conversions (or converters) to test if there is a significant difference between the two at the required statistical power (we recommend p < 0.05).
For those with limited statistics knowledge, a simple online Chi-squared calculator to check the uplift results at hand does the trick (we recommend Evan Miller’s Chi-squared Calculator).
To answer our initial query regarding results that come up due to random chance or otherwise requires a certain amount of data in order to determine if the answer is a ‘yes’ or a ‘no’. So let’s ask ourselves a more flexible question: What is the chance of the test group performing better than the control group? As this is not a binary yes-or-no question, we can often answer it with less data, by deciding what chance is acceptable for our use case. For this purpose, Bayesian A/B testing is the way to go.
We won’t go into detail of methodologies here, but we highly recommend trying this methodology using some of the readily available online tools, like that of Dynamic Yield.
The foundation of all incrementality measurement methodologies is the group randomization where we assign users to the treatment group or the control group. The randomization can be done by the client’s BI team if we use the basic Intent-To-Treat methodology, but it usually happens on the side of the vendor’s advertising platform in more advanced methodologies like PSA or Ghost Ads and Ghost Bids.
In both cases, it can happen that the resulting groups of users – treatment and control – are actually not balanced: one group might have better-performing users as a result of random chance or some selection bias in the measurement. For example: if your user-base has a ‘whale’ distribution of payments (caused by a small amount of users who spend a lot of money), a few users can affect the statistics of a group a lot, skewing the results in KPIs like mean revenue per user.
Another source of invalid results is ‘unfair’ test setups which introduce biases during the measurement. For example, in the PSA/Placebo methodology, it can happen that the delivery of the two different sets of creatives is optimized differently, thus affecting which sets the user sub-groups are exposed to, therefore introducing a potential selection bias. A worst case scenario is a vendor who intentionally tricks the measurement by cheating in the randomization process or the ad delivery.
Luckily, all those problems can be detected and dealt with using some relatively simple tests:
To detect if the random assignment of users to either treatment or control group has produced two statistically equal groups (fair randomization), a common procedure is to do an ‘A/A test’. Before the actual uplift test, the users are split using the randomization method in question but shown either the same creative or no creative at all. In the case of a fair randomization process, the performance of both groups of users should be equal.
While this test is simple to execute, it has a drawback: after the A/A test, we would need to stick to the users chosen for the two groups and not introduce any new users to the test, since the validity of their randomization would not be covered by the previous A/A test. This is not a big problem for one-off incrementality tests, but is not suitable for continuous incrementality measurement where new users fall into the target segment every day. If we were to exclude those users from the test, we would lose a lot of scale in unique users whom we can target and also have a reduction in sample size which would lead to longer measurements.
Hence, the Remerge incrementality product doesn’t support the A/A test in favor of a better method: ‘Retro-Analysis’.
A method equivalent to the A/A Test is what we call the ‘Retro-Analysis’: we start the uplift test and run it for a defined period of time required to reach statistical significance. When we analyze the results, we can simply look at the performance KPIs of the two groups during a period of time immediately preceding the test. In an ideal scenario with perfect randomization and big enough sample sizes, we can expect the two groups to have nearly identical performance (e.g. conversion rate, revenues per user, etc.), prior to the test. If that is true, we know that any change in performance we detect during the incrementality test is actually driven by the campaign we run.
The Retro-Analysis is equivalent to the A/A test as both methods compare the performance of two randomized groups before an intervention. The difference is that it is performed at different times: A/A test before, and Retro-Analysis after the uplift test. Retro-Analysis has the advantage that we can run dynamic or real-time segmentation and continuously add new users to the test as they fall into the targeted segments (e.g. users who install the app during the test or become inactive since 7 days, etc.). This way we don’t lose scale and can still prove the validity of the randomization process.
To run Retro-Analysis, we need to have data on which users were assigned to which group and when during the test, as well as to have access to data prior to test start. The client can request this user-level data from the vendor after the test and use it to validate the results. This analysis can be easily done by the BI or Data Science team on the client side.
Here is an example of a retro analysis looking at event frequencies per user and revenue per converter for the two groups, test and control, within 30 days prior to the start of the uplift test.
Even though the event counts don’t match precisely in all cases, the Chi-squared test shows no significant difference in the frequencies (p > 0.05).
Furthermore, the distribution of revenue per converter is almost identical for the two groups in this example, which in combination with the equivalent event frequencies above, indicates correctly randomized groups.
A more advanced technique to verify that the control and treatment groups are actually balanced is comparing the distribution of the groups based on some secondary attributes, i.e. age, gender, location, or propensity score. In an ideal situation, the two distributions should look identical. In reality, it would require very large populations or sample sizes to get a perfect match, though it can be a viable exercise to detect any selection biases or imperfect randomization in the measurement.
Image: A Comparison of Unmatched and Matched Characteristics Distributions via Kellogg School of Management
Incrementality test results are a great indication of marketing ROI and can point marketers toward the right business decisions. Analyzing the results can be done in-house or with the help of your retargeting partner using methods such as Chi-squared test or Bayesian A/B test. Equally important in drawing proper conclusions is verifying the objectiveness of the results. For this purpose, fair randomization of the groups in ongoing uplift tests can be confirmed using Retro-Analysis and/or distribution matching.
In this interview, Strategic Partnerships Manager Jihyo Kim talks about her journey in ad tech, long-term growth, and building meaningful relationships.
This International Women's Day, our office manger Claire Coles outlines some of the key ways that Remerge supports female employees
November was Wellness Month at Remerge and brought some surprising revelations
©Remerge GmbH, 2018