There are a few types of statistical errors:
- Type I - you reject a true Null Hypothesis
- Type II - you accept a false Null Hypothesis
Let’s say you are free of these types of errors. You see a positive lift in your AB test and reject a false Null Hypothesis. The problem is that set of possible errors is not limited to well-known Type I and II. In this case you make a wrong conclusion even if you do a right decision based on general A/B test results. Why it could happen?
The main metric we check in A/B tests is ATE (average treatment effect) which measures difference of means between treatment and control groups, e.g. difference of average revenue per user or conversion rate.
And saying whether this effect is distinguishable from zero, significantly lower or greater, requires statistical hypothesis testing.
Averaging and aggregation allow us to end up with a single number, proxy metric, which is convenient to tell a story, identify simple patterns, trends and causalities, and make a decision. But the thing is ATE filters valuable information together with noise. It is a summary metric. It represents only an average difference between groups which is not constant across users. This can be misleading and dangerous because it neglects a distribution and doesn’t tell the whole undergoing story beyond the averages.
This very well known problem also called aggregation bias - conclusion that what is true for the group must be true for the sub-group or individual. Which is often not true. The issue can emerge when the effect isn’t evenly distributed across the users. In a big and diverse user base it’s expected to observe heterogeneity of treatment effect. Standard A/B test methods do not provide you with a tool to estimate the effect for each user separately. It makes it hard to check how new features affects different groups of users.
I give you a simple example of how it can be seen in practice. Imagine you run an A/B test and see the positive left of 10-15% on ARPU. It doesn’t mean that each user has started paying more by that exact amount. Moreover, some users have started to pay much more, but some users have started to pay much less. A new feature can have either positive or negative effect. This is also kind of expected. But it could be the case when effect is positive on one source of traffic and negative on another - so distribution of an effect is skewed toward specific group of users.
For example, a new feature can be great for your the most paying users, but bad for common payers or non-payers. ARPU will go up by 10-15% but you risk losing a huge portion of your traffic. It will result in a drop in absolute revenue while average revenue per user stays the same or go up. Same can happen with countries, OS versions and anything. You can hurt the user experience of new users while making it better for old ones or vice versa. Sometimes it’s very hidden and even if you are used to checking a lot of additional stuff, you still can end up with aggregation bias on some important axis.
We tend to underestimate this problem. I used to do that. But after experiencing it “in the wild” for a couple of times I have started to do Heterogenity Analysis for every A/B test. I’m checking more metrics, metrics by segments, by traffic source, the effect on old and new users, the effect on small and big markets and so on. Of course, the sample size gets lower, which is another problem, but at least you can check for red flags.
The way how to do Heterogeneity Analysis is another big topic. If you’re interested in it, let me know, I will write a tutorial.