Avoid AB Testing Statistics Mistakes with Max LI

Video

Audio

Transcript

Qike (Max) Li 0:00
When we need to run a task to show that the treatment is no worse that can show it’s incorrect to run a rabbit or AB test. And not if you’re already past is designed for those use cases.

Qike (Max) Li 0:24
Hi, everyone, thanks for coming to my presentation. My name is Max today and we’re going to talk about non inferiority test in AP testing. The title for today is absence of evidence is not evidence of absence. So this non inferiority test is particularly useful when you need to rely on neutral feature. A neutral feature is a feature that is expected to have no impact on the business metrics. Before diving in, let me introduce myself a little bit. I’m a statistician by training.

Qike (Max) Li 1:00
After getting my PhD in statistics, I moved to the Bay Area and work in different tech companies currently. And I wish I’d been working in the space of machine learning and experimentation as well. This night if your AB test is one of my recent projects. Here’s the agenda for today. First, I’ll cover when this night if you already has these useful. And then what ninth Arctic test is followed by how is an inverted has better than a regular AB task. Lastly, I’ll give a conclusion. So let us know if you’re already has a useful, I’m not claiming that you should start running commercial AV test altogether and start doing non inferiority past.

Qike (Max) Li 1:52
But in some cases, the commercial AV test could fail. And it makes sense to apply non inferiority past. Here are some use cases. For example, a pharmaceutical company has a new generic drug that costs only 10% of the brand name drug. If the company could prove that this generic drug, the efficacy of this generic drug is not inferior, then that would be good news for patients. Another example is an E commerce company is evaluating a new policy, for example, to take down misleading products.

Qike (Max) Li 2:34
And it’s clear that this new policy can boost customer trust. If they could show that this new policy has no negative impact to the business metrics, then it will be a no brainer to introduce this new policy. The third example is say we have a machine learning model in production, which is complicated and computationally expensive to run. On the other hand, we have a simpler model that is easier to maintain, and much less expensive to run. If we could show that the simpler model is no worse than the complicated production model, then, and then we you know, we are confident to replace the production model by the simpler model. So does commercial AV testing work in those cases? Let’s take a look. Let’s devise our decision rule as follows. We run the experiment and conclude there is no difference between treatment and control when the p value is larger than 5%. It looks like the decision rule makes sense.

Qike (Max) Li 3:58
Because when P value is smaller than 5%, we could conclude that there is a difference between treatment and control. But is this in general correct? The short answer is no. Because absence of evidence is not evidence of absence. When the p value is larger than 5%, it simply means that we don’t have enough evidence to make any conclusions. Furthermore, if the decision rule was p value larger than 5%, then how long should we run the experiment? How do we even calculate the sample size? Let’s look at this plot. So this is from a simulation study. And because it’s a simulation study, we know that the treatment is actually worse than control. The X axis is the number off days, we run an experiment.

Qike (Max) Li 5:02
This this, this red line indicates a 5% error rate, and Y axis is error rate. As you can see here, the shorter we’d run the experiment, the higher the error rate. If we stop the experiment, after we run it for one day, then the probability of making a mistake is higher than 80%. Right? So the mistake means that we’ll say that the treatment is no worse not can show, but the treatment is actually worse. So what is wrong? Let’s look at it from a statistical perspective. This data is a little heavy in theory, don’t worry, if you cannot follow it, it shouldn’t affect your understanding of the whole presentation. So for typical AV testing, the null hypothesis is that there is no difference between treatment and control.

Qike (Max) Li 6:19
When we when we get a P value is smaller than 5%, we could reject the null hypothesis and conclude that there is a difference between treatment and control. Right? We never accept a null hypothesis. Because when the p value is larger than 5%, it basically means that there’s there’s not enough evidence to make any conclusions. Again, absence of evidence is not evidence of absence. Furthermore, there’s a distinction between statistical significance and practical, practical significance.

Qike (Max) Li 6:57
So if there is a teeny tiny difference between your treatment and can show, say the difference is, like, point 000000 1%. If you run that your experiment long enough, you will eventually reach the statistical significance, meaning the p value would be smaller than 5%. And you will conclude that there is a negative impact. But do you even care when the negative impact is decimal? Probably not. But commercial AV testing doesn’t have a framework to take this practical significance into account.

Qike (Max) Li 7:45
Now, hopefully, I’ve convinced you that for some use cases, when we want to show that this treatment is no worse than I can show, commercial AV testing simply doesn’t work. And not if you already pass is designed for those use cases. Conceptually, the non inferiority of Hasni is actually pretty simple. We change the null hypothesis to be the treatment is worse than can show. Now when we have a p value smaller than 5%, we get enough evidence to reject the null hypothesis and conclude that treatment is no worse than control by a margin data. So this data here indicates practically a significant difference, say to person and in essence, this not if you’re already past is a one sided task with about.

Qike (Max) Li 8:52
So you may wonder, does this small change in null hypothesis make a big difference? Let’s look at how non inferiority test is better than the regular AP test for those use cases. So let’s compare the regular AP test versus non inferiority test. The decision rule for a regular AP test is, you know P value larger than 5%. And at the senior reporting that you’ve already passed is a p value of smarter than 5%.

Qike (Max) Li 9:29
Please note, these two pianos are calculated in different ways. Let’s compare 90 VRT tasks and regular AV tasks in this simulation study. In this simulation study, we know that treatment is indeed inferior. So in this plot, on the x axis indicates the number of days we run the experiment and Y axis is the error rate. The orange dots are for regular AP tests, and the green dots are for non inferiority test. So the first orange dot here indicates that when the treatment is indeed inferior, there is a higher than 80% chance. Regular AP TAS would incorrectly claim that the treatment is not inferior. The longer we run the experiment, the lower the error rate for the regular AP test. On the other hand, or naive your ad test, the false positive rate is well controlled below 5%. Regardless how long run experiment, again, this is caused by that we make conclusions when P value is larger than 5%. absence is absence of evidence is not evidence of absence.

Qike (Max) Li 11:00
Let’s compare the two types of tasks to another simulation study. In this simulated study, we know that treatment is indeed not inferior. And again, in this plot, the x axis indicates the number of days we run the experiment. But the y axis in this case shows the true positive rate, ie the power for regular AB test, the longer we run the experiment, the lower the power. This is because when we run the a regular AB it has for a really long time, it tends to get statistically significant due to our practically non significant difference.

Qike (Max) Li 11:50
On the other hand, for the non inferiority tasks, the longer we run the test, the higher the power in aligns with our expectation for the power our theoretical test. Income conclusion, when we need to run a bid has to show that the treatment is no worse that can show it’s incorrect to run a rather AB test. And not if you’re already past is designed for those use cases. Um, also, we are in the process of open sourcing our Python library for AV testing. It’s called Wish AB, which supports them that you’ve already passed.

Qike (Max) Li 12:37
So please stay tuned. Last but not least, I want to thank my colleagues Paolo and PI for their support along the way. And also I want to thank tab for providing the use case that motivated this study. I also have some appendix guys to show the differences in decisions between regular AV test and non inferiority tests under different scenarios. Also, what would happen when you run a rather AB test with more sophisticated decision rules. In essence, even with more sophisticated rules, the problems are still there. Um, you can reach me by email or on LinkedIn. If you are interested in those slides, here’s the URL. Thanks for coming. And please let me know if you have any questions.


If you liked this post, signup for Experiment Nation’s newsletter to receive more great interviews like this, memes, editorials, and conference sessions in your inbox: https://bit.ly/3HOKCTK


Connect with Experimenters from around the world

We’ll highlight our latest members throughout our site, shout them out on LinkedIn, and for those who are interested, include them in an upcoming profile feature on our site.

Rommil Santiago