A/B Testing Statistics Concepts Experimenters must know with Ronny Kohavi



AI-Generated Transcript

Ronny Kohavi 0:00
So we actually share the data of what happens when we run an experiment for six months. And we showed that the users that stay with us for six months are so different than most users on many, many attributes. So remember, you’re introducing bias here, there are users that you think are new, but they’re really older users. And so if you’re accumulating stuff over a long period, like I’m computing revenue per user, that metric is much less reliable. Or when you run an experiment for a long time,

Rommil Santiago 0:35
remember to stay tuned for an exclusive experimentation community offered from Ronnie for his upcoming course on moving.

Richard Joe 0:46
Okay, folks, it’s Richard here from the experiment nation podcast. And I’ve got a very special guest here. It’s Rodney Harvey. And we’ve been trying to get back and forth multiple times. And he’s managed to give us a little time to slow him to discuss all things experimentation. When he’s got a background launching experiments from Microsoft and Bing. He was the Vice President and technical Fellow at Airbnb, he was heavily involved in Amazon’s personalization and experimentation, and more recently, now he’s consultant and instructor and just spreading the word as the experimentation evangelist. So welcome to the show, Ronnie.

Ronny Kohavi 1:27
Thanks for inviting me. Happy to be here.

Richard Joe 1:29
Very glad to have you here. For those who don’t know, you, as well, and CRN experimentation, can you? Yeah, we’ll just go through some of the background sort of Meishan. Like, what’s your formal study? Where did you study? And more importantly, how did you get involved in the experimentation game?

Ronny Kohavi 1:48
Yeah, so I did my PhD at Stanford, in machine learning, accuracy estimation, building decision trees, decision graphs, and the idea of assessing accuracy using something called the wrapper models. One of my advisors, Jerry Friedman is very famous for having written the current book classification and regression trees. So he got me into some statistics. And, you know, looking at the statistics versus the computer science side. I, my main introduction to experiments actually started at a small startup that I worked here called Blue Martini software, where we did some emails and wanted to understand the value of them, there was a sense that the emails that our customers were sending out, are generating too much money, in the sense that it wasn’t. It was taking credit for purchases after an email, not necessarily because of the email. So that’s when I first started to run some experiments. And it was kind of interesting that our customers didn’t want to hear about it. They were just happy to attribute a lot of money to the email campaigns. The real epiphany, I would say happened when I started working at Amazon. The director of data mining and personalization had multiple teams. One of them, of course, was doing the classical personalization. But there was also the homepage optimization. And weblog weblog was the experimentation platform at Amazon. And I think, at some point, I looked at the success rate, and it was just humbling. Like I thought, you know, we’re the best at doing all these recommendations. And everybody’s familiar with, you know, the value of them for Amazon. People who bought X bought why we implemented this cool idea, people will search for X, bought y that improved search dramatically. But when I looked at the results that came from our control experiments, or a B tests in the system weblog, it was like, wow, we were failing more than 50% of the time. A very humbling. And, and, you know, nobody at the time was really discussing any of those things are when I moved to Microsoft. And I had an opportunity to start a new team. They’re asking me, What do you want to do? I said, Why don’t we build an experimentation platform? Nobody’s doing it here. And I have this amazing statistics. Over half the ideas that we tried an Amazon failed. And, you know, typical of a lot of evil. Microsoft, their response was, we have better program managers here. I mean, people really thought that they were doing an amazing job at planning and releasing, you know, these products every three years. And there was sort of the denial that the rate could be this low. Well, you know, the cut to the chase when we built the experimentation platform, across most of Microsoft the failure rate, meaning As you start an experiment, you’re trying to move some key metrics, the overall evaluation criterion, and it doesn’t move or even moves negatively. That was about two thirds. So 60% of 60 70% of the time, ideas fail to move the needle at being, which is a much more optimized domain, it was about 85%. So think about this 85% of ideas that you try in a control experiment and AB test aren’t moving the needle that you want them to move. In an Airbnb, when I was in charge of search, and relevance, we were building machine learning models to optimize search, we were launching all these great models 92% of the time, the experiments failed to move the metrics that we were trying to improve. So to me, this is sort of one of the most amazing things about trying to experiment is this humbling reality that most of our ideas just aren’t as good as we think. So I’ve taken you through sort of the last few jobs from Amazon to Microsoft, Microsoft, I was there for 14 years. And then at the end of my tenure at Microsoft experimentation platform was like a company wide platform used by all the major product groups, from being of course, but MSN Office, Windows X Box, lots of groups using it. We were starting something like over 100 treatments every day. So very, very nice scale. Lots of automated scorecards, insights. Very, very cool.

Richard Joe 6:43
That’s, that’s, that’s huge. I mean, you’ve gone through like the 20 year span, and about five minutes or so in the interview there, they just checking. What are the some of the things that just just go back to Amazon? Cuz it’s one of the early experiences and look, a senior buyer that was 2003? It wasn’t even called experimentation or CRO back 20 years ago?

Ronny Kohavi 7:11
No, it was not it was not it was not called CRO certainly not, tr, I still think CRO, is kind of a misnomer. If you think about what people are trying to do, I always emphasize that you have to come up with an OVC an overall evaluation criterion. And conversion, isn’t it, it’s too narrow of an assumption that you just want to optimize for term conversion. So I’m not a fan of the name CRO. I think conversion is a piece of the larger equation. But you really want to optimize something longer term, look at, you know, customer success, happiness. So you think about an Airbnb, for example, you can say I want to just optimize conversions of a visitor to making a booking. But if you really think about the long term, you want them to book a listing, where they’re actually going to rate it high. At the end, it’s not enough to convert, but you want them to convert it to something that they’ll be happy with. And that makes the problem more interesting and more challenging. So back to your question at Amazon, we call it Web Lab. It’s sort of a laboratory for the web, to try and experiment. And this was actually done, the team already existed when I joined. So Amazon was running experiments already. We started to scale it, and give it more of this scientific bend of, you know, how do we do this correctly? How are we able to increase the trust? I think, you know, one of the things that people don’t realize, and I learned it early on, was it’s very easy to produce results from an AV test. But you know, something is wrong with it, you know, I’m famous for, you know, sharing this idea that you should check for a sample ratio mismatch? Well, you know, when you have a sample ratio mismatch, meaning the design of the experiments that you’re designing for 5050 ends up being off by even a small percentage, when you have a large number, that’s very unlikely. And so there’s a bug someplace. And so I think when we built a system at Microsoft, and we were trying to build a system for, you know, high scale, we built a lot of these trust tests into the platform. Something that didn’t exist in the early days.

Richard Joe 9:35
When you say trust tests, you mean, adding in monitoring metrics,

Ronny Kohavi 9:41
trust means that when the system tells you here’s your scorecard, you should be able to believe it. Whereas if the system finds a sample ratio mismatch, it should say don’t trust these results. Because there is something wrong in how we ran this experiment. And we found when we built these things, you’ll see in some of the early papers, the whole idea of an AAA test, they don’t run an A B test, run a against a, and see if the system really tells you if only 5% of the time that you have static results, right? If you’re making mistakes like peeking, or if you’re Miss estimating the variance, which is commonly done, in fact, we did, we incorrectly. So that variance of many ratio metrics when I worked at Amazon, and we didn’t know it, later on, we understood through a tests that things are more complicated when you have a ratio, you have to use the delta method or bootstrapping in order to assess the variance. And that, to me, is the key difference between an initial system that works okay most of the time, and something that you can trust at high scale, and allow people who are not necessarily the statistician experts to actually use and give them the scorecard at the end that they should trust if it passed all these trust tests.

Richard Joe 11:06
So the skullcap was an automated sort of number or ranking that would spit spit out at the end of the test and what the, what the person at the end would still have to do further checks and balances using either a data scientist or

Ronny Kohavi 11:23
not. So the idea is that you don’t you don’t need a data scientists, if you are using sort of the standard experimentation platform is it design, if you’re running a simple, you know, ABC test, we do the tests for you. We check for SRM we do the AAA tests, we look at your metrics and you know, run hundreds of a test to see that the P values are uniform, that’s a stronger test. If all these tests have passed, then when you get the result, we believe you should trust it. In fact, these tests are better than a data scientist, in some sense, because the data scientists may not realize that there are these flaws in the experiment, we put as much as many tests as we could into the system in an automated fashion. I think that made a huge difference.

Rommil Santiago 12:12
This is Rommil Santiago from experiment nation. Every week we share interviews with and conference sessions by our favorite conversion rate optimizers from around the world. So if you liked this video, smash that like button and consider subscribing it helps us a bunch. Now back to the episode.

Richard Joe 12:25
Amazon. You know, they this is like 20 years ago. So was it like you had the the experimentation team and they were doing all the testing? Or would you have democratized at the stage at a point where someone product or marketing could go and have an idea and run a test. And they didn’t have to have that typical experience, was it.

Ronny Kohavi 12:51
The goal was to make it self service, Amazon, we never when I was there, remember, this was 2003. We were just starting with a small team. We were using our Web Lab as a way to test our personalization as a way to build some products. But it wasn’t a platform in the sense of what we build later on at Microsoft At Microsoft, we really build a self service platform that could be used by multiple teams, high level of trust, and Amazon, you still needed to have the people look at the results and validate things. And I think we were naive in many ways, not realizing that some of our computations were incorrect.

Richard Joe 13:35
And just just to ask about some of these errors that he looked with. In hindsight, I’m not to put you on the spot. But I’m sure we’re all guilty of this of, you know, confident about this test beam and winning tests. You’ve done your checks and balances, you productionize it, then like six months or 12 months later, you’re like, Oh, that was wrong. Our assumptions are wrong. And we’ve already put it into the open. Do we just leave it out there as maybe a false positive? Or do we just sort of just sort of, they’re just sort of ignored? Or do you reverse the

Ronny Kohavi 14:18
realize when you build a new test, let’s say you build a sample ratio mismatch test, you can go back and look at all the prior experiments and see which ones have violated sample ratio mismatch. Yeah, and you can tell the owners of the experiment hey, these following three experiments are invalid. And you can try to correct for this apple ratio. Sometimes you can remove a bog or fix it and reanalyze but if the experiment turns out to be wrong, absolutely rerun it, and we’ve done a lot of asked even for a feature that seemed useful two years later, in the new context, sometimes removing a feature is actually a very good thing. There’s, I’ll give you a famous, I’ll give you a famous example. The and I remember Jeff, Jeff Bezos, at the time was a big fan of these bottom of the page deals in Amazon. I mean, this was like, oh, you know, cheap razors. The sort of the funny statement was, the prices are so law, they are so low, that they fall to the bottom of the page. He loved that he was telling the board he interviewed at some, you know, news, newspapers and periodicals about how this he loved this feature. Well, at some point, you know, me running data mining and personalization, we had some timing issues, we thought this was slowing the pace too much. We removed it as a test. And one he was much better for remove that feature.

Richard Joe 15:59
And how did you present this to the CEO of Amazon at the time? Did you have to kind of messages ego a little bit and just sort of, you know, I

Ronny Kohavi 16:09
think, you know, it depends on how borderline the result is, if if the we always laugh that they’re the result of an experiment, you add something we call the J factor, which is how much Jeff like the flat, if it was sort of flat, maybe slightly trending, but Jeff really liked that we might launch. But in this case, it was a it was very negative. And kudos to Jeff, when you show them real data, he, he went with it. And so we removed that feature, you can’t see that feature anymore.

Richard Joe 16:42
I just just go back to your statement about CIO having a very sort of, you know, I mean, like narrow,

Ronny Kohavi 16:50
narrow, narrow niche. Yes, for our audiences,

Richard Joe 16:53
let’s just say I’m running a SASS company. And my primary metric OEC was number of accounts opened, and they say, got to 95 99% stat cigar, after six weeks of running. Cool, we watch it, but we find that maybe the lifetime value of the customer decreases from 12 months to six months. Would you say that in that case, that is an example of our oeec, just based on what I already see is not really aligned with the overall business objective. It’s aligned with CRO objective, where we acutely found that a lot of people both opened up accounts, but in reality, yeah, they did. But they’re canceling after six months.

Ronny Kohavi 17:45
I mean, we’ll take it, we’ll take a simpler example, which is you’re selling products, you make some change, you’re increasing your conversion rate, but the average order size, the size decreases, right? So you manage to improve conversion. But the metric that you really care about even in the short term, you went a step further to lifetime value, even in the short term, it’s possible for conversion to increase, but the average order value decreases. And the question is, is revenue higher? Right. So I think that’s a better metric. And now the problem with revenue is that it’s what’s called high variance metric, you may not be able to detect the historically significant changes in revenue, which is why conversion is actually has some good properties, right? It’s a Boolean that the user convert or not, tends to have low variance. So what I like to do is I want to look at breaking down revenue into these two components. The conversion rate, or what we call the revenue indicator is 01. Was that was the revenue positive, or zero? And the other one is a conditional revenue, which is for the people that purchased, what was the revenue, right or average order value for the people that actually purchased? Turns out if you build these two metrics, their product is revenue, it’s going to cool. So when you run a test, you might say I’m going to try and improve conversion rate. But I want to make sure that the average order value or this revenue conditional metric, is a guardrail. It cannot move down materially.

Richard Joe 19:22
And just going back to what I said about monitoring metrics, I believe are guardrails as well, in your context? Let’s just say we drove you know, obviously was correct for this product page. We drove a lot of revenue and conversions during this test, but we found that we inadvertently cannibalized our other products. Would you always consider, you know, inside this as a complementary product, would you always add these Is guardrail metrics or monitoring metrics to account for? Like, Oh, absolutely,

Ronny Kohavi 20:07
I think I’m gonna say in this case, you’re always he’s wrong. Right, you’re always used to local, you said you’re checking the OEC for a product, you’re always see should be revenue for the company. And rather than for a single product, there’s an example that I show in class, there’s an example that I take where that somebody presented at a conference where, you know, there’s this page or the hero image, and there’s three slots below it. And they took the middle slot, moved it to the left, and increased clicks on that widget, materially, you know, the slide says they increased clicks by 109%. So that’s a terrible OEC. Because what she did is she moved something from the middle of the page to the left, where our scan pattern is usually go left into the right. And you expect that so the question should be what happened to the widget that was on the left, and you’re likely cannibalizing it, right. And so the OEC should take into account the increase clicks for your widget, minus the decrease in clicks for this other widget, ideally, times their downstream value. Right, maybe your widget actually has a lower value than this thing that was on the left, this is why it was designed to be on the left. So yeah, I think that the coming up with the right overall evaluation criterion is one of the toughest things when you onboard a new team. And then as you learn more, you tend to iterate and Revit as you get more data, so at Bing, we used to have sort of an OVC. That was for the year. And then near the beginning of the year, we have a team spun up that would summarize the learnings and suggest changes to the OEC for the next year based on the learnings. And I think that’s a good cadence, you want to keep the OEC fixed for a reasonable amount of time, you don’t want to just you know, drive in random directions every two months. So you want to sort of have a six months to a year of stability, telling the teams what they need to optimize. And then coming up with that OVC is a tough challenge.

Richard Joe 22:29
It’s something we’re gonna have to do work on personally, without going into too many details. But to be honest with you, it sounds like the OEC. It’s an easy thing to make an error on. And it’s something that needs a lot of thought and input from both management, and the experimentation teams and cannot be rushed.

Ronny Kohavi 22:54
Absolutely. I mean, this I mean, the OEC should reflect the company’s strategy, its mission, what are we trying to do? But the hard thing about it, it’s easy to say, you know, we want to make money in the long term, right? It’s long term profit is a typical, you know, simple way to for companies to decide what we want to do. How do you translate that into something that you can measure, in a short time you’re running an experiment for two weeks, you need some metrics that are going to be measurable in two weeks that are predictive of this long term goal. And so one of the things is to do some experiments that allow you to derive these metrics, I’ll give you one great example from being We, of course, want people to, you know, have more sessions on being and not churn. We suspected this is our, you know, natural model that as we show more ads, were degrading the customer experience right? Now, this makes sense. Yeah. So to me the ads, but we actually built a model by running an experiment for a long time where for some users, we displayed more ads. And for some users, we displayed fewer ads. And we were able to get clear metrics on the increase in churn rates, you know, several other metrics that were impacted by the fact that we had more ads on the page. And so we were able to translate our goal of you know, reduce churn and do other things into sort of a more formal equation. And that led to coming up with a goal for the team, which is you need to generate higher revenue with a fixed amount of real estate. Right? So we told them, you’re allowed to take this many pixels, on average. So for one query, you can have a lot of ads for another query, you can have zero ads, on average. This is the amount of real estate vertical real estate that you’re allowed to take. And that was a that was a very important part of coming up with an OTC that’s useful. Just in

Richard Joe 24:57
developing OAC for big tech. You’ve got A lot of smart people working their resources. I’m just trying to bring this back to say the more medium or small size website that’s trying to develop their own OEC. I mean, would they follow the exact same principles, knowing that they can’t have all the data points that you would have the luxury turn? You know,

Ronny Kohavi 25:20
much of the OEC comes from, you know, the main belief in a causal model of what will drive my long term goals. Right. So, again, I said long term, people say, customer lifetime value. The question is, how do you build that model? Now, whether you’re a small business or a large business, the difference is probably going to be in how you can validate your causal model, can you build a sophisticated one? What are the metrics that you should measure? But I think, you know, in a small business, the owners of the business may have a degree of belief that these metrics are the ones to improve because they will drive long term customer value, right, and business lifetime value.

Richard Joe 26:06
Just just another thing to add to that. I know, we’ve been very data driven with numeracy, and sticks and so forth. But was there ever was there an added layer of qualitative data were like, Hey, we could drive a ton of we could plaster the screen full of ads. But would it? I think he kind of alluded to that. But you know, and we can increase the OEC by this amount. But would we?

Ronny Kohavi 26:37
We also remember, we wouldn’t, it’s unlikely that you would increase the OEC if the OECS designed, right? Right. If you looked at revenue, as you’re always see, yes, you could improve revenue a lot by plastering the page with ads. But if you’re, if you’re understanding that too many ads are going to degrade the user experience, then you’re always see will be revenue, constrained to a certain amount of real estate, that’s a much better OVC.

Richard Joe 27:03
And just let you know, I don’t get any paid money for this and the affiliates, but I, he documents it in the book here. So I think this is probably one of the better books that I read on car. I think it’s the first serious CRO book I read after doing the C Excel course. So it really helped me out. You know, when I got into the experimentation game, just to really have a, it’s, by

Ronny Kohavi 27:30
the way, it’s an experimentation book. It’s not a CRO book. It has the largest section on the OEC. And it focuses on the fact that this is not just about conversions, I do think that’s a very, very important distinction. I mean, you could say the same about the name AV tests, we use it. It’s a colloquial name that everybody uses. But I actually tell people don’t run an A B test, run an ABC test, have throw two treatments out there, because one of them might surprise you in how it’s winning.

Richard Joe 28:04
Good. Good point. We’re not We’re not used car salespeople were professionals. Just in regards to that, I did actually have theirs off the cuff. You have mentioned Ron, you know, ABN or ABC tests, what’s your thoughts on running like, say four variants at the same time, versus just doing like a serial AB test where you run it for a month for one treatment? Get the get the data out of there, make any iterations if needs be, then run another test?

Ronny Kohavi 28:33
Yeah, so my thought is, look, there’s two factors. When you run an A B test, there’s the development time, which is usually high, that takes time for engineers to build a feature to QA to a level where you can release it to, you know, to the production system. And then there’s the runtime of the experiment. I believe that when you look at the power calculations, and everything, it’s usually better to run two treatments and a control. So an ABC test, or an ABCD test with three treatments, it will lengthen the experiment runtime. So you’re now fragmenting that traffic into not just two, but three or four. But because the development time is usually much smaller to do another variant. I don’t know if I should, you know, do three or five artists do both three and five as two different treatments, saying the number of recommendations are, you have some option that is easy to throw out there as to variants. You may be surprised, and I’ve seen this over and over again. You don’t want to come up with the best design of the idea to test but you want to say you want to ask the designer, hey, why don’t you come up with two designs that from the development perspective aren’t going to be dramatically different fact most of the code is going to be shared in the back end. Yeah, but at least Have some variety. And you’ll be surprised at how many times it’s sort of the designer says this is my number one choice. And just because you asked this is number two, and number two wins by saying, but you know, a large margin. Right? It happens often and, you know, that’s one of my messages is, you know, it’s very hard to predict correctly whether an idea is going to work. Same thing is true about the design or some parameters that you’re throwing off the cuff, you know, I want to make three recommendations versus four, why’d you pick three? Should I put this the search box in the left or in the middle? There’s a amazing example from Yahoo, where they moved the search box, and it was a huge winner. So small changes could make a big difference. And therefore, I usually prefer that people run ABC or ABC, you know, once you get into three or four, then there’s a lot of fragmentation. Like, I don’t want you to run eight, because that you will really have to extend the experiment runtime a lot, and you’re better off running, you know, three variants, or four variants. One of them is that control, learning from it. And then iterating. So to me the sweet spot, ABC ABCD.

Richard Joe 31:13
sadolin. Leeson is, speed is, speed is of the essence, right? It’s like,

Ronny Kohavi 31:20
yes, you learn a lot by iterating. Right? So you want the experiments to run for, you know, two weeks, I think is or one week or two weeks. That’s the ideal time for a development team. If you look at the models of agile development, the typical cycle for agile development is two weeks. Right? And so you want to align with that every two weeks we get data, we’re able to adjust. Yeah, right. If you have to run the experiment for a couple of months, that’s a long time you do these as learning experiments to understand metrics, there will be some experiments that will have to run for a long time, or if you’re looking for small effects that you wanted to check for something critical. But I think in most cases, you know, the ideal time for an experiment, in my opinion, is one to two weeks.

Richard Joe 32:04
Well, that’s not been the case and why?

Ronny Kohavi 32:08
It depends a lot on the traffic. And if you’re running on a small site, you may have to run for four weeks. But again, once you get over a month, you also start to get into issues of, you know, seasonality, cookies, the data starts to get less, less reliable. If you’re on something for multiple months.

Richard Joe 32:28
Yeah, just to get back to you on the two questions. Because I’ve talked to people who are like, you know, what, I don’t know if there’s any hard data on this, but I get the logic, if you know, to run it for like, one month, Max is a heuristic. And then maybe thereafter, there’s a certain if you run it for like, 60 days, maybe there’s gonna be some data pollution because of cookie deletion. And like you said, seasonality, is the any hard data on running it for too long. Yeah, I

Ronny Kohavi 33:04
mean, there is data. And, you know, I can point you to a paper that we wrote back in 2016, called pitfalls of long term online control experiments. So we actually share the data of what happens when we run an experiment for six months. And we showed that the users that stay with us for six months are so different than most users on many, many attributes. So remember, you’re introducing bias here, there are users that you think are new, but they’re really older users. And so if you’re accumulating stuff over a long period, like I’m computing revenue per user, that metric is much less reliable. Or when you run an experiment for a long time. So there’s a paper out there. You can go to exp dash platform.com, where we put all the papers from Microsoft days, and that’s called the pitfalls of long term online control experiments. hard data, lots of information.

Richard Joe 34:03
I’ll I’ll jump on there. As soon as I can of running a few tests them. I won’t go into details, running that taste for a little bit longer than I should have.

Ronny Kohavi 34:13
link in the chat. But I think the viewers won’t see the chat, right?

Richard Joe 34:17
Um, they won’t see it. But if their viewers can just google exp dash platform.com pitfalls of long term, they’ll be able to get it. Cool, thanks. I’ll just say that’s my drive. Um, the other thing is it the reason why I’ve been doing it, and maybe some other people been doing longer is just just to get that reach that staff so because on small sites, like I’m running in Australia, we’ve got, I had to run it for like, you know, six months, you have to do your power

Ronny Kohavi 34:50
calculation and determine for how long you’re gonna have to run it. I’m saying that the sweet spot is, you know, multiple weeks 234 or you get into trouble afterwards. So you may have to settle for different metrics that are more sensitive, you may have to apply mechanisms like you know, cubed, you know, it’s a variance reduction technique that we came up with. That typically reduces your runtime by a factor of two. So if your experimentation platform supports Cupid, and there are certain conditions like you have repeat users, then you may be able to cut the runtime by a factor of two. Another thing

Richard Joe 35:29
is just just forget ABCD whipsaw, sorry, variations. I, when I was new to experimentation, I did work with this agency where they would launch like three variants, pest control, and they would see what happens after like, you know, two, let’s just say two weeks. And they would see that, say, two variants are running quite well, against the control, they would actually switch off one of the two of the other variants just to allow more traffic to go into the winning variants. What’s your thoughts on on that and interrupting experiments.

Ronny Kohavi 36:09
So it’s risky. And you know, it can’t be done, you have to do it carefully. I mean, this is done in like multi Armed Bandit scenarios, when you start with a lot of treatments, and you prune them down in the context of classical control experiments. If you want to kill an arm, like you say, you know, let me just remove that treatment, you have to be careful about what you do with those users, for example, you have to allocate them in equal amounts of the control and treatment. Right? I’ve seen people to do that incorrectly, you have to realize that those users are now contaminated. They have seen a variant, and now you’re flipping a variant on them. And so from a raw statistical perspective, that’s a no, no. But could it be done? Yes, we do it, it depends how visible it is to the user. If the features a back end feature, and they’re unlikely to notice, then the contamination level is going to be low? And I’ll say, Sure, do it. Right. So everything has to be done in the context of these concerns about are you doing it right? Are you splitting the users in the right proportions. If you have an ABCD test and you kill D, make sure you split the users into ABC in the same ratios. And also, when you do the analysis, I would segment them out to see if they don’t look very different. Right? That’s when we talk about sanity and trust. That’s one of the things that I would do is I would say, Okay, so now we’ve increased the power, we have more users. But let me just validate that these users don’t look very different. When I look at that segment separately, that’s an easy test to do. And you might find out that, you know, they look different enough that I’m actually going to remove them from the analysis, even though I have slightly less power.

Richard Joe 38:00
That’s a good point. I know you’ve mentioned this on previous podcasts, I think it’s good for our listeners to know, to beat the drum, Twyman slaw. Seven, I’ve working on myself, anything that looks too good, it’s too good to be true. And I think it’s both an experimentation and life.

Ronny Kohavi 38:17
Well, I’m a fan of this. And again, if you look at, you know, the timers, there’s a deck on timers law, in this exp bathroom.com site, where I give a lot of examples, but it has saved us from making a lot of incorrect conclusions. So in the experiment, we’ve had experiments that just, you know, this is gonna make us $100 million. Yeah, sorry, guys. There are experiments like that, you know, I opened the book and chapter one in my class with this example, that made us $100 million. But it, it is rare, and it has been validated and repeated and, you know, we tried to do everything we can to validate that the result is real. Most of the time, when you get a result, that looks too good to be true. There’s something wrong and so we before you celebrate, make sure to spend more time look at this from all the angles, do segmentation, do tests look at whether some metrics can it be logged in correctly? And yeah, I mean, I’ll give you a funny example. You know, there was one experiment where you know, somebody logged revenue twice in certain scenario so it looked like my god you know, we have 30% more revenue and this variant will die out. Yeah, those things don’t happen you don’t get a change to make 30% increase the revenue unless you’re a really really tiny website that has never been reviewed. Once you get to a reasonable size if you’re able to get a few percentage points improvement, your conversion or revenue or things like that. That’s a huge win. Right when you get the 30% call time is law.

Richard Joe 39:55
Yeah, no way.

Ronny Kohavi 39:59
But years Saying this here, but I’ve seen people sell, you know, come use my agency and I will increase your conversion by 10x.

Richard Joe 40:07
Maybe if you’re a high growth startup, maybe

Ronny Kohavi 40:12
it could be in very early stages where you’re, but see at those stages, if you’re so bad, and your conversion rate is so low, just use best practices. But you don’t need to experiment, just, you know, look at why your conversion rate, look, most sites have conversion rates, I’ll say in a wide range about one to 5% on something reasonable. So if you’re below 1%, think best practices. But if you’re in the range of you know, I’m 2%, I’m 3%, nobody’s gonna 10 x.so

Richard Joe 40:45
I think a learning lesson is don’t take things at face value. How have kind of put put your BS detector on? And yeah, like, look at the numbers, like maybe get someone else in your team to analyze them? Did you have it? Do you have internal process to doing it like this, say, Hoover, whoever was involved in a test.

Ronny Kohavi 41:15
Ideally, the system does not, it’s the most objective, it’s easy to add more tests. And so ideally, this, the more tests you can think of to validate the results, you can add them. So I’ll give you an example that we had a problem with the reliability of some cookies in certain browsers. So if you use an older version of Internet Explorer seven, there was something that if you did wrong, the browser would lose your cookies. So we built a test for that. And you know, that test, you know, once we spent the time to isolate the issue, that test existed, and if anybody did something that would trigger this bug, we would alert for them. And so you build it once, and you don’t have to worry about it anymore. By the way, you mentioned your BS detector. Somebody I just somebody asked me to say what are my five favorite books in relation to data mining, and one of them that I put out as calling bullshit. It’s called The Art of skepticism in a data driven world. Really, really fun book. Lots of good lessons. I mean, I wish they’ve they’ve used the term twine as law, but lots of good lessons there. So if you look at my LinkedIn, there’s just a couple of days ago, I posted this.

Richard Joe 42:33
Yeah. So listeners, just look at Ronnie’s LinkedIn. And I think it was a five books you put up there.

Ronny Kohavi 42:40
Yeah, five books that Shep shepherd.com is the small startup that does these, you know, what are your best five books? Really fun side?

Richard Joe 42:51
Nothing is and this will help listeners of our show who have a smaller site like like the one I work on? Thoughts, just general thoughts on abt slow mo traffic websites, I know, I think you commented on John London’s post or something like that. I can’t remember the details.

Ronny Kohavi 43:13
People can look at LinkedIn for my thoughts on that. You can do certain things. When you have a low amount of traffic, you can use lower variance metrics, you can incorporate Cupid, which was you know, as soon as you implement that you typically gain variance reduction mechanisms. So you’ll need to run the experiment shorter. And you can find surrogate metrics. But at some point, there’s no magic, you know, in order for the theorems to work out, you need a certain number of users, you know, so to me, if you have less than I’m just throwing a number of 50,000 users, for most of the websites, just do best practices, you’re not you don’t have enough data for most AI to be able to test most ideas. Once you get to 100,000 to 200,000 users, then you’re good to go, then you can run useful experimentation. And that’s This is where all the statistics really work out nicely.

Richard Joe 44:15
Like it was saying, Would you say that? That website, it’s just starting out or maybe they only got 50,000 visitors would you also implement more qualitative type testing, remote user testing? Five

Ronny Kohavi 44:33
methods not nothing is going to be as sensitive as control experiments. We know scientifically This is the most sensitive mechanism. So if a control experiment can’t detect a 10% change, no qualitative method is going to give you that what you get from qualitative methods are interesting insights that you might say, this is a good idea to change or this is a best practice. I will do it But I’m not going to be able to run the AV test reliably that will show me that I had a 5% or 6% improvement, or even a 10%. Now, it depends on your, the key metrics that you care about, you may just not be able to detect them. Once you get above 50,000 100,000 200,000, again, depending on the metric, what you’re looking for, then things start to work in your favor. And then you have the ability to detect small differences.

Richard Joe 45:27
Okay, I’m getting what you say. I mean, numbers don’t lie, end of the day. You got the gold standard, which is the AB online, I mean, there’s a

Ronny Kohavi 45:38
formula, right? Just plug it in, and say, and you may say, you know, I’m just trying to go for homeruns. But if you’re trying to go for a home run of improving something by 10%, they’re rare. Right? It’s, it’s, it’s really rare, I mean, you, you should be able to look at some best practices out there and say, let me just copy from Amazon, because they ran all the A B tests, and I’m just gonna use whatever they’re doing, because that’s, you know, they’ve, they’ve tested it, and it’s likely to work similarly for me, when you start to diverge is when you have lots of users, and then you can test it on your own and see if you want to diverge from the best practice.

Richard Joe 46:19
Okay, so those listeners who are launching the e commerce website on Shopify, I’m guessing the learning lesson is Amazon’s handle the errors and their back from years of experimentation. So just just follow what they do on the website, and

Ronny Kohavi 46:34
it’s a good basis to start. That’s a good basis to start.

Richard Joe 46:38
Yeah. Um, just in regards to that, again, load traffic websites. Are you a 95%? Stats guy?

Ronny Kohavi 46:49
Oh, I am. I am also I will say this, I’m a two tailed 95% confident that you recently interviewed your GI. He’s a fan of doing one tailed tests. i If you look at some of the recent papers like intuition busters from last year, people think that a 95% means that, you know, if you’ve got 95% confidence interval, there’s only a 5% chance you launched an error that that’s BS, but it’s not, you have to look at the false positive risk, which you have to apply some prior probability. Again, if you’re interested, look at intuition busters, paper from last year. But that’s why I’m a fan of really doing the 95 to Dale so that the positive tail is only two and a half percent. Yeah, and so you really want to make sure that you can learn from it and believe that this is successful.

Richard Joe 47:47
I’m a 90 to 95% single tail out of convenience.

Ronny Kohavi 47:54
So you’re gonna have, you’re gonna have like 40% errors, if you work at a company, I mean, I really look at that FPR table. And if you using p values of, you know, 0.9, especially if you’re doing it one tail, your probability that when you get a static result that is going to be false ends up being something like 40%. That’s huge, right? Almost flipping a coin.

Richard Joe 48:20
In regards that, just I know, we’ve got to wrap up now, but okay, I’m seven 90%. Guy. And I’m willing to take on a little bit more risk, which I’m fine with. And I’m happy to, I know that the production optimization of this test is not so huge, and I’m happy to accept a few posts false positives, just out of practicality,

Ronny Kohavi 48:48
are you willing to accept 40% of them being made that to me is that to me, is the key point, make sure you understand this statistics. And this notion of false positive risk. Now it’s possible if you’re doing something like optimizing headlines, who cares, you can make 50% of the errors 50% of the time, and you’re still be good. But if you’re trying to learn from the experiments, and when you have a success, you want to share it with the Oregon’s look, this is a feature that we launched and it’s really useful, then I believe you should use tighter confidence intervals or lower p value thresholds. Because the learning to me is what drives is what changes that direction for the organization. And so again, it depends what you’re using it for. If you’re running a small experiments to just optimize some parameters, yeah, you can absolutely lower the or, you know, increase your risk, use higher alpha values. But if you’re trying to teach the org and use this loop of I have an idea, I’m trying I’m going to pivot and learn from it, and then change my roadmap based on the result, then I think you need to have higher confidence in the results. And therefore, as a default, I recommend, you know, 90 At least 95%. To tail, which effectively means 97.5%. On tail.

Richard Joe 50:11
Okay, that might be a straw for some people. It’s tough. Wow. Thanks a lot for being on the podcast. We got to wrap it up soon. But for listeners of this podcast, I’d like to let you know that Ronnie is hosting his ABX testing live course on the 12th of June. Is that correct? Can you give us some details about that?

Ronny Kohavi 50:37
Yep. So it’s a it’s a live class. five sessions two hours each. It’s a forcing function you want to attend live, you’re going to force yourself to spend those five sessions for two hours if you miss one. That’s okay. They’re recorded. But the goal is to ask questions, very interactive session, lots of polls fun. And so yeah, the next class starts on June 12. On Maven,

Richard Joe 51:05
can you tell listeners just in case they’re listening on Spotify, what the what the code is?

Ronny Kohavi 51:10
Yes. Oh, I made a complicated code. So the code is

Richard Joe 51:14
exp Yes.

Ronny Kohavi 51:15
For na t for experimentation. Then pod C for podcast. And then a B. So exp na t pod. C A, B. Okay, well, that will give you $500 off if you were staying for the whole 15 Minutes with Us.

Richard Joe 51:36
Awesome. And yeah, that’s a huge discount. Thank you very much, Ronnie for your generosity. Just Just one thing is just in case people have been sleeping under rocks or don’t know who you are. How can they reach out to you? How can they learn more about you and or contact you if needs be?

Ronny Kohavi 52:00
Want them is the best way to reach me.

Richard Joe 52:04
Awesome. Cheers. Okay, cool. Bye bye.

If you liked this post, signup for Experiment Nation’s newsletter to receive more great interviews like this, memes, editorials, and conference sessions in your inbox: https://bit.ly/3HOKCTK

Connect with Experimenters from around the world

We’ll highlight our latest members throughout our site, shout them out on LinkedIn, and for those who are interested, include them in an upcoming profile feature on our site.

Rommil Santiago