The CRO Statistical concepts all CROs should know featuring Ishan Goel

AI-Generated Summary

In this insightful video, Ishan Goel, transitioning from a software engineer to a researcher, delves into crucial CRO statistical concepts. He sheds light on statistical significance in experimentation, explores A/B testing intricacies, emphasizes the efficiency of sequential testing, and discusses challenges in stopping tests prematurely. The talk concludes by highlighting the intersection of experimentation and statistics in the realm of Conversion Rate Optimization.



AI-Generated Transcript

Ishan Goel 0:00
So suppose you are starting a test on Friday, and suddenly on the weekend, it is performing very well. So you let that test all those tests that were started on Friday, you will let them run those that were started on a Monday or Tuesday, which had an initial downfall, then you don’t let them run. So there is a bias to that as well. So that is why I generally can then that practice. And the idea I think, which has become accepted in the industry is blood tests run for two weeks. But I also

accept and say that there are some crucial stakes involved. Like for instance, if you understand why the performance is bad, or you have some crucial guardrail metric being hurt very bad, then you should go ahead and stop them, but it should not be a general practice.

Richard Joe 0:49
G’Day folks, it’s Richard here from the Experiment Nation Podcast. Today, I’ve got a guest on the platform is Ishan Goel, who’s a Associate Director of data science at VWO, thanks for being on the podcast, are really excited to have you here.

I’ve chatted to a few statistical intelligently, brainy people, way bigger than my paycheck. And every time I’ve got to, every time I interview, technically inclined people like yourself, I’ve always got to do a bit of research. I don’t come across as too thick on the air. But anyway, look, it’s good to have you on the air specifically, since you know you’re heavily invested in b2b Oh. And you know, you’re involved in the statistical engine and so forth. So you’ve got very both. You’ve got both good theoretical experience, but also very good practical experience. And, you know, statistics and more applied statistics, but of an experimentation. So look, before we talk about, like all things, stats, and all those sort of things. Look, why don’t we just talk a little bit about your background? You’ve got an interesting background. Do you study computer science?

Ishan Goel 2:14
I started Computer Science from my bachelor’s. Yeah.

Richard Joe 2:17
Great, great. And what was your attraction to computer science? Like? Were you always involved computers, as a child, you like programming? What was anything particular that you

Ishan Goel 2:30
write? That’s what we are told to write in our statements of purpose. But what really happens in India is that a lot of people are guided towards the engineering field. And that’s just sort of like, yeah, I really did not make a conscious choice before going there. But it somehow turned out to be that I landed myself in an engineering college and then I started enjoying computer science and stuff. So it was more reverse engineered that ways. It’s

Richard Joe 2:56
sort of like above I mean, wasn’t it’s very, it’s very culturally inclined towards it. Right? Like, I mean, I think it was mum, not mom by what’s the ICT landlord saying? Yeah, yeah, it’s like you’re kind of like the pumping out it you know, pumping up compass science grads everywhere. And they there’s a lot of like development work there and stuff. So cool. You went to you went to study comp comp, SCI for a few years. And then you studied a master’s in North Carolina State University, what obviously? What was your motivation to a dual masters and the study overseas in a Western University.

Ishan Goel 3:49
So essentially, one, when I was trying to enter the data science field, into the 16th, just after my Bachelor’s, I was finding it difficult to really justify like science and research based roles, without having done a master’s degree where you get more exposure to research, although I was researching on the side, but I was finding it difficult to really get into the industry and stuff at that time. So my first motivation was to get a masters and a US masters that time was like really valuable for the experience and the sort of they were spearheading the data science field and stuff. So that’s the reason I went to the US. And I found it to be a very interesting experience.

Richard Joe 4:31
And I’m sure to like kind of open up your eyes to other experiences, because it looks like you started off as a software engineer graduate, and then you sort of make your way into more kind of like, data science and then experimentation specifically, can you elaborate on that, please?

Ishan Goel 4:53
So it’s been an interesting journey like your first year of my vertex. I was working as a software engineer And then interestingly, I landed myself and do statistics research and causal inference. And surprisingly, the field was not even at that time, I was working on a problem in causal inference. Somehow, three years down the line, I landed on a very related problem when I joined VW, so it was just by chance that I landed on the experimentation problem and statistical significance. And then, ever since I got into the research into that, whatever I’ve learned, whatever I’d learned that sort of compounded there by chance, and I started enjoying the field a lot. So that’s how I learned to experimentation. And

Richard Joe 5:33
I mean, I guess it’s sort of like, it’s quite a, I think it’s for a lot of people to get into experimentation, I think, you know, like, it’s sort of sudden leaders stumble into us and that like, I mean, obviously, you know, 20 years ago was obviously not something that was existed at uni, the idea of experimentation as a job. You know, I’m pretty sure that you can’t major in it. But yeah, a lot of the people that I bumped into, in this world, they just sort of come across it, they might come across it from a background like yourself, or they come across, they bumped into as in analysts, and then they get into the kind of ball that ABTS thing and you know, sort of part of it and then but yeah, or them, they could even come across from a more of a psychological background. So yeah, everyone’s everyone’s got different sort of pathways. And so it’s good. It’s good that you, you know, you’ve explained that because there might be some data science, someone who wants to be a data scientist and doesn’t know what experimentation is. So maybe if they listen to this podcast, maybe you can inspire them to, you know, go down that path. Yeah, so look, let’s let’s talk about some statistical sort of concepts. I think the first one we’re going to talk about. And this is something I’m not really that familiar with myself, is the multiple comparison problem in statistics. Can you explain what that is? Very simple terms to our audiences, and why you think that? Why, what what, what’s the significance of understanding this woman experimentation?

Ishan Goel 7:21
I’ll start with a very interesting story and an intuitive story about multiple comparison problem. And it’s sort of like a general phenomenon of statistics and a general intuition that works in probabilistic things. So this is an extreme example, but imagine, and this is from Nassim Nicholas Taleb book, The Black Swan. He says that, imagine that there are an finite amount of monkeys sitting on a finite amount of typewriters. And everyone is just randomly keying in something. Yeah. So if the monkeys and the typewriters, they’re in finite, there is definitely a chance that someone will come up and write the Odyssey or the Iliad, that one monkey will be there, just randomly keying in, and he will come in with, he’ll present the Odyssey and all the VR just by chance. So the idea there is that if the pool is big enough, then any even smallest probability of event is likely to happen. And that’s just the fundamental intuition that goes on to statistics. So if you are picking out from a random pool, and the pool is very big, you should expect to see more and more uncommon patterns in data to emerge out sort of. So that is the fundamental intuition that’s behind this multiple comparison problem.

Richard Joe 8:42
Well, since here, in the definition, does depression notes by the way, the definition you’ve given here for me, is whenever you have more than one hypothesis being tested in an experiment, again, the chance of getting a false positive increases in the tests. Whenever you’re whenever you’re testing more than one hypothesis, you need to be conservative with the desired statistical significance in the test results. Yeah,

Ishan Goel 9:10
when you apply this to an idea in generally experimentation, this fundamental intuition, what happens is that whenever in an experiment, you are sort of testing more than one hypothesis, so like, that’s one implication of the multiple comparison problem. You can do this in multiple ways that maybe you are testing more than one variations. Maybe you are trying to test more than one goal, maybe you are telling yourself out of these five goals. If any one of them comes out to be a winner, I would deploy it. So essentially, the intuition is that you are increasing the pool among which you are trying to find a pattern. And if the pool is larger, the chance that you just get an erroneous pattern, a false positive sort of increases. So

Richard Joe 9:52
I guess I guess that makes sense, right? Like I’m thinking I’m just thinking practically. See, I’ve got a control and then I’ve got three variations. So it’s an ABN test, effectively running four variations, including control, I guess, common sense tells me that if I’m running all four in parallel, then there’s four, there’s higher chance of getting a false positive. We’re four variations, then say only two variations, right?

Ishan Goel 10:19
Yeah, so it’s exactly like that. So if you have like N variations, and you are, like hypothesizing that any one of these wins, I’ll deploy this, then essentially, the chance that any one of them wins by chance, which is the false positive rate, it sort of goes multiplying. So if you have like two variations, it’s from 5%, to 10%. If you have 20 variations, then almost there is a certain chance that you will get one variation just by chance winner. So what I tell people is if you run an eight as with 20, variations, 20 a variation, you can be sure that at least one of them will come out to be the winner. So that’s the idea. Got a

Richard Joe 10:57
quick question. So I’m just like we’re going from, from a frequent test perspective, I know you’re a Bayesian fan, but let’s just say for argument’s sake, from a frequentist perspective, does that mean that if you’re running more variations in your ABN test, then decrease the alpha so that you decrease the likelihood of getting a false positive? Is that correct? So yeah, so instead of, instead of simply 95%, stat, so you’d move on to, in theory, go towards, say, 1997? Or 899? Something like that? That makes sense. In theory, exactly.

Ishan Goel 11:40
That’s exactly what we do. So essentially, the quick heuristic is that you divide the alpha by the number of hypotheses you are testing. So you’re testing 10 hypotheses, then that 5% becomes 0.5%. So you are trying to go to nine 9.5. And that’s the futuristic, you need to follow.

Richard Joe 11:58
Okay. Okay. That’s interesting. Any other points you want to make about that? For our audiences? So

Ishan Goel 12:06
what I want to say is that just generally start to get comfortable with this intuition that this does not just apply across variations, this applies across a number of goals as well. This applies across sequential testing, also, like if you’re seeing the results multiple times. So essentially, just this fundamental infusion of statistics, you should be aware that whenever I’m seeing a result, how big of a pool have I picked out that result from? And that should go you how tight you should be with statistical significance. So yeah, that’s the idea of multiple nobodies. Okay,

Richard Joe 12:38
cool. Thanks for letting that’s very, that’s very enlightening. I think it I think relation to that, we’re gonna talk about peaking and why that matters. And test results. Do you want to just talk about, you know, what peaking is and why that matters? For CROs?

Ishan Goel 13:01
Yeah, so it’s interesting, special case of the multiple comparison problem. Yeah, it’s basically when you peek into the test results multiple times throughout the test, essentially, you increase the chance of getting a false positive. And I’ll tell you the history behind that. Basically, how peaking in statistical significance got merged. Essentially, when we were test when we were running tests and controlled experiments in the 20th century, what like, generally, a scientist or a drug trial would do is that they’ll post collect all the samples. So they’ll estimate the sample size, first, collect all the sample, and then calculate the statistical significance only once. So then it’s like a one shot hypothesis that you have

Richard Joe 13:45
base, we have fixed sample, and we’re based on their fixed sample, then we calculate, we did the calculations right?

Ishan Goel 13:53
You collect the entire sample, and then we do the calculations. Yep. But when we applied that method to AV testing, it was very easy to just calculate the statistical significance live on the fly. Yeah. So that is what has started to like now becoming known as sequential testing. And it has been differentiated quite a lot with fixed horizon testing, in general understanding, but the interesting part there is that sequential testing allows you to peek into the results, but it needs to apply a correction to the probabilities to adjust for those. So that is the statistical interesting part of it. Yeah, please.

Richard Joe 14:30
And just clarify of what audiences don’t know. So sequential testing, from my understanding, it’s basically where as you’re running the test the sets adjusting the statistical algorithm as you’re running the test, so that you can pick during the test whereas if you’ve got a as opposed to like the example we gave before the fix the sample size, you obviously you’re going to get into issues if you’re peaking during the test before you get your quick sample size, right? Is that correct? Do you want to explain it a bit more.

Ishan Goel 15:05
So repeats arising test, essentially, you are required to wait at least till that predetermined sample sizes met. But sequential testing allows you that you can look at this probability to read baseline on the fly and close the test as early as you see a winner. So it lets you save sample sizes, it’s obviously much more desirable. And you can like, it’s about the stopping. So even in a fixed horizon test, if you’re looking at the probabilities, just peeking is not the problem. But the moment it crosses 95, when you stop the test, that is not allowed in fixed horizon testing, in sequential testing, they have made the correction to probabilities so that you can pick comfortably as many times as you want. And it is being taken care of by the system. So the

Richard Joe 15:51
only kind of downside? Well, it’s probably not a downside of Val, you know, level of technology. But it does that doesn’t require a lot of computational power to do this. If you’re doing if you’re doing it on the fly. No,

Ishan Goel 16:05
it is, it is pretty comfortably a formula basically. And in the past, I’ve seen in the past three, four years, a lot of solutions to peeking directions have come up. But essentially, they make a compromise in terms of how early you can get a winner and stuff. So essentially, when picking directions, they delay the winner a bit. And the compromise is on that front. But it’s not otherwise, it’s not computationally very intensive. It’s just a better algorithm. The trade offs depend on what choice you are making like there is this always valid P value solution that has a different sort of a trade off, then there is the solution that is being deployed the Alpha spending solution, it requires some more information from you that what is the maximum sample size you need. So there are different trade offs. But overall, it’s a simple computation and that you make, like once the equation you have run it through once it’s robust, it doesn’t take a lot to calculate.

Richard Joe 17:06
And how does this relate to what we just talked about 10 minutes ago about the multiple comparison problem, you know about getting the increasing the chances of getting a false positives with more variations in a test.

Ishan Goel 17:22
So it’s like there you were seeing you were making more hypotheses in terms of the different variations. Here you are making more hypotheses across the test, or I’ll check 100 visitors 200 visitors 300 visitors, you are essentially increasing the pool size of finding a winner. So that’s the relation but it is slightly less stringent than multiple comparison problem. Because in peaking, you don’t need to make alpha by n. So if you aren’t making 10 peaks, you don’t need to directly go to a significance level of line 9.5, that would be like too stringent, you can go easier as well and still control the false positive rates. So that’s the slight difference from it. So the solution to both problems are different. But the root of the problem is the same.

Richard Joe 18:09
Okay, any other points of view, pointers or maybe practical tips you’d like to give our audiences who aren’t any statistically inclined,

Ishan Goel 18:21
I would give this advice that do check if you want continuity. So people choose one of the two ways either you run the tests and just wait till the predetermined sample size. And even if you have those two peak very less, and don’t take a decision till you reach that predetermined sample size, or you apply a peaking direction to your study six, so that ways you will not have a lot of false winners in your test. So that’s the summary.

Richard Joe 18:48
Yeah. And look not obviously, to only work for a competing company, but I mean, the platform, we use VT Optimizely. It does what they purport to say they they purport that they have some southern bottom to prevent issues of peaking in the algorithm. Anyway, we won’t get into that topic. I says you’re confused and suing for the competition. We want to don’t want to avoid a potential lawsuit. But defamation charge. But yeah, yeah, that’s what that’s what they bought. So that’s related to that. That issue of baking. Yeah, a

Ishan Goel 19:32
lot of vendors have started providing a solution to that and there are different solutions in the market and they are very interesting in the trade off. So I would suggest Yeah, we can definitely yeah, I think case by case you will have to go when you see what solution for baking is being provided by Spotify there and probably we can share it with our listeners that summarizes all the solutions on peaking and the trade offs essential Yeah.

Richard Joe 20:02
Yep. Cool. And I think the other thing we we we wanted to discuss about is false discovery rate. I remember you talking about false discovery rates in your experimentation. Talk that you had two months ago, one or two months ago. Do you want to explain what exactly a false discovery rate is? Because? Because it is, it is different to a false positive rate. I think it’s somewhat related. Can you can you can explain what that is and why it matters to us as experimenters.

Ishan Goel 20:44
So it’s a it’s a very interesting concept, actually false discovery rates. And yeah, I’m glad you asked that question. Yeah. So let me start by explaining about precisely defining false positive rates, and then explaining how false discovery rates are different. So false positive rates are the more commonly known term, and the more commonly measure of accuracy of an AV test. And what false positive rate says that if, if there is no impact on the target metric, if there is no impact on the target metric, then what is the chance that you get a winner, just by chance from the AV test. So

Richard Joe 21:21
I’m aware assuming no hypothesis, right for

Ishan Goel 21:24
assuming the null hypothesis, that gives you the false positive rate. So if you are running 100, AP tests, then you can expect with a 5% false positive rate that five of those will come out to be winners. So false discovery rates is exactly the opposite. And once you understand, you’ll start to get why it matters more is the false discovery rates are the chance that if you have gotten a winner, then what was the probability that actually there was no impact on your metric? So your metric, the test was actually a bogus test there was it was not a winner? What is the chance that if you have gotten the winner? What is the chance that it was a bogus winner? That’s probably

Richard Joe 22:10
a sign that it’s actually going a step further, then this whole false positive rate metric. So just just to clarify for myself, and maybe in our audiences, it’s asking if you’ve declared a winner or found a winner, what’s the chance of that would have been false? In reality, if you were to actually deploy that, is that correct?

Ishan Goel 22:31
Yeah, precisely. So an easy way to understand the difference is a cause and effect analogy. So essentially, being the being the winner is the cause of the data that you are going to observe. And in the data, if you see the winner or not, is the effect. So false positive rates are conditioning on the cause, and then telling you the probability of the effect that the cause is that it is a bogus winner? It is, it is a bogus test, there is no improvement, that is the cause. And the effect is that you’re seeing a winner. So that’s the false positive rate. And in a very Bayesian way, false discovery rates are like sort of going opposite. They’re saying that if you’ve observed the winner, then what is the chance that it was actually bogus? So it’s like sort of the Asian cousin of false positive rates? You

Richard Joe 23:18
kind of flipping? Flipping flipping the logic around? Right? Yeah, the direction around Yeah, direction around. Yeah.

Ishan Goel 23:24
And that is, interestingly, what matters more, because you will never know the underlying cause. That’s what you’re trying to study, you will only know the effect. And as an experimenter, what should matter to you is that out of the number of winners aren’t getting no more than an X percent should be false.

Richard Joe 23:44
Do you like, you know, completely new to this stuff? So the lies just sort of put it into practice? Like, do you would you would you would you have some sort of, I don’t know, percentage, like, stop gap where you say, Okay, I’m running again, except, you know, I want 90% of these winners that I declare to at least 90% of them to actually be true winners in the real world, like, how do I how do we, how do we sort of put some sort of boundaries around around this? False? Do you know what I mean? Like, yeah, ya declare false and declare, you know,

Ishan Goel 24:28
so that’s, I think you’re asking asking a very interesting questions, because that brings us the gap between false discovery rate and false positive rate. So first thing to understand is that false positive rates can be directly controlled, the alpha that you’re choosing in your test, yeah, that’s the false positive rate and you can directly choose it, you can set it to one person 2% 3% Go to false discovery rates are harder. Why? Because false discovery rates have this added component along with the false positive rate. They have this added cost But in that, how many good ideas are you actually running. So if you are only running one person, if you run 100 tests, but 99 of them were bad ideas that will not lead into a winner, and only one of them was a good idea, then your false discovery rate will be very high. Because you’re getting almost five winners from those 99. And this one good idea is already a winner. So you can follow right? You got six winners, but those five were actually bogus. So false discovery rate was 80%.

Richard Joe 25:32
Wow. Yeah. Yeah, I’m getting it. Yeah, yeah.

Ishan Goel 25:36
So essentially, this number, this mix matters of good ideas and bad ideas. And running go. He also published a very interesting table, in his recent paper on intuition Buster is very striking to show the chance of making the winner. And what is the false discovery rate that they are getting? So at 8%? At Airbnb, the false discovery rate was around 35%, or something of that sort. So what you need? Yeah. Do you want to ask question,

Richard Joe 26:05
now, I’m just sort of thinking like, I have no bosses, listening by me. I could have, who knows? Like, I mean, hand on heart, who knows? Like how many tests that are declared? We’re sure and correct. Winners? How many of like that we actually deployed? You know, how many, what percentage of them were? Like, we’re like, what was the false discovery rate of them? I’m just sort of thinking. I’m just having a bit of moment there. Hopefully, then listening to this podcast, but you don’t mean like, just being honest with you. Like, I’m just thinking, like, let’s just say in my, my career have pushed out, I don’t know, between 2020 winners. And then let’s just say 80% of them were like, but yes. That means that 16 of them was like duds, and then four of them, move the needle.

Ishan Goel 27:06
But I can assure you is that that problem is something that scientists have been going through for the past 200 years that now that we have published. And 20 years later, it was realized, though, it was a false discovery. So you are in safe hands, that is all so yeah.

Richard Joe 27:21
Hopefully, it’s all hidden in the noise and stuff. But I mean, like, what, this is a new concept to me. I mean, is this? Why is this not being talked about much in the CRO world? Because I’ve only heard you, maybe I’ve heard you and Ronnie talk about it. I didn’t really understand when I first heard about it. But I mean, this sounds to be like a true north star kind of metric we should be taking more seriously. Why is this not being talked about more often.

Ishan Goel 27:50
So I’ll tell you the reason and my hypotheses towards why it is not being talked about more. So the missing piece, if you get this prior estimate that how many of your ideas are winning ideas, then it is very easy to control false discovery rates, why? If you have an accurate estimate of the prior, then you can back calculate the required false positive rate. So suppose an Oracle comes in tells you that 20% of your ideas are good, 80% of them are bad, then you can just simply use the Bayes rule to calculate that, okay, I want a false discovery rate of 5%, then the false positive rate should be this distance. And you can deploy that in your tests, it’s not very difficult. The difficult part is to get an estimate of the prior. Because essentially, now, all the sorts of debates, so you have no instrument to know accurately. How many of your tests are winning, you can only guess it. And secondly, it leads to the Bayesian debate that never can that number be proved wrong, like there can be no evidence to that it’s an unfalsifiable number. So a lot of things go on there. Basically, the difficulties in estimating this prior, how

Richard Joe 28:55
would you know that in reality, though, like, you know, this whole false discovery thing, like it assumes that, you know, you have to know what percentage of your test that you’ve pushed out into the real world that you productionize actually winning tests, like, How the hell would you know that stuff, it’s really hard to measure, it’s

Ishan Goel 29:16
very hard to measure. And that’s the problem. And that’s the problem even in my hypothesis between the Bayesian and frequentist worlds. But essentially, what can be done is some good heuristics can be made. And some good guesses can be made to cut down your false discovery rates vary significantly. Like for example, if an organization is starting new, they should expect to have generally more good ideas because a lot of improvements can be made. And as an experimentation effort becomes more mature, they should start to become conservative with false positive rates, because now they know that good ideas are harder and harder to come by. So it can only be made a guess about and they can be like industry standards, or maybe some page levels like Ronnie tells us that cart abandonment problem is very hard like to reduce cart abandonment, it’s very difficult to generate a good idea. So you need to go by some heuristics by that you can control false discovery rates to quite an extent. But it’s very hard to exactly exactly control that does

Richard Joe 30:16
any platforms actually does do any platforms out there. Like, you know, VW AB Tasty or whatever to any of them, including any of these. So

Ishan Goel 30:31
that like, so essentially, we at VW are working on it. And we are essentially wanting to give a plugin where if you can guess your prior, you will be told a backtrack, false positive rate value, just you need to guess your prior and fill in that, and the desired false discovery rate. The other place that I’ve seen this concept of false discovery rate is that Optimizely is controlling FDR, only within a test for multiple variations. So essentially, the FDR problem also happens when you are running multiple variations in a test. So there is a banner meaning Hochberg procedure that needs to know in advance how many hypotheses you unmaking. And then it sort of creates a complicated algorithm where the problem is somewhat tackled. But essentially, in your experimentation journey, as you are running more and more experiments, there, you cannot apply that Benjamini Hochberg procedure, because they are the number of hypotheses are changing as you’re going. So that problem is still quite a bit unsolved. And it’s sort of something that can be solved with heuristics by making good guesses on your success rate, and then being conservative towards it.

Richard Joe 31:43
The only issue there is that you you’re relying on the input of people that may even not know these, the prior, there’s sort of kind of maybe somewhat, using some heuristics, which is not perfect, but I guess it’s whatever they have. Yeah, um, let’s move on before my my brain explodes of this

sort of thinking, yeah, how many? How many ones? How many false discovery rates of it? Why You Should Never stop test early, even if they’re failing. The reason why I wanted to talk about this one is because, you know, people talk about peaking and, you know, if a test is doing well, so you see, you say, a test is doing well, and you’re peaking. No, no, I think, no, sorry. Let me clarify. A lot of people, a lot of blogs, I read, as I’ve read before, in the past have said, well, you know, if you reach that certain, you shouldn’t just test it, you know, keep it going for a little while. Because you might be peeking. This is a bit different. This is, in some ways, the opposite. Because this is like, well, you’ve got a losing test, you should just keep on running it for a while. Do you want to explain a bit more about this, I would

Ishan Goel 33:10
love to, I would love to. So essentially, what happens is, and I’m seeing this as a practice as well, a lot of people in the first few days of the test is not giving you good results, they sort of shut it down. There are reasons to do that. But I want to draw the nuances and why in what cases you should not do that. So essentially, the first thing is that if you are killing a lot of the losing tests early, then you are not in for a lot of surprises, you are essentially just confirming what you’re already knowing. So experimentation to just explore the journey and explore the space of better ideas, you need to be ready to fail, and be ready to invest like blood some, some amount of your investments go waste, because what happens is that if you are killing all the losing tests early, then you are only operating in that safe zone, where the tests are giving a very quick result, and just very good. And they’re also sometimes a psychological factor comes in. Now they can have a very strong hypothesis that this will work, then I will tolerate the slight retrace as well. But essentially, if I don’t believe in the idea, then I’m more likely to kill it early. So what I really want to say is that you cannot go with that selection bias into testing because then you are just operating in the safe zone and losing out a lot of bends that might later bloom into some interest in learning. So that’s one problem with that. The second problem with that is that you are also creating a bias over time. So suppose you are starting a test on Friday, and suddenly on the weekend, it is performing very well. So you let that test all those tests that were started on Friday, you will let them run. Those that were started on a Monday or a Tuesday which had an initial downfall And then you don’t let them run. So there is a bias to that as well. So that is why I generally condemn that practice. And the idea, I think, which has become accepted in the industry is let test run for two weeks. But I also accept and say that if there are some crucial stakes involved, like for instance, if you understand why the performance is bad, or you have some crucial guardrail metric being hurt very bad, then you should go ahead and stop them, but it should not be a general practice.

Richard Joe 35:34
In like, in what kind of situations are you seeing this? Like? Specifically? I mean, is Could it be like, say, people that in this pattern, you see people stop a test early as in like, very close to like, with the new line, parts of the website like say, checkout on E commerce product or website or, or somewhere down like further end of the funnel of something like, lead gen form on something you seen in specific care kept specific areas of the funnel the website? Or are you just sort of seen across the board? Like?

Ishan Goel 36:21
So? That’s a very good question. And I honestly don’t have the data for it right now. Because we just saw it as a general partner at VW, we did not break it down to goals and do not go into an excessive that but we know in a general frame that a lot of customers from a lot of CROs, they probably have that tendency in general, but I understand the point that you’re making, if it is down the panel, that’s I think that is where you are hurting a crucial guardrail metric, and you can be more conservative.

Richard Joe 36:54
And there might be also like, internal political pressure to, you know, excuse my language, no, like, if up the revenue, you know what I mean? Like, if you’re pumping $1,000,002 for an ecommerce website, and you’re experimenting at the checkout stage or final stage of checkout. And, you know, your, your experiments losing, say, consecutively by, say 20%, luck, first fist three, four A’s, then I guess there’s there can be quite a bit of kind of managerial pressure to like, stop this test, because it may not cost the money, right?

Ishan Goel 37:35
Yeah, I definitely understand that. And that’s why I added the caveat. But I feel that as a general practice, one should be aware that it is it comes at a cost, that you’re essentially operating in a sacred zone. And it sort of goes into sort of a cultural area of experimentation. So I’m understanding the pressure and the organizational pressure. And that’s why this problem probably needs to be solved on an organizational level. Yeah. So like, I can give you an example, a very interesting IBM Watson story. So but there was this one young manager, and he sort of managed to lose $10 million and an entrepreneurial venture. And then when he went up to, I think James Watson on some long time CEO of IBM, he went to him and he asked him politely that maybe you are expecting my resignation. And the man, the boss says in return that you cannot be serious. We just invested $10 million educating you. So respect respective.

Richard Joe 38:41
Yeah, I hope you learned a lot from it. That’s, that’s a huge investment. And leads us to our last question, I think it was, which is about primacy and novelty effects and how they affect testing. Do you want to talk a bit more about what these are like? What are promising? What are promising novelty effects? And yeah, how do they affect how do they affect us in testing? And why should we care about them?

Ishan Goel 39:15
So they’re interesting patterns, that Firstly, you need to know that statistical significance does not take them into account. But you can always observe them from data. So what primacy effect is, when suppose you try out a new variation, but your audience had a learning curve. And they were used to the previous variation more. Yep. And hence the sort of like, Take a time to essentially start using that new design. And you see a starting rise in engagement in users. And novelty is like inter, entirely the opposite, that sometimes people are clicking on it some thing just because it is new, and otherwise it doesn’t hold validity. Like, for example, there was this very famous experiment In that Microsoft, that they changed the Mail icon on the top left of the bar to And suddenly, it showed an increase interaction, because people were just confused. Why are we not going to mail? So they were picking in? Yeah, I remember that one. So primacy and novelty effects are something that you should know can exist with your tests if there is a learning curve. And essentially, there are algorithmic ways to detect it also. But if you have a simple graph of the conversion rates daily, you can also like intuitively understand if you’re seeing them in your test, and if they are there, their advice is that run the test longer let the primacy of novelty, stabilize, and maybe required, then you can even chop off that initial time of stabilization for a more realistic estimate.

Richard Joe 40:51
Are these related to also peaking as well? Why because, you know, if you have the novelty effect, and doing really well, for the first I don’t know, one or two weeks, and you’re like, oh, okay, I’m gonna stop the variation. Now, this is winning. This is this is this is this is this is related to picking right? This

Ishan Goel 41:09
is sort of it gets aggravated with peaking it is related to it causes the winners curse as well like sort of that the early winner and you deploy it, and hence it was more likely. But the deeper problem behind it, the deeper intuition is that, essentially, statistics assume that your underlying conversion rates are constant. Whereas in practice, sometimes there are a lot of changing patterns to underlying conversion rates, that they’re not stationary. So that’s why all these problems get caused. So even we can make the effect and all those things because underlying conversion rates are going up and down. So that’s the thing,

Richard Joe 41:49
huh? That’s, that’s, that’s, that’s very insightful. Thanks. And I think the learning lesson for some of these questions that we’ve encountered is really to more obviously have a decent knowledge of working a working order statistics, but also for our CROs, but also to have a robust framework for when you’re launching tests. So you know, do your pre test calculations have a disciplined sort of way of testing, but also at a larger scale? You alluded to, when we talk about it here, but you learn to like, you know, having a culture of experimentation, where, you know, management’s not gonna flip out, because this test isn’t doing well for the first five days or whatever, you know, the these things do take a long time to build. And if they can be built to the culture that, you know, X amount of tests will lose, then they’ll they’ll see that they will see as an investment, but then you know, that it’s really up to the experimenter, all their team to sort of really championed the cause of experimentation. And you know, yeah. Shine any any further thoughts for audiences? But what’s been mentioned or anything like that?

Ishan Goel 43:11
Yeah. So I think as a closing thoughts and words, what you were all saying how all these things are connected. I think what is really going on in the industry right now is something that has not happened in the past, that the business world and the science world are getting infused with experimentation. Now, we are really like the business mindset is coming face to face with the scientific mindset. Yeah, and that is the cultural change that we are requiring. And we are trying to find a way in between, because a scientist has always worked very differently. And a businessman always works very differently. And that’s what we are probably looking at, and it will take a while to for businesses to adopt that scientific mindset. But I think that’s where we are headed with experimentation. Gods very

Richard Joe 43:56
exciting times. And, yeah, it’s good to see that we can, you know, have a very kind of, well, at least, or at least have you and I talk about a very sort of sort of scientific framework of things, but also bear in mind that you know, there is a there is business objectives here. You know, but look, I’m speaking very good having you on podcasts, Ashton, how can people contact you?

Ishan Goel 44:24
They can contact me by my email. Shawn Rockwell at Pinkie Yep. And I am available on LinkedIn very keen to talk to all people and feel free to drop me a message and we can definitely chat up on statistics anytime.

Richard Joe 44:39
Yeah, awesome. Um, and we’ve also got a cool sounding blog. Was it bagels the thought for for it was

Ishan Goel 44:49
it is Vegas petard dot block up just a second. It’s Vegas for thought God with a hyphen in between Of the three words. So essentially, I have been writing that blog. Now I’m going to write on statistics with VW. So we are going to very launch very soon launch a formal blog on statistics where I will be talking about the entire story

Richard Joe 45:16
of extreme winds, the winds that are coming out, we are very,

Ishan Goel 45:19
probably in the next two weeks, I’ll be launching LinkedIn. And that yeah, I’ll be sharing all my learnings and thoughts on experimentation.

Richard Joe 45:28
Awesome. Awesome. Well, hopefully, yeah, hopefully. Yeah, by the time this podcast came out, we’ll have a few of your blog articles out there. And yeah, it’s good to get name out. So thanks. Thanks. Thanks a lot for being on the podcast. And yeah, well, hopefully heavy in the future. Could you thanks

Ishan Goel 45:46
a lot for inviting me. It was great chatting up with you. Now I have great, thank you Richard.

Rommil Santiago 45:55
This is Rommil Santiago from experiment nation. Every week we share interviews with and conference sessions by our favorite conversion rate optimizers from around the world. So if you liked this video, smash that like button and consider subscribing. It helps us a bunch

If you liked this post, sign up for Experiment Nation’s newsletter to receive more great interviews like this, memes, editorials, and conference sessions in your inbox:

Connect with Experimenters from around the world

We’ll highlight our latest members throughout our site, shout them out on LinkedIn, and for those who are interested, include them in an upcoming profile feature on our site.

Rommil Santiago