CRO Insights from a Meta-analysis of over 1000 A/B Tests with Georgi Georgiev

Video

Audio

Transcript

Georgi Georgiev 0:00
When you have higher power, you also have higher stakes. So usually even in those situations, you still want to go with A/B, if C variant is not very, very different from on the B. And that’s, that’s something that people seem to not appreciate enough, which is that when you have big sample size, that also means big risk. So per unit time that you’re in running the tests, you have bigger risk of losses, if these are for these variants are, are worse than the control.

Richard Joe 0:41
Hey, folks, it’s rich here from Experiment Nation Podcast. Today, I’ve got a special guest, Georgi, Georgi Georgiev. He’s the managing director of Web Focus LLC. He’s been involved with marketer developer for over 15 years. It’s got a diverse range of experiences from owning and developing and managing dozens of online projects, SEO, Google AdWords, Google Analytics and as a sufficient set of statistician consultant. And he has also the, he also runs the blog analytics-toolkit.com, full of all kinds of statistical calculators and tools for coos. And he’s recently launched his book well, not recently, but he launched a few years ago. It’s the statistical methods and online ABTS thing. So welcome to the show. Yogi.

Georgi Georgiev 1:44
Thank you. Thank you for for the Central. And I’m pleased to be here, Richard.

Richard Joe 1:51
Awesome. So look, you’ve got a very diverse background, and very quite a unique background for someone involved in CRO so it sounds like you’ve pretty much done pretty much the full stack of you know, most online marketing efforts from SEO, web development, AdWords. And, you know, obviously, you’ve delved deeper into into statistics and woven coo Can you tell our audiences how you, you got involved? And where you are now, Sierra?

Georgi Georgiev 2:28
Sure. Yeah, I mean, I think this experience is what makes some of my stuff. So, so interesting for people since I draw back on really different, many different skills. My higher education is in mathematics, and also web development and network technologies. Some from from there, I have this development background, I’ve called it a lot of websites and continue to do so myself. And I also had a lot of experience in marketing. Seo was the first thing that caught my my fancy, I was really amazed by the possibilities there by the the information that Google AdWords at the time provided to their, for example, keyword research tool, that there was nothing like it back at back in the day. And from there, I slowly started understanding that doing the work is good. But you also need to analyze the results of your efforts and to be able to present and see and say, Okay, here’s what we did here is when we achieved this increase in, let’s say, new visitors, or new leads, or whatever. And so Google Analytics was a natural next step. I was, I think, one of the first to certify as a Google Analytics Individual, back back in 2009. And from there, yeah, I mean, I’ve really double down on on analytics. And as time went by, and I was consulting clients and doing reporting for them, I started hitting this this wall of, okay, you can only do so much with observational data, you can only do so much by taking a look at okay, we did this change on say, February 5, how are we doing one month later or two weeks later, compared to the period prior or the previous year? And I just wasn’t able to to find the noise in the data from this so many different things that are happening both on the product or website that I was working on and outside like in terms of seasonality, economics changes. At different technological advances, some changes, I just was hitting the wall, okay, you can’t really draw causal links. You can’t estimate the effect sizes So what you’re doing by just doing that, so I started searching for okay, how is that? How is it that people do it? What was the best way to do that? And obviously, I ended up into the scientists sent the scientific world and the word of scientific research, which is exactly where these kinds of questions have been answered for for many decades, if not centuries. And that’s where I, you know, became familiar with statistical methods. And, yeah, I really appreciated the clarity that that experiments gives you in terms of okay, here’s what you did. And here is the effect of it without any of these infinite confounders that you would otherwise need to take into effect. into account. Sorry. And yeah.

Richard Joe 5:55
You basically, were just scratching your own itch, right? Like, you were like, Okay, I’ve got I can, I can set up, I don’t know, consider, I can do these marketing efforts, but and I can release this output this report, but what do we do have the data? And how do we separate signal from noise, so to speak, and how to interpret the data and so forth? So? Yeah, it sounds like you went into a really deep dive, because I mean, you said, you know, before the show that you didn’t have a degree in statistics or anything like that. It sounds like you had quite a bit of medical, good, solid mathematical background, but maybe you give a guess, and you’re just so passionate about what you were trying to solve it, you just ended up becoming an expert in the field.

Georgi Georgiev 6:43
Yes, yes, indeed. And I would say that would help me the most, or the efforts that I put into developing tools myself, so I was actually coding the things that I was reading about, and just, you know, blindly using someone else’s work, and also running simulations of the different statistical methods. So for example, you can take a t test or Z test, and you can in like, 10 lines of code called the simulation, which will tell you the performance of this test under varying conditions, you know, you can vary the parameters and see how it behaves. And I think this is the best way to understand any statistical method that you are looking to apply in your work. If you have a little bit of a coding background, there are a plethora of tools out there that you can use free tools, including, like our Python. And yeah, I think if you if you’re serious about about stats, that’s the best way the most efficient way, like a few days of doing that is probably better than a couple of months of a statistical course at any, any college or university, I would say.

Richard Joe 7:54
So getting your head in other words, getting your hands dirty, diving into the deep end of things and knowing how to basically kind of knowing you knew everything you kind of knew. You knew how to you knew you knew everything from a scope perspective, in terms of the setup end to end, like you’re basically. Yeah, like you didn’t have like any of the free free tools that we had, that we were sort of blessed with right now. free or paid. You’re basically doing a lot of things very manually when you

Georgi Georgiev 8:30
Yeah, I mean, back in the day, obviously, there was our but there were very few other tools there were there was very little written on the topic, especially with regards to AV testing, online experimentation, it was fairly new. I mean, Ron is ironical Harvey’s, say 2008. Paper, I think it was one of the best intros back then to the topic. It was still fairly new. I mean, it’s not widely known, I would say that. And yeah, it was very difficult to find information, you you would go into, like say, papers on medical studies, or psychological studies or econometrics. And you will try to extract from there, what methods work for what situations what would apply to a B testing? How, you know, you make it actually actually work with the data you have. There are so few written about it back in the day. Now, it’s much easier. So

Richard Joe 9:28
would you read means our audiences to like look at, you know, to really look at analyze the scientific method that what, you know, that that medical or the medical industry users, scientists use, you know, the random, random randomized controlled trial, which is ABTS and essentially, would you would you would you recommend deep diving if they wanted to really get into it, they really understand the papers and so forth that they come from that field because that’s that’s essentially where we’re getting that from, aren’t we? Like, in a sense, like, the whole AV testing framework is really the scientific method applied on a website? Right?

Georgi Georgiev 10:11
It is it is. So nowadays, there are a lot more, a lot more than that one can find in terms of articles, books. So it’s not as necessary to go into the, to the source materials, let’s say. But yes, if you want a really good understanding, it definitely helps. But it should be noted that actually, in online AV testing, we have it much easier than most of the sciences, to be honest. Like we can actually experiment on whole populations. We can we don’t suffer from many of the external validity, or generalizability is the scope issues faced by many of the sciences. So let’s say no medical trial, you start recruiting volunteers, patients for treatment, well, how are we doing the selection? Is it representatives of what the trial outcomes are going to be generalized to afterwards? Yeah, that’s, that’s a very difficult task to manage. And you know, anybody, I think we don’t need to care about that at all right. So our task is actually very much simpler. So some of what you will read in this, in this more scientific papers may not apply or may not be a concern. Another thing is the sample sizes. So in it, especially traditionally, in most sciences, you work with very small sample sizes. So performance at those levels is crucial. And understanding how that relates to statistical power is crucial. With online testing. Obviously, you still have the issue of sample size, regardless of the how many users your product is serving. But you can much more easily lean, let’s say on two methods, which rely on some kind of asymptotic distribution assumptions, asymptotic performance, unlike many of the sciences, where you’re limited to by just the sheer amount of the small amount of data that you have available.

Rommil Santiago 12:17
This is Rommil Santiago from experiment nation. Every week we share interviews with and conference sessions by our favorite conversion rate optimizers from around the world. So if you liked this video, smash that like button and consider subscribing it helps us a bunch. Now back to the episode.

Richard Joe 12:31
Just want to deep dive in to an article you released I think was about a month ago, it was basically a meta analysis of 1001 ABTS. And I think this is the follow up to the 115 test min analysis that you did maybe a few years back. Very, very good. Meta analysis that’s on your analytics talk platform. Do you want to just give us some sort of background and motivation? Why you wanted to run this minute analysis and go from there?

Georgi Georgiev 13:11
Yeah, sure. So my first meta analysis that was based on data from good UI, and like, they collect a lot of experiments and their outcomes and collate them for you. So back, then I discovered a lot of possible issues, obviously, normally, it can be can become, you know, set firmly. But let’s say one of the eye popping numbers for me was how many tests were underpowered. So they learned the necessary statistical power related to you know, what would be a reasonable range of effect sizes? For me as a as an external observer, and some hinted at other issues, like peeking into the data. And, yeah, I was curious to see, okay, how has the profession developed? Are we now better at applying statistical methods? And also, I was curious whether my efforts in development analytics toolkit were paying off in terms of, okay, are people actually running better, more robust tests, more more informative tests? And so yeah, also another issue was that the first data set may not have been unbiased. Meaning that it’s, it might have contained more optimistic results than then were warranted. And so I wanted to just take okay, this is going to be a random sample from from our platform. It’s not going to be we’re not gonna you know, skew the data in any way we’ll make sure to remove any, any outliers, etc. And, yeah, some interesting insights from that in terms of test sample sizes, test durations. So we found out that people are actually now much more aware about external validity. So they’re running tests for longer, we no longer see these, like tests run for a day or two. And then a conclusion is, you know, is reached, which is, you know, even if if it’s can be statistically valid, it just doesn’t make sense in terms of, okay, how representative for these one of the days of users on your website for the whole year or two years, or however long you expect this solution to, you know, be implemented on the website for? It just doesn’t make sense. And

Richard Joe 15:34
when the actual test, though, what year was the initial test done?

Georgi Georgiev 15:38
2018?

Richard Joe 15:41
It’s been three, it’s been four years since then,

Georgi Georgiev 15:44
I guess. Yes. And also the these tests are gathered over a period of several years is my understanding. So they may be spanned 2012 2018. And now the current selection of tests is almost entirely from 2022. So it’s much much much in your data set.

Richard Joe 16:03
Okay, how about we deep dive into some of the things that you mentioned here that stood out? I’m just reading a few things in my show notes. There’s quite a few things that you’ve written down. One thing that I noticed is the arithmetic mean, of the AV test durations is what 35.4 days, which basically five weeks? Was that was that notably different to the previous, the previous meta analysis?

Georgi Georgiev 16:38
Yeah, for the previous one, we didn’t have this information. Unfortunately, that was also another goal of this analysis, because we have much or a much richer dataset. And previously, we didn’t know the durations we knew the sample sizes, but not how long it took for the sample sizes to be reached, necessarily. And yeah, I mean, just looking at this, this mean, and the distribution, it seems like, yeah, people are really, really aware that they need to run tests for longer, just to ensure external validity. And also, it could be also a byproduct of, okay, we need to have sufficient statistical power to detect smaller effect sizes, because these are so prevalent, as one of the other findings of the analysis actually shows shows the, the distribution of lift estimates is such that a lot of the effects that you might, that you want to expect from your tests is, is very small, either positive or negative, but very small, which means you need to, to prepare to wait for longer to be able to detect them. But it’s also important to do so because otherwise you’re missing out on this increment those small incremental steps, which would build a good product over time.

Richard Joe 17:56
I mean, test duration is something that I thought about myself, and I’ve read different blogs online. And I mean, I’ve, I’ve just put it out there. So I, you know, heard, you know, sort of general heuristic of maybe four to six weeks, you know, just to get enough, like we just said enough to get would be seen as as long enough to get enough statistical data, but then not too long, so that, you know, you’re having X amount of people clearing their cookies on their browser or so forth. What’s, what’s your thoughts on that? Because I mean, obviously, the wrong longer you run a test, the higher the chance of the population involved in that tests are going to clear their cookies, what percent of them do in reality, I don’t know. And the percent that do that. They do clear the cookies and go back into your into your test, either the control of or the variance. How much noise Do they really add to that test? Do you have any thoughts on that?

Georgi Georgiev 19:07
Yeah, this this issue is developing, as we speak. And the Yeah, that’s definitely a reason for not running tests for too long. Unless you’re doing that on users, you can permanently identify, let’s say users who are registered on your platform only. So on the acquisition side, obviously, we don’t have that. We don’t have that luxury. So there, I would say, you should you should go based on if you know the rate at which, you know, cookies are deleted, then you can definitely take that into account. I’m not sure what the current statistics are, as well as whether there is any wide discrepancy between browsers. But let’s say I think one of the major issues right now is how frequently a user revisits your website. So or if it’s user who is revisiting your website every several days, then I think the cookie deletion you shouldn’t be present. But if it is a website with less frequent interactions, then it could be more pronounced. But then again, if it’s Yeah, I think it’s also about the number of returning visitors. But yeah, depending on whether we can measure them reliably. Yeah. I mean, it’s a thorny issue, it’s a thorny issue, for sure. But on the other hand, if you submit too much to it, then that means you might be missing a lot of true improvements simply because you refuse to be exposed to that issue. And, yeah, if you’re concerned about it, I would say, run a few simulations. Like, it’s not that easy to simulate a set of data. Okay, we have 100,000 users here, these are properly randomized and properly tracked. And now let’s add another 50,000 to those, which you know, will have some dependency, and then those will be randomized to the treatment and control group and see how that affects our estimates, whether it’s, it’s meaningful or not. I think it’s, it’s going to be a difficult matter still, but not

Richard Joe 21:22
nearly that. Not that people CROs don’t have maybe money. So you have a clear sort of agreement on what I’ve heard and podcasts and read reading stuff.

Georgi Georgiev 21:36
Yeah, yeah. I mean, it’s definitely not not something I can give a recipe for, like, Okay, do this, and you’ll be safe or don’t do this. And in all honesty, we’re gonna screw it up. I think it may be something people are overly concerned about, just like, let’s say one such issue is running contests concurrently. And so many people are very aware of were very wary of doing that. They think that interaction effects can creep in and ruin their analysis. But actually, the the solutions that are that are proposed for this problem are much worse than the than the problem itself. And usually, you can actually run concurrent tests without worrying too much about interactions. And I think it might be the case for for cookie deletion here. But I don’t have the solid data to, you know, to prove that it’s not an area I’ve spent too much time investigating to be honest.

Richard Joe 22:40
Well, you know, that’s, that’s something that I’ll, I’ll deep dive to probably deep dive in the future. What’s another statistic that you must love the highlight that you found in the tests, that the sorry, the the meta analysis?

Georgi Georgiev 23:00
Yeah, I think that would be the left estimates. For me, these are the biggest surprise, since I was expecting, first of all, for them to be sort of normally distributed. So you have about half of them on one side, half of them on the other side, and then you have the worst effects would be closer to zero, but then you would have a lot of other effect sizes, let’s say, plus 10%, minus 15%, plus 25%, etc. Instead, what we see is that all the results are bunched around zero. So you have many slightly negative outcomes, and many slightly positive outcomes. And this is something which is, which should be taken into account, because it really on the one hand, highlights the need for AV testing. Because if you don’t take the test, the smaller small negative effects there, they can all be missed. If for example, if you try to do something like implement and then compare before and after, usually, the variance in that data is such so big that such more small hits that you’re taking to your performance are not going to be noticed. And so you’re just gonna, in our happy little into the future without understanding that you’re incurring these many small performance hits. And on the other side, for the positive ones. This should bring people’s attention to the fact that they need to power this properly, which means they need to take into account that the sample size that they plan should allow them to reliably detect small lifts of 235 percent. You know, and because otherwise, they will be suffering from the curse of too many false negatives. And I’ve had this example that they’ve been giving our company that’s a real case, not not my claim, but I know that it’s a real case. They started David Think they run about 70 AP tests? And they gave up on testing actually, at that point. And why was that? The reason is that they only saw three statistically significant winners in from these 70 tests. And why is that a bad decision. So while they might have been like really bad at choosing what to test that the changes that they were they were testing, like they could have been, you know, really bad? Well, in this case, AV testing saved them from implementing 67 changes that would have either made no change, or most likely would have slightly hurt or more significantly hurt their performance. So maybe testing was actually doing his job in this case. On the other hand, I honestly don’t believe that much in this hypothesis, the hypothesis I have is that they actually, were not powering their tests properly. So they were running tests for too short of time with too few users. And their effect sizes are probably in the range of 234 or 5%. But they had very little probability of detecting those three lifts because of the short duration. And yeah, that’s, that’s what what under estimating power can do for you. You just end up saying, Okay, this thing is useless. Like we can’t we can’t take with us because, you know, it’s impossible to have RDS or practical are, you know, but

Richard Joe 26:36
and yeah, yeah. Do you think do you think I mean, that’s, that’s alluding to your comment here that you said here? That, you know, majority of the tests are power to 80%. However, steps can minority maybe if they’re thirds apparent, 90%. You see this encouraging since 80% Power offers a fairly low chance of detection of a true effect of the target minimum effect of interests. It’s basically what you’re alluding to, isn’t it?

Georgi Georgiev 27:07
Yeah, that’s, that’s another way to look at the problem is okay. 80% power is kind of like textbook, example textbook,

Richard Joe 27:17
any sort of calculating bring up like, you know, used to see Excel calculators, like, it’s always set to 80%, which seems to be Could you explain to audiences why 80% is just seen as the, I mean, this, this helps me to like, why isn’t why is it said it? What, why is it said it that takes at that at that level? Was there a historical reason for that? Or?

Georgi Georgiev 27:40
Yeah, I think it’s just convenient to, you know, I think part of it is custom. So once the founding fathers of statistic, say, Okay, let’s examine the power curve at 80% power. And then everybody starts doing that. So that’s one thing. And another thing is, yeah, it percent power is some informative numbers. So it’s a fairly high probability. It’s like not filling out like 50% or 10%. But yeah, if you, if you think about it a 20% chance of missing the true effect of this size is actually pretty big, pretty big number. So more and more people are calling for, okay, you should actually be looking at 90% power, even higher sometimes. Because, you know, even 90% is one in 10. And, yeah, it’s still fairly high. But it makes sense, since this is the the error of secondary concern, you know, false negatives. So, for power, I think, the best, the most helpful thing for you to understand how it works is to just examine the power distribution of a single test in its entirety. So we will see where the test has 0% power, where it has 10% power where it gets 5080 90%. And this will give you a really good understanding of, you know, what you’re doing when you’re just examining the power curve at 8%. And, yeah, I mean, the power level here refers to the power level that the experimenter chose in the in the interface. But if you know I can give a tip to your listeners, it will be examined the power curve as a whole for most of your tests, this will give you a better idea of okay, of the different of the capacity of your test to detect reflected of different sizes. So let’s say I’m looking at the power curve right now. And it says 80% power to detect the true effect size of roughly 9% lift and yeah, okay, that’s that’s informative, but I can also look at 4% lift and see that the task is 30% power there. So I can be thinking okay, how likely is it that you know, this this, this affects phase will be beneficial if I can detect it, you know if it truly exists. And yeah, in most cases 4%, you really want to have more than 30% power there. 4% lift. And yeah, it will also highlight the difference between power and power level. Because the power level was something we said, and we, okay, let’s say it’s 80%, we examine the power function at that level, and we see what effect size we can detect, that’s fine. But you can always get more information by just looking at the whole thing. And not just one point on it, is what I’m trying to say.

Richard Joe 30:37
Actually, there’s often it’s quite interesting that the vast majority of tests I 88%. For your platform, including just one test variants, and a control, only 10% had two variants, and just 2% include three or more variants. So would you say that that’s, in some ways, alluding to the maturity of AV testing?

Georgi Georgiev 31:03
Well, this, this number can actually be interpreted in different ways. It’s obviously all speculation on my part, since obviously, I don’t know how the users or the users are thinking exactly, yeah, when they’re designing these tests. I think either either approach is fine, AB test is fine. IBM is fine. Like you can have 10 variants if you want against the control. But I think the prevalence of simple AV tests shows that most practitioners understand that unless they have a drastically different variant, which has sort of an independent chance of, you know, being better than both the control and their first variant, let’s say, then, adding variants curiously like, okay, we’re, we have this this one, one variant, which is we’re very happy we’re very, you know, we’re confident it’s going to be the current experience. Let’s tweak the color of this button, or the position of this text by one section, we change two sections to be a buffer one above the other. And I mean, small changes like these are very unlikely to result in meaningful differences. And it’s usually more more efficient to just test the best thing that you can come up now. And then to run a follow up tests to try to refine it after then to run a big PBN test with variants, which are just slightly different than then each other, simply because it, it comes with the cost of a longer test direction, as you mentioned, but the chance of you know, there has to be independent chance of each of these variants to outperform the control. They shouldn’t be, like, very close variants of each other. That’s, that’s the point.

Richard Joe 32:55
Yeah, I guess, I guess it’s like, you know, looking at from a cost benefit analysis, would you say that maybe the exception could be like, say, you’ve got a, I don’t know, a test that’s affecting pretty much the whole website, let’s just say it’s a footer or the the nav, the, you know, the nav bar, you know, and you’re doing two variants against the control. Maybe the variants visually different enough to warrant having two variants, would you say that there may be the exception, because even though you are splitting up the traffic of two variants, further, because the test is basically on a global footer, or navigation, if, you know, you’re getting quite a highlight, you know, high level traffic compared to like just a single page. I’m just sort of simplifying it here. But you know, what I mean, there might be exceptions to the rule.

Georgi Georgiev 33:56
Well, when you have higher power, you also have higher stakes. So usually, even in those situations, you still want to go with a B, if C variant is not very, very different from on the B. And that’s, that’s something that, you know, people seem to not appreciate enough, which is that when you have big sample size, that also means big risk. So per unit time that you’re in running the tests, you have bigger risk of losses, if these are four, these variants are, are worse than the control, like truly worse. Or if they’re truly better, you’re running to higher risk of missing out on revenue. Because if you

Richard Joe 34:42
are starting to be mean, when you say loss in business impacts like Yeah, exactly. What’s happening. So the bottom line the business, yeah,

Georgi Georgiev 34:51
exactly. I mean, and obviously the test is just just I mean, if it’s on a lead form or something else, then you Measure the last in that term, but it usually is just a proxy for for some business metric. So if you’re measuring the number of leads, like a lead conversion rate, then it proxies into into revenue, obviously, because typically, you know, number of leads, translates into into revenue, I will, I will actually recommend, obviously, to just run the test on Irish revenue per user when you can simply because that is the easiest to interpret into tie into actual business outcomes. But yeah, that could be a side a side topic to consider.

Richard Joe 35:38
I think another thing you noted, which was of interest was that about a third of a b tests have a significant outcome, or what was it like compared to the previous meta analysis? Do you have numbers for that?

Georgi Georgiev 35:51
Yeah, the previous was 27. So on that front, that didn’t seem biased, you know. But it is, the 33% is fairly in line with other estimates that we can see in the industry. So it wasn’t a big surprise there. I think it’s a good number to strive for, for your experimentation program, if you have, let’s say, an in house team doing that, or an agency doing that, simply because I know, a decent amount of the users of the platform are fairly high profile, CROs with many years of experience, many, many big projects funded by tax. And if they can’t be the 33% number, then I mean, even getting there should should make you feel good. But also, caveat here, obviously, is that yeah, it depends on what you’re testing. So if you implement experimentation as part of your development flow, so you’re testing every single change, or most changes down to the, to the website or app, then obviously, you would expect a smaller, positive percentage of positive outcomes if you’re doing superiority tests, because obviously, you know, some changes would just be 00 effect. So yeah, depending on the mix of tests that you’re doing, but I would say for tests where you’re aiming for an improvement, you’re not just testing, okay, is this implementation gonna break my website? This number looks like upper limit, I would say,

Richard Joe 37:33
I am, I think the, the, the 1/3 division that moved me because I remember hearing, Ronnie Garvey’s saying a podcast and correct me if I’m wrong, but he was saying like, you can kind of expect, you know, third winners, third losers. And a third. Don’t do anything’s is kind of heuristic in like you said, like, these are the high level CLRS. Right. Like, these are like in these these bills were sent for years. So you know, you’re basically expecting the flip of a coin, maybe even a little bit lower than that. To get your your winners out there.

Georgi Georgiev 38:13
Yeah, I mean, I think they should give, because I know people would probably have success rates lower than that, typically. And so they shouldn’t feel too bad about that. I mean, obviously, they should strive to improve, but they shouldn’t get discouraged by by a lower success rate. Or on the other hand, if your success rate is suspiciously high, let’s say 50 6070 80%, then maybe, maybe, you know, something in your process needs to be revised. So it is more stringent and doesn’t allow willing or unwilling, you know, cheating of the system, if you will, like producing nominal winners, but they are not actual improvements, you know, to the extent that these are obviously among these, I would expect there to be false positives as well. But yeah, this will be controlled very well by the by the stats.

Richard Joe 39:09
I’ve heard of Twyman of good scotswoman’s Law, basically, the more unusual or interesting the data, the more likely they are to have been the result of an error or one of a kind. Basically, if something’s good, too good to be true, it probably is. Right. So just have that in the back of your mind, because I’m here on like, maybe blogs or whatever about, you know, XYZ company, doubling, doubling the conversions or revenue overnight or, you know, some some crazy number. And you’re like, Okay, what’s what’s really happening? Yeah,

Georgi Georgiev 39:46
yeah, I mean, there are a lot of wild claims still floating in the industry. I would say they are getting less and less prominent, I think. But Yeah, I would be happy to see a bit more robustness in some of these claims, and a bit more data to be honest, because sometimes you just get a single number. For example, it’s going to be so much easier to spot these issues, if you instead of presenting just the lift, present the confidence interval around it, right, say the what, you’re in certainties. Yeah, I mean, say your to 50% Lift, good. I was the confidence interval. Is it from 49 to 51%? Or is it from 2%? To 150? Right? Because it’s a big difference, you know, the one is very narrow, and it says, okay, statistically, if you haven’t done anything untoward, that’s a super, super solid result. very trustworthy. And in the other in the other example, it’s like, okay, I mean, you could have achieved, you know, just 1% lift, but it’s like, not, it’s just not as impressive anymore. So, yeah, just reporting

Richard Joe 41:05
is not gonna create a newsworthy headline. Yeah, it’s all clicks and impressions these days? Well,

Georgi Georgiev 41:15
I wouldn’t say so. Because if it was, then I wouldn’t have the audience that I do. But I think a lot of people were actually really, really interested in doing the drop, right. And knowing that they did it that they did everything in their power to, you know, rule out biases or other other issues that might be affecting their tests, and that they are actually delivering value to their customers or getting value they’re on the customer site and evaluating the work of zero agency or, or some someone else. Yeah, I think there is a lot of interest in the robustness of testing and the trustworthiness Ronnie says it, the results.

Richard Joe 41:58
You didn’t mention about sequential testing? Excuse me, for my ignorance. I’m not very aware of sequential testing. Could you explain what it is for the audience’s?

Georgi Georgiev 42:11
Sure. So sequential testing, basically, is what you need to do when you’re tempted at looking at your results every day or every week and making a decision when you see a significantly positive or significantly negative outcome. If you do that without sequential testing, so you’re just using a regular, like, some free tool out there, some free calculator, which you just input, okay, here’s my sample size, here’s my outcome, give me the statistical significance. Well, surprise, surprise, this calculator actually, has makes the assumption that you have only violated your data once. And if this assumption is not true, then anything it gives you is you can just dispense with it. It has zero value. And so how do you satisfy your curiosity and desire to act quickly? And how do you satisfy the business person or stakeholders who demand that underperforming tests are stopped as early as possible, and over performing tests are also stopped as early as possible? One for the reason of cutting losses and the other for the reason of maximizing the the positive benefits from from a test? Well, you develop a sequential testing procedure. And this is exactly what happened. And this was actually first developed just after just during World War Two actually, in the USA, and then these procedures are obviously developed over time. Expanded, extended, perfected, if you will. And yeah, nowadays we have solid literature, literature, a vast experience. It’s actually the norm in many of the sciences, where the risk reward calculations are where they matter. And yeah, I would say it’s making its way in almost all sciences, perhaps without, but yeah, I think physics as well. Yeah, even there. I mean, it costs a lot to run, let’s say, the Large Hadron Collider for a month. So So yeah, you want to get results as quickly as possible. That’s fine. That that’s, that’s actually good efficiency. You wanted to stop early, if the results are only positive or negative. And that’s what sequential testing does. It allows you to do that while maintaining maintaining statistical rigor. And also if it’s a well thought out process, which will also retain a lot of the external validity of the results. So you wouldn’t be stopping let’s say, on the first day, or the first hour of your analysis you should be waiting for you should be conducting analysis This whole business cycle. So typically wait a week, that will be one week. And so on a weekly basis, sometimes daily analysis is also justified. And yeah, this is, this is the whole idea, we want the work to run tests efficiently, without invalidating the statistical assumptions without sacrificing the statistical rigor

Richard Joe 45:22
instead of like, you know, doing your pretest calculations. And then and then having this fixed data that you kind of run the test by, but just maybe seen as seen by something maybe rigid. you’re analyzing data as you’re under tests. And you’ve got some sort of stopping rule involved there. And it sounds like that’s a lot more involves more statistically rigorous, kind of set up, would you would you say that is, in some ways, has some relation to, I’ve done like multi Armed Bandit tests, it was basically for this email marketing platform. And it, it was like, wow, this is this variants doing? Well, we’ll just put more traffic towards these variants. I don’t know if there’s any relation

Georgi Georgiev 46:12
to Doris. There is in terms of like the stochastic processes that are being studied. But we won’t go into those details. The idea that the idea of bandits is a bit different than the equation tests, though, bandits want to improve the performance during, you know, the run the time that they’re being run, while a test wants to get you to come to a conclusion as quickly as possible. It’s a bit different. So in terms of how you’re balanced the exploration and exploitation phase, so with bandit, it’s like, okay, we can explore indefinitely with tests, it’s like, okay, we need an outcome by this date. And also, I think, estimation, so the effect size and confidence intervals, things like that are much more or much more the focus of sequential tests, and much less the focus of a network bandit, which just aims to give you the performance and estimation from from this data is, is possible, obviously, with much more difficult, and from what I understand less efficient, meaning to get the same accuracy estimation from a bandit, or from Adaptive sequential tests, which is also one possibility. So even in a sequential test, you can adapt the location between variants or you can switch you can allocate more users towards a variant, which is, which seems more promising at the time. But then the analysis is much more difficult, statistically, and also, it’s less efficient. So a sequential test without that adaptation of this kind, will take shorter to run than an adaptive sequential test. So, yeah, I think at best, they’re equal in terms of what you get at the end, like when you draw the draw the line at the bottom, what you get in the end there at best eco, and perhaps the simpler designs are a bit better in some situations.

Richard Joe 48:21
But you didn’t mean mentioned going back to minute analysis that he said 21 leads from these 26% in terms of average efficiency, improvement versus equivalent fixed sample size tests. Do you explain Yeah, but further, if you don’t mind?

Georgi Georgiev 48:39
Sure, sure. So, the whole idea of running of analyzing data as it comes is that you want to stop early or earlier than what then when you stop with the same test parameters. So you have the same significant stress code or confidence threshold as it is also good, you have the same target minimum effective interest, and you want to maintain a good power level against that. So 80% 90%, whatever we chose. So you want to maintain all of these parameters, but stop tests earlier to either save you from losses from underperforming variants, or, you know, make it quicker to get to exploiting the benefits of an actual improvement. And so here, the major thing that you want to measure is how long on average does it take to run a test? And with sequential tests, the benefit is that on average, they would stop sooner. And in the case of agile sequential testing, in particular, in a real world scenario, it stops 2526 28% depending on some assumptions being made, but in this range for sure. earlier than that, If you had designed the same exact tests, but run them with fixed sample size on on the same data, essentially, and they are this is a quarter of the, you know, saving in terms of performance. And this translates to, to the, to the bottom line actually, because it means less lost revenue and higher earns revenue from from winners. And yeah, I think that’s significant, it’s nowhere near the limit of where theoretically, sequential testing can get you. So depending on the actual distribution of the true effect sizes, it can be as, as efficient as as much as 80% faster. So if you’re actually the destined, you’re running KEF are actually much bigger, positive or negative outcomes, not outcomes, the true effects, then, you know, you can you can be 80% Faster, on average, which is,

Richard Joe 51:00
they can be huge in terms of cost savings to business and listening risk.

Georgi Georgiev 51:08
Sorry, just to final remark on that. Yeah, what we’ve seen in practice is that since these lift estimates are really concentrated around zero, and then relative to the manufacturer interest, they’re really in a region where they’re not reaching this reaching this maximum efficiency, you know, the ranges from 20% to 80%. Efficiency, the real world observed deficiencies 26 to 28, let’s say. And, yeah, that’s, that’s what you would expect, if you’re running tests, similar to the tests that were run on our platform during August. So

Richard Joe 51:43
for 26%, though, in 26, it’s still quite a material difference, though.

Georgi Georgiev 51:49
Indeed, indeed, very, very material.

Richard Joe 51:52
And just lastly, you made another point that half AB tests in have 90% probability to to detect a true effect of 6% or less, while the average in DEA is 11.3%, suggesting a trend towards better power tests, becoming the norm among professionals. Any comments on that?

Georgi Georgiev 52:16
Yeah, sure. One of the major findings of the previous meta analysis was that a lot of tests were actually not very well, powers. Only, you know, two thirds of tests yet, I mean, I mean, it will do that to affect below 10% lift. So in this analysis, but in the previous, it’s just one quarter, so only 20% of tests get the ability to detect them, a lift of 10%. And 10% is huge if you look at the distribution of the actual effect sizes. And what typically makes and also if you think about it, because it’s not only about what they expect to see, but also what will make business sense to test? Because I mean, yes, you should be you would be super happy with 10% improvement. But aren’t you going to be happy with two or 3%? For more businesses? For most businesses? The answer is yes. But with underpowered tests, you would miss those two effects of tool and 3%. Way too often, like 90% of the time, 80% of the time, which is unacceptable. And so, the finding here is, experimenters are much more realistic in the choice of the the target minimal detectable effect, the mean effect of interest, as they call it, and they’re consequently, they’re running tests for longer with larger sample sizes, so that the false negative rate is, is much better for this new meta analysis.

Richard Joe 53:54
Just to comment from my end, I mean, I I’m obviously not a stats expert expert like yourself, but you know, I’ve kind of read like blogs and so forth saying, you know, set your NDA you know, 5% your what’s your comments, have that in line with your meta analysis results that you’ve got? Yeah.

Georgi Georgiev 54:17
Yeah, 5% is a good I would say, guideline, not not not guideline. It’s a good starting point. I actually advocate for using the minimal minimum effective interest, which makes sense for the particular test. So for each test, you would see okay, at which point on the effects on the effect size range, is it where the test is? Beneficial for the business? Yeah, so for smaller businesses, typically, the sample sizes that they can achieve in a meaningful amount of time, are, are smaller, and so their power is bound to be smaller. It’ll be it’ll be worse. And so for them slightly higher effects can still make sense. Because otherwise, they might end up running tests for a whole year or half a year and a half a year, you’re, like, slowing down your process so much, and you’re waiting. So long time. And you’re also running into all these issues that we touched on earlier with Cookie deletion and the like. And yeah, you know, maybe there, the balance would be, let’s say it at 10%, maybe.

Richard Joe 55:35
But so having a strict 5% isn’t viable for your business. Because, you know, the more you go, you’re in the, the higher the sample size has to be right to get the exact things and the long, they’re not going to run your tests for.

Georgi Georgiev 55:48
Exactly. And then the other. On the other side, if you’re already a big business with a lot of already experiments that you’re doing involves a lot of people, and then even a 1% lift could be super significant for the business, it could mean let’s say 10s of millions of dollars per year might be 1% lift for larger businesses. And there, you obviously want to power for these effects. Because you don’t want to miss, let’s say 20 or 30 million left. Yeah. Yeah. So there are 5% is like what like, that’s, that’s like billions for us if we detect this, but if we miss it, you know, if you miss just 4% Lift, it’s like, it can be game changing for our business. So yeah, 5% is just, okay, maybe an average business somewhere out there. 5% would be great for them. But is there such a business, I mean, and also, I like to think about individual tests instead of businesses, because even if you take like a big corporation like Microsoft, they could have tests involving billions of users, or they could have tests on some niche feature niche product, which only, you know, attracts a couple of 1000s of users per week. So yeah, it’s always about what you’re testing and what the particular parent balance is for for that particular test. And that’s actually something that I’ve worked a lot on I’ve published articles on it, I can develop the two actually allow you to compute whether the optimum lies for for each particular test, it obviously takes more information into account, it requires some business information, for example, how much does it cost you to run this test in terms of fixed costs. So that’s, that’s your sunk cost, which no matter what the outcome of the test is you’re putting into, and yeah, other things like that, like the the monthly impact on the revenue, what was the revenue impacted by the test. So it’s, it’s a big factor, because it’s much different if you stand to lose, as I just explained, like if you’re, if what you’re testing can impact hundreds of millions of revenue, or when it’s much different in terms of the balance of risk and reward compared to if it’s just impacting 10,000. Doors. So

Richard Joe 58:15
So in other words, the NCAA is not a one size fits all. So it really depends. And just, just to round it off, you didn’t mention this in your book. There’s a shameless plug, I actually am reading this statistical methods and online AB testing on kind of halfway through maybe three quarters through so it’s quite a it’s quite a good rate. It will give you the old framework of things. And more things to sixes, obviously error. So yeah, thanks a lot for coming on the show. Gogi guilty. Is there anything you’d like to leave our listeners with?

Georgi Georgiev 59:02
Thank you, thank you for having me. Richard. It was a pleasure. And I hope it was a useful conversation for for yourself and for all the listeners out there. And, uh, yeah, I think I think the industry is moving in a great direction. I’m seeing it happen in my, in front of my eyes, and I’m very happy to contribute this Congress, you know, as I do to it, and yeah, if, if any of your listeners wants to, you know, drop me an email about any of the topics that we touched on, I’d be happy to, you know, to respond. And yeah, I say, let’s keep in touch. And I really liked it, like the work that you that you do here at experimentation, I think is also having this positive impact. And, yeah, so very happy to be part of it.

Richard Joe 59:52
Awesome. Um, how can people contact you again, I mean, you’re on LinkedIn. I can just google gone. Yeah, LinkedIn

Georgi Georgiev 59:57
is the best. I I am constantly there. So I should be fairly quick to prove a request for, you know, for connection there. And I think I also have my contact details listed there. So, yeah, that shouldn’t be, shouldn’t be an issue.

Richard Joe 1:00:16
Awesome and analytics-toolkit.com It’s got all the bells and whistles to do of A B testing and calculations and so forth. So have a look. And yeah, thanks for coming and sharing. See you around.


If you liked this post, signup for Experiment Nation’s newsletter to receive more great interviews like this, memes, editorials, and conference sessions in your inbox: https://bit.ly/3HOKCTK


Connect with Experimenters from around the world

We’ll highlight our latest members throughout our site, shout them out on LinkedIn, and for those who are interested, include them in an upcoming profile feature on our site.

Rommil Santiago