A/B Testing Statistics issues CROs should ignore with Timothy Chan Experiment Nation

Video

Audio

Transcript

Timothy Chan 0:00
I came to the sort of the conclusion that interaction effects tend to be overestimated. They’re, they’re number one, they’re actually quite rare. And then when they’re found, they’re actually in most cases underwhelming, and in such a way that they generally still produce directionally accurate results. So it wouldn’t change your decision over whether to ship a feature, it would just change your estimate of what that effect. True effect is. So I think in some ways, folks who are really worried about interaction effects are really making too much of a big deal.

Richard Joe 0:31
Hey folks it’s Richard here from Experiment Nation, and I have got some Tim Chan on the line here. Tim is a former data scientist, Facebook, we’re going to be talking about some experiences at Statsig and cover a few topics, such as when it’s okay to overlap or isolate tests. The concept of interaction effects between tests being overblown, and much more, so stay tuned. Today I’ve got a special guests. Welcome to the show, Tim.

Timothy Chan 1:07
Hi, Richard, thank you so much for having me.

Richard Joe 1:10
Cool. So how about we just started off with? You got two interesting backgrounds, I can see your previously working in various scientific roles that weren’t particularly to do with experimentation, specifically. Yeah, maybe just talk about how you what kind of sparked your interest to get involved as a data scientist and experimentation to Facebook?

Timothy Chan 1:35
Yeah, I think like, I’ve had a pretty interesting career path, like a lot of other data scientists. For me, I was my I love science, I love scientific thinking. And that’s sort of what led me to pursue sort of a career in the biotech space and doing medical research. But on but another part of me was also really interested in startups really interested in business. And what I had always thought was my dream job was combining sort of the scientific principles and scientific decision making and thinking and analytics to the business world. That kind of role didn’t really exist when I graduated with my PhD, but I think I slowly became aware of this career called the data scientists. And that is my interpretation of what a data scientist says is that it’s applying scientific principles and scientific methods to actually deriving pretty solid insights that can make business sound business decisions. And so once I found that out, I really wanted to become a data scientist, had to learn a little bit of Python SQL, things like that. But, and Facebook was my first role. My first bonafide role as a data scientist.

Richard Joe 2:47
Oh, yeah, I mean, it’s huge, huge jump from was a galvanize that you’re working on? Okay. So galvanize was basically kind of your you segue into data science, and then you segwayed into the tech sector of via Facebook, can your Facebook’s a huge company? What even Facebook are you working within? And

Timothy Chan 3:05
yeah, so when I first started, I joined in 2016, which Facebook was quite a bit smaller, but it was absorbed, but it was already a public company, or event. So it was still, by all means still very much a big tech company. At that point, the first team I joined was on Facebook gaming. Classically, this was known as the sort of the team that had been responsible for things like Farmville. So it sort of have like a bit of a tarnished legacy, internally at Facebook, also externally. But our job on that team was to sort of make a gaming relevance again, on Facebook, in a way that played very nicely with users and other Facebook content. So I that was a team that was very much growth focused. So we were in charge of sort of making things more relevant for like a younger group of Facebook users, but also break make Facebook a cool place for gaming content to be sort of shared and be active.

Richard Joe 4:04
In a main terms of the Facebook gaming division that you you’re part of, I mean, in the growth associated with that we specifically just bring it back to experimentation, like how was your role tangibly involved in an experimentation and growth and that sort of area? Yeah, I did it feel free to sorry, feel free to discuss any things that you feel safe to maybe share specifics?

Timothy Chan 4:27
Yeah, absolutely. So like, I think, like, if this was my first role as a data scientist, and the way I would describe, the rollout was it was more like product analytics, but sort of kicked up a notch in terms of like, scientific savviness. And what I’ve come to learn was like, you know, you can do a lot of deriving insights from existing data, and diving in and applying some statistical methods. But what I quickly found was one of the most powerful tools in the arsenal of like, Facebook product teams was experimentation. The ability to Like, instead of sort of like coming up with analyses to have all sorts of caveats and assumptions, you could actually put ideas to the test in a very quick fashion, where you just deploy it as a test to get results back in two weeks, and, and immediately know whether that idea was good or bad. And the team was, and this was true of all Facebook culture was, folks were exceptionally data driven. So being the being the data driven data scientists wasn’t necessarily a key advantage at Facebook, like everybody was data driven, but helping the team sort of navigate insights and sort of being able to dive in to the harder questions, especially with experimental effects was sort of my role

Richard Joe 5:40
in insane that I mean, do you mind kind of explain, maybe tangibly, the experimentation culture? Like, what like, it wasn’t just a team of experimenters? Like it sounds like your product managers, maybe marketers, internally, a team that could basically have an experiment? And what I have a hypothesis and ideating experiment and launch it on their own, right? I’m guessing there was, was it was there like, a democratization of experimentation with a woman, Facebook as a whole isn’t really so

Timothy Chan 6:14
yeah, that’s, that’s a great term for democratizing because I like that Facebook, every engineer, Product Manager and Data scientists is really empowered to set up, run and own the experiment. I’d even seen like interns, you know, within three weeks of them starting at Facebook, like knowing how to set up an experiment by themselves, and be able to operate that. And so this was like, really cool to see. And, and part of that was just because Facebook tooling was made experimentation easy. I know a lot of folks who do experimentation today think that there’s a lot of heavy lifting, you have to do in terms of like making sure that you have your assignment correct, that things are balanced, that your metrics flowing in that you’re ready for the computations, that you have things like your hypotheses setup. But for a lot of things I’ve seen, like people make one line code changes, and even put that behind an AV test. And, and they may make a decision in three or four days just to make sure that indeed fix the bug. But what I thought was really cool was that Facebook was definitely biased towards measuring outcomes, instead of just assuming those would come out. And that will, even through experimentation,

Richard Joe 7:23
I mean, was asked was, like, easy to use, typically have a roadmap and prioritize or experiments in a typical environment. I mean, we’re talking about Facebook here, which is not your atypical, but not your typical businesses very atypical. Was there any sort of ways to prioritize tests within say, the game via Facebook gaming, and be, you know, you see, like, people could create an I have an idea and launch it pretty quickly. It was any internal QA by engineers, and those sort of things, just to sort of make sure that there was a certain degree of quality control.

Timothy Chan 8:04
Yeah. Yeah. You know, we’re, we’re based in Seattle, and I was part of Facebook, Seattle. And what’s really kind of neat here is in Seattle, Microsoft, is sort of like the big legacy tech giant in this area. And Facebook is sort of like the new sort of like up and what’s sort of the new up and coming like you, and the two companies, at least, we had a lot of folks that went from Microsoft to Facebook, and there was a big culture shift, especially with regards to how Facebook approaches experimentation and QA, for example, like, the old way of doing software development was you would go through layers and layers and levels of QA, you’d have large QA teams, Facebook did not. Facebook believed very much in engineers owning their own QA. So like you build the feature, but you would also own the fact that like you had done your due diligence and checks. Now, obviously, you don’t have the ability to test on like 80, different Android devices, your your feature, but Facebook relied on experimentation, as how it would make that the final judgement. So in some ways, the end user was the QA. And you would look for things like I mean, at its simplest level, like you would look at things like I deployed this new line of code that crash rates, for example, change. And if they did, what are the set of devices that that might have occurred on? I think that was a big part of it. I think also, just go back to your first point of like, how Facebook sort of approach roadmaps, we spoke roadmaps where at least goals were very much outcome driven, and not necessarily going through a set of tasks and projects to complete. And I told that this sort of is divergent from somehow their company’s approach things. So we would have a goal in Facebook gaming that for example, we may strive to improve retention rate of certain users with certain classic games or with certain products, or we might strive to be like this product needs to achieve this many daily active users at the end of the hat. And so think As we’re very measured and data driven, and metrics driven outcomes, and we would come up with a list of ideas on how we could achieve those metric goals, and those ideas were very flexible and how we approach them, like, we may think we entered the half with a clear idea of like, the following five items, if we finish them would achieve it. But the second we test the first item, and we find no, it doesn’t work, but we actually learn from why that doesn’t work that can actually throw your entire roadmap out the door, because you suddenly think that, oh, I have a new way that we think we can achieve the goals. So I would say like the roadmap of items that we wanted to accomplish were very much in flux. And we were willing to pivot based on new data.

Richard Joe 10:42
That was quite flexible. It wasn’t like, No, we’ve got this tastes lineup next month from them like like a rigid roadmap.

Timothy Chan 10:50
Yeah, I would say the way we would prioritize, like we weren’t rigid in terms of scheduling, you wouldn’t have a set of tests cued up and say, like, we’re going to execute the following five tests in the following sequence. Instead, things were very much driven on what knowledge do we have today, and which tests do we think are the most impactful, and those ones would be prioritized first. So everything was very much focused on the goal you were trying to achieve? And which tests were most likely to drive that outcome?

Richard Joe 11:17
How does it like how does your role specifically works in terms of being a data scientist with with testing? And this is the school results and so forth. Like let’s say someone had a bid, they ran it to us, they made a bit of random bit of code. They have the outcome of tests after two weeks. What do you do as a data scientist?

Timothy Chan 11:37
Yeah, I would say there’s two parts, or two ways that a data scientist can influence experimentation, at least how I saw it work at Facebook. The first is really helping to prioritize a list of ideas and experiments that one could run. And that could be looking at things like if we wanted to improve the conversion rate on a certain page, like being able to identify which pages have low conversion rates, and believing that there’s an opportunity to drive those up. Or if you want it to drive users to a certain surface or a product, like where are the biggest opportunities for that. So I think a data scientist, was quite instrumental in being able to quantify an opportunity size, which sort of projects or experiments teams should be focusing on. I think that and the second one is hoping to interpret experimental results. I did Facebook, as I mentioned, the company was very data savvy. So like, engineers were fully capable and PMS are very capable of interpreting most experiments by themselves as particularly the ones where there’s a clear outcome and a key metric. But where a data scientist was probably most helpful, was in those cases where sometimes results were contradictory. Or sometimes you had metrics that moved in in odd ways, and being able to sort of discern, what is the possible what is the possible root cause of, of some of these effects, and being able to dive deeper into the data, or being able to understand a little bit of like the principles of experimentation and how that applies?

Richard Joe 13:10
We wrote an article about how maybe overlapping tests isn’t as bad as it is, and I’m guessing your experiences of Facebook, it sounds like they had a sort of culture of having overlapping tests and not worrying too much about the noise effects of that. Could you explain that? A bit more, please? Yeah.

Timothy Chan 13:28
Yeah. So, you know, I was I’m classically trained as a scientist. And I know one of the fundamentals of science, when you’re doing comparisons, is particularly an experimentation as you want to sort of control for every factor, and make sure you’re comparing apples to apples. Yeah. But what I quickly found was in the world of product based experimentation, that just fundamentally doesn’t work. And part of the reason is that part of part of the reason is that like, when you’re experimenting at that level of scale, like let’s say Facebook is running 10,000 experiments, I don’t know what the actual number is. But that’s not an unreasonable estimate. If you think of if you assume every test is actually like an AP test, split 5050, that’s actually two to the power of 50 combinations. There’s not enough users in the world to make sure that everybody’s represented properly. And so the only way to sort of achieve that scale of experimentation really is to overlap. But the other advantage that, that companies and who are doing online experimentation to have is really large sample sizes. And so and what I mean by really large is like, you know, in the world of statistics, like 30, is, was was considered like a normal size experiment, anything over that, like 100 was considered well powered. In the world of online experimentation, we’re talking about like 10s of 1000s of users, that’s actually plenty to wear. If you were to split that 5050 And actually split that many times. 5050 you would actually end up with fairly representative buckets on I’m just by pure randomization. And so and that is all fine in theory. But what I did see at Facebook was that we were able to overlap experiments quite regularly. And even in times where we swear, you would swear up and down, that like these two experiments just by their nature have to be overlapping, or so I have to have interaction effects. I found that every time we looked in to those, they were very underwhelming. Inter and so it’s I came to the sort of the conclusion that interaction effects tend to be overestimated. They’re, they’re number one, they’re actually quite rare. And then when they’re found, they’re actually in most cases, underwhelming, and in such a way that they generally still produce directionally accurate results. So when change your decision over whether to ship a feature, it would just change your estimate of what that effect, true effect is. So I think in some ways, folks who are really worried about interaction effects are really making too much of a big deal. And it’s actually slowing the rate of experimentation. You’re now severely under powering all your experiments, because you’re trying to keep users in separate buckets. And I have found that this is probably an effect that folks who aren’t used to experimenting at scale, I think it’s usually the first thing they worry about. And I just, it’s a question we get a lot of static

Richard Joe 16:22
mean, it’s a question that I’ve had, because I mean, I’ve just been going by other sort of blogs and having this very purist perspective that you want to, you want to put them into separate buckets to to to avoid the noise effects and interaction effects between tests. And it’s sort of like a purist thing. But I forgot rubber. Ronnie Carvey from wasn’t Microsoft hoping we at that time he was saying through in VT in BTTS, multivariate testing that the interaction effects between, you know, simple components and a test aren’t as high as you think. And I think it’s sort of going on the same line is you’re you’re saying, Would you say that these integration effects your your assertion that they they’re not as bad as you think, would you send it even applies to say, I don’t know, it was a small scale website of women say it’s got 100,000 visitors per month.

Timothy Chan 17:22
One thing I’ve learned at that stage, and I didn’t have this perspective, necessarily before, because you don’t actually need that many users to do experimentation properly, what we quickly found, like 100,000 monthly active users is larger than your average, that’s a customer, by far. And I remember, like, we would get some of these customers who initially signed up and tried static, that may have like 10,000 users, for example. And the scientists in me sort of said, like, Oh, my God, they’re about to try experimentation. This, they’re, they’re gonna walk away pretty underwhelmed, and and very frustrated.

Richard Joe 18:01
Sorry, 2000 users. Overall, just a one page we talked about, the whole website gets team first,

Timothy Chan 18:09
overall, and what I Yeah, and I remember, like, I was just, I was used to Facebook scale experiment. And I had worked at small products at Facebook, but even small products that Facebook would have, you know, 50,000 users, 100,000 users. But and so I will see these small companies. And not only that, these folks didn’t have a ton of experience with experimentation. And they were getting started. And so I think in some ways, back in my mind, I felt like the cards were stacked against them. But what I quickly learned, and this was repeated over and over, was that most of these companies who are startups who are just starting to grow and starting to experimentation, they ended up succeeding, they ended up with these wildly impressive results, like plus 50%, wins, plus 80% wins. And it happened very routinely. And after a while, I came to realize the reason was, because at Facebook, like we would struggle to find 2% wins, 1% wins. And the reason I came to the My conclusion was they’re working on unoptimized products. They’re just starting to actually rigorously test ideas. They may have done like design choices when they were first building their product or website, which just may have been like, let’s just pick something and go with it. And now they’re going back and actually trying to test these ideas with metrics and get good measurements. And because their products aren’t optimized, there’s a ton of low hanging fruit. Yeah, it’s just like a ton of opportunities to operate here. And they’re not looking for 1% When 2% wins. In fact, I would argue any startup that has aspirations of becoming a big deal, who, if they’re looking for 2% wins, they’re in trouble. They’re never going to hit any sort of like growth goals. They need to be looking for big wins. And so they are and they do find them and so I was pretty shocked and surprised that that these companies were fine During these big wins, it was interesting to see results that were like measurements of plus 50% increase on conversion rates, the error margins will be like plus or minus 25%. But that was enough to be statistically significant enough for them to make what happy with the product that they were getting pretty good measurements, and being able to make very clean decisions. So I think it’s a myth that you need large sample sizes as new experimentation, I think what is clearly the bigger effect is like, whether you are looking is whether those opportunities exists or not. If you have a highly optimized product, and have a very low sample size, and you know, the cards are stacked against you. But I think for a lot of people starting out their experimentation journey, that low hanging fruit does exist, they just haven’t tried it yet.

Richard Joe 20:47
And we wouldn’t you? You know, because I’ve kind of kept to like, Okay, this sort of heuristic of, okay, we need at least 10,000 uniques, to this landing page or so, so forth to get a general sort of like test that’s not underpowered? Would you? Would you say that this, though, has to be some sort of minimum sort of sample size for a particular test or, or size? Or page, you want to you want to taste them?

Timothy Chan 21:14
I am not a fan of hard and fast rules for like, what’s the minimum sample size you need for experiments? I think there’s a lot of factors that go into it such as, what is the metric you’ve chosen to measure? How accurate how precise is that met? Is that metric? And what is the minimum detectable effect you think you can achieve? I think those things tend to get overlooked if you just go with sample size. And part of it is also knowing your product. There are times where, you know, a metric is noisy, there are ways to find surrogate metrics, there are ways to find metrics that are better behaved and experimentation, that may give you a better chance at picking up and getting the necessary experimental power. But I’m not a fan of hard and fast rules. I’ve seen experiments as small as 1000 users come up with some pretty nice results. And it’s just a matter of what is it you’re trying to measure? And what is your chance? So I think, I know in the world of like CRO metrics tend to be things like, click through rates on buttons. And those things can be a little bit more challenging to move at times. But sometimes, if you’re looking at driving things like retention, driving things that like number of purchases, driving things that how long somebody’s spending on a session, those sort of metrics can sometimes be a little bit more powerful and easier to move, sometimes not. So I do encourage folks to sort of like look at a wider set of metrics, but I wouldn’t write off any sort of sample size or traffic, I would try to instead focus on what is it the what is the top line effect, you’re looking for?

Richard Joe 22:54
Just sort of backtracking, or you like, you know, looking at Facebook for what sort of lead you to start sets that sick? And also what what what learnings from Facebook, did you bring over to stat seek,

Timothy Chan 23:08
a lot of folks who worked in big tech that has a really strong experimentation program probably have the same observation that I did. And many of the folks here at stat six certainly had that observation as well. It’s that experimentation can work at a very large scale, and be exceptionally powerful at a company. And so what we had seen at Facebook was a set of internal tools Facebook had really did sort of like three things to the company. Number one, it made experimentation accessible just by making it easy to use, and a good default, like everybody just there was not a reason. There wasn’t a good reason to not shipping an idea as an experiment. Yeah, the second thing that was sort of unique was experiments, everybody could interpret experiments. And part of it is finding a way to walk that line of, you know, null hypotheses, p values, confidence intervals, these are all like somewhat challenging concepts, but at the heart in terms of guiding people to making the correct decision. That is doable. And we and we certainly saw that at Facebook. And I think the third result really is when you make experimentation easy, and you make metrics really accessible. The results just become automatic, sort of like checking in daily as a dashboard. And this really, to me, made Facebook a data driven company. I think without experimentation, Facebook would not have been data driven. I think that’s true at a lot of other companies. And I’m not the only one that’s had this observation that experimentation really is how you get everybody bought in on following data on following metrics and making data driven, not opinion driven decisions. And I think that was great in setting the culture that it became a culture of ideas and results, not so much of opinions. And who could put together? Well thought out arguments that were polished.

Richard Joe 25:04
Obviously, you join a startup. It’s been over two years now. Definitely you just want to join a startup? Or did you just sort of delay just thinking on LinkedIn or

Timothy Chan 25:16
I’ve always had an interest in doing startups. But I think my list of I, when I did my I did an MBA many, many years ago. And I actually majored in entrepreneurship, and was very well familiar with startups. And I sort of came to two conclusions. One is that not all startups are equal, that there is such a thing as a startup that already has success in its DNA. But the second conclusion I came to was that I would never ever find such a startup that I was comfortable enough doing. That that became, I was sort of proven wrong in the ladder, because it was a group of folks who I’ve worked with previously at Facebook had approached me too, and asked me if I wanted to join a startup. And these were folks I had a tremendous amount of respect for I really enjoyed working with and, and I knew that sort of success tends to follow certain people. And this group was one I was willing to bet my career on. So that’s how I got convinced to join stats IQ, but also the fundamental idea that we can make experimentation and product development tools, really easy to use really accessible, and put in the hands of small, medium and large companies was to me, it’s just fundamentally something I saw work at Facebook. And I think it’s pretty exciting to unleash on the rest of the world.

Richard Joe 26:40
And I’m guessing you’re, I mean, to build a experimentation platform from scratch is quite a thing. I mean, we you highly involved in the developing the state’s engine. Yeah. Can you talk about the development of that?

Timothy Chan 26:56
Gotcha. Yeah, are we have a stats engine, we have an internal name, it’s called the metrics Maestro. But externally, we call that results page pulse results. So we got to show people their experimental results, and they, and we tell people like check your pulse. That’s, that’s where the sort of the term comes from. But yeah, I think for me, I had never, I was the first data scientist at this company, it was myself and seven engineers, the CEO, is also very much an engineer at heart. So So at seven engineers and one data scientist, I was in charge of sort of the data architecture, and building the experimentation engine. But also not only that helping design, how does experimentation fit into static, how should results be displayed, that makes statistics accessible. And in a way that’s informative and not misleading? That is, actually, I think, one of our biggest challenges as a company, it’s like, we’re actually putting together some putting in the hands of our users some pretty powerful statistical tools. And it’s very important to us that they guide people to best practices, and they guide people to make the right decisions. And that’s, that’s, that’s hard to do. But it’s something that we think about very much day to day. And that’s something which I view very much as one of the focus of, of a data scientist at static.

Richard Joe 28:21
So maybe if you could, I hope it doesn’t get too technical for our audiences over me to be honest, able to explain your state’s engine to a certain degree to our audiences. In a somewhat more overly laden, state driven way.

Timothy Chan 28:36
I can want to tell you like our design philosophy, we try very hard, and we do not have any black magic in our stats engine. The statistical methods that we’ve employed are very much industry best practices. And we know that because we meet with other experimentation, thought leaders at other big tech companies. But we stick with experimentation best practices, we stick with very standard frequentis, readouts and methods, we highly encourage our customers to reproduce our results so that they know how our stats engine works. And that can only be done when we stick to like standard best practices, we have a few things that make that enhance the physical significance, such as winterization. We also have Cupid, which is the pre experimental bias correction. But these sorts of things have become standardized and best practices for all top experimentation platforms that we have incorporated those methods into our tooling.

Richard Joe 29:37
You know, because sometimes, like I’m a bit wary with certain experimentation platforms, calling that a winner or loser too early, does your platform do that? But does it say, hey, there’s there’s ultimately claimed as a winner and what sort of guardrails had you put in place? So,

Timothy Chan 29:54
I think, first off, most of our customers are doing what I would call product base experimented experiments. This is a little bit different than the world of marketing and CROs. And I think product based experimentation is what Facebook does is what you do when you’re actually building a product. And you care about things like time spent retention, and user engagement metrics, and you care about long term effects. For example of your Amazon, converting somebody to a purchase today is maybe an objective. But getting somebody to stay a customer for an entire year should be the goal. Like that is actually what matters to Amazon, and what they care about. And so for them, we view that is more of a product based metric is sort of tracking these long term things. I think in in sort of these products, metrics, and how the product works can be sometimes complicated, and can very rarely be boiled down to a single metric. For example, Amazon probably tracks revenue, but they by no means goal, every experiment on revenue. They’re looking for things like retention, engagement, or people have to they have healthy browsing activity, things like that. And then sometimes, if you drive revenue in one area, you have to understand that that might come at a trade off, is it possibly hurting another product is, if you are, if you are pushing a brand new product, like Amazon videos, for example? Are you possibly taking users away from another product? I think sometimes these trade offs are very important. And that’s why static very much encourages folks to look at a suite of metrics. And we also encourage folks to dive into their data. So if you have questions on an experiment, we do make it very easy to dive in and ask follow up questions on, hey, we’re worried about, you know, iOS users and Android users, like can we get a split by iOS are worried about whether this is cannibalizing this other product? Is there any way we can check in on some of their engagement metrics? And so in that, if you view it through that lens, because that’s it doesn’t know your product doesn’t know what it is you’re worried about, we don’t make decisions for you and make recommendations, we instead make it easy to ask questions and get answers. And we think sometimes these decision making can be nuanced. Especially when there’s a trade off. Is it better to drive number of sessions? Is it? Or is it better to drive time spent, you know, things, things like that? And if that’s the case, that needs a human. And so we try not to rely on automated decision making for a lot of these complex experimental effects. We do try to make it easy to get answers out of the system. And to really understand whether this results can be trusted to statistically. Yeah, but beyond that, we we try to present the data to you as if you were looking at a dashboard.

Richard Joe 32:49
Okay, the reason why I asked is because, you know, you talked about making experimentation easy and so forth. And, you know, I guess if someone’s new to it, they just sort of want to turn it on and just let a computer tell them whether it’s one or not. I just go back to it’s one of the topics you wrote about my articles you wrote on the website. Just to finish off of that, you know, you talked about 95%, stats, stats, Sugrue. It’s, it’s, it’s put this in place. Just to clarify, do you have a hard and fast rule? 95%? Would you? Yeah,

Timothy Chan 33:19
it’s I mean, it’s absolutely a trade off between the sensitivity of your experiment. So I’m a big fan of 95% confidence intervals. I know folks will criticize it as it’s arbitrary. And it absolutely is, but it’s a great solid default, I think it’s defensible because a lot of people use it. And what I think my number, my, my main observation is that people who criticize 95% Confidence Intervals, usually do it when they’re looking at experimental results, in order to convert a non statistically significant result into a statistically significant result. And so obviously, like that sort of change, that sort of cheating, you’re changing the rules of the game, after you’ve seen the result. And so while I am totally, I’m not against folks who want to set a custom confidence interval of like, do you want to 90% 80% or 99%? Feel free, but you should do that before you set up the experiment. But I think you should also have strong justifications in picking those confidence intervals. Because to be clear, those are arbitrary, as well. But there’s certainly circumstances that warrant for for example, if you wanted to really squash your false positive rate because you’re, you’re exceptionally risk averse, for example, you may be working on integrity efforts, or working on health care, drug treatments, then yeah, I present great and on the conch on the flip side, if all your if you want to just if you’re more worried about missing out on experimental effects, and you’re okay tolerating false positives and what L means you can lower your confidence intervals. I just don’t like it when folks try to change the numbers and the rules of the game, and try to bias at one way or the other, especially after looking at the results. So I think 95% is a great solid default that folks should use, and go with them until they have learned a little bit more about their experimentation and their risk reward profiles and

Richard Joe 35:25
say, Yo, I don’t know, you’re an 85%, or whatever. And you’ve been running this test for a month, and you’re like, Okay, well, I can sort of see that the test results have stabilized. We got another testing test on the roadmap. That’s dudes come out on Monday. You know, the fix on if we did we just hold on to this 95% rule and slow down our test velocity? Or do we? Should we just sort of say, Look, we’re just sort of declare it as of when I’m sort of playing devil’s advocate, just sort of Yeah,

Timothy Chan 35:58
I don’t remember really depends on your your tolerance. For false positives, I think there’s an argument to be made that the control group is the default for a reason, shipping new experiences, can sometimes introduce like new technical overhead, or can introduce, like changes to the user experience. There’s a there’s a lot of reasons why in just like a given change, why you want to give the control group sort of that default. But if you’re not worried about that, and you see an experiment, that is where the test group is beating the control, but in a not statistically significant manner. I’m okay with like picking a winner, as long as you know, that’s not a statistically sound result. And you and you are taking a chance that pretty strong chance that this is a false positive. But there’s really no cost to you. That’s an OK way to make decisions as well. But I think in a lot of cases, folks, one, two, there’s a reason why we follow this 95%, why we have implemented confidence intervals and p values is to control our false positive rates, because they are costly. And if you acknowledge that you should stick with those rules of the game. And after four weeks, if you haven’t seen the effect you needed to, it’s also possible that that effect is so small, that is not worth pursuing. That’s the other case. Like if it was really such a great effect, you would have probably picked it up, it’s it’s probably neutral, or probably something not noteworthy, like that’s, that’s something so you’re not missing out on a lot, in my opinion, when you’re into those borderline cases. And it’s best to move on to the next idea,

Richard Joe 37:37
they’ve also got to consider the sort of cost benefit analysis about production is and this Yeah, I actually think there’s

Timothy Chan 37:45
a lot of costs to making a change on a website or on a product. And I think there’s a lot of things we’re not accounting for a given example is one is like you have existing users who may be used to a certain experience, and now they’re sort of have to adapt to a new, new experience. Second one is, all of your other test results that you’ve been relying on, have been done on a certain on that control group experience. So if you do ship a change, perhaps, does that mean you have to revisit all of your other test results possibly. And then the the sort of the third one that I can at least think of it, I guarantee you there’s more is that your your engineers have sort of optimized a website for a given Layout and a given configuration, and certain performance, and sometimes making those changes fast and frequently can can introduce some thrash, and can, in the long term have some sort of degradation in performances and other things like that. I’m sure there’s a lot of other hidden costs that are sort of really hard to enumerate, but that’s why I think one should be careful about controlling those false positives. Now, there are some folks who have a pretty simple and tight website where like these changes are, are very easy to make, and they should those companies should absolutely play fast with the confidence interval, you know, and go with things like 80% If there’s low risk and they just need to try ideas and and be able to get things out the door. That’s fine. But I think it’s one should not underestimate the cost of making changes.

Richard Joe 39:25
Alright, thanks for that it’s fit for an ongoing debate on the on the net space. Anyway, thanks for a lot for your input on these topics and you know, telling our audiences but your journey instead seek i How can people contact you, Tim if needs be?

Timothy Chan 39:45
Oh, absolutely. I am active on LinkedIn. And I, I am my follower count is pretty small. I would love to grow that if possible. You can find me on linkedin.com/in slash TR Chan, CH A N.

Richard Joe 40:02
Qu, qu and it’s sec.com. Right?

Timothy Chan 40:07
And then you can, yeah. And so you’ve also find me, followed by following stats on LinkedIn or we’re also our website is stat sec.com. And we do have our own blog as well where I do publish a few of my articles.

Richard Joe 40:20
Cool. Looking forward to more articles and more sort of ones that go against the grain. I’m just checking. Awesome. Thanks for making the show. And, yeah, we’ll hopefully see in the future.

Timothy Chan 40:34
Richard, thank you so much. It’s been a pleasure.

If you liked this post, signup for Experiment Nation’s newsletter to receive more great interviews like this, memes, editorials, and conference sessions in your inbox: https://bit.ly/3HOKCTK

Connect with Experimenters from around the world

We’ll highlight our latest members throughout our site, shout them out on LinkedIn, and for those who are interested, include them in an upcoming profile feature on our site.

Join the Directory

Advertisment

A/B Testing Statistics issues CROs should ignore with Timothy Chan

Video

Audio

Transcript

Connect with Experimenters from around the world

Categories

Related posts: