Georgi Georgiev: As your testing velocity increases —so does the potential impact of Experiment Interaction
A Conversion Conversation with Web Focus LLC’s Georgi Georgiev
Georgi is definitely one-of-a-kind. He’s able to boil down pretty complex notions in a way that is approachable for the general Experimenter as well as seriously stress-test industry assumptions. He’s contributed mountains of insight to the Analytics and Experimentation communities and is often cited anywhere A/B testing is mentioned. I had the pleasure to pick his brain about a topic impacting every Experimentation program that gets off the ground and starts building steam?—?Experiment Interaction.
Rommil: Hi Georgi! Actually I need to ask, sorry. How should I pronounce your name?
Georgi: Ah, yes, that comes up sometimes. While I’m fine with George (pronounced the usual way) which is the English equivalent of my name, if you want to go for Georgi, then it is pronounced ‘g eh OH r g ee’. A good pronunciation of my full name can be heard here.
Fun fact related to that: if you try and google my full name “Georgi Georgiev” you will find a bunch of Bulgarians with that name, including footballers (soccer players), actors, musicians, writers, and even a judoka and a billiard player. Both my first name and my last name are among the most common in my country.
Ha! I just did that?—?so interesting. So speaking about your name?—?it is all over the A/B testing space?—?particularly around the topic of statistics. Could you share with my readers a bit about what you do today?
I continue mostly in the same line as the past several years and I keep my focus on data analytics and A/B testing statistics. On the one hand I am consulting clients on data analysis, data gathering, statistics, and methods for balancing business risk and reward. On the other I work on improving the web properties I operate such as Analytics-Toolkit.com which offers Google Analytics automation tools as well as many advanced tools for statistical analysis of A/B tests.
All the while I continue to contribute to the larger community by writing blog posts on A/B testing, by expanding the glossary of statistical terms in A/B testing, and by participating in various CRO/A/B testing projects, the end result of many of which is to be distributed for free. Some of these I can’t speak about at the moment, but one I’d like to mention is the first edition of Conversion Elite Awards in London. The event aims to give recognition to the best campaigns, individuals and tools in CRO. I’m participating there as a judge.
Hey, congrats on the judging gig! That’s pretty cool.
Regarding your contribution to the community, I have to say that those have been very educational for me lately?—?especially GIGACalculator.com and Aalytics-toolkit.com/. Can you share with us what inspired you to create them? Where does this passion come from? That’s a lot of work.
Thank you for this question, Rommil. The answer for Analytics-Toolkit.com is rather trivial?—?out of pure necessity! Initially I was doing a lot of Google Analytics work and there simply weren’t any tools automating things the way I wanted them?—?easy to use and suitable for serious consultants and mid-large agencies dealing with dozens or even hundreds of analytics properties at a time. So I started developing tools based on the Google Analytics API.
Over time my focus shifted to A/B testing statistics where, again, the tooling just wasn’t there back in the day. Most statistical calculators were confusing, poorly documented, lacked crucial functionalities or outright produced wrong results. Even today, even for some of our simpler tools, there are few alternatives which do as good a job. As an example, all our tools have built-in sample ratio mismatch checks, use multiplicity adjustments, and employ a proper statistical model for when you want to infer about or estimate relative difference (lift).
GIGAcalculator.com is a project that grew out of my frustration with different kinds of calculators that I used mainly for tasks related to statistics and data analysis. For many tasks I couldn’t find tools which work well, are easy to use, and are documented well-enough so one can A.) be sure they are entering the expected input and therefore getting the correct output, and B.) be able to use them without being an expert on the topic. It has also been a great opportunity to widen my perspective on the application of mathematical modelling. You learn a lot when you explore problems as diverse as controlling your body weight, estimating the net present value of an asset, and predicting the adult height of a child given only their age, height, and gender.
You’ve mentioned that some tools out in the wild were producing incorrect results. Along those lines, you’ve written some articles about the mistakes that Experimentation platform vendors have made. Do they ever respond to you?
Yes, when I initially jumped into A/B testing I, being lazy as we all are by design, preferred to make use of the tools others had already developed. However, I quickly started realizing the significant deficiencies they had, particularly regarding statistical design and analysis of tests. Mind you, this was 2014 and most tools looked nothing like they do today. My way of addressing this issue was to start working on my own platform. In the meantime many of the existing platforms changed their approach, some rather drastically.
While this led to certain improvements, there were now new fundamental issues with what most tools started doing, which was usually some mix of purportedly Bayesian method and frequentist methods, or just straight Bayesian. Forgive me for not being able to grant them the “Bayesian” label, but I’m sure many Bayesians will agree with me in that.
So, naturally, I wanted to bring up the issues and do so publicly to foster a debate. I didn’t think the issues I brought up could have been solved by any individual vendor. So I published several articles and a white paper back in 2016 and I sought discussions with no success. I did it again in 2018, much to the same effect. The discussion hasn’t happened yet, at least to my knowledge.
Well, I’ll see what I can do to spark this conversation. I’ll get back to you if I hear anything!
That said, it’s interesting that they don’t want to discuss the issues you raise. Why do you think that is?
I wouldn’t want to presume why vendors would be shy to respond. Perhaps it is not even on their radar, or perhaps they believe other aspects of their tools are more crucial and therefore require their full attention.
I believe the push for better stats needs to come from clients, from the CRO and A/B testing community using these tools. I think it was the client pushback against earlier inadequacies which forced vendors to rework their tools. For example, there was a post titled “How Optimizely (Almost) Got Me Fired” which made lots of circles in the industry, as well as “20 lines of code that will beat A/B testing every time” which kind of forced VWO to compare how their fixed sample statistics (which they used at the time) match against a simple bandit algorithm.
However, many end-users remain ignorant about statistical issues or believe them to be a matter of “philosophical debate”, without realizing their practical consequences.
I hear that. As practitioners of the craft of Experimentation, we need to have a better understanding of our tools. Myself included. Great point.
Let’s change subjects. There were so many topics I thought we could talk about today, you’re so knowledgeable on a lot of them?—?so it was hard to choose. But I finally decided on one: Experiment interaction.
I’m happy to cover more than one, if you think it will benefit your audience!
I think we’ll focus on one topic today, but I’ll definitely be taking you up on that offer in the future! I’ve got to keep the content stream flowing, y’know lol
“Briefly, interaction happens when something you do in a variant in one experiment influences the outcome of another experiment.”
So, tell us about interaction. What is it and why should we care?
Interactions between experiment variants is an actual issue with many misconceptions on can easily fall into, so I’m happy to discuss it in some detail.
Briefly, interaction happens when something you do in a variant in one experiment influences the outcome of another experiment. A trivial example is if in one test (T1) you are changing the color of a CTA button while in another (T2) you are changing the overall color scheme of the space around it. It is obvious that in such a case the performance of the CTA button variants (T1) is bound to be influenced by the different color scheme tested in test T2. The opposite will most likely also be the case. We have an interaction between test T1 and test T2.
More generally, I classify this as a generalizability issue?—?the test group is not representative enough of the users who will be using the variant if it is deployed. Statistically, the results are completely valid, it is the external validity which suffers if there are particular interaction effects.
“Statistically the results are completely valid, it is the external validity which suffers if there are particular interaction effects.”
A natural reaction for many is to avoid the whole thing and isolate the test experiences. Could you explain the issues around doing this? And how did you come to this conclusion?
It is the most natural thing one turns to once they realise that interactions between tests can invalidate the predictive validity of their results. In the most crude form isolation means to simply make sure that no user is exposed to more than one test at a time. Other, more refined systems use layering and so on, but the principle is the same.
The immediately visible issue is that this way the throughput in terms of number of experiments you can run per unit time becomes severely limited. This means either that you will slow down the speed of innovation significantly, or that a great many changes will end up being released untested.
“The less obvious issue is that by isolating tests one can easily end up releasing untested experiences to their end-users.”
The less obvious issue is that by isolating tests one can easily end up releasing untested experiences to their end-users. This is something I only detected once I wrote down specific scenarios in detail, so a simple example will help. Say you run tests T1 and T2 in isolation, then all users see only the control of T1 with variants of T2 and the control of T2 with variants of T1. Nobody sees a variant from T1 with a variant from T2. What happens, then, if variant B from T1 wins in T1, while variant B from T2 wins in T2?
You end up releasing these variants, which no user experienced at the same time. There is no data backing up this particular combination of choices. The issue gets worse the more tests and the more variants there are.
So then, not isolating is the way to go, right?
Yes. Whatever you choose to do, do not isolate tests. It’s pretty much the worst solution I can think of.
How serious is this problem? Can we somehow avoid this upfront?
If you run only a few tests with 1–2 variants at a time, even if they concern the same page(s) I don’t think the risk is that big, especially in terms of resulting in suboptimal choices after tests. Simulations I’ve ran show that very specific conditions need to be fulfilled in order for interaction effects to lead to a poor conclusion from one or more tests. However, interactions, to the extent that they exist, will taint the estimates you obtain.
For a small number of concurrent tests where all of them go through the approval of one or several people, I think diligence about testing variants with obvious potential for interaction is good enough to reduce the danger significantly. Where two or more such tests are proposed at the same time, the one with lower priority should take a back seat and wait for the other one to complete, first. It slows down testing, but reduces the risk, the same as with any other trade-off in A/B tests.
Note that in running tests one after another you fail to test non-winning variants of the first test you run with variants from future tests. This can lead to missed positive interactions. I would not dare speculate how often these are, but if there are possible negative interactions, surely there must be some positive ones as well.
It’s when we scale Experimentation that we start running into issues, right?
If you run hundreds or thousands of concurrent tests, then running them at the same time with automated monitoring for interaction effects is the only possible solution. Such systems are deployed by the likes of Google, Microsoft, and others, as alleged in their papers on the topic.
Yup! Chad Sanderson and Jonas Alves mentioned dealing with this at Microsoft and Booking.com, respectively.
So, in the case that we overlooked or failed to account for a potential interaction, how can we check or be alerted that this is happening? I.e., how can we track this in a practical way assuming numerous experiments running simultaneously?
Post-test monitoring for interaction effects is possible. It is basically the same as doing a proper segment-level analysis of the test results, where you treat each variant from another test as separate user segment. However, this can get very cumbersome even with a small number of tests since the number of possible interactions grows as a combinatorial explosion. For a mid-to-large number of concurrent tests only a fully automated solution makes sense.
Or at the very least, it’s something to keep in mind as companies test more.
This is a very interesting topic, where could our readers learn more about this pretty overlooked aspect of Experimentation?
My most comprehensive writing on the subject is in Chapter 12.5 of my book, “Statistical Methods in Online A/B Testing” , but it will not necessarily be enough as it is just a subchapter in a greater chapter on external validity (generalizability) of test results. Since such issues become serious concerns mainly once you’ve got hundreds and thousands of tests running, I’d look into what’s been published from the big companies in experimentation. Microsoft, Google, Booking, and others have been sharing some of their know-how on the topic over the years, e.g. as proceedings from the KDD conference, white papers, blog posts, etc.
Finally, I love to close with a Lightning Round?—?so here we go!
In one sentence, how do you feel about Google espousing Bayesian for its testing engine?
They failed to address one of the main issues I had with other vendors back in 2016?—?give enough control and enough information to the user to correctly interpret the output of their tool. I can’t hope to understand what their engine outputs, how can the average user? In a way, the problem is that it Optimize claims to be Bayesian, but it isn’t.
What is the most misunderstood concept in A/B Testing statistics?
I’d say what people get wrong most of the time is the concept and role of statistical power, including the concept of the so-called ‘minimum detectable effect’ which is integral to it.
Guilty as charged! That’s the amazing thing about Experimentation?—?you meet great people and learn every day.
OK Georgi, truthfully, if you couldn’t work in Experimentation, what would you be doing today?
There were many paths before me when I took on experimentation seriously, but I honestly can’t realistically imagine taking any other direction. If I was barred from doing this, I’d be working in information retrieval (search engines) or marketing and consumer psychology.
I don’t know why that doesn’t surprise me. It definitely feels like that’d be in your wheelhouse.
Finally, what does Experimentation mean to you in 3 words?
Optimal risk management.
Awesome. Georgi, thank you.