Learn from the “compulsive gambler” behind VWO’s Bayesian test engine

A Conversion Conversation with Chris Stucchio

A/B testing, while deceptively straight-forward and beneficial, presents many opportunities for misinterpretation and generating unreliable results. I recently had a wonderfully in-depth conversation with Chris Stucchio. Chris, a self-described “Compulsive gambler”, is the mind behind VWO’s Bayesian testing engine (you’d be right to think he is biased). We covered a lot of ground including his perspective on the mistakes that Experimenters often make around KPI-design, statistical independence, and the value of just running tests despite your traffic levels.


Rommil: Hey Chris! I’m really stoked to chat with you today. For those who don’t know you, do you mind sharing with our audience a bit about yourself and what you are doing these days?

Chris: Broadly speaking, I’m a quantitative developer. The modern term is “data scientist”, but I like the old school term better.

Right now I’m working on an automated trading strategy in thinly traded stocks and ETFs. In layman’s terms, this means I’m programming my computer to make a lot of small trades in the stock market. On the more technical side, I’m working on automated feature extraction in mixed time-series/event data. Stock market data (among many other things) consists of both time-series (daily open/close) as well as discrete events (earnings, dividends, news).

There are very few public tools that help you understand the hidden relationships between these things, unlike relational systems (see e.g. featuretools) or convolutional ones (deep learning).

“Another big mistake is testing only things you hope are good.”

So, a quick Google search for A/B testing statistics brings up numerous posts that quote you. How does that feel to play such a meaningful role in the Experimentation field?

I’m glad some of my work is out there helping people. A/B testing can play a very important (often transformative) role in organizations that are willing to accept it. It’s also a very easy way that data analysis can involve everyone, rather than just data scientists and analysts.

It’s also very meritocratic for a company that goes all-in on it. In a company where everyone can A/B test their ideas, suddenly the ideas that work best start winning (as opposed to the ideas coming from whoever is the best internal salesman).

I’ve always wondered, do you ever chat with Evan Miller?

Evan and I have had many conversations. One thing that stands out from them is that we both believe the world would be better if there was more of a focus on good data collection, good experiment design, and classical statistical methods.

Deep learning is the cool new toy (and it’s doing a great job within some domains!), but there’s a lot of value that can be gained by using old school battle-tested methods.

Bayesian and A/B testing. Bayesian, it seems, is the new black. Why do you think that is?

I was one of the early proponents of Bayesian methods. For me, there are several advantages of Bayesian methods.

  1. Bayesian methods make your assumptions very explicit. For example, consider the prior — many people criticize Bayesian methods because the prior is an arbitrary choice. However, frequentist methods also have arbitrary choices like these embedded. It’s just harder to tell because they are buried implicit in the middle of the math rather than the beginning.
  2. Bayesian methods better correspond to what non-statisticians expect to see. At Wingify, when we built SmartStats, one of the key observations we made was that our customers were frequently interpreting a frequentist p-value as a Bayesian posterior probability. If people are going to wrongly interpret a p-value as a posterior probability, I believe it’s better to simply give them the posterior probability.
  3. Bayesian methods make it a lot easier to incorporate prior knowledge or more complex models. On my blog, for example, you can find many modifications of A/B tests that incorporate annoying real-world complications — delayed reactions (send an email today, conversion happens 3 days from now), missing data (the user converts but their internet is flaky so the pixel doesn’t fire), that kind of thing.

At VWO, we built SmartStats which was one of the industry’s first Bayesian A/B testing engines. The major reason we used Bayesian methods is because it was a better fit for how marketers were actually using our A/B testing tool.

Marketers routinely interpret a frequentist p-value as “probability that variation is better than control”. Explaining what the frequentist p-value is is actually quite difficult. Rather than educating marketers about difficult statistical questions (“what is a p-value”), we decided to instead give them what they thought they were always getting — a Bayesian posterior probability.

Similarly, marketers frequently fail to obey the rules of the standard Student’s T-Test, such as “no peeking” (only check the test stats once at the end). Instead of imposing rules we knew they would break, we decided to change the rules so that they can keep using the tool the way they want to without breaking things. Bayesian tests typically impose much weaker rules around things like peeking and when you can safely stop the test.

(I realize that you can build frequentist sequential tests to solve this problem also, and I believe that is also a good approach. But part of the reason I’m a Bayesian is because I’m not quite smart enough to do a good job of frequentist statistics.)

There’s a bit of a debate on the applicability of Bayesian to small sample sizes, say, under 50 per branch. Some say we shouldn’t use it, some say it is actually quite good for small sample sizes — what’s your stance? Furthermore, how small a sample size can you go?

I will answer the above questions together. However, these questions are conflating two totally different issues and it’s important to disambiguate them.

Issue A: Statistical power, or how likely it is that you are making the correct decision at the end of any given test. If you have a low sample size, the statistical power is typically low. This means that you might be making the wrong (money-losing) decision a significant fraction of the time.

Issue B: Bayesian vs Frequentist. This question addresses the mathematical framework you set up the test with, and how to interpret the numbers coming out at the end.

I believe that Issue A is more important than B by a lot.

Let me discuss Issue A now.

It is my view that you can go as low as you want in terms of sample size provided you understand what you’re getting. With tiny sample sizes, you will almost never get any kind of “certainty”. The statistical power of your test — whether Bayesian or Frequentist — will be low.

However, you can still make good business decisions more frequently than you would without testing. I wrote a blog post when I was at VWO called Testing for Truth vs Maximizing Revenue which explains the distinction in some detail, but I’ll also give some simple examples here.

Right now I run a stock market trading strategy. A few days ago I bought some shares in ACC and sold short some shares in SCHH. (This is not trading advice! By the time you read this it’s too late to do what I did.) I have no real confidence that this particular trade will make me money. But I have confidence that over the next year at least 55% of trades like this one will make money. I’ll make money in aggregate even if this particular trade loses me money.

The key idea here is to make sure that your decision procedures are designed around your use case. For example, you should use a high powered statistical test on a landing page getting 15,000 visitors/day. Something like this is too important to do anything other than require low p-values or high Bayesian probabilities of success.

However, consider an alternate scenario. Instead of a single landing page, imagine you have 20,000 landing pages with each page getting only 1,000 visitors over its lifetime. (Such things are common with SEO strategies.) On any given landing page one might run the following A/B test. We’ll show variation A to 100 visitors, variation B to 100, and show the remaining 800 visitors whichever one had a higher CR (regardless of p-value or Bayesian posterior probability).

For example, imagine that in each page, one variation has a 7.5% CR whereas the other has a 10% CR. With this procedure, the variation with the 10% CR will win 69% of the A/B tests. The end result is a conversion rate of 69% x 10% + 31% x 7.5% = 9.25% across all the landing pages, whereas if you just picked randomly your conversion rate would be 8.75% across all the landing pages. The net result is 80,000 extra conversions (an extra 0.5% over 800 visitors/page x 20,000 pages).

How did you calculate that?

I got these numbers by doing a Monte Carlo simulation with python. Intuitively speaking, here’s what I did. I ran 1000000 simulated A/B tests of 10% vs 7.5%. In this simulation, the variation with 10% CR beat the variation with a 7.5% CR 69% of the time.

If you know Python or R, a Monte Carlo simulation is a quick and easy way to answer questions like this. Here’s the code:

from scipy.stats import binom

(binom(100, 0.10).rvs(1000000) > binom(100, 0.075).rvs(1000000)).mean()

Oh OK. Gotcha.

However, any given landing page only has a 69% chance of displaying the best variation.

As far as Issue B goes, I would defer to the data scientist building the meta-strategy on this. I will generally approach strategies like this with a Bayesian framework because that’s what I find easiest to understand and improve.

However, there are many frequentist strategies that can work here as well.

So Bayesian is clearly very useful for many use-cases. What are some of the biggest mistakes you’ve seen in using Bayesian for A/B testing?

The biggest mistakes I see are the same as what I see with frequentist methods — overinterpreting results, failing to observe when assumptions are violated, that kind of thing. Bayesian methods are not a panacea and they don’t solve everything. As a concrete example, I still see many people running A/B tests on sessions rather than visitors. (Part of this is caused by an increase in browsers blocking 3rd party cookies and that’s hard to fix.)

Regarding sessions vs visitors, would this not depend on what you’re looking to improve? I.e. something at the user level or at the visit level?

No, because the issue comes from the problem of statistical independence. In almost every A/B test method out there (Bayesian or Frequentist), a fundamental assumption made is that samples are unrelated to each other.

However, if a sample consists of a session, this assumption is violated. Two sessions from the same user are related to each other. The net result is that the effective number of samples in the test is a lot lower than the number you think it is.

Here’s a very extreme (and unrealistic) example that illustrates the issue:

Variation A: User #1 shows up, has a session and converts. User #1 shows up again 49 more times and converts each time.

Variation B: User #2 shows up, has a session and does not convert. He repeats this 49 times.

If you treat this A/B test as having 50 samples it is statistically significant in favour of A. But that’s pretty silly — it’s literally two people doing the same thing over and over.

In practice, randomization reduces the odds of seeing such an extreme situation but does not eliminate more moderate effects.

This can also be a problem on certain kinds of websites where users interact with each other — e.g., social networks or two-sided markets. I discuss this in more detail in a blog post.

“My general suggestion about designing KPIs is that if the KPI is a ratio, it should always be benefit/cost. Something like visitors/advertising dollar.”

So then, if a KPI is, say, add-to-cart per session? Since the sessions are not truly independent — either we should be looking at add-to-cart per session per unique, or we shouldn’t use this KPI. Thoughts?

My thought on the KPI you’ve described is that you should never use that KPI. The problem with the KPI is that it does not respect Pareto ordering, which is a concept I’ll explain after an example.

Example: You have a site which isn’t that sticky, it’s purely transactional. Users show up and convert 10% of the time. If they don’t convert they never come back. So for 10,000 visitors, you’ll have 10,000 sessions and 1,000 conversions.

Variation B makes the site more sticky. You have 10,000 users coming to the site, and just like before, 1,000 of them convert on their first session. But of the remaining 9,000, 50% return to the site a second time! On their second session, 5% of the returnees (= 9,000 x 50% x 5% = 225) convert.

What does this do to the KPI?

Variation A: 10,000 users, 10,000 sessions, 1,000 add-to-cart. Add-to-cart/sessions = 10%.

Variation B: 10,000 users, 14,500 sessions, 1,225 conversions. Add-to-cart/sessions = 1,225 / 14,500 = 8.4%.

So variation B made the site stickier (users remember it and come back) which is good. Add-to-cart per user went up, which is also a good thing. But the KPI went down, only because the sales increase was smaller than the session increase.

The problem here is you chose a compound metric that doesn’t respect Pareto ordering. You have two metrics, sessions and add-to-cart. Pareto ordering says that if you increase at least one of these things without reducing the other, an improvement has been made. Therefore your KPI should go up.

But this KPI doesn’t respect that — a 50% increase in sessions vs a 22.5% increase in conversions is actually considered bad since 1.225/1.5 < 1.

My general suggestion about designing KPIs is that if the KPI is a ratio, it should always be benefit/cost. Something like visitors/advertising dollar.

This is a real concern and not just theoretical, incidentally. I once worked at a local news site (Patch.com) that treated page views/visitor as a KPI. Once in a while, a story went viral on Reddit. An influx of Redditors from Michigan and NY came to a Florida local news site to read about “Hot teacher sleeps with a student!” or “Area man opens alligator ranch for extra money — what could go wrong?”. They watched one ad and bounced, reducing page views/visitor, simply because stories like “Road construction on local Florida road” is irrelevant to a Redditor from Texas.

This is free traffic that generated revenue. But because we picked a bad KPI we weren’t able to clearly see that from the numbers.

I’d say most of the KPIs out there don’t respect that. Take AARRR metrics, for instance. Those are very susceptible to this. That’s very interesting. I can totally see how easy it is for Experimenters to incorrectly set up Experiments and misuse Experimentation platforms!

The recent crop of Bayesian A/B testers (of which VWO SmartStats was the first big one) solve two major problems — marketers misinterpreting what p-values are, and the problem of peeking. That’s all they do, and other methods can offer some of the same benefits (for example, Optimizely’s frequentist sequential testing method solves the peeking problem though not the misinterpretation problem).

“Another big mistake is testing only things you hope are good.”

You can find a ton of online calculators for null hypothesis testing — why do you think there aren’t that many for Bayesian — especially for continuous metrics?

Bayesian methods are usually significantly more computationally intensive than frequentist one.

A frequentist Z-test can be easily implemented in a few lines of javascript and the calculation will run faster than loading the javascript. Frequentist tests were originally designed in the early 1900s when “computer” was a job description rather than a machine.

In contrast, Bayesian methods nearly always require massive computational effort. The difference between frequentist and Bayesian can be easily measured by listening to the fan in my laptop. Frequentist calculations are almost always instantaneous, Bayesian computations will usually take minutes if not hours.

None of this is something you want to run in a browser.

“Honestly, I think the biggest mistake is simply not A/B testing, or discounting A/B tests when someone higher up doesn’t like the result.”

I’d argue that most of the CRO industry doesn’t have a strong grasp on Statistics. What is your take and what are some of the biggest mistakes you see CRO folks make?

Honestly, I think the biggest mistake is simply not A/B testing, or discounting A/B tests when someone higher up doesn’t like the result. For example, I know of a case where an executive was obsessed with her vision of “user experience” and favoured eliminating some important risk controls on this basis. A/B test after A/B test showed that users didn’t care about her vision, but it took a very long time for the org to recognize this. Lukas Vermeer (of booking.com) recently made the observation that to do A/B testing well, you need to be willing to eat humble pie and discard your bad ideas. Lots of folks aren’t willing to do that.

“…to do A/B testing well, you need to be willing to eat humble pie and discard your bad ideas.”

This problem is pervasive in the industry. At VWO, after we released SmartStats, several conversion rate optimization agencies complained to us. These companies had the business model of “we’ll improve your site, you only pay us if VWO says the test is a positive”. After we released SmartStats the number of false positives went down dramatically, hurting their revenue. Paras Chopra (VWO’s founder) made the ethical choice here, refusing to give them the false positives they wanted even at the cost of losing significant revenue. It’s something I deeply respect about him.

Another big mistake is testing only things you hope are good. But you should also test potential bad things — if going backward in some direction doesn’t hurt, going forward probably won’t help much.

For example, at a large media company I worked at, I tested what happened if a webpage became 1/10 of a second slower. Slowing things down just a little bit hurt us a lot! Meanwhile, the entire engineering team was focused on shipping new features without much regard to page speed. The conclusion was that engineers should prioritize some boring page speed optimizations (SQL query optimization, smaller page size, merging javascript into a single file) over building exciting new features.

Another test I ran there was “what if our customers submitted less user-generated content?” Specifically, I hid 20% of UGC from customers in the “B” group. The company’s strategy at that time was predicated on the belief that increasing UGC was a good growth strategy. However, hiding UGC did not significantly affect engagement, suggesting that this belief was likely wrong.

“Failure after public failure for 6 months isn’t pleasant. In my experience, this is when people decide to stop A/B testing.”

A/B testing certainly hurts the ego. Last year, a product manager and I spent 6 months trying to increase customer retention and customer value. We took every user touchpoint and every best practice we could think of. Each bi-weekly leadership meeting consisted of us saying “A/B test result for our last brilliant idea is FAIL, here’s our next brilliant idea”. Failure after public failure for 6 months isn’t pleasant. In my experience, this is when people decide to stop A/B testing.

On the other hand, we did learn something really important: our customers almost never change their behaviour. Therefore, we need to accurately predict behaviour and focus on Lead bullets over silver bullets, to borrow a phrase from Ben Horowitz. We almost tripled our growth after 6 more months of hard work on the right problem.

I absolutely love that. Lead bullets over silver bullets.


Advertisement


Changing topics. I’ve asked this of Georgi, and I’d love your take — what do you think about running multiple experiments at the same time? Is interaction something we should be more concerned about?

My general feeling is that we should not worry about such things. The reason is that when running an A/B test, we are seeking a robust result. We want the lift to persist even if we change a different part of the product later on. If we break the test by tweaking something during the test, the result is probably not robust.

Oh, that would definitely ruffle some feathers in a few places I know! I get what you’re saying, at the same time, I definitely feel there are times when interactions should be accounted for.

Chris, it’s been amazing chatting with you — to close this off, I thought we could do a Lightning Round!

Bayesian vs Frequentist?

Bayesian.

LOL — I should have guessed. VWO vs Optimizely vs Target?

VWO. I’m 100% not biased at all by the fact that I built SmartStats at VWO, and it’s a Bayesian method.

Giants or the 49ers?

Anyone but the Patriots. That’s about as much football as I know.

What’s your favourite piece of trivia?

Here’s a piece of trivia I use to explain to westerners just how big and confusing India is. In the 1980s and 90s, Sri Lanka had a civil war. India joined the war on the side of the government. The Indian state of Tamil Nadu joined the war on the side of the rebels. It really drives home the “you’re not in Kansas anymore” that they’re feeling.

Finally, describe Chris in 5 or fewer words.

Compulsive gambler.

And with that, Chris, thank you for joining the Conversation!



You may also like

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s