Making product decisions with bayesian analysis

By John Ostrowski

In this test, we observed a 4.7% lift and a 90% probability of our variant beating the control.

This is how I communicated the result to the product manager during our test review meeting. Her reaction? “Ok, that’s an easy call, let’s roll out and shift focus to the next test.”

If I have to chose I’m a Bayesian thinker and I suggest you consider the same. I believe Bayesian analysis provides much clearer language to explain experimental results. In this short post, I will explain my process of using Reforge’s recommended bayesian process and calculator for analyzing a “big bet” experiment for a low-traffic website.

I will walk through an experiment I ran with anonymized data and imagery as I can’t disclose specifics, that’s one constraint of the experimentation industry that should be discussed. Anyhow, I found creative ways to paint a picture of each step of the process.

How to know when to stop a bayesian test? and How do I decide on sample size for Bayesian analysis? were both triggers for me to organize the content into a sequence of 9 questions answered throughout the process.

No alt text provided for this image

Who’s this post for? Will I get value from this read?

Data savvy product people will find value in this step-by-step.

I received positive feedback from product managers, growth managers, data analysts, and data scientists. Don’t like the numbers discussion? I understand, please share with someone in your team that would get value from it and hopefully help you clarifying test analysis 🚀

The 9 questions answered:

For exercises like this, I like to look back and codify all the questions I was able to answer throughout the process so it becomes a playbook. Here are the ones I took note of:

  • Can we test the website with 4K weekly unique users?
  • How to run a test with a small sample size?
  • How to know when to stop a bayesian test?
  • How do I decide on the sample size for Bayesian analysis?
  • How to use a bayesian calculator for ab tests?
  • How to choose bayesian priors?
  • What is the difference for weak priors?
  • What is the probability of our variant being better than the control?
  • How is bayesian analysis is used in decision-making?

Although this may seem a paradox, all exact science is based on the idea of approximation. If a person tells you they know a thing exactly, then you can be safe in inferring that you are speaking to an inexact person.

— Bertrand Russell (1873–1970), British mathematician and philosopher. Adapted.

Can we test the website with 4K weekly unique users?

The short answer is yes. The challenge with testing with a “small sample size” is that you’re always facing underpowered analysis. In simple terms, the statistics don’t help you to make many decisions, the majority of tests are just “flat”.

Sometimes the reality is different, even a website with 10K monthly unique users will have only about 65% of users actually interacting with the page, template, or feature that they’re testing. I like to call this “Net testing traffic” and it’s purely segmentation work on your analytics tool to get those right. Avoid sampled data.

Net testing traffic is most of the times smaller than overall traffic

Net testing traffic is most of the time smaller than overall traffic.

Bayesian to the rescue. The Bayesian Jazz is a lot nicer and more fluid than the frequentist orchestra. If I have to choose, I’m a Bayesian thinker. More on that on Philosophic discussion and afterthoughts section below.

You might also like:   What’s the best time to send an email with Deborah O'Malley

How to run a test with a small sample size?

There’s no simple answer, but here is my checklist for that:

  • Test big differences (big bets in the solution spectrum)
  • Don’t test on 5% of the site traffic (avoid ‘small net traffic”).
  • For the sake of all things holy, don’t test five shades of blue buttons.
  • Don’t test for a vanity metric.
  • Don’t pick a low traffic page/variation.
  • Have a hypothesis that is entirely based on an insight from your digital analytics data (for the first few).
  • Test one step back in the funnel, e.g. not enough homepage traffic? Use enough budget and test in the ads level

Edited from Avinash’s TMAI #262 and Reforge.

For low traffic website prioritize big bets first for better chances of detecting impact.

For low traffic websites prioritize big bets first for better chances of detecting impact.

In our example, we were testing a big bet, both restructuring and adding new interactive elements to an experience. To paint the picture of ‘size of difference’, imagine we were testing a homepage difference like the one below:

No alt text provided for this image
No alt text provided for this image

Those are Hubspot homepages over time, 2017 on the left and 2021 on the right. How did I get those? Here’s one trick I use more often than I thought I would, WaybackMachine.

How to know when to stop a bayesian test? How do I decide on the sample size for Bayesian analysis?

I couldn’t find any reference on my knowledge base to answer that. I searched on Reforge’s process and there was no reference.

I ended up calculating sample size just like you would normally do for a frequentist design. I use the CXL calculator because I like how it gives you the Minimum detectable Effect (MDE) in function of time.

Calculating MDE with CXL calculator

But I wasn’t happy with it. It didn’t make sense to me that you would follow the same exact pre-experiment process for Bayesian. It wasn’t Jazz.

Then I found this white paper working with an R package (SSDbain) that attempts to calculate a sample size for Bayesian testing, however:

In the Bayesian framework, updating (Rouder, 2014; Schonbrodt et al., 2017; Schonbrodt & Wagenmakers, 2018) can be seen as an alternative for sample-size determination that does not require specification of the effect size under the alternative hypothesis

So I decided to not calculate the sample size for Bayesian testing.

More coming to this section on “Using ROPE and HDI to reach a decision” — Work in Progress.

Bayesian standard stopping rules and decision criteria — HDI, ROPE, and time spent testing.

If I don’t have a pre-defined sample size, how do I know when to stop the experiment? This is where we must talk about HDI and ROPE, both metrics derived from the thousands of Monte Carlo simulations ran in the background.

This part gets statistically tricky and intricate very fast, so I’ll go one concept at a time making parallels with our experiment.

Monte Carlo SImulation, what is it?

The short answer is, 10,000 random samples based on the beta distributions we created and seeing in how many of the samples is the solution variation better than the control and by how much.

The way I like to understand is that in Risk Modeling we’re mostly working with confidence intervals instead of point estimates. Monte Carlo is an engine that allows the manipulation of confidence intervals.

ROPE, what is it?

You might also like:   Specsavers’ Steph Le Prevost on growing a top UK CRO program

It is similar to the minimal detectable effect of the Frequentist method. It is the minimum difference we need to see in order to say one variation is better than the other or the minimum difference to suggest the result is not “by chance.”

A normal ROPE is 10% of the standard deviation of all of the Monte Carlo results for the relative difference between control and variant. Therefore, the ROPE converges as more samples are collected because the standard deviation shrinks.

No alt text provided for this image

HDI, what is it?

“High density” is usually considered 95% of the Monte Carlo distribution for the relative difference between control and variant. This one is visually easy to grasp.

No alt text provided for this image

What do we do with those? When to stop a test? HDI vs ROPE vs TIME.

Below are the results from my test’s HDI and ROPE.

No alt text provided for this image

“ROPE within HDI” is not a stopping rule as we can see below. Therefore the standard criteria didn’t help my test much.

When to stop a test?

This is where the time dimension becomes a decision criterion. In my case, I ran the experiment for 5 weeks already and I’m not willing to extend further. I had to make a call.

So even without having full confidence from using standard stopping rules, I was able to stop the test, analyze it, and support the product decision.

How to use a bayesian calculator for ab tests?

Calculators available:

➤ I use Reforge’s calculator for my Bayesian analysis.

➤ Javid Jamae suggests this calculator he uses.

What should you look like in a Bayesian calculator for a simplistic test analysis?

Expected, must have:

  • It allows you to chose priors for control and variant
  • It displays both the Monte Carlo histogram and the cumulative distribution function (CDF) of ‘probability of being better than control’.
  • It shows you the observed difference and probability of beating the control

Extras, nice to have:

  • It displays all beta distributions
  • It shows you the Monte Carlo simulations data
  • It gives you an automated decision label based on HDI and ROPE parameters
No alt text provided for this image

Reforge calculator, yellow fields are inputs.

Why the emphasis on “Simplistic Analysis”?

Because as correctly stated by Javid Jamae, these calculators will help you analyzing experiments targeting improving conversion rates.

What do I mean by that? The input metrics to these calculators will be percentages, which are mostly used for conversion rates like conversion to trial, conversion to subscription etc.

Continuous metrics like the average page view per unique user or the average Largest Content Paint (LCP) are not covered by this methodology.

Analyzing continuous metrics requires either getting creative on transforming continuous to discrete metrics or diving into more intricate methodologies.

How to choose bayesian priors? What is the difference for weak priors?

As you see above, my initial analysis basically did not consider priors. With both control and solution set as (10, 10) this means a 50% conversion rate for both based on 20 samples. In simple terms, the priors have no real impact on the posteriors.

An interesting finding is that my go-to CXL calculator gives me a similar output (97% probability of beating the control) when compared to the analysis above that uses no priors.

Therefore I’m concluding the CXL calculator uses ‘weak priors’ (1,1) as well.

No alt text provided for this image

What’s the problem with it? There’s no real problem, but I was basically using half of what Bayesian analysis has to offer. 

You might also like:   Spencer Gray on how CRO is not a standalone service but rather a team sport

So I went back to the drawing board, calculated priors based on the past 5 weeks of data.

No alt text provided for this image

What’s the difference? Note that Control priors changed from 50% (10, 10) to 57.4% (1411, 965). More detail to priors which will lead to a better analysis.

What about priors for the variant? How to estimate? Well, I start by assuming I don’t know how much the variant will be, but at worse, it will be the same rate as control 57%. Because I have lower confidence in this assumption, I make a “larger beta distribution” of 57% (59, 41) based on 100 samples only.

No alt text provided for this image

Note the “Solution” (variant) distribution is also centered around 54% but ‘larger’ with higher variance.

What is the probability of our variant being better than the control?

Using the numbers from my second analysis considering priors. We observed a 4.7% lift in our Overall Evaluation Criteria (OEC). 

No alt text provided for this image

The ‘probability of beating the control’ comes from the Monte Carlo simulation of 10000 samples. In our case, 90.8% of probability of beating the control.

No alt text provided for this image

How bayesian analysis is used in decision-making?

Because of your Monte Carlo simulations, you can communicate the result in much clearer terms for the team. This is how I did it:

During the test, we observed a 4.7% lift, meaning a 90.8% probability of beating the control.

In other words, our variant has a:

  • 26.9% probability of being at least 4.3% better
  • 49.7% probability of being at least 3% better
  • 90.6% probability of being at least 0.2% better

We can be confident that the variant at least is not harmful. I recommend normalising the variant via testing tool while developers map what it takes to normalize the variant across the product and design tasks for the backlog.


No alt text provided for this image

Testing with low-traffic products and websites is challenging, I had to get creative and explore different statistical methods to allow more confidence to support the Product Manager’s decision-making. Bayesian analysis proved to be a great solution for our scenario testing with about four thousand unique users.

As a team, we agreed that the language around communicating the outcome was simpler to digest which led to more curiosity and follow-up questions.

I conclude that Bayesian analysis is more friendly, facilitates decision-making based on the multiple scenarios from Monte Carlo simulation, and contributes marginally to increasing the interest in testing — a vital step for the experimentation flywheel.

No alt text provided for this image

From It takes a Flywheel to Fly: Kickstarting and Keeping the A/B testing Momentum

Community shout outs

🎉 Kudos to Javid Jamae who helped to clarify some of the steps and added a new Bayesian calculator to be used with this system.

🎉 Kudos to Michael Taylor and his Growth Engineers group who were supportive and curious enough to ask questions.

Rommil Santiago