A review of Peeking, Post-Hoc Power, and Confidence

John Meakin

A Conversion Conversation with Vrbo’s John Meakin

Statistics is like snowboarding down a double-diamond — for some, it’s fun and exciting, and for others, it is rife with opportunities to stumble and screw things up. Today, I spoke with John from Vrbo/Expedia about some of the fundamental concepts of A/B testing statistics, his day-to-day as a statistician, and his understandable love of snowboarding.


Rommil: Hi John, how are you? How are things in Austin? I ask because I’m from Toronto and it’s starting to get really cold up here.

John: Hi Rommil, I’m great. Loving it here in Austin. Really enjoy working for Vrbo/Expedia Group.

So John, you’re a statistician. Could you share, with our readers, what Vrbo is about and what a statistician does there day-to-day?

In 1995, Vrbo introduced a new way for people to travel together, pairing homeowners with families and friends looking for places to stay. We were grounded in one purpose: To give people the space they need to drop the distractions of everyday life and simply be together. Since then, we’ve grown into a global community of homeowners and travellers, with unique properties around the world. Vrbo makes it easy and fun to book cabins, condos, beach houses and every kind of space in between. Vrbo is part of Expedia Group.

https://www.vrbo.com/

My day-to-day work is extraordinarily engaging. A lot of time is spent consulting with product and engineering on experimental design and strategy (e.g. how to instrument tests correctly to get an unbiased result, how long to test certain features while still maintaining learning velocity). Despite having quite robust tools and processes for experimentation at Vrbo, getting experimentation right can be challenging for people without a strong background in it, so this is one of my core contributions. When I’m not doing that I’m working on researching methodologies to advance the platform’s capabilities and tools (and more holistically, I set the standards for experimental methodologies across the organization). I work really closely with our experimentation platform engineers who implement all that stuff — they are truly amazing. I also consider myself extremely lucky to have such an interesting and engaging role that allows me to interact with so many talented people across the organization.

What drew you to A/B testing statistics? Was it the statistics or was it the testing?

Definitely the “testing” — or what I’d call the experimental side of things. In a large majority of cases, the statistics behind AB testing is relatively basic (though there are definitely exceptions). I’ve been in some form of experimentation my whole career. Before Vrbo I spent 6 years doing experimental research in education/social science where the statistical models are often more sophisticated — so I enjoy the stats side of things as well of course.

I’ve definitely picked your brain on the statistics of A/B testing. Thank for that. Other than my stuff, what kinds of questions do you usually get?

The product managers here are very enthusiastic and want to test everything; so I get a lot of questions about how to do that in the best way.

That’s so awesome. Sometimes it takes time (and intense convincing) for some product managers to embrace Experimentation. Go on.

Since we build our tools to provide good guidance on test calling, I don’t get loads of questions about readout statistics (i.e. once a test has completed people generally understand the results and don’t question the stats). I often get questions about how long tests should run and those are the hardest because they require the product managers’ input and understanding.

It goes without saying that statistics isn’t everyone’s strength, mine included. There are a LOT of misunderstandings out there about how to interpret test results.

What do you feel are common misunderstandings about statistics?

To be totally honest, there is a really great paper that describes this in detail. Not mentioned there — and quite relevant to AB testing, I think two of the biggest misunderstandings are around peeking at tests and around the sizing of tests. People think that “peeking” is simply “the act of looking at test results often”; this is not the case. Clearly (like in a sports game) watching it won’t change the outcome.

I disagree. My Raptors know when I’m watching. Go on. Peeking.

People also don’t understand that you can observe results that are significant even if the observed differences are less than the Minimum Detectable Effect (MDE) used to size the test (more on that below — but for readers unaware, MDE is the smallest possible difference (positive or negative) that you’ll be able to detect with statistical certainty).

“…confidence intervals are estimators (i.e. they are dependent on the formulas and assumptions used in modelling a problem) and they do not allow you to make probabilistic statements about population parameters…”

Continuing with the statistics theme, can you explain what is a “Confidence Interval”? And how does one use them when interpreting the results of an A/B test?

I could (or Wikipedia) could give you the cookie-cutter definition, but if I’m being honest; it wouldn’t mean a lot to most folks. I think the thing that people miss, even after hearing/reading the definition, is that confidence intervals are estimators (i.e. they are dependent on the formulas and assumptions used in modelling a problem) and they do not allow you to make probabilistic statements about population parameters. So a narrow confidence interval, generally means your estimator is precise and therefore you can have high confidence in the findings, a wider interval means your estimator is less precise and thus you have less confidence.

We use confidence intervals in AB testing to understand how precise our estimators are. For example, a neutral test with a very narrow confidence interval (for example if we’ve collected a lot of data) gives us way more information than a “neutral” test (where we haven’t collected a lot of data) with a wide confidence interval. Often with positive (or negative) tests, we are less concerned with the confidence interval as far as decision making goes (e.g. tests that are positive and statistically significant are usually rolled out regardless of the confidence intervals).

There are cases where a test is not able to reach sample. Can you explain the dangers of, after the Experiment has ended, changing the MDE to match the observed effect in order to recalculate your required sample size so that it matches what you collected?

I believe you’re asking about “post-hoc” power; I’d be better at truly answering this question with 45 minutes and a whiteboard. I guess the primary thing to understand is that when you set an MDE it’s a hypothesized value that you wish to detect with statistical certainty. As soon as you begin collecting data the values you observe have nothing to do with the MDE.

For example, before the experiment, the MDE tells you the amount of data you’d need to collect such that, if the true difference is actually zero, then only 5% of the time you’d collect data where the observed difference is “stat-sig” while at the same time if the true difference equals your MDE then 80% of the time you’d collect data where the observed difference is “stat-sig.” Notice I made no claims at all about what the observed differences would actually be; just whether or not they’d be “stat-sig.”


Advertisement


On the topic of MDE, how should one decide what it should be?

MDE stands for “Minimum Detectable Effect.” Deciding what the MDE should be can be very tricky and it often depends a lot on the objectives of the experiment. It is one of the most challenging parts of experimentation because it involves input from multiple stakeholders to be truly meaningful. But, at the end of the day, MDE should basically be informed by potential business impact as well as the level of cost and effort to build and maintain the feature being tested.

“…there isn’t really anything special about 95% confidence and 80% power especially not when viewed holistically for online experimentation.”

Why are we so hung up on 95% confidence and 80% power? What’s so special about those numbers?

That’s a great question, there isn’t really anything special about 95% confidence and 80% power especially not when viewed holistically for online experimentation. Ultimately, these numbers dictate the rate at which you make false inferences; either through claiming differences when they don’t exist or missing true differences when they do exist. However, confidence thresholds cannot be truly evaluated independently from other factors (e.g. the risk associated with false inferences as well as the expected rate at which experiments result in true differences). For example, a large company running hundreds of small experiments concurrently on millions of visitors with a relatively low probability of getting wins on any given one should probably not be using the same confidence thresholds as a startup running a few experiments a month on low traffic. That said, without hiring a trained statistician who can communicate with business leaders and come up with more robust thresholds when appropriate, 95% confidence and 80% power are fairly standard and a good starting off point.

Finally, it’s time for the Lightning round!

Cowboys or the Texans?

Haha — Neither, I’m an Eagles fan!!! Fortunately, here in Austin, this isn’t actually much of a debate either; everyone here is all about the Longhorns. I’m even becoming a fan (Hookem! — sorry Buffs :/)

If you couldn’t live in Texas, where would you live?

I’ve always wanted to live somewhere where I could snowboard 50+ days a year.

If you feel like an adventure, Montreal is great for that. Just a suggestion.

If you couldn’t be a statistician, what would you be?

Hmm — A data scientist — is that too close to what I do now to even count as an answer? I guess if I had to pick something completely unrelated to what I do now, I would be a professional/extreme backcountry snowboarder.

Oh, nice. I haven’t snowboarded in years — I definitely miss it.

“I don’t believe in claims that Bayesian methods can somehow shorten the duration you need to run tests for without making tradeoffs on certainty and the assumptions you build in.”

If you had to pick, Bayesian or Frequentist for A/B testing? (No cheating)

These are fundamentally different philosophies for thinking about probability; the differences are deeper than most care to concern themselves with.

I am a Frequentist, however, there are clearly very useful applications of Bayesian methods. From an AB testing perspective, whether you use a Bayesian or Frequentist approach is pretty insignificant as long as neither approach is misused/misunderstood by its practitioners. I don’t believe in claims that Bayesian methods can somehow shorten the duration you need to run tests for without making tradeoffs on certainty and the assumptions you build in.

Fundamentally, doing experimentation right has relatively little to do with whether you approach it from a Frequentist or Bayesian perspective; and has all to do with how each is implemented and how experimental design is carried out in practice.

Describe yourself in 3 words.

Hardworking, Dedicated, Pragmatic

Finally, for those unfamiliar with Austin — where would you suggest tourists go first?

I’ve only been here for 3 years — I still feel like a tourist. I know my colleagues who visit from out of town like Rainey Street.

Very cool. I’ll definitely have to add that to my list of places to visit. And with that, John, thank you for joining the conversation!



You may also like

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s