Exploring the suitability of t-tests for A/B tests involving RPV with integrate.ai’s Summer Hu

A conversation with integrate.ai’s Summer Hu about Experimentation

I recently had the pleasure of catching up with Summer, one of a small handful of folks who I truly feel are statistics experts. We spoke about how she leverages experimentation in her personalization work over at integrate.ai and dove really deep into whether the student’s t-test was suitable for evaluating the results of tests involving RPV.

Fair warning: This interview gets fairly technical?—?but if you love statistics, you’ll dig it.


Rommil: Summer! No one’s name ever suited their disposition like yours. How have you been?

Summer: Thanks Rommil! I’ve been doing well except I miss people and the in-person connections during this special time. Hope you’ve been well as well.

Yeah, I’m hanging in there. Definitely miss people for sure. I look forward to when this mess is over.

Can you share a bit about where you are now and what you do there?

I am a Data Scientist at Integrate.ai, a Toronto startup focused on empowering businesses with machine learning-based solutions.

https://integrate.ai/

Wow, that sounds pretty interesting! What excites you most at work?

Leverage machine learning to deliver business impact is always an exciting thing, but what excites me the most at my current job is to practice it in an ethical and privacy-oriented way. It is not a trivial investment but is the right thing to do.

As some of my readers know, we used to work together at a Toronto startup. I wanted to do a bit of a walk down memory lane and ask you what are some of the biggest lessons you learned during our time there?

Absolutely! From an experimentation perspective, two of the lessons I learned are:

  1. When testing new experiences, run experiments long enough to account for novice effects?—?they can be significant and deceiving.
  2. Event logs guarantee a single view of success overtime, whereas relying on data in the production database may lead to inconsistent metrics as they may change over time

I feel that those lessons were learned the hard way, but we’ll move on.

Summer, what excites you most about working at integrate.ai?

Delivering impact to customers is always an exciting thing, but what excites me the most is how we are thinking of doing it. In particular, we are introducing more privacy-oriented mechanisms into our machine learning pipeline to ensure personal information is well protected. It is not a trivial investment but is the right thing to do.

You’re giving me flashbacks to when we were working through GDPR, haha!

At integrate.ai, you folks are working on some seriously complex stuff. How are you leveraging experimentation?

For use cases that are tied to business performance, we run A/B testing against the business KPI and measure the lift. It allows us to make apples-to-apples comparisons between two options so that we can draw theoretically sound causal inference.

Makes sense. I’d expect no less.

I know you guys are deep into personalization and that there are so many ways to fail at it. What kinds of experiments can companies run to ensure they deliver value to customers?

Given the reasons mentioned above, if the technical capability allows, A/B testing is the way to go. That said, there are two types A/B testing: the traditional type and the multi-armed bandit type. The former assigns traffic to each group with a predetermined ratio while the latter lets the algorithm determine and change the ratio dynamically so that the group with early signals of winning can get more traffic. If the pattern of the metrics is relatively consistent and the team is looking to achieve some optimization goal on a tight schedule, then multi-armed bandit would be a good option as it enables companies to materialize or monetize early learnings. Though there is no free lunch, the trade-off is it can get to statistical significance slower, especially when compared with a 50–50 split A/B test.

I was going to mention duration, but you beat me to it.

Let’s change gears to something more technical. You recently looked into the suitability of using the t-test for RPV. Many bloggers say that this is not a good approach. Can you share with us why they think that and what are your thoughts?

From the blogs I’ve seen so far, many expressed their concern that the distribution of revenue data often violates the normality assumption of the t-test. As a result, the test statistical power can be greatly compromised (i.e., less likely to show statistical significance when a desired lift exists). This reasoning is not incorrect but can be more specific. In fact, not all cases of non-normality are concerning. If we think about it, non-normality isn’t unique to RPV but also applies to conversion data, which only consists of 0s and 1s. But from the simulations that I run, conversion rate tests can achieve a desirable level of statistical power (80%) with a reasonable sample size. Meanwhile, RPV tests don’t always achieve sufficient power even if the sample size is maxed out. The deterministic factor here is the long tail, or more rigorously the severity of skewness and/or outliers. In my simulations, given the same sample size, the scenarios with relatively less (but still a decent amount of) skewness can sometimes achieve approximately 2X power than otherwise. Although these are results from the specific scenarios that I simulated, it shows how much the skewness can swing the statistical power.

So do you have any advice around what to use when analyzing RPV?

The first thing we could do is to reduce the 0s by only counting users that receive the experience being tested.

Secondly, I’d suggest first assessing the severity of the existing skewness. A not-so-rigorous-but-useful way of doing it is simulating experiments based on the current distribution of RPV to see if a desirable level of power can be achieved with reasonable sample size. If it is severe enough, an alternative test called (one-sided) Mann Whitney U test would be worth considering. In the specific scenarios I simulated, it can achieve more than 2X power than t-tests and stay consistent in statistical power across different levels of skewness. However, it comes with a significant trade-off: it doesn’t address the same hypothesis as the t-tests. While the (one-sided) t-test is used to justify “RPV in one group > RPV in another”, Mann Whitney U test can only be used to justify “one group is stochastically greater than the other”. Unfortunately, the latter statement doesn’t guarantee the former and is hard to be understood intuitively. That said, as far as I understand, there’s a nice inference from the stochastic dominance that we can leverage: if one group is stochastically greater than the other, RPV in this group is greater or equal to RPV in the other. Compared to the inference from t-test, this is definitely a weaker one. But given the constraints from skewness and limited sample size, it may be a sensible compromise in certain contexts. On a separate note, some might be thinking of doing square root transformation or log transformation (after adding a positive constant to the data) in the attempt to reduce the skewness?—?these are valid techniques but are not suitable for A/B testing. In particular, they can alter the hypothesis and hence the inference is not necessarily about the averages. Though there might be some other alternatives out there. I would love to hear if someone has solved this problem.

Experiment Nation has many statisticians. I’m sure some will have an opinion!

But you know what doesn’t use t-tests? Bayesian. But let’s save this chat for a future interview.

I’ve grilled you enough for today lol?—?it’s now time for the lightning round!

Data science or Data Engineering?

The intersection. Without data engineering, data science can hardly follow the best practice. Without data science, the full power of data engineering can’t be unleashed.

Choose: Python or R?

Python.

Bayesian or Frequentist?

Frequentist.

That’s OK. We can still be friends. lol Finally, favourite lunch spot?

Momofuku noodle bar.

Great choice. I miss that place. Thanks so much for chatting with me today!

You are very welcome! Stay safe and look forward to chatting more about experiments.

Previous
Previous

To build or buy? Solving the classic Experimenter’s dilemma with Convoy’s Chad Sanderson

Next
Next

BMO's Claire Yeh: Experimentation is key to Personalization success