A Conversion Conversation with Microsoft’s Aleksander Fabijan
It’s hard enough to get Experimentation off the ground, it’s quite another challenge to scale an Experimentation Program. Often, you will run into bandwidth issues, automation issues, platform issues, and culture-related issues etc. I recently had the chance to talk to Microsoft’s Aleksander Fabijan about the “Experimentation Growth Model” he developed and his thoughts about Scaling Experimentation and why it’s so hard to find strong talent in this field.
Rommil: Hi Aleksander, thanks for taking the time to chat! How are you?
Aleksander: Thanks for connecting! I’m doing well given the times we are in right now. How about you?
I’m well, thank you for asking!
Could you share with our readers a bit about your background and what you do today?
Sure. I’m a Data Scientist at Microsoft’s Analysis & Experimentation team. Together with a number of others on the team, I enable product teams across Microsoft to run trustworthy AB tests (a.k.a. experiments) on our platform. This involves both onboarding product teams to use existing features as well as designing and building new ones to make our lives easier. My background is in Computer Science.
What is your day-to-day like?
I usually get woken up by a soon-to-be threenager that runs her first experiment of the day early in the morning. When I get to work, the first thing I normally do is check how the AB experiments that I’m involved in are doing and investigate if anything strange is happening. The day usually involves lots of interactions with product teams and coaching them on design and analysis of experiments, research and prototyping on new features for the experimentation platform, and most importantly, great conversations with an abundance of very smart people. I come back home to see how the morning experiment concluded and clean it up!
Microsoft is regarded as one of the leaders in Experimentation, running thousands of tests per year. With so much volume, how do you deal with Experiment interaction?
There are a few aspects to this. First, there are several products to run ABs on. Next, within a single product, some feature crews might work on the design, others on ranking, etc. There is also a time perspective here — not everything can be worked on all the time. These factors naturally reduce the chance of an interaction, however, they are not sufficient to prevent it. On top of careful planning, one also needs an ability in the experimentation platform to specify which changes need to be tested exclusively from others when interaction is suspected. This helps tremendously yet impacts your power. Finally, there are ways to statistically check for interactions between experiments if you are in doubt and alert on them. That said, there are multiple measures that help prevent this and guardrails in place to detect if it does happen.
I’ve always wondered, who has the ability to launch or stop an Experiment?
It depends on the product and the process that has been established by the leaders of the product team. For some products, many practitioners have this ability (they could be engineers, program managers, live-site support, etc.) whereas for others the set could be limited to a smaller group that has completed specialized training and received the necessary approval.
For those who are still starting out in Experimentation — at what point should companies start building their Experimentation platform in-house?
During my Ph.D., Pavel Dmitriev and I published a short paper in which we collected insights from many companies worldwide to understand what their experimentation platform capabilities are and whether they have been using an internal or an external platform. It seems that many companies start with an external platform and then transition towards developing their own. The lack of advanced features, debuggability, and integration with the in-house stack are commonly listed as reasons why most choose to go in-house at some point. In my opinion, some third party platforms are a great way to get started with experimentation. For example, correctly writing the code that is needed for assignment, building the pipelines for your data flows, hacking analysis features etc. is hard. I can imagine that many companies want to get started quickly and experience the benefits of experimentation as soon as possible. Where it gets tricky is when you want to scale and need debuggability, which is somewhere between Walk and Run on the Experimentation Growth Model described in this paper published during my academic times. To scale, you might need features that are not available at commercial platforms. For example, some commercial platforms don’t check whether the split that you observe in your experiment has a sample ratio mismatch which is a pre-condition for starting any experiment analysis. Furthermore, I’ve heard for cases where variant assignment was buggy and engineers had to spend countless weeks debugging the problem as opposed to work on the product itself. That said, there is no simple equation on where the tipping point is. Remember also that the two are not necessarily exclusive, e.g. I can see how one might use a third-party platform to solve the problem of assignment and an in-house analysis solution for analyzing the results (or vice-versa).
You often hear people say, we’ve done a lot of research, or, all of our users say they want this feature. What would be your argument for Experimentation in those cases?
I’m a strong believer of mixed methods when it comes to product development as well as research! Qualitative methods are great in informing the product development on what customers say they like (well, mostly what they don’t like). To quantify *how much* they actually like it, however, A/B testing is the right tool. The key is to always run an AB to quantify the impact of the feature on what you expected as well as uncover effects that you did not expect. A change to a Machine Learning algorithm might improve the accuracy of your results, but how does this impact future visits to your site? How about the impact on the performance of loading the page and the impact of this on the engagement? In short, to capture these second and third-order effects, always run an AB with a rich set of metrics to learn what the whole story is with your feature that, allegedly, everyone likes/wants. Also, think about the large investment that you did into developing the feature and how little it takes to sometimes double its value by using testing to make a few beneficial tweaks on it.
Arguably, one of the hardest things to do is to scale an Experimentation program — why is that?
This question has many answers. Two that come to my mind right now are Trust and Culture. Companies need access to trustworthy platforms that automate the complex workflows that are easy to get wrong manually. For example, statisticians know that you need to use a special method to correctly estimate the variance of a metric that is not computed at the same level as you randomized on. Someone with a computer science background like me might not know this and could incorrectly define a new metric. The platform needs to control for such errors and ensure that the correct computations are done. You are, after all, using experimentation to get a trustworthy decision. Second, there needs to be a hybrid model of evangelization in place: the practitioners working on new features/products as well as executives defining the product strategy need to be informed on what the value of AB testing is and what the consequences of not/under-investing in it are. Decisions on whether to ship something or not need to be based on experiment results. Without the commitment on both sides, experimentation might not grow.
Can you tell me about the “Experimentation Growth Model” you’ve developed?
When I was first interning with the Microsoft experimentation team in 2016, we observed a pattern that many products followed as they scaled their experimentation program. As part of my academic work, we validated this pattern across other companies and outlined it in this model, covering the technical, organizational and business investments that need to be done to scale experimentation. There has been a lot of interest about this model in the industry as well as in academia, and I’m always impressed to see how others internalize it. For example, one company showed me how they print the “Experimentation Growth Model” on the largest possible piece of paper that they can get, use one colour to annotate the current state and another colour for the upcoming quarter. They then exhibit the poster in the executive hallways to be transparent about their investments and progress. If there is one paper that I authored and I’d recommend everyone in the area to read, it would be the one in which we introduced this model.
“Automation is the only way to maintain trustworthiness as you scale.”
Why is automation so important for the “Run” stage?
Automation is the only way to maintain trustworthiness as you scale. Think about it: It’s easy to run a handful of experiments by hand and double-check them for correctness at every step. It is much harder to fully engage with every single AB when there are many of them. Just like a pilot that focuses on monitoring flight conditions in cruise, experimenters should be empowered to feel the same knowing that the automation will handle difficult and tedious tasks for them.
“…think of OEC as a small set of metrics that everyone in the organization is striving to improve through a series of experiments.”
Could you describe what an, “Overall Evaluation Criteria” is and why this is more effective than, say, individual KPIs?
Ronny Kohavi has published many papers and books on this topic and I’ll let the readers find the exact definitions there. In short, you can think of OEC as a small set of metrics that everyone in the organization is striving to improve through a series of experiments. Typically, an OEC will contain success metrics as well as guardrails that should not be degraded. KPIs, on the other hand, measure volume (e.g. sales, revenue etc.). Often, changes to volume metrics are only visible over longer periods (e.g. quarterly) after a number of other indicators have already changed. They are lagging metrics, whereas OECs are more sensitive so experimenters actually have a chance to impact them.
You’ve stated, “experimentation experts that are needed to set-up and analyze experiments, as well as to build tools and improve the processes around experimentation are scarce.” How scarce are we talking about here and why do you think that is?
True. I’m not sure why this is the case. Education is certainly a factor — while experimentation skills can be obtained in many graduate programs, they are seldom emphasized in e.g. computer science curriculums. Also, many companies, especially those that don’t focus on web properties, don’t invest in growing AB testing and as a result, don’t train experts in this field. Furthermore, when I visit conferences, machine learning and AI seem to be in the spotlight. I hope that AB testing gains more momentum and that the community expands as more companies and practitioners learn about AB testing.
Finally, it’s time for the Lightning Round!
Xbox or PlayStation?
If you couldn’t work in Experimentation — what would you be doing?
Product management… which would probably bring me back to experimentation….
What are the key skills that someone interested in working in Experimentation should develop?
- Basic statistics such as hypothesis testing and distributions
- Skepticism / triangulation
- Data engineering and coding skills
(ABs don’t help if no action is taken on them)
Seahawks or Mariners?
Neither. In the absence of the Supersonics — the Sounders!
How would you describe yourself in less than 10 words?
Fortunate to be able to do what I do.
Thank you, Aleksander, for joining the conversation!
You may also like