A Conversion Conversation with Convoy’s Chad Sanderson
While experimentation can be used to collect all sorts of interesting data, one of the most important use-cases for experimentation is to understand the potential risk of a new feature.
Having worked at places like Convoy, Microsoft, and a bunch of other big names you’ve probably heard of — Chad joins me this week to chat about his perspective on how experimentation can help companies protect themselves from undue risk as well as how he supported experimentation at-scale.
Rommil: Hi Chad, thanks for taking the time to chat with me! How about we start with a little about yourself and what you do over at Convoy?
Chad: Currently, I am the Head of Product for Experimentation and Data Platform at Convoy. I serve as the product lead for our experimentation service, machine learning framework, and data warehouse. Experimentation is an important part of CI/CD at Convoy and helping push both innovation and safety at a start-up in hypergrowth is an awesome place to be at this point in my career. In the past, I’ve worked in Experimentation and Personalization for Microsoft, SEPHORA, Subway, and Oracle.
Having worked in diverse industries, what, in your opinion, are the essential elements of a successful experimentation program?
I see the most valuable form of experimentation as a product, not a program. Marketers and product teams should first focus on building features their customers will love and using the experimentation tool, which could be internal or external, as a vehicle to ensure business metrics are not damaged when shipping new code. A great experimentation product scales with the organization, allowing anyone to run A/B tests at a massive scale.
Has anyone ever said to you, “We’re going to launch it anyway, why do we have to test it?” If so, how do you respond to that?
“Why do we have to test it?” is a common question. I usually reply, ‘If you knew your feature would lose the company $100M would you still launch it?” The answer is always no. “What about $1M? $500K? $100K?” What is the threshold at which you would not deploy a feature that lost the company money? Most often this number is significantly lower than 100 million dollars, and the net business impact would be difficult to discern from the cyclical nature of an analytics dashboard alone. I present experimentation as a tool to ensure we are safeguarding the company against monetary loss, and by extension protecting their job.
In your opinion, what is the biggest misconception around Experimentation?
I think there is a fundamental disconnect in what experimentation can do and how it’s used. Experiments alone have very little predictive capabilities when it comes to the lasting impact of a feature. A feature that generates $50K during the course of a month-long experiment may generate $1M over the course of a year, or it could generate half that, less than half that, double that, or zero. Using confidence intervals as declarations about the future is the most surefire method to have PnL teams scratching their heads at your multi-million dollar optimization claims.
Connect with members of the Experiment Nation Directory
|Photo||Name||Location||Short Bio / Specialities||LinkedIn URL|
|Josephus (Joey) AYOOLA||Brussels Metropolitan Area||Digital growth strategist with a huge knack for experimentation.||https://www.linkedin.com/in/josephusayoola/|
|Carlos Oliveira||Toronto||Business model design, customer development, experiment design||https://linkedin.com/in/withCarlos|
Microsoft, from what I read, runs an incredible number of experiments per month. During your time there, how were you able to run so many experiments without running into each other?
At Microsoft we ran over 25,000 experiments a year. The number of interacting experiments was minimal, and there were alerts in place in case an interaction ever did happen. A good way to think about interactions is that you aren’t testing isolated features but versions of your client experience as they will potentially appear in production. If Feature A and Feature B could be deployed at the same time there is no reason to test them sequentially.
I assume there was a dedicated team to support the platform? Also, how much traffic did you have, if you don’t mind my asking?
We had an internal experimentation team that manages assignment, metric design, and the computation of results. The team that operated the platform was large- over 60 engineers, 40 data scientists, and 10 PMs. The platform facilitated traffic from all of Microsoft’s larger orgs: Bing, Windows, Office, Xbox, Azure, Skype, MSN. Many experiments had 10’s of millions of users.
Whoa. That’s so cool. So I guess your view is there should be a platform that teams have access to and they run experiments on their own?
I don’t think platform or experimentation service teams should run experiments at all. I see experimentation as a part of engineering infrastructure. In the same way you wouldn’t ask the data warehouse team to define a metric you wouldn’t ask an A/B Testing service to run experiments.
Because experimentation is so complex, first or third party, you need a person or team responsible for the maintenance of the tool, facilitating education on how to use it, monitoring costs, defining platform capabilities, managing admins and best practices, etc. When a team builds their own service (and there are very good reasons for this) there enters a new set of issues entirely: What statistical algorithm do we use? How do we facilitate certain use cases like experimenting on ML models? How do we make our UI more intuitive? How do we make sure metrics are computed fast enough to detect site breaking problems?
Great platform teams can go really deep on these issues and add an unbelievable amount of business value in costs saved from both a human time and revenue cost perspective.
Finally, I love asking this question; Are you Frequentist or Bayesian?
Both Frequentist and Bayesian statistics have valid use cases, but to me, the discussion is generally irrelevant. Frequentist focuses on error control while Bayesian makes an assessment of the present based on past data (which can sometimes be no data at all). When we examined whether using one method or another led to practically different decision making, the answer was ‘Not really.’ In my opinion, there are far more important and problematic issues to focus on.
Thanks so much for taking the time to chat!
Connect with Experimenters from around the world
We’ll highlight our latest members throughout our site, shout them out on LinkedIn, and for those who are interested, include them in an upcoming profile feature on our site.