A Conversion Conversation with Angela Kang
In experimentation circles, there’s a lot of talk about tools, iteration, and testing velocity. Often overlooked is the importance of a firm understanding of statistics — as a misinterpretation of results can quite literally lead to a million-dollar failure. As organizations start to experiment more, they need to go beyond basic analysis and experimental procedures, especially if they are looking to outlearn their competition. Today, I chat with Angela about her perspective on the importance of statistics in experimentation.
Rommil: Hey Angela, thanks for taking the time to chat. Let’s start with a bit about you and how you got interested in Experimentation?
Angela: Hey Rommil, thank you for featuring me! I’m a recent MSc Statistics grad from University of Toronto. Prior to school I have experience applying statistical principles in various capacities in mainly financial institutions. My most recent at Mackenzie Investments in their operations department acting as an analytics generalist.
Having an interest in the sciences from a young age, I’ve long been aware of the crucial role experimentation serves. However, it wasn’t until I was responsible for collecting data that I came to truly appreciate it. For context, statistics courses usually go the passive route and provide datasets for you. I completely understand the reasoning behind this — classroom sizes are getting larger as statistics becomes more popular, but it comes at a cost. Students aren’t taught the value of good data collection methods, including thoughtful designs of experiments, and instead incentivize most of their effort to be placed on analysis and decision-making. With that in mind, it’s no surprise that my interest for experimentation developed outside the classroom — during a project to improve the call center experience at Investors Group. I was responsible for measuring and evaluating the solutions we were testing to reduce the number of repeat calls. It was unlike any other experiment I worked on previously. I had a very active role in the data collection process. I had the ability to modify the goals and incentives of the call representatives. Moreover, the project called for interviews with call center representatives, callers, and other members of the organization as part of the experimentation process. Being able to actually talk to individuals led to a more personal relationship with the data and had the secondary effect of better understanding the business. It was also my first time being exposed to the interesting challenge of interpreting data on individuals learning a new behavior.
How important would you say knowledge of statistics is for designing experiments?
Statistics should act as the foundation for the design of experiments. It is needed to establish valid and replicable results. That said, I believe statistics is not sufficient at providing a complete picture. Substantial domain expertise is also essential in designing effective experiments. It should assist in guiding the experimental design process: identifying potential treatments, evaluating the feasibility of designs, and synthesizing the results obtained.
From what you’ve seen, what are some things that folks running experiments say that make you cringe?
I think there has been a transition to better statistical literacy as data-related jobs have become more prolific, but a concept I still occasionally find misunderstood is the p-value. It is not a measure of the effect size, nor does it provide evidence for the alternative hypothesis. In plain English, it provides the probability of the observed outcome and more extreme outcomes, assuming the null hypothesis is true. It is simply one measure of evidence, and should therefore be used in conjunction with other measures such as power, confidence intervals, and likelihood ratios. Another point of contention is that we often use the threshold of 0.05 as some divine number, when in reality it is just some arbitrarily set standard. It can and should be modified to fit the needs of the experiment. Lastly, when testing multiple hypotheses, the p-value should be adjusted to reduce the chance of identifying spurious correlations.
What are some of the biggest traps that one should avoid when analyzing experimental results?
I can’t speak on behalf of everyone, but something I personally have had to grow comfortable with is obtaining results that are not statistically significant. While it is frustrating to put a lot of effort and resources into something only to arrive at inconclusive outcomes, it doesn’t mean that we have to throw away all our learnings. Sometimes it’s just as important to know that the status quo can be achieved with a different solution, especially if this solution is cheaper or aligns better with the brand.
What advice do you have for those looking to experiment but don’t have a lot of data — for instance, they don’t have access to a large audience? What would you suggest they do?
It’s tempting to go after high impact projects from the get-go, but I believe experiments should be treated as building blocks — evidence should be stacked carefully piece by piece. This is especially true for organizations that don’t have a well-established culture of experimentation. Rather than testing for a very specific effect that you don’t have much data for, it may be more useful to pursue a simpler experiment first. This may be achieved by reducing the number of treatments (e.g. opt for A/B test rather than A/B/n), or expanding the target population (e.g. opt for all mediums rather than just mobile) to get more traffic and acquire more data. Starting with simple experiments also has the side-effect of acclimatizing the organization to the process and effectiveness of experimentation. Another way of reducing the amount of data needed is to lower your standards by accepting results with a higher error rate. It depends on what we are testing for, but in a business setting we often want a practical result. Rather than spending a lot of money on running a test to achieve a narrow confidence interval, it may be sufficient to have a wider interval. Another option is to make use of early stopping techniques — a nice transition into the next question…
Connect with members of the Experiment Nation Directory
|Photo||Name||Location||Short Bio / Specialities||LinkedIn URL|
|Robert Richter||Germany, Dresden||CRO/AB Testing/User Research||http://linkedin.com/in/robert-richter-netzproduzenten|
|Luis Trindade||Lisboa||Leading Experimentation @ Farfetch | Building A/B Testing Platform | Product Manager | Startups Lead Mentor | ProductTank Lisbon organizer||https://www.linkedin.com/in/ltrindade/|
|Ben Labay||Austin, Texas, USA||Managing Director of Speero, an experimentation and research agency by CXL. I specialize in UX research and experimentation program management. I've helped brands like ADP.com, MongoDB, Miro.com, Codecademy, Native Deodorant, build and scale their testing programs and program ROI.||https://www.linkedin.com/in/ben-labay/|
Test and learning velocity are always of interest to businesses — the faster you learn, the faster you can beat the competition, as they say. Are there any techniques you’d suggest to get test results faster?
Sequential analysis is a statistical method where sample size is not fixed in advance, but instead uses a stopping rule to determine when the experiment is over. It’s popular in clinical trials where it is ethically and financially beneficial to reach a conclusion early. It’s well understood that stopping an experiment early as soon as it reaches a desired threshold for p-value, something known as “p-hacking” or “peeking”, leads to results with higher chances of false positives. A simple correction method such as Bonferroni is not adequate as the p-values are not independent (e.g. the results obtained after the second day of the experiment are dependent on the results obtained after the first day of the experiment). Instead, sequential testing methods rely on dynamic programming to determine the false-positive probabilities at each stage of the experiment. After setting a maximum length for the experiment and the number of times we wish to run the test, the chance for false-positives are allocated at every test so they add to a desired threshold. Popular allocation methods include Pocock and the O’Brien-Fleming approaches. If at any step the test statistic exceeds the threshold, the entire experiment is terminated.
Wow — I learn something new every day! I definitely have to look up those techniques.
And now it’s time for my favourite segment — the Lightning round! Which do you prefer: Normal or Poisson distribution?
Poisson distribution! I appreciate the elegance of the normal distribution, but the Poisson distribution is commonly illustrated with one of my favorite real-world datasets — the number of soldiers in the Prussian army accidentally killed by horse kicks.
Simpsons the cartoon or Simpson’s Paradox?
I didn’t grow up watching the Simpsons, so Simpson’s Paradox. But I have heard that Disney+ has all 31 seasons on their platform, so my answer may change very soon…
LOL, well I feel ancient! Finally, because I’ve heard you’re studying Japanese, how do you say “statistics” in Japanese?
統計 (Toukei) — which literally translates to “overall measure”
Very cool — thanks for chatting with me!
Connect with Experimenters from around the world
We’ll highlight our latest members throughout our site, shout them out on LinkedIn, and for those who are interested, include them in an upcoming profile feature on our site.