Don’t A/B Test?! featuring Solomon Kahn

AI-Generated Summary

CEO and founder of Delivery Layer, Solomon Kahn shares why he thinks that most companies should NOT be A/B testing and what we should do instead. He also shares details about his recent exchange with Ron Kohavi on LinkedIn on this same topic. What side are YOU on?

Audio

AI-Generated Transcript

(00:00) so before we begin though are you Pro or against AB testing yeah I mean I am generally against it um and and and and I made I I’ll tell you I made the case sort of just outlining all of the very serious issues that I felt both um both theoretically as well as uh pragmatic on the implementation side that I have seen go wrong with AB

(00:31) testing for a long time hi this is Romo Santiago founder of experiment Nation on today’s episode I debate the Merit of AB testing with Solomon K the CEO and founder of delivery layer if you’re a c or experimenter you’re not going to want to miss this episode enjoy awesome hi everyone my name is Romo Santiago and I’m the founder of

(01:00) experiment nation and today we have a great guest uh his name is Solomon Kong uh he is the founder and CEO of delivery layer Solomon I’d love for you to introduce yourself to our audience sure uh firstly thank you so much for having me I really appreciate it and I’m excited to have this chat uh I’ve been a

(01:16) data leader for a long time uh have held jobs at startups big companies Consulting companies uh and I am currently the founder of a startup called delivery layer which focuses on customer facing data applications so I used to work at neelen if you think about the stereotype of neelen amazing data and then you know customers needing

(01:41) to use web applications and apis to access that data and so those are the types of applications that uh delivery layerers of product focuses on and uh I’m sure you’re going to get into why we’re we’re on this podcast uh talking about AV testing in just a second yeah for sure but I I think it’s you know

(02:02) it’s clear you have a strong background and deep background uh in data you’ve seen it all you’ve seen the small you’ve seen the big worked at neelen have your own startup these are all the wide variety of data situations that one can face so I think it’s safe to to say that you know you’re pretty School in it

(02:17) would that be fair to say yeah I’ve been I’ve been around the block a couple times so yeah and and and and you know say at the same time many people have other experiences than I do so I don’t I don’t I don’t want to overstate either but I’ve been working in the field for for a long time trying to navigate

(02:36) business decision- making with data and and that’s exactly the topic today uh recently you posted this um a YouTube video as well as a LinkedIn post about a topic which would be in great of great interest to our audience um you made a case against AB testing this is experiment Nation so of course all my

(02:57) audience is going to go uh what um and you you listed a lot of reasons about um reasons against AB testing I’d love to get into that so before we begin though are you Pro or against AB testing yeah I mean I am generally against it um and and and and I made I’ll tell you I made the case sort of just outlining all

(03:23) of the very serious issues I felt both um both theoretically as well as uh pragmatically on the implementation side that I have seen go wrong with AB testing for a long time and when you look at the entire state of AB testing as it is right now in the market I think it is a very reasonable thing to say

(03:54) maybe we shouldn’t be doing this now I’ll take it I’ll take I’ll take another step back to say that this is experiment Nation right so what I’m not against and what I’m absolutely not against is experimentation in fact I think we need to be doing more experimentation and what I into that the Nuance between the

(04:13) two yeah because it’s because it’s important um it’s it’s very important what I what I’ve seen a lot of is false confidence and over o and a belief in very high precision when people are doing ab tests as a and I don’t think a lot of that confidence is warranted and I think instead a sort of what I talked about in the video and

(04:41) and in comments online was sort of more of a track your metrics and use common sense as you’re doing experiments to see how things are changing and let that guide your business decisions as opposed to relying on on the results of AB testing systems which I think are um Can very can in many cases will not be as

(05:06) precise as you think they are so essentially you are if we just Define experimenting as trying a thing out and seeing what happens I think we can agree that’s that’s something we should all be doing absolutely necessary 100% because you whether you have an idea you test it out and you want if it either will work or not work in the data

(05:33) and you should be able to see that so 100% um I guess the Gap is when you get down to um like statistics and a confidence level is that where where the issue is where one of the big issues are yes correct of 18 or some of them some of them fall under that under that category okay so we we’ve made some distinction between

(05:58) experimentation and AB testing where one is just the classical let’s try a thing the other one is more um stats driven there what are the criteria in your mind when it’s okay to do the the stats driven version or is it never okay well okay is a big word right I’m I’m very very purpose so so so and I

(06:24) so here’s what I think and I I I put this in the video also there are certain things that are a better fit for this type of testing like a Content headline right excuse me you might be a very skilled writer but you and you might come up with six different catchy headlines that you think could perform

(06:47) really well for a particular article but it’s almost impossible to be extremely confident in advance which of those six are really going to be the best performing ones and so it would be totally normal for you to put those out and do some sort of ab test or split test to see which one performs best and

(07:11) then and then use the best one or tuning parameters in video games right those those can be very tricky in in achieving whatever goal you’re looking for and so you know using an AB testing system for something like that you know I I I that I think is where um it is it is better suited and I think there are other

(07:35) decisions that you make as a business that are far less suited for AB testing such as product change features yeah so um if you think that there’s an important direction that you need to take as a business or or for your product and you have a deep understanding of the market and your customers and where the world is going

(07:56) and you put out a product change in support of that and it comes back with a very slightly but statistically significant negative impact on one of your metrics what do you do do you not ship that like and this is where many of the decisions that are um that are made by just looking at an AB test with and and having this

(08:21) expectation of precision that it’s going to be able to tell you exactly what the future is going to be don’t hold true right one of the the core argument that I I made in the in the video is that a lot of the process around AB testing was taken from Academia and very strictly controlled clinical experiments into the

(08:42) business world where it doesn’t fit as well so your experiments right now you might have an effect size that is substantially that changes substantially or would have changed substantially next week because the world changes right and so in a an academic experiment in the hard Sciences you know an experiment

(09:01) that you ran a hundred years ago or a hundred years from now is going to have the same result but in business things change week to week sometimes and so the idea that we’ve got this confidence in our AB tests because of when we ran them and the way that we ran them and how we were even to able to run some of these

(09:20) tests like for example it’s very easy to trade off long-term and short-term results where you can add high pressure sales tactics or add a bunch of fees at the end of a long checkout process to an e-commerce site you’ll get you you’ll be able to bump up your average customer value but what is that due to your brand

(09:40) equity and how do you manage how do you how do you measure that in an AB test you can’t and so there are these very large categories of things that you can’t measure well and have very important implications for your business and so when you limit yourself to only launching features that do well in an AB

(09:59) test well what does that even mean also if you have five metrics that you’re tracking in an AB test and three of them are have a very slight Improvement and one is neutral and one is negative like does that does that mean you do it or you don’t do it and this is where I think a lot of companies try to be very

(10:18) scientific about decision making in a way that doesn’t fit well with business decision Mak and that that’s where I’ve seen the failure years of AB testing I I think I’ve I’ve definitely run into situations in the past like exactly as you say where um you do you run a feature test and you’re looking at a

(10:39) constellation of metrics like whatever to guard rails and what have you some go up some go down and you’re left with maybe a slightly negative as you mentioned um statistically significant result and now you’re left with a huh what do we do um the the way I view it and I I I don’t completely disagree with

(10:58) you there’s a lot of things you called out like yeah I relate to that 100% um I think my perspective is kind of like okay we kind of know what it’s going to do but now we have some sort of measure of the risk you know um and then using our knowledge of the market we’re like okay we can live with this let’s good

(11:16) forward like I’ve never been one of those people to go if it’s a negative result don’t push it out because sometimes strategically it makes sense sometimes you have to unlock features in the future um but my perspective is I’d like to know whether we’re going to blow up everything you know is there a slam

(11:32) dunk in there like we’re going to have like me minus 25% and then maybe we shouldn’t ship this out but that’s my perspective I’d love to hear your takee I think that’s totally reasonable and I think that that non-dogmatic way of looking at AB testing is realistically what people are doing more often and

(11:52) because well for for a couple of reasons number one it is very smart to look at and to make sure you don’t make any disastrous changes when you roll things out right did you accidentally remove the add to cart button yeah in one of your in one of you know with this new code change so so having sort of a

(12:10) canary in the coal mine type of approach to AB testing where you you you intend to ship it you are doing some very basic testing to make sure that nothing disastrous happens to your data not just your you know Code test and assuming that everything is okay then you ship it and you move on I think that’s a totally reasonable way to

(12:33) approach development uh and to approach building features Etc so so all all all of that is is totally I think it it Mak sense um what I think doesn’t make sense is when people start to get uh extremely pedantic around and I’ve seen this many times where product leaders business Etc there’s an expectation that things need

(13:00) to go up and in a complex environment somebody is going to say oh well if this makes Revenue go down why would we do it and and then organizationally it’s very difficult to turn around and say well how confident are we even that it did make Revenue go down when it said when it said Revenue went down by 1% are

(13:21) we really sure that Revenue went down by 1% or is that statistical noise because many of these experiment or AB tests as we do them in the wild have statistical noise many AB testing systems aren’t always implemented fully correctly there’s race conditions in JavaScript in certain situations and there’s like a

(13:44) lot of things that you know many AB testing systems are blocked by ad blockers on on JavaScript and so the question is of the 40% of people that use ad blockers on online might their behavior be more than 1% different than the people that aren’t blocked in your ab test and could that like does that do

(14:04) to what degree could that change things for you and this is where again I think being being any any decision internally where you’re looking at something and and being really really confident about these generally small effect CH effect size changes I I don’t think that that is the right way to be thinking about

(14:28) the business decisions that you’re making um I think on the on the opposite side the testing of the headlines to say hey this this headline on the ad got us you know 5x the clickthrough rate of this other headline on an ad yeah go with go with the better one every time so yeah I think that’s it’s one of the

(14:48) harder things um at least with Business Leaders to you know they don’t want to screw up they want to cover their butts a lot but they often don’t have that Nuance understanding of things like the the caveats and all the results like as you mentioned the ad blocker the all these things that some platforms can’t handle

(15:08) or report on well or help someone interpret and without the help of good tools to help guide decision making people get focused and hooked on to a thing make decisions off that and it just perpetuates um and so it’s like in the absence of any data they take any data and they make wild decisions um but the

(15:29) one thing we did learn in school is like 95% confidence that’s that’s a good thing which is a is a very I like to say a graceful segue into that something that you mentioned in your in your uh video around 95% confidence and it’s something that you and Ron kavey had a little bit of a conversation around um I for those who

(15:50) haven’t seen the video and I’ll link it down below um what is the what are the issues or some issues with the 95% confidence sure I’ll I I’ll go through my my what I what I said in the video and Ron’s response and then we can talk a little bit about that so in the video what I said was that most of the ab tests that I’ve seen

(16:09) people have wanted to get to 95% you know A P value of 05 to say that a result is statistically significant and and which is sort of a 95% confidence level and in my experience in business you almost never making 95% confident decisions it’s like what kind of bar are we holding ourselves to what really what business decision are

(16:36) you 95 any major business decision you’re not going to be 95% confident in fact you might want to say it’s the opposite side you might want so so let’s say we okay we should make it 70 or 80% confident but but maybe we should do the opposite maybe we should say if I’m a product manager and I deeply understand

(16:54) my customers in my market and I I have a we have a hypothesis about where the business is going and therefore we’ve built a feature that we want to deploy towards what the company needs to do in in the future shouldn’t you need to be 70% sure that I’m wrong before you ship it as opposed to 70% sure that I’m right before you accept

(17:20) it is just not to interrupt but it’s like um if you’re disagreeing with your essentially it’s what it’s that situation where they want you to do XYZ and you have to prove them wrong and that’s a very awkward situation this is Romo Santiago from experiment Nation every week we share interviews with and

(17:38) Conference sessions by our favorite conversion rate optimizers from around the world so if you like this video smash that like button and consider subscribing it helped us a bunch now back to the episode well I mean and and just to be let’s let’s run with that analogy because I think it’s a good one imagine imagine your boss says

(17:56) something and they say I think that this is the this is the this is how the future is going to be right if you were to go back to them and say well there there isn’t a less than 5% chance that you’re wrong what what would they even what would they respond to that they would say who cares that they would you would

(18:20) you’d be laughed out of the room on the opposite side if you go to them and say hey there’s like at least a 70% chance that you’re wrong on this or at least 80% chance you’re wrong on this they’d probably appreciate it right and and I I think that that’s where that’s where when I so in the point that I made in

(18:40) the video is that I think I think that when everybody uses a 95% you know is expecting 95% confidence it’s just not a fit for business decision- making as opposed to an Academia where you’re looking for 95% conf because other academics are going to cite your article as essentially truth and it’s going to sort of be used

(19:03) as a foundational piece to build more science on top of and there’s an expectation that it doesn’t change it’s a totally different world than in business now your point like before as you mentioned earlier is like the world of business changes frequently week to week seasonality it impacts everything

(19:20) macro macro factors impacted and while Academia tries to isolate all those factors as much as possible it’s less reliable in a business context exactly I I mean it’s it’s just so obvious right if you have like a merchandising AB test in the winter and you’ve got winter coats or bathing suits you’re going to

(19:39) sell more winter coats and then six months later that’s not going to be true anymore yeah it’s just and so to what degree are are is the world changing at an effect in a way that affects your business more than the effect sides of your typical AB test and that’s where that again again I think that’s where

(20:01) some of these experiments um don’t don’t fit so well Ron’s point was fascinating and I I was new like I I hadn’t heard of his his thing before and he said that um my my he he’s well he said my point was incorrect and that uh 95% confidence when people in business think they’re getting 95% confidence they’re not

(20:31) they’re actually getting closer to 75% confidence because we already know through a large amount of work done at a number of large companies that the false positive rate is far higher than would have been expected looking at P value of 05 and therefore we need to take those additional failure rates into account to

(20:57) say that when we think it’s 0. N5 when we think it’s 95% it’s actually closer to 75% um and and then he he uh he said in that comment that he stopped watching my video after that which was the first point but then later on WE continued the conversation into Google doc you can check it all out on linked I’ll link

(21:17) those in the comments yeah um so I was I was a little frustrated initially that he had stopped watching the video but I reached out to him and we had a totally nice back and forth conversation um where he had his points and I had my points so that that was uh totally and completely resolved I I think that that

(21:34) makes the case against AB testing far stronger because one of the things that I had in the video is that the business World in general is is not interpreting and using these AB testing systems correctly and I’ve been in it like I’ve seen AB test I have never seen a business person say okay we’ve gotten to

(21:59) 95% confidence or a P value of Point 05 that means that we’re actually only 75% confident they expect that it actually means 95% confidence and it’s a this is a perfect example again of this of this over precision and false confidence in these systems which are not able to be as statistically exact as a lay business person who’s not

(22:25) a statistics person it does sound like a communication problem though sure exactly instead of using these particular numbers a lot of context seems to be missing if I yes but it’s also unreasonable to expect that business people are ever going to get to that context and so this is where it’s it’s it’s all about the

(22:51) difference between uh what you get sold and what you get right right if if what you if what you get sold is extremely high confidence in this number 95% confidence and then what you get overwhelmingly is 75% confidence I think that’s a big problem in the industry I think 15 yearsb testing has been super

(23:17) popular and after 15 years even I didn’t know that the 95% confidence is actually only 75% confidence due to the academic literature showing that we can’t trust these results as much as we think so well I think there’s still well obviously I I still think there’s value to it but I do think the communication

(23:45) of these facts the industry could be way better at and and I don’t think the academic terms help our cause at all the confidence sounds like confidence you know 95 sounds like a great score and a test it’s just it shoots itself in the foot and and maybe we need to rethink how we phrase things and but as you as

(24:07) you point out May a lot of Business Leaders may not be able to absorb or or grasp all these things because I know when I present stuff they kind of glaze over um some of this stuff so um I wanted to jump onto a point that you mentioned where um Ron was talking about his experience at um at Microsoft and

(24:27) he’s pulling all this data and you in one of the comments you left in the document that you felt that that was not representative of of the world um I want to dig into that because that’s something that I’ve heard a lot in in LinkedIn at least I was wondering if you could elaborate on on what that means

(24:45) sure I I mean I think that my my understanding of his team at Microsoft or Bing where he was doing and just I read his LinkedIn po profile and he had a team of you know 50 people building an experiment platform and and I imagine many of those people are phds and they’re all Microsoft it’s it’s one of those things

(25:16) where I’m sure that they had over the years many of the same challenges that a typical business has with AB testing I’m sure that that not all of them but but some of them they were able to overcome because they just invested so heavily in it and and I’m just taking his word for the fact that they had an

(25:42) excellent process and system that led to good business results where they were able to tune things effectively I think um I think the Bing and search engine use case is probably a little bit more close to like the video game tuning than it is to like very broad decision making because it’s sort of like you’re tuning

(26:08) different algorithms for search results or the size of the font for what people see but it’s like a pretty constrained use case of search whereas most businesses are testing like a lot of things that aren’t as constrained but I’m going to I’m going to totally take his word for it but I think the investment that Microsoft or

(26:28) Google or some of these other companies are making around VAR sort of thin product use cases don’t don’t don’t translate necessarily so well to many other businesses and I I think that’s fair um that’s that’s one of the commentaries I’ve often heard where okay you have all these learnings for Microsoft but I have

(26:51) one one drop one fraction of the traffic you have I have one/ tenth of the resources not even um so we’re we’re not entirely confident so I don’t know how applicable some of the stuff you’re saying is to me um but yeah I I I get it you you try to take it for for face value you try to apply it as you can but

(27:12) at the same time you have this these doubts in your head um let me see here you know I I’ve been peppering you with a lot of questions I was wondering if there’s anything you wanted to share with our audience what’s going on uh with your with yourself that you like our audience to know yeah I mean I I guess one important

(27:30) thing to recognize about this is this is just my opinion on the state of the industry I have no Financial uh bone in like like in this in this in this topic like delivery layer is not in the AB testing World um delivery like delivery layer mostly Powers B2B applications just want to call out that I think a lot of your

(27:56) challenges are fair it’s things that we’ve all gone through um and I think a conversation is warranted you know it’s like obviously there’s still a belief that AB testing we all agree experimentation is good but AB testing in its current form isn’t perfect we know this we know it’s not perfect um and we know that we should bring more

(28:15) common sense to it more context to the analysis our what but what does it take to get us to the next level I actually don’t know but I know we’re not there yet so I I I appreciate a lot of these points that you brought up thank you I I I appreciate and I will give you full credit like when I put this video out

(28:31) and the conversation was happening you know you’re like hey come on let’s talk about it I think that that that that’s a that’s a good thing right it shows that like the important part here is having everybody you know get all the thoughts and ideas and perspectives on the table to it I guess you I saw back and I

(28:52) watched the video I’m like yeah he’s lived the life man all no now mind you I didn’t agree with the ultimate the conclusion of like let’s push push off AB testing I still think there there’s value to it but at the same time I don’t have a solution you know I don’t right now I don’t know how to get to that spot to get the the

(29:11) industry there but I like I liked that it brought up the conversation it goes no things are screwed up at the very least and having that conversation with Ron po poking things at as well that’s that’s that can’t do anything but help the conversation thank you yeah and I agree and you I think you know i’ I’ve

(29:27) been in data for a while and I think that these like we continue to advance through like you know having having things go head-to-head and seeing what works and what doesn’t work um in in terms of the AB testing systems i i i because you know you kind of brought up the point of what does the future look

(29:52) like I do think that they are well situated to be used in that sort of like metrics driven Common Sense decision Mak approach where if you’re making a change and you’re looking at a couple of different metrics the sort of overly scientific Reliance on whether something is changed by 2% or 3% or 0 perc might

(30:15) not be might not might not be where I would go but having them to be able to see did we just drop Revenue 30% that’s a good that’s a good use and I think that the more um the more data that people have as they make decisions it does allow them to experiment faster and I think that that that the end point

(30:39) of the video was we should be experimenting faster if you’ve got if if if you start the quarter and you say compared to where I think we could be as a business our um conversion rate for this funnel is way lower or our churn rate is too high our churn rate should be substantially lower based on everything that we know and you come up

(31:03) with 18 different ideas to go and impact that churn rate the the question is do you want to implement one wait for a while try to see what the stati what the impact of that one is implement the other wait for a while or do you want to just zoom and try to go as fast as you can making all those changes and you

(31:28) look at the end of the quarter and then you say all right what did we see that the impact of these things were um although frankly churn is a great example of something that’s tough to AB test because because the impact of the decisions that you made might not impact churn for another eight months yeah and

(31:46) that’s usually where a hold out is but there’s a lot of complications in hold outs as well a lot of complications because you might have to keep two versions of the same code base running for months for just one holdout test so suddenly you get eight or nine hold out tests to try and this is where it’s like

(32:02) okay if you want to impact churn then maybe you should just do all of the changes and see what happens as opposed to try to do holdout tests for very complicated systems changes yeah there’s always this question right like when you hold out a group and you’re making fixes to your site that’s basic things you

(32:23) know like bug fixes do you hold out the bug fix is to and who gets in and out of the the your control group because over time it’s the quality the audience membership is going to decrease because people just bounce off your side and never come back so your hold out’s getting smaller so you have to let

(32:39) people in what are the rules to get there’s a lot of um setup you can’t just you know go straight out with a hold out but um but yeah I I mean you’d use that if you could in an ideal world but it is challenging sometimes yeah definitely but I do think that there are some really complicated risky stuff that we

(32:58) need to test um like in a small group or what have you um but always always promoted like don’t just trust the data if it’s a winner it doesn’t matter you need to look at the bigger picture you need to have the long-term strategy you need to have a lot of eyes on it’s just a piece of the whole puzzle to make a

(33:15) decision um unfortunately I think a lot of people just see you know AB whatever wins by XYZ and let’s let’s push it forward without thinking you know holistically in long term yeah totally cool well you know what to me I see that as uh a good sign that you know that the because when I put out the video I was always like all right you

(33:41) know when you say something that’s against what a lot of people think you’re always like am i g to sound crazy I don’t think I sound crazy but you never know when you go against the grain so hard whether that’s going to like come out well PMS will go yeah he’s got it product managers but I think C’s will be

(34:01) more like come on I’ve been both so it’s fine um so for those who want to continue this conversation with you where can they find you yeah uh so the best place is on LinkedIn so LinkedIn we find that thread exactly yeah Solomon we we should I mean should should post the thread in the comments but LinkedIn Solomon Khan

(34:28) khn uh and uh follow along I post a lot about data stuff and uh I’m sure that this conversation about AB testing will uh keep going yeah for sure uh and for the listeners I’d love to hear your thoughts on everything that was said today uh leave them in the comments send me an email s Solomon an email uh let’s

(34:49) keep this conversation going I think it’s a great topic and uh thank you for listening have a great day all right thank you this is Romo Santiago from experiment Nation every week we share interviews with and Conference sessions by our favorite conversion rate optimizers from around the world so if you like this

(35:05) video smash that like button and consider subscribing it helps us a bunch

If you liked this post, sign up for Experiment Nation’s newsletter to receive more great interviews like this, memes, editorials, and conference sessions in your inbox: https://bit.ly/3HOKCTK


Connect with Experimenters from around the world

We’ll highlight our latest members throughout our site, shout them out on LinkedIn, and for those who are interested, include them in an upcoming profile feature on our site.

Rommil Santiago