Experiment Nation - The Global Home of CROs and Experimenters

View Original

Learn about Jonas Alves - the mind behind Booking.com’s Experimentation engine

A Conversion Conversation with Ex-Booking.com Product Owner Jonas Alves

I’m happy to be able to share a recent conversation I had with Jonas?—?the man behind the Booking.com Experimentation platform. He was knee-deep in testing well before many of us even realized it was important in the digital space. We chatted about a wide range of topics, including his career path, his thoughts on how to succeed at experimentation, as well as his love of Chi Kung and Kung Fu.


Rommil: Hi Jonas?—?happy new year! How are you?

Jonas: Hi Rommil, Happy new year! I’m doing great. Looking forward to 2020.

Thanks for chatting with me today. Let’s start off with you sharing with us a little about yourself, your background, and what you’re up to these days?

Thank you for the invitation, Rommil. I love to chat about experimentation.

I started with experimentation back in 2008, when I joined Booking.com. I started in the website team as a full-stack developer. The team was quite small. Jasper Cramwinckel was the Team Leader, there was one developer focused on SEO, Martin Biskoping, another one focused on the Newsletter, David Morel, and one designer running A/B tests on the website, Luciano Kohen.

Although they were already running A/B tests since 2002, the tools that Jasper built were still a little bit limited. One of the first tasks that Jasper gave me was to revamp the platform.

Wow, that’s pretty ahead of the curve! I don’t know many people who were involved in experimenting in the early 2000s. What did you do to revamp that platform?

I added visualizations to the confidence intervals, ability to track not only conversion rate and revenue, but also any other goal with a single line of code.

Most of the experiments were being set up in the HTML templates. Our templating system didn’t support short-circuiting in the expressions which required a bit of a spaghetti-code to implement tests with complex conditions. This was fixed by Flávio Glock, that joined the team not long after me.

We started running a lot more experiments. It was during this time that I helped Luciano run many experiments, like the urgency ones, that were a huge success for Booking.com. Pretty much every e-commerce website has those urgency messages now.

To keep track of all the changes we made the experiments searchable in the platform and asked people to write good descriptions and comments every time that one experiment would be started, stopped or changed in some way.

Amazing! That must have been fun! Did a lot of people use it?

In 2010 we had more than 50 people using the platform, every day, and decided to create a team around it.

We started democratizing experimentation across the company. Not only the website team could run experiments, but pretty much any development team could do it. We ran A/B tests in the F5 load balancer to upgrade the version of Perl, to move our web servers from Apache to nginx, etc.

I’ll pretend I understood all that. Sorry, go on LOL

In 2012 we had hundreds of people running experiments and I became the Product Owner of Experimentation. I was responsible for experimentation training, for the features being added to the backlog of the team, for the reports being built, metrics being added, I was helping people make sense of data and making the platform easier to read, to use and to understand. Booking was growing like crazy. Every couple of years we would double the development headcount. Keeping the experimentation culture intact was not easy. At some point, I was giving new hires’ experimentation training every week, and a refreshing training to seasoned experimenters, every couple of months.

I got in touch with Ronny Kohavi to help us with the training and the statistics in the reports. He referred me to Roger Longbotham that worked with him both at Amazon and Microsoft. I hired Roger as a consultant and he helped me with the trainings and with making the statistics sound. It was with Roger that we learned about the Delta method, Fieller’s theorem, etc.

Source: https://en.wikipedia.org/wiki/Fieller%27s_theorem

Wow?—?that’s no joke. I love how you went all out to ensure that you were thoroughly knowledgeable about this domain. Did you run into any issues?

With the increase in traffic and in the number of experiments, the reports were becoming slower and slower. We borrowed Yves Orton from the Infrastructure team to build a new cluster to aggregate experiment data from the real-time event pipeline. This was huge, took a long time to finish, but worked amazingly well.

We were running so many experiments that it was kind of crazy. One day one developer came to me with a heatmap of all the experiments running in the Hotel page. It was red all around.

To avoid interactions between experiments is not uncommon for big companies to create layers of experiment types and the visitors can only be exposed to one experiment of each layer.

At Booking.com we were doing the full-factorial of all experiments. All users could be in all experiments simultaneously. This was a business decision and we were aware of the risks.

But it was at this point that I decided that we had to do something better about interactions between experiments.

Just thinking of all the interactions to detect is giving me headaches LOL. So what did you do?

We started aggregating all experiments in pairs to create alarms about interactions and we changed the platform to warn users about experiments being started or about to start in the same area.

That’s a great approach. It’s actually something I’m currently investigating at my company currently with our Analytics team. How did you know whether you were doing it right?

I had been in touch with Optimizely in 2010 when they were looking to open an office in Amsterdam, so I scheduled a call with Pete Koomen to discuss interactions between experiments.

Pete told me that what we were doing was the most sophisticated he had heard about. Their biggest client was far from being close to our scale and interactions was not a problem on their radar yet. I was both flattered and sad.

When I left Booking.com, in 2015, and passed the reins to my good friend Lukas Vermeer, we were running more than 1000 experiments simultaneously and starting 100+ every single day.

You’re my idol, Jonas. Where did you go after Booking.com?

I started helping other companies getting better at Experimentation as well. I joined Catawiki.com for one year, as an external consultant, to help build their experimentation platform. Also gave experimentation training to GetYourGuide, Viagogo, Adidas, Picnic.nl, MinDoktor.se, Match.com (OkCupid, Tinder), Remitly, SpotAHome, ParkMobile and many other companies.

At Tripaneer, together with my Co-founders, Márcio Martins and Mário Silva, we built an experimentation platform internally, similar to what was being used at Booking.com, but not with as many bells and whistles.

Now I’m creating a new company. Our product is yet another experimentation platform.

Wow, that sounds pretty interesting! I definitely would love to be updated when you launch!

I have to ask, what inspired you to build your own platform? How’s yours different from all the other platforms out in the market?

I’ve been doing this for more than a decade now. I get asked many times, from clients, if I could build a platform like the one Booking.com has. I was working full time, consulting on experimentation was not my main focus, so the answer was always negative. But now the stars aligned and I’m up for the challenge.

There are hundreds of experimentation platforms out there, but companies doing experimentation at scale are still building their own platforms internally.

I hear always the same concerns. Those companies want it customized to their business, need better guardrail metrics, want to own the data, need to be able to ask for new features and overall want the platform to run in their own premises.

There’s a lot of buzz around platforms and experimentation tech, but there’s not much out there in terms of philosophy and frameworks. For those looking to enter this field?—?what advice would you give them to get started

I think there’s a bit of philosophy and frameworks out there, but not always linked to experimentation. It has to be rooted to the culture of the company and it touches everything from deploying methods, unit testing, quality assurance, systems monitoring, code reviews, team structure. To do experimentation at scale you might need to change everything. It’s very difficult without backing from the top.

“To do experimentation at scale you might need to change everything. It’s very difficult without backing from the top.”

So, true.

What people usually miss is that the most important are not the methods to come up with ideas for new experiments or how to prioritize those ideas in the backlog. Right, that’s important too, but most companies are already doing that in one way or another. There’s always some improvements to do, but not a lot to change there usually. That’s maybe where you would focus if you have just CROs, Product Owners or Marketeers responsible for A/B testing.

To do it right you need to make A/B testing the default. Simply don’t do any changes that might impact the customer without an A/B test. Might be a change in the code, services, infrastructure, email, UX, processes, just a bug fix, you name it. You have to A/B test it!

“To do [Experimentation] right you need to make A/B testing the default.”

Decouple deployments from customer-facing changes. By doing so, anyone can deploy the code knowing that no changes will be visible. Nothing will impact the customers until someone turns on the feature flag for each test. This brings huge velocity and agility to the organization. Deployments start happening all the time. A designer finishes a task, s/he goes ahead and deploys. A software engineer fixes a bug, s/he goes ahead and deploys. A copywriter changes the copy on the site, usually doesn’t need a deployment, but if it does, goes ahead and deploys. You get tens or even hundreds of releases per day, and this is awesome. If something breaks you know right away which release was the culprit and have just a handful of commits to look at. To make this happen, creating an A/B test can’t be more work than just making the change. It has to be super easy! You need to monitor your systems in real-time and need to be able to roll-out and rollback very quickly.

Democratize experimentation across the whole organization. Anyone, in any role, can go ahead and start a new test.

Having been in Experimentation for so many years, what are the most interesting trends you’re seeing in the market?

There’s something that I have always wanted to do when I was responsible for experimentation at Booking.com. I wanted to build a system where anyone could create a new report and add it to the platform, right away, without having to request any dev time. Netflix achieved that with their notebooks, and even open-sourced Polynote. I think this is really the way to go. Democratize access to the data and make the insights shareable to anyone in the organization instantaneously.

I’m looking forward to implement something like this is my new platform.

I’ve only been to Amsterdam once, many years ago. What’s the Experimentation scene like out there?

I’ve just moved back to Portugal in the summer, after 11 years in Amsterdam. Experimentation is growing like crazy in Amsterdam. Big companies and startups are completely into it. On the other hand, in Portugal, barely no one is doing it.

I get more requests for help with experimentation from companies in the Netherlands than anywhere else in the world. But it’s also where the majority of my network is based.

On a more random note, I’ve read that you’re a Chi Kung (Qigong) practitioner? What is that? Do you see any parallels between that and Experimentation?

Hahaha. Yes, I’ve been practicing Chi Kung and Kung Fu for just over two years. Chi Kung is an energy art. Think of it as standing meditation. It helps me with consciousness, creativity and focus. It gives me the boost of energy that I need in the morning and keeps me healthy and fit.

It’s actually a funny story. I had Jeroen Maes as Personal Coach at Tripaneer. He was practicing Chi Kung for 10+ years. One day during one of our meetings we started talking about Chi Kung. I got excited about it and mentioned that I would like to try it. For him, to practice Chi Kung is the best thing you can do to improve your life. He mentioned that he would introduce me to his Chi Kung Master, Darryl Collett. Our meeting was almost over and we decided to finish it on a walk outside. Just as we leave the building, Darryl passes by and stops in front of us. He introduced us and I started private lessons with Darryl that same week. We still joke about that coincidence, but I guess that’s just how energy works.

Check out Darryl on LinkedIn

Finally, let’s wrap this up with a Lightning Round! Bayesian vs Frequentist?

Frankly, I don’t think it matters most. Both work fine if the culture is there. Most businesses have strong weekly seasonality. Either you have full weeks of data or you have results biased to some days of the week. If decisions can only be made with complete business cycles you are already treating the peeking issue. But if you have a good prior and bayesian gives you a bit more power, then go with it.

MVT?—?Hot or not?

Do you own your experiment data?

If you run all experiments in a full-factorial fashion you have MVT for free. Just aggregate the data from that A/B test on the title, together with the data from the A/B test on the hero image, together with the data from the A/B test on the call to action and you have your MVT.

Describe an experimentation culture in 3 words.

I think I already described it in 100 or so words. In three words it would be: “Decouple those deployments” or maybe “Test ALL changes!”. :)