This Is Why You Binge Watch: A Decade Of A/B Testing With Netflix

Data guiding design.

In a SXSW exclusive presentation, Todd Yellin, Netflix's VP of Product, shared learnings from a decade's worth of the company's A/B testing. The number one takeaway: Letting data determine the design changes in Netflix ensures that product upgrades are driven by actual user behavior.

Netflix takes new users and divides them into two groups: one that sees what everyone else is seeing, and the other that sees a different environment that Netflix is testing. The key question is to determine which data actually matters. Key findings have included:

    •    Age, gender, and location were thought to be critical in determining which content should be promoted to which users.  As it turns out, these metrics are not actually key drivers for content taste. In fact, the outliers – an 83 year old grandmother who loves the Avengers, for example –are prevalent. 

    •    Asking people what they like isn't as powerful as watching what they watch. Yellin and team got a better sense for their user base by looking at what they clicked on, rather than asking them directly.

    •    Bubbling up recommended content, rather than labeling it as such improves stickiness of service by allowing for an uninterrupted experience driven by algorithms that are refined over time by the individual. Testing key art as well as content leads can lead to  increased uptake. Yellin shared a Breaking Bad example, wherein a vicious close up of Walter White drove less watches than a plain shot of a camper – an outcome that was counter-intuitive, and reinforced the axiom that gut-checks just don't deliver accuracy like scaled testing.

By leveraging A/B testing to customize the rows on Netflix for each user, Netflix  created a personalized video store that extends the amount of hours watched by consumers, all without direct input from viewers other than what they clicked  to watch.