Upload
gopal-krishnan
View
483
Download
1
Embed Size (px)
Citation preview
Improving the power of a picture via A/B testingGopal Krishnan Director of EngineeringDale Elliott Senior Software EngineerKenny Xie Senior Data Scientist
TV is a lean back experience
90 seconds
Pop Quiz
A round plane figure whose boundary (the
circumference) consists of points equidistant from a fixed point (the center).
A round plane figure whose boundary (the
circumference) consists of points equidistant from a fixed point (the center).
Can we do better?
Sensitivity test
The Short Game
Single title A/B test result
14% better 6% better
Testable Hypothesis
Displaying better artwork will result in greater engagement and retention by helping members discover stories they will enjoy even faster.
Data Driven
Netflix API serviceBeacon (telemetry collection service)
Hive (computes artwork performance metrics for every title/country/locale
pair)
Netflix Image Library
Device (PS3, website, etc.)
Feedback loop
Serve artwork based on A/B logic
Feed with artwork based on perf metric
Collect plays & client impressions
Anatomy of artwork
Stable Image id for ground truth data
source-file-id-1 source-file-id-3source-file-id-2
Lineage-id-1
Diversity matters
Diversity matters
Pop Quiz
1 2
4 5 6
3
Building the A/B tests
vs.
Pairs of Explore and Exploit Tests
Explore Test
Current production explore
New explore
Exploit Test
Current production exploit
New exploit
Winner
Winner
● No member overlap● Explore and exploit allocation happens
simultaneously
Multi-title explore allocation test
Cell 1 Cell 2 Cell 3 Cell 4 Cell 5 Cell 6
Title 1 Control Image Test Image 1 Test Image 2 Test Image 3 Test Image 4 Test Image 5
Title 2 Control Image Test Image 1 Test Image 2 Test Image 3 Test Image 4 Test Image 5
... ... ... ... ... ... ...
Title n Control Image Test Image 1 Test Image 2 Test Image 3 Test Image 4 Test Image 5
Test Evolution: Single Title to Multiple Titles
Single title, multi-cell test
Engineering implementation / complexity
• Our A/B infrastructure is optimized for comparing test cells to each other
• Need to compare data across cells for one title of many
• Avoid creating hundreds of tests (one per title)
Solution:• Treat all the members who see a title’s images as a virtual test
• Impression tracking -- not just test cell allocation -- defines test population per title
Engineering implementation / complexity
Allocated Members
Title A impres-sions
Title B impres-sions
Problems with multi-title, multi-cell test
• Cohorts of testers who all saw the same set of images
• Same number of images for every title
Single-cell explore allocation test
Title 1
“Cells” 1 2 3 4 5 6
Image Control Image 1 Image 2 Image 3 Image 4 Image 5
Title 2
“Cells” 1 2 3 4
Image Control Image 1 Image 2 Image 3
Test Evolution: Images per titleMulti-cell explore evolves to Single-cell explore
Devolves?
Virtual Tests inside one test cell
Engineering implementation / complexity
Goals• No cohorts
• Image stickiness
• No persistent storage
We used a deterministic, pseudo-random calculation• new Random(memberID * titleId).nextInt(numImages)
Netflix API Service
Engineering implementation / complexity
No persistence neededCells Cell 1 Cell 2
Title 1
Ctrl Image Random of [Ctrl, Test 1, ... Test X1]
Title 2
Ctrl Image Random of [Ctrl, Test 1, ... Test X2]
... ... ...
Title n Ctrl Image
Random of [Ctrl, Test 1, ... Test Xn]
Image Data Feed
(Title ID, Image Lists)
Netflix Image Lib.
Random assignment to all test members.
Single-cell explore test
● No more cohorts
● Flexible
● Clear winners for many titles
● Overall win based on key metrics
Can we do better?
Result
Problems
• Over exposure of under-performing images
• Under exposure of niche titles
• Unfair burden on testers
Title-level allocation test
Solution: Title-Level Allocation
• Limit allocated members per title
• Less exposure of under-performing images
• Still get enough data to determine winner
• Allocate from a gigantic pool
• More exposure for niche titles
• Spreads testing burden
Test Evolution: Testers per titleC
Title A
Title B
Title C
Title A
Title B
● Some titles have few testers in the small pool
● Most titles have full testing allocation from larger pool
Engineering implementation / complexity• Goals from previous test
• No cohorts• Image stickiness• No persistent storage
• New goals• Less exposure for under-performing images• More exposure for niche titles• Faster decision and rollout of winning images
• This time, we needed to persist the allocations
Netflix API Service
Architecture
Image Data Feed
Yellow Square
(Y2)
Netflix Image LibraryMember
Allocated?
Title fully Allocated
?
Allocate with Random Assignment
Log and storeAllocation
SelectAssigned Image
SelectControl Image
SelectAssigned Image
No
No
Yes
Yes
Title Metadata Service (VMS)
Kafka
Oops
● Underestimated traffic
● Many titles allocated per member at once
● Write to Y2 for every allocation
Result: Service disruption; we had to turn off the test
Netflix API Service
ScalingImage Data Feed
Yellow Square
(Y2)
Netflix Image Library
Allocate with Random Assignment
Log and storeAllocation
KafkaStream
Processor
1 write per member every 30 sec.
Storing allocations as they occurred overloaded Yellow Square.
Now, we log them to a stream and consolidate many writes into one.
Who to Test on?
Test on the same population you are planning to rollout the changes to
Two Member Cohorts
• New Members are assigned to the experimental condition at the time of sign-up
• Existing Members are assigned to the experimental condition any time after free trial ended
Decision Focuses More on New Members
• A “pure” sample which is not tainted by a previous Netflix experience
• A more sensitive sample (“on the fence”)
Tiers of Metrics• Primary: Customer retention• Secondary: Streaming hours• Tertiary: all other customer engagement metrics
• Play rate• Number of Netflix visits• ...
How to Pick the Winner in Explore?
• Take fraction = (number of users played the title) / (number of users been seen the title)
• Correlated with retention
• Measurable from day one
What is a Play?
What is a Play?
What is a Play?
Does Impression Location Matter?
Does Impression Location Matter?
Does Impression Location Matter?
Does it Matter How Many Impressions it Takes to Play?
Netflix just recommended an awesome show to me and I am going to watch it!!!
Does it Matter How Many Impressions it Takes to Play?
I have seen the show on Netflix a few times. Maybe, I should try it...
Take Fraction is NOT as trivial as its definition implies.
How to Make the Final Decision?
Final decision is based on the exploit test• Retention movement
• Streaming hours movement
• Engagement with titles explored in the test, titles not explored in the test
• ….
Our Image Selection Test is a Win!
• Improved customer retention
• Improved customer engagement
Some Learnings
Emotions excellent to convey complex nuances
Great stories travel - but regional nuances can be powerful
Nice Guys Often Finish Last
Contact:Gopal KrishnanDale ElliottKenny Xie
More details available at Netflix techblog.
Talk to us outside at the booth.