Upload
ashish-dua
View
35
Download
0
Embed Size (px)
Citation preview
AB Testing Basics2nd November, 2016
Google’s infamous AB test: testing 41 variants of mildly different shades of blue
Agenda
The agenda of the course is to avoid setting up AB tests like that
Longitudinal or pre-post testing is difficult since little variance is explained by product features. Other factors impacting conversion are:
PriceWeekend/WeekdaySeasonalitySource of TrafficAvailabilityMix of users (distribution bias)
Clarity of product thinking & avoiding snowballing of incorrect insights
Why was conversion for Android version 5.5.6 better than 5.5.5 for the first 3 days? (Hint: Early adoptor bias- users with stable wifi and loyal to MMT app convert higher than all users)
Why is AB Testing needed?
Introduction to AB testingWhat can be tested: • A new feature• UI changes• Backend Changes
What can’t be tested•What a company should be making next•Anything that can’t be inconsistent between users
Choosing Alia Bhatt as brand ambassador
A recommended hotel on the top of the listing
Impact of a fix for latency
Increase sign-in rate by increasing the size of the login button
Impact of showing packing list as a notification a day before the flight date
Quiz: What can or cannot be AB tested
AB testing is for lower hanging fruits not quantum leaps: for those user testing, interviews and FGDs as well as analysis of existing data are better.
Choosing Alia Bhatt as brand ambassador: No
A recommended hotel on the top of the listing: Yes
Impact of a fix for latency: Yes
Increase sign-in rate by increasing the size of the login button: Yes
Impact of showing packing list as a notification a day before the flight date: Tough, but theoretically yes
Quiz: What can or cannot be AB tested
AB testing is for lower hanging fruits not quantum leaps: for those user testing, interviews and FGDs as well as analysis of existing data are better.
Key Stages of AB TestingHypothesis Definition
Metric Identification
Determining Size & Duration
Tooling & Distribution
Invariance Testing
Analyzing Results
Almost all AB experiment hypotheses should look something like below: Eg. 1H0 (Null/Control): A big login button will not impact user login percentageH1 (Test): A big login button will significantly increase user login percentage
Eg: 2H0 (Control): Putting higher user rating hotels at the top of the listing doesn’t change conversionH1 (Test): Putting higher user rating hotels at the top of the listing changes conversion significantly
Good to articulate the hypothesis you’re testing in simple English at the start of the experiment. The hypothesis should have a user verbiage and not a feature verbiage. It’s okay if you skip this too as long as you get the idea.
Hypotheses Definition
Counts, eg.#Shoppers#Users buying#Orders
Rates, eg.Click through RateSearch to Shopper RateBounce Rate
Probability (a user completes a task), eg.User Conversion in the funnel
Metric identification (1/2)Difficult to measure metrics: 1. That take too long to collect2. That are subjective or data is not
available with us
A derived or proxy metric may be required in such cases
Consider the following metrics for conversion: 1. #Order/#Visits to listing page2. #Visitors to TY Page/#Visitors to Listing Page3. #Visits to TY Page/#Visits to listing page4. #Orders/#PageViews of listing page
Metric identification (2/2): Quiz
Which of these are impacted by the following:1 2 3 4
User refreshes the listing page
User breaks the booking into 2
User’s TY page gets refreshed
User does a browser back and the page is served from cache
User drops off on details and comes back via drop-off notification
Omniture is not firing properly on listing page
1. If showing a summary of hotel USPs on the details page is improving conversion?
2. If a user who purchased with MMT will come back again?
3. If we are sending too many or too few notifications to users?
How can you measure?
1 .If showing a summary of hotel USPs on the details page is improving conversion? A simple A/B set-up with and without the feature will help in evaluation
2. If a user who purchased with MMT will come back again?A. An secondary metric captured by asking buyers this question or an NPS survey and comparing results should give some idea
3. If we are sending too many or too few notifications to users?A. An indirect metric measured as retained users on the app across the two variants
How can you measure?
Size & DurationA detour into statistics:
Reality Test Output Error
Control is better Control is better 1- α (confidence level)
Control is better Test is better α (significance)
Test is better Test is better 1-β (power)
Test is better Control is better β
α or type-I error is the probability of rejecting null when it is true (Downside Error)
β or type-II error is the probability of accepting null when control is better (Opportunity Cost Error)
Target values to test significance is at α = 5% and 1-β=80%
Size & DurationSize:• To figure out the size of the samples required to get the 80% power for the test, here • These many users need to be targeted with the smallest of the test variant being examined
Duration: • Is an outcome of what % of traffic can you direct to the test + some minimum duration considerations• You might want to limit the %age exposure of the experiment due to:
• Revenue impacts• Leaving room for other people to experiment
• Even if the sample size for the required power can be reached in a shorter duration good to reduce the exposure of the experiment to include: • At-least 1 weekend/weekdays• low & high discounting periods (if possible)• Low & high availability periods (if possible)
No Peeking• It is important to not reduce power of the test by changing decision with insufficient data
• Best explained in the blog. Primary idea being that taking duration clues from early data introduces human error in the measurement
• In-case the sample size is turning out to be very high, a few ways to reduce it are: • Use this sequential sampling approach (reduces size by as high as 50% in some scenarios)• Use this Bayesian sampling approach (mathematically intensive)• Try matching the lowest unit of measurement with lowest unit of distribution (eg instead of measuring
latency/user measure latency per hit and distribute the experiment on hit)• Try moving the experiment allocation closer to the step where there is an actual change (eg assign payment
experiment to payment page users)
Distribution Metric1. Page Views2. Cookies3. Login-ID4. Device ID5. IP Address
Tooling & Distribution (1/2)
Which will not be hampered by the following 1 2 3 4 5
User shortlists 2-3 hotels and comes back after a day
User starts search on mobile and books on desktop
User changes browsers on the machine
User logs out and continues with another ID
Typical requirements for an AB system are: Each experiment should support multiple variants (A/B/C..) and each variant can be defined using a combination of
experiment variablesEach user is randomly assigned a variant (as per the distribution percentage). System ensures users are served a
consistent experience basis their device ID or cookie (other distribution parameters like page view or visit might be used but cookie/device-id is the most stable)
Auto-logs the variant that the users are being exposed to in an analytics system
There are multiple AB testing systems available by several vendors or one can be easily created internally using a tag manager like Google tags
Tooling & Distribution (2/2)
Always best to start small in terms of distribution when you turn on the experiment and ramp-up after initial results look stable (& no major drop in conversion)
A/A Testing:Ideally, it is good to run 1 or many A/A test to measure the same metric you’re planning to measure in A/B tests before and after your test periodEven if the above is not feasible, do try to run A/A test regularly to test the underlying systemThings to test during A/A Tests:
Key metrics you measure (like conversion, counts, page-views, etc) and their statistical difference between the two cohorts at different ratios of test & control
A/A & Invariance Testing
Invariance TestingIdentify Invariance metrics- metrics that should not change between control & experiment One of the basic metrics that will be the invariant will be the count of the users assigned to each group. Very important to test theseEach of the invariants should be within statistical bounds between population and control
A/A & Invariance Testing
What would be the invariant metrics for the following feature:Showing details of the exact meal served on the full service domestic flight on the flights review page.
1. Remember the threshold practical significance threshold used in sample size calculator. That is going to be the least change that we care about, so a statistically significant change < the practical significance threshold is useless.
2. Choose the distribution & test: 2. Counts: poisson distribution or poisson-mean3. Rates: poisson distribution or possison-mean4. Click-through-probability: binomial distribution & t-test (or chi-square test).
Analyzing Results (1/3)
Analyzing Results (2/3): Taking Decision
+ Practical Significance
- Practical Significance
Launch Don’t Launch or Keep Testing
No Difference (0)
Analyzing Results (2/3): Taking Decision
+ Practical Significance
- Practical Significance
Launch Don’t Launch or Keep Testing
Yes
No Keep Testing
No Don’t Launch
No Keep Testing
No Difference (0)
Analyzing Results (3/3): Taking DecisionOutcome A: A feature was impacting three different metrics. All are positive.• Analyze more cuts- is the data intuitive across these cuts. Is a listing page change impacting listing -> details
conversion or is it something else that is giving upside. • Has the experiment run a complete cycle of external changes (weekday/weekend etc)• Time to increase rollout or make it 100% depending upon confidence.
Outcome B: A feature was impacting three different metrics. One or all of them are not statistically significantNeed to understand what exactly might be failing. The list would be specific to the feature launched but an indicative list might be:
• Daily conversion analysis. Were the conversions for any significant time higher?• Cut-across cities/routes? Are their pockets that have done better?• Cut-across customer dimension? Single-pax, Round-trip/one-way, number of child, AP range, logged in
users, users with wallet money, etc• Device/Browsers/Operating System & versions. • Step wise analysis of the various stages of the conversion funnel
Idea is to analyze deeply to understand what is happening. Basis that take a call to modify and continue experimentation or roll-back the experiment.
A/B/C SetupA particular type of experiment set-up that is beneficial where there might be server & client side affects that introduce bias. A few examples
Measure impact of persuasion shown (say last room left)User might be positively impacted to convert higher, v/sHigher latency to fetch persuasion might reduce conversion
Showing a message “Cheaper than Rajdhani” on flights > 75 mins duration and fare <3000User might be positively impacted to convert, v/sConversion for cheaper flight (<3000) is generally higher
Showing a USP of the hotel generated from user reviews, eg. guests love this because: “great neighborhood to stay”
User might be positively impacted to convert, v/sFeature might only be visible on hotels with > X reviews (and hence bookings). There is an innate hotel bias.
In these scenarios, it is best to setup 3 variants:
A= Feature Off or ControlB= Feature On but not shown to usersC= Feature on but shown to users.
A/B/C Setup
AB testing in an organization typically goes through the following stages:
Would encourage you all to help your organization move to the next stage in the AB testing journey
Best to be in a state where the company culture supports quick prototyping and testing with real users
Solving for multi device (stitching sessions) and other tracking limitations in the set-up
Higher standards of experiment analysis and responsible reporting
Things to Improve
Sanity ChecksTesting for
conflict resolution
Testing for impact
measurement
Testing for hypothesis
Rapid prototyping &
testing
Definitely read the Evan Miller blog. It basically summarizes everything you need to know.
If keen on getting in more detail of techniques and best practices, take the course on Udacity. Just doing the first chapter would be good enough
Further Reading