Basics of AB testing in online products

AB Testing Basics2nd November, 2016

Google’s infamous AB test: testing 41 variants of mildly different shades of blue

Agenda

The agenda of the course is to avoid setting up AB tests like that

Longitudinal or pre-post testing is difficult since little variance is explained by product features. Other factors impacting conversion are:

PriceWeekend/WeekdaySeasonalitySource of TrafficAvailabilityMix of users (distribution bias)

Clarity of product thinking & avoiding snowballing of incorrect insights

Why was conversion for Android version 5.5.6 better than 5.5.5 for the first 3 days? (Hint: Early adoptor bias- users with stable wifi and loyal to MMT app convert higher than all users)

Why is AB Testing needed?

Introduction to AB testingWhat can be tested: • A new feature• UI changes• Backend Changes

What can’t be tested•What a company should be making next•Anything that can’t be inconsistent between users

Choosing Alia Bhatt as brand ambassador

A recommended hotel on the top of the listing

Impact of a fix for latency

Increase sign-in rate by increasing the size of the login button

Impact of showing packing list as a notification a day before the flight date

Quiz: What can or cannot be AB tested

AB testing is for lower hanging fruits not quantum leaps: for those user testing, interviews and FGDs as well as analysis of existing data are better.

Choosing Alia Bhatt as brand ambassador: No

A recommended hotel on the top of the listing: Yes

Impact of a fix for latency: Yes

Increase sign-in rate by increasing the size of the login button: Yes

Impact of showing packing list as a notification a day before the flight date: Tough, but theoretically yes

Quiz: What can or cannot be AB tested

AB testing is for lower hanging fruits not quantum leaps: for those user testing, interviews and FGDs as well as analysis of existing data are better.

Key Stages of AB TestingHypothesis Definition

Metric Identification

Determining Size & Duration

Tooling & Distribution

Invariance Testing

Analyzing Results

Almost all AB experiment hypotheses should look something like below: Eg. 1H0 (Null/Control): A big login button will not impact user login percentageH1 (Test): A big login button will significantly increase user login percentage

Eg: 2H0 (Control): Putting higher user rating hotels at the top of the listing doesn’t change conversionH1 (Test): Putting higher user rating hotels at the top of the listing changes conversion significantly

Good to articulate the hypothesis you’re testing in simple English at the start of the experiment. The hypothesis should have a user verbiage and not a feature verbiage. It’s okay if you skip this too as long as you get the idea.

Hypotheses Definition

Counts, eg.#Shoppers#Users buying#Orders

Rates, eg.Click through RateSearch to Shopper RateBounce Rate

Probability (a user completes a task), eg.User Conversion in the funnel

Metric identification (1/2)Difficult to measure metrics: 1. That take too long to collect2. That are subjective or data is not

available with us

A derived or proxy metric may be required in such cases

Consider the following metrics for conversion: 1. #Order/#Visits to listing page2. #Visitors to TY Page/#Visitors to Listing Page3. #Visits to TY Page/#Visits to listing page4. #Orders/#PageViews of listing page

Metric identification (2/2): Quiz

Which of these are impacted by the following:1 2 3 4

User refreshes the listing page

User breaks the booking into 2

User’s TY page gets refreshed

User does a browser back and the page is served from cache

User drops off on details and comes back via drop-off notification

Omniture is not firing properly on listing page

1. If showing a summary of hotel USPs on the details page is improving conversion?

2. If a user who purchased with MMT will come back again?

3. If we are sending too many or too few notifications to users?

How can you measure?

1 .If showing a summary of hotel USPs on the details page is improving conversion? A simple A/B set-up with and without the feature will help in evaluation

2. If a user who purchased with MMT will come back again?A. An secondary metric captured by asking buyers this question or an NPS survey and comparing results should give some idea

3. If we are sending too many or too few notifications to users?A. An indirect metric measured as retained users on the app across the two variants

How can you measure?

Size & DurationA detour into statistics:

Reality Test Output Error

Control is better Control is better 1- α (confidence level)

Control is better Test is better α (significance)

Test is better Test is better 1-β (power)

Test is better Control is better β

α or type-I error is the probability of rejecting null when it is true (Downside Error)

β or type-II error is the probability of accepting null when control is better (Opportunity Cost Error)

Target values to test significance is at α = 5% and 1-β=80%

Size & DurationSize:• To figure out the size of the samples required to get the 80% power for the test, here • These many users need to be targeted with the smallest of the test variant being examined

Duration: • Is an outcome of what % of traffic can you direct to the test + some minimum duration considerations• You might want to limit the %age exposure of the experiment due to:

• Revenue impacts• Leaving room for other people to experiment

• Even if the sample size for the required power can be reached in a shorter duration good to reduce the exposure of the experiment to include: • At-least 1 weekend/weekdays• low & high discounting periods (if possible)• Low & high availability periods (if possible)

http://www.evanmiller.org/ab-testing/sample-size.html

No Peeking• It is important to not reduce power of the test by changing decision with insufficient data

• Best explained in the blog. Primary idea being that taking duration clues from early data introduces human error in the measurement

• In-case the sample size is turning out to be very high, a few ways to reduce it are: • Use this sequential sampling approach (reduces size by as high as 50% in some scenarios)• Use this Bayesian sampling approach (mathematically intensive)• Try matching the lowest unit of measurement with lowest unit of distribution (eg instead of measuring

latency/user measure latency per hit and distribute the experiment on hit)• Try moving the experiment allocation closer to the step where there is an actual change (eg assign payment

experiment to payment page users)

http://www.evanmiller.org/how-not-to-run-an-ab-test.html

http://www.evanmiller.org/sequential-ab-testing.html

http://www.evanmiller.org/bayesian-ab-testing.html

Distribution Metric1. Page Views2. Cookies3. Login-ID4. Device ID5. IP Address

Tooling & Distribution (1/2)

Which will not be hampered by the following 1 2 3 4 5

User shortlists 2-3 hotels and comes back after a day

User starts search on mobile and books on desktop

User changes browsers on the machine

User logs out and continues with another ID

Typical requirements for an AB system are: Each experiment should support multiple variants (A/B/C..) and each variant can be defined using a combination of

experiment variablesEach user is randomly assigned a variant (as per the distribution percentage). System ensures users are served a

consistent experience basis their device ID or cookie (other distribution parameters like page view or visit might be used but cookie/device-id is the most stable)

Auto-logs the variant that the users are being exposed to in an analytics system

There are multiple AB testing systems available by several vendors or one can be easily created internally using a tag manager like Google tags

Tooling & Distribution (2/2)

Always best to start small in terms of distribution when you turn on the experiment and ramp-up after initial results look stable (& no major drop in conversion)

A/A Testing:Ideally, it is good to run 1 or many A/A test to measure the same metric you’re planning to measure in A/B tests before and after your test periodEven if the above is not feasible, do try to run A/A test regularly to test the underlying systemThings to test during A/A Tests:

Key metrics you measure (like conversion, counts, page-views, etc) and their statistical difference between the two cohorts at different ratios of test & control

A/A & Invariance Testing

Invariance TestingIdentify Invariance metrics- metrics that should not change between control & experiment One of the basic metrics that will be the invariant will be the count of the users assigned to each group. Very important to test theseEach of the invariants should be within statistical bounds between population and control

A/A & Invariance Testing

What would be the invariant metrics for the following feature:Showing details of the exact meal served on the full service domestic flight on the flights review page.

1. Remember the threshold practical significance threshold used in sample size calculator. That is going to be the least change that we care about, so a statistically significant change < the practical significance threshold is useless.

2. Choose the distribution & test: 2. Counts: poisson distribution or poisson-mean3. Rates: poisson distribution or possison-mean4. Click-through-probability: binomial distribution & t-test (or chi-square test).

Analyzing Results (1/3)

http://www.evanmiller.org/ab-testing/poisson-means.html

http://www.evanmiller.org/ab-testing/poisson-means.html

http://www.evanmiller.org/ab-testing/t-test.html

Analyzing Results (2/3): Taking Decision

+ Practical Significance

- Practical Significance

Launch Don’t Launch or Keep Testing

No Difference (0)

Analyzing Results (2/3): Taking Decision

+ Practical Significance

- Practical Significance

Launch Don’t Launch or Keep Testing

Yes

No Keep Testing

No Don’t Launch

No Keep Testing

No Difference (0)

Analyzing Results (3/3): Taking DecisionOutcome A: A feature was impacting three different metrics. All are positive.• Analyze more cuts- is the data intuitive across these cuts. Is a listing page change impacting listing -> details

conversion or is it something else that is giving upside. • Has the experiment run a complete cycle of external changes (weekday/weekend etc)• Time to increase rollout or make it 100% depending upon confidence.

Outcome B: A feature was impacting three different metrics. One or all of them are not statistically significantNeed to understand what exactly might be failing. The list would be specific to the feature launched but an indicative list might be:

• Daily conversion analysis. Were the conversions for any significant time higher?• Cut-across cities/routes? Are their pockets that have done better?• Cut-across customer dimension? Single-pax, Round-trip/one-way, number of child, AP range, logged in

users, users with wallet money, etc• Device/Browsers/Operating System & versions. • Step wise analysis of the various stages of the conversion funnel

Idea is to analyze deeply to understand what is happening. Basis that take a call to modify and continue experimentation or roll-back the experiment.

A/B/C SetupA particular type of experiment set-up that is beneficial where there might be server & client side affects that introduce bias. A few examples

Measure impact of persuasion shown (say last room left)User might be positively impacted to convert higher, v/sHigher latency to fetch persuasion might reduce conversion

Showing a message “Cheaper than Rajdhani” on flights > 75 mins duration and fare <3000User might be positively impacted to convert, v/sConversion for cheaper flight (<3000) is generally higher

Showing a USP of the hotel generated from user reviews, eg. guests love this because: “great neighborhood to stay”

User might be positively impacted to convert, v/sFeature might only be visible on hotels with > X reviews (and hence bookings). There is an innate hotel bias.

In these scenarios, it is best to setup 3 variants:

A= Feature Off or ControlB= Feature On but not shown to usersC= Feature on but shown to users.

A/B/C Setup

AB testing in an organization typically goes through the following stages:

Would encourage you all to help your organization move to the next stage in the AB testing journey

Best to be in a state where the company culture supports quick prototyping and testing with real users

Solving for multi device (stitching sessions) and other tracking limitations in the set-up

Higher standards of experiment analysis and responsible reporting

Things to Improve

Sanity ChecksTesting for

conflict resolution

Testing for impact

measurement

Testing for hypothesis

Rapid prototyping &

testing

Definitely read the Evan Miller blog. It basically summarizes everything you need to know.

If keen on getting in more detail of techniques and best practices, take the course on Udacity. Just doing the first chapter would be good enough

Further Reading

http://www.evanmiller.org/ab-testing/

https://www.udacity.com/course/ab-testing--ud257

Technology

Basics of AB testing in online products