# Crash Course in A/B testing

• View
1.068

0

Embed Size (px)

Transcript

Introduction to A/B testing

Crash Course in A/B testingA statistical perspective

Wayne Tai Lee

RoadmapWhat is A/B testing?Good experiments and the role of statisticsSimilar to proof by contradictionTestsBig data meets classic asymptoticsComplaints with classical hypothesis testingAlternatives?

What is A/B TestingAn industry term for controlled and randomized experiment between treatment/control groups.Age old problem.especially with humans

What most people know:

ABGather samplesAssign treatmentsApply treatmentsMeasure OutcomeCompare

?

What most people know:

AB

?Only difference is in the treatment!

Reality:

AB

??????Variability fromSamples/InputsVariability fromTreatment/functionVariability fromMeasurementHow do we accountfor all that?

If there are variabilities in addition to the treatment effect, how can we identify/isolate the effect from the treatment?

Confounding:

Controlled variabilitySystematic and desiredi.e. our treatmentBias Systematic but not desiredAnything that can confound our studyNoise Random error but not desiredWont confound the study but makes it hard to make a decision.3 Types of Variability:

How do we categorize each?

AB

??????Variability fromSamples/InputsVariability fromTreatment/functionVariability fromMeasurement

Reality:

AB

??????Good instrumentation!

Reality:

AB

??????Randomize assignment!Convert bias to noise

Reality:

AB

??????Randomize assignment!Convert bias to noiseYour population can be skewed or biased.but that only restricts the generalizability of the results

12

Reality:

AB

?Think about what you want to measure and how!Minimize the noise level/variability in the metric.

Wine testing pairing vs just two groups13

A good experiment in general:Good design and implementation should be used to avoid bias.For unavoidable biases, use randomization to turn it into noise.Good planning to minimize noise in data.

How do we deal with noise?Bread and butter of statisticians!Quantify the magnitude of the treatmentQuantify the magnitude of the noiseJust compare..most of the time

Formalizing the ComparisonSimilar to proof by contradiction- You assume the difference is by chance (noise)

Formalizing the ComparisonSimilar to proof by contradiction- You assume the difference is by chance (noise)- See how the data contradicts the assumption

Formalizing the ComparisonSimilar to proof by contradiction- You assume the difference is by chance (noise)- See how the data contradicts the assumption- If the surprise surpasses a threshold, we reject the assumption.- .nothing is 100%

Difference due to chance?IDPVPerson 139Person 2209Person 331Person 498Person 59Person 6151Red -> treatment; Black -> control

Difference due to chance?IDPV|meanmeanPerson 139|72124.5Person 2209|Person 331|Person 498|Person 59|Person 6151|Red -> treatment; Black -> controlDiff = -52.5.so what?Lets measure the difference in means!

Difference due to chance?IDPVIDPVPerson 139139Person 22092209Person 331331Person 498498Person 5959Person 61516151Red -> treatment; Black -> control

If there was no difference from the treatment, shuffling the treatment statuscan emulate the randomization of the samples.

Difference due to chance?IDPVIDPVPerson 139139Person 22092209Person 331331Person 498498Person 5959Person 61516151Red -> treatment; Black -> control

Diff = 122.25 24 = 98.25

Difference due to chance?IDPVIDPVPerson 139139Person 22092209Person 331331Person 498498Person 5959Person 61516151Red -> treatment; Black -> control

Diff = 107. 5 53.5 = 54

Difference due to chance?

Our original -52.550000 repeats later..

Difference due to chance?

Our original -52.546.5% of the permutations yielded a larger if not the same difference as our original sample (in magnitude). Are you surprised by the initial results?

TestsCongratulations!

You just learned the permutation test!The 46.5% is the p-value under the permutation test.

TestsCongratulations!

You just learned the permutation test!The 46.5% is the p-value under the permutation test.

Problems:Permuting the labels can be computationally costly.Not possible before computers!Statistical theory says there are many tests out there.

Tests28

Standard t-test:1) Calculate delta: = mean_treatment mean_control2) Assumes follows a Normal distribution then calculatethe p-value.

3) If p-value < 0.05 then we reject the assumption that there is nodifference between treatment and control.

p-value = sum of red areas0

-

29Wait, our metrics may not be Normal!

Big data meets classic Stats

Big Data meets Classic Stat30Wait, our metrics may not be Normal!

We care about the mean ofthe metric and not the actual metric distribution.

Big Data meets Classic Stat31Wait, our metrics may not be Normal!

Central Limit Theorem:The mean of the metric will be Normal if the sample size is LARGE!We care about the mean ofthe metric and not the actual metric distribution.

Big Data meets Classic Stat32Assumptions with t-testNormality of %deltaGuaranteed with large sample sizesIndependent SamplesNot too many 0s

Thats IT!!!Easy to automate.Simple and general.

What are Tests?33Statistical tests are just procedures that depend on data to make a decision.Engineerify: Statistical tests are functions that take in data, treatments, and return a boolean.

What are Tests?34Statistical tests are just procedures that depend on data to make a decision.Engineerify: Statistical tests are functions that take in data, treatments, and return a boolean.Guarantees:By setting the p-value to compare to a 5% threshold, we controlP( Test says difference exists | In reality NO difference) =80%95%=80%95% A/B testing?