46
Experimenting with Stats Engine Pete Koomen Co-founder, CTO, Optimizely @koomen [email protected] opticon2017

Opticon 2017 Running Experiment Engines with Stats Engine

Embed Size (px)

Citation preview

Page 1: Opticon 2017 Running Experiment Engines with Stats Engine

Experimenting with Stats EnginePete KoomenCo-founder, CTO, Optimizely@[email protected]

opticon2017

Page 2: Opticon 2017 Running Experiment Engines with Stats Engine

AgendaHere

1. Why we built Stats Engine2. How to make a decisions with Stats

Engine3. How to scale your decision process

opticon2017

Page 3: Opticon 2017 Running Experiment Engines with Stats Engine

opticon2017opticon2017

Why we built Stats Engine

Page 4: Opticon 2017 Running Experiment Engines with Stats Engine
Page 5: Opticon 2017 Running Experiment Engines with Stats Engine
Page 6: Opticon 2017 Running Experiment Engines with Stats Engine

The study followed 1,291 participants for 10 years.

No exercise: 438 with 128 deaths (29%)Light exercise: 576 with 7 deaths (1%)Moderate exercise: 262 with 8 deaths (3%)Heavy exercise: 40 with 2 deaths (5%)

Page 7: Opticon 2017 Running Experiment Engines with Stats Engine

“Thank goodness a third person didn't die, or public health

authorities would be banning jogging.”

– Alex Hutchinson, Runner’s World

Page 8: Opticon 2017 Running Experiment Engines with Stats Engine
Page 9: Opticon 2017 Running Experiment Engines with Stats Engine
Page 10: Opticon 2017 Running Experiment Engines with Stats Engine

“A/A” results

Page 11: Opticon 2017 Running Experiment Engines with Stats Engine

The “T-test” (a.k.a. “NHST”, a.k.a. “Student T-test” )

The T-test in a nutshell1. Run your experiment until you have reached

the required sample size, and then stop.2. Ask “What are the chances I’d have gotten

these results in an A/A test?” (p-value)3. If p-value < 5%, your results are significant.

Page 12: Opticon 2017 Running Experiment Engines with Stats Engine

1908Data is expensive.

Data is slow.Practitioners are trained.

2017Data is cheap.Data is real-time.Practitioners are everyone.

The T-test was designed for this world

Page 13: Opticon 2017 Running Experiment Engines with Stats Engine

T-Test Pitfalls1. Peeking2. Multiple comparisons

Page 14: Opticon 2017 Running Experiment Engines with Stats Engine

1. Peeking

Page 15: Opticon 2017 Running Experiment Engines with Stats Engine

p-Value < 5%. Significant!

p-Value > 5%. Inconclusive.

p-Value > 5%. Inconclusive.

Min Sample Size

Time

Experiment Starts p-Value > 5%. Inconclusive.

Page 16: Opticon 2017 Running Experiment Engines with Stats Engine

Why is this a problem?

There is a ~5% chance of seeing a false positive each time you peek.

Page 17: Opticon 2017 Running Experiment Engines with Stats Engine

p-Value < 5%. Significant!

p-Value > 5%. Inconclusive.

p-Value > 5%. Inconclusive.

Min Sample Size

Time

Experiment Starts p-Value > 5%. Inconclusive.

4 peeks —> ~18% chance of seeing a false positive

Page 18: Opticon 2017 Running Experiment Engines with Stats Engine

The “T-test” (a.k.a. “NHST”, a.k.a. “Student T-test” )

The T-test in a nutshell1. Run your experiment until you have reached the required sample size, and then stop.2. Ask “What are the chances I’d have gotten these results in an A/A test?” (p-value)3. If p-value < 5%, your results are significant.

Page 19: Opticon 2017 Running Experiment Engines with Stats Engine

1:45 2:45 3:45 4:45 5:45

Page 20: Opticon 2017 Running Experiment Engines with Stats Engine

Solution: Stats Engine uses sequential testing to compute an “always-valid” p-value.

Page 21: Opticon 2017 Running Experiment Engines with Stats Engine

2. Multiple Comparisons

Page 22: Opticon 2017 Running Experiment Engines with Stats Engine

© Randall Patrick Munroe, xkcd.com

Page 23: Opticon 2017 Running Experiment Engines with Stats Engine

© Randall Patrick Munroe, xkcd.com

Page 24: Opticon 2017 Running Experiment Engines with Stats Engine

- - - - -

Metrics

1 2 3 4 5

Variations

A

B

C

D

Control

Page 25: Opticon 2017 Running Experiment Engines with Stats Engine

False Discovery Rate = P( No Real Improvement | 10% Lift )

False Positive Rate = P( 10% Lift | No Real Improvement ) “How likely are my results if I assume there is no underlying difference between my variation and control?

“How likely is it that my results are a fluke?”

Solution: Stats Engine controls False Discovery Rate by becoming more conservative when more metrics and variations are added to a test.

Page 26: Opticon 2017 Running Experiment Engines with Stats Engine

opticon2017opticon2017

How to make decisions with Stats Engine

When should I stop an experiment?Understanding resetsHow do additional variations and metrics affect my experiment?How do I trade off between risk and velocity?

Page 27: Opticon 2017 Running Experiment Engines with Stats Engine

opticon2017opticon2017

How to make decisions with Stats Engine

When should I stop an experiment?Understanding resetsHow do additional variations and metrics affect my experiment?How do I trade off between risk and velocity?

Page 28: Opticon 2017 Running Experiment Engines with Stats Engine

Variation

👍 Use “visitors remaining” to decide whether continuing your experiment is worth it.

Page 29: Opticon 2017 Running Experiment Engines with Stats Engine

opticon2017opticon2017

How to make decisions with Stats Engine

When should I stop an experiment?Understanding resetsHow do additional variations and metrics affect my experiment?How do I trade off between risk and velocity?

Page 30: Opticon 2017 Running Experiment Engines with Stats Engine

A

B

AB

Page 31: Opticon 2017 Running Experiment Engines with Stats Engine

“Peeking at A/B Tests: Why it matters, and what to do about it” KDD 2017

👍 Statistical Significance rises whenever there is strong evidence of a difference between variation and control

Page 32: Opticon 2017 Running Experiment Engines with Stats Engine

“Peeking at A/B Tests: Why it matters, and what to do about it” KDD 2017

0

Page 33: Opticon 2017 Running Experiment Engines with Stats Engine

Variatio

Variation

👍 Statistical Significance will “reset” when there is strong evidence of an underlying change.

Page 34: Opticon 2017 Running Experiment Engines with Stats Engine

Variation

👍 If your point estimate is near the edge of its confidence interval, consider running the experiment longer.

-19.3% -2.58%

Page 35: Opticon 2017 Running Experiment Engines with Stats Engine

opticon2017opticon2017

How to make decisions with Stats Engine

When should I stop an experiment?Understanding resetsHow do additional variations and metrics affect my experiment?How do I trade off between risk and velocity?

Page 36: Opticon 2017 Running Experiment Engines with Stats Engine

False Discovery Rate = P( No Real Improvement | 10% Lift )

False Positive Rate = P( 10% Lift | No Real Improvement ) “How likely are my results if I assume there is no underlying difference between my variation and control?

“How likely is it that my results are a fluke?”

Solution: Stats Engine controls False Discovery Rate by becoming more conservative when more metrics and variations are added to a test.

Page 37: Opticon 2017 Running Experiment Engines with Stats Engine

Stats Engine treats each metric as a “signal”.

High Signal metrics are directly affected by the experiment

Low Signal metrics are indirectly or not at all affected by the experiment

Page 38: Opticon 2017 Running Experiment Engines with Stats Engine

False Discovery Rate = P( No Real Improvement | 10% Lift )

False Positive Rate = P( 10% Lift | No Real Improvement ) “How likely are my results if I assume there is no underlying difference between my variation and control?

“How likely is it that my results are a fluke?”

Solution: Stats Engine controls False Discovery Rate by becoming more conservative when more low signal metrics and variations are added to a test.

Page 39: Opticon 2017 Running Experiment Engines with Stats Engine

Variations

A

B

C

D

Metrics

1 2 3 4 5 6 7 8

Primary Secondary Monitoring

Page 40: Opticon 2017 Running Experiment Engines with Stats Engine

👍For maximum velocity, use “high signal” primary and secondary metrics.

👍Use monitoring metrics for “low signal” metrics.

Page 41: Opticon 2017 Running Experiment Engines with Stats Engine

opticon2017opticon2017

How to make decisions with Stats Engine

When should I stop an experiment?Understanding resetsHow do additional variations and metrics affect my experiment?How do I trade off between risk and velocity?

Page 42: Opticon 2017 Running Experiment Engines with Stats Engine

Max False Discovery Rate

👍 Use your Statistical Significance threshold to control risk vs. velocity.

Page 43: Opticon 2017 Running Experiment Engines with Stats Engine

opticon2017opticon2017

How to scale your decision process

Risk vs. Velocity for Experimentation ProgramsGetting organizational buy-in

Page 44: Opticon 2017 Running Experiment Engines with Stats Engine

👍Define “risk classes” for your team’s experiments

👍Keep low-risk experiments “low touch”

👍Save data science analysis resources for high risk experiments

👍Run high-risk experiments for 1+ conversion cycles to control for seasonality

👍Rerun high-risk experiments

Risk vs. Velocity for Experimentation Programs

Page 45: Opticon 2017 Running Experiment Engines with Stats Engine

👍Decide how and when you’ll share experiment results with your organization.

👍Write down your “decision process” and socialize with the team

Getting organizational buy-in

Page 46: Opticon 2017 Running Experiment Engines with Stats Engine

opticon2017

Q&APete Koomen@[email protected]