Opticon 2017 Running Experiment Engines with Stats Engine

Experimenting with Stats EnginePete KoomenCo-founder, CTO, Optimizely@[email protected]

opticon2017

mailto:[email protected]

AgendaHere

1. Why we built Stats Engine2. How to make a decisions with Stats

Engine3. How to scale your decision process

opticon2017

opticon2017opticon2017

Why we built Stats Engine

The study followed 1,291 participants for 10 years.

No exercise: 438 with 128 deaths (29%)Light exercise: 576 with 7 deaths (1%)Moderate exercise: 262 with 8 deaths (3%)Heavy exercise: 40 with 2 deaths (5%)

“Thank goodness a third person didn't die, or public health

authorities would be banning jogging.”

– Alex Hutchinson, Runner’s World

“A/A” results

The “T-test” (a.k.a. “NHST”, a.k.a. “Student T-test” )

The T-test in a nutshell1. Run your experiment until you have reached

the required sample size, and then stop.2. Ask “What are the chances I’d have gotten

these results in an A/A test?” (p-value)3. If p-value < 5%, your results are significant.

1908Data is expensive.

Data is slow.Practitioners are trained.

2017Data is cheap.Data is real-time.Practitioners are everyone.

The T-test was designed for this world

T-Test Pitfalls1. Peeking2. Multiple comparisons

1. Peeking

p-Value < 5%. Significant!

p-Value > 5%. Inconclusive.


Min Sample Size

Time

Experiment Starts p-Value > 5%. Inconclusive.

Why is this a problem?

There is a ~5% chance of seeing a false positive each time you peek.

p-Value < 5%. Significant!



Min Sample Size

Time

Experiment Starts p-Value > 5%. Inconclusive.

4 peeks —> ~18% chance of seeing a false positive

The “T-test” (a.k.a. “NHST”, a.k.a. “Student T-test” )

The T-test in a nutshell1. Run your experiment until you have reached the required sample size, and then stop.2. Ask “What are the chances I’d have gotten these results in an A/A test?” (p-value)3. If p-value < 5%, your results are significant.

1:45 2:45 3:45 4:45 5:45

Solution: Stats Engine uses sequential testing to compute an “always-valid” p-value.

2. Multiple Comparisons

© Randall Patrick Munroe, xkcd.com

http://xkcd.com

© Randall Patrick Munroe, xkcd.com

http://xkcd.com

- - - - -

Metrics

1 2 3 4 5

Variations

A

B

C

D

Control

False Discovery Rate = P( No Real Improvement | 10% Lift )

False Positive Rate = P( 10% Lift | No Real Improvement ) “How likely are my results if I assume there is no underlying difference between my variation and control?

“How likely is it that my results are a fluke?”

Solution: Stats Engine controls False Discovery Rate by becoming more conservative when more metrics and variations are added to a test.


How to make decisions with Stats Engine

When should I stop an experiment?Understanding resetsHow do additional variations and metrics affect my experiment?How do I trade off between risk and velocity?




Variation

👍 Use “visitors remaining” to decide whether continuing your experiment is worth it.




A

B

AB

“Peeking at A/B Tests: Why it matters, and what to do about it” KDD 2017

👍 Statistical Significance rises whenever there is strong evidence of a difference between variation and control

“Peeking at A/B Tests: Why it matters, and what to do about it” KDD 2017

0

Variatio

Variation

👍 Statistical Significance will “reset” when there is strong evidence of an underlying change.

Variation

👍 If your point estimate is near the edge of its confidence interval, consider running the experiment longer.

-19.3% -2.58%







Solution: Stats Engine controls False Discovery Rate by becoming more conservative when more metrics and variations are added to a test.

Stats Engine treats each metric as a “signal”.

High Signal metrics are directly affected by the experiment

Low Signal metrics are indirectly or not at all affected by the experiment




Solution: Stats Engine controls False Discovery Rate by becoming more conservative when more low signal metrics and variations are added to a test.

Variations

A

B

C

D

Metrics

1 2 3 4 5 6 7 8

Primary Secondary Monitoring

…

👍For maximum velocity, use “high signal” primary and secondary metrics.

👍Use monitoring metrics for “low signal” metrics.




Max False Discovery Rate

👍 Use your Statistical Significance threshold to control risk vs. velocity.


How to scale your decision process

Risk vs. Velocity for Experimentation ProgramsGetting organizational buy-in

👍Define “risk classes” for your team’s experiments

👍Keep low-risk experiments “low touch”

👍Save data science analysis resources for high risk experiments

👍Run high-risk experiments for 1+ conversion cycles to control for seasonality

👍Rerun high-risk experiments

Risk vs. Velocity for Experimentation Programs

👍Decide how and when you’ll share experiment results with your organization.

👍Write down your “decision process” and socialize with the team

Getting organizational buy-in

opticon2017

Q&APete Koomen@[email protected]

mailto:[email protected]

Technology

Opticon 2017 Running Experiment Engines with Stats Engine