30
1 From practice to theory in learning from massive data Charles Elkan Amazon Fellow August 14, 2016

From Practice to Theory in Learning from Massive Data by Charles Elkan at BigMine16

  • Upload
    bigmine

  • View
    367

  • Download
    2

Embed Size (px)

Citation preview

Page 1: From Practice to Theory in Learning from Massive Data by Charles Elkan at BigMine16

1

From practice to theoryin learning from massive data

Charles Elkan

Amazon Fellow

August 14, 2016

Page 2: From Practice to Theory in Learning from Massive Data by Charles Elkan at BigMine16

Important

Information here is already public.

Opinions are mine, not Amazon’s.

Page 3: From Practice to Theory in Learning from Massive Data by Charles Elkan at BigMine16

3

Page 4: From Practice to Theory in Learning from Massive Data by Charles Elkan at BigMine16

Outline

Only 30 minutes!

1. Detecting anomalies in streaming data2. Making Spark usable for real-time predictions3. Amazon’s most important algorithm for recommendations4. Uplift: We want causation, not merely correlation

Page 5: From Practice to Theory in Learning from Massive Data by Charles Elkan at BigMine16

Outline

1. Detecting anomalies in streaming data2. Making Spark usable for real-time predictions3. Amazon’s most important algorithm for recommendations4. Uplift: We want causation, not merely correlation

Page 6: From Practice to Theory in Learning from Massive Data by Charles Elkan at BigMine16

From practice to theory

Page 7: From Practice to Theory in Learning from Massive Data by Charles Elkan at BigMine16

From theory to practice

Page 8: From Practice to Theory in Learning from Massive Data by Charles Elkan at BigMine16

Now for everyone!

Page 9: From Practice to Theory in Learning from Massive Data by Charles Elkan at BigMine16

Outline

1. Detecting anomalies in streaming data2. Making Spark usable for real-time predictions3. Amazon’s most important algorithm for recommendations4. Uplift: We want causation, not merely correlation

Page 10: From Practice to Theory in Learning from Massive Data by Charles Elkan at BigMine16

From practice to practice

Page 11: From Practice to Theory in Learning from Massive Data by Charles Elkan at BigMine16
Page 12: From Practice to Theory in Learning from Massive Data by Charles Elkan at BigMine16

Outline

1. Detecting anomalies in streaming data2. Making Spark usable for real-time predictions3. Amazon’s most important algorithm for recommendations4. Uplift: We want causation, not merely correlation

Page 13: From Practice to Theory in Learning from Massive Data by Charles Elkan at BigMine16

13

Academic versus applied

In theory, researchers favor simplicity. In practice, they don’t.

In industry, simplicity genuinely wins.

Example: Desiderata for recommender systems:1. Respect the privacy of users; don’t be creepy.2. Make recommendations understandable.3. Make them responsive to the user’s most recent interests.4. Generate them with millisecond latency.

Page 14: From Practice to Theory in Learning from Massive Data by Charles Elkan at BigMine16

14

Amazon’s most important recommender system

1. Respect the privacy of users; don’t be creepy.2. Make recommendations understandable.3. And responsive to the user’s most recent interests.4. Generate them with millisecond latency.

Page 15: From Practice to Theory in Learning from Massive Data by Charles Elkan at BigMine16

Outline

1. Detecting anomalies in streaming data2. Making Spark usable for real-time predictions3. Amazon’s most important algorithm for recommendations4. Uplift: We want causation, not merely correlation

Page 16: From Practice to Theory in Learning from Massive Data by Charles Elkan at BigMine16

What data scientists do every day

Let x be a user and let R = 0 or 1 be a response. For example, R=1 means the user buys shoes in the next month.

Routinely, we train models to predict the probability p(R=1|x).

We send messages and coupons to users with high p(R=1|x).

16

Page 17: From Practice to Theory in Learning from Massive Data by Charles Elkan at BigMine16

Is p(R=1|x) actually useful?

In principle, no. "Our goal is not to predict the future; it is to change the future."• Merely predicting user behavior is of limited interest.

We want to select treatments that influence users.• T = t means we choose treatment t. • For each available t, compute p(R=1|x,T=t). • Choose the t that gives highest probability.

17

Page 18: From Practice to Theory in Learning from Massive Data by Charles Elkan at BigMine16

The risk of ignoring uplift

18

Users are ranked by p(R=1|x), shown by the brown line.The blue dashed line shows p(R=1|x,T=t) .

The treatment t has a negative effect for users in the top 5%: p(R=1|x,T=t) < p(R=1|x).

Page 19: From Practice to Theory in Learning from Massive Data by Charles Elkan at BigMine16

Politicians know this …

If you are a Republican, don’t target confirmed Democrat voters!Instead:• Send persuasive messages to undecided voters.• Send “get out the vote” messages to confirmed supporters.• Send “please donate” messages to these people also.

Page 20: From Practice to Theory in Learning from Massive Data by Charles Elkan at BigMine16

A common scenario for uplift

Many treatments are almost free to apply, such as sending email.

The uplift question is then which treatment is most effective.

For each user x, we want to know which t has highest value p(R=1|x,T=t).

Keep in mind: The same treatment may be the best for all x.

20

Page 21: From Practice to Theory in Learning from Massive Data by Charles Elkan at BigMine16

A public dataset

Published by Kevin Hillstrom, former VP of database marketing at Nordstrom.

Studied in several published papers on uplift, notably by Nicholas Radcliffe, professor at the University of Edinburgh.

• 64,000 past customers of an e-commerce site selling clothing.• Randomized to no email, men’s email, or women’s email. • Three outcomes: Binary visit? purchase? and numerical spend.

21

Page 22: From Practice to Theory in Learning from Massive Data by Charles Elkan at BigMine16

Looking at the data

22

Treatments have a larger effect on “visit” than on “purchase given visit” or on “spend given purchase.”

We'll analyze uplift (i.e., the causal influence of treatments) for visits.

Table from Hillstrom’s MineThatData email analytics challenge by Radcliffe.

Page 23: From Practice to Theory in Learning from Massive Data by Charles Elkan at BigMine16

The linear probability model

Assume the linear function p(R=1|x) = b0 + ∑i bi * xi.• Find coefficients bi to minimize square loss.

Square loss is proper, so predicted probabilities are calibrated.

Avoid overfitting and predictions <0 or >1 by not having too many predictors.

Commonly used in econometrics, not in ML. In practice, often quite similar to logistic regression.

23

Page 24: From Practice to Theory in Learning from Massive Data by Charles Elkan at BigMine16

probability of visit = 7.5% + … +6.5% IF (men’s past AND men’s email) +6.6% IF (women’s past AND men’s email) +6.1% IF (women’s past AND women’s email)

24

Including treatment indicators M and W

Page 25: From Practice to Theory in Learning from Massive Data by Charles Elkan at BigMine16

25

The men’s email is effective for customers who have previously purchased men’s or women’s clothing.

The women’s email is not effective for customers who have previously purchased only men’s clothing.

Page 26: From Practice to Theory in Learning from Massive Data by Charles Elkan at BigMine16

26

Optimal treatment policy:• If only men’s previous purchases: send men’s email.• If only women’s purchases: send either email.• If both: send men’s email.

Hypothesis: Women tend to buy clothing for their families, but men tend to buy clothing only for themselves.

Page 27: From Practice to Theory in Learning from Massive Data by Charles Elkan at BigMine16

Validation

How can we confirm that we have found an optimal policy?

Approach:1. Train models of response for each treatment.2. For each user x in a test set, plot both predicted probabilities.3. Three separate test sets: users who previously purchased only

women’s clothing, only men’s, or both.4. The latter two sets should show p(R=1|x, T=M) > p(R=1|x, T=W)

for most x.

Page 28: From Practice to Theory in Learning from Massive Data by Charles Elkan at BigMine16

Results using random forests:

Lower two panels: As expected, p(R=1|x, T=M) > p(R=1|x, T=W).

Top panel: The two treatments M and W are equally effective.

Page 29: From Practice to Theory in Learning from Massive Data by Charles Elkan at BigMine16

What comes next?

Conclusion: Indeed, one treatment (the men’s email) can beoptimal for all customers.

The step beyond uplift modeling is reinforcement learning: Learning a sequence of actions that is best for each user.

• The goal is to maximize total lifetime reward from each customer.

• Learn simultaneously how customers evolve and how they respond to actions that we take.

29

Page 30: From Practice to Theory in Learning from Massive Data by Charles Elkan at BigMine16

Questions?

1. Detecting anomalies in streaming data2. Making Spark usable for real-time predictions3. Amazon’s most important algorithm for recommendations4. Uplift: We want causation, not merely correlation