From Practice to Theory in Learning from Massive Data by Charles Elkan at BigMine16

1

From practice to theoryin learning from massive data

Charles Elkan

Amazon Fellow

August 14, 2016

Important

Information here is already public.

Opinions are mine, not Amazon’s.

3

Outline

Only 30 minutes!

1. Detecting anomalies in streaming data2. Making Spark usable for real-time predictions3. Amazon’s most important algorithm for recommendations4. Uplift: We want causation, not merely correlation

Outline


From practice to theory

From theory to practice

Now for everyone!

Outline


From practice to practice

Outline


13

Academic versus applied

In theory, researchers favor simplicity. In practice, they don’t.

In industry, simplicity genuinely wins.

Example: Desiderata for recommender systems:1. Respect the privacy of users; don’t be creepy.2. Make recommendations understandable.3. Make them responsive to the user’s most recent interests.4. Generate them with millisecond latency.

14

Amazon’s most important recommender system

1. Respect the privacy of users; don’t be creepy.2. Make recommendations understandable.3. And responsive to the user’s most recent interests.4. Generate them with millisecond latency.

Outline


What data scientists do every day

Let x be a user and let R = 0 or 1 be a response. For example, R=1 means the user buys shoes in the next month.

Routinely, we train models to predict the probability p(R=1|x).

We send messages and coupons to users with high p(R=1|x).

16

Is p(R=1|x) actually useful?

In principle, no. "Our goal is not to predict the future; it is to change the future."• Merely predicting user behavior is of limited interest.

We want to select treatments that influence users.• T = t means we choose treatment t. • For each available t, compute p(R=1|x,T=t). • Choose the t that gives highest probability.

17

The risk of ignoring uplift

18

Users are ranked by p(R=1|x), shown by the brown line.The blue dashed line shows p(R=1|x,T=t) .

The treatment t has a negative effect for users in the top 5%: p(R=1|x,T=t) < p(R=1|x).

Politicians know this …

If you are a Republican, don’t target confirmed Democrat voters!Instead:• Send persuasive messages to undecided voters.• Send “get out the vote” messages to confirmed supporters.• Send “please donate” messages to these people also.

A common scenario for uplift

Many treatments are almost free to apply, such as sending email.

The uplift question is then which treatment is most effective.

For each user x, we want to know which t has highest value p(R=1|x,T=t).

Keep in mind: The same treatment may be the best for all x.

20

A public dataset

Published by Kevin Hillstrom, former VP of database marketing at Nordstrom.

Studied in several published papers on uplift, notably by Nicholas Radcliffe, professor at the University of Edinburgh.

• 64,000 past customers of an e-commerce site selling clothing.• Randomized to no email, men’s email, or women’s email. • Three outcomes: Binary visit? purchase? and numerical spend.

21

Looking at the data

22

Treatments have a larger effect on “visit” than on “purchase given visit” or on “spend given purchase.”

We'll analyze uplift (i.e., the causal influence of treatments) for visits.

Table from Hillstrom’s MineThatData email analytics challenge by Radcliffe.

The linear probability model

Assume the linear function p(R=1|x) = b0 + ∑i bi * xi.• Find coefficients bi to minimize square loss.

Square loss is proper, so predicted probabilities are calibrated.

Avoid overfitting and predictions <0 or >1 by not having too many predictors.

Commonly used in econometrics, not in ML. In practice, often quite similar to logistic regression.

23

probability of visit = 7.5% + … +6.5% IF (men’s past AND men’s email) +6.6% IF (women’s past AND men’s email) +6.1% IF (women’s past AND women’s email)

24

Including treatment indicators M and W

25

The men’s email is effective for customers who have previously purchased men’s or women’s clothing.

The women’s email is not effective for customers who have previously purchased only men’s clothing.

26

Optimal treatment policy:• If only men’s previous purchases: send men’s email.• If only women’s purchases: send either email.• If both: send men’s email.

Hypothesis: Women tend to buy clothing for their families, but men tend to buy clothing only for themselves.

Validation

How can we confirm that we have found an optimal policy?

Approach:1. Train models of response for each treatment.2. For each user x in a test set, plot both predicted probabilities.3. Three separate test sets: users who previously purchased only

women’s clothing, only men’s, or both.4. The latter two sets should show p(R=1|x, T=M) > p(R=1|x, T=W)

for most x.

Results using random forests:

Lower two panels: As expected, p(R=1|x, T=M) > p(R=1|x, T=W).

Top panel: The two treatments M and W are equally effective.

What comes next?

Conclusion: Indeed, one treatment (the men’s email) can beoptimal for all customers.

The step beyond uplift modeling is reinforcement learning: Learning a sequence of actions that is best for each user.

• The goal is to maximize total lifetime reward from each customer.

• Learn simultaneously how customers evolve and how they respond to actions that we take.

29

Questions?


Data & Analytics

From Practice to Theory in Learning from Massive Data by Charles Elkan at BigMine16