Probabilistic Machine Learning in Computational Advertising Thore Graepel, Thomas Borchert, Ralf...

Preview:

Citation preview

Probabilistic Machine Learning in Computational Advertising

Thore Graepel, Thomas Borchert, Ralf Herbrich and Joaquin Quiñonero Candela

Online Services and AdvertisingMicrosoft Research Cambridge, UK

NIPS 2009 – December 2009

Outline

• Online Advertising and Paid Search• AdPredictorTM: Predicting User Clicks on Ads

[Appendix]• Model shrinking• Parallel training

ONLINE ADVERTISING AND PAID SEARCH

Advertising Industry Business: Size

0

100

200

300

400

500

600

2001 2002 2003 2004 2005 2006

Year

Outdoor

Cinema

Radio

TV

Print

Online

Annu

al E

xpen

ditu

re (i

n bi

llion

USD

)

GDP Denmark (2006)

Microsoft Revenue (2008)

Data: World Advertising Research Center Report 2007Data: World Advertising Research Center Report 2007

Advertising Industry Business: Growth

-20.00%

-10.00%

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

2001 2002 2003 2004 2005 2006

Year

Outdoor

Cinema

Radio

TV

Print

Online

Data: World Advertising Research Center Report 2007Data: World Advertising Research Center Report 2007Data: World Advertising Research Center Report 2007Data: World Advertising Research Center Report 2007

$1.00$1.00

$2.00$2.00

$0.10$0.10

* 10%* 10%

* 4%* 4%

* 50%* 50%

=$0.10=$0.10

=$0.08=$0.08

=$0.05=$0.05

$0.80$0.80

$1.25$1.25

$0.05$0.05

Display to users (expected bid)Display to users (expected bid) Charge advertisers (per click)Charge advertisers (per click)

The Scale of Things

• Realistic training set for proof of concept: 7,000,000,000 impressions

• 2 weeks of CPU time during training: 2 wks × 7 days × 86,400 sec/day =

1,209,600 seconds• Learning algorithm speed requirement:– 5,787 impression updates / sec– 172.8 μs per impression update

ADPREDICTORBayesian Linear Probit Regression

Impression Level Predictions

One Weight per Feature Value102.34.12.201

15.70.165.9

221.98.2.187

92.154.3.86

Client IP

Exact Match

Broad Match

MatchType

Position

ML-1

SB-1

SB-2

++ pClick

Click Potential

Linear: click potential = sum of feature click contributions

click potentialclick potential00

PageNumber/DisplayPosition/ReturnedAds = 0/ML-1/2PageNumber/DisplayPosition/ReturnedAds = 0/ML-1/2

ListingId = 798831ListingId = 798831

ClientIP = 98.0.101.23ClientIP = 98.0.101.23

clickclickclickclickno clickno clickno clickno click

Impression click potentialImpression click potential

Gaussian Noise

Probit: area under Gaussian tail as a function of click potential

click potentialclick potential00

Impression click potentialImpression click potential

clickclickclickclickno clickno clickno clickno click

P(click) = P(potential > 0)P(click) = P(potential > 0)

Probit

Probit: area under Gaussian tail as a function of click potential

click potentialclick potential00

Impression click potentialImpression click potential

100%100%

P(click|Impression)P(click|Impression)

click potentialclick potential00

PageNumber/DisplayPosition/ReturnedAds = 0/ML-1/2PageNumber/DisplayPosition/ReturnedAds = 0/ML-1/2

ListingId = 798831ListingId = 798831

ClientIP = 98.0.101.23ClientIP = 98.0.101.23

Modelling Uncertainty

click potentialclick potential00

Impression click potentialImpression click potential

Uncertainty about the Potential

click potentialclick potential00

Impression click potentialImpression click potential

Probability of Click

100%100%

Uncertainty: Bayesian Probabilities102.34.12.201

15.70.165.9

221.98.2.187

92.154.3.86

Client IP

Exact Match

Broad Match

MatchType

Position

ML-1

SB-1

SB-2

p(pClick)++

Principled Exploration

Training Algorithm in Action

w1

w1

w2

w2++

zz

cc Prediction Training/Update

Posterior Updates for the Click Event

Client IP: Mean & Variance

Calibrated Predictions

Joint Updates vs. Independent Aggregation

Naive Bayes

adPredictor Wrap Up

Thank you!thoreg@microsoft.com

APPENDIX

Dealing with Millions of Variables• Observation 1: Large variable bags follow a

power-law w.r.t. frequency of items• Observation 2: Weight posteriors of rare items

are close to their prior• Idea:

1. Initially, the belief of each new item is compactly represented by one (and the same) prior

2. After observing an item for the first time, the posterior is allocated

3. At regular intervals, all weight posteriors with a small deviation from the prior are removed

Naïve Approach – Shared Memory

• Does not scale– Constant contention for locks– Some features are very frequent– Synchronization issues

Training Node 1 Training Node 2

Impression A Impression B

MSNH1110.0.0.1 USA(etc) MSNH11 Canada 10.0.1.25 (etc)

ModelFile

Conflict!

UpdateUpdate

UpdateUpdate

Upda

te

Upda

te

Upda

te

Upd

ate

Proposal: Approximate LearningTrain Node 1 Train Node 2

Impression A Impression B

MSNH1110.0.0.1 USA(etc) MSNH11 Canada 10.0.1.25 (etc)

Upda

te UpdateUpdate

Update

Update

Merge Deltas

Updat

e

Update

Update

Final Model File

Recommended