1 Challenges in Computational Advertising Deepayan Chakrabarti ([email protected])

1

Challenges in Computational Advertising

Deepayan Chakrabarti ([email protected])

Online Advertising Overview

Adv

ertis

ers

Ad Network

Ads

Content

Pick ads

User

Content Provider

Examples:Yahoo, Google, MSN,

RightMedia, …

2

Advertising Setting

Display Content Match

Sponsored Search

Advertising Setting

Pick ads


Sponsored Search

4

Advertising Setting

Graphical display ads Mostly for brand awareness Revenue based on number of impressions

(not clicks)


Sponsored Search

5

Advertising Setting

Content

match ad


Sponsored Search

6

Advertising Setting

Pick ads

Text ads

Match ads to the content


Sponsored Search

7

Advertising Setting

The user intent is unclear Revenue depends on number of clicks Query (webpage) is long and noisy


Sponsored Search

8

Advertising Setting

Search Query

Sponsored Search Ads


Sponsored Search

9

This presentation

1) Content Match [KDD 2007]: How can we estimate the click-through rate

(CTR) of an ad on a page?

~106 ads

~10

9 p

ages CTR for ad j

on page i

10

This presentation

1) Estimating CTR for Content Match [KDD ‘07]

2) Traffic Shaping for Display Advertising [EC ‘12]

Article summary

Alternates

click

Display ads

11

This presentation



Recommend articles (not ads) need high CTR on article summaries + prefer articles on which under-delivering ads

can be shown

12

This presentation



3) Theoretical underpinnings[COLT ‘10 best student paper]

Represent relationships as a graph Recommendation = Link Prediction Many useful heuristics exist Why do these heuristics work?

Goal: Suggest friends

13

14

Estimating CTR for Content Match Contextual Advertising

Show an ad on a webpage (“impression”) Revenue is generated if a user clicks Problem: Estimate the click-through rate (CTR) of

an ad on a page

~106 ads

~1

09 p

ag

es

CTR for ad j on page i

Estimating CTR for Content Match Why not use the MLE?

1. Few (page, ad) pairs have N>0

2. Very few have c>0 as well

3. MLE does not differentiate between 0/10 and 0/100

We have additional information: hierarchies

15

16

Estimating CTR for Content Match Use an existing, well-understood hierarchy

Categorize ads and webpages to leaves of the hierarchy

CTR estimates of siblings are correlatedThe hierarchy allows us to aggregate data

Coarser resolutions provide reliable estimates for rare events which then influences estimation at finer

resolutions

17

Estimating CTR for Content Match Level 0

Level i

Page hierarchy Ad hierarchy

Region= (page node, ad node)

Region Hierarchy A cross-product of the page

hierarchy and the ad hierarchy

Page classes Ad classes

Region

Estimating CTR for Content Match Our Approach

Data Transformation Model Model Fitting

18

Data Transformation

Problem:

Solution: Freeman-Tukey transform

Differentiates regions with 0 clicks Variance stabilization:

19

Model

Goal: Smoothing across siblings in hierarchy[Huang+Cressie/2000]

2020

Level i

Level i+1

S1S2

S3S4

Sparent1. Each region has a latent state Sr

2. yr is independent of the hierarchy given Sr

3. Sr is drawn from its parent Spa(r)

y1 y2 y4

observable

late

nt

Model

21

Sr

Spa(r)

yr

ypa(r)

variance Vr coeff. βr

variance wr Vpa(r)

wpa(r)

ur

βpa(r)

upa(r)

However, learning Wr , Vr and βr for each region is

clearly infeasible Assumptions:

All regions at the same level ℓ sharethe same W(ℓ) and β(ℓ)

Vr = V/Nr for some constant V, since

Model

22

Sr

yr

Vr βr

wr

ur

Spa(r)

Model

Implications: determines degree of smoothing :

Sr varies greatly from Spa(r)

Each region learns its own Sr

No smoothing :

All Sr are identical

A regression model on features ur is learnt

Maximum Smoothing

23

Sr

yr

Vr βr

wr

ur

Spa(r)

Implications: determines degree of smoothing Var(Sr) increases from root to leaf

Better estimates at coarser resolutions

Model

24

Sr

yr

Vr βr

wr

ur

Spa(r)

Implications: determines degree of smoothing Var(Sr) increases from root to leaf Correlations among siblings at

level ℓ: Depends only on level of least common

ancestor

Model

25

Sr

yr

Vr βr

wr

ur

Spa(r)

Corr( , ) > Corr( , )

Estimating CTR for Content Match Our Approach

Data Transformation (Freeman-Tukey) Model (Tree-structured Markov Chain) Model Fitting

26

27

Model Fitting

Fitting using a Kalman filtering algorithm Filtering: Recursively aggregate

data from leaves to root Smoothing: Propagate

information from root to leaves

Complexity: linear in the number of regions, for both time and space

filtering

smoo

thin

g

28

Model Fitting

Fitting using a Kalman filtering algorithm Filtering: Recursively aggregate

data from leaves to root Smoothing: Propagates

information from root to leaves

Kalman filter requires knowledge of β, V, and W EM wrapped around the

Kalman filter

filtering

smoo

thin

g

29

Experiments

503M impressions 7-level hierarchy of which the top 3 levels

were used Zero clicks in

76% regions in level 2 95% regions in level 3

Full dataset DFULL, and a 2/3 sample DSAMPLE

30

Experiments

Estimate CTRs for all regions R in level 3 with zero clicks in DSAMPLE

Some of these regions R>0 get clicks in DFULL

A good model should predict higher CTRs for R>0 as against the other regions in R

31

Experiments

We compared 4 models TS: our tree-structured model LM (level-mean): each level smoothed

independently NS (no smoothing): CTR proportional to 1/Nr

Random: Assuming |R>0| is given, randomly predict the membership of R>0 out of R

32

Experiments

TS

Rando

m

LM, N

S

Experiments MLE=0 everywhere, since 0 clicks were observed What about estimated CTR?

33

Impressions

Est

imat

ed C

TR

Impressions

Est

imat

ed C

TR

No Smoothing (NS) Our Model (TS)

Variability from coarser resolutions

Close to MLE for large N

34

Estimating CTR for Content Match We presented a method to estimate

rates of extremely rare events at multiple resolutions under severe sparsity constraints

Key points: Tree-structured generative model Extremely fast parameter fitting

Traffic Shaping



3) Theoretical underpinnings [COLT ‘10 best student paper]

35

Traffic Shaping

36

Which article summary should

be picked?

Ans: The one with highest expected CTR

Which ad should be displayed?

Ans: The ad that minimizes underdelivery

Article pool

Underdelivery

Advertisers are guaranteed some impressions (say, 1M) over some time (say, 2 months) only to users matching their specs only when they visit certain types of pages only on certain positions on the page

An underdelivering ad is one that is likely to miss its guarantee

37

Underdelivery

How can underdelivery be computed? Need user traffic forecasts Depends on other ads in the system

An ad-serving systemwill try to minimizeunder-delivery on thisgraph

38

Forecasted impressions

(user, article, position)

Ad inventory

Supply sℓDemand dj

ℓ j

Traffic Shaping

39

Which article summary should

be picked?

Ans: The one with highest expected CTR

Which ad should be displayed?

Ans: The ad that minimizes underdelivery

Goal: Combine the two

Traffic Shaping

Goal: Bias the article summary selection to reduce under-delivery but insignificant drop in CTR AND do this in real-time

Outline

Formulation as an optimization problem Real-time solution Empirical results

41

Formulation

j:(ads)

ℓ:(user, article, position)“Fully Qualified Impression”

i:(user, article)

k:(user)

ℓj

i

k

Goal: Infer traffic shaping fractions wki

Supply sk

CTR c ki

Traffi

c

shaping

fracti

on w ki

Demand dj

Ad delivery fraction φℓj

Formulation

Full traffic shaping graph: All forecasted user traffic X

all available articles arriving at the homepage, or directly on article page

Goal: Infer wki But forced to infer φℓj as

well

Full Traffic Shaping Graph

A

B

C

Traffic

shaping

fracti

on w ki

Ad delivery fraction φℓj

CTR c ki

Formulation

44

ℓj

ik

underdelivery

Total user traffic flowing to j (accounting for CTR loss)

demand

(Satisfy demand constraints)

sk wki

cki

Formulation

45

ℓj

ik

(Bounds on traffic shaping fractions)

(Shape only available traffic)

(Satisfy demand constraints)

(Ad delivery fractions)

Key Transformation

This allows a reformulation solely in terms of new variables zℓj zℓj = fraction of supply that is shown ad j,

assuming user always clicks article

46

Formulation

Convex program can be solved optimally

47

Formulation

But we have another problem At runtime, we must shape every incoming user

without looking at the entire graph

Solution: Periodically solve the convex problem offline Store a cache derived from this solution Reconstruct the optimal solution for each user at

runtime, using only the cache

48

Outline

Formulation as an optimization problem Real-time solution Empirical results

49

Real-time solution

50

Cache these

Reconstruct using these

All constraints can be expressed as constraints on σℓ

Real-time solution

51

1

2 σℓ = 0 unless Σzℓj = maxℓ Σzℓj

3 Σℓ σℓ = constant for all i connected to k

Σzℓj

Ui

Li

σℓ

3 K

KT

con

diti

ons Shape depends

on the cached duals αj

ℓj

k i

Real-time solution

52

1

2 σℓ = 0 unless Σzℓj = maxℓ Σzℓj

3 Σℓ σℓ = constant for all i connected to k

ℓj

k iΣzℓj

Ui

Li

σℓ

Algo Initialize σℓ = 0

Compute Σzℓj from (1)

If constraints unsatisfied, increase σℓ while satisfying (2) and (3)

Repeat

Extract wki from zℓj

Results

Data: Historical traffic logs from April, 2011 25K user nodes

Total supply weight > 50B impressions 100K ads

We compare our model to a scheme that picks articles to maximize expected CTR, and picks ads to display via a separate greedy method

53

Lift in impressions

Lift

in im

pres

sion

s de

liver

ed t

o un

derp

erfo

rmin

g ad

s

Fraction of traffic that is not shaped

Nearly threefold improvement via

traffic shaping

54

Average CTR

Ave

rage

CT

R (

as p

erce

ntag

e of

max

imum

CT

R)

Fraction of traffic that is not shaped

CTR drop < 10%

55

Comparison with other methods

56

Summary

3x underdelivery reduction with <10% CTR drop 2.6x reduction with 4% CTR drop Runtime application needs only a small cache

57

Traffic Shaping



3) Theoretical underpinnings [COLT ‘10 best student paper]

58

Link Prediction

Which pair of nodes {i,j} should be connected?

Alice

Bob

Charlie

Goal: Recommend a movie

59

Link Prediction

Which pair of nodes {i,j} should be connected?

Goal: Suggest friends

60

Previous Empirical Studies*

Random Shortest Path

Common Neighbors

Adamic/Adar Ensemble of short paths

Link

pre

dict

ion

accu

racy

*

*Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007

How do we justify these observations?

Especially if the graph is sparse

61

Link Prediction – Generative Model

Unit volume universe

Model:1. Nodes are uniformly distributed points in a latent space

2. This space has a distance metric

3. Points close to each other are likely to be connected in the graph

Logistic distance function (Raftery+/2002)

62

63

1

½

Higher probability of linking

radius r

α determines the steepness

Link prediction ≈ find nearest neighbor who is not currently linked to the node.

Equivalent to inferring distances in the latent space

Link Prediction – Generative Model

Model:1. Nodes are uniformly distributed points in a latent space

2. This space has a distance metric

3. Points close to each other are likely to be connected in the graph

Common Neighbors

Pr2(i,j) = Pr(common neighbor|dij)

jkikijjkikjkik2 dd)d|d,d()d|~Pr()d|~Pr(j)(i,Pr Pkjki

Product of two logistic probabilities, integrated over a volume determined by dij

i j

64

Common Neighbors

OPT = node closest to i MAX = node with max common neighbors with i

Theorem:

w.h.p

Link prediction by common neighbors is asymptotically optimal

dOPT ≤ dMAX ≤ dOPT + 2[ε/V(1)]1/D

65

Common Neighbors: Distinct Radii Node k has radius rk .

ik if dik ≤ rk (Directed graph) rk captures popularity of node k

“Weighted” common neighbors: Predict (i,j) pairs with highest Σ w(r)η(r)

i

rk

Weight for nodes of radius r

# common neighbors of radius r

k

j

m

66

Type 2 common neighbors

r is close to max radius

D1

deg

const

r

constw(r)

Real world graphs generally fall in this range

i

rk

k

j

Presence of common neighbor is very informative

Absence is very informative

Adamic/Adar

1/r

67

ℓ-hop Paths

Common neighbors = 2 hop paths

For longer paths:

Bounds are weaker For ℓ’ ≥ ℓ we need ηℓ’ >> ηℓ to obtain similar bounds

justifies the exponentially decaying weight given to longer paths by the Katz measure

δN,,ηg-11)r(rdij

68

Summary

Three key ingredients

1. Closer points are likelier to be linked. Small World Model- Watts, Strogatz, 1998, Kleinberg 2001

2. Triangle inequality holds necessary to extend to ℓ-hop paths

3. Points are spread uniformly at random Otherwise properties will depend on location as well as distance

69

Summary

Random Shortest Path

Common Neighbors

Adamic/Adar Ensemble of short paths

Link

pre

dict

ion

accu

racy

*

*Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007

The number of paths matters, not the

length

For large dense graphs, common neighbors are

enough

Differentiating between different degrees is

important

In sparse graphs, length 3 or more

paths help in prediction.

70

Conclusions

Discussed three problems1. Estimating CTR for Content Match

Combat sparsity by hierarchical smoothing

2. Traffic Shaping for Display Advertising Joint optimization of CTR and underdelivery-reduction Optimal traffic shaping at runtime using cached duals

3. Theoretical underpinnings Latent space model Link prediction ≈ finding nearest neighbors in this

space

71

Other Work

72

Web Search

Finding Quicklinks

Titles for Quicklinks

Incorporating tweets into search results

Website clustering

Webpage segmentation

Template detection

Finding hidden query aspects

Computational Advertising

Combining IR with click feedback

Multi-armed bandits using hierarchies

Online learning under finite ad lifetimes

Graph Mining

Epidemic thresholds

Non-parametric prediction in dynamic graphs

Graph sampling

Graph generation models

Community detection

Model

Goal: Smoothing across siblings in hierarchy Our approach:

Each region has a latent state Sr yr is independent of hierarchy given Sr Sr is drawn from the parent region Spa(r)

7373

Level i

Level i+1

Data Transformation

Problem:

Solution: Freeman-Tukey transform

Differentiates regions with 0 clicks Variance stabilization:

74

MLE CTR

N *

Va

r(M

LE

)Mean yr

N *

Va

r(y r

)

Documents

1 Challenges in Computational Advertising Deepayan Chakrabarti ([email protected])