A Utility-Theoretic Approach to Privacy and Personalization Andreas Krause Carnegie Mellon University work performed during an internship at Microsoft

A Utility-Theoretic Approach to Privacy and Personalization

Andreas Krause Carnegie Mellon University

work performed during an internship at Microsoft Research

Joint work with Eric HorvitzMicrosoft Research

23rd Conference on Artificial Intelligence | July 16, 2008

2

Value of private information to enhancing search

Personalized web search is a prediction problem:“Which page is user X most likely interested in for query Q?”

The more information we have about the user, the better services can be provided to users

Users are reluctant to share private information (or don’t want search engines to log data)

We apply utility theoretic methods to optimize tradeoff:

Getting the biggest “bang” for the “personal data buck”

3

Utility theoretic approach

Sharing personal information (topic interests, search history, IP address etc.)

Utility of knowing

Sensitivityof sharing

–

Net benefitto user=

4

Utility theoretic approach

Sharing more information might decrease net benefit

Utility of knowing

Sensitivityof sharing–

Net benefit

to user=

5

Maximizing the net benefit

How can we find optimal tradeoff maximizing net benefit?

?

Net

ben

efit

Share noinformation

Share muchinformation

6

Trading off utility and privacy

Set V of 29 possible attributes (each · 2 bits)Demographic data (location)

Query details (working hours / week day?)

Topic interests (ever visited business / science / … website)

Search history (same query / click before / searches/day?)

User behavior (ever changed Zip, City, Country)?

For each A µ V compute utility U(A) and cost C(A)

Find A maximizing U(A) while minimizing C(A)

7

Estimating utility U(A) of sharing data

Ideally: how does knowing A help increase the relevance of displayed results?

Very hard to estimate from data

Proxy [Mei and Church ’06, Dou et al ‘07]: Click entropy!Learn probabilistic model for P( C | Q, A) = P( click | query, attributes )U(A) = H( C | Q ) – H( C | Q, A )

Entropy beforerevealing attributes

Entropy afterrevealing attributes

E.g.: A = {X1, X3}U(A) = 1.3

CSearch goal

X1

AgeX2

GenderX3

Country

QQuery

8

Click entropy example

U(A) = expected click entropy reduction knowing A

Query: sports

Pages 1 2 3 4 5 6

Freq

1 2 3 4 5 6

FreqCountry:USA

Entropy H = 2.6 H = 1.7

Entropy Reduction:

0.9

CSearch goal

X1

AgeX2

GenderX3

Country

QQuery

9

Study of Value of Personal Data

Estimate click entropy from volunteer search log data.~15,000 usersOnly frequent queries (¸ 30 users)Total ~250,000 queries during 2006

Example: Consider topics of prior visits, V = {topic_arts,topic_kids}

Query: “cars”, prior entropy: 4.55U({topic_arts}) = 0.40U({topic_kids}) = 0.41

How does U(A) increase as we pick more attributes A?

10

none ATLV THOM ACTY TGAM TSPT AQRY ACLK AWDY AWHR TCIN TADT DREG TKID AFRQ TSCI THEA TNWS TCMP ACRY TREF0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Diminishing returns for click entropy

The more attributes we add, the less we gain in utility

Theorem: Click entropy U(A) is submodular!*

A*: Search activityT*: Topic interests

Mor

e uti

lity

(ent

ropy

redu

ction

)

More private attributes (greedily chosen)

*See store for details

11









12

Getting a handle on cost

Identifiability: “Will they know it’s me?”Sensitivity: “I don’t feel comfortable sharing this!”

13

Identifiability costIntuition: The more attributes we already know, the more identifying it is to add another

Goal: Avoid identifiabilityFor example: k-anonymity [Sweeney ‘02], and others

Age

Gender

Occupation

14

Predict person Y from attributes AExample: P(Y | gender = female, country = US)

Define “loss” function [c.f., Lebanon et al.]

Identifiability cost


User 1 2 3 4 5 6

Freq

User 1 2 3 4 5 6

Freq

Good! Predicting user is hard. Bad! Predicting user is easy!

Worst-case probability of detection

15


The more attributes we add, the larger the increase in cost: Accelerating cost

Theorem: Identifiability cost C(A) is supermodular!*

none TCMPAWDY AWHRAQRY ACLK ACRY TREG TWLD TART TREF ACTY TBUS THEA TREC AZIP TNWS TSPT TSHP TSOC AFRQ TSCI ATLV TKID DREG TADT THOM TGMS TCIN0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Less

iden

tifiab

ility

cos

t

More private attributes (greedily chosen)

*See store for details

16









17

Trading off utility and cost

Want: A* = argmax F(A)

Optimizing value of private information is a submodular problem! Can use algorithms for optimizing submodular functions:

Goldengorin et al. (branch and bound), Feige et al. (approx. algorithm),..

Can efficiently get provably near-optimal tradeoff!

- λ =

U(A) C(A) F(A)Trade-off

parameter

Utility Cost Final objective

no occ adult age whour gdr homewday ctry kids refs bus compworld arts reg0

0.2

0.4

0.6

0.8

1

1.2

1.4

Red

uctio

n in

clic

k en

trop

y

Greedy forward selection for utility

wdaywhourreg busadultworldctry gdr artscomprefskidshomeage occ all0

0.2

0.4

0.6

0.8

1

1.2

1.4

Priv

acy

cost

(p

log(

1-p)

)

none occ homewhourage wdaygenderreg bus worldadult artscountrycomp ref kids-0.2

0

0.2

0.4

0.6

0.8

Util

ity -

Cos

t

(Lazy) Greedy forward selection

submodular supermodular submodular (non-monotonic)

NP hard (and large: 229 subsets)

18

Finding the “sweet spot”

Which λ should we choose?

Tradeoff-curve purely based on log data.

What do users prefer?0 0.1 0.2 0.3 0.4 0.5 0.6

0

0.5

1

1.5

2

Mor

e uti

lity

U(A

)

Less cost C(A)

Want: A* = argmax U(A) - C(A)

= 1

= 0= 1

= 10 “ignorecost”

“ignore utility”

Sweet spot!Maximal utility at maximal privacy

19

Survey for eliciting costMicrosoft internal online surveyDistributed internationallyN=1451 responses from 35 countries (80% US)Incentive: 1 Zune™ digital music player

20

Identifiability vs sensitivity

21

Sensitivity vs utility

22

Seeking a common currency

Sensitivity acts as common currency to estimate utility-privacy tradeoff

1 2 3 4 5Region

Country

State

City

Zip

Address

Loca

tion

Gra

nul

arity

Sensitivity1 2 3 4 5

1.251.5

2

4

never

Sp

eed

up r

equi

red

Sensitivity

23

region country state city zip0

1

2

3

4

En

trop

y re

duct

ion

requ

ired

Survey data(median)

Identifiability cost(from search logs)

region country state city zip0

1

2

3

4

En

trop

y re

duct

ion

requ

ired

Survey data(median)

Identifiability cost(from search logs)

Survey data(median)

0 0.1 0.2 0.3 0.4 0.5 0.60

0.5

1

1.5

2

Cost (maxprob)

Util

ity (

en

tro

py

red

uct

ion

)

= 100

= 10

= 1

Calibrating the tradeoff

Can use survey data to calibrate utility privacy tradeoff!

User preferences map into sweet spot!

Best fit forλ = 5.12 F(A) = U(A) - λ C(A)

24

Understanding Sensitivities:“I don’t feel comfortable sharing this!”

25

Attribute sensitivities

FRQ NEWS SCI ART GMS BUS HEA SOC GDR QRY CLK TCY OCC MTL CHD WHR1

2

3

4

5

Sen

sitiv

ity

We incorporate sensitivity in our cost function by calibration

Significant differencesbetween topics!

26

Comparison with heuristics

Optimized solution: Repeated visit / query, workday / working hour, top-level domain, avg. queries per day, topic: sports, topic: games

0.573

0.899

-1.81 -3.3

-0.5

0

0.5

1

1.5

2

Optimized tradeoff

Search statistics (ATLV, AWDY, AWHR, AFRQ)

-1.73

All topic interests

IP Address Bytes 1&2

IP Address

Utility U(A)

Cost C(A)

Net Benefit F(A)

Mor

e ne

t ben

efit (

bits

of i

nfo.

)

Optimized solution outperforms naïve selection heuristics!

27

Summary

Use of private information by online services as an optimization problem (with user permission /awareness)

Utility (Click entropy) is submodularPrivacy (Identifiability) is supermodular

Can use theoretical and algorithmic tools to efficiently find provably near-optimal tradeoff

Can calibrate tradeoff using user preferences

Promising results on search logs and survey data!

Documents

A Utility-Theoretic Approach to Privacy and Personalization Andreas Krause Carnegie Mellon University work performed during an internship at Microsoft