View
221
Download
0
Tags:
Embed Size (px)
Citation preview
A Utility-Theoretic Approach to Privacy and Personalization
Andreas Krause Carnegie Mellon University
work performed during an internship at Microsoft Research
Joint work with Eric HorvitzMicrosoft Research
23rd Conference on Artificial Intelligence | July 16, 2008
2
Value of private information to enhancing search
Personalized web search is a prediction problem:“Which page is user X most likely interested in for query Q?”
The more information we have about the user, the better services can be provided to users
Users are reluctant to share private information (or don’t want search engines to log data)
We apply utility theoretic methods to optimize tradeoff:
Getting the biggest “bang” for the “personal data buck”
3
Utility theoretic approach
Sharing personal information (topic interests, search history, IP address etc.)
Utility of knowing
Sensitivityof sharing
–
Net benefitto user=
4
Utility theoretic approach
Sharing more information might decrease net benefit
Utility of knowing
Sensitivityof sharing–
Net benefit
to user=
5
Maximizing the net benefit
How can we find optimal tradeoff maximizing net benefit?
?
Net
ben
efit
Share noinformation
Share muchinformation
6
Trading off utility and privacy
Set V of 29 possible attributes (each · 2 bits)Demographic data (location)
Query details (working hours / week day?)
Topic interests (ever visited business / science / … website)
Search history (same query / click before / searches/day?)
User behavior (ever changed Zip, City, Country)?
For each A µ V compute utility U(A) and cost C(A)
Find A maximizing U(A) while minimizing C(A)
7
Estimating utility U(A) of sharing data
Ideally: how does knowing A help increase the relevance of displayed results?
Very hard to estimate from data
Proxy [Mei and Church ’06, Dou et al ‘07]: Click entropy!Learn probabilistic model for P( C | Q, A) = P( click | query, attributes )U(A) = H( C | Q ) – H( C | Q, A )
Entropy beforerevealing attributes
Entropy afterrevealing attributes
E.g.: A = {X1, X3}U(A) = 1.3
CSearch goal
X1
AgeX2
GenderX3
Country
QQuery
8
Click entropy example
U(A) = expected click entropy reduction knowing A
Query: sports
Pages 1 2 3 4 5 6
Freq
1 2 3 4 5 6
FreqCountry:USA
Entropy H = 2.6 H = 1.7
Entropy Reduction:
0.9
CSearch goal
X1
AgeX2
GenderX3
Country
QQuery
9
Study of Value of Personal Data
Estimate click entropy from volunteer search log data.~15,000 usersOnly frequent queries (¸ 30 users)Total ~250,000 queries during 2006
Example: Consider topics of prior visits, V = {topic_arts,topic_kids}
Query: “cars”, prior entropy: 4.55U({topic_arts}) = 0.40U({topic_kids}) = 0.41
How does U(A) increase as we pick more attributes A?
10
none ATLV THOM ACTY TGAM TSPT AQRY ACLK AWDY AWHR TCIN TADT DREG TKID AFRQ TSCI THEA TNWS TCMP ACRY TREF0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Diminishing returns for click entropy
The more attributes we add, the less we gain in utility
Theorem: Click entropy U(A) is submodular!*
A*: Search activityT*: Topic interests
Mor
e uti
lity
(ent
ropy
redu
ction
)
More private attributes (greedily chosen)
*See store for details
11
Trading off utility and privacy
Set V of 29 possible attributes (each · 2 bits)Demographic data (location)
Query details (working hours / week day?)
Topic interests (ever visited business / science / … website)
Search history (same query / click before / searches/day?)
User behavior (ever changed Zip, City, Country)?
For each A µ V compute utility U(A) and cost C(A)
Find A maximizing U(A) while minimizing C(A)
12
Getting a handle on cost
Identifiability: “Will they know it’s me?”Sensitivity: “I don’t feel comfortable sharing this!”
13
Identifiability costIntuition: The more attributes we already know, the more identifying it is to add another
Goal: Avoid identifiabilityFor example: k-anonymity [Sweeney ‘02], and others
Age
Gender
Occupation
14
Predict person Y from attributes AExample: P(Y | gender = female, country = US)
Define “loss” function [c.f., Lebanon et al.]
Identifiability cost
Identifiability cost
User 1 2 3 4 5 6
Freq
User 1 2 3 4 5 6
Freq
Good! Predicting user is hard. Bad! Predicting user is easy!
Worst-case probability of detection
15
Identifiability cost
The more attributes we add, the larger the increase in cost: Accelerating cost
Theorem: Identifiability cost C(A) is supermodular!*
none TCMPAWDY AWHRAQRY ACLK ACRY TREG TWLD TART TREF ACTY TBUS THEA TREC AZIP TNWS TSPT TSHP TSOC AFRQ TSCI ATLV TKID DREG TADT THOM TGMS TCIN0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Less
iden
tifiab
ility
cos
t
More private attributes (greedily chosen)
*See store for details
16
Trading off utility and privacy
Set V of 29 possible attributes (each · 2 bits)Demographic data (location)
Query details (working hours / week day?)
Topic interests (ever visited business / science / … website)
Search history (same query / click before / searches/day?)
User behavior (ever changed Zip, City, Country)?
For each A µ V compute utility U(A) and cost C(A)
Find A maximizing U(A) while minimizing C(A)
17
Trading off utility and cost
Want: A* = argmax F(A)
Optimizing value of private information is a submodular problem! Can use algorithms for optimizing submodular functions:
Goldengorin et al. (branch and bound), Feige et al. (approx. algorithm),..
Can efficiently get provably near-optimal tradeoff!
- λ =
U(A) C(A) F(A)Trade-off
parameter
Utility Cost Final objective
no occ adult age whour gdr homewday ctry kids refs bus compworld arts reg0
0.2
0.4
0.6
0.8
1
1.2
1.4
Red
uctio
n in
clic
k en
trop
y
Greedy forward selection for utility
wdaywhourreg busadultworldctry gdr artscomprefskidshomeage occ all0
0.2
0.4
0.6
0.8
1
1.2
1.4
Priv
acy
cost
(p
log(
1-p)
)
none occ homewhourage wdaygenderreg bus worldadult artscountrycomp ref kids-0.2
0
0.2
0.4
0.6
0.8
Util
ity -
Cos
t
(Lazy) Greedy forward selection
submodular supermodular submodular (non-monotonic)
NP hard (and large: 229 subsets)
18
Finding the “sweet spot”
Which λ should we choose?
Tradeoff-curve purely based on log data.
What do users prefer?0 0.1 0.2 0.3 0.4 0.5 0.6
0
0.5
1
1.5
2
Mor
e uti
lity
U(A
)
Less cost C(A)
Want: A* = argmax U(A) - C(A)
= 1
= 0= 1
= 10 “ignorecost”
“ignore utility”
Sweet spot!Maximal utility at maximal privacy
19
Survey for eliciting costMicrosoft internal online surveyDistributed internationallyN=1451 responses from 35 countries (80% US)Incentive: 1 Zune™ digital music player
20
Identifiability vs sensitivity
21
Sensitivity vs utility
22
Seeking a common currency
Sensitivity acts as common currency to estimate utility-privacy tradeoff
1 2 3 4 5Region
Country
State
City
Zip
Address
Loca
tion
Gra
nul
arity
Sensitivity1 2 3 4 5
1.251.5
2
4
never
Sp
eed
up r
equi
red
Sensitivity
23
region country state city zip0
1
2
3
4
En
trop
y re
duct
ion
requ
ired
Survey data(median)
Identifiability cost(from search logs)
region country state city zip0
1
2
3
4
En
trop
y re
duct
ion
requ
ired
Survey data(median)
Identifiability cost(from search logs)
Survey data(median)
0 0.1 0.2 0.3 0.4 0.5 0.60
0.5
1
1.5
2
Cost (maxprob)
Util
ity (
en
tro
py
red
uct
ion
)
= 100
= 10
= 1
Calibrating the tradeoff
Can use survey data to calibrate utility privacy tradeoff!
User preferences map into sweet spot!
Best fit forλ = 5.12 F(A) = U(A) - λ C(A)
24
Understanding Sensitivities:“I don’t feel comfortable sharing this!”
25
Attribute sensitivities
FRQ NEWS SCI ART GMS BUS HEA SOC GDR QRY CLK TCY OCC MTL CHD WHR1
2
3
4
5
Sen
sitiv
ity
We incorporate sensitivity in our cost function by calibration
Significant differencesbetween topics!
26
Comparison with heuristics
Optimized solution: Repeated visit / query, workday / working hour, top-level domain, avg. queries per day, topic: sports, topic: games
0.573
0.899
-1.81 -3.3
-0.5
0
0.5
1
1.5
2
Optimized tradeoff
Search statistics (ATLV, AWDY, AWHR, AFRQ)
-1.73
All topic interests
IP Address Bytes 1&2
IP Address
Utility U(A)
Cost C(A)
Net Benefit F(A)
Mor
e ne
t ben
efit (
bits
of i
nfo.
)
Optimized solution outperforms naïve selection heuristics!
27
Summary
Use of private information by online services as an optimization problem (with user permission /awareness)
Utility (Click entropy) is submodularPrivacy (Identifiability) is supermodular
Can use theoretical and algorithmic tools to efficiently find provably near-optimal tradeoff
Can calibrate tradeoff using user preferences
Promising results on search logs and survey data!