Upload
bo-begole
View
579
Download
0
Embed Size (px)
DESCRIPTION
Many applications benet from user location data, but lo-cation data raises privacy concerns. Anonymization can protect privacy,but identities can sometimes be inferred from supposedly anonymousdata. This paper studies a new attack on the anonymity of location data.We show that if the approximate locations of an individual's home andworkplace can both be deduced from a location trace, then the mediansize of the individual's anonymity set in the U.S. working population is1, 21 and 34,980, for locations known at the granularity of a census block,census track and county respectively. The location data of people wholive and work in dierent regions can be re-identied even more easily.Our results show that the threat of re-identication for location data ismuch greater when the individual's home and work locations can bothbe deduced from the data. To preserve anonymity, we oer guidance forobfuscating location traces before they are disclosed.
Citation preview
On the Anonymity of Home/Work Location Pairs
Philippe Golle
Kurt Partridge
Presented by: Bo Begole
Location traces are useful for personalized location-based services
Monday Tuesda
…
12:00 to 1:00
1:00 to
Time Location Cuisine
11:57- 12:45 37°26’39”-122°9’38”
1:22 - 1:31 37°23’11”-122°9’02”
… … …
Location History
$$$
ChineseItalian
…
…
…
$$$
ChineseItalian
…
…
Inferred Restaurant Preferences
[Magitti, CHI 2008]
PersonalizedRecommendations
Location traces can be sensitive
They can reveal business connections, political affiliation, medical condition, risky behaviors, etc.
Can damage reputations, unfairly raise insurance premiums, increase divorce rates, etc.
Some ways to mitigate the risk– “Only my trusted network provider knows my location”
» Location data is often shared with third party LBS providers
– “Privacy policies and legislation protect my location traces”» Data collector may fall victim to attacks, or be dishonest themselves
– “I’ll report my location only infrequently when I need service”» Many services require frequent location updates (e.g. friend finder)
– “I’ll report my location pseudonymously”» Watch out for inferences!
GenderPostal codeDate of Birth
Cancer Type
Patient Records
Sometimes possible to infer identity by combining data sources Inference: what can be learned
from combining existing knowledge with newly acquired information
E.g.: learn the medical record of William Weld [Swe00]– Knowing the birth date and ZIP code
of the governor of Massachusetts – Can retrieve his health records from a
supposedly anonymous database of state employee health-insurance claims
87% of US population have unique date of birth, gender and postal code.
Voter Registration
NameStreet address
…Gender
Postal codeDate of Birth
Inferences from location data With 2-weeks’ worth of GPS data collected from a subject’s
car, Krumm [Pervasive 2007] showed we can infer:– Home address (median error < 60 m)
>5% are Identifiable by combining with– Reverse geo-coder– Web-based white pages directory
Preventing inferences [Krumm 07]
Data supression Noise Rounding
The K-anonymity Guarantee
Problem– If a location trace is unique…– It may be combined with other public data– To generate undesirable inferences
Solution: k-anonymity [Sweeney 2000]– Principle: “Data is safe to release if at least k
people share the same attributes”– Anonymity set: set of people with same attributes– K-anonymity: size of the anonymity set
K-anonymity Example
– Cancer DB: (Post code: 34012, DOB 7/31/1945, Male)– Goal: ensure 5-anonymity– (Cambridge MA) 54,805
people– (Cambridge MA, male) 26,854
people– (Cambridge MA, 55 years old) 2,096 people– (Cambridge MA, DOB: 7/31/1945) 6
people– (Cambridge MA, DOB: 7/31/1945, Male) 3 people– (Post code: 34012, DOB: 7/31/1945, Male) 1 person
Combine with voter registration: William Weld The same can be done for 69 percent of the 54,805
people on the voting list of Cambridge, MA.
Inferences from location data
What level of accuracy of location information could result in small sizes of k anonymity for some portion of the population?
Assume a location trace reveals approximate …– region someone lives (county or postal code)– and also, region someone works (county or postal code)
County and postal code are coarse-grained– A trace that reveals less is probably not very useful– Useful traces in fact reveal considerably more!– Therefore, our results are an upper bound on the size of
the anonymity set (things could be worse!)
Data source Longitudinal Employer-Household Dynamics (LEHD)
– Census Bureau program to compile information about where people work and where they live
» County» Census Tract ~= US Postal code
– Also records age, earnings and distribution across industries.
– 103,289,243 workers from 42 states
LEHD is a synthetic dataset– “The key statistical property to preserve in the synthetic
data is the joint distribution of workers across home and work areas”
– 3 implicates (synthetic datasets) are available for researchers to check robustness of results
Size of live/work anonymity set
1
10
100
1,000
10,000
100,000
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
Siz
e of
ano
nym
ity s
et
Home onlyWork onlyBoth home & work
7% have k-anonymity of 1
5% have k-anonymity of 92
50% have k-anonymity
of 35,000
50% have k-anonymity of 21
Influence of living & working in the same region versus different regions
Same locationDifferent locationsAll
1
10
100
1,000
5,000
1
10
100
1,000
10,000
100,000
1,000,000
Siz
e of
ano
nym
ity s
et
Conclusion: Even coarse grained location traces can identify individuals Approximate home and work locations narrow a person to a very
small anonymity set
Being unique is not quite the same as being identifiable– But identity of a specific individual may be found by combining other sources
(white pages, patient records, police records, private knowledge, employer records, facebook pages, …)
We can estimate k-anonymity of location- and other context-aware tech. prior to widespread adoption using public datasets – Population Census– Longitudinal Employment Household Data (LEHD)– American Time Use Survey (ATUS)– Japan Statistics Bureau Survey on Time Use and Leisure Activities
Geographic Region Size
Median Anonymity Set Size
County 34,980
Census Tract (~postal code) 21
Census Block 1
Techniques to Estimate Privacy Effects of Context-Aware Technologies
K-anonymity (size of the anonymity set) is a metric to characterize level of privacy
Can estimate k-anonymity of context-awareness technologies with public datasets– Population Census– Longitudinal Employment Household Data (LEHD)– American Time Use Survey (ATUS)– Japan Statistics Bureau Survey on Time Use and Leisure Activities
K-anonymity is not a perfect measure– May still leak information
» L-diversity [MGKV 2006]» ε-differential privacy [Dwork et al]