14
On the Anonymity of Home/Work Location Pairs Philippe Golle Kurt Partridge Presented by: Bo Begole

On the Anonymity of Home/Work Location Pairs

Embed Size (px)

DESCRIPTION

Many applications bene t from user location data, but lo-cation data raises privacy concerns. Anonymization can protect privacy,but identities can sometimes be inferred from supposedly anonymousdata. This paper studies a new attack on the anonymity of location data.We show that if the approximate locations of an individual's home andworkplace can both be deduced from a location trace, then the mediansize of the individual's anonymity set in the U.S. working population is1, 21 and 34,980, for locations known at the granularity of a census block,census track and county respectively. The location data of people wholive and work in di erent regions can be re-identi ed even more easily.Our results show that the threat of re-identi cation for location data ismuch greater when the individual's home and work locations can bothbe deduced from the data. To preserve anonymity, we o er guidance forobfuscating location traces before they are disclosed.

Citation preview

Page 1: On the Anonymity of Home/Work Location Pairs

On the Anonymity of Home/Work Location Pairs

Philippe Golle

Kurt Partridge

Presented by: Bo Begole

Page 2: On the Anonymity of Home/Work Location Pairs

Location traces are useful for personalized location-based services

Monday Tuesda

12:00 to 1:00

1:00 to

Time Location Cuisine

11:57- 12:45 37°26’39”-122°9’38”

1:22 - 1:31 37°23’11”-122°9’02”

… … …

Location History

$$$

ChineseItalian

$$$

ChineseItalian

Inferred Restaurant Preferences

[Magitti, CHI 2008]

PersonalizedRecommendations

Page 3: On the Anonymity of Home/Work Location Pairs

Location traces can be sensitive

They can reveal business connections, political affiliation, medical condition, risky behaviors, etc.

Can damage reputations, unfairly raise insurance premiums, increase divorce rates, etc.

Some ways to mitigate the risk– “Only my trusted network provider knows my location”

» Location data is often shared with third party LBS providers

– “Privacy policies and legislation protect my location traces”» Data collector may fall victim to attacks, or be dishonest themselves

– “I’ll report my location only infrequently when I need service”» Many services require frequent location updates (e.g. friend finder)

– “I’ll report my location pseudonymously”» Watch out for inferences!

Page 4: On the Anonymity of Home/Work Location Pairs

GenderPostal codeDate of Birth

Cancer Type

Patient Records

Sometimes possible to infer identity by combining data sources Inference: what can be learned

from combining existing knowledge with newly acquired information

E.g.: learn the medical record of William Weld [Swe00]– Knowing the birth date and ZIP code

of the governor of Massachusetts – Can retrieve his health records from a

supposedly anonymous database of state employee health-insurance claims

87% of US population have unique date of birth, gender and postal code.

Voter Registration

NameStreet address

…Gender

Postal codeDate of Birth

Page 5: On the Anonymity of Home/Work Location Pairs

Inferences from location data With 2-weeks’ worth of GPS data collected from a subject’s

car, Krumm [Pervasive 2007] showed we can infer:– Home address (median error < 60 m)

>5% are Identifiable by combining with– Reverse geo-coder– Web-based white pages directory

Page 6: On the Anonymity of Home/Work Location Pairs

Preventing inferences [Krumm 07]

Data supression Noise Rounding

Page 7: On the Anonymity of Home/Work Location Pairs

The K-anonymity Guarantee

Problem– If a location trace is unique…– It may be combined with other public data– To generate undesirable inferences

Solution: k-anonymity [Sweeney 2000]– Principle: “Data is safe to release if at least k

people share the same attributes”– Anonymity set: set of people with same attributes– K-anonymity: size of the anonymity set

Page 8: On the Anonymity of Home/Work Location Pairs

K-anonymity Example

– Cancer DB: (Post code: 34012, DOB 7/31/1945, Male)– Goal: ensure 5-anonymity– (Cambridge MA) 54,805

people– (Cambridge MA, male) 26,854

people– (Cambridge MA, 55 years old) 2,096 people– (Cambridge MA, DOB: 7/31/1945) 6

people– (Cambridge MA, DOB: 7/31/1945, Male) 3 people– (Post code: 34012, DOB: 7/31/1945, Male) 1 person

Combine with voter registration: William Weld The same can be done for 69 percent of the 54,805

people on the voting list of Cambridge, MA.

Page 9: On the Anonymity of Home/Work Location Pairs

Inferences from location data

What level of accuracy of location information could result in small sizes of k anonymity for some portion of the population?

Assume a location trace reveals approximate …– region someone lives (county or postal code)– and also, region someone works (county or postal code)

County and postal code are coarse-grained– A trace that reveals less is probably not very useful– Useful traces in fact reveal considerably more!– Therefore, our results are an upper bound on the size of

the anonymity set (things could be worse!)

Page 10: On the Anonymity of Home/Work Location Pairs

Data source Longitudinal Employer-Household Dynamics (LEHD)

– Census Bureau program to compile information about where people work and where they live

» County» Census Tract ~= US Postal code

– Also records age, earnings and distribution across industries.

– 103,289,243 workers from 42 states

LEHD is a synthetic dataset– “The key statistical property to preserve in the synthetic

data is the joint distribution of workers across home and work areas”

– 3 implicates (synthetic datasets) are available for researchers to check robustness of results

Page 11: On the Anonymity of Home/Work Location Pairs

Size of live/work anonymity set

1

10

100

1,000

10,000

100,000

1

10

100

1,000

10,000

100,000

1,000,000

10,000,000

Siz

e of

ano

nym

ity s

et

Home onlyWork onlyBoth home & work

7% have k-anonymity of 1

5% have k-anonymity of 92

50% have k-anonymity

of 35,000

50% have k-anonymity of 21

Page 12: On the Anonymity of Home/Work Location Pairs

Influence of living & working in the same region versus different regions

Same locationDifferent locationsAll

1

10

100

1,000

5,000

1

10

100

1,000

10,000

100,000

1,000,000

Siz

e of

ano

nym

ity s

et

Page 13: On the Anonymity of Home/Work Location Pairs

Conclusion: Even coarse grained location traces can identify individuals Approximate home and work locations narrow a person to a very

small anonymity set

Being unique is not quite the same as being identifiable– But identity of a specific individual may be found by combining other sources

(white pages, patient records, police records, private knowledge, employer records, facebook pages, …)

We can estimate k-anonymity of location- and other context-aware tech. prior to widespread adoption using public datasets – Population Census– Longitudinal Employment Household Data (LEHD)– American Time Use Survey (ATUS)– Japan Statistics Bureau Survey on Time Use and Leisure Activities

Geographic Region Size

Median Anonymity Set Size

County 34,980

Census Tract (~postal code) 21

Census Block 1

Page 14: On the Anonymity of Home/Work Location Pairs

Techniques to Estimate Privacy Effects of Context-Aware Technologies

K-anonymity (size of the anonymity set) is a metric to characterize level of privacy

Can estimate k-anonymity of context-awareness technologies with public datasets– Population Census– Longitudinal Employment Household Data (LEHD)– American Time Use Survey (ATUS)– Japan Statistics Bureau Survey on Time Use and Leisure Activities

K-anonymity is not a perfect measure– May still leak information

» L-diversity [MGKV 2006]» ε-differential privacy [Dwork et al]