On the Anonymity of Home/Work Location Pairs

On the Anonymity of Home/Work Location Pairs

Philippe Golle

Kurt Partridge

Presented by: Bo Begole

Location traces are useful for personalized location-based services

Monday Tuesda

…

12:00 to 1:00

1:00 to

Time Location Cuisine

11:57- 12:45 37°26’39”-122°9’38”

1:22 - 1:31 37°23’11”-122°9’02”

… … …

Location History

$$$

ChineseItalian

…

…

…

$$$

ChineseItalian

…

…

Inferred Restaurant Preferences

[Magitti, CHI 2008]

PersonalizedRecommendations

Location traces can be sensitive

They can reveal business connections, political affiliation, medical condition, risky behaviors, etc.

Can damage reputations, unfairly raise insurance premiums, increase divorce rates, etc.

Some ways to mitigate the risk– “Only my trusted network provider knows my location”

» Location data is often shared with third party LBS providers

– “Privacy policies and legislation protect my location traces”» Data collector may fall victim to attacks, or be dishonest themselves

– “I’ll report my location only infrequently when I need service”» Many services require frequent location updates (e.g. friend finder)

– “I’ll report my location pseudonymously”» Watch out for inferences!

GenderPostal codeDate of Birth

Cancer Type

Patient Records

Sometimes possible to infer identity by combining data sources Inference: what can be learned

from combining existing knowledge with newly acquired information

E.g.: learn the medical record of William Weld [Swe00]– Knowing the birth date and ZIP code

of the governor of Massachusetts – Can retrieve his health records from a

supposedly anonymous database of state employee health-insurance claims

87% of US population have unique date of birth, gender and postal code.

Voter Registration

NameStreet address

…Gender

Postal codeDate of Birth

Inferences from location data With 2-weeks’ worth of GPS data collected from a subject’s

car, Krumm [Pervasive 2007] showed we can infer:– Home address (median error < 60 m)

>5% are Identifiable by combining with– Reverse geo-coder– Web-based white pages directory

Preventing inferences [Krumm 07]

Data supression Noise Rounding

The K-anonymity Guarantee

Problem– If a location trace is unique…– It may be combined with other public data– To generate undesirable inferences

Solution: k-anonymity [Sweeney 2000]– Principle: “Data is safe to release if at least k

people share the same attributes”– Anonymity set: set of people with same attributes– K-anonymity: size of the anonymity set

K-anonymity Example

– Cancer DB: (Post code: 34012, DOB 7/31/1945, Male)– Goal: ensure 5-anonymity– (Cambridge MA) 54,805

people– (Cambridge MA, male) 26,854

people– (Cambridge MA, 55 years old) 2,096 people– (Cambridge MA, DOB: 7/31/1945) 6

people– (Cambridge MA, DOB: 7/31/1945, Male) 3 people– (Post code: 34012, DOB: 7/31/1945, Male) 1 person

Combine with voter registration: William Weld The same can be done for 69 percent of the 54,805

people on the voting list of Cambridge, MA.

Inferences from location data

What level of accuracy of location information could result in small sizes of k anonymity for some portion of the population?

Assume a location trace reveals approximate …– region someone lives (county or postal code)– and also, region someone works (county or postal code)

County and postal code are coarse-grained– A trace that reveals less is probably not very useful– Useful traces in fact reveal considerably more!– Therefore, our results are an upper bound on the size of

the anonymity set (things could be worse!)

Data source Longitudinal Employer-Household Dynamics (LEHD)

– Census Bureau program to compile information about where people work and where they live

» County» Census Tract ~= US Postal code

– Also records age, earnings and distribution across industries.

– 103,289,243 workers from 42 states

LEHD is a synthetic dataset– “The key statistical property to preserve in the synthetic

data is the joint distribution of workers across home and work areas”

– 3 implicates (synthetic datasets) are available for researchers to check robustness of results

Size of live/work anonymity set

1

10

100

1,000

10,000

100,000

1

10

100

1,000

10,000

100,000

1,000,000

10,000,000

Siz

e of

ano

nym

ity s

et

Home onlyWork onlyBoth home & work

7% have k-anonymity of 1


50% have k-anonymity

of 35,000


Influence of living & working in the same region versus different regions

Same locationDifferent locationsAll

1

10

100

1,000

5,000

1

10

100

1,000

10,000

100,000

1,000,000

Siz

e of

ano

nym

ity s

et

Conclusion: Even coarse grained location traces can identify individuals Approximate home and work locations narrow a person to a very

small anonymity set

Being unique is not quite the same as being identifiable– But identity of a specific individual may be found by combining other sources

(white pages, patient records, police records, private knowledge, employer records, facebook pages, …)

We can estimate k-anonymity of location- and other context-aware tech. prior to widespread adoption using public datasets – Population Census– Longitudinal Employment Household Data (LEHD)– American Time Use Survey (ATUS)– Japan Statistics Bureau Survey on Time Use and Leisure Activities

Geographic Region Size

Median Anonymity Set Size

County 34,980

Census Tract (~postal code) 21

Census Block 1

Techniques to Estimate Privacy Effects of Context-Aware Technologies

K-anonymity (size of the anonymity set) is a metric to characterize level of privacy

Can estimate k-anonymity of context-awareness technologies with public datasets– Population Census– Longitudinal Employment Household Data (LEHD)– American Time Use Survey (ATUS)– Japan Statistics Bureau Survey on Time Use and Leisure Activities

K-anonymity is not a perfect measure– May still leak information

» L-diversity [MGKV 2006]» ε-differential privacy [Dwork et al]

Technology

On the Anonymity of Home/Work Location Pairs