55
SHARING HEALTH RESEARCH DATA De-identification METHODS & EXPERIENCES Dr. Khaled El Emam Electronic Health Information Laboratory

Sharing Health Research Data

  • Upload
    kelemam

  • View
    302

  • Download
    3

Embed Size (px)

DESCRIPTION

Slides from a presentation at Johns Hopkins on de-identification and data sharing

Citation preview

Page 1: Sharing Health Research Data

SHARING HEALTH RESEARCH DATA

De-identificationMETHODS & EXPERIENCES

Dr. Khaled El EmamElectronic Health Information Laboratory

Page 2: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Motivations for De-identification• Obtaining patient consent/authorization – not

practical for large databases and introduces bias

• Compliance to regulations / legislation

• Contractual obligations• Maintain public / consumer /

client trust• Costs of breach notification

Page 3: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Page 4: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Page 5: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Page 6: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Page 7: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

A Balance

Page 8: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Health information that does not identify an individual and with respect to which there is

no reasonable basis to believe that the information can be used to identify an individual is not individually identifiable

health information.

Definition of De-identified Data

Page 9: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

• Just to clear this issue out at the beginning• There are some claims that health data is easy to

re-identify• Often examples are used to support that argument• The evidence does not support these claims

– When data are de-identified properly the probability of a successful re-identification attack is very small

• Let’s consider a few highly publicized examples

Re-identification Attacks

Page 10: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

• AOL releases search queries replacing usernames with pseudonyms

• New York Times reporters re-identify one user 4417749

• Her search terms: “tea for good health”, “numb fingers”, “hand tremors”, “dry mouth”, “60 single men”, “dog that urinates on everything”, “landscapers in Lilburn, Ga”, “homes sold in shadow lake subdivision gwinnett county georgia”

• Thelma Arnold, widow living in Lilburn Ga ; she has three dogs

AOL

Page 11: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

• It is well known that a large percentage of individuals run ‘vanity’ searches that include their names – Thelma Arnold did

• It is also known that location information can be determined from an individual’s search queries

• Search queries, even if the username is replaced with a pseudonym, cannot be considered de-identified

AOL ?

Page 12: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

• Governor Weld of Massachusetts was unwell during a public appearance – the story was covered in the media

• Semi-publicly available insurance claims data matched with voter registration lists

• It was possible to determine which claims records belonged to the Governor

Weld

Page 13: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

• This re-identification attack was done before HIPAA came into effect – the insurance claims data would not pass any of the HIPAA de-identification standards

• A recent analysis indicated that Weld was likely re-identified because he was a famous person and there was already a lot of information about him in the media (his admission date, his diagnosis, his discharge date) – the voter registration list was arguably not necessary

• The success rate for such an attack would be lower for general members of the public because the voter registration list is incomplete

Weld ?

Page 14: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

• Netflix publicly released movie ratings data in the context of a competition to develop a recommendation algorithm

• Researchers re-identified a couple of records by matching with a publicly available and identifiable movie ratings database (IMDB)

• Results in cancellation of a second competition and litigation started against Netflix for exposing personal information

Netflix

Page 15: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

• The re-identifications were not actually verified by Netflix

• Authors of attack admit that the Netflix data was not de-identified (replaced usernames with pseudonyms)

• The false positive rate of the matching was not evaluated (how many people in the IMDB database were actually in the Netflix database ?)

Netflix ?

Page 16: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0028071

Page 17: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

• Attribute disclosure: discover something new about an individual in the database without knowing which record belongs to that individual

• Identity disclosure: determine which record in the database belongs to a particular individual (for example, determine that record number 7 belongs to Bob Smith – that is identity disclosure)

• HIPAA only cares about identity disclosure

Attribute vs Identity Disclosure

Page 18: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Statistically significant relationship (chi-square, p<0.05)

High risk of attribute disclosure

Attribute vs Identity Disclosure

HPV Vaccinated NOT HPV Vaccinated

Religion A 5 40

Religion B 40 5

Page 19: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Statistically significant relationship (chi-square, p<0.05)

High risk of attribute disclosure

Attribute vs Identity Disclosure

HPV Vaccinated NOT HPV Vaccinated

Religion A 5 40

Religion B 40 5

Page 20: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

After suppression Not statistically significant relationship (chi-square) Low risk of attribute disclosure

Attribute vs Identity Disclosure

HPV Vaccinated NOT HPV Vaccinated

Religion A 5 6

Religion B 6 5

Page 21: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Stigmatizing Analytics

Page 22: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Health information that does not identify an individual and with respect to which there is

no reasonable basis to believe that the information can be used to identify an

individual

Definition of De-identified Data

Page 23: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Direct Identifiers• Fields that would uniquely identify individuals

in a database• Name, address, telephone number, fax

number, MRN, health card number, health plan beneficiary number, license plate number, email address, photograph, biometrics, SSN, SIN, implanted device number

Page 24: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Dealing with Direct Identifiers• Defensible approaches:

– Remove those fields– Convert them to one-time or persistent

pseudonyms– Randomize the values

• These approaches will ensure, if done properly, that the probability of recovering the original value is very small

Page 25: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Quasi-Identifiers• sex, date of birth or age, geographic locations (such

as postal codes, census geography, information about proximity to known or unique landmarks), language spoken at home, ethnic origin, aboriginal identity, total years of schooling, marital status, criminal history, total income, visible minority status, activity difficulties/reductions, profession, event dates (such as admission, discharge, procedure, death, specimen collection, visit/encounter), codes (such as diagnosis codes, procedure codes, and adverse event codes), country of birth, birth weight, and birth plurality

Page 26: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Page 27: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Page 28: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Page 29: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Page 30: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Page 31: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Page 32: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Page 33: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Re-identification Risk Measurement• Risk measurement will depend on:

– Granularity of quasi-identifiers– Region of the country we are talking about– Risk metric used (eg, uniqueness or groups of 5)– Threshold for what is acceptable risk

Page 34: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

De-identification Standards• The HIPAA Privacy Rule specifies two de-

identification standards (45 CFR 164.514):– Safe Harbor– Statistical method (also known as the expert

statistician method)

Page 35: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Safe Harbor Direct Identifiers and Quasi-identifiers

1. Names2. ZIP Codes (except

first three)3. All elements of dates

(except year)4. Telephone numbers5. Fax numbers6. Electronic mail

addresses7. Social security

numbers8. Medical record

numbers9. Health plan

beneficiary numbers10.Account numbers11. Certificate/license

numbers

HIPAA Safe Harbor

12.Vehicle identifiers and serial numbers, including license plate numbers

13.Device identifiers and serial numbers

14.Web Universal Resource Locators (URLs)

15. Internet Protocol (IP) address numbers

16.Biometric identifiers, including finger and voice prints

17.Full face photographic images and any comparable images;

18. Any other unique identifying number, characteristic, or code

Page 36: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Safe Harbor Direct Identifiers and Quasi-identifiers

1. Names2. ZIP Codes (except

first three)3. All elements of dates

(except year)4. Telephone numbers5. Fax numbers6. Electronic mail

addresses7. Social security

numbers8. Medical record

numbers9. Health plan

beneficiary numbers10.Account numbers11. Certificate/license

numbers

HIPAA Safe Harbor

12.Vehicle identifiers and serial numbers, including license plate numbers

13.Device identifiers and serial numbers

14.Web Universal Resource Locators (URLs)

15. Internet Protocol (IP) address numbers

16.Biometric identifiers, including finger and voice prints

17.Full face photographic images and any comparable images;

18. Any other unique identifying number, characteristic, or code

Page 37: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Two Problems with Safe Harbor• May be removing too much information on

the ZIP Code and date fields – these fields are useful for many analytical purposes

• Does not provide adequate protection – it is easy to have a Safe Harbor compliant data set with a high risk of re-identification

Page 38: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

High Risk Safe Harbor Data - I• If the adversary knows that Bob, 55 year old

male, is in the database

Gender Age ZIP Lab Test

M 55 112 Albumin, Serum

F 53 114Alkaline

Phosphatase

M 24 134 Creatine Kinase

Page 39: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

High Risk Safe Harbor Data - II• 2.24m visits, 1.6m patients, NY discharge

data for 2007• Compliant with Safe Harbor

Fields % of patients unique

age, gender, ZIP3 2.54%

age, gender, ZIP3, LOS 21.49%

Page 40: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Statistical Method Conditions• A person with appropriate knowledge of and

experience with generally accepted statistical and scientific principles and methods for rendering information not individually identifiable:I. Applying such principles and methods, determines that

the risk is very small that the information could be used, alone or in combination with other reasonably available information, by an anticipated recipient to identify an individual who is a subject of the information; and

II. Documents the methods and results of the analysis that justify such determination

Page 41: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Re-identification Risk Spectrum

Page 42: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Overall Risk

Page 43: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Overall Risk

Page 44: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Overall Risk

Page 45: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Overall Risk

Page 46: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Overall Risk

Page 47: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Overall Risk

Page 48: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Managing Re-identification Risk

Page 49: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Different Types of Data Releases• The same data set can be disclosed

with different thresholds:– Public data set– Release with conditions for known data

recipients, including the requirement to sign a data sharing agreement, a prohibition on re-identification, and a requirement to pass these conditions to all sub-contractors

– The more conditions the higher quality the data set

Page 50: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Example – CA Hospital Discharges• Context: data release to a researcher who will sign a

data use agreement, good practices for managing sensitive health information

• There were ~2.1m patients who had ~3m visits• Risk threshold = 0.2; use average risk across all

patients• Variables:

– Year of birth– Gender– Year of admission– Days since last visit– Length of stay

Page 51: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Risk Level

Page 52: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Hierarchy

Page 53: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

De-identified Data

Page 54: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Key Practical Considerations• Data warehouses: de-identification of data extracts

instead of whole data warehouses results in higher quality de-identified data

• Beware of correlated data: data in multiple medical domains are correlated, so one has to be cognizant of inference attacks on data

• Automation: automation can detect outliers and perform selective suppression, which results in higher quality de-identified data

• Transparency: important to ensure that methods have received peer and regulator scrutiny

Page 55: Sharing Health Research Data

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Contact

[email protected]

@kelemam

www.ehealthinformation.ca