Upload
arx-deidentifier
View
124
Download
3
Embed Size (px)
Citation preview
Technische Universität München
Fabian Prasser, Raffael Bild, Klaus A. Kuhn
Chair for Medical InformaticsInstitute for Medical Statistics and Epidemiologie
Technical University of Munich (TUM)
A Generic Method for Assessing theQuality of De-Identified Health Data
Technische Universität München
Fabian Prasser, Raffael Bild, Klaus A. Kuhn:A Generic Method for Assessing the Quality of De-Identified Health Data
2 / 15 Health – Exploring Complexity HEC 2016 / Medical Informatics Europe MIE 201619.08.2016
Motivation: legal requirements● Secondary use of health care data for research● Data sharing in cooperative research
Goal: privacy protection● Ensure that recipients cannot learn the identity of data subjects● Re-identification can have severe legal consequences
Basis: make sure that the recipient is as trustworthy as possible● Sign data use agreements, approval by data access committees● Implement multiple layers of access to create controlled environments
Residual risks: data de-identification (also called: data anonymization)● Step 1: Remove identifying data (e.g. names, insurance numbers)● Step 2: Modify data to reduce the uniqueness of potentially identifying attribute
values (e.g. date-of-birth, sex, zip code)
Background
Technische Universität München
Generalization
Suppression
Micro-aggregation
Fabian Prasser, Raffael Bild, Klaus A. Kuhn:A Generic Method for Assessing the Quality of De-Identified Health Data
3 / 15 Health – Exploring Complexity HEC 2016 / Medical Informatics Europe MIE 201619.08.2016
Example
Reduction of the uniqueness of potentially identifying values
Technische Universität München
Fabian Prasser, Raffael Bild, Klaus A. Kuhn:A Generic Method for Assessing the Quality of De-Identified Health Data
4 / 15 Health – Exploring Complexity HEC 2016 / Medical Informatics Europe MIE 201619.08.2016
Trade-off: privacy risks vs. quality of data
Models are needed for measuring both aspects● Privacy: k-anonymity, k-map, strict average risk, population uniqueness● Quality: loss of information (e.g. granularity), changes in statistical properties
(e.g. tendency, dispersion, shape of distributions), data utility (e.g. classification)
Challenge
Privacy risk
Dat
a qu
ality
Original dataHighest risk
No dataNo risk
Potential solutions
Technische Universität München
Fabian Prasser, Raffael Bild, Klaus A. Kuhn:A Generic Method for Assessing the Quality of De-Identified Health Data
5 / 15 Health – Exploring Complexity HEC 2016 / Medical Informatics Europe MIE 201619.08.2016
Data transformation: attribute generalization
Recommended for health data: generalization hierarchies
Examples
Input data Global recoding Local recoding
Technische Universität München
Fabian Prasser, Raffael Bild, Klaus A. Kuhn:A Generic Method for Assessing the Quality of De-Identified Health Data
6 / 15 Health – Exploring Complexity HEC 2016 / Medical Informatics Europe MIE 201619.08.2016
Data transformation: global recoding
Identical input values are mapped to identical generalized values
Examples
Input data Global recoding Local recoding
Technische Universität München
Fabian Prasser, Raffael Bild, Klaus A. Kuhn:A Generic Method for Assessing the Quality of De-Identified Health Data
7 / 15 Health – Exploring Complexity HEC 2016 / Medical Informatics Europe MIE 201619.08.2016
Data transformation: local recoding
Identical values may be generalized to different levels
Examples
More flexible: can preserve more more information content
Input data Global recoding Local recoding
Technische Universität München
Fabian Prasser, Raffael Bild, Klaus A. Kuhn:A Generic Method for Assessing the Quality of De-Identified Health Data
8 / 15 Health – Exploring Complexity HEC 2016 / Medical Informatics Europe MIE 201619.08.2016
Well known model for measuring information loss● Developed by the statistical disclosure control community● A. De Waal and L. Willenborg, Information loss through global recoding and local
suppression, Netherlands Official Statistics 14 (1999), 17–20.
Often used for de-identifying health data● Recommended in several guidelines, used in papers
Based on the concept of mutual information● Quantifies the amount of information which can be obtained about one variable
by observing the other
Application to data anonymization● Measure loss of information by comparing input data with transformed output data
Can only be used with global recoding (details: see paper)● We have developed a generic variant which supports local recoding
(generalization, record suppression, cell suppression)
Non-Uniform Entropy
Technische Universität München
Fabian Prasser, Raffael Bild, Klaus A. Kuhn:A Generic Method for Assessing the Quality of De-Identified Health Data
9 / 15 Health – Exploring Complexity HEC 2016 / Medical Informatics Europe MIE 201619.08.2016
Generic Non-Uniform Entropy
Global recoding to level 0
Global recoding to level 1
Global recoding to level 2
Ageinput
Ageoutput
Global recoding, so we can use Non-Uniform Entropy for calculatingΔ0,1 and Δ1,2 !
Basic idea: model local recoding as iterative global recoding
This can be done for every local recoding scheme
Technische Universität München
Fabian Prasser, Raffael Bild, Klaus A. Kuhn:A Generic Method for Assessing the Quality of De-Identified Health Data
10 / 15 Health – Exploring Complexity HEC 2016 / Medical Informatics Europe MIE 201619.08.2016
Generic Non-Uniform Entropy
Basic idea: model local recoding as iterative global recoding
Result: Δ' = Δ0,1 + Δ1,2
Ageinput
Ageoutput
Global recoding to level 0
Global recoding to level 1
Global recoding to level 2
Non-UniformEntropy
Technische Universität München
Fabian Prasser, Raffael Bild, Klaus A. Kuhn:A Generic Method for Assessing the Quality of De-Identified Health Data
11 / 15 Health – Exploring Complexity HEC 2016 / Medical Informatics Europe MIE 201619.08.2016
Generic Non-Uniform Entropy
Basic idea: model local recoding as iterative global recoding
Result: Δ' = Δ0,1 + Δ1,2
Ageinput
Ageoutput
Global recoding to level 0
Global recoding to level 1
Global recoding to level 2
Non-UniformEntropy
Technische Universität München
Fabian Prasser, Raffael Bild, Klaus A. Kuhn:A Generic Method for Assessing the Quality of De-Identified Health Data
12 / 15 Health – Exploring Complexity HEC 2016 / Medical Informatics Europe MIE 201619.08.2016
Experiments
Two datasets● Extract of the 1994 US census database: 30,162 records● Health interview series: US survey with 1,193,504 participants
Transformation scheme● Initially: global recoding with generalization
● Schemes: original, low, medium, high● Followed by: local recoding with record suppression
● Iterative removal of records (10%, 20%, …, 100%)
Measured information loss with two models● Non-Uniform Entropy● Our generic variant
Expected outcome● Initially: loss of information via generalization● Followed by: linear increase of information loss (number of removed records)
Technische Universität München
Fabian Prasser, Raffael Bild, Klaus A. Kuhn:A Generic Method for Assessing the Quality of De-Identified Health Data
13 / 15 Health – Exploring Complexity HEC 2016 / Medical Informatics Europe MIE 201619.08.2016
Results
Both models measured the same initial loss of information
Only our model captured the linear increase→ Non-Uniform Entropy measured information gain followed by decrease
Technische Universität München
Fabian Prasser, Raffael Bild, Klaus A. Kuhn:A Generic Method for Assessing the Quality of De-Identified Health Data
14 / 15 Health – Exploring Complexity HEC 2016 / Medical Informatics Europe MIE 201619.08.2016
The method describe here has been implemented into ARX● Oriented towards guidelines for health data de-identification● Supports a wide variety of approaches to data de-identification● Requires development of generic methods
Highly scalable● Millions of records with up to 50 potentially identifying attributes
Mentioned in several data protection guidelines● European Medicines Agency (EMA): External Guidance on the Implementation of
the European Medicines Agency Policy on the Publication of Clinical Data for Medicinal Products for Human Use (2016)
● EU Agency for Network and Information Security (ENISA): Privacy and Data Protection by Design (2014)
ARX is open source software● Website: http://arx.deidentifier.org● Email: [email protected]
ARX – An anonymization tool for biomedical data
Technische Universität München
Thank you for your attention!
Fabian Prasser, Raffael Bild, Klaus A. Kuhn:A Generic Method for Assessing the Quality of De-Identified Health Data
15 / 15 Health – Exploring Complexity HEC 2016 / Medical Informatics Europe MIE 201619.08.2016