15
Technische Universität München Fabian Prasser , Raffael Bild, Klaus A. Kuhn Chair for Medical Informatics Institute for Medical Statistics and Epidemiologie Technical University of Munich (TUM) A Generic Method for Assessing the Quality of De-Identified Health Data

ARX - A Generic Method for Assessing the Quality of De-Identified Health Data

Embed Size (px)

Citation preview

Page 1: ARX - A Generic Method for Assessing the Quality of De-Identified Health Data

Technische Universität München

Fabian Prasser, Raffael Bild, Klaus A. Kuhn

Chair for Medical InformaticsInstitute for Medical Statistics and Epidemiologie

Technical University of Munich (TUM)

A Generic Method for Assessing theQuality of De-Identified Health Data

Page 2: ARX - A Generic Method for Assessing the Quality of De-Identified Health Data

Technische Universität München

Fabian Prasser, Raffael Bild, Klaus A. Kuhn:A Generic Method for Assessing the Quality of De-Identified Health Data

2 / 15 Health – Exploring Complexity HEC 2016 / Medical Informatics Europe MIE 201619.08.2016

Motivation: legal requirements● Secondary use of health care data for research● Data sharing in cooperative research

Goal: privacy protection● Ensure that recipients cannot learn the identity of data subjects● Re-identification can have severe legal consequences

Basis: make sure that the recipient is as trustworthy as possible● Sign data use agreements, approval by data access committees● Implement multiple layers of access to create controlled environments

Residual risks: data de-identification (also called: data anonymization)● Step 1: Remove identifying data (e.g. names, insurance numbers)● Step 2: Modify data to reduce the uniqueness of potentially identifying attribute

values (e.g. date-of-birth, sex, zip code)

Background

Page 3: ARX - A Generic Method for Assessing the Quality of De-Identified Health Data

Technische Universität München

Generalization

Suppression

Micro-aggregation

Fabian Prasser, Raffael Bild, Klaus A. Kuhn:A Generic Method for Assessing the Quality of De-Identified Health Data

3 / 15 Health – Exploring Complexity HEC 2016 / Medical Informatics Europe MIE 201619.08.2016

Example

Reduction of the uniqueness of potentially identifying values

Page 4: ARX - A Generic Method for Assessing the Quality of De-Identified Health Data

Technische Universität München

Fabian Prasser, Raffael Bild, Klaus A. Kuhn:A Generic Method for Assessing the Quality of De-Identified Health Data

4 / 15 Health – Exploring Complexity HEC 2016 / Medical Informatics Europe MIE 201619.08.2016

Trade-off: privacy risks vs. quality of data

Models are needed for measuring both aspects● Privacy: k-anonymity, k-map, strict average risk, population uniqueness● Quality: loss of information (e.g. granularity), changes in statistical properties

(e.g. tendency, dispersion, shape of distributions), data utility (e.g. classification)

Challenge

Privacy risk

Dat

a qu

ality

Original dataHighest risk

No dataNo risk

Potential solutions

Page 5: ARX - A Generic Method for Assessing the Quality of De-Identified Health Data

Technische Universität München

Fabian Prasser, Raffael Bild, Klaus A. Kuhn:A Generic Method for Assessing the Quality of De-Identified Health Data

5 / 15 Health – Exploring Complexity HEC 2016 / Medical Informatics Europe MIE 201619.08.2016

Data transformation: attribute generalization

Recommended for health data: generalization hierarchies

Examples

Input data Global recoding Local recoding

Page 6: ARX - A Generic Method for Assessing the Quality of De-Identified Health Data

Technische Universität München

Fabian Prasser, Raffael Bild, Klaus A. Kuhn:A Generic Method for Assessing the Quality of De-Identified Health Data

6 / 15 Health – Exploring Complexity HEC 2016 / Medical Informatics Europe MIE 201619.08.2016

Data transformation: global recoding

Identical input values are mapped to identical generalized values

Examples

Input data Global recoding Local recoding

Page 7: ARX - A Generic Method for Assessing the Quality of De-Identified Health Data

Technische Universität München

Fabian Prasser, Raffael Bild, Klaus A. Kuhn:A Generic Method for Assessing the Quality of De-Identified Health Data

7 / 15 Health – Exploring Complexity HEC 2016 / Medical Informatics Europe MIE 201619.08.2016

Data transformation: local recoding

Identical values may be generalized to different levels

Examples

More flexible: can preserve more more information content

Input data Global recoding Local recoding

Page 8: ARX - A Generic Method for Assessing the Quality of De-Identified Health Data

Technische Universität München

Fabian Prasser, Raffael Bild, Klaus A. Kuhn:A Generic Method for Assessing the Quality of De-Identified Health Data

8 / 15 Health – Exploring Complexity HEC 2016 / Medical Informatics Europe MIE 201619.08.2016

Well known model for measuring information loss● Developed by the statistical disclosure control community● A. De Waal and L. Willenborg, Information loss through global recoding and local

suppression, Netherlands Official Statistics 14 (1999), 17–20.

Often used for de-identifying health data● Recommended in several guidelines, used in papers

Based on the concept of mutual information● Quantifies the amount of information which can be obtained about one variable

by observing the other

Application to data anonymization● Measure loss of information by comparing input data with transformed output data

Can only be used with global recoding (details: see paper)● We have developed a generic variant which supports local recoding

(generalization, record suppression, cell suppression)

Non-Uniform Entropy

Page 9: ARX - A Generic Method for Assessing the Quality of De-Identified Health Data

Technische Universität München

Fabian Prasser, Raffael Bild, Klaus A. Kuhn:A Generic Method for Assessing the Quality of De-Identified Health Data

9 / 15 Health – Exploring Complexity HEC 2016 / Medical Informatics Europe MIE 201619.08.2016

Generic Non-Uniform Entropy

Global recoding to level 0

Global recoding to level 1

Global recoding to level 2

Ageinput

Ageoutput

Global recoding, so we can use Non-Uniform Entropy for calculatingΔ0,1 and Δ1,2 !

Basic idea: model local recoding as iterative global recoding

This can be done for every local recoding scheme

Page 10: ARX - A Generic Method for Assessing the Quality of De-Identified Health Data

Technische Universität München

Fabian Prasser, Raffael Bild, Klaus A. Kuhn:A Generic Method for Assessing the Quality of De-Identified Health Data

10 / 15 Health – Exploring Complexity HEC 2016 / Medical Informatics Europe MIE 201619.08.2016

Generic Non-Uniform Entropy

Basic idea: model local recoding as iterative global recoding

Result: Δ' = Δ0,1 + Δ1,2

Ageinput

Ageoutput

Global recoding to level 0

Global recoding to level 1

Global recoding to level 2

Non-UniformEntropy

Page 11: ARX - A Generic Method for Assessing the Quality of De-Identified Health Data

Technische Universität München

Fabian Prasser, Raffael Bild, Klaus A. Kuhn:A Generic Method for Assessing the Quality of De-Identified Health Data

11 / 15 Health – Exploring Complexity HEC 2016 / Medical Informatics Europe MIE 201619.08.2016

Generic Non-Uniform Entropy

Basic idea: model local recoding as iterative global recoding

Result: Δ' = Δ0,1 + Δ1,2

Ageinput

Ageoutput

Global recoding to level 0

Global recoding to level 1

Global recoding to level 2

Non-UniformEntropy

Page 12: ARX - A Generic Method for Assessing the Quality of De-Identified Health Data

Technische Universität München

Fabian Prasser, Raffael Bild, Klaus A. Kuhn:A Generic Method for Assessing the Quality of De-Identified Health Data

12 / 15 Health – Exploring Complexity HEC 2016 / Medical Informatics Europe MIE 201619.08.2016

Experiments

Two datasets● Extract of the 1994 US census database: 30,162 records● Health interview series: US survey with 1,193,504 participants

Transformation scheme● Initially: global recoding with generalization

● Schemes: original, low, medium, high● Followed by: local recoding with record suppression

● Iterative removal of records (10%, 20%, …, 100%)

Measured information loss with two models● Non-Uniform Entropy● Our generic variant

Expected outcome● Initially: loss of information via generalization● Followed by: linear increase of information loss (number of removed records)

Page 13: ARX - A Generic Method for Assessing the Quality of De-Identified Health Data

Technische Universität München

Fabian Prasser, Raffael Bild, Klaus A. Kuhn:A Generic Method for Assessing the Quality of De-Identified Health Data

13 / 15 Health – Exploring Complexity HEC 2016 / Medical Informatics Europe MIE 201619.08.2016

Results

Both models measured the same initial loss of information

Only our model captured the linear increase→ Non-Uniform Entropy measured information gain followed by decrease

Page 14: ARX - A Generic Method for Assessing the Quality of De-Identified Health Data

Technische Universität München

Fabian Prasser, Raffael Bild, Klaus A. Kuhn:A Generic Method for Assessing the Quality of De-Identified Health Data

14 / 15 Health – Exploring Complexity HEC 2016 / Medical Informatics Europe MIE 201619.08.2016

The method describe here has been implemented into ARX● Oriented towards guidelines for health data de-identification● Supports a wide variety of approaches to data de-identification● Requires development of generic methods

Highly scalable● Millions of records with up to 50 potentially identifying attributes

Mentioned in several data protection guidelines● European Medicines Agency (EMA): External Guidance on the Implementation of

the European Medicines Agency Policy on the Publication of Clinical Data for Medicinal Products for Human Use (2016)

● EU Agency for Network and Information Security (ENISA): Privacy and Data Protection by Design (2014)

ARX is open source software● Website: http://arx.deidentifier.org● Email: [email protected]

ARX – An anonymization tool for biomedical data

Page 15: ARX - A Generic Method for Assessing the Quality of De-Identified Health Data

Technische Universität München

Thank you for your attention!

Fabian Prasser, Raffael Bild, Klaus A. Kuhn:A Generic Method for Assessing the Quality of De-Identified Health Data

15 / 15 Health – Exploring Complexity HEC 2016 / Medical Informatics Europe MIE 201619.08.2016