ARX - A Generic Method for Assessing the Quality of De-Identified Health Data

Technische Universität München

Fabian Prasser, Raffael Bild, Klaus A. Kuhn

Chair for Medical InformaticsInstitute for Medical Statistics and Epidemiologie

Technical University of Munich (TUM)

A Generic Method for Assessing theQuality of De-Identified Health Data


Fabian Prasser, Raffael Bild, Klaus A. Kuhn:A Generic Method for Assessing the Quality of De-Identified Health Data

2 / 15 Health – Exploring Complexity HEC 2016 / Medical Informatics Europe MIE 201619.08.2016

Motivation: legal requirements● Secondary use of health care data for research● Data sharing in cooperative research

Goal: privacy protection● Ensure that recipients cannot learn the identity of data subjects● Re-identification can have severe legal consequences

Basis: make sure that the recipient is as trustworthy as possible● Sign data use agreements, approval by data access committees● Implement multiple layers of access to create controlled environments

Residual risks: data de-identification (also called: data anonymization)● Step 1: Remove identifying data (e.g. names, insurance numbers)● Step 2: Modify data to reduce the uniqueness of potentially identifying attribute

values (e.g. date-of-birth, sex, zip code)

Background


Generalization

Suppression

Micro-aggregation



Example

Reduction of the uniqueness of potentially identifying values




Trade-off: privacy risks vs. quality of data

Models are needed for measuring both aspects● Privacy: k-anonymity, k-map, strict average risk, population uniqueness● Quality: loss of information (e.g. granularity), changes in statistical properties

(e.g. tendency, dispersion, shape of distributions), data utility (e.g. classification)

Challenge

Privacy risk

Dat

a qu

ality

Original dataHighest risk

No dataNo risk

Potential solutions




Data transformation: attribute generalization

Recommended for health data: generalization hierarchies

Examples

Input data Global recoding Local recoding




Data transformation: global recoding

Identical input values are mapped to identical generalized values

Examples





Data transformation: local recoding

Identical values may be generalized to different levels

Examples

More flexible: can preserve more more information content





Well known model for measuring information loss● Developed by the statistical disclosure control community● A. De Waal and L. Willenborg, Information loss through global recoding and local

suppression, Netherlands Official Statistics 14 (1999), 17–20.

Often used for de-identifying health data● Recommended in several guidelines, used in papers

Based on the concept of mutual information● Quantifies the amount of information which can be obtained about one variable

by observing the other

Application to data anonymization● Measure loss of information by comparing input data with transformed output data

Can only be used with global recoding (details: see paper)● We have developed a generic variant which supports local recoding

(generalization, record suppression, cell suppression)

Non-Uniform Entropy




Generic Non-Uniform Entropy

Global recoding to level 0



Ageinput

Ageoutput

Global recoding, so we can use Non-Uniform Entropy for calculatingΔ0,1 and Δ1,2 !

Basic idea: model local recoding as iterative global recoding

This can be done for every local recoding scheme






Result: Δ' = Δ0,1 + Δ1,2

Ageinput

Ageoutput




Non-UniformEntropy






Result: Δ' = Δ0,1 + Δ1,2

Ageinput

Ageoutput




Non-UniformEntropy




Experiments

Two datasets● Extract of the 1994 US census database: 30,162 records● Health interview series: US survey with 1,193,504 participants

Transformation scheme● Initially: global recoding with generalization

● Schemes: original, low, medium, high● Followed by: local recoding with record suppression

● Iterative removal of records (10%, 20%, …, 100%)

Measured information loss with two models● Non-Uniform Entropy● Our generic variant

Expected outcome● Initially: loss of information via generalization● Followed by: linear increase of information loss (number of removed records)




Results

Both models measured the same initial loss of information

Only our model captured the linear increase→ Non-Uniform Entropy measured information gain followed by decrease




The method describe here has been implemented into ARX● Oriented towards guidelines for health data de-identification● Supports a wide variety of approaches to data de-identification● Requires development of generic methods

Highly scalable● Millions of records with up to 50 potentially identifying attributes

Mentioned in several data protection guidelines● European Medicines Agency (EMA): External Guidance on the Implementation of

the European Medicines Agency Policy on the Publication of Clinical Data for Medicinal Products for Human Use (2016)

● EU Agency for Network and Information Security (ENISA): Privacy and Data Protection by Design (2014)

ARX is open source software● Website: http://arx.deidentifier.org● Email: [email protected]

ARX – An anonymization tool for biomedical data

http://arx.deidentifier.org/


Thank you for your attention!



Software

ARX - A Generic Method for Assessing the Quality of De-Identified Health Data