What is synthpop?
A tool for producing synthetic versions of microdata containing confidential information so that they are safe to be released to users for exploratory analysis and preparing code
Administrative Data Research Centre - Scotland | Beata Nowok | 1 December 2014
Sex Age EducationMarital status
Income Life satisfaction
FEMALE 57 VOCATIONAL/GRAMMAR MARRIED 800 PLEASED
MALE 41 SECONDARY UNMARRIED 1500 MIXED
FEMALE 18 VOCATIONAL/GRAMMAR UNMARRIED NA PLEASED
FEMALE 78 PRIMARY/NO EDUCATION WIDOWED 900 MIXED
FEMALE 54 VOCATIONAL/GRAMMAR MARRIED 1500 MOSTLY SATISFIED
MALE 20 SECONDARY UNMARRIED -8 PLEASED
FEMALE 39 SECONDARY MARRIED 2000 MOSTLY SATISFIED
MALE 39 SECONDARY MARRIED 1197 MIXED
FEMALE 38 VOCATIONAL/GRAMMAR MARRIED NA MOSTLY DISSATISFIED
FEMALE 73 VOCATIONAL/GRAMMAR WIDOWED 1700 PLEASED
FEMALE 54 SECONDARY WIDOWED 2000 MOSTLY SATISFIED
MALE 30 VOCATIONAL/GRAMMAR UNMARRIED 900 MOSTLY SATISFIED
MALE 68 SECONDARY MARRIED -8 DELIGHTED
MALE 61 PRIMARY/NO EDUCATION MARRIED -8 MIXED
Real (input)
Sex Age EducationMarital status
Income Life satisfaction
MALE 81 PRIMARY/NO EDUCATION MARRIED 2100 PLEASED
MALE 54 VOCATIONAL/GRAMMAR MARRIED 1700 PLEASED
FEMALE 32 VOCATIONAL/GRAMMAR DIVORCED 870 MIXED
FEMALE 98 PRIMARY/NO EDUCATION MARRIED 800 MOSTLY DISSATISFIED
FEMALE 50 PRIMARY/NO EDUCATION MARRIED NA MOSTLY SATISFIED
FEMALE 37 VOCATIONAL/GRAMMAR MARRIED 158 PLEASED
MALE 28 VOCATIONAL/GRAMMAR NA 1500 MOSTLY SATISFIED
FEMALE 62 PRIMARY/NO EDUCATION MARRIED 830 MOSTLY SATISFIED
MALE 78 PRIMARY/NO EDUCATION MARRIED NA PLEASED
FEMALE 29 SECONDARY MARRIED 580 MOSTLY SATISFIED
MALE 59 PRIMARY/NO EDUCATION MARRIED 1300 MOSTLY SATISFIED
MALE 41 SECONDARY UNMARRIED 1500 MIXED
MALE 18 SECONDARY UNMARRIED -8 PLEASED
FEMALE 73 PRIMARY/NO EDUCATION WIDOWED 1350 MOSTLY SATISFIED
Synthetic (output)
Data that look (structurally) like original data but contain artificial units only
Data that behave (statistically) like original data
http://cran.r-project.org/package=synthpop
Generating synthetic versions of sensitive microdata for statistical disclosure control
package
Generating synthetic dataSequentially replacing original data values with synthetic values generated from conditional probability distributions
fitfit
drawdraw
Yj ~ (Y0,Y1,...,Yj−1)
syn
theti
c
real
Generating synthetic data
syn
theti
c
real
syn()
Overview of synthpop functions
syn
theti
c
real
read.real() write.syn()
sdc()
compare.synds() summary.synds()
compare.fit.synds()glm.synds()summary.fit.synds()
descriptive
models
syn()
syn() & common data problems
Missing-data codes: contNA categorical variables: additional factor level(s) continuous variables: specified by contNA and
modelled separately Semi-continuous variables: semicont Restricted values (interrelationships between
variables): rules & rvalues Linear constraints: denom Non-negativity / non-normality: method set to
‘lognorm’, ‘sqrtnorm’ or ‘cubertnorm’ Deterministic relations: method set to “~I(…)”
Administrative Data Research Centre - Scotland | Beata Nowok | 1 December 2014
sdc() & statistical disclosure control
Data labelling: label Removing replicated uniques: rm.replicated.uniques
Bottom- and top-coding: recode.vars, bottom.top.coding, recode.exclude
syn(): smoothing, minbucket
Administrative Data Research Centre - Scotland | Beata Nowok | 1 December 2014
sdc(syn.obj, real, label="false data", rm.replicated.uniques = TRUE, recode.vars = c("age","income"), bottom.top.coding = list(c(NA,85),c(NA,1500)))
Sex Age EducationMarital status
Income Life satisfaction
FEMALE 57 VOCATIONAL/GRAMMAR MARRIED 800 PLEASED
MALE 41 SECONDARY UNMARRIED 1500 MIXED
FEMALE 18 VOCATIONAL/GRAMMAR UNMARRIED NA PLEASED
FEMALE 78 PRIMARY/NO EDUCATION WIDOWED 900 MIXED
FEMALE 54 VOCATIONAL/GRAMMAR MARRIED 1500 MOSTLY SATISFIED
MALE 20 SECONDARY UNMARRIED -8 PLEASED
FEMALE 39 SECONDARY MARRIED 2000 MOSTLY SATISFIED
MALE 39 SECONDARY MARRIED 1197 MIXED
FEMALE 38 VOCATIONAL/GRAMMAR MARRIED NA MOSTLY DISSATISFIED
FEMALE 73 VOCATIONAL/GRAMMAR WIDOWED 1700 PLEASED
FEMALE 54 SECONDARY WIDOWED 2000 MOSTLY SATISFIED
MALE 30 VOCATIONAL/GRAMMAR UNMARRIED 900 MOSTLY SATISFIED
MALE 68 SECONDARY MARRIED -8 DELIGHTED
MALE 61 PRIMARY/NO EDUCATION MARRIED -8 MIXED
Real (input)
Synthetic (output)
sdc()
Sex Age EducationMarital status
Income Life satisfaction
MALE 81 PRIMARY/NO EDUCATION MARRIED 2100 PLEASED
MALE 54 VOCATIONAL/GRAMMAR MARRIED 1700 PLEASED
FEMALE 32 VOCATIONAL/GRAMMAR DIVORCED 870 MIXED
FEMALE 98 PRIMARY/NO EDUCATION MARRIED 800 MOSTLY DISSATISFIED
FEMALE 50 PRIMARY/NO EDUCATION MARRIED NA MOSTLY SATISFIED
FEMALE 37 VOCATIONAL/GRAMMAR MARRIED 158 PLEASED
MALE 28 VOCATIONAL/GRAMMAR NA 1500 MOSTLY SATISFIED
FEMALE 62 PRIMARY/NO EDUCATION MARRIED 830 MOSTLY SATISFIED
MALE 78 PRIMARY/NO EDUCATION MARRIED NA PLEASED
FEMALE 29 SECONDARY MARRIED 580 MOSTLY SATISFIED
MALE 59 PRIMARY/NO EDUCATION MARRIED 1300 MOSTLY SATISFIED
MALE 41 SECONDARY UNMARRIED 1500 MIXED
MALE 18 SECONDARY UNMARRIED -8 PLEASED
FEMALE 73 PRIMARY/NO EDUCATION WIDOWED 1350 MOSTLY SATISFIED
Sex Age EducationMarital status
Income Life satisfaction
false data MALE 81 PRIMARY/NO EDUCATION MARRIED 1500 PLEASED
false data MALE 54 VOCATIONAL/GRAMMAR MARRIED 1500 PLEASED
false data FEMALE 32 VOCATIONAL/GRAMMAR DIVORCED 870 MIXED
false data FEMALE 85 PRIMARY/NO EDUCATION MARRIED 800 MOSTLY DISSATISFIED
false data FEMALE 50 PRIMARY/NO EDUCATION MARRIED NA MOSTLY SATISFIED
false data FEMALE 37 VOCATIONAL/GRAMMAR MARRIED 158 PLEASED
false data MALE 28 VOCATIONAL/GRAMMAR NA 1500 MOSTLY SATISFIED
false data FEMALE 62 PRIMARY/NO EDUCATION MARRIED 830 MOSTLY SATISFIED
false data MALE 78 PRIMARY/NO EDUCATION MARRIED NA PLEASED
false data FEMALE 29 SECONDARY MARRIED 580 MOSTLY SATISFIED
false data MALE 59 PRIMARY/NO EDUCATION MARRIED 1300 MOSTLY SATISFIED
false data MALE 18 SECONDARY UNMARRIED -8 PLEASED
false data FEMALE 73 PRIMARY/NO EDUCATION WIDOWED 1350 MOSTLY SATISFIED
Sex Age EducationMarital status
Income Life satisfaction
FEMALE 57 VOCATIONAL/GRAMMAR MARRIED 800 PLEASED
MALE 41 SECONDARY UNMARRIED 1500 MIXED
FEMALE 18 VOCATIONAL/GRAMMAR UNMARRIED NA PLEASED
FEMALE 78 PRIMARY/NO EDUCATION WIDOWED 900 MIXED
FEMALE 54 VOCATIONAL/GRAMMAR MARRIED 1500 MOSTLY SATISFIED
MALE 20 SECONDARY UNMARRIED -8 PLEASED
FEMALE 39 SECONDARY MARRIED 2000 MOSTLY SATISFIED
MALE 39 SECONDARY MARRIED 1197 MIXED
FEMALE 38 VOCATIONAL/GRAMMAR MARRIED NA MOSTLY DISSATISFIED
FEMALE 73 VOCATIONAL/GRAMMAR WIDOWED 1700 PLEASED
FEMALE 54 SECONDARY WIDOWED 2000 MOSTLY SATISFIED
MALE 30 VOCATIONAL/GRAMMAR UNMARRIED 900 MOSTLY SATISFIED
MALE 68 SECONDARY MARRIED -8 DELIGHTED
MALE 61 PRIMARY/NO EDUCATION MARRIED -8 MIXED
Real (input)
Sex Age EducationMarital status
Income Life satisfaction
MALE 81 PRIMARY/NO EDUCATION MARRIED 2100 PLEASED
MALE 54 VOCATIONAL/GRAMMAR MARRIED 1700 PLEASED
FEMALE 32 VOCATIONAL/GRAMMAR DIVORCED 870 MIXED
FEMALE 98 PRIMARY/NO EDUCATION MARRIED 800 MOSTLY DISSATISFIED
FEMALE 50 PRIMARY/NO EDUCATION MARRIED NA MOSTLY SATISFIED
FEMALE 37 VOCATIONAL/GRAMMAR MARRIED 158 PLEASED
MALE 28 VOCATIONAL/GRAMMAR NA 1500 MOSTLY SATISFIED
FEMALE 62 PRIMARY/NO EDUCATION MARRIED 830 MOSTLY SATISFIED
MALE 78 PRIMARY/NO EDUCATION MARRIED NA PLEASED
FEMALE 29 SECONDARY MARRIED 580 MOSTLY SATISFIED
MALE 59 PRIMARY/NO EDUCATION MARRIED 1300 MOSTLY SATISFIED
MALE 41 SECONDARY UNMARRIED 1500 MIXED
MALE 18 SECONDARY UNMARRIED -8 PLEASED
FEMALE 73 PRIMARY/NO EDUCATION WIDOWED 1350 MOSTLY SATISFIED
Synthetic (output)
Disclosure control
Providing sufficient disclosure protection
Disclosure control measures
Watermarking
Partially synthetic data
Data synthesis
Handling various data types, data
structures and real data problems
Stratified synthesis
Value bounds
Multiple event data
Household and other hierarchical data
Complex survey design
Small geographic areas
Package usability
Making synthpop flexible and
accessible to a wider range of users
A graphical user interface (GUI)
Dealing with computational limitations
Support for LSs projects
Training workshops
Quality of synthetic data
Measuring and improving
analytical validity
Tests of synthesising approaches (parametric vs CART models)
CART extensions
Case studies for ADRC-S projects
Guidelines for best practise
synthpop: future developments
http://cran.r-project.org/package=synthpop