27
“Provenance and Social Science Data” 15 March 2017 Documenting DataTransformations George Alter, University of Michigan

Documenting Data Transformations

Embed Size (px)

Citation preview

Page 1: Documenting Data Transformations

“Provenance and Social Science Data”15 March 2017

Documenting DataTransformations

George Alter, University of Michigan

Page 2: Documenting Data Transformations

• Data are useless without Metadata – “data about data”

• Metadata should:– Include all information about data creation– Describe transformations to variables– Be easy to create

• Our goal: Automated capture of metadata

Why Metadata?

Page 3: Documenting Data Transformations

A few words about ICPSR

• World’s largest archive of social science data

• Consortium established 1962

• 760+ member institutions around the world

• Founding member and home office for the DDI Alliance

Page 4: Documenting Data Transformations

Powered by DDI Metadata

ICPSR is building search tools based upon Data Documentation Initiative (DDI) XML

Codebooks (pdf and online) are rendered from the DDI.

Page 5: Documenting Data Transformations

Searchable database of 4.5M variables

Click here for online codebook

Page 6: Documenting Data Transformations

Online codebook shows variable in context of dataset

Link to online crosstab tool

What question was asked?

How was the question coded?Link to online

graph tool

Page 7: Documenting Data Transformations

Searchable database of 4.5M variables

Click here for variable comparison

Page 8: Documenting Data Transformations

Variable comparisondisplay

Click here for online codebook

Page 9: Documenting Data Transformations

Search for datasets with 3 desired variables

Check boxes for variable comparison

Page 10: Documenting Data Transformations

Crosswalk for American National Election Study (ANES) and General Social Survey (GSS)

Columns link to 70 datasets

134 tags in 8 lists

Variable comparison display

Variables linked to online codebooks

Page 11: Documenting Data Transformations

Metadata for the American National Election Study

What question was asked?

Who answered this question?

How was the question coded?

Who answered this question?

Page 12: Documenting Data Transformations

Metadata for the American National Election Study

Who answered this question?

Who answered this question?

How do we know who answered the question?

It’s in the pdf.

Page 13: Documenting Data Transformations

When data arrive at the archive…

• No question text• No interview flow (question order, skip

pattern)• No variable provenance• Data transformations are not documented.

Page 14: Documenting Data Transformations

How is research data created?

• Most surveys are conducted with computer assisted interview software (CAI)– CATI – Computer-assisted Telephone Interview– CAPI – Computer-assisted Personal Interview– CAWI – Computer Aided Web Interview

• There is no paper questionnaire• The CAI program is the questionnaire– i.e. the program is the metadata

Page 15: Documenting Data Transformations

Originaldata

DDI XML

Original metadata

CAI

CAI to

DDI

Convert to DDI:

CollecticaMQDSothers

Computer Assisted

Interviewing

We already have tools to convert CAI to machine-

readable metadata.

Page 16: Documenting Data Transformations

SPSSSA

SStat

aR

Command scripts:

Originaldata

DDI XML

Original metadata

Reviseddata

SPSSSASStata

R

CAI

CAI to

DDI

Statistical Packages

Convert to DDI:

CollecticaMQDSothers

Computer Assisted

Interviewing

What happens when a project modifies the data.

The modified data no longer

match the metadata.

Page 17: Documenting Data Transformations

SPSSSA

SStat

aR

Command scripts:

Originaldata

DDI XML

Original metadata

Reviseddata

SPSSSASStata

R

SPSSSASStata

R

CAI

CAI to

DDI

Statistical Packages

Convert to DDI:

CollecticaMQDSothers

Computer Assisted

Interviewing

Stat Packag

e to DDI

DDI XML

Extracted metadata

Extract metadata

from SPSS/SAS/

Stata/RData file

Metadata are re-created after the

data are transformed.

Transformations are

documented by hand

Page 18: Documenting Data Transformations

Statistics packages have limited metadata

• Variable names• Variable labels• Value labels• No provenance

Page 19: Documenting Data Transformations

SDTL

XML Update

r

DDI XML

SPSSSA

SStat

aR

Script Parser

Command scripts:

Originaldata

Revised metadata

DDI XML

Original metadata

Reviseddata

SPSSSASStata

R

CAI

CAI to

DDI

Statistical Packages

StandardData

Transformation Language

Convert to DDI:

CollecticaMQDSothers

Computer Assisted

Interviewing

Automating the capture of

transformation metadata.

Missing links that we will build.

Page 20: Documenting Data Transformations

What statistics packages should be covered?

ICPSR Downloads by Format

All downloadsStudies with all

formatsDelimited text 43% 29%SPSS 22% 24%SAS 10% 12%Stata 19% 23%R 5% 12%Excel 0% 1%Other 0% 0%

100% 100%Number 378,007 154,663

Page 21: Documenting Data Transformations

Input Data Output DataSPSSMISSING VALUES X(-1).IF (X > 3) Y=9.IF (X < 3) Z=8.

X234-1

Statareplace X=. if X==-1generate Y=9 if X>3generate Z=8 if X<3

X234-1

SASif X=-1 then X=.;if X>3 then Y=9;if X<3 then Z=8;

X234-1

Why do we need an SDTL?

Page 22: Documenting Data Transformations

Input Data Output DataSPSSMISSING VALUES X(-1).IF (X > 3) Y=9.IF (X < 3) Z=8.

X X Y Z2 2 83 34 4 9-1 -1

Statareplace X=. if X==-1generate Y=9 if X>3generate Z=8 if X<3

X X Y Z2 2 83 34 4 9-1 9

SASif X=-1 then X=.;if X>3 then Y=9;if X<3 then Z=8;

X X Y Z2 2 . 83 3 . .4 4 9 .-1 . . 8

Why do we need an SDTL?

Page 23: Documenting Data Transformations

What happens when a missing value is in a logical comparison?• SPSS– Logical expressions including a missing value are

considered “Missing.” Usually, “Missing” is equivalent to “False.”

• Stata– Missing values are treated as numbers equal to

infinity. So, any number is less than a missing value.• SAS– Missing values are treated as numbers equal to minus

infinity. So, any number is greater than a missing value.

Page 24: Documenting Data Transformations

Input Data Output DataSPSSMISSING VALUES X(-1).IF (X > 3) Y=9.IF (X < 3) Z=8.

X X Y Z2 2 83 34 4 9-1 NULL

Statareplace X=. if X==-1generate Y=9 if X>3generate Z=8 if X<3

X X Y Z2 2 83 34 4 9-1 ∞ 9

SASif X=-1 then X=.;if X>3 then Y=9;if X<3 then Z=8;

X X Y Z2 2 . 83 3 . .4 4 9 .-1 -∞ . 8

Missing Values in Comparisons

Page 25: Documenting Data Transformations

Benefits of automated metadata capture

• Metadata will be better– All the information in the CAI can be included.– Variable transformations can be described

• Automation will lower costs– Metadata will not be discarded and re-created

• All metadata will be standardized and machine readable– Codebooks with rich information can be rendered at

will• If we make it easy and beneficial, researchers

will use it.

Page 26: Documenting Data Transformations

Continuous Capture of Metadata for Statistical Data

(NSF ACI-1640575)Project Partners•Inter-university Consortium for Political and Social Research (ICPSR), University of Michigan•Colectica•Metadata Technology North America•Norwegian Centre for Research Data•General Social Survey, NORC, University of Chicago•American National Election Study, University of Michigan

Page 27: Documenting Data Transformations

Questions?George Alter

[email protected]