32
Organising social science data – computer science perspectives Simon Jones Computing Science and Mathematics University of Stirling, Stirling, Scotland, UK Seminar: Data management in the social sciences and the contribution of the DAMES Node Stirling 31 January 2012 DAMES: Data Management through e-Social Science http://www.dames.org.uk

Organising social science data – computer science perspectives Simon Jones Computing Science and Mathematics University of Stirling, Stirling, Scotland,

Embed Size (px)

Citation preview

Page 1: Organising social science data – computer science perspectives Simon Jones Computing Science and Mathematics University of Stirling, Stirling, Scotland,

Organising social science data – computer science perspectives

Simon Jones

Computing Science and MathematicsUniversity of Stirling, Stirling, Scotland, UK

Seminar: Data management in the social sciences and the contribution of the DAMES Node

Stirling 31 January 2012

DAMES: Data Management through e-Social Sciencehttp://www.dames.org.uk

Page 2: Organising social science data – computer science perspectives Simon Jones Computing Science and Mathematics University of Stirling, Stirling, Scotland,

2

DAMES: Background DAMES: Case studies, provision and support for data

management in the social sciences This talk: focusing on "support for data management"

Infrastructure/tools Driven by social science needs for support for advanced

data management operations “In practice, social researchers often spend more time

on data management than any other part of the research process” (Lambert)

A ‘methodology’ of data management is relevant to ‘harmonisation’, ‘comparability’, ‘reproducibility’ in quantitative social science

Page 3: Organising social science data – computer science perspectives Simon Jones Computing Science and Mathematics University of Stirling, Stirling, Scotland,

3

DAMES: Themes Enabling the (social science) researcher:

To deposit, search and process heterogeneous data resources

To access online services/‘tools’ that enable researchers to carry out repeatable and challenging data management techniques such as: • fusion • matching • imputation …

Facilitating access is an important goal Underlying computer science research themes

MetadataData curationData management/processingPortals

Page 4: Organising social science data – computer science perspectives Simon Jones Computing Science and Mathematics University of Stirling, Stirling, Scotland,

4

Data management/processing scenarios

Curation scenarios include:Uploading occupational data to distribute across

academic communityRecording data properties prior to undertaking data

fusion involving a survey and an aggregate dataset Fusion scenarios include:

Linking a micro-social survey with aggregate occupational information (deterministic link)

Enhancing a survey dataset with ‘nearest match’ explanatory variables (probabilistic link)

Other processes: recoding, operationalising, linking, cleaning…

Page 5: Organising social science data – computer science perspectives Simon Jones Computing Science and Mathematics University of Stirling, Stirling, Scotland,

5

Generic data flows

Data setstore

Processing

Data sets are deposited

Data sets are selected

Processing is configured

Data set selection, and the configuration of processing jobs must be informed by knowledge about the data sets - metadata

Result is saved

Page 6: Organising social science data – computer science perspectives Simon Jones Computing Science and Mathematics University of Stirling, Stirling, Scotland,

6

Key role for metadata Metadata records are absolutely core to the functioning

of the portal infrastructureFor adequate, searchable records for the

heterogeneous resources (data tables, command files, notes and documentation)

To connect the resources and the data mgmt toolsTo document the data sets resulting from application

of the data mgmt tools: inputs, process, rationale,… DAMES requirements:

(Micro-)data based, very general DDI (= Data Documentation Initiative)

Page 7: Organising social science data – computer science perspectives Simon Jones Computing Science and Mathematics University of Stirling, Stirling, Scotland,

7

DDI 2 – An XML language<ddi2:codeBook xmlns:ddi2="http://www.icpsr.umich.edu/DDI"> <ddi2:docDscr> <ddi2:citation> <ddi2:titlStmt> <ddi2:titl>An interesting study</ddi2:titl> <ddi2:IDNo agency="DAMES-M">12</ddi2:IDNo> </ddi2:titlStmt> <ddi2:prodStmt> <ddi2:producer>DAMES Portal</ddi2:producer> <ddi2:copyright>Univ of Stirling </ddi2:copyright> <ddi2:prodDate>July 29, 2010</ddi2:prodDate> <ddi2:grantNo source="Financial_1" agency="Economic and Social Research Council"> RES-149-25-1066 </ddi2:grantNo> </ddi2:prodStmt> </ddi2:citation> </ddi2:docDscr> ...

Page 8: Organising social science data – computer science perspectives Simon Jones Computing Science and Mathematics University of Stirling, Stirling, Scotland,

8

The metadata "cycle"

Processing

Metadata

SearchData is mirrored by metadata

Configure/ process Select

Deposit/curate

Page 9: Organising social science data – computer science perspectives Simon Jones Computing Science and Mathematics University of Stirling, Stirling, Scotland,

9

DAMES portal architecture overview

Portal

DAMES Resources

External Dataset

Repositories

User

Services

Search

Enact Fusion

File Access

Compute Resources

Metadata

Local Datasets

(Note: Security omitted)

Page 10: Organising social science data – computer science perspectives Simon Jones Computing Science and Mathematics University of Stirling, Stirling, Scotland,

10

Tools Since metadata must have a key role in data

management… So tools for managing and exploiting the metadata have

key role in the use and operation of the DAMES portalAt deposit/curationFor searchingFor informing the configuration of processing steps

The following slides illustrate use of our tools

Page 11: Organising social science data – computer science perspectives Simon Jones Computing Science and Mathematics University of Stirling, Stirling, Scotland,

11

Curation ToolThe source data:

Page 12: Organising social science data – computer science perspectives Simon Jones Computing Science and Mathematics University of Stirling, Stirling, Scotland,

12

Page 13: Organising social science data – computer science perspectives Simon Jones Computing Science and Mathematics University of Stirling, Stirling, Scotland,

13

Page 14: Organising social science data – computer science perspectives Simon Jones Computing Science and Mathematics University of Stirling, Stirling, Scotland,

14

Page 15: Organising social science data – computer science perspectives Simon Jones Computing Science and Mathematics University of Stirling, Stirling, Scotland,

15

Page 16: Organising social science data – computer science perspectives Simon Jones Computing Science and Mathematics University of Stirling, Stirling, Scotland,

16

Page 17: Organising social science data – computer science perspectives Simon Jones Computing Science and Mathematics University of Stirling, Stirling, Scotland,

17

Page 18: Organising social science data – computer science perspectives Simon Jones Computing Science and Mathematics University of Stirling, Stirling, Scotland,

18

Page 19: Organising social science data – computer science perspectives Simon Jones Computing Science and Mathematics University of Stirling, Stirling, Scotland,

19

Page 20: Organising social science data – computer science perspectives Simon Jones Computing Science and Mathematics University of Stirling, Stirling, Scotland,

20

Page 21: Organising social science data – computer science perspectives Simon Jones Computing Science and Mathematics University of Stirling, Stirling, Scotland,

21

Page 22: Organising social science data – computer science perspectives Simon Jones Computing Science and Mathematics University of Stirling, Stirling, Scotland,

22

Page 23: Organising social science data – computer science perspectives Simon Jones Computing Science and Mathematics University of Stirling, Stirling, Scotland,

23

Page 24: Organising social science data – computer science perspectives Simon Jones Computing Science and Mathematics University of Stirling, Stirling, Scotland,

24

Also automatically uploaded to searchable eXist database

Page 25: Organising social science data – computer science perspectives Simon Jones Computing Science and Mathematics University of Stirling, Stirling, Scotland,

25

Metadata searching

Page 26: Organising social science data – computer science perspectives Simon Jones Computing Science and Mathematics University of Stirling, Stirling, Scotland,

26

Browsing the search results

Page 27: Organising social science data – computer science perspectives Simon Jones Computing Science and Mathematics University of Stirling, Stirling, Scotland,

27

Fusion Tool prototype Scenario: A soc sci researcher wishes to fuse Scottish

Household Survey data with privately collected study data:Uses the data curation tool to upload the dataUses the data fusion/imputation tool to select the data,

identify corresponding variables, and to generate a derived dataset (held in the portal)

The metadata about this derived dataset is stored and (may be) made public through the portal

Another researcher can now search the portal (metadata) for SHS data and find the derived dataset

DAMES metadata handling must facilitate this process

Page 28: Organising social science data – computer science perspectives Simon Jones Computing Science and Mathematics University of Stirling, Stirling, Scotland,

28

The Fusion Tool prototypeSelect datasets

(recipient and donor)

Select "common variables"

Select variables to be imputed

Select data fusion method

Submit to fusion "enactor"

Metadata accessed

Page 29: Organising social science data – computer science perspectives Simon Jones Computing Science and Mathematics University of Stirling, Stirling, Scotland,

29

Select datasets (recipient and donor)

Select "common variables"

Select variables to be imputed

Select data fusion method

Submit to fusion "enactor"

Metadata accessed

Page 30: Organising social science data – computer science perspectives Simon Jones Computing Science and Mathematics University of Stirling, Stirling, Scotland,

30

Select datasets (recipient and donor)

Select "common variables"

Select variables to be imputed

Select data fusion method

Submit to fusion "enactor"

Ski

pped

Metadata for result dataset

Page 31: Organising social science data – computer science perspectives Simon Jones Computing Science and Mathematics University of Stirling, Stirling, Scotland,

31

Job submission: Information flow

Wizard

EnactorCompute resources (Condor)

subjob1

subjob2

User's localfile store

Resultantdata

DDIrecord

notify(job id)

fetch job

submit

JFDL/JSDL

description.xml

Furtherinfra-

structure

Page 32: Organising social science data – computer science perspectives Simon Jones Computing Science and Mathematics University of Stirling, Stirling, Scotland,

35

Thank you!