13
Experiences of managing Birth Cohort Data at CLS Jon Johnson (Senior Database Manager) Sub-brand to go here CLS is an ESRC Resource Centre based at the Institute of Education

Experiences of managing Birth Cohort Data at CLS Jon Johnson (Senior Database Manager) Sub-brand to go here CLS is an ESRC Resource Centre based at the

Embed Size (px)

Citation preview

Experiences of managing Birth Cohort Data at CLS

Jon Johnson (Senior Database Manager)

Sub-brand to go here

CLS is an ESRC Resource Centre based at the Institute of Education

2

Contents

1 Introduction

2 (Pre) History

3 Centralised Computing

4 Semi-centralised computing

5 Personal Computing

6 Consequences

7 Survey Data ‘production line’

8 Requirements

9 Potential Database strategies

10 Staffing and skills

3

Introduction

CLS has been an ESRC Resource Centre since 2005. We are responsible for three of the four British Birth Cohort studies

•NCDS (1958)

•BCS70 (1970)

•MCS (2000)

NSHD (1946) is funded by MRC at UCL.

www.cls.ioe.ac.uk

4

(Pre) History

NCDS has its origins in the Perinatal Mortality Survey. Sponsored by the National Birthday Trust Fund, this was designed to examine the social and obstetric factors associated with stillbirth and death in early infancy among the children born in Great Britain in that one week. This was a ‘follow-up’ to the 1946 study with a similar scope.

BCS70 began as the British Births Survey (BBS), and it was sponsored by the National Birthday Trust Fund in association with the Royal College of Obstetricians and Gynaecologists to follow up the 1958 study.

MCS was the specifically designed as a longitudinal survey to follow up upon the three previous birth surveys.

www.cls.ioe.ac.uk

5

Centralised Computing

www.cls.ioe.ac.uk

“If one had coded and tried to use all the information received from the 68 questions it is calculated that the results could have been expressed in a vast number of permutations probably in the region of 10 to 480th power” Perinatal Mortality (1963)

Four years after the data collection, the tabulations were eventually finalised.

Things got faster ...

“The first batch of coded forms were sent for punching in October 1970 ... 113,994 punch cards there being a minimum of 6 cards per case. The punching was completed in November 1971”

Researchers were reliant on the DP and computer professionals to generate tabulations.

6

Semi-Centralised Computing

In the mid-1970’s, as at first SPSS and then other statistical packages became available. Researchers had the opportunity to use the data prepared and marshalled by the DP and computer scientists to analyse the data themselves using the central computer.

Most users still relied on computer professionals to retrieve and tabulate data.

www.cls.ioe.ac.uk

7

Personal Computing (c1984)

With a powerful 386 computer on your desk and a copy of SPSS researchers could take the raw data and manipulate it for their own purposes.

By the mid 1990’s this process had accelerated to the position where all the data from a survey could be easily handled on a single machine and the need for database professionals could be circumvented.

www.cls.ioe.ac.uk

8

Consequences

A study became snapshots of each survey making its value as a longitudinal resource cumbersome and inefficient to manage

•Data fragmentation as derivations became disconnected from original data

•Longitudinal linkage discrepancies e.g. Partnership, fertility histories

•Coding frame discrepancies

•Data security moved from IT to individuals

•Meta data was viewed as being separate from data

With the introduction of dependent interviewing these problems would be further increased.

www.cls.ioe.ac.uk

9

Survey Data ‘production line’

www.cls.ioe.ac.uk

Instrument realisation

Instrument design

Data processing

Datadocumentation

Science

Data collection

Study design

10

Requirement

Migrate and restructure the data back into a database to restore integrity and clean discrepancies

Re-derive variables

Integration of meta-data into data

Create longitudinal checking algorithms

Ability to manipulate data in-situ

Log of changes and version control

www.cls.ioe.ac.uk

11

Potential database strategies

www.cls.ioe.ac.uk

12

Staffing and Skills

At CLS we chose use SIR as our main database and SQL for holding metadata (DDI 2.0 model)•Existing SIR experience•Easy to cross-train from SPSS•Migration of data from SPSS is straight-forward•Security very configurable•Version control and change log easy to implement•Derivations, manipulations done in one place•3 FTE (mix of skills, data management, DBA)

www.cls.ioe.ac.uk

13

Any questions?

Institute of EducationUniversity of London20 Bedford WayLondon WC1H 0AL

Tel +44 (0)20 7612 6000Fax +44 (0)20 7612 6126Email [email protected] www.ioe.ac.uk