Transcript
Page 1: Chris Dibben University of Edinburgh Linking historical administrative data

Chris DibbenUniversity of Edinburgh

Linking historical administrative data

Page 2: Chris Dibben University of Edinburgh Linking historical administrative data

Context• History of very important

contributions:– Dutch Famine Birth Cohort Study

– epigenetics, thrifty phenotype– Överkalix study – epigenetics,

sex differences– UK Longitudinal Study – health

inequalities

Page 3: Chris Dibben University of Edinburgh Linking historical administrative data

Two new developmental projects

• Scottish Mental Surveys 1932 and 1947

• Scottish civil registration data

• New cohorts for people now in old age

Page 4: Chris Dibben University of Edinburgh Linking historical administrative data

The ‘Scottish Mental Survey’

Page 5: Chris Dibben University of Edinburgh Linking historical administrative data

1947 Scottish Mental Survey

1939 register

Birth1936

ED code, address, household members:

marital status, occupation

The Scottish Longitudinal

study

Scottish morbidity records

1939 books recorded

the date of death (up to

1980)

linkage to the death database (1974 onwards)

Education

Employment

Page 6: Chris Dibben University of Edinburgh Linking historical administrative data

Early life environment

1970

34

Hospitalisation

Mortality

Birth1936

0Age

Year

Mental ability

11

SchoolAchievement

(time estimated)

1947

Occupation (estimated)

1991

55

Detailed household/ individual

information

2001 2011

65 75

Page 7: Chris Dibben University of Edinburgh Linking historical administrative data

Background – Scottish vital events

• Civil registration of births, deaths and marriages in Scotland began on 1 January 1855

• All historical vital events records have been converted into digital image format with a supporting index

• Modern vital events data (from 1974 onwards) are available electronically

Page 8: Chris Dibben University of Edinburgh Linking historical administrative data

Digitising Scotland• Approximately 50 million

occupation strings, 8 million causes of death

• Classify occupations to Historical International Standard Classification of Occupations (HISCO)

• Cause of death to a modified ICD10

• Each with a location

Page 9: Chris Dibben University of Edinburgh Linking historical administrative data

Historical Geocoding

GEOCODINGTOOL

+

=+

GEOMETRYFEATURES

Year Historical address

2010 Ladywell House, Ladywell Road, Edinburgh, EH12 7T

1910 Ladywell House, Ladywell Street, Edinburgh

1810 Ladywell House, Ladywell Street, Edinburgh

1710 Ladywell House, Lady[vv]ell Street, Edinburgh

Postcode change

Without postcode

Interpretation error

17101810

19102010

• Change of road networks (new road replace old) over time• Change of road names over time• Interpretation errors from the address digitisation

GEOMETRYFEATURESGEOMETRY

FEATURESGEOMETRYFEATURES

17101810

19102010

Page 10: Chris Dibben University of Edinburgh Linking historical administrative data
Page 11: Chris Dibben University of Edinburgh Linking historical administrative data
Page 12: Chris Dibben University of Edinburgh Linking historical administrative data

Challenges

• Significant methodological issues:– How can we consistently code

occupational data so that researchers can explore changing patterns and trends?

– How can we automate this process so that the majority of records do not need to be manually coded?

[email protected] 12

Page 13: Chris Dibben University of Edinburgh Linking historical administrative data

Digitising Scotland• Records of births, marriages and deaths recorded

in Scotland from 1855 to present day.

[email protected]

Page 14: Chris Dibben University of Edinburgh Linking historical administrative data

14

Page 15: Chris Dibben University of Edinburgh Linking historical administrative data

15

Page 16: Chris Dibben University of Edinburgh Linking historical administrative data

16

Page 17: Chris Dibben University of Edinburgh Linking historical administrative data

17

Page 18: Chris Dibben University of Edinburgh Linking historical administrative data

18

Page 19: Chris Dibben University of Edinburgh Linking historical administrative data

Experimental Dataset

• Use a dataset with similar content for experiments

• 60,000 records from the Cambridge Family History Study (records from 1800-1990)

• Occupation descriptions and associated HISCO codes

• HISCO coding done by historians• Dataset contains 330 different HISCO codes

19

Page 20: Chris Dibben University of Edinburgh Linking historical administrative data

20

HISCO Hierarchy Example

Page 21: Chris Dibben University of Edinburgh Linking historical administrative data

Classification ExampleString from record Gold Standard

ClassificationAutomatic Classification Output

Farm horseman 62460 Horse Worker 62460 Horse Worker

Shoe maker 80110 Shoemaker, General

80110 Shoemaker, General

Fireman (railway) 98330 Railway Steam-Engine Fireman

98330 Railway Steam-Engine Fireman

Fireman 58100 Fire-Fighter 58100 Fire-Fighter

Stationer 41000 Working Proprietors (Wholesale and Retail Trade)

91000 Paper and Paperboard product makers

21

Page 22: Chris Dibben University of Edinburgh Linking historical administrative data

Classification ExampleString from record Gold Standard

ClassificationAutomatic Classification Output

Farm horseman 62460 Horse Worker 62460 Horse Worker

Shoe maker 80110 Shoemaker, General

80110 Shoemaker, General

Fireman (railway) 98330 Railway Steam-Engine Fireman

98330 Railway Steam-Engine Fireman

Fireman 58100 Fire-Fighter 58100 Fire-Fighter

Stationer 41000 Working Proprietors (Wholesale and Retail Trade)

91000 Paper and Paperboard product makers

22

Page 23: Chris Dibben University of Edinburgh Linking historical administrative data

Approach

• Text analysis• Supervised machine learning

–Apache Mahout framework.• Combination of these techniques.

23

Page 24: Chris Dibben University of Edinburgh Linking historical administrative data

Supervised Machine Learning

Training Data Machine Learning

Unseen Data

Prediction Model

Predicted Classification

24

Prediction Model

Page 25: Chris Dibben University of Edinburgh Linking historical administrative data

Supervised Machine Learning

Training Data Machine Learning

Unseen Data

Prediction Model

Predicted Classification

25

Prediction Model

Farm horseman 62460Shoe maker 80110Fireman 58100 Stationer41000

Page 26: Chris Dibben University of Edinburgh Linking historical administrative data

Supervised Machine Learning

Training Data Machine Learning

Unseen Data

Prediction Model

Predicted Classification

26

Prediction Model

Farm horseman 62460Shoe maker 80110Fireman 58100 Stationer41000

Farm horsemanBoot makerFiremanPainter

Page 27: Chris Dibben University of Edinburgh Linking historical administrative data

Supervised Machine Learning

Training Data Machine Learning

Unseen Data

Prediction Model

Predicted ClassificationPrediction Model

Farm horseman 62460Shoe maker 80110Fireman 58100 Stationer41000

Farm horsemanBoot makerFiremanPainter ?

Prediction Model

Page 28: Chris Dibben University of Edinburgh Linking historical administrative data

100%

100%

Asthma

Miners asthma

spasmodiccollier's

miner'sminers

asthma

dropsy

bronchial

Page 29: Chris Dibben University of Edinburgh Linking historical administrative data

String Similarity SGD Naïve Bayes Majority Vote Confidence Weighted 1 Confidence Weighted 20

10

20

30

40

50

60

70

80

90

100

Classification Accuracy

Techniques

Acc

ura

cy %

Page 30: Chris Dibben University of Edinburgh Linking historical administrative data

Creation of a fully-linked vital events database for the whole Scotland back to 1855

19741855 Present

Vital Events (24 million births, deaths and marriages)Digital Images + Index

Vital Events Database

Vital Events Database

Fully-linked Vital Events Database

Page 31: Chris Dibben University of Edinburgh Linking historical administrative data

Large scale family reconstruction studies and Pedigrees

Page 32: Chris Dibben University of Edinburgh Linking historical administrative data

Gottfredsson, Magnús, et al. "Lessons from the past: familial aggregation analysis of fatal pandemic influenza (Spanish flu) in Iceland in 1918."Proceedings of the National Academy of Sciences 105.4 (2008): 1303-1308.

Page 33: Chris Dibben University of Edinburgh Linking historical administrative data
Page 34: Chris Dibben University of Edinburgh Linking historical administrative data

Acknowledgments

• The Digitising Scotland project is funded by ESRC;• The support from National Records of Scotland is

also gratefully acknowledged.


Recommended