Chris Dibben University of Edinburgh Linking historical administrative data

  • View
    215

  • Download
    1

Embed Size (px)

Transcript

  • Slide 1
  • Chris Dibben University of Edinburgh Linking historical administrative data
  • Slide 2
  • Context History of very important contributions: Dutch Famine Birth Cohort Study epigenetics, thrifty phenotype verkalix study epigenetics, sex differences UK Longitudinal Study health inequalities
  • Slide 3
  • Two new developmental projects Scottish Mental Surveys 1932 and 1947 Scottish civil registration data New cohorts for people now in old age
  • Slide 4
  • The Scottish Mental Survey
  • Slide 5
  • 1947 Scottish Mental Survey 1939 register Birth 1936 ED code, address, household members: marital status, occupation The Scottish Longitudinal study Scottish morbidity records 1939 books recorded the date of death (up to 1980) linkage to the death database (1974 onwards) Education Employment
  • Slide 6
  • Early life environment 1970 34 Hospitalisation Mortality Birth 1936 0Age Year Mental ability 11 School Achievement (time estimated) 1947 Occupation (estimated) 1991 55 Detailed household/ individual information 20012011 6575
  • Slide 7
  • Background Scottish vital events Civil registration of births, deaths and marriages in Scotland began on 1 January 1855 All historical vital events records have been converted into digital image format with a supporting index Modern vital events data (from 1974 onwards) are available electronically
  • Slide 8
  • Digitising Scotland Approximately 50 million occupation strings, 8 million causes of death Classify occupations to Historical International Standard Classification of Occupations (HISCO) Cause of death to a modified ICD10 Each with a location
  • Slide 9
  • Historical Geocoding GEOCODING TOOL + = + GEOMETRY FEATURES YearHistorical address 2010Ladywell House, Ladywell Road, Edinburgh, EH12 7T 1910Ladywell House, Ladywell Street, Edinburgh 1810Ladywell House, Ladywell Street, Edinburgh 1710Ladywell House, Lady[vv]ell Street, Edinburgh Postcode change Without postcode Interpretation error 1710 1810 1910 2010 Change of road networks (new road replace old) over time Change of road names over time Interpretation errors from the address digitisation GEOMETRY FEATURES GEOMETRY FEATURES GEOMETRY FEATURES 1710 1810 1910 2010
  • Slide 10
  • Slide 11
  • Slide 12
  • Challenges Significant methodological issues: How can we consistently code occupational data so that researchers can explore changing patterns and trends? How can we automate this process so that the majority of records do not need to be manually coded? digitisingscotland@lscs.ac.uk12
  • Slide 13
  • Digitising Scotland Records of births, marriages and deaths recorded in Scotland from 1855 to present day. digitisingscotland@lscs.ac.uk
  • Slide 14
  • 14
  • Slide 15
  • 15
  • Slide 16
  • 16
  • Slide 17
  • 17
  • Slide 18
  • 18
  • Slide 19
  • Experimental Dataset Use a dataset with similar content for experiments 60,000 records from the Cambridge Family History Study (records from 1800-1990) Occupation descriptions and associated HISCO codes HISCO coding done by historians Dataset contains 330 different HISCO codes 19
  • Slide 20
  • 20 HISCO Hierarchy Example
  • Slide 21
  • Classification Example String from recordGold Standard Classification Automatic Classification Output Farm horseman62460 Horse Worker Shoe maker80110 Shoemaker, General Fireman (railway)98330 Railway Steam- Engine Fireman Fireman58100 Fire-Fighter Stationer41000 Working Proprietors (Wholesale and Retail Trade) 91000 Paper and Paperboard product makers 21
  • Slide 22
  • Classification Example String from recordGold Standard Classification Automatic Classification Output Farm horseman62460 Horse Worker Shoe maker80110 Shoemaker, General Fireman (railway)98330 Railway Steam- Engine Fireman Fireman58100 Fire-Fighter Stationer41000 Working Proprietors (Wholesale and Retail Trade) 91000 Paper and Paperboard product makers 22
  • Slide 23
  • Approach Text analysis Supervised machine learning Apache Mahout framework. Combination of these techniques. 23
  • Slide 24
  • Supervised Machine Learning Training DataMachine Learning Unseen Data Prediction Model Predicted Classification 24 Prediction Model
  • Slide 25
  • Supervised Machine Learning Training Data Machine Learning Unseen Data Prediction Model Predicted Classification 25 Prediction Model Farm horseman62460 Shoe maker80110 Fireman58100 Stationer41000
  • Slide 26
  • Supervised Machine Learning Training DataMachine Learning Unseen Data Prediction Model Predicted Classification 26 Prediction Model Farm horseman62460 Shoe maker80110 Fireman58100 Stationer41000 Farm horseman Boot maker Fireman Painter
  • Slide 27
  • Supervised Machine Learning Training DataMachine Learning Unseen Data Prediction Model Predicted ClassificationPrediction Model Farm horseman62460 Shoe maker80110 Fireman58100 Stationer41000 Farm horseman Boot maker Fireman Painter ? Prediction Model
  • Slide 28
  • 100% Asthma Miners asthma spasmodic collier's miner's miners asthma dropsy bronchial
  • Slide 29
  • Slide 30
  • Creation of a fully-linked vital events database for the whole Scotland back to 1855 1974 1855 Present Vital Events (24 million births, deaths and marriages) Digital Images + Index Vital Events Database Vital Events Database Fully-linked Vital Events Database
  • Slide 31
  • Large scale family reconstruction studies and Pedigrees
  • Slide 32
  • Gottfredsson, Magns, et al. "Lessons from the past: familial aggregation analysis of fatal pandemic influenza (Spanish flu) in Iceland in 1918."Proceedings of the National Academy of Sciences 105.4 (2008): 1303-1308.
  • Slide 33
  • Slide 34
  • Acknowledgments The Digitising Scotland project is funded by ESRC; The support from National Records of Scotland is also gratefully acknowledged.