25
Misinterpretation of Data and the Importance of Metadata Bernie Gloyn Ontario DLI Training April 11, 2005

Misinterpretation of Data and the Importance of Metadata Bernie Gloyn Ontario DLI Training – April 11, 2005

Embed Size (px)

Citation preview

Page 1: Misinterpretation of Data and the Importance of Metadata Bernie Gloyn Ontario DLI Training – April 11, 2005

Misinterpretation of Data and the Importance of Metadata

Bernie Gloyn

Ontario DLI Training – April 11, 2005

Page 2: Misinterpretation of Data and the Importance of Metadata Bernie Gloyn Ontario DLI Training – April 11, 2005

Outline

• Crime rates example from Wendy

• Metadata

• Some considerations by data types– Census– Sample Survey– Administrative

• Comparisons– Crude vs standardized

Page 3: Misinterpretation of Data and the Importance of Metadata Bernie Gloyn Ontario DLI Training – April 11, 2005

Crime Rates Example

• Ebert & Roeper review of Michael Wilson movie “Michael Moore hates America” Ebert doubted claim that Cdn crime rate 2X the USA rate

• Moorelies.com | News: Whoa; Stuart Didn't See That One Coming

• Ebert conceded with writer that stats supported claim - figures on right

• Comparison of STC and US Bureau of Justice Statistics website stats

Crimes per 100,000 population - 2003

  Canada USA

All Crimes 8,530 4,267

Violent crimes 958 523

Property crimes 4,275 3,744

Page 4: Misinterpretation of Data and the Importance of Metadata Bernie Gloyn Ontario DLI Training – April 11, 2005

Crime Rates Example

• Debunked by Craig from Canada

• Simplistic comparison– Similar category titles on

violent and property crimes but different definitions

– Concluded violent crime 2-3 X times higher in US, property crimes close

– Bureau of Justice Statistics Crime & Justice Data Online

– Canadian Statistics - Crimes by type of offence

Crimes per 100,000 population - 2002

  Canada USA

Violent crime    

homicide 1.9 5.6

robbery 85 146

comparison of US (rape and aggravated assault) difficult with Cdn sexual assault and assaults)

Property Crime    

B & E (Cdn) – Burglary (US) 879 746

Theft (Cdn) - Larceny & Theft (US) 2,191 2,446

Motor Vehicle theft 516 432

Page 5: Misinterpretation of Data and the Importance of Metadata Bernie Gloyn Ontario DLI Training – April 11, 2005

Crime Rates Example

Page 6: Misinterpretation of Data and the Importance of Metadata Bernie Gloyn Ontario DLI Training – April 11, 2005

Metadata

• STC Policy on Informing Users of Data Quality– In place since 1978

• tightened up 2000 in response to 1999 AG report» Looked at 4 surveys LFS, CPI, MSM & UCRS

– Recognised “All statistics are to some extent estimates”

– To be used with awareness of strengths and weaknesses – “fitness for use”

– Key tool is the Integrated Meta Database that you see definitions, data sources and methods

• Repository of info on STC surveys and programs

Page 7: Misinterpretation of Data and the Importance of Metadata Bernie Gloyn Ontario DLI Training – April 11, 2005

Metadata

• Can’t over emphasize importance of good metadata, finding it and reading it– Definitions, Data Sources and Methods (recently

revamped)• Questionnaire and reporting guides• Survey Description• Data sources and methodology• Data Accuracy• Documentation• Contact us

– Statistics Canada: Canadian Community Health Survey

Page 8: Misinterpretation of Data and the Importance of Metadata Bernie Gloyn Ontario DLI Training – April 11, 2005

Metadata

– Online Catalogue (OLC)• Canadian Community Health Survey: public use

microdata file: Product main page

– DLI website• DLI - Canadian Community Health Survey Cycle 1.

1

– DLI listserv• Ask and we will find out from the Division

Page 9: Misinterpretation of Data and the Importance of Metadata Bernie Gloyn Ontario DLI Training – April 11, 2005

Metadata

• With Public Use Microdata Files, the code book is very important– Gives questions asked and codes used for

responses– “Missing values”, “refusals”, “don’t know” and

“not applicable” numeric codes are often assigned

– Not consistent in the numeric codes used– Numeric codes that to most software would

seem to be valid response

Page 10: Misinterpretation of Data and the Importance of Metadata Bernie Gloyn Ontario DLI Training – April 11, 2005

Metadata1990 Health Promotion Survey there were a series of questions about alcohol consumption.

First they asked if the respondent EVER drank alcohol, and if YES asked if they drank within the last 12 monthsand if YES asked for number of drinks for each day for the past 7 days. The code book showed number of drinks per day as:

81 F4MON 2 0096‑0097 HOW MANY DRINKS DID YOU HAVE ON: MONDAY ? 00 NONE 4651/ 7334907 01:40 NUMBER OF DRINKS 1403/ 2585080 41 MORE THAN 40 DRINKS 1/ 106 98 QUESTION NOT ASKED 7648/10567910 99 NOT STATED 89/ 155377 NOTE: F4MON‑SUN NOT ASKED IF F4=1 OR F1=2 OR F2=2

82 F4TUE 2 0098‑0099 HOW MANY DRINKS DID YOU HAVE ON: TUESDAY ? 00 NONE 4608/ 7306101 01:40 NUMBER OF DRINKS 1447/ 2613991 98 QUESTION NOT ASKED 7648/10567910 99 NOT STATED 89/ 155377 NOTE: F4MON‑SUN NOT ASKED IF F4=1 OR F1=2 OR F2=2

Page 11: Misinterpretation of Data and the Importance of Metadata Bernie Gloyn Ontario DLI Training – April 11, 2005

Some Considerations by Data Type

• Census– Short form - 9 questions are 100%– Long form – 20% sample

• Sample Survey– Most data sets – LFS, GSS, NPHS, etc

• Administrative– GST, Revenue Canada, Vital Stats, school

enrollments, provincial health insurance, …

Page 12: Misinterpretation of Data and the Importance of Metadata Bernie Gloyn Ontario DLI Training – April 11, 2005

Census

• High quality but • Non sampling errors

– coverage, measurement, non response, processing errors

• Key documents are the Census Handbook, Census Dictionary and Census Technical reports

• Communiqué for revisions– Population and dwelling count amendments– Don’t change the Census base

Page 13: Misinterpretation of Data and the Importance of Metadata Bernie Gloyn Ontario DLI Training – April 11, 2005

Census

• Conceptual/definition changes over time can be very important

• Census family– Refers to a married couple (with or without children of either or

both spouses), …. … A couple living common-law may be of opposite or same sex. “Children” in a census family include grandchildren living with their grandparent(s) but with no parents present

– census family, 2001 census

• Economic family– Refers to a group of two or more persons who live in the same

dwelling and are related to each other by blood, marriage, common-law or adoption.

– economic family, 2001 census

Page 14: Misinterpretation of Data and the Importance of Metadata Bernie Gloyn Ontario DLI Training – April 11, 2005

Sample Surveys

• Estimates– Estimate of the population characteristics based on a

sample from a survey frame– Bigger sample gives better estimates

• Issue of sample size– 30,000 sample

– Want sub population – retirees ~ 3000, males ~1400, immigrants ~ 200, BC ~ 40

– Unstable estimates as you break down the sample

– Often forget estimate has a confidence interval• 73% with a CI 10% is not significantly different than 80%

Page 15: Misinterpretation of Data and the Importance of Metadata Bernie Gloyn Ontario DLI Training – April 11, 2005

Sample Surveys

• Statistical measures of quality– Coefficient of Varience (CV)

• gives Standard Deviation as % of Mean • Measure of the fitness for use

– smaller the CV, the more reliable the estimate is– CVs < or = 15% generally considered reliable for most

uses – CVs > 15% but < 33% are reliable for some purposes

with “caution”– CVs > 33% are unreliable and not published

Page 16: Misinterpretation of Data and the Importance of Metadata Bernie Gloyn Ontario DLI Training – April 11, 2005

Data Quality Symbols

Data quality symbols

These are the recommended data quality symbols that should be used when data quality assessment information is available.

Symbol Meaning

E (superscript)

use with caution

F too unreliable to be published

acceptable or better

When a figure is "too unreliable to be published," the data point is suppressed and the symbol F appears in the data cell.

When the figure is not accompanied by a data quality symbol, it means that the quality of the data was assessed to be "acceptable or better" according to the policies and standards of Statistics Canada. To denote specific levels of "acceptable or better" quality, letter grades such as A to D should be used.

Page 17: Misinterpretation of Data and the Importance of Metadata Bernie Gloyn Ontario DLI Training – April 11, 2005

Sample Surveys

• Sample value weighted up to represent population– 20% sample for census

• Simple weight is 5, more complex, adjusted for characteristics, response rates, etc

– example from Mike • Another Health survey• Analyst confusion on weight and height asked in survey

– Used body weight as the survey weight

– Survey weight was around 400

– … number of obese Cdns!!

Page 18: Misinterpretation of Data and the Importance of Metadata Bernie Gloyn Ontario DLI Training – April 11, 2005

Sample Surveys

• Changes in frame used for the sample– Annual Survey of Manufacturers moved to the

Business Register (ref yr. 2000)• 25,000 incorporated firms missing from survey coverage

before– 5% (1/3) of 15% increase from 1999 – 2000

– ASM also changed survey coverage• included 35,000 incorporated firms below $30,000 annual

sales– 2% of 15% increase from 1999 – 2000

– Almost half the 15% annual increase from coverage improvements

– Annual Survey of Manufactures (ASM)

Page 19: Misinterpretation of Data and the Importance of Metadata Bernie Gloyn Ontario DLI Training – April 11, 2005

Administrative Data

• Original purpose that the data was collected– Provincial Health counts differ from Census

• Definitions used aren’t the same– Success rate higher for students at some universities

(mostly in QC)– Deregister 4 weeks into course, elsewhere is 3 to 4 days

• Coverage of the universe (total population)– not everyone reports income tax

• Administrative changes can affect data series

Page 20: Misinterpretation of Data and the Importance of Metadata Bernie Gloyn Ontario DLI Training – April 11, 2005

Administrative Data

• Provide small area estimates– Normally postal code geography– Postal code can be problematic

• Highest income neighbourhood example

Page 21: Misinterpretation of Data and the Importance of Metadata Bernie Gloyn Ontario DLI Training – April 11, 2005

Crude vs Standardised

Comparisons between countries

Age Country A Country B

Population Death Rate Population Death Rate

Total 1 000 50 0.05 1 000 40 0.04

Page 22: Misinterpretation of Data and the Importance of Metadata Bernie Gloyn Ontario DLI Training – April 11, 2005

Crude vs StandardisedAge Country A Country B

Population Death Rate Population Death Rate

0-19 100 1 0.01 400 8 0.02

20-39 200 6 0.03 300 12 0.04

40-59 300 15 0.05 200 12 0.06

60+ 400 28 0.07 100 8 0.08

Total 1 000 50 0.05 1 000 40 0.04

Page 23: Misinterpretation of Data and the Importance of Metadata Bernie Gloyn Ontario DLI Training – April 11, 2005

Crude vs StandardisedAge Country A Country B

Share ofpopulation

Mortalityrate

Standardrate

Mortalityrate

Standardrate

0-19 0.10 0.01 0.001 0.02 0.002

20-39 0.20 0.03 0.006 0.04 0.008

40-59 0.30 0.05 0.015 0.06 0.018

60+ 0.40 0.07 0.028 0.08 0.032

Total 1.00 0.05 0.050 0.04 0.060

Page 24: Misinterpretation of Data and the Importance of Metadata Bernie Gloyn Ontario DLI Training – April 11, 2005

Crude vs Standardised

• Another mortality comparison but over time– 1951 - 2.83 per 1000 from heart disease– 1993 - 1.93 “ “ “ “ “

– Improvement from advances - 0.9 ?

– change due to progress - 2.19– change due to aging +1.29

Page 25: Misinterpretation of Data and the Importance of Metadata Bernie Gloyn Ontario DLI Training – April 11, 2005

Thank you!