Automated geocoding using postal codes · 4/15/2015  · •Several geocoding scenarios are...

Preview:

Citation preview

www.statcan.gc.ca

Automated geocoding using postal codes

An overview of the PCCF-SLI and PCCF+

Michael Tjepkema

Saeeda Khan Health Analysis Division, Statistics Canada

April 15, 2015

Overview

1. Introduction to the PCCF and PCCF+

• Uses of small-area data

2. Components of a postal code

• SLI geocoding versus population-weighting

3. Typical case strategy

• Pitfalls of automated geocoding

4. Why PCCF+?

4/15/2015 Statistics Canada • Statistique Canada 2

Key points of presentation

• Consider using PCCF+ rather than PCCF-SLI if any of the following apply

• You want to use variables present on the PCCF+ which are not present in regular PCCF

• Your file is less than perfect with respect to postal codes

• You want help to evaluate the quality of the postal code on your data file

• The “vintage” of the postal codes on your file spans more than one census

• You want to do better coding in rural areas

• If your file includes postal codes used by residents of “incompletely enumerated Indian Reserves”

4/15/2015 Statistics Canada • Statistique Canada 3

1. Introduction to the PCCF and PCCF+

4/15/2015 Statistics Canada • Statistique Canada 4

Introductory remarks

• Postal codes are part of most administrative data sets

• PCCF, PCCF+, and related tools are now the standard

• Allows for the conversion of address and postal code attributes to standard geographical codes

• Used in data collection, processing, and analysis

• Resulting small-area geography have a variety of uses

• Familiarity with the methods, strengths, and limitations will help researchers exploit the potential

4/15/2015 Statistics Canada • Statistique Canada 5

What is the PCCF?

• A flat file produced by STC that links between postal codes and geographic areas

• Allows for:

• Association of postal codes to standard geographic areas

• Selection of statistical units by geographic areas

• Provides linkages (including an SLI) to block-face, dissemination block, and dissemination area

• However, some postal codes are only linked to post office locations, many serve multiple DAs, and some are non-residential (government offices, etc)

4/15/2015 Statistics Canada • Statistique Canada 6

Statistics Canada. Postal Codes Conversion File (PCCF), Reference Guide. Catalogue no. 92-153-G, no 02. Ottawa, ON: Statistics Canada, 2011.

What is PCCF+?

• A SAS control program, reference files derived from the PCCF, and a postal code population-weight file

• Assigns geographic identifiers based on postal codes

• Postal codes for rural areas are assigned to DA & DB using population-weighted random allocation

• Able to assign geographic coding from firth 5, 4, 3 characters of the postal code, as well as from all 6.

• Full diagnostic output permits resolution of results for potentially troublesome postal codes

• Provides residential and institutional coding separately

4/15/2015 Statistics Canada • Statistique Canada 7

Wilkins R, Peters PA. PCCF+ Version 5K User’s Guide: Automated geocoding based on the Statistics Canada Postal Code Conversion File.

Catalogue no. 82F0086-XDB. Ottawa, ON: Statistics Canada, 2011.

Uses of the PCCF and PCCF+

• A 2011 literature review for publications using the PCCF and PCCF+ resulted in 622 publications

• Health Sciences 463 (74%)

• Social Sciences & Economics 93 (15%)

• Education, data, & statistics 34 (6%)

• Natural & applied sciences 12 (2%)

• Other 20 (3%)

• Articles appeared in 233 different journals, with CMAJ (23) and CJPH (19) the top two journals

4/15/2015 Statistics Canada • Statistique Canada 8

Peller P. An analysis of the Postal Code Conversion File’s use in research. DLI research paper series, 2011. Calgary, AB: University of Calgary.

Uses of small area data

• Add policy relevance by aggregating to admin areas

• Health Regions, School Districts, etc…

• Deal with changes over time (boundary shifts)

• Assign neighbourhood SES and other confounders

• Determine point-distance, road distance, travel time

• Allow for studies of migration over time (longitudinal)

• Help in the imputation of missing data

• Obtain additional identifiers for record linkage

4/15/2015 Statistics Canada • Statistique Canada 9

2. Components of a Postal code

4/15/2015 Statistics Canada • Statistique Canada 10

Components of a postal code

• The postal code is a six-character code defined and maintained by Canada Post Corporation for the purpose of sorting and delivering mail

• Postal codes are not geographic attributes

• Only spatial in that mail is delivered by geographic area

• Six character code ‘ANA NAN’

• First 3 – Forward Sortation Area (FSA)

• Last 3 – Local Delivery Unit (LDU)

4/15/2015 Statistics Canada • Statistique Canada 11

Statistics Canada. Postal Codes Conversion File (PCCF), Reference Guide. Catalogue no. 92-153-G, no 02. Ottawa, ON: Statistics Canada, 2011.

What is a postal code?

4/15/2015 Statistics Canada • Statistique Canada 12

ANA NAN Forward Sortation Area

Local Delivery Unit

Province / Territory / Region First Character

Newfoundland and Labrador A

Nova Scotia B

Prince Edward Island C

New Brunswick E

Eastern Québec G

Metropolitan Montréal H

Western Québec J

Eastern Ontario K

Central Ontario L

Metropolitan Toronto M

Southwestern Ontario N

Northern Ontario P

Manitoba R

Saskatchewan S

Alberta T

British Columbia V

Northwest Territories and Nunavut X

Yukon Y

if 0 then rural if 1-9 then urban

Components of a postal code

4/15/2015 Statistics Canada • Statistique Canada 13

Components of a postal code

• Local Delivery Unit (LDU)

• Letter carrier delivery to ordinary urban address

• Community mailbox

• Apartment building

• Business building

• Large firm or organisation (CBC: M5W 1E6)

• Federal department or agency (Statistics Canada: K1A 0T6)

• Mail delivery route (suburban, rural, or mobile)

• General delivery and post office boxes (large or small)

4/15/2015 Statistics Canada • Statistique Canada 14

Statistics Canada. Postal Codes Conversion File (PCCF), Reference Guide. Catalogue no. 92-153-G, no 02. Ottawa, ON: Statistics Canada, 2011.

Importance of Identifying Non-residential PCs

• In the following cases, we may not know much about the true place of residence, which could be any place in the CMA (or even further out)

• Government Offices, e.g., Statistics Canada

• Coroners Offices

• Children’s Aid Societies

• Hospitals in a Birth File

• UPS Store, Mailboxes Etc,

4/15/2015 Statistics Canada • Statistique Canada 15

Components of a postal code

Haydu G. The Postal Code – Geographic classification code conversion file, a tool for social science research. Paper presented at the

1979 annual meeting of the Canadian Association of Geographers, Victoria, BC, Canada.

4/15/2015 Statistics Canada • Statistique Canada 16

Single-link (PCCF-SLI) vs. PCCF+

• PCCF-SLI forces each postal code to be assigned to a single DA & DB, regardless of how large the actual service area may be

• For most research purposes, the distribution of the population across the entire service area is needed

• PCCF+ uses a population-weighted method of geocoding where multiple-matches are possible

• As such, the distribution of respondents more accurately reflects the underlying population

• “Numerator-denominator consistency”

4/15/2015 Statistics Canada • Statistique Canada 17

Population assignment using PCCF-SLI

4/15/2015 Statistics Canada • Statistique Canada 18

Saskatchewan

Manitoba

Alberta

Population assignment using PCCF+

4/15/2015 Statistics Canada • Statistique Canada 19

Saskatchewan

Manitoba

Alberta

Population assignment via PCCF-SLI & PCCF+

4/15/2015 Statistics Canada • Statistique Canada 20

Geographic Unit PCCF-SLI PCCF+

# of Units Percent of Population

# of Units Percent of Population

DA 8,476 2.9 187 0

CT 73 0.1 7 0

CMA .. .. .. ..

CSD 1,438 0.6 109 0

CD .. .. .. ..

Percent of total 2006 census population in areas with no respondent assignment

Population assignment using PCCF-SLI

4/15/2015 Statistics Canada • Statistique Canada 21

Gatineau

Ottawa

Population assignment using PCCF+

4/15/2015 Statistics Canada • Statistique Canada 22

Gatineau

Ottawa

Population miss-assignment using PCCF-SLI & PCCF+

4/15/2015 Statistics Canada • Statistique Canada 23

Geographic Unit PCCF PCCF+

% of total population % of total population

DA 37.4 7.6

CT 6.6 1.4

CMA 4.3 0.1

CSD 11.4 2.7

CD 1.1 0.3

Comparison of population coding errors using PCCF-SLI versus PCCF+ (5J)*

* Population coding errors are defined as the sum over all areas at this geographic level of the absolute value of the population coded less the population known from the census sample, expressed as a percentage of the total population in all areas at this level.

3. Typical case strategy

4/15/2015 Statistics Canada • Statistique Canada 24

Typical case scenario

• Researcher has access to a data file containing records of individuals (students, clients, respondents)

• Data file contains postal code of place of residence

• Data file is missing some aspect required for analysis (socio-economic, environmental, geographic codes)

• Desire is to exploit some or all uses of small area geography as described above

• Postal codes may be appropriate for this purpose

4/15/2015 Statistics Canada • Statistique Canada 25

Additional case scenarios

• Insufficient documentation

• Vintage of coding standard not included (don’t assume)

• Method of assigning multiple links not specified (SLI)

• Diagnostic codes not included

• Problem codes not identified (business, PO Box, etc…)

• Available geographic coding not suitable

• Not available at the level needed

• Not of correct vintage

• Too imprecise or inaccurate for intended use

4/15/2015 Statistics Canada • Statistique Canada 26

Potential case strategies

• Several geocoding scenarios are possible

1. Only postal codes available

• Use PCCF-SLI or PCCF+ to assign geographic codes, etc…

2. Full street address available

• Use address geocoding software (GIS)

• Use PCCF-SLI or PCCF+ on postal code portion of address

3. Telephone numbers available

• Reverse lookup to get postal code or address

• Use 911 system maps to get location from address

4/15/2015 Statistics Canada • Statistique Canada 27

4. Why PCCF+?

4/15/2015 Statistics Canada • Statistique Canada 28

Why PCCF+ and not regular PCCF (with SLI=1)?

1. Supplemental coding

2. Postal codes less than perfect

3. Documentation and diagnostics

4. Vintage of postal codes

5. Population weighted approach

6. Postal codes used by residents for “incompletely enumerated Indian Reserves”

4/15/2015 Statistics Canada • Statistique Canada 29

Why PCCF+? – 1: supplemental coding

• ID, PCODE

• PR, CD, CSD, CCSD

• CMA, CT, MIZ, ER, FED

• DA, BLK

• BLKURB*, DPL*

• LAT, LONG

• HR, SUB, AHR, ASUB

• QAIPPE, IMMTER

• CSIZE, NSREL, AIRLIFT, AR

• EA81uid, EA86uid, EA91uid EA96uid, DA01uid, DA06uid, DA11uid

4/15/2015 Statistics Canada • Statistique Canada 30

* Poorly coded and not recommended for analytic use

Also available from PCCF single-link

Why PCCF+? – 2: postal codes less than perfect

• Most files will include some postal codes that never existed (reporting or data capture errors)

• Sensitive files may omit the last digit of the postal code

• Some files may only contain the first 3 digits of the postal code

• PCCF+ can be used to geocode the above information

4/15/2015 Statistics Canada • Statistique Canada 31

Why PCCF+? – 3: documentation & diagnostics

• Output is documented with user manual and version

• Method has been validated in many publications

• SAS code can be tweaked so results are exactly reproducible

• Define a specific kernel for probabilistic assignment

• Diagnostic codes for problem codes are provided

• Two outputs: Full file & Problem File

4/15/2015 Statistics Canada • Statistique Canada 32

DMT, DMTDIFF RPF, SERV, PREC

LINK (PROB) BLG NAME + ADR

SOURCE CSDNAME + TYPE

NCSD, NCD CPCCODE

RESFLG, INSTFLG

Why PCCF+? – 4: “Vintage” of postal codes

• Postal codes on your file spans more than one census

• PCCF+ assigns DA or EA from each census from 1981 through 2011

• Useful for time-varying variables

4/15/2015 Statistics Canada • Statistique Canada 33

Why PCCF+? – 5: population weighting

• Almost all rural and several urban categories of postal code provide service to multiple DAs, CSDs, etc…

• Use of the SLI=1 in PCCF forces any occurrence of a postal code to only one set of geocodes

• Using single-link approach introduces systematic bias

• PCCF+ probabilistically assigns each postal code record using census derived population weights

4/15/2015 Statistics Canada • Statistique Canada 34

Why PCCF+? – 6: Indian Reserves

• Your file includes postal codes used by residents of “incompletely enumerated Indian Reserves”

• These postal codes will not properly be coded by PCCF-SLI

• PCCF+ includes census population weights adjusted to account for estimates of the population living on the incompletely enumerated reserves

4/15/2015 Statistics Canada • Statistique Canada 35

Limitations with PCCF-SLI & PCCF+

• In rural areas and at urban fringe, probabilistic assignment leads to random misclassification of DA and neighbourhood income quintiles

• Reduced ability to detect effects in rural areas

• Lower RRs and RDs for epidemiologic studies

• This is effect modification not confounding, so it is recommended to stratify analysis by urban & rural

• Take care in interpreting lower effect estimates in rural versus urban areas

4/15/2015 Statistics Canada • Statistique Canada 36

Limitations with PCCF and PCCF+

• Postal codes may change over time

1. Many technical changes to address ranges

• Usually no change at block-face of block level

• Very little change at higher levels

2. Some reuse of retired postal codes within same FSA

3. Two FSA in British Columbia moved in mid-90s

• Moral – Code as received and interpret the output

4/15/2015 Statistics Canada • Statistique Canada 37

Concluding remarks

• Small-area geography & spatial coordinates are part of most data sets and useful in most studies

• Familiarity with methods, limitations, and interpretation of data helps research more meaningfully exploit data potential

• It is not enough to use the data mechanically, users need to think about what they are doing and why

4/15/2015 Statistics Canada • Statistique Canada 38

More information…

• Contact:

• HAD-DAS@statcan.gc.ca

• Acknowledgments

• Russell Wilkins (retired) & Paul A Peters (University of New Brunswick)

4/15/2015 Statistics Canada • Statistique Canada 39

Extra slides

4/15/2015 Statistics Canada • Statistique Canada 40

FSAs do not respect CSD boundaries

4/15/2015 Statistics Canada • Statistique Canada 41

Limitation of SLI (e.g., 2001 Census Geography)

• Over a third of the total population of rural and small town Canada can never get the correct DA code when using the PCCF SLI since nearly 11,000 DAs are never linked to postal codes when only the SLI is selected.

• Also at the CSD level, over a quarter of all CSDs never get coded using SLI. In rural and small town Canada, nearly 30% of CSDs never get coded using the SLI.

4/15/2015 Statistics Canada • Statistique Canada 42

Recommended