26
Editing a Mixture of Editing a Mixture of Canadian 2006 Census Canadian 2006 Census and Tax Data and Tax Data Mike Bankier Mike Bankier Statistics Canada Statistics Canada 2006 Work Session on Statistical 2006 Work Session on Statistical Data Editing Data Editing [email protected] [email protected]

Editing a Mixture of Canadian 2006 Census and Tax Data Mike Bankier Statistics Canada 2006 Work Session on Statistical Data Editing [email protected]

Embed Size (px)

Citation preview

Page 1: Editing a Mixture of Canadian 2006 Census and Tax Data Mike Bankier Statistics Canada 2006 Work Session on Statistical Data Editing bankier@statcan.ca

Editing a Mixture of Editing a Mixture of Canadian 2006 Census Canadian 2006 Census

and Tax Dataand Tax Data

Mike BankierMike BankierStatistics CanadaStatistics Canada

2006 Work Session on Statistical Data 2006 Work Session on Statistical Data EditingEditing

[email protected]@statcan.ca

Page 2: Editing a Mixture of Canadian 2006 Census and Tax Data Mike Bankier Statistics Canada 2006 Work Session on Statistical Data Editing bankier@statcan.ca

IntroductionIntroduction• Census respondents can give permission to Census respondents can give permission to

link to tax form rather than answer 13 part link to tax form rather than answer 13 part census income question on 20% sample census income question on 20% sample long formlong form

• Early returns indicate permission rate of Early returns indicate permission rate of 83%.83%.

• Done to reduce level of response burden Done to reduce level of response burden plus partial/total NR rate was rising for plus partial/total NR rate was rising for income.income.

• Also census responses often approximate Also census responses often approximate while tax responses generally very accurate. while tax responses generally very accurate.

Page 3: Editing a Mixture of Canadian 2006 Census and Tax Data Mike Bankier Statistics Canada 2006 Work Session on Statistical Data Editing bankier@statcan.ca

Overview of TalkOverview of Talk

• Brief review of census/tax record Brief review of census/tax record linkage.linkage.

• Census data collection and Census data collection and processing prior to E&I.processing prior to E&I.

• Strategy to perform E&I on mixture Strategy to perform E&I on mixture of census and income tax data.of census and income tax data.

Page 4: Editing a Mixture of Canadian 2006 Census and Tax Data Mike Bankier Statistics Canada 2006 Work Session on Statistical Data Editing bankier@statcan.ca

Census/Tax Record LinkageCensus/Tax Record Linkage

• STC’s Generalized Record Linkage System STC’s Generalized Record Linkage System (GRLS) based on Fellegi/Sunter will be used.(GRLS) based on Fellegi/Sunter will be used.

• Name, birthdate, address, telephone number, Name, birthdate, address, telephone number, sex, marital status, disability status, labour sex, marital status, disability status, labour activity status (activity status (but not SINbut not SIN) used to link.) used to link.

• Nicknames, reordering names, accounting for Nicknames, reordering names, accounting for typographic errors, search across Canada, typographic errors, search across Canada, more weight for common names will be used more weight for common names will be used to achieve expected 85% match rate.to achieve expected 85% match rate.

Page 5: Editing a Mixture of Canadian 2006 Census and Tax Data Mike Bankier Statistics Canada 2006 Work Session on Statistical Data Editing bankier@statcan.ca

Census/Tax Record LinkageCensus/Tax Record Linkage

• Only very good matches retained Only very good matches retained since incorrect matches can since incorrect matches can generate undesirable outliers.generate undesirable outliers.

• No manual review done of all links No manual review done of all links because of large volume of data.because of large volume of data.

• Parameters fined tuned by running Parameters fined tuned by running linkage several times and assessing linkage several times and assessing quality of links for a sample of quality of links for a sample of persons.persons.

Page 6: Editing a Mixture of Canadian 2006 Census and Tax Data Mike Bankier Statistics Canada 2006 Work Session on Statistical Data Editing bankier@statcan.ca

Data Collection/Processing Prior Data Collection/Processing Prior E&IE&I

• In 2001, enumerators listed dwellings In 2001, enumerators listed dwellings and dropped off a questionnaire. and dropped off a questionnaire. Questionnaires completed and mailed Questionnaires completed and mailed back by respondent.back by respondent.

• In 2006, dwellings listed in advance In 2006, dwellings listed in advance and questionnaires were mailed to and questionnaires were mailed to them for approximately 2/3 of them for approximately 2/3 of dwellings. Other 1/3 treated the same dwellings. Other 1/3 treated the same way as in 2001. way as in 2001.

• 20% questionnaires completed over 20% questionnaires completed over Internet.Internet.

Page 7: Editing a Mixture of Canadian 2006 Census and Tax Data Mike Bankier Statistics Canada 2006 Work Session on Statistical Data Editing bankier@statcan.ca

Data Collection/Processing Prior Data Collection/Processing Prior E&IE&I

• Completed questionnaires scanned Completed questionnaires scanned and data captured using intelligent and data captured using intelligent character recognition.character recognition.

• Any responses not captured, keyed Any responses not captured, keyed from imaged questionnaire.from imaged questionnaire.

• In 2001, corrections made before In 2001, corrections made before keying (for example cents recorded keying (for example cents recorded as dollars) but not feasible for 2006.as dollars) but not feasible for 2006.

• In 2004 test, error rate of 11% for In 2004 test, error rate of 11% for income variables.income variables.

Page 8: Editing a Mixture of Canadian 2006 Census and Tax Data Mike Bankier Statistics Canada 2006 Work Session on Statistical Data Editing bankier@statcan.ca

Data Collection/Processing Prior Data Collection/Processing Prior E&IE&I

• Non-respondents or partial Non-respondents or partial respondents with non-response to respondents with non-response to many questions were phoned or many questions were phoned or visited.visited.

• Coverage edits applied at processing Coverage edits applied at processing centre and persons were added or centre and persons were added or subtracted occasionally.subtracted occasionally.

Page 9: Editing a Mixture of Canadian 2006 Census and Tax Data Mike Bankier Statistics Canada 2006 Work Session on Statistical Data Editing bankier@statcan.ca

Data Collection/Processing Prior Data Collection/Processing Prior E&IE&I

• Edits flagged persons with income Edits flagged persons with income responses outside limits.responses outside limits.

• Reviewed manually by comparing to Reviewed manually by comparing to correlated characteristics, looking at correlated characteristics, looking at questionnaire image and manually questionnaire image and manually modifying if necessary.modifying if necessary.

Page 10: Editing a Mixture of Canadian 2006 Census and Tax Data Mike Bankier Statistics Canada 2006 Work Session on Statistical Data Editing bankier@statcan.ca

Data Collection/Processing Prior Data Collection/Processing Prior E&IE&I

• Majority of income errors the result of Majority of income errors the result of – Decimals not recognized or not providedDecimals not recognized or not provided– Confusion between income sourcesConfusion between income sources– Monthly amounts reportedMonthly amounts reported– Occasionally erroneous amounts entered Occasionally erroneous amounts entered

as prankas prank

• Tax forms excluded from manual Tax forms excluded from manual process because linkage done later process because linkage done later and tax data mostly error free.and tax data mostly error free.

Page 11: Editing a Mixture of Canadian 2006 Census and Tax Data Mike Bankier Statistics Canada 2006 Work Session on Statistical Data Editing bankier@statcan.ca

Adjustments Done – Coverage Adjustments Done – Coverage StudiesStudies

• Dwelling Classification Survey revisited Dwelling Classification Survey revisited sample of households to determine if sample of households to determine if they had been classified correctly as not they had been classified correctly as not part of housing stock, unoccupied or part of housing stock, unoccupied or occupied. Census data base adjusted for occupied. Census data base adjusted for estimated undercoverage and estimated undercoverage and overcoverage.overcoverage.

• Reverse Record Check measures Reverse Record Check measures undercoverage and overcoverage from undercoverage and overcoverage from all sources, is used to adjust the all sources, is used to adjust the provincial population totals but does not provincial population totals but does not adjust the Census data base.adjust the Census data base.

Page 12: Editing a Mixture of Canadian 2006 Census and Tax Data Mike Bankier Statistics Canada 2006 Work Session on Statistical Data Editing bankier@statcan.ca

E&I of the Income QuestionsE&I of the Income Questions• With completion of the tax/census With completion of the tax/census

linkage, income data from Census and linkage, income data from Census and tax sources will be available on the tax sources will be available on the Census data base.Census data base.

• Canadian Edit and Imputation System Canadian Edit and Imputation System (CANCEIS) will be used for all Census (CANCEIS) will be used for all Census variables including income to performvariables including income to perform– Deterministic imputationDeterministic imputation– Minimum change donor imputationMinimum change donor imputation– Derive new variablesDerive new variables

Page 13: Editing a Mixture of Canadian 2006 Census and Tax Data Mike Bankier Statistics Canada 2006 Work Session on Statistical Data Editing bankier@statcan.ca

E&I of the Income QuestionsE&I of the Income Questions• Assumed income data given by most Assumed income data given by most

respondents is correct so every attempt will respondents is correct so every attempt will be made to change as few responses as be made to change as few responses as possible.possible.

• Some fields imputed deterministically.Some fields imputed deterministically.

• Donor imputation used to resolve NR.Donor imputation used to resolve NR.

• Also balance edits to make sure income Also balance edits to make sure income components sum to within 10% of total components sum to within 10% of total income.income.

• Total income is adjusted in later step to Total income is adjusted in later step to ensure perfect agreement with components.ensure perfect agreement with components.

Page 14: Editing a Mixture of Canadian 2006 Census and Tax Data Mike Bankier Statistics Canada 2006 Work Session on Statistical Data Editing bankier@statcan.ca

E&I of the Income QuestionsE&I of the Income Questions• Series of CANCEIS modules used .Series of CANCEIS modules used .• First three modules First three modules

– Merge tax and census data together.Merge tax and census data together.– Calculate average employment income Calculate average employment income

by occupation and geography (SAS) for by occupation and geography (SAS) for later use as matching variable.later use as matching variable.

– Define strata to be used in later Define strata to be used in later modules.modules.

– Determine status for each income field Determine status for each income field (income with amount reported, income (income with amount reported, income indicated, loss indicated, no income, indicated, loss indicated, no income, non-response). non-response).

Page 15: Editing a Mixture of Canadian 2006 Census and Tax Data Mike Bankier Statistics Canada 2006 Work Session on Statistical Data Editing bankier@statcan.ca

E&I of the Income QuestionsE&I of the Income Questions• Modules 4 to 6 impute missing income Modules 4 to 6 impute missing income

responses while ensuring total within 10% responses while ensuring total within 10% of sum of components.of sum of components.

• Module 4Module 4 imputes partial respondents who imputes partial respondents who provided total income.provided total income.

• Module 5Module 5 imputes partial respondents who imputes partial respondents who did not provide total income but provided did not provide total income but provided all the components of employment income.all the components of employment income.

• Module 6Module 6 imputes all other partial and total imputes all other partial and total non-respondents to the income question.non-respondents to the income question.

Page 16: Editing a Mixture of Canadian 2006 Census and Tax Data Mike Bankier Statistics Canada 2006 Work Session on Statistical Data Editing bankier@statcan.ca

E&I of the Income QuestionsE&I of the Income Questions

• Modules 7 and 8Modules 7 and 8 select a sample select a sample respondents with no pension benefits respondents with no pension benefits and impute positive amounts through and impute positive amounts through donor imputation.donor imputation.

• Modules 9 and 10Modules 9 and 10 do something similar do something similar but for employment insurance benefits.but for employment insurance benefits.

• Module 11Module 11 derives other government derives other government benefits such as old age security benefits such as old age security pension.pension.

Page 17: Editing a Mixture of Canadian 2006 Census and Tax Data Mike Bankier Statistics Canada 2006 Work Session on Statistical Data Editing bankier@statcan.ca

E&I of the Income QuestionsE&I of the Income Questions• Module 12Module 12 uses donor imputation to uses donor imputation to

resolve non-response to the income resolve non-response to the income tax field.tax field.

• Module 13Module 13 derives total income after derives total income after tax.tax.

• Other modules aggregate income to Other modules aggregate income to the family and household levels plus the family and household levels plus derive 2 low income flags.derive 2 low income flags.

Page 18: Editing a Mixture of Canadian 2006 Census and Tax Data Mike Bankier Statistics Canada 2006 Work Session on Statistical Data Editing bankier@statcan.ca

E&I of the Income QuestionsE&I of the Income Questions

• Donor selection edits extensively Donor selection edits extensively used to restrict what records which used to restrict what records which pass the edits can be used as donors.pass the edits can be used as donors.

• Reduces the number of outliers Reduces the number of outliers generated through imputation.generated through imputation.

Page 19: Editing a Mixture of Canadian 2006 Census and Tax Data Mike Bankier Statistics Canada 2006 Work Session on Statistical Data Editing bankier@statcan.ca

E&I of the Income QuestionsE&I of the Income Questions

• In search for donors, distance measure In search for donors, distance measure applies larger weights to income fields applies larger weights to income fields considered more important or reliable considered more important or reliable such as total income.such as total income.

• Numeric amount can be missing but Numeric amount can be missing but boxes checked can indicate that boxes checked can indicate that amount should be negative. Distance amount should be negative. Distance measure can be configured to almost measure can be configured to almost guarantee that negative quantity will guarantee that negative quantity will then be imputed.then be imputed.

Page 20: Editing a Mixture of Canadian 2006 Census and Tax Data Mike Bankier Statistics Canada 2006 Work Session on Statistical Data Editing bankier@statcan.ca

Other Changes in E&I Since Other Changes in E&I Since 20012001

• Number of strata will be reduced Number of strata will be reduced dramatically since some variables used for dramatically since some variables used for stratification in 2001 now used in the stratification in 2001 now used in the distance measure to identify donors, this distance measure to identify donors, this reduces boundary effects.reduces boundary effects.

• Also in the past, exact matches within a Also in the past, exact matches within a stratum was required while with CANCEIS stratum was required while with CANCEIS near matches will be allowed (e.g. age near matches will be allowed (e.g. age difference of 3 years). In past default difference of 3 years). In past default imputation sometimes used while with imputation sometimes used while with CANCEIS a donor will always be found.CANCEIS a donor will always be found.

Page 21: Editing a Mixture of Canadian 2006 Census and Tax Data Mike Bankier Statistics Canada 2006 Work Session on Statistical Data Editing bankier@statcan.ca

Differences in Processing of Differences in Processing of Census/Tax DataCensus/Tax Data

• During donor imputation, data from During donor imputation, data from tax records will generally be treated tax records will generally be treated the same as data from census forms.the same as data from census forms.

• For tax data, will deriveFor tax data, will derive– Quebec provincial taxQuebec provincial tax– Child BenefitsChild Benefits– GST CreditsGST Credits

• For census form data, Child Benefits, For census form data, Child Benefits, GST Credits will be derived.GST Credits will be derived.

Page 22: Editing a Mixture of Canadian 2006 Census and Tax Data Mike Bankier Statistics Canada 2006 Work Session on Statistical Data Editing bankier@statcan.ca

Differences in Processing of Differences in Processing of Census/Tax DataCensus/Tax Data

• When adjusting for under-reporting of When adjusting for under-reporting of pensions and employment insurance, pensions and employment insurance, tax responses are not adjusted tax responses are not adjusted because of policy not to modify them.because of policy not to modify them.

• When imputing income tax field from When imputing income tax field from census forms, donors restricted to tax census forms, donors restricted to tax forms because of poor quality of forms because of poor quality of responses on census forms.responses on census forms.

Page 23: Editing a Mixture of Canadian 2006 Census and Tax Data Mike Bankier Statistics Canada 2006 Work Session on Statistical Data Editing bankier@statcan.ca

Evaluation of Income E&IEvaluation of Income E&I• On experimental basis, income On experimental basis, income

responses were blanked out and then responses were blanked out and then CANCEIS imputed the blanks.CANCEIS imputed the blanks.

• CANCEIS was quite effective at CANCEIS was quite effective at replicating responses and preserving replicating responses and preserving distributions when matching distributions when matching variables were correlated with the variables were correlated with the variable being imputed.variable being imputed.

Page 24: Editing a Mixture of Canadian 2006 Census and Tax Data Mike Bankier Statistics Canada 2006 Work Session on Statistical Data Editing bankier@statcan.ca

Future Evaluations of Future Evaluations of IncomeIncome• Some people provide permission to Some people provide permission to

link and also answer the income link and also answer the income question on the census form.question on the census form.

• It will be interesting to compare the It will be interesting to compare the tax and census responses for these tax and census responses for these people.people.

• In 2004 test, census income data In 2004 test, census income data often rounded to nearest thousand or often rounded to nearest thousand or five thousand.five thousand.

• Mode effects (paper versus internet) Mode effects (paper versus internet) may also be studied.may also be studied.

Page 25: Editing a Mixture of Canadian 2006 Census and Tax Data Mike Bankier Statistics Canada 2006 Work Session on Statistical Data Editing bankier@statcan.ca

Future Changes to E&IFuture Changes to E&I

• It is hoped that we can eliminate It is hoped that we can eliminate certain deterministic modules certain deterministic modules by by obtaining Child Benefits, for example, obtaining Child Benefits, for example, from other sourcesfrom other sources. .

• Using CANCEIS, it may be possible to Using CANCEIS, it may be possible to reduce the number of modules used reduce the number of modules used in later censuses and improve in later censuses and improve consistency with labour and consistency with labour and education variables.education variables.

Page 26: Editing a Mixture of Canadian 2006 Census and Tax Data Mike Bankier Statistics Canada 2006 Work Session on Statistical Data Editing bankier@statcan.ca

ConclusionsConclusions

• Many changes to processing Many changes to processing including use of tax data, new including use of tax data, new questionnaire layout for scanning, questionnaire layout for scanning, use of new E&I system.use of new E&I system.

• These changes will require careful These changes will require careful monitoring during production and monitoring during production and may require fine-tuning.may require fine-tuning.

• Given high quality of tax data, its Given high quality of tax data, its availability should prove useful.availability should prove useful.