15
Data Quality Data Quality Sharp project 5 Sharp project 5 June 2010 June 2010

Data Quality Sharp project 5 June 2010. Statistical Problems with Data Quality in EHR Missing Data Missing Data Uncertain Diagnosis Uncertain Diagnosis

Embed Size (px)

Citation preview

Page 1: Data Quality Sharp project 5 June 2010. Statistical Problems with Data Quality in EHR Missing Data Missing Data Uncertain Diagnosis Uncertain Diagnosis

Data QualityData QualitySharp project 5Sharp project 5

June 2010June 2010

Page 2: Data Quality Sharp project 5 June 2010. Statistical Problems with Data Quality in EHR Missing Data Missing Data Uncertain Diagnosis Uncertain Diagnosis

Statistical Problems with Statistical Problems with Data Quality in EHRData Quality in EHR

• Missing DataMissing Data

• Uncertain DiagnosisUncertain Diagnosis

• Uneven/unequal precision / Uneven/unequal precision / measurement errormeasurement error

• BiasBias

• ……

Page 3: Data Quality Sharp project 5 June 2010. Statistical Problems with Data Quality in EHR Missing Data Missing Data Uncertain Diagnosis Uncertain Diagnosis

Missing Data: (Rage in Missing Data: (Rage in Statistical Theory)Statistical Theory)

• Common problem with observational/ Common problem with observational/ retrospective dataretrospective data

• Statistical approachesStatistical approaches– ImputationImputation– Multiple imputation (MI) (Statisticians Multiple imputation (MI) (Statisticians

have acronyms too)have acronyms too)– Regression with residual errorRegression with residual error– draw from Posterior distributiondraw from Posterior distribution

Page 4: Data Quality Sharp project 5 June 2010. Statistical Problems with Data Quality in EHR Missing Data Missing Data Uncertain Diagnosis Uncertain Diagnosis

Missing Data– Empirical Missing Data– Empirical approachapproach

• Regression on Y with Missing X-Regression on Y with Missing X-variablesvariables

• ““X is missing” is X is missing” is alsoalso information. information.

• Analyze data set usingAnalyze data set using– Imputation (mean?)Imputation (mean?)– ““missing” indicatormissing” indicator– Empirical approach– let data tell you Empirical approach– let data tell you

what to dowhat to do

Page 5: Data Quality Sharp project 5 June 2010. Statistical Problems with Data Quality in EHR Missing Data Missing Data Uncertain Diagnosis Uncertain Diagnosis

Uncertain diagnosisUncertain diagnosis

• Universal problem with health dataUniversal problem with health data

• No Gold standardNo Gold standard

• Disease/health is a spectrum, not a Disease/health is a spectrum, not a dichotomydichotomy

• Probabilistic perspectiveProbabilistic perspective– Probability (Peripheral Arterial Disease)Probability (Peripheral Arterial Disease)– From {0,1} to [0-1] as phenotypeFrom {0,1} to [0-1] as phenotype– More realistic phenotype?More realistic phenotype?

Page 6: Data Quality Sharp project 5 June 2010. Statistical Problems with Data Quality in EHR Missing Data Missing Data Uncertain Diagnosis Uncertain Diagnosis

Uncertain DiagnosisUncertain Diagnosis

• Result is a probabilityResult is a probability

• Probability is a posterior distribution Probability is a posterior distribution of a 0/1 variableof a 0/1 variable– Use p itself (certainty equivalent)Use p itself (certainty equivalent)

•Analogous to single imputationAnalogous to single imputation

– Use multiple imputationUse multiple imputation•““1” with probability p, “0” with probability 1-1” with probability p, “0” with probability 1-

pp

Page 7: Data Quality Sharp project 5 June 2010. Statistical Problems with Data Quality in EHR Missing Data Missing Data Uncertain Diagnosis Uncertain Diagnosis

Uncertain Diagnosis– PAD Uncertain Diagnosis– PAD example (eMERGE)example (eMERGE)• Mayo Vascular Lab Database– n=18000Mayo Vascular Lab Database– n=18000

• Gold Standard— Ankle/Brachial Index Gold Standard— Ankle/Brachial Index (ABI)(ABI)

• Use of Diagnostic / procedural codes Use of Diagnostic / procedural codes – ICD-9 / HICDA / CPT ICD-9 / HICDA / CPT

• Logistic regression of gold standard Logistic regression of gold standard (PAD by ABI) on diagnostic codes(PAD by ABI) on diagnostic codes

posterior probability of PADposterior probability of PAD

Page 8: Data Quality Sharp project 5 June 2010. Statistical Problems with Data Quality in EHR Missing Data Missing Data Uncertain Diagnosis Uncertain Diagnosis

Uncertain DiagnosisUncertain Diagnosis

• Model for Pr(PAD)– 90% predictive Model for Pr(PAD)– 90% predictive valuevalue

• Export model for Pr{PAD} to patients Export model for Pr{PAD} to patients withoutwithout gold standard gold standard ascertainment?ascertainment?

• (Coding practices?)(Coding practices?)

Page 9: Data Quality Sharp project 5 June 2010. Statistical Problems with Data Quality in EHR Missing Data Missing Data Uncertain Diagnosis Uncertain Diagnosis

Uncertain DiagnosisUncertain Diagnosis

• Use Pr{PAD} in analysis ofUse Pr{PAD} in analysis of– Incidence of PADIncidence of PAD– Incidence trendsIncidence trends– SurveillanceSurveillance– Analysis of etiology, risk factorsAnalysis of etiology, risk factors

Page 10: Data Quality Sharp project 5 June 2010. Statistical Problems with Data Quality in EHR Missing Data Missing Data Uncertain Diagnosis Uncertain Diagnosis

Unequal Precision of Unequal Precision of continuous phenotypecontinuous phenotype• eMERGE example: Red Blood CounteMERGE example: Red Blood Count

• Use retrospective Laboratory DataUse retrospective Laboratory Data

• N=3000, K=20,000N=3000, K=20,000– 1 measurement 1 measurement 100 measurements/subject 100 measurements/subject

• Account for differential precisionAccount for differential precision

• Components of varianceComponents of variance

• Weighted regression?Weighted regression?

• Posterior distribution– same model fitsPosterior distribution– same model fits

Page 11: Data Quality Sharp project 5 June 2010. Statistical Problems with Data Quality in EHR Missing Data Missing Data Uncertain Diagnosis Uncertain Diagnosis

Sample from Posterior Sample from Posterior DistributionDistribution• Missing Data, uncertain diagnosis, Missing Data, uncertain diagnosis,

unequal precision can all be represented unequal precision can all be represented by sampling from posterior distributionby sampling from posterior distribution

• They are all the “same problem”They are all the “same problem”

• Statistical / computational tools for this Statistical / computational tools for this have been developedhave been developed– Markov Chain Monte Carlo (MCMC)Markov Chain Monte Carlo (MCMC)– Multiple ImputationMultiple Imputation

Page 12: Data Quality Sharp project 5 June 2010. Statistical Problems with Data Quality in EHR Missing Data Missing Data Uncertain Diagnosis Uncertain Diagnosis

Summary: Data QualitySummary: Data Quality

• ‘‘Data’ is not ‘a number’ but ‘a Data’ is not ‘a number’ but ‘a posterior distribution’posterior distribution’– Mean and varianceMean and variance– Posterior probabilityPosterior probability

• Data qualityData quality– Don’t try to change itDon’t try to change it– Measure itMeasure it– Allow for it-- propagation of errorAllow for it-- propagation of error

Page 13: Data Quality Sharp project 5 June 2010. Statistical Problems with Data Quality in EHR Missing Data Missing Data Uncertain Diagnosis Uncertain Diagnosis

What is “Data”?What is “Data”?

• Data is whatever input goes into the Data is whatever input goes into the next procedure.next procedure.

• (= output from previous procedure)(= output from previous procedure)

• ‘‘Propagation of error’Propagation of error’

• Output of NLP is also “Data”Output of NLP is also “Data”

Page 14: Data Quality Sharp project 5 June 2010. Statistical Problems with Data Quality in EHR Missing Data Missing Data Uncertain Diagnosis Uncertain Diagnosis

How Assess Data Quality?How Assess Data Quality?

• What if there is no Gold Standard?What if there is no Gold Standard?

• Use any external standardUse any external standard– E.g. outcome dataE.g. outcome data

• Stronger predictive relationship= Stronger predictive relationship= better signal/noise ratio?better signal/noise ratio?

• ““Errors-in-variables” principleErrors-in-variables” principle– Larger error in X –> Smaller beta for Y|XLarger error in X –> Smaller beta for Y|X

Page 15: Data Quality Sharp project 5 June 2010. Statistical Problems with Data Quality in EHR Missing Data Missing Data Uncertain Diagnosis Uncertain Diagnosis

Summary: Help!Summary: Help!

• What are the important tasks in Data What are the important tasks in Data Quality?Quality?– Measurement?Measurement?– Allowance for?Allowance for?

• Important tasks for this Project?Important tasks for this Project?– Integrate with other projectsIntegrate with other projects