Upload
lesley-neal
View
215
Download
0
Embed Size (px)
Citation preview
Data QualityData QualitySharp project 5Sharp project 5
June 2010June 2010
Statistical Problems with Statistical Problems with Data Quality in EHRData Quality in EHR
• Missing DataMissing Data
• Uncertain DiagnosisUncertain Diagnosis
• Uneven/unequal precision / Uneven/unequal precision / measurement errormeasurement error
• BiasBias
• ……
Missing Data: (Rage in Missing Data: (Rage in Statistical Theory)Statistical Theory)
• Common problem with observational/ Common problem with observational/ retrospective dataretrospective data
• Statistical approachesStatistical approaches– ImputationImputation– Multiple imputation (MI) (Statisticians Multiple imputation (MI) (Statisticians
have acronyms too)have acronyms too)– Regression with residual errorRegression with residual error– draw from Posterior distributiondraw from Posterior distribution
Missing Data– Empirical Missing Data– Empirical approachapproach
• Regression on Y with Missing X-Regression on Y with Missing X-variablesvariables
• ““X is missing” is X is missing” is alsoalso information. information.
• Analyze data set usingAnalyze data set using– Imputation (mean?)Imputation (mean?)– ““missing” indicatormissing” indicator– Empirical approach– let data tell you Empirical approach– let data tell you
what to dowhat to do
Uncertain diagnosisUncertain diagnosis
• Universal problem with health dataUniversal problem with health data
• No Gold standardNo Gold standard
• Disease/health is a spectrum, not a Disease/health is a spectrum, not a dichotomydichotomy
• Probabilistic perspectiveProbabilistic perspective– Probability (Peripheral Arterial Disease)Probability (Peripheral Arterial Disease)– From {0,1} to [0-1] as phenotypeFrom {0,1} to [0-1] as phenotype– More realistic phenotype?More realistic phenotype?
Uncertain DiagnosisUncertain Diagnosis
• Result is a probabilityResult is a probability
• Probability is a posterior distribution Probability is a posterior distribution of a 0/1 variableof a 0/1 variable– Use p itself (certainty equivalent)Use p itself (certainty equivalent)
•Analogous to single imputationAnalogous to single imputation
– Use multiple imputationUse multiple imputation•““1” with probability p, “0” with probability 1-1” with probability p, “0” with probability 1-
pp
Uncertain Diagnosis– PAD Uncertain Diagnosis– PAD example (eMERGE)example (eMERGE)• Mayo Vascular Lab Database– n=18000Mayo Vascular Lab Database– n=18000
• Gold Standard— Ankle/Brachial Index Gold Standard— Ankle/Brachial Index (ABI)(ABI)
• Use of Diagnostic / procedural codes Use of Diagnostic / procedural codes – ICD-9 / HICDA / CPT ICD-9 / HICDA / CPT
• Logistic regression of gold standard Logistic regression of gold standard (PAD by ABI) on diagnostic codes(PAD by ABI) on diagnostic codes
posterior probability of PADposterior probability of PAD
Uncertain DiagnosisUncertain Diagnosis
• Model for Pr(PAD)– 90% predictive Model for Pr(PAD)– 90% predictive valuevalue
• Export model for Pr{PAD} to patients Export model for Pr{PAD} to patients withoutwithout gold standard gold standard ascertainment?ascertainment?
• (Coding practices?)(Coding practices?)
Uncertain DiagnosisUncertain Diagnosis
• Use Pr{PAD} in analysis ofUse Pr{PAD} in analysis of– Incidence of PADIncidence of PAD– Incidence trendsIncidence trends– SurveillanceSurveillance– Analysis of etiology, risk factorsAnalysis of etiology, risk factors
Unequal Precision of Unequal Precision of continuous phenotypecontinuous phenotype• eMERGE example: Red Blood CounteMERGE example: Red Blood Count
• Use retrospective Laboratory DataUse retrospective Laboratory Data
• N=3000, K=20,000N=3000, K=20,000– 1 measurement 1 measurement 100 measurements/subject 100 measurements/subject
• Account for differential precisionAccount for differential precision
• Components of varianceComponents of variance
• Weighted regression?Weighted regression?
• Posterior distribution– same model fitsPosterior distribution– same model fits
Sample from Posterior Sample from Posterior DistributionDistribution• Missing Data, uncertain diagnosis, Missing Data, uncertain diagnosis,
unequal precision can all be represented unequal precision can all be represented by sampling from posterior distributionby sampling from posterior distribution
• They are all the “same problem”They are all the “same problem”
• Statistical / computational tools for this Statistical / computational tools for this have been developedhave been developed– Markov Chain Monte Carlo (MCMC)Markov Chain Monte Carlo (MCMC)– Multiple ImputationMultiple Imputation
Summary: Data QualitySummary: Data Quality
• ‘‘Data’ is not ‘a number’ but ‘a Data’ is not ‘a number’ but ‘a posterior distribution’posterior distribution’– Mean and varianceMean and variance– Posterior probabilityPosterior probability
• Data qualityData quality– Don’t try to change itDon’t try to change it– Measure itMeasure it– Allow for it-- propagation of errorAllow for it-- propagation of error
What is “Data”?What is “Data”?
• Data is whatever input goes into the Data is whatever input goes into the next procedure.next procedure.
• (= output from previous procedure)(= output from previous procedure)
• ‘‘Propagation of error’Propagation of error’
• Output of NLP is also “Data”Output of NLP is also “Data”
How Assess Data Quality?How Assess Data Quality?
• What if there is no Gold Standard?What if there is no Gold Standard?
• Use any external standardUse any external standard– E.g. outcome dataE.g. outcome data
• Stronger predictive relationship= Stronger predictive relationship= better signal/noise ratio?better signal/noise ratio?
• ““Errors-in-variables” principleErrors-in-variables” principle– Larger error in X –> Smaller beta for Y|XLarger error in X –> Smaller beta for Y|X
Summary: Help!Summary: Help!
• What are the important tasks in Data What are the important tasks in Data Quality?Quality?– Measurement?Measurement?– Allowance for?Allowance for?
• Important tasks for this Project?Important tasks for this Project?– Integrate with other projectsIntegrate with other projects