Efficient Use of EHR Data for Translational Research
Tianxi CaiDepartment of Biostatistics, Harvard T.H. Chan School of Public Health
Department of Biomedical Informatics, Harvard Medical School
Harvard Catalyst, 2018
Acknowledgment
● Jessica Gronsbell (Stanford)● Hong Chuan (Harvard)● Sheng Yu (Tsinghua University)● David Cheng (Harvard)● Abhishek Chakrabortty (Upenn)● Issac Kohane (Harvard)● Vivian Gainer (Partner’s Healthchare)● Victor Castro (Partner’s Healthcare)● Shawn Murphy (Partner’s Healthcare)● Ashwin Ananthakrishnan (MGH)● Katherine Liao (BWH)● NIH BD2K grant
(Harvard Catalyst, 2018) EHR Research 2 / 18
Outline
● Opportunities & Challenges in using EHR for research
– Phenomewide Association Study (PheWAS)– Genetic Risk Prediction– Comparative Effectiveness Research/Causal Inference
● Efficient Phenotyping via Semi-supervised Learning (SSL)● SSL Approach to Genetic Risk Modeling
● Remarks
(Harvard Catalyst, 2018) EHR Research 3 / 18
Background
● EHR adoption rate ↗
● Rich resource for research
– detailed longitudinal patient level data– a wide range of disease conditions– enables large scale genomic &
comparative effectiveness studiesSource: CDC NCHS Data Brief (2014)
(Harvard Catalyst, 2018) EHR Research 4 / 18
EHR Data
● structured data: ICD9 billing codes; lab results etc● unstructured text data: extracted via natural language processing (NLP)
– clinical term ↝ concept unique identifiers (CUI)
[Liao et al, 2015]
(Harvard Catalyst, 2018) EHR Research 5 / 18
Integrative Analysis of Electronic Medical Records (EMR) Data
● EHR linked with bio-respository
EMR
Bio-repository
● PheWAS
● Genomic Risk Prediction of Disease
● Comparative Effective Research
● Pharmacogenomics
A Major Challenge
● Precise info on phenotype/treatment response not readily available– ICD9 billing codes sometimes provide inaccurate approximations ↝ power loss– PPV 0.70, NPV 0.95 ↝ power 45% vs 80% w/ gold standard labels
(Harvard Catalyst, 2018) EHR Research 6 / 18
Integrative Analysis of Electronic Medical Records (EMR) Data
● EHR linked with bio-respository
EMR
Bio-repository
● PheWAS
● Genomic Risk Prediction of Disease
● Comparative Effective Research
● Pharmacogenomics
A Major Challenge
● Precise info on phenotype/treatment response not readily available– ICD9 billing codes sometimes provide inaccurate approximations ↝ power loss– PPV 0.70, NPV 0.95 ↝ power 45% vs 80% w/ gold standard labels
(Harvard Catalyst, 2018) EHR Research 6 / 18
EHR Phenotyping
● Challenge: Who has what disease phenotype/outcome?– Solution: build algorithms to predict phenotype
● Algorithm Development Major Steps:1 identify features (Z) relevant to the phenotype2 gold standard labels (Y) obtained via chart review
3 regression modeling Y ∼ g(X;θ)↝ Y (X) = g(X; θ)4 prediction performance evaluation5 apply the algorithm to the EMR to predict phenotype
(Harvard Catalyst, 2018) EHR Research 7 / 18
EHR Phenotyping
● Challenge: Who has what disease phenotype/outcome?– Solution: build algorithms to predict phenotype
● Algorithm Development Major Steps:1 identify features (Z) relevant to the phenotype2 gold standard labels (Y) obtained via chart review
3 regression modeling Y ∼ g(X;θ)↝ Y (X) = g(X; θ)4 prediction performance evaluation5 apply the algorithm to the EMR to predict phenotype
(Harvard Catalyst, 2018) EHR Research 7 / 18
EHR Phenotyping
● Challenge: Who has what disease phenotype/outcome?– Solution: build algorithms to predict phenotype
● Algorithm Development Major Steps:1 identify features (Z) relevant to the phenotype2 gold standard labels (Y) obtained via chart review
3 regression modeling Y ∼ g(X;θ)↝ Y (X) = g(X; θ)4 prediction performance evaluation5 apply the algorithm to the EMR to predict phenotype
(Harvard Catalyst, 2018) EHR Research 7 / 18
Rheumatoid Arthritis (RA) Algorithm Development
Partners Healthcare EMR
● Data Mart (N=29,432)– at least 1 ICD9 code for RA or tested for anti-CCP
● Features (p ∼ 100): curated by domain experts– codified variables (e.g. ICD9 billing codes, lab test results, medication prescription)– NLP variables (e.g. NLP mention of symptoms, diseases, medication)
● Training set: n = 500 (chart reviewed)
● Algorithm developed via regularized estimation– adaptive LASSO
(Harvard Catalyst, 2018) EHR Research 8 / 18
RA Algorithm Development
● Partner’s EMR– AUC: 0.95; PPV: 0.94; vitual cohort size n = 4453 (∼ 15%) [Liao et al, 2010]
● Portability to other EMR– AUC: 0.92 at Northwestern; 0.95 at Vanderbilt [Carroll et al, 2012]
(Harvard Catalyst, 2018) EHR Research 9 / 18
Bottlenecks: Labor/Resource Intensive
Algorithm development: costly in time and resource
1 identifying features: manual creation w/ clinical + NLP expert– Solution: unsupervised feature selection leveraging
online knowledge sources [Yu et al, 2015, 2016]
2 gold standard label chart review: clinical expert– Solution: semi-supervised learning to improve estimation efficiency
and hence reduce # of labels needed
(Harvard Catalyst, 2018) EHR Research 10 / 18
Identifying Features: Automation
Concept Mapping�
Term Detection�
Drug Grouping�
Junk Filtering�
RankCor Control�
Frequency Control�
● Automated Feature Extraction for Phenotyping [Sheng et al, 2015,2016]
– online knowledge sources ↝ candidate features– surrogate phenotypes ↝ data driven feature selection
● Results for RA classification with Partner’s EMR– candidate features: rheumatoid arthritis, morning stiffness, methotraxate, TNF, CRP– classification accuracy: AUC = 0.95
(Harvard Catalyst, 2018) EHR Research 11 / 18
Identifying Features: Automation
Concept Mapping�
Term Detection�
Drug Grouping�
Junk Filtering�
RankCor Control�
Frequency Control�
● Automated Feature Extraction for Phenotyping [Sheng et al, 2015,2016]
– online knowledge sources ↝ candidate features– surrogate phenotypes ↝ data driven feature selection
● Results for RA classification with Partner’s EMR– candidate features: rheumatoid arthritis, morning stiffness, methotraxate, TNF, CRP– classification accuracy: AUC = 0.95
(Harvard Catalyst, 2018) EHR Research 11 / 18
Bottlenecks: Labor/Resource Intensive
Algorithm development: costly in time and resource
1 identifying features
2 gold standard label chart review: clinical expert– Solution: semi-supervised learning to improve estimation efficiency
and hence reduce # of labels needed
(Harvard Catalyst, 2018) EHR Research 12 / 18
Semi-Supervised Setting: Nature of the Data
● Unlabeled data: ↝ feature distribution (PX)
● Question: Can we use unlabeled data toget a more efficient SSL procedure?
● Missing data problem?
– % missing in the outcome: → 100%
(Harvard Catalyst, 2018) EHR Research 13 / 18
Semi-supervised Learning (SSL)
● SSL procedure:
– Step I: learn a relationship beween Y and X using labeled data (L)
– Step II: impute the missing Y for the unlabeled data (U)
– Step III: regress the imputed outcome against X
● SSL estimator can be substantially more efficient than thesupervised estimator under certain scenarios
● A robust combination procedure to guarantee that the SSLprocedure will always be at least as efficient as the supervised.
(Harvard Catalyst, 2018) EHR Research 14 / 18
Example: EHR Algorithm for Classifying Rheumatoid Arthritis
● n = 500 labeled observations● N = 29,000 unlabeled observations
● Features:
– ICD9 codes of rheumatoid arthritis and competing diagnosis,– NLP mentions of clinical conditions/signs/symptoms– medication prescription, lab results
● Results: Efficiency of SSL relative to supervised– SSL 20% − 380% times more efficient for regression coefficients– SSL 50% − 670% times more efficient for accuracy parameters such as
sensitivity, specificity and AUC.
(Harvard Catalyst, 2018) EHR Research 15 / 18
Genetic Risk Prediction
● Goal: predict the risk of Y = 1 using genetic marker G under
P(Y = 1 ∣ G) = g(β0 +βTG)
● Challenges:– Y only available on a small set– Algorithm scores S = (S1, ...,Sk)
T for predictingY are available on all patients, but they maynot be entirely accurate or fully validated
● Question: how to efficiently– estimate β– evaluate the prediction performance of Sk for Y
● Approach:– Assumption: S relate to G only through Y , i.e. S ⊥ G ∣ Y– maximizing a composite non-parametric likelihood ↝ P(Sk ≥ s ∣ Y ) and β.
(Harvard Catalyst, 2018) EHR Research 16 / 18
Genetic Risk Prediction
● Goal: predict the risk of Y = 1 using genetic marker G under
P(Y = 1 ∣ G) = g(β0 +βTG)
● Challenges:– Y only available on a small set– Algorithm scores S = (S1, ...,Sk)
T for predictingY are available on all patients, but they maynot be entirely accurate or fully validated
● Question: how to efficiently– estimate β– evaluate the prediction performance of Sk for Y
● Approach:– Assumption: S relate to G only through Y , i.e. S ⊥ G ∣ Y– maximizing a composite non-parametric likelihood ↝ P(Sk ≥ s ∣ Y ) and β.
(Harvard Catalyst, 2018) EHR Research 16 / 18
Genetic Risk Prediction of CAD in RA Patients
Goal: comparing genetic risk model estimating for CAD among RA patients viasupervised methods with a full sample of 950 patients versus SSL with a subsample of200 labels on CAD leveraging three phenotype algorithms S = (SICD,SNLP ,SCurated)
T.
Full Data n = 200 LabelsVariable β(SE) p-value β(SE) p-value
age 0.09(0.01) 0.00 0.11(0.01) 0.00sex 1.21(0.27) 0.00 1.51(0.31) 0.00
rs2479409 0.45(0.26) 0.08 0.56(0.30) 0.06rs7206971 −0.82(0.36) 0.02 −0.75(0.37) 0.04rs2902940 −2.02(0.71) 0.00 −1.96(0.55) 0.00
AUC SensitivityICD NLP ICD+NLP ICD NLP ICD+NLP
200 SL .93(.071) .98(.020) .99(.012) .84(.142) .80(.161) .92(.093)SSL .97(.017) .98(.005) .99(.002) .86(.062) .86(.068) .97(.017)
Full SL .94(.015) .98(.004) .99(.002) .86(.041) .85(.039) .97(.013)
(Harvard Catalyst, 2018) EHR Research 17 / 18
Remarks
EHR Data provides:
● Opportunities for novel ”big-data-analytics” development– optimal sampling design for chart review or marker measurement– Unsupervised learning:
○ Automated Feature Selection○ Automated Phenotype Prediction/Annotation
● Opportunities to improve in clinical practice and discovery research– precision medicine: who should be treated by what– more accurate diagnosis/prognosis– capture the disease early– longitudinal information enables dynamic prediction
(Harvard Catalyst, 2018) EHR Research 18 / 18