Efficient Use of EHR Data for Translational Research · 2018-03-02 · Efﬁcient Use of EHR Data for Translational Research Tianxi Cai Department of Biostatistics, Harvard T.H. Chan

Efficient Use of EHR Data for Translational Research

Tianxi CaiDepartment of Biostatistics, Harvard T.H. Chan School of Public Health

Department of Biomedical Informatics, Harvard Medical School

Harvard Catalyst, 2018

Acknowledgment

● Jessica Gronsbell (Stanford)● Hong Chuan (Harvard)● Sheng Yu (Tsinghua University)● David Cheng (Harvard)● Abhishek Chakrabortty (Upenn)● Issac Kohane (Harvard)● Vivian Gainer (Partner’s Healthchare)● Victor Castro (Partner’s Healthcare)● Shawn Murphy (Partner’s Healthcare)● Ashwin Ananthakrishnan (MGH)● Katherine Liao (BWH)● NIH BD2K grant

(Harvard Catalyst, 2018) EHR Research 2 / 18

Outline

● Opportunities & Challenges in using EHR for research

– Phenomewide Association Study (PheWAS)– Genetic Risk Prediction– Comparative Effectiveness Research/Causal Inference

● Efficient Phenotyping via Semi-supervised Learning (SSL)● SSL Approach to Genetic Risk Modeling

● Remarks


Background

● EHR adoption rate ↗

● Rich resource for research

– detailed longitudinal patient level data– a wide range of disease conditions– enables large scale genomic &

comparative effectiveness studiesSource: CDC NCHS Data Brief (2014)


EHR Data

● structured data: ICD9 billing codes; lab results etc● unstructured text data: extracted via natural language processing (NLP)

– clinical term ↝ concept unique identifiers (CUI)

[Liao et al, 2015]


Integrative Analysis of Electronic Medical Records (EMR) Data

● EHR linked with bio-respository

EMR

Bio-repository

● PheWAS

● Genomic Risk Prediction of Disease

● Comparative Effective Research

● Pharmacogenomics

A Major Challenge

● Precise info on phenotype/treatment response not readily available– ICD9 billing codes sometimes provide inaccurate approximations ↝ power loss– PPV 0.70, NPV 0.95 ↝ power 45% vs 80% w/ gold standard labels


Integrative Analysis of Electronic Medical Records (EMR) Data

● EHR linked with bio-respository

EMR

Bio-repository

● PheWAS

● Genomic Risk Prediction of Disease

● Comparative Effective Research

● Pharmacogenomics

A Major Challenge

● Precise info on phenotype/treatment response not readily available– ICD9 billing codes sometimes provide inaccurate approximations ↝ power loss– PPV 0.70, NPV 0.95 ↝ power 45% vs 80% w/ gold standard labels


EHR Phenotyping

● Challenge: Who has what disease phenotype/outcome?– Solution: build algorithms to predict phenotype

● Algorithm Development Major Steps:1 identify features (Z) relevant to the phenotype2 gold standard labels (Y) obtained via chart review

3 regression modeling Y ∼ g(X;θ)↝ Y (X) = g(X; θ)4 prediction performance evaluation5 apply the algorithm to the EMR to predict phenotype


EHR Phenotyping





EHR Phenotyping





Rheumatoid Arthritis (RA) Algorithm Development

Partners Healthcare EMR

● Data Mart (N=29,432)– at least 1 ICD9 code for RA or tested for anti-CCP

● Features (p ∼ 100): curated by domain experts– codified variables (e.g. ICD9 billing codes, lab test results, medication prescription)– NLP variables (e.g. NLP mention of symptoms, diseases, medication)

● Training set: n = 500 (chart reviewed)

● Algorithm developed via regularized estimation– adaptive LASSO


RA Algorithm Development

● Partner’s EMR– AUC: 0.95; PPV: 0.94; vitual cohort size n = 4453 (∼ 15%) [Liao et al, 2010]

● Portability to other EMR– AUC: 0.92 at Northwestern; 0.95 at Vanderbilt [Carroll et al, 2012]


Bottlenecks: Labor/Resource Intensive

Algorithm development: costly in time and resource

1 identifying features: manual creation w/ clinical + NLP expert– Solution: unsupervised feature selection leveraging

online knowledge sources [Yu et al, 2015, 2016]

2 gold standard label chart review: clinical expert– Solution: semi-supervised learning to improve estimation efficiency

and hence reduce # of labels needed


Identifying Features: Automation

Concept Mapping�

Term Detection�

Drug Grouping�

Junk Filtering�

RankCor Control�

Frequency Control�

● Automated Feature Extraction for Phenotyping [Sheng et al, 2015,2016]

– online knowledge sources ↝ candidate features– surrogate phenotypes ↝ data driven feature selection

● Results for RA classification with Partner’s EMR– candidate features: rheumatoid arthritis, morning stiffness, methotraxate, TNF, CRP– classification accuracy: AUC = 0.95


Identifying Features: Automation

Concept Mapping�

Term Detection�

Drug Grouping�

Junk Filtering�

RankCor Control�

Frequency Control�

● Automated Feature Extraction for Phenotyping [Sheng et al, 2015,2016]

– online knowledge sources ↝ candidate features– surrogate phenotypes ↝ data driven feature selection

● Results for RA classification with Partner’s EMR– candidate features: rheumatoid arthritis, morning stiffness, methotraxate, TNF, CRP– classification accuracy: AUC = 0.95


Bottlenecks: Labor/Resource Intensive

Algorithm development: costly in time and resource

1 identifying features

2 gold standard label chart review: clinical expert– Solution: semi-supervised learning to improve estimation efficiency

and hence reduce # of labels needed


Semi-Supervised Setting: Nature of the Data

● Unlabeled data: ↝ feature distribution (PX)

● Question: Can we use unlabeled data toget a more efficient SSL procedure?

● Missing data problem?

– % missing in the outcome: → 100%


Semi-supervised Learning (SSL)

● SSL procedure:

– Step I: learn a relationship beween Y and X using labeled data (L)

– Step II: impute the missing Y for the unlabeled data (U)

– Step III: regress the imputed outcome against X

● SSL estimator can be substantially more efficient than thesupervised estimator under certain scenarios

● A robust combination procedure to guarantee that the SSLprocedure will always be at least as efficient as the supervised.


Example: EHR Algorithm for Classifying Rheumatoid Arthritis

● n = 500 labeled observations● N = 29,000 unlabeled observations

● Features:

– ICD9 codes of rheumatoid arthritis and competing diagnosis,– NLP mentions of clinical conditions/signs/symptoms– medication prescription, lab results

● Results: Efficiency of SSL relative to supervised– SSL 20% − 380% times more efficient for regression coefficients– SSL 50% − 670% times more efficient for accuracy parameters such as

sensitivity, specificity and AUC.


Genetic Risk Prediction

● Goal: predict the risk of Y = 1 using genetic marker G under

P(Y = 1 ∣ G) = g(β0 +βTG)

● Challenges:– Y only available on a small set– Algorithm scores S = (S1, ...,Sk)

T for predictingY are available on all patients, but they maynot be entirely accurate or fully validated

● Question: how to efficiently– estimate β– evaluate the prediction performance of Sk for Y

● Approach:– Assumption: S relate to G only through Y , i.e. S ⊥ G ∣ Y– maximizing a composite non-parametric likelihood ↝ P(Sk ≥ s ∣ Y ) and β.


Genetic Risk Prediction

● Goal: predict the risk of Y = 1 using genetic marker G under

P(Y = 1 ∣ G) = g(β0 +βTG)

● Challenges:– Y only available on a small set– Algorithm scores S = (S1, ...,Sk)

T for predictingY are available on all patients, but they maynot be entirely accurate or fully validated

● Question: how to efficiently– estimate β– evaluate the prediction performance of Sk for Y

● Approach:– Assumption: S relate to G only through Y , i.e. S ⊥ G ∣ Y– maximizing a composite non-parametric likelihood ↝ P(Sk ≥ s ∣ Y ) and β.


Genetic Risk Prediction of CAD in RA Patients

Goal: comparing genetic risk model estimating for CAD among RA patients viasupervised methods with a full sample of 950 patients versus SSL with a subsample of200 labels on CAD leveraging three phenotype algorithms S = (SICD,SNLP ,SCurated)

T.

Full Data n = 200 LabelsVariable β(SE) p-value β(SE) p-value

age 0.09(0.01) 0.00 0.11(0.01) 0.00sex 1.21(0.27) 0.00 1.51(0.31) 0.00

rs2479409 0.45(0.26) 0.08 0.56(0.30) 0.06rs7206971 −0.82(0.36) 0.02 −0.75(0.37) 0.04rs2902940 −2.02(0.71) 0.00 −1.96(0.55) 0.00

AUC SensitivityICD NLP ICD+NLP ICD NLP ICD+NLP

200 SL .93(.071) .98(.020) .99(.012) .84(.142) .80(.161) .92(.093)SSL .97(.017) .98(.005) .99(.002) .86(.062) .86(.068) .97(.017)

Full SL .94(.015) .98(.004) .99(.002) .86(.041) .85(.039) .97(.013)


Remarks

EHR Data provides:

● Opportunities for novel ”big-data-analytics” development– optimal sampling design for chart review or marker measurement– Unsupervised learning:

○ Automated Feature Selection○ Automated Phenotype Prediction/Annotation

● Opportunities to improve in clinical practice and discovery research– precision medicine: who should be treated by what– more accurate diagnosis/prognosis– capture the disease early– longitudinal information enables dynamic prediction


Documents

Efficient Use of EHR Data for Translational Research · 2018-03-02 · Efﬁcient Use of EHR Data for Translational Research Tianxi Cai Department of Biostatistics, Harvard T.H. Chan