23
Efficient Use of EHR Data for Translational Research Tianxi Cai Department of Biostatistics, Harvard T.H. Chan School of Public Health Department of Biomedical Informatics, Harvard Medical School Harvard Catalyst, 2018

Efficient Use of EHR Data for Translational Research · 2018-03-02 · Efficient Use of EHR Data for Translational Research Tianxi Cai Department of Biostatistics, Harvard T.H. Chan

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Efficient Use of EHR Data for Translational Research · 2018-03-02 · Efficient Use of EHR Data for Translational Research Tianxi Cai Department of Biostatistics, Harvard T.H. Chan

Efficient Use of EHR Data for Translational Research

Tianxi CaiDepartment of Biostatistics, Harvard T.H. Chan School of Public Health

Department of Biomedical Informatics, Harvard Medical School

Harvard Catalyst, 2018

Page 2: Efficient Use of EHR Data for Translational Research · 2018-03-02 · Efficient Use of EHR Data for Translational Research Tianxi Cai Department of Biostatistics, Harvard T.H. Chan

Acknowledgment

● Jessica Gronsbell (Stanford)● Hong Chuan (Harvard)● Sheng Yu (Tsinghua University)● David Cheng (Harvard)● Abhishek Chakrabortty (Upenn)● Issac Kohane (Harvard)● Vivian Gainer (Partner’s Healthchare)● Victor Castro (Partner’s Healthcare)● Shawn Murphy (Partner’s Healthcare)● Ashwin Ananthakrishnan (MGH)● Katherine Liao (BWH)● NIH BD2K grant

(Harvard Catalyst, 2018) EHR Research 2 / 18

Page 3: Efficient Use of EHR Data for Translational Research · 2018-03-02 · Efficient Use of EHR Data for Translational Research Tianxi Cai Department of Biostatistics, Harvard T.H. Chan

Outline

● Opportunities & Challenges in using EHR for research

– Phenomewide Association Study (PheWAS)– Genetic Risk Prediction– Comparative Effectiveness Research/Causal Inference

● Efficient Phenotyping via Semi-supervised Learning (SSL)● SSL Approach to Genetic Risk Modeling

● Remarks

(Harvard Catalyst, 2018) EHR Research 3 / 18

Page 4: Efficient Use of EHR Data for Translational Research · 2018-03-02 · Efficient Use of EHR Data for Translational Research Tianxi Cai Department of Biostatistics, Harvard T.H. Chan

Background

● EHR adoption rate ↗

● Rich resource for research

– detailed longitudinal patient level data– a wide range of disease conditions– enables large scale genomic &

comparative effectiveness studiesSource: CDC NCHS Data Brief (2014)

(Harvard Catalyst, 2018) EHR Research 4 / 18

Page 5: Efficient Use of EHR Data for Translational Research · 2018-03-02 · Efficient Use of EHR Data for Translational Research Tianxi Cai Department of Biostatistics, Harvard T.H. Chan

EHR Data

● structured data: ICD9 billing codes; lab results etc● unstructured text data: extracted via natural language processing (NLP)

– clinical term ↝ concept unique identifiers (CUI)

[Liao et al, 2015]

(Harvard Catalyst, 2018) EHR Research 5 / 18

Page 6: Efficient Use of EHR Data for Translational Research · 2018-03-02 · Efficient Use of EHR Data for Translational Research Tianxi Cai Department of Biostatistics, Harvard T.H. Chan

Integrative Analysis of Electronic Medical Records (EMR) Data

● EHR linked with bio-respository

EMR

Bio-repository

● PheWAS

● Genomic Risk Prediction of Disease

● Comparative Effective Research

● Pharmacogenomics

A Major Challenge

● Precise info on phenotype/treatment response not readily available– ICD9 billing codes sometimes provide inaccurate approximations ↝ power loss– PPV 0.70, NPV 0.95 ↝ power 45% vs 80% w/ gold standard labels

(Harvard Catalyst, 2018) EHR Research 6 / 18

Page 7: Efficient Use of EHR Data for Translational Research · 2018-03-02 · Efficient Use of EHR Data for Translational Research Tianxi Cai Department of Biostatistics, Harvard T.H. Chan

Integrative Analysis of Electronic Medical Records (EMR) Data

● EHR linked with bio-respository

EMR

Bio-repository

● PheWAS

● Genomic Risk Prediction of Disease

● Comparative Effective Research

● Pharmacogenomics

A Major Challenge

● Precise info on phenotype/treatment response not readily available– ICD9 billing codes sometimes provide inaccurate approximations ↝ power loss– PPV 0.70, NPV 0.95 ↝ power 45% vs 80% w/ gold standard labels

(Harvard Catalyst, 2018) EHR Research 6 / 18

Page 8: Efficient Use of EHR Data for Translational Research · 2018-03-02 · Efficient Use of EHR Data for Translational Research Tianxi Cai Department of Biostatistics, Harvard T.H. Chan

EHR Phenotyping

● Challenge: Who has what disease phenotype/outcome?– Solution: build algorithms to predict phenotype

● Algorithm Development Major Steps:1 identify features (Z) relevant to the phenotype2 gold standard labels (Y) obtained via chart review

3 regression modeling Y ∼ g(X;θ)↝ Y (X) = g(X; θ)4 prediction performance evaluation5 apply the algorithm to the EMR to predict phenotype

(Harvard Catalyst, 2018) EHR Research 7 / 18

Page 9: Efficient Use of EHR Data for Translational Research · 2018-03-02 · Efficient Use of EHR Data for Translational Research Tianxi Cai Department of Biostatistics, Harvard T.H. Chan

EHR Phenotyping

● Challenge: Who has what disease phenotype/outcome?– Solution: build algorithms to predict phenotype

● Algorithm Development Major Steps:1 identify features (Z) relevant to the phenotype2 gold standard labels (Y) obtained via chart review

3 regression modeling Y ∼ g(X;θ)↝ Y (X) = g(X; θ)4 prediction performance evaluation5 apply the algorithm to the EMR to predict phenotype

(Harvard Catalyst, 2018) EHR Research 7 / 18

Page 10: Efficient Use of EHR Data for Translational Research · 2018-03-02 · Efficient Use of EHR Data for Translational Research Tianxi Cai Department of Biostatistics, Harvard T.H. Chan

EHR Phenotyping

● Challenge: Who has what disease phenotype/outcome?– Solution: build algorithms to predict phenotype

● Algorithm Development Major Steps:1 identify features (Z) relevant to the phenotype2 gold standard labels (Y) obtained via chart review

3 regression modeling Y ∼ g(X;θ)↝ Y (X) = g(X; θ)4 prediction performance evaluation5 apply the algorithm to the EMR to predict phenotype

(Harvard Catalyst, 2018) EHR Research 7 / 18

Page 11: Efficient Use of EHR Data for Translational Research · 2018-03-02 · Efficient Use of EHR Data for Translational Research Tianxi Cai Department of Biostatistics, Harvard T.H. Chan

Rheumatoid Arthritis (RA) Algorithm Development

Partners Healthcare EMR

● Data Mart (N=29,432)– at least 1 ICD9 code for RA or tested for anti-CCP

● Features (p ∼ 100): curated by domain experts– codified variables (e.g. ICD9 billing codes, lab test results, medication prescription)– NLP variables (e.g. NLP mention of symptoms, diseases, medication)

● Training set: n = 500 (chart reviewed)

● Algorithm developed via regularized estimation– adaptive LASSO

(Harvard Catalyst, 2018) EHR Research 8 / 18

Page 12: Efficient Use of EHR Data for Translational Research · 2018-03-02 · Efficient Use of EHR Data for Translational Research Tianxi Cai Department of Biostatistics, Harvard T.H. Chan

RA Algorithm Development

● Partner’s EMR– AUC: 0.95; PPV: 0.94; vitual cohort size n = 4453 (∼ 15%) [Liao et al, 2010]

● Portability to other EMR– AUC: 0.92 at Northwestern; 0.95 at Vanderbilt [Carroll et al, 2012]

(Harvard Catalyst, 2018) EHR Research 9 / 18

Page 13: Efficient Use of EHR Data for Translational Research · 2018-03-02 · Efficient Use of EHR Data for Translational Research Tianxi Cai Department of Biostatistics, Harvard T.H. Chan

Bottlenecks: Labor/Resource Intensive

Algorithm development: costly in time and resource

1 identifying features: manual creation w/ clinical + NLP expert– Solution: unsupervised feature selection leveraging

online knowledge sources [Yu et al, 2015, 2016]

2 gold standard label chart review: clinical expert– Solution: semi-supervised learning to improve estimation efficiency

and hence reduce # of labels needed

(Harvard Catalyst, 2018) EHR Research 10 / 18

Page 14: Efficient Use of EHR Data for Translational Research · 2018-03-02 · Efficient Use of EHR Data for Translational Research Tianxi Cai Department of Biostatistics, Harvard T.H. Chan

Identifying Features: Automation

Concept Mapping�

Term Detection�

Drug Grouping�

Junk Filtering�

RankCor Control�

Frequency Control�

● Automated Feature Extraction for Phenotyping [Sheng et al, 2015,2016]

– online knowledge sources ↝ candidate features– surrogate phenotypes ↝ data driven feature selection

● Results for RA classification with Partner’s EMR– candidate features: rheumatoid arthritis, morning stiffness, methotraxate, TNF, CRP– classification accuracy: AUC = 0.95

(Harvard Catalyst, 2018) EHR Research 11 / 18

Page 15: Efficient Use of EHR Data for Translational Research · 2018-03-02 · Efficient Use of EHR Data for Translational Research Tianxi Cai Department of Biostatistics, Harvard T.H. Chan

Identifying Features: Automation

Concept Mapping�

Term Detection�

Drug Grouping�

Junk Filtering�

RankCor Control�

Frequency Control�

● Automated Feature Extraction for Phenotyping [Sheng et al, 2015,2016]

– online knowledge sources ↝ candidate features– surrogate phenotypes ↝ data driven feature selection

● Results for RA classification with Partner’s EMR– candidate features: rheumatoid arthritis, morning stiffness, methotraxate, TNF, CRP– classification accuracy: AUC = 0.95

(Harvard Catalyst, 2018) EHR Research 11 / 18

Page 16: Efficient Use of EHR Data for Translational Research · 2018-03-02 · Efficient Use of EHR Data for Translational Research Tianxi Cai Department of Biostatistics, Harvard T.H. Chan

Bottlenecks: Labor/Resource Intensive

Algorithm development: costly in time and resource

1 identifying features

2 gold standard label chart review: clinical expert– Solution: semi-supervised learning to improve estimation efficiency

and hence reduce # of labels needed

(Harvard Catalyst, 2018) EHR Research 12 / 18

Page 17: Efficient Use of EHR Data for Translational Research · 2018-03-02 · Efficient Use of EHR Data for Translational Research Tianxi Cai Department of Biostatistics, Harvard T.H. Chan

Semi-Supervised Setting: Nature of the Data

● Unlabeled data: ↝ feature distribution (PX)

● Question: Can we use unlabeled data toget a more efficient SSL procedure?

● Missing data problem?

– % missing in the outcome: → 100%

(Harvard Catalyst, 2018) EHR Research 13 / 18

Page 18: Efficient Use of EHR Data for Translational Research · 2018-03-02 · Efficient Use of EHR Data for Translational Research Tianxi Cai Department of Biostatistics, Harvard T.H. Chan

Semi-supervised Learning (SSL)

● SSL procedure:

– Step I: learn a relationship beween Y and X using labeled data (L)

– Step II: impute the missing Y for the unlabeled data (U)

– Step III: regress the imputed outcome against X

● SSL estimator can be substantially more efficient than thesupervised estimator under certain scenarios

● A robust combination procedure to guarantee that the SSLprocedure will always be at least as efficient as the supervised.

(Harvard Catalyst, 2018) EHR Research 14 / 18

Page 19: Efficient Use of EHR Data for Translational Research · 2018-03-02 · Efficient Use of EHR Data for Translational Research Tianxi Cai Department of Biostatistics, Harvard T.H. Chan

Example: EHR Algorithm for Classifying Rheumatoid Arthritis

● n = 500 labeled observations● N = 29,000 unlabeled observations

● Features:

– ICD9 codes of rheumatoid arthritis and competing diagnosis,– NLP mentions of clinical conditions/signs/symptoms– medication prescription, lab results

● Results: Efficiency of SSL relative to supervised– SSL 20% − 380% times more efficient for regression coefficients– SSL 50% − 670% times more efficient for accuracy parameters such as

sensitivity, specificity and AUC.

(Harvard Catalyst, 2018) EHR Research 15 / 18

Page 20: Efficient Use of EHR Data for Translational Research · 2018-03-02 · Efficient Use of EHR Data for Translational Research Tianxi Cai Department of Biostatistics, Harvard T.H. Chan

Genetic Risk Prediction

● Goal: predict the risk of Y = 1 using genetic marker G under

P(Y = 1 ∣ G) = g(β0 +βTG)

● Challenges:– Y only available on a small set– Algorithm scores S = (S1, ...,Sk)

T for predictingY are available on all patients, but they maynot be entirely accurate or fully validated

● Question: how to efficiently– estimate β– evaluate the prediction performance of Sk for Y

● Approach:– Assumption: S relate to G only through Y , i.e. S ⊥ G ∣ Y– maximizing a composite non-parametric likelihood ↝ P(Sk ≥ s ∣ Y ) and β.

(Harvard Catalyst, 2018) EHR Research 16 / 18

Page 21: Efficient Use of EHR Data for Translational Research · 2018-03-02 · Efficient Use of EHR Data for Translational Research Tianxi Cai Department of Biostatistics, Harvard T.H. Chan

Genetic Risk Prediction

● Goal: predict the risk of Y = 1 using genetic marker G under

P(Y = 1 ∣ G) = g(β0 +βTG)

● Challenges:– Y only available on a small set– Algorithm scores S = (S1, ...,Sk)

T for predictingY are available on all patients, but they maynot be entirely accurate or fully validated

● Question: how to efficiently– estimate β– evaluate the prediction performance of Sk for Y

● Approach:– Assumption: S relate to G only through Y , i.e. S ⊥ G ∣ Y– maximizing a composite non-parametric likelihood ↝ P(Sk ≥ s ∣ Y ) and β.

(Harvard Catalyst, 2018) EHR Research 16 / 18

Page 22: Efficient Use of EHR Data for Translational Research · 2018-03-02 · Efficient Use of EHR Data for Translational Research Tianxi Cai Department of Biostatistics, Harvard T.H. Chan

Genetic Risk Prediction of CAD in RA Patients

Goal: comparing genetic risk model estimating for CAD among RA patients viasupervised methods with a full sample of 950 patients versus SSL with a subsample of200 labels on CAD leveraging three phenotype algorithms S = (SICD,SNLP ,SCurated)

T.

Full Data n = 200 LabelsVariable β(SE) p-value β(SE) p-value

age 0.09(0.01) 0.00 0.11(0.01) 0.00sex 1.21(0.27) 0.00 1.51(0.31) 0.00

rs2479409 0.45(0.26) 0.08 0.56(0.30) 0.06rs7206971 −0.82(0.36) 0.02 −0.75(0.37) 0.04rs2902940 −2.02(0.71) 0.00 −1.96(0.55) 0.00

AUC SensitivityICD NLP ICD+NLP ICD NLP ICD+NLP

200 SL .93(.071) .98(.020) .99(.012) .84(.142) .80(.161) .92(.093)SSL .97(.017) .98(.005) .99(.002) .86(.062) .86(.068) .97(.017)

Full SL .94(.015) .98(.004) .99(.002) .86(.041) .85(.039) .97(.013)

(Harvard Catalyst, 2018) EHR Research 17 / 18

Page 23: Efficient Use of EHR Data for Translational Research · 2018-03-02 · Efficient Use of EHR Data for Translational Research Tianxi Cai Department of Biostatistics, Harvard T.H. Chan

Remarks

EHR Data provides:

● Opportunities for novel ”big-data-analytics” development– optimal sampling design for chart review or marker measurement– Unsupervised learning:

○ Automated Feature Selection○ Automated Phenotype Prediction/Annotation

● Opportunities to improve in clinical practice and discovery research– precision medicine: who should be treated by what– more accurate diagnosis/prognosis– capture the disease early– longitudinal information enables dynamic prediction

(Harvard Catalyst, 2018) EHR Research 18 / 18