49
Machine Learning for Healthcare 6.S897, HST.S53 Prof. David Sontag MIT EECS, CSAIL, IMES Lecture 2: Risk stratification (Thanks to Narges Razavian for some of the slides)

Machine Learning for Healthcare - GitHub Pages• San Antonio Model • Easy to ask/measure in the office, or for patients to do online ... 71947 Joint pain-ankle 28648 3004 Dysthymic

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

MachineLearningforHealthcare6.S897,HST.S53

Prof.DavidSontagMITEECS,CSAIL,IMES

Lecture2:Riskstratification

(Thanks to Narges Razavian for some of the slides)

Outlinefortoday’sclass

1. Casestudyforriskstratification:EarlydetectionofType2diabetes

2. Framingassupervisedlearningproblem3. Derivinglabels4. Evaluatingriskstratificationalgorithms5. Non-stationarity

Outlinefortoday’sclass

1. Casestudyforriskstratification:EarlydetectionofType2diabetes

2. Framingassupervisedlearningproblem3. Derivinglabels4. Evaluatingriskstratificationalgorithms5. Non-stationarity

Type2Diabetes:AMajorpublichealthchallenge

1994 2000

<4.5%4.5%–5.9%6.0%–7.4%7.5%–8.9%>9.0%

2013

$245billion:Totalcostsofdiagnoseddiabetes intheUnitedStatesin2012$831billion:Totalfiscalyearfederal budgetforhealthcare intheUnitedStatesin2014

Type2DiabetesCanBePrevented*

Requirement for successful large scale prevention program1. Detect/reach truly at risk population

2. Improve the interventions

3. Lower the cost of intervention

*DiabetesPreventionProgramResearchGroup."Reductionintheincidenceoftype2diabeteswithlifestyleinterventionormetformin."TheNewEnglandjournalofmedicine346.6(2002):393.

Type2DiabetesCanBePrevented*

Requirement for successful large scale prevention program1. Detect/reach truly at risk population

2. Improve the interventions

3. Lower the cost of intervention

*DiabetesPreventionProgramResearchGroup."Reductionintheincidenceoftype2diabeteswithlifestyleinterventionormetformin."TheNewEnglandjournalofmedicine346.6(2002):393.

TraditionalRiskPredictionModels• SuccessfulExamples

• ARIC• KORA• FRAMINGHAM• AUSDRISC• FINDRISC• SanAntonioModel

• Easytoask/measureintheoffice,orforpatientstodoonline

• Simplemodel:cancalculatescoresbyhand

ChallengesofTraditionalRiskPredictionModels• A screening step needs to be done for every

member in the population• Either in the physician’s office or as surveys• Costly and time-consuming• Infeasible for regular screening for millions of individuals

• Models not easy to adapt to multiple surrogates, when a variable is missing• Discovery of surrogates not straightforward

Population-LevelRiskStratification

• Keyidea:Usereadilyavailableadministrative,utilization,andclinicaldata

• Machinelearningwillfindsurrogatesforriskfactorsthatwouldotherwisebemissing

• Performriskstratificationatthepopulationlevel– millionsofpatients

[Razavian, Blecker, Schmidt,Smith-McLallen,Nigam,Sontag.BigData.‘16]

AData-DrivenapproachonLongitudinalData

• Lookingatindividualswhogotdiabetestoday, (comparedtothosewhodidn’t)– Canweinferwhichvariables intheir recordcouldhavepredicted their

healthoutcome?

TodayAFewYearsAgo

Reminder:Administrative&ClinicalData

Patient:

EligibilityRecord:-MemberID-Age/gender-IDofsubscriber-Companycode

MedicalClaims:-ICD9diagnosiscodes-CPTcode(procedure)-Specialty-Locationofservice-DateofService

LabTests:-LOINCcode(urineorbloodtestname)-Results(actualvalues)-LabID-Rangehigh/low-Date

Medications:-NDCcode(drugname)-Daysofsupply-Quantity-ServiceProviderID-Dateoffill

time

Disease count4011Benignhypertension 4470172724Hyperlipidemia NEC/NOS 3820304019HypertensionNOS 37247725000DMIIwocmp nt st uncntr 3395222720Purehypercholesterolem 2326712722Mixedhyperlipidemia 180015V7231Routinegyn examination 1787092449HypothyroidismNOS 16982978079MalaiseandfatigueNEC 149797V0481Vaccin forinfluenza 1478587242Lumbago 137345V7612ScreenmammogramNEC 129445V700Routinemedicalexam 127848

Disease count71947Jointpain-ankle 286483004Dysthymicdisorder 285302689VitaminDdeficiencyNOS 28455V7281Preopcardiovsclrexam 278977243Sciatica 2760478791Diarrhea 27424V221Supervis oth normalpreg 2732036501Opnanglbrderln lorisk 2603337921Vitreousdegeneration 255924241Aorticvalvedisorder 2542561610VaginitisNOS 2473670219Othersborheickeratosis 244533804Impactedcerumen 24046

Disease count53081Esophagealreflux 12106442731Atrialfibrillation 1137987295Paininlimb 11244941401Crnry athrscl natve vssl 1044782859AnemiaNOS 10335178650ChestpainNOS 919995990Urin tractinfectionNOS 87982V5869Long-termusemedsNEC 85544496Chr airwayobstructNEC 785854779Allergic rhinitisNOS 7796341400Cor ath unsp vsl ntv/gft 75519

Outof135K patients who hadlaboratorydata

Topdiagnosiscodes

Labtest2160-0Creatinine 12847373094-0Ureanitrogen 12823442823-3Potassium 12808122345-7Glucose 12998971742-6Alanineaminotransferase 11878091920-8Aspartateaminotransferase 11879652885-2Protein 12773381751-7Albumin 12741662093-3Cholesterol 12682692571-8Triglyceride 125775113457-7Cholesterol.inLDL 124120817861-6Calcium 11653702951-2Sodium 1167675

Labtest

2085-9Cholesterol.in HDL 1155666718-7Hemoglobin 11527264544-3Hematocrit 11478939830-1Cholesterol.total/Cholesterol.in HDL 103773033914-3Glomerularfiltrationrate/1.73sqM.predicted 561309

785-6Erythrocytemeancorpuscularhemoglobin 10708326690-2Leukocytes 1062980789-8Erythrocytes 1062445

787-2Erythrocytemeancorpuscularvolume 1063665

Labtest770-8Neutrophils/100leukocytes 952089731-0Lymphocytes 943918704-7Basophils 863448711-2Eosinophils 9357105905-5Monocytes/100leukocytes 943764706-2Basophils/100leukocytes 863435751-8Neutrophils 943232742-7Monocytes 942978713-8Eosinophils/100leukocytes 9339293016-3Thyrotropin 8918074548-4HemoglobinA1c/Hemoglobin.total 527062

Countofpeoplewhohavethetestresult(ever)

Toplabtestresults

Outlinefortoday’sclass

1. Casestudyforriskstratification:EarlydetectionofType2diabetes

2. Framingassupervisedlearningproblem3. Derivinglabels4. Evaluatingriskstratificationalgorithms5. Non-stationarity

Alignbyabsolutetime

Framingforsupervisedmachinelearning

Gap isimportant topreventlabelleakage

Alternativeframings• Alignbyrelativetime,e.g.

– 2hoursintopatientstayinER– EverytimepatientseesPCP– Whenindividualturns40yrs old

• Alignbydataavailability

NOTE:• Ifmultipledatapointsperpatient,makesureeachpatientinonly train,validate,ortest

Methods• L1RegularizedLogisticRegression

– Simultaneouslyoptimizespredictiveperformanceand– Performsfeatureselection,choosingthesubsetofthefeaturesthataremostpredictive

• Thispreventsoverfittingtothetrainingdata

Demographics(age,sex,etc.)

Healthinsurancecoverage

Proceduresperformed(457features)

Specialtyofdoctorsseen(cardiology,rheumatology,…)

FeaturesusedinmodelsServiceplace(urgentcare,inpatient,outpatient,…)

Laboratoryindicators(7000features)

Forthe1000mostfrequent labtests:• Wasthetesteveradministered?• Wastheresulteverlow?• Wastheresulteverhigh?• Wastheresultevernormal?• Isthevalueincreasing?• Isthevaluedecreasing?• Isthevaluefluctuating?

Medicationstaken(999features)(laxatives,metformin,anti-arthritics,…)

Demographics(age,sex,etc.)

Healthinsurancecoverage

Proceduresperformed(457features)

Specialtyofdoctorsseen(cardiology,rheumatology,…)

FeaturesusedinmodelsServiceplace(urgentcare,inpatient,outpatient,…)

Laboratoryindicators(7000features)

Medicationstaken(999features)(laxatives,metformin,anti-arthritics,…)

16,000ICD-9diagnosiscodes(allhistory)

Allhistory 24monthhistory

6monthhistory

Totalfeaturesperpatient:42,000

Outlinefortoday’sclass

1. Casestudyforriskstratification:EarlydetectionofType2diabetes

2. Framingassupervisedlearningproblem3. Derivinglabels4. Evaluatingriskstratificationalgorithms5. Non-stationarity

Wheredothelabelscomefrom?

1. Manuallylabeldatabychartreview2. Electronicphenotypingfrommedicalrecords3. Usemachinelearningtogetthelabels

themselves

Electronicphenotyping

Electronicphenotyping

Visualization(lookingatindividualpatients)isimportanttosanitycheck

labelingmethod

Demographic informationPatientevents list

Events,astheyoccurforthefirsttime inpatienthistory

GettingthelabelsusingtheAnchor&LearnFramework

• Useacombinationofdomainexpertise(simplerules)andvastamountsofdata(machinelearning)

• Methoddoesnotrequireanymanuallabeling• Anchorsarehighlytransferablebetweeninstitutions

[Halpernetal.,AMIA2014]

Whatareanchors?• Ratherthanprovidegold-standardlabels,constructasimplerulethatcancatchsomepositivecases.

• Examples:

Clin.statevar PossibleAnchorDiabetic gsn:016313(insulin)inMedications

Cardiac ICD9:428.X (heartfailure)inDiagnoses

Nursinghome “fromnursinghome”intext

Socialwork “socialworkconsulted”intext

Whatareanchors?• Ratherthanprovidegold-standardlabels,constructasimplerulethatcancatchsomepositivecases. Lowsensitivityhereisok!

• Examples:

Clin.statevar PossibleAnchorDiabetic gsn:016313(insulin)inMedications

Cardiac ICD9:428.X (heartfailure)inDiagnoses

Nursinghome “fromnursinghome”intext

Socialwork “socialworkconsulted”intext

LearningwithAnchorsLOINC& UMLS&CUID& RXnorm& ICD9& Unstructured&Data&

Patientdatabase

101

100

1

• Identifyanchors• Learntopredicttheanchors(anchoraspseudo-labels)• Accountforthedifferencebetweenanchorsandlabels

Transform

Predictanchor Predictlabel

Theoreticalbasisforanchors• Unobservedvariable:Y,Observation:A• A isananchorforY ifconditioningonA=1givesuniformsamplesfromthesetofpositivecases.

Theoreticalbasisforanchors• Unobservedvariable:Y,Observation:A• A isananchorforY ifconditioningonA=1givesuniformsamplesfromthesetofpositivecases.

• Alternativeformulation– twonecessaryconditions:

P (Y = 1|A = 1) = 1Positivecondition

A ? X|YConditionalindependence

AND

X representsallother observations.

Theoreticalbasisforanchors• Unobservedvariable:Y,Observation:A• A isananchorforY ifconditioningonA=1givesuniformsamplesfromthesetofpositivecases.

• Alternativeformulation– twonecessaryconditions:

P (Y = 1|A = 1) = 1Positivecondition

A ? X|YConditionalindependence

AND

X representsallother observations.

e.g.Ifpatient istakinginsulin,thepatient issurelydiabetic.

e.g.Ifweknowthepatient hadheartfailure,knowingwhetherthediagnosiscodeappearsdoesnotinformusabouttherestoftherecord.

Theoreticalbasisforanchors• Unobservedvariable:Y,Observation:A• A isananchorforY ifconditioningonA=1givesuniformsamplesfromthesetofpositivecases.

• Theorem[Elkan &Noto 2008]:Intheabovesetting,afunction topredictA

canbetransformedtopredict Y

• Canalsousemorerecentadvancesonlearningwithnoisylabels(e.g.,Natarajan etal.,NIPS‘13)

LearningwithanchorsInput:anchorA

unlabeledpatientsOutput:predictionrule1. Learnacalibratedclassifier (e.g.

logisticregression) topredict:

2. Usingavalidateset,letP bethepatientswithA=1.Compute:

3. Forapreviouslyunseenpatientt,predict:

Pr(A = 1 | X̃ )

C =1

|P|X

k2PPr(A = 1 | X̃ (k))

[Elkan&Noto 2008]

1

CPr(A = 1|X (t)) if A(t) = 0

1 if A(t) = 1

CalibrationC istheaveragemodel

prediction forpatientswithanchors.

LearningLearntopredictA fromtheothervariables.

TransformationIfnoanchorpresent,

accordingtoascaledversionoftheanchor-prediction

model.

Outlinefortoday’sclass

1. Casestudyforriskstratification:EarlydetectionofType2diabetes

2. Framingassupervisedlearningproblem3. Derivinglabels4. Evaluatingriskstratificationalgorithms5. Non-stationarity

WhataretheDiscoveredRiskFactors?

• 769variableshavenon-zeroweight

TopHistoryofDisease Odds RatioImpaired Fasting Glucose (Code 790.21) 4.17

(3.87 4.49)

Abnormal Glucose NEC (790.29) 4.07 (3.76 4.41)

Hypertension (401) 3.28 (3.17 3.39)

Obstructive Sleep Apnea (327.23) 2.98 (2.78 3.20)

Obesity (278) 2.88 (2.75 3.02)

Abnormal Blood Chemistry (790.6) 2.49 (2.36 2.62)

Hyperlipidemia (272.4) 2.45 (2.37 2.53)

Shortness Of Breath (786.05) 2.09 (1.99 2.19)

Esophageal Reflux (530.81) 1.85(1.78 1.93)

Diabetes1-yeargap

WhataretheDiscoveredRiskFactors?

TopHistoryofDisease Odds RatioImpaired Fasting Glucose (Code 790.21) 4.17

(3.87 4.49)

Abnormal Glucose NEC (790.29) 4.07 (3.76 4.41)

Hypertension (401) 3.28 (3.17 3.39)

Obstructive Sleep Apnea (327.23) 2.98 (2.78 3.20)

Obesity (278) 2.88 (2.75 3.02)

Abnormal Blood Chemistry (790.6) 2.49 (2.36 2.62)

Hyperlipidemia (272.4) 2.45 (2.37 2.53)

Shortness Of Breath (786.05) 2.09 (1.99 2.19)

Esophageal Reflux (530.81) 1.85(1.78 1.93)

Additional DiseaseRiskFactors Include:Pituitarydwarfism (253.3),Hepatomegaly(789.1), ChronicHepatitisC(070.54),Hepatitis (573.3),CalcanealSpur(726.73),Thyrotoxicosiswithoutmentionofgoiter(242.90),Sinoatrial Nodedysfunction(427.81),Acute frontalsinusitis(461.1),Hypertrophicandatrophicconditionsofskin(701.9),Irregularmenstruation(626.4), …

• 769variableshavenon-zeroweight

Diabetes1-yeargap

Top Lab Factors Odds RatioHemoglobin A1c /Hemoglobin.Total (High - past 2 years) 5.75

(5.42 6.10)

Glucose (High- Past 6 months) 4.05 (3.89 4.21)

Cholesterol.In VLDL (Increasing - Past 2 years) 3.88(3.53 4.27)

Potassium (Low - Entire History) 2.58(2.24 2.98)

Cholesterol.Total/Cholesterol.In HDL (High - Entire History) 2.29(2.19 2.40)

Erythrocyte mean corpuscular hemoglobin concentration -(Low - Entire History)

2.25(1.92 2.64)

Eosinophils (High - Entire History) 2.11(1.82 2.44)

Glomerular filtration rate/1.73 sq M.Predicted (Low -Entire History) 2.07(1.92 2.24)

Alanine aminotransferase (High Entire History) 2.04(1.89 2.19)

WhataretheDiscoveredRiskFactors?

• 769variableshavenon-zeroweight

Diabetes1-yeargap

Top Lab Factors Odds RatioHemoglobin A1c /Hemoglobin.Total (High - past 2 years) 5.75

(5.42 6.10)

Glucose (High- Past 6 months) 4.05 (3.89 4.21)

Cholesterol.In VLDL (Increasing - Past 2 years) 3.88(3.53 4.27)

Potassium (Low - Entire History) 2.58(2.24 2.98)

Cholesterol.Total/Cholesterol.In HDL (High - Entire History) 2.29(2.19 2.40)

Erythrocyte mean corpuscular hemoglobin concentration -(Low - Entire History)

2.25(1.92 2.64)

Eosinophils (High - Entire History) 2.11(1.82 2.44)

Glomerular filtration rate/1.73 sq M.Predicted (Low -Entire History) 2.07(1.92 2.24)

Alanine aminotransferase (High Entire History) 2.04(1.89 2.19)

WhataretheDiscoveredRiskFactors?

Additional LabTestRiskFactors Include:Albumin/Globulin (Increasing -Entirehistory),Ureanitrogen/Creatinine -(high-EntireHistory),Specificgravity(Increasing,Past2years),Bilirubin (high-Past2years),…

• 769variableshavenon-zeroweight

Diabetes1-yeargap

Receiver-operatorcharacteristiccurve

FullmodelTraditionalriskfactors

Falsepositiverate

Truepositiverate

Wanttobehere Obtainedbyvaryingpredictionthreshold

Diabetes1-yeargap

Receiver-operatorcharacteristiccurve

FullmodelTraditionalriskfactors

Falsepositiverate

Truepositiverate

AreaundertheROCcurve(AUC)

AUC=Probabilitythatalgorithmranksapositivepatientoveranegativepatient

Invarianttoamountofclassimbalance

Diabetes1-yeargap

Receiver-operatorcharacteristiccurve

FullmodelAUC=0.78TraditionalriskfactorsAUC=0.74

Falsepositiverate

Truepositiverate

Riskstratificationusuallyfocusesonjustthisregion

(becauseofthecostofinterventions)

RandomAUC=0.5

Diabetes1-yeargap

Positivepredictivevalue(PPV)

0.060.07

0.06

0.15

0.17

0.1

Top100Predictions Top1000Predictions Top10000Predictions

Traditionalriskfactors Fullmodel

Diabetes1-yeargap

Calibration(note:differentdataset)

Predicted Probability

Actual Probability

0

1

0.5

1

fraction of patients the model predicts to have this

probability of infection

Model

PredictinginfectionintheER

Outlinefortoday’sclass

1. Casestudyforriskstratification:EarlydetectionofType2diabetes

2. Framingassupervisedlearningproblem3. Derivinglabels4. Evaluatingriskstratificationalgorithms5. Non-stationarity

Majorchallenge:non-stationarity• ICD10rolledoutin2015:predictivemodelslearnedusingICD9featuresarenolongeruseful!

• Logisticalissues=>somefeaturesmaynotbeavailable!

• Prevalenceandsignificanceoffeaturesmaychangeovertime

• Automaticallyderivedlabelsmaychangemeaning

Top100labmeasurements overtime

Time(inmonths,from1/2005upto1/2014)

Labs

DiabetesOnsetafter2009

Months,after2009/01/01

Num

bero

fpeo

plewho

haddiabetesonset

Jan/Feb2010

Aslightdownwardtrendofnewdiagnoses(Amongpatientswhoare enrolled between2009and2013,howmanynewlydiagnoseddiabetics dowehaveeachmonth?)

Geiss LS,WangJ,ChengYJ,etal.PrevalenceandIncidenceTrendsforDiagnosedDiabetes AmongAdultsAged20to79Years,UnitedStates,1980-2012.JAMA.

2014;312(12):1218-1226.

DiabetesOnsetafter2009

Externalvalidity• Motivatesmulti-institutionevaluations• Goodpracticeistoletthetestdatabefromafutureyear