32
Introduction to Healthcare Data Analytics with Extreme Tree Models Yubin Park, PhD Chief Technology Officer 1

Healthcare Data Analytics with Extreme Tree Models

Embed Size (px)

Citation preview

Page 1: Healthcare Data Analytics with Extreme Tree Models

Introductionto

HealthcareDataAnalytics

withExtremeTreeModels

YubinPark,PhD

ChiefTechnologyOfficer

1

Page 2: Healthcare Data Analytics with Extreme Tree Models

WhoamI

• Co-founderandChiefTechnologyOfficerofAccordionHealth,Inc.• PhDfromtheUniversityofTexasatAustin

• Advisor:ProfessorJoydeepGhosh• StudiedMachineLearningandDataMining,withaspecialfocuson

healthcaredata

• Involvedinvariousindustrydataminingprojects

• USAA:Life-timemodelingofcustomers

• SKTelecom:Smartphonepurchaseprediction,usagepatternanalysis

• LinkedInCorp.:Relatedsearchkeywordsrecommendation

• WholeFoodsMarket:Priceelasticitymodeling

• …

2

Page 3: Healthcare Data Analytics with Extreme Tree Models

AccordionHealth

• HealthcareDataAnalyticsCompany

• Foundedin2014by• Sriram Vishwanath,PhD

• YubinPark,PhD• JoyceHo,PhD

• Ateamofdatascientistsandmedical

professionals

• Helphealthcareorganizationslowercostsandimprovequalities

3

FromHealthDatapalooza 2014

Page 4: Healthcare Data Analytics with Extreme Tree Models

TypesofProblemsWeSolve

• Whichpatientislikelytobereadmitted?

• Whichpatientislikelytodeveloptype2diabetes?

• Whichpatientislikelytoadheretohismedication?

• Howmuchthispatientwillcostthisyear?

• Howmanyinpatientadmissionsthispatientwillhavethisyear?

• Whichphysicianislikelytofollowourcareguideline?

• Whatstarratingwillourorganizationreceivethisyear?

• …

4

Page 5: Healthcare Data Analytics with Extreme Tree Models

HealthcareDataisMessy

• Datastructure

• UnstructureddatasuchasEHR

• Structureddatasuchasclaims

• Location

• Doctors’offices,insurancecompanies,governments,etc.

• Datadefinition

• Differentdefinitionsfordifferentcommunities

• Dataformat

• Variousindustryformats

• Datacomplexity

• Patientsgoinginandoutofsystems

• Incompletedata

• Regulations&requirements

• Source:HealthCatalyst

5

Page 6: Healthcare Data Analytics with Extreme Tree Models

MyUsualWorkFlow

Summary

Statistics

Visual

Inspection

DataCleansing

&Feature

Engineering(1)

Baseline

Models

ExtremeTreeModels

DataCleansing

&Feature

Engineering(2)

CustomExtremeTree

Models

DataCleansing

&Feature

Engineering(3)

FullyCustomizedModels

6

Istartmydataprojectby

checkingsummary

statistics,distributions,data

errors,andapplyingsimple

models.

ExtremeTreeModels*

serveasacheckpoint

beforefurther

developingcustomized

models.

*ExtremeTreeModelsrefertoaclassofmodelsthatuseatreeasabaseclassifier.

Page 7: Healthcare Data Analytics with Extreme Tree Models

WhyTree-basedModels

“Ofallthewell-knownmethods,

decisiontreescomeclosestto

meetingtherequirementsfor

servingasanoff-the-shelf

procedurefordatamining.”

• J.H.Friedman,R.Tibshirani,and

T.Hastie,.TheElementsof

StatisticalLearning

7

Page 8: Healthcare Data Analytics with Extreme Tree Models

HowtoGrowaTree

1. Startwithadataset

2. Pickasplittingfeature

3. Pickasplittingcut-point

4. Splitthedatasetintotwosetsbasedonthesplittingfeatureand

cut-point

5. RepeatfromStep2withthepartitioneddatasets

8

Page 9: Healthcare Data Analytics with Extreme Tree Models

VariousKindsofTrees– C4.5,CART

1. Startwithadataset

2. Pickasplittingfeature

3. Pickasplittingcut-point

4. Splitthedatasetintotwosetsbasedonthesplittingfeatureand

cut-point

5. RepeatfromStep2withthepartitioneddatasets

9

InformationGainà C4.5

GiniImpurity,VarianceReductionà CART

- Quinlan,J.R.(1993)C4.5:ProgramsforMachineLearning.MorganKaufmannPublishers.

- Breiman,Leo;Friedman,J.H.;Olshen,R.A.;Stone,C.J.(1984). Classificationandregressiontrees.Monterey,CA:Wadsworth&Brooks/ColeAdvancedBooks&Software.

Page 10: Healthcare Data Analytics with Extreme Tree Models

Treeà Forest

• RandomizationMethods

• Randomdatasampling

• Randomfeaturesampling

• Randomcut-pointsampling

10

Page 11: Healthcare Data Analytics with Extreme Tree Models

VariousKindsofForests– BaggedTrees

1. Startwithadataset

2. Pickasplittingfeature

3. Pickasplittingcut-point

4. Splitthedatasetintotwosetsbasedonthesplittingfeatureand

cut-point

5. RepeatfromStep2withthepartitioneddatasets

11

Samplewithreplacement,andmanytrees

à BaggedTrees

- Breiman,L.(1996b).Baggingpredictors.MachineLearning,24:2,123–140.

Page 12: Healthcare Data Analytics with Extreme Tree Models

VariousKindsofForests– RandomSubspace

1. Startwithadataset

2. Pickasplittingfeature

3. Pickasplittingcut-point

4. Splitthedatasetintotwosetsbasedonthesplittingfeatureand

cut-point

5. RepeatfromStep2withthepartitioneddatasets

12

Selectarandomsubsetoffeatures

Thenfindthebestfeature/cut-point

- Ho,T.(1998).TheRandomsubspacemethodforconstructingdecisionforests.IEEETransactionsonPatternAnalysisandMachineIntelligence,20:8,832–844.

Page 13: Healthcare Data Analytics with Extreme Tree Models

VariousKindsofForests– RandomForests

1. Startwithadataset

2. Pickasplittingfeature

3. Pickasplittingcut-point

4. Splitthedatasetintotwosetsbasedonthesplittingfeatureand

cut-point

5. RepeatfromStep2withthepartitioneddatasets

13

Samplewithreplacement

Selectarandomsubsetoffeatures

Thenfindthebestfeature/cut-point

- Breiman,L.(2001).Randomforests.MachineLearning,45,5–32.

Page 14: Healthcare Data Analytics with Extreme Tree Models

VariousKindsofTrees– ExtraTrees

1. Startwithadataset

2. Pickasplittingfeature

3. Pickasplittingcut-point

4. Splitthedatasetintotwosetsbasedonthesplittingfeatureand

cut-point

5. RepeatfromStep2withthepartitioneddatasets

14

Selectarandomsubsetof(feature,cut-point)pairs

Thenfindthebest(feature,cut-point)pair

- Geurts,P.,DamienE.,andLouisW..(2006)Extremelyrandomizedtrees.Machinelearning63.1,3-42.

Page 15: Healthcare Data Analytics with Extreme Tree Models

Again,BiasvsVariance

• Bias:Errorfrommodel

• Variance:Errorfromdata

• Recursivepartitionà fewersamplesas

treegrows

• Splitfeatures/cut-pointsaresusceptibletotrainingsamples

• Randomizationdecreasesvariance

• ImageSource:ScottFortmann-Roe

15

Page 16: Healthcare Data Analytics with Extreme Tree Models

EvolutionofBiasvs.Variance

16

- Geurts,P.,DamienE.,andLouisW..(2006)Extremelyrandomizedtrees.Machinelearning63.1,3-42.

Page 17: Healthcare Data Analytics with Extreme Tree Models

BiasVarianceTrade-off

17ImageSource:ScottFortmann-Roe

• RandomizationMethods

reducesvariance

• However,forsome

problems,reducingthe

bias ofamodelmaybe

morecriticalforimproving

itsaccuracy

• A verycomplexdatasetwith

manyvariablesandsamples

Page 18: Healthcare Data Analytics with Extreme Tree Models

AreTreeModelsareHigh-VarianceModels?

• Itdepends…• Numberofdatasamples

• Numberoffeatures

• Datacomplexity

• RandomizationMethods

• DecreaseVariance• ButincreaseBias

18

Thereisanotherwayofdecreasingthe

expectederror,which

- DecreaseBias

- Mayincreasevariance

Page 19: Healthcare Data Analytics with Extreme Tree Models

Boosting:LearnfromErrors

19

Y =f0(X),whereE1 =|Y-f0(X)|2

E1 =f1(X),whereE2 =|Y-f1(X)|2

E2 =f2(X),whereE3 =|Y-f2(X)|2

andsoon...

Page 20: Healthcare Data Analytics with Extreme Tree Models

AdditiveModelFramework

• AdditiveModelFramework

generalizesboosting,

stacking,andothervariants

• Source:J.H.Friedman,R.

Tibshirani,andT.Hastie,.

TheElementsofStatistical

Learning (ESL)

20

Page 21: Healthcare Data Analytics with Extreme Tree Models

GradientBoostingMachine

• AdditiveModelscanbenumerically

optimizedviaGradientDescent

• Source:Wikipedia andESL

21

- Friedman,JeromeH. (2001)Greedyfunctionapproximation:agradientboostingmachine.Annalsofstatistics:1189-1232.

Page 22: Healthcare Data Analytics with Extreme Tree Models

ExtremeGradientBoosting(XGBoost)

22

VariousDataMining

Competitions inKaggle

Onethingtheyhavein

common:

- TheyallusedXGBoost

Page 23: Healthcare Data Analytics with Extreme Tree Models

What’ssoSpecialaboutXGBoost

• XGBoost implementsthebasicideaofGBMwithsometweaks,such

as:

• Regularizationofbasetrees• Approximatesplitfinding

• Weightedquantile sketch

• Sparsity-awaresplitfinding• Cache-awareblockstructureforout-of-corecomputation

• “XGBoost scalesbeyondbillionsofexamplesusingfarfewerresources

thanexistingsystems.”– T.ChenandC.Guestrin

23

Page 24: Healthcare Data Analytics with Extreme Tree Models

GoingFurtherExtreme

• XGBoost ofXGBoost• BaggingofXGBoost• BaggingofXGBoost ofXGBoost of…

• Stacking,Bagging,Sampling,

etc.

• Source:Kaggle

24

Page 25: Healthcare Data Analytics with Extreme Tree Models

Real-worldExample:PredictMedAdh Scores

• CentersforMedicareandMedicaidServices(CMS)measuresthe

performanceofMedicareAdvantage(MA)PlansviaStarRating

System

• MedicationAdherence(MedAdh)isoneofthemostimportant

qualitymeasuresintheStarRatingSystem

• MAPlanswanttoknowhowmuchtheirMedAdh scoreswillchange

inthenexttwoyears

25

Page 26: Healthcare Data Analytics with Extreme Tree Models

PredictMedAdh Scores

• WherecanIfinddata

• DownloadfromtheCMSPartCandDPerformanceDatawebpage

• Constructingdatasets• MedAdh Datafrom2012,2013à TrainingFeatures,Xtrain• MedAdh Datafrom2015à TrainingLabel,Ytrain• MedAdh Datafrom2013,2014à TestFeatures,Xtest• MedAdh Datafrom2016à TestLabel,Ytest

26

Page 27: Healthcare Data Analytics with Extreme Tree Models

LotsofMissingData

• NotallMAplansaremeasuredforagivenyearàMeanImputation

27

X1,X2,X3,X4,X5,X6,X7,X8,X9,Y

...

71.2,72.7,69.9,75.2,75.9,71.0,1.8

-999,-999,-999,75.8,72.5,68.8,-4.8

61.8,59.4,57.7,57.3,59.3,58.3,16.7

...

-999,-999,-999,82.8,80.0,69.8,-11.8

73.8,73.2,71.8,74.5,76.1,72.9,4.5

Page 28: Healthcare Data Analytics with Extreme Tree Models

TryVariousModels

• FromsimplemodelslikeLinearRegression,DecisionTreetoextreme-

treemodelssuchasExtraTrees andGradientBoosting

28

from sklearn import linear_model

from sklearn import tree

from sklearn.utils import resample

from sklearn.metrics import mean_squared_error

from sklearn.ensemble import ExtraTreesRegressor

from sklearn.ensemble import GradientBoostingRegressor

Page 29: Healthcare Data Analytics with Extreme Tree Models

TryVariousModels– codesnippet

• FromsimplemodelslikeLinearRegression,DecisionTreetoextreme-

treemodelssuchasExtraTrees andGradientBoosting

29

lm =linear_model.LinearRegression()

dt =tree.DecisionTreeRegressor()

etr =ExtraTreesRegressor(n_estimators=100, max_depth=10)

gbr =GradientBoostingRegressor(n_estimators=500,

learning_rate=0.25,

max_depth=8)

Page 30: Healthcare Data Analytics with Extreme Tree Models

TryVariousModels– results

30

$ pythontest.py

RMSEResults

lm:2.7125536923

dt:3.10460672029

etr:2.18597303421

gbr:2.02698129388

Page 31: Healthcare Data Analytics with Extreme Tree Models

TryVariousModels– results

31

ExtremeTreeModels

exhibitsignificant

improvements in

accuraciescomparedto

simplemodels.

Onecanbuildmore

sophisticatedmodels

basedontheerror

characteristicsofthese

models.

Page 32: Healthcare Data Analytics with Extreme Tree Models

Contact

• yubin[at]accordionhealth [dot]com

32