Upload
yubin-park
View
456
Download
1
Embed Size (px)
Citation preview
Introductionto
HealthcareDataAnalytics
withExtremeTreeModels
YubinPark,PhD
ChiefTechnologyOfficer
1
WhoamI
• Co-founderandChiefTechnologyOfficerofAccordionHealth,Inc.• PhDfromtheUniversityofTexasatAustin
• Advisor:ProfessorJoydeepGhosh• StudiedMachineLearningandDataMining,withaspecialfocuson
healthcaredata
• Involvedinvariousindustrydataminingprojects
• USAA:Life-timemodelingofcustomers
• SKTelecom:Smartphonepurchaseprediction,usagepatternanalysis
• LinkedInCorp.:Relatedsearchkeywordsrecommendation
• WholeFoodsMarket:Priceelasticitymodeling
• …
2
AccordionHealth
• HealthcareDataAnalyticsCompany
• Foundedin2014by• Sriram Vishwanath,PhD
• YubinPark,PhD• JoyceHo,PhD
• Ateamofdatascientistsandmedical
professionals
• Helphealthcareorganizationslowercostsandimprovequalities
3
FromHealthDatapalooza 2014
TypesofProblemsWeSolve
• Whichpatientislikelytobereadmitted?
• Whichpatientislikelytodeveloptype2diabetes?
• Whichpatientislikelytoadheretohismedication?
• Howmuchthispatientwillcostthisyear?
• Howmanyinpatientadmissionsthispatientwillhavethisyear?
• Whichphysicianislikelytofollowourcareguideline?
• Whatstarratingwillourorganizationreceivethisyear?
• …
4
HealthcareDataisMessy
• Datastructure
• UnstructureddatasuchasEHR
• Structureddatasuchasclaims
• Location
• Doctors’offices,insurancecompanies,governments,etc.
• Datadefinition
• Differentdefinitionsfordifferentcommunities
• Dataformat
• Variousindustryformats
• Datacomplexity
• Patientsgoinginandoutofsystems
• Incompletedata
• Regulations&requirements
• Source:HealthCatalyst
5
MyUsualWorkFlow
Summary
Statistics
Visual
Inspection
DataCleansing
&Feature
Engineering(1)
Baseline
Models
ExtremeTreeModels
DataCleansing
&Feature
Engineering(2)
CustomExtremeTree
Models
DataCleansing
&Feature
Engineering(3)
FullyCustomizedModels
6
Istartmydataprojectby
checkingsummary
statistics,distributions,data
errors,andapplyingsimple
models.
ExtremeTreeModels*
serveasacheckpoint
beforefurther
developingcustomized
models.
*ExtremeTreeModelsrefertoaclassofmodelsthatuseatreeasabaseclassifier.
WhyTree-basedModels
“Ofallthewell-knownmethods,
decisiontreescomeclosestto
meetingtherequirementsfor
servingasanoff-the-shelf
procedurefordatamining.”
• J.H.Friedman,R.Tibshirani,and
T.Hastie,.TheElementsof
StatisticalLearning
7
HowtoGrowaTree
1. Startwithadataset
2. Pickasplittingfeature
3. Pickasplittingcut-point
4. Splitthedatasetintotwosetsbasedonthesplittingfeatureand
cut-point
5. RepeatfromStep2withthepartitioneddatasets
8
VariousKindsofTrees– C4.5,CART
1. Startwithadataset
2. Pickasplittingfeature
3. Pickasplittingcut-point
4. Splitthedatasetintotwosetsbasedonthesplittingfeatureand
cut-point
5. RepeatfromStep2withthepartitioneddatasets
9
InformationGainà C4.5
GiniImpurity,VarianceReductionà CART
- Quinlan,J.R.(1993)C4.5:ProgramsforMachineLearning.MorganKaufmannPublishers.
- Breiman,Leo;Friedman,J.H.;Olshen,R.A.;Stone,C.J.(1984). Classificationandregressiontrees.Monterey,CA:Wadsworth&Brooks/ColeAdvancedBooks&Software.
Treeà Forest
• RandomizationMethods
• Randomdatasampling
• Randomfeaturesampling
• Randomcut-pointsampling
10
VariousKindsofForests– BaggedTrees
1. Startwithadataset
2. Pickasplittingfeature
3. Pickasplittingcut-point
4. Splitthedatasetintotwosetsbasedonthesplittingfeatureand
cut-point
5. RepeatfromStep2withthepartitioneddatasets
11
Samplewithreplacement,andmanytrees
à BaggedTrees
- Breiman,L.(1996b).Baggingpredictors.MachineLearning,24:2,123–140.
VariousKindsofForests– RandomSubspace
1. Startwithadataset
2. Pickasplittingfeature
3. Pickasplittingcut-point
4. Splitthedatasetintotwosetsbasedonthesplittingfeatureand
cut-point
5. RepeatfromStep2withthepartitioneddatasets
12
Selectarandomsubsetoffeatures
Thenfindthebestfeature/cut-point
- Ho,T.(1998).TheRandomsubspacemethodforconstructingdecisionforests.IEEETransactionsonPatternAnalysisandMachineIntelligence,20:8,832–844.
VariousKindsofForests– RandomForests
1. Startwithadataset
2. Pickasplittingfeature
3. Pickasplittingcut-point
4. Splitthedatasetintotwosetsbasedonthesplittingfeatureand
cut-point
5. RepeatfromStep2withthepartitioneddatasets
13
Samplewithreplacement
Selectarandomsubsetoffeatures
Thenfindthebestfeature/cut-point
- Breiman,L.(2001).Randomforests.MachineLearning,45,5–32.
VariousKindsofTrees– ExtraTrees
1. Startwithadataset
2. Pickasplittingfeature
3. Pickasplittingcut-point
4. Splitthedatasetintotwosetsbasedonthesplittingfeatureand
cut-point
5. RepeatfromStep2withthepartitioneddatasets
14
Selectarandomsubsetof(feature,cut-point)pairs
Thenfindthebest(feature,cut-point)pair
- Geurts,P.,DamienE.,andLouisW..(2006)Extremelyrandomizedtrees.Machinelearning63.1,3-42.
Again,BiasvsVariance
• Bias:Errorfrommodel
• Variance:Errorfromdata
• Recursivepartitionà fewersamplesas
treegrows
• Splitfeatures/cut-pointsaresusceptibletotrainingsamples
• Randomizationdecreasesvariance
• ImageSource:ScottFortmann-Roe
15
EvolutionofBiasvs.Variance
16
- Geurts,P.,DamienE.,andLouisW..(2006)Extremelyrandomizedtrees.Machinelearning63.1,3-42.
BiasVarianceTrade-off
17ImageSource:ScottFortmann-Roe
• RandomizationMethods
reducesvariance
• However,forsome
problems,reducingthe
bias ofamodelmaybe
morecriticalforimproving
itsaccuracy
• A verycomplexdatasetwith
manyvariablesandsamples
AreTreeModelsareHigh-VarianceModels?
• Itdepends…• Numberofdatasamples
• Numberoffeatures
• Datacomplexity
• RandomizationMethods
• DecreaseVariance• ButincreaseBias
18
Thereisanotherwayofdecreasingthe
expectederror,which
- DecreaseBias
- Mayincreasevariance
Boosting:LearnfromErrors
19
Y =f0(X),whereE1 =|Y-f0(X)|2
E1 =f1(X),whereE2 =|Y-f1(X)|2
E2 =f2(X),whereE3 =|Y-f2(X)|2
andsoon...
AdditiveModelFramework
• AdditiveModelFramework
generalizesboosting,
stacking,andothervariants
• Source:J.H.Friedman,R.
Tibshirani,andT.Hastie,.
TheElementsofStatistical
Learning (ESL)
20
GradientBoostingMachine
• AdditiveModelscanbenumerically
optimizedviaGradientDescent
• Source:Wikipedia andESL
21
- Friedman,JeromeH. (2001)Greedyfunctionapproximation:agradientboostingmachine.Annalsofstatistics:1189-1232.
ExtremeGradientBoosting(XGBoost)
22
VariousDataMining
Competitions inKaggle
Onethingtheyhavein
common:
- TheyallusedXGBoost
What’ssoSpecialaboutXGBoost
• XGBoost implementsthebasicideaofGBMwithsometweaks,such
as:
• Regularizationofbasetrees• Approximatesplitfinding
• Weightedquantile sketch
• Sparsity-awaresplitfinding• Cache-awareblockstructureforout-of-corecomputation
• “XGBoost scalesbeyondbillionsofexamplesusingfarfewerresources
thanexistingsystems.”– T.ChenandC.Guestrin
23
GoingFurtherExtreme
• XGBoost ofXGBoost• BaggingofXGBoost• BaggingofXGBoost ofXGBoost of…
• Stacking,Bagging,Sampling,
etc.
• Source:Kaggle
24
Real-worldExample:PredictMedAdh Scores
• CentersforMedicareandMedicaidServices(CMS)measuresthe
performanceofMedicareAdvantage(MA)PlansviaStarRating
System
• MedicationAdherence(MedAdh)isoneofthemostimportant
qualitymeasuresintheStarRatingSystem
• MAPlanswanttoknowhowmuchtheirMedAdh scoreswillchange
inthenexttwoyears
25
PredictMedAdh Scores
• WherecanIfinddata
• DownloadfromtheCMSPartCandDPerformanceDatawebpage
• Constructingdatasets• MedAdh Datafrom2012,2013à TrainingFeatures,Xtrain• MedAdh Datafrom2015à TrainingLabel,Ytrain• MedAdh Datafrom2013,2014à TestFeatures,Xtest• MedAdh Datafrom2016à TestLabel,Ytest
26
LotsofMissingData
• NotallMAplansaremeasuredforagivenyearàMeanImputation
27
X1,X2,X3,X4,X5,X6,X7,X8,X9,Y
...
71.2,72.7,69.9,75.2,75.9,71.0,1.8
-999,-999,-999,75.8,72.5,68.8,-4.8
61.8,59.4,57.7,57.3,59.3,58.3,16.7
...
-999,-999,-999,82.8,80.0,69.8,-11.8
73.8,73.2,71.8,74.5,76.1,72.9,4.5
TryVariousModels
• FromsimplemodelslikeLinearRegression,DecisionTreetoextreme-
treemodelssuchasExtraTrees andGradientBoosting
28
from sklearn import linear_model
from sklearn import tree
from sklearn.utils import resample
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import GradientBoostingRegressor
TryVariousModels– codesnippet
• FromsimplemodelslikeLinearRegression,DecisionTreetoextreme-
treemodelssuchasExtraTrees andGradientBoosting
29
lm =linear_model.LinearRegression()
dt =tree.DecisionTreeRegressor()
etr =ExtraTreesRegressor(n_estimators=100, max_depth=10)
gbr =GradientBoostingRegressor(n_estimators=500,
learning_rate=0.25,
max_depth=8)
TryVariousModels– results
30
$ pythontest.py
…
RMSEResults
lm:2.7125536923
dt:3.10460672029
etr:2.18597303421
gbr:2.02698129388
TryVariousModels– results
31
ExtremeTreeModels
exhibitsignificant
improvements in
accuraciescomparedto
simplemodels.
Onecanbuildmore
sophisticatedmodels
basedontheerror
characteristicsofthese
models.
Contact
• yubin[at]accordionhealth [dot]com
32