Upload
dario-marrocchelli
View
232
Download
1
Embed Size (px)
Citation preview
RISKPREDICTORIdentify health risk before it is too late
Dario Marrocchelli
Insight Health Data Fellow
The problem
Wellness companies do not know which programs will be most effective for a specific population
My solution
RiskPredictor A tool that identifies those people at risk1 and their underlying conditions
1 Risk is defined as an individual’s predicted healthcare cost
The data: unique, rich and messy
Unique de-identified data set from Zakipoint that combines three types of information:
Claims: ICD-9 codes, medical costs, gender, age
Biometric: BMI, BP, cholesterol, A1C, etc. Behavioral: HRA and wellness program
participation
There are about 250,000 rows and 2,000 people in these datasets
Cannot discuss feature engineering
Model performs very well
1 http://us.milliman.com/mara/
Model performance in line with proprietary programs which cost $100,000+ There is room for improvement (more data, more features, etc.)
R2Model
RiskPredictor - Random Forest ACG1 (Commercial) RiskPredictor – Linear (Ridge) MARA1 (Commercial)
20.5%
29.7%
34.4%
57.9%
PhD in Chemistry2006-2010
Postdoctoral Associate2010-2011
Postdoctoral Fellow
2011-2013
Research Scientist & Instructor
2013-Present
Short Bio Fun FactI designed and taught at MIT a course on the Science of Cooking (rated 9/10)
Computational Materials Scientist working on renewable energy
30+ papers, 900+ citations
Extra slides
Model performs very well
1 http://us.milliman.com/mara/
There is room for improvement (more data, more features, etc.)…
… but model performance is in line with proprietary programs which cost $100,000+
R2Model
RiskPredictor - Random Forest ACG1 (Commercial) RiskPredictor – Linear (Ridge) MARA1 (Commercial)
20.5%
29.7%
34.4%
57.9%
No diabetes
Diabetes (59)
(972)
No hypertension
Hypertension (210)
(821)
Data
Claims Biometric Behavioral
Communication with Zakipoint
Ramesh Kumar, CEO Heather Richie,VP Product Management
Several emails Google Hangout
DataUnique data set from zph that combines:
1) Claim information (ICD-9 codes, medical costs, gender, age)2) Biometric information (BMI, BP, cholesterol, A1C, etc.)3) Behavioral (HRA and wellness program participation)
The dataset contains 2k lives and is in csv format (masked)
Obese
Overweight
Normal
Underweight (5)
(245)
(430)
(411)
Algorithm anatomy
Raw data(icd-9 codes, BMI)
1000s diagnostic features(e.g. diabetes, obesity, etc.)
Regression (linear & random forest)
Predicted cost (risk scores)