Good Enough Analyticsby Kai Xin
The Good Enough StuffAnalytical Tools
Analytical Tools are like spoons
Analytical Tools are like spoons
Usefulness
Usefulness
Point of stupidity
Usefulness
Point of stupidity
Usefulness
Point of stupidity
Point of stupidity
Point of stupidity
What is stupid today, might not be stupid tomorrow
Good Enough AnalyticsBig data analytics using cost efficient tools
The Good Enough StuffEnsembles of good enough models
Point of stupidity: The perfect model4 7 7 9 8 2 6 5 3 6 1 3 6 5 2 3 7 7 2 4 9 1 9 7 5 3 9 9 4 9 3 9 4 6 9 2 2 3 1 6 3 9 5 1 4 9 5 3 4 7 9 2 2 1 2 6 6 1 5 6 4 4 2 2 7 3 6 6 6 2 8 7 6 8 4 5 6 4 1 3 8 7 4 2 1 6 5 9 6 1 6 3 1 5 6 9 9 7 9 5 8 1 5 4 3 1 6 7 3 7 3 4 9 5 2 3 6 9 6 1 6 6 5 4 1 9 6 9 6 5 1 1 5 8 6 2 7 6 7 9 3 2 1 9 1 2 4 5 5 3 9 5 6 3 6 2 1 1 6 1 2 4 8 8 3 9 2 3 2 5 2 9 7 7 9 9 8 9 3 2 5 8 4 5 2 7 5 4 1 9 2 5 8 6 9 6 3 6 8 7 3 7 7 8 2 5 4 7 7 1 2 7 4 6 6 1 2 7 7 1 8 9 9 8 7 5 2 6 8 9 5 3 5 6 9 3 2 2 4 2 5 3 4 9 3 4 9 9 4 3 5 8 9 7 3 7 4 2 9 3 4 3 8 9 8 6 7 8 8 8 1 6 2 7 2 6 9 7 7 6 5 8 2 5 1 6 1 5 3 5 4 9 1 6 4 8 5 5 5 5 9 9 5 5 6 3 5 6 6 8 8 7 5 7 1 2 3 7 2 8 5 4 5 5 8 6 2 5 8 7 7 9 2 3 6 5 2 3 1 5 7 4 9 4 1 8 3 7 8 9 2 2 3 5 2 2 2 9 2 7 1 4 4 3 7 3 1 6 6 8 3 2 9 9 1 3 3 6 8 9 6 7 3 9 2 4 1 8 6 9 1 2 7 2 5 2 1 6 4 6 8 6 9 5 2 3 4 6 4 2 2 9 6 2 2 3 2 2 4 5 6 2 2 7 2 3 7 5 1 5 9 4 1 9 2 2 9 8 3 5 2 8 2 1 9 2 8 3 7 5 4 7 9 5 2 9 4 1 2 2 4 7 6 8 7 7 2 8 4 2 1 5 7 2 5 5 2 1 9 8 4 9 4 5 5 1 8 7 7 4 6 9 4 3 7 5 4 9 8 7 7 2 7 1 6 4 4 1 1 7 5 1 7 8 3 9 6 2 6 7 7 3 8 1 7 6 9 8 9 8 8 2 5 9 1 4 5 6 2 8 5 8 9 2 4 7 6 9 7 3 1 4 2 6 7 6 2 9 1 2 2 8 3 6 1 7 9 8 1 2 8 4 9 6 8 4 1 1 6 9 6 7 6 4 4 6 8 7 2 9 2 2 5 9 2 6 2 4 5 4 8 6 5 5 7 6 2 5 4 8 7 9 6 9 9 2 4 6 9 2 9 5 3 8 4 9 8 2 7 2 1 1 4 6 1 6 7 2 3 2 4 4 8 3 3 3 1 9 5 6 1 4 6 1 6 2 5 1 4 6 8 1 6 5 6 6 6 1 2 4 3 2 7 9 7 9 4 8 7 1 8 9 2 3 5 6 4 4 8 3 5 2 9 1 2 8 6 8 8 2 8 2 3 8 7 6 8 6 2 9 7 1 7 1 4 2 9 1 2 1 1 7 9 3 8 3 3 8 9 3 9 4 1 7 6 7 2 1 7 7 2 9 6 2 2 3 9 1 1 1 6 5 4 3 4 3 7 8 5 7 9 2 2 7 1 6 7 3 2 3 9 5 8 8 6 3 9 7 7 4 4 4 7 34 7 7 9 8 2 6 5 3 6 1 3 6 5 2 3 7 7 2 4 9 1 9 7 5 3 9 9 4 9 3 9 4 6 9 2 2 3 1 6 3 9 5 1 4 9 5 3 4 7 9 2 2 1 2 6 6 1 5 6 4 4 2 2 7 3 6 6 6 2 8 7 6 8 4 5 6 4 1 3 8 7 4 2 1 6 5 9 6 1 6 3 1 5 6 9 9 7 9 5 8 1 5 4 3 1 6 7 3 7 3 4 9 5 2 3 6 9 6 1 6 6 5 4 1 9 6 9 6 5 1 1 5 8 6 2 7 6 7 9 3 2 1 9 1 2 4 5 5 3 9 5 6 3 6 2 1 1 6 1 2 4 8 8 3 9 2 3 2 5 2 9 7 7 9 9 8 9 3 2 5 8 4 5 2 7 5 4 1 9 2 5 8 6 9 6 3 6 8 7 3 7 7 8 2 5 4 7 7 1 2 7 4 6 6 1 2 7 7 1 8 9 9 8 7 5 2 6 8 9 5 3 5 6 9 3 2 2 4 2 5 3 4 9 3 4 9 9 4 3 5 8 9 7 3 7 4 2 9 3 4 3 8 9 8 6 7 8 8 8 1 6 2 7 2 6 9 7 7 6 5 8 2 5 1 6 1 5 3 5 4 9 1 6 4 8 5 5 5 5 9 9 5 5 6 3 5 6 6 8 8 7 5 7 1 2 3 7 2 8 5 4 5 5 8 6 2 5 8 7 7 9 2 3 6 5 2 3 1 5 7 4 9 4 1 8 3 7 8 9 2 2 3 5 2 2 2 9 2 7 1 4 4 3 7 3 1 6 6 8 3 2 9 9 1 3 3 6 8 9 6 7 3 9 2 4 1 8 6 9 1 2 7 2 5 2 1 6 4 6 8 6 9 5 2 3 4 6 4 2 2 9 6 2 2 3 2 2 4 5 6 2 2 7 2 3 7 5 1 5 9 4 1 9 2 2 9 8 3 5 2 8 2 1 9 2 8 3 7 5 4 7 9 5 2 9 4 1 2 2 4 7 6 8 7 7 2 8 4 2 1 5 7 2 5 5 2 1 9 8 4 9 4 5 5 1 8 7 7 4 6 9 4 3 7 5 4 9 8 7 7 2 7 1 6 4 4 1 1 7 5 1 7 8 3 9 6 2 6 7 7 3 8 1 7 6 9 8 9 8 8 2 5 9 1 4 5 6 2 8 5 8 9 2 4 7 6 9 7 3 1 4 2 6 7 6 2 9 1 2 2 8 3 6 1 7 9 8 1 2 8 4 9 6 8 4 1 1 6 9 6 7 6 4 4 6 8 7 2 9 2 2 5 9 2 6 2 4 5 4 8 6 5 5 7 6 2 5 4 8 7 9 6 9 9 2 4 6 9 2 9 5 3 8 4 9 8 2 7 2 1 1 4 6 1 6 7 2 3 2 4 4 8 3 3 3 1 9 5 6 1 4 6 1 6 2 5 1 4 6 8 1 6 5 6 6 6 1 2 4 3 2 7 9 7 9 4 8 7 1 8 9 2 3 5 6 4 4 8 3 5 2 9 1 2 8 6 8 8 2 8 2 3 8 7 6 8 6 2 9 7 1 7 1 4 2 9 1 2 1 1 7 9 3 8 3 3 8 9 3 9 4 1 7 6 7 2 1 7 7 2 9 6 2 2 3 9 1 1 1 6 5 4 3 4 3 7 8 5 7 9 2 2 7 1 6 7 3 2 3 9 5 8 8 6 3 9 7 7 4 4 4 7 34 7 7 9 8 2 6 5 3 6 1 3 6 5 2 3 7 7 2 4 9 1 9 7 5 3 9 9 4 9 3 9 4 6 9 2 2 3 1 6 3 9 5 1 4 9 5 3 4 7 9 2 2 1 2 6 6 1 5 6 4 4 2 2 7 3 6 6 6 2 8 7 6 8 4 5 6 4 1 3 8 7 4 2 1 6 5 9 6 1 6 3 1 5 6 9 9 7 9 5 8 1 5 4 3 1 6 7 3 7 3 4 9 5 2 3 6 9 6 1 6 6 5 4 1 9 6 9 6 5 1 1 5 8 6 2 7 6 7 9 3 2 1 9 1 2 4 5 5 3 9 5 6 3 6 2 1 1 6 1 2 4 8 8 3 9 2 3 2 5 2 9 7 7 9 9 8 9 3 2 5 8 4 5 2 7 5 4 1 9 2 5 8 6 9 6 3 6 8 7 3 7 7 8 2 5 4 7 7 1 2 7 4 6 6 1 2 7 7 1 8 9 9 8 7 5 2 6 8 9 5 3 5 6 9 3 2 2 4 2 5 3 4 9 3 4 9 9 4 3 5 8 9 7 3 7 4 2 9 3 4 3 8 9 8 6 7 8 8 8 1 6 2 7 2 6 9 7 7 6 5 8 2 5 1 6 1 5 3 5 4 9 1 6 4 8 5 5 5 5 9 9 5 5 6 3 5 6 6 8 8 7 5 7 1 2 3 7 2 8 5 4 5 5 8 6 2 5 8 7 7 9 2 3 6 5 2 3 1 5 7 4 9 4 1 8 3 7 8 9 2 2 3 5 2 2 2 9 2 7 1 4 4 3 7 3 1 6 6 8 3 2 9 9 1 3 3 6 8 9 6 7 3 9 2 4 1 8 6 9 1 2 7 2 5 2 1 6 4 6 8 6 9 5 2 3 4 6 4 2 2 9 6 2 2 3 2 2 4 5 6 2 2 7 2 3 7 5 1 5 9 4 1 9 2 2 9 8 3 5 2 8 2 1 9 2 8 3 7 5 4 7 9 5 2 9 4 1 2 2 4 7 6 8 7 7 2 8 4 2 1 5 7 2 5 5 2 1 9 8 4 9 4 5 5 1 8 7 7 4 6 9 4 3 7 5 4 9 8 7 7 2 7 1 6 4 4 1 1 7 5 1 7 8 3 9 6 2 6 7 7 3 8 1 7 6 9 8 9 8 8 2 5 9 1 4 5 6 2 8 5 8 9 2 4 7 6 9 7 3 1 4 2 6 7 6 2 9 1 2 2 8 3 6 1 7 9 8 1 2 8 4 9 6 8 4 1 1 6 9 6 7 6 4 4 6 8 7 2 9 2 2 5 9 2 6 2 4 5 4 8 6 5 5 7 6 2 5 4 8 7 9 6 9 9 2 4 6 9 2 9 5 3 8 4 9 8 2 7 2 1 1 4 6 1 6 7 2 3 2 4 4 8 3 3 3 1 9 5 6 1 4 6 1 6 2 5 1 4 6 8 1 6 5 6 6 6 1 2 4 3 2 7 9 7 9 4 8 7 1 8 9 2 3 5 6 4 4 8 3 5 2 9 1 2 8 6 8 8 2 8 2 3 8 7 6 8 6 2 9 7 1 7 1 4 2 9 1 2 1 1 7 9 3 8 3 3 8 9 3 9 4 1 7 6 7 2 1 7 7 2 9 6 2 2 3 9 1 1 1 6 5 4 3 4 3 7 8 5 7 9 2 2 7 1 6 7 3 2 3 9 5 8 8 6 3 9 7 7 4 4 4 7 34 7 7 9 8 2 6 5 3 6 1 3 6 5 2 3 7 7 2 4 9 1 9 7 5 3 9 9 4 9 3 9 4 6 9 2 2 3 1 6 3 9 5 1 4 9 5 3 4 7 9 2 2 1 2 6 6 1 5 6 4 4 2 2 7 3 6 6 6 2 8 7 6 8 4 5 6 4 1 3 8 7 4 2 1 6 5 9 6 1 6 3 1 5 6 9 9 7 9 5 8 1 5 4 3 1 6 7 3 7 3 4 9 5 2 3 6 9 6 1 6 6 5 4 1 9 6 9 6 5 1 1 5 8 6 2 7 6 7 9 3 2 1 9 1 2 4 5 5 3 9 5 6 3 6 2 1 1 6 1 2 4 8 8 3 9 2 3 2 5 2 9 7 7 9 9 8 9 3 2 5 8 4 5 2 7 5 4 1 9 2 5 8 6 9 6 3 6 8 7 3 7 7 8 2 5 4 7 7 1 2 7 4 6 6 1 2 7 7 1 8 9 9 8 7 5 2 6 8 9 5 3 5 6 9 3 2 2 4 2 5 3 4 9 3 4 9 9 4 3 5 8 9 7 3 7 4 2 9 3 4 3 8 9 8 6 7 8 8 8 1 6 2 7 2 6 9 7 7 6 5 8 2 5 1 6 1 5 3 5 4 9 1 6 4 8 5 5 5 5 9 9 5 5 6 3 5 6 6 8 8 7 5 7 1 2 3 7 2 8 5 4 5 5 8 6 2 5 8 7 7 9 2 3 6 5 2 3 1 5 7 4 9 4 1 8 3 7 8 9 2 2 3 5 2 2 2 9 2 7 1 4 4 3 7 3 1 6 6 8 3 2 9 9 1 3 3 6 8 9 6 7 3 9 2 4 1 8 6 9 1 2 7 2 5 2 1 6 4 6 8 6 9 5 2 3 4 6 4 2 2 9 6 2 2 3 2 2 4 5 6 2 2 7 2 3 7 5 1 5 9 4 1 9 2 2 9 8 3 5 2 8 2 1 9 2 8 3 7 5 4 7 9 5 2 9 4 1 2 2 4 7 6 8 7 7 2 8 4 2 1 5 7 2 5 5 2 1 9 8 4 9 4 5 5 1 8 7 7 4 6 9 4 3 7 5 4 9 8 7 7 2 7 1 6 4 4 1 1 7 5 1 7 8 3 9 6 2 6 7 7 3 8 1 7 6 9 8 9 8 8 2 5 9 1 4 5 6 2 8 5 8 9 2 4 7 6 9 7 3 1 4 2 6 7 6 2 9 1 2 2 8 3 6 1 7 9 8 1 2 8 4 9 6 8 4 1 1 6 9 6 7 6 4 4 6 8 7 2 9 2 2 5 9 2 6 2 4 5 4 8 6 5 5 7 6 2 5 4 8 7 9 6 9 9 2 4 6 9 2 9 5 3 8 4 9 8 2 7 2 1 1 4 6 1 6 7 2 3 2 4 4 8 3 3 3 1 9 5 6 1 4 6 1 6 2 5 1 4 6 8 1 6 5 6 6 6 1 2 4 3 2 7 9 7 9 4 8 7 1 8 9 2 3 5 6 4 4 8 3 5 2 9 1 2 8 6 8 8 2 8 2 3 8 7 6 8 6 2 9 7 1 7 1 4 2 9 1 2 1 1 7 9 3 8 3 3 8 9 3 9 4 1 7 6 7 2 1 7 7 2 9 6 2 2 3 9 1 1 1 6 5 4 3 4 3 7 8 5 7 9 2 2 7 1 6 7 3 2 3 9 5 8 8 6 3 9 7 7 4 4 4 7 3
A “perfect” model is too complex, too costly to build, too hard to maintain and not
flexible to change.
“There are known knowns; there are things we know that we know.
There are known unknowns; there are things that we now know we don't know.
But there are also unknown unknowns;there are things we do not know we don't know.”
By Donald Rumsfeld, United States Secretary of Defense and Potential Data Scientist
Why the perfect model is stupid
“In statistics and machine learning, ensemble
methods use multiple models to obtain better predictive performance than could be obtained
from any of the constituent models”
Good Enough Analytics: Ensembles4 7 7 9 8 2 6 5 3 6 1 3 6 5 2 3 7 7 2 4 9 1 9 7 5 3 9 9 4 9 3 9 4 6 9 2 2 3 1 6 3 9 5 1 4 9 5 3 4 7 9 2 2 1 2 6 6 1 5 6 4 4 2 2 7 3 6 6 6 2 8 7 6 8 4 5 6 4 1 3 8 7 4 2 1 6 5 9 6 1 6 3 1 5 6 9 9 7 9 5 8 1 5 4 3 1 6 7 3 7 3 4 9 5 2 3 6 9 6 1 6 6 5 4 1 9 6 9 6 5 1 1 5 8 6 2 7 6 7 9 3 2 1 9 1 2 4 5 5 3 9 5 6 3 6 2 1 1 6 1 2 4 8 8 3 9 2 3 2 5 2 9 7 7 9 9 8 9 3 2 5 8 4 5 2 7 5 4 1 9 2 5 8 6 9 6 3 6 8 7 3 7 7 8 2 5 4 7 7 1 2 7 4 6 6
+1 2 7 7 1 8 9 9 8 7 5 2 6 8 9 5 3 5 6 9 3 2 2 4 2 5 3 4 9 3 4 9 9 4 3 5 8 9 7 3 7 4 2 9 3 4 3 8 9 8 6 7 8 8 8 1 6 2 7 2 6 9 7 7 6 5 8 2 5 1 6 1 5 3 5 4 9 1 6 4 8 5 5 5 5 9 9 5 5 6 3 5 6 6 8 8 7 5 7 1 2 3 7 2 8 5 4 5 5 8 6 2 5 8 7 7 9 2 3 6 5 2 3 1 5 7 4 9 4
+1 8 3 7 8 9 2 2 3 5 2 2 2 9 2 7 1 4 4 3 7 3 1 6 6 8 3 2 9 9 1 3 3 6 8 9 6 7 3 9 2 4 1 8 6 9 1 2 7 2 5 2 1 6 4 6 8 6 9 5 2 3 4 6 4 2 2 9 6 2 2 3 2 2 4 5 6 2 2 7 2 3 7 5 1 5 9 4 1 9 2 2 9 8 3 5 2 8 2 1 9 2 8 3 7 5 4 7 9 5 2 9 4 1 2 2 4 7 6 8 7 7 2 8 4 2 1 5 7 2 5 5 2 1 9 8 4 9 4 5 5 1 8 7 7 4 6 9 4 3 7 5 4 9 8 7 7 2 7 1 6 4 4 1 1 7 5 1 7 8 3 9 6 2 6 7 7 3 8 1 7 6 9 8 9 8 8 2 5 9 1 4 5 6 2 8 5 8 9 2 4 7 6 9 7 3 1 4 2 6 7 6 2 9 1 2 2 8 3 6 1 7 9 8 1 2 8 4 9 6 8 4 1 1 6 9 6 7 6 4 4 6 8 7 2 9 2 2 5 9 2 6 2 4 5 4 8 6
scholarpedia.orgRefer to References
scholarpedia.orgRefer to References
scholarpedia.orgRefer to References
The Serious Stuff…beyond theorycraft
Simple Ensembles – GLM Bootstrap aggregating (bagging)
predictions<-foreach(1:1000,.combine=cbind) %dopar%{ training_positions <- sample(nrow(train), size=floor((nrow(train)*0.9)),replace = TRUE)
train_pos<-1:nrow(train) %in% training_positions glmMod<-rxLinMod(eqn, train[train_pos,]) rxPredict(glmMod,test, type="response") }result<-rowMeans(predictions)
Simple Ensembles – Gradient Boosting Machines
gbmMod<-gbm(eqn, train,n.trees=10000, shrinkage=0.002, distribution="gaussian", interaction.depth=7, bag.fraction=0.9,
n.minobsinnode = 50 )
Similar to bagging, boosting also creates an ensemble of classifiers by resampling the data, which are then combined
by majority voting. However, in boosting, resampling is strategically geared to provide the most informative training
data for each consecutive classifier.
Simple Ensembles - Random Forest
rf <- foreach(ntree=rep(333,3), .combine=combine, .packages='randomForest')%dopar%
randomForest(train[,3:length(train)], train$Act, ntree=ntree, do.trace=1000, mtry=round(colNumber/3), replace=FALSE, nodesize = 5, na.action=na.omit)
Ensemble of Ensembles
1. Mean(RF+GBM+BagGLM)2. Median(RF+GBM+BagGLM)3. 0.4*RF+0.4*GBM+0.2*BagGLM
Ensembles – Why it mattersImprove accuracyEnsembles tend to yield better results than its constituent models when there is a significant diversity among the models
Developing multiple simple model is faster attempting to develop the perfect model
More resistance to over fitting Less reliant on any single model
Concurrent developmentDifferent models can be run and developed on different instances/machines by different data scientist
Ensembles – point of stupidity
Netflix prize 1 million dollar winner: Ensemble of 107 models for 10% improvementToo complicated, costly and inflexible to change
Actual deployment: Ensemble of 2 models for 8.43% improvement Moral of story:Good Enough Ensemble is good enough
Good Enough AnalyticsBig data analytics using cost efficient tools
and good enough ensemble of models
The Good Enough StuffData Optimization
Data cleaning vs Data optimization
Important but I assume you know
Done AFTER data cleaning
Kaggle Medical Drug Competition
15 sets of dataEach data set:
1,000 to 2,000 Attributes500 to 20,000 Rows
Qn: Identify rogue drugs
Point of stupidity: Trying to run analysis on all attributes
Drug Rogue %
Company Color Component 1
Component 2…2000
A 0.0400 XYZ Red 200 30
B 0.0002 XYZ Green 920 50
C 0.8000 XYZ Blue 30 1000
D ? XYZ Red 340 800
Drug Rogue %
Company Color Component 1
Component 2…2000
A 0.0400 XYZ Red 200 30
B 0.0002 XYZ Green 920 50
C 0.8000 XYZ Blue 30 1000
D ? XYZ Red 340 800
Not all attributes are born equalNo
Variance Irrelevant Too many attributes
Drug Rogue %
Company
A 0.0400 XYZ
B 0.0002 XYZ
C 0.8000 XYZ
D ? XYZ
R code: Library(caret)healthdata[nearZeroVar(healthdata, freqCut = 95/5, uniqueCut = 10)]<-list(NULL)
<- this attribute does not help in differentiating
between the drugs
Remove no variance / near zero variance attributes
Drug Rogue %
Color
A 0.0400 Red
B 0.0002 Green
C 0.8000 Blue
D ? Red
R code for Random Forest: importanceScore <- importance(myMod)
R code for GBM: importanceScore <- summary.gbm (myMod, ntree)
<- this attribute has no relevance to % rouge drug
Remove not important attributes
Drug Rogue % Component 1
Component 2…2000
A 0.0400 200 30
B 0.0002 920 50
C 0.8000 30 1000
D ? 340 800
R code:pc <- prcomp(train[, 2:length(train)],tol=0.12)
<- too many attributes takes very long to run
analysis
Attribute reduction using Principal Component Analysis
Andrew Ng: Always try analysis without PCA first.
X XXXX X
Attribute 1
Attribute 2
Attribute reduction using Principal Component Analysis
Andrew Ng: Machine Learning CourseRefer to References
Andrew Ng: Always try analysis without PCA first.
X XXXX XPrincipal Component
Attribute reduction using Principal Component Analysis
Andrew Ng: Machine Learning CourseRefer to References
X
X
X
X
X
X
Attribute 1
Attribute 2
Attribute reduction using Principal Component Analysis
Andrew Ng: Machine Learning CourseRefer to References
The 1D red line and points are now representative of the 2D graph
Principal Component
Attribute reduction using Principal Component Analysis
0
00
00
0
Andrew Ng: Machine Learning CourseRefer to References
Data Optimization – Why it matters
Performance Improvement (importance,nearZeroVar)
Cut down attributes which are useless or not “good enough”. More accurate and complex models can be built on attributes that matters.
Cost Savings (PCA)
Less data needs to be processed, faster turnover for models and results.
Good Enough AnalyticsBig data analytics using cost efficient tools
and good enough ensemble of models based on optimized data
The Good Enough StuffScaling on cloud
Why use Cloud
How often do you really need a multimillion machine to be on standby 24/7 to churn data?
Do you really need real time analytics or is hourly/daily/weekly/monthly report good enough?
Cloud – Why it mattersExcellent bang for the buck<$5/hr to rent million dollar worth of power. No need to purchase/maintain hardware. Scale on demand
Great for Ensemble ModelingYou can start multiple instance, each instance running one simple model and ensemble them
But beware of data security and privacy lawsNot suitable for all kinds of data/application For example, Amazon Web Service is HIPAA compliant but Rackspace is not.
Name Age Income Postal
Peter 23 $2,000 400573
Sally 11 $0 520028
Paul 70 $500 521201
Mark 30 $8,000 247392
Prepare data for the cloud
Name Age Age Group
Income Income Range
Postal Postal Area
Peter 23 Youth $2,000 $1,000-$3,000
400*** Eunos
Sally 11 Child $0 $0 520*** Simei
Paul 70 Senior $500 $1-$1,000 521*** Tampines
Mark 30 Adult $8,000 >$5,000 247*** Tanglin
Prepare data for the cloud
RemoveIdentity
Use general category
Reference: Dr. Yap Ghim Eng (A*Star)
Use range category Masking Rollup
Good Enough AnalyticsBig data analytics using cost efficient tools
and good enough ensemble of models based on optimized data, scaled on cloud services
The Good Enough Stuff…that we have no time for
Amazon Web Service
sudo yum install gcc gcc-c++ gcc-gfortran readline-devel python-devel make atlas blassudo yum install -y lapack-devel blas-devel
wget http://cran.at.r-project.org/src/base/R-2/R-2.15.2.tar.gztar -xf R-2.15.2.tar.gzcd R-2.15.2./configure --with-x=nosudo makePATH=$PATH:~/R-2.15.2/bin/cd ..
wget http://sourceforge.net/projects/numpy/files/NumPy/1.6.2/numpy-1.6.2.tar.gz/downloadtar -xzf numpy-1.6.2.tar.gzcd numpy-1.6.2sudo python setup.py installcd ..
wget http://sourceforge.net/projects/scipy/files/scipy/0.11.0/scipy-0.11.0.tar.gz/downloadtar -xzf scipy-0.11.0.tar.gzcd scipy-0.11.0sudo python setup.py installcd ..
wget http://pypi.python.org/packages/source/n/nose/nose-1.1.2.tar.gz#md5=144f237b615e23f21f6a50b2183aa817tar -xzf nose-1.1.2.tar.gzcd nose-1.1.2sudo python setup.py install
Basic code to setup Amazon instance for analytics
=after sudo-ing and running R, type=install.packages('gbm')install.packages('randomForest')
To leave R or Python jobs running while you are not logged on: "nohup R CMD BATCH myfile.r &"
Amazon EC2 Spot InstanceCluster Compute Eight Extra Large60.5 GiB memory, 88 EC2 Compute Units, 3370 GB of local instance storage, 64-bit platform, 10 Gigabit Ethernet$0.27 per hour
High-Memory Quadruple Extra Large Instance 68.4 GiB of memory, 26 EC2 Compute Units (8 virtual cores with 3.25 EC2 Compute Units each), 1690 GB of local instance storage, 64-bit platform$0.14 per hour
Weakness of Spot Instance
Bidding system. If your bid < spot instance price, instance will be terminated.
Solutions:1) Put master on normal cloud instance
and slave on spot instance2) Heartbeat + Queue with Checkpoint
The Good Enough Stuff…that we have no time for
PCA with KNN
library(FNN)train <- read.csv("train.csv", header=TRUE)test <- read.csv("test.csv", header=TRUE)
pc <- prcomp(train[, 2:length(train)],tol=0.12)mydata <- data.frame(label = train[, "label"], pc$x)labels <- mydata[,1]mydata2 <- mydata[,-1]test.p <- predict(pc, newdata = test)
results <- (0:9)[knn(mydata2, data.frame(test.p), labels, k = 1, algorithm="cover_tree")]write(results, file="knn_PCA.csv", ncolumns=1)
Principal Component Analysis - With K-Nearest Neighbor
The Good Enough Stuff…that we have no time for
Data Chunking
Data Chunking– Revolution R
Loosely based on NoSQL
The XDF format is a binary file format that stores data in blocks and processes data in chunks (groups of blocks) for efficient reading of arbitrary columns and contiguous rows
Use a format called XDF
For more details, visit RevR website
Data Chunking– Why it matters# Chunk 6.5GB worth of data onto HDD in XDFrxImport(inData = trainFile, outFile = “trainingData.xdf”)
#revR created methods like rxGlm to run huge Poisson regression directly on XDF file myPos <- rxGlm(amount2 ~ Mailed+Donated+RR,data="trainingData", family=poisson())*This cannot be done using normal R on my laptop, as R tries to load entire dataset into memory
RAM: Fast but expansive
SSD: ~4x faster than normal HDD when chunking
Data Chunking– Speeding it up using SSD instead of normal HDD
The Good Enough Stuff…that we have no time for
Multicore
Multicore Processing – Revolution Rlibrary(foreach)library(doSNOW)cluster <-makeCluster(3, type = "SOCK")registerDoSNOW(cluster)setMKLthreads(1)
predictions<-foreach(1:1000,.combine=cbind) %dopar%{ training_positions <- sample(nrow(train), size=floor((nrow(train)*0.9)),replace = TRUE)
train_pos<-1:nrow(train) %in% training_positions glmMod<-rxLinMod(eqn, train[train_pos,]) rxPredict(glmMod,test, type="response") }result<-rowMeans(predictions)
Multicore Processing – Why it matters
License Cost (Usually charge by per CPU)1 CPU with 4 core = 1 single user license
Distributed 4 CPUS with 1 core each = 4 license or group license
Performance Improvement~2 x performance for 3 core vs 1 core
Visualization
Good Enough ReferencesRandom Forest•Obtaining knowledge from a random forest•Suggestions for speeding up Random Forests•Random Forest with classes that are very unbalanced
GBM•Define boosting•Generalized Boosted Models:A guide to the gbm package•What are some useful guidelines for GBM parameters?•R gbm logistic regression•How to win the KDD Cup Challenge with R and gbm
Ensembles•Ensemble learning introduction•Exploiting Diversity in Ensembles: Improving the Performance on Unbalanced Datasets•Resources for learning how to implement ensemble methods•Ensemble methods•Intro to ensemble learning in R•Predictive analytics & decision tree
Good Enough ReferencesPCA and NearZero•Principal Component Analysis in R•PCA on high dimensional data•PCA on training and test data•Nearzero R caret library
Misc•Andrew Ng’s Machine Learning Course•A Few Useful Things to Know about Machine Learning•Creating HIPAA-Compliant Medical Data Applications With AWS•Amazon EC2 Spot Instances•Improve Predictive Performance in R with Bagging •Kaggle: Visualizing dark world•Kaggle: Visualizing handwriting
Good Enough AnalyticsBig data analytics using cost efficient tools
and good enough ensemble of models based on optimized data, scaled on cloud services
Qns? Email me @ [email protected] ProfileKaggle Profile
Good Enough AnalyticsBig data analytics using cost efficient tools
and good enough ensemble of models based on optimized data, scaled on cloud services
Asia?
•Slide 2: http://3.bp.blogspot.com/-nkP_UHgebKo/T70GJ3ezCrI/AAAAAAAAAZc/mWD6RsDlz6Y/s1600/IMG_0349.JPG•Slide 3: http://www.salesmanagementmastery.com/wp-content/uploads/2010/09/money-flying.jpg•Slide 5: http://www.pachd.com/free-images/household-images/spoon-01.jpg•Slide 6: http://www.bhmpics.com/view-rice_in_a_wooden_spoon-1440x900.html•Slide 7: http://2.bp.blogspot.com/-Oj7ji_8CB3Q/TkvdFXAYUcI/AAAAAAAADgQ/XcevbehpPHU/s1600/Big+spoon+3.jpg•Slide 8: http://familyhelpers.files.wordpress.com/2012/03/spoon.jpg•Slide 11 (Lemon): http://miamiaromatherapy.com/shopping/images/70//Lemon-2.jpg•Slide 12 (Bank): http://www.psdgraphics.com/wp-content/uploads/2011/03/bank-icon.jpg•Slide 11/12 (Logos): http://commons.wikimedia.org/wiki/Main_Page•Slide 19-21: www.scholarpedia.org•Slide 23/25: www.wikipedia.org•Slide 32: http://www.chipandco.com/wp-content/uploads/2012/08/medicine.jpg•Slide 63: www.kaggle.com
Photo Credits