Upload
maina
View
148
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Using R to win Kaggle Data Mining Competitions. Chris Raimondi November 1, 2012. Overview of talk. What I hope you get out of this talk Life before R Simple model example R programming language Background/Stats/Info How to get started Kaggle. Overview of talk. - PowerPoint PPT Presentation
Citation preview
Using Rto win
Kaggle Data Mining Competitions
Chris RaimondiNovember 1, 2012
Overview of talk• What I hope you get out of this talk• Life before R• Simple model example• R programming language• Background/Stats/Info• How to get started
• Kaggle
Overview of talk• Individual Kaggle competitions• HIV Progression• Chess• Mapping Dark Matter• Dunnhumby’s Shoppers Challenge• Online Product Sales
What I want you to leave with• Belief that you don’t need to be a
statistician to use R - NOR do you need to fully understand Machine Learning in order to use it• Motivation to use Kaggle
competitions to learn R• Knowledge on how to start
My life before R• Lots of Excel• Had tried programming in the past –
got frustrated• Read NY Times article in January
2009 about R & Google• Installed R, but gave up after a
couple minutes• Months later…
My life before R• Using Excel to run PageRank
calculations that took hours and was very messy
• Was experimenting with Pajek – a windows based Network/Link analysis program
• Was looking for a similar program that did PageRank calculations
• Revisited R as a possibility
My life before R• Came across “R Graph Gallery”• Saw this graph…
Addicted to R in one line of code
pairs(iris[1:4], main="Edgar Anderson's Iris Data", pch=21, bg=c("red", "green3", "blue")[unclass(iris$Species)])
“pairs” = function“iris” = dataframe
What do we want to do with R?
• Machine learninga.k.a. – or more specifically
• Making models
We want to TRAIN a set of data with KNOWN answers/outcomes
In order to PREDICT the answer/outcome to similar data where the answer is not known
How to train a model
R allows for the training of models using probably over 100 different machine learning methods
To train a model you need to provide1. Name of the function – which machine learning
method2. Name of Dataset3. What is your response variable and what features
are you going to use
Example machine learning methods available in R
Bagging Partial Least SquaresBoosted Trees Principal Component RegressionElastic Net Projection Pursuit RegressionGaussian Processes Quadratic Discriminant AnalysisGeneralized additive model Random ForestsGeneralized linear model Recursive PartitioningK Nearest Neighbor Rule-Based ModelsLinear Regression Self-Organizing MapsNearest Shrunken Centroids Sparse Linear Discriminant AnalysisNeural Networks Support Vector Machines
Code used to train decision tree
library(party)irisct <- ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = iris)
Or use “.” to mean everything else - as in…
irisct <- ctree(Species ~ ., data = iris)
That’s itYou’ve trained your model – to make predictions with it – use the “predict” function – like so:
my.prediction <- predict(irisct, iris2)
To see a graphic representation of it – use “plot”.
plot(irisct)
plot(irisct, tp_args = list(fill = c("red", "green3", "blue")))
R background• Statistical Programming Language• Since 1996• Powerful – used by companies like
Google, Allstate, and Pfizer.• Over 4,000 packages available on
CRAN• Free• Available for Linux, Mac, and
Windows
Learn R – Starting Tonight
• Buy “R in a Nutshell”• Download and Install R• Download and Install Rstudio• Watch 2.5 minute video on
front page of rstudio.com• Use read.csv to read a Kaggle
data set into R
Learn R – Continue Tomorrow
• Train a model using Kaggle data• Make a prediction using that
model• Submit the prediction to Kaggle
Learn R – This Weekend• Install the Caret package• Start reading the four Caret
vignettes• Use the “train” function in Caret
to train a model, select a parameter, and make a prediction with this model
Buy This Book: R in a Nutshell
• Excellent Reference• 2nd Edition released
just two weeks ago• In stock at Amazon
for $37.05• Extensive chapter on
machine learning
R Studio
R Tip
Read the vignettes – some of them are golden.
There is a correlation between the quality of an R package and its associated vignette.
What is kaggle?• Platform/website for predictive
modeling competitions• Think middleman – they provide
the tools for anyone to host a data mining competition
• Makes it easy for competitors as well – they know where to go to find the data/competitions
• Community/forum to find teammates
Kaggle Stats• Competitions started over 2 years
ago• 55+ different competitions• Over 60,000 Competitors• 165,000+ Entries• Over $500,000 in prizes awarded
Why Use Kaggle?• Rich Diverse Set of Competitions• Real World Data• Competition = Motivation• Fame• Fortune
Who has Hosted on Kaggle?
Methods used by competitors
source:kaggle.com
Predict HIV Progression
1st $500.00Prizes:
Objective:Predict (yes/no) if there will be an
improvement in a patient's HIV viral load.
Training Data:1,000 Patients
Testing Data:692 Patients
Answer Various Features
Response PR Seq RT Seq VL-t0 CD4-t01 CCTCAGATCA TACCTTAAAT 4.7 4731 CACTCTAAAT CTTAAATTTY 5.0 70 AAGAAATCTG CCTCAGATCA 3.2 3490 AAGAAATCTG CTCTTTGGCA 5.1 510 AAGAAATCTG GAGAGATCTG 3.7 770 CACTCTAAAT CTTAAATTTY 5.7 2060 AAGAAATCTG TCTAAATTTC 3.9 1440 CACTTTAAAT TCTAAACTTT 4.4 4960 AAGAAATCTG CTCTTTGGCA 3.4 2521 TGGAAGAAAT CTCTTTGGCA 5.5 71 TTCGTCACAA CTCTTTGGCA 4.3 1090 AAGAGATCTG CTCTTTGGCA 5.0 700 ACTAAATTTT CTCTTTGGCA 5.0 5700 CCTCAAATCA CTCTTTGGCA 4.0 2171 CCTCAGATCA TCTAAATTTC 2.8 7300 ATTAAATTTT CTCTTTGGCA 4.5 560 ATTAAATTTT TACTTTAAAT 5.1 211 CCTCAGATCA CTCTTTGGCA 5.5 2490 CCTCAAATCA CTTAAATTTT 4.0 2691 AAGGAATCTG CCTCAGATCA 4.6 1650 AAGAAATCTG TCTAAATTTC 3.9 1440 CACTTTAAAT TCTAAACTTT 4.4 4960 AAGAAATCTG CTCTTTGGCA 3.4 2521 TGGAAGAAAT CTCTTTGGCA 5.5 91
Training Set
N/AN/AN/AN/AN/AN/AN/AN/AN/AN/A
Trai
ning
Test
Public Leaderboard
Private Leaderboard
Predict HIV Progression
Predict HIV ProgressionFeatures Provided:
1.PR: 297 letters long – or N/A2.RT: 193 – 494 letters long3.CD4: Numeric4.VLt0: Numeric
Features Used:1.PR1-PR97: Factor2.RT1-RT435: Factor3.CD4: Numeric4.VLt0: Numeric
Predict HIV ProgressionConcepts / Packages:
• Caret• train• rfe
• randomForest
Random ForestSepal.Length Sepal.Width Petal.Length Petal.Width
5.1 3.5 1.4 0.24.9 3 1.4 0.24.7 3.2 1.3 0.24.6 3.1 1.5 0.2
5 3.6 1.4 0.25.4 3.9 1.7 0.44.6 3.4 1.4 0.3
5 3.4 1.5 0.24.4 2.9 1.4 0.24.9 3.1 1.5 0.15.4 3.7 1.5 0.24.8 3.4 1.6 0.24.8 3 1.4 0.14.3 3 1.1 0.15.8 4 1.2 0.25.7 4.4 1.5 0.45.4 3.9 1.3 0.45.1 3.5 1.4 0.35.7 3.8 1.7 0.35.1 3.8 1.5 0.3
Tree 1:
Take a random ~ 63.2% sample of rows from the data set
For each node – take mtry random features – in this case 2 would be the default
Tree 2:
Take a different random ~ 63.2% sample of rows from the data set
And so on…..
Caret – trainTrainData <- iris[,1:4] TrainClasses <- iris[,5]
knnFit1 <- train(TrainData, TrainClasses, method = "knn",
preProcess = c("center", "scale"), tuneLength = 3, trControl = trainControl(method = "cv", number=10))
Caret – train> knnFit1150 samples 4 predictors 3 classes: 'setosa', 'versicolor', 'virginica'
Pre-processing: centered, scaled Resampling: Cross-Validation (10 fold)
Summary of sample sizes: 135, 135, 135, 135, 135, 135, ...
Resampling results across tuning parameters:
Caret – traink Accuracy Kappa Accuracy SD Kappa SD 5 0.94 0.91 0.0663 0.0994 7 0.967 0.95 0.0648 0.0972 9 0.953 0.93 0.0632 0.0949 11 0.953 0.93 0.0632 0.0949 13 0.967 0.95 0.0648 0.0972 15 0.967 0.95 0.0648 0.0972 17 0.973 0.96 0.0644 0.0966 19 0.96 0.94 0.0644 0.0966 21 0.96 0.94 0.0644 0.0966 23 0.947 0.92 0.0613 0.0919
Accuracy was used to select the optimal model using the largest value.The final value used for the model was k = 17.
Benefits of winning• Cold hard cash• Several newspaper articles• Quoted in Science magazine• Prestige• Easier to find people willing to
team up• Asked to speak at STScI• Perverse pleasure in telling
people the team that came in second worked at….
IBM Thomas J. Watson Research Center
Chess Ratings Comp1st $10,000.00
Prizes:
Objective:Given 100 months of data predict game
outcomes for months 101 – 105.
Training Data Provided:1. Month2. White Player #3. Black Player #4. White Outcome – Win/Draw/Lose
(1/0.5/0)
How do I convert the data into a flat 2D
representation?
Think:1. What are you trying to
predict?2. What Features will you
use?
Outcome
White Feature 1
White Feature 2
White Feature 3
White Feature 4
Black Feature 1
Black Feature 2
Black Feature 3
Black Feature 4
White/Black 1
White/Black 2
White/Black 3
White/Black 4
Game Feature 1
Game Feature 2
1
0.5
1
1
0
1
0.5
1
0
0
Percentage of Games W
on
Percentage of Games W
on
Num
ber of Games w
on as White
Num
ber of Games w
on as White
Num
ber of Games Played
Num
ber of Games Played
White Gam
es Played/Black Games Played
Type of Game Played
Packages/Concepts Used:
1. igraph2. 1st real function
Mapping Dark Matter
Mapping Dark Matter
1st ~$3,000.00Prizes:
Objective:“Participants are provided with 100,000
galaxy and star pairs. A participant should provide an estimate for the ellipticity for
each galaxy.”
The prize will be an expenses paid trip to the Jet Propulsion Laboratory (JPL) in Pasadena, California to attend the GREAT10 challenge workshop "Image Analysis for Cosmology".
dunnhumby's Shopper Challenge
1st $6,000.002nd $3,000.003rd $1,000.00
Prizes:
Objective:• Predict the next date that the
customer will make a purchaseAND
• Predict the amount of the purchase to within £10.00
Data ProvidedFor 100,000 customers:
April 1, 2010 – June 19, 20111. customer_id 2. visit_date 3. visit_spend
For 10,000 customers:April 1, 2010 – March 31, 2011
4. customer_id 5. visit_date 6. visit_spend
Really two different challenges:
1) Predict next purchase dateMax of ~42.73% obtained
2) Predict purchase amount to within £10.00
Max of ~38.99% obtained
If independent 42.73% * 38.99% = 16.66%
In reality – max obtained was 18.83%
dunnhumby's Shopper Challenge
Packages Used & Concepts Explored:
1st competition with real dates• zoo• arima• forecast
SVD• svd• irlba
SVDSingular value decomposition
OriginalMatrix
807 x
1209
U
=
X X
D V T
1st M
ost I
mpo
rtan
t2nd
Mos
t Im
port
ant
3rd M
ost I
mpo
rtan
t4th
Mos
t Im
port
ant
. . .
Nth
Mos
t Im
port
ant
12
34
…N
N x N 1st
2nd
3rd
4th
. . .Nth
N x N
Row Features
Colu
mn
Feat
ures
Col 1
Col 2
Col 3
Col 4
… Col N
Row 1
Row 2
Row 3
Row 4
Row N
OriginalMatrix
807 x
1209
U
~
X X
D V T
1st M
ost I
mpo
rtan
t
1 1st
x <- read.jpeg("test.image.2.jpg")im <- imagematrix(x, type = "grey")
im.svd <- svd(im)
u <- im.svd$ud <- diag(im.svd$d)v <- im.svd$v
OriginalMatrix
807 x
1209
U
~
X X
D V T
1st M
ost I
mpo
rtan
t
1 1st
new.u <- as.matrix(u[, 1:1])new.d <- as.matrix(d[1:1, 1:1])new.v <- as.matrix(v[, 1:1])
new.mat <- new.u %*% new.d %*% t(new.v)
new.im <- imagematrix(new.mat, type = "grey")plot(new.im, useRaster = TRUE)
OriginalMatrix
807 x
1209
U
~
X X
D V T
1st M
ost I
mpo
rtan
t
1 1st
OriginalMatrix
807 x
1209
U
~
X X
D V T
1st M
ost I
mpo
rtan
t2nd
Mos
t Im
port
ant
12
1st
2nd
OriginalMatrix
807 x
1209
U
~
X X
D V T
1st M
ost I
mpo
rtan
t2nd
Mos
t Im
port
ant
3rd M
ost I
mpo
rtan
t
12
3
1st
2nd
3rd
OriginalMatrix
807 x
1209
U
~
X X
D V T
1st M
ost I
mpo
rtan
t2nd
Mos
t Im
port
ant
3rd M
ost I
mpo
rtan
t4th
Mos
t Im
port
ant
12
34
1st
2nd
3rd
4th
OriginalMatrix
807 x
1209
U
~
X X
D V T
1st M
ost I
mpo
rtan
t2nd
Mos
t Im
port
ant
3rd M
ost I
mpo
rtan
t4th
Mos
t Im
port
ant
5th M
ost I
mpo
rtan
t
12
34
5
1st
2nd
3rd
4th
5th
OriginalMatrix
807 x
1209
U
~
X X
D VT
1st M
ost I
mpo
rtan
t2nd
Mos
t Im
port
ant
3rd M
ost I
mpo
rtan
t4th
Mos
t Im
port
ant
5th M
ost I
mpo
rtan
t6th
Mos
t Im
port
ant
12
34
56
1st
2nd
3rd
4th
5th6th
OriginalMatrix
807 x
1209
U
~
X X
D VT
1st M
ost I
mpo
rtan
t2nd
Mos
t Im
port
ant
3rd M
ost I
mpo
rtan
t4th
Mos
t Im
port
ant
…80
7th M
ost I
mpo
rtan
t
12
34
.807
1st
2nd
3rd
4th
…807th
U
=
X X
D V T
1st M
ost I
mpo
rtan
t2nd
Mos
t Im
port
ant
3rd M
ost I
mpo
rtan
t4th
Mos
t Im
port
ant
. . .
Nth
Mos
t Im
port
ant
12
34
…N
365x365 1st
2nd
3rd
4th
. . .Nth
365x
365
Customer Features
Day
Fea
ture
s
Day 1
Day 2
Day 3
Day 4
… Day N
Cust 1
Cust 2
Cust 3
Cust 4
Cust 5
OriginalMatrix
100,000 x
365
100,000x
365
D1
23
4…
N
365x365
OriginalMatrix
100,000 x
365
OriginalMatrix
100,000 x
365
U[,1] = 100,000 x 1
=1st
Mos
t Im
port
ant
1st = 365 x 1 =V T
1st = 365 x 1 [first 28 shown]=V T
2nd = 365 x 1 [first 28 shown]=V T
3rd = 365 x 1 [first 28 shown]=V T
4th = 365 x 1 [first 28 shown]=V T
5th = 365 x 1 [first 28 shown]=V T
6th = 365 x 1 [first 28 shown]=V T
7th = 365 x 1 [first 28 shown]=V T
8th = 365 x 1 [all 365 shown]=V T
Online Product Sales
1st $15,000.002nd $ 5,000.003rd $ 2,500.00
Prizes:
Objective:“[P]redict monthly online sales of a
product. Imagine the products are online self-help programs following an initial
advertising campaign.”
Online Product Sales
Packages/Concepts Explored:
1. Data analysis – looking at data closely
2. gbm3. Teams
Online Product Sales
Looking at data closely
... 6532 6532 6661 6661 7696 7701 7701 8229 8412 8895 9596 9596 9772 9772 ...
Cat_1=0 Cat_1=16274 1 16532 1 16661 1 17696 0 17701 1 18229 1 08412 1 08895 1 09596 1 19772 1 1
Online Product Sales
On the public leaderboard:
Online Product Sales
On the private leaderboard:
Thank You!
Questions?
Extra Slides
R Code for Dunnhumby Time Series
4 X 150 4 X 150
U
=
4 X 4X X
D V T
> my.svd <- svd(iris[,1:4])> objects(my.svd)[1] "d" "u" "v"> my.svd$d [1] 95.959914 17.761034 3.460931 1.884826 > dim(my.svd$u) [1] 150 4 > dim(my.svd$v) [1] 4 4
4 X 4