Upload
toshihiko-yamasaki
View
1.580
Download
4
Embed Size (px)
Citation preview
ここから始める情報処理~機械学習(とその周辺)編~
Toshihiko Yamasaki
Associate Professor,Department of Information and Communication Engineering,
Graduate School of Information Science and Technology,The University of Tokyo
Today’s agenda
How can we start?
Which algorithm should you choose?
How can you find real data?
SVMを使い倒す
4
Today’s agenda
How can we start?
Which algorithm should you choose?
How can you find real data?
SVMを使い倒す
5
Tools: R
9https://www.r-project.org/http://www.kdnuggets.com/2015/06/top-20-r-machine-learning-packages.html
No available PC? Use cloud computers
Such as Amazon EC2, Microsoft Azure, etc…
Virtualized PC on the Internet
「ゼロから始めるクラウドコンピューティング」
11http://aws.amazon.com/jp/ec2/https://azure.microsoft.com/ja-jp/
Today’s agenda
How can we start?
Which algorithm should you choose?
How can you find real data?
SVMを使い倒す
12
In short
BST-DT > RF > BAG-DT > SVMs > ANN > KNN > BST-STMP > DT > LOGREG > NB
Boosted Trees:
RF + boosting technique
Note!!
Feature dims. were 10-100.
RF usually requires 10xdim vectors for training
14
[Caruana, ICML06]
Random Forests and Boosted Trees
15
www.habe-lab.org/habe/RFtutorial/SSII2013_RFtutorial_Slides.pdfhttp://www.slideshare.net/HitoshiHabe/ss-58784309
https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf
RF
Boosted Trees
Today’s agenda
How can we start?
Which algorithm should you choose?
How can you find real data?
SVMを使い倒す
17
Today’s agenda
How can we start?
Which algorithm should you choose?
How can you find real data?
SVMを使い倒す
19
How do you usually use SVM?
Through Python/Matlab/R/…
In many cases, you are using libSVM
By downloading binary code
Why don’t we download a source code?
20http://www.csie.ntu.edu.tw/~cjlin/libsvm/
Which kernel you should use?
Gaussian kernel is the best in many cases
But it takes a lot of time
Linear kernel performs as well as Gaussian
When the data size is large
When the feature dimension is large
You may also consider using liblinear
What else?
You can use your own kernel
21
Optimize the parameters
For binary or MATLAB, use grid.py/grid.m
It tries to optimize C and g for Gaussian kernel
You should check the source code
Use n-cross validation
Sometimes, make train, validation, test data
23http://www.csie.ntu.edu.tw/~cjlin/libsvm/
Unbalanced data
25http://www.csie.ntu.edu.tw/~cjlin/libsvm/faq.html#f410
What if you have unbalance data?
For example, +1: 1,000 items, -1: 10,000 items
SVM can achieve 91% accuracy just by saying “-1”
Non numerical data
libSVM can handle only numerical data
× Sun:0, Mon: 1, Tue: 2(There is no meaning in magnitude relation)
Change to Categorical/one-hot
Sun: (1, 0, 0, 0…)
Mon: (0, 1, 0, 0…)
Tue: (0, 0, 1, 0…)
26
Missing data?
There is no golden rule
Eliminate such vectors
Use average or median value
Use the most frequently appearing value
27
Use OpenMP
28http://www.csie.ntu.edu.tw/~cjlin/libsvm/faq.html#f432
As I introduced in my first lecture, it is very easy
You can know probability
You can be probability instead ofobtaining +1/-1 labels or continuous values
Use “–b 1” option
It is useful for further processing
32
Look at the model file
When using a linear kernel or liblinear
Weight vector w will be saved
You can also know support vectors
33
http://www.csie.ntu.edu.tw/~cjlin/libsvm/faq.html#f433