COMP 4332 Tutorial 4 Feb 23 Yin Zhu [email protected]

1

COMP 4332 Tutorial 4Feb 23

Yin [email protected]

Classification tools

mailto:[email protected]

2

Fact Sheets:Classifier

0 10 20 30 40 50 60

Bayesian Neural Network

Bayesian Network

Nearest neighbors

Naïve Bayes

Neural Network

Other Classif

Non-linear kernel

Linear classifier

Decision tree...

CLASSIFIER (overall usage=93%)

Percent of participants

- About 30% logistic loss, >15% exp loss, >15% sq loss, ~10% hinge loss.- Less than 50% regularization (20% 2-norm, 10% 1-norm).- Only 13% unlabeled data.

KDDCUP 2009

3

Top 10 data mining algorithms• http://www.cs.uvm.edu/~icdm/algorithms/index.shtml• C4.5• K-means• SVM• Apriori• EM• PageRank• AdaBoost• kNN• Naïve Bayes• CART (Classification and Regression Trees)

Six are classification!!!

http://www.cs.uvm.edu/~icdm/algorithms/index.shtml

http://www.cs.uvm.edu/~icdm/algorithms/index.shtml

4

My view of classification • 1. Tree-based models

• Trees and tree ensemble

• 2. Linear family and its kernel extension • Least Squares Regression• Logistic Regression• SVM

• 3. Others: Naïve Bayes, KNN

5

Tree-based models• Two most famous trees: C4.5 and CART• C4.5 has two solid implementations:

• Ross Quinlan’s original C++ program. And later C 5.0. Source code available.

• Weka’s J48 classifier. • CART: rpart package in R.

• Tree ensemble: • AdaBoost • Bagging• RandomForest

http://en.wikipedia.org/wiki/Ross_Quinlan

http://cran.r-project.org/web/packages/rpart/index.html

http://cran.r-project.org/web/packages/rpart/index.html

6

Recommended tools for trees• For dense datasets, e.g. # of attributes < 1000, use Weka.

• For sparse datasets, • Try FEST.

http://www.cs.cornell.edu/~nk/fest/

7

Linear family and its kernel extension• Learning samples: ,

• Minimize classification loss:

• : map to its kernel space, for linear classification, • : loss function• Classification label:

8

Loss functions • Square loss:

• Logistic loss:

• Hinge loss:

• More: http://ttic.uchicago.edu/~dmcallester/ttic101-06/lectures/genreg/genreg.pdf

• First three lecture notes of Stanford-CS 229, http://cs229.stanford.edu/materials.html

http://ttic.uchicago.edu/~dmcallester/ttic101-06/lectures/genreg/genreg.pdf

http://ttic.uchicago.edu/~dmcallester/ttic101-06/lectures/genreg/genreg.pdf

http://cs229.stanford.edu/materials.html

9

Recommended tools for linear family• LibSVM and SVM Light for Kernel SVM

• LibLinear for linear classifiers with different loss functions.

• Online linear classification (very fast!): Stochastic Gradient Descendent (Léon Bottou’s sgd and John Langford’s Vowpal Wabbit)

http://leon.bottou.org/projects/sgd

http://hunch.net/~vw/



10

Naïve Bayes and KNN

• For dense dataset, use Weka.

• For sparse dataset, implement them yourself. Very easy to implement. • Weka is able to load sparse dataset, but not sure about its speed

and memory usage.

11

Empirical comparison of classifiers • Caruana et. al:

An empirical evaluation of supervised learning in high dimensions, ICML’08.Slide

• Caruana et. al: An empirical comparison of supervised learning algorithms. ICML’06. Slide Video

http://icml2008.cs.helsinki.fi/papers/632.pdf

http://icml2008.cs.helsinki.fi/papers/632.pdf

http://dl.dropbox.com/u/7090102/comp4332/T4/icml08.pdf

http://www.cs.cornell.edu/~caruana/ctp/ct.papers/caruana.icml06.pdf


http://dl.dropbox.com/u/7090102/comp4332/T4/caruana_07.pdf

http://videolectures.net/solomon_caruana_wslmw/

12

Learning a classification tool

• Data Format

• Parameters in the tool• Have to learn the mathematics behind the classifier. At least

intuitively, if not rigorously.

• Read its manual and sometimes source code!

13

DEMO: LibLinear• Step 1: Download the source code and binaries. Compile

the source code if necessary.

• Step 2: Read its README/tutorial, and usually the tutorial has a running example. Follow the example.

• Step 3: Study its documentation and know about its data format and its parameters.

• Step 4: Use it on your data set.

14

Dataset format• ARFF: Weka

• Libsvm sparse: SvmLight, LibSvm, LibLinear, Sgd, etc.

• Dense vs Sparse

15

ARFF: Attribute-Relation File Format

• Documentation:

• http://www.cs.waikato.ac.nz/ml/weka/arff.html

• ARFF also supports sparse format

http://www.cs.waikato.ac.nz/ml/weka/arff.html




16

LibSvm Sparse FormatLine Format:Label Index:Value pairs Label: +1/-1 for binary classification, 1/2/3/4/etc for multi-class.

17

Predicting a score (Not a label)• Many classifiers support probability output:

• Nearly all classifiers in Weka support probability output. • LibSVM/LibLinear supports probability output.

• SvmLight outputs a real value from –Inf to +Inf.

• More details on: • Caruana et. al:

An empirical comparison of supervised learning algorithms. ICML’06.



18

DEMO: experiment.pyFrom raw data to a successful submission • Read the raw data and do preprocessing

• Transform the data to the input format of a classification tool (liblinear in our example).

• Perform training and testing using the tool.

• Wrap up the results and submit online.

Documents

COMP 4332 Tutorial 4 Feb 23 Yin Zhu [email protected]