18
COMP 4332 Tutorial 4 Feb 23 Yin Zhu [email protected] Classification tools 1

COMP 4332 Tutorial 4 Feb 23 Yin Zhu [email protected]

  • Upload
    vaughan

  • View
    43

  • Download
    0

Embed Size (px)

DESCRIPTION

Classification tools. COMP 4332 Tutorial 4 Feb 23 Yin Zhu [email protected]. Fact Sheets: Classifier. KDDCUP 2009. CLASSIFIER (overall usage=93%). Decision tree. Linear classifier. Non-linear kernel. About 30% logistic loss , >15% exp loss, >15% sq loss, ~10% hinge loss. - PowerPoint PPT Presentation

Citation preview

Page 1: COMP 4332 Tutorial 4 Feb 23 Yin Zhu yinz@cset.hk

1

COMP 4332 Tutorial 4Feb 23

Yin [email protected]

Classification tools

Page 2: COMP 4332 Tutorial 4 Feb 23 Yin Zhu yinz@cset.hk

2

Fact Sheets:Classifier

0 10 20 30 40 50 60

Bayesian Neural Network

Bayesian Network

Nearest neighbors

Naïve Bayes

Neural Network

Other Classif

Non-linear kernel

Linear classifier

Decision tree...

CLASSIFIER (overall usage=93%)

Percent of participants

- About 30% logistic loss, >15% exp loss, >15% sq loss, ~10% hinge loss.- Less than 50% regularization (20% 2-norm, 10% 1-norm).- Only 13% unlabeled data.

KDDCUP 2009

Page 3: COMP 4332 Tutorial 4 Feb 23 Yin Zhu yinz@cset.hk

3

Top 10 data mining algorithms• http://www.cs.uvm.edu/~icdm/algorithms/index.shtml• C4.5• K-means• SVM• Apriori• EM• PageRank• AdaBoost• kNN• Naïve Bayes• CART (Classification and Regression Trees)

Six are classification!!!

Page 4: COMP 4332 Tutorial 4 Feb 23 Yin Zhu yinz@cset.hk

4

My view of classification • 1. Tree-based models

• Trees and tree ensemble

• 2. Linear family and its kernel extension • Least Squares Regression• Logistic Regression• SVM

• 3. Others: Naïve Bayes, KNN

Page 5: COMP 4332 Tutorial 4 Feb 23 Yin Zhu yinz@cset.hk

5

Tree-based models• Two most famous trees: C4.5 and CART• C4.5 has two solid implementations:

• Ross Quinlan’s original C++ program. And later C 5.0. Source code available.

• Weka’s J48 classifier. • CART: rpart package in R.

• Tree ensemble: • AdaBoost • Bagging• RandomForest

Page 6: COMP 4332 Tutorial 4 Feb 23 Yin Zhu yinz@cset.hk

6

Recommended tools for trees• For dense datasets, e.g. # of attributes < 1000, use Weka.

• For sparse datasets, • Try FEST.

Page 7: COMP 4332 Tutorial 4 Feb 23 Yin Zhu yinz@cset.hk

7

Linear family and its kernel extension• Learning samples: ,

• Minimize classification loss:

• : map to its kernel space, for linear classification, • : loss function• Classification label:

Page 8: COMP 4332 Tutorial 4 Feb 23 Yin Zhu yinz@cset.hk

8

Loss functions • Square loss:

• Logistic loss:

• Hinge loss:

• More: http://ttic.uchicago.edu/~dmcallester/ttic101-06/lectures/genreg/genreg.pdf

• First three lecture notes of Stanford-CS 229, http://cs229.stanford.edu/materials.html

Page 9: COMP 4332 Tutorial 4 Feb 23 Yin Zhu yinz@cset.hk

9

Recommended tools for linear family• LibSVM and SVM Light for Kernel SVM

• LibLinear for linear classifiers with different loss functions.

• Online linear classification (very fast!): Stochastic Gradient Descendent (Léon Bottou’s sgd and John Langford’s Vowpal Wabbit)

Page 10: COMP 4332 Tutorial 4 Feb 23 Yin Zhu yinz@cset.hk

10

Naïve Bayes and KNN

• For dense dataset, use Weka.

• For sparse dataset, implement them yourself. Very easy to implement. • Weka is able to load sparse dataset, but not sure about its speed

and memory usage.

Page 11: COMP 4332 Tutorial 4 Feb 23 Yin Zhu yinz@cset.hk

11

Empirical comparison of classifiers • Caruana et. al: 

An empirical evaluation of supervised learning in high dimensions, ICML’08.Slide

• Caruana et. al: An empirical comparison of supervised learning algorithms. ICML’06. Slide Video

Page 12: COMP 4332 Tutorial 4 Feb 23 Yin Zhu yinz@cset.hk

12

Learning a classification tool

• Data Format

• Parameters in the tool• Have to learn the mathematics behind the classifier. At least

intuitively, if not rigorously.

• Read its manual and sometimes source code!

Page 13: COMP 4332 Tutorial 4 Feb 23 Yin Zhu yinz@cset.hk

13

DEMO: LibLinear• Step 1: Download the source code and binaries. Compile

the source code if necessary.

• Step 2: Read its README/tutorial, and usually the tutorial has a running example. Follow the example.

• Step 3: Study its documentation and know about its data format and its parameters.

• Step 4: Use it on your data set.

Page 14: COMP 4332 Tutorial 4 Feb 23 Yin Zhu yinz@cset.hk

14

Dataset format• ARFF: Weka

• Libsvm sparse: SvmLight, LibSvm, LibLinear, Sgd, etc.

• Dense vs Sparse

Page 15: COMP 4332 Tutorial 4 Feb 23 Yin Zhu yinz@cset.hk

15

ARFF: Attribute-Relation File Format

• Documentation:

• http://www.cs.waikato.ac.nz/ml/weka/arff.html

• ARFF also supports sparse format

Page 16: COMP 4332 Tutorial 4 Feb 23 Yin Zhu yinz@cset.hk

16

LibSvm Sparse FormatLine Format:Label Index:Value pairs Label: +1/-1 for binary classification, 1/2/3/4/etc for multi-class.

Page 17: COMP 4332 Tutorial 4 Feb 23 Yin Zhu yinz@cset.hk

17

Predicting a score (Not a label)• Many classifiers support probability output:

• Nearly all classifiers in Weka support probability output. • LibSVM/LibLinear supports probability output.

• SvmLight outputs a real value from –Inf to +Inf.

• More details on: • Caruana et. al: 

An empirical comparison of supervised learning algorithms. ICML’06.

Page 18: COMP 4332 Tutorial 4 Feb 23 Yin Zhu yinz@cset.hk

18

DEMO: experiment.pyFrom raw data to a successful submission • Read the raw data and do preprocessing

• Transform the data to the input format of a classification tool (liblinear in our example).

• Perform training and testing using the tool.

• Wrap up the results and submit online.