Upload
vaughan
View
43
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Classification tools. COMP 4332 Tutorial 4 Feb 23 Yin Zhu [email protected]. Fact Sheets: Classifier. KDDCUP 2009. CLASSIFIER (overall usage=93%). Decision tree. Linear classifier. Non-linear kernel. About 30% logistic loss , >15% exp loss, >15% sq loss, ~10% hinge loss. - PowerPoint PPT Presentation
Citation preview
2
Fact Sheets:Classifier
0 10 20 30 40 50 60
Bayesian Neural Network
Bayesian Network
Nearest neighbors
Naïve Bayes
Neural Network
Other Classif
Non-linear kernel
Linear classifier
Decision tree...
CLASSIFIER (overall usage=93%)
Percent of participants
- About 30% logistic loss, >15% exp loss, >15% sq loss, ~10% hinge loss.- Less than 50% regularization (20% 2-norm, 10% 1-norm).- Only 13% unlabeled data.
KDDCUP 2009
3
Top 10 data mining algorithms• http://www.cs.uvm.edu/~icdm/algorithms/index.shtml• C4.5• K-means• SVM• Apriori• EM• PageRank• AdaBoost• kNN• Naïve Bayes• CART (Classification and Regression Trees)
Six are classification!!!
4
My view of classification • 1. Tree-based models
• Trees and tree ensemble
• 2. Linear family and its kernel extension • Least Squares Regression• Logistic Regression• SVM
• 3. Others: Naïve Bayes, KNN
5
Tree-based models• Two most famous trees: C4.5 and CART• C4.5 has two solid implementations:
• Ross Quinlan’s original C++ program. And later C 5.0. Source code available.
• Weka’s J48 classifier. • CART: rpart package in R.
• Tree ensemble: • AdaBoost • Bagging• RandomForest
6
Recommended tools for trees• For dense datasets, e.g. # of attributes < 1000, use Weka.
• For sparse datasets, • Try FEST.
7
Linear family and its kernel extension• Learning samples: ,
• Minimize classification loss:
• : map to its kernel space, for linear classification, • : loss function• Classification label:
8
Loss functions • Square loss:
• Logistic loss:
• Hinge loss:
• More: http://ttic.uchicago.edu/~dmcallester/ttic101-06/lectures/genreg/genreg.pdf
• First three lecture notes of Stanford-CS 229, http://cs229.stanford.edu/materials.html
9
Recommended tools for linear family• LibSVM and SVM Light for Kernel SVM
• LibLinear for linear classifiers with different loss functions.
• Online linear classification (very fast!): Stochastic Gradient Descendent (Léon Bottou’s sgd and John Langford’s Vowpal Wabbit)
10
Naïve Bayes and KNN
• For dense dataset, use Weka.
• For sparse dataset, implement them yourself. Very easy to implement. • Weka is able to load sparse dataset, but not sure about its speed
and memory usage.
11
Empirical comparison of classifiers • Caruana et. al:
An empirical evaluation of supervised learning in high dimensions, ICML’08.Slide
• Caruana et. al: An empirical comparison of supervised learning algorithms. ICML’06. Slide Video
12
Learning a classification tool
• Data Format
• Parameters in the tool• Have to learn the mathematics behind the classifier. At least
intuitively, if not rigorously.
• Read its manual and sometimes source code!
13
DEMO: LibLinear• Step 1: Download the source code and binaries. Compile
the source code if necessary.
• Step 2: Read its README/tutorial, and usually the tutorial has a running example. Follow the example.
• Step 3: Study its documentation and know about its data format and its parameters.
• Step 4: Use it on your data set.
14
Dataset format• ARFF: Weka
• Libsvm sparse: SvmLight, LibSvm, LibLinear, Sgd, etc.
• Dense vs Sparse
15
ARFF: Attribute-Relation File Format
• Documentation:
• http://www.cs.waikato.ac.nz/ml/weka/arff.html
• ARFF also supports sparse format
16
LibSvm Sparse FormatLine Format:Label Index:Value pairs Label: +1/-1 for binary classification, 1/2/3/4/etc for multi-class.
17
Predicting a score (Not a label)• Many classifiers support probability output:
• Nearly all classifiers in Weka support probability output. • LibSVM/LibLinear supports probability output.
• SvmLight outputs a real value from –Inf to +Inf.
• More details on: • Caruana et. al:
An empirical comparison of supervised learning algorithms. ICML’06.
18
DEMO: experiment.pyFrom raw data to a successful submission • Read the raw data and do preprocessing
• Transform the data to the input format of a classification tool (liblinear in our example).
• Perform training and testing using the tool.
• Wrap up the results and submit online.