Upload
hoangliem
View
227
Download
10
Embed Size (px)
Citation preview
Introduction to WekaML Seminar for Rookies
2012-02-03Byoung-Hee Kim
Biointelligence Lab, Seoul National University
(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 2
BI?
(Predictive) Analytics
Data Mining
Machine Learning
AI
Hype Cycle ofHype Cycle of Emerging Technologies 2010, GartnerEmerging Technologies 2010, Gartner
Analytics as a Mainstream Technology
3(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Analytics as a Mainstream Technology
4(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Components of Data Mining
(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 5
Weka as a Must-Have Tool
(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 6
I use Weka constantly in my speech work. It is the first thing a reach for when encountering a new problem. What a terrific tool.
A must for anyone even marginally interested in machine learning and classification techniques.
One of the most useful AI software packages available. It's only serious flaw is being infected with the GNU virus.
Reviews in Sourceforge.net
7
Agenda
(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/
8
General Information on Weka
Weka: Data Mining Software in Java Weka is a collection of machine learning algorithms for
data mining & machine learning tasks What you can do with Weka are
data pre-processing, feature selection, classification, regression, clustering, association rules, and visualization
Weka is an open source software issued under the GNU General Public License
How to get? http://www.cs.waikato.ac.nz/ml/weka/ or just type ‘Weka’ in google.
(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Components of Weka
(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 9
Explorer lets you do various data mining tasks in interactive, step-by-step way.The first choice, usually
KnowledgeFlow allows you to design configurations for streamed data processing
Experimenter allows you to classification and regression in batch way-Different parameter settings-Various datasets-Comparison of models-Large-scale statistical experiments
Simple CLI lies behind other interfaces. By entering textual commands, you can access to all features of the Weka system.
Auxiliary Tools in the menu
Practice: Classifying Iris Flower
(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 10
Iris virginicaIris versicolorIris setosa
Features for Classification
Practice: Classifying Iris Flower
(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 11
Terminology
(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 12
13
Neural Networks
MLP (Multilayer Perceptron) In Weka, Classifiers-functions-MultilayerPerceptron
(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/
14
Decision Trees
J48 (Java implementation of C4.5) In Weka, classifiers-trees-J48
(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Support Vector Machines
SMO (sequential minimal optimization) for training SVM In Weka, classifiers-functions-SMO
(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 15
Practice Scenario
Basic Comparing the performances of algorithms
MultilayerPerceptron vs. J48 vs. SVM Checking the trained model (structure & parameter)
Tuning parameters to get better models Understanding ‘Test options’ & ‘Classifier output’ in
Weka
Advanced Building committee machines using ‘meta’ algorithms for
classification Preprocessing / data manipulation – applying ‘Filter’ Batch experiment with ‘Experimenter’ Design & run a batch process with ‘KnowledgeFlow’
16(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Dataset for Practice with Weka
Just open “iris.arff” in the data folder of Weka
(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 17
Data format for Weka (.ARFF)
@RELATION iris@ATTRIBUTE sepallength REAL@ATTRIBUTE sepalwidth REAL@ATTRIBUTE petallength REAL@ATTRIBUTE petalwidth REAL@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}@DATA
5.1, 3.5, 1.4, 0.2, Iris-setosa4.9, 3.0, 1.4, 0.2, Iris-setosa4.7, 3.2, 1.3, 0.2, Iris-setosa…7.0, 3.2, 4.7, 1.4, Iris-versicolor6.4, 3.2, 4.5, 1.5, Iris-versicolor6.9, 3.1, 4.9, 1.5, Iris-versicolor…
Data (CSV format)
Header
18
Note: You can easily generate ‘arff’ file by adding a header to a usual CSV text file(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Neural Networks in Weka
19
click • load a file that contains the training data by clicking ‘Open file’ button• ‘ARFF’ or ‘CSV’ formats are
readible
• Click ‘Classify’ tab• Click ‘Choose’ button• Select ‘weka – function - MultilayerPerceptron
• Click ‘MultilayerPerceptron’ • Set parameters for MLP• Set parameters for Test• Click ‘Start’ for learning
(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/
20
Some Notes on the Parameter Setting
Parameter Setting = Car Tuning need much experience or many times of trial you may get worse results if you are unlucky
Multilayer Perceptron (MLP) Main parameters for learning: hiddenLayers, learningRate,
momentum, trainingTime (epoch), seed
J48 Main parameters: unpruned, numFolds, minNumObj Many parameters are for controlling the size of the result tree,
i.e. confidenceFactor, pruning
SMO (SVM) Main parameters: c (complexity parameter), kernel, kernel
parameters(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Test Options and Classifier Output
21
There are various metrics for evaluation
Setting the data set used for evaluation
(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/
22
How to Evaluate the Performance? (1/2)
Usually, build a ‘Confusion Matrix’ on the test data set
Evaluation Metrics Accuracy (percent correct) Precision Recall Many other metrics: F-measure, Kappa score,
etc.
For fare evaluation, the ‘cross-validation’ scheme is used
(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/
23
How to Evaluate the Performance? (2/2)
Confusion Matrix Real
Prediction Positive Negative
Positive TP FPAll with positive
Test
Negative FN TNAll with
Negative Test
All with Disease
All without Disease
Everyone
FNTNFPTPTNTP
Accuracy
FNTPTP
RecallFPTP
TP
Precision
As recall ↑ precision ↓conversely:
As recall ↓ precision ↑
(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/
24
Evaluation Method - Cross Validation
K-fold Cross Validation The data set is randomly divided into k
subsets. One of the k subsets is used as the ‘test set’
and the other k-1 subsets are put together to form a ‘training set’.
128 128128 128 128D1 D2 D3 D4 D5
128D6
128 128128 128 128D1 D2 D3 D4 D6
128D5
128 128128 128 128D2 D3 D4 D5 D6
128D1
k
iiError
kError
1
16-fold cross validation
(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Data Manipulation with Filter in Weka
Attribute Selection, discretize
Instance Re-sampling, selecting specified folds
(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 25
Using Experimenter in Weka
Tool for ‘Batch’ experiments
26
click
• Set experiment type/iteration control• Set datasets / algorithms
Click ‘New’
• Select ‘Run’ tab and click ‘Start’• If it has finished successfully,
click ‘Analyse’ tab and see the summary
(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Usages of Experimenter
Model selection for classification/regression Various approaches
Repeated training/test set split Repeated cross-validation (c.f. double cross-validation) Averaging
Comparison between models / algorithms Paired t-test On various metrics: accuracies / RMSE / etc.
Batch and/or Distributed processing Load/save experiment settings http://weka.wikispaces.com/Remote+Experiment Multi-core support : utilize all the cores on a multi-core
machine(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 27
KnowledgeFlow for Analysis Process Design
28(‘Process Flow Diagram’ of SAS® Enterprise Miner )
(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/
KnowledgeFlow: Example Usage
Decision tree (J48)
(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 29
(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 30
(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 31
(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 32
Simple CLI
Example command and result java weka.classifiers.functions.MultilayerPerceptron -L 0.3 -M
0.2 -N 500 -V 0 -S 0 -E 20 -H a -t "C:\Program Files\Weka-3-6\data\iris.arff"
(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 33
You may build a command line script for various experiments easily
Refer Ch.1 of WekaManual-3-*-*.pdf for further information
Other ML Open Source S/W’s
KNIME Konstanz Information Miner http://www.knime.org/
RapidMiner With Weka as its core http://rapid-i.com/index.php?lang=en
TANAGRA http://eric.univ-lyon2.fr/~ricco/tanagra/en/
tanagra.html (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 34
General Information on Weka
Current version (2012-2-3) Stable version: 3.6.6 Developer version: 3.7.5
(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 35
References
Weka Wiki: http://weka.wikispaces.com/ Primer: good starting point
Weka online documentation: http://www.cs.waikato.ac.nz/ml/weka/index_documentation.html
Textbook Ian H. Witten, Eibe Frank, Mark A. Hall, Data
Mining: Practical Machine Learning Tools and Techniques (Third Edition), Morgan Kaufmann, Jan. 2011.
Articles Data mining with WEKA, Part 1, Part 2, Part 3 in
IBM Technical Library Weka 를 이용한 예측프로그램 만들기 – 월간 마소 연재
(2009 7,8,9 월호 ) 블로그 , MS Live
(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 36