36
Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence Lab, Seoul National University

Weka 사용법 - Welcome to SNU Biointelligence Lab!!€¦ · PPT file · Web view · 2015-11-24Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence

Embed Size (px)

Citation preview

Page 1: Weka 사용법 - Welcome to SNU Biointelligence Lab!!€¦ · PPT file · Web view · 2015-11-24Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence

Introduction to WekaML Seminar for Rookies

2012-02-03Byoung-Hee Kim

Biointelligence Lab, Seoul National University

Page 2: Weka 사용법 - Welcome to SNU Biointelligence Lab!!€¦ · PPT file · Web view · 2015-11-24Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 2

BI?

(Predictive) Analytics

Data Mining

Machine Learning

AI

Page 3: Weka 사용법 - Welcome to SNU Biointelligence Lab!!€¦ · PPT file · Web view · 2015-11-24Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence

Hype Cycle ofHype Cycle of Emerging Technologies 2010, GartnerEmerging Technologies 2010, Gartner

Analytics as a Mainstream Technology

3(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Page 4: Weka 사용법 - Welcome to SNU Biointelligence Lab!!€¦ · PPT file · Web view · 2015-11-24Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence

Analytics as a Mainstream Technology

4(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Page 5: Weka 사용법 - Welcome to SNU Biointelligence Lab!!€¦ · PPT file · Web view · 2015-11-24Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence

Components of Data Mining

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 5

Page 6: Weka 사용법 - Welcome to SNU Biointelligence Lab!!€¦ · PPT file · Web view · 2015-11-24Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence

Weka as a Must-Have Tool

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 6

I use Weka constantly in my speech work. It is the first thing a reach for when encountering a new problem. What a terrific tool.

A must for anyone even marginally interested in machine learning and classification techniques.

One of the most useful AI software packages available. It's only serious flaw is being infected with the GNU virus.

Reviews in Sourceforge.net

Page 7: Weka 사용법 - Welcome to SNU Biointelligence Lab!!€¦ · PPT file · Web view · 2015-11-24Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence

7

Agenda

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Page 8: Weka 사용법 - Welcome to SNU Biointelligence Lab!!€¦ · PPT file · Web view · 2015-11-24Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence

8

General Information on Weka

Weka: Data Mining Software in Java Weka is a collection of machine learning algorithms for

data mining & machine learning tasks What you can do with Weka are

data pre-processing, feature selection, classification, regression, clustering, association rules, and visualization

Weka is an open source software issued under the GNU General Public License

How to get? http://www.cs.waikato.ac.nz/ml/weka/ or just type ‘Weka’ in google.

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Page 9: Weka 사용법 - Welcome to SNU Biointelligence Lab!!€¦ · PPT file · Web view · 2015-11-24Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence

Components of Weka

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 9

Explorer lets you do various data mining tasks in interactive, step-by-step way.The first choice, usually

KnowledgeFlow allows you to design configurations for streamed data processing

Experimenter allows you to classification and regression in batch way-Different parameter settings-Various datasets-Comparison of models-Large-scale statistical experiments

Simple CLI lies behind other interfaces. By entering textual commands, you can access to all features of the Weka system.

Auxiliary Tools in the menu

Page 10: Weka 사용법 - Welcome to SNU Biointelligence Lab!!€¦ · PPT file · Web view · 2015-11-24Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence

Practice: Classifying Iris Flower

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 10

Iris virginicaIris versicolorIris setosa

Features for Classification

Page 11: Weka 사용법 - Welcome to SNU Biointelligence Lab!!€¦ · PPT file · Web view · 2015-11-24Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence

Practice: Classifying Iris Flower

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 11

Page 12: Weka 사용법 - Welcome to SNU Biointelligence Lab!!€¦ · PPT file · Web view · 2015-11-24Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence

Terminology

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 12

Page 13: Weka 사용법 - Welcome to SNU Biointelligence Lab!!€¦ · PPT file · Web view · 2015-11-24Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence

13

Neural Networks

MLP (Multilayer Perceptron) In Weka, Classifiers-functions-MultilayerPerceptron

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Page 14: Weka 사용법 - Welcome to SNU Biointelligence Lab!!€¦ · PPT file · Web view · 2015-11-24Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence

14

Decision Trees

J48 (Java implementation of C4.5) In Weka, classifiers-trees-J48

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Page 15: Weka 사용법 - Welcome to SNU Biointelligence Lab!!€¦ · PPT file · Web view · 2015-11-24Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence

Support Vector Machines

SMO (sequential minimal optimization) for training SVM In Weka, classifiers-functions-SMO

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 15

Page 16: Weka 사용법 - Welcome to SNU Biointelligence Lab!!€¦ · PPT file · Web view · 2015-11-24Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence

Practice Scenario

Basic Comparing the performances of algorithms

MultilayerPerceptron vs. J48 vs. SVM Checking the trained model (structure & parameter)

Tuning parameters to get better models Understanding ‘Test options’ & ‘Classifier output’ in

Weka

Advanced Building committee machines using ‘meta’ algorithms for

classification Preprocessing / data manipulation – applying ‘Filter’ Batch experiment with ‘Experimenter’ Design & run a batch process with ‘KnowledgeFlow’

16(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Page 17: Weka 사용법 - Welcome to SNU Biointelligence Lab!!€¦ · PPT file · Web view · 2015-11-24Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence

Dataset for Practice with Weka

Just open “iris.arff” in the data folder of Weka

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 17

Page 18: Weka 사용법 - Welcome to SNU Biointelligence Lab!!€¦ · PPT file · Web view · 2015-11-24Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence

Data format for Weka (.ARFF)

@RELATION iris@ATTRIBUTE sepallength REAL@ATTRIBUTE sepalwidth REAL@ATTRIBUTE petallength REAL@ATTRIBUTE petalwidth REAL@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}@DATA

5.1, 3.5, 1.4, 0.2, Iris-setosa4.9, 3.0, 1.4, 0.2, Iris-setosa4.7, 3.2, 1.3, 0.2, Iris-setosa…7.0, 3.2, 4.7, 1.4, Iris-versicolor6.4, 3.2, 4.5, 1.5, Iris-versicolor6.9, 3.1, 4.9, 1.5, Iris-versicolor…

Data (CSV format)

Header

18

Note: You can easily generate ‘arff’ file by adding a header to a usual CSV text file(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Page 19: Weka 사용법 - Welcome to SNU Biointelligence Lab!!€¦ · PPT file · Web view · 2015-11-24Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence

Neural Networks in Weka

19

click • load a file that contains the training data by clicking ‘Open file’ button• ‘ARFF’ or ‘CSV’ formats are

readible

• Click ‘Classify’ tab• Click ‘Choose’ button• Select ‘weka – function - MultilayerPerceptron

• Click ‘MultilayerPerceptron’ • Set parameters for MLP• Set parameters for Test• Click ‘Start’ for learning

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Page 20: Weka 사용법 - Welcome to SNU Biointelligence Lab!!€¦ · PPT file · Web view · 2015-11-24Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence

20

Some Notes on the Parameter Setting

Parameter Setting = Car Tuning need much experience or many times of trial you may get worse results if you are unlucky

Multilayer Perceptron (MLP) Main parameters for learning: hiddenLayers, learningRate,

momentum, trainingTime (epoch), seed

J48 Main parameters: unpruned, numFolds, minNumObj Many parameters are for controlling the size of the result tree,

i.e. confidenceFactor, pruning

SMO (SVM) Main parameters: c (complexity parameter), kernel, kernel

parameters(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Page 21: Weka 사용법 - Welcome to SNU Biointelligence Lab!!€¦ · PPT file · Web view · 2015-11-24Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence

Test Options and Classifier Output

21

There are various metrics for evaluation

Setting the data set used for evaluation

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Page 22: Weka 사용법 - Welcome to SNU Biointelligence Lab!!€¦ · PPT file · Web view · 2015-11-24Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence

22

How to Evaluate the Performance? (1/2)

Usually, build a ‘Confusion Matrix’ on the test data set

Evaluation Metrics Accuracy (percent correct) Precision Recall Many other metrics: F-measure, Kappa score,

etc.

For fare evaluation, the ‘cross-validation’ scheme is used

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Page 23: Weka 사용법 - Welcome to SNU Biointelligence Lab!!€¦ · PPT file · Web view · 2015-11-24Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence

23

How to Evaluate the Performance? (2/2)

Confusion Matrix Real

Prediction Positive Negative

Positive TP FPAll with positive

Test

Negative FN TNAll with

Negative Test

All with Disease

All without Disease

Everyone

FNTNFPTPTNTP

Accuracy

FNTPTP

RecallFPTP

TP

Precision

As recall ↑ precision ↓conversely:

As recall ↓ precision ↑

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Page 24: Weka 사용법 - Welcome to SNU Biointelligence Lab!!€¦ · PPT file · Web view · 2015-11-24Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence

24

Evaluation Method - Cross Validation

K-fold Cross Validation The data set is randomly divided into k

subsets. One of the k subsets is used as the ‘test set’

and the other k-1 subsets are put together to form a ‘training set’.

128 128128 128 128D1 D2 D3 D4 D5

128D6

128 128128 128 128D1 D2 D3 D4 D6

128D5

128 128128 128 128D2 D3 D4 D5 D6

128D1

k

iiError

kError

1

16-fold cross validation

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Page 25: Weka 사용법 - Welcome to SNU Biointelligence Lab!!€¦ · PPT file · Web view · 2015-11-24Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence

Data Manipulation with Filter in Weka

Attribute Selection, discretize

Instance Re-sampling, selecting specified folds

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 25

Page 26: Weka 사용법 - Welcome to SNU Biointelligence Lab!!€¦ · PPT file · Web view · 2015-11-24Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence

Using Experimenter in Weka

Tool for ‘Batch’ experiments

26

click

• Set experiment type/iteration control• Set datasets / algorithms

Click ‘New’

• Select ‘Run’ tab and click ‘Start’• If it has finished successfully,

click ‘Analyse’ tab and see the summary

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Page 27: Weka 사용법 - Welcome to SNU Biointelligence Lab!!€¦ · PPT file · Web view · 2015-11-24Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence

Usages of Experimenter

Model selection for classification/regression Various approaches

Repeated training/test set split Repeated cross-validation (c.f. double cross-validation) Averaging

Comparison between models / algorithms Paired t-test On various metrics: accuracies / RMSE / etc.

Batch and/or Distributed processing Load/save experiment settings http://weka.wikispaces.com/Remote+Experiment Multi-core support : utilize all the cores on a multi-core

machine(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 27

Page 28: Weka 사용법 - Welcome to SNU Biointelligence Lab!!€¦ · PPT file · Web view · 2015-11-24Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence

KnowledgeFlow for Analysis Process Design

28(‘Process Flow Diagram’ of SAS® Enterprise Miner )

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Page 29: Weka 사용법 - Welcome to SNU Biointelligence Lab!!€¦ · PPT file · Web view · 2015-11-24Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence

KnowledgeFlow: Example Usage

Decision tree (J48)

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 29

Page 30: Weka 사용법 - Welcome to SNU Biointelligence Lab!!€¦ · PPT file · Web view · 2015-11-24Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 30

Page 31: Weka 사용법 - Welcome to SNU Biointelligence Lab!!€¦ · PPT file · Web view · 2015-11-24Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 31

Page 32: Weka 사용법 - Welcome to SNU Biointelligence Lab!!€¦ · PPT file · Web view · 2015-11-24Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 32

Page 33: Weka 사용법 - Welcome to SNU Biointelligence Lab!!€¦ · PPT file · Web view · 2015-11-24Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence

Simple CLI

Example command and result java weka.classifiers.functions.MultilayerPerceptron -L 0.3 -M

0.2 -N 500 -V 0 -S 0 -E 20 -H a -t "C:\Program Files\Weka-3-6\data\iris.arff"

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 33

You may build a command line script for various experiments easily

Refer Ch.1 of WekaManual-3-*-*.pdf for further information

Page 34: Weka 사용법 - Welcome to SNU Biointelligence Lab!!€¦ · PPT file · Web view · 2015-11-24Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence

Other ML Open Source S/W’s

KNIME Konstanz Information Miner http://www.knime.org/

RapidMiner With Weka as its core http://rapid-i.com/index.php?lang=en

TANAGRA http://eric.univ-lyon2.fr/~ricco/tanagra/en/

tanagra.html (C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 34

Page 35: Weka 사용법 - Welcome to SNU Biointelligence Lab!!€¦ · PPT file · Web view · 2015-11-24Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence

General Information on Weka

Current version (2012-2-3) Stable version: 3.6.6 Developer version: 3.7.5

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 35

Page 36: Weka 사용법 - Welcome to SNU Biointelligence Lab!!€¦ · PPT file · Web view · 2015-11-24Introduction to Weka ML Seminar for Rookies 2012-02-03 Byoung-Hee Kim Biointelligence

References

Weka Wiki: http://weka.wikispaces.com/ Primer: good starting point

Weka online documentation: http://www.cs.waikato.ac.nz/ml/weka/index_documentation.html

Textbook Ian H. Witten, Eibe Frank, Mark A. Hall, Data

Mining: Practical Machine Learning Tools and Techniques (Third Edition), Morgan Kaufmann, Jan. 2011.

Articles Data mining with WEKA, Part 1, Part 2, Part 3 in

IBM Technical Library Weka 를 이용한 예측프로그램 만들기 – 월간 마소 연재

(2009 7,8,9 월호 ) 블로그 , MS Live

(C) 2007-2012, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 36