29
WEKA and Machine Learning Algorithms

WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of

Embed Size (px)

Citation preview

Page 1: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of

WEKA and Machine Learning Algorithms

Page 2: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of

Algorithm Types

• Classification (supervised)Given -> A set of classified examples

“instances”Produce -> A way of classifying new examples

Instances: described by fixed set of features “attributes”Classes: discrete or continuous “classification” “regression”Interested in:– Results? (classifying new instances)– Model? (how the decision is made)

• Clustering (unsupervised)There are no classes

• Association rulesLook for rules that relate features to other features

Page 3: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of

Classification

Page 4: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of

Clustering

Page 5: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of

Clustering

• It is expected that similarity among members of a cluster should be high and similarity among objects of different clusters should be low.

• The objectives of clustering – knowing which data objects belong to which cluster – understanding common characteristics of the

members of a specific cluster

Page 6: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of

Clustering vs Classification

• There is some similarity between clustering and classification.

• Both classification and clustering are about assigning appropriate class or cluster labels to data records. However, clustering differs from classification in two aspects. – First, in clustering, there are no pre-defined classes.

This means that the number of classes or clusters and the class or cluster label of each data record are not known before the operation.

– Second, clustering is about grouping data rather than developing a classification model. Therefore, there is no distinction between data records and examples. The entire data population is used as input to the clustering process.

Page 7: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of

Association Mining

Page 8: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of

Overfitting

• Memorization vs generalization

• To fix, use– Training data — to form rules– Validation data — to decide on best rule– Test data — to determine system performance

• Cross-validation

Page 9: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of

Baseline Experiments

• In order to evaluate the efficiency of the classifiers used in experiments, we use baselines: – Majority based random classification (Kappa=0)– Class distribution based random classification (Kappa=0)

• Kappa statistics, is used as a measure to assess the improvement of a classifier’s accuracy over a predictor employing chance as its guide.

• P0 is the accuracy of the classifier and Pc is the expected accuracy that can be achieved by a randomly guessing classifier on the same data set. Kappa statistics has a range between 1 and 1, where 1 is total disagreement (i.e., total misclassification) and 1 is perfect agreement (i.e., a 100% accurate classification).

• Kappa score over 0.4 indicates a reasonable agreement beyond chance.

9

0 c

c

P P

1 P

Page 10: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of

Data Mining Process

Page 11: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of

WEKA: the software

• Machine learning/data mining software written in Java (distributed under the GNU Public License)

• Used for research, education, and applications

• Complements “Data Mining” by Witten & Frank

• Main features:– Comprehensive set of data pre-processing tools, learning

algorithms and evaluation methods– Graphical user interfaces (incl. data visualization)– Environment for comparing learning algorithms

Page 12: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of

Weka’s Role in the Big Picture

Input•Raw data

Input•Raw data

Data Miningby Weka

•Pre-processing •Classification•Regression •Clustering

•Association Rules •Visualization

Data Miningby Weka

•Pre-processing •Classification•Regression •Clustering

•Association Rules •Visualization

Output•Result

Output•Result

Page 13: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of

13

WEKA: TerminologyWEKA: Terminology

Some synonyms/explanations for the terms used by WEKA:

Attribute: feature Relation: collection of examples Instance: collection in use Class: category

Page 14: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of

@relation heart-disease-simplified

@attribute age numeric

@attribute sex { female, male}

@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}

@attribute cholesterol numeric

@attribute exercise_induced_angina { no, yes}

@attribute class { present, not_present}

@data

63,male,typ_angina,233,no,not_present

67,male,asympt,286,yes,present

67,male,asympt,229,yes,present

38,female,non_anginal,?,no,not_present

...

WEKA only deals with “flat” files

Page 15: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of
Page 16: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of

Explorer: pre-processing the data

• Data can be imported from a file in various formats: ARFF, CSV, C4.5, binary

• Data can also be read from a URL or from an SQL database (using JDBC)

• Pre-processing tools in WEKA are called “filters”

• WEKA contains filters for:– Discretization, normalization, resampling, attribute

selection, transforming and combining attributes, …

Page 17: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of

04/19/2317

Explorer: building “classifiers”

• Classifiers in WEKA are models for predicting nominal or numeric quantities

• Implemented learning schemes include:– Decision trees and lists, instance-based classifiers,

support vector machines, multi-layer perceptrons, logistic regression, Bayes’ nets, …

• “Meta”-classifiers include:– Bagging, boosting, stacking, error-correcting output

codes, locally weighted learning, …

Page 18: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of

Classifiers - Workflow

LabeledData

LearningAlgorithm Classifier

UnlabeledData

Predictions

Page 19: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of

Evaluation

• Accuracy– Percentage of Predictions that are correct– Problematic for some disproportional Data Sets

• Precision– Percent of positive predictions correct

• Recall (Sensitivity)– Percent of positive labeled samples predicted as

positive

• Specificity– The percentage of negative labeled samples

predicted as negative.

Page 20: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of

20

Contains information about the actual and the predicted classificationAll measures can be derived from it:

accuracy: (a+d)/(a+b+c+d) recall: d/(c+d) => R precision: d/(b+d) => P F-measure: 2PR/(P+R) false positive (FP) rate: b /(a+b) true negative (TN) rate: a /(a+b) false negative (FN) rate: c /(c+d)

Confusion matrixConfusion matrix

predicted

– +

true– a b

+ c d

Page 21: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of

04/19/2321

Explorer: clustering data

• WEKA contains “clusterers” for finding groups of similar instances in a dataset

• Implemented schemes are:– k-Means, EM, Cobweb, X-means, FarthestFirst

• Clusters can be visualized and compared to “true” clusters (if given)

• Evaluation based on loglikelihood if clustering scheme produces a probability distribution

Page 22: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of

04/19/2322

Explorer: finding associations

• WEKA contains an implementation of the Apriori algorithm for learning association rules– Works only with discrete data

• Can identify statistical dependencies between groups of attributes:– milk, butter bread, eggs (with confidence 0.9 and

support 2000)

• Apriori can compute all rules that have a given minimum support and exceed a given confidence

Page 23: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of

04/19/2323

Explorer: attribute selection

• Panel that can be used to investigate which (subsets of) attributes are the most predictive ones

• Attribute selection methods contain two parts:– A search method: best-first, forward selection,

random, exhaustive, genetic algorithm, ranking– An evaluation method: correlation-based, wrapper,

information gain, chi-squared, …

• Very flexible: WEKA allows (almost) arbitrary combinations of these two

Page 24: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of

04/19/2324

Explorer: data visualization

• Visualization very useful in practice: e.g. helps to determine difficulty of the learning problem

• WEKA can visualize single attributes (1-d) and pairs of attributes (2-d)– To do: rotating 3-d visualizations (Xgobi-style)

• Color-coded class values

• “Jitter” option to deal with nominal attributes (and to detect “hidden” data points)

• “Zoom-in” function

Page 25: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of

04/19/2325

Performing experiments

• Experimenter makes it easy to compare the performance of different learning schemes

• For classification and regression problems

• Results can be written into file or database

• Evaluation options: cross-validation, learning curve, hold-out

• Can also iterate over different parameter settings

• Significance-testing built in!

Page 26: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of

04/19/2326

The Knowledge Flow GUI

• New graphical user interface for WEKA

• Java-Beans-based interface for setting up and running machine learning experiments

• Data sources, classifiers, etc. are beans and can be connected graphically

• Data “flows” through components: e.g.,

“data source” -> “filter” -> “classifier” -> “evaluator”

• Layouts can be saved and loaded again later

Page 27: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of

27

Beyond the GUI

• How to reproduce experiments with the command-line/API– GUI, API, and command-line all rely

on the same set of Java classes– Generally easy to determine what

classes and parameters were used in the GUI.

– Tree displays in Weka reflect its Java class hierarchy.

> java -cp ~galley/weka/weka.jar weka.classifiers.trees.J48 –C 0.25 –M 2 -t <train_arff> -T <test_arff>

Page 28: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of

28

Important command-line parameters

where options are:

• Create/load/save a classification model:

-t <file> : training set-l <file> : load model file-d <file> : save model file

• Testing:-x <N> : N-fold cross validation

-T <file> : test set-p <S> : print predictions + attribute selection S

> java -cp ~galley/weka/weka.jar weka.classifiers.<classifier_name>

[classifier_options] [options]

Page 29: WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of

Problem with Running Weka

Solution : java -Xmx1000m -jar weka.jar

Problem : Out of memory for large data set