Barga Data Science lecture 8

Deriving Knowledge from Data at Scale


Lecture 8 Agenda

Opening Discussion 45

• Course Project Check In

• Thought Exercise

Data Transformation 60

• Attribute Selection

• SVM Considerations

Ensembling, review and deeper dive; 60


Being proficient at data science requires intuition and

judgement.


Intuition and judgement come with experience


There is no compression algorithm for experience…


But you can hack this…


Three Steps (every 3 – 4 months)

1. Become proficient in using one tool;

2. Select one algorithm for deep dive;

3. Focus on one data type;

Hands on practice…



What tools to use?





• Weka – explorer…

• KNIME – experimentation…

Get proficient in at least two (2) tools…




http://www.slideshare.net/DataRobot/final-10-r-xc-36610234

http://www.slideshare.net/DataRobot/final-10-r-xc-36610234






This is how you learn…



Performing Experiments
















Copy on Catalyst…


Course Project



what is the data telling you











Attribute Selection(feature selection)


Problem: Where to focus attention?


What is Evaluated?

AttributesSubsets of

Attributes

Evaluation

Method

IndependentFilters Filters

Learning

Algorithm Wrappers


What is Evaluated?


Attributes

Evaluation

Method


Learning

Algorithm Wrappers



Tab for selecting attributes in a data set…


Interface for classes that evaluate attributes…

Interface for ranking or searching for a subset of attributes…


Select CorrelationAttributeEval for Pearson Correlation…

False, doesn’t return R score

True, returns R scores;


Ranks attributes by their individual evaluations, used in

conjunction with GainRatio, Entropy, Pearson, etc…

Number of attributes to return,

-1 returns all ranked attributes;

Attributes to ignore (skip) in the

evaluation forma: [1, 3-5, 10];

Cutoff at which attributes can

be discarded, -1 no cutoff;


What is Evaluated?


Attributes

Evaluation

Method


Learning

Algorithm Wrappers









True: Adds features that are correlated

with class and NOT intercorrelated with

other features already in selection.

False: Eliminates redundant features.

Precompute the correlation matrix in

advance, useful for fast backtracking, or

compute lazily. When given a large

number of attributes, compute lazily…

CfsSubsetEval


What is Evaluated?


Attributes

Evaluation

Method


Learning

Algorithm Wrappers


Select a subset of

attributes

Induce learning

algorithm on this subset

Evaluate the resulting

model (e.g., accuracy)

Stop? YesNo



Tab for selecting attributes in a data set…







WrapperSubsetEval


Select and configure ML algorithm…

Accuracy (default discrete classes), RMSE (default

numeric), AUC, AUPRC, F-measure (discrete class)

Number of folds to use to estimate

subset accuracy


BestFirst: Default search method, it searches

the space of descriptor subsets by greedy

hill-climbing augmented with a backtracking

facility. The BestFirst method may start with

the empty set of descriptors and searches

forward (default behavior), or starts with the

full set of attributes and searches backward,

or starts at any point and searches in both

directions (considering all single descriptor

additions and deletions at a given point).

Other options include:

• GreedyStepwise;

• EvolutionarySearch;

• ExhaustiveSearch;

• LinearForwardSearch;

• GeneticSearch (could take hours)

Search Method










Feature ranking

Forward feature selection

Backward feature elimination




10 Minute Break…


Practical SVM


Goal: to find discriminatorThat maximize the margins







SMO and it's complexity parameter ("-C")• load your dataset in the Explorer

• choose weka.classifiers.meta.CVParameterSelection as classifier

• select weka.classifiers.functions.SMO as base classifier within CVParameterSelection and modify its

setup if necessary, e.g., RBF kernel

• open the ArrayEditor for CVParameters and enter the following string (and click on Add):

C 2 8 4

This will test the complexity parameters 2, 4, 6 and 8 (= 4 steps)

• close dialogs and start the classifier

• you will get output similar to this one, with the best parameters found in bold:


LibSVM• load your dataset in the Explorer

• choose weka.classifiers.meta.CVParameterSelection as classifier

• select weka.classifiers.functions.LibSVM as base classifier within CVParameterSelection and modify

its setup if necessary, e.g., RBF kernel

• open the ArrayEditor for CVParameters and enter the following string (and click on Add):

G 0.01 0.1 10

http://weka.wikispaces.com/LibSVM


GridSearchweka.classifiers.meta.GridSearch is a meta-classifier for exploring 2 parameters, hence the grid in the name.

Instead of just using a classifier, one can specify a base classifier and a filter, which both of them can be

optimized (one parameter each).

For each of the two axes, X and Y, one can specify the following parameters:• min, the minimum value to start from.• max, the maximum value.• step, the step size used to get from min to max.

GridSearch can also optimized based on the following measures:• Correlation coefficient (= CC)• Root mean squared error (= RMSE)• Root relative squared error (= RRSE)• Mean absolute error (= MAE)• Root absolute error (= RAE)• Combined: (1-abs(CC)) + RRSE + RAE• Accuracy (= ACC)


Missing Values(revisited)


?Instances

Attributes


Missing values – UCI machine learning repository, 31 of 68 data sets

reported to have missing values. “Missing” can mean many things…

MAR: "Missing at Random":– usually best case

– usually not true

Non-randomly missing

Presumed normal, so not measured

Causally missing

– attribute value is missing because of other attribute values (or because of

the outcome value!)



30% missing values

88% accuracy

93% accuracy

95% accuracy


Input:

Output:


X

Y

Data point with missing Y

Imputation

x

Filling in there makes

sense!


Imputation with k-Nearest Neighbor


K-means Clustering Imputation


Imputation via Regression/Classification


fills in the missing values for an instance with the expectedvalues


This can make a difference!


Ensemble LearningReview & Out of Class Exercises


Original

Training data

....D

1D

2 Dt-1

Dt

D

Step 1:

Create Multiple

Data Sets

C1

C2

Ct -1

Ct

Step 2:

Build Multiple

Classifiers

C*

Step 3:

Combine

Classifiers


Why does it work?

25

13

25 06.0)1(25

i

ii

i


Ensemble vs. Base Classifier Error

As long as base classifier is better than random (error < 0.5),

ensemble will be superior to base classifier


• Bagging

• Boosting

• DECORATE

meta-learners

base learner


Training set

Matrix 1

Matrix 2

Matrix 3

Learningalgorithm

ModelM1

Learningalgorithm

ModelM2

Learningalgorithm

ModelMe

ENSEMBLE

Consensus Model

Perturbed sets

C1

Cn

D1 Dm

Compounds/DescriptorMatrix


Mixture of Experts

L

jjjdwy

1


Stacking


Cascading


Bagging

Leo Breiman

(1928-2005)

Leo Breiman (1996). Bagging predictors. Machine Learning. 24(2):123-140.

Bagging = Bootstrap Aggregation


Training set S

.

.

.

C1

C2

C3

C4

Cn

Bootstrap

.

.

.

C3

C2

C2

C4

C4

Sample Si from training set S

• All compounds have the same probability to

be selected

• Each compound can be selected several

times or even not selected at all (i.e.

compounds are sampled randomly with

replacement)

Efron, B., & Tibshirani, R. J. (1993). "An introduction to the bootstrap". New York: Chapman & Hall

Si

D1 Dm D1 Dm


Bagging

Original Data 1 2 3 4 5 6 7 8 9 10

Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9

Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2

Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7

Training DataData ID


The 0.632 bootstrap

This method is also called the 0.632 bootstrap

• A particular training instance has a probability of 1-1/n of not being picked

• Thus its probability of ending up in the test data (not selected) is:

This means the training data will contain approximately 63.2% of the instances

368.01

1 1

e

n

n


Bagging

Training set

.

.

.

C1

C2

C3

C4

Cn

Learningalgorithm

ModelM1

Learningalgorithm

ModelM2

Learningalgorithm

ModelMe

ENSEMBLE

Consensus Model

S1

S2

Se

C4

C2

C8

C2

C1

C9

C7

C2

C2

C1

C4

C3

C4

C8

Voting (classification)

Averaging (regression)

Data with

perturbed sets

of compounds

C1


Classification - Files

train-ache.sdf/test-ache.sdf

train-ache-t3ABl2u3.arff/test-ache-t3ABl2u3.arff

ache-t3ABl2u3.hdr


Exercise 1

Development of one individual rules-based model (JRip method in WEKA)


Exercise 1

Load train-ache-t3ABl2u3.arff


Load test-ache-t3ABl2u3.arff


Setup one JRip

model


187. (C*C),(C*C*C),(C*C-C),(C*N),(C*N*C),(C-C),(C-C-C),xC*

81. (C-N),(C-N-C),(C-N-C),(C-N-C),xC

12. (C*C),(C*C),(C*C*C),(C*C*C),(C*C*N),xC


What happens if we randomize the data

and rebuild a JRip model ?


Changing data ordering induces rules changes


Exercise 3a: Bagging

Reinitialize the datasetIn the classifier tab, choose the meta

classifier Bagging


Set the base classifier as JRip

Build an ensemble of 1 model


JRipBag1.out

JRipBag3.out

JRipBag8.out


ROC AUC of the consensus

model as a function of the

number of bagging iterations

Classification

AChE

0.74

0.76

0.78

0.8

0.82

0.84

0.86

0.88

0 2 4 6 8 10

Number of bagging iterations

RO

C

AU

C


Boosting works by training a set of classifiers sequentially by combining them

for prediction, where each latter classifier focuses on the mistakes of the

earlier classifiers.

Yoav Freund Robert Shapire Jerome Friedman

Yoav Freund, Robert E. Schapire: Experiments with a new boosting algorithm. In: Thirteenth International Conference on

Machine Learning, San Francisco, 148-156, 1996.

J.H. Friedman (1999). Stochastic Gradient Boosting. Computational Statistics and Data Analysis. 38:367-378.

AdaBoost -

classification

Regression

boosting


Training set

.

.

.

C1

C2

C3

C4

Cn

Learningalgorithm

ModelM1

Learningalgorithm

ModelM2

Learningalgorithm

ModelMb

ENSEMBLE

Consensus Model

S1

S2

Se

C1

C2

C3

C4

Cn

.

.

.

w

w

w

w

w

e

ee

e

e

e

ee

e

e

C1

C2

C3

C4

Cn

.

.

.

w

w

w

w

w

Weighted averaging

& thresholding

w

C4

Cn

.

.

.

w

ww

w

C1

C2

C3


Load train-ache-t3ABl2u3.arff

In classification tab, load test-ache-t3ABl2u3.arff


In classifier tab, choose meta classifier AdaBoostM1

Setup an ensemble of one JRip model


JRipBoost1.out

JRipBoost3.out JRipBoost8.out


ROC AUC as a function of the

number of boosting iterations

Classification

AChE

Log(Number of boosting iterations)

RO

C

AU

C

0.76

0.77

0.78

0.79

0.8

0.81

0.82

0.83

0 2 4 6 8 10


0.7

0.75

0.8

0.85

0.9

0.95

1

1 10 100 1000

Bagging

Boosting

Base learner – DecisionStump

0.7

0.75

0.8

0.85

0.9

0.95

1

1 10 100

Base learner – JRip


Conjecture: Bagging vs Boosting

Bagging leverages unstable base learners that

are weak because of overfitting (JRip, MLR)

Boosting leverages stable base learners that

are weak because of underfitting

(DecisionStump, SLR)


Random Subspace Method

Tin Kam Ho

Tin Kam Ho (1998). The Random Subspace Method for Constructing Decision Forests. IEEE Transactions on

Pattern Analysis and Machine Intelligence. 20(8):832-844.


• All descriptors have the same probability to be

selected

• Each descriptor can be selected only once

• Only a certain part of descriptors are selected

in each run

...D1 D2 D3 D4 Dm

D3 D2 Dm D4

C1

Cn

C1

Cn

Training set with initial pool of descriptors

Training set with randomly selected descriptors


Random Subspace Method

145

Training set

Learningalgorithm

ModelM1

Learningalgorithm

ModelM2

Learningalgorithm

ModelMe

ENSEMBLE

Consensus Model

S1

S2

Se

Voting (classification)

Averaging (regression)

Data sets with

randomly selected

descriptors

D1 D2 D3 D4 Dm

D4 D2 D3

D1 D2 D3

D4 D2 D1


Load train-logs-t1ABl2u4.arff

In classification tab, load test-logs-t1ABl2u4.arff


Choose the meta method

Random SubSpace.


Base classifier: Multi-Linear Regression

without descriptor selection

Build an ensemble of 1 model

… then build an ensemble of 10 models.


1 model

10 models



Random Forest

random tree

Leo Breiman(1928-2005)

Leo Breiman (2001). Random Forests. Machine Learning. 45(1):5-32.

Random Forest = Bagging + Random Subspace


David H. Wolpert

Wolpert, D., Stacked Generalization., Neural Networks, 5(2), pp. 241-259., 1992

Breiman, L., Stacked Regression, Machine Learning, 24, 1996


Training set

Data set

S

Data set

S

Data set

S

Learningalgorithm

L1

ModelM1

ModelM2

ModelMe

ENSEMBLE

Consensus Model

The same data set

Data set

S

C1

Cn

D1 Dm

Learningalgorithm

L2

Learningalgorithm

Le

Machine Learning

Meta-Method

(e.g. MLR)

Different algorithms


Choose meta method Stacking

Click here


•Delete the classifier ZeroR

•Add PLS classifier (default

parameters)

•Add Regression Tree M5P (default

parameters)

•Add Multi-Linear Regression without

descriptor selection


Click hereSelect Multi-Linear

Regression as meta-

method



Exercise 5

Rebuild the stacked model using:

• kNN (default parameters)

• Multi-Linear Regression without descriptor selection

• PLS classifier (default parameters)

• Regression Tree M5P


Exercise 5


Exercise 5 - Stacking

Regression models

for LogS

Learning

algorithm

R (correlation

coefficient)

RMSE

MLR 0.8910 1.0068

PLS 0.9171 0.8518

M5P (regression

trees)

0.9176 0.8461

1-NN (one

nearest

neighbour)

0.8455 1.1889

Stacking of

MLR, PLS, M5P

0.9366 0.7460

Stacking of

MLR, PLS,

M5P, 1-NN

0.9392 0.7301


That’s all for tonight….

Data & Analytics

Barga Data Science lecture 8