161
Deriving Knowledge from Data at Scale

Barga Data Science lecture 8

Embed Size (px)

Citation preview

Page 1: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 2: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Lecture 8 Agenda

Opening Discussion 45

• Course Project Check In

• Thought Exercise

Data Transformation 60

• Attribute Selection

• SVM Considerations

Ensembling, review and deeper dive; 60

Page 3: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Being proficient at data science requires intuition and

judgement.

Page 4: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Intuition and judgement come with experience

Page 5: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

There is no compression algorithm for experience…

Page 6: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

But you can hack this…

Page 7: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Three Steps (every 3 – 4 months)

1. Become proficient in using one tool;

2. Select one algorithm for deep dive;

3. Focus on one data type;

Hands on practice…

Page 8: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 9: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

What tools to use?

Page 10: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 11: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 12: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 13: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

• Weka – explorer…

• KNIME – experimentation…

Get proficient in at least two (2) tools…

Page 14: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 15: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 16: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

http://www.slideshare.net/DataRobot/final-10-r-xc-36610234

Page 17: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 18: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 19: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 20: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 21: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

This is how you learn…

Page 22: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 23: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Performing Experiments

Page 24: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 25: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 26: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 27: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 28: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 29: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 30: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 31: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 32: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 33: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 34: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 35: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 36: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 37: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 38: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Copy on Catalyst…

Page 39: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Course Project

Page 40: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 41: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

what is the data telling you

Page 42: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 43: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 44: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 45: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 46: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 47: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 48: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 49: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 50: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 51: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Attribute Selection(feature selection)

Page 52: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Problem: Where to focus attention?

Page 53: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

What is Evaluated?

AttributesSubsets of

Attributes

Evaluation

Method

IndependentFilters Filters

Learning

Algorithm Wrappers

Page 54: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

What is Evaluated?

AttributesSubsets of

Attributes

Evaluation

Method

IndependentFilters Filters

Learning

Algorithm Wrappers

Page 55: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 56: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Tab for selecting attributes in a data set…

Page 57: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Interface for classes that evaluate attributes…

Interface for ranking or searching for a subset of attributes…

Page 58: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Select CorrelationAttributeEval for Pearson Correlation…

False, doesn’t return R score

True, returns R scores;

Page 59: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Ranks attributes by their individual evaluations, used in

conjunction with GainRatio, Entropy, Pearson, etc…

Number of attributes to return,

-1 returns all ranked attributes;

Attributes to ignore (skip) in the

evaluation forma: [1, 3-5, 10];

Cutoff at which attributes can

be discarded, -1 no cutoff;

Page 60: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

What is Evaluated?

AttributesSubsets of

Attributes

Evaluation

Method

IndependentFilters Filters

Learning

Algorithm Wrappers

Page 61: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 62: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 63: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 64: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 65: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Interface for classes that evaluate attributes…

Interface for ranking or searching for a subset of attributes…

Page 66: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

True: Adds features that are correlated

with class and NOT intercorrelated with

other features already in selection.

False: Eliminates redundant features.

Precompute the correlation matrix in

advance, useful for fast backtracking, or

compute lazily. When given a large

number of attributes, compute lazily…

CfsSubsetEval

Page 67: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

What is Evaluated?

AttributesSubsets of

Attributes

Evaluation

Method

IndependentFilters Filters

Learning

Algorithm Wrappers

Page 68: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Select a subset of

attributes

Induce learning

algorithm on this subset

Evaluate the resulting

model (e.g., accuracy)

Stop? YesNo

Page 69: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 70: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Tab for selecting attributes in a data set…

Page 71: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Interface for classes that evaluate attributes…

Interface for ranking or searching for a subset of attributes…

Page 72: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Interface for classes that evaluate attributes…

Interface for ranking or searching for a subset of attributes…

WrapperSubsetEval

Page 73: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Select and configure ML algorithm…

Accuracy (default discrete classes), RMSE (default

numeric), AUC, AUPRC, F-measure (discrete class)

Number of folds to use to estimate

subset accuracy

Page 74: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

BestFirst: Default search method, it searches

the space of descriptor subsets by greedy

hill-climbing augmented with a backtracking

facility. The BestFirst method may start with

the empty set of descriptors and searches

forward (default behavior), or starts with the

full set of attributes and searches backward,

or starts at any point and searches in both

directions (considering all single descriptor

additions and deletions at a given point).

Other options include:

• GreedyStepwise;

• EvolutionarySearch;

• ExhaustiveSearch;

• LinearForwardSearch;

• GeneticSearch (could take hours)

Search Method

Page 75: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 76: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 77: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 78: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 79: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 80: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 81: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 82: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 83: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Feature ranking

Forward feature selection

Backward feature elimination

Page 84: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 85: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 86: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

10 Minute Break…

Page 87: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Practical SVM

Page 88: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Goal: to find discriminatorThat maximize the margins

Page 89: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 90: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 91: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 92: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 93: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 94: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

SMO and it's complexity parameter ("-C")• load your dataset in the Explorer

• choose weka.classifiers.meta.CVParameterSelection as classifier

• select weka.classifiers.functions.SMO as base classifier within CVParameterSelection and modify its

setup if necessary, e.g., RBF kernel

• open the ArrayEditor for CVParameters and enter the following string (and click on Add):

C 2 8 4

This will test the complexity parameters 2, 4, 6 and 8 (= 4 steps)

• close dialogs and start the classifier

• you will get output similar to this one, with the best parameters found in bold:

Page 95: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

LibSVM• load your dataset in the Explorer

• choose weka.classifiers.meta.CVParameterSelection as classifier

• select weka.classifiers.functions.LibSVM as base classifier within CVParameterSelection and modify

its setup if necessary, e.g., RBF kernel

• open the ArrayEditor for CVParameters and enter the following string (and click on Add):

G 0.01 0.1 10

Page 96: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

GridSearchweka.classifiers.meta.GridSearch is a meta-classifier for exploring 2 parameters, hence the grid in the name.

Instead of just using a classifier, one can specify a base classifier and a filter, which both of them can be

optimized (one parameter each).

For each of the two axes, X and Y, one can specify the following parameters:• min, the minimum value to start from.• max, the maximum value.• step, the step size used to get from min to max.

GridSearch can also optimized based on the following measures:• Correlation coefficient (= CC)• Root mean squared error (= RMSE)• Root relative squared error (= RRSE)• Mean absolute error (= MAE)• Root absolute error (= RAE)• Combined: (1-abs(CC)) + RRSE + RAE• Accuracy (= ACC)

Page 97: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Missing Values(revisited)

Page 98: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

?Instances

Attributes

Page 99: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Missing values – UCI machine learning repository, 31 of 68 data sets

reported to have missing values. “Missing” can mean many things…

MAR: "Missing at Random":– usually best case

– usually not true

Non-randomly missing

Presumed normal, so not measured

Causally missing

– attribute value is missing because of other attribute values (or because of

the outcome value!)

Page 100: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 101: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

30% missing values

88% accuracy

93% accuracy

95% accuracy

Page 102: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Input:

Output:

Page 103: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

X

Y

Data point with missing Y

Imputation

x

Filling in there makes

sense!

Page 104: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Imputation with k-Nearest Neighbor

Page 105: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

K-means Clustering Imputation

Page 106: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Imputation via Regression/Classification

Page 107: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

fills in the missing values for an instance with the expectedvalues

Page 108: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

This can make a difference!

Page 109: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Ensemble LearningReview & Out of Class Exercises

Page 110: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Original

Training data

....D

1D

2 Dt-1

Dt

D

Step 1:

Create Multiple

Data Sets

C1

C2

Ct -1

Ct

Step 2:

Build Multiple

Classifiers

C*

Step 3:

Combine

Classifiers

Page 111: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Why does it work?

25

13

25 06.0)1(25

i

ii

i

Page 112: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Ensemble vs. Base Classifier Error

As long as base classifier is better than random (error < 0.5),

ensemble will be superior to base classifier

Page 113: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

• Bagging

• Boosting

• DECORATE

meta-learners

base learner

Page 114: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Training set

Matrix 1

Matrix 2

Matrix 3

Learningalgorithm

ModelM1

Learningalgorithm

ModelM2

Learningalgorithm

ModelMe

ENSEMBLE

Consensus Model

Perturbed sets

C1

Cn

D1 Dm

Compounds/DescriptorMatrix

Page 115: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Mixture of Experts

L

jjjdwy

1

Page 116: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Stacking

Page 117: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Cascading

Page 118: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Bagging

Leo Breiman

(1928-2005)

Leo Breiman (1996). Bagging predictors. Machine Learning. 24(2):123-140.

Bagging = Bootstrap Aggregation

Page 119: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Training set S

.

.

.

C1

C2

C3

C4

Cn

Bootstrap

.

.

.

C3

C2

C2

C4

C4

Sample Si from training set S

• All compounds have the same probability to

be selected

• Each compound can be selected several

times or even not selected at all (i.e.

compounds are sampled randomly with

replacement)

Efron, B., & Tibshirani, R. J. (1993). "An introduction to the bootstrap". New York: Chapman & Hall

Si

D1 Dm D1 Dm

Page 120: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Bagging

Original Data 1 2 3 4 5 6 7 8 9 10

Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9

Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2

Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7

Training DataData ID

Page 121: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

The 0.632 bootstrap

This method is also called the 0.632 bootstrap

• A particular training instance has a probability of 1-1/n of not being picked

• Thus its probability of ending up in the test data (not selected) is:

This means the training data will contain approximately 63.2% of the instances

368.01

1 1

e

n

n

Page 122: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Bagging

Training set

.

.

.

C1

C2

C3

C4

Cn

Learningalgorithm

ModelM1

Learningalgorithm

ModelM2

Learningalgorithm

ModelMe

ENSEMBLE

Consensus Model

S1

S2

Se

C4

C2

C8

C2

C1

C9

C7

C2

C2

C1

C4

C3

C4

C8

Voting (classification)

Averaging (regression)

Data with

perturbed sets

of compounds

C1

Page 123: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Classification - Files

train-ache.sdf/test-ache.sdf

train-ache-t3ABl2u3.arff/test-ache-t3ABl2u3.arff

ache-t3ABl2u3.hdr

Page 124: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Exercise 1

Development of one individual rules-based model (JRip method in WEKA)

Page 125: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Exercise 1

Load train-ache-t3ABl2u3.arff

Page 126: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Load test-ache-t3ABl2u3.arff

Page 127: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Setup one JRip

model

Page 128: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

187. (C*C),(C*C*C),(C*C-C),(C*N),(C*N*C),(C-C),(C-C-C),xC*

81. (C-N),(C-N-C),(C-N-C),(C-N-C),xC

12. (C*C),(C*C),(C*C*C),(C*C*C),(C*C*N),xC

Page 129: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

What happens if we randomize the data

and rebuild a JRip model ?

Page 130: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Changing data ordering induces rules changes

Page 131: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Exercise 3a: Bagging

Reinitialize the datasetIn the classifier tab, choose the meta

classifier Bagging

Page 132: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Set the base classifier as JRip

Build an ensemble of 1 model

Page 133: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

JRipBag1.out

JRipBag3.out

JRipBag8.out

Page 134: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

ROC AUC of the consensus

model as a function of the

number of bagging iterations

Classification

AChE

0.74

0.76

0.78

0.8

0.82

0.84

0.86

0.88

0 2 4 6 8 10

Number of bagging iterations

RO

C

AU

C

Page 135: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Boosting works by training a set of classifiers sequentially by combining them

for prediction, where each latter classifier focuses on the mistakes of the

earlier classifiers.

Yoav Freund Robert Shapire Jerome Friedman

Yoav Freund, Robert E. Schapire: Experiments with a new boosting algorithm. In: Thirteenth International Conference on

Machine Learning, San Francisco, 148-156, 1996.

J.H. Friedman (1999). Stochastic Gradient Boosting. Computational Statistics and Data Analysis. 38:367-378.

AdaBoost -

classification

Regression

boosting

Page 136: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Training set

.

.

.

C1

C2

C3

C4

Cn

Learningalgorithm

ModelM1

Learningalgorithm

ModelM2

Learningalgorithm

ModelMb

ENSEMBLE

Consensus Model

S1

S2

Se

C1

C2

C3

C4

Cn

.

.

.

w

w

w

w

w

e

ee

e

e

e

ee

e

e

C1

C2

C3

C4

Cn

.

.

.

w

w

w

w

w

Weighted averaging

& thresholding

w

C4

Cn

.

.

.

w

ww

w

C1

C2

C3

Page 137: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Load train-ache-t3ABl2u3.arff

In classification tab, load test-ache-t3ABl2u3.arff

Page 138: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

In classifier tab, choose meta classifier AdaBoostM1

Setup an ensemble of one JRip model

Page 139: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

JRipBoost1.out

JRipBoost3.out JRipBoost8.out

Page 140: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

ROC AUC as a function of the

number of boosting iterations

Classification

AChE

Log(Number of boosting iterations)

RO

C

AU

C

0.76

0.77

0.78

0.79

0.8

0.81

0.82

0.83

0 2 4 6 8 10

Page 141: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

0.7

0.75

0.8

0.85

0.9

0.95

1

1 10 100 1000

Bagging

Boosting

Base learner – DecisionStump

0.7

0.75

0.8

0.85

0.9

0.95

1

1 10 100

Base learner – JRip

Page 142: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Conjecture: Bagging vs Boosting

Bagging leverages unstable base learners that

are weak because of overfitting (JRip, MLR)

Boosting leverages stable base learners that

are weak because of underfitting

(DecisionStump, SLR)

Page 143: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Random Subspace Method

Tin Kam Ho

Tin Kam Ho (1998). The Random Subspace Method for Constructing Decision Forests. IEEE Transactions on

Pattern Analysis and Machine Intelligence. 20(8):832-844.

Page 144: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

• All descriptors have the same probability to be

selected

• Each descriptor can be selected only once

• Only a certain part of descriptors are selected

in each run

...D1 D2 D3 D4 Dm

D3 D2 Dm D4

C1

Cn

C1

Cn

Training set with initial pool of descriptors

Training set with randomly selected descriptors

Page 145: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Random Subspace Method

145

Training set

Learningalgorithm

ModelM1

Learningalgorithm

ModelM2

Learningalgorithm

ModelMe

ENSEMBLE

Consensus Model

S1

S2

Se

Voting (classification)

Averaging (regression)

Data sets with

randomly selected

descriptors

D1 D2 D3 D4 Dm

D4 D2 D3

D1 D2 D3

D4 D2 D1

Page 146: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Load train-logs-t1ABl2u4.arff

In classification tab, load test-logs-t1ABl2u4.arff

Page 147: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Choose the meta method

Random SubSpace.

Page 148: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Base classifier: Multi-Linear Regression

without descriptor selection

Build an ensemble of 1 model

… then build an ensemble of 10 models.

Page 149: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

1 model

10 models

Page 150: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 151: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Random Forest

random tree

Leo Breiman(1928-2005)

Leo Breiman (2001). Random Forests. Machine Learning. 45(1):5-32.

Random Forest = Bagging + Random Subspace

Page 152: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

David H. Wolpert

Wolpert, D., Stacked Generalization., Neural Networks, 5(2), pp. 241-259., 1992

Breiman, L., Stacked Regression, Machine Learning, 24, 1996

Page 153: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Training set

Data set

S

Data set

S

Data set

S

Learningalgorithm

L1

ModelM1

ModelM2

ModelMe

ENSEMBLE

Consensus Model

The same data set

Data set

S

C1

Cn

D1 Dm

Learningalgorithm

L2

Learningalgorithm

Le

Machine Learning

Meta-Method

(e.g. MLR)

Different algorithms

Page 154: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Choose meta method Stacking

Click here

Page 155: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

•Delete the classifier ZeroR

•Add PLS classifier (default

parameters)

•Add Regression Tree M5P (default

parameters)

•Add Multi-Linear Regression without

descriptor selection

Page 156: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Click hereSelect Multi-Linear

Regression as meta-

method

Page 157: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Page 158: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Exercise 5

Rebuild the stacked model using:

• kNN (default parameters)

• Multi-Linear Regression without descriptor selection

• PLS classifier (default parameters)

• Regression Tree M5P

Page 159: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Exercise 5

Page 160: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

Exercise 5 - Stacking

Regression models

for LogS

Learning

algorithm

R (correlation

coefficient)

RMSE

MLR 0.8910 1.0068

PLS 0.9171 0.8518

M5P (regression

trees)

0.9176 0.8461

1-NN (one

nearest

neighbour)

0.8455 1.1889

Stacking of

MLR, PLS, M5P

0.9366 0.7460

Stacking of

MLR, PLS,

M5P, 1-NN

0.9392 0.7301

Page 161: Barga Data Science lecture 8

Deriving Knowledge from Data at Scale

That’s all for tonight….