Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Introduction to Machine Learning & Data Mining

Jennifer Neville Purdue University

May 24, 2016

Slides:https://www.cs.purdue.edu/homes/neville/soi-dswksp-neville.pdfData:https://www.cs.purdue.edu/homes/neville/iris.dat

Data mining

The process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data

(Fayyad, Piatetsky-Shapiro & Smith 1996)

Databases

Artificial Intelligence

Visualization

Statistics

Example

During WWII, statistician Abraham Wald was asked to help the British decide where to add armor to their planes

The data revolution

The last 35 years of research in ML/DM has resulted in wide spread adoption of predictive analytics to automate and improve decision making. As “big data” efforts increase the collection of data… so will the need for new data science methodology. Data today have more volume, velocity, variety, etc. Machine learning research develops statistical tools, models & algorithms that address these complexities. Data mining research focuses on how to scale to massive data and how to incorporate feedback to improve accuracy while minimizing effort.

Processed data

Target dataData

Selection Preprocessing

MiningPatternsInterpretationevaluation

Knowledge

The data mining process

Machine Learning

Overview

• Task specification

• Data representation

• Knowledge representation

• Learning technique

• Search + scoring

• Prediction and/or interpretation

Overview







Task specification

• Objective of the person who is analyzing the data

• Description of the characteristics of the analysis and desired result

• Examples:

• From a set of labeled examples, devise an understandable model that will accurately predict whether a stockbroker will commit fraud in the near future.

• From a set of unlabeled examples, cluster stockbrokers into a set of homogeneous groups based on their demographic information

Exploratory data analysis

• Goal

• Interact with data without clear objective

• Techniques

• Visualization, adhoc modeling

Descriptive modeling

• Goal

• Summarize the data or the underlying generative process

• Techniques

• Density estimation, cluster analysis and segmentation

Branch (Bn)

Region

Area

Firm

Size

On

Watchlist

Disclosure

Year

Type

Broker (Bk)

On

Watchlist

Is Problem

Problem In Past

Has

Business

Layoffs

Bk

Bk

Bk

Bk

Bn

Bn

Bn

Bn

Bn

Also known as: unsupervised learning

Predictive modeling

• Goal

• Learn model to predict unknown class label values given observed attribute values

• Techniques

• Classification, regression

BrokerAge>27

Current CoWorkerCount>8

Current FirmAvg(Size)>12

Current BranchMode(Location)=NY

BrokerYears In Industry>1

DisclosureCount(Yr<1995)>0

Past FirmAvg(Size)>90

Current RegulatorMode(Status)=Reg

Past CoWorkerCount>35

DisclosureCount(Type=CC)>0

DisclosureCount>5Past Firm

Max(Size)>1000Current Branch

Mode(Location)=AZ

Past CoWorkerCount(Gender=M)>15

703 564

17949 9218

7 63 34 249

5 54

20010

Also known as: supervised learning

Pattern discovery

• Goal

• Detect patterns and rules that describe sets of examples

• Techniques

• Association rules, graph mining, anomaly detection

-

-

-

-

- - -

- - -

+

++

- -

+

-

-

+

-

-

- -

-

-

+

+

- +

+ +

Model: global summary of a data setPattern: local to a subset of the data

Overview







Data representation

• Choice of data structure for representing individual and collections of measurements

• Individual measurements: single observations (e.g., person’s date of birth, product price)

• Collections of measurements: sets of observations that describe an instance (e.g., person, product)

• Choice of representation determines applicability of algorithms and can impact modeling effectiveness

• Additional issues: data sampling, data cleaning, feature construction

Tabular data

Fraud Age Degree StartYr Series7

+ 22 Y 2005 N

- 25 N 2003 Y

- 31 Y 1995 Y

- 27 Y 1999 Y

+ 24 N 2006 N

- 29 N 2003 N

N instances X p attributes

Temporal data

0

25

50

75

100

2004 2005 2006 2007

Region 1 Region 2

Relational/structured data

659,000brokers

171,000

branches

5,100

firms400,000

disclosures

Overview







Knowledge representation

• Underlying structure of the model or patterns that we seek from the data • Specifies the models/patterns that could be returned as the results of the

data mining algorithm

• Defines space of possible models/patterns for algorithm to search over

• Examples:

• If-then rule If short closed car then toxic chemicals

• Conditional probability distributionP( fraud | age, degree, series7, startYr )

• Decision tree

• Regression model

8

Linear equation

!

CDR = 0.12MM1+ 0.34SBScore..." 0.34!

y = "1x1+ "

2x2...+ "

0

Generic form is:

An example for the Alzheimer’s data would be:

Rules

• ExamplesIF (cheese) THEN chocolate [Conf=0.10, Supp=0.04]

IF (cheese & Sterno) THEN chocolate [Conf=0.95, Supp=0.01]

• These can be written as probabilitiesP(chocolate|cheese) = 0.10

P(chocolate|cheese,Sterno) = 0.95

• As rules get more specific (by including additionalattributes), support decreases, but confidencemay increase.

Overview







Learning technique

• Method to construct model or patterns from data

• Model space

• Choice of knowledge representation defines a set of possible models or patterns

• Scoring function

• Associates a numerical value (score) with each member of the set of models/patterns

• Search technique

• Defines a method for generating members of the set of models/patterns, determining their score, and identifying the ones with the “best” score

Scoring function

• A numeric score assigned to each possible model in a search space, given a reference/input dataset

• Used to judge the quality of a particular model for the domain

• Score function are statistics—estimates of a population parameter based on a sample of data

• Examples:

• Misclassification

• Squared error

• Likelihood

Example learning problem

0 10 20 30 40

0.00

0.05

0.10

0.15

density.default(x = d2[, 1])

N = 500 Bandwidth = 0.6062

Density

Task: Devise a rule to classify

items based on the attribute X

X

+

-

Knowledge representation: If-then rules

Example rule: If x > 25 then + Else -

What is the model space?

All possible thresholds

What score function?

Prediction error rate

Score function over model space

0 10 20 30 40

0.0

0.2

0.4

0.6

0.8

1.0

res[,1]

res[,2]

threshold

% p

redi

ctio

n er

rors

Full data

Small sample

Large sample

Biased sample

Search procedure?

Try all thresholds, select one with lowest score

Note: learning result depends on data

Overview





• Search + Evaluation


Inference and interpretation

• Prediction technique

• Method to apply learned model to new data for prediction/analysis

• Only applicable for predictive and some descriptive models

• Prediction is often used during learning (i.e., search) to determine value of scoring function

• Interpretation of results

• Objective: significance measures, hypothesis tests

• Subjective: importance, interestingness, novelty

Example application

Real-world example:

659,000brokers

171,000

branches

5,100

firms400,000

disclosures

BrokerAge>27












Mode(Location)=AZ


703 564

17949 9218

7 63 34 249

5 54

20010

BrokerAge>27












Mode(Location)=AZ


703 564

17949 9218

7 63 34 249

5 54

20010

Neither

Both

NASD Rules

Relational

Models

Performance of NASD models

"One broker I was highly confident in ranking as 5…

Not only did I have the pleasure of meeting him at a shady warehouse location, I also negotiated his bar from the industry... This person actually used investors' funds

to pay for personal expenses including his trip to attend a NASD compliance conference!

…If the model predicted this person, it would be right on target."

Data mining process

6

CS590D 12

Data Mining: Classification Schemes

• General functionality

– Descriptive data mining

– Predictive data mining

• Different views, different classifications

– Kinds of data to be mined

– Kinds of knowledge to be discovered

– Kinds of techniques utilized

– Kinds of applications adapted

CS590D 13

adapted from:

U. Fayyad, et al. (1995), “From Knowledge Discovery to Data

Mining: An Overview,” Advances in Knowledge Discovery and

Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press

DataTarget

Data

Selection

KnowledgeKnowledge

Preprocessed

Data

Patterns

Data Mining

Interpretation/

Evaluation

Knowledge Discovery in Databases: Process

Preprocessing

Use public data from NASD BrokerCheck

Extract data about small firms in a few geographic locations

Create class label, temporal features

Learn decision trees, output predictions and tree structure

Evaluate objectively on historical data, subjectively with fraud analysts

Steps in the data mining process

Data mining process

1. Application setup:

• Acquire relevant domain knowledge

• Assess user goals

2. Data selection

• Choose data sources

• Identify relevant attributes

• Sample data

3. Data preprocessing

• Remove noise or outliers

• Handle missing values

• Account for time or other changes

4. Data transformation

• Find useful features

• Reduce dimensionality

Data mining process

5. Data mining:

• Choose task (e.g., classification, regression, clustering)

• Choose algorithms for learning and inference

• Set parameters

• Apply algorithms to search for patterns of interest

6. Interpretation/evaluation

• Assess accuracy of model/results

• Interpret model for end-users

• Consolidate knowledge

7. Repeat...

Data selection

# download data from:# https://www.cs.purdue.edu/homes/neville/iris.dat

# read in datad <- read.table(file=‘iris.dat’,sep=',',header=TRUE)summary(d)

# histogramhist(d[,2], main='Histogram of Sepal Width', xlab='Sepal Width')

# scatterplotplot(d[,c(1,3)],xlab='Sepal Length',ylab='Petal Length’)plot(d)

# feature constructionSepalArea <- d[,1] * d[,2]PetalArea <- d[,3] * d[,4]

# add new features to data framed2 <- cbind(d,SepalArea,PetalArea)

# boxplotboxplot(d2$SepalLength ~ d2$Class,d2,ylab='Sepal Length’)boxplot(d2$SepalArea ~ d2$Class,d2,ylab='Sepal Area')

# correlationcor(d2$SepalLength,d2$sepalArea)

Data transformation

Feature construction

• Create new attributes that can capture the important information in the data more efficiently than the original attributes

• General methodologies:

• Attribute extraction (domain specific)

• Attribute transformations, i.e., mapping data to new space (e.g., PCA)

• Attribute combinations

Feature selection

• Select the most “relevant” subset of attributes

• May improve performance of algorithms due to overfitting

• Improves domain understanding

• Less resources to collect/use a smaller number of features

• Wrapper approach

• Features are selected as part of the mining algorithms

• Filter approach

• Features are selected before mining

Wrapper approach

• Consider all subsets of features, 2p if there are p features; features are selected according to a particular model score function (e.g., classification accuracy)

• Search over all subsets and find smallest set of features such that “best” model score does not significantly change

• For large p, exhaustive search is intractable so heuristic search is often used

• Examples:

• Forward, greedy selection — start with empty set and add features one at a time until no more improvement

• Backward, greedy removal — start with full set and remove features one at a time until no further improvements

• Interleave forward and backward selection

Filter approach

• Select “useful” features before mining, using a score that measures a feature’s utility separate from model learning

• Find features on the basis of statistical properties such as association with the class label

• Example:

• Calculate correlation of features with target class label, order all features by their score, choose features with top k scores

• Other scores: Chi-square, information gain, etc.

Dimensionality reduction

• Identify and describe the “dimensions” that underlie the data

• May be more fundamental than those directly measured but hidden to the user

• Reduce dimensionality of modeling problem

• Benefit is simplification, it reduces the number of variables you have to deal with in modeling

• Can identify set of variables with similar behavior

Principal component analysis (PCA)

• High-level approach, given data matrix D with p dimensions:

• Preprocess D so that the mean of each attribute is 0, call this matrix X

• Compute pxp covariance matrix:

• Compute eigenvectors/eigenvalues of covariance matrix:

• Eigenvectors are the principal component vectors (pxp matrix), where each is a px1 column vector of projection weights

� = XT X

A⌃A�1 = ⇤ : matrix of eigenvectors : diagonal matrix of eigenvalues

A⇤

Aa

: 1st principal component, eigenvector

Applying PCA

• New data vectors are formed by projecting the data onto the first few principal components (i.e., top k eigenvectors)

x = [x1, x2, . . . , xp] (original instance)

A = [a1,a2, . . . ,ap] (principal components)

x

01 = a1x =

pX

j=1

a1jxj

· · ·

x

0m = amx =

pX

j=1

amjxj for m < p

x

0= [x

01, x

02, . . . , x

0m] (transformed instance)

If m=p then data is transformed If m<p then transformation is lossy and dimensionality is reduced

Applying PCA (cont’)

• Goal: Find a new (smaller) set of dimensions that captures most of the variability of the data

• Use scree plot to choose number of dimensions

• Choose m < p so projected data captures much of the variance of original data

Index

Varia

nce

Example: Eigenfaces

First 40 PCA dimensions

PCA applied to images of human faces. Reduce dimensionality to set of basis images.

All other images are linear combo of these “eigenpictures”.

Used for facial recognition.

PCA example on Iris data

> pcdat <- princomp(d[,1:4])> summary(pcdat)Importance of components: Comp.1 Comp.2 Comp.3 Comp.4Standard deviation 2.0485788 0.49053911 0.27928554 0.153379074Proportion of Variance 0.9246162 0.05301557 0.01718514 0.005183085Cumulative Proportion 0.9246162 0.97763178 0.99481691 1.000000000> plot(pcdat)> loadings(pcdat)

Loadings: Comp.1 Comp.2 Comp.3 Comp.4V1 0.362 -0.657 -0.581 0.317V2 -0.730 0.596 -0.324V3 0.857 0.176 -0.480V4 0.359 0.549 0.751

Comp.1 Comp.2 Comp.3 Comp.4SS loadings 1.00 1.00 1.00 1.00Proportion Var 0.25 0.25 0.25 0.25Cumulative Var 0.25 0.50 0.75 1.00

Comp.1 Comp.2 Comp.3 Comp.4

pcdat

Variances

010000

20000

30000

40000

First component explains 92% of data variance

Choose m=1

PCA example on Iris data

# Transform data using 1st componentpcdat$loadings[,1] V1 V2 V3 V4 0.36158968 -0.08226889 0.85657211 0.35884393

pcaF <- as.matrix(d[,1:4]) %*% pcdat$loadings[,1]d3 <- cbind(d2,pcaF)

x = [x1, x2, . . . , xp] (original instance)

A = [a1,a2, . . . ,ap] (principal components)

x

01 = a1x =

pX

j=1

a1jxj

· · ·

x

0m = amx =

pX

j=1

amjxj for m < p

x

0= [x

01, x

02, . . . , x

0m] (transformed instance)

m=1, transform data to one dimension

1 2 3

34

56

78

9

Iris-setosa Iris-virginica

4.55.05.56.06.57.07.58.0

Original data (1st variable)

Transformed data

Machine learning

Descriptive vs. predictive modeling

• Descriptive models summarize the data

• Provide insights into the domain

• Focus on modeling joint distribution P(X)

• May be used for classification, but prediction is not the primary goal

• Predictive models predict the value of one variable of interest given known values of other variables

• Focus on modeling the conditional distribution P( Y | X ) or on modeling the decision boundary for Y

Learning predictive models

• Choose a data representation

• Select a knowledge representation (a “model”)

• Defines a space of possible models M={M1, M2, ..., Mk}

• Use search to identify “best” model(s)

• Search the space of models (i.e., with alternative structures and/or parameters)

• Evaluate possible models with scoring function to determine the model which best fits the data

Knowledge representation

• Underlying structure of the model or patterns that we seek from the data

• Defines space of possible models for algorithm to search over

• Model: high-level global description of dataset

• “All models are wrong, some models are useful” G. Box and N. Draper (1987)

• Choice of model family determines space of parameters and structure

• Estimate model parameters and possibly model structure from training data

BrokerAge>27












Mode(Location)=AZ


703 564

17949 9218

7 63 34 249

5 54

20010

Classification tree

Model space: all possible decision trees

Scoring functions

• Given a model M and dataset D, we would like to “score” model M with respect to D

• Goal is to rank the models in terms of their utility (for capturing D) and choose the “best” model

• Score function can be used to search over parameters and/or model structure

• Score functions can be different for:

• Models vs. patterns

• Predictive vs. descriptive functions

• Models with varying complexity (i.e., number parameters)

Predictive scoring functions

• Assess the quality of predictions for a set of instances

• Measures difference between the prediction M makes for an instance i and the true class label value of i

Predicted class label for item i

True class label for item i

Distance betweenpredicted and true

Sum over examples

S(M) =NtestX

i=1

d

⇥f(x(i);M), y(i)

⇤

What space are we searching?

Model Score

X

Learned model ⇡ (✓0 = 0.8, ✓1 = 0.4)

Alex Holehuse, Notes from Andrew Ng’s Machine Learning Class, http://www.holehouse.org/mlclass/01_02_Introduction_regression_analysis_and_gr.html

Model space

Searching over models/patterns

• Consider a space of possible models M={M1, M2, ..., Mk} with parameters θ

• Search could be over model structures or parameters, e.g.:

• Parameters: In a linear regression model, find the regression coefficients (β) that minimize squared loss on the training data

• Model structure: In a decision trees, find the tree structure that maximizes accuracy on the training data

Decision trees

Tree models

• Easy to understand knowledge representation

• Can handle mixed variables

• Recursive, divide and conquer learning method

• Efficient inference

Tree learning

• Top-down recursive divide and conquer algorithm

• Start with all examples at root

• Select best attribute/feature

• Partition examples by selected attribute

• Recurse and repeat

• Other issues:

• How to construct features

• When to stop growing

• Pruning irrelevant parts of the tree

choose split on Age>28


+ 22 Y 2005 N

+ 24 N 2006 N


- 29 N 2003 N

NY


+ 22 Y 2005 N

+ 24 N 2006 N

- 29 N 2003 N


- 25 N 2003 Y

- 31 Y 1995 Y

- 27 Y 1999 Y

choose split on Series7Y N


+ 22 Y 2005 N

- 25 N 2003 Y

- 31 Y 1995 Y

- 27 Y 1999 Y

+ 24 N 2006 N

- 29 N 2003 N

Score each attribute split for these instances:

Age, Degree, StartYr, Series7

Score each attribute split for these instances: Age, Degree, StartYr

Tree models

• Most well-known systems

• CART: Breiman, Friedman, Olshen and Stone

• ID3, C4.5: Quinlan

• How do they differ?

• Split scoring function

• Stopping criterion

• Pruning mechanism

• Predictions in leaf nodes

Scoring functions: Local split value

Choosing an attribute/feature

• Idea: a good feature splits the examples into subsets that distinguish among the class labels as much as possible... ideally into pure sets of "all positive" or "all negative"

67

Information gain

• How much does a feature split decrease the entropy?

Entropy(D)= -9/14 log 9/14 -5/14 log 5/14= 0.9400

15

CS590D 30

Classification and Prediction

• What is classification? What is prediction?

• Issues regarding classification and

prediction

• Bayesian Classification

• Instance Based Methods

• Classification by decision tree induction

• Classification by Neural Networks

• Classification by Support Vector Machines

(SVM)

• Prediction

CS590D 31

Training Dataset

age income student credit_rating buys_computer

<=30 high no fair no

<=30 high no excellent no

31…40 high no fair yes

>40 medium no fair yes

>40 low yes fair yes

>40 low yes excellent no

31…40 low yes excellent yes

<=30 medium no fair no

<=30 low yes fair yes

>40 medium yes fair yes

<=30 medium yes excellent yes

31…40 medium no excellent yes

31…40 high yes fair yes

>40 medium no excellent no

68

Information gain

nonoyesyes

yesnoyesyesyesno

yesnoyesyes

High LowMed

IncomeEntropy(Income=high) = -2/4 log 2/4 -2/4 log 2/4 = 1

Entropy(Income=med) = -4/6 log 4/6 -2/6 log 2/6 = 0.9183

Entropy(Income=low) = -3/4 log 3/4 -1/4 log 1/4 = 0.8113

Gain(D,Income)= 0.9400 - (4/14 [1] + 6/14 [0.9183] + 4/14 [0.8113])= 0.029

Tree learning

• Top-down recursive divide and conquer algorithm

• Start with all examples at root

• Select best attribute/feature

• Partition examples by selected attribute

• Recurse and repeat

• Other issues:

• How to construct features

• When to stop growing

• Pruning irrelevant parts of the tree

When to stop growing

• Full growth methods

• All samples for at a node belong to the same class

• There are no attributes left for further splits

• There are no samples left

• What impact does this have on the quality of the learned trees?

• Trees overfit the data and accuracy decreases

• Pruning is used to avoid overfitting

Pruning

• Postpruning

• Use a separate set of examples to evaluate the utility of pruning nodes from the tree (after tree is fully grown)

• Prepruning

• Apply a statistical test to decide whether to expand a node

• Use an explicit measure of complexity to penalize large trees (e.g., Minimum Description Length)

Algorithm comparison

• CART

• Evaluation criterion:Gini index

• Search algorithm:Simple to complex, hill-climbing search

• Stopping criterion: When leaves are pure

• Pruning mechanism:Cross-validation to select gini threshold

• C4.5

• Evaluation criterion: Information gain

• Search algorithm:Simple to complex, hill-climbing search

• Stopping criterion: When leaves are pure

• Pruning mechanism:Reduce error pruning

Learning CART decision trees in R

# install packageslibrary(rpart)install.packages("rpart.plot")library(rpart.plot)

# learn CART tree with default parametersdTree <- rpart(Class ~ SepalLength + SepalWidth + PetalLength + PetalWidth + SepalArea + PetalArea + pcaF, method=“class", data=d3)summary(dTree)

# display learned treeprp(dTree)prp(dTree,type=4, extra=1)

# explore settingsdTree2 <- rpart(Class ~ SepalLength + SepalWidth + PetalLength + PetalWidth + SepalArea + PetalArea + pcaF, method=“class", data=d3, parms=list(split="information"))

dTree3 <- rpart(Class ~ SepalLength + SepalWidth + PetalLength + PetalWidth + SepalArea + PetalArea + pcaF, method="class", data=d3, control=rpart.control(minsplit=5))

Evaluation

Empirical evaluation

• Given observed accuracy of a model on limited data, how well does this estimate generalize for additional examples?

• Given that one model outperforms another on some sample of data, how likely is it that this model is more accurate in general?

• When data are limited, what is the best way to use the data to both learn and evaluate a model?

Evaluating classifiers

• Goal: Estimate true future error rate

• When data are limited, what is the best way to use the data to both learn and evaluate a model?

• Approach 1

• Reclassify training data to estimate error rate

Approach 1

Y X1 X2

Data Set

Model

F(X) Y X1 X2

Data Set

Score: 83%

Typically produces a biased estimate of future error rate -- why?

Learning curves

• Goal: See how performance improves with additional training data

• From dataset set S, where |S|=n

• For i=[10, 20, ... ,100]

• Randomly sample i% of S to construct sample S’

• Learn model on S’

• Evaluate model

• Plot training set size vs. accuracy

Size of dataset

How does performance change when measured on disjoint test set?

Size of dataset

Approach 2

Y X1 X2

Data Set

Model

Test set

Score: 77%

Estimate will vary due to size and makeup of test set

Y X1 X2

Y X1 X2

Training set

Test set

F(X) Y X1 X2

Evaluating classifiers (cont)

• Approach 2:

• Partition D0 into two disjoint subsets, learn model on one subset, measure error on the other subset

• Problem: this is a point estimate of the error on one subset

0 200 400 600 800

0.0

0.2

0.4

0.6

0.8

1.0

Erro

r

Training Set Size

Algorithm AAlgorithm B

Overlapping test sets are dependent

• Repeated sampling of test sets leads to overlap (i.e., dependence) among test sets... this will results in underestimation of variance

• Standard errors will be biased if performance is estimated from overlapping test sets (Dietterich’98)

• Recommendation: Use cross-validation to eliminate dependence between test sets A A

• Use k-fold cross-validation to get k estimates of error for MA and MB

• Set of errors estimated over the test set folds provides empirical estimate of sampling distribution

• Mean is estimate of expected error

Evaluating classification algorithms A and B

Y X1 X2

Dataset Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2

Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2

Train1 Train2 Train3 Train4 Train5 Train6

Test1 Test2 Test3 Test4 Test5 Test6

AccA.1

AccB.1

AccA.2

AccB.2

AccA.3

AccB.3

AccA.4

AccB.4

AccA.5

AccB.5

AccA.6

AccB.6

# package with helper methodsinstall.packages("caret")library(caret)

# partition data into 80% training and 20% test settrainIndex <- createDataPartition(d3$SepalLength, times=1, p=0.8, list=FALSE)dTrain <- d3[ trainIndex,]dTest <- d3[-trainIndex,]

# learn model from training, evaluate on testdtTrain <- rpart(Class ~ SepalLength + SepalWidth + PetalLength + PetalWidth + SepalArea + PetalArea + pcaF, method="class", data=dTrain, control=rpart.control(minsplit=5))testPreds <- predict(dtTrain, dTest, type = "class")

# evaluateconfusionMatrix(testPreds, dTest$Class)

# calculate learning curveallResultsTr <- matrix(numeric(0), 0,2)allResultsTe <- matrix(numeric(0), 0,2)

trainSetSizes <- c(0.025,0.05,0.1,0.2,0.4,0.8)for(i in trainSetSizes){

trainIndex <- createDataPartition(d3$SepalLength,times=1,p=i,list=FALSE)dTrain <- d3[ trainIndex,]dTest <- d3[-trainIndex,]dTree <- rpart(Class ~ SepalLength + SepalWidth + PetalLength + PetalWidth

+ SepalArea + PetalArea + pcaF, method="class", data=dTrain, control=rpart.control(minsplit=5))

# evaluate on test settestPreds <- predict(dTree, dTest, type = "class")evalResults <- confusionMatrix(testPreds, dTest$Class)tmpAcc <- evalResults$overall[1]allResultsTe <- rbind(allResultsTe, as.vector(c(i,tmpAcc)))

# evaluate on training settrainPreds <- predict(dTree, dTrain, type = "class")evalResults2 <- confusionMatrix(trainPreds, dTrain$Class)tmpAcc2 <- evalResults2$overall[1]allResultsTr <- rbind(allResultsTr, as.vector(c(i,tmpAcc2)))

}

Ensemble methods

Ensemble methods

• Motivation

• Too difficult to construct a single model that optimizes performance (why?)

• Approach

• Construct many models on different versions of the training set and combine them during prediction

• Goal: reduce bias and/or variance

Conventional classification

X:attributesY:classlabel

92

Ensemble classification

12© 2004 Elder Research, Inc.

Relative Performance Examples: 5 Algorithms on 6 Datasets(with Stephen Lee, U. Idaho, 1997)

.00

.10

.20

.30

.40

.50

.60

.70

.80

.90

1.00

Diabetes Gaussian Hypothyroid German

Credit

Waveform Investment

Neural Network

Logistic Regression

Linear Vector Quantization

Projection Pursuit Regression

Decision Tree

Err

or

Rel

ativ

e to

Pee

r T

echniq

ues

(lo

wer

is

bet

ter)

source: Top Ten Data Mining Mistakes, John Edler, Edler Research)

13© 2004 Elder Research, Inc.

Essentially every Bundling method improves performance

.00

.10

.20

.30

.40

.50

.60

.70

.80

.90

1.00

Diabetes Gaussian Hypothyroid German

Credit

Waveform Investment

Advisor Perceptron

AP weighted average

Vote

Average

Err

or

Rel

ativ

e to

Pee

r T

echniq

ues

(lo

wer

is

bet

ter)

source: Top Ten Data Mining Mistakes, John Edler, Edler Research)

ensemble

Ensemble design

TREATMENTOFINPUTDATA

•sampling

•variableselection

Ensemble design


•sampling


CHOICEOFBASECLASSIFIER

•decisiontree

•perceptron

•…

Ensemble design

PREDICTIONAGGREGATION

•averaging

•weightedvote

•…


•sampling



•decisiontree

•perceptron

•…

Ensemble design

Bagging

• Bootstrap aggregating

• Main assumption

• Combining many unstable predictors in an ensemble produces a stable predictor (i.e., reduces variance)

• Unstable predictor: small changes in training data produces large changes in the model (e.g., trees)

• Model space: non-parametric, can model any function if an appropriate base model is used

PREDICTIONAGGREGATION

•averaging


•samplewithreplacement


•unstablepredictore.g.,decisiontree

Bagging

Bagging

• Given a training data set D={(x1,y1),..., (xN,yN)}

• For m=1:M

• Obtain a bootstrap sample Dm by drawing N instanceswith replacement from D

• Learn model Mm from Dm

• To classify test instance t, apply each model Mm to t and use majority predication or average prediction

• Models have uncorrelated errors due to difference in training sets (each bootstrap sample has ~68% of D)

Sample to create altered training data

Random forests

• Given a training data set D={(x1,y1),..., (xN,yN)}

• For m=1:M

• Obtain a bootstrap sample Dm by drawing N instanceswith replacement from D

• Learn decision tree Mm from Dm using k randomly selected features at each split

• To classify test instance t, apply each model Mm to t and use majority predication or average prediction

install.packages("randomForest")library(randomForest)

# random forest ensembledForest <- randomForest(Class ~ SepalLength + SepalWidth + PetalLength + PetalWidth + SepalArea + PetalArea + pcaF, method=“class", data=d3)

# view resultsprint(dForest)

# importance of each predictorimportance(dForest)

# calculate learning curve for random forestallResultsTe <- matrix(numeric(0), 0,2)

trainSetSizes <- c(0.025,0.05,0.1,0.2,0.4,0.8)for(i in trainSetSizes){

trainIndex <- createDataPartition(d3$SepalLength, times=1, p=i, list=FALSE)dTrain <- d3[ trainIndex,]dTest <- d3[-trainIndex,]

dForest <- randomForest(Class ~ SepalLength + SepalWidth + PetalLength + PetalWidth + SepalArea + PetalArea + pcaF,method="class",data=d3)

# evaluate on test settestPreds <- predict(dForest, dTest, type = "class")evalResults <- confusionMatrix(testPreds, dTest$Class)tmpAcc <- evalResults$overall[1]allResultsTe <- rbind(allResultsTe, as.vector(c(i,tmpAcc)))

}

Documents

Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides: