104
Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides: https://www.cs.purdue.edu/homes/neville/soi-dswksp-neville.pdf Data: https://www.cs.purdue.edu/homes/neville/iris.dat

Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Embed Size (px)

Citation preview

Page 1: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Introduction to Machine Learning & Data Mining

Jennifer Neville Purdue University

May 24, 2016

Slides:https://www.cs.purdue.edu/homes/neville/soi-dswksp-neville.pdfData:https://www.cs.purdue.edu/homes/neville/iris.dat

Page 2: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Data mining

The process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data

(Fayyad, Piatetsky-Shapiro & Smith 1996)

Databases

Artificial Intelligence

Visualization

Statistics

Page 3: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Example

During WWII, statistician Abraham Wald was asked to help the British decide where to add armor to their planes

Page 4: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

The data revolution

The last 35 years of research in ML/DM has resulted in wide spread adoption of predictive analytics to automate and improve decision making. As “big data” efforts increase the collection of data… so will the need for new data science methodology. Data today have more volume, velocity, variety, etc. Machine learning research develops statistical tools, models & algorithms that address these complexities. Data mining research focuses on how to scale to massive data and how to incorporate feedback to improve accuracy while minimizing effort.

Page 5: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:
Page 6: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Processed data

Target dataData

Selection Preprocessing

MiningPatternsInterpretationevaluation

Knowledge

The data mining process

Machine Learning

Page 7: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Overview

• Task specification

• Data representation

• Knowledge representation

• Learning technique

• Search + scoring

• Prediction and/or interpretation

Page 8: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Overview

• Task specification

• Data representation

• Knowledge representation

• Learning technique

• Search + scoring

• Prediction and/or interpretation

Page 9: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Task specification

• Objective of the person who is analyzing the data

• Description of the characteristics of the analysis and desired result

• Examples:

• From a set of labeled examples, devise an understandable model that will accurately predict whether a stockbroker will commit fraud in the near future.

• From a set of unlabeled examples, cluster stockbrokers into a set of homogeneous groups based on their demographic information

Page 10: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Exploratory data analysis

• Goal

• Interact with data without clear objective

• Techniques

• Visualization, adhoc modeling

Page 11: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Descriptive modeling

• Goal

• Summarize the data or the underlying generative process

• Techniques

• Density estimation, cluster analysis and segmentation

Branch (Bn)

Region

Area

Firm

Size

On

Watchlist

Disclosure

Year

Type

Broker (Bk)

On

Watchlist

Is Problem

Problem In Past

Has

Business

Layoffs

Bk

Bk

Bk

Bk

Bn

Bn

Bn

Bn

Bn

Also known as: unsupervised learning

Page 12: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Predictive modeling

• Goal

• Learn model to predict unknown class label values given observed attribute values

• Techniques

• Classification, regression

BrokerAge>27

Current CoWorkerCount>8

Current FirmAvg(Size)>12

Current BranchMode(Location)=NY

BrokerYears In Industry>1

DisclosureCount(Yr<1995)>0

Past FirmAvg(Size)>90

Current RegulatorMode(Status)=Reg

Past CoWorkerCount>35

DisclosureCount(Type=CC)>0

DisclosureCount>5Past Firm

Max(Size)>1000Current Branch

Mode(Location)=AZ

Past CoWorkerCount(Gender=M)>15

703 564

17949 9218

7 63 34 249

5 54

20010

Also known as: supervised learning

Page 13: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Pattern discovery

• Goal

• Detect patterns and rules that describe sets of examples

• Techniques

• Association rules, graph mining, anomaly detection

-

-

-

-

- - -

- - -

+

++

- -

+

-

-

+

-

-

- -

-

-

+

+

- +

+ +

Model: global summary of a data setPattern: local to a subset of the data

Page 14: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Overview

• Task specification

• Data representation

• Knowledge representation

• Learning technique

• Search + scoring

• Prediction and/or interpretation

Page 15: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Data representation

• Choice of data structure for representing individual and collections of measurements

• Individual measurements: single observations (e.g., person’s date of birth, product price)

• Collections of measurements: sets of observations that describe an instance (e.g., person, product)

• Choice of representation determines applicability of algorithms and can impact modeling effectiveness

• Additional issues: data sampling, data cleaning, feature construction

Page 16: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Tabular data

Fraud Age Degree StartYr Series7

+ 22 Y 2005 N

- 25 N 2003 Y

- 31 Y 1995 Y

- 27 Y 1999 Y

+ 24 N 2006 N

- 29 N 2003 N

N instances X p attributes

Page 17: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Temporal data

0

25

50

75

100

2004 2005 2006 2007

Region 1 Region 2

Page 18: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Relational/structured data

659,000brokers

171,000

branches

5,100

firms400,000

disclosures

Page 19: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Overview

• Task specification

• Data representation

• Knowledge representation

• Learning technique

• Search + scoring

• Prediction and/or interpretation

Page 20: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Knowledge representation

• Underlying structure of the model or patterns that we seek from the data • Specifies the models/patterns that could be returned as the results of the

data mining algorithm

• Defines space of possible models/patterns for algorithm to search over

• Examples:

• If-then rule If short closed car then toxic chemicals

• Conditional probability distributionP( fraud | age, degree, series7, startYr )

• Decision tree

• Regression model

8

Linear equation

!

CDR = 0.12MM1+ 0.34SBScore..." 0.34!

y = "1x1+ "

2x2...+ "

0

Generic form is:

An example for the Alzheimer’s data would be:

Rules

• ExamplesIF (cheese) THEN chocolate [Conf=0.10, Supp=0.04]

IF (cheese & Sterno) THEN chocolate [Conf=0.95, Supp=0.01]

• These can be written as probabilitiesP(chocolate|cheese) = 0.10

P(chocolate|cheese,Sterno) = 0.95

• As rules get more specific (by including additionalattributes), support decreases, but confidencemay increase.

Page 21: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Overview

• Task specification

• Data representation

• Knowledge representation

• Learning technique

• Search + scoring

• Prediction and/or interpretation

Page 22: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Learning technique

• Method to construct model or patterns from data

• Model space

• Choice of knowledge representation defines a set of possible models or patterns

• Scoring function

• Associates a numerical value (score) with each member of the set of models/patterns

• Search technique

• Defines a method for generating members of the set of models/patterns, determining their score, and identifying the ones with the “best” score

Page 23: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Scoring function

• A numeric score assigned to each possible model in a search space, given a reference/input dataset

• Used to judge the quality of a particular model for the domain

• Score function are statistics—estimates of a population parameter based on a sample of data

• Examples:

• Misclassification

• Squared error

• Likelihood

Page 24: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Example learning problem

0 10 20 30 40

0.00

0.05

0.10

0.15

density.default(x = d2[, 1])

N = 500 Bandwidth = 0.6062

Density

Task: Devise a rule to classify

items based on the attribute X

X

+

-

Knowledge representation: If-then rules

Example rule: If x > 25 then + Else -

What is the model space?

All possible thresholds

What score function?

Prediction error rate

Page 25: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Score function over model space

0 10 20 30 40

0.0

0.2

0.4

0.6

0.8

1.0

res[,1]

res[,2]

threshold

% p

redi

ctio

n er

rors

Full data

Small sample

Large sample

Biased sample

Search procedure?

Try all thresholds, select one with lowest score

Note: learning result depends on data

Page 26: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Overview

• Task specification

• Data representation

• Knowledge representation

• Learning technique

• Search + Evaluation

• Prediction and/or interpretation

Page 27: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Inference and interpretation

• Prediction technique

• Method to apply learned model to new data for prediction/analysis

• Only applicable for predictive and some descriptive models

• Prediction is often used during learning (i.e., search) to determine value of scoring function

• Interpretation of results

• Objective: significance measures, hypothesis tests

• Subjective: importance, interestingness, novelty

Page 28: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Example application

Page 29: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Real-world example:

659,000brokers

171,000

branches

5,100

firms400,000

disclosures

Page 30: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

BrokerAge>27

Current CoWorkerCount>8

Current FirmAvg(Size)>12

Current BranchMode(Location)=NY

BrokerYears In Industry>1

DisclosureCount(Yr<1995)>0

Past FirmAvg(Size)>90

Current RegulatorMode(Status)=Reg

Past CoWorkerCount>35

DisclosureCount(Type=CC)>0

DisclosureCount>5Past Firm

Max(Size)>1000Current Branch

Mode(Location)=AZ

Past CoWorkerCount(Gender=M)>15

703 564

17949 9218

7 63 34 249

5 54

20010

BrokerAge>27

Current CoWorkerCount>8

Current FirmAvg(Size)>12

Current BranchMode(Location)=NY

BrokerYears In Industry>1

DisclosureCount(Yr<1995)>0

Past FirmAvg(Size)>90

Current RegulatorMode(Status)=Reg

Past CoWorkerCount>35

DisclosureCount(Type=CC)>0

DisclosureCount>5Past Firm

Max(Size)>1000Current Branch

Mode(Location)=AZ

Past CoWorkerCount(Gender=M)>15

703 564

17949 9218

7 63 34 249

5 54

20010

Page 31: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Neither

Both

NASD Rules

Relational

Models

Performance of NASD models

"One broker I was highly confident in ranking as 5…

Not only did I have the pleasure of meeting him at a shady warehouse location, I also negotiated his bar from the industry... This person actually used investors' funds

to pay for personal expenses including his trip to attend a NASD compliance conference!

…If the model predicted this person, it would be right on target."

Page 32: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Data mining process

6

CS590D 12

Data Mining: Classification Schemes

• General functionality

– Descriptive data mining

– Predictive data mining

• Different views, different classifications

– Kinds of data to be mined

– Kinds of knowledge to be discovered

– Kinds of techniques utilized

– Kinds of applications adapted

CS590D 13

adapted from:

U. Fayyad, et al. (1995), “From Knowledge Discovery to Data

Mining: An Overview,” Advances in Knowledge Discovery and

Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press

DataTarget

Data

Selection

KnowledgeKnowledge

Preprocessed

Data

Patterns

Data Mining

Interpretation/

Evaluation

Knowledge Discovery in Databases: Process

Preprocessing

Use public data from NASD BrokerCheck

Extract data about small firms in a few geographic locations

Create class label, temporal features

Learn decision trees, output predictions and tree structure

Evaluate objectively on historical data, subjectively with fraud analysts

Page 33: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Steps in the data mining process

Page 34: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Data mining process

1. Application setup:

• Acquire relevant domain knowledge

• Assess user goals

2. Data selection

• Choose data sources

• Identify relevant attributes

• Sample data

3. Data preprocessing

• Remove noise or outliers

• Handle missing values

• Account for time or other changes

4. Data transformation

• Find useful features

• Reduce dimensionality

Page 35: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Data mining process

5. Data mining:

• Choose task (e.g., classification, regression, clustering)

• Choose algorithms for learning and inference

• Set parameters

• Apply algorithms to search for patterns of interest

6. Interpretation/evaluation

• Assess accuracy of model/results

• Interpret model for end-users

• Consolidate knowledge

7. Repeat...

Page 36: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Data selection

Page 37: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

# download data from:# https://www.cs.purdue.edu/homes/neville/iris.dat

# read in datad <- read.table(file=‘iris.dat’,sep=',',header=TRUE)summary(d)

# histogramhist(d[,2], main='Histogram of Sepal Width', xlab='Sepal Width')

# scatterplotplot(d[,c(1,3)],xlab='Sepal Length',ylab='Petal Length’)plot(d)

Page 38: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

# feature constructionSepalArea <- d[,1] * d[,2]PetalArea <- d[,3] * d[,4]

# add new features to data framed2 <- cbind(d,SepalArea,PetalArea)

# boxplotboxplot(d2$SepalLength ~ d2$Class,d2,ylab='Sepal Length’)boxplot(d2$SepalArea ~ d2$Class,d2,ylab='Sepal Area')

# correlationcor(d2$SepalLength,d2$sepalArea)

Page 39: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Data transformation

Page 40: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Feature construction

• Create new attributes that can capture the important information in the data more efficiently than the original attributes

• General methodologies:

• Attribute extraction (domain specific)

• Attribute transformations, i.e., mapping data to new space (e.g., PCA)

• Attribute combinations

Page 41: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Feature selection

• Select the most “relevant” subset of attributes

• May improve performance of algorithms due to overfitting

• Improves domain understanding

• Less resources to collect/use a smaller number of features

• Wrapper approach

• Features are selected as part of the mining algorithms

• Filter approach

• Features are selected before mining

Page 42: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Wrapper approach

• Consider all subsets of features, 2p if there are p features; features are selected according to a particular model score function (e.g., classification accuracy)

• Search over all subsets and find smallest set of features such that “best” model score does not significantly change

• For large p, exhaustive search is intractable so heuristic search is often used

• Examples:

• Forward, greedy selection — start with empty set and add features one at a time until no more improvement

• Backward, greedy removal — start with full set and remove features one at a time until no further improvements

• Interleave forward and backward selection

Page 43: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Filter approach

• Select “useful” features before mining, using a score that measures a feature’s utility separate from model learning

• Find features on the basis of statistical properties such as association with the class label

• Example:

• Calculate correlation of features with target class label, order all features by their score, choose features with top k scores

• Other scores: Chi-square, information gain, etc.

Page 44: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Dimensionality reduction

• Identify and describe the “dimensions” that underlie the data

• May be more fundamental than those directly measured but hidden to the user

• Reduce dimensionality of modeling problem

• Benefit is simplification, it reduces the number of variables you have to deal with in modeling

• Can identify set of variables with similar behavior

Page 45: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Principal component analysis (PCA)

• High-level approach, given data matrix D with p dimensions:

• Preprocess D so that the mean of each attribute is 0, call this matrix X

• Compute pxp covariance matrix:

• Compute eigenvectors/eigenvalues of covariance matrix:

• Eigenvectors are the principal component vectors (pxp matrix), where each is a px1 column vector of projection weights

� = XT X

A⌃A�1 = ⇤ : matrix of eigenvectors : diagonal matrix of eigenvalues

A⇤

Aa

: 1st principal component, eigenvector

Page 46: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Applying PCA

• New data vectors are formed by projecting the data onto the first few principal components (i.e., top k eigenvectors)

x = [x1, x2, . . . , xp] (original instance)

A = [a1,a2, . . . ,ap] (principal components)

x

01 = a1x =

pX

j=1

a1jxj

· · ·

x

0m = amx =

pX

j=1

amjxj for m < p

x

0= [x

01, x

02, . . . , x

0m] (transformed instance)

If m=p then data is transformed If m<p then transformation is lossy and dimensionality is reduced

Page 47: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Applying PCA (cont’)

• Goal: Find a new (smaller) set of dimensions that captures most of the variability of the data

• Use scree plot to choose number of dimensions

• Choose m < p so projected data captures much of the variance of original data

Index

Varia

nce

Page 48: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Example: Eigenfaces

First 40 PCA dimensions

PCA applied to images of human faces. Reduce dimensionality to set of basis images.

All other images are linear combo of these “eigenpictures”.

Used for facial recognition.

Page 49: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

PCA example on Iris data

> pcdat <- princomp(d[,1:4])> summary(pcdat)Importance of components: Comp.1 Comp.2 Comp.3 Comp.4Standard deviation 2.0485788 0.49053911 0.27928554 0.153379074Proportion of Variance 0.9246162 0.05301557 0.01718514 0.005183085Cumulative Proportion 0.9246162 0.97763178 0.99481691 1.000000000> plot(pcdat)> loadings(pcdat)

Loadings: Comp.1 Comp.2 Comp.3 Comp.4V1 0.362 -0.657 -0.581 0.317V2 -0.730 0.596 -0.324V3 0.857 0.176 -0.480V4 0.359 0.549 0.751

Comp.1 Comp.2 Comp.3 Comp.4SS loadings 1.00 1.00 1.00 1.00Proportion Var 0.25 0.25 0.25 0.25Cumulative Var 0.25 0.50 0.75 1.00

Comp.1 Comp.2 Comp.3 Comp.4

pcdat

Variances

010000

20000

30000

40000

First component explains 92% of data variance

Choose m=1

Page 50: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

PCA example on Iris data

# Transform data using 1st componentpcdat$loadings[,1] V1 V2 V3 V4 0.36158968 -0.08226889 0.85657211 0.35884393

pcaF <- as.matrix(d[,1:4]) %*% pcdat$loadings[,1]d3 <- cbind(d2,pcaF)

x = [x1, x2, . . . , xp] (original instance)

A = [a1,a2, . . . ,ap] (principal components)

x

01 = a1x =

pX

j=1

a1jxj

· · ·

x

0m = amx =

pX

j=1

amjxj for m < p

x

0= [x

01, x

02, . . . , x

0m] (transformed instance)

m=1, transform data to one dimension

1 2 3

34

56

78

9

Iris-setosa Iris-virginica

4.55.05.56.06.57.07.58.0

Original data (1st variable)

Transformed data

Page 51: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Machine learning

Page 52: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Descriptive vs. predictive modeling

• Descriptive models summarize the data

• Provide insights into the domain

• Focus on modeling joint distribution P(X)

• May be used for classification, but prediction is not the primary goal

• Predictive models predict the value of one variable of interest given known values of other variables

• Focus on modeling the conditional distribution P( Y | X ) or on modeling the decision boundary for Y

Page 53: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Learning predictive models

• Choose a data representation

• Select a knowledge representation (a “model”)

• Defines a space of possible models M={M1, M2, ..., Mk}

• Use search to identify “best” model(s)

• Search the space of models (i.e., with alternative structures and/or parameters)

• Evaluate possible models with scoring function to determine the model which best fits the data

Page 54: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Knowledge representation

• Underlying structure of the model or patterns that we seek from the data

• Defines space of possible models for algorithm to search over

• Model: high-level global description of dataset

• “All models are wrong, some models are useful” G. Box and N. Draper (1987)

• Choice of model family determines space of parameters and structure

• Estimate model parameters and possibly model structure from training data

Page 55: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

BrokerAge>27

Current CoWorkerCount>8

Current FirmAvg(Size)>12

Current BranchMode(Location)=NY

BrokerYears In Industry>1

DisclosureCount(Yr<1995)>0

Past FirmAvg(Size)>90

Current RegulatorMode(Status)=Reg

Past CoWorkerCount>35

DisclosureCount(Type=CC)>0

DisclosureCount>5Past Firm

Max(Size)>1000Current Branch

Mode(Location)=AZ

Past CoWorkerCount(Gender=M)>15

703 564

17949 9218

7 63 34 249

5 54

20010

Classification tree

Model space: all possible decision trees

Page 56: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Scoring functions

• Given a model M and dataset D, we would like to “score” model M with respect to D

• Goal is to rank the models in terms of their utility (for capturing D) and choose the “best” model

• Score function can be used to search over parameters and/or model structure

• Score functions can be different for:

• Models vs. patterns

• Predictive vs. descriptive functions

• Models with varying complexity (i.e., number parameters)

Page 57: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Predictive scoring functions

• Assess the quality of predictions for a set of instances

• Measures difference between the prediction M makes for an instance i and the true class label value of i

Predicted class label for item i

True class label for item i

Distance betweenpredicted and true

Sum over examples

S(M) =NtestX

i=1

d

⇥f(x(i);M), y(i)

Page 58: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

What space are we searching?

Model Score

X

Learned model ⇡ (✓0 = 0.8, ✓1 = 0.4)

Alex Holehuse, Notes from Andrew Ng’s Machine Learning Class, http://www.holehouse.org/mlclass/01_02_Introduction_regression_analysis_and_gr.html

Model space

Page 59: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Searching over models/patterns

• Consider a space of possible models M={M1, M2, ..., Mk} with parameters θ

• Search could be over model structures or parameters, e.g.:

• Parameters: In a linear regression model, find the regression coefficients (β) that minimize squared loss on the training data

• Model structure: In a decision trees, find the tree structure that maximizes accuracy on the training data

Page 60: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Decision trees

Page 61: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Tree models

• Easy to understand knowledge representation

• Can handle mixed variables

• Recursive, divide and conquer learning method

• Efficient inference

Page 62: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Tree learning

• Top-down recursive divide and conquer algorithm

• Start with all examples at root

• Select best attribute/feature

• Partition examples by selected attribute

• Recurse and repeat

• Other issues:

• How to construct features

• When to stop growing

• Pruning irrelevant parts of the tree

Page 63: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

choose split on Age>28

Fraud Age Degree StartYr Series7

+ 22 Y 2005 N

+ 24 N 2006 N

Fraud Age Degree StartYr Series7

- 29 N 2003 N

NY

Fraud Age Degree StartYr Series7

+ 22 Y 2005 N

+ 24 N 2006 N

- 29 N 2003 N

Fraud Age Degree StartYr Series7

- 25 N 2003 Y

- 31 Y 1995 Y

- 27 Y 1999 Y

choose split on Series7Y N

Fraud Age Degree StartYr Series7

+ 22 Y 2005 N

- 25 N 2003 Y

- 31 Y 1995 Y

- 27 Y 1999 Y

+ 24 N 2006 N

- 29 N 2003 N

Score each attribute split for these instances:

Age, Degree, StartYr, Series7

Score each attribute split for these instances: Age, Degree, StartYr

Page 64: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Tree models

• Most well-known systems

• CART: Breiman, Friedman, Olshen and Stone

• ID3, C4.5: Quinlan

• How do they differ?

• Split scoring function

• Stopping criterion

• Pruning mechanism

• Predictions in leaf nodes

Page 65: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Scoring functions: Local split value

Page 66: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Choosing an attribute/feature

• Idea: a good feature splits the examples into subsets that distinguish among the class labels as much as possible... ideally into pure sets of "all positive" or "all negative"

Page 67: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

67

Information gain

• How much does a feature split decrease the entropy?

Entropy(D)= -9/14 log 9/14 -5/14 log 5/14= 0.9400

15

CS590D 30

Classification and Prediction

• What is classification? What is prediction?

• Issues regarding classification and

prediction

• Bayesian Classification

• Instance Based Methods

• Classification by decision tree induction

• Classification by Neural Networks

• Classification by Support Vector Machines

(SVM)

• Prediction

CS590D 31

Training Dataset

age income student credit_rating buys_computer

<=30 high no fair no

<=30 high no excellent no

31…40 high no fair yes

>40 medium no fair yes

>40 low yes fair yes

>40 low yes excellent no

31…40 low yes excellent yes

<=30 medium no fair no

<=30 low yes fair yes

>40 medium yes fair yes

<=30 medium yes excellent yes

31…40 medium no excellent yes

31…40 high yes fair yes

>40 medium no excellent no

Page 68: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

68

Information gain

nonoyesyes

yesnoyesyesyesno

yesnoyesyes

High LowMed

IncomeEntropy(Income=high) = -2/4 log 2/4 -2/4 log 2/4 = 1

Entropy(Income=med) = -4/6 log 4/6 -2/6 log 2/6 = 0.9183

Entropy(Income=low) = -3/4 log 3/4 -1/4 log 1/4 = 0.8113

Gain(D,Income)= 0.9400 - (4/14 [1] + 6/14 [0.9183] + 4/14 [0.8113])= 0.029

Page 69: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Tree learning

• Top-down recursive divide and conquer algorithm

• Start with all examples at root

• Select best attribute/feature

• Partition examples by selected attribute

• Recurse and repeat

• Other issues:

• How to construct features

• When to stop growing

• Pruning irrelevant parts of the tree

Page 70: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

When to stop growing

• Full growth methods

• All samples for at a node belong to the same class

• There are no attributes left for further splits

• There are no samples left

• What impact does this have on the quality of the learned trees?

• Trees overfit the data and accuracy decreases

• Pruning is used to avoid overfitting

Page 71: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Pruning

• Postpruning

• Use a separate set of examples to evaluate the utility of pruning nodes from the tree (after tree is fully grown)

• Prepruning

• Apply a statistical test to decide whether to expand a node

• Use an explicit measure of complexity to penalize large trees (e.g., Minimum Description Length)

Page 72: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Algorithm comparison

• CART

• Evaluation criterion:Gini index

• Search algorithm:Simple to complex, hill-climbing search

• Stopping criterion: When leaves are pure

• Pruning mechanism:Cross-validation to select gini threshold

• C4.5

• Evaluation criterion: Information gain

• Search algorithm:Simple to complex, hill-climbing search

• Stopping criterion: When leaves are pure

• Pruning mechanism:Reduce error pruning

Page 73: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Learning CART decision trees in R

Page 74: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

# install packageslibrary(rpart)install.packages("rpart.plot")library(rpart.plot)

# learn CART tree with default parametersdTree <- rpart(Class ~ SepalLength + SepalWidth + PetalLength + PetalWidth + SepalArea + PetalArea + pcaF, method=“class", data=d3)summary(dTree)

# display learned treeprp(dTree)prp(dTree,type=4, extra=1)

# explore settingsdTree2 <- rpart(Class ~ SepalLength + SepalWidth + PetalLength + PetalWidth + SepalArea + PetalArea + pcaF, method=“class", data=d3, parms=list(split="information"))

dTree3 <- rpart(Class ~ SepalLength + SepalWidth + PetalLength + PetalWidth + SepalArea + PetalArea + pcaF, method="class", data=d3, control=rpart.control(minsplit=5))

Page 75: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Evaluation

Page 76: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Empirical evaluation

• Given observed accuracy of a model on limited data, how well does this estimate generalize for additional examples?

• Given that one model outperforms another on some sample of data, how likely is it that this model is more accurate in general?

• When data are limited, what is the best way to use the data to both learn and evaluate a model?

Page 77: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Evaluating classifiers

• Goal: Estimate true future error rate

• When data are limited, what is the best way to use the data to both learn and evaluate a model?

• Approach 1

• Reclassify training data to estimate error rate

Page 78: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Approach 1

Y X1 X2

Data Set

Model

F(X) Y X1 X2

Data Set

Score: 83%

Typically produces a biased estimate of future error rate -- why?

Page 79: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Learning curves

• Goal: See how performance improves with additional training data

• From dataset set S, where |S|=n

• For i=[10, 20, ... ,100]

• Randomly sample i% of S to construct sample S’

• Learn model on S’

• Evaluate model

• Plot training set size vs. accuracy

Page 80: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Size of dataset

Page 81: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

How does performance change when measured on disjoint test set?

Page 82: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Size of dataset

Page 83: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Approach 2

Y X1 X2

Data Set

Model

Test set

Score: 77%

Estimate will vary due to size and makeup of test set

Y X1 X2

Y X1 X2

Training set

Test set

F(X) Y X1 X2

Page 84: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Evaluating classifiers (cont)

• Approach 2:

• Partition D0 into two disjoint subsets, learn model on one subset, measure error on the other subset

• Problem: this is a point estimate of the error on one subset

0 200 400 600 800

0.0

0.2

0.4

0.6

0.8

1.0

Erro

r

Training Set Size

Algorithm AAlgorithm B

Page 85: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Overlapping test sets are dependent

• Repeated sampling of test sets leads to overlap (i.e., dependence) among test sets... this will results in underestimation of variance

• Standard errors will be biased if performance is estimated from overlapping test sets (Dietterich’98)

• Recommendation: Use cross-validation to eliminate dependence between test sets A A

Page 86: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

• Use k-fold cross-validation to get k estimates of error for MA and MB

• Set of errors estimated over the test set folds provides empirical estimate of sampling distribution

• Mean is estimate of expected error

Evaluating classification algorithms A and B

Y X1 X2

Dataset Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2

Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2

Train1 Train2 Train3 Train4 Train5 Train6

Test1 Test2 Test3 Test4 Test5 Test6

AccA.1

AccB.1

AccA.2

AccB.2

AccA.3

AccB.3

AccA.4

AccB.4

AccA.5

AccB.5

AccA.6

AccB.6

Page 87: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

# package with helper methodsinstall.packages("caret")library(caret)

# partition data into 80% training and 20% test settrainIndex <- createDataPartition(d3$SepalLength, times=1, p=0.8, list=FALSE)dTrain <- d3[ trainIndex,]dTest <- d3[-trainIndex,]

# learn model from training, evaluate on testdtTrain <- rpart(Class ~ SepalLength + SepalWidth + PetalLength + PetalWidth + SepalArea + PetalArea + pcaF, method="class", data=dTrain, control=rpart.control(minsplit=5))testPreds <- predict(dtTrain, dTest, type = "class")

# evaluateconfusionMatrix(testPreds, dTest$Class)

Page 88: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

# calculate learning curveallResultsTr <- matrix(numeric(0), 0,2)allResultsTe <- matrix(numeric(0), 0,2)

trainSetSizes <- c(0.025,0.05,0.1,0.2,0.4,0.8)for(i in trainSetSizes){

trainIndex <- createDataPartition(d3$SepalLength,times=1,p=i,list=FALSE)dTrain <- d3[ trainIndex,]dTest <- d3[-trainIndex,]dTree <- rpart(Class ~ SepalLength + SepalWidth + PetalLength + PetalWidth

+ SepalArea + PetalArea + pcaF, method="class", data=dTrain, control=rpart.control(minsplit=5))

# evaluate on test settestPreds <- predict(dTree, dTest, type = "class")evalResults <- confusionMatrix(testPreds, dTest$Class)tmpAcc <- evalResults$overall[1]allResultsTe <- rbind(allResultsTe, as.vector(c(i,tmpAcc)))

# evaluate on training settrainPreds <- predict(dTree, dTrain, type = "class")evalResults2 <- confusionMatrix(trainPreds, dTrain$Class)tmpAcc2 <- evalResults2$overall[1]allResultsTr <- rbind(allResultsTr, as.vector(c(i,tmpAcc2)))

}

Page 89: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Ensemble methods

Page 90: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Ensemble methods

• Motivation

• Too difficult to construct a single model that optimizes performance (why?)

• Approach

• Construct many models on different versions of the training set and combine them during prediction

• Goal: reduce bias and/or variance

Page 91: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Conventional classification

X:attributesY:classlabel

Page 92: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

92

Ensemble classification

Page 93: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

12© 2004 Elder Research, Inc.

Relative Performance Examples: 5 Algorithms on 6 Datasets(with Stephen Lee, U. Idaho, 1997)

.00

.10

.20

.30

.40

.50

.60

.70

.80

.90

1.00

Diabetes Gaussian Hypothyroid German

Credit

Waveform Investment

Neural Network

Logistic Regression

Linear Vector Quantization

Projection Pursuit Regression

Decision Tree

Err

or

Rel

ativ

e to

Pee

r T

echniq

ues

(lo

wer

is

bet

ter)

source: Top Ten Data Mining Mistakes, John Edler, Edler Research)

Page 94: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

13© 2004 Elder Research, Inc.

Essentially every Bundling method improves performance

.00

.10

.20

.30

.40

.50

.60

.70

.80

.90

1.00

Diabetes Gaussian Hypothyroid German

Credit

Waveform Investment

Advisor Perceptron

AP weighted average

Vote

Average

Err

or

Rel

ativ

e to

Pee

r T

echniq

ues

(lo

wer

is

bet

ter)

source: Top Ten Data Mining Mistakes, John Edler, Edler Research)

ensemble

Page 95: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Ensemble design

Page 96: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

TREATMENTOFINPUTDATA

•sampling

•variableselection

Ensemble design

Page 97: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

TREATMENTOFINPUTDATA

•sampling

•variableselection

CHOICEOFBASECLASSIFIER

•decisiontree

•perceptron

•…

Ensemble design

Page 98: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

PREDICTIONAGGREGATION

•averaging

•weightedvote

•…

TREATMENTOFINPUTDATA

•sampling

•variableselection

CHOICEOFBASECLASSIFIER

•decisiontree

•perceptron

•…

Ensemble design

Page 99: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Bagging

• Bootstrap aggregating

• Main assumption

• Combining many unstable predictors in an ensemble produces a stable predictor (i.e., reduces variance)

• Unstable predictor: small changes in training data produces large changes in the model (e.g., trees)

• Model space: non-parametric, can model any function if an appropriate base model is used

Page 100: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

PREDICTIONAGGREGATION

•averaging

TREATMENTOFINPUTDATA

•samplewithreplacement

CHOICEOFBASECLASSIFIER

•unstablepredictore.g.,decisiontree

Bagging

Page 101: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Bagging

• Given a training data set D={(x1,y1),..., (xN,yN)}

• For m=1:M

• Obtain a bootstrap sample Dm by drawing N instanceswith replacement from D

• Learn model Mm from Dm

• To classify test instance t, apply each model Mm to t and use majority predication or average prediction

• Models have uncorrelated errors due to difference in training sets (each bootstrap sample has ~68% of D)

Sample to create altered training data

Page 102: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

Random forests

• Given a training data set D={(x1,y1),..., (xN,yN)}

• For m=1:M

• Obtain a bootstrap sample Dm by drawing N instanceswith replacement from D

• Learn decision tree Mm from Dm using k randomly selected features at each split

• To classify test instance t, apply each model Mm to t and use majority predication or average prediction

Page 103: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

install.packages("randomForest")library(randomForest)

# random forest ensembledForest <- randomForest(Class ~ SepalLength + SepalWidth + PetalLength + PetalWidth + SepalArea + PetalArea + pcaF, method=“class", data=d3)

# view resultsprint(dForest)

# importance of each predictorimportance(dForest)

Page 104: Introduction to Machine Learning & Data Mining · Introduction to Machine Learning & Data Mining Jennifer Neville Purdue University May 24, 2016 Slides:

# calculate learning curve for random forestallResultsTe <- matrix(numeric(0), 0,2)

trainSetSizes <- c(0.025,0.05,0.1,0.2,0.4,0.8)for(i in trainSetSizes){

trainIndex <- createDataPartition(d3$SepalLength, times=1, p=i, list=FALSE)dTrain <- d3[ trainIndex,]dTest <- d3[-trainIndex,]

dForest <- randomForest(Class ~ SepalLength + SepalWidth + PetalLength + PetalWidth + SepalArea + PetalArea + pcaF,method="class",data=d3)

# evaluate on test settestPreds <- predict(dForest, dTest, type = "class")evalResults <- confusionMatrix(testPreds, dTest$Class)tmpAcc <- evalResults$overall[1]allResultsTe <- rbind(allResultsTe, as.vector(c(i,tmpAcc)))

}