53
© Vipin Kumar CSci 8980 Fall 2002 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer Science University of Minnesota http://www.cs.umn.edu/~kumar

CSci 8980: Data Mining (Fall 2002)

Embed Size (px)

DESCRIPTION

CSci 8980: Data Mining (Fall 2002). Vipin Kumar Army High Performance Computing Research Center Department of Computer Science University of Minnesota http://www.cs.umn.edu/~kumar. Model Evaluation. Metrics for Performance Evaluation How to evaluate the performance of a model? - PowerPoint PPT Presentation

Citation preview

Page 1: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 1

CSci 8980: Data Mining (Fall 2002)

Vipin Kumar

Army High Performance Computing Research CenterDepartment of Computer Science

University of Minnesota

http://www.cs.umn.edu/~kumar

Page 2: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 2

Model Evaluation

Metrics for Performance Evaluation– How to evaluate the performance of a model?

Methods for Performance Evaluation– How to obtain reliable estimates

Methods for Model Comparison– How to compare the relative performance among

competing models

Page 3: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 3

Metrics for Performance Evaluation

Focus on the predictive capability of a model– Rather than how fast it takes to classify or build

models, scalability, etc.

Confusion Matrix:

PREDICTED CLASS

ACTUALCLASS

Class=Yes Class=No

Class=Yes a b

Class=No c d

a: TP (true positive)

b: FN (false negative)

c: FP (false positive)

d: TN (true negative)

Page 4: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 4

Metrics for Performance Evaluation…

Most widely-used metric:

PREDICTED CLASS

ACTUALCLASS

Class=Yes Class=No

Class=Yes a(TP)

B(FN)

Class=No c(FP)

d(TN)

FNFPTNTPTNTP

dcbada

Accuracy

Page 5: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 5

Cost Matrix

PREDICTED CLASS

ACTUALCLASS

C(i|j) Class=Yes Class=No

Class=Yes C(Yes|Yes) C(No|Yes)

Class=No C(Yes|No) C(No|No)

C(i|j): Cost of misclassifying class j example as class i

Accuracy is a useful measure if C(Yes|No)=C(No|Yes) and C(Yes|Yes)=C(No|No) P(Yes) = P(No) (class distribution are equal)

Page 6: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 6

Cost vs Accuracy

Cost Matrix

PREDICTED CLASS

ACTUALCLASS

C(i|j) + -

+ -1 100

- 1 0

Model M1 PREDICTED CLASS

ACTUALCLASS

C(i|j) + -

+ 150 40

- 60 250

Model M2 PREDICTED CLASS

ACTUALCLASS

C(i|j) + -

+ 250 45

- 5 200

Accuracy = 80%

Cost = 3910

Accuracy = 90%

Cost = 4255

Page 7: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 7

Cost-Sensitive Measures

cbaa

prrp

baa

caa

222

(F) measure-F

(r) Recall

(p)Precision

Precision is biased towards C(Yes|Yes) & C(Yes|No) Recall is biased towards C(Yes|Yes) & C(No|Yes) F-measure is biased towards all except C(No|No)

dwcwbwawdwaw

4321

41Accuracy Weighted

Page 8: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 8

Methods for Performance Evaluation

How to obtain a reliable estimate of performance?

Performance of a model may depend on other factors besides the learning algorithm:– Class distribution

– Cost of misclassification

– Size of training and test sets

Page 9: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 9

Learning Curve

Learning curve shows how accuracy changes with varying sample size

Requires a sampling schedule for creating learning curve:

Arithmetic sampling(Langley, et al)

Geometric sampling(Provost et al)

Effect of small sample size:- Bias in the estimate- Variance of estimate

Page 10: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 10

Methods of Estimation

Holdout– Reserve 2/3 for training and 1/3 for testing

Random subsampling– Repeated holdout

Cross validation– Partition data into k disjoint subsets

– k-fold: train on k-1 partitions, test on the remaining one

– Leave-one-out: k=n

Stratified sampling – oversampling vs undersampling

Bootstrap– Sampling with replacement

Page 11: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 11

ROC (Receiver Operating Characteristic)

Developed in 1950s for signal detection theory to analyze noisy signals – Characterize the trade-off between positive hits and

false alarms

ROC curve plots TP (on the y-axis) against FP (on the x-axis)

Performance of each classifier represented as a point on the ROC curve– changing the threshold of algorithm, sample

distribution or cost matrix changes the location of the point

Page 12: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 12

ROC Curve

- 1-dimensional data set containing 2 classes (positive and negative)

- any points located at x > t is classified as positive

At threshold t:

TP=0.5, FN=0.5, FP=0.12, FN=0.88

Page 13: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 13

ROC Curve

(TP,FP): (0,0): declare everything

to be negative class (1,1): declare everything

to be positive class (1,0): ideal

Diagonal line:– Random guessing

– Below diagonal line: prediction is opposite of the true class

Page 14: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 14

Using ROC for Model Comparison

No model consistently outperform the other M1 is better for

small FPR M2 is better for

large FPR

Area Under the ROC curve

Ideal: Area = 1

Random guess: Area = 0.5

Page 15: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 15

How to Construct an ROC curve

Instance P(+|A) True Class

1 0.95 +

2 0.93 +

3 0.87 -

4 0.85 -

5 0.85 -

6 0.85 +

7 0.76 -

8 0.53 +

9 0.43 -

10 0.25 +

• Use classifier that produces posterior probability for each test instance P(+|A)

• Sort the instances according to P(+|A) in decreasing order

• Apply threshold at each unique value of P(+|A)

• Count the number of TP, FP,

TN, FN at each threshold

• TP rate, TPR = TP/(TP+FN)

• FP rate, FPR = FP/(FP + TN)

Page 16: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 16

How to construct an ROC curve

Class + - + - - - + - + +

P 0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00

TP 5 4 4 3 3 3 3 2 2 1 0

FP 5 5 4 4 3 2 1 1 0 0 0

TN 0 0 1 1 2 3 4 4 5 5 5

FN 0 1 1 2 2 2 2 3 3 4 5

TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0

FPR 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0

Threshold >=

ROC Curve:

Page 17: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 17

Test of Significance

Given two models:– Model M1: accuracy = 85%, tested on 30 instances

– Model M2: accuracy = 75%, tested on 5000 instances

Can we say M1 is better than M2?– How much confidence can we place on accuracy of

M1 and M2?

– Can the difference in performance measure be explained as a result of random fluctuations in the test set?

Page 18: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 18

Confidence Interval for Accuracy

Prediction can be regarded as a Bernoulli trial– A Bernoulli trial has 2 possible outcomes

– Possible outcomes for prediction: correct or wrong

– Collection of Bernoulli trials has a Binomial distribution: x Bin(N, p) x: number of correct predictions e.g: Toss a fair coin 50 times, how many heads would turn up? Expected number of heads = Np = 50 0.5 = 25

Given x (# of correct predictions) or equivalently, acc=x/N, and N (# of test instances),

Can we predict p (true accuracy of model)?

Page 19: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 19

Confidence Interval for Accuracy

For large test sets (N > 30), – acc has a normal distribution

with mean p and variance p(1-p)/N

Confidence Interval for p:

1

)/)1(

(2/12/

ZNpp

paccZP

Area = 1 -

Z/2 Z1- /2

)(2

4422

2/

22

2/

2

2/

ZN

accNaccNZZaccNp

Page 20: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 20

Confidence Interval for Accuracy

Consider a model that produces an accuracy of 80% when evaluated on 100 test instances:– N=100, acc = 0.8

– Let 1- = 0.95 (95% confidence)

– From probability table, Z/2=1.96

1- Z

0.99 2.58

0.98 2.33

0.95 1.96

0.90 1.65

N 50 100 500 1000 5000

p(lower) 0.670 0.711 0.763 0.774 0.789

p(upper) 0.888 0.866 0.833 0.824 0.811

Page 21: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 21

Comparing Performance of 2 Models

Given two models, say M1 and M2, which is better?– M1 is tested on D1 (size=n1), found error rate = e1

– M2 is tested on D2 (size=n2), found error rate = e2

– Assume D1 and D2 are independent

– If n1 and n2 are sufficiently large, then

– Approximate:

22

11

,~2

,~1

Ne

Ne

i

ii

i nee )1(

ˆ

Page 22: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 22

Comparing Performance of 2 Models

To test if performance difference is statistically significant: d = e1 – e2– d ~ NN(dt,t) where dt is the true difference

– Since D1 and D2 are independent, their variance adds up:

– At (1-) confidence level,

2)21(2

1)11(1

ˆˆ 2

2

2

1

2

2

2

1

2

nee

nee

t

ttZdd

ˆ

2/

Page 23: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 23

An Illustrative Example

Given: M1: n1 = 30, e1 = 0.15 M2: n2 = 5000, e2 = 0.25

d = |e2 – e1| = 0.1 (2-sided test)

At 95% confidence level, Z/2=1.96

=> Interval contains 0 => difference may not be statistically significant

0043.05000

)25.01(25.030

)15.01(15.0ˆ

d

128.0100.00043.096.1100.0 t

d

Page 24: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 24

Comparing Performance of 2 Algorithms

Each learning algorithm may produce k models:– L1 may produce M11 , M12, …, M1k

– L2 may produce M21 , M22, …, M2k

If models are generated on the same test sets D1,D2, …, Dk (e.g., via cross-validation)– For each set: compute dj = e1j – e2j

– dj has mean dt and variance t

– Estimate:

tkt

k

j j

t

tdd

kk

dd

ˆ

)1(

)(ˆ

1,1

1

2

2

Page 25: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 25

What is Cluster Analysis?

Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups.– Based on information found in the data that describes the

objects and their relationships.

– Also known as unsupervised classification.

Many applications– Understanding: group related documents for browsing or to

find genes and proteins that have similar functionality.

– Summarization: Reduce the size of large data sets.

Page 26: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 26

What is not Cluster Analysis?

Supervised classification.– Have class label information.

Simple segmentation.– Dividing students into different registration groups

alphabetically, by last name.

Results of a query.– Groupings are a result of an external specification.

Graph partitioning– Some mutual relevance and synergy, but areas are not

identical.

Page 27: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 27

Notion of a Cluster is Ambiguous

Initial points.

Four Clusters Two Clusters

Six Clusters

Page 28: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 28

Types of Clusterings

A clustering is a set of clusters.

One important distinction is between hierarchical and partitional sets of clusters.

Partitional Clustering– A division data objects into non-overlapping subsets (clusters)

such that each data object is in exactly one subset.

Hierarchical clustering– A set of nested clusters organized as a hierarchical tree.

Page 29: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 29

Partitional Clustering

Original Points A Partitional Clustering

Page 30: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 30

Hierarchical Clustering

p4p1

p3

p2

p4 p1

p3

p2

p4p1 p2 p3

p4p1 p2 p3

Traditional Hierarchical Clustering

Non-traditional Hierarchical Clustering Non-traditional Dendrogram

Traditional Dendrogram

Page 31: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 31

Other Distinctions Between Sets of Clusters

Exclusive versus non-exclusive– In non-exclusive clusterings, points may belong to multiple

clusters.

– Can represent multiple classes or ‘border’ points

Fuzzy versus non-fuzzy– In fuzzy clusterings, a point belongs to every cluster with some

weight between 0 and 1.

– Weights must sum to 1.

– Probabilistic clustering has similar characteristics.

Partial versus complete.– In some cases, we only want to cluster some of the data.

Page 32: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 32

Types of Clusters: Well-Separated

Well-Separated Clusters: – A cluster is a set of points such that any point in a cluster is

closer (or more similar) to every other point in the cluster than to any point not in the cluster.

Page 33: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 33

Types of Clusters: Center-Based

Center-based– A cluster is a set of objects such that an object in a cluster is

closer (more similar) to the “center” of a cluster, than to the center of any other cluster.

– The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster.

Page 34: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 34

Types of Clusters: Contiguity-Based

3) Contiguous Cluster(Nearest neighbor or Transitive)– A cluster is a set of points such that a point in a cluster is

closer (or more similar) to one or more other points in the cluster than to any point not in the cluster.

Page 35: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 35

Types of Clusters: Density-Based

Density-based– A cluster is a dense region of points, which is separated by

low-density regions, from other regions of high density.

– Used when the clusters are irregular or intertwined, and when noise and outliers are present.

– The three curves don’t form clusters since they fade into the noise, as does the bridge between the two small circular clusters.

Page 36: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 36

Similarity and Dissimilarity

Similarity– Numerical measure of how alike two data objects are.

– Is higher when objects are more alike.

– Often falls in the range [0,1]

Dissimilarity– Numerical measure of how different two data objects are.

– Is lower when objects are more alike.

– Minimum dissimilarity is often 0.

– Upper limit varies

Proximity refers to a similarity or dissimilarity

Page 37: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 37

Summary of Similarity/Dissimilarity for Simple

Attributes

p and q are the attribute values for two data objects.

Page 38: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 38

Euclidean Distance

Euclidean Distance

Where n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q.

Standardization is necessary, if scales differ.

n

kkk qpdist

1

2)(

Page 39: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 39

Euclidean Distance

0

1

2

3

0 1 2 3 4 5 6

p1

p2

p3 p4

point x yp1 0 2p2 2 0p3 3 1p4 5 1

Distance Matrix

p1 p2 p3 p4p1 0 2.828 3.162 5.099p2 2.828 0 1.414 3.162p3 3.162 1.414 0 2p4 5.099 3.162 2 0

Page 40: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 40

Minkowski Distance

Minkowski Distance is a generalization of Euclidean Distance

Where r is a parameter, n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q.

rn

k

rkk qpdist

1

1)||(

Page 41: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 41

Minkowski Distance: Examples r = 1. City block (Manhattan, taxicab, L1 norm) distance.

– A common example of this is the Hamming distance, which is just the number of bits that are different between two binary vectors.

r = 2. Euclidean distance. r . “supremum” (Lmax norm, L norm) distance.

– This is the maximum difference between any component of the vectors.

Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions.

Page 42: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 42

Minkowski Distance

Distance Matrix

point x yp1 0 2p2 2 0p3 3 1p4 5 1

L1 p1 p2 p3 p4p1 0 4 4 6p2 4 0 2 4p3 4 2 0 2p4 6 4 2 0

L2 p1 p2 p3 p4p1 0 2.828 3.162 5.099p2 2.828 0 1.414 3.162p3 3.162 1.414 0 2p4 5.099 3.162 2 0

L p1 p2 p3 p4

p1 0 2 3 5p2 2 0 1 3p3 3 1 0 2p4 5 3 2 0

Page 43: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 43

Common Properties of a Distance

Distances, such as the Euclidean distance, have some well known properties.

1. d(p, q) 0 for all p and q and d(p, q) = 0 only if p = q. (Positive definiteness)

2. d(p, q) = d(q, p) for all p and q. (Symmetry)

3. d(p, r) d(p, q) + d(q, r) for all points p, q, and r. (Triangle Inequality)

where d(p, q) is the distance (dissimilarity) between points (data objects), p and q.

A distance that satisfies these properties is a metric

Page 44: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 44

Common Properties of a Similarity

Similarities, also have some well known properties.

1. s(p, q) = 1 (or maximum similarity) only if p = q.

2. s(p, q) = s(q, p) for all p and q. (Symmetry)

where s(p, q) is the similarity between points (data objects), p and q.

Page 45: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 45

Similarity Between Binary Vectors

Common situation is that objects, p and q, have only binary attributes.

Compute similarities using the following quantitiesM01 = the number of attributes where p was 0 and q was 1

M10 = the number of attributes where p was 1 and q was 0

M00 = the number of attributes where p was 0 and q was 0

M11 = the number of attributes where p was 1 and q was 1

Simple Matching and Jaccard Coefficients SMC = number of matches / number of attributes

= (M11 + M00) / (M01 + M10 + M11 + M00)

J = number of 11 matches / number of not-both-zero attributes values

= (M11) / (M01 + M10 + M11)

Page 46: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 46

SMC versus Jaccard: Example

p = 1 0 0 0 0 0 0 0 0 0

q = 0 0 0 0 0 0 1 0 0 1

M01 = 2 (the number of attributes where p was 0 and q was 1)

M10 = 1 (the number of attributes where p was 1 and q was 0)

M00 = 7 (the number of attributes where p was 0 and q was 0)

M11 = 0 (the number of attributes where p was 1 and q was 1)

SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) = 0.7

J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0

Page 47: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 47

Cosine Similarity

If d1 and d2 are two document vectors, then

cos( d1, d2 ) = (d1 d2) / ||d1|| ||d2|| ,

where indicates vector dot product and || d || is the length of vector d. Example:

d1 = 3 2 0 5 0 0 0 2 0 0

d2 = 1 0 0 0 0 0 0 1 0 2

d1 d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5

||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481

||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245

cos( d1, d2 ) = .3150

Page 48: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 48

Extended Jaccard Coefficient (Tanimoto)

Variation of Jaccard for continuous or count attributes– Reduces to Jaccard for binary attributes

Page 49: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 49

Correlation

)(/))(( pstdpmeanpp kk

)(/))(( qstdqmeanqq kk

Correlation measure the linear relationship between objects. To compute correlation, we standardize data objects, p and q,

and then take the dot product.

qpqpncorrelatio ),(

Page 50: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 50

Visually Evaluating Correlation

Scatter plots showing the similarity from –1 to 1.

Page 51: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 51

Mahalanobis Distance

Tqpqpqpsmahalanobi )()(),( 1

For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.

Page 52: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 52

A General Approach for Combining Similarities

Sometimes attributes are of many different types, but an overall similarity is needed.

Page 53: CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 53

Using Weights to Combine Similarities

May not want to treat all attributes the same.– Use weights wk which are between 0 and 1 and sum to 1.