CSci 8980: Data Mining (Fall 2002)

© Vipin Kumar CSci 8980 Fall 2002 1

CSci 8980: Data Mining (Fall 2002)

Vipin Kumar

Army High Performance Computing Research CenterDepartment of Computer Science

University of Minnesota

http://www.cs.umn.edu/~kumar


Model Evaluation

Metrics for Performance Evaluation– How to evaluate the performance of a model?

Methods for Performance Evaluation– How to obtain reliable estimates

Methods for Model Comparison– How to compare the relative performance among

competing models


Metrics for Performance Evaluation

Focus on the predictive capability of a model– Rather than how fast it takes to classify or build

models, scalability, etc.

Confusion Matrix:

PREDICTED CLASS

ACTUALCLASS

Class=Yes Class=No

Class=Yes a b

Class=No c d

a: TP (true positive)

b: FN (false negative)

c: FP (false positive)

d: TN (true negative)


Metrics for Performance Evaluation…

Most widely-used metric:

PREDICTED CLASS

ACTUALCLASS

Class=Yes Class=No

Class=Yes a(TP)

B(FN)

Class=No c(FP)

d(TN)

FNFPTNTPTNTP

dcbada

Accuracy


Cost Matrix

PREDICTED CLASS

ACTUALCLASS

C(i|j) Class=Yes Class=No

Class=Yes C(Yes|Yes) C(No|Yes)

Class=No C(Yes|No) C(No|No)

C(i|j): Cost of misclassifying class j example as class i

Accuracy is a useful measure if C(Yes|No)=C(No|Yes) and C(Yes|Yes)=C(No|No) P(Yes) = P(No) (class distribution are equal)


Cost vs Accuracy

Cost Matrix

PREDICTED CLASS

ACTUALCLASS

C(i|j) + -

+ -1 100

- 1 0

Model M1 PREDICTED CLASS

ACTUALCLASS

C(i|j) + -

+ 150 40

- 60 250

Model M2 PREDICTED CLASS

ACTUALCLASS

C(i|j) + -

+ 250 45

- 5 200

Accuracy = 80%

Cost = 3910

Accuracy = 90%

Cost = 4255


Cost-Sensitive Measures

cbaa

prrp

baa

caa

222

(F) measure-F

(r) Recall

(p)Precision

Precision is biased towards C(Yes|Yes) & C(Yes|No) Recall is biased towards C(Yes|Yes) & C(No|Yes) F-measure is biased towards all except C(No|No)

dwcwbwawdwaw

4321

41Accuracy Weighted


Methods for Performance Evaluation

How to obtain a reliable estimate of performance?

Performance of a model may depend on other factors besides the learning algorithm:– Class distribution

– Cost of misclassification

– Size of training and test sets


Learning Curve

Learning curve shows how accuracy changes with varying sample size

Requires a sampling schedule for creating learning curve:

Arithmetic sampling(Langley, et al)

Geometric sampling(Provost et al)

Effect of small sample size:- Bias in the estimate- Variance of estimate


Methods of Estimation

Holdout– Reserve 2/3 for training and 1/3 for testing

Random subsampling– Repeated holdout

Cross validation– Partition data into k disjoint subsets

– k-fold: train on k-1 partitions, test on the remaining one

– Leave-one-out: k=n

Stratified sampling – oversampling vs undersampling

Bootstrap– Sampling with replacement


ROC (Receiver Operating Characteristic)

Developed in 1950s for signal detection theory to analyze noisy signals – Characterize the trade-off between positive hits and

false alarms

ROC curve plots TP (on the y-axis) against FP (on the x-axis)

Performance of each classifier represented as a point on the ROC curve– changing the threshold of algorithm, sample

distribution or cost matrix changes the location of the point


ROC Curve

- 1-dimensional data set containing 2 classes (positive and negative)

- any points located at x > t is classified as positive

At threshold t:

TP=0.5, FN=0.5, FP=0.12, FN=0.88


ROC Curve

(TP,FP): (0,0): declare everything

to be negative class (1,1): declare everything

to be positive class (1,0): ideal

Diagonal line:– Random guessing

– Below diagonal line: prediction is opposite of the true class


Using ROC for Model Comparison

No model consistently outperform the other M1 is better for

small FPR M2 is better for

large FPR

Area Under the ROC curve

Ideal: Area = 1

Random guess: Area = 0.5


How to Construct an ROC curve

Instance P(+|A) True Class

1 0.95 +

2 0.93 +

3 0.87 -

4 0.85 -

5 0.85 -

6 0.85 +

7 0.76 -

8 0.53 +

9 0.43 -

10 0.25 +

• Use classifier that produces posterior probability for each test instance P(+|A)

• Sort the instances according to P(+|A) in decreasing order

• Apply threshold at each unique value of P(+|A)

• Count the number of TP, FP,

TN, FN at each threshold

• TP rate, TPR = TP/(TP+FN)

• FP rate, FPR = FP/(FP + TN)


How to construct an ROC curve

Class + - + - - - + - + +

P 0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00

TP 5 4 4 3 3 3 3 2 2 1 0

FP 5 5 4 4 3 2 1 1 0 0 0

TN 0 0 1 1 2 3 4 4 5 5 5

FN 0 1 1 2 2 2 2 3 3 4 5

TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0

FPR 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0

Threshold >=

ROC Curve:


Test of Significance

Given two models:– Model M1: accuracy = 85%, tested on 30 instances

– Model M2: accuracy = 75%, tested on 5000 instances

Can we say M1 is better than M2?– How much confidence can we place on accuracy of

M1 and M2?

– Can the difference in performance measure be explained as a result of random fluctuations in the test set?


Confidence Interval for Accuracy

Prediction can be regarded as a Bernoulli trial– A Bernoulli trial has 2 possible outcomes

– Possible outcomes for prediction: correct or wrong

– Collection of Bernoulli trials has a Binomial distribution: x Bin(N, p) x: number of correct predictions e.g: Toss a fair coin 50 times, how many heads would turn up? Expected number of heads = Np = 50 0.5 = 25

Given x (# of correct predictions) or equivalently, acc=x/N, and N (# of test instances),

Can we predict p (true accuracy of model)?



For large test sets (N > 30), – acc has a normal distribution

with mean p and variance p(1-p)/N

Confidence Interval for p:

1

)/)1(

(2/12/

ZNpp

paccZP

Area = 1 -

Z/2 Z1- /2

)(2

4422

2/

22

2/

2

2/

ZN

accNaccNZZaccNp



Consider a model that produces an accuracy of 80% when evaluated on 100 test instances:– N=100, acc = 0.8

– Let 1- = 0.95 (95% confidence)

– From probability table, Z/2=1.96

1- Z

0.99 2.58

0.98 2.33

0.95 1.96

0.90 1.65

N 50 100 500 1000 5000

p(lower) 0.670 0.711 0.763 0.774 0.789

p(upper) 0.888 0.866 0.833 0.824 0.811


Comparing Performance of 2 Models

Given two models, say M1 and M2, which is better?– M1 is tested on D1 (size=n1), found error rate = e1

– M2 is tested on D2 (size=n2), found error rate = e2

– Assume D1 and D2 are independent

– If n1 and n2 are sufficiently large, then

– Approximate:

22

11

,~2

,~1

Ne

Ne

i

ii

i nee )1(

ˆ


Comparing Performance of 2 Models

To test if performance difference is statistically significant: d = e1 – e2– d ~ NN(dt,t) where dt is the true difference

– Since D1 and D2 are independent, their variance adds up:

– At (1-) confidence level,

2)21(2

1)11(1

ˆˆ 2

2

2

1

2

2

2

1

2

nee

nee

t

ttZdd

ˆ

2/


An Illustrative Example

Given: M1: n1 = 30, e1 = 0.15 M2: n2 = 5000, e2 = 0.25

d = |e2 – e1| = 0.1 (2-sided test)

At 95% confidence level, Z/2=1.96

=> Interval contains 0 => difference may not be statistically significant

0043.05000

)25.01(25.030

)15.01(15.0ˆ

d

128.0100.00043.096.1100.0 t

d


Comparing Performance of 2 Algorithms

Each learning algorithm may produce k models:– L1 may produce M11 , M12, …, M1k

– L2 may produce M21 , M22, …, M2k

If models are generated on the same test sets D1,D2, …, Dk (e.g., via cross-validation)– For each set: compute dj = e1j – e2j

– dj has mean dt and variance t

– Estimate:

tkt

k

j j

t

tdd

kk

dd

ˆ

)1(

)(ˆ

1,1

1

2

2


What is Cluster Analysis?

Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups.– Based on information found in the data that describes the

objects and their relationships.

– Also known as unsupervised classification.

Many applications– Understanding: group related documents for browsing or to

find genes and proteins that have similar functionality.

– Summarization: Reduce the size of large data sets.


What is not Cluster Analysis?

Supervised classification.– Have class label information.

Simple segmentation.– Dividing students into different registration groups

alphabetically, by last name.

Results of a query.– Groupings are a result of an external specification.

Graph partitioning– Some mutual relevance and synergy, but areas are not

identical.


Notion of a Cluster is Ambiguous

Initial points.

Four Clusters Two Clusters

Six Clusters


Types of Clusterings

A clustering is a set of clusters.

One important distinction is between hierarchical and partitional sets of clusters.

Partitional Clustering– A division data objects into non-overlapping subsets (clusters)

such that each data object is in exactly one subset.

Hierarchical clustering– A set of nested clusters organized as a hierarchical tree.


Partitional Clustering

Original Points A Partitional Clustering


Hierarchical Clustering

p4p1

p3

p2

p4 p1

p3

p2

p4p1 p2 p3

p4p1 p2 p3

Traditional Hierarchical Clustering

Non-traditional Hierarchical Clustering Non-traditional Dendrogram

Traditional Dendrogram


Other Distinctions Between Sets of Clusters

Exclusive versus non-exclusive– In non-exclusive clusterings, points may belong to multiple

clusters.

– Can represent multiple classes or ‘border’ points

Fuzzy versus non-fuzzy– In fuzzy clusterings, a point belongs to every cluster with some

weight between 0 and 1.

– Weights must sum to 1.

– Probabilistic clustering has similar characteristics.

Partial versus complete.– In some cases, we only want to cluster some of the data.


Types of Clusters: Well-Separated

Well-Separated Clusters: – A cluster is a set of points such that any point in a cluster is

closer (or more similar) to every other point in the cluster than to any point not in the cluster.


Types of Clusters: Center-Based

Center-based– A cluster is a set of objects such that an object in a cluster is

closer (more similar) to the “center” of a cluster, than to the center of any other cluster.

– The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster.


Types of Clusters: Contiguity-Based

3) Contiguous Cluster(Nearest neighbor or Transitive)– A cluster is a set of points such that a point in a cluster is

closer (or more similar) to one or more other points in the cluster than to any point not in the cluster.


Types of Clusters: Density-Based

Density-based– A cluster is a dense region of points, which is separated by

low-density regions, from other regions of high density.

– Used when the clusters are irregular or intertwined, and when noise and outliers are present.

– The three curves don’t form clusters since they fade into the noise, as does the bridge between the two small circular clusters.


Similarity and Dissimilarity

Similarity– Numerical measure of how alike two data objects are.

– Is higher when objects are more alike.

– Often falls in the range [0,1]

Dissimilarity– Numerical measure of how different two data objects are.

– Is lower when objects are more alike.

– Minimum dissimilarity is often 0.

– Upper limit varies

Proximity refers to a similarity or dissimilarity


Summary of Similarity/Dissimilarity for Simple

Attributes

p and q are the attribute values for two data objects.


Euclidean Distance

Euclidean Distance

Where n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q.

Standardization is necessary, if scales differ.

n

kkk qpdist

1

2)(


Euclidean Distance

0

1

2

3

0 1 2 3 4 5 6

p1

p2

p3 p4

point x yp1 0 2p2 2 0p3 3 1p4 5 1

Distance Matrix

p1 p2 p3 p4p1 0 2.828 3.162 5.099p2 2.828 0 1.414 3.162p3 3.162 1.414 0 2p4 5.099 3.162 2 0


Minkowski Distance

Minkowski Distance is a generalization of Euclidean Distance

Where r is a parameter, n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q.

rn

k

rkk qpdist

1

1)||(


Minkowski Distance: Examples r = 1. City block (Manhattan, taxicab, L1 norm) distance.

– A common example of this is the Hamming distance, which is just the number of bits that are different between two binary vectors.

r = 2. Euclidean distance. r . “supremum” (Lmax norm, L norm) distance.

– This is the maximum difference between any component of the vectors.

Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions.


Minkowski Distance

Distance Matrix

point x yp1 0 2p2 2 0p3 3 1p4 5 1

L1 p1 p2 p3 p4p1 0 4 4 6p2 4 0 2 4p3 4 2 0 2p4 6 4 2 0

L2 p1 p2 p3 p4p1 0 2.828 3.162 5.099p2 2.828 0 1.414 3.162p3 3.162 1.414 0 2p4 5.099 3.162 2 0

L p1 p2 p3 p4

p1 0 2 3 5p2 2 0 1 3p3 3 1 0 2p4 5 3 2 0


Common Properties of a Distance

Distances, such as the Euclidean distance, have some well known properties.

1. d(p, q) 0 for all p and q and d(p, q) = 0 only if p = q. (Positive definiteness)

2. d(p, q) = d(q, p) for all p and q. (Symmetry)

3. d(p, r) d(p, q) + d(q, r) for all points p, q, and r. (Triangle Inequality)

where d(p, q) is the distance (dissimilarity) between points (data objects), p and q.

A distance that satisfies these properties is a metric


Common Properties of a Similarity

Similarities, also have some well known properties.

1. s(p, q) = 1 (or maximum similarity) only if p = q.

2. s(p, q) = s(q, p) for all p and q. (Symmetry)

where s(p, q) is the similarity between points (data objects), p and q.


Similarity Between Binary Vectors

Common situation is that objects, p and q, have only binary attributes.

Compute similarities using the following quantitiesM01 = the number of attributes where p was 0 and q was 1

M10 = the number of attributes where p was 1 and q was 0



Simple Matching and Jaccard Coefficients SMC = number of matches / number of attributes

= (M11 + M00) / (M01 + M10 + M11 + M00)

J = number of 11 matches / number of not-both-zero attributes values

= (M11) / (M01 + M10 + M11)


SMC versus Jaccard: Example

p = 1 0 0 0 0 0 0 0 0 0

q = 0 0 0 0 0 0 1 0 0 1

M01 = 2 (the number of attributes where p was 0 and q was 1)




SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) = 0.7

J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0


Cosine Similarity

If d1 and d2 are two document vectors, then

cos( d1, d2 ) = (d1 d2) / ||d1|| ||d2|| ,

where indicates vector dot product and || d || is the length of vector d. Example:

d1 = 3 2 0 5 0 0 0 2 0 0

d2 = 1 0 0 0 0 0 0 1 0 2

d1 d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5

||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481

||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245

cos( d1, d2 ) = .3150


Extended Jaccard Coefficient (Tanimoto)

Variation of Jaccard for continuous or count attributes– Reduces to Jaccard for binary attributes


Correlation

)(/))(( pstdpmeanpp kk

)(/))(( qstdqmeanqq kk

Correlation measure the linear relationship between objects. To compute correlation, we standardize data objects, p and q,

and then take the dot product.

qpqpncorrelatio ),(


Visually Evaluating Correlation

Scatter plots showing the similarity from –1 to 1.


Mahalanobis Distance

Tqpqpqpsmahalanobi )()(),( 1

For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.


A General Approach for Combining Similarities

Sometimes attributes are of many different types, but an overall similarity is needed.


Using Weights to Combine Similarities

May not want to treat all attributes the same.– Use weights wk which are between 0 and 1 and sum to 1.

Documents

CSci 8980: Data Mining (Fall 2002)