Minkowski

Data Mining: Data

Lecture Notes for Chapter 2Lecture Notes for Chapter 2

Introduction to Data MiningIntroduction to Data Miningby

Tan, Steinbach, Kumar

(Modified by P Radivojac for I211)(Modified by P. Radivojac for I211)

What went wrong in 1936?

• Literary Digest successively conducted surveys since 1920 and predicted an elected president every time correctly

I 1936 th di t d 55% f t f Alf L d d 41% f• In 1936 they predicted 55% of vote for Alf Landon and 41% for Franklin Roosevelt

• actual elections showed that Roosevelt won 61% vs. 37%

M th d l f d t ll tiMethodology for data collection• Literary digest sent 10 million ballots to voters in the USA

b 2 3 illi d• about 2.3 million were returned

• names obtained from phone registries and automobile licensing departmentsp

• So, what was the problem ?

What went wrong in 1936 (1)

Source: Peverill Squire, Why the 1936 Literary Digest Poll Failed.



What is Data?

Collection of data objects and their attributes Attributes Class

An attribute is a property or characteristic of an object

E l l f

Tid Home Owner

Marital Status

Taxable Income Cheat

1 Yes Single 125K No– Examples: eye color of a

person, temperature, etc.– Attribute is also known as

feature variable variate

g

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No feature, variable, variate

A collection of attributes describe a data point

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

Data points

describe a data point– data point is also known as

object, record, instance, or example

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10 example

Similarity and Dissimilarity

Similarity– Numerical measure of how alike two data points areNumerical measure of how alike two data points are.– Is higher when objects are more alike.– Often falls in the range [0,1]g [ , ]

Dissimilarity– Numerical measure of how different are two data

points– Lower when objects are more alike– Minimum dissimilarity is often 0– Upper limit varies

P i it f t i il it di i il itProximity refers to a similarity or dissimilarity

Similarity/Dissimilarity for Simple Attributes

p and q are the attribute values for two data objects.

Euclidean distance in 2D

)51()36()( 2222 =−+−=+= baqpdist

x2 525)4(3

)51()36(),(22 ==−+=

++ baqpdist

p = (3, 5)Pythagoras' theorem:

p (3, 5)5

c = dist(p, q)b

a2 + b2 = c2

q = (6, 1)

1a

b

x13 6

Euclidean Distance in n dimensions

Euclidean Distance

n

∑=

−=n

kkk qpdist

1

2)(

Where n is the number of dimensions (attributes) and pk and qkare, respectively, the kth attributes (components) or data bj t dobjects p and q.

Standardization is necessary, if scales differ.Standardization is necessary, if scales differ.

Euclidean Distance

3

point x y

1

2 p1

p2

p3 p4

point x yp1 0 2p2 2 0p3 3 1

00 1 2 3 4 5 6

pp4 5 1

p1 p2 p3 p4p1 0 2.828 3.162 5.099p2 2.828 0 1.414 3.162

Distance Matrix

p3 3.162 1.414 0 2p4 5.099 3.162 2 0

Distance Matrix

More about Euclidean distance

nn

∑∑ 22

x2

ppqpqpdistk

kk

kk ==−= ∑∑== 1

2

1

2)(),(

p = (3, 5) length of vector pp (3, 5)5

x13q = (0, 0)

Minkowski Distance

Minkowski Distance is a generalization of Euclidean Distance

rn r

kk qpdist1

)||( ∑ −=

Where r is a parameter, n is the number of dimensions

kkk qp

1)||( ∑

=

(attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q.

Minkowski Distance: Examples

r = 1. City block (Manhattan, taxicab, L1 norm) distance. y ( 1 )– A common example of this is the Hamming distance, which is just the

number of bits that are different between two binary vectors

r = 2. Euclidean distancer 2. Euclidean distancer → ∞. “supremum” (Lmax norm, L∞ norm) distance. – This is the maximum difference between any component of the vectors

Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions.

From Wikipedia

Minkowski Distance

L1 p1 p2 p3 p4p1 0 4 4 6p2 4 0 2 4

point x y

p2 4 0 2 4p3 4 2 0 2p4 6 4 2 0

L2 1 2 3 4p1 0 2p2 2 0p3 3 1p4 5 1

L2 p1 p2 p3 p4p1 0 2.828 3.162 5.099p2 2.828 0 1.414 3.162p3 3.162 1.414 0 2pp4 5.099 3.162 2 0

L∞ p1 p2 p3 p4p1 0 2 3 5pp2 2 0 1 3p3 3 1 0 2p4 5 3 2 0

Distance Matrix

Common Properties of a Distance

Distances, such as the Euclidean distance, have some well known properties.

1. d(p, q) ≥ 0 for all p and q and d(p, q) = 0 only if p = q. (Positive definiteness)

2. d(p, q) = d(q, p) for all p and q. (Symmetry)3. d(p, r) ≤ d(p, q) + d(q, r) for all points p, q, and r.

(Triangle Inequality)( g q y)

where d(p, q) is the distance (dissimilarity) between points, p and q.

A distance that satisfies these properties is a metric

Documents

Minkowski