Upload
mohammed-ahmed-ali
View
8
Download
0
Embed Size (px)
DESCRIPTION
minkowski distance in data mining examples
Citation preview
Data Mining: Data
Lecture Notes for Chapter 2Lecture Notes for Chapter 2
Introduction to Data MiningIntroduction to Data Miningby
Tan, Steinbach, Kumar
(Modified by P Radivojac for I211)(Modified by P. Radivojac for I211)
What went wrong in 1936?
• Literary Digest successively conducted surveys since 1920 and predicted an elected president every time correctly
I 1936 th di t d 55% f t f Alf L d d 41% f• In 1936 they predicted 55% of vote for Alf Landon and 41% for Franklin Roosevelt
• actual elections showed that Roosevelt won 61% vs. 37%
M th d l f d t ll tiMethodology for data collection• Literary digest sent 10 million ballots to voters in the USA
b 2 3 illi d• about 2.3 million were returned
• names obtained from phone registries and automobile licensing departmentsp
• So, what was the problem ?
What went wrong in 1936 (1)
Source: Peverill Squire, Why the 1936 Literary Digest Poll Failed.
What went wrong in 1936 (2)
What went wrong in 1936 (3)
What is Data?
Collection of data objects and their attributes Attributes Class
An attribute is a property or characteristic of an object
E l l f
Tid Home Owner
Marital Status
Taxable Income Cheat
1 Yes Single 125K No– Examples: eye color of a
person, temperature, etc.– Attribute is also known as
feature variable variate
g
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No feature, variable, variate
A collection of attributes describe a data point
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
Data points
describe a data point– data point is also known as
object, record, instance, or example
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes 10 example
Similarity and Dissimilarity
Similarity– Numerical measure of how alike two data points areNumerical measure of how alike two data points are.– Is higher when objects are more alike.– Often falls in the range [0,1]g [ , ]
Dissimilarity– Numerical measure of how different are two data
points– Lower when objects are more alike– Minimum dissimilarity is often 0– Upper limit varies
P i it f t i il it di i il itProximity refers to a similarity or dissimilarity
Similarity/Dissimilarity for Simple Attributes
p and q are the attribute values for two data objects.
Euclidean distance in 2D
)51()36()( 2222 =−+−=+= baqpdist
x2 525)4(3
)51()36(),(22 ==−+=
++ baqpdist
p = (3, 5)Pythagoras' theorem:
p (3, 5)5
c = dist(p, q)b
a2 + b2 = c2
q = (6, 1)
1a
b
x13 6
Euclidean Distance in n dimensions
Euclidean Distance
n
∑=
−=n
kkk qpdist
1
2)(
Where n is the number of dimensions (attributes) and pk and qkare, respectively, the kth attributes (components) or data bj t dobjects p and q.
Standardization is necessary, if scales differ.Standardization is necessary, if scales differ.
Euclidean Distance
3
point x y
1
2 p1
p2
p3 p4
point x yp1 0 2p2 2 0p3 3 1
00 1 2 3 4 5 6
pp4 5 1
p1 p2 p3 p4p1 0 2.828 3.162 5.099p2 2.828 0 1.414 3.162
Distance Matrix
p3 3.162 1.414 0 2p4 5.099 3.162 2 0
Distance Matrix
More about Euclidean distance
nn
∑∑ 22
x2
ppqpqpdistk
kk
kk ==−= ∑∑== 1
2
1
2)(),(
p = (3, 5) length of vector pp (3, 5)5
x13q = (0, 0)
Minkowski Distance
Minkowski Distance is a generalization of Euclidean Distance
rn r
kk qpdist1
)||( ∑ −=
Where r is a parameter, n is the number of dimensions
kkk qp
1)||( ∑
=
(attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q.
Minkowski Distance: Examples
r = 1. City block (Manhattan, taxicab, L1 norm) distance. y ( 1 )– A common example of this is the Hamming distance, which is just the
number of bits that are different between two binary vectors
r = 2. Euclidean distancer 2. Euclidean distancer → ∞. “supremum” (Lmax norm, L∞ norm) distance. – This is the maximum difference between any component of the vectors
Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions.
From Wikipedia
Minkowski Distance
L1 p1 p2 p3 p4p1 0 4 4 6p2 4 0 2 4
point x y
p2 4 0 2 4p3 4 2 0 2p4 6 4 2 0
L2 1 2 3 4p1 0 2p2 2 0p3 3 1p4 5 1
L2 p1 p2 p3 p4p1 0 2.828 3.162 5.099p2 2.828 0 1.414 3.162p3 3.162 1.414 0 2pp4 5.099 3.162 2 0
L∞ p1 p2 p3 p4p1 0 2 3 5pp2 2 0 1 3p3 3 1 0 2p4 5 3 2 0
Distance Matrix
Common Properties of a Distance
Distances, such as the Euclidean distance, have some well known properties.
1. d(p, q) ≥ 0 for all p and q and d(p, q) = 0 only if p = q. (Positive definiteness)
2. d(p, q) = d(q, p) for all p and q. (Symmetry)3. d(p, r) ≤ d(p, q) + d(q, r) for all points p, q, and r.
(Triangle Inequality)( g q y)
where d(p, q) is the distance (dissimilarity) between points, p and q.
A distance that satisfies these properties is a metric