Upload
lethu
View
216
Download
2
Embed Size (px)
Citation preview
Topics to be covered • Types of Attributes;
• Statistical Description of Data;
• Data Visualization;
• Measuring similarity and dissimilarity.
Chp2 Slides by: Shree Jaswal 2
Which Chapter from which Text Book ?
• Chapter 2: Getting to know your data from Han,
Kamber, "Data Mining Concepts and Techniques",
Morgan Kaufmann 3nd Edition
• Chapter 2: Data from P. N. Tan, M. Steinbach, Vipin
Kumar, “Introduction to Data Mining”, Pearson
Education
• Chapter 3: Exploring data from P. N. Tan, M.
Steinbach, Vipin Kumar, “Introduction to Data
Mining”, Pearson Education
Chp2 Slides by: Shree Jaswal 3
Course Outcome achieved • TEITC604.2: Able to prepare the
data needed for data mining
algorithms in terms of attributes
and class inputs, training,
validating, and testing files.
Chp2 Slides by: Shree Jaswal 4
What is Data?
• Collection of data objects
and their attributes
• An attribute is a property or characteristic of an object
o Examples: eye color of a
person, temperature, etc.
o Attribute is also known as
variable, field,
characteristic, or feature
• A collection of attributes
describe an object
o Object is also known as
record, point, case, sample,
entity, or instance
T id R e fu n d M a r ita l
S ta tu s
T a x a b le
In c o m e C h e a t
1 Y e s S in g le 1 2 5 K N o
2 N o M a rr ie d 1 0 0 K N o
3 N o S in g le 7 0 K N o
4 Y e s M a rr ie d 1 2 0 K N o
5 N o D iv o rc e d 9 5 K Y e s
6 N o M a rr ie d 6 0 K N o
7 Y e s D iv o rc e d 2 2 0 K N o
8 N o S in g le 8 5 K Y e s
9 N o M a rr ie d 7 5 K N o
1 0 N o S in g le 9 0 K Y e s 10
Attributes
Objects
Chp2 Slides by: Shree Jaswal 5
Attribute Values • Attribute values are numbers or symbols assigned to
an attribute
• Distinction between attributes and attribute values o Same attribute can be mapped to different attribute values
• Example: height can be measured in feet or meters
o Different attributes can be mapped to the same set of values
• Example: Attribute values for ID and age are integers
• But properties of attribute values can be different
o ID has no limit but age has a maximum and minimum value
Chp2 Slides by: Shree Jaswal 6
Types of Attributes • There are different types of attributes
o Nominal
• Examples: ID numbers, eye color, zip codes
o Binary
• Symmetric and Asymmetric types
o Ordinal
• Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short}
o Interval
• Examples: calendar dates, temperatures in Celsius or Fahrenheit.
o Ratio
• Examples: temperature in Kelvin, length, time, counts
Chp2 Slides by: Shree Jaswal 7
Properties of Attribute Values
• The type of an attribute depends on which of the
following properties it possesses:
o Distinctness: =
o Order: < >
o Addition: + -
o Multiplication: * /
o Nominal attribute: distinctness
o Ordinal attribute: distinctness & order
o Interval attribute: distinctness, order & addition
o Ratio attribute: all 4 properties
Chp2 Slides by: Shree Jaswal 8
Attribute
Type
Description
Examples
Operations
Nominal
The values of a nominal attribute are
just different names, i.e., nominal
attributes provide only enough
information to distinguish one object
from another. (=, )
zip codes,
employee ID
numbers, eye color,
sex: {male, female}
mode, entropy,
contingency
correlation, 2
test
Ordinal
The values of an ordinal
attribute provide enough
information to order objects.
(<, >)
hardness of minerals,
{good, better, best},
grades, street numbers
median,
percentiles, rank
correlation, run
tests, sign tests
Interval
For interval attributes, the
differences between values are
meaningful, i.e., a unit of
measurement exists.(+, - )
calendar dates,
temperature in
Celsius or
Fahrenheit
mean, standard
deviation,
Pearson's
correlation, t
and F tests
Ratio
For ratio variables, both
differences and ratios are
meaningful. (*, /)
temperature in
Kelvin, monetary
quantities, counts,
age, mass, length,
electrical current
geometric
mean,
harmonic
mean, percent
variation
Chp2 Slides by: Shree Jaswal 9
Attribute
Level
Transformation
Comments
Nominal
Any permutation of values
If all employee ID numbers
were reassigned, would it
make any difference?
Ordinal
An order preserving change of
values, i.e.,
new_value = f(old_value)
where f is a monotonic function.
An attribute encompassing
the notion of good, better
best can be represented
equally well by the values
{1, 2, 3} or by { 0.5, 1,
10}.
Interval
new_value =a * old_value + b
where a and b are constants
Thus, the Fahrenheit and
Celsius temperature scales
differ in terms of where
their zero value is and the
size of a unit (degree).
Ratio
new_value = a * old_value
Length can be measured in
meters or feet.
Chp2 Slides by: Shree Jaswal 10
Discrete and Continuous Attributes
• Discrete Attribute o Has only a finite or countably infinite set of values
o Examples: zip codes, counts, or the set of words in a collection of documents
o Often represented as integer variables.
o Note: binary attributes are a special case of discrete attributes
• Continuous Attribute o Has real numbers as attribute values
o Examples: temperature, height, or weight.
o Practically, real values can only be measured and represented using a finite number of digits.
o Continuous attributes are typically represented as floating-point variables.
Chp2 Slides by: Shree Jaswal 11
Types of data sets • Record
o Data Matrix
o Document Data
o Transaction Data
• Graph o World Wide Web
o Molecular Structures
• Ordered o Spatial Data
o Temporal Data
o Sequential Data
o Genetic Sequence Data
Chp2 Slides by: Shree Jaswal 12
Important Characteristics of Structured Data
o Dimensionality
• Curse of Dimensionality
o Sparsity
• Only presence counts
o Resolution
• Patterns depend on the scale
Chp2 Slides by: Shree Jaswal 13
Record Data • Data that consists of a collection of records, each
of which consists of a fixed set of attributes
T id R e fu n d M a r ita l
S ta tu s
T a x a b le
In c o m e C h e a t
1 Y e s S in g le 1 2 5 K N o
2 N o M a rr ie d 1 0 0 K N o
3 N o S in g le 7 0 K N o
4 Y e s M a rr ie d 1 2 0 K N o
5 N o D iv o rc e d 9 5 K Y e s
6 N o M a rr ie d 6 0 K N o
7 Y e s D iv o rc e d 2 2 0 K N o
8 N o S in g le 8 5 K Y e s
9 N o M a rr ie d 7 5 K N o
1 0 N o S in g le 9 0 K Y e s 10
Chp2 Slides by: Shree Jaswal 14
Transaction Data • A special type of record data, where
o each record (transaction) involves a set of items.
o For example, consider a grocery store. The set of products
purchased by a customer during one shopping trip
constitute a transaction, while the individual products that
were purchased are the items.
T I D I tem s
1 B r e a d , C o k e , M ilk
2 B e e r , B r e a d
3 B e e r , C o k e , D ia p e r , M ilk
4 B e e r , B r e a d , D ia p e r , M ilk
5 C o k e , D ia p e r , M ilk
Chp2 Slides by: Shree Jaswal 15
Data Matrix • If data objects have the same fixed set of numeric
attributes, then the data objects can be thought of
as points in a multi-dimensional space, where each
dimension represents a distinct attribute
• Such data set can be represented by an m by n
matrix, where there are m rows, one for each object,
and n columns, one for each attribute
1.12.216.226.2512.65
1.22.715.225.2710.23
Thickness LoadDistanceProjection
of y load
Projection
of x Load
1.12.216.226.2512.65
1.22.715.225.2710.23
Thickness LoadDistanceProjection
of y load
Projection
of x Load
Chp2 Slides by: Shree Jaswal 16
Document Data • Each document becomes a `term' vector,
o each term is a component (attribute) of the vector,
o the value of each component is the number of times the
corresponding term occurs in the document.
Document 1
se
aso
n
time
ou
t
lost
wi
n
ga
me
sco
re
ba
ll
play
co
ach
tea
m
Document 2
Document 3
3 0 5 0 2 6 0 2 0 2
0
0
7 0 2 1 0 0 3 0 0
1 0 0 1 2 2 0 3 0
Chp2 Slides by: Shree Jaswal 17
Graph Data • Examples: Generic graph and HTML Links
5
2
1
2
5
<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
<a href="papers/papers.html#aaaa">
Graph Partitioning </a>
<li>
<a href="papers/papers.html#aaaa">
Parallel Solution of Sparse Linear System of Equations </a>
<li>
<a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers
Chp2 Slides by: Shree Jaswal 18
Ordered Data • Sequences of transactions
An element of the sequence
Items/Events
Chp2 Slides by: Shree Jaswal 20
Ordered Data • Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Chp2 Slides by: Shree Jaswal 21
Ordered Data
• Spatio-Temporal Data
Average Monthly Temperature of land and ocean
Chp2 Slides by: Shree Jaswal 22
Data Quality • What kinds of data quality problems?
• How can we detect problems with the data?
• What can we do about these problems?
• Examples of data quality problems:
o Noise and outliers
o missing values
o duplicate data
Chp2 Slides by: Shree Jaswal 23
Noise • Noise refers to modification of original values
o Examples: distortion of a person’s voice when talking on a poor phone
and “snow” on television screen
Two Sine Waves Two Sine Waves + Noise
Chp2 Slides by: Shree Jaswal 24
Outliers • Outliers are data objects with characteristics that
are considerably different than most of the other
data objects in the data set
Chp2 Slides by: Shree Jaswal 25
Missing Values • Reasons for missing values
o Information is not collected (e.g., people decline to give their age and weight)
o Attributes may not be applicable to all cases (e.g., annual income is not applicable to children)
• Handling missing values o Eliminate Data Objects
o Estimate Missing Values
o Ignore the Missing Value During Analysis
o Replace with all possible values (weighted by their probabilities)
Chp2 Slides by: Shree Jaswal 26
Duplicate Data • Data set may include data objects that are
duplicates, or almost duplicates of one another
o Major issue when merging data from heterogeneous
sources
• Examples:
o Same person with multiple email addresses
• Data cleaning
o Process of dealing with duplicate data issues
Chp2 Slides by: Shree Jaswal 27
28
Mining Data Descriptive Characteristics • Motivation
o To better understand the data: central tendency,
variation and spread
• Data dispersion characteristics
o median, max, min, quantiles, outliers, variance, etc.
• Numerical dimensions correspond to sorted intervals
o Data dispersion: analyzed with multiple granularities of
precision
o Boxplot or quantile analysis on sorted intervals
• Dispersion analysis on computed measures
o Folding measures into numerical dimensions
o Boxplot or quantile analysis on the transformed cube
Chp2 Slides by: Shree Jaswal
Measuring the Central Tendency
• Mean (algebraic measure) (sample vs. population):
o Weighted arithmetic mean:
o Trimmed mean: chopping extreme values
• Median: A holistic measure
o Middle value if odd number of values, or average of the middle
two values otherwise
o Estimated by interpolation (for grouped data):
• Mode
o Value that occurs most frequently in the data
o Unimodal, bimodal, trimodal
o Empirical formula:
n
i
ix
nx
1
1
n
i
i
n
i
ii
w
xw
x
1
1
widthfreq
lfreqNLmedian
median
))(2/
(1
)(3 medianmeanmodemean
N
x
Chp2 Slides by: Shree Jaswal 29
Chp2 Slides by: Shree Jaswal 30
Symmetric vs. Skewed Data
• Median, mean and mode of
symmetric, positively and
negatively skewed data
positively skewed negatively skewed
symmetric
Measuring the Dispersion of Data
• Quartiles, outliers and boxplots
o Quartiles: Q1 (25th percentile), Q3 (75th percentile)
o Inter-quartile range: IQR = Q3 – Q1
o Five number summary: min, Q1, M, Q3, max
o Boxplot: ends of the box are the quartiles, median is marked,
whiskers, and plot outlier individually
o Outlier: usually, a value higher/lower than 1.5 x IQR
• Variance and standard deviation (sample: s, population: σ)
o Variance: (algebraic, scalable computation)
o Standard deviation s (or σ) is the square root of variance s2 (or σ2)
n
i
n
i
ii
n
i
ix
nx
nxx
ns
1 1
22
1
22])(
1[
1
1)(
1
1
n
i
i
n
i
ix
Nx
N 1
22
1
22 1)(
1
Chp2 Slides by: Shree Jaswal 31
Boxplot Analysis
• Five-number summary of a distribution:
Minimum, Q1, M, Q3, Maximum
• Boxplot
o Data is represented with a box
o The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
o The median is marked by a line within the
box
o Whiskers: two lines outside the box extend to
Minimum and Maximum
Chp2 Slides by: Shree Jaswal 32
Visualization Visualization is the conversion of data into a visual or
tabular format so that the characteristics of the data and the relationships among data items or attributes can be analyzed or reported.
• Visualization of data is one of the most powerful and appealing techniques for data exploration.
o Humans have a well developed ability to analyze large amounts of information that is presented visually
o Can detect general patterns and trends
o Can detect outliers and unusual patterns
Chp2 Slides by: Shree Jaswal 33
34
Data Visualization and Its Methods
• Why data visualization?
o Gain insight into an information space by mapping data onto
graphical primitives
o Provide qualitative overview of large data sets
o Search for patterns, trends, structure, irregularities,
relationships among data
o Help find interesting regions and suitable parameters for
further quantitative analysis
o Provide a visual proof of computer representations derived
• Typical visualization methods:
o Geometric techniques
o Icon-based techniques
o Hierarchical techniques
Chp2 Slides by: Shree Jaswal
Iris Sample Data Set • Many of the exploratory data techniques are
illustrated with the Iris Plant data set. o Can be obtained from the UCI Machine Learning Repository
http://www.ics.uci.edu/~mlearn/MLRepository.html
o From the statistician Douglas Fisher
o Three flower types (classes):
• Setosa
• Virginica
• Versicolour
o Four (non-class) attributes
• Sepal width and length
• Petal width and length Virginica. Robert H. Mohlenbrock.
USDA NRCS. 1995. Northeast
wetland flora: Field office guide to
plant species. Northeast National
Technical Center, Chester, PA.
Courtesy of USDA NRCS Wetland
Science Institute. Chp2 Slides by: Shree Jaswal 35
Chp2 Slides by: Shree Jaswal 36
5.5 4.2 1.4 0.2 Iris-setosa
4.9 3.1 1.5 0.1 Iris-setosa
5 3.2 1.2 0.2 Iris-setosa
5.5 3.5 1.3 0.2 Iris-setosa
4.9 3.1 1.5 0.1 Iris-setosa
4.4 3 1.3 0.2 Iris-setosa
5.1 3.4 1.5 0.2 Iris-setosa
5 3.5 1.3 0.3 Iris-setosa
4.5 2.3 1.3 0.3 Iris-setosa
4.4 3.2 1.3 0.2 Iris-setosa
5 3.5 1.6 0.6 Iris-setosa
5.1 3.8 1.9 0.4 Iris-setosa
4.8 3 1.4 0.3 Iris-setosa
5.1 3.8 1.6 0.2 Iris-setosa
4.6 3.2 1.4 0.2 Iris-setosa
5.3 3.7 1.5 0.2 Iris-setosa
5 3.3 1.4 0.2 Iris-setosa
7 3.2 4.7 1.4 Iris-versicolor
6.4 3.2 4.5 1.5 Iris-versicolor
6.9 3.1 4.9 1.5 Iris-versicolor
5.5 2.3 4 1.3 Iris-versicolor
6.5 2.8 4.6 1.5 Iris-versicolor
5.7 2.8 4.5 1.3 Iris-versicolor
6.3 3.3 4.7 1.6 Iris-versicolor
4.9 2.4 3.3 1 Iris-versicolor
6.6 2.9 4.6 1.3 Iris-versicolor
5.2 2.7 3.9 1.4 Iris-versicolor
5 2 3.5 1 Iris-versicolor
5.9 3 4.2 1.5 Iris-versicolor
6 2.2 4 1 Iris-versicolor
6.1 2.9 4.7 1.4 Iris-versicolor
5.6 2.9 3.6 1.3 Iris-versicolor
Attribute Information: 1. sepal length in cm 2. sepal width in cm 3. petal length in cm 4. petal width in cm 5. class: -- Iris Setosa -- Iris Versicolour -- Iris Virginica
Example: Sea Surface Temperature • The following shows the Sea Surface Temperature
(SST) for July 1982
o Tens of thousands of data points are summarized in a single
figure
Chp2 Slides by: Shree Jaswal 37
Representation • Is the mapping of information to a visual format
• Data objects, their attributes, and the relationships among data objects are translated into graphical elements such as points, lines, shapes, and colors.
• Example: o Objects are often represented as points
o Their attribute values can be represented as the position of the points or the characteristics of the points, e.g., color, size, and shape
o If position is used, then the relationships of points, i.e., whether they form groups or a point is an outlier, is easily perceived.
Chp2 Slides by: Shree Jaswal 38
Arrangement • Is the placement of visual elements within a display
• Can make a large difference in how easy it is to
understand the data
• Example:
Chp2 Slides by: Shree Jaswal 39
Selection • Is the elimination or the de-emphasis of certain
objects and attributes
• Selection may involve the choosing a subset of attributes o Dimensionality reduction is often used to reduce the number
of dimensions to two or three (in case of few dimensions)
o Alternatively, pairs of attributes can be considered
• Selection may also involve choosing a subset of objects o A region of the screen can only show so many points
o Can sample, but want to preserve points in sparse areas
Chp2 Slides by: Shree Jaswal 40
Visualization Techniques: Histograms • Histogram
o Usually shows the distribution of values of a single variable
o Divide the values into bins and show a bar plot of the
number of objects in each bin.
o The height of each bar indicates the number of objects
o Shape of histogram depends on the number of bins
• Example: Petal Width (10 and 20 bins, respectively)
Chp2 Slides by: Shree Jaswal 41
Two-Dimensional Histograms • Show the joint distribution of the values of two
attributes
• Example: petal width and petal length
Chp2 Slides by: Shree Jaswal 42
Visualization Techniques: Box Plots • Box Plots
o Invented by J. Tukey
o Another way of displaying the distribution of data
o Following figure shows the basic part of a box plot
outlier
10th percentile
25th percentile
75th percentile
50th percentile
10th percentile
Chp2 Slides by: Shree Jaswal 43
46
Histogram Analysis
• Graph displays of basic statistical class descriptions
o Frequency histograms
• A univariate graphical method
• Consists of a set of rectangles that reflect the counts or
frequencies of the classes present in the given data
Chp2 Slides by: Shree Jaswal
47
Histograms Often Tells More than Boxplots
The two histograms shown in
the left may have the same
boxplot representation
The same values for: min,
Q1, median, Q3, max
But they have rather different
data distributions
Chp2 Slides by: Shree Jaswal
48
Quantile Plot • Displays all of the data (allowing the user to assess
both the overall behavior and unusual occurrences)
• Plots quantile information
o For a data xi data sorted in increasing order, fi indicates that approximately 100 fi% of the data are below or equal to the value xi
Chp2 Slides by: Shree Jaswal
49
Quantile-Quantile (Q-Q) Plot
• Graphs the quantiles of one univariate distribution
against the corresponding quantiles of another
• Allows the user to view whether there is a shift in
going from one distribution to another
Chp2 Slides by: Shree Jaswal
50
Scatter plot • Provides a first look at bivariate data to see clusters
of points, outliers, etc
• Each pair of values is treated as a pair of
coordinates and plotted as points in the plane
Chp2 Slides by: Shree Jaswal
Visualization Techniques: Scatter Plots
• Scatter plots o Attributes values determine the position
o Two-dimensional scatter plots most common, but can have three-dimensional scatter plots
o Often additional attributes can be displayed by using the size, shape, and color of the markers that represent the objects
o It is useful to have arrays of scatter plots can compactly summarize the relationships of several pairs of attributes
• See example on the next slide
Chp2 Slides by: Shree Jaswal 51
53
Positively and Negatively Correlated Data
• The left half fragment is positively
correlated
• The right half is negative correlated
Chp2 Slides by: Shree Jaswal
Visualization Techniques: Contour Plots
• Contour plots o Useful when a continuous attribute is measured on a spatial
grid
o They partition the plane into regions of similar values
o The contour lines that form the boundaries of these regions connect points with equal values
o The most common example is contour maps of elevation
o Can also display temperature, rainfall, air pressure, etc.
• An example for Sea Surface Temperature (SST) is provided on the next slide
Chp2 Slides by: Shree Jaswal 55
Visualization Techniques: Matrix Plots
• Matrix plots
o Can plot the data matrix
o This can be useful when objects are sorted according to
class
o Typically, the attributes are normalized to prevent one
attribute from dominating the plot
o Plots of similarity or distance matrices can also be useful for visualizing the relationships between objects
o Examples of matrix plots are presented on the next slide
Chp2 Slides by: Shree Jaswal 57
Visualization Techniques: Parallel Coordinates
• Parallel Coordinates o Used to plot the attribute values of high-dimensional data
o Instead of using perpendicular axes, use a set of parallel axes
o The attribute values of each object are plotted as a point on each corresponding coordinate axis and the points are connected by a line
o Thus, each object is represented as a line
o Often, the lines representing a distinct class of objects group together, at least for some attributes
o Ordering of attributes is important in seeing such groupings
Chp2 Slides by: Shree Jaswal 59
Other Visualization Techniques • Star Plots
o Similar approach to parallel coordinates, but axes radiate from a central point
o The line connecting the values of an object is a polygon
• Chernoff Faces o Approach created by Herman Chernoff
o This approach associates each attribute with a characteristic of a face
o The values of each attribute determine the appearance of the corresponding facial characteristic
o Each object becomes a separate face
o Relies on human’s ability to distinguish faces
Chp2 Slides by: Shree Jaswal 61
Chp2 Slides by: Shree Jaswal 65
Hierarchical Techniques
• Visualization of the data using a hierarchical
partitioning into subspaces.
• Methods
o Dimensional Stacking
o Worlds-within-Worlds
o Treemap
o Cone Trees
o InfoCube
66
Tree-Map • Screen-filling method which uses a hierarchical partitioning
of the screen into regions depending on the attribute values
• The x- and y-dimension of the screen are partitioned alternately according to the attribute values (classes)
MSR Netscan Image
Chp2 Slides by: Shree Jaswal
69
Similarity and Dissimilarity
• Similarity
o Numerical measure of how alike two data objects
are
o Value is higher when objects are more alike
o Often falls in the range [0,1]
• Dissimilarity (i.e., distance)
o Numerical measure of how different are two data
objects
o Lower when objects are more alike
o Minimum dissimilarity is often 0
o Upper limit varies
• Proximity refers to a similarity or dissimilarity Chp2 Slides by: Shree Jaswal
70
Data Matrix and Dissimilarity Matrix
• Data matrix
o n data points with p
dimensions
o Two modes
• Dissimilarity matrix
o n data points, but
registers only the
distance
o A triangular matrix
o Single mode
npx...
nfx...
n1x
...............
ipx...
ifx...
i1x
...............
1px...
1fx...
11x
0...)2,()1,(
:::
)2,3()
...ndnd
0dd(3,1
0d(2,1)
0
Chp2 Slides by: Shree Jaswal
Chp2 Slides by: Shree Jaswal 71
Nominal Variables
• A generalization of the binary variable in that it can
take more than 2 states, e.g., red, yellow, blue, green
• Method 1: Simple matching
o m: # of matches, p: total # of variables
• Method 2: Use a large number of binary variables
o creating a new binary variable for each of the M
nominal states
p
mpjid
),(
72
Binary Variables
• A contingency table for binary
data
• Distance measure for symmetric
binary variables:
• Distance measure for asymmetric
binary variables:
• Jaccard coefficient (similarity
measure for asymmetric binary
variables): cba
a jisimJaccard
),(
dcba
cb jid
),(
cba
cb jid
),(
pdbcasum
dcdc
baba
sum
0
1
01
Object i
Object j
Chp2 Slides by: Shree Jaswal
73
Dissimilarity between Binary Variables
• Example
o gender is a symmetric attribute
o the remaining attributes are asymmetric binary
o let the values Y and P be set to 1, and the value N be set to
0
N a m e G e n d e r F e v e r C o u g h T e s t-1 T e s t-2 T e s t-3 T e s t-4
J a c k M Y N P N N N
M a ry F Y N P N P N
J im M Y P N N N N
?),(
?),(
?),(
maryjimd
jimjackd
maryjackd
Chp2 Slides by: Shree Jaswal
• Jim and marry are unlikely to have a similar disease whereas Jack and Mary are most likely to have a similar disease
Chp2 Slides by: Shree Jaswal 74
75.0211
21),(
67.0111
11),(
33.0102
10),(
maryjimd
jimjackd
maryjackd
Euclidean Distance
• Euclidean Distance
Where n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) of data objects p and q.
• Standardization is necessary, if scales differ.
n
kkk qpdist
1
2)(
Chp2 Slides by: Shree Jaswal 75
Euclidean Distance
0
1
2
3
0 1 2 3 4 5 6
p1
p2
p3 p4
point x y
p1 0 2
p2 2 0
p3 3 1
p4 5 1
Distance Matrix
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Chp2 Slides by: Shree Jaswal 76
Manhattan Distance • It is named so because it is the distance in blocks
between any two points in a city
• A popular distance measure
• It is defined as:
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp)
are two p-dimensional data objects
Chp2 Slides by: Shree Jaswal 77
||...||||),(2211 pp j
xi
xj
xi
xj
xi
xjid
78
Minkowski Distance • Minkowski distance: A generalization of Euclidean and
Manhattan distances
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-
dimensional data objects, and q is the order
• Properties
o d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
o d(i, j) = d(j, i) (Symmetry)
o d(i, j) d(i, k) + d(k, j) (Triangle Inequality)
• A distance that satisfies these properties is a metric
pp
jx
ix
jx
ix
jx
ixjid )||...|||(|),(
2211
Chp2 Slides by: Shree Jaswal
79
Special Cases of Minkowski Distance
• q = 1: Manhattan (city block, L1 norm) distance
o E.g., the Hamming distance: the number of bits that are different between two binary vectors
• q= 2: (L2 norm) Euclidean distance
• q . “supremum” (Lmax norm, L norm) distance.
o This is the maximum difference between any component of the vectors
• Do not confuse q with n, i.e., all these distances are defined for all numbers of dimensions.
• Also, one can use weighted distance, parametric Pearson product moment correlation, or other dissimilarity measures
)||...|||(|),(22
22
2
11 pp jx
ix
jx
ix
jx
ixjid
||...||||),(2211 pp j
xi
xj
xi
xj
xi
xjid
Chp2 Slides by: Shree Jaswal
Chp2 Slides by: Shree Jaswal 80
Example: Minkowski Distance
Distance Matrix
point x y
p1 0 2
p2 2 0
p3 3 1
p4 5 1
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
L2 p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
Common Properties of a Distance
• Distances, such as the Euclidean distance, have some well known properties.
1. d(p, q) 0 for all p and q and d(p, q) = 0 only if p = q. (Positive definiteness)
2. d(p, q) = d(q, p) for all p and q. (Symmetry)
3. d(p, r) d(p, q) + d(q, r) for all points p, q, and r. (Triangle Inequality)
where d(p, q) is the distance (dissimilarity) between points (data objects), p and q.
• A distance that satisfies these properties is a metric
Chp2 Slides by: Shree Jaswal 81
Common Properties of a Similarity
• Similarities, also have some well known properties.
1. s(p, q) = 1 (or maximum similarity) only if p = q.
2. s(p, q) = s(q, p) for all p and q. (Symmetry)
where s(p, q) is the similarity between points (data objects), p and q.
Chp2 Slides by: Shree Jaswal 82
83
Ordinal Variables
• An ordinal variable can be discrete or continuous
• Order is important, e.g., rank
• Can be treated like interval-scaled
o replace xif by their rank
o map the range of each variable onto [0, 1] by
replacing i-th object in the f-th variable by
o compute the dissimilarity using methods for
interval-scaled variables
1
1
f
if
if M
rz
},...,1{fif
Mr
Chp2 Slides by: Shree Jaswal
86
Variables of Mixed Types
• A database may contain all the six types of variables
o symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio
• One may use a weighted formula to combine their effects
o f is binary or nominal: dij
(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
o f is interval-based: use the normalized distance
o f is ordinal or ratio-scaled • Compute ranks rif and
• Treat zif as interval-scaled
)(
1
)()(
1),(
f
ij
p
f
f
ij
f
ij
p
fd
jid
1
1
f
if
M
rz
if
Chp2 Slides by: Shree Jaswal
Similarity/Dissimilarity for Simple Attributes p and q are the attribute values for two data objects.
Chp2 Slides by: Shree Jaswal 87
88
Vector Objects: Cosine Similarity
• Vector objects: keywords in documents, gene features
in micro-arrays, …
• Applications: information retrieval, biologic taxonomy,
...
• Cosine similarity measure: If d1 and d2 are two vectors,
then
sim(d1, d2) = (d1 d2) /||d1|| ||d2|| ,
where indicates vector dot product, ||d||: the
length of vector d
• A cosine value of 0 indicates that two vectors are at 90
degrees to each other(orthogonal) and have no
match. The closer the cosine value to 1, the smaller the
angle and greater the match between vectors Chp2 Slides by: Shree Jaswal
• Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
d1d2 = 3*1+2*0+0*0+5*0+0*0+0*0+0*0+2*1+0*0+0*2 = 5
||d1||= (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2)0.5=(6) 0.5 = 2.245
sim( d1, d2 ) = .3150
Chp2 Slides by: Shree Jaswal 89