88
Data Exploration Slides by: Shree Jaswal

Data Exploration - Shree Jaswal | Assistant Professor, St ...ssjaswal.com/wp-content/uploads/2017/02/DMBI_chp2.pdfWhich Chapter from which Text Book ? • Chapter 2: Getting to know

  • Upload
    lethu

  • View
    216

  • Download
    2

Embed Size (px)

Citation preview

Data Exploration

Slides by: Shree Jaswal

Topics to be covered • Types of Attributes;

• Statistical Description of Data;

• Data Visualization;

• Measuring similarity and dissimilarity.

Chp2 Slides by: Shree Jaswal 2

Which Chapter from which Text Book ?

• Chapter 2: Getting to know your data from Han,

Kamber, "Data Mining Concepts and Techniques",

Morgan Kaufmann 3nd Edition

• Chapter 2: Data from P. N. Tan, M. Steinbach, Vipin

Kumar, “Introduction to Data Mining”, Pearson

Education

• Chapter 3: Exploring data from P. N. Tan, M.

Steinbach, Vipin Kumar, “Introduction to Data

Mining”, Pearson Education

Chp2 Slides by: Shree Jaswal 3

Course Outcome achieved • TEITC604.2: Able to prepare the

data needed for data mining

algorithms in terms of attributes

and class inputs, training,

validating, and testing files.

Chp2 Slides by: Shree Jaswal 4

What is Data?

• Collection of data objects

and their attributes

• An attribute is a property or characteristic of an object

o Examples: eye color of a

person, temperature, etc.

o Attribute is also known as

variable, field,

characteristic, or feature

• A collection of attributes

describe an object

o Object is also known as

record, point, case, sample,

entity, or instance

T id R e fu n d M a r ita l

S ta tu s

T a x a b le

In c o m e C h e a t

1 Y e s S in g le 1 2 5 K N o

2 N o M a rr ie d 1 0 0 K N o

3 N o S in g le 7 0 K N o

4 Y e s M a rr ie d 1 2 0 K N o

5 N o D iv o rc e d 9 5 K Y e s

6 N o M a rr ie d 6 0 K N o

7 Y e s D iv o rc e d 2 2 0 K N o

8 N o S in g le 8 5 K Y e s

9 N o M a rr ie d 7 5 K N o

1 0 N o S in g le 9 0 K Y e s 10

Attributes

Objects

Chp2 Slides by: Shree Jaswal 5

Attribute Values • Attribute values are numbers or symbols assigned to

an attribute

• Distinction between attributes and attribute values o Same attribute can be mapped to different attribute values

• Example: height can be measured in feet or meters

o Different attributes can be mapped to the same set of values

• Example: Attribute values for ID and age are integers

• But properties of attribute values can be different

o ID has no limit but age has a maximum and minimum value

Chp2 Slides by: Shree Jaswal 6

Types of Attributes • There are different types of attributes

o Nominal

• Examples: ID numbers, eye color, zip codes

o Binary

• Symmetric and Asymmetric types

o Ordinal

• Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short}

o Interval

• Examples: calendar dates, temperatures in Celsius or Fahrenheit.

o Ratio

• Examples: temperature in Kelvin, length, time, counts

Chp2 Slides by: Shree Jaswal 7

Properties of Attribute Values

• The type of an attribute depends on which of the

following properties it possesses:

o Distinctness: =

o Order: < >

o Addition: + -

o Multiplication: * /

o Nominal attribute: distinctness

o Ordinal attribute: distinctness & order

o Interval attribute: distinctness, order & addition

o Ratio attribute: all 4 properties

Chp2 Slides by: Shree Jaswal 8

Attribute

Type

Description

Examples

Operations

Nominal

The values of a nominal attribute are

just different names, i.e., nominal

attributes provide only enough

information to distinguish one object

from another. (=, )

zip codes,

employee ID

numbers, eye color,

sex: {male, female}

mode, entropy,

contingency

correlation, 2

test

Ordinal

The values of an ordinal

attribute provide enough

information to order objects.

(<, >)

hardness of minerals,

{good, better, best},

grades, street numbers

median,

percentiles, rank

correlation, run

tests, sign tests

Interval

For interval attributes, the

differences between values are

meaningful, i.e., a unit of

measurement exists.(+, - )

calendar dates,

temperature in

Celsius or

Fahrenheit

mean, standard

deviation,

Pearson's

correlation, t

and F tests

Ratio

For ratio variables, both

differences and ratios are

meaningful. (*, /)

temperature in

Kelvin, monetary

quantities, counts,

age, mass, length,

electrical current

geometric

mean,

harmonic

mean, percent

variation

Chp2 Slides by: Shree Jaswal 9

Attribute

Level

Transformation

Comments

Nominal

Any permutation of values

If all employee ID numbers

were reassigned, would it

make any difference?

Ordinal

An order preserving change of

values, i.e.,

new_value = f(old_value)

where f is a monotonic function.

An attribute encompassing

the notion of good, better

best can be represented

equally well by the values

{1, 2, 3} or by { 0.5, 1,

10}.

Interval

new_value =a * old_value + b

where a and b are constants

Thus, the Fahrenheit and

Celsius temperature scales

differ in terms of where

their zero value is and the

size of a unit (degree).

Ratio

new_value = a * old_value

Length can be measured in

meters or feet.

Chp2 Slides by: Shree Jaswal 10

Discrete and Continuous Attributes

• Discrete Attribute o Has only a finite or countably infinite set of values

o Examples: zip codes, counts, or the set of words in a collection of documents

o Often represented as integer variables.

o Note: binary attributes are a special case of discrete attributes

• Continuous Attribute o Has real numbers as attribute values

o Examples: temperature, height, or weight.

o Practically, real values can only be measured and represented using a finite number of digits.

o Continuous attributes are typically represented as floating-point variables.

Chp2 Slides by: Shree Jaswal 11

Types of data sets • Record

o Data Matrix

o Document Data

o Transaction Data

• Graph o World Wide Web

o Molecular Structures

• Ordered o Spatial Data

o Temporal Data

o Sequential Data

o Genetic Sequence Data

Chp2 Slides by: Shree Jaswal 12

Important Characteristics of Structured Data

o Dimensionality

• Curse of Dimensionality

o Sparsity

• Only presence counts

o Resolution

• Patterns depend on the scale

Chp2 Slides by: Shree Jaswal 13

Record Data • Data that consists of a collection of records, each

of which consists of a fixed set of attributes

T id R e fu n d M a r ita l

S ta tu s

T a x a b le

In c o m e C h e a t

1 Y e s S in g le 1 2 5 K N o

2 N o M a rr ie d 1 0 0 K N o

3 N o S in g le 7 0 K N o

4 Y e s M a rr ie d 1 2 0 K N o

5 N o D iv o rc e d 9 5 K Y e s

6 N o M a rr ie d 6 0 K N o

7 Y e s D iv o rc e d 2 2 0 K N o

8 N o S in g le 8 5 K Y e s

9 N o M a rr ie d 7 5 K N o

1 0 N o S in g le 9 0 K Y e s 10

Chp2 Slides by: Shree Jaswal 14

Transaction Data • A special type of record data, where

o each record (transaction) involves a set of items.

o For example, consider a grocery store. The set of products

purchased by a customer during one shopping trip

constitute a transaction, while the individual products that

were purchased are the items.

T I D I tem s

1 B r e a d , C o k e , M ilk

2 B e e r , B r e a d

3 B e e r , C o k e , D ia p e r , M ilk

4 B e e r , B r e a d , D ia p e r , M ilk

5 C o k e , D ia p e r , M ilk

Chp2 Slides by: Shree Jaswal 15

Data Matrix • If data objects have the same fixed set of numeric

attributes, then the data objects can be thought of

as points in a multi-dimensional space, where each

dimension represents a distinct attribute

• Such data set can be represented by an m by n

matrix, where there are m rows, one for each object,

and n columns, one for each attribute

1.12.216.226.2512.65

1.22.715.225.2710.23

Thickness LoadDistanceProjection

of y load

Projection

of x Load

1.12.216.226.2512.65

1.22.715.225.2710.23

Thickness LoadDistanceProjection

of y load

Projection

of x Load

Chp2 Slides by: Shree Jaswal 16

Document Data • Each document becomes a `term' vector,

o each term is a component (attribute) of the vector,

o the value of each component is the number of times the

corresponding term occurs in the document.

Document 1

se

aso

n

time

ou

t

lost

wi

n

ga

me

sco

re

ba

ll

play

co

ach

tea

m

Document 2

Document 3

3 0 5 0 2 6 0 2 0 2

0

0

7 0 2 1 0 0 3 0 0

1 0 0 1 2 2 0 3 0

Chp2 Slides by: Shree Jaswal 17

Graph Data • Examples: Generic graph and HTML Links

5

2

1

2

5

<a href="papers/papers.html#bbbb">

Data Mining </a>

<li>

<a href="papers/papers.html#aaaa">

Graph Partitioning </a>

<li>

<a href="papers/papers.html#aaaa">

Parallel Solution of Sparse Linear System of Equations </a>

<li>

<a href="papers/papers.html#ffff">

N-Body Computation and Dense Linear System Solvers

Chp2 Slides by: Shree Jaswal 18

Chemical Data • Benzene Molecule: C6H6

Chp2 Slides by: Shree Jaswal 19

Ordered Data • Sequences of transactions

An element of the sequence

Items/Events

Chp2 Slides by: Shree Jaswal 20

Ordered Data • Genomic sequence data

GGTTCCGCCTTCAGCCCCGCGCC

CGCAGGGCCCGCCCCGCGCCGTC

GAGAAGGGCCCGCCTGGCGGGCG

GGGGGAGGCGGGGCCGCCCGAGC

CCAACCGAGTCCGACCAGGTGCC

CCCTCTGCTCGGCCTAGACCTGA

GCTCATTAGGCGGCAGCGGACAG

GCCAAGTAGAACACGCGAAGCGC

TGGGCTGCCTGCTGCGACCAGGG

Chp2 Slides by: Shree Jaswal 21

Ordered Data

• Spatio-Temporal Data

Average Monthly Temperature of land and ocean

Chp2 Slides by: Shree Jaswal 22

Data Quality • What kinds of data quality problems?

• How can we detect problems with the data?

• What can we do about these problems?

• Examples of data quality problems:

o Noise and outliers

o missing values

o duplicate data

Chp2 Slides by: Shree Jaswal 23

Noise • Noise refers to modification of original values

o Examples: distortion of a person’s voice when talking on a poor phone

and “snow” on television screen

Two Sine Waves Two Sine Waves + Noise

Chp2 Slides by: Shree Jaswal 24

Outliers • Outliers are data objects with characteristics that

are considerably different than most of the other

data objects in the data set

Chp2 Slides by: Shree Jaswal 25

Missing Values • Reasons for missing values

o Information is not collected (e.g., people decline to give their age and weight)

o Attributes may not be applicable to all cases (e.g., annual income is not applicable to children)

• Handling missing values o Eliminate Data Objects

o Estimate Missing Values

o Ignore the Missing Value During Analysis

o Replace with all possible values (weighted by their probabilities)

Chp2 Slides by: Shree Jaswal 26

Duplicate Data • Data set may include data objects that are

duplicates, or almost duplicates of one another

o Major issue when merging data from heterogeneous

sources

• Examples:

o Same person with multiple email addresses

• Data cleaning

o Process of dealing with duplicate data issues

Chp2 Slides by: Shree Jaswal 27

28

Mining Data Descriptive Characteristics • Motivation

o To better understand the data: central tendency,

variation and spread

• Data dispersion characteristics

o median, max, min, quantiles, outliers, variance, etc.

• Numerical dimensions correspond to sorted intervals

o Data dispersion: analyzed with multiple granularities of

precision

o Boxplot or quantile analysis on sorted intervals

• Dispersion analysis on computed measures

o Folding measures into numerical dimensions

o Boxplot or quantile analysis on the transformed cube

Chp2 Slides by: Shree Jaswal

Measuring the Central Tendency

• Mean (algebraic measure) (sample vs. population):

o Weighted arithmetic mean:

o Trimmed mean: chopping extreme values

• Median: A holistic measure

o Middle value if odd number of values, or average of the middle

two values otherwise

o Estimated by interpolation (for grouped data):

• Mode

o Value that occurs most frequently in the data

o Unimodal, bimodal, trimodal

o Empirical formula:

n

i

ix

nx

1

1

n

i

i

n

i

ii

w

xw

x

1

1

widthfreq

lfreqNLmedian

median

))(2/

(1

)(3 medianmeanmodemean

N

x

Chp2 Slides by: Shree Jaswal 29

Chp2 Slides by: Shree Jaswal 30

Symmetric vs. Skewed Data

• Median, mean and mode of

symmetric, positively and

negatively skewed data

positively skewed negatively skewed

symmetric

Measuring the Dispersion of Data

• Quartiles, outliers and boxplots

o Quartiles: Q1 (25th percentile), Q3 (75th percentile)

o Inter-quartile range: IQR = Q3 – Q1

o Five number summary: min, Q1, M, Q3, max

o Boxplot: ends of the box are the quartiles, median is marked,

whiskers, and plot outlier individually

o Outlier: usually, a value higher/lower than 1.5 x IQR

• Variance and standard deviation (sample: s, population: σ)

o Variance: (algebraic, scalable computation)

o Standard deviation s (or σ) is the square root of variance s2 (or σ2)

n

i

n

i

ii

n

i

ix

nx

nxx

ns

1 1

22

1

22])(

1[

1

1)(

1

1

n

i

i

n

i

ix

Nx

N 1

22

1

22 1)(

1

Chp2 Slides by: Shree Jaswal 31

Boxplot Analysis

• Five-number summary of a distribution:

Minimum, Q1, M, Q3, Maximum

• Boxplot

o Data is represented with a box

o The ends of the box are at the first and third

quartiles, i.e., the height of the box is IQR

o The median is marked by a line within the

box

o Whiskers: two lines outside the box extend to

Minimum and Maximum

Chp2 Slides by: Shree Jaswal 32

Visualization Visualization is the conversion of data into a visual or

tabular format so that the characteristics of the data and the relationships among data items or attributes can be analyzed or reported.

• Visualization of data is one of the most powerful and appealing techniques for data exploration.

o Humans have a well developed ability to analyze large amounts of information that is presented visually

o Can detect general patterns and trends

o Can detect outliers and unusual patterns

Chp2 Slides by: Shree Jaswal 33

34

Data Visualization and Its Methods

• Why data visualization?

o Gain insight into an information space by mapping data onto

graphical primitives

o Provide qualitative overview of large data sets

o Search for patterns, trends, structure, irregularities,

relationships among data

o Help find interesting regions and suitable parameters for

further quantitative analysis

o Provide a visual proof of computer representations derived

• Typical visualization methods:

o Geometric techniques

o Icon-based techniques

o Hierarchical techniques

Chp2 Slides by: Shree Jaswal

Iris Sample Data Set • Many of the exploratory data techniques are

illustrated with the Iris Plant data set. o Can be obtained from the UCI Machine Learning Repository

http://www.ics.uci.edu/~mlearn/MLRepository.html

o From the statistician Douglas Fisher

o Three flower types (classes):

• Setosa

• Virginica

• Versicolour

o Four (non-class) attributes

• Sepal width and length

• Petal width and length Virginica. Robert H. Mohlenbrock.

USDA NRCS. 1995. Northeast

wetland flora: Field office guide to

plant species. Northeast National

Technical Center, Chester, PA.

Courtesy of USDA NRCS Wetland

Science Institute. Chp2 Slides by: Shree Jaswal 35

Chp2 Slides by: Shree Jaswal 36

5.5 4.2 1.4 0.2 Iris-setosa

4.9 3.1 1.5 0.1 Iris-setosa

5 3.2 1.2 0.2 Iris-setosa

5.5 3.5 1.3 0.2 Iris-setosa

4.9 3.1 1.5 0.1 Iris-setosa

4.4 3 1.3 0.2 Iris-setosa

5.1 3.4 1.5 0.2 Iris-setosa

5 3.5 1.3 0.3 Iris-setosa

4.5 2.3 1.3 0.3 Iris-setosa

4.4 3.2 1.3 0.2 Iris-setosa

5 3.5 1.6 0.6 Iris-setosa

5.1 3.8 1.9 0.4 Iris-setosa

4.8 3 1.4 0.3 Iris-setosa

5.1 3.8 1.6 0.2 Iris-setosa

4.6 3.2 1.4 0.2 Iris-setosa

5.3 3.7 1.5 0.2 Iris-setosa

5 3.3 1.4 0.2 Iris-setosa

7 3.2 4.7 1.4 Iris-versicolor

6.4 3.2 4.5 1.5 Iris-versicolor

6.9 3.1 4.9 1.5 Iris-versicolor

5.5 2.3 4 1.3 Iris-versicolor

6.5 2.8 4.6 1.5 Iris-versicolor

5.7 2.8 4.5 1.3 Iris-versicolor

6.3 3.3 4.7 1.6 Iris-versicolor

4.9 2.4 3.3 1 Iris-versicolor

6.6 2.9 4.6 1.3 Iris-versicolor

5.2 2.7 3.9 1.4 Iris-versicolor

5 2 3.5 1 Iris-versicolor

5.9 3 4.2 1.5 Iris-versicolor

6 2.2 4 1 Iris-versicolor

6.1 2.9 4.7 1.4 Iris-versicolor

5.6 2.9 3.6 1.3 Iris-versicolor

Attribute Information: 1. sepal length in cm 2. sepal width in cm 3. petal length in cm 4. petal width in cm 5. class: -- Iris Setosa -- Iris Versicolour -- Iris Virginica

Example: Sea Surface Temperature • The following shows the Sea Surface Temperature

(SST) for July 1982

o Tens of thousands of data points are summarized in a single

figure

Chp2 Slides by: Shree Jaswal 37

Representation • Is the mapping of information to a visual format

• Data objects, their attributes, and the relationships among data objects are translated into graphical elements such as points, lines, shapes, and colors.

• Example: o Objects are often represented as points

o Their attribute values can be represented as the position of the points or the characteristics of the points, e.g., color, size, and shape

o If position is used, then the relationships of points, i.e., whether they form groups or a point is an outlier, is easily perceived.

Chp2 Slides by: Shree Jaswal 38

Arrangement • Is the placement of visual elements within a display

• Can make a large difference in how easy it is to

understand the data

• Example:

Chp2 Slides by: Shree Jaswal 39

Selection • Is the elimination or the de-emphasis of certain

objects and attributes

• Selection may involve the choosing a subset of attributes o Dimensionality reduction is often used to reduce the number

of dimensions to two or three (in case of few dimensions)

o Alternatively, pairs of attributes can be considered

• Selection may also involve choosing a subset of objects o A region of the screen can only show so many points

o Can sample, but want to preserve points in sparse areas

Chp2 Slides by: Shree Jaswal 40

Visualization Techniques: Histograms • Histogram

o Usually shows the distribution of values of a single variable

o Divide the values into bins and show a bar plot of the

number of objects in each bin.

o The height of each bar indicates the number of objects

o Shape of histogram depends on the number of bins

• Example: Petal Width (10 and 20 bins, respectively)

Chp2 Slides by: Shree Jaswal 41

Two-Dimensional Histograms • Show the joint distribution of the values of two

attributes

• Example: petal width and petal length

Chp2 Slides by: Shree Jaswal 42

Visualization Techniques: Box Plots • Box Plots

o Invented by J. Tukey

o Another way of displaying the distribution of data

o Following figure shows the basic part of a box plot

outlier

10th percentile

25th percentile

75th percentile

50th percentile

10th percentile

Chp2 Slides by: Shree Jaswal 43

Example of Box Plots • Box plots can be used to compare attributes

Chp2 Slides by: Shree Jaswal 44

Chp2 Slides by: Shree Jaswal 45

Visualization of Data Dispersion: 3-D Boxplots

46

Histogram Analysis

• Graph displays of basic statistical class descriptions

o Frequency histograms

• A univariate graphical method

• Consists of a set of rectangles that reflect the counts or

frequencies of the classes present in the given data

Chp2 Slides by: Shree Jaswal

47

Histograms Often Tells More than Boxplots

The two histograms shown in

the left may have the same

boxplot representation

The same values for: min,

Q1, median, Q3, max

But they have rather different

data distributions

Chp2 Slides by: Shree Jaswal

48

Quantile Plot • Displays all of the data (allowing the user to assess

both the overall behavior and unusual occurrences)

• Plots quantile information

o For a data xi data sorted in increasing order, fi indicates that approximately 100 fi% of the data are below or equal to the value xi

Chp2 Slides by: Shree Jaswal

49

Quantile-Quantile (Q-Q) Plot

• Graphs the quantiles of one univariate distribution

against the corresponding quantiles of another

• Allows the user to view whether there is a shift in

going from one distribution to another

Chp2 Slides by: Shree Jaswal

50

Scatter plot • Provides a first look at bivariate data to see clusters

of points, outliers, etc

• Each pair of values is treated as a pair of

coordinates and plotted as points in the plane

Chp2 Slides by: Shree Jaswal

Visualization Techniques: Scatter Plots

• Scatter plots o Attributes values determine the position

o Two-dimensional scatter plots most common, but can have three-dimensional scatter plots

o Often additional attributes can be displayed by using the size, shape, and color of the markers that represent the objects

o It is useful to have arrays of scatter plots can compactly summarize the relationships of several pairs of attributes

• See example on the next slide

Chp2 Slides by: Shree Jaswal 51

Scatter Plot Array of Iris Attributes

Chp2 Slides by: Shree Jaswal 52

53

Positively and Negatively Correlated Data

• The left half fragment is positively

correlated

• The right half is negative correlated

Chp2 Slides by: Shree Jaswal

54

Not Correlated Data

Chp2 Slides by: Shree Jaswal

Visualization Techniques: Contour Plots

• Contour plots o Useful when a continuous attribute is measured on a spatial

grid

o They partition the plane into regions of similar values

o The contour lines that form the boundaries of these regions connect points with equal values

o The most common example is contour maps of elevation

o Can also display temperature, rainfall, air pressure, etc.

• An example for Sea Surface Temperature (SST) is provided on the next slide

Chp2 Slides by: Shree Jaswal 55

Contour Plot Example: SST Dec, 1998

Celsius

Chp2 Slides by: Shree Jaswal 56

Visualization Techniques: Matrix Plots

• Matrix plots

o Can plot the data matrix

o This can be useful when objects are sorted according to

class

o Typically, the attributes are normalized to prevent one

attribute from dominating the plot

o Plots of similarity or distance matrices can also be useful for visualizing the relationships between objects

o Examples of matrix plots are presented on the next slide

Chp2 Slides by: Shree Jaswal 57

Visualization of the Iris Data Matrix

standard deviation Chp2 Slides by: Shree Jaswal 58

Visualization Techniques: Parallel Coordinates

• Parallel Coordinates o Used to plot the attribute values of high-dimensional data

o Instead of using perpendicular axes, use a set of parallel axes

o The attribute values of each object are plotted as a point on each corresponding coordinate axis and the points are connected by a line

o Thus, each object is represented as a line

o Often, the lines representing a distinct class of objects group together, at least for some attributes

o Ordering of attributes is important in seeing such groupings

Chp2 Slides by: Shree Jaswal 59

Parallel Coordinates Plots for Iris Data

Chp2 Slides by: Shree Jaswal 60

Other Visualization Techniques • Star Plots

o Similar approach to parallel coordinates, but axes radiate from a central point

o The line connecting the values of an object is a polygon

• Chernoff Faces o Approach created by Herman Chernoff

o This approach associates each attribute with a characteristic of a face

o The values of each attribute determine the appearance of the corresponding facial characteristic

o Each object becomes a separate face

o Relies on human’s ability to distinguish faces

Chp2 Slides by: Shree Jaswal 61

Star Plots for Iris Data

Setosa

Versicolour

Virginica

Chp2 Slides by: Shree Jaswal 62

Chernoff Faces for Iris Data

Setosa

Versicolour

Virginica

Chp2 Slides by: Shree Jaswal 63

Chp2 Slides by: Shree Jaswal 64

census data

showing age,

income, sex,

education, etc.

Stick Figures

Chp2 Slides by: Shree Jaswal 65

Hierarchical Techniques

• Visualization of the data using a hierarchical

partitioning into subspaces.

• Methods

o Dimensional Stacking

o Worlds-within-Worlds

o Treemap

o Cone Trees

o InfoCube

66

Tree-Map • Screen-filling method which uses a hierarchical partitioning

of the screen into regions depending on the attribute values

• The x- and y-dimension of the screen are partitioned alternately according to the attribute values (classes)

MSR Netscan Image

Chp2 Slides by: Shree Jaswal

Chp2 Slides by: Shree Jaswal 67

Chp2 Slides by: Shree Jaswal 68

69

Similarity and Dissimilarity

• Similarity

o Numerical measure of how alike two data objects

are

o Value is higher when objects are more alike

o Often falls in the range [0,1]

• Dissimilarity (i.e., distance)

o Numerical measure of how different are two data

objects

o Lower when objects are more alike

o Minimum dissimilarity is often 0

o Upper limit varies

• Proximity refers to a similarity or dissimilarity Chp2 Slides by: Shree Jaswal

70

Data Matrix and Dissimilarity Matrix

• Data matrix

o n data points with p

dimensions

o Two modes

• Dissimilarity matrix

o n data points, but

registers only the

distance

o A triangular matrix

o Single mode

npx...

nfx...

n1x

...............

ipx...

ifx...

i1x

...............

1px...

1fx...

11x

0...)2,()1,(

:::

)2,3()

...ndnd

0dd(3,1

0d(2,1)

0

Chp2 Slides by: Shree Jaswal

Chp2 Slides by: Shree Jaswal 71

Nominal Variables

• A generalization of the binary variable in that it can

take more than 2 states, e.g., red, yellow, blue, green

• Method 1: Simple matching

o m: # of matches, p: total # of variables

• Method 2: Use a large number of binary variables

o creating a new binary variable for each of the M

nominal states

p

mpjid

),(

72

Binary Variables

• A contingency table for binary

data

• Distance measure for symmetric

binary variables:

• Distance measure for asymmetric

binary variables:

• Jaccard coefficient (similarity

measure for asymmetric binary

variables): cba

a jisimJaccard

),(

dcba

cb jid

),(

cba

cb jid

),(

pdbcasum

dcdc

baba

sum

0

1

01

Object i

Object j

Chp2 Slides by: Shree Jaswal

73

Dissimilarity between Binary Variables

• Example

o gender is a symmetric attribute

o the remaining attributes are asymmetric binary

o let the values Y and P be set to 1, and the value N be set to

0

N a m e G e n d e r F e v e r C o u g h T e s t-1 T e s t-2 T e s t-3 T e s t-4

J a c k M Y N P N N N

M a ry F Y N P N P N

J im M Y P N N N N

?),(

?),(

?),(

maryjimd

jimjackd

maryjackd

Chp2 Slides by: Shree Jaswal

• Jim and marry are unlikely to have a similar disease whereas Jack and Mary are most likely to have a similar disease

Chp2 Slides by: Shree Jaswal 74

75.0211

21),(

67.0111

11),(

33.0102

10),(

maryjimd

jimjackd

maryjackd

Euclidean Distance

• Euclidean Distance

Where n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) of data objects p and q.

• Standardization is necessary, if scales differ.

n

kkk qpdist

1

2)(

Chp2 Slides by: Shree Jaswal 75

Euclidean Distance

0

1

2

3

0 1 2 3 4 5 6

p1

p2

p3 p4

point x y

p1 0 2

p2 2 0

p3 3 1

p4 5 1

Distance Matrix

p1 p2 p3 p4

p1 0 2.828 3.162 5.099

p2 2.828 0 1.414 3.162

p3 3.162 1.414 0 2

p4 5.099 3.162 2 0

Chp2 Slides by: Shree Jaswal 76

Manhattan Distance • It is named so because it is the distance in blocks

between any two points in a city

• A popular distance measure

• It is defined as:

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp)

are two p-dimensional data objects

Chp2 Slides by: Shree Jaswal 77

||...||||),(2211 pp j

xi

xj

xi

xj

xi

xjid

78

Minkowski Distance • Minkowski distance: A generalization of Euclidean and

Manhattan distances

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-

dimensional data objects, and q is the order

• Properties

o d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)

o d(i, j) = d(j, i) (Symmetry)

o d(i, j) d(i, k) + d(k, j) (Triangle Inequality)

• A distance that satisfies these properties is a metric

qq

pp

qq

jx

ix

jx

ix

jx

ixjid )||...|||(|),(

2211

Chp2 Slides by: Shree Jaswal

79

Special Cases of Minkowski Distance

• q = 1: Manhattan (city block, L1 norm) distance

o E.g., the Hamming distance: the number of bits that are different between two binary vectors

• q= 2: (L2 norm) Euclidean distance

• q . “supremum” (Lmax norm, L norm) distance.

o This is the maximum difference between any component of the vectors

• Do not confuse q with n, i.e., all these distances are defined for all numbers of dimensions.

• Also, one can use weighted distance, parametric Pearson product moment correlation, or other dissimilarity measures

)||...|||(|),(22

22

2

11 pp jx

ix

jx

ix

jx

ixjid

||...||||),(2211 pp j

xi

xj

xi

xj

xi

xjid

Chp2 Slides by: Shree Jaswal

Chp2 Slides by: Shree Jaswal 80

Example: Minkowski Distance

Distance Matrix

point x y

p1 0 2

p2 2 0

p3 3 1

p4 5 1

L1 p1 p2 p3 p4

p1 0 4 4 6

p2 4 0 2 4

p3 4 2 0 2

p4 6 4 2 0

L2 p1 p2 p3 p4

p1 0 2.828 3.162 5.099

p2 2.828 0 1.414 3.162

p3 3.162 1.414 0 2

p4 5.099 3.162 2 0

L p1 p2 p3 p4

p1 0 2 3 5

p2 2 0 1 3

p3 3 1 0 2

p4 5 3 2 0

Common Properties of a Distance

• Distances, such as the Euclidean distance, have some well known properties.

1. d(p, q) 0 for all p and q and d(p, q) = 0 only if p = q. (Positive definiteness)

2. d(p, q) = d(q, p) for all p and q. (Symmetry)

3. d(p, r) d(p, q) + d(q, r) for all points p, q, and r. (Triangle Inequality)

where d(p, q) is the distance (dissimilarity) between points (data objects), p and q.

• A distance that satisfies these properties is a metric

Chp2 Slides by: Shree Jaswal 81

Common Properties of a Similarity

• Similarities, also have some well known properties.

1. s(p, q) = 1 (or maximum similarity) only if p = q.

2. s(p, q) = s(q, p) for all p and q. (Symmetry)

where s(p, q) is the similarity between points (data objects), p and q.

Chp2 Slides by: Shree Jaswal 82

83

Ordinal Variables

• An ordinal variable can be discrete or continuous

• Order is important, e.g., rank

• Can be treated like interval-scaled

o replace xif by their rank

o map the range of each variable onto [0, 1] by

replacing i-th object in the f-th variable by

o compute the dissimilarity using methods for

interval-scaled variables

1

1

f

if

if M

rz

},...,1{fif

Mr

Chp2 Slides by: Shree Jaswal

Numeric Variables

Chp2 Slides by: Shree Jaswal 84

hXhfhXhf

ififf

ij

xxd

minmax

|)(

|

86

Variables of Mixed Types

• A database may contain all the six types of variables

o symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio

• One may use a weighted formula to combine their effects

o f is binary or nominal: dij

(f) = 0 if xif = xjf , or dij(f) = 1 otherwise

o f is interval-based: use the normalized distance

o f is ordinal or ratio-scaled • Compute ranks rif and

• Treat zif as interval-scaled

)(

1

)()(

1),(

f

ij

p

f

f

ij

f

ij

p

fd

jid

1

1

f

if

M

rz

if

Chp2 Slides by: Shree Jaswal

Similarity/Dissimilarity for Simple Attributes p and q are the attribute values for two data objects.

Chp2 Slides by: Shree Jaswal 87

88

Vector Objects: Cosine Similarity

• Vector objects: keywords in documents, gene features

in micro-arrays, …

• Applications: information retrieval, biologic taxonomy,

...

• Cosine similarity measure: If d1 and d2 are two vectors,

then

sim(d1, d2) = (d1 d2) /||d1|| ||d2|| ,

where indicates vector dot product, ||d||: the

length of vector d

• A cosine value of 0 indicates that two vectors are at 90

degrees to each other(orthogonal) and have no

match. The closer the cosine value to 1, the smaller the

angle and greater the match between vectors Chp2 Slides by: Shree Jaswal

• Example:

d1 = 3 2 0 5 0 0 0 2 0 0

d2 = 1 0 0 0 0 0 0 1 0 2

d1d2 = 3*1+2*0+0*0+5*0+0*0+0*0+0*0+2*1+0*0+0*2 = 5

||d1||= (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481

||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2)0.5=(6) 0.5 = 2.245

sim( d1, d2 ) = .3150

Chp2 Slides by: Shree Jaswal 89