Data Exploration - Shree Jaswal | Assistant Professor, St ...ssjaswal.com/wp-content/uploads/2017/02/DMBI_chp2.pdfWhich Chapter from which Text Book ? • Chapter 2: Getting to know

Data Exploration

Slides by: Shree Jaswal

Topics to be covered • Types of Attributes;

• Statistical Description of Data;

• Data Visualization;

• Measuring similarity and dissimilarity.

Chp2 Slides by: Shree Jaswal 2

Which Chapter from which Text Book ?

• Chapter 2: Getting to know your data from Han,

Kamber, "Data Mining Concepts and Techniques",

Morgan Kaufmann 3nd Edition

• Chapter 2: Data from P. N. Tan, M. Steinbach, Vipin

Kumar, “Introduction to Data Mining”, Pearson

Education

• Chapter 3: Exploring data from P. N. Tan, M.

Steinbach, Vipin Kumar, “Introduction to Data

Mining”, Pearson Education


Course Outcome achieved • TEITC604.2: Able to prepare the

data needed for data mining

algorithms in terms of attributes

and class inputs, training,

validating, and testing files.


What is Data?

• Collection of data objects

and their attributes

• An attribute is a property or characteristic of an object

o Examples: eye color of a

person, temperature, etc.

o Attribute is also known as

variable, field,

characteristic, or feature

• A collection of attributes

describe an object

o Object is also known as

record, point, case, sample,

entity, or instance

T id R e fu n d M a r ita l

S ta tu s

T a x a b le

In c o m e C h e a t

1 Y e s S in g le 1 2 5 K N o

2 N o M a rr ie d 1 0 0 K N o

3 N o S in g le 7 0 K N o

4 Y e s M a rr ie d 1 2 0 K N o

5 N o D iv o rc e d 9 5 K Y e s

6 N o M a rr ie d 6 0 K N o

7 Y e s D iv o rc e d 2 2 0 K N o

8 N o S in g le 8 5 K Y e s


1 0 N o S in g le 9 0 K Y e s 10

Attributes

Objects


Attribute Values • Attribute values are numbers or symbols assigned to

an attribute

• Distinction between attributes and attribute values o Same attribute can be mapped to different attribute values

• Example: height can be measured in feet or meters

o Different attributes can be mapped to the same set of values

• Example: Attribute values for ID and age are integers

• But properties of attribute values can be different

o ID has no limit but age has a maximum and minimum value


Types of Attributes • There are different types of attributes

o Nominal

• Examples: ID numbers, eye color, zip codes

o Binary

• Symmetric and Asymmetric types

o Ordinal

• Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short}

o Interval

• Examples: calendar dates, temperatures in Celsius or Fahrenheit.

o Ratio

• Examples: temperature in Kelvin, length, time, counts


Properties of Attribute Values

• The type of an attribute depends on which of the

following properties it possesses:

o Distinctness: =

o Order: < >

o Addition: + -

o Multiplication: * /

o Nominal attribute: distinctness

o Ordinal attribute: distinctness & order

o Interval attribute: distinctness, order & addition

o Ratio attribute: all 4 properties


Attribute

Type

Description

Examples

Operations

Nominal

The values of a nominal attribute are

just different names, i.e., nominal

attributes provide only enough

information to distinguish one object

from another. (=, )

zip codes,

employee ID

numbers, eye color,

sex: {male, female}

mode, entropy,

contingency

correlation, 2

test

Ordinal

The values of an ordinal

attribute provide enough

information to order objects.

(<, >)

hardness of minerals,

{good, better, best},

grades, street numbers

median,

percentiles, rank

correlation, run

tests, sign tests

Interval

For interval attributes, the

differences between values are

meaningful, i.e., a unit of

measurement exists.(+, - )

calendar dates,

temperature in

Celsius or

Fahrenheit

mean, standard

deviation,

Pearson's

correlation, t

and F tests

Ratio

For ratio variables, both

differences and ratios are

meaningful. (*, /)

temperature in

Kelvin, monetary

quantities, counts,

age, mass, length,

electrical current

geometric

mean,

harmonic

mean, percent

variation


Attribute

Level

Transformation

Comments

Nominal

Any permutation of values

If all employee ID numbers

were reassigned, would it

make any difference?

Ordinal

An order preserving change of

values, i.e.,

new_value = f(old_value)

where f is a monotonic function.

An attribute encompassing

the notion of good, better

best can be represented

equally well by the values

{1, 2, 3} or by { 0.5, 1,

10}.

Interval

new_value =a * old_value + b

where a and b are constants

Thus, the Fahrenheit and

Celsius temperature scales

differ in terms of where

their zero value is and the

size of a unit (degree).

Ratio

new_value = a * old_value

Length can be measured in

meters or feet.


Discrete and Continuous Attributes

• Discrete Attribute o Has only a finite or countably infinite set of values

o Examples: zip codes, counts, or the set of words in a collection of documents

o Often represented as integer variables.

o Note: binary attributes are a special case of discrete attributes

• Continuous Attribute o Has real numbers as attribute values

o Examples: temperature, height, or weight.

o Practically, real values can only be measured and represented using a finite number of digits.

o Continuous attributes are typically represented as floating-point variables.


Types of data sets • Record

o Data Matrix

o Document Data

o Transaction Data

• Graph o World Wide Web

o Molecular Structures

• Ordered o Spatial Data

o Temporal Data

o Sequential Data

o Genetic Sequence Data


Important Characteristics of Structured Data

o Dimensionality

• Curse of Dimensionality

o Sparsity

• Only presence counts

o Resolution

• Patterns depend on the scale


Record Data • Data that consists of a collection of records, each

of which consists of a fixed set of attributes

T id R e fu n d M a r ita l

S ta tu s

T a x a b le

In c o m e C h e a t

1 Y e s S in g le 1 2 5 K N o

2 N o M a rr ie d 1 0 0 K N o

3 N o S in g le 7 0 K N o

4 Y e s M a rr ie d 1 2 0 K N o

5 N o D iv o rc e d 9 5 K Y e s


7 Y e s D iv o rc e d 2 2 0 K N o

8 N o S in g le 8 5 K Y e s


1 0 N o S in g le 9 0 K Y e s 10


Transaction Data • A special type of record data, where

o each record (transaction) involves a set of items.

o For example, consider a grocery store. The set of products

purchased by a customer during one shopping trip

constitute a transaction, while the individual products that

were purchased are the items.

T I D I tem s

1 B r e a d , C o k e , M ilk

2 B e e r , B r e a d

3 B e e r , C o k e , D ia p e r , M ilk

4 B e e r , B r e a d , D ia p e r , M ilk

5 C o k e , D ia p e r , M ilk


Data Matrix • If data objects have the same fixed set of numeric

attributes, then the data objects can be thought of

as points in a multi-dimensional space, where each

dimension represents a distinct attribute

• Such data set can be represented by an m by n

matrix, where there are m rows, one for each object,

and n columns, one for each attribute

1.12.216.226.2512.65

1.22.715.225.2710.23

Thickness LoadDistanceProjection

of y load

Projection

of x Load

1.12.216.226.2512.65

1.22.715.225.2710.23

Thickness LoadDistanceProjection

of y load

Projection

of x Load


Document Data • Each document becomes a `term' vector,

o each term is a component (attribute) of the vector,

o the value of each component is the number of times the

corresponding term occurs in the document.

Document 1

se

aso

n

time

ou

t

lost

wi

n

ga

me

sco

re

ba

ll

play

co

ach

tea

m

Document 2

Document 3

3 0 5 0 2 6 0 2 0 2

0

0

7 0 2 1 0 0 3 0 0

1 0 0 1 2 2 0 3 0


Graph Data • Examples: Generic graph and HTML Links

5

2

1

2

5

<a href="papers/papers.html#bbbb">

Data Mining </a>

<li>

<a href="papers/papers.html#aaaa">

Graph Partitioning </a>

<li>

<a href="papers/papers.html#aaaa">

Parallel Solution of Sparse Linear System of Equations </a>

<li>

<a href="papers/papers.html#ffff">

N-Body Computation and Dense Linear System Solvers


Chemical Data • Benzene Molecule: C6H6


Ordered Data • Sequences of transactions

An element of the sequence

Items/Events


Ordered Data • Genomic sequence data

GGTTCCGCCTTCAGCCCCGCGCC

CGCAGGGCCCGCCCCGCGCCGTC

GAGAAGGGCCCGCCTGGCGGGCG

GGGGGAGGCGGGGCCGCCCGAGC

CCAACCGAGTCCGACCAGGTGCC

CCCTCTGCTCGGCCTAGACCTGA

GCTCATTAGGCGGCAGCGGACAG

GCCAAGTAGAACACGCGAAGCGC

TGGGCTGCCTGCTGCGACCAGGG


Ordered Data

• Spatio-Temporal Data

Average Monthly Temperature of land and ocean


Data Quality • What kinds of data quality problems?

• How can we detect problems with the data?

• What can we do about these problems?

• Examples of data quality problems:

o Noise and outliers

o missing values

o duplicate data


Noise • Noise refers to modification of original values

o Examples: distortion of a person’s voice when talking on a poor phone

and “snow” on television screen

Two Sine Waves Two Sine Waves + Noise


Outliers • Outliers are data objects with characteristics that

are considerably different than most of the other

data objects in the data set


Missing Values • Reasons for missing values

o Information is not collected (e.g., people decline to give their age and weight)

o Attributes may not be applicable to all cases (e.g., annual income is not applicable to children)

• Handling missing values o Eliminate Data Objects

o Estimate Missing Values

o Ignore the Missing Value During Analysis

o Replace with all possible values (weighted by their probabilities)


Duplicate Data • Data set may include data objects that are

duplicates, or almost duplicates of one another

o Major issue when merging data from heterogeneous

sources

• Examples:

o Same person with multiple email addresses

• Data cleaning

o Process of dealing with duplicate data issues


28

Mining Data Descriptive Characteristics • Motivation

o To better understand the data: central tendency,

variation and spread

• Data dispersion characteristics

o median, max, min, quantiles, outliers, variance, etc.

• Numerical dimensions correspond to sorted intervals

o Data dispersion: analyzed with multiple granularities of

precision

o Boxplot or quantile analysis on sorted intervals

• Dispersion analysis on computed measures

o Folding measures into numerical dimensions

o Boxplot or quantile analysis on the transformed cube

Chp2 Slides by: Shree Jaswal

Measuring the Central Tendency

• Mean (algebraic measure) (sample vs. population):

o Weighted arithmetic mean:

o Trimmed mean: chopping extreme values

• Median: A holistic measure

o Middle value if odd number of values, or average of the middle

two values otherwise

o Estimated by interpolation (for grouped data):

• Mode

o Value that occurs most frequently in the data

o Unimodal, bimodal, trimodal

o Empirical formula:

n

i

ix

nx

1

1

n

i

i

n

i

ii

w

xw

x

1

1

widthfreq

lfreqNLmedian

median

))(2/

(1

)(3 medianmeanmodemean

N

x



Symmetric vs. Skewed Data

• Median, mean and mode of

symmetric, positively and

negatively skewed data

positively skewed negatively skewed

symmetric

Measuring the Dispersion of Data

• Quartiles, outliers and boxplots

o Quartiles: Q1 (25th percentile), Q3 (75th percentile)

o Inter-quartile range: IQR = Q3 – Q1

o Five number summary: min, Q1, M, Q3, max

o Boxplot: ends of the box are the quartiles, median is marked,

whiskers, and plot outlier individually

o Outlier: usually, a value higher/lower than 1.5 x IQR

• Variance and standard deviation (sample: s, population: σ)

o Variance: (algebraic, scalable computation)

o Standard deviation s (or σ) is the square root of variance s2 (or σ2)

n

i

n

i

ii

n

i

ix

nx

nxx

ns

1 1

22

1

22])(

1[

1

1)(

1

1

n

i

i

n

i

ix

Nx

N 1

22

1

22 1)(

1


Boxplot Analysis

• Five-number summary of a distribution:

Minimum, Q1, M, Q3, Maximum

• Boxplot

o Data is represented with a box

o The ends of the box are at the first and third

quartiles, i.e., the height of the box is IQR

o The median is marked by a line within the

box

o Whiskers: two lines outside the box extend to

Minimum and Maximum


Visualization Visualization is the conversion of data into a visual or

tabular format so that the characteristics of the data and the relationships among data items or attributes can be analyzed or reported.

• Visualization of data is one of the most powerful and appealing techniques for data exploration.

o Humans have a well developed ability to analyze large amounts of information that is presented visually

o Can detect general patterns and trends

o Can detect outliers and unusual patterns


34

Data Visualization and Its Methods

• Why data visualization?

o Gain insight into an information space by mapping data onto

graphical primitives

o Provide qualitative overview of large data sets

o Search for patterns, trends, structure, irregularities,

relationships among data

o Help find interesting regions and suitable parameters for

further quantitative analysis

o Provide a visual proof of computer representations derived

• Typical visualization methods:

o Geometric techniques

o Icon-based techniques

o Hierarchical techniques


Iris Sample Data Set • Many of the exploratory data techniques are

illustrated with the Iris Plant data set. o Can be obtained from the UCI Machine Learning Repository

http://www.ics.uci.edu/~mlearn/MLRepository.html

o From the statistician Douglas Fisher

o Three flower types (classes):

• Setosa

• Virginica

• Versicolour

o Four (non-class) attributes

• Sepal width and length

• Petal width and length Virginica. Robert H. Mohlenbrock.

USDA NRCS. 1995. Northeast

wetland flora: Field office guide to

plant species. Northeast National

Technical Center, Chester, PA.

Courtesy of USDA NRCS Wetland

Science Institute. Chp2 Slides by: Shree Jaswal 35

http://www.ics.uci.edu/~mlearn/MLRepository.html


5.5 4.2 1.4 0.2 Iris-setosa

4.9 3.1 1.5 0.1 Iris-setosa

5 3.2 1.2 0.2 Iris-setosa

5.5 3.5 1.3 0.2 Iris-setosa

4.9 3.1 1.5 0.1 Iris-setosa

4.4 3 1.3 0.2 Iris-setosa

5.1 3.4 1.5 0.2 Iris-setosa


4.5 2.3 1.3 0.3 Iris-setosa

4.4 3.2 1.3 0.2 Iris-setosa


5.1 3.8 1.9 0.4 Iris-setosa

4.8 3 1.4 0.3 Iris-setosa

5.1 3.8 1.6 0.2 Iris-setosa

4.6 3.2 1.4 0.2 Iris-setosa

5.3 3.7 1.5 0.2 Iris-setosa


7 3.2 4.7 1.4 Iris-versicolor

6.4 3.2 4.5 1.5 Iris-versicolor


5.5 2.3 4 1.3 Iris-versicolor




4.9 2.4 3.3 1 Iris-versicolor



5 2 3.5 1 Iris-versicolor

5.9 3 4.2 1.5 Iris-versicolor

6 2.2 4 1 Iris-versicolor



Attribute Information: 1. sepal length in cm 2. sepal width in cm 3. petal length in cm 4. petal width in cm 5. class: -- Iris Setosa -- Iris Versicolour -- Iris Virginica

Example: Sea Surface Temperature • The following shows the Sea Surface Temperature

(SST) for July 1982

o Tens of thousands of data points are summarized in a single

figure


Representation • Is the mapping of information to a visual format

• Data objects, their attributes, and the relationships among data objects are translated into graphical elements such as points, lines, shapes, and colors.

• Example: o Objects are often represented as points

o Their attribute values can be represented as the position of the points or the characteristics of the points, e.g., color, size, and shape

o If position is used, then the relationships of points, i.e., whether they form groups or a point is an outlier, is easily perceived.


Arrangement • Is the placement of visual elements within a display

• Can make a large difference in how easy it is to

understand the data

• Example:


Selection • Is the elimination or the de-emphasis of certain

objects and attributes

• Selection may involve the choosing a subset of attributes o Dimensionality reduction is often used to reduce the number

of dimensions to two or three (in case of few dimensions)

o Alternatively, pairs of attributes can be considered

• Selection may also involve choosing a subset of objects o A region of the screen can only show so many points

o Can sample, but want to preserve points in sparse areas


Visualization Techniques: Histograms • Histogram

o Usually shows the distribution of values of a single variable

o Divide the values into bins and show a bar plot of the

number of objects in each bin.

o The height of each bar indicates the number of objects

o Shape of histogram depends on the number of bins

• Example: Petal Width (10 and 20 bins, respectively)


Two-Dimensional Histograms • Show the joint distribution of the values of two

attributes

• Example: petal width and petal length


Visualization Techniques: Box Plots • Box Plots

o Invented by J. Tukey

o Another way of displaying the distribution of data

o Following figure shows the basic part of a box plot

outlier

10th percentile

25th percentile

75th percentile

50th percentile

10th percentile


Example of Box Plots • Box plots can be used to compare attributes



Visualization of Data Dispersion: 3-D Boxplots

46

Histogram Analysis

• Graph displays of basic statistical class descriptions

o Frequency histograms

• A univariate graphical method

• Consists of a set of rectangles that reflect the counts or

frequencies of the classes present in the given data


47

Histograms Often Tells More than Boxplots

The two histograms shown in

the left may have the same

boxplot representation

The same values for: min,

Q1, median, Q3, max

But they have rather different

data distributions


48

Quantile Plot • Displays all of the data (allowing the user to assess

both the overall behavior and unusual occurrences)

• Plots quantile information

o For a data xi data sorted in increasing order, fi indicates that approximately 100 fi% of the data are below or equal to the value xi


49

Quantile-Quantile (Q-Q) Plot

• Graphs the quantiles of one univariate distribution

against the corresponding quantiles of another

• Allows the user to view whether there is a shift in

going from one distribution to another


50

Scatter plot • Provides a first look at bivariate data to see clusters

of points, outliers, etc

• Each pair of values is treated as a pair of

coordinates and plotted as points in the plane


Visualization Techniques: Scatter Plots

• Scatter plots o Attributes values determine the position

o Two-dimensional scatter plots most common, but can have three-dimensional scatter plots

o Often additional attributes can be displayed by using the size, shape, and color of the markers that represent the objects

o It is useful to have arrays of scatter plots can compactly summarize the relationships of several pairs of attributes

• See example on the next slide


Scatter Plot Array of Iris Attributes


53

Positively and Negatively Correlated Data

• The left half fragment is positively

correlated

• The right half is negative correlated


54

Not Correlated Data


Visualization Techniques: Contour Plots

• Contour plots o Useful when a continuous attribute is measured on a spatial

grid

o They partition the plane into regions of similar values

o The contour lines that form the boundaries of these regions connect points with equal values

o The most common example is contour maps of elevation

o Can also display temperature, rainfall, air pressure, etc.

• An example for Sea Surface Temperature (SST) is provided on the next slide


Contour Plot Example: SST Dec, 1998

Celsius


Visualization Techniques: Matrix Plots

• Matrix plots

o Can plot the data matrix

o This can be useful when objects are sorted according to

class

o Typically, the attributes are normalized to prevent one

attribute from dominating the plot

o Plots of similarity or distance matrices can also be useful for visualizing the relationships between objects

o Examples of matrix plots are presented on the next slide


Visualization of the Iris Data Matrix

standard deviation Chp2 Slides by: Shree Jaswal 58

Visualization Techniques: Parallel Coordinates

• Parallel Coordinates o Used to plot the attribute values of high-dimensional data

o Instead of using perpendicular axes, use a set of parallel axes

o The attribute values of each object are plotted as a point on each corresponding coordinate axis and the points are connected by a line

o Thus, each object is represented as a line

o Often, the lines representing a distinct class of objects group together, at least for some attributes

o Ordering of attributes is important in seeing such groupings


Parallel Coordinates Plots for Iris Data


Other Visualization Techniques • Star Plots

o Similar approach to parallel coordinates, but axes radiate from a central point

o The line connecting the values of an object is a polygon

• Chernoff Faces o Approach created by Herman Chernoff

o This approach associates each attribute with a characteristic of a face

o The values of each attribute determine the appearance of the corresponding facial characteristic

o Each object becomes a separate face

o Relies on human’s ability to distinguish faces


Star Plots for Iris Data

Setosa

Versicolour

Virginica


Chernoff Faces for Iris Data

Setosa

Versicolour

Virginica



census data

showing age,

income, sex,

education, etc.

Stick Figures


Hierarchical Techniques

• Visualization of the data using a hierarchical

partitioning into subspaces.

• Methods

o Dimensional Stacking

o Worlds-within-Worlds

o Treemap

o Cone Trees

o InfoCube

66

Tree-Map • Screen-filling method which uses a hierarchical partitioning

of the screen into regions depending on the attribute values

• The x- and y-dimension of the screen are partitioned alternately according to the attribute values (classes)

MSR Netscan Image




69

Similarity and Dissimilarity

• Similarity

o Numerical measure of how alike two data objects

are

o Value is higher when objects are more alike

o Often falls in the range [0,1]

• Dissimilarity (i.e., distance)

o Numerical measure of how different are two data

objects

o Lower when objects are more alike

o Minimum dissimilarity is often 0

o Upper limit varies

• Proximity refers to a similarity or dissimilarity Chp2 Slides by: Shree Jaswal

70

Data Matrix and Dissimilarity Matrix

• Data matrix

o n data points with p

dimensions

o Two modes

• Dissimilarity matrix

o n data points, but

registers only the

distance

o A triangular matrix

o Single mode

npx...

nfx...

n1x

...............

ipx...

ifx...

i1x

...............

1px...

1fx...

11x

0...)2,()1,(

:::

)2,3()

...ndnd

0dd(3,1

0d(2,1)

0



Nominal Variables

• A generalization of the binary variable in that it can

take more than 2 states, e.g., red, yellow, blue, green

• Method 1: Simple matching

o m: # of matches, p: total # of variables

• Method 2: Use a large number of binary variables

o creating a new binary variable for each of the M

nominal states

p

mpjid

),(

72

Binary Variables

• A contingency table for binary

data

• Distance measure for symmetric

binary variables:

• Distance measure for asymmetric

binary variables:

• Jaccard coefficient (similarity

measure for asymmetric binary

variables): cba

a jisimJaccard

),(

dcba

cb jid

),(

cba

cb jid

),(

pdbcasum

dcdc

baba

sum

0

1

01

Object i

Object j


73

Dissimilarity between Binary Variables

• Example

o gender is a symmetric attribute

o the remaining attributes are asymmetric binary

o let the values Y and P be set to 1, and the value N be set to

0

N a m e G e n d e r F e v e r C o u g h T e s t-1 T e s t-2 T e s t-3 T e s t-4

J a c k M Y N P N N N

M a ry F Y N P N P N

J im M Y P N N N N

?),(

?),(

?),(

maryjimd

jimjackd

maryjackd


• Jim and marry are unlikely to have a similar disease whereas Jack and Mary are most likely to have a similar disease


75.0211

21),(

67.0111

11),(

33.0102

10),(

maryjimd

jimjackd

maryjackd

Euclidean Distance

• Euclidean Distance

Where n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) of data objects p and q.

• Standardization is necessary, if scales differ.

n

kkk qpdist

1

2)(


Euclidean Distance

0

1

2

3

0 1 2 3 4 5 6

p1

p2

p3 p4

point x y

p1 0 2

p2 2 0

p3 3 1

p4 5 1

Distance Matrix

p1 p2 p3 p4

p1 0 2.828 3.162 5.099

p2 2.828 0 1.414 3.162

p3 3.162 1.414 0 2

p4 5.099 3.162 2 0


Manhattan Distance • It is named so because it is the distance in blocks

between any two points in a city

• A popular distance measure

• It is defined as:

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp)

are two p-dimensional data objects


||...||||),(2211 pp j

xi

xj

xi

xj

xi

xjid

78

Minkowski Distance • Minkowski distance: A generalization of Euclidean and

Manhattan distances

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-

dimensional data objects, and q is the order

• Properties

o d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)

o d(i, j) = d(j, i) (Symmetry)

o d(i, j) d(i, k) + d(k, j) (Triangle Inequality)

• A distance that satisfies these properties is a metric

qq

pp

qq

jx

ix

jx

ix

jx

ixjid )||...|||(|),(

2211


79

Special Cases of Minkowski Distance

• q = 1: Manhattan (city block, L1 norm) distance

o E.g., the Hamming distance: the number of bits that are different between two binary vectors

• q= 2: (L2 norm) Euclidean distance

• q . “supremum” (Lmax norm, L norm) distance.

o This is the maximum difference between any component of the vectors

• Do not confuse q with n, i.e., all these distances are defined for all numbers of dimensions.

• Also, one can use weighted distance, parametric Pearson product moment correlation, or other dissimilarity measures

)||...|||(|),(22

22

2

11 pp jx

ix

jx

ix

jx

ixjid

||...||||),(2211 pp j

xi

xj

xi

xj

xi

xjid



Example: Minkowski Distance

Distance Matrix

point x y

p1 0 2

p2 2 0

p3 3 1

p4 5 1

L1 p1 p2 p3 p4

p1 0 4 4 6

p2 4 0 2 4

p3 4 2 0 2

p4 6 4 2 0

L2 p1 p2 p3 p4

p1 0 2.828 3.162 5.099

p2 2.828 0 1.414 3.162

p3 3.162 1.414 0 2

p4 5.099 3.162 2 0

L p1 p2 p3 p4

p1 0 2 3 5

p2 2 0 1 3

p3 3 1 0 2

p4 5 3 2 0

Common Properties of a Distance

• Distances, such as the Euclidean distance, have some well known properties.

1. d(p, q) 0 for all p and q and d(p, q) = 0 only if p = q. (Positive definiteness)

2. d(p, q) = d(q, p) for all p and q. (Symmetry)

3. d(p, r) d(p, q) + d(q, r) for all points p, q, and r. (Triangle Inequality)

where d(p, q) is the distance (dissimilarity) between points (data objects), p and q.

• A distance that satisfies these properties is a metric


Common Properties of a Similarity

• Similarities, also have some well known properties.

1. s(p, q) = 1 (or maximum similarity) only if p = q.

2. s(p, q) = s(q, p) for all p and q. (Symmetry)

where s(p, q) is the similarity between points (data objects), p and q.


83

Ordinal Variables

• An ordinal variable can be discrete or continuous

• Order is important, e.g., rank

• Can be treated like interval-scaled

o replace xif by their rank

o map the range of each variable onto [0, 1] by

replacing i-th object in the f-th variable by

o compute the dissimilarity using methods for

interval-scaled variables

1

1

f

if

if M

rz

},...,1{fif

Mr


Numeric Variables


hXhfhXhf

ififf

ij

xxd

minmax

|)(

|

86

Variables of Mixed Types

• A database may contain all the six types of variables

o symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio

• One may use a weighted formula to combine their effects

o f is binary or nominal: dij

(f) = 0 if xif = xjf , or dij(f) = 1 otherwise

o f is interval-based: use the normalized distance

o f is ordinal or ratio-scaled • Compute ranks rif and

• Treat zif as interval-scaled

)(

1

)()(

1),(

f

ij

p

f

f

ij

f

ij

p

fd

jid

1

1

f

if

M

rz

if


Similarity/Dissimilarity for Simple Attributes p and q are the attribute values for two data objects.


88

Vector Objects: Cosine Similarity

• Vector objects: keywords in documents, gene features

in micro-arrays, …

• Applications: information retrieval, biologic taxonomy,

...

• Cosine similarity measure: If d1 and d2 are two vectors,

then

sim(d1, d2) = (d1 d2) /||d1|| ||d2|| ,

where indicates vector dot product, ||d||: the

length of vector d

• A cosine value of 0 indicates that two vectors are at 90

degrees to each other(orthogonal) and have no

match. The closer the cosine value to 1, the smaller the

angle and greater the match between vectors Chp2 Slides by: Shree Jaswal

• Example:

d1 = 3 2 0 5 0 0 0 2 0 0

d2 = 1 0 0 0 0 0 0 1 0 2

d1d2 = 3*1+2*0+0*0+5*0+0*0+0*0+0*0+2*1+0*0+0*2 = 5

||d1||= (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481

||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2)0.5=(6) 0.5 = 2.245

sim( d1, d2 ) = .3150


Documents

Data Exploration - Shree Jaswal | Assistant Professor, St ...ssjaswal.com/wp-content/uploads/2017/02/DMBI_chp2.pdfWhich Chapter from which Text Book ? • Chapter 2: Getting to know