Upload
johnathan-evans
View
271
Download
8
Tags:
Embed Size (px)
Citation preview
1
Data Mining: Concepts and
Techniques
— Chapter 2 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign
Simon Fraser University
©2013 Han, Kamber, and Pei. All rights reserved.
2
Chapter 2: Getting to Know Your Data
Data Objects and Attribute Types
Basic Statistical Descriptions of Data
Data Visualization
Measuring Data Similarity and Dissimilarity
Summary
3
Types of Data Sets
Record Relational records Data matrix, e.g., numerical matrix,
crosstabs Document data: text documents: term-
frequency vector Transaction data
Graph and network World Wide Web Social or information networks Molecular Structures
Ordered Video data: sequence of images Temporal data: time-series Sequential Data: transaction
sequences Genetic sequence data
Spatial, image and multimedia: Spatial data: maps Image data: Video data:
4
Data Objects
Data sets are made up of data objects. A data object represents an entity. Examples:
sales database: customers, store items, sales medical database: patients, treatments university database: students, professors, courses
Also called samples , examples, instances, data points, objects, tuples.
Data objects are described by attributes. Database rows -> data objects; columns ->attributes.
5
Attributes
Attribute (or dimensions, features, variables): a data field, representing a characteristic or feature of a data object. E.g., customer _ID, name, address
Types: Nominal Binary Ordinal Numeric
Interval-scaled Ratio-scaled
6
Attribute Types
Nominal: categories, states, or “names of things”, order not meaningful Hair_color = {auburn, black, blond, brown, grey, red, white}
Binary Nominal attribute with only 2 states (0 and 1) Symmetric binary: both outcomes equally important
e.g., gender Asymmetric binary: outcomes not equally important.
e.g., medical test (positive vs. negative) Convention: assign 1 to most important outcome (e.g., HIV
positive) Ordinal
Values have a meaningful order (ranking) but magnitude between successive values is not known.
Size = {small, medium, large}, grades, army rankings
7
Numeric Attribute Types
Quantity (integer or real-valued) Interval
Measured on a scale of equal-sized units Values have order
E.g., temperature in C˚or F˚, calendar dates No true zero-point
Ratio Inherent zero-point We can speak of values as being an order of magnitude
larger than the unit of measurement (10 K˚ is twice as high as 5 K˚). e.g., temperature in Kelvin, length, counts,
monetary quantities
8
Discrete vs. Continuous Attributes
Discrete Attribute Has only a finite or countably infinite set of values
E.g., zip codes, profession, or the set of words in a collection of documents
Continuous Attribute Has real numbers as attribute values
E.g., temperature, height, or weight Continuous attributes are typically represented as floating-
point variables
9
Chapter 2: Getting to Know Your Data
Data Objects and Attribute Types
Basic Statistical Descriptions of Data
Data Visualization
Measuring Data Similarity and Dissimilarity
Summary
10
Basic Statistical Descriptions of Data
Measures of central tendency Mean, median, mode
Dispersion of data range, quartiles and interquartile range, five-number
summary and boxplots, variance and standard deviation
11
Measuring the Central Tendency
Mean:
Note: n is sample size and N is population size.
Weighted arithmetic mean:
Trimmed mean: chopping extreme values
Median:
Middle value if odd number of values, or average of the
middle two values otherwise
Estimated by interpolation (for grouped data):
Mode
Value that occurs most frequently in the data
Unimodal, bimodal, trimodal
Empirical formula:
n
iixn
x1
1
n
ii
n
iii
w
xwx
1
1
)(3 medianmeanmodemean
Median interval
April 19, 2023 Data Mining: Concepts and Techniques 12
Symmetric vs. Skewed Data
Median, mean and mode of symmetric, positively and negatively skewed data
positively skewed negatively skewed
symmetric
13
Measuring the Dispersion of Data
Quartiles: Q1 (25th percentile), Q3 (75th percentile)
Inter-quartile range: IQR = Q3 – Q1
Five number summary: min, Q1, median, Q3, max
Boxplot: A simple graphical device to display the overall shape of a
distribution, including the outliers. Ends of the box are the quartiles; median is
marked within the box; add whiskers to mark min and max, and plot outliers
individually
Outlier: Values less than Q1-1.5*IQR and greater than Q3+1.5*IQR are
outliers
14
Boxplot Analysis
Draw a box plot for the following dataset.
10.2, 14.1, 14.4. 14.4, 14.4, 14.5, 14.5, 14.6, 14.7, 14.7, 14.7, 14.9, 15.1, 15.9, 16.4
Here,
Q2(median) = 14.6
Q1 = 14.4
Q3 = 14.9
IQR = Q3 – Q1 = 14.9-14.4 = 0.5
Outliers will be any points below Q1 – 1.5×IQR = 14.4 – 0.75 = 13.65 or above Q3 + 1.5×IQR = 14.9 + 0.75 = 15.65.
So, the outliers are at 10.2, 15.9, and 16.4.
The ends of the box are at 14.4 and 14.9.
The median 14.6 is marked within the box.
The whiskers extend to 14.1 and 15.1.
The outliers 10.2, 15.9, and 16.4 are plotted individually.
15
Measuring the Dispersion of Data
Variance and standard deviation
Variance:
Standard deviation σ is the square root of variance σ2
16
Example of Standard Deviation
Find out the Mean, the Variance, and the Standard Deviation of the following
dataset.
9, 2, 5, 4, 12, 7, 8, 11, 9, 3, 7, 4, 12, 5, 4, 10, 9, 6, 9, 4
Here, the mean = 7
Using the formula for variance,
σ2 = 8.9
Therefore, the standard deviation is σ = 2.98
17
Graphic Displays of Basic Statistical Descriptions
Boxplot: graphic display of five-number summary
Histogram: x-axis are values, y-axis represents frequencies
Quantile plot: each value xi is paired with fi indicating that
approximately fi * 100% of data are below the value xi
Quantile-quantile (q-q) plot: graphs the quantiles of one
univariant distribution against the corresponding quantiles of
another
Scatter plot: each pair of values is a pair of coordinates and
plotted as points in the plane
18
Histogram Example
19
Quantile Plot
Used to check whether your data is normal
To make a quantile plot:
If the data distribution is close to normal, the plotted points
will lie close to a slopped straight line
20
Quantile Plot Example
Create a quantile plot for the following dataset.
3, 5, 1, 4, 10
First sort the data: 1, 3, 4, 5, 10
Calculate the sample quantiles: 0.1, 0.3, 0.5, 0.7, 0.9
Plot the graph
21
Scatter Plot
22
Positively and Negatively Correlated Data
23
Uncorrelated Data
24
Chapter 2: Getting to Know Your Data
Data Objects and Attribute Types
Basic Statistical Descriptions of Data
Data Visualization
Measuring Data Similarity and Dissimilarity
Summary
25
Data Visualization
Why data visualization? Gain insight into the data Search for patterns, trends, structure, irregularities, relationships among
data Categorization of visualization methods:
Pixel-oriented visualization techniques Geometric projection visualization techniques Icon-based visualization techniques Hierarchical visualization techniques Visualizing complex data and relations
26
Pixel-Oriented Visualization Techniques
For a data set of m dimensions, create m windows on the screen, one for each dimension
The m dimension values of a record are mapped to m pixels at the corresponding positions in the windows
The colors of the pixels reflect the corresponding values
(a) Income (b) Credit Limit
(c) transaction volume
(d) age
27
Geometric Projection Visualization Techniques
Visualization of geometric transformations and projections of the data
Helps users find interesting projections of multidimensional data Methods
Scatterplot and scatterplot matrices
28
Icon-Based Visualization Techniques
Visualization of the data values as features of icons Typical visualization methods
Chernoff Faces Stick Figures
29
Chernoff Faces
A way to display multidimensional data of up to 18 dimensions as a cartoon human face
Components of the faces such as eyes, ears, mouth, and nose represent values of the dimensions by their shape, size, placement and orientation
Data Mining: Concepts and Techniques 30
A census data figure showing age, income, gender, education, etc.
used
by
perm
issi
on o
f G
. G
rinst
ein,
Uni
vers
ity o
f M
assa
chus
ette
s at
Low
ell
Stick Figure
A 5-piece stick figure (1 body and 4 limbs)
Two attributes mapped to the two axes, remaining attributes mapped to angle or length of limbs
31
Hierarchical Visualization Techniques
Visualization of the data using a hierarchical partitioning into subspaces instead of visualizing all dimensions at the same time
Methods Dimensional Stacking Worlds-within-Worlds Tree-Map Cone Trees InfoCube
32
Used by permission of M. Ward, Worcester Polytechnic Institute
Visualization of oil mining data with longitude and latitude mapped to the outer x-, y-axes and ore grade and depth mapped to the inner x-, y-axes
Dimensional Stacking
Visualizing Complex Data and Relations
Visualizing non-numerical data: text and social networks Tag cloud: visualizing user-generated tags
The importance of tag is represented by font size/color
Newsmap: Google News Stories in 2005
34
Chapter 2: Getting to Know Your Data
Data Objects and Attribute Types
Basic Statistical Descriptions of Data
Data Visualization
Measuring Data Similarity and Dissimilarity
Summary
35
Similarity and Dissimilarity
Similarity Numerical measure of how alike two data objects are Value is higher when objects are more alike Often falls in the range [0,1]
Dissimilarity Numerical measure of how different two data objects are Lower when objects are more alike
Two data structures commonly used to measure the above Data matrix Dissimilarity matrix
36
Data Matrix and Dissimilarity Matrix
Data matrix Stores n data points with p
dimensions Two modes – stores both
objects and attributes
Dissimilarity matrix Stores n data points, but
registers only the dissimilarity between objects i and j
A triangular matrix Single mode as it only
stores dissimilarity values
npx...nfx...n1x
...............ipx...ifx...i1x
...............1px...1fx...11x
0...)2,()1,(
:::
)2,3()
...ndnd
0dd(3,1
0d(2,1)
0
37
Proximity Measure for Nominal Attributes
Can take 2 or more states, e.g., map_color may have attributes red, yellow, blue and green
Dissimilarity can be computed based on the ratio of mismatches m: # of matches, p: total # of attributes
pmpjid ),(
38
Dissimilarity between Nominal Attributes
Here, we have one nominal attribute, test-1, so p=1.
The dissimilarity matrix is as shown below:
39
Proximity Measure for Binary Attributes
A contingency table for binary data
Dissimilarity for symmetric binary
attributes:
Dissimilarity for asymmetric binary
attributes :
Jaccard coefficient (similarity
measure for asymmetric binary
attributes):
Object i
Object j
40
Dissimilarity between Binary Variables
Example
Gender is a symmetric attribute, others are asymmetric binary Let the values Y and P be 1, and the value N 0 Suppose the distance between patients is computed based only on the
asymmetric attributes
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N NMary F Y N P N P NJim M Y P N N N N
75.0211
21),(
67.0111
11),(
33.0102
10),(
maryjimd
jimjackd
maryjackd
41
Distance on Numeric Data: Minkowski Distance
Minkowski distance: A popular distance measure
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and h is the order (the distance so defined is also called L-h norm)
Properties d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness) d(i, j) = d(j, i) (Symmetry) d(i, j) d(i, k) + d(k, j) (Triangle Inequality)
A distance that satisfies these properties is a metric
42
Special Cases of Minkowski Distance
h = 1: Manhattan distance
h = 2: Euclidean distance
h . supremum distance This is the maximum difference between any component (attribute)
of the vectors
)||...|||(|),( 22
22
2
11 pp jx
ix
jx
ix
jx
ixjid
||...||||),(2211 pp jxixjxixjxixjid
43
Example: Minkowski DistanceDissimilarity Matrices
point attribute 1 attribute 2x1 1 2x2 3 5x3 2 0x4 4 5
L x1 x2 x3 x4x1 0x2 5 0x3 3 6 0x4 6 1 7 0
L2 x1 x2 x3 x4x1 0x2 3.61 0x3 2.24 5.1 0x4 4.24 1 5.39 0
L x1 x2 x3 x4
x1 0x2 3 0x3 2 5 0x4 3 1 5 0
Manhattan (L1)
Euclidean (L2)
Supremum
0 2 4
2
4
x1
x2
x3
x4
44
Proximity Measures for Ordinal Variables
Let M represent the number of possible values that an ordinal attribute can have replace each xif by its corresponding rank map the range of each variable onto [0, 1] by replacing i-th
object in the f-th variable by
compute the dissimilarity using any of the distance measures for numeric attributes, e.g., Euclidean distance
11
f
ifif M
rz
},...,1{fif
Mr
45
Proximity Measures for Ordinal Variables
Consider the data in the adjacent table: Here, the attribute Test has three states: fair, good and
excellent, so Mf=3 For step 1, the four attribute values are assigned the
ranks 3,1,2 and 3 respectively. Step 2 normalizes the ranking by mapping rank 1 to
0.0, rank 2 to 0.5 and rank 3 to 1.0 For step 3, using Euclidean distance, a dissimilarity
matrix is obtained as shown Therefore, students 1 and 2 are most dissimilar, as are
students 2 and 4
Student Test
1 Excellent
2 Fair
3 Good
4 Excellent
46
Attributes of Mixed Type
A database may contain all attribute types : Nominal, symmetric binary, asymmetric binary, numeric, ordinal
One may use a weighted formula to combine their effects
For nominal and ordinal attributes, use the technique mentioned earlier to compute dissimilarity matrix
For numeric attributes use the following formula to calculate dissimilarity
)(1
)()(1),(
fij
pf
fij
fij
pf
djid
47
Attributes of Mixed Type - Example
Consider the data in the table:
The dissimilarity matrices for the nominal and ordinal data are shown to the right computed using the methods discussed before
To compute dissimilarity matrix for the numeric attribute, maxhxh=64, minhxh=22. Using the formula from previous slide, the dissimilarity matrix is obtained as shown below:
48
Attributes of Mixed Type - Example
The three dissimilarity matrices can now be used to compute the overall dissimilarity between two objects using the equation
The resulting dissimilarity matrix is
)(1
)()(1),(
fij
pf
fij
fij
pf
djid
49
Cosine Similarity Cosine similarity is a measure of similarity that can be used to compare
documents. A document can be represented by thousands of attributes, each recording the
frequency of a particular word (such as keywords) or phrase in the document called term-frequency vector.
Cosine measure: If d1 and d2 are two term-frequency vectors, then cos(x, y) = (x y) /||x|| ||y|| ,
where indicates dot product, ||x|| is the length of vector x defined as . Similarly, ||y|| is the length of vector y.
A cosine value of 0 means that the two vectors are at 90 degrees to each other and there is no match.
The closer the cosine value to 1, the greater the match between the vectors
50
Example: Cosine Similarity
Ex: Find the similarity between documents 1 and 2 from previous slide.d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
Using the formula, cos(d1, d2) = (d1 d2) /||d1|| ||d2|| ,
d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
cos(d1, d2 ) = 0.94
The cosine similarity shows that the two documents are quite similar.
51
Chapter 2: Getting to Know Your Data
Data Objects and Attribute Types
Basic Statistical Descriptions of Data
Data Visualization
Measuring Data Similarity and Dissimilarity
Summary
Summary
Data attribute types: nominal, binary, ordinal, interval-scaled, ratio-scaled
Many types of data sets, e.g., numerical, text, graph, image, etc. Gain insight into the data by:
Basic statistical data description: central tendency, dispersion, graphical displays
Data visualization: Measure data similarity
Above steps are the beginning of data preprocessing
Exercise 1 of 2
Find the dissimilarity value between Alyssa and Chris and between Alyssa and Diane.
Student Gender
HairColor
Test1 Test2
Alyssa F Black Excellent
80
Chris M Black Good 85
Jessica F Brown Fair 55
Diane F Blonde Excellent
80
Exercise 2 of 2
Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8), compute the: Euclidean distance Manhattan distance Supremum distance Cosine similarity