Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24,...

Preview:

Citation preview

Data Mining: Concepts and Techniques 1

Data Mining:Concepts and Techniques

— Chapter 2 —

Original Slides: Jiawei Han and Micheline Kamber

Modification: Li Xiong

Data Mining: Concepts and Techniques 2

Chapter 2: Data Preprocessing

Why preprocess the data?

Descriptive data summarization

Data cleaning

Data integration and transformation

Data reduction

Discretization and concept hierarchy generation

Summary

Data Mining: Concepts and Techniques 3

Why Data Preprocessing?

Data in the real world is dirtyincomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data

e.g., occupation=“ ”noisy: containing errors or outliers

e.g., Salary=“-10”inconsistent: containing discrepancies in codes or names

e.g., Age=“42” Birthday=“03/07/1997”e.g., Was rating “1,2,3”, now rating “A, B, C”e.g., discrepancy between duplicate records

January 24, 2008 Data Mining: Concepts and Techniques 4

Why Is Data Dirty?

Incomplete data may come from“Not applicable” data value when collectedDifferent considerations between the time when the data was collected and when it is analyzed.Human/hardware/software problems

Noisy data (incorrect values) may come fromFaulty data collection instrumentsHuman or computer error at data entryErrors in data transmission

Inconsistent data may come fromDifferent data sourcesFunctional dependency violation (e.g., modify some linked data)

Duplicate records also need data cleaning

January 24, 2008 Data Mining: Concepts and Techniques 5

Multi-Dimensional Measure of Data Quality

A well-accepted multidimensional view:AccuracyCompletenessConsistencyTimelinessBelievabilityValue addedInterpretabilityAccessibility

Broad categories:Intrinsic, contextual, representational, and accessibility

January 24, 2008 Data Mining: Concepts and Techniques 6

Major Tasks in Data Preprocessing

Data cleaningFill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies

Data integrationIntegration of multiple databases, data cubes, or files

Data transformationNormalization and aggregation

Data reductionObtains reduced representation in volume but produces the same or similar analytical results

Data discretizationPart of data reduction but with particular importance, especially for numerical data

Data Mining: Concepts and Techniques 7

Forms of Data Preprocessing

January 24, 2008 Data Mining: Concepts and Techniques 8

Chapter 2: Data Preprocessing

Why preprocess the data?

Descriptive data summarization

Data cleaning

Data integration and transformation

Data reduction

Discretization and concept hierarchy generation

Summary

January 24, 2008 Data Mining: Concepts and Techniques 9

Descriptive Data Summarization

Motivation

To better understand the data

Descriptive statistics: describe basic features of data

Graphical description

Tabular description

Summary statistics

Descriptive data summarization

Measuring central tendency – how data seem similar

Measuring statistical variability or dispersion of data – how data differ

Graphic display of descriptive data summarization

January 24, 2008 Data Mining: Concepts and Techniques 10

Measuring the Central Tendency

Mean (sample vs. population):

Weighted arithmetic mean:

Trimmed mean: chopping extreme values

Median

Middle value if odd number of values, or average of the middle two

values otherwise

Estimated by interpolation (for grouped data):

Mode

Value that occurs most frequently in the data

Unimodal, bimodal, trimodal

Empirical formula:

∑=

=n

iix

nx

1

1

=

== n

ii

n

iii

w

xwx

1

1

cf

lfnLmedian

median

))(2/

(1∑−

+=

)(3 medianmeanmodemean −×=−

Nx∑=μ

January 24, 2008 Data Mining: Concepts and Techniques 11

Symmetric vs. Skewed Data

Median, mean and mode of symmetric, positively and negatively skewed data

MeanMedianMode

January 24, 2008 Data Mining: Concepts and Techniques 12

Computational Issues

Different types of measuresDistributed measure – can be computed by partitioning the data into smaller subsets. E.g. sum, countAlgebraic measure – can be computed by applying an algebraic function to one or more distributed measures. E.g. ?Holistic measure – must be computed on the entire dataset as a whole. E.g. ?

Selection algorithm: finding kth smallest number in a listE.g. min, max, medianSelection by sorting: O(n* logn)Linear algorithms based on quicksort: O(n)

January 24, 2008 Data Mining: Concepts and Techniques 13

The Long Tail

Long tail: low-frequency population (e.g. wealth distribution)The Long Tail: the current and future business and economic models

Previous empirical studies: Amazon, NetflixProducts that are in low demand or have low sales volume can collectively make up a market share that rivals or exceeds the relatively few current bestsellers and blockbustersThe primary value of the internet: providing access to products in the long tailBusiness and social implications

mass market retailers: Amazon, Netflix, eBaycontent producers: YouTube

The Long Tail. Chris Anderson, Wired, Oct. 2004The Long Tail: Why the Future of Business is Selling Less of More. Chris Anderson. 2006

January 24, 2008 Data Mining: Concepts and Techniques 14

Measuring the Dispersion of Data

Dispersion or variance: the degree to which numerical data tend to spread

Range and Quartiles

Range: difference between the largest and smallest values

Percentile: the value of a variable below which a certain percent of data fall

(algebraic or holistic?)

Quartiles: Q1 (25th percentile), Median (50th percentile), Q3 (75th percentile)

Inter-quartile range: IQR = Q3 – Q1

Five number summary: min, Q1, M, Q3, max (Boxplot)

Outlier: usually, a value at least 1.5 x IQR higher/lower than Q3/Q1

Variance and standard deviation (sample: s, population: σ)

Variance: sample vs. population (algebraic or holistic?)

Standard deviation s (or σ) is the square root of variance s2 (or σ2)

∑ ∑∑= ==

−−

=−−

=n

i

n

iii

n

ii x

nx

nxx

ns

1 1

22

1

22 ])(1[1

1)(1

1∑∑==

−=−=n

ii

n

ii x

Nx

N 1

22

1

22 1)(1 μμσ

January 24, 2008 Data Mining: Concepts and Techniques 15

Graphic Displays of Basic Statistical Descriptions

HistogramBoxplotQuantile plotQuantile-quantile (q-q) plotScatter plotLoess (local regression) curve

2008�1�24���� Data Mining: Concepts and Techniques 16

Histogram Analysis

Graphical display of tabulated frequenciesunivariate graphical method (one attribute)data partitioned into disjoint buckets (typically equal-width)a set of rectangles that reflect the counts or frequencies of values at the bucketBar chart for categorical values

January 24, 2008 Data Mining: Concepts and Techniques 17

Boxplot Analysis

Visualizes five-number summary:

The ends of the box are first and third quartiles (Q1 and Q3), i.e., the height of the box is IRQ

The median (M) is marked by a line within the box

Whiskers: two lines outside the box extend to Minimum and Maximum

January 24, 2008 Data Mining: Concepts and Techniques 18

Example Boxplot: Profit Analysis

January 24, 2008 Data Mining: Concepts and Techniques 19

Quantile Plot

Displays all of the data for the given attributePlots quantile informationEach data point (xi, fi) indicates that approximately 100 fi% of the data are below or equal to the value xi

January 24, 2008 Data Mining: Concepts and Techniques 20

Quantile-Quantile (Q-Q) Plot

Graphs the quantiles of one univariate distribution against the corresponding quantiles of anotherDiagnosing differences between the probability distribution of two distributions

January 24, 2008 Data Mining: Concepts and Techniques 21

Scatter plot

Displays values for two numerical attributes (bivariate data) Each pair of values plotted as a point in the planecan suggest various kinds of correlations between variables with a certain confidence level: positive (rising), negative (falling), or null (uncorrelated).

2008�1�24���� Data Mining: Concepts and Techniques 22

Example Scatter Plot – Correlation between Wine Consumption and Heart Mortality

France

US

2008�1�24���� Data Mining: Concepts and Techniques 23

Positively and Negatively Correlated Data

2008�1�24���� Data Mining: Concepts and Techniques 24

Not Correlated Data

January 24, 2008 Data Mining: Concepts and Techniques 25

Loess Curve

Locally weighted scatter plot smoothing to provide better perception of the pattern of dependenceFitting simple models to localized subsets of the data

January 24, 2008 Data Mining: Concepts and Techniques 26

Chapter 2: Data Preprocessing

Why preprocess the data?

Descriptive data summarization

Data cleaning

Data integration and transformation

Data reduction

Discretization and concept hierarchy generation

Summary

January 24, 2008 Data Mining: Concepts and Techniques 27

Data Cleaning

Importance“Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball“Data cleaning is the number one problem in data warehousing”—DCI survey

Data cleaning tasks

Fill in missing values

Identify outliers and smooth out noisy data

Correct inconsistent data

Resolve redundancy caused by data integration

January 24, 2008 Data Mining: Concepts and Techniques 28

Missing Data

Data is not always available

E.g., many tuples have no recorded value for several attributes, such as customer income in sales data

Missing data may be due to

equipment malfunction

inconsistent with other recorded data and thus deleted

data not entered due to misunderstanding

certain data may not be considered important at the time of entry

not register history or changes of the data

Missing data may need to be inferred.

January 24, 2008 Data Mining: Concepts and Techniques 29

How to Handle Missing Values?

Ignore the tuple: usually done when class label is missing (assuming

the tasks in

Fill in the missing value manually

Fill in the missing value automatically

a global constant : e.g., “unknown”, a new class?!

the attribute mean

the attribute mean for all samples belonging to the same class:

smarter

the most probable value: inference-based such as Bayesian

formula or decision tree (Chap 6)

January 24, 2008 Data Mining: Concepts and Techniques 30

Noisy Data

Noise: random error or variance in a measured variableIncorrect attribute values may due to

faulty data collection instrumentsdata entry problemsdata transmission problemstechnology limitationinconsistency in naming convention

Other data problems which requires data cleaningduplicate recordsincomplete datainconsistent data

January 24, 2008 Data Mining: Concepts and Techniques 31

How to Handle Noisy Data?

Binning and smoothingsort data and partition into bins (equal-frequency or equal-width)then smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.

Regressionsmooth by fitting the data into a function with regression

Clusteringdetect and remove outliers that fall outside clusters

Combined computer and human inspectiondetect suspicious values and check by human (e.g., deal with possible outliers)

January 24, 2008 Data Mining: Concepts and Techniques 32

Simple Discretization Methods: Binning

Equal-width (distance) partitioning

Divides the range into N intervals of equal size: uniform grid

if A and B are the lowest and highest values of the attribute, the

width of intervals will be: W = (B –A)/N.

The most straightforward, but outliers may dominate presentation

Skewed data is not handled well

Equal-depth (frequency) partitioning

Divides the range into N intervals, each containing approximately

same number of samples

Good data scaling

Managing categorical attributes can be tricky

January 24, 2008 Data Mining: Concepts and Techniques 33

Binning Methods for Data Smoothing

Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

* Partition into equal-frequency (equi-depth) bins:- Bin 1: 4, 8, 9, 15- Bin 2: 21, 21, 24, 25- Bin 3: 26, 28, 29, 34

* Smoothing by bin means:- Bin 1: 9, 9, 9, 9- Bin 2: 23, 23, 23, 23- Bin 3: 29, 29, 29, 29

* Smoothing by bin boundaries:- Bin 1: 4, 4, 4, 15- Bin 2: 21, 21, 25, 25- Bin 3: 26, 26, 26, 34

January 24, 2008 Data Mining: Concepts and Techniques 34

Regression

x

y

y = x + 1

X1

Y1

Y1’

January 24, 2008 Data Mining: Concepts and Techniques 35

Cluster Analysis

January 24, 2008 Data Mining: Concepts and Techniques 36

Chapter 2: Data Preprocessing

Why preprocess the data?

Data cleaning

Data integration

Data transformation

Data reduction

Discretization and concept hierarchy generation

Summary

January 24, 2008 Data Mining: Concepts and Techniques 37

Data Integration

Data integration: combines data from multiple sources into a unified viewArchitectures

Data warehouse (tightly coupled)Federated database systems (loosely coupled)

Database heterogeneitySemantic integration

Data Warehouse Approach

Client Client

Warehouse

Source Source Source

Query & Analysis

ETL

Metadata

Advantages and Disadvantages of Data Warehouse

AdvantagesHigh query performanceCan operate when sources unavailableExtra information at warehouse

Modification, summarization (aggregates), historical information

Local processing at sources unaffectedDisadvantages

Data freshnessDifficult to construct when only having access to query interface of local sources

Federated Database Systems

Client Client

Wrapper Wrapper Wrapper

Mediator

Source Source Source

Advantages and Disadvantages of Federated Database Systems

AdvantageNo need to copy and store data at mediatorMore up-to-date dataOnly query interface needed at sources

DisadvantageQuery performanceSource availability

Database Heterogeneity

System Heterogeneity: use of different operating system, hardware platformsSchematic or Structural Heterogeneity: the native model or structure to store data differ in data sources. Syntactic Heterogeneity: differences in representation format of dataSemantic Heterogeneity: differences in interpretation of the 'meaning' of data

Semantic Integration

Problem: reconciling semantic heterogeneityLevels

Schema matching (schema mapping)e.g., A.cust-id ≡ B.cust-#

Data matching (data deduplication, record linkage, entity/object matching)

e.g., Bill Clinton = William ClintonChallenges

Semantics inferred from few information sources (data creators, documentation) -> rely on schema and dataSchema and data unreliable and incompleteGlobal pair-wise matching computationally expensive

In practice, 60-80% of resources spent on reconciling semantic heterogeneity in data sharing project

Schema Matching

TechniquesRule basedLearning based

Type of matches1-1 matches vs. complex matches (e.g. list-price = price *(1+tax_rate))

Information usedSchema information: element names, data types, structures, number of sub-elements, integrity constraintsData information: value distributions, frequency of wordsExternal evidence: past matches, corpora of schemasOntologies. E.g. Gene Ontology

Multi-matcher architecture

Data Matching Or … ?

record linkagedata matchingobject identificationentity resolutionentity disambiguationduplicate detectionrecord matchinginstance identificationdeduplicationreference reconciliationdatabase hardening…

Data Mining: Concepts and Techniques 45

Data Matching

TechniquesRule basedProbabilistic Record Linkage (Fellegi and Sunter, 1969)

Similarity between pairs of attributesCombined scores representing probability of matchingThreshold based decision

Machine learning approachesNew challenges

Complex information spacesMultiple classes

Data Mining: Concepts and Techniques 46

January 24, 2008 Data Mining: Concepts and Techniques 47

Chapter 2: Data Preprocessing

Why preprocess the data?

Data cleaning

Data integration

Data transformation

Data reduction

Discretization and concept hierarchy generation

Summary

January 24, 2008 Data Mining: Concepts and Techniques 48

Data Transformation

Smoothing: remove noise from data (data cleaning)

Aggregation: summarization

E.g. Daily sales -> monthly sales

Discretization and generalization

E.g. age -> youth, middle-aged, senior

(Statistical) Normalization: scaled to fall within a small, specified range

E.g. income vs. age

Attribute construction: construct new attributes from given ones

E.g. birthday -> age

January 24, 2008 Data Mining: Concepts and Techniques 49

Data Aggregation

Data cubes store multidimensional aggregated information

Multiple levels of aggregation for analysis at multiple granularities

More on data warehouse and

cube computation (chap 3, 4)

January 24, 2008 Data Mining: Concepts and Techniques 50

Normalization

Min-max normalization: [minA, maxA] to [new_minA, new_maxA]

Ex. Let income [$12,000, $98,000] normalized to [0.0, 1.0]. Then $73,000 is mapped to

Z-score normalization (μ: mean, σ: standard deviation):

Ex. Let μ = 54,000, σ = 16,000. Then

Normalization by decimal scaling

716.00)00.1(000,12000,98000,12600,73

=+−−−

AAA

AA

A minnewminnewmaxnewminmax

minvv _)__(' +−−

−=

A

Avvσμ−

='

j

vv10

'= Where j is the smallest integer such that Max(|ν’|) < 1

225.1000,16

000,54600,73=

January 24, 2008 Data Mining: Concepts and Techniques 51

Chapter 2: Data Preprocessing

Why preprocess the data?

Data cleaning

Data integration and transformation

Data reduction

Discretization and concept hierarchy generation

Summary

January 24, 2008 Data Mining: Concepts and Techniques 52

Data Reduction

Why data reduction?A database/data warehouse may store terabytes of dataComplex data analysis/mining may take a very long time to run on the complete data set

Data reduction Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results

Data reduction strategiesDimensionality reduction

Feature selection - attribute subset selectionFeature extraction – mapping data to a smaller number of features

Instance reduction

January 24, 2008 Data Mining: Concepts and Techniques 53

Feature Selection

Select a set of attributes (features) such that the resulting probability distribution is as close as possible to the originaldistribution given all featuresBenefits

Remove irrelevant or redundant attributesreduce # of attributes in the patterns

Heuristic methods (# of choices?):Step-wise forward selectionStep-wise backward eliminationCombining forward selection and backward eliminationDecision-tree induction (Chap 6. Classification)

January 24, 2008 Data Mining: Concepts and Techniques 54

Example of Decision Tree Induction

Initial attribute set:{A1, A2, A3, A4, A5, A6}

A4 ?

A1? A6?

Class 1 Class 2 Class 1 Class 2

> Reduced attribute set: {A1, A4, A6}

January 24, 2008 Data Mining: Concepts and Techniques 55

Feature Extraction

Create new features (attributes) by combining/mapping existing onesMethods

Principle Component AnalysisData compression methods – Discrete Wavelet TransformRegression analysis

January 24, 2008 Data Mining: Concepts and Techniques 56

Principle component analysis: find the dimensions that capture the most variance

A linear mapping of the data to a new coordinate system such that the greatest variance lies on the first coordinate (the first principal component), the second greatest variance on the second coordinate, and so on.

StepsNormalize input data: each attribute falls within the same rangeCompute k orthonormal (unit) vectors, i.e., principal components -each input data (vector) is a linear combination of the k principal component vectorsThe principal components are sorted in order of decreasing “significance”Weak components can be eliminated, i.e., those with low variance

Principal Component Analysis (PCA)

January 24, 2008 Data Mining: Concepts and Techniques 57

X1

X2

Y1Y2

Illustration of Principal Component Analysis

Recommended