Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

Data Mining: Concepts and Techniques 1

Data Mining:Concepts and Techniques

— Chapter 2 —

Original Slides: Jiawei Han and Micheline Kamber

Modification: Li Xiong


Chapter 2: Data Preprocessing

Why preprocess the data?

Descriptive data summarization

Data cleaning

Data integration and transformation

Data reduction

Discretization and concept hierarchy generation

Summary


Why Data Preprocessing?

Data in the real world is dirtyincomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data

e.g., occupation=“ ”noisy: containing errors or outliers

e.g., Salary=“-10”inconsistent: containing discrepancies in codes or names

e.g., Age=“42” Birthday=“03/07/1997”e.g., Was rating “1,2,3”, now rating “A, B, C”e.g., discrepancy between duplicate records

January 24, 2008 Data Mining: Concepts and Techniques 4

Why Is Data Dirty?

Incomplete data may come from“Not applicable” data value when collectedDifferent considerations between the time when the data was collected and when it is analyzed.Human/hardware/software problems

Noisy data (incorrect values) may come fromFaulty data collection instrumentsHuman or computer error at data entryErrors in data transmission

Inconsistent data may come fromDifferent data sourcesFunctional dependency violation (e.g., modify some linked data)

Duplicate records also need data cleaning


Multi-Dimensional Measure of Data Quality

A well-accepted multidimensional view:AccuracyCompletenessConsistencyTimelinessBelievabilityValue addedInterpretabilityAccessibility

Broad categories:Intrinsic, contextual, representational, and accessibility


Major Tasks in Data Preprocessing

Data cleaningFill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies

Data integrationIntegration of multiple databases, data cubes, or files

Data transformationNormalization and aggregation

Data reductionObtains reduced representation in volume but produces the same or similar analytical results

Data discretizationPart of data reduction but with particular importance, especially for numerical data


Forms of Data Preprocessing





Data cleaning


Data reduction


Summary


Descriptive Data Summarization

Motivation

To better understand the data

Descriptive statistics: describe basic features of data

Graphical description

Tabular description

Summary statistics


Measuring central tendency – how data seem similar

Measuring statistical variability or dispersion of data – how data differ

Graphic display of descriptive data summarization


Measuring the Central Tendency

Mean (sample vs. population):

Weighted arithmetic mean:

Trimmed mean: chopping extreme values

Median

Middle value if odd number of values, or average of the middle two

values otherwise

Estimated by interpolation (for grouped data):

Mode

Value that occurs most frequently in the data

Unimodal, bimodal, trimodal

Empirical formula:

∑=

=n

iix

nx

1

1

∑

∑

=

== n

ii

n

iii

w

xwx

1

1

cf

lfnLmedian

median

))(2/

(1∑−

+=

)(3 medianmeanmodemean −×=−

Nx∑=μ


Symmetric vs. Skewed Data

Median, mean and mode of symmetric, positively and negatively skewed data

MeanMedianMode


Computational Issues

Different types of measuresDistributed measure – can be computed by partitioning the data into smaller subsets. E.g. sum, countAlgebraic measure – can be computed by applying an algebraic function to one or more distributed measures. E.g. ?Holistic measure – must be computed on the entire dataset as a whole. E.g. ?

Selection algorithm: finding kth smallest number in a listE.g. min, max, medianSelection by sorting: O(n* logn)Linear algorithms based on quicksort: O(n)


The Long Tail

Long tail: low-frequency population (e.g. wealth distribution)The Long Tail: the current and future business and economic models

Previous empirical studies: Amazon, NetflixProducts that are in low demand or have low sales volume can collectively make up a market share that rivals or exceeds the relatively few current bestsellers and blockbustersThe primary value of the internet: providing access to products in the long tailBusiness and social implications

mass market retailers: Amazon, Netflix, eBaycontent producers: YouTube

The Long Tail. Chris Anderson, Wired, Oct. 2004The Long Tail: Why the Future of Business is Selling Less of More. Chris Anderson. 2006


Measuring the Dispersion of Data

Dispersion or variance: the degree to which numerical data tend to spread

Range and Quartiles

Range: difference between the largest and smallest values

Percentile: the value of a variable below which a certain percent of data fall

(algebraic or holistic?)

Quartiles: Q1 (25th percentile), Median (50th percentile), Q3 (75th percentile)

Inter-quartile range: IQR = Q3 – Q1

Five number summary: min, Q1, M, Q3, max (Boxplot)

Outlier: usually, a value at least 1.5 x IQR higher/lower than Q3/Q1

Variance and standard deviation (sample: s, population: σ)

Variance: sample vs. population (algebraic or holistic?)

Standard deviation s (or σ) is the square root of variance s2 (or σ2)

∑ ∑∑= ==

−−

=−−

=n

i

n

iii

n

ii x

nx

nxx

ns

1 1

22

1

22 ])(1[1

1)(1

1∑∑==

−=−=n

ii

n

ii x

Nx

N 1

22

1

22 1)(1 μμσ


Graphic Displays of Basic Statistical Descriptions

HistogramBoxplotQuantile plotQuantile-quantile (q-q) plotScatter plotLoess (local regression) curve

2008�1�24�� Data Mining: Concepts and Techniques 16

Histogram Analysis

Graphical display of tabulated frequenciesunivariate graphical method (one attribute)data partitioned into disjoint buckets (typically equal-width)a set of rectangles that reflect the counts or frequencies of values at the bucketBar chart for categorical values


Boxplot Analysis

Visualizes five-number summary:

The ends of the box are first and third quartiles (Q1 and Q3), i.e., the height of the box is IRQ

The median (M) is marked by a line within the box

Whiskers: two lines outside the box extend to Minimum and Maximum


Example Boxplot: Profit Analysis


Quantile Plot

Displays all of the data for the given attributePlots quantile informationEach data point (xi, fi) indicates that approximately 100 fi% of the data are below or equal to the value xi


Quantile-Quantile (Q-Q) Plot

Graphs the quantiles of one univariate distribution against the corresponding quantiles of anotherDiagnosing differences between the probability distribution of two distributions


Scatter plot

Displays values for two numerical attributes (bivariate data) Each pair of values plotted as a point in the planecan suggest various kinds of correlations between variables with a certain confidence level: positive (rising), negative (falling), or null (uncorrelated).


Example Scatter Plot – Correlation between Wine Consumption and Heart Mortality

France

US


Positively and Negatively Correlated Data


Not Correlated Data


Loess Curve

Locally weighted scatter plot smoothing to provide better perception of the pattern of dependenceFitting simple models to localized subsets of the data





Data cleaning


Data reduction


Summary


Data Cleaning

Importance“Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball“Data cleaning is the number one problem in data warehousing”—DCI survey

Data cleaning tasks

Fill in missing values

Identify outliers and smooth out noisy data

Correct inconsistent data

Resolve redundancy caused by data integration


Missing Data

Data is not always available

E.g., many tuples have no recorded value for several attributes, such as customer income in sales data

Missing data may be due to

equipment malfunction

inconsistent with other recorded data and thus deleted

data not entered due to misunderstanding

certain data may not be considered important at the time of entry

not register history or changes of the data

Missing data may need to be inferred.


How to Handle Missing Values?

Ignore the tuple: usually done when class label is missing (assuming

the tasks in

Fill in the missing value manually

Fill in the missing value automatically

a global constant : e.g., “unknown”, a new class?!

the attribute mean

the attribute mean for all samples belonging to the same class:

smarter

the most probable value: inference-based such as Bayesian

formula or decision tree (Chap 6)


Noisy Data

Noise: random error or variance in a measured variableIncorrect attribute values may due to

faulty data collection instrumentsdata entry problemsdata transmission problemstechnology limitationinconsistency in naming convention

Other data problems which requires data cleaningduplicate recordsincomplete datainconsistent data


How to Handle Noisy Data?

Binning and smoothingsort data and partition into bins (equal-frequency or equal-width)then smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.

Regressionsmooth by fitting the data into a function with regression

Clusteringdetect and remove outliers that fall outside clusters

Combined computer and human inspectiondetect suspicious values and check by human (e.g., deal with possible outliers)


Simple Discretization Methods: Binning

Equal-width (distance) partitioning

Divides the range into N intervals of equal size: uniform grid

if A and B are the lowest and highest values of the attribute, the

width of intervals will be: W = (B –A)/N.

The most straightforward, but outliers may dominate presentation

Skewed data is not handled well

Equal-depth (frequency) partitioning

Divides the range into N intervals, each containing approximately

same number of samples

Good data scaling

Managing categorical attributes can be tricky


Binning Methods for Data Smoothing

Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

* Partition into equal-frequency (equi-depth) bins:- Bin 1: 4, 8, 9, 15- Bin 2: 21, 21, 24, 25- Bin 3: 26, 28, 29, 34

* Smoothing by bin means:- Bin 1: 9, 9, 9, 9- Bin 2: 23, 23, 23, 23- Bin 3: 29, 29, 29, 29

* Smoothing by bin boundaries:- Bin 1: 4, 4, 4, 15- Bin 2: 21, 21, 25, 25- Bin 3: 26, 26, 26, 34


Regression

x

y

y = x + 1

X1

Y1

Y1’


Cluster Analysis




Data cleaning

Data integration

Data transformation

Data reduction


Summary


Data Integration

Data integration: combines data from multiple sources into a unified viewArchitectures

Data warehouse (tightly coupled)Federated database systems (loosely coupled)

Database heterogeneitySemantic integration

Data Warehouse Approach

Client Client

Warehouse

Source Source Source

Query & Analysis

ETL

Metadata

Advantages and Disadvantages of Data Warehouse

AdvantagesHigh query performanceCan operate when sources unavailableExtra information at warehouse

Modification, summarization (aggregates), historical information

Local processing at sources unaffectedDisadvantages

Data freshnessDifficult to construct when only having access to query interface of local sources

Federated Database Systems

Client Client

Wrapper Wrapper Wrapper

Mediator

Source Source Source

Advantages and Disadvantages of Federated Database Systems

AdvantageNo need to copy and store data at mediatorMore up-to-date dataOnly query interface needed at sources

DisadvantageQuery performanceSource availability

Database Heterogeneity

System Heterogeneity: use of different operating system, hardware platformsSchematic or Structural Heterogeneity: the native model or structure to store data differ in data sources. Syntactic Heterogeneity: differences in representation format of dataSemantic Heterogeneity: differences in interpretation of the 'meaning' of data

Semantic Integration

Problem: reconciling semantic heterogeneityLevels

Schema matching (schema mapping)e.g., A.cust-id ≡ B.cust-#

Data matching (data deduplication, record linkage, entity/object matching)

e.g., Bill Clinton = William ClintonChallenges

Semantics inferred from few information sources (data creators, documentation) -> rely on schema and dataSchema and data unreliable and incompleteGlobal pair-wise matching computationally expensive

In practice, 60-80% of resources spent on reconciling semantic heterogeneity in data sharing project

Schema Matching

TechniquesRule basedLearning based

Type of matches1-1 matches vs. complex matches (e.g. list-price = price *(1+tax_rate))

Information usedSchema information: element names, data types, structures, number of sub-elements, integrity constraintsData information: value distributions, frequency of wordsExternal evidence: past matches, corpora of schemasOntologies. E.g. Gene Ontology

Multi-matcher architecture

Data Matching Or … ?

record linkagedata matchingobject identificationentity resolutionentity disambiguationduplicate detectionrecord matchinginstance identificationdeduplicationreference reconciliationdatabase hardening…


Data Matching

TechniquesRule basedProbabilistic Record Linkage (Fellegi and Sunter, 1969)

Similarity between pairs of attributesCombined scores representing probability of matchingThreshold based decision

Machine learning approachesNew challenges

Complex information spacesMultiple classes





Data cleaning

Data integration

Data transformation

Data reduction


Summary


Data Transformation

Smoothing: remove noise from data (data cleaning)

Aggregation: summarization

E.g. Daily sales -> monthly sales

Discretization and generalization

E.g. age -> youth, middle-aged, senior

(Statistical) Normalization: scaled to fall within a small, specified range

E.g. income vs. age

Attribute construction: construct new attributes from given ones

E.g. birthday -> age


Data Aggregation

Data cubes store multidimensional aggregated information

Multiple levels of aggregation for analysis at multiple granularities

More on data warehouse and

cube computation (chap 3, 4)


Normalization

Min-max normalization: [minA, maxA] to [new_minA, new_maxA]

Ex. Let income [$12,000, $98,000] normalized to [0.0, 1.0]. Then $73,000 is mapped to

Z-score normalization (μ: mean, σ: standard deviation):

Ex. Let μ = 54,000, σ = 16,000. Then

Normalization by decimal scaling

716.00)00.1(000,12000,98000,12600,73

=+−−−

AAA

AA

A minnewminnewmaxnewminmax

minvv _)__(' +−−

−=

A

Avvσμ−

='

j

vv10

'= Where j is the smallest integer such that Max(|ν’|) < 1

225.1000,16

000,54600,73=

−




Data cleaning


Data reduction


Summary


Data Reduction

Why data reduction?A database/data warehouse may store terabytes of dataComplex data analysis/mining may take a very long time to run on the complete data set

Data reduction Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results

Data reduction strategiesDimensionality reduction

Feature selection - attribute subset selectionFeature extraction – mapping data to a smaller number of features

Instance reduction


Feature Selection

Select a set of attributes (features) such that the resulting probability distribution is as close as possible to the originaldistribution given all featuresBenefits

Remove irrelevant or redundant attributesreduce # of attributes in the patterns

Heuristic methods (# of choices?):Step-wise forward selectionStep-wise backward eliminationCombining forward selection and backward eliminationDecision-tree induction (Chap 6. Classification)


Example of Decision Tree Induction

Initial attribute set:{A1, A2, A3, A4, A5, A6}

A4 ?

A1? A6?

Class 1 Class 2 Class 1 Class 2

> Reduced attribute set: {A1, A4, A6}


Feature Extraction

Create new features (attributes) by combining/mapping existing onesMethods

Principle Component AnalysisData compression methods – Discrete Wavelet TransformRegression analysis


Principle component analysis: find the dimensions that capture the most variance

A linear mapping of the data to a new coordinate system such that the greatest variance lies on the first coordinate (the first principal component), the second greatest variance on the second coordinate, and so on.

StepsNormalize input data: each attribute falls within the same rangeCompute k orthonormal (unit) vectors, i.e., principal components -each input data (vector) is a linear combination of the k principal component vectorsThe principal components are sorted in order of decreasing “significance”Weak components can be eliminated, i.e., those with low variance

Principal Component Analysis (PCA)


X1

X2

Y1Y2

Illustration of Principal Component Analysis

Documents

Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample