Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Data Mining: Concepts and Techniques 1
Data Mining:Concepts and Techniques
— Chapter 2 —
Original Slides: Jiawei Han and Micheline Kamber
Modification: Li Xiong
Data Mining: Concepts and Techniques 2
Chapter 2: Data Preprocessing
Why preprocess the data?
Descriptive data summarization
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary
Data Mining: Concepts and Techniques 3
Why Data Preprocessing?
Data in the real world is dirtyincomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data
e.g., occupation=“ ”noisy: containing errors or outliers
e.g., Salary=“-10”inconsistent: containing discrepancies in codes or names
e.g., Age=“42” Birthday=“03/07/1997”e.g., Was rating “1,2,3”, now rating “A, B, C”e.g., discrepancy between duplicate records
January 24, 2008 Data Mining: Concepts and Techniques 4
Why Is Data Dirty?
Incomplete data may come from“Not applicable” data value when collectedDifferent considerations between the time when the data was collected and when it is analyzed.Human/hardware/software problems
Noisy data (incorrect values) may come fromFaulty data collection instrumentsHuman or computer error at data entryErrors in data transmission
Inconsistent data may come fromDifferent data sourcesFunctional dependency violation (e.g., modify some linked data)
Duplicate records also need data cleaning
January 24, 2008 Data Mining: Concepts and Techniques 5
Multi-Dimensional Measure of Data Quality
A well-accepted multidimensional view:AccuracyCompletenessConsistencyTimelinessBelievabilityValue addedInterpretabilityAccessibility
Broad categories:Intrinsic, contextual, representational, and accessibility
January 24, 2008 Data Mining: Concepts and Techniques 6
Major Tasks in Data Preprocessing
Data cleaningFill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies
Data integrationIntegration of multiple databases, data cubes, or files
Data transformationNormalization and aggregation
Data reductionObtains reduced representation in volume but produces the same or similar analytical results
Data discretizationPart of data reduction but with particular importance, especially for numerical data
Data Mining: Concepts and Techniques 7
Forms of Data Preprocessing
January 24, 2008 Data Mining: Concepts and Techniques 8
Chapter 2: Data Preprocessing
Why preprocess the data?
Descriptive data summarization
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary
January 24, 2008 Data Mining: Concepts and Techniques 9
Descriptive Data Summarization
Motivation
To better understand the data
Descriptive statistics: describe basic features of data
Graphical description
Tabular description
Summary statistics
Descriptive data summarization
Measuring central tendency – how data seem similar
Measuring statistical variability or dispersion of data – how data differ
Graphic display of descriptive data summarization
January 24, 2008 Data Mining: Concepts and Techniques 10
Measuring the Central Tendency
Mean (sample vs. population):
Weighted arithmetic mean:
Trimmed mean: chopping extreme values
Median
Middle value if odd number of values, or average of the middle two
values otherwise
Estimated by interpolation (for grouped data):
Mode
Value that occurs most frequently in the data
Unimodal, bimodal, trimodal
Empirical formula:
∑=
=n
iix
nx
1
1
∑
∑
=
== n
ii
n
iii
w
xwx
1
1
cf
lfnLmedian
median
))(2/
(1∑−
+=
)(3 medianmeanmodemean −×=−
Nx∑=μ
January 24, 2008 Data Mining: Concepts and Techniques 11
Symmetric vs. Skewed Data
Median, mean and mode of symmetric, positively and negatively skewed data
MeanMedianMode
January 24, 2008 Data Mining: Concepts and Techniques 12
Computational Issues
Different types of measuresDistributed measure – can be computed by partitioning the data into smaller subsets. E.g. sum, countAlgebraic measure – can be computed by applying an algebraic function to one or more distributed measures. E.g. ?Holistic measure – must be computed on the entire dataset as a whole. E.g. ?
Selection algorithm: finding kth smallest number in a listE.g. min, max, medianSelection by sorting: O(n* logn)Linear algorithms based on quicksort: O(n)
January 24, 2008 Data Mining: Concepts and Techniques 13
The Long Tail
Long tail: low-frequency population (e.g. wealth distribution)The Long Tail: the current and future business and economic models
Previous empirical studies: Amazon, NetflixProducts that are in low demand or have low sales volume can collectively make up a market share that rivals or exceeds the relatively few current bestsellers and blockbustersThe primary value of the internet: providing access to products in the long tailBusiness and social implications
mass market retailers: Amazon, Netflix, eBaycontent producers: YouTube
The Long Tail. Chris Anderson, Wired, Oct. 2004The Long Tail: Why the Future of Business is Selling Less of More. Chris Anderson. 2006
January 24, 2008 Data Mining: Concepts and Techniques 14
Measuring the Dispersion of Data
Dispersion or variance: the degree to which numerical data tend to spread
Range and Quartiles
Range: difference between the largest and smallest values
Percentile: the value of a variable below which a certain percent of data fall
(algebraic or holistic?)
Quartiles: Q1 (25th percentile), Median (50th percentile), Q3 (75th percentile)
Inter-quartile range: IQR = Q3 – Q1
Five number summary: min, Q1, M, Q3, max (Boxplot)
Outlier: usually, a value at least 1.5 x IQR higher/lower than Q3/Q1
Variance and standard deviation (sample: s, population: σ)
Variance: sample vs. population (algebraic or holistic?)
Standard deviation s (or σ) is the square root of variance s2 (or σ2)
∑ ∑∑= ==
−−
=−−
=n
i
n
iii
n
ii x
nx
nxx
ns
1 1
22
1
22 ])(1[1
1)(1
1∑∑==
−=−=n
ii
n
ii x
Nx
N 1
22
1
22 1)(1 μμσ
January 24, 2008 Data Mining: Concepts and Techniques 15
Graphic Displays of Basic Statistical Descriptions
HistogramBoxplotQuantile plotQuantile-quantile (q-q) plotScatter plotLoess (local regression) curve
2008�1�24���� Data Mining: Concepts and Techniques 16
Histogram Analysis
Graphical display of tabulated frequenciesunivariate graphical method (one attribute)data partitioned into disjoint buckets (typically equal-width)a set of rectangles that reflect the counts or frequencies of values at the bucketBar chart for categorical values
January 24, 2008 Data Mining: Concepts and Techniques 17
Boxplot Analysis
Visualizes five-number summary:
The ends of the box are first and third quartiles (Q1 and Q3), i.e., the height of the box is IRQ
The median (M) is marked by a line within the box
Whiskers: two lines outside the box extend to Minimum and Maximum
January 24, 2008 Data Mining: Concepts and Techniques 18
Example Boxplot: Profit Analysis
January 24, 2008 Data Mining: Concepts and Techniques 19
Quantile Plot
Displays all of the data for the given attributePlots quantile informationEach data point (xi, fi) indicates that approximately 100 fi% of the data are below or equal to the value xi
January 24, 2008 Data Mining: Concepts and Techniques 20
Quantile-Quantile (Q-Q) Plot
Graphs the quantiles of one univariate distribution against the corresponding quantiles of anotherDiagnosing differences between the probability distribution of two distributions
January 24, 2008 Data Mining: Concepts and Techniques 21
Scatter plot
Displays values for two numerical attributes (bivariate data) Each pair of values plotted as a point in the planecan suggest various kinds of correlations between variables with a certain confidence level: positive (rising), negative (falling), or null (uncorrelated).
2008�1�24���� Data Mining: Concepts and Techniques 22
Example Scatter Plot – Correlation between Wine Consumption and Heart Mortality
France
US
2008�1�24���� Data Mining: Concepts and Techniques 23
Positively and Negatively Correlated Data
2008�1�24���� Data Mining: Concepts and Techniques 24
Not Correlated Data
January 24, 2008 Data Mining: Concepts and Techniques 25
Loess Curve
Locally weighted scatter plot smoothing to provide better perception of the pattern of dependenceFitting simple models to localized subsets of the data
January 24, 2008 Data Mining: Concepts and Techniques 26
Chapter 2: Data Preprocessing
Why preprocess the data?
Descriptive data summarization
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary
January 24, 2008 Data Mining: Concepts and Techniques 27
Data Cleaning
Importance“Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball“Data cleaning is the number one problem in data warehousing”—DCI survey
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration
January 24, 2008 Data Mining: Concepts and Techniques 28
Missing Data
Data is not always available
E.g., many tuples have no recorded value for several attributes, such as customer income in sales data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the time of entry
not register history or changes of the data
Missing data may need to be inferred.
January 24, 2008 Data Mining: Concepts and Techniques 29
How to Handle Missing Values?
Ignore the tuple: usually done when class label is missing (assuming
the tasks in
Fill in the missing value manually
Fill in the missing value automatically
a global constant : e.g., “unknown”, a new class?!
the attribute mean
the attribute mean for all samples belonging to the same class:
smarter
the most probable value: inference-based such as Bayesian
formula or decision tree (Chap 6)
January 24, 2008 Data Mining: Concepts and Techniques 30
Noisy Data
Noise: random error or variance in a measured variableIncorrect attribute values may due to
faulty data collection instrumentsdata entry problemsdata transmission problemstechnology limitationinconsistency in naming convention
Other data problems which requires data cleaningduplicate recordsincomplete datainconsistent data
January 24, 2008 Data Mining: Concepts and Techniques 31
How to Handle Noisy Data?
Binning and smoothingsort data and partition into bins (equal-frequency or equal-width)then smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.
Regressionsmooth by fitting the data into a function with regression
Clusteringdetect and remove outliers that fall outside clusters
Combined computer and human inspectiondetect suspicious values and check by human (e.g., deal with possible outliers)
January 24, 2008 Data Mining: Concepts and Techniques 32
Simple Discretization Methods: Binning
Equal-width (distance) partitioning
Divides the range into N intervals of equal size: uniform grid
if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B –A)/N.
The most straightforward, but outliers may dominate presentation
Skewed data is not handled well
Equal-depth (frequency) partitioning
Divides the range into N intervals, each containing approximately
same number of samples
Good data scaling
Managing categorical attributes can be tricky
January 24, 2008 Data Mining: Concepts and Techniques 33
Binning Methods for Data Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:- Bin 1: 4, 8, 9, 15- Bin 2: 21, 21, 24, 25- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:- Bin 1: 9, 9, 9, 9- Bin 2: 23, 23, 23, 23- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:- Bin 1: 4, 4, 4, 15- Bin 2: 21, 21, 25, 25- Bin 3: 26, 26, 26, 34
January 24, 2008 Data Mining: Concepts and Techniques 34
Regression
x
y
y = x + 1
X1
Y1
Y1’
January 24, 2008 Data Mining: Concepts and Techniques 35
Cluster Analysis
January 24, 2008 Data Mining: Concepts and Techniques 36
Chapter 2: Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration
Data transformation
Data reduction
Discretization and concept hierarchy generation
Summary
January 24, 2008 Data Mining: Concepts and Techniques 37
Data Integration
Data integration: combines data from multiple sources into a unified viewArchitectures
Data warehouse (tightly coupled)Federated database systems (loosely coupled)
Database heterogeneitySemantic integration
Data Warehouse Approach
Client Client
Warehouse
Source Source Source
Query & Analysis
ETL
Metadata
Advantages and Disadvantages of Data Warehouse
AdvantagesHigh query performanceCan operate when sources unavailableExtra information at warehouse
Modification, summarization (aggregates), historical information
Local processing at sources unaffectedDisadvantages
Data freshnessDifficult to construct when only having access to query interface of local sources
Federated Database Systems
Client Client
Wrapper Wrapper Wrapper
Mediator
Source Source Source
Advantages and Disadvantages of Federated Database Systems
AdvantageNo need to copy and store data at mediatorMore up-to-date dataOnly query interface needed at sources
DisadvantageQuery performanceSource availability
Database Heterogeneity
System Heterogeneity: use of different operating system, hardware platformsSchematic or Structural Heterogeneity: the native model or structure to store data differ in data sources. Syntactic Heterogeneity: differences in representation format of dataSemantic Heterogeneity: differences in interpretation of the 'meaning' of data
Semantic Integration
Problem: reconciling semantic heterogeneityLevels
Schema matching (schema mapping)e.g., A.cust-id ≡ B.cust-#
Data matching (data deduplication, record linkage, entity/object matching)
e.g., Bill Clinton = William ClintonChallenges
Semantics inferred from few information sources (data creators, documentation) -> rely on schema and dataSchema and data unreliable and incompleteGlobal pair-wise matching computationally expensive
In practice, 60-80% of resources spent on reconciling semantic heterogeneity in data sharing project
Schema Matching
TechniquesRule basedLearning based
Type of matches1-1 matches vs. complex matches (e.g. list-price = price *(1+tax_rate))
Information usedSchema information: element names, data types, structures, number of sub-elements, integrity constraintsData information: value distributions, frequency of wordsExternal evidence: past matches, corpora of schemasOntologies. E.g. Gene Ontology
Multi-matcher architecture
Data Matching Or … ?
record linkagedata matchingobject identificationentity resolutionentity disambiguationduplicate detectionrecord matchinginstance identificationdeduplicationreference reconciliationdatabase hardening…
Data Mining: Concepts and Techniques 45
Data Matching
TechniquesRule basedProbabilistic Record Linkage (Fellegi and Sunter, 1969)
Similarity between pairs of attributesCombined scores representing probability of matchingThreshold based decision
Machine learning approachesNew challenges
Complex information spacesMultiple classes
Data Mining: Concepts and Techniques 46
January 24, 2008 Data Mining: Concepts and Techniques 47
Chapter 2: Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration
Data transformation
Data reduction
Discretization and concept hierarchy generation
Summary
January 24, 2008 Data Mining: Concepts and Techniques 48
Data Transformation
Smoothing: remove noise from data (data cleaning)
Aggregation: summarization
E.g. Daily sales -> monthly sales
Discretization and generalization
E.g. age -> youth, middle-aged, senior
(Statistical) Normalization: scaled to fall within a small, specified range
E.g. income vs. age
Attribute construction: construct new attributes from given ones
E.g. birthday -> age
January 24, 2008 Data Mining: Concepts and Techniques 49
Data Aggregation
Data cubes store multidimensional aggregated information
Multiple levels of aggregation for analysis at multiple granularities
More on data warehouse and
cube computation (chap 3, 4)
January 24, 2008 Data Mining: Concepts and Techniques 50
Normalization
Min-max normalization: [minA, maxA] to [new_minA, new_maxA]
Ex. Let income [$12,000, $98,000] normalized to [0.0, 1.0]. Then $73,000 is mapped to
Z-score normalization (μ: mean, σ: standard deviation):
Ex. Let μ = 54,000, σ = 16,000. Then
Normalization by decimal scaling
716.00)00.1(000,12000,98000,12600,73
=+−−−
AAA
AA
A minnewminnewmaxnewminmax
minvv _)__(' +−−
−=
A
Avvσμ−
='
j
vv10
'= Where j is the smallest integer such that Max(|ν’|) < 1
225.1000,16
000,54600,73=
−
January 24, 2008 Data Mining: Concepts and Techniques 51
Chapter 2: Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary
January 24, 2008 Data Mining: Concepts and Techniques 52
Data Reduction
Why data reduction?A database/data warehouse may store terabytes of dataComplex data analysis/mining may take a very long time to run on the complete data set
Data reduction Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results
Data reduction strategiesDimensionality reduction
Feature selection - attribute subset selectionFeature extraction – mapping data to a smaller number of features
Instance reduction
January 24, 2008 Data Mining: Concepts and Techniques 53
Feature Selection
Select a set of attributes (features) such that the resulting probability distribution is as close as possible to the originaldistribution given all featuresBenefits
Remove irrelevant or redundant attributesreduce # of attributes in the patterns
Heuristic methods (# of choices?):Step-wise forward selectionStep-wise backward eliminationCombining forward selection and backward eliminationDecision-tree induction (Chap 6. Classification)
January 24, 2008 Data Mining: Concepts and Techniques 54
Example of Decision Tree Induction
Initial attribute set:{A1, A2, A3, A4, A5, A6}
A4 ?
A1? A6?
Class 1 Class 2 Class 1 Class 2
> Reduced attribute set: {A1, A4, A6}
January 24, 2008 Data Mining: Concepts and Techniques 55
Feature Extraction
Create new features (attributes) by combining/mapping existing onesMethods
Principle Component AnalysisData compression methods – Discrete Wavelet TransformRegression analysis
January 24, 2008 Data Mining: Concepts and Techniques 56
Principle component analysis: find the dimensions that capture the most variance
A linear mapping of the data to a new coordinate system such that the greatest variance lies on the first coordinate (the first principal component), the second greatest variance on the second coordinate, and so on.
StepsNormalize input data: each attribute falls within the same rangeCompute k orthonormal (unit) vectors, i.e., principal components -each input data (vector) is a linear combination of the k principal component vectorsThe principal components are sorted in order of decreasing “significance”Weak components can be eliminated, i.e., those with low variance
Principal Component Analysis (PCA)
January 24, 2008 Data Mining: Concepts and Techniques 57
X1
X2
Y1Y2
Illustration of Principal Component Analysis