Upload
docong
View
226
Download
2
Embed Size (px)
Citation preview
2
Data Types and Forms
A1 A2 … An Cn Attribute-value data: n Data types
q numeric, categorical (see the hierarchy for its relationship)
q static, dynamic (temporal) n Other kinds of data
q distributed data q text, Web, meta data q images, audio/video
3
Data Preprocessing
n Why preprocess the data? n Data cleaning n Data integration and transformation n Data reduction n Discretization n Summary
4
Why Data Preprocessing?
n Data in the real world is “dirty” q incomplete: missing attribute values, lack of certain attributes
of interest, or containing only aggregate data n e.g., occupation=“”
q noisy: containing errors or outliers n e.g., Salary=“-10”
q inconsistent: containing discrepancies in codes or names n e.g., Age=“42” Birthday=“03/07/1997” n e.g., Was rating “1,2,3”, now rating “A, B, C” n e.g., discrepancy between duplicate records
5
Why Is Data Preprocessing Important?
n No quality data, no quality mining results! (garbage in garbage out!) q Quality decisions must be based on quality data
n e.g., duplicate or missing data may cause incorrect or even misleading statistics.
n Data preparation, cleaning, and transformation comprises the majority of the work in a data mining application (could be as high as 90%).
6
Multi-Dimensional Measure of Data Quality
n A well-accepted multi-dimensional view: q Accuracy q Completeness q Consistency q Timeliness q Believability q Value added q Interpretability q Accessibility
7
Major Tasks in Data Preprocessing
n Data cleaning q Fill in missing values, smooth noisy data, identify or remove
outliers and noisy data, and resolve inconsistencies
n Data integration q Integration of multiple databases, or files
n Data transformation q Normalization and aggregation
n Data reduction q Obtains reduced representation in volume but produces the
same or similar analytical results
n Data discretization (for numerical data)
8
Data Preprocessing
n Why preprocess the data? n Data cleaning n Data integration and transformation n Data reduction n Discretization n Summary
9
Data Cleaning
n Importance q “Data cleaning is the number one problem in data
warehousing”
n Data cleaning tasks q Fill in missing values
q Identify outliers and smooth out noisy data
q Correct inconsistent data
q Resolve redundancy caused by data integration
10
Missing Data
n Data is not always available q E.g., many tuples have no recorded values for several
attributes, such as customer income in sales data
n Missing data may be due to q equipment malfunction q inconsistent with other recorded data and thus deleted q data not entered due to misunderstanding q certain data may not be considered important at the time of
entry q no register history or changes of the data q expansion of data schema
11
How to Handle Missing Data?
n Ignore the tuple (loss of information) n Fill in missing values manually: tedious, infeasible?
n Fill in it automatically with q a global constant : e.g., “unknown”, a new class?! q the attribute mean q the most probable value: inference-based such as Bayesian
formula, decision tree, or EM algorithm
12
Noisy Data n Noise: random error or variance in a measured
variable. n Incorrect attribute values may due to
q faulty data collection instruments q data entry problems q data transmission problems q etc
n Other data problems which requires data cleaning q duplicate records, incomplete data, inconsistent data
13
How to Handle Noisy Data?
n Binning method: q first sort data and partition into (equi-size) bins q then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc. n Clustering
q detect and remove outliers
n Combined computer and human inspection q detect suspicious values and check by human (e.g., deal with
possible outliers)
14
Binning Methods for Data Smoothing
n Sorted data for price (in dollars): q Eg. 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
n Partition into (equi-size) bins: q Bin 1: 4, 8, 9, 15 q Bin 2: 21, 21, 24, 25 q Bin 3: 26, 28, 29, 34
n Smoothing by bin means: q Bin 1: 9, 9, 9, 9 q Bin 2: 23, 23, 23, 23 q Bin 3: 29, 29, 29, 29
n Smoothing by bin boundaries: q Bin 1: 4, 4, 4, 15 q Bin 2: 21, 21, 21, 25 q Bin 3: 26, 26, 26, 34
15
Outlier Removal
n Data points inconsistent with the majority of data n Different outliers
q Valid: CEO’s salary, q Noisy: One’s age = 200, widely deviated points
n Removal methods q Clustering q Curve-fitting q Hypothesis-testing with a given model
16
Data Preprocessing
n Why preprocess the data? n Data cleaning n Data integration and transformation n Data reduction n Discretization n Summary
17
Data Integration n Data integration:
q combines data from multiple sources n Schema integration
q integrate metadata from different sources q Entity identification problem: identify real world entities from
multiple data sources, e.g., A.cust-id ≡ B.cust-# n Detecting and resolving data value conflicts
q for the same real world entity, attribute values from different sources are different, e.g., different scales, metric vs. British units
n Removing duplicates and redundant data
18
Data Transformation
n Smoothing: remove noise from data n Normalization: scaled to fall within a small, specified
range n Attribute/feature construction
q New attributes constructed from the given ones
n Aggregation: summarization q Integrate data from different sources (tables)
n Generalization: concept hierarchy climbing
19
Data Transformation: Normalization
n min-max normalization
n z-score normalization (normalize to N(0,1))
AAA
AA
A minnewminnewmaxnewminmaxminvv _)__(' +−−
−=
A
A
devstandmeanvv_
' −=
'v
20
Data Preprocessing
n Why preprocess the data? n Data cleaning n Data integration and transformation n Data reduction n Discretization n Summary
21
Data Reduction Strategies n Data is too big to work with
q Too many instances q too many features (attributes) – curse of dimensionality
n Data reduction q Obtain a reduced representation of the data set that is much smaller
in volume but yet produce the same (or almost the same) analytical results (easily said but difficult to do)
n Data reduction strategies q Dimensionality reduction — remove unimportant attributes q Aggregation and clustering –
n Remove redundant or close associated ones q Sampling
22
Dimensionality Reduction n Feature selection (i.e., attribute subset selection):
q Direct methods – n Select a minimum set of attributes (features) that is sufficient for the
data mining task. q Indirect methods -
n Principal component analysis (PCA) n Singular value decomposition (SVD) n Independent component analysis (ICA) n Various spectral and/or manifold embedding (active topics)
n Heuristic methods (due to exponential # of choices): q step-wise forward selection q step-wise backward elimination q combining forward selection and backward elimination
n Combinatorial search – exponential computation cost q etc
23
Histograms
n A popular data reduction technique
n Divide data into buckets and store average (sum) for each bucket
0
5
10
15
20
25
30
35
40
10000 30000 50000 70000 90000
24
Clustering
n Partition data set into clusters, and one can store cluster representation only
n Can be very effective if data is clustered but not if data is “smeared”
n There are many choices of clustering definitions and clustering algorithms. We will discuss them later.
25
Sampling
n Choose a representative subset of the data q Simple random sampling may have poor performance in the
presence of skew.
n Develop adaptive sampling methods q Stratified sampling:
n Approximate the percentage of each class (or subpopulation of interest) in the overall database
n Used in conjunction with skewed data
27
Data Preprocessing
n Why preprocess the data? n Data cleaning n Data integration and transformation n Data reduction n Discretization n Summary
28
Discretization
n Three types of attributes: q Nominal — values from an unordered set q Ordinal — values from an ordered set q Continuous — real numbers
n Discretization: q divide the range of a continuous attribute into intervals
because some data mining algorithms only accept categorical attributes.
n Some techniques: q Binning methods – equal-width, equal-frequency q Entropy based q etc
29
Discretization and Concept Hierarchy
n Discretization q reduce the number of values for a given continuous attribute
by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values
n Concept hierarchies q reduce the data by collecting and replacing low level
concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior)
30
Binning
n Attribute values (for one attribute e.g., age): q 0, 4, 12, 16, 16, 18, 24, 26, 28
n Equi-width binning – for bin width of e.g., 10: q Bin 1: 0, 4 [-,10) bin q Bin 2: 12, 16, 16, 18 [10,20) bin q Bin 3: 24, 26, 28 [20,+) bin q – denote negative infinity, + positive infinity
n Equi-frequency binning – for bin density of e.g., 3: q Bin 1: 0, 4, 12 [-, 14) bin q Bin 2: 16, 16, 18 [14, 21) bin q Bin 3: 24, 26, 28 [21,+] bin
31
Entropy-based (1)
n Given attribute-value/class pairs: q (0,P), (4,P), (12,P), (16,N), (16,N), (18,P), (24,N), (26,N), (28,N)
n Entropy-based binning via binarization: q Intuitively, find best split so that the bins are as pure as possible q Formally characterized by maximal information gain.
n Let S denote the above 9 pairs, p=4/9 be fraction of P pairs, and n=5/9 be fraction of N pairs.
n Entropy(S) = - p log p - n log n. q Smaller entropy – set is relatively pure; smallest is 0. q Large entropy – set is mixed. Largest is 1.
32
Entropy-based (2)
n Let v be a possible split. Then S is divided into two sets: q S1: value <= v and S2: value > v
n Information of the split: q I(S1,S2) = (|S1|/|S|) Entropy(S1)+ (|S2|/|S|) Entropy(S2)
n Information gain of the split: q Gain(v,S) = Entropy(S) – I(S1,S2)
n Goal: split with maximal information gain. n Possible splits: mid points b/w any two consecutive values. n For v=14, I(S1,S2) = 0 + 6/9*Entropy(S2) = 6/9 * 0.65 = 0.433
q Gain(14,S) = Entropy(S) - 0.433 q maximum Gain means minimum I.
n The best split is found after examining all possible splits.