Upload
mostafa-heidary
View
215
Download
0
Embed Size (px)
Citation preview
7/30/2019 PreProcessing_DMM
1/35
Machine Learning
1
Jalali.mshdiau.ac.ir
7/30/2019 PreProcessing_DMM
2/35
Machine Learning
22
7/30/2019 PreProcessing_DMM
3/35
Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration and transformation Data reduction
Discretization and concept hierarchy generation
Summary
3
7/30/2019 PreProcessing_DMM
4/35
Why Data Preprocessing?
Data in the real world is dirty
incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
e.g., occupation=
noisy: containing errors or outliers
e.g., Salary=-10
inconsistent: containing discrepancies in codes or names
e.g., Age=42 Birthday=03/07/1997
e.g., Was rating 1,2,3, now rating A, B, C
e.g., discrepancy between duplicate records
4
7/30/2019 PreProcessing_DMM
5/35
Why Is Data Dirty?
Incomplete data may come from Not applicable data value when collected
Different considerations between the time when the data was collected and when it is
analyzed.
Human/hardware/software problems
Noisy data (incorrect values) may come from
Faulty data collection instruments
Human or computer error at data entry
Errors in data transmission
Inconsistent data may come from
Different data sources
Functional dependency violation (e.g., modify some linked data)
Duplicate records also need data cleaning
5
7/30/2019 PreProcessing_DMM
6/35
Multi-Dimensional Measure of Data Quality
A well-accepted multidimensional view:
Accuracy
Completeness
Consistency
Timeliness
Believability
Value added
Interpretability
Accessibility
6
7/30/2019 PreProcessing_DMM
7/35
Major Tasks in Data Preprocessing
Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data transformation Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the same or similar analytical
results
Data discretization
Part of data reduction but with particular importance, especially for numerical data
7
7/30/2019 PreProcessing_DMM
8/35
8
Data Preprocessing
-2,32,100,59,48 -0.02,0.32,1.00,0.59,0.48
Data Cleaning
Data Integration
Data Transformation
Data Reduction
8
7/30/2019 PreProcessing_DMM
9/35
Data Cleaning
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration
9
7/30/2019 PreProcessing_DMM
10/35
Missing Data
Data is not always available
E.g., many tuples have no recorded value for several attributes, such as customer
income in sales data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the time of entry
not register history or changes of the data
Missing data may need to be inferred.
10
7/30/2019 PreProcessing_DMM
11/35
How to Handle Missing Data?
Fill in the missing value manually: tedious + infeasible?
Fill in it automatically with
a global constant : e.g., unknown, a new class?!
the attribute mean
the attribute mean for all samples belonging to the same class: smarter
the most probable value: inference-based such as Bayesian formula or decision tree
11
7/30/2019 PreProcessing_DMM
12/35
Noisy Data
Noise: random error or variance in a measured variable Incorrect attribute values may due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Other data problems which requires data cleaning
duplicate records
incomplete data
inconsistent data
12
7/30/2019 PreProcessing_DMM
13/35
How to Handle Noisy Data?
Regression
smooth by fitting the data into regression functions
Clustering
detect and remove outliers
Binning
first sort data and partition into (equal-frequency) bins then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc
Combined computer and human inspection
detect suspicious values and check by human (e.g., deal with possible outliers)
13
7/30/2019 PreProcessing_DMM
14/35
14
Regression
x
y
y = x + 1
X1
Y1
Y1
Fit data to a function. Linear
regression finds the best line to fit
two variables. Multiple regressioncan handle multiple variables. The
values given by the function are used
instead of the original values
14
7/30/2019 PreProcessing_DMM
15/35
15
Cluster Analysis
Similar values are
organized into groups
(clusters). Values falling
outside of clusters may
be considered outliers
and may be candidatesfor elimination.
15
7/30/2019 PreProcessing_DMM
16/35
Binning
partitioning Divides the range into N intervals, each containing approximately same number of samples
Good data scaling
Managing categorical attributes can be tricky
16
7/30/2019 PreProcessing_DMM
17/35
17
Binning
Partition into equal depth bins
Bin1: 4, 8, 15
Bin2: 21, 21, 24Bin3: 25, 28, 34
means
Bin1: 9, 9, 9Bin2: 22, 22, 22
Bin3: 29, 29, 29
boundaries
Bin1: 4, 4, 15Bin2: 21, 21, 24
Bin3: 25, 25, 34
Binning
Original Data for price (after sorting): 4, 8, 15, 21, 21, 24, 25, 28, 34
Each value in a
bin is replacedby the mean
value of the bin.
Min and Max
values in each bin
are identified
(boundaries). Eachvalue in a bin is
replaced with the
closest boundary
value.
17
7/30/2019 PreProcessing_DMM
18/35
18
Example
ID Outlook Temperature Humidity Windy
1 sunny 85 85 FALSE
2 sunny 80 90 TRUE
3 overcast 83 78 FALSE
4 rain 70 96 FALSE
5 rain 68 80 FALSE6 rain 65 70 TRUE
7 overcast 58 65 TRUE
8 sunny 72 95 FALSE
9 sunny 69 70 FALSE
10 rain 71 80 FALSE
11 sunny 75 70 TRUE
12 overcast 73 90 TRUE
13 overcast 81 75 FALSE14 rain 75 80 TRUE
ID Temperature
7 58
6 65
5 68
9 69
4 7010 71
8 72
12 73
11 75
14 75
2 80
13 81
3 831 85
Bin5
Bin1
Bin2
Bin3
Bin4
18
7/30/2019 PreProcessing_DMM
19/35
19
Example
ID Temperature7 58
6 655 68
9 69
4 7010 71
8 72
12 7311 75
14 752 80
13 81
3 83
1 85Bin5
Bin1
Bin2
Bin3
Bin4
ID Temperature7 64
6 645 64
9 70
4 7010 70
8 73
12 7311 73
14 79
2 7913 79
3 841 84
Bin5
Bin1
Bin2
Bin3
Bin4
19
7/30/2019 PreProcessing_DMM
20/35
20
Smoothing Noisy Data - Example
The final table with the new values for the Temperature attribute.
ID Outlook T emperature Humidity W indy1 sunny 84 85 FALSE
2 sunny 79 90 TRUE
3 overcast 84 78 FALSE
4 rain 70 96 FALSE
5 rain 64 80 FALSE
6 rain 64 70 TRUE
7 overcast 64 65 TRUE
8 sunny 73 95 FALSE
9 sunny 70 70 FALSE
10 rain 70 80 FALSE11 sunny 73 70 TRUE
12 overcast 73 90 TRUE
13 overcast 79 75 FALSE
14 rain 79 80 TRUE
20
7/30/2019 PreProcessing_DMM
21/35
Data Integration
Data integration: Combines data from multiple sources into a coherent store
Schema integration: e.g., A.cust-id B.cust-#
Integrate metadata from different sources
Detecting and resolving data value conflicts
For the same real world entity, attribute values from different sources are
different
Possible reasons: different representations, different scales, e.g., metric vs.
British units
Use Ontology to find same entities in the different Database (Wordnet)
21
7/30/2019 PreProcessing_DMM
22/35
Handling Redundancy in Data Integration
Redundant data occur often when integration of multiple databases
Object identification: The same attribute or object may have different names
in different databases Semantic heterogeneity
Derivable data:One attribute may be a derived attribute in another table,
e.g., annual revenue
Redundant attributes may be able to be detected by correlation analysis
Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve mining speed
and quality
22
7/30/2019 PreProcessing_DMM
23/35
Correlation Analysis (Numerical Data)
Correlation coefficient (also called Pearson product-moment correlationcoefficient - PMCC)
where n is the number of tuples, and are the respective means of A and B, A and B are
the respective standard deviation of A and B, and (AB) is the sum of the AB cross-product.
If rA,B > 0, A and B are positively correlated (As values increase as B
s). The
higher, the stronger correlation.
rA,B = 0: independent; rA,B < 0: negatively correlated
BABA n
BAnAB
n
BBAAr
BA
)1(
)(
)1(
))((,
A B
23
7/30/2019 PreProcessing_DMM
24/35
Data Transformation
Smoothing: remove noise from data
Aggregation: summarization, data cube construction
Generalization: concept hierarchy climbing
Normalization: scaled to fall within a small, specified range
min-max normalization
z-score normalization
normalization by decimal scaling
Attribute/feature construction
New attributes constructed from the given ones
24
7/30/2019 PreProcessing_DMM
25/35
Data Transformation: Normalization
Min-max normalization: to [new_minA, new_max
A]
Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,600 is
mapped to
Z-score normalization (: mean, : standard deviation):
Ex. Let = 54,000, = 16,000. Then
Normalization by decimal scaling
716.00)00.1(000,12000,98
000,12600,73
AAA
AA
A
minnewminnewmaxnewminmax
minvv _)__('
A
Avv
'
j
vv
10' Wherej is the smallest integer such that Max(||) < 1
225.1000,16
000,54600,73
25
7/30/2019 PreProcessing_DMM
26/35
Normalization: Example z-score normalization
Example: normalizing the Humidity attribute:
Humidity8590
78
96
80
70
65
95
70
80
70
90
75
80
Mean = 80.3
Stdev = 9.84
Humidity
0.48
0.99
-0.23
1.60
-0.03
-1.05
-1.55
1.49
-1.05
-0.03
-1.05
0.99
-0.54
-0.03
26
7/30/2019 PreProcessing_DMM
27/35
Normalization: Example II - Min-max
normalization
ID Gender Age Salary
1 F 27 19,000
2 M 51 64,000
3 M 52 100,000
4 F 33 55,000
5 M 45 45,000
ID Gender Age Salary
1 1 0.00 0.00
2 0 0.96 0.56
3 0 1.00 1.00
4 1 0.24 0.44
5 0 0.72 0.32
27
7/30/2019 PreProcessing_DMM
28/35
Data Reduction Strategies
Why data reduction? A database/data warehouse may store terabytes of data
Complex data analysis/mining may take a very long time to run on the complete data
set
Data reduction
Obtain a reduced representation of the data set that is much smaller in volume but yet
produce the same (or almost the same) analytical results
Data reduction strategies
Data cube aggregation:
Dimensionality reduction e.g., remove unimportant attributes
Data Compression
Numerosity reduction e.g., fit data into models, Regression, Clustering Discretization and concept hierarchy generation
28
7/30/2019 PreProcessing_DMM
29/35
Discretization
Three types of attributes:
Nominal values from an unordered set, e.g., color, profession
Ordinal values from an ordered set, e.g., military or academic rank
Continuous real numbers, e.g., integer or real numbers
Discretization:
Divide the range of a continuous attribute into intervals
Some classification algorithms only accept categorical attributes.
Reduce data size by discretization
Prepare for further analysis
29
7/30/2019 PreProcessing_DMM
30/35
Discretization and Concept Hierarchy
Discretization Reduce the number of values for a given continuous attribute by dividing the range of the
attribute into intervals
Interval labels can then be used to replace actual data values
Supervised vs. unsupervised
Split (top-down) vs. merge (bottom-up)
Discretization can be performed recursively on an attribute
Concept hierarchy formation
Recursively reduce the data by collecting and replacing low level concepts (such as numeric
values for age) by higher level concepts (such as young, middle-aged, or senior)
30
7/30/2019 PreProcessing_DMM
31/35
Discretization - Example
Humidity
85
90
78
96
80
70
65
95
70
80
70
90
75
80
Low = 60-69Normal = 70-79
High = 80+
Humidity
High
High
Normal
High
High
Normal
Low
High
NormalHigh
Normal
High
Normal
High
Example: discretizing the Humidity attribute using 3 bins.
31
7/30/2019 PreProcessing_DMM
32/35
Concept Hierarchy Generation for Categorical Data
Specification of a partial/total ordering of attributes explicitly at the schemalevel by users or experts
street < city < state < country
Specification of a hierarchy for a set of values by explicit data grouping
{Urbana, Champaign, Chicago} < Illinois
Specification of only a partial set of attributes
E.g., only street < city, not others
Automatic generation of hierarchies (or attribute levels) by the analysis of
the number of distinct values
E.g., for a set of attributes: {street, city, state, country}
32
7/30/2019 PreProcessing_DMM
33/35
Automatic Concept Hierarchy Generation
Some hierarchies can be automatically generated based on the analysis
of the number of distinct values per attribute in the data set The attribute with the most distinct values is placed at the lowest level of
the hierarchy
Exceptions, e.g., weekday, month, quarter, year
country
province_or_ state
city
street
15 distinct values
365 distinct values
3567 distinct values
674,339 distinct values
33
7/30/2019 PreProcessing_DMM
34/35
2
Spatio-Temporal data mining( )
IRIS WekaPre-
Processing ( .10 Dataset)
34
7/30/2019 PreProcessing_DMM
35/35
?