PreProcessing_DMM

7/30/2019 PreProcessing_DMM

1/35

Machine Learning

1

[email protected]

Jalali.mshdiau.ac.ir


2/35

Machine Learning

22


3/35

Data Preprocessing

Why preprocess the data?

Data cleaning

Data integration and transformation Data reduction

Discretization and concept hierarchy generation

Summary

3


4/35

Why Data Preprocessing?

Data in the real world is dirty

incomplete: lacking attribute values, lacking certain attributes of

interest, or containing only aggregate data

e.g., occupation=

noisy: containing errors or outliers

e.g., Salary=-10

inconsistent: containing discrepancies in codes or names

e.g., Age=42 Birthday=03/07/1997

e.g., Was rating 1,2,3, now rating A, B, C

e.g., discrepancy between duplicate records

4


5/35

Why Is Data Dirty?

Incomplete data may come from Not applicable data value when collected

Different considerations between the time when the data was collected and when it is

analyzed.

Human/hardware/software problems

Noisy data (incorrect values) may come from

Faulty data collection instruments

Human or computer error at data entry

Errors in data transmission

Inconsistent data may come from

Different data sources

Functional dependency violation (e.g., modify some linked data)

Duplicate records also need data cleaning

5


6/35

Multi-Dimensional Measure of Data Quality

A well-accepted multidimensional view:

Accuracy

Completeness

Consistency

Timeliness

Believability

Value added

Interpretability

Accessibility

6


7/35

Major Tasks in Data Preprocessing

Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers, and resolve

inconsistencies

Data integration

Integration of multiple databases, data cubes, or files

Data transformation Normalization and aggregation

Data reduction

Obtains reduced representation in volume but produces the same or similar analytical

results

Data discretization

Part of data reduction but with particular importance, especially for numerical data

7


8/35

8

Data Preprocessing

-2,32,100,59,48 -0.02,0.32,1.00,0.59,0.48

Data Cleaning

Data Integration

Data Transformation

Data Reduction

8


9/35

Data Cleaning

Data cleaning tasks

Fill in missing values

Identify outliers and smooth out noisy data

Correct inconsistent data

Resolve redundancy caused by data integration

9


10/35

Missing Data

Data is not always available

E.g., many tuples have no recorded value for several attributes, such as customer

income in sales data

Missing data may be due to

equipment malfunction

inconsistent with other recorded data and thus deleted

data not entered due to misunderstanding

certain data may not be considered important at the time of entry

not register history or changes of the data

Missing data may need to be inferred.

10


11/35

How to Handle Missing Data?

Fill in the missing value manually: tedious + infeasible?

Fill in it automatically with

a global constant : e.g., unknown, a new class?!

the attribute mean

the attribute mean for all samples belonging to the same class: smarter

the most probable value: inference-based such as Bayesian formula or decision tree

11


12/35

Noisy Data

Noise: random error or variance in a measured variable Incorrect attribute values may due to

faulty data collection instruments

data entry problems

data transmission problems

technology limitation

inconsistency in naming convention

Other data problems which requires data cleaning

duplicate records

incomplete data

inconsistent data

12


13/35

How to Handle Noisy Data?

Regression

smooth by fitting the data into regression functions

Clustering

detect and remove outliers

Binning

first sort data and partition into (equal-frequency) bins then one can smooth by bin means, smooth by bin median, smooth by bin

boundaries, etc

Combined computer and human inspection

detect suspicious values and check by human (e.g., deal with possible outliers)

13


14/35

14

Regression

x

y

y = x + 1

X1

Y1

Y1

Fit data to a function. Linear

regression finds the best line to fit

two variables. Multiple regressioncan handle multiple variables. The

values given by the function are used

instead of the original values

14


15/35

15

Cluster Analysis

Similar values are

organized into groups

(clusters). Values falling

outside of clusters may

be considered outliers

and may be candidatesfor elimination.

15


16/35

Binning

partitioning Divides the range into N intervals, each containing approximately same number of samples

Good data scaling

Managing categorical attributes can be tricky

16


17/35

17

Binning

Partition into equal depth bins

Bin1: 4, 8, 15

Bin2: 21, 21, 24Bin3: 25, 28, 34

means

Bin1: 9, 9, 9Bin2: 22, 22, 22

Bin3: 29, 29, 29

boundaries

Bin1: 4, 4, 15Bin2: 21, 21, 24

Bin3: 25, 25, 34

Binning

Original Data for price (after sorting): 4, 8, 15, 21, 21, 24, 25, 28, 34

Each value in a

bin is replacedby the mean

value of the bin.

Min and Max

values in each bin

are identified

(boundaries). Eachvalue in a bin is

replaced with the

closest boundary

value.

17


18/35

18

Example

ID Outlook Temperature Humidity Windy

1 sunny 85 85 FALSE

2 sunny 80 90 TRUE

3 overcast 83 78 FALSE

4 rain 70 96 FALSE

5 rain 68 80 FALSE6 rain 65 70 TRUE

7 overcast 58 65 TRUE

8 sunny 72 95 FALSE

9 sunny 69 70 FALSE

10 rain 71 80 FALSE

11 sunny 75 70 TRUE


13 overcast 81 75 FALSE14 rain 75 80 TRUE

ID Temperature

7 58

6 65

5 68

9 69

4 7010 71

8 72

12 73

11 75

14 75

2 80

13 81

3 831 85

Bin5

Bin1

Bin2

Bin3

Bin4

18


19/35

19

Example

ID Temperature7 58

6 655 68

9 69

4 7010 71

8 72

12 7311 75

14 752 80

13 81

3 83

1 85Bin5

Bin1

Bin2

Bin3

Bin4

ID Temperature7 64

6 645 64

9 70

4 7010 70

8 73

12 7311 73

14 79

2 7913 79

3 841 84

Bin5

Bin1

Bin2

Bin3

Bin4

19


20/35

20

Smoothing Noisy Data - Example

The final table with the new values for the Temperature attribute.

ID Outlook T emperature Humidity W indy1 sunny 84 85 FALSE

2 sunny 79 90 TRUE


4 rain 70 96 FALSE

5 rain 64 80 FALSE

6 rain 64 70 TRUE


8 sunny 73 95 FALSE

9 sunny 70 70 FALSE

10 rain 70 80 FALSE11 sunny 73 70 TRUE



14 rain 79 80 TRUE

20


21/35

Data Integration

Data integration: Combines data from multiple sources into a coherent store

Schema integration: e.g., A.cust-id B.cust-#

Integrate metadata from different sources

Detecting and resolving data value conflicts

For the same real world entity, attribute values from different sources are

different

Possible reasons: different representations, different scales, e.g., metric vs.

British units

Use Ontology to find same entities in the different Database (Wordnet)

21


22/35

Handling Redundancy in Data Integration

Redundant data occur often when integration of multiple databases

Object identification: The same attribute or object may have different names

in different databases Semantic heterogeneity

Derivable data:One attribute may be a derived attribute in another table,

e.g., annual revenue

Redundant attributes may be able to be detected by correlation analysis

Careful integration of the data from multiple sources may help

reduce/avoid redundancies and inconsistencies and improve mining speed

and quality

22


23/35

Correlation Analysis (Numerical Data)

Correlation coefficient (also called Pearson product-moment correlationcoefficient - PMCC)

where n is the number of tuples, and are the respective means of A and B, A and B are

the respective standard deviation of A and B, and (AB) is the sum of the AB cross-product.

If rA,B > 0, A and B are positively correlated (As values increase as B

s). The

higher, the stronger correlation.

rA,B = 0: independent; rA,B < 0: negatively correlated

BABA n

BAnAB

n

BBAAr

BA

)1(

)(

)1(

))((,

A B

23


24/35

Data Transformation

Smoothing: remove noise from data

Aggregation: summarization, data cube construction

Generalization: concept hierarchy climbing

Normalization: scaled to fall within a small, specified range

min-max normalization

z-score normalization

normalization by decimal scaling

Attribute/feature construction

New attributes constructed from the given ones

24


25/35

Data Transformation: Normalization

Min-max normalization: to [new_minA, new_max

A]

Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,600 is

mapped to

Z-score normalization (: mean, : standard deviation):

Ex. Let = 54,000, = 16,000. Then

Normalization by decimal scaling

716.00)00.1(000,12000,98

000,12600,73

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__('

A

Avv

'

j

vv

10' Wherej is the smallest integer such that Max(||) < 1

225.1000,16

000,54600,73

25


26/35

Normalization: Example z-score normalization

Example: normalizing the Humidity attribute:

Humidity8590

78

96

80

70

65

95

70

80

70

90

75

80

Mean = 80.3

Stdev = 9.84

Humidity

0.48

0.99

-0.23

1.60

-0.03

-1.05

-1.55

1.49

-1.05

-0.03

-1.05

0.99

-0.54

-0.03

26


27/35

Normalization: Example II - Min-max

normalization

ID Gender Age Salary

1 F 27 19,000

2 M 51 64,000

3 M 52 100,000

4 F 33 55,000

5 M 45 45,000

ID Gender Age Salary

1 1 0.00 0.00

2 0 0.96 0.56

3 0 1.00 1.00

4 1 0.24 0.44

5 0 0.72 0.32

27


28/35

Data Reduction Strategies

Why data reduction? A database/data warehouse may store terabytes of data

Complex data analysis/mining may take a very long time to run on the complete data

set

Data reduction

Obtain a reduced representation of the data set that is much smaller in volume but yet

produce the same (or almost the same) analytical results

Data reduction strategies

Data cube aggregation:

Dimensionality reduction e.g., remove unimportant attributes

Data Compression

Numerosity reduction e.g., fit data into models, Regression, Clustering Discretization and concept hierarchy generation

28


29/35

Discretization

Three types of attributes:

Nominal values from an unordered set, e.g., color, profession

Ordinal values from an ordered set, e.g., military or academic rank

Continuous real numbers, e.g., integer or real numbers

Discretization:

Divide the range of a continuous attribute into intervals

Some classification algorithms only accept categorical attributes.

Reduce data size by discretization

Prepare for further analysis

29


30/35

Discretization and Concept Hierarchy

Discretization Reduce the number of values for a given continuous attribute by dividing the range of the

attribute into intervals

Interval labels can then be used to replace actual data values

Supervised vs. unsupervised

Split (top-down) vs. merge (bottom-up)

Discretization can be performed recursively on an attribute

Concept hierarchy formation

Recursively reduce the data by collecting and replacing low level concepts (such as numeric

values for age) by higher level concepts (such as young, middle-aged, or senior)

30


31/35

Discretization - Example

Humidity

85

90

78

96

80

70

65

95

70

80

70

90

75

80

Low = 60-69Normal = 70-79

High = 80+

Humidity

High

High

Normal

High

High

Normal

Low

High

NormalHigh

Normal

High

Normal

High

Example: discretizing the Humidity attribute using 3 bins.

31


32/35

Concept Hierarchy Generation for Categorical Data

Specification of a partial/total ordering of attributes explicitly at the schemalevel by users or experts

street < city < state < country

Specification of a hierarchy for a set of values by explicit data grouping

{Urbana, Champaign, Chicago} < Illinois

Specification of only a partial set of attributes

E.g., only street < city, not others

Automatic generation of hierarchies (or attribute levels) by the analysis of

the number of distinct values

E.g., for a set of attributes: {street, city, state, country}

32


33/35

Automatic Concept Hierarchy Generation

Some hierarchies can be automatically generated based on the analysis

of the number of distinct values per attribute in the data set The attribute with the most distinct values is placed at the lowest level of

the hierarchy

Exceptions, e.g., weekday, month, quarter, year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674,339 distinct values

33


34/35

2

Spatio-Temporal data mining( )

IRIS WekaPre-

Processing ( .10 Dataset)

34


35/35

?

Documents

PreProcessing_DMM