PreProcessing_DMM

Embed Size (px)

Citation preview

  • 7/30/2019 PreProcessing_DMM

    1/35

    Machine Learning

    1

    [email protected]

    Jalali.mshdiau.ac.ir

  • 7/30/2019 PreProcessing_DMM

    2/35

    Machine Learning

    22

  • 7/30/2019 PreProcessing_DMM

    3/35

    Data Preprocessing

    Why preprocess the data?

    Data cleaning

    Data integration and transformation Data reduction

    Discretization and concept hierarchy generation

    Summary

    3

  • 7/30/2019 PreProcessing_DMM

    4/35

    Why Data Preprocessing?

    Data in the real world is dirty

    incomplete: lacking attribute values, lacking certain attributes of

    interest, or containing only aggregate data

    e.g., occupation=

    noisy: containing errors or outliers

    e.g., Salary=-10

    inconsistent: containing discrepancies in codes or names

    e.g., Age=42 Birthday=03/07/1997

    e.g., Was rating 1,2,3, now rating A, B, C

    e.g., discrepancy between duplicate records

    4

  • 7/30/2019 PreProcessing_DMM

    5/35

    Why Is Data Dirty?

    Incomplete data may come from Not applicable data value when collected

    Different considerations between the time when the data was collected and when it is

    analyzed.

    Human/hardware/software problems

    Noisy data (incorrect values) may come from

    Faulty data collection instruments

    Human or computer error at data entry

    Errors in data transmission

    Inconsistent data may come from

    Different data sources

    Functional dependency violation (e.g., modify some linked data)

    Duplicate records also need data cleaning

    5

  • 7/30/2019 PreProcessing_DMM

    6/35

    Multi-Dimensional Measure of Data Quality

    A well-accepted multidimensional view:

    Accuracy

    Completeness

    Consistency

    Timeliness

    Believability

    Value added

    Interpretability

    Accessibility

    6

  • 7/30/2019 PreProcessing_DMM

    7/35

    Major Tasks in Data Preprocessing

    Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers, and resolve

    inconsistencies

    Data integration

    Integration of multiple databases, data cubes, or files

    Data transformation Normalization and aggregation

    Data reduction

    Obtains reduced representation in volume but produces the same or similar analytical

    results

    Data discretization

    Part of data reduction but with particular importance, especially for numerical data

    7

  • 7/30/2019 PreProcessing_DMM

    8/35

    8

    Data Preprocessing

    -2,32,100,59,48 -0.02,0.32,1.00,0.59,0.48

    Data Cleaning

    Data Integration

    Data Transformation

    Data Reduction

    8

  • 7/30/2019 PreProcessing_DMM

    9/35

    Data Cleaning

    Data cleaning tasks

    Fill in missing values

    Identify outliers and smooth out noisy data

    Correct inconsistent data

    Resolve redundancy caused by data integration

    9

  • 7/30/2019 PreProcessing_DMM

    10/35

    Missing Data

    Data is not always available

    E.g., many tuples have no recorded value for several attributes, such as customer

    income in sales data

    Missing data may be due to

    equipment malfunction

    inconsistent with other recorded data and thus deleted

    data not entered due to misunderstanding

    certain data may not be considered important at the time of entry

    not register history or changes of the data

    Missing data may need to be inferred.

    10

  • 7/30/2019 PreProcessing_DMM

    11/35

    How to Handle Missing Data?

    Fill in the missing value manually: tedious + infeasible?

    Fill in it automatically with

    a global constant : e.g., unknown, a new class?!

    the attribute mean

    the attribute mean for all samples belonging to the same class: smarter

    the most probable value: inference-based such as Bayesian formula or decision tree

    11

  • 7/30/2019 PreProcessing_DMM

    12/35

    Noisy Data

    Noise: random error or variance in a measured variable Incorrect attribute values may due to

    faulty data collection instruments

    data entry problems

    data transmission problems

    technology limitation

    inconsistency in naming convention

    Other data problems which requires data cleaning

    duplicate records

    incomplete data

    inconsistent data

    12

  • 7/30/2019 PreProcessing_DMM

    13/35

    How to Handle Noisy Data?

    Regression

    smooth by fitting the data into regression functions

    Clustering

    detect and remove outliers

    Binning

    first sort data and partition into (equal-frequency) bins then one can smooth by bin means, smooth by bin median, smooth by bin

    boundaries, etc

    Combined computer and human inspection

    detect suspicious values and check by human (e.g., deal with possible outliers)

    13

  • 7/30/2019 PreProcessing_DMM

    14/35

    14

    Regression

    x

    y

    y = x + 1

    X1

    Y1

    Y1

    Fit data to a function. Linear

    regression finds the best line to fit

    two variables. Multiple regressioncan handle multiple variables. The

    values given by the function are used

    instead of the original values

    14

  • 7/30/2019 PreProcessing_DMM

    15/35

    15

    Cluster Analysis

    Similar values are

    organized into groups

    (clusters). Values falling

    outside of clusters may

    be considered outliers

    and may be candidatesfor elimination.

    15

  • 7/30/2019 PreProcessing_DMM

    16/35

    Binning

    partitioning Divides the range into N intervals, each containing approximately same number of samples

    Good data scaling

    Managing categorical attributes can be tricky

    16

  • 7/30/2019 PreProcessing_DMM

    17/35

    17

    Binning

    Partition into equal depth bins

    Bin1: 4, 8, 15

    Bin2: 21, 21, 24Bin3: 25, 28, 34

    means

    Bin1: 9, 9, 9Bin2: 22, 22, 22

    Bin3: 29, 29, 29

    boundaries

    Bin1: 4, 4, 15Bin2: 21, 21, 24

    Bin3: 25, 25, 34

    Binning

    Original Data for price (after sorting): 4, 8, 15, 21, 21, 24, 25, 28, 34

    Each value in a

    bin is replacedby the mean

    value of the bin.

    Min and Max

    values in each bin

    are identified

    (boundaries). Eachvalue in a bin is

    replaced with the

    closest boundary

    value.

    17

  • 7/30/2019 PreProcessing_DMM

    18/35

    18

    Example

    ID Outlook Temperature Humidity Windy

    1 sunny 85 85 FALSE

    2 sunny 80 90 TRUE

    3 overcast 83 78 FALSE

    4 rain 70 96 FALSE

    5 rain 68 80 FALSE6 rain 65 70 TRUE

    7 overcast 58 65 TRUE

    8 sunny 72 95 FALSE

    9 sunny 69 70 FALSE

    10 rain 71 80 FALSE

    11 sunny 75 70 TRUE

    12 overcast 73 90 TRUE

    13 overcast 81 75 FALSE14 rain 75 80 TRUE

    ID Temperature

    7 58

    6 65

    5 68

    9 69

    4 7010 71

    8 72

    12 73

    11 75

    14 75

    2 80

    13 81

    3 831 85

    Bin5

    Bin1

    Bin2

    Bin3

    Bin4

    18

  • 7/30/2019 PreProcessing_DMM

    19/35

    19

    Example

    ID Temperature7 58

    6 655 68

    9 69

    4 7010 71

    8 72

    12 7311 75

    14 752 80

    13 81

    3 83

    1 85Bin5

    Bin1

    Bin2

    Bin3

    Bin4

    ID Temperature7 64

    6 645 64

    9 70

    4 7010 70

    8 73

    12 7311 73

    14 79

    2 7913 79

    3 841 84

    Bin5

    Bin1

    Bin2

    Bin3

    Bin4

    19

  • 7/30/2019 PreProcessing_DMM

    20/35

    20

    Smoothing Noisy Data - Example

    The final table with the new values for the Temperature attribute.

    ID Outlook T emperature Humidity W indy1 sunny 84 85 FALSE

    2 sunny 79 90 TRUE

    3 overcast 84 78 FALSE

    4 rain 70 96 FALSE

    5 rain 64 80 FALSE

    6 rain 64 70 TRUE

    7 overcast 64 65 TRUE

    8 sunny 73 95 FALSE

    9 sunny 70 70 FALSE

    10 rain 70 80 FALSE11 sunny 73 70 TRUE

    12 overcast 73 90 TRUE

    13 overcast 79 75 FALSE

    14 rain 79 80 TRUE

    20

  • 7/30/2019 PreProcessing_DMM

    21/35

    Data Integration

    Data integration: Combines data from multiple sources into a coherent store

    Schema integration: e.g., A.cust-id B.cust-#

    Integrate metadata from different sources

    Detecting and resolving data value conflicts

    For the same real world entity, attribute values from different sources are

    different

    Possible reasons: different representations, different scales, e.g., metric vs.

    British units

    Use Ontology to find same entities in the different Database (Wordnet)

    21

  • 7/30/2019 PreProcessing_DMM

    22/35

    Handling Redundancy in Data Integration

    Redundant data occur often when integration of multiple databases

    Object identification: The same attribute or object may have different names

    in different databases Semantic heterogeneity

    Derivable data:One attribute may be a derived attribute in another table,

    e.g., annual revenue

    Redundant attributes may be able to be detected by correlation analysis

    Careful integration of the data from multiple sources may help

    reduce/avoid redundancies and inconsistencies and improve mining speed

    and quality

    22

  • 7/30/2019 PreProcessing_DMM

    23/35

    Correlation Analysis (Numerical Data)

    Correlation coefficient (also called Pearson product-moment correlationcoefficient - PMCC)

    where n is the number of tuples, and are the respective means of A and B, A and B are

    the respective standard deviation of A and B, and (AB) is the sum of the AB cross-product.

    If rA,B > 0, A and B are positively correlated (As values increase as B

    s). The

    higher, the stronger correlation.

    rA,B = 0: independent; rA,B < 0: negatively correlated

    BABA n

    BAnAB

    n

    BBAAr

    BA

    )1(

    )(

    )1(

    ))((,

    A B

    23

  • 7/30/2019 PreProcessing_DMM

    24/35

    Data Transformation

    Smoothing: remove noise from data

    Aggregation: summarization, data cube construction

    Generalization: concept hierarchy climbing

    Normalization: scaled to fall within a small, specified range

    min-max normalization

    z-score normalization

    normalization by decimal scaling

    Attribute/feature construction

    New attributes constructed from the given ones

    24

  • 7/30/2019 PreProcessing_DMM

    25/35

    Data Transformation: Normalization

    Min-max normalization: to [new_minA, new_max

    A]

    Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,600 is

    mapped to

    Z-score normalization (: mean, : standard deviation):

    Ex. Let = 54,000, = 16,000. Then

    Normalization by decimal scaling

    716.00)00.1(000,12000,98

    000,12600,73

    AAA

    AA

    A

    minnewminnewmaxnewminmax

    minvv _)__('

    A

    Avv

    '

    j

    vv

    10' Wherej is the smallest integer such that Max(||) < 1

    225.1000,16

    000,54600,73

    25

  • 7/30/2019 PreProcessing_DMM

    26/35

    Normalization: Example z-score normalization

    Example: normalizing the Humidity attribute:

    Humidity8590

    78

    96

    80

    70

    65

    95

    70

    80

    70

    90

    75

    80

    Mean = 80.3

    Stdev = 9.84

    Humidity

    0.48

    0.99

    -0.23

    1.60

    -0.03

    -1.05

    -1.55

    1.49

    -1.05

    -0.03

    -1.05

    0.99

    -0.54

    -0.03

    26

  • 7/30/2019 PreProcessing_DMM

    27/35

    Normalization: Example II - Min-max

    normalization

    ID Gender Age Salary

    1 F 27 19,000

    2 M 51 64,000

    3 M 52 100,000

    4 F 33 55,000

    5 M 45 45,000

    ID Gender Age Salary

    1 1 0.00 0.00

    2 0 0.96 0.56

    3 0 1.00 1.00

    4 1 0.24 0.44

    5 0 0.72 0.32

    27

  • 7/30/2019 PreProcessing_DMM

    28/35

    Data Reduction Strategies

    Why data reduction? A database/data warehouse may store terabytes of data

    Complex data analysis/mining may take a very long time to run on the complete data

    set

    Data reduction

    Obtain a reduced representation of the data set that is much smaller in volume but yet

    produce the same (or almost the same) analytical results

    Data reduction strategies

    Data cube aggregation:

    Dimensionality reduction e.g., remove unimportant attributes

    Data Compression

    Numerosity reduction e.g., fit data into models, Regression, Clustering Discretization and concept hierarchy generation

    28

  • 7/30/2019 PreProcessing_DMM

    29/35

    Discretization

    Three types of attributes:

    Nominal values from an unordered set, e.g., color, profession

    Ordinal values from an ordered set, e.g., military or academic rank

    Continuous real numbers, e.g., integer or real numbers

    Discretization:

    Divide the range of a continuous attribute into intervals

    Some classification algorithms only accept categorical attributes.

    Reduce data size by discretization

    Prepare for further analysis

    29

  • 7/30/2019 PreProcessing_DMM

    30/35

    Discretization and Concept Hierarchy

    Discretization Reduce the number of values for a given continuous attribute by dividing the range of the

    attribute into intervals

    Interval labels can then be used to replace actual data values

    Supervised vs. unsupervised

    Split (top-down) vs. merge (bottom-up)

    Discretization can be performed recursively on an attribute

    Concept hierarchy formation

    Recursively reduce the data by collecting and replacing low level concepts (such as numeric

    values for age) by higher level concepts (such as young, middle-aged, or senior)

    30

  • 7/30/2019 PreProcessing_DMM

    31/35

    Discretization - Example

    Humidity

    85

    90

    78

    96

    80

    70

    65

    95

    70

    80

    70

    90

    75

    80

    Low = 60-69Normal = 70-79

    High = 80+

    Humidity

    High

    High

    Normal

    High

    High

    Normal

    Low

    High

    NormalHigh

    Normal

    High

    Normal

    High

    Example: discretizing the Humidity attribute using 3 bins.

    31

  • 7/30/2019 PreProcessing_DMM

    32/35

    Concept Hierarchy Generation for Categorical Data

    Specification of a partial/total ordering of attributes explicitly at the schemalevel by users or experts

    street < city < state < country

    Specification of a hierarchy for a set of values by explicit data grouping

    {Urbana, Champaign, Chicago} < Illinois

    Specification of only a partial set of attributes

    E.g., only street < city, not others

    Automatic generation of hierarchies (or attribute levels) by the analysis of

    the number of distinct values

    E.g., for a set of attributes: {street, city, state, country}

    32

  • 7/30/2019 PreProcessing_DMM

    33/35

    Automatic Concept Hierarchy Generation

    Some hierarchies can be automatically generated based on the analysis

    of the number of distinct values per attribute in the data set The attribute with the most distinct values is placed at the lowest level of

    the hierarchy

    Exceptions, e.g., weekday, month, quarter, year

    country

    province_or_ state

    city

    street

    15 distinct values

    365 distinct values

    3567 distinct values

    674,339 distinct values

    33

  • 7/30/2019 PreProcessing_DMM

    34/35

    2

    Spatio-Temporal data mining( )

    IRIS WekaPre-

    Processing ( .10 Dataset)

    34

  • 7/30/2019 PreProcessing_DMM

    35/35

    ?