Data Preprocessing

http://dblab.chungbuk.ac.kr

Introduction to Data MiningIntroduction to Data Mining

Ch. 2 Data Ch. 2 Data PreprocessingPreprocessing

Heon Gyu LeeHeon Gyu Lee(([email protected]@dblab.chungbuk.ac.kr))

http://dblab.chungbuk.ac.kr/~hgleehttp://dblab.chungbuk.ac.kr/~hgleeDB/Bioinfo., Lab. DB/Bioinfo., Lab. http://dblab.chungbuk.ac.krhttp://dblab.chungbuk.ac.kr

Chungbuk National UniversityChungbuk National University

Why Data Preprocessing?Why Data Preprocessing?

Data in the real world is dirtyData in the real world is dirty– incompleteincomplete: lacking attribute values, lacking : lacking attribute values, lacking

certain attributes of interest, or containing only certain attributes of interest, or containing only aggregate dataaggregate data e.g., occupation=“ ”e.g., occupation=“ ”

– noisynoisy: containing errors or outliers: containing errors or outliers e.g., Salary=“-10”e.g., Salary=“-10”

– inconsistentinconsistent: containing discrepancies in codes : containing discrepancies in codes or namesor names e.g., Age=“42” Birthday=“03/07/1997”e.g., Age=“42” Birthday=“03/07/1997” e.g., Was rating “1,2,3”, now rating “A, B, C”e.g., Was rating “1,2,3”, now rating “A, B, C” e.g., discrepancy between duplicate recordse.g., discrepancy between duplicate records

What is Data?What is Data?

Collection of data objects and Collection of data objects and their attributestheir attributes

An attribute is a property or An attribute is a property or characteristic of an objectcharacteristic of an object– Examples: eye color of a Examples: eye color of a

person, temperature, etc.person, temperature, etc.– Attribute is also known as Attribute is also known as

variable, field, characteristic, variable, field, characteristic, or featureor feature

A collection of attributes A collection of attributes describe an objectdescribe an object– Object is also known as Object is also known as

record, point, case, sample, record, point, case, sample, entity, or instanceentity, or instance

Tid Refund Marital Status

Taxable Income Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

Attributes

Ob

ject

s

Types of Attributes Types of Attributes

There are different types of attributesThere are different types of attributes– NominalNominal

Examples: ID numbers, eye color, zip codesExamples: ID numbers, eye color, zip codes– OrdinalOrdinal

Examples: rankings (e.g., taste of potato chips on a Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, scale from 1-10), grades, height in {tall, medium, short}short}

– IntervalInterval Examples: calendar dates, temperatures in Celsius or Examples: calendar dates, temperatures in Celsius or

– RatioRatio Examples: temperature, length, time, counts Examples: temperature, length, time, counts

Discrete and Continuous Discrete and Continuous Attributes Attributes

Discrete AttributeDiscrete Attribute– Has only a finite or countably infinite set of valuesHas only a finite or countably infinite set of values– Examples: zip codes, counts, or the set of words in a collection of documeExamples: zip codes, counts, or the set of words in a collection of docume

nts nts – Often represented as integer variables. Often represented as integer variables. – Note: binary attributes are a special case of discrete attributes Note: binary attributes are a special case of discrete attributes

Continuous AttributeContinuous Attribute– Has real numbers as attribute valuesHas real numbers as attribute values– Examples: temperature, height, or weight. Examples: temperature, height, or weight. – Practically, real values can only be measured and represented using a finitPractically, real values can only be measured and represented using a finit

e number of digits.e number of digits.– Continuous attributes are typically represented as floating-point variables. Continuous attributes are typically represented as floating-point variables.

Data Quality Data Quality

What kinds of data quality problems?What kinds of data quality problems? How can we detect problems with the data? How can we detect problems with the data? What can we do about these problems? What can we do about these problems?

Examples of data quality problems: Examples of data quality problems: – Noise and outliers Noise and outliers – missing values missing values – duplicate data duplicate data

NoiseNoise

Noise refers to modification of original valuesNoise refers to modification of original values– Examples: distortion of a person’s voice when talking on a Examples: distortion of a person’s voice when talking on a

poor phone and “snow” on television screenpoor phone and “snow” on television screen

Two Sine Waves Two Sine Waves + Noise

OutliersOutliers

Outliers are data objects with Outliers are data objects with characteristics that characteristics that are considerably differentare considerably different than most of the other than most of the other data objects in the data setdata objects in the data set

Missing ValuesMissing Values

Reasons for missing valuesReasons for missing values– Information is not collected Information is not collected

(e.g., people decline to give their age and weight)(e.g., people decline to give their age and weight)– Attributes may not be applicable to all cases Attributes may not be applicable to all cases

(e.g., annual income is not applicable to children)(e.g., annual income is not applicable to children)

Handling missing valuesHandling missing values– Eliminate Data ObjectsEliminate Data Objects– Estimate Missing ValuesEstimate Missing Values– Ignore the Missing Value During AnalysisIgnore the Missing Value During Analysis– Replace with all possible values (weighted by their Replace with all possible values (weighted by their

probabilities)probabilities)

Duplicate DataDuplicate Data

Data set may include data objects that are duplicates, or alData set may include data objects that are duplicates, or almost duplicates of one anothermost duplicates of one another– Major issue when merging data from heterogeous sourcesMajor issue when merging data from heterogeous sources

Examples:Examples:– Same person with multiple email addressesSame person with multiple email addresses

Data cleaningData cleaning– Process of dealing with duplicate data issuesProcess of dealing with duplicate data issues

Major Tasks in Data Major Tasks in Data PreprocessingPreprocessing

Data cleaningData cleaning– Fill in missing values, smooth noisy data, identify or remove outlieFill in missing values, smooth noisy data, identify or remove outlie

rs, and resolve inconsistenciesrs, and resolve inconsistencies

Data integrationData integration– Integration of multiple databases, data cubes, or filesIntegration of multiple databases, data cubes, or files

Data transformationData transformation– Normalization and aggregationNormalization and aggregation

Data reductionData reduction– Obtains reduced representation in volume but produces the same Obtains reduced representation in volume but produces the same

or similar analytical resultsor similar analytical results

Data discretizationData discretization– Part of data reduction but with particular importance, especially foPart of data reduction but with particular importance, especially fo

r numerical datar numerical data

Forms of Data PreprocessingForms of Data Preprocessing

ImportanceImportance– ““Data cleaning is one of the three biggest Data cleaning is one of the three biggest

problems in data warehousing”—Ralph Kimballproblems in data warehousing”—Ralph Kimball

– ““Data cleaning is the number one problem in Data cleaning is the number one problem in data warehousing”—DCI surveydata warehousing”—DCI survey

Data cleaning tasksData cleaning tasks– Fill in Fill in missing missing valuesvalues

– Identify outliers and smooth out Identify outliers and smooth out noisynoisy data data

– Correct inconsistent dataCorrect inconsistent data

– Resolve redundancy caused by data integrationResolve redundancy caused by data integration

Data CleaningData Cleaning

Data Cleaning Data Cleaning : How to Handle Missing Data?: How to Handle Missing Data?

Ignore the tupleIgnore the tuple: usually done when class label is missing : usually done when class label is missing (assuming the tasks in classification—not effective when (assuming the tasks in classification—not effective when the percentage of missing values per attribute varies consthe percentage of missing values per attribute varies considerably.iderably.

Fill in the missing value manuallyFill in the missing value manually Fill in it automatically withFill in it automatically with

– a global constant : e.g., “unknown”, a new class?! a global constant : e.g., “unknown”, a new class?! – the attribute meanthe attribute mean– the attribute mean for all samples belonging to the same class: sthe attribute mean for all samples belonging to the same class: s

martermarter– the most probable value: the most probable value: inference-based such as Bayesian formuinference-based such as Bayesian formu

la or regressionla or regression

Data Cleaning Data Cleaning : How to Handle Noisy Data?: How to Handle Noisy Data?

BinningBinning– first sort data and partition into (equal-frequency) binsfirst sort data and partition into (equal-frequency) bins– then one can then one can smooth by bin means, smooth by bin smooth by bin means, smooth by bin

median, smooth by bin boundariesmedian, smooth by bin boundaries, etc., etc.

RegressionRegression– smooth by fitting the data into regression functionssmooth by fitting the data into regression functions

ClusteringClustering– detect and remove outliersdetect and remove outliers

Combined computer and human inspectionCombined computer and human inspection– detect suspicious values and check by human (e.g., deal detect suspicious values and check by human (e.g., deal

with possible outliers)with possible outliers)

Data Cleaning Data Cleaning : Binning Methods: Binning Methods

Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 3434

* Partition into equal-frequency (equi-depth) bins:* Partition into equal-frequency (equi-depth) bins: - Bin 1: 4, 8, 9, 15- Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25- Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34- Bin 3: 26, 28, 29, 34* Smoothing by bin means:* Smoothing by bin means: - Bin 1: 9, 9, 9, 9- Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23- Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29- Bin 3: 29, 29, 29, 29* Smoothing by bin boundaries:* Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15- Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25- Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34- Bin 3: 26, 26, 26, 34

Data Cleaning : Data Cleaning : RegressionRegression

x

y

y = x + 1

X1

Y1

Y1’

Data Cleaning : Cluster Data Cleaning : Cluster AnalysisAnalysis

Data IntegrationData Integration

Data integration: Data integration: – Combines data from Combines data from multiple sources into a coherent storemultiple sources into a coherent store

Schema integration: e.g., A.cust-id Schema integration: e.g., A.cust-id B.cust-# B.cust-#– Integrate metadata from different sourcesIntegrate metadata from different sources

Entity identification problem:Entity identification problem: – Identify real world entities from multiple data sources, e.g., Bill ClinIdentify real world entities from multiple data sources, e.g., Bill Clin

ton = William Clintonton = William Clinton Detecting and resolving data value conflictsDetecting and resolving data value conflicts

– For the same real world entity, attribute values from different sourcFor the same real world entity, attribute values from different sources are differentes are different

– Possible reasons: different representations, different scalesPossible reasons: different representations, different scales

Data Integration Data Integration : Handling Redundancy in Data : Handling Redundancy in Data

IntegrationIntegration

Redundant data occur often when integration of Redundant data occur often when integration of multiple databasesmultiple databases– Object identificationObject identification: The same attribute or object may : The same attribute or object may

have different names in different databaseshave different names in different databases

– Derivable data:Derivable data: One attribute may be a “derived” One attribute may be a “derived” attribute in another table, e.g., annual revenueattribute in another table, e.g., annual revenue

Redundant attributes may be able to be detected Redundant attributes may be able to be detected by by correlation analysiscorrelation analysis

Careful integration of the data from multiple Careful integration of the data from multiple sources may help reduce/avoid redundancies and sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and inconsistencies and improve mining speed and qualityquality

Data Integration : Data Integration : Correlation Analysis (Numerical Data)Correlation Analysis (Numerical Data)

Correlation coefficient (also called Correlation coefficient (also called Pearson’s product momePearson’s product moment coefficientnt coefficient))

where n is the number of tuples, and are the respective means of A anwhere n is the number of tuples, and are the respective means of A and B, d B, σσA A and and σσB B are the respective standard deviation of A and B, and are the respective standard deviation of A and B, and ΣΣ(AB) i(AB) is the sum of the AB cross-product.s the sum of the AB cross-product.

If rIf rA,BA,B > 0, A and B are positively correlated (A’s values incre > 0, A and B are positively correlated (A’s values increase as B’s). The higher, the stronger correlation.ase as B’s). The higher, the stronger correlation.

rrA,BA,B = 0: independent; r = 0: independent; rA,BA,B < 0: negatively correlated < 0: negatively correlated

BABA n

BAnAB

n

BBAAr BA )1(

)(

)1(

))((,

BA

Data Integration Data Integration : Correlation Analysis (Categorical : Correlation Analysis (Categorical

Data)Data) ΧΧ22 (chi-square) test (chi-square) test

The larger the The larger the ΧΧ22 value, the more likely the variables are rel value, the more likely the variables are relatedated

The cells that contribute the most to the The cells that contribute the most to the ΧΧ22 value are thos value are those whose actual count is very different from the expected coe whose actual count is very different from the expected countunt

Correlation does not imply causalityCorrelation does not imply causality– # of hospitals and # of car-theft in a city are correlated# of hospitals and # of car-theft in a city are correlated– Both are causally linked to the third variable: populationBoth are causally linked to the third variable: population

Expected

ExpectedObserved 22 )(

Chi-Square Calculation: An ExampleChi-Square Calculation: An Example

ΧΧ22 (chi-square) calculation (numbers in parenthesis are expected count (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the data distribution in the two categories)s calculated based on the data distribution in the two categories)

It shows that like_science_fiction and play_chess are correlated in the gIt shows that like_science_fiction and play_chess are correlated in the grouproup

93.507840

)8401000(

360

)360200(

210

)21050(

90

)90250( 22222

Play chessPlay chess Not play chessNot play chess Sum (row)Sum (row)

Like science fictionLike science fiction 250(90)250(90) 200(360)200(360) 450450

Not like science fictionNot like science fiction 50(210)50(210) 1000(840)1000(840) 10501050

Sum(col.)Sum(col.) 300300 12001200 15001500

Data TransformationData Transformation

Smoothing: remove noise from dataSmoothing: remove noise from data

Aggregation: summarization, data cube Aggregation: summarization, data cube constructionconstruction

Generalization: concept hierarchy climbingGeneralization: concept hierarchy climbing

Normalization: scaled to fall within a small, Normalization: scaled to fall within a small, specified rangespecified range– min-max normalizationmin-max normalization

– z-score normalizationz-score normalization

– normalization by decimal scalingnormalization by decimal scaling

Attribute/feature constructionAttribute/feature construction– New attributes constructed from the given onesNew attributes constructed from the given ones

Data TransformationData Transformation: Normalization: Normalization

Min-max normalization: to [new_minMin-max normalization: to [new_minAA, new_max, new_maxAA]]

– Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. TheEx. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to n $73,000 is mapped to

Z-score normalization (Z-score normalization (μμ: mean, : mean, σσ: standard deviation):: standard deviation):

– Ex. Let Ex. Let μμ = 54,000, = 54,000, σσ = 16,000. Then = 16,000. Then

Normalization by decimal scalingNormalization by decimal scaling

716.00)00.1(000,12000,98

000,12600,73

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__('

A

Avv

'

j

vv

10' Where j is the smallest integer such that Max(|ν’|) < 1

225.1000,16

000,54600,73

Data Reduction StrategiesData Reduction Strategies

Why data reduction?Why data reduction?– A database/data warehouse may store terabytes of dataA database/data warehouse may store terabytes of data– Complex data analysis/mining may take a very long time to run on the cComplex data analysis/mining may take a very long time to run on the c

omplete data setomplete data set

Data reduction Data reduction – Obtain a reduced representation of the data set that is much smaller in Obtain a reduced representation of the data set that is much smaller in

volume but yet produce the same (or almost the same) analytical resultsvolume but yet produce the same (or almost the same) analytical results

Data reduction strategiesData reduction strategies– AggregationAggregation– SamplingSampling– Dimensionality ReductionDimensionality Reduction– Feature subset selectionFeature subset selection– Feature creationFeature creation– Discretization and BinarizationDiscretization and Binarization– Attribute TransformationAttribute Transformation

Data Reduction : AggregationData Reduction : Aggregation

Combining two or more attributes (or objects) into Combining two or more attributes (or objects) into a single attribute (or object)a single attribute (or object)

PurposePurpose– Data reductionData reduction

Reduce the number of attributes or objectsReduce the number of attributes or objects– Change of scaleChange of scale

Cities aggregated into regions, states, countries, etcCities aggregated into regions, states, countries, etc– More “stable” dataMore “stable” data

Aggregated data tends to have less variability Aggregated data tends to have less variability

Data Reduction : AggregationData Reduction : Aggregation

Standard Deviation of Average Monthly Precipitation

Standard Deviation of Average Yearly Precipitation

Variation of Precipitation in Australia

Data Reduction : Sampling Data Reduction : Sampling

Sampling is the main technique employed for data selection.Sampling is the main technique employed for data selection.– It is often used for both the preliminary investigation of the data It is often used for both the preliminary investigation of the data

and the final data analysis.and the final data analysis.

Statisticians sample because Statisticians sample because obtainingobtaining the entire set of data the entire set of data of interest is too expensive or time consuming.of interest is too expensive or time consuming.

Sampling is used in data mining because Sampling is used in data mining because processing processing the the

entire set of data of interest is too expensive or time entire set of data of interest is too expensive or time consuming.consuming.

Data Reduction : Types of Data Reduction : Types of SamplingSampling

Simple Random SamplingSimple Random Sampling– There is an equal probability of selecting any particular There is an equal probability of selecting any particular

itemitem

Sampling without replacementSampling without replacement– As each item is selected, it is removed from the As each item is selected, it is removed from the

populationpopulation

Sampling with replacementSampling with replacement– Objects are not removed from the population as they are Objects are not removed from the population as they are

selected for the sample. selected for the sample. In sampling with replacement, the same object can be In sampling with replacement, the same object can be

picked up more than oncepicked up more than once

Data Reduction Data Reduction : Dimensionality Reduction: Dimensionality Reduction

Purpose:Purpose:– Avoid curse of dimensionalityAvoid curse of dimensionality– Reduce amount of time and memory required by data Reduce amount of time and memory required by data

mining algorithmsmining algorithms– Allow data to be more easily visualizedAllow data to be more easily visualized– May help to eliminate irrelevant features or reduce noiseMay help to eliminate irrelevant features or reduce noise

TechniquesTechniques– Principle Component AnalysisPrinciple Component Analysis– Singular Value DecompositionSingular Value Decomposition– Others: supervised and non-linear techniquesOthers: supervised and non-linear techniques

Dimensionality Reduction : PCADimensionality Reduction : PCA

Goal is to find a projection that captures the largest Goal is to find a projection that captures the largest amount of variation in data amount of variation in data

x2

x1

e

Dimensionality Reduction : PCADimensionality Reduction : PCA

Find the eigenvectors of the covariance matrixFind the eigenvectors of the covariance matrix The eigenvectors define the new spaceThe eigenvectors define the new space

x2

x1

e

Data Reduction Data Reduction : Feature Subset Selection: Feature Subset Selection

Another way to reduce dimensionality of dataAnother way to reduce dimensionality of data

Redundant features Redundant features – duplicate much or all of the information contained in one duplicate much or all of the information contained in one

or more other attributesor more other attributes– Example: purchase price of a product and the amount of Example: purchase price of a product and the amount of

sales tax paidsales tax paid

Irrelevant featuresIrrelevant features– contain no information that is useful for the data mining contain no information that is useful for the data mining

task at handtask at hand– Example: students' ID is often irrelevant to the task of Example: students' ID is often irrelevant to the task of

predicting students' GPApredicting students' GPA

Data Reduction Data Reduction : Feature Subset Selection: Feature Subset Selection

Techniques:Techniques:– Brute-force approch:Brute-force approch:

Try all possible feature subsets as input to data mining algorithmTry all possible feature subsets as input to data mining algorithm– Filter approaches:Filter approaches:

Features are selected before data mining algorithm is runFeatures are selected before data mining algorithm is run– Wrapper approaches:Wrapper approaches:

Use the data mining algorithm as a black box to find best subset Use the data mining algorithm as a black box to find best subset of attributesof attributes

Data Reduction Data Reduction : Feature Creation: Feature Creation

Create new attributes that can capture the Create new attributes that can capture the important information in a data set much more important information in a data set much more efficiently than the original attributesefficiently than the original attributes

Three general methodologies:Three general methodologies:– Feature ExtractionFeature Extraction

domain-specificdomain-specific– Mapping Data to New SpaceMapping Data to New Space– Feature ConstructionFeature Construction

combining features combining features

Data Reduction Data Reduction : Mapping Data to a New Space: Mapping Data to a New Space

Two Sine Waves Two Sine Waves + Noise Frequency

Fourier transformFourier transform Wavelet transform Wavelet transform

Data Reduction Data Reduction : Discretization Using Class Labels: Discretization Using Class Labels

Entropy based approachEntropy based approach

3 categories for both x and y 5 categories for both x and y

Data Reduction Data Reduction : Discretization Without Using Class Labels: Discretization Without Using Class Labels

Data Equal interval width

Equal frequency K-means

Data Reduction Data Reduction : Attribute Transformation: Attribute Transformation

A function that maps the entire set of values of a given attA function that maps the entire set of values of a given attribute to a new set of replacement values such that each ribute to a new set of replacement values such that each old value can be identified with one of the new valuesold value can be identified with one of the new values– Simple functions: xSimple functions: xkk, log(x), e, log(x), exx, |x|, |x|– Standardization and Normalization Standardization and Normalization

Technology

Data Preprocessing