Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24,...

Data Mining: Concepts and Techniques 1

Data Mining:Concepts and Techniques

— Chapter 2 —

Original Slides: Jiawei Han and Micheline Kamber

Modification: Li Xiong

Chapter 2: Data Preprocessing

Why preprocess the data?

Descriptive data summarization

Data cleaning

Data integration and transformation

Data reduction

Discretization and concept hierarchy generation

Summary

Why Data Preprocessing?

Data in the real world is dirtyincomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data

e.g., occupation=“ ”noisy: containing errors or outliers

e.g., Salary=“-10”inconsistent: containing discrepancies in codes or names

e.g., Age=“42” Birthday=“03/07/1997”e.g., Was rating “1,2,3”, now rating “A, B, C”e.g., discrepancy between duplicate records

January 24, 2008 Data Mining: Concepts and Techniques 4

Why Is Data Dirty?

Incomplete data may come from“Not applicable” data value when collectedDifferent considerations between the time when the data was collected and when it is analyzed.Human/hardware/software problems

Noisy data (incorrect values) may come fromFaulty data collection instrumentsHuman or computer error at data entryErrors in data transmission

Inconsistent data may come fromDifferent data sourcesFunctional dependency violation (e.g., modify some linked data)

Duplicate records also need data cleaning

Multi-Dimensional Measure of Data Quality

A well-accepted multidimensional view:AccuracyCompletenessConsistencyTimelinessBelievabilityValue addedInterpretabilityAccessibility

Broad categories:Intrinsic, contextual, representational, and accessibility

Major Tasks in Data Preprocessing

Data cleaningFill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies

Data integrationIntegration of multiple databases, data cubes, or files

Data transformationNormalization and aggregation

Data reductionObtains reduced representation in volume but produces the same or similar analytical results

Data discretizationPart of data reduction but with particular importance, especially for numerical data

Forms of Data Preprocessing

Data cleaning

Data reduction

Summary

Descriptive Data Summarization

Motivation

To better understand the data

Descriptive statistics: describe basic features of data

Graphical description

Tabular description

Summary statistics

Measuring central tendency – how data seem similar

Measuring statistical variability or dispersion of data – how data differ

Graphic display of descriptive data summarization

Measuring the Central Tendency

Mean (sample vs. population):

Weighted arithmetic mean:

Trimmed mean: chopping extreme values

Median

Middle value if odd number of values, or average of the middle two

values otherwise

Estimated by interpolation (for grouped data):

Value that occurs most frequently in the data

Unimodal, bimodal, trimodal

Empirical formula:

lfnLmedian

median

(1∑−

)(3 medianmeanmodemean −×=−

Nx∑=μ

Symmetric vs. Skewed Data

Median, mean and mode of symmetric, positively and negatively skewed data

MeanMedianMode

Computational Issues

Different types of measuresDistributed measure – can be computed by partitioning the data into smaller subsets. E.g. sum, countAlgebraic measure – can be computed by applying an algebraic function to one or more distributed measures. E.g. ?Holistic measure – must be computed on the entire dataset as a whole. E.g. ?

Selection algorithm: finding kth smallest number in a listE.g. min, max, medianSelection by sorting: O(n* logn)Linear algorithms based on quicksort: O(n)

The Long Tail

Long tail: low-frequency population (e.g. wealth distribution)The Long Tail: the current and future business and economic models

Previous empirical studies: Amazon, NetflixProducts that are in low demand or have low sales volume can collectively make up a market share that rivals or exceeds the relatively few current bestsellers and blockbustersThe primary value of the internet: providing access to products in the long tailBusiness and social implications

mass market retailers: Amazon, Netflix, eBaycontent producers: YouTube

The Long Tail. Chris Anderson, Wired, Oct. 2004The Long Tail: Why the Future of Business is Selling Less of More. Chris Anderson. 2006

Measuring the Dispersion of Data

Dispersion or variance: the degree to which numerical data tend to spread

Range and Quartiles

Range: difference between the largest and smallest values

Percentile: the value of a variable below which a certain percent of data fall

(algebraic or holistic?)

Quartiles: Q1 (25th percentile), Median (50th percentile), Q3 (75th percentile)

Inter-quartile range: IQR = Q3 – Q1

Five number summary: min, Q1, M, Q3, max (Boxplot)

Outlier: usually, a value at least 1.5 x IQR higher/lower than Q3/Q1

Variance and standard deviation (sample: s, population: σ)

Variance: sample vs. population (algebraic or holistic?)

Standard deviation s (or σ) is the square root of variance s2 (or σ2)

∑ ∑∑= ==

−−

=−−

22 ])(1[1

1∑∑==

−=−=n

22 1)(1 μμσ

Graphic Displays of Basic Statistical Descriptions

HistogramBoxplotQuantile plotQuantile-quantile (q-q) plotScatter plotLoess (local regression) curve

2008�1�24�� Data Mining: Concepts and Techniques 16

Histogram Analysis

Graphical display of tabulated frequenciesunivariate graphical method (one attribute)data partitioned into disjoint buckets (typically equal-width)a set of rectangles that reflect the counts or frequencies of values at the bucketBar chart for categorical values

Boxplot Analysis

Visualizes five-number summary:

The ends of the box are first and third quartiles (Q1 and Q3), i.e., the height of the box is IRQ

The median (M) is marked by a line within the box

Whiskers: two lines outside the box extend to Minimum and Maximum

Example Boxplot: Profit Analysis

Quantile Plot

Displays all of the data for the given attributePlots quantile informationEach data point (xi, fi) indicates that approximately 100 fi% of the data are below or equal to the value xi

Quantile-Quantile (Q-Q) Plot

Graphs the quantiles of one univariate distribution against the corresponding quantiles of anotherDiagnosing differences between the probability distribution of two distributions

Scatter plot

Displays values for two numerical attributes (bivariate data) Each pair of values plotted as a point in the planecan suggest various kinds of correlations between variables with a certain confidence level: positive (rising), negative (falling), or null (uncorrelated).

Example Scatter Plot – Correlation between Wine Consumption and Heart Mortality

France

Positively and Negatively Correlated Data

Not Correlated Data

Loess Curve

Locally weighted scatter plot smoothing to provide better perception of the pattern of dependenceFitting simple models to localized subsets of the data

Data cleaning

Data reduction

Summary

Data Cleaning

Importance“Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball“Data cleaning is the number one problem in data warehousing”—DCI survey

Data cleaning tasks

Fill in missing values

Identify outliers and smooth out noisy data

Correct inconsistent data

Resolve redundancy caused by data integration

Missing Data

Data is not always available

E.g., many tuples have no recorded value for several attributes, such as customer income in sales data

Missing data may be due to

equipment malfunction

inconsistent with other recorded data and thus deleted

data not entered due to misunderstanding

certain data may not be considered important at the time of entry

not register history or changes of the data

Missing data may need to be inferred.

How to Handle Missing Values?

Ignore the tuple: usually done when class label is missing (assuming

the tasks in

Fill in the missing value manually

Fill in the missing value automatically

a global constant : e.g., “unknown”, a new class?!

the attribute mean

the attribute mean for all samples belonging to the same class:

smarter

the most probable value: inference-based such as Bayesian

formula or decision tree (Chap 6)

Noisy Data

Noise: random error or variance in a measured variableIncorrect attribute values may due to

faulty data collection instrumentsdata entry problemsdata transmission problemstechnology limitationinconsistency in naming convention

Other data problems which requires data cleaningduplicate recordsincomplete datainconsistent data

How to Handle Noisy Data?

Binning and smoothingsort data and partition into bins (equal-frequency or equal-width)then smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.

Regressionsmooth by fitting the data into a function with regression

Clusteringdetect and remove outliers that fall outside clusters

Combined computer and human inspectiondetect suspicious values and check by human (e.g., deal with possible outliers)

Simple Discretization Methods: Binning

Equal-width (distance) partitioning

Divides the range into N intervals of equal size: uniform grid

if A and B are the lowest and highest values of the attribute, the

width of intervals will be: W = (B –A)/N.

The most straightforward, but outliers may dominate presentation

Skewed data is not handled well

Equal-depth (frequency) partitioning

Divides the range into N intervals, each containing approximately

same number of samples

Good data scaling

Managing categorical attributes can be tricky

Binning Methods for Data Smoothing

Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

* Partition into equal-frequency (equi-depth) bins:- Bin 1: 4, 8, 9, 15- Bin 2: 21, 21, 24, 25- Bin 3: 26, 28, 29, 34

* Smoothing by bin means:- Bin 1: 9, 9, 9, 9- Bin 2: 23, 23, 23, 23- Bin 3: 29, 29, 29, 29

* Smoothing by bin boundaries:- Bin 1: 4, 4, 4, 15- Bin 2: 21, 21, 25, 25- Bin 3: 26, 26, 26, 34

Regression

y = x + 1

Cluster Analysis

Data cleaning

Data integration

Data transformation

Data reduction

Summary

Data Integration

Data integration: combines data from multiple sources into a unified viewArchitectures

Data warehouse (tightly coupled)Federated database systems (loosely coupled)

Database heterogeneitySemantic integration

Data Warehouse Approach

Client Client

Warehouse

Source Source Source

Query & Analysis

Metadata

Advantages and Disadvantages of Data Warehouse

AdvantagesHigh query performanceCan operate when sources unavailableExtra information at warehouse

Modification, summarization (aggregates), historical information

Local processing at sources unaffectedDisadvantages

Data freshnessDifficult to construct when only having access to query interface of local sources

Federated Database Systems

Client Client

Wrapper Wrapper Wrapper

Mediator

Source Source Source

Advantages and Disadvantages of Federated Database Systems

AdvantageNo need to copy and store data at mediatorMore up-to-date dataOnly query interface needed at sources

DisadvantageQuery performanceSource availability

Database Heterogeneity

System Heterogeneity: use of different operating system, hardware platformsSchematic or Structural Heterogeneity: the native model or structure to store data differ in data sources. Syntactic Heterogeneity: differences in representation format of dataSemantic Heterogeneity: differences in interpretation of the 'meaning' of data

Semantic Integration

Problem: reconciling semantic heterogeneityLevels

Schema matching (schema mapping)e.g., A.cust-id ≡ B.cust-#

Data matching (data deduplication, record linkage, entity/object matching)

e.g., Bill Clinton = William ClintonChallenges

Semantics inferred from few information sources (data creators, documentation) -> rely on schema and dataSchema and data unreliable and incompleteGlobal pair-wise matching computationally expensive

In practice, 60-80% of resources spent on reconciling semantic heterogeneity in data sharing project

Schema Matching

TechniquesRule basedLearning based

Type of matches1-1 matches vs. complex matches (e.g. list-price = price *(1+tax_rate))

Information usedSchema information: element names, data types, structures, number of sub-elements, integrity constraintsData information: value distributions, frequency of wordsExternal evidence: past matches, corpora of schemasOntologies. E.g. Gene Ontology

Multi-matcher architecture

Data Matching Or … ?

record linkagedata matchingobject identificationentity resolutionentity disambiguationduplicate detectionrecord matchinginstance identificationdeduplicationreference reconciliationdatabase hardening…

Data Matching

TechniquesRule basedProbabilistic Record Linkage (Fellegi and Sunter, 1969)

Similarity between pairs of attributesCombined scores representing probability of matchingThreshold based decision

Machine learning approachesNew challenges

Complex information spacesMultiple classes

Data cleaning

Data integration

Data transformation

Data reduction

Summary

Data Transformation

Smoothing: remove noise from data (data cleaning)

Aggregation: summarization

E.g. Daily sales -> monthly sales

Discretization and generalization

E.g. age -> youth, middle-aged, senior

(Statistical) Normalization: scaled to fall within a small, specified range

E.g. income vs. age

Attribute construction: construct new attributes from given ones

E.g. birthday -> age

Data Aggregation

Data cubes store multidimensional aggregated information

Multiple levels of aggregation for analysis at multiple granularities

Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24,...

Documents

Cluster Analysis - Emory Universitylxiong/cs570_s11/share/slides/14_clustering.… · Cluster Analysis Overview Partitioning methods: k-means, k-medoids Hierarchical methods: agglomerative,

Data Mining-Graph Mining

Mining News€¦ · Mining News

Web Mining – Data Mining im Internet Mining – Data Mining im Internet Vorlesung SS 2010 ... Web Mining is Data Mining for Data on the World-Wide Web Text Mining: Application of

Web Mining - University of Florida€¢ Web Mining is mining of data related to ... Web mining Web content mining Web structure mining Web usage mining ... • Basic content mining

Data Mining 1 Mining...ถ้าไม่มีโครงสร้างจะเป็น text-mining, web-mining, image-mining

Data Mining: Concepts and Techniques - Emory Universitylxiong/teaching/cs570/share/slides/06.pdf · Supervised vs. Unsupervised Learning ... February 12, 2008 Tan,Steinbach, Kumar

Practice - Emory Universitylxiong/cs377_f11/share/slides/10_SQL.pdf · Practice P5. Find the names of departments with 2 or more male employees P6. Find the names of employees who

CS 377 Database Systems - Emory Universitylxiong/cs377_f11/share/slides/04_relational.pdfRelational Model Concepts Relational Model Constraints Relational Database and operations 2

Web Usage Mining Mining

Data Mining - Emory Universitylxiong/cs570s08/share/slides/… · · 2009-07-22March 26, 2008 Data Mining: Concepts and Techniques 2 Chapter 3: Data Warehousing and OLAP Technology:

Mining Explained Glossary Mining 101 Mining Finance Guide

Web Mining – Data Mining im Internet Mining – Data Mining im Internet Vorlesung SS 2014 ... Web Mining is Data Mining for Data on the World-Wide Web Text Mining: Application of

Mining Products - Telsmith Mining Brochure.pdf · From coal to iron ore, surface mining to underground, ... as well as underground mining ... Astec Mining Group Astec Mining Products

Web Mining & Text Mining

Mining 9 INTERNATIONAL MINING,

CS570 Introduction to Data Mining - Emory Universitylxiong/cs570/share/slides/01_intro.pdfPaper reviews 1 page NOT just a summary of the paper, but your critical opinion of the paper

WEB USAGE MINING Web Usage Mining 1. Contents Web Usage Mining 2 Web Mining Web Mining Taxonomy Web Usage Mining Web analysis tools Pattern

Introduction to Data Mining - Emory Universitylxiong/cs378/share/slides/03_fim.pdf · Introduction to Data Mining ... Slide credits: Jiawei Han and Micheline Kamber George Kollios

CS 377 Database Systems - Emory Universitylxiong/cs377_f11/share/slides/01_intro.pdfCS 377 Database Systems Introduction and Course Information 1 Li Xiong ... Database Management System