120
DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Embed Size (px)

Citation preview

Page 1: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

DATA MINING & DATA

WAREHOUSING

UNIT-I

Data Mining: Concepts and Techniques

Page 2: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

What Is Data Mining?

Data mining is the principle of sorting through large amounts of data and picking out relevant information.

In other words…

Data mining (knowledge discovery from data) Extraction of interesting (non-trivial, implicit, previously unknown and

potentially useful) patterns or knowledge from huge amount of data

Other names Knowledge discovery (mining) in databases (KDD), knowledge

extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

Data Mining: Concepts and Techniques

Page 3: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Some Definitions

Data : Data are any facts, numbers, or text that can be processed by a computer. operational or transactional data such as, sales, cost, inventory,

payroll, and accounting nonoperational data, such as industry sales, forecast data, and

macro economic data meta data - data about the data itself, such as logical database

design or data dictionary definitions

Information: The patterns, associations, or relationships among all this data can provide information.

Data Mining: Concepts and Techniques

Page 4: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Definitions Continued..

Knowledge: Information can be converted into knowledge about historical patterns and future trends. For example, summary information on retail supermarket sales can be analyzed in terms of promotional efforts to provide knowledge of consumer buying behavior. Thus, a manufacturer or retailer could determine which items are most susceptible to promotional efforts.

Data Warehouses: Data warehousing is defined as a process of centralized data management and retrieval.

Data Mining: Concepts and Techniques

Page 5: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Data Warehouse example

Data Mining: Concepts and Techniques

Page 6: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Data Rich, Information Poor

Data Mining: Concepts and Techniques

Page 7: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Data Mining process

Data Mining: Concepts and Techniques

Page 8: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Knowledge discovery from data

KDD process includes

data cleaning (to remove noise and inconsistent data)

data integration (where multiple data sources may be combined)

data selection (where data relevant to the analysis task are retrieved from the database)

data transformation (where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations)

Data Mining: Concepts and Techniques

Page 9: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

KDD continued….

data mining (an essential process where intelligent methods are applied in order to extract data patterns.

pattern evaluation (to identify the truly interesting patterns representing knowledge based on some interestingness measures)

knowledge presentation (where visualization and knowledge representation techniques are used to present the mined knowledge to the user)

Data mining is a core of knowledge discovery process

Data Mining: Concepts and Techniques

Page 10: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Knowledge Discovery (KDD) Process

Data mining—core of knowledge discovery process

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

Data Mining: Concepts and Techniques

Page 11: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Data Mining: Confluence of Multiple Disciplines

Data Mining

Database Technology Statistics

MachineLearning

PatternRecognition

AlgorithmOther

Disciplines

Visualization

Data Mining: Concepts and Techniques

Page 12: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Data Mining Functions

All Data Mining functions can be thought of as attempting to find a model to fit the data.

Each function needs Criteria to create one model over another. Each function needs a technique to Compare the data. Two types of model:

Predictive models- predict unknown values based on known data Descriptive models- identify patterns in data

Data Mining: Concepts and Techniques

Page 13: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques

Page 14: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Functionalities/Techniques:

Concept/Class Description: Characterization and Discrimination

Mining Frequent Patterns, Associations and correlations

Classification and Prediction Cluster Analysis Outlier Analysis Evolution Analysis

Data Mining: Concepts and Techniques

Page 15: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Concept/Class Description: Characterization and Discrimination

Data Characterization: A data mining system should be able to produce a description summarizing the characteristics of customers.

Example: The characteristics of customers who spend more than $1000 a year at (some store called ) AllElectronics. The result can be a general profile such as age, employment status or credit ratings.

Data Mining: Concepts and Techniques

Page 16: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Characterization and Discriminationcontinued…

Data Discrimination: It is a comparison of the general features of targeting class data objects with the general features of objects from one or a set of contrasting classes. User can specify target and contrasting classes.

Example: The user may like to compare the general features of software products whose sales increased by 10% in the last year with those whose sales decreased by about 30% in the same duration.

Data Mining: Concepts and Techniques

Page 17: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Mining Frequent Patterns, Associations and correlations

Frequent Patterns : as the name suggests patterns that occur frequently in data.

Association Analysis: from marketing perspective, determining which items are frequently purchased together within the same transaction.

Example: An example is mined from the (some store) AllElectronic transactional database.

buys (X, “Computers”) buys (X, “software”) [Support = 1%, confidence = 50% ]

X represents customer confidence = 50% , if a customer buys a computer there is a 50% chance

that he/she will buy software as well. Support = 1%, means that 1% of all the transactions under analysis

showed that computer and software were purchased together.

Data Mining: Concepts and Techniques

Page 18: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Mining Frequent Patterns, Associations and correlations

Another example: Age (X, 20…29) ^ income (X, 20K-29K) buys(X,

“CD Player”) [Support = 2%, confidence = 60% ] Customers between 20 to 29 years of age with an

income $20000-$29000. There is 60% chance they will purchase CD Player and 2% of all the transactions under analysis showed that this age group customers with that range of income bought CD Player.

Data Mining: Concepts and Techniques

Page 19: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Classification and Prediction

Classification is the process of finding a model that describes and distinguishes data classes or concepts for the purpose of being able to use the model to predict the class of objects whose class label is unknown.

Classification model can be represented in various forms such as

IF-THEN Rules A decision tree Neural network

Data Mining: Concepts and Techniques

Page 20: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Classification Model

Data Mining: Concepts and Techniques

Page 21: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Cluster Analysis

Clustering analyses data objects without consulting a known class label.

Example: Cluster analysis can be performed on AllElectronics customer data in order to identify homogeneous subpopulations of customers. These clusters may represent individual target groups for marketing. The figure on next slide shows a 2-D plot of customers with respect to customer locations in a city.

Data Mining: Concepts and Techniques

Page 22: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Cluster Analysis

Data Mining: Concepts and Techniques

Page 23: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Outlier Analysis

Outlier Analysis : A database may contain data objects that do not comply with the general behavior or model of the data. These data objects are outliers.

Example: Use in finding Fraudulent usage of credit cards. Outlier Analysis may uncover Fraudulent usage of credit cards by detecting purchases of extremely large amounts for a given account number in comparison to regular charges incurred by the same account. Outlier values may also be detected with respect to the location and type of purchase or the purchase frequency.

Data Mining: Concepts and Techniques

Page 24: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Evolution Analysis

Evolution Analysis: Data evolution analysis describes and models regularities or trends for objects whose behavior changes over time. Example: Time-series data. If the stock market data (time-series) of

the last several years available from the New York Stock exchange and one would like to invest in shares of high tech industrial companies.

A data mining study of stock exchange data may identify stock evolution regularities for overall stocks and for the stocks of particular companies. Such regularities may help predict future trends in stock market prices, contributing to one’s decision making regarding stock investments.

Data Mining: Concepts and Techniques

Page 25: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

DATA MINING CONCEPTS AND TECHNIQUES

UNIT-I

Data Mining: Concepts and Techniques

Page 26: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Chapter 2: Data Preprocessing

Data Mining: Concepts and Techniques

Why preprocess the data?

Descriptive data summarization

Data cleaning

Data integration and transformation

Data reduction

Discretization and concept hierarchy generation

Summary

Page 27: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Why Data Preprocessing?

Data Mining: Concepts and Techniques

Data in the real world is dirty- incomplete: lacking attribute values, lacking certain

attributes of interest, or containing only aggregate data e.g., occupation=“ ”

noisy: containing errors or outliers e.g., Salary=“-10”

inconsistent: containing discrepancies in codes or names e.g., Age=“42” Birthday=“03/07/1997” e.g., Was rating “1,2,3”, now rating “A, B, C” e.g., discrepancy between duplicate records

Page 28: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Why Is Data Dirty?

Data Mining: Concepts and Techniques

Incomplete data may come from “Not applicable” data value when collected Different considerations between the time when the data was collected and

when it is analyzed. Human/hardware/software problems

Noisy data (incorrect values) may come from Faulty data collection instruments Human or computer error at data entry Errors in data transmission

Inconsistent data may come from Different data sources Functional dependency violation (e.g., modify some linked data)

Duplicate records also need data cleaning.

Page 29: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Why Is Data Preprocessing Important?

Data Mining: Concepts and Techniques

No quality data, no quality mining results! Quality decisions must be based on quality data

e.g., duplicate or missing data may cause incorrect or even misleading statistics.

Data warehouse needs consistent integration of quality data Data extraction, cleaning, and transformation comprises the

majority of the work of building a data warehouse.

Page 30: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Multi-Dimensional Measure of Data Quality

Data Mining: Concepts and Techniques

A well-accepted multidimensional view: Accuracy Completeness Consistency Timeliness Believability Value added Interpretability Accessibility

Broad categories: Intrinsic, contextual, representational, and accessibility

Page 31: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Major Tasks in Data Preprocessing

Data Mining: Concepts and Techniques

Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers,

and resolve inconsistencies Data integration

Integration of multiple databases, data cubes, or files Data transformation

Normalization and aggregation Data reduction

Obtains reduced representation in volume but produces the same or similar analytical results

Data discretization Part of data reduction but with particular importance, especially for

numerical data

Page 32: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Forms of Data Preprocessing

Data Mining: Concepts and Techniques

Page 33: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Chapter 2: Data Preprocessing

Data Mining: Concepts and Techniques

Why preprocess the data?

Descriptive data summarization

Data cleaning

Data integration and transformation

Data reduction

Discretization and concept hierarchy generation

Summary

Page 34: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Mining Data Descriptive Characteristics

Data Mining: Concepts and Techniques

Motivation

To better understand the data: central tendency, variation and spread

Data dispersion characteristics

median, max, min, quantiles, outliers, variance, etc.

Numerical dimensions correspond to sorted intervals

Data dispersion: analyzed with multiple granularities of precision

Boxplot or quantile analysis on sorted intervals

Dispersion analysis on computed measures

Folding measures into numerical dimensions

Boxplot or quantile analysis on the transformed cube

Page 35: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Measuring the Central Tendency

Mean (algebraic measure) (sample vs. population):

Weighted arithmetic mean:

Trimmed mean: chopping extreme values

Median: A holistic measure

Middle value if odd number of values, or average of the middle two values

otherwise

We can approximate the median of the entire data set (e.g., the median salary)

by interpolation using the formula

Mode

Value that occurs most frequently in the data

Unimodal, bimodal, trimodal

Empirical formula:

N

x

Data Mining: Concepts and Techniques

n

iix

nx

1

1

n

ii

n

iii

w

xwx

1

1

)(3 medianmeanmodemean

Page 36: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Holistic Measure:- is a measure that must be computed on the entire data set as a whole-

Data Mining: Concepts and Techniques

Page 37: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Symmetric vs. Skewed Data

Median, mean and mode of symmetric, positively and negatively skewed data.

The mean, median, and mode are all at the same center value.

Positively skewed, where the mode occurs at a value that is smaller than the median.

Negatively skewed, where the mode occurs at a value greater than the median.

Data Mining: Concepts and Techniques

Page 38: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Measuring the Dispersion of Data

Data Mining: Concepts and Techniques

Quartiles, outliers and boxplots

Quartiles: Q1 (25th percentile), Q3 (75th percentile)

Inter-quartile range: IQR = Q3 – Q1

Five number summary: min, Q1, M, Q3, max

Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot outlier

individually

Outlier: usually, a value higher/lower than 1.5 x IQR

Variance and standard deviation (sample: s, population: σ)

Variance: (algebraic, scalable computation)

Standard deviation- s (or σ) is the square root of variance s2 (or σ2)

Page 39: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Properties of Normal Distribution Curve

The normal (distribution) curve From μ–σ to μ+σ: contains about 68% of the measurements

(μ: mean, σ: standard deviation) From μ–2σ to μ+2σ: contains about 95% of it From μ–3σ to μ+3σ: contains about 99.7% of it

Data Mining: Concepts and Techniques

Page 40: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Boxplot Analysis

Five-number summary of a distribution:

[Minimum, Q1, M, Q3, Maximum]

Boxplot Data is represented with a box

The ends of the box are at the first and third quartiles,

i.e., the height of the box is IRQ

The median is marked by a line within the box

Whiskers: two lines outside the box extend to

Minimum and Maximum

Data Mining: Concepts and Techniques

Page 41: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Question-1

Data Mining: Concepts and Techniques

Page 42: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Answer-

Data Mining: Concepts and Techniques

Page 43: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Question-2

Data Mining: Concepts and Techniques

Page 44: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Histogram Analysis

Data Mining: Concepts and Techniques

Graph displays of basic statistical class descriptions Frequency histograms

A univariate graphical method. Consists of a set of rectangles that reflect the counts or frequencies

of the classes present in the given data.

Page 45: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Quantile Plot

Data Mining: Concepts and Techniques

Displays all of the data (allowing the user to assess both the overall behavior and unusual occurrences)

Plots quantile information For a data xi data sorted in increasing order, fi indicates that

approximately 100 fi% of the data are below or equal to the value xi

Page 46: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Quantile-Quantile (Q-Q) Plot

Data Mining: Concepts and Techniques

Graphs the quantiles of one univariate distribution against the corresponding quantiles of another.

Allows the user to view whether there is a shift in going from one distribution to another.

Page 47: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Scatter plot

Data Mining: Concepts and Techniques

Provides a first look at bivariate ((mathematics) having two variables) data to see clusters of points, outliers, etc

Each pair of values is treated as a pair of coordinates and plotted as points in the plane.

Page 48: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Loess Curve (local regression)

Data Mining: Concepts and Techniques

Adds a smooth curve to a scatter plot in order to provide better perception of the pattern of dependence.

Page 49: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques

Positively and Negatively Correlated Data

Page 50: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Not Correlated Data

Data Mining: Concepts and Techniques

Page 51: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Question-3

Data Mining: Concepts and Techniques

Page 52: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Answer-

Data Mining: Concepts and Techniques

Page 53: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Graphic Displays of Basic Statistical Descriptions

Data Mining: Concepts and Techniques

Histogram: (shown before) Boxplot: (covered before) Quantile plot: each value xi is paired with fi indicating that

approximately 100 fi % of data are xi

Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution against the corresponding quantiles of another

Scatter plot: Each pair of values is a pair of coordinates and plotted as points in the plane

Loess (local regression) curve: add a smooth curve to a scatter plot to provide better perception of the pattern of dependence

Page 54: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Chapter 2: Data Preprocessing

Data Mining: Concepts and Techniques

Why preprocess the data?

Descriptive data summarization

Data cleaning

Data integration and transformation

Data reduction

Discretization and concept hierarchy generation

Summary

Page 55: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques

Page 56: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Data Cleaning

Data Mining: Concepts and Techniques

Importance- “Data cleaning is one of the three biggest problems in data

warehousing”—Ralph Kimball “Data cleaning is the number one problem in data warehousing”—

DCI survey

Data cleaning tasks-

Fill in missing values.

Identify outliers and smooth out noisy data .

Correct inconsistent data.

Resolve redundancy caused by data integration.

Page 57: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Missing Data

Data Mining: Concepts and Techniques

Data is not always available- E.g., many tuples have no recorded value for several attributes, such

as customer income in sales data.

Missing data may be due to- equipment malfunction.

inconsistent with other recorded data and thus deleted.

data not entered due to misunderstanding.

certain data may not be considered important at the time of entry.

not register history or changes of the data.

Missing data may need to be inferred.

Page 58: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

How to Handle Missing Data?

Data Mining: Concepts and Techniques

Ignore the tuple: usually done when class label is missing (assuming the

tasks in classification—not effective when the percentage of missing values

per attribute varies considerably.

Fill in the missing value manually: (Boring + infeasible)?

Fill in it automatically with-

A global constant : e.g., “unknown”, a new class?!

The attribute mean

The attribute mean for all samples belonging to the same class: smarter

The most probable value: inference-based such as Bayesian formula or

decision tree

Page 59: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Noisy Data

Data Mining: Concepts and Techniques

Noise: random error or variance in a measured variable. Incorrect attribute values may due to-

faulty data collection instruments. data entry problems. data transmission problems. technology limitation. inconsistency in naming convention.

Other data problems which requires data cleaning duplicate records. incomplete data. inconsistent data.

Page 60: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

How to Handle Noisy Data?

Data Mining: Concepts and Techniques

Binning- First sort data and partition into (equal-frequency) bins. Then one can smooth by bin means, smooth by bin median,

smooth by bin boundaries, etc. Regression-

Smooth by fitting the data into regression functions Clustering-

Detect and remove outliers Combined computer and human inspection-

Detect suspicious values and check by human (e.g., deal with possible outliers)

Page 61: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Simple Discretization Methods: Binning

Data Mining: Concepts and Techniques

Equal-width (distance) partitioning

Divides the range into N intervals of equal size: uniform grid

if A and B are the lowest and highest values of the attribute, the width of

intervals will be: W = (B –A)/N.

The most straightforward, but outliers may dominate presentation.

Skewed data is not handled well.

Equal-depth (frequency) partitioning-

Divides the range into N intervals, each containing approximately same

number of samples.

Good data scaling.

Managing categorical attributes can be tricky.

Page 62: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques

Binning Methods for Data Smoothing

Page 63: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Question-4

Data Mining: Concepts and Techniques

Page 64: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Answer-

Data Mining: Concepts and Techniques

Page 65: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Question-

Data Mining: Concepts and Techniques

Page 66: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Answer-

Data Mining: Concepts and Techniques

Page 67: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Regression

Data can be smoothed by fitting the data to a function, such as with regression.

Linear regression- involves finding the “best” line to fit two attributes (or variables), so that one attribute can be used to predict the other.

Multiple linear regression- is an extension of linear regression, where more than two attributes are involved and the data are fit to a multidimensional surface.

Data Mining: Concepts and Techniques

Page 68: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Regression

Data Mining: Concepts and Techniques

x

y

y = x + 1

X1

Y1

Y1’

Page 69: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Cluster Analysis

Data Mining: Concepts and Techniques

Outliers may be detected by clustering, where similar values are organized into groups, or “clusters.

Page 70: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Data Cleaning as a Process

Data Mining: Concepts and Techniques

Data discrepancy detection Use metadata (e.g., domain, range, dependency, distribution). Check field overloading. Check uniqueness rule, consecutive rule and null rule. Use commercial tools-

Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to detect errors and make corrections

Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g., correlation and clustering to find outliers)

Data migration and integration- Data migration tools: allow transformations to be specified ETL (Extraction/Transformation/Loading) tools: allow users to specify

transformations through a graphical user interface

Page 71: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Chapter 2: Data Preprocessing

Data Mining: Concepts and Techniques

Why preprocess the data?

Data cleaning

Data integration and transformation

Data reduction

Discretization and concept hierarchy generation

Summary

Page 72: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Data Integration

Data Mining: Concepts and Techniques

Data integration: Combines data from multiple sources into a coherent store

Schema integration: e.g., A.cust-id B.cust-# Integrate metadata from different sources

Entity identification problem: Identify real world entities from multiple data sources,

e.g., Bill Clinton = William Clinton Detecting and resolving data value conflicts

For the same real world entity, attribute values from different sources are different

Possible reasons: different representations, different scales, e.g., metric vs. British units

Page 73: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Handling Redundancy in Data Integration

Data Mining: Concepts and Techniques

Redundant data occur often when integration of multiple databases Object identification: The same attribute or object may have

different names in different databases Derivable data: One attribute may be a “derived” attribute in

another table, e.g., annual revenue Redundant attributes may be able to be detected by correlation

analysis Careful integration of the data from multiple sources may help

reduce/avoid redundancies and inconsistencies and improve mining speed and quality

Page 74: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Correlation Analysis (Numerical Data)

Correlation coefficient (also called Pearson’s product moment coefficient)

where n is the number of tuples, and are the respective means of A and B, σA and σB are the respective standard deviation of A and B, and Σ(AB) is

the sum of the AB cross-product.

If rA,B > 0, A and B are positively correlated (A’s values increase

as B’s). The higher, the stronger correlation.

rA,B = 0: independent; rA,B < 0: negatively correlated

BABA n

BAnAB

n

BBAAr BA )1(

)(

)1(

))((,

A

Data Mining: Concepts and Techniques

B

Page 75: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Correlation Analysis (Categorical Data)

Χ2 (chi-square) test

The larger the Χ2 value, the more likely the variables are related The cells that contribute the most to the Χ2 value are those

whose actual count is very different from the expected count Correlation does not imply causality

# of hospitals and # of car-theft in a city are correlated Both are causally linked to the third variable: population

Expected

ExpectedObserved 22 )(

Data Mining: Concepts and Techniques

Page 76: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques

Page 77: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Chi-Square Calculation: An Example

Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the data distribution in the two categories)

It shows that like_science_fiction and play_chess are correlated in the group

93.507840

)8401000(

360

)360200(

210

)21050(

90

)90250( 22222

Data Mining: Concepts and Techniques

Play chess

Not play chess

Sum (row)

Like science fiction 250(90) 200(360) 450

Not like science fiction

50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

Page 78: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques

Page 79: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Data Transformation

Data Mining: Concepts and Techniques

Smoothing: remove noise from data Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing Normalization: scaled to fall within a small, specified range

min-max normalization z-score normalization normalization by decimal scaling

Attribute/feature construction New attributes constructed from the given ones

Page 80: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Data Transformation: Normalization

Min-max normalization: to [new_minA, new_maxA]

Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].

Then $73,600 is mapped to

Z-score normalization (μ: mean, σ: standard deviation):

Ex. Let μ = 54,000, σ = 16,000. Then

Normalization by decimal scaling

716.00)00.1(000,12000,98

000,12600,73

225.1000,16

000,54600,73

Data Mining: Concepts and Techniques

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__('

A

Avv

'

j

vv

10'

Where j is the smallest integer such that Max(|ν’|) < 1

Page 81: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Chapter 2: Data Preprocessing

Data Mining: Concepts and Techniques

Why preprocess the data?

Data cleaning

Data integration and transformation

Data reduction

Discretization and concept hierarchy generation

Summary

Page 82: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Data Reduction Strategies

Data Mining: Concepts and Techniques

Why data reduction? A database/data warehouse may store terabytes of data. Complex data analysis/mining may take a very long time to run on the

complete data set. Data reduction

Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results.

Data reduction strategies- Data cube aggregation. Dimensionality reduction — e.g., remove unimportant attributes. Data Compression. Clustering. Numerosity reduction — e.g., fit data into models. Discretization and concept hierarchy generation.

Page 83: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Data Cube Aggregation

Data Mining: Concepts and Techniques

The lowest level of a data cube (base cuboid) The aggregated data for an individual entity of interest.

E.g., a customer in a phone calling data warehouse.

Multiple levels of aggregation in data cubes Further reduce the size of data to deal with.

Reference appropriate levels Use the smallest representation which is enough to solve the task.

Queries regarding aggregated information should be answered

using data cube, when possible

Page 84: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Base cuboid

The cube created at the lowest level of abstraction is referred to as the base cuboid. The base cuboid should correspond to an individual entity of interest, such as sales or customer.

Data Mining: Concepts and Techniques

Page 85: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Apex Cuboid

A cube at the highest level of abstraction is the apex cuboid. For the sales data of Figure, the apex cuboid would give one total—the total sales for all three years, for all item types, and for all branches.

Data Mining: Concepts and Techniques

Page 86: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Attribute Subset Selection

Data Mining: Concepts and Techniques

“How can we find a ‘good’ subset of the original attributes?” For n attributes, there are possible subsets.

Feature selection (i.e., attribute subset selection): Select a minimum set of features such that the probability distribution of

different classes given the values for those features is as close as possible to the original distribution given the values of all features.

Reduce # of patterns in the patterns, easier to understand.

Heuristic methods (due to exponential # of choices): Step-wise forward selection Step-wise backward elimination Combining forward selection and backward elimination Decision-tree induction

Page 87: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Basic Heuristic methods of attribute subset selection include the following techniques-

Data Mining: Concepts and Techniques

Page 88: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques

Page 89: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Data Compression

Data Mining: Concepts and Techniques

String compression There are extensive theories and well-tuned algorithms. Typically lossless. But only limited manipulation is possible without expansion.

Audio/video compression Typically lossy compression, with progressive refinement. Sometimes small fragments of signal can be reconstructed without

reconstructing the whole.

Time sequence is not audio Typically short and vary slowly with time.

Page 90: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Data Compression

Data Mining: Concepts and Techniques

Original Data Compressed Data

lossless

Original DataApproximated

lossy

Page 91: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Dimensionality Reduction:Wavelet Transformation

Data Mining: Concepts and Techniques

Discrete wavelet transform (DWT): linear signal processing, multi-resolutional analysis

Compressed approximation: store only a small fraction of the strongest of the wavelet coefficients

Similar to discrete Fourier transform (DFT), but better lossy compression, localized in space

Method: Length, L, must be an integer power of 2 (padding with 0’s, when

necessary) Each transform has 2 functions: smoothing, difference Applies to pairs of data, resulting in two set of data of length L/2 Applies two functions recursively, until reaches the desired length

Haar2 Daubechie4

Page 92: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

DWT for Image Compression

Image

Low Pass High Pass

Low Pass High Pass

Low Pass High Pass

Data Mining: Concepts and Techniques

Page 93: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Dimensionality Reduction: Principal Component Analysis (PCA)

Data Mining: Concepts and Techniques

Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors (principal components) that can be best used to represent data

Steps Normalize input data: Each attribute falls within the same range Compute k orthonormal (unit) vectors, i.e., principal

components Each input data (vector) is a linear combination of the k

principal component vectors The principal components are sorted in order of decreasing

“significance” or strength Since the components are sorted, the size of the data can be

reduced by eliminating the weak components, i.e., those with low variance. (i.e., using the strongest principal components, it is possible to reconstruct a good approximation of the original data

Works for numeric data only Used when the number of dimensions is large

Page 94: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques

X1

X2

Y1

Y2

Principal Component Analysis

Page 95: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Numerosity Reduction

Data Mining: Concepts and Techniques

Reduce data volume by choosing alternative, smaller forms of data representation.

Techniques of numerosity reduction can indeed be applied for this purpose. These techniques may be parametric or nonparametric.

Parametric methods- A model is used to estimate the data, so that typically only the data

parameters need to be stored, instead of the actual data. (Outliers may also be stored.)

Example: Log-linear models(which estimate discrete multidimensional probability distributions)

Non-parametric methods (for storing reduced representations of the data)

Do not assume models. Major families: histograms, clustering, sampling.

Page 96: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Data Reduction Method (1): Regression and Log-Linear Models

Data Mining: Concepts and Techniques

Linear regression: Data are modeled to fit a straight line

Often uses the least-square method to fit the line

Multiple regression: allows a response variable Y to be modeled as a

linear function of multidimensional feature vector

Log-linear model: approximates discrete multidimensional probability

distributions

Page 97: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Regress Analysis and Log-Linear Models

Linear regression: Y = w X + b A random variable, y (called a response variable), can be modeled as a

linear function of another random variable, x (called a predictor variable)-

Two regression coefficients, w and b, specify the line and are to be estimated by using the data at hand.

Using the least squares criterion to the known values of - Y1, Y2, …, X1, X2, ….

Multiple regression: Y = b0 + b1 X1 + b2 X2. It is an extension of (simple) linear regression, which allows

a response variable, y, to be modeled as a linear function of two or more predictor variables.

Data Mining: Concepts and Techniques

Page 98: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Log-linear models: Approximate discrete multidimensional probability distributions. Given a set of tuples in n dimensions (e.g., described by n attributes), we can

consider each tuple as a point in an n-dimensional space. Log-linear models can be used to estimate the probability of each point in a

multidimensional space for a set of discretized attributes, based on a smaller subset of dimensional combinations.

The multi-way table of joint probabilities is approximated by a product of lower-order tables.

Probability: p(a, b, c, d) = ab acad bcd Regression and log-linear models can both be used on sparse data, although

their application may be limited. While both methods can handle skewed data, regression does exceptionally well.

Data Mining: Concepts and Techniques

Page 99: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Data Reduction Method (2): Histograms

A histogram for an attribute, A, partitions the data distribution of A into disjoint subsets, or buckets.

If each bucket represents only a single attribute-value/frequency pair, the buckets are called singleton buckets.

Often, buckets instead represent continuous ranges for the given attribute.

Data Mining: Concepts and Techniques

Page 100: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques

Page 101: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques

Page 102: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Data Reduction Method (2): Histograms

Data Mining: Concepts and Techniques

Divide data into buckets and store

average (sum) for each bucket.

Partitioning rules:

Equal-width: equal bucket range.

Equal-frequency (or equal-depth).

V-optimal: with the least histogram

variance (weighted sum of the

original values that each bucket

represents)

MaxDiff: set bucket boundary

between each pair for pairs have the

β–1 largest differences 0

5

10

15

20

25

30

35

40

10000 30000 50000 70000 90000

Page 103: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Data Reduction Method (3): Clustering

Data Mining: Concepts and Techniques

Partition data set into clusters based on similarity, and store cluster

representation (e.g., centroid and diameter) only.

Can be very effective if data is clustered but not if data is “smeared”.

Can have hierarchical clustering and be stored in multi-dimensional index

tree structures.

There are many choices of clustering definitions and clustering algorithms.

Page 104: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Data Reduction Method (4): Sampling

Data Mining: Concepts and Techniques

Sampling: obtaining a small sample s to represent the whole data set N.

Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data.

Choose a representative subset of the data- Simple random sampling may have very poor

performance in the presence of skew. Develop adaptive sampling methods

Stratified sampling: Approximate the percentage of each class (or

subpopulation of interest) in the overall database Used in conjunction with skewed data

Note: Sampling may not reduce database I/Os (page at a time)

Page 105: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques

Page 106: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques

Sampling: with or without Replacement

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Page 107: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Sampling: Cluster or Stratified Sampling

Data Mining: Concepts and Techniques

Raw Data Cluster/Stratified Sample

Page 108: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Chapter 2: Data Preprocessing

Data Mining: Concepts and Techniques

Why preprocess the data?

Data cleaning

Data integration and transformation

Data reduction

Discretization and concept hierarchy

generation

Summary

Page 109: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Discretization

Data Mining: Concepts and Techniques

Three types of attributes:

Nominal — values from an unordered set, e.g., color, profession

Ordinal — values from an ordered set, e.g., military or

academic rank

Continuous — real numbers, e.g., integer or real numbers

Discretization:

Divide the range of a continuous attribute into intervals

Some classification algorithms only accept categorical

attributes.

Reduce data size by discretization

Prepare for further analysis

Page 110: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Discretization and Concept Hierarchy

Data Mining: Concepts and Techniques

Discretization –

Reduce the number of values for a given continuous attribute

by dividing the range of the attribute into intervals

Interval labels can then be used to replace actual data values

Supervised vs. unsupervised

Split (top-down) vs. merge (bottom-up)

Discretization can be performed recursively on an attribute

Concept hierarchy formation-

Recursively reduce the data by collecting and replacing low

level concepts (such as numeric values for age) by higher

level concepts (such as young, middle-aged, or senior)

Page 111: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Discretization and Concept Hierarchy Generation for Numeric Data

Data Mining: Concepts and Techniques

Typical methods: All the methods can be applied recursively

Binning (covered above)

Binning is a top-down splitting technique based on a specified number of

bins,

unsupervised,

Histogram analysis (covered above)

Like binning, histogram analysis is an unsupervised discretization

technique because it does not use class information. Histograms

partition the values for an attribute, A, into disjoint ranges called

buckets.

Clustering analysis (covered above)

Either top-down split or bottom-up merge, unsupervised

Page 112: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Entropy-based discretization: supervised, top-down split

To discretize a numerical attribute, A, the method

selects the value of A that has the minimum entropy

as a split-point, and recursively partitions the resulting

intervals to arrive at a hierarchical discretization. Such

discretization forms a concept hierarchy for A.

Interval merging by 2 Analysis: unsupervised, bottom-up

merge

Segmentation by natural partitioning: top-down split,

unsupervised

Data Mining: Concepts and Techniques

Page 113: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Entropy-Based Discretization

Data Mining: Concepts and Techniques

Given a set of samples S, if S is partitioned into two intervals S1

and S2 using boundary T, the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in

the set. Given m classes, the entropy of S1 is

where pi is the probability of class i in S1

The boundary that minimizes the entropy function over all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)(||

||)(

||

||),( 2

21

1SEntropy

SS

SEntropySSTSI

m

iii ppSEntropy

121 )(log)(

Page 114: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Interval Merge by 2 Analysis

Data Mining: Concepts and Techniques

Merging-based (bottom-up) vs. splitting-based methods

Merge: Find the best neighboring intervals and merge them to

form larger intervals recursively

ChiMerge [Kerber AAAI 1992, See also Liu et al. DMKD 2002]

Initially, each distinct value of a numerical attr. A is considered

to be one interval

2 tests are performed for every pair of adjacent intervals

Adjacent intervals with the least 2 values are merged together,

since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined

stopping criterion is met (such as significance level, max-

interval, max inconsistency, etc.)

Page 115: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Segmentation by Natural Partitioning

Data Mining: Concepts and Techniques

A simply 3-4-5 rule can be used to segment numeric

data into relatively uniform, “natural” intervals.

If an interval covers 3, 6, 7 or 9 distinct values at the

most significant digit, partition the range into 3 equi-

width intervals

If it covers 2, 4, or 8 distinct values at the most

significant digit, partition the range into 4 intervals

If it covers 1, 5, or 10 distinct values at the most

significant digit, partition the range into 5 intervals

Page 116: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Example of 3-4-5 Rule

Data Mining: Concepts and Techniques

(-$400 -$5,000)

(-$400 - 0)

(-$400 - -$300)

(-$300 - -$200)

(-$200 - -$100)

(-$100 - 0)

(0 - $1,000)

(0 - $200)

($200 - $400)

($400 - $600)

($600 - $800) ($800 -

$1,000)

($2,000 - $5, 000)

($2,000 - $3,000)

($3,000 - $4,000)

($4,000 - $5,000)

($1,000 - $2, 000)

($1,000 - $1,200)

($1,200 - $1,400)

($1,400 - $1,600)

($1,600 - $1,800) ($1,800 -

$2,000)

msd=1,000 Low=-$1,000 High=$2,000Step 2:

Step 4:

Step 1: -$351 -$159 profit $1,838 $4,700

Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max

count

(-$1,000 - $2,000)

(-$1,000 - 0) (0 -$ 1,000)

Step 3:

($1,000 - $2,000)

Page 117: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Concept Hierarchy Generation for Categorical Data

Data Mining: Concepts and Techniques

Specification of a partial/total ordering of attributes explicitly at the schema level by users or experts street < city < state < country

Specification of a hierarchy for a set of values by explicit data grouping {Urbana, Champaign, Chicago} < Illinois

Specification of only a partial set of attributes E.g., only street < city, not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values E.g., for a set of attributes: {street, city, state,

country}

Page 118: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Automatic Concept Hierarchy Generation

Data Mining: Concepts and Techniques

Some hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions, e.g., weekday, month, quarter, year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674,339 distinct values

Page 119: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Chapter 2: Data Preprocessing

Data Mining: Concepts and Techniques

Why preprocess the data?

Data cleaning

Data integration and transformation

Data reduction

Discretization and concept hierarchy

generation

Summary

Page 120: DATA MINING & DATA WAREHOUSING UNIT-I Data Mining: Concepts and Techniques

Summary

Data Mining: Concepts and Techniques

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research