Cssu dw dm

D. ChristozovMarch 15, 2006

CSSU: Data Warehouses andData Mining

1

Data WarehousesData Mining

Business Intelligence

Applications



2

Outline

Why needed?

1. Data Base Data Warehouse Data Mining == Knowledge Discovery in DB

2. Data Warehouses: organization, structuring and presentation of data, oriented to analyses. Data Cube

3. DM: preprocessing

4. DM: characterization and comparison

5. DM: classification and forecasting

Conclusion



3

Why needed?Necessity is the Mother of Invention

Data explosion problem

Automated data collection tools and mature database

technology lead to tremendous amounts of data stored in

databases and other information repositories

We are drowning in data, but starving for knowledge!

Solution: Data warehousing and data mining

Data warehousing and on-line analytical processing (OLAP)

Extraction of interesting knowledge (rules, regularities,

patterns, constraints) from data in large databases



4

Why needed?

Evolution of Database Technology1960s:

Data collection, database creation, IMS and network DBMS

1970s: Relational data model, relational DBMS implementation

1980s: RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.)

1990s—2000s: Data warehousing and data mining, multimedia databases, and Web databases



5

What Is Data Mining?• Data mining (knowledge discovery in

databases): – Extraction of interesting (non-trivial, implicit,

previously unknown and potentially useful) information or patterns from data in large databases

– Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

• What is not data mining?– Query processing. – Expert systems or statistical programs



6

DB DW DM == Knowledge Discovery in DB

– Data mining: the core of knowledge discovery process.

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation



7

Steps of a KDD Process

• Learning the application domain:– relevant prior knowledge and goals of application

• Creating a target data set: data selection• Data cleaning and preprocessing: (may take 60% of effort!)• Data reduction and transformation:

– Find useful features, dimensionality/variable reduction, invariant representation.

• Choosing functions of data mining – summarization, classification, regression, association, clustering.

• Choosing the mining algorithm(s)• Data mining: search for patterns of interest• Pattern evaluation and knowledge presentation

– visualization, transformation, removing redundant patterns, etc.

• Use of discovered knowledge



8

Data Mining and Business Intelligence

Increasing potentialto supportbusiness decisions End User

Business Analyst

DataAnalyst

DBA

MakingDecisions

Data Presentation

Visualization Techniques

Data MiningInformation Discovery

Data Exploration

OLAP, MDA

Statistical Analysis, Querying and Reporting

Data Warehouses / Data Marts

Data SourcesPaper, Files, Information Providers, Database Systems, OLTP



9

Data Warehouses: Data Cube OLAP

all

Europe North_America

MexicoCanadaSpainGermany

Vancouver

M. WindL. Chan

...

......

... ...

...

all

region

office

country

TorontoFrankfurtcity



10

Data Warehouses: Data Cube OLAP P

rodu

ctReg

ion

Month

Dimensions: Product, Location, TimeHierarchical summarization paths

Industry Region Year

Category Country Quarter

Product City Month Week

Office Day



11

Data Warehouses: Data Cube OLAP

Total annual salesof TV in U.S.A.Date

Produ

ct

Cou

ntr

ysum

sum TV

VCRPC

1Qtr 2Qtr 3Qtr 4Qtr

U.S.A

Canada

Mexico

sum



12

• Concept description: Characterization and discrimination– Generalize, summarize, and contrast data characteristics, e.g.,

dry vs. wet regions

• Association (correlation and causality)– Multi-dimensional vs. single-dimensional association

– age(X, “20..29”) ^ income(X, “20..29K”) buys(X, “PC”) [support = 2%, confidence = 60%]

– contains(T, “computer”) contains(x, “software”) [1%, 75%]

Data Mining Functionalities (1)



13


• Classification and Prediction

– Finding models (functions) that describe and distinguish classes or concepts for future prediction

– E.g., classify countries based on climate, or classify cars based on gas mileage

– Presentation: decision-tree, classification rule, neural network

– Prediction: Predict some unknown or missing numerical values

• Cluster analysis

– Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns

– Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity



14

• Outlier analysis– Outlier: a data object that does not comply with the general behavior

of the data

– It can be considered as noise or exception but is quite useful in fraud

detection, rare events analysis

• Trend and evolution analysis

– Trend and deviation: regression analysis

– Sequential pattern mining, periodicity analysis

– Similarity-based analysis

• Other pattern-directed or statistical analyses




15

Are All the “Discovered” Patterns Interesting?

• A data mining system/query may generate thousands of patterns,

not all of them are interesting.

– Suggested approach: Human-centered, query-based, focused mining

• Interestingness measures: A pattern is interesting if it is easily

understood by humans, valid on new or test data with some

degree of certainty, potentially useful, novel, or validates some

hypothesis that a user seeks to confirm

• Objective vs. subjective interestingness measures:

– Objective: based on statistics and structures of patterns, e.g., support,

confidence, etc.

– Subjective: based on user’s belief in the data, e.g., unexpectedness,

novelty, actionability, etc.



16

Can We Find All and Only Interesting Patterns?

• Find all the interesting patterns: Completeness

– Can a data mining system find all the interesting patterns?

– Association vs. classification vs. clustering

• Search for only interesting patterns: Optimization

– Can a data mining system find only the interesting

patterns?

– Approaches

• First general all the patterns and then filter out the uninteresting

ones.

• Generate only the interesting patterns—mining query optimization



17

Data PreprocessingWhy Data Preprocessing?

• Data in the real world is dirty

– incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data

– noisy: containing errors or outliers

– inconsistent: containing discrepancies in codes or names

• No quality data, no quality mining results!

– Quality decisions must be based on quality data

– Data warehouse needs consistent integration of quality data



18

Major Tasks in Data Preprocessing

• Data cleaning– Fill in missing values, smooth noisy data, identify or remove outliers,

and resolve inconsistencies

• Data integration– Integration of multiple databases, data cubes, or files

• Data transformation– Normalization and aggregation

• Data reduction– Obtains reduced representation in volume but produces the same or

similar analytical results

• Data discretization– Part of data reduction but with particular importance, especially for

numerical data



19

Data Cleaning

• Data cleaning tasks

– Fill in missing values

– Identify outliers and smooth out noisy data

– Correct inconsistent data



20

How to Handle Missing Data?

• Ignore the tuple: usually done when class label is missing (assuming the

tasks in classification—not effective when the percentage of missing values

per attribute varies considerably.

• Fill in the missing value manually: tedious + infeasible?

• Use a global constant to fill in the missing value: e.g., “unknown”, a new

class?!

• Use the attribute mean to fill in the missing value

• Use the attribute mean for all samples belonging to the same class to fill in

the missing value: smarter

• Forecast the missing value : use the most probable value Vs. use of the

value with less impact on the further analysis



21

How to Handle Noisy Data?

• Binning method:– first sort data and partition into (equi-depth) bins

– then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.

• Clustering– detect and remove outliers

• Combined computer and human inspection– detect suspicious values and check by human

• Regression– smooth by fitting the data into regression functions



22

Data Integration

• Data integration: – combines data from multiple sources into a coherent store

• Schema integration– integrate metadata from different sources– Entity identification problem: identify real world entities

from multiple data sources, e.g., A.cust-id B.cust-#

• Detecting and resolving data value conflicts– for the same real world entity, attribute values from different

sources are different– possible reasons: different representations, different scales,

e.g., metric vs. British units



23

Data Transformation

• Smoothing: remove noise from data

• Data reduction - aggregation: summarization, data cube Generalization: concept hierarchy climbing

• Normalization: scaled to fall within a small, specified range

– dimensions

– Scales: nominal, order and interval scales

– min-max normalization

– z-score normalization

– normalization by decimal scaling

• Attribute/feature construction

– New attributes constructed from the given ones



24

Data Reduction Strategies

• Warehouse may store terabytes of data: Complex data analysis/mining may take a very long time to run on the complete data set

• Data reduction – Obtains a reduced representation of the data set that is much

smaller in volume but yet produces the same (or almost the same) analytical results

• Data reduction strategies– Data cube aggregation

– Dimensionality reduction

– Numerosity reduction

– Discretization and concept hierarchy generation



25

Mining Association Rules in Large Databases

• Association rule mining

• Mining single-dimensional Boolean association rules from transactional databases

• Mining multilevel association rules from transactional databases

• Mining multidimensional association rules from transactional databases and data warehouse

• From association mining to correlation analysis

• Constraint-based association mining



26

What Is Association Mining?

• Association rule mining:– Finding frequent patterns, associations, correlations, or causal

structures among sets of items or objects in transaction databases, relational databases, and other information repositories.

• Applications:– Basket data analysis, cross-marketing, catalog design, loss-

leader analysis, clustering, classification, etc.• Examples.

– Rule form: “Body ead [support, confidence]”.– buys(x, “diapers”) buys(x, “beers”) [0.5%, 60%]– major(x, “CS”) ^ takes(x, “DB”) grade(x, “A”) [1%, 75%]



27

Rule Measures: Support and Confidence

Customerbuys diaper

Customerbuys both

Customerbuys beer

Find all the rules X & Y Z with minimum confidence and support– support, s, probability that a

transaction contains {X Y Z}– confidence, c, conditional

probability that a transaction having {X Y} also contains Z

Let minimum support 50%, and minimum confidence 50%, we have– A C (50%, 66.6%)– C A (50%, 100%)

Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F



28

Association Rule Mining

• Boolean vs. quantitative associations • buys(x, “SQLServer”) ^ buys(x, “DMBook”) buys(x,

“DBMiner”) [0.2%, 60%]– age(x, “30..39”) ^ income(x, “42..48K”) buys(x, “PC”) [1%, 75%]

• Single dimension vs. multiple dimensional associations • Single level vs. multiple-level analysis

– What brands of beers are associated with what brands of diapers?• Various extensions

– Correlation, causality analysis• Association does not necessarily imply correlation or causality

– Constraints enforced• E.g., small sales (sum < 100) trigger big buys (sum > 1,000)?



29

Concept Description: Characterization and Comparison

Concept description: Characterization: provides a concise and succinct

summarization of the given collection of data

Comparison: provides descriptions comparing two or more collections of data



30

Classification and Prediction

• Classification:

– predicts categorical class labels

– classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

• Prediction:

– models continuous-valued functions, i.e., predicts unknown or missing values



31

Classification—A Two-Step Process

• Model construction: describing a set of predetermined classes– Each tuple/sample is assumed to belong to a predefined class, as

determined by the class label attribute

– The set of tuples used for model construction: training set

– The model is represented as classification rules, decision trees, or mathematical formulae

• Model usage: for classifying future or unknown objects– Estimate accuracy of the model

• The known label of test sample is compared with the classified result from the model

• Accuracy rate is the percentage of test set samples that are correctly classified by the model

• Test set is independent of training set, otherwise over-fitting will occur



32

Classification Process (1): Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’

Classifier(Model)



33

Classification Process (2): Use the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff, Professor, 4)

Tenured?



34

Supervised vs. Unsupervised Learning

• Supervised learning (classification)

– Supervision: The training data (observations,

measurements, etc.) are accompanied by labels indicating

the class of the observations

– New data is classified based on the training set

• Unsupervised learning (clustering)

– The class labels of training data is unknown

– Given a set of measurements, observations, etc. with the

aim of establishing the existence of classes or clusters in

the data



35

Q and A

Thank you !!!Thank you !!!

Documents

Cssu dw dm