Download pdf - Data Warehouse & Mining Notes

~ Arvind Pandi Dorai

Lecturer, Computer Dept

KJSIEIT

Chapter 1

Introduction

NEED OF DATA WAREHOUSE In 1960s, computer systems used to maintain business

data.

As enterprises grew larger, hundreds of computer applications needed to support business processes.

In 1990s as businesses grew more complex, corporations spread globally & competition became complex, businesses executives became desperate for information to stay competitive & improve bottom line.

Companies need information to formulate the business strategies, establish goals, set objectives & monitor results

Data Warehouse

Definition: Data warehouse is a relational DB that

maintains huge volumes of historical data, so as to

support strategic analysis & decision making.

To take a strategic decision, we need strong

analysis & for strong analysis we need historical

data. Since ERP does not support historical data,

DW came into picture.

Data Warehouse Features

Subject oriented - Subject specific data marts.

Integrated - Data integrated into single uniform format.

Time Variant - DW maintains data over a wide range of time.

Non volatile - Data is never deleted, Rarely updated.

Data Warehouse Objects

Dimension Tables:

Fact Tables:

Dimension Table Key

Wide

Textual Attributes

Denormalised

Drill-down & Roll-up

Multiple Hierarchies

Foreign key

Deep

Numeric facts

Transaction level data

Aggregate data

Star Schema

A large and central fact table and one table for

each dimension.

Every fact points to one tuple in each of the

dimensions and has additional attributes.

Does not capture hierarchies directly.

De-normalized system.

Easy to understand, easy to define hierarchies,

reduces no. of joins.

Star Schema layout

Star Schema Example

SnowFlake Schema

Variant of star schema model.

A single & large and central fact table and one or

more tables for each dimension.

Dimension tables are normalized i.e. split

dimension table data into additional tables.

Process of making a snowflake schema is called

snowflaking.

Drawbacks: Time consuming joins, report

generation slow.

Snowflake Schema Layout

Fact Constellation

Multiple fact tables share dimension tables.

This schema is viewed as collection of stars hence

called galaxy schema or fact constellation or

family of stars.

Sophisticated application requires such schema.

Fact Constellation

Store Key

Product Key

Period Key

Units

Price

Store Dimension

Product Dimension

Sales

Fact Table

Store Key

Store Name

City

State

Region

Product Key

Product Desc

Shipper Key

Store Key

Product Key

Period Key

Units

Price

Shipping

Fact Table

Chapter 2

Metadata

Meta Data: Data about data

Types of Metadata:

Operational Metadata

Extraction &Transformation Metadata

End-User Metadata

Information Package

IP gives special significance to dimension hierarchy in

the business dimension & the key facts in the fact table.

Chapter 3

DW Architecture

DW Architecture Data Acquisition

Data Extraction

Data Transformation

Data Staging

Data Storage

Data Loading

Data Aggregation

Information Delivery

Report

OLAP

Data Mining

Data Acquisition

Data Extraction:

Immediate Data Extraction

Deferred Data Extraction

Data Transformation:

Splitting up of cells

Merging up of cells

Decoding of fields

De-duplication

Date-Time format conversion

Computed or derived fields

Data Staging

Data Storage

Data Loading:

Initial Loading

Incremental Loading

Data Aggregation:

Based on fact tables

Based on aggregate tables

Information Delivery

Reports Aggregate data

OLAP Multidimensional Analysis

Data Mining Extracting knowledge from database

Chapter 4

Principles of Dimensional Modeling

Dimensional Modeling:

Logical Design technique to structure{arrange} the

business dimensions & the fact tables.

DM is a technique to prepare a star schema.

Provides best data access.

Fact table interacts with each & every business

dimension.

Drill-down & Roll-up.

Fully Additive Facts: When the values of an attribute are summed up by

simple addition to provide some meaningful data, it is

known as fully additive facts.

Semi Additive Facts: When the values of an attribute are summed up, but it

does not provide meaningful data, but when some

mathematical operations are performed on it to provide

meaningful data, it is known as fully additive facts.

Factless Fact table: A fact table in which numeric facts are absent.

Chapter 5

Information Access & Delivery

OLAP is a technique that allows user to view aggregate data across measurements along with a

set of related dimension.

OLAP supports multidimensional analysis because

data is stored in multidimensional array.

OLAP Operations

Slice: Filtering the OLAP cube, view 1 attribute.

Dice: Viewing two attributes.

Drill-down: Detailing or expanding an attribute

values.

Roll-up: Aggregating or compressing an attribute

values.

Rotate: Rotating the cube to view different

dimensions.

OLAP Operations

Slice and Dice

Time

Product Product= iPod

Time

OLAP Operations

Drill Down

Time

Product

Category e.g Music Player

Sub Category e.g MP3

Product e.g iPod

OLAP Operations

Roll Up

Time

Product

Category e.g Music Player

Sub Category e.g MP3

Product e.g iPod

OLAP Operations

Pivot

Time

Product

Region

Product

OLAP Server

An OLAP Server is a high capacity, multi-user data

manipulation engine specifically designed to support

and operate on multi-dimensional data structure.

OLAP server available are

MOLAP server

ROLAP server

HOLAP server

Chapter 6 Implementation & Maintenance

IMPLEMENTATION: Monitoring: Sending data from sources

Integrating: Loading, cleansing, ...

Processing: Query processing, indexing, ...

Managing: Metadata, Design, ...

Maintainence

Maintenance is an issue for materialized

views

Recomputation

Incremental updating

View and Materialized Views

View

Derived relation defined in terms of base

(stored) relations.

Materialized views

A view can be materialized by storing the tuples

of the view in the database.

Index structures can be built on the materialized

view.

Overview

Extracting knowledge

Perform analysis

Use DM Algorithms

Knowledge Discovery in Database

Steps In KDD Process

Data Cleaning

Data Integration

Data Selection

Data Transformation

Data mining

Pattern Evaluation

Knowledge Presentation

Architecture of DM

DM Algorithms

Association: Relationship between item sets.

Used in Market basket analysis.

Eg: Apriori & FP Tree

Classification: Classify each item to predefined groups.

Eg: Nave Bayesian & ID3

Clustering: Each item divided into dynamically generated

groups.

Eg: K-means & K-mediods

Example: Market Basket Data

Items frequently purchased together:

Computer Printer

Uses:

Placement

Advertising

Sales

Coupons

Objective: increase sales and reduce costs

Called Market Basket Analysis, Shopping Cart Analysis

Transaction Data: Supermarket Data

Market basket transactions:

t1: {bread, cheese, milk}

t2: {apple, jam, salt, ice-cream}

tn: {biscuit, jam, milk}

Concepts: An item: an item/article in a basket

I: the set of all items sold in the store

A Transaction: items purchased in a basket; it may have TID (transaction ID)

A Transactional dataset: A set of transactions

Association Rule Definitions

Association Rule (AR): implication X Y where

X,Y I and X Y = ;

Support of AR (s) X Y: Percentage of

transactions that contain X Y

Confidence of AR (a) X Y: Ratio of number of

transactions that contain X Y to the number

that contain X

Association Rule Problem

Given a set of items I={I1,I2,,Im} and a database of transactions D={t1,t2, , tn} where ti={Ii1,Ii2, , Iik} and Iij I, the Association Rule Problem is to identify all

association rules X Y with a minimum support and

confidence.

Link Analysis

Association Rule Mining Task

Given a set of transactions T, the goal of association rule

mining is to find all rules having

support minsup threshold

confidence minconf threshold

Brute-force approach:

List all possible association rules

Compute the support and confidence for each rule

Prune rules that fail the minsup and minconf thresholds

Example

Transaction data

Assume:

minsup = 30%

minconf = 80%

An example frequent itemset:

{Cocoa, Clothes, Milk} [sup = 3/7]

Association rules from the itemset:

Clothes Milk, Cocoa [sup = 3/7, conf = 3/3]

Clothes, Cocoa Milk, [sup = 3/7, conf = 3/3]

t1: Butter, Cocoa, Milk

t2: Butter, Cheese

t3: Cheese, Boots

t4: Butter, Cocoa, Cheese

t5: Butter, Cocoa, Clothes, Cheese, Milk

t6: Cocoa, Clothes, Milk

t7: Cocoa, Milk, Clothes

Mining Association Rules

Two-step approach:

1. Frequent Itemset Generation

Generate all itemsets whose support minsup

2. Rule Generation

Generate high confidence rules from each frequent

itemset, where each rule is a binary partitioning of a

frequent itemset

Frequent itemset generation is still computationally

expensive

Step:1 Generate Candidate & Frequent

Item Sets

Let k=1 Generate frequent itemsets of length 1

Repeat until no new frequent itemsets are identified

Generate length (k+1) candidate itemsets from length k frequent itemsets

Prune candidate itemsets containing subsets of length k that are infrequent

Count the support of each candidate by scanning the DB

Eliminate candidates that are infrequent, leaving only those that are frequent

Apriori Algorithm Example

Step 2: Generating Rules From Frequent

Itemsets

Frequent itemsets association rules One more step is needed to generate association rules For each frequent itemset X, For each proper nonempty subset A of X,

Let B = X - A A B is an association rule if

Confidence(A B) minconf, support(A B) = support(AB) = support(X) confidence(A B) = support(A B) / support(A)

Generating Rules: An example

Suppose {2,3,4} is frequent, with sup=50%

Proper nonempty subsets: {2,3}, {2,4}, {3,4}, {2}, {3}, {4},

with sup=50%, 50%, 75%, 75%, 75%, 75% respectively

These generate these association rules:

2,3 4, confidence=100%



2 3,4, confidence=67%



All rules have support = 50%

Rule Generation

Given a frequent itemset L, find all non-empty subsets f

L such that f L f satisfies the minimum confidence requirement

If {A,B,C,D} is a frequent itemset, candidate rules:

ABC D, ABD C, ACD B, BCD A,

A BCD, B ACD, C ABD, D ABC

AB CD, AC BD, AD BC, BC AD,

BD AC, CD AB,

If |L| = k, then there are 2k 2 candidate association rules (ignoring L and L)

Generating Rules

To recap, in order to obtain A B, we need to have support(A B) and support(A)

All the required information for confidence computation has already been recorded in itemset generation. No need to see the data T any more.

This step is not as time-consuming as frequent itemsets generation.

Rule Generation

How to efficiently generate rules from frequent itemsets?

In general, confidence does not have an anti-monotone property

c(ABC D) can be larger or smaller than

c(AB D)

But confidence of rules generated from the same itemset has an anti-monotone property

e.g., L = {A,B,C,D}: c(ABC D) c(AB CD) c(A BCD)

Apriori Advantages/Disadvantages

Advantages:

Uses large itemset property.

Easily parallelized

Easy to implement.

Disadvantages:

Assumes transaction database is memory resident.

Requires up to m database scans.

Mining Frequent Patterns

Without Candidate Generation

Compress a large database into a compact, Frequent-

Pattern tree (FP-tree) structure

highly condensed, but complete for frequent pattern

mining

avoid costly database scans

Develop an efficient, FP-tree-based frequent pattern

mining method

A divide-and-conquer methodology: decompose mining

tasks into smaller ones

Avoid candidate generation: sub-database test only!

Construct FP-tree From A Transaction DB

{}

f:4 c:1

b:1

p:1

b:1 c:3

a:3

b:1 m:2

p:2 m:1

Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3

min_support = 0.5 TID Items bought (L-order) freq items 100 {f, a, c, d, g, i, m, p} {f, c, a, m, p} 200 {a, b, c, f, l, m, o} {f, c, a, b, m} 300 {b, f, h, j, o} {f, b} 400 {b, c, k, s, p} {c, b, p} 500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

Steps:

1. Scan DB once, find

frequent 1-itemset

(single item pattern)

2. Order frequent

items in frequency

descending order

3. Scan DB again,

construct FP-tree

Benefits of the FP-tree Structure

Completeness:

never breaks a long pattern of any transaction

preserves complete information for frequent pattern

mining

Compactness

reduce irrelevant informationinfrequent items are gone

frequency descending ordering: more frequent items are

more likely to be shared

never be larger than the original database (if not count

node-links and counts)

Mining Frequent Patterns Using FP-tree

General idea (divide-and-conquer)

Recursively grow frequent pattern path using the FP-

tree

Method

For each item, construct its conditional pattern-base,

and then its conditional FP-tree

Repeat the process on each newly created conditional

FP-tree

Until the resulting FP-tree is empty, or it contains only

one path (single path will generate all the combinations

of its sub-paths, each of which is a frequent pattern)

Major Steps to Mine FP-tree

1) Construct conditional pattern base for each

node in the FP-tree

2) Construct conditional FP-tree from each

conditional pattern-base

3) Recursively mine conditional FP-trees and

grow frequent patterns obtained so far

If the conditional FP-tree contains a single path,

simply enumerate all the patterns

Step 1: FP-tree to Conditional Pattern Base

Starting at the frequent header table in the FP-tree Traverse the FP-tree by following the link of each

frequent item Accumulate all of transformed prefix paths of that item to

form a conditional pattern base Conditional pattern bases

item cond. pattern base

c f:3

a fc:3

b fca:1, f:1, c:1

m fca:2, fcab:1

p fcam:2, cb:1

{}

f:4 c:1

b:1

p:1

b:1 c:3

a:3

b:1 m:2

p:2 m:1


Step 2: Construct Conditional FP-tree

For each pattern-base Accumulate the count for each item in the base

Construct the FP-tree for the frequent items of the pattern base

m-conditional

pattern base:

fca:2, fcab:1

{}

f:3

c:3

a:3

m-conditional FP-

tree

All frequent patterns concerning m

m,

fm, cm, am,

fcm, fam, cam,

fcam

{}

f:4 c:1

b:1

p:1

b:1 c:3

a:3

b:1 m:2

p:2 m:1


Mining Frequent Patterns by Creating

Conditional Pattern-Bases

Empty Empty f

{(f:3)}|c {(f:3)} c

{(f:3, c:3)}|a {(fc:3)} a

Empty {(fca:1), (f:1), (c:1)} b

{(f:3, c:3, a:3)}|m {(fca:2), (fcab:1)} m

{(c:3)}|p {(fcam:2), (cb:1)} p

Conditional FP-tree Conditional pattern-base Item

Step 3: Recursively mine the conditional

FP-tree

{}

f:3

c:3

a:3 m-conditional FP-tree

Cond. pattern base of am: (fc:3)

{}

f:3

c:3

am-conditional FP-tree

Cond. pattern base of cm: (f:3) {}

f:3

cm-conditional FP-tree

Cond. pattern base of cam: (f:3) {}

f:3

cam-conditional FP-tree

Single FP-tree Path Generation

Suppose an FP-tree T has a single path P

The complete set of frequent pattern of T can be generated

by enumeration of all the combinations of the sub-paths of P

{}

f:3

c:3

a:3

m-conditional FP-tree

All frequent patterns concerning m

m,

fm, cm, am,

fcm, fam, cam,

fcam

Classification

Given old data about customers and payments, predict

new applicants loan eligibility.

Age Salary

Profession Location Customer

type

Previous customers Classifier Decision tree

Salary > 5 K

Prof. = Exec

New applicants data

good/

bad

Overview of Naive Bayes The goal of Naive Bayes is to work out whether a new

example is in a class given that it has a certain combination of attribute values. We work out the likelihood of the example being in each class given the evidence (its attribute values), and take the highest likelihood as the classification.

Bayes Rule: E- Event has occurred

P[H] is called the prior probability (of the hypothesis). P[H|E] is called the posterior probability (of the hypothesis given the evidence)

][

][].|[]|[

EP

HPHEPEHP

ID3 (Decision Tree Algorithm)

ID3 was the first proper decision tree algorithm to use this

mechanism:

Building a decision tree with ID3 algorithm

1. Select the attribute with the most gain

2. Create the subsets for each value of the attribute

3. For each subset

1. if not all the elements of the subset belongs to same

class repeat the steps 1-3 for the subset

ID3 (Decision Tree Algorithm) Function DecisionTtreeLearner(Examples, Target_Class, Attributes)

create a Root node for the tree if all Examples are positive, return the single-node tree Root, with label = Yes if all Examples are negative, return the single-node tree Root, with label = No if Attributes list is empty,

return the single-node tree Root, with label = most common value of Target_Class in Examples

else A = the attribute from Attributes with the highest information gain with respect to Examples

Make A the decision attribute for Root for each possible value v of A:

add a new tree branch below Root, corresponding to the test A = v let Examples_v be the subset of Examples that have value v for attribute A if Examples_v is empty then

add a leaf node below this new branch with label = most common value of Target_Class in Examples

else add the subtree DTL(Examples_v, Target_Class, Attributes - { A })

end if end return Root

Decision Trees (Summary)

Advantages of ID3

automatically creates knowledge from data

can discover new knowledge (watch out for counter-intuitive rules)

avoids knowledge acquisition bottleneck

identifies most discriminating attribute first

trees can be converted to rules

Disadvantages of ID3

several identical examples have same effect as a single

example

trees can become large and difficult to understand

cannot deal with contradictory examples

examines attributes individually: does not consider

effects of inter-attribute relationships

CLUSTERING

Cluster: a collection of data objects

Similar to one another within the same cluster

Dissimilar to the objects in other clusters

Cluster analysis

Grouping a set of data objects into clusters

Clustering is unsupervised classification: no predefined classes

Typical applications

As a stand-alone tool to get insight into data distribution

As a preprocessing step for other algorithms

Partitional Clustering

Nonhierarchical

Creates clusters in one step as opposed to several

steps.

Since only one set of clusters is output, the user

normally has to input the desired number of

clusters, k.

Usually deals with static sets.

K-Means

Initial set of clusters randomly chosen.

Iteratively, items are moved among sets of clusters

until the desired set is reached.

High degree of similarity among elements in a cluster is obtained.

Given a cluster Ki={ti1,ti2,,tim}, the cluster mean is

mi = (1/m)(ti1 + + tim)

K-Means Example

Given: {2,4,10,12,3,20,30,11,25}, k=2

Randomly assign means: m1=3,m2=4

K1={2,3}, K2={4,10,12,20,30,11,25}, m1=2.5,m2=16

K1={2,3,4},K2={10,12,20,30,11,25}, m1=3,m2=18

K1={2,3,4,10},K2={12,20,30,11,25}, m1=4.75,m2=19.6

K1={2,3,4,10,11,12},K2={20,30,25}, m1=7,m2=25

Stop as the clusters with these means are the same.

Hierarchical Clustering

Clusters are created in levels actually creating sets of clusters at each level.

Agglomerative: Initially each item in its own cluster

Iteratively clusters are merged together

Bottom Up

Divisive: Initially all items in one cluster

Large clusters are successively divided

Top Down

Hierarchical Clustering

Use distance matrix as clustering criteria. This method

does not require the number of clusters k as an input,

but needs a termination condition

Step 0 Step 1 Step 2 Step 3 Step 4

b

d

c

e

a a b

d e

c d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative

(AGNES)

divisive

(DIANA)

The K-Medoids Clustering Method

Find representative objects, called medoids, in clusters

PAM (Partitioning Around Medoids,)

starts from an initial set of medoids and iteratively

replaces one of the medoids by one of the non-medoids if

it improves the total distance of the resulting clustering

Handles outliers well.

Ordering of input does not impact results.

Does not scale well.

Each cluster represented by one item, called the medoid.

Initial set of k medoids randomly chosen.

PAM works effectively for small data sets, but does not scale

well for large data sets

PAM (Partitioning Around Medoids)

PAM - Use real object to represent the cluster

Select k representative objects arbitrarily

For each pair of non-selected object h and selected

object i, calculate the total swapping cost TCih

For each pair of i and h,

If TCih < 0, i is replaced by h

Then assign each non-selected object to the most

similar representative object

repeat steps 2-3 until there is no change

Web Mining

Web Mining

Web Content

Mining Web Structure

Mining Web Usage

Mining

Identify information

within given web

pages

Distinguish personal

home pages from

other web pages

Understand access

patterns and the trends

to improve structure

Uses interconnections

between web pages to

give weight to the

Pages

Defines Data structures

of the links

Crawlers

Robot (spider) traverses the hypertext structure in the Web.

Collect information from visited pages

Used to construct indexes for search engines

Traditional Crawler visits entire Web and replaces index

Periodic Crawler visits portions of the Web and updates subset of index

Incremental Crawler selectively searches the Web and incrementally modifies index

Focused Crawler visits pages related to a particular subject

Web Usage Mining

Performs mining on Web Usage data or Web Logs

A web log is a listing of page reference data also

called as a click steam

Can be seen from either server perspective better web site design

Or client perspective prefetching of web pages etc.

Web Usage Mining Applications

Personalization

Improve structure of a sites Web pages

Aid in caching and prediction of future page references

Improve design of individual pages

Improve effectiveness of e-commerce (sales and

advertising)

Web Usage Mining Activities

Preprocessing Web log Cleanse

Remove extraneous information

Sessionize

Session: Sequence of pages referenced by one user at a sitting.

Pattern Discovery Count patterns that occur in sessions

Pattern is sequence of pages references in session.

Similar to association rules

Transaction: session

Itemset: pattern (or subset)

Order is important

Pattern Analysis

Web Structure Mining

Mine structure (links, graph) of the Web

Techniques

PageRank

CLEVER

Create a model of the Web organization.

May be combined with content mining to more

effectively retrieve important pages.

Web as a Graph

Web pages as nodes of a graph.

Links as directed edges.

www.uta.edu

my page

www.uta.edu

www.google.com

www.google.com

my page

www.uta.edu

www.google.com

Link Structure of the Web

Forward links (out-edges).

Backward links (in-edges).

Approximation of importance/quality: a page may

be of high quality if it is referred to by many other

pages, and by pages of high quality.

PageRank

Used by Google

Prioritize pages returned from search by looking at Web structure.

Importance of page is calculated based on number of pages which point to it Backlinks.

Weighting is used to provide more importance to backlinks coming form important pages.

HITS Algorithm

Used to generate good quality authoritative pages

and hub pages

Authoritative Page: A page pointed by many

other pages.

Hub Page: A page which points to an authoritative

page.

HITS Algorithm

Step 1: Generate Root set

Step 2: Generate Base set

Step 3: Build Graph

Step 4: Retain external links & eliminate internal links

Step 5: Calculate Authoritative & Hub score

Step 6: Generate result