~ Arvind Pandi Dorai
Lecturer, Computer Dept
KJSIEIT
Chapter 1
Introduction
NEED OF DATA WAREHOUSE In 1960s, computer systems used to maintain business
data.
As enterprises grew larger, hundreds of computer applications needed to support business processes.
In 1990s as businesses grew more complex, corporations spread globally & competition became complex, businesses executives became desperate for information to stay competitive & improve bottom line.
Companies need information to formulate the business strategies, establish goals, set objectives & monitor results
Data Warehouse
Definition: Data warehouse is a relational DB that
maintains huge volumes of historical data, so as to
support strategic analysis & decision making.
To take a strategic decision, we need strong
analysis & for strong analysis we need historical
data. Since ERP does not support historical data,
DW came into picture.
Data Warehouse Features
Subject oriented - Subject specific data marts.
Integrated - Data integrated into single uniform format.
Time Variant - DW maintains data over a wide range of time.
Non volatile - Data is never deleted, Rarely updated.
Data Warehouse Objects
Dimension Tables:
Fact Tables:
Dimension Table Key
Wide
Textual Attributes
Denormalised
Drill-down & Roll-up
Multiple Hierarchies
Foreign key
Deep
Numeric facts
Transaction level data
Aggregate data
Star Schema
A large and central fact table and one table for
each dimension.
Every fact points to one tuple in each of the
dimensions and has additional attributes.
Does not capture hierarchies directly.
De-normalized system.
Easy to understand, easy to define hierarchies,
reduces no. of joins.
Star Schema layout
Star Schema Example
SnowFlake Schema
Variant of star schema model.
A single & large and central fact table and one or
more tables for each dimension.
Dimension tables are normalized i.e. split
dimension table data into additional tables.
Process of making a snowflake schema is called
snowflaking.
Drawbacks: Time consuming joins, report
generation slow.
Snowflake Schema Layout
Fact Constellation
Multiple fact tables share dimension tables.
This schema is viewed as collection of stars hence
called galaxy schema or fact constellation or
family of stars.
Sophisticated application requires such schema.
Fact Constellation
Store Key
Product Key
Period Key
Units
Price
Store Dimension
Product Dimension
Sales
Fact Table
Store Key
Store Name
City
State
Region
Product Key
Product Desc
Shipper Key
Store Key
Product Key
Period Key
Units
Price
Shipping
Fact Table
Chapter 2
Metadata
Meta Data: Data about data
Types of Metadata:
Operational Metadata
Extraction &Transformation Metadata
End-User Metadata
Information Package
IP gives special significance to dimension hierarchy in
the business dimension & the key facts in the fact table.
Chapter 3
DW Architecture
DW Architecture Data Acquisition
Data Extraction
Data Transformation
Data Staging
Data Storage
Data Loading
Data Aggregation
Information Delivery
Report
OLAP
Data Mining
Data Acquisition
Data Extraction:
Immediate Data Extraction
Deferred Data Extraction
Data Transformation:
Splitting up of cells
Merging up of cells
Decoding of fields
De-duplication
Date-Time format conversion
Computed or derived fields
Data Staging
Data Storage
Data Loading:
Initial Loading
Incremental Loading
Data Aggregation:
Based on fact tables
Based on aggregate tables
Information Delivery
Reports Aggregate data
OLAP Multidimensional Analysis
Data Mining Extracting knowledge from database
Chapter 4
Principles of Dimensional Modeling
Dimensional Modeling:
Logical Design technique to structure{arrange} the
business dimensions & the fact tables.
DM is a technique to prepare a star schema.
Provides best data access.
Fact table interacts with each & every business
dimension.
Drill-down & Roll-up.
Fully Additive Facts: When the values of an attribute are summed up by
simple addition to provide some meaningful data, it is
known as fully additive facts.
Semi Additive Facts: When the values of an attribute are summed up, but it
does not provide meaningful data, but when some
mathematical operations are performed on it to provide
meaningful data, it is known as fully additive facts.
Factless Fact table: A fact table in which numeric facts are absent.
Chapter 5
Information Access & Delivery
OLAP is a technique that allows user to view aggregate data across measurements along with a
set of related dimension.
OLAP supports multidimensional analysis because
data is stored in multidimensional array.
OLAP Operations
Slice: Filtering the OLAP cube, view 1 attribute.
Dice: Viewing two attributes.
Drill-down: Detailing or expanding an attribute
values.
Roll-up: Aggregating or compressing an attribute
values.
Rotate: Rotating the cube to view different
dimensions.
OLAP Operations
Slice and Dice
Time
Product Product= iPod
Time
OLAP Operations
Drill Down
Time
Product
Category e.g Music Player
Sub Category e.g MP3
Product e.g iPod
OLAP Operations
Roll Up
Time
Product
Category e.g Music Player
Sub Category e.g MP3
Product e.g iPod
OLAP Operations
Pivot
Time
Product
Region
Product
OLAP Server
An OLAP Server is a high capacity, multi-user data
manipulation engine specifically designed to support
and operate on multi-dimensional data structure.
OLAP server available are
MOLAP server
ROLAP server
HOLAP server
Chapter 6 Implementation & Maintenance
IMPLEMENTATION: Monitoring: Sending data from sources
Integrating: Loading, cleansing, ...
Processing: Query processing, indexing, ...
Managing: Metadata, Design, ...
Maintainence
Maintenance is an issue for materialized
views
Recomputation
Incremental updating
View and Materialized Views
View
Derived relation defined in terms of base
(stored) relations.
Materialized views
A view can be materialized by storing the tuples
of the view in the database.
Index structures can be built on the materialized
view.
Overview
Extracting knowledge
Perform analysis
Use DM Algorithms
Knowledge Discovery in Database
Steps In KDD Process
Data Cleaning
Data Integration
Data Selection
Data Transformation
Data mining
Pattern Evaluation
Knowledge Presentation
Architecture of DM
DM Algorithms
Association: Relationship between item sets.
Used in Market basket analysis.
Eg: Apriori & FP Tree
Classification: Classify each item to predefined groups.
Eg: Nave Bayesian & ID3
Clustering: Each item divided into dynamically generated
groups.
Eg: K-means & K-mediods
Example: Market Basket Data
Items frequently purchased together:
Computer Printer
Uses:
Placement
Advertising
Sales
Coupons
Objective: increase sales and reduce costs
Called Market Basket Analysis, Shopping Cart Analysis
Transaction Data: Supermarket Data
Market basket transactions:
t1: {bread, cheese, milk}
t2: {apple, jam, salt, ice-cream}
tn: {biscuit, jam, milk}
Concepts: An item: an item/article in a basket
I: the set of all items sold in the store
A Transaction: items purchased in a basket; it may have TID (transaction ID)
A Transactional dataset: A set of transactions
Association Rule Definitions
Association Rule (AR): implication X Y where
X,Y I and X Y = ;
Support of AR (s) X Y: Percentage of
transactions that contain X Y
Confidence of AR (a) X Y: Ratio of number of
transactions that contain X Y to the number
that contain X
Association Rule Problem
Given a set of items I={I1,I2,,Im} and a database of transactions D={t1,t2, , tn} where ti={Ii1,Ii2, , Iik} and Iij I, the Association Rule Problem is to identify all
association rules X Y with a minimum support and
confidence.
Link Analysis
Association Rule Mining Task
Given a set of transactions T, the goal of association rule
mining is to find all rules having
support minsup threshold
confidence minconf threshold
Brute-force approach:
List all possible association rules
Compute the support and confidence for each rule
Prune rules that fail the minsup and minconf thresholds
Example
Transaction data
Assume:
minsup = 30%
minconf = 80%
An example frequent itemset:
{Cocoa, Clothes, Milk} [sup = 3/7]
Association rules from the itemset:
Clothes Milk, Cocoa [sup = 3/7, conf = 3/3]
Clothes, Cocoa Milk, [sup = 3/7, conf = 3/3]
t1: Butter, Cocoa, Milk
t2: Butter, Cheese
t3: Cheese, Boots
t4: Butter, Cocoa, Cheese
t5: Butter, Cocoa, Clothes, Cheese, Milk
t6: Cocoa, Clothes, Milk
t7: Cocoa, Milk, Clothes
Mining Association Rules
Two-step approach:
1. Frequent Itemset Generation
Generate all itemsets whose support minsup
2. Rule Generation
Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning of a
frequent itemset
Frequent itemset generation is still computationally
expensive
Step:1 Generate Candidate & Frequent
Item Sets
Let k=1 Generate frequent itemsets of length 1
Repeat until no new frequent itemsets are identified
Generate length (k+1) candidate itemsets from length k frequent itemsets
Prune candidate itemsets containing subsets of length k that are infrequent
Count the support of each candidate by scanning the DB
Eliminate candidates that are infrequent, leaving only those that are frequent
Apriori Algorithm Example
Step 2: Generating Rules From Frequent
Itemsets
Frequent itemsets association rules One more step is needed to generate association rules For each frequent itemset X, For each proper nonempty subset A of X,
Let B = X - A A B is an association rule if
Confidence(A B) minconf, support(A B) = support(AB) = support(X) confidence(A B) = support(A B) / support(A)
Generating Rules: An example
Suppose {2,3,4} is frequent, with sup=50%
Proper nonempty subsets: {2,3}, {2,4}, {3,4}, {2}, {3}, {4},
with sup=50%, 50%, 75%, 75%, 75%, 75% respectively
These generate these association rules:
2,3 4, confidence=100%
2,4 3, confidence=100%
3,4 2, confidence=67%
2 3,4, confidence=67%
3 2,4, confidence=67%
4 2,3, confidence=67%
All rules have support = 50%
Rule Generation
Given a frequent itemset L, find all non-empty subsets f
L such that f L f satisfies the minimum confidence requirement
If {A,B,C,D} is a frequent itemset, candidate rules:
ABC D, ABD C, ACD B, BCD A,
A BCD, B ACD, C ABD, D ABC
AB CD, AC BD, AD BC, BC AD,
BD AC, CD AB,
If |L| = k, then there are 2k 2 candidate association rules (ignoring L and L)
Generating Rules
To recap, in order to obtain A B, we need to have support(A B) and support(A)
All the required information for confidence computation has already been recorded in itemset generation. No need to see the data T any more.
This step is not as time-consuming as frequent itemsets generation.
Rule Generation
How to efficiently generate rules from frequent itemsets?
In general, confidence does not have an anti-monotone property
c(ABC D) can be larger or smaller than
c(AB D)
But confidence of rules generated from the same itemset has an anti-monotone property
e.g., L = {A,B,C,D}: c(ABC D) c(AB CD) c(A BCD)
Apriori Advantages/Disadvantages
Advantages:
Uses large itemset property.
Easily parallelized
Easy to implement.
Disadvantages:
Assumes transaction database is memory resident.
Requires up to m database scans.
Mining Frequent Patterns
Without Candidate Generation
Compress a large database into a compact, Frequent-
Pattern tree (FP-tree) structure
highly condensed, but complete for frequent pattern
mining
avoid costly database scans
Develop an efficient, FP-tree-based frequent pattern
mining method
A divide-and-conquer methodology: decompose mining
tasks into smaller ones
Avoid candidate generation: sub-database test only!
Construct FP-tree From A Transaction DB
{}
f:4 c:1
b:1
p:1
b:1 c:3
a:3
b:1 m:2
p:2 m:1
Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3
min_support = 0.5 TID Items bought (L-order) freq items 100 {f, a, c, d, g, i, m, p} {f, c, a, m, p} 200 {a, b, c, f, l, m, o} {f, c, a, b, m} 300 {b, f, h, j, o} {f, b} 400 {b, c, k, s, p} {c, b, p} 500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
Steps:
1. Scan DB once, find
frequent 1-itemset
(single item pattern)
2. Order frequent
items in frequency
descending order
3. Scan DB again,
construct FP-tree
Benefits of the FP-tree Structure
Completeness:
never breaks a long pattern of any transaction
preserves complete information for frequent pattern
mining
Compactness
reduce irrelevant informationinfrequent items are gone
frequency descending ordering: more frequent items are
more likely to be shared
never be larger than the original database (if not count
node-links and counts)
Mining Frequent Patterns Using FP-tree
General idea (divide-and-conquer)
Recursively grow frequent pattern path using the FP-
tree
Method
For each item, construct its conditional pattern-base,
and then its conditional FP-tree
Repeat the process on each newly created conditional
FP-tree
Until the resulting FP-tree is empty, or it contains only
one path (single path will generate all the combinations
of its sub-paths, each of which is a frequent pattern)
Major Steps to Mine FP-tree
1) Construct conditional pattern base for each
node in the FP-tree
2) Construct conditional FP-tree from each
conditional pattern-base
3) Recursively mine conditional FP-trees and
grow frequent patterns obtained so far
If the conditional FP-tree contains a single path,
simply enumerate all the patterns
Step 1: FP-tree to Conditional Pattern Base
Starting at the frequent header table in the FP-tree Traverse the FP-tree by following the link of each
frequent item Accumulate all of transformed prefix paths of that item to
form a conditional pattern base Conditional pattern bases
item cond. pattern base
c f:3
a fc:3
b fca:1, f:1, c:1
m fca:2, fcab:1
p fcam:2, cb:1
{}
f:4 c:1
b:1
p:1
b:1 c:3
a:3
b:1 m:2
p:2 m:1
Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3
Step 2: Construct Conditional FP-tree
For each pattern-base Accumulate the count for each item in the base
Construct the FP-tree for the frequent items of the pattern base
m-conditional
pattern base:
fca:2, fcab:1
{}
f:3
c:3
a:3
m-conditional FP-
tree
All frequent patterns concerning m
m,
fm, cm, am,
fcm, fam, cam,
fcam
{}
f:4 c:1
b:1
p:1
b:1 c:3
a:3
b:1 m:2
p:2 m:1
Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3
Mining Frequent Patterns by Creating
Conditional Pattern-Bases
Empty Empty f
{(f:3)}|c {(f:3)} c
{(f:3, c:3)}|a {(fc:3)} a
Empty {(fca:1), (f:1), (c:1)} b
{(f:3, c:3, a:3)}|m {(fca:2), (fcab:1)} m
{(c:3)}|p {(fcam:2), (cb:1)} p
Conditional FP-tree Conditional pattern-base Item
Step 3: Recursively mine the conditional
FP-tree
{}
f:3
c:3
a:3 m-conditional FP-tree
Cond. pattern base of am: (fc:3)
{}
f:3
c:3
am-conditional FP-tree
Cond. pattern base of cm: (f:3) {}
f:3
cm-conditional FP-tree
Cond. pattern base of cam: (f:3) {}
f:3
cam-conditional FP-tree
Single FP-tree Path Generation
Suppose an FP-tree T has a single path P
The complete set of frequent pattern of T can be generated
by enumeration of all the combinations of the sub-paths of P
{}
f:3
c:3
a:3
m-conditional FP-tree
All frequent patterns concerning m
m,
fm, cm, am,
fcm, fam, cam,
fcam
Classification
Given old data about customers and payments, predict
new applicants loan eligibility.
Age Salary
Profession Location Customer
type
Previous customers Classifier Decision tree
Salary > 5 K
Prof. = Exec
New applicants data
good/
bad
Overview of Naive Bayes The goal of Naive Bayes is to work out whether a new
example is in a class given that it has a certain combination of attribute values. We work out the likelihood of the example being in each class given the evidence (its attribute values), and take the highest likelihood as the classification.
Bayes Rule: E- Event has occurred
P[H] is called the prior probability (of the hypothesis). P[H|E] is called the posterior probability (of the hypothesis given the evidence)
][
][].|[]|[
EP
HPHEPEHP
ID3 (Decision Tree Algorithm)
ID3 was the first proper decision tree algorithm to use this
mechanism:
Building a decision tree with ID3 algorithm
1. Select the attribute with the most gain
2. Create the subsets for each value of the attribute
3. For each subset
1. if not all the elements of the subset belongs to same
class repeat the steps 1-3 for the subset
ID3 (Decision Tree Algorithm) Function DecisionTtreeLearner(Examples, Target_Class, Attributes)
create a Root node for the tree if all Examples are positive, return the single-node tree Root, with label = Yes if all Examples are negative, return the single-node tree Root, with label = No if Attributes list is empty,
return the single-node tree Root, with label = most common value of Target_Class in Examples
else A = the attribute from Attributes with the highest information gain with respect to Examples
Make A the decision attribute for Root for each possible value v of A:
add a new tree branch below Root, corresponding to the test A = v let Examples_v be the subset of Examples that have value v for attribute A if Examples_v is empty then
add a leaf node below this new branch with label = most common value of Target_Class in Examples
else add the subtree DTL(Examples_v, Target_Class, Attributes - { A })
end if end return Root
Decision Trees (Summary)
Advantages of ID3
automatically creates knowledge from data
can discover new knowledge (watch out for counter-intuitive rules)
avoids knowledge acquisition bottleneck
identifies most discriminating attribute first
trees can be converted to rules
Disadvantages of ID3
several identical examples have same effect as a single
example
trees can become large and difficult to understand
cannot deal with contradictory examples
examines attributes individually: does not consider
effects of inter-attribute relationships
CLUSTERING
Cluster: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis
Grouping a set of data objects into clusters
Clustering is unsupervised classification: no predefined classes
Typical applications
As a stand-alone tool to get insight into data distribution
As a preprocessing step for other algorithms
Partitional Clustering
Nonhierarchical
Creates clusters in one step as opposed to several
steps.
Since only one set of clusters is output, the user
normally has to input the desired number of
clusters, k.
Usually deals with static sets.
K-Means
Initial set of clusters randomly chosen.
Iteratively, items are moved among sets of clusters
until the desired set is reached.
High degree of similarity among elements in a cluster is obtained.
Given a cluster Ki={ti1,ti2,,tim}, the cluster mean is
mi = (1/m)(ti1 + + tim)
K-Means Example
Given: {2,4,10,12,3,20,30,11,25}, k=2
Randomly assign means: m1=3,m2=4
K1={2,3}, K2={4,10,12,20,30,11,25}, m1=2.5,m2=16
K1={2,3,4},K2={10,12,20,30,11,25}, m1=3,m2=18
K1={2,3,4,10},K2={12,20,30,11,25}, m1=4.75,m2=19.6
K1={2,3,4,10,11,12},K2={20,30,25}, m1=7,m2=25
Stop as the clusters with these means are the same.
Hierarchical Clustering
Clusters are created in levels actually creating sets of clusters at each level.
Agglomerative: Initially each item in its own cluster
Iteratively clusters are merged together
Bottom Up
Divisive: Initially all items in one cluster
Large clusters are successively divided
Top Down
Hierarchical Clustering
Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input,
but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
a a b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative
(AGNES)
divisive
(DIANA)
The K-Medoids Clustering Method
Find representative objects, called medoids, in clusters
PAM (Partitioning Around Medoids,)
starts from an initial set of medoids and iteratively
replaces one of the medoids by one of the non-medoids if
it improves the total distance of the resulting clustering
Handles outliers well.
Ordering of input does not impact results.
Does not scale well.
Each cluster represented by one item, called the medoid.
Initial set of k medoids randomly chosen.
PAM works effectively for small data sets, but does not scale
well for large data sets
PAM (Partitioning Around Medoids)
PAM - Use real object to represent the cluster
Select k representative objects arbitrarily
For each pair of non-selected object h and selected
object i, calculate the total swapping cost TCih
For each pair of i and h,
If TCih < 0, i is replaced by h
Then assign each non-selected object to the most
similar representative object
repeat steps 2-3 until there is no change
PAM
Web Mining
Web Mining
Web Content
Mining Web Structure
Mining Web Usage
Mining
Identify information
within given web
pages
Distinguish personal
home pages from
other web pages
Understand access
patterns and the trends
to improve structure
Uses interconnections
between web pages to
give weight to the
Pages
Defines Data structures
of the links
Crawlers
Robot (spider) traverses the hypertext structure in the Web.
Collect information from visited pages
Used to construct indexes for search engines
Traditional Crawler visits entire Web and replaces index
Periodic Crawler visits portions of the Web and updates subset of index
Incremental Crawler selectively searches the Web and incrementally modifies index
Focused Crawler visits pages related to a particular subject
Web Usage Mining
Performs mining on Web Usage data or Web Logs
A web log is a listing of page reference data also
called as a click steam
Can be seen from either server perspective better web site design
Or client perspective prefetching of web pages etc.
Web Usage Mining Applications
Personalization
Improve structure of a sites Web pages
Aid in caching and prediction of future page references
Improve design of individual pages
Improve effectiveness of e-commerce (sales and
advertising)
Web Usage Mining Activities
Preprocessing Web log Cleanse
Remove extraneous information
Sessionize
Session: Sequence of pages referenced by one user at a sitting.
Pattern Discovery Count patterns that occur in sessions
Pattern is sequence of pages references in session.
Similar to association rules
Transaction: session
Itemset: pattern (or subset)
Order is important
Pattern Analysis
Web Structure Mining
Mine structure (links, graph) of the Web
Techniques
PageRank
CLEVER
Create a model of the Web organization.
May be combined with content mining to more
effectively retrieve important pages.
Web as a Graph
Web pages as nodes of a graph.
Links as directed edges.
www.uta.edu
my page
www.uta.edu
www.google.com
www.google.com
my page
www.uta.edu
www.google.com
Link Structure of the Web
Forward links (out-edges).
Backward links (in-edges).
Approximation of importance/quality: a page may
be of high quality if it is referred to by many other
pages, and by pages of high quality.
PageRank
Used by Google
Prioritize pages returned from search by looking at Web structure.
Importance of page is calculated based on number of pages which point to it Backlinks.
Weighting is used to provide more importance to backlinks coming form important pages.
HITS Algorithm
Used to generate good quality authoritative pages
and hub pages
Authoritative Page: A page pointed by many
other pages.
Hub Page: A page which points to an authoritative
page.
HITS Algorithm
Step 1: Generate Root set
Step 2: Generate Base set
Step 3: Build Graph
Step 4: Retain external links & eliminate internal links
Step 5: Calculate Authoritative & Hub score
Step 6: Generate result