  • ~ Arvind Pandi Dorai

    Lecturer, Computer Dept


  • Chapter 1


    NEED OF DATA WAREHOUSE In 1960s, computer systems used to maintain business


    As enterprises grew larger, hundreds of computer applications needed to support business processes.

    In 1990s as businesses grew more complex, corporations spread globally & competition became complex, businesses executives became desperate for information to stay competitive & improve bottom line.

    Companies need information to formulate the business strategies, establish goals, set objectives & monitor results

  • Data Warehouse

    Definition: Data warehouse is a relational DB that

    maintains huge volumes of historical data, so as to

    support strategic analysis & decision making.

    To take a strategic decision, we need strong

    analysis & for strong analysis we need historical

    data. Since ERP does not support historical data,

    DW came into picture.

  • Data Warehouse Features

    Subject oriented - Subject specific data marts.

    Integrated - Data integrated into single uniform format.

    Time Variant - DW maintains data over a wide range of time.

    Non volatile - Data is never deleted, Rarely updated.

  • Data Warehouse Objects

    Dimension Tables:

    Fact Tables:

    Dimension Table Key


    Textual Attributes


    Drill-down & Roll-up

    Multiple Hierarchies

    Foreign key


    Numeric facts

    Transaction level data

    Aggregate data

  • Star Schema

    A large and central fact table and one table for

    each dimension.

    Every fact points to one tuple in each of the

    dimensions and has additional attributes.

    Does not capture hierarchies directly.

    De-normalized system.

    Easy to understand, easy to define hierarchies,

    reduces no. of joins.

  • Star Schema layout

  • Star Schema Example

  • SnowFlake Schema

    Variant of star schema model.

    A single & large and central fact table and one or

    more tables for each dimension.

    Dimension tables are normalized i.e. split

    dimension table data into additional tables.

    Process of making a snowflake schema is called


    Drawbacks: Time consuming joins, report

    generation slow.

  • Snowflake Schema Layout

  • Fact Constellation

    Multiple fact tables share dimension tables.

    This schema is viewed as collection of stars hence

    called galaxy schema or fact constellation or

    family of stars.

    Sophisticated application requires such schema.

  • Fact Constellation

    Store Key

    Product Key

    Period Key



    Store Dimension

    Product Dimension


    Fact Table

    Store Key

    Store Name




    Product Key

    Product Desc

    Shipper Key

    Store Key

    Product Key

    Period Key




    Fact Table

  • Chapter 2


    Meta Data: Data about data

    Types of Metadata:

    Operational Metadata

    Extraction &Transformation Metadata

    End-User Metadata

  • Information Package

    IP gives special significance to dimension hierarchy in

    the business dimension & the key facts in the fact table.

  • Chapter 3

    DW Architecture

  • DW Architecture Data Acquisition

    Data Extraction

    Data Transformation

    Data Staging

    Data Storage

    Data Loading

    Data Aggregation

    Information Delivery



    Data Mining

  • Data Acquisition

    Data Extraction:

    Immediate Data Extraction

    Deferred Data Extraction

    Data Transformation:

    Splitting up of cells

    Merging up of cells

    Decoding of fields


    Date-Time format conversion

    Computed or derived fields

    Data Staging

  • Data Storage

    Data Loading:

    Initial Loading

    Incremental Loading

    Data Aggregation:

    Based on fact tables

    Based on aggregate tables

  • Information Delivery

    Reports Aggregate data

    OLAP Multidimensional Analysis

    Data Mining Extracting knowledge from database

  • Chapter 4

    Principles of Dimensional Modeling

    Dimensional Modeling:

    Logical Design technique to structure{arrange} the

    business dimensions & the fact tables.

    DM is a technique to prepare a star schema.

    Provides best data access.

    Fact table interacts with each & every business


    Drill-down & Roll-up.

  • Fully Additive Facts: When the values of an attribute are summed up by

    simple addition to provide some meaningful data, it is

    known as fully additive facts.

    Semi Additive Facts: When the values of an attribute are summed up, but it

    does not provide meaningful data, but when some

    mathematical operations are performed on it to provide

    meaningful data, it is known as fully additive facts.

    Factless Fact table: A fact table in which numeric facts are absent.

  • Chapter 5

    Information Access & Delivery

    OLAP is a technique that allows user to view aggregate data across measurements along with a

    set of related dimension.

    OLAP supports multidimensional analysis because

    data is stored in multidimensional array.

  • OLAP Operations

    Slice: Filtering the OLAP cube, view 1 attribute.

    Dice: Viewing two attributes.

    Drill-down: Detailing or expanding an attribute


    Roll-up: Aggregating or compressing an attribute


    Rotate: Rotating the cube to view different


  • OLAP Operations

    Slice and Dice


    Product Product= iPod


  • OLAP Operations

    Drill Down



    Category e.g Music Player

    Sub Category e.g MP3

    Product e.g iPod

  • OLAP Operations

    Roll Up



    Category e.g Music Player

    Sub Category e.g MP3

    Product e.g iPod

  • OLAP Operations






  • OLAP Server

    An OLAP Server is a high capacity, multi-user data

    manipulation engine specifically designed to support

    and operate on multi-dimensional data structure.

    OLAP server available are

    MOLAP server

    ROLAP server

    HOLAP server

  • Chapter 6 Implementation & Maintenance

    IMPLEMENTATION: Monitoring: Sending data from sources

    Integrating: Loading, cleansing, ...

    Processing: Query processing, indexing, ...

    Managing: Metadata, Design, ...

  • Maintainence

    Maintenance is an issue for materialized



    Incremental updating

  • View and Materialized Views


    Derived relation defined in terms of base

    (stored) relations.

    Materialized views

    A view can be materialized by storing the tuples

    of the view in the database.

    Index structures can be built on the materialized


  • Overview

    Extracting knowledge

    Perform analysis

    Use DM Algorithms

  • Knowledge Discovery in Database

  • Steps In KDD Process

    Data Cleaning

    Data Integration

    Data Selection

    Data Transformation

    Data mining

    Pattern Evaluation

    Knowledge Presentation

  • Architecture of DM

  • DM Algorithms

    Association: Relationship between item sets.

    Used in Market basket analysis.

    Eg: Apriori & FP Tree

    Classification: Classify each item to predefined groups.

    Eg: Nave Bayesian & ID3

    Clustering: Each item divided into dynamically generated


    Eg: K-means & K-mediods

  • Example: Market Basket Data

    Items frequently purchased together:

    Computer Printer






    Objective: increase sales and reduce costs

    Called Market Basket Analysis, Shopping Cart Analysis

  • Transaction Data: Supermarket Data

    Market basket transactions:

    t1: {bread, cheese, milk}

    t2: {apple, jam, salt, ice-cream}

    tn: {biscuit, jam, milk}

    Concepts: An item: an item/article in a basket

    I: the set of all items sold in the store

    A Transaction: items purchased in a basket; it may have TID (transaction ID)

    A Transactional dataset: A set of transactions

  • Association Rule Definitions

    Association Rule (AR): implication X Y where

    X,Y I and X Y = ;

    Support of AR (s) X Y: Percentage of

    transactions that contain X Y

    Confidence of AR (a) X Y: Ratio of number of

    transactions that contain X Y to the number

    that contain X

  • Association Rule Problem

    Given a set of items I={I1,I2,,Im} and a database of transactions D={t1,t2, , tn} where ti={Ii1,Ii2, , Iik} and Iij I, the Association Rule Problem is to identify all

    association rules X Y with a minimum support and


    Link Analysis

  • Association Rule Mining Task

    Given a set of transactions T, the goal of association rule

    mining is to find all rules having

    support minsup threshold

    confidence minconf threshold

    Brute-force approach:

    List all possible association rules

    Compute the support and confidence for each rule

    Prune rules that fail the minsup and minconf thresholds

  • Example

    Transaction data


    minsup = 30%

    minconf = 80%

    An example frequent itemset:

    {Cocoa, Clothes, Milk} [sup = 3/7]

    Association rules from the itemset:

    Clothes Milk, Cocoa [sup = 3/7, conf = 3/3]

    Clothes, Cocoa Milk, [sup = 3/7, conf = 3/3]

    t1: Butter, Cocoa, Milk

    t2: Butter, Cheese

    t3: Cheese, Boots

    t4: Butter, Cocoa, Cheese

    t5: Butter, Cocoa, Clothes, Cheese, Milk

    t6: Cocoa, Clothes, Milk

    t7: Cocoa, Milk, Clothes

  • Mining Association Rules

    Two-step approach:

    1. Frequent Itemset Generation

    Generate all itemsets whose support minsup

    2. Rule Generation

    Generate high confidence rules from each frequent

    itemset, where each rule is a binary partitioning of a

    frequent itemset

    Frequent itemset generation is still computationally


  • Step:1 Generate Candidate & Frequent

    Item Sets

    Let k=1 Generate frequent itemsets of length 1

    Repeat until no new frequent itemsets are identified

    Generate length (k+1) candidate itemsets from length k frequent itemsets

    Prune candidate itemsets containing subsets of length k that are infrequent

    Count the support of each candidate by scanning the DB

    Eliminate candidates that are infrequent, leaving only those that are frequent

  • Apriori Algorithm Example

  • Step 2: Generating Rules From Frequent


    Frequent itemsets association rules One more step is needed to generate association rules For each frequent itemset X, For each proper nonempty subset A of X,

    Let B = X - A A B is an association rule if

    Confidence(A B) minconf, support(A B) = support(AB) = support(X) confidence(A B) = support(A B) / support(A)

  • Generating Rules: An example

    Suppose {2,3,4} is frequent, with sup=50%

    Proper nonempty subsets: {2,3}, {2,4}, {3,4}, {2}, {3}, {4},

    with sup=50%, 50%, 75%, 75%, 75%, 75% respectively

    These generate these association rules:

    2,3 4, confidence=100%

    2,4 3, confidence=100%

    3,4 2, confidence=67%

    2 3,4, confidence=67%

    3 2,4, confidence=67%

    4 2,3, confidence=67%

    All rules have support = 50%

  • Rule Generation

    Given a frequent itemset L, find all non-empty subsets f

    L such that f L f satisfies the minimum confidence requirement

    If {A,B,C,D} is a frequent itemset, candidate rules:




    BD AC, CD AB,

    If |L| = k, then there are 2k 2 candidate association rules (ignoring L and L)

  • Generating Rules

    To recap, in order to obtain A B, we need to have support(A B) and support(A)

    All the required information for confidence computation has already been recorded in itemset generation. No need to see the data T any more.

    This step is not as time-consuming as frequent itemsets generation.

  • Rule Generation

    How to efficiently generate rules from frequent itemsets?

    In general, confidence does not have an anti-monotone property

    c(ABC D) can be larger or smaller than

    c(AB D)

    But confidence of rules generated from the same itemset has an anti-monotone property

    e.g., L = {A,B,C,D}: c(ABC D) c(AB CD) c(A BCD)

  • Apriori Advantages/Disadvantages


    Uses large itemset property.

    Easily parallelized

    Easy to implement.


    Assumes transaction database is memory resident.

    Requires up to m database scans.

  • Mining Frequent Patterns

    Without Candidate Generation

    Compress a large database into a compact, Frequent-

    Pattern tree (FP-tree) structure

    highly condensed, but complete for frequent pattern


    avoid costly database scans

    Develop an efficient, FP-tree-based frequent pattern

    mining method

    A divide-and-conquer methodology: decompose mining

    tasks into smaller ones

    Avoid candidate generation: sub-database test only!

  • Construct FP-tree From A Transaction DB


    f:4 c:1



    b:1 c:3


    b:1 m:2

    p:2 m:1

    Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3

    min_support = 0.5 TID Items bought (L-order) freq items 100 {f, a, c, d, g, i, m, p} {f, c, a, m, p} 200 {a, b, c, f, l, m, o} {f, c, a, b, m} 300 {b, f, h, j, o} {f, b} 400 {b, c, k, s, p} {c, b, p} 500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}


    1. Scan DB once, find

    frequent 1-itemset

    (single item pattern)

    2. Order frequent

    items in frequency

    descending order

    3. Scan DB again,

    construct FP-tree

  • Benefits of the FP-tree Structure


    never breaks a long pattern of any transaction

    preserves complete information for frequent pattern



    reduce irrelevant informationinfrequent items are gone

    frequency descending ordering: more frequent items are

    more likely to be shared

    never be larger than the original database (if not count

    node-links and counts)

  • Mining Frequent Patterns Using FP-tree

    General idea (divide-and-conquer)

    Recursively grow frequent pattern path using the FP-



    For each item, construct its conditional pattern-base,

    and then its conditional FP-tree

    Repeat the process on each newly created conditional


    Until the resulting FP-tree is empty, or it contains only

    one path (single path will generate all the combinations

    of its sub-paths, each of which is a frequent pattern)

  • Major Steps to Mine FP-tree

    1) Construct conditional pattern base for each

    node in the FP-tree

    2) Construct conditional FP-tree from each

    conditional pattern-base

    3) Recursively mine conditional FP-trees and

    grow frequent patterns obtained so far

    If the conditional FP-tree contains a single path,

    simply enumerate all the patterns

  • Step 1: FP-tree to Conditional Pattern Base

    Starting at the frequent header table in the FP-tree Traverse the FP-tree by following the link of each

    frequent item Accumulate all of transformed prefix paths of that item to

    form a conditional pattern base Conditional pattern bases

    item cond. pattern base

    c f:3

    a fc:3

    b fca:1, f:1, c:1

    m fca:2, fcab:1

    p fcam:2, cb:1


    f:4 c:1



    b:1 c:3


    b:1 m:2

    p:2 m:1

    Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3

  • Step 2: Construct Conditional FP-tree

    For each pattern-base Accumulate the count for each item in the base

    Construct the FP-tree for the frequent items of the pattern base


    pattern base:

    fca:2, fcab:1





    m-conditional FP-


    All frequent patterns concerning m


    fm, cm, am,

    fcm, fam, cam,



    f:4 c:1



    b:1 c:3


    b:1 m:2

    p:2 m:1

    Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3

  • Mining Frequent Patterns by Creating

    Conditional Pattern-Bases

    Empty Empty f

    {(f:3)}|c {(f:3)} c

    {(f:3, c:3)}|a {(fc:3)} a

    Empty {(fca:1), (f:1), (c:1)} b

    {(f:3, c:3, a:3)}|m {(fca:2), (fcab:1)} m

    {(c:3)}|p {(fcam:2), (cb:1)} p

    Conditional FP-tree Conditional pattern-base Item

  • Step 3: Recursively mine the conditional





    a:3 m-conditional FP-tree

    Cond. pattern base of am: (fc:3)




    am-conditional FP-tree

    Cond. pattern base of cm: (f:3) {}


    cm-conditional FP-tree

    Cond. pattern base of cam: (f:3) {}


    cam-conditional FP-tree

  • Single FP-tree Path Generation

    Suppose an FP-tree T has a single path P

    The complete set of frequent pattern of T can be generated

    by enumeration of all the combinations of the sub-paths of P





    m-conditional FP-tree

    All frequent patterns concerning m


    fm, cm, am,

    fcm, fam, cam,


  • Classification

    Given old data about customers and payments, predict

    new applicants loan eligibility.

    Age Salary

    Profession Location Customer


    Previous customers Classifier Decision tree

    Salary > 5 K

    Prof. = Exec

    New applicants data



  • Overview of Naive Bayes The goal of Naive Bayes is to work out whether a new

    example is in a class given that it has a certain combination of attribute values. We work out the likelihood of the example being in each class given the evidence (its attribute values), and take the highest likelihood as the classification.

    Bayes Rule: E- Event has occurred

    P[H] is called the prior probability (of the hypothesis). P[H|E] is called the posterior probability (of the hypothesis given the evidence)





  • ID3 (Decision Tree Algorithm)

    ID3 was the first proper decision tree algorithm to use this


    Building a decision tree with ID3 algorithm

    1. Select the attribute with the most gain

    2. Create the subsets for each value of the attribute

    3. For each subset

    1. if not all the elements of the subset belongs to same

    class repeat the steps 1-3 for the subset

  • ID3 (Decision Tree Algorithm) Function DecisionTtreeLearner(Examples, Target_Class, Attributes)

    create a Root node for the tree if all Examples are positive, return the single-node tree Root, with label = Yes if all Examples are negative, return the single-node tree Root, with label = No if Attributes list is empty,

    return the single-node tree Root, with label = most common value of Target_Class in Examples

    else A = the attribute from Attributes with the highest information gain with respect to Examples

    Make A the decision attribute for Root for each possible value v of A:

    add a new tree branch below Root, corresponding to the test A = v let Examples_v be the subset of Examples that have value v for attribute A if Examples_v is empty then

    add a leaf node below this new branch with label = most common value of Target_Class in Examples

    else add the subtree DTL(Examples_v, Target_Class, Attributes - { A })

    end if end return Root

  • Decision Trees (Summary)

    Advantages of ID3

    automatically creates knowledge from data

    can discover new knowledge (watch out for counter-intuitive rules)

    avoids knowledge acquisition bottleneck

    identifies most discriminating attribute first

    trees can be converted to rules

    Disadvantages of ID3

    several identical examples have same effect as a single


    trees can become large and difficult to understand

    cannot deal with contradictory examples

    examines attributes individually: does not consider

    effects of inter-attribute relationships


    Cluster: a collection of data objects

    Similar to one another within the same cluster

    Dissimilar to the objects in other clusters

    Cluster analysis

    Grouping a set of data objects into clusters

    Clustering is unsupervised classification: no predefined classes

    Typical applications

    As a stand-alone tool to get insight into data distribution

    As a preprocessing step for other algorithms

  • Partitional Clustering


    Creates clusters in one step as opposed to several


    Since only one set of clusters is output, the user

    normally has to input the desired number of

    clusters, k.

    Usually deals with static sets.

  • K-Means

    Initial set of clusters randomly chosen.

    Iteratively, items are moved among sets of clusters

    until the desired set is reached.

    High degree of similarity among elements in a cluster is obtained.

    Given a cluster Ki={ti1,ti2,,tim}, the cluster mean is

    mi = (1/m)(ti1 + + tim)

  • K-Means Example

    Given: {2,4,10,12,3,20,30,11,25}, k=2

    Randomly assign means: m1=3,m2=4

    K1={2,3}, K2={4,10,12,20,30,11,25}, m1=2.5,m2=16

    K1={2,3,4},K2={10,12,20,30,11,25}, m1=3,m2=18

    K1={2,3,4,10},K2={12,20,30,11,25}, m1=4.75,m2=19.6

    K1={2,3,4,10,11,12},K2={20,30,25}, m1=7,m2=25

    Stop as the clusters with these means are the same.

  • Hierarchical Clustering

    Clusters are created in levels actually creating sets of clusters at each level.

    Agglomerative: Initially each item in its own cluster

    Iteratively clusters are merged together

    Bottom Up

    Divisive: Initially all items in one cluster

    Large clusters are successively divided

    Top Down

  • Hierarchical Clustering

    Use distance matrix as clustering criteria. This method

    does not require the number of clusters k as an input,

    but needs a termination condition

    Step 0 Step 1 Step 2 Step 3 Step 4





    a a b

    d e

    c d e

    a b c d e

    Step 4 Step 3 Step 2 Step 1 Step 0





  • The K-Medoids Clustering Method

    Find representative objects, called medoids, in clusters

    PAM (Partitioning Around Medoids,)

    starts from an initial set of medoids and iteratively

    replaces one of the medoids by one of the non-medoids if

    it improves the total distance of the resulting clustering

    Handles outliers well.

    Ordering of input does not impact results.

    Does not scale well.

    Each cluster represented by one item, called the medoid.

    Initial set of k medoids randomly chosen.

    PAM works effectively for small data sets, but does not scale

    well for large data sets

  • PAM (Partitioning Around Medoids)

    PAM - Use real object to represent the cluster

    Select k representative objects arbitrarily

    For each pair of non-selected object h and selected

    object i, calculate the total swapping cost TCih

    For each pair of i and h,

    If TCih < 0, i is replaced by h

    Then assign each non-selected object to the most

    similar representative object

    repeat steps 2-3 until there is no change

  • PAM

  • Web Mining

    Web Mining

    Web Content

    Mining Web Structure

    Mining Web Usage


    Identify information

    within given web


    Distinguish personal

    home pages from

    other web pages

    Understand access

    patterns and the trends

    to improve structure

    Uses interconnections

    between web pages to

    give weight to the


    Defines Data structures

    of the links

  • Crawlers

    Robot (spider) traverses the hypertext structure in the Web.

    Collect information from visited pages

    Used to construct indexes for search engines

    Traditional Crawler visits entire Web and replaces index

    Periodic Crawler visits portions of the Web and updates subset of index

    Incremental Crawler selectively searches the Web and incrementally modifies index

    Focused Crawler visits pages related to a particular subject

  • Web Usage Mining

    Performs mining on Web Usage data or Web Logs

    A web log is a listing of page reference data also

    called as a click steam

    Can be seen from either server perspective better web site design

    Or client perspective prefetching of web pages etc.

  • Web Usage Mining Applications


    Improve structure of a sites Web pages

    Aid in caching and prediction of future page references

    Improve design of individual pages

    Improve effectiveness of e-commerce (sales and


  • Web Usage Mining Activities

    Preprocessing Web log Cleanse

    Remove extraneous information


    Session: Sequence of pages referenced by one user at a sitting.

    Pattern Discovery Count patterns that occur in sessions

    Pattern is sequence of pages references in session.

    Similar to association rules

    Transaction: session

    Itemset: pattern (or subset)

    Order is important

    Pattern Analysis

  • Web Structure Mining

    Mine structure (links, graph) of the Web




    Create a model of the Web organization.

    May be combined with content mining to more

    effectively retrieve important pages.

  • Web as a Graph

    Web pages as nodes of a graph.

    Links as directed edges.

    my page

    my page

  • Link Structure of the Web

    Forward links (out-edges).

    Backward links (in-edges).

    Approximation of importance/quality: a page may

    be of high quality if it is referred to by many other

    pages, and by pages of high quality.

  • PageRank

    Used by Google

    Prioritize pages returned from search by looking at Web structure.

    Importance of page is calculated based on number of pages which point to it Backlinks.

    Weighting is used to provide more importance to backlinks coming form important pages.

  • HITS Algorithm

    Used to generate good quality authoritative pages

    and hub pages

    Authoritative Page: A page pointed by many

    other pages.

    Hub Page: A page which points to an authoritative


  • HITS Algorithm

    Step 1: Generate Root set

    Step 2: Generate Base set

    Step 3: Build Graph

    Step 4: Retain external links & eliminate internal links

    Step 5: Calculate Authoritative & Hub score

    Step 6: Generate result