2. NOTES.doc

Embed Size (px)

Citation preview

  • 8/11/2019 2. NOTES.doc

    1/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    Department of Information Technoo!"

    UNIT I

    DATA WAREHOUSING

    Data Warehouse Introduction

    A data warehouse is a collection of data marts representing historical data from different

    operations in the company. This data is stored in a structure optimized for querying and data analysis as a

    data warehouse. Table design, dimensions and organization should be consistent throughout a data

    warehouse so that reports or queries across the data warehouse are consistent. A data warehouse can also

    be viewed as a database for historical data from different functions within a company.

    The term Data Warehouse was coined by Bill nmon in !""#, which he defined in the following

    way$ A warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data

    in support of managements decision making process!

    %e defined the terms in the sentence as follows$

    Subject Oriented:Data that gives information about a particular sub&ect instead of about a company's

    ongoing operations.

    Integrated: Data that is gathered into the data warehouse from a variety of sources and merged into a

    coherent whole.

    Time-variant:All data in the data warehouse is identified with a particular time period.

    Non-volatile:Data is stable in a data warehouse. (ore data is added but data is never removed.

    This enables management to gain a consistent picture of the business. t is a single, complete and

    consistent store of data obtained from a variety of different sources made available to end users in what

    they can understand and use in a business conte)t. t can be

    *sed for decision +upport

    *sed to manage and control business

    *sed by managers and endusers to understand the business and ma-e &udgments

    Data Warehousing is an architectural construct of information systems that provides users with current

    and historical decision support information that is hard to access or present in traditional operational data

    stores

    "ther important terminolog#

    Enterprise Data warehouse: t collects all information about sub&ects customers, products, sales,

    assets, personnel/ that span the entire organization

    Data (art$ Departmental subsets that focus on selected sub&ects. A data mart is a segment of a datawarehouse that can provide data for reporting and analysis on a section, unit, department or operation in

    1

  • 8/11/2019 2. NOTES.doc

    2/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    the company, e.g. sales, payroll, production. Data marts are sometimes complete individual data

    warehouses which are usually smaller than the corporate data warehouse.

    Decision Support System (DSS):nformation technology to help the -nowledge wor-er e)ecutive,

    manager, and analyst/ ma-es faster 0 better decisions

    Drill-down:Traversing the summarization levels from highly summarized data to the underlyingcurrent or old detail

    Metadata:Data about data. 1ontaining location and description of warehouse system components$

    names, definition, structure2

    $enefits of data warehousing

    Data warehouses are designed to perform well with aggregate queries running on large

    amounts of data.

    The structure of data warehouses is easier for end users to navigate, understand and query

    against unli-e the relational databases primarily designed to handle lots of transactions. Data warehouses enable queries that cut across different segments of a company's operation.

    3.g. production data could be compared against inventory data even if they were originally

    stored in different databases with different structures.

    4ueries that would be comple) in very normalized databases could be easier to build and

    maintain in data warehouses, decreasing the wor-load on transaction systems.

    Data warehousing is an efficient way to manage and report on data that is from a variety of

    sources, non uniform and scattered throughout a company.

    Data warehousing is an efficient way to manage demand for lots of information from lots of

    users.

    5Data warehousing provides the capability to analyze large amounts of historical data fornuggets of wisdom that can provide an organization with competitive advantage.

    "perational and informational Data

    6perational Data$

    7ocusing on transactional function such as ban- card withdrawals and deposits

    Detailed

    *pdateable

    8eflects current data

    nformational Data$

    7ocusing on providing answers to problems posed by decision ma-ers

    +ummarized

    9on updateable

    Data Warehouse %haracteristicsA data warehouse can be viewed as an information system with the following attributes$

    : t is a database designed for analytical tas-s

    : t's content is periodically updated

    : t contains current and historical data to provide a historical perspective of information

    2

  • 8/11/2019 2. NOTES.doc

    3/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    6perational data store 6D+/

    ; 6D+ is an architecture concept to support daytoday operational decision support and contains

    current value data propagated from operational applications

    ; 6D+ is sub&ectoriented, similar to a classic definition of a Data warehouse

    ; 6D+ is integrated

    6D+ DATA WA83%6*+3

  • 8/11/2019 2. NOTES.doc

    4/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    . nformation delivery system

    Data warehouse is an environment, not a product which is based on relational database

    management system that functions as the central repository for informational data.

    The central repository information is surrounded by number of -ey components designed to ma-e

    the environment is functional, manageable and accessible.

    The data source for data warehouse is coming from operational applications. The data entered into

    the data warehouse transformed into an integrated structure and format. The transformation process

    involves conversion, summarization, filtering and condensation. The data warehouse must be capable of

    holding and managing large volumes of data as well as different structure of data structures over the time.

    &! Data warehouse database

    This is the central part of the data warehousing environment. This is the item number > in the

    above arch. diagram. This is implemented based on 8DB(+ technology.

    '! (ourcing, Ac)uisition, %lean up, and *ransformation *ools

    This is item number ! in the above arch diagram. They perform conversions, summarization, -ey

    changes, structural changes and condensation. The data transformation is required so that the information

    can by used by decision support tools. The transformation produces programs, control statements, E1F

    4

  • 8/11/2019 2. NOTES.doc

    5/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    code, 16B6F code, *9G scripts, and +4F DDF code etc., to move the data into data warehouse from

    multiple operational systems.

    The functionalities of these tools are listed below$

    To remove unwanted data from operational db 1onverting to common data names and attributes

    1alculating summaries and derived data

    3stablishing defaults for missing data

    5Accommodating source data definition changes

    Issues to be considered while data sourcing, cleanup, extract and transformation:

    Data heterogeneity$ t refers to DB(+ different nature such as it may be in different data modules,

    it may have different access languages, it may have data navigation methods, operations, concurrency,

    integrity and recovery processes etc.,

    Data heterogeneity$ t refers to the different way the data is defined and used in different modules.

    Some experts involved in the development of such tools:

    =rism +olutions, 3volutionary Technology nc.,

  • 8/11/2019 2. NOTES.doc

    6/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    (eta data helps the users to understand content and find the data. (eta data are stored in a

    separate data stores which is -nown as informational directory or (eta data repository which helps to

    integrate, maintain and view the contents of the data warehouse. The following lists the characteristics of

    info directory@ (eta data$

    t is the gateway to the data warehouse environment

    t supports easy distribution and replication of content for high performance and availability

    t should be searchable by business oriented -ey words

    5t should act as a launch platform for end user to access data and analysis tools

    t should support the sharing of info

    5t should support scheduling options for request

    5t should support and provide interface to other applications

    t should support end user monitoring of the status of the data warehouse environment

    Access tools

    ts purpose is to provide info to business users for decision ma-ing. There are five categories$

    5Data query and reporting tools

    Application development tools

    3)ecutive info system tools 3+/

    56FA= tools

    Data mining tools

    4uery and reporting tools are used to generate query and report. There are two types of reporting tools.

    They are$

    =roduction reporting tool used to generate regular operational reports

    Des-top report writer are ine)pensive des-top tools designed for end users.

    Managed Query tools:used to generate +4F query. t uses (eta layer software in between users

    and databases which offers a pointandclic- creation of +4F statement. This tool is a preferred choice of

    users to perform segment identification, demographic analysis, territory management and preparation of

    customer mailing lists etc.

    pplication de!elopment tools: This is a graphical data access environment which integrates

    6FA= tools with data warehouse and can be used to access all db systems

    "#$ Tools:are used to analyze the data in multi dimensional and comple) views. To enable

    multidimensional properties it uses (DDB and (8DB where (DDB refers multi dimensional data base

    and (8DB refers multi relational data bases.

    Data mining tools:are used to discover -nowledge from the data warehouse data also can be used

    for data visualization and data correction purposes.

    6

  • 8/11/2019 2. NOTES.doc

    7/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    .!Data marts

    Departmental subsets that focus on selected sub&ects. They are independent used by

    dedicated user group. They are used for rapid delivery of enhanced decision support functionality

    to end users. Data mart is used in the following situation$

    3)tremely urgent user requirement

    The absence of a budget for a full scale data warehouse strategy

    The decentralization of business needs

    The attraction of easy to use tools and mind sized pro&ect

    Data mart presents two problems$

    !. Scala%ility: A small data mart can grow quic-ly in multi dimensions. +o that while

    designing it, the organization has to pay more attention on system scalability, consistency

    and manageability issues

    >.Data integration

    /!Data warehouse admin and management

    The management of data warehouse includes,

    +ecurity and priority management

    (onitoring updates from multiple sources

    Data quality chec-s

    (anaging and updating meta data

    Auditing and reporting data warehouse usage and status

    =urging data

    8eplicating, sub setting and distributing data

    Bac-up and recovery

    Data warehouse storage management which includes capacity planning, hierarchical storage

    management and purging of aged data etc.,

    0!Information deliver# s#stem

    ; t is used to enable the process of subscribing for data warehouse info.; Delivery to one or more destinations according to specified scheduling algorithm.

    '!$uilding a Data warehouse

    7

  • 8/11/2019 2. NOTES.doc

    8/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    There are two reasons why organizations consider data warehousing a critical need. n

    other words, there are two factors that drive you to build and use data warehouse. They are$

    Business &actors:

    Business users want to ma-e decision quic-ly and correctly using all available data.

    Technological &actors:

    To address the incompatibility of operational data stores

    T infrastructure is changing rapidly. ts capacity is increasing and cost is decreasing so that

    building a data warehouse is easy

    *here are several things to be considered while building a successful data warehouse

    Business considerations$

    6rganizations interested in development of a data warehouse can choose one of the following

    *wo approaches1

    1. Top : Down Approach +uggested by Bill nmon/

    2. Bottom : *p Approach +uggested by 8alph Himball/

    &!*op 2 Down Approach

    n the top down approach suggested by Bill nmon, we build a centralized repository to house

    corporate wide business data. This repository is called 3nterprise Data Warehouse 3DW/. The data in the

    3DW is stored in a normalized form in order to avoid redundancy.

    The central repository for corporate wide data helps us maintain one version of truth of the

    data.The data in the 3DW is stored at the most detail level. The reason to build the 3DW on the most detail

    level is to leverage

    !. 7le)ibility to be used by multiple departments.

    >. 7le)ibility to cater for future requirements.

    *he disadvantages of storing data at the detail level are

    !. The comple)ity of design increases with increasing level of detail.

    >. t ta-es large amount of space to store data at detail level, hence increased cost.

    8

  • 8/11/2019 2. NOTES.doc

    9/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    6nce the 3DW is implemented we start building sub&ect area specific data marts which contain

    data in a de normalized form also called star schema. The data in the marts are usually summarized based

    on the end users analytical requirements. The reason to de normalize the data in the mart is to provide

    faster access to the data for the end users analytics. f we were to have queried a normalized schema for the

    same analytics, we would end up in a comple) multiple level &oins that would be much slower as

    compared to the one on the de normalized schema.

    We should implement the topdown approach when

    !. The business has complete clarity on all or multiple sub&ect areas data warehosue requirements.

    >. The business is ready to invest considerable time and money.

    Theadvantage

    of using the Top Down approach is that we build a centralized repository to cater

    for one version of truth for business data. This is very important for the data to be reliable, consistent

    across sub&ect areas and for reconciliation in case of data related contention between sub&ect areas.

    The disadvantageof using the Top Down approach is that it requires more time and initial

    investment. The business has to wait for the 3DW to be implemented followed by building the data marts

    before which they can access their reports.

    '! $ottom 3p Approach

    The bottom up approach suggested by 8alph Himball is an incremental approach to build a data

    warehouse. %ere we build the data marts separately at different points of time as and when the specific

    sub&ect area requirements are clear. The data marts are integrated or combined together to form a data

    warehouse. +eparate data marts are combined through the use of conformed dimensions and conformed

    facts. A conformed dimension and a conformed fact is one that can be shared across data marts.

    A 1onformed dimension has consistent dimension -eys, consistent attribute names and consistent

    values across separate data marts. The conformed dimension means e)act same thing with every fact table

    it is &oined. A 1onformed fact has the same definition of measures, same dimensions &oined to it and at the

    same granularity across data marts.

    The bottom up approach helps us incrementally build the warehouse by developing and integrating

    data marts as and when the requirements are clear. We don't have to wait for -nowing the overall

    requirements of the warehouse. We should implement the bottom up approach when

    !. We have initial cost and time constraints.

    >. The complete warehouse requirements are not clear. We have clarity to only one data mart.

    9

  • 8/11/2019 2. NOTES.doc

    10/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    The advantageof using the Bottom *p approach is that they do not require high initial costs and

    have a faster implementation timeI hence the business can start using the marts much earlier as compared

    to the topdown approach.

    The disadvantages of using the Bottom *p approach is that it stores data in the de normalized

    format, hence there would be high space usage for detailed data. We have a tendency of not -eeping

    detailed data in this approach hence loosing out on advantage of having detail data .i.e. fle)ibility to easily

    cater to future requirements. Bottom up approach is more realistic but the comple)ity of the integration

    may become a serious obstacle.

    !SI"N #ONSI!$%TIONS

    To be a successful data warehouse designer must adopt a holistic approach that is considering all

    data warehouse components as parts of a single comple) system, and ta-e into account all possible data

    sources and all -nown usage requirements.

    (ost successful data warehouses that meet these requirements have these common characteristics$

    Are based on a dimensional model

    1ontain historical and current data

    nclude both detailed and summarized data

    1onsolidate disparate data from multiple sources while retaining consistency

    Data warehouse is difficult to build due to the following reason$

    %eterogeneity of data sources

    *se of historical data

    Jrowing nature of data base

    Data warehouse design approach muse be business driven, continuous and iterative engineering

    approach. n addition to the general considerations there are following specific points relevant to the data

    warehouse design$

    Data content

    The content and structure of the data warehouse are reflected in its data model. The data model is

    the template that describes how information will be organized within the integrated warehouse framewor-.

    The data warehouse data must be a detailed data. t must be formatted, cleaned up and transformed to fit

    the warehouse data model.

    10

  • 8/11/2019 2. NOTES.doc

    11/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    eta data

    t defines the location and contents of data in the warehouse. (eta data is searchable by users to

    find definitions or sub&ect areas. n other words, it must provide decision support oriented pointers to

    warehouse data and thus provides a logical lin- between warehouse data and decision support applications.

    Data distribution

    6ne of the biggest challenges when designing a data warehouse is the data placement and

    distribution strategy. Data volumes continue to grow in nature. Therefore, it becomes necessary to -now

    how the data should be divided across multiple servers and which users should get access to which types of

    data. The data can be distributed based on the sub&ect area, location geographical region/, or time current,

    month, year/.

    *ools

    A number of tools are available that are specifically designed to help in the

    implementation of the data warehouse. All selected tools must be compatible with the given data

    warehouse environment and with each other. All tools must be able to use a common (eta data

    repository.

    Design steps

    The following ninestep method is followed in the design of a data warehouse$

    !. 1hoosing the sub&ect matter

    >. Deciding what a fact table represents

    ?. dentifying and conforming the dimensions

    . 1hoosing the facts

    . +toring pre calculations in the fact table

    C. 8ounding out the dimension table

    . 1hoosing the duration of the db

    K. The need to trac- slowly changing dimensions

    ". Deciding the query priorities and query models

    T!#&NI#%' #ONSI!$%TIONS

    A number of technical issues are to be considered when designing a data warehouse

    environment. These issues include$

    11

  • 8/11/2019 2. NOTES.doc

    12/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    The hardware platform that would house the data warehouse

    The dbms that supports the warehouse data

    The communication infrastructure that connects data marts, operational systems and end

    users

    The hardware and software to support meta data repository

    The systems management framewor- that enables admin of the entire environment

    I()'!(!NT%TION #ONSI!$%TIONS

    The following logical steps needed to implement a data warehouse$

    1ollect and analyze business requirements

    1reate a data model and a physical design

    Define data sources

    1hoose the db tech and platform

    3)tract the data from operational db, transform it, clean it up and load it into the warehouse

    1hoose db access and reporting tools

    1hoose db connectivity software

    1hoose data analysis and presentation s@w

    *pdate the data warehouse

    Access tools

    Data warehouse implementation relies on selecting suitable data access tools. The best way to choose

    this is based on the type of data can be selected using this tool and the -ind of access it permits for a

    particular user. The following lists the various type of data that can be accessed$

    +imple tabular form data

    8an-ing data

    (ultivariable data

    Time series data

    Jraphing, charting and pivoting data

    1omple) te)tual search data

    +tatistical analysis data

    Data for testing of hypothesis, trends and patterns

    =redefined repeatable queries

    Ad hoc user specified queries

    12

  • 8/11/2019 2. NOTES.doc

    13/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    8eporting and analysis data

    1omple) queries with multiple &oins, multi level sub queries and sophisticated search criteria

    Data e4traction, clean up, transformation and migration

    A proper attention must be paid to data e)traction which represents a success factor for a data

    warehouse architecture. When implementing data warehouse several the following selection criteria that

    affect the ability to transform, consolidate, integrate and repair the data should be considered$

    Timeliness of data delivery to the warehouse

    The tool must have the ability to identify the particular data and that can be read by conversion tool

    The tool must support flat files, inde)ed files since corporate data is still in this type

    The tool must have the capability to merge data from multiple data stores

    The tool should have specification interface to indicate the data to be e)tracted

    The tool should have the ability to read data from data dictionary

    The code generated by the tool should be completely maintainable

    The tool should permit the user to e)tract the required data

    The tool must have the facility to perform data type and character set translation

    The tool must have the capability to create summarization, aggregation and derivation of records

    The data warehouse database system must be able to perform loading data directly from these tools

    Data placement strategies

    : As a data warehouse grows, there are at least two options for data placement. 6ne is to put some of

    the data in the data warehouse into another storage media.

    : The second option is to distribute the data in the data warehouse across multiple servers.

    3ser levels

    The users of data warehouse data can be classified on the basis of their s-ill level in accessing the

    warehouse. There are three classes of users$

    'asual users:are most comfortable in retrieving info from warehouse in pre defined formats and

    running pre e)isting queries and reports. These users do not need tools that allow for building standard and

    ad hoc reports

    $ower sers:can use pre defined as well as user defined queries to create simple and ad hoc

    reports. These users can engage in drill down operations. These users may have the e)perience of using

    reporting and query tools.

    13

  • 8/11/2019 2. NOTES.doc

    14/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    Epert users:These users tend to create their own comple) queries and perform standard analysis

    on the info they retrieve. These users have the -nowledge about the use of query and report tools

    $enefits of data warehousing 1Data warehouse usage includes,

    : Focating the right info

    : =resentation of info

    : Testing of hypothesis

    : Discovery of info

    : +haring the analysis

    *he benefits can be classified into two1

    Tangible benefits quantified @ measureable/$t includes,

    : mprovement in product inventory

    : Decrement in production cost

    : mprovement in selection of target mar-ets

    : 3nhancement in asset and liability management

    ntangible benefits not easy to quantified/$ t includes,

    : mprovement in productivity by -eeping all data in single location and eliminating re-eying of

    data

    : 8educed redundant processing

    : 3nhanced customer relation

    +! apping the data warehouse architecture to ultiprocessor architecture

    The functions of data warehouse are based on the relational data base technology. The relational

    data base technology is implemented in parallel manner. There are two advantages of having parallel

    relational data base technology for data warehouse$

    #inear Speed up:refers the ability to increase the number of processor to reduce response time.

    #inear Scale up:refers the ability to provide same performance on the same requests as the

    database size increases

    *#pes of parallelism

    There are two types of parallelism$

    14

  • 8/11/2019 2. NOTES.doc

    15/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    *nter +uery $arallelism: n which different server threads or processes handle multiple requests at

    the same time.

    *ntra +uery $arallelism:This form of parallelism decomposes the serial +4F query into lower

    level operations such as scan, &oin, sort etc. Then these lower level operations are e)ecuted concurrently in

    parallel.

    ntra query parallelism can be done in either of two ways$

    oriontal parallelism:which means that the data base is partitioned across multiple dis-s and

    parallel processing occurs within a specific tas- that is performed concurrently on different processors

    against different set of data

    .ertical parallelism:This occurs among different tas-s. All query components such as scan, &oin,

    sort etc are e)ecuted in parallel in a pipelined fashion. n other words, an output from one tas- becomes an

    input into another tas-.

    Data partitioning1

    Data partitioning is the -ey component for effective parallel e)ecution of data base operations.=artition can be done randomly or intelligently.

    /andom portioningincludes random data striping across multiple dis-s on a single server. Anotheroption for random portioning is round robin fashion partitioning in which each record is placed on the ne)tdis- assigned to the data base.

    *ntelligent partitioningassumes that DB(+ -nows where a specific record is located and does notwaste time searching for it across all dis-s. The various intelligent partitioning include$

    ash partitioning:A hash algorithm is used to calculate the partition number based on the value ofthe partitioning -ey for each row

    15

  • 8/11/2019 2. NOTES.doc

    16/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    0ey range partitioning:8ows are placed and located in the partitions according to the value of thepartitioning -ey. That is all the rows with the -ey value from A to H are in partition !, F to T are inpartition > and so on.

    Schema portioning:an entire table is placed on one dis-I another table is placed on different dis-

    etc. This is useful for small reference tables.

    ser de&ined portioning:t allows a table to be partitioned on the basis of a user definede)pression.

    Data base architectures of parallel processing

    There are three DB(+ software architecture styles for parallel processing$

    !. +hared memory or shared everything Architecture

    >. +hared dis- architecture

    ?. +hred nothing architecture

    Shared (emor* %rchitecture

    Tightly coupled shared memory systems, illustrated in following figure have the following

    characteristics$

    (ultiple =*s share memory.

    3ach =* has full access to all shared memory through a common bus.

    1ommunication between nodes occurs via shared memory.

    =erformance is limited by the bandwidth of the memory bus.

    +ymmetric multiprocessor +(=/ machines are often nodes in a cluster. (ultiple +(= nodes can be

    used with 6racle =arallel +erver in a tightly coupled system, where memory is shared among the multiple

    =*s, and is accessible by all the =*s through a memory bus. 3)amples of tightly coupled systems include

    the =yramid, +equent, and +un +parc+erver.

    16

  • 8/11/2019 2. NOTES.doc

    17/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    =erformance is potentially limited in a tightly coupled system by a number of factors. These include

    various system components such as the memory bandwidth, =* to =* communication bandwidth, the

    memory available on the system, the @6 bandwidth, and the bandwidth of the common bus.

    =arallel processing advantages of shared memor# s#stemsare these$

    (emory access is cheaper than internode communication. This means that internal

    synchronization is faster than using the Foc- (anager.

    +hared memory systems are easier to administer than a cluster.

    A disadvantage of shared memor# s#stems for parallel processing is as follows$

    +calability is limited by bus bandwidth and latency, and by available memory.

    Shared is+ %rchitecture

    +hared dis- systems are typically loosely coupled. +uch systems, illustrated in following figure, have

    the following characteristics$

    3ach node consists of one or more =*s and associated memory.

    (emory is not shared between nodes.

    1ommunication occurs over a common highspeed bus.

    17

  • 8/11/2019 2. NOTES.doc

    18/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    3ach node has access to the same dis-s and other resources.

    A node can be an +(= if the hardware supports it.

    Bandwidth of the highspeed bus limits the number of nodes scalability/ of the system.

    The cluster illustrated in figure is composed of multiple tightly coupled nodes. The Distributed Foc-

    (anager DF( / is required. 3)amples of loosely coupled systems are

  • 8/11/2019 2. NOTES.doc

    19/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    =arallel processing disadvantages of shared dis- systems are these$

    nternode synchronization is required, involving DF( overhead and greater dependency on high

    speed interconnect.

    f the wor-load is not partitioned well, there may be high synchronization overhead.

    There is operating system overhead of running shared dis- software.

    Shared Nothing %rchitecture

    +hared nothing systems are typically loosely coupled. n shared nothing systems only one 1=* is

    connected to a given dis-. f a table or database is located on that dis-, access depends entirely on the =*

    which owns it. +hared nothing systems can be represented as follows$

    +hared nothing systems are concerned with access to dis-s, not access to memory. 9onetheless,

    adding more =*s and dis-s can improve scale up. 6racle =arallel +erver can access the dis-s on a shared

    nothing system as long as the operating system provides transparent dis- access, but this access is

    e)pensive in terms of latency.

    +hared nothing systems have advantages and disadvantages for parallel processing$

    Advantages

    19

  • 8/11/2019 2. NOTES.doc

    20/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    +hared nothing systems provide for incremental growth.

    +ystem growth is practically unlimited.

    (==s are good for readonly databases and decision support applications.

    7ailure is local$ if one node fails, the others stay up.

    Disadvantages

    (ore coordination is required.

    (ore overhead is required for a process wor-ing on a dis- belonging to another node.

    f there is a heavy wor-load of updates or inserts, as in an online transaction processing system, it

    may be worthwhile to consider datadependent routing to alleviate contention.

    5arallel D$( features

    +cope and techniques of parallel DB(+ operations

    6ptimizer implementation

    Application transparency

    =arallel environment which allows the DB(+ server to ta-e full advantage of the e)isting facilities

    on a very low level

    DB(+ management tools help to configure, tune, admin and monitor a parallel 8DB(+ as

    effectively as if it were a serial 8DB(+

    =rice @ =erformance$ The parallel 8DB(+ can demonstrate a non linear speed up and scale up at

    reasonable costs.

    5arallel D$( vendors

    6racle$ =arallel 4uery 6ption =46/

    Architecture$ shared dis- arch

    20

  • 8/11/2019 2. NOTES.doc

    21/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    Data partition$ Hey range, hash, round robin

    =arallel operations$ hash &oins, scan and sort

    nformi)$ eGtended =arallel +erver G=+/

    Architecture$ +hared memory, shared dis- and shared nothing models

    Data partition$ round robin, hash, schema, -ey range and user defined

    =arallel operations$ 9+38T, *=DAT3, D3F3FT3

    B($ DB> =arallel 3dition DB> =3/

    Architecture$ +hared nothing modelsData partition$ hash

    =arallel operations$ 9+38T, *=DAT3, D3F3FT3, load, recovery, inde) creation, bac-up, table

    reorganization

    +LBA+3$ +LBA+3 (==

    Architecture$ +hared nothing models

    Data partition$ hash, -ey range, +chema

    =arallel operations$ %orizontal and vertical parallelism

    ! D$( schemas for decision support

    The basic concepts of dimensional modeling are$ facts, dimensions and measures. A fact is a

    collection of related data items, consisting of measures and conte)t data. t typically represents business

    items or business transactions. A dimension is a collection of data that describe one business dimension.

    Dimensions determine the conte)tual bac-ground for the factsI they are the parameters over which we

    want to perform 6FA=. A measure is a numeric attribute of a fact, representing the performance or

    behavior of the business relative to the dimensions.

    1onsidering 8elational conte)t, there are three basic schemas that are used in dimensional

    modeling$

    !. +tar schema

    >. +nowfla-e schema

    ?. 7act constellation schema

    (tar schema

    21

  • 8/11/2019 2. NOTES.doc

    22/70

  • 8/11/2019 2. NOTES.doc

    23/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    fact that the star schema is the simplest architecture, it is most commonly used nowadays and is

    recommended by 6racle.

    6act *ables

    A fact table is a table that contains summarized numerical and historical data facts/ and a

    multipart inde) composed of foreign -eys from the primary -eys of related dimension tables. A fact table

    typically has two types of columns$ foreign -eys to dimension tables and measures those that contain

    numeric facts. A fact table can contain fact's data on detail or aggregated level.

    Dimension *ables

    Dimensions are categories by which summarized data can be viewed. 3.g. a profit

    summary in a fact table can be viewed by a Time dimension profit by month, quarter, year/,

    8egion dimension profit by country, state, city/, =roduct dimension profit for product!,

    product>/.

    A dimension is a structure usually composed of one or more hierarchies that categorizes data. f a

    dimension hasn't got a hierarchies and levels it is called flat dimension or list. The primary -eys of each of

    the dimension tables are part of the composite primary -ey of the fact table. Dimensional attributes help to

    describe the dimensional value. They are normally descriptive, te)tual values. Dimension tables are

    generally small in size then fact table.

    Typical fact tables store data about sales while dimension tables data about geographic region

    mar-ets, cities/, clients, products, times, channels.

    easures

    (easures are numeric data based on columns in a fact table. They are the primary data which

    end users are interested in. 3.g. a sales fact table may contain a profit measure which represents profit on

    each sale.

    Aggregations are pre calculated numeric data. By calculating and storing the answers to a query before

    users as- for it, the query processing time can be reduced. This is -ey in providing fast query performance

    in 6FA=.

    1ubes are data processing units composed of fact tables and dimensions from the data

    warehouse. They provide multidimensional views of data, querying and analytical capabilities to clients.

    The main characteristics of star schema$

    +imple structure N easy to understand schema

    23

  • 8/11/2019 2. NOTES.doc

    24/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    Jreat query effectives N small number of tables to &oin

    8elatively long time of loading data into dimension tables N denormalization, redundancy

    data caused that size of the table could be large.

    The most commonly used in the data warehouse implementations N widely supported by a

    large number of business intelligence tools

    (nowflake schema1

    The snowfla-e schema is an e)tension of the star schema, where each point of the star e)plodes

    into more points. n a star schema, each dimension is represented by a single dimensional table, whereas in

    a snowfla-e schema, that dimensional table is normalized into multiple loo-up tables, each representing a

    level in the dimensional hierarchy.

    7or e)ample, the Time Dimension that consists of > different hierarchies$

    !.LearO(onthODay

    >. Wee- O Day

    We will have loo-up tables in a snowfla-e schema$ A loo-up table for year, a loo-up table for

    month, a loo-up table for wee-, and a loo-up table for day. Lear is connected to (onth, which is then

    connected to Day. Wee- is only connected to Day.

    The main advantage of the snowfla+e schemais the improvement in query performance due to

    minimized dis- storage requirements and &oining smaller loo-up tables.

    The main disadvantage of the snowfla+e schemais the additional maintenance efforts needed due

    to the increase number of loo-up tables.

    24

    http://www.1keydata.com/datawarehousing/www.1keydata.com/datawarehousing/star-schema.htmlhttp://www.1keydata.com/datawarehousing/www.1keydata.com/datawarehousing/star-schema.html
  • 8/11/2019 2. NOTES.doc

    25/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    t is the result of decomposing one or more of the dimensions. The manytoone relationships

    among sets of attributes of a dimension can separate new dimension tables, forming a hierarchy. The

    decomposed snowfla-e structure visualizes the hierarchical structure of dimensions very well.

    7act constellation schema$ 7or each star schema it is possible to construct fact constellation

    schema for e)ample by splitting the original star schema into more star schemes each of them describes

    facts on another level of dimension hierarchies/. The fact constellation architecture contains multiple fact

    tables that share many dimension tables.

    The main shortcoming of the fact constellation schema is a more complicated design because

    many variants for particular -inds of aggregation must be considered and selected. (oreover, dimension

    tables are still large.

    ! Data 74traction, %leanup, and *ransformation *ools

    3TF stands for 3)tract, Transform, Foad is Data Warehouse acquisition processes that involves

    3)tract the data from outside sources.

    Transform the data to fit business needs and ultimately

    Foad the the transform data to the data warehouse.

    7or e)ample$

    !. nformatics.

    >. Data +tage.

    ?. 6racle warehouse builder.

    . Ab initio.

    3TF can also be used for the integration with legacy systems. 3TF is the Data Warehouse

    acquisition processes of 3)tracting, Transforming and Foading data from source systems into the data

    warehouse.

    74traction

    25

  • 8/11/2019 2. NOTES.doc

    26/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    3)traction is the operation of e)tracting data from a source system for further use in a data

    warehouse environment. This is the first step of the 3TF process. After the e)traction, this data can be

    transformed and loaded into the data warehouse.

    *ntroduction to Etraction Methods in Data 1arehouses

    The e)traction method you should choose is highly dependent on the source system and also from

    the business needs in the target data warehouse environment.

  • 8/11/2019 2. NOTES.doc

    27/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    users to perform segment identification, demographic analysis, territory management and preparation of

    customer mailing lists etc.

    pplication de!elopment tools: This is a graphical data access environment which integrates

    6FA= tools with data warehouse and can be used to access all db systems

    "#$ Tools:are used to analyze the data in multi dimensional and comple) views. To enable

    multidimensional properties it uses (DDB and (8DB where (DDB refers multi dimensional data base

    and (8DB refers multi relational data bases.

    Data mining tools:are used to discover -nowledge from the data warehouse data also can be used

    for data visualization and data correction purposes.

    .! etadata

    (eta data$ data about data

    eta Data in Data Warehouse

    eta Datais one of the most important aspect of data warehousing. t is the data about data stored

    in data warehouse and its users.

    eta Dataprovides decisionsupportoriented pointer to warehouse data and thus provide logical

    lin- between warehouse data and decision support application.

    eta Datais the -ey to providing user and application with a road map to the information stored

    in the warehouse.

    eta Datacan define all attributes, data sources and timing, and rules that govern data use and

    data transformation of all data elements.

    (etadata metacontent/ is defined as data providing information about one or more aspects of the

    data, such as$

    (eans of creation of the data

    =urpose of the data

    Time and date of creation

    1reator or author of data

    27

  • 8/11/2019 2. NOTES.doc

    28/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    Focation on acomputer networ-where the data was created

    +tandardsused

    *#pes1

    Technical (eta data1

    t contains information about data warehouse data used by warehouse designer, administrator to carry out

    development and management tas-s. t includes,

    nfo about data stores

    Transformation descriptions. That is mapping methods from operational db to warehouse db

    Warehouse 6b&ect and data structure definitions for target data

    The rules used to perform clean up, and data enhancement

    Data mapping operations

    Access authorization, bac-up history, archive history, info delivery history, data acquisition history,

    data access etc.,

    . /usiness (eta data:

    t contains info that gives info stored in data warehouse to users. t includes,

    +ub&ect areas, and info ob&ect type including queries, reports, images, video, audio clips etc.

    nternet home pages

    nfo related to info delivery system

    Data warehouse operational info such as ownerships, audit trails etc.,

    "ther *#pes1

    28

    http://en.wikipedia.org/wiki/Computer_networkhttp://en.wikipedia.org/wiki/Computer_networkhttp://en.wikipedia.org/wiki/Technical_standardhttp://en.wikipedia.org/wiki/Computer_networkhttp://en.wikipedia.org/wiki/Technical_standard
  • 8/11/2019 2. NOTES.doc

    29/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    (tructural metadata is used to describe the structure of computer systems such as tables,

    columns and inde)es. 8uide metadatais used to help humans find specific items and is usually e)pressed

    as a set of -eywords in a natural language.

    According to 8alph Himballmetadata can be divided into > similar categoriesQTechnical

    metadata and Business metadata. Technical metadata correspond to internal metadata, business

    metadatato e)ternal metadata.

    Himball adds a third category named 5rocess metadata. 6n the other hand, 9+6 distinguishes

    between three types of metadata$ descriptive, structural and administrative.

    Descriptive metadatais the information used to search and locate an ob&ect such as title, author,

    sub&ects, -eywords, publisherIstructural metadata

    gives a description of how the components of the

    ob&ect are organizedI and administrative metadatarefers to the technical information including file type.

    Two subtypes of administrative metadata are rights management metadata and preservation metadata.

    *#pes of Data Warehouse

    There are mainly three type of Data Warehouse.

    !/. 3nterprise Data Warehouse.

    >/. 6perational data store.

    ?/. Data (art.

    7nterprise Data Warehouseprovide a control Data Base for decision support through out the

    enterprise.

    "perational data storehas a broad enterprise under scope but unli-e a real enterprise DW. Datais refreshed in rare real time and used for routine business activity.

    Data artis a sub part of Data Warehouse. t support a particular reason or it is design for

    particular lines of business such as sells, mar-eting or finance, or in any organization documents of a

    particular department will be data mart

    29

    http://en.wikipedia.org/wiki/Ralph_Kimballhttp://en.wikipedia.org/wiki/Ralph_Kimball
  • 8/11/2019 2. NOTES.doc

    30/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    UNIT II

    #USINESS ANA$%SIS

    &! 9eporting and :uer# tools and Applications 2 *ool %ategories 2 the ;eed for

    Applications

    Data )uer# and reporting tools

    4uery and reporting tools are divided in to two parts.

    8eporting tools

    (anaged query tools

    9eporting toolsfurther dividing in to two parts.

    5roduction reporting toolswill let companies generate regular operational reports or support

    high level batch &ob, such as calculating and printing paychec-s.

    9eport writer, on the other hand, are e)pensive des-top tools designed for end users.

    30

  • 8/11/2019 2. NOTES.doc

    31/70

  • 8/11/2019 2. NOTES.doc

    32/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    5nteractive reporting capability

    53nterprisewide scalability

    5+uperior user interface

    57astest time to result

    5Fowest cost of ownership

    %atalogs

    mpromptu stores metadata in sub&ect related folders. This metadata is what will be used to

    develop a query for a report. The metadata set is stored in a file called a Rcatalog'. The catalog does not

    contain any data. t &ust contains information about connecting to the database and the fields that will be

    accessible for reports.

    A catalog contains1

    ; 7oldersQmeaningful groups of information representing columns from one or more tables

    ; 1olumnsQindividual data elements that can appear in one or more folders

    ; 1alculationsQe)pressions used to compute required values from e)isting data

    ; 1onditionsQused to filter information so that only a certain type of information is displayed

    ; =romptsQpredefined selection criteria prompts that users can include in reports they create

    ; 6ther components, such as metadata, a logical database name, &oin information, and user classes

    =ou can use catalogs to

    ; view, run, and print reports

    ; e)port reports to other applications; disconnect from and connect to the database; create reports; change the contents of the catalog; add user classes

    5rompts

    Lou can use prompts to; filter reports; calculate data items; format data

    5icklist 5rompts

    A pic-list prompt presents you with a list of data items from which you select one or more values,

    so you need not be familiar with the database. The values listed in pic-list prompts can be retrieved from

    32

  • 8/11/2019 2. NOTES.doc

    33/70

  • 8/11/2019 2. NOTES.doc

    34/70

  • 8/11/2019 2. NOTES.doc

    35/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    6ne of the limitations that +4F has, it cannot represent these comple) problems. A query will be

    translated in to several +4F statements. These +4F statements will involve multiple &oins, intermediate

    tables, sorting, aggregations and a huge temporary memory to store these tables. These procedures

    required a lot of computation which will require a long time in computing. The second limitation of +4F is

    its inability to use mathematical models in these +4F statements. f an analyst, could create these comple)

    statements using +4F statements, still there will be a large number of computation and huge memory

    needed. Therefore the use of 6FA= is preferable to solve this -ind of problem.

    .! %ategories of ">A5 *ools

    (O'%)

    This is the more traditional way of 6FA= analysis. n (6FA=, data is stored in a

    multidimensional cube. The storage is not in the relational database, but in proprietary formats. That is,

    data stored in arraybased structures.

    Advantages$

    3)cellent performance$ (6FA= cubes are built for fast data retrieval, and are optimal for slicing

    and dicing operations.

    1an perform comple) calculations$ All calculations have been pregenerated when the cube is

    created. %ence, comple) calculations are not only doable, but they return quic-ly.

    Disadvantages$

    Fimited in the amount of data it can handle$ Because all calculations are performed when the cube

    is built, it is not possible to include a large amount of data in the cube itself. This is not to say that

    the data in the cube cannot be derived from a large amount of data. ndeed, this is possible. But in

    this case, only summarylevel information will be included in the cube itself.

    8equires additional investment$ 1ube technology are often proprietary and do not already e)ist in

    the organization. Therefore, to adopt (6FA= technology, chances are additional investments in

    human and capital resources are needed.

    35

  • 8/11/2019 2. NOTES.doc

    36/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    3)amples$ %yperion 3ssbase, 7usion nformation Builders/

    $O'%)

    This methodology relies on manipulating the data stored in the relational database to give the

    appearance of traditional 6FA='s slicing and dicing functionality. n essence, each action of slicing and

    dicing is equivalent to adding a SW%383 clause in the +4F statement. Data stored in relational tables

    Advantages$

    1an handle large amounts of data$ The data size limitation of 86FA= technology is the limitation

    on data size of the underlying relational database. n other words, 86FA= itself places no

    limitation on data amount.

    1an leverage functionalities inherent in the relational database$ 6ften, relational database already

    comes with a host of functionalities. 86FA= technologies, since they sit on top of the relational

    database, can therefore leverage these functionalities.

    Disadvantages$

    =erformance can be slow$ Because each 86FA= report is essentially a +4F query or multiple

    +4F queries/ in the relational database, the query time can be long if the underlying data size is

    large.

    Fimited by +4F functionalities$ Because 86FA= technology mainly relies on generating +4F

    statements to query the relational database, and +4F statements do not fit all needs for e)ample, it

    is difficult to perform comple) calculations using +4F/, 86FA= technologies are thereforetraditionally limited by what +4F can do. 86FA= vendors have mitigated this ris- by building

    36

  • 8/11/2019 2. NOTES.doc

    37/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    into the tool outofthebo) comple) functions as well as the ability to allow users to define their

    own functions.

    3)amples$ (icrostrategy ntelligence +erver, (eta1ube nformi)@B(/

    &O'%) 0(1!: (anaged 1uer* !nvironment2

    %6FA= technologies attempt to combine the advantages of (6FA= and 86FA=. 7or summary

    type information, %6FA= leverages cube technology for faster performance. t stores only the inde)es and

    aggregations in the multidimensional form while the rest of the data is stored in the relational database.

    3)amples$ =ower=lay 1ognos/, Brio, (icrosoft Analysis +ervices, 6racle Advanced Analytic+ervices

    /! ultidimensional ?ersus ultirelational ">A5

    These relational implementations of multidimensional database systems are sometimes referred to

    as multirelationaldatabase systems. To achieve the required speed, these products use the star or snowfla-e

    schemasspecially optimized and denormalized data models that involve data restructuring and

    37

  • 8/11/2019 2. NOTES.doc

    38/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    aggregation. The snowfla-e schema is an e)tension of the star schema that supports multiple fact tables

    and &oins between them./

    6ne benefit of the star schema approach is reduced comple)ity in the data model, which increases

    data Slegibility, ma-ing it easier for users to pose business questions of 6FA= nature.Data warehouse

    queries can be answered up to !# times faster because of improved navigations.

    Two types of database activity$

    ! 6FT=$ 6nFine Transaction =rocessing

    +hort transactions, both queries and updates

    e.g., update account balance, enroll in course/

    4ueries are simple

    e.g., find account balance, find grade in course/

    *pdates are frequent

    e.g., concert tic-ets, seat reservations, shopping carts/

    >. 6FA=$ 6nFine Analytical =rocessing

    U Fong transactions, usually comple) queries

    U e.g., all statistics about all sales, grouped by dept and

    U month/

    U SData mining operations

    U nfrequent updates

    O'T) vs O'%)

    6FT= stands for 6n Fine Transaction =rocessing and is a data modeling approach typically used to

    facilitate and manage usual business applications. (ost of applications yousee and use are 6FT= based.

    6FT= technology used to perform updates on operational or transactional systems e.g., point of

    sale systems/

    6FA= stands for 6n Fine Analytic =rocessing and is an approach to answer multidimensional queries. 6FA= was conceived for (anagement nformation +ystems and Decision+upport +ystems. 6FA= technology used to perform comple) analysis of the data in a datawarehouse.

    The following table summarizes the major dieren!es between "#T$ and "#%$

    s&stem design'"#T$ (&stem "#%$ (&stem

    38

  • 8/11/2019 2. NOTES.doc

    39/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    "nline Transa!tion $ro!essing)"*erational (&stem+

    "nline %nal&ti!al $ro!essing),ata -arehouse+

    +ource of data6perational dataI 6FT=s are theoriginal source of the data.

    1onsolidation dataI 6FA= data comes fromthe various 6FT= Databases

    =urpose of data To control and run fundamentalbusiness tas-s

    To help with planning, problem solving, anddecision support

    What the data8eveals a snapshot of ongoing

    business processes(ultidimensional views of various -inds of

    business activities

    nserts and*pdates

    +hort and fast inserts and updatesinitiated by end users

    =eriodic longrunning batch &obs refresh thedata

    4ueries8elatively standardized and simplequeries 8eturning relatively fewrecords

    6ften comple) queries involvingaggregations

    =rocessing+peed

    Typically very fast

    Depends on the amount of data involvedIbatch data refreshes and comple) queriesmay ta-e many hoursI query speed can beimproved by creating inde)es

    +pace8equirements

    1an be relatively small if historicaldata is archived

    Farger due to the e)istence of aggregationstructures and history dataI requires moreinde)es than 6FT=

    Database

    Design%ighly normalized with many tables

    Typically denormalized with fewer tablesI

    use of star and@or snowfla-e schemas

    Bac-up and8ecovery

    Bac-up religiouslyI operational data iscritical to run the business, data loss isli-ely to entail significant monetaryloss and legal liability

    nstead of regular bac-ups, someenvironments may consider simplyreloading the 6FT= data as a recoverymethod

    0! *he ultidimensional data odel

    The multidimensional data model is an integral part of 6nFine Analytical =rocessing, or 6FA=.

    Because 6FA= is online, it must provide answers quic-lyI analysts pose iterative queries during

    interactive sessions, not in batch &obs that run overnight. And because 6FA= is also analytic, the queries

    are comple). The multidimensional data model is designed to solve comple) queries in real time.

    (ultidimensional data model is to view it as a cube. The cable at the left contains detailed sales

    data by product, mar-et and time. The cube on the right associates sales number unit sold/ with

    dimensionsproduct type, mar-et and time with the unit variables organized as cell in an array.

    39

  • 8/11/2019 2. NOTES.doc

    40/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    This cube can be e)pended to include another arraypricewhich can be associates with all or only

    some dimensions. As number of dimensions increases number of cubes cell increase e)ponentially.

    Dimensions are hierarchical in nature i.e. time dimension may contain hierarchies for years,

    quarters, months, wea- and day. J36J8A=%L may contain country, state, city etc.

    n this cube we can observe, that each side of the cube represents one of the elements of the

    question. The )a)is represents the time, the ya)is represents the products and the za)is represents

    different centers. The cells of in the cube represents the number of product sold or can represent the price

    of the items.

    This 7igure also gives a different understanding to the drilling down operations. The relations

    defined must not be directly related, they related directly.

    The size of the dimension increase, the size of the cube will also increase e)ponentially. The time

    response of the cube depends on the size of the cube.

    "perations in ultidimensional Data odel1

    ; Aggregation roll-up/

    : dimension reduction$ e.g., total sales by city

    : summarization over aggregate hierarchy$ e.g., total sales by city and year N total sales byregion and by year

    40

  • 8/11/2019 2. NOTES.doc

    41/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    ; +election slice/ defines a subcube

    : e.g., sales where city V =alo Alto and date V !@!@"C

    ; 9avigation to detailed data drill-down/

    : e.g., sales : e)pense/ by city, top ? of cities by average income

    ; / *nlimited dimensions and aggregation levels$ This depends on the -ind of Business, where

    multiple dimensions and defining hierarchies can be made.

    n addition to these guidelines an 6FA= system should also support$

    1omprehensive database management tools$ This gives the database management to control

    distributed Businesses

    41

  • 8/11/2019 2. NOTES.doc

    42/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    The ability to drill down to detail source record level$ Which requires that The 6FA= tool should

    allow smooth transitions in the multidimensional database.

    ncremental database refresh$ The 6FA= tool should provide partial refresh.

    +tructured 4uery Fanguage +4F interface/$ the 6FA= system should be able to integrate effectively

    in the surrounding enterprise environment.

    UNIT III

    DATA MINING

    &! Data mining knowledge discover# in databasesB

    3)traction of interesting nontrivial, implicit, previously un-nown and potentially useful/information or patterns from data in large databases.

    42

  • 8/11/2019 2. NOTES.doc

    43/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    Data mining is the practice of automatically searching large stores of data to discover patterns and

    trends that go beyond simple analysis. Data mining uses sophisticated mathematical algorithms to segment

    the data and evaluate the probability of future events. Data mining is also -nown as Hnowledge Discovery

    in Data HDD/.

    The -ey properties of data mining are$

    Automatic discovery of patterns

    =rediction of li-ely outcomes

    1reation of actionable information

    7ocus on large data sets and databases

    Data mining can answer questions that cannot be addressed through simple query and reportingtechniques.

    '! Data ining 6unctions

    A basic understanding of data mining functions and algorithms is required for using 6racle Data

    (ining. This section introduces the concept of data mining functions. Algorithms are introduced in XData

    (ining AlgorithmsX.

    3ach data mining function specifies a class of problems that can be modeled and solved. Data

    mining functions fall generally into two categories$ supervised and unsupervised. 9otions of supervised

    and unsupervised learning are derived from the science of machine learning, which has been called a sub

    area of artificial intelligence.

    Artificial intelligence refers to the implementation and study of systems that e)hibit autonomous

    intelligence or behavior of their own. (achine learning deals with techniques that enable devices to learn

    from their own performance and modify their own functioning. Data mining applies machine learning

    concepts to data.

    Supervised ata (ining:

    +upervised learning is also -nown as directed learning. The learning process is directed by a

    previously -nown dependent attribute or target. Directed data mining attempts to e)plain the behavior of

    the target as a function of a set of independent attributes or predictors.

    43

    http://docs.oracle.com/cd/B28359_01/datamine.111/b28129/intro_concepts.htm#BHCDJDAFhttp://docs.oracle.com/cd/B28359_01/datamine.111/b28129/intro_concepts.htm#BHCDJDAFhttp://docs.oracle.com/cd/B28359_01/datamine.111/b28129/intro_concepts.htm#BHCDJDAFhttp://docs.oracle.com/cd/B28359_01/datamine.111/b28129/intro_concepts.htm#BHCDJDAFhttp://docs.oracle.com/cd/B28359_01/datamine.111/b28129/intro_concepts.htm#BHCDJDAF
  • 8/11/2019 2. NOTES.doc

    44/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    +upervised learning generally results in predictive models. This is in contrast to unsupervised

    learning where the goal is pattern detection.

    The building of a supervised model involves training, a process whereby the software analyzes

    many cases where the target value is already -nown. n the training process, the model XlearnsX the logic

    for ma-ing the prediction. 7or e)ample, a model that see-s to identify the customers who are li-ely to

    respond to a promotion must be trained by analyzing the characteristics of many customers who are -nown

    to have responded or not responded to a promotion in the past.

    3nsupervised ata (ining

    *nsupervised learning is nondirected. There is no distinction between dependent and independent

    attributes. There is no previously-nown result to guide the algorithm in building the model.

    *nsupervised learning can be used for descriptive purposes. t can also be used to ma-e

    predictions.

    ata pre-processing

    Data pre-processing is an often neglected but important step in the data mining process. The

    phrase Xgarbage in, garbage outXis particularly applicable to data miningandmachine learningpro&ects.

    Datagathering methods are often loosely controlled, resulting in outofrange values e.g., ncome$ Y!##/,

    impossible data combinations e.g., Jender$ (ale, =regnant$ Les/, missing values,etc. Analyzing data that

    has not been carefully screened for such problems can produce misleading results. Thus, the representation

    and quality of data is first and foremost before running an analysis.

    f there is much irrelevant and redundant information present or noisy and unreliable data, then

    -nowledge discoveryduring the training phase is more difficult. Data preparation and filtering steps can

    ta-e considerable amount of processing time. Data preprocessing includes cleaning, normalization,

    transformation, feature e)tractionand selection, etc. The product of data preprocessing is the final training

    set. Hotsiantis et al. >##C/ present a well-nown algorithm for each step of data preprocessing.

    +! %lassification of Data ining (#stems

    ata mining classification scheme:

    !. Decisions in data mining

    : Hinds of databases to be mined

    44

    http://en.wikipedia.org/wiki/GIGOhttp://en.wikipedia.org/wiki/Data_mininghttp://en.wikipedia.org/wiki/Machine_learninghttp://en.wikipedia.org/wiki/Machine_learninghttp://en.wikipedia.org/wiki/Machine_learninghttp://en.wikipedia.org/wiki/Missing_valueshttp://en.wikipedia.org/wiki/Missing_valueshttp://en.wikipedia.org/wiki/Knowledge_discoveryhttp://en.wikipedia.org/wiki/Knowledge_discoveryhttp://en.wikipedia.org/wiki/Data_cleaninghttp://en.wikipedia.org/wiki/Feature_extractionhttp://en.wikipedia.org/wiki/Training_sethttp://en.wikipedia.org/wiki/Training_sethttp://en.wikipedia.org/wiki/GIGOhttp://en.wikipedia.org/wiki/Data_mininghttp://en.wikipedia.org/wiki/Machine_learninghttp://en.wikipedia.org/wiki/Missing_valueshttp://en.wikipedia.org/wiki/Knowledge_discoveryhttp://en.wikipedia.org/wiki/Data_cleaninghttp://en.wikipedia.org/wiki/Feature_extractionhttp://en.wikipedia.org/wiki/Training_sethttp://en.wikipedia.org/wiki/Training_set
  • 8/11/2019 2. NOTES.doc

    45/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    : Hinds of -nowledge to be discovered

    : Hinds of techniques utilized

    : Hinds of applications adapted

    >. Data mining tas-s

    : Descriptive data mining

    : =redictive data mining

    Decisions in data mining

    Databases to be mined

    o 8elational, transactional, ob&ectoriented, ob&ectrelational, active, spatial, time

    series, te)t, multimedia, heterogeneous, legacy, WWW, etc.

    - Hnowledge to be mined

    o 1haracterization, discrimination, association, classification, clustering, trend,

    deviation and outlier analysis, etc.

    o (ultiple@integrated functions and mining at multiple levels

    - Techniques utilized

    o Databaseoriented, data warehouse 6FA=/, machine learning, statistics,

    visualization, neural networ-, etc.

    - Applications adapted

    o 8etail, telecommunication, ban-ing, fraud analysis, D9A mining, stoc- mar-et

    analysis, Web mining, Weblog analysis, etc.

    . Data mining tasks

    : =rediction Tas-s

    o *se some variables to predict un-nown or future values of other variables

    : Description Tas-s

    o 7ind humaninterpretable patterns that describe the data.

    45

  • 8/11/2019 2. NOTES.doc

    46/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    #ommon data mining tas+s

    : 1lassification Z=redictive[

    : 1lustering ZDescriptive[

    : Association 8ule Discovery ZDescriptive[

    : +equential =attern Discovery ZDescriptive[

    : 8egression Z=redictive[

    : Deviation Detection Z=redictive[

    %lassifications of data mining s#stems1

    +upervised learning classification/

    +upervision$ The training data observations, measurements, etc./ are

    accompanied by labels indicating the class of the observations 9ew data is classified based on the training set

    *nsupervised learning clustering/

    The class labels of training data is un-nown

    Jiven a set of measurements, observations, etc. with the aim of establishing the e)istence of

    classes or clusters in the data.

    %lassification

    predicts categorical class labels discrete or nominal/

    classifies data constructs a model/ based on the training set and the values class

    labels/ in a classifying attribute and uses it in classifying new data

    ;umeric 5rediction

    models continuousvalued functions, i.e., predicts un-nown or missing values

    *#pical applications

    1redit@loan approval

    (edical diagnosis$ if a tumor is cancerous or benign

    7raud detection$ if a transaction is fraudulent

    Web page categorization$ which category it is

    46

  • 8/11/2019 2. NOTES.doc

    47/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    ! Data ining *ask 5rimitives

    The set of tas7-rele!ant data to be mined$ This specifies the portions of the database or the set of

    data in which the user is interested. This includes the database attributes or data warehouse dimensions of

    interest referred to as the rele!ant attri%utes or dimensions/.

    The 7ind o& 7nowledge to be mined$ This specifies the data mining &unctions to be per

    formed, such as characterization, discrimination, association or correlation analysis, classification,

    prediction, clustering, outlier analysis, or evolution analysis.

    The %ac7ground 7nowledge to be used in the discovery process$ This -nowledge about the domainto be mined is useful for guiding the -nowledge discovery process and for evaluating the patterns found.

    'oncept hierarchies are a popular form of bac-ground -nowledge, which allow data to be mined

    at multiple levels of abstraction. An e)ample of a concept hierarchy for the attribute or dimension/ age is

    shown in 7igure. *ser beliefs regarding relationships in the data are another formof bac- ground

    -nowledge.

    The interestingness measures and thresholds for pattern evaluation$ They may be used to guide the

    mining process or, after discovery, to evaluate the discovered patterns. Different -inds of -nowledge may

    have different interestingness measures. 7or e)am ple, interestingness measures for association rules

    includesupport and con&idence.

    8ules whose support and confidence values are below userspecified thresholds are considered

    uninteresting. The e)pected representation &or !isualiing the discovered patterns$ This refers to the

    forminwhich discovered patterns are to be displayed,which may include rules, tables, charts, graphs,

    decision trees, and cubes.

    47

  • 8/11/2019 2. NOTES.doc

    48/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    .! Data5reprocessing!

    The realworld data that is to be analyzed by data mining techniques are$

    1' Incomplete1lac-ing attribute values or certain attributes of interest, or containing only aggregate

    data. (issing data, particularly for tuples with missing values for some attributes, may need to be

    inferred.

    2' ;ois# $ containing errors, or outlier values that deviate from the e)pected. ncorrect data may also

    result from inconsistencies in naming conventions or data codes used, or inconsistent formats for

    input fields, such as date. t is hence necessary to use some techniques to replace the noisy data.

    3' Inconsistent 1containing discrepancies between different data items. some attributes representing

    a given concept may have different names in different databases, causing inconsistencies and

    redundancies. 9aming inconsistencies may also occur for attribute values. The inconsistency in

    data needs to be removed.

    4' Aggregate Information1 t would be useful to obtain aggregate information such as to the sales

    per customer regionQsomething that is not part of any precomputed data cube in the data

    warehouse.

    48

  • 8/11/2019 2. NOTES.doc

    49/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    5' 7nhancing mining process1Farge number of data sets may ma-e the data mining process slow.

    %ence, reducing the number of data sets to enhance the performance of the mining process is

    important.

    6' Improve Data :ualit#1Data preprocessing techniques can improve the quality of the data,thereby helping to improve the accuracy and efficiency of the subsequent mining process. Data

    preprocessing is an important step in the -nowledge discovery process, because quality decisions

    must be based on quality data. Detecting data anomalies, rectifying them early, and reducing the

    data to be analyzed can lead to huge payoffs for decision ma-ing.

    Different forms of Data 5rocessing

    Data %leaning1

    Data cleaning routines wor- to Sclean the data by filling in missing values, smoothing noisy

    data, identifying or removing outliers, and resolving inconsistencies.

    f users believe the data are dirty, they are unli-ely to trust the results of any data mining that

    has been applied to it. Also, dirty data can cause confusion for the mining procedure,

    resulting in unreliable output. But, they are not always robust.

    Therefore, a useful preprocessing step is used some datacleaning routines.

    Data Integration1

    Data integration involves integrating data from multiple databases, data cubes, or files.

    +ome attributes representing a given concept may have different names in different databases,

    causing inconsistencies and redundancies. 7or e)ample, the attribute for customer

    identification may be referred to as customer\id in one data store and cust\id in another.

    9aming inconsistencies may also occur for attribute values.

    Also, some attributes may be inferred from others e.g., annual revenue/.

    %aving a large amount of redundant data may slow down or confuse the -nowledge

    discovery process. Additional data cleaning can be performed to detect and remove

    redundancies that may have resulted from data integration.

    Data *ransformation1

    Data transformation operations, such as normalization and aggregation, are additional data

    preprocessing procedures that would contribute toward the success of the mining process.

    9ormalization$ 9ormalization is scaling the data to be analyzed to a specific range such as

    Z#.#, !.#[ for providing better results.

    49

  • 8/11/2019 2. NOTES.doc

    50/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    Aggregation$ Also, it would be useful for data analysis to obtain aggregate information such

    as the sales per customer region. As, it is not a part of any precomputed data cube, it would

    need to be computed. This process is called Aggregation.

    Data 9eduction1

    Data reduction obtains a reduced representation of the data set that is much smaller in volume, yet

    produces the same or almost the same/ analytical results. There are a number of strategies for

    data reduction.

    data aggregation e.g., building a data cube/,

    attribute subset selection e.g., removing irrelevant attributes through correlation analysis/,

    dimensionality reduction e.g., using encoding schemes such as minimum length encoding or

    wavelets/,

    and numerosity reduction e.g., Sreplacing the data by alternative, smaller representations

    such as clusters or parametric models/.

    generalization with the use of concept hierarchies,by organizing the concepts into varying

    levels of abstraction.

    Data discretization is very useful for the automatic generation of concept hierarchies from

    numerical data.

    50

  • 8/11/2019 2. NOTES.doc

    51/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    UNIT & I'

    ASSOCIATION RU$E MINING AND C$ASSI(ICATION

    &! 6re)uent 5attern Anal#sis

    7requent pattern$ a pattern a set of items, subsequences, substructures, etc./ that occurs frequently

    in a data set

    7irst proposed by Agrawal, mielins-i, and +wami ZA+"?[ in the conte)t of frequent itemsets and

    association rule mining

    (otivation$ 7inding inherent regularities in data

    What products were often purchased togetherMQ Beer and diapersM]

    What are the subsequent purchases after buying a =1M

    What -inds of D9A are sensitive to this new drugM

    1an we automatically classify web documentsM

    Applications$

    Bas-et data analysis, crossmar-eting, catalog design, sale campaign analysis, Web log clic-

    stream/ analysis, and D9A sequence analysis

    Wh# Is 6re)! 5attern ining Important

    Dimension@level constraint

    o in relevance to region, price, brand, customer category

    8ule or pattern/ constraint

    o small sales price ^ _!#/ triggers big sales sum N _>##/

    nterestingness constraint

    o strong rules$ min\support ?, min\confidence C#

    %onstrained ining vs! %onstraint-$ased (earch

    1onstrained mining vs. constraintbased search@reasoning

    o Both are aimed at reducing search space

    o 7inding all patterns satisfying constraints vs. finding some or one/ answer in

    constraintbased search in A

    o 1onstraintpushing vs. heuristic search

    o t is an interesting research problem on how to integrate them

    1onstrained mining vs. query processing in DB(+

    o Database query processing requires to find all

    o 1onstrained pattern mining shares a similar philosophy as pushing selections

    deeply in query processing

    *he Apriori Algorithm C 74ample

    53

  • 8/11/2019 2. NOTES.doc

    54/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    +!Decision *ree Induction

    54

  • 8/11/2019 2. NOTES.doc

    55/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    nformation produced by data mining techniques can be represented in many different

    ways. Decision tree structures are a common way to organize classification schemes. n

    classifying tas-s, decision trees visualize what steps are ta-en to arrive at a classification. 3very

    decision tree begins with what is termed a root node, considered to be the XparentX of every other

    node. 3ach node in the tree evaluates an attribute in the data and determines which path it should

    follow. Typically, the decision test is based on comparing a value against some constant.

    1lassification using a decision tree is performed by routing from the root node until arriving at a

    leaf node.

    The illustration provided here is a cannonical e)ample in data mining, involving the

    decision to play or not play based on climate conditions. n this case, outloo- is in the position of

    the root node. The degrees of the node are attribute values. n this e)ample, the child nodes are

    tests of humidity and windy, leading to the leaf nodes which are the actual classifications. This

    e)ample also includes the corresponding data, also referred to as instances. n our e)ample, there

    are " XplayX days and Xno playX days.

    55

  • 8/11/2019 2. NOTES.doc

    56/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    Decision trees can represent diverse types of data. The simplest and most familiar is

    numerical data. t is often desirable to organize nominal data as well. 9ominal quantities are

    formally described by a discrete set of symbols. 7or e)ample, weather can be described in either

    numeric or nominal fashion. We can quantify the temperature by saying that it is !! degrees

    1elsius or > degrees 7ahrenheit. We could also say that it is cold, cool, mild, warm or hot. The

    former is an e)ample of numeric data, and the latter is a type of nominal data. (ore accurately,

    the e)ample of cold, cool, mild, warm and hot is a special type of nominal data, described as

    ordinal data. 6rdinal data has an implicit assumption of ordered relationships between the values.

    1ontinuing with the weather e)ample, we could also have a purely nominal description li-e

    sunny, overcast and rainy. These values have no relationships or distance measures.

    The type of data organized by a tree is important for understanding how the tree wor-s at

    the node level. 8ecalling that each node is effectively a test, numeric data is often evaluated in

    terms of simple mathematical inequality. 7or e)ample, numeric weather data could be tested by

    finding if it is greater than !# degrees 7ahrenheit. 9ominal data is tested in Boolean fashionI in

    other words, whether or not it has a particular value. The illustration shows both types of tests. n

    the weather e)ample, outloo- is a nominal data type. The test simply as-s which attribute value is

    represented and routes accordingly. The humidity node reflects numeric tests, with an inequality

    of less than or equal to #, or greater than #.

    Decision tree induction algorithms function recursively. 7irst, an attribute must be selected

    as the root node. n order to create the most efficient i.e, smallest/ tree, the root node must

    effectively split the data. 3ach split attempts to pare down a set of instances the actual data/ until

    they all have the same classification. The best split is the one that provides what is termed the

    most information gain.

    nformation in this conte)t comes from the concept of entropy from information theory, as

    developed by 1laude +hannon. Although XinformationX has many conte)ts, it has a very specific

    mathematical meaning relating to certainty in decision ma-ing. deally, each split in the decision

    tree should bring us closer to a classification. 6ne way to conceptualize this is to see each step

    along the tree as removing randomness or entropy. nformation, e)pressed as a mathematical

    quantity, reflects this. 7or e)ample, consider a very simple classification problem that requires

    creating a decision tree to decide yes or no based on some data. This is e)actly the scenario

    visualized in the decision tree. 3ach attributes values will have a certain number of yes or no

    classifications. f there are equal numbers of yeses and noPs, then there is a great deal of entropy in

    56

  • 8/11/2019 2. NOTES.doc

    57/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    that value. n this situation, information reaches a ma)imum. 1onversely, if there are only yeses

    or only noPs the information is also zero. The entropy is low, and the attribute value is very useful

    for ma-ing a decision.

    The formula for calculating intermediate values is as follows$

    )*Machine $earnin!

    The general problem of machine learning is to search a, usually very large, space of potential

    hypotheses to determine the one that will best fit the data and any prior -nowledge. The data may be

    labelled or unlabelled. f labels are given then the problem is one of supervised learning in that the true

    answer is -nown for a given set of data. f the labels are categorical then the problem is one of

    classification, e.g. predicting the species of a flower given petal and sepal measurements. f the labels are

    realvalued the problem is one of regression, e.g. predicting property values from crime, pollution, etc.

    statistic. f labels are not given then the problem is one of unsupervised learning and the aim is

    characterize the structure of the data, e.g. by identifying groups of e)amples in the data that are

    collectively similar to each other and distinct from the other data.

    S+per,i-e. $earnin!

    Jiven some e)amples we wish to predict certain properties, in the case where there are available a

    set of e)amples whose properties have already been characterized the tas- is to learn the relationship

    between the two. 6ne common early approach was to present the e)amples in turn to a learner. The learner

    ma-es a prediction of the property of interest, the correct answer is presented, and the learner ad&usts its

    hypothesis accordingly. This is -nown as learning with a teacher, or supervised learning.

    n supervised learning there is necessarily the assumption that the descriptors available are in some

    related to a quantity of interest. 7or instance, suppose that a ban- wishes to detect fraudulent credit card

    transactions. n order to do this some domain -nowledge is required to identify factors that are li-ely to be

    indicative of fraudulent use. These may include frequency of usage, amount of transaction, spending

    patterns, type of business engaging in the transaction and so forth. These variables are the predictive, or

    independent, variables 4. t would be hoped that these were in some way related to the target, or

    dependent, variable . Deciding which variables to use in a model is a very difficult problem in generalI this

    is -nown as the problem of feature selection and is 9=complete. (any methods e)ist for choosing the

    predictive variables, if domain -nowledge is available then this can be very useful in this conte)t. %ere we

    assume that at least some of the predictive variables at least are in fact predictive. L Assume, then, that the

    relationship between and is given by the &oint probability density .

    57

  • 8/11/2019 2. NOTES.doc

    58/70

    CS2032 DATA WAREHOUSING AND DATA MINING

    UNIT & '

    C$USTERING AND A//$ICATIONS AND TRENDS IN DATA MINING

    &!%luster Anal#sis

    Data clustering is a method in which we make cluster of objects that are somehow similar in

    characteristics. The criterion for checking the similarity is implementation dependent.

    Clustering is often confused with classification, but there is some difference between the two. In

    classification the objects are assigned to pre defined classes, whereas in clustering the classes are also to be

    defined.

    Precisely, Data Clustering is a technique in which, the information that is logically similar is

    physically stored together. In order to increase the efficiency in the database systems the number of disk

    accesses are to be minimized. In clustering the objects of similar properties are placed in one class of

    objects and a single access to the disk makes the entire cl