Dw Tutorial Index

Embed Size (px)

Citation preview

  • 8/13/2019 Dw Tutorial Index

    1/38

    Index

    What are the Source systems?

    ETL process

    EDW Enterprise data warehouse DM data Mart

    OLAP Online analytical processing

    Dimensional Modeling

    Topology (all data marts, dependent,independent)

    Audience

  • 8/13/2019 Dw Tutorial Index

    2/38

    Data Warehousing.

    Data Warehouse basic concepts

    Data Warehouse Approach

    Data Warehouse Implementation

    OLAP (Online Analytical Processing)

    Next steps in Data Warehousing

    By V.S.Rajesh Kumar

    November 2004

  • 8/13/2019 Dw Tutorial Index

    3/38

    Data Warehouse- Concepts

    Module 1

    Data Warehouse basic concepts

  • 8/13/2019 Dw Tutorial Index

    4/38

    What is DSS?

    Decision Support System Mainly used by business to take some

    strategic decisions based on the trends(comparing current fiscal to previous) and

    project the numbers based on history andsome parameters

    Not to run the business, OLTP systems takescare of the day to day activities of a business.Example SAP Order Management takes care of

    the orders which the organization gets. In theDSS we collect all the data to do the analysis.

  • 8/13/2019 Dw Tutorial Index

    5/38

    OLTP

    Online Transaction processingsystem

    Examples of OLTP systems are

    order management, TERA etc Always follows 3rdnormal form,

    while designing the database

    All the DML types are active Deal with specific data (customer

    x, product z etc)

  • 8/13/2019 Dw Tutorial Index

    6/38

    OLTP vs DSS

    More DML operations(Update, Delete,Inserts)

    Point Queries

    Very specific whileissuing queries

    Less history(approximately 6months to 1 year)

    Used for day todayactivities (must torun the business)

    No change in thedata (No updates anddeletes)

    Queries based on

    time period, set ofproducts, set ofcustomers etc

    Maintains the history.

    Used mainly foranalytics (trendanalysis, customerbehavior etc)

  • 8/13/2019 Dw Tutorial Index

    7/38

    General DSS Architecture

    Source Data

    OLTP 1

    OLTP 2

    Market Place

    Web clicks

    Data

    Warehouse

    Database

    Database

    Pre

    Defined

    Reports

    Ad hocReporting

    OLAP

    Cubes

    ODS

    Staging

    DB

    ETL

    (Tool or

    TSQL)

    Close the loop (write back to OLTP about the findings in DSS

    Data

    Mining

  • 8/13/2019 Dw Tutorial Index

    8/38

    Architecture Diagram

    Source Data

    HR Data

    Finance

    Payroll

    Project

    Microsoft

    DTS

    (DataTransformation

    Services)

    &

    StoredProcedures

    ET&L

    Data

    Warehouse

    Database

    SQL ServerDatabase

    Database

    Pre

    Defined

    Reports

    Ad hoc

    Reporting

    OLAP

    Cubes

  • 8/13/2019 Dw Tutorial Index

    9/38

    Example for a DSS

    OLTP 1OLTP 2

    OLTP 3 OLTP 4

    Data

    WarehouseOLAP

    Reporting

    Analytics

  • 8/13/2019 Dw Tutorial Index

    10/38

    DSS Categories

    Operational

    Data Store

    Support for:

    Consolidated and

    reconciled operational datacapture and access

    Detailed, lightly

    summarized

    Process oriented, Subject

    oriented

    integrated

    Volatile (updateable)

    Current

    Short; business process life

    (30 to 90 days of history),purge

    Enterprise

    Data Warehouse

    Support for:

    Single source of consistent,

    integrated, cross-functionaldata for access and

    distribution

    Detailed atomic record of

    events, reference and

    dimension masters,

    derived, summarized

    Subject oriented

    integrated

    non-volatile; periodic loads,

    read onlyTime variant

    Long; institutional memory

    (2 years or more of history),

    archive

    Relational

    Data Mart

    Support for:

    Subset of Integrated data,

    separated for autonomousprocessing, optimized for

    access

    Aggregated, summarized,

    specialized

    Subject oriented

    integrated

    Non-volatile; periodic load,

    can contain separate

    updateable structures for

    OLTP support Time variant

    Variable retention; some

    archive

    Online Analytical

    Processing

    Support for:

    Subset of Integrated data,

    separated forautonomous processing,

    optimized for access

    Aggregated, summarized,

    specialized

    Subject oriented

    integrated

    Non-volatile; periodic

    load, can contain

    separate updateable

    structures for OLTPsupport

    Time variant

    Variable retention; some

    archive

    EDWRDM OLAPODS

  • 8/13/2019 Dw Tutorial Index

    11/38

    ETL (E Extract)

    Extract Getting data out of the sourcesystems. This may be just a DTSpackage which pulls the data, orexporting a table to a flat file in thesource system.

    In Teradata we have Fast Export utilitywhere we can export the data to a flatfile.

    In Oracle we have SQL*Loader to

    export the data to a flat file. In SQL Server we can use a DTS

    package to do the same job

  • 8/13/2019 Dw Tutorial Index

    12/38

    ETL (T Transform)

    Transform Its not necessary to have thesame data model in source and destination.When the data model is different from sourceobviously we have to modify the source datato destinations data model. This process is

    called transformation. Example : When we receive data from various

    distis about the reseller information we wontget the geo information. So in the

    transformation logic we will have some codewhich assigns the respective geo based on thecountry from which you are getting the data.

    This is the simple example on transformation.

  • 8/13/2019 Dw Tutorial Index

    13/38

    ETL (L Load)

    Load Loding the transformed data intothe destination datamoel (datawarehouse).

    As there are export functionality

    available in each RDBMS there is anutility to import the data into thedatabase.

    Teradata Fast Import

    Oracle SQL*Loader

    Sybase - bcp

  • 8/13/2019 Dw Tutorial Index

    14/38

    Data Modeling for OLTP

    Usually 3rdnormal form.

    Advantages : Flexibility to modifyfor the changes. No redundancy of

    the data in the model.

    Disadvantages : Complex queriesto generate the reports as the

    number of tables to join areusually high.

  • 8/13/2019 Dw Tutorial Index

    15/38

    Dimensional Modeling for DSS

    Star Schema, Snowflake schema Based on RDBMS we have to choose what type of model

    suits better. Example: Teradata is an RDBMS which can give the

    results in reasonable time as its a parallel processingdatabase engine in the market. So we can design the

    Enterprise data model in the 3rd

    normal form. But wecant have the same approach for SQL server or Oracle,we should think of denormalizing the data model.

    Star Schema makes queries run faster as the number oftables to join is less.

    In star schema all the hierarchies defined per dimension

    will be stored in single table. So the data redundancy ishigh. In snow flake we can have one more table for thehierarchy. Thats the difference between the starschema and snow flake schema.

  • 8/13/2019 Dw Tutorial Index

    16/38

    Star Schema

    Star schema is optimized forqueries. You will have theredundant data available in star

    schema based data model.

  • 8/13/2019 Dw Tutorial Index

    17/38

    Snow flake

    Snow flake wont have much ofredundant data as most of thedimensions will have a look table.

    This way the number of joinsbetween the tables will becomemore.

    Both have advantages and dis

    advantages, so analyze the endusers requirements and spaceconstraints to pick the best.

  • 8/13/2019 Dw Tutorial Index

    18/38

    Data Refresh in DSS

    We have to refresh the data in DSSfrom various source systems in timelymanner.

    While doing so, either we should do a

    full refresh of a particular table orcapture only the changed data (thisprocess is called delta)

    Usually for fact tables we go for deltarefresh and for dimension tables we go

    for full refresh. As the environment isgetting bigger and bigger almost all thetables will become delta loads.

  • 8/13/2019 Dw Tutorial Index

    19/38

    Advantages of DSS

    Safeway a grocery store chain in US givesvarious information from DSS directly to storemanager. Example, the system can predict thea particular stock outage in the store. Basedon the history system knows for every 3 hours

    there should be sale on one particular item, ifthe DSS system did not see a transaction fromlast 2 hours it sends an SMS to current shiftsmanager mobile. Thats the level you can gowith the DSS. It takes time to get there.

    Walmart does the customer profiling, storesales analysis etc etc on there datawarehouse, its implemented on Teradata.

    FedEx uses Teredata, Ab Initio andMicrostrategy as there DSS tools.

  • 8/13/2019 Dw Tutorial Index

    20/38

    Data Warehouse- Concepts

    Module 2

    Data Warehouse Approach

  • 8/13/2019 Dw Tutorial Index

    21/38

    Distributed Approach

    Various departments can startcreating different data marts. Eachcan start working independently

    and see the ROI in a short span. Inthe long run integrating these dataadds the complexity and Cost will

    be higher as there are moresystems to maintain.

  • 8/13/2019 Dw Tutorial Index

    22/38

    Gives only partof the answer

    Requires timeand effort toput the piecestogether

    No guaranteeits the rightanswer

    Distributed Approach to DSS

    How We Are Different

  • 8/13/2019 Dw Tutorial Index

    23/38

    Centralized Approach

    Centralized data warehouse contains the datain one place, easy to answer any businessquestion. In the long run this has the costadvantage over the non-centralized datawarehouse. Not very easy to implement as it

    needs more time and resources. ROI wont beseen until the implementation is completed.So recommended approach is to implementthe centralized data warehouse is, start withone subject area and keep adding one subjectarea at a time, this way organization will getthe see the ROI at various stages.

  • 8/13/2019 Dw Tutorial Index

    24/38

    Delivers oneversion ofthe truthfor

    increasedconfidenceand speed indecision-making

    Centralized Approach to DSS

    How We Are Different

  • 8/13/2019 Dw Tutorial Index

    25/38

    Data Warehouse- Concepts

    Module 3

    Data Warehouse Implementation

    Steps

  • 8/13/2019 Dw Tutorial Index

    26/38

    Typical Approach

    Data Modeling is a cyclic process involving the followingsteps

    Requirement Gathering

    Requirement Analysis

    Requirement Validation

    Logical Modeling

    Physical Design

    Implementation

    Validation

    The above cycle repeats for any upgrades orenhancements

  • 8/13/2019 Dw Tutorial Index

    27/38

    Requirement Gathering

    Identify the Business objectives Identify the reporting requirements

    Identify the frequency of report generation

    Granularity of Information

    Business rules

  • 8/13/2019 Dw Tutorial Index

    28/38

    Requirement Analysis

    Study the requirements captured Identify the subject areas

    Identify the Measures and criteria fields

    Identify the granularity of information

    required

  • 8/13/2019 Dw Tutorial Index

    29/38

    Requirement Validation

    Validate the analysis with the customer Document Sign off.

  • 8/13/2019 Dw Tutorial Index

    30/38

    Logical Modeling

    Identify facts and dimensions Create Logical Model

  • 8/13/2019 Dw Tutorial Index

    31/38

    Physical Design

    Analyze Source Systems with respect to Logical Model Data Quality Analysis

    Physical Design

    Data type

    Indexes

    Partitioning

    Database creation etc.,

    Source to target mapping

    Capture Transformation rules

    Capture Derivation rules for derived fields

  • 8/13/2019 Dw Tutorial Index

    32/38

    Implementation

    Database Creation

    Staging Design (Design Extraction Jobs)

    Develop ETL Jobs

    Unit testing of ETL Jobs

    Schedule Jobs Test Load

    Data Validation

    Performance monitoring

    ETL Job tuning

    Test Database performance tuning

    Final loading of data from source to target

  • 8/13/2019 Dw Tutorial Index

    33/38

    Data Warehouse- Concepts

    Module 4

    OLAP (Online Analytical Processing)

  • 8/13/2019 Dw Tutorial Index

    34/38

    What is OLAP?

    What is OLAP?Online Analytical Processing.

    Viewing data in a multi dimensional

    way.

    Why OLAP?

    Slice and dice for data warehouse.RDBMS is a 2 dimensional way of

    storing / viewing the data

  • 8/13/2019 Dw Tutorial Index

    35/38

    Types in OLAP?

    Three types of OLAP in theindustry.

    1. MOLAP Multi dimensional OLAP

    (Ex MSOLAP, Essbase, Cognos).2. ROLAP Relational OLAP ( Ex

    Business Objects, Microstrategy).

    3. HOLAP Hybrid OLAP

  • 8/13/2019 Dw Tutorial Index

    36/38

    Data Warehouse- Concepts

    Module 5

    Next steps in Data Warehousing

  • 8/13/2019 Dw Tutorial Index

    37/38

    Data Mining

    OLAP is like fishing (one trend at atime) Data Mining is like fishing using a NET. Mining tools provides the sophisticated

    algorithms to find the specific trendswith the data available. Example : MS Analysis Server provides

    the following algorithms. (Clusteringetc)

    Mainly used to identify set of customerswho think a like, fraud deductions etcetc

  • 8/13/2019 Dw Tutorial Index

    38/38

    Business Activity Monitoring(BAM)

    BAM is the technology which is used tomonitor the DW or OLTP actively for certainvalue.

    The system can run the set of process when itfinds the exception and sends the informationto relevant owners to take the action.

    Based on the findings immediately update therelevant OLTP system (conceptually its calledclosing the loop with DSS and OLTP)

    Example - INFORAY is a BAM tool which youcan use on the DW.