DW Architecture & DataFlow

Embed Size (px)

Citation preview

  • 8/3/2019 DW Architecture & DataFlow

    1/24

    Architecture of Data Warehouse

    By:Er. Manu Bansal

    (Assistant Professor)

    Dept. of IT

    [email protected]

  • 8/3/2019 DW Architecture & DataFlow

    2/24

    Data Warehouse- Concept

    A data warehouse refers to a database that ismaintained separately from an organizations

    operational databases.

    The construction of data warehouses involvesdata cleaning, data integration, and datatransformation.

    Data warehousing also forms an essential step inthe knowledge discovery process.

  • 8/3/2019 DW Architecture & DataFlow

    3/24

    The four keywords distinguishing data warehouses fromother data repository systems, such as relational databasesystems, transaction processing systems, and file systems

    are: Subject-oriented

    Integrated

    Time-variant

    Nonvolatile

    Data Warehouse V/S Data Base

  • 8/3/2019 DW Architecture & DataFlow

    4/24

    Three Tired Architecture

    Data

    Warehouse

    Extract

    Transform

    Load

    Refresh

    OLAP Engine

    Analysis

    Query

    Reports

    Data mining

    Monitor

    &

    IntegratorMetadata

    Data Sources Front-End Tools

    Serve

    Data Marts

    Operational

    DBs

    other

    sources

    Data Storage

    OLAP Server

  • 8/3/2019 DW Architecture & DataFlow

    5/24

    Typical Components of a Data

    Warehouse Architecture

  • 8/3/2019 DW Architecture & DataFlow

    6/24

    Operational data

    Without source system, there would be no data

    The data sources for the data warehouse are supplied asfollows: Operational data held in network databases Departmental data held in file systems Private data held on workstations and private servers and

    external systems such as Internet, commercially available DB,

    or DB associated with and organizations suppliers orcustomers

  • 8/3/2019 DW Architecture & DataFlow

    7/24

    Operational Data Store(ODS)

    Is a repository of current and integrated operationaldata used for analysis. It is often structured andsupplied with data in the same way as the data

    warehouse, but may in fact simply act as a staging areafor data to be moved into the warehouse.

    ODS objectives: to integrate information from day-to-day systems and allow operational lookup to relieve

    day-to-day systems of reporting and current-dataanalysis demands.

    ODS can be helpful step towards building a data

    warehouse because ODS can supply data that has beenalread extracted from the source s stems and cleaned.

  • 8/3/2019 DW Architecture & DataFlow

    8/24

    Load Manager

    Called thefrontendcomponent. The data is extracted from the operational systems

    directly or from the operational datastore and then tothe data warehouse

    Performs all the operations associated with theextraction and loading of data into the warehouse.

    These operations include sourcing, acquisition, cleanup andtransformation toolswhich prepare the data for entry into

    the warehouse. The functionality includes: Removing unwanted data from operational databases. Converting to common data names and definitions. Calculating summaries.

    Establishing defaults for missing data.

  • 8/3/2019 DW Architecture & DataFlow

    9/24

    Warehouse Manager

    Performs all the operations associated with themanagement of the data in the warehouse asfollows:

    Analysis of data to ensure consistency

    Transformation and merging of source data fromtemporary storage into the data warehouse tables

    Creation of indexes and views. Backing-up and archiving data.

  • 8/3/2019 DW Architecture & DataFlow

    10/24

    Data Warehouse Database

    Central Repository for information. This database is almost always implemented on the

    relational database management system (RDBMS)

    technology.

    Certain data warehouse attributes such as very largedatabase size, ad hoc query processing and need for flexibleuser view creation including aggregates, multi-table joins

    and drill downs have become drivers for differenttechnology approaches to data warehouse database.These approaches include:

  • 8/3/2019 DW Architecture & DataFlow

    11/24

    Data Warehouse Database- Contd.

    Parallel Relational database designs that require aparallel computing platform, such as symmetricmultiprocessors (SMPs) and massively parallel

    processors (MPPs). Multidimensional databases (MDDBs).

  • 8/3/2019 DW Architecture & DataFlow

    12/24

    Query Manager

    Called backendcomponent

    Performs all the operations associated with themanagement of user queries

    Directing queries to the appropriate tables andscheduling the execution of queries.

  • 8/3/2019 DW Architecture & DataFlow

    13/24

    Detailed Data

    Stores all the detailed data in the databaseschema.

    On a regular basis, detailed data is added to thewarehouse to supplement the aggregated data.

  • 8/3/2019 DW Architecture & DataFlow

    14/24

    Lightly and Highly Summarized

    Data

    Stores all the pre-defined lightly and highly aggregateddata generated by the warehouse manager.

    The purpose of summary information is to speed up

    the performance of queries. On the other hand, it removes the requirement to

    continually perform summary operations (such as sortor group by) in answering user queries.

    The summarized data is updated continuously as newdata is loaded into the warehouse.

  • 8/3/2019 DW Architecture & DataFlow

    15/24

    Archive/Backup Data

    Stores detailed and summarized data for the purposesof archiving and backup

    May be necessary to backup online summary data if this

    data is kept beyond the retention period for detaileddata

    The data is transferred to storage archives such asmagnetic tape or optical disk

  • 8/3/2019 DW Architecture & DataFlow

    16/24

    Meta Data

    This area of the warehouse stores all the metadatadefinitions used by all the processes in the warehouse Meta-Data is used for a variety of purposes:

    Extraction and loading processes

    Warehouse management processUsed to automate the production of summary tables

    Query management process

    Used to direct a query to the most appropriate data source

    End-user access tools use metadata to understand howto build a query

  • 8/3/2019 DW Architecture & DataFlow

    17/24

    End-user Access Tools

    Users interact with the warehouse using end-user accesstools.

    Can be categorized into five main groups Data reporting and query tools(Query by ExampleMS Access

    DBMS) Application development tools (application used to access major

    DBSOracle, sybase..) Executive information system (EIS) tools(For sales, marketing and

    finance) Online analytical processing (OLAP) tools(Allow users to analyze

    the data using complex and multidimentional views-frommultiple databases)

    Data mining tools (allow the discovery of new patterns andtrend by mining a large amount of data using statistical,

    mathematical tools)

  • 8/3/2019 DW Architecture & DataFlow

    18/24

    Data Warehousing: Data flows

  • 8/3/2019 DW Architecture & DataFlow

    19/24

    Inflow The processes associated with the extraction, cleansing,

    and loading of the data from the source systems intothe data warehouse

    Cleaning include removing inconsistencies, adding

    missing fields, and cross-checking for data integrity Transformation include adding date/time stamp fields,

    summarizing detailed data, deriving new fields to storecalculated data

    Extract the relevant data from multiple, heterogeneous,and external sources (commercial tools are used)

    Then mapped and loaded into the warehouse

  • 8/3/2019 DW Architecture & DataFlow

    20/24

    Upflow The process associated with adding value to the data in

    the warehouse through summarizing, packaging, anddistribution of the data

    Summarizing the data works by choosing, projecting,joining, and grouping relational data into views that are

    more convenient and useful to the end users. Packeging the data involves converting the detailed or

    summarized information into more useful formats,such as spreadsheets, test documents, charts, othergraphical presentations, private databases, andanimation.

    Distribute the data in appropiate groups to increase itsavailability and accessibility

  • 8/3/2019 DW Architecture & DataFlow

    21/24

    Downflow

    The processes associated with archiving and backing-upof data in the warehouse.

    Archiving the effectiveness and performancemaintanance is achieved by transferring the older data

    of limited value to storage archivers such as magnetictapes, optical disk or digital storage devices.

    The downflow of data includes the processes to ensurethat the current state of the data warehouse can be

    rebuilt following data loss, or software/hardwarefailures. Archived data should be stored in a way thatallows the re-establishement of the data in the

    warehouse when required.

  • 8/3/2019 DW Architecture & DataFlow

    22/24

    Outflow

    Involves the process associated with making the dataavailabe to the end-users.

    This involves two activities such as data accessing anddelivering

    Data accessing is concerned with satisfying the enduserss requests for the data they need. The mainproblem here is the creation of an environment so thatthe users can effectively use the query tools to accessthe most appropiate data source.

    Delivering activity makes possible the informationdelivery to the users systems/workstations.

  • 8/3/2019 DW Architecture & DataFlow

    23/24

    Metaflow

    Meta-flow is a description of the data contents of thedata warehouse, what is in it, where it came fromoriginally, and what has been done to it by way ofcleansing, integrating, and summarizing

    Managing the metadata (data about the data)

  • 8/3/2019 DW Architecture & DataFlow

    24/24

    Thanks