Data warehouse system and its concepts

Data Warehouse System

Faculty : Gaurav Garg

Table of Contents• Data warehouse• Data Warehouse vs. Operational DBMS• OLTP vs. OLAP• Why Need Separate Data Warehouse?• Data Mart• Data warehouse Architecture• Data warehouse backend process• Metadata Repository• OLAP Server Architecture• Schemas for multidimensional data• Partitioning the Data warehouse

What is Data Warehouse?• Defined in many different ways, but not rigorously.

– A decision support database that is maintained separately from the

organization’s operational database

– Support information processing by providing a solid platform of

consolidated, historical data for analysis.

• “A data warehouse is a subject-oriented, integrated, time-variant, and

nonvolatile collection of data in support of management’s decision-making

process.”—W. H. Inmno

• Data warehousing:

– The process of constructing and using data warehouses

Gaurav Garg (AITM, Palwal)

Data Warehouse—Subject-Oriented• Organized around major subjects, such as customer, product, sales

• Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing

• Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process


Data Warehouse—Integrated• Constructed by integrating multiple, heterogeneous data

sources– relational databases, flat files, on-line transaction records

• Data cleaning and data integration techniques are applied.– Ensure consistency in naming conventions, encoding structures,

attribute measures, etc. among different data sources• E.g., Hotel price: currency, tax, breakfast covered, etc.

– When data is moved to the warehouse, it is converted.


Data Warehouse—Time Variant• The time horizon for the data warehouse is significantly longer

than that of operational systems– Operational database: current value data– Data warehouse data: provide information from a historical perspective

(e.g., past 5-10 years)

• Every key structure in the data warehouse– Contains an element of time, explicitly or implicitly– But the key of operational data may or may not contain “time element”


Data Warehouse vs. Operational DBMS• OLTP (on-line transaction processing)

– Major task of traditional relational DBMS– Day-to-day operations: purchasing, inventory, banking, manufacturing,

payroll, registration, accounting, etc.• OLAP (on-line analytical processing)

– Major task of data warehouse system– Data analysis and decision making

• Distinct features (OLTP vs. OLAP):– User and system orientation: customer vs. market– Data contents: current, detailed vs. historical, consolidated– Database design: ER + application vs. star + subject– View: current, local vs. evolutionary, integrated– Access patterns: update vs. read-only but complex queries


OLTP vs. OLAP OLTP OLAP users clerk, IT professional knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current, up-to-date

detailed, flat relational isolated

historical, summarized, multidimensional integrated, consolidated

usage repetitive ad-hoc access read/write

index/hash on prim. key lots of scans

unit of work short, simple transaction complex query # records accessed tens millions #users thousands hundreds DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response


Why Separate Data Warehouse?• High performance for both systems

– DBMS— tuned for OLTP: access methods, indexing, concurrency control, recovery

– Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view, consolidation

• Different functions and different data:– missing data: Decision support requires historical data which

operational DBs do not typically maintain– data consolidation: DS requires consolidation (aggregation,

summarization) of data from heterogeneous sources– data quality: different sources typically use inconsistent data

representations, codes and formats which have to be reconciled


Data Warehouse Usage• Three kinds of data warehouse applications

– Information processing• supports querying, basic statistical analysis, and reporting using

crosstabs, tables, charts and graphs– Analytical processing

• multidimensional analysis of data warehouse data• supports basic OLAP operations, slice-dice, drilling, pivoting

– Data mining• knowledge discovery from hidden patterns • supports associations, constructing analytical models, performing

classification and prediction, and presenting the mining results using visualization tools


Data Mart• Data marts are partitions of the overall data warehouse. • A data mart is a simple form of a data warehouse that is focused on a

single subject (or functional area), such as Sales, Finance, or Marketing• Data marts may contain some overlapping data. E.g. a store sales data

mart, would also need some data from inventory and payroll.

Types of data martsDependent data marts draw data from a central data warehouse that has

already been created. Independent data marts, are standalone systems built by drawing data

directly from operational or external sources of data, or both.The main difference between these is how you get data out of the sources

and into the data mart. This step, called the Extraction-Transformation-and Loading (ETL) process, involves moving data from operational systems, filtering it, and loading it into the data mart.


A 3-tier Data warehouse Architecture

DataWarehouse

ExtractTransformLoadRefresh

OLAP Engine

AnalysisQueryReportsData mining

Monitor&

IntegratorMetadata

Data Sources Front-End Tools

Serve

Data Marts

Operational DBs

Othersources

Data Storage

OLAP Server


Data warehouse backend processData extractionIt’s a process of extracting data for the warehouse from various sources.Data cleaningIts an essential operation for construction of quality data warehouse. Because

a large volume of data from heterogeneous sources are involved,so there is high probability of errors in the data. The data cleaning process include

• Using transformation rules e.g. translating attribute name like ‘age’ to ‘DOB’

• Using domain specific knowledge• Performing parsing and fuzzy matching e.g for multiple data sources, one

can designate a preferred source as a matching standard• Auditing

it is difficult and costly to clean the data.


Data TransformationIt’s a process of transforming heterogeneous data to an uniform structure so

that data can be combined and integrated .i.e convert data from legacy or host format to warehouse format.

Data LoadingFor loading correct, refined, processed data( huge in size) into data

warehouse, there are some data loading strategies.Batch loading.Sequential loading.Incremental loading.

RefreshThe process of updating the data.


Metadata Repository• Meta data is the data defining warehouse objects. It stores:• Description of the structure of the data warehouse

– schema, view, dimensions, hierarchies, derived data defn, data mart locations and contents

• Operational meta-data– data lineage (history of migrated data and transformation path), currency

of data (active, archived, or purged), monitoring information (warehouse usage statistics, error reports, audit trails)

• The algorithms used for summarization• The mapping from operational environment to the data warehouse• Data related to system performance

– warehouse schema, view and derived data definitions• Business data

– business terms and definitions, ownership of data, charging policiesGaurav Garg (AITM, Palwal)

OLAP Server Architecture• Relational OLAP (ROLAP)

– Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware

– Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services

– Greater scalability• Multidimensional OLAP (MOLAP)

– Sparse array-based multidimensional storage engine – Fast indexing to pre-computed summarized data

• Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)– Flexibility, e.g., low level: relational, high-level: array

• Specialized SQL servers (e.g., Redbricks) – Specialized support for SQL queries over star/snowflake schemas


Multidimensional Data• Sales volume as a function of product, month,

and region

Prod

uct

Region

Month

Dimensions: Product, Location, TimeHierarchical summarization paths

Industry Region Year

Category Country Quarter

Product City Month Week

Office Day


Multidimensional Data Model“what is a data cube?”“what do you mean by dimensions?”Data cube allows data to be modeled and viewed in multiple dimensions. It is

defined by dimensions and facts.Dimensions are the perspectives or entities with respect to which an

organizations wants to keep records.Each dimension may have a table associated with it, called a dimension table.

A dimension table for the item may contain the attributes item_name, brand etc.

A multidimensional data model is typically organized around a central theme, like sales. These theme is represented by fact table.

Facts are numerical measure. Examples of facts for a sales data warehouse include dollars_sold,unit_sold,amount_budgeted.

Facts table contains the names of the facts, or measures, as well as keys to each related dimension tables.


Schemas for multidimensional dataThese multidimensional model can exist in the form of a star schema, a

snowflake schema, or a fact constellation schema (galaxy schema).

Star Shcema It is the most common schema type in which the data warehouse

contains a large central table (fact table) which contains the bulk of data, with no redundancy, and a set of smaller attendant tables (dimension tables), one for each dimension. The dimension tables displayed a radial pattern around the central fact table.

Snowflake schema It is a variant of the star schema model, where some dimension tables are

normalized, thereby further splitting the data into additional tables. The resulting schema graph forms a shape similar to a snowflake.


Example of Star Schema

time_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcitystate_or_provincecountry

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_salesMeasures

item_keyitem_namebrandtypesupplier_type

item

branch_keybranch_namebranch_type

branch


Example of Snowflake Schema


time

location_keystreetcity_key

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_keyitem_namebrandtypesupplier_key

item


branch

supplier_keysupplier_type

supplier

city_keycitystate_or_provincecountry

city


Fact constellation (galaxy schema) Some sophisticated applications may require multiple fact tables to share

dimension tables. This kind of schema can be viewed as a collection of stars, and hence is called a galaxy schema or fact constellation.


Example of Fact Constellation


time

location_keystreetcityprovince_or_statecountry

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_salesMeasures

item_keyitem_namebrandtypesupplier_type

item


branch

Shipping Fact Table

time_key

item_key

shipper_key

from_location

to_location

dollars_cost

units_shipped

shipper_keyshipper_namelocation_keyshipper_type

shipper


Partitioning the Data warehouse Data warehouses often contain large tables and require techniques both

for managing these large tables and for providing good query performance across these large tables.

Partitioning have many advantages, some are as

Scalability Partitioning helps scaling a data warehouse by dividing database objects into smaller pieces, enabling access to smaller, more manageable objects. Having direct access to smaller objects .

PerformanceGood performance is a key to success for a data warehouse. Analyses run against the database should return within a reasonable amount of time, even if the queries access large amounts of data in tables that are terabytes in size. Partitioning increases the performance of data warehouse.

Note: partitions of database are generally known as clusters.


Types of partitioning• Basically it has two types1. Horizontal partitioning2. Vertical partitioningThere are two main categories of partitioning algorithms3. k-means algorithm, where each cluster is represented by the center of

gravity of the cluster.4. K-medoid algorithm, where each cluster is represented by one of the

objects of the cluster located near the center.


Some general methods of partitioning Range Partitioning

Range partitioning maps data to partitions based on ranges of partition key values that you establish for each partition. It is the most common type of partitioning and is often used with dates. For example, you might want to partition sales data into monthly partitions.

Hash PartitioningHash partitioning maps data to partitions based on a hashing algorithmThe hashing algorithm evenly distributes rows among partitions, giving partitions approximately the same size.

List PartitioningList partitioning enables you to explicitly control how rows map to partitions. This is different from range partitioning, where a range of values is associated with a partition and with hash partitioning, where you have no control of the row-to-partition mapping. The advantage of list partitioning is that you can group and organize unordered and unrelated sets of data in a natural way.


?

Education

Data warehouse system and its concepts