Upload
gaurav-garg
View
329
Download
0
Embed Size (px)
Citation preview
Data Warehouse System
Faculty : Gaurav Garg
Table of Contents• Data warehouse• Data Warehouse vs. Operational DBMS• OLTP vs. OLAP• Why Need Separate Data Warehouse?• Data Mart• Data warehouse Architecture• Data warehouse backend process• Metadata Repository• OLAP Server Architecture• Schemas for multidimensional data• Partitioning the Data warehouse
What is Data Warehouse?• Defined in many different ways, but not rigorously.
– A decision support database that is maintained separately from the
organization’s operational database
– Support information processing by providing a solid platform of
consolidated, historical data for analysis.
• “A data warehouse is a subject-oriented, integrated, time-variant, and
nonvolatile collection of data in support of management’s decision-making
process.”—W. H. Inmno
• Data warehousing:
– The process of constructing and using data warehouses
Gaurav Garg (AITM, Palwal)
Data Warehouse—Subject-Oriented• Organized around major subjects, such as customer, product, sales
• Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing
• Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process
Gaurav Garg (AITM, Palwal)
Data Warehouse—Integrated• Constructed by integrating multiple, heterogeneous data
sources– relational databases, flat files, on-line transaction records
• Data cleaning and data integration techniques are applied.– Ensure consistency in naming conventions, encoding structures,
attribute measures, etc. among different data sources• E.g., Hotel price: currency, tax, breakfast covered, etc.
– When data is moved to the warehouse, it is converted.
Gaurav Garg (AITM, Palwal)
Data Warehouse—Time Variant• The time horizon for the data warehouse is significantly longer
than that of operational systems– Operational database: current value data– Data warehouse data: provide information from a historical perspective
(e.g., past 5-10 years)
• Every key structure in the data warehouse– Contains an element of time, explicitly or implicitly– But the key of operational data may or may not contain “time element”
Gaurav Garg (AITM, Palwal)
Data Warehouse vs. Operational DBMS• OLTP (on-line transaction processing)
– Major task of traditional relational DBMS– Day-to-day operations: purchasing, inventory, banking, manufacturing,
payroll, registration, accounting, etc.• OLAP (on-line analytical processing)
– Major task of data warehouse system– Data analysis and decision making
• Distinct features (OLTP vs. OLAP):– User and system orientation: customer vs. market– Data contents: current, detailed vs. historical, consolidated– Database design: ER + application vs. star + subject– View: current, local vs. evolutionary, integrated– Access patterns: update vs. read-only but complex queries
Gaurav Garg (AITM, Palwal)
OLTP vs. OLAP OLTP OLAP users clerk, IT professional knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current, up-to-date
detailed, flat relational isolated
historical, summarized, multidimensional integrated, consolidated
usage repetitive ad-hoc access read/write
index/hash on prim. key lots of scans
unit of work short, simple transaction complex query # records accessed tens millions #users thousands hundreds DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response
Gaurav Garg (AITM, Palwal)
Why Separate Data Warehouse?• High performance for both systems
– DBMS— tuned for OLTP: access methods, indexing, concurrency control, recovery
– Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view, consolidation
• Different functions and different data:– missing data: Decision support requires historical data which
operational DBs do not typically maintain– data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources– data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
Gaurav Garg (AITM, Palwal)
Data Warehouse Usage• Three kinds of data warehouse applications
– Information processing• supports querying, basic statistical analysis, and reporting using
crosstabs, tables, charts and graphs– Analytical processing
• multidimensional analysis of data warehouse data• supports basic OLAP operations, slice-dice, drilling, pivoting
– Data mining• knowledge discovery from hidden patterns • supports associations, constructing analytical models, performing
classification and prediction, and presenting the mining results using visualization tools
Gaurav Garg (AITM, Palwal)
Data Mart• Data marts are partitions of the overall data warehouse. • A data mart is a simple form of a data warehouse that is focused on a
single subject (or functional area), such as Sales, Finance, or Marketing• Data marts may contain some overlapping data. E.g. a store sales data
mart, would also need some data from inventory and payroll.
Types of data martsDependent data marts draw data from a central data warehouse that has
already been created. Independent data marts, are standalone systems built by drawing data
directly from operational or external sources of data, or both.The main difference between these is how you get data out of the sources
and into the data mart. This step, called the Extraction-Transformation-and Loading (ETL) process, involves moving data from operational systems, filtering it, and loading it into the data mart.
Gaurav Garg (AITM, Palwal)
A 3-tier Data warehouse Architecture
DataWarehouse
ExtractTransformLoadRefresh
OLAP Engine
AnalysisQueryReportsData mining
Monitor&
IntegratorMetadata
Data Sources Front-End Tools
Serve
Data Marts
Operational DBs
Othersources
Data Storage
OLAP Server
Gaurav Garg (AITM, Palwal)
Data warehouse backend processData extractionIt’s a process of extracting data for the warehouse from various sources.Data cleaningIts an essential operation for construction of quality data warehouse. Because
a large volume of data from heterogeneous sources are involved,so there is high probability of errors in the data. The data cleaning process include
• Using transformation rules e.g. translating attribute name like ‘age’ to ‘DOB’
• Using domain specific knowledge• Performing parsing and fuzzy matching e.g for multiple data sources, one
can designate a preferred source as a matching standard• Auditing
it is difficult and costly to clean the data.
Gaurav Garg (AITM, Palwal)
Data TransformationIt’s a process of transforming heterogeneous data to an uniform structure so
that data can be combined and integrated .i.e convert data from legacy or host format to warehouse format.
Data LoadingFor loading correct, refined, processed data( huge in size) into data
warehouse, there are some data loading strategies.Batch loading.Sequential loading.Incremental loading.
RefreshThe process of updating the data.
Gaurav Garg (AITM, Palwal)
Metadata Repository• Meta data is the data defining warehouse objects. It stores:• Description of the structure of the data warehouse
– schema, view, dimensions, hierarchies, derived data defn, data mart locations and contents
• Operational meta-data– data lineage (history of migrated data and transformation path), currency
of data (active, archived, or purged), monitoring information (warehouse usage statistics, error reports, audit trails)
• The algorithms used for summarization• The mapping from operational environment to the data warehouse• Data related to system performance
– warehouse schema, view and derived data definitions• Business data
– business terms and definitions, ownership of data, charging policiesGaurav Garg (AITM, Palwal)
OLAP Server Architecture• Relational OLAP (ROLAP)
– Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware
– Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services
– Greater scalability• Multidimensional OLAP (MOLAP)
– Sparse array-based multidimensional storage engine – Fast indexing to pre-computed summarized data
• Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)– Flexibility, e.g., low level: relational, high-level: array
• Specialized SQL servers (e.g., Redbricks) – Specialized support for SQL queries over star/snowflake schemas
Gaurav Garg (AITM, Palwal)
Multidimensional Data• Sales volume as a function of product, month,
and region
Prod
uct
Region
Month
Dimensions: Product, Location, TimeHierarchical summarization paths
Industry Region Year
Category Country Quarter
Product City Month Week
Office Day
Gaurav Garg (AITM, Palwal)
Multidimensional Data Model“what is a data cube?”“what do you mean by dimensions?”Data cube allows data to be modeled and viewed in multiple dimensions. It is
defined by dimensions and facts.Dimensions are the perspectives or entities with respect to which an
organizations wants to keep records.Each dimension may have a table associated with it, called a dimension table.
A dimension table for the item may contain the attributes item_name, brand etc.
A multidimensional data model is typically organized around a central theme, like sales. These theme is represented by fact table.
Facts are numerical measure. Examples of facts for a sales data warehouse include dollars_sold,unit_sold,amount_budgeted.
Facts table contains the names of the facts, or measures, as well as keys to each related dimension tables.
Gaurav Garg (AITM, Palwal)
Schemas for multidimensional dataThese multidimensional model can exist in the form of a star schema, a
snowflake schema, or a fact constellation schema (galaxy schema).
Star Shcema It is the most common schema type in which the data warehouse
contains a large central table (fact table) which contains the bulk of data, with no redundancy, and a set of smaller attendant tables (dimension tables), one for each dimension. The dimension tables displayed a radial pattern around the central fact table.
Snowflake schema It is a variant of the star schema model, where some dimension tables are
normalized, thereby further splitting the data into additional tables. The resulting schema graph forms a shape similar to a snowflake.
Gaurav Garg (AITM, Palwal)
Example of Star Schema
time_keydayday_of_the_weekmonthquarteryear
time
location_keystreetcitystate_or_provincecountry
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_salesMeasures
item_keyitem_namebrandtypesupplier_type
item
branch_keybranch_namebranch_type
branch
Gaurav Garg (AITM, Palwal)
Example of Snowflake Schema
time_keydayday_of_the_weekmonthquarteryear
time
location_keystreetcity_key
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_keyitem_namebrandtypesupplier_key
item
branch_keybranch_namebranch_type
branch
supplier_keysupplier_type
supplier
city_keycitystate_or_provincecountry
city
Gaurav Garg (AITM, Palwal)
Fact constellation (galaxy schema) Some sophisticated applications may require multiple fact tables to share
dimension tables. This kind of schema can be viewed as a collection of stars, and hence is called a galaxy schema or fact constellation.
Gaurav Garg (AITM, Palwal)
Example of Fact Constellation
time_keydayday_of_the_weekmonthquarteryear
time
location_keystreetcityprovince_or_statecountry
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_salesMeasures
item_keyitem_namebrandtypesupplier_type
item
branch_keybranch_namebranch_type
branch
Shipping Fact Table
time_key
item_key
shipper_key
from_location
to_location
dollars_cost
units_shipped
shipper_keyshipper_namelocation_keyshipper_type
shipper
Gaurav Garg (AITM, Palwal)
Partitioning the Data warehouse Data warehouses often contain large tables and require techniques both
for managing these large tables and for providing good query performance across these large tables.
Partitioning have many advantages, some are as
Scalability Partitioning helps scaling a data warehouse by dividing database objects into smaller pieces, enabling access to smaller, more manageable objects. Having direct access to smaller objects .
PerformanceGood performance is a key to success for a data warehouse. Analyses run against the database should return within a reasonable amount of time, even if the queries access large amounts of data in tables that are terabytes in size. Partitioning increases the performance of data warehouse.
Note: partitions of database are generally known as clusters.
Gaurav Garg (AITM, Palwal)
Types of partitioning• Basically it has two types1. Horizontal partitioning2. Vertical partitioningThere are two main categories of partitioning algorithms3. k-means algorithm, where each cluster is represented by the center of
gravity of the cluster.4. K-medoid algorithm, where each cluster is represented by one of the
objects of the cluster located near the center.
Gaurav Garg (AITM, Palwal)
Some general methods of partitioning Range Partitioning
Range partitioning maps data to partitions based on ranges of partition key values that you establish for each partition. It is the most common type of partitioning and is often used with dates. For example, you might want to partition sales data into monthly partitions.
Hash PartitioningHash partitioning maps data to partitions based on a hashing algorithmThe hashing algorithm evenly distributes rows among partitions, giving partitions approximately the same size.
List PartitioningList partitioning enables you to explicitly control how rows map to partitions. This is different from range partitioning, where a range of values is associated with a partition and with hash partitioning, where you have no control of the row-to-partition mapping. The advantage of list partitioning is that you can group and organize unordered and unrelated sets of data in a natural way.
Gaurav Garg (AITM, Palwal)
?