Upload
iliana-millner
View
223
Download
2
Tags:
Embed Size (px)
Citation preview
04/11/23 1
Lecture 10: Data Warehouses
Introduction Operational vs. Warehouse
Multidimensional Data Examples MOLAP vs ROLAP Dimensional Hierarchies OLAP Queries Demos Comparison with SQL
Queries CUBE Operator Multidimensional Design Star/Snowflake Schemas
Online Aggregation Implementation Issues
Bitmap Index Constructing a Data
Warehouse Views
Materialized View Example Materialized View is an
Index Issues in Materialized
Views Maintaining Materialized
Views
04/11/23 2
Introduction
In the late 80s and early 90s, companies began to use their DBMSs for complex, interactive, exploratory analysis of historical data.
This was called Decision Support, and On-Line Analytic Processing (OLAP).
DS slowed down the operation of the company, called On-Line Transaction Processing (OLTP).
This led to the creation of Data Warehouses, separate from operational Databases.
04/11/23 3
Operational vs Data Warehouse Requirements
OLTP / Operational / Production
DSS / Warehouse / DataMart
Operate the business / Clerks Diagnose the business / Managers
Short queries, small amts of data
opposite
Queries change data opposite
Customer inquiry, Order Entry, etc.
OLAP, Statistics, Visualization, Data Mining, etc.
Legacy Applications, Heterogeneous databases
Opposite
Often Distributed Often Centralized (Warehouse)
Current data Current and Historical data
04/11/23 4
Operational vs Data Warehouse Requirements, ctd
Operational Warehouse
General E-R Diagrams Multidimensional data model common
Locks necessary No Locks necessary
Crash recovery required Crash recovery optional
Smaller volume of data Huge volume of data
Need indexes designed to access small amounts of data
Need indexes designed to access large volumes of data
04/11/23 5
Data Warehousing
Integrated data spanning long time periods, often augmented with summary information.
Several terabytes to petabytes common.
Interactive response times expected for complex queries; ad-hoc updates uncommon.
Operational Data
EXTRACTTRANSFORM LOAD REFRESH
DATAWAREHOUSE Metadata
Repository
SUPPORTS
OLAPDATAMINING
04/11/23 6
Multidimensional Data In order to support OLAP, warehouse data is
often structured multidimensionally, as measures and dimensions.
Measure: Numeric attribute, e.g. sales amount
Dimension: attribute categorizing the measure, e.g. product, store, date of sale.
The fact table is a foreign key for each dimension, plus an attribute for each measure.
There will also be a dimension table for each dimension.
On the next page, the fact tables are red, the dimension tables are green.
04/11/23 7
Examples of MultiDimensional Data
Purchase(ProductID, StoreID, DateID, Amt) Product(ID, SKU, size, brand) Store(ID, Address, Sales District, Region, Manager) Date (ID, Week, Month, Holiday, Promotion)
Claims(ProvID, MembID, Procedure, DateID, Cost) Providers(ID, Practice, Address, ZIP, City, State) Members(ID, Contract, Name, Address) Procedure (ID, Name, Type)
Telecomm (CustID, SalesRepID, ServiceID, DateID) SalesRep(ID, Address, Sales District, Region, Manager) Service(ID, Name, Category)
04/11/23 8
MOLAP vs ROLAP
Multidimensional data can be stored physically in a (disk-resident, persistent) array; called MOLAP systems. Alternatively, can store as a relation; called ROLAP systems.
The main relation, which relates dimensions to a measure, is called the fact table. Each dimension can have additional attributes and an associated dimension table. E.g., Products(pid, locid, timeid, amt) Fact tables are much larger than dimensional tables.
04/11/23 9
25.2 Multidimensional Data Model
Collection of numeric measures, which depend on a set of dimensions. E.g., measure Amt, dimensions Product
(key: pid), Location (locid), and Time (timeid).
8 10 10
30 20 50
25 8 15
1 2 3 timeid
p
id11
12
13
11 1 1 25
11 2 1 8
11 3 1 15
12 1 1 30
12 2 1 20
12 3 1 50
13 1 1 8
13 2 1 10
13 3 1 10
11 1 2 35
pid
tim
eid
locid
am
t
locid
Slice locid=1is shown:
04/11/23 10
Dimension Hierarchies
For each dimension, some of the attributes may be organized in a hierarchy:
PRODUCT TIME LOCATION
pname week city
PID date ZIP
year
category quarter state
04/11/23 11
25.3 OLAP Queries
Influenced by SQL and by spreadsheets. A common operation is to aggregate a
measure over one or more dimensions. Find total sales. Find total sales for each city, or for each state. Find top five products ranked by total sales.
Roll-up: Aggregating at different levels of a dimension hierarchy. E.g., Given total sales by city, we can roll-up to
get sales by state.
04/11/23 12
OLAP Queries Drill-down: The inverse of roll-up.
E.g., Given total sales by state, can drill-down to get total sales by city.
E.g., Can also drill-down on different dimension to get total sales by product for each state.
Pivoting: Aggregation on selected dimensions. E.g., Pivoting on State and Year
yields this cross-tabulation:63 81 144
38 107 145
75 35 110
OR CA Total
2007
2008
2009
176 223 339Total
Slicing and Dicing: Equality and range selections on one or more dimensions.
04/11/23 13
Cognos Demo
Now we watch a demo of Cognos (bought by IBM) Dimensions: Products…Margin ranges Measure: Order value (sales)
First pivot from Product dimension to Margin Range Notice how quickly the cube changes
Slice to Low Margin, pivot to Product and Company Region
Drill Down to High Tech, IDES AG Now the guilty product is clear.
04/11/23 14
Tableau Demo http://www.tableausoftware.com/products/tour
2
Note the many measures. Pivot on sales, date (drill down to
month), region as color. Clear date, pivot on product and drill
down on subcategory. Change region from color to rows Move profit into color Change bars to circles Pivot on dates (columns)
04/11/23 15
Comparison with SQL Queries The cross-tabulation obtained by pivoting can also be
computed using a collection of SQLqueries:
SELECT T.year, L.state, SUM(S.amt)FROM Sales S, Times T, Locations LWHERE S.timeid=T.timeid AND S.locid=L.locidGROUP BY T.year, L.state
SELECT T.year, SUM(S.amt)FROM Sales S, Times TWHERE S.timeid=T.timeidGROUP BY T.year
SELECT L.state,SUM(S.amt)FROM Sales S, Location LWHERE S.locid=L.locidGROUP BY L.state
04/11/23 16
The CUBE Operator
Generalizing the previous example, if there are k dimensions, we have 2^k possible SQL GROUP BY queries that can be generated through pivoting on a subset of dimensions.
CUBE pid, locid, timeid BY SUM Sales Equivalent to rolling up Sales on all eight subsets
of the set {pid, locid, timeid}; each roll-up corresponds to an SQL query of the form:
SELECT SUM(S.amt)FROM Sales SGROUP BY grouping-list
Lots of work on optimizing the CUBE operator!
04/11/23 17
Example Multidimensional Design
This kind of schema is very common in OLAP applications
It is called a star schema What is “wrong” with it?
price
category
pname
pid country
statecitylocid
amtlocidtimeid
pid
holiday_flag
weekdate
timeid month
quarter
year
(Fact table)SALES
TIMES
PRODUCTS LOCATIONS
04/11/23 18
Star/Snowflake Schemas
Why normalize? Space Redundancy, anomalies
Why unnormalize? Performance
Which is more important in D. Warehouses?
If normalized, it is a snowflake schema
04/11/23 19
Online Aggregation Consider an aggregate query, e.g., finding
the average sales by state. Can we provide the user with some information before the exact average is computed for all states? Can show the current “running average” for each
state as the computation proceeds. Even better, if we use statistical techniques and
sample tuples to aggregate instead of simply scanning the aggregated table, we can provide bounds such as “the average for Oregon is 2000102 with 95% probability”.• Should also use nonblocking algorithms!
04/11/23 20
25.6 Implementation Issues New indexing techniques: Bitmap indexes,
Join indexes, array representations, compression, precomputation of aggregations, etc.
E.g., Bitmap index:
10100110
112 Joe M 3115 Ram M 5
119 Sue F 5
112 Woo M 4
00100000010000100010
sex custid name sex rating ratingBit-vector:1 bit for eachpossible value.
MF
04/11/23 21
Bitmap Indexes
Work when an attribute has few values, e.g. gender or rating
Advantage: Small enough to fit in memory
Many queries can be answered by bit-vector ops, e.g. females with rating = 3.
04/11/23 22
25.7 Constructing a D. Warehouse Extract
Is the data in native format? Clean
How many ways can you spell Mr.? Errors, missing information
Transform Fix semantic mismatches.
• E.g. Last+first vs. Name
Load Do it in parallel or else….
Refresh Both data and indexes
04/11/23 23
25.8,9 Views and Decision Support In large databases, precomputation is
necessary for decent response times Examples: brain, google
Example: Precompute daily sums for the cube. What can be derived from those
precomputations? These precomputed queries are called
Materialized Views (SQL Server: Indexed views).
04/11/23 24
Materialized View Example
CREATE VIEW DailySum(date, sumamt)AS SELECT date, SUM(amt)FROM Times Join Sales USING(timeid) GROUP BY date
SELECT week, SUM(amt)FROM Times Join Sales USING(timeid)Group By week
SELECT week, SUM(sumamt)FROM Times Join DailySum USING (week)GROUP BY week
Mat.View
Query
ModifiedQuery
04/11/23 25
Pros and Cons of Materialized Views
Pro: Modified query is a join of two small tables; original query is a join with one huge table.
Con: Materialized views take up space, need to be updated.
04/11/23 26
A Materialized View is an Index
Recall the definition of an index Data structure that provides fast access to
data Table indexes were of the form {(value,
pointer)}, perhaps at leaf level of a search structure. This is different.
Needs to be maintained as underlying tables change.
Ideally, we want incremental view maintenance algorithms.
04/11/23 27
What views should we materialize?
Remember the software that automatically chooses optimal index configurations?
The same software will choose optimal materialized views, given a workload and available space.
04/11/23 28
What about the optimizer?
Given a query and a set of materialized views, can we use the materialized views to answer the query?
This is tricky. Best reference is [348]
04/11/23 29
Refreshing Materialized Views
How often should we refresh the materialized view?
Many enterprises refresh warehouse data only weekly/nightly, so can afford to completely rebuild their materialized views.
Others want their warehouses to be current, so materialized views must be updated incrementally if possible.
Let's look at some simple examples.
04/11/23 30
25.10 Maintaining Materialized Views*
Incremental view maintenance Defn: make changes in view that
correspond to changes in the base tables Example: V = SELECT a FROM R
How is V modified if r is inserted to R?
How is V modified if r is deleted from R?
04/11/23 31
Maintaining Materialized Views* Consider V = R ⋈ S
How is V modified if r is inserted to R? How is V modified if r is deleted from R?
Consider V = SELECT g,COUNT(*) FROM R GROUP BY g
How is V modified if r is inserted to R?
How is V modified if r is deleted from R
For more general cases, see [348]