Upload
diamond
View
53
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Data Warehouses and OLAP. What are data warehousing systems? Data Warehouse Architecture & Design Multidimensional Data Model ROLAP and MOLAP Systems View Design in Data Warehouses Object-Oriented Data Warehousing Summary and New Aspects. What Is Data Warehousing?. - PowerPoint PPT Presentation
Citation preview
Data Warehouses and OLAP
What are data warehousing systems? Data Warehouse Architecture & DesignMultidimensional Data ModelROLAP and MOLAP SystemsView Design in Data WarehousesObject-Oriented Data WarehousingSummary and New Aspects
What Is Data Warehousing?
Data warehousing is a collection of decision support technologies, aimed at enabling the knowledge worker (e.g., chief executive, manager, analyst) to make better and faster decisions.
-Chaudhuri and Dayal, SIGMOD Record, March 1997
Characteristics of Data Warehousing Systems
Historical, summarized and consolidated data ; very large databases
Query intensive processing, query throughput and response time driven
Multidimensional model, new operationsROLAP, MOLAP, Data Marts.
Operational Database Systems
Mainly on-line transaction processing systems.Have been bread-butter systems for most database system vendors.Lot of work has gone in building these systems.
But these systems have limited decision support functionality (data analysis) required for competitive business environments
Why Need Data Analysis?to know your customers and yourself better,for effective business strategies,to provide future directions to business organizations.
This kind of data analysis has been going on for long time. But there is an urgency in getting such data analysis done faster. Main problem in doing this has been the disparate and heterogeneous data sources.
Data warehousing systems aim to solve this problem!
Data Warehousing Architecture
External Sources
Data Sources
Refresh
TransformLoad
Extract
Operational Dbs
Monitoring & Administration
MetadataRepository
Data Warehouse
Serve
OLAPServers
Analysis
Query/Reporting
Data Mining
Tools
Data Marts
Taken from Chaudhri&Dayal, SIGMOD RECORD March 1997
Data Warehouse DesignDefine the architecture, do capacity planning, and
select storage servers, database and OLAP servers, and tools
Integrate the servers, storage and client toolsDesign the warehouse schema and viewsDefine the physical warehouse organization, data
placement, partitioning, and access methods
Data Warehouse Design (Cont..)Connect the servers using gateways, ODBC
drivers, or other wrappersDesign and implement scripts for data extraction,
cleaning, transformation, load and refreshPopulate the repository with the schema and view
definitions, scripts, and other metadataDesign and implement end-user applicationsRoll out the warehouse and applications
Back-end Tools and UtilitiesData Cleansing consists of data migration, data
scrubbing and data auditingData Loading - consists of checking integrity
constraints; sorting; summarization; aggregation and other computation to build derived tables stored in warehouse; building indices and other access paths; and partitioning to multiple target storage areas.
Refresh - data shipping (triggers) vs transaction shipping (based on logs)
Multidimensional Data ModelMultidimensional view of data in the warehouse
Each dimension is described by a set of attributes; the attributes of a dimension may be related via hierarchy of relationships.
Dimensions: Product, City, DateHierarchical summarization paths
Industry Country Year
Category State Quarter
Product City Month Week
Date
On-Line Analytical Processing (OLAP)OLAP tools provide an environment for decision making and business modeling activities by supporting ad hoc queries
provide a multidimensional conceptual view of the datausually star schema in which a single fact table relates to each
dimensional table, or
snowflake schema where dimensional tables are normalized for simplifying the data operations related to the dimension
provide easy-to-use end user interfaces
OLAP (Front-end) ToolsMultidimesional data model grew out of the view of business data popularized by PC spread sheet programs.
Operations supported by multidimensional data model Aggregation: total sales by store and by year Selection (slicing); sales where toys = “soft” and store = “LA”
and year=1996 Roll up (multiple group by): sales by city to sales by state Drill-down: sales by state to sales by city Calculation by positioning: top 5 stores by total sales
Star Schema
OrderNoSalespersonIDCustomerNoProdNoDateKeyCityNameQuantityTotalPrice
Fact Table
CityNameStateCountry
City
DateKeyDateMonthYear
Date
ProdNoProdNameProdDescrCategoryCategoryDescrUnitPriceQOH
Product
OrderNoOrderDate
Order
SalespersonIDSalespersonNameCityQuota
Salesperson
CustomerNoCustomerNameCustomerAddressCity
Customer
Relational OLAP (ROLAP)
Stores the data in specialized relational tables (star schema);
ROLAP offers flexibility; cost is the many joins needed for each query
ROLAP extends SQL for decision support data requests
Bitmapped indexes more useful than B-trees in handling large amount of data
Multidimensional OLAP (MOLAP)
Stores data in a N-dimensional cube (hyper cube) using array-based storage structure
each cell is formed by the intersection of all the dimensions; not all cells have a value (eg, not every product is sold in every store)
Cubes are created before can be used and are static
Suited for small and medium data sets
View Design & Data WarehousingThe virtual view approach may be better if
the information sources are changing frequently;
The materialized view approach would be superior if the information sources are changing infrequently and very fast query response time is needed.
A Motivating Example
Suppose the member databases contain following tables
Item(I_id, I_name, I_price)
Part(P_id, P_name, I_id)
Supplier(S_id, S_name, P_id, city, cost, preference)
Sales(I_id, month, year, amount)
Example continuedAssume we have the following frequently asked queries:Q1: Select I_id, sum(amount*I_price)
From Item, Sales Where I_name like {MAZADA, NISSEN, TOYOTA} And year=1996 And Item.I_id=Sales.I_id Group by I_id
Q2: Select P_id, month, sum(amount) From Item, Sales, Part Where I_name like {MAZADA, NISSEN, TOYOTA} And year=1996 And Item.I_id=sales.I_id And Part.I_id=Item.I_id Group by P_id, month
Example ContinuedQ3: Select P_id, min(cost), max(cost)
From Part, Supplier WherePart.P_id=Supplier.P_id And P_name like {spark_plug, gas_kit} Group by P_id
Q4: Select I_id, sum(amount*min_cost) From Item, Sales, Part WhereI_name like {MAZADA, NISSEN, TOYOTA} And year=1996 And Item.I_id=Sales.I_id And Item.I_Id=Part.I_id and Part.P_id =
(Select P_id, min(cost) as min_cost From supplier Group by P_id)Group by I_id
An MVPP for the Example
Item Sales Part Supplier
1k 12k 10k 50ktmp1 I_name like
{Mazda, Nisson, Toyota}
tmp2
year=“1996”
tmp5
p_name like{spark_plug, gas_kit}
tmp6
P_id,min(cost)max(cost)
tmp3
tmp tmp8 tmp7
result1
I_id, sum
(amount*I_price)
result2P_id, month
sum(amount*no)
result4I_id, sum
(mincost*amount*no)
result3
P_id,min(cost)max(cost)
Q1
Q2 Q4 Q4
36m360m
3.6b15m
36k
360k 360k 1.5k30k
10
1120k
230k 1.5k
5
Different Materialization Strategies
Materalized Views Cost of Query Processing Cost of Maintenance Total Cost
Item, Sales, Part, Supplier 8b980m860k 0 8b980m860k
tmp3, tmp4, tmp8 7b201m547k 1b350m125k 8b551m672k
tmp3, tmp5 416m747k 16b32m204k 16b448m951k
tmp3, tmp4, tmp7 7b276m497k 1b220m55k 8b496m552k
tmp3, tmp7 8b281m547k 126m122k 8b407m669k
result1, result2, result3, result4 1m447k 17b384m934k 17b386m381k
Issues & ProblemsFinding all the common subexpressions and
combining individual query access plans into one MVPP, such that all the common subexpressions are merged;
Finding a set of intermediate nodes in the MVPP, such that if the members of this set are materialized, the total cost of global query access and view maintenance is minimal.
Algorithms for Materialized View Selection
Algorithms for multiple MVPP design;a feasible solution - working with individual
optimal plans;generating optimal plan(s) - applying 0-1 integer
programming technique.
Given an MVPP, using heuristic rules to find a set of nodes to be materialized so that the total cost is minimal.
Dynamic Materialized View Selection Monitor the queries being executed over timeMaintain MVPP by incorporating most
frequently executed queries (common subexpressions)
Modify MVPP incrementally by executing MVPP generation algorithm (in background)
Decide on the views to be materializedReorganize the existing views
Materialized View Selection Costs
The dynamic materialized view selection problem has to take into consideration:Benefit to the query processing cost in futureCost of maintaining the materialized views Cost of reorganization
Materialized View ReorganizationGiven a set of views V1, V2, …, Vn currently
materializedLet V’
1, V’2, …, V’
m be the new views that need to be materialized
Need to design algorithms for efficient view reorganization
on-line (concurrency, failure recovery) & off-line (efficiency) algorithms
Relational Schema
OrderNoSalespersonIDCustomerNoProdNoDateKeyCityNameQuantityTotalPrice
Fact Table
CityNameStateCountry
City
DateKeyDateMonthYear
Date
ProdNoProdNameProdDescrCategoryCategoryDescrUnitPriceQOH
Product
OrderNoOrderDate
Order
SalespersonIDSalespersonNameCityQuota
Salesperson
CustomerNoCustomerNameCustomerAddressCity
Customer
An Object Model
OrderOrderNoQuantity
TotalPrice
SalesPersonSalesPersonID
Quota
State
CityGetRegion()
PersonName
DateOfBirthAddressGetAge()
OrderPYCViewCity OrderPYView
ProductYear
OrderViewOrderSet
Summarize()
Country
Date
Month
Year
OrderDateGetDate()
CustomerCustomerNo
CategoryCategoryNameCategoryDescr
GetCategName()
ProductProdNameUnitPrice
GetProdName()
ISA relationship
IS PART-OF relationship
Why Object Oriented Data Warehouse?Object Identity reduces data redundancy -
can it help materialized view maintenance? Is-a hierarchy facilitates common data
objects and methods reuse (overloading)Class composition hierarchy helps fast
traversal using OIDsMethods facilitate implementation of
complex aggregate functions (over complex objects, such as volume of a CAD object)
Efficiency considerationsStructural join index hierarchies and class partitioning can facilitate inEvaluating of multiple path operationsEfficiently processing methodsIn calculating multidimensional aggregate operations, such as data cube, and pivoting
Architecture considerationsFollowing issues need to be addressed Is the preferred architecture OO front-end
with relational back-end?What about OO back end and front-end?How does one integrate data mining and OO
data warehousing componentsHow does one build distributed object
oriented data warehousing systems?
SummaryData warehousing systems are about 5 years oldMost of the work has concentrated on
materialized view maintenance, preliminariesNew aspects of data warehousing have to be
considered to build next generation systemsdynamic materialized view designobject-orientation, etc.
Some ReferencesDynamic Materialized View Design/Selection Timos Sellis group, Stanford Group,
CSIRO/HKUST/CityU
Object Oriented Data WarehousingRundensteiner group, Tore Risch Group,
CityU/HKUST, Univ. of South Australia