38
DATA WAREHOUSING & MINING Chapter 2 Online Analytical Processing

Chapter 2 Online Analytical Processing.ppt

Embed Size (px)

DESCRIPTION

fytf

Citation preview

Page 1: Chapter 2 Online Analytical Processing.ppt

DATA WAREHOUSING & MINING

Chapter 2 Online Analytical Processing

Page 2: Chapter 2 Online Analytical Processing.ppt

Outline

A multi-dimensional data model

Data warehouse architecture

Data warehouse implementation

Further development of data cube technology

From data warehousing to data mining

Wednesday, April 19, 2023

2

Chapter 2 Online Analytical Processing

Page 3: Chapter 2 Online Analytical Processing.ppt

Limitations of SQL

3

“A Freshman in Business needs a Ph.D. in SQL”

-- Ralph Kimball

Wednesday, April 19, 2023Chapter 2 Online Analytical Processing

Page 4: Chapter 2 Online Analytical Processing.ppt

Data Warehouse vs. Operational DBMS

OLTP (On-Line Transaction Processing) Major task of traditional relational DBMS Day-to-day operations: purchasing, inventory, banking,

manufacturing, payroll, registration, accounting, etc.

OLAP (On-Line Analytical Processing) Major task of data warehouse system Data analysis and decision making

Wednesday, April 19, 2023

4

Chapter 2 Online Analytical Processing

Page 5: Chapter 2 Online Analytical Processing.ppt

OLTP vs OLAP

Wednesday, April 19, 2023Chapter 2 Online Analytical Processing

5

Data WarehouseOLAP(On-line Analytical Processing)

OLTP(On-line Transaction Processing)

AnalyticsData MiningDecision Making

Short online transactions:update, insert, delete

Online-TransactionProcessing Tx.

database

current & detailed data,Versatile

aggregated & historical data, Static and Low volume

Complex Queries

Page 6: Chapter 2 Online Analytical Processing.ppt

OLTP vs OLAP

Distinct features (OLTP vs. OLAP): User and system orientation: customer vs. market Data contents: current, detailed vs. historical,

consolidated Database design: ER + application vs. star + subject View: current, local vs. evolutionary, integrated Access patterns: update vs. read-only but complex

queries

Wednesday, April 19, 2023

6

Chapter 2 Online Analytical Processing

Page 7: Chapter 2 Online Analytical Processing.ppt

Typical OLAP Queries

Write a multi-table join to compare sales for each

product line YTD this year vs. last year.

Repeat the above process to find the top 5

product contributors to margin.

Repeat the above process to find the sales of a

product line to new vs. existing customers.

Repeat the above process to find the customers

that have had negative sales growth.

Wednesday, April 19, 2023

7

Chapter 2 Online Analytical Processing

Page 8: Chapter 2 Online Analytical Processing.ppt

OLAP Pros

It is a powerful visualization paradigm

It provides fast, interactive response

times

It is good for analyzing time series

It can be useful to find some clusters

and outliers

Many vendors offer OLAP toolsWednesday, April 19, 2023

8

Chapter 2 Online Analytical Processing

Page 9: Chapter 2 Online Analytical Processing.ppt

Nature of OLAP Analysis Aggregation -- (total sales,

percent-to-total) Comparison -- Budget vs.

Expenses Ranking -- Top 10, quartile

analysis Access to detailed and

aggregate data Complex criteria

specification Visualization Wednesday, April 19, 2023

9

Chapter 2 Online Analytical Processing

Page 10: Chapter 2 Online Analytical Processing.ppt

Why Separate Data Warehouse? High performance for both systems

DBMS— tuned for OLTP access methods, indexing, concurrency control, recovery

Warehouse—tuned for OLAP complex OLAP queries, multidimensional view, consolidation.

Different functions and different data Missing data: Decision support requires historical data

which operational DBs do not typically maintain Data consolidation: DS requires consolidation

(aggregation, summarization) of data from heterogeneous sources

Data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled

Wednesday, April 19, 2023

10

Chapter 2 Online Analytical Processing

Page 11: Chapter 2 Online Analytical Processing.ppt

From Tables and Spreadsheets to Data Cubes

A data warehouse is based on multidimensional data model which views data in the form of a

data cube

A data cube allows data to be modeled and viewed in multiple dimensions (such as sales) Dimension tables, such as item (item_name, brand, type), or

time(day, week, month, quarter, year) Fact table contains measures (such as dollars_sold) and keys to

each of the related dimension tables

Definitions an n-Dimensional base cube is called a base cuboid The top most 0-D cuboid, which holds the highest-level of

summarization, is called the apex cuboid The lattice of cuboids forms a data cube

Wednesday, April 19, 2023

11

Chapter 2 Online Analytical Processing

Page 12: Chapter 2 Online Analytical Processing.ppt

Cube: A Lattice of Cuboids

all

time item location supplier

time,item time,location

time,supplier

item,location

item,supplier

location,supplier

time,item,location

time,item,supplier

time,location,supplier

item,location,supplier

time, item, location, supplier

0-D(apex) cuboid

1-D cuboids

2-D cuboids

3-D cuboids

4-D(base) cuboid

Wednesday, April 19, 2023

12

Chapter 2 Online Analytical Processing

Page 13: Chapter 2 Online Analytical Processing.ppt

Conceptual Modeling of Data Warehouses

Modeling data warehouses: dimensions & measures Star schema

A fact table in the middle connected to a set of dimension

tables

Snowflake schema A refinement of star schema where some dimensional

hierarchy is normalized into a set of smaller dimension

tables, forming a shape similar to snowflake

Fact constellations Multiple fact tables share dimension tables, viewed as a

collection of stars, therefore called galaxy schema or fact

constellation Wednesday, April 19, 2023

13

Chapter 2 Online Analytical Processing

Page 14: Chapter 2 Online Analytical Processing.ppt

Avg_sales

Euros_sold

Unit_sold

Location_keybranch_keybranch_namebranch_type

Example of Star Schema

time_keydayday_of_the_weekmonthquarteryear

location_keystreetcityprovince_or_streetcountry

Measures

item_keyitem_namebrandtypesupplier_type

Branch_key

Branch

Time

Item

Location

Sales Fact Table

Item_key

Time_key

Wednesday, April 19, 2023

14

Chapter 2 Online Analytical Processing

Page 15: Chapter 2 Online Analytical Processing.ppt

branch_keybranch_namebranch_type

Example of Snowflake Schema

time_keydayday_of_the_weekmonthquarteryear

Measures

item_keyitem_namebrandtypesupplier_key

Branch

Time

Item

location_keystreetcity_key

Location

Sales Fact Table

Avg_sales

Euros_sold

Unit_sold

Location_key

Branch_key

Item_key

Time_key

supplier_keysupplier_type

city_keycityprovince_or_streetcountry

City

Supplier

Wednesday, April 19, 2023

15

Chapter 2 Online Analytical Processing

Page 16: Chapter 2 Online Analytical Processing.ppt

branch_keybranch_namebranch_type

Example of Fact Constellation

time_keydayday_of_the_weekmonthquarteryear

Measures

Branch

Time

item_keyitem_namebrandtypesupplier_key

Item

location_keystreetcityProvince/streetcountry

Location

Sales Fact Table

Avg_sales

Euros_sold

Unit_sold

Location_key

Branch_key

Item_key

Time_key

shipper_keyshipper_namelocation_keyshipper_type

shipper

unit_shipped

Euros_sold

to_location

from_location

shipper_key

Item_key

Time_key

Shipping Fact Table

Wednesday, April 19, 2023

16

Chapter 2 Online Analytical Processing

Page 17: Chapter 2 Online Analytical Processing.ppt

DMQL: Language Primitives

Cube Definition (Fact Table) define cube <cube_name> [<dimension_list>]:

<measure_list>

Dimension Definition (Dimension Table) define dimension <dimension_name> as

(<attribute_or_subdimension_list>)

Special Case (Shared Dimension Tables) First time as “cube definition” define dimension <dimension_name> as

<dimension_name_first_time> in cube <cube_name_first_time>

Wednesday, April 19, 2023

17

Chapter 2 Online Analytical Processing

Page 18: Chapter 2 Online Analytical Processing.ppt

Defining a Star Schema in DMQL

define cube sales_star [time, item, branch, location]:dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)

define dimension time as (time_key, day, day_of_week, month, quarter, year)

define dimension item as (item_key, item_name, brand, type, supplier_type)

define dimension branch as (branch_key, branch_name, branch_type)

define dimension location as (location_key, street, city, province_or_state, country)

Wednesday, April 19, 2023

18

Chapter 2 Online Analytical Processing

Page 19: Chapter 2 Online Analytical Processing.ppt

Defining a Snowflake Schema in DMQL

define cube sales_snowflake [time, item, branch, location]:dollars_sold = sum(sales_in_dollars),

avg_sales = avg(sales_in_dollars),

units_sold = count(*)

define dimension time as

(time_key, day, day_of_week, month, quarter, year)

define dimension item as

(item_key, item_name, brand, type, supplier(supplier_key, supplier_type))

define dimension branch as

(branch_key, branch_name, branch_type)

define dimension location as

(location_key, street, city(city_key, province_or_state, country))Wednesday, April 19, 2023

19

Chapter 2 Online Analytical Processing

Page 20: Chapter 2 Online Analytical Processing.ppt

Defining a Fact Constellation in DMQL

define cube sales [time, item, branch, location]:dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)

define dimension time as (time_key, day, day_of_week, month, quarter, year)

define dimension item as (item_key, item_name, brand, type, supplier_type)

define dimension branch as (branch_key, branch_name, branch_type)define dimension location as (location_key, street, city,

province_or_state, country)define cube shipping [time, item, shipper, from_location, to_location]:

dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)

define dimension time as time in cube salesdefine dimension item as item in cube salesdefine dimension shipper as (shipper_key, shipper_name, location as

location in cube sales, shipper_type)define dimension from_location as location in cube salesdefine dimension to_location as location in cube sales

Wednesday, April 19, 2023

20

Chapter 2 Online Analytical Processing

Page 21: Chapter 2 Online Analytical Processing.ppt

Measures: Three Categories

Distributive if the result derived by applying the function to n aggregate

values is the same as that derived by applying the function on all the data without partitioning. E.g., count(), sum(), min(), max()

Algebraic if it can be computed by an algebraic function with M arguments

(where M is a bounded integer), each of which is obtained by applying a distributive aggregate function. E.g., avg(), min_N(), standard_deviation()

Holistic if there is no constant bound on the storage size needed to

describe a subaggregate. E.g., median(), mode(), rank()

Wednesday, April 19, 2023

21

Chapter 2 Online Analytical Processing

Page 22: Chapter 2 Online Analytical Processing.ppt

A Concept Hierarchy: Dimension (location)

all

North_America Europe

FranceIrelandMexicoCanada

Dublin

BlackrockBelfield

...

......

... ...

...

all

region

office

country

BelfastTorontocity

Wednesday, April 19, 2023

22

Chapter 2 Online Analytical Processing

Page 23: Chapter 2 Online Analytical Processing.ppt

Multidimensional Data

Sales volume as a function of product, month, and region

Pro

duct

Regio

n

Month

Dimensions: Product, Location, TimeHierarchical summarization paths

Industry Region Year

Category Country Quarter

Product City Month Week

Office Day

Wednesday, April 19, 2023

23

Chapter 2 Online Analytical Processing

Page 24: Chapter 2 Online Analytical Processing.ppt

MDDM““Sales by product line over the past six Sales by product line over the past six

months”months”

““Sales by store between 1990 and 1995”Sales by store between 1990 and 1995”

Prod Code Time Code Store Code Sales Qty

Store Info

Product Info

Time Info

. . .

Numerical MeasuresKey columns joining fact table

to dimension tables

Fact table for measures

Dimension tables

Wednesday, April 19, 2023

24

Chapter 2 Online Analytical Processing

Page 25: Chapter 2 Online Analytical Processing.ppt

Star Schema

Wednesday, April 19, 2023

25

Chapter 2 Online Analytical Processing

Page 26: Chapter 2 Online Analytical Processing.ppt

A Sample Data CubeTotal annual salesof TV in IrelandDate

Produ

ct

Cou

ntr

ysum

sum TV

VCRPC

1Qtr 2Qtr 3Qtr 4Qtr

Ireland

France

Germany

sum

Wednesday, April 19, 2023

26

Chapter 2 Online Analytical Processing

Page 27: Chapter 2 Online Analytical Processing.ppt

Cuboids Corresponding to the Cube

all

product date country

product,date product,country date, country

product, date, country

0-D(apex) cuboid

1-D cuboids

2-D cuboids

3-D(base) cuboid

Wednesday, April 19, 2023

27

Chapter 2 Online Analytical Processing

Page 28: Chapter 2 Online Analytical Processing.ppt

Browsing a Data Cube

Visualization OLAP capabilities Interactive

manipulation

Wednesday, April 19, 2023

28

Chapter 2 Online Analytical Processing

Page 29: Chapter 2 Online Analytical Processing.ppt

Typical OLAP Operations

Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction

Drill down (roll down): reverse of roll-up from higher level summary to lower level summary or detailed

data, or introducing new dimensions

Slice and dice project and select

Pivot (rotate) reorient the cube, visualization, 3D to series of 2D planes.

Other operations drill across: involving (across) more than one fact table drill through: through the bottom level of the cube to its back-

end relational tables (using SQL)Wednesday, April 19, 2023

29

Chapter 2 Online Analytical Processing

Page 30: Chapter 2 Online Analytical Processing.ppt

A Star-Net Query Model

Shipping Method

AIR-EXPRESS

TRUCKORDER

Customer Orders

CONTRACTS

Customer

Product

PRODUCT GROUP

PRODUCT LINE

PRODUCT ITEM

SALES PERSON

DISTRICT

DIVISION

OrganizationPromotion

CITY

COUNTRY

REGION

Location

DAILYQTRLYANNUALYTime

Each circle is called a footprint

Wednesday, April 19, 2023

30

Chapter 2 Online Analytical Processing

Page 31: Chapter 2 Online Analytical Processing.ppt

OLAP Server Architectures

Relational OLAP (ROLAP) Use relational or extended-relational DBMS to store and manage

warehouse data and OLAP middle ware to support missing pieces Include optimization of DBMS backend, implementation of

aggregation navigation logic, and additional tools and services greater scalability

Multidimensional OLAP (MOLAP) Array-based multidimensional storage engine (sparse matrix

techniques) fast indexing to pre-computed summarized data

Hybrid OLAP (HOLAP) User flexibility, e.g., low level: relational, high-level: array

Specialized SQL servers specialized support for SQL queries over star/snowflake schemas

Wednesday, April 19, 2023

31

Chapter 2 Online Analytical Processing

Page 32: Chapter 2 Online Analytical Processing.ppt

Client/Server Architecture

Framework for the new systems to be designed, developed and implemented

Divide the OLAP system into several components that define its architecture Same Computer Distributed among several computer

Wednesday, April 19, 2023

32

Chapter 2 Online Analytical Processing

Page 33: Chapter 2 Online Analytical Processing.ppt

C/S

Wednesday, April 19, 2023

33

Chapter 2 Online Analytical Processing

Page 34: Chapter 2 Online Analytical Processing.ppt

Which Technology?

1) Performance: • How fast will the system appear to the end-user?

• MDD server vendors believe this is a key point in their favor.

2) Data volume and scalability:

• While MDD servers can handle up to 50GB of storage, RDBMS servers can handle hundreds of gigabytes and terabytes.

Wednesday, April 19, 2023

34

Chapter 2 Online Analytical Processing

Page 35: Chapter 2 Online Analytical Processing.ppt

What if AnalysisIF

A. You require write access B. Your data is under 50 GBC. Your timetable to implement is 60-90 daysD. Lowest level already aggregatedE. Data access on aggregated levelF. You’re developing a general-purpose application for inventory movement or assets management

THENConsider an MDD /MOLAP solution for your data mart

 IF

A. Your data is over 100 GBB. You have a "read-only" requirementC. Historical data at the lowest level of granularityD. Detailed access, long-running queriesE. Data assigned to lowest level elements

THENConsider an RDBMS/ROLAP solution for your data mart.

IFA. OLAP on aggregated and detailed dataB. Different user groupsC. Ease of use and detailed data

THENConsider an HOLAP for your data mart

Wednesday, April 19, 2023

35

Chapter 2 Online Analytical Processing

Page 36: Chapter 2 Online Analytical Processing.ppt

Examples

• ROLAP– Telecommunication startup: call data records

(CDRs) – ECommerce Site– Credit Card Company

• MOLAP– Analysis and budgeting in a financial

department– Sales analysis

• HOLAP– Sales department of a multi-national company– Banks and Financial Service Providers

Wednesday, April 19, 2023

36

Chapter 2 Online Analytical Processing

Page 37: Chapter 2 Online Analytical Processing.ppt

Tools

• ROLAP:– ORACLE 8i– ORACLE Reports; ORACLE Discoverer– ORACLE Warehouse Builder– Arbors Software’s Essbase

• MOLAP:– ORACLE Express Server– ORACLE Express Clients (C/S and Web)– MicroStrategy’s DSS server– Platinum Technologies’ Plantinum InfoBeacon

• HOLAP:– ORACLE 8i– ORACLE Express Serve– ORACLE Relational Access Manager– ORACLE Express Clients (C/S and Web)

Wednesday, April 19, 2023

37

Chapter 2 Online Analytical Processing

Page 38: Chapter 2 Online Analytical Processing.ppt

Conclusion

• ROLAP: RDBMS -> star/snowflake schema

• MOLAP: MDD -> Cube structures

• ROLAP or MOLAP: Data models used play major role in performance differences

• MOLAP: for summarized and relatively lesser volumes of data (10-50GB)

• ROLAP: for detailed and larger volumes of data

• Both storage methods have strengths and weaknesses

• The choice is requirement specific, though currently data warehouses are predominantly built using RDBMSs/ROLAP.

Wednesday, April 19, 2023

38

Chapter 2 Online Analytical Processing