16
17.04.2013 1 DATA WAREHOUSE AND OLAP TECHNOLOGIES Data Mining Keep order, and the order shall save thee. Latin maxim Outline Data Mining Data Warehouse Definition Architecture OLAP Multidimensional data model OLAP cube computing © Mikhail Zymbler 2 Data Warehouse © Mikhail Zymbler Data Mining 3 A data warehouse is a: subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process. Data warehousing is a process of constructing and using data warehouses. W. Inmon

Data Warehouse and OLAP - mzym.susu.ru · OLAP TECHNOLOGIES Data Mining Keep order, and the order shall save thee. ... longer than that of operational systems ... distinguished warehouses

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data Warehouse and OLAP - mzym.susu.ru · OLAP TECHNOLOGIES Data Mining Keep order, and the order shall save thee. ... longer than that of operational systems ... distinguished warehouses

17.04.2013

1

DATA WAREHOUSE

AND

OLAP TECHNOLOGIES

Data Mining

Keep order, and the order shall save thee. Latin maxim

Outline

Data Mining

Data Warehouse

Definition

Architecture

OLAP

Multidimensional data model

OLAP cube computing

© Mikhail Zymbler

2

Data Warehouse

© Mikhail Zymbler Data Mining

3

A data warehouse is a:

subject-oriented,

integrated,

time-variant, and

nonvolatile

collection of data in support of management’s

decision-making process.

Data warehousing is a process of constructing and using

data warehouses.

W. Inmon

Page 2: Data Warehouse and OLAP - mzym.susu.ru · OLAP TECHNOLOGIES Data Mining Keep order, and the order shall save thee. ... longer than that of operational systems ... distinguished warehouses

17.04.2013

2

Data Warehouse: subject-oriented

© Mikhail Zymbler Data Mining

4

Organized around major subjects, such as customer,

product, sales

Focusing on the modeling and analysis of data for

decision makers, not on daily operations or transaction

processing

Provide a simple and concise view around particular

subject issues by excluding data that are not useful in

the decision support process

Data Warehouse: integrated

© Mikhail Zymbler Data Mining

5

Constructed by integrating multiple, heterogeneous data

sources

relational databases, flat files, on-line transaction records

Data cleaning and data integration techniques are

applied.

Ensure consistency in naming conventions, encoding

structures, attribute measures, etc. among different data

sources

E.g., Hotel price: currency, tax, breakfast covered, etc.

When data is moved to the warehouse, it is converted.

Data Warehouse: time variant

© Mikhail Zymbler Data Mining

6

The time horizon for the data warehouse is significantly

longer than that of operational systems

Operational database: current value data

Data warehouse data: provide information from a historical

perspective (e.g., past 5-10 years)

Every key structure in the data warehouse

Contains an element of time, explicitly or implicitly

But the key of operational data may or may not contain

“time element”

Page 3: Data Warehouse and OLAP - mzym.susu.ru · OLAP TECHNOLOGIES Data Mining Keep order, and the order shall save thee. ... longer than that of operational systems ... distinguished warehouses

17.04.2013

3

Data Warehouse: nonvolatile

© Mikhail Zymbler Data Mining

7

A physically separate store of data transformed from the

operational environment

Operational update of data does not occur in the data

warehouse environment

Does not require transaction processing, recovery, and

concurrency control mechanisms

Requires only two operations in data accessing:

initial loading of data and access of data

Data Warehouse

© Mikhail Zymbler Data Mining

9

Information from

heterogeneous sources is

integrated in advance

and stored in physically

distinguished warehouses

for direct query and

analysis

Client Client

Source Source Source

Warehouse

Data Warehouse vs. Operational DBMS

© Mikhail Zymbler Data Mining

10

OLTP (On-Line Transaction Processing) Major task of traditional relational DBMS

Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc.

OLAP (On-Line Analytical Processing) Major task of data warehouse system

Data analysis and decision making

Distinct features (OLTP vs. OLAP) User and system orientation: customer vs. market

Data contents: current, detailed vs. historical, consolidated

Database design: ER + application vs. star + subject

View: current, local vs. evolutionary, integrated

Access patterns: update vs. read-only but complex queries

Page 4: Data Warehouse and OLAP - mzym.susu.ru · OLAP TECHNOLOGIES Data Mining Keep order, and the order shall save thee. ... longer than that of operational systems ... distinguished warehouses

17.04.2013

4

Multidimensional data model

© Mikhail Zymbler Data Mining

11

Data in a data warehouse is modeled and viewed in

multiple dimensions

Dimension is a set of attribute values

Suppliers={MEXX, Bvlgari, Versace, Ecco, …}

Products={Clothes, Shoes, Cosmetic, Haberdashery, …}

Locations={Chelyabinsk, Moscow, Yekaterinburg, …}

Measure is a numerical function of dimensions

Cost: Suppliers Products Locations R+

Supplements(Ecco, Shoes, Chelyabinsk)=50300.75 (USD)

Amount: Suppliers Products Locations Z+

Supplements(Versace, Clothes, Moscow)=10 (item)

Data Warehouse: tables

© Mikhail Zymbler Data Mining

12

Dimension table keeps data concerning dimensions

Dimension(ID, Attr1, Attr2, …)

Suppliers(SID, Name, Rating, …)

Products(PID, Name, Price, Color, …)

Locations(LID, Name, Address, …)

Fact table keeps data cube

Fact(ID_Dim1, ID_Dim2, …, Measure1, Measure2, …)

Sales(SID, PID, LID, Cost, Amount)

Data Warehouse: typical schemas

© Mikhail Zymbler Data Mining

13

Star schema: a fact table in the middle connected to a

set of dimension tables.

Snowflake schema: a refinement of star schema where

some dimensional hierarchy is normalized into a set of

smaller dimension tables

Constellation (galaxy) schema: multiple fact tables

share dimension tables, viewed as a collection of stars.

Page 5: Data Warehouse and OLAP - mzym.susu.ru · OLAP TECHNOLOGIES Data Mining Keep order, and the order shall save thee. ... longer than that of operational systems ... distinguished warehouses

17.04.2013

5

Star schema

© Mikhail Zymbler Data Mining

14

Supplements

ProductID

LocationID

SupplierID

Cost

Amount

Products

ID

Name

Brand

Locations

ID

Address

City

Suppliers

ID

Name

Snowflake schema

© Mikhail Zymbler Data Mining

15

Supplements

ProductID

LocationID

SupplierID

Cost

Amount

Products

ID

Name

BrandID

Locations

ID

Address

CityID

Suppliers

ID

Name

… Cities

ID

City

Country

Population

Brands

ID

Name

Country

Constellation (galaxy) schema

© Mikhail Zymbler Data Mining

16

Supplements

ProductID

LocationID

DeliverymanID

Cost

Amount

Products

ID

Name

Brand

Locations

ID

Address

City

Country

Suppliers

ID

Name

Deliveries

ProductID

FromID

ToID

DeliverymanID

Cost

Amount

Deliverymen

ID

Name

LocaleID

Page 6: Data Warehouse and OLAP - mzym.susu.ru · OLAP TECHNOLOGIES Data Mining Keep order, and the order shall save thee. ... longer than that of operational systems ... distinguished warehouses

17.04.2013

6

Time dimension

© Mikhail Zymbler Data Mining

17

DD-MMM-YYYY day is unambiguously identified by

number of the day in the month

number of the month in the year

number of the year

number of the week in the year

number of the day in the week

number of quarter

25-JAN-2010=(ID, 25, 1, 2010, 4, 2, 1, …)

Hierarchy in dimensions

© Mikhail Zymbler Data Mining

18

ALL

West Europe East Europe

Portugal Poland … Belarus Russia …

Lisbon Krakow … … Minsk … Moscow …

Tys

ciekawostek

Тысяча

мелочей

Milhares

de trivialidades

Адна тысяча

драбніц

ALL

Region

Country

City

Shop Всё

для вас

Multidimensional data model

© Mikhail Zymbler Data Mining

19

Domain's facts are points of n-dimensional space.

Location

Time

Product

280.7

Cost

Page 7: Data Warehouse and OLAP - mzym.susu.ru · OLAP TECHNOLOGIES Data Mining Keep order, and the order shall save thee. ... longer than that of operational systems ... distinguished warehouses

17.04.2013

7

Data cube

© Mikhail Zymbler Data Mining

20

280 70 310

330

180

Footwear

Cosmetics

Clothing 160

?

?

?

?

?

270

Location

Time

Product

Data cube

© Mikhail Zymbler Data Mining

21

280 70 310

330

180

Yekaterinburg Footwear

Cosmetics

Clothing 160

?

?

?

?

?

270

Location

Time Product

Moscow ?

220

250

40

?

350

?

?

300

?

140

360

Chelyabinsk ?

?

120

?

230

180

?

?

130

?

?

50

OLAP cube

© Mikhail Zymbler Data Mining

22

OLAP cube is a data cube, where every dimension has

additional ALL value and respective points of data space

are computed by an aggregate function(s).

Aggregate functions

Distributive

count(), sum(), min(), max(), etc.

Algebraic

avg(), stddev(), etc.

Holistic

median(), mode(), etc.

Page 8: Data Warehouse and OLAP - mzym.susu.ru · OLAP TECHNOLOGIES Data Mining Keep order, and the order shall save thee. ... longer than that of operational systems ... distinguished warehouses

17.04.2013

8

OLAP cube

© Mikhail Zymbler Data Mining

23

Footwear

Cosmetics

Clothing

Location

Time Product

Yekaterinburg ?

? 180

? 230

?

? ?

?

? ?

270

Moscow ?

? 250

? 230

350

? ?

300

? ?

360

Chelyabinsk ?

? 120

? 230

180

? ?

130

? ?

50

ALL ?

? 550

? 230

530

? ?

430

? ?

680

OLAP cube

© Mikhail Zymbler Data Mining

24

Shoes

Cosmetics

Clothing

Location

Time Product

Yekaterinburg ?

? 180

? 230

?

? ?

?

? ?

270

Moscow ?

? 250

? 230

350

? ?

300

? ?

360

Chelyabinsk ?

? 120

? 230

180

? ?

130

? ?

50

ALL ?

? 550

? 230

530

? ?

430

? ?

680

? ?

450

? ?

1260

? ?

480

OLAP cube

© Mikhail Zymbler Data Mining

25

Shoes

Cosmetics

Clothing

Location

Time Product

Yekaterinburg ?

? 180

? 230

?

? ?

?

? ?

270

Moscow ?

? 250

? 230

350

? ?

300

? ?

360

Chelyabinsk ?

? 120

? 230

180

? ?

130

? ?

50

ALL ?

? 550

? 230

530

? ?

430

? ?

680

? ?

450

? ?

1260

? ?

480

ALL 580 630 460 650

470 430 440 560

380 610 540 620

Page 9: Data Warehouse and OLAP - mzym.susu.ru · OLAP TECHNOLOGIES Data Mining Keep order, and the order shall save thee. ... longer than that of operational systems ... distinguished warehouses

17.04.2013

9

OLAP cube

© Mikhail Zymbler Data Mining

26

Shoes

Cosmetics

Clothing

Location

Time Product

Yekaterinburg ?

? 180

? 230

?

? ?

?

? ?

270

Moscow ?

? 250

? 230

350

? ?

300

? ?

360

Chelyabinsk ?

? 120

? 230

180

? ?

130

? ?

50

ALL ?

? 550

? 230

530

? ?

430

? ?

680

? ?

450

? ?

1260

? ?

480

ALL 580 630 460 650

470 430 440 560

380 610 540 620

2560 3100 2340 3620

? ?

3080

3570

3230

3780

29780

Manipulation with OLAP cube

© Mikhail Zymbler Data Mining

27

Slice and dice

projection and/or restriction

Roll-up (drill-up)

computing measure moving upward

Drill-down (roll-down)

computing measure moving downward

Pivot

changing an order of dimensions

Slice

© Mikhail Zymbler Data Mining

28

Shoes

Cosmetics

Clothing

Yekaterinburg ? ?

180

? 230

?

? ?

?

? ?

270

Moscow ? ?

250

? 230

300

? ?

350

? ?

360

Chelyabinsk ? ?

120

? 230

180

? ?

130

? ?

50

2000

Shoes

Cosmetics

Clothing Yekaterinburg

Moscow

Chelyabinsk

180

250

120

180

250

120

180

250

120

Location

Time Product

slice

for Time=2000

Page 10: Data Warehouse and OLAP - mzym.susu.ru · OLAP TECHNOLOGIES Data Mining Keep order, and the order shall save thee. ... longer than that of operational systems ... distinguished warehouses

17.04.2013

10

Slice

© Mikhail Zymbler Data Mining

29

Shoes

Cosmetics

Clothing

Yekaterinburg ? ?

180

? 230

?

? ?

?

? ?

270

Moscow ? ?

250

? 230

300

? ?

350

? ?

360

Chelyabinsk ? ?

120

? 230

180

? ?

130

? ?

50

Location

Time Product

2000 Yekaterinburg Moscow Chelyabins

k

Cosmetics 100 240 210

Shoes 320 170 320

Clothing 180 250 180

slice

for Time=2000

Pivot

© Mikhail Zymbler Data Mining

30

Location

Time Product

2000 Yekaterinburg Moscow Chelyabinsk

Cosmetics 100 240 210

Shoes 320 170 320

Clothing 180 250 120

2000 Clothing Shoes Cosmetics

Yekaterinburg 180 320 100

Moscow 250 170 240

Chelyabinsk 120 320 210

pivot

Dice

© Mikhail Zymbler Data Mining

31

Shoes

Cosmetics

Clothing

Yekaterinburg ? ?

180

? 230

?

? ?

?

? ?

270

Moscow ? ?

250

? 230

300

? ?

350

? ?

360

Chelyabinsk ? ?

120

? 230

180

? ?

130

? ?

50

Location

Time Product

dice for (Time=2000 or Time=2001) and

(Location=Chelyabinsk or

Location=Moscow) and

(Product=Shoes or Product=Clothing) 2000

Shoes

Clothing Moscow

Chelyabinsk

250

120

250

120

250

120

300

180

2001

Page 11: Data Warehouse and OLAP - mzym.susu.ru · OLAP TECHNOLOGIES Data Mining Keep order, and the order shall save thee. ... longer than that of operational systems ... distinguished warehouses

17.04.2013

11

Roll-up

© Mikhail Zymbler Data Mining

32

Shoes

Cosmetics

Clothing

Yekaterinburg ? ?

180

? 230

?

? ?

?

? ?

270

Moscow ? ?

250

? 230

300

? ?

350

? ?

360

Chelyabinsk ? ?

120

? 230

180

? ?

130

? ?

50

Location

Time Product

roll-up on Location

(from City to Region)

Shoes

Cosmetics

Clothing

Center ? ?

250

? 230

300

? ?

350

? ?

360

Ural ? ?

300

? 230

180

? ?

130

? ?

320

Drill-down

© Mikhail Zymbler Data Mining

33

Location

Time Product

drill-down on Time

(from Year to Quarter)

2000

Shoes

Cosmetics

Clothing Yekaterinburg

Moscow

Chelyabinsk

180

250

120

180

250

120

180

250

120

1 qrt.

Shoes

Cosmetics

Clothing

180

250

120

180

250

120

80

50

60

2 qrt.

180

250

120

180

250

120

40

150

30

3 qrt.

180

250

120

180

250

120

20

25

20

4 qrt.

180

250

120

180

250

120

40

25

10

Model of OLAP queries

© Mikhail Zymbler Data Mining

34

Location

Time

Product

Customer

Day

Name

Category

TargetGroup

Week Month Quarter Year

Seller

Shop

City

Region

Name

Brand

Category

Page 12: Data Warehouse and OLAP - mzym.susu.ru · OLAP TECHNOLOGIES Data Mining Keep order, and the order shall save thee. ... longer than that of operational systems ... distinguished warehouses

17.04.2013

12

Data Warehouse is a result of ETL

© Mikhail Zymbler Data Mining

35

"Raw" data

Target data

Cleansed data

Data

Warehouse

Reports

Diagrams

Data Interpretation

OLAP

Data Mining

Extraction

Transforming

Loading

Data Preprocessing

Data Warehouse: a multi-tiered

architecture

© Mikhail Zymbler Data Mining

36

Data

Warehouse

Extract

Transform

Load

Refresh

OLAP Engine

Analysis

Queries

Reports

Data mining

Data Sources Front-End Tools

Data Marts

Operational

databases

Other

sources

Data Storage

19901991

1992ALL

Red

Blue0

50

100

150

200 150-200

100-150

50-100

0-50

Metadata

repository Monitor &

Integrator OLAP

Server

OLAP

Server

OLAP

Server

OLAP

Server

Metadata Repository

© Mikhail Zymbler Data Mining

37

Description of the structure of the data warehouse

schema, view, dimensions, hierarchies, derived data definition, data mart locations and contents

Operational meta-data

data lineage (history of migrated data and transformation path)

currency of data (active, archived, or purged)

monitoring information (warehouse usage statistics, error reports, audit trails)

Algorithms used for summarization

measure and dimension definition algorithms

data granularity, partitions, subject areas, aggregation, summarization, and predefined queries and reports

Page 13: Data Warehouse and OLAP - mzym.susu.ru · OLAP TECHNOLOGIES Data Mining Keep order, and the order shall save thee. ... longer than that of operational systems ... distinguished warehouses

17.04.2013

13

Metadata Repository

© Mikhail Zymbler Data Mining

38

The mapping from operational environment to the data warehouse source databases and their contents,

gateway descriptions, data partitions, data extraction, cleaning, transformation rules, and defaults, data refresh and purge rules

security

Data related to system performance indices, profiles

timing and scheduling of refresh

Business metadata business terms and definitions

ownership of data

charging policies

OLAP Server Architectures

© Mikhail Zymbler Data Mining

39

Relational OLAP (ROLAP)

Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware

Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services

Multidimensional OLAP (MOLAP)

Sparse array-based multidimensional storage engine

Fast indexing to pre-computed summarized data

Hybrid OLAP (HOLAP)

Low level: relational, high-level: multidimensional arrays

Computing OLAP cube in SQL

© Mikhail Zymbler Data Mining

40

ROLLUP BY

creates subtotals that roll up from the most detailed level to a grand total, following a specified grouping list

takes as its argument an ordered list of grouping columns

calculates the standard aggregate values specified in the GROUP BY clause

creates progressively higher-level subtotals, moving from right to left through the list of grouping columns

creates a grand total.

CUBE BY

generates all the subtotals that could be calculated for a data cube with the specified dimensions

Page 14: Data Warehouse and OLAP - mzym.susu.ru · OLAP TECHNOLOGIES Data Mining Keep order, and the order shall save thee. ... longer than that of operational systems ... distinguished warehouses

17.04.2013

14

ROLLUP BY

© Mikhail Zymbler Data Mining

41

select Time, Location, Product, sum(Cost) as Profit

from Sales

rollup by (Time, Location, Product)

select Time, Location, Product,

sum(Cost) as Profit

from Sales

group by (Time, Location, Product)

union

select Time, Location, '',

sum(Cost) as Profit

from Sales

group by (Time, Location)

union

select Time, '', '',

sum(Cost) as Profit

from Sales

group by (Time)

union

select '', '', '', sum(Cost) as Profit

from Sales

ROLLUP BY

© Mikhail Zymbler Data Mining

42

Time Location Product Cost

2000 Chelyabinsk Clothing 100

2000 Chelyabinsk Cosmetics 120

2000 Moscow Clothing 250

2000 Moscow Cosmetics 75

2001 Chelyabinsk Clothing 230

2001 Chelyabinsk Cosmetics 310

2001 Moscow Clothing 170

2001 Moscow Cosmetics 350

ROLLUP BY

© Mikhail Zymbler Data Mining

43

select

Time, Location, Product, sum(Cost) as Profit

from Sales

rollup by (Time, Location, Product)

Time Location Product Profit

2000 Chelyabinsk Clothing 100

2000 Chelyabinsk Cosmetics 120

2000 Chelyabinsk [NULL] 220

2000 Moscow Clothing 250

2000 Moscow Cosmetics 75

2000 Moscow [NULL] 325

2000 [NULL] [NULL] 545

2001 Chelyabinsk Clothing 230

2001 Chelyabinsk Cosmetics 310

2001 Chelyabinsk [NULL] 540

2001 Moscow Clothing 170

2001 Moscow Cosmetics 350

2001 Moscow [NULL] 520

2001 [NULL] [NULL] 1 060

[NULL] [NULL] [NULL] 1 605

Page 15: Data Warehouse and OLAP - mzym.susu.ru · OLAP TECHNOLOGIES Data Mining Keep order, and the order shall save thee. ... longer than that of operational systems ... distinguished warehouses

17.04.2013

15

CUBE BY

© Mikhail Zymbler Data Mining

44

Time Location Product Cost

2000 Chelyabinsk Clothing 100

2000 Chelyabinsk Cosmetics 120

2000 Moscow Clothing 250

2000 Moscow Cosmetics 75

2001 Chelyabinsk Clothing 230

2001 Chelyabinsk Cosmetics 310

2001 Moscow Clothing 170

2001 Moscow Cosmetics 350

CUBE BY

© Mikhail Zymbler Data Mining

45

select

Time, Location, Product, sum(Cost) as Profit

from Sales

cube by (Time, Location, Product)

Time Location Product Profit

2000 Chelyabinsk Clothing 100

2000 Chelyabinsk Cosmetics 120

2000 Chelyabinsk [NULL] 220

2000 Moscow Clothing 250

2000 Moscow Cosmetics 75

2000 Moscow [NULL] 325

2000 [NULL] Clothing 350

2000 [NULL] Cosmetics 195

2000 [NULL] [NULL] 545

2001 Chelyabinsk Clothing 230

2001 Chelyabinsk Cosmetics 310

2001 Chelyabinsk [NULL] 540

2001 Moscow Clothing 170

2001 Moscow Cosmetics 350

2001 Moscow [NULL] 520

CUBE BY

© Mikhail Zymbler Data Mining

46

Time Location Product Profit

[NULL] Chelyabinsk Clothing 330

[NULL] Chelyabinsk Cosmetics 430

[NULL] Chelyabinsk [NULL] 760

[NULL] Moscow Clothing 420

[NULL] Moscow Cosmetics 425

[NULL] Moscow [NULL] 845

[NULL] [NULL] Clothing 750

[NULL] [NULL] Cosmetics 855

[NULL] [NULL] [NULL] 1 605

select

Time, Location, Product, sum(Cost) as Profit

from Sales

cube by (Time, Location, Product)

Page 16: Data Warehouse and OLAP - mzym.susu.ru · OLAP TECHNOLOGIES Data Mining Keep order, and the order shall save thee. ... longer than that of operational systems ... distinguished warehouses

17.04.2013

16

Conclusion

Data Mining

Data warehouse is a subject-oriented, integrated, time-variant, nonvolatile and physically distinguished collection of data in support of management’s decision-making process.

Data warehouse is based on multidimensional model.

There are three basic data warehouse schemas: star, snowflake, constellation.

OLAP cube is a data cube, where every dimension has additional ALL value and respective points of data space are computed by an aggregate function(s).

OLAP operations: roll-up, drill-down, pivot.

OLAP cube computing using SQL: ROLLUP BY and CUBE BY keywords.

© Mikhail Zymbler

47