60
Design and Maintenance of Data Warehouses Design and Maintenance of Data Warehouses Timos Sellis National Technical University of Athens KDBS Laboratory http://www.dbnet.ece.ntua.gr/ Many thanks to P. Vassiliadis and A. Tsois EDBT Summer School - Cargese 2002 2 Outline What’s and Why’s for DW’s DW architecture DW Schema Back End of the DW Front End of the DW DW Servers Metadata Repository Conclusions

DWH Concepts

Embed Size (px)

Citation preview

Page 1: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 1

Design and Maintenance of Data Warehouses

Design and Maintenance of Data Warehouses

Timos SellisNational Technical University of AthensKDBS Laboratoryhttp://www.dbnet.ece.ntua.gr/

Many thanks to P. Vassiliadis and A. Tsois

EDBT Summer School - Cargese 2002 2

Outline

What’s and Why’s for DW’sDW architectureDW SchemaBack End of the DWFront End of the DWDW ServersMetadata RepositoryConclusions

Page 2: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 2

EDBT Summer School - Cargese 2002 3

OLTP

On-line transaction processing (OLTP) is the traditional way of using a database

Legacy systems: relational, hierarchical, network databases / COBOL applications / …Short transactions (read/update few records) with ACID propertiesNormally, only the last version of data stored in the database

EDBT Summer School - Cargese 2002 4

DSS & OLAP

Decision support systems - help the executive, manager, analyst make faster and better decisions.

What where the sales volumes by region and product category for the last year?Will a 10% discount increase sales volumes sufficiently?

On-line analytical processing (OLAP) is an element of decision support systems (DSS)

Page 3: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 3

EDBT Summer School - Cargese 2002 5

OLTP vs. OLAP

OLTP OLAP User Clerk Manager Function Day to day operations Decision support Access Read/write Mostly read Data detailed, up-to-date,

flat relational summarised, historical, multidimensional

Db Size 100MB - 1GB 100GB - 1TB

Chaudhuri& Dayal@VLDB’96

EDBT Summer School - Cargese 2002 6

Data Warehouse

A decision support database that is maintained separately from the organization’s operational database.

• S. Chaudhuri, U. Dayal, VLDB’96 tutorialA data warehouse is a subject-oriented, integrated, time-varying, non-volatile collection of data that is used primarily in organizational decision making.

• W.H. Inmon, Building the Data Warehouse, 1992

Page 4: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 4

EDBT Summer School - Cargese 2002 7

Reasons for Building Data Warehouses

Semantic ReconciliationDispread data sources within the same organizationDifferent encoding of the same entitiesDW encompasses the full volume of these data under a single, reconciled schemaKeeps the history of these data, too

EDBT Summer School - Cargese 2002 8

Reasons for Building Data Warehouses

PerformanceOLAP applications need different organization of dataComplex OLAP queries would degrade OLTP performance

AvailabilitySeparation increases availabilityPossibly the only way to query the dispread data sources

Page 5: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 5

EDBT Summer School - Cargese 2002 9

Reasons for Building Data WarehousesData Quality

The validity of source data is not guaranteed (data can be missing, inconsistent, out of date, violating business and database rules…)Errors in data reach a minimum 10% in most data storesCan lead to wasting of resources of 25-40%DW acts as a data cleaning buffer

…. and the market is there!

EDBT Summer School - Cargese 2002 10

The Market

Estimated sales in millions of dollars [ShTy98] (*estimates are from [Pend00]).

1998 1999 2000 2001 2002 CAGR (%)RDBMS sales for DW 900.0 1110.0 1390.0 1750.0 2200.0 25.0Data Marts 92.4 125.0 172.0 243.0 355.0 40.0ETL tools 101.0 125.0 150.0 180.0 210.0 20.1Data Quality 48.0 55.0 64.5 76.0 90.0 17.0Metadata Management 35.0 40.0 46.0 53.0 60.0 14.4OLAP (including implementationservices)*

2000 2500 3000 3600 4000 18.9

Page 6: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 6

EDBT Summer School - Cargese 2002 11

Data Warehouse ArchitectureA Simple View

Client Client

Warehouse

Source

Source

Source

Query & Analysis

Integration

Metadata

EDBT Summer School - Cargese 2002 12

Data Warehouse Architecture

Sources

Administrator

DSA

Administrator

DW

Designer

Data Marts

Metadata Repository

End User

Quality Issues

Quality Issues

Quality Issues

Quality Issues

Reporting / OLAP tools

Page 7: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 7

EDBT Summer School - Cargese 2002 13

Two / Three Tier Architecture

Warehouse database serveralmost always relational (RDBMS)

Data Marts / OLAP serverRelational OLAP (ROLAP)Multidimensional OLAP (MOLAP)

ClientsQuery and reporting toolsAnalysis tools / Data mining tools

EDBT Summer School - Cargese 2002 14

Data Warehouse Architecture

Enterprise warehouse: collects all information about subjects

requires extensive business modelingmay take years to design and build

Data Marts: Departmental subsets that focus on selected subjectsVirtual warehouse: views over operational dbs

Page 8: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 8

EDBT Summer School - Cargese 2002 15

How to build the DWTop – down

Single integrated enterprise modelReduce all sources (and clients, if necessary) to the central model

− Time consuming; labor intensive; slow to produce results− Enhances the risk of the DW project due to late delivery of

results+ Provides a consistent, global view of the enterprise data

EDBT Summer School - Cargese 2002 16

How to build the DWBottom – up

Build smaller data marts firstProgressively combine pairwise

− Fails to provide a global view of the enterprise data− Possibly enhances the risk since a complete

integration might prove impossible late in the project+ Early delivery of results+ Less time consuming, less labor intensive

Page 9: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 9

EDBT Summer School - Cargese 2002 17

Data Warehouse Back-End

Sources

Administrator

DSA

Administrator

DW

Designer

Data Marts

Metadata Repository

End User

Quality Issues

Quality Issues

Quality Issues

Quality Issues

Reporting / OLAP tools

EDBT Summer School - Cargese 2002 18

Design: Global-As-View IntegrationPreintegration. What schemata to integrate and in which orderSchema Comparison. To determine the correlations among concepts of different schemata and to detect possible naming, semantic, structural, … conflictsSchema Conforming. Conflict resolution for heterogeneous schemataSchema Merging and Restructuring. Production of a single conformed schema

Page 10: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 10

EDBT Summer School - Cargese 2002 19

Design: Local-As-View IntegrationWorks the other way around.Main deliverable is a central conceptual model, produced by interactively examining user needs and existing schemataAll source and client schemata are expressed in terms of the central data warehouse schema and not the other way around.

EDBT Summer School - Cargese 2002 20

DW = Materialized Views?

DW.PARTSUPP Aggregate1

PKEY, DAYMIN(COST)

Aggregate2

PKEY, MONTHAVG(COST)

V2

V1

TIME

DW.PARTSUPP.DATE,DAY

S1_PARTSUPP

S2_PARTSUPP

Sources DW

U

Simple View of a DW

Page 11: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 11

EDBT Summer School - Cargese 2002 21

Add_SPK1

SUPPKEY=1

SK1

DS.PS1.PKEY, LOOKUP_PS.SKEY,

SUPPKEY

$2€

COST DATE

DS.PS2 Add_SPK2

SUPPKEY=2

SK2

DS.PS2.PKEY, LOOKUP_PS.SKEY,

SUPPKEYCOST DATE=SYSDATE

AddDate CheckQTY

QTY>0

U

DS.PS1

Log

rejected

Log

rejected

A2EDate

NotNULL

Log

rejected

Log

rejected

Log

rejected

DIFF1

DS.PS_NEW1.PKEY,DS.PS_OLD1.PKEYDS.PS_NEW

1

DS.PS_OLD1

DW.PARTSUPP Aggregate1

PKEY, DAYMIN(COST)

Aggregate2

PKEY, MONTHAVG(COST)

V2

V1

TIME

DW.PARTSUPP.DATE,DAY

FTP1S1_PARTSU

PP

S2_PARTSUPP FTP2

DS.PS_NEW2

DIFF2

DS.PS_OLD2

DS.PS_NEW2.PKEY,DS.PS_OLD2.PKEY

DW ≠ Materialized Views !

Sources DW

DSA

EDBT Summer School - Cargese 2002 22

Operational Processes

Data extraction, transform & loadOriginally treated as the ‘refreshment’ problemRequires to transform, clean, integrate data from different sources.

Build/refresh derived data and viewsService queriesMonitor the warehouse

Page 12: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 12

EDBT Summer School - Cargese 2002 23

The Refreshment Problem

Propagate updates on source data to the warehouseIssues:

when to refreshon every updateperiodicallyrefresh policy set by administrator

how to refresh

EDBT Summer School - Cargese 2002 24

Refreshment Techniques

Full extract from base tablesIncremental techniques

detect changes on base tablessnapshotstransaction shippingactive rules

logical correctnesstransactional correctness

Currently, in practice we use ETL tools/scripts (see next)…

Page 13: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 13

EDBT Summer School - Cargese 2002 25

Data ExtractionCan take snapshot or differentials (new/deleted/updated) of source dataTransfer, encryption, compression are also involvedTime window and source system overhead involvedIn general, faced with the requirement of minimal changes to existing configuration of sources

EDBT Summer School - Cargese 2002 26

Data TransformationSchema Reconciliation: conflicts at the schema level (different attributes for the same information)Value Identification & Reconciliation: different (same) id’s for same (different) objects (use surrogate keys)

Page 14: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 14

EDBT Summer School - Cargese 2002 27

Data CleaningOffending Data: duplicates, integrity/business rules/format violations …Incompleteness: missing dataRenicing: esp. addresses

EDBT Summer School - Cargese 2002 28

Data Loading

This final stage may still require additional preprocessing:

sorting, summarizing, performing computationsIssues:

huge volumes of data to be loadedsmall time windowwhen to build indexes and summary tablesrestart after failure with no loss of data integrity

Page 15: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 15

EDBT Summer School - Cargese 2002 29

Loading TechniquesCannot use SQL language interface to update or append data.

record at a timetoo slow since it uses random disc I/Ocan make rollback segment or log file to burst

Use batch load utilitysort input records on a clustering keysequential I/O 100 times faster than random I/Obuild index at the same timeuse parallelism to accelerate load operations

EDBT Summer School - Cargese 2002 30

Incremental Loading

Use incremental loads during refresh to reduce data volume (e.g. Redbrick)

insert only updated tuplesincremental load conflicts with queriesbreak into sequence of shorter transactionscoordinate this sequence of transactions: must ensure consistency between base and derived tables and indices.

Page 16: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 16

EDBT Summer School - Cargese 2002 31

Data Warehouse Front-End

Sources

Administrator

DSA

Administrator

DW

Designer

Data Marts

Metadata Repository

End User

Quality Issues

Quality Issues

Quality Issues

Quality Issues

Reporting / OLAP tools

EDBT Summer School - Cargese 2002 32

Front End Tools

Ad hoc query and reportingExample: MS Excel, ProReports

OLAP: ‘Multidimensional spreadsheet’pivot tables, drill down, roll up, slice, dice

Data Mining

Page 17: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 17

EDBT Summer School - Cargese 2002 33

Basic ideas for OLAP

Several numeric measures that are analyzedsales, budget, revenue, inventory

Dimensionscontexts in which a measure appearsExample: store, product, date information associated with a sale.each context is a dimension and the measure is a point in a multi-dimensional world

EDBT Summer School - Cargese 2002 34

Basic ideas for OLAP

Nature of Analysisaggregation (total sales, percent-to-total)comparison (budget vs. expense)ranking (top 10)access to detailed and aggregate datacomplex criteria specificationvisualization

Page 18: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 18

EDBT Summer School - Cargese 2002 35

Basic ideas for OLAP

Attributesinformation associated with a dimensionexample: owner of store, county in which the store is located

Attribute HierarchiesAttributes of a dimension are often related in a a hierarchical wayexample: street city country

EDBT Summer School - Cargese 2002 36

Multidimensional Data

Dimensions: Product, Region, Date

Hierarchical summarization paths:

Month

Region

Prod

uct

Sales volume

Industry

Category

Product

Country

Region

City

Office

Year

Quarter

Month Week

Day

Page 19: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 19

EDBT Summer School - Cargese 2002 37

Operations

Roll up: summarize dataDrill down: go from higher level summary to lower level summary or detailed dataSlice and dice: select and projectPivot: re-orient cube

EDBT Summer School - Cargese 2002 38

Roll up

Sales volume

ElectronicsToysClothingCosmetics

Q1

$5,2$1,9$2,3$1,1

ElectronicsToysClothingCosmetics

Q2

$8,9$0,75$4,6$1,5

Products Store1 Store2

$5,6$1,4$2,6$1,1$7,2$0,4$4,6$0,5

Sales volume

ElectronicsToysClothingCosmeticsY

ear 1

996 $14,1

$2,65$6,9$2,6

Products Store1 Store2

$12,8$1,8$7,2$1,6

Page 20: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 20

EDBT Summer School - Cargese 2002 39

Drill down

Sales volume

ElectronicsToysClothingCosmetics

Q1

$5,2$1,9$2,3$1,1

ElectronicsToysClothingCosmetics

Q2

$8,9$0,75$4,6$1,5

Products Store1 Store2

$5,6$1,4$2,6$1,1$7,2$0,4$4,6$0,5

Sales volume

VCRCamcorderTVCD player

Q1

$1,4$0,6$2,0$1,2

VCRCamcorderTVCD player

Q2

$2,4$3,3$2,2$1,0

Electronics Store1 Store2

$1,4$0,6$2,4$1,2$2,4$1,3$2,5$1,0

EDBT Summer School - Cargese 2002 40

Pivot

Sales volume

ElectronicsToysClothingCosmetics

Q1

$5,2$1,9$2,3$1,1

ElectronicsToysClothingCosmetics

Q2

$8,9$0,75$4,6$1,5

Products Store1 Store2

$5,6$1,4$2,6$1,1$7,2$0,4$4,6$0,5

Sales volume

ElectronicsToysClothingCosmetics

Stor

e 1 $5,2

$1,9$2,3$1,1

ElectronicsToysClothingCosmetics

Stor

e 2 $5,6

$1,4$2,6$1,1

Products Q1 Q2

$8,9$0,75$4,6$1,5$7,2$0,4$4,6$0,5

Page 21: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 21

EDBT Summer School - Cargese 2002 41

Slice and Dice

Sales volume

ElectronicsToysClothingCosmetics

Q1

$5,2$1,9$2,3$1,1

ElectronicsToysClothingCosmetics

Q2

$8,9$0,75$4,6$1,5

Products Store1 Store2

$5,6$1,4$2,6$1,1$7,2$0,4$4,6$0,5

Sales volume

ElectronicsToysQ

1 $5,2$1,9

Products Store1

ElectronicsToysQ

2 $8,9$0,75

EDBT Summer School - Cargese 2002 42

Data Warehouse Server

Sources

Administrator

DSA

Administrator

DW

Designer

Data Marts

Metadata Repository

End User

Quality Issues

Quality Issues

Quality Issues

Quality Issues

Reporting / OLAP tools

Page 22: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 22

EDBT Summer School - Cargese 2002 43

Data Warehouse Servers - Outline

Server Technology: ROLAP & MOLAPIndexing TechniquesQuery Processing and Optimization

EDBT Summer School - Cargese 2002 44

Database Servers

Relational and Specialized Relational DBMSRelational OLAP (ROLAP) DBMSMultidimensional OLAP (MOLAP) DBMS

Page 23: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 23

EDBT Summer School - Cargese 2002 45

Relational DBMS

Features that support DSSSpecialized Indexing techniquesSpecialized Join and Scan MethodsData Partitioning and use of ParallelismComplex Query ProcessingIntelligent Processing of AggregatesExtensions to SQL and their processing

EDBT Summer School - Cargese 2002 46

ROLAP Servers

Exploits services of a relational engine effectivelyKey functionality

needs aggregation navigation logicability to generate multi statement SQLoptimize for each individual database backend

Additional servicescost-based query governordesign tool for DSS schemaperformance analysis tool

Page 24: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 24

EDBT Summer School - Cargese 2002 47

Database Schemata for DW & ROLAP

Star SchemaSnowflake SchemaFact ConstellationAggregated data

EDBT Summer School - Cargese 2002 48

Star Schema

A star schema consists of one central fact table and several denormalized dimension tables. The measures of interest for OLAP are stored in the fact table (e.g. Dollar Amount, Units in the table SALES).For each dimension of the multidimensional model there exists a dimension table (e.g. Geography, Product, Time, Account) with all the levels of aggregation and the extra properties of these levels.

Page 25: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 25

EDBT Summer School - Cargese 2002 49

Star Schema

SALESGeography CodeTime CodeAccount CodeProduct CodeDollar AmountUnits

GeographyGeography CodeRegion CodeRegion ManagerState CodeCity Code.....

ProductProduct CodeProduct NameBrand CodeBrand NameProd. Line CodeProd. Line Name

TimeTime CodeQuarter CodeQuarter NameMonth CodeMonth NameDate

AccountAccount CodeKeyAccount CodeKeyAccountNameAccount NameAccount TypeAccount Market

Stanford Technology Group, Inc., 1996

EDBT Summer School - Cargese 2002 50

Snowflake Schema

The normalized version of the star schemaExplicit treatment of dimension hierarchies (each level has its own table)Easier to maintain, slower in query answering

Page 26: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 26

EDBT Summer School - Cargese 2002 51

Snowflake Schema

SALESPostal CodeTime CodeAccount CodeProduct CodeDollar AmountUnits

TimeTime CodeQuarter CodeMonth Code

QuarterQuarter CodeQuarterName

MonthMonth CodeMonth Name

AccountAccount CodeKeyAccountCode

AccountattributesAccount CodeAccountName

KeyAccountKeyAcc CodeKeyAcc Name

GeographyPostal CodeRegion CodeState CodeCity Code

RegionRegion CodeRegion Mgr

StateState CodeState Name

CityCity CodeCity Name

ProductProduct CodeProd Line CodeBrand Code

ProductProduct CodeProductName

BrandBrand CodeBrand Name

ProdLineProdLineCodeProdLineName

Stanford Technology Group, Inc., 1996

EDBT Summer School - Cargese 2002 52

Fact Constellation

Multiple fact tables that share many dimension tablesExample: projected expense and the actual expense may share dimensional tables

Page 27: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 27

EDBT Summer School - Cargese 2002 53

Aggregated Tables

In addition to base fact and dimension tables, data warehouse keeps aggregated (summary) data for efficiency.Two approaches

store as separate summary fact and dimension tablesadd to the existing base tables

EDBT Summer School - Cargese 2002 54

Aggregated Tables

RID City Amount1 Athens $1002 N.Y. $3003 Rome $1204 Athens $2505 Rome $1806 Rome $657 N.Y. $450

City AmountAthens $350N.Y. $750Rome $365

RID City Amount Level1 Athens $100 NULL2 N.Y. $300 NULL3 Rome $120 NULL4 Athens $250 NULL5 Rome $180 NULL6 Rome $65 NULL7 N.Y. $450 NULL8 Athens $350 City9 N.Y. $750 City

10 Rome $365 City

• Separate sum-table• Extend existing base tables

Extended Sales table

Sales table

City-dimension sum table

Page 28: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 28

EDBT Summer School - Cargese 2002 55

MOLAP Servers

The storage model is an n-dimensional arrayVery fast in computations and OLAP operationsNormally they require pre-computation of the available cubesCompression of data to save storage spaceCurrently: 98% of the market for client tools

SISYPHUS: A Chunk-Based Storage Manager for OLAP Cubes

PhD work of Nikos KarayannidisNational Technical University of Athens

(NTUA)

Page 29: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 29

EDBT Summer School - Cargese 2002 57

ERATOSTHENES projectERATOSTHENES, is a specialized database management system for OLAP cubes which is under development.In the context of ERATOSTHENES, a prototype storage manager for OLAP cubes, called SISYPHUS, has been developed.Storage Engine

(SISYPHUS)

Processing Engine

Presentation Engine

EDBT Summer School - Cargese 2002 58

Why OLAP poses new require-ments to storage management?

Small response time: good physical clustering + efficient access pathsMultidimensionality: md-storage structures, address by locationHierarchies: access paths, clusteringSparseness: not random but according to hierarchies.

Page 30: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 30

EDBT Summer School - Cargese 2002 59

Architecture: levels of abstraction in SISYPHUS

SSM Record-oriented storage mngmnt

File Manager Bucket-oriented File mngmnt

Logging/Recovery

Buffer ManagerBuffer mngmnt

Access Manager Chunk-oriented File mngmnt

Cube Access Methods OLAP Processing

rec.oriented access

bckt.oriented access

chnk.oriented access

Cell oriented access

EDBT Summer School - Cargese 2002 60

Dimension data encoding

City

Region

Country

LOCATION

0.1.2

0 1 2CityA CityB CityC CityD

0 1RegionA RegionB

0CountryA

3

order-codes

member-code

Page 31: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 31

EDBT Summer School - Cargese 2002 61

A chunk-oriented file system: the hierarchically chunked cube

Use the bucket file system.Chunking Method: partition the data space by forming a hierarchy of chunks that is based on the dimension hierarchies.

continent

city

region

country

item

type

category

item

Pseudo

[0..18]

[0..10]

[0..4]

[0..2]

[0..5]

[0..2]

[0..2]

[0..1]

EDBT Summer School - Cargese 2002 62

D = 0

continent

city

region

country

item

type

category

item

Pseudo

[0..18] (LOCATION)

[0..5

] (P

RO

DU

CT)

(0,0)

Page 32: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 32

EDBT Summer School - Cargese 2002 63

continent

city

region

country

item

type

category

item

Pseudo

[0..5] [6..10] [11..18]

[0..3

][4

..5]

D = 1

EDBT Summer School - Cargese 2002 64

continent

city

region

country

item

type

category

item

Pseudo

[0..2] [3..5] [6..10] [11..14] [15..18]

[4..5

][0

..1]

[2..3

]

D = 2

Page 33: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 33

EDBT Summer School - Cargese 2002 65

continent

city

region

country

item

type

category

item

Pseudo[0

..1]

[2..3

][4

..5]

[1..2][0] [4..5][3] [8..9][6..7] [10] [12..14][11] [17..18][15..16]

D = 3 (Max Depth)

EDBT Summer School - Cargese 2002 66

Chunk Identifiers (chunk-ids)Chunk addressing.Unique identifier of chunk within cube + depicts hierarchy path of chunk.Interleave the member-codes of the pivot-level members that define a chunk (at any depth).

e.g. D = 2 LOCATION: 2.3, PRODUCT:1.2

2.3 1.22 . 31 2| |

Page 34: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 34

EDBT Summer School - Cargese 2002 67

Accessing the chunks of a cubeNeed some chunk directory.Idea: use intermediate depth chunks as directory chunks that will guide us to the data chunks(Dmax + 1)Create a chunk-tree.

EDBT Summer School - Cargese 2002 68

1

3

Grain level(Data Chunks)

Root Chunk

P P

0 1 2 3

D = 1

D = 2

LOCATION

PRODUCT

0 1 2

0

1

0

00.00 00.10

D = 3 (Max Depth)

0

00.00.0P

0

1

1 2

00.00.1P

0

1

00.10.2P

0

1

4 5

00.10.3P

0

1

0 1

00

P P

0 1 2 3

00.01 00.11

30

00.01.0P

2

3

1 2

00.01.1P

2

3

00.11.2P

2

3

4 5

00.11.3P

2

3

Page 35: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 35

EDBT Summer School - Cargese 2002 69

Bucket Organization3 parts: bucket header, directory chunk vector, data chunk vector.Main idea: try to store in the same bucket whole families (i.e. sub-trees of chunks)!

A) A single sub-treeB) Many sub-trees that form a bucket region C) A single tree of directory chunks (root bucket)D) A single data chunk

EDBT Summer School - Cargese 2002 70

Chunk organizationImplementation data structure: multidimensional arrays:

Offer data address by-location, native to cubes.Enable chunk id exploitation.We don’t have to store the chunk ids.Are FAST!

Compression schemes:Data chunks: allocate only non-empty cells, maintain bitmap.Directory chunks: full cell allocation but no allocation for empty sub-trees.

Page 36: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 36

EDBT Summer School - Cargese 2002 71

SummaryStorage management in OLAPSISYPHUS storage manager for OLAPChunk-oriented file system:

Natively multidimensional and supports hierarchies.Clusters data hierarchically.It is space conservative.Adopts location-based than content-based data address scheme.

Also: data-access interface can be used for defining access paths and OLAP operations.

EDBT Summer School - Cargese 2002 72

Future WorkExperimental tests.Design/Implementation of algorithms for typical OLAP operations.Other research issues:

Finding optimal bucket regionsUpdating interface for common OLAP updating operations.Efficient file organization for dimension data

Page 37: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 37

EDBT Summer School - Cargese 2002 73

Data Warehouse Servers - Outline

Server Technology: ROLAP & MOLAPIndexing TechniquesQuery Processing and Optimization

EDBT Summer School - Cargese 2002 74

Why specialized indexing

Join-intensive queriesAlmost all queries demand joins of the fact table with some dimensions

Very large tablestraditional index become too large to be efficient

Complex queriesselections based on complex criteria

Read-intensive workload

Page 38: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 38

EDBT Summer School - Cargese 2002 75

BitMap Indexes

An alternative representation of RID-listAdvantageous for low-cardinality domainsRepresent each row of a table by a bit and the table as a bit vectorThere is a distinct bit vector Bv for each value v for the domain.The j-th bit in the vector Bv is set if the j-th row of the table has the value v for the column

EDBT Summer School - Cargese 2002 76

BitMap Indexes

Example: The attribute sex has values M and F.A table of 100 million people needs 2 lists of 100 million bitsComparison, join and aggregation operations are reduced to bit arithmetic with dramatic improvement in processing timeSignificant reduction in space and I/O (30:1)

Page 39: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 39

EDBT Summer School - Cargese 2002 77

BitMap Indexes

Cust Region RatingC1 N HC2 S MC3 W LC4 W HC5 S LC6 W LC7 N H

RID N S E W 1 1 0 0 0 2 0 1 0 0 3 0 0 0 1 4 0 0 0 1 5 0 1 0 0 6 0 0 0 1 7 1 0 0 0

RID H M L1 1 0 02 0 1 03 0 0 14 1 0 05 0 0 16 0 0 17 1 0 0

Base Table Region Index Rating Index

EDBT Summer School - Cargese 2002 78

BitMap Indexes

Works poorly for high cardinality domains since the number of vectors increaseHowever, often good performance via compression since scarcity also increasesProducts that support bitmaps: Model 204, TargetIndex (Redbrick), IQ (Sybase), Oracle

Page 40: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 40

EDBT Summer School - Cargese 2002 79

Join Indexes

Traditional indexes map the value in a column to a list of rows with that valueJoin indexes maintain relationships between the primary key and the foreign keysThus, join indexes relate the values of the dimensions of a star schema to rows in the fact table.Join indexes may span multiple dimensions

EDBT Summer School - Cargese 2002 80

Join IndexesJoin index for a single dimension:

Consider a schema with a Sales fact table and two dimensions city and productIf there is a join index on city, then for each distinct city, the index maintains a list of RIDs of the tuples recording sale in that cityExample: The node Athens in the index points to the list of RIDs in the fact table corresponding to transactions (sale) in Athens.

Join indexes can span multiple dimensionsthe node (Athens, oranges) points to transactions that took place in Athens and which corresponds to purchase of oranges

Page 41: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 41

EDBT Summer School - Cargese 2002 81

Join Indexes

RID City Amount1 Athens $1002 N.Y. $3003 Rome $1204 Athens $2505 Rome $1806 Rome $657 N.Y. $450

City Country PopulationAthens Greece 3.507.000Rome Italy 3.033.000N.Y. USA 17.953.000

Sales table City table

City RIDsAthens 1, 4Rome 3, 5, 6N.Y. 2, 7

Index on City-Sales

EDBT Summer School - Cargese 2002 82

Data Warehouse Servers - Outline

Server Technology: ROLAP & MOLAPIndexing TechniquesQuery Processing and Optimization

Page 42: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 42

EDBT Summer School - Cargese 2002 83

Specialized Join Methods

Traditional systems limit themselves to binary joins

results in many intermediate tablesFor a query over many dimensions, the optimization time can be substantial

EDBT Summer School - Cargese 2002 84

Specialized Join Methods

StarJoin Algorithm (Redbrick)use join indexes to identify regions of cartesianproduct that are of interest

Intelligent Scan (Redbrick)take advantage of the “read-only” environment

Parallel Join Methods

Page 43: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 43

EDBT Summer School - Cargese 2002 85

Complex Query Processing

Extensible optimization frameworks (e.g. Starburst [IBM Almaden])Estimation of Statistics (histograms, sampling)Some of the ideas useful for DSS:

interleaving GroupBy and JoinMerging ViewsPropagating selection through viewsOptimizing nested subqueries

EDBT Summer School - Cargese 2002 86

Example of Optimizing Nested Subqueries

Find all employees younger than 35 who earn more than the average of their departmentAlternatives:

Iterate over each employee: (1) find the department of the employee (2) compute average salary in the department (3) check if the employee’s salary is above the averageCompute the average salary of each department. For each employee, check if his/her salary is above the corresponding average salaryFind out the set of all departments where at least one of the employees is 35. Compute the average salary of only those departments. Repeat the previous step.

Page 44: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 44

EDBT Summer School - Cargese 2002 87

Rollup and Cube operators

[Gray et.al. 1996] Rollup operator for nested aggregations

rollup product, store, citygroup by product, store, citygroup by store, citygroup by city

Cube operator for all possible combinationsgroup by product,store,city cube

group by each subset of {product, store, city}, independently of the order of columns in the statement

EDBT Summer School - Cargese 2002 88

The CUBE operatorJim GrayAdam BosworthAndrew LaymanMicrosoft

CHEVY

FORD 19901991

19921993

REDWHITEBLUE

By Color

By Make & Color

By Make & Year

By Color & Year

By MakeBy Year

Sum

The Data Cube and The Sub-Space Aggregates

REDWHITEBLUE

Chevy Ford

By Make

By Color

Sum

Cross TabRED

WHITEBLUE

By Color

Sum

Group By (with total)Sum

Aggregate

Hamid PiraheshIBM

Page 45: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 45

Processing Star Queries on Hierarchically-Clustered Fact Tables

Nikos Karayannidis1, Aris Tsois1, Timos Sellis1, Roland Pieringer2, Volker Markl4,

Frank Ramsak3,Robert Fenk3, Klaus Elhardt2, Rudolf Bayer5

1I.C.C.S. - N.T.U.Athens, 3FORWISS –5T.U.München,

2TransAction Software GmbH, 4IBM Almaden Research Center

EDBT Summer School - Cargese 2002 90

Key PointsStar queries are ubiquitous in DW and OLAPNew trend: Hierarchically clustered star-schemataNew processing frameworkNew optimization challenges Implemented in TransBase HyperCubeTested with real-world application (up to 40 speed-up)

Page 46: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 46

EDBT Summer School - Cargese 2002 91

EDITHEDITH - the European Development on Indexing Techniques for Databases with Multidimensional Hierarchies Information Society Technologies Programme (IST) - grant No. IST-1999-20722. http://edith.in.tum.de

EDBT Summer School - Cargese 2002 92

Motivation – Problem statement

Not just report! What about ad hoc queries?OLAP requires efficient processing of ad-hoc star queriesMajor bottleneck processing of the star-join

Cartesian product, bitmap indexes, …NOT enough: Efficiency requires good physical clusteringof data

Page 47: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 47

EDBT Summer School - Cargese 2002 93

Hierarchical ClusteringA new trend:

hierarchical clustering of fact table data through path-based surrogate keysExploitation of multidimensional indexesStar join transforms to multidimensional range query

The overall processing framework of star queries changes radically

EDBT Summer School - Cargese 2002 94

ContributionsPresent a novel processing framework for star queries over hierarchically clustered dataDiscuss optimizationsRealization of our technology in a real systemEvaluation on a real-world application has shown significant speed-ups.

Page 48: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 48

EDBT Summer School - Cargese 2002 95

Hierarchical Surrogate KeysApply hierarchical encoding on each dimension tableSystem-assigned h-surrogate key:

e.g., oc1(“Greece”)/oc2(“Athens”)/oc3(“Store5”)

Implementation based on underlying physical data structure

EDBT Summer School - Cargese 2002 96

Database Schema

FTm1m2

d1d2…dN

D1

h1---------------

h2h3f1f2

D2

h1---------------

h2h3h4

DN

h1---------------

h2f1f2f3hsk1

hsk2…

hskN

hsk1

hsk2

hskN

Page 49: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 49

EDBT Summer School - Cargese 2002 97

Star Queries

SELECT {Di.hj}{Di.fj}{aggr(…)AS AMj}FROM FT,D1,…,DNWHERE FT.d1 = D1.h1 AND…

LOCPRED({D1}) AND …MPRED({FT.mi})

GROUP BY {Di.hj},{Di.fj},{FT.mj}HAVING <having clause>ORDER BY <ordering fields>

Star-join conditions

Dimension restrictions

Measure restrictions

EDBT Summer School - Cargese 2002 98

The Abstract Processing Plan

...Dn

FT

MD Range Access

Residual Join

Group-Select

Order_By

D1

Dj

Di

Residual Join

...Create_RangeCreate_Range

...

h-surrogate processing Main execution phase

Page 50: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 50

EDBT Summer School - Cargese 2002 99

Optimization IssuesOptimizing h-surrogate processing

Single tuple retrieval for hierarchical prefix path restrictionsExploit composite index on (hm, hm-1,…, h1, hski)

Pregrouping transformation Reduces tuples for residual join and speeds up groupingHeuristic algorithm based on query syntax

EDBT Summer School - Cargese 2002 100

Pre-grouping Transformation

F

Group Selectby month, store

Residual Join

MD Range Access

Residual Join

Date

LocationDate

F

Group Selectby month, store

Residual Join

MD Range Access

Residual Join

Location

Group Selectby hsk1, hsk2

Page 51: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 51

EDBT Summer School - Cargese 2002 101

Performance EvaluationGreek electronic retailer data:

3 dims (1.4M, 27K, 2.5K) tuplesFact table: 15.5M tuples (1.5GB)220 ad hoc star queries from real application

Compare 3 plans: STAR, AEP and OPTFT selectivity range: 0.0% to 5.0% of FTResult:

AEP vs. STAR 20 avg. speed upOPT vs STAR 40 avg speed up

EDBT Summer School - Cargese 2002 102

SummaryEfficient star query processing a must in DW and OLAPNew trend: Hierarchically clustered star-schemataPresented a novel processing framework for star queries over hierarchically clustered dataDiscussed optimization issuesFully implemented our technology in TransBaseEvaluation with real-word application has shown significant speed-ups

Page 52: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 52

EDBT Summer School - Cargese 2002 103

Future WorkExtensive experimental evaluationInvestigate applicability of our processing framework to other areasFurther optimization issues: reducing the number of produced h-surrogate ranges

EDBT Summer School - Cargese 2002 104

Metadata Repository

Sources

Administrator

DSA

Administrator

DW

Designer

Data Marts

Metadata Repository

End User

Quality Issues

Quality Issues

Quality Issues

Quality Issues

Reporting / OLAP tools

Page 53: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 53

EDBT Summer School - Cargese 2002 105

The Lack of Conceptual Support

InformationSource

DataWarehouse

Wrapper/Loader

Multidim.Data Mart

Aggregation/Customization?

Observation

OLTP

OLAPAnalyst

Operational Department

Enterprise

Source Quality

DWQuality

MartQuality

(1)

(2)

(3)

(4)

(5)

EDBT Summer School - Cargese 2002 106

Conceptual-Logical-Physical

SourceData Store

DWData Store

Wrapper

ClientData Store

Aggregation/Customization?

Observation

OLTP

OLAPClient Model

Operational Department

Model

Enterprise Model

SourceSchema

DWSchema

TransportationAgent

TransportationAgent

ClientSchema

Conceptual Perspective

LogicalPerspective

PhysicalPerspective

Page 54: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 54

EDBT Summer School - Cargese 2002 107

The DWQ ApproachClient Level

DW Level

Source Level

Conceptual Perspective Logical

PerspectivePhysicalPerspective

Meta Model Level

Models/ Meta Data

Level

in

Real World

in in

ProcessModel

ProcessMeta

Model

uses

Process

Processes

Quality Metamodel

Quality Model

Quality Measure- ments

EDBT Summer School - Cargese 2002 108

DWQ RepositoryThe DWQ approach for managing data warehouse quality is organized around an extended, semantically rich metadata repository (prototypically implemented using ConceptBase), which controls all relevant metadataWe have developed meta models for DW architecture, quality, processes and evolutionMetadata can be provided and queried by external tools, via active rules external tools could even be activated

[Jarke et al., CAiSE98]

Page 55: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 55

EDBT Summer School - Cargese 2002 109

DWQ Metadata Framework

Sources

...

...

EnterpriseModel

Client Client

Source SourceModel_1 Model_n

Model_1 Model_m

Mediators

conceptual/logical mappingphysical/logical mapping

conceptual link

data flow

logical link

Source SourceWrappers

physical levelmeta level conceptual level logical level

Met

a M

odel

Interface

SchemaStore

Client Client

DWDW

Source Source

Schema_1 Schema_n

Schema_1 Schema_m

Data Store_1 Data Store_n

EDBT Summer School - Cargese 2002 110

Quality Model: An Adapted GQM Approach

DW Designers

DecisionMaker

DWAdministrator

QualityGoal

QualityQuery

DW Objects, Processes and Data

Metadata for DW Architecture,

Quality and Processes

establish

Measurement Processes

evaluated by

evidence for

defined on

QualityFactor

[Jarke et al., IS99]

Page 56: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 56

EDBT Summer School - Cargese 2002 111

Quality Factors by PerspectiveConceptual Perspective

• Completeness• Redundancy• Consistency• Correctness• Traceabilityof Concepts andModels

Logical Perspective

• Usefulness of schemas• Correctness of mappings• Interpretability of schemas

Physical Perspective

• Efficiency • Interpretability of schemas• Timeliness of stored data • Maintainability/ Usability of software components

EDBT Summer School - Cargese 2002 112

Towards Quality-Oriented DW Design Quality

Goal

1. Design 2. Evaluation 3. Analysis& Improvement

DefineQualityFactorTypes

DefineObjectTypes

Define ObjectInstances &Properties

Define Metrics& Agents

Compute!

Acquire values forquality factors

(current status)

Feed values toquality scenario

and play!

Discover/Refinenew/old

"functions"

Take actions!

Decomposecomplex objects

and iterate

Empiricallyderive

"functions"

Analyticalyderive

"functions"

Produce ascenariofor a goal

Produce expected/acceptable values

Negotiate!

4. Re-evalution& evolution

[Vassiliadis et al., IS00]

Page 57: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 57

EDBT Summer School - Cargese 2002 113

DWQ Methodology : Summary

R1 R2 R3

EnterpriseModel

MaterializedViews

C1 C2 Cm

Conj.Queries

R1 R2 R3

S1R1 R2 R3

S2R1 R2 R3

S3R1 R2 R3

Sn

Conj.Queries

Conj.Queries

User queries

OLTP updates

3. ConceptualClient Modeling

1. ConceptualEnterprise Model

2. ConceptualSource Models

Rewriting ofAggregate Queries

Refreshment

6. DataReconciliation

4. Translate aggregates into OLAP operations

5. DesignOptimization

Metadata Repository

EDBT Summer School - Cargese 2002 114

Key Formal Results on Quality Impacts

conceptual: description logic theory and tools for complete reasoning about the relationships between source, enterprise, and client models conceptual/logical: containment, satisfiability, and rewriting of queries over views with & without aggregateslogical/physical: incremental cost-based optimization of view materializations physical: detailed impact analysis of replication and refreshment policies

Page 58: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 58

EDBT Summer School - Cargese 2002 115

ConceptBase User Interface

EDBT Summer School - Cargese 2002 116

DW Quality Example

Page 59: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 59

EDBT Summer School - Cargese 2002 117

Metadata StandardsMetadata Coalition

MetaData Interchange Specification (MDIS)Open Information Model (OIM)

OMG (latest development)Common Warehouse Model (CWM)

Microsoft Repository

EDBT Summer School - Cargese 2002 118

SummaryOLAP - Multidimensional dataDrill down, Roll Up, Pivot, Slice and DiceData warehouse architectureWarehouse operational process

Loading - Cleaning - Serving (ROLAP/MOLAP)Refreshing

Warehouse server requirementsStar-Snowflake schemesSpecialized indexes: BitMap - Join Indexes

Page 60: DWH Concepts

Design and Maintenance of DataWharehouses

ABIS 2002 – Timos Sellis 60

EDBT Summer School - Cargese 2002 119

Research issues

Data cleaningfocus on schema inconsistencies

Data warehouse designsummary tables, indexing

Query Processinguse summary data, statistics mgt, dynamic optimization

Warehouse Managementresource management, runaway queriesincremental refresh techniques

EDBT Summer School - Cargese 2002 120

ReferencesW. H. Inmon: Building the Data Warehouse (2nd Edition),John Wiley, 1996.R. Kimball: The Data Warehouse Toolkit, John Wiley, 1996.H. Garcia-Molina, Data Warehousing Overview, class notes, Stanford University.S. Chaudhuri & U. Dayal: Data Warehousing and OLAP for Decision Support - VLDB’96 tutorialOracle, IBM, Redbrick, Sybase, Informix, Tandem, Teradata, HP, … web sites.The DWQ project: http://www.dbnet.ece.ntua.gr/~dwq/