32
DATA WAREHOUSE: DESIGN - 47 Copyright All rights reserved Database and data mining group, Politecnico di Torino Elena Baralis Politecnico di Torino DataBase and Data Mining Group of Politecnico di Torino D B M G Materialized views Elena Baralis Politecnico di Torino DataBase and Data Mining Group of Politecnico di Torino D B M G of Politecnico di Torino f Politecnico di Tori G G G

Politecnico di Torino - dbdmg.polito.it

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Politecnico di Torino - dbdmg.polito.it

DATA WAREHOUSE: DESIGN - 47Copyright – All rights reserved

Database and data mining group, Politecnico di Torino

Elena Baralis

Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

Materialized views

Elena Baralis

Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

roup of Politecnico di Torinoroup of Politecnico di Torino

GGG

Page 2: Politecnico di Torino - dbdmg.polito.it

DATA WAREHOUSE: DESIGN - 48Copyright – All rights reserved

Database and data mining group, Politecnico di Torino

Elena Baralis

Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

roup of Politecnico di Torino

GG

Materialized views

• Precomputed summaries for the fact table

– explicitly stored in the data warehouse

– provide a performance increase for aggregate queries

v5 = {quarter, region}

v4 = {type, month, region} v3 = {category, month, city}

v2 = {type, date, city}

v1 = {product, date, shop}

From Golfarelli, Rizzi,”Data warehouse, teoria e pratica della progettazione”, McGraw Hill 2006

Page 3: Politecnico di Torino - dbdmg.polito.it

DATA WAREHOUSE: DESIGN - 49Copyright – All rights reserved

Database and data mining group, Politecnico di Torino

Elena Baralis

Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

roup of Politecnico di Torino

GG

Materialized views• Defined by SQL statements

• Example: definition of v3

– Starting from base tables or views with higher granularity

group by City, Category, Month

– Aggregation (SUM) on Quantity, Income measures– Reduction of detail in dimensions

City

Month

Category

Month_ID

Month

Year

Category_ID

Category

Department

City_ID

Month_ID

Category_ID

TotalQuantity

TotalIncome

City_ID

City

State

Page 4: Politecnico di Torino - dbdmg.polito.it

DATA WAREHOUSE: DESIGN - 50Copyright – All rights reserved

Database and data mining group, Politecnico di Torino

Elena Baralis

Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

roup of Politecnico di Torino

GG

{a,c}

{a,d}{b,c}

{b,d}{c} {a}

{b}{d}

{ }

Multidimensional lattice

Materialized views• Materialized views may be exploited for answering several

different queries– not for all aggregation operators

From Golfarelli, Rizzi,”Data warehouse, teoria e pratica della progettazione”, McGraw Hill 2006

a

b

c

d

Page 5: Politecnico di Torino - dbdmg.polito.it

DATA WAREHOUSE: DESIGN - 51Copyright – All rights reserved

Database and data mining group, Politecnico di Torino

Elena Baralis

Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

roup of Politecnico di Torino

GG

Materialized view selection

• Huge number of allowed aggregations

– most attribute combinations are eligible

• Selection of the “best” materialized view set

• Cost function minimization

– query execution cost

– view maintainance (update) cost

• Constraints

– available space

– time window for update

– response time

– data freshness

Page 6: Politecnico di Torino - dbdmg.polito.it

DATA WAREHOUSE: DESIGN - 52Copyright – All rights reserved

Database and data mining group, Politecnico di Torino

Elena Baralis

Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

roup of Politecnico di Torino

GG

Materialized view selection

q1

q3

q2

+ = candidate views,

possibly useful to

increase workload

query performance

From Golfarelli, Rizzi,”Data warehouse, teoria e pratica della progettazione”, McGraw Hill 2006

{a,c}

{a,d}{b,c}

{b,d}

{c}{a}

{b}{d}

{ }

Multidimensional lattice

Page 7: Politecnico di Torino - dbdmg.polito.it

DATA WAREHOUSE: DESIGN - 53Copyright – All rights reserved

Database and data mining group, Politecnico di Torino

Elena Baralis

Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

roup of Politecnico di Torino

GG

Materialized view selection

Querycost

Update window

Diskspace

Space and time

minimization

q1

q3

q2

From Golfarelli, Rizzi,”Data warehouse, teoria e pratica della progettazione”, McGraw Hill 2006

{a,c}

{a,d}{b,c}

{b,d}

{c}{a}

{b}{d}

{ }

Page 8: Politecnico di Torino - dbdmg.polito.it

DATA WAREHOUSE: DESIGN - 54Copyright – All rights reserved

Database and data mining group, Politecnico di Torino

Elena Baralis

Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

roup of Politecnico di Torino

GG

Materialized view selection

q3

q1

q2

Querycost

Updatewindow

Diskspace

Cost

minimization

From Golfarelli, Rizzi,”Data warehouse, teoria e pratica della progettazione”, McGraw Hill 2006

{a,c}

{a,d}{b,c}

{b,d}

{c}{a}

{b}{d}

{ }

Page 9: Politecnico di Torino - dbdmg.polito.it

DATA WAREHOUSE: DESIGN - 55Copyright – All rights reserved

Database and data mining group, Politecnico di Torino

Elena Baralis

Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

roup of Politecnico di Torino

GG

Materialized view selection

q3

q1

q2

Querycost

Diskspace

All

constraints

Updatewindow

From Golfarelli, Rizzi,”Data warehouse, teoria e pratica della progettazione”, McGraw Hill 2006

{a,c}

{a,d}{b,c}

{b,d}

{c}{a}

{b}{d}

{ }

Page 10: Politecnico di Torino - dbdmg.polito.it

DATA WAREHOUSE: DESIGN - 56Copyright – All rights reserved

Database and data mining group, Politecnico di Torino

Elena Baralis

Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

Physical design

Elena Baralis

Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

roup of Politecnico di Torinoroup of Politecnico di Torino

GGG

Page 11: Politecnico di Torino - dbdmg.polito.it

DATA WAREHOUSE: DESIGN - 57Copyright – All rights reserved

Database and data mining group, Politecnico di Torino

Elena Baralis

Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

roup of Politecnico di Torino

GG

Physical design• Workload characteristics

– aggregate queries which require accessing a large fraction of each table

– read-only access

– periodic data refresh, possibly rebuilding physical access structures (indices, views)

• Physical structures– index types different from OLTP

• bitmap index, join index, bitmapped join index, ...

• B+-tree index not appropriate for

– attributes with low cardinality domains

– queries with low selectivity

– materialized views• query optimizer should be able to exploit them

Page 12: Politecnico di Torino - dbdmg.polito.it

DATA WAREHOUSE: DESIGN - 58Copyright – All rights reserved

Database and data mining group, Politecnico di Torino

Elena Baralis

Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

roup of Politecnico di Torino

GG

Physical design

• Optimizer characteristics

– should consider statistics when defining the access

plan (cost based)

– aggregate navigation

• Physical design procedure

– selection of physical structures supporting most

frequent (or most relevant) queries

– selection of structures improving performance of more

than one query

– constraints

• disk space

• available time window for data update

Page 13: Politecnico di Torino - dbdmg.polito.it

DATA WAREHOUSE: DESIGN - 59Copyright – All rights reserved

Database and data mining group, Politecnico di Torino

Elena Baralis

Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

roup of Politecnico di Torino

GG

Physical design

• Tuning

– a posteriori change of physical access structures

– workload monitoring tools are needed

– frequently required for OLAP applications

• Parallelism

– data fragmentation

– query parallelization

• inter-query

• intra-query

– join and group by lend themselves well to parallel

execution

Page 14: Politecnico di Torino - dbdmg.polito.it

DATA WAREHOUSE: DESIGN - 60Copyright – All rights reserved

Database and data mining group, Politecnico di Torino

Elena Baralis

Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

roup of Politecnico di Torino

GG

Index selection

• Indexing dimensions– attributes frequently involved in selection predicates

– if domain cardinality is high, then B-tree index

– if domain cardinality is low, then bitmap index

• Indices for join– indexing only foreign keys in the fact table is rarely

appropriate

– bitmapped join index is suggested (if available)

• Indices for group by– use materialized views

Page 15: Politecnico di Torino - dbdmg.polito.it

DATA WAREHOUSE: DESIGN - 61Copyright – All rights reserved

Database and data mining group, Politecnico di Torino

Elena Baralis

Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

ETL Process

Elena Baralis

Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

roup of Politecnico di Torinoroup of Politecnico di Torino

GGG

Page 16: Politecnico di Torino - dbdmg.polito.it

DATA WAREHOUSE: DESIGN - 62Copyright – All rights reserved

Database and data mining group, Politecnico di Torino

Elena Baralis

Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

roup of Politecnico di Torino

GG

Extraction, Transformation

and Loading (ETL)

• Prepares data to be loaded into the data warehouse– data extraction from (OLTP and external) sources

– data cleaning

– data transformation

– data loading

• Eased by exploiting the staging area

• Performed– when the DW is first loaded

– during periodical DW refresh

Page 17: Politecnico di Torino - dbdmg.polito.it

DATA WAREHOUSE: DESIGN - 63Copyright – All rights reserved

Database and data mining group, Politecnico di Torino

Elena Baralis

Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

roup of Politecnico di Torino

GG

Extraction

• Data acquisition from sources

• Extraction methods

– static: snapshot of operational data

• performed during the first DW population

– incremental: selection of updates that took place after

last extraction

• exploited for periodical DW refresh

• immediate or deferred

• The selection of which data to extract is based on

their quality

Page 18: Politecnico di Torino - dbdmg.polito.it

DATA WAREHOUSE: DESIGN - 64Copyright – All rights reserved

Database and data mining group, Politecnico di Torino

Elena Baralis

Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

roup of Politecnico di Torino

GG

Extraction

• It depends on how operational data is collected

– historical: all modifications are stored for a given time in

the OLTP system

• bank transactions, insurance data

• operationally simple

– partly historical: only a limited number of states is

stored in the OLTP system

• operationally complex

– transient: the OLTP system only keeps the current data

state

• example: stock inventory

• operationally complex

Page 19: Politecnico di Torino - dbdmg.polito.it

DATA WAREHOUSE: DESIGN - 65Copyright – All rights reserved

Database and data mining group, Politecnico di Torino

Elena Baralis

Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

roup of Politecnico di Torino

GG

Incremental extraction

• Application assisted– data modifications are captured by ad hoc application

functions

– requires changing OLTP applications (or APIs for database access)

– increases application load

– hardly avoidable in legacy systems

• Log based– log data is accessed by means of appropriate APIs

– log data format is usually proprietary

– efficient, no interference with application load

Page 20: Politecnico di Torino - dbdmg.polito.it

DATA WAREHOUSE: DESIGN - 66Copyright – All rights reserved

Database and data mining group, Politecnico di Torino

Elena Baralis

Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

roup of Politecnico di Torino

GG

• Trigger based– triggers capture interesting data modifications

– does not require changing OLTP applications

– increases application load

• Timestamp based– modified records are marked by the (last) modification

timestamp

– requires modifying the OLTP database schema (and applications)

– deferred extraction, may lose intermediate states if data is transient

Incremental extraction

Page 21: Politecnico di Torino - dbdmg.polito.it

DATA WAREHOUSE: DESIGN - 67Copyright – All rights reserved

Database and data mining group, Politecnico di Torino

Elena Baralis

Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

roup of Politecnico di Torino

GG

Comparison of extraction techniques

Static TimestampsAppilcation

assistedTrigger Log

Management of transient or semi-

periodic dataNo Incomplete Complete Complete Complete

Support to file-basedsystems

Yes Yes Yes No Rare

Implementation technique

ToolsTools or internal developments

Internal developments

Tools Tools

Costs of enterprise specific development

None Medium High None None

Use with legacy systems

Yes Difficult Difficult Difficult Yes

Changes to applications

None Likely Likely None None

DBMS-dependentprocedures

Limited Limited Variabile High Limited

Impact on operational system performance

None None Medium Medium None

Complexity of extraction procedures

Low Low High Medium Low

From Devlin, Data warehouse: from architecture to implementation, Addisono-Wesley, 1997

Page 22: Politecnico di Torino - dbdmg.polito.it

DATA WAREHOUSE: DESIGN - 68Copyright – All rights reserved

Database and data mining group, Politecnico di Torino

Elena Baralis

Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

roup of Politecnico di Torino

GG

Incremental extraction

75LuminiBarbera3

45CappelliSangiovese4

150MaioBarolo2

50MalavasiGreco di tufo1

QtyCustomerProductCod du

4/

uc

4/4

Cuct

///4/2010

150MaltoniTrebbiano6

145CappelliSangiovese4

25MaltoniVermentino5

150MaioBarolo2

50MalavasiGreco di tufo1

QtyCustomerProductCod

I150MaltoniTrebbiano6

I25MaltoniVermentino5

U145CappelliSangiovese4

D75LuminiBarbera3

ActionQtyCustomerProductCod

Cusuct

6/4/2010

QtC t

Incremental difference

75LuminiBarbera3

Barbera3 D75Lumini

145CappelliSangiovese4

U145CappelliSangiovese4

45CappelliSangiovese4150MaltoniTrebbiano6

25MaltoniVermentino5

I150MaltoniTrebbiano6

I25MaltoniVermentino5

From Golfarelli, Rizzi,”Data warehouse, teoria e pratica della progettazione”, McGraw Hill 2006

Page 23: Politecnico di Torino - dbdmg.polito.it

DATA WAREHOUSE: DESIGN - 69Copyright – All rights reserved

Database and data mining group, Politecnico di Torino

Elena Baralis

Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

roup of Politecnico di Torino

GG

Data cleaning

• Techniques for improving data quality (correctness and consistency)– duplicate data

– missing data

– unexpected use of a field

– impossible or wrong data values

– inconsistency between logically connected data

• Problems due to– data entry errors

– different field formats

– evolving business practices

Page 24: Politecnico di Torino - dbdmg.polito.it

DATA WAREHOUSE: DESIGN - 70Copyright – All rights reserved

Database and data mining group, Politecnico di Torino

Elena Baralis

Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

roup of Politecnico di Torino

GG

Data cleaning

• Each problem is solved by an ad hoc technique

– data dictionary

• appropriate for data entry errors or format errors

• can be exploited only for data domains with limited cardinality

– approximate fusion

• appropriate for detecting duplicates/similar data correlations

– approximate join

– purge/merge problem

– outlier identification, deviations from business rules

• Prevention is the best strategy

– reliable and rigorous OLTP data entry procedures

Page 25: Politecnico di Torino - dbdmg.polito.it

DATA WAREHOUSE: DESIGN - 71Copyright – All rights reserved

Database and data mining group, Politecnico di Torino

Elena Baralis

Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

roup of Politecnico di Torino

GG

Approximate join

(0,n) (1,1)

CUSTOMER

Customer

surname

Customer

address

Customer

name

ORDER

DataOrder_ID

Quantity

CUSTOMER

Customer

surname

Customer

address

Customer

name

Cust_ID

Customer

surname

Customer

address

Data

ORDER

Order_ID

Customer

Code

Quantity

Marketing DB Administration DB

Cust_ID

From Golfarelli, Rizzi,”Data warehouse, teoria e pratica della progettazione”, McGraw Hill 2006

• The join operation should be executed based on common fields, not representing the customer identifier

PLACES

Page 26: Politecnico di Torino - dbdmg.polito.it

DATA WAREHOUSE: DESIGN - 72Copyright – All rights reserved

Database and data mining group, Politecnico di Torino

Elena Baralis

Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

roup of Politecnico di Torino

GG

Purge/Merge problem

CUSTOMER

(Roma)

Cutomer_ID

Marketing DB (Milano)Marketing DB (Roma)

CUSTOMER

(Milano)

CUSTOMER

Customer

surname

Customer

address

Customer

name

Customer

surname

Customer

address

Customer

name

Customer

surname

Customer

address

Customer

name

From Golfarelli, Rizzi,”Data warehouse, teoria e pratica della progettazione”, McGraw Hill 2006

Cutomer_ID

Cutomer_ID

• Duplicate tuples should be identified and removed

• A criterion is needed to evaluate record similarity

Page 27: Politecnico di Torino - dbdmg.polito.it

DATA WAREHOUSE: DESIGN - 73Copyright – All rights reserved

Database and data mining group, Politecnico di Torino

Elena Baralis

Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

roup of Politecnico di Torino

GGData cleaning and

transformation example

Adapted from Golfarelli, Rizzi,”Data warehouse, teoria e pratica della progettazione”, McGraw Hill 2006

name: Elena

surname: Baralis

address: Corso Duca degli Abruzzi 24

ZIP: 10129

city: Torino

country: Italia

Correction

Elena Baralis

C.so Duca degli Abruzzi 24

20129 Torino (I)

name: Elena

surname: Baralis

address: C.so Duca degli Abruzzi 24

ZIP: 20129

city: Torino

country: I

Normalization

Standardizationname: Elena

surname: Baralis

address: Corso Duca degli Abruzzi 24

ZIP: 20129

city: Torino

country: Italia

Page 28: Politecnico di Torino - dbdmg.polito.it

DATA WAREHOUSE: DESIGN - 74Copyright – All rights reserved

Database and data mining group, Politecnico di Torino

Elena Baralis

Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

roup of Politecnico di Torino

GG

Transformation• Data conversion from operational format to data

warehouse format– requires data integration

• A uniform operational data representation (reconciled schema) is needed

• Two steps– from operational sources to reconciled data in the staging

area• conversion and normalization

• matching

• (possibly) significant data selection

– from reconciled data to the data warehouse• surrogate keys generation

• aggregation computation

Page 29: Politecnico di Torino - dbdmg.polito.it

DATA WAREHOUSE: DESIGN - 75Copyright – All rights reserved

Database and data mining group, Politecnico di Torino

Elena Baralis

Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

roup of Politecnico di Torino

GG

Data warehouse loading

• Update propagation to the data warehouse

• Update order that preserves data integrity

1. dimensions

2. fact tables

3. materialized views and indices

• Limited time window to perform updates

• Transactional properties are needed

– reliability

– atomicity

Page 30: Politecnico di Torino - dbdmg.polito.it

DATA WAREHOUSE: DESIGN - 76Copyright – All rights reserved

Database and data mining group, Politecnico di Torino

Elena Baralis

Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

roup of Politecnico di Torino

GG

Dimension table loading

ID2

attr 3

attr 4

…….

ID3

attr 5

attr 6

…….

ID1

attr 1

attr 2

…….ODS

ID2

attr 1

attr 3

attr 5

attr 6

Identify

updates

New/updated

tuples for DT

Map identifiers

and sur. keys

ID2

Sur. Key SLook-up

table

New/updated tuples

for DT

Load

new/updated

tuples in DT

Staging area

Sur. Key S

attr 1

attr 3

attr 5

attr 6

Dimension Table

Data mart

Sur. Key S

attr 1

attr 3

attr 5

attr 6

From Golfarelli, Rizzi,”Data warehouse, teoria e pratica della progettazione”, McGraw Hill 2006

Page 31: Politecnico di Torino - dbdmg.polito.it

DATA WAREHOUSE: DESIGN - 77Copyright – All rights reserved

Database and data mining group, Politecnico di Torino

Elena Baralis

Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

roup of Politecnico di Torino

GG

Fact table loading

ID5

attr 3

attr 4

ID6

attr 5

attr 6

ID4

attr 1

attr 2

ODS

ID4

ID5

ID6

mes 1

mes 3

mes 5

Identify

updates

New/updated tuples

for FT

Map identifiers

and surrogate keys

ID5

Sur.Key S5

Sur key S4

Sur key S5

Sur key S6

mes 1

mes 3

mes 5

mes 6

New/updated tuples

for FT

ID4

Sur.Key S4

ID6

Sur.Key S5

Look-up table

Load

new/updated tuples

in FT

Sur key S4

Sur key S5

Sur key S6

mes 1

mes 3

mes 5

mes 6

Fact Table

Data mart

Staging area

From Golfarelli, Rizzi,”Data warehouse, teoria e pratica della progettazione”, McGraw Hill 2006

Page 32: Politecnico di Torino - dbdmg.polito.it

DATA WAREHOUSE: DESIGN - 78Copyright – All rights reserved

Database and data mining group, Politecnico di Torino

Elena Baralis

Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

roup of Politecnico di Torino

GG

Materialized view loading

Tratto da Golfarelli, Rizzi,”Data warehouse, teoria e pratica della progettazione”, McGraw Hill 2006

{a,b}

{a,b'}{a',b}

{a',b'}{b} {a}

{a'}{b'}

{ }