45
Contents of this slideshow : • What is a datawarehouse? • Multi-dimensional data modelling • Data warehouse architecture The hidden slides of this slideshow may be important. However, I will focus on leaning by exercises and therefore, rattling off new concepts are often

Contents of this slideshow :

  • Upload
    vern

  • View
    34

  • Download
    0

Embed Size (px)

DESCRIPTION

Contents of this slideshow :. What is a datawarehouse? Multi-dimensional data modelling Data warehouse architecture. The hidden slides of this slideshow may be important. However, I will focus on leaning by exercises and therefore, rattling off new concepts are often done in hidden slides. - PowerPoint PPT Presentation

Citation preview

Page 1: Contents of this slideshow :

Contents of this slideshow:

• What is a datawarehouse?

• Multi-dimensional data modelling

• Data warehouse architecture

The hidden slides of this slideshow may be important. However, I will focus on leaning by exercises and therefore, rattling off new concepts are often done in hidden slides.

Page 2: Contents of this slideshow :

OLTP versus OLAPOLTP = On Line Transaction ProcessingOLAP = On Line Analytical Processing

OLTP OLAP

users clerk, IT professional knowledge worker/decision makers

function day to day operations decision support

DB design application-oriented subject-oriented (Business functions)

data current, up-to-date detailed, flat relational isolated data.

historical, summarized, multidimensional integrated, consolidated data.

usage repetitive ad-hoc

access read/write index/hash on prim. key

lots of scans

unit of work short, simple transaction complex query

# records accessed tens millions

#users thousands hundreds

DB size 100MB-GB 100GB-TB

Page 3: Contents of this slideshow :

An example of a Datawarehouse:

- Product#- Order#- Qty- Date#- Salesman#

Fact table

Orders

Orderdetails

Time

Products Salesmen

Dimension Dimension

Dimension

Dimension

- Product#- Product-name- Price

- Order#- Ordertype

- Salesman#- Salesman-name

- Date#- Date-Name

A star shema datawarehouse has a central table (the Fact table) surrouded by dimension tables with on-to-many relationships towards the fact table.

The fixed data base structure implies that application programs (drilling functions/aggregates) can be generated automatically!

Page 4: Contents of this slideshow :

Dimension hierarchies:

A dimension hierarchy is a set of tables connected by one-to-many relationships towards the fact table:

Orders CustomersOrderdetails

- Product# - Order# - Qty- Price

- Order# - Customer#- Date

- Customer# - Customer-name

Fact table Dimension hierarchy

In a dimension hierarchiy it is possible to aggregate data from the fact table to the different levels of the hierachy.

Roll-up = aggregate along one or more dimensions.

Drill-down = “de-aggregate” = break an aggregate into its constituents.

Page 5: Contents of this slideshow :

Two different types of drilling:• -Drilling in dimension hierarchies

• -Drilling between dimensions.

- Product#- Order#- Qty- Date#- Salesman#

Fact table

Orders

Orderdetails

Time

Products Salesmen

Dimension Dimension

Dimension

Dimension

- Product#- Product-name- Price

- Order#- Ordertype

- Salesman#- Salesman-name

- Date#- Date-Name

Page 6: Contents of this slideshow :

Which star schemas or data marts can be build by using the illustrated integrated E-commerce/ERP data model?

Which star schema would you recommend to be implemented first?

LocationLocation#Address

UserSessionSession#IPaddress#ClickTimestamp

ProductProduct#ProductNamePrice

OrderOrder#OrderDateBalanceState

Order-DetailHistoryInv-Item#Order#Seq#StateTimestamp

UserAccountSalesman#PassWordTimestamp#visits#transTtl-tr-amount

Order-DetailProduct#Order#QtyPriceTimestamp

ShippingShipping#ShipMethodShipChargeStateShipDate

CreditCardCard#HolderNameExpireDate

PaymentPayment#AmmountStateTimestamp

InvoyceHistoryInvoice#TimestampStateNotes

AddressAddress#NameAdd1Add2CityStateZip

InvoiceInvoice#CreationDate

Billing

Shipping

Product-StockProduct#Location#Qty

CustomerCustomer#Kredit-LimitBalance

Page 7: Contents of this slideshow :

A galaxy is a set of star fact tableswith conformed (fælles tilpassede) dimensions:

Sale-Orderdetails

Storage-per-product

Purchase-orderdetails

- Product# - Sale-order#

- Qty- Discount

- Sale-price - Date#

- Product# - Date# - End-of-day-

storage-qty

- Product# - Purchase-order#

- Purchase-price - Qty - Date#

Fact table Fact table- Date# - Qty

Day

Month

Year

Fact table

Products

Productgroups

Time dimension hierarchy

- yy

- yy- mm

- yy- mm- dd

- Product#- Product-name

- Product-group#- Product-group-name

Product dimension hierarchy

The value chain

Page 8: Contents of this slideshow :

Conceptual Modeling of Data Warehouses

– Star schema: A fact table in the middle connected to a set of

dimension tables

– Snowflake schema: A refinement of star schema where

some dimensional hierarchy is normalized into a set of

smaller dimension tables, forming a shape similar to

snowflake

– Galaxy schema: Multiple fact tables share dimension tables

(Conformed dimensions), viewed as a collection of stars,

therefore called galaxy schema or fact constellation

Page 9: Contents of this slideshow :

The aggregating level is the argument to the GROUP BY statement:

- Product#- Order#- Qty- Date#- Salesman#

Fact table

Orders

Orderdetails

Time

Products Salesmen

Dimension Dimension

Dimension

Dimension

- Product#- Product-name- Price

- Order#- Ordertype

- Salesman#- Salesman-name

- Date#- Date-Name

SELECT Product#, SUM(Qty*Price) AS TurnoverFROM Orderdetails JOIN ProductsGROUP BY Product#

Page 10: Contents of this slideshow :

Drill down to the Product per Salesman level:

- Product#- Order#- Qty- Date#- Salesman#

Fact table

Orders

Orderdetails

Time

Products Salesmen

Dimension Dimension

Dimension

Dimension

- Product#- Product-name- Price

- Order#- Ordertype

- Salesman#- Salesman-name

- Date#- Date-Name

SELECT Product#, Salesman#, SUM(Qty*Price) AS TurnoverFROM Orderdetails JOIN Products JOIN Salesmen GROUP BY Product#, Salesman#;

Where should the Price be stored?

Page 11: Contents of this slideshow :

Snowflake schema with branches:

A Snowflake schema may have branches in the dimension hierarchies:

Orders CustomersOrderdetails

- Product# - Order# - Qty

- Order# - Customer#- Date

- Customer# - Customer-name

Fact table Dimension hierarchy

Salesmen Branchoffices

Regions

Products

- Product# - Product-name- Price- Group#

Productgroups

- Group# - Group-name- Department#

Departments- Department# - Department-name

- Salesman# - Salesman-name– Branch-office#

- Branch-office# - Branch-office#- Region#

- Region# - Region-name

Snowflake hierarchy

Dimension hierarchyAre Customers related to the Regions?

Page 12: Contents of this slideshow :

Drilling in dimension hierarchies:

Orders Customers Orderdetails

- Product# - Order# - Qty

- Order# - Customer# - Date

- Customer# - Customer-name

Fact table Dimension hierarchy

Salesmen Branch offices

Products

- Product# - Product-name - Price - Group#

Product groups

- Group# - Group-name - Department#

Departments

- Salesman# - Salesman-name – Branch-office#

- Branch-office# - Branch-office# - Region#

Snowflake hierarchy

Dimension hierarchy

Branch-office# Turnover

LA 400,000

SF 200,000

Salesman# Turnover Branch-office#

Smith 100,000 LA

Jones 300,000 LA

Adams 200,000 SF

Page 13: Contents of this slideshow :

Drilling between dimension hierarchies:

Orders Customers Orderdetails

- Product# - Order# - Qty

- Order# - Customer# - Date

- Customer# - Customer-name

Fact table Dimension hierarchy

Salesmen Branch offices

Products

- Salesman# - Salesman-name – Branch-office#

- Branch-office# - Branch-office# - Region#

Snowflake hierarchy

Salesman# Turn-over

Branch-office#

Smith 100,000 LA

Jones 300,000 LA

Adams 200,000 SF

Salesman#

Product-name

Turn-over

Branch-office#

Smith Screw 10,000 LA

Smith Bolt 30,000 LA

Smith Nut 60,000 LA

Jones Screw 20,000 SF

Jones Nut 40,000 SF

. . .

Page 14: Contents of this slideshow :

Roll up to the top level:

Roll up can be executed by removing one or more argument to the GROUP BY statement.

Salesman#

Product-name

Turn-over

Branch-office#

Smith Screw 10,000 LA

Smith Bolt 30,000 LA

Smith Nut 60,000 LA

Jones Screw 20,000 SF

Jones Nut 40,000 SF

. . .

Productname Turnover

Screw 100.000

Bolt 200.000

Nut 300,000

Roll up to the product level.

Top level Turnover

600.000 Roll up to the top level.

Page 15: Contents of this slideshow :

The aggregation level is the argument to the GROUP BY statement.

x1 x2 … xn Aggregated data Non-aggregated data

Salesman# Productname Turnover Branch-office#

Smith Screw 10,000 LA

Smith Bolt 30,000 LA

Smith Nut 60,000 LA

Jones Screw 20,000 SF

Jones Nut 40,000 SF

. . .

- Product# - Order# - Qty - Date# - Salesman#

Fact table

Orders

Orderdetails

Time

Products Salesmen

Dimension Dimension

Dimension

Dimension

- Product# - Product-name - Price

- Order# - Ordertype

- Salesman# - Salesman-name - Branch-Office#

- Date# - Date-Name

Page 16: Contents of this slideshow :

Dimension hierarchies:A dimension hierarchi is a set of tables connected by one-to-many relationships towards the fact table:

Orders CustomersOrderdetails

- Product# - Order# - Qty- Price

- Order# - Customer#- Date

- Customer# - Customer-name

Fact table Dimension hierarchy

A Snowflake schema may in contrast to star schemas have dimension hierarchies.

Describe the advantage/disadvantage of using dimension hierarchies or Snowflake schema?

Page 17: Contents of this slideshow :

Exercise:The figure illustrates an ER-diagram of a car rental company like Hertz or Avis.

Customers

Car types

Reservations

Orders

Branch offices

Cars

GaragesGarage services

Pick up

Contracts

Car return

Question 1.Design a star schema or Galaxy for the car rental company.

Question 2.Is there advantages by storing suppliers as customers in e.g. an e-commerce data warehouse?

Page 18: Contents of this slideshow :

Contents of this slideshow:

• What is a datawarehouse?

• Multi-dimensional data modelling

• Data warehouse architecture

Page 19: Contents of this slideshow :

Data Models

– Relational models/ER-diagram used for OLTP databases– Stars, snowflakes and galaxies used for OLAP databases– Cubes used for OLAP databases

Page 20: Contents of this slideshow :

Et star schema DW can be illustrated as a multidimensinal cube:

Page 21: Contents of this slideshow :

Describe advantages/disadvantages of storing data in a cube in memory?

Page 22: Contents of this slideshow :

OLAP Cube operations:

OLAP operations:Roll Up = Aggregatin to a higer level. For example from month to year)Drill Down = recalculation with more detailsSlice = Selecting a subset by using a fixed dimension value.Drill Across = Join of fact data across conformed dimensionsDrill Through = Accessing related data from a OLTP system.Aggregating Pivoting = See next slide!

Page 23: Contents of this slideshow :

Pivoting =

Transforming SQL query output to user friendly two dimensional screen layout

sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4

day 2c1 c2 c3

p1 44 4p2 c1 c2 c3

p1 12 50p2 11 8

day 1

Multi-dimensional cube:Fact table view:

c1 c2 c3p1 56 4 50p2 11 8

Page 24: Contents of this slideshow :

OLAP Server Architectures

• Relational OLAP (ROLAP) – Use relational or extended-relational DBMS to store and manage

warehouse data – Include optimization of DBMS backend, implementation of aggregation

navigation logic, and additional tools and services– greater scalability

• Multidimensional OLAP (MOLAP) – Array-based multidimensional storage engine (sparse matrix techniques)– fast indexing to pre-computed summarized data

• Hybrid OLAP (HOLAP)– Storage flexibility with mix of ROLAP and MOLAP

• POLAP personel HOLAP

Page 25: Contents of this slideshow :

Contents of this slideshow:

• What is a datawarehouse?

• Multi-dimensional data modeling

• Data warehouse design/implementation architectures1. Kimball has a bottom-up architecture

2. Inmon has a top-down architecture3. Data Vault architecture is normalized tables extended with historic data tables. That is, the Data Vault can be used to generate any data mart when needed.

Page 26: Contents of this slideshow :

Kimball’s Bottom-Up DW architecture:

Kimball’s architecture uses conformed dimensions and conformed facts. Conformed dimensions makes it possible to drill across from one data mart to another to present data from different marts in the same view.

Only the conformed data have top-down design.

Page 27: Contents of this slideshow :

Kimball’s Data Warehousing Architecture

Data StagingArea

MetadataETL side Query side

QueryServices

- Extract- Transform- Load Data mining

Datasources

Presentation servers Desktop DataAccess Tools

Reporting Tools

Data marts with aggregate-only data

DataWarehouse

Bus

Conformeddimensions

and facts

Data marts with atomic data

-Warehouse Browsing-Access and Security-Query Management- Standard Reporting-Activity Monitor

Surrogate key (Surrogatnøgle) = A sequense number used as primary key.

Page 28: Contents of this slideshow :

William Inmon’s DATA WAREHOUSE architecture from 1990 has top-down design without conformed data. and:

Department dataware-houses

EDS = Enterprise Data Warehouse.

The DSA (Data Staging Area) where transformation takes place is not illustrated.

Page 29: Contents of this slideshow :

The DATA VAULT architecture from 2002-2005 has full top-down design and buttom up implementation:

Normalized Data Vault with historic data

In the Data Vault database with historic information only the Extract activity has taken place. Therefore, the Data Vault architecture is not drowned in the design phase.

Page 30: Contents of this slideshow :

Classical Data warehousing

Extraction

Error handling

Aggregate

Business Rules

Trans- formation

DeltaDetection

DSADSA DMDMEDEDWW

11 22 33SourcSourc

ee

Cleansing

Filter

DSA = Data Staging AreaEDS = Enterprise Data Warehouse

OLTP

Page 31: Contents of this slideshow :

Error handling

Aggregate

Business Rules

Trans- formation

DeltaDetection

DSADSA

11SourcSourc

ee

Cleansing Filte

rExtraction

Extraction

Error handling

Aggregate

Business Rules

Trans- formation

DeltaDetection

DSADSA DMDMEDEDWW

11 22 33SourcSourc

ee

Cleansing

Filter

Classical Data warehousing

HANA from SAP is an In memory Data Warehouse product

OLTP

OLTP

Page 32: Contents of this slideshow :

Error handling

Aggregate

Business Rules

Trans- formation

DeltaDetection

DSADSA

11SourcSourc

ee

Cleansing Filte

rExtraction

Extraction

Error handling

Aggregate

Business Rules

Trans- formation

DeltaDetection

DSADSA DMDMEDEDWW

11 22 33SourcSourc

ee

Cleansing

Filter

Classical Data warehousing

In memory Data warehousing

How can OLTP and OLAP be integrated in a common In Memory database?

OLTP

Page 33: Contents of this slideshow :

ER-diagram for a hospital.

Health records

TreatmentsDiagnoses/diseases

Patient admitsSympthoms

and test results

Employees

Prescriptions

Prescription lines

Patient discharges

...

Conseptual hospital entites in general are below the dottet line

Basic Health records are above the dottet line

...Patient admit type

Health record subtypes

Figure 2. Generalized ER diagram of a local hospital database

Medical tests subtypes

Sympthom types

Disease types

Treatment types

Patient discharges type

Medicin types

Medicin productsMedicin

companies

Patients_____ Patient IDNameAddress

Exercise: Transform the OLTP database to a Star schema DW for a Hospital.

Page 34: Contents of this slideshow :

Exercise: Design an Airline DW.

Flight routes

Subroutes

Departures

Airports

Tickets

Travelarrangement

Customers

Airlinecompanies

Page 35: Contents of this slideshow :

Exercise: Design a Hotel DW.

Hotels

Rooms

Room reservations

Services/ tours/ car rentals Check-in

periods

Customers Customer groups

Hotel chains

Page 36: Contents of this slideshow :

Exercise.Design a datawarehouse for a travel agency.

Customers

Reservations

Orders

Departures/Hotel rooms/Car rentals/

etc.

Flight routes/Room types/Car types/

service types

Buyer

Bookings

Traveler

Product owners

Page 37: Contents of this slideshow :

End of session

Thank you !!!Thank you !!!

Page 38: Contents of this slideshow :

Inmon versus Kimball’s DW definitions:

Why do you think Kimball’s DW architecture is used most in practice?

Kimball and Inmon agree in that OLAP datawarehouses do not use the OLTP databases. However, what is the difference in the architectures?

Page 39: Contents of this slideshow :

Dates may be stored in different formats.As an example the First purchase date may be stored as a FK to a hierachical time dimension and Birth date as a SQL time stamp.Why is different Date formats used in the Customer table?

Page 40: Contents of this slideshow :

OLAP

• OLAP = On-Line Analytical Processing– Interaktiv analyse– Eksplorativ opdagelse – Kræver hurtige svartider

• Data kan vises som multidimensionelle terninger– Terninger/kuber kan have et vilkårligt antal dimensioner– Dimensioner har hierarkier, f.eks. dag-måned-år

• OLAP operationer– Aggregering = Sammentælling af data, f.eks. med SUM, AVG, COUNT…– Startniveau, (Kvartal, Produkt)– Roll Up: mindre detalje, Kvartal->År– Drill Down : mere detalje, Kvartal->Måned– Slice: Projektering/selektering, År=1999– Drill Across: “join” på fælles dimensioner– Drill Through: Opsøgning af kildedataene i de operative systemer– Pivoting

Page 41: Contents of this slideshow :

Design afteknisk

arkitektur

Design afteknisk

arkitektur

Valg af produkt oginstallation

Valg af produkt oginstallation

Specifikationaf

applikationer

Specifikationaf

applikationer

Udviklingaf

applikationer

Udviklingaf

applikationer

Specifikationaf

krav

Specifikationaf

kravIbrugtagningIbrugtagning Vedligehold

og vækstVedligehold

og vækst

ProjektledelseProjektledelse

DimensionelmodelleringDimensionelmodellering

FysiskdesignFysiskdesign

ETL:design ogudvikling

ETL:design ogudvikling

Projektplanlægning

Projektplanlægning

Design afteknisk

arkitektur

Design afteknisk

arkitektur

Valg af produkt oginstallation

Valg af produkt oginstallation

Specifikationaf

applikationer

Specifikationaf

applikationer

Udviklingaf

applikationer

Udviklingaf

applikationer

Specifikationaf

krav

Specifikationaf

kravIbrugtagningIbrugtagning Vedligehold

og vækstVedligehold

og vækst

ProjektledelseProjektledelse

DimensionelmodelleringDimensionelmodellering

FysiskdesignFysiskdesign

ETL:design ogudvikling

ETL:design ogudvikling

Projektplanlægning

Projektplanlægning

The Business Dimensional Lifecycle = Kimball’s activity model for DATAWAREHOUSE devellopment has three parallel tracks:

Page 42: Contents of this slideshow :

The Data Warehouse Bus Architecture =

Arkitektur for design af en række data marts som tilsammen udgør virksomhedens data warehouse med fælles conformed dimensions og conformed facts.

Data marts = afdelings data warehouse. Kimball bruger ordet mere generelt om en enkelt multidimensional database.

Conformed dimensions = Fælles dimensioner, som er tilpasset kravere fra flere data marts.

Stovepipe (kakkelovnsrør) = Skældsord for et data warehouse uden conformed dimensions.

Page 43: Contents of this slideshow :

Kimball’s datawarehouse concepts:

Data StagingArea

MetadataETL side Query side

QueryServices

- Extract- Transform- Load Data mining

Data ServiceElement

Datasources

Presentation servers

Operationel systems

Desktop DataAccess Tools

Reporting Tools

Data marts with aggregate-only data

DataWarehouse

Bus

Conformeddimensions

and facts

Data marts with atomic data

-Warehouse Browsing-Access and Security-Query Management- Standard Reporting-Activity Monitor

Inmonn does not use the conformed facts and dimension table concepts!

Page 44: Contents of this slideshow :

DB

DB

DB

DB

DB

Appl.

Appl.

Appl.

ETL Data Vault

DM

DM

DM

OLAP

Visua-lization

Appl.

Appl.

Data mining

Existing databasesand systems (OLTP)

New databasesand systems (OLAP)

In the DATA VAULT Architecture the data marts are loaded from a normalized database with historic information.

Page 45: Contents of this slideshow :

DB

DB

DB

DB

DB

Appl.

Appl.

Appl.

ETL Data Vault

DM

DM

DM

OLAP

Visua-lization

Appl.

Appl.

Data mining

In the future the DATA VAULT may be the only database and stored In-Memory.

SAP has already developed an In-Memory OLAP database called HANA