Upload
vern
View
34
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Contents of this slideshow :. What is a datawarehouse? Multi-dimensional data modelling Data warehouse architecture. The hidden slides of this slideshow may be important. However, I will focus on leaning by exercises and therefore, rattling off new concepts are often done in hidden slides. - PowerPoint PPT Presentation
Citation preview
Contents of this slideshow:
• What is a datawarehouse?
• Multi-dimensional data modelling
• Data warehouse architecture
The hidden slides of this slideshow may be important. However, I will focus on leaning by exercises and therefore, rattling off new concepts are often done in hidden slides.
OLTP versus OLAPOLTP = On Line Transaction ProcessingOLAP = On Line Analytical Processing
OLTP OLAP
users clerk, IT professional knowledge worker/decision makers
function day to day operations decision support
DB design application-oriented subject-oriented (Business functions)
data current, up-to-date detailed, flat relational isolated data.
historical, summarized, multidimensional integrated, consolidated data.
usage repetitive ad-hoc
access read/write index/hash on prim. key
lots of scans
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
An example of a Datawarehouse:
- Product#- Order#- Qty- Date#- Salesman#
Fact table
Orders
Orderdetails
Time
Products Salesmen
Dimension Dimension
Dimension
Dimension
- Product#- Product-name- Price
- Order#- Ordertype
- Salesman#- Salesman-name
- Date#- Date-Name
A star shema datawarehouse has a central table (the Fact table) surrouded by dimension tables with on-to-many relationships towards the fact table.
The fixed data base structure implies that application programs (drilling functions/aggregates) can be generated automatically!
Dimension hierarchies:
A dimension hierarchy is a set of tables connected by one-to-many relationships towards the fact table:
Orders CustomersOrderdetails
- Product# - Order# - Qty- Price
- Order# - Customer#- Date
- Customer# - Customer-name
Fact table Dimension hierarchy
In a dimension hierarchiy it is possible to aggregate data from the fact table to the different levels of the hierachy.
Roll-up = aggregate along one or more dimensions.
Drill-down = “de-aggregate” = break an aggregate into its constituents.
Two different types of drilling:• -Drilling in dimension hierarchies
• -Drilling between dimensions.
- Product#- Order#- Qty- Date#- Salesman#
Fact table
Orders
Orderdetails
Time
Products Salesmen
Dimension Dimension
Dimension
Dimension
- Product#- Product-name- Price
- Order#- Ordertype
- Salesman#- Salesman-name
- Date#- Date-Name
Which star schemas or data marts can be build by using the illustrated integrated E-commerce/ERP data model?
Which star schema would you recommend to be implemented first?
LocationLocation#Address
UserSessionSession#IPaddress#ClickTimestamp
ProductProduct#ProductNamePrice
OrderOrder#OrderDateBalanceState
Order-DetailHistoryInv-Item#Order#Seq#StateTimestamp
UserAccountSalesman#PassWordTimestamp#visits#transTtl-tr-amount
Order-DetailProduct#Order#QtyPriceTimestamp
ShippingShipping#ShipMethodShipChargeStateShipDate
CreditCardCard#HolderNameExpireDate
PaymentPayment#AmmountStateTimestamp
InvoyceHistoryInvoice#TimestampStateNotes
AddressAddress#NameAdd1Add2CityStateZip
InvoiceInvoice#CreationDate
Billing
Shipping
Product-StockProduct#Location#Qty
CustomerCustomer#Kredit-LimitBalance
A galaxy is a set of star fact tableswith conformed (fælles tilpassede) dimensions:
Sale-Orderdetails
Storage-per-product
Purchase-orderdetails
- Product# - Sale-order#
- Qty- Discount
- Sale-price - Date#
- Product# - Date# - End-of-day-
storage-qty
- Product# - Purchase-order#
- Purchase-price - Qty - Date#
Fact table Fact table- Date# - Qty
Day
Month
Year
Fact table
Products
Productgroups
Time dimension hierarchy
- yy
- yy- mm
- yy- mm- dd
- Product#- Product-name
- Product-group#- Product-group-name
Product dimension hierarchy
The value chain
Conceptual Modeling of Data Warehouses
– Star schema: A fact table in the middle connected to a set of
dimension tables
– Snowflake schema: A refinement of star schema where
some dimensional hierarchy is normalized into a set of
smaller dimension tables, forming a shape similar to
snowflake
– Galaxy schema: Multiple fact tables share dimension tables
(Conformed dimensions), viewed as a collection of stars,
therefore called galaxy schema or fact constellation
The aggregating level is the argument to the GROUP BY statement:
- Product#- Order#- Qty- Date#- Salesman#
Fact table
Orders
Orderdetails
Time
Products Salesmen
Dimension Dimension
Dimension
Dimension
- Product#- Product-name- Price
- Order#- Ordertype
- Salesman#- Salesman-name
- Date#- Date-Name
SELECT Product#, SUM(Qty*Price) AS TurnoverFROM Orderdetails JOIN ProductsGROUP BY Product#
Drill down to the Product per Salesman level:
- Product#- Order#- Qty- Date#- Salesman#
Fact table
Orders
Orderdetails
Time
Products Salesmen
Dimension Dimension
Dimension
Dimension
- Product#- Product-name- Price
- Order#- Ordertype
- Salesman#- Salesman-name
- Date#- Date-Name
SELECT Product#, Salesman#, SUM(Qty*Price) AS TurnoverFROM Orderdetails JOIN Products JOIN Salesmen GROUP BY Product#, Salesman#;
Where should the Price be stored?
Snowflake schema with branches:
A Snowflake schema may have branches in the dimension hierarchies:
Orders CustomersOrderdetails
- Product# - Order# - Qty
- Order# - Customer#- Date
- Customer# - Customer-name
Fact table Dimension hierarchy
Salesmen Branchoffices
Regions
Products
- Product# - Product-name- Price- Group#
Productgroups
- Group# - Group-name- Department#
Departments- Department# - Department-name
- Salesman# - Salesman-name– Branch-office#
- Branch-office# - Branch-office#- Region#
- Region# - Region-name
Snowflake hierarchy
Dimension hierarchyAre Customers related to the Regions?
Drilling in dimension hierarchies:
Orders Customers Orderdetails
- Product# - Order# - Qty
- Order# - Customer# - Date
- Customer# - Customer-name
Fact table Dimension hierarchy
Salesmen Branch offices
Products
- Product# - Product-name - Price - Group#
Product groups
- Group# - Group-name - Department#
Departments
- Salesman# - Salesman-name – Branch-office#
- Branch-office# - Branch-office# - Region#
Snowflake hierarchy
Dimension hierarchy
Branch-office# Turnover
LA 400,000
SF 200,000
Salesman# Turnover Branch-office#
Smith 100,000 LA
Jones 300,000 LA
Adams 200,000 SF
Drilling between dimension hierarchies:
Orders Customers Orderdetails
- Product# - Order# - Qty
- Order# - Customer# - Date
- Customer# - Customer-name
Fact table Dimension hierarchy
Salesmen Branch offices
Products
- Salesman# - Salesman-name – Branch-office#
- Branch-office# - Branch-office# - Region#
Snowflake hierarchy
Salesman# Turn-over
Branch-office#
Smith 100,000 LA
Jones 300,000 LA
Adams 200,000 SF
Salesman#
Product-name
Turn-over
Branch-office#
Smith Screw 10,000 LA
Smith Bolt 30,000 LA
Smith Nut 60,000 LA
Jones Screw 20,000 SF
Jones Nut 40,000 SF
. . .
Roll up to the top level:
Roll up can be executed by removing one or more argument to the GROUP BY statement.
Salesman#
Product-name
Turn-over
Branch-office#
Smith Screw 10,000 LA
Smith Bolt 30,000 LA
Smith Nut 60,000 LA
Jones Screw 20,000 SF
Jones Nut 40,000 SF
. . .
Productname Turnover
Screw 100.000
Bolt 200.000
Nut 300,000
Roll up to the product level.
Top level Turnover
600.000 Roll up to the top level.
The aggregation level is the argument to the GROUP BY statement.
x1 x2 … xn Aggregated data Non-aggregated data
Salesman# Productname Turnover Branch-office#
Smith Screw 10,000 LA
Smith Bolt 30,000 LA
Smith Nut 60,000 LA
Jones Screw 20,000 SF
Jones Nut 40,000 SF
. . .
- Product# - Order# - Qty - Date# - Salesman#
Fact table
Orders
Orderdetails
Time
Products Salesmen
Dimension Dimension
Dimension
Dimension
- Product# - Product-name - Price
- Order# - Ordertype
- Salesman# - Salesman-name - Branch-Office#
- Date# - Date-Name
Dimension hierarchies:A dimension hierarchi is a set of tables connected by one-to-many relationships towards the fact table:
Orders CustomersOrderdetails
- Product# - Order# - Qty- Price
- Order# - Customer#- Date
- Customer# - Customer-name
Fact table Dimension hierarchy
A Snowflake schema may in contrast to star schemas have dimension hierarchies.
Describe the advantage/disadvantage of using dimension hierarchies or Snowflake schema?
Exercise:The figure illustrates an ER-diagram of a car rental company like Hertz or Avis.
Customers
Car types
Reservations
Orders
Branch offices
Cars
GaragesGarage services
Pick up
Contracts
Car return
Question 1.Design a star schema or Galaxy for the car rental company.
Question 2.Is there advantages by storing suppliers as customers in e.g. an e-commerce data warehouse?
Contents of this slideshow:
• What is a datawarehouse?
• Multi-dimensional data modelling
• Data warehouse architecture
Data Models
– Relational models/ER-diagram used for OLTP databases– Stars, snowflakes and galaxies used for OLAP databases– Cubes used for OLAP databases
Et star schema DW can be illustrated as a multidimensinal cube:
Describe advantages/disadvantages of storing data in a cube in memory?
OLAP Cube operations:
OLAP operations:Roll Up = Aggregatin to a higer level. For example from month to year)Drill Down = recalculation with more detailsSlice = Selecting a subset by using a fixed dimension value.Drill Across = Join of fact data across conformed dimensionsDrill Through = Accessing related data from a OLTP system.Aggregating Pivoting = See next slide!
Pivoting =
Transforming SQL query output to user friendly two dimensional screen layout
sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4
day 2c1 c2 c3
p1 44 4p2 c1 c2 c3
p1 12 50p2 11 8
day 1
Multi-dimensional cube:Fact table view:
c1 c2 c3p1 56 4 50p2 11 8
OLAP Server Architectures
• Relational OLAP (ROLAP) – Use relational or extended-relational DBMS to store and manage
warehouse data – Include optimization of DBMS backend, implementation of aggregation
navigation logic, and additional tools and services– greater scalability
• Multidimensional OLAP (MOLAP) – Array-based multidimensional storage engine (sparse matrix techniques)– fast indexing to pre-computed summarized data
• Hybrid OLAP (HOLAP)– Storage flexibility with mix of ROLAP and MOLAP
• POLAP personel HOLAP
Contents of this slideshow:
• What is a datawarehouse?
• Multi-dimensional data modeling
• Data warehouse design/implementation architectures1. Kimball has a bottom-up architecture
2. Inmon has a top-down architecture3. Data Vault architecture is normalized tables extended with historic data tables. That is, the Data Vault can be used to generate any data mart when needed.
Kimball’s Bottom-Up DW architecture:
Kimball’s architecture uses conformed dimensions and conformed facts. Conformed dimensions makes it possible to drill across from one data mart to another to present data from different marts in the same view.
Only the conformed data have top-down design.
Kimball’s Data Warehousing Architecture
Data StagingArea
MetadataETL side Query side
QueryServices
- Extract- Transform- Load Data mining
Datasources
Presentation servers Desktop DataAccess Tools
Reporting Tools
Data marts with aggregate-only data
DataWarehouse
Bus
Conformeddimensions
and facts
Data marts with atomic data
-Warehouse Browsing-Access and Security-Query Management- Standard Reporting-Activity Monitor
Surrogate key (Surrogatnøgle) = A sequense number used as primary key.
William Inmon’s DATA WAREHOUSE architecture from 1990 has top-down design without conformed data. and:
Department dataware-houses
EDS = Enterprise Data Warehouse.
The DSA (Data Staging Area) where transformation takes place is not illustrated.
The DATA VAULT architecture from 2002-2005 has full top-down design and buttom up implementation:
Normalized Data Vault with historic data
In the Data Vault database with historic information only the Extract activity has taken place. Therefore, the Data Vault architecture is not drowned in the design phase.
Classical Data warehousing
Extraction
Error handling
Aggregate
Business Rules
Trans- formation
DeltaDetection
DSADSA DMDMEDEDWW
11 22 33SourcSourc
ee
Cleansing
Filter
DSA = Data Staging AreaEDS = Enterprise Data Warehouse
OLTP
Error handling
Aggregate
Business Rules
Trans- formation
DeltaDetection
DSADSA
11SourcSourc
ee
Cleansing Filte
rExtraction
Extraction
Error handling
Aggregate
Business Rules
Trans- formation
DeltaDetection
DSADSA DMDMEDEDWW
11 22 33SourcSourc
ee
Cleansing
Filter
Classical Data warehousing
HANA from SAP is an In memory Data Warehouse product
OLTP
OLTP
Error handling
Aggregate
Business Rules
Trans- formation
DeltaDetection
DSADSA
11SourcSourc
ee
Cleansing Filte
rExtraction
Extraction
Error handling
Aggregate
Business Rules
Trans- formation
DeltaDetection
DSADSA DMDMEDEDWW
11 22 33SourcSourc
ee
Cleansing
Filter
Classical Data warehousing
In memory Data warehousing
How can OLTP and OLAP be integrated in a common In Memory database?
OLTP
ER-diagram for a hospital.
Health records
TreatmentsDiagnoses/diseases
Patient admitsSympthoms
and test results
Employees
Prescriptions
Prescription lines
Patient discharges
...
Conseptual hospital entites in general are below the dottet line
Basic Health records are above the dottet line
...Patient admit type
Health record subtypes
Figure 2. Generalized ER diagram of a local hospital database
Medical tests subtypes
Sympthom types
Disease types
Treatment types
Patient discharges type
Medicin types
Medicin productsMedicin
companies
Patients_____ Patient IDNameAddress
Exercise: Transform the OLTP database to a Star schema DW for a Hospital.
Exercise: Design an Airline DW.
Flight routes
Subroutes
Departures
Airports
Tickets
Travelarrangement
Customers
Airlinecompanies
Exercise: Design a Hotel DW.
Hotels
Rooms
Room reservations
Services/ tours/ car rentals Check-in
periods
Customers Customer groups
Hotel chains
Exercise.Design a datawarehouse for a travel agency.
Customers
Reservations
Orders
Departures/Hotel rooms/Car rentals/
etc.
Flight routes/Room types/Car types/
service types
Buyer
Bookings
Traveler
Product owners
End of session
Thank you !!!Thank you !!!
Inmon versus Kimball’s DW definitions:
Why do you think Kimball’s DW architecture is used most in practice?
Kimball and Inmon agree in that OLAP datawarehouses do not use the OLTP databases. However, what is the difference in the architectures?
Dates may be stored in different formats.As an example the First purchase date may be stored as a FK to a hierachical time dimension and Birth date as a SQL time stamp.Why is different Date formats used in the Customer table?
OLAP
• OLAP = On-Line Analytical Processing– Interaktiv analyse– Eksplorativ opdagelse – Kræver hurtige svartider
• Data kan vises som multidimensionelle terninger– Terninger/kuber kan have et vilkårligt antal dimensioner– Dimensioner har hierarkier, f.eks. dag-måned-år
• OLAP operationer– Aggregering = Sammentælling af data, f.eks. med SUM, AVG, COUNT…– Startniveau, (Kvartal, Produkt)– Roll Up: mindre detalje, Kvartal->År– Drill Down : mere detalje, Kvartal->Måned– Slice: Projektering/selektering, År=1999– Drill Across: “join” på fælles dimensioner– Drill Through: Opsøgning af kildedataene i de operative systemer– Pivoting
Design afteknisk
arkitektur
Design afteknisk
arkitektur
Valg af produkt oginstallation
Valg af produkt oginstallation
Specifikationaf
applikationer
Specifikationaf
applikationer
Udviklingaf
applikationer
Udviklingaf
applikationer
Specifikationaf
krav
Specifikationaf
kravIbrugtagningIbrugtagning Vedligehold
og vækstVedligehold
og vækst
ProjektledelseProjektledelse
DimensionelmodelleringDimensionelmodellering
FysiskdesignFysiskdesign
ETL:design ogudvikling
ETL:design ogudvikling
Projektplanlægning
Projektplanlægning
Design afteknisk
arkitektur
Design afteknisk
arkitektur
Valg af produkt oginstallation
Valg af produkt oginstallation
Specifikationaf
applikationer
Specifikationaf
applikationer
Udviklingaf
applikationer
Udviklingaf
applikationer
Specifikationaf
krav
Specifikationaf
kravIbrugtagningIbrugtagning Vedligehold
og vækstVedligehold
og vækst
ProjektledelseProjektledelse
DimensionelmodelleringDimensionelmodellering
FysiskdesignFysiskdesign
ETL:design ogudvikling
ETL:design ogudvikling
Projektplanlægning
Projektplanlægning
The Business Dimensional Lifecycle = Kimball’s activity model for DATAWAREHOUSE devellopment has three parallel tracks:
The Data Warehouse Bus Architecture =
Arkitektur for design af en række data marts som tilsammen udgør virksomhedens data warehouse med fælles conformed dimensions og conformed facts.
Data marts = afdelings data warehouse. Kimball bruger ordet mere generelt om en enkelt multidimensional database.
Conformed dimensions = Fælles dimensioner, som er tilpasset kravere fra flere data marts.
Stovepipe (kakkelovnsrør) = Skældsord for et data warehouse uden conformed dimensions.
Kimball’s datawarehouse concepts:
Data StagingArea
MetadataETL side Query side
QueryServices
- Extract- Transform- Load Data mining
Data ServiceElement
Datasources
Presentation servers
Operationel systems
Desktop DataAccess Tools
Reporting Tools
Data marts with aggregate-only data
DataWarehouse
Bus
Conformeddimensions
and facts
Data marts with atomic data
-Warehouse Browsing-Access and Security-Query Management- Standard Reporting-Activity Monitor
Inmonn does not use the conformed facts and dimension table concepts!
DB
DB
DB
DB
DB
Appl.
Appl.
Appl.
ETL Data Vault
DM
DM
DM
OLAP
Visua-lization
Appl.
Appl.
Data mining
Existing databasesand systems (OLTP)
New databasesand systems (OLAP)
In the DATA VAULT Architecture the data marts are loaded from a normalized database with historic information.
…
DB
DB
DB
DB
DB
Appl.
Appl.
Appl.
ETL Data Vault
DM
DM
DM
OLAP
Visua-lization
Appl.
Appl.
Data mining
In the future the DATA VAULT may be the only database and stored In-Memory.
…
SAP has already developed an In-Memory OLAP database called HANA