Upload
erwin-modeling
View
437
Download
0
Embed Size (px)
DESCRIPTION
Citation preview
Optimizing the Design of your Data WarehouseMichael [email protected]
PAGE 2
Introduction
• Who am I?
– Michael Wacey
– Partner with CSC since 1986
– Architected many large scale data warehouses
• What are we going to discuss today?
– Motivation
– Tools
– Approach
PAGE 3
Motivation
• Data Here, Data There, Data Everywhere
• Solutions
– Architecture – the SAP approach – very hard to sustain and SAP can not solve all problems
– Data Integration – requires architecture on the boundaries and infrastructure, lots of infrastructure
– Data Warehouse – Periodically collect the data and bring it all together for one or more purposes – the best bet for the foreseeable future
• Solutions are always trying to answer - How do we get this data to fit together?
PAGE 4
Motivation
• Making data fit together is difficult
– Local countries report numbers in their local (possibly multiple) currencies and there is no agreed to set of conversion rates
– The Trust department would rather not share that data with finance
– The current policy administration system has serious data quality issues, but there is a new system being built and scheduled to go online in June 2011, but that date may be in jeopardy
• We need a way to collect and analyze all this knowledge about the data
PAGE 5
Motivation
• A high level view:
• May help with scoping
• Each line could represent many files or feeds
• Each box could represent many applications
Accounting
Sales
Marketing
Data Warehouse
Customer Profitability
Sales Forecasts
PAGE 6
Motivation
• A detailed view:
• Too much detail to plan and analyze and understand
• As usual, we have a forest and trees problem
BEGIN
SELECT ml.sequence, al.sequence, m.msgkey INTO mseq, aseq, mkey
FROM mqseries.levelcodes ml, mqseries.messages m, mqseries.appctl a, mqseries.levelcodes al
WHERE m.msglevel = ml.levelcodekey
AND m.msgcode = inmsgcode
AND a.msglevel = al.levelcodekey
AND a.appctlkey = 1;
IF sql%ROWCOUNT = 1 THEN
IF aseq <= mseq THEN
SELECT statuscodekey INTO sck FROM mqseries.statuscodes WHERE
statuscode = 'n';
insert into mqseries.msglog (msglogkey, msgkey, msgdata, msgstatus,
msgsqlcode, msgsqlerrm)
values(mqseries.msgseq.nextval, mkey, inmsgdata, sck, inmsgsqlcode,
SUBSTR(inmsgsqlerrm,1,4000));
IF incommit = true THEN
commit;
END IF;
END IF;
ELSE
PAGE 7
Motivation
• What to do?
– PowerPoint?
– Visio?
– ERwin?
• They all help, but none gives us that right picture
• We need a way to see the problem and the solution at the right level of detail
PAGE 8
Motivation
• What is a data warehouse?
• It includes:
– Sources of data
– Processing of data
– Storage of data – probably multiple times in different structures
– Analytics
• Except for Analytics, these are either static views of data or dynamic processing of data
• ERwin DM is great for the static views of data, we just need to capture the dynamic processing
PAGE 9
Motivation
• I have used many techniques to capture the dynamic processing
• Spreadsheets to capture data mapping (who hasn’t)
• Process flow diagrams in PowerPoint and Visio
• UML Diagrams in the IBM and Sparx tools
• They all worked to an extent but were hard to maintain and did not provide a leveling mechanism
PAGE 10
Motivation
• Many years ago, I had used Data Flow Diagrams to describe systems under development
• They provided insight into the flow of data and leveling of those processes
• So, I tried that – first in Visio and later in ERwin PM
• The rest of this talk is an approach to using ERwin DM and ERwin PM together to model a Data Warehouse
• I have used this approach for the past five years and find it is very successful
• It provides information to both the user community and developers
PAGE 11
The Tools
• ERwin Data Modeler
– Used to model databases
– Supports both Logical and Physical models
– If needed, I create conceptual models in PowerPoint or Visio
– Each model has to represent one type of database
– But, data warehouses use many – Flat Files, Oracle, SQL Server, Cubes, etc
– I use UDP to represent the actual type of an Entity/Table
– For example, a table that represents a flat file would have that setting in a UDP
PAGE 12
The Tools
• ERwin Process Modeler (ERwin PM)
– Previously called BPwin
– Supports several diagram types
– I have only found the Data Flow diagrams useful for the design of a data warehouse
– The other diagrams could be used in analysis to understand how the data warehouse will be used
PAGE 13
The Tools
• ERwin DM and ERwin PM
• There is a connection between the tools
• I have not used it extensively
PAGE 14
The Tools
• Other Tools
– These are minor but needed
– PDF Viewer
– Microsoft Excel
– Microsoft Word
PAGE 15
The Approach
• So, we have two tools to design a data warehouse
• ERwin DM will be used to design and document static data stores
• ERwin PM will be used to design the processing
• Lets take a look at an example and then discuss how it works
PAGE 16
The Approach
• Start in ERwin PM
• Create a new model that is a data flow model
• First we will create a context model
• This will provide a view of the sources and uses of data
• On the left side, the sources of data are listed – using the external entity symbol
– Sources can be Systems, Databases, People, etc.
• On the right hand side, the uses of data are listed – using the external entity symbol
– Uses can be reports, cubes, analytics, data feeds, etc.
PAGE 17
The Approach
NODE: TITLE: NUMBER:Customer ProfitabilityA-0
Allocation
Factors
Consum er Loan
Data
Exception
Report
Data
Comm ercial
Customer Data
Comm ercial Loan
Data
Retail Customer
Data
General Ledger
Data
Demand
Deposit Data
Mortgage Data
Balancing Report Data
Treasury Data
Trust Data
Organization Data
A0$0
Customer Profitabil ity
E1
Allocation
Factors
E11
Exception
Report
E3
Consum er Loans
E14
Retail
Customer
Analytics
E13
Comm ercial
Customer
Analytics
E9
General Ledger
E5
Comm erical Loans
E2
Demand Deposit
Accounts
E4
Mortgages
E12
Balancing
Report
E6
Treasury
E7
Trust Accounts
E8
Organization
PAGE 18
The Approach
• The Context Diagram is a good start
• It sets the scope
• But does not provide any details about what is going to be done
• This comes in the next diagram – The details of the central process
PAGE 19
The Approach
NODE: TITLE: NUMBER:Customer ProfitabilityA0
Treasury Data
Trust Data
Dim ens ion
Data for
Calculation
Exception Report
Data
Balancing
Report Data
Validated
Dim ens ion
Data
Mortgage Data
Demand
Deposit Data
Organization Data
Comm ercial
BI Data
Consum er
Loan Data
Retial BI Data
Input Balance
Values
Calculation Balance
Values
Validated
Fact Data
Retail Balancing Data
Fact Data
for
Calculation
Source Exceptions
Calculation
Exceptions
Comm ercial
Loan Data
Comm ercial
Balancing Data
General Ledger
Data
Allocation
Factors
Comm ercial
Customer Data
Retail Customer
Data
Customer
Profitabil ity
Data
A1$0
Sourcing
A2$0
Customer Profitabil ity Calculation
A5$0
Retail BI
A3$0
Exception Output
A6$0
Balance Input and Output
A4$0
Comm ercial BI
D4Balancing
Values
D3 Exceptions
D1
Customer
Profitabil ity
Staging
D2
Customer
Profitabil ity
Data
Warehouse
PAGE 20
The Approach
• This level one diagram shows all the key components of the solution.
• There is no magic formula of should be included here
• There needs to at least be some sort of sourcing, processing, and display/output activities
• In this case, there one source processing, one calculation, and four output activities
• Each can be broken down into more details
• Lets look at the Commercial BI Activity
PAGE 21
The Approach
NODE: TITLE: NUMBER:Commercial BIA4
Comm ercial
Balancing Data
Data for
Cube
Out
Data for
Reporting
Comm ercial
Customer Data
Comm ercial
BI Data Data for Cube
In
A4.1$0
Load Commercial Cube
A4.3$0
Cube Provider
A4.6$0
Comm erical Profitability Reporting
D16
Comm ercial
Profitabil ity
Cube
PAGE 22
The Approach
• This decomposition can continue until you are comfortable
• I try to get to the point where one developer can implement it in one module
• At this point, we will have a series of diagrams that show the flow of data through the system
• The diagrams contain:
– Activities
– Data Stores (note that a single data store can be used on multiple diagrams)
– Data Flows
– External Entities
PAGE 23
The Approach
• Each of the diagram elements, except for the Data Flows, can be further modeled in ERwin DM
• This gives the developer a further level of detail of what is intended
• It also provides the physical names that will be used
• To maintain the mapping between the models, I use a naming convention for ERwin DM Subject Areas
• The convention is:
– A01.01.01 – {Activity Name}
– D01 – {Data Store Name}
– E01 – {External Name}
PAGE 24
The Approach
• Some examples for External Entities and Data Stores from the model above:
– D01 – Customer Profitability Staging
– E05 – Commercial Loans
• Each of these subject areas should have the portion of the data model relevant to it
• Note that these are just typical ER models
• They can represent more than just table – for example, an external entity could be a flat file
• Below is an example – the E05 – Commercial Loans external entity
PAGE 25
The Approach
PAGE 26
The Approach
• Next we need to look at the activities
• Because activities have a hierarchical numbering system, we need one for the subject areas
• We simply start with A and separate each level with a period
• Combine Retail Loans from the model above is in Activity 7 inside of Activity 2. It is called A2.7 Combine Retail Loans in the model.
• The associated subject area will be:
– A02.07 – Combine Retail Loans
• The data model will show the input and out put entities and how they are processed
PAGE 27
The Approach
PAGE 28
The Approach
• With the Diagrams from ERwin DM, ERwin PM, and the narrative in ERwin PM, the developer has all the information they need to implement a portion of the solution
• The diagrams and narratives are also accessible to technical users
• Twice, I have had the user community write papers to explain the details of specific areas of the ERwin PM model
PAGE 29
The Approach
• Notes
– Using ERwin DM we can quickly build detailed reports with diagrams and descriptions
– The developers use these reports to track what they have to do
– The Project Managers use these reports as an inventory for project planning
– The ERwin PM reports are like a roadmap that ties everything together
– It takes some effort to keep everything synchronized but it is well worth it
PAGE 30
The Approach
• In Summary
– A data warehouse is very much a store of data and a flow of data
– ERwin DM and ERwin PM can model both of these areas
– Use ERwin PM to decompose the solution
• There is no right or best decomposition
• Try it until it works
– Use ERwin DM to model the internals of External Entities, Data Stores, and Activities
• Tie the two models together through an appropriate naming convention
• Do not worry if the entities model more than tables
– The goal is to communicate with users and developers
PAGE 31
Questions?