View
1.406
Download
3
Category
Tags:
Preview:
DESCRIPTION
This is the presentation I did on TDWI EU in Munich - date; june 22nd, 2012. It is about a robust, agile and reliable way of deploying data warehouse environments. The majority of data warehouses in the Netherlands is Data Vault based now which instigated a wave of innovation of engineers and software vendors that pursued model driven development based on pattern based ETL,standardized modeling and a certain architectural style.
Citation preview
R.D.Damhof
Data Vault, What is the buzz about
TDWI München June 18, 2012 Ronald Damhof
Agile Data Warehousing
R.D.Damhof
“Our highest priority is to satisfy the customer through early and continuous
delivery of valuable software” Agile Manifesto, 2001
Kent Beck, Mike Beedle, Arie van Bennekum, Alistair Cockburn, Ward Cunningham, Martin Fowler, James Grenning, Jim Highsmith, Andrew Hunt, Ron Jeffries, Jon Kern,
Brian Marick, Robert C. Martin, Steve Mellor, Ken Schwaber, Jeff Sutherland, Dave Thomas
R.D.Damhof
Source
Source
‘Semantic gap’
‘Calculating risk’
‘Yield modules’
‘Customer ���segmentation’
R.D.Damhof
Everybody mines their own data Everybody enriches their own data
Everybody uses their own data User = Developer
With his selfmade tools Data quality determined by the individual It’s a grind – limited reusability Leadtimes unpredictable No management
R.D.Damhof
Lets ‘order’ an information product And hire a master/expert Separation between user/developer Developer/expert mines the data The information product = custom made
Data quality is mostly dependable on the developer/expert
Leadtimes unpredictable
Still not much reusability
R.D.Damhof
A central department who knows what information you need
That assembles information products, ready to be used for you
‘I now what you want’ – black
Efficiency is the name of the game
At least I got something, but it does not comply - even remotely - to my needs
Even worse; the guild-days are still there – the expert is now submerged, but needed to get the data you actually need. Introduction of management – you want something? Please apply in 3-fold…
R.D.Damhof Stephen Denning (2011) – Radical Management
Creating information products, the moment they are asked for Against quality criteria which are in line with the expectation of the customer Empower the customer with skills and facilities to be more self sufficient Minimize ‘data’-stock as much as possible Embrace new wishes and changes required by the customer The customer is the most important part of the production process
R.D.Damhof
A modern data management environment: The ‘Supermarket’
The ‘Restaurant’
The ‘Do it yourself buffet’
R.D.Damhof
R.D.Damhof
Push characteristics § Mass production § Known specifications, operational definitions, standards § Repeatable, predictable, & even better; uniform process § Part of the system that needs statistical control § Inventory allowed/necessary § Supply driven § Reliability over flexibility
Pull characteristics § Just in time § Demand driven § Build to order § Preferably no inventory § Flexibility over Reliability
R.D.Damhof
Back to the issue at hand……
§ What: the ‘production process of data’ § Where: Coordination - Local versus central § How: System Engineering - Systematic vs. Opportunistic § What principles guide us - leading principles
R.D.Damhof
1
2
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
Data & fu
nc.o
n service
3
4
End-‐user (Local)
Recipient
Inform
a.on
Delivery process
Generic proces (Central)
Data sources (internal & external)
4. Generate Informa.on products
3. Enrich and cleanse data
2. Register & Standardize
1. Get the raw uncut data
Informa.on Delivery Proces
Local vs Central deployment
R.D.Damhof
IT Development
Delegated Development
Selfservice Development
Development line discipline (OTAP) Developers at a distance from users
Mutually dependent/ within frameworks Heavy separation of function
Lightweight development process Minimum of specialisation/ distinction of roles
Self-sufficient/ limited freedom
Manoeuvrability (opportunistic approach)
Sustainability (Systematic approach)
Ad-hoc development proces Developer=user
Self-sufficient/ great degree of freedom Very broad tasks
System Engineering - Systematic vs. Opportunistic
R.D.Damhof
Adaptible
Sustainable
Decoupled
Centralized
Compliant
Standardized &
Industrialized Effective
Leading principles
R.D.Damhof 15
Enterprise Data Warehouse
BI Apps Analysis
BI apps Reports
BI Apps Ad-‐hoc
Company xxx data management Domain
Data, ‘What’ Func.on, ‘How’ ‘Where’, ‘Whom’
Busin
ess V
iew
Sources Source store
1 2 3 4
R.D.Damhof 16
Sustainable
Compliant
Decoupled
Standardized
Centralized
Adaptable
Effec.ve
Source to product
Sourcestore to product
Sourcestore to BV EDW (DV)
R.D.Damhof 17
Enterprise Data Warehouse
BI Apps Analysis
BI apps Reports
BI Apps Ad-‐hoc
Company xxx data warehouse & Business Intelligence Domain
Data, ‘What’ Func.on, ‘How’ ‘Where’, ‘Whom’
Busin
ess V
iew,
Data fe
eds Sources Source store
1 2 3 4
R.D.Damhof 18
Administra.ve process
Data & Informa.on recipients
Informa.on Delivery Process
AXain
Register & Standardize
Enrich
Generate& Distribute
Proces
Decision-‐ & control
Systems (internal & external)
DV based Data
Warehouse
Informa.on products
Data products
Compliance repor.ng
Supply chain op.miza.on
Staging
Risk Management
Performance Management
Fraud detec.on
Market basket analysis
Business rules
PDCA
Control / Metadata
Pull
Push
Push
Why DV?
R.D.Damhof 19
Metamodel driven automa.on -‐ Models (process, rules and data) determine the metadata, the metadata determines the automa.on ar.facts -‐ Aim is to be 100% declara.ve -‐ It can not be generated all, specific tailored metadata will remain necessary
Metadata driven automa.on -‐ Inputs: Source model(s), target model, Template Design, Naming conven.ons -‐ Advanced inputs: Normaliza.on preferences, Ontologies Taken from Dan Linstedt’s blog post: hXp://danlinstedt.com/datavaultcat/code-‐genera.on-‐for-‐data-‐vault-‐not-‐as-‐easy-‐as-‐you-‐think/
Template driven automa.on -‐ In the most basic forms; documenta.on -‐ describing a paXern -‐ More advanced; genera.ng XML code for 2nd gen. ETL tooling -‐ Vb -‐ hXp://www.grundsatzlich-‐it.nl/bi-‐tools-‐templator.html
Data Vault implementa.ons
R.D.Damhof 20
My PoV about (Data Vault) automation Tooling
§ Generation is an aid, not a goal in itself Do not accommodate the principles to fit the tool.... Look for decoupling
§ Truly understand the mechanics - handcraft it first! Invest in proper education and learning Invest in ‘getting ready’ time Involve your ‘customers’ from the start
§ PoC, PoC, PoC
§ Deliver, Deliver, Deliver
R.D.Damhof
Agility & Data Vault (1)
Why is it that you can build and deploy extremely small particles in Data Vault and not in other approaches, without having an increase in the overhead and coordination of these particles? In other words; 'Divide and Conquer to beat the Size / Complexity Dynamic’
R.D.Damhof
Why is it that you can re-engineer your existing model and guarantee that the changes remain local? Something that is hugely beneficial in data warehouses that - by definition - grow over time.
Agility & Data Vault (2)
R.D.Damhof
Why is it that - as your (Data Vault based) data warehouse grows - your costs grow ‘merely’ in linear fashion initially, and as you approach the end state marginal growth in cost decreases exponentially.
Agility & Data Vault (3)
R.D.Damhof
Data Vault as-such is not Agile, it is the development process that needs to be agile, DV merely supports
the agile development process.
24
“Our highest priority is to satisfy the customer through early and continuous
delivery of valuable software”
Agile Manifesto, 2001 Kent Beck, Mike Beedle, Arie van Bennekum, Alistair Cockburn, Ward Cunningham,
Martin Fowler, James Grenning, Jim Highsmith, Andrew Hunt, Ron Jeffries, Jon Kern, Brian Marick, Robert C. Martin, Steve Mellor, Ken Schwaber, Jeff Sutherland, Dave Thomas
R.D.Damhof
Data Model Time Line Historic Overview
© (Linstedt, Graziano, & Hultgren, The New Business Supermodel, The Business of Data Vault Modeling, 2008, p. 36)
§ Created By Dan Linstedt § Released in 2000 § Formally Introduced in the Netherlands in 2007
§ First DV Book: The Business of Data Vault Modeling 2008 § First (Dutch) User group in 2010 § Technical book from Dan Linstedt in 2011
R.D.Damhof
Application���Architecture
R.D.Damhof
Top Down Approach
R.D.Damhof
Bottom Up Approach
R.D.Damhof
Bottom Up Approach
R.D.Damhof
Bottom Up Approach
R.D.Damhof
Bottom Up Approach
R.D.Damhof
Irony
R.D.Damhof
Hybrid Approach (Data Vault)
R.D.Damhof
R.D.Damhof
ETL/Load Architecture - 100% of the data (within
scope) 100% of the time - Source driven /Auditable: - “Fact Oriented” - Template/metadata driven - No Business Rules
Kimball or Inmon ETL - Complex ETL - Truth oriented - Business Rules before EDW
Pictures: Genesee Academy ©
R.D.Damhof 36
Data Vault
Business Transac.on System
Business Transac.on System
Structure transforma.on Hub = business keys
Datasets
Business rule execu.on Structure and value transforma.on
Staging Out
Classic Data Vault Application Architecture
Adaptable Sustainable Compliant Decoupled Effec.veness Standardized Centralized
Rule Vault
Generic Business Rules
? ?
R.D.Damhof
Data Vault Application Architecture
§ Central EDW § Business rules downstream § Incremental/Non destructive Loading § 100% of the data (within scope) 100% of the time § Auditable/Partly source driven
R.D.Damhof
Modeling
R.D.Damhof
R.D.Damhof
R.D.Damhof
R.D.Damhof Pictures: Genesee Academy ©
R.D.Damhof Pictures: Genesee Academy ©
R.D.Damhof Pictures: Genesee Academy ©
R.D.Damhof Pictures: Genesee Academy ©
R.D.Damhof
Data Vault Constructs
Pictures: Genesee Academy ©
R.D.Damhof
Data Vault Constructs
Pictures: Genesee Academy ©
R.D.Damhof
Data Vault Constructs
Pictures: Genesee Academy ©
R.D.Damhof
Core Components
R.D.Damhof
Data Vault Core Components
Pictures: Genesee Academy ©
R.D.Damhof
Data Vault Core Components
Pictures: Genesee Academy ©
R.D.Damhof
Hubs
Pictures: Genesee Academy ©
R.D.Damhof
Hubs
Pictures: Genesee Academy ©
R.D.Damhof
Hubs
Pictures: Genesee Academy ©
R.D.Damhof
Satellites
Pictures: Genesee Academy ©
R.D.Damhof
Satellites
Pictures: Genesee Academy ©
R.D.Damhof
Links
Pictures: Genesee Academy ©
R.D.Damhof
Links
Pictures: Genesee Academy ©
R.D.Damhof
Loading
R.D.Damhof
HUB load
Pictures: Genesee Academy ©
R.D.Damhof
INSERT INTO customer_hub (cust#,load_dts,record_src) SELECT source.customer#, @load_dts, @record_src FROM source_customer AS source WHERE
NOT EXISTS (SELECT * FROM customer_hub AS hub WHERE hub.customer#=source.customer#)
HUB load
Pictures: Genesee Academy ©
R.D.Damhof
Loading a Link
Link Load
Pictures: Genesee Academy ©
R.D.Damhof
Link Load
INSERT INTO custcontact_link(cust_id,contact_id,load_dts, record_src) SELECT source.customer#, @load_dts, @record_src
FROM source_table AS source INNER JOIN contact_hub AS contact ON
contact. contact#= source.contact# INNER JOIN customer_hub AS cust ON
cust. customer#= source.customer# WHERE NOT EXISTS (SELECT * FROM custcontact_link AS link WHERE link. contact_id= contact.id and link.cust_id= cust.id)
Pictures: Genesee Academy ©
R.D.Damhof
Loading a Satellite
Satellite Load
Pictures: Genesee Academy ©
R.D.Damhof
Satellite Load
INSERT INTO customer_sat (hub_id,load_dts, name,record_src) SELECT hub.id, @load_dts, source.cust_name, ,@record_src
FROM source_customer AS source INNER JOIN customer_hub AS hub ON
cust.customer#= source.customer# # INNER JOIN customer_sat AS sat ON sat.id= hub.id# AND sat “Is most recent” AND
sat.name <> source.name
Pictures: Genesee Academy ©
R.D.Damhof
Data Vault Loading Paradigm
Pictures: Genesee Academy ©
R.D.Damhof
Top 10 Rules for Data Vault Modeling
Pictures: Genesee Academy ©
R.D.Damhof 68
Why is it that you can build and deploy extremely small particles in Data Vault and not in other approaches, without having an increase in the overhead and coordination of these particles? In other words; 'Divide and Conquer to beat the Size / Complexity Dynamic’
Why is it that you can re-engineer your existing model and guarantee that the changes remain local? Something that is hugely beneficial in data warehouses that - by definition - grow over time.
Why is it that - as your (Data Vault based) data warehouse grows - your costs grow ‘merely’ in linear fashion initially, and as you approach the end state marginal growth in cost decreases exponentially.
Agility & Data Vault - recap (1)
R.D.Damhof
➡ Mass production
➡ Known specifications, operational definitions, standards
➡ Repeatable, predictable, & even better; uniform process
➡ Part of the system that needs statistical control
➡ Inventory allowed/necessary
➡ Mainly supply driven
➡ Reliability over flexibility
Remember the Push characteristics Data Vault
Data Vault
Data Vault
Data Vault
Data Vault
Data Vault
Data Vault
Automation of a Data Vault ‘production process’ is just common sense
Agility & Data Vault - recap (2)
R.D.Damhof
Bonus Slides���Forks and mutations in DV ‘evolution’
R.D.Damhof 71
Data Vault
Business Transac.on System
Business Transac.on System
Structure transforma.on Hub = business keys
Datasets
Business rule execu.on Structure and value transforma.on
Staging Out
Type 1 - Classic Data Vault
Adaptable Sustainable Compliant Decoupled Effec.veness Standardized Centralized
Rule Vault
Generic Business Rules
? ?
R.D.Damhof 72
Staging Vault
Business Transac.on System
Business Transac.on System
Structure transforma.on No integra.on, Hub=surrogate keys Persis.ng staging in DV format
Business Data Vault
Business rule execu.on Integra.on DV modelled
Staging Vault
Data Marts
Structure transforma.on
Type 2 - Source Data Vault
Sustainable Compliant Decoupled Standardized Centralized
? ? ? Adaptable Effec.veness
R.D.Damhof 73
Source
Source
100% Seman.c gap
Source
Source
100% Seman.c gap
Staging DV
Staging DV
Business DV
Integra.on, cleansing, consolida.on Business rule execu.on upstream ?? DV modelled
S.ll the source
R.D.Damhof 74
Source
Source
100% Seman.c gap
Source
Source
100% Seman.c gap
Staging DV
Staging DV
Business DV
Integra.on, cleansing, consolida.on Business rule execu.on upstream ?? DV modelled
S.ll the source
Source
Source Data Warehouse
R.D.Damhof
Wanna know more? § Training & certification: www.geneseeacademy.com
§ Books: ‘Super Charge Your Data Warehouse: Invaluable Data Modeling Rules to Implement Your Data Vault’ – D.Linstedt / K.Graziano
§ Linkedin: Data Vault Discussions (approx. 800 members)
§ Niche non-commercial conferences; www.dwhautomation.com
§ Many blogs, articles, presentations on the World Wide Web
§ The best way to learn; try it, make some code, experience, engage
R.D.Damhof 76
Drs. Ronald D. Damhof
Blog hXp://prudenza.typepad.com/ hXp://www.b-‐eye-‐network.com/blogs/damhof/
Linkedin hXp://nl.linkedin.com/in/ronalddamhof
Email ronald.damhof@prudenza.nl
TwiXer RonaldDamhof
Skype Ronald.Damhof
Mobile +31(0)6 269 67 184
Others Informa.on Quality Cer.fied Professional (IQCP) Data Vault Cer.fied Grand Master Cer.fied Scrum Master Member of the Boulder BI Brain Trust (#BBBT)
Ronald Damhof is an independent prac..oner in the field of data management and decision support. Graduated in 1995 in the study of Economics. Since 1995 he worked as a prac..oner into the field of Informa.on Management with a focus on decision support and data management, trying hard to enhance the rigor and relevance in these fields by combining scien.fic research with the everyday challenges of the prac..oner. Ronald is mainly hired by customers in the role of business/IT architect, auditor, coach & trainer. He blogs on B-‐Eye-‐Network.com as well as his own blog, is a member of the pres.gious BBBT, wrote several ar.cles regarding decision support architectures and is a researcher in the field of Informa.on Management. Although Ronald likes to work with theore.cal grounded research and proven prac.ces, Ronald is not a 'white paper' architect; put your money where your mouth is, is his moXo. He likes to see architectures 'live' in enterprises, not just write about it. In most organiza.ons his role extends architecture onen. In truely agile spirit the roles he plays depend on the context of the client; he can be a missionary (selling the value), a project manager (geong it done), a scrum master (removing impediments), specialist (educa.ng hardware peeps, data architects, data logis.cs etc.) or a leader.
Thank You
Recommended