42
Introductionn [email protected]

PDI data vault framework #pcmams 2012

  • View
    1.211

  • Download
    4

Embed Size (px)

DESCRIPTION

Presentation given by Edwin Weber at #pcmams 2012

Citation preview

Page 1: PDI data vault framework #pcmams 2012

Introductionn

[email protected]

Page 2: PDI data vault framework #pcmams 2012

Data Vault Definition

Source: Dan Linstedthttp://www.tdan.com/view-articles/5054/

The Data Vault is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and star schema. The design is flexible, scalable, consistent and adaptable to the needs of the enterprise. It is a data model that is architected specifically to meet the needs of enterprise data warehouses.

Page 3: PDI data vault framework #pcmams 2012

Data Vault Building Blocks

Source: Dan Linstedthttp://www.slideshare.net/dlinstedt/introduction-to-data-vault-dama-oregon-2012

different sources/rate of change

Page 4: PDI data vault framework #pcmams 2012

Data Vault Fundamentals: Hub

Source: data-vault-modeling-guideGENESEE ACADEMY, LLC, Hans Hultgren

Page 5: PDI data vault framework #pcmams 2012

Data Vault Fundamentals: Link

Source: data-vault-modeling-guideGENESEE ACADEMY, LLC, Hans Hultgren

Page 6: PDI data vault framework #pcmams 2012

Data Vault Fundamentals: Satellite

Source: data-vault-modeling-guideGENESEE ACADEMY, LLC, Hans Hultgren

Page 7: PDI data vault framework #pcmams 2012

Data Vault Fundamentals: Model

Source: data-vault-modeling-guideGENESEE ACADEMY, LLC, Hans Hultgren

Page 8: PDI data vault framework #pcmams 2012

Data Vault ETL

Many objects to load, standardized procedures

This screams for a generic solution!

I don't want to:

throw ETL tool away and code it all myself

manage too many ETL objects

connect similar columns in mappings by hand

I do want to:

generate ETL (Kettle) objects? No

Take it one step further: there's only 1 parameterised hub load object. Don't need to know xml structure of PDI objects

Page 9: PDI data vault framework #pcmams 2012

Tools

Version Control

Database

Virtualization

Data Integration

Operating System

'Productivity'

Sql Development

Page 10: PDI data vault framework #pcmams 2012

Place of framework in architecture

StagingArea

CSVFiles

ETL

ERP

DBMS

Sources ETL Process Data Warehouse EUL

MySQL

Files

ETL:KettleDataVault Framework

Central DWH & Data Marts

MySQLDataVault

ETL

Page 11: PDI data vault framework #pcmams 2012

What has to be taken care of?

Data Vault designed and implemented in database

Staging tables and loading procedures in place(can also be generic, we use PDI Metadata Injection step for loading files)

Mapping from source to Data Vault specified (now in an Excel sheet)

What

Page 12: PDI data vault framework #pcmams 2012

Framework components

PDI repository (file based), jobs and transformations

Configuration files:kettle.properties

shared.xml

repositories.xml

Excel sheet that contains the specifications

MySQL database for metadata

Virtual machine with Ubuntu 12.04 Server

Page 13: PDI data vault framework #pcmams 2012

Design decisions

Updateable views with generic column names

(MySQL more lenient than PostgreSQL)

Compare satellite attributes via string comparison (concatenate all columns, with | (pipe) as delimiter)

'inject' the metadata using Kettle parameters

Generate and use an error table for each Data Vault table

Page 14: PDI data vault framework #pcmams 2012

Metadata tables

All have history tables

Page 15: PDI data vault framework #pcmams 2012

Metadata in Excel

Data Vault

connections

source systems

source tables

Page 16: PDI data vault framework #pcmams 2012

Metadata in Excel (hub + sat)

x 200 (max)

Page 17: PDI data vault framework #pcmams 2012

Metadata in Excel (link)

link attributes

x 10

Page 18: PDI data vault framework #pcmams 2012

Metadata in Excel (link satellite)

x 10

x 5

x 200 (max)

Page 19: PDI data vault framework #pcmams 2012

Last seen date

applicable for hubs and links

existing hubs and links: update 'last_seen_dts'!

Page 20: PDI data vault framework #pcmams 2012

Link validity satellite

Link has 'business key': not all hub id's

Page 21: PDI data vault framework #pcmams 2012

Loading the metadata

Page 22: PDI data vault framework #pcmams 2012

'design errors'

Checks to avoid debugging:(compares design metadata with Data Vault DB information_schema)

hubs, links, satellites that don't exist in the DV

key columns that do not exist in the DV

missing connection data (source db)

missing attribute columns

Page 23: PDI data vault framework #pcmams 2012

A complete run

Page 24: PDI data vault framework #pcmams 2012

Metadata needed for a hub

name

key column

business key column

source table

source table business key column(can be expression, e.g. concatenate for composite key)

Page 25: PDI data vault framework #pcmams 2012

Job for hub

Page 26: PDI data vault framework #pcmams 2012

Transformation for hub

Page 27: PDI data vault framework #pcmams 2012

Metadata needed for a linkname

key column

for each hub (maximum 10, can be a ref-table)

hub name

column name for the hub key in the link (roles!)

column in the source table → business key of hub

link 'attributes' (part of key, no hub, maximum 5)

link validity satellite needed?

last seen date needed?

source table

Page 28: PDI data vault framework #pcmams 2012

Job for link

Page 29: PDI data vault framework #pcmams 2012

Transformation for link

Run table needed for validity sat ?

Lookup hubs

Remove columns not in link

Last seen?

Page 30: PDI data vault framework #pcmams 2012

Metadata needed for a hub satellite

name

key column

hub name

column in the source table → business key of hub

for each attribute (maximum 200)

source column target column

source table

Page 31: PDI data vault framework #pcmams 2012

Job for hub satellite

Page 32: PDI data vault framework #pcmams 2012

Transformation for hub satellite

Page 33: PDI data vault framework #pcmams 2012

Metadata needed for a link satellite

name

key column

link name

for each hub of the link:

column in the source table → business key of hub

for each key attribute: source column

for each attribute: source column → target column

source table

Page 34: PDI data vault framework #pcmams 2012

Job for link satellite

Page 35: PDI data vault framework #pcmams 2012

Transformation for link satellite

Page 36: PDI data vault framework #pcmams 2012

Executing in a loop ..

Page 37: PDI data vault framework #pcmams 2012

.. and parallel

Page 38: PDI data vault framework #pcmams 2012

Logging

Configuring log tablesfor concurrent access

PDI logging

Custom logging

Page 39: PDI data vault framework #pcmams 2012

Version Control: PDI objects

Page 40: PDI data vault framework #pcmams 2012

Version Control: database objects

Page 41: PDI data vault framework #pcmams 2012

Some points of interest

Easy to make mistake in design sheet

Generic → a bit harder to maintain and debug

Application/tool to maintain metadata?

Data Vault generators (e.g. Quipu)?

Spinoff using Informatica and Oracle: Sander Robijns

Thanks to: Jos van Dongen Kasper de Graaf

Page 42: PDI data vault framework #pcmams 2012

Sourceforge!