Upload
muneer24
View
575
Download
1
Tags:
Embed Size (px)
Citation preview
Internal and Confidential
Data Integration – Techniques for Extracting, Transforming and Loading Data
April 10, 2023
Agenda• Module 1 - Source Data Analysis & Modeling
• Module 2 – Data Capture Analysis & Design
• Module 3 – Data Transformation Analysis & Design
• Module 4 – Data Transport & Load Design
• Module 5 – Implementation Guidelines
Internal and Confidential April 10, 2023
Module 1 – Source Data Analysis and Modeling
April 10, 2023
Module 1 – Source Data Analysis & Modeling
Data Acquisition Concepts
Source Data Analysis
Source Data Modeling
Understanding the Scope of Data sources
Understanding Source Content
Logical to Physical Mapping
Data Acquisition Roles and Skills
April 10, 2023
• What is Data Acquisition?– Data acquisition is the process of moving data to and within the
warehousing environment• Goal oriented
– Not an isolated activity. Like data modeling, it is driven by goals and purpose of the data warehouse
• Source Driven / Target Driven– Source Driven – the necessary activities for getting data into the
warehousing environment from various sources– Target Driven – activities for acquisition within the warehousing
environment• Data Acquisition activities
– Identifying the set of data needed– Choosing the extract approach– Extract data from the source– Apply transformations– Load data
Data Acquisition Concepts
Staging
Datamart
ETL
Datawarehouse
ETL
ETLETL
ETL
Operational dataexternal data
Data sources
Data Intake
Data Distribution
Datamart
Datamart
Information DeliveryAccess
Get data from sources
Get data into warehouse
Get data for delivery
April 10, 2023
Module 1 – Source Data Analysis & Modeling
Data Acquisition Concepts
Source Data Analysis
Source Data Modeling
Understanding the Scope of Data sources
Understanding Source Content
Logical to Physical Mapping
Data Acquisition Roles and Skills
April 10, 2023
Source Data Analysis – Scope of Data Sources• Data acquisition issues
– Identifying and understanding data sources
– Mapping source data to target data
– Deciding which data to capture
– Effectively and efficiently capturing the data, and
– Determining how and when to transform the data
• Acquisition process design must pay attention to:– Ensuring data quality
– Efficiently loading warehouse databases
– Using tools effectively
– Planning for error recovery
product Customer
Finance Process
OrganizationHREntities
Customer
Order
Product…..
Business Events
Receive order
Ship order
Cancel order…. 2005
20042003
Merger with..
Acquisition of..
Termination of….
Enterprise Events
Subjects?
Kinds of
data?
History?
April 10, 2023
Source Data Analysis – Identification of sources• Types of sources
– Operational Systems
– Secondary Systems
– Backups, Logs, and Archives
– Shadow Systems
– DSS/EIS Systems, and
– External Data
• On-Going versus Single Load Sources
April 10, 2023
Source Data Analysis – Evaluation of Sources
Qualifying criteria Assessment Questions
Availability How available and accessible is the data?Are there technical obstacles to access?Or ownership and access authority issues?
Understandability How easily understood is the data?Is it well documented?Does someone in the organization have depth of knowledge?Who works regularly with this data?
Stability How frequently do data structures change?What is the history of change for the data?What is the expected life span of the potential data source?
Accuracy How reliable is the data?Do the business people who work with the data trust it?
Timeliness When and how often is the data updated?How current is the data?How much history is available?How available is it for extraction?
Completeness Does the scope of data correspond to the scope of the data warehouse?Is any data missing?
Granularity Is the source the lowest available grain ( most detailed level) for this data?
April 10, 2023
Evaluation of Sources – Origin of data• Original Point of Entry
– Best practice technique is to evaluate the original point of entry. “Is is this the very first place that the data is recorded anywhere within the business?”
– If “yes”, then you have found the original point of entry. If “no”, then source may not be the original point of entry. Ask the follow up question “ Can the element be updated in this file/table?”
– If not then this is not the original point of entry. If “yes” then the data element may be useful as a data warehousing source data
CUSTOMER-NUMBER N(9)CUSTOMER-NAME A(32)GENDER A(1)DATE-OF-BIRTH N(6)SSN N(9)
CUSTOMER MASTER FILE
Point of Origin?
CLAIM-NUMBER Unique number to identify the claim N(9)POLICY-NUMBER ID of policy against which the claim is filed NNNN-XXX-NNCUSTOMER-NUMBER ID number of the customer who filed the claim N(9)CUSTOMER-NAME Name of the policy holder A(32)DRIVER-NAME Name of the driver(if any) involved in the accident A(32)
DRIVER-ID State and driver's license number of the driver X(22)VIN Vehicle ID number of involved vehicle X(40)STATUS-CODE Status of claim, blank=open, C-closed A(1)INCIDENT-DATE Date of incident that initiated the claim YYYYMMDDFILING-DATE Date that the claim was received YYYYMMDDINCIDENT-TYPE-CODE What kind of incident resulted in a claim?
1=accident2=theft3=vandalism4=fire5=earthquake6=hurricane7=tornado8=flood9=act of god
N(1)
CLA
IM M
AS
TER
FIL
E
April 10, 2023
Evaluation of Sources – Origin of data
• Original Point of Entry – This practice has many benefits
– Data timeliness and accuracy are improved
– Simplifies the set of extracts from the source system
• Business System of Record
– To what system do the business people go when they are validating results?
– If business identifies a system as the “System of Record” then it must be considered as a probable warehousing data source
• Data Stewardship
– In organizations that have data stewardship program, involve the data stewards
April 10, 2023
Availabilit
y
Un
ders
tan
dabilit
y
Sta
bilit
y
Accu
racy
Tim
elin
ess
Com
ple
ten
ess
Gra
nu
lari
ty
Poin
f of
Ori
gin
?
Syste
m o
f R
ecord
?
APMS Policy Master File + - - + + + + policy policyAPMS Driver File X - - + - X + driver driverAPMS Premium File - X + - X - + premiumCPS Claim Master File + X + - X + +CPS Claim Detail File + X - X X + +CPS Claim Action File - + + - + - +
Evaluation of Sources – An example
FIELD ATTRIBUTE ID(KEY)
ENTITY RELATIONSHIP
POE SOR
CLAIM-NUMBER Unique number to identify the claim Y Claim Y Y Y
POLICY-NUMBER ID of policy against which the claim is filed
N Policy Y N Y
CUSTOMER-NUMBER ID number of the customer who filed the claim
N Customer Y N N
CUSTOMER-NAME Name of the policy holder N Customer N N N
DRIVER-NAME Name of the driver(if any) involved in the accident
N Claim N Y N
DRIVER-ID State and driver's license number of the driver
N Claim N Y N
VIN Vehicle ID number of involved vehicle N Claim N Y Y
STATUS-CODE Status of claim, blank=open, C-closed N Claim N Y Y
INCIDENT-DATE Date of incident that initiated the claim N Claim N Y Y
FILING-DATE Date that the claim was received N Claim N Y Y
Source Data Store Matrix
Source Data Element Matrix
April 10, 2023
Module 1 – Source Data Analysis & Modeling
Data Acquisition Concepts
Source Data Analysis
Source Data Modeling
Understanding the Scope of Data sources
Understanding Source Content
Logical to Physical Mapping
Data Acquisition Roles and Skills
April 10, 2023
Source Data Modeling – Overview of Warehouse data modeling• Business Goals and Drivers
• Information Needs
• Source composition• Source subjects
• Source Data Files
• Integrated Source Data Model (ERM)
• Source Data Structure Model
• Source Data File Descriptions
•Warehousing subjects•Business Questions•Facts and Qualifiers•Targets Configuration
• Staging, Warehouse, & mart ER models• Data mart Dimensional models
• Staging Area Structure• Warehouse structure• Relational mart structures• Dimensional mart structures
• Staging Physical Design• Warehouse Physical Design• Data Mart Physical Designs
• Implemented warehousing databases
Contextual
Models
Conceptual Models
Logical
Models
Structural
Models
Physical
Models
Functional
Databases
Triage
April 10, 2023
Source Data Modeling
Source Data Modeling Objectives
• A single logical model representing a design view of all source data within scope
• An entity relationship model in 3rd normal form ( a business model without implementation redundancies)
• Traceability from logical entities to the specific data sources that implement those entities
• Traceability from logical relationships to the specific data sources that implement those relationships
• Verification that each logical attribute is identifiable in implemented data sources
Source Data Modeling Challenges
• Many data sources do not have data models
• Where data models exist, they are probably out-dated and almost certainly not integrated
• Many source structures are only documented in code (e.g. COBOL definitions of VSAM files)
• Sometimes multiple and conflicting file descriptions exist for a single data structure
April 10, 2023
Source Data Modeling – The activities
Business drivers
Business goalsInformation needs
What kinds of
data stores
Totarget modeling
Source composition model Does
source model exist
SourceSubjectmodel
integrate
StructureOf data
Store (matrix)
Existing file desc
Existing data store
SourceLogical
Model (ERM)
Existing data model
validate
locate extract
WhichModeling
Approach?
Source data warehousing data
Each source
no yes
bottom-up
top-down
Con
textu
al
(scop
e)
Con
cep
tua
l (an
aly
ze)
Logical
(design)
Structural
(specify)
Physical
(optimize)
Functional
(Implement)
April 10, 2023
Module 1 – Source Data Analysis & Modeling
Data Acquisition Concepts
Source Data Analysis
Source Data Modeling
Understanding the Scope of Data sources
Understanding Source Content
Logical to Physical Mapping
Data Acquisition Roles and Skills
April 10, 2023
Understand the Scope of Data SourcesBusiness drivers
Business goalsInformation needs
What kinds of
data stores
Totarget modeling
Source composition model Does
source model exist
SourceSubjectmodel
Existing data model
validate
WhichModeling
Approach?
Source data warehousing data
Each source
no yes
bottom-up
top-down
Con
text
ual
(scop
e)
Con
cep
tu
al
(an
aly
ze)
•Source composition model uses set notation to develop a subject area model
•Classifies each source by the business subjects that it supports
•Helps to understand
•which subjects have a robust set of sources
•which sources address a broad range of business subjects
•Helpful to plan, size, sequence and schedule development of the DW increments
Identify & NameSubjects
AssociateSubjects
April 10, 2023
Composition Subject Model - Example
MIS customer table
LIS policy file
RPS policy file
APMS policy master
MIS product table
APMS premium file
CPS claim master
LIS claim file
CPS claim detail file
CPS claim
action file
CPS party file
MIS auto
Marketplace tableMIS residential
Marketplace table
CUSTOMER
REVENUE
POLICY
CLAIM
EXPENSE
INCIDENT
ORGANIZATION
MARKETPLACE
PARTY
April 10, 2023
Composition Subject Matrix Example
SUBJ ECTS
FILES/TABLES
CLAI M MASTER FI LE √ √ √CLAI M DETAI L FI LE √ √ √ √CLAI M ACTI ON FI LE √PARTY FI LE √POLI CY MASTER √ √PREMI UM FI LE √DRI VER FI LE
RIP
RESI DENTI AL POLI CY FI LE √ √ √LI FE CLAI MS FI LE √ √ √ √LI FE POLI CY FI LE √ √ √CUSTOMER TABLE √PRODUCT TABLE √AUTO MARKET TABLE √RESI DENTI AL MARKET TABLE √
LIS
MIS
MA
RK
ETP
LA
CE
CP
SA
PM
S
PA
RTY
OR
GA
NIZA
TIO
N
REV
EN
UE
EX
PEN
SE
CU
STO
MER
PO
LIC
Y
CLA
IM
IN
CID
EN
T
April 10, 2023
Module 1 – Source Data Analysis & Modeling
Data Acquisition Concepts
Source Data Analysis
Source Data Modeling
Understanding the Scope of Data sources
Understanding Source Content
Logical to Physical Mapping
Data Acquisition Roles and Skills
April 10, 2023
Understanding source content - Integrated view of Non-integrated sources
“Not within the charter of the warehousing program to redesign data sources”
• Understand the existing source data designs
• Merging all of the designs into one representative model
• The source logical data modeling process is not one of design but of integrating
• To develop a logical source data model, you will need to integrate design information from multiple inputs including
– Merging and integrating existing data models
– Extending the subject model that represents any source
– Extracting design structures from source data stores with reverse engineering techniques
Source Data Modeling…
April 10, 2023
Understanding Source content – Using existing models
•This modeling activity begins with collection of existing data models
•Models must be validated – to ensure accuracy and currency
•Existing models – “jump start” the process
•Merging models – identifying and resolving redundancy and conflict across source models
Check for Currency& Accuracy
Combine intoSingle model
Source composition model
Does source model
exist
SourceSubjectmodel
integrate
SourceLogical
Model (ERM)
Existing data model
validate
WhichModeling
Approach?
no yes
bottom-up
top-down
Con
cep
tual
(an
aly
ze)
Logical
(design)
Existing data model
Existing data model
April 10, 2023
Understanding Source content – Working Top DownSource composition
modelDoes
source model exist
SourceSubjectmodel
integrate
SourceLogical
Model (ERM)
Existing data model
validate
WhichModeling
Approach?
no yes
bottom-up
top-down
Con
cep
tual
(an
aly
ze)
Logical
(design)
Existing data model
Existing data model
Identify, Name & Describe Entities
Map to Data Stores
Identify, Name & Describe Relationships
Identify, Name & Describe Attributes
April 10, 2023
Understanding Source Content – Working Bottom-Up
• Derive the data model from the File descriptions
• The source data element matrix serves as the tool to perform source data modeling
• Source modeling and source assessment work well together and share the same set of documentation techniques.
File Field Attribute (what fact?)
ID (key?)
Entity (what
subject?)
Relationship (foreign key?)
APMS Premium POLICY-NUMBER Unique Policy ID Yes POLICYAPMS Premium NAME Policy Holder Name CUSTOMERAPMS Premium ADDRESS Policy Holder Address CUSTOMERAPMS Premium PREMIUM-AMOUNT Cost of Policy Premium POLICYAPMS Premium POLICY-TERM Coverage Duration POLICYAPMS Premium BEGIN-DATE Start date of coverage POLICYAPMS Premium END-DATE End date of coverage POLICYAPMS Premium DISCOUNT-CD Identify kind of discount Partial DISCOUNTAPMS Premium SCHEDULE Basis of discount amt DISCOUNTAPMS Policy POLICY-NUMBER Unique Policy ID Yes POLICYAPMS Policy CUSTOMER-NUMBER Unique customer ID Yes CUSTOMER POLICY->CUSTOMERAPMS Policy VIN Vehicle ID number Yes VEHICLE POLICY-> VEHICLEAPMS Policy MAKE Vehicle Manufacturer VEHICLE
April 10, 2023
Integrating Multiple Views – Resolving Redundancy & Conflict
Resolve RedundancyAnd Conflict
Verify Model
Model States
Normalize
Examine sets of entities and relationships:
customer places order and person places order customer and person are redundant
customer places order and customer sends order places redundant with sends
Examine sets of entities and attributes
When differently named entities have a high degree of similarity in their sets of attributes
April 10, 2023
Understanding Source content - Data Profiling – Looking at the Data
Look at the data to:
discover patterns
know how it is used
understand data quality
identify all data values
classify and organize3 types of profiling:
• Column profiling
• Dependency profiling
• Redundancy profiling
April 10, 2023
Module 1 – Source Data Analysis & Modeling
Data Acquisition Concepts
Source Data Analysis
Source Data Modeling
Understanding the Scope of Data sources
Understanding Source Content
Logical to Physical Mapping
Data Acquisition Roles and Skills
April 10, 2023
Logical to Physical Mapping
• Two way connection
– What attribute is implemented by this field/column?
– Which fields/columns implement this attribute?
• Documenting the mapping
– Extend the source data element matrix to include all data sources
– Provides comprehensive documentation of source data, and detailed mapping from the business view to implemented data elements
Tracing Business data to Physical Implementation
April 10, 2023
Module 1 – Source Data Analysis & Modeling
Data Acquisition Concepts
Source Data Analysis
Source Data Modeling
Understanding the Scope of Data sources
Understanding Source Content
Logical to Physical Mapping
Data Acquisition Roles and Skills
April 10, 2023
Understanding the source systems – A Team Effort• Understanding the source systems feeding the warehousing environment is a critical success factor
• All members of the warehousing team have a role in this effort
• The acquisition (ETL) team is generally responsible to– document the source layouts– perform reverse engineering as needed– determine point of entry– identify system of record– and look at actual data values
• The data modelers are likely to– create the single source logical model (using the inputs gathered by the acquisition team)
• Business Analysts/representatives are involved to– look at the data and help understand the values– help to identify point of entry– and help to determine system of record
Internal and Confidential April 10, 2023
Module 2 – Data Capture Analysis and Design
April 10, 2023
Module 2 – Data Capture Analysis & Design
Data Capture Concepts
Source/Target Mapping
Source Data Triage
Data Capture Design Considerations
Time and Data Capture
April 10, 2023
Data Capture Concepts – An overview• Data capture – activities involved in getting data out of sources
• Synonym for Data Extraction
Data capture Analysis – What to extract?
– Performed to understand requirements for data capture
• Which data entities and elements are required by target data stores?
• Which data sources are needed to meet target requirements?
• What data elements need to be extracted from each data source?
Data capture Design – When to extract? How to extract?
– Performed to understand and specify methods of data capture
• Timing of extracts from each data source
• Kinds of data to be extracted
• Occurrences of data (all or changes only) to be captured
• Change detection methods
• Extract technique (snapshot or audit trail)
• Data capture method (push from source or pull from warehouse
source
Extract
Transform
Load
target
April 10, 2023
Module 2 – Data Capture Analysis & Design
Data Capture Concepts
Source/Target Mapping
Source Data Triage
Data Capture Design Considerations
Time and Data Capture
April 10, 2023
Source/Target Mapping – Mapping Objectives
– Primary technique used to perform data capture analysis
– Mapping source data to target data
– Three levels of detail
• Entities
• Data Stores
• Data Elements
– The terms “source” and “target” describe roles that a data store may play, not fixed characteristics of a data store
source
Extract
Transform
Load
target
Source/Target mapping
Data
avai
labi
lity
Data Requirements
April 10, 2023
Source and Target as Roles
Staging
Datamart
ETL
Datawarehouse
ETL
ETLETL
ETL
Operational dataexternal data
Data sources
Data Intake
Data Distribution
Datamart
Datamart
Information Delivery
Source: Operational/external data
Target: staging data
Source: staging data
Target: data warehouse
Source: data warehouse
Target: data marts
The terms “source” and “target”
describe roles that a data store
may play, not fixed characteristics
of a data store
April 10, 2023
Mapping Techniques
Map source entities to target entities
Map source data storesto target data stores
Map source data elements totarget data elements
Source & target data models
design transformations
product
customer
service
Logical data models
Structural &
physical models
April 10, 2023
Source/Target Mapping: EXAMPLES
CU
STO
MER
CU
STO
MER
S
ALES
TR
AN
SA
CTIO
N
PR
OD
UC
T S
ALES
TR
AN
SA
CTIO
N
PR
OD
UC
T
INV
EN
TO
RY
TU
RN
OV
ER
MEMBER √
SALES TRANSACTI ON √ √ √
PRODUCT √
PRODUCT I NVENTORY √ √ √
RESTOCK TRANSACTI ON √ √enti
ties
from
sourc
e log
ical
data
mod
el
Entities from target logical models
CU
STO
MER
TA
BLE
CU
STO
MER
SA
LES
TR
AN
SA
CTIO
N
TA
BLE
PR
OD
UC
T
SA
LES
TR
AN
SA
CTIO
N
TA
BLE
PR
OD
UC
T
TA
BLE
IN
VEN
TO
RY
TU
RN
OV
ER
TA
BLE
MEMBERSHI P MASTER FI LE √MEMBERSHI P ACTI VI TYFI LE √ √POI NT-OF-SALEDETAI L FI LE √PRODUCT I NVENTORYFI LE √ √ √WAREHOUSE TO STOREACTI VI TY TABLE √ √ F
iles/
table
fro
m
sourc
e s
truct
ura
lmodel
Entities from target structural model
cust
om
er
nu
mber
cust
om
er
firs
t n
am
e
cust
om
er
last
nam
e
cust
om
er
biz
nam
e
ren
ew
al date
cust
om
er
nu
mber
cust
om
er
addre
ss
cust
om
er
city
cust
om
er
state
cust
om
er
zip
code
tran
sact
ion
ID
lin
e n
um
ber
term
inal ID
SK
U I
D
tran
sact
ion
qu
an
tity
member-number √ √
membership-type √ √ √
date-joineddate-last-renewed √
term-last-renewed √
date-of-last-activityname √ √ √
address √
city-and-state √ √
zip-code √
transcation-ID √
line-number √
√
ME
MB
ER
SH
IP M
AS
TE
RPO
INT O
FS
ALE
DE
TA
IL
CUSTOMERCUSTOMER ADDRESS
SALESTRANSACTION
ENTITY MAPPING
DATA STORE MAPPING
DATA ELEMENT MAPPING
April 10, 2023
Source/target Mapping: Full set of Data elements
Source/target mapping
triage
transform design
Logica
l
model
Data element
matrix
File/ta
ble
descrip
tions
Business
questio
ns
Physic
al
desig
n
Source/target map
Logica
l
model
Elements added by business
Elements added by triage
Ele
men
ts a
dd
ed
by
tran
sfo
rm log
ic
April 10, 2023
Module 2 – Data Capture Analysis & Design
Data Capture Concepts
Source/Target Mapping
Source Data Triage
Data Capture Design Considerations
Time and Data Capture
April 10, 2023
Source Data Triage
What to extract - opportunities• Source/target mapping analyzes need, triage analyzes opportunity
• Triage is about extracting all data with potential value
What is Triage?• Source data structures are analyzed to determine the appropriate data elements for inclusion
Why Triage?• Ensure that a complete set of attributes is captured in the warehousing environment.
• Rework is minimized
Triage and Acquisition• Performing triage is a joint effort between the acquisition team and the warehousing data modelers
April 10, 2023
The Triage Technique
Select needed files
Identify elements addressing known business questions
Eliminate Operational and redundant elements
Take all other business elements
First draft of element mapping for staging area or atomic DW
Source systems needed for increment
April 10, 2023
Module 2 – Data Capture Analysis & Design
Data Capture Concepts
Source/Target Mapping
Source Data Triage
Data Capture Design Considerations
Time and Data Capture
April 10, 2023
Customer
Member NumberMembership TypeLast NameFirst NameDate J oinedLast Activity DateLast Update DateLast Upate TimeLast Update User ID
Sales Transaction
Transaction DateTransaction TimePayment MethodTransaction StatusRegister IDStore NumberChecker Employee IDMember Number
Employee
Employee IdEmployee NameEmployee AddressStore Number
Store
Store NumberStore NameStore AddressPhone NumberManager Employee ID
Kinds of Data
Source system metadata
Reference Data
Event Data
Source system keys
April 10, 2023
PULL FROMSOURCE
Extract sourceFiles/tables
Extract source Changes ortransactions
PUSH TOWAREHOUSE
Replicate sourceFiles/tables
ReplicateSource changesOr transactions
Data Capture Methods
ALL DATA
CHANGEDDATA
April 10, 2023
Detecting Data Changes
• Detecting changes at source
• source date/time stamps
• source transaction files/logs
• replicate source data changes
• DBMS logs
• compare back-up files
• Detecting changes after extract
• compare warehouse extract generations
• compare warehouse extract to source system
PULL FROMSOURCE
Extract sourceFiles/tables
Extract source Changes ortransactions
PUSH TOWAREHOUSE
Replicate sourceFiles/tables
ReplicateSource changesOr transactions
ALL DATA
CHANGEDDATA
how to know
which data has
changed???
April 10, 2023
Module 2 – Data Capture Analysis & Design
Data Capture Concepts
Source/Target Mapping
Source Data Triage
Data Capture Design Considerations
Time and Data Capture
April 10, 2023
Timing issues
OLTP
DataExtraction
Work Tables
DataTransformation
WarehouseLoading
Intake layer
DataMart
Sources
Frequency of
Acquisition
Latency of
Load
Periodicity
Of
Data Marts
April 10, 2023
Source System Considerations
OLTP
DataExtraction
Work Tables
Sources
When is the data ready in each source system ?
How will i know when it’s ready?
How will I respond to source system failures?
How will i recover from a failure?
How long will it remain in the steady state?
How will I know when source systems fail?
April 10, 2023
Handling time variance – techniques and methods
• SNAPSHOT– Periodically posts records as of a specific point in time
– Records all data of interest without regard to changes
– Acquisition techniques to create snapshots
• DBMS replication
• Full File Unload or Copy
• AUDIT TRAIL– Records details of each change to data of interest
– Details may include date and time of change, how the change was detected, reason for change, before and after data values, etc.
– Acquisition techniques
• DBMS triggers
• DBMS replication
• Incremental selection
• Full file unload/copy
Important distinction between Snapshot and Audit trail : Audit trail techniques only Changed data is extracted and loaded,Snapshot all data is extracted and loaded, whether changed or not
Internal and Confidential April 10, 2023
Module 3 – Data Transformation Analysis & Design
April 10, 2023
Module 3 – Data Transformation Analysis & Design
Data Transformation Concepts
Transformation Analysis
Transformation Design
Transformation Rules and Logic
Transformation Sequences and Processes
April 10, 2023
Transformation concepts – An overview• Data Transformation
• Changes that occur to the data after it is extracted• Transformation processing removes
– Complexities of operational environments– Conflicts and redundancies of multiple databases– Details of daily operations– Obscurity of highly encoded data
• Transformation Analysis• Integrate disparate data• Change granularity of data• Assure data quality
• Transformation Design• Specifies the processing needed to meet the requirements that are determined by
transformation analysis• Determining kinds of transformations
– Selection– Filtering– Conversion– Translation– Derivation– Summarization– Organized into programs, scripts, modules, jobs, etc. that are compatible with chosen tools
and technology
April 10, 2023
Module 3 – Data Transformation Analysis & Design
Data Transformation Concepts
Transformation Analysis
Transformation Design
Transformation Rules and Logic
Transformation Sequences and Processes
April 10, 2023
Data Integration Requirements• Integration
– Create a single view of the data
• Integration & Staging Data– organize data by business subjects– ensure integrated identity through use of common, shared business keys
• Integration & Warehouse data– implement data standards, including derivation of conformed facts and structuring
of conformed dimensions– ensure integration of internal identifiers – where staging integrates real world keys,
the warehouse needs to do the same for surrogate keys
• Integration & Data marts– Intended to satisfy business specific /department specific requirements
April 10, 2023
Data Granularity Requirements• Granularity
– Each change of data grain, from atomic data to progressively higher levels of summary – achieved through transformation
• Granularity & Staging Data
– Staging data kept at atomic level
• Granularity & warehouse data
– In a 3 tier environment, warehouse should contain all common and standard summaries
• Granularity & Data marts
– Derivation of summaries specific to individual needs
April 10, 2023
Data Quality Requirements
• Data Cleansing
– process by which data quality needs are met
– range from filtering bad date to replacing data values with some alternative default or derived values
• Cleansing & Staging Data
– the earlier the data is cleansed, the better the result
– sometimes important for staging data to reflect what was contained in the source systems
– Delay data cleansing transformation until data is moved from staging to warehouse
– Keep both cleansed and un-cleansed data in staging area
• Cleansing & Warehouse data
– data not cleansed in staging is cleansed before loaded into the warehouse
• Cleansing & Data marts
– cleansing at data marts is not necessarily desirable, however as a practical matter may be necessary
April 10, 2023
Module 3 – Data Transformation Analysis & Design
Data Transformation Concepts
Transformation Analysis
Transformation Design
Transformation Rules and Logic
Transformation Sequences and Processes
April 10, 2023
Transformation Design - Approach
Identify transformation rules& logic
Determine transformationsequences
Specify transformation process
transformation specifications
transformation requirements
April 10, 2023
Module 3 – Data Transformation Analysis & Design
Data Transformation Concepts
Transformation Analysis
Transformation Design
Transformation Rules and Logic
Transformation Sequences and Processes
April 10, 2023
Kinds of transformations
This Transformation type… is Used to…
Selection Choose one source to be used among multiple possibilities
Filtering Choose a subset of rows from a source data table, or a subset of records from a source data file
Conversion and Translation Change the format of data elements
Derivation Create new data values, which can be inferred from the values of existing data elements
Summarization Create new data values, which can inferred from the values of existing data elements
April 10, 2023
Choose among alternative sources
based upon selection rules
Selection
ExtractedSource # 1
ExtractedSource # 2
TransformedTarget data
Select
sometimes from source 1
sometimes from source 2
‘If membership type is individual use member name from the membership master file, otherwise use member name from the business contact table’
April 10, 2023
eliminate somedata from the target set of data based on
filtering rules
Filtering
ExtractedSource data
TransformedTarget data
Filter
Some rows or values discarded
‘If the last 2 digits of policy number are 04,27,46, or 89 extract data for the data mart, otherwise exclude the policy and all associated data’
April 10, 2023
Change data content and/or format based on conversion rules
Conversion
ExtractedSource data
TransformedTarget data
Convert
Value/format in is different than value/format out
‘For policy history prior to 1994, reformat from Julian date to YYYYMMDD format. Default century to 19’
April 10, 2023
decode data whose values are encoded based on rules for
translation
Translation
ExtractedSource data
TransformedTarget data
Translate
both encoded and decoded value out
‘if membership-type-code is ‘C’ translate to ‘Business’; If membership-type-code is ‘P’, blank, or null translate to ‘Individual’; otherwise translate to ‘Unknown’ ’
encode values in
April 10, 2023
use existing data values to create new
data based on derivation rules
Derivation
ExtractedSource data
TransformedTarget data
Derive
new data values created…
More values out than in
‘Total Premium Cost = base-premium-amount + (sum of all additional coverage amounts)-(sum of all discount amounts) ’
April 10, 2023
Change data granularity based on
rules of summarization
Summarization
ExtractedSource data
TransformedTarget data
Summarize
Summary data out
‘for each store (for each product line (for each day (count the number of transactions, accumulate the total dollar value of the transactions))) ’
‘for each week (sum daily transaction count, sum daily dollar total)
atomic or base data in
April 10, 2023
Identifying Transformation Rules
cust
om
er
nu
mber
cust
om
er
firs
t n
am
e
cust
om
er
last
nam
e
cust
om
er
biz
nam
e
ren
ew
al date
cust
om
er
nu
mber
cust
om
er
addre
ss
cust
om
er
city
cust
om
er
state
cust
om
er
zip c
ode
tran
sact
ion
ID
line n
um
ber
term
inal ID
SK
U I
D
tran
sact
ion
qu
an
tity
member-number √ √
membership-type √ √ √
date-joineddate-last-renewed √
term-last-renewed √
date-of-last-activityname √ √ √
address √
city-and-state √ √
zip-code √
transcation-ID √
line-number √
√
ME
MB
ER
SH
IP M
AS
TE
RPO
INT O
FS
ALE
DE
TA
IL
CUSTOMERCUSTOMER ADDRESS
SALESTRANSACTION
for any source-to-target data element association, what needs exist for:
• selection?• filtering?• conversion?• translation?• derivation?• summarization?
April 10, 2023
Specifying Transformation Rules
cust
om
er
nu
mber
cust
om
er
firs
t n
am
e
cust
om
er
last
nam
e
cust
om
er
biz
nam
e
ren
ew
al date
cust
om
er
nu
mber
cust
om
er
addre
ss
cust
om
er
city
cust
om
er
state
cust
om
er
zip c
ode
tran
sact
ion
ID
line n
um
ber
term
inal ID
SK
U I
D
tran
sact
ion
qu
an
tity
member-number √ √
membership-type √ √ √
date-joineddate-last-renewed √
term-last-renewed √
date-of-last-activityname √ √ √
address √
city-and-state √ √
zip-code √
transcation-ID √
line-number √
√
ME
MB
ER
SH
IP M
AS
TE
RPO
INT O
FS
ALE
DE
TA
ILCUSTOMER
CUSTOMER ADDRESS
SALESTRANSACTION
cleansing DTR027 (default value)
Derivation DTR008 (Derive name)
DTR027(Default Membership Type) If membership-type is null or invalid assume “family” membership
DTR008(Derive Name) If membership-type is “family” separate name using comma insert characters prior to comma in customer-last-name insert characters after comma in customer-first-name else move name to customer-biz-name
cells expand to identify
transformations by type &
name
logic
of
transfo
rmatio
ns
is se
parate
ly
docum
ented
April 10, 2023
Module 3 – Data Transformation Analysis & Design
Data Transformation Concepts
Transformation Analysis
Transformation Design
Transformation Rules and Logic
Transformation Sequences and Processes
April 10, 2023
Dependencies and Sequences• Time Dependency – when one transformation rule must execute before another
• example: summarization of derived data cannot occur before the derivation
• Rule Dependency – when execution of a transformation rule is based upon the result of another rule
• example: different translations occur depending on source chosen by a selection rule
• Grain Dependency – when developing one level of summary if based on results of a previous summarization
• example: quarters can’t be summarized annually before months are summarized on a quarterly basis
April 10, 2023
Dependencies and Sequences
1. Identify the transformation rules
2. Understand rule dependency – package as modules
3. Understand time dependency – package as processes
4. Validate and define the test plan
Specify selection
Specify filtering
Specify conversion & translation
Specify derivation
Specify summarization
1
2
3
4
April 10, 2023
Modules and ProgramsDTR027(Default Membership Type) If membership-type is null or invalid assume “family” membershipDTR008(Derive Name) If membership-type is “family” separate name using comma insert characters prior to comma in customer-last-name
insert characters after comma in customer-first-name
else move name to customer-biz-name
Dependencies among
rules
Transformation Rules
Structures of Modules, Programs, Scripts, etc.
April 10, 2023
Job Streams & Manual Procedures- completing the ETL design
Extract & Load
Dependencies
scheduling
execution
verification
communication
restart
Transformation Rules and their implementation
Automated & Manual Procedures
Internal and Confidential April 10, 2023
Module 4 – Data Transportation & Loading Design
April 10, 2023
Module 4 – Data Transport & Load Design
Data Transport and Load Concepts
Data Transport Design
Database Load Design
April 10, 2023
Overview
Extract
Transform
Load
Source Data
Target Data
wh
ere
do p
latf
orm
chang
es
occ
ur?
data
tra
nsp
ort
database load
April 10, 2023
Module 4 – Data Transport & Load Design
Data Transport and Load Concepts
Data Transport Design
Database Load Design
April 10, 2023
Data Transport Issues
Extract
Transform
Load
Source Data
Target Data
data
tra
nsp
ort
which platforms?data volumes?
transport frequency?network capacity?ASCII vs EBCDIC
data security?transport methods?
April 10, 2023
Data Transport Techniques
Extract
Transform
Load
Source Data
Target Data
data
tra
nsp
ort
Open FTPSecure FTP
Alternatives to FTPData compressionData Encryption
ETL tools
April 10, 2023
Module 4 – Data Transport & Load Design
Data Transport and Load Concepts
Data Transport Design
Database Load Design
April 10, 2023
Database Load Issues
Extract
Transform
Load
Source Data
Target Data
database load
which DBMS?relational vs dimensional?
tables & indices?load frequency?
load timing?data volumes?
exception handling?restart & recovery?
load methods?referential integrity?
April 10, 2023
Populating Tables• Drop and rebuild the tables
• Insert (only) rows into a table
• Delete old rows and insert changed rows
April 10, 2023
Indexing
Load
Tables
Indices
update at load?drop & rebuild?index segmentation?
April 10, 2023
Updating
• Isn’t the warehouse read only?
• Updating Business Data
• Updating Row level Metadata
Load
Indices
Tables
allow updating of
rows in tables?
April 10, 2023
Referential Integrity• RI is the condition where every reference to another table has a foreign key/primary key
match.
• Three common options for RI
• DBMS checking
• Test load files before load using a tool/custom application
• Test data base(s) after load using a tool/custom application
April 10, 2023
Timing Considerations• User Expectations
• Data Readiness
• Database synchronization
April 10, 2023
Exception Processing
Transform
Load
Targetdata
Log
Reports
Suspendexceptions
ok
Discard
April 10, 2023
Integrating with ETL processes
• Loading as a part of single transform & load job stream
• Loads triggered by completion of transform job stream
• Loads triggered by verification of transforms
• Parallel ETL processing
• Loading Partitions
• Updating summary tables
communication
dependencies
scheduling
execution
verificationrestart
scheduling
execution
verificationrestart
scheduling
execution
verificationrestart
EX
TR
AC
TTR
AN
SFO
RM
LO
AD
restart/recovery
dependencies
processmetadata
toolcapabilities
schedulingdependencies
parallel processing
Internal and Confidential April 10, 2023
Module 5 – Implementation Guidelines
April 10, 2023
Module 5 – Implementation Guidelines
Data Acquisition Technology
ETL Summary
April 10, 2023
Technology in Data Acquisition
Data Mapping
Data Conversion
Data Cleansing
Data Movement
Data
base
Loadin
g
Data
Acc
ess
Data Transformation
ETL Technology
Storage Management
Sourc
e S
yst
em
s
Data
base
Manag
em
ent
Metadata Management
April 10, 2023
ETL - Critical Success Factors
Inta
ke
Dis
trib
uti
on
Info
rmati
on
Deliv
ery
Inte
gra
tion
Gra
nula
rity
Cle
ansi
ng
1. Design for the Future, Not for the Present v v v v v v2. Capture and store only changed data v3. Fully understand source systems and data v v v v v4. Allow enough time to do the job right v v v v v v5. Use the right sources, not the easy ones v v v6. Pay attention to data quality v v v v v v7. Capture comprehensive ETL metadata v v v v v v8. Test thoroughly and according to a test plan v v v v v v9. Distinguish between one-time and ongoing loads v v v10. Use the right technology for the right reasons v v v v v v11. Triage source attributes v v v12. Capture atomic level detail v v13. Strive for subject orientation and integration v v14. Capture history of changes in audit trail form v15. Modularize ETL processing v v v v v v16. Ensure that business data is non-volatile v v v17. Use bulk loads and/or insert-only processing v18. Complete subject orientation and integration v v19. Use the right data structures (relational vs. dimensional) v v v v v20. Use shared transformation rules and logic v v v v v21. Design for distribution first, then for access v22. Fully understand each unique access need v v v23. Use DBMS update capabilities v v v24. Design for access before other purposes v v25. Design for access tool capabilities v v26. Capture quality metadata and report data quality v v v
Data Store Roles
Data Transformation
Roles
April 10, 2023
Exercises
Exercise 1: Source Data Options
Exercise 2: Source Data Modeling
Exercise 3: Data Capture
Exercise 4: Data Transformation
Exercise 5: Data Acquisition Decision