Upload
clyde-potter
View
217
Download
0
Embed Size (px)
Citation preview
ETL
Extract
Design Logical before Physical
• Have a plan• Identify Data source candidates• Analyze source systems with data-
profiling tools• Receive walk-through of data lineage
and business rules• Receive walk-through of data
warehouse model• Validate calculations and formulas
Logical Data Map
• Used to collect and document source systems to be used for DW
• Should contain the following:– Target table name– Target column name– Table type– SCD type– Source db– Source table name– Source column name– Transformation
Track Volume
• Volume worksheet– Staging table name– Update Strategy– Load frequency– ETL jobs/ programs– Initial row count– Avg. Row length– Grows with– Expected monthly rows– Expected monthly bytes– Initial table size bytes– Table size 6 mo.
Source System Tracking
• Used to document source systems and who is responsible for them
• Should be maintained, not a 1 time effort
• May also serve as reference for future phases of the DW
Source System Tracking• Should contain the following
– Subject area– Interface name– Business name– Priority– Department/ business use– Business owner– Technical owner– DBMS– Production server/ OS– # of daily users– DB size– DB complexity– # transactions per day– Comments
System of Record
• Originating source of data• As much as possible, extract only
from system-of-record• The farther away from the system-of-
record, the higher the risk that the data is corrupted
Data Profiling:Source System Analysis
• Reengineer ERD of source system• Focus on the following:
– Primary keys– Data types– Relationships– Cardinalities
Data Profiling:Data Content Analysis
• Nulls– Null is not equal to Null– Data loss
• Dates– Different formatting
Business Rules
• Dimensional model– STATUS CODE: 4 digit code that uniquely
identifies the status of the product. It has a short description (usually 1 word), and a long description (usually 1 sentence)
• ETL– STATUS CODE: 4 digit code, however some
existing legacy codes have 3 digits that are still being used. These have to be converted to 4 digit codes. If name of the code has “OBSOLETE”, it needs to be removed and the obsolete flag set to “Y”
Heterogeneous Sources
• Sources may be of the following formats/ platforms:– ODBC– Mainframes (EBDIC, ASCII)– Flat files (delimited, fixed length)– XML– Web logs– ERP systems
Extracting Changed Data
• Initial vs. Incremental– Initial: loading all data from pre-
determined point in time– Incremental: loading changes to data
Detecting Changes
• DB audit columns and tables• DB log scraping or sniffing• Timed Extracts• Elimination