13
ETL Extract

ETL Extract. Design Logical before Physical Have a plan Identify Data source candidates Analyze source systems with data- profiling tools Receive walk-through

Embed Size (px)

Citation preview

Page 1: ETL Extract. Design Logical before Physical Have a plan Identify Data source candidates Analyze source systems with data- profiling tools Receive walk-through

ETL

Extract

Page 2: ETL Extract. Design Logical before Physical Have a plan Identify Data source candidates Analyze source systems with data- profiling tools Receive walk-through

Design Logical before Physical

• Have a plan• Identify Data source candidates• Analyze source systems with data-

profiling tools• Receive walk-through of data lineage

and business rules• Receive walk-through of data

warehouse model• Validate calculations and formulas

Page 3: ETL Extract. Design Logical before Physical Have a plan Identify Data source candidates Analyze source systems with data- profiling tools Receive walk-through

Logical Data Map

• Used to collect and document source systems to be used for DW

• Should contain the following:– Target table name– Target column name– Table type– SCD type– Source db– Source table name– Source column name– Transformation

Page 4: ETL Extract. Design Logical before Physical Have a plan Identify Data source candidates Analyze source systems with data- profiling tools Receive walk-through

Track Volume

• Volume worksheet– Staging table name– Update Strategy– Load frequency– ETL jobs/ programs– Initial row count– Avg. Row length– Grows with– Expected monthly rows– Expected monthly bytes– Initial table size bytes– Table size 6 mo.

Page 5: ETL Extract. Design Logical before Physical Have a plan Identify Data source candidates Analyze source systems with data- profiling tools Receive walk-through

Source System Tracking

• Used to document source systems and who is responsible for them

• Should be maintained, not a 1 time effort

• May also serve as reference for future phases of the DW

Page 6: ETL Extract. Design Logical before Physical Have a plan Identify Data source candidates Analyze source systems with data- profiling tools Receive walk-through

Source System Tracking• Should contain the following

– Subject area– Interface name– Business name– Priority– Department/ business use– Business owner– Technical owner– DBMS– Production server/ OS– # of daily users– DB size– DB complexity– # transactions per day– Comments

Page 7: ETL Extract. Design Logical before Physical Have a plan Identify Data source candidates Analyze source systems with data- profiling tools Receive walk-through

System of Record

• Originating source of data• As much as possible, extract only

from system-of-record• The farther away from the system-of-

record, the higher the risk that the data is corrupted

Page 8: ETL Extract. Design Logical before Physical Have a plan Identify Data source candidates Analyze source systems with data- profiling tools Receive walk-through

Data Profiling:Source System Analysis

• Reengineer ERD of source system• Focus on the following:

– Primary keys– Data types– Relationships– Cardinalities

Page 9: ETL Extract. Design Logical before Physical Have a plan Identify Data source candidates Analyze source systems with data- profiling tools Receive walk-through

Data Profiling:Data Content Analysis

• Nulls– Null is not equal to Null– Data loss

• Dates– Different formatting

Page 10: ETL Extract. Design Logical before Physical Have a plan Identify Data source candidates Analyze source systems with data- profiling tools Receive walk-through

Business Rules

• Dimensional model– STATUS CODE: 4 digit code that uniquely

identifies the status of the product. It has a short description (usually 1 word), and a long description (usually 1 sentence)

• ETL– STATUS CODE: 4 digit code, however some

existing legacy codes have 3 digits that are still being used. These have to be converted to 4 digit codes. If name of the code has “OBSOLETE”, it needs to be removed and the obsolete flag set to “Y”

Page 11: ETL Extract. Design Logical before Physical Have a plan Identify Data source candidates Analyze source systems with data- profiling tools Receive walk-through

Heterogeneous Sources

• Sources may be of the following formats/ platforms:– ODBC– Mainframes (EBDIC, ASCII)– Flat files (delimited, fixed length)– XML– Web logs– ERP systems

Page 12: ETL Extract. Design Logical before Physical Have a plan Identify Data source candidates Analyze source systems with data- profiling tools Receive walk-through

Extracting Changed Data

• Initial vs. Incremental– Initial: loading all data from pre-

determined point in time– Incremental: loading changes to data

Page 13: ETL Extract. Design Logical before Physical Have a plan Identify Data source candidates Analyze source systems with data- profiling tools Receive walk-through

Detecting Changes

• DB audit columns and tables• DB log scraping or sniffing• Timed Extracts• Elimination