Upload
christopher-strickland
View
214
Download
0
Embed Size (px)
Citation preview
Zhangxi Lin
Texas Tech University
ISQS 6339, Data Management & Business Intelligence1
ISQS 6339, Data Management & Business Intelligence
Extraction, Transformation, and Loading (II)
Agenda
ISQS 6339, Data Management & Business Intelligence2
I. Using SSIS for ETLIntegration ServicesLearn by doingPackage itemsProblem-oriented package development
II. The Principle of ETLExtraction TransformationLoading
ISQS 6339, Data Management & Business Intelligence3
II. The Principle of ETL
Structure and Components of Business Intelligence
ISQS 6339, Data Management & Business Intelligence4
SSMSSSMS SSISSSIS SSASSSAS
SSRSSSRS
SASEM
SASEM
SASEG
SASEG
Automating your routine information processing tasks
ISQS 6339, Data Management & Business Intelligence5
Your routine information processing tasksRead online news at 8:00a and collect a few most important piecesRetrieve data from database to draft a short daily report at 10aView and reply emails and take some notes that are saved in a
databaseView 10 companies’ webpage to see the updates. Input the
summaries into a databaseBrowse three popular magazines twice a week. Input the summaries
into a databaseGenerate a few one-way frequency and two-way frequency tables
and put them on the webMerge datasets collected by other people into a main database.Prepare a weekly report using the database and at 4p every Monday,
and publish it to the internal portal site.Prepare a monthly report at 11a on the first day of a month, which
must be converted into a pdf file and uploaded to the website. Seems there are many things are on going. How to handle them
properly in the right time? Organizer – yesHow about regular data processing tasks?
Information Processing and Information Flow
ISQS 6339, Data Management & Business Intelligence6
Transaction processingInteractions between a user and a computer
application system with immediate responses from the application
Operational processingMake use of computer to control a process
Batch processingConsisting of a series of executions, each of which is
applied to a set of data and turns the result to the next one.
Analytical processingThe interaction between analysts and collections of
aggregated data that may have been reformulated into alternative representational forms for improved analytical performance.
Extraction, Transformation, Loading (ETL) Processes
Extract source dataTransform/clean dataIndex and summarizeLoad data into warehouseDetect changesRefresh data
ETL
Operational systems
Data Warehouse
Programs
Tools
Gateways
7ISQS 6339, Data Management & Business Intelligence
ETL: Tasks, Importance, and Cost
ISQS 6339, Data Management & Business Intelligence8
Operationalsystems
Relevant
Useful
Quality
Accurate
Accessible
Data Warehouse
ETL
Extract
Clean up
Consolidate
Restructure
Load
Maintain
Refresh
Data mapping
Transform
Operationaldatabases
Data staging area Warehousedatabase
Extracting Data Source systems
Data from various data sources in various formatsExtraction Routines
Developed to select data fields from sourcesConsist of business rules, audit trails, error correction
facilities
9ISQS 6339, Data Management & Business Intelligence
Production Data
Operating system platformsFile systemsDatabase systems and vertical
applications
IMS
DB2
Oracle
Sybase
Informix
VSAM
SAP
Shared Medical Systems
Dun and Bradstreet Financials
Hogan Financials
Oracle Financials
10ISQS 6339, Data Management & Business Intelligence
Archive DataHistorical dataUseful for analysis over long periods of
timeUseful for first-time loadMay require unique transformations
Operation databases
Warehouse database
11ISQS 6339, Data Management & Business Intelligence
Internal DataPlanning, sales, and marketing
organization dataMaintained in the form of:
Spreadsheets (structured)Documents (unstructured)
Treated like any other source data
Warehouse database
Planning
Accounting
Marketing
12ISQS 6339, Data Management & Business Intelligence
External DataInformation from outside the organizationIssues of frequency, format, and
predictability Described and tracked using metadata
A.C. Nielsen, IRI, IMS,Walsh America
Barron's
Dun and Bradstreet
Purchased databases
Wall Street Journal
Economic forecasts
Competitive information
Warehousingdatabases
13ISQS 6339, Data Management & Business Intelligence
Possible ETL FailuresA missing source fileA system failureInadequate metadataPoor mapping informationInadequate storage planningA source structural changeNo contingency planInadequate data validation
14ISQS 6339, Data Management & Business Intelligence
Maintaining ETL QualityETL must be:
TestedDocumentedMonitored and reviewed
Disparate metadata must be coordinated.
15ISQS 6339, Data Management & Business Intelligence
Transformation
Transformation eliminates anomalies from operational data:Cleans and standardizesPresents subject-oriented data
Extract
Warehouse
Load
Operationalsystems
Data Staging Area
Transform:
Clean up
Consolidate
Restructure
16ISQS 6339, Data Management & Business Intelligence
Remote Staging Model
LoadWarehouse
LoadWarehouse
Data staging area within the warehouse environment
Data staging area in its own environment
Operationalsystem
Extract
Operationalsystem
Extract
Transform
Staging area
Transform
Staging area
17ISQS 6339, Data Management & Business Intelligence
On-site Staging Model
Data staging area within the operational environment,possibly affecting the operational system
Extract Load
Warehouse
Operational system
Transform
Staging area
18ISQS 6339, Data Management & Business Intelligence
Data AnomaliesNo unique keyData naming and coding anomaliesData meaning anomalies between groupsSpelling and text inconsistencies
CUSNUM NAME ADDRESS
90233479 Oracle Limited 100 N.E. 1st St.
90233489 Oracle Computing 15 Main Road, Ft. Lauderdale
90234889 Oracle Corp. UK 15 Main Road, Ft. Lauderdale, FLA
90345672 Oracle Corp UK Ltd 181 North Street, Key West, FLA
19ISQS 6339, Data Management & Business Intelligence
Transformation RoutinesCleaning dataEliminating inconsistenciesAdding elementsMerging dataIntegrating dataTransforming data before load
20ISQS 6339, Data Management & Business Intelligence
Transforming Data: Problems and Solutions
ISQS 6339, Data Management & Business Intelligence2
1
Multipart keysMultiple local standardsMultiple filesMissing valuesDuplicate valuesElement namesElement meaningsInput formatsReferential Integrity constraintsName and address
Multipart Keys ProblemMultipart keys
Country code
Sales territory
Productnumber
Salesperson code
Product code = 12 M 654313 45
22ISQS 6339, Data Management & Business Intelligence
Multiple Local Standards Problem
Multiple local standardsTools or filters to preprocess
cm
inches
cm USD 600
1,000 GBP
FF 9,990
DD/MM/YY
MM/DD/YY
DD-Mon-YY
23ISQS 6339, Data Management & Business Intelligence
Multiple Files ProblemAdded complexity of multiple source
filesStart simple
Transformeddata
Multiple source files
Logic to detectcorrect source
24ISQS 6339, Data Management & Business Intelligence
Missing Values Problem
Solution:IgnoreWaitMark rowsExtract when time-stamped
If NULL thenfield = ‘A’
A
25ISQS 6339, Data Management & Business Intelligence
Duplicate Values Problem
Solution:SQL self-join techniquesRDMBS constraint utilities
ACME Inc
ACME Inc
ACME Inc
SQL> SELECT ... 2 FROM table_a, table_b 3 WHERE table_a.key (+)= table_b.key 4 UNION 5 SELECT ... 6 FROM table_a, table_b 7 WHERE table_a.key = table_b.key (+);
26ISQS 6339, Data Management & Business Intelligence
Element Names Problem
Solution: Common
naming conventions Customer
Customer
Client
Contact
Name
27ISQS 6339, Data Management & Business Intelligence
Element Meaning ProblemAvoid misinterpretationComplex solutionDocument meaning in metadata
Customer’s name
Customer_detail
All customerdetails
All detailsexcept name
28ISQS 6339, Data Management & Business Intelligence
Input Format Problem
ASCIIEBCDIC
12373“123-73”
ACME Co.
áøåëéí äáàéí Beer (Pack of 8)
29ISQS 6339, Data Management & Business Intelligence
Referential Integrity Problem
Solution:SQL anti-joinServer constraintsDedicated tools
Department
10
20
30
40
Emp Name Department
1099 Smith 10
1289 Jones 20
1234 Doe 50
6786 Harris 60
30ISQS 6339, Data Management & Business Intelligence
Name and Address Problem
Single-field format
Multiple-field format
Mr. J. Smith,100 Main St., Bigtown, County Luth, 23565
Database 1
NAME LOCATION
DIANNE ZIEFELD N100
HARRY H. ENFIELD M300
Database 2
NAME LOCATION
ZIEFELD, DIANNE 100
ENFIELD, HARRY H 300
Name Mr. J. Smith
Street 100 Main St.
Town Bigtown
Country County Luth
Code 23565
31ISQS 6339, Data Management & Business Intelligence
Quality Data: Importance and Benefits
Quality data: ◦ Key to a successful warehouse
implementation Quality data helps you in:
◦ Targeting right customers◦ Determining buying patterns◦ Identifying householders: private and
commercial◦ Matching customers◦ Identify historical data
32ISQS 6339, Data Management & Business Intelligence
Data Quality Guidelines
Operational data:Should not be used directly in the
warehouseMust be cleaned for each incrementIs not simply fixed by modifying
applications
33ISQS 6339, Data Management & Business Intelligence
Transformation Techniques
ISQS 6339, Data Management & Business Intelligence3
4
Merging dataAdding a Date StampAdding Keys to Data
Merging Data
Operational transactions do not usually map one-to-one with warehouse data.
Data for the warehouse is merged to provide information for analysis.
Pizza sales/returns by day, hour, seconds
Sale 1/2/02 12:00:01 Ham Pizza $10.00
Sale 1/2/02 12:00:02 Cheese Pizza $15.00
Sale 1/2/02 12:00:02 Anchovy Pizza $12.00
Return 1/2/02 12:00:03 Anchovy Pizza - $12.00
Sale 1/2/02 12:00:04 Sausage Pizza $11.0035
ISQS 6339, Data Management & Business Intelligence
Merging Data
Pizza sales
Sale 1/2/02 12:00:01 Ham Pizza $10.00
Sale 1/2/02 12:00:02 Cheese Pizza $15.00
Sale 1/2/02 12:00:04 Sausage Pizza $11.00
Pizza sales/returns by day, hour, seconds
Sale 1/2/02 12:00:01 Ham Pizza $10.00
Sale 1/2/02 12:00:02 Cheese Pizza $15.00
Sale 1/2/02 12:00:02 Anchovy Pizza $12.00
Return 1/2/02 12:00:03 Anchovy Pizza - $12.00
Sale 1/2/02 12:00:04 Sausage Pizza $11.00
36ISQS 6339, Data Management & Business Intelligence
Adding a Date Stamp
Time element can be represented as a:◦ Single point in time◦ Time span
Add time element to:◦ Fact tables◦ Dimension data
37ISQS 6339, Data Management & Business Intelligence
Adding a Date Stamp:Fact Tables and Dimensions
Item TableItem_idDept_id
Time_key
Store TableStore_id
District_idTime_key
Sales Fact TableItem_idStore_idTime_key
Sales_dollarsSales_units
Time TableWeek_idPeriod_idYear_id
Time_key
Product TableProduct_idTime_key
Product_desc
38ISQS 6339, Data Management & Business Intelligence
Adding Keys to Data
#1 Sale 1/2/98 12:00:01 Ham Pizza $10.00
#2 Sale 1/2/98 12:00:02 Cheese Pizza $15.00
#3 Sale 1/2/98 12:00:02 Anchovy Pizza $12.00
#5 Sale 1/2/98 12:00:04 Sausage Pizza $11.00
#4 Return 1/2/98 12:00:03 Anchovy Pizza - $12.00
#dw1 Sale 1/2/98 12:00:01 Ham Pizza $10.00
#dw2 Sale 1/2/98 12:00:02 Cheese Pizza $15.00
#dw3 Sale 1/2/98 12:00:04 Sausage Pizza $11.00
Data values or artificial keys
39ISQS 6339, Data Management & Business Intelligence
Summarizing Data1. During extraction on staging area 2. After loading to the warehouse
server
Operationaldatabases
Warehousedatabase
Staging area
40ISQS 6339, Data Management & Business Intelligence
Maintaining Transformation Metadata
Transformation metadata contains:◦ Transformation rules ◦ Algorithms and routines
SourcesExtract
StageTransform
RulesLoad
PublishQuery
41ISQS 6339, Data Management & Business Intelligence
Maintaining Transformation Metadata
Restructure keysIdentify and resolve coding differencesValidate data from multiple sourcesHandle exception rulesIdentify and resolve format differencesFix referential integrity inconsistencies Identify summary data
42ISQS 6339, Data Management & Business Intelligence
Transformation Timing and Location
◦Transformation is performed: Before load In parallel
◦Can be initiated at different points: On the operational platform In a separate staging area
43ISQS 6339, Data Management & Business Intelligence
Monitoring and Tracking
Transformations should:Be self-documentingProvide summary statisticsHandle process exceptions
44ISQS 6339, Data Management & Business Intelligence
Loading Data into the Warehouse
Loading moves the data into the warehouseLoading can be time-consuming:
Consider the load windowSchedule and automate the loading
Initial load moves large volumes of dataSubsequent refresh moves smaller volumes of
data
Operationaldatabases
Warehousedatabase
Staging area
Extract
Transform
Transport,Load
45ISQS 6339, Data Management & Business Intelligence
Initial Load and Refresh
Initial Load:Single event that populates the database
with historical dataInvolves large volumes of dataEmploys distinct ETL tasksInvolves large amounts of processing after
loadRefresh:
Performed according to a business cycleLess data to load than first-time loadLess-complex ETL tasksSmaller amounts of post-load processing
46ISQS 6339, Data Management & Business Intelligence
Data Refresh Models: Extract Processing Environment
After each time interval, build a new snapshot of the database.
Purge old snap shots.
T1 T2 T3
Operationaldatabases
47ISQS 6339, Data Management & Business Intelligence
Data Refresh Models: Warehouse Processing Environment
Build a new database.After each time interval, add changes to
database.Archive or purge oldest data.
T1 T2 T3
Operationaldatabases
48ISQS 6339, Data Management & Business Intelligence
Building the Loading ProcessTechniques and toolsFile transfer methods The load windowTime window for other tasks First-time and refresh volumesFrequency of the refresh cycleConnectivity bandwidth
49ISQS 6339, Data Management & Business Intelligence
Building the Loading Process
Test the proposed technique Document proposed load Monitor, review, and revise
50ISQS 6339, Data Management & Business Intelligence
Data GranularityImportant design and operational issueLow-level grain:
Expensive, high level of processing, more disk space, more details
High-level grain:Cheaper, less processing, less disk space, little details
51ISQS 6339, Data Management & Business Intelligence
Loading TechniquesToolsUtilities and 3GL GatewaysCustomized copy programsReplicationFTPManual
52ISQS 6339, Data Management & Business Intelligence
Loading Technique Considerations
Tools are comprehensive, but costly.Data-movement utilities are fast and powerful.Gateways are suitable for specific instances:
Access other databasesSupply dependent data martsSupport a distributed environmentProvide real-time access if needed
Use customized programs as a last resort.
Replication is limited by data-transfer rates.
53ISQS 6339, Data Management & Business Intelligence
Post-Processing of Loaded Data
Post-processing of loaded data
Create indexes
Generate keys
Summarize Filter
Extract
Transform
LoadWarehouseStaging area
54ISQS 6339, Data Management & Business Intelligence
Creating Derived Keys
The use of derived or generalized keys is recommended to maintain the uniqueness of a row.
Methods:Concatenate operational key with a numberAssign a number sequentially from a list
109908 01109908
109908 100
55ISQS 6339, Data Management & Business Intelligence
Summary ManagementSummary tablesMaterialized views
Summary data
56ISQS 6339, Data Management & Business Intelligence
Filtering Data
From warehouse to data marts
Data marts
Summary data
Warehouse
57ISQS 6339, Data Management & Business Intelligence
Verifying Data Integrity
Load data into intermediate file.Compare target flash totals with totals
before load.
Target
=
=
Load
Preserve, inspect, fix, then load
Counts & Amounts
FlashTotals
Counts & Amounts
FlashTotals
Intermediate file
58ISQS 6339, Data Management & Business Intelligence
Steps for Verifying Data Integrity
TargetSource files
Source filesSource files
Control
ExtractSQL*Loader
4
.log
7
.bad
5 62
3
1
59ISQS 6339, Data Management & Business Intelligence
Standard Quality Assurance ChecksLoad status Completion of the processCompleteness of the dataData reconciliation Referential integrity violations ReprocessingComparison of counts and amounts
1 + 1 = 360
ISQS 6339, Data Management & Business Intelligence