60
Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation, and Loading (II)

Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Embed Size (px)

Citation preview

Page 1: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Zhangxi Lin

Texas Tech University

ISQS 6339, Data Management & Business Intelligence1

ISQS 6339, Data Management & Business Intelligence

Extraction, Transformation, and Loading (II)

Page 2: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Agenda

ISQS 6339, Data Management & Business Intelligence2

I. Using SSIS for ETLIntegration ServicesLearn by doingPackage itemsProblem-oriented package development

II. The Principle of ETLExtraction TransformationLoading

Page 3: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

ISQS 6339, Data Management & Business Intelligence3

II. The Principle of ETL

Page 4: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Structure and Components of Business Intelligence

ISQS 6339, Data Management & Business Intelligence4

SSMSSSMS SSISSSIS SSASSSAS

SSRSSSRS

SASEM

SASEM

SASEG

SASEG

Page 5: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Automating your routine information processing tasks

ISQS 6339, Data Management & Business Intelligence5

Your routine information processing tasksRead online news at 8:00a and collect a few most important piecesRetrieve data from database to draft a short daily report at 10aView and reply emails and take some notes that are saved in a

databaseView 10 companies’ webpage to see the updates. Input the

summaries into a databaseBrowse three popular magazines twice a week. Input the summaries

into a databaseGenerate a few one-way frequency and two-way frequency tables

and put them on the webMerge datasets collected by other people into a main database.Prepare a weekly report using the database and at 4p every Monday,

and publish it to the internal portal site.Prepare a monthly report at 11a on the first day of a month, which

must be converted into a pdf file and uploaded to the website. Seems there are many things are on going. How to handle them

properly in the right time? Organizer – yesHow about regular data processing tasks?

Page 6: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Information Processing and Information Flow

ISQS 6339, Data Management & Business Intelligence6

Transaction processingInteractions between a user and a computer

application system with immediate responses from the application

Operational processingMake use of computer to control a process

Batch processingConsisting of a series of executions, each of which is

applied to a set of data and turns the result to the next one.

Analytical processingThe interaction between analysts and collections of

aggregated data that may have been reformulated into alternative representational forms for improved analytical performance.

Page 7: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Extraction, Transformation, Loading (ETL) Processes

Extract source dataTransform/clean dataIndex and summarizeLoad data into warehouseDetect changesRefresh data

ETL

Operational systems

Data Warehouse

Programs

Tools

Gateways

7ISQS 6339, Data Management & Business Intelligence

Page 8: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

ETL: Tasks, Importance, and Cost

ISQS 6339, Data Management & Business Intelligence8

Operationalsystems

Relevant

Useful

Quality

Accurate

Accessible

Data Warehouse

ETL

Extract

Clean up

Consolidate

Restructure

Load

Maintain

Refresh

Page 9: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Data mapping

Transform

Operationaldatabases

Data staging area Warehousedatabase

Extracting Data Source systems

Data from various data sources in various formatsExtraction Routines

Developed to select data fields from sourcesConsist of business rules, audit trails, error correction

facilities

9ISQS 6339, Data Management & Business Intelligence

Page 10: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Production Data

Operating system platformsFile systemsDatabase systems and vertical

applications

IMS

DB2

Oracle

Sybase

Informix

VSAM

SAP

Shared Medical Systems

Dun and Bradstreet Financials

Hogan Financials

Oracle Financials

10ISQS 6339, Data Management & Business Intelligence

Page 11: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Archive DataHistorical dataUseful for analysis over long periods of

timeUseful for first-time loadMay require unique transformations

Operation databases

Warehouse database

11ISQS 6339, Data Management & Business Intelligence

Page 12: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Internal DataPlanning, sales, and marketing

organization dataMaintained in the form of:

Spreadsheets (structured)Documents (unstructured)

Treated like any other source data

Warehouse database

Planning

Accounting

Marketing

12ISQS 6339, Data Management & Business Intelligence

Page 13: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

External DataInformation from outside the organizationIssues of frequency, format, and

predictability Described and tracked using metadata

A.C. Nielsen, IRI, IMS,Walsh America

Barron's

Dun and Bradstreet

Purchased databases

Wall Street Journal

Economic forecasts

Competitive information

Warehousingdatabases

13ISQS 6339, Data Management & Business Intelligence

Page 14: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Possible ETL FailuresA missing source fileA system failureInadequate metadataPoor mapping informationInadequate storage planningA source structural changeNo contingency planInadequate data validation

14ISQS 6339, Data Management & Business Intelligence

Page 15: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Maintaining ETL QualityETL must be:

TestedDocumentedMonitored and reviewed

Disparate metadata must be coordinated.

15ISQS 6339, Data Management & Business Intelligence

Page 16: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Transformation

Transformation eliminates anomalies from operational data:Cleans and standardizesPresents subject-oriented data

Extract

Warehouse

Load

Operationalsystems

Data Staging Area

Transform:

Clean up

Consolidate

Restructure

16ISQS 6339, Data Management & Business Intelligence

Page 17: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Remote Staging Model

LoadWarehouse

LoadWarehouse

Data staging area within the warehouse environment

Data staging area in its own environment

Operationalsystem

Extract

Operationalsystem

Extract

Transform

Staging area

Transform

Staging area

17ISQS 6339, Data Management & Business Intelligence

Page 18: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

On-site Staging Model

Data staging area within the operational environment,possibly affecting the operational system

Extract Load

Warehouse

Operational system

Transform

Staging area

18ISQS 6339, Data Management & Business Intelligence

Page 19: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Data AnomaliesNo unique keyData naming and coding anomaliesData meaning anomalies between groupsSpelling and text inconsistencies

CUSNUM NAME ADDRESS

90233479 Oracle Limited 100 N.E. 1st St.

90233489 Oracle Computing 15 Main Road, Ft. Lauderdale

90234889 Oracle Corp. UK 15 Main Road, Ft. Lauderdale, FLA

90345672 Oracle Corp UK Ltd 181 North Street, Key West, FLA

19ISQS 6339, Data Management & Business Intelligence

Page 20: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Transformation RoutinesCleaning dataEliminating inconsistenciesAdding elementsMerging dataIntegrating dataTransforming data before load

20ISQS 6339, Data Management & Business Intelligence

Page 21: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Transforming Data: Problems and Solutions

ISQS 6339, Data Management & Business Intelligence2

1

Multipart keysMultiple local standardsMultiple filesMissing valuesDuplicate valuesElement namesElement meaningsInput formatsReferential Integrity constraintsName and address

Page 22: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Multipart Keys ProblemMultipart keys

Country code

Sales territory

Productnumber

Salesperson code

Product code = 12 M 654313 45

22ISQS 6339, Data Management & Business Intelligence

Page 23: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Multiple Local Standards Problem

Multiple local standardsTools or filters to preprocess

cm

inches

cm USD 600

1,000 GBP

FF 9,990

DD/MM/YY

MM/DD/YY

DD-Mon-YY

23ISQS 6339, Data Management & Business Intelligence

Page 24: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Multiple Files ProblemAdded complexity of multiple source

filesStart simple

Transformeddata

Multiple source files

Logic to detectcorrect source

24ISQS 6339, Data Management & Business Intelligence

Page 25: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Missing Values Problem

Solution:IgnoreWaitMark rowsExtract when time-stamped

If NULL thenfield = ‘A’

A

25ISQS 6339, Data Management & Business Intelligence

Page 26: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Duplicate Values Problem

Solution:SQL self-join techniquesRDMBS constraint utilities

ACME Inc

ACME Inc

ACME Inc

SQL> SELECT ... 2 FROM table_a, table_b 3 WHERE table_a.key (+)= table_b.key 4 UNION 5 SELECT ... 6 FROM table_a, table_b 7 WHERE table_a.key = table_b.key (+);

26ISQS 6339, Data Management & Business Intelligence

Page 27: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Element Names Problem

Solution: Common

naming conventions Customer

Customer

Client

Contact

Name

27ISQS 6339, Data Management & Business Intelligence

Page 28: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Element Meaning ProblemAvoid misinterpretationComplex solutionDocument meaning in metadata

Customer’s name

Customer_detail

All customerdetails

All detailsexcept name

28ISQS 6339, Data Management & Business Intelligence

Page 29: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Input Format Problem

ASCIIEBCDIC

12373“123-73”

ACME Co.

áøåëéí äáàéí Beer (Pack of 8)

29ISQS 6339, Data Management & Business Intelligence

Page 30: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Referential Integrity Problem

Solution:SQL anti-joinServer constraintsDedicated tools

Department

10

20

30

40

Emp Name Department

1099 Smith 10

1289 Jones 20

1234 Doe 50

6786 Harris 60

30ISQS 6339, Data Management & Business Intelligence

Page 31: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Name and Address Problem

Single-field format

Multiple-field format

Mr. J. Smith,100 Main St., Bigtown, County Luth, 23565

Database 1

NAME LOCATION

DIANNE ZIEFELD N100

HARRY H. ENFIELD M300

Database 2

NAME LOCATION

ZIEFELD, DIANNE 100

ENFIELD, HARRY H 300

Name Mr. J. Smith

Street 100 Main St.

Town Bigtown

Country County Luth

Code 23565

31ISQS 6339, Data Management & Business Intelligence

Page 32: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Quality Data: Importance and Benefits

Quality data: ◦ Key to a successful warehouse

implementation Quality data helps you in:

◦ Targeting right customers◦ Determining buying patterns◦ Identifying householders: private and

commercial◦ Matching customers◦ Identify historical data

32ISQS 6339, Data Management & Business Intelligence

Page 33: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Data Quality Guidelines

Operational data:Should not be used directly in the

warehouseMust be cleaned for each incrementIs not simply fixed by modifying

applications

33ISQS 6339, Data Management & Business Intelligence

Page 34: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Transformation Techniques

ISQS 6339, Data Management & Business Intelligence3

4

Merging dataAdding a Date StampAdding Keys to Data

Page 35: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Merging Data

Operational transactions do not usually map one-to-one with warehouse data.

Data for the warehouse is merged to provide information for analysis.

Pizza sales/returns by day, hour, seconds

Sale 1/2/02 12:00:01 Ham Pizza $10.00

Sale 1/2/02 12:00:02 Cheese Pizza $15.00

Sale 1/2/02 12:00:02 Anchovy Pizza $12.00

Return 1/2/02 12:00:03 Anchovy Pizza - $12.00

Sale 1/2/02 12:00:04 Sausage Pizza $11.0035

ISQS 6339, Data Management & Business Intelligence

Page 36: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Merging Data

Pizza sales

Sale 1/2/02 12:00:01 Ham Pizza $10.00

Sale 1/2/02 12:00:02 Cheese Pizza $15.00

Sale 1/2/02 12:00:04 Sausage Pizza $11.00

Pizza sales/returns by day, hour, seconds

Sale 1/2/02 12:00:01 Ham Pizza $10.00

Sale 1/2/02 12:00:02 Cheese Pizza $15.00

Sale 1/2/02 12:00:02 Anchovy Pizza $12.00

Return 1/2/02 12:00:03 Anchovy Pizza - $12.00

Sale 1/2/02 12:00:04 Sausage Pizza $11.00

36ISQS 6339, Data Management & Business Intelligence

Page 37: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Adding a Date Stamp

Time element can be represented as a:◦ Single point in time◦ Time span

Add time element to:◦ Fact tables◦ Dimension data

37ISQS 6339, Data Management & Business Intelligence

Page 38: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Adding a Date Stamp:Fact Tables and Dimensions

Item TableItem_idDept_id

Time_key

Store TableStore_id

District_idTime_key

Sales Fact TableItem_idStore_idTime_key

Sales_dollarsSales_units

Time TableWeek_idPeriod_idYear_id

Time_key

Product TableProduct_idTime_key

Product_desc

38ISQS 6339, Data Management & Business Intelligence

Page 39: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Adding Keys to Data

#1 Sale 1/2/98 12:00:01 Ham Pizza $10.00

#2 Sale 1/2/98 12:00:02 Cheese Pizza $15.00

#3 Sale 1/2/98 12:00:02 Anchovy Pizza $12.00

#5 Sale 1/2/98 12:00:04 Sausage Pizza $11.00

#4 Return 1/2/98 12:00:03 Anchovy Pizza - $12.00

#dw1 Sale 1/2/98 12:00:01 Ham Pizza $10.00

#dw2 Sale 1/2/98 12:00:02 Cheese Pizza $15.00

#dw3 Sale 1/2/98 12:00:04 Sausage Pizza $11.00

Data values or artificial keys

39ISQS 6339, Data Management & Business Intelligence

Page 40: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Summarizing Data1. During extraction on staging area 2. After loading to the warehouse

server

Operationaldatabases

Warehousedatabase

Staging area

40ISQS 6339, Data Management & Business Intelligence

Page 41: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Maintaining Transformation Metadata

Transformation metadata contains:◦ Transformation rules ◦ Algorithms and routines

SourcesExtract

StageTransform

RulesLoad

PublishQuery

41ISQS 6339, Data Management & Business Intelligence

Page 42: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Maintaining Transformation Metadata

Restructure keysIdentify and resolve coding differencesValidate data from multiple sourcesHandle exception rulesIdentify and resolve format differencesFix referential integrity inconsistencies Identify summary data

42ISQS 6339, Data Management & Business Intelligence

Page 43: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Transformation Timing and Location

◦Transformation is performed: Before load In parallel

◦Can be initiated at different points: On the operational platform In a separate staging area

43ISQS 6339, Data Management & Business Intelligence

Page 44: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Monitoring and Tracking

Transformations should:Be self-documentingProvide summary statisticsHandle process exceptions

44ISQS 6339, Data Management & Business Intelligence

Page 45: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Loading Data into the Warehouse

Loading moves the data into the warehouseLoading can be time-consuming:

Consider the load windowSchedule and automate the loading

Initial load moves large volumes of dataSubsequent refresh moves smaller volumes of

data

Operationaldatabases

Warehousedatabase

Staging area

Extract

Transform

Transport,Load

45ISQS 6339, Data Management & Business Intelligence

Page 46: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Initial Load and Refresh

Initial Load:Single event that populates the database

with historical dataInvolves large volumes of dataEmploys distinct ETL tasksInvolves large amounts of processing after

loadRefresh:

Performed according to a business cycleLess data to load than first-time loadLess-complex ETL tasksSmaller amounts of post-load processing

46ISQS 6339, Data Management & Business Intelligence

Page 47: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Data Refresh Models: Extract Processing Environment

After each time interval, build a new snapshot of the database.

Purge old snap shots.

T1 T2 T3

Operationaldatabases

47ISQS 6339, Data Management & Business Intelligence

Page 48: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Data Refresh Models: Warehouse Processing Environment

Build a new database.After each time interval, add changes to

database.Archive or purge oldest data.

T1 T2 T3

Operationaldatabases

48ISQS 6339, Data Management & Business Intelligence

Page 49: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Building the Loading ProcessTechniques and toolsFile transfer methods The load windowTime window for other tasks First-time and refresh volumesFrequency of the refresh cycleConnectivity bandwidth

49ISQS 6339, Data Management & Business Intelligence

Page 50: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Building the Loading Process

Test the proposed technique Document proposed load Monitor, review, and revise

50ISQS 6339, Data Management & Business Intelligence

Page 51: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Data GranularityImportant design and operational issueLow-level grain:

Expensive, high level of processing, more disk space, more details

High-level grain:Cheaper, less processing, less disk space, little details

51ISQS 6339, Data Management & Business Intelligence

Page 52: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Loading TechniquesToolsUtilities and 3GL GatewaysCustomized copy programsReplicationFTPManual

52ISQS 6339, Data Management & Business Intelligence

Page 53: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Loading Technique Considerations

Tools are comprehensive, but costly.Data-movement utilities are fast and powerful.Gateways are suitable for specific instances:

Access other databasesSupply dependent data martsSupport a distributed environmentProvide real-time access if needed

Use customized programs as a last resort.

Replication is limited by data-transfer rates.

53ISQS 6339, Data Management & Business Intelligence

Page 54: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Post-Processing of Loaded Data

Post-processing of loaded data

Create indexes

Generate keys

Summarize Filter

Extract

Transform

LoadWarehouseStaging area

54ISQS 6339, Data Management & Business Intelligence

Page 55: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Creating Derived Keys

The use of derived or generalized keys is recommended to maintain the uniqueness of a row.

Methods:Concatenate operational key with a numberAssign a number sequentially from a list

109908 01109908

109908 100

55ISQS 6339, Data Management & Business Intelligence

Page 56: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Summary ManagementSummary tablesMaterialized views

Summary data

56ISQS 6339, Data Management & Business Intelligence

Page 57: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Filtering Data

From warehouse to data marts

Data marts

Summary data

Warehouse

57ISQS 6339, Data Management & Business Intelligence

Page 58: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Verifying Data Integrity

Load data into intermediate file.Compare target flash totals with totals

before load.

Target

=

=

Load

Preserve, inspect, fix, then load

Counts & Amounts

FlashTotals

Counts & Amounts

FlashTotals

Intermediate file

58ISQS 6339, Data Management & Business Intelligence

Page 59: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Steps for Verifying Data Integrity

TargetSource files

Source filesSource files

Control

ExtractSQL*Loader

4

.log

7

.bad

5 62

3

1

59ISQS 6339, Data Management & Business Intelligence

Page 60: Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Standard Quality Assurance ChecksLoad status Completion of the processCompleteness of the dataData reconciliation Referential integrity violations ReprocessingComparison of counts and amounts

1 + 1 = 360

ISQS 6339, Data Management & Business Intelligence