13

Click here to load reader

ETL for type 2 dimension

Embed Size (px)

DESCRIPTION

ssis

Citation preview

Page 1: ETL for type 2 dimension

ETL for Type 2 dimension Overview

ETL Definitions and Data Flow for Type 2 dimensions

OverviewThe purpose of this document is to provide a detailed example of how data is extracted from the OLTP production database to the Staging database. Specifically this document will examine the steps required to extract data from a table that contains Type 1, Type 2, and Inferred data.

View and UDFsViews and user-defined, table-valued functions are used in the extraction of data from the production database to provide consistent naming convention to the Staging database, and also provide a layer of abstraction between the ETL and production database. This will minimize downstream problems when minor changes are made to the production database.

Naming ConventionsMany of the fields in the production database do not have descriptive names. In order to make maintenance easier and ease future upgrades in the ODS system, fields in the production database are renamed in views and UDFs by using a consistent and descriptive naming convention. The following naming convention rules are used in views and UDFs on the production database:

Use the prefix “vw” for the name of the view. For example: vwReservations, or vwGuests. Tables and Views use the same name space. Using the prefix vw allow you to easily differentiate between tables and views when a table list is presented to the user in a drop-down list.

User-defined functions are prefixed with uf_. Stored procedures are prefixed with up_.

When naming the primary key field, use the singular form of the table name + ID. For example, if the table name is People, the primary key field name would be PersonID, for the table named Guests, the primary key field would be GuestID. Do not use just ID or Code for a primary key field name. This makes maintenance and troubleshooting very difficult if the primary key field name for all the tables is either ID or Code.

Use Upper CamelCase for closed, compound field names (capitalize the first character of each word in the field name). For example: UpperCamelCaseField, or WorkPhone.

Do not use underscores in field names unless you are separating two parts of the name that contain abbreviations. For example: CC_ID for credit card ID. Actually, for this field, the preferred method would be better to spell out the words credit and card. Then, the field name would b CreditCardID.

© 2006 STATÊRA. All rights reserved. Proprietary & Confidential.Page 1

Page 2: ETL for type 2 dimension

ETL for Type 2 dimension Overview

Three letter prefix on task names in SSIS:It is easier to interpret logs of package execution if the task and transform names tell what the type of the task or transform is. The three letter prefixes used to name tasks in SSIS are:

EPT – is an Execute Process task DFT – is a Data Flow task DER - is a Derived Column transform ASP – is an Analysis Services package SCD – represents a Slowly Changing Dimension SEQ – represents a sequence container

Source descriptors SRC - is a data source in a data flow task STAT – is a script transform that you will see in most of the data flows. It

collects row count and throughput information which is ultimately reported in the enhanced logging

Data Extraction on the OLTP Production database

Data is pulled from the OLTP database on a nightly basis. The data is accessed through a layering of structures as diagramed below:

© 2006 STATÊRA. All rights reserved. Proprietary & Confidential.Page 2

Page 3: ETL for type 2 dimension

ETL for Type 2 dimension Overview

In SSIS there are four levels of control. The first three level provided modularity, and determines the order in which ETL packages will be executed.

Flow Control

Configuration InformationWhen a package is started, connection information and variables are set from the package configuration. For the ADE BI System all configuration information is kept in the table admin.Configuration in the relational Staging database. Any information passed from one package to another is through this database.

Top Level Flow ControlThe daily ETL process is controlled by the process called LoadGroup_Full_Daily.dtsx There are 4 main packages contained within LoadGroup_Full_Daily.dtsx as shown below:

© 2006 STATÊRA. All rights reserved. Proprietary & Confidential.Page 3

Page 4: ETL for type 2 dimension

ETL for Type 2 dimension Overview

Figure 1. LoadGroup_Full_Daily.dtsx Top level flow control

The job of LoadGroup_Full_Daily.dtsx is to specify the order of operations: First dimension data is loaded into the relational warehouse, then fact data is loaded, then dimensions are processed in the AS cube, then facts are processed.

Second Level Flow ControlWhen you open the task EPT Dimensions and you will see that it calls a package to carry out dimension processing: LoadGroup_Dimensions_Daily.dtsx. That package, in turn, calls another package for each of the other dimensions that are processed in a daily load. For demonstration purposes, the first 7 dimensions are listed in the sequence package shown below.

© 2006 STATÊRA. All rights reserved. Proprietary & Confidential.Page 4

Page 5: ETL for type 2 dimension

ETL for Type 2 dimension Overview

Figure 2. LoadGroup_Dimensions_Daily.dtsx Sequence container holds packages to load dimension data into staging db

Type 2 Dimension Logic FlowDimCompany contains some columns that are Type 1 (data changes are overwritten), some columns are type 2 (historical changes need to be tracked) and logic will be added to allow for Type 1.5 (addition of an inferred member). We will use the Company dimension as an example of how the data flows to populate the staging table.

The high level logic control for DimCompany contains the three elements as shown below:

Figure 3. Dim_Company.DTSX High level flow control to load dimCompany into staging

© 2006 STATÊRA. All rights reserved. Proprietary & Confidential.Page 5

Page 6: ETL for type 2 dimension

ETL for Type 2 dimension Overview

When you drill down into Data Flow Task (DFT) Load Company the following flow control is revealed:

Figure 4. Flow control to load DimCompany into staging table

SRC DimCompanySRC DimCompany is an OLE DB flow component. This component executes the stored procedure etl.up_DimCompany to extract the data from uf_DimCompany in the Source database. The stored procedure has two parameters: @logicalDate and ,@debug

exec etl.up_DimCompany @logicalDate = ?,@debug = 0

The create script for the stored procedure is shown below:

CREATE procedure [etl].[up_DimCompany] @logicalDate datetime,@debug bit = 0 --Debug mode?

with execute as calleras

© 2006 STATÊRA. All rights reserved. Proprietary & Confidential.Page 6

Page 7: ETL for type 2 dimension

ETL for Type 2 dimension Overview

/* This procedure is used to extract Company information into* the staging database** exec etl.up_DimStore '2006-12-20', 1*/begin

set nocount on

if @debug = 1 beginselect top (100) * from etl.uf_DimCompany (@logicalDate)

end else beginselect * from etl.uf_DimCompany (@logicalDate)

end --if

set nocount offend –proc

etl.uf_DimCompany is a user defined function that returns a Table-Valued Function containing the desired records from the Company and Address tables. The create script for etl.uf_DimCompany is shown below:

CREATE function [etl].[uf_DimCompany](@logicalDate datetime)returns tableasreturn(

SELECT dbo.company_profile.account AS CompanyAccountID, dbo.company_profile.name AS CompanyName, dbo.company_profile.contact_name AS ContactName, dbo.company_profile.contact_title AS ContactTitle, dbo.address.address AS Address, dbo.address.Address_2 AS Address2, dbo.address.city, dbo.address.state, dbo.address.country, dbo.address.zip, dbo.address.phone, dbo.address.fax, dbo.address.email, dbo.company_profile.credit_limit, dbo.company_profile.status, dbo.company_profile.property AS PropertyID, dbo.company_profile.locale_id AS LocaleID

FROM dbo.company_profile INNER JOIN dbo.address ON dbo.company_profile.property = dbo.address.propertyfrom

WHERE Logical_Date > @logicalDate);

STAT SourceSTAT Source is a script component that executes a custom SQL Script. The purpose of STAT Source is to count the number of source records. The SQL script defined in STAT Source is as follows:

© 2006 STATÊRA. All rights reserved. Proprietary & Confidential.Page 7

Page 8: ETL for type 2 dimension

ETL for Type 2 dimension Overview

Imports SystemImports System.DataImports System.Data.OleDbImports System.Collections

Public Class ScriptMain Inherits UserComponent

Private startTicks, totalTicks As Long Private rowCount, totalRows As Integer Private rps As New ArrayList() 'rps = rows per second

Public Overrides Sub Input0_ProcessInput(ByVal Buffer As Input0Buffer) 'Save the rate statistic for this buffer If startTicks <> 0 Then totalRows += rowCount Dim ticks As Long = CLng(DateTime.Now.Ticks - startTicks) If ticks > 0 Then totalTicks += ticks Dim rate As Integer = CInt(rowCount * (TimeSpan.TicksPerSecond / ticks)) rps.Add(rate) End If End If 'Reinitialize the counters rowCount = 0 startTicks = DateTime.Now.Ticks 'Call the base method MyBase.Input0_ProcessInput(Buffer) End Sub

Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer) rowCount += 1 'No exposed Buffer.RowCount property so have to count manually End Sub

Public Overrides Sub PostExecute() MyBase.PostExecute()

'Define the Stored Procedure object With New OleDbCommand("audit.up_Event_Package_OnCount") .CommandType = CommandType.StoredProcedure 'Define the common parameters .Parameters.Add("@logID", OleDbType.Integer).Value = Variables.LogID .Parameters.Add("@componentName", OleDbType.VarChar, 50).Value = Me.ComponentMetaData.Name .Parameters.Add("@rows", OleDbType.Integer).Value = totalRows .Parameters.Add("@timeMS", OleDbType.Integer).Value = CInt(totalTicks \ TimeSpan.TicksPerMillisecond)

© 2006 STATÊRA. All rights reserved. Proprietary & Confidential.Page 8

Page 9: ETL for type 2 dimension

ETL for Type 2 dimension Overview

'Only write the extended stats if RowCount > 0 If rps.Count > 0 Then 'Calculations depend on sorted array rps.Sort() 'Remove boundary-case statistics If rps.Count >= 3 Then rps.RemoveAt(0) 'Calculate min & max Dim min As Integer = CInt(rps.Item(0)) Dim max As Integer = CInt(rps.Item(rps.Count - 1)) 'Define the statistical parameters .Parameters.Add("@minRowsPerSec", OleDbType.Integer).Value = min .Parameters.Add("@maxRowsPerSec", OleDbType.Integer).Value = max End If 'Define and open the database connection .Connection = New OleDbConnection(Connections.SQLRealWarehouse.ConnectionString) .Connection.Open() Try .ExecuteNonQuery() 'Execute the procedure Finally 'Always finalize expensive objects .Connection.Close() .Connection.Dispose() End Try End With End SubEnd Class

DER CoalesceDER Coalesce is a data flow component that is used to replace any NULLvalues in the incoming source data with pre-defined unknown strings and numbers for the non-nullable target fields.

SCD Data Flow ComponentThe Slowly Changing Dimension Wizard allows you to select which columns are Type 1 (changes overwritten), Type 1.5 (Inferred), and Type 2 (Historical changes stored)

Once you have classified each of the columns in the dimension, most of the code to support the functionality is written for you. You can modify the code after the wizard creates the code for the inferred output, the type 1 output, and the type 2 to accommodate any custom functionality that is desired.

© 2006 STATÊRA. All rights reserved. Proprietary & Confidential.Page 9

Page 10: ETL for type 2 dimension

ETL for Type 2 dimension Overview

Inferred OutputThe first component in the Inferred Output branch, STAT Inferred, counts the number of records that include inferred members. Inferred members would be added if a fact includes, in this case, a Company that doesn’t already exist in the Company table.

The final step for the Inferred Output branch updates a record in the dimension table that is blank except for the business primary key that was retrieved from a fact record. The remaining data in this record will be updated with actual data in a future import.

New OutputThe first component in the New Output branch, STAT New, counts the number of records that include new type 2 members. New members would be added if a fact includes, in this case, a Company that doesn’t already exist in the Company table.

The next component, a Union All Transform Component named All New SCD – 2, unions updates made to type 2 columns as well as new records containing type 2 data.

The next component, a Derived Colum Transform Component named DER New SCD 2, sets the values for all the derived columns such as the surrogate key, CurrentRow, StartDate. EndDate, InferredMember, LastModifidiedDate.

The final component in the New Output branch, an OLE DB Destination Component named DTS New SCD-2, writes the data to the DimCompany table in the staging database.

SCD-2 OutputThe first component in SCD-2 Output branch, STAT SCD-2, counts the number of records that include type 2 members that have updates. Historical changes will be saved with all type 2 data columns.

The next component, a Derived Colum Transform Component named DER SCD 2, sets the values for all the derived columns such as the surrogate key, CurrentRow, StartDate. EndDate, InferredMember, LastModifidiedDate.

The next component in the SCD 2 Output branch, an OLE DB Destination Component named DTS New SCD-2, updates the CurrentRow field for the give business primaryKey.

The SCD 2 Output branch then merges with the Output branch at the Union All Transform Component named All New SCD – 2.

SCD-1 OutputThe first component in the SCD-1 Output branch, STAT SCD-1, counts the number of records that include type 1 members that have updates. Historical changes will be overwritten with all type 1 data columns.

© 2006 STATÊRA. All rights reserved. Proprietary & Confidential.Page 10

Page 11: ETL for type 2 dimension

ETL for Type 2 dimension Overview

The second and final step in this branch, updates all data for type 1 changes in the DimCompany table.

© 2006 STATÊRA. All rights reserved. Proprietary & Confidential.Page 11