A-Introduction to ETL & DataStage.pdf

ETL Basics

ETL BasicsETL Basics

• Extraction Transformation & Load

– Extracts data from source systems

– Enforces data quality and consistency standards

– Conforms data from different sources

– Load data to target systems

• Usually a

– batch process

– Involves large volumes of data

• Scenarios

– Load a data warehouse, data mart for analytical and reporting applications

– Data Integration

– Load packaged applications, or external systems through their APIs or interface databases

– Data Migration


Tool-based ETL

– Simpler, faster, cheaper

development

– Integrated metadata repositories

– Built-in scheduler

– Built-in connectors for variety of

sources/targets

– Delivers good performance

– Can call external routines

ETL Tool or Hand Coding?

Hand-based ETL

– Object-oriented programming

techniques

– Automated unit testing tools

– Can develop in common and

well-known language

– Unlimited flexibility

– In-house programmers


– Reusability

– Metadata repository

– Incremental load

– Managed batch loading

– Simpler connectivity

– Parallel operation

– Vendor experience

Advantages of Tool-based ETL


ETL Products from

– Pure-play ETL vendors

– Database vendors

– Business Intelligence vendors


• Usual features provided by ETL tools:

– Graphical data flow definition interfaces for easy development

– Native & ODBC connectivity to standard databases, packages, etc.

– Metadata maintenance components

– Metadata import & export from standard databases, packages, etc.

– Inbuilt standard functions & transformations – e.g. date, aggregate, sort, etc.

– Options for sharing or reusing developed components

– Facility to call external routines or write custom code for complex requirement

– Batch definition to handle dependencies between data flows to create the application

– ETL Engines that handle the data manipulation without depending on the database engines.

– Run-time support for monitoring the data flow and reading message logs

– Scheduling options


Source & Target Database

GUI-Based Development Environment

• Metadata Definition/Import/Export• Data Flow & Transformation Definition• Batch Definition• Test & Debug• Schedule

Run-time Environment

• Trigger ETL• Monitor flow• View logs

Source & Target Database

Metadata

Data

Data

Architecture of a Typical ETL ToolArchitecture of a Typical ETL Tool

ETL Engine

ETL Metadata Repository

DataStage Overview

IBM WebSphere DataStageIBM WebSphere DataStage

What is IBM WebSphere DataStage?

• Design jobs for Extraction, Transformation, and Loading (ETL)

• Ideal tool for data integration projects – such as, data warehouses,

data marts, and system migrations

• Import, export, create, and manage metadata for use within jobs

• Schedule, run, and monitor jobs all within DataStage

• Administer the DataStage development and execution environments

• Create batch (controlling) jobs

•Execute Jobs• Monitor Jobs, view job logs

DataStage ArchitectureDataStage Architecture

�SourcesSources

ServerServer

DesignerDesigner

DirectorDirector

RepositoryRepository

•Assemble Jobs•Debug•Compile Jobs•Execute Jobs

EngineEngine

•Manage Repository•Create custom routines & transforms•Import & Export component definitions

ManagerManager•ETL Metadata•Maintained in internal format

DataData DataData

�TargetsTargets

MetadataMetadataMetadataMetadata

Some Product FlavorsSome Product Flavors

• Enterprise Edition

– Includes Parallel Engine, Server Engine & MetaStage

– Supports Parallel & Server Jobs in a SMP & MPP environment

• Server Edition

– Lower-end version, much less expensive

– Includes Server Engine, supports only Server Jobs

– Sufficient for less performance critical applications

– MetaStage can also be packaged with it

• MVS Edition

– An Extension that allows generation of Cobol Code & JCL for execution on Mainframes

– Common development environment, but involves porting & compiling code on to the mainframe

• SOA Edition

– RTI component to handle real-time interface

– Allows job components to be exposed as web-services

– Multiple servers service requests routed through the RTI component

– Note that the web service client component is available even without purchasing the SOA Edition

NOTE:

• This material covers ONLY the Parallel Engine Component


DataStage Server Architecture

• Server (Parallel Engine):

• Windows - Windows Server 2003 (Standard & Enterprise) (DS 7.5.2 only),

• Unix - HP-UX, Tru64, IBM AIX,

• Linux - Red Hat Enterprise Linux AS 3.0 & Linux SUSE LINUX Enterprise Server 9

• Solaris 2.8/2.9/2.10

• USS z/OS

– The engine runs the executable, managing data

• Repository:

– Contains all the metadata, mapping rules, etc.

– DataStage applications are organized into Projects, each server can handle multiple projects

– DataStage repository maintained in an internal file format & not in the database


• DataStage Client Products

– Windows-based components

– Need to access the server at development time

– Designer. to create DataStage ‘jobs’ , compiled to create the executables

– Director: validate, schedule, run, and monitor jobs

– Manager: view and edit the contents of the Repository.

– Administrator: setting up users, creating and moving projects, and setting up purging criteria, setting environment variables

– Designer, Director & Manager can connect to one Project at a time

Key DataStage ComponentsKey DataStage Components

• Project

– Usually created for each application (or version of an application, e.g. Test, Dev, etc.)

– Multiple projects can exist on a single server box

– Associated with a specific directory with the same name as the Project: the “Repository”, which contains all metadata associated with the project

– Consists of

• DataStage Server & Parallel Jobs

• Pre-built components (Stages, Functions, etc.)

• User-defined components

– User Roles & Privileges set at this level

– Managed through the DS Administrator client tool

– Connected to through other client components


• Category

– Folder-structure within the Project.

– Separate “Trees” for Jobs, Table Definitions, Routines, etc.

– Managed through the DS Manager client tool

– Used for better organization of project components.

• Table Definition

– Metadata: record structure with column definitions

– Can be imported or manually entered

– Not necessarily associated with a specific table or file.

• Association only made within the job (and stage) definition

• Metadata definition also possible directly through the Stage, but may not result in creation of a table definition

– Created using the DS Manager client tool

• Schema Files

– External metadata definition for a sequential file. Specific format & syntax for a file. Associated with a data file at run-time


• Job

– Executable unit of work that can be compiled & executed independently or as part of a data flow stream

– Created using DS Designer Client (Compile & Execute also available through Designer)

– Managed (copy, rename) through DS Manager

– Executed, monitored through DS Director, Log also available through Director

– Parallel Jobs (Available with Enterprise Edition):

• have built-in functionality for Pipeline and Partitioning Parallelism

• Compiled into OSH (Orchestrate Scripting Language).

• The OSH executes “Operators” which are executable C++ class instances

– Server Jobs (Available with Enterprise as well as Server Editions):

• Compiled into Basic (interpreted pseudo-code)

• Limited functionality and parallelism

– Can accept parameters **

– Reads & writes from one or more files/tables, may include transformations

– Collection of stages & links


• Stages

– Pre-built component to

• Perform a frequently required operation on a record or set or records, e.g. Aggregate, Sort, Join, Transform, etc.

• Read or write into a source or target table or file

• Links

– Depicts flow of data between stages

• Data Sets

– Data is internally carried through links in the form of Data Sets

– DataStage provides facility to “land” or store this data in the form of files

– Recommended for staging data as the data is partitioned & sorted data; so a fast way of sharing/passing data between jobs

– Not recommended for back-ups or for sharing between applications as it is not readable, except through DataStage

• Shared Containers

– Reusable job elements – comprises of stages and links


• Routines

– Pre-built & Custom built

– Two Types

• Before/After Job: Can be executed before or after a job( or some stages), multiple input arguments, returns a single error code

• Transform: Called within a Transform Stage to process record & produce a single return value that can be allocated to or used in computation of an output field

– Custom Built

• Written & compiled using a C++ utility. The Object File created is registered as a routine & is invoked from within DataStage

– Note that server jobs use routines written within the DS environment using an extended version of the BASIC language

• Job Sequence

– Definition of a workflow, executing jobs (or sub sequences), routines, OS commands, etc.

– Can accept specifications for dependency, e.g.

• when file A arrives, execute Job B

• Execute Job A, On Failure of Job A Execute OS Command <<XXX>> On Completion of Job A execute Job B & C

• Can invoke parallel as well as server jobs

• DS API

– SDK functions

– Can be embedded into C++ code, invoked through the command line or from shell scripts

– Can retrieve information, compile, start, & stop jobs


• Configuration File

– Defines the system size & configuration applicable to the job, in terms of nodes, node pools, mapped to disk space & assigned scratch disk space

– Details maintained external to the job design

– Different files can be used according to individual job requirements

• Environment Variables

– Set or defined through the Administrator at a project level

– Overridden at a job level

– Types

• Standard/Generic Variables: design and running of parallel jobs: e.g. buffering, message logging, etc.

• User Defined Variables

• DSX or XML files

– Created through export option

– Can select components by type, category & name

Other DataStage FeaturesOther DataStage Features

Source & Target data supported:

• Text files

• Complex data structures in XML

• Enterprise application systems such as SAP, PeopleSoft, Siebel and Oracle Applications

• Almost any database - including partitioned databases, such as Oracle, IBM DB2 EE/EEE/ESE (with and without DPF), Informix, Sybase, Teradata, SQL Server, and the list goes on including access using ODBC

• Web services

• Messaging and EAI including WebSphereMQ and SeeBeyond

• SAS

• DataStage is National Language Support (NLS) enabled using Unicode.

• 400 pre-built functions and routines

• Job templates & wizards

• DataStage uses the OS-level security for restricting access to projects.

– Only root/admin user can administer the server

– Roles can be assigned to users & groups to control access to projects

RecapRecap

• We Saw:

– What, Why & How ETL

– DataStage

• Architecture

• Flavors

• Components & Other Features

A Quick Demo JobA Quick Demo Job

• Case:

– Input File contains Sales Data with attribute including <Region ID, Zone, Total Sales>

– Note that

• Region ID is the Unique

• The file contains attributes other than the 3 mentioned above

– The required calculation is to

• compute the Regional Total as a percentage of the Zonal Total

• Compute the Rupee equivalent of the Regional Total by multiplying it with the exchange rate which should be a parameter

– e.g.

– If input is

– And conversion rate is 40

– Expected Output is

30Z2City 55

20Z2City 44

20Z1City 33

10Z1City 22

10Z1City 11

Regional SalesZone IDCityRegion ID

1200

800

800

400

400

Rs_Sales

6030Z2City 55

4020Z2City 44

5020Z1City 33

2510Z1City 22

2510Z1City 11

PCTRegional Sales

Zone

IDCityRegion ID


• Step 0

– Project has been created

– User groups have been assigned appropriate roles

– Source Data is available

– ODBC connection DSNs to the source & target databases have been created <Not required for this particular example>

• Step 1 : Connect to the DataStage Project

– Open Manager, connecting to the appropriate server & the specific project

– Note that the OS-level User ID & Password of the server box are used


• Step 2 : Define Metadata of source and/or target files

– Menu Option: Import > Table Definitions > Sequential File Definitions

– Note: Can also be done directly through the Designer Interface

• Browse to the directory & select source file.

• Select category under which to save the table definition & the name of the table definition

• Click on Import

Path & file w.r.t DS server not the client!


• Step 2 …

– Define formatting (e.g. fixed width/delimited, what end of line character has been

used, does the first line contain column names, etc.)

– Set Column Names (if file does not already contain them), & widths

Designer InterfaceDesigner Interface

• Step 3: Create the job

– Open Designer

• through tools menu in Manager OR

• Directly through Desktop. In this case, will need to connect to provide authentication & select the project again

• Create a new “Parallel Job”

• Save within the chosen ‘Category’ or folder

Designer InterfaceDesigner Interface

RepositoryRepository

PalettePalette Design paneDesign pane

• Step 4 – Design the job

– Drag & drop icons & links from the palette as shown in the next slide


Sequential File Stage: Read Source File

Copy Stage: to use the data

stream twice

Aggregate Stage: Group by Zone,

Sum( Sales Total)

Join Stage: Join aggregated & un-aggregated data by Zone

Transform Stage: Compute PCT

at the record level

Sequential File Stage: Write

into Target File, metadata

defined through the job


• Step 4 Contd.

– Define Job Parameters

Default value optional


• Step 4 Contd. - Design the job..

– Double-Click icons to open stages for settings & options.

– Note that individual stage options will be discussed shortly

– Stage & link names will have defaults, These must be changed to meaningful tags

• Step 5 – Save & Compile the job

– Compile the job: Designer menu/icon

• Step 6 – Run

– Designer menu/icon (or Director menu/icon)

– View Log: Director menu/icon

Tip!Tip!

••Table definitions can also be created through the DataStage DesiTable definitions can also be created through the DataStage Designergner

••Always import table definitions from the database to ensure thatAlways import table definitions from the database to ensure that datatypes are consistentdatatypes are consistent

••Ensure data definition is a projectEnsure data definition is a project--level controlled activity to avoid proliferation of metadata witlevel controlled activity to avoid proliferation of metadata with redundancies and inconsistenciesh redundancies and inconsistencies

Director InterfaceDirector Interface

Director view …


• View sample records in the output

– Designer: option available on Right-click on stage icon or within stage dialog box

• Demo Job Completed

Sequential File StageSequential File Stage

• Features

– Normally executes in sequential mode**

– Can read from multiple files with same metadata

– Can accept wild-card path & names.

– The stage needs to be told:

• How file is divided into rows (record format)

• How row is divided into columns (column format)

• Stage Rules

– Accepts 1 input link OR 1 stream output link

– Rejects record(s) that have metadata mismatch. Options on reject

• Continue: ignore record

• Fail: Job aborts

• Output: Reject link metadata a single column, not alterable, can be

written into a file/table** - parallelization options to be discussed shortly


Dotted line for reject link

Options for output link: Reject Mode = “Output”

Options for Reject link – noneColumn Format is raw. Not editable

•Add a reject link


Sequential File Stage properties …

Load from Table Definitions orEnter manually & “Save” as a Table definition

Copy StageCopy Stage

• Features of Copy Stage

– Copies single input link dataset to a number of output datasets

– Records can be copied with or without some modifications

– Modifications can be:

• Drop columns

• Change the order of columns

– Note that this functionality also provided by the Transform Stage but Copy is faster

Separate settings for each output link

Drop columns, Change the order of columns,

rename columns

Transformer StageTransformer Stage

• Single input

• One or more output links

• Optional Reject link

• Column mappings – for each output link, selection of columns & creation of new derived columns also possible

• Derivations

– Expressions written in Basic

– Final compiled code is C++ generated object code (Specified compiler must be available on the DS Server)

– Powerful but expensive stage in terms of performance

• Stage variables

– For readability & for performance when same complex expression is used in multiple derivations

– Be aware that

• The values are retained across rows & order of definition of stage variables will matter.

• The values are retained across rows but only within a each partition

• Expressions for constraints and derivations can reference

– Input columns

– Job parameters

– Functions (built-in or user-defined)

– System variables and constants

– Stage variables – be aware that the variables are within each partition

– External routines

• Link Ordering - to use derivations from previous links


Inside Transformer Stage …

Input LinksInput Links

Output Output

LinksLinks

Link AreaLink Area

Metadata Metadata

AreaArea

Expressions/ Expressions/

TransformsTransforms


Column Mappings

Not all input columns need to be used

Metadata defined

for derived columns

Section for each output

link

Stage Variable Derivation,

Expression

Properties


• Constraints

– Filter data

– Direct data down different output links

• For different processing or storage

– Output links may also be set to be “Otherwise/Log” to catch records that have not passed through any of the links processed so far (link ordering is critical)

– Optional Reject link to catch records that failed to be written into any output because of write errors or NULL

Do not output if Region_ID is NULL

Output records where all previous constraints have failed i.e. Region_IDis NULL

Abort job if 10 rows have Region_Id = NULL

Join StageJoin Stage

• Four types:

• Inner

• Left outer

• Right outer

• Full outer

• Follow the RDBMS-style relational model

– Cross-products in case of duplicates

– Matching entries are reusable for multiple matches

– Non-matching entries can be captured (Left, Right, Full)

• Join keys must have same name, can modify if required in a previous stage

• 2 or more input links, 1 output link

• No fail/reject option for missed matches

• All input link data is pre-sorted & partitioned** on the join key

– By default

• Sort inserted by DataStage

• If data is pre-sorted (by a previous stage), does not pre-sort

** - to be discussed shortly

Join StageJoin Stage

Join Stage implementation

Can have multiple keys

Option for case-

sensitive or insensitive joins

Join Types

Candidates listed in drop

box, i.e. fields with common names

• Important: In case outer joins are specified

• the left & right links must be specified & the downstream checks must consider this

• Non-null joining fields must be made nullable on output to allow detection of join failures

Aggregator StageAggregator Stage

• Performs data aggregations

• Specify zero or more key columns that define the aggregation units (or groups)

• Aggregation functions available are:

– Count (nulls/non-nulls)

– Sum

– Max/Min/Range/Mean

– Missing/Non-missing value cnt

– % coefficient of variation

• Output link has “Mapping” tab to select, reorder & rename fields

• Input key-partitioned** on grouping columns

** - to be discussed shortly

Aggregator StageAggregator Stage

• Grouping methods available are:

– Hash

• Intermediate results for each group are stored in a hash table

• Final results are written out after all input has been processed

• No sort required

• Use when number of unique groups is small

– Running tally for each group’s aggregate calculations needs to fit into memory. Requires about 1K RAM / group

– Sort

• Only a single aggregation group is kept in memory

– When new group is seen, current group is written out

• Requires input to be sorted by grouping keys

• Can handle unlimited numbers of groups

• Example: average daily balance by credit card

Using Job ParametersUsing Job Parameters

Direct usage for expression evaluation#XXX#Usage as stage parameter for string substitution

•Defining through Job Properties > Parameters

•Used to pass business & control parameters to the jobs

Default value optional

Job ParametersJob Parameters

• Setting Parameter Values

– Provided at run-time

– Use default value used if not reset

– If no default value, the value must be provided at run-time

RecapRecap

• We Saw:

– Table Definition

– Job

– Stages

• Sequential File as source & target

• Aggregator

• Join

• Transform

– Job Parameters

• Case Study 1

Documents

A-Introduction to ETL & DataStage.pdf