220
2 0 1 0 DataStage :: FUNDAMENTAL CONCEPTS:: DAY 1 Introduction for Phases of DataStage Four different phases are in DataStage, they are Phase I: Data Profiling It is for source system analyses, and the analysis are 1. Column analysis, 2. Primary key analysis, 3. Foreign key analysis, by this analysis whether we can find the data is “dirty” or “not”. 4. Base Line analysis, and 5. Cross domain analysis. Phase II: Data Quality (or also called as cleansing) In this process we must follow inter dependent i.e., after one after one process as shown below. Parsing Correcting Inter Standardizing Dependent Matching Consolidated Phase III: Data Transmission Navs notes Page 1

DataStage NOTES

Embed Size (px)

Citation preview

Page 1: DataStage NOTES

2010DataStage

:: FUNDAMENTAL CONCEPTS::

DAY 1

Introduction for Phases of DataStage

Four different phases are in DataStage, they are

Phase I: Data Profiling

It is for source system analyses, and the analysis are

1. Column analysis,

2. Primary key analysis,

3. Foreign key analysis, by this analysis whether we can find the data is “dirty” or “not”.

4. Base Line analysis, and

5. Cross domain analysis.

Phase II: Data Quality (or also called as cleansing) In this process we must follow inter dependent i.e., after one after one process as

shown below.

Parsing

Correcting Inter

Standardizing Dependent

Matching

Consolidated

Phase III: Data Transmission

In this ETL process is done here, the data transmission from one stage to another stage

And ETL means

E- Extract

T- Transmission

L- Load.

Phase IV: Meta Data Management - “Meta data means where the data for data”.

Navs notes Page 1

Page 2: DataStage NOTES

2010DataStage

DAY 2

How the ETL programming tool works?

Pictorial view:

Data Base

ETL Process Business Interface

Flat files

MS Excel

Figure: ETL programming process

Navs notes Page 2

db ETL BI DM

DWH

Page 3: DataStage NOTES

2010DataStage

DAY 3

Continue…

Extracting from .txt (ASCII code)

Source

Understand to DataStage Format (Native Format)

Source

Source

Loading the data into .txt (ASCII code) data base or resides in

local repository

ETL is a process that is performs in stages:

S T S T S T

OLTP stage area sa sa sa DWH

Here, S- source and T- target.

Home Work (HW): one record for each kindle (multiple records for multiple addresses and

dummy records for joint accounts);

Navs notes Page 3

Extract window

Load window

Staging (permanent data)

Staging (after transmission)

DWH

Page 4: DataStage NOTES

2010DataStage

DAY 4

ETL Developer Requirements

Q: One record for each kindle(multiple records for multiple addresses and dummy

records for joint accounts);

Kindle means information of customers.

Customer

Loan

Bank Credit card

Savings kindle

Customer maintaining one record but handling different addresses is called ‘single

view customer’ or ‘single version of truth’.

HW explanation: Here we must read the query very care fully and understand the terminology

of the words in business perceptive. Multiple records means multiple of the

customers(records) and multiple addresses means one customer(one account) maintaining

multiple of addresses like savings/credit cards/current account/loan.

ETL Developer Requirements:

HLD LLD ,, ,, ,,

Inputs

here,

HLD- high level document

Developer LLD- low level document

Navs notes Page 4

Page 5: DataStage NOTES

2010DataStage

ETL Developer Requirements are:

1. Under Standing

2. Prepare Questions : after reading document which is given and ask to friends/

forums/team leads/project leads.

3. Logical designs : means paper work.

4. Physical model : using Tool.

5. UNIT Test

6. Performance Tuning

7. Peer Reviews : it is nothing but releasing versions(version control *.**)

here, * means range of 1-9.

8. Design Turn Over Document (DTD)/ Detailed Design Document(DDD)/ Technical

Design Document(TDD)

9. Backups : means importing and exporting the data require.

10. Job Sequencing

Navs notes Page 5

Page 6: DataStage NOTES

2010DataStage

DAY 5

How the DWH project is under taken?

Process:

HLD

Requirements: Warehouse(WH) -HLD

x x

x TD jobs in %

Developer (70% - 80%)

as developer involves Developer system engineer Production(10%)

Migration (30%)

x TEST Production Migration

x x

here, x – cross mark that developer not involves in the flow.

– mean where the developer involves in the project and implement all TEN

requirements shown above.

Production based companies are like IBM and so on.

Migration means Support based companies like TCS, Cognizent,

Satyam Mahindra and so on.

In Migration: works both server and parallel jobs.

Server jobs – parallel jobs

Up to 2002 this environment worked after 2002 and up to till this environment

IBM launched X-Migrator, which convert server jobs to parallel jobs

In this it converts up to, 70% automatically

30% manually.

Navs notes Page 6

Page 7: DataStage NOTES

2010DataStage

Project divided into some category with respective to period as shown below and its

period( time of the project).

Categories - Period (that taken in months and years)

Simple 6m

Medium 6m – 1y

Complex 1– 11/2 y

Too complex 11/2 y – 5y and so on(it may takes many years depend up on project)

5.1. Project Process:

(high level documents)

HLD SRS (here, business analyzer/ Subject matter expert)

Requirements: BRD

HLD Architecture

Warehouse: Schema (structure)

Dimensions and tables (target tables)

Facts

(low level doc’s)

LLD Mapping Doc’s (specifications-spec’s)

TD Test Spec’s

Naming Doc’s

Navs notes Page 7

Page 8: DataStage NOTES

2010DataStage

5.2. Mapping Document:

For example if a query requirements are 1-experience employee, 2- dname, and 3- first

name, middle name, last name.

For this mapping pictorial way as we see in the way:

Common fields

S.no Load

order

Target

Entity

Target

Attributes

Source

Tables

Source

Fields

Transmi

ssion

Constan

t

Error

Handling

Eno Hire date

Dno

Current Date-

Hire date

(CD-HD)

Pk F

FName Ename Fk C

Exp_tbl MName Emp Eno Sk D

LName Dept Dno C

Exp_emp Dname

DName

Funneling

S1

Target

S2

Horizontal combining

or vertical combining

Navs notes Page 8

Get data from Multiple tables ‘C’Is combining

Page 9: DataStage NOTES

2010DataStage

As per example here horizontal combination is used

Emp

Trg

Dept

Here, HC means Horizontal combination is used for combine primary rows with secondary

rows.

As Developer maximum 30 Target fields will get.

As Developer maximum 100 source fields will get.

“Look Up!” means cross verification from primary table.

After document:

.txt (fwf, cv, vl, sc, s & t, h & t)

( F/ dB) (Types of dB)

S1

S2

Format of Mapping Document.

Navs notes Page 9

HC

HC T HC

TRG

Page 10: DataStage NOTES

2010DataStage

DAY 6

Architecture of DWH

For example: dB every branch have each mgr

Manager

Reliance comm.

Reliance Group : manager Top Level mgr(TLM)

Reliance power details of below

sales

Manager customer

Reliance Fresh TLM needs employee

period

` order

Input

Explanation of above example: Reliance group with some there branches and every branch

have one manager. And for all this manager one Top level manager (TLM) will be there. And

TLM needs the details of list shown above for analyze. Bottom level

For above example how ETL process is done shown below RC-mgr

reliance fresh ERP mini WH/

Data mart

DWH

independent Data Mart Dependent Data Mart

Reliance Fresh(taking one from group directly)

Navs notes Page 10

ETLPROCESS

ETLPROCESS

Page 11: DataStage NOTES

2010DataStage

Dependent Data Mart: means the ETL process takes all manager information or dB and keep

in the Warehouse. By that the data transmission between warehouse and data mart where

depends upon by each other. Here Data mart is also called as ‘Bottom level’/ ‘mini WH’ as

shown in blue color in above figure i.e., the data of individual manager (like RF, RC, RP and

so on). Hence the data mart depends up on the WH is called dependent data mart.

Independent Data Mart: only one or individual manager i.e., data mart were directly access the

ETL process with out any help of Warehouse. That’s why its called independent data mart.

6.1 Two level approaches:

For the both approaches two layers architecture will apply.

1. Top-Bottom level approach, and

2. Bottom- Top level approach.

6.1. Top – Bottom level approach:

The level start from top means as per example Reliance group to their individual

managers their ETL process from their to Data Warehouse (top level) and from their to all

separate data marts (bottom level).

R Comm. Data Mart

R Power Data Mart

Reliance Group

Warehouse

R Fresh Data Mart

Top level Bottom level

Layer I Layer II

Top – Bottom level approach

Navs notes Page 11

ETL PROCESS

Page 12: DataStage NOTES

2010DataStage

In the above the top – bottom level is defined, and this approach is invented by W. H. Inner.

Here, warehouse is top level and all data mart are bottom level as shown in the above figure.

6.2. Bottom – top level approach:

Means from here the ETL process takes directly from data mart (DM) and the data put

in the warehouse for reference purpose or storing the DM in the Data WareHouse (DWH).

R comm.

DM

R power

DM DWH

Reliance Group

R fresh DM

Layer I Layer II

Bottom level Top level

Bottom – Top level approach is invented by R Kimbell.

Here, one data mart (DM) contains information like customer, products, employees, location

and so on.

Top – Bottom level approach

These two approaches comes under two layer Architecture

Bottom – Top level approach

Navs notes Page 12

ETL PROCESS

Page 13: DataStage NOTES

2010DataStage

Programming (coding)

ETL Tool’s: GUI(graph user interface)

This tool’s to “extract the data from heterogeneous source”.

ETL program Tool’s are “Tara Data/ Oracle/ DB2 & so on…”

6.2. Four layers of DWH Architecture:

6.2.1. Layer I:

DM

DM

Source DWH Source

DMLayer I Layer I

In this layer the data send directly in first case from source to Data WareHouse(DWH) and in

second case source to group of Data Marts(DM).

6.2.2. Layer II:

DM DM

SRC DWH SRC DWH

DM DM

Layer I Layer II Layer I Layer II

Navs notes Page 13

Page 14: DataStage NOTES

2010DataStage

TOP – BOTTOM APPROACH BOTTOM – TOP APPROACH

In this layer the data follow from source – data warehouse – data mart and this type of follow

is called “top – bottom approach”. And in another case the data follow from source – data

marts – data warehouse and this type of following data is called “bottom – top approach”. For

this Layer II architecture is explained in the above shown example eg. Reliance group.

* (99.99% using layer 3 and layer 4)

6.2.3. Layer III:

DM

Source ODS DWH DM

DM Layer I Layer II Layer III

In this layer the data follow from source – ODS (operations data stores) – DWH – Data Marts.

Here the new concept add that is ODS means operations of data stores for at period like 6

months or one year that data used to solve instance problem where the ETL developer is not

involved here.

And who solve the instance/ temporary problems that team called Interface team is

involved here. The ODS data stores after the period into the DWH and from that it goes to DM

there the ETL developers involves here in layer 3.

The clear explanation about the layer 3 architecture in the below example, it is the best

example for clear explanation.

Navs notes Page 14

Page 15: DataStage NOTES

2010DataStage

Example #1:

Source (it is waiting for landing, because of some technical problem)

(at least or max. 2hrs to solve the problem )ETL dev. Involves here

Layer I

DM

DM

Airport terminal DWH

Interface team involves here Stores problem info for future references DM

Layer III

Layer II

Problem information captured

Data Base (stores the technical problem in dB for 1year)OPERATIONS DATA STORE

Example explanation:

In this example, source is aero plan that is for waiting for landing to the airport

terminal. But it is not to suppose to land because of some technical problem in the airport base

station. To solve this type operations special team involves here i.e., interface team. In the

airport base station the technical problems and the Operations Data Store (ODS) in db i.e.,

simple say problem information captured.

But the ODS stores the data for one year only. And years of database stores in the data

warehouse because of some technical problems to be not repeat or for future reference. From

Navs notes Page 15

Airport base station

ODS

Page 16: DataStage NOTES

2010DataStage

DWH to it goes to Data Marts here ETL developers involves for solve technical problems i.e.,

is also called layer 3 architecture of data warehouse.

DAY 7

Continues…..

Project Architecture:

7.1. Layer IV: Layer 4 is also called as “Project Architecture”

It is for data backup of DWH & SVC

L3

Business intelligence

Read flat files through DS

L2 L4

Reporting

Layer I

SRC Figure: Project Architecture / layer IV

Here,

ODS-operations data store,

DW- Data Warehouse,

DM- Data Mart,

SVC- Single view customer,

BI- Business Intelligence.

Navs notes Page 16

Source1

Source2

Source3

Interface

Files

(FLAT FILES)

ETL

ODS SVC DM

DW

BI DM

Condition

MISMATCH

Format

MISMATCH

lookup

Page 17: DataStage NOTES

2010DataStage

L2 & L3 & L4- layer2,3,4.

------------- reference data

- - - - - - - -> reject data

About the project architecture:

In project architecture, there are 4 layers.

In first layer source to interface files(flat files),

Coming to second layer ETL reads the flat files through the DataStage(DS) and sends

to ODS. When ETL sending the flat files to ODS if any mismatch data will there it will

drops that data. There are two types mismatch data 1. Condition mismatch 2. Format

mismatch.

In third layer the ETL transfer the data to warehouse.

In last layer data warehouse to check whether a single customer or not and data loading

or transmission in between DWH and DM(business intelligence).

Note: (Information about dropped data when the transmission done between ETL reads the flat

files(.txt, csv, .xml and so on) to ODS.)

Two types of mismatch data:

Condition mismatch(CM) : this verify the data from flat files whether they are

conditions are correct or mismatched, if it is mismatched the record will drops

automatically. To see the drop data the reference link is used and it shows which

record is condition mismatched.

Format mismatch(FM) : this is also like condition mismatch but it checks on the format

whether the sending data or records is format is correct or mismatched. Here also

reference link is used to see drop data.

Example for condition mismatch: An employee table contains some data

SQL> select * from emp;

EID ENAME DNO

08 Naveen 10

Navs notes Page 17

emp tbl

TRG

Contains

dno 10,20,30,10

Trg only req. dno = 10

Page 18: DataStage NOTES

2010DataStage

19 Munna 20

99 Suman 30

15 Sravan 10

Example for Format Mismatch:

Here the table format is tab – space separated.

The cross mark record has format mismatched so that the

record its just rejected.

7.2. Single View Customer (SCV):

It is also called as “single version of truth”.

For example:

*to make unique customer?

Same records

Phase – II > identify field by field.

Phase – III> cannot identify in this.

5 multiple records of customers

transforming

Here DataStage people involves in this process

Navs notes Page 18

drops20,30

from emp

Reference link

EID EName Place

111 naveen mncl

222 munna knl

333suman hyd

CName Adds.

naveen savingsmunna insurancesuman creditnaveen loanmunna deposit

CName Adds.

naveen savings, loanmunna insurance, depositsuman credit

Page 19: DataStage NOTES

2010DataStage

SVC/ single version of truth

This type of transforming is also called as Reverse Pivoting.

NOTE: Business intelligence(BI DM) is for data backup of DWH & SVC(single version of truth).

DAY 8

Dimensional Model

Modeling: it represent in physical or logical design from our source system to target system.

o Logical design: client perspective,

o Physical design: data base perspective.

optional

Manual

Here the above is Designing manual

Data Modeler’s are use DM Tools

o ERWIN Forward Engineering

o ER – STUDIO Reverse Engineering

Entity relation windows (ERWIN),

Entity relation studio(ER-Studio) these two are data modeler’s where logical and

physical design is done.

Mata Data : every entity has a structure is called Meta Data(simple say ‘a data to a

data’)

o In a table there are attributes and domain, two types of domain they are 1.

Alphabetical and 2. Number.

Navs notes Page 19

Pictorial View Logical View

EMP Dept

SQ B

Page 20: DataStage NOTES

2010DataStage

Forward Engineering (FE): it’s starting from the scratch.

Reverse Engineering (RE): it’s create from existing system is known as RE. are simple

say “ altering the existing process”

For example:

Q: An client required a experience of an employee.

SRC

EMP_table

Implicit requirement (is experience of employee) Hire Date

From Developer point of view is

Explicit Requirement (to find out everything as per the client requirement want to see)

TRG

Lowest level Detailed Information

8.1. Dimensional Table:

To find out everything as per the client requirement want to see (or) the “Lowest level

of detailed information” in the target tables is called Dimensional Table.

Q: how the tables are interconnected is shown below.

- Here taking some tables and linking with them with related to other tables.

- Like in product table.

Navs notes Page 20

(Employee hire detailed information)ENo EName Years Months Days Hours Minutes Seconds Nano_Seconds

Page 21: DataStage NOTES

2010DataStage

- This link is created by using foreign key and primary key.

- Foreign Key: means which is constraint and used as reference for other table.

- Primary Key: means which is constraint, it is a combination of unique and not null.

- Surrogate key.

- Tables as follow.

Foreign Key

Primary Key Foreign Key Link Establishing

Using Fk & Pk

Primary Key

8.2. Normalized Data:

In a table there if repetitive information or records is called Redundancy, that information

is to minimize or that technique is called as Normalization.

For example:

These is

Repetitive

Information

or Redundancy

Fk Pk

Navs notes Page 21

Product_ID PRD_Desc PRD_TYPE_ID

PRD_TYPE_ID PRD_Category PRD_SP_ID

PRD_SP_ID SName ADD1

ENO EName Designation DNo Higher Quali. Add1 Add2

111 naveen ETL Developer 10 M.TECH JNTU HYD222 munna System analysis 20 M.SC SVU HYD333 Sravan JAVA Developer 10 M.TECH JNTU HYD444 Raju Call Center 30 M.SC KU WRL555 Rajesh JAVA Developer 10 M.TECH JNTU HYD666 Hari .NET Developer 20 M.SC SVU HYD777 Mahesh Bank PO 10 M.TECH JNTU HYD

ENO EName Designation DNo

111 naveen ETL Developer 10 222 munna System analysis 20333 Sravan JAVA Developer 10 444 Raju Call Center 30555 Rajesh JAVA Developer 10666 Hari .NET Developer 20 777 Mahesh Bank PO 10

DNo Higher Quali. Add1 Add2

10 M.TECH JNTU HYD20 M.SC SVU HYD30 M.SC KU WRL

Page 22: DataStage NOTES

2010DataStage

This dividing

Technique

is known

Normalization

(or) reducing

Redundancy

The target table must be always in De-Normalized format.

Normalization De-Normalization

De-Normalization means combining the multiple tables into one table. And

combining is done by Horizontal combine.

But it is not in all cases, that de-normalized is must and should.

Navs notes Page 22

HC

Page 23: DataStage NOTES

2010DataStage

DAY 9

E-R Model

An Entity-Relationship Model:

In logical design, there are two options to design a job. They are

1. Optional, and

2. Manual.

Mandatory is must 1- primary table & n-secondary table

EMP table DEPT table

The given two tables EMP and DEPT

Primary (or also known as Master Table)

Secondary (or also known as Child Table)

Navs notes Page 23

ENO EName Designation DNo

111 naveen ETL Developer 10 222 munna System analysis 20333 Sravan JAVA Developer 10 444 Raju Call Center 30555 Rajesh JAVA Developer 10666 Hari .NET Developer 20 777 Mahesh Bank PO 10

DNo Higher Quali. Add1 Add2

10 M.TECH JNTU HYD20 M.SC SVU HYD30 M.SC KU WRL

Page 24: DataStage NOTES

2010DataStage

Here from above two tables the primary table is DEPT table, because is not depends

for any other table. And EMP table is secondary table because it is depends on the

DEPT table.

But when we take in real time, that we joining the two table by using Horizontal

combining it takes the EMP table as primary table and DEPT table as secondary table.

9.1. Horizontal Combine:

To perform horizontal combining we must follow these cases.

It must have multiple sources.

There should be dependency.

1 – Primary, n – secondary.

Horizontal combining is also called as JOIN.

HC means combining primary rows with secondary rows based on the primary key

column values.

There are three types of keys, they are

o Primary key,

o Foreign key, and

o Surgut key.

For example combining these two tables: EMP & DEPT tables

Fk Pk

Navs notes Page 24

ENO EName Designation DNo

111 naveen ETL Developer 10 222 munna System analysis 20333 Sravan JAVA Developer 10 444 Raju Call Center 30555 naveen ETL Developer 10666 munna System analysis 20 777 naveen ETL Developer 10

DNo Higher Quali. Add1 Add2

10 M.TECH JNTU HYD20 M.SC SVU HYD30 M.SC KU WRL

Page 25: DataStage NOTES

2010DataStage

After combining or joining the table by using HC, hence it’s like below

9.2. Different types of Schema’s:

There are four types schemas, they are

o STAR Schema,

o Snow Flake Schema,

o Multi STAR Schema, and

o Galaxy Schema.

1. STAR Schema:

In the star schema, you must know about two things

o Dimensional table, and

o Fact table.

Dimensional table: means ‘Lowest level detailed information’ of a table.

Fact Table: means it is collection of foreign key from n- dimensional tables.

Definition of STAR Schema:

“A Fact Table collection of foreign key surrounded by multiple dimensional table and

each dimensional collection of de-normalized data, it is called STAR Schema.”

The data transmission is done in two different methods,

in pictorial way it look like as below

Transmission

Navs notes Page 25

ENO EName Designation DNo Higher Quali. Add1 Add2

111 naveen ETL Developer 10 M.TECH JNTU HYD 222 munna System analysis 20 M.SC SVU HYD333 Sravan JAVA Developer 10 M.TECH JNTU HYD444 Raju Call Center 30 M.SC KU WRL

Source DIM table

FACT tblT T

Page 26: DataStage NOTES

2010DataStage

in practical way it directly from source to dimensional table and fact table.

Example for STAR Schema:

“As taking some tables as below to derive a star schema from that”.

Q: display what suman buy a lux product in ameerpet on January 1st week?

Bridge/ intermediate table

Product table

Brand table

Category table Pk Pk

Customer table

Unit table

Customer_Category_table

Customer table

Location table Pk Pk

Here, Pk – primary key, and

Fk – foreign key.

By above shown that fact table is surrounded by the dimensional table, and fact table is

collection of foreign key, where dimensional table is lowest level detailed information.

And fact table is also called as Bridge or Intermediate table.

But in current market STAR Schema and Snow Flake Schema is using rarely.

Navs notes Page 26

Source

FACT tbl

DIM tableT

T

PRD_Dim_tbl

Cust_Dim_tbl

Date_Dim_tbl

Loc_Dim_tbl

Fact tableFk FkFk Fkmeasurements

Page 27: DataStage NOTES

2010DataStage

In the fact table, measurements mean taking the information as per client requirement

or user requirement.

As per above question, it needs information PRD_dim_tbl, Date_dim_tbl,

Cust_dim_tbl, and Loc_dim_tbl. The link is creating to the measurements i.e., for Fact

table by foreign key and primary key.

2. Snow Flake Schema:

The fact-tables surrounded by dimensional table, each dimensional table have

lookup table is called Snow Flake Schema.

For example:

Fk Pk Fk Pk Fk Pk

If we want to require the information from location table it fetch from that table and

display the client required.

To minimize the huge data at once or in a one dimensional table, some times it not

possible to bring as soon as possible if huge data in dimensional table.

That is reason we divide the dimensional table, into some tables. And that tables is

known as “look up tables”

Definition of Snow Flake Schema:

“The Fact table surrounded by dimensional tables, and each dimensional table have

look up tables is called Snow Flake Schema”.

STAR Schema works effectively

De-normalization

Navs notes Page 27

EMP_tbl

Dept_tbl

Locations

Area

Cagnos/BO

DN

Page 28: DataStage NOTES

2010DataStage

Reports

Normalization

Snow Flake Schema works effectively

NOTE: Selection of Schema in run time it is depends on report generation.

:: DataStage CONCEPTS::

DAY 10

DataStage (DS) Concepts:

History of DS,

Feature of DS,

Differences between 7.5.x2 and 8.0.1 versions,

Architecture of 7.5.x2 and 8.0.1 versions,

Enhancements and new features of version 8.0.1

HISTORY of DataStage

An ETL tool according year 2006 there are 600 tools in market, some of they are

DataStage Parallel Extends,

ODI(OWB),

SAS(ETL Studio),

BODI,

Abinity and so on…

But DataStage is so famous and widely used in the market and it is to expensive also.

Navs notes Page 28

Source

DWH

MIG/H1

N

Q: What is DataStage?

ANS: DataStage is a comprehensive ETL tool, which provides End – to – End Enterprise

Resource Planning (ERP) solution (here, comprehensive means good in all areas)

Page 29: DataStage NOTES

2010DataStage

History begins:

- In 1997, first version of DataStage is released by VMARK company i.e., US based

company, and the Mr. LEE SCHEFFLER is father of DataStage.

- Only 5 members involved in release the software into the market.

- DataStage those days called as “Data Integrator”.

- There are 90% changes from 1997 to 2010 comparing to release versions.

- In 1997, Data Integrator is acquiring by company name called TORRENT.

- After two years i.e., in 1999, INFORMIX Company has acquired Data Integrator from

TORRENT Company.

- In 2000, ACENTIAL Company acquired both Data Base and Data Integrator and after

that ACENTIAL DataStage Server Edition released in this year.

o By this company the DataStage has popularized into the market from that year.

o And released software were 30 tools used to run.

- In 2002, ADSS + ORCHESTRATE means ACENTIAL company is integrated with

ORCHESTRATE company for the parallel capabilities.

o Because ORCHESTRATE (PX, UNIX) have parallel extendable capabilities in

UNIX environment.

o By integrating ADSS + ORCHESTRATE and they named as ADSSPX.

o And ADSSPX is version is 6.0, from that version parallel operations starts or

parallel capabilities starts.

o From that parallel versions gone on developing up to 7.5.1 version,

o But from 6.0 to 7.5.1 versions they supports only UNIX flavors environment.

o Because server configured only on UNIX flat form or environment.

- In 2004, a version 7.5.x2 is released that support server configuration for windows flat

form also.

Navs notes Page 29

Page 30: DataStage NOTES

2010DataStage

o For this ADSSPX is integrated with MKS_TOOL_KIT.

o MKS_TOOL_KIT is virtual UNIX machine that brings the capabilities to

windows for support server configuration.

o NOTE: After installing the ADSSPX+MKS_TOOL_KIT into the windows, and

all the UNIX commands works in the windows flat form.

- In 2004, December the version 7.5.x2 were having ASCENTIAL suite components

o They are,

Profile stage,

Quality stage,

Audit stage, these are individual tools.

Meta stage,

DataStage Px,

DataStage Tx,

DataStage MUS, and so on

o There are 12 types of ASCENTIAL suite components.

- In 2005, February the IBM acquired all the ASCENTIAL suite components and the

IBM released IBM DS EE i.e., enterprise edition.

- In 2006, the IBM has made some changes to the IBM DS EE and the changes are the

integrated the profiling stage and audit stage into one, quality stage, Meta stage, and

DataStage Px.

o With the combination of four stages they have released

“IBM WEBSPHERE DS & QS 8.0”

o This is also called as “Integrated Developer Environment” i.e., IDE.

Navs notes Page 30

Page 31: DataStage NOTES

2010DataStage

- In 2009, IBM has released another version that “IBM INFOSPHERE DS & QS 8.1”

o In current market,

7.5.x2 using 40 – 50%

8.0.1 using 30 – 40%

8.1 using 10 – 20%

DAY 11

DataStage FEATURES

Features of DS:

There are 5 important features of DataStage, they are

- Any to Any,

- Plat form Independent,

- Node configuration,

- Partition parallelism, and

- Pipe line parallelism.

Any to Any:

o DataStage that capable to any source to any target.

Plat form Independent:

o “A job can run in any processor is called plat form independent”

o Three types of processor are there, they are

UNI,

Symmetric Multi Processor(SMP), and

Massively Multi Processor (MMP).

UNI SMP MMP

Navs notes Page 31

NOTE: DataStage is Front End, it nothing to be stored.

Page 32: DataStage NOTES

2010DataStage

“ “ “

”””

RAM RAM

Node Configuration:

o Node is software that is created in operating system.

o “Node is a logical CPU i.e., is instance of physical CPU.

o Hence, using software it is “the process of creating virtual CPU’s is called

Node Configuration.”

o Node configuration concept is exclusively work on the DataStage, it is the best

feature comparing from other ETL tools.

o For example:

An ETL job requires executing 1000records?

For above question an UNI processor takes 10mins to execute 1000

records.

But for the same question an SMP processor takes 2.5 minutes to

execute 1000 records.

It is explained clearly in below diagram.

1000 records 1000 records

Here,1000 records share

for four CPU’ hence

Navs notes Page 32

HDD

C

PU

HDD

C

PU

C

PU

C

PU

C

PU

SMP -1 SMP -2

SMP -3 SMP -n

C

PU

HDD HDD

C

PU

C

PU

C

PU

C

PU

Page 33: DataStage NOTES

2010DataStage

execution time reduced.

10 minutes 2.5 minutes

RAM RAM

o As per above example, Node Configuration is also can create virtual CPU’s to

reduce the execution time for UNI processor.

o Using Node Configuration for the above example to UNI processor.

o In below figure how the virtual CPU’s can create and reduce the execution time

of the process.

1000 records created multiple nodes

Node1

Node2

Node3

Node4

10 minutes reduces to 2.5minutes

RAM

Partition parallelism:

o Partition is a distributing the data across the nodes, based on partition

techniques.

o Considering one example why we use the partition technique’s

o Example: taking some records in EMP table and some in DEPT table

EMP table have 9 records,

DEPT table have 3 records.

Navs notes Page 33

C

PU

HDD

CPU

PUCPU

PU

CPU

PU

CPU

PU

Page 34: DataStage NOTES

2010DataStage

o After partitioning these records output must and should have 9 records, because

here primary table is 9 records.

EMP(10,20,10,30,20,10,10,20,30) and DEPT(10,20,30)

N1 10,20,10 10 only 2 matched

12 N2 30,20,10 20 1 total only 4 matched

But output must be 9 records

N3 10,20,30 30 1

o In the above example, only 4 records are in there in final output and 5 records

are missing for this reason the partition techniques are introduced.

o And there are two types of partition parallelism categories, in those total 8

types of partition techniques are there.

Key based

Hash

Modulus

Range

Db/2

Key less

Same

Random

Entire

Round robin

o Key based category or key based techniques will give the assurance, to the

same key column value to collected same key partition.

o Key less technique is used to append the data for joining given tables.

From above taken records we partitioning using key based.

Navs notes Page 34

4

Page 35: DataStage NOTES

2010DataStage

Key based partitioning

EMP DNO N1 10

N2 20

DEPT N3 30

DAY 12

Continues…

Features of DataStage

Partition Parallelism:

o Re – Partition: means re – distributing the distributed data.

P1

P2

P3

EMP 10 N1 N1 AP

20 N2 N2 TN

Dno

DEPT 30 N3 N3 KN

Navs notes Page 35

JOIN

JOIN

ENO EName DNo Loc

111 naveen 10 AP222 munna 20 TN333 Sravan 10 KN 444 Raju 30 AP555 naveen 10 TN666 munna 20 TN777 naveen 30 KN

Page 36: DataStage NOTES

2010DataStage

Dno Loc

o First partition is done by key based partition for dno, and taking a separate

column as location, for that it re – distributing for the distributed data. i.e.,

known as Re – Partition.

o Reverse Partitioning:

It is also called as collecting. But it done in one case only or in one

situation only : “when the data move from parallel stage to sequential

stage the collecting happens in this case only”

Designing job in “stages”

is also called as link or pipe, this is channel it is moving data from

one stage to another stage.

S1 S2 S3

Example:

Here collecting to Nodes

N1 N

Nn

Parallel files into Sequential/Single file

There are four categories of collecting techniques

Order

Navs notes Page 36

SRC TRSF TRG

S1 S2

Page 37: DataStage NOTES

2010DataStage

Round robin

Sort – merge

Auto

Example for collecting techniques:

N1 a,x

N2 b,y N

N3 c,z

Pipeline Parallelism:

“All pipes carry the data parallel and the process done simultaneously”

o In server environment: the execution process is called traditional batch

processing.

o For example: how the execution done in server environment we see

Extract Transform Load

10min 10min 10min

Here, the execution taken 30minutes to complete.

o Same job in parallel environment :

E T L R5 R3 R1

R4 R2

Navs notes Page 37

Order RR SM Auto

a a a ax b b zb c c yy x x cc y y xz z z b

S1

S2

S3

HD HD

S1

S2

S3

Page 38: DataStage NOTES

2010DataStage

Here, all the pipe carry the data parallel and processing the job

simultaneously and the execution taken only 10 minutes to complete

By using the pipeline parallelism we can reduce the process time.

DAY 13

Differences between 7.5.x2 & 8.0.1

Differences:

7.5.x2 8.0.1

- 4 client components - 5 client components

* DS Designer * DS Designer

* DS Manager * DS Director

* DS Director * DS Administrator

* DS Administrator * Web Console

* Information Analyzer

- Architecture Components - Architecture Components

* Server Component * Common User Interface

* Client Component * Common Repository

* Common Engine

* Common Connectivity

* Common shared Services

Navs notes Page 38

7.5.x2 8.0.1

Page 39: DataStage NOTES

2010DataStage

- II- tier architecture - N-Tier architecture

- OS dependent w.r.t. users - OS independent w.r.t. users but one

time dependent only.

- Capable of P-III & P-IV - Capable of all phases.

- No web based administration - Web based administration through

web console( simple say work from home)

- File based repository - Data Base based repository

13.1. Client components of 7.5.x2:

- DS Designer: it is to create jobs, compile, run and multiple job compile.

4 types of jobs can handle by DS Designer.

Mainframes job

Server job

Parallel job

Job sequence job

- DS Director: it can handle the given list below.

Schedule , run job’s

Monitor, Unlock, batch jobs

Views(job, status, logs)

Message Handling.

- DS Manager: it can handle the given list below.

Import and Export of Repository components

Node Configuration

- DS Administrator: it can handle the given list below.

Create project

Delete project

Organize project

Navs notes Page 39

Page 40: DataStage NOTES

2010DataStage

13.1. Client components of 8.0.1:

- DS Designer: it is to create jobs, compile, run and multiple job compile.

4 types of jobs can handle by DS Designer.

Mainframes job

Server job

Parallel job

Job sequence job

Data quality job

- DS Director : same in as above shown in 7.5.x2

- DS Administrator : same in as above shown in 7.5.x2

- Web Console : administrator components through which performing.

Security services

Scheduling services

Logging services

Reporting services

Session management

Domain manager

- Information Analyzer : is also called as console for IBM INFO SERVER.

It perform all phase-I activities

Column analysis,

Primary key analysis,

Foreign key analysis,

Base Line analysis, and

Cross domain analysis.

Navs notes Page 40

As an ETL developer you can come across DS Designer and DS Director.

But, some information to be knows about Web console, Information Analyzer, and DS Administrator.

Page 41: DataStage NOTES

2010DataStage

DAY 14

Description of 7.5.x2 & 8.0.1 Architecture

14.1. Architecture of 7.5.x2:

* Server Components: it is divided into 3 categories, they are

a. Repository

b. Engine

c. Package Installer

Repository: is also called as project or work area.

o Here repository is also Integrated Developer Environment(IDE)

IDE performs design, compile, run, save jobs.

o Repository organize different component at one area is called collection of

components. Some of components are

Job’s

Table definition

Shared container

Routines ….. etc.,

o Repository is for developing application as well as storing application.

Navs notes Page 41

Page 42: DataStage NOTES

2010DataStage

Engine: it is executing DataStage jobs and it automatically selects the partition

technique.

o Never leave any stage to auto ?

If we leave it auto, it select auto partition technique it causes effect on

the performance.

Package Installer: in this component contains two types of package installer one plug-

in and another is pack’s.

Example: Derivers needed 1100 to install

1100 driver provide

Here, interface is also called as plug-in between computer and printer.

Packs

Best example that normal windows XP acquires Service Pack2 for more capabilities

Here, packs are used to configuration for DataStage to ERP solution.

*Client components: it is divided into 4 categories, they are

a. DS Designer

b. DS Manager

c. DS Director

d. DS Administrator

These categories are shown above what they handle i.e., in page no 39.

Navs notes Page 42

Computer PrinterInterface

ERP SW DS

Page 43: DataStage NOTES

2010DataStage

14.2. Architecture of 8.0.1:

1. Common user interface: is also called as unified user.

a. Web console

b. Information analyzer

c. DS Designer

d. DS Director

e. DS Administrator

2. Common Repository: is divided into two types

a. Global repository: it is for DataStage jobs files to store here. (it’s checks

security issues)

b. Local repository: it is for individual files stores here(it’s for performance issue)

o common repository is also called as Mata Data SERVER

o three types

Project level MD

Design level MD

Operation level MD

3. Common Engine:

o It is responsible of

Data Profiling analysis

Data Quality analysis

Data Transmission analysis

4. Common Connectivity:

It provides the connections to common repository.

Navs notes Page 43

WC IA DE DI DA

Page 44: DataStage NOTES

2010DataStage

Common shared services

Table representation of “8.0.1 Architecture”

DAY 15

Enhancement & New feature of version8

In version 8.0.1, there are 8 categories of stages.

Processing stage:

o New Stage:

1. SCD(slow changing Dimension)

2. FTP(File transfer Protocol)

3. WTX(Web Sphere Transfer)

o Enhanced Stage:

1. Surrogate key stage: it is new concept introduced.

2. Lookup stage, previously lookup having

i. Normal lookup

ii. Sparse lookup

Newly added

iii. Range lookup

iv. Case less lookup

Data Base Stage:

o New stages:

IWAY

Navs notes Page 44

REPOSITORY

MD SERVER

Project level MDDesign level MDOperation level MD

DP DQ DT DACommon Engine

Common Connectivity

Page 45: DataStage NOTES

2010DataStage

Classic federation

ODBC connector

NETEZZA

o Enhanced Stages:

All Stages techniques used with respect to SQL Builder.

Palate of the version 8.0.1

General X

Data Quality new

Data Base

File X

Development & Debug X

Processing

Real time X here, have changes

Restructure X X no changes

o Palate is shortcuts of stages where we can drag n drop into canvas to do design

the job.

o Data Quality is exclusively new concept of 8.0.1.

o Data Base and processing stages have some changes that shown above.

o Other stages are same as version 7.5.x2 i.e., no changes in this version.

Navs notes Page 45

Page 46: DataStage NOTES

2010DataStage

:: Stages Process & Lab Work::

DAY 16

Starting steps for DataStage tool

The starting of DataStage on the system we must follow the difference steps to do job.

Five difference steps job development process (this is for design a job).

DB2 Repository started and DataStage server started.

After started: select DS Designer

& enter uid:

& enter pwd:

& attach appropriate project:

Select appropriate stage in the palate and dragging them on to the CANVAS.

And link them (or giving connectivity) and after that setting properties is important.

Palate means which contains all stages shortcuts i.e., 7 – stages in 7.5.2 & 8 – stages in 8.0.

This stages are categorized into two groups, they are 1 –> Active Stage (what ever stage is

Navs notes Page 46

Palate -> (it’s from tool bar)

GeneralData QualityDatabaseFileDevelopment & DebugProcessingReal TimeRestructure

Designer Canvas or Editor

CANVAS

Where the place we design the job. Eg: Seq to Seq

SEQ SEQ

admin

**** (eg.: phil)

Project\navs…..

Page 47: DataStage NOTES

2010DataStage

transmission is called active stage). 2 –> Passive Stage (here what ever stage whether extracting or loading is called passive stage).

In 8 categories we have use sequential stage and parallel stage jobs.

Save, compile and run the job.

Run director (to see views) or to view the status of your job.

DAY 17

My first job creating process

Process:

In computer desktop, the current running process will show at the left Conner in that

a round symbol with green color is to start when it is not automatically starts. i.e.,

whether the server for DataStage was start or not. If not manually to start.

When 8th version of DataStage is installed five client components short cuts visible

on desktop.

Web Console

Information Analyzer

DS Administrator

DS Designer

DS Director

Web Console: when you will click, it displays “ the login page appears”

o If server is not started, it displays “the page cannot open” error will appear.

o If error occurs like that, the server must be restart for doing or creating jobs.

DS Administrator: it is for creating/deleting/organize the project.

Navs notes Page 47

Page 48: DataStage NOTES

2010DataStage

DS Director: it is for views the status of the job executed, and to view log, status,

warnings.

DS Designer: when you will click on the designer icon, it will display to attach the

project for creating a new job. As shown as below

o User id: admin

o Password: ****

o If authentication failed to login i.e., because repository interface error.

Below figure showing how to authenticate & shows designer canvas for creating

jobs.

After authentication, it displays the Designer canvas

o And it ask which job want to you do, they are

Main frames

Parallel

Sequential

Server jobs

After clicking on parallel jobs, go to tool bar – view – palate.

Navs notes Page 48

Domain

User Name

Password

Project

Attach the project X

Localhost:8080

admin

Teleco

phil

OK

cancel

Page 49: DataStage NOTES

2010DataStage

In palate the 8 types of stages were displayed for designing a job, they are

General

Data Quality

Data Base

File

Development & Debug

Processing

Real Time

Re – Structure

17.1. File Stage:

Q: How data can read from files?

File stage can read only flat files and the formats of flat files are .txt, .csv, .xml

In .txt there are different types of formats like fwf, sc, csv, s & t, H & T.

.csv means comma separated value.

.xml means extendable markup language.

- In File Stage, there are sub–stages like sequential stage, data set, file set and so on.

o Example how a job can execute:

one sequential file(SF) to another SF.

Source Targeto Source file require target/output properties, ando Target file require input/source properties.

- In source file, how we to read a file?o On double clicking source file, we must set the properties as below

File name \\ browse the file name. Location \\ example in c:\ Format \\ .txt, .csv, .xml Structure \\ meta data

Navs notes Page 49

Page 50: DataStage NOTES

2010DataStage

General properties of sequential file:

1. Setting / importing source file from local server.

2. Format selection:

- As per input file taken and the data must to be in given format

- Like “tab/ space/ comma” must to be select one them.

3. Column structure defining:

To get the structure of file.

- Steps for load a structure

- Import

o Sequential file

Browse the file and import

Select the import file

o Define the structure.

These three are general properties when we design for simple job.

Navs notes Page 50

Select a file name:

File: \ c:\data\se_source_file.txt File: \? (This option for multiple purposes)

Browse button

C:\data\se_source_file.txt

LOAD

Page 51: DataStage NOTES

2010DataStage

DAY 18

Sequential File Stage

Sequential file stage also says as “output properties”

- For single structure format only we going to use sequential file stage.

Output Input

Properties Properties

- About Sequential File Stage and how it works:

Step1: Sequential file is file stage, that it to read flat files from different of

extensions(.txt, .csv, .xml)

Step 2: SF it reads/writes sequentially by default, when it reads/writes from single

file.

o And it also reads/writes parallel when it read/writes to or from multiple files

Navs notes Page 51

Page 52: DataStage NOTES

2010DataStage

Step 3: Sequential stage supports one input (or) one output and one reject link.

Link :

Link is also a stage that transforms data from one stage to another stage.

o That link has divided into categories.

Stream link SF SF

Reject link SF SF

Reference link SF SF

Link Marker:

It is show how the link behaves between the transmissions from source to target.

1. Ready BOX: it is indicate that “a stage is ready with Mata Data” and data transform

between sequential stages to sequential stage.

Ready BOX

2. FAN IN: it indicates when “a data transform from parallel stage to sequential stage” and it

done when collecting happens

FAN IN

3. FAN OUT: it indicates when “a data transform from sequential stage to parallel stage” and

it is also called auto partition.

Navs notes Page 52

Page 53: DataStage NOTES

2010DataStage

FAN OUT

4. BOX: it indicates when “a data transform from parallel stage to parallel stage” and it is

also known as partitioning.

BOX

5. BOW – TIE: it indicates when “a data transform parallel stage to parallel stage” and it is

also known as re-partitioning.

BOW – TIE

Link Color:

The link color indicates the process in execution of a job.

LINK

RED:

o A link in RED color means

case1: a stage not connected properly and

case2: job aborted

BLACK:

o A link in BLACK color means “a stage is ready”.

Navs notes Page 53

Page 54: DataStage NOTES

2010DataStage

BLUE:

o A link in BLUE color means “ it indicates that a job execution on process”

GREEN:

o A link in GREEN color means “execution of job finished”.

Compile:

Compile is a translator that source code to target code.

Compiling .C function

HLL BC

ALL *HLL – High Level Language

*ALL – Assembly Level Language

*BC – Binary Code

Compiling process in DataStage:

MC

OSH Code & C++

*MC – Machine Code

Navs notes Page 54

NOTE: “Stage is an operator; operator is a pre – built in component”.

Because the stage that imports import operator for purpose of creating in Native Format.

Native Format is DataStage under stable format. So, stage is a operator.

.C

.OBJ

.EXE

GUI

.OBJ

.EXE

Page 55: DataStage NOTES

2010DataStage

*OSH – Orchestrate Shell Script

Note: Orchestrate Shell Script generate for all stage except one i.e., Transformer stage that is

done by C++.

In process, it checks for

Link Requirement (checks for link)

Mandatory stage properties

Syntax Rules

DAY 19

Sequential File Stage Properties

Properties:

Read Methods: two options are

o Specific File : user or client to give specifically each file name.

o File Pattern : we can use wild card character and search for pattern i.e., * & ?

For example: C:\eid*.txt

C:\eid??.txt

Reject Mode: to handle a “format/data type/condition” miss match records.

Three options

o Continue : Drops the miss match and continue other records.

o Fail : job aborted.

o Output : its capture the drop data through the link to another sequential file.

First line or record of table: true/false.

o If it false, it display the first line also a drop record.

o Else it is true, it’s doesn’t drop the first record.

Missing File Mode: if any file name miss this option used

Navs notes Page 55

Page 56: DataStage NOTES

2010DataStage

Two options

o Ok : drops the file name when missed.

o Error : if file name miss it aborts the job.

File Name Column: “source information at the target” it gives information about which

record in which address in local server.

Directly to add a new column to existing table and it’s displays in that column.

Row Number Column: “Source record number at target” it gives information about

which source record number at target table.

It is also directly to add a new column to existing table and it’s displays in that column.

Read First Rows: “will get you top first n-records rows”

o Read First Rows option will asks give n value to display the n number of

records

Filter: “blocking unwanted data based on UNIX filter commands”

Like grep, egrep, ……..so on

Example:

o grep “moon” ; \\ it is case sensitive that display only moon contained records.

o grep - i “moon” \\ it ignores the case sensitive it displays all moon records.

o grep - w “moon” \\ it displays exact match record.

Read from Multiple Nodes: we can read the data parallel from using sequential stage

Reads parallel is possible

Loading parallel is not possible

LIMITATIONS of SF:

o It should be sequential processing( process the data in sequential)

o Memory limit 2gb(.txt format)

Navs notes Page 56

Page 57: DataStage NOTES

2010DataStage

o Problem with sequential is conversions.

Like ASCII – NF – ASCII – NF

o It is lands or resides the data “outside of boundary” of DataStage.

DAY 20

General settings DataStage and about Data Set

Default setting for startup with parallel job:

- Tools

o Options

Select a default

And to create new: it ask which type of job u want.

- Types of jobs are main frames/parallel/sequential/server.

- After setting above when we restart the DS Designer it directly goes designer canvas.

According naming standards every stage has to be named.

o Naming a stage is simple, just right click on a stage rename option is visible

and name a stage as naming standards.

General Stage:

In this stage the some of stage were used for commenting a stage what they behave or

what a stage can perform to do i.e., simple giving comments for a stage.

Navs notes Page 57

Page 58: DataStage NOTES

2010DataStage

Let we discuss on Annotation & Description Annotation

- Annotation: it is for stage comment.

- Description Annotation: it is used for job title (any one tile can keep).

Parallel Capable of 3 jobs:

Resides into or

SRC TRG

Extracting landing the data into LS/RR/db

Q: In which format the data sends between the source file to target file?

A: if we send a .txt file from source, it is ASCII format because .txt file support only ASCII

format and DataStage support the Native format only, here the ASCII code will convert into

Native format that is understandable to DataStage. And at target ASCII code will convert

into .txt format to user/client visible.

“Native Format” is also called as Virtual Dataset.

NF

ASCII ASCII

src_f.txt trg_f.txt

Navs notes Page 58

When we convert ASCII code into NF. SRC need to import an operator.

When we convert NF code into ASCII. Target need to import an operator.

Page 59: DataStage NOTES

2010DataStage

Data Set (DS):

“It is file stage, and it is used staging the data when we design dependent jobs”.

Data Set over comes the limitation of sequential file stage for the better performance.

By default Data Set sends the data in parallel.

In Data Set the data lands in to “Native Format”.

Q: How the Data Set over comes the sequential file limitation?

- By default the data process parallel.

- More than 2 GB.

- No need of conversion, because Dataset represent or data directly resides into Native

format.

- The data Lands in the DataStage repository.

- Data Set extension is *.ds

Structure saving as “st_trg”

src_f.txt trg_f.ds

Q: How the conversion is easy in Data Set?

- we can copy the “trg_f.ds” file name and also we must save the structure of the

trg_f.ds example st_trg.

- We can use the saved file name and structure of the target in other job.

copying the structure st_trg

& trg_f.ds for reusing here.

trg_f.ds trg_f.txt

Navs notes Page 59

Page 60: DataStage NOTES

2010DataStage

- Data Set can read only Native Format file, like DataStage reads only orchestrate

format.

DAY 21

Types of Data Set (DS)

Two types of Data Set, they are

Virtual (temporary)

Persistency (permanent)

- Virtual : it is a Data Set stage that the data moves in the link from one stage to another

stage i.e., link holds the data temporary.

- Persistency : means the data sending from the link it directly lands into the repository.

That data is permanent.

Alias of Data Set:

o ORCHESTRATE FILE

o OS FILE

Q: How many files are created internally when we created data set?

Navs notes Page 60

Page 61: DataStage NOTES

2010DataStage

A: Data Set is not a single file; it creates multiple files when it created internally.

o Descriptor file

o Data file

o Control file

o Header file

Descriptor File : it contains schema details and address of data.

Data File: consists of data in the Native Format and resides in DataStage repository.

Control File :

It resides in the operating system and both acting as interface

between descriptor file and data file.

Header File :

Physical file means it stores in the local drive/ local server.

Permanently stores in the install program files c:\ibm\inser..\server\dataset{“pools”}

Q: How can we organize Data Set to view/copy/delete in real time and etc.,

A: Case1: we can’t directly delete the Data Set

Case2: we can’t directly see it or view it.

Data Set organizes using utilities.

o Using GUI i.e., we have utility in tool (dataset management)

o Using Command Line: we have to start with $orachadmin grep “moon”;

Navigation of organize Data Set in GUI:

o Tools

Dataset Management

- File_name.ds(eg.: dataset.ds)

o Then we will see the general information of dataset

Schema window

Data window

Navs notes Page 61

Page 62: DataStage NOTES

2010DataStage

Copy window

Delete window

At command line

o $orachadmin rm dataset.ds (this is correct process) \\ this command for remove

a file

o $rm dataset.ds (this is wrong process) \\ cannot write like this

o $ds records \\ to view files in a folder

Q: What is the operator which associates to Dataset:

A: Dataset doesn’t have any operator, but it uses copy operator has a it’s operator.

Dataset Version:

- Dataset have version control

- Dataset has version for different DataStage version

- Default version in 8 is it saves in the version 4.1 i.e., v41

Q: how to perform version control in run time?

A: we have set the environment variable for this question.

Navigation for how to set a environment variable.

Job properties

o Parameters

Add environments variable

- Compile

o Dataset version ($APT_WRITE_DS_VERSION)

Click on that.

Navs notes Page 62

Page 63: DataStage NOTES

2010DataStage

After doing this when we want to save the job, it will ask whether which version you

want.

DAY 22

File Set & Sequential File (SF) input properties

File Set (FS): “It is also a staging the data”.

- File stage is same to design in dependent jobs.

- Data Set & File Set are same, but having minor differences

- The differences between DS & FS are shown below

Navs notes Page 63

Data Set File Set

Having parallel extendable capabilities

More than 2 GB limit

NO REJECT link with the Dataset

DS is exclusively for internal use DataStage environment

Copy (file name) operator

Native format

.ds files saves extension

Multiple segments

Having parallel extendable capabilities

More than 2GB limit

REJECT LINK with in File Set

External application create FS we use the any other application

Import / Export operator

Binary Format

.fs extension

Multiple files

Page 64: DataStage NOTES

2010DataStage

But, Data Set have more performance than File Set.

Sequential File Stage : input properties

- Setting input properties at target file, and at target there have four properties

1. File update mode

2. Cleanup on failure

3. First line in column names

4. Reject Mode

File Update Mode: having three options – append/create (error if exists)/overwrite

o Append: when the multiple file or single file sending to sequential target it’s

appends one file after another file to single file.

o Create (error if exists): just creating a file if not exist or given wrong.

o Overwrite: it’s overwriting one file with another file.

Setting passing value in Run time(for file update mode)

o Job properties

Parameters

- Add environment variables

o Parallel

Automatically overwrite

($APT_CLOBBER_OUTPUT)

Cleanup on Failure: having two options – true/false,

Navs notes Page 64

Page 65: DataStage NOTES

2010DataStage

True – the cleanup on failure option when it is true it adds partially coded or records.

Its works only when “file update mode” is equal to append.

False – it’s simple appends or overwrites the records.

First Line in Column Names: having two options – true/false.

True – it is enable the first row or record as a fields of column

False – it is simple reads every row include first row read as record.

Reject mode: here reject mode is same like as output properties we discussed already before.

In this we have three options – continue/fail/output.

Continue – it just drops when the format/condition/data type miss match the data and

continues process remain records.

Fail – it just abort the file when format/condition/data type miss match were found.

Output – it capture the drops record data.

DAY 23

Development & Debug Stage

The development and debug stage having three categories, they are

1. Stage that Generated Data:

a. Row Generated Data

b. Column Generated Data

2. The stage that used to Pick Sample Data:

a. Head

b. Tail

c. Simple

3. The stage that helps in Debugging:

Navs notes Page 65

Page 66: DataStage NOTES

2010DataStage

a. Peek

Simply say in development and debug we having 6 types of stages and the 6 stages

where divided into three categories as above shown.

23.1. Stages that Generated Data:

Row Generator Data: “It having only one output”

- The row generator is for generating the sample data; in some cases it is used.

- Some cases are,

o When client unable to give the data.

o For doing testing purpose.

o To make job design simple that shoots for jobs.

- Row Generator can generate the junk data automatically by considering data type, or

we manual can set a some related understandable data by giving user define values.

- In this having only one property and select a structure for creating junk data.

Row Generator design as below:

ROW Generator DS_TRG

Navigation for Row Generator:

- Opening the RG properties

- Properties

o Number of records = XXX( user define value)

- Column

o Load structure or Meta data if existing or we can type their.

For example n=30

Navs notes Page 66

Page 67: DataStage NOTES

2010DataStage

- Data generated for the 30 records and the junk data also generated considering the data

type.

Q: how to generate User define value instead of junk data?

A: first we must go to the RG properties

- Column

o Double click serial number or press ctrl+E

Generator

Type = cycle/random (it is integer data type)

In integer data type we have three option

Under cycle type:

There are three types of cycle generated data Increment, Initial value, and limit.

Q: when we select initial value=30?

A: it starts from 30 only.

Q: when we select increment=45?

A: it going to generate a cycle value of from 45 and after adds every number with 45.

Q: when we select limit=20?

A: it is going to generate up to limit number in a cycle form.

Under Random type:

There are three types of random generated data – limit, seed, and signed.

Q: when we select limit=20?

A: it going to generate random value up to limit=20 and continues if more than 20 rows.

Q: when we select seed=XX;

A: it is going to generate the junk data for random values.

Navs notes Page 67

Page 68: DataStage NOTES

2010DataStage

Q: when we select signed?

A: it going to generate signed values for the field (values between –limit and +limit),

otherwise generate values between 0 and +limit.

Column Generator Data: “it having the one input and one output”

- Main purpose of column generator to group a table as one.

- And by using this we add extra column for the added column the junk data will be

generated in the output.

- Here mapping should be done in the column generated properties, means just drag and

dropping created column into existing table.

Sequential file Column Generator DataSet

- Coming to the column generator properties.

- To open the properties just double clicking on that.

Navigation:

- Stage

o Options

Column to generate =?

And so on we can give up to required.

- Output

o Mapping

After adding extra column it will visible here, and for mapping we drag

simple to existing table into right side of a table.

- Column

o We can change data type as you require.

In the output,

- The junk data will generate automatically for extra added columns.

Navs notes Page 68

Page 69: DataStage NOTES

2010DataStage

- For manual we can generate some meaning full data to extra column’s

- Navigation for manual:

o Column

Ctrl+E

Generator

o Algorithm = two options “cycle/ alphabet”

o Cycle – it have only one option i.e., value

o Alphabet – it also have only one option i.e., string.

- Cycle is same like above shown in row generator.

Q: when we select alphabet where string=naveen?

A: it going to generate different rows with given alphabetical wise.

DAY 24

Pick sample Data & Peek

24.1. Pick sample data: “it is a debug stage; there are three types of pick sample data”.

- Head

- Tail

- Sample

Head : “it reads the top ‘n’ records of the every partition”.

o It having one input and one output.

o In the head stage mapping must and should do.

Navs notes Page 69

Page 70: DataStage NOTES

2010DataStage

SF_SRC HEAD DS_TRG

Properties of Head:

o Rows

All Rows(after skip)=false

- It is to copy all rows to the output following any requested skip

positioning

Number of rows(per partition)=XX

- It copy number of rows from input to output per partition.

o Partitions

All partition = true

- True: copies row from all partitions

- False: copies from specific partition numbers, which must be

specified.

Tail: “it is debug stage, that it can read bottom ‘n’ rows from every partition”

o Tail stage having one input and one output.

o In this stage mapping must and should do. That mapping done in the tail output

properties.

SF_SRC TAIL_F DS_TRG

Properties of Tail:

o The properties of head and tail are similar way as show above.

o Mainly we must give the value for “number of rows to display”

Navs notes Page 70

Page 71: DataStage NOTES

2010DataStage

Sample: “it is also a debug stage consists of period and percentage”

o Period: means when it’s operating is supports one input and one output.

o Percentage: means when it’s operating is supports one input and multiple of

outputs.

SF_SRC SAMPLE DS_TRG

Period: if I have some records in source table and when we give ‘n’ number of

period value it displays or retrieves the every nth record from the source table.

Skip: it also displays or retrieves the every nth record from given source table.

Percentage: it reads from one input to multiple outputs.

o Coming to the properties

Options

- Percentage = 25 and we must set target =1

- Percentage = 50 , target = 0

- Percentage = 15 , target = 2

o Here we setting target number that is called link order.

o Link Order: it specifies to which output the specific data has to be send.

o Mapping: it should be done for multiple outputs.

Navs notes Page 71

Page 72: DataStage NOTES

2010DataStage

Target1

Target2

SF_SRC SAMPLE

Target3

NOTE: sum of percentage of all outputs must be less than are equal to ‘<=’ to ‘n’ records of

input records.

o In the percentage it distributes the data in percentage form. When sample

receives the 90% of data from source. It considers 90% as 100% and it

distributes as we specify.

24.2. PEEK: “it is a debug stage and it helps in debugging stage”

SF_SRC PEEK

It is used in three types they are

1. It can use as copying the data from Source to multiple outputs.

2. Send the data into logs.

3. And it can use as stub stage.

Navs notes Page 72

Page 73: DataStage NOTES

2010DataStage

Q: How to send the data into logs?

Opening properties of peek stage, we must assign

o Number of row = value?

o Peek record output mode = job log and so on, as per options

o If we put column name = false, it doesn’t shows the column in the log.

For seeing the log records that we stored.

o In DS Director

From Peek – log – peek - We see here ‘n’ values of records and fields

Q: When the peek act as copy stage?

A: It is done when the sequence file it doesn’t send the data to multiple outputs. In that time

the peek act as copy stage.

Q: What is Stub Stage?

A: Stub Stage is a place holder, because in some situations a client requires only dropped data.

In that time the stub stage acts as a place holder which holds the output data as temporary,

and its sends the rejected data to the another file.

DAY 25

Database Stages

In this stage we have use generally oracle enterprise, ODBC enterprise, Tara data with ODBC,

and dynamic RDBMS and so on.

25.1. Oracle Enterprise:

“Oracle enterprise is a data base stage, it reads tables from the oracle data base

from source to the target”

o Oracle enterprise reads multiple tables from, but it loads in the one output.

Navs notes Page 73

Page 74: DataStage NOTES

2010DataStage

Oracle Enterprise Data Set

o Properties of Oracle Enterprise(OE):

Read Method have four options

Auto Generated \\ it generated auto query

SQL Builder \\ its new concept apart comparing from v7 to v8.

Table \\ giving table name here

User Defined \\ here we are giving user defined SQL query.

If we select table option

Table = “<table name>”

Connection

Password = *****

User = Scott

Remote server = oracle

o Navigations for how the data load to the column

This is for already data present in plug-in.

Select load option in column

Going to the table definitions

Than to plug-in

Loading EMP table from their.

If table not in the not their in plug-in.

Select load option in column

Then we go to import

Import “meta data definition”

o Select related plug-in

Navs notes Page 74

Page 75: DataStage NOTES

2010DataStage

Oracle

User id: Scott

Password: tiger

After loading select specific table and import.

After importing into column, in define we must change hired

date data type as “Time Stamp”.

Q: A table containing 300 records in that, I need only 100 fields from that?

A: In read method we use user-defined SQL query to solve this problem by writing a query for

reading 100 records.

But by the first read method option, we can auto generate the query by that we can use

by coping the query statement in user-defined SQL.

Q: What we can do when we don’t know how to write a select command?

A: Selecting in read method = SQL Builder

After selecting SQL Builder option from read method

o Oracle 10g

o From their dragging which table you want

o And select column or double clicking in the dragged table

There we can select what condition we need to get.

It is totally automated.

NOTE: in version 7.5.x2 we don’t have saving and reusing the properties.

Data connection: its main purpose is reusing the saved properties.

Q: How to reuse the saved properties?

A: navigation for how to save and reuse the properties

Opening the OE properties

o Select stage

Data connection

Navs notes Page 75

Page 76: DataStage NOTES

2010DataStage

There load saved dc

o Naveen_dbc \\ it is a saved dc

o Save in table definition.

DAY 26

ODBC Enterprise

ODBC Enterprise is a data base stage

About ODBC Enterprise:

Oracle needs some plug-ins to connect the DataStage. When DataStage version7

released that time the oracle 9i provides some drivers to use.

When coming to connection oracle enterprise connects directly to oracle data base. But

ODBC needs OS drivers to hit oracle or to connect oracle data base.

Navs notes Page 76

ODBC

Enterprise

ORACLE DB

OS

Oracle Enterprise

Page 77: DataStage NOTES

2010DataStage

Directly hitting

Use OS drivers to hit the oracle db

Difference between Oracle Enterprise (OE) and ODBC Enterprise

Q: How database connect using ODBC?

ODBCE Data Set

First step: opening the properties of ODBCE

Read method = table

o Table = EMP

Connection

Navs notes Page 77

OE ODBCE

Version dependent

Good performance

Specific to oracle

Uses plug-ins

No rejects at source

Version independent

Poor performance

For multiple db

Uses OS drivers

Reject at SRC &TRG.

Page 78: DataStage NOTES

2010DataStage

o Data Source = WHR \\ WHR means name of ODBC driver

o Password = ******

o User = Scott

Creating of WHR ODBC driver at OS level.

o Administration tools

ODBC

Add

o MS ODBC for Oracle

Giving name as WHR

Providing user name= Scott

And server= tiger.

ODBCE driver at OS level having lengthy process to connect, to over this ODBC

connector were introduced.

Using ODBC Connector is quick process as we compare with ODBCE.

Best Feature by using ODBC Connector is “Schema reconciliation”. That

automatically handles data type miss match between the source data types and

DataStage data types.

Differences between ODBCE and ODBC Connector.

Navs notes Page 78

ODBCE ODBC Connector

It cannot make the list of Data Source Name (DSN).

In the ODBCE “no testing the connection”.

ODBCE read sequentially and load parallel (poor performance).

It provides the list have in ODBC DSN.

In this we can test the connection by test button.

It read parallel and loads parallel (good performance).

Page 79: DataStage NOTES

2010DataStage

Properties of ODBC Connector:

o Selecting Data Source Name DSN = WHR

o User name = Scott

o Password = *****

o SQL query

26.1. MS Excel with ODBCE:

First step is to create MS Excel that is called “work book”. It’s having ‘n’ number of

sheets in that.

For example CUST work book is created

Q: How to read Excel work book with ODBCE?

A: opening the properties of ODBCE

Read method = table

o Table = “empl$” \\ when we reading from excel name must be in double codes

end with $ symbol.

Connections

o DSN = EXE

o Password = *****

o User = xxxxx

Column

o Load

Import ODBC table definitions

DSN \\ here select work book

User id & password

Navs notes Page 79

Page 80: DataStage NOTES

2010DataStage

o Filter \\ enable by click on include system tables

o And select which you need & ok

In Operating System

o Add in ODBC

MS EXCEL drivers

Name = EXE \\ it is DSN

Q: How do you read Excel format in Sequential File?

A: By changing the CUST excel format into CUST.csv

26.2. Tara Data with ODBCE:

Tara Data is like an oracle cooperation data base, which use as a data base.

Q: How to read Tara Data with ODBC

A: we must start the Tara Data connection (by clicking shortcut).

o And in OS also we must start

Start ->control panel ->Administrator tools -> services ->

Tara Data db initiator \\ must start here

o Add DSN in ODBC drivers

Select Tara data in add list

We must provide details as shown below

User id = tduser

Password = tduser

Server : 127.0.0.1

After these things we must open the properties of ODBCE

o Read method = table

Table = financial.customer

o Connections

Navs notes Page 80

Page 81: DataStage NOTES

2010DataStage

DSN = tduser

Uid = tduser

Pwd = tduser

Column

o Load

Import

Table definitions\plug-in\taradata

Server: 127.0.0.1

Uid = tduser

Pwd = tduser

After all this navigation at last we view the data, which we have load in source.

DAY 27

Dynamic RDBMS and PROCESSING STAGE

27.1. Dynamic RDBMS:

“It is data base stage; it is also called as DRS”

It supports multiple inputs and multiple outputs

Navs notes Page 81

Page 82: DataStage NOTES

2010DataStage

Ln_EMP_Data Data Set

DRS

Ln_DEPT_Data

Data Set

It all most common properties of oracle enterprise.

Coming to DRS properties

o Select db type i.e., oracle

o Oracle

Scott \\ for authentication

Tiger

o At output

Ln_EMP_Data \\ set emp table here

And Ln_DEPT_Data \\ set dept table here

o Column

Load

Meta data for table EMP & DEPT.

In oracle enterprise we can read multiple files, but we can’t load into multiple files.

We can solve this problem with DRS that we can read multiple files and load in to

multiple files.

Some of data base stages:

IWay can use in source only to set in output properties.

Navs notes Page 82

Page 83: DataStage NOTES

2010DataStage

Netezza can use in target only to set in input properties.

27.2. Processing Stage:

In this 28 processing stages are there, but we use 10 stages generally. And the 10

stages are very important.

They are,

1. Transformer

2. Look UP

3. Join

4. Copy

5. Funnel

6. Remove duplicates

7. Slowly changing dimension

8. Modify

9. Sort

10. Surrogate key

27.3. Transformer Stage:

The symbol of Transformer Stage is

A simple query that we solving by using transformer i.e,

Q: calculate the salary and commission of an employee from EMP table.

Navs notes Page 83

Page 84: DataStage NOTES

2010DataStage

Oracle Enterprise Transformer Data Set

Here, setting the connection here, source field and structure available

and load Meta data in to column mapping should be do.

Transformer Stage is “all in one

stage”.

Properties of Transformer Stage:

o For above question we must create a column to write description

In the down at output properties clicking in empty position.

That column we name as NETSAL

By double clicking on the NETSAL, we can write derivation here.

For example, IN.SAL + IN.COMM \\ we can write by write clicking

their

It visible in input column\function\ and so on.

After that when we execute the null values records it drops and remaining records it

sends to the target.

o For this we can functions in derivation

IN.SAL + NullToZero (IN.COMM)

o By this derivation we can null values records as target.

Q: NETSAL= SAL + COMM

Logic: if NETSAL > 2000 then TakeHome = NETSAL – 200 else TakeHome = NETSAL

+200; how to include this logic in derivation?

Navs notes Page 84

Page 85: DataStage NOTES

2010DataStage

A: adding THome column in output properties.

In THome derivation part we include this logic

o If (IN.SAL + NullToZero (IN.COMM))> 2000

Then (IN.SAL + NullToZero (IN.COMM)) – 200

Else (IN.SAL + NullToZero (IN.COMM) ) + 200

o By this logic it takes more time in huge records, so the best way to over this

problem is Stage Variable.

Stage Variable: “it is a temporary variable which will holds the value until the process

completes and which doesn’t sent to the result to output”

Stage variable is shown in the tool bar of transformer properties.

After clicking that it visible in the input properties

In stage variable we must add a column for example, NS

Variables to adding column

1 NS 0 integer - 4 0

After adding NS column

To NS column including the derivation, IN.SAL + NullToZero (IN.COMM).

Adding these derivations to the input properties to created columns.

o NETSAL = NS

o THome = if (NS > 2000) then (NS -200) else (NS + 200).

DAY 28

Transformer Functions-I

Examples on Transformer Functions:

1. Left Function

2. Right Function

3. Substring Function

4. Concatenate Function

5. Field Function

Navs notes Page 85

Page 86: DataStage NOTES

2010DataStage

6. Constraints Function (Filter)

For example, a word MINDQUEST, from that word we need only QUE.

Right Function using the above for question - R(L(7),3)

Left Function – L(R(5),3)

Substring – SST(5,3)

Filter: DataStage in 3 different ways

1. Source level

2. Stages (filter, switch, extended filter)

3. Constraints (transformer, lookup)

Constraints:

“In transformer constraints used as filter, means constraints is also called as filter”

Q: how a constraint used in Transformer?

A: in transformer properties, we will see a constraints row in output link. There we can write

the derivation by double clicking.

Differences between Basic transformer and parallel transformer:

Navs notes Page 86

Basic Transformer Parallel Transformer

Its effects on performance.

Basic Tx can only execute up to SMP.

Basic Tx can call the Routines which is in basic and shell scripting language.

Don’t effects on performance, but it effects on compile time.

Can execute in any platform.

It supports wide range of language or multiple languages or all languages.

Page 87: DataStage NOTES

2010DataStage

NOTE: Tx is very sensitive with respect to Data Types, if an source and target be cannot

different data types.

Q: How the below file can read and perform operation like filtering, separating by using left,

right, substring functions and date display like DD-MM-YYYY?

A: File.txt

Design:

IN1 IN2

SF Tx1 Tx2

IN3

Navs notes Page 87

HINVC23409CID45432120080203DOLTPID5650 5 8261.99TPID5655 4 2861.69TPID5657 7 6218.96HINVC12304CID46762120080304EUOTPID5640 3 5234.00TPID5645 2 7855.67TPID5657 9 7452.28HINVC43205CID67632120080405EUOTPID5630 8 1657.57TPID5635 6 9564.13TPID5637 1 2343.64

Page 88: DataStage NOTES

2010DataStage

OUT

Tx3 DS

Total five steps to need to solve the given question:

Step 1: Loading file.txt into sequential file, in the properties of sequential file loading the

whole data into one record. Means here creating one column called REC and no need of

loading of Meta data for this.

Step 2: IN1 Tx- Properties, in this step we are filtering the “H” staring records from the given

file. Here, we are creating two columns TYPE and DATA.

Step 3: IN2 Tx properties, here creating four column and separating the data as per created

columns.

Navs notes Page 88

REC

IN1 CONSTRAINT Left (IN1.REC,1)=”H”

Left (IN1.REC, 1) TYPE

IN1.REC DATA

IN1

IN2

Derivation Column Name

TYPEDATA

Left (IN1.REC, 1) INVCNOLeft (Right (IN2.DATA, 21), 9) CIDIN2.DATA [20, 8] BILL_DATERight (IN2.DATA, 4) CURR

IN2 IN3

Derivation Column Name

Page 89: DataStage NOTES

2010DataStage

Step 4: IN3 Tx properties, here BILL_DATE column going to change into DD-MM-YYYY

format using Stage Variable.

Step 5: here, setting the output file name for displaying the BILL_DATE.

DAY 29

Transformer Functions-II

Examples on Transformer Functions II:

1. Field Function : “it separates the fields using delimiter support”.

2. Trim : “it removes all special characters”.

3. Trim B : “it removes all after spaces”.

4. Trim F : “it removes all before spaces”.

Navs notes Page 89

INVCNOCIDBILL_DATECURR

IN3.INVCNO INVCNOIN3.CID CIDD:’-‘: M:’-‘: Y BILL_DATEIN3.CURR CURR

IN3

OUT

Derivation Column Name

Right (IN3.BILL_DATE, 2) DRight (Left (IN3.BILL_DATE, 6), 2) MLeft (IN3.BILL_DATE, 4) Y

Stage Variable

Derivation Column Name

Page 90: DataStage NOTES

2010DataStage

5. Trim T & L : “it removes all after and before spaces”.

6. Strip White Spaces : “it removes all spaces”.

7. Compact White Spaces : “it removes before, after, middle one, spaces”.

Q: A file.txt consisting of special character, comma delimiters and spaces (before, after, and in

between). How to solve by above functions and at last it to be one record?

File.txt

Design:

IN1 IN2

SF Tx Tx

IN3

OUT

Navs notes Page 90

EID,ENAME,STATE

111, NaVeen, AP

222@, MUnNA, TN

@333, Sra van, KN@

444, @ San DeeP, KN

555, anvesh,MH

Page 91: DataStage NOTES

2010DataStage

Tx DS

Total Five steps to solve the File.txt using above functions:

Step 1: Here, extracting the file.txt and setting into all data into one record to the new column

created that REC. no need of load meta data to this.

Point to remember keep that first line is column name = true.

Step 2:IN1. Tx properties

In link IN1 having the REC, that REC to divide into fields by comma delimiter i.e.,

using field functions.

Step 3: IN2. Tx properties

Here, to remove special characters, spaces, lower cases into upper cases by using the

trim, Strip Whitespaces (SWS), Up case functions.

Step 4: IN3. Tx properties

Navs notes Page 91

RECField(IN2.REC,’,’,1) EIDField(IN2.REC,’,’,2) ENAMEField(IN2.REC,’,’,3) STATE

IN1 IN2

Derivation Column Name

EIDENAMESTATE Trim(IN2.EID,”@”,””) EID

Upcase(Trim(SWS(IN2.ENAME,”@”,””)) ENAMETrim(SWS(IN2.EID),”@”,””) STATE

IN2 IN3

Derivation Column Name

Page 92: DataStage NOTES

2010DataStage

Here, all rows that divided into fields are concatenating means adding all records into

one REC.

Step 5:

For the output, here assigning a target file. And at last the answer will display in one

record but all special characters, spaces were removed after doing are implementing

the transformer functions to the above file.txt.

Final output:

Trg_file.ds

29.1. Re-Structure Stage:

1. Column Export

2. Column Import

Column Export:

“it is used to combine the multiple of columns into single column” and it is also like

concatenate in the transformer function.

Properties:

o Input

Column method = explicit

Navs notes Page 92

EIDENAMESTATE IN3.EID: IN3.ENAME: IN3.STATE REC

IN3 OUT

Derivation Column Name

REC

111NAVEEN AP222 MUNNATN333SRAVAN KN444SAN DEEPKN555 ANVESHMH

Page 93: DataStage NOTES

2010DataStage

Column To Export = EID

Column To Export = ENAME

Column To Export = STATE

o Output

Export column type = “varchar”

Export output column = REC

Column Import:

“it is used to explore from single column into multiple columns” and it is also like field

separator in the transformer function.

Properties:

o Input

Column method=

Column To Import = REC

o Output

Import column type = “varchar”

Import output column= EID

Import output column= ENAME

Import output column= STATE

DAY 30

JOB Parameters (Dynamic Binding)

Dynamic Binding:

“After compiling the job and passing the values during the runtime is known as

dynamic binding”.

Assuming one scenario that when we taking a oracle enterprise, we must provide the

table and load its meta data. Here table name must be static bind.

Navs notes Page 93

Page 94: DataStage NOTES

2010DataStage

But there is no need for giving the authentication to oracle are to be static bind,

because of some security reasons. For this we can use job parameters that can provide

values at runtime to authenticate.

Job parameters:

“job parameters is a technique that passing values at the runtime, it is also called

dynamic binding”.

Job parameters are divided into two types, they are

o Local variables

o Global Variable

Local variables (params): “it is created by the DS Designer only, it can use with in the

job only”.

Global Variables : “it is also called as environment variables”, it is divided into two

types. They are,

o Existing: comes with in DataStage, in this two types one general and another

one parallel. Under parallel compiler, operator specific, reporting will

available.

o User Defining : it is created in the DataStage administrator only.

Q: How to give Runtime values using parameters for the following list?

a. To give runtime values for user ID, password, and remote server?

b. Department number (DNO) to keep as constraint and runtime to select list of any

number to display it?

c. Add BONUS to SAL + COMM at runtime?

d. Providing target file name at runtime?

e. Re-using the global and parameter set?

Navs notes Page 94

NOTE: “The local parameters that created one job they cannot be reused in other job, this is up to version7. But coming to version8 we can reuse them by technique called parameter set”. But in version7 we can also reuse parameters by User Define values by DataStage Administrator.

Page 95: DataStage NOTES

2010DataStage

Design:

ORACLE Tx Data Set

Step1:

“Creating job parameters for given question in local variable”.

Job parameters

o Parameters

Name DNAME Type Default value

UID USER string SCOTT

PWD Password Encrypted ******

RS SERVER String ORACLE

DNO DEPT List 10

BONUS BONUS Integer 1000

IP DRIVE String C:\

FOLDER FOLDER String Repository\

TRG FILE TARGET String dataset.ds

Here, a, b, c, d are represents a solution for the given question.

Step 2:“Creating global job parameters and parameter set”.

DS Administrator

o Select a project

Properties

General

o Environment variables

User defined (there we can write parameters)

Name DNAME Type Default value

Navs notes Page 95

a

b

c

d

Page 96: DataStage NOTES

2010DataStage

UID USER string SCOTT

PWD Password Encrypted ******

RS SERVER String ORACLE

Here, global parameters are preceded by $ symbol.

For Re-use, we must

o Add environment variables

User defined

UID $UID

PWD $PWD

RS $RS

Step 3:

“Creating parameter set for multiple values & providing UID and PWD other values

for DEV, PRD, and TEST”.

In local variables job parameters

o Select multiple of values by clicking on

And create parameter set

Providing name to the set

o SUN_ORA

Saving in Table definition

In table definition

o Edit SUN_ORA values to add

Name UID PWD SERVER

DEV SYSTEM ****** MOON

PRD PRD ****** SUN

TEST TEST ****** ORACLE

For re-using this to another job.

o Add parameters set (in job parameters)

Table definitions

Navs

Navs notes Page 96

Page 97: DataStage NOTES

2010DataStage

o SUN_ORA(select here to use)

NOTE: “Parameter set use in the jobs with in the project only”.

Step 4:

“In oracle enterprise properties selecting the table name and later assign created job

parameter as shown below”.

Properties:

Read method = table

o Table = EMP

Connection

o Password = #PWD#

o User = #UID#

o Remote Server = #RS#

Column:

Load

o Meta data for EMP table

Step 5:

“In Tx properties dept no using as a constraint and assign bonus to bonus column”.

Navs notes Page 97

Insert job parameters

$UID

$PWD global environment variables

$RS

SUN_ORA.UID

SUN_ORA.PWD parameter set

SUN_ORA.RS

UID

PWD Local variables

RS

Parameters

EIDENAMESTATESALCOMMDEPTNO

ININ.SAL + NullToZero(IN.COMM) NS

Stage Variable

Derivation Column Name

Page 98: DataStage NOTES

2010DataStage

Here, DNO and BONUS are the job parameters we have created above to use here.

For that simply right click->job parameters->DNO/BONUS (choose what you want)

Step 6:

“Target file set at runtime, means following below steps to follow to keep at runtime”.

Data set properties

o Target file= #IP##FOLDER##TRGFILE#

Here, when run the job it asks in what drive, and in which folder. At last it asks what target

file name you want.

DAY 31

Sort Stage (Processing Stage)

Q: What is sorting?

“Here sorting means higher than we know actually”.

Q: Why to sort the data?

Navs notes Page 98

IN.EID EIDIN.ENAME ENAMENS NETSALNS+BONUS BONUS

OUT

Derivation Column Name

Constraint: IN.DEPTNO = DNO

Page 99: DataStage NOTES

2010DataStage

“To provide sorted data to some sort stages like join/ aggregator/ merge/ remove

duplicates for the good performance”.

Two types of sorting:

1. Traditional sorting : “simple sort arranging the data in ascending order or descending

order”.

2. Complex sorting : “it is only for sort stages and to create group id, blocking unwanted

sorting, and group wise sorting”.

In DataStage we can perform sorting in three levels:

Source level: “it can only possible in data base”.

Link level: “it can use in traditional sort”.

Stage level: “it can use in traditional sorting as well as complex sorting”.

Q: What is best level to sort when we consider the performance?

“At Link level sort is the best we can perform”.

Source level sort:

o It can be done in only data base, like oracle enterprise and so on.

o How it will be done in Oracle Enterprise (OE)?

Go to OE properties

Select user define SQL

o Query: select * from EMP order by DEPTNO.

Link level sort:

o Here sorting will be done in the link stage that is shown how in pictorial way.

o And it will use in traditional sorting only.

o Link sort is best sort in case of performance.

Navs notes Page 99

Page 100: DataStage NOTES

2010DataStage

OE

JOIN DS

Q: How to perform a Link Sort?

“Here as per above design, open the JOIN properties”.

And go to partitions

o Select partition technique (here default is ‘auto’)

Mark “perform sort”

When we select unique (it removes duplicates)

When we select stable (it displays the stable data)

Q: Get all unique records to target1 and remaining to another target2?

“For this we must create group id, it indicates the group identification”.

It is done in a stage called sort stage, in the properties of the sort stage and in the

options by keeping create key change column (CKCC) = “true”, default is false.

Here we must select to which column group id you want.

Sort Stage:

“It is a processing stage, that it can sort the data in traditional sort or in complex sort”.

Sort Stage

Navs notes Page 100

Page 101: DataStage NOTES

2010DataStage

Complex sort means to create group id, blocking unwanted sorting, and group wise

sorting in some sort stage like join, merge, aggregate, and remove duplicates.

Traditional sort means sorting in ascending order or descending order.

Sort Properties:

Input properties

o Sorting key = EID (select the column from source table)

o Key mode = sort (sort/ don’t sort (previously sorted)/ don’t sort (previously

grouped))

o Options

Create cluster key change column = false (true/ false)

Create key change column = (true/ false)

True = enables group id.

False = disables the group id.

Output properties

o Mapping should be done here.

DAY 32

A Transformer & Sort stage job

Q: Sort the given file and extract the all addresses to one column of a unique record and count

of the addresses to new column.

Navs notes Page 101

Page 102: DataStage NOTES

2010DataStage

File.txt

Design:

SF Sort1 DS

Tx Sort2

Sequential File (SF): here reads the file.txt for the process.

Sort1: here sorting key = EID

And enables the CKCC for group id.

Transformer (TX): here logic to implement operation for target.

o Properties of TX :

Navs notes Page 102

EID, ENAME, ACCTYPE

111, munna, savings333, naveen, loans222, kumar, credit111,munna, current222, kumar, loans111, munna, insurance333, naveen, current111, munna, loans222, kumar, savings

EIDENAMEACCTYPEKeyChange

IN2

OUTDerivation Column Name

if (IN2.keychange = 1) then IN2.ACCTYPE func1else func1 :’,’: IN2.ACCTYPE

if(IN2.keychange=1) then 1 else c+1 c

Stage Variable

Derivation Column Name

Page 103: DataStage NOTES

2010DataStage

For this logic output will displays like below:

Sort2:

o Here, in the properties we must set as below.

Stage

Key=ACCTYPE

o Sort key mode = sort

o Sort order = Descending order

Navs notes Page 103

IN2.EID EIDIN3.ENAME ENAMEfunc1 ACCTYPEc COUNT

EID, ENAME, ACCTYPE COUNT

111, munna, savings 1111,munna, savings, current 2111, munna, savings, current, insurance 3111, munna, savings, current, insurance, loans 4222, kumar, credit 1222, kumar, credit ,loans 2222, kumar, credit ,loans, savings 3333, naveen, current 1333, naveen, current, loans 2

Page 104: DataStage NOTES

2010DataStage

Input

Partition type: hash

Sorting

o Perform sort

Stable (uncheck)

Unique (check this)

o Selected

Key= count

Usage= sorting, partitioning

Options= ascending, case sensitive

Output

Mapping should be doing here.

Data Set (DS):

o Input:

partition type: hash

o Sorting:

Perform sort

Stable (check this)

Unique (check this) Final output:

o Selected

Key= EID

Usage= sorting, partition

Ascending

DAY 33

FILTER STAGE

Filter means “blocking the unwanted data”. In DataStage Filter stage can perform in three

level, they are

1. Source level

Navs notes Page 104

EID, ENAME, ACCTYPE, COUNT

111, munna, sav, curr, insu, loans 4222, kumar, credit ,loans, sav 3333, naveen, current, loans 2

Page 105: DataStage NOTES

2010DataStage

2. Stage level

3. Constraints

Source Level Filter : “it can be done in data base and as well as in file at source level”.

o Data Base: by write filter quires like “select * from EMP where DEPTNO =

10”.

o Source File: here we have option called filter there we can write filter

commands like “grep “moon”/ grep –I “moon”/ grep –w “moon” ”.

Stage Filter :

o “Stage filters use in three stages, and they are 1. Filter, 2. Switch and 3.

External filter”.

o Difference between if and switch:

o Here filter is like an IF, switch as switch.

o Differences between three filter stages.

Navs notes Page 105

IF SWITCH

Poor performance.

IF can write ‘n’ number of column in condition.

It have ‘n’ number of cases.

Better performance than IF.

SWITCH can only one condition can perform.

It can only have 128 cases.

FILTER SWITCH EXTERNAL FILTER Condition on multiple

columns.

It have,

o 1 – inputn – outputs1 – reject

Condition on single column.

It have,

o 1 – input128 – outputs1 - default

It is using by the GREP commands.

It have,

o 1 – input 1 – output no rejects

Page 106: DataStage NOTES

2010DataStage

Filter stage: “it having one input, n outputs, and one reject link”.

The symbol of filter is

Filter

Q: How the filter stage to send the data from source to target?

Design:

DS

Filter

OE

DS

DS

Step1:

Connecting to the oracle for extracting the EMP table from it.

Step2:

Filter properties

Predicates

Navs notes Page 106

T1

T2

Reject link

Page 107: DataStage NOTES

2010DataStage

o Where clauses = DEPT NO =10

Output link =1

o Where clauses = SAL > 1000 and SAL < 3000

Output link = 2

o Output rejects = true // it is for output reject data.

Link ordering

o Order of the following output links

Output:

o Mapping should be done for links of the targets we have.

Here, Mapping for T1 and T2 should be done separately for both.

Step3:

“Assigning a target files names in the target”.

It have no reject link, we must convert a link as reject link. Because it has ‘n’ number of

outputs.

DAY 34

Jobs on Filter and properties of Switch stage

Assignment Job 1:

a. Only DEPTNO 10 to target1?

b. Condition SAL>1000 and SAL<3000 satisfied records to target2?

Navs notes Page 107

Page 108: DataStage NOTES

2010DataStage

c. Only DEPTNO 20 where clause = SAL<1000 and SAL>3000 to target3?

d. Reject data to target4?

Design to the JOB1:

Filter

EMP_TBL

Filter

Step1: “For target1: In filter where clause for target1 is DEPTNO=10 and link order=0”.

Step2: “For target2: where clause = SAL>1000 and SAL<3000 and link order=1”.

Step3: “For target3: where clause= DEPTNO=20 and link order=0”.

Step4: “For target4: convert link into reject link and output reject link=true”.

Job 2:

a. All records from source to target1?

b. Only DEPTNO=30 to target2?

c. Where clause = SAL<1000 and SAL>3000 to target3?

d. Reject data to target4?

Navs notes Page 108

T1

T2

T3

T4

Page 109: DataStage NOTES

2010DataStage

Design to the JOB2:

Copy

EMP_TBL

Filter

Step1: “For target1 mapping should be done output links for this”.

Step2: “For target2 where clause = DEPTNO=30 and link order =0”.

Step3: “For target3 where clause = SAL<1000 and SAL>3000 and link order=1”.

Step4: “For target4 convert link into reject link and output reject link=true”.

Job 3:

a. All unique records of DEPTNO to target1?

b. All duplicates records of DEPTNO to target2?

c. All records to target3?

d. Only DEPTNO 10 records to target4?

e. Condition SAL>1000 & SAL<3000, but no DEPTNO=10 to target5?

Navs notes Page 109

T1

T2

T3

T4

Page 110: DataStage NOTES

2010DataStage

Design to the JOB3:

Filter

EMP_TBL

Filter

Step1: “For target1: where clause = keychange=1 and link order=0”.

Step2: “For target2: where clause = keychange=0 and link order=1”.

Step3: “For target3: mapping should be done output links for this”.

Step4: “For target4: where clause= DEPTNO=10”.

Step5: “For target5: in filter properties put output rows only once= true for where clause

SAL>1000 & SAL<3000”.

SWITCH Stage:

Navs notes Page 110

T1

T2

TT3

T4

K=1

K=0

T5

Page 111: DataStage NOTES

2010DataStage

“Condition on single column and it has only 1 – input, 128 – outputs and 1- default”.

Picture of switch stage:

Properties of Switch stage:

Input

o Selector column = DEPTNO

Cases values

o Case = 10 = 0 link order

o Case = 20 = 1

Options

o If no found = options (Drop/ fail/ output)

Drop= drops the data and continue the process.

Fail= if any records drops job aborts.

Output= to view reject data through the link.

DAY 35

External Filter and Combining

External Filter: “It is processes stage, which can perform filter by UNIX commands”.

It having 1-input, 1-output, and 1-reject link.

To perform a text file, first it must read in single record in the input.

Navs notes Page 111

Page 112: DataStage NOTES

2010DataStage

Example filter command: grep “newyork”.

Sequential File External Filter Data Set

External Filter properties:

o Filter command = grep “newyork”

o Grep –v “newyork” \\ other than new it filters.

Combining: “in DataStage combining can done in three types”.

They are

o Horizontal combining

o Vertical combining

o Funneling combining

Horizontal combining: combining primary rows with secondary rows based on primary key.

o This stage that perform by JOIN, LOOKUP, and MERGE.

These three stages differs with each other with respect to,

o Inputs requirements,

o Treatment of unmatched records, and

o Memory usage.

DAY 36

Horizontal Combining (HC) and Description of HC stages

Horizontal Combining (HC): “combining the primary rows with secondary rows based on

primary key”.

Navs notes Page 112

Page 113: DataStage NOTES

2010DataStage

Selection of primary table is situation based.

Here we can combine

Inner join,

Left outer join,

Right outer join, and

full outer join

If T1= {10, 20, 30} and T2= {10, 20, 40}

Inner Join: “Matched primary and secondary records”.

T1 T2

Left Outer Join: “Matched primary & secondary and unmatched primary records”.

T1 (T1 T2)

Right Outer Join: “Matched primary & secondary and unmatched secondary records”.

T2 (T1 T2)

Full Outer Join: “Matched primary & secondary and unmatched primary & unmatched

secondary records”.

T1 T2

Navs notes Page 113

ENO EName DNo

111 naveen 10 222 munna 20333 kumar 30

DNo DName LOC

10 IT HYD20 SE SEC40 SA WRL

HC

DNO DNAME LOC ENO ENAME

Page 114: DataStage NOTES

2010DataStage

Description of HC stages: “The description of horizontal combining is divided into nine

parts”. They are,

o Input names,

o Input output rejects,

o Join types,

o Input requirements with respect to sorting,

o De – duplication (removing duplicates),

o Treatment of unmatched records,

o Memory usage,

o Key column names, and

o Types of inner join.

The differences between join, lookup, and merge with respect to above nine points are

shown below.

Navs notes Page 114

JOIN LOOKUP MERGE

When we work on HC with JOIN the first SRC is left table, and last SRC is right table. And all middle SRC’s are intermediate tables.

The first link from source is primary/ input and remaining links are lookup/ references links.

The first table is master table and remaining tables are updates tables.

Input names:

Inner join,left outer join,right outer join, andfull outer join.

Inner Join

Left outer join

Inner join

Left outer join

Join Types:

N – inputs (inner, LOJ, ROJ)2 – inputs (FOJ)1 – output, and 1 – reject.

N – Inputs (normal)2 – inputs (sparse)1 – output, and 1 – reject

N – inputs1 – output(n – 1) rejects.

Input output rejects:

Page 115: DataStage NOTES

2010DataStage

DAY 37

LOOKUP stage (Processer Stage)

Lookup stage:

Navs notes Page 115

Primary: mandatory

Secondary: mandatory

Optional

Optional

Mandatory

Mandatory

:: Input Requirements with respect to sorting::

Primary: OK (nothing happens)

Secondary: OK

OK

Warnings

Warnings

OK

::De – Duplication (removing the duplicates)::

Primary: Drop (inner)Target (Left)

Secondary: Drop (inner)Target (right)

Drop, Target (continue), reject (unmatched primary records)

Drop

Drop, target (keep)

DropReject (unmatchedsecondary records)

:: Treatment of Unmatched Records::

Light memory Heavy memory Light memory

:: MEMORY USAGE::

Must be SAMEOptionalSame in case of lookup file set

Must be SAME

:: Key Column Names::

ALL ALL ANY

:: Type of Inner Join ::

Page 116: DataStage NOTES

2010DataStage

In real time projects, 95% of horizontal combining is used by this stage.

“Look up stage is for cross verification of primary records with secondary records”.

DataStage version8 supports four types of LOOKUP, they are

o Normal LOOKUP

o Sparse LOOKUP

o Range LOOKUP

o Case less LOOKUP

For example in simple job with EMP and DEPT tables:

Primary table as EMP with column consisting of EID, ENAME, DNO

Reference table as DEPT with column consisting of DNO, DNAME, LOC

DEPT table (reference/ lookup)

EMP table (Primary/ input) LOOKUP Data Set (target)

LOOKUP properties for two tables:

Navs notes Page 116

ENOENAME DNO

DNODNAMELOC

ENOENAME

DNAME

Primary Table

Target

Page 117: DataStage NOTES

2010DataStage

Key column for both tables

It can set by just drag from primary table to reference table to DNO column.

In tool bar of LOOKUP stage consists of constraints button, in that we have to select

Continue : this option for Left Outer Join.

Drop : it is to Inner Join.

Fail : its aborts job, if a primary unmatched records are their.

Reject : it’s captured the primary unmatched records.

Case less LOOKUP: In execution by default it acts as a case sensitive.

But we have a option to remove the case sensitive i.e.,

o Key type = case less.

DAY 38

Sparse and Range LOOKUP

Sparse LOOKUP:

Navs notes Page 117

Reference Table

Page 118: DataStage NOTES

2010DataStage

If the source is database, its supports only two inputs.

Normal lookup: “is cross verification of primary records with secondary at memory”.

Sparse lookup: “is cross verification of primary records with secondary at source level

itself”.

To set sparse lookup we must adjust key type as sparse in reference table only.

By default Normal LOOKUP is done in lookup stage.

Note: sparse lookup not support another reference when it is database.

But in ONE Case sparse LOOKUP stage can supports ‘n’ references. By taking lookup file set

Job1: a sequential file extracting a text file to load into lookup file set (lfs).

Sequential file Lookup file set

Here in lookup file set properties:

o Column names should same as in sequential file.

o Target file stored in .lfs extension.

o Address of the target must save to use in another job.

Job2: in this job we are using lookup file set as sparse lookup.

LFS LFS

……………………

Navs notes Page 118

Page 119: DataStage NOTES

2010DataStage

SF LOOKUP DS

In lookup file set, we must paste the address of the above lfs.

Lookup file supports ‘n’ references means indirectly sparse supports ‘n’ references.

Range LOOKUP:

“Range lookup is keeping condition in between the tables”.

How to set the range lookup:

In LOOKUP properties:

Select the check box for column you need to condition.

Condition for LOOKUP stage:

How to write a condition in the lookup stage?

o Go to tool bar constraint, there we will see condition box.

o In condition, for example: in.primary= “AP”

o For multiple links we can write multiple conditions for ‘n’ references.

DAY 39

Funnel, Copy and Modify stages

Funnel Stage:

Navs notes Page 119

Page 120: DataStage NOTES

2010DataStage

“It is a processing stage which performs combining of multiple sources to a target”.

To perform the funnel stage some conditions must to follow:

1. Columns should be same

2. Columns names also should be same

3. Columns names should be case sensitive

4. Data type should be same

Funnel stage it is process to append the records one table after the one, but above four

conditions has to be meet.

Simple example for funnel stage:

Funnel operation three modes:

Continues funnel : it’s random.

Sequence : collection of records is based on link order.

Sort funnel : it’s based on key column values.

Navs notes Page 120

ENO EN Loc GEN

111 naveen HYD M222 munna SEC F333 kumar BAN M

EMPID EName Loc Country Company GEN

444 IT DEL INDIA IBM 1555 SA NY USA IBM 0

TX

Copy/Modify

ENO EN ADD GEN

In this stage the column GEN M has to exchange into 1 and F=0;

In this column names has change as primary table.

Page 121: DataStage NOTES

2010DataStage

Copy Stage:

“It is processing stage which can be used from”.

1. Copying source data to multiple targets.

2. Charge the column names.

3. Drop the columns.

4. Stub stage.

NOTE: best for change column names and drop columns.

Modify Stage:

“It is processing stage which can perform”.

1. Drop the columns.

2. Keep the columns.

3. Change the column names.

4. Modify the data types.

5. Alter the data.

Oracle Enterprise Modify Data Set

From OE using modify stage send data into data set with respect to above five points.

In modify properties:

Specification: drop SAL, MGR, DEPTNO

o Here drops the above columns.

Navs notes Page 121

Page 122: DataStage NOTES

2010DataStage

Specification: keep SAL, MGR, DEPTNO

o Here accept the columns, remaining columns were drops.

At runtime: Data Set Management (view the operation process)

Specification: <new column name> DOJ=HIREDATE<old column>

o Here to change column name.

Specification: <new column name>DOJ=DATE_FROM_TIMESTAMP(HIREDATE)

<old column>

o Here changing the column name with data type.

DAY 40

JOIN Stage (processing stage)

Navs notes Page 122

Page 123: DataStage NOTES

2010DataStage

Join stage it used in horizontal combining with respect to input requirements, treatment of

unmatched records, and memory usage.

Join stage input names are left table, right table, and intermediate tables.

Join stage having n – inputs (inner, LOJ, ROJ), 2 – inputs (FOJ), 1- output, no

reject.

Types of Join stage are inner, left outer join, right outer join, and full outer join.

Input requirements with respect to sorting: it is mandatory in primary and secondary

tables.

Input requirements with respect to de – duplication: nothing happens means it’s OK

when de – duplication.

Treatment of unmatched records: in primary table when the option Inner its simple

drops and when it is LOJ will keep all records in target. And in secondary table in

Inner option it’s drops and it ROJ will keep all records in target.

Memory usage: light memory in join stage.

Key column names should be SAME in this stage.

All types of inner join will supports.

A simple job for JOIN Stage:

JOIN properties:

Need a key column

Navs notes Page 123

Page 124: DataStage NOTES

2010DataStage

o Inner JOIN, Left Outer JOIN comes in left table.

o Right Outer JOIN comes in right table.

o Full Outer JOIN comes both tables, in this no scope from third table that’s why

FOJ have two inputs.

In join stage when we sort with different key column names, that job can executes but

its effect on the performance (simply say WARNINGS will occurs)

We can change the column name by two types

Copy stage and with query statement.

Example of SQL query: select DEPTNO1 as DEPTNO, DN, and Loc from DEPT;

DAY 41

MERGE Stage (processing stage)

Navs notes Page 124

Page 125: DataStage NOTES

2010DataStage

Merge stage is a processing stage it perform horizontal combining with respect to input

requirements, treatment of unmatched records, and memory usage.

Merge stage input names are master and updates.

N – inputs, 1 – output, and (n – 1) rejects for merge stage.

Join types of this stage are inner join, and left outer join.

Input requirements with respect to sorting is mandatory to sort before perform merge

stage.

Input requirements with respect to de – duplication in the primary table it will get

warnings when we don’t remove the duplicates in primary table. And in secondary

table nothing will happens its OK when we don’t remove the duplicates.

Treatment of unmatched records in primary table Drop (drops), Target (keep) the

unmatched records of the unmatched primary table records. And in secondary table

drops and reject it captures the unmatched secondary table records.

In the merge stage the memory usage is LIGHT memory.

The key column names must be the SAME.

In type of inner join it compares in ANY update tables.

NOTE:

Static information stores in the master table.

All changes information stores in the update tables.

Merge operates with only two options

o Keep (left outer join)

o Drop (inner Join)

Simple job for MERGE stage:

Navs notes Page 125

PID PRD_DESC PRD_MANF11 indica tata22 swift maruthi 33 civic hundai 44 logon

PID PRD_SUPP PRD_CAT11 abc XXX33 xyz XXX55 pqr XXX77 mno XXX

PID PRD_AGE PRD_PRICE11 4 100022 9 120066 3 150088 9 1020

Page 126: DataStage NOTES

2010DataStage

Master Table Update (U1) Update (U2)

Master table

TRG

U1

U2

or

Reject (U1) Reject (U2)

In MERGE properties:

Merge have inbuilt sort = (Ascending Order/Descending Order)

Must to follow link order.

Merge supports (n-1) reject links.

NOTE: there has to be same number of reject links as update links or zero reject links.

Here COPY stage is acting as STUB Stage means holding the data with out sending

the data into the target.

DAY 42

Remove Duplicates & Aggregator Stages

Navs notes Page 126

Page 127: DataStage NOTES

2010DataStage

Remove Duplicates:

“It is a processing stage which removes the duplicates from a column and retains the

first or last duplicate rows”.

Sequential File Remove Duplicates Data Set

Properties of Remove Duplicates:

Two options in this stage.

o Key column= <column name>

o Dup to retain=(first/last)

Remove Duplicates stage supports 1 – input and 1 – output.

NOTE: for every n – input and n – output stages should must done mapping.

Aggregator:

“It is a processing stage that performs count of rows and different calculation between

columns i.e. group by same operation in oracle”.

SF Aggregator DS

Properties of Aggregator:

Grouping keys:

o Group= Deptno

Aggregator

o Aggregator type = count rows (count rows/ calculation/ re – calculation)

o Count output column= count <column name>

Navs notes Page 127

Page 128: DataStage NOTES

2010DataStage

1Q: Count the number of all records and deptno wise in a EMP table?

1 Design:

OE_EMP Copy of EMP Counting rows of deptno TRG1

Generating a column counting rows of created column TRG2

For doing some group calculation between columns:

Example:

Select group key

Group= DEPTNO

- Aggregation type = calculation

- Column for calculation = SAL <column name>

Operations are

Maximum value output column = max <new column name>

Minimum value output column = min <new column name>

Sum of column = sum <new column name> and so on.

Here, doing calculation on SAL based on DEPTNO;

2Q In Target one dept no wise to find maximum, minimum, and sum of rows, and in

target two company wise maximum?

2 Design:

Navs notes Page 128

Page 129: DataStage NOTES

2010DataStage

OE_emp copy of emp max, min, sum of deptno trg1

Company: IBM max of IBM trg2

3Q: To find max salary from emp table of a company and find all the details of that?

&

4Q: To find max, min, sum of salary of a deptno wise in a emp table?

3 & 4 Design: dummy dno=10

compare

emp

max(deptno) dno=20

UNION ALL diving

compare dummy dno=30

copy

min(deptno)

company: IBM compare

maximum SAL with his details

max (IBM)

DAY 43

Slowly Changing Dimensions (SCD) Stage

Before SCD we must understand: types of loading

Navs notes Page 129

Page 130: DataStage NOTES

2010DataStage

1. Initial load

2. Incremental load

Initial load: complete dump in dimensions or data warehouse i.e., target also before

data is called Initial load.

The subsequent is alter is called incremental load i.e., coming from OLTP also source

is after data.

Example: #1

Before data (already data in a table)

CID CNAME ADD GEN BALANCE Phone No AGE

11 A HYD M 30000 9885310688 24

After data (update n insert at source level data)

CID CNAME ADD GEN BALANCE Phone No AGE

11 A SEC M 60000 9885865422 25

Column fields that have changes types:

Address – slowly change

Balance – rapid change

Phone No – often change

Age – frequently

Example: #2

Before Data:

CID CNAME ADD

Navs notes Page 130

Page 131: DataStage NOTES

2010DataStage

11 A HYD

22 B SEC

33 C DEL

After Data: (update ‘n’ insert option loading a table)

CID CNAME ADD

11 A HYD

22 B CUL

33 D PUN

Extracting after and before data from DW (or) database to compare and upsert.

We have SIX Types of SCD’s are there, they are

SCD – I

SCD – II

SCD – III

SCD – IV or V

SCD – VI

Explanation:

SCD – I: “it only maintains current update, and no historical data were organized”.

As per SCD – I, it updates the before data with after data and no history present after the

execution.

SCD – II: “it maintains both current update data and historical data”. With some special

operation columns they are, surrogate key, active flag, effect start date, and effect end date;

In SCD – II, not having primary key that need system generated primary key, i.e.,

surrogate key. Here surrogate key acting as a primary key.

Navs notes Page 131

Page 132: DataStage NOTES

2010DataStage

And when SCD – II performs we get a practical problem is to identify old and current

record. That we can solve by active flag: “Y” or “N”.

In SCD – II, new concepts are introduced here i.e., effect start date (ESDATE) and

effect end date (EEDATE).

Record version : it is concept that when the ESDATE and EEDATE where not able to

use is some conditions.

Unique key : the unique key is done by comparing.

SCD – III: SCD – I (+) SCD – II “maintain the history but no duplicates”.

SCD – IV or V: SCD – II + record version

“When we not maintain date version then the record version useful”.

SCD – VI: SCD – I + unique identification.

Example table of SCD data:

SID CID CNAME ADD AF ESDATE EEDATE RV UID

1 11 A HYD N 03-06-06 29-11-10 1 1

2 22 B SEC N 03-06-06 07-09-07 1 2

3 33 C DEL Y 03-06-06 9999-12-31 1 3

4 22 B DEL N 08-09-07 29-11-10 2 2

5 44 D MCI Y 08-09-07 9999-12-31 1 5

6 11 A GDK Y 30-11-10 9999-12-31 2 1

7 22 B RAJ Y 30-11-10 9999-12-31 3 2

8 55 E CUL Y 30-11-10 9999-12-31 1 8

Table: this table is describing the SCD six types and the description is shown above.

DAY 44

SCD I & SCD II (Design and Properties)

Navs notes Page 132

Page 133: DataStage NOTES

2010DataStage

SCD – I: Type1 (Design and Properties):

Transfer job Load job

10,20,30

OE_DIM before fact DS_FACT 10, 20, 40 10, 20, 40

DS_TRG_DIM OE_UPSERT

10, 20, 40 After dim 10,20, 40 -update and insert

OE_SRC DS_TRG_DIM

In oracle we have to create table1 and table2,

Table1:

Create table SRC(SNO number, SNAME varchar2(25));

o Insert into src values(111, ‘naveen’);

o Insert into src values(222, ‘munna’);

o Insert into src values(333, ‘kumar’);

Table2:

Create table DIM(SKID number, SNO number, SNAME varchar2(25));

o No records to display;

Processes of transform job SCD1:

Step 1: Load plug-in Meta data from oracle of before and after data as shown in the above

links that coming from different sources.

Step 2: “SCD1 properties”

Navs notes Page 133

Page 134: DataStage NOTES

2010DataStage

Fast path 1 of 5: select output link as:

Fast path 2 of 5: navigating the key column value between before and after tables

Fast path 3 of 5: selecting source type and source name.

Source type: source name:

NOTE: for every time of running the program we should empty the source name i.e.,

empty.txt, else surrogate key will continue with last stored value.

Fast path 4 of 5: select output in DIM.

For path 5 of 5: setting the output paths to FACT data set.

Navs notes Page 134

SNOSNAME

AFTER

KEY EXPR COLUMN N PURPOSE SKID surrogate keyAFTER.SNO SNO business key

SNAME Type1

BEFORE

Flat file D:\study\navs\empty.txt

fact

SNOSNAME

AFTER

Derivation COLUMN N PURPOSE next sk() SKID surrogate keyAFTER.SNO SNO business keyAFTER.SNAME SNAME Type1

DIM

SNOSNAME

AFTER

Derivation COLUMN N BEFORE.SKID SKID

AFTER.SNO SNO

FACT

SKIDSNOSNAMEBEFORE

Page 135: DataStage NOTES

2010DataStage

Step 3: In the Next job, i.e. in load job if we change or edit in the source table and when you

are loading into oracle we must change the write method = upsert in that we have two options

they are, -update n insert \\ if key column value is already.

-insert n update \\ if key column value is new.

Here SCD I result is for the below input

Before table

Target Dimensional table of SCD I

After table

SCD – II: (Design and Properties):

Navs notes Page 135

CID CNAME SKID10 abc 120 xyz 230 pqr 3

CID CNAME10 abc20 nav40 pqr

CID CNAME SKID10 abc 120 nav 240 pqr 3

Page 136: DataStage NOTES

2010DataStage

Transfer job Load job10,20,30

before

OE_DIM fact DS_FACT 10, 20, 20, 30, 40 10, 20, 20, 30, 40

DS_TRG_DIM OE_UPSERT

10, 20, 40 After dim 10, 20, 20, 30, 40 -update and insert

OE_SRC DS_TRG_DIM

Step 1: in transformer stage:

Adding some columns to the to before table – to covert EEDATE and ESDATE columns into

time stamp transformer stage to perform SCD II

In TX properties:

In SCD II properties:

Navs notes Page 136

SKIDSNOSNAMEESDATEEEDATEACF

BEFORE

Derivation COLUMN NAMBEFORE.SKID SKID BEFORE.SNO SNO BEFORE.SNAME SNAMETS_to_D(BEFORE.ESDATE) ESDATETS_to_D(BEFORE.EEDATE) EEDATEBEFORE.ACF ACF

BEFORE_TX

Page 137: DataStage NOTES

2010DataStage

Fast path 1 of 5: select output link as:

Fast path 2 of 5: navigating the key column value between before and after tables

Fast path 3 of 5: selecting source type and source name.

Source type: source name:

NOTE: for every time of running the program we should empty the source name i.e.,

empty.txt, else surrogate key will continue with last stored value.

Fast path 4 of 5: select output in DIM.

Date from Julian (Julian day from day (current date ()) – 1)

For path 5 of 5: setting the output paths to FACT data set.

Navs notes Page 137

fact

SNOSNAME

AFTER KEY EXPR COLUMN N PURPOSE SKID surrogate keyAFTER.SNO SNO business key

SNAME Type2 ESDATE experi date ESDATE expire date ACF current indicator

BEFORE

Flat file D:\study\navs\empty.txt

SNOSNAME

AFTER

SNOSNAME

AFTER

BEFORE

Derivation COLUMN N PURPOSE Expires next sk() SKID surrogate key -AFTER.SNO SNO business key -AFTER.SNAME SNAME Type2 -curr date() ESDATE experi date -“9999-12-31” ESDATE expire date (…..)“Y” ACF CI “N”

DIM

Page 138: DataStage NOTES

2010DataStage

Step 3: In the Next job, i.e. in load job if we change or edit in the source table and when you

are loading into oracle we must change the write method = upsert in that we have two options

they are, -update n insert \\ if key column value is already.

-insert n update \\ if key column value is new.

Here SCD II result is for the below input

Before table

Target Dimensional table of SCD II

After table

DAY 45

Navs notes Page 138

SKIDSNOSNAMEESDATEEEDATEACF

CID CNAME SKID ESDATE EEDATE ACF10 abc 1 01-10-08 99-12-31 Y20 xyz 20 01-10-08 99-12-31 Y30 pqr 30 01-10-08 99-12-31 Y

CID CNAME10 abc20 nav40 mun

CID CNAME SKID ESDATE EEDATE ACF10 abc 1 01-10-08 99-12-31 Y20 xyz 2 01-10-08 09-12-10 N20 xyz 4 10-12-10 99-12-31 Y30 pqr 3 01-10-08 99-12-31 Y40 mun 5 01-10-08 99-12-31 Y

Derivation COLUMN NAME BEFORE.SKID SKID AFTER.SNO SNOAFTER.SNAME SNAME BEFORE.ESD ESDATE BEFORE.EED ESDATE BEFORE.ACF ACF

FACT

Page 139: DataStage NOTES

2010DataStage

Change Capture, Change Apply & Surrogate Key stages

Change Capture Stage:

“It is processing stage, that it capture whether a record from table is copy or edited or

insert or to delete by keeping the code column name”.

Simple example of change capture:

Change_capture

Properties of Change Capture:

Change keys

o Key = EID (key column name)

Sort order = ascending order

Change valves

o Values =? \\ ENAME

o Values =? \\ ADD

Options

o Change mode = (explicit keys & values / explicit keys, values)

o Drop output for copy = (false/ true) “false – default ”

o Drop output for delete = (false/ true) “false – default”

o Drop output for edit = (false/ true) “false – default”

o Drop output for insert = (false/ true) “false – default”

Navs notes Page 139

Page 140: DataStage NOTES

2010DataStage

Copy code = 0

Delete code = 2

Edit code = 3

Insert code = 1

Code column name = <column name>

o Log statistics = (false/ true) “false – default”

Change Apply Stage:

“It is processing stage, that it applies the changes of records of a table”.

Change Apply

Properties of Change Apply:

Change keys

o Key = EID

Sort order = ascending order

Options

o Change mode = explicit key & values

o Check value columns on delete = (false/ true) “true - default”

o Log statistics = false

o Code column name = <column name> \\ change capture and this has to be

SAME for apply operations

Navs notes Page 140

Page 141: DataStage NOTES

2010DataStage

SCD II in version 7.5.x2

Design of thatESDATE=current date ()

EEDATE= “9999-12-31”

Key=EID ACF= “Y”

-option: e k & v

Before.txt c=3

c=all

after.txt

key= EID

-option: e k & v

before.txt

ESDATE- current date ()

EEDATE- if c=3 then DFJD(JDFD(CD())-1)

else EEDATE = “9999-12-31”

ACF- if(c=3) then “N” else “Y”

SURROGATE KEY Stage:

In version 7.5.x2: “identifying last value which generated for the first time compiling and

running the job in surrogate key stage, for that reason in version 7 we have to do a another job

to store a last generated value”.

And that job in version 7.5.x2: design

SF Sk copy ds

Tail peek

Navs notes Page 141

Page 142: DataStage NOTES

2010DataStage

In this job, a surrogate key stage used for generates the system key column values that

are like primary key values. But it generate at first compile only.

But by taking tail stage with that we tracing the last value and storing into the peek

stage that is in buffer.

With that buffer value we can generate the sequence values that are surrogate key in

version 7.5.x2.

In version 8.0:

“The above problem with version7 is over comes by version 8.0 surrogate key by

taking an empty text(empty.txt) file and storing last value information in that file, and by using

that it generates the sequence values”

Before.txt SK Data Set

Properties of SK version8:

Option 1: generated output column name = skid

Source name = g:\data\empty.txt

Source type = flat file

Option 2: database type= oracle (DB2/ oracle)

Source name = sq9 (in oracle – create sequence sq9)\\ it is like empty.txt

Password= tiger

User id= scott

Server name= oracle

Source type = database sequence

Navs notes Page 142

Page 143: DataStage NOTES

2010DataStage

DAY 46

DataStage Manager

Export:

“Export is used to save the group of jobs for the export purpose that where we want”.

Navigation - “how to export”?

DataStage toolbar

Change selection: or or

o Job components to export

Here there are three options are

- Export job designs with executables(where applicable)

- Export job designs without executables

- Export job executables without designs

o Export to file

Where we want locate the export file.

o Type of export

By two options we can export file

- dsx 7 – bit encoded

- xml

Import:

“It is used to import the .dsx or .xml extensions to a particular project and also to

import some definitions as shown below”.

Options of import are

o DataStage components…

Navs notes Page 143

ADD REMOVE SELECT ALL

Source name\.....

dsx

Page 144: DataStage NOTES

2010DataStage

o DataStage components (xml)…

o External function definitions

o Web services function definitions

o Table definitions

o IMS definitions

In IMS two options are,

Database description (DBD)

Program Specification Block (PSB / PCB)

In DataStage components..

o Import from file

Import all overwrite without query

Import selected perform impact analysis

Generate Report:

“It is for to generate report to a job or a specific, that it generates a report to a job

instantly”.

For that, go to

File

o Generate report

Report name

Options

Use default style sheet

Use custom style sheet

After finishing the settings:

It’s generates in default position

“/reportingsendfile/ send file/ tempDir.tmp”

Navs notes Page 144

Give the source name to import ….

Page 145: DataStage NOTES

2010DataStage

Node Configuration:

Q: To see nodes in a project:

o Go to run director

Check in logs

Double click on main program: APT config file

Q: What are Node Components?

1. Node name – logical CPU name.

2. Fast name – server name or system name.

3. Pools – logical area where stages are executed.

4. Resource – memory associated with node.

o Node components stores in the disc’s permanent in the below address.

“c:\ibm\information server\server\parasets”

o Node components stores temporary is the below address.

“c:\ibm\information server\scratch”

Q: What node that handles to run each and every job and name of the configuration file?

o Every job runs on APT node as on below name that is default for every job.

o Name of configuration file is C:\ibm\.........\default.apt

Q: How to run a job on specific configuration file?

o Job properties

Parameters

Add environment variables

o Parallel

Compiler

Config file (Add $APT_CONFIG_FILE)

Navs notes Page 145

Page 146: DataStage NOTES

2010DataStage

Q: How to create a new Node configuration File?

o Tools

Configurations

There we see

o

o Default.apt will have the single node information.

o We can create new node by option NEW

o

Save the things after creating new nodes

By, save configuration As

o NOTE: Best 8 or 16 nodes is to create in a project, and

2^0,2^1(say) CPU’s have & so on.

Q: If uni processing system with 1 CPU needs minimum 1 node to run a job then for SMP

with 4 CPU needs how many minimum nodes?

o Only 1 node.

Advanced Find:

“It is the new feature to version8”

It consists of to find objects of a job like list shown below

1. Where used,

2. Dependency,

3. Compared report.

Q: How to run a job in a job?

Navigation for how to run a job in a job

Job properties

o Job control

Select a job

Navs notes Page 146

Default.apt

NEW

Page 147: DataStage NOTES

2010DataStage

-------------

------------- here, Job Control Language (JCL) script presents.

-------------

o Dependencies

Select job (first compile this job before the main

job)

Q: Repository of Advance Find (means palate of advance find)?

o Name to find:

o Folder to search:

o Type

o Creation

o Last modification

o Where used

Find objects that use any of the following objects.

Options: Add, remove, remove all

o Dependencies of job

Q: Advance Find of repository through tool bar?

o Cross project compare….

o Compare against

o Export

o Multiple job compile

o Add to palate

o Create copy

o Locate in tree

o Find dependencies

Q: How to find dependency in a job?

o Go to tool bar

Navs notes Page 147

Nav*

D:\datastage\

Page 148: DataStage NOTES

2010DataStage

Repository

Find dependency: all types of a job

DAY 47

DataStage Director

DS Director maintains:

Schedule

Monitor

Views

o Job view

o Status view

o Log view

Message Handling

Batch jobs

Unlocking

Schedule:

“Schedule means a job can run in specific timings”

To set timings for that,

o Right click on job in the DS Director

Click on “add to schedule…”

And set the timings.

In real time, specific the job sequence by some tools shown below

o Tools to schedule jobs (its happen the production only)

Control M

Cron tab

Autosys

Navs notes Page 148

Page 149: DataStage NOTES

2010DataStage

Purge:

“It means cleaning or wash out or deleting the already created logs”

- In job can we clear

- Job logs having a option is FILTER. By right clicking we can

filter.

Navigation for set the purge.

o Tool bar

Job

- Clear log (choose the option)

o Immediate purge

o Auto purge

Monitor:

“It shows the Status of job, numbers of row where executed, started at (time), elapsed

time (i.e. rows/sec), percentage used by CPU)”

Navigation for job that how to monitor.

o Right click on job

Click monitor

“it shows performance of a job”

Like below figure for a simple job.

Navs notes Page 149

Finished 6 sys time 00:00:03 2 =9Finished 6 sys time 00:00:03 2 =7Finished 6 sys time 00:00:03 2 =0

Status No. rows started at elaspsed time rows/sec %CPU

Page 150: DataStage NOTES

2010DataStage

NOTE: Based on this we can check the performance tuning of a stage in a particular

job.

Reasons for warnings:

Default warnings in sequential file are

1. Field “<column name>” has import error and no default value; data : { e i d },

at offset: 0

2. Import warnings at record 0.

3. Import unsuccessful at record 0.

o These three warnings can solve by a simple option in sequential file, i.e.,

First line is column names= set as true.(here default option is false)

Missing record delimiter “\r\n”, saw EOF instead (format mismatch)

When we working on look-up, in the secondary stage have duplicates we with get

warning.

Where these is length miss match, like in source length (10) and target (20).

When sorting for different key column in join.

When second stage in merge.

Abort a job:

Q: How can we abort a job conditionally?

Conditionally

o When we Run a job

Their we can keep a constraint

Like warnings

o No limit

o Abort job after:

In transformer stage

o Constraint

Otherwise/log

Navs notes Page 150

5

Page 151: DataStage NOTES

2010DataStage

Abort after rows: 5 (if 5 records not meet the constraint it’s

simple aborts the job)

We can keep constraint same like this only in Range Lookup.

Message Handling:

“If the warnings are failed to handle then we come across the message handling”

Navigation for how to add rule set a message handle the warnings.

Jog logs

o Right click on a warning

Add rule to message handler

Two options

Suppress from log

Demote to information

Choose any one of above option and add rule.

Batch jobs:

“Executing set of jobs in a order”

Q: How to create a Batch?

Navigation for creating a batch

DS Director

o Tools

Batch

New (give the name of batch)

Add jobs in created job batch

o Just compile after adding in new batch.

Allow multiple instances:

“Same job can open by multiple clients and run the job”

If we not enable the option it will open in a read only that you can’t edit.

But a job can execute by multiple users at the same time in director.

Navs notes Page 151

Page 152: DataStage NOTES

2010DataStage

Navigation for enable the allow multiple instance

Go to tool bar in DS Designer

o Job properties

Check the box on “allow multiple instances”

Unlock the jobs:

“We can unlock the jobs for multiple instances by release all the permissions”

Navigation for unlock the job

DS Director

Tool bar

o Job

Cleanup resources

Processes

Show by job

Show all

o Release all

For global to see PIDs for jobs, for that

DS Administrator

o General

Environment variables

Parallel

o Reporting

Add (APT_PM_SHOW_ PIDS)

Set as (true/false)

Navs notes Page 152

Page 153: DataStage NOTES

2010DataStage

DAY 48

Web Console Administrator

Components of administrator:

Administration:

o User & group

Users

User name & password is created here.

And assigning permissions

Session managements:

o Active sessions

For admin

Reports:

o DS

INDIA (server/system name)

View report.

We can create the reports.

Domain Management:

o License

Update the license here

Upload to review

Scheduling management:

“It is know what user is doing from part”

Navs notes Page 153

Page 154: DataStage NOTES

2010DataStage

o Scheduling views

New

schedule | Run

creation task run | last update

DAY 49

Job Sequencing

Stages of job sequencing:

“It is for executing jobs in sequence that we can schedule job sequencing”

Or

“Its control the order of execution jobs”

A simple job will process in below process.

o Extract

o Transform

o Load

o Master jobs: “its control the order of execution”.

Important stages in job sequencing are

1. Job activity

2. Sequencer

3. Terminator activity

4. Exception handler

5. Notification activity

6. Wait for file activity

Job Activity:

“It is job activity that holds the job and it have 1-input and n-outputs”

Job activity

Navs notes Page 154

Page 155: DataStage NOTES

2010DataStage

How the Job Activity drag into design canvas?

- In two methods we can,

1. Go to tool bar – view – repository – jobs – just drag the job to the canvas.

2. Go to tool bar – view – palate – job activity – just drag the icon to the canvas.

Simple job:

Student Sequencer student rank

Terminator activity

Properties of Job Activity:

Load a job what you want in active

o Job name:

Execution action: Do not check point Run

options - (Run/Reset if required, than run/ Validate only/ Reset only)

Check Point:

“Job has re-started where it aborted it is called check point”

It is special option that we must enable manually

Go to

o Job properties of DS Designer

Enable check point

Navs notes Page 155

D:\DS\scd_job

RUN

OK

WAR

FAIL

Page 156: DataStage NOTES

2010DataStage

Parameter mapping:

“If job have already some parameters to that we can map to the another job if we need”

Triggers:

“It holds the link expression type that how to act”

And some more options in “Expression type”

- Unconditional “N/A (its default)”

- Otherwise “N/A”

- User Status = “<user define message>”

- Custom-(conditional) - “custom”

Terminator Activity:

“It is stage that handles the error if it fails”

Properties:

It consists of two options: for if any sub ordinate jobs are still running.

Its for job failure

o Send STOP requests to all Running Jobs

And wait for all jobs to finish

It’s for server downs in between the process running.

o Abort without sending STOP requests

Wait for all jobs to finish first.

Navs notes Page 156

Name of Expression type Expression output link

OK OK-(conditional) “executed OK”

WAR WAR-(conditional) “execution finished with warnings”

Fail Failed-(conditional) “execution failed”

Page 157: DataStage NOTES

2010DataStage

Sequencer: “it holds multiple inputs and multiple outputs”

It has two options or modes: ALL – it’s for OK & WAR links

ANY – it’s for FAIL (‘n’ number of links)

Exception handler:

“It handles the server interrupts”

we don’t connect any stage here it will separate in a job

A simple job for exception handler:

Exception handler Notification activity Terminator activity

Exception handler properties:

“Its have only general information”

Notification Activity:

“It is sending acknowledgement in between the process”

Option to fill in the properties:

SMTP Mail server Name:

Senders email address:

Recipients email address:

Email subject:

Attachments:

Email body:

Wait for file Activity: “To place the job in pause”

Navs notes Page 157

ALLANYALL

D:\DS\SCD_LOAD browse file

Page 158: DataStage NOTES

2010DataStage

File name:

Two options: wait for file to appear

Wait for file to disappear

Timeout length (hh:mm:ss) Do not timeout (no time length for the above options)

DAY 50

Performance tuning w.r.t partition techniques & Stages

Partition techniques: “are two categories”

Key based:

1. Hash

2. Modulus

3. DB2

4. Range

Key less:

1. Same

2. Round Robin

3. Entire

4. Random

In key based partition technique:

DB2 is used when the target is database.

DB2 and Range techniques are used rarely.

Hash partition technique:

o It is selected when number of key columns will be there. i.e., key columns (>1)

and hetro data types (means different different data types)

o Other than this situation we can select “modulus partition technique”.

Modulus partition technique:

o It distributes the data based on mod values.

o And mod formula is MOD(value/ Number of nodes)

Navs notes Page 158

Page 159: DataStage NOTES

2010DataStage

NOTE: Modulus is having high performance than Hash, because the way its groups the data

and based on the mod value.

NOTE: But modules can only be selected, if the only one key column and only one data type

that is only integer (data type).

In Key less partition technique:

Same: is never distributes the data, but is carry previous technique that continuous.

Entire: will distribute the same group of records to all nodes. That is the purpose of

avoiding the mismatch records in between the operation.

Round Robin: it is for generated stage like Column Generator and so on is associated

this partition technique.

o It is the best partition technique than comparing to random.

Random: all key less partition techniques stages are used this technique its default.

Performance tuning w.r.t Stages:

If when Sorting already perform then JOIN stage we can use.

Else LOOKUP stage is the best.

LOOKUP FILE SET: is options use to remove duplicates in lookup stage.

SORT stage: if complex sort : go to Stage sort

Else: go to link sort.

Remove Duplicates: the data already sort – Remove duplicates stage

Sorting and remove duplicates – go to link sort (unique)

Constraints: when operation and constraints needed – go to Transformer stage

Else only constraints – simply go to FILTER stage.

Navs notes Page 159

Page 160: DataStage NOTES

2010DataStage

Conversions: Modify stage and Transformer stage (it takes more compile time).

DAY 51

Compress, Expand, Generic, Pivot, xml input & output Stages

Compress Stage:

“It is a processing stage that compresses the records into single format means in single

file or it compresses the records into zip”.

It supports – “1 input and 1 output”.

Properties:

Stage

o Options

Command= (compress/gzip)

Input

o <do nothing>

Output

o Load the ‘Meta’ data of the source file.

Expand Stage:

“It is a processing stage the extract the compress data or its extract the zip data into

unzip data”.

It supports – “1 input and 1 output”.

Navs notes Page 160

Page 161: DataStage NOTES

2010DataStage

Properties:

Stage:

o Options : - command= (uncompress/gunzip)

Input:

o <do nothing>

Output:

o Load the Meta data of the source file for the further process.

Encode Stage:

“It is processing stage that encodes the records into single format with the support of

command line”.

It supports – “1-input and 1-output”.

Properties:

Stage

o Options: Command line = (compress/ gzip)

Input

o <do nothing>

Output

o Load the ‘Meta’ data of the source file.

Decode Stage:

“It is processing stage that decodes the encoded data”.

It supports – “1-intput and 1-output”.

Navs notes Page 161

Page 162: DataStage NOTES

2010DataStage

Properties:

Stage

o Options: command line = (uncompress/gunzip)

Output

o Load the ‘Meta’ data of the source file.

Generic Stage:

“It is processing stage that holds any operator can call here, but it must and should full

fill the properties”.

It supports – “n- inputs and n-outputs, but no rejects”

When compiling the job, the job related OSH code will generated.

Generic stage can call the operator on the datastage.

Its purpose is migration serve jobs to parallel jobs (IBM has x- migrator that converts

into 70%)

And it can call ANY operator here, but it must full fill the properties.

Properties:

Stage

o Options

Operator: copy (we can write any stage operator here)

Input

o <do nothing>

Output

o Load the Meta data of the source file.

Pivot Stage:

“It is processing stage that converts rows into columns in a table”.

Its supports – “1-input and 1-output”.

Navs notes Page 162

Page 163: DataStage NOTES

2010DataStage

Properties: Stage – <do nothing>

Input: <do nothing>

Output:

XML Stages:

“It is real time stage that the data stores in single records or in aggregator with in the

xml format”.

And XML Stage divided into two types, they are

1. XML Output

2. XML Input

XML Input:

“”.

Navs notes Page 163

REC <col_n with comma separated> varchar 25

Column name Derivation SQL Type Length