DataStage NOTES

  • View

  • Download

Embed Size (px)


DataStage:: FUNDAMENTAL CONCEPTS:: DAY 1 Introduction for Phases of DataStage Four different phases are in DataStage, they are Phase I: Data Profiling It is for source system analyses, and the analysis are 1. Column analysis, 2. Primary key analysis,3. Foreign key analysis, by this analysis whether we can find the data is dirty or not.2010

4. Base Line analysis, and 5. Cross domain analysis. Phase II: Data Quality (or also called as cleansing) In this process we must follow inter dependent i.e., after one after one process as shown below. Parsing Correcting Standardizing Matching Consolidated Phase III: Data Transmission In this ETL process is done here, the data transmission from one stage to another stage And ETL means E- Extract T- Transmission L- Load. Phase IV: Meta Data Management - Meta data means where the data for data. Inter Dependent

Navs notes

Page 1

DataStageDAY 2 How the ETL programming tool works? Pictorial view:2010Data Base ETL Process Business Interface Flat files






Figure: ETL programming process

Navs notes

Page 2

DataStageDAY 3 Continue2010Extracting from .txt (ASCII code)

SourceExtract window

Staging (permanent data)

Understand to DataStage Format (Native Format)


Staging (after transmission)

Load window


DWHdata base or resides in local repository

Loading the data into .txt (ASCII code)

ETL is a process that is performs in stages: S OLTP T S T sa S sa T sa DWH

stage area

Here, S- source and T- target. Home Work (HW): one record for each kindle (multiple records for multiple addresses and dummy records for joint accounts);

Navs notes

Page 3

DataStageDAY 4 ETL Developer Requirements

Q: One record for each kindle(multiple records for multiple addresses and dummy records for joint accounts); Kindle means information of customers. Customer Loan Bank Credit card Savings kindle

Customer maintaining one record but handling different addresses is called single view customer or single version of truth.

HW explanation: Here we must read the query very care fully and understand the terminology of the words in business perceptive. Multiple records means multiple of the customers(records) and multiple addresses means one customer(one account) maintaining multiple of addresses like savings/credit cards/current account/loan. ETL Developer Requirements: HLD LLD ,, ,, ,,

Inputs here, HLD- high level document Developer LLD- low level document

Navs notes

Page 4


DataStageETL Developer Requirements are: 1. Under Standing forums/team leads/project leads.3. Logical designs: means paper work. 4. Physical model: using Tool.2010

2. Prepare Questions: after reading document which is given and ask to friends/

5. UNIT Test 6. Performance Tuning7. Peer Reviews: it is nothing but releasing versions(version control *.**)

here, * means range of 1-9. 8. Design Turn Over Document (DTD)/ Detailed Design Document(DDD)/ Technical Design Document(TDD)9. Backups: means importing and exporting the data require. 10. Job Sequencing

Navs notes

Page 5

DataStageDAY 5 How the DWH project is under taken? HLD Requirements: x Warehouse(WH) -HLD x x as developer involves2010


TD Developer system engineer

jobs in % Developer (70% - 80%) Production(10%) Migration (30%)


Production x

Migration x

here, x cross mark that developer not involves in the flow. mean where the developer involves in the project and implement all TEN

requirements shown above.

Production based companies are like IBM and so on. Migration means Support based companies like TCS, Cognizent, Satyam Mahindra and so on.

In Migration: works both server and parallel jobs. Server jobs parallel jobs Up to 2002 this environment worked In this it converts up to, 70% automatically 30% manually. after 2002 and up to till this environment

IBM launched X-Migrator, which convert server jobs to parallel jobs

Navs notes

Page 6

DataStageProject divided into some category with respective to period as shown below and its period( time of the project). Categories Simple Medium Complex Too complex 5.1. Project Process: Period (that taken in months and years) 6m 6m 1y 1 11/2 y 11/2 y 5y and so on(it may takes many years depend up on project)2010

(high level documents) HLD Requirements: SRS BRD (here, business analyzer/ Subject matter expert)

HLD Warehouse:

Architecture Schema (structure) Dimensions and tables (target tables) Facts

(low level docs) LLD TD Mapping Docs (specifications-specs) Test Specs Naming Docs

Navs notes

Page 7

DataStage5.2. Mapping Document: For example if a query requirements are 1-experience employee, 2- dname, and 3- first2010

name, middle name, last name. For this mapping pictorial way as we see in the way:

Common fields

Load order

Target Entity



Source FieldsHire date Dno

Transmi ssionCurrent DateHire date (CD-HD)

Constan t Pk Fk Sk

Error Handling F C D C

Attributes Tables Eno

FName Exp_tbl MName LName Exp_emp DName

Emp Dept

Ename Eno Dno Dname

Funneling S1Get data from Multiple tables

CIs combining


S2 Horizontal combining or vertical combining

As per example here horizontal combination is used

Navs notes

Page 8



Dept rows. As Developer maximum 30 Target fields will get.

Here, HC means Horizontal combination is used for combine primary rows with secondary

As Developer maximum 100 source fields will get. Look Up! means cross verification from primary table. After document: .txt (fwf, cv, vl, sc, s & t, h & t) ( F/ dB) S1T HC H C TRG

(Types of dB)


Format of Mapping Document.

DAY 6 Architecture of DWHNavs notes Page 9


For example:


every branch have each mgr

Manager Reliance comm. Reliance Group : Reliance power Manager Reliance Fresh ` TLM needs manager

Top Level mgr(TLM) details of below sales customer employee period order Input

Explanation of above example: Reliance group with some there branches and every branch have one manager. And for all this manager one Top level manager (TLM) will be there. And TLM needs the details of list shown above for analyze.Bottom level

For above example how ETL process is done shown below reliance freshETL PROCES S ETL PROCES S

RC-mgr ERPmini WH/ Data mart

DWHDependent Data Mart

independent Data MartReliance Fresh(taking one from group directly)

Dependent Data Mart: means the ETL process takes all manager information or dB and keep in the Warehouse. By that the data transmission between warehouse and data mart where depends upon by each other. Here Data mart is also called as Bottom level/ mini WH as

Navs notes

Page 10


DataStageshown in blue color in above figure i.e., the data of individual manager (like RF, RC, RP and so on). Hence the data mart depends up on the WH is called dependent data mart. Independent Data Mart: only one or individual manager i.e., data mart were directly access the ETL process with out any help of Warehouse. Thats why its called independent data mart. 6.1 Two level approaches: For the both approaches two layers architecture will apply. 1. Top-Bottom level approach, and 2. Bottom- Top level approach. 6.1. Top Bottom level approach: The level start from top means as per example Reliance group to their individual managers their ETL process from their to Data Warehouse (top level) and from their to all separate data marts (bottom level). R Comm. Data Mart2010

R Power Reliance Group


Data Mart Warehouse

R FreshTop level Layer I Layer II

Data MartBottom level

Top Bottom level approach

In the above the top bottom level is defined, and this approach is invented by W. H. Inner. Here, warehouse is top level and all data mart are bottom level as shown in the above figure.

Navs notes

Page 11

DataStage6.2. Bottom top level approach: Means from here the ETL process takes directly from data mart (DM) and the data put2010

in the warehouse for reference purpose or storing the DM in the Data WareHouse (DWH).

R comm. DM R power Reliance Group R freshLayer I Bottom levelETL PROCE SS



Top level

Bottom Top level approach is invented by R Kimbell. Here, one data mart (DM) contains information like customer, products, employees, location and so on. Top Bottom level approach These two approaches comes under two layer Architecture Bottom Top level approach

Programming (coding)

Navs notes

Page 12


ETL Tools:

GUI(graph user interface) This tools to extract the data from heterogeneous source.2010

ETL program Tools are Tara Data/ Oracle/ DB2 & so on

6.2. Four layers of DWH Architecture: 6.2.1. Layer I: DM

DM SourceLayer I


Source DMLayer I

In this layer the data send directly in first case from source to Data WareHouse(DWH) and in second case source to group of Data Marts(DM). 6.2.2. Layer II:






In this layer the data follow f