View
196
Download
58
Category
Preview:
DESCRIPTION
Datastage Experiments for important stages
Citation preview
GETTING STARTED WITH DATASTAGE
Opening Virtual machine:
1) Run Datastage shortcut.2) Goto action menu in menu bar and select “Ctrl+alt+delete”.3) Give the login password as “P@ssw0rd”. Press “OK”.4) Wait for 5 min to get loaded with all the services.
NOTE: Don’t move the mouse cursor very often and don’t open the Internet Explorer as it makes the services slower.
To check whether all the services are running or not:
1) Goto Run.2) Type “services.msc”3) Press Enter.4) Check whether “IBM websphere” service is started or not.
To Cleanup temporary Files
1) Run Cleanup.exe2) Click the button cleanup3) Wait for some time until all the temporary files get cleared.4) Close.
Opening Designer client(Infosphere Data stage and quality stage):
1) Run “Designer client.exe”.2) Enter the username and password +ok.
Exercise-1 : Loading data from oltpsrc file to a dwhtarget file
Step 1:
File->New->Parallel Job.
Create a project in the repository by right clicking on dtstage1 and creating a new folder.
Name that folder.
Goto file->sequential file on palette
Drag and Drop the sequential file option twice to the work area.
Goto general->link on palette
Connect two sequential files by using link in work area(like drawing arrow in paint).
sequential_file(oltp) -> sequential_file(DWH)
copying the contents from oltp to DWH using flat file.
Step 2:
Create a txt file named “src.txt”.
Type some records with the structure (eno,ename,sal)
Rename sequential_File_o and sequential_File_1 as ‘oltpsrc’ and ‘dwhtarget’ respectively.
Step 3:
Setting oltpsrc properties
Double click ‘oltpsrc’ file on the work area
Set the properties as follows
File: Location of the source file
First Line is Column Name:Set True if first line of src file has column names else False
Set Format as follows:
Final delimeter = end(represents end of file)
Delimeter= Set the delimeter that you have used in the src file for separating each field
Quote=single|double|none as per the usage in src file fields.
Define Column name and datatype
Step 4: Setting ‘dwhtarget’ file properties
File=path of target file
File Update Mode=Overwrite (overwrites the target file if exists)|Create(creates a new file)|Append(append to the target file)
First Line is Column Names=True (treats first line of your src file as column names and skips the first line)|False (Loads the first line to the target file)
Step 5: Save Your Project:
Goto file-> save as
Item name: Project name
Folder Path: Path of your Project Folder
Step 6: Compiling Project:
Click the compile button on the toolbar.
Step 7: Run the Project:
Click the run button on the toolbar.
Warnings
No limit: Runs the process even if n warnings are present
Abort job after: Aborts the process after encountering the specified no. of warnings.
Note:
Before clicking Run close your src file and target file
Link Color status during run time.
Black-process not started
Blue-process is going on
Red- Process aborted
Green-Process completed successfully
Step 8: Run Director:
Now Goto->Tools->Run Director
It maintains run logs for all the projects.
To view logs: select the desired project and goto ->view -> log
Exercise 2: Pump the data from source to target with some constraints using ‘FILTER’ Stage
Filter is used for restricting each row of a file based upon certain conditions set against a/multiple fields in the row.
Eg: Select * from emp where sal>10000;
Step1: Create a new parallel project
Step 2: Save the project with a name.
Step 3: Drag and Drop three sequential files into the work area.
Step 4: Drag and Drop a Filter from processing option on palette into work area
Step 5: Create a source file named “src.txt”
Step 7: Set sequential_File_0 properties same as in exercise 1.
Step 8: Set Filter Properties as follows.
Setting Constraints:
Predicates:
1st Where clause condition for the link DSLink12(sal<=10000)
the sequential_file_1 will have the rows satisfies the above constraint
2nd where clause condition for the link DSLink11(sal>10000 and sal<=20000)
the sequential_file_2 will have the rows satisfies the above constraint
Options:
output Rejects=true for DSLink10 and right click on the DSLink10->select Convert to Stream
Keep Output Rejects=false ; if there is no
Now the sequential_file_3 will have the rows that are rejected from the above two constraints
Output Settings:
Mapping Columns:
1. Select the output link from the combo box2. Drag and Drop the columns from left to right side3. Redo the above steps for all the output links
Step 9: Set sequential_file_1, sequential_file_2, sequential_file_3 properties same as in exercise 1.
Step 10: Compile
Step 11: Run the project and observe the output.
Exercise-3: Load the target file from multiple src files using ‘Funnel’ stage
Step1: Create a new parallel project
Step 2: Save the project with a name.
Step 3: Drag and Drop four sequential files into the work area and rename them as src1, src2, src3 and target respectively.
Step 4: Drag and Drop a funnel from processing option on palatte into the work area.
Step 5: Set the src 1,src 2, scr 3 properties same as in exercise 1.
Step 6: Set Funnel Properties as follows
Properties settings
Funnel Type=Continuous Funnel.
Target file is loaded with all the src files in the order in which the src link comes to the funnel.
Funnel Type=Sequence Funnel.
Target file is loaded with all the src files in the order with which the src files are place in the work area.i.e., from top to bottom.
Funnel Type=Sort Funnel.
Target file is loaded with all the src files in the sorted manner based on the sor key value and sort order.
Output settings:
Step 7: set target file properties same as in exercise 1.
Step 8: Compile
Step 9: Run the project
Output:
Source files:
Target File on
1. Funnel Type=Continous Funnel
2. Funnel Type=Sequence Funnel
3. Funnel Type=Sort Funnel with key=ename and sort order=Ascending.
Exercise- 4: Pump the target file from the source file in the sorted order using ‘SORT’ stage
Step1: Create a new parallel project
Step 2: Save the project with a name.
Step 3: Drag and Drop two sequential files into the work area.
Step 4: Drag and Drop sort from processing option on the palette into the work area.
Step 5: set sequential_file_0 properties same as in exercise 1.
Step 6: set sort properties as follows
Output setting:
Step 7: set sequential_file_1 properties same as in exercise 1.
Step 8: compile and run the project.
OUTPUT:
Source file:
Target File:
Sort can also be performed with the link directed from funnel
The above case won’t work because Funnel link should be directed directly to Sort
Exercise -5: Load the target file after removing duplicate rows from the src file using ‘Remove Duplicates’ stage.
Step1: Create a new parallel project
Step 2: Save the project with a name.
Step 3: Drag and Drop two sequential files into the work area.
Step 4: Drag and Drop ‘Remove Duplicates’ from processing option on the palette into the work area.
Step 5: set sequential_file_0 properties same as in exercise 1.
Step 6: set ‘remove duplicates’ properties as follows.
Key=eno (Key column for the operation)
Duplicate to Retain=Last.
Row Duplicates:
Eno, ename, salary
101,gokul,10000
102,gopal,20000
101,gokul,15000
101,gokul,25000
103,kumar,20000
The record (101,gokul) has been duplicated for 3 times with different salary values. We need the latest updated row. So use the stage ‘Remove Duplicates’ as it removes all the duplicate rows keeping the last (or) first row retained.
Duplicate row search is made using the key, ‘eno’ in our case.
We can customize the duplicate to be retained by setting Duplicate to Retain=Last | First.
Output Settings:
Step 7: set sequential_file_1 properties same as in exercise 1.
Step 8: compile and run the project.
OUTPUT FOR THE ABOVE SETTINGS:
SOURCE FILE:
TARGET FILE:
Exercise 6: Join the rows in two src files and load them into the target using ‘JOIN’ stage
Step1: Create a new parallel project
Step 2: Save the project with a name.
Step 3: Drag and Drop three sequential files into the work area.
Step 4: Drag and Drop ‘Join’ from processing option on the palette into the work area.
Step 5: Set sequential_file_0 and sequential_file_1 properties same as in exercise 1 but select a key in both files with which the join has to be made. In our example we have selected the key as ‘eno’.
Step 6: set join properties as follows.
Key= eno
Join Type= Inner|Left outer|Right outer|Full Outer
Output Settings:
Note:
While Joining keep your small table as left table and big table as right table for better performance.
Step 7: set sequential_file_2 properties as same as in exercise 1.
Step 8: Compile and Run the project.
OUTPUT:
Source File 1 and 2:
Target file after Inner Join:
Target file after Left outer join:
Target file after Right outer join:
Target file after full outer join:
Exercise -7: Generate n number of dummy records under a defined table or structure using ‘Row Generator’ stage.
Step1: Create a new parallel project
Step 2: Save the project with a name.
Step 3: Drag and Drop a sequential file into the work area.
Step 4: Drag and Drop Row Generator from Development/Debug option on the palette into the work area.
Step 5: Set Row_Generator properties as follows
Output Settings:
Specifying the length and scale values is important here.
Sal=12000.00 (length=7 and scale=2)//generates all the values of decimal domain column with same no. of digits.
Length value for char is fixed length.(all the values of char domain column have fixed no. of characters)
Length value for integer and varchar is their upper limit i.e., the max no. of digits for integer and the max no. of characters for varchar.
Step 6: Set sequential_file_1 properties as same as in exercise 1.
Output:
Target File:
Exercise 8: Load data from a flat src file to a target oracle database using ‘oracle connector’ stage.
Step1: Create a new parallel project
Step 2: Save the project with a name.
Step 3: Drag and Drop a sequential file into the work area.
Step 4: Drag and Drop oracle connector from Database option on the palette into the work area.
Step 5: Set sequential_file_1 properties as same as in exercise 1.
Step 6: Starting Oracle services.
Start OracleJobSchedulerorcl, Orcaleoradb11g_home1 TNSListener, OracleServiceORCL services
Step 7: set oracle_connector properties as follows.
Check oracle connectivity by pressing the Test button under connection.
You can also View Data that has been imported using View Data button under usage.
Output Settings:
Specifying the length and scale values is important here.
Sal=12000.00 (length=7 and scale=2)//generates all the values of decimal domain column with same no. of digits.
Length value for char is fixed length.(all the values of char domain column have fixed no. of characters)
Length value for integer and varchar is their upper limit i.e., the max no. of digits for integer and the max no. of characters for varchar.
Step 8: Compile and run the project.
Output:
Source File:
Target:
Username: Scott/tiger@orcl
Exercise 9: Load data from an oracle database to a target flat file using ‘oracle connector’ stage.
Step1: Create a new parallel project
Step 2: Save the project with a name.
Step 3: Drag and Drop a sequential file into the work area.
Step 4: Drag and Drop oracle connector from Database option on the palette into the work area.
Step 5: Starting Oracle services.
Start OracleJobSchedulerorcl, Orcaleoradb11g_home1 TNSListener, OracleServiceORCL services
Step 6: Import a table (This will take a snapshot of the original table and this snapshot is used for further processing with better performance since reading each and every record from the oracle database via an oracle connection requires more overhead)
Since importing a table is equivalent to a snapshot, you have to perform it for each time whenever the table faces any changes.
The changes you are making in the table should be committed before importing it into the datastage, especially in oracle.
Username : scott
Password : tiger
Step 7: Set the oracle_connector properties as follows.
Column Settings:
Load the columns from the ‘employee’ table as follows
a. Click the button loadb. Select the table from the ‘table definitions’ wizard.c. Select the desired columns from the ‘select columns’ wizard
Step 7: set sequential_file_0 properties as same as in exercise 1.
Step 8: compile and run the project.
OUTPUT:
Target File:
Exercise 10: Load data from teradata database to oracle database using ‘Teradata connector and Oracle Connector’ stage.
Step1: Create a new parallel project
Step 2: Save the project with a name.
Step 3: Drag and Drop Teradata Connector and Oracle Connector from the Database option on the palette into the work area.
Step 4: Start teradata services.
Step 5: Import a teradata database.
Username: tduser
Password: tduser
Step 6: Set Teradata_Connector properties as follows.
Check oracle connectivity by pressing the Test button under connection.
You can also View Data that has been imported using View Data button under usage.
Column Settings:
Procedure is same as in exercise 9.
Specifying the length and scale values is important here. (from any db to db (or) from file to any db)
Sal=12000.00 (length=7 and scale=2)//generates all the values of decimal domain column with same no. of digits.
Length value for char is fixed length.(all the values of char domain column have fixed no. of characters)
Length value for integer and varchar is their upper limit i.e., the max no. of digits for integer and the max no. of characters for varchar.
Step 7: Set Oracle Connector properties as same as in exercise 8.
Step 8: Compile and run the project.
OUTPUT:
Target:
Username: Scott/tiger@orcl
Exercise 11: Load data from oracle database to teradata database using ‘Teradata connector and Oracle Connector’ stage.
Step1: Create a new parallel project
Step 2: Save the project with a name.
Step 3: Drag and Drop Teradata Connector and Oracle Connector from the Database option on the palette into the work area.
Step 4: Start oracle and teradata services.
Step 5. Import an oracle table.
Step 6: Set Oracle_Connector properties as same as in exercise 9.
Step 7: Set Teradata_Connector properties as follows.
Step 8: Compile and run the project.
Output:
At Teradata
Exercise 12: Load data from an Teradata database to a target flat file using ‘Teradata connector’ stage.
Step1: Create a new parallel project
Step 2: Save the project with a name.
Step 3: Drag and Drop a sequential file into the work area.
Step 4: Drag and Drop teradata connector from Database option on the palette into the work area.
Step 4: Start teradata services.
Step 5. Import a teradata table.
Step 6: Set teraddata_Connector properties as same as in exercise 10.
Step 7: Set Sequential_File properties as same as in exercise 1.
Step 8: Compile and run the project.
OUTPUT:
Source table and Target Flat file.
Exercise 13: Load data from an a target flat file to a Teradata database using ‘Teradata connector’ stage.
Step1: Create a new parallel project
Step 2: Save the project with a name.
Step 3: Drag and Drop a sequential file into the work area.
Step 4: Drag and Drop teradata connector from Database option on the palette into the work area.
Step 4: Start teradata services.
Step 5: Set Sequential_File properties as same as in exercise 1.
Step 6: Set teraddata_Connector properties as same as in exercise 10.
Step 7: Compile and run the project.
OUTPUT:
Source Target flat file and Target teradata table.
Exercise 14: Load data from teradata database to a teradata database using ‘Teradata connector’stage.
Step1: Create a new parallel project
Step 2: Save the project with a name.
Step 3: Drag and Drop two Teradata Connectors from the Database option on the palette into the work area.
Step 4: Start teradata services.
Step 5. Import a teradata table.
Step 6: Set teradata_connector_0 properties as same as in exercise 10.
Step 7: Step 6: Set teradata_connector_1 properties as same as in exercise 11.
Step 8: Compile and Run the project.
OUTPUT:
Source new_emp teradata table and Target cpy_emp teradata table.
Exercise 15: Load data from oracle database to a oracle database using ‘oracle connector’stage.
Step1: Create a new parallel project
Step 2: Save the project with a name.
Step 3: Drag and Drop two oracle Connectors from the Database option on the palette into the work area.
Step 4: Start oracle services.
Step 5. Import an oracle table.
Step 6: Set oracle_connector_0 properties as same as in exercise 11.
Step 7: Step 6: Set oracle_connector_1 properties as same as in exercise 10.
Step 8: Compile and Run the project.
OUTPUT:
Source oracle table ‘dept’:
Target Oracle table ‘cpy_dept’:
Exercise 16: Perform some aggregations on the src flat file and load them into a target flat file using ‘Aggregator’ stage.
Step1: Create a new parallel project
Step 2: Save the project with a name.
Step 3: Drag and Drop two sequential files into the work area.
Step 4: Drag and Drop Aggregator from processing option on the palette into the work area.
Step 5: Set sequential_file_0 properties as same as in exercise 1.
Step 6: Set Aggregator properties as follows.
Select deptid, max(sal) “Max_Sal” from emp group by deptid;
Group = deptid (group by column)
Aggregation Type=Calculation|Count Rows | Re-calculation
Column For Calculation=sal (column on which the aggregation has to be performed)
Maximum Value Output Column=Max_Sal (Alias name )
Column Mapping:
Column Settings
By default data type for all aggregation type will be Double. So reset the type as per your desire.
Step 7: Set Sequential_File_1 properties as same as in exercise 1.
Step 8: Compile and Run the project.
OUTPUT:
Source File
Target File on ‘Select deptid, max(sal) “Max_Sal” from emp group by deptid;’
Exercise 17: Load from src flat file to a target flat file with some derived columns using ‘Transformer’ stage.
Step1: Create a new parallel project
Step 2: Save the project with a name.
Step 3: Drag and Drop two sequential files into the work area.
Step 4: Drag and Drop ‘Transformer’ from processing option on the palette into the work area.
Step 5: set sequential_file_0 properties as same as in exercise 1.
Step 6: Set transformer properties as follows.
Drag and Drop the columns on which derivations have to be performed from left to right (Column Mapping).
In the right hand side right click on each column and select function->any desired function, then the function prototype will be loaded in the column.
Edit the column as per the prototype (for ex: on selecting UpCase, UpCase(%string%) will be loaded. Edit the parameter value as DSLink5.ename)
Deriving ‘Grade’ column from the sal column using If Else with the same procedure as above.
At the right bottom side rename the columns if you want (Here we are renaming ‘ename’ as ‘Emp_Name’, ‘sal’ as ‘Annual_salary’ ). Changes will get updated in DSLink6 table.
Be conscious in setting the datatype for each derived columns.
Step 7: set sequential_file_1 properties as same as in exercise 1.
Step 8: Compile and run the project.
OUTPUT:
Source File:
Target File:
Exercise 18: Compare two tables (DWH and OLTP) and Capture the changes in OLTP table with respect to DWH table then load the changes to a flat file using ‘Change Capture’ stage.
Step1: Create a new parallel project
Step 2: Save the project with a name.
Step 3: Drag and Drop oracle connectors from database option on the palette into the work area.
Step 4: Drag and Drop a ‘sequential file’ from file option on the palette into the work area.
Step 5: Drag and Drop ‘change capture’ from processing option on the palette into the work area.
Step 6: Create two tables student and dupstudent with the structure (rollno,name,age,deptid) and insert same records in student and dupstudent. Make some changes in the dupstudent table (new insert,delete,update).
Step 7: set oracle connector properties as same as in exercise 9.
Step 8: set change capture properties as follows.
Setting Properties
Change key=rollno (a column that will never change on which the comparison between the tables will occur).
Change Value= Age, Deptid, Name (columns whose values get change over time)
Drop Output For Copy, delete, edit, insert= False
If two tables contains exactly similar records then don’t leave that record, forward that record to the flat file.
If a record in student is not present in dupstudent (deleted) then forward that record to the flat file.
Similar actions on edit (update) and insert will occcur.
Column Settings:
The change capture generates a column called change_code by default which indicates the following.
Copy-0
Insert-1
Update-2
Delete-3
Column Mappings:
Step 9: set sequential_file properties as same as In exercise 1.
Step 10: compile and run the project.
OUTPUT:
Source tables:
Target File:
Exercise 19: Look up for the existence of records in DWH table with respect to OLTP table and join the records using ‘Look Up’ Stage
Step1: Create a new parallel project
Step 2: Save the project with a name.
Step 3: Drag and Drop three sequential files into the work area.
Step 4: Drag and Drop ‘Look Up’ from processing option on the palette into the work area.
Step 5: Set OLTPSRC and DWHSRC file properties as same as in exercise 1.
NOTE: Always oltp file should be at the top and dwh file should be at the bottom in the work area else error on running the project will occur.
Step 6: set look up properties as follows.
Create a link with ‘dno’ from oltp_link to dwh_link which act as a key for comparison.
Drag and Drop the desired columns from oltp_link and dwh_link to target_link.
Step 7: set target file properties as same as in exercise 1.
Step 8: Compile and run the project.
OUTPUT:
Source Files (DWH and OLTP):
Result: Execution success
Target File:
Inference:
If look up finds the existence of all the related records in DWH table with respect to OLTP table on using a key (here dno) then it will join those records and the join type is ‘natural join with using clause’.
So lookup can act as join with the above restriction.
Source files (DWH and OLTP):
Result:
Inference:
Since a record with the key (dno=6) in the oltp table is not exists in the dwh table, error occurred.
Exercise 20: Maintain logs of changes made in DWH table with respect to OLTP table using ‘SLOWLY CHANGING DIMENSION’ stage.
Step1: Create a new parallel project
Step 2: Save the project with a name.
Step 3: Drag and Drop three oracle connectors from database option on the palette into the work area.
Step 4: Drag and Drop a ‘sequential file’ from file option on the palette into the work area.
Step 5: Drag and Drop ‘Slowly Changing Dimension’ from processing option on the palette into the work area.
Step 6: Create a table oltp with the following description and insert some records then commit.
Step 7: Create a table deptdwh with the following description.
Step 7: set OLTP oracle connector properties as same as in exercise 9 and use oltp table.
Step 8: Set DWH oracle connector properties as same as in exercise 9 and use deptdwh table.
Step 9: Set Target_DWH oracle connector properties as same as in exercise 9 and use deptdwh table.
Step 10: Set Fact sequential file properties as same as in exercise 1.
Step 11: Set Slowly Changing Dimension as follows.
Fast Path: 1 of 5
Select output link as fact (sequential file).
Fast Path: 2 of 5 (Input)
Map the key column between oltp and dwh table.
Fast Path: 3 of 5 (Input)
Set Initial Value as 1
Create a txt file ‘System.txt’ in C:\ for system reference.
Give that file path under Source name:
Fast Path: 4 of 5 (Output)
Map columns for the Fact (sequential file).
Always map common columns from oltp table.
Fast Path: 5 of 5 (Output)
At Initial Stage:
Set Derivation, Purpose and Expire for columns.
Derivation and Expire can be set by double click->right click->function->desired function on the respective columns.
Purpose Settings:
Business Key: primary key
Surrogate key: to locate changes (for system reference)
Type 1: Non-Changeable values but not but not a business key (eg: Date of birth).
Type 2: Changeable values.
Effective Date: Entry date of the record
Expiration Date: Entry date of immediate duplicate record (so initially set it as null)
Current Indicator: Indicates the active record
Active-1
Inactive-0
Fast Path 5 of 5 (output) at final stage:
After setting the fast path: 5 of 5, fast path: 2 of 5 will become as
Step 12: Compile and Run the project.
OUTPUT:
Deptdwh table is inserted with the records from oltp table with stdate as current date, expdate as null and CID as 1(active record).
Fact file content:
After Making the following changes on oltp table
Deptdwh table is inserted with changed records as well as newly inserted records at oltp with stdate as current date, expdate and cid.
The ‘dname’ value of the row with deptno =10 is changed from ‘C’ to ‘JAVA’ .
The old record gets the expiration date as the starting date of the newly updated record
Current indicator (cid) of old record= 0 and for new record, cid=1.
Fact file content:
Exercise 21: PIVOT STAGE
Step1: Create a new parallel project
Step 2: Save the project with a name.
Step 3: Drag and Drop two sequential files into the work area.
Step 4: Drag and Drop ‘pivot’ from processing option on the palette into the work area.
Step 5: set sequential_file_0 properties as same as in exercise1.
Step 6: Set Pivot properties as follows.
Input settings:
Output Settings:
Step 7: set sequential_file_2 properties as same as in exercise1.
Step 8: Compile and run the project.
OUTPUT:
Source File:
Target File:
NOTE: Datatype of all horizontal columns except the primary key column in source table should be same. In our case q1, q2, q3 column in source table are integers. So that all these columns can fit into the column ‘q’ with integer datatype in target table.
Exercise 22: Run the jobs in sequential manner (one after other) using ‘Sequence Job’
Sequence Job is mainly used for executing the jobs one after other.
It is very essential to execute the jobs in a particular sequence in which one job depends on the finished execution state of another job.
For example consider the following query,
Select e.eno,e.ename,e.deptno,d.deptname from emp e join dept d on(e.deptno=d.deptno) where e.deptno in(10,20,30) order by 2;
The above query needs to execute three jobs (1. Join, 2. Filter, 3. Sort) in sequence.
Step1: Create a new sequence project
Step 2: Save the project with a name.
Step 3: Drag and Drop the jobs you want to execute sequentially from repository into the work area.
Step 4: Link the Jobs
Step 5: Compile and run the project.
Step 6: Open the run directory and observe the logs for successful execution of all the jobs.
Recommended