DataStage Faq's

  • View

  • Download

Embed Size (px)


Page 1 of 57

1. Main differences b/w DataStage 7.5.2 to 8.0.1 A. In DataStage 7.5.2 we have manager as client. In 8.0.1 we dont have any managerclient. The manager client is embedded into designer client.

B. In 7.5.2 quality stage has separate designer. In 8.0.1, quality stage is integrated in C. In 7.5.2 code and metadata is stored in file based system. In 8.0.1 code is a filebased system where as metadata is stored in database. designer.

D. In 7.5.2 we required operating system authentications. In 8.0.1, we require operating E. In 7.5.2 we dont have range lookup. In 8.0.1, we have range lookup. F. In 7.5.2 a single join stage can't support multiple references. In 8.0.1 a single joinstage can support multiple references. system authentications and DataStage authentications.

G. In 7.5.2, when a developer opens a particular job, and another developer wants toopen the same job, that job can't be opened. In 8.0.1, it can be possible when a developer opens a particular job and another developer wants to open the same job then it can be opened as read only job. H. In 8.0.1 a compare utility is available to compare 2 jobs, one in development another is in production. In 7.5.2 it is not possible. I. In 8.0.1 quick find and advance find features are available, in 7.5.2 not available. J. In 7.5.2 first time one job is run and surrogate keys generated from initial to n value. Next time the same job is compiled and run again surrogate key is generated from initial to n. automatic increment of surrogate key is not available in 7.5.2 but in 8.0.1 surrogate key is incremented automatically. a state file is used to store the maximum value of surrogate key.


Data Modeling

1) E-R Modeling (Entity-Relationship Modeling) OLTP

a) Logical Modeling: Logical modeling deals with gathering the business requirements and converting them into a model. b) Physical Modeling: Physical modeling involves the actual design of a database according to the requirements that were established during logical modeling. 2) Dimensional Modeling Dimensional modeling is divided into 2 types. a) Star Schema - Simple & Denormalized form. Much Faster compared to snow flake. b) Snowflake Schema - Complex with more Granularities. More normalized form. Slow.


Importance of Surrogate Key in Data warehousing?

Surrogate Key is the Primary Key in a Dimension table. It is independent of the underlying database i.e. Surrogate Key is not affected by the changes going on with the source database. 3. Differentiate between Database data and Data warehouse data? Data in a Database is for OLTP. a) Detailed or Transactional b) Both Readable and Writable. c) Current. d) Volatile 4. Data in a DWH is for OLAP. a) for Analysis & BI. b) We can only read from the DWH c) Historical data d) non-volatile

What is the flow of loading data into fact & dimensional tables?

Page 2 of 57

First Data should be loaded into Dimension tables where the surrogate keys are generated and then to Fact tables. The surrogate keys are referenced as foreign keys in Fact tables.


Orchestrate Vs DataStage Parallel Extender?

Orchestrate itself is an ETL tool with extensive parallel processing capabilities and running on UNIX platform. DataStage used Orchestrate with DataStage XE (Beta version of 6.0) to incorporate the parallel processing capabilities. Now DataStage has purchased Orchestrate and integrated it with DataStage XE and released a new version DataStage 6.0 i.e. Parallel Extender.


What are Stage Variables, Derivations and Constraints?

Stage Variable - An intermediate processing variable that retains value during read and doesnt pass the value into target column (in case of comparison, if-else). We can also use this for computing the result which is useful in multiple columns of the target table(s). Constraints - Constraint is like a filter condition which limits the number of records coming from Input according to business rule. Derivation - Expression that specifies the value to be passed to the target column.


DataStage Lookup types?

Normal Lookup: In this whenever DataStage wants to look up, it just places the target table data into buffer. It is used when the reference table or file contains less data. Sparse Lookup: To look up, it fires a SQL query to the database instead of placing into buffer. To use sparse look up your target database must be larger than source table and your target should be a database of any type. Range lookup: This will help you to search records based on particular range. It will search only that particular range records and provides good performance instead of searching the entire record set. 8. Explain about Error Handling in DataStage and best practices.

In DataStage sequence there is an "Exception Handler" activity. When you are calling your jobs from a DataStage Sequence you should do following : Step 1: Go to job properties of master sequence and check the checkbox "Add Checkpoints so sequence is restartable on Failure" and "Automatically handle activities that fail". Step2: In your sequence use an exception handler activity. After exception handler activity you may include an email notification activity. Here if the job fails the handle will go to the exception handler activity and an email will be sent notifying the user that a sequence has failed. 9. What are the different types of links in DataStage?

They are 3 different links in the DataStage. 1. stream link means straight link 2. Reference link it acts like a lookup. 3. Rejected link used in parallel jobs 10. How to use Excel file as input in DataStage?

Page 3 of 57

You can use excel file as input by importing the .xls file. step1 --> Go to Administrative Tools -> Data Source (ODBC) --> System DSN. Click on Add button and configure the corresponding .xsl file in your system DSN. Make sure that workbook contains the name of your excel sheet. Step2 --> Import the excel file into the DataStage as ODBC table definition. Step3 --> Use ODBC stage as input stage. You should be able to use excel file very effectively. Please let me know if you face any problem.


What is the default cache size? How do you change the cache size if needed?

Default cache size is 128 MB. We can increase it by going into DataStage Administrator and selecting the Tunables Tab and specify the cache size over there. 12. Differentiate Primary Key and Partition Key?

Primary key is the key that we define on a table column or set of columns (composite PK) to make sure that all the rows in that table column or columns are unique. Partition key is the key that we use while partitioning a table (in database) for processing the source records in ETL. We should define the partition based on the stages (in DataStage) or transformations (in Informatica) we use in a job (in DataStage) or mapping (in Informatica). To improve the target load process, we partition the data. 13. How to remove the locked jobs using DataStage? Go to Director -- Tools --- Clear the Job Resources option there u find the PID Number. Then select that PID and click a logout. Your job gets released.


How do you execute DataStage job from UNIX command line prompt?

/opt/Ascential/DataStage/DSEngine/bin/dsjob -server $ServerName \ -user $User \ -password $ServerPassword \ -run -wait -mode NORMAL \ -param FmtDirectory=$FMT_PATH \ -param ReFmtDirectory=$REFMT_PATH \ -param Twin=$TWIN \ -param Group=$GROUP \ -param PrcsDte=$PROCDTE \ -param Cmelnd=$CME_IND\ -param Mndlnd=$MND_IND \ IDL $DSSEQUENCE.${APPLCDE}


What are types of Hashed File?

Hashed File is classified into 2 types. a) Static - Sub divided into 17 types based on the Primary Key Pattern. b) Dynamic - Default Hash file is "Dynamic - Type Random 30 D". These are the three types of files will be created when we create a Hash file .data .type .over .

16. How to call a Stored Procedure which is in SQL Server database in DataStage job?

Page 4 of 57

In ODBC stage properties -- Click on OUTPUTS --- General --- select Stored Procedure -- browse to get the stored procedure. We can use a Stored Procedure stage while designing parallel jobs. 17. Explain what are SUMMARY TABLES and use? Summary tables contain the summarized or "Aggregated" data to tune up query performance. Example: Suppose that we have a table which contains following columns: a) Medicine_Name b) Sell c) Time Now the business requirement is to get the sales of medicine on monthly basis. Here if a query is fired to aggregate the medicine cell will have to use aggregation to get the monthly sales each time. Instead of that if a summary table is created which contains the monthly sales records the query cost will decrease as the query will directly get the aggregated data from the summary table. In this scenario the summary table will contain following columns: a) Medicine_Name b) Sell c) MONTH Hence for sell of all days of month only one aggregated record will come in Summary table. i.e. for each month one row will be there in summary table containing aggregated data. This will increase the performance of a query.

18. Containers: Usage and Types?Container is a collection of stages used for the purpose of Reusability. There are 2 types of Containers. a) Local Container: Job Specific b) Shared Container: for any job within a project. Again 2 types. 1. Server shared container: Used in server jobs (can also be used in parallel jobs). 2. Parallel shared container: Used only in parallel jobs.

19. Where the DataStage hash files are stored when they are created?There are two ways of specifying where the Hash Files will be created. a) Account: This is generally the project name. If this is selec