69
Parallel Architecture, DataStage v8 Configuration, Metadata Parallel processing = executing your application on multiple CPUs Parallel processing environments The environment in which you run your parallel jobs is defined by your system’s architecture and hardware resources. All parallel processing environments are categorized as one of: 1. SMP (symmetric multiprocessing), in which some hardware resources may be shared among processors . The processors communicate via shared memory and have a single operating system. 2. Cluster or MPP (massively parallel processing), also known as shared-nothing , in which each processor has exclusive access to hardware resources. MPP systems are physically housed in the same box, whereas cluster systems can be physically dispersed. The processors each have their own operating system, and communicate via a high-speed network. Pipeline Parallelism 1. Extract, Transform and Load processes execute simultaneously 2. The downstream process starts while the upstream process is running like a conveyor belt moving rows from process to process 3. Advantages: Reduces disk usage for staging areas and Keeps processors busy 4. Still has limits on scalability Pipeline Parallelism 1. Divide the incoming stream of data into subsets known as partitions to be processed separately 2. Each partition is processed in the same way 3. Facilitates near-linear scalability. However the data needs to be evenly distributed across the partitions; otherwise the benefits of partitioning are reduced Within parallel jobs pipelining, partitioning and repartitioning are automatic. Job developer only identifies 1. Sequential or Parallel mode (by stage) 2. Partitioning Method 3. Collection Method 4. Configuration file Configuration File One of the great strengths of the WebSphere DataStage Enterprise Edition is that, when designing parallel jobs, you don’t have to worry too much about the underlying structure of your system, beyond appreciating its parallel processing capabilities. If your system changes, is upgraded or improved, or if you develop a job on one platform and implement it on another, you don’t necessarily have to change your job design. WebSphere DataStage learns about the shape and size of the system from the configuration file. It organizes the resources needed for a job according to what is defined in the configuration file. When your system changes, you change the file not the jobs. The WebSphere DataStage Designer provides a configuration file editor to help you define configuration files for the parallel engine. To use the editor, choose Tools Configurations, the Configurations dialog box appears. You specify which configuration will be used by setting the $APT_CONFIG_FILE environment variable. This is set on installation to point to the default configuration file, but you can set it on a project wide level from the WebSphere DataStage Administrator or for individual jobs from the Job Properties dialog. Configuration files are text files containing string data. The general form of a configuration file is as follows:

datastage

Embed Size (px)

DESCRIPTION

datastage

Citation preview

Page 1: datastage

Parallel Architecture, DataStage v8 Configuration, MetadataParallel processing = executing your application on multiple CPUs

Parallel processing environments

The environment in which you run your parallel jobs is defined by your system’s architecture and hardware resources. All parallel processing environments are categorized as one of:1. SMP (symmetric multiprocessing), in which some hardware resources may be shared among processors. The

processors communicate via shared memory and have a single operating system. 2. Cluster or MPP (massively parallel processing), also known as shared-nothing, in which each processor has

exclusive access to hardware resources. MPP systems are physically housed in the same box, whereas cluster systems can be physically dispersed. The processors each have their own operating system, and communicate via a high-speed network.

Pipeline Parallelism1. Extract, Transform and Load processes execute simultaneously2. The downstream process starts while the upstream process is running like a conveyor belt moving rows from

process to process3. Advantages: Reduces disk usage for staging areas and Keeps processors busy 4. Still has limits on scalability

Pipeline Parallelism1. Divide the incoming stream of data into subsets known as partitions to be processed separately2. Each partition is processed in the same way3. Facilitates near-linear scalability. However the data needs to be evenly distributed across the partitions;

otherwise the benefits of partitioning are reduced

Within parallel jobs pipelining, partitioning and repartitioning are automatic. Job developer only identifies1. Sequential or Parallel mode (by stage)2. Partitioning Method3. Collection Method4. Configuration file

Configuration File

One of the great strengths of the WebSphere DataStage Enterprise Edition is that, when designing parallel jobs, you don’t have to worry too much about the underlying structure of your system, beyond appreciating its parallel processing capabilities. If your system changes, is upgraded or improved, or if you develop a job on one platform and implement it on another, you don’t necessarily have to change your job design.

WebSphere DataStage learns about the shape and size of the system from the configuration file. It organizes the resources needed for a job according to what is defined in the configuration file. When your system changes, you change the file not the jobs.

The WebSphere DataStage Designer provides a configuration file editor to help you define configuration files for the parallel engine. To use the editor, choose Tools → Configurations, the Configurations dialog box appears.

You specify which configuration will be used by setting the $APT_CONFIG_FILE environment variable. This is set on installation to point to the default configuration file, but you can set it on a project wide level from the WebSphere DataStage Administrator or for individual jobs from the Job Properties dialog.

Configuration files are text files containing string data. The general form of a configuration file is as follows:

Page 2: datastage

{node "n1" {fastname "s1"pool "" "n1" "s1" "app2" "sort"resource disk "/orch/n1/d1" {}resource disk "/orch/n1/d2" {"bigdata"}resource scratchdisk "/temp" {"sort"}}}

Node names

Each node you define is followed by its name enclosed in quotation marks, for example: node "orch0"

For a single CPU node or workstation, the node’s name is typically the network name of a processing node on a connection such as a high-speed switch or Ethernet. Issue the following UNIX command to learn a node’s network name: $ uname -n

On an SMP, if you are defining multiple logical nodes corresponding to the same physical node, you replace the network name with a logical node name. In this case, you need a fast name for each logical node. If you run an application from a node that is undefined in the corresponding configuration file, each user must set the environment variable APT_PM_CONDUCTOR_NODENAME to the fast name of the node invoking the parallel job.

Fastname

Syntax: fastname "name"

This option takes as its quoted attribute the name of the node as it is referred to on the fastest network in the system, such as an IBM switch, FDDI, or BYNET. The fastname is the physical node name that stages use to open connections for high volume data transfers. The attribute of this option is often the network name. For an SMP, all CPUs share a single connection to the network, and this setting is the same for all parallel engine processing nodes defined for an SMP. Typically, this is the principal node name, as returned by the UNIX command uname -n.

Node pools and the default node pool

Node pools allow association of processing nodes based on their characteristics. For example, certain nodes can have large amounts of physical memory, and you can designate them as compute nodes. Others can connect directly to a mainframe or some form of high-speed I/O. These nodes can be grouped into an I/O node pool.

The option pools is followed by the quoted names of the node pools to which the node belongs. A node can be assigned to multiple pools, as in the following example, where node1 is assigned to the default pool (″″) as well as the pools node1, node1_css, and pool4.node "node1" { fastname "node1_css" pools "" "node1" "node1_css" "pool4" resource disk "/orch/s0" {} resource scratchdisk "/scratch" {} }

A node belongs to the default pool unless you explicitly specify a pools list for it, and omit the default pool name (″″) from the list.

Page 3: datastage

Once you have defined a node pool, you can constrain a parallel stage or parallel job to run only on that pool, that is, only on the processing nodes belonging to it. If you constrain both a stage and a job, the stage runs only on the nodes that appear in both pools.

Nodes or resources that name a pool declare their membership in that pool.

We suggest that when you initially configure your system you place all nodes in pools that are named after the node’s name and fast name. Additionally include the default node pool in this pool, as in the following example:

node "n1" { fastname "nfast" pools "" "n1" "nfast" }

By default, the parallel engine executes a parallel stage on all nodes defined in the default node pool. You can constrain the processing nodes used by the parallel engine either by removing node descriptions from the configuration file or by constraining a job or stage to a particular node pool.

Disk and scratch disk pools and their defaults

When you define a processing node, you can specify the options resource disk and resource scratchdisk. They indicate the directories of file systems available to the node. You can also group disks and scratch disks in pools. Pools reserve storage for a particular use, such as holding very large data sets.

Pools defined by disk and scratchdisk are not combined; therefore, two pools that have the same name and belong to both resource disk and resource scratchdisk define two separate pools.

A disk that does not specify a pool is assigned to the default pool. The default pool may also be identified by ″″ by and { } (the empty pool list). For example, the following code configures the disks for node1:

node "n1" {resource disk "/orch/s0" {pools "" "pool1"}resource disk "/orch/s1" {pools "" "pool1"}resource disk "/orch/s2" { } /* empty pool list */resource disk "/orch/s3" {pools "pool2"}resource scratchdisk "/scratch" {pools "" "scratch_pool1"} }

In this example: 1. The first two disks are assigned to the default pool. 2. The first two disks are assigned to pool1.3. The third disk is also assigned to the default pool, indicated by { }.4. The fourth disk is assigned to pool2 and is not assigned to the default pool.5. The scratch disk is assigned to the default scratch disk pool and to scratch_pool1.

Buffer scratch disk pools

Under certain circumstances, the parallel engine uses both memory and disk storage to buffer virtual data set records.The amount of memory defaults to 3 MB per buffer per processing node. The amount of disk space for each processing node defaults to the amount of available disk space specified in the default scratchdisk setting for the node. The parallel engine uses the default scratch disk for temporary storage other than buffering. If you define a buffer scratch disk pool for a node in the configuration file, the parallel engine uses that scratch disk pool rather than the default scratch disk for buffering, and all other scratch disk pools defined are used for temporary storage other than buffering.

Page 4: datastage

Here is an example configuration file that defines a buffer scratch disk pool: { node node1 { fastname "node1_css" pools "" "node1" "node1_css" resource disk "/orch/s0" {} resource scratchdisk "/scratch0" {pools "buffer"} resource scratchdisk "/scratch1" {} } node node2 { fastname "node2_css" pools "" "node2" "node2_css" resource disk "/orch/s0" {} resource scratchdisk "/scratch0" {pools "buffer"} resource scratchdisk "/scratch1" {} } }

In this example, each processing node has a single scratch disk resource in the buffer pool, so buffering will use /scratch0 but not /scratch1. However, if /scratch0 were not in the buffer pool, both /scratch0 and /scratch1 would be used because both would then be in the default pool.

Partitioning

The aim of most partitioning operations is to end up with a set of partitions that are as near equal size as possible, ensuring an even load across your processors.

When performing some operations however, you will need to take control of partitioning to ensure that you get consistent results. A good example of this would be where you are using an aggregator stage to summarize your data. To get the answers you want (and need) you must ensure that related data is grouped together in the same partition before the summary operation is performed on that partition.

Round robin partitioner

The first record goes to the first processing node, the second to the second processing node, and so on. When WebSphere DataStage reaches the last processing node in the system, it starts over. This method is useful for resizing partitions of an input data set that are not equal in size. The round robin method always creates approximately equal-sized partitions. This method is the one normally used when WebSphere DataStage initially partitions data.

Random partitioner

Records are randomly distributed across all processing nodes. Like round robin, random partitioning can rebalance the partitions of an input data set to guarantee that each processing node receives an approximately equal-sized partition. The random partitioning has a slightly higher overhead than round robin because of the extra processing required to calculate a random value for each record.

Entire partitioner

Every instance of a stage on every processing node receives the complete data set as input. It is useful when you want the benefits of parallel execution, but every instance of the operator needs access to the entire input data set. You are most likely to use this partitioning method with stages that create lookup tables from their input.

Page 5: datastage

Same partitioner

The stage using the data set as input performs no repartitioning and takes as input the partitions output by the preceding stage. With this partitioning method, records stay on the same processing node; that is, they are not redistributed. Same is the fastest partitioning method. This is normally the method WebSphere DataStage uses when passing data between stages in your job.

Hash partitioner

Partitioning is based on a function of one or more columns (the hash partitioning keys) in each record. The hash partitioner examines one or more fields of each input record (the hash key fields). Records with the same values for all hash key fields are assigned to the same processing node.

This method is useful for ensuring that related records are in the same partition, which may be a prerequisite for a processing operation. For example, for a remove duplicates operation, you can hash partition records so that records with the same partitioning key values are on the same node. You can then sort the records on each node using the hash key fields as sorting key fields, then remove duplicates, again using the same keys. Although the data is distributed across partitions, the hash partitioner ensures that records with identical keys are in the same partition, allowing duplicates to be found.

Hash partitioning does not necessarily result in an even distribution of data between partitions. For example, if you hash partition a data set based on a zip code field, where a large percentage of your records are from one or two zip codes, you can end up with a few partitions containing most of your records. This behavior can lead to bottlenecks because some nodes are required to process more records than other nodes.

Modulus partitioner

Partitioning is based on a key column modulo the number of partitions. This method is similar to hash by field, but involves simpler computation.

In data mining, data is often arranged in buckets, that is, each record has a tag containing its bucket number. You can use the modulus partitioner to partition the records according to this number. The modulus partitioner assigns each record of an input data set to a partition of its output data set as determined by a specified key field in the input data set. This field can be the tag field.

The partition number of each record is calculated as follows: partition_number = fieldname mod number_of_partitions

where: fieldname is a numeric field of the input data set and number_of_partitions is the number of processing nodes on which the partitioner executes. If a partitioner is executed on three processing nodes it has three partitions.

Range partitioner

Divides a data set into approximately equal-sized partitions, each of which contains records with key columns within a specified range. This method is also useful for ensuring that related records are in the same partition. A range partitioner divides a data set into approximately equal size partitions based on one or more partitioning keys.

In order to use a range partitioner, you have to make a range map. You can do this using the Write Range Map stage. The range partitioner guarantees that all records with the same partitioning key values are assigned to the same partition and that the partitions are approximately equal in size so all nodes perform an equal amount of work when processing the data set.

Range partitioning is not the only partitioning method that guarantees equivalent-sized partitions. The random and round robin partitioning methods also guarantee that the partitions of a data set are equivalent in size. However,

Page 6: datastage

these partitioning methods are keyless; that is, they do not allow you to control how records of a data set are grouped together within a partition.

DB2 partitioner

Partitions an input data set in the same way that DB2® would partition it. For example, if you use this method to partition an input data set containing update information for an existing DB2 table, records are assigned to the processing node containing the corresponding DB2 record. Then, during the execution of the parallel operator, both the input record and the DB2 table record are local to the processing node. Any reads and writes of the DB2 table would entail no network activity.

Auto partitioner

The most common method you will see on the WebSphere DataStage stages is Auto. This just means that you are leaving it to WebSphere DataStage to determine the best partitioning method to use depending on the type of stage, and what the previous stage in the job has done. Typically WebSphere DataStage would use round robin when initially partitioning data, and same for the intermediate stages of a job.

Collecting

Collecting is the process of joining the multiple partitions of a single data set back together again into a single partition. There may be a stage in your job that you want to run sequentially rather than in parallel, in which case you will need to collect all your partitioned data at this stage to make sure it is operating on the whole data set.

Note that collecting methods are mostly non-deterministic. That is, if you run the same job twice with the same data, you are unlikely to get data collected in the same order each time. If order matters, you need to use the sorted merge collection method.

Round robin collector

Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, starts over. After reaching the final record in any partition, skips that partition in the remaining rounds.

Ordered collector

Reads all records from the first partition, then all records from the second partition, and so on. This collection method preserves the order of totally sorted input data sets. In a totally sorted data set, both the records in each partition and the partitions themselves are ordered. This may be useful as a preprocessing action before exporting a sorted data set to a single data file.

Sorted merge collector

Read records in an order based on one or more columns of the record. The columns used to define record order are called collecting keys. Typically, you use the sorted merge collector with a partition-sorted data set (as created by a sort stage). In this case, you specify as the collecting key fields those fields you specified as sorting key fields to the sort stage.

The data type of a collecting key can be any type except raw, subrec, tagged, or vector.

Auto collector

Page 7: datastage

The most common method you will see on the parallel stages is Auto. This normally means that WebSphere DataStage will eagerly read any row from any input partition as it becomes available, but if it detects that, for example, the data needs sorting as it is collected, it will do that. This is the fastest collecting method. Preserve partitioning flag

A stage can also request that the next stage in the job preserves whatever partitioning it has implemented. It does this by setting the preserve partitioning flag for its output link. Note, however, that the next stage may ignore this request. In most cases you are best leaving the preserve partitioning flag in its default state. The exception to this is where preserving existing partitioning is important. The flag will not prevent repartitioning, but it will warn you that it has happened when you run the job. If the Preserve Partitioning flag is cleared, this means that the current stage doesn’t care what the next stage in the job does about partitioning. On some stages, the Preserve Partitioning flag can be set to Propagate. In this case the stage sets the flag on its output link according to what the previous stage in the job has set. If the previous job is also set to Propagate, the setting from the stage before is used and so on until a Set or Clear flag is encountered earlier in the job. If the stage has multiple inputs and has a flag set to Propagate, its Preserve Partitioning flag is set if it is set on any of the inputs, or cleared if all the inputs are clear.

Parallel Job Score

At runtime, the Job SCORE can be examined to identify:1. Number of UNIX processes generated for a given job and $APT_CONFIG_FILE2. Operator combination3. Partitioning methods between operators4. Framework-inserted components - Including Sorts, Partitioners, and Buffer operators

Set $APT_DUMP_SCORE=1 to output the Score to the DataStage job log

For each job run, 2 separate Score Dumps are written to the log1. First score is actually from the license operator2. Second score entry is the actual job score

Job scores are divided into two sections1. Datasets - partitioning and collecting2. Operators - node/operator mapping

Example score dump

The following score dump shows a flow with a single data set, which has a hash partitioner, partitioning on key ″a″. It shows three operators: generator, tsort, and peek. Tsort and peek are ″combined″, indicating that they have been optimized into the same process. All the operators in this flow are running on one node.

Page 8: datastage

The DataStage Parallel Framework implements a producer-consumer data flow modelUpstream stages (operators or persistent data sets) produce rows that are consumed by downstream stages (operators or data sets)Partitioning method is associated with producer. Collector method is associated with consumer. “eCollectAny” is specified for parallel consumers, although no collection occurs!

The producer and consumer are separated by the following indicators:-> Sequential to Sequential<> Sequential to Parallel=> Parallel to Parallel (SAME)#> Parallel to Parallel (not SAME)>> Parallel to Sequential> No producer or no consumer May also include [pp] notation when Preserve Partitioning flag is set

At runtime, the DataStage Parallel Framework can only combine stages (operators) that:1. Use the same partitioning method

Repartitioning prevents operator combination between the corresponding producer and consumer stages Implicit repartitioning (eg. Sequential operators, node maps) also prevents combination

2. Are Combinable Set automatically within the stage/operator definition Set within DataStage Designer: Advanced stage properties

The Lookup stage is a composite operator. Internally it contains more than one component, but to the user it appears to be one stage

1. LUTCreateImpl - Reads the reference data into memory2. LUTProcessImpl - Performs actual lookup processing once reference data has been loaded

At runtime, each internal component is assigned to operators independently

Job Compilation

1. Operators. These underlie the stages in a WebSphere DataStage job. A single stage may correspond to a single operator, or a number of operators, depending on the properties you have set, and whether you have chosen to partition or collect or sort data on the input link to a stage. At compilation, WebSphere DataStage evaluates your job design and will sometimes optimize operators out if they are judged to be superfluous, or insert other operators if they are needed for the logic of the job.

2. OSH. This is the scripting language used internally by the WebSphere DataStage parallel engine.

Page 9: datastage

3. Players. Players are the workhorse processes in a parallel job. There is generally a player for each operator on each node. Players are the children of section leaders; there is one section leader per processing node. Section leaders are started by the conductor process running on the conductor node (the conductor node is defined in the configuration file).

DataStage Designer client generates all code - Validates link requirements, mandatory stage options, transformer logic, etc.

1. Generates OSH representation of job data flow and stagesGUI “stages” are representations of Framework “operators”Stages in parallel shared containers are statically inserted in the job flow Each server shared container becomes a dsjobsh operator

2. Generates transform code for each parallel TransformerCompiled on the DataStage server into C++ and then to corresponding native operatorsTo improve compilation times, previously compiled Transformers that have not been modified are not recompiledForce Compile recompiles all Transformers (use after client upgrades)

3. Buildop stages must be compiled manually within the GUI or using buildop UNIX command line

Viewing of generated OSH is enabled in DS Administrator

OSH is visible in 1. Job Properties2. Job run log3. View Data4. Table Definitions

Page 10: datastage

Generated OSH PrimerDesigner inserts comment blocks to assist in understanding the generated OSH. Note that operator order within the generated OSH is the order a stage was added to the job canvas

OSH uses the familiar syntax of the UNIX shell to create applications for Data Stage Enterprise Edition1. operator name2. operator options (use “-name value” format)

Schema (for generator, import, export) Inputs Outputs

The following data sources are supported as input/output Virtual data set, (name.v) Persistent data set(name.ds or [ds] name) File sets (name.fs or [fs] name) External files (name or [file] name) Every operator has inputs numbered sequentially starting from 0.For example: op1 0> dst op1 1<src

Terminology

Page 11: datastage

Framework DataStageschema table definitionproperty formattype SQL type + length

[and scale]virtual dataset linkrecord/field row/columnoperator stagestep, flow, OSH command

job

Framework DS engine

• GUI uses both terminologies• Log messages (info, warnings, errors) use Framework term

Example Stage / Operator Mapping

Within Designer, stages represent operators, but there is not always a 1:1 correspondence.

Sequential File_ Source: import_ Target: exportDataSet: copy Sort (DataStage): tsort Aggregator: groupRow Generator, Column Generator, Surrogate Key Generator: generatorOracle_ Source: oraread_ Sparse Lookup: oralookup_ Target Load: orawrite_ Target Upsert: oraupsert Lookup File Set_ Target: lookup -createOnly

Runtime Architecture

Generated OSH and Configuration file are used to “compose” a job SCORE similar to the way an RDBMS builds a query optimization plan1. Identifies degree of parallelism and node assignment for each operator2. Inserts sorts and partitioners as needed to ensure correct results3. Defines connection topology (datasets) between adjacent operators4. Inserts buffer operators to prevent deadlocks (eg. fork-joins)5. Defines number of actual UNIX processes -Where possible, multiple operators are combined within a single

UNIX process to improve performance and optimize resource requirements6. Job SCORE is used to fork UNIX processes with communication interconnects for data, message, and control.

Setting $APT_PM_SHOW_PIDS to show UNIX process IDs in DataStage logIt is only after these steps that processing begins. This is the “startup overhead” of an Enterprise Edition job

Job processing ends when - Last row (end of data) is processed by final operator in the flow (or) A fatal error is encountered by any operator ( or) Job is halted (SIGINT) by DataStage Job Control or human intervention (eg. DataStage Director STOP)

Job Execution: The Orchestra

Page 12: datastage

• Conductor - initial Framework process– Score Composer – Creates Section Leader processes (one/node)– Consolidates massages, to DataStage log– Manages orderly shutdown

• Section Leader (one per Node)– Forks Players processes (one/Stage)– Manages up/down communication

• Players– The actual processes associated with Stages– Combined players: one process only– Sends stderr, stdout to Section Leader– Establish connections to other players for data flow– Clean up upon completion

• Default Communication:• SMP: Shared Memory• MPP: Shared Memory (within hardware node) and TCP (across hardware nodes)

Page 13: datastage

Introduction

What is IBM Websphere DataStage?

1. Design jobs for ETL2. Ideal tool for data integration projects3. Import, export, create and manage metadata for use within jobs4. Schedule, run and monitor jobs all within DataStage5. Administer your DataStage development and execution environments6. Create batch (controlling) jobs

What are the components/applications in IBM Information Server Suite?

1. DataStage2. Quality Stage3. Metadata Server consisting of Metadata Access Services and Metadata Analysis Services4. Repository which is DB2 by default 5. Business Glossary6. Federation Server7. Information Services Director8. Information Analyzer9. Information Server console

Explain the DataStage Architecture?

The DataStage client components areAdministrator - Administers DataStage projects and conducts housekeeping on the serverDesigner - Creates DataStage jobs that are compiled into executable programsDirector - Used to run and monitor DataStage jobs

The Repository is used to store DataStage objects. The Repository which is DB2 by default is shared by other applications in the Suite

What are the uses of DataStage Adminsitrator?

The Administrator is used to add and delete projects, and to set project properties. The Adminsitrator also provides a command line interface to the DataStage repository.

Use Administrator Project Properties window to

1. Enable job administration in Director, enable run time column propogation, auto purging options, protect project and set environment vaiables on the General tab2. Set user and group priveleges on the Permissions tab3. Enable or disable server side tracing on the Tracing tab4. Specifying a username and password for scheduling jobs on the Schedule tab5. Specify parallel job defaults on the Parallel tab6. Specify job sequencer defaults on the Sequencer tab

Explain the DataStage Development workflow?

1. Define poject properties - Administrator2. Open (attach to) your project3. Import metadata that defines the format of data stores your jobs will read from or write to4. Design the job - Designer5. Compile and debug the job - Designer

Page 14: datastage

6. Run and monitor the job - Director

What is the DataStage project repository?

All your work is stored in a DataStage project. Projects are created during and after the installation process. You can add projects after installation on the Projects tab of Adminsitrator.The project directory is used by DataStage to store your jobs and other DataStage objects and metadata on your server.Although multiple projects can be open at the same time, they are seperate environments. You can however, import and export objects between them.Multiple users can be working in the same project at the same time. However, DataStage will prevent multiple users from editing the same DataStage object (job, table definition, etc) at the same time.

What are the different types of DataStage jobs?

Parallel Jobs-1. Executed by DataStage parallel engine2. Built-in functionality for pipeline and partition parallelism3. Compiled into OSH (Orchestrate Scripting Language)4. OSH executes operators (Execute C++ class instances)

Server Jobs-1. Executed by DataStage server engine2. Compiled into basic

Job Sequencers-1. Master Server jobs that kick-off jobs and other activities2. Can kick-off Server or Parallel jobs3. Executed by DataStage server engine

What are the design elements of parallel jobs?

Stages - Implemented as OSH operatorsPassive Stages (E and L of ETL) - Read/Write data Eg., Sequential file, DB2, Oracle, Peek stagesActive Stages (T of ETL) - Transform/Filter/Aggregate/Generate/Split/Merge data Eg., Transformer, Aggregator, Join, Sort stages

Links - Pipes though which the data moves from stage to stage

What are the different types of parallelism?

Pipeline Parallelism1. Transform, clean, load processes execute simultaneously2. Start downstream process while upstream process is running3. Reduces disk usage for staging areas4. Keeps processor busy5. Still has limits on scalability

Partition Parallelism1. Divide the incoming stream of data into subsets(partitions) to be processed by the same operator2. The operation is performed on each partition of data seperately and in parallel3. Facilitates near-linear scalability provided the data is evenly distributed4. If the data is evenly distributes, the data will be processed n times faster on n nodes.

Page 15: datastage

Installation and Deployment

What gets deployed as part of Information Server Domain?

1. Metadata Server, hosted by an IBM WebSphere Application Server instance2. One or more DataStage servers3. One DB2 UDB instance containing the repository database

Additional Server application

1. Business Glossary2. Federation Server3. Information Analyzer4. Information Services Director5. Rational Data Architect

What are the Information Server clients?

1. Administration Console2. Reporting Console3. DataStage Clients - Administrator, Designer, Director

What are the different types of Information Server deployment?

1. Everything on One machine - All the applicaions in the domain are deployed in one machine2. The domain is split between two machines - DataStage Server in one machine , Metadata Server and DB2 Repository in one machine3. The domain is split between three machines - DataStage Server, Metadata Server and DB2 Repository on 3 different machines

Additional DataStage Servers can be part of this domain, but they would have to be seperate from one another

There is a possibility of additional DataStage player-node machines connected to the DataStage server machine using a high speed network

What are the components that should be running if Application Server(hosting the metadata server) and DataStage server are running on different machines?

1. The Application Server2. The ASB agent

Administering DataStage

Explain the User and Group Management?

Suite Authorization can be provided to users or groups. Users that are members of a group acquire authorizations of the group.

Authorization are provided in the form of roles

1. Suite roles

a. Administrator - Performs user and group management tasks. Includes all the priveleges of the Suite User roleb. User - Create views of scheduled tasks and logged messages. Create and run reports

Page 16: datastage

2. Suite Component roles

a. DataStage Administrator - Full permission to work in DataStage Administrator, Designer and Directorb. DataStage user - Permissions are assigned within DataStage - Developer, Operator, Super Operator and Production Manager

A DataStage user cannot delete projects and cannot set permissions

A user ID that is assigned Suite roles can immediately log onto the Information Server Console.

What about a user ID that is assigned a DataStage Suite Component role? If the user ID is assigned the DataStage Administrator role, then the user will immediately acquire the DataStage Administrato permission for all projects.

If the the user ID is assigned the DataStage user role, one moe step is required. A DataStage administrator must assign a corresponding role to that user ID on the permissions tab. When Suite users or groups have been assigned DataStage Administrator role they automatically appear on the permissions. Suite users or groups that have a DataStage User role need to be manually added.

Explain The DataStage Credential Mapping?

All the Suite users without their own DataStage credentials will be mapped to this user ID and password. Here the username and password are demohawk/demohwak. demohawk is assumed to be a valid user on the DataStage Server machine and has file permissions on the DataStage engine and project directories

Suite users can also be mapped individually to specific users

Note that demohawk need not be a Suite administrator or user

What information are required to login into DataStage Administrator?

Domain - Host name , port number of the application server.

Recall that multiple DataStage servers can exist in a domain, although they must be on different machines.

DataStage server - The Server that has the DataStage projects you want to administer

Explain the DataStage roles?

1. DataStage Developer - full access to all areas of a DataStage project2. DataStage Operator - run and manage release DataStage jobs3. DataStage Super Operator - can open the Designer and view the repository in read-only mode4. DataStage Production Managet - create and manipulate protected projects

DataStage Designer

Explain Import and Export and their corresponding procedures?

1. Backing up jobs and projects2. Maintaining different versions of a job or project3. Moving DataStage objects from one project to another4. Sharing jobs and projects between developers

Export-->DataStage components

Page 17: datastage

By default, objects are exported to a text file in a specific format. By default, the extension is dsx. Alternatively, you can export the objects to a XML document.

The directory you export is on the DataStage client, not the server.

Objects can also be exported from the list of found objects using search functionality.

Import-->DataStage components

Import all to begin the import process. Use Import selected to import selected objects from the list.

Select Overwrite without query button to overwrite objects with the same name without warning.

For large imports you may want to disable "Perform impact analysis." This adds overhead to the import process

Import-->Table Definitions

Table definition describes the column and format of files and tables

The table definition for the following can be imported1. Sequential file2. Relational tables3. Cobol files4. XML5. ODBC data sourcesetc

Table definitions can be loaded into job stages that access data with the same format. In this sense the metadata is reusable.

Creating Parallel Jobs

What is a Parallel Job?

A parallel job is an executable DataStage program created in DataStage designer using components from repository. It compiles into Orchestrate script language(OSH) and object code(from generated C++)

DataStage jobs are1. Designed and built in Designer2. Scheduled, invoked and monitored in Director3. Executed under the control of DataStage

Use the import process in Designer to import metadata defining sources and targets

What are the benefits of renaming links and stages?1. Documentation2. Clarity3. Fewer development errors

Explain the Row Generator stage?1. Produces mock data2. No input link;single output link3. On Properties tab, specify number of rows4. On Columns tab, load or specify column definitions

Page 18: datastage

You have a cluster of nodes available to run DataStage jobs. The network configuration between the servers is a private network with a 1 GB connection between each node. The public name is on a 100 MB network, which is what each hostname is identified with. In order to use the private network for communications between each node you need to use an alias for each node in the cluster. The Information Server Engine node (conductor node) is where the DataStage job starts.

Which environment variable must be used to identify the hostname for the Engine node?A. $APT_SERVER_ENGINEB. $APT_ENGINE_NODEC. $APT_PM_CONDUCTOR_HOSTNAMED. $APT_PM_NETWORK_NAME

Answer: C

Which three privileges must the user possess when running a parallel job? (Choose three.)A. read access to APT_ORCHHOMEB. execute permissions on local copies of programs and scriptsC. read/write permissions to the UNIX/etc directoryD. read/write permissions to APT_ORCHHOMEE. read/write access to disk and scratch disk resources

Answer: A,B,E

Which two tasks will create DataStage projects? (Choose two.)A. Export and import a DataStage project from DataStage Manager.B. Add new projects from DataStage Administrator.C. Install the DataStage engine.D. Copy a project in DataStage Administrator.

Answer: B,C

Which three defaults are set in DataStage Administrator? (Choose three.)A. default prompting options, such as Autosave job before compileB. default SMTP mail server nameC. project level default for Runtime Column PropagationD. project level defaults for environment variablesE. project level default for Auto-purge of job log entries

Answer: C,D,E

Which two must be specified to manage Runtime Column Propagation? (Choose two.)A. enabled in DataStage AdministratorB. attached to a table definition in DataStage ManagerC. enabled at the stage levelD. enabled with environmental parameters set at runtime

Answer: A,C

You are reading customer data using a Sequential File stage and transforming it using the Transformer stage. The Transformer is used to cleanse the data by trimming spaces from character fields in the input. The cleansed data is to be written to a target DB2 table. Which partitioning method would yield optimal performance without violating the business requirements?A. Hash on the customer ID fieldB. Round RobinC. RandomD. Entire

Page 19: datastage

Answer: B

A job contains a Sort stage that sorts a large volume of data across a cluster of servers. The customer has requested that this sorting be done on a subset of servers identified in the configuration file to minimize impact on database nodes. Which two steps will accomplish this? (Choose two.)

A. Create a sort scratch disk pool with a subset of nodes in the parallel configuration file.B. Set the execution mode of the Sort stage to sequential.C. Specify the appropriate node constraint within the Sort stage.D. Define a non-default node pool with a subset of nodes in the parallel configuration file.

Answer: C,D

You have a compiled job and parallel configuration file. Which three methods can be used to determine the number of nodes actually used to run the job in parallel? (Choose three.)A. within DataStage Designer, generate report and retain intermediate XMLB. within DataStage Designer, show performance statisticsC. within DataStage Director, examine log entry for parallel configuration fileD. within DataStage Director, examine log entry for parallel job scoreE. within DataStage Director, open a new DataStage Job Monitor

Answer: C,D,E

Which environment variable, when set to true, causes a report to be produced which shows the operators, processes and data sets in the job?A. APT_DUMP_SCOREB. APT_JOB_REPORTC. APT_MONITOR_SIZED. APT_RECORD_COUNTS

Answer: A

A job reads from a dataset using a DataSet stage. This data goes to a Transformer stage and then is written to a sequential file using a Sequential File stage. The default configuration file has 3 nodes. The job creating the dataset and the current job both use the default configuration file. How many instances of the Transformer run in parallel?A. 3B. 1C. 7D. 9

Answer: A

Your job reads from a file using a Sequential File stage running sequentially. TheDataStage server is running on a single SMP system. One of the columns contains a product ID. In a Lookup stage following the Sequential File stage, you decide to look up the product description from a reference table. Which two partition settings would correctly find matching product descriptions? (Choose two.)A. Hash algorithm, specifying the product ID field as the key, on both the link coming from the Sequential File stage and the link coming from the reference table.B. Round Robin on both the link coming from the Sequential File stage and the link coming from the reference table.C. Round Robin on the link coming from the Sequential File stage and Entire on the link coming from the reference table.D. Entire on the link coming from the Sequential File stage and Hash, specifying the product ID field as the key, on the link coming from the reference table.

Page 20: datastage

Answer: A,C

Page 21: datastage

A job design consists of an input fileset followed by a Peek stage, followed by a Filter stage, followed by an output fileset. The environment variable APT_DISABLE_COMBINATION is set to true, and the job executes on an SMP using a configuration file with 8 nodes defined. Assume also that the input dataset was created with the same 8 node configuration file. Approximately how many data processing processes will this job create?A. 32B. 8C. 16D. 1

Answer: A

Which two statements are true of the column data types used in Orchestrate schemas? (Choose two.)A. Orchestrate schema column data types are the same as those used in DataStage stages.B. Examples of Orchestrate schema column data types are varchar and integer.C. Examples of Orchestrate schema column data types are int32 and string[max=30].D. OSH import operators are needed to convert data read from sequential files into schema types.

Answer: C,D

You have set the "Preserve Partitioning" flag for a Sort stage to request that the next stage preserves whatever partitioning it has implemented. Which statement describes what will happen next?A. The job will compile but will abort when run.B. The job will not compile.C. The next stage can ignore this request but a warning is logged when the job is run depending on the stage type that ignores the flag.D. The next stage disables the partition options that are normally available in the Partitioning tab.

Answer: C

What is the purpose of the uv command in a UNIX DataStage server?A. Cleanup resources from a failed DataStage job.B. Start and stop the DataStage engine.C. Provide read access to a DataStage EE configuration file.D. Report DataStage client connections.

Answer: B

Which two statements regarding the usage of data types in the parallel engine are correct? (Choose two.)A. The best way to import RDBMS data types is using the ODBC importer.B. The parallel engine will use its interpretation of the Oracle meta data (e.g, exact data types) based on interrogation of Oracle, overriding what you may have specified in the Columns tabs.C. The best way to import RDBMS data types is using the Import Orchestrate Schema Definitions using orchdbutil.D. The parallel engine and server engine have exactly the same data types so there is no conversion cost overhead from moving data between the engines.

Answer: B,C

Which two describe a DataStage EE installation in a clustered environment? (Choose two.)A. The C++ compiler must be installed on all cluster nodes.B. Transform operators must be copied to all nodes of the cluster.C. The DataStage parallel engine must be installed or accessible in the same directory on all machines in the cluster.D. A remote shell must be configured to support communication between the conductor and section leader nodes.

Answer: C,D

Page 22: datastage

Which partitioning method would yield the most even distribution of data without duplication?A. EntireB. Round RobinC. HashD. Random

Answer: B

Which three accurately describe the differences between a DataStage server root installation and a non-root installation? (Choose three.)A. A non-root installation enables auto-start on reboot.B. A root installation must specify the user "dsadm" as the DataStage administrative user.C. A non-root installation inherits the permissions of the user who starts the DataStage services.D. A root installation will start DataStage services in impersonation mode.E. A root installation enables auto-start on reboot.

Answer: C,D,E

Your job reads from a file using a Sequential File stage running sequentially. You are using a Transformer following the Sequential File stage to format the data in some of the columns. Which partitioning algorithm would yield optimized performance?A. HashB. RandomC. Round RobinD. Entire

Answer: C

Which three UNIX kernel parameters have minimum requirements for DataStage installations? (Choose three.)A. MAXUPROC - maximum number of processes per userB. NOFILES - number of open filesC. MAXPERM - disk cache thresholdD. NOPROC - no process limitE. SHMMAX - maximum shared memory segment size

Answer: A,B,E

Which partitioning method requires specifying a key?A. RandomB. DB2C. EntireD. Modulus

Answer: D

When a sequential file is written using a Sequential File stage, the parallel engine inserts an operator to convert the data from the internal format to the external format. Which operator is inserted?A. export operatorB. copy operatorC. import operatorD. tsort operator

Answer: A

Page 23: datastage

Which statement is true when Runtime Column Propagation (RCP) is enabled?A. DataStage Manager does not import meta data.B. DataStage Director does not supply row counts in the job log.C. DataStage Designer does not enforce mapping rules.D. DataStage Administrator does not allow default settings for environment variables.

Answer: C

Page 24: datastage

Persistent StorageSequential file stage

The Sequential File stage is a file stage. It allows you to read data from or write data to one or more flat files. The stage can have a single input link or a single output link, and a single rejects link.

The stage executes in parallel mode if reading multiple files but executes sequentially if it is only reading one file. By default a complete file will be read by a single node (although each node might read more than one file). For fixed-width files, however, you can configure the stage to behave differently:

1. You can specify that single file can be read by multiple nodes. This can improve performance on cluster systems.

2. You can specify that a number of readers run on a single node. This means, for example, that a single file can be partitioned as it is read (even though the stage is constrained to running sequentially on the conductor node).

(These two options are mutually exclusive.)

File This property defines the flat file that data will be read from. You can type in a pathname, or browse for a file. You can specify multiple files by repeating the File property

File pattern Specifies a group of files to import. Specify file containing a list of files or a job parameter representing the file. The file could also contain be any valid shell expression, in Bourne shell syntax, that generates a list of file names.

Read method This property specifies whether you are reading from a specific file or files or using a file pattern to select files (e.g., *.txt).

Missing file mode Specifies the action to take if one of your File properties has specified a file that does not exist. Choose from Error to stop the job, OK to skip the file, or Depends, which means the default is Error, unless the file has a node name prefix of *: in which case it is OK. The default is Depends.

Keep file partitions Set this to True to partition the imported data set according to the organization of the input file(s). So, for example, if you are reading three files you will have three partitions. Defaults to False.

Reject mode Allows you to specify behavior if a read record does not match the expected schema (record does not match the metadata defined in column definition). Choose from Continue to continue operation and discard any rejected rows, Fail to cease reading if any rows are rejected, or Save to send rejected rows down a reject link. Defaults to Continue.

Report progress Choose Yes or No to enable or disable reporting. By default the stage displays a progress report at each 10% interval when it can ascertain file size. Reporting occurs only if the file is greater than 100 KB, records are fixed length, and there is no filter on the file.

Number Of readers per node This is an optional property and only applies to files containing fixed-length records, it is mutually exclusive with the Read from multiple nodes property. Specifies the number of instances of the file read operator on a processing node. The default is one operator per node per input data file. If numReaders is greater than one, each instance of the file read operator reads a contiguous range of records from the input file.

Page 25: datastage

This provides a way of partitioning the data contained in a single file. Each node reads a single file, but the file can be divided according to the number of readers per node, and written to separate partitions. This method can result in better I/O performance on an SMP system.

Read from multiple nodes This is an optional property and only applies to files containing fixed-length records, it is mutually exclusive with the Number of Readers Per Node property. Set this to Yes to allow individual files to be read by several nodes. This can improve performance on a cluster system. WebSphere DataStage knows the number of nodes available, and using the fixed length record size, and the actual size of the file to be read, allocates the reader on each node a separate region within the file to process. The regions will be of roughly equal size.

Note that sequential row order cannot be maintained when reading a file in parallel

File update mode This property defines how the specified file or files are updated. The same method applies to all files being written to. Choose from Append to append to existing files, Overwrite to overwrite existing files, or Create to create a new file. If you specify the Create property for a file that already exists you will get an error at runtime. By default this property is set to Overwrite.

Using RCP With Sequential Stages Runtime column propagation (RCP) allows WebSphere DataStage to be flexible about the columns you define in a job. If RCP is enabled for a project, you can just define the columns you are interested in using in a job, but ask WebSphere DataStage to propagate the other columns through the various stages. So such columns can be extracted from the data source and end up on your data target without explicitly being operated on in between.

Sequential files, unlike most other data sources, do not have inherent column definitions, and so WebSphere DataStage cannot always tell where there are extra columns that need propagating. You can only use RCP on sequential files if you have used the Schema File property (see ″Schema File″ on page Schema File and on page Schema File) to specify a schema which describes all the columns in the sequential file. You need to specify the same schema file for any similar stages in the job where you want to propagate columns. Stages that will require a schema file are:

1. Sequential File

2. File Set

3. External Source

4. External Target

5. Column Import

6. Column Export

Improving Sequential File PerformanceIf the source file is fixed width, the Readers Per Node option can be used to read a single input file in parallel at evenly-spaced offsets. Note that in this manner, input row order is not maintained.

If the input sequential file cannot be read in parallel, performance can still be improved by separating the file I/O from the column parsing operation. To accomplish this, define a single large string column for the non-parallel Sequential File read, and then pass this to a Column Import stage to parse the file in parallel. The formatting and column properties of the Column Import stage match those of the Sequential File stage.

On heavily-loaded file servers or some RAID/SAN array configurations, the environment variables

Page 26: datastage

$APT_IMPORT_BUFFER_SIZE and $APT_EXPORT_BUFFER_SIZE can be used to improve I/O performance.These settings specify the size of the read (import) and write (export) buffer size in Kbytes, with a default of 128 (128K). Increasing this may improve performance.

Finally, in some disk array configurations, setting the environment variable$APT_CONSISTENT_BUFFERIO_SIZE to a value equal to the read/write size in bytes can significantly improve performance of Sequential File operations.

$APT_CONSISTENT_BUFFERIO_SIZE - Some disk arrays have read ahead caches that are only effective when data is read repeatedly in like-sized chunks. Setting APT_CONSISTENT_BUFFERIO_SIZE=N will force stages to read data in chunks which are size N or a multiple of N.

Partitioning Sequential File ReadsCare must be taken to choose the appropriate partitioning method from a Sequential File read:· Don’t read from Sequential File using SAME partitioning! Unless more than one source file is specified, SAME will read the entire file into a single partition, making the entire downstream flow run sequentially (unless it is later repartitioned).· When multiple files are read by a single Sequential File stage (using multiple files, or by using a File Pattern), each file’s data is read into a separate partition. It is important to use ROUND-ROBIN partitioning (or other partitioning appropriate to downstream components) to evenly distribute the data in the flow.

Sequential File (Export) BufferingBy default, the Sequential File (export operator) stage buffers its writes to optimize performance. When a job completes successfully, the buffers are always flushed to disk. The environment variable $APT_EXPORT_FLUSH_COUNT allows the job developer to specify how frequently (in number of rows) that the Sequential File stage flushes its internal buffer on writes. Setting this value to a low number (such as 1) is useful for realtime applications, but there is a small performance penalty associated with increased I/O.

Reading from and Writing to Fixed-Length FilesParticular attention must be taken when processing fixed-length fields using the Sequential File stage:· If the incoming columns are variable-length data types (eg. Integer, Decimal, Varchar), the field width column property must be set to match the fixed-width of the input column.Double-click on the column number in the grid dialog to set this column property.· If a field is nullable, you must define the null field value and length in the Nullable section of the column property. Double-click on the column number in the grid dialog to set these

Data set stage

The Data Set stage is a file stage. It allows you to read data from or write data to a data set. The stage can have a single input link or a single output link. It can be configured to execute in parallel or sequential mode.

What is a data set? Parallel jobs use data sets to manage data within a job. You can think of each link in a job as carrying a data set. The Data Set stage allows you to store data being operated on in a persistent form, which can then be used by other WebSphere DataStage jobs. Data sets are operating system files, each referred to by a control file, which by convention has the suffix .ds. Using data sets wisely can be key to good performance in a set of linked jobs. You can also manage data sets independently of a job using the Data Set Management utility, available from the WebSphere DataStage Designer or Director

A data set comprises a descriptor file and a number of other files that are added as the data set grows. These files are stored on multiple disks in your system.

The descriptor file for a data set contains the following information:

Page 27: datastage

1. Data set header information.

2. Creation time and date of the data set.

3. The schema (metadata) of the data set.

4. A copy of the configuration file used when the data set was created.

Data Sets are the structured internal representation of data within the Parallel Framework

Consist of:

Framework Schema (format=name, type, nullability)Data Records (data)Partition (subset of rows for each node)

Virtual Data Sets exist in-memory correspond to DataStage Designer links

Persistent Data Sets are stored on-disk

Descriptor file(metadata, configuration file, data file locations, flags)Multiple Data Files(one per node, stored in disk resource file systems)node1:/local/disk1/…node2:/local/disk2/…

There is no “DataSet” operator – the Designer GUI inserts a copy operator

When to Use Persistent Data Sets

When writing intermediate results between DataStage EE jobs, always write to persistent Data Sets (checkpoints)Stored in native internal format (no conversion overhead)Retain data partitioning and sort order (end-to-end parallelism across jobs)Maximum performance through parallel I/O

Why Data Sets are not intended for long-term or archive storage

Internal format is subject to change with new DataStage releasesRequires access to named resources (node names, file system paths, etc)Binary format is platform-specific

For fail-over scenarios, servers should be able to cross-mount filesystemsCan read a dataset as long as your current $APT_CONFIG_FILE defines the same NODE names (fastnames may differ)orchadmin –x lets you recover data from a dataset if the node names are no longer available

Data Set Management

1. Viewing the schema

Click the Schema icon from the tool bar to view the record schema of the current data set. This is presented in text form in the Record Schema window.

Page 28: datastage

2. Viewing the data

Click the Data icon from the tool bar to view the data held by the current data set. This options the Data Viewer Options dialog box, which allows you to select a subset of the data to view.

Rows to display. Specify the number of rows of data you want the data browser to display.

Skip count. Skip the specified number of rows before viewing data.

Period. Display every Pth record where P is the period. You can start after records have been skipped by using the Skip property. P must equal or be greater than 1.

Partitions. Choose between viewing the data in All partitions or the data in the partition selected from the drop-down list.Click OK to view the selected data, the Data Viewer window appears.

3. Copying data sets

Click the Copy icon on the tool bar to copy the selected data set. The Copy data set dialog box appears, allowing you to specify a path where the new data set will be stored. The new data set will have the same record schema, number of partitions and contents as the original data set.

Note: You cannot use the UNIX cp command to copy a data set because WebSphere DataStage represents a single data set with multiple files.

4. Deleting data sets

Click the Delete icon on the tool bar to delete the current data set data set. You will be asked to confirm the deletion.

Note: You cannot use the UNIX rm command to copy a data set because WebSphere DataStage represents a single data set with multiple files. Using rm simply removes the descriptor file, leaving the much larger data files behind.

Orchadmin Commands

Orchadmin is a command line utility provided by datastage to research on data sets.

The general callable format is : $orchadmin <command> [options] [descriptor file]

Before using orchadmin, you should make sure that either the working directory or the $APT_ORCHHOME/etc contains the file ”config.apt” OR The environment variable $APT_CONFIG_FILE should be defined for your session.

The various commands available with orchadmin are

1. CHECK: $orchadmin check

Validates the configuration file contents like , accesibility of all nodes defined in the configuration file, scratch disk definitions and accesibility of all the nodes etc. Throws an error when config file is not found or not defined properly

2. COPY : $orchadmin copy <source.ds> <destination.ds>

Makes a complete copy of the datasets of source with new destination descriptor file name. Please not that

Page 29: datastage

a. You cannot use UNIX cp command as it justs copies the config file to a new name. The data is not copied.

b. The new datasets will be arranged in the form of the config file that is in use but not according to the old confing file that was in use with the source.

3. DELETE : $orchadmin < delete | del | rm > [-f | -x] descriptorfiles….

The unix rm utility cannot be used to delete the datasets. The orchadmin delete or rm command should be used to delete one or more persistent data sets.

-f options makes a force delete. If some nodes are not accesible then -f forces to delete the dataset partitions from accessible nodes and leave the other partitions in inaccesible nodes as orphans.

-x forces to use the current config file to be used while deleting than the one stored in data set.

4. DESCRIBE: $orchadmin describe [options] descriptorfile.ds

This is the single most important command.

1. Without any option lists the no.of.partitions, no.of.segments, valid segments, and preserve partitioning flag details of the persistent dataset.

-c : Print the configuration file that is written in the dataset if any

-p: Lists down the partition level information.

-f: Lists down the file level information in each partition

-e: List down the segment level information .

-s: List down the meta-data schema of the information.

-v: Lists all segemnts , valid or otherwise

-l : Long listing. Equivalent to -f -p -s -v -e

5. DUMP: $orchadmin dump [options] descriptorfile.ds

The dump command is used to dump (extract) the records from the dataset. Without any options the dump command lists down all the records starting from first record from first partition till last record in last partition.

-delim ‘<string>’ : Uses the given string as delimtor for fields instead of space.

-field <name> : Lists only the given field instead of all fields.

-name : List all the values preceded by field name and a colon

-n numrecs : List only the given number of records per partition.

-p period(N) : Lists every Nth record from each partition starting from first record.

-skip N: Skip the first N records from each partition.

-x : Use the current system configuration file rather than the one stored in dataset.

Page 30: datastage

6. TRUNCATE: $orchadmin truncate [options] descriptorfile.ds

Without options deletes all the data(ie Segments) from the dataset.

-f: Uses force truncate. Truncate accessible segments and leave the inaccesible ones.

-x: Uses current system config file rather than the default one stored in the dataset.

-n N: Leaves the first N segments in each partition and truncates the remaining.

7. HELP: $orchadmin -help OR $orchadmin <command> -help

Help manual about the usage of orchadmin or orchadmin commands.

File set stage

The File Set stage is a file stage. It allows you to read data from or write data to a file set. The stage can have a single input link, a single output link, and a single rejects link. It only executes in parallel mode.

What is a file set? WebSphere DataStage can generate and name exported files, write them to their destination, and list the files it has generated in a file whose extension is, by convention, .fs. The data files and the file that lists them are called a file set. This capability is useful because some operating systems impose a 2 GB limit on the size of a file and you need to distribute files among nodes to prevent overruns.

The amount of data that can be stored in each destination data file is limited by the characteristics of the file system and the amount of free disk space available. The number of files created by a file set depends on: 1. The number of processing nodes in the default node pool 2. The number of disks in the export or default disk pool connected to each processing node in the default node

pool 3. The size of the partitions of the data set

The File Set stage enables you to create and write to file sets, and to read data back from file sets. Unlike data sets, file sets carry formatting information that describes the format of the files to be read or written.

Filesets are similar to datasets1. Partitioned2. Implemented with header file and data files

Filesets are different from datasets1. The data files of filesets are text files and hence are readable by other applications whereas the data files of

datasets are stored in native internal format and are readable only DataStage

Lookup file set stage

The Lookup File Set stage is a file stage. It allows you to create a lookup file set or reference one for a lookup. The stage can have a single input link or a single output link. The output link must be a reference link. The stage can be configured to execute in parallel or sequential mode when used with an input link.

When creating Lookup file sets, one file will be created for each partition. The individual files are referenced by a single descriptor file, which by convention has the suffix .fs.

Page 31: datastage

When performing lookups, Lookup File Set stages are used with Lookup stages.

When you use a Lookup File Set stage as a source for lookup data, there are special considerations about column naming. If you have columns of the same name in both the source and lookup data sets, the source data set column will go to the output data. If you want this column to be replaced by the column from the lookup data source, you need to drop the source data column before you perform the lookup

http://www.dsxchange.com/viewtopic.php?t=113394

A Hashed File is only available in server jobs. It uses a hashing algorithm (without building an index) to determine the location of keys within its structure. It is not amenable to parallelism. The contents of a hashed file may be cached in memory when using the Hashed File stage to service a reference input link. New rows to be written to a hashed file may first be written to a memory cache, then flushed to disk. All writes to a hashed file using an existing key overwrite the previous row. Duplicate key values are not permitted.

A Lookup File Set is only available in parallel jobs. It uses an index (based on a hash table) to determine the location of keys within its structure. It is a parallel structure; it has its records spread over the processing nodes specified when it was created. The records in the Lookup File Set are loaded into a virtual Data Set before use, and the index is also loaded into memory. Duplicate key values are (optionally) permitted. If the option is not selected, duplicates are rejected when writing to the Lookup File Set.

http://www.dsxchange.com/viewtopic.php?t=93287

I did testing on a Windows machine processing 100,000 primary rows against 100,000 lookup rows with a 1 to 1 match. Two key fields of char 255 and two non key fields also of char 255. I deliberately chose fat key fields. The dataset as a lookup took 2-3 minutes. The fileset as a lookup took about 40 seconds. Ran it a few times with the same results.

One interesting result was memory utilisation, the fileset was consistently lighter then the dataset, by as much as 30% on RAM memory. This may be due to the keep/drop key field option of the fileset stage. If you set keep to false the key fields in the fileset are not loaded into memory as they are not required on the output side of the lookup. I am guessing that the fileset version was moving and storing 510 char less for each lookup then the dataset version. In a normal lookup these key fields travel up the reference link and back down it again, in a lookup fileset they only travel up.

When I switch the same job onto an AIX box with several gig of RAM I get 7 seconds for the dataset and 4 for the fileset. With an increase to 500,000 rows I get 23 seconds for the dataset and 7 seconds for the fileset. This difference may not be so apparent if your key fields are shorter. The major drawback of a lookup fileset is that it doesn't have the Append option of a dataset, you can only overwrite it.

Creating a lookup file set1. In the Input Link Properties Tab:– Specify the key that the lookup on this file set will ultimately be performed on. You can repeat this property to specify multiple key columns. You must specify the key when you create the file set, you cannot specify it when performing the lookup– Specify the name of the Lookup File Set.– Specify a lookup range, or accept the default setting of No.– Set Allow Duplicates, or accept the default setting of False.2. Ensure column meta data has been specified for the lookup file set.

Looking up a lookup file set

Page 32: datastage

1. In the Output Link Properties Tab specify the name of the lookup file set being used in the lookup.2. Ensure column meta data has been specified for the lookup file set.

By default the stage will write to the file set in entire mode. The complete data set is written to each partition. If the Lookup File Set stage is operating in sequential mode, it will first collect the data before writing it to the file using the default (auto) collection method.

Page 33: datastage

Complex Flat File stage

The Complex Flat File (CFF) stage is a file stage. You can use the stage to read a file or write to a file, but you cannot use the same stage to do both.

As a source, the CFF stage can have multiple output links and a single reject link. You can read data from one or more complex flat files, including MVS data sets with QSAM and VSAM files. You can also read data from files that contain multiple record types. The source data can contain one or more of the following clauses:

1. GROUP 2. REDEFINES 3. OCCURS 4. OCCURS DEPENDING ON

CFF source stages run in parallel mode when they are used to read multiple files, but you can configure the stage to run sequentially if it is reading only one file with a single reader.

As a target, the CFF stage can have a single input link and a single reject link. You can write data to one or more complex flat files. You cannot write to MVS data sets or to files that contain multiple record types.

Editing a Complex Flat File stage as a source

To edit a CFF stage as a source, you must provide details about the file that the stage will read, create record definitions for the data, define the column metadata, specify record ID constraints, and select output columns.

To edit a CFF stage as a source: 1. Open the CFF stage editor2. On the Stage page, specify information about the stage data:

a. On the File Options tab, provide details about the file that the stage will read. b. On the Record Options tab, describe the format of the data in the file.c. If the stage is reading a file that contains multiple record types, on the Records tab, create record definitions

for the data.d. On the Records tab, create or load column definitions for the data.e. If the stage is reading a file that contains multiple record types, on the Records ID tab, define the record ID

constraint for each record.f. Optional: On the Advanced tab, change the processing settings.

3. On the Output page, specify how to read data from the source file:a. On the Selection tab, select one or more columns for each output link. b. Optional: On the Constraint tab, define a constraint to filter the rows on each output link.c. Optional: On the Advanced tab, change the buffering settings.

4. Click OK to save your changes and to close the CFF stage editor.

Creating record definitions

If you are reading data from a file that contains multiple record types, you must create a separate record definition for each type. COBOL copybooks with multiple record types can be imported as COBOL file definition (Eg. Insurance.cfd). Each record type is stores as a separate DataStage table definition (Eg. If the Insurance.cfd has 3 record types for Client, Policy and Coverage then there will be 3 table definitions one for each record type)

To create record definitions:1. Click the Records tab on the Stage page.2. Clear the Single record check box. 3. Right-click the default record definition RECORD_1 and select Rename Current Record.4. Type a new name for the default record definition.

Page 34: datastage

5. Add another record by clicking one of the buttons at the bottom of the records list. Each button offers a different insertion point. A new record is created with the default name of NEWRECORD.

6. Double-click NEWRECORD to rename it.7. Repeat steps 3 and 4 for each new record that you need to create.8. Right-click the master record in the list and select Toggle Master Record. Only one master record is permitted.

Column definitionsYou must define columns to specify what data the CFF stage will read or write.

If the stage will read data from a file that contains multiple record types, you must first create record definitions on the Records tab. If the source file contains only one record type, or if the stage will write data to a target file, then the columns belong to the default record called RECORD_1.

You can load column definitions from a table in the repository, or you can type column definitions into the columns grid. You can also define columns by dragging a table definition from the Repository window to the CFF stage icon on the Designer canvas.

Loading columns The fastest way to define column metadata is to load columns from a table definition in the repository.

To load columns: 1. Click the Records tab on the Stage page. 2. Click Load to open the Table Definitions window. This window displays all of the repository objects that are in

the current project. 3. Select a table definition in the repository tree and click OK. 4. Select the columns to load in the Select Columns From Table window and click OK. 5. If flattening is an option for any arrays in the column structure, specify how to handle array data in the

Complex File Load Option window.

Typing columns You can also define column metadata by typing column definitions in the columns grid.

To type columns: 1. Click the Records tab on the Stage page. 2. In the Level number field of the grid, specify the COBOL level number where the data is defined. If you do

not specify a level number, a default value of 05 is used. 3. In the Column name field, type the name of the column. 4. In the Native type field, select the native data type. 5. In the Length field, specify the data precision. 6. In the Scale field, specify the data scale factor. 7. Optional: In the Description field, type a description of the column.

Defining record ID constraints If you are using the CFF stage to read data from a file that contains multiple record types, you must specify a record ID constraint to identify the format of each record.

Columns that are identified in the record ID clause must be in the same physical storage location across records. The constraint must be a simple equality expression, where a column equals a value.

To define a record ID constraint: 1. Click the Records ID tab on the Stage page. 2. Select a record from the Records list.

Page 35: datastage

3. Select the record ID column from the Column list. This list displays all columns from the selected record, except the first OCCURS DEPENDING ON (ODO) column and any columns that follow it.

4. Select the = operator from the Op list. 5. Type the identifying value for the record ID column in the Value field. Character values must be enclosed in

single quotation marks.

Selecting output columns By selecting output columns, you specify which columns from the source file the CFF stage should pass to the output links.

You can select columns from multiple record types to output from the stage. If you do not select columns to output on each link, the CFF stage automatically propagates all of the stage columns except group columns to each empty output link when you click OK to exit the stage.

To select output columns: 1. Click the Selection tab on the Output page. 2. If you have multiple output links, select the link that you want from the Output name list.

Defining output link constraints By defining a constraint, you can filter the data on each output link from the CFF stage.

You can set the output link constraint to match the record ID constraint for each selected output record by clicking Default on the Constraint tab on the Output page. The Default button is available only when the constraint grid is empty.

To define an output link constraint: 1. Click the Constraint tab on the Output page. 2. In the ( field of the grid, select an opening parenthesis if needed. You can use parentheses to specify the order in

evaluating a complex constraint expression.3. In the Column field, select a column or job parameter. (Group columns cannot be used in constraint expressions

and are not displayed.)4. In the Op field, select an operator or a logical function.5. In the Column/Value field, select a column or job parameter, or double-click in the cell to type a value. Enclose

character values in single quotation marks.6. In the ) field, select a closing parenthesis if needed.7. If you are building a complex expression, in the Logical field, select AND or OR to continue the expression in

the next row.8. Click Verify. If errors are found, you must either correct the expression, click Clear All to start over, or cancel.

You cannot save an incorrect constraint.

Editing a Complex Flat File stage as a target To edit a CFF stage as a target, you must provide details about the file that the stage will write, define the record format of the data, and define the column metadata.

To edit a CFF stage as a target: 1. Open the CFF stage editor. 2. On the Stage page, specify information about the stage data:

a. On the File Options tab, provide details about the file that the stage will write. b. On the Record Options tab, describe the format of the data in the file. c. On the Records tab, create or load column definitions for the datad. Optional: On the Advanced tab, change the processing settings.

3. Optional: On the Input page, specify how to write data to the target file: a. On the Advanced tab, change the buffering settings.

Page 36: datastage

b. On the Partitioning tab, change the partitioning settings.4. Click OK to save your changes and to close the CFF stage editor.

Reject linksThe CFF stage can have a single reject link, whether you use the stage as a source or a target.

For CFF source stages, reject links are supported only if the source file contains a single record type without any OCCURS DEPENDING ON (ODO) columns. For CFF target stages, reject links are supported only if the target file does not contain ODO columns.

You cannot change the selection properties of a reject link. The Selection tab for a reject link is blank.

You cannot edit the column definitions for a reject link. For writing files, the reject link uses the input link column definitions. For reading files, the reject link uses a single column named ″rejected″ that contains raw data for the columns that were rejected after reading because they did not match the schema.

FTP Enterprise Stage

The FTP Enterprise stage transfers multiple files in parallel. These are sets of files that are transferred from one or more FTP servers into WebSphere DataStage or from WebSphere DataStage to one or more FTP servers. The source or target for the file is identified by a URI (Universal Resource Identifier). The FTP Enterprise stage invokes an FTP client program and transfers files to or from a remote host using the FTP Protocol.

URI Is a pathname connecting the Stage to a target file on a remote host. It has the Open dependent property. You can repeat this property to specify multiple URIs. You can specify an absolute or a relative pathname.

Open command Is required if you perform any operation besides navigating to the directory where the file exists. There can be multiple Open commands. This is a dependent property of URI.

ftp command Is an optional command that you can specify if you do not want to use the default ftp command. For example, you could specify /opt/gnu/bin/wuftp. You can enter the path of the command (on the server) directly in this field. You can also specify a job parameter if you want to be able to specify the ftp command at run time.

User Name Specify the user name for the transfer. You can enter it directly in this field, or you can specify a job parameter if you want to be able to specify the user name at run time. You can specify multiple user names. User1 corresponds to URI1 and so on. When the number of users is less than the number of URIs, the last user name is set for remaining URIs. If no User Name is specified, the FTP Enterprise Stage tries to use .netrc file in the home directory.

PasswordEnter the password in this field. You can also specify a job parameter if you want to be able to specify the password at run time. Specify a password for each user name. Password1 corresponds to URI1. When the number of passwords is less than the numbers of URIs, the last password is set for the remaining URIs.

Transfer Protocol Select the type of FTP service to transfer files between computers. You can choose either FTP or Secure FTP (SFTP).1. FTP Select this option if you want to transfer files using the standard FTP protocol. This is a nonsecure

protocol. By default FTP enterprise stage uses this protocol to transfer files.

Page 37: datastage

2. Secure FTP (SFTP) Select this option if you want to transfer files between computers in a secured channel. Secure FTP (SFTP) uses the SSH (Secured Shell) protected channel for data transfer between computers over a nonsecure network such as a TCP/IP network. Before you can use SFTP to transfer files, you should configure the SSH connection without any pass phrase for RSA authentication.

Force Parallelism You can set either Yes or No. In general, the FTP Enterprise stage tries to start as many processes as needed to transfer the n files in parallel. However, you can force the parallel transfer of data by specifying this property to yes. This allows m number of processes at a time where m is the number specified in WebSphere DataStage configuration file. If m is less than n, then the stage waits to transfer the first m files and then start the next m until n files are transferred.

When you set Force Parallelism to Yes , you should only give one URI.

Overwrite Set this option to have any existing files overwritten by this transfer.

Restartable Mode When you specify a restartable mode of Restartable transfer, WebSphere DataStage creates a directory for recording information about the transfer in a restart directory. If the transfer fails, you can run an identical job with the restartable mode property set to Restart transfer, which will reattempt the transfer. If the transfer repeatedly fails, you can run an identical job with the restartable mode option set to Abandon transfer, which will delete the restart directory

Restartable mode has the following dependent properties: 1. Job Id Identifies a restartable transfer job. This is used to name the restart directory. 2. Checkpoint directory Optionally specifies a checkpoint directory to contain restart directories. If you do not

specify this, the current working directory is used.For example, if you specify a job_id of 100 and a checkpoint directory of /home/bgamsworth/checkpoint the files would be written to /home/bgamsworth/checkpoint/pftp_jobid_100.

Schema file Contains a schema for storing data. Setting this option overrides any settings on the Columns tab. You can enter the path name of a schema file, or specify a job parameter, so the schema file name can be specified at run time.

Transfer Type Select a data transfer type to transfer files between computers. You can select either the Binary or ASCII mode of data transfer. The default data transfer mode is binary.

When reading a delimited Sequential File, you are instructed to interpret two contiguous field delimiters as NULL for the corresponding field regardless of data type. Which three actions must you take? (Choose three.)A. Set the data type to Varchar.B. Set the field to nullable.C. Set the "NULL Field Value" to two field delimiters (e.g., "||" for pipes).D. Set the "NULL Field Value" to ''.E. Set the environment variable $APT_IMPEXP_ALLOW_ZERO_LENGTH_FIXED_NULL.

Answer: B,D,E

$APT_IMPEXP_ALLOW_ZERO_LENGTH_FIXED_NULL - When set, allows zero length null_field value with fixed length fields. This should be used with care as poorly formatted data will cause incorrect results. By default a zero length null_field value will cause an error.

Page 38: datastage

Which two attributes are found in a Data Set descriptor file? (Choose two.)A. A copy of the job score.B. The schema of the Data Set.C. A copy of the partitioned data.D. A copy of the configuration file used when Data Set was created.

Answer: B,D

When importing a COBOL file definition, which two are required? (Choose two.)A. The file you are importing is accessible from your client workstation.B. The file you are importing contains level 01 items.C. The column definitions are in a COBOL copybook file and not, for example, in a COBOL source file.D. The file does not contain any OCCURS DEPENDING ON clauses.

Answer: A,B

Which three features of datasets make them suitable for job restart points?(Choose three.)A. They are indexed for fast data access.B. They are partitioned.C. They use datatypes that are in the parallel engine internal format.D. They are persistent.E. They are compressed to minimize storage space.

Answer: B,C,D

Which statement describes a process for capturing a COBOL copybook from a z/OS system?A. FTP the COBOL copybook to the server platform in text mode and capture the metadata through Manager.B. Select the COBOL copybook using the Browse button and capture the COBOL copybook with Manager.C. FTP the COBOL copybook to the client workstation in text mode and capture the copybook with Manager.D. FTP the COBOL copybook to the client workstation in binary and capture the metadata through Manager.

Answer: C

The high performance ETL server on which DataStage EE is installed is networked with several other servers in the IT department with a very high bandwidth switch. A list of seven files (all of which contain records with the same record layout) must be retrieved from three of the other servers using FTP. Given the high bandwidth network and high performance ETL server, which approach will retrieve and process all seven files in the minimal amount of time?A. In a single job, use seven separate FTP Enterprise stages the output links of which lead to a single Sort Funnel stage, then process the records without landing to disk.B. Setup a sequence of seven separate DataStage EE jobs, each of which retrieves a single file and appends to a common dataset, then process the resulting dataset in an eighth DataStage EE job.C. Use three FTP Plug-in stages (one for each machine) to retrieve the seven files and store them to a single file on the fourth server, then use the FTP Enterprise stage to retrieve the single file and process the records without landing to disk.D. Use a single FTP Enterprise stage and specify seven URI properties, one for each file, then process the records without landing to disk.

Answer: D

An XML file is being processed by the XML Input stage. How can repetition elements be identified on the stage?A. No special settings are required. XML Input stage automatically detects the repetition element from the XPath expression.B. Set the "Key" property for the column on the output link to "Yes".

Page 39: datastage

C. Check the "Repetition Element Required" box on the output link tab.D. Set the "Nullable" property for the column on the output link to "Yes".

Answer: B

Using FTP, a file is transferred from an MVS system to a LINUX system in binary transfer mode. Which data conversion must be used to read a packed decimal field in the file?A. treat the field as a packed decimalB. packed decimal fields are not supportedC. treat the field as ASCIID. treat the field as EBCDIC

Answer: A

When a sequential file is read using a Sequential File stage, the parallel engine inserts an operator to convert the data to the internal format. Which operator is inserted?A. import operatorB. copy operatorC. tsort operatorD. export operator

Answer: A

Which type of file is both partitioned and readable by external applications?A. filesetB. Lookup filesetC. datasetD. sequential file

Answer: A

Which two statements are true about XML Meta Data Importer? (Choose two.)A. XML Meta Data Importer is capable of reporting syntax and semantic errors from an XML file.B. XPATH expressions that are created during XML metadata import cannot be modified.C. XML Meta Data Importer can import Table Definitions from only XML documents.D. XPATH expressions that are created during XML metadata import are used by XML Input stage and XML Output stage.

Answer: A,D

Which two statements are correct about XML stages and their usage? (Choose two.)A. XML Input stage converts XML data to tabular format.B. XML Output stage converts tabular data to XML hierarchical structure.C. XML Output stage uses XSLT stylesheet for XML to tabular transformations.D. XML Transformer stage converts XML data to tabular format.

Answer: A,B

Which "Reject Mode" option in the Sequential File stage will write records to a reject link?A. OutputB. FailC. DropD. Continue

Answer: A

Page 40: datastage

A single sequential file exists on a single node. To read this sequential file in parallel, what should be done?A. Set the Execution mode to "Parallel".B. A sequential file cannot be read in parallel using the Sequential File stage.C. Select "File Pattern" as the Read Method.D. Set the "Number of Readers Per Node" optional property to a value greater than 1.Answer: D

When a sequential file is written using a Sequential File stage, the parallel engine inserts an operator to convert the data from the internal format to the external format. Which operator is inserted?A. export operatorB. copy operatorC. import operatorD. tsort operator

Answer: A

A bank receives daily credit score updates from a credit agency in the form of a fixed width flat file. The monthly_income column is an unsigned nullable integer (int32) whose width is specified as 10, and null values are represented as spaces. Which Sequential File property will properly import any nulls in the monthly_income column of the input file?A. Set the record level fill char property to the space character (' ').B. Set the null field value property to a single space (' ').C. Set the C_format property to '"%d. 10"'.D. Set the null field value property to ten spaces (' ').

Answer: D

An XML file is being processed by the XML Input stage. How can repetition elements be identified on the stage?A. Set the "Nullable" property for the column on the output link to "Yes".B. Set the "Key" property for the column on the output link to "Yes".C. Check the "Repetition Element Required" box on the output link tab.D. No special settings are required. XML Input stage automatically detects the repetition element from the XPath expression.

Answer: B

During a sequential file read, you experience an error with the data. What is a valid technique for identifying the column causing the difficulty?A. Set the "data format" option to text on the Record Options tab.B. Enable tracing in the DataStage Administrator Tracing panel.C. Enable the "print field" option at the Record Options tab.D. Set the APT_IMPORT_DEBUG environmental variable.

Answer: C

On which two does the number of data files created by a fileset depend ? (Choose two.)A. the size of the partitions of the datasetB. the number of CPUsC. the schema of the fileD. the number of processing nodes in the default node pool

Answer: A,D

What are two ways to delete a persistent parallel dataset? (Choose two.)A. standard UNIX command rmB. orchadmin command rm

Page 41: datastage

C. delete the dataset Table Definition in DataStage ManagerD. delete the dataset in Data Set Manager

Answer: B,D

A parts supplier has a single fixed width sequential file. Reading the file has been slow, so the supplier would like to try to read it in parallel. If the job executes using a configuration file consisting of four nodes, which two Sequential File stage settings will cause the DataStage parallel engine to read the file using four parallel readers? (Choose two.)(Note: Assume the file path and name is /data/parts_input.txt.)A. Set the read method to specific file(s), set the file property to '/data/parts_input.txt', and set the number of readers per node option to 2.B. Set the read method to specific file(s), set the file property to '/data/parts_input.txt', and set the read from multiple nodes option to yes.C. Set read method to file pattern, and set the file pattern property to '/data/(@PART_COUNT)parts_input.txt'.D. Set the read method to specific file(s), set the file property to '/data/parts_input.txt', and set the number of readers per node option to 4.

Answer: B,D

Page 42: datastage

Data Transformation

Transformer Stage

Transformer stages can have a single input and any number of outputs. It can also have a reject link that takes any rows which have not been written to any of the outputs links by reason of a write failure or expression evaluation failure.

In order to write efficient Transformer stage derivations, it is useful to understand what items get evaluated and when. The evaluation sequence is as follows:

Evaluate each stage variable initial value

For each input row to process:

Evaluate each stage variable derivation value, unless the derivation is empty

For each output link:

Evaluate each column derivation value

Write the output record

Next output link

Next input row

The stage variables and the columns within a link are evaluated in the order in which they are displayed on the parallel job canvas. Similarly, the output links are also evaluated in the order in which they are displayed.

System variables WebSphere DataStage provides a set of variables containing useful system information that you can access from an output derivation or constraint.

1. @FALSE The value is replaced with 0. 2. @TRUE The value is replaced with 1. 3. @INROWNUM Input row counter. 4. @OUTROWNUM Output row counter (per link). 5. @NUMPARTITIONS The total number of partitions for the stage.6. @PARTITIONNUM The partition number for the particular instance.

Triggers tab The Triggers tab allows you to choose routines to be executed at specific execution points as the transformer stage runs in a job. The execution point is per-instance, i.e., if a job has two transformer stage instances running in parallel, the routine will be called twice, once for each instance.

The available execution points are Before-stage and After-stage. At this release, the only available built-in routine is SetCustomSummaryInfo. You can also define custom routines to be executed; to do this you define a C function, make it available in UNIX shared library, and then define a Parallel routine which calls it (see WebSphere DataStage Designer Client Guide for details on defining a Parallel Routine). Note that the function should not return a value .

Page 43: datastage

A constraint otherwise link can be defined by: 1. Clicking on the Otherwise/Log field so a tick appears and leaving the Constraint fields blank. This will catch

any rows that have failed to meet constraints on all the previous output links. 2. Set the constraint to OTHERWISE. This will be set whenever a row is rejected on a link because the row fails

to match a constraint. OTHERWISE is cleared by any output link that accepts the row. 3. The otherwise link must occur after the output links in link order so it will catch rows that have failed to meet

the constraints of all the output links. If it is not last rows may be sent down the otherwise link which satisfy a constraint on a later link and is sent down that link as well.

4. Clicking on the Otherwise/Log field so a tick appears and defining a Constraint . This will result in the number of rows written to that link (i.e. rows which satisfy the constraint) to be recorded in the job log as a warning message.

Note: You can also specify a reject link which will catch rows that have not been written on any output links due to a write error or null expression error. Define this outside Transformer stage by adding a link and using the shortcut menu to convert it to a reject link.

Conditionally Aborting a JobUse the “Abort After Rows” setting in the output link constraints of the parallel Transformer to conditionally abort a parallel job. You can specify an abort condition for any output link. The abort occurs after the specified number of rows occurs in one of the partitions. When the “Abort After Rows” threshold is reached, the Transformer immediately aborts the job flow, potentially leaving uncommitted database rows or un-flushed file buffers

Functions and OperatorsConcatenation Operator – “:”Substring operator – Input_ String[starting position, length]String Functions1. Len(<string>)2. Trim(<string>)3. UpCase/DownCase(<string>)Null Handling functions1. IsNull2. IsNotNull3. NulltoValue4. NulltoZero5. SetNull()Type Conversions1. StringtoTimestamp2. StringtoDecimal

Using Transformer stages In general, it is good practice not to use more Transformer stages than you have to. You should especially avoid using multiple Transformer stages where the logic can be combined into a single stage. It is often better to use other stage types for certain types of operation: 1. Use a Copy stage rather than a Transformer for simple operations such as: – Providing a job design placeholder on the canvas. (Provided you do not set the Force property to True on the Copy stage, the copy will be optimized out of the job at run time.) – Renaming columns. – Dropping columns. – Implicit type conversions. Note that, if runtime column propagation is disabled, you can also use output mapping on a stage to rename, drop, or convert columns on a stage that has both inputs and outputs.2. Use the Modify stage for explicit type conversion and null handling. 3. Where complex, reusable logic is required, or where existing Transformer-stage based job flows do not meet

performance requirements, consider building your own custom stage

Page 44: datastage

4. Use a BASIC Transformer stage where you want to take advantage of user-defined functions and routines.

SCD Stage

The SCD stage reads source data on the input link, performs a dimension table lookup on the reference link, and writes data on the output link. The output link can pass data to another SCD stage, to a different type of processing stage, or to a fact table. The dimension update link is a separate output link that carries changes to the dimension. You can perform these steps in a single job or a series of jobs, depending on the number of dimensions in your database and your performance requirements.

SCD stages support both SCD Type 1 and SCD Type 2 processing: 1. SCD Type 1 Overwrites an attribute in a dimension table. 2. SCD Type 2 Adds a new row to a dimension table.

Each SCD stage processes a single dimension and performs lookups by using an equality matching technique. If the dimension is a database table, the stage reads the database to build a lookup table in memory. If a match is found, the SCD stage updates rows in the dimension table to reflect the changed data. If a match is not found, the stage creates a new row in the dimension table. All of the columns that are needed to create a new dimension row must be present in the source data.

Purpose codes in a Slowly Changing Dimension stagePurpose codes are an attribute of dimension columns in SCD stages. Purpose codes are used to build the lookup table, to detect dimension changes, and to update the dimension table.

Building the lookup table The SCD stage uses purpose codes to determine how to build the lookup table for the dimension lookup. If a dimension has only Type 1 columns, the stage builds the lookup table by using all dimension rows. If any Type 2 columns exist, the stage builds the lookup table by using only the current rows. If a dimension has a Current Indicator column, the stage uses the derivation value of this column on the Dim Update tab to identify the current rows of the dimension table. If a dimension does not have a Current Indicator column, then the stage uses the Expiration Date column and its derivation value to identify the current rows. Any dimension columns that are not needed are not used. This technique minimizes the amount of memory that is required by the lookup table.

Detecting dimension changes Purpose codes are also used to detect dimension changes. The SCD stage compares Type 1 and Type 2 column values to source column values to determine whether to update an existing row, insert a new row, or expire a row in the dimension table.

Updating the dimension table Purpose codes are part of the column metadata that the SCD stage propagates to the dimension update link. You can send this column metadata to a database stage in the same job, or you can save the metadata on the Columns tab and load it into a database stage in a different job. When the database stage uses the auto-generated SQL option to perform inserts and updates, it uses the purpose codes to generate the correct SQL statements.

Selecting purpose codes Purpose codes specify how the SCD stage should process dimension data. Purpose codes apply to columns on the dimension reference link and on the dimension update link. Select purpose codes according to the type of columns in a dimension:

Page 45: datastage

1. If a dimension contains a Type 2 column, you must select a Current Indicator column, an Expiration Date column, or both. An Effective Date column is optional. You cannot assign Type 2 and Current Indicator to the same column.

2. If a dimension contains only Type 1 columns, no Current Indicator, Effective Date, Expiration Date, or SK Chain columns are allowed.

Purpose code definitionsThe SCD stage provides nine purpose codes to support dimension processing.

1. (blank) The column has no SCD purpose. This purpose code is the default.

2. Surrogate Key The column is a surrogate key that is used to identify dimension records.

3. Business Key The column is a business key that is typically used in the lookup condition.

4. Type 1 The column is an SCD Type 1 field. SCD Type 1 column values are always current. When changes occur, the SCD stage overwrites existing values in the dimension table.

5. Type 2 The column is an SCD Type 2 field. SCD Type 2 column values represent a point in time. When changes occur, the SCD stage creates a new dimension row.

6. Current® Indicator (Type 2) The column is the current record indicator for SCD Type 2 processing. Only one Current Indicator column is allowed.

7. Effective Date (Type 2) The column is the effective date for SCD Type 2 processing. Only one Effective Date column is allowed.

8. Expiration Date (Type 2) T he column is the expiration date for SCD Type 2 processing. An Expiration Date column is required if there is no Current Indicator column, otherwise it is optional.

9. SK Chain The column is used to link a record to the previous record or the next record by using the value of the Surrogate Key column. Only one Surrogate Key column can exist if you have an SK Chain column.

Surrogate keys in a Slowly Changing Dimension stage Surrogate keys are used to join a dimension table to a fact table in a star schema database.

When the SCD stage performs a dimension lookup, it retrieves the value of the existing surrogate key if a matching record is found. If a match is not found, the stage obtains a new surrogate key value by using the derivation of the Surrogate Key column on the Dim Update tab. If you want the SCD stage to generate new surrogate keys by using a key source that you created with a Surrogate Key Generator stage, you must use the NextSurrogateKey function to derive the Surrogate Key column. If you want to use your own method to handle surrogate keys, you should derive the Surrogate Key column from a source column.

You can replace the dimension information in the source data stream with the surrogate key value by mapping the Surrogate Key column to the output link.

Specifying information about a key source If you created a key source with a Surrogate Key Generator stage, you must specify how the SCD stage should use the source to generate surrogate keys.

The key source can be a flat file or a database sequence. The key source must exist before the job runs. If the key source is a flat file, the file must be accessible from all nodes that run the SCD stage.

To use the key source:

1. On the Input page, select the reference link in the Input name field.

Page 46: datastage

2. Click the Surrogate Key tab. 3. In the Source type field, select the source type. 4. In the Source name field, type the name of the key source, or click the arrow button to browse for a file or to insert a job parameter. If the source is a flat file, type the name and fully qualified path of the state file, such as C:/SKG/ProdDim. If the source is a database sequence, type the name of the sequence, such as PRODUCT_KEY_SEQ. 5. Provide additional information about the key source according to the type:If the source is a flat file, specify information in the Flat File area. If the source is a database sequence, specify information in the DB sequence area.

Calls to the key source are made by the NextSurrogateKey function. On the Dim Update tab, create a derivation that uses the NextSurrogateKey function for the column that has a purpose code of Surrogate Key. The NextSurrogateKey function returns the value of the next surrogate key when the SCD stage creates a new dimension row.

A DataStage job contains a parallel Transformer with a single input link and a single output link. The Transformer has a constraint that should produce 1000 records, however only 900 came out through the output link.What should be done to identify the missing records?A. Turn trace on using DataStage Administrator.B. Add a Reject link to the Transformer stage.C. Scan generated osh script for possible errors.D. Remove the constraint on the output link.

Answer: B

Which three actions are performed using stage variables in a parallel Transformer stage? (Choose three.)A. A function can be executed once per record.B. A function can be executed once per run.C. Identify the first row of an input group.D. Identify the last row of an input group.E. Lookup up a value from a reference dataset.

Answer: A,B,C

Which two system variables must be used in a parallel Transformer derivation to generate a unique sequence of integers across partitions? (Choose two.)A. @PARTITIONNUMB. @INROWNUMC. @DATED. @NUMPARTITIONS

Answer: A,D

What would require creating a new parallel Custom stage rather than a new parallel BuildOp stage?A. A Custom stage can be created with properties. BuildOp stages cannot be created with properties.B. In a Custom stage, the number of input links does not have to be fixed, but can vary, for example from one to two. BuildOp stages require a fixed number of input links.C. Creating a Custom stage requires knowledge of C/C++. You do not need knowledge of C/C++ to create a BuildOp stage.D. Custom stages can be created for parallel execution. BuildOp stages can only be built to run sequentially.

Answer: B

Page 47: datastage

Your input rows contain customer data from a variety of locations. You want to select just those rows from a specified location based on a parameter value. You are trying to decide whether to use a Transformer or a Filter stage to accomplish this. Which statement is true?A. The Transformer stage will yield better performance because the Filter stage Where clause is interpreted at runtime.B. You cannot use a Filter stage because you cannot use parameters in a Filter stage Where clause.C. The Filter stage will yield better performance because it has less overhead than a Transformer stage.D. You cannot use the Transformer stage because you cannot use parameters in a Transformer stage constraint.Answer: A

In a Transformer you add a new column to an output link named JobName that is to contain the name of the job that is running. What can be used to derive values for this column?A. a DataStage functionB. a link variableC. a system variableD. a DataStage macro

Answer: D

Which statement describes how to add functionality to the Transformer stage?A. Create a new parallel routine in the Routines category that specifies the name, path, type, and return type of a function written and compiled in C++.B. Create a new parallel routine in the Routines category that specifies the name, path, type, and return type of an external program.C. Create a new server routine in the Routines category that specifies the name and category of a function written in DataStage Basic.D. Edit the C++ code generated by the Transformer stage.

Answer: A

Which three statements about the Enterprise Edition parallel Transformer stage are correct? (Choose three.)A. The Transformer allows you to copy columns.B. The Transformer allows you to do lookups.C. The Transformer allows you to apply transforms using routines.D. The Transformer stage automatically applies 'NullToValue' function to all non-nullable output columns.E. The Transformer allows you to do data type conversions.

Answer: A,C,E

Which two stages allow field names to be specified using job parameters? (Choose two.)A. Transformer stageB. Funnel stageC. Modify stageD. Filter stage

Answer: C,D

The parallel dataset input into a Transformer stage contains null values. What should you do to properly handle these null values?A. Convert null values to a valid values in a stage variable.B. Convert null values to a valid value in the output column derivation.C. Null values are automatically converted to blanks and zero, depending on the target data type.D. Trap the null values in a link constraint to avoid derivations.

Answer: A

Page 48: datastage

Which two would require the use of a Transformer stage instead of a Copy stage? (Choose two.)A. Drop a column.B. Send the input data to multiple output streams.C. Trim spaces from a character field.D. Select certain output rows based on a condition.

Answer: C,D

In which situation should a BASIC Transformer stage be used in a DataStage EE job?A. in a job containing complex routines migrated from DataStage Server EditionB. in a job requiring lookups to hashed filesC. in a large-volume job flowD. in a job requiring complex, reusable logic

Answer: A

You have three output links coming out of a Transformer. Two of them (A and B) have constraints you have defined. The third you want to be an Otherwise link that is to contain all of the rows that do not satisfy the constraints of A and B. This Otherwise link must work correctly even if the A and B constraints are modified. Which two are required? (Choose two.)A. The Otherwise link must be first in the link ordering.B. A constraint must be coded for the Otherwise link.C. The Otherwise link must be last in the link ordering.D. The Otherwise check box must be checked.

Answer: C,D

Which two statements are true about DataStage Parallel Buildop stages? (Choose two.)A. Unlike standard DataStage stages they do not have properties.B. They are coded using C/C++.C. They are coded using DataStage Basic.D. Table Definitions are used to define the input and output interfaces of the BuildOp.

Answer: B,D

Page 49: datastage

Job Control and Run time ManagementMessage Handlers

When you run a parallel job, any error messages and warnings are written to an error log and can be viewed from the Director. You can choose to handle specified errors in a different way by creating one or more message handlers.

A message handler defines rules about how to handle messages generated when a parallel job is running. You can, for example, use one to specify that certain types of message should not be written to the log.

You can edit message handlers in the DataStage Manager or in the DataStage Director. The recommended way to create them is by using the Add rule to message handler feature in the Director.

You can specify message handler use at different levels:

Project Level . You define a project level message handler in the DataStage Administrator, and this applies to all parallel jobs within the specified project.

Job Level. From the Designer and Manager you can specify that any existing handler should apply to a specific job. When you compile the job, the handler is included in the job executable as a local handler (and so can be exported to other systems if required).

You can also add rules to handlers when you run a job from the Director (regardless of whether it currently has a local handler included). This is useful, for example, where a job is generating a message for every row it is processing. You can suppress that particular message.

When the job runs it will look in the local handler (if one exists) for each message to see if any rules exist for that message type. If a particular message is not handled locally, it will look to the projectwide handler for rules. If there are none there, it writes the message to the job log.

Note that message handlers do not deal with fatal error messages, these will always be written to the job log. You cannot add message rules to jobs from an earlier release of DataStage without first re-running those jobs.

Adding Rules to Message Handlers

You can add rules to message handlers 'on the fly' from within the Director. Using this method, you can add rules to handlers that are local to the current job, to the project default handler, or to any previously-defined handler.

To add rules in this way, highlight the message you want to add a rule about in the job log and choose Add rule to message handler... from the job log shortcut menu or from the Job menu on the menu bar. The Add rule to message handler dialog box appears.

To add a rule:1. Choose an option to specify which handler you want to add the new rule to. Choose between the local runtime

handler for the currently selected job, the project-level message handler, or a specific message handler. If you want to edit a specific message handler, select the handler from the Message Handler dropdown list. Choose (New) to create a new message handler.

2. Choose an Action from the drop down list. Choose from:– Suppress from log. The message is not written to the job's log as it runs.– Promote to Warning. Promote an informational message to a warning message.– Demote to Informational. Demote a warning message to become an informational one.The Message ID, Message type and Example of message text fields are all filled in from the log entry you have currently selected. You cannot edit these.

Page 50: datastage

3. Click Add Rule to add the new message rule to the chosen handler.

Managing Message HandlersTo open the Message Handler Manager, choose Tools Message Handlers (you can also open the manager from the Add rule to message handler dialog box). The Edit Message Handlers dialog box appears.

Page 51: datastage

Message Handler File Format

A message handler is a plain text file and has the suffix .msh. It is stored in the folder $DSHOME/../DataStage/MsgHandlers. The following is an example message file.

TUTL 0000311 1 The open file limit is 100; raising to 1024…TFSC 0000011 2 APT configuration file…TFSC 0000432 3 Attempt to Cleanup after ABORT raised in stage…

Each line in the file represents message rule, and comprises four tabseparated fields:- Message ID. Case-specific string uniquely identifying the message- Type. 1 for Info, 2 for Warn- Action. 1 = Suppress, 2 = Promote, 3 = Demote- Message. Example text of the message

Identify the use of dsjob command line utility

You can start, stop, validate, and reset jobs using the –run option.

Running a jobdsjob –run[ –mode [ NORMAL | RESET | VALIDATE ] ][ –param name=value ][ –warn n ][ –rows n ][ –wait ][ –stop ][ –jobstatus][–userstatus][–local][–opmetadata [TRUE | FALSE]][-disableprjhandler][-disablejobhandler][useid] project job|job_id

–mode specifies the type of job run. NORMAL starts a job run, RESET resets the job and VALIDATE validates the job. If mode is not specified, a normal job run is started.–param specifies a parameter value to pass to the job. The value is in the format name=value, where name is the parameter name, and value is the value to be set. If you use this to pass a value of an environment variable for a job (as you may do for parallel jobs), you need to quote the environment variable and its value, for example -param '$APT_CONFIG_FILE=chris.apt' otherwise the current value of the environment variable will be used.–warn n sets warning limits to the value specified by n (equivalent to the DSSetJobLimit function used with DSJ_LIMITWARN specified as the LimitType parameter).–rows n sets row limits to the value specified by n (equivalent to the DSSetJobLimit function used with DSJ_LIMITROWS specified as the LimitType parameter).–wait waits for the job to complete (equivalent to the DSWaitForJob function).–stop terminates a running job (equivalent to the DSStopJob function).–jobstatus waits for the job to complete, then returns an exit code derived from the job status.–userstatus waits for the job to complete, then returns an exit code derived from the user status if that status is defined. The user status is a string, and it is converted to an integer exit code. The exit code 0 indicates that the job completed without an error, but that the user status string could not be converted. If a job returns a negative userstatus value, it is interpreted as an error.

Page 52: datastage

-local use this when running a DataStage job from withing a shellscript on a UNIX server. Provided the script is run in the project directory, the job will pick up the settings for any environment variables set in the script and any setting specific to the user environment.-opmetadata use this to have the job generate operational meta data as it runs. If MetaStage, or the Process Meta Data MetaBroker, is not installed on the machine, then the option has no effect. If you specifyTRUE, operational meta data is generated, whatever the default setting for the project. If you specify FALSE, the job will not generate operational meta data, whatever the default setting for the project.-disableprjhandler use this to disable any error message handler that has been set on a project wide basis-disablejobhandler use this to disable any error message handler that has been set for this jobuseid specify this if you intend to use a job alias (jobid) rather than ajob name (job) to identify the job.project is the name of the project containing the job.job is the name of the job. To run a job invocation, use the formatjob.invocation_id.job_id is an alias for the job that has been set using the dsjob –jobid command

Stopping a jobYou can stop a job using the –stop option.dsjob –stop [useid] project job|job_id

–stop terminates a running job (equivalent to the DSStopJobfunction).useid specify this if you intend to use a job alias (jobid) rather than a job name (job) to identify the job.project is the name of the project containing the job.job is the name of the job. To stop a job invocation, use the formatjob.invocation_id.job_id is an alias for the job that has been set using the dsjob –jobid command

Listing ProjectsThe following syntax displays a list of all known projects on the server:dsjob –lprojectsThis syntax is equivalent to the DSGetProjectList function.

Listing JobsThe following syntax displays a list of all jobs in the specified project:dsjob –ljobs projectproject is the name of the project containing the jobs to list. This syntax is equivalent to the DSGetProjectInfo function.

Listing StagesThe following syntax displays a list of all stages in a job:dsjob –lstages [useid] project job|job_id

This syntax is equivalent to the DSGetJobInfo function with DSJ_STAGELIST specified as the InfoType parameter.

Listing LinksThe following syntax displays a list of all the links to or from a stage:dsjob –llinks [useid] project job|job_id stage

This syntax is equivalent to the DSGetStageInfo function with DSJ_LINKLIST specified as the InfoType parameter.

Listing ParametersThe following syntax display a list of all the parameters in a job and their values:dsjob –lparams [useid] project job|job_id

Page 53: datastage

Listing InvocationsThe following syntax displays a list of the invocations of a job:dsjob –linvocations

Setting an Alias for a JobThe dsjob command can be used to specify your own ID for a DataStage job. Other commands can then use that alias to refer to the job.dsjob –jobid [my_ID] project jobmy_ID is the alias you want to set for the job. If you omit my_ID, the command will return the current alias for the specified job. An alias must be unique within the project, if the alias already exists an error message is displayed

Displaying Job InformationThe following syntax displays the available information about a specified job:dsjob –jobinfo [useid] project job|job_idThis syntax is equivalent to the DSGetJobInfo function.

Displaying Stage InformationThe following syntax displays all the available information about a stage:dsjob –stageinfo [useid] project job|job_id stageThis syntax is equivalent to the DSGetStageInfo function.

Displaying Link InformationThe following syntax displays information about a specified link to or from a stage:dsjob –linkinfo [useid] project job|job_id stage linkThis syntax is equivalent to the DSGetLinkInfo function.

Displaying Parameter InformationThis syntax displays information about the specified parameter:dsjob –paraminfo [useid] project job|job_id paramThe following information is displayed:

The parameter type� The parameter value� Help text for the parameter that was provided by the job’s designer� Whether the value should be prompted for� The default value that was specified by the job’s designer� Any list of values� The list of values provided by the job’s designer�

This syntax is equivalent to the DSGetParamInfo function.

Adding a Log EntryThe following syntax adds an entry to the specified log file. The text for the entry is taken from standard input to the terminal, ending with Ctrl-D.dsjob –log [ –info | –warn ] [useid] project job|job_id–info specifies an information message. This is the default if no logentry type is specified.–warn specifies a warning message.

Displaying a Short Log EntryThe following syntax displays a summary of entries in a job log file:dsjob –logsum [–type type] [ –max n ] [useid] project job|job_id–type type specifies the type of log entry to retrieve. If –type type isnot specified, all the entries are retrieved. type can be one of thefollowing options:

Page 54: datastage

INFO Information.WARNING Warning.FATAL Fatal error.REJECT Rejected rows from a Transformer stage.STARTED All control logs.RESET Job reset.BATCH Batch control.ANY All entries of any type. This is the default if type is not specified.

–max n limits the number of entries retrieved to n.

Displaying a Specific Log EntryThe following syntax displays the specified entry in a job log file:dsjob –logdetail [useid] project job|job_id entry

entry is the event number assigned to the entry. The first entry in the file is 0.This syntax is equivalent to the DSGetLogEntry function.

Identifying the Newest EntryThe following syntax displays the ID of the newest log entry of the specified type:dsjob –lognewest [useid] project job|job_id type

INFO Information.WARNING Warning.FATAL Fatal error.REJECT Rejected rows from a Transformer stage.STARTED Job started.RESET Job reset.BATCH Batch control.

This syntax is equivalent to the DSGetNewestLogId function.

Importing Job ExecutablesThe dsjob command can be used to import job executables from a DSX file into a specified project. Note that this command is only available on UNIX servers.

dsjob –import project DSXfilename [-OVERWRITE] [-JOB[S] jobname …] |[-LIST] project is the project to import into. DSXfilename is the DSX file containing the job executables.-OVERWRITE specifies that any existing jobs in the project with the same name will be overwritten.-JOB[S] jobname specifies that one or more named job executables should be imported (otherwise all the executable in the DSX file are imported).-LIST causes DataStage to list the executables in a DSX file rather than import them.

Generating a ReportThe dsjob command can be used to generate an XML format report containing job, stage, and link information.dsjob –report [useid] project job|jobid [report_type]

report_type is one of the following: BASIC – Text string containing start/end time, time elapsed and status of job.� DETAIL – As basic report, but also contains information about individual stages and links within the job.� LIST – Text string containing full XML report.�

By default the generated XML will not contain a <?xml-stylesheet?> processing instruction. If a stylesheet is required, specify a RetportLevel of 2 and append the name of the required stylesheetURL, i.e., 2:styleSheetURL. This inserts a processing instruction into the generated XML of the form:

Page 55: datastage

<?xml-stylesheet type=text/xsl” href=”styleSheetURL”?>The generated report is written to stdout.

This syntax is equivalent to the DSMakeJobReport function.DETAIL – As basic report, but also contains� information about individual stages and links within the job.

LIST – Text string containing full XML report.�

Job Sequence

What is a Job Sequence?1. A master controlling job that controls the execution set of subordinate jobs 2. Passes values to subordinate job parameters3. Controls the order of execution (links)4. Specifies conditions under which the subordinate jobs get executed (triggers)5. Specified complex flow of control – Loops, All/Any sequencer, Wait for file6. Perform system activities ( Email, Execute system commands and executables)7. Can include Restart checkpoints

What are the Job Sequence stages?1. Run stages – Job Activity: Run a job, Execute Command/Routine Activity: Run a system command,

Notification Activity: Send an email2. Flow Control stages – Sequencer: Go All/Any, Wait for file: Go when file exists/doesn’t exist, Loop: Start loop

and End Loop, Nested Condition: Go if condition satisfied3. Error handling – Exception Handler, Terminator4. Variables – User Variables

What are the compilation options in Job Sequence properties?1. Add checkpoints so sequence is restartable on failure – Restart Functionality 2. Automatically handle activities that fail – Exception stage to handle aborts 3. Log warnings after activities that finish with status other than OK4. Log report messages after each run

What are the inputs for Job Activity stage?1. Job name (select from list)2. Execution Action (select from list)3. Parameters4. Do not checkpoint run (select/unselect checkbox)

What are the Job Activity Execution Actions?1. Run2. Reset if required, then run3. Validate

What are the different types of triggers for a Job Activity?OK – (Conditional)Failed – (Conditional)Warning – (Conditional)Custom – (Conditional)UserStatus – (Conditional)UnconditionalOtherwise

Custom Trigger Example – Job_1.$JobStatus=DSJS.RUNOK or Job_1.$JobStatus= DSJS.RUNWARN

Page 56: datastage

What are the inputs for Execute Command stage?1. Command2. Parameters3. Do not checkpoint run (select/unselect checkbox)

What are the inputs for Notification stage?1. SMTP Mail server name2. Senders email address3. Recipients email address4. Email subject5. Attachment6. Email body7. Include job status in email (select/unselect checkbox)8. Do not checkpoint run (select/unselect checkbox)

What are the inputs for Wait for file stage?1. Filename2. Wait for file to appear / Wait for file to appear (Select one of the two options)3. Timeout length (disabled if the “Do not timeout” option is selected)4. Do not timeout5. Do not checkpoint run

Explain the Nested Condition stage?The Nested Condition stage is used to branch out to other activities based on trigger conditions.

Explain the Loop stage?The Loop stage is made up of Start Loop and End Loop. The Start Loop connects to one of the Run activities (preferably Job Activity). This Activity stage connects to the End Loop. The End Loop connects to the Start Loop activity by means of a reference link.

The 2 types of looping are 1. Numeric (For counter n to n Step n)2. List (For each thing in list) Explain the Error handling and Restartability?Error handling is enabled using “Automatically handle activities that fail” option. The control is passed to the Exception stage when an Activity fails

Restartability is enabled using “Add checkpoints so sequence is restartable on failure” option. If a sequence fails, then when the Sequence is re-run, activities that completed successfully in the prior run are skipped over (unless the “Do not checkpoint run” option was set for an activity).

Which three are valid ways within a Job Sequence to pass parameters to Activity stages? (Choose three.)A. ExecCommand Activity stageB. UserVariables Activity stageC. Sequencer Activity stageD. Routine Activity stageE. Nested Condition Activity stage

Answer: A,B,D

Which three are valid trigger expressions in a stage in a Job Sequence? (Choose three.)A. Equality(Conditional)

Page 57: datastage

B. UnconditionalC. ReturnValue(Conditional)D. Difference(Conditional)E. Custom(Conditional)

Answer: B,C,E

A client requires that any job that aborts in a Job Sequence halt processing. Which three activities would provide this capability? (Choose three.)A. Nested Condition ActivityB. Exception HandlerC. Sequencer ActivityD. Sendmail ActivityE. Job trigger

Answer: A,B,E

Which command can be used to execute DataStage jobs from a UNIX shell script?A. dsjobB. DSRunJobC. oshD. DSExecute

Answer: A

Which three are the critical stages that would be necessary to build a Job Sequence that: picks up data from a file that will arrive in an directory overnight, launches a job once the file has arrived, sends an email to the administrator upon successful completion of the flow? (Choose three.)A. SequencerB. Notification ActivityC. Wait For File ActivityD. Job ActivityE. Terminator Activity

Answer: B,C,D

Which two statements describe functionality that is available using the dsjob command? (Choose two.)A. dsjob can be used to get a report containing job, stage, and link information.B. dsjob can be used to add a log entry for a specified job.C. dsjob can be used to compile a job.D. dsjob can be used to export job executables.

Answer: A,B

Page 58: datastage

Other Topics

Environment Variables

APT_BUFFER_FREE_RUNThis environment variable is available in the DataStage Administrator, under the Parallel category. It specifies how much of the available inmemory buffer to consume before the buffer resists. This is expressed as a decimal representing the percentage of Maximum memory buffer size (for example, 0.5 is 50%). When the amount of data in the buffer is less than this value, new data is accepted automatically. When the data exceeds it, the buffer first tries to write some of the data it contains before accepting more. The default value is 50% of the Maximum memory buffer size. You can set it to greater than 100%, in which case the buffer continues to store data up to the indicated multiple of Maximum memory buffer size before writing to disk.

APT_BUFFER_MAXIMUM_MEMORYSets the default value of Maximum memory buffer size. The default value is 3145728 (3 MB). Specifies the maximum amount of virtual memory, in bytes, used per buffer.

APT_BUFFER_MAXIMUM_TIMEOUTDataStage buffering is self tuning, which can theoretically lead to long delays between retries. This environment variable specified the maximum wait before a retry in seconds, and is by default set to 1.

APT_BUFFERING_POLICYThis environment variable is available in the DataStage Administrator, under the Parallel category. Controls the buffering policy for all virtual data sets in all steps. The variable has the following settings:

AUTOMATIC_BUFFERING (default). Buffer a data set only if necessary to prevent a data flow deadlock.� FORCE_BUFFERING. Unconditionally buffer all virtual data sets. Note that this can slow down processing�

considerably. NO_BUFFERING. Do not buffer data sets. This setting can cause data flow deadlock if used inappropriately.�

APT_DECIMAL_INTERM_PRECISIONSpecifies the default maximum precision value for any decimal intermediate variables required in calculations. Default value is 38.

APT_DECIMAL_INTERM_SCALESpecifies the default scale value for any decimal intermediate variables required in calculations. Default value is 10.

APT_CONFIG_FILESets the path name of the configuration file. (You may want to include this as a job parameter, so that you canspecify the configuration file at job run time).

APT_DISABLE_COMBINATIONGlobally disables operator combining. Operator combining is DataStage’s default behavior, in which two or more (in fact any number of) operators within a step are combined into one process where possible. You may need to disable combining to facilitate debugging. Note that disabling combining generates more UNIX processes, and hence requires more system resources and memory. It also disables internal optimizations for job efficiency and run times.

APT_EXECUTION_MODEBy default, the execution mode is parallel, with multiple processes. Set this variable to one of the following values to run an application in sequential execution mode:� ONE_PROCESS one-process mode� MANY_PROCESS many-process mode� NO_SERIALIZE many-process mode, without serialization

Page 59: datastage

APT_ORCHHOMEMust be set by all DataStage Enterprise Edition users to point to the top-level directory of the DataStage Enterprise Edition installation.

APT_STARTUP_SCRIPTAs part of running an application, DataStage creates a remote shell on all DataStage processing nodes on which the job runs. By default, the remote shell is given the same environment as the shell from which DataStage is invoked. However, you can write an optional startupshell script to modify the shell configuration of one or more processing nodes. If a startup script exists, DataStage runs it on remote shells before running your application. APT_STARTUP_SCRIPT specifies the script to be run. If it is not defined, DataStage searches ./startup.apt, $APT_ORCHHOME/etc/startup.apt and $APT_ORCHHOME/etc/startup, in that order. APT_NO_STARTUP_SCRIPT disables running the startup script.

APT_NO_STARTUP_SCRIPTPrevents DataStage from executing a startup script. By default, this variable is not set, and DataStage runs the startup script. If this variable is set, DataStage ignores the startup script. This may be useful when debugging a startup script. See also APT_STARTUP_SCRIPT.

APT_STARTUP_STATUSSet this to cause messages to be generated as parallel job startup moves from phase to phase. This can be useful as a diagnostic if parallel job startup is failing.

APT_MONITOR_SIZEThis environment variable is available in the DataStage Administrator under the Parallel branch. Determines the minimum number of records the DataStage Job Monitor reports. The default is 5000 records.

APT_MONITOR_TIMEThis environment variable is available in the DataStage Administrator under the Parallel branch. Determines the minimum time interval in seconds for generating monitor information at runtime. The default is 5 seconds. This variable takes precedence over APT_MONITOR_SIZE.

APT_NO_JOBMONTurn off job monitoring entirely.

APT_PM_NO_SHARED_MEMORYBy default, shared memory is used for local connections. If this variable is set, named pipes rather than shared memory are used for local connections. If both APT_PM_NO_NAMED_PIPES and APT_PM_NO_SHARED_MEMORY are set, then TCP sockets are used for local connections.

APT_PM_NO_NAMED_PIPESSpecifies not to use named pipes for local connections. Named pipes will still be used in other areas of DataStage, including subprocs and setting up of the shared memory transport protocol in the process manager.

APT_RECORD_COUNTSCauses DataStage to print, for each operator Player, the number of records consumed by getRecord() and produced by putRecord(). Abandoned input records are not necessarily accounted for. Buffer operators do not print this information.

APT_NO_PART_INSERTIONDataStage automatically inserts partition components in your application to optimize the performance of the stages in your job. Set this variable to prevent this automatic insertion.

APT_NO_SORT_INSERTIONDataStage automatically inserts sort components in your job to optimize the performance of the operators in your data flow. Set this variable to prevent this automatic insertion.

Page 60: datastage

APT_SORT_INSERTION_CHECK_ONLYWhen sorts are inserted automatically by DataStage, if this is set, the sorts will just check that the order is correct, they won't actually sort. This is a better alternative to shutting partitioning and sorting off insertion off using APT_NO_PART_INSERTION and APT_NO_SORT_INSERTION.

APT_DUMP_SCOREConfigures DataStage to print a report showing the operators, processes, and data sets in a running job.

APT_PM_PLAYER_MEMORYSetting this variable causes each player process to report the process heap memory allocation in the job log when returning.

APT_PM_PLAYER_TIMINGSetting this variable causes each player process to report its call and return in the job log. The message with the return is annotated with CPU times for the player process.

OSH_DUMPIf set, it causes DataStage to put a verbose description of a job in the job log before attempting to execute it.

OSH_ECHOIf set, it causes DataStage to echo its job specification to the job log after the shell has expanded all arguments.

OSH_EXPLAINIf set, it causes DataStage to place a terse description of the job in the job log before attempting to run it.

OSH_PRINT_SCHEMASIf set, it causes DataStage to print the record schema of all data sets and the interface schema of all operators in the job log.

APT_STRING_PADCHAROverrides the pad character of 0x0 (ASCII null), used by default when DataStage extends, or pads, a string field to a fixed length.

Page 61: datastage

XML Stages

Xml Importer

The XML Meta Data Importer window has the following panes:• Tree View, which depicts the hierarchical structure in the XML source. This pane is the main view. It is always present and cannot be hidden or docked.• Source, which contains the original XML schema or XML document, in read-only mode. To compare the tree view with the XML source, you can dock this pane next to the tree view.• Node Properties, which describes XML and XPath information of the selected element.• Table Definition, which maps elements that you select in the Tree View.• Parser Output, which presents XML syntax and semantic errors.

The following illustration shows all XML Meta Data Importer panes except Parser Output:

Page 62: datastage

XML Meta Data Importer reports any syntax and semantic errors when you open a source file. In the following example, the Parser Output pane indicates that at least one quote is missing from line 3.

To highlight the error in the Source pane, double-click the error in the Parser Output pane. After correcting the error outside of the XML Meta Data Importer, you can load the revised source file. To reload the file, choose FileRefresh.

You can process an XML schema file (.xsd) or an XML document (.xml). The file can be located on your file system or accessed with a URL.

Processing XML DocumentsThe XML Meta Data Importer retains namespaces and considers every node in an XML hierarchy to be fully-qualified with a namespace prefix. The form is: prefix:nodename. This approach applies to documents in which the prefixes are included or unspecified.When prefixes are unspecified, XML Meta Data Importer generates prefixes using the pattern ns#, where # is a sequence number.

ExampleThe following input does not include a namespace prefix.Input<Person xmlns="mynamespace"><firstName>John</firstName></Person>

Page 63: datastage

Output<ns1:Person xmlns:ns1="mynamespace"><ns1:firstName>John</firstName></Person>

Processing XML SchemasThe XML Meta Data Importer processes namespaces in XML schemas according to three rules:• General• Import By Reference• Target Namespace Unspecified

General RuleIn general, the XML Meta Data Importer assigns the prefix defns to the target namespace.For example:<xsd:schema targetNamespace="mynamespace" xmlns:xsd="http://www.w3.org/2001/XMLSchema"><xsd:element name="Person"><xsd:complexType><xsd:sequence><xsd:element name="firstName" type="xsd:string" minOccurs="1" maxOccurs="1"/></xsd:sequence></xsd:complexType></xsd:element></xsd:schema>The firstName node generates the following XPath expression:/defns:Person/defns:firstNamewhere defns=mynamespace

Import By Reference RuleIf the schema imports by reference other schemas with different target namespaces, the XML Meta Data assigns a prefix in the form ns# to each of them. To enable this processing, the dependent schema must specifyelementFormDefault="qualified". If this is omitted, the elements are considered as belonging to the caller's target namespace.

ExampleThe following example imports by reference the schema mysecondschema.<xsd:schema targetNamespace "demonamespace"xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:other="othernamespace"><xsd:import namespace="othernamespace" schemaLocation="mysecondschema.xsd"/><xsd:element name="Person"><xsd:complexType><xsd:sequence><xsd:element name="address" type="other:Address" minOccurs="1" maxOccurs="1"/></xsd:sequence></xsd:complexType></xsd:element></xsd:schema>The schema mysecondschema contains the following statements:<xsd:schema targetNamespace="othernamespace"xmlns:xsd="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified"attributeFormDefault="unqualified"><xsd:complexType name="Address"><xsd:sequence><xsd:element name="street" minOccurs="1" maxOccurs="1" /><xsd:element name="city" minOccurs="1" maxOccurs="1" />

Page 64: datastage

<xsd:element name="state" minOccurs="1" maxOccurs="1" /><xsd:element name="zip" minOccurs="1" maxOccurs="1" /></xsd:sequence></xsd:complexType></xsd:schema>The street node generates the following XPath expression:/defns:Person/defns:address/ns2:streetwhere defns=demonamespace and ns2=othernamespace

The Target Namespace Unspecified RuleWhen the target namespace is unspecified, XML Meta Data Importer omits the prefix defns from XPath expressions.

For example:<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"><xsd:element name="Person"><xsd:complexType><xsd:sequence><xsd:element name="firstName" type="xsd:string" minOccurs="1" maxOccurs="1"/></xsd:sequence></xsd:complexType></xsd:element></xsd:schema>

The firstName tree node generates the following XPath expression:/Person/firstName

Mapping Nodes from an XML SchemaYou can individually choose elements and attributes, or select all leaf nodes except empty ones in one step.

Choosing Individual ItemsSelect the box that is next to item that you want to map. In the following example, there are elements and text nodes. The three TEXT nodes are selected.

If you select an element box, you get all the sub nodes and the actual content of the element. Your selection is reflected in the Table Definition pane:

Page 65: datastage

An asterisk appears after the title Table Definition when you modify the table definition. It disappears when you save the information.Selecting All NodesYou can simplify selecting all leaf nodes by using the Auto-check command. This command checks leaf nodes.XML Meta Data Importer ignores leaf nodes in the following circumstances:• Nodes are empty.• In a branch in which a node represents a reference to an element or a type defined elsewhere, such as an included schema. To avoid recursive looping, which may be deep in the sub-schema, the node is not expanded. You may manually expand the reference branch down to a specific level, and run the Auto-check command on the top branch node. This action selects all nodes in the branch.• Node represents a detected recursion. This happens with a schema that has the following form:parent = person childrenchild = personYou may manually expand the recursive branch and run the Auto-check command to select all nodes in the branch.

To run Auto-check:Choose FileEditAuto-check. The nodes appear in the Table Definition pane.

The default table definition name depends on the XML source name:

Source file DefaultUNC-name Original file name without extensionURL The value NewXML document Original XML document filenameXML schema Original XML schema filename

Xml Input Stage

XML Input stage is used to transform hierarchical XML data to flat relational tables. XML Input stage supports a single input link and one or more output links.

XML Input performs two XML validations when the server job runs:• Checks for well-formed XML.• Optionally checks that elements and attributes conform to any XML schema that is referenced in the document. You control this option.

The XML parser reports three types of conditions: fatal, error, and warning.• Fatal errors are thrown when the XML is not well-formed.• Non-fatal errors are thrown when the XML violates a validity constraint. For example, the root element in the document is not found in the validating XML schema.• Warnings may be thrown when the schema has duplicate definitions.

XML Input supports one Reject link, which can store rejection messages and rejected rows.

Writing Rejection Messages to the LinkTo write rejection messages to a Reject link:1. Add a column on the Reject link.2. Using the General page of the Output Link properties, identify the column as the target for rejection messages.

Writing Rejected Rows to the LinkTo write rejected rows to a Reject link:Add a column on the Reject link that has the same name as the column on the input link that contains or references the XML document. This is a pass-through operation. Column names for this operation are case-sensitive. Pass-through is available for any input column.

Page 66: datastage

Controlling Output RowsTo populate the columns of an output row, XML Input uses XPath expressions that are specified on the output link. XPath expressions locate elements, attributes, and text nodes.

Controlling the Number of Output RowsYou must designate one column on the output link as the repetition element. A repetition element consists of an XPath expression. For each occurrence of the repetition element, XML Input always generates a row. By varying the repetition element and using a related option, you can control the number of output rows.

Identifying the Repetition ElementTo identify the repetition element, set the Key property to Yes on the output link.

Transformation SettingsThese properties control the values that can be shared by multiple output links of the XML Input stage.They fall into these categories:• Requiring the repetition element• Processing NULLs and empty values• Processing namespaces• Formatting extracted XML fragmentsTo use these values with a specific output link, select the Inherit Stage properties box on the Transformation Settings tab of the output link.

Xml Output Stage

XML Output stage is used to transform tabular data, such as relational tables and sequential files, to XML hierarchical structures. XML Output stage supports a single input link and zero or one output links.

Page 67: datastage

XML Output requires XPath expressions to transform tabular data to XML. A table definition stores the XPath expressions. Using the Description property on the Columns pages within the stage, you record or maintain the XPath expressions.

Aggregating Input Rows on OutputYou have several options for aggregating input rows on output.• Aggregate all rows in a single output row. This is the default option.• Generate one output row per input row. This is the Single row option.• Trigger a new output row when the value of an input column changes.• Trigger a new output row when the value of a pass-through column changes.A pass-through column is an output column that has no XPath expression in the Description property and whose name exactly matches the name of an input column.

Job Management and Deployment

Quick Find1. Name to find2. Types to find3. Include descriptions (If checked, the text in short and long descriptions will be searched)

Advanced Find Filtering options1. Type – Type of object (Job, Table Definition, etc)2. Creation – Date range3. Last Modification – Date range4. Where used – 5. Dependencies of – 6. Options – Case sensitivity and Search within last result set

Impact AnalysisRight click over a stage or table definition1. Select “Find where table definitions used”2. Select “Find where table definitions used (deep)” – Deep includes additional object typesDisplays a list of objects using the table definition

1. Select “Find dependencies”2. Select “Find dependencies (deep)” – Deep includes additional object typesDisplays list of objects dependent on the one selected

Graphical Functionality1. Display the dependency path2. Collapse selected objects3. Move the graphical object4. “Birds-eye” view

Comparison1. Cross project compare2. Compare againstThe two objects that can be compared are 1.Jobs and 2.Table Definitions

Page 68: datastage

Aggregator Stage1. Grouping Keys2. Aggregations

Aggregation Type - Count Rows, Calculation, Re-Calculation

Aggregation Type - Count RowsCount Output Column - Name of the output column which consists of the number of records based on grouping keys

Aggregation Type - Calculation, Re-CalculationColumn for Calculation - Input Column to be selected for calculation

Options

Allow Null Output - True means that NULL is a valid output value when calculating minimum value, maximum value, mean value, standard deviation, standard error, sum, sum of weights, and variance. False means 0 is output when all input values for calculation column are NULL.

Method – Hash (Hash table) or Sort (Pre-Sort). The default method is Hash

Use hash mode for a relatively small number of groups; generally, fewer than about 1000 groups per megabyte of memory. Sort mode requires the input data set to have been partition sorted with all of the grouping keys specified as hashing and sorting keys.

Use Hash method for inputs with a limited number of distinct groups1. Uses 2K of memory/group2. Calculations are made for all groups and stored in memory (Hash table structure and hence the name)3. Incoming data does not need to be pre-sorted4. Results are output after all rows have been read5. Useful when the number of unique groups is small

Use Sort method with a large (or unknown) number of distinct key column values1. Requires inputs pre-sorted on key columns (Does not perform the sort! Expects the sort)2. Results are output after each group3. Can handle unlimited number of groups

Sort Aggregator - one of the lightweight stages that minimize memory usage by requiring data in key column sort order

Lightweight stages that minimize memory usage by requiring data in key column sort order1. Join2. Merge3. Sort Aggregator

Sort Stage

DataStage designer provides two methods for parallel (group) sorting1. Sort stage - Parallel Execution2. Sort on a link when the partitioning is not Auto - Identified by the Sort icon

Both methods use the same tsort operator

Sorting on a link provides easier job maintenance (fewer stages on job canvas) but fewer options.

Page 69: datastage

The Sort stage offers more options than a link sort.

The Sort Utility should be DataStage as it is faster than the Unix Sort.

Stable sort preserves the order of non-key columns within each sort group but are slightly slower than non-stable sorts. Stable sort is enabled by default on Sort stages but not on Sort links. If disabled no prior ordering of records is guaranteed to be preserved by the sorting operation

Sort Key Modes1. Dont Sort (Previously Sorted) means that input records are already sorted by this key. The Sort stage will then sort on secondary keys, if any.2. Dont Sort (Previously Grouped) means that input records are already grouped by that key but not sorted3. Sort – Sort by this key

Advantages of Dont Sort (Previously Sorted)1. Uses significantly less memory/disk2. Sort is now on previously sorted key column groups not the entire data set3. Outputs rows after each group

DataStage provides two methods for generating a sequentially (totally) sorted result1. Sort stage - Sequential Execution mode2. Sort Merge Collector

In general a parallel Sort + Sort Merge Collector will be faster than a Sequential Sort.

By default the Parallel Framework will insert tsort operators as necessary to ensure correct results. But by setting $APT_INSERTION_CHECK_ONLY we can force the inserted tsort operator to verify if the data is sorted instead of actually performing the sort operation.

By default each tsort operator (Sort stage, link sort and inserted sort) uses 20MB per partition as an internal memory buffer.

But the Sort stage provides the "Restrict Memory Usage" option.1. Increasing this value can improve improve performance if the entire (or group) data can fit into memory2. Decreasing this value may hurt performance, but will use less memory

When the memory buffer is filled, sort uses temporary disk space in the following order1. Scratch disks in the $APT_CONFIG_FILE "sort" named disk pool2. Scratch disks in the $APT_CONFIG_FILE default disk pool3. The default directory specified by $TMPDIR4. The Unix /tmp directory

Removing Duplicates

Can be done by Sort stage – Use unique option• No choice on which duplicate to keep• Stable sort always retains the first row in the group• Non stable sort is indeterminate

Remove Duplicates stage• Can choose to retain first or last