Best Practices - IBM · PDF filePage 3 3 Executive Summary The objective of this document is to communicate best practices for tuning IBM InfoSphere DataStage jobs containing sort

Page 1

1

IBM® InfoSphere® DataStage®

Best Practices Performance Guidelines for IBM InfoSphere DataStage Jobs

Containing Sort Operations on Intel® Xeon® Servers

Garrett Drysdale, Intel® Corporation

Jantz Tran, Intel® Corporation

Sriram Padmanabhan, IBM

Brian Caufield, IBM

Fan Ding, IBM

Ron Liu, IBM

Pin Lp Lv, IBM

Mi Wan Shum, IBM

Jackson Dong Jie Wei, IBM

Samuel Wong, IBM

IBM®

Page 2

2

Executive Summary............................................................................................. 3

Introduction .......................................................................................................... 4

Overview of IBM InfoSphere DataStage .......................................................... 5

Overview of Intel® Xeon® Series X7500 Processors....................................... 6

Sort Operation in IBM InfoSphere DataStage.................................................. 7

Testing Configurations.................................................................................. 8

Summary for Sort Performance Optimizations............................................... 9

Recommendations for Optimizing Sort Performance .................................. 10

Optimal RMU Tuning ................................................................................. 11

Configuration / Job Tuning Recommendations.........................................................12 Final Merge Sort Phase Tuning using Linux Read Ahead..................... 15

Configuration / Job Tuning Recommendations.........................................................15 Using a Buffer Operator to minimize latency for Sort Input................. 17

Configuration / Job Tuning Recommendations.........................................................17 Minimizing I/O for Sort Data containing Variable length fields .......... 18

Configuration / Job Tuning Recommendations.........................................................19 Future Study: Using Memory for RAM Based Scratch Disk ................. 19

Best Practices....................................................................................................... 20

Conclusion .......................................................................................................... 21

Further reading................................................................................................... 22

Contributors.................................................................................................. 22

Notices ................................................................................................................. 25

Trademarks ................................................................................................... 26

Page 3

3

Executive Summary The objective of this document is to communicate best practices for tuning IBM

InfoSphere DataStage jobs containing sort operations on Intel® Xeon® Servers. Sort

operations are I/O intensive operations and can cause significant I/O load on the

temporary or scratch file system. To optimize server CPU utilization, the scratch I/O

storage system must be capable of providing the necessary disk bandwidth demanded by

the sort operations. A scratch storage system that cannot write or read data at a high

enough bandwidth will lead to under-utilization of computing capability of the system.

This will be observed as low CPU utilization.

The paper provides recommendations that will reduce the bandwidth demand placed on

the scratch storage I/O system by sort operations. These I/O reductions result in

improved performance that can be quite significant for systems where the scratch I/O

storage system is significantly under sized in comparison to the compute capability of the

processors.

Page 4

4

Introduction

This whitepaper is the first in what is anticipated to be a series of whitepapers envisioned

to provide IBM InfoSphere DataStage customers with helpful performance tuning

guidelines for deployment on Intel® Xeon® processor- based platforms. IBM and Intel

began collaborating to optimize performance and ROI of the combination of IBM

InfoSphere DataStage and Intel® Xeon® based platforms in 2007. Our goal is to not only

optimize the performance, and therefore, reduce the total cost of ownership of this

powerful combination in future versions of IBM InfoSphere DataStage on future Intel®

processors, but also to pass along tuning and configuration guidance that we discover

along the way.

In our work together, we are striving to understand the execution characteristics of

DataStage jobs on Intel® platforms. This information is used to determine the hardware

configurations, the operating system settings, and the job design and tuning techniques

to optimize performance. Because of the highly scalable capabilities of IBM InfoSphere

DataStage, our tests are focused on the latest Intel® Xeon® 4 and 8 socket capable X7560

Xeon® EX processors. Initially, we are testing with four socket configurations.

We have presented information about IBM InfoSphere DataStage on Intel® platforms at

the 2009 and 2010 IBM Information on Demand Conferences. In 2009, our audience

applauded the great scalability of IBM InfoSphere DataStage on Intel® platforms, but

asked us to provide more information on the I/O requirements of jobs and how to get the

most out of existing platform I/O capability. Since then, we have found ways to increase

the overall performance of all jobs in the new Information Server 8.5 version of IBM

InfoSphere DataStage that is now a 64 bit binary on Intel® platforms, and we

investigated the I/O requirements of sorting.

The focus of the paper is on the key pieces of information we obtained regarding

configuring the platform, Operating System, and DataStage jobs that contain sort

operators. Sort is a crucial operation in data integration software, Sort operations are I/O

intensive operations and can cause significant I/O load on the temporary or scratch file

system. To optimize server CPU utilization, the scratch I/O storage system must be

capable of providing the necessary disk bandwidth demanded by the sort operations. A

scratch storage system that cannot write or read data at a high enough bandwidth will

lead to under-utilization of computing capability of the system. This will be observed as

low CPU utilization.

The paper provides recommendations that will reduce the bandwidth demand placed on

the scratch storage I/O system by sort operations. These I/O reductions result in

improved performance that can be quite significant for systems where the scratch I/O

storage system is significantly under sized in comparison to the compute capability of the

processors. We show such a scenario in this paper. Ideally, the best solution is to

upgrade the scratch I/O storage subsystem to match the compute capability of the server.

Page 5

5

Overview of IBM InfoSphere DataStage

IBM InfoSphere DataStage is a product for data integration via Extract-Transform-Load

capabilities. It provides a designer tool that allows developers to visually create

integration jobs. Job is used within IBM InfoSphere DataStage to describe extract,

transform and load (ETL) tasks. Jobs are composed from a rich palette of operators called

stages. These stages include:

• Source and target access for databases, applications and files

• General processing stages such as filter, sort, join, union, lookup and aggregations

• Built-in and custom transformations

• Copy, move, FTP and other data movement stages

• Real-time, XML, SOA and Message queue processing

Additionally, IBM InfoSphere DataStage allows pre- and post-conditions to be applied to

all these stages. Multiple jobs can be controlled and linked by a sequencer. The sequencer

provides the control logic that can be used to process the appropriate data integration

jobs. IBM InfoSphere DataStage also supports a rich administration capability for

deploying, scheduling and monitoring jobs.

One of the great strengths of IBM InfoSphere DataStage is that when designing jobs, very

little consideration to the underlying structure of the system is required and does not

typically need to change. If the system changes, is upgraded or improved, or if a job is

developed on one platform and implemented on another, the job design does not

necessarily have to change. IBM InfoSphere DataStage has the capability to learn about

the shape and size of the system from the IBM InfoSphere DataStage configuration file.

Further, it has the capability to organize the resources needed for a job according to what

is defined in the configuration file. When a system changes, the file is changed, not the

jobs. A configuration file defines one or more processing nodes with which the job will

run. The processing nodes are logical rather than physical. The number of processing

nodes does not necessarily correspond to the number of cores in the system.

The following are factors that affect the optimal degree of parallelism:

• CPU-intensive applications, which typically perform multiple CPU-demanding

operations on each record, benefit from the greatest possible parallelism up to the

capacity supported by a given system.

• Jobs with large memory requirements can benefit from parallelism if they act on data

that has been partitioned and if the required memory is also divided among partitions.

• Applications that are disk- or I/O-intensive, such as those that extract data from and

load data into databases, benefit from configurations in which the number of logical

nodes equals the number of I/O paths being accessed. For example, if a table is

Page 6

6

partitioned 16 ways inside a database or if a data set is spread across 16 disk drives, one

should set up a node pool consisting of 16 processing nodes.

Another great strength of IBM InfoSphere DataStage is that it does not rely on the

functions and processes of a database to perform transformations: while IBM InfoSphere

DataStage can generate complex SQL and leverages databases, IBM InfoSphere

DataStage is designed from the ground up as a multipath data integration engine equally

at home with files, streams, databases, and internal caching in single-machine, cluster,

and grid implementations. As a result, customers in many circumstances find they do not

also need to invest in staging databases to support IBM InfoSphere DataStage.

Overview of Intel® Xeon® Series X7500 Processors

Servers using the Intel® Xeon® series 7500 processor deliver dramatic increases in

performance and scalability versus previous generation servers. The chipset includes

new embedded technologies that give professionals in business, information

management, creative, and scientific fields, the tools to solve problems faster, process

larger data sets, and meet bigger challenges.

With intelligent performance, a new high-bandwidth interconnect architecture, and

greater memory capacity, platforms based on the Intel® Xeon® series 7500 processor are

ideal for demanding workloads. A standard four-socket server provides up to 32

processor cores, 64 execution threads and a full terabyte of memory. Eight-socket and

larger systems are in development by leading system vendors. The Intel® Xeon® series

7500 processor also includes more than 20 new reliability, availability and serviceability

(RAS) features that improve data integrity and uptime. One of the most important is

Intel® Machine Check Architecture Recovery, which allows the operating system to take

corrective action and continue running when uncorrected errors are detected. These

highly scalable servers can be used to support enormous user populations.

Server platforms based on the Intel® Xeon® series 7500 processor deliver a number of

additional features that help to improve performance, scalability and energy-efficiency.

• Next-generation Intel® Virtualization Technology (Intel® VT) provides extensive

hardware assists in processors, chipsets and I/O devices to enable fast application

performance in virtual machines, including near-native I/O performance. Intel®

VT also supports live virtual machine migration among current and future Intel®

Xeon® processor-based servers, so businesses maintain a common pool of

virtualized resources as they add new servers.

• Intel® QuickPath Interconnect Technology provides point-to-point links to

distributed shared memory. The Intel® Xeon® 7500 series processors with QPI

feature two integrated memory controllers with and 3 QPI links to deliver

scalable interconnect bandwidth, outstanding memory performance and

flexibility and tightly integrated interconnect RAS features. Technical articles on

QPI can be found at http://www.intel.com/technology/quickpath/.

Page 7

7

• Intel® Turbo Boost Technology boosts performance when it’s needed most by

dynamically increasing core frequencies beyond rated values for peak

workloads.

• Intel® Intelligent Power Technology adjusts core frequencies to conserve power

when demand is lower.

• Intel® Hyper-Threading Technology can improve throughput and reduce

latency for multithreaded applications and for multiple workloads running

concurrently in virtualized environments.

For additional information on the Intel® Xeon® Series 7500 Processor for mission critical

applications, please see

http://www.intel.com/pressroom/archive/releases/20100330comp_sm.htm.

Sort Operation in IBM InfoSphere DataStage

A brief overall description of the Sort operation is given here. The Sort operator

implements a segmented merge sort and accomplishes sorting in two phases.

First, the initial sort phase categorizes chunks of data into the correct order and stores

this data as files to the scratch file system. The sort operator uses a buffer whose size is

defined by the RMU parameter. This buffer is divided into two halves. The sorting

thread will sort the half of the buffer it is working on until it is full and then move to the

other half to begin inserts. The full buffer portion is sorted and then written out as a

chunk to the scratch file system. The data is written to disk by a separate writer thread.

See the figure below.

Figure 1 - Sort operation overview

The sort buffer is used during both the initial sort phase and the final merge phase of the

sort operation. During the final merge phase, a block of data is read from the beginning

of each of the temporary sorted files stored on the scratch file system. If the sort buffer is

too small, there will not be enough memory to read a chunk of data from each of the

temporary sort files from the initial sort phase. This condition will be detected during

the initial sort phase and if it occurs, a second thread will run to perform pre-merging of

Page 8

8

the temporary sort files. This will reduce the number of temporary sort files so that the

buffer will have sufficient space to load a block of data from each of the temporary sort

files during final merging.

In the following tests, we will show several tuning and configuration settings that can be

used to reduce the I/O demand placed on the system by sort operations.

Testing Configurations

The testing was done on a single Intel® server with the Intel® Xeon® 7500 series chipset

and four Intel® Xeon® X7560 processors. The X7560 processors are based on the

Nehalem micro architecture. The system has 4 sockets, 8 cores per socket, and 2 threads

per core using Intel® Hyper-Threading Technology for a total of 64 threads of execution.

Our test configuration uses 64 GB of memory though the platform has a maximum

capacity of 1 TB. The processor operating frequency is 2.26 GHz and each processor has

24 MB of L3 cache shared across the 8 cores.

The system uses 5 Intel® X-25E solid state drives (SSDs) for temporary I/O storage

configured in a RAID-0 array using the on board RAID controller. This storage is used as

scratch storage for the sort tests. The bandwidth capability of the 5 SSDs was not

sufficient to maximize the CPU utilization of the system given the high performance

capabilities of DataStage and this will be explained in more detail later. We recommend

sizing the I/O subsystem to maximize CPU utilization although we were not able to do

this given the equipment available at the time of data collection.

The operating system is Red Hat* Enterprise Linux* 5.3, 64 bit version.

The test environment is a standard Information Server two tier configuration. The client

tier is used to run just the DataStage client applications. All the remaining Information

Server tiers are installed on a single Intel® Xeon® X7560 server.

Client • Window server 2003 • Processor Type: x86 - based PC • Processor Speed: 2.4GHZ • Memory Size: 8 GB RAM

Services + Repository + Engine Tiers • Platform: Red Hat EL 5.3, 64 bit • Processor: Intel® Xeon® X7560, 4

socket, 32 cores, 64 threads • Processor Speed: 2.26 GHz • Memory Size: 64 GB RAM • Metadata Repository: DB2/LUW 9.7 GA • 5 Intel X25-E SSDs for Scratch Space

configured as RAID0 array using onboard

controller.

Information Server (IS) Tiers (Services + Repository + Engine)

Test Client(s)

Intel® Xeon® X7560 Server

IS Topology: Standalone

Page 9

9

Figure 2 - System Test Configuration

The following table lists the specifics of the platform tested:

OEM Intel®

CPU Model ID 7560

Platform Name Boxboro

Sockets 4

Cores per Socket 8

Threads per core 2

CPU Code Name Nehalem-EX

CPU Frequency (GHz) 2.24

QPI GT/s 6.4

Hyperthreading Enabled

Prefetch Settings Default

LLC Size (MB) 24

BIOS Version R21

Memory Installed (GB) 64

DIMM Type DDR3-1066

DIMM Size (GB) 4

Number of DIMMS 16

NUMA Enabled

OS RHEL 5.3 64 bit

Table 1 – Intel® Platform Tested

Summary for Sort Performance Optimizations

This section provides a brief summary of the recommendations from this performance

study. Section 6 provides more detail for those seeking the deeper technical dive.

Reducing I/O contention is critical to optimizing Sort stage performance. Spreading

sorting I/O usage across different physical disks is a simple first step. A sample

DataStage configuration file to implement this method is shown below.

{

node "node1"

{

fastname "DataStage1.ibm.com"

pools ""

resource disk "/opt/IBM/InformationServer/Server/Datasets1" {pools ""}

resource scratchdisk "/opt/IBM/InformationServer/Server/Scratch1" {pools ""}

}

Page 10

10

node "node2"

{

fastname "DataStage2.ibm.com"

pools ""

resource disk "/opt/IBM/InformationServer/Server/Datasets2" {pools ""}

resource scratchdisk "/opt/IBM/InformationServer/Server/Scratch2" {pools ""}

}

}

In this configuration file, each DataStage processing node has its own scratch space

defined in a directory that resides on separate physical devices. This helps prevent

contention for I/O subsystem resource among DataStage processing nodes. This is a fairly

well known technique and was not studied for this paper.

This paper describes additional techniques to achieve the optimal performance for

DataStage jobs containing Sort operations:

1. Setting the Restrict Memory Usage (RMU) parameter for sort operations to an

appropriate value for large data sets will reduce I/O demand on the scratch file

system. The recommend RMU size varies with the data set size and node count.

The formula is shown in section 6.1 along with a reference table that summarizes

the suggested RMU sizes for a variety of data set sizes and node counts. The

RMU parameter provides users with the flexibility of defining the sort buffer size

to optimize memory usage of their system.

2. Increasing the default Linux read-ahead value for the disk storage system(s) for

scratch space can increase the performance of the final merge phase of sort

operations. The recommended setting for the read ahead value is 512 or 1024

sectors (256KB or 512KB) for the scratch file system. See section 6.2 for

information on how to change the read ahead value in Linux.

3. Sort operations can benefit from having a buffer operator inserted prior to the

sort in the data flow graph. Because sort operations work on large amounts of

data, a buffer operator provides extra storage to get the data to the sort operator

as fast as possible. See section 6.3 for details.

4. Enabling the APT_OLD_BOUNDED_LENGTH setting can decrease I/O demand

during sorting when bounded length VARCHAR data is involved, potentially

resulting in improved overall throughput for the job.

Recommendations for Optimizing Sort Performance

We investigated the Sort operation in detail and considered the effect of a number of

performance tuning factors on the I/O characteristics. The input data size and the format

of the data are critical input factors affecting sort. The layout of the I/O subsystem and

the file cache and prefetching characteristics are also important. The RMU buffer size

Page 11

11

configuration parameter has a significant effect on the behavior of the sort as the input

data set size is adjusted. These factors are considered in greater detail below.

In our tests, a job consisting of one sort stage and running on a one node configuration

was capable of sorting and writing data at the rate of 120 MB/s to the scratch file system.

Increasing the node count of the job quickly resulted in more I/O requests to the scratch

I/O storage array than it was able to service in a timely manner. Due to the limitation of

the scratch I/O system, the Intel® Xeon® server CPUs were greatly underutilized. The

scratch I/O file system was simply under-configured for a server with such a high

computational capability. This illustrates the high compute power available on the

Intel® Xeon® processors and the ability of IBM InfoSphere DataStage to efficiently

harness this compute power. Configuring sufficient I/O to harness the computational

capability of this powerful combination of hardware and software is of paramount

importance to enable efficient utilization of system.

For our test system, we chose a configuration for the scratch storage I/O system that was

significantly undersized in comparison to the compute capability of the server. While we

recommend always configuring for optimal performance which would include a more

capable scratch storage system, customer feedback has indicated that many deployed

systems have insufficient bandwidth capability to the scratch storage system. The

tuning and configuration tips found in this paper to design to increase performance on

all systems, but will be especially beneficial for systems constrained by the scratch I/O

storage system. In all cases, the amount of data transferred may be reduced by these

tuning and configuration tips.

By adjusting the DataStage parallel node count, we were able to match the scratch

storage capabilities and prevent the scratch storage system from being saturated. This

allowed us to study and develop this tuning guidance in a balanced environment. Using

this strategy, we developed several tuning guidelines to reduce the demand for scratch

storage I/O which we used to effectively increase performance. This is likely to be the

situation for many customers as growth in CPU processing performance continues to

outpace the I/O capability of storage subsystems.

Several valuable tuning guidelines were discovered and we present the findings here.

While these findings are significant, and we highly recommend them, we also want to

make clear that there is no substitute for having a high performance scratch storage

system capable of supplying sufficient bandwidth and I/Os per second (IOPS) to

maintain high CPU utilization. The tuning guidance given here will help even a high

performance scratch I/O system deliver better performance to DataStage jobs using sort

operations.

The remainder of this section describes the tuning results we found to improve sort

performance through I/O reduction.

Optimal RMU Tuning

Page 12

12

This section describes how to tune the sort buffer size parameter called RMU to minimize

I/O demand on the scratch I/O system. An RMU value that is too small will result in

intermediate merges of temporary files during the initial sort phase. These intermediate

merges can significantly increase the I/O demand on the scratch file system. Tuning the

RMU value appropriately can eliminate all intermediate merge operations and greatly

increase throughput of sort operations for systems with limited I/O bandwidth to the

scratch I/O file system.

The scratch disk I/O system on many systems is a bottleneck to performance due to

insufficient bandwidth capability of the number of disks or the interconnect bandwidth is

less than needed to maximize CPU utilization. The elimination of pre-merging can

reduce the overall I/O demand on the scratch file system therefore allowing the scratch

file system to complete I/O faster, increasing throughput and decreasing job run time.

Configuration / Job Tuning Recommendations

Given knowledge of the size of data to be sorted, it is possible to calculate the optimal

RMU value that will prevent the pre-merge thread from running and thus, reducing I/O

demand. The RMU formula is:

RMU (MB) >= SQRT ( DataSizeToSort (MB) / NodeCount) RMU (MB) >= SQRT ( DataSizeToSort (MB) / NodeCount) RMU (MB) >= SQRT ( DataSizeToSort (MB) / NodeCount) RMU (MB) >= SQRT ( DataSizeToSort (MB) / NodeCount) / 2/ 2/ 2/ 2

Notes about using the above formula:

1. The total data size is divided by the node count because the data sorted per node

decreases with increasing node count. A node in this context refers to the

number of parallel instances of the job when it is instantiated.

2. Our tests indicate that the RMU value can span a fairly large range and still

provide good performance. Sometimes the amount of data to be sorted is not

known precisely. We recommend attempting to estimate the input data size

within one or two factors of the actual value. In other words, overestimating the

data set size by a factor of 2x will still result in an RMU value from the above

equation that will provide good performance results.

3. The default RMU value is 20MB. This RMU value can sort up to 1.6 GB of data

per node while avoiding costly pre-merge operations. If your data set size

divided by node count is less than 1.6 GB, then no change is necessary to the

RMU.

The following table is a handy reference of RMU settings for different sizes of input data

(data set size) and node counts. The table assumes the user knows the data set size to be

sorted. Knowing the precise size of the data being sorted may not be feasible. Over

estimating the data set size by up to a factor 2 times the actual data size will still result in

good performance. The default RMU value is 20 MB. The table contains the word

“Default” where the formula results in less than 20 MB indicating that the user should

use the default value. It is not necessary to decrease the RMU value below the 20 MB

default, though doing so is allowed.

Page 13

13

1Node 4Nodes 8Nodes 16Nodes 24Nodes 32Nodes 48Nodes 64Nodes

Data

Size to

be

Sorted

Min

RMU

(MB)

Min

RMU

(MB)

Min

RMU

(MB)

Min

RMU

(MB)

Min

RMU

(MB)

Min

RMU

(MB)

Min

RMU

(MB)

Min

RMU

(MB)

1 Default Default Default Default Default Default Default Default

1.5 Default Default Default Default Default Default Default Default

3 28 Default Default Default Default Default Default Default

10 51 25 Default Default Default Default Default Default

30 88 44 31 22 Default Default Default Default

100 160 80 57 40 33 28 23 20

300 277 139 98 69 57 49 40 35

1000 506 253 179 126 103 89 73 63

3000 876 438 310 219 179 155 126 110

10000 1600 800 566 400 327 283 231 200

Table 2 – RMU Buffer Size Table

Our test results of a job consisting of one sort stage running with 4 parallel nodes with

two different RMU values are shown in Figure 3. The correct sizing of the RMU value

resulted in a 36% throughput increase. In the tests, the I/O bandwidth did not decrease

because the I/O subsystem was delivering the maximum bandwidth it was capable of in

both cases. However, because the total quantity of data transferred was much lower, the

CPU cores were able to operate at higher CPU utilization and complete the sort in a

shorter amount of time. This optimization is very effective for scratch disks that are

unable to deliver enough scratch file I/O bandwidth to feed the high performing Intel®

Xeon® Server and highly efficient IBM InfoSphere DataStage Software.

The results shown here are for a sort only job where we have isolated the effect of the

RMU parameter. This optimization will help more complex jobs, but will only directly

affect the performance of the sort operators within the job.

RMU Size Read Ahead Setting Run Time

10MB 128KB(Linux Default) 4.05 minutes


Figure 3 - Performance Tuning Sort with Sort Operator RMU value

To modify the RMU setting for a Sort Stage in a job, on DataStage Designer client canvas,

open the Sort stage, click on Tab ‘Stage’, then ‘Properties’, click on ‘Options’ in the left

Page 14

14

window, and select ‘Restrict Memory Usage (MB) from the ‘Available properties to add’

window to add it.

Figure 4 – Adding RMU Option

Once the Restrict Memory Usage option is added, its value can be set to the

recommended one based on above-mentioned formula.

Page 15

15

Figure 5 – Setting RMU Option

Final Merge Sort Phase Tuning using Linux Read Ahead

During testing of the single node sort job, we found that CPU utilization of final merge

can be improved by changing the scratch disk read ahead setting in Linux, resulting in

substantial throughput improvements of the final merge sort phase.


The default Linux file system read ahead value is 256 sectors. A sector is 512 bytes so the

total default read ahead is 128 kB. Our testing indicated that increasing the read ahead

value to 1024 sectors (512 kB) increased CPU utilization and reduced the final merge time

by reducing the amount of time that DataStage had to wait for I/Os from the scratch file

system. This resulted in an increase in throughput of the final merge phase of sort of

approximately 30%.

Test results for a job consisting of one sort stage running with 4 parallel nodes with two

different values for the Linux read ahead setting are shown in Figure 6. Increasing the

Linux default read ahead setting of 128 kB to 512 kB resulted in a 9% improvement in

throughput of the job.

Page 16

16

RMU Size Read Ahead Setting Run Time


30MB 512KB 2.72 minutes

Figure 6 - Performance Tuning Sort Operator with Linux Read Ahead Setting

The current read ahead setting in Linux can be obtained using the following command:

>hdparm

To set the read ahead setting for a specific disk device in Linux, use the following

command:

>hdparm –a 1024 /dev/sdb1 (sets read ahead to 1024 sectors on disk device,

/dev/sdb1)

To make the command persist across reboots, edit the /etc/init.d/boot.local file.

Recommended settings to try are 512 sectors (256 kB) or 1024 sectors (512 kB).

Increasing read ahead size results in more data being read from the disk and stored in

the OS disk cache memory. As a result, more read requests by the sort operator get the

requested data directly from the OS disk cache instead of waiting for the full latency of a

data read from the scratch storage system. (Note that the Linux file system cache is

controlled by the kernel and uses memory that is not allocated to processes.)

In our tests, the scratch storage system consists of SSDs configured in a RAID-0 array. I/O

request latencies are low on this system compared to typical rotating media storage

arrays. Increasing OS read ahead will benefit scratch storage arrays consisting of HDDs

even more. Larger read ahead values than those tested may be more beneficial for HDD

arrays. We chose to use SSDs because they provide higher bandwidth, much improved

IOPS (I/Os per second) and much lower latency than an equivalent number of hard disk

drives.

Many RAID controllers found in commercial storage systems also have capability to do

read ahead on read requests and store data in the cache. It is good to enable this feature

if it is available on the storage array being used for scratch storage. It is still important to

increase read ahead in the OS. Serving requests from the OS disk cache will be faster than

having to wait for data from the RAID engine.

The results shown here are on a job with a sort operation only. Tuning of read ahead will

not impact performance of other operations in the job that are not performing scratch

disk I/O.

Page 17

17

Using a Buffer Operator to minimize latency for Sort Input The DataStage parallel engine employs buffering automatically as part of its normal

operations. Because the initial sort phase has such a high demand for input data, it is

especially sensitive to latency spikes in the data source feeding the sort. These latency

spikes can occur due to data being sourced from local or remote disks, or due to

scheduling of operators by the operating system. By adding an additional buffer in front

of the sort, we were able to maintain the CPU utilization on the core running the sort

thread at 100% during the entire initial sort phase, thus increasing the performance of the

initial sort phase by nearly 7%.


We recommend using an additional buffer prior to sort of size equal to the RMU value.

To add an additional buffer in front of the sort, open the Sort stage in a DataStage job on

DataStage Designer client canvas, click on the ‘Input’ tab, then ‘Advanced’. Select

‘Buffer’ from the ‘Buffering mode’ drop-down menu and modify the ‘Maximum memory

buffer size (bytes)’ field.

Figure 7 - Adding buffer in front of the sort

Page 18

18

Minimizing I/O for Sort Data containing Variable length fields

By default, the parallel engine internally handles bounded length VARCHAR fields

(those that specify a maximum length) as essentially fixed length character strings. If the

actual data in the field is less than the maximum length, the string is padded to the

maximum size. This behavior is efficient for CPU processing of records throughout the

course of an entire job flow but it increases the I/O demands for operations such as Sort.

When environment variable APT_OLD_BOUNDED_LENGTH is set, the data within

each VARCHAR field is processed without additional padding resulting in a decreased

amount of data written to disk. This can decrease I/O bandwidth demand and therefore

increase performance when running a scratch disk subsystem with insufficient

bandwidth. This can increase job throughput if the scratch file system is not able to keep

up with the processing capability of DataStage and the Intel® Xeon® Server. Additional

CPU cycles will be used to process variable length data when using

APT_OLD_BOUNDED_LENGTH. More CPU processing power is used to reduce the

amount of I/O required from the scratch file system by using this setting.

Our test results of a job consisting of one sort stage running with 16 parallel nodes using

the APT_OLD_BOUNDED_LENGTH resulted in a 25% reduction in size of temporary

sort files and a 26% increase in throughput (a 21% reduction in runtime.)

Normalized ComparisonNormalized ComparisonNormalized ComparisonNormalized Comparison DefaultDefaultDefaultDefault With APT_OLD_BOUNDED_LENGTHWith APT_OLD_BOUNDED_LENGTHWith APT_OLD_BOUNDED_LENGTHWith APT_OLD_BOUNDED_LENGTH

Scratch Storage Space

Consumed

1.0 0.75x (75% of the original storage space

used)

Runtime 1.0 0.79x (79% of the original runtime)

Throughput 1.0 1.26x (26% increase in job processing rate)

Table 3 – Sort Operation performance comparison using

APT_OLD_BOUNDED_LENGTH

Please note that the performance benefit of this tuning parameter will vary based on

several factors. It only applies to data records that have varchar fields. The actual file

size reduction realized on the scratch storage system will depend heavily on the

maximum size specified by the varchar fields, and the size of the actual data contained in

these fields, and whether the varchar fields are a sort key for the records. The amount of

performance benefit will depend on how much the total file size is reduced, along with

the data request rate of the sort operations compared to the capability of the scratch file

system to supply the data. In our test configuration, the 16 node test resulted in the

scratch I/O system being driven to its maximum bandwidth limit. By setting

APT_OLD_BOUNDED_LENGTH, the amount of data that was written and subsequently

read from the disk decreased substantially over the length of the job allowing faster

completion.

Page 19

19


This optimization will only affect data sets that use bounded length VARCHAR data

types. APT_OLD_BOUNDED_LENGTH is a user defined variable for DataStage. The

variable can be added either at the project level or job level. You can follow the

instructions in the IBM InfoSphere DataStage and QualityStage Administrator Client

Guide and the IBM InfoSphere DataStage and QualityStage Designer Client Guide to add

and set a new variable.

We recommend trying this setting if low CPU utilization is observed during sorting or if

it is known that the scratch file system is unable to keep up with job demands.

Future Study: Using Memory for RAM Based Scratch Disk

As a future study, we intend to investigate performance when using a RAM based disk

for scratch storage. . The memory bandwidth available in the Nehalem-EX test system is

greater than 70 GB/s when correctly configured. While SSDs offer some bandwidth

improvements over hard disk drives, they cannot begin to match the performance of

main memory bandwidth. The system supports PCI Express lanes to reach ~ 35 GB/s of

I/O in each direction if all PCIe lanes are utilized. However, such an I/O solution would

be expensive.

The currently available 4 socket Intel® X7560 systems can address 1 TB of memory and 8

socket systems can address 2 TB of memory. DRAM capacity will continue to rise with

new product releases and IBM X series systems also offer options to increase DRAM

capacity beyond the baseline. While DRAM is expensive when compared to disk drives

on a per capacity basis, it is more favorable when comparing bandwidth capability in and

out of the system. We plan to evaluate the performance and cost benefit analysis of large

in-memory storage compared to disk drive based storage solutions and provides the

results in the near future.

Page 20

20

Best Practices This paper describes additional techniques to achieve the optimal performance for

DataStage jobs containing Sort operations:

• Setting the Restrict Memory Usage (RMU) parameter for sort operations to an

appropriate value for large data sets will reduce I/O demand on the scratch file

system. The recommend RMU size varies with the data set size and node count.

The formula is shown in section 6.1 along with a reference table that summarizes

the suggested RMU sizes for a variety of data set sizes and node counts. The

RMU parameter provides users with the flexibility of defining the sort buffer size

to optimize memory usage of their system.

• Increasing the default Linux read-ahead value for the disk storage system(s) for

scratch space can increase the performance of the final merge phase of sort

operations. The recommended setting for the read ahead value is 512 or 1024

sectors (256KB or 512KB) for the scratch file system. See section 6.2 for

information on how to change the read ahead value in Linux.

• Sort operations can benefit from having a buffer operator inserted prior to the

sort in the data flow graph. Because sort operations work on large amounts of

data, a buffer operator provides extra storage to get the data to the sort operator

as fast as possible. See section 6.3 for details.

• Enabling the APT_OLD_BOUNDED_LENGTH setting can decrease I/O demand

during sorting when bounded length VARCHAR data is involved, potentially

resulting in improved overall throughput for the job.

Page 21

21

Conclusion We have shown how to optimize IBM InfoSphere DataStage sort performance on Intel®

Xeon® processors using a variety of tuning options such as Sort buffer RMU size, Linux

read ahead settings, additional Buffer operator, and configuring the Varchar length

parameter.

Our results reinforce the necessity of correctly sizing I/O to optimize server performance.

For sort, it is imperative to have sufficient scratch I/O storage performance to allow

maximization of all sort operators running in the system concurrently in order to fully

utilize the server.

Powerful mission critical servers like the Intel® Xeon® Platforms based on the X7500

series processor running the IBM InfoSphere DataStage parallel engine can efficiently

process data at extremely high data rates. As a result, I/O and network bandwidth are

extremely important for high performance. Network interconnects like 10 Gbit/s

Ethernet or 40 Gbit/s Fiber Channel are necessary to fully realize the computation

potential of this powerful combination of hardware and software. In the near future, we

plan to analyze the cost and benefit trade off of using large DRAM capacity as a

replacement for disk subsystems for scratch I/O. We also will be looking at tuning high

bandwidth networking solutions to optimize performance.

Page 22

22

Further reading Other documentation with information on compression you might be interested in:

• IBM InfoSphere Information Server, Version 8.5 Information Center

http://publib.boulder.ibm.com/infocenter/iisinfsv/v8r5/index.jsp

Contributors

Garrett Drysdale is a Sr. Software Performance Engineer for

Intel. Garrett has analyzed and optimized software on Intel®

platforms since 1995 spanning client, workstation, and

enterprise server market segments. Garrett currently works

with enterprise software developers to analyze and optimize

server applications, and with internal design teams to assist

in evaluating the impact of new technologies on software

performance for future Intel® platforms. Garrett has a BSEE

from University of Missouri-Rolla and a MSEE from The Georgia

Institute of Technology. His email is

[email protected].

Jantz Tran is a Software Performance Engineer for Intel. He

has been analyzing and optimizing enterprise software on Intel

server platforms for 10 years. Jantz has a BSCE from Texas A&M

University. His email is [email protected].

Dr. Sriram Padmanabhan is an IBM is an IBM is an IBM is an IBM Distinguished Engineer, and

Chief Architect for IBM InfoSphere Servers. Most recently, he

had led the Information Management Advanced Technologies team

investigating new technical areas such as the impact of Web

2.0 information access and delivery. He was a Research Staff

Member and then a manager of the Database Technology group at

IBM T.J. Watson Research Center for several years. He was a

key technologist for DB2’s shared-nothing parallel database

feature and one of the originators of DB2’s multi-dimensional

clustering feature. He was also a chief architect for Data

Warehouse Edition which provides integrated warehousing and

business intelligence capabilities enhancing DB2. Dr.

Padmanabhan has authored more than 25 publications including a

book chapter on DB2 in a popular database text book, several

journal articles, and many papers in leading database

conferences. His email is [email protected].

Page 23

23

Brian Caufield is a Software Architect for Infosphere

Information Server responsible for the definition and design

of new IBM InfoSphere DataStage features, and also works with

the Information Server Performance Team. Brian represents IBM

at the TPC, working to define an industry standard benchmark

for data integration. Previously, Brian worked for 10 years

as a developer on IBM InfoSphere DataStage specializing in the

parallel engine. His email is [email protected].

Fan Ding is currently a member of the Information Server

Performance Team. Prior to joining the team, he worked in

Information Integration Federation Server Development. Fan has a

PH.D. in Mechanical Engineering and a Master in Computer Science

from University of Wisconsin. His email is: [email protected].

Ron Liu is currently a member of the IBM InfoSphere

Information Server Performance Team with focus on performance

tuning and information integration benchmark development.

Prior to his current job, Ron had 7 years in Database Server

development (federation runtime, wrapper, query gateway,

process model, and database security). Ron has a Master of

Science in Computer Science and Bachelor of Science in

Physics. His email is [email protected].

Pin Lp Lv is a Software Performance Engineer from IBM. Pin has

worked for IBM since 2006. He worked as a software tester for IBM

WebSphere Product Center Team and RFID Team from September

2006 to March 2009, and joined IBM InfoSphere Information Server

Performance Team in April 2009. Pin has a Master of Science degree in Computer Science from University of West Scotland. His email is

[email protected]

Mi Wan Shum is the manager of the IBM InfoSphere Information Server performance team at the IBM Silicon Valley Lab. She graduated from University of Texas at Austin and she has years of software development experience in IBM. Her email is

[email protected]

Jackson (Dong Jie) Wei is a Staff Software Performance Engineer for

IBM. He once worked as a DBA in CSRC before joining IBM in 2006.

Since then, he has been working on the Information Server product.

In 2009, he began to focus his work on the ETL performance. Jackson

is also the technical lead for the IBM China Lab Information Server

performance group. He got his bachelor and master degrees for

Electronic Engineering of Peking University in 2000 and 2003

respectively. His email is [email protected].

Page 24

24

Samuel Wong is a member of the IBM InfoSphere InfoSphere Information Server performance team at the IBM Silicon Valley Lab. He graduated from University of Toronto and he has 12 years of software development experience with IBM. His email is

[email protected]

Page 25

25

Notices This information was developed for products and services offered in the U.S.A.

IBM may not offer the products, services, or features discussed in this document in other

countries. Consult your local IBM representative for information on the products and services

currently available in your area. Any reference to an IBM product, program, or service is not

intended to state or imply that only that IBM product, program, or service may be used. Any

functionally equivalent product, program, or service that does not infringe any IBM

intellectual property right may be used instead. However, it is the user's responsibility to

evaluate and verify the operation of any non-IBM product, program, or service.

IBM may have patents or pending patent applications covering subject matter described in

this document. The furnishing of this document does not grant you any license to these

patents. You can send license inquiries, in writing, to:

IBM Director of Licensing

IBM Corporation

North Castle Drive

Armonk, NY 10504-1785

U.S.A.

The following paragraph does not apply to the United Kingdom or any other country where

such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES

CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER

EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-

INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do

not allow disclaimer of express or implied warranties in certain transactions, therefore, this

statement may not apply to you.

Without limiting the above disclaimers, IBM provides no representations or warranties

regarding the accuracy, reliability or serviceability of any information or recommendations

provided in this publication, or with respect to any results that may be obtained by the use of

the information or observance of any recommendations provided herein. The information

contained in this document has not been submitted to any formal IBM test and is distributed

AS IS. The use of this information or the implementation of any recommendations or

techniques herein is a customer responsibility and depends on the customer’s ability to

evaluate and integrate them into the customer’s operational environment. While each item

may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee

that the same or similar results will be obtained elsewhere. Anyone attempting to adapt

these techniques to their own environment do so at their own risk.

This document and the information contained herein may be used solely in connection with

the IBM products discussed in this document.

This information could include technical inaccuracies or typographical errors. Changes are

periodically made to the information herein; these changes will be incorporated in new

editions of the publication. IBM may make improvements and/or changes in the product(s)

and/or the program(s) described in this publication at any time without notice.

Any references in this information to non-IBM Web sites are provided for convenience only

and do not in any manner serve as an endorsement of those Web sites. The materials at

those Web sites are not part of the materials for this IBM product and use of those Web sites is

at your own risk.

IBM may use or distribute any of the information you supply in any way it believes

appropriate without incurring any obligation to you.

Any performance data contained herein was determined in a controlled environment.

Therefore, the results obtained in other operating environments may vary significantly. Some

measurements may have been made on development-level systems and there is no

guarantee that these measurements will be the same on generally available systems.

Furthermore, some measurements may have been estimated through extrapolation. Actual

results may vary. Users of this document should verify the applicable data for their specific

environment.

Page 26

26

Information concerning non-IBM products was obtained from the suppliers of those products,

their published announcements or other publicly available sources. IBM has not tested those

products and cannot confirm the accuracy of performance, compatibility or any other

claims related to non-IBM products. Questions on the capabilities of non-IBM products should

be addressed to the suppliers of those products.

All statements regarding IBM's future direction or intent are subject to change or withdrawal

without notice, and represent goals and objectives only.

This information contains examples of data and reports used in daily business operations. To

illustrate them as completely as possible, the examples include the names of individuals,

companies, brands, and products. All of these names are fictitious and any similarity to the

names and addresses used by an actual business enterprise is entirely coincidental.

COPYRIGHT LICENSE:

This information contains sample application programs in source language, which illustrate

programming techniques on various operating platforms. You may copy, modify, and

distribute these sample programs in any form without payment to IBM, for the purposes of

developing, using, marketing or distributing application programs conforming to the

application programming interface for the operating platform for which the sample

programs are written. These examples have not been thoroughly tested under all conditions.

IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these

programs.

Trademarks IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International

Business Machines Corporation in the United States, other countries, or both. If these and

other IBM trademarked terms are marked on their first occurrence in this information with a

trademark symbol (® or ™), these symbols indicate U.S. registered or common law

trademarks owned by IBM at the time this information was published. Such trademarks may

also be registered or common law trademarks in other countries. A current list of IBM

trademarks is available on the Web at “Copyright and trademark information” at

www.ibm.com/legal/copytrade.shtml

Windows is a trademark of Microsoft Corporation in the United States, other countries, or

both.

UNIX is a registered trademark of The Open Group in the United States and other countries.

Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.

Other company, product, or service names may be trademarks or service marks of others.

Documents

Best Practices - IBM · PDF filePage 3 3 Executive Summary The objective of this document is to communicate best practices for tuning IBM InfoSphere DataStage jobs containing sort