APEX-Map: a parameterized scalable memory access probe for high-performance computing systems

CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCEConcurrency Computat.: Pract. Exper. 2007; 19:2185–2205Published online 31 May 2007 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cpe.1166

APEX-Map: a parameterizedscalable memory access probefor high-performance computingsystems

Erich Strohmaier∗,† and Hongzhang Shan

Future Technology Group, Computational Research Division, Lawrence Berkeley National Laboratory,One Cyclotron Road, Berkeley, CA 94720, U.S.A.

SUMMARY

The memory wall between the peak performance of microprocessors and their memory performance hasbecome the prominent performance bottleneck for many scientific application codes. New benchmarksmeasuring data access speeds locally and globally in a variety of different ways are needed to explorethe ever increasing diversity of architectures for high-performance computing. In this paper, we introducea novel benchmark, APEX-Map, which focuses on global data movement and measures how fast globaldata can be fed into computational units. APEX-Map is a parameterized, synthetic performance probeand integrates concepts for temporal and spatial locality into its design. Our first parallel implementationin MPI and various results obtained with it are discussed in detail. By measuring the APEX-Mapperformance with parameter sweeps for a whole range of temporal and spatial localities performancesurfaces can be generated. These surfaces are ideally suited to study the characteristics of the computationalplatforms and are useful for performance comparison. Results on a global-memory vector platform anddistributed-memory superscalar platforms clearly reflect the design differences between these differentarchitectures. Published in 2007 by John Wiley & Sons, Ltd.

Received 7 April 2006; Revised 11 September 2006; Accepted 19 December 2006

KEY WORDS: performance evaluation; benchmarking; workload characterization; high-performance comput-ing; performance modeling; data locality

1. INTRODUCTION

The memory wall has become the prominent performance bottleneck for many scientific applicationcodes during the last few decades. However, many benchmarking efforts in scientific computing have

∗Correspondence to: Erich Strohmaier, Future Technology Group, Computational Research Division, Lawrence BerkeleyNational Laboratory, One Cyclotron Road, Berkeley, CA 94720, U.S.A.†E-mail: [email protected]

This article is a U.S. Government work and is in the public domain in the U.S.A.

2186

Concurrency Computat.: Pract. Exper. 2007; 19:2185–2205DOI: 10.1002/cpe

E. STROHMAIER AND H. SHAN

focused on measuring the floating-point computing capabilities of a system and have often ignoredor downplayed the memory subsystem and processors interconnect. One prominent example is theLinpack benchmark, which is used to rank systems in the TOP500 Project [1]. This type of benchmarkcan provide guidance for the performance of some compute-intensive applications, but fails to providereasonable guidance for the performance of any memory-bound real applications.

The increasing gap between CPU speed and memory speed is the main reason that the capabilityto load and store data locally and globally has become the dominant performance bottleneck formany scientific applications. To remedy this situation, system architects design increasingly complexmemory systems and interconnect networks in an effort to increase the data-transfer bandwidths andreduce data-transfer latencies. However, we still lack a widely accepted benchmark to measure ourability to access globally distributed data in ways relevant for application performance.

We introduce a synthetic, parameterized memory access probe called APEX-Map [2–4], whichmeasures global data access performance based on parameterized concepts for data localities. APEX-Map has three main parameters: the global memory size M used, the temporal locality α, and thespatial locality L. Our basic idea is that an application’s global memory access can be approximated bymultiple data access streams, each of which can be characterized with the three parameters introducedabove. The execution profile of APEX-Map can then be tuned by its set of input parameters to match thedata access characteristics of a chosen scientific application. This allows us to use APEX-Map as a per-formance proxy for the performance behavior of the actual codes. An advantage of our synthetic bench-mark probe is its easy execution by simulators due to its simplicity. This allows its usage in the earlystages of architecture design. Many applications themselves are too complex to be run by simulators.

APEX-Map is different from many benchmarks as it is designed to allow its input parameters to bevaried independent of each other between extreme values. This allows exhaustive multi-dimensionalparameter sweeps, which can be used to generate continuous performance surfaces to explore theperformance effects of all potential values of the characterizing parameters. By analyzing thesesurfaces, we can understand how changes in spatial locality or temporal locality affect the performancesof applications and which factors are more important for performance. Moreover, we can compare theseperformance surfaces across different platforms and explore the advantages and disadvantages of eachplatform.

The rest of this paper is organized as follows. Section 2 gives an overview on related work in thefield of benchmarking. The general design concepts of APEX-Map are described in Section 3. Section 4provides details about the parallel implementation using MPI. In Section 5, we analyze our results onour test platforms, which include hierarchical superscalar designs and vector architectures. We find thatthe APEX-Map performance results clearly reflect the different design philosophies of the superscalarand the vector systems. Finally, we analyze the scalability of these platforms based on the APEX-Mapresults. Section 6 summarizes our results and discusses our ongoing and future work.

2. RELATED WORK

2.1. Current situation in the field of benchmarking and performance characterization

During the last few decades the variety and complexity of architectures used in high-performancecomputing (HPC) has increased steadily [5–7]. At the same time we have seen an increasing user

Published in 2007 by John Wiley & Sons, Ltd.

2187


APEX-MAP: A PARAMETERIZED SCALABLE MAP FOR HPC SYSTEMS

community with new and constantly changing applications. The space of performance requirements ofHPC applications is growing more varied and complex. No longer can any single architecture satisfythe need for cost-effective high-performance execution of applications for the whole scientific HPCcommunity. Despite this variety of system architectures and application requirements, benchmarkingand performance evaluation is still dominated by spot measurements using isolated benchmarks basedon specific scientific kernels or applications. Several benchmarking initiatives have attempted toremedy this situation using the traditional approach of constructing benchmark suites. These suitescontain a variety of kernels or applications, which ideally represent a broad variety of requirements.Relating these multiple spot measurements to other applications is still difficult, however, becauseno general application performance characterization methodology exists. Using real applicationsas benchmarks is also very time consuming, and collections of results are limited and hard tomaintain. Due to the difficulty of defining performance characteristics of applications, many syntheticbenchmarks are instead designed to measure specific hardware features. While this allows hardwarefeatures to be understood in great detail, it does not help in understanding overall applicationperformance. Due to renewed interest in new architectures and new programming paradigms thereis also a growing need for flexible and hardware-independent approaches to benchmarking andperformance evaluation across an increasing space of architectures. These benchmarks also need tobe general enough so that they can be adapted to many different parallel-programming paradigms.

2.2. Benchmarking initiatives

The memory wall between the peak performance of microprocessors and their memory performancehas become the prominent performance bottleneck for many scientific application codes.Despite this development, many benchmarking efforts in scientific computing in the past havefocused on measuring the floating-point computing capabilities of a system and have often ignoredor downplayed the memory subsystem and processor interconnect. One prominent example is theLinpack benchmark, which is used to rank systems in the TOP500 Project [1]. This type of benchmarkcan provide guidance for the performance of some compute-intensive applications, but fails to providereasonable guidance for the performance of any memory-bound real applications. On most platforms,Linpack can achieve well over 70% of peak performance, while on the same systems real applicationstypically achieve substantially lower performance rates. Nevertheless, Linpack is still the most widelyused and cited benchmark for HPC systems and has its use as a first test for the stability and accuracyof any new system.

Despite this situation there is still no standard or widely accepted way to measure progress in ourability to access globally distributed data. STREAM [8] is a common measure for memory bandwidthbut its use is limited to single processors and with some restrictions to single shared memory nodes.In addition, it emphasizes exclusively regular stride one access to main memory, which makes itirrelevant for many modern scientific codes.

During the recent years several new approaches for benchmarking HPC systems have been explored.The HPC Challenge benchmark [9] is a major community driven benchmarking effort backed by theDARPA HPCS program [10]. It is built on the traditional idea of spanning the range of possibleapplication performances with a set of different benchmarks. HPC Challenge is a suite of kernelswith memory access patterns more challenging than those of the High Performance Linpack (HPL)benchmark used in the TOP500 list. Thus, the suite is designed to provide benchmarks that bound the


2188



performance of many real applications as a function of memory access characteristics, e.g. spatialand temporal locality, and providing a framework for including additional tests. In particular, thesuite is composed of several well-known computational kernels (STREAM, HPL, matrix multiply—DGEMM, parallel matrix transpose—PTRANS, FFT, RandomAccess, and bandwidth/latency tests—b eff) that attempt to span high and low spatial and temporal locality space. Other than STREAM, theRandomAccess benchmark comes closest to being a data access benchmark by measuring the rate ofinteger random updates possible in global memory. Unfortunately, the structure of the RandomAccessbenchmark cannot easily be related to scientific applications and thus does not help much forapplications performance prediction. The major problem the HPC Challenge benchmark faces is howto become widely accepted and used as a reference benchmark.

The DARPA HPCS program itself pursues an extensive traditional layered approach tobenchmarking by developing benchmarks on all levels of complexity from simple kernels to fullapplications. Due to the goal of the HPCS program of developing a new highly productive HPCsystem in the petaflops performance range by the end of the decade, it will also need methodologiesfor modeling performance of non-existing workloads on non-existing machines.

Most current benchmark suites (HPCC [9], NAS [11], and SPEC [12]) only contain severalapplication codes or their synthetic benchmarks have other features strongly limiting the scopeof performance behaviors they can explore. The results of these application benchmarks providevery good indications how similar applications will perform on a specific platform. However, thesebenchmarks are restricted to spot measurement and cannot reveal performance behavior for differentparameter ranges.

In a different research effort at the Berkeley Institute for Performance Studies (BIPS) [13], severalsmall synthetic tunable benchmark probes have been developed. These probes focus on exploring theefficiencies of specific architectural features and are designed with a specific limited scope in mind.Sqmat [14] is a sequential probe to explore the influence of spatial locality and of different aspects ofcomputational kernels on performance. It is limited to sequential execution only and has no conceptfor temporal locality but has tackled the problem on how to characterize the detail of computation forperformance characterization purposes.

Benchmarking using a full application is also pursued at BIPS [15,16]. While applicationperformance is the best measure of performance, using an application is also very time consumingand large collections of comparable results are also very hard to obtain and to maintain. Applicationbenchmarks are also not suitable for simulators and thus hard to use for the evaluation of future systems.

2.3. Performance modeling

Another approach to understanding and predicting application performance on various architecturalplatforms is through performance modeling. The goal of performance modeling is to gain anunderstanding of a computer system’s performance on various applications, by means of measurementand analysis, and to encapsulate these characteristics in a derived model. A number of importantresearch efforts are developing these modeling methodologies, including the Performance EvaluationResearch Center (PERC) [17], which includes the Performance Modeling and Characterization(PMaC) group [18], and the Performance and Architecture Laboratory (PAL) at Los Alamos NationalLaboratories [19]. Performance modeling is an important component in understanding performanceon current and, especially, future machines, but it does not replace the need for detailed application


2189



performance analysis. Model development is time consuming and specific to each application, andthere is a tradeoff between the accuracy of the model and its simplicity. Performance models areoften calibrated with architectural parameters measured by synthetic benchmarks or by measurementsof code fragments. APEX-Map can be used to measure several architectural parameters, and it isconceivable to build analytic performance models based on these parameters. This would open up anew alternative to existing approaches in this field. Performance models for applications are becomingincreasingly important as concurrency levels of our computing platform increase rapidly [20].

2.4. Data locality concepts and application performance characterization

Theoretical research by Snir and Yu [21] recently showed that spatial and temporal locality cannoteasily be characterized with single parameters, as their dimensionality is potentially unbounded.In addition, they also showed that spatial and temporal locality cannot be studied in isolation fromeach other. From a practical point of view these results are not surprising, considering the large varietyof issues influencing data access performance in general. One important question in this context iswhether a simple performance characterization of data access with a few parameters is able to capturethe majority of performance effects for most scientific kernels.

APEX-Map [2–4,22,23] is a tunable synthetic benchmark that measures global data accessperformance. It is designed based on parameterized concepts for temporal and spatial locality andgenerates a global data access stream according to specified levels of these measures of locality.This allows the whole range of performance to be explored for all levels of spatial and temporallocality. APEX-Map stresses a machine’s memory subsystem and processor interconnect accordingto the parameterized degrees of spatial and temporal locality. By selecting specific values for thesemeasures for temporal and spatial locality, it can serve as performance proxy for a specific scientifickernel [2]. APEX-Map is described in detail in the following sections.

Several different characterizations of data access along the notion of spatial and temporal localityhave recently been proposed [24–27]. The MetaSim tracer was developed to analyze a program’smemory reference stream [28]. Its output can be used to derive appropriately defined spatial andtemporal locality scores [29]. While this is a necessary first step for a profiling-based applicationperformance characterization, it still needs to be demonstrated that scores thus measured are consistentbetween different applications and truly reflect the performance behavior of these codes. Such profilingtools would, however, be an ideal complement to synthetic benchmarks that are based on the notion oflocality, such as APEX-Map, if the underlying definitions of locality are reasonably aligned and can beunified. In this case, MetaSim could be used to profile codes and derive locality parameters, which inturn could be used to execute APEX-Map as a performance proxy for the actual application.

3. CONCEPTS AND VALIDATION OF APEX-MAP

In the Application Performance Characterization (APEX) project we developed a characterizationfor global data access streams together with a related synthetic benchmark called memory accessprobe (APEX-Map). This characterization and APEX-Map allowed us to capture the performancebehavior of several scientific kernels across a wide range of problem sizes and computational platforms[2]. We further extended these performance characterization and benchmarking concepts for parallel


2190



X

L LM-1

X

Figure 1. The data access model of APEX-Map.

0.1%

1.0%

10.0%

100.0%

0.0010.0100.1001.000

alpha

Hit

Ra

te

Lo

ca

l R

eq

ue

sts

0.1%

1.0%

10.0%

100.0%

Mis

s R

ate

Re

mo

te R

eq

ue

sts

Hits / Local Requests

Misses / Remote Requests

Figure 2. Effect of α on hit/miss rates and on the ratio of local to remote data requests (1:256).

execution [3,4]. In this section we describe briefly the principles behind APEX-Map, our experienceimplementing it, and the results we obtained using it.

3.1. Design principles of APEX-Map

APEX-Map assumes that a synthetic address stream, which is generated based on parameterizedconcepts for temporal and spatial locality, can approximate the performance of a data access pattern ofan application. It uses a blocked data access to a global array of size M to simulate the effects of spatiallocality. The block length L is used as a measure for spatial locality and L can take any value between1 (single word access) and M . A non-uniform random selection of starting addresses for these blocksis used to simulate the effects of temporal locality (Figure 1).

A power function was chosen as a generating function as a simple scale-invariant, one-parameterapproximation for the behavior of real applications [2]. Non-uniform random numbers X are generatedbased on uniform random numbers r with the generating function X = r1/α. The characteristicparameter α of the generating function is used as a measure for temporal locality and can take valuesbetween 0 and 1. A value of α = 1 generates uniform random numbers while small values of α generaterandom numbers centered towards the starting address of the global data array. The effect of thetemporal locality parameter α on cache hit and miss rates is illustrated in Figure 2 for a ratio of cachesize to used memory of 1:256.


2191



0.0

0.5

1.0

1.5

2.0

100 1,000 10,000 100,000 1,000,000

Memory Size (KB)

Rati

o

Power3 (200 MHZ) Power3 (375 MHz)

Power4 (1.3 GHz) Opteron (1.6 GHz)

Xeon (2.8 GHz) Cray X1 (800MHz)

Figure 3. Ratio of Nbody and APEX-Map runtimes.

Temporal locality in actual programs can be caused by different reasons. In some cases (e.g. ablocked matrix–matrix multiplication) variables are not used for long periods of time, but once theyare used they are reused in close time proximity multiple times. Overall, however, all variables areaccessed with equal frequency. In other codes (e.g. matrix–vector multiplication) some variables aresimply accessed more often than others (in our example the elements of the original vector). Exploitingthese different flavors of temporal locality in the sequential case requires different caching strategiessuch as dynamic caching in our first example and static caching in our second example. In practice,dynamic caching is used almost exclusively as it tends to work reasonable well in many (but not all!)situations, which conceptually would require static caching. APEX-Map clearly uses more frequentaccesses to certain addresses to simulate the effects of temporal locality.

In the parallel case the difference between these flavors becomes more important as placement (andpossibly sharing) of data and their affinity to processes becomes a performance issue. In APEX-Mapwe assume that each process accesses certain variables more often and that these variables can beplaced closer in memory to this process. We do not address the different question of how to addressand exploit temporal locality in cases where overall all variables are accessed with equal frequency andthus data placement is an ineffective strategy for exploiting temporal locality. APEX-Map also assumesthat sharing of variables is not a performance constraint, as we are only reading global data but do notmodify them. More discussions about the rational of the APEX-Map parameters can be found in [2–4].

3.2. Sequential validation

To validate the performance characterization developed for APEX, we conducted a study usingsix different scientific kernels (Radix, FFT, Matrix–Matrix Multiplications, strided Matrix–MatrixMultiplication, Nbody, NAS CG) [2]. We executed these kernels and APEX-Map sequentially onsix different test platforms for a variety of problem sizes and selected one set of characteristicsparameter for each kernel. With this we were able to capture the performance scaling behavior for mostkernels with satisfactory accuracy. Figure 3 shows the results for the Nbody kernel. In this example


2192



APEX-Map is able to replicate the performance scaling of seven different problem sizes ranging from100 kB to 500 MB on six different systems with only one set of locality parameters.

Overall we found that kernels with very regular data access structure had to be approximated with aregular, strided data access stream and only in one case (CG), was it necessary to combine one irregularand one regular data access stream to achieve satisfactory precision.

4. PARALLEL IMPLEMENTATION

APEX-Map is designed to use the same three main parameters for both the sequential version andparallel version. These parameters are the global memory size M , the measure of temporal localityα, and the measure of spatial locality L. For the parallel version, an important question is whetherthe effects of temporal locality and process locality should be treated independent of each other byimplementing different parameters and executions models for them, or if a global usage of the temporallocality model can provide a sufficient first approximation of the data access behavior of scientificapplication kernels. For the parallel execution of a single scientific kernel any method to divide theproblem to increase process locality should also be usable to improve temporal locality on a sequentialexecution. At the same time any algorithm with good temporal locality should, in turn, exhibit goodprocess locality on a global parallel implementation. Thus, the only cases where process and temporallocalities can differ substantially would be algorithms for which the problem solved in each processis different from the problem between processes. One example would be the embarrassingly parallelexecution of a kernel with low temporal locality by running multiple copies of the individual problemwith different parameters at the same time and thus generating high process locality. For simplicityreasons we therefore decided to treat temporal and process locality in a unified way by extendingthe sequential temporal locality concept to global memory access in the parallel case. One importantimplementation detail here is that we access the global array only with load operations, which avoidsany possibility of race conditions for memory update operations.

In the parallel version of APEX-Map the global data-array of size M is now evenly distributed acrossall processes (see Figure 1). Data will again be accessed in block mode, i.e. L continuous memoryaddresses will be accessed in succession and the block length L is used to characterize spatial locality.The starting addresses X of these data blocks are computed in advance by using a non-uniform randomaddress generator driven by a power function with the shape parameter α.

The basic flowchart of any parallel version of APEX-Map is shown in the left-hand side of Table I.The indices X are generated and stored in an index array first before the measurement starts. Then,for each index it is tested, if the addressed data resides in local memory then the computationproceeds immediately; if it resides in remote memory then the data are fetched into local memoryfirst. APEX-Map is designed to measure the rate at which global data can be fed into the CPU itselfand not only into the memory or into cache. Therefore, it is essential that an actual computation beperformed in the Compute module. This computation is currently a global sum of all accessed arrayelements.

The pre-computed indices X are stored in an array whose size is given by a parameter I witha default size of 1024. The time to pre-compute the indices is not included in the measurements.Generating these non-uniform random indices in the parallel implementation includes the following


2193



Table I. The flowchart of APEX-Map implementation.

Basic parallel MPI

Repeat N TimesGenerate Index ArrayCLOCK(start)For each Index i in the Array

If (data not in local memory)Get Remote Data

End IfCompute

CLOCK(end)RunningTime += end – start;

End Repeat

Repeat N TimesGenerate Index ArrayCLOCK(start)For each Index i in the Array

If (local data)Compute

ElseGenerate Remote Request

End IfServe Incoming RequestsProcess RepliesCLOCK(end)RunningTime += end - start

End RepeatCLOCK(start)Wait For FinishCLOCK(end)RunningTime += end - start

two steps:

Step 1: Indices = Power(drand48(), 1/α) ∗ (M/L ∗ P − 1)

Step 2: Indices = (Index + M/L ∗ myid)%(M/L ∗ P)

In the first step, a non-uniform random address based on a power distribution function controlledby the parameters M , L, and α is produced. However, because each process should access its ownsegment of the global data array with the highest probability, the computed global address has to beshifted according to the rank of the process in the second step (P is the total number of processesand myid is the rank of a process). The frequency with which remote data accesses occur is mainlydetermined by parameter α, the temporal locality. Figure 2 shows as an illustration how the percentageof remote accesses varies with the value of α for 256 processes. For α = 1, the data accesses followa uniform random distribution and the percentage of remote accesses is 255/256 (= 99.6%). With theincrease of temporal locality, the percentage of remote accesses reduces to 0.55% for α = 0.001.

One major non-trivial issue is how the remote access is carried out. The actual implementationwill be highly affected by the available parallel programming paradigm and different programmingstyles. However, we assume that the operation for different indices is independent and multiple remoteaccesses can be executed independent of each other at the same time.

The main output of APEX-Map is the average cycles per data access for one process and theaggregate bandwidth in megabytes per second for the given parameters. The results are directly


2194



comparable across different platforms. By running a set of parameters, such as α = 0.001 to 1.0 andL = 1 to 16 384 words, APEX-Map can generate a performance surface to explore the performanceeffects of temporal locality and spatial locality.

4.1. MPI implementation

Our first parallel version of APEX-Map was developed using two-sided MPI because it is the mostpopular and portable parallel programming model available today. Even with this restriction there aremany different possible implementations. One possibility is to aggregate the remote requests insteadof sending them one by one. We explored several different strategies to do this in depth, but hadto conclude that we ended up only benchmarking our inventiveness for new algorithms to assembleand exchange these messages and our skills in implementing them. This approach not only furthercomplicates the code, but also conflicts with our locality concept. By extensively rearranging the orderof data accesses, the actual executed address stream will no longer show the intended features to achievethe given localities. In effect, such rearranging would substantially change the actual localities fromthe intended localities and would go against our design principles. We therefore decided not to permitsuch message aggregation and to exchange messages for each remote access.

However, we permit multiple outstanding requests for data and out-of-order processing of thereceived data. In APEX-Map, because the process numbers for message exchanges are generated basedon a non-uniform random access, non-blocking asynchronous MPI functions are used to avoid blockingand deadlock. Given our non-deterministic random message pattern it was not clear whether a scalableimplementation of APEX-Map in MPI was possible. However, we succeeded with an efficient andscalable implementation, which shows increasing performance up to thousands of processors.

Due to the unpredictable communication patterns, the flowchart becomes substantially morecomplex (see the right-hand side of Table II) and several MPI-related implementation parameters haveto be introduced. The first parameter is B, the number of receive buffers allocated, which are needed foreach call of MPI Irecv. It defines the maximum possible number of concurrent outstanding remote datarequests per process. Another parameter is SMSG, the maximum number of outstanding send handlesdefined for MPI Isend. The last parameter is NSER, with which we limit how many remote requestscan be served at one time by our Serving incoming requests module. This parameter is especiallyuseful when the remote request distribution is imbalanced. Without this parameter, a process may getcompletely stuck in serving remote requests for a long time and might not make any progress on itsown local computation, which would cause a severe load imbalance at the end of the global execution.

In summary, there are three kinds of APEX-Map parameter. The first category of parameters includesM , L, and α, which are the characteristic parameters of interest. The second category includes generalimplementation related parameters, including the index array size I , and the number of times N theexperiment is repeated. The third category includes parameters related to the MPI implementation suchas the number of receive buffer B, the number of send handles SMSG, and the maximum number ofserved requests in one iteration NSER. Fortunately, experiments on several systems indicate that ourdefault values for all implementation parameters work reasonably well on all of them. The defaultvalues for I and N are 1024 and 1000, respectively. However, on some systems an N value as smallas 10 is sufficient to achieve consistent results. For small messages (L < 1024), the optimal numberof receive buffers B is typically eight. For larger messages it typically is four. Using more receivebuffers than this typically results in an over-proportional increasing overhead larger than the possibleperformance gains. The SMSG and NSER parameters are selected as the minimum(64, 4*P ).


2195



Table II. The MPI implementation of different program modules.

Generate remote requests Serve incoming requests

Check Receive Buffer (B = total buffer number)While (no receive buffer available)

Check Incoming RequestIf (have incoming request)

Serve Incoming RequestEnd IfCheck Receive BufferIf (some buffer available)

ComputeFree Buffer and Break

End IfEnd WhileSend Request to Remote ProcessPost Non-blocking Receive

Check incoming requestsServed Number = 0While (Have Request &&

Served Number < Max Allowed: NSER)If (Reached Maximum Outstanding Send:

SMSG)Check Finished Non-Blocking SendIf (none)

BreakEnd If

End IfCall Non-Blocking SendServed Number++Check Incoming Requests

End While

Process replies Wait for finish

Check Incoming DataIf (Have)

Compute Data in the Receive BufferRelease the Receive Buffer

End If

If (my work done)Inform Master

While (Not All Process Finished)Check and Serve Incoming RequestsIf (my work not done)

Process RepliesIf (my work done)

Inform MasterEnd IfCheck Whether All Finished

End While

The flowchart for modules ‘Generate remote request’, ‘Serve incoming requests’, ‘Process replies’,and ‘Wait for finish’ are shown in Table II. The ‘Wait for finish’ module is needed for MPI becauseeven if a process has finished its own task, it may still need to provide data for other processes andhence cannot complete its execution.

5. RESULTS AND ANALYSIS

In this section, we first look at the achieved scalability of the MPI based implementation of APEX-Map. We then show the utility of the performance surfaces generated with APEX-Map for analyzing


2196



Table III. Some characteristics of the three platforms used.

CPU Memory bandwidth Networ

Seaborg IBM Power3, 375 MHz 16 GB s−1 node1 GB s−1 processor

IBM Colony-II,1 GB s−1 node

Cheetah IBM Power4, 1.3 GHz 44 GB s−1 node1.375 GB s−1 processor

IBM Federation,4 GB s−1 node

Phoenix Cray X1, 400 MHz,(800 MHz for vector units)

25.6 GB s−1 MSP Cray SeaStar25 GB s−1 node

individual system or comparing different systems. Finally, we compare the APEX-Map results betweendifferent classes of systems and examine how the APEX-Map results reflect their architecturaldifferences.

Most results reported in this paper were obtained on the three platforms: Seaborg, Cheetah, andPhoenix. Seaborg is currently the main computing platform at NERSC, a DOE Office of Science userfacility at Lawrence Berkeley National Laboratory. It is an IBM Power3 based distributed memorymachine. Each node has 16 IBM Power3 processors running at the speed of 375 MHz. The peakperformance of each processor is 1.5 Gflops. Its network switch is the IBM Colony II, which isconnected to two ‘GX Bus Colony’ network adapters per node.

Cheetah is a 27-node IBM p690 system with the IBM Federated switch, where each node has32 Power4 processors at 1.3 GHz. The peak performance of each processor is 5.2 Gflops. Phoenixis a Cray X1 platform consisting of 512 multi-streaming vector processors. Each MSP has four single-stream vector processors and a 2 MB cache. Four MSPs form a node with 16 GB of shared memory.The interconnect functions as an extension of the memory system, offering each node direct access tomemory on other nodes. These two machines are currently operated by the center for ComputationalSciences at Oak Ridge National Laboratory. Table III lists some main characteristics of these threesystems.

5.1. Scalability of APEX-Map

In APEX-Map, because the process numbers for message exchanges are generated based on a non-uniform random access, non-blocking asynchronous MPI functions are used to avoid blocking anddeadlock. Given our non-deterministic random message pattern, it was not clear whether a scalableimplementation of APEX-Map in MPI was possible. However, we succeeded with an efficient andscalable implementation, which shows increasing performance up to thousands of processors. At thesame time APEX-Map results on many systems follow a very simple performance model (Figure 4) ofA + B ∗ Log2(P ), where A and B are parameters characterizing every level of system hierarchy andP is the number of processors. To achieve such high scalability for this type of code requires carefulselection of numerous implementation-related details.


2197



1

10

100

1,000

10,000

100,000

1,000,000

1 10 100 1,000 10,000

Number of Processes

CP

U C

ycle

s/D

ata

Ac

ce

ss

Data: L=1

Model: L=1

Data: L=64

Model: L=64

Data: L=4096

Model: L=4096

Figure 4. Scalability of APEX-Map and its performance model for three different block sizes (L = 1, 64, 4096).Results were obtained on Seaborg, a Power3-based IBM SP system.

100

1,000

10,000

100,000

1 10 100 1,000 10,000

Number of Processors

MB

/s

Phoenix

Seaborg

Cheetah

Figure 5. The performance scalability of APEX-Map on different parallel systems withL = 4096 words and α = 1 (random access).

We can then use APEX-Map to analyze the scalability of different systems. We focus on lowtemporal locality (α = 1) because codes with high temporal locality usually scale better. We analyzethe cross-section bandwidth of L = 4096 and L = 1 over a range of processors in Figures 5 and 6.

The two SMP-based IBM systems scale well within their SMPs (Seaborg up to 16 and Cheetah upto 32) but show a performance drop if we start using more than a single SMP. For larger numbers ofSMPs, cross-section bandwidth starts scaling again. The architecture of Phoenix is not hierarchicaland shows no such effect of architectural hierarchies. Its performance scales up very well acrossthe whole range of processors tested. Phoenix total aggregate bandwidth at 16 processors is almostequal to the total aggregate bandwidth of 256 processors on Cheetah and 1024 processors on Seaborg.For small messages (L = 1 words; see Figure 12 below) the absolute performance levels become muchsmaller. Performance is more dominated by message latencies. The hierarchies of the interconnect


2198



0

1

10

100

1 10 100 1,000 10,000

Number of Processors

MB

/s

Phoenix

Seaborg

Cheetah

Figure 6. The performance scalability of APEX-Map on different parallel systemswith L = 1 word and α = 1 (random access).

networks have less effect. All three platforms still scale well on this low-performance level, especiallyCheetah, which has the fastest scalar processor. Comparisons between APEX-Map performance andother known benchmarks such as Ping-Pong can be found in [3].

5.2. Usage of performance surfaces to analyze systems

One feature that distinguishes APEX-Map from many other benchmarks is that its input parameters canbe varied independently of each other between extreme values. This allows us to generate continuousperformance surfaces, which can be used to explore the performance effects of all potential values ofthe characterizing parameters. By examining these surfaces, we can understand how changes in spatialor temporal locality affect the performance and which factors are more important on a specific system.These performance surfaces are not only useful to gain a direct impression of the overall performancebehavior of a system but have proven themselves useful in a variety of situations, such as for thedetection of performance anomalies as shown in Figure 7. The performance degradation for L > 1024in this example is caused by an MPI protocol change, which caused excessive pinning of memorypages.

Being able to analyze the complete performance surfaces is also very helpful for system comparison.Figure 8 shows the ratios of cycles per data access for the IBM Power 3 and Power 4 processors.This ratio of architectural efficiencies is larger than one if the Power 3 processors takes less cycles toaccess data and smaller than one if the Power 4 processor is more efficient. The effect of reducedarchitectural efficiency on the Power 4 architecture for a restricted range of temporal and spatiallocalities is clearly visible. This is due to the Power 4’s substantially more complex memory hierarchy.As this effect only appears for certain locality levels, it is easily overlooked if we had used spotmeasurements only, which would make it difficult to explain the behavior of applications withexecution phases affected by it.


2199



1

8 64

51

2

40

96

32

76

8

0.0010.010

0.1001.000

0.1

1.0

10.0

100.0

1,000.0

10,000.0

MB/s

L : Spatial Locality

:

Temp

oral

3.00-4.00

2.00-3.00

1.00-2.00

0.00-1.00

-1.00-0.00

Figure 7. An MPI performance anomaly for L > 1024 words due to excessive memory pinning on a clusteredsystem with Infiniband interconnect (256 processes).

1 416

64

256

1024

4096

1638

46553

6

0.00

0.01

0.10

1.000.0

1.0

2.0

3.0

4.0

Ratio

L

a

3.00-4.00

2.00-3.00

1.00-2.00

0.00-1.00

Figure 8. Architectural efficiency ratio for Power 4 and Power 3 processors: ratios above one indicate a higherefficiency of the Power 3 memory hierarchy.

5.3. Comparing parallel systems with APEX-Map

We now use APEX-Map surfaces to compare different systems. Figures 9–11 show the surface spacefor α = 0.001 to 1.0, L = 1 to 65 536 words on 256 processors for M = 64 MWords*256 on Seaborg,Cheetah, and Phoenix. The Z-axis shows the achieved bandwidth per processes on a log-scale.

Figure 9 shows that both temporal reuse and spatial locality affect the total aggregate bandwidthsubstantially. The worst performance is observed when α = 1 and L = 1, which are the lowest valuesfor temporal and spatial locality. By increasing either the temporal locality or spatial locality, theperformance improves. The best performance is obtained when α = 0.001 and L = 4096 words.


2200



1 4 16 64

25

6

10

24

40

96

16

38

4

65

53

6

0.0010.010

0.1001.000

0.01

0.10

1.00

10.00

100.00

1,000.00

10,000.00

MB/s

L

3.00-4.00

2.00-3.00

1.00-2.00

0.00-1.00

-1.00-0.00

-2.00--1.00

Figure 9. The aggregate bandwidth per process on Seaborg (IBM Power3 SP) for 256 processes.

1 4 16 64

25

6

10

24

40

96

16

38

4

65

53

6

0.0010.010

0.1001.000

0.01

0.10

1.00

10.00

100.00

1000.00

10000.00

MB/s

L

3.00-4.00

2.00-3.00

1.00-2.00

0.00-1.00

-1.00-0.00

-2.00--1.00

Figure 10. The total aggregate bandwidth per process on Cheetah (IBM Power4 SP) for 256 processes.


2201



1 4 16 64

25

6

10

24

40

96

16

38

4

65

53

6

0.0010.010

0.1001.000

0.01

0.10

1.00

10.00

100.00

1000.00

10000.00

100000.00

MB/s

L

4.00-5.00

3.00-4.00

2.00-3.00

1.00-2.00

0.00-1.00

-1.00-0.00

-2.00--1.00

Figure 11. The achieved bandwidth per process on Phoenix (Cray X1) for 256 processes.

1 2 4 8

16

32

64

12

8

25

6

51

2

10

24

20

48

40

96

81

92

16

38

4

32

76

8

65

53

6

0.001

0.003

0.005

0.010

0.025

0.050

0.100

0.250

0.500

1.000

L

6.00-8.00

4.00-6.00

2.00-4.00

0.00-2.00

Figure 12. The bandwidth performance ratio between Cheetah and Seaborg.


2202



1 2 4 8

16

32

64

12

8

25

6

51

2

10

24

20

48

40

96

81

92

16

38

4

32

76

8

65

53

6

0.001

0.003

0.005

0.010

0.025

0.050

0.100

0.250

0.500

1.000

L

11.00-12.00

10.00-11.00

9.00-10.00

8.00-9.00

7.00-8.00

6.00-7.00

5.00-6.00

4.00-5.00

3.00-4.00

2.00-3.00

1.00-2.00

0.00-1.00

Figure 13. The bandwidth performance ratio between Phoenix and Cheetah.

Further increasing L does not improve performance further. This is mainly because the sumcomputation on this platform is less efficient for very large messages. Beyond L = 4096 spatial localityhas only a minor influence on performance while temporal locality α still has a large influence.If we look at an intermediate performance level such as 1 MB s−1, we see that the temporal localityand spatial locality can be substituted by each other to some degree. To achieve 1 MB s−1 at hightemporal locality of α = 0.005, a very low spatial locality of L = 1 is sufficient. With decreasingtemporal locality (increasing α), a higher spatial locality of up to L = 85 is needed to maintain thisperformance.

The performance characteristics of Cheetah (Figure 10) are very similar to Seaborg. Figure 12 showsthe performance ratio between Cheetah and Seaborg. From Table III we see that the ratio of processorspeed is 3.47, the ratio of local memory bandwidth is 1.375 and the ratio of network bandwidth is 4.For high temporal locality or high spatial locality the performance ratio of 2–4 seems to be dominatedby the ratio of the respective memory bandwidth. For low localities, the performance ratio betweenthese two systems is in the range of 6–8 and thus higher than any ratio of simple architecturalparameters. In this locality range, performance is dominated by a large number of very short messages.The details of the MPI implementation as well as the cross-section bandwidth of the interconnect canbe expected to have a large influence on performance in this corner of low localities where it will benotoriously difficult to achieve high absolute performance.

Figure 11 shows the performance surface for the Cray X1 for which the effects of increasing spatiallocality are significant even for values of L beyond 4096. Spatial locality has a greater affect on theperformance in general. For example, on Cheetah, in order to maintain the total aggregate bandwidtharound 10 MB s−1, if we reduce the temporal locality α from 0.001 to 1, the spatial locality needs toincrease 128 times. On Phoenix, it only needs to increase 16 times. We also note that when L changesfrom 32 to 64, the performance drops. This is an effect of the MPI implementation on the Cray X1.


2203



1 2 4 8

16

32

64

12

8

25

6

51

2

10

24

20

48

40

96

81

92

16

38

4

32

76

8

65

53

6

0.001

0.010

0.100

1.000

Log(MB/s)

L

a

4.00-5.00

3.00-4.00

2.00-3.00

1.00-2.00

0.00-1.00

-1.00-0.00

-2.00--1.00

1 2 4 8

16

32

64

12

8

25

6

51

2

10

24

20

48

40

96

81

92

16

384

32

768

65

53

6

0.001

0.010

0.100

1.000

L

a

Seaborg Phoenix

Figure 14. Contour plots of the performance surfaces for Seaborg and Phoenix (256 processes).

When the message size becomes larger than 32 words or 256 bytes, communication in MPI will switchfrom eager mode to rendezvous mode and the implementation overhead increases.

The performance ratio between Phoenix and Cheetah is shown in Figure 13. Interestingly, when thespatial locality is poor or temporal locality is high, the vector processor X1 delivers less performancethan the super-scalar processor Power4. In these cases, performance is dominated either by short MPImessages for which the Power 4 processor has the clear advantage of a much faster scalar processoror by very localized memory accesses for which the Power4 can effectively use its cache hierarchy.In this locality range, the Cray X1 is unable to show its true potential with our current MPI-basedbenchmark implementation. The X1 clearly shows the best performance when spatial locality becomeshigh, especially in the area with poor temporal locality (the bottom-right corner). In the best case, it candeliver 12 times better performance than Power4 platform. Performance in this corner is dominated bythe exchange of many long messages which requires an interconnect network with a large cross-sectionbandwidth.

5.4. Architectural signatures

The IBM SP systems have a very hierarchical architecture with several levels of cache, large SMPnodes, and distributed global memory. The Cray vector system is designed without traditional caches,no obvious SMP structure, and a comparable flat global interconnect structure. To compare the APEX-Map performance surfaces for these different classes of architectures we put contour-plots of Seaborgand Phoenix next to each other in Figure 14. For the IBM systems, the area of highest performanceis of rectangular shape and clearly elongated parallel to the spatial locality axis while for the Craysystem it is elongated parallel to the temporal locality axis. The IBM system can tolerate a decrease inspatial locality more easily but is much more sensitive to a loss of temporal locality. This reflects theelaborate cache and memory hierarchy on the individual nodes as well as the global system hierarchywhich also heavily relies on reuse of data as the interconnect bandwidth is substantially lower than thelocal memory bandwidth. The Cray system can tolerate a decrease in temporal locality much betterbut is sensitive to a loss in spatial locality. This reflects an architecture which depends very little


2204



on local caching of data and an interconnect bandwidth equal to local memory bandwidth. To seesuch a clear signature of the Cray architecture is even more surprising considering that we us anMPI-based benchmark, which does not fully exploit the capability of this system. The lines of equalperformance on the Cray system are, in general, more vertical than diagonal, as with the IBM system,which further confirms our interpretation. These differences in our performance surfaces overall clearlyreflect the different design philosophies of these two different systems and demonstrate the utility ofour approach.

6. CONCLUSION AND FUTURE WORK

The memory wall between the peak performance of microprocessors and their memory performancehas become the prominent performance bottleneck for many scientific application codes. Therefore,benchmarks measuring data access speeds locally and globally in a variety of different ways are neededto explore the ever increasing diversity of architectures for HPC.

In this paper, we have described the concept and parallel implementation in MPI of a parameterizedsynthetic performance probe, APEX-Map. It focuses on measuring the performance of global datamovement and has three main parameters, the global data size M , the temporal locality α, and thespatial locality L. APEX-Map generates a generic address stream based on a non-uniform random,block-access to global data defined by its parameters. We have run multiple experiments with APEX-Map on hierarchical, superscalar, and shared memory vector architectures. APEX-Map allows thegeneration of continuous multi-dimensional performance surfaces, which enable the effects of spatialand temporal locality on performance to be studied. The results show that APEX-Map can be used tocompare efficiency and scalability across different platforms. The performance surfaces generated byAPEX-Map also clearly reflect the design differences between these architectures.

Currently we are extending our characterization framework and APEX-Map to include parametersreflecting the performance effects of computational intensity and register pressure. We are alsoinvestigating different approaches to refine our concepts for characterizing parallel communicationbehavior.

REFERENCES

1. TOP500 Supercomputer Sites. http://www.top500.org [15 March 2007].2. Strohmaier E, Shan H. Architecture independent performance characterization and benchmarking for scientific

applications. Proceedings of the International Symposium on Modeling, Analysis, and Simulation of Computer andTelecommunication Systems, Volendam, The Netherlands, October 2004. IEEE Computer Society: Washington, DC, 2004.

3. Strohmaier E, Shan H. APEX-Map: A synthetic scalable benchmark probe to explore data access performance on highlyparallel systems. Proceedings of EuroPar2005, Lisbon, Portugal, August 2005. Springer: Berlin, 2005.

4. Strohmaier E, Shan H. APEX-Map: A global data access benchmark to analyze HPC systems and parallel programmingparadigms. Proceedings of Supercomputing 2005 (SC05), November 2005. IEEE Computer Society: Washington, DC,2005.

5. Dongarra JJ, Sterling T, Simon HD, Strohmaier E. High-performance computing: Clusters, constellations, MPPs, andfuture directions. Computing in Science and Engineering 2005; 7(2):51–59.

6. Strohmaier E, Dongarra JJ, Meuer HW, Simon HD. The marketplace of high-performance computing. Parallel Computing1999; 25:1517–1544.

7. Strohmaier E, Dongarra JJ, Meuer HW, Simon HD. Recent trends in the marketplace of high performance computing.Parallel Computing 2005; 31:261–273.


2205



8. STREAM Benchmark. http://www.cs.virginia.edu/stream [15 March 2007].9. HPC Challenge Benchmark. http://icl.cs.utk.edu/hpcc [15 March 2007].

10. DARPA HPCS. http://www.highproductivity.org [15 March 2007].11. NAS Parallel Benchmarks. http://www.nas.nasa.gov/Software/NPB/ [15 March 2007].12. SPEC. http://www.spec.org/ [15 March 2007].13. The Berkeley Institute for Performance Studies. http://crd.lbl.gov/html/bips.html [15 March 2007].14. Griem G, Oliker L, Shalf J, Yelick K. Identifying performance bottlenecks on modern microarchitectures using an adaptable

probe. Proceedings of the 18th Parallel and Distributed Processing Symposium (IPDPS 2004), Santa Fe, NM, 2004,vol. 15(15). IEEE Computer Society: Washington, DC, 2004; 255a.

15. Modern Vector Architecture. http://ftg.lbl.gov/ModernVectorarch/ModernVectorarch.shtml [15 March 2007].16. Oliker L, Carter J, Wehner M, Canning A. Ethier S, Govindasamy B, Mirin A, Parks D. Leading computational methods

on scalar and vector HEC platforms. Proceedings of SC2005: High Performance Computing, Networking, and StorageConference, Seattle, WA, 2005. IEEE Computer Society: Washington, DC, 2005.

17. SciDAC Performance Evaluation Research Center. http://perc.nersc.gov [15 March 2007].18. Performance Modeling and Characterization (PMaC) Laboratory, SDSC. http://www.sdsc.edu/PMaC/ [15 March 2007].19. Performance and Architecture Laboratory, Los Alamos National Laboratory. http://www.c3.lanl.gov/pal/ [15 March 2007].20. Strohmaier E. 20 years supercomputer market analysis. Proceedings of the 20th International Supercomputing Conference

2005, June 2005. Prometeus: Waibstadt-Daisbach, 2005.21. Snir M, Yu Jing. On the theory of spatial and temporal locality. Technical Report No. UIUCDCS-R-2005-2611, University

of Illinois at Urbana-Champaign, Urbana, IL, July 2005.22. Shan H, Strohmaier E. Performance characteristics of the Cray X1 and their implications for application performance

tuning. Proceedings of the 18th Annual International Conference on Supercomputing (ICS’04). ACM Press: New York,2004; 175–183.

23. Shan H, Strohmaier E. MPI, SHMEM, and UPC performance on the Cray X1—a case study using APEX-Map. Proceedingsof the Cray Users Group Meeting (CUG 2005), May 2005. Cray User Group: Albuquerque, NM, 2005.

24. Bunt RB, Murphy JM. The measurement of locality and the behavior of programs. The Computer Journal 1984; 27(3):238–245.

25. Bunt RB, Williamson CL. Temporal and spatial locality: A time and a place for everything. Proceedings of the InternationalSymposium in Honour of Professor Guenter Haring’s 60th Birthday, University of Vienna, Vienna, Austria, 4–5 December2003. Oxford University Press: Oxford, 2003.

26. Hennessy JL, Patterson DA, Goldberg D. Computer Architecture: A Quantitative Approach. Morgan Kaufmann: SanFrancisco, CA, 1996.

27. Yu Jing, Baghsorkhi S, Snir M. A new locality metric and case studies for HPCS benchmarks. Technical ReportUIUC DCS-R-2005-2564, University of Illinois at Urbana-Champaign, Urbana, IL, April 2005.

28. Snavely A, Wolter N, Carrington L. Modeling application performance by convolving machine signatures with applicationprofiles. Proceedings of IEEE 4th Annual Workshop on Workload Characterization, Austin, TX, December 2001. IEEEComputer Society: Washington, DC, 2001.

29. Weinberg J, Snavely A, McCracken MO, Strohmaier E. Measurement of spatial and temporal locality in memory accesspatterns. Proceedings of Supercomputing 2005 (SC05), November 2005. IEEE Computer Society: Washington, DC, 2005.


Documents

APEX-Map: a parameterized scalable memory access probe for high-performance computing systems