10
FEATURE ARTICLE 46 1070-9924/98/$10.00 © 1998 IEEE IEEE COMPUTATIONAL SCIENCE & ENGINEERING A pplication developers have long recognized that scalable hardware and software are nec- essary for parallel scalability in application performance. Both have existed for some time in their lowest common denominator form, and scalable hardware—as physically distributed memories connected through a scalable interconnection network (as a multi- stage interconnect, k-ary n-cube, or fat tree)—has been commercially available since the 1980s. When develop- ers build such systems without any provision for cache coherence, the systems are essentially “zeroth order” scalable architectures. They provide only a scalable in- terconnection network, and the burden of scalability falls on the software. As a result, scalable software for such systems exists, at some level, only in a message-passing model. Message passing is the native model for these ar- chitectures, and developers can only build higher-level models on top of it. Unfortunately, many in the high-performance com- puting world implicitly assume that the only way to achieve scalability in parallel software is with a message- passing programming model. This is not necessarily true. A class of multiprocessor architectures is now emerging that offers scalable hardware support for cache coher- ence. These are generally called scalable shared memory multiprocessor architectures. 1 For SSMP systems, the na- tive programming model is shared memory, and message passing is built on top of the shared-memory model. On such systems, software scalability is straightforward to achieve with a shared-memory programming model. In a shared-memory system, every processor has di- rect access to the memory of every other processor, meaning it can directly load or store any shared address. The programmer also can declare certain pieces of mem- ory as private to the processor, which provides a simple yet powerful model for expressing and managing paral- lelism in an application. Despite its simplicity and scalability, many parallel ap- plications developers have resisted adopting a shared- memory programming model for one reason: portabil- ity. Shared-memory system vendors have created their own proprietary extensions to Fortran or C for paral- lel-software development. However, the absence of portability has forced many developers to adopt a portable message-passing model such as the Message Passing Interface (MPI) or Parallel Virtual Machine (PVM). This article presents a portable alternative to message passing: OpenMP. OpenMP: An Industry- Standard API for Shared- Memory Programming LEONARDO DAGUM AND RAMESH MENON SILICON GRAPHICS INC. OpenMP, the portable alternative to message passing, offers a powerful new way to achieve scalability in software. This article compares OpenMP to existing parallel-programming models. .

OpenMP: An Industry- Standard API for Shared- Memory ... · OpenMP was designed to exploit certain char-acteristics of shared-memory architectures. The ability to directly access

  • Upload
    others

  • View
    15

  • Download
    0

Embed Size (px)

Citation preview

Page 1: OpenMP: An Industry- Standard API for Shared- Memory ... · OpenMP was designed to exploit certain char-acteristics of shared-memory architectures. The ability to directly access

F E A T U R E A R T I C L E

46 1070-9924/98/$10.00 © 1998 IEEE IEEE COMPUTATIONAL SCIENCE & ENGINEERING

Application developers have long recognizedthat scalable hardware and software are nec-essary for parallel scalability in applicationperformance. Both have existed for some time

in their lowest common denominator form, and scalablehardware—as physically distributed memories connectedthrough a scalable interconnection network (as a multi-stage interconnect, k-ary n-cube, or fat tree)—has beencommercially available since the 1980s. When develop-ers build such systems without any provision for cachecoherence, the systems are essentially “zeroth order”scalable architectures. They provide only a scalable in-terconnection network, and the burden of scalability fallson the software. As a result, scalable software for suchsystems exists, at some level, only in a message-passingmodel. Message passing is the native model for these ar-chitectures, and developers can only build higher-levelmodels on top of it.

Unfortunately, many in the high-performance com-puting world implicitly assume that the only way toachieve scalability in parallel software is with a message-passing programming model. This is not necessarily true.A class of multiprocessor architectures is now emergingthat offers scalable hardware support for cache coher-

ence. These are generally called scalable shared memorymultiprocessor architectures.1 For SSMP systems, the na-tive programming model is shared memory, and messagepassing is built on top of the shared-memory model. Onsuch systems, software scalability is straightforward toachieve with a shared-memory programming model.

In a shared-memory system, every processor has di-rect access to the memory of every other processor,meaning it can directly load or store any shared address.The programmer also can declare certain pieces of mem-ory as private to the processor, which provides a simpleyet powerful model for expressing and managing paral-lelism in an application.

Despite its simplicity and scalability, many parallel ap-plications developers have resisted adopting a shared-memory programming model for one reason: portabil-ity. Shared-memory system vendors have created theirown proprietary extensions to Fortran or C for paral-lel-software development. However, the absence ofportability has forced many developers to adopt aportable message-passing model such as the MessagePassing Interface (MPI) or Parallel Virtual Machine(PVM). This article presents a portable alternative tomessage passing: OpenMP.

OpenMP: An Industry-Standard API for Shared-Memory Programming

LEONARDO DAGUM AND RAMESH MENONSILICON GRAPHICS INC.

♦ ♦ ♦

OpenMP, the portable alternative to message passing, offers a powerful new way to achieve scalability in software. This article compares OpenMP to existing parallel-programming models.

.

Page 2: OpenMP: An Industry- Standard API for Shared- Memory ... · OpenMP was designed to exploit certain char-acteristics of shared-memory architectures. The ability to directly access

JANUARY–MARCH 1998 47

OpenMP was designed to exploit certain char-acteristics of shared-memory architectures. Theability to directly access memory throughout thesystem (with minimum latency and no explicitaddress mapping), combined with fast shared-memory locks, makes shared-memory architec-tures best suited for supporting OpenMP.

Why a new standard?The closest approximation to a standard shared-memory programming model is the now-dormant ANSI X3H5 standards effort.2 X3H5was never formally adopted as a standard largelybecause interest waned as distributed-memorymessage-passing systems (MPPs) came intovogue. However, even though hardware vendorssupport it to varying degrees, X3H5 has limita-tions that make it unsuitable for anything otherthan loop-level parallelism. Consequently, ap-plications adopting this model are often limitedin their parallel scalability.

MPI has effectively standardized the message-passing programming model. It is a portable,widely available, and accepted standard for writ-ing message-passing programs. Unfortunately,message passing is generally a difficult way toprogram. It requires that the program’s datastructures be explicitly partitioned, so the entireapplication must be parallelized to work withthe partitioned data structures. There is no in-cremental path to parallelize an application.Furthermore, modern multiprocessor architec-tures increasingly provide hardware support forcache coherence; therefore, message passing isbecoming unnecessary and overly restrictive forthese systems.

Pthreads is an accepted standard for sharedmemory in low-end systems. However, it is nottargeted at the technical, HPC space. There islittle Fortran support for pthreads, and it is nota scalable approach. Even for C applications, thepthreads model is awkward, because it is lower-level than necessary for most scientific applica-tions and is targeted more at providing task par-allelism, not data parallelism. Also, portabilityto unsupported platforms requires a stub libraryor equivalent workaround.

Researchers have defined many new languagesfor parallel computing, but these have not foundmainstream acceptance. High-Performance For-tran (HPF) is the most popular multiprocessingderivative of Fortran, but it is mostly geared to-ward distributed-memory systems.

Independent software developers of scientific

applications, as well as government laboratories,have a large volume of Fortran 77 code thatneeds to get parallelized in a portable fashion.The rapid and widespread acceptance of shared-memory multiprocessor architectures—fromthe desktop to “glass houses”—has created apressing demand for a portable way to programthese systems. Developers need to parallelize ex-isting code without completely rewriting it, butthis is not possible with most existing parallel-language standards. Only OpenMP and X3H5allow incremental parallelization of existingcode, of which only OpenMP is scalable (seeTable 1). OpenMP is targeted at developers whoneed to quickly parallelize existing scientificcode, but it remains flexible enough to support amuch broader application set. OpenMP pro-vides an incremental path for parallel conver-sion of any existing software. It also providesscalability and performance for a complete re-write or entirely new development.

What is OpenMP?At its most elemental level, OpenMP is a set ofcompiler directives and callable runtime libraryroutines that extend Fortran (and separately, Cand C++) to express shared-memory parallelism.It leaves the base language unspecified, and ven-dors can implement OpenMP in any Fortrancompiler. Naturally, to support pointers and al-locatables, Fortan 90 and Fortran 95 require theOpenMP implementation to include additionalsemantics over Fortran 77.

Table 1: Comparing standard parallel-programming models.X3H5 MPI Pthreads HPF OpenMP

Scalable no yes sometimes yes yes

Incremental yes no no no yesparallelization

Portable yes yes yes yes yes

Fortran binding yes yes no yes yes

High level yes no no yes yes

Supports data yes no no yes yesparallelism

Performance no yes no tries yesoriented

.

Page 3: OpenMP: An Industry- Standard API for Shared- Memory ... · OpenMP was designed to exploit certain char-acteristics of shared-memory architectures. The ability to directly access

48 IEEE COMPUTATIONAL SCIENCE & ENGINEERING

OpenMP leverages many of the X3H5 con-cepts while extending them to support coarse-grain parallelism. Table 2 compares OpenMPwith the directive bindings specified by X3H5and the MIPS Pro Doacross model,3 and it sum-marizes the language extensions into one ofthree categories: control structure, data envi-ronment, or synchronization. The standard alsoincludes a callable runtime library with accom-panying environment variables.

Several vendors have prod-ucts—including compilers, de-velopment tools, and perfor-mance-analysis tools—that areOpenMP aware. Typically, thesetools understand the semanticsof OpenMP constructs andhence aid the process of writingprograms. The OpenMP Archi-tecture Review Board includesrepresentatives from Digital,Hewlett-Packard, Intel, IBM,Kuck and Associates, and Sili-con Graphics.

All of these companies areactively developing compilersand tools for OpenMP. OpenMP products are available to-day from Silicon Graphics andother vendors. In addition, anumber of independent soft-ware vendors plan to use Open-MP in future products. (Forinformation on individual pro-ducts, see www.openmp.org.)

A simple exampleFigure 1 presents a simple ex-ample of computing p usingOpenMP.4 This example il-lustrates how to parallelize asimple loop in a shared-mem-ory programming model. Thecode would look similar witheither the Doacross or the X3-H5 set of directives (exceptthat X3H5 does not have areduction attribute, so youwould have to code it yourself).

Program execution begins asa single process. This initialprocess executes serially, andwe can set up our problem in astandard sequential manner,

reading and writing stdout as necessary.When we first encounter a parallel con-struct, in this case a parallel do, the runtimeforms a team of one or more processes and cre-ates the data environment for each team mem-ber. The data environment consists of one pri-vate variable, x, one reduction variable,sum, and one shared variable, w. All referencesto x and sum inside the parallel region addressprivate, nonshared copies. The reduction at-

Table 2: Comparing X3H5 directives, OpenMP, and MIPS Pro Doacross functionality.X3H5 OpenMP MIPS Pro

OverviewOrphan scope None, lexical Yes, binding Yes, through

scope only rules specified callable runtimeQuery functions None Standard YesRuntime functions None Standard YesEnvironment variables None Standard YesNested parallelism Allowed Allowed SerializedThroughput mode Not defined Yes YesConditional compilation None _OPENMP,!$ C$Sentinel C$PAR !$OMP C$

C$PAR& !$OMP& C$&

Control structureParallel region Parallel Parallel DoacrossIterative Pdo Do DoacrossNoniterative Psection Section User codedSingle process Psingle Single, User coded

MasterEarly completion Pdone User coded User codedSequential Ordering Ordered PDO Ordered None

Data environmentAutoscope None Default(private) shared default

Default(shared)Global objects Instance Parallel Threadprivate Linker: -Xlocal

(p + 1 instances) (p instances) (p instances)Reduction attribute None Reduction ReductionPrivate initialization None Firstprivate None

Copyin CopyinPrivate persistence None Lastprivate Lastlocal

SynchronizationBarrier Barrier Barrier mp_barrierSynchronize Synchronize Flush synchronizeCritical section Critical Section Critical mp_setlock

mp_unsetlockAtomic update None Atomic NoneLocks None Full mp_setlock

functionality mp_unsetlock

.

Page 4: OpenMP: An Industry- Standard API for Shared- Memory ... · OpenMP was designed to exploit certain char-acteristics of shared-memory architectures. The ability to directly access

JANUARY–MARCH 1998 49

tribute takes an operator, such that at the end ofthe parallel region it reduces the private copiesto the master copy using the specified operator.All references to w in the parallel region addressthe single master copy. The loop index variable,i, is private by default. The compiler takescare of assigning the appropriate iterations tothe individual team members, so in parallelizingthis loop the user need not even on know howmany processors it runs.

There might be additional control and syn-chronization constructs within the parallel re-gion, but not in this example. The parallel re-gion terminates with the end do, which has animplied barrier. On exit from the parallel region,the initial process resumes execution using itsupdated data environment. In this case, the onlychange to the master’s data environment is thereduced value of sum.

This model of execution is referred to as thefork/join model. Throughout the course of a pro-gram, the initial process can fork and join manytimes. The fork/join execution model makes iteasy to get loop-level parallelism out of a se-quential program. Unlike in message passing,where the program must be completely decom-posed for parallel execution, the shared-mem-ory model makes it possible to parallelize just atthe loop level without decomposing the datastructures. Given a working sequential program,

it becomes fairly straightforward to parallelizeindividual loops incrementally and thereby im-mediately realize the performance advantages ofa multiprocessor system.

For comparison with message passing, Figure2 presents the same example using MPI. Clearly,there is additional complexity just in setting upthe problem, because we must begin with a teamof parallel processes. Consequently, we need toisolate a root process to read and write stdout.Because there is no globally shared data, wemust explicitly broadcast the input parameters(in this case, the number of intervals for the in-tegration) to all the processors. Furthermore, wemust explicitly manage the loop bounds. This re-quires identifying each processor (myid) andknowing how many processors will be used to ex-

program compute_pi

integer n, i

double precision w, x, sum, pi, f, a

c function to integrate

f(a) = 4.d0 / (1.d0 + a*a)

print *, ‘Enter number of intervals: ‘

read *,n

c calculate the interval size

w = 1.0d0/n

sum = 0.0d0

!$OMP PARALLEL DO PRIVATE(x), SHARED(w)

!$OMP& REDUCTION(+: sum)

do i = 1, n

x = w * (i - 0.5d0)

sum = sum + f(x)

enddo

pi = w * sum

print *, ‘computed pi = ‘, pi

stop

end

Figure 1. Computing p in parallel using OpenMP.

program compute_pi

include ‘mpif.h’

double precision mypi, pi, w, sum, x, f, a

integer n, myid, numprocs, i, rc

c function to integrate

f(a) = 4.d0 / (1.d0 + a*a)

call MPI_INIT( ierr )

call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr )

call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr )

if ( myid .eq. 0 ) then

print *, ‘Enter number of intervals:‘

read *, n

endif

call MPI_BCAST(n,1,MPI_INTEGER,0,MPI_COMM_WORLD,ierr)

c calculate the interval size

w = 1.0d0/n

sum = 0.0d0

do i = myid+1, n, numprocs

x = w * (i - 0.5d0)

sum = sum + f(x)

enddo

mypi = w * sum

c collect all the partial sums

call MPI_REDUCE(mypi,pi,1,MPI_DOUBLE_PRECISION,MPI_SUM,0,

$ MPI_COMM_WORLD,ierr)

c node 0 prints the answer.

if (myid .eq. 0) then

print *, ‘computed pi = ‘, pi

endif

call MPI_FINALIZE(rc)

stop

end

Figure 2. Computing p in parallel using MPI.

.

Page 5: OpenMP: An Industry- Standard API for Shared- Memory ... · OpenMP was designed to exploit certain char-acteristics of shared-memory architectures. The ability to directly access

50 IEEE COMPUTATIONAL SCIENCE & ENGINEERING

ecute the loop (numprocs). When we finally get to the

loop, we can only sum into ourprivate value for mypi. To re-duce across processors we usethe MPI_Reduce routine andsum into pi. The storage forpi is replicated across all pro-cessors, even though only theroot process needs it. As a gen-eral rule, message-passing pro-grams waste more storage thanshared-memory programs.5

Finally, we can print the result,again making sure to isolatejust one process for this step toavoid printing numprocs

messages.It is also interesting to see

how this example looks usingpthreads (see Figure 3). Natu-rally, it’s written in C, but wecan still compare functionalitywith the Fortran examples giv-en in Figures 1 and 2.

The pthreads version is morecomplex than either the Open-MP or the MPI versions:

• First, pthreads is aimed atproviding task parallelism,whereas the example is oneof data parallelism—paral-lelizing a loop. The exam-ple shows why pthreadshas not been widely usedfor scientific applications.

• Second, pthreads is some-what lower-level than weneed, even in a task- orthreads-based model. Thisbecomes clearer as we gothrough the example.

As with the MPI version, weneed to know how many threadswill execute the loop and wemust determine their IDs so wecan manage the loop bounds.We get the thread number as acommand-line argument anduse it to allocate an array ofthread IDs. At this time, wealso initialize a lock, reduc-tion_mutex, which we’ll

#include <pthread.h>

#include <stdio.h>

pthread_mutex_t reduction_mutex;

pthread_t *tid;

int n, num_threads;

double pi, w;

double f(a)

double a;

{

return (4.0 / (1.0 + a*a));

}

void *PIworker(void *arg)

{

int i, myid;

double sum, mypi, x;

/* set individual id to start at 0 */

myid = pthread_self()-tid[0];

/* integrate function */

sum = 0.0;

for (i=myid+1; i<=n; i+=num_threads) {

x = w*((double)i - 0.5);

sum += f(x);

}

mypi = w*sum;

/* reduce value */

pthread_mutex_lock(&reduction_mutex);

pi += mypi;

pthread_mutex_unlock(&reduction_mutex);

return(0);

}

void main(argc,argv)

int argc;

char *argv[];

{

int i;

/* check command line */

if (argc != 3) {

printf(“Usage: %s Num-intervals Num-threads\n”, argv[0]);

exit(0);

}

/* get num intervals and num threads from command line */

n = atoi(argv[1]);

num_threads = atoi(argv[2]);

w = 1.0 / (double) n;

pi = 0.0;

tid = (pthread_t *) calloc(num_threads, sizeof(pthread_t));

/* initialize lock */

if (pthread_mutex_init(&reduction_mutex, NULL))

fprintf(stderr, “Cannot init lock\n”), exit(1);

/* create the threads */

for (i=0; i<num_threads; i++)

if(pthread_create(&tid[i], NULL, PIworker, NULL))

fprintf(stderr,”Cannot create thread %d\n”,i), exit(1);

/* join threads */

for (i=0; i<num_threads; i++)

pthread_join(tid[i], NULL);

printf(“computed pi = %.16f\n”, pi);

}

Figure 3. Computing p in parallel using pthreads.

.

Page 6: OpenMP: An Industry- Standard API for Shared- Memory ... · OpenMP was designed to exploit certain char-acteristics of shared-memory architectures. The ability to directly access

JANUARY–MARCH 1998 51

need for reducing our partial sums into a globalsum for p. Our basic approach is to start aworker thread, PIworker, for every processorwe want to work on the loop. In PIworker,we first compute a zero-based thread ID and usethis to map the loop iterations. The loop thencomputes the partial sums into mypi. We addthese into the global result pi, making sure toprotect against a race condition by locking. Fi-nally, we need to explicitly join all our threadsbefore we can print out the result of the inte-gration.

All the data scoping is implicit; that is, globalvariables are shared and automatic variables areprivate. There is no simple mechanism in p-threads for making global variables private. Also,implicit scoping is more awkward in Fortran be-cause the language is not as strongly scoped asC.

In terms of performance, all three models arecomparable for this simple example. Table 3presents the elapsed time in seconds for eachprogram when run on a Silicon Graphics Ori-gin2000 server, using 109 intervals for each in-tegration. All three models are exhibiting excel-lent scalability on a per node basis (there are twoCPUs per node in the Origin2000), as expectedfor this embarrassingly parallel algorithm.

ScalabilityAlthough simple and effective, loop-level par-allelism is usually limited in its scalability, be-cause it leaves some constant fraction of se-quential work in the program that by Amdahl’slaw can quickly overtake the gains from parallelexecution. It is important, however, to distin-guish between the type of parallelism (for ex-ample, loop-level versus coarse-grained) andthe programming model. The type of paral-lelism exposed in a program depends on the al-gorithm and data structures employed and noton the programming model (to the extent thatthose algorithms and data structures can be rea-sonably expressed within a given model).Therefore, given a parallel algorithm and a scal-able shared-memory architecture, a shared-memory implementation scales as well as a mes-sage-passing implementation.

OpenMP introduces the powerful concept oforphan directives that simplify the task of imple-menting coarse-grain parallel algorithms. Or-phan directives are directives encountered out-side the lexical extent of the parallel region.Coarse-grain parallel algorithms typically con-

sist of only a few parallel regions, with most ofthe execution taking place within those regions.

In implementing a coarse-grained parallel al-gorithm, it becomes desirable, and often neces-sary, to be able to specify control or synchro-nization from anywhere inside the parallelregion, not just from the lexically contained por-tion. OpenMP provides this functionality byspecifying binding rules for all directives and al-lowing them to be encountered dynamically inthe call chain originating from the parallel re-gion. In contrast, X3H5 does not allow direc-tives to be orphaned, so all the control and syn-chronization for the program must be lexicallyvisible in the parallel construct. This limitationrestricts the programmer and makes any non-trivial coarse-grained parallel application virtu-ally impossible to write.

A coarse-grain exampleTo highlight additional features in the standard,Figure 4 presents a slightly more complicatedexample, computing the energy spectrum for afield. This is essentially a histogramming prob-lem with a slight twist—it also generates the se-quence in parallel. We could easily parallelizethe histogramming loop and the sequence gen-eration as in the previous example, but in the in-terest of performance we would like to his-togram as we compute in order to preservelocality.

The program goes immediately into a parallelregion with a parallel directive, declaringthe variables field and ispectrum asshared, and making everything else privatewith a default clause. The default clausedoes not affect common blocks, so setup re-mains a shared data structure.

Within the parallel region, we call initial-ize_field() to initialize the field and is-pectrum arrays. Here we have an example of

Table 3: Time (in seconds) to compute πusing 109 intervalswith three standard parallel-programming models.

CPUs OpenMP MPI Pthreads

1 107.7 121.4 115.4

2 53.9 60.7 62.5

4 27.0 30.3 32.4

6 17.9 20.4 22.08 13.5 15.2 16.7

.

Page 7: OpenMP: An Industry- Standard API for Shared- Memory ... · OpenMP was designed to exploit certain char-acteristics of shared-memory architectures. The ability to directly access

52 IEEE COMPUTATIONAL SCIENCE & ENGINEERING

orphaning the do directive. With the X3H5 di-rectives, we would have to move these loops intothe main program so that they could be lexicallyvisible within the parallel directive. Clearly,that restriction makes it difficult to write goodmodular parallel programs. We use the nowaitclause on the end do directives to eliminate the

implicit barrier. Finally, we use the single di-rective when we initialize a single internal fieldpoint. The end single directive also can takea nowait clause, but to guarantee correctnesswe need to synchronize here.

The field gets computed in compute_field.This could be any parallel Laplacian solver, butin the interest of brevity we don’t include it here.With the field computed, we are ready to com-pute the spectrum, so we histogram the field val-ues using the atomic directive to eliminate raceconditions in the updates to ispectrum. Theend do here has a nowait because the parallelregion ends after compute spectrum() andthere is an implied barrier when the threads join.

OpenMP design objectiveOpenMP was designed to be a flexible standard,easily implemented across different platforms.As we discussed, the standard compriss four dis-tinct parts:

• control structure, • the data environment, • synchronization, and • the runtime library.

Control structureOpenMP strives for a minimalist set of con-

trol structures. Experience has indicated thatonly a few control structures are necessary forwriting most parallel applications. For example,in the Doacross model, the only control struc-ture is the doacross directive, yet this is ar-guably the most widely used shared-memoryprogramming model for scientific computing.Many of the control structures provided byX3H5 can be trivially programmed in OpenMPwith no performance penalty. OpenMP includescontrol structures only in those instances wherea compiler can provide both functionality andperformance over what a user could reasonablyprogram.

Our examples used only three control struc-tures: parallel, do, and single. Clearly, thecompiler adds functionality in parallel anddo directives. For single, the compiler addsperformance by allowing the first thread reach-ing the single directive to execute the code.This is nontrivial for a user to program.

Data environmentAssociated with each process is a unique data

environment providing a context for execution.

parameter(N = 512, NZ = 16)

common /setup/ npoints, nzone

dimension field(N), ispectrum(NZ)

data npoints, nzone / N, NZ /

!$OMP PARALLEL DEFAULT(PRIVATE) SHARED(field, ispectrum)

call initialize_field(field, ispectrum)

call compute_field(field, ispectrum)

call compute_spectrum(field, ispectrum)

!$OMP END PARALLEL

call display(ispectrum)

stop

end

subroutine initialize_field(field, ispectrum)

common /setup/ npoints, nzone

dimension field(npoints), ispectrum(nzone)

!$OMP DO

do i=1, nzone

ispectrum(i) = 0.0

enddo

!$OMP END DO NOWAIT

!$OMP DO

do i=1, npoints

field(i) = 0.0

enddo

!$OMP END DO NOWAIT

!$OMP SINGLE

field(npoints/4) = 1.0

!$OMP END SINGLE

return

end

subroutine compute_spectrum(field, ispectrum)

common /setup/ npoints, nzone

dimension field(npoints), ispectrum(nzone)

!$OMP DO

do i= 1, npoints

index = field(i)*nzone + 1

!$OMP ATOMIC

ispectrum(index) = ispectrum(index) + i

enddo

!$OMP END DO NOWAIT

return

end

Figure 4. A coarse-grained example.

.

Page 8: OpenMP: An Industry- Standard API for Shared- Memory ... · OpenMP was designed to exploit certain char-acteristics of shared-memory architectures. The ability to directly access

JANUARY–MARCH 1998 53

The initial process at program start-up has aninitial data environment that exists for the du-ration of the program. It contructs new dataenvironments only for new processes createdduring program execution. The objects consti-tuting a data environment might have one ofthree basic attributes: shared, private, orreduction.

The concept of reduction as an attribute isgeneralized in OpenMP. It allows the compilerto efficiently implement reduction opera-tions. This is especially important on cache-based systems where the compiler can eliminateany false sharing. On large-scale SSMP archi-tectures, the compiler also might choose to im-plement tree-based reductions for even betterperformance.

OpenMP has a rich data environment. In ad-dition to the reduction attribute, it allowsprivate initialization with firstprivateand copyin, and private persistence withlastprivate. None of these features exist in

X3H5, but experience has indicated a real needfor them.

Global objects can be made private withthe threadprivate directive. In the interestof performance, OpenMP implements a “p-copy” model for privatizing global objects:threadprivate will create p copies of theglobal object, one for each of the p members inthe team executing the parallel region. Often,however, it is desirable either from memoryconstraints or for algorithmic reasons to priva-tize only certain elements because of a com-pound global object. OpenMP allows individ-ual elements of a compound global object toappear in a private list.

SynchronizationThere are two types of synchronization: im-

plicit and explicit. Implicit synchronizationpoints exist at the beginning and end of parallelconstructs and at the end of control constructs(for example, do and single). In the case of

Engineering of Complex Distributed Systems trackPresenting requirements for complex distributed systems, recent research results, andtechnological developments apt to be transferred into mature applications andproducts.

Actors & AgentsRepresenting a cross sectin of current work involving actors and agents—autonomy,identity, interaction, communication, coordination, mobility, persistence, protocols,distribution, and parallelism.

Object-Oriented Systems trackShowcasing traditional and innovative uses of object-oriented languages, systems, andtechnologies.

Also, regular columns on mobile computing, distributed multimedia applications,distributed databases, and high-performance computing trends from around the world.IEEE Concurrency chronicles the latest advances in high-performance computing,distributed systems, parallel processing, mobile computing, embedded systems,multimedia applications, and the Internet.

Check us out at http://computer.org/concurrencyCheck us out at http://computer.org/concurrency

® ®IEEE COMPUTER SOCIETY

Like ComparingAlligators to Armadillos

Distributed databases: rethinking integrityParallel computing trends in European industryApplication-centric parallel multimedia softwareDistributed programming tools

Distributed databases: rethinking integrityParallel computing trends in European industryApplication-centric parallel multimedia softwareDistributed programming tools

ConcurrencyConcurrencyIEEE

PARALL E L , D I S TR IBUTED & MOB I L E COMPUT ING / OCTOBER–DECEMBER 1997

AlsoAlso

Coming in 1998Coming in 1998ConcurrencyIEEE

.

Page 9: OpenMP: An Industry- Standard API for Shared- Memory ... · OpenMP was designed to exploit certain char-acteristics of shared-memory architectures. The ability to directly access

54 IEEE COMPUTATIONAL SCIENCE & ENGINEERING

do sections, and single, the implicit syn-chronization can be removed with the nowaitclause.

The user specifies explicit synchronization tomanage order or data dependencies. Synchro-nization is a form of interprocess communicationand, as such, can greatly affect program perfor-mance. In general, minimizing a program’s syn-chronization requirements (explicit and implicit)achieves the best performance. For this reason,OpenMP provides a rich set of synchronizationfeatures so developers can best tune the synchro-nization in an application.

We saw an example using the Atomic direc-tive. This directive allows the compiler to takeadvantage of available hardware for implement-ing atomic updates to a variable. OpenMP alsoprovides a Flush directive for creating morecomplex synchronization constructs such aspoint-to-point synchronization. For ultimateperformance, point-to-point synchronizationcan eliminate the implicit barriers in the energy-spectrum example. All the OpenMP synchro-nization directives can be orphaned. As discussedearlier, this is critically important for imple-menting coarse-grained parallel algorithms.

Runtime library and environment variablesIn addition to the directive set described,

OpenMP provides a callable runtime libraryand accompanying environment variables. Theruntime library includes query and lock func-tions. The runtime functions allow an applica-tion to specify the mode in which it should run.An application developer might wish to maxi-mize the system’s throughput performance,rather than time to completion. In such cases,the developer can tell the system to dynamicallyset the number of processes used to executeparallel regions. This can have a dramatic effecton the system’s throughput performance withonly a minimal impact on the program’s time tocompletion.

The runtime functions also allow a developerto specify when to enable nested parallelism,which allows the system to act accordinglywhen it encounters a nested parallel construct.On the other hand, by disabling it, a developercan write a parallel library that will perform inan easily predictable fashion whether encoun-tered dynamically from within or outside a par-allel region.

OpenMP also provides a conditional compi-lation facility both through the C language pre-processor (CPP) and with a Fortran comment

sentinel. This allows calls to the runtime libraryto be protected as compiler directives, so Open-MP code can be compiled on non-OpenMP sys-tems without linking in a stub library or usingsome other awkward workaround.

OpenMP provides standard environmentvariables to accompany the runtime libraryfunctions where it makes sense and to simplifythe start-up scripts for portable applications.This helps application developers who, in addi-tion to creating portable applications, need aportable runtime environment.

OpenMP is supported by a number ofhardware and software vendors, andwe expect support to grow. OpenMPhas been designed to be extensible

and evolve with user requirements. TheOpenMP Architecture Review Board was createdto provide long-term support and enhancementsof the OpenMP specifications. The OARB char-ter includes interpreting OpenMP specifications,developing future OpenMP standards, address-ing issues of validation of OpenMP implementa-tions, and promoting OpenMP as a de facto stan-dard.

Possible extensions for Fortran include greatersupport for nested parallelism and support forshaped arrays. Nested parallelism is the abilityto create a new team of processes from within anexisting team. It can be useful in problemsexhibiting both task and data parallelism. For ex-ample, a natural application for nested paral-lelism would be parallelizing a task queue where-in the tasks involve large matrix multiplies.

Shaped arrays refers to the ability to explicitlyassign the storage for arrays to specific memorynodes. This ability is useful for improving per-formance on Non-Uniform Memory architec-tures (NUMAs) by reducing the number ofnon-local memory references made by a proces-sor.

The OARB is currently developing the spec-ification of C and C++ bindings and is also de-veloping validation suites for tesing OpenMPimplementations. ♦

References1. D.E. Lenoski and W.D. Weber, Scalable Shared-

Memory Multiprocessing, Morgan Kaufmann, SanFrancisco, 1995.

.

Page 10: OpenMP: An Industry- Standard API for Shared- Memory ... · OpenMP was designed to exploit certain char-acteristics of shared-memory architectures. The ability to directly access

JANUARY–MARCH 1998 55

2. B. Leasure, ed., Parallel Processing Model for High-Level Programming Languages, proposed draft,American National Standard for Information Pro-cessing Systems, Apr. 5, 1994.

3. MIPSpro Fortran77 Programmer’s Guide, SiliconGraphics, Mountain View, Calif., 1996; http://techpubs.sgi.com/library/dynaweb_bin/0640/bi/nph-dynaweb.cgi/dynaweb/SGI_Developer/MproF77_PG/.

4. S. Ragsdale, ed., Parallel Programming Primer, IntelScientific Computers, Santa Clara, Calif., March1990.

5. J. Brown, T. Elken, and J. Taft, Silicon GraphicsTechnical Servers in the High Throughput Environ-ment, Silicon Graphics Inc., 1995; http://www.sgi.com/tech/challenge.html.

Leonardo Dagum works for Silicon Graphics in theSystem Performance group, where he helped definethe OpenMP Fortran API. His research interests includeparallel algorithms and performance modelling for

parallel systems. He is the author of over 30 refereedpublications relating to these subjects. He received hisMS and PhD in aeronautics and astronautics fromStanford. Contact him at M/S 580, 2011 N. ShorelineBlvd., Mountain View, CA 94043-1389; [email protected].

Ramesh Menon is Silicon Graphics’ representative tothe OpenMP Architicture Review Board and served asthe board’s first chairman. He managed the writing ofthe OpenMP Fortran API. His research interests includeparallel-programming models, performance charac-terization, and computational mechanics. He receivedan MS in mechanical engineering from Duke Univer-sity and a PhD in aerospace engineering from TexasA&M. He was awarded a National Science FoundationFellowship and was a principal contributor to the NSFGrand Challenge Coupled Fields project at the Uni-veristy of Colorado, Boulder. Contact him [email protected].

Coming Next IssueFeature Transformation and Subset Selection

As computer and database technologies have advanced, humans are relying more heavily oncomputers to accumulate, process, and make use of data. Machine learning, knowledge discovery,and data mining are some of the AI tools that help us accomplish those tasks. To use those toolseffectively, however, data must be preprocessed before it can be presented to any learning, dis-covering, or visualizing algorithm. As this issue will show, feature transformation and subset selec-tion are two vital data-preprocessing tools for making effective use of data.

Also Coming in 1998 • Self-Adaptive Software• Autonomous Space Systems• Knowlege Representation: Ontologies• Intelligent Agents: The Crossroads between AI and Information Technology• Intelligent Vehicles• Intelligent Information Retrieval

IEEE Intelligent Systems (formerly IEEE Expert) covers the full range of intelligent system develop-ments for the AI practitioner, researcher, educator, and user.

IEEE Intelligent SystemsIEEE

...AND MORE!

AI IN HEALTH CAREINTELLIGENT AGENTSSELF-ADAPTIVE SOFTWAREDATA-MINING TOOLSINTELLIGENT VEHICLES

AUTONOMOUS SPACE SYSTEMS

& the i r app l i cat ions

IEEE

.