14
*.- Rohit Chandra, Anoop Gupta, and John L. Stanford University lthough multiprocessors can provide high computational rates, using them efficiently requires program partitioning and restructuring for parallel ex- ecution. Parallelizing compilers can restructure programs automatically to extract simple loop-level parallelism, but most codes (including those used in scien- tific applications) are not amenable to compiler-based parallelization. They require the exploitation of unstructured task-level parallelism, which must be indicated by the programmer. We present the programming language COOL (Concurrent Object- Oriented Language), which was designed to exploit coarse-grained parallelism at the task level in shared-memory multiprocessors. COOL’Sprimary design goals are efficiency and expressiveness. By efficiency we mean that the language constructs should be efficient to implement and a program should not have to pay for features it does not use. By expressiveness, we imply that the program should flexibly support different concurrency patterns, thereby allow- ing various decompositions of a problem. COOL emphasizes the integration of concurrency and synchronization with data abstraction. It is an extension of C++,’ which was chosen because it supports abstract data type definitions and is widely used. Extending an existing language also has the advantage of offering concurrency features in a familiar programming environment. Concurrent execution in COOL is expressed through parallel functions that exe- cute asynchronously when invoked. Parallel functions communicate through data in the shared address space and synchronize using monitors and condition variables? Op- erations on shared objects can be declared mutex, signifying that they must acquire exclusive access to that object instance before being executed. Finally, event synchronization is expressed through operations on condition vari- ables, which allow tasks to wait for an event or to signal a waiting task when the ap- propriate event occurs. In our experience, monitors and condition variables are at- 001X-Y1h2/YJ/$~000 IYW IEEE 13

COOL: An object-based language for parallel programming

  • Upload
    jl

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: COOL: An object-based language for parallel programming

*.-

Rohit Chandra, Anoop Gupta, and John L.

Stanford University

lthough multiprocessors can provide high computational rates, using them efficiently requires program partitioning and restructuring for parallel ex- ecution. Parallelizing compilers can restructure programs automatically to

extract simple loop-level parallelism, but most codes (including those used in scien- tific applications) are not amenable to compiler-based parallelization. They require the exploitation of unstructured task-level parallelism, which must be indicated by the programmer. We present the programming language COOL (Concurrent Object- Oriented Language), which was designed to exploit coarse-grained parallelism at the task level in shared-memory multiprocessors.

COOL’S primary design goals are efficiency and expressiveness. By efficiency we mean that the language constructs should be efficient to implement and a program should not have to pay for features it does not use. By expressiveness, we imply that the program should flexibly support different concurrency patterns, thereby allow- ing various decompositions of a problem.

COOL emphasizes the integration of concurrency and synchronization with data abstraction. It is an extension of C++,’ which was chosen because it supports abstract data type definitions and is widely used. Extending an existing language also has the advantage of offering concurrency features in a familiar programming environment.

Concurrent execution in COOL is expressed through parallel functions that exe- cute asynchronously when invoked. Parallel functions communicate through data in the shared address space and synchronize using monitors and condition variables? Op- erations on shared objects can be declared mutex, signifying that they must acquire exclusive access to that object instance before being executed.

Finally, event synchronization is expressed through operations on condition vari- ables, which allow tasks to wait for an event or to signal a waiting task when the ap- propriate event occurs. In our experience, monitors and condition variables are at-

001X-Y1h2/YJ/$~000 I Y W IEEE 13

Page 2: COOL: An object-based language for parallel programming

tractive because they are simple primi- tives that provide the basic elements of mutual exclusion and event synchroniza- tion. They can also be used as building blocks for more complex synchroniza- tion. Furthermore, as we show, monitors can be implemented efficiently in a par- allel programming language like COOL.

The benefits of data abstraction for writing programs are well known. Pro- grams structured around objects are modular, with side-effects encapsulated within the class functions of an object. Data dependencies are made explicit in the interface between an object and its use. The programs are therefore easier to understand and modify. However, in addition to these advantages, integrating concurrency and synchronization with data abstraction offers benefits that are particular to parallel programming. First, the expressiveness of the language is en- hanced since programmers can build new abstractions that encapsulate concur- rency and synchronization within the ob- ject while providing a simple interface for its use. Second, the synchronization op- erations (and the data protected by the synchronization) can be clearly identified by the compiler system, enabling opti- mizations for efficient synchronization support. Finally, the object structure of COOL programs makes it easier to ad- dress two important performance issues: load balancing and data locality. Because a COOL program is naturally organized into tasks and objects, the objects refer- enced by tasks are more easily identified. Using this information, the runtime sys- tem can improve performance by mov- ing tasks andlor data close to each other in the memory hierarchy while keeping

Status report

the load balanced. As we show later, con- structs must be integrated with the ob- ject structure in the language, since these benefits cannot be realized by offering similar constructs through runtime li- braries.

Finally, we note that the majority of the individual constructs in COOL are not novei; rather, we have tried to put together a simple set of features that mesh well to afford the programmer flexibility, control, and efficient implementation. Thus, we view COOL as a vehicle for gaining expe- rience with parallel applications.

Language overview As stated, concurrency in a COOL pro-

gram is expressed through parallel func- tions. Communication between parallel functions occurs through shared variables; and the two basic elements of synchro- nization - mutual exclusion and event synchronization - are expressed through monitors and condition variables, respec- tively. For convenience, we also provide a construct to express fork-join-style syn- chronization at the task level.

Before presenting the language con- structs, we briefly describe the runtime execution model for a COOL program. The runtime system implements user- level (or lightweight) threads. This run- time layer manages units of work called tasks entirely within user space. Tasks are typically coarse grained (several thou- sands of instructions) and are managed efficiently, since they are highly stream- lined and execute in user space. An exe- cuting task (or process) is commonly re- ferred to as a thread of execution. The

of code), including

14

programmer can therefore identify the concurrency in the program and depend on the runtime system to efficiently take care of the details of task management and scheduling.

Expressing concurrency. Both C++ class member functions and C-style func- tions can be declared parallel by prefixing the keyword parallel to the function header. Invoking a parallel function cre- ates a task to execute that function; this task executes asynchronously with all other tasks, including the caller. The task executes in the same shared address space as all other tasks and can access the pro- gram variables like an ordinary sequential invocation does. The COOL runtime sys- tem automatically manages the creation and scheduling of user-level tasks. Since tasks are lightweight threads of execution, they can be dynamically created to ex- press relatively fine-grained concurrency, making it easier to specify parallelism at a level natural to the problem.

It is useful to have certain invocations of a parallel function execute serially within the caller when the default of par- allel execution results in unnecessary par- allelism. For instance, when implement- ing a mergesort operation, one may want to invoke the sort on the left half of the array in parallel, while the sort on the right half is serially invoked. This is be- cause the caller must wait for both halves to be sorted before merging them. There- fore, we allow a parallel function to be invoked serially by specifying the key- word serial at the invocation.

To enable synchronization for the completion of a parallel function, all par- allel functions implicitly,return a pointer to an event of type condH (see the “Event synchronization” section). In- voking the parallel function automati- ~

cally allocates an event, and a pointer to the event is returned immediately to the caller. To synchronize for the comple- tion of the parallel function, the caller can store this pointer and continue exe- cution, waiting later on this event (when necessary) for the parallel function to complete. The called parallel function automatically broadcasts a signal on the event upon completion, resuming any waiting tasks.

Parallel functions cannot return a value other than a pointer to an event. This restriction is necessary because of the complex semantics of returning a value that may be read (or even modi- fied) by the calling task while a parallel

COMPUTER

Page 3: COOL: An object-based language for parallel programming

function is executing to produce the re- turn value. Therefore, the results pro- duced by a parallel function must be com- municated through global variables and pointer arguments to the function. Re- turning a pointer to an event is a simple mechanism to synchronize for the com- pletion of parallel functions. Synchro- nization can therefore be expressed within the caller without requiring mod- ifications to the called function.

Mutual exclusion. Parallel functions communicate through objects in shared memory. With traditional monitors? syn- chronization for a shared object (an in- stance of a class) is expressed by specifying the class to be a monitor. All operations on that class are assumed to modify the object and must acquire exclusive access to the object before executing. Instead of specifying a class as a monitor, COOL specifies an attribute, either mutex or nonmutex, for the individual member functions of that class. A function so at-

Monitor-based languages

tributed must appropriately acquire ac- cess to the object while executing, thereby serializing concurrent accesses to the object.

A mutex function requires exclusive access to the object instance it is invoked on; it is therefore assured that no other function (mutex or nonmutex) executes concurrently on the object. Because a nonmutex function does not require ex- clusive access to the object, it can execute concurrently with other nonmutex func- tions on the same object - but not with another mutex function. Functions with neither attribute execute without syn- chronization like ordinary c++ functions.

Typically, functions that modify the object are declared mutex, while those that reference the object without modi- fying it are declared nonmutex, auto- matically providing multiple-reader (nonmutex), single-writer (mutex)-style synchronization for the shared object. Synchronization occurs between func- tions on the same object and is unaffected

Hoarel proposed monitors as a synchronization mecha- nism for coordinating concurrent processes on a uniproces- sor. In a language with abstract data types, the programmer can declare a shared data type (like a C++ class) as a moni- tor. A monitor object (an instance of a monitor class) has the property that only one operation (member function of that class) can execute on the object at any one time, thereby serializing multiple simultaneous operations on the object. This ensures exclusive access to the object for individual operations. Event synchronization is provided through in- stances of a special type called condition variables, which provide the wait and signal operations on the event.

Languages like Concurrent Pascal,* Mesa,3 and Modula4 have integrated monitors with support for data abstraction in the language. However, their primary focus has been con- current programming on a uniprocessor, rather than parallel programming on a multiprocessor. Consequently, the pri- mary concerns have been synchronization for shared re- sources among operating system processes and providing real-time response and fair scheduling in a multiprogrammed environment. Recent research efforts, such as the Parmacs macro^,^ Presto,‘ and Modula-3,7 have provided monitor- based support for writing parallel programs on multiproces- sors. However, these programming environments are based on runtime libraries, in contrast with COOL, where concur- rency and synchronization are tightly integrated within the object structure of the language. Consequently, their ab- stractions are lower level. The programmer explicitly creates

by functions on other instances of the class. Because this synchronization is based on objects, C-style functions can- not be declared mutex or nonmutex. A function can have only one of these at- tributes. However, these attributes are orthogonal to the parallel attribute. A function that is both parallel and mu- texhonmutex executes asynchronously when invoked and appropriately locks the object before executing.

Often the programmer may know that certain references to a shared object can be performed safely without synchro- nization. In such cases, we allow the pro- grammer to reference the object directly without regard for executing mutex and nonmutex functions. For instance, one may wish to simply examine the length of a queue object to decide whether to dequeue an element from this queue or to move on to another queue. Although unsynchronized accesses violate the ab- straction offered by the mutex and non- mutex attributes, they avoid synchro-

tasks for parallel execution and expresses synchronization by wrapping enter and exit operations around critical code sections. These operations are invoked on instances of a predefined monitor type. Associating the monitor with the shared object - and the structured usage of the monitor - becomes the programmer’s responsibility; it is not enforced by the language, as is done in COOL.

References

1 .

2.

3.

4.

5.

6.

7.

C.A.R. Hoare, “Monitors: An Operating System Structuring Concept,” Comm. AGM, Vol. 17, No. I O , Oct. 1974, pp. 459-557.

P.B. Hansen, “The Programming Language Concurrent Pascal,” /E€€ Trans. Software Eng., Vol. SE-I, No. 2, June 1975, pp. 199-207.

B.W. Lampson and D.D. Redell, “Experience with Processes and Monitors in Mesa,” Comm. ACM, Vol. 23, No. 2, Feb. 1980, pp. 105- 117.

N. Wirth, “Modula: A Language for Modular Multiprogramming,” Software Practice andExperience, Vol. 7, No. 1, Jan. 1977, pp. 3-36.

B. Beck, “Shared-Memory Parallel Programming in C++,” /€E€ Software, Vol. 7, No. 4, July 1990, pp. 38-48.

B. Bershad, E. Lazowska, and H. Levy, “Presto: A System for Object- Oriented Parallel Programming,” Software Practice and Experience, Vol. 18, NO. 8, Aug. 1988, pp. 713-732.

L. Cardelli et al., “Modula-3 Report,” Tech. Report 31, Digital Systems Research Ctr., Palo Alto, Calif., Aug. 1988.

August 1994 15

Page 4: COOL: An object-based language for parallel programming

nization overhead and serialization when they can be performed safely.

We permit two types of unsynchro- nized access. First, we do not require functions to have a mutednonmutex at- tribute. Those without an attribute can execute on the object without regard for other concurrently executing functions - mutex or nonmutex. Second, we main- tain the C++ property that allows the public fields of an object to be accessed directly rather than through member functions alone - again, independently of other executing functions. When per- formed safely, these unsynchronized ref- erences can avoid synchronization over- head and enable additional concurrency. Although the burden for determining when these unsynchronized references can be performed safely is left entirely to the programmer, we believe that such compromises are frequently necessary for efficiency reasons.

Event synchronization. Waiting for an event (say, waiting for some data to be- come available) is expressed through op- erations on condition variables. A condi- tion variable is an instance of a predefined class cond. Operations on condition vari- ables allow a task to wait for or signal the event represented by the condition vari- able. The main operations on condition variables are wait, signal, and broadcast. A wait operation always blocks the cur- rent task until a subsequent signal opera- tion is invoked. Asignal wakes up a single task waiting on an event, if any (it has no effect otherwise). A broadcast wakes up all tasks waiting for that event and is use-

Table 1. Summary of COOL constructs.

ful for synchronizations like bamers that resume all waiting tasks.

Within a mutexlnonmutex function, waiting for an event that is signaled by another function on the same object is initiated by the release operation. Invok- ing a wait operation can lead to deadlock, since the function has locked the object, preventing the signaling function from acquiring access to it. The release opera- tion atomically blocks the function on an event and releases the object for other functions. When the event is signaled, the function resumes execution at the point where it left off after (a) waiting for the signaling function to complete and (b) reacquiring the appropriate lock on the object. This maintains the synchroniza- tion requirements of mutex and nonmu- tex functions.

A wait operation is resumed by subse- quent signals only - that is, condition variables do not have history. In some sit- uations, such as a wait operation on the event returned by a parallel function, the wait should obviously continue without blocking if the event had been signaled before the wait operation. Although this functionality can be implemented with ex- isting language mechanisms, for conve- nience we provide the class condH (cond+History), in which a signal opera- tion is never lost. If a task is waiting, it is resumed otherwise the signal is stored in the condition variable. A wait operation blocks only if there is no stored signal; otherwise it consumes a signal and con- tinues without blocking. The broadcast operation is equivalent to infinitely many signals; all subsequent wait operations

continue without blocking. We provide an uncast operation to reset the stored sig- nals to zero and a nonblocking count op- eration to return the number of stored sig- nals. The latter operation is useful for writing nondeterministic programs, which test an event and take different actions depending on whether the event has been signaled.

In contrast to COOL, condition vari- ables in traditional monitors are re- stricted to be private variables within a monitor. The traditional monitor wait op- eration is like our release: It releases the monitor and blocks until a subsequent signal is performed. However, we also al- low condition variables to be used out- side a monitor for general event syn- chronization between tasks. The release operation is therefore distinct from a wait; both operations block on the event, while a release unlocks the object as well.

Task-level synchronization. Programs often have clearly identifiable phases that are performed in parallel (for example, a time step in a simulation program). Rather than synchronize for each indi- vidual object to be produced by different parallel tasks in the phase, it is often sim- pler and more efficient to wait for the en- tire phase to complete. In COOL, this task-level synchronization is expressed by the waitfor construct. Programmers attach the keyword waitfor to a scope wrapped around the block of statements of that phase. The end of the scope causes the task (that executed the scope) to block until all tasks created within the scope of the waitfor have completed. This

Feature Construct Syntax

Concurrency Parallel functions

Mutual exclusion synchronization

Concurrency and mutual exclusion

Event synchronization Condition variables Wait for an event Signal an event Wait and release monitor Broadcast an event (condH) Reset an event (condH) Number of stored signals (condH)

Mutex and nonmutex functions

Parallel mutex functions

Task-level synchronization Waitfor

parallel condH* egC1ass::foo ()

mutex int egClass::bar ()

parallel mutex condH* egC1ass::baz ()

condH x; cond y; x.wait () x.signal () x.release () x.broadcast () x.uncast ()

waitfor { . . . ] x.count ()

Page 5: COOL: An object-based language for parallel programming

is precisely the dynamic call graph of all functions invoked within the scope of the waitfor and includes all parallel functions invoked directly by the task or indirectly on its behalf by other functions.

Using a waitfor wrapped around a phase in a COOL program is similar to using barriers at the end of a phase in pro- gramming models without lightweight threads, such as in the ANL (Argonne National Laboratory) macros? However, a barrier is tied to the number of pro- cesses that participate in that barrier. In contrast, since the computation within a phase is usually expressed by calling par- allel functions in COOL, synchronization for the phase to complete is the same as waiting for the various parallel functions to complete. Table 1 summarizes the COOL constructs.

Discussion. We now briefly elaborate on some issues that we have side-stepped so far. The first concerns events returned by a parallel function, while the other two deal with the semantics of monitor operations, that is, mutex and nonmutex functions.

The programmer is responsible for freeing the storage associated with the event returned by a parallel function once the event is no longer required. The compiler performs simple optimizations such as (a) not allocating the event at all for invocations of parallel functions where the return value is ignored and (b) deallocating the event once it is no longer accessible in the program (for example, exiting the scope of an automatic vari- able that stores the only pointer to the event). The compiler performs the opti- mizations safely, that is, it ensures that an event is not deallocated by both pro- grammer and compiler. However, al- though the compiler can optimize simple situations, the ultimate responsibility for deallocating the event rests with the pro- grammer.

Second, in a traditional monitor, re- cursive monitor calls to the same object by a task result in deadlock, since the task is trying to reacquire an object. However, such recursive monitor calls (both direct and indirect) proceed in COOL without attempting to reacquire the object, since the task has already satisfied necessary synchronization constraints. These se- mantics for monitor functions are more reasonable, since they avoid obscure deadlock situations even though the pro- gram has performed the desired syn- chronization. However, note that dead-

, I Wind tunnel I (consisting of space cells)

I I Figure 1. A wind tunnel in PSIM4, with particles contained in space cells.

lock is still possible in a COOL program when two tasks attempt to acquire multi- ple objects in a different order. This is a common problem when acquiring multi- ple resources; well-structured programs solve it by acquiring the resources in a specific total order. Finally, invocations of monitor operations can rely on not be- ing starved. The implementation guaran- tees that a blocked monitor operation is not indefinitely overtaken by subsequent monitor operations and that it is ulti- mately serviced.

Note that allowances made in the lan- guage design let programmers leverage their knowledge of the code for addi- tional expressiveness and/or performance benefits. The features that relate to mon- itor operations include specifying differ- ent kinds of synchronization for functions on shared objects, permitting unsyn- chronized access to an object, and sup- porting recursive monitor calls without deadlock. Other features include condi- tion variables with history, general event synchronization through condition vari- ables, and task-level synchronization. Some of these allowances are not desir- able in terms of safety concerns, but we believe they offer sufficient performance benefits to justify offering them in a con- trolled manner.

Examples Now we illustrate the basic concur-

rency and synchronization features of COOL with a particle-in-cell code (PSIM4). We then show how to build ab- stractions in COOL with two examples that perform object-level synchroniza- tion. The first example implements syn-

chronization for multiple values pro- duced by a parallel function (taken from our Water application). The second ex- ample shows how to acquire exclusive ac- cess to multiple objects (common in database applications).

Illustrating the constructs. PSIM4 is an improved version of the more widely known Splash4 (Stanford Parallel Appli- cations for Shared Memory) benchmark program MP3D. It is a three-dimensional particle-in-cell code used to study the pressure and temperature profiles created when an object flies through the upper at- mosphere (see Figure 1). The program contains particle objects that represent air molecules and space cell objects that rep- resent the wind tunnel, the flying object, and the boundary conditions. PSIM4 evaluates the position, velocity, and other parameters of every particle over a se- quence of time steps. The main computa- tion in each simulated time step consists of a move phase and a collide phase. The first phase calculates a particle’s new po- sition by using the current position and velocity vectors, and moves the particles to their calculated destination cell. The collide phase models collisions of parti- cles within the same space cell.

Concurrency is organized around space-cell objects in the paralIel version of the algorithm, as shown in Figure 2 on the next page. Within the move phase, particles in different space cells can be moved concurrently, expressed through the parallel move function on a space cell. In the collide phase, only particles in the same space cell can collide. Conse- quently, collisions in different space cells can be modeled concurrently through the parallel collide function on a space cell.

During the course of the simulation, particles that move from one space cell to another are removed from their present cell and added to the destination cell by invoking the addparticle function on the destination cell. Each cell maintains a list of incoming particles (incomingparti- clelist) that provides the mutex function add. The addparticle function calls the add function to enqueue particles on this list. In the beginning of the collide phase, the incoming particles are incorporated with other particles in the space cell. While the new particles are being trans- ferred in the collide phase, the incom- ingParticleList can be manipulated di- rectly without synchronization, since all cells have already completed their move phase. Finally, since each phase must

August 1994 17

Page 6: COOL: An object-based language for parallel programming

elpss ParticleList-c { ...

public:

1; mutex void add (Particle*);

elass cell-c { ... ParticleList-c incomingParticleList;

public: ... // Perform the move phase for particles in this cell. parallel void move() (

... /I To move a particle ‘p to the cell dest . Cell[dest].addParticle (p);

I /I Model collisions for particles in this ceU. parauel veid collide () (

// Transfer particles from the incoming ParticleList // and add them to the list of particles in this cell. // Access the list without synchronization. ...

I void addparticle (Particle* p) (

incoming ParticleList.add (p); I

I Cell [NI;

main 0 { ... while (timestep > 0) {

I/ For each time step do

w@tfor( . // Move particles in different cells concurrently. for (i = 0; I < N; i++)

Cell[i].move (); ] // Wait for the move phase to complete.

waitfor [ // Model collisions in different cells concurrently. for (i = 0; i < N, i++)

Cell[i].collide 0; ] /I Wait for the collide phase to complete.

timestep- -; 1

I

Figure 2. Expressing concurrency and synchronization in PSIM4.

complete before the program can go on to the next phase, synchronization is nat- urally expressed by wrapping the waitfor construct around each phase.

Object-level synchronization. We use the Water code6 to illustrate object-level synchronization in COOL. This code is an n-body molecular dynamics applica- tion that evaluates forces and potentials in a system of water molecules in the liq- uid state. The molecules are partitioned into groups. A group contains a parallel function stats for computing internal statistics like the position, velocity, and potential energy of its molecules. In this example, we show how to synchronize for individual values produced by a parallel function as they become available, in- stead of waiting for the entire function to complete, thus enabling the caller to ex- ploit additional concurrency.

Figure 3 shows the synchronization for the variable PotEnergy, which represents a molecule group’s potential energy. This example exploits C++’s definitions of

type conversion operators, which are au- tomatically invoked to construct values of the desired type. We define a class double-s that adds a condition variable for synchronization to a double. This class has two operators. The first is an assign- ment operator invoked when a value of type double is assigned to a variable of type double-s. The operator stores the value in the variable and broadcasts a sig- nal on the condition variable. The second is a cast operator invoked when a value of type double-s is used where a double is expected. The operator waits for the event signal and returns the value. With these operators, a double can be safely used in place of a double-s, and vice versa.

The caller passes the address of PotEn- ergy to the stats function and continues execution. When it later references the value of PotEnergy, it invokes the cast operator, which automatically blocks un- til the value is available. The pointer to PotEnergy can be passed to other func- tions without blocking, thus exploiting

additional concurrency. The called func- tion computes the potential energy and stores it in the supplied variable. This as- signment invokes the assignment opera- tor, which stores the value and signals all tasks waiting on the condition variable. The function stats continues execution to compute other parameters.

The example shows that the desired synchronization is expressed naturally at the object level. This lets the caller exe- cute asynchronously until it needs the value and lets the called function execute independently, making the various pa- rameters available as they are computed. In addition, we use C++’s capability for defining operators to build a doubles ab- straction. This abstraction encapsulates synchronization within the object and can be used like an ordinary double.

Exclusive access to multiple objects. Mutex and nonmutex functions protect access to a single object only. In some sit- uations, the program updates two (or more) objects and requires exclusive ac-

18 COMPUTER

.

Page 7: COOL: An object-based language for parallel programming

class double-s double value; condH x;

/I Initialize the value to be unavailable. double-s () [

public:

x.uncast (); I /I Operator to cast a double to a double-s. /I Called when a double is assigned I/ to a variable of this type. void operator= (double v) [

value = v; x.broadcast ();

I I/ Operator to cast a double-s to a double. /I Called when the value is referenced. operator double () {

x.wait 0; return value;

I 1;

I/ Function to compute internal statistics of the 11 molecules in the group, like position, velocity /I and potential energy.

parallel void mo1eculeGroup::stats (double-s* poteng, . . .)

double subtotal; . . . Compute local potential energy into subtotal. . . *poteng = subtotal; . . . Continue - compute position, velocity etc. . . .

I

main O( moleculeGroup mg; double totalEnergy = 0 double-s PotEnergy; 11 The value is initially unavailable.

/I Parallel invocation. mg.stats (&PotEnergy, . . . <other arguments>. . .);

. . .

. . . perform other useful computation. . .

I/ Now the value of PotEnergy is required. I1 The reference to PotEnergy blocks I1 if the corresponding function has not completed 11 e.g. Compute the total energy. totalEnergy += PotEnergy; I1 Continue after the value becomes available.

I

Figure 3.Object-level synchronization in the Water code.

cess to all of them. For instance, moving an object from one list to another re- quires exclusive access to both lists (oth- erwise another task may find either no copy or duplicate copies of that object). Or, say, a transaction-based system re- quires that updates to several objects be atomic. The programmer can easily build this synchronization by using mutex func- tions and condition variables, explicitly maintaining the status of the objects and providing operations to acquire and re- lease them (see Figure 4 on the next page). A claim operation acquires the ob- ject if it is free but blocks otherwise. The surrender operation grants access to a waiting task, if any. Exclusive access to all objects is obtained by claiming each individual object. All tasks must claim the objects in the same order to avoid

deadlock; this can be ensured through macros or functions that claim the objects in order of increasing virtual addresses. This simple protocol can be easily ex- tended to support specific application re- quirements such as operations to test whether an object is available, and either blocking or retrying later.

Summary. We have implemented sev- eral significant applications in COOL, in- cluding several from the Splash parallel benchmark suite.6 These include a rout- ing program for standard cell circuits (b- cusRoute), a program to simulate the evolution of a system of water molecules (Water), a program to simulate the eddy currents in an ocean basin (Ocean), a rar- efied hypersonic flow simulation pro- gram (PSIM4), and a program to perform

Cholesky factorization of a sparse matrix (Cholesky). For most of these programs, the primary synchronization requirement was serializing access to shared objects, and only a few programs required more complex synchronization abstractions. As shown by previous examples, various synchronization requirements were ex- pressed naturally using monitors in COOL.

Data locality and load balancing

We have expressed a variety of appli- cations in COOL. However, expressing the concurrency and the accompanying coordination doesn’t automatically

August 1994 19

Page 8: COOL: An object-based language for parallel programming

class sh-obj { objtype value; enum { FREE, BUSY] status; mud sync; ...

public: mutex void claim () {

if (status ! = FREE) { 11 Release mutex and wait until the object I/ becomes available.

] sync.release 0; I/ The object is now available; continue. status = BUSY;

I mutex void surrender () {

Status = FREE; I/ Wake up a task waiting for this object, if any. sync.signal 0;

I 1;

d o { sh-obj x, y;

I/ Begin Transaction: Acquire exclusive access to 11 bothobjects. /I To avoid deadlock, objects must be claimed 11 in the same order by all transactions. x.claim 0; y.claim 0;

. . . Reference and modify the two objects . . .

11 End Transaction: Release the objects. x.surrender 0; y.surrender 0;

I

Figure 4. Acquiring access to multiple objects in a transaction-based system.

ensure great speedups. Obtaining good performance on a multiprocessor requires (a) good load balance so that processors do not sit idle for lack of work and (b) good data locality so that individ- ual processors do not waste time waiting for memory references to complete. The latter problem is particularly severe on modern multiprocessors where the la- tency of memory references can range from tens to hundreds of processor cycles. Any parallel programming system must address these performance issues.

For this study, we assume a three-level memory hierarchy consisting of (a) per- processor caches, (b) a local portion of shared memory, and (c) remote shared memory (see Figure 5). We believe that this is a reasonable abstraction for many large-scale multiprocessors, since it cap- tures the essential components of their memory hierarchy. The primary mecha- nisms for improving data locality in such architectures are task scheduling and ob- ject distribution. For instance, we can ex- ploit cache locality (that is, cache reuse) by scheduling tasks that reference the same objects on the same processor. Cache locality can be further enhanced by scheduling tasks back to back to re- duce possible cache interference caused by the intervening execution of other un- related tasks. When cache locality alone is insufficient, we can exploit memory lo-

I I Figure 5. Multiprocessor architecture and memory hierarchy: cache, local memory, and remote memory.

cality by identlfylng the primary object(s) referenced by a task and executing that task on the processor that contains the object(,) in its local memory. Thus, ref- erences to the object that miss in the cache are serviced in local, rather than remote, memory. This results in lower latency.

Along with task scheduling, object placement is often necessary. For exam- ple, exploiting memory locality and si- multaneously ensuring a good load bal- ance requires task scheduling and an appropriate distribution of objects across processors (local memories). Thus, it is important to provide mechanisms to dis-

tribute and dynamically migrate objects across processor memories.

Our approach. Determining efficient task and object distributions often re- quires program knowledge beyond the compiler’s scope but possessed by the programmer. Therefore, COOL provides abstractions that enable the programmer to supply hints about the data objects ref- erenced by parallel tasks. The runtime system uses these hints to appropriately schedule tasks and migrate data, thereby exploiting locality in the memory hierar- chy. The hints do not affect program se- mantics. This approach provides a clear separation of functionality. The pro- grammer can focus on exploiting paral- lelism and supplying hints about the ob- ject reference patterns, leaving the task-creation details and management to the implementation.

Since our approach depends on pro- grammer participation, it is important that the abstractions be intuitive and easy to use. At the same time, the abstractions should be powerful enough to exploit lo- cality at each level in the memory hierar- chy without compromising performance. We address these goals through the fol- lowing key design elements. First, the ab- stractions are integrated with the task and object structure of the underlying COOL program so that the hints are easily sup-

20 COMPUTER

Page 9: COOL: An object-based language for parallel programming

class column-c (

public: ...

main () [ inti. j; ...

/I Parallel function to update this column /I by the given src I column. parallel mutex void update (column-c* src)

/I Code specifying affinity hints. The runtime /I executes this code when the function is invoked, /I to determine how to schedule the corresponding /I task. I

/I By default the task has affinity for the column /I being reduced ( this ). amity (this);

/I To instead express affinity for the column 'src' /I being used to reduce the column. affinity (src);

I; ) *column;

for (i = 0; i e N; i++) [ for (j = 0 j 4; j++) {

/I Invocations of the parallel function. /I The corresponding tasks are scheduled /I based on the specified affinity hints (or default). column[i].update (column+j);

1 I

1

Figure 6. COOL code illustrating the specification of affinity hints.

plied by the programmer. Second, the ab- stractions are structured as a hierarchy of optional hints so that simple optimiza- tions are easily obtained as defaults, while more complex ones require incre- mental amounts of programmer effort. Finally, the abstractions are designed to enable experimentation with different optimizations and to address locality at each memory-hierarchy level.

For simplicity, we currently leave the burden of distributing and migrating ob- jects across processor memories to the programmer and simply provide con- structs to allow this to be expressed in the language. Ongoing compiler research to automatically determine good object dis- tribution, which could reduce program- mer burden, has enjoyed some success for regular data-access patterns.

Abstractions. In COOL, information about the program is provided by identi- fying objects important for task locality. Along with a parallel function, the pro- grammer can specify a block of code that contains affinity hints (see Figure 6) . This block of code is executed when the pro- gram invokes a parallel function and cre- ates a corresponding task. The hints are simply evaluated by the runtime system to determine their effect on task schedul-

ing; they do not affect program seman- tics. Here we explain the hierarchy of affinity abstractions.

Default affinity. Parallel functions in COOL have a natural association with the object that they are invoked on. By default, tasks created by invocations of this function are scheduled on the pro- cessor that contains that object in local memory. (C-style functions are sched- uled on the server that created them.) The task is therefore likely to reference the object in local, rather than remote, memory. In addition, if several tasks op- erate on that object, the runtime system executes those tasks back to back on that processor to further improve cache locality.

Simple affinity. When a parallel func- tion would benefit from locality on an ob- ject(,) other than the default object, the programmer can override the default by explicitly identifying the object through an affinity specification for the function. When affinity for an object is explicitly identified, the runtime system schedules the task in a manner similar to the default described above, except that scheduling is based on the specified, rather than the default, object.

Task and object affinity. It is often use- ful to simultaneously exploit memory lo- cality on one object and cache locality on another. Consider a column-oriented Gaussian elimination on a matrix. In the algorithm, each column of the matrix is updated by the columns to its left to zero- out the entries above the diagonal ele- ment. Once the column has received all such updates (that is, all entries above the diagonal element are zero), it is used to update other columns to its right in the matrix. This algorithm invokes the parallel function update (see Figure 7 on the next page), which modifies a desti- nation column by a given source column. A desirable execution schedule and ob- ject distribution for this column-oriented algorithm are as follows (see Rothberg and Gupta6). The algorithm executes an update on the processor where the des- tination column is allocated, thereby ex- ploiting memory locality on the destina- tion column (the number of columns per processor is so large that they are not ex- pected to fit in the cache). The columns are distributed across processors in a round-robin fashion to obtain good load distribution. In addition, each processor executes multiple updates that involve the same source column. By executing these updates back to back, the algo-

August 1994 21

Page 10: COOL: An object-based language for parallel programming

class column-c (

public: ... ... /I Parallel function to update ’ this column by the given src ‘ column. parallel mutex void update (column-c* src)

I/ Object affinity for this exploits /I memory locality on the destination column. affinity (this, OBJECT);

[

/I Task affinity for src exploits I/ cache locality on the source column. affinity (src, TASK);

I; I

Figure 7. COOL code illustrating TASK and OBJECT affiiity in Gaussian elimination.

rithm also exploits cache locality on the source column.

Because of these needs for exploiting different kinds of locality, we allow the keywords TASK and OBJECT to be speci- fied with an affinity statement. The TASK

affinity statement identifies tasks that ref- erence a common object as a task-affinity set; these tasks are executed back to back to increase cache reuse. The OBJECT affin- ity statement identifies the object for memory locality. The task is colocated with the object. As Figure I shows, we can simultaneously exploit cache locality through task affinity on the source col- umn and memory locality through object affinity on the destination column. This exactly captures the way the algorithm was hand-coded using ANL macros5 The same scheduling is very simply ex- pressed in COOL.

Processor affinity. For load-balancing reasons, one may need to directly sched- ule a task on a particular processor (in practice, the corresponding server pro- cess), rather than indirectly through the objects it references. We therefore pro- videdprocessor affinity through the PRO-

CESSOR keyword, which can be specified for an affinity declaration. The program- mer supplies an integer argument instead of an object address. The argument’s value (modulo the number of server pro- cesses) becomes the server number on which the task is scheduled.

When affinity is specified for multiple objects, the runtime system simply sched- ules the task based on the first object. The programmer must supply appropriate affinity hints for tasks that would benefit from locality on multiple objects. For in- stance, to exploit memory locality on mul-

Table 2. Summary of affinity specifcations and data distribution constructs.

tiple objects, the programmer could allo- cate all objects from the memory of one processor and supply an object-affinity hint for any one. To exploit cache locality on multiple objects, the programmer could pick one of several objects and sup- ply a task-affinity hint for that particular object consistently across the various tasks. We are evaluating heuristics that would automatically make an intelligent scheduling decision when multiple affin- ity hints are supplied, such as those that determine the relative importance of ob- jects based on their size and schedule the task on the processor that has the most objects in its local memory.

Object distribution. In addition to affin- ity hints, we let the programmer distribute objects across memory modules when they are allocated. The programmer can also dynamically migrate them to another processor’s local memory. Unless other- wise specified, memory is allocated from the requesting processor’s local memory. For dynamic object distribution, we pro- vide a migrate function that takes a pointer to an object and a processor num- ber, and migrates the object to the local memory of the specified processor (mod- ulo the number of server processes). An optional third argument that specifies the number of objects to be migrated is useful for migrating an array of objects. (The op- erating system on the SGI and Dash mul- tiprocessors supports data allocation at the page level only. Therefore, the mi- grate call on these machines is imple- mented through the migration of entire pages spanned by the object, rather than the object alone.) Finally, the home func- tion returns the number of the processor that contains the given object allocated in its local memory.

Affinity Construct Description Effect

affinity (obj-addr) aff‘iinity (obj-addr, OBJECT) Object affinity Colocate with object aff‘iiity (obj-addr, TASK) affinity (num, PROCESSOR) Processor affinity Schedule on specified processor

x = new (P) type migrate (obj-addr, P) migrate (obj-addr, P, size) home (obj-addr)

Simple affinity

Task affinity

Colocate with object, schedule back to back

Schedule tasks in task-affinity set back to back

Allocate the object from the memory of processor P Migrate the object to the memory of processor P Migrate an array of “size” objects to the memory of processor P Return the processor P that contains the object in its local memory

22 COMPUTER

Page 11: COOL: An object-based language for parallel programming

Table 2 summarizes the affinity hints. The runtime-scheduler uses them to schedule tasks, as previously described. In addition to scheduling tasks for good locality, the scheduler uses several heuris- tics to maintain good load balancing at runtime. For instance, an idle processor steals tasks from other processors. An idle processor will also try to steal an en- tire task-affinity set, since the set can ex- ecute on any processor and benefit from cache locality. Several processors can ex- ecute tasks from the same task-affinity set if the common object is not being modified (modifications to the object in- validate it in other processors’ caches).

Implementation

Here we provide a brief overview of the COOL implementation, present some compiler optimizations, and de- scribe the runtime task queue structures to support task scheduling for better data locality.

Overview. A COOL program is trans- lated to a C++ program by a yacc-based source-to-source translator. The gener- ated C++ program can then be compiled and linked with COOL runtime libraries. When a COOL program begins execu- tion, several server processes are created, usually one per available processor. Each server process is assigned to a processor, on which it executes without migrating. Server processes correspond to tradi- tional heavyweight Unix processes but share the same address space. They are created initially when the program starts up and execute for the entire duration of program execution. Invoking a parallel function creates a task, or a lightweight unit of execution. Tasks are implemented entirely within the COOL runtime sys- tem. They have their own stack and exe- cute in the shared address space. Each server continually fetches a task and ex- ecutes it without preemption until the task completes or blocks (perhaps to ac- quire a mutex object or wait on an event). The program finishes execution when all tasks complete.

Reducing synchronization overhead. A naive implementation of monitors can incur high overheads. Below, we present some optimizations to help eliminate this problem. These optimizations also illus- trate the benefits of integrating monitors

Figure 8. The effect of synchro- nization optimiza- tions in the Water code.

n P .... Ideal

m - m SpinMonitor

0-0 Blocking monitor _...

I I I I I I I

Number of Processors ‘0 4 8 12 16 20 24 28 32

with the object structure. This integra- tion enables the compiler to (a) identify the data associated with the monitor syn- chronization and (b) identify the syn- chronization operations on an object. The compiler can therefore analyze the synchronization operations and optimize their implementation in various ways that we outline below.

Monitor operations are implemented by first executing some entry code to ac- quire access to the object, then executing the user function code, and finally exe- cuting some exit code to surrender the object to a waiting operation, if any. When another monitor operation is al- ready executing on that object, the task executing the entry code must wait until the object becomes available. This wait- ing must be implemented by blocking the task that executes the entry code so as to free the underlying processor to fetch and execute other tasks. Implementing the waiting with busy-waiting can lead to deadlock in some programs.

The primary overhead in this scheme occurs when it blocks a task and switches to another context. Although blocking the task immediately on an unavailable monitor is efficient in situations where the operations execute for long dura- tions, it has excessive overhead for shared objects with small critical sections. We therefore implement the entry to a mon- itor operation with a two-phase algorithm in which the entry code busy-waits for a “while” (heuristically chosen by the run-

time system) and then blocks if the mon- itor object is still unavailable. (This con- text switch occurs between user-level tasks and is analogous, but orthogonal, to implementing locks in the context of operating system scheduling, where locks can be implemented with pure spin or with blocking.) Furthermore, we have de- veloped compiler techniques to auto- matically identify those monitors (called spin monitors) that are suitable to imple- ment with pure busy-wait semantics. A monitor can safely be implemented with busy-wait alone if no monitor function calls an operation that could block, that is, a monitor operation on another object or a wait on an event. In addition, a mon- itor is worth implementing as a spin mon- itor if all monitor operations have a “small” body (determined heuristically). Spin monitors are protected by a simple lock that must be held for the duration of the monitor function. They are thus at- tractive for small shared objects, since they have minimal runtime and storage overhead.

In several of the applications that we have programmed in COOL, the syn- chronizations built by using monitors meet our heuristic criteria for spin mon- itors. These include synchronization for shared objects (Water code and Cholesky) and providing exclusive access to multi- ple objects (see above). Figure 8 shows the performance impact of these opti- mizations on the Water code, executing on a 32-processor Dash multiprocessor.

August 1994 23

Page 12: COOL: An object-based language for parallel programming

As we can see, a two-phase implementa- tion of monitors performs better than a blocking monitor. In addition, this pro- gram has a shared object that meets our criteria for a spin monitor. Implementing this object as a spin monitor further im- proves performance by over 20 percent as compared with a blocking monitor.

Monitor implementations incur other overheads that can frequently be opti-

/ \ \

/ \

Processori \

Queue for a task-affinity

0 Task

Figure 9. Task queue structure to sup- port task and object affiity.

Figure 10. Absolute speedup on a 32- I

mized by a compiler. For instance, a mon- itor often has mutex functions, but no nonmutex functions. A compiler can identify such monitors and entirely dis- card the support for nonmutex functions, thereby reducing storage and runtime overhead. Second, all monitor calls re- quire a runtime check to detect whether they are recursive calls, in which case they should proceed without synchronization. A compiler can often identify recursive calls statically, again, doing away with the storage and runtime check. Finally, syn- chronization abstractions are frequently built with a condition variable declared private within the monitor class. In such situations, condition-variable accesses need not be protected by a separate lock, since accesses to the variable are auto- matically serialized through mutex func- tions on the object.

Automatically identifying spin moni- tors, private condition variables, or classes without nonmutex functions is not implemented in our current system, since they each require a two-pass compiler. We expect to implement them in the near future. However, because these tech- niques are highly effective for optimizing overheads associated with most common synchronizations built using monitors, they are necessary for any monitor-based environment. Finally, note how the inte- gration of monitors with the object struc- ture enables the analysis required for these optimizations. This analysis would not be possible if monitors were provided in a runtime library.

(1

a,

3

24

3 20

16

12

8

4

0

Ideal .... A-A Water +-+ Ocean 0-0 Barnes-Hut 0 - 0 Block Cholesky X-x Panel Cholesky ,:.’

_...‘

4 8 12 16 20 24 28 32 Number of Processors

processor Dash. I

Scheduling support for locality. Object affinity requires that tasks be executed on the processor that contains the object allocated in its local memory, while task affinity requires that tasks be executed back to back on a processor. To support task scheduling for object affinity, there is a task queue for each server process, and a task with object affinity is colocated with the object (see Figure 9). To support task-affinity sets, each processor must distinguish tasks belonging to different sets. Each server thus actually has an ar- ray of queues, with each array element corresponding to a task-affinity set.

Given the above task queue structure, when a task is created with affinity spec- ified for some object, we schedule it on the processor that has the object allo- cated in local memory. In addition, within the array of queues for that processor, the task is enqueued on the queue given by the value of the object’s virtual ad- dress modulo the size of the array.

When both task and object affinity are specified for a task, the task is scheduled on the processor determined by the ob- ject-affinity hint, that is, on the proces- sor that contains the object specified with the object-affinity hint in local memory. However, within that proces- sor, it is scheduled on the queue given by the object specified in the task-affinity hint, thereby identifying tasks in that task-affinity set. This lets us simultane- ously exploit both cache and memory lo- cality on different objects (see the “Ab- straction” section). Tasks in the same task-affinity set therefore are mapped onto the same queue and can be serviced back to back. Collisions of different task-affinity sets on the same queue can be minimized by choosing a suitably large array.

This implementation of the task-queue structure is very efficient. Determining where to schedule a task simply requires two modulo operations. Within the task- queue structure of each processor, the nonempty queues in the array are linked to form a doubly linked list and provide fast enqueue and dequeue operations.

Performance results Figure 10 presents the performance of

some COOL programs on a 32-proces- sor Dash. We exclude the sequential start-up time in our measurements and plot the speedup relative to the serial code. We present results for several ap-

24

-~

COMPUTER

Page 13: COOL: An object-based language for parallel programming

n fl

(I)

J

20

16

12

8

..... Ideal m - x Distr+Aff inity 0-0 Base

I I I I L '0 4 8 12 16 20 24 28 32

Number of Processors

Ocean (1 94x1 94 grid)

Ideal n ..... "c 16 D-0 Base m - % Distr+Aff inity+ClustSteal .:.'

12 -

'0 4 6 12 16 20 24

4

'0 4 6 12 16 20 24 Number of Processors

Panel (matrix BCSTK33)

Figure 11. Performance improvement with affinity hints for Ocean (left) and Panel Cholesky (right).

plications (mostly from the Splash4 benchmark suite) - Water, Ocean, Barnes-Hut, Block Cholesky, Panel Cholesky, and LocusRoute. As the fig- ure shows, the speedups reach 24 times on 32 processors. Most programs exhibit good speedup, with the exception of Lo- cusRoute, where performance gains are limited by a high degree of communica- tion of shared data between processors. Overall, the COOL overhead is low. A COOL parallel program on one proces- sor usually runs about 2 to 3 percent slower than the original serial program did. Finally, we note that these perfor- mance results include optimizations based on affinity hints. We analyze the effect of these optimizations next.

Effect of locality optimizations. To evaluate our approach to improving data locality, we supplied affinity hints in var- ious applications and measured their ef- fect on performance. The hints resulted in significant (and sometimes dramatic) performance improvements in most pro- grams. Here we present results from ex- periments on a 32-processor Dash for two of the applications, Ocean and Panel Cholesky. On Dash, references that are satisfied in the cache take one processor cycle. Memory references to data in the local cluster memory take nearly 30 cy- cles, while references to the remote mem- ory of another cluster take 100 to 150 cy- cles. The hints supplied for Ocean were

migrate calls to distribute objects (por- tions of a grid), and object-affinity hints (for tasks on different grid portions). For Cholesky, we added migrate calls to dis- tribute panels (a collection of columns) and supplied both task- and object-affin- ity hints (similar to the hints supplied for the Gaussian elimination program shown in Figure 7). Figure 11 shows the perfor- mance improvements from using these hints - over 75 percent for Ocean and 135 percent for Cholesky.

We also analyzed the performance im- provements on Dash by using a hardware performance monitor that let us monitor the bus and network activity nonintru- sively. We plotted the total number of cache misses before and after optimiza- tion during the execution of the parallel portion of the program for different num- bers of processors (see Figures 12 and 13 on the next page). Each bar corresponds to the total number of cache misses and is partitioned into two regions. The dark re- gion represents the cache misses serviced in local memory, while the light region corresponds to misses to data in remote memory. In the Ocean code (Figure 12), most misses are to remote memory be- fore optimization (B), but with the opti- mizations (DA), nearly all (80 to 90 per- cent) of the cache misses are satisfied in local memory. In Panel Cholesky (Fig- ure 13), the affinity hints (DAC) reduce the total number of cache misses com- pared with the base version (B). In addi-

tion, since the tasks are also colocated with the panels being updated, more misses are serviced in local than in re- mote memory. (More extensive results can be found in the literature?)

Our experience has shown this ap- proach to be simple and effective for im- proving COOL program performance. It becomes easy for the programmer to ex- periment with different optimizations simply by changing the supplied affinity hints. We also hope that further experi- ence with this approach will suggest au- tomatic techniques that could reduce the burden on the programmer.

W e have implemented COOL on several shared-memory multiprocessors and have

programmed a variety of application pro- grams in the language. Our experience in writing programs in COOL, as well as the performance of the programs on actual multiprocessors, has demonstrated the benefits of data abstraction in expressing concurrency and synchronization, and in exploiting data locality.

Acknowledgments The implementation of COOL owes a great

deal to the efforts of Arul Menezes and Shigeru Urushibara. Greg S. Lindhorst imple- mented the synchronization optimizations in part. We have benefited from the program- ming experience of and the feedback from

August 1994 25

Page 14: COOL: An object-based language for parallel programming

” 55.0

S

31.3 _ _ -

2 Number of processors

I 1201 B Base 115.7

Number of processors I Figure U. Cache miss statistics for Ocean, before and &er affinity optimizations.

Figure 13. Cache miss statistics for Panel Cholesky, before and &er affinity optimizations.

Avneesh Agrawal, Maneesh Agrawala, Denis Bohm, Robert Wilson, and Kaoru Uchida. We thank Kourosh Gharachorloo and J.P. Singh for many useful discussions. This research has been supported by the Defense Advanced Re- search Projects Agency under DARPA con- tract #N00039-91-C-0138. Anoop Gupta is also supported by an NSF Presidential Young In- vestigator Award.

References 1. B. Stroustrup, The C++ Programming Lan-

guage, Addison-Wesley, Reading, Mass., 1986.

2. C.A.R. Hoare, “Monitors: An Operating System Structuring Concept,” Comm. ACM,

3. D. Lenoski et al., The Stanford Dash Multi- processor,” computer, Vol. 25, No. 3, Mar.

Vol. 17, NO. 10, Oct. 1974, pp. 459-557.

1992, pp. 63-79.

4. J.P. Singh, W.-D. Weber, and A. Gupta, “Splash: Stanford Parallel Applications for Shared Memory,” Computer Architecture News, Vol. 20, No. 1, Mar. 1992, pp. 5-44.

5. E. Lusk et al., Portable Programs for Parallel Processors, Holt, Rinehard, and Winston, New York, 1987.

6. E. Rothberg and A. Gupta, ’Techniques for Improving the Performance of Sparse Matrix Factorization on Multiprocessor Worksta- tions,” Proc. Supemmputing 90, IEEE CS Press, Los Alamitos, Calif., Order No. 2056 (microfiche only), 1990, pp. 232-243.

7. R. Chandra, A. Gupta, and J.L. Hennessy, “Data Locality and Load Balancing in COOL,” ACM Sigplan. Symp. Principles and Practice of Parallel Programming, ACM Press, New York, 1993, Pp. 249-259.

neering from the Indian Institute of Technology in Delphi in 1980 and a PhD from Carnegie Mellon University in 1986. He won the National Science Foundation’s Presidential Young In- vestigator Award in 1990 and holds the Robert Noyce faculty scholar chair in the School of En- gineering at Stanford. He is a member of the IEEE Computer Society.

Rohit Chandra is employed by the Languages and Compilers group at Silicon Graphics, where he continues to work on multiprocessor systems. His research interests are in devel- oping programming languages, runtime sys- tems, and operating system support for paral- lel processing.

Chandra received a BTech in computer sci- ence from the Indian Institute of Technology, Kanpur, India, in 1986, and is a 1994 PhD can- didate in computer science from Stanford Uni- versity. He is a member of the IEEE Com- puter Society.

John L. Hennessy, a member of the faculty at Stanford University since 1977, led the Mips project there and continues research in pro- cessor architectures. He cofounded Mips Computer Systems in 1984.

Hennessy received a BE in electrical engi- neering from Vianova University in 1973, and masters’ and PhD degrees in computer science from the State University of New York at Stony Brook in 1975 and 1977, respectively. He is an IEEE fellow and a member of the IEEE Computer Society.

h o o p Gupta is an associate professor of com- puter science and electrical engineering at Stanford University. His research interests are in the design of hardware and software for large-scale multiprocessors. With John L. Hennessy, he was a coleader in the design and construction of the Dash scalable shared- memory multiprocessor at Stanford.

Gupta received a BTech in electrical engin-

Readers can contact the authors at the Com- puter Systems Laboratory, Stanford Univer- sity, Stanford, CA 94305. Chandra’s Internet address is [email protected].

26 COMPUTER