RPM: a rapid prototyping engine for multiprocessor systems

Svstems Luiz Andre Barroso, Sasan Iman, Jaeheon Jeong, Koray Oner, and Michel Dubois University of Southern Califimia

Krishnan Ramamurthy LSILogic Corporation

m RPM enables rapid

prototyping of different

mu Iti processor architectures.

It uses hardware emulation for

reliable design verification

and performance evaluation.

Computer

s multiprocessor systems become commonplace in the computer industry, there is a growing interest in tools that can evaluate A different architectural features as early as possible in the sys-

tem-development cycle. The RPM (Rapid Prototyping engine for Multiprocessors) project is exploring a rapid-prototyping methodology for multiprocessor systems that is based on hardware emulation. The flexibility of emulation is important, since the design space for multiprocessor systems is arguably much wider than that of uniprocessors.

Most machine designers favor asynchronous MIMD (multiple-instruction, multiple-data) systems, where processors execute their own instructions and run on different clocks. In these systems, processing elements contain a processor, some cache memory, and a share of the system memory, and are connected by a high-speed interconnection, such as a bus or mesh, that facilitates machine packaging (see Figure 1). Although this physical model prevails, there is disagreement in the computer science community about the interprocessor communication mechanism, which is represented by two dominant models. One model is based on disjoint memories and message passing, the other on shared memory. In a message-passing system, processors communicate by exchanging explicit messages through send and receive primitives. In the shared-memory model, they communicate through load and store instructions, and require explicit synchronization to avoid data-access races.l

The shared-memory model facilitates fine-grained (word-level) communication but requires many instructions to transmit large chunks of data, whereas the message-passing model can transmit large amounts of data in one message. For programming ease, the shared-memory model has thus far been the favored transition path from uniprocessors to multiprocessors. On the other hand, message-passing systems are generally perceived as more scalable than shared-memory systems. The growing disparity between processor and communication speeds is a problem inboth systems. Message- passing primitives typically suffer from high software overhead, while shared-memory systems’ large latency of loads and stores on shared data usually requires complex shared-memory access mechanisms.

Some researchers advocate private caches2 with hardware- or software- based coherence maintenance (for example, the Stanford Dash3 prototype). The coherence protocol, constraints on memory access ordering,] cache parameters, and the interconnection latency and bandwidth all affect a multiprocessor’s performance and programming ease. Machines like Dash are called cache-coherent nonuniform memory access architectures (CC-NUMAs), distinguished from cache-only memory architectures (COMAS) such as the Data Diffusion M a ~ h i n e . ~ A COMA has the same general architecture as the one shown in Figure 1, with communication accomplished through shared variables; however, instead of main

0018-9162/95/$4.00 Q 1995 IEEE

memory in each processor node, a COMA contains a huge cache called attraction memory.

Because message passing is more efficient than shared memory for some forms of communication, there is a trend toward integrating message-passing and shared-memory paradigms to draw on the strengths of each.5 However, effective integration requires comparisons, which are difficult to make and hard to validate without a common hardware platform to implement different models.

Because multiprocessors are complex and powerful, correctness of design and expected performance are very difficult to evaluate before the machine is built. There have traditionally been two approaches to verifying a design: breadboard prototyping and software simulation. A breadboard prototype is costly, takes years to build, and explores only one or a few design points. Software simulation is flexible, but slow when the design is simulated in detail; it also has validity problems, because the target must be considerably abstracted to keep simulation times reasonable. Some industrial projects accomplish detailed, faithful software simulation of a target, but the speed reaches only a few target cycles per second of simulation. Parallelizing a software simulation is an ad hoc procedure that usually exhibits low speedup. Most simulators6 directly execute each target instruction onto the host. Because the code (either source or binary) must be instrumented, it is difficult to efficiently simulate the execution of general-purpose workloads other than scientific programs with little or no I/O.

Therefore, the major objective of the RPM project is to develop a common, configurable hardware platform to accurately emulate different MIMD systems with up to eight execution processors. Because emulation is orders- of-magnitude faster than simulation, an emulator can run problems with large data sets more representative of the workloads for which the target machine is designed. Because an emulation is closer to the target implementation than an abstracted simulation, it can accomplish more reliable performance evaluation and design verification. Finally, an emulator is a real computer with its own I/O; the code running on the emulator is not instrumented. As a result, the emulator “looks” exactly like the target machine (to the programmer) and can run several different workloads, including code from production compilers, operating systems, databases, and software utilities.

THE HARDWARE EMULATION APPROACH

Emulators have been used to experiment with instruction sets, and more recently to validate complexVLSI circuits in the presilicon stage. But we know of no attempt to use this approach to design and verify multiprocessor systems.

Several technologies, including field-programmable gate arrays (FPGAs), and efficient computer-aided design (CAD) tools are converging, making it possible to build and program flexible multiprocessor emulators. FPGAs are high-density, ASIC devices that users can program in- circuit via software.’ Sophisticated design-automation tools greatly facilitate FPGA programming, letting designers use high-level description languages (HDL) such as VHDL and Verilog to specify their design at the register

transfer level (RTL). Such tools enable more complex designs in a much shorter time frame.

RPM is made of mostly off-the-shelf components, except for its cache, memory, coherence, and communication controllers, which are implemented with FPGAs. It emulates a particular machine modelvia these FPGAs and part of the memory to which they are attached. The RPM approach enables the emulation of an entire multiprocessor system at low cost.

Interconnection ...._ .__..

System memory j

Figure 1. Physical organization of typical MIMD system.

The clock rate is 10 MHz, which is about 10 times slower than the rate permitted by current board technologies and programmable logic devices (PLDs). This compromise in emulation efficiency results from two trade-offs. First, the PC boards’ design and fabrication are greatly simplified. Second, the lower clock rate facilitates the configuration of the FPGAs, which are slower than other PLDs or custom circuits-especially when they are programmed with VHDL synthesizers-and clocking them at a lower speed promotes the mapping of more complex circuits. Additionally, to further simplify the design and maintain the flexibility to emulate complex mechanisms, each processor clock (pclock) is emulated in several consecu- tive system clocks. So the overall processing speed is 10 MHz divided by the number of clocks per pclock. This relatively low rate lets RPM use a standard interconnection fabric yet still have enough bandwidth to emulate useful interconnections.

We preserve timing in RPM by scaling down the speed of all components (time-scaling). Performance data is col- lected with a set of counters stored in each memory (count memory).

RPM ARCHITECIURE RPM is geared to evaluate multiprocessors with the gen-

eral architecture shown in Figure 1. Possible target interconnections are limited to FIFO (first-in, first-out) networks, with uniform access latencies such as crossbars or buses; other interconnections, such as rings, can be modeled only approximately. (In a FIFO interconnection, messages sent between nodes reach the destination in the same order they are sent.)

Hardware organization Figure 2 shows the RPM emulator, packaged in a stan-

dard Mupac Futurebus+ cage. Each board is a 10-layer printed-circuit board measuring 22 inches x 18 inches.

February 1995

The board architecture is shown in Figure 4. Each board has an LSI Logic L64831 Sparc IU/FPU, which can reach 40 MHz and execute both integer and floating-point instructions. It has no on-chip cache, so all instruction fetches and data accesses are visible on the chip's pins.

Each board contains three memories, each controlled by a set of Xilinx FPGAs, as follow.

MCl/RAMl. Each processor is attached to 2 Mbytes of static RAM (RAM1) controlled by two Xilinx XC4013 FPGAs, called MC1. MC1 controls the processor's cycle- by-cycle execution by blocking the processor at the begin- ning of each access, executing the sequence of steps needed to satisfy the access, and unblocking the processor when the access is completed. It also manages RAM1 as a cache and interacts with the second-level memory. MC1 implements extensions to the Sparc instruction set, such as shared and exclusive prefetches. These extensions are possible through Sparc's alternate space identifiers

Figure 2. RPM: Nine processor boards connected to a standard Futurebus+ backplane.

I

r

Host (Sun SparcStation 2) 0 P Processor i

Futu rebus+

Futurebus+ interface

Local memories/

W SCSI i/o processor '

interface Execution processors

- I g u r e 3. Overall configuration of RPM.

Figure 3 depicts the hardware organization. Eight identical boards, each with one Sparc processor, are connected to a 64-bit wide (data) Futurebus+, which is used for transmitting packets among processors as well as for data broadcasting and interprocessor interrupts. An I/O processor board identical to the other boards connects the system with a Sun SparcStation 2 through an SCSI. This workstation serves as the console for RPM and executes its I/O requests. (For more details on RPM's hardware design, see Oner et aL8)

With a clockrate of 10 MHz and a pclockof eight clocks, the emulation rate is 1.25 million cycles per second-or at most 10 million instructions per second (MIPS)-of the target machine. The peak I/O bandwidth is 1.25 Mbytes per second.

Computer

(ASI) or through unused address bits. Finally, MC1 can remap addresses in many ways, including full virtual-to-physical translation through a translation look- aside buffer (TLB).

MC2/RAN12. An additional 8 Mbytes of static RAM (RAM2) controlled by three Xilinx XC4013 FPGAs, called MC2, com- prise the second-level memory. MC2 inter- faces the processor to the rest of the system and usually acts as a second-level cache controller.

MC3/RAM3. System memory is emulated by 96 Mbytes of dynamic RAM (RAM3) controlled by MC3, which includes two XilinxXC4013 FPGAs and one Cypress CYM7232 DRAM controller.

MC3 supports the virtual interleaving of on-board memory. This interleaving effect is achieved through time multiplexing of the memory controller. Virtual interleaving of memory relies on the many cycles available in the emulator for memory transactions. It is supported by eight inter-

leaving registers (for up to eight interleaved banks) along with some buffer space in RAM3. The interleaving register corresponding to a memory bank contains a counter that is decremented at every pclock. When the counter reaches zero, the memory bank is free. Arequest to a busy bank is queued at the controller.

A typical RPM memory transaction has three phases: prelude, suspension, and completion. During prelude, the packet is received and decoded. RPM then suspends the request by storing a transaction completion record-which contains all the information needed to resume and complete the request-in a memory location associated with the virtual memory bank, and by filling the interleaving register with a value in pclocks equal to the suspension time. While the request is suspended, accesses to a different virtual memory bank can be serviced. When the interleaving register reaches zero, the memory controller is

interrupted; the controller then fetches the transaction completion record in memory and completes the transaction by sending some messages (completion phase), if necessary.

The internal bus is synchronous with a protocol similar to that of the Sun Microsystems Mbus. This 32-bit-wide, packet-switched bus transfers packets that range from 16 to 128 bytes. Controllers MC2 and MC3 connect to the internal bus through verxlarge, two-way FIFO buffers, which prevent deadlocks and relieve controllers from managing data transfers. All on-board data paths are 32 bits wide.

The delay unit (DU) is a programmable controller that emulates variable interconnection delays. It contains a FIFO controlled by one Advanced Micro Devices (AMD) Mach 210 chip. The 16-Kbyte FIFO contains blocks and messages that are sent to the bus interface after a programmable delay, which depends on the target machine's interconnect latencies and packet size. This delay is computed by the formula

Latency = T,,,,, + (number of words in packet) * Tword

7 SCSI

I I S ' J L

Internal bus

___- i-- Futurebusc

~i -- 1 Futurebus+ interface j

-n

# FlFOs

CPU

MC Memory controller SI SCSl interface

The Futurebus+ interface contains off- the-shelf chipsets. It includes bus trans- ceivers, a bus protocol controller chip from

Figure 4. Block diagram of a processor node.

Newbridge, and a distributed arbiter chip from National Semiconductor.

Emulation of CC-NUMAs with central directory

Our first emulator has hardware-enforced cache coherence with strong ordering of memory accesses to enforce sequential c0nsistency.l In this emulator, MCl/RAMl is a first-level write-through cache containing both data and instructions, MC2/RAM2 is a second-level write-back cache, and MC3/RAM3 is the main memory. The protocol is directory-based, and each memory block has a home node where a directory records the presence and state of copies in e~e rycache .~ ,~ The memory and cache directories have pending states, so that transactions for different blocks are executed concurrently. The protocol is a pure write-invalidate protocol, but write- and competitive- update protocols (as described in Dahlgren, Dubois, and Stenstrom9) can easily be added.

The static RAM (SRAM) that implements the first-level cache is divided into five parts: the data memory (up to 1 Mbyte), cache directory, TLB needed for virtual-memory support, space to emulate prefetch and write buffers, and space for performance statistics. The controller is parti- tioned across two FPGAs, one for the control unit and the other for the data unit. The first-level cache is currently write-through and direct-mapped with a 16-byte block size. A processor-issued store always propagates to the second-level cache. The first-level cache and the processor both block on every write access that misses or requires

protocol

coherence activity in the second-level cache. A typical data-read cycle in the first-level cache involves

receiving an address from the processor, translating the address in TLB space, accessing cache directory space, retrieving the data from cache data space, fetching the counter for that event in count memory, updating the counter, and returning the data to the processor.

MC2/RAM2 implements the second-level cache. The current configuration is a two-way, set-associative, write- back cache with a 16-byte block size. Half of RAM2 (up to 4 Mbytes) is for the second-level cache data memory. The other half is dedicated to the cache directory, various buffers, and performance counters. The second-level cache controller is by far WM's most complex controller. It includes three FPGAs:

the data unit, which contains the hardware resources needed by the controller, including the buses to the RAM; the control unit, which implements basic cache control functions; and the consistency unit, which is currently unused and is reserved for implementing sophisticated memory- consistency model^.^

The second-level cache supports a virtually unlimited

February 1995

number of pending prefetches. The compiler issues these nonblocking accesses to direct the second-level cache to prefetch cache blocks before they are needed.

MC3/RAM3 implements the main memory. There are 64 Mbytes devoted to the data memory, 16 Mbytes to the directory (one 32-bit word per 16-byte block), 8 Mbytes for performance counters, and 8 Mbytes to emulate various hardware mechanisms (including virtual interleaving of memory).

Emulation of other architectures RPM’s generic board architecture allows

Keeping track of time Since the speeds of the hardware

emulation and target differ, timing measured on the emulator must be related to the timing in the target machine. Rather than keeping track of simulated time through event-driven mechanisms and time stamping (as software simulators do), RPM scales time. Time scaling preserves the rela- tive timing of components in the emulator and target, and simple scaling arguments yield absolute times in the target. For example, a system with processors running at 100 MHz and

fully detailed emulation of highly complex target multiprocessors with very different architectures, provided they fit the generic blockdiagram of Figurel. Examples include

CC-NUMA-CDs ( C C - N W with centralized directories) under weaklyordered memory models. Weak ordering of memory accesses’ may involve different buffering schemes, such as write buffers between the first- and second-level caches and between the second-level cache and the internal bus. Awrite cache can be used to track partially modified blocks. Prefetching hardware can also be added. CC-NUMA-DDs (CC-NUMAS with distributed directories). Instead of a centralized directory located at the home memory, the directory is distributed by the linking of caches containing a copy of the memory block through hardware pointers in each cache entry. The Scalable Coherent Interface (SCI) standardlo has adopted this scheme. COMAS (cache-only memory architectures) .4 The system memory (DRAM) is replaced by a huge cache (also called attraction memory). The memory directory then becomes the attraction memory state. MCl/RAMl and/or MC2/RAM2 may be configured as caches. MPS (message-passing systems). In this configuration, MC3/RAM3 acts as the processor’s local (private) memory, MC2/RAM2 acts as a message-passing controller (MPC), and MCl/RAMl acts as the processor cache. The MPC buffers the messages to or from the processor, formats out-going packets according to the target protocol, decodes the received messages, and, if needed, interrupts the processor when messages are received. Mixed shared-memory and message-passing systems. Every shared-memory scheme can be augmented with a message-passing facility for bulk data transfers among processors (as was done in the Alewife project, described in Kranz et al.5) Virtual shared memory. In a distributed, message-passing system, shared memory can be implemented through virtual-memory mechanisms. Hardware support is often needed for efficiency,11 and RPM is an ideal vehicle to experiment with such hardware.

EMULATION METnODOLOCY RPM project success is partially due to several novel

approaches for measuring time and collecting performance data.

- average memory latencies of 100 nanoseconds has the same processor utilization as a system with the same architecture and processors running at l MHz with average memory latencies of 10 microsystems. With time scaling, it is not necessary to build the emulator with the fastest, most up-to-date technology, which therefore allows cost reduction and extended life time instead.

Every component (interconnect, cache, memory, and 1/0 processor) is characterized by two fundamental performance measures: latency and bandwidth. These two measures can be independent. For example, two networks can have the same latency, but one may have more bandwidth because it has more links; similarly, memory bandwidth can be increased (while its latency remains the same) through interleaving.

The pclock, the processor’s clock period, is a conve- nient unit for all timing. If all component latencies are expressed in terms of pclocks and all component bandwidths are expressed in terms of bytes per pclock, systems with equal component latencies and bandwidths are equivalent.

Currently, RPM’s pclock is eight cycles, during which the emulator must simulate all activities that occur in one pclockof the target system. RPM simulatesvariable latencies by delaying requests, and a given resource’s variable bandwidth by varying the number of cycles during which each request uses that resource.

This adjustment’s flexibility varies with the resource. All RPM on-board data paths are 32 bits wide, and the system bus (Futurebus+) is 64 bits wide. RPM can emulate one data-path cycle of 64,128, or even 256 bits with two, four, or eight cycles, respectively, of its 32-bit data path. We do not adjust latencies and bandwidths for the first- level cache and internal bus: The implicit assumption is that, in any target system, these latencies and bandwidths will scale up proportionally with the processor speed. The DU (see Figure 4) simulatesvariable latencies in the interconnection (emulated by the Futurebus +). As configured, available Futurebus+ bandwidth is very large, in effect implementing an interconnection with infinite bandwidth. To run experiments under limited interconnection bandwidth, we can reduce the bus width (by reprogramming the Futurebus+ controllers) and artificially increase each packet size. To increase the interconnection bandwidth, we can double-r even quadruple-the number of clocks per pclock.

To illustrate time scaling and virtual interleaving of

Computer

memory, we consider the simple case of a miss in the second-level cache, whereby the block address is mapped to the local on-board memory (local miss) and the block is uncached elsewhere. The memory can deliver n blocks every T,,, pclocks, where T, is the latency of the miss and n is the degree of memory interleaving.

RPM has a monolithic memory controller that must emulate the operations of the target system’s n parallel memory controllers. The memory transaction in RPM must keep MC3 busy for T,/n pclocks by idling it before suspending the miss request. On the other hand, the total miss latencyjn RPM must be T, pclocks, the same as in the target system.

Figure 5 illustrates the phases of a memory access in RPM, where T,, T,, T,, and T, represent the prelude, idle, suspension, and completion times, respectively, of the miss request. The memory controller is busy for the entire access period except during T,, when an access to another memory bank can be accepted by the controller. Tp and T, are known from the emulation. T, and T, can be found from the following constraints:

T, = Tp + T, + T, + T,

T,/n = T, + T, + T,

The first equation enforces the same latency in RPM and the target, whereas the second equation enforces the same utilization of controllers in RPM and the target. The unknowns Ti and T, are then given by

Ti = T,/n - T, - T, and T, = T,,, (n - l ) / n

As an example, consider a target with a 5-11s pclock, where a memory access to a 16-byte block takes 140 ns. Because RPM’s pclock is 800 ns, the target’s 140-11s block access time translates to 140/5 pclocks or 28 * 8 = 224 clocks in RPM. During this large number of cycles, the memory controller can emulate highly complex directory mechanisms. Now assume that based on the emulator implementation, the prelude time and the completion time for a block access in memory are each 30 clocks. If the memory is two-way interleaved, the idle time is 224/2 - 30 - 30 = 52 clocks, and the suspension time is 224/2 = 112 clocks.

Timing is typically tighter in the second-level cache, which is usuallybuilt in the target with SRAMs. For example, a block transfer from the second- to first-level cache that takes 4 pclocks must be emulated in 32 clocks. As the block size increases, the number of extra cycles available for emulation also increases. The second-level cache’s response time can be adjusted if dummy cycles are added to each VHDL controller’s hardware sequence.

I/O scaling in RPM is automatic. When we simulate target systems with different processor speeds, we do not adjust 1/0 bandwidth, because RPM always runs at the same speed (10 MIPS peak) regardless of the target speed, provided one pclock equals eight clocks. I/O latency, however, must be scaled. The service of I/O requests must be delayed as faster processors are emulated. This is accomplished in software with the support of an interrupt timer.

_ _ _ ~

i i I - !

Collecting performance data The primary mechanism to collect performance data

involves event counters stored in a special area of each memory called count memory. Counters that trackmutu- ally exclusive events are updated in all three on-board memories whenever a transaction is completed in the controller. Event counter addresses are formed automatically by the merging of signals corresponding to basic events. These events are mutually exclusive, so that in each memory, only one counter is updated at a time. Each memory has thousands of counters and can therefore count thousands of event types. At the end of the emulation run, the counters are uploaded and postprocessed (basically they are selectively added together) to obtain the required performance data. This counting mechanism can be started and stopped through software.

Table 1 shows how count memory addresses are gener- ated for mutually exclusive events in the first-level cache. In this simple example, there are three signals (programmed in MCl), each corresponding to one property of a first-level cache access. One signal indicates whether the access is to private or shared data, and the second line is the processor’s read/write signal. The third signal is high when the first-level cache access hits and low when it

Table 1. Generation of addresses for event counting in the first- level cache’s count memory.

Counter address

Private/ shared

0 0 0 0 1 1 1 1

Read/ write

Hit/ miss

0 1 0 1 0 1 0 1

Basic event Shared-write-miss 5 ha red-write-hit Shared-read-miss Shared-read-hit

Private-write-miss Private-write-hit Private-read-miss Private-read-hit

misses. The combination of these signals forms a three-bit address, which can be used to address a counter in the RAM1 allocated to count memory. This example only includes eight addresses, but actually up to 20 signals can be defined in each controller. For instance, one signal could distinguish between instructions and data, and the instruction opcode could also be used as part of the performance-counter address.

Several other special-purpose monitoring strategies can be implemented at all levels of the memory hierarchy. They can supplement or replace the basic count memory

February 1995

COMA ]'Ti architecture

I ' I

///////// Instruction set extensions

Cache protocol Directory structure

p z q parameters

Write buffers Write cache

Prefetch COMA Cache-only memory CC-NUMA Cache-coherent nonuniform

memory access architecture Distributed directory Centralized directory

Delayed

Figure 6. Three dimensions of RPM flexibility.

mechanism, depending on the amount of hardware required.

Programming RPM We program RPM by mapping the target's controllers

into the FPGAs. Currently, we do not have tools to automatically partition a design across multiple FPGAs. The data path and the RTL description of each controller are specified separately in VHDL. VHDL descriptions specify design parameters such as block and cache sizes, latencies, and bandwidth, as constants.

In the current approach, modifymg each FPGA's func- tionality requires describing and specifying each controller's RTL description inVHDL. In reality, however, many common functions are shared by all the possible designs for a given FPGA. Nevertheless, the task of reprogramming the FPGAs is error-prone and increases the turnaround time for emulating different machines. But we expect turnaround time for this design stage to be reduced as behavioral compilers become more widely available.

A behavioral compiler is a high-level synthesis tool that creates hardware from a circuit's algorithmic description. This algorithmic description can be derived directly from

each protocol's high-level description. (In the short term, we will develop libraries of parameterized designs for every possible configuration of each FPGA.)

Figure 6 illustrates the three dimensions of flexibility in RPM: hardware parameters, hardware mechanisms, and system architecture. We can change hardware parameters such as cache size, block size, or latency by modifying constants in the VHDL programs, which must then be recompiled. (The only exception is the interconnection latency and bandwidth, which are adjusted during system-configuration time.) Within a class of systems, changing basic hardware mechanisms, such as the cache protocol, requires only moderate reprogramming. But designing the FPGA programs for different architectures is a major endeavor.

RPM PERFORMANCE Table 2 shows the slowdown factor (the

speed ratio between the target and emulator) for various uniprocessor technologies in the targets. Our emulation approach can

predict these slowdown factors accurately. They are independent of the number of processors (from two to eight), since this number is the same in RPM and the target. Table 2 also shows the time taken by RPM to execute the equivalent of one second or one minute of the target system. We can reasonably expect to obtain experimental points for realistic workloads on systems with processor speeds of up to 1 GIPS (billion instructions per second). Some of these experiments would take months to run on a software simulator, and the simulation would have to be sig- nificantly simplified and abstracted.

COMPARISON WITH OTHER APPROACHES

Recent breakthroughs in simulation methodology enable efficient, detailed, and flexible evaluations. In an execution-driven simulator such as Tango,6 each instruction runs directly on the host machine, and only some access types generate simulation events. The memory systems' processors and components are simulated as processes or threads. Each event requires a context switch. Instrumentation code is added at basic block boundaries and "sensitive" data accesses (usually shared-

Table 2. Slowdown factors between target and RPM.

Target uniprocessor speed 50 MIPS 100 MIPS 200 MIPS 500 MIPS 1 GIPS

Slowdown 40 80 160 400 800

Time on RPM per second 40 sec. 1 min., 2 min., 6 min., 13 min., of target execution 20 sec. 40 sec. 40 sec. 20 sec.,

Time on RPM per minute 40 min. 1 hr., 3 hr., 6 hr., 13 hr., of target execution 20 min. 40 min. 40 min. 20 min.

Computer

data accesses and synchronization primitives). By con- trast, program-driven simulators such as Cache-MireI2 interpret each instruction in software.

Software simulation, whether execution- or program- driven, is slow. For a target multiprocessor with a complex memory model, a simulator such as Tango or Cache-Mire can execute about 10,000 instructions per second on a 50- MIPS workstation. This low performance is due to

overhead from event scheduling (such as context switching, event-list management, or activity scanning); either code expansion (reported to be between a factor of 2 and 3 by Davis, Goldschmidt, and Hennessy6) for instrumentation to track target-instruction execution times in execution-driven simulators or the overhead for decoding and executing instructions in program- driven simulations; management of time stamps associated with events; the collection of performance data; the semantic gap between hardware mechanisms and their execution on the simulator (one cycle in target hardware for each basic activityrequires several instructions for simulation on a workstation); and target multiprocessor speedup.

In addition, to keep simulation times reasonable, the workload’s data set sizes must be drastically reduced. Observations made on the small data sets must then be extrapolated to the workload with the actual data set size, a difficult task which has never really been validated.

A common drawback of all simulations is their abstract- ing of target multiprocessor behavior. Many effects are considered negligible and are ignored or approximated. In some cases, key hardware components and physical events are totallyremoved from the simulation. For example, a simulator often avoids simulating the caches and data transfers among the various caches.

To compare RPM speedups with respect to simulators such as Cache-Mire or Tango, we simulate the target and measure the target simulation and execution times. Based on such measurements, we have observed speedups between 100 and 1,000 with respect to execution-driven simulations. However, RPM is actually closer to a cycle- by-cycle, RTL simulation in its level of implementation details. In practical cases, such simulations run at a few cycles per second, and RPM speedups with respect to these detailed simulations can reach one million.

Another approach uses breadboard prototypes to validate or explore new architectures. Breadboard prototypes are fast and faithful to the target, but their cost is high and they provide little research information. In an academic environment, building a breadboard prototype can be valuable for students and can help direct simulation experiments. But in most projects, the majority of research results are still derived through simulations.

STANS AND FUTURE PLANS The hardware is running, and our first emulator (a CC-

NUMA-CD under sequential consistency) is in the late debugging phase. We are replacing the Minx XC4013 in the controllers with theXC4025 (theXC4025 is pin-to-pin compatible with the XC4013). This upgrade will double

the number of equivalent gates in each controller. We are also developing FPGA programs for the SCI protocol, for a CC-NUMA with release consistency, and for a COMA. We are upgrading RPM’s VO capabilitywith a 2-Gbyte disk connected to each processor node, which will yield 16 Gbytes of total disk space and enable RPM to run applications with distributed I/O. On the software side, we are porting some of the Splash benchmarks to the first emulator. And we are planning to port an operating-system kernel and a database engine to run database applications.

RPM EMULATION IS VERY COST-EFFECTIVE for prototyping multiprocessors, and in this sense, it is an extremelyvalu- able research tool. Each system architecture mapped onto RPM is an actual system design, which a graduate student can complete in a few months by reusing the same hardware platform and thus concentrating on essential design parts.

Nevertheless, the current emulator also has some limi- tations. First, it has a relatively low pclock rate. (We were somewhat conservative in our design because of our lack of experience with FPGAs and FPGA CAD tools.) The low pclock rate and simple board-level design facilitated RPM construction in an academic environment. The total time for design and construction was only 15 months. A second limitation is the small number of processors, due to our limited budget. Unfortunately, we have no technical solu- tion to this problem: To investigate larger machines, we would have to build an emulator with more processors.

The RPM approach to hardware emulation differs from other FPGA-based emulators where circuits are emulated by FPGA arrays. Because of cost constraints and performance concerns, we restrict FPGA use to the controllers in the memory hierarchy. While this approach makes it feasible to emulate an entire multiprocessor system, it is somewhat limited in that RPM can only emulate machines with the overall organization shown in Figure 1. In the future, programmable interconnection technology might allow more flexibility in emulators. Like crossbar switches, field-programmable interconnection components (FPICs) at the chip level would let multiple input signals be directed to multiple output pins.

Finally, as with any hardware, an emulator’s efficiency advantage over simulation erodes every year, as faster workstations and PCs are introduced. Nevertheless, given RPM’s current speed, we expect that it will remain competitive with software simulators for at least 10 years. Moreover, there are several ways to improve the emulator’s speed. With current CAD tools and FPGA technologies, a more aggressive design could raise the clock speed to 20 MHz. We could also cut the number of clock cycles in each pclock in half by using a processor with an on-chip instruction cache to avoid emulating instruction fetches. Such an emulator would emulate 5 million target cycles per second, and large speedups over software simulation could be expected for large-scale emulators with 128 to 256 processors. I

Acknowledgments The National Science Foundation funded this project

February 1995

through Grant MIP-9223812. Besides the authors, several individuals contributed to the design of RPM. In particular, we thank Per Stenstrom at Lund University (Sweden); Massoud Pedram at EE-Systems, University of Southern California, and Jacqueline Chame, also at USC. Through grants, gifts, or rebates, several companies have helped reduce the cost of the hardware and software needed for this project. These companies are Advanced Micro Devices, Synopsys, Viewlogic, Axil Workstations, and Xilinx. Finally, we thank John Granacki from the Information Science Institute for offering the services of Ezfab, which is part of the Systems Assembly Project spon- sored by ARPA.

References 1. M. Dubois, C. Scheurich, and F.A. Briggs, “Synchronization,

Coherence, and Event Ordering in Multiprocessors,” Com- puter,Vol. 21, No. 2, Feb. 1988, pp. 9-21.

2. P. Stenstrom, “A Survey of Cache Coherence Schemes for Multiprocessors,” Computer, Vol. 23, No. 6, June 1990, pp.

3. D. Lenoski et al., “The Stanford Dash Multiprocessor,” Com- puter,Vol. 25, No. 3, Mar. 1992, pp. 63-79.

4. E. Hagersten, A. Landin, and S. Haridi, “DDM-ACache-Only MemoryArchitecture,”Computer, Vol. 25, No. 9, Sept. 1992, pp. 44-54.

5. D. Kranz et al., “Integrating Message Passing and Shared Memory: Early Experience,”Proc. Fourth ACM SIGPlan Symp. Principles and Practice ofParallel Programming, ACM Press, NewYork, 1963, pp. 54-63.

6. H. Davis, S.R. Goldschmidt, and J. Hennessy, ”Multiproces- sor Simulation and Tracing Using Tango,” Proc. Int’l Con$ Parallel Processing, IEEE CS Press, Los Alamitos, Calif., Order No. 2355, Vol. 2,1991, pp. 99-107.

7. S. Trimberger, ”A Reprogrammable Gate Array and Applica- tions,”Proc. IEEE, Vol. 81, No. 7,1993, pp. 1030-1041.

8. K. Oner et. al, “The Design of RPM: An FPGA-Based Multi- processor Emulator,” Proc. Third ACM Int’l Symp. Field-Pro- grammable Gate Arrays, ACM Press, New York, 1995.

9. F. Dahlgren, M. Dubois, and P. Stenstrom, “Combined Per- formance Gains of Simple Cache Protocol Extensions,” Proc. Int’l Symp. ComputerArchitecture, ACM Press, New York, Apr.

10. D. Gustavson, “The Scalable Coherent Interface and Related Standards Projects,” IEEEMicro, Vol. 12, No. 1, Feb. 1992.

11. M.A. Blumrich et al., “Virtual-Memory-Mapped Network Interface for the Shrimp Multicomputer,” Proc. Int’l Symp. Computer Architecture, ACM Press, New York, 1994, pp. 142-

12. M. Brorsson et al., “The Cache-Mire Test Bench: A Flexible and Effective Approach for Simulation of Multiprocessors,” Proc. 26thAnn. Simulation Symp., IEEE CS Press, Los Alami- tos, Calif., Order No. 3620,1993, pp. 41-49.

12-24.

~

~

~

1994, pp. 187-199.

~ ’ I ~ 153.

ing/electronics and an MS degree in computer systemsfrom Pontificia Universidade Catolica do Rio de Janeiro, Brazil, and an MS degree in computer engineeringfrom USC.

Sasan Iman is pursuing a PhD degree in computer engineering at the University of Southern California. His research interests include design automation for system prototyping and design of low-power digital circuits. He received a BS degree in electrical engineeringfrom the University of Iowa in 1989 and an MS degree in electrical engineeringfrom USC in 1991.

Jaeheon Jeongis aPhDstudent at the Universityof South- ern Califomia. His research interests include computer architecture and performance evaluation of shared-memory multiprocessors. He received hisBS degree in electronics engineeringfrom Korea University, Korea, in 1985, and an MS degree in electrical engineeringfrom USC in 1993. He was an engineer at the System R&D Laboratoryat SamsungElec- tronics, Korea, from 1985 to 1992.

Koray Oner is pursuing a PhD degree in computer engineering at the University of Southern California. His research interests include computer architecture, parallel processing, and rapid prototypingsystems. He received a BS degree in electrical engineeringfrom the Middle East Technical Uni- versity, Ankara, Turkey, in 1989, and an MS degree in computer engineeringfrom USCin 1990.

Krishnan Ramamurthy is a hardware designer for LSI Logic Corporation, Milpitas, California. His research interests include multiprocessor systems, modeling, and performance evaluation. He received a BS degree in mathematics from the University ofMadras, India, BS and MS degrees in electrical engineeringfrom the Indian Institute of Science, Bangalore, India, and an MS degree in electrical engineeringfrom USC.

Michel Dubois is an associate professor in the Department of Electrical Engineering at the University of Southern Cali- fornia, where he leads the RPMproject. His main interests include computer architecture andparallelprocessing, with a focus on multiprocessor architecture, performance, and algorithms. His research in parallel algorithms concerns iter- ative algorithms for numerical and nonnumerical problems and their asynchronous implementations on multiprocessors.

Dubois received an engineering degreefrom the Faculte Polytechnique de Mons in Belgium, an MS degreefrom the UniversityofMinnesota, and aPhDdegreefromPurdue Uni- versity, all in electrical engineering. He has edited two books, one on multiprocessor caches and another on scalable shared-memory multiprocessors. He is a member of theACM and a senior member of the IEEE Computer Society.

LukAndr6Barrosois aPhDcandidateincomputerengi- neering at the Universiv of Southern California. His research interests include parallel processing, performance evaluation, and computer architecture. He isparticularly interested in the design of general-purpose, shared-memory multiprocessors. He received a BS degree in electrical engineer-

~ Computer

Readers can contact the authors at the Department of Elec- trical Engineering Systems, University of Southern Califor- nia, LosAngeles, CA 90089-2562; e-mail {dubois, barroso, oner, jaeheonj)@paris. usc.edu.

Future developments on the RPM project will be posted on WWW at http://www. wc. edu/depVceng/dubois/RPM. html.

http://usc.edu

http://www

Documents

RPM: a rapid prototyping engine for multiprocessor systems