21
Multiprocessor simulators Introduction Rog scrieti textele cu Times New Roman 12. Nowadays, multicore and manycore computer architectures represent the solution to further increase ing the overall performance of computers, since increasing the clock cycle is no longer a possibility because of the thermal dissipation and ( power consumption density per square centimeter) . Computer architects use simulators in order to evaluate the performance of different designs. Monolithic simulators, like the popular SimpleScalar, were intensively used to evaluate different uniprocessor architectures. The design space of multiprocessor architectures is however considerably greater than that of uniprocessor systems. This makes the design space exploration of multi and many cores more difficult (time-consuming) and therefore, multiprocessor simulators can no longer be monolithic – ce legatura e intre DSE si monolitic? Dimpotriva, strict dpdv al DSE, e mai bines a fie monolitic simulatorul . They have to adapt to the increased complexity and become more complex themselves. This is why multiprocessor simulators must – cam tare! become modular, by taking advantage of modern Software Engineering techniques. The processor becomes a "functional unit" it is no longer unique in the architecture but (formulare neinspirata) , nevertheless, Instruction Level Parallelism must still be taken into account by multiprocessor simulators because otherwise ILP features could significantly distort simulation results [1]. Furthermore, it was shown that simulators should run an operating system too, because otherwise, simulation results might be disturbed have significant errors . The purpose of this paper is to evaluate a few multiprocessor simulators in order to find out to what extent the more complex design space of multicore architectures can be explored in such a way that architectural optimizations might be determined. The first section presents an overview of the evaluated simulators. The next section tries to perform a comparison between the simulators by taking into account different aspects, like simulation speed and accuracy. The last section contains our conclusions about the studied simulators. Overview of the simulators This section presents an overview of the evaluated multiprocessor simulators. We have 1

Multiprocessor Simulators LV

Embed Size (px)

Citation preview

Page 1: Multiprocessor Simulators LV

Multiprocessor simulators

Introduction

Rog scrieti textele cu Times New Roman 12.

Nowadays, multicore and manycore computer architectures represent the solution to further increaseing the overall performance of computers, since increasing the clock cycle is no longer a possibility because of the thermal dissipation and (power consumption density per square centimeter).

Computer architects use simulators in order to evaluate the performance of different designs.

Monolithic simulators, like the popular SimpleScalar, were intensively used to evaluate different uniprocessor architectures. The design space of multiprocessor architectures is however considerably greater than that of uniprocessor systems. This makes the design space exploration of multi and many cores more difficult (time-consuming) and therefore, multiprocessor simulators can no longer be monolithic – ce legatura e intre DSE si monolitic? Dimpotriva, strict dpdv al DSE, e mai bines a fie monolitic simulatorul. They have to adapt to the increased complexity and become more complex themselves. This is why multiprocessor simulators must – cam tare! become modular, by taking advantage of modern Software Engineering techniques.

The processor becomes a "functional unit" it is no longer unique in the architecture but (formulare neinspirata), nevertheless, Instruction Level Parallelism must still be taken into account by multiprocessor simulators because otherwise ILP features could significantly distort simulation results [1].

Furthermore, it was shown that simulators should run an operating system too, because otherwise, simulation results might be disturbed have significant errors.

The purpose of this paper is to evaluate a few multiprocessor simulators in order to find out to what extent the more complex design space of multicore architectures can be explored in such a way that architectural optimizations might be determined.

The first section presents an overview of the evaluated simulators. The next section tries to perform a comparison between the simulators by taking into account different aspects, like simulation speed and accuracy. The last section contains our conclusions about the studied simulators.

Overview of the simulators

This section presents an overview of the evaluated multiprocessor simulators. We have focused on three application- only simulators (Multi2Sim, RSIM, and SESC) and three full-system simulators (M5, Simics and GEMS).

Multi2Sim

Multi2Sim [2] is an application- only simulator developed at Universidad Politecnica de Valencia, Spain, which is developed on top of the popular SimpleScalar simulator but provides additional functionality: multithreading, multiprocessors, memory hierarchy and interconnection network.

1

Page 2: Multiprocessor Simulators LV

Multi2Sim uses timing-first simulation approach so that the functional part of the simulator is separated from the timing module.

Multi2Sim offers both multithreaded and multicore support.It supports a set of parameters that specify how the five stages of the modeled pipeline (similar to SimpleScalar's pipeline) are organized in a multithreaded design.

Figura arata un dual-thread processor

Stages can be shared among threads or private per thread. When a stage is shared, there must be an algorithm which schedules a thread each cycle on the stage.Multi2Sim can be employed to evaluate different sharing strategies of pipeline stage. A multithread design is classified as fine-grain, coarse-grain and simultaneous multithreading (SMT) and Multi2Sim allows all of these three designs to be simulated.

In Multi2Sim a multicore architecture is achieved by replicating all data structures that represent a single processor core.For solving the cache coherence problem, a MOESI cache coherence protocol is implemented.For the interconnection network, Multi2Sim implements a simple bus. Additionally, the number of interconnects and their location vary depending on the sharing strategy of data and instruction cache memories. The next figure shows the possible sharing schemes for L1 and L2 caches, using a dual-core architecture as an example.

2

Page 3: Multiprocessor Simulators LV

t means that the cache is private per thread, c means private per core, and s means shared for the whole multiprocessor.

RSIM

RSIM [3]  - Rice Simulator for ILP Multiprocessors - was developed at University of Illinois by Sarita Adve's group of researchers and released in 1997.

RSIM models a processor microarchitecture that aggressively exploits ILP. It incorporates features from a variety of current commercial processors. The default processor features include: superscalar execution, out-of-order scheduling, register renaming, branch prediction, non-blocking memory load and store operations, and register windows.

RSIM simulates a hardware cache-coherent distributed shared memory system (a CC-NUMA), with variations of a full-mapped invalidation-based directory coherence protocol. Each processing node consists of a processor, a two level cache hierarchy (with a coalescing write buffer if the first-level cache is write-through), a portion of the systems distributed physical memory and its associated directory, and a network interface. A pipelined split-transaction bus connects the secondary cache, the memory and directory modules, and the network interface. Local communication within the node takes place on the bus. The network interface connects the node to a multiprocessor interconnection network for remote communication.

Both cache levels are lockup-free and store the state of outstanding requests using miss status holding registers (MSHRs).

The first-level cache can either be a write-through cache with a no-allocate policy on writes, or a write-back cache with a write-allocate policy. RSIM allows for a multiported and pipelined first level cache. Lines are replaced only on incoming replies.

The coalescing write buffer is implemented as a buffer with cache-line-sized entries. All writes are buffered here and sent to the second level cache as soon as the second level cache is free to accept a new request. The number of entries in the write buffer is configurable.

The second-level cache is a write back cache with write-allocate. RSIM allows for a pipelined secondary cache. Lines are replaced only on incoming replies. The secondary cache maintains inclusion with respect to the first-level cache.

For the both levels of cache, the size, line size, set associativity, cache latency, and number of MSHRs can be varied.

3

Page 4: Multiprocessor Simulators LV

The memory is interleaved, with multiple modules available on each node. The memory is accessed in parallel with an interleaved directory, which implements a full-mapped cache coherence protocol. The memory access time, the memory interleaving factor, the minimum directory access time, and the time to create coherence packets at the directory are all configurable parameters.

The directory can support either a MESI protocol, or a MSI protocol. The RSIM directory protocol and cache controllers support cache to cache transfers.

For local communication within a node, RSIM models a pipelined split-transaction bus connecting the L2 cache, the local memory, and the local network interface. The bus speed, bus width, and bus arbitration delay are all configurable.

For remote communication, RSIM currently supports a two-dimensional mesh network. RSIM models a pipelined wormhole-routed network with contention at the various switches. For deadlock avoidance, the system includes separate request and reply networks. The flit delay per network hop, the width of the network, the buffer size at each switch, and the length of each packet's control header are user-configurable parameters.

RSIM supports more types of multiprocessor memory consistency protocols:

Sequential Consistency (SC); Processor Consistency (PC) and Total Store Ordering (TSO); Relaxed Memory Ordering (RMO) and Release Consistency (RC).

Each of these memory models is supported with a straightforward implementation and optimized implementations.

SESC

SESC [4][5] is a microprocessor architectural simulator developed primarily by the IACOMA research group at University of California, Santa Cruz and various groups at other universities. It models different processor architectures, such as single processors, chip multiprocessors. It models a full out-of-order pipeline with branch prediction, caches, buses, and every other component of a modern processor necessary for accurate simulation.SESC is an event-driven simulator. It has an emulator built from MINT, an old project that emulates a MIPS processor. Many functions in the core of the simulator are called every processor cycle. But many others are called only as needed, using events.SESC is written in C++, since it is faster than Java and has good object-oriented programming support. SESC runs on big-endian and little-endian processors.In SESC, the actual instructions are executed in an emulation module, which emulates the MIPS Instruction Set Architecture (ISA). It emulates the instructions in the application binary in order.

The emulator returns instruction objects to SESC which are then used for the timing simulator. These instruction objects contain all the relevant information necessary for accurate timing. This includes the address of the instruction, the addresses of any loads or stores to memory, the source and destination registers, and the functional units used by the instruction. The bulk of the simulator uses this information to calculate how much time it takes for the instruction to execute through the pipeline.

From our experience, you can change almost every parameter in the SESC simulator. It does have a lot of inputs (given through configuration files). But we were not able to test any of them. We suppose it supports different cache hierarchies. We have seen that it has the bus implemented and maybe other interconnection networks.

4

Page 5: Multiprocessor Simulators LV

M5 – cam ce tipuri de arhitecturi multiprocessor poate simula? Shared pe unibus e clar, dar si retele point to point – message passing? Detalii

M5 [6] is a modular platform for computer system architecture research, encompassing system-level architecture as well as processor microarchitecture. It is developed at the University of Michigan.M5 was mainly developed to simulate network communication but, because of its high simulation accuracy, it is also used in computer architecture design. M5 is implemented using two Object Oriented languages: Python for high-level object configuration and simulation scripting, where flexibility and ease of programming are of concern and C++ for low-level object implementation, where performance matters. All simulation objects (CPUs, busses, caches, etc.) are represented as objects in both Python and C++. Using Python objects for configuration allows flexible script-based object composition to describe complex simulation targets. Once the configuration is constructed in Python, M5 instantiates the corresponding C++ objects, which provide good run-time performance for detailed modeling.M5’s use of object-oriented (OO) programming techniques provides a clear interface between a simulation object and the rest of the system. Here are some benefits. First, researchers can modify a component’s behavior using only localized code changes with a higher likelihood of not breaking seemingly unrelated parts of the simulator. Second, different models for a particular component, such as a CPU, can be substituted easily within a particular configuration. These interchangeable models may differ in level of detail (allowing simulation speed vs. accuracy trade-offs), in the component’s functional behavior (to study design alternatives), or both.

M5 contains two primary CPU models, SimpleCPU and O3CPU (models an Alpha 21264 [9]). Both models derive from a base CPU class and export the same interface, allowing them to be used interchangeably. M5 can also switch between CPU models during run-time, allowing the use of SimpleCPU for fast-forwarding and warm-up phases and O3CPU for taking statistics.

SimpleCPU is an in-order, non-pipelined functional model that can be configured to execute one or more instructions per cycle, but can only have one outstanding memory operation. In addition to fast forwarding and warm-up, SimpleCPU is useful for modeling network client systems whose only goal is to generate traffic for a detailed server system under test.O3CPU is an out-of-order, superscalar, pipelined, simultaneous multithreading (SMT - does not work in full system mode) model. The O3CPU model simulates an out-of-order pipeline in detail. Individual stages such as fetch, decode, etc. are configurable in their width and latency. Both forward and backward communication between stages is modeled explicitly using time buffers. The model includes detailed branch predictors, instruction queues, load/store queues, functional units, and memory dependence predictors.The O3CPU model has been developed with a strong focus on ensuring timing accuracy. To promote accuracy, they integrated both timing and functional modeling into a single “execute-in-execute” pipeline, in which functional instruction execution occurs in the execute stage of the timing pipeline. Most other models are “execute-in-fetch” models, which functionally execute instructions as they are fetched, and then do the timing modeling afterwards. Their execute-in-execute design provides accurate modeling of timing-dependent instructions, such as synchronization operations and I/O device accesses. An execute-in-execute model also provides higher confidence in the realism of the timing model, as unrealistic timing is often manifested as incorrect functional operation as well.Because M5 supports booting entire operating systems as well as running application binaries with system call emulation, all the CPU models support the full privileged instruction set, virtual address translation, and asynchronous interrupts. M5 currently boots unmodified versions of Linux 2.4/2.6, FreeBSD, HP/Compaq Tru64 Unix, and the L4Ka::Pistachio microkernel.M5 currently provides only a bus model; the device-to-interconnect interface has been designed to easily allow extension to point-to-point networks as well.M5 supports configurable caches with parameters for size, latency, associativity, replacement policy.Bus objects model a split-transaction bus which is configurable in both latency and bandwidth. A simple bus bridge object is available to connect busses of different speeds, e.g. the PCI bus and the system bus.Though intended for simulating networked systems, M5’s structured design, rich feature set, and library of predefined object models provide a powerful framework for building complex architectural simulations in a variety of domains.M5 offers three models of CPU's from the perspective of simulation accuracy:

5

Page 6: Multiprocessor Simulators LV

atomic - it's not accurate, every operation takes one cycle and the requests/responses are instantly propagated through the entire system. It's very fast, 50 times slower than the host machine

timing - it's almost the same as the atomic model, but the messages take time to propagate. For example the memory does not respond immediately, it waits for an amount of time then it sends the response

detailed - the CPU is modeled with all the pipeline stages, ROB, branch predictor, etc. It is very accurate.

We were able to install the simulator. We ran it in Full System (FS) mode and in System Call Emulation (SE) mode. We managed to boot Linux, and to add any file in the Linux image (benchmarks, input files). We also managed to run SPLASH-2 and our own benchmarks in FS and SE modes. We changed the simulator so now you can change the simulated architecture from the command line (added options to change cache parameters and cache hierarchy). We were able to simulate up to 4 cores in FS (it should work with 64 but we did not try it yet - it involves some code change). In SE mode it should work with as many cores as we like, we simulated up to completion FFT with 64 cores, and started a simulation with 128 cores. We have discovered that the limitation is the system memory. For 64 cores 1GB of free memory should be enough, but with 128 cores more memory was needed. If we need more cores we should switch to a 64bit computer/operating system.We have created a program (in Python) that lets us simulate automatically as many benchmarks as we like and also change the simulation parameters (you can give ranges of values).We enabled the support for MySQL database in M5. We can now save the simulation results directly in the database. This is convenient if we simulate over a couple of computers and we want the results stored in a single place. We still need to completely understand the database structure and to create a program so we can query the database.

Simics

Simics is a commercial full-system simulator, provided by Virtutech [7], who claims that Simics is the fastest full-system simulator ever implemented. Although commercial, Simics is also available under a Personal Academic License, which allows researchers to use it for. However, the source code is not provided.

Simics models the target system at the level of individual instructions, executing them one at a time. It includes a very wide range of Instruction Set Architectures: Alpha, ARM, MIPS, PowerPC, SPARC, x86-64.

Being a system-level framework on which simulation tools can be built, Simics is able to boot an unmodified Operating System (Solaris, Linux, Windows) and it also includes devices like video and Ethernet cards and disks.

Simics fully virtualizes the target computer, allowing simulation of multiprocessor systems as well as a cluster of independent systems and even networks, regardless of the simulator host type. The virtualization also allows Simics to be cross platform. For instance, Simics/SunFire can run on a Linux/x86 system, thus simulating a 64-bit big-endian system on a 32-bit little endian host.

Simics provides checkpointing, which allows saving the entire simulation state on the hard disk. The checkpoint files are saved in a portable format and can be loaded at a later time.

Another feature of Simics, called Hindsight, enables the simulation to run backwards in time. Using Hindsight, we are able to restore a previous simulation state, without explicitly creating checkpoint files.

A full systems simulation can be run completely isolated, but Simics also provides a way for the simulated system to access files that are located on the real computer (the host). SimicsFS is a Linux kernel file system module that talks to a simulated device.

Simics also provides tracing for memory accesses, I/O accesses, control register writes, and exceptions (like system calls).

6

Page 7: Multiprocessor Simulators LV

Virtual networks are provided by Simics. This way, it can simulate several machines simultaneously, which are connected together using a simulated Ethernet link. Further more, a simulation can be connected to a real network. By doing this, simulated computers and real computers are able to communicate with each other.

GEMS

GEMS [8], General Execution-driven Multiprocessor Simulator, is a modular simulation infrastructure, developed at University of Wisconsin Madison, which decouples simulation functionality from timing (timing-first simulation). To speedup the development of the simulator, GEMS uses Simics as a foundation on which various timing simulation modules can be dynamically loaded.

Since Simics is robust enough to boot an unmodified OS, its functional simulation was used in order to avoid implementing rare but effort-consuming instructions in the timing simulator. The timing modules of GEMS interact with Simics to determine when Simics should execute an instruction. However, what is the result of the execution of the instruction is ultimately dependent on Simics. Such a decoupling allows the timing models to focus on the most common 99.9% of all dynamic instructions. This task is much easier than requiring a monolithic simulator to model the timing  and correctness of every function in all aspects of the full-system simulation. For example, a small mistake in handling an I/O request or in modeling a special case in floating point arithmetic is unlikely to cause a significant change in timing fidelity. However, such a mistake will likely affect functional fidelity, which may prevent the simulation from continuing to execute. By allowing Simics to always determine the result of execution, the program will always continue to execute correctly.

The GEMS simulation system has primarily been used to study cache-coherent shared memory systems (both on-chip and off-chip) and related issues. For example, GEMS models the transient states of cache coherence protocols in great detail. However, GEMS has a more approximate timing model for the interconnection network, a simplified DRAM subsystem, and a simple I/O timing model. GEMS includes a detailed model of a modern dynamically-scheduled processor, called Opal.

The heart of GEMS is the Ruby memory system simulator. GEMS provides multiple drivers that can serve as a source of memory operation requests to Ruby.

The simplest driver of Ruby is a random testing module used to stress test the corner cases of the memory system. It uses false sharing and action-check pairs to detect many possible coherence errors, race conditions and deadlocks.

The microbenchmarks module can be used for basic timing verification, as well as detailed performance analysis of specific conditions.

7

Page 8: Multiprocessor Simulators LV

The Simics driver uses the Simics functional simulator to approximate a simple in-order processor. Simics passes all load, store and instruction fetch requests to Ruby, which performs the first level cache access to determine if the operation hits or misses in the primary cache. On a hit, Simics continues executing instructions, switching between processors in a multiple processor setting. On a miss, Ruby stalls Simics’ request from the issuing processor, and then simulates the cache miss. Each processor can have only a single miss outstanding, but contention and other timing effects among the processors will determine when the request completes. By controlling the timing of when Simics advances, Ruby determines the timing-dependent functional simulation in Simics (e.g., to determine which processor next acquires a memory block).

Opal models a dynamically-scheduled SPARC v9 processor and uses Simics to verify its functional correctness.

The memory system simulator is independent of the out-of-order processor simulator. Preliminary results can be obtained for a memory system enhancement using the simple in-order processor model provided by Simics, which runs much faster than Opal. Based on these preliminary results, the researcher can then determine whether the accompanying processor enhancement should be implemented and simulated in the detailed out-of-order simulator.

GEMS also provides flexibility in specifying many different cache coherence protocols that can be simulated. The protocol-dependent details are separated from the protocol-independent system components and mechanisms.  To facilitate specifying different protocols and systems, the protocol specification language SLICC was implemented.SLICC protocols specify the entire memory system logic. Protocols are generally classified as SMP, CMP, or SCMP.

SMP protocols assume each node consists of a processor, private L1, private L2, and Memory/Directory controller. The SMP protocols can be used to model a CMP with private caches. Both the L1 and L2 caching is implemented in a single controller.

CMP protocols assume each node consists of processors with their private L1 caches, and a banked, shared L2 cache. The CMP protocols generally support Multiple-CMP systems unless otherwise noted. The L1 and L2 controllers are split.

SCMP protocols assume a Single-CMP with split L1 and L2 controllers.

The interconnection network is the unified communication substrate used to communicate between cache and memory controllers. A single monolithic interconnection network model is used to simulate all communication, even between controllers that would be on the same chip in a simulated CMP system. As such, all intra-chip and inter-chip communication is handled as part of the interconnect, although each individual link can have different latency and bandwidth parameters.

Ruby models a point-to-point switched interconnection network that can be configured similarly to interconnection networks in current high-end multiprocessor systems, including both directory-based and snooping-based systems.

For simulating systems based on directory protocols, Ruby supports three non-ordered networks:

a simplified full connected point-to-point network, a dynamically-routed 2D-torus interconnect, a flexible user-defined network interface.

The first two networks are automatically generated using certain simulator configuration parameters, while the third creates an arbitrary network by reading a user-defined configuration file. This file-specified network can create complicated networks.

For snooping-based systems, Ruby has two totally-ordered networks:

8

Page 9: Multiprocessor Simulators LV

a crossbar network and a hierarchical switch network.

Both ordered networks use a hierarchy of one or more switches to create a total order of coherence requests at the network’s root.

The topology of the interconnect is specified by a set of links between switches, and the actual routing tables are re-calculated for each execution, allowing for additional topologies to be easily added to the system. The interconnect models virtual networks for different types and classes of messages, and it allows dynamic routing to be enabled or disabled on a per-virtual-network basis (to provide point-to-point order if required). Each link of the interconnect has limited bandwidth, but the interconnect does not model the details of the physical or link-level layer. By default, infinite network buffering is assumed at the switches, but Ruby also supports finite buffering in certain networks.

Comparison of the simulators

The following table represents a comparison of the six simulators that we have evaluated.From the beginning, we can divide the simulators into two categories:

Systemcall emulation (or Application only): Multi2Sim, RSIM and SESC; Full system: M5, Simics and GEMS.

Actually, M5 is designed to work in both Systemcall Emulation and Full system modes.If was shown that the effects of the operating system on the simulation results can be very significant. Thus, it is very important if the simulator is full system or not, because a full system simulation gives increased simulation realismaccuracy.But simulation accuracy means more than using an operating system in the simulation framework: obtaining accurate results can be done if the simulator models the hardware precisely. Obviously, the more accuracy a simulator has, the more the simulation speed decreases.Nevertheless, we were interested to find out to what extent our tested simulators were modeling some real architectures. Simics is the one that provides support for a lot of Instruction Set Architectures. Actually, Simics, models whole architectures: graphic cards, network cards, bus based devices etc. We can also notice that most simulators are modeling a CPU architecture that is closest to MIPS.Since Instruction Level Parallelism features could also contribute toalter the simulation results, we have found out that all the evaluated simulators provide ILP support: out of order execution (like it is the case of GEMS' Opal module, or M5), branch prediction and pipelines (Multi2Sim, RSIM, GEMS, M5). A very interesting idea is the "execute-in-execute" model of M5, as opposed to the classic one of "execute-in-fetch" (SESC seems to run this way). We think that modeling the pipeline correctly has a deep influence in the coherency/consistency problem. Ce vreti sa spuneti aici, in mod concret, relativ la relatia

9

Page 10: Multiprocessor Simulators LV

pipeline-coerenta?Also, multithreading support is provided by M5 (only in system call emulation mode - not tested yet), GEMS (only with Opal) and Multi2Sim, which besides Simultaneous Multithreading allows configurations which can use Fine-grain and Coarse-grain models.

The main part of our comparison is obviously focused on the multiprocessor features. We were first interested in how many processors can each simulator support. We have successfully tested Multi2Sim with 2, 4, 8, 16, 64, 128 cores using the FFT SPLASH-2 kernel benchmark. Probably it works with even more cores. GEMS is limited only by the capabilities of Simics. Since GEMS uses the Simics/Serengeti model, which represents the Sun Fire 3800 - 6800 class of servers, we have found out that such a server provides up to 24 UltraSPARC processors: 6 boards with 4 CPUs/board. The authors of GEMS say that a maximum of 384 (= 24 * 16) processors is supported for Serengeti, by Simics. However, there is also a restriction: the numbers of processors used by GEMS must be a power of 2. We only tested GEMS with 2, 4 and 8 processors and the authors of GEMS claim that they used up to 64 processors. Therefore, 256 processors is a theoretical limit. RSIM uses a 2D mesh and the number of nodes (one CPU/node) must be a square. We have concluded that a maximum of 11*11 = 121 processors are supported by RSIM. M5 can support with some modifications 64 processors in Full System mode, but in System Call Emulation mode we ran FFT to completion with 64 cores and also ran it with 128 cores, but it took to long to complete.Ourt next concern was the memory system. Cache hierarchies, with 2 levels, are basically supported by all the simulators (Simics doesn't use caches by default, in order to be fast). Multi2Sim and GEMS for example allows creating different kinds of cache hierarchies. M5 also has the ability to simulate different cache hierarchies, but this was possible by sacrificing the accuracy of the cache coherence protocol ??? Detalii. M5 should not be used to study cache coherence (more). The simulators show their limitations when we are talking about cache coherence and memory consistency models.MESI and MOESI are the most common cache coherence protocols supported. Nevertheless, GEMS is without doubt the most suitable simulator when we are talking about cache coherence because it provides a specification language for implementing cache coherence protocols.The support for memory consistency models is even weaker: some simulators like Multi2Sim and SESC do not appear to be concerned with this problem. Simics and implicitly GEMS use Sequential Consistency. GEMS should also support the weaker model called Total Store Ordering, which is supported by SPARC and Solaris but, we have found out that it isn't fully implemented nor tested. M5 provides some weak consistency model but, RSIM is the most suited simulator from this point of view.The interconnection network is another place where simulators are limited: besides GEMS and maybe RSIM we could say that the other simulators do not provide a very extensible interconnection module. GEMS is the simulator that gives the best support for specifying different network topologies.

Another feature, related to interconnection networks is the Ethernet simulation, offered by M5, Simics and implicitly GEMS. This allows simulating more computers. Simics even provides the ability of connecting simulated computers with real computers.

As for measuring the power consumption and estimating the integration area, we can note SESC, that has them both and GEMS which offers an interconnection network power model, developed at Princeton University. SESC also integrates HotSpot (besides CACTI and Wattch).

The next section from our table briefly shows how flexible the simulators are. Our impression is that all the simulators provide a significant number of inputs and outputs as well. Simics is more a framework of simulation than a specific simulator so we didn't expect it to provide a lot of results. For example it doesn't simulate cache memories by default. Multi2Sim allows us to configure the memory hierarchy, to specify some bus parameters, the number of cores and the multithreading model. It outputs the resulted IPC, the hit ratio of the cache memories and the branch prediction accuracy. RSIM, GEMS and M5 allow us to specify processor, cache and interconnection parameters. Also, the outputs of those three simulators are better. Particularly, RSIM gives the percentage of time occupied by different kind of operations, like lock and unlock, spin , barrier and GEMS provides separated results for user and kernel operations.Besides SESC and RSIM, for all the simulators we managed to compile and run our own benchmarks.

The programmers view section tries to evaluate the modularity of the simulators. This is important for multiprocessor simulators. Basically, we think that the simulator that are written in C++ are more suitable the ones written in C. We have also noticed that GEMS and M5 are better structured. They also do not have a very big source code, considering that they are full system simulators. SESC for example

10

Page 11: Multiprocessor Simulators LV

is application only and (mostly because of the thermal module) has a very big source code.The documentation provided with the simulators and other sources of information are also important when considering using a particular simulator for development. M5 and GEMS for example provide online documentation a forum of discussions. For RSIM, there is manual but no community activity (this simulator was developed in 1997).

All the simulators support the shared memory programming model. For the simulators that allow Ethernet simulation, we believe that message passing could also be supported.

The simulation speed is not included in the table because we find it difficult to uniformly evaluate the six simulators, in terms of speed. We have found out that generally, GEMS runs 15 times slower than Simics, with Ruby only, and 45 times slower with both Ruby and Opal. And Simics runs impressively fast. It can be only 10 time slower than the host machine. M5 runs from 50000 times to 2000 times slower than the host machine. Additionally, we managed to run FFT on Multi2Sim, with 128 cores in ~ 17 minutes. The same FFT kernel from the SPLASH-2 benchmarks suite run in ~ 5 minutes, with RSIM, using a 8x8 2D mesh (64 processors). Pe M5 in cat timp ruleaza FFT-ul? (Ati scris ca ati rulat FFT pe 64 cores, pe M5)

The last section of the table presents a few simulation techniques and shows how they are supported by the evaluated simulators.Truncated execution is a technique that can significantly reduce the simulation time. With this technique, instead of simulating the entire benchmark, the architect simulates Z million contiguous instructions only from somewhere in the benchmark.In M5 you can limit the number of instructions or you can run for a period of time using a simple/fast CPU and then switch for a more detailed one.The three main variations of truncated execution are Run Z, Fast-Forward X + Run Z (FF X+ Run Z), and Fast-Forward X + Warmup Y + Run Z (FF X + WU Y + Run Z).Direct-execution allows increasing the speed of functional simulation, by running the benchmark directly on the host machine, instead of simulating the benchmark during the functional simulation period.To eliminate the fast-forwarding time, architects can use checkpoints to minimize the simulation time. To create a checkpoint, the architect executes the program until the checkpoint and then saves the program state to a checkpoint file. Then, before simulation, the user-visible registers are updated with the contents of the checkpoint file.M5 has this feature, but we can not create checkpoints in the detailed mode. To restore a benchmark from a checkpoint and then run it in detailed mode, we must first load the checkpoint in atomic/timing mode, then switch the CPU to detailed after a warm-up period.Simics and GEMS also support checkpointing.

Multi2Sim GEMS RSIM SESC Simics M5

Full system n y n n y y

ISA x86 SPARC v9 (32 bit)

SPARC v9 (32 bit), ported to MIPS and

x86

MIPS

Alpha, ARM, PowerPC,

MIPS, SPARC, x86-64

ALPHA, MIPS*, SPARC*

CPU architecture MIPS MIPS R10000

closest to the MIPS R10000

Alpha

ILP features y y y y y y

MultithreadingFine-grain,

Coarse-grain, SMT

SMT (only with Opal)

n n n

SMT (Application only mode-not tested)

Multiprocessor features

11

Page 12: Multiprocessor Simulators LV

max. number of processors

tested with 128

256 121 64 target dependent

4 (64 with modification)

FS/in SE

limited by system memory

(tested with 128)more

cache hierarchies

y y y y n y

cache coherenceprotocols

MOESI

SLICC (Specification Language for Implementing

Cache Coherence)

MSI, MESI MESI MESI

MOESInot very accurate,

allows different

cache hierarchies

more

Snoopy protocols

y y n y n y

Directory protocols

n y y y n n

memory consistency models

not specified

Sequential Consistency,

Total Store Ordering (TSO)

Sequential, Processor

and Release

Consistency

not specified

Sequential Consistency

weak consistency

(detailed mode)

Sequential Consistency

(in atomic/timin

g modes)more

Interconnection network

bus

hierarchical switch, crossbar, 2D torus, point-to-point, file

specified

wormhole-routed mesh

n n bus

Other features

Ethernet simulation

n y n n y y

Powerconsumption

n

Opal uses Wattch;Orion - Princeton interconnection

power model

n Wattch, HotSpot

n n

Integration area estimation

n n n y n n

Effectiveness

inputs (1 to 10) 7 10 10 10 7 7

outputs(1 to 10)

7 10 10 7 7 10

12

Page 13: Multiprocessor Simulators LV

benchmarks

SPLASH-2, own

benchmarks

SPLASH-2, own benchmarks

quicksort, SOR,

SPLASH-2

own benchmark

s

SPLASH-2, own

benchmarks

SPLASH-2,own

benchmarks

Programmers view

programming models

shared memory (POSIX

Threads)

shared memory / message passing

(Ethernet)

shared memory

shared memory

shared memory / message passing

(Ethernet)

shared memory / message passing

(Ethernet)

source code complexity

~23K lines of code

Ruby: ~ 46K lines of code (~ 28K lines of code without LogTM

and Princeton network)

Opal: ~ 76K lines of code

~ 41K lines of code~

~ 700K lines of

code (178K without

sesctherm)

N/A

∼91K lines of C++, ∼21K

lines of Python

programminglanguages

C C++, Python C/C++(mainly C)

C++ C/C++, Python

C++/Python

documentation(1 to 10)

7 7 9 4 10 9

community activity(1 to 10)

5 9 4 4 10 10

Simulation techniques

truncated execution

Run ZRun Z, FF X + Run Z, FF X + WU Y +

Run ZRun Z Run Z

Run Z, FF X + Run Z, FF X + WU Y +

Run Z

Run Z, FF X + Run Z, FF X + WU Y + Run

Z

direct-execution n y n n y n

checkpointing n y n n y y

parallel-simulation

n y n n y n

Multi2Sim GEMS RSIM SESC Simics M5

Conclusions

We conclude our evaluation of the six multiprocessor simulators with a few arguments that reflect both positive and negative aspects of each simulator, based on our experience with them.

Multi2Sim pros

Multi2Sim has a configurable multithreading feature. We weren't interested that much in multithreading support but its worth mentioning.

13

Page 14: Multiprocessor Simulators LV

On the other hand, support for creating different cache hierarchies is something that we think to be quite useful.Finally, Multi2Sim runs with a big number of processors. We tested it with up to 128 (and we stopped here because it takes significant resources) but we think it could work with even more processors.

Multi2Sim cons

This application only simulator supports only one cache coherence protocol: MOESI. Also, it has only a bus for the interconnection network and it supports only snooping and no directory based protocols. Multi2Sim says nothing about what memory consistency it uses. It probably uses Sequential Consistency but we don't think other memory consistency models are supported. Additionally, we believe Multi2Sim was developed on top of SimpleScalar so we have doubts about its modularity. The source code is not big but it is written in C.

Since GEMS uses Simics we do not present specific Simics pros and cons. Simics is anyway more of a simulation framework which can be used to develop simulators. We think that Simics authors were mainly concerned on simulation speed rather than simulation accuracy. And indeed Simics is impressively fast. We managed to boot a Linux Red hat 7.2 operating system with graphical user interface. Sure, we didn't have the mouse working but the fact remains: Simics is very fast. Also it supports a lot of target machines which makes it to have a wide scope of application. We also think that Simics does not limit simulation accuracy: you may use Simics to get an accurate simulator but naturally you would lose simulation speed. Actually, this is GEMS' case.

GEMS pros

It can be easily observed that GEMS has two major advantages over the other simulators evaluated by us: cache coherence and interconnection network support.Defining a coherence protocol with GEMS is not easy because coherence protocols are difficult to specify in general. However, we believe that, using SLICC and the already defined protocols, the task of cache coherence implementation is substantially easier. There is an example with a simple MI protocol to show us how SLICC can be used. Additionally, there a lot of coherence protocols already defined, both snooping and directory based. The snooping protocols are invalidation based and the authors say that update based protocols are not supported. A random tester module is also provided for automatically testing cache coherence protocols.GEMS also comes with a well defined interface for network interconnections. It provides point-to-point, crossbar, hierarchical switch and a 2D torus as network topologies. Additionally, we have a "file specified" network that the authors claim to allow us to easily define many network topologies. We don't know yet exactly how we can use this feature but, we have enough examples which might help us.Those are in our opinion the major advantages of GEMS but this simulator has other good features:

it uses Simics which makes it reliable and capable to benefit from all the features of Simics, it provides an out-of-order processor called Opal which is independent from Ruby (we may use

Opal only if we want to), it is modular enough to be extended, this is at least proven by Princeton University, which used GEMS to create a more advanced

interconnection module, that supports power consumption estimation, we think that the Ethernet simulation provided by Simics could support the Message Passing

programming model (MPI), it provides some separated statistics for user and kernel operating modes.

GEMS cons

Unfortunately, GEMS doesn't appear to provide mush support for memory consistency models. Besides Sequential Consistency (which is used by Simics), GEMS has some implementation for TSO but we suspect that this was not tested and it might not even be fully implemented.

14

Page 15: Multiprocessor Simulators LV

About the network interconnection module, GEMS authors claim that it is "sufficient for coherence protocol and memory hierarchy research, but a more detailed model of the interconnection network may need to be integrated for research focusing on low-level interconnection network issues" [8].

Licenta? Detalii pt. cumpararea a 2 licente academice

RSIM pros

The main advantage of RSIM is clearly the extended support for memory consistency. This simulator appears to be the only one concerned about relaxed memory consistency models.Also, it uses directory based cache coherence. Two protocols are implemented, MSI and MESI. For the interconnection network a 2D mesh with wormhole routing is provided.Most of the parameters of RSIM are runtime configurable. RSIM provides a lot of parameters for the configuration of the processors, cache memories and interconnection network.

RSIM cons

This simulator is application only and although the source code complexity is not very big, RSIM might not be modular enough since it is mainly written in C.The main problem (because otherwise this simulator leaved a good impression) is that it was released in 1997. Since then, in 2005, it was ported from SPARC to x86. We suspect that this simulator is no longer maintained and therefore community support is unavailable. We had problems when trying to compile our own benchmarks for RSIM. Actually, we didn't manage to do this yet.

SESC pros

It has power consumption integrated but, we could not install it. We believe it's fast, faster than M5 at least and probably it has about the same accuracy.   

SESC cons

It seriously lacks documentation. We were not able to understand its code structure until now. It seems it is not in active development but, there are not many users of this simulator.We can not run any of our own benchmarks. They give two test benchmarks and those work, but when we try compiling other benchmarks using the cross compiler provided by the SESC developers or another MIPS cross compiler taken from the Unisim site, errors occur.We tried to ask the developers some questions but, we were not able to use their mailing list.We managed to find out from the mailing list that SESC theoretically supports 64 processors and there shouldn't be a limit from this point of view. However, on of the SESC authors say: "We kept the libnet because it works fine, but the previous coherence had many problems. Currently, no sesc module uses libnet. The libsmp is simpler a ring based coherence, so no network is used". We have concluded from this that SESC doesn't really provide support for interconnection networks.This simulator is application only.

M5 pros

M5 works flawlessly. You don't have to do any tricks to install it. Everything works from the beginning. This simulator might be the first simulator we ever tried that is installed without any tricks. This, we believe, tells many things about the development team behind it.  M5 is well documented. The site is kept up to date. There are a few tutorials (there could be more) and some wiki pages.

15

Page 16: Multiprocessor Simulators LV

The big advantage in our opinion is the mailing list. The users are very active., We are currently subscribed to it and we receive around 3 mails a day. The users are asking questions and we can see what they are developing, maybe even join their work, and other users or the developers are giving answers.M5 is in constant development and we might receive interesting new updates in the future (or not). We know they are currently working on other interconnection networks.The code seems quite well structured and the simulator is modular.The simulator is implemented in C++ and it also uses Python for the modules that are not "time critical". Python is nicer and faster to program than C++ so we consider this an advantage. At the moment of writing this we managed to change some Python scripts and change the system architecture (cache parameters, cache hierarchy). These scripts are very easy to understand and to modify. We also made a program in Python that resembles PThreadGUI.It has a lot of outputs. The outputs can also be saved in a MySQL database.One big advantage is the "execute-in-execute" model. This means that the execution of the instruction actually takes place in the execute phase of the pipeline. This we believe it has strong implications in the coherency/consistency problem. Other simulators use the "execute-in-fetch" model, which executes the instruction as soon as it is brought in the pipeline, and then applies a delay.

M5 cons

The C++ part of the code seems to be quite complicated. The simulator looks like a framework that is why it is very complicated (we are pretty sure you can make a program that simulates the operations in a factory using M5, or any program that needs moving data between modules and needs timing information). This complexity is hidden and using the simulator, as its authors intended, should not cause any problems. We think that if we need to modify the simulator to a greater extent we might find ourselves blocked or it would take a great amount of time to do it.M5 lacks the support for more interesting interconnection networks. It only supports the bus network.The cache coherence protocol is not modeled accurately. This was done so they could support any cache hierarchy configuration.One of the things it would have been nice to have is the power consumption integration.Now that we started working with it, we found out that it still has bugs, and it's not as good as it is "marketed".For example in SE mode you can not simulate in detailed mode (this is the most accurate mode) with more than one core the SPLASH benchmarks. We are still searching for a way to get around this problem. The solution seemed to be simulating in timing or atomic mode for a couple of instructions, create a checkpoint and then simulate with the detailed model. But it wasn't. The simulator was not able to restore its state from a less detailed model. So we need another approach.

Finally, we would like to drawn a few general conclusions about the evaluated simulators.

Multi2Sim is a simple enough simulator that could be used to perform some research in the field of multicore architectures. However, it is not supporting more suitable interconnection networks and it is application only.

We definitely do not recommend SESC, another application only simulator. The process of installing the simulator brought back memories from the Unisim installation: you don't know what to do and where to start, we spent very much time trying to install it and we did not get it right. But, if we will have some time, we should try to make it work; the power integration might be useful in the future.RSIM is application only too. But, this simulator leaved a good impression. It models a NUMA architecture instead of an UMA architecture, it is highly configurable and it addresses relaxed memory consistency models. On the other hand, although it has a documentation manual, we think that this simulator is no longer supported by the community. Also, we do not think it to be very modular.Initially the impression with M5 was: "We strongly believe that this is the best simulator we have and the easiest to use". Now we have some mixed feelings, we found bugs that are not solved for almost a year now. Not everything the M5 authors promised in their presentation actually works, or it works but not in every configuration. But we expect this kind of surprises in any simulator.Finally, Simics with GEMS leaved the best impression to us. GEMS, like M5 is a full system simulator. We like M5 too, but GEMS seems to be more developed, to provide more features. The support for cache

16

Page 17: Multiprocessor Simulators LV

coherence protocols is substantial because GEMS was mainly intended for studying coherence protocols. Also, GEMS has more support for interconnection networks than M5 and it also relies on the very fast Simics. One small drawback of GEMS is licensing. GEMS is based on Simics so automatically you have to accept the Simics license, which is not "free software". From this perspective M5 has an advantage because it adopts a Berkeley-style open source license (BSD), which means you can do almost whatever you like with their software, you still have to keep some headers, but that's it.

References

[1] Joshua J. Yi, David J. Lilja, Simulation of Computer Architectures: Simulators, Benchmarks, Methodologies, and Recommendations, IEEE TRANSACTIONS ON COMPUTERS, VOL. 55, NO. 3, MARCH 2006

[2] http://www.gap.upv.es/~raurte/tools/multi2sim.html

[3] http://rsim.cs.uiuc.edu/rsim/

[4] http://sesc.sourceforge.net/

[5] http://iacoma.cs.uiuc.edu/~paulsack/sescdoc/

[6] http://www.m5sim.org/wiki/index.php/Main_Page

[7] https://www.simics.net/

[8] http://www.cs.wisc.edu/gems/

[9] http://www.m5sim.org/wiki/index.php/O3CPU

17