A comparison of three architectures: Superscalar ... › fb4a › 01e7fafa0765920ad263c2c… · A comparison of three architectures: Superscalar, Simultaneous Multithreading CPUs

A comparison of three architectures: Superscalar, Simultaneous

Multithreading CPUs and Single-Chip Multiprocessor.

Recent years have seen a great deal of interest in multiple-issue machines or superscalar

processors, processors that can issue several mutually independent instructions in the same cycle. These

machines exploit the parallelism that programs exhibit at the instruction level. The superscalar processor

designs dynamically extract parallelism by executing many instructions within a single, sequential

program in parallel. To find independent instructions within a sequential sequence of instructions, or

thread of control, today’s processors increasingly make use of sophisticated architectural features.

Examples are out-of-order instruction execution and speculative execution of instructions after branches

predicted with dynamic hardware branch prediction techniques. However it is important to know how

much parallelism is available in typical applications. Machines providing a high degree of multiple-issue

would be of little use if applications did not display that much parallelism. The available parallelism

depends strongly on how hard we are willing to work to find it.

Future performance improvements will require processors to be enlarged to execute more

instructions per clock cycle. Speculative execution, or issuing an instruction whose data dependencies are

satisfied but control dependencies are not. That is, we issue a potential future instruction early even

though an intervening branch may send us in another direction entirely. However, reliance on a single

thread of control limits the parallelism available for many applications, and the cost of extracting

parallelism from a single thread is becoming prohibitive. This cost manifests itself in numerous ways,

including increased die area and longer design and verification times. In general, we see diminishing

returns when trying to extract parallelism from a single thread. To continue this trend will trade only

incremental performance increases for large increases in overall complexity.

Exploiting Parallelism:

Parallelism exists at multiple levels in modern systems. Parallelism between individual,

independent instructions in a single application is instruction-level parallelism (ILP). Loop-level

parallelism results when the instruction-level parallelism comes from data-independent loop iterations.

The finite number of instructions that can be examined at once by hardware looking for instruction level

parallelism to exploit is called the instruction window size. Compilers, which have essentially infinite

virtual instruction windows as they generate code, can help increase usable parallelism by reordering

instructions. Instructions are reordered so that instructions that can be issued in parallel are close to each

other in executable code, allowing the hardware’s finite window to detect the resulting instruction-level

parallelism. Some compilers can also divide a program into multiple threads of control, exposing thread-

level parallelism (TLP). This form of parallelism simulates a single, large, hardware instruction window

by allowing multiple, smaller instruction windows—one for each thread—to work together on one

application. A third form of very coarse parallelism, processlevel parallelism, involves completely

independent applications running in independent processes controlled by the operating system.

In the future, we expect thread and process parallelism to become widespread, for two reasons:

the nature of the applications and the nature of the operating system. As a result researchers have

proposed two alternative micro architectures that exploit multiple threads of control: simultaneous

multithreading (SMT)[1, 3] and chip multiprocessors (CMP)[4, 7, 9].

Simultaneous multithreading is a technique permitting several independent threads to issue

instructions to a superscalar’s multiple functional units in a single cycle it is a processor design that

combines hardware multithreading with superscalar processor technology to allow multiple threads to

issue instructions each cycle.

Chip multiprocessors CMPs use relatively simple single-thread processor cores to exploit only

moderate amounts of parallelism within any one thread, while executing multiple threads in parallel

across multiple processor cores[5]. CMPs are identical to most of today’s AMP machines. But having

multiple CPUs on a single chip yields speedup on data transactions among processors. This speedup

makes CMP faster than conventional multichip multiprocessors in running parallel programs especially

when threads communicate frequently.

Wide-issue super scalar processors exploit ILP by executing multiple instructions from a single

program in a single cycle. Multiprocessors (MP) exploit TLP by executing different threads in parallel on

different processors. Unfortunately, both parallel-processing styles statically partition processor

resources, thus preventing them from adapting to dynamically-changing levels of ILP and TLP in a

program. With insufficient TLP, processors in an MP will be idle; with insufficient ILP, multiple-issue

hardware on a superscalar is wasted. Simultaneous multithreading (SMT) [Tullsen et al. 1995; 1996;

Gulati et al. 1996] allows multiple threads to compete for and share available processor resources every

cycle. One of its key advantages when executing parallel applications is its ability to use thread-level

parallelism and instruction-level parallelism interchangeably.

Software trends favors multithread programming for its various benefits. As multiprocessor

systems can provide multiple simultaneous points of execution. With the help of the operating system.

Independent threads can run on independent processors simultaneously. However, the need to limit the

effects of interconnect delays, which are becoming much slower than transistor gate delays, and the

ability to exploits increasing transistor count on chip, favors CMPs[5].

1. Trends in Multiprocessor Architecture:

The major trend in commercial microprocessor architecture is the use of complex architecture to

exploit the ILP. There are two approaches that are used to exploit the ILP: SUPERSCALER and Very

Long Instruction Word (VLIW). Both approaches attempt to issue multiple instructions to independent

functional units at every clock cycle. SUPERSCALER uses Hardware to dynamically find data

independent instruction in an instruction window and issue them to independent functional units. On the

other hand, VLIW relies in the compiler to find ILP and schedule the exec of independent instruction

statically.

Superscalar is more appealing in commercial microprocessor because it can improve the

performance of existing application binaries[7]. However Superscalar is complex to design and difficult

to implement. Looking for parallelism in large instruction window requires a significant amount of

hardware and usually does not improve the performance as much as one might expect. Due to this

complexity, it is difficult not only to make the architecture correct but also to optimize the pipeline and

circuit to achieve high clock frequency.

On the other hand VLIW relies on the compiler to find bunches of independent inst. Since VLIW

does not require the hardware for dynamic scheduling it can be much simpler to design and implement.

However it requires significant compiler support such as a trace scheduling to find out ILP an application

program. VLIW is preferred over Superscalar when the issue width is so large that that dynamic

scheduling hardware in Superscalar is too complex and expensive to implement. However in VLIW such

a wide-issue machine has a centralized register file that must have many ports to supply operands to

independent functional units. The access time for the register file and complexity of the buses connecting

to the functional units may limit clock Frequency. Another disadvantage of VLIW is they cannot use

precompiled binaries. What if the source code id not available. VLIW forces a bunch of instruction to

execute together. If one instruction in the bunch is stalls then other instruction in the bunch must stall too.

This limits VLIW's ability to deal unpredictable events such as data accesses causing cache misses.

Currently most of commercial microprocessors such as Intel Pentium, Compaq Alpha 21264,

IBM PowerPC620, Sun UltraSparc, HP PA8000 and MIPS R100000 use Superscalar design technique.

Performance of these microprocessor has been improving at a phenomenal rate for decades. This

performance growth has been driven by (1) the innovation in compilers, (2) the improvements in

architecture and (3) tremendous improvements in VLSI technology. The latest Superscalar

microprocessors can execute four to six instructions concurrently with many nontrivial techniques

including dynamic branch prediction, out-of-order execution, and speculative execution method. However

speedup may not be achieved by using these techniques because of the limitations of the instructions

window size and the ILP in a typical program. Moreover considerable design efforts are required to

develop such high performance microprocessor. Therefore developing a complex wide issue superscalar

microprocessor as a next generation microprocessor may not be an efficient approach to satisfy the

required performance.

• Superscalar Bottlenecks: Where Have All the Cycles Gone?

Figure 1: gives the issue utilization i.e. the percentage of issue slots that are filled each cycle, for

most of the SPEC benchmarks. The cause of each empty issue slot is also recorded. For example, if the

next instruction cannot be scheduled in the same cycle as the current instruction, then the remaining issue

slots this cycle, as well as all issue slots for idle cycles between the execution of the current instruction

and the next (delayed) instruction, are assigned to the cause of the delay. When there are overlapping

causes, all cycles are assigned to the cause that delays the instruction the most if the delays are additive,

such as an I TLB miss and an I cache miss, the wasted cycles are divided up appropriately[1].

Thus it can be seen that the functional units in the wide Superscalar used are highly underutilized.

These results also indicate that there is no dominant source of wasted issue bandwidth. Although there are

dominant items in individual applications (e.g., mdljsp2, swm, fpppp), the dominant cause is different in

each case. In the composite results we see that the largest cause (short FP dependence) is responsible for

37% of the issue bandwidth, but there are six other causes that account for at least 4.5% of wasted cycles.

Even completely eliminating any one factor will not necessarily improve performance to the degree that

this graph might imply, because many of the causes overlap. Not only is there no dominant cause of

wasted cycles — there appears to be no dominant solution. If specific latency-hiding techniques are

limited, then any dramatic increase in parallelism needs to come from a general latency-hiding solution,

of which multithreading or multiprocessing are examples.

Table 1 gives an idea of the possible causes of wasted issue slots, and the latency-reducing technique that

can reduce that number of cycles wasted by each cause.

2. Hardware Multithreading:

Increasing miss rates and increasing latency of cache misses are having a compounding effect on

the portion of execution time that is wasted on cache misses. The solution to this problem is to use coarse-

grained multithreading to enable the processor to perform useful instructions during cache misses.

• Why are there increasing miss rates and increasing latency of cache misses ?

Workload Characteristics:

Taking for instance the server workloads represent such market segments as on-line transaction

processing (OLTP), business intelligence, enterprise resource planning (ERP), web serving, and

collaborative groupware. The applications are often large and function-rich; they use a large number of

operating system services and access large databases. These characteristics make the instruction and data

working sets large. These workloads are also inherently multi-user and multitasking. The large working

set and high frequency of task switches cause the cache-miss rates to be high. In addition, research in this

area points out that such applications can also have data that is frequently read–write shared. In

multiprocessors, this can make the miss rates significantly higher. Also, because of the large instruction

working set, branch-prediction rates can be poor. These characteristics are all detrimental to the

performance of the processor.

Application Characteristics:

Current trends in application characteristics and languages are likely to make this worse. Object-

oriented programming with languages such as C++ and Java has been popular for several years and is

increasing in popularity. Virtual-function pointers are a feature of these languages that did not exist in the

languages used in older applications. Virtual-function pointers lead to branches that can have very poor

branch-miss prediction rates. The frequency of dynamic memory allocation in these languages is also

higher than in older languages, which leads to more allocation of memory from the heap. Memory from

the heap is more scattered than memory from the stack, which can cause higher cache-miss rates. Java

also does “garbage collection.” Garbage collection has access patterns that lead to poor cache-miss rates

because it references many objects and uses each only a small number of times. All of these factors are

causing the already high miss rates to become even higher.

Faster clock rates:

A large portion of the execution time can already be spent on cache misses and branch

mispredictions. The trend in processor microarchitecture is toward decreasing cycle time at a faster rate

than the decrease in memory access time. This is causing the number of processor cycles for a cache-miss

latency to increase. For a given miss rate, this causes the portion of the execution time due to cache

misses to become larger. This trend, combined with the trend toward higher miss rates in workloads that

already have high miss rates, causes a compounding effect on the cycles-per-instruction (CPI) increase

due to cache misses.

• Multithreading:

In a multithreaded processor, the processor holds the state of several tasks/threads. The several

threads provide additional instruction-level parallelism, enabling the processor to better utilize all of its

resources. When one of the threads would normally be stalled, instructions from the other threads can

utilize the processor’s resources. The observation that cache misses were becoming a very large portion

of the execution time led to the investigation of multithreaded hardware as a way to execute useful

instructions during cache misses.

In fine-grained multithreading, a different thread is executed every cycle While fine-grained

multithreading covers control and data dependencies quite well (although this may require more than two

threads), the impact of cycle interleaving on single-task performance was deemed too large.

In coarse-grained multithreading, a single thread, called the foreground thread, executes until

some long-latency event such as a cache miss occurs, causing execution to switch to the background

thread. If there are no such events, a single thread can consume all execution cycles. This minimizes the

impact on single-task execution speed, making it performance-competitive with non-multithreaded

processors. The processor executes instructions in order so coarse-grained multithreading is used.

Simultaneous multithreading is a technique permitting several independent threads to issue

instructions to a superscalar’s multiple functional units in a single cycle it is a processor design that

combines hardware multithreading with superscalar processor technology to allow multiple threads to

issue instructions each cycle. In a deeply pipelined out-of-order execution processor, simultaneous

multithreading is chosen.

Simultaneous multithreading combines hardware multithreading with superscalar processor

technology, it makes it easier to compare the performance of a Simultaneous Multithreaded processor

with that of a Superscalar processor. So I chose Simultaneous Multithreaded Processor for my study.

• Simultaneous Multithreading (SMT):

Multiple instruction issue has the potential to increase performance, but is ultimately limited by

instruction dependencies (i.e., the available parallelism) and long-latency operations within the single

executing thread. The effects of these are shown as horizontal waste and vertical waste in Figure 2

Multithreaded architectures, on the other hand, such as HEP [28], Tera [3], MASA [15] and Alewife [2]

employ multiple threads with fast context switch between threads. Traditional multithreading hides

memory and functional unit latencies, attacking vertical waste. In any one cycle, though, these

architectures issue instructions from only one thread. The technique is thus limited by the amount of

parallelism that can be found in a single thread in a single cycle. And as issue width increases, the ability

of traditional multithreading to utilize processor resources will decrease. Simultaneous multithreading, in

contrast, attacks both horizontal and vertical waste.

Simultaneous multithreading (SMT), allows multiple threads to compete for and share all of the

processor’s resources every cycle. By permitting multiple threads to share the processor’s functional units

simultaneously, the processor can use both ILP and TLP to accommodate variations in parallelism. When

a program has only a single thread, all of the SMT processor’s resources can be dedicated to that thread;

when more TLP exists, this parallelism can compensate for a lack of per-thread ILP. When a program has

only a single thread, i.e., it lacks TLP, all of the SMT processor’s resources can be dedicated to that

thread; when more TLP exists, this parallelism can compensate for a lack of per-thread ILP. An SMT

processor can uniquely exploit whichever type of parallelism is available, thereby utilizing the functional

units more effectively to achieve the goals of greater throughput and significant program speedups.

• Performance of Simultaneous Multithreading: (the results in this section are based on the observations in [1] )

This section presents performance results for simultaneous muhithreaded processors. Several

machine models have been defined for simultaneous multithreading, spanning a range of hardware

complexities. It is also shown that simultaneous multithreading provides significant performance

improvement over both single-thread superscalar and fine-grain multithreaded processors, both in the

limit, and also under less ambitious hardware assumptions.

Instruction Class Latency

integer multiply 8,16

conditional move 2

I Cache D Cache L2

Cache L3 Cache

compare 0 Size 64 KB 64 KB 256 KB 2 MB

all other integer 1 Assoc DM DM 4 -way 4 – way

FP divide 17, 30 Line Size 32 32 32 32

all other FP 4 Banks 8 8 4 1

load (Ll cache hit, no bank conflicts) 2

load (L2 cache hit) 8

Transfer

time/bank

1 cycle 1 cycle 2 cycle 2 cycle

load (L3 cache hit) 14 Table 3: Details of the Cache Hierarchy

load (memory) 50

control hazard (br or jmp redicted) 1

control hazard br or jmp mispredicted) 6 Table 4: Simulated Instruction Latencies

The Machine Models:

The following models reflect several possible design choices for a combined multithreaded,

superscalar processor. The models differ in how threads can use issue slots and functional units each

cycle; in all cases, however, the basic machine is a wide superscalar with 10 functional units capable of

issuing 8 instructions per cycle (the same core machine as Section 3). The models are:

Fine-Grain Multithreading. Only one thread issues instructions each cycle, but it can use the entire

issue width of the processor. This hides all sources of vertical waste, but does not hide horizontal waste. It

is the only model that does not feature simultaneous multithreading. Among existing or proposed

architectures, this is most similar to the Tera processor [3], which issues one 3-operation LIW instruction

per cycle.

SM: FuIl Simultaneous Issue. This is a completely flexible simultaneous multithreaded superscalar all

eight threads compete for each of the issue slots each cycle. This is the least realistic model in terms of

hardware complexity, but provides insight into the potential for simultaneous multithreading. The

following models each represent restrictions to this scheme that decrease hardware complexity.

SM: Single Issue, SM: Dual Issue, and SM: Four Issue. These three models limit the number of

instructions each thread can issue, or have active in the scheduling window, each cycle. For example, in a

SM: Dual Issue processor, each thread can issue a maximum of 2 instructions per cycle; therefore, a

minimum of 4 threads would be required to fill the 8 issue slots in one cycle.

SM: Limited Connection. Each hardware context is directly connected to exactly one of each type of

functional unit. For example, if the hardware supports eight threads and there are four integer units, each

integer unit could receive instructions from exactly two threads. The partitioning of functional units

among threads is thus less dynamic than in the other models, but each functional unit is still shared (the

critical factor in achieving high utilization). Since the choice of functional units available to a single

thread is different than in the original target machine, recompilation is done for a 4-issue (one of each

type of functional unit) processor for this model. Table 2 shows the important differences in hardware

implementation complexity.

The simulator models the execution pipelines, the memory hierarchy (both in terms of hit rates

and bandwidths), the TLBs, and the branch prediction logic of a wide superscalar processor. It is based on

the Alpha AXP 21164, augmented first for wider superscalar execution and then for multithreaded

execution. The typical simulated configuration contains 10 functional units of four types (four integer,

two floating point, three load/store and 1 branch) and a maximum issue rate of 8 instructions per cycle.

We assume that all functional units are completely pipelined. Tables 3 and 4 show Details of the Cache

Hierarchy and Simulated Instruction Latencies respectively. Figure 3 shows the performance of the

various models as a function of the number of threads.

Observations:

• Each of these models become increasingly competitive with full simultaneous issue as the ratio of threads to

issue slots increases.

• The increase in processor utilization is a direct result of threads dynamically sharing processor resources that

would otherwise remain idle much of the time.

• Lowest priority thread (at 8 threads) runs at 55% of the speed of the highest priority thread.

• Competition for non-execution resources, play nearly as significant a role in this performance region as the

competition fro execution resources.

• Caches are more strained by a multi-threaded work load than a single threads work load, due to decrease in

locality.

• Sharing caches is the dominant effect in the wasted issue cycles.

• Data TLB waste also increases

• Total speedups relatively constant across a wide range of cache sizes.

• Instruction throughput of the various SM models is some what hampered by the sharing of caches and TLBs.

• Cache Design for a Simultaneous Multithreaded Processor:

The measurements show a performance degradation due to cache sharing in simultaneous

multithreaded processors. In this section, the cache problem is explored further. The study focuses on the

organization of the first-level (Ll ) caches, comparing the use of private per-thread caches to shared

caches for both instructions and data. (It is assumed assume that L2 and L3 caches are shared among all

threads.) All experiments use the 4-issue model with up to 8 threads. Not all of the private caches will be

utilized when fewer than eight threads are running. Figure 4 exposes several interesting properties for

multithreaded caches. It is seen that shared caches optimize for a small number of threads (where the few

threads can use all available cache), while private caches perform better with a large number of threads.

For example, the 64s.64s cache ranks first among all models at 1 thread and last at 8 threads, while the

64p.64p cache gives nearly the opposite result. However, the tradeoffs are not the same for both

instructions and data. A shared data cache outperforms a private data cache over all numbers of threads

(e.g., compare 64p.64s with 64p.64p), while instruction caches benefit from private caches at 8 threads.

One reason for this is the differing access patterns between instructions and data. Private I caches

eliminate conflicts between different threads in the I cache, while a shared D cache allows a single thread

to issue multiple memory instructions to different banks.

There are two configurations that appear to be good choices. Because there is little performance

difference at 8 threads, the cost of optimizing for a small number of threads is small, making 64s.64s an

attractive option. However, typically operating with all or most thread slots full, the 64p.64s gives the

best performance in that region and is never worse than the second best performer with fewer threads. The

shared data cache in this scheme allows it to take advantage of more flexible cache

partitioning, while the private instruction caches make each thread less sensitive to the presence of other

threads. Shared data caches also have a significant advantage in a data-sharing environment by allowing

sharing at the lowest level of the data cache hierarchy without any special hardware for cache coherence.

For SMT processors potential bottlenecks may occur in the fetch stages, particularly when

instructions from different blocks are fetched simultaneously, causing contention at the instruction cache .

Furthermore, the cache size becomes more critical as the threads share the same cache [1]. In addition to

memory I/O, the pipeline is lengthened by the addition of two stages when reading and writing to

registers. The increase in pipeline length, places potential strain on the branch prediction unit. However,

the single thread performance degraded by only 2% with the insertion of these two stages [1, 13, 14].

SMT provides an option by which a processor can exploit TLP. Threads are executed in parallel

by scheduling instructions from multiple threads simultaneously. This is done to increase usage of the

functional units already present in multiple issue processors. Logically, SMT is chip multi-processor

(CMP) except all of the functional units are combined to allow for very flexible scheduling. Unlike CMP,

threads on an SMT system share the same caches.

64p.64s has eight private 8 KB I caches and a shared 64 KB data cache.

Presently, SMT technologies are scheduled to be used in upcoming Pentium IV and future Alpha

processors. While SMT is transparent to the user, usage of the SMT features requires the application to

be multithreaded so that TLP can be used. Therefore, applications must be multithreaded in order to take

advantage of SMT capable processors. In particular, such thread support via the hardware facilitates

improved performance by fine-grained threading of programs. Fine-grain threading attempts TLP

wherever possible by threading every independent unit of work. With the imminent arrival of SMT

support in commercial microprocessors, multithreaded programs will be needed to take advantage of

these enhancements

1. Chip Multiprocessor:

CMPs use relatively simple single-thread processor cores to exploit only moderate amounts of

parallelism within any one thread, while executing multiple threads in parallel across multiple processor

cores.

• Implementation technology concerns that favors CMPs:

Today, as most microprocessor designers use the increased transistor budgets to build larger and

more complex uniprocessors, Several problems are beginning to make this approach to microprocessor

design difficult to continue. To address these problems, the future processor design methodology is

shifting from simply making progressively larger uniprocessors to implementing more than one

processor on each chip. The following discusses the key reasons why single-chip microprocessors are a

good idea.

Parallelism

Superscalar processors can extract greater amounts of instruction-level parallelism, or ILP, by

finding nondependent instructions that occur near each other in the original program code. Designers

primarily use additional transistors on chips to extract more parallelism from programs to perform more

work per clock cycle. Unfortunately, there is only a finite amount of ILP present in any particular

sequence of instructions that the processor executes because instructions from the same sequence are

typically highly interdependent. As a result, processors that use this technique are seeing diminishing

returns as they attempt to execute more instructions per clock cycle, even as the logic required to process

multiple instructions per clock cycle increases quadratically.

A CMP avoids this limitation by primarily using a completely different type of parallelism:

thread-level parallelism. A CMP may also exploit small amounts of ILP within each of its individual

processors, since ILP and TLP are orthogonal to each other.

Wire delay

As CMOS gates become faster and chips become physically larger, the delay caused by

interconnects between gates is becoming more significant. Due to rapid process technology improvement,

within the next few years wires will only be able to transmit signals over a small portion of large

processor chips during each clock cycle. However, a CMP can be designed so that each of its small

processors takes up a relatively small area on a large processor chip, minimizing the length of its wires

and simplifying the design of critical paths. Only the more infrequently used, and therefore less critical,

wires connecting the processors need to be long.

Design time

Processors are already difficult to design. Larger numbers of transistors, increasingly complex

methods of extracting ILP, and wire delay considerations will only make this worse. A CMP can help

reduce design time, however, because it allows a single, proven processor design to be replicated multiple

times over a die. Each processor core on a CMP can be much smaller than a competitive uniprocessor,

minimizing the core design time. Also, a core design can be used over more chip generations simply by

scaling the number of cores present on a chip. Only the processor interconnection logic is not entirely

replicated on a CMP.

• Why aren’t CMPs used now?

A CMP addresses all of these potential problems in a straightforward, scalable manner, the

treason for them not being common are:

Integration densities are just reaching levels where these problems are becoming significant

enough to consider a paradigm shift in processor design. The primary reason, however, is because it is

very difficult to convert today’s important uniprocessor programs into multiprocessor ones. Conventional

multiprocessor programming techniques typically require careful data layout in memory to avoid conflicts

between processors, minimization of data communication between processors, and explicit

synchronization at any point in a program where processors may actively share data. A CMP is much less

sensitive to poor data layout and poor communication management, since the interprocessor

communication latencies are lower and bandwidths are higher. However, sequential programs must still

be explicitly broken into threads and synchronized properly.

Parallelizing compilers have been only partially successful at automatically handling these tasks

for programmers. As a result, acceptance of multiprocessors has been slowed because only a limited

number of programmers have mastered these techniques.

• The architectures’ major design considerations in a qualitative manner.

CPU cores:

To keep the processors’ execution units busy, the superscalar and SMT processors as shown

above are assumed to feature advanced branch prediction, register renaming, out-of-order instruction

issue, and nonblocking data caches. As a result, the processors have numerous multiported rename

buffers, issue queues, and register files. The inherent complexity of these architectures results in three

major hardware design problems

Their area increases quadratically with the core’s complexity. The number of registers in each structure

must increase proportionally to the instruction window size. Additionally, the number of ports on each

register must increase proportionally to the processor’s issue width.

The CMP approach minimizes this problem because it attempts to exploit higher levels of

instruction-level parallelism using more processors instead of larger issue widths within a single

processor. This results in an approximately linear area-to-issue width relationship, since the area of each

additional processor is essentially constant, and it adds a constant number of issue slots. Using this

relationship, the area of an 8 2-issue CMP (16 total issue slots) has an area similar to that of a single 12-

issue processor.

They can require longer cycle times. Long, high capacitance I/O wires span the large buffers, queues,

and register files. Extensive use of multiplexers and crossbars to interconnect these units adds more

capacitance. Delays associated with these wires will probably dominate the delay along the CPU’s critical

path. The cycle time impact of these structures can be mitigated by careful design using deep pipelining,

by breaking up the structures into small, fast clusters of closely related components connected by short

wires, or both. But deeper pipelining increases branch misprediction penalties, and clustering tends to

reduce the ability of the processor to find and exploit instruction-level parallelism.

The CMP approach allows a fairly short cycle time to be targeted with relatively little design effort, since

its hardware is naturally clustered —each of the small CPUs is already a very small fast cluster of

components. Since the operating system allocates a single software thread of control to each processor,

the partitioning of work among the “clusters” is natural and requires no hardware to dynamically allocate

instructions to different component clusters. This heavy reliance on software to direct instructions to

clusters limits the amount of instruction-level parallelism that can be dynamically exploited by the entire

CMP, but it allows the structures within each CPU to be small and fast.

Since these factors are difficult to quantify, the evaluated superscalar and SMT architectures

represent how these systems would perform if it was possible to build an optimal implementation with a

fairly shallow pipeline and no clustering, a combination that would result in an unacceptably low clock

cycle time in reality. This probably gives the CMP a handicap in the simulations.

The CPU cores are complicated and composed of many closely interconnected components. As

a result, design and verification costs will increase since they must be designed and verified as single,

large units.

The CMP architecture uses a group of small, identical processors. This allows the design and

verification costs for a single CPU core to be lower, and amortizes those costs over a larger number of

processor cores. It may also be possible to utilize the same core design across a family of processor

designs, simply by including more or fewer cores.

With even more advanced IC technologies, the logic, wire, and design complexity advantages

will increasingly favor a multiprocessor implementation over a superscalar or SMT implementation.

Memory:

A 12-issue superscalar or SMT processor can place large demands on the memory system. For

example, to handle load and store instructions quickly enough, the processors would require a large

primary data cache with four to six independent ports. The SMT processor requires more bandwidth from

the primary cache than the superscalar processor, because its multiple independent threads will typically

allow the core to issue more loads and stores in each cycle, some from each thread. To accommodate

these accesses, the superscalar and SMT architectures have 128-Kbyte, multibanked primary caches with

a two-cycle latency due to the size of the primary caches and the bank interconnection complexity.

The CMP architecture features sixteen 16-Kbyte caches. The eight cores are completely

independent and tightly integrated with their individual pairs of caches another form of clustering, which

leads to a simple, high-frequency design for the primary cache system. The small cache size and tight

connection to these caches allows single-cycle access. The rest of the memory system remains essentially

unchanged, except that the secondary cache controller must add two extra cycles of secondary cache

latency to handle requests from multiple processors. To make a shared memory multiprocessor, the data

caches could be made writethrough, or a MESI (modified, exclusive, shared, and invalid) cache-

coherence protocol could be established between the primary data caches. Because the bandwidth to an

on-chip cache can easily be made high enough to handle the write-through traffic, the simpler coherence

scheme is chosen for the CMP.

In this way, designers can implement a small-scale multiprocessor with very low interprocessor

communication latency. To provide enough off-chip memory bandwidth for the high-performance

processors, all simulations were made with main memory composed of multiple banks of Rambus

DRAMs (RDRAMs), attached via multiple Rambus channels to each processor.

Compiler support:

The main challenge for the compiler targeting the superscalar processor is finding enough

instruction-level parallelism in applications to use a 12-issue processor effectively. Code reordering is

fundamentally limited by true data dependencies and control dependencies within a thread of instructions.

It is likely that most integer applications will be unable to use a 12-issue processor effectively, even with

very aggressive branch prediction and advanced compiler support for exposing instruction-level

parallelism. Limit studies with large instruction windows and perfect branch prediction have shown that a

maximum of approximately 10–15 instructions per cycle are possible for general-purpose integer

applications.9 Branch mispredictions will reduce this number further in a real processor.

On the other hand, programmers must find threadlevel parallelism in order to maximize CMP

performance. The SMT also requires programmers to explicitly divide code into threads to get maximum

performance, but, unlike the CMP, it can dynamically find more instruction-level parallelism if thread-

level parallelism is limited. With current trends in parallelizing compilers, multithreaded operating

systems, and the awareness of programmers about how to program parallel computers, however, these

problems should prove less daunting in the future. Additionally, having all eight of the CPUs on a single

chip allows designers to exploit thread-level parallelism even when threads communicate frequently.

This has been a limiting factor on today’s multichip multiprocessors, preventing some parallel programs

from attaining speedups, but the low communication latencies inherent in a single-chip microarchitecture

allow speedup to occur across a wide range of parallelism.4

Hardware Performance & Comparison: In this section I have tried to compare the three architectures based on the simulations and experiments

conducted by the various research groups. I have presented the results form ref [??] [??] to draw

conclusions for my study. CMP versus Superscalar:

Two main Concerns: (1) Area, (2) Time.

For an instruction window to enable the dynamic issue of instructions, require large die area. PA-8000, 4-

issue Superscalar devotes 20% of die area solely to instruction window. in general area requirement

increases quadratically with issue width. Increase in issue width typically requires an increase in the

number of ports in the register file. Alternatively it may involve the replication of the register file as in

Alpha 21264.

The number of datapaths between the functional units and register files increase quadratically

with the issue width. CMP requires extra hardware for speculation support. However overhead for

register communication is quite modest. Register bypass network (values forwarded directly form output

of functional unit to their inputs, this permits back to back issue of data dependent instructions) may be an

important factor in determining the cycle time in future high-issue processor.

Other concerns:

Inability to extract a significant amount of parallelism form the application leads to uneven

distribution of work among the different processors in a CMP.

CMP is able to exploit parallelism in application that are fully loop based and most of the loops

have few or no loop carried dependence, better than 12 issue Superscalar. Each processor in the CMP

executes an iteration and most of the time can independently issue instructions without being affected by

dependence with other threads. In the 12 issue Superscalar, instead, the centralized instruction window is

often cloggy by instructions that are either data dependent on long-latency FP operations or are waiting

on cache misses. On an average IPC of a 4X4issue CMP is on an average nearly twice that of a 12-issue

superscalar.

Thus it can be seen that

Superscalar

• Norm of today's high-performance microprocessor.

• Issue rate of these microprocessor has continued to increase over the past few years

Compaq Alpha 21264, IBM Power PC, Intel Pentium – Pro, MIPS R10000 issue four instructions

per cycle.

• special hardware to dynamically identify independent instructions.

maintaining a large pool of instructions in a large associative window.

register renaming to eliminate false dependence

• out of order issue (instruction issued as soon as its operand and functional units are available).

Thus parallelism is extracted only from ILP at the program at run time.

• Requires centralized hardware structure that lengthens the critical path for the processor pipeline.

register renaming logic

instruction window wake-up and select mechanism

register bypass logic

• long latency interconnects in centralized approach.

Chip multi processor

• Exploits thread level parallelism

• Exploits increasing transistor count on chip

• Wide issue dynamic processor would make fact communication at register level will soon popular

CMPs

• Speculative mode improves performance but needs true memory dependence violation to be handled

• Decentralized architecture

divide the application into multiple threads and exploit ILP across them.

multiple threads run on multiple simple processing units on a single chip. CMP architecture.

• design simplicity

o fast clocking each of the processing units

o eases time consuming design validation phase

• fast communication in processing units localizes interconnects (long latency interconnects in

centralized approach)

• better utilization of silicon space hence avoids extra logic devoted to centralized architecture

=> higher overall issue bandwidth.

Olukotum et al show how a CMP with eight 2-issue superscalar processing units would occupy

the same area as a conventional 12-issue superscalar processor.

• ideal for running multithreaded applications

• May not be able to give good performance when running sequential applications as parallel compilers

are successful only at a restricted class of applications typically numeric ones. so cannot handle a

large class of sequential applications.

speculation can help as compilers assume existence of inter thread dependence when it cannot guarantee

data independence among threads. Speculative mode improves performance but needs true memory

dependence violation to be handled. Technique to solve this problem has been discussed in [21].

Charecterstics of superscalar, simultaneous multithreading, and chip multiprocessor architecture

Characteristic

Superscalar Simultaneous multithreading

Chip multiprocessor

Number of CPUs 1 1 8 CPU issue width 12 12 2 per CPU Number of threads 1 8 1 per CPU Architecture registers (for integer and FP) 32 32 per thread 32 per CPU Physical registers (for integer and FP) 32 + 256 256 + 256 32 + 32 per CPU Instruction window size 256 256 32 per CPU Branch predictor table size (entries) 32,768 32,68 8 x 4,096 Return stack size 64 entries 64 entries 8 x 8 entries

1. Superscalar: The superscalar processor,

shown in Figure 1a, can dynamically issue up to 12 instructions per cycle.

2. Simultaneous Multithreading: The SMT processor, shown in Figure 1b, is identical to the superscalar except that it has eight separate program counters and executes instructions from up to eight different threads of control concurrently. The processor core dynamically allocates instruction fetch and execution resources among the different threads on a cycle-by-cycle basis to find as much thread-level and instruction-level parallelism as possible.

3. Chip Multiprocessor: The CMP, shown in Figure 1c, is composed of eight small 2-issue superscalar processors. This processor depends on thread-level parallelism, since its ability to find instruction-level parallelism is limited by the small size of each processor.

Instruction (I) and data (D) cache organization 1 x 8 banks 1 x 8 banks 1 bank I and D cache sizes 128 Kbytes 128 Kbytes 16 Kbytes per CPU Branch predictor table size (entries) 4 way 4 way 4 way I and D cache line sizes (bytes) 32 32 32 I and D cache access times (cycles) 2 2 1 Secondary cache organization (Mbytes) 1 x 8 bocks 1 x 8 bocks 1 x 8 bocks Secondary cache size (bytes) 8 8 8 Secondary cache associativity 4 - way 4 - way 4 - way Secondary cache line size (bytes) 32 32 32 Secondary cache access time (cycles) 5 5 7 Secondary cache occupancy per access (cycles) 1 1 1 Memory organization (no. of banks) 50 50 50 Memory access time (cycles) 4 4 4 Memory occupancy per access (cycles) 13 13 13

Figure 2. Relative performance of superscalar, simultaneous multithreading, and chip

multiprocessor architectures compared to a baseline, 2-issue superscalar architecture.

Performance results:

Figure ?? shows the performance of the superscalar, SMT, and CMP architectures on the four

benchmarks relative to a baseline architecture—a single 2-issue processor attached to the

superscalar/SMT memory system.

The first two benchmarks show performance on applications with moderate memory behavior

and no thread-level parallelism (compress) or large amounts of thread-level parallelism (mpeg).

The CMP experienced a nearly eight-times performance improvement over the single 2-issue processor.

The separate primary caches are beneficial because they can be accessed by all processors in parallel. In a

separate test with eight processors sharing a single cache, bank contention between accesses from

different processors degraded performance significantly. The average memory access time to the primary

cache alone went up from 1.1 to 5.7 cycles, mostly due to extra queuing delays at the contended banks,

and overall performance dropped 24 percent. In contrast, the shared secondary cache is not a bottleneck in

the CMP because it received an order of magnitude fewer accesses. SMT results showed similar trends.

The speedups tracked the CMP results closely when modeling similar degrees of data cache contention.

The nominal performance was similar to that of the CMP’s with a single primary cache, and performance

improved 17 percent when primary cache contention is temporarily deactivated. The multiple threads of

control in the SMT allowed it to exploit thread-level parallelism. Additionally, the dynamic resource

allocation in the SMT allowed it to be competitive with the CMP, even though it had fewer total issue

slots.

However, tomcatv’s memory behavior highlighted a fundamental problem with the SMT

architecture: the unified data cache architecture was a bandwidth limitation. Making a data cache with

enough banks or ports to keep up with the memory requirements of eight threads requires a more

sophisticated crossbar network that will add more latency to every cache access, and may not help if there

is a particular bank that is heavily accessed. The CMP’s independent data caches avoid this problem but

are not possible in an SMT.

As with compress, the multiprogramming workload has limited amounts of instruction-level

parallelism, so the speedup of the superscalar architecture was only a 35 percent increase over the

baseline processor. Unlike compress, however, the multiprogramming workload had large amounts of

process-level parallelism, which both the SMT and CMP exploited effectively. This resulted in a linear

eight-times speedup for the CMP. The SMT achieved nearly a seven-times speedup over the 2-issue

baseline processor, more than the increase in the number of issue slots would indicate possible, because it

efficiently utilized processor resources by interleaving threads cycle by cycle.

Thus this approach proves that CMP has superior performance using relatively simple hardware.

Fine Comparison of Simultaneous Multithreading versus Single-Chip Multiprocessing: (These are the

results as shown in ref [??])

This section compares the performance of simultaneous multithreading to small-scale, single-chip

multiprocessing (MP). On the organizational level, the two approaches are extremely similar both have

multiple register sets, multiple functional units, and high issue bandwidth on a single chip. The key

difference is in the way those resources are partitioned and scheduled: the multiprocessor statically

partitions resources, devoting a fixed number of functional units to each thread; the SM processor allows

the partitioning to change every cycle. Clearly, scheduling is more complex for an SM processor

however, it is shown that in other areas the SM model requires fewer resources, relative to

multiprocessing, in order to achieve a desired level of performance.

For these experiments, SM and MP configurations that are reasonably equivalent. For most of the

comparisons all or most of the following are kept equal: the number of register sets (i.e., the number of

threads for SM and the number of processors for MP), the total issue bandwidth, and the specific

functional unit configuration. A consequence of the last item is that the functional unit configuration is

often optimized for the multiprocessor and represents an inefficient configuration for simultaneous

multithreading. All experiments use 8 KB private instruction and data caches (per thread for SM, per

processor for MP), a 256 KB 4-way set-associative shared second-level cache, and a 2 MB direct-mapped

third-level cache. It is desired to keep the caches constant in the comparisons, and this (private I and D

caches) is the most natural configuration for the multiprocessor.

MPs are evaluated with 1, 2, and 4 issues per cycle on each processor. SM processors are

evaluated with 4 and 8 issues per cycle; however the SM: Four Issue model (defined in Section ??) is

used, for all of the SM measurements (i.e., each thread is limited to four issues per cycle). Using this

model minimizes some of the inherent complexity differences between the SM and MP architectures. For

example, an SM: Four Issue processor is similar to a single-threaded processor with 4 issues per cycle in

terms of both the number of ports on each register file and the amount of inter-instruction dependence

checking. In each experiment the same version of the benchmarks is run for both configurations

(compiled for a 4-issue, 4 functional unit processor, which most closely matches the MP configuration)

on both the MP and SM models; this typically favors the MP.

It must be noted that, while in general it is tried that the bias is in favor of the MP, the SM results may be

optimistic in two respects — the amount of time required to schedule instructions onto functional units,

and the shared cache access time. The distance between the load/store units and the data cache can have a

large impact on cache access time. The multiprocessor, with private caches and private load/store units,

can minimize the distances between them. The SM processor cannot do so, even with private caches,

because the load store units are shared. However, two alternate configurations could eliminate this

difference. Having eight load/store units (one private unit per thread, associated with a private cache)

would still allow us to match MP performance with fewer than half the total number of MP functional

units (32 vs. 15). Or with 4 load/store units and 8 threads, it is possible to statically share a single cache l

load store combination among each set of 2 threads. Threads O and 1 might share one load/store unit, and

all accesses through that load/store unit would go to the same cache, thus allowing us to minimize the

distance between cache and load/store unit, while still allowing resource sharing. Figure ?? shows the

results of the SM/MP comparison for various configurations.

Tests A, B, and C compare the performance of the two schemes with an essentially unlimited

number of functional units (FUS); i.e., there is a functional unit of each type available to every issue slot.

The number of register sets and total issue bandwidth are constant for each experiment. In these models,

the ratio of functional units (and threads) to issue bandwidth is high, so both configurations should be

able to utilize most of their issue bandwidth. Simultaneous multithreading, however, does so more

effectively.

Test D repeats test A but limits the SM processor to a more reasonable configuration (the same

10 functional unit configuration used throughout this paper). This configuration outperforms the

multiprocessor by nearly as much as test A, even though the SM configuration has 22 fewer functional

units and requires fewer forwarding connections.

In tests E and F, the MP is allowed a much larger total issue bandwidth. In test E, each MP

processor can issue 4 instructions per cycle for a total issue bandwidth of 32 across the 8 processors; each

SM thread can also issue 4 instructions per cycle, but the 8 threads share only 8 issue slots. The results are

similar despite the disparity in issue slots. In test F, the 4-thread, 8-issue SM slightly outperforms a 4-

processor, 4-issue per processor MP, which has twice the total issue bandwidth. Simultaneous

multithreading performs well in these tests, despite its handicap, because the MP is constrained with

respect to which 4 instructions a single processor can issue in a single cycle.

Test G shows the greater ability of SM to utilize a fixed number of functional units. Here both

SM and MP have 8 functional units and 8 issues per cycle. However, while the SM is allowed to have 8

contexts (8 register sets), the MP is limited to two processors (2 register sets), because each processor

must have at least 1 of each of the 4 functional unit types. Simultaneous multithreading’s ability to drive

up the utilization of a fixed number of functional units through the addition of thread contexts achieves

more than 2.5 times the throughput.

These comparisons show that simultaneous multithreading outperforms single-chip

multiprocessing in a variety of configurations because of the dynamic partitioning of functional

units. More important, SM requires many fewer resources (functional units and instruction issue slots)

to achieve a given performance level. For example, a single 8-thread, 8-issue SM processor with 10

functional units is 24~o faster than the 8-processor, single-issue MP (Test D), which has identical issue

bandwidth but requires 32 functional units; to equal the throughput of that 8-thread 8-issue SM, an MP

system requires eight 4-issue processors (Test E), which consume 32 functional units and 32 issue slots

per cycle.

Finally, there are further advantages of SM over MP that are not shown by the experiments:

• Performance with few threads — These results show only the performance at maximum utilization.

The advantage of SM (over MP) is greater as some of the contexts (processors) become unutilized.

An idle processor leaves l/p of an MP idle, while with SM, the other threads can expand to use the

available resources. This is important when (1) running parallel code where the degree of parallelism

varies overtime, (2) the performance of a small number of threads is important in the target

environment, or (3) the workload is sized for the exact size of the machine (e.g., 8 threads). In the last

case, a processor and all of its resources is lost when a thread experiences a latency orders of

magnitude larger than that simulated (e.g., IO).

• Granularity and flexibility of design — The configuration options are much richer with SM, because

the units of design have finer granularity. That is, with a multiprocessor, it would be typically to add

computing in units of entire processors. With simultaneous multithreading, it is possible to benefit

from the addition of a single resource, such as a functional unit, a register context, or an instruction

issue slot; furthermore, all threads would be able to share in using that resource. The comparisons did

not take advantage of this flexibility. Processor designers, taking full advantage of the configure

ability of simultaneous multithreading, should be able to construct configurations that even further

out-distance multiprocessing. Performance Comparison of SMP and CMP Using Parallel Workloads.

Why Parallel Applications?

SMT is most effective when threads have complementary hardware resource requirements.

Multiprogrammed workloads and workloads consisting of parallel applications both provide TLP via

independent streams of control, but they compete for hardware resources differently. Because a

multiprogrammed workload (used in our previous work [Tullsen et al. 1995; 1996]) does not share

memory references across threads, it places more stress on the caches. Furthermore, its threads have

different instruction execution patterns, causing interference in branch prediction hardware. On the other

hand, multiprogrammed workloads are less likely to compete for identical functional units.

Although parallel applications have the benefit of sharing the caches and branch prediction

hardware, they are an interesting and different test of SMT for several reasons. First, unlike the

Table V. Throughput Comparison of MP2, MP4, and SMT, Measured in Instructions per Cycle Number of Threads

Configuration 1 2 4 8 MP2 2.08 3.32 -- -- MP4 1.38 2.25 3.27 -- SMT 2.40 3.49 4.24 4.88

multiprogrammed workload, all threads in a parallel application execute the same code and,

therefore, have similar execution resource requirements, memory reference patterns, and levels

of ILP. Because all threads tend to have the same resource needs at the same time, there is

potentially more contention for these resources compared to a multiprogrammed workload. For

example, a particular loop may have a large degree of instruction-level parallelism, so each

thread will require a large number of renaming registers and functional units. Because all

threads have the same resource needs, they may exacerbate or create bottlenecks in these

resources. Parallel applications are therefore particularly appropriate for this study, which

focuses on these execution resources. Second, parallel applications illustrate the promise of SMT

as an architecture for improving the performance of single applications. By using threads to

parallelize programs, SMT can improve processor utilization, but more important, it can achieve

program speedups. Finally, parallel applications are a natural workload for traditional parallel

architectures and therefore serve as a fair basis for comparing SMT and multiprocessors. For the

sake of comparison, in Section 7, we also briefly contrast our parallel results with the

multiprogrammed results from Tullsen et al. [1996].

For another set of experiments as shown in ref[??] The Processor instruction latencies and memory

hirerchy details are as shown figure ??

Contributions regarding design tradeoffs for future high-end processors

First, the performance costs of resource partitioning for various multiprocessor configurations has

been identified. By partitioning execution resources between processors, multiprocessors enforce the

distinction between instruction- and thread-level parallelism. In this study, we examined two MP design

choices with similar hardware cost in terms of execution resources: one design with more resources per

processor (MP2) and one with twice as many processors, but fewer resources on each (MP4). Our results

showed that both alternatives frequently suffered from an inefficient use of their resources and that

improvements could only be obtained with costly upgrades in processor resources. The MP designs were

unable to adapt to varying levels of ILP and TLP, so their performance depended heavily on the

parallelism characteristics of the applications. For programs with more ILP, MP2 outperformed MP4; for

programs with less ILP, MP4 was superior because it exploited more thread-level parallelism. To

maximize performance on an MP, compilers and parallel programmers are therefore faced with the

difficult task of partitioning program parallelism (ILP and TLP) in a manner that matches the physical

partitioning of resources.

Second, it has been illustrated that in contrast, simultaneous multithreading allows compilers and

programmers to focus on extracting whatever parallelism exists, by treating instruction- and thread-level

parallelism equally. ILP and TLP are fundamentally identical; they both represent independent

instructions that can be used to increase processor utilization and improve performance. SMT has the

flexibility to use both forms of parallelism interchangeably, because threads can share resources

dynamically. Rather than adding more resources to further improve performance, existing resources are

used more effectively. By using more hardware contexts, SMT can take advantage of TLP to expose more

parallelism and attain an average throughput of 4.88 instructions per cycle, while increasing its

performance edge over MP2 and MP4 to 64% and 52%, respectively.

Third, our results demonstrate that SMT can achieve large program speedups on parallel

applications. Even though these parallel threads have greater potential for interference because of similar

resource usage patterns (including memory references and demands for renaming registers and functional

units), simultaneous multithreading has the ability to compensate for these potential conflicts. We found

that interthread cache interference, bank contention, and branch prediction interference on an SMT

processor had only minimal effects on performance. The latency hiding characteristics of simultaneous

multithreading allow it to achieve a 2.68 average speedup over a single MP2 processor, whereas MP2 and

MP4 speedups are limited to 1.63 and 1.76, respectively. The bottom line is that simultaneous

multithreading makes better utilization of on-chip resources to run parallel applications effectively.

For these reasons, as well as the performance and complexity results shown, it can be believed

that when component densities permit us to put multiple hardware contexts and wide issue bandwidth on

a single chip, simultaneous multithreading represents the most efficient organization of those resources.

Charecterstics of superscalar, simultaneous multithreading, and chip multiprocessor architecture

Characteristic

Superscalar Simultaneous multithreading

Chip multiprocessor

Number of CPUs 1 1 8 CPU issue width 12 12 2 per CPU Number of threads 1 8 1 per CPU Architecture registers (for integer and FP) 32 32 per thread 32 per CPU Physical registers (for integer and FP) 32 + 256 256 + 256 32 + 32 per CPU Instruction window size 256 256 32 per CPU Branch predictor table size (entries) 32,768 32,68 8 x 4,096 Return stack size 64 entries 64 entries 8 x 8 entries Instruction (I) and data (D) cache organization 1 x 8 banks 1 x 8 banks 1 bank I and D cache sizes 128 Kbytes 128 Kbytes 16 Kbytes per CPU Branch predictor table size (entries) 4 way 4 way 4 way I and D cache line sizes (bytes) 32 32 32 I and D cache access times (cycles) 2 2 1 Secondary cache organization (Mbytes) 1 x 8 bocks 1 x 8 bocks 1 x 8 bocks

1. Superscalar: The superscalar processor,

shown in Figure 1a, can dynamically issue up to 12 instructions per cycle.

2. Simultaneous Multithreading: The SMT processor, shown in Figure 1b, is identical to the superscalar except that it has eight separate program counters and executes instructions from up to eight different threads of control concurrently. The processor core dynamically allocates instruction fetch and execution resources among the different threads on a cycle-by-cycle basis to find as much thread-level and instruction-level parallelism as possible.

3. Chip Multiprocessor: The CMP, shown in Figure 1c, is composed of eight small 2-issue superscalar processors. This processor depends on thread-level parallelism, since its ability to find instruction-level parallelism is limited by the small size of each processor.

Secondary cache size (bytes) 8 8 8 Secondary cache associativity 4 - way 4 - way 4 - way Secondary cache line size (bytes) 32 32 32 Secondary cache access time (cycles) 5 5 7 Secondary cache occupancy per access (cycles) 1 1 1 Memory organization (no. of banks) 50 50 50 Memory access time (cycles) 4 4 4 Memory occupancy per access (cycles) 13 13 13

Figure 2. Relative performance of superscalar, simultaneous multithreading, and chip

multiprocessor architectures compared to a baseline, 2-issue superscalar architecture.

The CMP experienced a nearly eight-times performance improvement over the single 2-issue processor.

The separate primary caches are beneficial because they could be accessed by all processors in parallel. In

a separate test with eight processors sharing a single cache, bank contention between accesses from

different processors degraded performance significantly. The average memory access time to the primary

cache alone went up from 1.1 to 5.7 cycles, mostly due to extra queuing delays at the contended banks,

and overall performance dropped 24 percent. In contrast, the shared secondary cache was not a bottleneck

in the CMP because it received an order of magnitude fewer accesses. SMT results showed similar trends.

The speedups tracked the CMP results closely when modeling similar degrees of data cache contention.

The nominal performance was similar to that of the CMP’s with a single primary cache, and performance

improved 17 percent when primary cache contention is temporarily deactivated.

The multiple threads of control in the SMT allowed it to exploit thread-level parallelism.

Additionally, the dynamic resource allocation in the SMT allowed it to be competitive with the CMP,

even though it had fewer total issue slots.

However, tomcatv’s memory behavior highlighted a fundamental problem with the SMT

architecture: the unified data cache architecture was a bandwidth limitation.

Making a data cache with enough banks or ports to keep up with the memory requirements of

eight threads requires a more sophisticated crossbar network that will add more latency to every cache

access, and may not help if there is a particular bank that is heavily accessed.

The CMP’s independent data caches avoid this problem but are not possible in an SMT.

As with compress, the multiprogramming workload has limited amounts of instruction-level

parallelism, so the speedup of the superscalar architecture was only a 35 percent increase over the

baseline processor.

Unlike compress, however, the multiprogramming workload had large amounts of process-level

parallelism, which both the SMT and CMP exploited effectively.

This resulted in a linear eight-times speedup for the CMP.

The SMT achieved nearly a seven-times speedup over the 2-issue baseline processor, more than

the increase in the number of issue slots would indicate possible, because it efficiently utilized processor

resources by interleaving threads cycle by cycle.

CMP has superior performance using relatively simple hardware.

CMP or SMT ?

• The performance race between SMT and CMP is not yet decided.

• CMP is easier to implement, but only SMT has the ability to hide latencies.

• A functional partitioning is not easily reached within a SMT processor due to the

centralized instruction issue.

A separation of the thread queues is a possible solution, although it does not

remove the central instruction issue.

A combination of simultaneous multithreading with the CMP may be superior.

• Research: combine SMT or CMP organization with the ability to create threads with

compiler support or fully dynamically out of a single thread

thread-level speculation

close to multiscalar

It is difficult to decide which of the two approaches is going to be next generation

Microprocessor. With the operating systems developing the ability to handle multiple threads

Simultaneous threading seems to be an option. But with the increasing chip density increasing it

is chip multiprocessors are worth considering because of their simplicity of design. The

combination of the two approaches is worth considering too as seen in reference[7]

References

1. D. TULLSEN, S. EGGERS, AND H. LEVY, “Simultaneous Multithreading: Maximizing On-Chip Parallelism,” Proc. 22nd Ann. Int’l Symp. Computer Architecture, ACM Press, New York, 1995, pp. 392-403.

2. J. BORKENHAGEN, R. EICKEMEYER, AND R. KALLA :A Multithreaded PowerPC Processor

for Commercial Servers, IBM Journal of Research and Development, November 2000, Vol. 44, No. 6, pp. 885-98.

3. J. LO, S. EGGERS, J. EMER, H. LEVY, R. STAMM, AND D. TULLSEN. Converting thread-

level parallelism into instruction-level parallelism via simultaneous multithreading. ACM Transactions on Computer Systems, 15(2), August 1997.

4. Kunle Olukotun, Basem A. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang. The

case for a single-chip multiprocessor. In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, pages 2--11, Cambridge, Massachusetts, October 1--5, 1996.

5. LANCE HAMMOND, BASEM A NAYFEH, KUNLE OLUKOTUn. A Single-Chip

Multiprocessor. IEEE September1997

6. GULATI, M. AND BAGHERZADEH, N. 1996. Performance study of a multithreaded superscalar microprocessor. In the 2nd International Symposium on High-Performance Computer Architecture(Feb.). 291–301.

7. KYOUNG PARK, SUNG-HOON CHOI, YONGWHA CHUNG, WOO-JONG HAHN AND

SUK-HAN YOON. On-Chip Multiprocessor with Siultaneous Multithreading. http://etrij.etri.re.kr/etrij/pdfdata/22-04-02.pdf

8. NAYFEH, B. A., HAMMOND, L., AND OLUKOTUN, K. 1996. Evaluation of design

alternatives for a multiprocessor microprocessor. In the 23rd Annual International Symposium on Computer Architecture (May). 67–77.

9. OLUKOTUN, K., NAYFEH, B. A., HAMMOND, L., WILSON, K., AND CHANG, K. 1996. The

case for a single-chip multiprocessor. In the 7th International Conference on Architectural Support for Programming Languages and Operating Systems (Oct.). ACM, New York, 2–11.

10. LANCE HAMMOND, BENEDICT A. HUBBERT, MICHAEL SIU, MANOHAR K. PRABHU,

MICHAEL CHEN, KUNLE OLUKOTUN. The Stanford Hydra CMP. IEEE Micro March/April 2000 (Vol. 20, No. 2)

11. S. EGGERS, J. EMER, H. LEVY, J. LO, R. STAMM, D. TULLSEN. Simultaneous

Multithreading: A Platform for Next-generation Processors. In IEEE Micro, pages 12-18, September/October 1997

12. V. KRISHNAN AND J. TORRELLAS. Hardware and Software Support for Speculative

Execution of Sequential Binaries on Chip-Multiprocessor. In ACM International Conference on Supercomputing (ICS’98), pages 85-92, June 1998.

13. goethe.ira.uka.de/people/ungerer/proc-arch/ EUROPAR-tutorial-slides.ppt

14. http://www.acm.uiuc.edu/banks/20/6/page4.html

15. Simultaneous Multithreading home page http://www.cs.washington.edu/research/smt/

Documents

A comparison of three architectures: Superscalar ... › fb4a › 01e7fafa0765920ad263c2c… · A comparison of three architectures: Superscalar, Simultaneous Multithreading CPUs