Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
A comparison of three architectures: Superscalar, Simultaneous
Multithreading CPUs and Single-Chip Multiprocessor.
Recent years have seen a great deal of interest in multiple-issue machines or superscalar
processors, processors that can issue several mutually independent instructions in the same cycle. These
machines exploit the parallelism that programs exhibit at the instruction level. The superscalar processor
designs dynamically extract parallelism by executing many instructions within a single, sequential
program in parallel. To find independent instructions within a sequential sequence of instructions, or
thread of control, today’s processors increasingly make use of sophisticated architectural features.
Examples are out-of-order instruction execution and speculative execution of instructions after branches
predicted with dynamic hardware branch prediction techniques. However it is important to know how
much parallelism is available in typical applications. Machines providing a high degree of multiple-issue
would be of little use if applications did not display that much parallelism. The available parallelism
depends strongly on how hard we are willing to work to find it.
Future performance improvements will require processors to be enlarged to execute more
instructions per clock cycle. Speculative execution, or issuing an instruction whose data dependencies are
satisfied but control dependencies are not. That is, we issue a potential future instruction early even
though an intervening branch may send us in another direction entirely. However, reliance on a single
thread of control limits the parallelism available for many applications, and the cost of extracting
parallelism from a single thread is becoming prohibitive. This cost manifests itself in numerous ways,
including increased die area and longer design and verification times. In general, we see diminishing
returns when trying to extract parallelism from a single thread. To continue this trend will trade only
incremental performance increases for large increases in overall complexity.
Exploiting Parallelism:
Parallelism exists at multiple levels in modern systems. Parallelism between individual,
independent instructions in a single application is instruction-level parallelism (ILP). Loop-level
parallelism results when the instruction-level parallelism comes from data-independent loop iterations.
The finite number of instructions that can be examined at once by hardware looking for instruction level
parallelism to exploit is called the instruction window size. Compilers, which have essentially infinite
virtual instruction windows as they generate code, can help increase usable parallelism by reordering
instructions. Instructions are reordered so that instructions that can be issued in parallel are close to each
other in executable code, allowing the hardware’s finite window to detect the resulting instruction-level
parallelism. Some compilers can also divide a program into multiple threads of control, exposing thread-
level parallelism (TLP). This form of parallelism simulates a single, large, hardware instruction window
by allowing multiple, smaller instruction windows—one for each thread—to work together on one
application. A third form of very coarse parallelism, processlevel parallelism, involves completely
independent applications running in independent processes controlled by the operating system.
In the future, we expect thread and process parallelism to become widespread, for two reasons:
the nature of the applications and the nature of the operating system. As a result researchers have
proposed two alternative micro architectures that exploit multiple threads of control: simultaneous
multithreading (SMT)[1, 3] and chip multiprocessors (CMP)[4, 7, 9].
Simultaneous multithreading is a technique permitting several independent threads to issue
instructions to a superscalar’s multiple functional units in a single cycle it is a processor design that
combines hardware multithreading with superscalar processor technology to allow multiple threads to
issue instructions each cycle.
Chip multiprocessors CMPs use relatively simple single-thread processor cores to exploit only
moderate amounts of parallelism within any one thread, while executing multiple threads in parallel
across multiple processor cores[5]. CMPs are identical to most of today’s AMP machines. But having
multiple CPUs on a single chip yields speedup on data transactions among processors. This speedup
makes CMP faster than conventional multichip multiprocessors in running parallel programs especially
when threads communicate frequently.
Wide-issue super scalar processors exploit ILP by executing multiple instructions from a single
program in a single cycle. Multiprocessors (MP) exploit TLP by executing different threads in parallel on
different processors. Unfortunately, both parallel-processing styles statically partition processor
resources, thus preventing them from adapting to dynamically-changing levels of ILP and TLP in a
program. With insufficient TLP, processors in an MP will be idle; with insufficient ILP, multiple-issue
hardware on a superscalar is wasted. Simultaneous multithreading (SMT) [Tullsen et al. 1995; 1996;
Gulati et al. 1996] allows multiple threads to compete for and share available processor resources every
cycle. One of its key advantages when executing parallel applications is its ability to use thread-level
parallelism and instruction-level parallelism interchangeably.
Software trends favors multithread programming for its various benefits. As multiprocessor
systems can provide multiple simultaneous points of execution. With the help of the operating system.
Independent threads can run on independent processors simultaneously. However, the need to limit the
effects of interconnect delays, which are becoming much slower than transistor gate delays, and the
ability to exploits increasing transistor count on chip, favors CMPs[5].
1. Trends in Multiprocessor Architecture:
The major trend in commercial microprocessor architecture is the use of complex architecture to
exploit the ILP. There are two approaches that are used to exploit the ILP: SUPERSCALER and Very
Long Instruction Word (VLIW). Both approaches attempt to issue multiple instructions to independent
functional units at every clock cycle. SUPERSCALER uses Hardware to dynamically find data
independent instruction in an instruction window and issue them to independent functional units. On the
other hand, VLIW relies in the compiler to find ILP and schedule the exec of independent instruction
statically.
Superscalar is more appealing in commercial microprocessor because it can improve the
performance of existing application binaries[7]. However Superscalar is complex to design and difficult
to implement. Looking for parallelism in large instruction window requires a significant amount of
hardware and usually does not improve the performance as much as one might expect. Due to this
complexity, it is difficult not only to make the architecture correct but also to optimize the pipeline and
circuit to achieve high clock frequency.
On the other hand VLIW relies on the compiler to find bunches of independent inst. Since VLIW
does not require the hardware for dynamic scheduling it can be much simpler to design and implement.
However it requires significant compiler support such as a trace scheduling to find out ILP an application
program. VLIW is preferred over Superscalar when the issue width is so large that that dynamic
scheduling hardware in Superscalar is too complex and expensive to implement. However in VLIW such
a wide-issue machine has a centralized register file that must have many ports to supply operands to
independent functional units. The access time for the register file and complexity of the buses connecting
to the functional units may limit clock Frequency. Another disadvantage of VLIW is they cannot use
precompiled binaries. What if the source code id not available. VLIW forces a bunch of instruction to
execute together. If one instruction in the bunch is stalls then other instruction in the bunch must stall too.
This limits VLIW's ability to deal unpredictable events such as data accesses causing cache misses.
Currently most of commercial microprocessors such as Intel Pentium, Compaq Alpha 21264,
IBM PowerPC620, Sun UltraSparc, HP PA8000 and MIPS R100000 use Superscalar design technique.
Performance of these microprocessor has been improving at a phenomenal rate for decades. This
performance growth has been driven by (1) the innovation in compilers, (2) the improvements in
architecture and (3) tremendous improvements in VLSI technology. The latest Superscalar
microprocessors can execute four to six instructions concurrently with many nontrivial techniques
including dynamic branch prediction, out-of-order execution, and speculative execution method. However
speedup may not be achieved by using these techniques because of the limitations of the instructions
window size and the ILP in a typical program. Moreover considerable design efforts are required to
develop such high performance microprocessor. Therefore developing a complex wide issue superscalar
microprocessor as a next generation microprocessor may not be an efficient approach to satisfy the
required performance.
• Superscalar Bottlenecks: Where Have All the Cycles Gone?
Figure 1: gives the issue utilization i.e. the percentage of issue slots that are filled each cycle, for
most of the SPEC benchmarks. The cause of each empty issue slot is also recorded. For example, if the
next instruction cannot be scheduled in the same cycle as the current instruction, then the remaining issue
slots this cycle, as well as all issue slots for idle cycles between the execution of the current instruction
and the next (delayed) instruction, are assigned to the cause of the delay. When there are overlapping
causes, all cycles are assigned to the cause that delays the instruction the most if the delays are additive,
such as an I TLB miss and an I cache miss, the wasted cycles are divided up appropriately[1].
Thus it can be seen that the functional units in the wide Superscalar used are highly underutilized.
These results also indicate that there is no dominant source of wasted issue bandwidth. Although there are
dominant items in individual applications (e.g., mdljsp2, swm, fpppp), the dominant cause is different in
each case. In the composite results we see that the largest cause (short FP dependence) is responsible for
37% of the issue bandwidth, but there are six other causes that account for at least 4.5% of wasted cycles.
Even completely eliminating any one factor will not necessarily improve performance to the degree that
this graph might imply, because many of the causes overlap. Not only is there no dominant cause of
wasted cycles — there appears to be no dominant solution. If specific latency-hiding techniques are
limited, then any dramatic increase in parallelism needs to come from a general latency-hiding solution,
of which multithreading or multiprocessing are examples.
Table 1 gives an idea of the possible causes of wasted issue slots, and the latency-reducing technique that
can reduce that number of cycles wasted by each cause.
2. Hardware Multithreading:
Increasing miss rates and increasing latency of cache misses are having a compounding effect on
the portion of execution time that is wasted on cache misses. The solution to this problem is to use coarse-
grained multithreading to enable the processor to perform useful instructions during cache misses.
• Why are there increasing miss rates and increasing latency of cache misses ?
Workload Characteristics:
Taking for instance the server workloads represent such market segments as on-line transaction
processing (OLTP), business intelligence, enterprise resource planning (ERP), web serving, and
collaborative groupware. The applications are often large and function-rich; they use a large number of
operating system services and access large databases. These characteristics make the instruction and data
working sets large. These workloads are also inherently multi-user and multitasking. The large working
set and high frequency of task switches cause the cache-miss rates to be high. In addition, research in this
area points out that such applications can also have data that is frequently read–write shared. In
multiprocessors, this can make the miss rates significantly higher. Also, because of the large instruction
working set, branch-prediction rates can be poor. These characteristics are all detrimental to the
performance of the processor.
Application Characteristics:
Current trends in application characteristics and languages are likely to make this worse. Object-
oriented programming with languages such as C++ and Java has been popular for several years and is
increasing in popularity. Virtual-function pointers are a feature of these languages that did not exist in the
languages used in older applications. Virtual-function pointers lead to branches that can have very poor
branch-miss prediction rates. The frequency of dynamic memory allocation in these languages is also
higher than in older languages, which leads to more allocation of memory from the heap. Memory from
the heap is more scattered than memory from the stack, which can cause higher cache-miss rates. Java
also does “garbage collection.” Garbage collection has access patterns that lead to poor cache-miss rates
because it references many objects and uses each only a small number of times. All of these factors are
causing the already high miss rates to become even higher.
Faster clock rates:
A large portion of the execution time can already be spent on cache misses and branch
mispredictions. The trend in processor microarchitecture is toward decreasing cycle time at a faster rate
than the decrease in memory access time. This is causing the number of processor cycles for a cache-miss
latency to increase. For a given miss rate, this causes the portion of the execution time due to cache
misses to become larger. This trend, combined with the trend toward higher miss rates in workloads that
already have high miss rates, causes a compounding effect on the cycles-per-instruction (CPI) increase
due to cache misses.
• Multithreading:
In a multithreaded processor, the processor holds the state of several tasks/threads. The several
threads provide additional instruction-level parallelism, enabling the processor to better utilize all of its
resources. When one of the threads would normally be stalled, instructions from the other threads can
utilize the processor’s resources. The observation that cache misses were becoming a very large portion
of the execution time led to the investigation of multithreaded hardware as a way to execute useful
instructions during cache misses.
In fine-grained multithreading, a different thread is executed every cycle While fine-grained
multithreading covers control and data dependencies quite well (although this may require more than two
threads), the impact of cycle interleaving on single-task performance was deemed too large.
In coarse-grained multithreading, a single thread, called the foreground thread, executes until
some long-latency event such as a cache miss occurs, causing execution to switch to the background
thread. If there are no such events, a single thread can consume all execution cycles. This minimizes the
impact on single-task execution speed, making it performance-competitive with non-multithreaded
processors. The processor executes instructions in order so coarse-grained multithreading is used.
Simultaneous multithreading is a technique permitting several independent threads to issue
instructions to a superscalar’s multiple functional units in a single cycle it is a processor design that
combines hardware multithreading with superscalar processor technology to allow multiple threads to
issue instructions each cycle. In a deeply pipelined out-of-order execution processor, simultaneous
multithreading is chosen.
Simultaneous multithreading combines hardware multithreading with superscalar processor
technology, it makes it easier to compare the performance of a Simultaneous Multithreaded processor
with that of a Superscalar processor. So I chose Simultaneous Multithreaded Processor for my study.
• Simultaneous Multithreading (SMT):
Multiple instruction issue has the potential to increase performance, but is ultimately limited by
instruction dependencies (i.e., the available parallelism) and long-latency operations within the single
executing thread. The effects of these are shown as horizontal waste and vertical waste in Figure 2
Multithreaded architectures, on the other hand, such as HEP [28], Tera [3], MASA [15] and Alewife [2]
employ multiple threads with fast context switch between threads. Traditional multithreading hides
memory and functional unit latencies, attacking vertical waste. In any one cycle, though, these
architectures issue instructions from only one thread. The technique is thus limited by the amount of
parallelism that can be found in a single thread in a single cycle. And as issue width increases, the ability
of traditional multithreading to utilize processor resources will decrease. Simultaneous multithreading, in
contrast, attacks both horizontal and vertical waste.
Simultaneous multithreading (SMT), allows multiple threads to compete for and share all of the
processor’s resources every cycle. By permitting multiple threads to share the processor’s functional units
simultaneously, the processor can use both ILP and TLP to accommodate variations in parallelism. When
a program has only a single thread, all of the SMT processor’s resources can be dedicated to that thread;
when more TLP exists, this parallelism can compensate for a lack of per-thread ILP. When a program has
only a single thread, i.e., it lacks TLP, all of the SMT processor’s resources can be dedicated to that
thread; when more TLP exists, this parallelism can compensate for a lack of per-thread ILP. An SMT
processor can uniquely exploit whichever type of parallelism is available, thereby utilizing the functional
units more effectively to achieve the goals of greater throughput and significant program speedups.
• Performance of Simultaneous Multithreading: (the results in this section are based on the observations in [1] )
This section presents performance results for simultaneous muhithreaded processors. Several
machine models have been defined for simultaneous multithreading, spanning a range of hardware
complexities. It is also shown that simultaneous multithreading provides significant performance
improvement over both single-thread superscalar and fine-grain multithreaded processors, both in the
limit, and also under less ambitious hardware assumptions.
Instruction Class Latency
integer multiply 8,16
conditional move 2
I Cache D Cache L2
Cache L3 Cache
compare 0 Size 64 KB 64 KB 256 KB 2 MB
all other integer 1 Assoc DM DM 4 -way 4 – way
FP divide 17, 30 Line Size 32 32 32 32
all other FP 4 Banks 8 8 4 1
load (Ll cache hit, no bank conflicts) 2
load (L2 cache hit) 8
Transfer
time/bank
1 cycle 1 cycle 2 cycle 2 cycle
load (L3 cache hit) 14 Table 3: Details of the Cache Hierarchy
load (memory) 50
control hazard (br or jmp redicted) 1
control hazard br or jmp mispredicted) 6 Table 4: Simulated Instruction Latencies
The Machine Models:
The following models reflect several possible design choices for a combined multithreaded,
superscalar processor. The models differ in how threads can use issue slots and functional units each
cycle; in all cases, however, the basic machine is a wide superscalar with 10 functional units capable of
issuing 8 instructions per cycle (the same core machine as Section 3). The models are:
Fine-Grain Multithreading. Only one thread issues instructions each cycle, but it can use the entire
issue width of the processor. This hides all sources of vertical waste, but does not hide horizontal waste. It
is the only model that does not feature simultaneous multithreading. Among existing or proposed
architectures, this is most similar to the Tera processor [3], which issues one 3-operation LIW instruction
per cycle.
SM: FuIl Simultaneous Issue. This is a completely flexible simultaneous multithreaded superscalar all
eight threads compete for each of the issue slots each cycle. This is the least realistic model in terms of
hardware complexity, but provides insight into the potential for simultaneous multithreading. The
following models each represent restrictions to this scheme that decrease hardware complexity.
SM: Single Issue, SM: Dual Issue, and SM: Four Issue. These three models limit the number of
instructions each thread can issue, or have active in the scheduling window, each cycle. For example, in a
SM: Dual Issue processor, each thread can issue a maximum of 2 instructions per cycle; therefore, a
minimum of 4 threads would be required to fill the 8 issue slots in one cycle.
SM: Limited Connection. Each hardware context is directly connected to exactly one of each type of
functional unit. For example, if the hardware supports eight threads and there are four integer units, each
integer unit could receive instructions from exactly two threads. The partitioning of functional units
among threads is thus less dynamic than in the other models, but each functional unit is still shared (the
critical factor in achieving high utilization). Since the choice of functional units available to a single
thread is different than in the original target machine, recompilation is done for a 4-issue (one of each
type of functional unit) processor for this model. Table 2 shows the important differences in hardware
implementation complexity.
The simulator models the execution pipelines, the memory hierarchy (both in terms of hit rates
and bandwidths), the TLBs, and the branch prediction logic of a wide superscalar processor. It is based on
the Alpha AXP 21164, augmented first for wider superscalar execution and then for multithreaded
execution. The typical simulated configuration contains 10 functional units of four types (four integer,
two floating point, three load/store and 1 branch) and a maximum issue rate of 8 instructions per cycle.
We assume that all functional units are completely pipelined. Tables 3 and 4 show Details of the Cache
Hierarchy and Simulated Instruction Latencies respectively. Figure 3 shows the performance of the
various models as a function of the number of threads.
Observations:
• Each of these models become increasingly competitive with full simultaneous issue as the ratio of threads to
issue slots increases.
• The increase in processor utilization is a direct result of threads dynamically sharing processor resources that
would otherwise remain idle much of the time.
• Lowest priority thread (at 8 threads) runs at 55% of the speed of the highest priority thread.
• Competition for non-execution resources, play nearly as significant a role in this performance region as the
competition fro execution resources.
• Caches are more strained by a multi-threaded work load than a single threads work load, due to decrease in
locality.
• Sharing caches is the dominant effect in the wasted issue cycles.
• Data TLB waste also increases
• Total speedups relatively constant across a wide range of cache sizes.
• Instruction throughput of the various SM models is some what hampered by the sharing of caches and TLBs.
• Cache Design for a Simultaneous Multithreaded Processor:
The measurements show a performance degradation due to cache sharing in simultaneous
multithreaded processors. In this section, the cache problem is explored further. The study focuses on the
organization of the first-level (Ll ) caches, comparing the use of private per-thread caches to shared
caches for both instructions and data. (It is assumed assume that L2 and L3 caches are shared among all
threads.) All experiments use the 4-issue model with up to 8 threads. Not all of the private caches will be
utilized when fewer than eight threads are running. Figure 4 exposes several interesting properties for
multithreaded caches. It is seen that shared caches optimize for a small number of threads (where the few
threads can use all available cache), while private caches perform better with a large number of threads.
For example, the 64s.64s cache ranks first among all models at 1 thread and last at 8 threads, while the
64p.64p cache gives nearly the opposite result. However, the tradeoffs are not the same for both
instructions and data. A shared data cache outperforms a private data cache over all numbers of threads
(e.g., compare 64p.64s with 64p.64p), while instruction caches benefit from private caches at 8 threads.
One reason for this is the differing access patterns between instructions and data. Private I caches
eliminate conflicts between different threads in the I cache, while a shared D cache allows a single thread
to issue multiple memory instructions to different banks.
There are two configurations that appear to be good choices. Because there is little performance
difference at 8 threads, the cost of optimizing for a small number of threads is small, making 64s.64s an
attractive option. However, typically operating with all or most thread slots full, the 64p.64s gives the
best performance in that region and is never worse than the second best performer with fewer threads. The
shared data cache in this scheme allows it to take advantage of more flexible cache
partitioning, while the private instruction caches make each thread less sensitive to the presence of other
threads. Shared data caches also have a significant advantage in a data-sharing environment by allowing
sharing at the lowest level of the data cache hierarchy without any special hardware for cache coherence.
For SMT processors potential bottlenecks may occur in the fetch stages, particularly when
instructions from different blocks are fetched simultaneously, causing contention at the instruction cache .
Furthermore, the cache size becomes more critical as the threads share the same cache [1]. In addition to
memory I/O, the pipeline is lengthened by the addition of two stages when reading and writing to
registers. The increase in pipeline length, places potential strain on the branch prediction unit. However,
the single thread performance degraded by only 2% with the insertion of these two stages [1, 13, 14].
SMT provides an option by which a processor can exploit TLP. Threads are executed in parallel
by scheduling instructions from multiple threads simultaneously. This is done to increase usage of the
functional units already present in multiple issue processors. Logically, SMT is chip multi-processor
(CMP) except all of the functional units are combined to allow for very flexible scheduling. Unlike CMP,
threads on an SMT system share the same caches.
64p.64s has eight private 8 KB I caches and a shared 64 KB data cache.
Presently, SMT technologies are scheduled to be used in upcoming Pentium IV and future Alpha
processors. While SMT is transparent to the user, usage of the SMT features requires the application to
be multithreaded so that TLP can be used. Therefore, applications must be multithreaded in order to take
advantage of SMT capable processors. In particular, such thread support via the hardware facilitates
improved performance by fine-grained threading of programs. Fine-grain threading attempts TLP
wherever possible by threading every independent unit of work. With the imminent arrival of SMT
support in commercial microprocessors, multithreaded programs will be needed to take advantage of
these enhancements
1. Chip Multiprocessor:
CMPs use relatively simple single-thread processor cores to exploit only moderate amounts of
parallelism within any one thread, while executing multiple threads in parallel across multiple processor
cores.
• Implementation technology concerns that favors CMPs:
Today, as most microprocessor designers use the increased transistor budgets to build larger and
more complex uniprocessors, Several problems are beginning to make this approach to microprocessor
design difficult to continue. To address these problems, the future processor design methodology is
shifting from simply making progressively larger uniprocessors to implementing more than one
processor on each chip. The following discusses the key reasons why single-chip microprocessors are a
good idea.
Parallelism
Superscalar processors can extract greater amounts of instruction-level parallelism, or ILP, by
finding nondependent instructions that occur near each other in the original program code. Designers
primarily use additional transistors on chips to extract more parallelism from programs to perform more
work per clock cycle. Unfortunately, there is only a finite amount of ILP present in any particular
sequence of instructions that the processor executes because instructions from the same sequence are
typically highly interdependent. As a result, processors that use this technique are seeing diminishing
returns as they attempt to execute more instructions per clock cycle, even as the logic required to process
multiple instructions per clock cycle increases quadratically.
A CMP avoids this limitation by primarily using a completely different type of parallelism:
thread-level parallelism. A CMP may also exploit small amounts of ILP within each of its individual
processors, since ILP and TLP are orthogonal to each other.
Wire delay
As CMOS gates become faster and chips become physically larger, the delay caused by
interconnects between gates is becoming more significant. Due to rapid process technology improvement,
within the next few years wires will only be able to transmit signals over a small portion of large
processor chips during each clock cycle. However, a CMP can be designed so that each of its small
processors takes up a relatively small area on a large processor chip, minimizing the length of its wires
and simplifying the design of critical paths. Only the more infrequently used, and therefore less critical,
wires connecting the processors need to be long.
Design time
Processors are already difficult to design. Larger numbers of transistors, increasingly complex
methods of extracting ILP, and wire delay considerations will only make this worse. A CMP can help
reduce design time, however, because it allows a single, proven processor design to be replicated multiple
times over a die. Each processor core on a CMP can be much smaller than a competitive uniprocessor,
minimizing the core design time. Also, a core design can be used over more chip generations simply by
scaling the number of cores present on a chip. Only the processor interconnection logic is not entirely
replicated on a CMP.
• Why aren’t CMPs used now?
A CMP addresses all of these potential problems in a straightforward, scalable manner, the
treason for them not being common are:
Integration densities are just reaching levels where these problems are becoming significant
enough to consider a paradigm shift in processor design. The primary reason, however, is because it is
very difficult to convert today’s important uniprocessor programs into multiprocessor ones. Conventional
multiprocessor programming techniques typically require careful data layout in memory to avoid conflicts
between processors, minimization of data communication between processors, and explicit
synchronization at any point in a program where processors may actively share data. A CMP is much less
sensitive to poor data layout and poor communication management, since the interprocessor
communication latencies are lower and bandwidths are higher. However, sequential programs must still
be explicitly broken into threads and synchronized properly.
Parallelizing compilers have been only partially successful at automatically handling these tasks
for programmers. As a result, acceptance of multiprocessors has been slowed because only a limited
number of programmers have mastered these techniques.
• The architectures’ major design considerations in a qualitative manner.
CPU cores:
To keep the processors’ execution units busy, the superscalar and SMT processors as shown
above are assumed to feature advanced branch prediction, register renaming, out-of-order instruction
issue, and nonblocking data caches. As a result, the processors have numerous multiported rename
buffers, issue queues, and register files. The inherent complexity of these architectures results in three
major hardware design problems
Their area increases quadratically with the core’s complexity. The number of registers in each structure
must increase proportionally to the instruction window size. Additionally, the number of ports on each
register must increase proportionally to the processor’s issue width.
The CMP approach minimizes this problem because it attempts to exploit higher levels of
instruction-level parallelism using more processors instead of larger issue widths within a single
processor. This results in an approximately linear area-to-issue width relationship, since the area of each
additional processor is essentially constant, and it adds a constant number of issue slots. Using this
relationship, the area of an 8 2-issue CMP (16 total issue slots) has an area similar to that of a single 12-
issue processor.
They can require longer cycle times. Long, high capacitance I/O wires span the large buffers, queues,
and register files. Extensive use of multiplexers and crossbars to interconnect these units adds more
capacitance. Delays associated with these wires will probably dominate the delay along the CPU’s critical
path. The cycle time impact of these structures can be mitigated by careful design using deep pipelining,
by breaking up the structures into small, fast clusters of closely related components connected by short
wires, or both. But deeper pipelining increases branch misprediction penalties, and clustering tends to
reduce the ability of the processor to find and exploit instruction-level parallelism.
The CMP approach allows a fairly short cycle time to be targeted with relatively little design effort, since
its hardware is naturally clustered —each of the small CPUs is already a very small fast cluster of
components. Since the operating system allocates a single software thread of control to each processor,
the partitioning of work among the “clusters” is natural and requires no hardware to dynamically allocate
instructions to different component clusters. This heavy reliance on software to direct instructions to
clusters limits the amount of instruction-level parallelism that can be dynamically exploited by the entire
CMP, but it allows the structures within each CPU to be small and fast.
Since these factors are difficult to quantify, the evaluated superscalar and SMT architectures
represent how these systems would perform if it was possible to build an optimal implementation with a
fairly shallow pipeline and no clustering, a combination that would result in an unacceptably low clock
cycle time in reality. This probably gives the CMP a handicap in the simulations.
The CPU cores are complicated and composed of many closely interconnected components. As
a result, design and verification costs will increase since they must be designed and verified as single,
large units.
The CMP architecture uses a group of small, identical processors. This allows the design and
verification costs for a single CPU core to be lower, and amortizes those costs over a larger number of
processor cores. It may also be possible to utilize the same core design across a family of processor
designs, simply by including more or fewer cores.
With even more advanced IC technologies, the logic, wire, and design complexity advantages
will increasingly favor a multiprocessor implementation over a superscalar or SMT implementation.
Memory:
A 12-issue superscalar or SMT processor can place large demands on the memory system. For
example, to handle load and store instructions quickly enough, the processors would require a large
primary data cache with four to six independent ports. The SMT processor requires more bandwidth from
the primary cache than the superscalar processor, because its multiple independent threads will typically
allow the core to issue more loads and stores in each cycle, some from each thread. To accommodate
these accesses, the superscalar and SMT architectures have 128-Kbyte, multibanked primary caches with
a two-cycle latency due to the size of the primary caches and the bank interconnection complexity.
The CMP architecture features sixteen 16-Kbyte caches. The eight cores are completely
independent and tightly integrated with their individual pairs of caches another form of clustering, which
leads to a simple, high-frequency design for the primary cache system. The small cache size and tight
connection to these caches allows single-cycle access. The rest of the memory system remains essentially
unchanged, except that the secondary cache controller must add two extra cycles of secondary cache
latency to handle requests from multiple processors. To make a shared memory multiprocessor, the data
caches could be made writethrough, or a MESI (modified, exclusive, shared, and invalid) cache-
coherence protocol could be established between the primary data caches. Because the bandwidth to an
on-chip cache can easily be made high enough to handle the write-through traffic, the simpler coherence
scheme is chosen for the CMP.
In this way, designers can implement a small-scale multiprocessor with very low interprocessor
communication latency. To provide enough off-chip memory bandwidth for the high-performance
processors, all simulations were made with main memory composed of multiple banks of Rambus
DRAMs (RDRAMs), attached via multiple Rambus channels to each processor.
Compiler support:
The main challenge for the compiler targeting the superscalar processor is finding enough
instruction-level parallelism in applications to use a 12-issue processor effectively. Code reordering is
fundamentally limited by true data dependencies and control dependencies within a thread of instructions.
It is likely that most integer applications will be unable to use a 12-issue processor effectively, even with
very aggressive branch prediction and advanced compiler support for exposing instruction-level
parallelism. Limit studies with large instruction windows and perfect branch prediction have shown that a
maximum of approximately 10–15 instructions per cycle are possible for general-purpose integer
applications.9 Branch mispredictions will reduce this number further in a real processor.
On the other hand, programmers must find threadlevel parallelism in order to maximize CMP
performance. The SMT also requires programmers to explicitly divide code into threads to get maximum
performance, but, unlike the CMP, it can dynamically find more instruction-level parallelism if thread-
level parallelism is limited. With current trends in parallelizing compilers, multithreaded operating
systems, and the awareness of programmers about how to program parallel computers, however, these
problems should prove less daunting in the future. Additionally, having all eight of the CPUs on a single
chip allows designers to exploit thread-level parallelism even when threads communicate frequently.
This has been a limiting factor on today’s multichip multiprocessors, preventing some parallel programs
from attaining speedups, but the low communication latencies inherent in a single-chip microarchitecture
allow speedup to occur across a wide range of parallelism.4
Hardware Performance & Comparison: In this section I have tried to compare the three architectures based on the simulations and experiments
conducted by the various research groups. I have presented the results form ref [??] [??] to draw
conclusions for my study. CMP versus Superscalar:
Two main Concerns: (1) Area, (2) Time.
For an instruction window to enable the dynamic issue of instructions, require large die area. PA-8000, 4-
issue Superscalar devotes 20% of die area solely to instruction window. in general area requirement
increases quadratically with issue width. Increase in issue width typically requires an increase in the
number of ports in the register file. Alternatively it may involve the replication of the register file as in
Alpha 21264.
The number of datapaths between the functional units and register files increase quadratically
with the issue width. CMP requires extra hardware for speculation support. However overhead for
register communication is quite modest. Register bypass network (values forwarded directly form output
of functional unit to their inputs, this permits back to back issue of data dependent instructions) may be an
important factor in determining the cycle time in future high-issue processor.
Other concerns:
Inability to extract a significant amount of parallelism form the application leads to uneven
distribution of work among the different processors in a CMP.
CMP is able to exploit parallelism in application that are fully loop based and most of the loops
have few or no loop carried dependence, better than 12 issue Superscalar. Each processor in the CMP
executes an iteration and most of the time can independently issue instructions without being affected by
dependence with other threads. In the 12 issue Superscalar, instead, the centralized instruction window is
often cloggy by instructions that are either data dependent on long-latency FP operations or are waiting
on cache misses. On an average IPC of a 4X4issue CMP is on an average nearly twice that of a 12-issue
superscalar.
Thus it can be seen that
Superscalar
• Norm of today's high-performance microprocessor.
• Issue rate of these microprocessor has continued to increase over the past few years
Compaq Alpha 21264, IBM Power PC, Intel Pentium – Pro, MIPS R10000 issue four instructions
per cycle.
• special hardware to dynamically identify independent instructions.
maintaining a large pool of instructions in a large associative window.
register renaming to eliminate false dependence
• out of order issue (instruction issued as soon as its operand and functional units are available).
Thus parallelism is extracted only from ILP at the program at run time.
• Requires centralized hardware structure that lengthens the critical path for the processor pipeline.
register renaming logic
instruction window wake-up and select mechanism
register bypass logic
• long latency interconnects in centralized approach.
Chip multi processor
• Exploits thread level parallelism
• Exploits increasing transistor count on chip
• Wide issue dynamic processor would make fact communication at register level will soon popular
CMPs
• Speculative mode improves performance but needs true memory dependence violation to be handled
• Decentralized architecture
divide the application into multiple threads and exploit ILP across them.
multiple threads run on multiple simple processing units on a single chip. CMP architecture.
• design simplicity
o fast clocking each of the processing units
o eases time consuming design validation phase
• fast communication in processing units localizes interconnects (long latency interconnects in
centralized approach)
• better utilization of silicon space hence avoids extra logic devoted to centralized architecture
=> higher overall issue bandwidth.
Olukotum et al show how a CMP with eight 2-issue superscalar processing units would occupy
the same area as a conventional 12-issue superscalar processor.
• ideal for running multithreaded applications
• May not be able to give good performance when running sequential applications as parallel compilers
are successful only at a restricted class of applications typically numeric ones. so cannot handle a
large class of sequential applications.
speculation can help as compilers assume existence of inter thread dependence when it cannot guarantee
data independence among threads. Speculative mode improves performance but needs true memory
dependence violation to be handled. Technique to solve this problem has been discussed in [21].
Charecterstics of superscalar, simultaneous multithreading, and chip multiprocessor architecture
Characteristic
Superscalar Simultaneous multithreading
Chip multiprocessor
Number of CPUs 1 1 8 CPU issue width 12 12 2 per CPU Number of threads 1 8 1 per CPU Architecture registers (for integer and FP) 32 32 per thread 32 per CPU Physical registers (for integer and FP) 32 + 256 256 + 256 32 + 32 per CPU Instruction window size 256 256 32 per CPU Branch predictor table size (entries) 32,768 32,68 8 x 4,096 Return stack size 64 entries 64 entries 8 x 8 entries
1. Superscalar: The superscalar processor,
shown in Figure 1a, can dynamically issue up to 12 instructions per cycle.
2. Simultaneous Multithreading: The SMT processor, shown in Figure 1b, is identical to the superscalar except that it has eight separate program counters and executes instructions from up to eight different threads of control concurrently. The processor core dynamically allocates instruction fetch and execution resources among the different threads on a cycle-by-cycle basis to find as much thread-level and instruction-level parallelism as possible.
3. Chip Multiprocessor: The CMP, shown in Figure 1c, is composed of eight small 2-issue superscalar processors. This processor depends on thread-level parallelism, since its ability to find instruction-level parallelism is limited by the small size of each processor.
Instruction (I) and data (D) cache organization 1 x 8 banks 1 x 8 banks 1 bank I and D cache sizes 128 Kbytes 128 Kbytes 16 Kbytes per CPU Branch predictor table size (entries) 4 way 4 way 4 way I and D cache line sizes (bytes) 32 32 32 I and D cache access times (cycles) 2 2 1 Secondary cache organization (Mbytes) 1 x 8 bocks 1 x 8 bocks 1 x 8 bocks Secondary cache size (bytes) 8 8 8 Secondary cache associativity 4 - way 4 - way 4 - way Secondary cache line size (bytes) 32 32 32 Secondary cache access time (cycles) 5 5 7 Secondary cache occupancy per access (cycles) 1 1 1 Memory organization (no. of banks) 50 50 50 Memory access time (cycles) 4 4 4 Memory occupancy per access (cycles) 13 13 13
Figure 2. Relative performance of superscalar, simultaneous multithreading, and chip
multiprocessor architectures compared to a baseline, 2-issue superscalar architecture.
Performance results:
Figure ?? shows the performance of the superscalar, SMT, and CMP architectures on the four
benchmarks relative to a baseline architecture—a single 2-issue processor attached to the
superscalar/SMT memory system.
The first two benchmarks show performance on applications with moderate memory behavior
and no thread-level parallelism (compress) or large amounts of thread-level parallelism (mpeg).
The CMP experienced a nearly eight-times performance improvement over the single 2-issue processor.
The separate primary caches are beneficial because they can be accessed by all processors in parallel. In a
separate test with eight processors sharing a single cache, bank contention between accesses from
different processors degraded performance significantly. The average memory access time to the primary
cache alone went up from 1.1 to 5.7 cycles, mostly due to extra queuing delays at the contended banks,
and overall performance dropped 24 percent. In contrast, the shared secondary cache is not a bottleneck in
the CMP because it received an order of magnitude fewer accesses. SMT results showed similar trends.
The speedups tracked the CMP results closely when modeling similar degrees of data cache contention.
The nominal performance was similar to that of the CMP’s with a single primary cache, and performance
improved 17 percent when primary cache contention is temporarily deactivated. The multiple threads of
control in the SMT allowed it to exploit thread-level parallelism. Additionally, the dynamic resource
allocation in the SMT allowed it to be competitive with the CMP, even though it had fewer total issue
slots.
However, tomcatv’s memory behavior highlighted a fundamental problem with the SMT
architecture: the unified data cache architecture was a bandwidth limitation. Making a data cache with
enough banks or ports to keep up with the memory requirements of eight threads requires a more
sophisticated crossbar network that will add more latency to every cache access, and may not help if there
is a particular bank that is heavily accessed. The CMP’s independent data caches avoid this problem but
are not possible in an SMT.
As with compress, the multiprogramming workload has limited amounts of instruction-level
parallelism, so the speedup of the superscalar architecture was only a 35 percent increase over the
baseline processor. Unlike compress, however, the multiprogramming workload had large amounts of
process-level parallelism, which both the SMT and CMP exploited effectively. This resulted in a linear
eight-times speedup for the CMP. The SMT achieved nearly a seven-times speedup over the 2-issue
baseline processor, more than the increase in the number of issue slots would indicate possible, because it
efficiently utilized processor resources by interleaving threads cycle by cycle.
Thus this approach proves that CMP has superior performance using relatively simple hardware.
Fine Comparison of Simultaneous Multithreading versus Single-Chip Multiprocessing: (These are the
results as shown in ref [??])
This section compares the performance of simultaneous multithreading to small-scale, single-chip
multiprocessing (MP). On the organizational level, the two approaches are extremely similar both have
multiple register sets, multiple functional units, and high issue bandwidth on a single chip. The key
difference is in the way those resources are partitioned and scheduled: the multiprocessor statically
partitions resources, devoting a fixed number of functional units to each thread; the SM processor allows
the partitioning to change every cycle. Clearly, scheduling is more complex for an SM processor
however, it is shown that in other areas the SM model requires fewer resources, relative to
multiprocessing, in order to achieve a desired level of performance.
For these experiments, SM and MP configurations that are reasonably equivalent. For most of the
comparisons all or most of the following are kept equal: the number of register sets (i.e., the number of
threads for SM and the number of processors for MP), the total issue bandwidth, and the specific
functional unit configuration. A consequence of the last item is that the functional unit configuration is
often optimized for the multiprocessor and represents an inefficient configuration for simultaneous
multithreading. All experiments use 8 KB private instruction and data caches (per thread for SM, per
processor for MP), a 256 KB 4-way set-associative shared second-level cache, and a 2 MB direct-mapped
third-level cache. It is desired to keep the caches constant in the comparisons, and this (private I and D
caches) is the most natural configuration for the multiprocessor.
MPs are evaluated with 1, 2, and 4 issues per cycle on each processor. SM processors are
evaluated with 4 and 8 issues per cycle; however the SM: Four Issue model (defined in Section ??) is
used, for all of the SM measurements (i.e., each thread is limited to four issues per cycle). Using this
model minimizes some of the inherent complexity differences between the SM and MP architectures. For
example, an SM: Four Issue processor is similar to a single-threaded processor with 4 issues per cycle in
terms of both the number of ports on each register file and the amount of inter-instruction dependence
checking. In each experiment the same version of the benchmarks is run for both configurations
(compiled for a 4-issue, 4 functional unit processor, which most closely matches the MP configuration)
on both the MP and SM models; this typically favors the MP.
It must be noted that, while in general it is tried that the bias is in favor of the MP, the SM results may be
optimistic in two respects — the amount of time required to schedule instructions onto functional units,
and the shared cache access time. The distance between the load/store units and the data cache can have a
large impact on cache access time. The multiprocessor, with private caches and private load/store units,
can minimize the distances between them. The SM processor cannot do so, even with private caches,
because the load store units are shared. However, two alternate configurations could eliminate this
difference. Having eight load/store units (one private unit per thread, associated with a private cache)
would still allow us to match MP performance with fewer than half the total number of MP functional
units (32 vs. 15). Or with 4 load/store units and 8 threads, it is possible to statically share a single cache l
load store combination among each set of 2 threads. Threads O and 1 might share one load/store unit, and
all accesses through that load/store unit would go to the same cache, thus allowing us to minimize the
distance between cache and load/store unit, while still allowing resource sharing. Figure ?? shows the
results of the SM/MP comparison for various configurations.
Tests A, B, and C compare the performance of the two schemes with an essentially unlimited
number of functional units (FUS); i.e., there is a functional unit of each type available to every issue slot.
The number of register sets and total issue bandwidth are constant for each experiment. In these models,
the ratio of functional units (and threads) to issue bandwidth is high, so both configurations should be
able to utilize most of their issue bandwidth. Simultaneous multithreading, however, does so more
effectively.
Test D repeats test A but limits the SM processor to a more reasonable configuration (the same
10 functional unit configuration used throughout this paper). This configuration outperforms the
multiprocessor by nearly as much as test A, even though the SM configuration has 22 fewer functional
units and requires fewer forwarding connections.
In tests E and F, the MP is allowed a much larger total issue bandwidth. In test E, each MP
processor can issue 4 instructions per cycle for a total issue bandwidth of 32 across the 8 processors; each
SM thread can also issue 4 instructions per cycle, but the 8 threads share only 8 issue slots. The results are
similar despite the disparity in issue slots. In test F, the 4-thread, 8-issue SM slightly outperforms a 4-
processor, 4-issue per processor MP, which has twice the total issue bandwidth. Simultaneous
multithreading performs well in these tests, despite its handicap, because the MP is constrained with
respect to which 4 instructions a single processor can issue in a single cycle.
Test G shows the greater ability of SM to utilize a fixed number of functional units. Here both
SM and MP have 8 functional units and 8 issues per cycle. However, while the SM is allowed to have 8
contexts (8 register sets), the MP is limited to two processors (2 register sets), because each processor
must have at least 1 of each of the 4 functional unit types. Simultaneous multithreading’s ability to drive
up the utilization of a fixed number of functional units through the addition of thread contexts achieves
more than 2.5 times the throughput.
These comparisons show that simultaneous multithreading outperforms single-chip
multiprocessing in a variety of configurations because of the dynamic partitioning of functional
units. More important, SM requires many fewer resources (functional units and instruction issue slots)
to achieve a given performance level. For example, a single 8-thread, 8-issue SM processor with 10
functional units is 24~o faster than the 8-processor, single-issue MP (Test D), which has identical issue
bandwidth but requires 32 functional units; to equal the throughput of that 8-thread 8-issue SM, an MP
system requires eight 4-issue processors (Test E), which consume 32 functional units and 32 issue slots
per cycle.
Finally, there are further advantages of SM over MP that are not shown by the experiments:
• Performance with few threads — These results show only the performance at maximum utilization.
The advantage of SM (over MP) is greater as some of the contexts (processors) become unutilized.
An idle processor leaves l/p of an MP idle, while with SM, the other threads can expand to use the
available resources. This is important when (1) running parallel code where the degree of parallelism
varies overtime, (2) the performance of a small number of threads is important in the target
environment, or (3) the workload is sized for the exact size of the machine (e.g., 8 threads). In the last
case, a processor and all of its resources is lost when a thread experiences a latency orders of
magnitude larger than that simulated (e.g., IO).
• Granularity and flexibility of design — The configuration options are much richer with SM, because
the units of design have finer granularity. That is, with a multiprocessor, it would be typically to add
computing in units of entire processors. With simultaneous multithreading, it is possible to benefit
from the addition of a single resource, such as a functional unit, a register context, or an instruction
issue slot; furthermore, all threads would be able to share in using that resource. The comparisons did
not take advantage of this flexibility. Processor designers, taking full advantage of the configure
ability of simultaneous multithreading, should be able to construct configurations that even further
out-distance multiprocessing. Performance Comparison of SMP and CMP Using Parallel Workloads.
Why Parallel Applications?
SMT is most effective when threads have complementary hardware resource requirements.
Multiprogrammed workloads and workloads consisting of parallel applications both provide TLP via
independent streams of control, but they compete for hardware resources differently. Because a
multiprogrammed workload (used in our previous work [Tullsen et al. 1995; 1996]) does not share
memory references across threads, it places more stress on the caches. Furthermore, its threads have
different instruction execution patterns, causing interference in branch prediction hardware. On the other
hand, multiprogrammed workloads are less likely to compete for identical functional units.
Although parallel applications have the benefit of sharing the caches and branch prediction
hardware, they are an interesting and different test of SMT for several reasons. First, unlike the
Table V. Throughput Comparison of MP2, MP4, and SMT, Measured in Instructions per Cycle Number of Threads
Configuration 1 2 4 8 MP2 2.08 3.32 -- -- MP4 1.38 2.25 3.27 -- SMT 2.40 3.49 4.24 4.88
multiprogrammed workload, all threads in a parallel application execute the same code and,
therefore, have similar execution resource requirements, memory reference patterns, and levels
of ILP. Because all threads tend to have the same resource needs at the same time, there is
potentially more contention for these resources compared to a multiprogrammed workload. For
example, a particular loop may have a large degree of instruction-level parallelism, so each
thread will require a large number of renaming registers and functional units. Because all
threads have the same resource needs, they may exacerbate or create bottlenecks in these
resources. Parallel applications are therefore particularly appropriate for this study, which
focuses on these execution resources. Second, parallel applications illustrate the promise of SMT
as an architecture for improving the performance of single applications. By using threads to
parallelize programs, SMT can improve processor utilization, but more important, it can achieve
program speedups. Finally, parallel applications are a natural workload for traditional parallel
architectures and therefore serve as a fair basis for comparing SMT and multiprocessors. For the
sake of comparison, in Section 7, we also briefly contrast our parallel results with the
multiprogrammed results from Tullsen et al. [1996].
For another set of experiments as shown in ref[??] The Processor instruction latencies and memory
hirerchy details are as shown figure ??
Contributions regarding design tradeoffs for future high-end processors
First, the performance costs of resource partitioning for various multiprocessor configurations has
been identified. By partitioning execution resources between processors, multiprocessors enforce the
distinction between instruction- and thread-level parallelism. In this study, we examined two MP design
choices with similar hardware cost in terms of execution resources: one design with more resources per
processor (MP2) and one with twice as many processors, but fewer resources on each (MP4). Our results
showed that both alternatives frequently suffered from an inefficient use of their resources and that
improvements could only be obtained with costly upgrades in processor resources. The MP designs were
unable to adapt to varying levels of ILP and TLP, so their performance depended heavily on the
parallelism characteristics of the applications. For programs with more ILP, MP2 outperformed MP4; for
programs with less ILP, MP4 was superior because it exploited more thread-level parallelism. To
maximize performance on an MP, compilers and parallel programmers are therefore faced with the
difficult task of partitioning program parallelism (ILP and TLP) in a manner that matches the physical
partitioning of resources.
Second, it has been illustrated that in contrast, simultaneous multithreading allows compilers and
programmers to focus on extracting whatever parallelism exists, by treating instruction- and thread-level
parallelism equally. ILP and TLP are fundamentally identical; they both represent independent
instructions that can be used to increase processor utilization and improve performance. SMT has the
flexibility to use both forms of parallelism interchangeably, because threads can share resources
dynamically. Rather than adding more resources to further improve performance, existing resources are
used more effectively. By using more hardware contexts, SMT can take advantage of TLP to expose more
parallelism and attain an average throughput of 4.88 instructions per cycle, while increasing its
performance edge over MP2 and MP4 to 64% and 52%, respectively.
Third, our results demonstrate that SMT can achieve large program speedups on parallel
applications. Even though these parallel threads have greater potential for interference because of similar
resource usage patterns (including memory references and demands for renaming registers and functional
units), simultaneous multithreading has the ability to compensate for these potential conflicts. We found
that interthread cache interference, bank contention, and branch prediction interference on an SMT
processor had only minimal effects on performance. The latency hiding characteristics of simultaneous
multithreading allow it to achieve a 2.68 average speedup over a single MP2 processor, whereas MP2 and
MP4 speedups are limited to 1.63 and 1.76, respectively. The bottom line is that simultaneous
multithreading makes better utilization of on-chip resources to run parallel applications effectively.
For these reasons, as well as the performance and complexity results shown, it can be believed
that when component densities permit us to put multiple hardware contexts and wide issue bandwidth on
a single chip, simultaneous multithreading represents the most efficient organization of those resources.
Charecterstics of superscalar, simultaneous multithreading, and chip multiprocessor architecture
Characteristic
Superscalar Simultaneous multithreading
Chip multiprocessor
Number of CPUs 1 1 8 CPU issue width 12 12 2 per CPU Number of threads 1 8 1 per CPU Architecture registers (for integer and FP) 32 32 per thread 32 per CPU Physical registers (for integer and FP) 32 + 256 256 + 256 32 + 32 per CPU Instruction window size 256 256 32 per CPU Branch predictor table size (entries) 32,768 32,68 8 x 4,096 Return stack size 64 entries 64 entries 8 x 8 entries Instruction (I) and data (D) cache organization 1 x 8 banks 1 x 8 banks 1 bank I and D cache sizes 128 Kbytes 128 Kbytes 16 Kbytes per CPU Branch predictor table size (entries) 4 way 4 way 4 way I and D cache line sizes (bytes) 32 32 32 I and D cache access times (cycles) 2 2 1 Secondary cache organization (Mbytes) 1 x 8 bocks 1 x 8 bocks 1 x 8 bocks
1. Superscalar: The superscalar processor,
shown in Figure 1a, can dynamically issue up to 12 instructions per cycle.
2. Simultaneous Multithreading: The SMT processor, shown in Figure 1b, is identical to the superscalar except that it has eight separate program counters and executes instructions from up to eight different threads of control concurrently. The processor core dynamically allocates instruction fetch and execution resources among the different threads on a cycle-by-cycle basis to find as much thread-level and instruction-level parallelism as possible.
3. Chip Multiprocessor: The CMP, shown in Figure 1c, is composed of eight small 2-issue superscalar processors. This processor depends on thread-level parallelism, since its ability to find instruction-level parallelism is limited by the small size of each processor.
Secondary cache size (bytes) 8 8 8 Secondary cache associativity 4 - way 4 - way 4 - way Secondary cache line size (bytes) 32 32 32 Secondary cache access time (cycles) 5 5 7 Secondary cache occupancy per access (cycles) 1 1 1 Memory organization (no. of banks) 50 50 50 Memory access time (cycles) 4 4 4 Memory occupancy per access (cycles) 13 13 13
Figure 2. Relative performance of superscalar, simultaneous multithreading, and chip
multiprocessor architectures compared to a baseline, 2-issue superscalar architecture.
The CMP experienced a nearly eight-times performance improvement over the single 2-issue processor.
The separate primary caches are beneficial because they could be accessed by all processors in parallel. In
a separate test with eight processors sharing a single cache, bank contention between accesses from
different processors degraded performance significantly. The average memory access time to the primary
cache alone went up from 1.1 to 5.7 cycles, mostly due to extra queuing delays at the contended banks,
and overall performance dropped 24 percent. In contrast, the shared secondary cache was not a bottleneck
in the CMP because it received an order of magnitude fewer accesses. SMT results showed similar trends.
The speedups tracked the CMP results closely when modeling similar degrees of data cache contention.
The nominal performance was similar to that of the CMP’s with a single primary cache, and performance
improved 17 percent when primary cache contention is temporarily deactivated.
The multiple threads of control in the SMT allowed it to exploit thread-level parallelism.
Additionally, the dynamic resource allocation in the SMT allowed it to be competitive with the CMP,
even though it had fewer total issue slots.
However, tomcatv’s memory behavior highlighted a fundamental problem with the SMT
architecture: the unified data cache architecture was a bandwidth limitation.
Making a data cache with enough banks or ports to keep up with the memory requirements of
eight threads requires a more sophisticated crossbar network that will add more latency to every cache
access, and may not help if there is a particular bank that is heavily accessed.
The CMP’s independent data caches avoid this problem but are not possible in an SMT.
As with compress, the multiprogramming workload has limited amounts of instruction-level
parallelism, so the speedup of the superscalar architecture was only a 35 percent increase over the
baseline processor.
Unlike compress, however, the multiprogramming workload had large amounts of process-level
parallelism, which both the SMT and CMP exploited effectively.
This resulted in a linear eight-times speedup for the CMP.
The SMT achieved nearly a seven-times speedup over the 2-issue baseline processor, more than
the increase in the number of issue slots would indicate possible, because it efficiently utilized processor
resources by interleaving threads cycle by cycle.
CMP has superior performance using relatively simple hardware.
CMP or SMT ?
• The performance race between SMT and CMP is not yet decided.
• CMP is easier to implement, but only SMT has the ability to hide latencies.
• A functional partitioning is not easily reached within a SMT processor due to the
centralized instruction issue.
A separation of the thread queues is a possible solution, although it does not
remove the central instruction issue.
A combination of simultaneous multithreading with the CMP may be superior.
• Research: combine SMT or CMP organization with the ability to create threads with
compiler support or fully dynamically out of a single thread
thread-level speculation
close to multiscalar
It is difficult to decide which of the two approaches is going to be next generation
Microprocessor. With the operating systems developing the ability to handle multiple threads
Simultaneous threading seems to be an option. But with the increasing chip density increasing it
is chip multiprocessors are worth considering because of their simplicity of design. The
combination of the two approaches is worth considering too as seen in reference[7]
References
1. D. TULLSEN, S. EGGERS, AND H. LEVY, “Simultaneous Multithreading: Maximizing On-Chip Parallelism,” Proc. 22nd Ann. Int’l Symp. Computer Architecture, ACM Press, New York, 1995, pp. 392-403.
2. J. BORKENHAGEN, R. EICKEMEYER, AND R. KALLA :A Multithreaded PowerPC Processor
for Commercial Servers, IBM Journal of Research and Development, November 2000, Vol. 44, No. 6, pp. 885-98.
3. J. LO, S. EGGERS, J. EMER, H. LEVY, R. STAMM, AND D. TULLSEN. Converting thread-
level parallelism into instruction-level parallelism via simultaneous multithreading. ACM Transactions on Computer Systems, 15(2), August 1997.
4. Kunle Olukotun, Basem A. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang. The
case for a single-chip multiprocessor. In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, pages 2--11, Cambridge, Massachusetts, October 1--5, 1996.
5. LANCE HAMMOND, BASEM A NAYFEH, KUNLE OLUKOTUn. A Single-Chip
Multiprocessor. IEEE September1997
6. GULATI, M. AND BAGHERZADEH, N. 1996. Performance study of a multithreaded superscalar microprocessor. In the 2nd International Symposium on High-Performance Computer Architecture(Feb.). 291–301.
7. KYOUNG PARK, SUNG-HOON CHOI, YONGWHA CHUNG, WOO-JONG HAHN AND
SUK-HAN YOON. On-Chip Multiprocessor with Siultaneous Multithreading. http://etrij.etri.re.kr/etrij/pdfdata/22-04-02.pdf
8. NAYFEH, B. A., HAMMOND, L., AND OLUKOTUN, K. 1996. Evaluation of design
alternatives for a multiprocessor microprocessor. In the 23rd Annual International Symposium on Computer Architecture (May). 67–77.
9. OLUKOTUN, K., NAYFEH, B. A., HAMMOND, L., WILSON, K., AND CHANG, K. 1996. The
case for a single-chip multiprocessor. In the 7th International Conference on Architectural Support for Programming Languages and Operating Systems (Oct.). ACM, New York, 2–11.
10. LANCE HAMMOND, BENEDICT A. HUBBERT, MICHAEL SIU, MANOHAR K. PRABHU,
MICHAEL CHEN, KUNLE OLUKOTUN. The Stanford Hydra CMP. IEEE Micro March/April 2000 (Vol. 20, No. 2)
11. S. EGGERS, J. EMER, H. LEVY, J. LO, R. STAMM, D. TULLSEN. Simultaneous
Multithreading: A Platform for Next-generation Processors. In IEEE Micro, pages 12-18, September/October 1997
12. V. KRISHNAN AND J. TORRELLAS. Hardware and Software Support for Speculative
Execution of Sequential Binaries on Chip-Multiprocessor. In ACM International Conference on Supercomputing (ICS’98), pages 85-92, June 1998.
13. goethe.ira.uka.de/people/ungerer/proc-arch/ EUROPAR-tutorial-slides.ppt
14. http://www.acm.uiuc.edu/banks/20/6/page4.html
15. Simultaneous Multithreading home page http://www.cs.washington.edu/research/smt/