Multiprocessing on a Chip · The bigger but slower L2 or L3 caches are shared due to the possibility of two or more cores working on the same data without any coherence problems

Multiprocessing on a ChipProseminar Technische Informatik (WS 07/08)

Johannes Kulick (4127375)

January 21, 2008

Abstract

Multi-core processors are widely used today. So what is the usual technology in these

processors and what is state-of-the-art? This report provides an insight into the development

of multi-processors, shown by the following three examples: AMD K10 and Intel Core are

examples of end-user hardware and the IBM POWER6 microarchitecture is an example of

high-end hardware. In addition, this report gives a perspective of future developments in this

domain.

1 Introduction

After years of increasing the frequency of the central processor unit (CPU) as the main improve-ment in processor development, Intel and AMD invented dual-core processors in 2005 and begana new challenge of enhancing the performance of the processors. Today, multi-core processors arewidely used in all kinds of computers, from laptops to database servers made for high traffic andthroughput. Both AMD and Intel offer processors with four cores on a chip. They promise highperformance and less energy consumption on a good cost/performance ratio. But how do theywork internally? How fast are they and how will they develop?

This report will provide an insight into the development of multi-core processing, based onexemplary descriptions of modern multi-core microarchitectures and processors. It describes theAMD and Intel processors Phenom and Core 2 Quad, as an example of consumer hardware, aswell as the IBM POWER6, a processor meant to be used in mainframes. The paper will alsodiscuss their strengths and weaknesses. At the end, it gives a perspective of future developmentsin multi-core processors.

1.1 Related work

The most important sources for this work are the technical reports on the different microarchitec-tures. IBM provides a lot of information about the POWER6 microarchitecture in their Researchand Development Journal [11] and on their website [6]. All important aspects of the microarchi-tecture are described here and does not leave behind a lot of questions.

A similar complete documentation of its microarchitecture is provided by AMD in severalpapers [1, 2, 3], mainly the ”Software Optimization Guide for AMD Family 10h Processors” [4],but they withheld a few information about specific technologies, such as details about the cacheimplementation.

Intel provides most information about the Core microarchitecture in its paper ”Intel 64 and IA-32 Architectures Optimization Reference Manual" [7]. It is the least described microarchitectureand few details are elaborated upon.

For future development the report expands on the TeraScale research program of Intel, whichis described in the ”Intel technology journal” [5].

1

2 Basics of multi-core processing

A multi-core processor is basically a package of multiple processors made out of one piece ofsilicon or multiple CPUs packed in a single case (these CPUs are sometimes called ‘non-nativemultiprocessors’). Therefore most components of a normal CPU appear duplicated in a multi-coreprocessor: ALUs, pipelines, FPUs (if they exist at all), etc. All these components are, in principal,equal to their single-core equivalents, except the caches.

There are two different types of multi-core processors: Symmetric and asymmetric multi-coreprocessors. In a symmetric multi-core processor, all cores have the same task and can executethe same set of instructions. However, there are different processor types in an asymmetric multi-core processor such as a main CPU and coprocessors. Also, often only one processor handlesthe interrupts. The IBM Cell processor is an example of such an asymmetric processor. Onlysymmetric processors will be discussed in this report as they are the CPU used most frequentlyfor modern computer systems.

The design of the caches plays an important role in multi-core processing. Unlike most otherCPU parts, caches are often shared by all cores of a multi-core processor and do not appear in eachcore separately. Most modern multi-core systems have a hierarchical structure of the caches withup to three levels of cache. The first level cache, often called L1 cache, is, in most implementationsa native cache for each core, while the higher level caches are often shared between the cores. Thisdesign was made due to the fact that it is nearly impossible to develop a fast and big cache.Therefore, each core has a very fast but small L1 cache located close to the execution units ofthe processors to access its data and instructions in a very short time. Because the L1 cachesare duplicated, it is important to make sure that all the caches are coherent at any time. Thismeans that if they keep the same data, it must be guaranteed that these data will be updated inall caches if one processor changes them.

The bigger but slower L2 or L3 caches are shared due to the possibility of two or more coresworking on the same data without any coherence problems.

Additionally, most modern implementations of multi-core processors have some special hard-ware integrated. Often there is a memory controller on the chip, at least one bus - sometimesmore - and a crossbar to connect all the components. Today the complete northbridge is oftenintegrated into the processor.

3 The microarchitectures

3.1 AMD K10 - The “Phenom" processor

The K10 or family 10h microarchitecture is AMD’s newest development in multi-core processors.It was presented in the third quarter of 2007 and the first K10 processor called “Phenom” wasinvented in the fourth quarter of the same year. The “Phenom” is the first customer processor withfour cores made out of one piece of silicon and not a package of two usual dual-core processors.The first Phenom is clocked with 2.2 GHz main frequency and build in 65 nm technology.

With this new microarchitecture AMD invented a lot of new features, such as a totally newcache design with three cache levels or a new version of the HyperTransport bus.

3.1.1 Basic concepts

The AMD K10 microarchitecture is a 3-way superscalar out-of-order microarchitecture. Out-of-order means, that K10 processors can rearrange the instructions to better resolve their depen-dencies. Its fetch and decode unit (FDU) fetches up to three x86 (with AMD64 extensions)instructions. While fetching the instructions, the branch prediction takes place. After one cycleof a one-bit prediction a complex branch prediction table is used to predict whether a branch istaken or not [4, cf. p. 217].

The FDU has two separate decoders: the DirectPath decoder and the VectorPath decoder.These decoders determine whether the instruction is a DirectPath Single, a DirectPath Double

2

Figure 1: The AMD K10 microarchitecture as block diagram (source: [4], p. 218)

or a VectorPath instruction and they decode it into macro-ops [4, cf. p. 220]. Instructions areclassified according how they are decoded. DirectPath Single Instructions are decoded directlyinto one macro-op. DirectPath Double Instructions are decoded into two macro-ops. Up to threeDirectPath Single or one and a half DirecdtPath Double Instructions can be decoded per cycle.VectorPath Instructions are decoded into one or more, usually more than two, macro-ops bya microcode-engine ROM. Decoding of VectorPath instructions blocks decoding of DirectPathinstructions [4, cf. p. 5, p. 52].

Its central instruction control unit takes the fetched and decoded macro-ops and saves them ina centralized buffer, which can hold up to 72 macro-ops. It then distributes the instructions intotwo independent instruction schedulers: an integer scheduler and a floating point scheduler [4, cf.p. 221].

The integer scheduler consists of three basically equal queues, each with eight entries. Eachqueue directs the instructions to a arithmetic logical unit (ALU) or a address generation unit(AGU). While queue 0 also has a multiplier included which can forward its result to ALU 0 and1, queue 2 has a SSE Bit manipulating extension which can forward its result to ALU 2 [4, cf. p.222 et seq.].

The floating point scheduler has twelve lines of three macro-ops as a queue. It provides su-

perforwarding. This is the AMD term for load-forwarding floating point instructions. There isalso a 64-bit wide floating point to integer bus for conversation. There are three floating pointexecution units: one for multiplication, one for addition and one for storing. Each can execute128 bit instructions [4, cf. p. 223 et seq.].

3

Next, load store unit (LSU) can write the results to the L1 cache. There are two load-storequeues. LS1 can issue two operations per cycle and LS2 holds these requests that were missed inthe L1 cache after they probe out of LS1 [4, cf. p. 225].

3.1.2 The cache design

AMD implemented a three level hierarchical structure for its new K10 microarchitecture. The L1cache is a dedicated cache for each core parted into a 64 KB instruction cache and a 64 KB datacache and both are 2-way set-associative caches. Unlike typical L1 caches the K10 L1 cache is notfed with data from the L2 cache but is the only place where fetched data is placed. So the K10architecture has no usual pass-through cache design, where all cache levels are used when fetchingdata [4, cf. p. 219].

This leads to very high performance of fetching data and instructions. Additionally there is aprefetch mechanism implemented, which tries to detect repeating data and instruction patternsand prefetches them into the L1 cache to reduce cache misses.

With 64 KB the L1 cache’s capacity is highly limited, so there is a dedicated general-purpose512 KB 16-way set-associative (in the first generation of K10 processors) L2 cache per core, whichis called a victim cache. A victim cache holds the data that was evicted from the original cache,in this case the L1 cache. To evict data the L1 cache uses the well known least-recently-usedalgorithm. Reloading data from the L2 cache into the L1 cache is quite fast, thus latency timeswhile requesting data in the L2 cache are short as well (9 cycles beyond the L1 cache). Dataloaded from L2 to L1 is removed from L2 to avoid redundancy [4, cf. p. 219].

To keep these dedicated caches coherent, AMD implemented the MOESI protocol, a cache-coherence protocol. As AMD describes, it defines the five following states cache lines can have [1,p. 167]:

• Invalid—A cache line in the invalid state does not hold a valid copy of the data.Valid copies of the data can be either in main memory or another processor cache.

• Exclusive—A cache line in the exclusive state holds the most recent, correct copyof the data. The copy in main memory is also the most recent, correct copy ofthe data. No other processor holds a copy of the data.

• Shared—A cache line in the shared state holds the most recent, correct copy of thedata. Other processors in the system may hold copies of the data in the sharedstate, as well. If no other processor holds it in the owned state, then the copy inmain memory is also the most recent.

• Modified—A cache line in the modified state holds the most recent, correct copyof the data. The copy in main memory is stale (incorrect), and no other processorholds a copy.

• Owned—A cache line in the owned state holds the most recent, correct copy of thedata. The owned state is similar to the shared state in that other processors canhold a copy of the most recent, correct data. Unlike the shared state, however,the copy in main memory can be stale (incorrect). Only one processor can holdthe data in the owned state—all other processors must hold the data in the sharedstate.

Additionally, there is a level 3 cache, which is shared between all cores. Actual implementationshave a 2 MB maximum 32-way set-associative L3 cache. It is considered to be a non-inclusive

victim cache. The L2 also uses the least recently used algorithm to evict data into the L3 cache.However, upon reloading data into the L1 cache, the data can removed from the L3 cache or notbe removed, depending on the probability that other processors do need this data [4, cf. p. 219].

AMD does not provide any information about the algorithm used to estimate this probability.

4

3.1.3 The integrated northbridge

AMD has integrated a complete northbridge on the chip. One can normally find this on themainboard of common PCs. It includes the L3 cache, the memory controller, the crossbar, allbus-links and the system request interface.

The system request interface with its system request queue is the central interface for all cores.This is where they can request data from the memory, ask the other cores for cached data and forcoherence.

The integrated memory controller is the interface where data is received from the main systemmemory. It supports 144-bit wide DDR2 RAM operating at frequencies up to 533 MHz. It includesa data prefetcher included, which holds the data in the memory controller and thus does not excessspace within the L1, L2 or L3 cache [4, cf. p. 226 et seq.].

To connect the processor to the periphery there is a HyperTransport 3 link on the chip.HyperTransport is a protocol made for fast and big data transfers. It has a data rate of 20.8GB/s. All PCI, PCI-X, USB and other bus requests are tunneled into HyperTransport requests.If the processor is used in a multiprocessor environment the HyperTransport bus is also used forcoherent inter processor communication [4, cf. p. 226].

With this integrated northbridge all cores can use speed relevant functions directly on the chip,which means they are also physically close to the cores and this results in low latency times.

3.1.4 Power efficiency features

The AMD quad-core processors can adjust the frequency of their cores independently to reduceunnecessary power consumption.

AMD also implemented a new technology called CoolCore. It detects unused cores and parts ofthe processor, that are unused, and disables the power supply of these parts. Additionally AMDexclusively uses energy-efficient DDR2 memory which consumes about 8 watts less of power thancomparable fully buffered DIMMs.

3.2 Intel Core - The “Core 2 Quad” processor

In the first quarter of 2006, Intel presented its new Core microarchitecture, based on the P6architecture. Most of the processors based on Core are multi-core processors. Intel does not buildits quad-core processors out of one piece of silicon, but puts two dual-dore processors into onecase.

Until 2008 Intel used the 65nm technology, but has now switched to 45 nm. The discussedprocessor is a 65 nm processor clocked with 2.66 GHz.


Intels Core microarchitecture is a 4-way superscalar out-of-order microarchitecture. Its instruc-tion fetch and decode unit fetches 16 bytes, which is usually about 4 instructions, assisted by abranch prediction mechanism (BPU). Intel does not provide any information about the predictionalgorithm used in the core microarchitecture. The predecoder can decode up to six instructionsand write them into the instruction queue. If there are more then six instructions it will returnafter a cycle with the rest of the instructions [7, cf. p. 2-3, 2-6 et seq.].

The instruction queue can hold up to 18 instructions. For loops smaller than 18 instructionsthere is a loop stream detector, which detects the loop and locks the instructions in the queue.The loop now is allowed to stream from the IQ until a misprediction from the BPU ends it [7, cf.p. 2-7 et seq.].

There are 4 decoders in the Core microarchitecture. While decoder 0 can decode x86 instruc-tions as big as 4 micro-ops, the others can only decode single-micro-ops. Additionally, there is afunction called macro-fusion, which can merge two 32-bit instructions into one single micro-op.After the decoding the macro-ops to micro-ops micro-fusion takes place, which merges multiplemicro-ops into one single complex micro-op, in order to improve the bandwidth [7, cf. p. 2-8].

5

Figure 2: The Intel core 2 microarchitecture as block diagram (source: [7], p. 2-4)

The execution unit exhibits a reservation station which holds up to 32 instructions until alloperands are fetched. Then they are provided to three execution stacks: one for regular integercomputing, one for SIMD integer computing and one for floating point computing [7, cf. p. 2-9et seq.].

After execution of the instructions, data can be written back to L1 cache by the load storeunit. It provides a store forwarding. If a load instruction follows a store instruction of the datacurrently written, this data can forwarded to the load [7, cf. p. 2-13 et seq.].

3.2.2 The cache design

Intel implemented a two level cache hierarchy. Each core in the core microarchitecture has adedicated 32 KB 8-way set associative L1 data cache and a 32 KB 8-way set associative L1instruction cache [7, cf. p. 2-18].

Each L1 data cache is equipped with two prefetchers. One of them is a streaming prefetcher,which detects access of successive data and fetches the next lines. The other prefetcher detectsload instructions with a regular stripe then adds the stripe to the current address and fetches thisaddress [7, cf. p. 2-14].

These prefetchers can improve the performance, mostly when accessing successive data, but canalso cause a performance degradation, when much unneeded data evicts required lines. However,this is not very likely.

To keep all L1 caches coherent Intel implements the MESI protocol. This protocol defines fourstates each cache line can have [8, cf. p. 10-11 et seq.]:

• Modified—A cache line in the modified state holds the actual and correct copy of the data.All other copies are not vaild, that one in the main memory is invalid as well.

6

• Exclusive—A cache line in the exclusive state holds the actual and correct copy of the data.It is the only cache which holds this data, the data in the main memory is also correct.

• Shared—A cache line in the shared state holds the actual and correct copy of the data. Therecan be other caches which hold theses data, the data in the main memory is also correct.

• Invalid—A cache line in the invalid state holds invalid data.

In addition to the L1 cache there is a shared general purpose 4MB 16-way set-associative L2 cache.It is shared by only two cores, not all four, because there are only two processors on one die. Tokeep the two L2 caches coherent the MESI protocol is used as well [7, cf. p. 2-17 et seq.].

There is a data prefetchers on each L2 cache as well. It detects streams of data and fetchesthe next lines. It can detect not only successive streams, but also more complicated streams assuch, that skip lines. It prefetches data dependent on the available bus bandwidth ][7, cf. p. 2-15et seq., 2-43].

3.2.3 The front side bus

Intel uses the front side bus (FSB) as the only bus on the chip. All traffic to and from the processoris sent over this bus. Also, the two dual-core dice in the quad-core package communicate overthis bus and all memory access is sent over it. The frequency of it is 266 MHz in the quad datarate mode, which means that four data packets per cycle are transferred. Intel calls this conceptquadpumped and thus refers to the frequency with 1066 MHz. The bandwidth of the FSB is 10.7GB/s [10, cf. p. 2].

3.2.4 Power efficiency features

Micro and Macro Fusion technology is referred to as a power efficiency technology, because manyinstructions can be executed in one cycle and this reduces the amount of energy needed [7, cf. p.2-9].

Additionally, there are some features used to reduce the power consumption, which Intel sum-marized as Intelligent Power Capability. It is a mechanism which indicates unused parts of theprocessor and deactivates them. If there is no need for high bus bandwidth only parts of the FSBare in use.

Furthermore, the core processors have a deep sleep mode, where all caches are written back tomemory when the CPU is inactive.

3.3 IBM POWER6

In May 2007 IBM invented its new chip the POWER6. Unlike the other two microarchitecturesthe POWER processors are not used in consumer hardware but are used in servers and datacenters.

There are only double-core processors on POWER6 microarchitecture, but they are very scal-able as one can put up to 16 processors on a board. The POWER6 is made in the 65 nm technologyand clocked with 4,7 GHz.


The POWER6 microarchitecture is a 7-way superscalar, 2-way simultaneous multithreading mi-croarchitecture, so it can execute two threads per core simultaneously. It is designed for highfrequency. The core itself runs at twice the frequency of the other parts, such as L1, L2 cache orintegrated memory controller.

The instruction fetch unit fetches up to 32 instructions from the L2 cache, and decodes themin four pre-decode stages, before writing them into the L1 instruction cache. From there, up toeight instructions are fetched and decoded by the instruction decode pipeline, which includes a 64

7

Figure 3: The POWER6 pipeline as block diagram (AG: address generation; BHT: branch tableaccess and predict; BR: branch; DC: data-cache access; DISP: dispatch; ECC: error-correctioncode; EX: execute; FMT: formatting; IB: instruction buffer; IC0/IC1: instruction-cache access;IFA: instruction fetch address; ISS: issue; P1–P4: pre-decode; PD: post-decode; RF: register f ileaccess.) (source: [11], p. 645)

entry instruction buffer. All this is assisted by a branch prediction mechanism, based on a 2-bitbranch history table [11, cf. p. 645].

After fetching, the instructions are merged into groups of up to seven instructions of one threadand are sent to the appropriate execution unit. Both threads can be dispatched simultaneously, ifthey have 7 instructions at the most. One thread can have up to five instructions to be dispatchedsimultaneously. Instruction groups are not dispatched until all dependencies in the group areresolved [11, cf. p. 647].

There are seven execution units: two floating point units (FPU), two fixed point units (FXU),two load-sore units (LSU) and a branch unit (BRU).

The FXU does all the fixed point computings except multiplication and division. This task isundertaken by the FPU. The FXU also generates all load and store addresses. It can do a resultforward without any cycle delay [11, cf. p. 649]. The FPU is a out-of-order execution unit, unlikeall other parts of the POWER6, which are generally in-order. It does all floating-point and vectorcomputings and additionally fixed point multiplication and division. It can do a result forwardingbefore the result is rounded and thus often seems to need only six cycles instead of seven [11, cf. p.650 et seq.]. The LSU executes the loads and stores. On a cache miss it manages a load forward,when the data is got from the L2 cache [11, cf. p. 652].

To achieve a higher reliability against softerrors, occuring because of cosmic rays, which increasedue to the fact, that the production technology becomes so small, IBM added the concept ofinstruction retry recovery. The recovery unit (RU) constantly saves stable states of the registersthrough separate busses, protected by an error correcting code (ECC), called checkpoints. If ansofterror occurs, the RU detects this error and can set the core into the state of the last known

8

checkpoint. If the error seams to be persistent, the RU can sent the last checkpoint to anothercore, so that this core can execute the erroneous instructions [12].

3.3.2 Cache design

The POWER6 microarchitecture has a three level cache hierarchy. Each core has a dedicated64 KB 4-way set associative instruction cache and a dedicated 64 KB 8-way set associative datacache [11, cf. p. 646, p. 652], which is located directly in the core in the LSU or respectively inthe IFU.

The private 4 MB 8-way set associative general purpose L2 cache contains all data from theL1 caches in a store-through design. It is located on the chip. The second level cache is protectedby a single-error-correct double-error-detect ECC system, which autocorrects one error and candetect up to two errors and also changes the coherence states of the erroneous cache lines. Toevict data, IBM implemented a pseudo least-recently-used replacement algorithm in the L2 cache.On misses, the L2 first asks the shared L3 cache and if it misses there, too, it asks on the SMPinterconnect bus for the data. On a hit the L2 can do a load forward to the core [11, cf. p. 653et seq.].

The L3 cache is a shared 32 MB 16-way set associative victim cache. It is also protected bythe ECC system and uses the pseudo LRU replacement algorithm to evict data. The L3 cache isplaced on an external chip. Only the L3 controller is implemented on chip [11, cf. p. 655 et seq.].

To keep all caches coherent, IBM invented a new complex cache coherence protocol. It cansent a broadcast either globally or locally, if it can determine that all information can be receivedin the local scope. Thus, it makes it a very scalable cache coherence protocol, since latency timesfor big SMP systems can be kept short. The protocol has the 13 states described in Table 1.

Table 1: IBM POWER6 cache coherence states. Adapted from [11], p. 642 et. seq.

State Description Authority Sharers Castout Source dataI Invalid None N/A N/A N/A

ID Deleted, do not allocate None N/A N/A N/AS Shared Read Yes No No

SL Shared, local data source Read Yes No At requestT Formerly MU, now shared Update Yes Yes If notification

TE Formerly ME, now shared Update Yes No If notificationM Modified, avoid sharing Update No Yes At request

ME Exclusive Update No No At requestMU Modified, bias toward sharing Update No Yes At requestIG Invalid, cached scope-state N/A N/A N/A N/AIN Invalid, scope predictor N/A N/A N/A N/ATN Formerly MU, now shared Update Yes Yes If notification

TEN Formerly ME, now shared Update Yes No If notification

3.3.3 Additional integrated hardware

IBM integrated two memory controllers directly on chip. Each of them can control up to fourDDR2 DRAM DIMMs. Thus, the POWER6 processor can communicate with the memory DIMMSwith a bandwidth of 51.2 GB/s at reads and 25.6 GB/s at writes, if 800 MHz memory is used.The memory is protected by the same ECC system as the L2 and L3 cache.

Each memory controller has two parts: First is the asynchronous region, which operates atfour times the memory frequency. This region manages all memory access. The second part is thesynchronous region, which operates at half the processor frequency and manages all communicationto the SMP interconnect fabric [11, cf. p. 657].

9

The integrated SMP interconnect bus is used to connect more than one processor to the system.It is a combined bus, which transports data and coherence. It can be configured to use 67% ofits bandwidth for data and 33% for coherence, or, 50% for data and 50% for coherence [11, cf. p.657].

There is also an I/O controller on the POWER6 chip, which connects the CPU to an I/O hubwhich connects the CPU to all other I/O devices. [11, cf. p. 657].

3.3.4 Energy efficiency features

IBM summarizes its energy efficiency features under the term EnergyScale. The following fivemain technologies are implemented. The first one is Power Trending. IBM integrated manysensors onto the POWER6 chips and these collect data about the voltage and the thermal statusof the CPU. Depending on this data, all other features are used [6, cf. p. 4].

Power Saver Mode lets the administrator drop the CPU voltage and therefore drops the fre-quency of the core a fixed percentage of 14% at actual implementations. The Power Saver Modecan not be active at booting the system [6, cf. p. 4].

Power capping is a feature which lets the administrator set a limit of power consumption. Thisis mainly used, if a fixed power should be distributed to more machines [6, cf. p. 5 et seq.].

These two features are typically not meant for the consumer features because a high knowledgeabout computers and electricity is needed. Therefore they are not seen on normal PC hardware,but are instead seen on high-performance CPUs like the POWER6.

Processor Core Nap detects cores which are not currently in use and stops them to reducepower consumption [6, cf. p. 6].

EnergyScale for I/O shuts PCI slots down, if they are not used at the moment and thus reducespower consumption by approximately 14 watts. This, of course, is not really a CPU feature, butit is also summarized under the name EnergyScale [6, cf. p. 6].

3.4 Comparison between the microarchitectures

All the described microarchitectures are not fundamentally different. Well known and testedtechnology is used in all three processors, such as superscalarity or pipelining. But the slightdifferences in the architectures cause a considerable performance difference.

The biggest gap is, of course, between AMD K10 and Intel Core on the one hand and POWER6on the other hand. In performance they are quite incomparable. Although POWER6 processorsare only available as double-core processors, they received a much better rating at SPEC [14] thanthe Intel Core 2 Quad [13].

The high frequency of 4.7 GHz is the first big factor of its high performance. It is about twice thethe frequency of AMD and Intel based processors. Additionally, the POWER6 microarchitecture isa simultaneous multithreading microarchitecture, so it actually has four logical processors, whichleads to a performance increase of about 15% to 30% [11, cf. p. 653]. Also, the big cachesimprove the performance considerably. 32 MB is much bigger than any consumer processor’scache. Furthermore the performance of the POWER6 is biased by the high bandwidth of thememory subsystem and the possibility of up to 16 GB of main memory.

So the POWER6 should not be in a real comparison to the other two microarchitectures, butit be used to show the leading technology and thus lets one imagine what can be realized in thefuture in terms of customer hardware. It is interesting, that all main components of the POWER6are also implemented in the customer microarchitectures only in a smaller or slower scale. Onlysome features, such as the recovery unit, required for high availability, are not integrated in theK10 or the Core microarchitecture.

Comparing the K10 and Core microarchitecture, AMD seems to have a more modern concept.The cache design is a new development and is well optimized to multi-processors: The highlyaccessed L2 cache is close to the core and for extra memory there is an additional shared L3cache. Furthermore, the integrated memory controller leads to a high bandwidth between CPUand main memory and thus eliminates a presumable bottleneck. HyperTransport 3.0 offers a

10

very good bandwidth when compared to other CPUs or I/O devices, so the front side bus is notlonger necessary. Also the fact that AMD integrated all four cores on one single die increases theperformance due to physically proximity.

Unlike AMD, Intel implemented quite a conservative microarchitecture with Core. The mainexamples are: They used only a two level cache hierarchy and used the front side bus to connectmemory, other CPUs and I/O devices. It is based on the well known P6 architecture, which wasinvented in 1995 with the Pentium Pro processor. It replaces even its own successor the NetBurstarchitecture.

It seems to be obvious that the AMD processor has a leading performance over the Intel CPU,but for some reason this conclusion is wrong. At most benchmarks made from tecchannel.de1 Intelprocessors do have a better performance then the AMDs.

Figure 4: Some exemplary benchmark results (source: [15])

As you can see in Figure 4 the new Core 2 extreme is a leader in the field. These processorsare made with the new 45 nm technology and have some other enhancements as well, such as afaster front side bus. The Core 2 Quad Q6600 has a frequency of 2.4 GHz. With that frequencyit has in some benchmarks a better performance than the Phenom 9900 with its 2.6 GHz.

There are some reasons for that difference. First, Intel has implemented a bigger L2 cache,which is often accessed, so misses in this cache cause a big performance loss. Primarily, the 3dsMax 2008 benchmark is biased by this, because rendering needs a good memory performance.Also the prefetchers vastly improve the performance of the Core 2 Quad. Also, the fact that itis a 4-way rather than only 3-way superscalar microarchitecture results in higher performance.Another point is the use of faster DDR3 memory DIMMs, which compensates for the integratedmemory controller of the K10 microarchitecture.

So one can say for all jobs where high single-processor performance is needed, the Intel Coreprocessors exceed the Phenoms.

1Because of delivering problems of AMD the well known benchmark website of the SPEC spec.org deleted all

Phenom benchmarks in December 2007.

11

On the other hand, the AMD processors scale considerably better. The SunGard ACR 3.0economic analysis tool uses much multithreading and is highly optimized for multiprocessing. ThePhenom however has a direct intern communication bus for the four cores. On Intel processorsthere are only two cores on one die, so they have to communicate via the FSB, when sending dataor coherence to the other two cores. Also, the L3 cache, which is shared between all four cores onthe K10 microarchitecture improves the scalability, where Intel only has only a L2 cache sharedby two of the four cores.

On todays computers, usually only one or two programs are active, so at overall-benchmarksthe Intel processors have the better benchmark results as well. The single-processor throughputof a CPU is in most environments the most important factor of overall performance. So althoughAMD seems to have the more modern concept, their CPUs do not fit to the needs of most computerusers, since multithreading is not yet used much in todays programs yet. The amount of programsusing multithreading increases rapidly, but when there will be a major amount of programs, thePhenom obviously will be out of date.

4 The future of multiprocessing

Multi-processing is everywhere today. Quad-cores are invented from every big vendor and most ofthe new computers are based on a multi-processor chip. So what is the future of multiprocessing?In the near future both AMD and Intel will update their products. While AMD announced tooffer a 3 GHz Phenom, Intel switched to the 45 nm technology and made some enhancementsto its processors. In 2008 they will introduce a totally new, more modern microarchitecture, theNahelem microarchitecture. [9] This includes an integrated memory controller, a multi-level cacheand simultaneous multithreading, so some of the new POWER6 features will be seen in the nextconsumer microarchitecture.

Further on in the future the number of cores will strongly increase. Intel has started a researchprogram called TeraScale, which works on multi-processors with hundreds of processors on onechip, also called many-cores, and implemented a prototype with eighty cores on one chip. Theyname the following three points as the big challenges of implementing such a microarchitecture [5,cf. p. 174]:

1. Interconnection—A fast, efficient and scalable bus interface is needed to interconnect all theCPUs and the other parts of the system, such as I/O interfaces.

2. Cache hierarchy and coherence—The CPUs must be able to useful share the caches, andtherefore the coherence must be guaranteed over many cores with a appropriate coherenceprotocol.

3. Memory—The memory system must be able to feed hundreds of CPU caches with requireddata and instructions.

Some technical requirements are introduced that a many-core architecture has to comply with toachieve these goals.

The interconnection must be very scalable, which means that adding more cores should lead toincrease the performance in a noteworthy fashion. In detail the average distance between two coresshould increase sub-lineary by adding cores, a low latency per core, and a manageable growth oflatency under bigger loads. Secondly, the topology of the architecture should be partitionable, tosplit the architecture dynamically in single partitions for each job and to isolate faults. Generallythe fault tolerance of the architecture should be as high as possible, so that the CPU can proceedto operate, although some cores are defect [5, cf. p. 174].

Determining, what the requirements of the cache hierarchy are, is not as easy as at the inter-connection. There must be multiple levels of cache, which can be shared dynamically to partitionsin order to scale over multiple processors. This leads to directory based cache coherence protocols,which are only applicable to a specific part of the whole CPU. They are widely used on multi-processor systems with many distributed CPU chips and are therefor well known [5, cf. p. 179 et

12

seq.]. Well integrated caches can also lead to a fewer requirements off-chip memory bandwidth.The biggest problem of a many-core processor is the need to feed it with enough data, and byintegrating more memory on the chip, this can be achieved [5, cf. p. 180].

So there will be some new problems in CPU architectures, such as the topology of the cores,which does not even appear when implementing two cores on a chip. Due to Moore’s law it willbe possible to integrate more and more external hardware directly on the chip, so the CPU willalso be able to compute, for example, 3D graphics with a special partition of the CPU.

5 Summary

As we have seen only physical limits will stop the development towards many-cory processors, butdo users really need such processors?

The answer to this depends upon which programs are used. On todays home computer withthe typical tasks such as office work, playing music or surfing at the internet, the answer isobviously not. Most of the time only one or two programs are active and all other programs in thebackground are idle and only become active very rarely and for only short time. Actual processorswith four cores are enough for most of this kind of work that is done. Even as more multithreadingis used in the program-code, this will not increase very much because much of office work is hardlyserial and can not be parallelized. Single-processor throughput is the most important factor inthis domain, but most users also work in a domain, which uses programs that can benefit frommulti-core processing, though.

Actual games often use multithreading and their numbers increase rapidly. Complex AI calcu-lating can be highly parallelised and therefore games can strongly benefit from multi-core proces-sors. Also most of the professional domains can profit from multi-core processors, if correspondingprograms provide multi-threading capabilities, which is often practicable. Video rendering, graphicprocessing, virtual instruments - all this can be parallelised. With all these domains in mind mostusers can benefit from multi-processor systems, but it is up to the software vendors to delivermultitasking capable software to use these benefits.

In data-centers and server-rooms multiprocessing also can provide a better performance and iswidely used. In the past, mainframes used multi-processor (not on a single chip) configurations fora long time, so software for such domains is optimized for multiprocessing and uses multithreadinga lot.

Energy efficiency is an issue most users are concerned with. Multi-core processors offer a muchbetter power consumption/performance ratio than configurations with single-core chips and thushelp to reduce costs, heat and even environmental pollution.

Multi-core processors are the modern way to improve the performance of processors and mostusers will benefit from this development. Many-core processors are on the way and can take overthe work of some specialized hardware in the near future.

13

References

[1] AMD (ed.). AMD64 Architecture Programmer’s Manual - Volume 2: System Programming.Tech. Rep. 24593, AMD, September 2007.http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf.

[2] AMD (ed.). BIOS and Kernel Developer’s Guide (BKDG) For AMD Family 10h Processors.Tech. Rep. 31116, AMD, September 2007.http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/31116.pdf.

[3] AMD (ed.). Family 10h AMD Phenom Processor Product Data Sheet. Tech. Rep. 44109,AMD, November 2007.http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/44109.pdf.

[4] AMD (ed.). Software Optimization Guide for AMD Family 10h Processors. Tech. Rep.40546, AMD, September 2007.http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf.

[5] Azimi, Mani; Cherukuri, N. e. a. Integration Challenges and Tradeoffs for Tera-scaleArchitectures. Intel technology journal 11, 03 (August 2007), 173–184.http://download.intel.com/technology/itj/2007/v11i3/vol11-iss03.pdf.

[6] H.-Y. McCreary, M. Broyles, e. a. EnergyScale for IBM POWER6 microprocessor-based systems, November 2007.Available at http://www-03.ibm.com/systems/p/library/wp_lit.html.

[7] Intel (ed.). Intel 64 and IA-32 Architectures Optimization Reference Manual. Tech. Rep.248966-015, Intel, May 2007.http://developer.intel.com/design/processor/manuals/248966.pdf.

[8] Intel (ed.). Intel 64 and IA-32 Architectures Software Developer’s Manual - Volume 3A:System Programming Guide, Part 1 . Tech. Rep. 253668, Intel, November 2007.http://developer.intel.com/design/processor/manuals/253668.pdf.

[9] Intel (ed.). News Flash: Intel Details Upcoming New Processor Generations, March 2007.http://www.intel.com/pressroom/archive/releases/Intel_New_Processor_Generations.pdf.

[10] Intel (ed.). Product Brief: Intel Core 2 Quad Processor, 2007.http://www.intel.com/products/processor/core2quad/prod_brief.pdf.

[11] Le, H. Q.; Starke, W. J. e. a. IBM POWER6 microarchitecture. IBM Journal Research

and Development 51, 6 (November 2007), 639–662.http://www.research.ibm.com/journal/rd/516/le.pdf.

[12] M. J. Mack, W. M. Sauer, S. B. S. B. G. M. IBM POWER6 reliability. IBM Journal

Research and Development 51, 6 (November 2007), 763–774.http://www.research.ibm.com/journal/rd/516/mack.pdf.

[13] SPEC (ed.). SPEC CFP2006 Result - DQ965GF motherboard (Intel Core 2 Quad Q6700),July 2007.http://www.spec.org/cpu2006/results/res2007q3/cpu2006-20070723-01534.pdf.

[14] SPEC (ed.). SPEC CFP2006 Result - IBM System p (4.7 GHz, 2 core), May 2007.http://www.spec.org/cpu2006/results/res2007q2/cpu2006-20070518-01098.pdf.

[15] Vilsbeck, Christian; Haluschak, B. Test: AMD Phenom 9900 mit 2,6 GHz, November2007.http://www.tecchannel.de/pc_mobile/prozessoren/1739204/.

14

http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf




http://download.intel.com/technology/itj/2007/v11i3/vol11-iss03.pdf

http://www-03.ibm.com/systems/p/library/wp_lit.html

http://developer.intel.com/design/processor/manuals/248966.pdf

http://developer.intel.com/design/processor/manuals/253668.pdf

http://www.intel.com/pressroom/archive/releases/Intel_New_Processor_Generations.pdf

http://www.intel.com/products/processor/core2quad/prod_brief.pdf

http://www.research.ibm.com/journal/rd/516/le.pdf

http://www.research.ibm.com/journal/rd/516/mack.pdf

http://www.spec.org/cpu2006/results/res2007q3/cpu2006-20070723-01534.pdf

http://www.spec.org/cpu2006/results/res2007q2/cpu2006-20070518-01098.pdf

http://www.tecchannel.de/pc_mobile/prozessoren/1739204/

Documents

Multiprocessing on a Chip · The bigger but slower L2 or L3 caches are shared due to the possibility of two or more cores working on the same data without any coherence problems