Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
UC Regents Fall 2009 © UCBCS 250 L9: Floorplanning
2009-9-24John Wawrzynek and Krste Asanovic
with John Lazzaro
CS 250 VLSI System Design
Lecture 9 – Floorplanning
www-inst.eecs.berkeley.edu/~cs250/
TA: Yunsup Lee
Many-core Example Chips
1
UC Regents Fall 2009 © UCBCS 250 L9: Floorplanning
2
UC Regents Fall 2009 © UCBCS 250 L9: Floorplanning
Intel Larrabee
3
UC Regents Fall 2009 © UCBCS 250 L9: Floorplanning
Sun Niagara II and Intel Larrabee are both:
Complete implementations oflong-established ISAs (SPARC, x86).
Provide a conventional shared-memory virtual-address memory model.
Micro-architecture and ISA extensions target a specific application area.
Not boutique parts - manufacturable.
Many-cores, multi-threaded cores.
4
UC Regents Fall 2009 © UCBCS 250 L9: Floorplanning
Sun Niagara II Target Market
Some applications inavoidably spend their lives waiting for memory.
memory machines with discrete single-thread-ed processors and coherent interconnect havetended to perform well because they exploitTLP. However, the use of an SMP composedof multiple processors designed to exploit ILPis neither power efficient nor cost-efficient. Amore efficient approach is to build a machineusing simple cores aggregated on a single die,with a shared on-chip cache and high band-width to large off-chip memory, therebyaggregating an SMP server on a chip. This hasthe added benefit of low-latency communi-cation between the cores for efficient datasharing in commercial server applications.
Niagara overviewThe Niagara approach to increasing
throughput on commercial server applicationsinvolves a dramatic increase in the number ofthreads supported on the processor and amemory subsystem scaled for higher band-widths. Niagara supports 32 threads of exe-cution in hardware. The architectureorganizes four threads into a thread group; thegroup shares a processing pipeline, referred toas the Sparc pipe. Niagara uses eight suchthread groups, resulting in 32 threads on theCPU. Each SPARC pipe contains level-1caches for instructions and data. The hard-ware hides memory and pipeline stalls on agiven thread by scheduling the other threadsin the group onto the SPARC pipe with a zerocycle switch penalty. Figure 1 schematicallyshows how reusing the shared processingpipeline results in higher throughput.
The 32 threads share a 3-Mbyte level-2cache. This cache is 4-way banked andpipelined for bandwidth; it is 12-way set-associative to minimize conflict misses fromthe many threads. Commercial server codehas data sharing, which can lead to highcoherence miss rates. In conventional SMPsystems using discrete processors with coher-ent system interconnects, coherence misses goout over low-frequency off-chip buses or links,and can have high latencies. The Niagaradesign with its shared on-chip cache elimi-nates these misses and replaces them with low-latency shared-cache communication.
The crossbar interconnect provides thecommunication link between Sparc pipes, L2cache banks, and other shared resources onthe CPU; it provides more than 200 Gbytes/s
of bandwidth. A two-entry queue is availablefor each source-destination pair, and it canqueue up to 96 transactions each way in thecrossbar. The crossbar also provides a port forcommunication with the I/O subsystem.Arbitration for destination ports uses a sim-ple age-based priority scheme that ensures fairscheduling across all requestors. The crossbaris also the point of memory ordering for themachine.
The memory interface is four channels ofdual-data rate 2 (DDR2) DRAM, supportinga maximum bandwidth in excess of 20Gbytes/s, and a capacity of up to 128 Gbytes.Figure 2 shows a block diagram of the Nia-gara processor.
Sparc pipelineHere we describe the Sparc pipe implemen-
tation, which supports four threads. Eachthread has a unique set of registers and instruc-tion and store buffers. The thread group sharesthe L1 caches, translation look-aside buffers(TLBs), execution units, and most pipelineregisters. We implemented a single-issuepipeline with six stages (fetch, thread select,decode, execute, memory, and write back).
In the fetch stage, the instruction cache andinstruction TLB (ITLB) are accessed. The fol-lowing stage completes the cache access byselecting the way. The critical path is set bythe 64-entry, fully associative ITLB access. Athread-select multiplexer determines which of
23MARCH–APRIL 2005
C M M MCC
C M M MCC
C M
C M
C M
Time saved
Singleissue
ILP
TLP (on shared
single issuepipeline)
Memory latency Compute latency
Figure 1. Behavior of processors optimized for TLP and ILP oncommercial server workloads. In comparison to the single-issue machine, the ILP processor mainly reduces computetime, so memory access time dominates application perfor-mance. In the TLP case, multiple threads share a single-issuepipeline, and overlapped execution of these threads results inhigher performance for a multithreaded application.
memory machines with discrete single-thread-ed processors and coherent interconnect havetended to perform well because they exploitTLP. However, the use of an SMP composedof multiple processors designed to exploit ILPis neither power efficient nor cost-efficient. Amore efficient approach is to build a machineusing simple cores aggregated on a single die,with a shared on-chip cache and high band-width to large off-chip memory, therebyaggregating an SMP server on a chip. This hasthe added benefit of low-latency communi-cation between the cores for efficient datasharing in commercial server applications.
Niagara overviewThe Niagara approach to increasing
throughput on commercial server applicationsinvolves a dramatic increase in the number ofthreads supported on the processor and amemory subsystem scaled for higher band-widths. Niagara supports 32 threads of exe-cution in hardware. The architectureorganizes four threads into a thread group; thegroup shares a processing pipeline, referred toas the Sparc pipe. Niagara uses eight suchthread groups, resulting in 32 threads on theCPU. Each SPARC pipe contains level-1caches for instructions and data. The hard-ware hides memory and pipeline stalls on agiven thread by scheduling the other threadsin the group onto the SPARC pipe with a zerocycle switch penalty. Figure 1 schematicallyshows how reusing the shared processingpipeline results in higher throughput.
The 32 threads share a 3-Mbyte level-2cache. This cache is 4-way banked andpipelined for bandwidth; it is 12-way set-associative to minimize conflict misses fromthe many threads. Commercial server codehas data sharing, which can lead to highcoherence miss rates. In conventional SMPsystems using discrete processors with coher-ent system interconnects, coherence misses goout over low-frequency off-chip buses or links,and can have high latencies. The Niagaradesign with its shared on-chip cache elimi-nates these misses and replaces them with low-latency shared-cache communication.
The crossbar interconnect provides thecommunication link between Sparc pipes, L2cache banks, and other shared resources onthe CPU; it provides more than 200 Gbytes/s
of bandwidth. A two-entry queue is availablefor each source-destination pair, and it canqueue up to 96 transactions each way in thecrossbar. The crossbar also provides a port forcommunication with the I/O subsystem.Arbitration for destination ports uses a sim-ple age-based priority scheme that ensures fairscheduling across all requestors. The crossbaris also the point of memory ordering for themachine.
The memory interface is four channels ofdual-data rate 2 (DDR2) DRAM, supportinga maximum bandwidth in excess of 20Gbytes/s, and a capacity of up to 128 Gbytes.Figure 2 shows a block diagram of the Nia-gara processor.
Sparc pipelineHere we describe the Sparc pipe implemen-
tation, which supports four threads. Eachthread has a unique set of registers and instruc-tion and store buffers. The thread group sharesthe L1 caches, translation look-aside buffers(TLBs), execution units, and most pipelineregisters. We implemented a single-issuepipeline with six stages (fetch, thread select,decode, execute, memory, and write back).
In the fetch stage, the instruction cache andinstruction TLB (ITLB) are accessed. The fol-lowing stage completes the cache access byselecting the way. The critical path is set bythe 64-entry, fully associative ITLB access. Athread-select multiplexer determines which of
23MARCH–APRIL 2005
C M M MCC
C M M MCC
C M
C M
C M
Time saved
Singleissue
ILP
TLP (on shared
single issuepipeline)
Memory latency Compute latency
Figure 1. Behavior of processors optimized for TLP and ILP oncommercial server workloads. In comparison to the single-issue machine, the ILP processor mainly reduces computetime, so memory access time dominates application perfor-mance. In the TLP case, multiple threads share a single-issuepipeline, and overlapped execution of these threads results inhigher performance for a multithreaded application.
Cores that focus on extracting instruction-level parallelism are wasted on these apps.
memory machines with discrete single-thread-ed processors and coherent interconnect havetended to perform well because they exploitTLP. However, the use of an SMP composedof multiple processors designed to exploit ILPis neither power efficient nor cost-efficient. Amore efficient approach is to build a machineusing simple cores aggregated on a single die,with a shared on-chip cache and high band-width to large off-chip memory, therebyaggregating an SMP server on a chip. This hasthe added benefit of low-latency communi-cation between the cores for efficient datasharing in commercial server applications.
Niagara overviewThe Niagara approach to increasing
throughput on commercial server applicationsinvolves a dramatic increase in the number ofthreads supported on the processor and amemory subsystem scaled for higher band-widths. Niagara supports 32 threads of exe-cution in hardware. The architectureorganizes four threads into a thread group; thegroup shares a processing pipeline, referred toas the Sparc pipe. Niagara uses eight suchthread groups, resulting in 32 threads on theCPU. Each SPARC pipe contains level-1caches for instructions and data. The hard-ware hides memory and pipeline stalls on agiven thread by scheduling the other threadsin the group onto the SPARC pipe with a zerocycle switch penalty. Figure 1 schematicallyshows how reusing the shared processingpipeline results in higher throughput.
The 32 threads share a 3-Mbyte level-2cache. This cache is 4-way banked andpipelined for bandwidth; it is 12-way set-associative to minimize conflict misses fromthe many threads. Commercial server codehas data sharing, which can lead to highcoherence miss rates. In conventional SMPsystems using discrete processors with coher-ent system interconnects, coherence misses goout over low-frequency off-chip buses or links,and can have high latencies. The Niagaradesign with its shared on-chip cache elimi-nates these misses and replaces them with low-latency shared-cache communication.
The crossbar interconnect provides thecommunication link between Sparc pipes, L2cache banks, and other shared resources onthe CPU; it provides more than 200 Gbytes/s
of bandwidth. A two-entry queue is availablefor each source-destination pair, and it canqueue up to 96 transactions each way in thecrossbar. The crossbar also provides a port forcommunication with the I/O subsystem.Arbitration for destination ports uses a sim-ple age-based priority scheme that ensures fairscheduling across all requestors. The crossbaris also the point of memory ordering for themachine.
The memory interface is four channels ofdual-data rate 2 (DDR2) DRAM, supportinga maximum bandwidth in excess of 20Gbytes/s, and a capacity of up to 128 Gbytes.Figure 2 shows a block diagram of the Nia-gara processor.
Sparc pipelineHere we describe the Sparc pipe implemen-
tation, which supports four threads. Eachthread has a unique set of registers and instruc-tion and store buffers. The thread group sharesthe L1 caches, translation look-aside buffers(TLBs), execution units, and most pipelineregisters. We implemented a single-issuepipeline with six stages (fetch, thread select,decode, execute, memory, and write back).
In the fetch stage, the instruction cache andinstruction TLB (ITLB) are accessed. The fol-lowing stage completes the cache access byselecting the way. The critical path is set bythe 64-entry, fully associative ITLB access. Athread-select multiplexer determines which of
23MARCH–APRIL 2005
C M M MCC
C M M MCC
C M
C M
C M
Time saved
Singleissue
ILP
TLP (on shared
single issuepipeline)
Memory latency Compute latency
Figure 1. Behavior of processors optimized for TLP and ILP oncommercial server workloads. In comparison to the single-issue machine, the ILP processor mainly reduces computetime, so memory access time dominates application perfor-mance. In the TLP case, multiple threads share a single-issuepipeline, and overlapped execution of these threads results inhigher performance for a multithreaded application.
Instead, build simple cores that are multi-threaded, and focus on maximizing throughput of a large number of threads.
5
UC Regents Fall 2009 © UCBCS 250 L9: Floorplanning
KrsteNovember 10, 2004
6.823, L18--5
Simple Multithreaded Pipeline
Have to carry thread select down pipeline to ensure correct state bits read/written at each pipe stage
+1
2 Thread
select
PC1
PC1
PC1
PC1
I$ IRGPR1GPR1GPR1GPR1
X
Y
2
D$
Sun Niagara II Multithreading
Multi-threading static pipeline CPUs is simple and efficient, and goes back to the 1960s (CDC).
6
UC Regents Fall 2009 © UCBCS 250 L9: Floorplanning
Sun Niagara II Sizing the chip
8 threads/core: Enough to keep one core busy, given clock speed, memory system latency, and target application characteristics.
6 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 43, NO. 1, JANUARY 2008
Implementation of an 8-Core, 64-Thread,Power-Efficient SPARC Server on a Chip
Umesh Gajanan Nawathe, Mahmudul Hassan, King C. Yen, Ashok Kumar, Aparna Ramachandran, andDavid Greenhill
Abstract—The second in the Niagara series of processors(Niagara2) from Sun Microsystems is based on the power-ef-ficient chip multi-threading (CMT) architecture optimized forSpace, Watts (Power), and Performance (SWaP) [SWap Rating
Performance Space Power ]. It doubles the throughputperformance and performance/watt, and provides 10 im-provement in floating point throughput performance as comparedto UltraSPARC T1 (Niagara1). There are two 10 Gb Ethernetports on chip. Niagara2 has eight SPARC cores, each supportingconcurrent execution of eight threads for 64 threads total. EachSPARC core has a Floating Point and Graphics unit and anadvanced Cryptographic unit which provides high enough band-width to run the two 10 Gb Ethernet ports encrypted at wirespeeds. There is a 4 MB Level2 cache on chip. Each of the fouron-chip memory controllers controls two FBDIMM channels.Niagara2 has 503 million transistors on a 342 mm die packagedin a flip-chip glass ceramic package with 1831 pins. The chip isbuilt in Texas Instruments’ 65 nm 11LM triple- CMOS process.It operates at 1.4 GHz at 1.1 V and consumes 84 W.
Index Terms—Chip multi-threading (CMT), clocking, com-puter architecture, cryptography, low power, microprocessor,multi-core, multi-threaded, Niagara series of processors, powerefficient, power management, SerDes, SPARC architecture, syn-chronous and asynchronous clock domains, system on a chip(SoC), throughput computing, UltraSPARC T2.
I. INTRODUCTION
TODAY’S datacenters face extreme throughput, space,and power challenges. Throughput demands continue in-
creasing while space and power are fixed. The increase in powerconsumed by the servers and the cost of cooling has caused arapid increase in the cost of operating a datacenter. The Nia-gara1 processor [5] (also known as the UltraSPARC T1) madea substantial attempt at solving this problem. This paper de-scribes the implementation of the Niagara2 processor, designedwith a wide range of applications in mind, including database,web-tear, floating-point, and secure applications. Niagara2, asthe name suggests, is the follow-on to the Niagara1 processorbased on the CMT architecture optimized for SWaP.
Fig. 1 illustrates the advantages of the CMT architecture.For a single thread, memory access is the single biggestbottleneck to improving performance. For workloads which
Manuscript received April 17, 2007; revised September 27, 2007.U. G. Nawathe is with Sun Microsystems, Santa Clara, CA 95054 USA
(e-mail: [email protected]).M. Hassan, K. C. Yen, A. Kumar, A. Ramachandran, and D. Greenhill are
with Sun Microsystems, Sunnyvale, CA 94085 USA.Digital Object Identifier 10.1109/JSSC.2007.910967
Fig. 1. Throughput computing using the CMT architecture.
exhibit poor memory locality, only a modest throughputspeedup is possible by reducing compute time. As a result,conventional single-thread processors which are optimized forInstruction-Level-Parallelism have low utilization and wastedpower. Having many threads makes it easier to find somethinguseful to execute every cycle. As a result, processor utilizationis higher and significant throughput speedups are achievable.
The design of the Niagara2 processor started off withthree primary goals in mind: 1) 2 throughput perfor-mance and performance/watt as compared to UltraSPARC T1;2) 10 floating point throughput performance as comparedto UltraSPARC T1; and 3) integration of the major systemcomponents on chip. Two options were considered to achievethese goals: 1) double the number of cores to 16 as comparedto eight on UltraSPARC T1, with each core supporting fourthreads as in UltraSPARC T1; and 2) double the number ofthreads/core from four to eight and correspondingly doublethe number of execution units per core from one to two. Bothoptions would have enabled us to achieve our first goal. Thefirst option would have doubled the SPARC core area as com-pared to a lot smaller area increase with the second option.The second option was chosen as the area saved using thisoption allowed us to integrate a Floating Point and Graphicsunit and a Cryptographic unit inside each SPARC core and alsoallowed integration of the critical SoC components on chip,thus enabling us to achieve our second and third goals as well.
II. ARCHITECTURE AND KEY STATISTICAL HIGHLIGHTS
A. Niagara2 Architecture
Fig. 2 shows the Niagara2 block diagram, and Fig. 3 showsthe die micrograph. The chip has eight SPARC Cores, a 4 MBshared Level2 cache, and supports concurrent execution of64 threads. The Level2 cache is divided into eight banks of512 kB each. The SPARC Cores communicate with the Level2cache banks through a high bandwidth crossbar. Niagara2 hasa 8 PCI-Express channel, two 10 Gb Ethernet ports withXAUI interfaces and four memory controllers each controlling
0018-9200/$25.00 © 2008 IEEE
Authorized licensed use limited to: Univ of Calif Berkeley. Downloaded on September 18, 2009 at 17:03 from IEEE Xplore. Restrictions apply.
Cores/chip?: The number of cores that can be put to good use is limited by the memory bandwidth available to keep all threads on all cores busy.
7
UC Regents Fall 2009 © UCBCS 250 L9: Floorplanning
NAWATHE et al.: IMPLEMENTATION OF AN 8-CORE, 64-THREAD, POWER-EFFICIENT SPARC SERVER ON A CHIP 7
Fig. 2. Niagara2 block diagram.
Fig. 3. Niagara2 die micrograph.
two FBDIMM channels. These three major I/O interfaces areserializer/deserializer (SerDes) based and provide a total pinbandwidth in excess of 1 Tb/s. All the SerDes are on chip.The high levels of system integration truly makes Niagara2 a“server-on-a-chip”, thus reducing system component count,complexity and power, and hence improving system reliability.
B. SPARC Core Architecture
Fig. 4 shows the block diagram of the SPARC Core. EachSPARC core (SPC) implements the 64-bit SPARC V9 instruc-tion set while supporting concurrent execution of eight threads.Each SPC has one load/store unit (LSU), two Execution units(EXU0 and EXU1), and one Floating Point and Graphics Unit(FGU). The Instruction Fetch unit (IFU) and the LSU contain an8-way 16 kB Instruction cache and a 4-way 8 kB Data cache re-spectively. Each SPC also contains a 64-entry Instruction-TLB(ITLB), and a 128-entry Data-TLB (DTLB). Both the TLBs arefully associative. The memory Management Unit (MMU) sup-ports 8 K, 64 K, 4 M, and 256 M page sizes and has Hardware
Fig. 4. SPC block diagram.
Fig. 5. Integer pipeline: eight stages.
Fig. 6. Floating point pipeline: 12 stages.
TableWalk to reduce TLB miss penalty. “TLU” in the block dia-gram is the Trap Logic Unit. The “Gasket” performs arbitrationfor access to the Crossbar. Each SPC also has an advanced Cryp-tographic/Stream Processing Unit (SPU). The combined band-width of the eight Cryptographic units from the eight SPCs issufficient for running the two 10 Gb Ethernet ports encrypted.This enables Niagara2 to run secure applications at wire speed.
Fig. 5 and Fig. 6 illustrate the Niagara2 integer and floatingpoint pipelines, respectively. The integer pipeline is eight stageslong. The floating point pipeline has 12 stages for most opera-tions. Divide and Square-root operations have a longer pipeline.
Authorized licensed use limited to: Univ of Calif Berkeley. Downloaded on September 18, 2009 at 17:03 from IEEE Xplore. Restrictions apply.
Sun Niagara II: 8 cores, 4MB L2, 4 DRAM channels
“Small” L2 cache because apps are locality-poor.Each DRAM channel: 50 GB/s Read, 25 GB/s Write BW.Crossbar BW: 270 GB/s total (Read + Write).
(Also shared by an I/O port, not shown)8
UC Regents Fall 2009 © UCBCS 250 L9: Floorplanning
NAWATHE et al.: IMPLEMENTATION OF AN 8-CORE, 64-THREAD, POWER-EFFICIENT SPARC SERVER ON A CHIP 9
Fig. 9. L2 cache row redundancy scheme.
2-cycle latency. Addresses can be hashed to distribute accessesacross different sets in case of hot cache sets caused by refer-ence conflicts. All arrays are protected by single error correc-tion, double error detection ECC, and parity. Data from differentways and different words is interleaved to improve soft errorrates.
The L2 cache used a unique row-redundancy scheme. It is im-plemented at the 32 kB level and is illustrated in Fig. 9. Sparerows for one array are located in the adjacent array as opposed tothe same array. In other words, spare rows for the top array arelocated in the bottom array and vice versa. When redundancy isenabled, the incoming address is compared with the address ofthe defective row and if it matches, the adjacent array (which isnormally not enabled) is enabled to read from or write into thespare row. Using this kind of scheme enables a large ( 30%)reduction in X-decoder area. The area reduction is achieved be-cause the multiplexing required in the X-decoder to bypass thedefective row/rows in the traditional row redundancy scheme isno longer needed in this scheme.
N-well power for the Primary and L2 cache memory cellsis separated out as a test hook. This allows weakening of thepMOS loads of the SRAM bit cells by raising their thresholdvoltage, thus enabling screening cells with marginal static noisemargin. This significantly reduces defective parts per million(DPPM) and improves reliability.
Fig. 10 shows the Niagara2 Crossbar (CCX). CCX serves asa high bandwidth interface between the eight SPARC Cores,shown on top, and the eight L2 cache banks, and the non-cacheable unit (NCU) shown at the bottom. CCX consists of twoblocks: PCX and CPX. PCX (“Processor-to-Cache-Transfer”)is a 8-input 9-output multiplexer (mux). It transfers data fromthe eight SPARC cores to the eight L2 cache banks and theNCU. Likewise, CPX (“Cache-to-Processor Transfer”) is a
Fig. 10. Crossbar.
9-input 8-output mux, and it transfers data in the reverse di-rection. The PCX and CPX combined provide a Read/Writebandwidth of 270 GB/s. All crossbar data transfer requestsare processed using a four-stage pipeline. The pipeline stagesare: Request, Arbitration, Selection, and Transmission. As canbe seen from the figure, there are possiblesource destination pairs for each data transfer request. There isa two-deep queue for each source–destination pair to hold datatransfer requests for that pair.
IV. CLOCKING
Niagara2 contains a mix of many clocking styles—syn-chronous, mesochronous and asynchronous—and hence a largenumber of clock domains. Managing all these clock domainsand domain crossings between them was one of the biggestchallenges the design team faced. A subset of synchronousmethodology, ratioed synchronous clocking (RSC) is usedextensively. The concept works well for functional mode whilebeing equally applicable to at-speed test of the core using theSerDes interfaces.
A. Clock Sources and Distribution
An on-chip phase-locked loop (PLL) uses a fractional divider[8], [9] to generate Ratioed Synchronous Clocks with supportfor a wide range of integer and fractional divide ratios. Thedistribution of these clocks uses a combination of H-treesand grids. This ensures they meet tight clock skew budgetswhile keeping power consumption under control. Clock TreeSynthesis is used for routing the asynchronous clocks. Asyn-chronous clock domain crossings are handled using FIFOsand meta-stability hardened flip-flops. All clock headers aredesigned to support clock gating to save clock power.
Fig. 11 shows the block diagram of the PLL. Its architectureis similar to the one described in [8]. It uses a loop filter capac-itor referenced to a regulated 1.1 V supply (VREG). VREG isgenerated by a voltage regulator from the 1.5 V supply coming
Authorized licensed use limited to: Univ of Calif Berkeley. Downloaded on September 18, 2009 at 17:03 from IEEE Xplore. Restrictions apply.
Sun Niagara II 8 x 9 Crossbar
8 ports on CPU side (one per core)
8 ports for L2 banks, plus one for I/0
4 cycle latency (715ps/cycle).
Cycles 1-3 are for arbitration.
Transmit data on cycle 4.
100-200 wires/ port (each way).
9
UC Regents Fall 2009 © UCBCS 250 L9: Floorplanning
NA
WA
TH
Eetal.:IM
PLE
ME
NTA
TIO
NO
FA
N8-C
OR
E,64-T
HR
EA
D,PO
WE
R-E
FFICIE
NT
SPAR
CSE
RV
ER
ON
AC
HIP
9
Fig.9.L
2cache
rowredundancy
scheme.
2-cyclelatency.A
ddressescan
behashed
todistribute
accessesacross
differentsets
incase
ofhot
cachesets
causedby
refer-ence
conflicts.All
arraysare
protectedby
singleerror
correc-tion,double
errordetectionE
CC
,andparity.D
atafrom
differentw
aysand
differentw
ordsis
interleavedto
improve
softerror
rates.T
heL
2cache
useda
uniquerow
-redundancyschem
e.Itisim-
plemented
atthe32
kBleveland
isillustrated
inFig.9.Spare
rowsforone
arrayare
locatedin
theadjacentarray
asopposedto
thesam
earray.In
otherwords,spare
rows
forthetop
arrayare
locatedin
thebottom
arrayand
viceversa.W
henredundancy
isenabled,the
incoming
addressis
compared
with
theaddress
ofthe
defectiverow
andifitm
atches,theadjacentarray
(which
isnorm
allynotenabled)
isenabled
toread
fromor
write
intothe
sparerow
.Using
thiskind
ofschem
eenables
alarge
(30%
)reduction
inX
-decoderarea.The
areareduction
isachieved
be-cause
them
ultiplexingrequired
inthe
X-decoder
tobypass
thedefective
row/row
sin
thetraditionalrow
redundancyschem
eis
nolonger
neededin
thisschem
e.N
-well
power
forthe
Primary
andL
2cache
mem
orycells
isseparated
outas
atest
hook.T
hisallow
sw
eakeningof
thepM
OS
loadsof
theSR
AM
bitcells
byraising
theirthreshold
voltage,thusenabling
screeningcells
with
marginalstatic
noisem
argin.T
hissignificantly
reducesdefective
partsper
million
(DPPM
)and
improves
reliability.Fig.10
shows
theN
iagara2C
rossbar(C
CX
).CC
Xserves
asa
highbandw
idthinterface
between
theeight
SPAR
CC
ores,show
non
top,and
theeight
L2
cachebanks,
andthe
non-cacheable
unit(NC
U)show
natthe
bottom.C
CX
consistsoftwo
blocks:PC
Xand
CPX
.PC
X(“Processor-to-C
ache-Transfer”)is
a8-input
9-outputm
ultiplexer(m
ux).Ittransfers
datafrom
theeight
SPAR
Ccores
tothe
eightL
2cache
banksand
theN
CU
.L
ikewise,
CPX
(“Cache-to-Processor
Transfer”)is
a
Fig.10.C
rossbar.
9-input8-output
mux,
andit
transfersdata
inthe
reversedi-
rection.T
hePC
Xand
CPX
combined
providea
Read/W
ritebandw
idthof
270G
B/s.
All
crossbardata
transferrequests
areprocessed
usinga
four-stagepipeline.T
hepipeline
stagesare:R
equest,Arbitration,Selection,and
Transmission.A
scan
beseen
fromthe
figure,there
arepossible
sourcedestination
pairsfor
eachdata
transferrequest.T
hereis
atw
o-deepqueue
foreach
source–destinationpair
tohold
datatransfer
requestsfor
thatpair.
IV.
CL
OC
KIN
G
Niagara2
containsa
mix
ofm
anyclocking
styles—syn-
chronous,mesochronous
andasynchronous—
andhence
alarge
number
ofclock
domains.
Managing
allthese
clockdom
ainsand
domain
crossingsbetw
eenthem
was
oneof
thebiggest
challengesthe
designteam
faced.A
subsetof
synchronousm
ethodology,ratioed
synchronousclocking
(RSC
)is
usedextensively.T
heconceptw
orksw
ellforfunctionalmode
while
beingequally
applicableto
at-speedtest
ofthe
coreusing
theSerD
esinterfaces.
A.
Clock
Sourcesand
Distribution
An
on-chipphase-locked
loop(PL
L)usesa
fractionaldivider[8],[9]
togenerate
Ratioed
SynchronousC
locksw
ithsupport
fora
wide
rangeof
integerand
fractionaldivide
ratios.T
hedistribution
ofthese
clocksuses
acom
binationof
H-trees
andgrids.
This
ensuresthey
meet
tightclock
skewbudgets
while
keepingpow
erconsum
ptionunder
control.C
lockTree
Synthesisis
usedfor
routingthe
asynchronousclocks.
Asyn-
chronousclock
domain
crossingsare
handledusing
FIFOs
andm
eta-stabilityhardened
flip-flops.A
llclock
headersare
designedto
supportclockgating
tosave
clockpow
er.Fig.11
shows
theblock
diagramof
thePL
L.Its
architectureis
similarto
theone
describedin
[8].Itusesa
loopfiltercapac-
itorreferenced
toa
regulated1.1
Vsupply
(VR
EG
).VR
EG
isgenerated
bya
voltageregulatorfrom
the1.5
Vsupply
coming
Au
tho
rize
d lic
en
se
d u
se
limite
d to
: Un
iv o
f Ca
lif Be
rke
ley. D
ow
nlo
ad
ed
on
Se
pte
mb
er 1
8, 2
00
9 a
t 17
:03
from
IEE
E X
plo
re. R
estric
tion
s a
pp
ly.
Sun Niagara II 8 x 9 Crossbar
NAWATHE et al.: IMPLEMENTATION OF AN 8-CORE, 64-THREAD, POWER-EFFICIENT SPARC SERVER ON A CHIP 9
Fig. 9. L2 cache row redundancy scheme.
2-cycle latency. Addresses can be hashed to distribute accessesacross different sets in case of hot cache sets caused by refer-ence conflicts. All arrays are protected by single error correc-tion, double error detection ECC, and parity. Data from differentways and different words is interleaved to improve soft errorrates.
The L2 cache used a unique row-redundancy scheme. It is im-plemented at the 32 kB level and is illustrated in Fig. 9. Sparerows for one array are located in the adjacent array as opposed tothe same array. In other words, spare rows for the top array arelocated in the bottom array and vice versa. When redundancy isenabled, the incoming address is compared with the address ofthe defective row and if it matches, the adjacent array (which isnormally not enabled) is enabled to read from or write into thespare row. Using this kind of scheme enables a large ( 30%)reduction in X-decoder area. The area reduction is achieved be-cause the multiplexing required in the X-decoder to bypass thedefective row/rows in the traditional row redundancy scheme isno longer needed in this scheme.
N-well power for the Primary and L2 cache memory cellsis separated out as a test hook. This allows weakening of thepMOS loads of the SRAM bit cells by raising their thresholdvoltage, thus enabling screening cells with marginal static noisemargin. This significantly reduces defective parts per million(DPPM) and improves reliability.
Fig. 10 shows the Niagara2 Crossbar (CCX). CCX serves asa high bandwidth interface between the eight SPARC Cores,shown on top, and the eight L2 cache banks, and the non-cacheable unit (NCU) shown at the bottom. CCX consists of twoblocks: PCX and CPX. PCX (“Processor-to-Cache-Transfer”)is a 8-input 9-output multiplexer (mux). It transfers data fromthe eight SPARC cores to the eight L2 cache banks and theNCU. Likewise, CPX (“Cache-to-Processor Transfer”) is a
Fig. 10. Crossbar.
9-input 8-output mux, and it transfers data in the reverse di-rection. The PCX and CPX combined provide a Read/Writebandwidth of 270 GB/s. All crossbar data transfer requestsare processed using a four-stage pipeline. The pipeline stagesare: Request, Arbitration, Selection, and Transmission. As canbe seen from the figure, there are possiblesource destination pairs for each data transfer request. There isa two-deep queue for each source–destination pair to hold datatransfer requests for that pair.
IV. CLOCKING
Niagara2 contains a mix of many clocking styles—syn-chronous, mesochronous and asynchronous—and hence a largenumber of clock domains. Managing all these clock domainsand domain crossings between them was one of the biggestchallenges the design team faced. A subset of synchronousmethodology, ratioed synchronous clocking (RSC) is usedextensively. The concept works well for functional mode whilebeing equally applicable to at-speed test of the core using theSerDes interfaces.
A. Clock Sources and Distribution
An on-chip phase-locked loop (PLL) uses a fractional divider[8], [9] to generate Ratioed Synchronous Clocks with supportfor a wide range of integer and fractional divide ratios. Thedistribution of these clocks uses a combination of H-treesand grids. This ensures they meet tight clock skew budgetswhile keeping power consumption under control. Clock TreeSynthesis is used for routing the asynchronous clocks. Asyn-chronous clock domain crossings are handled using FIFOsand meta-stability hardened flip-flops. All clock headers aredesigned to support clock gating to save clock power.
Fig. 11 shows the block diagram of the PLL. Its architectureis similar to the one described in [8]. It uses a loop filter capac-itor referenced to a regulated 1.1 V supply (VREG). VREG isgenerated by a voltage regulator from the 1.5 V supply coming
Authorized licensed use limited to: Univ of Calif Berkeley. Downloaded on September 18, 2009 at 17:03 from IEEE Xplore. Restrictions apply.
NAWATHE et al.: IMPLEMENTATION OF AN 8-CORE, 64-THREAD, POWER-EFFICIENT SPARC SERVER ON A CHIP 9
Fig. 9. L2 cache row redundancy scheme.
2-cycle latency. Addresses can be hashed to distribute accessesacross different sets in case of hot cache sets caused by refer-ence conflicts. All arrays are protected by single error correc-tion, double error detection ECC, and parity. Data from differentways and different words is interleaved to improve soft errorrates.
The L2 cache used a unique row-redundancy scheme. It is im-plemented at the 32 kB level and is illustrated in Fig. 9. Sparerows for one array are located in the adjacent array as opposed tothe same array. In other words, spare rows for the top array arelocated in the bottom array and vice versa. When redundancy isenabled, the incoming address is compared with the address ofthe defective row and if it matches, the adjacent array (which isnormally not enabled) is enabled to read from or write into thespare row. Using this kind of scheme enables a large ( 30%)reduction in X-decoder area. The area reduction is achieved be-cause the multiplexing required in the X-decoder to bypass thedefective row/rows in the traditional row redundancy scheme isno longer needed in this scheme.
N-well power for the Primary and L2 cache memory cellsis separated out as a test hook. This allows weakening of thepMOS loads of the SRAM bit cells by raising their thresholdvoltage, thus enabling screening cells with marginal static noisemargin. This significantly reduces defective parts per million(DPPM) and improves reliability.
Fig. 10 shows the Niagara2 Crossbar (CCX). CCX serves asa high bandwidth interface between the eight SPARC Cores,shown on top, and the eight L2 cache banks, and the non-cacheable unit (NCU) shown at the bottom. CCX consists of twoblocks: PCX and CPX. PCX (“Processor-to-Cache-Transfer”)is a 8-input 9-output multiplexer (mux). It transfers data fromthe eight SPARC cores to the eight L2 cache banks and theNCU. Likewise, CPX (“Cache-to-Processor Transfer”) is a
Fig. 10. Crossbar.
9-input 8-output mux, and it transfers data in the reverse di-rection. The PCX and CPX combined provide a Read/Writebandwidth of 270 GB/s. All crossbar data transfer requestsare processed using a four-stage pipeline. The pipeline stagesare: Request, Arbitration, Selection, and Transmission. As canbe seen from the figure, there are possiblesource destination pairs for each data transfer request. There isa two-deep queue for each source–destination pair to hold datatransfer requests for that pair.
IV. CLOCKING
Niagara2 contains a mix of many clocking styles—syn-chronous, mesochronous and asynchronous—and hence a largenumber of clock domains. Managing all these clock domainsand domain crossings between them was one of the biggestchallenges the design team faced. A subset of synchronousmethodology, ratioed synchronous clocking (RSC) is usedextensively. The concept works well for functional mode whilebeing equally applicable to at-speed test of the core using theSerDes interfaces.
A. Clock Sources and Distribution
An on-chip phase-locked loop (PLL) uses a fractional divider[8], [9] to generate Ratioed Synchronous Clocks with supportfor a wide range of integer and fractional divide ratios. Thedistribution of these clocks uses a combination of H-treesand grids. This ensures they meet tight clock skew budgetswhile keeping power consumption under control. Clock TreeSynthesis is used for routing the asynchronous clocks. Asyn-chronous clock domain crossings are handled using FIFOsand meta-stability hardened flip-flops. All clock headers aredesigned to support clock gating to save clock power.
Fig. 11 shows the block diagram of the PLL. Its architectureis similar to the one described in [8]. It uses a loop filter capac-itor referenced to a regulated 1.1 V supply (VREG). VREG isgenerated by a voltage regulator from the 1.5 V supply coming
Authorized licensed use limited to: Univ of Calif Berkeley. Downloaded on September 18, 2009 at 17:03 from IEEE Xplore. Restrictions apply.
Every cross of blue and purple is a pass gate with a unique control signal.
72 control signals (if distributed unencoded).
NA
WA
TH
Eetal.:IM
PLE
ME
NTA
TIO
NO
FA
N8-C
OR
E,64-T
HR
EA
D,PO
WE
R-E
FFICIE
NT
SPAR
CSE
RV
ER
ON
AC
HIP
9
Fig.9.L
2cache
rowredundancy
scheme.
2-cyclelatency.A
ddressescan
behashed
todistribute
accessesacross
differentsets
incase
ofhot
cachesets
causedby
refer-ence
conflicts.All
arraysare
protectedby
singleerror
correc-tion,double
errordetectionE
CC
,andparity.D
atafrom
differentw
aysand
differentw
ordsis
interleavedto
improve
softerror
rates.T
heL
2cache
useda
uniquerow
-redundancyschem
e.Itisim-
plemented
atthe32
kBleveland
isillustrated
inFig.9.Spare
rowsforone
arrayare
locatedin
theadjacentarray
asopposedto
thesam
earray.In
otherwords,spare
rows
forthetop
arrayare
locatedin
thebottom
arrayand
viceversa.W
henredundancy
isenabled,the
incoming
addressis
compared
with
theaddress
ofthe
defectiverow
andifitm
atches,theadjacentarray
(which
isnorm
allynotenabled)
isenabled
toread
fromor
write
intothe
sparerow
.Using
thiskind
ofschem
eenables
alarge
(30%
)reduction
inX
-decoderarea.The
areareduction
isachieved
be-cause
them
ultiplexingrequired
inthe
X-decoder
tobypass
thedefective
row/row
sin
thetraditionalrow
redundancyschem
eis
nolonger
neededin
thisschem
e.N
-well
power
forthe
Primary
andL
2cache
mem
orycells
isseparated
outas
atest
hook.T
hisallow
sw
eakeningof
thepM
OS
loadsof
theSR
AM
bitcells
byraising
theirthreshold
voltage,thusenabling
screeningcells
with
marginalstatic
noisem
argin.T
hissignificantly
reducesdefective
partsper
million
(DPPM
)and
improves
reliability.Fig.10
shows
theN
iagara2C
rossbar(C
CX
).CC
Xserves
asa
highbandw
idthinterface
between
theeight
SPAR
CC
ores,show
non
top,and
theeight
L2
cachebanks,
andthe
non-cacheable
unit(NC
U)show
natthe
bottom.C
CX
consistsoftwo
blocks:PC
Xand
CPX
.PC
X(“Processor-to-C
ache-Transfer”)is
a8-input
9-outputm
ultiplexer(m
ux).Ittransfers
datafrom
theeight
SPAR
Ccores
tothe
eightL
2cache
banksand
theN
CU
.L
ikewise,
CPX
(“Cache-to-Processor
Transfer”)is
a
Fig.10.C
rossbar.
9-input8-output
mux,
andit
transfersdata
inthe
reversedi-
rection.T
hePC
Xand
CPX
combined
providea
Read/W
ritebandw
idthof
270G
B/s.
All
crossbardata
transferrequests
areprocessed
usinga
four-stagepipeline.T
hepipeline
stagesare:R
equest,Arbitration,Selection,and
Transmission.A
scan
beseen
fromthe
figure,there
arepossible
sourcedestination
pairsfor
eachdata
transferrequest.T
hereis
atw
o-deepqueue
foreach
source–destinationpair
tohold
datatransfer
requestsfor
thatpair.
IV.
CL
OC
KIN
G
Niagara2
containsa
mix
ofm
anyclocking
styles—syn-
chronous,mesochronous
andasynchronous—
andhence
alarge
number
ofclock
domains.
Managing
allthese
clockdom
ainsand
domain
crossingsbetw
eenthem
was
oneof
thebiggest
challengesthe
designteam
faced.A
subsetof
synchronousm
ethodology,ratioed
synchronousclocking
(RSC
)is
usedextensively.T
heconceptw
orksw
ellforfunctionalmode
while
beingequally
applicableto
at-speedtest
ofthe
coreusing
theSerD
esinterfaces.
A.
Clock
Sourcesand
Distribution
An
on-chipphase-locked
loop(PL
L)usesa
fractionaldivider[8],[9]
togenerate
Ratioed
SynchronousC
locksw
ithsupport
fora
wide
rangeof
integerand
fractionaldivide
ratios.T
hedistribution
ofthese
clocksuses
acom
binationof
H-trees
andgrids.
This
ensuresthey
meet
tightclock
skewbudgets
while
keepingpow
erconsum
ptionunder
control.C
lockTree
Synthesisis
usedfor
routingthe
asynchronousclocks.
Asyn-
chronousclock
domain
crossingsare
handledusing
FIFOs
andm
eta-stabilityhardened
flip-flops.A
llclock
headersare
designedto
supportclockgating
tosave
clockpow
er.Fig.11
shows
theblock
diagramof
thePL
L.Its
architectureis
similarto
theone
describedin
[8].Itusesa
loopfiltercapac-
itorreferenced
toa
regulated1.1
Vsupply
(VR
EG
).VR
EG
isgenerated
bya
voltageregulatorfrom
the1.5
Vsupply
coming
Au
tho
rize
d lic
en
se
d u
se
limite
d to
: Un
iv o
f Ca
lif Be
rke
ley. D
ow
nlo
ad
ed
on
Se
pte
mb
er 1
8, 2
00
9 a
t 17
:03
from
IEE
E X
plo
re. R
estric
tion
s a
pp
ly.
10
UC Regents Fall 2009 © UCBCS 250 L9: Floorplanning
11
UC Regents Fall 2009 © UCBCS 250 L9: Floorplanning
Crossbar defines floorplan: all port devices should be equidistant to the crossbar.
Uniform latency between all port pairs.
Did not scale up for 16-core Rainbow Falls. Rainbow Falls keeps the 8 x 9 crossbar, and shares each CPU-side port with two cores.
Sun Niagara II Crossbar Notes
Low latency: 4 cycles (less than 3 ns).
Design alternatives to crossbar?12
UC Regents Fall 2009 © UCBCS 250 L9: Floorplanning
Sun Niagara II Energy Facts
8 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 43, NO. 1, JANUARY 2008
Fig. 7. Various threads occupying different pipeline stages.
The load-use latency is three cycles. There is a six-cycle latencyfor dependent FP operations. The ICACHE is shared betweenall eight threads. Each thread has its own instruction buffer.The Fetch stage/unit fetches up to four instructions per cycleand puts them into the thread’s instruction buffer. Threadscan be in “Wait” (as opposed to “Ready”) state due to a ITLBmiss, ICACHE miss, or their Instruction Buffer being full. The“Least-Recently-Fetched” algorithm is used to select one of“Ready” threads for which the next instruction will be fetched.Fig. 7 shows the Integer/Load/Store pipeline and illustrateshow different threads can occupy different pipeline stages ina given cycle. In other words, threads are interleaved betweenpipeline stages with very few restrictions. The Load/Storeand Floating Point units are shared between all eight threads.The eight threads within each SPARC core are divided intotwo thread groups (TGs) of four threads each. Once again,the threads could be in “Wait” states due to events such as aDCACHE miss, DTLB miss, or data dependency. The “Pick”stage tries to find one instruction from all the “Ready” threads(using the “Least-Recently-Picked” algorithm) from each ofthe two TGs to execute every cycle. Since each TG picks inde-pendently (w.r.t. the other TG), it can lead to hazards such asload instructions being picked from both TGs even though eachSPC has only one load/store unit. These hazards are resolved inthe “Decode” stage.
Niagara2’s Primary and L2 cache sizes are relatively smallcompared to some other processors. Even though this may causehigher cache miss rates, the miss latency is well hidden by thepresence of other threads whose operands/data is available andhence can make good use of the “compute” time slots, thusminimizing wastage of “compute” resources. This factor ex-plains why the optimum design point moved towards havinghigher thread counts and lower cache sizes. In effect, this canbe thought of as devoting more transistors on chip to the in-telligent “processing” function as opposed to the nonintelligent“data-storing” function.
Performance measurements using several commercial ap-plications and performance benchmarks (SpecJBB, SpecWeb,TPC-C, SpecIntRate, SpecFPRate, etc.) confirm that Niagara2has achieved its goal of doubling the throughput performance
Fig. 8. Key statistical highlights.
and performance/watt as compared to UltraSPARC T1. Most ofthe gain comes from doubling the thread count and the numberof execution units. Some of the gain comes from a higheroperating frequency. Similarly, performance measurementsusing commonly used Floating Point benchmarks confirm thatNiagara2’s Floating Point throughput performance is more thanan order of magnitude higher compared to UltraSPARC T1.Niagara2 has eight Floating Point units (FPUs), each occupyingonly 1.3 mm , against only one FPU for UltraSPARC T1.Also, the Niagara2 FPUs are within the SPCs as comparedto UltraSPARC T1 where the SPCs had to access the FPUthrough the Crossbar. Another factor that helps performance isthe higher memory bandwidth on Niagara2.
C. Key Statistical Highlights
The table in Fig. 8 lists some key statistical highlights ofNiagara2’s physical implementation. Niagara2 is built in TexasInstruments’ 65 nm, 11LM, Triple- CMOS process. The chiphas 503 million transistors on a 342 mm die packaged in aflip-chip glass ceramic package with 1831 pins. It operates at1.4 GHz @ 1.1 V and consumes 84 W.
III. MEMORY HIERARCHY, CACHES, AND CROSSBAR
Niagara2 has four memory controllers on chip, each con-trolling two FBDIMM channels. They are clocked by the DR(DRAM) clock, which nominally runs at 400 MHz corre-sponding to the FBDIMM SerDes link rate of 4.8 Gb/s. Upto eight DIMMs can be connected to each channel. Everytransaction from each controller consists of 64 data bytes andECC. Read transactions take two DR clock cycles, while Writetransactions take four DR clock cycles. This yields a Readbandwidth of 51.2 GB/s and a Write bandwidth of 25.6 GB/s.
Niagara2 has two levels of caches on chip. Each SPC has a16 kB Primary Instruction cache (ICACHE), and a 8 kB PrimaryData cache (DCACHE). The ICACHE is 8-way set-associativewith a 32 B line size. The DCACHE is 4-way set-associativewith a 16 B line size. The 4 MB shared Level2 (L2) cache isdivided into eight banks for 512 kB each. The number of banksare doubled to support the doubling of thread count as comparedto UltraSPARC T1. The L2 cache is 16-way set associative witha 64 B line size. Each bank can read up to 16 B per cycle with a
Authorized licensed use limited to: Univ of Calif Berkeley. Downloaded on September 18, 2009 at 17:03 from IEEE Xplore. Restrictions apply.
16 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 43, NO. 1, JANUARY 2008
Fig. 21. Power consumption.
Power EM spec. For example, for a typical layout of big buffers,the worst power EM violation reduced from of 1.57with a conventional M3 power grid to of 0.93 withM3 post grid. Similarly, IR drop from M4 to devices improvedfrom 89.1 mV to 55.8 mV.
An interactive script was used for insertion of DECAPs basedon available empty space, and the placement was always assuredto be DRC/LVS clean. The width of DECAPs used in stan-dard cell blocks matched the width of the standard cells. Thechannel DECAPs were similar to DECAPs used in standard cellblocks, except that they had M4 and below embedded into theleaf DECAP cell to reduce data base size. About 700 nF of ex-plicit DECAP was added on chip (this does not include implicitDECAP due to the metal power grid and the quiet cap).
VIII. POWER AND POWER MANAGEMENT
A. Power
Niagara2’s SOC design is optimized for performance/wattand enables reduction of total power consumption and powerdensity at the chip and system level. Niagara2’s simplifiedpipeline and reduced speculation in instruction execution re-duces wasted power. A Niagara2-based system is a lot morepower efficient as compared to, for example, a system with eightsingle-core processors (on separate chips) each having theirown I/O (DRAM, networking, and PCI-Express) interfaces.Such a system will have 8 times the I/O interfaces and hencewill consume a lot more power in those interfaces. Also, extrapower will be consumed in driving the off-chip multi-processorcoherency fabric. In comparison, for Niagara2 there are onlyone set of I/O interfaces and the coherency between the eightprocessor cores is handled on chip by the crossbar, whichconsumes less than 1 W of power.
Niagara2’s total power consumption is 84 W @ 1.1 V and1.4 GHz operation. The pie-chart in Fig. 21 shows the powerconsumed by the various blocks inside Niagara2. Almost athird of the total power is consumed by the eight SPARCcores. L2 cache Data, Tag and Buffer blocks together accountfor 20% of the total. SOC logic consumes 6% while I/Osconsume 13%. Leakage is about 21% of the total power. Clocksto unused clusters are gated off to save dynamic power. Within
units, clocks to groups of flops are separated into independentdomains depending upon the functionality of the correspondinglogic. Clocks to each domain can be turned off independentlyof one another when the related logic is not processing validinstructions. This saves dynamic power.
B. Technique to Reduce Leakage Power
Niagara2 uses gate-bias (GBIAS) cells to reduce leakagepower. GBIAS cells are footprint-compatible with the corre-sponding standard- non-GBIAS versions. The only layoutdifference is that the GBIAS cell has an additional identificationlayer (GBIAS). All the transistors of any cell having this layerare fabricated with 10% longer channel length.
The table in Fig. 22 illustrates the reduction in leakage andcorresponding increase in delay as the channel length was in-creased above minimum for three different gates. 10% largerchannel length was chosen, resulting in, on an average, about50% reduced leakage on a per-cell basis with about 14% im-pact on the cell delay. High- (HVT) cells were considered aswell for leakage reduction. We did not have an unconstrainedchoice of for the HVT cells. For HVT cells using the avail-able HVT transistors, the delay impact was much larger. Asa result, approximate calculations lead to the conclusion thatusing HVT cells would have enabled substitution of only aboutone-third of the number of gates as compared to using GBIASgates with 10% larger channel length. Hence, the GBIAS optionwas chosen. This enabled about 77% additional leakage savingas compared to using HVT cells. Cells in non-timing-criticalpaths could be replaced by their GBIAS versions as long as thisdid not result in any new noise, slew, or timing violations. Be-cause of footprint compatibility, the swapping was easily doneat the end of the design cycle without any timing, noise, or areaimpact. This reduced leakage power by 10%–15%.
The swapping algorithm works as follows. The project STAtool and a set of scripts determine which cells can be swappedwithout creating new timing, noise, and slew violations. First,all cells that have timing margins larger than a set thresholdare swapped to their GBIAS equivalent. The STA tool thencomputes the timing graph and swaps back all GBIAS cells thatare on paths with less than a predefined positive slack. Then,a script evaluates all the receivers connected to the outputsof the GBIAS cells that went through timing qualification,and determines if sufficient noise margin exists. The scriptcalculates the new noise values by increasing the latest noisesnapshot by a percentage that was derived from simulationsand analysis. Once a list of cells that can be safely swappedis built, a custom program performs the swaps in the actuallayout database. A footprint-compatibility check for identicalpins and metal shapes is built into the program to maintain LVSand DRC cleanliness.
C. Power Management
Niagara2 has several power management features to managepower consumption. Software can dynamically turn threads onor off as required. Niagara2 has a Power Throttling mode whichprovides the ability to control instruction issue rates to managepower. The graph in Fig. 23 shows that this can reduce dynamicpower by up to 30% depending upon the level of workload.
Authorized licensed use limited to: Univ of Calif Berkeley. Downloaded on September 18, 2009 at 17:03 from IEEE Xplore. Restrictions apply.
18IE
EE
JOU
RN
AL
OF
SOL
ID-STA
TE
CIR
CU
ITS,V
OL
.43,NO
.1,JAN
UA
RY
2008
Fig.24.C
lock–datarecovery
ofa
singleFB
DIM
M1.0
receivechannelw
ithinan
FSRcluster.
Fig.25.C
lock-datarecovery
ofa
singleFB
DIM
M1.0
receivechannel(expanded)
forD
TM
.
i)the
linktraining
pattern(T
S0)can
beused
toidentify
thestart
ofa
clock-bundlepair,
andhence
controlthe
generationofthe
slowbyte
clockw
ithrespectto
TS0
starttim
es;ii)
thequeue
inthe
MC
Uw
hichperform
saggregation
ofthe
datato
constructframes
needto
waitfor
theslow
estbundle,and
thereforethe
slowestlink.
Fig.25illustrates
howthe
FBD
IMM
(andsim
ilarlyPC
I-Ex-
press)channels
may
bem
anipulatedto
generateclocks
within
anacceptable
tolerancefor
synchronouscrossing
intothe
con-troller.T
hecore
clockspeed
isallow
edto
varyindependently
suchthatthe
CM
P:DR
andC
MP:I/O
clockratios
arealw
aysin-
tegersin
DT
M.In
thism
anner,theprocessorcore
clockm
aybe
sweptfrom
8sys_clk
to15
sys_clkfor
functionalat-speedtesting.
Fig.26show
sthefrequency
versusshm
ooplotat95
Cfrom
firstsilicon.As
canbe
seen,thechip
operatesat1.4
GH
zat1.1
Vw
ithsufficientm
argin.Fig.26.
versusfrequency
shmoo
plot.
Auth
oriz
ed lic
ensed u
se lim
ited to
: Univ
of C
alif B
erk
ele
y. D
ow
nlo
aded o
n S
epte
mber 1
8, 2
009 a
t 17:0
3 fro
m IE
EE
Xplo
re. R
estric
tions a
pply
.
18 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 43, NO. 1, JANUARY 2008
Fig. 24. Clock–data recovery of a single FBDIMM 1.0 receive channel within an FSR cluster.
Fig. 25. Clock-data recovery of a single FBDIMM 1.0 receive channel (expanded) for DTM.
i) the link training pattern (TS0) can be used to identifythe start of a clock-bundle pair, and hence control thegeneration of the slow byte clock with respect to TS0 starttimes;
ii) the queue in the MCU which performs aggregation ofthe data to construct frames need to wait for the slowestbundle, and therefore the slowest link.
Fig. 25 illustrates how the FBDIMM (and similarly PCI-Ex-press) channels may be manipulated to generate clocks withinan acceptable tolerance for synchronous crossing into the con-troller. The core clock speed is allowed to vary independentlysuch that the CMP:DR and CMP:I/O clock ratios are always in-tegers in DTM. In this manner, the processor core clock may beswept from 8 sys_clk to 15 sys_clk for functional at-speedtesting.
Fig. 26 shows the frequency versus shmoo plot at 95 Cfrom first silicon. As can be seen, the chip operates at 1.4 GHzat 1.1 V with sufficient margin. Fig. 26. versus frequency shmoo plot.
Authorized licensed use limited to: Univ of Calif Berkeley. Downloaded on September 18, 2009 at 17:03 from IEEE Xplore. Restrictions apply.
13
UC Regents Fall 2009 © UCBCS 250 L9: Floorplanning
Intel Larrabee
14
UC Regents Fall 2009 © UCBCS 250 L9: Floorplanning
Intel Larrabee Market Concept
It’s easier to program a “normal” CPU than a GPU - so let’s make a GPU out of Intel CPU cores.
But mainstream out-of-order cores provide insufficient FLOPS for graphics work.
!
!"#$%&'(%')$*%+$,%-+)./0.12,'3/(4%.*%4'/(%5'2')),)/$-6%7/'4'2'%/$%
'% -+)./0&*2,% 4,(,2')% 5+25*$,% -/&2*52*&,$$*2% 89*(4,./2'% ,.% ')6%
:;;<=% >,'.+2/(4% ,/41.% /(0*23,2% &*2,$?% ,'&1% &'5'@),% *>% ,A,&+./(4%
>*+2%$/-+).'(,*+$%.12,'3$?%'(3%'%$1'2,3%&'&1,6%B+.%4/C,(%/.$%>*&+$%
*(% &*--,2&/')% $,2C,2% D*2E)*'3$?% 7/'4'2'% )'&E$% '2&1/.,&.+2')%
,),-,(.$% &2/./&')% >*2% C/$+')% &*-5+./(4?% $+&1% '$% FGHI% >)*'./(40
5*/(.%,A,&+./*(?%$&'..,204'.1,2?%*2%>/A,3%>+(&./*(%.,A.+2,%$+55*2.6%
!"# $%&&%'((#)%&*+%&(#,&-./0(-01&(#
J/4+2,% K% '@*C,% $1*D$% '% @)*&E% 3/'42'-% *>% .1,% @'$/&% L'22'@,,%'2&1/.,&.+2,6% L'22'@,,% /$% 3,$/4(,3% '2*+(3%-+)./5),% /($.'(./'./*($%*>% '(% /(0*23,2% !"#% &*2,% .1'.% /$% '+4-,(.,3% D/.1% '% D/3,% C,&.*2%52*&,$$*2% MN"#O6%!*2,$% &*--+(/&'.,% .12*+41% '% 1/410@'(3D/3.1%/(.,2&*((,&.%(,.D*2E%D/.1%$*-,%>/A,3%>+(&./*(%)*4/&?%-,-*2P%GQR%/(.,2>'&,$?%'(3%*.1,2%(,&,$$'2P%GQR%)*4/&?%3,5,(3/(4%*(%.1,%,A'&.%'55)/&'./*(6% J*2% ,A'-5),?% '(% /-5),-,(.'./*(% *>% L'22'@,,% '$% '%$.'(30')*(,%S"#%D*+)3%.P5/&'))P%/(&)+3,%'%"!G,%@+$6%
T1,% 3'.'% /(% T'@),% K% -*./C'.,$% L'22'@,,U$% +$,% *>% /(0*23,2% &*2,$%D/.1%D/3,%N"#$6%T1,%-/33),%&*)+-(%$1*D$%.1,%5,'E%5,2>*2-'(&,%*>%'%-*3,2(%*+.0*>0*23,2%!"#?%.1,%G(.,)V%!*2,W:%I+*%52*&,$$*26%T1,% 2/41.01'(3% &*)+-(% $1*D$% '% .,$.% !"#% 3,$/4(% @'$,3% *(% .1,%",(./+-V%52*&,$$*2?%D1/&1%D'$%/(.2*3+&,3%/(%KXX:%'(3%+$,3%3+')0/$$+,% /(0*23,2% /($.2+&./*(% ,A,&+./*(% 8Y)5,2.% KXXZ=6% T1,% ",(./+-%52*&,$$*2% &*2,% D'$% -*3/>/,3% .*% $+55*2.% >*+2% .12,'3$% '(3% '% K[0D/3,%N"#6%T1,%>/(')%.D*%2*D$%$5,&/>P%.1,%(+-@,2%*>%(*(0C,&.*2%/($.2+&./*($%.1'.%&'(%@,%/$$+,3%5,2%&)*&E%@P%*(,%!"#%'(3%.1,%.*.')%(+-@,2%*>%C,&.*2%*5,2'./*($%.1'.%&'(%@,%/$$+,3%5,2%&)*&E6%T1,%.D*%&*(>/4+2'./*($%+$,%2*+41)P%.1,%$'-,%'2,'%'(3%5*D,26%%
\%!"#%&*2,$]% :%*+.0*>0*23,2% K;%/(0*23,2%
G($.2+&./*(%/$$+,]% ^%5,2%&)*&E% :%5,2%&)*&E%
N"#%5,2%&*2,]% ^0D/3,%FF_% K[0D/3,%
L:%&'&1,%$/`,]% ^%HB% ^%HB%
F/(4),0$.2,'-]% "!#$%!&'(&)!! *!#$%!&'(&)!
N,&.*2%.12*+415+.]% +!#$%!&'(&)! ,-.!#$%!&'(&)!
!"#$%! &"! #$%&'(&')*+)! ,-.! /0&')*+)! 123! 4'567)/-'0"! *+-/80/08!%9+!6)'4+--')! (')! /04)+7-+*! %9)'$896$%!470!)+-$:%! /0!;! %9+!6+7<!-/08:+&-%)+75! 6+)(')5704+=! >$%! ?@A! %9+! 6+7<! ,+4%')! %9)'$896$%!B/%9!)'$89:C! %9+!-75+!7)+7!70*!6'B+).!D9/-!*/((+)+04+! /-!E@A! /0!FG#2H=!-/04+!%9+!B/*+!I23!-$66')%-!($-+*!5$:%/6:C&7**!>$%!HHJ!*'+-0K%.!D9+-+!/0&')*+)!4')+-!7)+!0'%!G7))7>++=!>$%!7)+!-/5/:7).!
T1,%.,$.%3,$/4(%/(%T'@),%K%/$%(*.%/3,(./&')%.*%L'22'@,,6%T*%52*C/3,%'%-*2,% 3/2,&.% &*-5'2/$*(?% .1,% /(0*23,2% &*2,% .,$.% 3,$/4(% +$,$% .1,%$'-,%52*&,$$%'(3%&)*&E%2'.,%'$%.1,%*+.0*>0*23,2%&*2,$%'(3%/(&)+3,$%(*% >/A,3% >+(&./*(% 42'51/&$% )*4/&6% T1/$% &*-5'2/$*(% -*./C'.,$%3,$/4(%3,&/$/*($%>*2%L'22'@,,%$/(&,%/.%$1*D$%.1'.%'%D/3,%N"#%D/.1%'%$/-5),%/(0*23,2%&*2,%'))*D$%!"#$%.*%2,'&1%'%32'-'./&'))P%1/41,2%&*-5+.'./*(')%3,($/.P%>*2%5'2')),)%'55)/&'./*($6%
F,&./*($%Z6K%.*%Z6<%@,)*D%3,$&2/@,%.1,%E,P%>,'.+2,$%*>%.1,%L'22'@,,%'2&1/.,&.+2,]% .1,% !"#% &*2,?% .1,% $&')'2% +(/.% '(3% &'&1,% &*(.2*)%/($.2+&./*($?%.1,%C,&.*2%52*&,$$*2?%.1,%/(.,252*&,$$*2%2/(4%(,.D*2E?%'(3%.1,%&1*/&,$%>*2%D1'.%/$%/-5),-,(.,3%/(%>/A,3%>+(&./*(%)*4/&6%
!"2#$%&&%'((#34&(#%5*#3%-.(6#
J/4+2,%Z%$1*D$%'%$&1,-'./&%*>%'%$/(4),%L'22'@,,%!"#%&*2,?%5)+$%/.$% &*((,&./*(% .*% .1,% *(03/,% /(.,2&*((,&.% (,.D*2E% '(3% .1,% &*2,U$%)*&')%$+@$,.%*>%.1,%L:%&'&1,6%T1,%/($.2+&./*(%3,&*3,2%$+55*2.$%.1,%$.'(3'23%",(./+-%52*&,$$*2%Aa[%/($.2+&./*(%$,.?%D/.1%.1,%'33/./*(%*>%(,D%/($.2+&./*($%.1'.%'2,%3,$&2/@,3%/(%F,&./*($%Z6:%'(3%Z6Z6%T*%
$/-5)/>P% .1,% 3,$/4(% .1,% $&')'2% '(3% C,&.*2% +(/.$% +$,% $,5'2'.,%2,4/$.,2%$,.$6%I'.'%.2'($>,22,3%@,.D,,(%.1,-%/$%D2/..,(%.*%-,-*2P%'(3%.1,(%2,'3%@'&E%/(%>2*-%.1,%LK%&'&1,6%
L'22'@,,U$% LK% &'&1,% '))*D$% )*D0)'.,(&P% '&&,$$,$% .*% &'&1,%-,-*2P%/(.*%.1,%$&')'2%'(3%C,&.*2%+(/.$6%T*4,.1,2%D/.1%L'22'@,,U$%)*'30*5% N"#% /($.2+&./*($?% .1/$% -,'($% .1'.% .1,% LK% &'&1,% &'(% @,%.2,'.,3%$*-,D1'.%)/E,%'(%,A.,(3,3%2,4/$.,2%>/),6%T1/$%$/4(/>/&'(.)P%/-52*C,$%.1,%5,2>*2-'(&,%*>%-'(P%')4*2/.1-$?%,$5,&/'))P%D/.1%.1,%&'&1,% &*(.2*)% /($.2+&./*($% 3,$&2/@,3% F,&./*(% Z6:6% T1,% $/(4),0.12,'3,3% ",(./+-% 52*&,$$*2% 52*C/3,3% '(% a9B% G&'&1,% '(3% a9B%I&'&1,6%b,%$5,&/>P%'%Z:9B%G&'&1,%'(3%Z:9B%I&'&1,%.*%$+55*2.%>*+2%,A,&+./*(%.12,'3$%5,2%!"#%&*2,6%
%
'()*+%!L"!G7))7>++!123!4')+!70*!7--'4/7%+*!-C-%+5!>:'4<-"!%9+!123!/-!*+)/,+*!()'5!%9+!2+0%/$5!6)'4+--')!/0&')*+)!*+-/80=!6:$-!ME&>/%! /0-%)$4%/'0-=!5$:%/&%9)+7*/08!70*! 7!B/*+!I23.!J749! 4')+!97-! (7-%!744+--! %'! /%-!?NMOP! :'47:! -$>-+%!'(!7!4'9+)+0%!?0*! :+,+:!4749+.!GQ!4749+!-/R+-!7)+!L?OP!(')!S4749+!70*!L?OP!(')!T4749+.!U/08!0+%B')<!744+--+-!67--!%9)'$89!%9+!G?!4749+!(')!4'9+)+04C.!
L'22'@,,U$% 4)*@')% :(3% ),C,)% ML:O% &'&1,% /$% 3/C/3,3% /(.*% $,5'2'.,%)*&')% $+@$,.$?% *(,% 5,2% !"#% &*2,6% _'&1% !"#% 1'$% '% >'$.% 3/2,&.%'&&,$$%5'.1%.*%/.$%*D(%)*&')%$+@$,.%*>%.1,%L:%&'&1,6%I'.'%2,'3%@P%'%!"#% &*2,% /$% $.*2,3% /(% /.$% L:% &'&1,% $+@$,.% '(3% &'(% @,% '&&,$$,3%c+/&E)P?%/(%5'2')),)%D/.1%*.1,2%!"#$%'&&,$$/(4%.1,/2%*D(%)*&')%L:%&'&1,%$+@$,.$6%I'.'%D2/..,(%@P%'%!"#%&*2,%/$%$.*2,3%/(%/.$%*D(%L:%&'&1,%$+@$,.%'(3%/$% >)+$1,3%>2*-%*.1,2%$+@$,.$?% />%(,&,$$'2P6%T1,%2/(4% (,.D*2E% ,($+2,$% &*1,2,(&P% >*2% $1'2,3% 3'.'?% '$% 3,$&2/@,3% /(%F,&./*(% Z6^6%b,% $5,&/>P% :<[9B% >*2% ,'&1% L:% &'&1,% $+@$,.6% T1/$%$+55*2.$% )'24,% ./),% $/`,$% >*2% $*>.D'2,% 2,(3,2/(4?% '$% 3,$&2/@,3% /(%F,&./*(%^6K6%
!"7#8-%9%&#:5/0#%5*#3%-.(#3450&49#;560&1-0/456#
L'22'@,,U$%$&')'2%5/5,)/(,%/$%3,2/C,3%>2*-%.1,%3+')0/$$+,%",(./+-%52*&,$$*2?% D1/&1% +$,$% '% $1*2.?% /(,A5,($/C,% ,A,&+./*(% 5/5,)/(,6%L'22'@,,%52*C/3,$%-*3,2(%'33/./*($%$+&1%'$%-+)./0.12,'3/(4?%[^0@/.% ,A.,($/*($?% '(3% $*51/$./&'.,3% 52,>,.&1/(46% T1,% &*2,$% $+55*2.%.1,% >+))% ",(./+-% 52*&,$$*2% Aa[% /($.2+&./*(% $,.% $*% .1,P% &'(% 2+(%,A/$./(4%&*3,%/(&)+3/(4%*5,2'./(4%$P$.,-%E,2(,)$%'(3%'55)/&'./*($6%L'22'@,,% '33$% (,D% $&')'2% /($.2+&./*($% $+&1% '$% @/.% &*+(.% '(3% @/.%$&'(?%D1/&1%>/(3$%.1,%(,A.%$,.%@/.%D/.1/(%'%2,4/$.,26%%
L'22'@,,% ')$*% '33$% (,D% /($.2+&./*($% '(3% /($.2+&./*(% -*3,$% >*2%,A5)/&/.% &'&1,% &*(.2*)6% _A'-5),$% /(&)+3,% /($.2+&./*($% .*%52,>,.&1%3'.'%/(.*%.1,%LK%*2%L:%&'&1,$%'(3%/($.2+&./*(%-*3,$%.*%2,3+&,%.1,%52/*2/.P% *>% '% &'&1,% )/(,6% J*2% ,A'-5),?% $.2,'-/(4% 3'.'% .P5/&'))P%$D,,5$%,A/$./(4%3'.'%*+.%*>%'%&'&1,6%L'22'@,,%/$%'@),%.*%-'2E%,'&1%$.2,'-/(4%&'&1,%)/(,%>*2%,'2)P%,C/&./*(%'>.,2%/.% /$%'&&,$$,36%T1,$,%&'&1,% &*(.2*)% /($.2+&./*($% ')$*% '))*D% .1,% L:% &'&1,% .*% @,% +$,3%$/-/)'2)P%.*%'%$&2'.&15'3%-,-*2P?%D1/),%2,-'/(/(4%>+))P%&*1,2,(.6%%
b/.1/(%'%$/(4),%&*2,?%$P(&12*(/`/(4%'&&,$$% .*%$1'2,3%-,-*2P%@P%-+)./5),%.12,'3$%/$%/(,A5,($/C,6%T1,%.12,'3$%*(%'%$/(4),%&*2,%$1'2,%.1,% $'-,% )*&')% LK% &'&1,?% $*% '% $/(4),% '.*-/&% $,-'51*2,% 2,'3%D/.1/(% .1,%LK% &'&1,% /$% $+>>/&/,(.6% FP(&12*(/`/(4% '&&,$$% @,.D,,(%
Larrabee: A Many-Core x86 Architecture for Visual Computing • 18:3
ACM Transactions on Graphics, Vol. 27, No. 3, Article 18, Publication date: August 2008.
Solution: Make a many-core GPU out of an in-order implementation of IA-32 that includes a 16-wide vector unit.
!
!"#$%&'(%')$*%+$,%-+)./0.12,'3/(4%.*%4'/(%5'2')),)/$-6%7/'4'2'%/$%
'% -+)./0&*2,% 4,(,2')% 5+25*$,% -/&2*52*&,$$*2% 89*(4,./2'% ,.% ')6%
:;;<=% >,'.+2/(4% ,/41.% /(0*23,2% &*2,$?% ,'&1% &'5'@),% *>% ,A,&+./(4%
>*+2%$/-+).'(,*+$%.12,'3$?%'(3%'%$1'2,3%&'&1,6%B+.%4/C,(%/.$%>*&+$%
*(% &*--,2&/')% $,2C,2% D*2E)*'3$?% 7/'4'2'% )'&E$% '2&1/.,&.+2')%
,),-,(.$% &2/./&')% >*2% C/$+')% &*-5+./(4?% $+&1% '$% FGHI% >)*'./(40
5*/(.%,A,&+./*(?%$&'..,204'.1,2?%*2%>/A,3%>+(&./*(%.,A.+2,%$+55*2.6%
!"# $%&&%'((#)%&*+%&(#,&-./0(-01&(#
J/4+2,% K% '@*C,% $1*D$% '% @)*&E% 3/'42'-% *>% .1,% @'$/&% L'22'@,,%'2&1/.,&.+2,6% L'22'@,,% /$% 3,$/4(,3% '2*+(3%-+)./5),% /($.'(./'./*($%*>% '(% /(0*23,2% !"#% &*2,% .1'.% /$% '+4-,(.,3% D/.1% '% D/3,% C,&.*2%52*&,$$*2% MN"#O6%!*2,$% &*--+(/&'.,% .12*+41% '% 1/410@'(3D/3.1%/(.,2&*((,&.%(,.D*2E%D/.1%$*-,%>/A,3%>+(&./*(%)*4/&?%-,-*2P%GQR%/(.,2>'&,$?%'(3%*.1,2%(,&,$$'2P%GQR%)*4/&?%3,5,(3/(4%*(%.1,%,A'&.%'55)/&'./*(6% J*2% ,A'-5),?% '(% /-5),-,(.'./*(% *>% L'22'@,,% '$% '%$.'(30')*(,%S"#%D*+)3%.P5/&'))P%/(&)+3,%'%"!G,%@+$6%
T1,% 3'.'% /(% T'@),% K% -*./C'.,$% L'22'@,,U$% +$,% *>% /(0*23,2% &*2,$%D/.1%D/3,%N"#$6%T1,%-/33),%&*)+-(%$1*D$%.1,%5,'E%5,2>*2-'(&,%*>%'%-*3,2(%*+.0*>0*23,2%!"#?%.1,%G(.,)V%!*2,W:%I+*%52*&,$$*26%T1,% 2/41.01'(3% &*)+-(% $1*D$% '% .,$.% !"#% 3,$/4(% @'$,3% *(% .1,%",(./+-V%52*&,$$*2?%D1/&1%D'$%/(.2*3+&,3%/(%KXX:%'(3%+$,3%3+')0/$$+,% /(0*23,2% /($.2+&./*(% ,A,&+./*(% 8Y)5,2.% KXXZ=6% T1,% ",(./+-%52*&,$$*2% &*2,% D'$% -*3/>/,3% .*% $+55*2.% >*+2% .12,'3$% '(3% '% K[0D/3,%N"#6%T1,%>/(')%.D*%2*D$%$5,&/>P%.1,%(+-@,2%*>%(*(0C,&.*2%/($.2+&./*($%.1'.%&'(%@,%/$$+,3%5,2%&)*&E%@P%*(,%!"#%'(3%.1,%.*.')%(+-@,2%*>%C,&.*2%*5,2'./*($%.1'.%&'(%@,%/$$+,3%5,2%&)*&E6%T1,%.D*%&*(>/4+2'./*($%+$,%2*+41)P%.1,%$'-,%'2,'%'(3%5*D,26%%
\%!"#%&*2,$]% :%*+.0*>0*23,2% K;%/(0*23,2%
G($.2+&./*(%/$$+,]% ^%5,2%&)*&E% :%5,2%&)*&E%
N"#%5,2%&*2,]% ^0D/3,%FF_% K[0D/3,%
L:%&'&1,%$/`,]% ^%HB% ^%HB%
F/(4),0$.2,'-]% "!#$%!&'(&)!! *!#$%!&'(&)!
N,&.*2%.12*+415+.]% +!#$%!&'(&)! ,-.!#$%!&'(&)!
!"#$%! &"! #$%&'(&')*+)! ,-.! /0&')*+)! 123! 4'567)/-'0"! *+-/80/08!%9+!6)'4+--')! (')! /04)+7-+*! %9)'$896$%!470!)+-$:%! /0!;! %9+!6+7<!-/08:+&-%)+75! 6+)(')5704+=! >$%! ?@A! %9+! 6+7<! ,+4%')! %9)'$896$%!B/%9!)'$89:C! %9+!-75+!7)+7!70*!6'B+).!D9/-!*/((+)+04+! /-!E@A! /0!FG#2H=!-/04+!%9+!B/*+!I23!-$66')%-!($-+*!5$:%/6:C&7**!>$%!HHJ!*'+-0K%.!D9+-+!/0&')*+)!4')+-!7)+!0'%!G7))7>++=!>$%!7)+!-/5/:7).!
T1,%.,$.%3,$/4(%/(%T'@),%K%/$%(*.%/3,(./&')%.*%L'22'@,,6%T*%52*C/3,%'%-*2,% 3/2,&.% &*-5'2/$*(?% .1,% /(0*23,2% &*2,% .,$.% 3,$/4(% +$,$% .1,%$'-,%52*&,$$%'(3%&)*&E%2'.,%'$%.1,%*+.0*>0*23,2%&*2,$%'(3%/(&)+3,$%(*% >/A,3% >+(&./*(% 42'51/&$% )*4/&6% T1/$% &*-5'2/$*(% -*./C'.,$%3,$/4(%3,&/$/*($%>*2%L'22'@,,%$/(&,%/.%$1*D$%.1'.%'%D/3,%N"#%D/.1%'%$/-5),%/(0*23,2%&*2,%'))*D$%!"#$%.*%2,'&1%'%32'-'./&'))P%1/41,2%&*-5+.'./*(')%3,($/.P%>*2%5'2')),)%'55)/&'./*($6%
F,&./*($%Z6K%.*%Z6<%@,)*D%3,$&2/@,%.1,%E,P%>,'.+2,$%*>%.1,%L'22'@,,%'2&1/.,&.+2,]% .1,% !"#% &*2,?% .1,% $&')'2% +(/.% '(3% &'&1,% &*(.2*)%/($.2+&./*($?%.1,%C,&.*2%52*&,$$*2?%.1,%/(.,252*&,$$*2%2/(4%(,.D*2E?%'(3%.1,%&1*/&,$%>*2%D1'.%/$%/-5),-,(.,3%/(%>/A,3%>+(&./*(%)*4/&6%
!"2#$%&&%'((#34&(#%5*#3%-.(6#
J/4+2,%Z%$1*D$%'%$&1,-'./&%*>%'%$/(4),%L'22'@,,%!"#%&*2,?%5)+$%/.$% &*((,&./*(% .*% .1,% *(03/,% /(.,2&*((,&.% (,.D*2E% '(3% .1,% &*2,U$%)*&')%$+@$,.%*>%.1,%L:%&'&1,6%T1,%/($.2+&./*(%3,&*3,2%$+55*2.$%.1,%$.'(3'23%",(./+-%52*&,$$*2%Aa[%/($.2+&./*(%$,.?%D/.1%.1,%'33/./*(%*>%(,D%/($.2+&./*($%.1'.%'2,%3,$&2/@,3%/(%F,&./*($%Z6:%'(3%Z6Z6%T*%
$/-5)/>P% .1,% 3,$/4(% .1,% $&')'2% '(3% C,&.*2% +(/.$% +$,% $,5'2'.,%2,4/$.,2%$,.$6%I'.'%.2'($>,22,3%@,.D,,(%.1,-%/$%D2/..,(%.*%-,-*2P%'(3%.1,(%2,'3%@'&E%/(%>2*-%.1,%LK%&'&1,6%
L'22'@,,U$% LK% &'&1,% '))*D$% )*D0)'.,(&P% '&&,$$,$% .*% &'&1,%-,-*2P%/(.*%.1,%$&')'2%'(3%C,&.*2%+(/.$6%T*4,.1,2%D/.1%L'22'@,,U$%)*'30*5% N"#% /($.2+&./*($?% .1/$% -,'($% .1'.% .1,% LK% &'&1,% &'(% @,%.2,'.,3%$*-,D1'.%)/E,%'(%,A.,(3,3%2,4/$.,2%>/),6%T1/$%$/4(/>/&'(.)P%/-52*C,$%.1,%5,2>*2-'(&,%*>%-'(P%')4*2/.1-$?%,$5,&/'))P%D/.1%.1,%&'&1,% &*(.2*)% /($.2+&./*($% 3,$&2/@,3% F,&./*(% Z6:6% T1,% $/(4),0.12,'3,3% ",(./+-% 52*&,$$*2% 52*C/3,3% '(% a9B% G&'&1,% '(3% a9B%I&'&1,6%b,%$5,&/>P%'%Z:9B%G&'&1,%'(3%Z:9B%I&'&1,%.*%$+55*2.%>*+2%,A,&+./*(%.12,'3$%5,2%!"#%&*2,6%
%
'()*+%!L"!G7))7>++!123!4')+!70*!7--'4/7%+*!-C-%+5!>:'4<-"!%9+!123!/-!*+)/,+*!()'5!%9+!2+0%/$5!6)'4+--')!/0&')*+)!*+-/80=!6:$-!ME&>/%! /0-%)$4%/'0-=!5$:%/&%9)+7*/08!70*! 7!B/*+!I23.!J749! 4')+!97-! (7-%!744+--! %'! /%-!?NMOP! :'47:! -$>-+%!'(!7!4'9+)+0%!?0*! :+,+:!4749+.!GQ!4749+!-/R+-!7)+!L?OP!(')!S4749+!70*!L?OP!(')!T4749+.!U/08!0+%B')<!744+--+-!67--!%9)'$89!%9+!G?!4749+!(')!4'9+)+04C.!
L'22'@,,U$% 4)*@')% :(3% ),C,)% ML:O% &'&1,% /$% 3/C/3,3% /(.*% $,5'2'.,%)*&')% $+@$,.$?% *(,% 5,2% !"#% &*2,6% _'&1% !"#% 1'$% '% >'$.% 3/2,&.%'&&,$$%5'.1%.*%/.$%*D(%)*&')%$+@$,.%*>%.1,%L:%&'&1,6%I'.'%2,'3%@P%'%!"#% &*2,% /$% $.*2,3% /(% /.$% L:% &'&1,% $+@$,.% '(3% &'(% @,% '&&,$$,3%c+/&E)P?%/(%5'2')),)%D/.1%*.1,2%!"#$%'&&,$$/(4%.1,/2%*D(%)*&')%L:%&'&1,%$+@$,.$6%I'.'%D2/..,(%@P%'%!"#%&*2,%/$%$.*2,3%/(%/.$%*D(%L:%&'&1,%$+@$,.%'(3%/$% >)+$1,3%>2*-%*.1,2%$+@$,.$?% />%(,&,$$'2P6%T1,%2/(4% (,.D*2E% ,($+2,$% &*1,2,(&P% >*2% $1'2,3% 3'.'?% '$% 3,$&2/@,3% /(%F,&./*(% Z6^6%b,% $5,&/>P% :<[9B% >*2% ,'&1% L:% &'&1,% $+@$,.6% T1/$%$+55*2.$% )'24,% ./),% $/`,$% >*2% $*>.D'2,% 2,(3,2/(4?% '$% 3,$&2/@,3% /(%F,&./*(%^6K6%
!"7#8-%9%&#:5/0#%5*#3%-.(#3450&49#;560&1-0/456#
L'22'@,,U$%$&')'2%5/5,)/(,%/$%3,2/C,3%>2*-%.1,%3+')0/$$+,%",(./+-%52*&,$$*2?% D1/&1% +$,$% '% $1*2.?% /(,A5,($/C,% ,A,&+./*(% 5/5,)/(,6%L'22'@,,%52*C/3,$%-*3,2(%'33/./*($%$+&1%'$%-+)./0.12,'3/(4?%[^0@/.% ,A.,($/*($?% '(3% $*51/$./&'.,3% 52,>,.&1/(46% T1,% &*2,$% $+55*2.%.1,% >+))% ",(./+-% 52*&,$$*2% Aa[% /($.2+&./*(% $,.% $*% .1,P% &'(% 2+(%,A/$./(4%&*3,%/(&)+3/(4%*5,2'./(4%$P$.,-%E,2(,)$%'(3%'55)/&'./*($6%L'22'@,,% '33$% (,D% $&')'2% /($.2+&./*($% $+&1% '$% @/.% &*+(.% '(3% @/.%$&'(?%D1/&1%>/(3$%.1,%(,A.%$,.%@/.%D/.1/(%'%2,4/$.,26%%
L'22'@,,% ')$*% '33$% (,D% /($.2+&./*($% '(3% /($.2+&./*(% -*3,$% >*2%,A5)/&/.% &'&1,% &*(.2*)6% _A'-5),$% /(&)+3,% /($.2+&./*($% .*%52,>,.&1%3'.'%/(.*%.1,%LK%*2%L:%&'&1,$%'(3%/($.2+&./*(%-*3,$%.*%2,3+&,%.1,%52/*2/.P% *>% '% &'&1,% )/(,6% J*2% ,A'-5),?% $.2,'-/(4% 3'.'% .P5/&'))P%$D,,5$%,A/$./(4%3'.'%*+.%*>%'%&'&1,6%L'22'@,,%/$%'@),%.*%-'2E%,'&1%$.2,'-/(4%&'&1,%)/(,%>*2%,'2)P%,C/&./*(%'>.,2%/.% /$%'&&,$$,36%T1,$,%&'&1,% &*(.2*)% /($.2+&./*($% ')$*% '))*D% .1,% L:% &'&1,% .*% @,% +$,3%$/-/)'2)P%.*%'%$&2'.&15'3%-,-*2P?%D1/),%2,-'/(/(4%>+))P%&*1,2,(.6%%
b/.1/(%'%$/(4),%&*2,?%$P(&12*(/`/(4%'&&,$$% .*%$1'2,3%-,-*2P%@P%-+)./5),%.12,'3$%/$%/(,A5,($/C,6%T1,%.12,'3$%*(%'%$/(4),%&*2,%$1'2,%.1,% $'-,% )*&')% LK% &'&1,?% $*% '% $/(4),% '.*-/&% $,-'51*2,% 2,'3%D/.1/(% .1,%LK% &'&1,% /$% $+>>/&/,(.6% FP(&12*(/`/(4% '&&,$$% @,.D,,(%
Larrabee: A Many-Core x86 Architecture for Visual Computing • 18:3
ACM Transactions on Graphics, Vol. 27, No. 3, Article 18, Publication date: August 2008.
15
UC Regents Fall 2009 © UCBCS 250 L9: Floorplanning
Intel Larrabee: Memory BW.
!
!"#$%&'(
%')$*%+$,%-
+)./0.1
2,'3/(4%.*%4'/(%5'2')),)/$-
6%7/'4'2'%/$%
'%-+)./0&*
2,%4,(,2')%
5+25*$,%-/&2*
52*&,$$*
2%89*(4,./2'%
,.%')6%
:;;<=%>,'.+
2/(4%,/4
1.%/(0*23,2%&*
2,$?%,'&1%&'5
'@),%*
>%,A,&+./(4%
>*+2%$/-
+).'(,*+$%.12,'3
$?%'(3%'%$1
'2,3%&'&1
,6%B+.%4/C,(%/.$%>*
&+$%
*(%&*--,2&/')%
$,2C,2%D*2E)*'3$?%7/'4'2'%
)'&E$%'2&1
/.,&.+2')%
,),-,(.$%&2/./&')%
>*2%C/$+')%&*
-5+./(4?%$+
&1%'$%F
GHI%>)*
'./(40
5*/(.%,A,&+./*(?%$&'..,204
'.1,2?%*
2%>/A,3%>+(&./*(%.,A.+2,%$+
55*2.6%
!"#$%&&%'((#)%&*+%&(#,&-./0(-01&(#
J/4+2,%K%'@*C,%$1*D$%'%@)*&E%3/'42'-%*>%.1,%@'$/&%
L'22'@
,,%'2&1
/.,&.+2,6%L
'22'@,,%/$%3
,$/4(,3%'2*
+(3%-+)./5),%/(
$.'(./'./*
($%
*>%'(
%/(0*23,2%!
"#%&*2,%.1
'.%/$%'+4-,(.,3%D/.1%'%D
/3,%C,&.*
2%52*&,$$*
2%MN"#O6%!
*2,$%&*
--+(/&'.,%.1
2*+41%'%1
/410@'(3D/3.1%
/(.,2&*
((,&.%(
,.D*2E%D/.1%$*-,%>/A
,3%>+(&./*(%)*4/&?%-
,-*2P%GQR
%/(.,2>'&,$?%'(
3%*.1,2%(
,&,$$'2P%GQR
%)*4/&?%3
,5,(3/(4%*(%.1,%,A
'&.%'55)/&'./*
(6%J*2%,A'-5),?%'(%/-5),-,(.'./*
(%*>%L'22'@
,,%'$%'%
$.'(30')*(,%S"#%D*+)3%.P5/&'))P
%/(&)+3,%'%"
!G,%@+$6%
T1,%3'.'%/(
%T'@),%K
%-*./C'.,$%L
'22'@,,U$%+
$,%*>%/(0*23,2%&*
2,$%D/.1%D/3,%N"#$6%T1,%-/33),%&*
)+-(%$1*D$%.1,%5,'E%5,2>*
2-'(&,%
*>%'%-
*3,2(%*+.0*>0*23,2%!
"#?%.1,%G(.,) V%!*2,W
:%I+*%52*&,$$*
26%T1,%2/4
1.01'(3%&*)+-(%$1*D$%'%.,$.%!
"#%3,$/4
(%@'$,3
%*(%.1,%
",(./+-V%52*&,$$*
2?%D1/&1%D'$%/(
.2*3+&,3%/(%KXX:%'(3%+$,3%3+')0
/$$+,%/(
0*23,2%/(
$.2+&./*(%,A,&+./*(%8Y)5,2.%K
XXZ=6%T
1,%",(./+-%
52*&,$$*
2%&*2,%D
'$%-*3/>/,3
%.*%$+55*2.%>*
+2%.12,'3
$%'(3%'%K
[0
D/3,%N"#6%T1,%>/(
')%.D*%2*D$%$5
,&/>P%.1,%(+-@,2%*
>%(*(0C,&.*
2%/($.2+&./*($%.1'.%&'(
%@,%/$$+
,3%5,2%&)*
&E%@P%*(,%!"#%'(3%.1,%.*.')%
(+-@,2%*
>%C,&.*
2%*5,2'./*
($%.1'.%&'(
%@,%/$$+
,3%5,2%&)*
&E6%T1,%.D
*%
&*(>/4+2'./*
($%+$,%2*
+41)P%.1,%$'-
,%'2,'%'(3%5*D,26%%
\%!"#%&*2,$]%
:%*+.0*>0*23,2%
K;%/(0*23,2%
G($.2+&./*(%/$$+
,]%^%5,2%&)*
&E%
:%5,2%&)*
&E%
N"#%5,2%&*
2,]%^0D/3,%FF_%
K[0D/3,%
L:%&'&1
,%$/`,]%^%HB%
^%HB%
F/(4),0$.2,'-
]%"!#$%!&'(
&)!!
*!#$%!&'(
&)!
N,&.*
2%.12*+415+.]%
+!#$%!&'(
&)!
,-.!#$%!&'(
&)!
!"#$%!&
"!#$%&'(&')*+)!,-.!/0
&')*+)!1
23!4'567)/-'0"!*+-/8
0/08!
%9+!6)'4+--'
)!(')!/04)+7
-+*!%9)'$896$%!47
0!)+-$
:%!/0!;!%9+!6+7<!
-/08:+&-%)+7
5!6+)('
)5704+=!>
$%!?@A!%9
+!6+7<!,+4%'
)!%9)'$896$%!
B/%9!)'$89:C!%9
+!-75+!7)+7!70*!6'B+).!D
9/-!*/((+)+0
4+!/-!E@A!/0
!FG#2H=!-/0
4+!%9+!B/*+!I23!-$66')%-!($
-+*!5$:%/6:C&7
**!>$%!HHJ!
*'+-0K%.!D
9+-+!/0
&')*+)!4'
)+-!7)+!0
'%!G7))7>++=!>
$%!7)+!-/5
/:7).!
T1,%.,$.%3
,$/4(%/(%T'@),%K%/$%(
*.%/3,(./&')%.*
%L'22'@
,,6%T*%52*C/3,%
'%-*2,%3
/2,&.%&*-5'2/$*
(?%.1,%/(
0*23,2%&*
2,%.,$.%3,$/4
(%+$,$%.1
,%$'-
,%52*&,$$%'(
3%&)*&E%2'.,%'$%.1
,%*+.0*>0*23,2%&*
2,$%'(3%/(&)+3,$%
(*%>/A,3%>+(&./*(%42'51/&$%
)*4/&6%
T1/$%&*-5'2/$*
(%-*./C'.,$%
3,$/4
(%3,&/$/*
($%>*2%L'22'@
,,%$/(&,%/.%$1
*D$%.1'.%'%D
/3,%N"#%D/.1%
'%$/-5),%/(
0*23,2%&*
2,%'))*D$%!"#$%.*%2,'&1
%'%32'-
'./&'))P%1/41,2%
&*-5+.'./*
(')%3,($/.P%>*2%5'2')),)%'5
5)/&'./*
($6%
F,&./*
($%Z6K%.*%Z6<%@,)*D%3,$&2/@
,%.1,%E,P%>,'.+
2,$%*>%.1,%L'22'@
,,%'2&1
/.,&.+2,]%
.1,%!"#%&*2,?%
.1,%$&')'2%
+(/.%'(3%&'&1
,%&*(.2*)%
/($.2+&./*($?%.1
,%C,&.*
2%52*&,$$*
2?%.1,%/(.,25
2*&,$$*
2%2/(4%(,.D*2E?%
'(3%.1,%&1
*/&,$%>*
2%D1'.%/$%/-
5),-,(.,3%/(%>/A,3%>+(&./*(%)*4/&6%
!"2#$%&&%'((#34&(#%5*#3%-.(6#
J/4+2,%Z
%$1*D$%'%$&1
,-'./&%*
>%'%$/(4),%L
'22'@,,%!
"#%&*2,?%5
)+$%
/.$%&*((,&./*
(%.*%.1,%*(03/,%/(
.,2&*((,&.%(
,.D*2E%'(3%.1,%&*
2,U$%)*&')%$+
@$,.%*
>%.1,%L:%&'&1
,6%T1,%/($.2+&./*(%3,&*3,2%$+
55*2.$%.1
,%$.'(
3'23%",(./+-%52*&,$$*
2%Aa[%/($.2+&./*(%$,.?%D
/.1%.1,%'3
3/./*(%
*>%(,D%/($.2+&./*($%.1'.%'2,%3
,$&2/@,3%/(%F,&./*
($%Z6:%'(3%Z6Z6%T*%
$/-5)/>P%.1,%3,$/4
(%.1,%$&')'2%
'(3%C,&.*
2%+(/.$%+$,%$,5'2'.,%
2,4/$.,2%$,.$6%I
'.'%.2'($>,22,3
%@,.D,,(%.1,-%/$%D
2/..,(%.*%-,-*2P%
'(3%.1,(%2,'3
%@'&E%/(%>2*-%.1,%LK%&'&1
,6%
L'22'@
,,U$%LK%&'&1
,%'))*D$%)*D0)'.,(
&P%'&&,$$,$%
.*%&'&1
,%-,-*2P%/(.*%.1,%$&')'2%'(
3%C,&.*
2%+(/.$6%T
*4,.1,2%D
/.1%L'22'@
,,U$%)*'30*5%N"#%/($.2+&./*($?%.1
/$%-,'($%.1
'.%.1,%LK%&'&1
,%&'(%@,%
.2,'.,3%$*-,D1'.%)/E
,%'(%,A.,(3,3%2,4
/$.,2%>/),6%T1/$%$/4
(/>/&'(
.)P%
/-52*C,$%.1
,%5,2>*
2-'(&,%*
>%-'(P%')4*2/.1-$?%,$5
,&/'))P%D/.1%.1,%
&'&1,%&*(.2*)%/($.2+&./*($%3,$&2/@
,3%F,&./*
(%Z6:6%T1,%$/(4),0
.12,'3
,3%",(./+-%52*&,$$*
2%52*C/3,3%'(%a9B%G&'&1
,%'(3%a9B%
I&'&1
,6%b,%$5
,&/>P%'%Z
:9B%G&'&1
,%'(3%Z:9B%I&'&1
,%.*%$+55*2.%
>*+2%,A,&+./*(%.12,'3
$%5,2%!
"#%&*2,6%
%
'()*+%!L
"!G7))7>++!1
23!4')+!7
0*!7--'4/7%+*!-C-%+5
!>:'4<-"!%9
+!123!/-!*
+)/,+*!()'5!%9+!2+0%/$5!6)'4+--'
)!/0&')*+)!*
+-/80=!6:$-!
ME&>/%!/0
-%)$4%/'0-=!5
$:%/&%9
)+7*/08!70*!7!B/*+!I23.!J749!4')+!
97-!(7-%!744+--!%'
!/%-!?NMOP!:'47:!-$>-+%!'
(!7!4'9+)+0
%!?0*!:+,+:!
4749+.!GQ!4749+!-/R+-!7
)+!L?OP!(')!S47
49+!70*!L?OP!(')!T4749+.!
U/08!0+%B')<!7
44+--+-!67--!%9
)'$89!%9+!G?!4749+!(')!4'
9+)+0
4C.!
L'22'@
,,U$%4)*@')%:(3%),C,)%ML:O%&'&1
,%/$%3/C/3,3%/(.*%$,5'2'.,%
)*&')%
$+@$,.$?%
*(,%5,2%!"#%&*2,6%_'&1%!"#%1'$%'%>'$.%
3/2,&.%
'&&,$$%5'.1%.*%/.$%*
D(%)*&')%$+
@$,.%*
>%.1,%L:%&'&1
,6%I'.'%2,'3
%@P%'%
!"#%&*2,%/$%$.*
2,3%/(%/.$%L
:%&'&1
,%$+@$,.%'(
3%&'(
%@,%'&&,$$,3
%c+/&E)P?%/(%5'2')),)%D
/.1%*.1,2%!
"#$%'&&,$$/(
4%.1,/2%*
D(%)*&')%L
:%
&'&1,%$+
@$,.$6%I
'.'%D2/..,(
%@P%'%!
"#%&*2,%/$%$.*
2,3%/(%/.$%*
D(%L:%
&'&1,%$+
@$,.%'(
3%/$%>)+
$1,3%>2*-%*.1,2%$+
@$,.$?%/>%(
,&,$$'2P6%T1,%
2/(4%(,.D*2E%,($+2,$%&*
1,2,(
&P%>*2%$1
'2,3%3'.'?%'$%3
,$&2/@,3%/(%
F,&./*
(%Z6^6%b
,%$5,&/>P
%:<[9B%>*2%,'&1
%L:%&'&1
,%$+@$,.6%T
1/$%
$+55*2.$%)'24
,%./),%$/`,$%>*2%$*
>.D'2,%2,(
3,2/(
4?%'$%3
,$&2/@,3%/(%
F,&./*
(%^6K6%
!"7#8-%9%&#:5/0#%5*#3%-.(#3450&49#;560&1-0/456#
L'22'@
,,U$%$&')'2%5/5,)/(,%/$%3
,2/C,3%>2*-%.1,%3+')0/$$+
,%",(./+-%
52*&,$$*
2?%D1/&1%+$,$%
'%$1*2.?%/(,A5,($/C,%,A,&+./*(%5/5,)/(,6%
L'22'@
,,%52*C/3,$%-
*3,2(%'33/./*($%$+
&1%'$%-
+)./0.1
2,'3/(4?%[^0
@/.%,A
.,($/*($?%'(
3%$*51/$./&'.,3
%52,>,.&1
/(46%T1,%&*
2,$%$+55*2.%
.1,%>+))%"
,(./+-%52*&,$$*
2%Aa[%/($.2+&./*(%$,.%
$*%.1,P%&'(%2+(%
,A/$./(
4%&*3,%/(&)+3/(4%*5,2'./(
4%$P$.,-
%E,2(,)$%'(
3%'55)/&'./*
($6%
L'22'@
,,%'33$%(,D%$&')'2%/(
$.2+&./*($%$+
&1%'$%@
/.%&*+(.%'(
3%@/.%
$&'(?%D1/&1%>/(3$%.1,%(,A.%$,.%@
/.%D/.1/(%'%2,4
/$.,26%%
L'22'@
,,%')$*
%'33$%(,D%/($.2+&./*($%'(3%/($.2+&./*(%-*3,$%>*2%
,A5)/&/.%&'&1
,%&*(.2*)6%_A'-5),$%/(
&)+3,%/(
$.2+&./*($%.*%52,>,.&1
%3'.'%/(
.*%.1,%LK%*2%L:%&'&1
,$%'(3%/($.2+&./*(%-*3,$%.*
%2,3+&,%.1
,%52/*2/.P%*>%'%&'&1
,%)/(,6%J*2%,A'-5),?%$.2,'-
/(4%3'.'%
.P5/&'))P
%$D,,5$%,A
/$./(4%3'.'%*
+.%*>%'%&'&1
,6%L'22'@
,,%/$%'@),%.*
%-'2E%,'&1
%$.2,'-
/(4%&'&1
,%)/(,%>*
2%,'2)P%,C/&./*
(%'>.,2%/.%/$%'&&,$$,3
6%T1,$,%
&'&1,%&*(.2*)%/($.2+&./*($%')$*
%'))*D%.1,%L:%&'&1
,%.*%@,%+$,3%
$/-/)'2)P
%.*%'%$&2'.&1
5'3%-,-*2P?%D1/),%2,-
'/(/(4%>+))P%&*1,2,(
.6%%
b/.1/(%'%$/(
4),%&*
2,?%$P(&12*(/`/(4%'&&,$$%.*
%$1'2,3
%-,-*2P%@P%
-+)./5),%.1
2,'3$%/$%/(
,A5,($/C,6%T
1,%.12,'3
$%*(%'%$/(
4),%&*
2,%$1'2,%
.1,%$'-
,%)*&')%
LK%&'&1
,?%$*%'%$/(4),%'.*-/&%$,-
'51*2,%2,'3
%D/.1/(%.1,%LK%&'&1
,%/$%$+>>/&/,(
.6%FP(&12*(/`/(4%'&&,$$%@
,.D,,(%
Larra
bee: A
Many-C
ore
x86 A
rchite
ctu
re fo
r Vis
ual C
om
putin
g • 1
8:3
AC
M T
ransactio
ns o
n G
raph
ics, Vo
l. 27
, No
. 3, A
rticle 18
, Pu
blicatio
n d
ate: Au
gu
st 20
08
.
Larrabee Core
Each core has a low-latency path to a 256 KB subset of the coherent L2 cache.
Stores from a core are placed in its L2 slice. Tuned code works out of the slice.
Accesses to the L2 slices of other cores go over the ring network (higher latency).
16
UC Regents Fall 2009 © UCBCS 250 L9: Floorplanning
Intel Larrabee: Ring Network
Bi-directional ring network.512 wires each way. For large
chips, multiple linked rings.
Ring also provides path to off-chip DRAM and to special-purpose accelerators.
Contrast with Niagara II: In Larrabee, access to different L2 banks has non-uniform latency. Acceptable, given Larrabee’s graphics focus.
!"#$%&'&(&)*&$+,(-./!"#$"%&'()&'*+%,"+-&'.)&'!/%+-0$"&'1)&'23%4567&'8)&'9:%+47&';)&'.<:"5&'=)&'><-?#-4&'!)&'(+?"&'9)&'!<0"%,+-&'>)&'*+@#-&'A)&'14/+4+&'A)&'B%3C73D4?#&'1)&'><+-&'8)&'E+-%+7+-&'=)'FGGH)'(+%%+:""I'9';+-5J*3%"'KHL'9%C7#6"C6<%"'M3%'N#4<+$'*3,/<6#-0)'!"#$%&'()*$+&',-*$./&'O&'9%6#C$"'PH'Q9<0<46'FGGHR&'PS'/+0"4)'.TU'V'PG)PPWSXPOLGLPF)POLGLPY'766/IXXZ3#)+C,)3%0XPG)PPWSXPOLGLPF)POLGLPY)
",01(234/$5,/2*&="%,#44#3-'63',+?"'Z#0#6+$'3%'7+%Z'C3/#"4'3M'/+%6'3%'+$$'3M'67#4'D3%?'M3%'/"%43-+$'3%'C$+44%33,'<4"'#4'0%+-6"Z'D#673<6'M""'/%3@#Z"Z'67+6'C3/#"4'+%"'-36',+Z"'3%'Z#46%#:<6"Z'M3%'/%3!'6'3%'Z#%"C6'C3,,"%C#+$'+Z@+-6+0"'+-Z'67+6'C3/#"4'473D'67#4'-36#C"'3-'67"'!'%46'/+0"'3%'#-#6#+$'4C%""-'3M'+'Z#4/$+5'+$3-0'D#67'67"'M<$$'C#6+6#3-)'*3/5%#0764'M3%'C3,/3-"-64'3M'67#4'D3%?'3D-"Z':5'367"%4'67+-'9*;',<46':"'73-3%"Z)'9:46%+C6#-0'D#67'C%"Z#6'#4'/"%,#66"Z)'83'C3/5'367"%D#4"&'63'%"/<:$#47&'63'/346'3-'4"%@"%4&'63'%"Z#46%#:<6"'63'$#464&'3%'63'<4"'+-5'C3,/3-"-6'3M'67#4'D3%?'#-'367"%'D3%?4'%"[<#%"4'/%#3%'4/"C#!'C'/"%,#44#3-'+-ZX3%'+'M"")'="%,#44#3-4',+5':"'%"[<"46"Z'M%3,'=<:$#C+6#3-4'."/6)&'9*;&'U-C)&'F'="--'=$+\+&'!<#6"'YGP&']"D' 3%?&']^'PGPFP_GYGP&'M+K'`P'QFPFR'HLa_GWHP&'3%'/"%,#44#3-4b+C,)3%0)c'FGGH'9*;'GYOG_GOGPXFGGHXGO_9A8PH'dS)GG'.TU'PG)PPWSXPOLGLPF)POLGLPY'766/IXXZ3#)+C,)3%0XPG)PPWSXPOLGLPF)POLGLPY
!
!"##"$%%&'(')"*+,-.#%'/01'(#2345%256#%'7.#'8496":'-.;<654*='
!"##$%&'()'#*+%%%,-./%0"#1'"2
*+%%%3#(4%&5#"2/)'
*+%%%6-1%7-#8$9:
*+%%%;(4:"')%<=#"8:
>+%
?#"@''5%,.='$*+%%%&9'5:'2%A.2B(28
*+%%%<@"1%!"B'
*+%%%A'#'1$%&./'#1"2
C+%
D-='#9%0"E(2*+%%%D-/'#%385"8"
*+%%%3@%F#-4:-G8B(
*+%%%6-2(%A."2
*+%%%"2@%%?"9%H"2#":"2
C
($95#"25!"#'
6:(8% 5"5'#% 5#'8'298% "% 1"2$I4-#'% E(8.")% 4-15.9(2/% "#4:(9'49.#'%
4-@'%2"1'@%!"##"=''+%"%2'G%8-J9G"#'%#'2@'#(2/%5(5')(2'+%"%1"2$I
4-#'% 5#-/#"11(2/% 1-@')+% "2@% 5'#J-#1"24'% "2")$8(8% J-#% 8'E'#")%
"55)(4"9(-28K%!"##"=''%.8'8%1.)9(5)'%(2I-#@'#%LMN%0?O%4-#'8%9:"9%
"#'%"./1'29'@%=$%"%G(@'%E'49-#%5#-4'88-#%.2(9+% "8%G'))% "8% 8-1'%
J(L'@% J.249(-2% )-/(4% =)-4B8K% 6:(8% 5#-E(@'8% @#"1"9(4"))$% :(/:'#%
5'#J-#1"24'%5'#%G"99%"2@%5'#%.2(9%-J%"#'"%9:"2%-.9I-JI-#@'#%0?O8%
-2% :(/:)$% 5"#"))')% G-#B)-"@8K% P9% ")8-% /#'"9)$% (24#'"8'8% 9:'%
J)'L(=()(9$%"2@%5#-/#"11"=()(9$%-J%9:'%"#4:(9'49.#'%"8%4-15"#'@%9-%
89"2@"#@%F?O8K%<%4-:'#'29%-2I@('%>2@%)'E')%4"4:'%"))-G8%'JJ(4('29%
(29'#I5#-4'88-#% 4-11.2(4"9(-2% "2@% :(/:I="2@G(@9:% )-4")% @"9"%
"44'88%=$%0?O%4-#'8K%6"8B%84:'@.)(2/%(8%5'#J-#1'@%'29(#')$%G(9:%
8-J9G"#'% (2% !"##"=''+% #"9:'#% 9:"2% (2% J(L'@% J.249(-2% )-/(4K% 6:'%
4.89-1(Q"=)'% 8-J9G"#'% /#"5:(48% #'2@'#(2/% 5(5')(2'% J-#% 9:(8%
"#4:(9'49.#'% .8'8% =(22(2/% (2% -#@'#% 9-% #'@.4'% #'R.(#'@% 1'1-#$%
="2@G(@9:+%1(2(1(Q'% )-4B% 4-29'29(-2+% "2@% (24#'"8'%-55-#9.2(9('8%
J-#% 5"#"))')(81% #')"9(E'% 9-% 89"2@"#@% F?O8K% 6:'% !"##"=''% 2"9(E'%
5#-/#"11(2/% 1-@')% 8.55-#98% "% E"#('9$% -J% :(/:)$% 5"#"))')%
"55)(4"9(-28% 9:"9% .8'% (##'/.)"#% @"9"% 89#.49.#'8K% ?'#J-#1"24'%
"2")$8(8% -2% 9:-8'% "55)(4"9(-28% @'1-289#"9'8% !"##"=''S8% 5-9'29(")%
J-#%"%=#-"@%#"2/'%-J%5"#"))')%4-15.9"9(-2K%
""#$% PKCK*% T0-15.9'#% F#"5:(48UV% H"#@G"#'% <#4:(9'49.#'IIF#"5:(48% ?#-4'88-#8+% ?"#"))')% ?#-4'88(2/W% PKCKC% T0-15.9'#%F#"5:(48UV% ?(49.#'XP1"/'% F'2'#"9(-2II,(85)"$% <)/-#(9:18W% PKCKY%T0-15.9'#%F#"5:(48UV%6:#''I,(1'28(-2")%F#"5:(48%"2@%D'")(81II0-)-#+%8:"@(2/+%8:"@-G(2/+%"2@%9'L9.#'%
%&'()*+8$% /#"5:(48% "#4:(9'49.#'+% 1"2$I4-#'% 4-15.9(2/+% #'")I9(1'% /#"5:(48+% 8-J9G"#'% #'2@'#(2/+% 9:#-./:5.9% 4-15.9(2/+% E(8.")%4-15.9(2/+%5"#"))')%5#-4'88(2/+%&P;,+%F?F?OK!
>?' @*5#.A6254.*'
;-@'#2%F?O8%"#'%(24#'"8(2/)$%5#-/#"11"=)'%(2%-#@'#%9-%8.55-#9%"@E"24'@% /#"5:(48% ")/-#(9:18% "2@% -9:'#% 5"#"))')% "55)(4"9(-28K%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%*%P29')Z%0-#5-#"9(-2V%%)"##$K8'()'#+%@-./K4"#1'"2+%'#(4K85#"2/)'+%
9-1KJ-#8$9:+%5#"@''5K@.='$+%89'5:'2K[.2B(28+%"@"1K9K)"B'+%
#-='#9K@K4"E(2+%#-/'#K'85"8"+%'@G"#@K/#-4:-G8B(%\%9-2(K[."2%
](29')K4-1%
>%D<,%F"1'%6--)8V%%1(B'"]#"@/"1'9--)8K4-1%
C%&9"2J-#@%O2(E'#8(9$V%%$-')%\%:"2#":"2%]48K89"2J-#@K'@.%
H-G'E'#+% /'2'#")% 5.#5-8'% 5#-/#"11"=()(9$% -J% 9:'% /#"5:(48%5(5')(2'%(8%#'89#(49'@%=$%)(1(9"9(-28%-2%9:'%1'1-#$%1-@')%"2@%=$%J(L'@% J.249(-2% =)-4B8% 9:"9% 84:'@.)'% 9:'% 5"#"))')% 9:#'"@8% -J%'L'4.9(-2K% 7-#% 'L"15)'+% 5(L')% 5#-4'88(2/% -#@'#% (8% 4-29#-))'@% =$%9:'%#"89'#(Q"9(-2%)-/(4%"2@%-9:'#%@'@(4"9'@%84:'@.)(2/%)-/(4K%
6:(8%5"5'#%@'84#(='8%"%:(/:)$%5"#"))')%"#4:(9'49.#'%9:"9%1"B'8%9:'%#'2@'#(2/% 5(5')(2'% 4-15)'9')$% 5#-/#"11"=)'K% 6:'% !"##"=''%"#4:(9'49.#'%(8%="8'@%-2%(2I-#@'#%0?O%4-#'8%9:"9%#.2%"2%'L9'2@'@%E'#8(-2% -J% 9:'% LMN% (289#.49(-2% 8'9+% (24).@(2/% G(@'% E'49-#%5#-4'88(2/% -5'#"9(-28% "2@% 8-1'% 85'4(")(Q'@% 84")"#% (289#.49(-28K%7(/.#'% *% 8:-G8% "% 84:'1"9(4% ()).89#"9(-2% -J% 9:'% "#4:(9'49.#'K% 6:'%4-#'8% '"4:% "44'88% 9:'(#% -G2% 8.=8'9% -J% "% 4-:'#'29% !>% 4"4:'% 9-%5#-E(@'% :(/:I="2@G(@9:% !>% 4"4:'% "44'88% J#-1% '"4:% 4-#'% "2@% 9-%8(15)(J$%@"9"%8:"#(2/%"2@%8$24:#-2(Q"9(-2K%
!"##"=''% (8%1-#'% J)'L(=)'% 9:"2% 4.##'29%F?O8K% P98%0?OI)(B'% LMNI="8'@% "#4:(9'49.#'% 8.55-#98% 8.=#-.9(2'8% "2@% 5"/'% J".)9(2/K% &-1'%-5'#"9(-28% 9:"9% F?O8% 9#"@(9(-2"))$% 5'#J-#1% G(9:% J(L'@% J.249(-2%)-/(4+% 8.4:% "8% #"89'#(Q"9(-2% "2@% 5-89I8:"@'#% =)'2@(2/+% "#'%5'#J-#1'@%'29(#')$%(2%8-J9G"#'%(2%!"##"=''K%!(B'%F?O8+%!"##"=''%.8'8%J(L'@%J.249(-2%)-/(4%J-#%9'L9.#'%J()9'#(2/+%=.9%9:'%4-#'8%"88(89%9:'%J(L'@%J.249(-2%)-/(4+%'K/K%=$%8.55-#9(2/%5"/'%J".)98K%%
%
!"#$%&'(!"#$%&'()*$"+,")%&"-(..(/&&"'(012$+.&"(.$%*)&$)3.&!"4%&"03'/&."+,"567"$+.&8"(09")%&"03'/&."(09")1:&"+,"$+2:.+$&88+.8"(09" ;<=" />+$?8" (.&" *':>&'&0)()*+029&:&09&0)@" (8" (.&" )%&":+8*)*+08"+,")%&"567"(09"0+02567"/>+$?8"+0")%&"$%*:A"
6:(8%5"5'#%")8-%@'84#(='8%"%8-J9G"#'%#'2@'#(2/%5(5')(2'% 9:"9% #.28%'JJ(4('29)$% -2% 9:(8% "#4:(9'49.#'K% P9% .8'8% =(22(2/% 9-% (24#'"8'%5"#"))')(81% "2@% #'@.4'% 1'1-#$% ="2@G(@9:+% G:()'% "E-(@(2/% 9:'%5#-=)'18%-J%8-1'%5#'E(-.8%9()'I="8'@%"#4:(9'49.#'8K%P15)'1'29(2/%9:'% #'2@'#'#% (2% 8-J9G"#'%"))-G8%'L(89(2/% J'"9.#'8% 9-%='%-59(1(Q'@%="8'@% -2% G-#B)-"@% "2@% "))-G8% 2'G% J'"9.#'8% 9-% ='% "@@'@K% 7-#%'L"15)'+% 5#-/#"11"=)'% =)'2@(2/% "2@% -#@'#I(2@'5'2@'29%9#"285"#'24$%J(9%'"8()$%(29-%9:'%!"##"=''%8-J9G"#'%5(5')(2'K%%
7(2"))$+% 9:(8%5"5'#%@'84#(='8%"%5#-/#"11(2/%1-@')% 9:"9% 8.55-#98%1-#'% /'2'#")% 5"#"))')% "55)(4"9(-28+% 8.4:% "8% (1"/'% 5#-4'88(2/+%5:$8(4")%8(1.)"9(-2+%"2@%1'@(4")%\%J(2"24(")%"2")$9(48K%!"##"=''S8%8.55-#9% J-#% (##'/.)"#% @"9"% 89#.49.#'8% "2@% (98% 84"99'#I/"9:'#%4"5"=()(9$% 1"B'% (9% 8.(9"=)'% J-#% 9:'8'% 9:#-./:5.9% "55)(4"9(-28% "8%@'1-289#"9'@%=$%-.#%84")"=()(9$%"2@%5'#J-#1"24'%"2")$8(8K%
ACM Transactions on Graphics, Vol. 27, No. 3, Article 18, Publication date: August 2008.
17
UC Regents Fall 2009 © UCBCS 250 L9: Floorplanning
Intel Larrabee
Ring network
18
UC Regents Fall 2009 © UCBCS 250 L9: Floorplanning
Core + 256KB L2 Cache slice
19
UC Regents Fall 2009 © UCBCS 250 L9: Floorplanning
Texture filtering units (for graphics)
20
UC Regents Fall 2009 © UCBCS 250 L9: Floorplanning
DRAM and I/O Interfaces
21
UC Regents Fall 2009 © UCBCS 250 L9: Floorplanning
Intel Larrabee: Performance
!
!"#$%&!'#($')#*'+,),-'.%&/0&)#(1%'/0&'0-"%&'!"#$%&!2'3%1-,0('425'.&06,$%!'7&%#8$09(!'0/'-"%'-0-#+'.&01%!!,(:'-,)%'$%60-%$'-0'.0!-;!"#$%&'7+%($,(:'#($'.#&#)%-%&',(-%&.0+#-,0(2'
<(%' &%)#,(,(:' ,!!=%' ,!' -%>-=&%'10;.&01%!!0&'#11%!!%!?'9",1"'1#('"#6%'"=($&%$!'0/'1+018!'0/'+#-%(1*2'@",!',!'",$$%('7*'10).=-,(:')=+-,.+%'AA=#$!'0('%#1"'"#&$9#&%'-"&%#$2'B#1"'AA=#$C!'!"#$%&',!'1#++%$' #' !"#$%2' @"%' $,//%&%(-' /,7%&!' 0(' #' -"&%#$' 10;0.%&#-,6%+*'!9,-1"'7%-9%%(' -"%)!%+6%!'9,-"0=-'#(*'<3',(-%&6%(-,0(2'D' /,7%&'!9,-1"' ,!' .%&/0&)%$' #/-%&' %#1"' -%>-=&%' &%#$' 10))#($?' #($'.&01%!!,(:'.#!!%!'-0'-"%'0-"%&'/,7%&!'&=((,(:'0('-"%'-"&%#$2'E,7%&!'%>%1=-%',('#'1,&1=+#&'A=%=%2'@"%'(=)7%&'0/'/,7%&!',!'1"0!%('!0'-"#-'7*'-"%'-,)%'10(-&0+'/+09!'7#18'-0'#'/,7%&?',-!'-%>-=&%'#11%!!'"#!'"#$'-,)%'-0'%>%1=-%'#($'-"%'&%!=+-!'#&%'&%#$*'/0&'.&01%!!,(:2'
!"# $%&'%(%(#)%(*+(,-&.%#/01'2%3#
@",!' !%1-,0(' $%!1&,7%!' .%&/0&)#(1%' #($' !1#+,(:' !-=$,%!' /0&' -"%'F#&%%'!0/-9#&%'&%($%&%&'$%!1&,7%$',('3%1-,0('52'3-=$,%!',(1+=$%'!1#+#7,+,-*' %>.%&,)%(-!' /0&' !0/-9#&%' &%($%&,(:?' +0#$' 7#+#(1,(:'!-=$,%!?' 7#($9,$-"' 10).#&,!0(!' 0/' 7,((,(:' -0' ,))%$,#-%' )0$%'&%($%&%&!?' .%&/0&)#(1%' 0(' !%6%&#+' :#)%' 90&8+0#$!?' #($' 1"#&-!'!"09,(:'-"%'"09'-0-#+'.&01%!!,(:'-,)%',!'$,6,$%$'#)0(:'$,//%&%(-'.#&-!'0/'-"%'!0/-9#&%'&%($%&%&2'
!"4#5-,%#6+(78+-'3#-&'#/2,18-02+	%0:+'#
G%&/0&)#(1%' -%!-!'=!%'90&8+0#$!'$%&,6%$' /&0)' -"&%%'9%++;8(09(':#)%!H' I%#&!' 0/' J#&K?' E2B2D2L2K?' #($' M#+/' F,/%K' N' B.,!0$%' N2'@#7+%' N' 10(-#,(!' ,(/0&)#-,0(' #70=-' -"%' -%!-%$' /&#)%!' /&0)' %#1"':#)%2'3,(1%'9%'#&%'!1#+,(:'0=-'-0'+#&:%'(=)7%&!'0/'10&%!'9%'=!%'#'",:";%($'!1&%%('!,O%'9,-"')=+-,!#).+,(:'9"%('!=..0&-%$2'
"#$%!&'%(!)!(*+!)! ,+-+.+/+! 0(#12!3%!4#1!
PQRR>PNRR'5'!#).+%' PQRR>PNRR'5'!#).+%' PQRR>PNRR'P'!#).+%'
N4'/&#)%!'SP',('TRU' N4'/&#)%!'SP',('PRRU' N4'/&#)%!'SP',('N4RU'
V#+6%'W0&.2' X0(0+,-"'G&0$=1-,0(!' B.,1'I#)%!'Y(1'
!"#$%& &'& ()%*+),-& ./00,%1& !)%& 23$& 23%$$& 2$.2$-& 4,0$.'& 23$&!%,0$.& ,%$& 5"-$+1& .$6,%,2$-& 2)& 7,273& -"!!$%$82& .7$8$&73,%,72$%".2"7.&,.&23$&4,0$.&6%)4%$..9&
J%'1#.-=&%$' -"%' /&#)%!'7*' ,(-%&1%.-,(:' -"%'Z,&%1-['\'10))#($'
!-&%#)'7%,(:'!%(-'-0'#'10(6%(-,0(#+':&#.",1!'1#&$'9",+%'-"%':#)%'
9#!'.+#*%$'#-'#'(0&)#+'!.%%$?'#+0(:'9,-"'-"%'10(-%(-!'0/'-%>-=&%!'
#($'!=&/#1%!' #-' -"%' !-#&-'0/' -"%' /&#)%2'J%' -%!-%$' -"%)' -"&0=:"'#'
/=(1-,0(#+')0$%+' -0' %(!=&%' -"%' #+:0&,-")!'9%&%' 10&&%1-' #($' -"#-'
-"%' &,:"-' ,)#:%!'9%&%' .&0$=1%$2']%>-?' 9%' %!-,)#-%$' -"%' 10!-' 0/'
%#1"'!%1-,0('0/'10$%' ,(' -"%' /=(1-,0(#+')0$%+?'7%,(:'#::&%!!,6%+*'
.%!!,),!-,1?' #($' 7=,+-' #' &0=:"' .&0/,+%' 0/' %#1"' /&#)%2' J%' 9&0-%'
#!!%)7+*'10$%'/0&'-"%'",:"%!-;10!-'!%1-,0(!?'&#(',-' -"&0=:"'1*1+%;
#11=&#-%' !,)=+#-0&!?' /%$' -"%' 1+018' 1*1+%' &%!=+-!' 7#18' ,(-0' -"%'
/=(1-,0(#+' )0$%+?' #($' &%;&#(' -"%' -%!2' @",!' ,-%&#-,6%' 1*1+%' 0/'
&%/,(%)%(-'9#!' &%.%#-%$' =(-,+' \R^'0/' -"%' 1+018' 1*1+%!' %>%1=-%$'
$=&,(:' #' /&#)%' "#$' 7%%(' &=(' -"&0=:"' -"%' !,)=+#-0&!?' :,6,(:' -"%'
06%&#++' .&0/,+%!' #' ",:"' $%:&%%' 0/' 10(/,$%(1%2' @%>-=&%' =(,-'
-"&0=:".=-?' 1#1"%' .%&/0&)#(1%' #($' )%)0&*' 7#($9,$-"'
+,),-#-,0(!'9%&%'#++',(1+=$%$',('-"%'6#&,0=!'!,)=+#-,0(!2'
Y(' -"%!%' !-=$,%!' 9%' )%#!=&%' 90&8+0#$' .%&/0&)#(1%' ,(' -%&)!' 0/':,%%,#$$& /8"2.2' D' :,%%,#$$& /8"2' ,!' $%/,(%$' -0' 7%'0(%' F#&%%'10&%'&=((,(:'#-'P'IMO2'@"%'1+018'&#-%',!'1"0!%('!0+%+*'/0&'%#!%'0/'1#+1=+#-,0(?'!,(1%'&%#+'$%6,1%!'90=+$'!",.'9,-"')=+-,.+%'10&%!'#($'
'''''''''''''''''''''''''''''''''''''''' '''''''''''''''''''''''''K'<-"%&'(#)%!'_'7&#($!')#*'7%'1+#,)%$'#!'-"%'.&0.%&-*'0/'0-"%&!2'
#' 6#&,%-*' 0/' 1+018' &#-%!2' `!,(:' F#&%%' =(,-!' #++09!' =!' -0'10).#&%'.%&/0&)#(1%'0/'F#&%%',).+%)%(-#-,0(!'9,-"'$,//%&%(-'(=)7%&!' 0/' 10&%!' &=((,(:' #-' $,//%&%(-' 1+018' &#-%!2' D' !,(:+%'F#&%%'=(,-'10&&%!.0($!' -0'#' -"%0&%-,1#+'.%#8' -"&0=:".=-'0/'TN'IEF<G3?'10=(-,(:'/=!%$')=+-,.+*;#$$'#!'-90'0.%&#-,0(!2''
!";#/.-8-<2820=#/01'2%3#
@"%'F#&%%'!0/-9#&%'&%($%&%&',!'$%!,:(%$'-0'#++09'%//,1,%(-'+0#$'7#+#(1,(:'06%&'#'+#&:%'(=)7%&'0/'10&%!2'E,:=&%'\'!"09!'-"%'&%!=+-!'0/' -%!-,(:' +0#$' 7#+#(1,(:' /0&' !,>' 10(/,:=&#-,0(!?' %#1"' 0/' 9",1"'!1#+%!' -"%')%)0&*'7#($9,$-"'#($' -%>-=&%'/,+-%&,(:'!.%%$'&%+#-,6%'-0'-"%'(=)7%&'0/'10&%!2'@",!'-%!-'=!%!'-"%'!,)=+#-,0(')%-"0$0+0:*'$%!1&,7%$' ,(' 3%1-,0(' 42P' ,(' 10)7,(#-,0(' 9,-"' #' -,)%;7#!%$'.%&/0&)#(1%')0$%+'-"#-'-!'$%.%($%(1,%!'#($'!1"%$=+,(:2'@",!'-00+',!'=!%$'/0&')=+-,.+%':&#.",1!'.&0$=1-!'9,-",('Y(-%+2''
'
'()*+%& ,'& ;$+,2"<$& =7,+"84& ,.& ,& >/872")8& )!& ?)%$& ?)/82'& @3".&.3)5.&7)8!"4/%,2")8.&5"23&A&2)&BA&7)%$.C&5"23&$,73&4,0$D.&%$./+2.&6+)22$-&%$+,2"<$&2)&23$&6$%!)%0,87$&)!&,8&AE7)%$&.1.2$09&&
@"%'&%!=+-!'0/'-"%'+0#$'7#+#(1,(:'!,)=+#-,0('!"09'#'/#++0//'0/'a^'-0' PR^' /&0)' #' +,(%#&' !.%%$=.' #-' 5b' 10&%!2' E0&' -"%!%' -%!-!?'G&,)3%-!' #&%' !=7$,6,$%$' ,/' -"%*' 10(-#,(' )0&%' -"#(' PRRR'.&,),-,6%!?'#!'$%!1&,7%$',('3%1-,0('52N2'D$$,-,0(#+'-%!-!'!"09'-"#-'E2B2D2L2' /#++!' 0//' 7*' 0(+*' N^' ,/' G&,)3%-!' #&%' !=7$,6,$%$' ,(-0':&0=.!' 0/' NRR' .&,),-,6%!?' !0' 10$%' -=(,(:' !"0=+$' ,).&06%' -"%'+,(%#&,-*2''
E,:=&%'PR'!"09!'-"%'(=)7%&'0/'F#&%%'=(,-!'&%A=,&%$'-0'&%($%&'!#).+%' /&#)%!' /&0)' -"%' -"&%%':#)%!'#-'QR' /&#)%!c!%10($2'@"%!%'&%!=+-!'9%&%' !,)=+#-%$'0('#' !,(:+%'10&%'9,-"' -"%'#!!=).-,0(' -"#-'.%&/0&)#(1%'!1#+%!'+,(%#&+*2'E0&'M#+/'F,/%'N'%.,!0$%'N?'&0=:"+*'PR'F#&%%'`(,-!'#&%'!=//,1,%(-'-0'%(!=&%'-"#-'#++'/&#)%!'&=('#-'QR'/.!'0&' /#!-%&2' E0&' E2B2D2L2' #($' I%#&!' 0/'J#&?' &0=:"+*' N4' F#&%%'`(,-!'!=//,1%2''
'
'()*+%&-.'&F<$%,++&6$%!)%0,87$'&.3)5.& 23$&8/0#$%&)!&:,%%,#$$&G8"2.&H7)%$.&%/88"84&,2&I&JKLM&8$$-$-&2)&,73"$<$&NO!6.&!)%&)!&23$&.$%"$.&)!&.,06+$&!%,0$.&"8&$,73&4,0$9&
@"%' &%)#,(,(:' ,!!=%' -"#-' 1#(' +,),-' !1#+#7,+,-*' ,!' !0/-9#&%' +018!2'3,)=+#-,(:' )=+-,.+%' /&#)%!' 0/' &%($%&,(:' #-' !=1"' #' /,(%' +%6%+' 0/'$%-#,+' ,!' %>-&%)%+*' 10!-+*2' M09%6%&?' -",!' !0/-9#&%' &%($%&,(:'.,.%+,(%'9#!'%>.+,1,-+*'$%!,:(%$'-0'),(,),O%'-"%'(=)7%&'0/'+018!'
18:8 • L. Seiler et al.
ACM Transactions on Graphics, Vol. 27, No. 3, Article 18, Publication date: August 2008.
August 2008: Larrabee DirectX 9pipeline, used by video games.
Good scaling with # of cores.
!
!"#$%&!'#($')#*'+,),-'.%&/0&)#(1%'/0&'0-"%&'!"#$%&!2'3%1-,0('425'.&06,$%!'7&%#8$09(!'0/'-"%'-0-#+'.&01%!!,(:'-,)%'$%60-%$'-0'.0!-;!"#$%&'7+%($,(:'#($'.#&#)%-%&',(-%&.0+#-,0(2'
<(%' &%)#,(,(:' ,!!=%' ,!' -%>-=&%'10;.&01%!!0&'#11%!!%!?'9",1"'1#('"#6%'"=($&%$!'0/'1+018!'0/'+#-%(1*2'@",!',!'",$$%('7*'10).=-,(:')=+-,.+%'AA=#$!'0('%#1"'"#&$9#&%'-"&%#$2'B#1"'AA=#$C!'!"#$%&',!'1#++%$' #' !"#$%2' @"%' $,//%&%(-' /,7%&!' 0(' #' -"&%#$' 10;0.%&#-,6%+*'!9,-1"'7%-9%%(' -"%)!%+6%!'9,-"0=-'#(*'<3',(-%&6%(-,0(2'D' /,7%&'!9,-1"' ,!' .%&/0&)%$' #/-%&' %#1"' -%>-=&%' &%#$' 10))#($?' #($'.&01%!!,(:'.#!!%!'-0'-"%'0-"%&'/,7%&!'&=((,(:'0('-"%'-"&%#$2'E,7%&!'%>%1=-%',('#'1,&1=+#&'A=%=%2'@"%'(=)7%&'0/'/,7%&!',!'1"0!%('!0'-"#-'7*'-"%'-,)%'10(-&0+'/+09!'7#18'-0'#'/,7%&?',-!'-%>-=&%'#11%!!'"#!'"#$'-,)%'-0'%>%1=-%'#($'-"%'&%!=+-!'#&%'&%#$*'/0&'.&01%!!,(:2'
!"# $%&'%(%(#)%(*+(,-&.%#/01'2%3#
@",!' !%1-,0(' $%!1&,7%!' .%&/0&)#(1%' #($' !1#+,(:' !-=$,%!' /0&' -"%'F#&%%'!0/-9#&%'&%($%&%&'$%!1&,7%$',('3%1-,0('52'3-=$,%!',(1+=$%'!1#+#7,+,-*' %>.%&,)%(-!' /0&' !0/-9#&%' &%($%&,(:?' +0#$' 7#+#(1,(:'!-=$,%!?' 7#($9,$-"' 10).#&,!0(!' 0/' 7,((,(:' -0' ,))%$,#-%' )0$%'&%($%&%&!?' .%&/0&)#(1%' 0(' !%6%&#+' :#)%' 90&8+0#$!?' #($' 1"#&-!'!"09,(:'-"%'"09'-0-#+'.&01%!!,(:'-,)%',!'$,6,$%$'#)0(:'$,//%&%(-'.#&-!'0/'-"%'!0/-9#&%'&%($%&%&2'
!"4#5-,%#6+(78+-'3#-&'#/2,18-02+	%0:+'#
G%&/0&)#(1%' -%!-!'=!%'90&8+0#$!'$%&,6%$' /&0)' -"&%%'9%++;8(09(':#)%!H' I%#&!' 0/' J#&K?' E2B2D2L2K?' #($' M#+/' F,/%K' N' B.,!0$%' N2'@#7+%' N' 10(-#,(!' ,(/0&)#-,0(' #70=-' -"%' -%!-%$' /&#)%!' /&0)' %#1"':#)%2'3,(1%'9%'#&%'!1#+,(:'0=-'-0'+#&:%'(=)7%&!'0/'10&%!'9%'=!%'#'",:";%($'!1&%%('!,O%'9,-"')=+-,!#).+,(:'9"%('!=..0&-%$2'
"#$%!&'%(!)!(*+!)! ,+-+.+/+! 0(#12!3%!4#1!
PQRR>PNRR'5'!#).+%' PQRR>PNRR'5'!#).+%' PQRR>PNRR'P'!#).+%'
N4'/&#)%!'SP',('TRU' N4'/&#)%!'SP',('PRRU' N4'/&#)%!'SP',('N4RU'
V#+6%'W0&.2' X0(0+,-"'G&0$=1-,0(!' B.,1'I#)%!'Y(1'
!"#$%& &'& ()%*+),-& ./00,%1& !)%& 23$& 23%$$& 2$.2$-& 4,0$.'& 23$&!%,0$.& ,%$& 5"-$+1& .$6,%,2$-& 2)& 7,273& -"!!$%$82& .7$8$&73,%,72$%".2"7.&,.&23$&4,0$.&6%)4%$..9&
J%'1#.-=&%$' -"%' /&#)%!'7*' ,(-%&1%.-,(:' -"%'Z,&%1-['\'10))#($'
!-&%#)'7%,(:'!%(-'-0'#'10(6%(-,0(#+':&#.",1!'1#&$'9",+%'-"%':#)%'
9#!'.+#*%$'#-'#'(0&)#+'!.%%$?'#+0(:'9,-"'-"%'10(-%(-!'0/'-%>-=&%!'
#($'!=&/#1%!' #-' -"%' !-#&-'0/' -"%' /&#)%2'J%' -%!-%$' -"%)' -"&0=:"'#'
/=(1-,0(#+')0$%+' -0' %(!=&%' -"%' #+:0&,-")!'9%&%' 10&&%1-' #($' -"#-'
-"%' &,:"-' ,)#:%!'9%&%' .&0$=1%$2']%>-?' 9%' %!-,)#-%$' -"%' 10!-' 0/'
%#1"'!%1-,0('0/'10$%' ,(' -"%' /=(1-,0(#+')0$%+?'7%,(:'#::&%!!,6%+*'
.%!!,),!-,1?' #($' 7=,+-' #' &0=:"' .&0/,+%' 0/' %#1"' /&#)%2' J%' 9&0-%'
#!!%)7+*'10$%'/0&'-"%'",:"%!-;10!-'!%1-,0(!?'&#(',-' -"&0=:"'1*1+%;
#11=&#-%' !,)=+#-0&!?' /%$' -"%' 1+018' 1*1+%' &%!=+-!' 7#18' ,(-0' -"%'
/=(1-,0(#+' )0$%+?' #($' &%;&#(' -"%' -%!2' @",!' ,-%&#-,6%' 1*1+%' 0/'
&%/,(%)%(-'9#!' &%.%#-%$' =(-,+' \R^'0/' -"%' 1+018' 1*1+%!' %>%1=-%$'
$=&,(:' #' /&#)%' "#$' 7%%(' &=(' -"&0=:"' -"%' !,)=+#-0&!?' :,6,(:' -"%'
06%&#++' .&0/,+%!' #' ",:"' $%:&%%' 0/' 10(/,$%(1%2' @%>-=&%' =(,-'
-"&0=:".=-?' 1#1"%' .%&/0&)#(1%' #($' )%)0&*' 7#($9,$-"'
+,),-#-,0(!'9%&%'#++',(1+=$%$',('-"%'6#&,0=!'!,)=+#-,0(!2'
Y(' -"%!%' !-=$,%!' 9%' )%#!=&%' 90&8+0#$' .%&/0&)#(1%' ,(' -%&)!' 0/':,%%,#$$& /8"2.2' D' :,%%,#$$& /8"2' ,!' $%/,(%$' -0' 7%'0(%' F#&%%'10&%'&=((,(:'#-'P'IMO2'@"%'1+018'&#-%',!'1"0!%('!0+%+*'/0&'%#!%'0/'1#+1=+#-,0(?'!,(1%'&%#+'$%6,1%!'90=+$'!",.'9,-"')=+-,.+%'10&%!'#($'
'''''''''''''''''''''''''''''''''''''''' '''''''''''''''''''''''''K'<-"%&'(#)%!'_'7&#($!')#*'7%'1+#,)%$'#!'-"%'.&0.%&-*'0/'0-"%&!2'
#' 6#&,%-*' 0/' 1+018' &#-%!2' `!,(:' F#&%%' =(,-!' #++09!' =!' -0'10).#&%'.%&/0&)#(1%'0/'F#&%%',).+%)%(-#-,0(!'9,-"'$,//%&%(-'(=)7%&!' 0/' 10&%!' &=((,(:' #-' $,//%&%(-' 1+018' &#-%!2' D' !,(:+%'F#&%%'=(,-'10&&%!.0($!' -0'#' -"%0&%-,1#+'.%#8' -"&0=:".=-'0/'TN'IEF<G3?'10=(-,(:'/=!%$')=+-,.+*;#$$'#!'-90'0.%&#-,0(!2''
!";#/.-8-<2820=#/01'2%3#
@"%'F#&%%'!0/-9#&%'&%($%&%&',!'$%!,:(%$'-0'#++09'%//,1,%(-'+0#$'7#+#(1,(:'06%&'#'+#&:%'(=)7%&'0/'10&%!2'E,:=&%'\'!"09!'-"%'&%!=+-!'0/' -%!-,(:' +0#$' 7#+#(1,(:' /0&' !,>' 10(/,:=&#-,0(!?' %#1"' 0/' 9",1"'!1#+%!' -"%')%)0&*'7#($9,$-"'#($' -%>-=&%'/,+-%&,(:'!.%%$'&%+#-,6%'-0'-"%'(=)7%&'0/'10&%!2'@",!'-%!-'=!%!'-"%'!,)=+#-,0(')%-"0$0+0:*'$%!1&,7%$' ,(' 3%1-,0(' 42P' ,(' 10)7,(#-,0(' 9,-"' #' -,)%;7#!%$'.%&/0&)#(1%')0$%+'-"#-'-!'$%.%($%(1,%!'#($'!1"%$=+,(:2'@",!'-00+',!'=!%$'/0&')=+-,.+%':&#.",1!'.&0$=1-!'9,-",('Y(-%+2''
'
'()*+%& ,'& ;$+,2"<$& =7,+"84& ,.& ,& >/872")8& )!& ?)%$& ?)/82'& @3".&.3)5.&7)8!"4/%,2")8.&5"23&A&2)&BA&7)%$.C&5"23&$,73&4,0$D.&%$./+2.&6+)22$-&%$+,2"<$&2)&23$&6$%!)%0,87$&)!&,8&AE7)%$&.1.2$09&&
@"%'&%!=+-!'0/'-"%'+0#$'7#+#(1,(:'!,)=+#-,0('!"09'#'/#++0//'0/'a^'-0' PR^' /&0)' #' +,(%#&' !.%%$=.' #-' 5b' 10&%!2' E0&' -"%!%' -%!-!?'G&,)3%-!' #&%' !=7$,6,$%$' ,/' -"%*' 10(-#,(' )0&%' -"#(' PRRR'.&,),-,6%!?'#!'$%!1&,7%$',('3%1-,0('52N2'D$$,-,0(#+'-%!-!'!"09'-"#-'E2B2D2L2' /#++!' 0//' 7*' 0(+*' N^' ,/' G&,)3%-!' #&%' !=7$,6,$%$' ,(-0':&0=.!' 0/' NRR' .&,),-,6%!?' !0' 10$%' -=(,(:' !"0=+$' ,).&06%' -"%'+,(%#&,-*2''
E,:=&%'PR'!"09!'-"%'(=)7%&'0/'F#&%%'=(,-!'&%A=,&%$'-0'&%($%&'!#).+%' /&#)%!' /&0)' -"%' -"&%%':#)%!'#-'QR' /&#)%!c!%10($2'@"%!%'&%!=+-!'9%&%' !,)=+#-%$'0('#' !,(:+%'10&%'9,-"' -"%'#!!=).-,0(' -"#-'.%&/0&)#(1%'!1#+%!'+,(%#&+*2'E0&'M#+/'F,/%'N'%.,!0$%'N?'&0=:"+*'PR'F#&%%'`(,-!'#&%'!=//,1,%(-'-0'%(!=&%'-"#-'#++'/&#)%!'&=('#-'QR'/.!'0&' /#!-%&2' E0&' E2B2D2L2' #($' I%#&!' 0/'J#&?' &0=:"+*' N4' F#&%%'`(,-!'!=//,1%2''
'
'()*+%&-.'&F<$%,++&6$%!)%0,87$'&.3)5.& 23$&8/0#$%&)!&:,%%,#$$&G8"2.&H7)%$.&%/88"84&,2&I&JKLM&8$$-$-&2)&,73"$<$&NO!6.&!)%&)!&23$&.$%"$.&)!&.,06+$&!%,0$.&"8&$,73&4,0$9&
@"%' &%)#,(,(:' ,!!=%' -"#-' 1#(' +,),-' !1#+#7,+,-*' ,!' !0/-9#&%' +018!2'3,)=+#-,(:' )=+-,.+%' /&#)%!' 0/' &%($%&,(:' #-' !=1"' #' /,(%' +%6%+' 0/'$%-#,+' ,!' %>-&%)%+*' 10!-+*2' M09%6%&?' -",!' !0/-9#&%' &%($%&,(:'.,.%+,(%'9#!'%>.+,1,-+*'$%!,:(%$'-0'),(,),O%'-"%'(=)7%&'0/'+018!'
18:8 • L. Seiler et al.
ACM Transactions on Graphics, Vol. 27, No. 3, Article 18, Publication date: August 2008.
# of cores for high frame rates is a work in progress.
22
UC Regents Fall 2009 © UCBCS 250 L9: Floorplanning
Next Week: Practical design techniques
Thursday: Design blocks
Tuesday: Micro-architecture
23