23
UC Regents Fall 2009 © UCB CS 250 L9: Floorplanning 2009-9-24 John Wawrzynek and Krste Asanovic with John Lazzaro CS 250 VLSI System Design Lecture 9 Floorplanning www-inst.eecs.berkeley.edu/~cs250/ TA: Yunsup Lee Many-core Example Chips 1

CS 250 Ma VLSI System Design Exa Chi e Lecture 9 ...cs250/fa09/lectures/lec...the CPU; it provides more than 200 Gbytes/s of bandwidth. A two-entry queue is available for each source-destination

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS 250 Ma VLSI System Design Exa Chi e Lecture 9 ...cs250/fa09/lectures/lec...the CPU; it provides more than 200 Gbytes/s of bandwidth. A two-entry queue is available for each source-destination

UC Regents Fall 2009 © UCBCS 250 L9: Floorplanning

2009-9-24John Wawrzynek and Krste Asanovic

with John Lazzaro

CS 250 VLSI System Design

Lecture 9 – Floorplanning

www-inst.eecs.berkeley.edu/~cs250/

TA: Yunsup Lee

Many-core Example Chips

1

Page 2: CS 250 Ma VLSI System Design Exa Chi e Lecture 9 ...cs250/fa09/lectures/lec...the CPU; it provides more than 200 Gbytes/s of bandwidth. A two-entry queue is available for each source-destination

UC Regents Fall 2009 © UCBCS 250 L9: Floorplanning

2

Page 3: CS 250 Ma VLSI System Design Exa Chi e Lecture 9 ...cs250/fa09/lectures/lec...the CPU; it provides more than 200 Gbytes/s of bandwidth. A two-entry queue is available for each source-destination

UC Regents Fall 2009 © UCBCS 250 L9: Floorplanning

Intel Larrabee

3

Page 4: CS 250 Ma VLSI System Design Exa Chi e Lecture 9 ...cs250/fa09/lectures/lec...the CPU; it provides more than 200 Gbytes/s of bandwidth. A two-entry queue is available for each source-destination

UC Regents Fall 2009 © UCBCS 250 L9: Floorplanning

Sun Niagara II and Intel Larrabee are both:

Complete implementations oflong-established ISAs (SPARC, x86).

Provide a conventional shared-memory virtual-address memory model.

Micro-architecture and ISA extensions target a specific application area.

Not boutique parts - manufacturable.

Many-cores, multi-threaded cores.

4

Page 5: CS 250 Ma VLSI System Design Exa Chi e Lecture 9 ...cs250/fa09/lectures/lec...the CPU; it provides more than 200 Gbytes/s of bandwidth. A two-entry queue is available for each source-destination

UC Regents Fall 2009 © UCBCS 250 L9: Floorplanning

Sun Niagara II Target Market

Some applications inavoidably spend their lives waiting for memory.

memory machines with discrete single-thread-ed processors and coherent interconnect havetended to perform well because they exploitTLP. However, the use of an SMP composedof multiple processors designed to exploit ILPis neither power efficient nor cost-efficient. Amore efficient approach is to build a machineusing simple cores aggregated on a single die,with a shared on-chip cache and high band-width to large off-chip memory, therebyaggregating an SMP server on a chip. This hasthe added benefit of low-latency communi-cation between the cores for efficient datasharing in commercial server applications.

Niagara overviewThe Niagara approach to increasing

throughput on commercial server applicationsinvolves a dramatic increase in the number ofthreads supported on the processor and amemory subsystem scaled for higher band-widths. Niagara supports 32 threads of exe-cution in hardware. The architectureorganizes four threads into a thread group; thegroup shares a processing pipeline, referred toas the Sparc pipe. Niagara uses eight suchthread groups, resulting in 32 threads on theCPU. Each SPARC pipe contains level-1caches for instructions and data. The hard-ware hides memory and pipeline stalls on agiven thread by scheduling the other threadsin the group onto the SPARC pipe with a zerocycle switch penalty. Figure 1 schematicallyshows how reusing the shared processingpipeline results in higher throughput.

The 32 threads share a 3-Mbyte level-2cache. This cache is 4-way banked andpipelined for bandwidth; it is 12-way set-associative to minimize conflict misses fromthe many threads. Commercial server codehas data sharing, which can lead to highcoherence miss rates. In conventional SMPsystems using discrete processors with coher-ent system interconnects, coherence misses goout over low-frequency off-chip buses or links,and can have high latencies. The Niagaradesign with its shared on-chip cache elimi-nates these misses and replaces them with low-latency shared-cache communication.

The crossbar interconnect provides thecommunication link between Sparc pipes, L2cache banks, and other shared resources onthe CPU; it provides more than 200 Gbytes/s

of bandwidth. A two-entry queue is availablefor each source-destination pair, and it canqueue up to 96 transactions each way in thecrossbar. The crossbar also provides a port forcommunication with the I/O subsystem.Arbitration for destination ports uses a sim-ple age-based priority scheme that ensures fairscheduling across all requestors. The crossbaris also the point of memory ordering for themachine.

The memory interface is four channels ofdual-data rate 2 (DDR2) DRAM, supportinga maximum bandwidth in excess of 20Gbytes/s, and a capacity of up to 128 Gbytes.Figure 2 shows a block diagram of the Nia-gara processor.

Sparc pipelineHere we describe the Sparc pipe implemen-

tation, which supports four threads. Eachthread has a unique set of registers and instruc-tion and store buffers. The thread group sharesthe L1 caches, translation look-aside buffers(TLBs), execution units, and most pipelineregisters. We implemented a single-issuepipeline with six stages (fetch, thread select,decode, execute, memory, and write back).

In the fetch stage, the instruction cache andinstruction TLB (ITLB) are accessed. The fol-lowing stage completes the cache access byselecting the way. The critical path is set bythe 64-entry, fully associative ITLB access. Athread-select multiplexer determines which of

23MARCH–APRIL 2005

C M M MCC

C M M MCC

C M

C M

C M

Time saved

Singleissue

ILP

TLP (on shared

single issuepipeline)

Memory latency Compute latency

Figure 1. Behavior of processors optimized for TLP and ILP oncommercial server workloads. In comparison to the single-issue machine, the ILP processor mainly reduces computetime, so memory access time dominates application perfor-mance. In the TLP case, multiple threads share a single-issuepipeline, and overlapped execution of these threads results inhigher performance for a multithreaded application.

memory machines with discrete single-thread-ed processors and coherent interconnect havetended to perform well because they exploitTLP. However, the use of an SMP composedof multiple processors designed to exploit ILPis neither power efficient nor cost-efficient. Amore efficient approach is to build a machineusing simple cores aggregated on a single die,with a shared on-chip cache and high band-width to large off-chip memory, therebyaggregating an SMP server on a chip. This hasthe added benefit of low-latency communi-cation between the cores for efficient datasharing in commercial server applications.

Niagara overviewThe Niagara approach to increasing

throughput on commercial server applicationsinvolves a dramatic increase in the number ofthreads supported on the processor and amemory subsystem scaled for higher band-widths. Niagara supports 32 threads of exe-cution in hardware. The architectureorganizes four threads into a thread group; thegroup shares a processing pipeline, referred toas the Sparc pipe. Niagara uses eight suchthread groups, resulting in 32 threads on theCPU. Each SPARC pipe contains level-1caches for instructions and data. The hard-ware hides memory and pipeline stalls on agiven thread by scheduling the other threadsin the group onto the SPARC pipe with a zerocycle switch penalty. Figure 1 schematicallyshows how reusing the shared processingpipeline results in higher throughput.

The 32 threads share a 3-Mbyte level-2cache. This cache is 4-way banked andpipelined for bandwidth; it is 12-way set-associative to minimize conflict misses fromthe many threads. Commercial server codehas data sharing, which can lead to highcoherence miss rates. In conventional SMPsystems using discrete processors with coher-ent system interconnects, coherence misses goout over low-frequency off-chip buses or links,and can have high latencies. The Niagaradesign with its shared on-chip cache elimi-nates these misses and replaces them with low-latency shared-cache communication.

The crossbar interconnect provides thecommunication link between Sparc pipes, L2cache banks, and other shared resources onthe CPU; it provides more than 200 Gbytes/s

of bandwidth. A two-entry queue is availablefor each source-destination pair, and it canqueue up to 96 transactions each way in thecrossbar. The crossbar also provides a port forcommunication with the I/O subsystem.Arbitration for destination ports uses a sim-ple age-based priority scheme that ensures fairscheduling across all requestors. The crossbaris also the point of memory ordering for themachine.

The memory interface is four channels ofdual-data rate 2 (DDR2) DRAM, supportinga maximum bandwidth in excess of 20Gbytes/s, and a capacity of up to 128 Gbytes.Figure 2 shows a block diagram of the Nia-gara processor.

Sparc pipelineHere we describe the Sparc pipe implemen-

tation, which supports four threads. Eachthread has a unique set of registers and instruc-tion and store buffers. The thread group sharesthe L1 caches, translation look-aside buffers(TLBs), execution units, and most pipelineregisters. We implemented a single-issuepipeline with six stages (fetch, thread select,decode, execute, memory, and write back).

In the fetch stage, the instruction cache andinstruction TLB (ITLB) are accessed. The fol-lowing stage completes the cache access byselecting the way. The critical path is set bythe 64-entry, fully associative ITLB access. Athread-select multiplexer determines which of

23MARCH–APRIL 2005

C M M MCC

C M M MCC

C M

C M

C M

Time saved

Singleissue

ILP

TLP (on shared

single issuepipeline)

Memory latency Compute latency

Figure 1. Behavior of processors optimized for TLP and ILP oncommercial server workloads. In comparison to the single-issue machine, the ILP processor mainly reduces computetime, so memory access time dominates application perfor-mance. In the TLP case, multiple threads share a single-issuepipeline, and overlapped execution of these threads results inhigher performance for a multithreaded application.

Cores that focus on extracting instruction-level parallelism are wasted on these apps.

memory machines with discrete single-thread-ed processors and coherent interconnect havetended to perform well because they exploitTLP. However, the use of an SMP composedof multiple processors designed to exploit ILPis neither power efficient nor cost-efficient. Amore efficient approach is to build a machineusing simple cores aggregated on a single die,with a shared on-chip cache and high band-width to large off-chip memory, therebyaggregating an SMP server on a chip. This hasthe added benefit of low-latency communi-cation between the cores for efficient datasharing in commercial server applications.

Niagara overviewThe Niagara approach to increasing

throughput on commercial server applicationsinvolves a dramatic increase in the number ofthreads supported on the processor and amemory subsystem scaled for higher band-widths. Niagara supports 32 threads of exe-cution in hardware. The architectureorganizes four threads into a thread group; thegroup shares a processing pipeline, referred toas the Sparc pipe. Niagara uses eight suchthread groups, resulting in 32 threads on theCPU. Each SPARC pipe contains level-1caches for instructions and data. The hard-ware hides memory and pipeline stalls on agiven thread by scheduling the other threadsin the group onto the SPARC pipe with a zerocycle switch penalty. Figure 1 schematicallyshows how reusing the shared processingpipeline results in higher throughput.

The 32 threads share a 3-Mbyte level-2cache. This cache is 4-way banked andpipelined for bandwidth; it is 12-way set-associative to minimize conflict misses fromthe many threads. Commercial server codehas data sharing, which can lead to highcoherence miss rates. In conventional SMPsystems using discrete processors with coher-ent system interconnects, coherence misses goout over low-frequency off-chip buses or links,and can have high latencies. The Niagaradesign with its shared on-chip cache elimi-nates these misses and replaces them with low-latency shared-cache communication.

The crossbar interconnect provides thecommunication link between Sparc pipes, L2cache banks, and other shared resources onthe CPU; it provides more than 200 Gbytes/s

of bandwidth. A two-entry queue is availablefor each source-destination pair, and it canqueue up to 96 transactions each way in thecrossbar. The crossbar also provides a port forcommunication with the I/O subsystem.Arbitration for destination ports uses a sim-ple age-based priority scheme that ensures fairscheduling across all requestors. The crossbaris also the point of memory ordering for themachine.

The memory interface is four channels ofdual-data rate 2 (DDR2) DRAM, supportinga maximum bandwidth in excess of 20Gbytes/s, and a capacity of up to 128 Gbytes.Figure 2 shows a block diagram of the Nia-gara processor.

Sparc pipelineHere we describe the Sparc pipe implemen-

tation, which supports four threads. Eachthread has a unique set of registers and instruc-tion and store buffers. The thread group sharesthe L1 caches, translation look-aside buffers(TLBs), execution units, and most pipelineregisters. We implemented a single-issuepipeline with six stages (fetch, thread select,decode, execute, memory, and write back).

In the fetch stage, the instruction cache andinstruction TLB (ITLB) are accessed. The fol-lowing stage completes the cache access byselecting the way. The critical path is set bythe 64-entry, fully associative ITLB access. Athread-select multiplexer determines which of

23MARCH–APRIL 2005

C M M MCC

C M M MCC

C M

C M

C M

Time saved

Singleissue

ILP

TLP (on shared

single issuepipeline)

Memory latency Compute latency

Figure 1. Behavior of processors optimized for TLP and ILP oncommercial server workloads. In comparison to the single-issue machine, the ILP processor mainly reduces computetime, so memory access time dominates application perfor-mance. In the TLP case, multiple threads share a single-issuepipeline, and overlapped execution of these threads results inhigher performance for a multithreaded application.

Instead, build simple cores that are multi-threaded, and focus on maximizing throughput of a large number of threads.

5

Page 6: CS 250 Ma VLSI System Design Exa Chi e Lecture 9 ...cs250/fa09/lectures/lec...the CPU; it provides more than 200 Gbytes/s of bandwidth. A two-entry queue is available for each source-destination

UC Regents Fall 2009 © UCBCS 250 L9: Floorplanning

KrsteNovember 10, 2004

6.823, L18--5

Simple Multithreaded Pipeline

Have to carry thread select down pipeline to ensure correct state bits read/written at each pipe stage

+1

2 Thread

select

PC1

PC1

PC1

PC1

I$ IRGPR1GPR1GPR1GPR1

X

Y

2

D$

Sun Niagara II Multithreading

Multi-threading static pipeline CPUs is simple and efficient, and goes back to the 1960s (CDC).

6

Page 7: CS 250 Ma VLSI System Design Exa Chi e Lecture 9 ...cs250/fa09/lectures/lec...the CPU; it provides more than 200 Gbytes/s of bandwidth. A two-entry queue is available for each source-destination

UC Regents Fall 2009 © UCBCS 250 L9: Floorplanning

Sun Niagara II Sizing the chip

8 threads/core: Enough to keep one core busy, given clock speed, memory system latency, and target application characteristics.

6 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 43, NO. 1, JANUARY 2008

Implementation of an 8-Core, 64-Thread,Power-Efficient SPARC Server on a Chip

Umesh Gajanan Nawathe, Mahmudul Hassan, King C. Yen, Ashok Kumar, Aparna Ramachandran, andDavid Greenhill

Abstract—The second in the Niagara series of processors(Niagara2) from Sun Microsystems is based on the power-ef-ficient chip multi-threading (CMT) architecture optimized forSpace, Watts (Power), and Performance (SWaP) [SWap Rating

Performance Space Power ]. It doubles the throughputperformance and performance/watt, and provides 10 im-provement in floating point throughput performance as comparedto UltraSPARC T1 (Niagara1). There are two 10 Gb Ethernetports on chip. Niagara2 has eight SPARC cores, each supportingconcurrent execution of eight threads for 64 threads total. EachSPARC core has a Floating Point and Graphics unit and anadvanced Cryptographic unit which provides high enough band-width to run the two 10 Gb Ethernet ports encrypted at wirespeeds. There is a 4 MB Level2 cache on chip. Each of the fouron-chip memory controllers controls two FBDIMM channels.Niagara2 has 503 million transistors on a 342 mm die packagedin a flip-chip glass ceramic package with 1831 pins. The chip isbuilt in Texas Instruments’ 65 nm 11LM triple- CMOS process.It operates at 1.4 GHz at 1.1 V and consumes 84 W.

Index Terms—Chip multi-threading (CMT), clocking, com-puter architecture, cryptography, low power, microprocessor,multi-core, multi-threaded, Niagara series of processors, powerefficient, power management, SerDes, SPARC architecture, syn-chronous and asynchronous clock domains, system on a chip(SoC), throughput computing, UltraSPARC T2.

I. INTRODUCTION

TODAY’S datacenters face extreme throughput, space,and power challenges. Throughput demands continue in-

creasing while space and power are fixed. The increase in powerconsumed by the servers and the cost of cooling has caused arapid increase in the cost of operating a datacenter. The Nia-gara1 processor [5] (also known as the UltraSPARC T1) madea substantial attempt at solving this problem. This paper de-scribes the implementation of the Niagara2 processor, designedwith a wide range of applications in mind, including database,web-tear, floating-point, and secure applications. Niagara2, asthe name suggests, is the follow-on to the Niagara1 processorbased on the CMT architecture optimized for SWaP.

Fig. 1 illustrates the advantages of the CMT architecture.For a single thread, memory access is the single biggestbottleneck to improving performance. For workloads which

Manuscript received April 17, 2007; revised September 27, 2007.U. G. Nawathe is with Sun Microsystems, Santa Clara, CA 95054 USA

(e-mail: [email protected]).M. Hassan, K. C. Yen, A. Kumar, A. Ramachandran, and D. Greenhill are

with Sun Microsystems, Sunnyvale, CA 94085 USA.Digital Object Identifier 10.1109/JSSC.2007.910967

Fig. 1. Throughput computing using the CMT architecture.

exhibit poor memory locality, only a modest throughputspeedup is possible by reducing compute time. As a result,conventional single-thread processors which are optimized forInstruction-Level-Parallelism have low utilization and wastedpower. Having many threads makes it easier to find somethinguseful to execute every cycle. As a result, processor utilizationis higher and significant throughput speedups are achievable.

The design of the Niagara2 processor started off withthree primary goals in mind: 1) 2 throughput perfor-mance and performance/watt as compared to UltraSPARC T1;2) 10 floating point throughput performance as comparedto UltraSPARC T1; and 3) integration of the major systemcomponents on chip. Two options were considered to achievethese goals: 1) double the number of cores to 16 as comparedto eight on UltraSPARC T1, with each core supporting fourthreads as in UltraSPARC T1; and 2) double the number ofthreads/core from four to eight and correspondingly doublethe number of execution units per core from one to two. Bothoptions would have enabled us to achieve our first goal. Thefirst option would have doubled the SPARC core area as com-pared to a lot smaller area increase with the second option.The second option was chosen as the area saved using thisoption allowed us to integrate a Floating Point and Graphicsunit and a Cryptographic unit inside each SPARC core and alsoallowed integration of the critical SoC components on chip,thus enabling us to achieve our second and third goals as well.

II. ARCHITECTURE AND KEY STATISTICAL HIGHLIGHTS

A. Niagara2 Architecture

Fig. 2 shows the Niagara2 block diagram, and Fig. 3 showsthe die micrograph. The chip has eight SPARC Cores, a 4 MBshared Level2 cache, and supports concurrent execution of64 threads. The Level2 cache is divided into eight banks of512 kB each. The SPARC Cores communicate with the Level2cache banks through a high bandwidth crossbar. Niagara2 hasa 8 PCI-Express channel, two 10 Gb Ethernet ports withXAUI interfaces and four memory controllers each controlling

0018-9200/$25.00 © 2008 IEEE

Authorized licensed use limited to: Univ of Calif Berkeley. Downloaded on September 18, 2009 at 17:03 from IEEE Xplore. Restrictions apply.

Cores/chip?: The number of cores that can be put to good use is limited by the memory bandwidth available to keep all threads on all cores busy.

7

Page 8: CS 250 Ma VLSI System Design Exa Chi e Lecture 9 ...cs250/fa09/lectures/lec...the CPU; it provides more than 200 Gbytes/s of bandwidth. A two-entry queue is available for each source-destination

UC Regents Fall 2009 © UCBCS 250 L9: Floorplanning

NAWATHE et al.: IMPLEMENTATION OF AN 8-CORE, 64-THREAD, POWER-EFFICIENT SPARC SERVER ON A CHIP 7

Fig. 2. Niagara2 block diagram.

Fig. 3. Niagara2 die micrograph.

two FBDIMM channels. These three major I/O interfaces areserializer/deserializer (SerDes) based and provide a total pinbandwidth in excess of 1 Tb/s. All the SerDes are on chip.The high levels of system integration truly makes Niagara2 a“server-on-a-chip”, thus reducing system component count,complexity and power, and hence improving system reliability.

B. SPARC Core Architecture

Fig. 4 shows the block diagram of the SPARC Core. EachSPARC core (SPC) implements the 64-bit SPARC V9 instruc-tion set while supporting concurrent execution of eight threads.Each SPC has one load/store unit (LSU), two Execution units(EXU0 and EXU1), and one Floating Point and Graphics Unit(FGU). The Instruction Fetch unit (IFU) and the LSU contain an8-way 16 kB Instruction cache and a 4-way 8 kB Data cache re-spectively. Each SPC also contains a 64-entry Instruction-TLB(ITLB), and a 128-entry Data-TLB (DTLB). Both the TLBs arefully associative. The memory Management Unit (MMU) sup-ports 8 K, 64 K, 4 M, and 256 M page sizes and has Hardware

Fig. 4. SPC block diagram.

Fig. 5. Integer pipeline: eight stages.

Fig. 6. Floating point pipeline: 12 stages.

TableWalk to reduce TLB miss penalty. “TLU” in the block dia-gram is the Trap Logic Unit. The “Gasket” performs arbitrationfor access to the Crossbar. Each SPC also has an advanced Cryp-tographic/Stream Processing Unit (SPU). The combined band-width of the eight Cryptographic units from the eight SPCs issufficient for running the two 10 Gb Ethernet ports encrypted.This enables Niagara2 to run secure applications at wire speed.

Fig. 5 and Fig. 6 illustrate the Niagara2 integer and floatingpoint pipelines, respectively. The integer pipeline is eight stageslong. The floating point pipeline has 12 stages for most opera-tions. Divide and Square-root operations have a longer pipeline.

Authorized licensed use limited to: Univ of Calif Berkeley. Downloaded on September 18, 2009 at 17:03 from IEEE Xplore. Restrictions apply.

Sun Niagara II: 8 cores, 4MB L2, 4 DRAM channels

“Small” L2 cache because apps are locality-poor.Each DRAM channel: 50 GB/s Read, 25 GB/s Write BW.Crossbar BW: 270 GB/s total (Read + Write).

(Also shared by an I/O port, not shown)8

Page 9: CS 250 Ma VLSI System Design Exa Chi e Lecture 9 ...cs250/fa09/lectures/lec...the CPU; it provides more than 200 Gbytes/s of bandwidth. A two-entry queue is available for each source-destination

UC Regents Fall 2009 © UCBCS 250 L9: Floorplanning

NAWATHE et al.: IMPLEMENTATION OF AN 8-CORE, 64-THREAD, POWER-EFFICIENT SPARC SERVER ON A CHIP 9

Fig. 9. L2 cache row redundancy scheme.

2-cycle latency. Addresses can be hashed to distribute accessesacross different sets in case of hot cache sets caused by refer-ence conflicts. All arrays are protected by single error correc-tion, double error detection ECC, and parity. Data from differentways and different words is interleaved to improve soft errorrates.

The L2 cache used a unique row-redundancy scheme. It is im-plemented at the 32 kB level and is illustrated in Fig. 9. Sparerows for one array are located in the adjacent array as opposed tothe same array. In other words, spare rows for the top array arelocated in the bottom array and vice versa. When redundancy isenabled, the incoming address is compared with the address ofthe defective row and if it matches, the adjacent array (which isnormally not enabled) is enabled to read from or write into thespare row. Using this kind of scheme enables a large ( 30%)reduction in X-decoder area. The area reduction is achieved be-cause the multiplexing required in the X-decoder to bypass thedefective row/rows in the traditional row redundancy scheme isno longer needed in this scheme.

N-well power for the Primary and L2 cache memory cellsis separated out as a test hook. This allows weakening of thepMOS loads of the SRAM bit cells by raising their thresholdvoltage, thus enabling screening cells with marginal static noisemargin. This significantly reduces defective parts per million(DPPM) and improves reliability.

Fig. 10 shows the Niagara2 Crossbar (CCX). CCX serves asa high bandwidth interface between the eight SPARC Cores,shown on top, and the eight L2 cache banks, and the non-cacheable unit (NCU) shown at the bottom. CCX consists of twoblocks: PCX and CPX. PCX (“Processor-to-Cache-Transfer”)is a 8-input 9-output multiplexer (mux). It transfers data fromthe eight SPARC cores to the eight L2 cache banks and theNCU. Likewise, CPX (“Cache-to-Processor Transfer”) is a

Fig. 10. Crossbar.

9-input 8-output mux, and it transfers data in the reverse di-rection. The PCX and CPX combined provide a Read/Writebandwidth of 270 GB/s. All crossbar data transfer requestsare processed using a four-stage pipeline. The pipeline stagesare: Request, Arbitration, Selection, and Transmission. As canbe seen from the figure, there are possiblesource destination pairs for each data transfer request. There isa two-deep queue for each source–destination pair to hold datatransfer requests for that pair.

IV. CLOCKING

Niagara2 contains a mix of many clocking styles—syn-chronous, mesochronous and asynchronous—and hence a largenumber of clock domains. Managing all these clock domainsand domain crossings between them was one of the biggestchallenges the design team faced. A subset of synchronousmethodology, ratioed synchronous clocking (RSC) is usedextensively. The concept works well for functional mode whilebeing equally applicable to at-speed test of the core using theSerDes interfaces.

A. Clock Sources and Distribution

An on-chip phase-locked loop (PLL) uses a fractional divider[8], [9] to generate Ratioed Synchronous Clocks with supportfor a wide range of integer and fractional divide ratios. Thedistribution of these clocks uses a combination of H-treesand grids. This ensures they meet tight clock skew budgetswhile keeping power consumption under control. Clock TreeSynthesis is used for routing the asynchronous clocks. Asyn-chronous clock domain crossings are handled using FIFOsand meta-stability hardened flip-flops. All clock headers aredesigned to support clock gating to save clock power.

Fig. 11 shows the block diagram of the PLL. Its architectureis similar to the one described in [8]. It uses a loop filter capac-itor referenced to a regulated 1.1 V supply (VREG). VREG isgenerated by a voltage regulator from the 1.5 V supply coming

Authorized licensed use limited to: Univ of Calif Berkeley. Downloaded on September 18, 2009 at 17:03 from IEEE Xplore. Restrictions apply.

Sun Niagara II 8 x 9 Crossbar

8 ports on CPU side (one per core)

8 ports for L2 banks, plus one for I/0

4 cycle latency (715ps/cycle).

Cycles 1-3 are for arbitration.

Transmit data on cycle 4.

100-200 wires/ port (each way).

9

Page 10: CS 250 Ma VLSI System Design Exa Chi e Lecture 9 ...cs250/fa09/lectures/lec...the CPU; it provides more than 200 Gbytes/s of bandwidth. A two-entry queue is available for each source-destination

UC Regents Fall 2009 © UCBCS 250 L9: Floorplanning

NA

WA

TH

Eetal.:IM

PLE

ME

NTA

TIO

NO

FA

N8-C

OR

E,64-T

HR

EA

D,PO

WE

R-E

FFICIE

NT

SPAR

CSE

RV

ER

ON

AC

HIP

9

Fig.9.L

2cache

rowredundancy

scheme.

2-cyclelatency.A

ddressescan

behashed

todistribute

accessesacross

differentsets

incase

ofhot

cachesets

causedby

refer-ence

conflicts.All

arraysare

protectedby

singleerror

correc-tion,double

errordetectionE

CC

,andparity.D

atafrom

differentw

aysand

differentw

ordsis

interleavedto

improve

softerror

rates.T

heL

2cache

useda

uniquerow

-redundancyschem

e.Itisim-

plemented

atthe32

kBleveland

isillustrated

inFig.9.Spare

rowsforone

arrayare

locatedin

theadjacentarray

asopposedto

thesam

earray.In

otherwords,spare

rows

forthetop

arrayare

locatedin

thebottom

arrayand

viceversa.W

henredundancy

isenabled,the

incoming

addressis

compared

with

theaddress

ofthe

defectiverow

andifitm

atches,theadjacentarray

(which

isnorm

allynotenabled)

isenabled

toread

fromor

write

intothe

sparerow

.Using

thiskind

ofschem

eenables

alarge

(30%

)reduction

inX

-decoderarea.The

areareduction

isachieved

be-cause

them

ultiplexingrequired

inthe

X-decoder

tobypass

thedefective

row/row

sin

thetraditionalrow

redundancyschem

eis

nolonger

neededin

thisschem

e.N

-well

power

forthe

Primary

andL

2cache

mem

orycells

isseparated

outas

atest

hook.T

hisallow

sw

eakeningof

thepM

OS

loadsof

theSR

AM

bitcells

byraising

theirthreshold

voltage,thusenabling

screeningcells

with

marginalstatic

noisem

argin.T

hissignificantly

reducesdefective

partsper

million

(DPPM

)and

improves

reliability.Fig.10

shows

theN

iagara2C

rossbar(C

CX

).CC

Xserves

asa

highbandw

idthinterface

between

theeight

SPAR

CC

ores,show

non

top,and

theeight

L2

cachebanks,

andthe

non-cacheable

unit(NC

U)show

natthe

bottom.C

CX

consistsoftwo

blocks:PC

Xand

CPX

.PC

X(“Processor-to-C

ache-Transfer”)is

a8-input

9-outputm

ultiplexer(m

ux).Ittransfers

datafrom

theeight

SPAR

Ccores

tothe

eightL

2cache

banksand

theN

CU

.L

ikewise,

CPX

(“Cache-to-Processor

Transfer”)is

a

Fig.10.C

rossbar.

9-input8-output

mux,

andit

transfersdata

inthe

reversedi-

rection.T

hePC

Xand

CPX

combined

providea

Read/W

ritebandw

idthof

270G

B/s.

All

crossbardata

transferrequests

areprocessed

usinga

four-stagepipeline.T

hepipeline

stagesare:R

equest,Arbitration,Selection,and

Transmission.A

scan

beseen

fromthe

figure,there

arepossible

sourcedestination

pairsfor

eachdata

transferrequest.T

hereis

atw

o-deepqueue

foreach

source–destinationpair

tohold

datatransfer

requestsfor

thatpair.

IV.

CL

OC

KIN

G

Niagara2

containsa

mix

ofm

anyclocking

styles—syn-

chronous,mesochronous

andasynchronous—

andhence

alarge

number

ofclock

domains.

Managing

allthese

clockdom

ainsand

domain

crossingsbetw

eenthem

was

oneof

thebiggest

challengesthe

designteam

faced.A

subsetof

synchronousm

ethodology,ratioed

synchronousclocking

(RSC

)is

usedextensively.T

heconceptw

orksw

ellforfunctionalmode

while

beingequally

applicableto

at-speedtest

ofthe

coreusing

theSerD

esinterfaces.

A.

Clock

Sourcesand

Distribution

An

on-chipphase-locked

loop(PL

L)usesa

fractionaldivider[8],[9]

togenerate

Ratioed

SynchronousC

locksw

ithsupport

fora

wide

rangeof

integerand

fractionaldivide

ratios.T

hedistribution

ofthese

clocksuses

acom

binationof

H-trees

andgrids.

This

ensuresthey

meet

tightclock

skewbudgets

while

keepingpow

erconsum

ptionunder

control.C

lockTree

Synthesisis

usedfor

routingthe

asynchronousclocks.

Asyn-

chronousclock

domain

crossingsare

handledusing

FIFOs

andm

eta-stabilityhardened

flip-flops.A

llclock

headersare

designedto

supportclockgating

tosave

clockpow

er.Fig.11

shows

theblock

diagramof

thePL

L.Its

architectureis

similarto

theone

describedin

[8].Itusesa

loopfiltercapac-

itorreferenced

toa

regulated1.1

Vsupply

(VR

EG

).VR

EG

isgenerated

bya

voltageregulatorfrom

the1.5

Vsupply

coming

Au

tho

rize

d lic

en

se

d u

se

limite

d to

: Un

iv o

f Ca

lif Be

rke

ley. D

ow

nlo

ad

ed

on

Se

pte

mb

er 1

8, 2

00

9 a

t 17

:03

from

IEE

E X

plo

re. R

estric

tion

s a

pp

ly.

Sun Niagara II 8 x 9 Crossbar

NAWATHE et al.: IMPLEMENTATION OF AN 8-CORE, 64-THREAD, POWER-EFFICIENT SPARC SERVER ON A CHIP 9

Fig. 9. L2 cache row redundancy scheme.

2-cycle latency. Addresses can be hashed to distribute accessesacross different sets in case of hot cache sets caused by refer-ence conflicts. All arrays are protected by single error correc-tion, double error detection ECC, and parity. Data from differentways and different words is interleaved to improve soft errorrates.

The L2 cache used a unique row-redundancy scheme. It is im-plemented at the 32 kB level and is illustrated in Fig. 9. Sparerows for one array are located in the adjacent array as opposed tothe same array. In other words, spare rows for the top array arelocated in the bottom array and vice versa. When redundancy isenabled, the incoming address is compared with the address ofthe defective row and if it matches, the adjacent array (which isnormally not enabled) is enabled to read from or write into thespare row. Using this kind of scheme enables a large ( 30%)reduction in X-decoder area. The area reduction is achieved be-cause the multiplexing required in the X-decoder to bypass thedefective row/rows in the traditional row redundancy scheme isno longer needed in this scheme.

N-well power for the Primary and L2 cache memory cellsis separated out as a test hook. This allows weakening of thepMOS loads of the SRAM bit cells by raising their thresholdvoltage, thus enabling screening cells with marginal static noisemargin. This significantly reduces defective parts per million(DPPM) and improves reliability.

Fig. 10 shows the Niagara2 Crossbar (CCX). CCX serves asa high bandwidth interface between the eight SPARC Cores,shown on top, and the eight L2 cache banks, and the non-cacheable unit (NCU) shown at the bottom. CCX consists of twoblocks: PCX and CPX. PCX (“Processor-to-Cache-Transfer”)is a 8-input 9-output multiplexer (mux). It transfers data fromthe eight SPARC cores to the eight L2 cache banks and theNCU. Likewise, CPX (“Cache-to-Processor Transfer”) is a

Fig. 10. Crossbar.

9-input 8-output mux, and it transfers data in the reverse di-rection. The PCX and CPX combined provide a Read/Writebandwidth of 270 GB/s. All crossbar data transfer requestsare processed using a four-stage pipeline. The pipeline stagesare: Request, Arbitration, Selection, and Transmission. As canbe seen from the figure, there are possiblesource destination pairs for each data transfer request. There isa two-deep queue for each source–destination pair to hold datatransfer requests for that pair.

IV. CLOCKING

Niagara2 contains a mix of many clocking styles—syn-chronous, mesochronous and asynchronous—and hence a largenumber of clock domains. Managing all these clock domainsand domain crossings between them was one of the biggestchallenges the design team faced. A subset of synchronousmethodology, ratioed synchronous clocking (RSC) is usedextensively. The concept works well for functional mode whilebeing equally applicable to at-speed test of the core using theSerDes interfaces.

A. Clock Sources and Distribution

An on-chip phase-locked loop (PLL) uses a fractional divider[8], [9] to generate Ratioed Synchronous Clocks with supportfor a wide range of integer and fractional divide ratios. Thedistribution of these clocks uses a combination of H-treesand grids. This ensures they meet tight clock skew budgetswhile keeping power consumption under control. Clock TreeSynthesis is used for routing the asynchronous clocks. Asyn-chronous clock domain crossings are handled using FIFOsand meta-stability hardened flip-flops. All clock headers aredesigned to support clock gating to save clock power.

Fig. 11 shows the block diagram of the PLL. Its architectureis similar to the one described in [8]. It uses a loop filter capac-itor referenced to a regulated 1.1 V supply (VREG). VREG isgenerated by a voltage regulator from the 1.5 V supply coming

Authorized licensed use limited to: Univ of Calif Berkeley. Downloaded on September 18, 2009 at 17:03 from IEEE Xplore. Restrictions apply.

NAWATHE et al.: IMPLEMENTATION OF AN 8-CORE, 64-THREAD, POWER-EFFICIENT SPARC SERVER ON A CHIP 9

Fig. 9. L2 cache row redundancy scheme.

2-cycle latency. Addresses can be hashed to distribute accessesacross different sets in case of hot cache sets caused by refer-ence conflicts. All arrays are protected by single error correc-tion, double error detection ECC, and parity. Data from differentways and different words is interleaved to improve soft errorrates.

The L2 cache used a unique row-redundancy scheme. It is im-plemented at the 32 kB level and is illustrated in Fig. 9. Sparerows for one array are located in the adjacent array as opposed tothe same array. In other words, spare rows for the top array arelocated in the bottom array and vice versa. When redundancy isenabled, the incoming address is compared with the address ofthe defective row and if it matches, the adjacent array (which isnormally not enabled) is enabled to read from or write into thespare row. Using this kind of scheme enables a large ( 30%)reduction in X-decoder area. The area reduction is achieved be-cause the multiplexing required in the X-decoder to bypass thedefective row/rows in the traditional row redundancy scheme isno longer needed in this scheme.

N-well power for the Primary and L2 cache memory cellsis separated out as a test hook. This allows weakening of thepMOS loads of the SRAM bit cells by raising their thresholdvoltage, thus enabling screening cells with marginal static noisemargin. This significantly reduces defective parts per million(DPPM) and improves reliability.

Fig. 10 shows the Niagara2 Crossbar (CCX). CCX serves asa high bandwidth interface between the eight SPARC Cores,shown on top, and the eight L2 cache banks, and the non-cacheable unit (NCU) shown at the bottom. CCX consists of twoblocks: PCX and CPX. PCX (“Processor-to-Cache-Transfer”)is a 8-input 9-output multiplexer (mux). It transfers data fromthe eight SPARC cores to the eight L2 cache banks and theNCU. Likewise, CPX (“Cache-to-Processor Transfer”) is a

Fig. 10. Crossbar.

9-input 8-output mux, and it transfers data in the reverse di-rection. The PCX and CPX combined provide a Read/Writebandwidth of 270 GB/s. All crossbar data transfer requestsare processed using a four-stage pipeline. The pipeline stagesare: Request, Arbitration, Selection, and Transmission. As canbe seen from the figure, there are possiblesource destination pairs for each data transfer request. There isa two-deep queue for each source–destination pair to hold datatransfer requests for that pair.

IV. CLOCKING

Niagara2 contains a mix of many clocking styles—syn-chronous, mesochronous and asynchronous—and hence a largenumber of clock domains. Managing all these clock domainsand domain crossings between them was one of the biggestchallenges the design team faced. A subset of synchronousmethodology, ratioed synchronous clocking (RSC) is usedextensively. The concept works well for functional mode whilebeing equally applicable to at-speed test of the core using theSerDes interfaces.

A. Clock Sources and Distribution

An on-chip phase-locked loop (PLL) uses a fractional divider[8], [9] to generate Ratioed Synchronous Clocks with supportfor a wide range of integer and fractional divide ratios. Thedistribution of these clocks uses a combination of H-treesand grids. This ensures they meet tight clock skew budgetswhile keeping power consumption under control. Clock TreeSynthesis is used for routing the asynchronous clocks. Asyn-chronous clock domain crossings are handled using FIFOsand meta-stability hardened flip-flops. All clock headers aredesigned to support clock gating to save clock power.

Fig. 11 shows the block diagram of the PLL. Its architectureis similar to the one described in [8]. It uses a loop filter capac-itor referenced to a regulated 1.1 V supply (VREG). VREG isgenerated by a voltage regulator from the 1.5 V supply coming

Authorized licensed use limited to: Univ of Calif Berkeley. Downloaded on September 18, 2009 at 17:03 from IEEE Xplore. Restrictions apply.

Every cross of blue and purple is a pass gate with a unique control signal.

72 control signals (if distributed unencoded).

NA

WA

TH

Eetal.:IM

PLE

ME

NTA

TIO

NO

FA

N8-C

OR

E,64-T

HR

EA

D,PO

WE

R-E

FFICIE

NT

SPAR

CSE

RV

ER

ON

AC

HIP

9

Fig.9.L

2cache

rowredundancy

scheme.

2-cyclelatency.A

ddressescan

behashed

todistribute

accessesacross

differentsets

incase

ofhot

cachesets

causedby

refer-ence

conflicts.All

arraysare

protectedby

singleerror

correc-tion,double

errordetectionE

CC

,andparity.D

atafrom

differentw

aysand

differentw

ordsis

interleavedto

improve

softerror

rates.T

heL

2cache

useda

uniquerow

-redundancyschem

e.Itisim-

plemented

atthe32

kBleveland

isillustrated

inFig.9.Spare

rowsforone

arrayare

locatedin

theadjacentarray

asopposedto

thesam

earray.In

otherwords,spare

rows

forthetop

arrayare

locatedin

thebottom

arrayand

viceversa.W

henredundancy

isenabled,the

incoming

addressis

compared

with

theaddress

ofthe

defectiverow

andifitm

atches,theadjacentarray

(which

isnorm

allynotenabled)

isenabled

toread

fromor

write

intothe

sparerow

.Using

thiskind

ofschem

eenables

alarge

(30%

)reduction

inX

-decoderarea.The

areareduction

isachieved

be-cause

them

ultiplexingrequired

inthe

X-decoder

tobypass

thedefective

row/row

sin

thetraditionalrow

redundancyschem

eis

nolonger

neededin

thisschem

e.N

-well

power

forthe

Primary

andL

2cache

mem

orycells

isseparated

outas

atest

hook.T

hisallow

sw

eakeningof

thepM

OS

loadsof

theSR

AM

bitcells

byraising

theirthreshold

voltage,thusenabling

screeningcells

with

marginalstatic

noisem

argin.T

hissignificantly

reducesdefective

partsper

million

(DPPM

)and

improves

reliability.Fig.10

shows

theN

iagara2C

rossbar(C

CX

).CC

Xserves

asa

highbandw

idthinterface

between

theeight

SPAR

CC

ores,show

non

top,and

theeight

L2

cachebanks,

andthe

non-cacheable

unit(NC

U)show

natthe

bottom.C

CX

consistsoftwo

blocks:PC

Xand

CPX

.PC

X(“Processor-to-C

ache-Transfer”)is

a8-input

9-outputm

ultiplexer(m

ux).Ittransfers

datafrom

theeight

SPAR

Ccores

tothe

eightL

2cache

banksand

theN

CU

.L

ikewise,

CPX

(“Cache-to-Processor

Transfer”)is

a

Fig.10.C

rossbar.

9-input8-output

mux,

andit

transfersdata

inthe

reversedi-

rection.T

hePC

Xand

CPX

combined

providea

Read/W

ritebandw

idthof

270G

B/s.

All

crossbardata

transferrequests

areprocessed

usinga

four-stagepipeline.T

hepipeline

stagesare:R

equest,Arbitration,Selection,and

Transmission.A

scan

beseen

fromthe

figure,there

arepossible

sourcedestination

pairsfor

eachdata

transferrequest.T

hereis

atw

o-deepqueue

foreach

source–destinationpair

tohold

datatransfer

requestsfor

thatpair.

IV.

CL

OC

KIN

G

Niagara2

containsa

mix

ofm

anyclocking

styles—syn-

chronous,mesochronous

andasynchronous—

andhence

alarge

number

ofclock

domains.

Managing

allthese

clockdom

ainsand

domain

crossingsbetw

eenthem

was

oneof

thebiggest

challengesthe

designteam

faced.A

subsetof

synchronousm

ethodology,ratioed

synchronousclocking

(RSC

)is

usedextensively.T

heconceptw

orksw

ellforfunctionalmode

while

beingequally

applicableto

at-speedtest

ofthe

coreusing

theSerD

esinterfaces.

A.

Clock

Sourcesand

Distribution

An

on-chipphase-locked

loop(PL

L)usesa

fractionaldivider[8],[9]

togenerate

Ratioed

SynchronousC

locksw

ithsupport

fora

wide

rangeof

integerand

fractionaldivide

ratios.T

hedistribution

ofthese

clocksuses

acom

binationof

H-trees

andgrids.

This

ensuresthey

meet

tightclock

skewbudgets

while

keepingpow

erconsum

ptionunder

control.C

lockTree

Synthesisis

usedfor

routingthe

asynchronousclocks.

Asyn-

chronousclock

domain

crossingsare

handledusing

FIFOs

andm

eta-stabilityhardened

flip-flops.A

llclock

headersare

designedto

supportclockgating

tosave

clockpow

er.Fig.11

shows

theblock

diagramof

thePL

L.Its

architectureis

similarto

theone

describedin

[8].Itusesa

loopfiltercapac-

itorreferenced

toa

regulated1.1

Vsupply

(VR

EG

).VR

EG

isgenerated

bya

voltageregulatorfrom

the1.5

Vsupply

coming

Au

tho

rize

d lic

en

se

d u

se

limite

d to

: Un

iv o

f Ca

lif Be

rke

ley. D

ow

nlo

ad

ed

on

Se

pte

mb

er 1

8, 2

00

9 a

t 17

:03

from

IEE

E X

plo

re. R

estric

tion

s a

pp

ly.

10

Page 11: CS 250 Ma VLSI System Design Exa Chi e Lecture 9 ...cs250/fa09/lectures/lec...the CPU; it provides more than 200 Gbytes/s of bandwidth. A two-entry queue is available for each source-destination

UC Regents Fall 2009 © UCBCS 250 L9: Floorplanning

11

Page 12: CS 250 Ma VLSI System Design Exa Chi e Lecture 9 ...cs250/fa09/lectures/lec...the CPU; it provides more than 200 Gbytes/s of bandwidth. A two-entry queue is available for each source-destination

UC Regents Fall 2009 © UCBCS 250 L9: Floorplanning

Crossbar defines floorplan: all port devices should be equidistant to the crossbar.

Uniform latency between all port pairs.

Did not scale up for 16-core Rainbow Falls. Rainbow Falls keeps the 8 x 9 crossbar, and shares each CPU-side port with two cores.

Sun Niagara II Crossbar Notes

Low latency: 4 cycles (less than 3 ns).

Design alternatives to crossbar?12

Page 13: CS 250 Ma VLSI System Design Exa Chi e Lecture 9 ...cs250/fa09/lectures/lec...the CPU; it provides more than 200 Gbytes/s of bandwidth. A two-entry queue is available for each source-destination

UC Regents Fall 2009 © UCBCS 250 L9: Floorplanning

Sun Niagara II Energy Facts

8 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 43, NO. 1, JANUARY 2008

Fig. 7. Various threads occupying different pipeline stages.

The load-use latency is three cycles. There is a six-cycle latencyfor dependent FP operations. The ICACHE is shared betweenall eight threads. Each thread has its own instruction buffer.The Fetch stage/unit fetches up to four instructions per cycleand puts them into the thread’s instruction buffer. Threadscan be in “Wait” (as opposed to “Ready”) state due to a ITLBmiss, ICACHE miss, or their Instruction Buffer being full. The“Least-Recently-Fetched” algorithm is used to select one of“Ready” threads for which the next instruction will be fetched.Fig. 7 shows the Integer/Load/Store pipeline and illustrateshow different threads can occupy different pipeline stages ina given cycle. In other words, threads are interleaved betweenpipeline stages with very few restrictions. The Load/Storeand Floating Point units are shared between all eight threads.The eight threads within each SPARC core are divided intotwo thread groups (TGs) of four threads each. Once again,the threads could be in “Wait” states due to events such as aDCACHE miss, DTLB miss, or data dependency. The “Pick”stage tries to find one instruction from all the “Ready” threads(using the “Least-Recently-Picked” algorithm) from each ofthe two TGs to execute every cycle. Since each TG picks inde-pendently (w.r.t. the other TG), it can lead to hazards such asload instructions being picked from both TGs even though eachSPC has only one load/store unit. These hazards are resolved inthe “Decode” stage.

Niagara2’s Primary and L2 cache sizes are relatively smallcompared to some other processors. Even though this may causehigher cache miss rates, the miss latency is well hidden by thepresence of other threads whose operands/data is available andhence can make good use of the “compute” time slots, thusminimizing wastage of “compute” resources. This factor ex-plains why the optimum design point moved towards havinghigher thread counts and lower cache sizes. In effect, this canbe thought of as devoting more transistors on chip to the in-telligent “processing” function as opposed to the nonintelligent“data-storing” function.

Performance measurements using several commercial ap-plications and performance benchmarks (SpecJBB, SpecWeb,TPC-C, SpecIntRate, SpecFPRate, etc.) confirm that Niagara2has achieved its goal of doubling the throughput performance

Fig. 8. Key statistical highlights.

and performance/watt as compared to UltraSPARC T1. Most ofthe gain comes from doubling the thread count and the numberof execution units. Some of the gain comes from a higheroperating frequency. Similarly, performance measurementsusing commonly used Floating Point benchmarks confirm thatNiagara2’s Floating Point throughput performance is more thanan order of magnitude higher compared to UltraSPARC T1.Niagara2 has eight Floating Point units (FPUs), each occupyingonly 1.3 mm , against only one FPU for UltraSPARC T1.Also, the Niagara2 FPUs are within the SPCs as comparedto UltraSPARC T1 where the SPCs had to access the FPUthrough the Crossbar. Another factor that helps performance isthe higher memory bandwidth on Niagara2.

C. Key Statistical Highlights

The table in Fig. 8 lists some key statistical highlights ofNiagara2’s physical implementation. Niagara2 is built in TexasInstruments’ 65 nm, 11LM, Triple- CMOS process. The chiphas 503 million transistors on a 342 mm die packaged in aflip-chip glass ceramic package with 1831 pins. It operates at1.4 GHz @ 1.1 V and consumes 84 W.

III. MEMORY HIERARCHY, CACHES, AND CROSSBAR

Niagara2 has four memory controllers on chip, each con-trolling two FBDIMM channels. They are clocked by the DR(DRAM) clock, which nominally runs at 400 MHz corre-sponding to the FBDIMM SerDes link rate of 4.8 Gb/s. Upto eight DIMMs can be connected to each channel. Everytransaction from each controller consists of 64 data bytes andECC. Read transactions take two DR clock cycles, while Writetransactions take four DR clock cycles. This yields a Readbandwidth of 51.2 GB/s and a Write bandwidth of 25.6 GB/s.

Niagara2 has two levels of caches on chip. Each SPC has a16 kB Primary Instruction cache (ICACHE), and a 8 kB PrimaryData cache (DCACHE). The ICACHE is 8-way set-associativewith a 32 B line size. The DCACHE is 4-way set-associativewith a 16 B line size. The 4 MB shared Level2 (L2) cache isdivided into eight banks for 512 kB each. The number of banksare doubled to support the doubling of thread count as comparedto UltraSPARC T1. The L2 cache is 16-way set associative witha 64 B line size. Each bank can read up to 16 B per cycle with a

Authorized licensed use limited to: Univ of Calif Berkeley. Downloaded on September 18, 2009 at 17:03 from IEEE Xplore. Restrictions apply.

16 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 43, NO. 1, JANUARY 2008

Fig. 21. Power consumption.

Power EM spec. For example, for a typical layout of big buffers,the worst power EM violation reduced from of 1.57with a conventional M3 power grid to of 0.93 withM3 post grid. Similarly, IR drop from M4 to devices improvedfrom 89.1 mV to 55.8 mV.

An interactive script was used for insertion of DECAPs basedon available empty space, and the placement was always assuredto be DRC/LVS clean. The width of DECAPs used in stan-dard cell blocks matched the width of the standard cells. Thechannel DECAPs were similar to DECAPs used in standard cellblocks, except that they had M4 and below embedded into theleaf DECAP cell to reduce data base size. About 700 nF of ex-plicit DECAP was added on chip (this does not include implicitDECAP due to the metal power grid and the quiet cap).

VIII. POWER AND POWER MANAGEMENT

A. Power

Niagara2’s SOC design is optimized for performance/wattand enables reduction of total power consumption and powerdensity at the chip and system level. Niagara2’s simplifiedpipeline and reduced speculation in instruction execution re-duces wasted power. A Niagara2-based system is a lot morepower efficient as compared to, for example, a system with eightsingle-core processors (on separate chips) each having theirown I/O (DRAM, networking, and PCI-Express) interfaces.Such a system will have 8 times the I/O interfaces and hencewill consume a lot more power in those interfaces. Also, extrapower will be consumed in driving the off-chip multi-processorcoherency fabric. In comparison, for Niagara2 there are onlyone set of I/O interfaces and the coherency between the eightprocessor cores is handled on chip by the crossbar, whichconsumes less than 1 W of power.

Niagara2’s total power consumption is 84 W @ 1.1 V and1.4 GHz operation. The pie-chart in Fig. 21 shows the powerconsumed by the various blocks inside Niagara2. Almost athird of the total power is consumed by the eight SPARCcores. L2 cache Data, Tag and Buffer blocks together accountfor 20% of the total. SOC logic consumes 6% while I/Osconsume 13%. Leakage is about 21% of the total power. Clocksto unused clusters are gated off to save dynamic power. Within

units, clocks to groups of flops are separated into independentdomains depending upon the functionality of the correspondinglogic. Clocks to each domain can be turned off independentlyof one another when the related logic is not processing validinstructions. This saves dynamic power.

B. Technique to Reduce Leakage Power

Niagara2 uses gate-bias (GBIAS) cells to reduce leakagepower. GBIAS cells are footprint-compatible with the corre-sponding standard- non-GBIAS versions. The only layoutdifference is that the GBIAS cell has an additional identificationlayer (GBIAS). All the transistors of any cell having this layerare fabricated with 10% longer channel length.

The table in Fig. 22 illustrates the reduction in leakage andcorresponding increase in delay as the channel length was in-creased above minimum for three different gates. 10% largerchannel length was chosen, resulting in, on an average, about50% reduced leakage on a per-cell basis with about 14% im-pact on the cell delay. High- (HVT) cells were considered aswell for leakage reduction. We did not have an unconstrainedchoice of for the HVT cells. For HVT cells using the avail-able HVT transistors, the delay impact was much larger. Asa result, approximate calculations lead to the conclusion thatusing HVT cells would have enabled substitution of only aboutone-third of the number of gates as compared to using GBIASgates with 10% larger channel length. Hence, the GBIAS optionwas chosen. This enabled about 77% additional leakage savingas compared to using HVT cells. Cells in non-timing-criticalpaths could be replaced by their GBIAS versions as long as thisdid not result in any new noise, slew, or timing violations. Be-cause of footprint compatibility, the swapping was easily doneat the end of the design cycle without any timing, noise, or areaimpact. This reduced leakage power by 10%–15%.

The swapping algorithm works as follows. The project STAtool and a set of scripts determine which cells can be swappedwithout creating new timing, noise, and slew violations. First,all cells that have timing margins larger than a set thresholdare swapped to their GBIAS equivalent. The STA tool thencomputes the timing graph and swaps back all GBIAS cells thatare on paths with less than a predefined positive slack. Then,a script evaluates all the receivers connected to the outputsof the GBIAS cells that went through timing qualification,and determines if sufficient noise margin exists. The scriptcalculates the new noise values by increasing the latest noisesnapshot by a percentage that was derived from simulationsand analysis. Once a list of cells that can be safely swappedis built, a custom program performs the swaps in the actuallayout database. A footprint-compatibility check for identicalpins and metal shapes is built into the program to maintain LVSand DRC cleanliness.

C. Power Management

Niagara2 has several power management features to managepower consumption. Software can dynamically turn threads onor off as required. Niagara2 has a Power Throttling mode whichprovides the ability to control instruction issue rates to managepower. The graph in Fig. 23 shows that this can reduce dynamicpower by up to 30% depending upon the level of workload.

Authorized licensed use limited to: Univ of Calif Berkeley. Downloaded on September 18, 2009 at 17:03 from IEEE Xplore. Restrictions apply.

18IE

EE

JOU

RN

AL

OF

SOL

ID-STA

TE

CIR

CU

ITS,V

OL

.43,NO

.1,JAN

UA

RY

2008

Fig.24.C

lock–datarecovery

ofa

singleFB

DIM

M1.0

receivechannelw

ithinan

FSRcluster.

Fig.25.C

lock-datarecovery

ofa

singleFB

DIM

M1.0

receivechannel(expanded)

forD

TM

.

i)the

linktraining

pattern(T

S0)can

beused

toidentify

thestart

ofa

clock-bundlepair,

andhence

controlthe

generationofthe

slowbyte

clockw

ithrespectto

TS0

starttim

es;ii)

thequeue

inthe

MC

Uw

hichperform

saggregation

ofthe

datato

constructframes

needto

waitfor

theslow

estbundle,and

thereforethe

slowestlink.

Fig.25illustrates

howthe

FBD

IMM

(andsim

ilarlyPC

I-Ex-

press)channels

may

bem

anipulatedto

generateclocks

within

anacceptable

tolerancefor

synchronouscrossing

intothe

con-troller.T

hecore

clockspeed

isallow

edto

varyindependently

suchthatthe

CM

P:DR

andC

MP:I/O

clockratios

arealw

aysin-

tegersin

DT

M.In

thism

anner,theprocessorcore

clockm

aybe

sweptfrom

8sys_clk

to15

sys_clkfor

functionalat-speedtesting.

Fig.26show

sthefrequency

versusshm

ooplotat95

Cfrom

firstsilicon.As

canbe

seen,thechip

operatesat1.4

GH

zat1.1

Vw

ithsufficientm

argin.Fig.26.

versusfrequency

shmoo

plot.

Auth

oriz

ed lic

ensed u

se lim

ited to

: Univ

of C

alif B

erk

ele

y. D

ow

nlo

aded o

n S

epte

mber 1

8, 2

009 a

t 17:0

3 fro

m IE

EE

Xplo

re. R

estric

tions a

pply

.

18 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 43, NO. 1, JANUARY 2008

Fig. 24. Clock–data recovery of a single FBDIMM 1.0 receive channel within an FSR cluster.

Fig. 25. Clock-data recovery of a single FBDIMM 1.0 receive channel (expanded) for DTM.

i) the link training pattern (TS0) can be used to identifythe start of a clock-bundle pair, and hence control thegeneration of the slow byte clock with respect to TS0 starttimes;

ii) the queue in the MCU which performs aggregation ofthe data to construct frames need to wait for the slowestbundle, and therefore the slowest link.

Fig. 25 illustrates how the FBDIMM (and similarly PCI-Ex-press) channels may be manipulated to generate clocks withinan acceptable tolerance for synchronous crossing into the con-troller. The core clock speed is allowed to vary independentlysuch that the CMP:DR and CMP:I/O clock ratios are always in-tegers in DTM. In this manner, the processor core clock may beswept from 8 sys_clk to 15 sys_clk for functional at-speedtesting.

Fig. 26 shows the frequency versus shmoo plot at 95 Cfrom first silicon. As can be seen, the chip operates at 1.4 GHzat 1.1 V with sufficient margin. Fig. 26. versus frequency shmoo plot.

Authorized licensed use limited to: Univ of Calif Berkeley. Downloaded on September 18, 2009 at 17:03 from IEEE Xplore. Restrictions apply.

13

Page 14: CS 250 Ma VLSI System Design Exa Chi e Lecture 9 ...cs250/fa09/lectures/lec...the CPU; it provides more than 200 Gbytes/s of bandwidth. A two-entry queue is available for each source-destination

UC Regents Fall 2009 © UCBCS 250 L9: Floorplanning

Intel Larrabee

14

Page 15: CS 250 Ma VLSI System Design Exa Chi e Lecture 9 ...cs250/fa09/lectures/lec...the CPU; it provides more than 200 Gbytes/s of bandwidth. A two-entry queue is available for each source-destination

UC Regents Fall 2009 © UCBCS 250 L9: Floorplanning

Intel Larrabee Market Concept

It’s easier to program a “normal” CPU than a GPU - so let’s make a GPU out of Intel CPU cores.

But mainstream out-of-order cores provide insufficient FLOPS for graphics work.

!

!"#$%&'(%')$*%+$,%-+)./0.12,'3/(4%.*%4'/(%5'2')),)/$-6%7/'4'2'%/$%

'% -+)./0&*2,% 4,(,2')% 5+25*$,% -/&2*52*&,$$*2% 89*(4,./2'% ,.% ')6%

:;;<=% >,'.+2/(4% ,/41.% /(0*23,2% &*2,$?% ,'&1% &'5'@),% *>% ,A,&+./(4%

>*+2%$/-+).'(,*+$%.12,'3$?%'(3%'%$1'2,3%&'&1,6%B+.%4/C,(%/.$%>*&+$%

*(% &*--,2&/')% $,2C,2% D*2E)*'3$?% 7/'4'2'% )'&E$% '2&1/.,&.+2')%

,),-,(.$% &2/./&')% >*2% C/$+')% &*-5+./(4?% $+&1% '$% FGHI% >)*'./(40

5*/(.%,A,&+./*(?%$&'..,204'.1,2?%*2%>/A,3%>+(&./*(%.,A.+2,%$+55*2.6%

!"# $%&&%'((#)%&*+%&(#,&-./0(-01&(#

J/4+2,% K% '@*C,% $1*D$% '% @)*&E% 3/'42'-% *>% .1,% @'$/&% L'22'@,,%'2&1/.,&.+2,6% L'22'@,,% /$% 3,$/4(,3% '2*+(3%-+)./5),% /($.'(./'./*($%*>% '(% /(0*23,2% !"#% &*2,% .1'.% /$% '+4-,(.,3% D/.1% '% D/3,% C,&.*2%52*&,$$*2% MN"#O6%!*2,$% &*--+(/&'.,% .12*+41% '% 1/410@'(3D/3.1%/(.,2&*((,&.%(,.D*2E%D/.1%$*-,%>/A,3%>+(&./*(%)*4/&?%-,-*2P%GQR%/(.,2>'&,$?%'(3%*.1,2%(,&,$$'2P%GQR%)*4/&?%3,5,(3/(4%*(%.1,%,A'&.%'55)/&'./*(6% J*2% ,A'-5),?% '(% /-5),-,(.'./*(% *>% L'22'@,,% '$% '%$.'(30')*(,%S"#%D*+)3%.P5/&'))P%/(&)+3,%'%"!G,%@+$6%

T1,% 3'.'% /(% T'@),% K% -*./C'.,$% L'22'@,,U$% +$,% *>% /(0*23,2% &*2,$%D/.1%D/3,%N"#$6%T1,%-/33),%&*)+-(%$1*D$%.1,%5,'E%5,2>*2-'(&,%*>%'%-*3,2(%*+.0*>0*23,2%!"#?%.1,%G(.,)V%!*2,W:%I+*%52*&,$$*26%T1,% 2/41.01'(3% &*)+-(% $1*D$% '% .,$.% !"#% 3,$/4(% @'$,3% *(% .1,%",(./+-V%52*&,$$*2?%D1/&1%D'$%/(.2*3+&,3%/(%KXX:%'(3%+$,3%3+')0/$$+,% /(0*23,2% /($.2+&./*(% ,A,&+./*(% 8Y)5,2.% KXXZ=6% T1,% ",(./+-%52*&,$$*2% &*2,% D'$% -*3/>/,3% .*% $+55*2.% >*+2% .12,'3$% '(3% '% K[0D/3,%N"#6%T1,%>/(')%.D*%2*D$%$5,&/>P%.1,%(+-@,2%*>%(*(0C,&.*2%/($.2+&./*($%.1'.%&'(%@,%/$$+,3%5,2%&)*&E%@P%*(,%!"#%'(3%.1,%.*.')%(+-@,2%*>%C,&.*2%*5,2'./*($%.1'.%&'(%@,%/$$+,3%5,2%&)*&E6%T1,%.D*%&*(>/4+2'./*($%+$,%2*+41)P%.1,%$'-,%'2,'%'(3%5*D,26%%

\%!"#%&*2,$]% :%*+.0*>0*23,2% K;%/(0*23,2%

G($.2+&./*(%/$$+,]% ^%5,2%&)*&E% :%5,2%&)*&E%

N"#%5,2%&*2,]% ^0D/3,%FF_% K[0D/3,%

L:%&'&1,%$/`,]% ^%HB% ^%HB%

F/(4),0$.2,'-]% "!#$%!&'(&)!! *!#$%!&'(&)!

N,&.*2%.12*+415+.]% +!#$%!&'(&)! ,-.!#$%!&'(&)!

!"#$%! &"! #$%&'(&')*+)! ,-.! /0&')*+)! 123! 4'567)/-'0"! *+-/80/08!%9+!6)'4+--')! (')! /04)+7-+*! %9)'$896$%!470!)+-$:%! /0!;! %9+!6+7<!-/08:+&-%)+75! 6+)(')5704+=! >$%! ?@A! %9+! 6+7<! ,+4%')! %9)'$896$%!B/%9!)'$89:C! %9+!-75+!7)+7!70*!6'B+).!D9/-!*/((+)+04+! /-!E@A! /0!FG#2H=!-/04+!%9+!B/*+!I23!-$66')%-!($-+*!5$:%/6:C&7**!>$%!HHJ!*'+-0K%.!D9+-+!/0&')*+)!4')+-!7)+!0'%!G7))7>++=!>$%!7)+!-/5/:7).!

T1,%.,$.%3,$/4(%/(%T'@),%K%/$%(*.%/3,(./&')%.*%L'22'@,,6%T*%52*C/3,%'%-*2,% 3/2,&.% &*-5'2/$*(?% .1,% /(0*23,2% &*2,% .,$.% 3,$/4(% +$,$% .1,%$'-,%52*&,$$%'(3%&)*&E%2'.,%'$%.1,%*+.0*>0*23,2%&*2,$%'(3%/(&)+3,$%(*% >/A,3% >+(&./*(% 42'51/&$% )*4/&6% T1/$% &*-5'2/$*(% -*./C'.,$%3,$/4(%3,&/$/*($%>*2%L'22'@,,%$/(&,%/.%$1*D$%.1'.%'%D/3,%N"#%D/.1%'%$/-5),%/(0*23,2%&*2,%'))*D$%!"#$%.*%2,'&1%'%32'-'./&'))P%1/41,2%&*-5+.'./*(')%3,($/.P%>*2%5'2')),)%'55)/&'./*($6%

F,&./*($%Z6K%.*%Z6<%@,)*D%3,$&2/@,%.1,%E,P%>,'.+2,$%*>%.1,%L'22'@,,%'2&1/.,&.+2,]% .1,% !"#% &*2,?% .1,% $&')'2% +(/.% '(3% &'&1,% &*(.2*)%/($.2+&./*($?%.1,%C,&.*2%52*&,$$*2?%.1,%/(.,252*&,$$*2%2/(4%(,.D*2E?%'(3%.1,%&1*/&,$%>*2%D1'.%/$%/-5),-,(.,3%/(%>/A,3%>+(&./*(%)*4/&6%

!"2#$%&&%'((#34&(#%5*#3%-.(6#

J/4+2,%Z%$1*D$%'%$&1,-'./&%*>%'%$/(4),%L'22'@,,%!"#%&*2,?%5)+$%/.$% &*((,&./*(% .*% .1,% *(03/,% /(.,2&*((,&.% (,.D*2E% '(3% .1,% &*2,U$%)*&')%$+@$,.%*>%.1,%L:%&'&1,6%T1,%/($.2+&./*(%3,&*3,2%$+55*2.$%.1,%$.'(3'23%",(./+-%52*&,$$*2%Aa[%/($.2+&./*(%$,.?%D/.1%.1,%'33/./*(%*>%(,D%/($.2+&./*($%.1'.%'2,%3,$&2/@,3%/(%F,&./*($%Z6:%'(3%Z6Z6%T*%

$/-5)/>P% .1,% 3,$/4(% .1,% $&')'2% '(3% C,&.*2% +(/.$% +$,% $,5'2'.,%2,4/$.,2%$,.$6%I'.'%.2'($>,22,3%@,.D,,(%.1,-%/$%D2/..,(%.*%-,-*2P%'(3%.1,(%2,'3%@'&E%/(%>2*-%.1,%LK%&'&1,6%

L'22'@,,U$% LK% &'&1,% '))*D$% )*D0)'.,(&P% '&&,$$,$% .*% &'&1,%-,-*2P%/(.*%.1,%$&')'2%'(3%C,&.*2%+(/.$6%T*4,.1,2%D/.1%L'22'@,,U$%)*'30*5% N"#% /($.2+&./*($?% .1/$% -,'($% .1'.% .1,% LK% &'&1,% &'(% @,%.2,'.,3%$*-,D1'.%)/E,%'(%,A.,(3,3%2,4/$.,2%>/),6%T1/$%$/4(/>/&'(.)P%/-52*C,$%.1,%5,2>*2-'(&,%*>%-'(P%')4*2/.1-$?%,$5,&/'))P%D/.1%.1,%&'&1,% &*(.2*)% /($.2+&./*($% 3,$&2/@,3% F,&./*(% Z6:6% T1,% $/(4),0.12,'3,3% ",(./+-% 52*&,$$*2% 52*C/3,3% '(% a9B% G&'&1,% '(3% a9B%I&'&1,6%b,%$5,&/>P%'%Z:9B%G&'&1,%'(3%Z:9B%I&'&1,%.*%$+55*2.%>*+2%,A,&+./*(%.12,'3$%5,2%!"#%&*2,6%

%

'()*+%!L"!G7))7>++!123!4')+!70*!7--'4/7%+*!-C-%+5!>:'4<-"!%9+!123!/-!*+)/,+*!()'5!%9+!2+0%/$5!6)'4+--')!/0&')*+)!*+-/80=!6:$-!ME&>/%! /0-%)$4%/'0-=!5$:%/&%9)+7*/08!70*! 7!B/*+!I23.!J749! 4')+!97-! (7-%!744+--! %'! /%-!?NMOP! :'47:! -$>-+%!'(!7!4'9+)+0%!?0*! :+,+:!4749+.!GQ!4749+!-/R+-!7)+!L?OP!(')!S4749+!70*!L?OP!(')!T4749+.!U/08!0+%B')<!744+--+-!67--!%9)'$89!%9+!G?!4749+!(')!4'9+)+04C.!

L'22'@,,U$% 4)*@')% :(3% ),C,)% ML:O% &'&1,% /$% 3/C/3,3% /(.*% $,5'2'.,%)*&')% $+@$,.$?% *(,% 5,2% !"#% &*2,6% _'&1% !"#% 1'$% '% >'$.% 3/2,&.%'&&,$$%5'.1%.*%/.$%*D(%)*&')%$+@$,.%*>%.1,%L:%&'&1,6%I'.'%2,'3%@P%'%!"#% &*2,% /$% $.*2,3% /(% /.$% L:% &'&1,% $+@$,.% '(3% &'(% @,% '&&,$$,3%c+/&E)P?%/(%5'2')),)%D/.1%*.1,2%!"#$%'&&,$$/(4%.1,/2%*D(%)*&')%L:%&'&1,%$+@$,.$6%I'.'%D2/..,(%@P%'%!"#%&*2,%/$%$.*2,3%/(%/.$%*D(%L:%&'&1,%$+@$,.%'(3%/$% >)+$1,3%>2*-%*.1,2%$+@$,.$?% />%(,&,$$'2P6%T1,%2/(4% (,.D*2E% ,($+2,$% &*1,2,(&P% >*2% $1'2,3% 3'.'?% '$% 3,$&2/@,3% /(%F,&./*(% Z6^6%b,% $5,&/>P% :<[9B% >*2% ,'&1% L:% &'&1,% $+@$,.6% T1/$%$+55*2.$% )'24,% ./),% $/`,$% >*2% $*>.D'2,% 2,(3,2/(4?% '$% 3,$&2/@,3% /(%F,&./*(%^6K6%

!"7#8-%9%&#:5/0#%5*#3%-.(#3450&49#;560&1-0/456#

L'22'@,,U$%$&')'2%5/5,)/(,%/$%3,2/C,3%>2*-%.1,%3+')0/$$+,%",(./+-%52*&,$$*2?% D1/&1% +$,$% '% $1*2.?% /(,A5,($/C,% ,A,&+./*(% 5/5,)/(,6%L'22'@,,%52*C/3,$%-*3,2(%'33/./*($%$+&1%'$%-+)./0.12,'3/(4?%[^0@/.% ,A.,($/*($?% '(3% $*51/$./&'.,3% 52,>,.&1/(46% T1,% &*2,$% $+55*2.%.1,% >+))% ",(./+-% 52*&,$$*2% Aa[% /($.2+&./*(% $,.% $*% .1,P% &'(% 2+(%,A/$./(4%&*3,%/(&)+3/(4%*5,2'./(4%$P$.,-%E,2(,)$%'(3%'55)/&'./*($6%L'22'@,,% '33$% (,D% $&')'2% /($.2+&./*($% $+&1% '$% @/.% &*+(.% '(3% @/.%$&'(?%D1/&1%>/(3$%.1,%(,A.%$,.%@/.%D/.1/(%'%2,4/$.,26%%

L'22'@,,% ')$*% '33$% (,D% /($.2+&./*($% '(3% /($.2+&./*(% -*3,$% >*2%,A5)/&/.% &'&1,% &*(.2*)6% _A'-5),$% /(&)+3,% /($.2+&./*($% .*%52,>,.&1%3'.'%/(.*%.1,%LK%*2%L:%&'&1,$%'(3%/($.2+&./*(%-*3,$%.*%2,3+&,%.1,%52/*2/.P% *>% '% &'&1,% )/(,6% J*2% ,A'-5),?% $.2,'-/(4% 3'.'% .P5/&'))P%$D,,5$%,A/$./(4%3'.'%*+.%*>%'%&'&1,6%L'22'@,,%/$%'@),%.*%-'2E%,'&1%$.2,'-/(4%&'&1,%)/(,%>*2%,'2)P%,C/&./*(%'>.,2%/.% /$%'&&,$$,36%T1,$,%&'&1,% &*(.2*)% /($.2+&./*($% ')$*% '))*D% .1,% L:% &'&1,% .*% @,% +$,3%$/-/)'2)P%.*%'%$&2'.&15'3%-,-*2P?%D1/),%2,-'/(/(4%>+))P%&*1,2,(.6%%

b/.1/(%'%$/(4),%&*2,?%$P(&12*(/`/(4%'&&,$$% .*%$1'2,3%-,-*2P%@P%-+)./5),%.12,'3$%/$%/(,A5,($/C,6%T1,%.12,'3$%*(%'%$/(4),%&*2,%$1'2,%.1,% $'-,% )*&')% LK% &'&1,?% $*% '% $/(4),% '.*-/&% $,-'51*2,% 2,'3%D/.1/(% .1,%LK% &'&1,% /$% $+>>/&/,(.6% FP(&12*(/`/(4% '&&,$$% @,.D,,(%

Larrabee: A Many-Core x86 Architecture for Visual Computing • 18:3

ACM Transactions on Graphics, Vol. 27, No. 3, Article 18, Publication date: August 2008.

Solution: Make a many-core GPU out of an in-order implementation of IA-32 that includes a 16-wide vector unit.

!

!"#$%&'(%')$*%+$,%-+)./0.12,'3/(4%.*%4'/(%5'2')),)/$-6%7/'4'2'%/$%

'% -+)./0&*2,% 4,(,2')% 5+25*$,% -/&2*52*&,$$*2% 89*(4,./2'% ,.% ')6%

:;;<=% >,'.+2/(4% ,/41.% /(0*23,2% &*2,$?% ,'&1% &'5'@),% *>% ,A,&+./(4%

>*+2%$/-+).'(,*+$%.12,'3$?%'(3%'%$1'2,3%&'&1,6%B+.%4/C,(%/.$%>*&+$%

*(% &*--,2&/')% $,2C,2% D*2E)*'3$?% 7/'4'2'% )'&E$% '2&1/.,&.+2')%

,),-,(.$% &2/./&')% >*2% C/$+')% &*-5+./(4?% $+&1% '$% FGHI% >)*'./(40

5*/(.%,A,&+./*(?%$&'..,204'.1,2?%*2%>/A,3%>+(&./*(%.,A.+2,%$+55*2.6%

!"# $%&&%'((#)%&*+%&(#,&-./0(-01&(#

J/4+2,% K% '@*C,% $1*D$% '% @)*&E% 3/'42'-% *>% .1,% @'$/&% L'22'@,,%'2&1/.,&.+2,6% L'22'@,,% /$% 3,$/4(,3% '2*+(3%-+)./5),% /($.'(./'./*($%*>% '(% /(0*23,2% !"#% &*2,% .1'.% /$% '+4-,(.,3% D/.1% '% D/3,% C,&.*2%52*&,$$*2% MN"#O6%!*2,$% &*--+(/&'.,% .12*+41% '% 1/410@'(3D/3.1%/(.,2&*((,&.%(,.D*2E%D/.1%$*-,%>/A,3%>+(&./*(%)*4/&?%-,-*2P%GQR%/(.,2>'&,$?%'(3%*.1,2%(,&,$$'2P%GQR%)*4/&?%3,5,(3/(4%*(%.1,%,A'&.%'55)/&'./*(6% J*2% ,A'-5),?% '(% /-5),-,(.'./*(% *>% L'22'@,,% '$% '%$.'(30')*(,%S"#%D*+)3%.P5/&'))P%/(&)+3,%'%"!G,%@+$6%

T1,% 3'.'% /(% T'@),% K% -*./C'.,$% L'22'@,,U$% +$,% *>% /(0*23,2% &*2,$%D/.1%D/3,%N"#$6%T1,%-/33),%&*)+-(%$1*D$%.1,%5,'E%5,2>*2-'(&,%*>%'%-*3,2(%*+.0*>0*23,2%!"#?%.1,%G(.,)V%!*2,W:%I+*%52*&,$$*26%T1,% 2/41.01'(3% &*)+-(% $1*D$% '% .,$.% !"#% 3,$/4(% @'$,3% *(% .1,%",(./+-V%52*&,$$*2?%D1/&1%D'$%/(.2*3+&,3%/(%KXX:%'(3%+$,3%3+')0/$$+,% /(0*23,2% /($.2+&./*(% ,A,&+./*(% 8Y)5,2.% KXXZ=6% T1,% ",(./+-%52*&,$$*2% &*2,% D'$% -*3/>/,3% .*% $+55*2.% >*+2% .12,'3$% '(3% '% K[0D/3,%N"#6%T1,%>/(')%.D*%2*D$%$5,&/>P%.1,%(+-@,2%*>%(*(0C,&.*2%/($.2+&./*($%.1'.%&'(%@,%/$$+,3%5,2%&)*&E%@P%*(,%!"#%'(3%.1,%.*.')%(+-@,2%*>%C,&.*2%*5,2'./*($%.1'.%&'(%@,%/$$+,3%5,2%&)*&E6%T1,%.D*%&*(>/4+2'./*($%+$,%2*+41)P%.1,%$'-,%'2,'%'(3%5*D,26%%

\%!"#%&*2,$]% :%*+.0*>0*23,2% K;%/(0*23,2%

G($.2+&./*(%/$$+,]% ^%5,2%&)*&E% :%5,2%&)*&E%

N"#%5,2%&*2,]% ^0D/3,%FF_% K[0D/3,%

L:%&'&1,%$/`,]% ^%HB% ^%HB%

F/(4),0$.2,'-]% "!#$%!&'(&)!! *!#$%!&'(&)!

N,&.*2%.12*+415+.]% +!#$%!&'(&)! ,-.!#$%!&'(&)!

!"#$%! &"! #$%&'(&')*+)! ,-.! /0&')*+)! 123! 4'567)/-'0"! *+-/80/08!%9+!6)'4+--')! (')! /04)+7-+*! %9)'$896$%!470!)+-$:%! /0!;! %9+!6+7<!-/08:+&-%)+75! 6+)(')5704+=! >$%! ?@A! %9+! 6+7<! ,+4%')! %9)'$896$%!B/%9!)'$89:C! %9+!-75+!7)+7!70*!6'B+).!D9/-!*/((+)+04+! /-!E@A! /0!FG#2H=!-/04+!%9+!B/*+!I23!-$66')%-!($-+*!5$:%/6:C&7**!>$%!HHJ!*'+-0K%.!D9+-+!/0&')*+)!4')+-!7)+!0'%!G7))7>++=!>$%!7)+!-/5/:7).!

T1,%.,$.%3,$/4(%/(%T'@),%K%/$%(*.%/3,(./&')%.*%L'22'@,,6%T*%52*C/3,%'%-*2,% 3/2,&.% &*-5'2/$*(?% .1,% /(0*23,2% &*2,% .,$.% 3,$/4(% +$,$% .1,%$'-,%52*&,$$%'(3%&)*&E%2'.,%'$%.1,%*+.0*>0*23,2%&*2,$%'(3%/(&)+3,$%(*% >/A,3% >+(&./*(% 42'51/&$% )*4/&6% T1/$% &*-5'2/$*(% -*./C'.,$%3,$/4(%3,&/$/*($%>*2%L'22'@,,%$/(&,%/.%$1*D$%.1'.%'%D/3,%N"#%D/.1%'%$/-5),%/(0*23,2%&*2,%'))*D$%!"#$%.*%2,'&1%'%32'-'./&'))P%1/41,2%&*-5+.'./*(')%3,($/.P%>*2%5'2')),)%'55)/&'./*($6%

F,&./*($%Z6K%.*%Z6<%@,)*D%3,$&2/@,%.1,%E,P%>,'.+2,$%*>%.1,%L'22'@,,%'2&1/.,&.+2,]% .1,% !"#% &*2,?% .1,% $&')'2% +(/.% '(3% &'&1,% &*(.2*)%/($.2+&./*($?%.1,%C,&.*2%52*&,$$*2?%.1,%/(.,252*&,$$*2%2/(4%(,.D*2E?%'(3%.1,%&1*/&,$%>*2%D1'.%/$%/-5),-,(.,3%/(%>/A,3%>+(&./*(%)*4/&6%

!"2#$%&&%'((#34&(#%5*#3%-.(6#

J/4+2,%Z%$1*D$%'%$&1,-'./&%*>%'%$/(4),%L'22'@,,%!"#%&*2,?%5)+$%/.$% &*((,&./*(% .*% .1,% *(03/,% /(.,2&*((,&.% (,.D*2E% '(3% .1,% &*2,U$%)*&')%$+@$,.%*>%.1,%L:%&'&1,6%T1,%/($.2+&./*(%3,&*3,2%$+55*2.$%.1,%$.'(3'23%",(./+-%52*&,$$*2%Aa[%/($.2+&./*(%$,.?%D/.1%.1,%'33/./*(%*>%(,D%/($.2+&./*($%.1'.%'2,%3,$&2/@,3%/(%F,&./*($%Z6:%'(3%Z6Z6%T*%

$/-5)/>P% .1,% 3,$/4(% .1,% $&')'2% '(3% C,&.*2% +(/.$% +$,% $,5'2'.,%2,4/$.,2%$,.$6%I'.'%.2'($>,22,3%@,.D,,(%.1,-%/$%D2/..,(%.*%-,-*2P%'(3%.1,(%2,'3%@'&E%/(%>2*-%.1,%LK%&'&1,6%

L'22'@,,U$% LK% &'&1,% '))*D$% )*D0)'.,(&P% '&&,$$,$% .*% &'&1,%-,-*2P%/(.*%.1,%$&')'2%'(3%C,&.*2%+(/.$6%T*4,.1,2%D/.1%L'22'@,,U$%)*'30*5% N"#% /($.2+&./*($?% .1/$% -,'($% .1'.% .1,% LK% &'&1,% &'(% @,%.2,'.,3%$*-,D1'.%)/E,%'(%,A.,(3,3%2,4/$.,2%>/),6%T1/$%$/4(/>/&'(.)P%/-52*C,$%.1,%5,2>*2-'(&,%*>%-'(P%')4*2/.1-$?%,$5,&/'))P%D/.1%.1,%&'&1,% &*(.2*)% /($.2+&./*($% 3,$&2/@,3% F,&./*(% Z6:6% T1,% $/(4),0.12,'3,3% ",(./+-% 52*&,$$*2% 52*C/3,3% '(% a9B% G&'&1,% '(3% a9B%I&'&1,6%b,%$5,&/>P%'%Z:9B%G&'&1,%'(3%Z:9B%I&'&1,%.*%$+55*2.%>*+2%,A,&+./*(%.12,'3$%5,2%!"#%&*2,6%

%

'()*+%!L"!G7))7>++!123!4')+!70*!7--'4/7%+*!-C-%+5!>:'4<-"!%9+!123!/-!*+)/,+*!()'5!%9+!2+0%/$5!6)'4+--')!/0&')*+)!*+-/80=!6:$-!ME&>/%! /0-%)$4%/'0-=!5$:%/&%9)+7*/08!70*! 7!B/*+!I23.!J749! 4')+!97-! (7-%!744+--! %'! /%-!?NMOP! :'47:! -$>-+%!'(!7!4'9+)+0%!?0*! :+,+:!4749+.!GQ!4749+!-/R+-!7)+!L?OP!(')!S4749+!70*!L?OP!(')!T4749+.!U/08!0+%B')<!744+--+-!67--!%9)'$89!%9+!G?!4749+!(')!4'9+)+04C.!

L'22'@,,U$% 4)*@')% :(3% ),C,)% ML:O% &'&1,% /$% 3/C/3,3% /(.*% $,5'2'.,%)*&')% $+@$,.$?% *(,% 5,2% !"#% &*2,6% _'&1% !"#% 1'$% '% >'$.% 3/2,&.%'&&,$$%5'.1%.*%/.$%*D(%)*&')%$+@$,.%*>%.1,%L:%&'&1,6%I'.'%2,'3%@P%'%!"#% &*2,% /$% $.*2,3% /(% /.$% L:% &'&1,% $+@$,.% '(3% &'(% @,% '&&,$$,3%c+/&E)P?%/(%5'2')),)%D/.1%*.1,2%!"#$%'&&,$$/(4%.1,/2%*D(%)*&')%L:%&'&1,%$+@$,.$6%I'.'%D2/..,(%@P%'%!"#%&*2,%/$%$.*2,3%/(%/.$%*D(%L:%&'&1,%$+@$,.%'(3%/$% >)+$1,3%>2*-%*.1,2%$+@$,.$?% />%(,&,$$'2P6%T1,%2/(4% (,.D*2E% ,($+2,$% &*1,2,(&P% >*2% $1'2,3% 3'.'?% '$% 3,$&2/@,3% /(%F,&./*(% Z6^6%b,% $5,&/>P% :<[9B% >*2% ,'&1% L:% &'&1,% $+@$,.6% T1/$%$+55*2.$% )'24,% ./),% $/`,$% >*2% $*>.D'2,% 2,(3,2/(4?% '$% 3,$&2/@,3% /(%F,&./*(%^6K6%

!"7#8-%9%&#:5/0#%5*#3%-.(#3450&49#;560&1-0/456#

L'22'@,,U$%$&')'2%5/5,)/(,%/$%3,2/C,3%>2*-%.1,%3+')0/$$+,%",(./+-%52*&,$$*2?% D1/&1% +$,$% '% $1*2.?% /(,A5,($/C,% ,A,&+./*(% 5/5,)/(,6%L'22'@,,%52*C/3,$%-*3,2(%'33/./*($%$+&1%'$%-+)./0.12,'3/(4?%[^0@/.% ,A.,($/*($?% '(3% $*51/$./&'.,3% 52,>,.&1/(46% T1,% &*2,$% $+55*2.%.1,% >+))% ",(./+-% 52*&,$$*2% Aa[% /($.2+&./*(% $,.% $*% .1,P% &'(% 2+(%,A/$./(4%&*3,%/(&)+3/(4%*5,2'./(4%$P$.,-%E,2(,)$%'(3%'55)/&'./*($6%L'22'@,,% '33$% (,D% $&')'2% /($.2+&./*($% $+&1% '$% @/.% &*+(.% '(3% @/.%$&'(?%D1/&1%>/(3$%.1,%(,A.%$,.%@/.%D/.1/(%'%2,4/$.,26%%

L'22'@,,% ')$*% '33$% (,D% /($.2+&./*($% '(3% /($.2+&./*(% -*3,$% >*2%,A5)/&/.% &'&1,% &*(.2*)6% _A'-5),$% /(&)+3,% /($.2+&./*($% .*%52,>,.&1%3'.'%/(.*%.1,%LK%*2%L:%&'&1,$%'(3%/($.2+&./*(%-*3,$%.*%2,3+&,%.1,%52/*2/.P% *>% '% &'&1,% )/(,6% J*2% ,A'-5),?% $.2,'-/(4% 3'.'% .P5/&'))P%$D,,5$%,A/$./(4%3'.'%*+.%*>%'%&'&1,6%L'22'@,,%/$%'@),%.*%-'2E%,'&1%$.2,'-/(4%&'&1,%)/(,%>*2%,'2)P%,C/&./*(%'>.,2%/.% /$%'&&,$$,36%T1,$,%&'&1,% &*(.2*)% /($.2+&./*($% ')$*% '))*D% .1,% L:% &'&1,% .*% @,% +$,3%$/-/)'2)P%.*%'%$&2'.&15'3%-,-*2P?%D1/),%2,-'/(/(4%>+))P%&*1,2,(.6%%

b/.1/(%'%$/(4),%&*2,?%$P(&12*(/`/(4%'&&,$$% .*%$1'2,3%-,-*2P%@P%-+)./5),%.12,'3$%/$%/(,A5,($/C,6%T1,%.12,'3$%*(%'%$/(4),%&*2,%$1'2,%.1,% $'-,% )*&')% LK% &'&1,?% $*% '% $/(4),% '.*-/&% $,-'51*2,% 2,'3%D/.1/(% .1,%LK% &'&1,% /$% $+>>/&/,(.6% FP(&12*(/`/(4% '&&,$$% @,.D,,(%

Larrabee: A Many-Core x86 Architecture for Visual Computing • 18:3

ACM Transactions on Graphics, Vol. 27, No. 3, Article 18, Publication date: August 2008.

15

Page 16: CS 250 Ma VLSI System Design Exa Chi e Lecture 9 ...cs250/fa09/lectures/lec...the CPU; it provides more than 200 Gbytes/s of bandwidth. A two-entry queue is available for each source-destination

UC Regents Fall 2009 © UCBCS 250 L9: Floorplanning

Intel Larrabee: Memory BW.

!

!"#$%&'(

%')$*%+$,%-

+)./0.1

2,'3/(4%.*%4'/(%5'2')),)/$-

6%7/'4'2'%/$%

'%-+)./0&*

2,%4,(,2')%

5+25*$,%-/&2*

52*&,$$*

2%89*(4,./2'%

,.%')6%

:;;<=%>,'.+

2/(4%,/4

1.%/(0*23,2%&*

2,$?%,'&1%&'5

'@),%*

>%,A,&+./(4%

>*+2%$/-

+).'(,*+$%.12,'3

$?%'(3%'%$1

'2,3%&'&1

,6%B+.%4/C,(%/.$%>*

&+$%

*(%&*--,2&/')%

$,2C,2%D*2E)*'3$?%7/'4'2'%

)'&E$%'2&1

/.,&.+2')%

,),-,(.$%&2/./&')%

>*2%C/$+')%&*

-5+./(4?%$+

&1%'$%F

GHI%>)*

'./(40

5*/(.%,A,&+./*(?%$&'..,204

'.1,2?%*

2%>/A,3%>+(&./*(%.,A.+2,%$+

55*2.6%

!"#$%&&%'((#)%&*+%&(#,&-./0(-01&(#

J/4+2,%K%'@*C,%$1*D$%'%@)*&E%3/'42'-%*>%.1,%@'$/&%

L'22'@

,,%'2&1

/.,&.+2,6%L

'22'@,,%/$%3

,$/4(,3%'2*

+(3%-+)./5),%/(

$.'(./'./*

($%

*>%'(

%/(0*23,2%!

"#%&*2,%.1

'.%/$%'+4-,(.,3%D/.1%'%D

/3,%C,&.*

2%52*&,$$*

2%MN"#O6%!

*2,$%&*

--+(/&'.,%.1

2*+41%'%1

/410@'(3D/3.1%

/(.,2&*

((,&.%(

,.D*2E%D/.1%$*-,%>/A

,3%>+(&./*(%)*4/&?%-

,-*2P%GQR

%/(.,2>'&,$?%'(

3%*.1,2%(

,&,$$'2P%GQR

%)*4/&?%3

,5,(3/(4%*(%.1,%,A

'&.%'55)/&'./*

(6%J*2%,A'-5),?%'(%/-5),-,(.'./*

(%*>%L'22'@

,,%'$%'%

$.'(30')*(,%S"#%D*+)3%.P5/&'))P

%/(&)+3,%'%"

!G,%@+$6%

T1,%3'.'%/(

%T'@),%K

%-*./C'.,$%L

'22'@,,U$%+

$,%*>%/(0*23,2%&*

2,$%D/.1%D/3,%N"#$6%T1,%-/33),%&*

)+-(%$1*D$%.1,%5,'E%5,2>*

2-'(&,%

*>%'%-

*3,2(%*+.0*>0*23,2%!

"#?%.1,%G(.,) V%!*2,W

:%I+*%52*&,$$*

26%T1,%2/4

1.01'(3%&*)+-(%$1*D$%'%.,$.%!

"#%3,$/4

(%@'$,3

%*(%.1,%

",(./+-V%52*&,$$*

2?%D1/&1%D'$%/(

.2*3+&,3%/(%KXX:%'(3%+$,3%3+')0

/$$+,%/(

0*23,2%/(

$.2+&./*(%,A,&+./*(%8Y)5,2.%K

XXZ=6%T

1,%",(./+-%

52*&,$$*

2%&*2,%D

'$%-*3/>/,3

%.*%$+55*2.%>*

+2%.12,'3

$%'(3%'%K

[0

D/3,%N"#6%T1,%>/(

')%.D*%2*D$%$5

,&/>P%.1,%(+-@,2%*

>%(*(0C,&.*

2%/($.2+&./*($%.1'.%&'(

%@,%/$$+

,3%5,2%&)*

&E%@P%*(,%!"#%'(3%.1,%.*.')%

(+-@,2%*

>%C,&.*

2%*5,2'./*

($%.1'.%&'(

%@,%/$$+

,3%5,2%&)*

&E6%T1,%.D

*%

&*(>/4+2'./*

($%+$,%2*

+41)P%.1,%$'-

,%'2,'%'(3%5*D,26%%

\%!"#%&*2,$]%

:%*+.0*>0*23,2%

K;%/(0*23,2%

G($.2+&./*(%/$$+

,]%^%5,2%&)*

&E%

:%5,2%&)*

&E%

N"#%5,2%&*

2,]%^0D/3,%FF_%

K[0D/3,%

L:%&'&1

,%$/`,]%^%HB%

^%HB%

F/(4),0$.2,'-

]%"!#$%!&'(

&)!!

*!#$%!&'(

&)!

N,&.*

2%.12*+415+.]%

+!#$%!&'(

&)!

,-.!#$%!&'(

&)!

!"#$%!&

"!#$%&'(&')*+)!,-.!/0

&')*+)!1

23!4'567)/-'0"!*+-/8

0/08!

%9+!6)'4+--'

)!(')!/04)+7

-+*!%9)'$896$%!47

0!)+-$

:%!/0!;!%9+!6+7<!

-/08:+&-%)+7

5!6+)('

)5704+=!>

$%!?@A!%9

+!6+7<!,+4%'

)!%9)'$896$%!

B/%9!)'$89:C!%9

+!-75+!7)+7!70*!6'B+).!D

9/-!*/((+)+0

4+!/-!E@A!/0

!FG#2H=!-/0

4+!%9+!B/*+!I23!-$66')%-!($

-+*!5$:%/6:C&7

**!>$%!HHJ!

*'+-0K%.!D

9+-+!/0

&')*+)!4'

)+-!7)+!0

'%!G7))7>++=!>

$%!7)+!-/5

/:7).!

T1,%.,$.%3

,$/4(%/(%T'@),%K%/$%(

*.%/3,(./&')%.*

%L'22'@

,,6%T*%52*C/3,%

'%-*2,%3

/2,&.%&*-5'2/$*

(?%.1,%/(

0*23,2%&*

2,%.,$.%3,$/4

(%+$,$%.1

,%$'-

,%52*&,$$%'(

3%&)*&E%2'.,%'$%.1

,%*+.0*>0*23,2%&*

2,$%'(3%/(&)+3,$%

(*%>/A,3%>+(&./*(%42'51/&$%

)*4/&6%

T1/$%&*-5'2/$*

(%-*./C'.,$%

3,$/4

(%3,&/$/*

($%>*2%L'22'@

,,%$/(&,%/.%$1

*D$%.1'.%'%D

/3,%N"#%D/.1%

'%$/-5),%/(

0*23,2%&*

2,%'))*D$%!"#$%.*%2,'&1

%'%32'-

'./&'))P%1/41,2%

&*-5+.'./*

(')%3,($/.P%>*2%5'2')),)%'5

5)/&'./*

($6%

F,&./*

($%Z6K%.*%Z6<%@,)*D%3,$&2/@

,%.1,%E,P%>,'.+

2,$%*>%.1,%L'22'@

,,%'2&1

/.,&.+2,]%

.1,%!"#%&*2,?%

.1,%$&')'2%

+(/.%'(3%&'&1

,%&*(.2*)%

/($.2+&./*($?%.1

,%C,&.*

2%52*&,$$*

2?%.1,%/(.,25

2*&,$$*

2%2/(4%(,.D*2E?%

'(3%.1,%&1

*/&,$%>*

2%D1'.%/$%/-

5),-,(.,3%/(%>/A,3%>+(&./*(%)*4/&6%

!"2#$%&&%'((#34&(#%5*#3%-.(6#

J/4+2,%Z

%$1*D$%'%$&1

,-'./&%*

>%'%$/(4),%L

'22'@,,%!

"#%&*2,?%5

)+$%

/.$%&*((,&./*

(%.*%.1,%*(03/,%/(

.,2&*((,&.%(

,.D*2E%'(3%.1,%&*

2,U$%)*&')%$+

@$,.%*

>%.1,%L:%&'&1

,6%T1,%/($.2+&./*(%3,&*3,2%$+

55*2.$%.1

,%$.'(

3'23%",(./+-%52*&,$$*

2%Aa[%/($.2+&./*(%$,.?%D

/.1%.1,%'3

3/./*(%

*>%(,D%/($.2+&./*($%.1'.%'2,%3

,$&2/@,3%/(%F,&./*

($%Z6:%'(3%Z6Z6%T*%

$/-5)/>P%.1,%3,$/4

(%.1,%$&')'2%

'(3%C,&.*

2%+(/.$%+$,%$,5'2'.,%

2,4/$.,2%$,.$6%I

'.'%.2'($>,22,3

%@,.D,,(%.1,-%/$%D

2/..,(%.*%-,-*2P%

'(3%.1,(%2,'3

%@'&E%/(%>2*-%.1,%LK%&'&1

,6%

L'22'@

,,U$%LK%&'&1

,%'))*D$%)*D0)'.,(

&P%'&&,$$,$%

.*%&'&1

,%-,-*2P%/(.*%.1,%$&')'2%'(

3%C,&.*

2%+(/.$6%T

*4,.1,2%D

/.1%L'22'@

,,U$%)*'30*5%N"#%/($.2+&./*($?%.1

/$%-,'($%.1

'.%.1,%LK%&'&1

,%&'(%@,%

.2,'.,3%$*-,D1'.%)/E

,%'(%,A.,(3,3%2,4

/$.,2%>/),6%T1/$%$/4

(/>/&'(

.)P%

/-52*C,$%.1

,%5,2>*

2-'(&,%*

>%-'(P%')4*2/.1-$?%,$5

,&/'))P%D/.1%.1,%

&'&1,%&*(.2*)%/($.2+&./*($%3,$&2/@

,3%F,&./*

(%Z6:6%T1,%$/(4),0

.12,'3

,3%",(./+-%52*&,$$*

2%52*C/3,3%'(%a9B%G&'&1

,%'(3%a9B%

I&'&1

,6%b,%$5

,&/>P%'%Z

:9B%G&'&1

,%'(3%Z:9B%I&'&1

,%.*%$+55*2.%

>*+2%,A,&+./*(%.12,'3

$%5,2%!

"#%&*2,6%

%

'()*+%!L

"!G7))7>++!1

23!4')+!7

0*!7--'4/7%+*!-C-%+5

!>:'4<-"!%9

+!123!/-!*

+)/,+*!()'5!%9+!2+0%/$5!6)'4+--'

)!/0&')*+)!*

+-/80=!6:$-!

ME&>/%!/0

-%)$4%/'0-=!5

$:%/&%9

)+7*/08!70*!7!B/*+!I23.!J749!4')+!

97-!(7-%!744+--!%'

!/%-!?NMOP!:'47:!-$>-+%!'

(!7!4'9+)+0

%!?0*!:+,+:!

4749+.!GQ!4749+!-/R+-!7

)+!L?OP!(')!S47

49+!70*!L?OP!(')!T4749+.!

U/08!0+%B')<!7

44+--+-!67--!%9

)'$89!%9+!G?!4749+!(')!4'

9+)+0

4C.!

L'22'@

,,U$%4)*@')%:(3%),C,)%ML:O%&'&1

,%/$%3/C/3,3%/(.*%$,5'2'.,%

)*&')%

$+@$,.$?%

*(,%5,2%!"#%&*2,6%_'&1%!"#%1'$%'%>'$.%

3/2,&.%

'&&,$$%5'.1%.*%/.$%*

D(%)*&')%$+

@$,.%*

>%.1,%L:%&'&1

,6%I'.'%2,'3

%@P%'%

!"#%&*2,%/$%$.*

2,3%/(%/.$%L

:%&'&1

,%$+@$,.%'(

3%&'(

%@,%'&&,$$,3

%c+/&E)P?%/(%5'2')),)%D

/.1%*.1,2%!

"#$%'&&,$$/(

4%.1,/2%*

D(%)*&')%L

:%

&'&1,%$+

@$,.$6%I

'.'%D2/..,(

%@P%'%!

"#%&*2,%/$%$.*

2,3%/(%/.$%*

D(%L:%

&'&1,%$+

@$,.%'(

3%/$%>)+

$1,3%>2*-%*.1,2%$+

@$,.$?%/>%(

,&,$$'2P6%T1,%

2/(4%(,.D*2E%,($+2,$%&*

1,2,(

&P%>*2%$1

'2,3%3'.'?%'$%3

,$&2/@,3%/(%

F,&./*

(%Z6^6%b

,%$5,&/>P

%:<[9B%>*2%,'&1

%L:%&'&1

,%$+@$,.6%T

1/$%

$+55*2.$%)'24

,%./),%$/`,$%>*2%$*

>.D'2,%2,(

3,2/(

4?%'$%3

,$&2/@,3%/(%

F,&./*

(%^6K6%

!"7#8-%9%&#:5/0#%5*#3%-.(#3450&49#;560&1-0/456#

L'22'@

,,U$%$&')'2%5/5,)/(,%/$%3

,2/C,3%>2*-%.1,%3+')0/$$+

,%",(./+-%

52*&,$$*

2?%D1/&1%+$,$%

'%$1*2.?%/(,A5,($/C,%,A,&+./*(%5/5,)/(,6%

L'22'@

,,%52*C/3,$%-

*3,2(%'33/./*($%$+

&1%'$%-

+)./0.1

2,'3/(4?%[^0

@/.%,A

.,($/*($?%'(

3%$*51/$./&'.,3

%52,>,.&1

/(46%T1,%&*

2,$%$+55*2.%

.1,%>+))%"

,(./+-%52*&,$$*

2%Aa[%/($.2+&./*(%$,.%

$*%.1,P%&'(%2+(%

,A/$./(

4%&*3,%/(&)+3/(4%*5,2'./(

4%$P$.,-

%E,2(,)$%'(

3%'55)/&'./*

($6%

L'22'@

,,%'33$%(,D%$&')'2%/(

$.2+&./*($%$+

&1%'$%@

/.%&*+(.%'(

3%@/.%

$&'(?%D1/&1%>/(3$%.1,%(,A.%$,.%@

/.%D/.1/(%'%2,4

/$.,26%%

L'22'@

,,%')$*

%'33$%(,D%/($.2+&./*($%'(3%/($.2+&./*(%-*3,$%>*2%

,A5)/&/.%&'&1

,%&*(.2*)6%_A'-5),$%/(

&)+3,%/(

$.2+&./*($%.*%52,>,.&1

%3'.'%/(

.*%.1,%LK%*2%L:%&'&1

,$%'(3%/($.2+&./*(%-*3,$%.*

%2,3+&,%.1

,%52/*2/.P%*>%'%&'&1

,%)/(,6%J*2%,A'-5),?%$.2,'-

/(4%3'.'%

.P5/&'))P

%$D,,5$%,A

/$./(4%3'.'%*

+.%*>%'%&'&1

,6%L'22'@

,,%/$%'@),%.*

%-'2E%,'&1

%$.2,'-

/(4%&'&1

,%)/(,%>*

2%,'2)P%,C/&./*

(%'>.,2%/.%/$%'&&,$$,3

6%T1,$,%

&'&1,%&*(.2*)%/($.2+&./*($%')$*

%'))*D%.1,%L:%&'&1

,%.*%@,%+$,3%

$/-/)'2)P

%.*%'%$&2'.&1

5'3%-,-*2P?%D1/),%2,-

'/(/(4%>+))P%&*1,2,(

.6%%

b/.1/(%'%$/(

4),%&*

2,?%$P(&12*(/`/(4%'&&,$$%.*

%$1'2,3

%-,-*2P%@P%

-+)./5),%.1

2,'3$%/$%/(

,A5,($/C,6%T

1,%.12,'3

$%*(%'%$/(

4),%&*

2,%$1'2,%

.1,%$'-

,%)*&')%

LK%&'&1

,?%$*%'%$/(4),%'.*-/&%$,-

'51*2,%2,'3

%D/.1/(%.1,%LK%&'&1

,%/$%$+>>/&/,(

.6%FP(&12*(/`/(4%'&&,$$%@

,.D,,(%

Larra

bee: A

Many-C

ore

x86 A

rchite

ctu

re fo

r Vis

ual C

om

putin

g • 1

8:3

AC

M T

ransactio

ns o

n G

raph

ics, Vo

l. 27

, No

. 3, A

rticle 18

, Pu

blicatio

n d

ate: Au

gu

st 20

08

.

Larrabee Core

Each core has a low-latency path to a 256 KB subset of the coherent L2 cache.

Stores from a core are placed in its L2 slice. Tuned code works out of the slice.

Accesses to the L2 slices of other cores go over the ring network (higher latency).

16

Page 17: CS 250 Ma VLSI System Design Exa Chi e Lecture 9 ...cs250/fa09/lectures/lec...the CPU; it provides more than 200 Gbytes/s of bandwidth. A two-entry queue is available for each source-destination

UC Regents Fall 2009 © UCBCS 250 L9: Floorplanning

Intel Larrabee: Ring Network

Bi-directional ring network.512 wires each way. For large

chips, multiple linked rings.

Ring also provides path to off-chip DRAM and to special-purpose accelerators.

Contrast with Niagara II: In Larrabee, access to different L2 banks has non-uniform latency. Acceptable, given Larrabee’s graphics focus.

!"#$%&'&(&)*&$+,(-./!"#$"%&'()&'*+%,"+-&'.)&'!/%+-0$"&'1)&'23%4567&'8)&'9:%+47&';)&'.<:"5&'=)&'><-?#-4&'!)&'(+?"&'9)&'!<0"%,+-&'>)&'*+@#-&'A)&'14/+4+&'A)&'B%3C73D4?#&'1)&'><+-&'8)&'E+-%+7+-&'=)'FGGH)'(+%%+:""I'9';+-5J*3%"'KHL'9%C7#6"C6<%"'M3%'N#4<+$'*3,/<6#-0)'!"#$%&'()*$+&',-*$./&'O&'9%6#C$"'PH'Q9<0<46'FGGHR&'PS'/+0"4)'.TU'V'PG)PPWSXPOLGLPF)POLGLPY'766/IXXZ3#)+C,)3%0XPG)PPWSXPOLGLPF)POLGLPY)

",01(234/$5,/2*&="%,#44#3-'63',+?"'Z#0#6+$'3%'7+%Z'C3/#"4'3M'/+%6'3%'+$$'3M'67#4'D3%?'M3%'/"%43-+$'3%'C$+44%33,'<4"'#4'0%+-6"Z'D#673<6'M""'/%3@#Z"Z'67+6'C3/#"4'+%"'-36',+Z"'3%'Z#46%#:<6"Z'M3%'/%3!'6'3%'Z#%"C6'C3,,"%C#+$'+Z@+-6+0"'+-Z'67+6'C3/#"4'473D'67#4'-36#C"'3-'67"'!'%46'/+0"'3%'#-#6#+$'4C%""-'3M'+'Z#4/$+5'+$3-0'D#67'67"'M<$$'C#6+6#3-)'*3/5%#0764'M3%'C3,/3-"-64'3M'67#4'D3%?'3D-"Z':5'367"%4'67+-'9*;',<46':"'73-3%"Z)'9:46%+C6#-0'D#67'C%"Z#6'#4'/"%,#66"Z)'83'C3/5'367"%D#4"&'63'%"/<:$#47&'63'/346'3-'4"%@"%4&'63'%"Z#46%#:<6"'63'$#464&'3%'63'<4"'+-5'C3,/3-"-6'3M'67#4'D3%?'#-'367"%'D3%?4'%"[<#%"4'/%#3%'4/"C#!'C'/"%,#44#3-'+-ZX3%'+'M"")'="%,#44#3-4',+5':"'%"[<"46"Z'M%3,'=<:$#C+6#3-4'."/6)&'9*;&'U-C)&'F'="--'=$+\+&'!<#6"'YGP&']"D' 3%?&']^'PGPFP_GYGP&'M+K'`P'QFPFR'HLa_GWHP&'3%'/"%,#44#3-4b+C,)3%0)c'FGGH'9*;'GYOG_GOGPXFGGHXGO_9A8PH'dS)GG'.TU'PG)PPWSXPOLGLPF)POLGLPY'766/IXXZ3#)+C,)3%0XPG)PPWSXPOLGLPF)POLGLPY

!

!"##"$%%&'(')"*+,-.#%'/01'(#2345%256#%'7.#'8496":'-.;<654*='

!"##$%&'()'#*+%%%,-./%0"#1'"2

*+%%%3#(4%&5#"2/)'

*+%%%6-1%7-#8$9:

*+%%%;(4:"')%<=#"8:

>+%

?#"@''5%,.='$*+%%%&9'5:'2%A.2B(28

*+%%%<@"1%!"B'

*+%%%A'#'1$%&./'#1"2

C+%

D-='#9%0"E(2*+%%%D-/'#%385"8"

*+%%%3@%F#-4:-G8B(

*+%%%6-2(%A."2

*+%%%"2@%%?"9%H"2#":"2

C

($95#"25!"#'

6:(8% 5"5'#% 5#'8'298% "% 1"2$I4-#'% E(8.")% 4-15.9(2/% "#4:(9'49.#'%

4-@'%2"1'@%!"##"=''+%"%2'G%8-J9G"#'%#'2@'#(2/%5(5')(2'+%"%1"2$I

4-#'% 5#-/#"11(2/% 1-@')+% "2@% 5'#J-#1"24'% "2")$8(8% J-#% 8'E'#")%

"55)(4"9(-28K%!"##"=''%.8'8%1.)9(5)'%(2I-#@'#%LMN%0?O%4-#'8%9:"9%

"#'%"./1'29'@%=$%"%G(@'%E'49-#%5#-4'88-#%.2(9+% "8%G'))% "8% 8-1'%

J(L'@% J.249(-2% )-/(4% =)-4B8K% 6:(8% 5#-E(@'8% @#"1"9(4"))$% :(/:'#%

5'#J-#1"24'%5'#%G"99%"2@%5'#%.2(9%-J%"#'"%9:"2%-.9I-JI-#@'#%0?O8%

-2% :(/:)$% 5"#"))')% G-#B)-"@8K% P9% ")8-% /#'"9)$% (24#'"8'8% 9:'%

J)'L(=()(9$%"2@%5#-/#"11"=()(9$%-J%9:'%"#4:(9'49.#'%"8%4-15"#'@%9-%

89"2@"#@%F?O8K%<%4-:'#'29%-2I@('%>2@%)'E')%4"4:'%"))-G8%'JJ(4('29%

(29'#I5#-4'88-#% 4-11.2(4"9(-2% "2@% :(/:I="2@G(@9:% )-4")% @"9"%

"44'88%=$%0?O%4-#'8K%6"8B%84:'@.)(2/%(8%5'#J-#1'@%'29(#')$%G(9:%

8-J9G"#'% (2% !"##"=''+% #"9:'#% 9:"2% (2% J(L'@% J.249(-2% )-/(4K% 6:'%

4.89-1(Q"=)'% 8-J9G"#'% /#"5:(48% #'2@'#(2/% 5(5')(2'% J-#% 9:(8%

"#4:(9'49.#'% .8'8% =(22(2/% (2% -#@'#% 9-% #'@.4'% #'R.(#'@% 1'1-#$%

="2@G(@9:+%1(2(1(Q'% )-4B% 4-29'29(-2+% "2@% (24#'"8'%-55-#9.2(9('8%

J-#% 5"#"))')(81% #')"9(E'% 9-% 89"2@"#@% F?O8K% 6:'% !"##"=''% 2"9(E'%

5#-/#"11(2/% 1-@')% 8.55-#98% "% E"#('9$% -J% :(/:)$% 5"#"))')%

"55)(4"9(-28% 9:"9% .8'% (##'/.)"#% @"9"% 89#.49.#'8K% ?'#J-#1"24'%

"2")$8(8% -2% 9:-8'% "55)(4"9(-28% @'1-289#"9'8% !"##"=''S8% 5-9'29(")%

J-#%"%=#-"@%#"2/'%-J%5"#"))')%4-15.9"9(-2K%

""#$% PKCK*% T0-15.9'#% F#"5:(48UV% H"#@G"#'% <#4:(9'49.#'IIF#"5:(48% ?#-4'88-#8+% ?"#"))')% ?#-4'88(2/W% PKCKC% T0-15.9'#%F#"5:(48UV% ?(49.#'XP1"/'% F'2'#"9(-2II,(85)"$% <)/-#(9:18W% PKCKY%T0-15.9'#%F#"5:(48UV%6:#''I,(1'28(-2")%F#"5:(48%"2@%D'")(81II0-)-#+%8:"@(2/+%8:"@-G(2/+%"2@%9'L9.#'%

%&'()*+8$% /#"5:(48% "#4:(9'49.#'+% 1"2$I4-#'% 4-15.9(2/+% #'")I9(1'% /#"5:(48+% 8-J9G"#'% #'2@'#(2/+% 9:#-./:5.9% 4-15.9(2/+% E(8.")%4-15.9(2/+%5"#"))')%5#-4'88(2/+%&P;,+%F?F?OK!

>?' @*5#.A6254.*'

;-@'#2%F?O8%"#'%(24#'"8(2/)$%5#-/#"11"=)'%(2%-#@'#%9-%8.55-#9%"@E"24'@% /#"5:(48% ")/-#(9:18% "2@% -9:'#% 5"#"))')% "55)(4"9(-28K%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%*%P29')Z%0-#5-#"9(-2V%%)"##$K8'()'#+%@-./K4"#1'"2+%'#(4K85#"2/)'+%

9-1KJ-#8$9:+%5#"@''5K@.='$+%89'5:'2K[.2B(28+%"@"1K9K)"B'+%

#-='#9K@K4"E(2+%#-/'#K'85"8"+%'@G"#@K/#-4:-G8B(%\%9-2(K[."2%

](29')K4-1%

>%D<,%F"1'%6--)8V%%1(B'"]#"@/"1'9--)8K4-1%

C%&9"2J-#@%O2(E'#8(9$V%%$-')%\%:"2#":"2%]48K89"2J-#@K'@.%

H-G'E'#+% /'2'#")% 5.#5-8'% 5#-/#"11"=()(9$% -J% 9:'% /#"5:(48%5(5')(2'%(8%#'89#(49'@%=$%)(1(9"9(-28%-2%9:'%1'1-#$%1-@')%"2@%=$%J(L'@% J.249(-2% =)-4B8% 9:"9% 84:'@.)'% 9:'% 5"#"))')% 9:#'"@8% -J%'L'4.9(-2K% 7-#% 'L"15)'+% 5(L')% 5#-4'88(2/% -#@'#% (8% 4-29#-))'@% =$%9:'%#"89'#(Q"9(-2%)-/(4%"2@%-9:'#%@'@(4"9'@%84:'@.)(2/%)-/(4K%

6:(8%5"5'#%@'84#(='8%"%:(/:)$%5"#"))')%"#4:(9'49.#'%9:"9%1"B'8%9:'%#'2@'#(2/% 5(5')(2'% 4-15)'9')$% 5#-/#"11"=)'K% 6:'% !"##"=''%"#4:(9'49.#'%(8%="8'@%-2%(2I-#@'#%0?O%4-#'8%9:"9%#.2%"2%'L9'2@'@%E'#8(-2% -J% 9:'% LMN% (289#.49(-2% 8'9+% (24).@(2/% G(@'% E'49-#%5#-4'88(2/% -5'#"9(-28% "2@% 8-1'% 85'4(")(Q'@% 84")"#% (289#.49(-28K%7(/.#'% *% 8:-G8% "% 84:'1"9(4% ()).89#"9(-2% -J% 9:'% "#4:(9'49.#'K% 6:'%4-#'8% '"4:% "44'88% 9:'(#% -G2% 8.=8'9% -J% "% 4-:'#'29% !>% 4"4:'% 9-%5#-E(@'% :(/:I="2@G(@9:% !>% 4"4:'% "44'88% J#-1% '"4:% 4-#'% "2@% 9-%8(15)(J$%@"9"%8:"#(2/%"2@%8$24:#-2(Q"9(-2K%

!"##"=''% (8%1-#'% J)'L(=)'% 9:"2% 4.##'29%F?O8K% P98%0?OI)(B'% LMNI="8'@% "#4:(9'49.#'% 8.55-#98% 8.=#-.9(2'8% "2@% 5"/'% J".)9(2/K% &-1'%-5'#"9(-28% 9:"9% F?O8% 9#"@(9(-2"))$% 5'#J-#1% G(9:% J(L'@% J.249(-2%)-/(4+% 8.4:% "8% #"89'#(Q"9(-2% "2@% 5-89I8:"@'#% =)'2@(2/+% "#'%5'#J-#1'@%'29(#')$%(2%8-J9G"#'%(2%!"##"=''K%!(B'%F?O8+%!"##"=''%.8'8%J(L'@%J.249(-2%)-/(4%J-#%9'L9.#'%J()9'#(2/+%=.9%9:'%4-#'8%"88(89%9:'%J(L'@%J.249(-2%)-/(4+%'K/K%=$%8.55-#9(2/%5"/'%J".)98K%%

%

!"#$%&'(!"#$%&'()*$"+,")%&"-(..(/&&"'(012$+.&"(.$%*)&$)3.&!"4%&"03'/&."+,"567"$+.&8"(09")%&"03'/&."(09")1:&"+,"$+2:.+$&88+.8"(09" ;<=" />+$?8" (.&" *':>&'&0)()*+029&:&09&0)@" (8" (.&" )%&":+8*)*+08"+,")%&"567"(09"0+02567"/>+$?8"+0")%&"$%*:A"

6:(8%5"5'#%")8-%@'84#(='8%"%8-J9G"#'%#'2@'#(2/%5(5')(2'% 9:"9% #.28%'JJ(4('29)$% -2% 9:(8% "#4:(9'49.#'K% P9% .8'8% =(22(2/% 9-% (24#'"8'%5"#"))')(81% "2@% #'@.4'% 1'1-#$% ="2@G(@9:+% G:()'% "E-(@(2/% 9:'%5#-=)'18%-J%8-1'%5#'E(-.8%9()'I="8'@%"#4:(9'49.#'8K%P15)'1'29(2/%9:'% #'2@'#'#% (2% 8-J9G"#'%"))-G8%'L(89(2/% J'"9.#'8% 9-%='%-59(1(Q'@%="8'@% -2% G-#B)-"@% "2@% "))-G8% 2'G% J'"9.#'8% 9-% ='% "@@'@K% 7-#%'L"15)'+% 5#-/#"11"=)'% =)'2@(2/% "2@% -#@'#I(2@'5'2@'29%9#"285"#'24$%J(9%'"8()$%(29-%9:'%!"##"=''%8-J9G"#'%5(5')(2'K%%

7(2"))$+% 9:(8%5"5'#%@'84#(='8%"%5#-/#"11(2/%1-@')% 9:"9% 8.55-#98%1-#'% /'2'#")% 5"#"))')% "55)(4"9(-28+% 8.4:% "8% (1"/'% 5#-4'88(2/+%5:$8(4")%8(1.)"9(-2+%"2@%1'@(4")%\%J(2"24(")%"2")$9(48K%!"##"=''S8%8.55-#9% J-#% (##'/.)"#% @"9"% 89#.49.#'8% "2@% (98% 84"99'#I/"9:'#%4"5"=()(9$% 1"B'% (9% 8.(9"=)'% J-#% 9:'8'% 9:#-./:5.9% "55)(4"9(-28% "8%@'1-289#"9'@%=$%-.#%84")"=()(9$%"2@%5'#J-#1"24'%"2")$8(8K%

ACM Transactions on Graphics, Vol. 27, No. 3, Article 18, Publication date: August 2008.

17

Page 18: CS 250 Ma VLSI System Design Exa Chi e Lecture 9 ...cs250/fa09/lectures/lec...the CPU; it provides more than 200 Gbytes/s of bandwidth. A two-entry queue is available for each source-destination

UC Regents Fall 2009 © UCBCS 250 L9: Floorplanning

Intel Larrabee

Ring network

18

Page 19: CS 250 Ma VLSI System Design Exa Chi e Lecture 9 ...cs250/fa09/lectures/lec...the CPU; it provides more than 200 Gbytes/s of bandwidth. A two-entry queue is available for each source-destination

UC Regents Fall 2009 © UCBCS 250 L9: Floorplanning

Core + 256KB L2 Cache slice

19

Page 20: CS 250 Ma VLSI System Design Exa Chi e Lecture 9 ...cs250/fa09/lectures/lec...the CPU; it provides more than 200 Gbytes/s of bandwidth. A two-entry queue is available for each source-destination

UC Regents Fall 2009 © UCBCS 250 L9: Floorplanning

Texture filtering units (for graphics)

20

Page 21: CS 250 Ma VLSI System Design Exa Chi e Lecture 9 ...cs250/fa09/lectures/lec...the CPU; it provides more than 200 Gbytes/s of bandwidth. A two-entry queue is available for each source-destination

UC Regents Fall 2009 © UCBCS 250 L9: Floorplanning

DRAM and I/O Interfaces

21

Page 22: CS 250 Ma VLSI System Design Exa Chi e Lecture 9 ...cs250/fa09/lectures/lec...the CPU; it provides more than 200 Gbytes/s of bandwidth. A two-entry queue is available for each source-destination

UC Regents Fall 2009 © UCBCS 250 L9: Floorplanning

Intel Larrabee: Performance

!

!"#$%&!'#($')#*'+,),-'.%&/0&)#(1%'/0&'0-"%&'!"#$%&!2'3%1-,0('425'.&06,$%!'7&%#8$09(!'0/'-"%'-0-#+'.&01%!!,(:'-,)%'$%60-%$'-0'.0!-;!"#$%&'7+%($,(:'#($'.#&#)%-%&',(-%&.0+#-,0(2'

<(%' &%)#,(,(:' ,!!=%' ,!' -%>-=&%'10;.&01%!!0&'#11%!!%!?'9",1"'1#('"#6%'"=($&%$!'0/'1+018!'0/'+#-%(1*2'@",!',!'",$$%('7*'10).=-,(:')=+-,.+%'AA=#$!'0('%#1"'"#&$9#&%'-"&%#$2'B#1"'AA=#$C!'!"#$%&',!'1#++%$' #' !"#$%2' @"%' $,//%&%(-' /,7%&!' 0(' #' -"&%#$' 10;0.%&#-,6%+*'!9,-1"'7%-9%%(' -"%)!%+6%!'9,-"0=-'#(*'<3',(-%&6%(-,0(2'D' /,7%&'!9,-1"' ,!' .%&/0&)%$' #/-%&' %#1"' -%>-=&%' &%#$' 10))#($?' #($'.&01%!!,(:'.#!!%!'-0'-"%'0-"%&'/,7%&!'&=((,(:'0('-"%'-"&%#$2'E,7%&!'%>%1=-%',('#'1,&1=+#&'A=%=%2'@"%'(=)7%&'0/'/,7%&!',!'1"0!%('!0'-"#-'7*'-"%'-,)%'10(-&0+'/+09!'7#18'-0'#'/,7%&?',-!'-%>-=&%'#11%!!'"#!'"#$'-,)%'-0'%>%1=-%'#($'-"%'&%!=+-!'#&%'&%#$*'/0&'.&01%!!,(:2'

!"# $%&'%(%(#)%(*+(,-&.%#/01'2%3#

@",!' !%1-,0(' $%!1&,7%!' .%&/0&)#(1%' #($' !1#+,(:' !-=$,%!' /0&' -"%'F#&&#7%%'!0/-9#&%'&%($%&%&'$%!1&,7%$',('3%1-,0('52'3-=$,%!',(1+=$%'!1#+#7,+,-*' %>.%&,)%(-!' /0&' !0/-9#&%' &%($%&,(:?' +0#$' 7#+#(1,(:'!-=$,%!?' 7#($9,$-"' 10).#&,!0(!' 0/' 7,((,(:' -0' ,))%$,#-%' )0$%'&%($%&%&!?' .%&/0&)#(1%' 0(' !%6%&#+' :#)%' 90&8+0#$!?' #($' 1"#&-!'!"09,(:'-"%'"09'-0-#+'.&01%!!,(:'-,)%',!'$,6,$%$'#)0(:'$,//%&%(-'.#&-!'0/'-"%'!0/-9#&%'&%($%&%&2'

!"4#5-,%#6+(78+-'3#-&'#/2,18-02+&#9%0:+'#

G%&/0&)#(1%' -%!-!'=!%'90&8+0#$!'$%&,6%$' /&0)' -"&%%'9%++;8(09(':#)%!H' I%#&!' 0/' J#&K?' E2B2D2L2K?' #($' M#+/' F,/%K' N' B.,!0$%' N2'@#7+%' N' 10(-#,(!' ,(/0&)#-,0(' #70=-' -"%' -%!-%$' /&#)%!' /&0)' %#1"':#)%2'3,(1%'9%'#&%'!1#+,(:'0=-'-0'+#&:%'(=)7%&!'0/'10&%!'9%'=!%'#'",:";%($'!1&%%('!,O%'9,-"')=+-,!#).+,(:'9"%('!=..0&-%$2'

"#$%!&'%(!)!(*+!)! ,+-+.+/+! 0(#12!3%!4#1!

PQRR>PNRR'5'!#).+%' PQRR>PNRR'5'!#).+%' PQRR>PNRR'P'!#).+%'

N4'/&#)%!'SP',('TRU' N4'/&#)%!'SP',('PRRU' N4'/&#)%!'SP',('N4RU'

V#+6%'W0&.2' X0(0+,-"'G&0$=1-,0(!' B.,1'I#)%!'Y(1'

!"#$%& &'& ()%*+),-& ./00,%1& !)%& 23$& 23%$$& 2$.2$-& 4,0$.'& 23$&!%,0$.& ,%$& 5"-$+1& .$6,%,2$-& 2)& 7,273& -"!!$%$82& .7$8$&73,%,72$%".2"7.&,.&23$&4,0$.&6%)4%$..9&

J%'1#.-=&%$' -"%' /&#)%!'7*' ,(-%&1%.-,(:' -"%'Z,&%1-['\'10))#($'

!-&%#)'7%,(:'!%(-'-0'#'10(6%(-,0(#+':&#.",1!'1#&$'9",+%'-"%':#)%'

9#!'.+#*%$'#-'#'(0&)#+'!.%%$?'#+0(:'9,-"'-"%'10(-%(-!'0/'-%>-=&%!'

#($'!=&/#1%!' #-' -"%' !-#&-'0/' -"%' /&#)%2'J%' -%!-%$' -"%)' -"&0=:"'#'

/=(1-,0(#+')0$%+' -0' %(!=&%' -"%' #+:0&,-")!'9%&%' 10&&%1-' #($' -"#-'

-"%' &,:"-' ,)#:%!'9%&%' .&0$=1%$2']%>-?' 9%' %!-,)#-%$' -"%' 10!-' 0/'

%#1"'!%1-,0('0/'10$%' ,(' -"%' /=(1-,0(#+')0$%+?'7%,(:'#::&%!!,6%+*'

.%!!,),!-,1?' #($' 7=,+-' #' &0=:"' .&0/,+%' 0/' %#1"' /&#)%2' J%' 9&0-%'

#!!%)7+*'10$%'/0&'-"%'",:"%!-;10!-'!%1-,0(!?'&#(',-' -"&0=:"'1*1+%;

#11=&#-%' !,)=+#-0&!?' /%$' -"%' 1+018' 1*1+%' &%!=+-!' 7#18' ,(-0' -"%'

/=(1-,0(#+' )0$%+?' #($' &%;&#(' -"%' -&#1%!2' @",!' ,-%&#-,6%' 1*1+%' 0/'

&%/,(%)%(-'9#!' &%.%#-%$' =(-,+' \R^'0/' -"%' 1+018' 1*1+%!' %>%1=-%$'

$=&,(:' #' /&#)%' "#$' 7%%(' &=(' -"&0=:"' -"%' !,)=+#-0&!?' :,6,(:' -"%'

06%&#++' .&0/,+%!' #' ",:"' $%:&%%' 0/' 10(/,$%(1%2' @%>-=&%' =(,-'

-"&0=:".=-?' 1#1"%' .%&/0&)#(1%' #($' )%)0&*' 7#($9,$-"'

+,),-#-,0(!'9%&%'#++',(1+=$%$',('-"%'6#&,0=!'!,)=+#-,0(!2'

Y(' -"%!%' !-=$,%!' 9%' )%#!=&%' 90&8+0#$' .%&/0&)#(1%' ,(' -%&)!' 0/':,%%,#$$& /8"2.2' D' :,%%,#$$& /8"2' ,!' $%/,(%$' -0' 7%'0(%' F#&&#7%%'10&%'&=((,(:'#-'P'IMO2'@"%'1+018'&#-%',!'1"0!%('!0+%+*'/0&'%#!%'0/'1#+1=+#-,0(?'!,(1%'&%#+'$%6,1%!'90=+$'!",.'9,-"')=+-,.+%'10&%!'#($'

'''''''''''''''''''''''''''''''''''''''' '''''''''''''''''''''''''K'<-"%&'(#)%!'_'7&#($!')#*'7%'1+#,)%$'#!'-"%'.&0.%&-*'0/'0-"%&!2'

#' 6#&,%-*' 0/' 1+018' &#-%!2' `!,(:' F#&&#7%%' =(,-!' #++09!' =!' -0'10).#&%'.%&/0&)#(1%'0/'F#&&#7%%',).+%)%(-#-,0(!'9,-"'$,//%&%(-'(=)7%&!' 0/' 10&%!' &=((,(:' #-' $,//%&%(-' 1+018' &#-%!2' D' !,(:+%'F#&&#7%%'=(,-'10&&%!.0($!' -0'#' -"%0&%-,1#+'.%#8' -"&0=:".=-'0/'TN'IEF<G3?'10=(-,(:'/=!%$')=+-,.+*;#$$'#!'-90'0.%&#-,0(!2''

!";#/.-8-<2820=#/01'2%3#

@"%'F#&&#7%%'!0/-9#&%'&%($%&%&',!'$%!,:(%$'-0'#++09'%//,1,%(-'+0#$'7#+#(1,(:'06%&'#'+#&:%'(=)7%&'0/'10&%!2'E,:=&%'\'!"09!'-"%'&%!=+-!'0/' -%!-,(:' +0#$' 7#+#(1,(:' /0&' !,>' 10(/,:=&#-,0(!?' %#1"' 0/' 9",1"'!1#+%!' -"%')%)0&*'7#($9,$-"'#($' -%>-=&%'/,+-%&,(:'!.%%$'&%+#-,6%'-0'-"%'(=)7%&'0/'10&%!2'@",!'-%!-'=!%!'-"%'!,)=+#-,0(')%-"0$0+0:*'$%!1&,7%$' ,(' 3%1-,0(' 42P' ,(' 10)7,(#-,0(' 9,-"' #' -,)%;7#!%$'.%&/0&)#(1%')0$%+'-"#-'-&#18!'$%.%($%(1,%!'#($'!1"%$=+,(:2'@",!'-00+',!'=!%$'/0&')=+-,.+%':&#.",1!'.&0$=1-!'9,-",('Y(-%+2''

'

'()*+%& ,'& ;$+,2"<$& =7,+"84& ,.& ,& >/872")8& )!& ?)%$& ?)/82'& @3".&.3)5.&7)8!"4/%,2")8.&5"23&A&2)&BA&7)%$.C&5"23&$,73&4,0$D.&%$./+2.&6+)22$-&%$+,2"<$&2)&23$&6$%!)%0,87$&)!&,8&AE7)%$&.1.2$09&&

@"%'&%!=+-!'0/'-"%'+0#$'7#+#(1,(:'!,)=+#-,0('!"09'#'/#++0//'0/'a^'-0' PR^' /&0)' #' +,(%#&' !.%%$=.' #-' 5b' 10&%!2' E0&' -"%!%' -%!-!?'G&,)3%-!' #&%' !=7$,6,$%$' ,/' -"%*' 10(-#,(' )0&%' -"#(' PRRR'.&,),-,6%!?'#!'$%!1&,7%$',('3%1-,0('52N2'D$$,-,0(#+'-%!-!'!"09'-"#-'E2B2D2L2' /#++!' 0//' 7*' 0(+*' N^' ,/' G&,)3%-!' #&%' !=7$,6,$%$' ,(-0':&0=.!' 0/' NRR' .&,),-,6%!?' !0' 10$%' -=(,(:' !"0=+$' ,).&06%' -"%'+,(%#&,-*2''

E,:=&%'PR'!"09!'-"%'(=)7%&'0/'F#&&#7%%'=(,-!'&%A=,&%$'-0'&%($%&'!#).+%' /&#)%!' /&0)' -"%' -"&%%':#)%!'#-'QR' /&#)%!c!%10($2'@"%!%'&%!=+-!'9%&%' !,)=+#-%$'0('#' !,(:+%'10&%'9,-"' -"%'#!!=).-,0(' -"#-'.%&/0&)#(1%'!1#+%!'+,(%#&+*2'E0&'M#+/'F,/%'N'%.,!0$%'N?'&0=:"+*'PR'F#&&#7%%'`(,-!'#&%'!=//,1,%(-'-0'%(!=&%'-"#-'#++'/&#)%!'&=('#-'QR'/.!'0&' /#!-%&2' E0&' E2B2D2L2' #($' I%#&!' 0/'J#&?' &0=:"+*' N4' F#&&#7%%'`(,-!'!=//,1%2''

'

'()*+%&-.'&F<$%,++&6$%!)%0,87$'&.3)5.& 23$&8/0#$%&)!&:,%%,#$$&G8"2.&H7)%$.&%/88"84&,2&I&JKLM&8$$-$-&2)&,73"$<$&NO!6.&!)%&)!&23$&.$%"$.&)!&.,06+$&!%,0$.&"8&$,73&4,0$9&

@"%' &%)#,(,(:' ,!!=%' -"#-' 1#(' +,),-' !1#+#7,+,-*' ,!' !0/-9#&%' +018!2'3,)=+#-,(:' )=+-,.+%' /&#)%!' 0/' &%($%&,(:' #-' !=1"' #' /,(%' +%6%+' 0/'$%-#,+' ,!' %>-&%)%+*' 10!-+*2' M09%6%&?' -",!' !0/-9#&%' &%($%&,(:'.,.%+,(%'9#!'%>.+,1,-+*'$%!,:(%$'-0'),(,),O%'-"%'(=)7%&'0/'+018!'

18:8 • L. Seiler et al.

ACM Transactions on Graphics, Vol. 27, No. 3, Article 18, Publication date: August 2008.

August 2008: Larrabee DirectX 9pipeline, used by video games.

Good scaling with # of cores.

!

!"#$%&!'#($')#*'+,),-'.%&/0&)#(1%'/0&'0-"%&'!"#$%&!2'3%1-,0('425'.&06,$%!'7&%#8$09(!'0/'-"%'-0-#+'.&01%!!,(:'-,)%'$%60-%$'-0'.0!-;!"#$%&'7+%($,(:'#($'.#&#)%-%&',(-%&.0+#-,0(2'

<(%' &%)#,(,(:' ,!!=%' ,!' -%>-=&%'10;.&01%!!0&'#11%!!%!?'9",1"'1#('"#6%'"=($&%$!'0/'1+018!'0/'+#-%(1*2'@",!',!'",$$%('7*'10).=-,(:')=+-,.+%'AA=#$!'0('%#1"'"#&$9#&%'-"&%#$2'B#1"'AA=#$C!'!"#$%&',!'1#++%$' #' !"#$%2' @"%' $,//%&%(-' /,7%&!' 0(' #' -"&%#$' 10;0.%&#-,6%+*'!9,-1"'7%-9%%(' -"%)!%+6%!'9,-"0=-'#(*'<3',(-%&6%(-,0(2'D' /,7%&'!9,-1"' ,!' .%&/0&)%$' #/-%&' %#1"' -%>-=&%' &%#$' 10))#($?' #($'.&01%!!,(:'.#!!%!'-0'-"%'0-"%&'/,7%&!'&=((,(:'0('-"%'-"&%#$2'E,7%&!'%>%1=-%',('#'1,&1=+#&'A=%=%2'@"%'(=)7%&'0/'/,7%&!',!'1"0!%('!0'-"#-'7*'-"%'-,)%'10(-&0+'/+09!'7#18'-0'#'/,7%&?',-!'-%>-=&%'#11%!!'"#!'"#$'-,)%'-0'%>%1=-%'#($'-"%'&%!=+-!'#&%'&%#$*'/0&'.&01%!!,(:2'

!"# $%&'%(%(#)%(*+(,-&.%#/01'2%3#

@",!' !%1-,0(' $%!1&,7%!' .%&/0&)#(1%' #($' !1#+,(:' !-=$,%!' /0&' -"%'F#&&#7%%'!0/-9#&%'&%($%&%&'$%!1&,7%$',('3%1-,0('52'3-=$,%!',(1+=$%'!1#+#7,+,-*' %>.%&,)%(-!' /0&' !0/-9#&%' &%($%&,(:?' +0#$' 7#+#(1,(:'!-=$,%!?' 7#($9,$-"' 10).#&,!0(!' 0/' 7,((,(:' -0' ,))%$,#-%' )0$%'&%($%&%&!?' .%&/0&)#(1%' 0(' !%6%&#+' :#)%' 90&8+0#$!?' #($' 1"#&-!'!"09,(:'-"%'"09'-0-#+'.&01%!!,(:'-,)%',!'$,6,$%$'#)0(:'$,//%&%(-'.#&-!'0/'-"%'!0/-9#&%'&%($%&%&2'

!"4#5-,%#6+(78+-'3#-&'#/2,18-02+&#9%0:+'#

G%&/0&)#(1%' -%!-!'=!%'90&8+0#$!'$%&,6%$' /&0)' -"&%%'9%++;8(09(':#)%!H' I%#&!' 0/' J#&K?' E2B2D2L2K?' #($' M#+/' F,/%K' N' B.,!0$%' N2'@#7+%' N' 10(-#,(!' ,(/0&)#-,0(' #70=-' -"%' -%!-%$' /&#)%!' /&0)' %#1"':#)%2'3,(1%'9%'#&%'!1#+,(:'0=-'-0'+#&:%'(=)7%&!'0/'10&%!'9%'=!%'#'",:";%($'!1&%%('!,O%'9,-"')=+-,!#).+,(:'9"%('!=..0&-%$2'

"#$%!&'%(!)!(*+!)! ,+-+.+/+! 0(#12!3%!4#1!

PQRR>PNRR'5'!#).+%' PQRR>PNRR'5'!#).+%' PQRR>PNRR'P'!#).+%'

N4'/&#)%!'SP',('TRU' N4'/&#)%!'SP',('PRRU' N4'/&#)%!'SP',('N4RU'

V#+6%'W0&.2' X0(0+,-"'G&0$=1-,0(!' B.,1'I#)%!'Y(1'

!"#$%& &'& ()%*+),-& ./00,%1& !)%& 23$& 23%$$& 2$.2$-& 4,0$.'& 23$&!%,0$.& ,%$& 5"-$+1& .$6,%,2$-& 2)& 7,273& -"!!$%$82& .7$8$&73,%,72$%".2"7.&,.&23$&4,0$.&6%)4%$..9&

J%'1#.-=&%$' -"%' /&#)%!'7*' ,(-%&1%.-,(:' -"%'Z,&%1-['\'10))#($'

!-&%#)'7%,(:'!%(-'-0'#'10(6%(-,0(#+':&#.",1!'1#&$'9",+%'-"%':#)%'

9#!'.+#*%$'#-'#'(0&)#+'!.%%$?'#+0(:'9,-"'-"%'10(-%(-!'0/'-%>-=&%!'

#($'!=&/#1%!' #-' -"%' !-#&-'0/' -"%' /&#)%2'J%' -%!-%$' -"%)' -"&0=:"'#'

/=(1-,0(#+')0$%+' -0' %(!=&%' -"%' #+:0&,-")!'9%&%' 10&&%1-' #($' -"#-'

-"%' &,:"-' ,)#:%!'9%&%' .&0$=1%$2']%>-?' 9%' %!-,)#-%$' -"%' 10!-' 0/'

%#1"'!%1-,0('0/'10$%' ,(' -"%' /=(1-,0(#+')0$%+?'7%,(:'#::&%!!,6%+*'

.%!!,),!-,1?' #($' 7=,+-' #' &0=:"' .&0/,+%' 0/' %#1"' /&#)%2' J%' 9&0-%'

#!!%)7+*'10$%'/0&'-"%'",:"%!-;10!-'!%1-,0(!?'&#(',-' -"&0=:"'1*1+%;

#11=&#-%' !,)=+#-0&!?' /%$' -"%' 1+018' 1*1+%' &%!=+-!' 7#18' ,(-0' -"%'

/=(1-,0(#+' )0$%+?' #($' &%;&#(' -"%' -&#1%!2' @",!' ,-%&#-,6%' 1*1+%' 0/'

&%/,(%)%(-'9#!' &%.%#-%$' =(-,+' \R^'0/' -"%' 1+018' 1*1+%!' %>%1=-%$'

$=&,(:' #' /&#)%' "#$' 7%%(' &=(' -"&0=:"' -"%' !,)=+#-0&!?' :,6,(:' -"%'

06%&#++' .&0/,+%!' #' ",:"' $%:&%%' 0/' 10(/,$%(1%2' @%>-=&%' =(,-'

-"&0=:".=-?' 1#1"%' .%&/0&)#(1%' #($' )%)0&*' 7#($9,$-"'

+,),-#-,0(!'9%&%'#++',(1+=$%$',('-"%'6#&,0=!'!,)=+#-,0(!2'

Y(' -"%!%' !-=$,%!' 9%' )%#!=&%' 90&8+0#$' .%&/0&)#(1%' ,(' -%&)!' 0/':,%%,#$$& /8"2.2' D' :,%%,#$$& /8"2' ,!' $%/,(%$' -0' 7%'0(%' F#&&#7%%'10&%'&=((,(:'#-'P'IMO2'@"%'1+018'&#-%',!'1"0!%('!0+%+*'/0&'%#!%'0/'1#+1=+#-,0(?'!,(1%'&%#+'$%6,1%!'90=+$'!",.'9,-"')=+-,.+%'10&%!'#($'

'''''''''''''''''''''''''''''''''''''''' '''''''''''''''''''''''''K'<-"%&'(#)%!'_'7&#($!')#*'7%'1+#,)%$'#!'-"%'.&0.%&-*'0/'0-"%&!2'

#' 6#&,%-*' 0/' 1+018' &#-%!2' `!,(:' F#&&#7%%' =(,-!' #++09!' =!' -0'10).#&%'.%&/0&)#(1%'0/'F#&&#7%%',).+%)%(-#-,0(!'9,-"'$,//%&%(-'(=)7%&!' 0/' 10&%!' &=((,(:' #-' $,//%&%(-' 1+018' &#-%!2' D' !,(:+%'F#&&#7%%'=(,-'10&&%!.0($!' -0'#' -"%0&%-,1#+'.%#8' -"&0=:".=-'0/'TN'IEF<G3?'10=(-,(:'/=!%$')=+-,.+*;#$$'#!'-90'0.%&#-,0(!2''

!";#/.-8-<2820=#/01'2%3#

@"%'F#&&#7%%'!0/-9#&%'&%($%&%&',!'$%!,:(%$'-0'#++09'%//,1,%(-'+0#$'7#+#(1,(:'06%&'#'+#&:%'(=)7%&'0/'10&%!2'E,:=&%'\'!"09!'-"%'&%!=+-!'0/' -%!-,(:' +0#$' 7#+#(1,(:' /0&' !,>' 10(/,:=&#-,0(!?' %#1"' 0/' 9",1"'!1#+%!' -"%')%)0&*'7#($9,$-"'#($' -%>-=&%'/,+-%&,(:'!.%%$'&%+#-,6%'-0'-"%'(=)7%&'0/'10&%!2'@",!'-%!-'=!%!'-"%'!,)=+#-,0(')%-"0$0+0:*'$%!1&,7%$' ,(' 3%1-,0(' 42P' ,(' 10)7,(#-,0(' 9,-"' #' -,)%;7#!%$'.%&/0&)#(1%')0$%+'-"#-'-&#18!'$%.%($%(1,%!'#($'!1"%$=+,(:2'@",!'-00+',!'=!%$'/0&')=+-,.+%':&#.",1!'.&0$=1-!'9,-",('Y(-%+2''

'

'()*+%& ,'& ;$+,2"<$& =7,+"84& ,.& ,& >/872")8& )!& ?)%$& ?)/82'& @3".&.3)5.&7)8!"4/%,2")8.&5"23&A&2)&BA&7)%$.C&5"23&$,73&4,0$D.&%$./+2.&6+)22$-&%$+,2"<$&2)&23$&6$%!)%0,87$&)!&,8&AE7)%$&.1.2$09&&

@"%'&%!=+-!'0/'-"%'+0#$'7#+#(1,(:'!,)=+#-,0('!"09'#'/#++0//'0/'a^'-0' PR^' /&0)' #' +,(%#&' !.%%$=.' #-' 5b' 10&%!2' E0&' -"%!%' -%!-!?'G&,)3%-!' #&%' !=7$,6,$%$' ,/' -"%*' 10(-#,(' )0&%' -"#(' PRRR'.&,),-,6%!?'#!'$%!1&,7%$',('3%1-,0('52N2'D$$,-,0(#+'-%!-!'!"09'-"#-'E2B2D2L2' /#++!' 0//' 7*' 0(+*' N^' ,/' G&,)3%-!' #&%' !=7$,6,$%$' ,(-0':&0=.!' 0/' NRR' .&,),-,6%!?' !0' 10$%' -=(,(:' !"0=+$' ,).&06%' -"%'+,(%#&,-*2''

E,:=&%'PR'!"09!'-"%'(=)7%&'0/'F#&&#7%%'=(,-!'&%A=,&%$'-0'&%($%&'!#).+%' /&#)%!' /&0)' -"%' -"&%%':#)%!'#-'QR' /&#)%!c!%10($2'@"%!%'&%!=+-!'9%&%' !,)=+#-%$'0('#' !,(:+%'10&%'9,-"' -"%'#!!=).-,0(' -"#-'.%&/0&)#(1%'!1#+%!'+,(%#&+*2'E0&'M#+/'F,/%'N'%.,!0$%'N?'&0=:"+*'PR'F#&&#7%%'`(,-!'#&%'!=//,1,%(-'-0'%(!=&%'-"#-'#++'/&#)%!'&=('#-'QR'/.!'0&' /#!-%&2' E0&' E2B2D2L2' #($' I%#&!' 0/'J#&?' &0=:"+*' N4' F#&&#7%%'`(,-!'!=//,1%2''

'

'()*+%&-.'&F<$%,++&6$%!)%0,87$'&.3)5.& 23$&8/0#$%&)!&:,%%,#$$&G8"2.&H7)%$.&%/88"84&,2&I&JKLM&8$$-$-&2)&,73"$<$&NO!6.&!)%&)!&23$&.$%"$.&)!&.,06+$&!%,0$.&"8&$,73&4,0$9&

@"%' &%)#,(,(:' ,!!=%' -"#-' 1#(' +,),-' !1#+#7,+,-*' ,!' !0/-9#&%' +018!2'3,)=+#-,(:' )=+-,.+%' /&#)%!' 0/' &%($%&,(:' #-' !=1"' #' /,(%' +%6%+' 0/'$%-#,+' ,!' %>-&%)%+*' 10!-+*2' M09%6%&?' -",!' !0/-9#&%' &%($%&,(:'.,.%+,(%'9#!'%>.+,1,-+*'$%!,:(%$'-0'),(,),O%'-"%'(=)7%&'0/'+018!'

18:8 • L. Seiler et al.

ACM Transactions on Graphics, Vol. 27, No. 3, Article 18, Publication date: August 2008.

# of cores for high frame rates is a work in progress.

22

Page 23: CS 250 Ma VLSI System Design Exa Chi e Lecture 9 ...cs250/fa09/lectures/lec...the CPU; it provides more than 200 Gbytes/s of bandwidth. A two-entry queue is available for each source-destination

UC Regents Fall 2009 © UCBCS 250 L9: Floorplanning

Next Week: Practical design techniques

Thursday: Design blocks

Tuesday: Micro-architecture

23