MPSoC University of Tehran Electrical and Computer Engineering School Design of ASIC CMOS Systems Course Presented by: Mahdi Hamzeh Instructor: Dr S.M

MPSoC

University of TehranElectrical and Computer Engineering School

Design of ASIC CMOS Systems Course

Presented by:Mahdi Hamzeh

Instructor:Dr S.M. Fakhraie

Spring 2006

This is a class presentation. All data are copy righted to respective authors as listed in the references and have been used here for educational purpose only

Outline Introduction A Power-Efficient High-Throughput 32-Thread

SPARC Processor A 16-Core RISC Microprocessor with Network

Extensions A Dual-Core Multi-Threaded Xeon® Processor

with 16MB L3 Cache A 2.6GHz Dual-Core 64b×86 Microprocessor with

DDR2 Memory Support Conclusion

Introduction Multi-processor systems-on-chip

(MPSoC) pose many new challenges to the design of embedded systems [5]

Multi-processor systems-on-chip (MPSoCs) are becoming a necessary way to balance performance, power and reliability while maintaining the maximum degree of flexibility [6]

Energy-efficient design is strongly required in MPSoC's [6]

Dynamic power and leakage power

Features High throughput Performance/watt optimized Concurrent execution of 32 threads 8 symmetrical 4-way 64b multithreaded SPARC cores 16KB ICache per Core 8KB DCache per Core High-bandwidth low-latency cache/memory Single-issue 6-stage pipeline for each core Maximize pipeline utilization CPU-to-cache crossbar of 134GB/s 4-banked 12-way L2 cache Pipelined shared 3MB L2 cache of 153.6GB/s Four 144b DDR2 DIMM channels at 400MT/s (Mega-transfers/s)

delivering 25.6GB/s

Processor block diagram [1]

A Power-Efficient High-Throughput 32-Thread SPARC Processor

Niagara processor micrograph and overview [1]

A Power-Efficient High-Throughput 32-Thread SPARC Processor (Contd.)

Thread features 4 threads are interleaved per cycle with zero thread-switch

cost When any thread is blocked by a cache miss or branch

penalty, the other threads issue instructions more frequently, effectively hiding the miss latency of the first thread.

Measured IPC (instructions per cycle) 5.76 with an actual L2 latency of 20.9 CPU cycles and memory

latency of 106ns on Java Business Benchmark (SpecJBB) Pipeline efficiency 71% (5.76 out of a maximum of 8). A balanced H-tree scheme is used to distribute the global

clock Thermal gradient of only 7°C. At 63W Worst-case junction temperature is 66°C.(Compared to a

typical Tj of 105°C, reliability improves by 5×).

Technology

90nm CMOS process 9 layers of Cu interconnect 378mm2 die 279M transistors packaged in a flip-chip ceramic LGA with

1933 pins Power dissipation is 63W at 1.2V and

1.2GHz Library cells are static CMOS with a 1.5 P/N

width ratioChip power consumption: 63W [1]

Components Peak Power Control

Active power- and temperature-control mechanisms allow threads and cores to be dynamically scheduled or idled.

Clock-gating techniques include coarse-grain to disable selective cores, and fine-grain to disable about 30% of the datapath flops on average.

PLL generate 3 ratioed clock domains CPU, crossbar, L2 cache memory interface system interface

Components (Contd.) On-chip L2 cache

12-way set-associative Divided into 4 independent banks Operate concurrently to read out up

to 256B Each sub-bank supplies 16B with 2-

cycle throughput providing a maximum data array read

bandwidth of 153.6GB/sL2 cache data array floorplan and inter-

locking clock header [1]

Design Methodology

Hold-time methodology based on metal-programmable delay buffers, allowing the top level route to freeze while still resolving violations

A 16-Core RISC Microprocessor with NetworkExtensions

Features Size of Icache (each core) 32kB Size of Dcache (each core) 8kB L2 Cache 1MB Number of MIPS Cores 16 Number of Metal Layers 9 Process 0.13μm CMOS Voltage 1.2V Frequency 600MHz Power 25W Power for an individual processor is 450mW

@600MHz Number of transistors 180 million

Chip plot [2]

A 16-Core RISC Microprocessor with NetworkExtensions (Contd.)

Targeted for layer-4 through layer-7 network applications Designed for power efficiency Components (Hardwired)

Security engines Network function accelerators Memory/network/bus controllers

Most of the silicon area is dedicated to the 16 RISC processors and the 1MB L2 cache

remaining area is occupied by network coprocessors and physical interfaces

Chip interfaces 64b 133MHz PCI/PCIX 144b 800MHz DDR2 36b 600MHz low-latency DRAM interface in addition to

miscellaneous and general-purpose I/OsPeak performance [2]

Architecture Each RISC core can issue in-order two MIPS instructions per

cycle 32kB 4-way set-associative virtual instruction cache The execute unit consists of two pipelines

first handles all instructions second only handles ALU/insert/extract/shift/move instructions

memory section consists 8kB fully associative Dcache 2kB write buffer 32-entry (64 page) unified translation look-aside buffers (TLBs)

multiplication/division unit in addition to supporting the standard MIPS instructions

Cryptographic operations are accelerated by dedicated units supporting different encryption methods: 3DES, AES, MD5, SHA1/256/512, and GF2

The 16 processors share a 1MB fully coherent L2 write-back cache.

RISC Processor [2]

Power Aggressive clock gating of all place-and

route and custom islands Some blocks present natural exclusivity

and hardware enforces exclusivity to reduce peak power Example : In execution unit only the ALU or

the shifter need be enabled power performance is approximately

2000 MIPS/W Global clock distribution power is <1W

at 1.2V, 600MHz for a skew of <50ps

Design Methodology combination of industry-standard synthesis and place-

and-route flow for control blocks, and full custom schematic/layout design for the datapath-style units

Global clock distribution is full custom and consists of a power-efficient variable-density grid that minimizes total metal capacitance while maintaining low resistance paths to the heaviest clock loads

Local conditional clocks are two gain stages from the global clock and are designed on an ad-hoc basis

Global floorplanning and wiring is done with an in-house tool that handles routing in addition to optimal repeater, local clock driver and decoupling capacitance placement

Chip floorplan [3]

A Dual-Core Multi-Threaded Xeon® Processor with 16MB L3 Cache

Features two 64b cores 16MB unified L3 cache Each core has two threads Each core has unified 1MB L2 cache 1.328B transistors. Frequency 3.0GHz 435mm2 die 1.25V core supply worst-case power dissipation 165W Typical server workload is 110W process 65nm 8 copper interconnect low-k carbon-doped oxide (k=2.9) inter-level dielectric flip-chip (C4) attached to a 12-layer (4-4-4) organic package with an

integrated heat spreader Package has 604 pins(238 are signal pins and the rest are power and

ground)

Die micrograph [3]

Components L3 cache

Using 256 data sub-arrays(64kB each) 32 redundancy sub-arrays(68kB each) data sub-array stores 32 bits redundancy sub-array stores 34 bits 6T memory-cell bit size is 0.624μm2 physical address is 40b wide

Clock and PLL 3 PLLS uncore clock is distributed through a balanced tree

embedded in nine vertical spines De-skew circuits controlled by on-die fuses reduce

the uncore clock skew to less than 11ps.Clock distribution map [3]

Components (Contd.) Level shifters

used between voltage domains DFT and debug features

Scan observability registers (scan-out) I/O loopback and I/O test generator (IBIST) on-die clock shrink …

Power only 0.8% of all L3 cache array blocks are powered up for each

cache access To reduce the L3 cache leakage, NMOS sleep transistors are

implemented in the SRAM sub-arrays PMOS power gating devices in the cache periphery (Both

saving about 3W of leakage) Supply voltage

three supply voltage one for the two cores separate supply for the L3 cache together with the associated control

logic third one for the FSB I/O circuits

design uses longer Le devices (10% longer than nominal) in non-timing-critical paths to reduce subthreshold leakage (About 54% of the transistor width in the cores and 76% of the transistor width in the uncore (excluding cache arrays)

L3 cache sleep circuit and shut-off mode [3]

Voltage domains and power breakdown [3]

A 2.6GHz Dual-Core 64b×86 Microprocessor withDDR2 Memory Support

Features 90nm triple-Vt, artially-depleted SOI. 9 layer Cu metallization Dual gate-

oxide thickness Process technology 2 CPU Cores 220mm2 die area 77.4mm2 L2 cache area 243M Transistor count, 134M L2 array 13M L1 array L1 instruction cache 64kB per core, parity protected L1 data cache 64kB per core, ECC protected L2 cache 1MB per core, ECC protected 128b DDR2-800, 12.8GB/s Memory interface 2.6GHz 1.35V core supply power dissipation 95W The chip implements the Pacifica architecture for hardware support of

virtualization design has 7% frequency margin and 10% voltage margin at its operating

point

die micrograph [4]

Components 2 Hammer cores on-chip DDR2 memory controller 3 identical PLLs

2 PLLs provide clocks for 3 Hyper Transport links

third provides a clock for the memory controller and both cores

Clock Distribution balanced H-tree drives the clock signal from

the PLL to final clock buffers Worst-case clock skew is 21ps

Power fine-grained clock gating reduces the load on the clock

grid and reduces power consumption The clock grids over the 2 cores can be separately

enabled low-power operating modes

clock grids over the CPU cores are disabled clock grid over the memory controller runs at 1/256th the

frequency of the system clock The grid provides a low-resistance path to all clock

receivers so clock drivers do not have to be tuned based on loading at the end of the design cycle

Reducing power dissipation by reducing voltage from 1.35V to 1.1V achieves a three-fold reduction in static leakage and a 47% reduction in dynamic leakage at a cost of 20% in frequency

Static leakage versus frequency [4]

Conclusion Power-hungry techniques like memory

speculation, out-of order execution, and predication are not needed to achieve the desired performance [1]

Extensive use of simple static CMOS circuits improves the robustness [2]

Designed for power efficiency, which is a key requirement for MPSoC embedded applications [3]

References [1] A. S. Leon, J. L. Shin, K. W. Tam, W. Bryg, F. Schumacher, P. Kongetira, W.

Weisner, A. Strong,” A Power-Efficient High-Throughput 32-Thread SPARC Processor”, International Solid-State Circuits Conference ,February 2006.

[2] V. Yalala, D. Brasili, D. Carlson, A. Hughes, A. Jain, T. Kiszely, K. Kodandapani, A. Varadharajan, T. Xanthopoulos,” A 16-Core RISC Microprocessor with Network Extensions”, International Solid-State Circuits Conference ,February 2006.

[3] S. Rusu, S. Tam, H. Muljono, D. Ayers, J. Chang,” A Dual-Core Multi-Threaded Xeon® Processor with 16MB L3 Cache”, International Solid-State Circuits Conference ,February 2006.

[4] M. Golden, S. Arekapudi, G. Dabney, M. Haertel, S. Hale, L. Herlinger, Y. Kim, K. McGrath, V. Palisetti, M. Singh,” A 2.6GHz Dual-Core 64b×86 Microprocessor with DDR2 Memory Support”, International Solid-State Circuits Conference ,February 2006.

[5] K. C. Chang, J. S. Shen, T. F. Chen,” Evaluation and Design Trade-Offs Between Circuit-Switched and Packet-Switched NOCs for Application-Specific SOCs”, 43rd Design Automation Conference, July 2006.

[6] I. Issenin, E. Brockmeyer, B. Durinck, N. Dutt, “Multiprocessor System-on-Chip Data Reuse Analysis for Exploring Customized Memory Hierarchies”, 43rd Design Automation Conference, July 2006.

Questions

?

Documents

MPSoC University of Tehran Electrical and Computer Engineering School Design of ASIC CMOS Systems Course Presented by: Mahdi Hamzeh Instructor: Dr S.M