Upload
branden-hardy
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
MPSoC
University of TehranElectrical and Computer Engineering School
Design of ASIC CMOS Systems Course
Presented by:Mahdi Hamzeh
Instructor:Dr S.M. Fakhraie
Spring 2006
This is a class presentation. All data are copy righted to respective authors as listed in the references and have been used here for educational purpose only
Outline Introduction A Power-Efficient High-Throughput 32-Thread
SPARC Processor A 16-Core RISC Microprocessor with Network
Extensions A Dual-Core Multi-Threaded Xeon® Processor
with 16MB L3 Cache A 2.6GHz Dual-Core 64b×86 Microprocessor with
DDR2 Memory Support Conclusion
Introduction Multi-processor systems-on-chip
(MPSoC) pose many new challenges to the design of embedded systems [5]
Multi-processor systems-on-chip (MPSoCs) are becoming a necessary way to balance performance, power and reliability while maintaining the maximum degree of flexibility [6]
Energy-efficient design is strongly required in MPSoC's [6]
Dynamic power and leakage power
Features High throughput Performance/watt optimized Concurrent execution of 32 threads 8 symmetrical 4-way 64b multithreaded SPARC cores 16KB ICache per Core 8KB DCache per Core High-bandwidth low-latency cache/memory Single-issue 6-stage pipeline for each core Maximize pipeline utilization CPU-to-cache crossbar of 134GB/s 4-banked 12-way L2 cache Pipelined shared 3MB L2 cache of 153.6GB/s Four 144b DDR2 DIMM channels at 400MT/s (Mega-transfers/s)
delivering 25.6GB/s
Processor block diagram [1]
A Power-Efficient High-Throughput 32-Thread SPARC Processor
Niagara processor micrograph and overview [1]
A Power-Efficient High-Throughput 32-Thread SPARC Processor (Contd.)
Thread features 4 threads are interleaved per cycle with zero thread-switch
cost When any thread is blocked by a cache miss or branch
penalty, the other threads issue instructions more frequently, effectively hiding the miss latency of the first thread.
Measured IPC (instructions per cycle) 5.76 with an actual L2 latency of 20.9 CPU cycles and memory
latency of 106ns on Java Business Benchmark (SpecJBB) Pipeline efficiency 71% (5.76 out of a maximum of 8). A balanced H-tree scheme is used to distribute the global
clock Thermal gradient of only 7°C. At 63W Worst-case junction temperature is 66°C.(Compared to a
typical Tj of 105°C, reliability improves by 5×).
Technology
90nm CMOS process 9 layers of Cu interconnect 378mm2 die 279M transistors packaged in a flip-chip ceramic LGA with
1933 pins Power dissipation is 63W at 1.2V and
1.2GHz Library cells are static CMOS with a 1.5 P/N
width ratioChip power consumption: 63W [1]
Components Peak Power Control
Active power- and temperature-control mechanisms allow threads and cores to be dynamically scheduled or idled.
Clock-gating techniques include coarse-grain to disable selective cores, and fine-grain to disable about 30% of the datapath flops on average.
PLL generate 3 ratioed clock domains CPU, crossbar, L2 cache memory interface system interface
Components (Contd.) On-chip L2 cache
12-way set-associative Divided into 4 independent banks Operate concurrently to read out up
to 256B Each sub-bank supplies 16B with 2-
cycle throughput providing a maximum data array read
bandwidth of 153.6GB/sL2 cache data array floorplan and inter-
locking clock header [1]
Design Methodology
Hold-time methodology based on metal-programmable delay buffers, allowing the top level route to freeze while still resolving violations
A 16-Core RISC Microprocessor with NetworkExtensions
Features Size of Icache (each core) 32kB Size of Dcache (each core) 8kB L2 Cache 1MB Number of MIPS Cores 16 Number of Metal Layers 9 Process 0.13μm CMOS Voltage 1.2V Frequency 600MHz Power 25W Power for an individual processor is 450mW
@600MHz Number of transistors 180 million
Chip plot [2]
A 16-Core RISC Microprocessor with NetworkExtensions (Contd.)
Targeted for layer-4 through layer-7 network applications Designed for power efficiency Components (Hardwired)
Security engines Network function accelerators Memory/network/bus controllers
Most of the silicon area is dedicated to the 16 RISC processors and the 1MB L2 cache
remaining area is occupied by network coprocessors and physical interfaces
Chip interfaces 64b 133MHz PCI/PCIX 144b 800MHz DDR2 36b 600MHz low-latency DRAM interface in addition to
miscellaneous and general-purpose I/OsPeak performance [2]
Architecture Each RISC core can issue in-order two MIPS instructions per
cycle 32kB 4-way set-associative virtual instruction cache The execute unit consists of two pipelines
first handles all instructions second only handles ALU/insert/extract/shift/move instructions
memory section consists 8kB fully associative Dcache 2kB write buffer 32-entry (64 page) unified translation look-aside buffers (TLBs)
multiplication/division unit in addition to supporting the standard MIPS instructions
Cryptographic operations are accelerated by dedicated units supporting different encryption methods: 3DES, AES, MD5, SHA1/256/512, and GF2
The 16 processors share a 1MB fully coherent L2 write-back cache.
RISC Processor [2]
Power Aggressive clock gating of all place-and
route and custom islands Some blocks present natural exclusivity
and hardware enforces exclusivity to reduce peak power Example : In execution unit only the ALU or
the shifter need be enabled power performance is approximately
2000 MIPS/W Global clock distribution power is <1W
at 1.2V, 600MHz for a skew of <50ps
Design Methodology combination of industry-standard synthesis and place-
and-route flow for control blocks, and full custom schematic/layout design for the datapath-style units
Global clock distribution is full custom and consists of a power-efficient variable-density grid that minimizes total metal capacitance while maintaining low resistance paths to the heaviest clock loads
Local conditional clocks are two gain stages from the global clock and are designed on an ad-hoc basis
Global floorplanning and wiring is done with an in-house tool that handles routing in addition to optimal repeater, local clock driver and decoupling capacitance placement
Chip floorplan [3]
A Dual-Core Multi-Threaded Xeon® Processor with 16MB L3 Cache
Features two 64b cores 16MB unified L3 cache Each core has two threads Each core has unified 1MB L2 cache 1.328B transistors. Frequency 3.0GHz 435mm2 die 1.25V core supply worst-case power dissipation 165W Typical server workload is 110W process 65nm 8 copper interconnect low-k carbon-doped oxide (k=2.9) inter-level dielectric flip-chip (C4) attached to a 12-layer (4-4-4) organic package with an
integrated heat spreader Package has 604 pins(238 are signal pins and the rest are power and
ground)
Die micrograph [3]
Components L3 cache
Using 256 data sub-arrays(64kB each) 32 redundancy sub-arrays(68kB each) data sub-array stores 32 bits redundancy sub-array stores 34 bits 6T memory-cell bit size is 0.624μm2 physical address is 40b wide
Clock and PLL 3 PLLS uncore clock is distributed through a balanced tree
embedded in nine vertical spines De-skew circuits controlled by on-die fuses reduce
the uncore clock skew to less than 11ps.Clock distribution map [3]
Components (Contd.) Level shifters
used between voltage domains DFT and debug features
Scan observability registers (scan-out) I/O loopback and I/O test generator (IBIST) on-die clock shrink …
Power only 0.8% of all L3 cache array blocks are powered up for each
cache access To reduce the L3 cache leakage, NMOS sleep transistors are
implemented in the SRAM sub-arrays PMOS power gating devices in the cache periphery (Both
saving about 3W of leakage) Supply voltage
three supply voltage one for the two cores separate supply for the L3 cache together with the associated control
logic third one for the FSB I/O circuits
design uses longer Le devices (10% longer than nominal) in non-timing-critical paths to reduce subthreshold leakage (About 54% of the transistor width in the cores and 76% of the transistor width in the uncore (excluding cache arrays)
L3 cache sleep circuit and shut-off mode [3]
Voltage domains and power breakdown [3]
A 2.6GHz Dual-Core 64b×86 Microprocessor withDDR2 Memory Support
Features 90nm triple-Vt, artially-depleted SOI. 9 layer Cu metallization Dual gate-
oxide thickness Process technology 2 CPU Cores 220mm2 die area 77.4mm2 L2 cache area 243M Transistor count, 134M L2 array 13M L1 array L1 instruction cache 64kB per core, parity protected L1 data cache 64kB per core, ECC protected L2 cache 1MB per core, ECC protected 128b DDR2-800, 12.8GB/s Memory interface 2.6GHz 1.35V core supply power dissipation 95W The chip implements the Pacifica architecture for hardware support of
virtualization design has 7% frequency margin and 10% voltage margin at its operating
point
die micrograph [4]
Components 2 Hammer cores on-chip DDR2 memory controller 3 identical PLLs
2 PLLs provide clocks for 3 Hyper Transport links
third provides a clock for the memory controller and both cores
Clock Distribution balanced H-tree drives the clock signal from
the PLL to final clock buffers Worst-case clock skew is 21ps
Power fine-grained clock gating reduces the load on the clock
grid and reduces power consumption The clock grids over the 2 cores can be separately
enabled low-power operating modes
clock grids over the CPU cores are disabled clock grid over the memory controller runs at 1/256th the
frequency of the system clock The grid provides a low-resistance path to all clock
receivers so clock drivers do not have to be tuned based on loading at the end of the design cycle
Reducing power dissipation by reducing voltage from 1.35V to 1.1V achieves a three-fold reduction in static leakage and a 47% reduction in dynamic leakage at a cost of 20% in frequency
Static leakage versus frequency [4]
Conclusion Power-hungry techniques like memory
speculation, out-of order execution, and predication are not needed to achieve the desired performance [1]
Extensive use of simple static CMOS circuits improves the robustness [2]
Designed for power efficiency, which is a key requirement for MPSoC embedded applications [3]
References [1] A. S. Leon, J. L. Shin, K. W. Tam, W. Bryg, F. Schumacher, P. Kongetira, W.
Weisner, A. Strong,” A Power-Efficient High-Throughput 32-Thread SPARC Processor”, International Solid-State Circuits Conference ,February 2006.
[2] V. Yalala, D. Brasili, D. Carlson, A. Hughes, A. Jain, T. Kiszely, K. Kodandapani, A. Varadharajan, T. Xanthopoulos,” A 16-Core RISC Microprocessor with Network Extensions”, International Solid-State Circuits Conference ,February 2006.
[3] S. Rusu, S. Tam, H. Muljono, D. Ayers, J. Chang,” A Dual-Core Multi-Threaded Xeon® Processor with 16MB L3 Cache”, International Solid-State Circuits Conference ,February 2006.
[4] M. Golden, S. Arekapudi, G. Dabney, M. Haertel, S. Hale, L. Herlinger, Y. Kim, K. McGrath, V. Palisetti, M. Singh,” A 2.6GHz Dual-Core 64b×86 Microprocessor with DDR2 Memory Support”, International Solid-State Circuits Conference ,February 2006.
[5] K. C. Chang, J. S. Shen, T. F. Chen,” Evaluation and Design Trade-Offs Between Circuit-Switched and Packet-Switched NOCs for Application-Specific SOCs”, 43rd Design Automation Conference, July 2006.
[6] I. Issenin, E. Brockmeyer, B. Durinck, N. Dutt, “Multiprocessor System-on-Chip Data Reuse Analysis for Exploring Customized Memory Hierarchies”, 43rd Design Automation Conference, July 2006.
Questions
?