Upload
cameroon45
View
380
Download
1
Tags:
Embed Size (px)
Citation preview
The Power Architecture and Power.org word marks and the Power and Power.org logos and related marks are trademarks and service marks licensed by Power.org.
IBM POWER6TM Based ServersMay 27, 2008Jeff Stuecheli, IBM
2
» Historical Background» POWER6TM Chip Components» Interconnect Topology» System Innovations» POWER6TM Systems» Benchmark Results
Agenda
Copyright © 2007 by IBM
All rights reserved.
3
IBM POWER development team
UnixServersGames
SOC Mainframes
collaboration
4
IBM POWER SYSTEMS
Leading up to POWER6TM
POWER4+ (130nm)
POWER4 (180nm)
POWER5 (130nm)
POWER5+ (90nm)
POWER6 (65nm)
20012003
20042006
2007
Copyright © 2007 by IBM
All rights reserved.
5
POWER6: Innovations in all areas of system design
Technology
Performance
Flexibility
RAS
Features and Function
System Energy Management
Copyright © 2007 by IBM
All rights reserved.
POWER6TM System Components
7
POWER6TM Chip Overview
» Ultra-high frequency dual-core chip• 7-way superscalar, 2-way SMT core
• 9 execution units
– 2LS, 2FP, 2FX, 1BR, 1VMX,1DFU
• 790M transistors
• Up to 64-core SMP systems
• 2x4MB on-chip L2
• 32MB On-chip L3 directory and controller
• Two memory controllers on-chip
» Technology• CMOS 65nm lithography, SOI
» High-speed elastic bus interface at 2:1 freq
• I/Os: 1953 signal, 5399 Power/Gnd
CoreCore
CoreCore
L2L2CtrlCtrl
L2L2CtrlCtrl
L2L2DataData
L2L2DataData
L2L2DataData
L2L2DataData
L3L3CtrlCtrl
L3L3CtrlCtrl
MemMemCtrlCtrl
MemMemCtrlCtrl
I/OI/OCtrlCtrl
SMPSMPInterconnectInterconnect
Copyright © 2007 by IBM
All rights reserved.
8
eDRAM L3 cache chip
» 32 MB eDRAM• 8 x 36 Mb macros• 16 banks per macro
» Cache Data controller• 8 entry read queue• 8 entry write queue• 8x128 Byte Write buffering
» 3 GHz max interface
Copyright © 2007 by IBM
All rights reserved.
9
POWER5 vs POWER6 Pipeline Comparison
Instruction Fetch Instruction Buffer/Decode Instruction Dispatch/Issue Data Fetch/Execute
FXU Dependent execution
Load Dependent execution
» Pipeline Comparison• POWER6 processor is ~2X frequency of POWER5• POWER6 instruction pipeline depth equivalent to POWER5
POWER5
POWER6
10
POWER6 Core Pipeline
»A
Instruction fetch pipelineInstruction fetch pipeline
BR/FX/Load pipelineBR/FX/Load pipeline
Floating Point PipelineFloating Point Pipeline Check Point Recovery PipelineCheck Point Recovery Pipeline
BR/CRBR/CR
FXFX
LOADLOAD
Legend :Legend : Pre-decode stage
Ifetch/Branch stage
Delayed/Transmit stage
Instruction Decode stage
Instruction Dispatch/Issue stage
Operand access/execution stage
Write back stage
Completion stage
Check Point stage
FX result bypass
Load result bypass
Float result bypass
Cache access stage
P1P1
P2P2
P3P3
P4P4 IC0IC0 ROTROTIC1IC1
EX1EX1
FMTFMTAGAGDISPDISPPDPDIB0IB0 IB1IB1
RFRF
RFRF
RFRF
RFRF DC0DC0 DC1DC1
EX2EX2 EX3EX3 EX4EX4 EX5EX5 EX6EX6 EX7EX7
EXEX
ISSISS ECCECC
ECCECC
BHTBHT
BHTBHT
IFARIFAR
Instruction dispatch pipelineInstruction dispatch pipeline
Interconnect Topology
12
POWER6TM: Extreme Bandwidth Capability
L3 Cache32MB 2X8B
2X8B
2X2B A
D
D L3 Dir
Fabric
L3 Cntl
Mem Cntl
A/D A/D
8B
Corona
Core (SMT+, VMX)
Core
4X2B4X1B 4X1B A
D D
8B
A/D
A/D
A/D
A/D
8B
8B
8B
8B
Intra-Node8W SMP Buses
4X1B A D D Mem Cntl4X1B
4X2B
GX+ Cntl GX+ Bus4B4B A/D
A/D
(SMT+, VMX)
A/DA/D
8B 8B
A/DA/D
8B 8B
L2 Cache (8.0MB)
75 GB/s
50GB/s 80GB/s
20GB/s
80GB/s
Total = 300 GB/s
Copyright © 2007 by IBM
All rights reserved.
13
M
POWER6TM Interconnect
» Memory DIMMs• 8 channels• Point to Point Elastic interface
at 4X DRAM frequency• Custom on-DIMM buffer chip
» L3 data• 32 MB capacity• Dual 16 byte (8 read + 8
write) interface at ½ CPU frequency
» SMP interconnect• Three Intra-Node SMP buses
for 8-way Node• Two Inter-Node SMP buses for
up to 8 Nodes
L3L3
POWER6TMM M M
M
M MM
Intra-NodeLinks
Inter-NodeLinks
DDR2/3 DRAM
Custom eDRAML3 Data
Copyright © 2007 by IBM
All rights reserved.
14
Example System Topology
Node Node
Node
Node Node
Node Node
P6
P6L3L3
P6
P6
M M MM
M M MM
Mem Ctrl
Mem Ctrl
64-way SMP: 8 Nodes x 4 Chips
Dual L3 Data Chips per POWER6TM chipDual Memory Controllers: 4 Channels each
Node is logical: System design drives physical embodiment
Node
Copyright © 2007 by IBM
All rights reserved.
System Innovation
16
System Innovation
» Bulletproof Computing– Error detection and recovery
» System Level Power– Datacenter centric control
» Partition Mobility– Datacenter virtualization
» Decimal Floating Point» Water Cooling» Coherence
– Scalable SMP
» Prefetch
Copyright © 2007 by IBM
All rights reserved.
17
POWER6TM Cache Coherence
»SubSpace Snooping– Hybrid approach: Combination of
cache snooping and directory lookup– Design Principle: Provide ideal latency
of cache snooping, with high scalability provided with full directories.
»Design Components– Cache States: MESI enhanced with line
location residue.– Memory Directory: Low cost coherence
information stored with cache line data
– Coherence Predictor: Prediction of cache line system state.
– Two Tiered: Node local and System scope
Copyright © 2007 by IBM
All rights reserved.
18
Sustained system memory bandwidth running HPC loop
0
200
400
600
800
1000
1200
1400
8 16 24 32 40 48 56 64
number cores
GB
/sec
sustained memory bandwidth(74% of peak)
peak 'global' snoop bandwidth
peak 'subspace' snoopbandwidth
Copyright © 2007 by IBM
All rights reserved.
19
Prefetch
»Core Centric Control»Design Details,
• 16 entry stream detection table• Differentiation of load, store, and load+store streams• Hardware/software synergy
– Prefetch table exposed to sw via enhanced dcbt instructions– Software controlled stream characteristics and initiation
• Multi-line prefetch– Multi-cacheline bus memory prefetch operation– Power and resource efficient
Copyright © 2007 by IBM
All rights reserved.
POWER6 Systems
21
The New Power Server Family
BladeCenter
Power 550
Power 570
Power 520
JS12 JS22
Power 595
Power 575
2 Processor Sockets(20 sockets/rack)
4 Processor Sockets(40 sockets/rack)
2,4,6,8 SocketsScalable Size
High RAS(20 sockets/rack)
32 SocketsHighest Bandwidth Design
Highest RAS
224 Sockets(14 - 16 skt 2U servers)
Direct Water Cooling
1 skt 2 skt(112 LP sockets/rack)
22
P6-p575P6-p575602 GFLOPs
8.42
TF
LO
Ps
121.6 GFLOPs
1.46
TF
LO
Ps
The P6-p575: Comparison to P5-p575+ for Reference
< 1/5th the space / Compute Op
Same size Rack as P5-575+
~ 1/3rd the energy / compute Op
~6x Density of P6 Rack Servers
~3x Density of P6 Blade Servers
23
• Save: • Energy• Space• Management Effort
• Replace:• 3 Racks of High Performance 1U Servers• 224 Ethernet Links and Switch ports• 112 Operating System Images• Associated on-Floor PDU Power capacity• Associated on-Floor CRAC Cooling capacity
• With:• 1 p6-575 rack (completely factory pre-assembled & pre-tested) • 48 ports of 10Gb Ethernet• 14 Operating System Images• ~15% on the on-Floor CRAC capacity the 1U Servers require
A Greener Alternative to 1U/2U Servers for Technical or Commercial Compute Intense Tasks
24
• P6-p575 Gains by Eliminating Components• The larger SMP of a P6-p575 is 8x larger then that of the typical 2-socket
• 1/8th the I/O• 1/8th the Ethernet switch ports & cables• 1Gb & 10Gb Ethernet are fully integrated into chipset • Less Memory is required due to 1/8th the OS images• Shared Memory usage across 8x the amount of compute power -> more efficient processing (virtualization).
I/O Unit
PC
I Ris
er(P
CI &
Gx
Ad
apte
r)
Processor Unit
AMD
PC
I Riser
(PC
I & G
xA
dap
ter)
Efficiency in Consolidation
25
• P6-p575 Reduces Power & Cooling Overhead through Advanced Design Techniques
• Processors are directly water cooled allowing them to run at <50 deg-C versus the typical 90 deg-C
• Leakage reduction• 2 Larger 3-phase electronically commutated fans are used in place of many small fans typically used• Efficient Voltage Regulators with lower “on-resistance” are used to convert power more efficiently
• Regulators are precisely adjusted for each P6 chip • AC-to-DC Conversion is moved out of the server package
• Logic is not heated by AC-to-DC conversion process• AC-to-DC Centralization• Centralized AC-to-DC enables better facility power efficiency
• Ability to accept higher voltages (480V in North America) • True balanced 3-phase power input with active PFC• Fewer server power feeds to manage
• Water Conditioning Units in the rack – with full redundancy• System accepts standard building chilled water – no preconditioning
Energy Efficient 575 design
26
P6-p575 100TFP6-p575 100TFpeakpeak 12 Rack Cluster12 Rack Cluster
27
NCAR bluefire 575 installation (76.4 TeraFLOPs)
Munich