28
The Power Architecture and Power.org word marks and the Power and Power.org logos and related marks are trademarks and service marks licensed by Power.org. IBM POWER6 TM Based Servers May 27, 2008 Jeff Stuecheli, IBM

The Power Architecture and Power.org word marks and the Power

Embed Size (px)

Citation preview

Page 1: The Power Architecture and Power.org word marks and the Power

The Power Architecture and Power.org word marks and the Power and Power.org logos and related marks are trademarks and service marks licensed by Power.org.

IBM POWER6TM Based ServersMay 27, 2008Jeff Stuecheli, IBM

Page 2: The Power Architecture and Power.org word marks and the Power

2

» Historical Background» POWER6TM Chip Components» Interconnect Topology» System Innovations» POWER6TM Systems» Benchmark Results

Agenda

Copyright © 2007 by IBM

All rights reserved.

Page 3: The Power Architecture and Power.org word marks and the Power

3

IBM POWER development team

UnixServersGames

SOC Mainframes

collaboration

Page 4: The Power Architecture and Power.org word marks and the Power

4

IBM POWER SYSTEMS

Leading up to POWER6TM

POWER4+ (130nm)

POWER4 (180nm)

POWER5 (130nm)

POWER5+ (90nm)

POWER6 (65nm)

20012003

20042006

2007

Copyright © 2007 by IBM

All rights reserved.

Page 5: The Power Architecture and Power.org word marks and the Power

5

POWER6: Innovations in all areas of system design

Technology

Performance

Flexibility

RAS

Features and Function

System Energy Management

Copyright © 2007 by IBM

All rights reserved.

Page 6: The Power Architecture and Power.org word marks and the Power

POWER6TM System Components

Page 7: The Power Architecture and Power.org word marks and the Power

7

POWER6TM Chip Overview

» Ultra-high frequency dual-core chip• 7-way superscalar, 2-way SMT core

• 9 execution units

– 2LS, 2FP, 2FX, 1BR, 1VMX,1DFU

• 790M transistors

• Up to 64-core SMP systems

• 2x4MB on-chip L2

• 32MB On-chip L3 directory and controller

• Two memory controllers on-chip

» Technology• CMOS 65nm lithography, SOI

» High-speed elastic bus interface at 2:1 freq

• I/Os: 1953 signal, 5399 Power/Gnd

CoreCore

CoreCore

L2L2CtrlCtrl

L2L2CtrlCtrl

L2L2DataData

L2L2DataData

L2L2DataData

L2L2DataData

L3L3CtrlCtrl

L3L3CtrlCtrl

MemMemCtrlCtrl

MemMemCtrlCtrl

I/OI/OCtrlCtrl

SMPSMPInterconnectInterconnect

Copyright © 2007 by IBM

All rights reserved.

Page 8: The Power Architecture and Power.org word marks and the Power

8

eDRAM L3 cache chip

» 32 MB eDRAM• 8 x 36 Mb macros• 16 banks per macro

» Cache Data controller• 8 entry read queue• 8 entry write queue• 8x128 Byte Write buffering

» 3 GHz max interface

Copyright © 2007 by IBM

All rights reserved.

Page 9: The Power Architecture and Power.org word marks and the Power

9

POWER5 vs POWER6 Pipeline Comparison

Instruction Fetch Instruction Buffer/Decode Instruction Dispatch/Issue Data Fetch/Execute

FXU Dependent execution

Load Dependent execution

» Pipeline Comparison• POWER6 processor is ~2X frequency of POWER5• POWER6 instruction pipeline depth equivalent to POWER5

POWER5

POWER6

Page 10: The Power Architecture and Power.org word marks and the Power

10

POWER6 Core Pipeline

»A

Instruction fetch pipelineInstruction fetch pipeline

BR/FX/Load pipelineBR/FX/Load pipeline

Floating Point PipelineFloating Point Pipeline Check Point Recovery PipelineCheck Point Recovery Pipeline

BR/CRBR/CR

FXFX

LOADLOAD

Legend :Legend : Pre-decode stage

Ifetch/Branch stage

Delayed/Transmit stage

Instruction Decode stage

Instruction Dispatch/Issue stage

Operand access/execution stage

Write back stage

Completion stage

Check Point stage

FX result bypass

Load result bypass

Float result bypass

Cache access stage

P1P1

P2P2

P3P3

P4P4 IC0IC0 ROTROTIC1IC1

EX1EX1

FMTFMTAGAGDISPDISPPDPDIB0IB0 IB1IB1

RFRF

RFRF

RFRF

RFRF DC0DC0 DC1DC1

EX2EX2 EX3EX3 EX4EX4 EX5EX5 EX6EX6 EX7EX7

EXEX

ISSISS ECCECC

ECCECC

BHTBHT

BHTBHT

IFARIFAR

Instruction dispatch pipelineInstruction dispatch pipeline

Page 11: The Power Architecture and Power.org word marks and the Power

Interconnect Topology

Page 12: The Power Architecture and Power.org word marks and the Power

12

POWER6TM: Extreme Bandwidth Capability

L3 Cache32MB 2X8B

2X8B

2X2B A

D

D L3 Dir

Fabric

L3 Cntl

Mem Cntl

A/D A/D

8B

Corona

Core (SMT+, VMX)

Core

4X2B4X1B 4X1B A

D D

8B

A/D

A/D

A/D

A/D

8B

8B

8B

8B

Intra-Node8W SMP Buses

4X1B A D D Mem Cntl4X1B

4X2B

GX+ Cntl GX+ Bus4B4B A/D

A/D

(SMT+, VMX)

A/DA/D

8B 8B

A/DA/D

8B 8B

L2 Cache (8.0MB)

75 GB/s

50GB/s 80GB/s

20GB/s

80GB/s

Total = 300 GB/s

Copyright © 2007 by IBM

All rights reserved.

Page 13: The Power Architecture and Power.org word marks and the Power

13

M

POWER6TM Interconnect

» Memory DIMMs• 8 channels• Point to Point Elastic interface

at 4X DRAM frequency• Custom on-DIMM buffer chip

» L3 data• 32 MB capacity• Dual 16 byte (8 read + 8

write) interface at ½ CPU frequency

» SMP interconnect• Three Intra-Node SMP buses

for 8-way Node• Two Inter-Node SMP buses for

up to 8 Nodes

L3L3

POWER6TMM M M

M

M MM

Intra-NodeLinks

Inter-NodeLinks

DDR2/3 DRAM

Custom eDRAML3 Data

Copyright © 2007 by IBM

All rights reserved.

Page 14: The Power Architecture and Power.org word marks and the Power

14

Example System Topology

Node Node

Node

Node Node

Node Node

P6

P6L3L3

P6

P6

M M MM

M M MM

Mem Ctrl

Mem Ctrl

64-way SMP: 8 Nodes x 4 Chips

Dual L3 Data Chips per POWER6TM chipDual Memory Controllers: 4 Channels each

Node is logical: System design drives physical embodiment

Node

Copyright © 2007 by IBM

All rights reserved.

Page 15: The Power Architecture and Power.org word marks and the Power

System Innovation

Page 16: The Power Architecture and Power.org word marks and the Power

16

System Innovation

» Bulletproof Computing– Error detection and recovery

» System Level Power– Datacenter centric control

» Partition Mobility– Datacenter virtualization

» Decimal Floating Point» Water Cooling» Coherence

– Scalable SMP

» Prefetch

Copyright © 2007 by IBM

All rights reserved.

Page 17: The Power Architecture and Power.org word marks and the Power

17

POWER6TM Cache Coherence

»SubSpace Snooping– Hybrid approach: Combination of

cache snooping and directory lookup– Design Principle: Provide ideal latency

of cache snooping, with high scalability provided with full directories.

»Design Components– Cache States: MESI enhanced with line

location residue.– Memory Directory: Low cost coherence

information stored with cache line data

– Coherence Predictor: Prediction of cache line system state.

– Two Tiered: Node local and System scope

Copyright © 2007 by IBM

All rights reserved.

Page 18: The Power Architecture and Power.org word marks and the Power

18

Sustained system memory bandwidth running HPC loop

0

200

400

600

800

1000

1200

1400

8 16 24 32 40 48 56 64

number cores

GB

/sec

sustained memory bandwidth(74% of peak)

peak 'global' snoop bandwidth

peak 'subspace' snoopbandwidth

Copyright © 2007 by IBM

All rights reserved.

Page 19: The Power Architecture and Power.org word marks and the Power

19

Prefetch

»Core Centric Control»Design Details,

• 16 entry stream detection table• Differentiation of load, store, and load+store streams• Hardware/software synergy

– Prefetch table exposed to sw via enhanced dcbt instructions– Software controlled stream characteristics and initiation

• Multi-line prefetch– Multi-cacheline bus memory prefetch operation– Power and resource efficient

Copyright © 2007 by IBM

All rights reserved.

Page 20: The Power Architecture and Power.org word marks and the Power

POWER6 Systems

Page 21: The Power Architecture and Power.org word marks and the Power

21

The New Power Server Family

BladeCenter

Power 550

Power 570

Power 520

JS12 JS22

Power 595

Power 575

2 Processor Sockets(20 sockets/rack)

4 Processor Sockets(40 sockets/rack)

2,4,6,8 SocketsScalable Size

High RAS(20 sockets/rack)

32 SocketsHighest Bandwidth Design

Highest RAS

224 Sockets(14 - 16 skt 2U servers)

Direct Water Cooling

1 skt 2 skt(112 LP sockets/rack)

Page 22: The Power Architecture and Power.org word marks and the Power

22

P6-p575P6-p575602 GFLOPs

8.42

TF

LO

Ps

121.6 GFLOPs

1.46

TF

LO

Ps

The P6-p575: Comparison to P5-p575+ for Reference

< 1/5th the space / Compute Op

Same size Rack as P5-575+

~ 1/3rd the energy / compute Op

~6x Density of P6 Rack Servers

~3x Density of P6 Blade Servers

Page 23: The Power Architecture and Power.org word marks and the Power

23

• Save: • Energy• Space• Management Effort

• Replace:• 3 Racks of High Performance 1U Servers• 224 Ethernet Links and Switch ports• 112 Operating System Images• Associated on-Floor PDU Power capacity• Associated on-Floor CRAC Cooling capacity

• With:• 1 p6-575 rack (completely factory pre-assembled & pre-tested) • 48 ports of 10Gb Ethernet• 14 Operating System Images• ~15% on the on-Floor CRAC capacity the 1U Servers require

A Greener Alternative to 1U/2U Servers for Technical or Commercial Compute Intense Tasks

Page 24: The Power Architecture and Power.org word marks and the Power

24

• P6-p575 Gains by Eliminating Components• The larger SMP of a P6-p575 is 8x larger then that of the typical 2-socket

• 1/8th the I/O• 1/8th the Ethernet switch ports & cables• 1Gb & 10Gb Ethernet are fully integrated into chipset • Less Memory is required due to 1/8th the OS images• Shared Memory usage across 8x the amount of compute power -> more efficient processing (virtualization).

I/O Unit

PC

I Ris

er(P

CI &

Gx

Ad

apte

r)

Processor Unit

AMD

PC

I Riser

(PC

I & G

xA

dap

ter)

Efficiency in Consolidation

Page 25: The Power Architecture and Power.org word marks and the Power

25

• P6-p575 Reduces Power & Cooling Overhead through Advanced Design Techniques

• Processors are directly water cooled allowing them to run at <50 deg-C versus the typical 90 deg-C

• Leakage reduction• 2 Larger 3-phase electronically commutated fans are used in place of many small fans typically used• Efficient Voltage Regulators with lower “on-resistance” are used to convert power more efficiently

• Regulators are precisely adjusted for each P6 chip • AC-to-DC Conversion is moved out of the server package

• Logic is not heated by AC-to-DC conversion process• AC-to-DC Centralization• Centralized AC-to-DC enables better facility power efficiency

• Ability to accept higher voltages (480V in North America) • True balanced 3-phase power input with active PFC• Fewer server power feeds to manage

• Water Conditioning Units in the rack – with full redundancy• System accepts standard building chilled water – no preconditioning

Energy Efficient 575 design

Page 26: The Power Architecture and Power.org word marks and the Power

26

P6-p575 100TFP6-p575 100TFpeakpeak 12 Rack Cluster12 Rack Cluster

Page 27: The Power Architecture and Power.org word marks and the Power

27

NCAR bluefire 575 installation (76.4 TeraFLOPs)

Page 28: The Power Architecture and Power.org word marks and the Power

Munich