21
Dr. George Chiu IEEE Fellow IBM T.J. Watson Research Center Yorktown Heights, NY Architecture of the IBM Blue Gene Supercomputer

Architecture of the IBM Blue Gene Supercomputertwqcd09.phys.ntu.edu.tw/speakers/G.Chiu/BGP_Taiwan.pdf · IBM T.J. Watson Research Center Yorktown Heights, NY Architecture of the IBM

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Dr. George ChiuIEEE FellowIBM T.J. Watson Research CenterYorktown Heights, NY

Architecture of the IBM Blue Gene Supercomputer

President Obama Honors IBM's Blue Gene Supercomputer With National Medal Of Technology And InnovationNinth time IBM has received nation's most prestigious tech awardBlue Gene has led to breakthroughs in science, energy efficiency and analytics

WASHINGTON, D.C. - 18 Sep 2009: President Obama recognized IBM (NYSE: IBM) and its Blue Gene family of supercomputers with the National Medal of Technology and Innovation, the country's most prestigious award given to leading innovators for technological achievement.President Obama will personally bestow the award at a special White House ceremony on October 7. IBM, which earned the National Medal of Technology and Innovation on eight other occasions, is the only company recognized with the award this year.Blue Gene's speed and expandability have enabled business and science to address a wide range of complex problems and make more informed decisions -- not just in the life sciences, but also in astronomy, climate, simulations, modeling and many other areas. Blue Gene systems have helped map the human genome, investigated medical therapies, safeguarded nuclear arsenals, simulated radioactive decay, replicated brain power, flown airplanes, pinpointed tumors, predicted climate trends, and identified fossil fuels – all without the time and money that would have been required to physically complete these tasks.The system also reflects breakthroughs in energy efficiency. With the creation of Blue Gene, IBM dramatically shrank the physical size and energy needs of a computing system whose processing speed would have required a dedicated power plant capable of generating power to thousands of homes.The influence of the Blue Gene supercomputer's energy-efficient design and computing model can be seen today across the Information Technology industry. Today, 18 of the top 20 most energy efficient supercomputers in the world are built on IBM high performance computing technology, according to the latest Supercomputing 'Green500 List' announced by Green500.orgin July, 2009.

© 2007 IBM Corporation3

CMOS Scaling in Petaflop EraThree decades of exponential clock rate (and electrical power!) growth has endedInstruction Level Parallelism (ILP) growth has endedSingle threaded performance improvement is dead (Bill Dally)Yet Moore’s Law continues in transistor countIndustry response: Multi-core(i.e. double the number of cores every 18 months instead of the clock frequency (and power!)

Source: “The Landscape of Computer Architecture,” John Shalf, NERSC/LBNL, presented at ISC07, Dresden, June 25, 2007

Frequency (GHz)

Ops/Cycle

# of Compute engines1

10

10,000

1001,000

1,000,000100,000

10,000,000

10100

Clusters Blue GeneFuture (2010-2015)

Over the next 8-10 yearsFrequency might improve by 2xOps/cycle might improve by 2-4xOnly opportunity for dramatic performance improvement is in number of compute engines

Performance Improvement Trend

Source: Tilak Agerwala, ICS 08

Blue Gene Roadmap• QCDSP – 600 GF based on TI DSP C31 (1998)

• QCDOC – 20TF based on IBM 180nm ASIC (2003)

• BG/L (5.7 TF/rack) – 130nm ASIC (1999-2004GA)– 104 racks, 212,992 cores, 596 TF/s, 210 MF/W; dual-core system-on-chip, – 0.5/1 GB/node

• BG/P (13.9 TF/rack) – 90nm ASIC (2004-2007GA)– 72 racks, 294,912 cores, 1 PF/s, 357 MF/W; quad core SOC, DMA– 2/4 GB/node– SMP support, OpenMP, MPI

• BG/Q (209 TF/rack) – 20 PF/s

HPCC 2009

IBM BG/P 0.501 PF peak (36 racks) Class 1: Number 1 on G-Random Access (117 GUPS) Class 2: Number 1

Cray XT5 2.331 PF peak Class 1: Number 1 on G-HPL (1533 TF/s) Class 1: Number 1 on EP-Stream (398 TB/s) Class 1: Number 1 on G-FFT (11 TF/s)

Source: www.top500.org

BlueGene/P

13.6 GF/s8 MB EDRAM

4 processors

1 chip, 20 DRAMs

13.6 GF/s2.0 GB DDR2

(4.0GB 6/30/08)

32 Node Cards

13.9 TF/s2 (4) TB

72 Racks, 72x32x32

1 PF/s144 (288) TB

Cabled 8x8x16Rack

System

Compute Card

Chip

435 GF/s64 (128) GB

(32 chips 4x4x2)32 compute, 0-1 IO cards

Node Card

JTAG 10 Gb/s

256

256

32k I1/32k D132k I1/32k D1

PPC450PPC450

Double FPUDouble FPU

Ethernet10 Gbit

Ethernet10 GbitJTAG

Access

JTAGAccess Collective

CollectiveTorus

Torus GlobalBarrier

GlobalBarrier

DDR-2Controllerw/ ECC

DDR-2Controllerw/ ECC

32k I1/32k D132k I1/32k D1

PPC450PPC450

Double FPUDouble FPU

4MBeDRAM

L3 Cacheor

On-ChipMemory

4MBeDRAM

L3 Cacheor

On-ChipMemory

6 3.4Gb/sbidirectional

4 globalbarriers orinterrupts

128

32k I1/32k D132k I1/32k D1

PPC450PPC450

Double FPUDouble FPU

32k I1/32k D132k I1/32k D1

PPC450PPC450

Double FPUDouble FPU L2L2

Snoop filter

Snoop filter

4MBeDRAM

L3 Cacheor

On-ChipMemory

4MBeDRAM

L3 Cacheor

On-ChipMemory

512b data 72b ECC

128

L2L2

Snoop filter

Snoop filter

128

L2L2

Snoop filter

Snoop filter

128

L2L2

Snoop filter

Snoop filter

Multiplexing

switch

Multiplexing

switch

DMADMA

Multiplexing

switch

Multiplexing

switch

3 6.8Gb/sbidirectional

DDR-2Controllerw/ ECC

DDR-2Controllerw/ ECC

13.6 Gb/sDDR-2 DRAM bus

32

SharedSRAM

SharedSRAM

snoop

Hybrid PMU

w/ SRAM256x64b

Hybrid PMU

w/ SRAM256x64b

BlueGene/P compute ASIC

Shared L3 Directory

for eDRAM

w/ECC

Shared L3 Directory

for eDRAM

w/ECC

Shared L3 Directory

for eDRAM

w/ECC

Shared L3 Directory

for eDRAM

w/ECC

ArbArb

512b data 72b ECC

Hard CoresEDRAMsI/O cellsDecapsFuse/BISTSoft CoresArraysCust. Logic

Relative utilization of Blue Gene/P chip area by different types of components

Port snoop filter• Each of 4 port filters contains three complementary filters for optimal filtering

– Snoop cache: keeps track of snoops. Those addresses recently invalidated need not be re-invalidated again, because we know it is not in L1.

– Stream registers: addresses requested by L1 to L2 are monitored and stored in stream registers. Using this information, we could discard some invalidations, because we know some of them are not in L1.

– Range filter: address ranges are set to filter all coherence requests with addresses either within or outside of the specified address range.

• All filters run concurrently– Combined rejection of unnecessary snoops are typically over 90%– All required snoops are forwarded to L1’s– Performance improvements up to 35%, depending on the application.

Stream register status

Local processor cache misses

Snp_addressSnp_request

Stream Registers

Range Filter

Snoop CacheDecision

Logic

Enable control for port snoop filter

Forward request into the snoop queueReturn token

Snoop filter efficiency

0

20

40

60

80

100

120

FFT Barnes LU Ocean Raytrace Cholesky

Filte

r rat

e (%

)

Big

ger i

s be

tter

Big

ger i

s be

tter

Simulation results

IBM System Blue Gene®/P Solution © 2007 IBM Corporation

Execution Modes in BG/P per Node

Hardware Abstractions BlackSoftware Abstractions Blue

node

core

core core

core

P0

T0

T1

T2

P0

T0

T1 T3

T2

P0

T0

T1

P0

T0

SMP Mode1 Process

1-4 Threads/Process

P0T0

T1 T1

T0

P1

P0T0 T0

P1

Dual Mode2 Processes

1-2 Threads/Process

P0T0

T0

Quad Mode (VNM)4 Processes

1 Thread/Process

P1

P2T0

T0

P3

Next Generation HPC– Many Core– Expensive Memory – Two-Tiered Programming Model

Air-CooledBG/L

36”

48”

Air-CooledBG/P

36”

40 kW/Rack5000 CFM/Rack

25 kW/Rack3000 CFM/Rack

Hydro-AirConcept forBlueGene/P

Hydro-AirCooledBG/P

40 kW/Rack5000 CFM/Row

Key:BG Rack with Cards and Fans

AirflowAir Plenum

Air-to-Water Heat Exchanger

(drawn to scale)

11

Main Memory Capacity per Rack

0500

10001500200025003000350040004500

LRZIA64

CrayXT4

ASCPurple

RR BG/P SunTACC

SGIICE

Peak Memory Bandwidth per node (byte/flop)

0 0.5 1 1.5 2

BG/P 4 core

Roadrunner

Cray XT3 2 core

Cray XT5 4 core

POWER5

Itanium 2

Sun TACC

SGI ICE

Main Memory Bandwidth per Rack

0

2000

4000

6000

8000

10000

12000

14000

LRZItanium

Cray XT5

ASCPurple

RR BG/P SunTACC

SGI ICE

BlueGene/P Interconnection Networks

3 Dimensional Torus Interconnects all compute nodes (73,728) Virtual cut-through hardware routing 3.4 Gb/s on all 12 node links (5.1 GB/s per node) 0.5 µs latency between nearest neighbors, 5 µs to the

farthest MPI: 3 µs latency for one hop, 10 µs to the farthest Communications backbone for computations 1.7/3.9 TB/s bisection bandwidth, 188TB/s total bandwidth

Collective Network One-to-all broadcast functionality Reduction operations functionality 6.8 Gb/s of bandwidth per link per direction Latency of one way tree traversal 1.3 µs, MPI 5 µs ~62TB/s total binary tree bandwidth (72k machine) Interconnects all compute and I/O nodes (1152)

Low Latency Global Barrier and Interrupt Latency of one way to reach all 72K nodes 0.65 µs,

MPI 1.6 µs

Interprocessor Peak Bandwidth per node (byte/flop)

0 0.2 0.4 0.6 0.8

BG/L,P

Cray XT5 4c

Cray XT4 2c

NEC ES

Power5

Itanium 2

Sun TACC

x86 cluster

Dell Myrinet

Roadrunner

Total power consumption of the BlueGene/P chip configuration

0.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

Idle UMT2k SPHOT SPPM CrystalMK DGEMM

Ave

rage

pow

er (W

)

Avg node power Avg memory power

IBM® System Blue Gene®/P Solution © 2007 IBM Corporation

IBM® System Blue Gene®/P Solution: Expanding the Limits of Breakthrough Science

Summary

Blue Gene/P: Facilitating Extreme Scalability– Ultrascale capability computing when nothing else will satisfy– Provides customer with enough computing resources to help

solve grand challenge problems– Provide competitive advantages for customers’ applications

looking for extreme computing power– Energy conscious solution supporting green initiatives– Familiar open/standards operating environment– Simple porting of parallel codes

Key Solution Highlights– Leadership performance, space saving design, low power

requirements, high reliability, and easy manageability