Application Driven Supercomputing An IBM Perspective€¦ · Advancing State-ofAdvancing State-of-the-art -the-art in Modeling and Simulation in Modeling and Simulation . IBM High

William R. PulleyblankDirector, Exploratory Server Systems and DCITJ Watson Research Lab

Application Driven SupercomputingApplication Driven SupercomputingAn IBM PerspectiveAn IBM Perspective

IBM High Performance Computing

Computer SimulationComputer Simulation Climate and Weather Climate and Weather ModelingModeling

Fusion Reactor, Fusion Reactor, Accelerator Design Accelerator Design Materials Science, Materials Science, AstrophysicsAstrophysics

Aircraft, Automobile Aircraft, Automobile DesignDesign

ExperimentExperiment

TheoryTheory Computing & Computing & SimulationSimulation

The Third Node of Science and EngineeringThe Third Node of Science and Engineering


What Drives HPC? --- What Drives HPC? --- “The Need for Speed…”“The Need for Speed…”Computational Needs of Technical, Scientific, Digital Media and Business Applications

Approach or Exceed the Petaflops/s Range CFD Wing Simulation

512x64x256 Grid ( 8.3 x 10e6 mesh points)5000 FLOPs per mesh point, 5000 time steps/cycles2.15 x 10e14 FLOPs

CFD Full Plane Simulation 512x64x256 Grid ( 3.5 x 10e17 mesh points)5000 FLOPs per mesh point5000 time steps/cycles 8.7x 10e24 FLOPs

Magnetic Materials:Current: 2000 atoms; 2.64 TF/s, 512GBFuture: HDD Simulation – 30TF/s, 2 TBs

Electronic Structures:Current: 300 atoms; 0.5 TF/s, 100GBFuture: 3000 atoms; 50TF/s, 2TB

Materials Science

Digital Movies and Special Effects

~ 1E14 FLOPs per frame50 frames/sec90 minute movie

2.7E19 FLOPs

~ 150 days on 2000 1 GFLOP/s CPUs

Source: D. Bailey, NERSC

Source: Pixar

Source: A. Jameson, et al

Modeling the optimized deployment of 10,000 part Modeling the optimized deployment of 10,000 part numbers across 100 parts depots and requires:numbers across 100 parts depots and requires:

2 x 10e14 FLOP/s 2 x 10e14 FLOP/s ( 12 hours on 10, 650MHz CPUs) ( 12 hours on 10, 650MHz CPUs)

2.4 PetaFlop/s sust. performance2.4 PetaFlop/s sust. performance ( 1 hour turn-around time ) ( 1 hour turn-around time )Industry trend for rapid, frequent Industry trend for rapid, frequent modeling for timely businessmodeling for timely business decision support drives higher decision support drives higher sustained performance sustained performance

Spare Parts Inventory Planning

Source: B. Dietrich, IBM


Supercomputer Peak SpeedSupercomputer Peak Speed

1940 1950 1960 1970 1980 1990 2000 2010Year Introduced

1E+2

1E+4

1E+6

1E+8

1E+10

1E+12

1E+14

1E+16

Peak

Spe

ed (f

lops

)

Doubling time = 1.5 yr.

ENIAC (vacuum tubes)UNIVAC

IBM 701 IBM 704IBM 7090 (transistors)

IBM Stretch

CDC 6600 (ICs)CDC 7600

CDC STAR-100 (vectors)CRAY-1

Cyber 205 X-MP2 (parallel vectors)

CRAY-2X-MP4

Y-MP8i860 (MPPs)

ASCI White

Blue Gene / PBlue Gene / L

Blue Pacific

DeltaCM-5 Paragon

NWTASCI Red

ASCI Red

CP-PACS

NEC Earth Simulator


HPC Systems HPC Systems

Very High Resolution Simulation Very High Resolution Simulation of Compressible Turbulence of Compressible Turbulence (1999 Gordon Bell Award Recipient)(1999 Gordon Bell Award Recipient)

Source: LLNLSource: LLNL

24 billion zones achieved 1.18 24 billion zones achieved 1.18 teraOPS on 5,832 IBM SP teraOPS on 5,832 IBM SP

processorsprocessors

Advancing State-of-the-art in Modeling and Simulation Advancing State-of-the-art in Modeling and Simulation


HPC Systems HPC Systems Advancing State-of-the-art in Modeling and Simulation Advancing State-of-the-art in Modeling and Simulation

Black Hole Merge SimulationsBlack Hole Merge Simulations

Source: NERSC and DoE Office of ScienceSource: NERSC and DoE Office of Science

700,000 CPU hours of an IBM SP 700,000 CPU hours of an IBM SP completed three-fourths of a full completed three-fourths of a full orbit coalescenceorbit coalescence


HPC Systems HPC Systems

Supporting a stockpile of aging, Supporting a stockpile of aging, highly optimized nuclear weaponshighly optimized nuclear weapons

Impacting Science and TechnologyImpacting Science and Technology

Parallel Nuclear Weapons Explosion Simulation for the ASCI Primary Parallel Nuclear Weapons Explosion Simulation for the ASCI Primary Burn Milepost on ASCI White and Blue Pacific MachinesBurn Milepost on ASCI White and Blue Pacific Machines

Source: LLNLSource: LLNL


HPC SystemsHPC Systems

Parallel smooth particle Parallel smooth particle hydrodynamics coupled with hydrodynamics coupled with flux-limited diffusion radiation flux-limited diffusion radiation transporttransport

Impacting Science and TechnologyImpacting Science and Technology

Supernova Explosions and CosmologySupernova Explosions and Cosmology

Source: NERSC and DoE Office of ScienceSource: NERSC and DoE Office of Science

Code incorporated the four forces of Code incorporated the four forces of physics. Completing a 1 million particle physics. Completing a 1 million particle simulation with 100,000 time steps took simulation with 100,000 time steps took IBM SP 3 months IBM SP 3 months


What is a protein?What is a protein?Examples of Protein FunctionStructural: keratin (skin, hair, nail), collagen (tendon), fibrin (clot)Motive: actomyosin (muscle)Transport: Hemoglobin (blood)Signaling: Growth factors, insulin, hormones (blood)Regulation: Transcription factor (gene expression)Catalysis: Enzymes

Protein is a linear polymer. 30 to several hundred residues long.

There are 20 natural amino acids with different physicochemical properties, such as: shape, volume, flexibility, hydrophobic, hydrophilic, charge

N C C N C C N C C

H

H

R

O O O

H

H

H

H

R R


Molecular dynamics simulationMolecular dynamics simulationProtein is folded by mimicking its atomic mechanics in the computerProtein drops into Free Energy Funnel

to a unique folded native stateGoal: compute folded structure of protein

Goal: study the folding process and understand its dynamics


Description Count* Comment

Atoms ~32,000 300 amino acid protein + waterForce evaluations / time step 10

9 Pairwise atom - atom interactions

FLOPs / force evaluation 150 Typical molecular dynamics

FLOPs / time step 1.5 x 1011

Each time step ~10-15

s 1 - 5 femto second

Total simulation time 10-3

s Protein folds in ~1 milli second

Total time steps 2 x 1011

FLOPs / simulation 3 x 1022

Total FLOP/s to fold a protein

Execution time 3 x 107

s 1 year

Required FLOPS ~1 x 1015

1 Petaflop/s

Ab initio protein folding calculation requirementsAb initio protein folding calculation requirementsWhy does it take 1 Petaflop/s?

Estimate is conservatively based on quadratic algorithm.Better algorithms will reduce (somewhat) running time, but usual surprises will increase it!And good science will require multiple simulations


IBM's Business ModelIBM's Business ModelIncreased Application Capability, Manageable Costs Increased Application Capability, Manageable Costs

Maintain industry leadership in Maintain industry leadership in systems designs through a continued systems designs through a continued partnership with the scientific partnership with the scientific communitycommunity

Leverage technology improvements Leverage technology improvements that drive system performance that drive system performance


L1 CacheL1 Cache

Processor Execution Units: Processor Execution Units: MIPS and FLOPS MIPS and FLOPS

L2 CacheL2 Cache

Memory & I/OMemory & I/OBridgeBridge

Main Main MemoryMemory StorageStorage

InterconnectInterconnect

Algorithms and Application Software

Balanced Systems DesignBalanced Systems Design

Latency &Latency &BandwidthBandwidth

BandwidthBandwidth

Latency &Latency &BandwidthBandwidth

MemoryMemoryWallWall

System Design ChallengesSystem Design ChallengesCost Effective, Uniprocessor Building Blocks Exploit ConcurrencyCost Effective, Uniprocessor Building Blocks Exploit Concurrency


Sustained Performance Is More Than Just Hardware

Packaging

Processor, Cache, Memory, and

I/O

MultiprocessorEnablement

Semiconductors

Operating System

Middleware and Utilites

Applications

Development Tools and Environment

Algorithms

Compiler

Debugger

Optimizers

GUIMessage Passing Libraries

VisualizersMath

Libraries

Application ExecutionPerformance

Application DevelopmentPerformance

System Design IssuesSystem Design Issues


Power density growth is imposing constraints on server capability

Power Efficient CMOS processors can achieve high performance with significantly lower power dissipation

0 50 100 150 200Power Dissipation (Watts)

0

200

400

600

Web

Pag

es s

erve

d (M

B/s

ec)

Perf

orm

ance

Power Efficient CMOS µprocessor

High PerfCMOS µprocessor

1950 1960 1970 1980 1990 2000 20101950 1960 1970 1980 1990 2000 2010

1414

77

00

BipolarBipolar

CMOSCMOS

Mod

ule

Hea

t Flu

x (W

/cm

)M

odul

e H

eat F

lux

(W/c

m)

Power Efficient Computing Power Efficient Computing IT Electrical Power Needs Projected to Reach Excessive ProportionsIT Electrical Power Needs Projected to Reach Excessive Proportions


IBM's HPC StrategyIBM's HPC StrategySolving Problems More Quickly At Lower CostSolving Problems More Quickly At Lower Cost

Aggressively evolve and improve Aggressively evolve and improve our POWER architecture based our POWER architecture based HPC product lineHPC product line

Develop additional advanced Develop additional advanced systems based on loosely systems based on loosely coupled clusterscoupled clusters

Research and overcome obstacles Research and overcome obstacles to parallelism and other revolutionary to parallelism and other revolutionary approaches to supercomputing approaches to supercomputing


ASCI Purple and BlueGene/LASCI Purple and BlueGene/L

Immediate-term Mid-term Long-term


BlueGene/L BlueGene/L

Chip(2 processors)

Compute Card(2 chips, 2x1x1)

Node Board(32 chips, 4x4x2)

16 Compute Cards

System(64 cabinets, 64x32x32)

Cabinet(32 Node boards, 8x8x16)

2.8/5.6 GF/s4 MB

5.6/11.2 GF/s0.5 GB DDR

90/180 GF/s8 GB DDR

2.9/5.7 TF/s256 GB DDR

180/360 TF/s16 TB DDR


EthernetIncorporated into every node ASICDisk I/OHost control, booting and diagnostics

3 Dimensional TorusVirtual cut-through hardware routing to maximize efficiency1.4 Gb/s on all 12 node links (total of 2.1 GB/s per node)Communication backbone67 TB/s total torus interconnect bandwidth1.4/2.8 TB/s bisectional bandwidth

Global Tree

One-to-all or all-all broadcast functionalityArithmetic operations implemented in tree2.8 GB/s of bandwidth from any node to all other nodes Latency of tree less than 12usec~90TB/s total binary tree bandwidth (64k machine)

The Principal NetworksThe Principal Networks

65,536 nodes interconnected with three integrated networks


Physical DesignPhysical Design


7/29/2003 IBM Confidential Information 5

Compute Card

9 x 256Mb DRAM; 16B interface

Heatsinks designed for 15W (measuring ~13W @1.6V)

54 mm (2.125”)

206 mm (8.125”) wide, 14 layers

Metral 4000 connector

BlueGene/L - two node compute cardBlueGene/L - two node compute card


Node CardNode Card

7/23/2003 IBM Confidential Information 7

32- way (4x4x2) node card

DC-DC converters

Gb Ethernet connectors through tailstock

Latching and retention

Midplane torus, tree, barrier, clock, Ethernet service port connects

16 compute cards

2 IO cards

Ethernet-JTAG FPGA


BlueGene/L - system viewBlueGene/L - system view


Blue Matter - Blue Matter - a Molecular Dynamics Codea Molecular Dynamics Code

Separate MD program into three subpackages Separate MD program into three subpackages (offload function to host where possible):(offload function to host where possible):

MD core engine (massively parallel, minimal in size)MD core engine (massively parallel, minimal in size)Setup programs to setup force field assignments, etc.Setup programs to setup force field assignments, etc.Analysis Tools to analyze MD trajectories, etc.Analysis Tools to analyze MD trajectories, etc.

Multiple Force Field SupportMultiple Force Field SupportCHARMM force field (done)CHARMM force field (done)OPLS-AA force field (done)OPLS-AA force field (done)AMBER force field (done)AMBER force field (done)Polarizable Force Field (desired)Polarizable Force Field (desired)

Potential Parallelization StrategiesPotential Parallelization StrategiesInteraction-basedInteraction-basedVolume-basedVolume-basedAtom-basedAtom-based


Simulation CapacitySimulation Capacity

1000 10000 100000System Size (atoms)

1.00E+6

1.00E+7

1.00E+8

1.00E+9

1.00E+10

1.00E+11

1.00E+12

1.00E+13

time

step

s/m

onth

1 rack Power3 ('01)512 node BG/L partition (2H03)

40*512 node BG/L partition (4Q04)1,000,000 GFLOP/second (2H06)


Science on Blue GeneScience on Blue GeneAlzheimer Research: Help Drug Discovery

Alzheimer's Disease has been associated with the accumulation of amyloid plaque in the brain

Beta-secretase is a prime therapeutic target for Alzheimer’s drug discovery efforts

No experimental data exists for the details of the relationship between the protein and membrane


HPC Applications and Algorithms HPC Applications and Algorithms

Source: Rick Stevens, ANL

BasicAlgorithms &

NumericalMethods

PipelineFlows

Biosphere/Geosphere

Neural Networks

Condensed MatterElectronic Structure

CloudPhysics

-ChemicalReactors

CVD

PetroleumReservoirs

MolecularModeling

BiomolecularDynamics / Protein Folding

RationalDrug DesignNanotechnology

FractureMechanicsChemicalDynamics Atomic

ScatteringsElectronicStructure

Flows in Porous Media

FluidDynamics

Reaction-DiffusionMultiphaseFlow

Weather and ClimateStructural Mechanics

Seismic ProcessingAerodynamics

Geophysical Fluids

QuantumChemistry

ActinideChemistry

CosmologyAstrophysics

VLSIDesign

ManufacturingSystems

MilitaryLogistics

NeutronTransport

NuclearStructure

QuantumChromo -Dynamics Virtual

Reality

VirtualPrototypes

ComputationalSteering

Scientific Visualization

MultimediaCollaborationTools

CAD

GenomeProcessing

Databases

Large-scaleData Mining

IntelligentAgents

IntelligentSearch

Cryptography

Number Theory

EcosystemsEconomicsModels

Astrophysics

SignalProcessing

Data Assimilation

Diffraction & InversionProblems

MRI Imaging

DistributionNetworks

Electrical Grids

Phylogenetic TreesCrystallographyTomographicReconstruction

ChemicalReactors

PlasmaProcessing

Radiation

MultibodyDynamics

Air TrafficControl

PopulationGenetics

TransportationSystems

Economics

ComputerVision

AutomatedDeduction

ComputerAlgebra

OrbitalMechanics

Electromagnetics

Magnet DesignSource: Rick Stevens, Argonne National Lab and The University of Chicago

SymbolicProcessing

Pattern Matching

RasterGraphics

MonteCarlo

DiscreteEvents

N-Body

FourierMethods

GraphTheoretic

Transport

Partial Diff. EQs.

Ordinary Diff. EQs.

Fields

Cellular-scale ParallelizabilityGoodBetterBest


Power Systems Architectural EnhancementsPower Systems Architectural Enhancements

Mid-term Long-termImmediate-term


Achieve and Sustain Multiple Design PointsAchieve and Sustain Multiple Design Points

Continued evolutionary Continued evolutionary technological improvements technological improvements for current HPC systemsfor current HPC systems

Package level integration Package level integration technologies provide technologies provide differentiationdifferentiation

Silicon semiconductor Silicon semiconductor technology and performance technology and performance advancements continueadvancements continue

Open standard softwareOpen standard software

Satisfy the Spectrum of Customer Performance and Price NeedsSatisfy the Spectrum of Customer Performance and Price Needs

LinuxLinuxMPIMPI

OGSAOGSA


POWER6 Server RoadmapPOWER6 Server Roadmap

2001Power4

Chip Multi Processing- Distributed Switch- Shared L2

Dynamic LPARs (16)

180 nm

1+ GHzCore

1+ GHzCore

Distributed Switch

Shared L2

2002-3Power4+

1.7+ GHz Core

1.7+ GHz Core

130 nm

Shared L2

Distributed Switch

2004Power5

2005Power5+

Simultaneous multi-threadingSub-processor partitioningEnhanced scalability, parallelismHigh throughput performanceEnhanced memory subsystem

90 nm

Shared L2

>> GHz Core

>> GHz Core

Distributed Switch

130 nm

> GHzCore

> GHz Core

Distributed Switch

Shared L2

2006Power6

65 nm

L2 caches

Ultra high frequency cores

AdvancedSystem Features

Total VirtualizationMainframe RASLarger SMPsBlade Optimized4X Perf of POWER5Reduced Size/Power

Larger L2Increased BandwidthsMore LPARs (32)

2001Power4


Dynamic LPARs (16)

180 nm

1+ GHzCore

1+ GHzCore

Distributed Switch

Shared L2


Dynamic LPARs (16)

180 nm

1+ GHzCore

1+ GHzCore

Distributed Switch

Shared L2

180 nm

1+ GHzCore

1+ GHzCore

Distributed Switch

Shared L2

1+ GHzCore

1+ GHzCore

Distributed Switch

Shared L2

2002-3Power4+

2002-3Power4+

1.7+ GHz Core

1.7+ GHz Core

130 nm

1.7+ GHz Core

1.7+ GHz Core

130 nm

1.7+ GHz Core

1.7+ GHz Core

1.7+ GHz Core

1.7+ GHz Core

1.7+ GHz Core

1.7+ GHz Core

130 nm

Shared L2

Distributed Switch

2004Power5

2004Power5

2005Power5+

2005Power5+

Simultaneous multi-threadingSub-processor partitioningEnhanced scalability, parallelismHigh throughput performanceEnhanced memory subsystem

90 nm

Shared L2

>> GHz Core

>> GHz Core

Distributed Switch

130 nm

> GHzCore

> GHz Core

Distributed Switch

Shared L2

90 nm

Shared L2

>> GHz Core

>> GHz Core

Distributed Switch

90 nm

Shared L2

>> GHz Core

>> GHz Core

Distributed Switch

Shared L2Shared L2

>> GHz Core

>> GHz Core

>> GHz Core

>> GHz Core

Distributed Switch

130 nm

> GHzCore

> GHz Core

Distributed Switch

Shared L2

130 nm

> GHzCore

> GHz Core

Distributed Switch

Shared L2

> GHzCore

> GHz Core

> GHzCore

> GHzCore

> GHz Core

> GHz Core

Distributed Switch

Shared L2Shared L2

2006Power6

2006Power6

65 nm

L2 caches



65 nm

L2 caches



Total VirtualizationMainframe RASLarger SMPsBlade Optimized4X Perf of POWER5Reduced Size/Power

Larger L2Increased BandwidthsMore LPARs (32)


PERCS ProjectPERCS Project

Immediate-term Long-termMid-term

CFDSelf-Adapting

CAE

ChemistryElectronic Structures

MaterialsScience

Bioinformatics

Climate and Weather

Nuclear Energy


PERCSPERCSA consortium of IBM, LANL, and 13 universitiesA consortium of IBM, LANL, and 13 universities

Machine

OS&

M-ware

App App

OS&

M-ware

OS&

M-ware

App

Adapt application to systemAdapt system to application

Machine

OS&

M-ware

App App

OS&

M-ware

OS&

M-ware

App

Adapt application to systemAdapt system to application


Low F04 circuits

SiGe

Modular packaging

Power management

K42 operating system

Dynamic & cont. optimization

Self-healing, self-management

Fail-in place strategy

Polymorphic processors

Intelligent memory controllers

Total virtualization

Power-aware HW-SW codesign

Intelligent storage

Morphogenic SW development

New Prog. Lang.: StreamIt, UPC

Atomic sections

User-transparent reliability

MindFrames prog. environment

Sys

tem

sof

twar

eSy stem

a rchitec tureBasic technology

Appl

icat

ions

& d

evel

opm

ent

Low F04 circuits

SiGe

Modular packaging

Power management

K42 operating system

Dynamic & cont. optimization

Self-healing, self-management

Fail-in place strategy

Polymorphic processors

Intelligent memory controllers

Total virtualization

Power-aware HW-SW codesign

Intelligent storage

Morphogenic SW development

New Prog. Lang.: StreamIt, UPC

Atomic sections

User-transparent reliability

MindFrames prog. environment

Sys

tem

sof

twar

eSy stem

a rchitec tureBasic technology

Appl

icat

ions

& d

evel

opm

ent

PERCS - key technologiesPERCS - key technologies


Application Driven DesignApplication Driven Design

DARPA PERCS Project:DARPA PERCS Project:Explore innovative Explore innovative adaptive system adaptive system architectures for high architectures for high efficiency, scalability, efficiency, scalability, software tools and software tools and physical constraintsphysical constraints

Close Collaboration and Partnerships with the National Labs, Close Collaboration and Partnerships with the National Labs, Universities and Government AgenciesUniversities and Government Agencies

Explore innovative extensions Explore innovative extensions to IBM's Power architecture toto IBM's Power architecture tooptimize system designs for optimize system designs for the broadest possible range of the broadest possible range of application computational application computational requirements in conjunction requirements in conjunction with LLNLwith LLNL

Blue Gene:Blue Gene:Advance the state-of-the-artAdvance the state-of-the-artfor parallelism in computer for parallelism in computer design and software. design and software. Deliver a Limited Production Deliver a Limited Production System in conjunction with System in conjunction with LLNL, ANL, and several LLNL, ANL, and several universitiesuniversities


1995 2000 2005 2010 20151

10

100

1000

10000

100000

Source: ASCI Roadmap www.llnl.gov/asci, IBMBrain ops/sec: Kurzweil 1999, The Age of Spiritual MachinesMoravec 1998, www.transhumanist.com/volume1/moravec.htm

Supercomputing RoadmapSupercomputing Roadmap

IBM Deep Blue®*

TeraFlops IBM BlueGene/P

US Dept. Of Energy ASCI

IBM BlueGene/L®*

Chip(2 processors)

Compute Card(2 chips, 2x1x1)

Node Board(32 chips, 4x4x2)

16 Compute Cards

System(64 cabinets, 64x32x32)

Cabinet(32 Node boards, 8x8x16)

2.8/5.6 GF/s4 MB

5.6/11.2 GF/s0.5 GB DDR

90/180 GF/s8 GB DDR

2.9/5.7 TF/s256 GB DDR

180/360 TF/s16 TB DDR

William R. PulleyblankDirector, Exploratory Server Systems and DCITJ Watson Research Lab

Application Driven SupercomputingApplication Driven SupercomputingAn IBM PerspectiveAn IBM Perspective

Documents

Application Driven Supercomputing An IBM Perspective€¦ · Advancing State-ofAdvancing State-of-the-art -the-art in Modeling and Simulation in Modeling and Simulation . IBM High