Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
William R. PulleyblankDirector, Exploratory Server Systems and DCITJ Watson Research Lab
Application Driven SupercomputingApplication Driven SupercomputingAn IBM PerspectiveAn IBM Perspective
IBM High Performance Computing
Computer SimulationComputer Simulation Climate and Weather Climate and Weather ModelingModeling
Fusion Reactor, Fusion Reactor, Accelerator Design Accelerator Design Materials Science, Materials Science, AstrophysicsAstrophysics
Aircraft, Automobile Aircraft, Automobile DesignDesign
ExperimentExperiment
TheoryTheory Computing & Computing & SimulationSimulation
The Third Node of Science and EngineeringThe Third Node of Science and Engineering
IBM High Performance Computing
What Drives HPC? --- What Drives HPC? --- “The Need for Speed…”“The Need for Speed…”Computational Needs of Technical, Scientific, Digital Media and Business Applications
Approach or Exceed the Petaflops/s Range CFD Wing Simulation
512x64x256 Grid ( 8.3 x 10e6 mesh points)5000 FLOPs per mesh point, 5000 time steps/cycles2.15 x 10e14 FLOPs
CFD Full Plane Simulation 512x64x256 Grid ( 3.5 x 10e17 mesh points)5000 FLOPs per mesh point5000 time steps/cycles 8.7x 10e24 FLOPs
Magnetic Materials:Current: 2000 atoms; 2.64 TF/s, 512GBFuture: HDD Simulation – 30TF/s, 2 TBs
Electronic Structures:Current: 300 atoms; 0.5 TF/s, 100GBFuture: 3000 atoms; 50TF/s, 2TB
Materials Science
Digital Movies and Special Effects
~ 1E14 FLOPs per frame50 frames/sec90 minute movie
2.7E19 FLOPs
~ 150 days on 2000 1 GFLOP/s CPUs
Source: D. Bailey, NERSC
Source: Pixar
Source: A. Jameson, et al
Modeling the optimized deployment of 10,000 part Modeling the optimized deployment of 10,000 part numbers across 100 parts depots and requires:numbers across 100 parts depots and requires:
2 x 10e14 FLOP/s 2 x 10e14 FLOP/s ( 12 hours on 10, 650MHz CPUs) ( 12 hours on 10, 650MHz CPUs)
2.4 PetaFlop/s sust. performance2.4 PetaFlop/s sust. performance ( 1 hour turn-around time ) ( 1 hour turn-around time )Industry trend for rapid, frequent Industry trend for rapid, frequent modeling for timely businessmodeling for timely business decision support drives higher decision support drives higher sustained performance sustained performance
Spare Parts Inventory Planning
Source: B. Dietrich, IBM
IBM High Performance Computing
Supercomputer Peak SpeedSupercomputer Peak Speed
1940 1950 1960 1970 1980 1990 2000 2010Year Introduced
1E+2
1E+4
1E+6
1E+8
1E+10
1E+12
1E+14
1E+16
Peak
Spe
ed (f
lops
)
Doubling time = 1.5 yr.
ENIAC (vacuum tubes)UNIVAC
IBM 701 IBM 704IBM 7090 (transistors)
IBM Stretch
CDC 6600 (ICs)CDC 7600
CDC STAR-100 (vectors)CRAY-1
Cyber 205 X-MP2 (parallel vectors)
CRAY-2X-MP4
Y-MP8i860 (MPPs)
ASCI White
Blue Gene / PBlue Gene / L
Blue Pacific
DeltaCM-5 Paragon
NWTASCI Red
ASCI Red
CP-PACS
NEC Earth Simulator
IBM High Performance Computing
HPC Systems HPC Systems
Very High Resolution Simulation Very High Resolution Simulation of Compressible Turbulence of Compressible Turbulence (1999 Gordon Bell Award Recipient)(1999 Gordon Bell Award Recipient)
Source: LLNLSource: LLNL
24 billion zones achieved 1.18 24 billion zones achieved 1.18 teraOPS on 5,832 IBM SP teraOPS on 5,832 IBM SP
processorsprocessors
Advancing State-of-the-art in Modeling and Simulation Advancing State-of-the-art in Modeling and Simulation
IBM High Performance Computing
HPC Systems HPC Systems Advancing State-of-the-art in Modeling and Simulation Advancing State-of-the-art in Modeling and Simulation
Black Hole Merge SimulationsBlack Hole Merge Simulations
Source: NERSC and DoE Office of ScienceSource: NERSC and DoE Office of Science
700,000 CPU hours of an IBM SP 700,000 CPU hours of an IBM SP completed three-fourths of a full completed three-fourths of a full orbit coalescenceorbit coalescence
IBM High Performance Computing
HPC Systems HPC Systems
Supporting a stockpile of aging, Supporting a stockpile of aging, highly optimized nuclear weaponshighly optimized nuclear weapons
Impacting Science and TechnologyImpacting Science and Technology
Parallel Nuclear Weapons Explosion Simulation for the ASCI Primary Parallel Nuclear Weapons Explosion Simulation for the ASCI Primary Burn Milepost on ASCI White and Blue Pacific MachinesBurn Milepost on ASCI White and Blue Pacific Machines
Source: LLNLSource: LLNL
IBM High Performance Computing
HPC SystemsHPC Systems
Parallel smooth particle Parallel smooth particle hydrodynamics coupled with hydrodynamics coupled with flux-limited diffusion radiation flux-limited diffusion radiation transporttransport
Impacting Science and TechnologyImpacting Science and Technology
Supernova Explosions and CosmologySupernova Explosions and Cosmology
Source: NERSC and DoE Office of ScienceSource: NERSC and DoE Office of Science
Code incorporated the four forces of Code incorporated the four forces of physics. Completing a 1 million particle physics. Completing a 1 million particle simulation with 100,000 time steps took simulation with 100,000 time steps took IBM SP 3 months IBM SP 3 months
IBM High Performance Computing
What is a protein?What is a protein?Examples of Protein FunctionStructural: keratin (skin, hair, nail), collagen (tendon), fibrin (clot)Motive: actomyosin (muscle)Transport: Hemoglobin (blood)Signaling: Growth factors, insulin, hormones (blood)Regulation: Transcription factor (gene expression)Catalysis: Enzymes
Protein is a linear polymer. 30 to several hundred residues long.
There are 20 natural amino acids with different physicochemical properties, such as: shape, volume, flexibility, hydrophobic, hydrophilic, charge
N C C N C C N C C
H
H
R
O O O
H
H
H
H
R R
IBM High Performance Computing
Molecular dynamics simulationMolecular dynamics simulationProtein is folded by mimicking its atomic mechanics in the computerProtein drops into Free Energy Funnel
to a unique folded native stateGoal: compute folded structure of protein
Goal: study the folding process and understand its dynamics
IBM High Performance Computing
Description Count* Comment
Atoms ~32,000 300 amino acid protein + waterForce evaluations / time step 10
9 Pairwise atom - atom interactions
FLOPs / force evaluation 150 Typical molecular dynamics
FLOPs / time step 1.5 x 1011
Each time step ~10-15
s 1 - 5 femto second
Total simulation time 10-3
s Protein folds in ~1 milli second
Total time steps 2 x 1011
FLOPs / simulation 3 x 1022
Total FLOP/s to fold a protein
Execution time 3 x 107
s 1 year
Required FLOPS ~1 x 1015
1 Petaflop/s
Ab initio protein folding calculation requirementsAb initio protein folding calculation requirementsWhy does it take 1 Petaflop/s?
Estimate is conservatively based on quadratic algorithm.Better algorithms will reduce (somewhat) running time, but usual surprises will increase it!And good science will require multiple simulations
IBM High Performance Computing
IBM's Business ModelIBM's Business ModelIncreased Application Capability, Manageable Costs Increased Application Capability, Manageable Costs
Maintain industry leadership in Maintain industry leadership in systems designs through a continued systems designs through a continued partnership with the scientific partnership with the scientific communitycommunity
Leverage technology improvements Leverage technology improvements that drive system performance that drive system performance
IBM High Performance Computing
L1 CacheL1 Cache
Processor Execution Units: Processor Execution Units: MIPS and FLOPS MIPS and FLOPS
L2 CacheL2 Cache
Memory & I/OMemory & I/OBridgeBridge
Main Main MemoryMemory StorageStorage
InterconnectInterconnect
Algorithms and Application Software
Balanced Systems DesignBalanced Systems Design
Latency &Latency &BandwidthBandwidth
BandwidthBandwidth
Latency &Latency &BandwidthBandwidth
MemoryMemoryWallWall
System Design ChallengesSystem Design ChallengesCost Effective, Uniprocessor Building Blocks Exploit ConcurrencyCost Effective, Uniprocessor Building Blocks Exploit Concurrency
IBM High Performance Computing
Sustained Performance Is More Than Just Hardware
Packaging
Processor, Cache, Memory, and
I/O
MultiprocessorEnablement
Semiconductors
Operating System
Middleware and Utilites
Applications
Development Tools and Environment
Algorithms
Compiler
Debugger
Optimizers
GUIMessage Passing Libraries
VisualizersMath
Libraries
Application ExecutionPerformance
Application DevelopmentPerformance
System Design IssuesSystem Design Issues
IBM High Performance Computing
Power density growth is imposing constraints on server capability
Power Efficient CMOS processors can achieve high performance with significantly lower power dissipation
0 50 100 150 200Power Dissipation (Watts)
0
200
400
600
Web
Pag
es s
erve
d (M
B/s
ec)
Perf
orm
ance
Power Efficient CMOS µprocessor
High PerfCMOS µprocessor
1950 1960 1970 1980 1990 2000 20101950 1960 1970 1980 1990 2000 2010
1414
77
00
BipolarBipolar
CMOSCMOS
Mod
ule
Hea
t Flu
x (W
/cm
)M
odul
e H
eat F
lux
(W/c
m)
Power Efficient Computing Power Efficient Computing IT Electrical Power Needs Projected to Reach Excessive ProportionsIT Electrical Power Needs Projected to Reach Excessive Proportions
IBM High Performance Computing
IBM's HPC StrategyIBM's HPC StrategySolving Problems More Quickly At Lower CostSolving Problems More Quickly At Lower Cost
Aggressively evolve and improve Aggressively evolve and improve our POWER architecture based our POWER architecture based HPC product lineHPC product line
Develop additional advanced Develop additional advanced systems based on loosely systems based on loosely coupled clusterscoupled clusters
Research and overcome obstacles Research and overcome obstacles to parallelism and other revolutionary to parallelism and other revolutionary approaches to supercomputing approaches to supercomputing
IBM High Performance Computing
ASCI Purple and BlueGene/LASCI Purple and BlueGene/L
Immediate-term Mid-term Long-term
IBM High Performance Computing
BlueGene/L BlueGene/L
Chip(2 processors)
Compute Card(2 chips, 2x1x1)
Node Board(32 chips, 4x4x2)
16 Compute Cards
System(64 cabinets, 64x32x32)
Cabinet(32 Node boards, 8x8x16)
2.8/5.6 GF/s4 MB
5.6/11.2 GF/s0.5 GB DDR
90/180 GF/s8 GB DDR
2.9/5.7 TF/s256 GB DDR
180/360 TF/s16 TB DDR
IBM High Performance Computing
EthernetIncorporated into every node ASICDisk I/OHost control, booting and diagnostics
3 Dimensional TorusVirtual cut-through hardware routing to maximize efficiency1.4 Gb/s on all 12 node links (total of 2.1 GB/s per node)Communication backbone67 TB/s total torus interconnect bandwidth1.4/2.8 TB/s bisectional bandwidth
Global Tree
One-to-all or all-all broadcast functionalityArithmetic operations implemented in tree2.8 GB/s of bandwidth from any node to all other nodes Latency of tree less than 12usec~90TB/s total binary tree bandwidth (64k machine)
The Principal NetworksThe Principal Networks
65,536 nodes interconnected with three integrated networks
IBM High Performance Computing
Physical DesignPhysical Design
IBM High Performance Computing
7/29/2003 IBM Confidential Information 5
Compute Card
9 x 256Mb DRAM; 16B interface
Heatsinks designed for 15W (measuring ~13W @1.6V)
54 mm (2.125”)
206 mm (8.125”) wide, 14 layers
Metral 4000 connector
BlueGene/L - two node compute cardBlueGene/L - two node compute card
IBM High Performance Computing
Node CardNode Card
7/23/2003 IBM Confidential Information 7
32- way (4x4x2) node card
DC-DC converters
Gb Ethernet connectors through tailstock
Latching and retention
Midplane torus, tree, barrier, clock, Ethernet service port connects
16 compute cards
2 IO cards
Ethernet-JTAG FPGA
IBM High Performance Computing
BlueGene/L - system viewBlueGene/L - system view
IBM High Performance Computing
Blue Matter - Blue Matter - a Molecular Dynamics Codea Molecular Dynamics Code
Separate MD program into three subpackages Separate MD program into three subpackages (offload function to host where possible):(offload function to host where possible):
MD core engine (massively parallel, minimal in size)MD core engine (massively parallel, minimal in size)Setup programs to setup force field assignments, etc.Setup programs to setup force field assignments, etc.Analysis Tools to analyze MD trajectories, etc.Analysis Tools to analyze MD trajectories, etc.
Multiple Force Field SupportMultiple Force Field SupportCHARMM force field (done)CHARMM force field (done)OPLS-AA force field (done)OPLS-AA force field (done)AMBER force field (done)AMBER force field (done)Polarizable Force Field (desired)Polarizable Force Field (desired)
Potential Parallelization StrategiesPotential Parallelization StrategiesInteraction-basedInteraction-basedVolume-basedVolume-basedAtom-basedAtom-based
IBM High Performance Computing
Simulation CapacitySimulation Capacity
1000 10000 100000System Size (atoms)
1.00E+6
1.00E+7
1.00E+8
1.00E+9
1.00E+10
1.00E+11
1.00E+12
1.00E+13
time
step
s/m
onth
1 rack Power3 ('01)512 node BG/L partition (2H03)
40*512 node BG/L partition (4Q04)1,000,000 GFLOP/second (2H06)
IBM High Performance Computing
Science on Blue GeneScience on Blue GeneAlzheimer Research: Help Drug Discovery
Alzheimer's Disease has been associated with the accumulation of amyloid plaque in the brain
Beta-secretase is a prime therapeutic target for Alzheimer’s drug discovery efforts
No experimental data exists for the details of the relationship between the protein and membrane
IBM High Performance Computing
HPC Applications and Algorithms HPC Applications and Algorithms
Source: Rick Stevens, ANL
BasicAlgorithms &
NumericalMethods
PipelineFlows
Biosphere/Geosphere
Neural Networks
Condensed MatterElectronic Structure
CloudPhysics
-ChemicalReactors
CVD
PetroleumReservoirs
MolecularModeling
BiomolecularDynamics / Protein Folding
RationalDrug DesignNanotechnology
FractureMechanicsChemicalDynamics Atomic
ScatteringsElectronicStructure
Flows in Porous Media
FluidDynamics
Reaction-DiffusionMultiphaseFlow
Weather and ClimateStructural Mechanics
Seismic ProcessingAerodynamics
Geophysical Fluids
QuantumChemistry
ActinideChemistry
CosmologyAstrophysics
VLSIDesign
ManufacturingSystems
MilitaryLogistics
NeutronTransport
NuclearStructure
QuantumChromo -Dynamics Virtual
Reality
VirtualPrototypes
ComputationalSteering
Scientific Visualization
MultimediaCollaborationTools
CAD
GenomeProcessing
Databases
Large-scaleData Mining
IntelligentAgents
IntelligentSearch
Cryptography
Number Theory
EcosystemsEconomicsModels
Astrophysics
SignalProcessing
Data Assimilation
Diffraction & InversionProblems
MRI Imaging
DistributionNetworks
Electrical Grids
Phylogenetic TreesCrystallographyTomographicReconstruction
ChemicalReactors
PlasmaProcessing
Radiation
MultibodyDynamics
Air TrafficControl
PopulationGenetics
TransportationSystems
Economics
ComputerVision
AutomatedDeduction
ComputerAlgebra
OrbitalMechanics
Electromagnetics
Magnet DesignSource: Rick Stevens, Argonne National Lab and The University of Chicago
SymbolicProcessing
Pattern Matching
RasterGraphics
MonteCarlo
DiscreteEvents
N-Body
FourierMethods
GraphTheoretic
Transport
Partial Diff. EQs.
Ordinary Diff. EQs.
Fields
Cellular-scale ParallelizabilityGoodBetterBest
IBM High Performance Computing
Power Systems Architectural EnhancementsPower Systems Architectural Enhancements
Mid-term Long-termImmediate-term
IBM High Performance Computing
Achieve and Sustain Multiple Design PointsAchieve and Sustain Multiple Design Points
Continued evolutionary Continued evolutionary technological improvements technological improvements for current HPC systemsfor current HPC systems
Package level integration Package level integration technologies provide technologies provide differentiationdifferentiation
Silicon semiconductor Silicon semiconductor technology and performance technology and performance advancements continueadvancements continue
Open standard softwareOpen standard software
Satisfy the Spectrum of Customer Performance and Price NeedsSatisfy the Spectrum of Customer Performance and Price Needs
LinuxLinuxMPIMPI
OGSAOGSA
IBM High Performance Computing
POWER6 Server RoadmapPOWER6 Server Roadmap
2001Power4
Chip Multi Processing- Distributed Switch- Shared L2
Dynamic LPARs (16)
180 nm
1+ GHzCore
1+ GHzCore
Distributed Switch
Shared L2
2002-3Power4+
1.7+ GHz Core
1.7+ GHz Core
130 nm
Shared L2
Distributed Switch
2004Power5
2005Power5+
Simultaneous multi-threadingSub-processor partitioningEnhanced scalability, parallelismHigh throughput performanceEnhanced memory subsystem
90 nm
Shared L2
>> GHz Core
>> GHz Core
Distributed Switch
130 nm
> GHzCore
> GHz Core
Distributed Switch
Shared L2
2006Power6
65 nm
L2 caches
Ultra high frequency cores
AdvancedSystem Features
Total VirtualizationMainframe RASLarger SMPsBlade Optimized4X Perf of POWER5Reduced Size/Power
Larger L2Increased BandwidthsMore LPARs (32)
2001Power4
Chip Multi Processing- Distributed Switch- Shared L2
Dynamic LPARs (16)
180 nm
1+ GHzCore
1+ GHzCore
Distributed Switch
Shared L2
Chip Multi Processing- Distributed Switch- Shared L2
Dynamic LPARs (16)
180 nm
1+ GHzCore
1+ GHzCore
Distributed Switch
Shared L2
180 nm
1+ GHzCore
1+ GHzCore
Distributed Switch
Shared L2
1+ GHzCore
1+ GHzCore
Distributed Switch
Shared L2
2002-3Power4+
2002-3Power4+
1.7+ GHz Core
1.7+ GHz Core
130 nm
1.7+ GHz Core
1.7+ GHz Core
130 nm
1.7+ GHz Core
1.7+ GHz Core
1.7+ GHz Core
1.7+ GHz Core
1.7+ GHz Core
1.7+ GHz Core
130 nm
Shared L2
Distributed Switch
2004Power5
2004Power5
2005Power5+
2005Power5+
Simultaneous multi-threadingSub-processor partitioningEnhanced scalability, parallelismHigh throughput performanceEnhanced memory subsystem
90 nm
Shared L2
>> GHz Core
>> GHz Core
Distributed Switch
130 nm
> GHzCore
> GHz Core
Distributed Switch
Shared L2
90 nm
Shared L2
>> GHz Core
>> GHz Core
Distributed Switch
90 nm
Shared L2
>> GHz Core
>> GHz Core
Distributed Switch
Shared L2Shared L2
>> GHz Core
>> GHz Core
>> GHz Core
>> GHz Core
Distributed Switch
130 nm
> GHzCore
> GHz Core
Distributed Switch
Shared L2
130 nm
> GHzCore
> GHz Core
Distributed Switch
Shared L2
> GHzCore
> GHz Core
> GHzCore
> GHzCore
> GHz Core
> GHz Core
Distributed Switch
Shared L2Shared L2
2006Power6
2006Power6
65 nm
L2 caches
Ultra high frequency cores
AdvancedSystem Features
65 nm
L2 caches
Ultra high frequency cores
AdvancedSystem Features
Total VirtualizationMainframe RASLarger SMPsBlade Optimized4X Perf of POWER5Reduced Size/Power
Larger L2Increased BandwidthsMore LPARs (32)
IBM High Performance Computing
PERCS ProjectPERCS Project
Immediate-term Long-termMid-term
CFDSelf-Adapting
CAE
ChemistryElectronic Structures
MaterialsScience
Bioinformatics
Climate and Weather
Nuclear Energy
IBM High Performance Computing
PERCSPERCSA consortium of IBM, LANL, and 13 universitiesA consortium of IBM, LANL, and 13 universities
Machine
OS&
M-ware
App App
OS&
M-ware
OS&
M-ware
App
Adapt application to systemAdapt system to application
Machine
OS&
M-ware
App App
OS&
M-ware
OS&
M-ware
App
Adapt application to systemAdapt system to application
IBM High Performance Computing
Low F04 circuits
SiGe
Modular packaging
Power management
K42 operating system
Dynamic & cont. optimization
Self-healing, self-management
Fail-in place strategy
Polymorphic processors
Intelligent memory controllers
Total virtualization
Power-aware HW-SW codesign
Intelligent storage
Morphogenic SW development
New Prog. Lang.: StreamIt, UPC
Atomic sections
User-transparent reliability
MindFrames prog. environment
Sys
tem
sof
twar
eSy stem
a rchitec tureBasic technology
Appl
icat
ions
& d
evel
opm
ent
Low F04 circuits
SiGe
Modular packaging
Power management
K42 operating system
Dynamic & cont. optimization
Self-healing, self-management
Fail-in place strategy
Polymorphic processors
Intelligent memory controllers
Total virtualization
Power-aware HW-SW codesign
Intelligent storage
Morphogenic SW development
New Prog. Lang.: StreamIt, UPC
Atomic sections
User-transparent reliability
MindFrames prog. environment
Sys
tem
sof
twar
eSy stem
a rchitec tureBasic technology
Appl
icat
ions
& d
evel
opm
ent
PERCS - key technologiesPERCS - key technologies
IBM High Performance Computing
Application Driven DesignApplication Driven Design
DARPA PERCS Project:DARPA PERCS Project:Explore innovative Explore innovative adaptive system adaptive system architectures for high architectures for high efficiency, scalability, efficiency, scalability, software tools and software tools and physical constraintsphysical constraints
Close Collaboration and Partnerships with the National Labs, Close Collaboration and Partnerships with the National Labs, Universities and Government AgenciesUniversities and Government Agencies
Explore innovative extensions Explore innovative extensions to IBM's Power architecture toto IBM's Power architecture tooptimize system designs for optimize system designs for the broadest possible range of the broadest possible range of application computational application computational requirements in conjunction requirements in conjunction with LLNLwith LLNL
Blue Gene:Blue Gene:Advance the state-of-the-artAdvance the state-of-the-artfor parallelism in computer for parallelism in computer design and software. design and software. Deliver a Limited Production Deliver a Limited Production System in conjunction with System in conjunction with LLNL, ANL, and several LLNL, ANL, and several universitiesuniversities
IBM High Performance Computing
1995 2000 2005 2010 20151
10
100
1000
10000
100000
Source: ASCI Roadmap www.llnl.gov/asci, IBMBrain ops/sec: Kurzweil 1999, The Age of Spiritual MachinesMoravec 1998, www.transhumanist.com/volume1/moravec.htm
Supercomputing RoadmapSupercomputing Roadmap
IBM Deep Blue®*
TeraFlops IBM BlueGene/P
US Dept. Of Energy ASCI
IBM BlueGene/L®*
Chip(2 processors)
Compute Card(2 chips, 2x1x1)
Node Board(32 chips, 4x4x2)
16 Compute Cards
System(64 cabinets, 64x32x32)
Cabinet(32 Node boards, 8x8x16)
2.8/5.6 GF/s4 MB
5.6/11.2 GF/s0.5 GB DDR
90/180 GF/s8 GB DDR
2.9/5.7 TF/s256 GB DDR
180/360 TF/s16 TB DDR
William R. PulleyblankDirector, Exploratory Server Systems and DCITJ Watson Research Lab
Application Driven SupercomputingApplication Driven SupercomputingAn IBM PerspectiveAn IBM Perspective