Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Keeping Programming in MIND for the MTV Petaflops Generation
Thomas Sterling
California Institute of Technologyand
NASA Jet Propulsion LaboratoryJune 24, 2002
ICS’02 Panel on Programming Models for Future HPC Systems:
June 24, 2002 Thomas Sterling - Caltech & JPL 2
Future HPC Systems?
Commodity ClustersBeowulf, NOW, Chiba City, HP SC
MPPT3E, ASCI White
PVPSX-6, SV-2, Earth Simulator
Hybrid TechnologyHTMT
June 24, 2002 Thomas Sterling - Caltech & JPL 3
Petaflops in 2010 to 2012
Flops performance gain 25X at system levelClock rate > 10 GHz (chip wide)The rest is ILP and SOCPetaflops in < 10K chips
Memory capacity 64XPetabyte in 128K chipsfor ~$4M7%/yr speedup in access timeWill take ~30 times longer to read contents of memory chipWill take > 100X longer in clock cycles (not including latency)
I/O interface will grow slowly toward 1 Tbps (maybe optics?)1 Petaflops System in 2010 cost between $50M & $200MPower is unclear
Between 3 Mwatts and 25 Mwatts
June 24, 2002 Thomas Sterling - Caltech & JPL 4
Problems with Conventional Approach
Processor boundALU and IP constrained – ALU utilization wrong metricLatency intolerantParallelism exposure inadequate – e.g. irregular dataOverhead too large – performed in software
Memory boundMemory bandwidth strangledMemory access latency isolation – 100’s, 1000’s cycles100X time to touch all of memoryMemory hierarchy dies with low temporal localityTLB coherency and scalability
Too difficult to program – resource management by handToo big, too hot, too $$Vulnerable to reliability failures
June 24, 2002 Thomas Sterling - Caltech & JPL 5
Multi Threaded Vector with MINDLocality Driven Architecture
Distributed shared memoryNot cache coherentSupport for atomic distributed object operations
High efficiency through dynamic management of latency and overheadWill make programming easier by simplifying burden of resource control
Two execution regimes:High temporal localityLow or no data reuse
Very high-speed Multithreaded Vector compute processors (MTV)Multithreaded vector architectureCompile time managed caches with hardware assistParcel message driven computation and synchronization
Processor in Memory data processors (MIND)Expose very high memory bandwidth while reducing latency/powerPerform low/no reuse data ops directly in memorySupport system overhead functions for high efficiency
Percolation prestaging for initializing computations to hide latency and overhead as well as providing dynamic load balancing
June 24, 2002 Thomas Sterling - Caltech & JPL 6
Parcels
Next Generation HPC Architecturewith Dynamic Adaptive Resource Management
Low Latency Network
MTV0
High-speedBuffer
MTV1
High-speedBuffer
MTVN-1
High-speedBuffer
Global Shared High-speed Buffer
High Bandwidth Optical Network
MIND0
3/2 RAM
MIND1
3/2 RAM
MINDP-1
3/2 RAM
Data Intensive User Function Instantiation
Percolation(data/task prestaging)
Dynamic Scheduling,
Load Balancing
June 24, 2002 Thomas Sterling - Caltech & JPL 7
Current PIM Projects
IBM Blue GenePflops computer for protein folding
Stanford StreamingPower efficient
UC Berkeley IRAMAttached to conventional servers for multi-media
USC ISI DIVAIrregular data structure manipulation
U of Notre Dame PIM-liteMultithreaded
Caltech MINDVirtual everything for scalable fault tolerant general purpose
June 24, 2002 Thomas Sterling - Caltech & JPL 8
Why is PIM Inevitable?
Separation between memory and logic artificialvon Neumann bottleneckImposed by technology limitationsNot a desirable property of computer architecture
Technology now brings down barrierWe didn’t do it because we couldn’t do itWe can do it so we will do it
What to do with a billion transistorsComplexity can not be extended indefinitelySynthesis of simple elements through replicationMeans to fault tolerance, lower power
Normalize memory touch time through scaled bandwidth with capacityWithout it, takes ever longer to look at each memory block
Will be mass market commodity commercial marketDrivers outside of HPC thrustCousin to embedded computing
June 24, 2002 Thomas Sterling - Caltech & JPL 9
Limitations of Current PIM Architectures
No global address spaceNo virtual to physical address translation
DIVA recognizes pointers for irregular data handlingDo not exploit full potential memory bandwidth
Most use full row bufferBlue Gene/Cyclops has 32 nodes
No memory to memory process invocationPIM-lite & DIVA use parcels for method driven computation
No low overhead context switchingBG/C and PIM-lite have some support for multithreading
June 24, 2002 Thomas Sterling - Caltech & JPL 10
Attributes of MIND PIM Architecture
Memory, Intelligence, and Network DeviceParcel active message driven computing
Decoupled split-transaction executionSystem wide latency hidingMove work to data instead of data to work
Multithreaded controlUnified dynamic mechanism for resource managementLatency hidingReal time response
Virtual to physical address translation in memoryGlobal distributed shared memory thru distributed directory tableDynamic page migrationWide registers serve as context sensitive TLB
Graceful degradation for Fault toleranceGraceful degradation through highly replicated fine-grain elements. Hardware architecture includes fault detection mechanisms.Software tags for bit-checking at hardware speeds; includes constant memory scrubbing.Virtual data and tasks permits rapid reconfiguration without software regeneration.
MemoryStack
Sense Amps
Node Logic
Sense Amps
Memory Stack
Sense Amps
Sense Amps
Dec
ode
Memory Stack
Sense Amps
Sense Amps
Memory Stack
Sense Amps
Sense Amps
June 24, 2002 Thomas Sterling - Caltech & JPL 11
A Software Architecture
Users
Hardware Architecture
Global Runtime System
Local Runtime System
Conventional Programming
Models
Operating System
Advanced Programming Models
Mass Storage
I/O
compilers
Conventional microprocessor
June 24, 2002 Thomas Sterling - Caltech & JPL 12
MIND Programming Opportunities
Provides lightweight efficient threadsReduces overhead burden Avoids data access latencyRapid streaming for associative searchesExposes massive parallelismEnables efficient and concurrent gather-scatterEnables efficient and concurrent array transposePermits fine grain manipulation of sparse and irregular data structures
June 24, 2002 Thomas Sterling - Caltech & JPL 13
MTV/MIND Programming Challenges
Decoupled activitiesCompiler/programmer must delineate between two class of work
HeterogeneousTwo or more separate instruction setsMore than one uniform cost function
Coherence among cache and memory copiesSeparate ALU, IP, and Memory
Not conventional node, semi-independent and self containedPart of activity state, shared with other nodes
Portability of legacy codesE.g. Fortran, C, MPI
June 24, 2002 Thomas Sterling - Caltech & JPL 14
Advanced Programming Models for MTV/MIND
Never Predict a new languageThere will be a new language
Called BOCCE from C3PO New Republic Software Inc.; $49.95, delivery 2007
Expose diversity of parallelismReflect a semantics of affinity
Support active decision space for allocation and scheduling
Provide easy code structuring for partial rollback & restartsMicro-checkpointing for fault tolerance
Integrate performance information as part of regular outputPossibly with intelligent closed-loop
Exploit dynamic adaptive resource management mechanismsMultithreading, percolation, parcels
June 24, 2002 Thomas Sterling - Caltech & JPL 15
Keeping MPI in MIND
MPI as programming model for MTV/MINDMigration of legacy codes: MPI with C or FortranTransition methodology for leveraging learning-curve
Reduce barriers to getting new applications from old programmersAn excellent programming model for certain classes of parallel algorithms
Use as intra-MPI process accelerator or as MPI process hostUnlike most accelerators, data already in MIND PIM chips
Incremental performance enhancementsRuns normally on first startupTarget data intensive performance bottlenecksBuild on libraries with conventional interfaces and new wrappers
Spanning multiple MIND nodes with MPI processParcels perform MPI message requirements and also migrate workCollective operations
Very efficient MPI implementations with MIND/PIMThreaded MPI processes
Exploit MPI process internal parallelism
June 24, 2002 Thomas Sterling - Caltech & JPL 16
Percolation of Active Tasks from MIND to MTV
LatencyPrestage all relevant data near execution engineMultiple stage latency management methodologyAugmented multithreaded resource scheduling
HeterogeneityCompute intensive work in high speed processorsLow data-reuse work in memory processorsHierarchy of task contexts
OverheadPerform all data manipulation and scheduling in separate logicCoarse-grain contexts coordinate in PIM memoryReady contexts migrate to SRAM under PIM control releasing threads for schedulingThreads pushed into SRAM/CRAM frame buffersTask stack loaded in register banks on space available basisWrite back to main memory
Contexts Stored in MIND DRAM
Tasks Stored in MIND SRAM
Stack Stored in MTV buffers
June 24, 2002 Thomas Sterling - Caltech & JPL 17
Panel Questions and Issues
Which are the key requirements of HPC applications?Support for substantial and diverse abstract parallelismLocality management through a nomenclature of affinityGradual performance sensitivity to adjustments in application code and system resource parameters and availabilityConvenient restart points for transparent resumption in the event of a fault
What is wrong with the SOA in programming HPC systems?Nothing, if you are a sadomasochist There is a close and uncompromising relationship between the programmer and the physical properties of the HPC machine structure.Limited in the classes and forms of parallelism that can be exposedPrecludes exploitation of runtime information
Which major directions will HPC architecture development take? Will they make programming simpler of even more difficult than today?
Plan A: The lowest common denominator – clusters of commodity SMPs (harder)Plan B: The right thing – MTV with MIND altering products (simpler)
June 24, 2002 Thomas Sterling - Caltech & JPL 18
Panel Questions and Issues(2)
How will High-level languages, compilers, runtime systems and software tools evolve during this decade? Which new features are required to support fault tolerance?
Impossible to predictEmergence of runtime agents enabled by cheap pervasive logicCompiler to runtime relationship through intermediate APIActive decision choice space Introspection or monitor threads for resource measuring and fault detection
How can legacy codes be efficiently ported to new architectures?At the risk of being Clintonesque, how do you define “efficiency”
Definition A: as good as it would run on conventional system of comparable capacityDefinition B: as good as it should run with superior properties of new systemYes to A, No to B
Incremental rewrites required for adaption of applications to new machinesO/S servicesCommon shared librariesUser critical sections