Keeping Programming in MIND for the ... - Monash University

Keeping Programming in MIND for the MTV Petaflops Generation

Thomas Sterling

California Institute of Technologyand

NASA Jet Propulsion LaboratoryJune 24, 2002

ICS’02 Panel on Programming Models for Future HPC Systems:

June 24, 2002 Thomas Sterling - Caltech & JPL 2

Future HPC Systems?

Commodity ClustersBeowulf, NOW, Chiba City, HP SC

MPPT3E, ASCI White

PVPSX-6, SV-2, Earth Simulator

Hybrid TechnologyHTMT


Petaflops in 2010 to 2012

Flops performance gain 25X at system levelClock rate > 10 GHz (chip wide)The rest is ILP and SOCPetaflops in < 10K chips

Memory capacity 64XPetabyte in 128K chipsfor ~$4M7%/yr speedup in access timeWill take ~30 times longer to read contents of memory chipWill take > 100X longer in clock cycles (not including latency)

I/O interface will grow slowly toward 1 Tbps (maybe optics?)1 Petaflops System in 2010 cost between $50M & $200MPower is unclear

Between 3 Mwatts and 25 Mwatts


Problems with Conventional Approach

Processor boundALU and IP constrained – ALU utilization wrong metricLatency intolerantParallelism exposure inadequate – e.g. irregular dataOverhead too large – performed in software

Memory boundMemory bandwidth strangledMemory access latency isolation – 100’s, 1000’s cycles100X time to touch all of memoryMemory hierarchy dies with low temporal localityTLB coherency and scalability

Too difficult to program – resource management by handToo big, too hot, too $$Vulnerable to reliability failures


Multi Threaded Vector with MINDLocality Driven Architecture

Distributed shared memoryNot cache coherentSupport for atomic distributed object operations

High efficiency through dynamic management of latency and overheadWill make programming easier by simplifying burden of resource control

Two execution regimes:High temporal localityLow or no data reuse

Very high-speed Multithreaded Vector compute processors (MTV)Multithreaded vector architectureCompile time managed caches with hardware assistParcel message driven computation and synchronization

Processor in Memory data processors (MIND)Expose very high memory bandwidth while reducing latency/powerPerform low/no reuse data ops directly in memorySupport system overhead functions for high efficiency

Percolation prestaging for initializing computations to hide latency and overhead as well as providing dynamic load balancing


Parcels

Next Generation HPC Architecturewith Dynamic Adaptive Resource Management

Low Latency Network

MTV0

High-speedBuffer

MTV1

High-speedBuffer

MTVN-1

High-speedBuffer

Global Shared High-speed Buffer

High Bandwidth Optical Network

MIND0

3/2 RAM

MIND1

3/2 RAM

MINDP-1

3/2 RAM

Data Intensive User Function Instantiation

Percolation(data/task prestaging)

Dynamic Scheduling,

Load Balancing


Current PIM Projects

IBM Blue GenePflops computer for protein folding

Stanford StreamingPower efficient

UC Berkeley IRAMAttached to conventional servers for multi-media

USC ISI DIVAIrregular data structure manipulation

U of Notre Dame PIM-liteMultithreaded

Caltech MINDVirtual everything for scalable fault tolerant general purpose


Why is PIM Inevitable?

Separation between memory and logic artificialvon Neumann bottleneckImposed by technology limitationsNot a desirable property of computer architecture

Technology now brings down barrierWe didn’t do it because we couldn’t do itWe can do it so we will do it

What to do with a billion transistorsComplexity can not be extended indefinitelySynthesis of simple elements through replicationMeans to fault tolerance, lower power

Normalize memory touch time through scaled bandwidth with capacityWithout it, takes ever longer to look at each memory block

Will be mass market commodity commercial marketDrivers outside of HPC thrustCousin to embedded computing


Limitations of Current PIM Architectures

No global address spaceNo virtual to physical address translation

DIVA recognizes pointers for irregular data handlingDo not exploit full potential memory bandwidth

Most use full row bufferBlue Gene/Cyclops has 32 nodes

No memory to memory process invocationPIM-lite & DIVA use parcels for method driven computation

No low overhead context switchingBG/C and PIM-lite have some support for multithreading


Attributes of MIND PIM Architecture

Memory, Intelligence, and Network DeviceParcel active message driven computing

Decoupled split-transaction executionSystem wide latency hidingMove work to data instead of data to work

Multithreaded controlUnified dynamic mechanism for resource managementLatency hidingReal time response

Virtual to physical address translation in memoryGlobal distributed shared memory thru distributed directory tableDynamic page migrationWide registers serve as context sensitive TLB

Graceful degradation for Fault toleranceGraceful degradation through highly replicated fine-grain elements. Hardware architecture includes fault detection mechanisms.Software tags for bit-checking at hardware speeds; includes constant memory scrubbing.Virtual data and tasks permits rapid reconfiguration without software regeneration.

MemoryStack

Sense Amps

Node Logic

Sense Amps

Memory Stack

Sense Amps

Sense Amps

Dec

ode

Memory Stack

Sense Amps

Sense Amps

Memory Stack

Sense Amps

Sense Amps


A Software Architecture

Users

Hardware Architecture

Global Runtime System

Local Runtime System

Conventional Programming

Models

Operating System

Advanced Programming Models

Mass Storage

I/O

compilers

Conventional microprocessor


MIND Programming Opportunities

Provides lightweight efficient threadsReduces overhead burden Avoids data access latencyRapid streaming for associative searchesExposes massive parallelismEnables efficient and concurrent gather-scatterEnables efficient and concurrent array transposePermits fine grain manipulation of sparse and irregular data structures


MTV/MIND Programming Challenges

Decoupled activitiesCompiler/programmer must delineate between two class of work

HeterogeneousTwo or more separate instruction setsMore than one uniform cost function

Coherence among cache and memory copiesSeparate ALU, IP, and Memory

Not conventional node, semi-independent and self containedPart of activity state, shared with other nodes

Portability of legacy codesE.g. Fortran, C, MPI


Advanced Programming Models for MTV/MIND

Never Predict a new languageThere will be a new language

Called BOCCE from C3PO New Republic Software Inc.; $49.95, delivery 2007

Expose diversity of parallelismReflect a semantics of affinity

Support active decision space for allocation and scheduling

Provide easy code structuring for partial rollback & restartsMicro-checkpointing for fault tolerance

Integrate performance information as part of regular outputPossibly with intelligent closed-loop

Exploit dynamic adaptive resource management mechanismsMultithreading, percolation, parcels


Keeping MPI in MIND

MPI as programming model for MTV/MINDMigration of legacy codes: MPI with C or FortranTransition methodology for leveraging learning-curve

Reduce barriers to getting new applications from old programmersAn excellent programming model for certain classes of parallel algorithms

Use as intra-MPI process accelerator or as MPI process hostUnlike most accelerators, data already in MIND PIM chips

Incremental performance enhancementsRuns normally on first startupTarget data intensive performance bottlenecksBuild on libraries with conventional interfaces and new wrappers

Spanning multiple MIND nodes with MPI processParcels perform MPI message requirements and also migrate workCollective operations

Very efficient MPI implementations with MIND/PIMThreaded MPI processes

Exploit MPI process internal parallelism


Percolation of Active Tasks from MIND to MTV

LatencyPrestage all relevant data near execution engineMultiple stage latency management methodologyAugmented multithreaded resource scheduling

HeterogeneityCompute intensive work in high speed processorsLow data-reuse work in memory processorsHierarchy of task contexts

OverheadPerform all data manipulation and scheduling in separate logicCoarse-grain contexts coordinate in PIM memoryReady contexts migrate to SRAM under PIM control releasing threads for schedulingThreads pushed into SRAM/CRAM frame buffersTask stack loaded in register banks on space available basisWrite back to main memory

Contexts Stored in MIND DRAM

Tasks Stored in MIND SRAM

Stack Stored in MTV buffers


Panel Questions and Issues

Which are the key requirements of HPC applications?Support for substantial and diverse abstract parallelismLocality management through a nomenclature of affinityGradual performance sensitivity to adjustments in application code and system resource parameters and availabilityConvenient restart points for transparent resumption in the event of a fault

What is wrong with the SOA in programming HPC systems?Nothing, if you are a sadomasochist There is a close and uncompromising relationship between the programmer and the physical properties of the HPC machine structure.Limited in the classes and forms of parallelism that can be exposedPrecludes exploitation of runtime information

Which major directions will HPC architecture development take? Will they make programming simpler of even more difficult than today?

Plan A: The lowest common denominator – clusters of commodity SMPs (harder)Plan B: The right thing – MTV with MIND altering products (simpler)


Panel Questions and Issues(2)

How will High-level languages, compilers, runtime systems and software tools evolve during this decade? Which new features are required to support fault tolerance?

Impossible to predictEmergence of runtime agents enabled by cheap pervasive logicCompiler to runtime relationship through intermediate APIActive decision choice space Introspection or monitor threads for resource measuring and fault detection

How can legacy codes be efficiently ported to new architectures?At the risk of being Clintonesque, how do you define “efficiency”

Definition A: as good as it would run on conventional system of comparable capacityDefinition B: as good as it should run with superior properties of new systemYes to A, No to B

Incremental rewrites required for adaption of applications to new machinesO/S servicesCommon shared librariesUser critical sections

Documents

Keeping Programming in MIND for the ... - Monash University