Prof. Srinidhi Varadarajan Director Center for High-End Computing Systems

Prof. Srinidhi VaradarajanDirector

Center for High-End Computing Systems

We need a paradigm shift to make supercomputers more usable for mainstream computational scientists. A similar shift occurred in computing in the 1970s when

the advent of inexpensive minicomputers into academia spurred a large body of computing research.

Results from this research went back to industry creating a growth cycle that lead computing being a commodity.

This requires a comprehensive “rethink” of programming languages, runtime systems, operating systems, scheduling, reliability and operations and management Moving to petascale and exascale class systems

significantly complicates this challenge.

Need a computing environment that can efficiently and usably span the scales from department sized systems to national resources.

The majority of our supercomputers today are distributed memory systems that use the message passing model of parallel computation.

The shared/distributed memory view is a dichotomy imposed by hardware constraints.

Modern high performance interconnects such as Infiniband are memory based systems. Provides the hardware basis to envision DSM

systems that deepen the memory hierarchy. Most common operations are accelerated through

hardware offload.

Common question. My application runs on my desktop, but it takes too long.

Can I just run it on the supercomputer and make it run faster?

Short answer: no. Longer answer: almost certainly not.

As core frequencies have flattened, multi-core and many core architectures are here to stay. This is increasing the prevalence of threaded codes.

Can we take standard threaded codes and run them on a cluster supercomputer without any modifications or recompilation?

The goal of our work is to enable Pthread based threaded codes to transparently run cluster supercomputers.

The DSM system acts as the runtime, and provides a globally consistent memory abstraction.

New consistency algorithm with release consistency semantics guarantees correct operation for valid threaded codes. No, it won’t fix your bugs, but it may make deadlock

and livelock detection easier, possibly even automatic.

Separation of concerns The system is divided into a consistency layer and

a lower level communication layer. The communication layer uses a well-defined

architecture similar to MPI’s ADI to enable a wide variety of lower level interconnects.

System consists of either dedicated memory servers or nodes may share a portion of their memory into a global pool. Dedicated memory servers are essentially low end

servers that can host a lot of memory over a fast interconnect.

Memory striping algorithms employed to mitigate memory access hotspots.

The DSM architecture uses a global scheduler that treats cluster nodes as a set of processors.

Thread migration is simple and relatively inexpensive.

This enables load balancing through runtime migration.

Two issues: Compute Load Imbalance Data Affinity

Extending the threads model to support adaptivity.

Transactional Memory An artifact of our consistency protocol enables us

to provide transactional memory semantics fairly inexpensively.

This enables speculative and/or adaptive execution models, particularly in hard to parallelize sequential sections of code. Speculation enables us to explore multiple

execution paths, with with the DSM guaranteeing that there are no memory side effects. Invalid paths are simply pruned.

Adaptive execution enables optimistic and conservative algorithms to be started concurrently.

Current threaded and message passing models are inadequate for peta and exascale systems. Growth in heterogeneous multi-core systems

significantly complicates this problem.

Need more comprehensive runtime systems that can aid in load balancing, profile guided optimization and code adaptation. The move in the compilers community

towards greater emphasis on dynamic analysis is a step in this direction.

We are working on hybrid programming models that combine von Neumann program counter based elements embedded in dataflow constructs.

A new model must provide insights into problem decomposition as well as map existing decomposition methods.

Coordination models that can operate at peta and exascale.

Methods to evolve applications easily when requirements change.

Working with the compilers, programming languages, architectures, software engineering, applications and systems communities to realize this goal.

System G: 2600 core Intel x86 Xeon Penryn processors with Quad Data Rate Infiniband. 12,000 thermal sensors, 5000 power sensors

System X: 2200 processor PowerPC cluster with Infiniband interconnect.

Anantham: 400 processor Opteron Cluster with Myrinet interconnect

Several 8-32 processor research clusters. 12 processor SGI Altix shared memory system 8 processor AMD Opteron shared memory system. 16 core AMD Opteron shared memory system 16 node Playstation 3 cluster

Documents

Prof. Srinidhi Varadarajan Director Center for High-End Computing Systems