High performance horizons [high performance computing]

he field of high performance computing is undergoing a lot of changes. At the low end, supercomputers are becoming ever more affordable while, at the high end, it is becoming increasingly difficult to exploit the full

potential of the most advanced hardware, given the sheer scale of the software challenges involved. There are also signs that the boundary between the worlds of the supercomputer and the desktop ~ traditionally very distinct - is starting to b h x

The term ‘high performance computing’ (HPC) has entered common usage in recent years. Many use it to refer to that part of the IT world requiring high performance systems for compntationally intensive or data intensive applications in technical computing. This includes areas like bioinformatics, climate modelling and computational applications in engineering (but excludes things like commercial database applications).

At present, the term HPC covers a wide range of

platforms and computing environments, and the level of diversity is growing. The ubiquity of affordable and powerful off-the-shelf silicon, software and networking technology is changing the landscape very quickly

Supercomputers continue to scale, with performance advancing at a rate that exceeds the rate of chip-level improvements, as characterised by Moore’s law. These advances largely stem &om the dramatic growth in cluster technology which allows many processors to be closely coupled to solve large computational problems. Clusters have been around since the early 1990s but recent times have seen the arrival of affordable, off-the-shelf processor platforms and networking hardware to support the building of extremely high performance cluster platforms. Most of these use Linux as an OS and the availability of such open source software to support cluster desigus has been a big factor in the uptake of the technology According to Ed Turkel, a manager with HP’s High Performance ComputingDivision: “The net result of all this is that, at

A processor art l i i tect t i ie t l iat suppoits t l ie use of single iiistriictioris t l i a t cat1

ope ia te on m a n y data po in ts at once. Examples a r e i 'ectot and a i l a y processors The alteiiiatiire is k'IIRID, i i i i i l t ip l r i i isti i ictioii mul t ip le data, i v l i i c h is a label that applies to SMP or Di\,IP systems (s i lc i i as clusters)

Geiierall!: this triers to a collection of i i i d i idua l systems or processoi plalforms linked together by a iiehz*oik or interconnect to operate co-operatively on oiie or i i io ie computational problems, Eacli of t l ie processor platlorins or individual 'bodes' i i i a clustet is capable of operating alone, and it1 that sense clusters d i f f r i from certain other mult ip iocessor systems. Clusters a i r d istr ibuted m e m o l y processing (DhIP) systems, n iea i i ing that programs can't directly access t t ie men-ioty of remote sys tems iii t l i e i l u i t e r . but h a i ~ to tise some sort of message passing behveri i t l ie nodes.

Systems wi iere each processoi tias its oiw associated nieniory (se? 'Cluster: aboue). These Cali be SIMD or ~Vlli i lD designs and t l ie ptocessiii$ iiode c a r be i'ector or scalar.

Systems i h e r e i r idt ip le CPUs 01 processing i iodes all stiarr t t ie same addiess space. These c a n be SIMD or kjIli\'ID desipns a n d the processing inode can be \rectal 01 scala,

A n a ich i tec tu ie s imi lar t o a cluster bu t m a d e u p of latger Si\,lP riodes. suggesting that it can be used foi eitlier shared imenioiy ptocessitip (Sh'lP) or disti ibuted memory (LIMP) piocessirig

These are usually special purpose systems i w t h very large r iumbers of piocesiors, n i t h the piocessors tied togetlirr using a special piirposr piocessor~ t o ~ p i o c e s s o r interconnect At one t i m e very popiilar - iwt l i systems f r o m coinparlies like Thii iking Machines, hSlassPar, Ciay and others - imost of those platfoims died out. There lias bren reiieived ititerest 111 MPPs, \i'itii the release of systems froin Clay and IBi\'I, called Red Storni arid 61ue Ceiie/L, respectively, Hoivever, because of tlieir architectures, geiierally ciiaiacterised by many lo\\, power processors and linle meinon, per processoi, tliere are only a liandful of application classes that r i i i i ii~cll on MPPs.

taking account of factors such as the speed with which data can move between processor nodes, in addition to Linpack.

CLUSTER OR CUSTOM BUILD? The use of the cluster approach and off-the-shelf technology - to build platforms for parallel computation - is now in evidence with the vast majority of today’s Supercomputers, including eight of the top 10 in the TOP500 list. In September 2o(w, Virginia Tech turned on its SystemX machine, at the time offering a peak performance level of 17.7Tflops - in a system that cost only $52m to build, of the order of a flfth to a tenth of the traditional supercomputer price tag of the time. SystemX is currently ranked seventh on the TOP500 list.

However, theE are many applications that appear to call for supercomputers that are custom-built using proprietary technologies, platforms that are more in the mould of the traditional, or ‘heavy iron’ supercomputer. “Clusters are not a panacea,” said Dr Charles Holland of the US Department of Defence, the keynote speaker at November’s Supercomputing 2004 conference in Pittsburgh. “There exists a need, driven by government and business requirements, for supercomputing systems with high memory bandwidth, low latency, and improved I/O, for example, that far exceed those offered by existing COTS- based supercomputers.”

The current number one spot in the TOP500 list is occupied by BlueGene/L, a machine co-developed by IBM and the US Department of Energy, which shares many of the characteristics of both off-the-shelf cluster platforms and traditional supercomputers (see box ‘Looking after

number one’). In November, the BlueGendL system gained the Linpack performance crown with peak performance rated at 10.72Tflops -nearly double the incumbent, the Japanese Earth Simulator, which occupied the top spot for nearly three years with Linpack performance rated at 35.86Mops.

Although it is more difficult to characterise using a one- size-fits-all benchmark, supercomputers in the more traditional mould - custom-built and optimised for a particular application area ~ clearly offer a performance advantage in their area of deployment. Cluster platforms, on the other hand, tend to be more general purpose in na-. Broadly, the more traditional platforms come under the heading of ‘vector’ machines, because of their reliance on vector processing techniques. Vector processors are much more suited to certain types of mathematical calculations, heavy on matrices and linear algorithms, which crop up frequently in certain classes of scientific or technical calculations such as climate modelling. Many platforms, including the Earth Simulator, deploy a multitude of vector processors connected in parallel.

HIGH SPEED INTERCONNECTS The ability to support very high bandwidth, low latency communication between processing nodes is one of the factors that traditional supercomputer makers argue still distinguishes their platforms from the new breed of clusters. “All of these platforms have lots of processing power at the node level but the real distinction is between those with the bandwidtb to keep their processors busy and those that don’t,” said Steve Conway, a spokesman for Cray

Suoercomouters

The development of a high speed interconnect was one of the primary goals of the $9om 'Red Storm' project (a joint venture between Cray and Sandia National Laboratories). This machine uses more than 11,000 64bit AMD Opteron processors and allows an interconnect bandwidth of lMlterabytes/second.

With certain classes of application, the performance of a parallel computing platform depends critically on the ability of the system to effectively pass data between each of these nodes. A distinction tends to be made between applications with 'tightly-coupled' and 'loosely-coupled' processing wuirements. In the former case, the execution of algorithms requins data to be passed between pmcessor nodes at very high speed and with very low latency Loosely-coupled processing, on the other hand, is acceptable in applications where discrete computational tasks can be completed on single nodes with a minimal amount of inter-nodal communication - the results of the calculations made at each node can simply be aggregated at a later stage in the processing sequence of the whole system, to yield a result. The ray tracing algorithms used in 3D graphics are a good example of the latter - each ray of light can be calculated on a single node and the results combined to provide a complete image, with there being a minimal need for inter-nodal communication.

Applications that fit the loosely-coupled processing model, which can be very effectively decomposed into distinct sub-tasks, are also candidates for more distributed processing environments. For example, the SETI@home project ~ which crunches through radio telescope data, making calculations that an intended to help spot evidence

of extraterrerstrial life ~ utilises spare processor time available on ordinary PCs that are linked to the Internet (with PC users who have signed up to participate in the project). Another interesting focus of current development work is in fmding ways to allow geographicallydistributed researchers to access supercomputers using grid and distributed computing technologies. An example of this is the RealityCrid project (www.realitygrid.org), in which a number of UK universities are participating.

THE HOMOGENEOUS CLUSTER The vast majorily of clusters are classed as 'homogeneous', in the sense that each node is much like any other. Each has the same amount of memory and all are connected using the same quantity and style of interconnect, The remainder - heterogeneous clusters - might feature, for instance, a subset of nodes that have more memory or a faster interconnect, for example.

Horst Simon, an associate lab director at Lawrence Berkeley National Laboratory, suggested that many applications don't fit with this kind of homogeneous cluster. For instance, he cited certain classes of 'multi-scale' problems, where calculations made at a microscopic level of a model have to be resolved in terms of their innuence at the macroscopic level. Certain climate models require detailed simulation of effects l i e ocean turbulence, on a scale of a few kilometres. These effects can have a discernible influence on the climate for hundreds of miles around - global effects that have to be calculated using the earlier calculation results, made at a smaller scale. The tendency with these climate models is for a lot of +

IEE Computing 8 Control Engineering 1 Derembedlanuary 2004105 45

Developers of cluster systems have an increasingly rich menu of interconnects to choose from, to enable backplane level communication j j

change to occur in one or a few of these tiny sub-models, while nothing much is happening in the others. Where this kind of problem has been hierarchically split into sub- tasks, suitable for execution on a parallel processing platform, there will he a tendency for one node, or subset of nodes, to be more overloaded than the others, which will have to wait while these suh-tasks are completed. Nonetheless, the strides in affordability that have been made with HPC platforms rely on the use of these homogeneous structures. “The challenge is to come up with algorithms that are a good fit with a homogeneous duster,” he said. Cluster platforms have managed to make important strides in affordability through the use of homogeneous structures that use ostheshelf components.

HARDWARE AND SOFlWARE DECISIONS For those designing cluster platforms, a critical decision at the outset is the choice of processing hardware, generally made in conjunction with the choice of OS. It is a choice guided by factors that include price, performance, power consumption and OS compatibility Processor vendors ~

including Intel, AMD, Apple and Sun Microsystems ~

provide platforms suitable for use in clustering. Apple’s a h i t XServe G5 platform, for instance, which was used for

the nodes in Virginia Tech’s SystemX, is available as a 1U rack-mounted server, which can be configured for multiprocessor applications. The relatively low cost of these elements is down to the fact that they are mainly sold into larger markets, such as servers. For instance, only 0.1% of AMD’s Opteron processors end up in supercomputers, according to the company

Most clusters run Linux as an OS, although a minority run other OSs, such as Solaris, Unix, Mac OS X and FreeBSD. Microsoft has demonstrated an HPC version of Windows, for a planned release in 2005, which includes a Windows-optimi version of MPI - the standard method of codmg message-passing functionality in clusters ~ and facilities such as job scheduling and cluster management. The OS is already being used in some installations, such as in a cluster being used to analyse human genome data by Perlegen Sciences, a bioinformatics organisation.

However, some are sceptical about the likely import of Windows in the HFC arena. The culture of supercomputing tends to prize flexibility and system developers generally prefer to carry out their own tweaks to optimise their designs. Microsoft has no plans to release its source code, which some see as a potential bugbear. Wil Mayers of Streamline Computing, a UK based cluster integration

Suaercomauters

specialist, said that, while the changes required in common Linux distributions - to achieve a speedup or fu bugs, for example - are minimal, it is often necessary to do at least some. “Time will tell if this lack of source code access is genuinely a limit to what people can achieve,” he said.

One of the more difficult issues to manage in cluster development is the nonlinearity of scaling. Ideally, N number of CPUs should yield a performance level N times that obtained with a single CPU. Unfortunately, the real world is a much messier place. “You tend to get a law of diminishing returns as you scale up a cluster and add more nodes. In other words, the performance benefit from parallelism doesn’t increase linearly, it plateaus and you tend to get little benefit from use of additional processors,” according to Richard Groves of Streamhe Computing.

There are various ways of managing this effect. A priority of researchers in the parallel processingfield is to develop algorithms that minimise the need for inter-nodal communication. However, another angle of attack on this problem is to optimise the speed of communication between nodes. “Cluster interconnects with high bandwidth and low latency, and heavily optimised MPI libraries, wiU minimise the time spent in communication,” said Turkel of Hp.

Developers of cluster systems have an increasingly rich menu of interconnects to choose from, to enable backplane level communication. A wide array of off-the-shelf components is available to support interconnects using Gigabit Ethernet, Myricom’s Myrinet, Infiniband, and Quadrics’ Qsnet.

THE AD HOC SUPERCOMPUTER Nowadays, the availability of low cost Gigabit Ethernet interfaces ~ which often come as a standard component in a laptop or desktop PC -has given some developers cause to explore the feasibility of constructing quick-and-easy supercomputers using standard PCs or laptops. Rather than permanent installations, these could be ad hoc arrangements, which could be assembled quickly by a group of people without serious computing expertise.

The feasibility of setting up such an ‘ad hoc supercomputer’ is a hobbyhorse of Pat Miller, a computer science lecturer at the University of San Francisco, who runs a class called ‘DIY Supercomputing’. The first event to see the construction of such a cluster, dubbed a ‘flash mob supercomputer’ (using the popular term ‘flash mob’, recently coined to refer to a group of people who assemble suddenly in a public {lace to do something unusual or notable, normqllyiiorganised through the Internet) took place in April’Zb&This was in a gymnasium at USF using over 7W computers donated by members of the public, who had responded to a n announcement made about the event on the wb.site Slashdot (http://slashdot.org/). See the box ‘Build your, own supercomputer’ for more details.

APPLICATION DEVELOPMENT One of the biggest challenges+in the design of HPC platforms that use parallelsprocessing is the difficulty of application developmedt. Clusters are distributed +

, , .I

.IEE Com‘iuting 8. Control Engineering I Deremberllan!Jgly 2004105 , , . /I. : , . . . . , .. . , ,

47

http://slashdot.org

memory systems, which means that programs can’t directly access the memory of remote systems in the cluster, but have to use some sort of message passing between the nodes. The most popular method for that message passing is MPI (Message Passing Interface). The use of MPI requires some skill, and higher level languages are emerging to enable programming of parallel applications for clusters by non-computer scientists.

The use of commercially-available sohare applications can also present difficulties. “Many vendors of technical applications have issues supporting clusters because of the variability of configurations between different vendor and end-user designed clusters”, according to Turkel.

AT THE HIGH END However, the software challenge is particularly acute at the high end of supercomputing, in the domain of problems that, at this stage, cannot realistically be solved using standard clusters running Linux. Here, the barriers to entry are actually becoming higher, according to Simon. As supercomputers become more powerful, they are naturally used to address more complex problems. With certain classes of emerging problem - such as applications in the area of multi-scale and multi-physics system modelling - the investment in software and algorithmic research is beyond the means of anything but a large and welborganised community of dedicated researchers and engineers, he said.

Such a scale of project requires large scale collaboration between mathematicians and computer scientists, for

instance, and introduces elements that are quite alien to the existing supercomputing culture. For example, the employment of a professional statr of software engineers, to manage aspects of a project such as porting software to run on ditrerent platforms. The importance of this kind of approach has been recognised in certain areas, such as climate research (for instance, the Community Climate System Model) and high energy physics, but it is lacking elsewhere, said Simon.

Such feelings hint at the need for change in the culture of HPC. Related sentiments are expressed by the DoD’s Charles Holland. “Looking at the press generated by winning the race for Number 1 on the Top500 Supercomputer Sites list, I become very concerned that such discussions focus the community in the wrong direction. The correct question is whether we are developing systems that improve mission or business effectiveness,” he said. That people are now asking such a question is evident from initiatives such as DAFPRs High Productivity Computer Systems (HPCS) program which, its representatives say, is initiating a fundamental reassessment of how we defme and measure parameters like performance, programmability and robustness in the HPC domain. One emerging area it addresses, for instance, is the development of quantitative approaches for measuring the programmability of systems.

So it would seem that, while an unprecedented level of democratisation of HPC is underway at the lower end of the market, the highest performance capabilities are increasingly out of reach.

Documents

High performance horizons [high performance computing]