7
A ccessible computing power has be- come the main motivation for cluster computing—some wish to tap the proliferation of desktop computers, while others seek clustering because they find ac- cess to large supercomputing centers to be diffi- cult or unattainable. Both want to combine smaller machines to provide sufficient access to computational power. In this article, we describe our approach to cluster computing to best achieve these goals for scientific users and, ulti- mately, for the mainstream end user. One approach, introduced in the mid-1990s, used a parallel computing message-passing library with the Linux operating system and became known as Beowulf-style cluster computing. 1 Today, the message-passing interface (MPI) 2 is a dominant industry standard, and many MPI implementations are available under open-source license. (Most ma- jor supercomputing centers only use MPI for dis- tributed-memory parallel computing; the absence of other message-passing schemes on new hard- ware is evident at http://hpcf.nersc.gov/ software/libs and www.npaci.edu/BlueHorizon/ guide/ref.html.) Our goal is to minimize the time needed to as- semble and run a working cluster. Simplicity and straightforwardness are just as important as pro- cessing power because power provides nothing if it can’t be used effectively. Moreover, our solution should provide a better total price-to-performance ratio and a higher commitment to the original pur- pose of such systems: providing the user with large amounts of accessible computing power. Since 1998, we in the University of California, Los Angeles, Plasma Physics Group have been de- veloping and using a solution to meet these design criteria. It’s based on the Macintosh operating sys- tem and uses PowerPC-based Macintosh (Power Mac) hardware; we call it a Mac cluster. 3 In our on- going effort to improve user experience, we con- tinue to streamline the software and add numerous new features. With OS X, the latest, Unix-based version of the Mac OS (www.apple.com/macosx), we’re seeing the convergence of the best of Unix with the best of the Mac. We’ve extended the Mac’s ease of use to parallel computing for the sake of accessible computing power. In this article, we describe how a user can build a Mac cluster and demonstrate how to oper- ate it. We then describe the technology we built to 22 COMPUTING IN SCIENCE & ENGINEERING P LUG- AND-P LAY C LUSTER C OMPUTING: HIGH-P ERFORMANCE C OMPUTING FOR THE MAINSTREAM C LUSTER C OMPUTING To achieve accessible computational power for their research goals, the authors developed the tools to build easy-to-use, numerically intensive parallel computing clusters using the Macintosh platform. Their approach enables the user, without expertise in the operating system, to develop and run parallel code efficiently, maximizing the advancement of scientific research. DEAN E. DAUGER Dauger Research VIKTOR K. DECYK University of California, Los Angeles 1521-9615/05/$20.00 © 2005 IEEE Copublished by the IEEE CS and the AIP

HIGH-PERFORMANCE COMPUTING FOR THE MAINSTREAM

Embed Size (px)

Citation preview

Accessible computing power has be-come the main motivation for clustercomputing—some wish to tap theproliferation of desktop computers,

while others seek clustering because they find ac-cess to large supercomputing centers to be diffi-cult or unattainable. Both want to combinesmaller machines to provide sufficient access tocomputational power. In this article, we describeour approach to cluster computing to bestachieve these goals for scientific users and, ulti-mately, for the mainstream end user.

One approach, introduced in the mid-1990s,used a parallel computing message-passing librarywith the Linux operating system and becameknown as Beowulf-style cluster computing.1 Today,the message-passing interface (MPI)2 is a dominantindustry standard, and many MPI implementationsare available under open-source license. (Most ma-jor supercomputing centers only use MPI for dis-tributed-memory parallel computing; the absence

of other message-passing schemes on new hard-ware is evident at http://hpcf.nersc.gov/software/libs and www.npaci.edu/BlueHorizon/guide/ref.html.)

Our goal is to minimize the time needed to as-semble and run a working cluster. Simplicity andstraightforwardness are just as important as pro-cessing power because power provides nothing if itcan’t be used effectively. Moreover, our solutionshould provide a better total price-to-performanceratio and a higher commitment to the original pur-pose of such systems: providing the user with largeamounts of accessible computing power.

Since 1998, we in the University of California,Los Angeles, Plasma Physics Group have been de-veloping and using a solution to meet these designcriteria. It’s based on the Macintosh operating sys-tem and uses PowerPC-based Macintosh (PowerMac) hardware; we call it a Mac cluster.3 In our on-going effort to improve user experience, we con-tinue to streamline the software and add numerousnew features. With OS X, the latest, Unix-basedversion of the Mac OS (www.apple.com/macosx),we’re seeing the convergence of the best of Unixwith the best of the Mac.

We’ve extended the Mac’s ease of use to parallelcomputing for the sake of accessible computingpower. In this article, we describe how a user canbuild a Mac cluster and demonstrate how to oper-ate it. We then describe the technology we built to

22 COMPUTING IN SCIENCE & ENGINEERING

PLUG-AND-PLAY CLUSTERCOMPUTING: HIGH-PERFORMANCECOMPUTING FOR THE MAINSTREAM

C L U S T E RC O M P U T I N G

To achieve accessible computational power for their research goals, the authors developedthe tools to build easy-to-use, numerically intensive parallel computing clusters using theMacintosh platform. Their approach enables the user, without expertise in the operatingsystem, to develop and run parallel code efficiently, maximizing the advancement ofscientific research.

DEAN E. DAUGER

Dauger ResearchVIKTOR K. DECYK

University of California, Los Angeles

1521-9615/05/$20.00 © 2005 IEEE

Copublished by the IEEE CS and the AIP

JANUARY/FEBRUARY 2005 23

accomplish our goals. Part of our effort involves re-thinking and streamlining cluster design, installa-tion, and operation. We believe this experience hasled us to a cluster solution that maximizes the user’saccessibility to computational power.

Building, Running, and Debugging the ClusterStreamlining cluster setup to the bare minimum,the steps to building a Mac cluster essentially in-volve connecting the computers to the network, as-signing network names and addresses to the nodes,and installing the software.

Building a Mac cluster begins by collecting thehardware: the Power Mac G5s, one category 5 Eth-ernet cable with RJ-45 jacks per Mac, and an Eth-ernet switch. The latest Power Mac models haveeither fast (100BaseT) or Gbit Ethernet, so aswitch of either type will function well. For eachMac, one end of a cable plugs into the Ethernetjack on the Mac and the other end to a port on theswitch. System software is a simple matter: Macscome preinstalled with Mac OS X, so configuringthe Macs generally involves making sure each onehas a working Internet or IP connection and aunique name, specified in the Network and Shar-ing System Preferences. Finally, a software pack-age called Pooch operates the cluster (seehttp://daugerresearch.com/pooch/ for a downloadversion). Running the installer on each Mac’s harddrive completes the parallel computer. Software in-stallation on a node takes only a few seconds,brevity not found in other cluster types.

Because the intention is that the cluster user willspend most of his or her time interacting with thecluster performing job-launching activities, we’veinvested considerable effort in refining the inter-face design to minimize the time for the user to runa parallel job. In our documentation, we recom-mend that users first test their Mac cluster with asimple, reliable parallel computing job. For thepurpose of this initial test, the AltiVec Fractal Car-bon demo, a demonstration parallel application, isavailable for free download at http://daugerre-search.com/pooch/. This demonstration of high-performance computing also runs on a single node.The user runs the application in parallel by select-ing New Job from Pooch’s File menu. This actionopens up a new Job window; the user can drag theAltiVec Fractal Carbon demo from the Finder tothis Job window, as depicted in Figure 1.

Next, the user chooses nodes to run in parallel.By default, Pooch selects the node where the job isspecified. To add more, the user clicks on SelectNodes…, which invokes a Network Scan window,

shown in Figure 2. Double-clicking on a nodemoves it to the Job window’s node list. If a machinerunning OS X has two processors, Pooch can usethem as if they were separate nodes. Finally, theuser starts the parallel job by clicking on LaunchJob. Pooch should now be distributing copies ofthe parallel application to the other nodes and ini-tiating them in parallel. Upon completion of itscomputational task, the program then calculates itsachieved performance, which should be signifi-cantly greater than single-node performance.

This initial test also trains the user to accomplishthe fundamental tasks required to run a parallel job:selecting an executable, selecting computational re-sources, and combining these selections throughjob initiation. Streamlining this user interface is im-portant because submitting jobs is a repetitive taskthat can potentially occupy much of the user’s time.We chose a GUI because it tolerates the type of er-ror and imprecision that users can accidentally in-troduce when operating a device. This use of a

Figure 1. Using the cluster. To set up a parallel computing job, the userdrags a parallel application, in this case the AltiVec Fractal Carbondemo, and drops it in Pooch’s Job window.

Figure 2. Selecting nodes. The user choosescomputational resources via the Network Scanwindow, invoked by clicking on Select Nodes fromthe window in Figure 1.

24 COMPUTING IN SCIENCE & ENGINEERING

GUI is meant to contribute to the efficiency withwhich the user can operate the cluster.

So that our group’s researchers could maximizetheir time studying physics, we’ve added enhance-ments, beyond basic message passing, to the MPIimplementation we call MacMPI that make it eas-ier for them to develop parallel programs. One ofthese is the monitoring of MPI messages, con-trolled by a monitor flag in MacMPI: it can logevery message sent or received. In its default set-ting, a small monitor window appears (see Figure3). In this window, status lights indicate whetherthe node whose screen is being examined is send-ing or receiving messages from any other node.Green indicates sending, red indicates receiving,and yellow means both. Because messages are nor-mally sent very quickly, these lights blink rapidly.However, if a deadlock occurs, a common occur-rence for beginning programmers, the lights willstay lit. The moment such a problem occurs,though, a particular color pattern is immediatelyvisible to the user, who can then apply the new in-formation to debugging the code.

The monitor window also shows a color-codedhistogram of the message sizes sent or received todraw the user’s attention to the length of the mes-sages the code is sending. The two dials in MacMPI’s

monitor window show the approximate percent oftime spent in communication and the average and in-stantaneous speeds achieved during communication.Although approximate, these indicators are invalu-able in revealing problems in the code and network.

ImplementationIn the Mac cluster’s design, we made the responsi-bilities of the communications library distinct andseparate from the code that launches jobs and man-ages the cluster.

MacMPIMacMPI, freely available from the AppleSeed site(http://exodus.physics.ucla.edu/appleseed/), is Vik-tor Decyk’s 45-routine subset of MPI implementedvia the Mac OS’s networking APIs. It exists in twoforms: MacMPI_X, which uses Apple’s latest OpenTransport implementation of TCP/IP available inboth OS 9 and OS X, and MacMPI_S, which usesthe Unix socket implementation in OS X (seehttp://developer.apple.com/documentation/CoreFoundation/Networking-date.html).

Using MacMPI, we achieve excellent networkperformance comparable to other implementa-tions. On OS X, we achieve near peak speed of100BaseT for large messages. Apple’s Power MacG5 hardware also comes with built-in Gbit Ether-net ports. On current hardware using a crossoverEthernet cable, the Open Transport implementa-tion’s latency is longer than MacMPI_S’s, but bothachieve more than 100 Mbytes/s bandwidth.

MacMPI is a source-code library that users canintegrate into their executables. It forms a wrapperlibrary for MPI code written in Fortran and C thatassumes that just the fundamental preinstalled op-erating system is present. MacMPI takes advantageof as much of the operating system as possible tominimize its size and complexity. We’ve used thislibrary on hardware normally not designated forcluster operation and configured it in virtuallyevery possible configuration.

The Pooch ApplicationPooch is a parallel computing and cluster manage-ment tool that can organize a job’s files into subdi-rectories on other nodes and retrieve files on thosenodes containing output from completed jobs. Itcan queue jobs and launch them only when certainconditions are met. It also has the ability to kill run-ning, launching, or queued jobs. It can also exploitmachines on which no one is logged in.

Pooch supports a wide variety of parallel pro-gramming environments, enabled by the conver-gence of technologies in OS X: Carbon, Cocoa,

Figure 3. MPI modifications. The MacMPI monitorwindow keeps track of statistics about the executionof parallel applications.

JANUARY/FEBRUARY 2005 25

Mach-O, Unix shell scripts, and AppleScripts. Asof this writing, Pooch supports five different MPIs:MacMPI, mpich, MPI/Pro, mpich-gm (forMyrinet hardware), and LAM/MPI (see http://daugerresearch.com/pooch/mpi.html).

Pooch also features four user interfaces. In addi-tion to its drag-and-drop GUI, Pooch’s AppleScriptinterface makes it possible to write automatic and in-teractive scripts that perform customized job queu-ing and other cluster operations. A suite of com-mand-line utilities makes it easy for users to log infrom other platforms and control cluster operations.

Additionally, by sending commands through in-terapplication messages called AppleEvents, otherapplications can directly control Pooch to performGrid-like behavior. The Fresnel Diffraction Ex-plorer (FDE), an optics parallel desktop application,can initiate its own utilization of a local cluster viaPooch. Although the current incarnation of Globus(www.globus.org) and related technologies combineresources on a supercomputer level, our technologycombines desktop machines. Unlike Globus andCondor (www.cs.wisc.edu/condor/), these featuresare installed, configured, and run by using an acces-sible GUI. With only a menu selection, desktop ap-plications such as FDE today can automatically takeadvantage of resources elsewhere on the cluster.Such powerful yet easy-to-use features are essentialfor parallel computing to become mainstream.

Mac cluster security consists of administrative do-mains. Each copy of Pooch is hard-coded to a 512-bit encryption key that rotates for each encodedcommand sent and received; the way it rotates isunique to a particular user or group of Mac clusterusers. This approach is much like how a researchgroup shares access to office resources, such as a net-work printer or a copy machine. The cluster cansimilarly be shared, which enables it to operate in-dependently of password databases or other externalsecurity systems. At the same time, the conveniencefor users is that the cluster is accessible when theyopen Pooch, making it very easy for them to use thecluster. Pooch makes little distinction between usersand administrators because current administrativeneeds are so minor. A recently introduced bifurca-tion called Pooch Pro introduces user login and hi-erarchy, CPU quotas, and other supercomputer-likeadministrative features to Mac clustering.

Pooch can also include nodes on almost any sub-net of the Internet. We’ve exercised that capabilitymany times, either initiating jobs on our clusterfrom home or by combining nodes at arbitrary dis-tances. We’ve even used Pooch to combine nodesat UCLA with machines in Munich, Germany,10,000 km away.

Distinctions from Other ImplementationsSeveral fundamental differences exist between ourapproach to cluster computing and that of others.

Division of API and launching mechanism. A fun-damental difference from most other clustertypes is the clear distinction and separationbetween the code that performs internode com-munications for the job and the code thatperforms job initiation and other cluster man-agement. In most MPI implementations, such asmpich and LAM, these tasks are merged in onepackage. Only recently has work begun on ver-sions whose organization identifies distinctionsbetween these tasks, such as the emergingMPICH2 rewrite of mpich (http://www-unix.mcs.anl.gov/mpi/mpich2/).

No modification to the operating system. Makingno modifications to the operating system let ussimplify much of our software design. We didn’teven add a runtime-linked library on the system,much less the system-level or even kernel-levelmodifications other cluster implementationsmake. We took this approach so that parallel ex-ecutables could run on any node regardless ofsuch modifications. We added as little as possi-ble to the system by adding only one extra pieceof executable code, Pooch, to run and operatethe cluster. This approach keeps installation timeto a minimum, which helps satisfy our designgoals with regards to cluster setup.

No static data. All Pooch operations exclusivelyuse dynamically determined information. Pooch,therefore, doesn’t require an administrator tomaintain any static data files about the cluster.On OS X 10.2 and later, Pooch’s node discoveryimplementation uses the Service Location Pro-tocol and Apple’s Rendezvous (also called Zero-Conf, www.zeroconf.org) simultaneously(http://developer.apple.com/macosx/ren-dezvous). These TCP/IP-based discovery ser-vices provide Pooch with the informationneeded by MacMPI to organize and set up theinternode connections for the parallel job.

Minimal assumptions about configuration. The ab-sence of further configuration details about thecluster expresses how reliably it tolerates varia-tions in configuration while interfacing andoperating with hardware and software. Thehardware need not be identical, command-linelogin isn’t needed, network interfaces can vary,and computing hardware can differ. Again, as far

26 COMPUTING IN SCIENCE & ENGINEERING

as the platform is concerned, Pooch andMacMPI are categorized as just applicationcode. When we run demonstrations, we oftenask the audience to volunteer their computers toadd to the Mac cluster. It still runs despite thewide variety in configuration of these volun-teered machines. This design has many implica-tions for the mainstream because most end usersdon’t want to worry about such details.

Minimal centralization. A common philosophyused to increase the performance of parallel codesis to eliminate bottlenecks. We’ve extended thisconcept to the Mac cluster by minimizing cen-tralization such as single points of failure or otherbottlenecks. Many Linux clusters assume somesort of shared storage is fully configured and op-erational before distributing executables or datafiles, but this is a well-known point of failure. Ourcluster approach doesn’t use shared storage, thuseliminating the potential for a bottleneck. Finally,Mac clusters don’t have a head node, at least notin the sense of the head node found in typicalLinux-based clusters. There is no permanent con-trolling unit or units, so the behavior is muchmore peer to peer. A node is designated node zerofor the duration of the job only, but any node canbe node zero for the next job and more than onenode can be designated node zero simultaneously.All nodes can act as temporary head nodes, a tran-sient state that occurs only during the brief sec-onds of the launch process.

Real-World ExperienceThe Mac cluster’s performance is excellent for cer-tain classes of problems, mainly those in whichcommunication is small compared to the calcula-tion, yet the message packet size is large. In 2002,for example, Apple introduced the Xserve, a rack-mounted version of a Power Mac meant for serversolutions. In collaboration with the Applied Clus-ter Computing Group at NASA’s Jet PropulsionLaboratory, the AltiVec Fractal Carbon demoachieved over 217 Gflops on its 33-XServe dual-processor G4/1000 cluster. (For further details, seehttp://daugerresearch.com/fractaldemos/JPLXServes/JPLXServeClusterBenchmark.html.)

The University of Southern California gave ourteam the opportunity to run the Fractal demo andPooch on 56 of its dual-processor Power MacG4/533’s plus 20 of its dual-processor Power MacG4/450’s, where we achieved more than 233 Gflops.(For further details, see http://daugerresearch.com/fractaldemos/USCCluster/USCMacCluster-Benchmark.html.) Note that these machines were

part of the Language Arts undergraduate computerlab and weren’t meant for cluster work, yet theyachieved supercomputer-level results. The latestversion of our software operates on unused, logged-out machines of this sort of computer lab.

We built a new cluster, called the Dawson clus-ter, in 2004 out of 128 dual-processor Xserves us-ing the PowerPC 970FX (“G5”) processor. Con-nected via a gigabit network and managed byUCLA’s Academic Technical Services, this clusterachieved 1.21 TFlops using the AltiVec Fractalbenchmark and Pooch. (For further details, seehttp://daugerresearch.com/fractaldemos/Dawson-Cluster/DawsonCluster128Benchmark.html.) Itachieved a similar result using Linpack, placing theDawson cluster in the Top500 list. Our group is ac-tively using the Dawson cluster for physics re-search, and we hope to extend the machine to 256Xserves (512 processors) in the coming year.

As an example of performance specific to ourphysics research, we successfully ran a 127 millionparticle 3D electrostatic PIC simulation on a four-node Macintosh G4/1000 dual-processor cluster.4,5

The total time was 17.6 seconds per time step, witha 128 � 128 � 256 grid, and the cost of the ma-chines was less than US$10,000. Just a decade ago,such calculations required the world’s largest andmost expensive supercomputers!

Pictured in Figure 4, our inexpensive and pow-erful cluster of Power Mac G3s, G4s, and G5s hasbecome a valuable addition to our UCLA PlasmaPhysics Group. We use it to introduce new mem-bers to parallel computing and run large calcula-tions for extended periods. The solution we’vefound is fairly unique in that half of the nodes aren’tdedicated for parallel computing. We regularlypurchase high-end Macs and devote them for com-putation, reassigning the older, slower Macs for in-dividual (desktop) use and data storage. This reuseof Macs in the cluster makes for a very cost-effec-tive solution that satisfies both our parallel com-puting and desktop computing needs.

In addition, the cluster’s flexibility lets us redirectcomputational resources very quickly within ourgroup, which is useful for unfunded research or ex-ploratory projects so that we can better prepare foran official proposal later. If one investigator needsto meet a short deadline, he or she can ask the re-search group, borrow their desktop Macs, andcombine them with the dedicated Macs for onelarge job or many smaller ones.

The cluster’s presence has encouraged newmembers of our group and visitors to learn how towrite portable parallel MPI programs, which theycan run later on larger computers elsewhere. The

JANUARY/FEBRUARY 2005 27

cluster also encourages a more interactive style ofparallel programming, in contrast to the batch-ori-ented processing encouraged by most other clustertypes. We can also display on desktop machines theresults of calculations made elsewhere in the clus-ter, which lets us study a simulation part waythrough the calculation. Checking for mistakesearly on saves a great deal of computation time thatmight otherwise be wasted.

Plasma Physics and Additional ApplicationsThe particle-in-cell (PIC) codes we use in ourgroup are also used in several other high-perfor-mance computing projects, such as modeling fu-sion reactors6 and advanced accelerators.7 Suchprojects require massively parallel computers, butwe’ve found it very convenient to perform researchprojects on more modest and user-friendly parallelmachines such as those in Mac clusters.

Recent calculations by James Kniep and Jean-NoelLeboeuf have concentrated on studying variousmechanisms of turbulence suppression in devicessuch as the Electric Tokamak at UCLA8 and theDIII-D tokamak at General Atomics.9 The re-searchers involved in these projects use the Mac clus-ter for smaller problems when they need fast turn-around for quick scoping, as well as for theproduction calculations that 8-node subsets of ourcluster can accommodate. Their results compare fa-vorably to experimental observations of plasma mi-croturbulence characteristics in DIII-D discharges.10

We’ve also been highly successful in applying thecluster’s computational power to the classroom envi-ronment, but the experience serendipitously led to aplasma physics conference presentation. At UCLA,the Physics 260 course entitled “Exploring PlasmasUsing Computer Models” is a computational plasmaphysics course in which students learn about plasmasand operate and analyze data from a plasma physicscode. The professor assigned the cluster for the stu-dents’ course work with Parsec, a 3D fully electro-magnetic particle-in-cell code that John Tonge wrotefor investigating the physics of plasma confinementin a levitated internal conductor device. Some of theruns contained as many as 50 million particles on upto a 256 � 256 � 128 grid. Such assignments wouldhave been impossible without the local cluster. Theruns could be started at the end of class on one day,and the output could be retrieved for analysis at thenext class meeting—an impossibility at supercom-puting centers because of their large queues and run-time limitations. The student work even led to anacademic contribution to the field.11

Recently, John Huelsenbeck (University of Cal-

ifornia, San Diego) and Fredrik Ronquist (UppsalaUniversity) wrote and released pMrBayes, a paral-lel application that performs Bayesian estimates ofphylogeny for biology. While discussing ap-proaches for running their parallel code, they de-scribed the simplest method was “to use Dauger’sprogram Pooch to control the jobs” (http://morphbank.ebc.uu.se/mrbayes3/).

Our approach is unique because,while other solutions seem to directlittle, if any, attention to usability,tolerance to variations in configura-

tion, and reliability outside tightly controlledconditions, we find such issues to be as impor-tant as raw performance. We believe the ultimatevision of parallel computing is technology so re-liable and trivial to install, configure, and usethat the user will barely be aware that computa-tions are occurring in parallel.

We organize problems with using parallel com-puters into two categories: one, building, operating,managing, and maintaining such a machine, andtwo, determining how best to solve an applicationon that machine. We’ve described how our worksolves the former so that users can maximize theirenergy on the latter. The simplicity of using a Maccluster technology makes it a highly effective solu-

Figure 4. A portion of our Mac cluster. The cluster in the Department ofPhysics at the University of California, Los Angeles, features a mix ofPower Mac G5s and their predecessors. It routinely combines dedicatednodes with desktop machines to perform physics calculations.

28 COMPUTING IN SCIENCE & ENGINEERING

tion of all but the largest calculations, and we’recontinuing to improve on our work for the sake ofthose users and to respond to their feedback. We’repartnering with industry and other entities to en-able shrink-wrapped applications of clusters. Ournext step is to expand the cluster to incorporate ad-ditional Grid-like and supercomputing features andfurther enhance its service to users.

AcknowledgmentsMany people have provided us useful advice over the lastfew years. We acknowledge help given by Bedros Afeyanfrom Polymath Research; Ricardo Fonseca from InstitutoSuperior Técnico, Lisbon, Portugal; and Frank Tsung, JohnTonge, and Warren Mori from the University of California,Los Angeles, and the Applied Cluster Computing Group atNASA’s Jet Propulsion Laboratory.

References1. T.L. Sterling et al., How to Build a Beowulf, MIT Press, 1999.

2. M. Snir et al., MPI: The Complete Reference, MIT Press, 1996.

3. V.K. Decyk, D. Dauger, and P. Kokelaar, “How to Build an Ap-pleSeed: A Parallel Macintosh Cluster for Numerically IntensiveComputing,” Physica Scripta, vol. T84, 2000, pp. 85–88.

4. V.K. Decyk, “Benchmark Timings with Particle Plasma SimulationCodes,” Supercomputer 27, vol. V-5, 1988, pp. 33–43.

5. V.K. Decyk, “Skeleton PIC Codes for Parallel Computers,” Com-puter Physics Comm., vol. 87, nos. 1 and 2, 1995, pp. 87–94.

6. R.D. Sydora, V.K. Decyk, and J.M. Dawson, “Fluctuation-InducedHeat Transport Results from a Large Global 3D Toroidal ParticleSimulation Model,” Plasma Physics and Controlled Fusion, vol. 38,no. 12A, 1996, pp. A281–A294.

7. K.-C. Tzeng, W.B. Mori, and T. Katsouleas, “Electron Beam Char-acteristics from Laser-Driven Wave Breaking,” Physical Rev. Let-ters, vol. 79, no. 26, 1997, pp. 5258–5261.

8. M.W. Kissick et al., “Radial Electric Field Required to Suppress IonTemperature Gradient Modes in the Electric Tokamak,” Physicsof Plasmas, vol. 6, no. 12, 1999, pp. 4722–4727.

9. J.C. Kniep, J.-N. Leboeuf, and V.K. Decyk, “Gyrokinetic Particle-In-Cell Calculations of Ion Temperature Gradient Driven Turbu-lence with Parallel Nonlinearity and Strong Flow Corrections,”Poster 1E25, Int’l Sherwood Fusion Theory Conf., 2003,www.sherwoodtheory.org/sherwood03/agenda.html.

10. T.L. Rhodes et al., “Comparison of Turbulence Measurementsfrom DIII-D L-Mode and High-Performance Plasmas to Turbu-lence Simulations and Models,” Physics of Plasmas, vol. 9, no. 5,2002, pp. 2141–2148.

11. J. Tonge et al., “Kinetic Simulations of the Stability of a PlasmaConfined by the Magnetic Field of a Current Rod,” Physics of Plas-mas, vol. 10, no. 9, 2003, pp. 3475–3483.

Dean E. Dauger is president of Dauger Research. Histechnical interests include computational physics,particularly the dynamics of complex quantum sys-tems, and high-performance computing and visual-ization. Dauger has a PhD in physics from the Uni-versity of California, Los Angeles. He is a member ofthe APS and the IEEE Computer Society. Contact himat [email protected].

Viktor K. Decyk is a research physicist in the Depart-ment of Physics and Astronomy at the University ofCalifornia, Los Angeles. His technical interests includecomputational plasma physics using parallel particle-in-cell methods. Decyk has a PhD in physics from theUniversity of California, Los Angeles. He is a memberof the APS, SIAM,ACM, and the IEEE Computer Soci-ety. Contact him at [email protected].