[IEEE Comput. Soc. Press Scalable High Performance Computing Conference SHPCC-92. - Williamsburg, VA, USA (26-29 April 1992)] Proceedings Scalable High Performance Computing Conference

Visual-Aural Representations of Performance for a Scalable Application Program

Joan M. Francioni and Diane T. Rover

Computer Science Department University of Southwestern Louisiana

Lafayette, LA 70504+

Abstract

Visual and aural portrayals of parallel program execution are used to gain insight into how a program is working. The combination of portrayals in a coordinated performance environment provides the user with multiple perspectives and stimuli to comprehend complex, multidimensional run-time information. An open question for either medium is how well does it scale? That is, how eflectively can it be used to represent program performance on a large parallel computer system? This paper investigates using sound in conjunction with graphics to represent the performance of a scalable application program, the S L A L O W benchmark program, executed on the nCUBE 2 distributed memory parallel computel: Custom auraluation software is coupled with the PICL and ParaGraph tools. The techniques and results of visually and aurally monitoring program execution on increasing numbers of processors are presented.

1 Introduction In the arena of high performance computing, especially

for large-scale parallel systems, the importance of using software tools to perceive the dynamics, intricacies, and properties of program execution cannot be overstated [9, 131. Consider a parallel program for a distributed memory parallel machine. There are a number of ways to describe the execution behavior of such a program. For example, the temporal ordering of all sends and receives, the utilization of the processors over time, the activities of the processors (and their relative durations), and the distribution of the work throughout the system, each describe certain aspects of the program’s behavior.

One common way of depicting these kinds of behavior is by means of graphical displays. This type of portrayal provides an effective mechanism by which users can gain insight into the complex interactions of the concurrent parts of a program. Such visualization techniques can

t?hip work was supported in part by the National Science Foundation

respedivel y. under Grant NO. CCE-9008702 and G m t NO. CCR-9296029,

0-81862775-3/92 $3.00 Q 1992 JEEE 433

Department of Electrical Engineering Michigan State Universit East Lansing, MI 48824 Y

usually be enhanced by the use of color. Another enhancement is to map the run-time events to sounds and hence provide an auralization of the data that corresponds with the graphical display [2]. In this way, the user is provided with (1) more information, as visual and aural data can be processed simultaneously by humans, and (2) different kinds of information, as our ears and eyes detect different pattems and different temporal cues.

There are a number of general characteristics and properties related to the run-time behavior of a parallel program that can be defined. Some of these lend themselves well to visual displays, some to aural displays, and some to both. These characteristics include: 1. Relative timing of events. 2. Duration of events. 3. Pattems of events. 4. Phases of behavior over time. 5. Frequency over time of some behavior. 6. Balance of behavior over the entire system. 7. Specifics of an event.

Each of these has been displayed for small-scale systems and in forms that are useful. Of special note, experience has shown that patterns are a key to effectively using visual and aural representation techniques to study program execution [6,7, 111. Detecting patterns will be an essential step to reduce the apparent complexity inherent in large systems. Beyond detection, it is necessary to determine if a given patkm-in graphics or in sound-is correct, failed, optimum, good, bad, etc. That is a challenge for both small-scale and large-scale systems.

This paper focuses on the scalability of graphics and sound, i.e., their ability to manifest the properties of a parallel program executing on a large system having many processors. After highlighting background work ($2) and introducing the SLALOM benchmark program as the topic of study ($3), this paper presents a collection of performance experiments ($4).

2 Background It has been demonstrated that presenting multi-

dimensional data using sound and graphics increases the probability that more is understood about the data in a shorter amount of time than if either type of presentation is

used alone [l, 141. Moreover, the studies investigated a user’s ability to simultaneously perceive-in parallel- the entire set of data values at a particular instant in time. It was found that the combined sounds of multiple graphs did, in fact, help users identify appropriate correlations among a set of variables. There is reason to believe that these phenomena are true for presentations of parallel program behavior as well.

As evidence of the positive effects of visualization, one need only consult current computing literature and the myriad of references therein, for example in Heath [5], Pancake [9], Reed [lo], and Simmons [13]. Reasons for using auralization in general and in parallel computing in particular are well-documented in the feasibility studies done by Francioni and Jackson [2, 71. Sound is also considered in [9, 103. In these studies, sound was found to be effective in depicting certain patterns and timing information related to the program behaviors. This allows a user to not only detect patterns of program behavior but also detect anomalies with respect to expected patterns. It was also found that, by listening to the auralization while looking at a related graphical display, the speed of recognition and distinction of whole and partial programs was increased over using either sound or graphics alone. Simply, since certain patterns are more pronounced in one or the other medium, pattems detected in one complement and/or emphasize pattems detected in the other. Just as importantly, neither portrayal detracts from the other.

An open question for either medium is how well does it scale? That is, how effectively can it be used to represent program performance on a large parallel computer system? For visualization, answers have been put forth, and scalability is an active area of study. However, the scalability of auralization, specifically for representing program behavior, is essentially an unknown and remains to be empirically determined.

3 The SLALOM Benchmark Program

The application chosen for this study is the SLALOMTM benchmark program because of its ability to scale to varying problem sizes and its portability across varying machine sizes and architectures Results of visualizing SLALOM program performance on two parallel machines were reported in [12]. SLALOM stands for Scalable, Language-independent Ames Laboratory One-minute Measurement. This acronym embodies several of the performance measurement goals which guided its design. A detailed description of the SLALOM benchmark program is given in [4]. SLALOM has proven to be a testbed for performance fine-tuning of scientific applications. There are currently over one hundred entries in the benchmark report, ranging from vector and parallel supercomputers to laptop personal computers. SLALOM solves a real application, a radiosity problem in which the walls of a room are decomposed into patches. The program is designed to be

scalable to allow any number of patches, n. In fact, the figure of merit for ranking machines is the number of patches, not execution time, MFLOPS rate, etc. That is, it fixes the execution time and lets the problem scale (fixed- time model).

So as a scalable application program, it is a vehicle to study large-scale systems. Here, we assess what happens to visual-aural representations as the number of processors is increased. Conveniently, SLALOM’S constituent tasks, each commonly found in mainstream scientific computing, can be considered modularly. In this paper, we consider Solver and Storer tasks, where Solver is comprised of factoring and bachlving. The n x n matrix in the parallel version of SLALOM (Ver. 1) is distributed among the processors using a twodimensional scattered decomposition. As will be shown, an important effect of this mapping is that a row of processors “owns” a row of the matrix (similarly for columns).

Additional studies using SLALOM are possible, including a compadcontrast analysis of “old” versus “new” algorithms and cross-architectural comparisons. These are beyond the scope of this paper.

4 Performance Studies In this section, we describe the experiments done to

study the scalability of auralization. Results are in the form of graphical views, sound playbacks, and a critique of our experiences. 4.1 Tools

The machine used to run the SLALOM program for this study is the nCUBE 2 distributed memory parallel computer. This paper reports on using up to 64 processors. The collection of program event trace data is facilitated by using the PICL subroutine library [3]. Visual portrayal of trace data after program completion is supported by the ParaGraph tool [5 ] . The auralizations are synchronized with ParaGraph displays. Aural portrayal is done by mapping trace data to specific sounds according to the MIDI protocol [8], converting the data to sound, and then generating the sound (using appropriate sound equipment). Although this is not a fully integrated environment, it is sufficiently coordinated to provide a flexible set-up for experimentation. 4.2 Representation Techniques

Special consideration needs to be given to visual-aural representation schemes for large-scale parallel systems. Regardless of the medium, the approach should support a hierarchical presentation andor logical grouping of information. In general, however, the suitability of a particular approach to large-scale systems is related to tbe following: 1. What is the capacity for scaling the sound mapping? 2. What is a means for using sound and graphics together

Related to the first question, there are definite in a scalable way?

434

limitations to scaling auralizations in terms of how much sound can be generated and/or assimiited at once. On the other hand, sound is well suited to portraying a large number of parallel events at any one time. Thus, an effective, scalable sound mapping is one that balances these two properties. The second question deals with the potential synergism between sound and graphics. That is, use of the additional dimensions and properties of sound can reduce the amount of information to be visually portrayed and hence simplify visual displays, possibly making them more scalable. The result is a representation that is perhaps more scalable than either can be individually. In this section we describe two sound mappings with respect to both questions: scalability of sound mappings and synergism of sound and graphics in a meaningful way. 4.2.1 Sending and Receiving Patterns by Groups

In 121, Francioni et al descrikd a sound mapping that corresponds to the ParaGraph Space-Tie diagram. The basic technique is to map each processor to a different note and play the processor’s note whenever it sends or receives a message. The send-notes can be played on different s tem channels or with different instruments from the receive-notes. This mapping works well for relatively small numbers of processors (2-16). The problem with the mapping for large numbers of processors is that, for a given instrument, it is unpleasing to hear certain notes played at the same (or similar) time. With current sound equipment, it is possible to generate on the order of 128 distinct notes for upwards of 100 different timbres. However, it is also very easy to generate noise as opposed to meaningful sounds.

A variation of the base mapping is as follows. The processors are separated into a (relatively) small number of groups and individual notes are assigned to each group rather than each processor. In addition, when a processor sends a message to another processor in its same group, the sound for the send is directed to one channel; when a message transcends group boundaries, the sounds for both the send and receive are directed to a different channel. By using two distinct channels, it is possible to listen to the inter-group communication rrafEc in conjunction with or separately from the intra-group traffic. I

The capacity for scaling this group-send-receive mapping is basically unlimited. Any number of processors can be assigned to one group. Also, a user can experiment with different processor-to-group assignments to study the behavior of a program in varying ways. For example, possibilities include dividing the processors into simply two groups to find out which group is doing the most communication, or assigning only the processors working on boundary data to groups in order to isolate this communication behavior from that of the processors working on the intemal regions.

A synergistic coupling of this sound mapping with

graphics is to have the visual display capture the spatial behavior of the system at a moment in time (through animation) and have the aural display track the temporal behavior. Thus, system activity can be portrayed without dedicating a spatial dimension of the screen to the time variable. In particular, we have combined the visual display of ParaGraph’s Processor Views with the aural display described in this section. (Note that a Processor View is just a re-dimensioned cross-section of a Space Tune Diagram or a Gantt Chart.) This replaces a modestly scalable view, the Space-Tie Diagram, with a highly scalable one, a Processor View, and lets the user aurally monitor what happens over time-coupling noticeable visual pattems with audible aural pattems. 4.2.2 Phase Meters for Utilization and

Communication A second sound mapping is designed to depict, over

time, the relative amount of system utilization and communication that is occurring. At any single point in time, system utilization is defined to be &he percentage of busy processors, and system communication is defined to be the percentage of outstanding messages (those that have been sent but not yet received). Both percentages can be easily represented by pitch: the higher the pitch, the closer the value is to 100 percent. Thus the sound corresponds directly to meters of both values (aural alternatives to ParaGraph’s Utilization and Communication Meters). We assign a different instrument to each sound-meter, sustain each note until that meter changes, and have the sounds directed to two separate channels. By using two channels, both the utilization and communication information can be considered at the Same time or individually. The benefit of listening to both sound-meters together comes when trying to detect phases of a program.

Since the range of values to be sounded is restricted to be from 0 to 100 percent, this mapping is independent of the actual number of processors in the system. Hence, it is completely scalable. In addition, the sound-meters correspond well with a number of visual displays where the sound represents an aggregate statistic and the graphics display specific values in detail. Such an aural-visual combination serves to save screen space and also relieves the user of simultaneously watching two or more graphical

With a slight variation of this mapping, it is also possible to depict aurally the density of send events over time. Rather than sustaining the communication note, a short note is played on every send where the pitch of the note corresponds to the new percentage value. This allows three threads of information to be depicted at once- system utilization, volume of pending messages, and amount of communication-and can significantly aid in the detection of phases. 4.3 Results

To consider one aspect of scalability, the machine size was varied. SLALOM executions were traced for machine

displays.

435

--

sizes equal to 16 and 64 processors. In each case reported here, a problem size of n = 100 was used. While this represents a relatively small problem size, it is a basis for comparison with larger, more typical problem sizes (and hence a study on the effects of varying problem size). It is notewarthy that monitoring large systems requires not only scalable machines and programs but also scalable tools. We found that using present tools on larger machines and problems is not without its “technical difficulties.”

To manage the volume of event trace data, selective instrumenting of the SLALOM program was done. Each task within SLALOM generates a separate trace file. Within a task, iterative sections of code are traced only for a subset of the full range (based on programmer knowledge of what’s representative). So for each SLALOM execution, there is a suite of event traces, including traces for the Factoring, Backsolve, and Storer tasks. This is what is d e d to compare behavior and representations across machine sizes. 4.3.1 Results of Group Communication

Mapping The decomposition topology for SLALOM (Ver. 1) on

the nCUBE 2 computer is a two-dimensional mesh. So for 16 processors, it is a 4 x 4 mesh, and for 64 processors, an 8 x 8 mesh. Each row of processors in the mesh is mapped to a row of the matrix. In the backsolve phase, shown in Figure 1, activity “moves” row by row up the mesh and matrix. Figure 2a, representing a Processor View captured at the point in time

corresponding to mark A in Figure 1, shows such row activity (where three nodes are active on one row of the matrix, and four nodes have already progressed to the next mw). Figure 5 illustrates this repetitive “stepping” behavior in more detail for a 16-node nm. Based on this data-pmcessor mapping, processors are associated with one of four groups for the 16-node case and with one of eight groups for the 64-node case. Therefore, each group includes an entire row of processors.

The auralization corresponding to this group assignment was able to portray significant information about the run- time behavior of this part of SLALOM. In particular it was aurally detected rhat

Inter-group communication is always between nearest group neighbors (except at the end of the task). Most communication takes place within a group, and inter-group communication takes place at the end of each phase of intra-group communication. The basic communication pattem is repeated the same over and over, and occurs one last time intertwined with the “wrap-up” communication. The “wrap-up” communication referred to is a set of

collective communications (sum collapses) done within each column at the conclusion of an iteration of the block method used. It begins in a staggered fashion as processors complete their last step. As shown in Figure 2b, which corresponds to mark B in Figure 1, the wrap-up communication visually obscures the basic communication pattem. However, it is discernible aurally.

This auralization depicts information similar to that of

- TI

Figure 1: 64-node Space-Time Diagram of Backsolving Task.

436

I J I I “lIIlLEl

a. b.

Figure 2: @-node Processor View Snapshots of Backsolving Task.

the enhanced version of the ParaGraph Processor Status View described in [12] (e.g., see the Machine Views in Figures 5 and 6). These views show edges for communication between cells and are configurable for the topology. Although the information presented by the edges visually shows communication within a row of processors versus between rows of processors (or vice- versa), the views tend to become cluttered as the number of processors increases. In that sense, aurally detecting intra- versus inter-group communication is a more appropriate extension.

It is worth mentioning that the initial assignment of processors to groups for the backsolve was made without considering the movement of the data. By experimenting with different group sizes on the 64-node version, it was completely obvious that eight consecutive nodes to a group was appropriate. This is an example of how sound can be used to help define equivalence classes. This classification is supported on the visual end by, for example, ParaGraph’s Communication Matrix, which can indicate sets of processors that regularly and intensively communicate with other sets of processors.

We experimented with two other phases of SLALOM with this sound mapping. Figures 3 and 4 show 64-node versions of the Space-Time Diagrams for the Factoring and Storer phases, respectively. In the Factoring phase (Figure 3), the auralizations emphasize that there is a repeated pattem in the communication corresponding to eight segments in the graph (see the region between the bars). Similarly, four segments can be heard in the 16- node version. The factoring algorithm involves advancing down the matrix diagonal and hence repetitively “moving” down the mesh diagonal. As each diagonal processor takes its turn at leading the activity within a segment, a series of overlapping communication events occurs. So, each segment corresponds to the activity led by a particular diagonal processor,

Within each segment, it can also be heard that the inter-

group communication pattem (between different rows, possibly column-wise) is essentially the same as the intra- group communication pattern (row-wise). In fact, each segment’s communication activity concludes with a row- wise broadcast followed by a column-wise broadcast. Figure 6 illustrates one of these segments in more detail for a 16-node run.

The Storer phase of SLALOM (Figure 4) appears as three regions: an active one involving processors in the top row of the mesh (low-numbered processors), an apparent transition region, and another active one involving processors along the diagonal of the mesh (low-tehigh- numbered processors). In the first active region, processor 0 is collecting data from the top row of processors. In the second active region, processor 0 is collecting data from the diagonal processors. There is no obvious algorithmic reason for the middle transition region, which here interrupts the beginning of the second active region. One possible explanation is that it is an artifact of the tracing (e.g., a delay while trace buffers are being flushed, though this has not been verified). In the auralization of this phase, the tempo of communication events is very similar in both active regions. In the first, only one group incurs any communication-the top row, and in the second, all eight groups are involved in communication-a diagonal processor from each row (again, four groups are heard in the 16-node version).

4.33 Results of Phase Meters The combined meters for utilization and communication

were set up for the 16- and 64-node versions of each of the three program phases (Backsolving, Factoring, and Storer) discussed above. In all cases, the sound-meters depicted very rhythmic pattems in the variations between pitches. Although each phase sounded different, the auralizations imply that within a phase both the utilization and communication loads are cyclic. A few specific observations include:

431

P R 0 C E S S 0 R

H U n B E R

Figure 3: 64-node Space-Time Diagram of Factoring Task.

Figure 4 64node Space-Time Diagram of Storer Task.

438

In the Backsolve phase (Figure l), both meters are consistently low until the end when they go up and down in opposite directions. In the Factoring phase (Figure 3), the sound-meters confirm that the overall repeated pattem occurs every eighth segment. In addition, they depict that, on the average, the utilization is roughly constant throughout; and the communication (both number of sends and volume of pending messages) increases and then decreases within each segment in essentially the same manner each time. In the Storer phase (Figure 4). the communication meter depicts very similar behavior for the two active regions, as does the utilization meter. Although the overall utilization is higher in the second active region, the rhythm and melody are largely the same in both regions. Note that since Storer is the final program task, increased utilization at the end is a consequence of an increased percentage of the processors completing their work (because a processor is tagged “busy” at the end of the program).

The sound-meters work well for emphasizing pattems within a trace file. In these cases in particular, the rhythm of the sound-meters was very regular and distinct for each trace. 4.4 Discussion

So, how well do sound and graphics scale? Within the scope of this study, we found that in general visual-aural representations can be used effectively on larger systems.

As expected suitable approaches must support a hierarchical presentation andor logical grouping of information. Special consideration needs to be given to a particular approach, however, based on the capacity for scaling a sound mapping and the utility of combining sound and graphic portrayals. Two approaches were focused on: group communication and phase meters. In the former, adizations highlighted significant information about the run-time behavior of the program. Auralizations emphasized repeated communication patterns of groups (or classes) of processors, and aurally distinguishing intra- versus inter-group communication was a practical and useful strategy. In the latter approach, sound-meters, too, emphasized patterns. Regular and distinct rhythms characterized each program phase.

Empirical studies such as this, combined with tool development and basic research in multimedia representations of complex, multidimensional information, are needed to determine how to practically apply visual-aural techniques to parallel program analysis in general.

5 Acknowledgements We would like to thank Jay Jackson for his help in fine-

tuning the voice assignments of the synthesizer for the sound mappings used. We also appreciate the contributions of Michael Carter of generating PICL traces (amidst several “technical difficulties”) and providing general system assistance. SLALOM runs were executed on the nCUBE 2 computer in the Scalable Computing Laboratory of the USDOE Ames Laboratory, Ames, Iowa.

t

Figure 5: 16-node Space-Time Diagram and Machine View of Backsdving Task.

439

I I

9 I 1 10 - I4 1s 13 ’ 12 4 5 -I

b ’ 2 3 1 ’ 8

c

Figure 6: 16-node Space-Time Diagram and Machine View of a Segment within the Factoring Task.

References Bly, Sarah, “Presenting Information in Sound,” Proceedings of the CHI ‘82 Conference on Human Factors in Computer Systems, 1982. Francioni, Joan, Jay Jackson, and Larry Albrighf “The Sounds of Parallel Programs,” Proceedings of the Sirth Distributed Memory Computing Conference, New York: IEEE Computer Society, 1991. Geist, G., M. Heath, B. Peyton, and P. Worley, “A Machine-independent Communication Library,” Proceedings of the Fourth Conference on Hypercubes, Concurrent Computers, and Applications, Los Altos, C A Golden Gate Enterprises, 1990. Gustafson, John, Diane Rover, Stephen Elbert, and Michael Carter, “The Design of a Scalable, Fixed-Tie Computer Benchmark,” Journal of Parallel and Distributed Computing, Special issue on modeling of parallel computers, August 1991. Heath, Michael, and Jennifer Etheridge, “Viualizing the Performance of Parallel Programs,” IEEE Software, Vol. 8, No. 5, September 1991, pp. 29-39. Hough, A. and J. Cuny, “Initial Experiences with a Pattem-oriented Parallel Debugger,” SIGPLAN Notices, 24,1989. Jackson, Jay, and Joan Francioni, “Aural Signatures of Parallel Programs,” Proceedings of the Twenty-Fifrh

Hawaii International Conference on System Sciences, New York: IEEE, 1992.

[8] MIDI-Musical Instrument Digital Interface, Specification I.0, The International MIDI Association, 1983.

[9] Pancake, Chem M., “Software Support for Parallel Computing: Where Are We Headed?’, Communications of the ACM, November 1991.

[lo] Reed, Daniel, et al., “Scalable Performance Environments for Parallel Systems,” Proceedings of the Sixth Distributed Memory Computing Conference, New York IEEE Computer Society, 1991.

[ll] Rover, Diane T., “A Performance Visualization Paradigm for Data Parallel Computing,” Proceedings of the TwenQ- Fifrh Hawaii International Conference on System Sciences, New York: IEEE, 1992.

[12] Rover, Diane T., Michael B. Carter, and John L. Gustafson, “Performance Visualization of SLALOM,” Proceedings of the Sixth Distributed Memory Computing Conference, New York: IEEE Computer Society, 1991.

[13] Simmons, Margaret, and Rebecca Koskela, editors, Performance Instrumen&ation and Viualization, New York: ACM & Addison-Wesley, 1990.

[14] Mezrich, J., S . Fysinger, and R. Slivjanovski, “Dynamic Representation of Multivariate T i e Series Data,” Journal of American Statistical Association, Vol. 79, No. 385. March 1984, pp. 34-40.

440

Documents

[IEEE Comput. Soc. Press Scalable High Performance Computing Conference SHPCC-92. - Williamsburg, VA, USA (26-29 April 1992)] Proceedings Scalable High Performance Computing Conference