Massively Parallel Software Rendering for Visualizing Large

For some time, researchers have done pro-duction visualization almost exclusively

using high-end graphics workstations. They routinelyarchived and analyzed the outputs of simulations run-

ning on massively parallel super-computers. Generally, a featureextraction step and a geometric mod-eling step to significantly reduce thedata’s size preceded the actual datarendering. Researchers also used thisprocedure to visualize large-scaledata produced by high-resolutionsensors and scanners. While thegraphics workstation allowed inter-active visualization of the extracteddata, looking only at a lower resolu-tion and polygonal representation ofthe data defeated the original pur-pose of performing the high-resolu-tion simulation or scanning.

To look at the data more closely, researchers couldrun batch-mode software rendering of the data at thehighest possible resolution on a parallel supercomput-er using the rendering parameters suggested by theinteractive viewing. However, researchers frequentlydidn’t do this for several reasons. First, a supercomput-er is a precious resource. Scientists wanted to reservetheir limited computer time on the supercomputers forrunning simulations rather than visualization calcula-tions. Second, many of the parallel-rendering algo-rithms don’t scale well, so the large number of massivelyparallel computer processors couldn’t be fully and effi-ciently used. Third, most of the parallel-rendering algo-rithms were developed for meeting research curiosityrather than for production use. As a result, large andcomplex data couldn’t be rendered cost effectively.

However, the current technology trend of cheaper,more powerful computers prompted us to revisit the

option of using parallel software rendering (and insome cases, discarding hardware rendering totally).Most graphics cards were mainly optimized for poly-gon rendering and texture mapping. Scientists can nowmodel physical phenomena with greater accuracy andcomplexity. Analyzing the resulting data demandsadvanced rendering features that weren’t generallyoffered by commercial graphics workstations. In addi-tion, the short lifespan, limited resolution, and highcost of graphics workstations constrain what scientistscan do.

However, the decreasing cost and rapidly increasingperformance of commodity PC and network technolo-gies have let us build powerful PC clusters for large-scalecomputing. Supercomputing is no longer a sharedresource. Scientists can build cluster systems dedicatedto their own research. They can also build such systemsincrementally to solve problems with increasing com-plexity and scale. More importantly, they can now affordto use the same machine for visualization calculationseither for runtime visual monitoring of the simulation orpostprocessing visualization calculations. Therefore,parallel software rendering is becoming a viable solu-tion for visualizing large-scale data sets.

In this tutorial, we describe two highly scalable, par-allel software volume-rendering algorithms. Wedesigned one algorithm for distributed-memory paral-lel architectures to render unstructured-grid volumedata.1 We designed the other for shared-memory par-allel architectures to directly render isosurfaces.2

Through the discussion of these two algorithms, weaddress the most relevant issues when using massive-ly parallel computers to render large-scale volumetricdata. The focus of our discussion is direct rendering ofvolumetric data, so we don’t consider other techniquesfor treating the large-scale data visualization problemsuch as feature extraction, multiresolution schemes,and compression.

0272-1716/01/$10.00 © 2001 IEEE

Tutorial

72 July/August 2001

We describe two highly

scalable, parallel software

volume-rendering

algorithms—one renders

unstructured grid volume

data and the other renders

isosurfaces.

Kwan-Liu MaUniversity of California, Davis

Steven ParkerUniversity of Utah

Massively ParallelSoftwareRendering forVisualizing Large-Scale Data Sets

Parallel volume visualizationVolume rendering3 is a powerful technique because

it can potentially display more information in a singlevisualization than techniques such as isosurfacing orslicing. It’s more flexible because we can also implementit to generate isosurfaces and cut planes or a mixture ofthem. Most importantly, direct volume rendering is par-ticularly effective for visualizing fine features and thosefeatures that can’t be defined analytically or with a bina-ry classification of the data. Figure 1 shows two volumerendered images.

In Figure 2, the basic volume-rendering algorithmsteps through the data volume, integrating color andopacity along a ray. A transfer function defines the rela-tionship between values in the data set and the colorand opacity at each point. Areas of high opacity will con-tribute more to the pixel’s final color. Another methodfor performing volume rendering starts with the dataset’s cells and integrates color and opacity for multiplepixels under the data cell’s projected area. The differ-ence between these two algorithms is roughly analo-gous to the differences between ray tracing and z-bufferrendering. Other algorithms, such as the first one wedescribe in the following section, employ a hybridapproach between these two algorithms.

Volume rendering is computationally expensivebecause of the interpolation and shading calculationsrequired for every sample point in the data’s spatialdomain. We can achieve interactive volume rendering oflarge-scale data sets by using graphics hardware4,5 or aparallel computer.6 Kniss et al.4 used a hybrid parallel-software and texture-hardware capability of a multip-ipe SGI Onyx2 and rendered a time-varying 10243

volume at multiple frames per second. Lum, Ma, andClyne5 demonstrated that a single PC with a nVidiaGeForce 2 card and a disk array (for less than $2,500)could render a time-varying 640 × 256 × 256 volume athigh frame rates, permitting interactive exploration inboth the data’s temporal and spatial domains. Whilehardware rendering has many advantages, parallel sys-tems often use software rendering algorithms to pro-vide maximum flexibility and to avoid any constraintson data size, image size, or image quality.

A parallel volume-rendering algorithmfor 3D unstructured-grid data

Increasingly, we use unstructured meshes to modelsome of the most challenging scientific and engineer-ing problems, from aerodynamics calculations, accel-erator simulations, and blast simulations to bioelectricfield simulations. Applying finer meshes only to regionswith complex geometry or requiring high accuracy cansignificantly reduce computing time and storage space.This adaptive approach results in computational mesh-es containing irregular data cells (in both size andshape). The lack of a simple indexing scheme for tra-versing the data cells makes visualization calculationson such meshes more expensive than on regular mesh-es. In a distributed computing environment, irregular-ities in cell size and shape make balanced loaddistribution especially difficult.

Ma and Crockett1 introduced a parallel cell-projection

volume rendering algorithm for visualizing unstructured-grid data. The basic algorithm performs the followingsequence of tasks:

1. distribute data and visualization parameters,2. space partitioning,3. view transformation,4. scan conversion of data cells,5. merge of ray segments, and6. assemble and output final images.

In this tutorial, we discuss the design philosophybehind this algorithm and the implementation strate-gies we used to achieve high scalability. Our test resultsshow that an implementation of this algorithm on theCray T3E can achieve above 75 percent parallel effi-ciency when rendering 18 million tetrahedra using up to

IEEE Computer Graphics and Applications 73

(a)

(b)

1 Volumevisualizations:(a) A volumerendering of thehuman brain,showing ananeurysm nearthe right of theimage. (Imagecourtesy ofGordon Kindl-mann.) (b) Avolume render-ing of a CTMicrosoft com-puter mouse.(Data courtesyof Tony Davisand MatthewSheats.)

VolumeScreen

Eye

Value is searched foralong part of ray thatis inside the volume

2 A ray travers-es a volume,integratingcolor and opaci-ty through theentire volume.

512 processors. To learn complete performance num-bers on several different parallel architectures, youshould refer to the Ma and Crockett article.1

Key design criteriaOne problem shared by many visualization algo-

rithms for unstructured data is the need for a significantamount of preprocessing. One step extracts additionalinformation about the mesh, such as connectivity, tospeed up later rendering calculations. Another step maybe needed to partition the data based on the particularparallel computing configuration being used (number ofprocessors, communication parameters, and so on). Toreduce the user-hassle factor as much as possible andavoid increasing the data size or replicating data, wesuggest avoiding offline preprocessing whenever possi-ble—especially for rendering large-scale data sets.

While eliminating preprocessing provides flexibilityand convenience, it also means less information is avail-able for optimizing the rendering computations. Thealgorithm introduced here sacrifices a small amount ofperformance in favor of enhanced usability. An optimalrenderer design should seek a balance between prepro-cessing cost and parallel-rendering performance. Onegood design criterion is to conduct low-cost prepro-cessing calculations (for example, for view-dependentoptimization) prior to the rendering time on the sameparallel computer to increase parallel rendering effi-ciency. That is, the preprocessing is also parallelized toincur less storage requirements and data transport.

In addition to offering flexibility, the parallel render-er must be highly scalable so that whenever manyprocessors are available they can be used to renderlarge-scale data efficiently. As we describe in the nextsection, we can achieve high scalability by using fine-grain load interleaving coupled with an asynchronouscommunication strategy that overlaps the rendering cal-culations with interprocessor communication.

Load distributionIdeally, data should be distributed so that every

processor requires the same amount of storage spaceand incurs the same computational load. Several fac-tors affect this. For the sake of concreteness, let’s con-sider meshes composed of tetrahedral cells. First, there’ssome cost for scan converting each cell. Variations in thenumber of cells assigned to each processor will produce

variations in workloads. Second, cells come in differentsizes and shapes. The difference in size can be as largeas several orders of magnitude due to the mesh’s adap-tive nature. As a result, a cell’s projected image area canvary dramatically, which produces similar variations inscan-conversion costs. Furthermore, a cell’s projectedarea also depends on the viewing direction. Finally,voxel values are mapped to both color and opacity val-ues to user-specified transfer functions. An opaque cellcan terminate a ray early, thereby saving further merg-ing calculations but introducing further variability inthe workload.

One common approach assigns groups of connectedcells to each processor so that exploiting cell-to-cellcoherence can optimize the rendering process. But con-nected cells are often similar in size and opacity, so thatgrouping them together exacerbates load imbalances,making it difficult to obtain satisfactory partitionings.For parallel rendering of large-scale unstructured-griddata, the opposite approach—dispersing connectedcells as widely as possible among the processors—is infact better. That is, each processor is loaded with cellstaken from the whole spatial domain rather than froma small neighborhood.

We can generally achieve satisfactory scattering ofthe input data with a simple round–robin assignmentpolicy. With enough cells, the computational require-ments for each processor tend to average out, producingan approximate load balance. This approach also satis-fies our requirement for flexibility, since the data distri-bution can be computed trivially for any number ofprocessors, without the need for an expensive prepro-cessing step.

Dispersing the grid cells among processors also facil-itates an important visualization operation for unstruc-tured data—zoom-in viewing. Because of the highlyadaptive nature of unstructured meshes, the most impor-tant simulation results are usually associated with a rel-atively small portion of the overall spatial domain. Theviewer normally takes a peek at the overall domain, thenimmediately focuses on localized regions of interest suchas areas with high velocity or gradient values. This zoom-ing operation introduces challenges for visualizing a dis-tributed computing environment efficiently. First,locating all the cells that reside within the viewing regioncan be expensive. Our solution, described in the next sec-tion, employs a spatial partitioning tree to speed up thiscell searching. Second, if data cells are distributed toprocessors as connected components, zooming in on alocal region will result in severe load imbalances, becauseonly a few processors handle all the rendering calcula-tions while others are idle.

Although the round–robin distribution discouragesdata sharing, this rendering algorithm only requires min-imum data—the cell and node information. It doesn’tneed connectivity data. Each cell takes 16 bytes to storefour node indices, and each node takes 16 bytes to storethree coordinates and a scalar value. As a result, in theworst case of not sharing any node information, 80nbytes of data must be transferred for distributing n cellsper processor. For example, on average, about 640,000bytes of data are transferred to and stored at each proces-

Tutorial

74 July/August 2001

The algorithm introduced here sacrifices

a small amount of performance in favor

of enhanced usability. An optimal

renderer design should seek a balance

between preprocessing cost and

parallel-rendering performance.

sor when distributing a data set consisting of 1 millioncells to 128 processors.

In addition to the object–space operations on meshcells, pixel-oriented ray-merging computations mustalso be evenly distributed to processors. Local variationsin cell sizes within the mesh lead directly to variationsin depth complexity in image space. We also need animage partitioning strategy that disperses the ray-merg-ing operations. Scanline interleaving, which assigns suc-cessive image scanlines to processors in round–robinfashion, generally works well as long as the image’s ver-tical resolution is several times larger than the numberof processors. When using numerous processors, weneed finer grained pixel interleaving to more effective-ly distribute high-density regions over more processors,resulting in better load balancing and improved scala-bility. Our test results show we can speed up the overallrendering by nearly 10 percent with pixel interleavingwhen using a large system.

Parallel space partitioningThe round–robin data distribution scheme helps to

achieve flexibility and produces an approximate staticload balance. However, it destroys the spatial relation-ship between mesh cells, making an unstructured dataset even more irregular. Certain ordering must berestored so that the rendering step can be performedmore efficiently.

The central idea is to have all processors render thecells in the same neighborhood at about the same time.Ray segments generated for a particular region wouldconsequently arrive at their image–space destinationswithin a relatively short window of time, letting themmerge early. This early merging tends to limit the lengthof the ray-segment list that each processor maintains,which benefits the rendering process in two ways:

� a shorter list reduces the cost of inserting a ray seg-ment in its proper position within the list and

� the memory needed to store unmerged ray segmentsis reduced.

To provide the desired ordering, we can group datacells into local regions using a hierarchical spatial datastructure such as an octree or k-d tree. Figure 3 showsrendering a region within such a partitioning, wheredifferent processors store the different colored cells andscan converts them.

The tree should be constructed cooperatively so thatthe resulting spatial partitioning is exactly the same onevery processor. After the host computer initially dis-tributes the data cells, all processors participate in sucha synchronized parallel partitioning process. The algo-rithm works as follows:

� Each processor examines its local collection of cellsand establishes a cutting position so that the tworesulting subregions contain about the same numberof cells. The cut’s direction is the same on each proces-sor and alternates at each level of the partitioning.

� The proposed local cutting positions are reported bythe processors to a designated host node, which aver-

ages them to obtain a global cutting position. Thisinformation is then broadcast by the host node toeach processor, along with the host’s choice of thenext subregion to be partitioned. We assign a cellcrossing the cut boundary to the region containingits centroid.

� The procedure repeats until the desired number ofregions has been generated.

At the end of the partitioning process, each processorhas an identical list of regions, with each region repre-senting approximately the same rendering load as thecorresponding region on every other processor. If allprocessors render their local regions in the same order,loose synchronization can be achieved because of thesimilar workloads, letting early ray merging take placewithin the local neighborhoods. Test results show thatfor each pixel the maximum number of ray segmentsthat can’t be premerged and therefore must be storedinto the temporary list is always less than 10 percent ofthe total number of ray segments received for the samepixel. Figure 4 displays plots of the total number of theray segments received for each pixel and the actual num-ber stored. The k-d tree also allows for fast searching of


Imageplane

3 A spatial datastructure directsall processors towork on cells inthe same neigh-borhood atabout the sametime. Differentprocessorshandle thedifferent col-ored cells.

4 Runtime memory consumption for a data set from modeling a flow fieldsurrounding an airplane wing. The left image shows the total number ofray segments received for each pixel. The spike coincides with the fine-mesh region. The right images show the number of ray segments actuallystored at two different times during the rendering. It’s also apparent fromthe images how using a spatial partitioning tree affects the order in whichray segments are generated and delivered.

cells within a spatial region specified by a zoom-in view.In addition, the spatial regions can also serve as work-load units should we ever need to perform dynamic loadbalancing.

Cell-projection renderingWe use a cell-projection rendering method to render

the volume data because it offers more flexibility in datadistribution. During rendering, processors follow thesame path through the spatial partitioning tree, pro-cessing all the cells at each leaf node of the tree. Eachprocessor scan converts local cells independent of otherprocessors and sends the resulting ray segments to theprocessor that owns the corresponding image scanline.Ray segments received by each processor are mergedaccording to the depth order using the standardPorter–Duff over operator.

Because the ray segments that contribute to a givenpixel arrive in an unpredictable order, each ray segmentmust contain not only a sample value and pixel coordi-nates but starting and ending depth values for sortingand merging within the pixel’s ray segment list. For thetypical applications we envision, 106 to 108 ray segmentsmay be generated for each image; at 16 bytes per seg-ment, aggregate communication requirements areapproximately 107 to 109 bytes per frame. Clearly, man-aging the communication efficiently is essential to thisapproach’s viability.

Task management with asynchronouscommunication

Good scalability and parallel efficiency can only beachieved if we can keep the parallelization penalty andcommunication cost low. To manage communicationcosts, we use an asynchronous communication strate-gy. Its features include

� asynchronous operation, which allows processors toproceed independently of each other during the ren-dering computations;

� multiplexing of the object–spacecell computations with theimage–space ray merging com-putations;

� overlapped computation andcommunication, which hidesdata transfer overheads andspreads the communication loadover time; and

� buffering of intermediate resultsto amortize communication over-heads.

During the course of rendering, twomain tasks must be performed: scanconversion and image compositing.We can attain high efficiency if wecan keep all processors busy doingeither of these two tasks. Logically,the scan conversion and mergingoperations represent separatethreads of control, operating in dif-

ferent computational spaces and using different datastructures. For the sake of efficiency and portability,however, it’s a good strategy to interleave these twooperations using a polling strategy, as Figure 5 illus-trates. Each processor starts by scan converting one ormore data cells. Periodically the processor checks to seeif incoming ray segments are available; if so, it switch-es to the merging task, sorting and merging incomingrays until no more input is pending.

Because of the large number of ray segments gener-ated, the overhead for communicating each of themindividually would be prohibitive in most architectures.Instead, it’s better to buffer them up locally and sendmany ray segments together in one operation. To sup-plement this, we use asynchronous send and receiveoperations, which let us overlap communication andcomputation, reduce data copying overheads in mes-sage-passing systems, and decouple the sending andreceiving tasks. This strategy proves most effective wheneach destination has two or more ray-segment buffers.While a send operation is pending for a full buffer, thescan conversion process can be placing additional raysegments in its companion buffer. In the event that bothbuffers for a particular destination fill up before the firstsend completes, rendering can switch to the ray-merg-ing task and process incoming segments while waitingfor the outbound congestion to clear (in fact, this isessential to prevent deadlock).

Users may specify two parameters to control the fre-quency of task switching and communication. The firstparameter is the polling interval—that is, the numberof cells to be processed before checking for incoming raysegments. If polling occurs too frequently, excessiveoverheads will be introduced; if not, the asynchronouscommunication scheme will perform poorly as out-bound buffers clog up due to pending send operations.

The second parameter is the buffer depth, which indi-cates how many ray segments should be accumulatedbefore the system posts an asynchronous send. If thebuffer size is too small, the overheads for initiating send

Tutorial

76 July/August 2001

Scan converta cell

Merge raysegments

Segments fromother processors

Segments toother processors

Polling forincoming

ray segments

Buffer

Buffer

Buffer

Taskswitching

Cell rendering task

Image compositing task

Segmentlist

Buffer

Buffer

Loca

l seg

men

ts

5 Task manage-ment withasynchronouscommunication.

and receive operations will be exces-sive, resulting in lowered efficiency.On the other hand, large buffers canintroduce delays for processors thathave finished their scan conversionwork and are waiting for ray seg-ments to merge. Large buffers arealso less effective at spreading thecommunication load across time,resulting in contention delays inbandwidth-limited systems.

When choosing an ideal buffersize, we must take into account thenumber of processors in use, thenumber of ray segments to be communicated, and thetarget architecture’s characteristics. As you might sus-pect, the polling frequency should be selected in accor-dance with the buffer size. As a general rule, pollingshould be performed more frequently with smaller buffersizes or larger numbers of processors. Empirical resultsagree with this rule.

Because of the rendering algorithm’s asynchronousnature, individual processors can’t determine when aframe is complete. Hence, we must use a distributed ter-mination detection protocol. A straightforward approachwould use a designated node (like the host processor) tocoordinate the termination process by collecting localtermination messages and broadcasting a global termi-nation signal. A final global synchronization operationthen ends the overall rendering process. When using alarge number of processors, this approach would pro-long the termination step.

A more scalable solution calls for using a binary merg-ing algorithm based on ray-segment counts. That is,because each processor knows exactly how many raysegments it has sent to each of the other processors, abinary-swap summing process would result in the totalnumber of ray segments each processor should eventu-ally receive. At the end of this globally synchronousprocess, each processor should hold exactly the num-ber it needs. If this number is the same as the numberof segments the processor has actually received, theprocessor can stop receiving. Figure 6 illustrates such asumming process for a four-processor case. This bina-ry-swap summing process can be carried out as soon asall processors finish projecting local cells. Such an algo-rithm runs in logarithmic, rather than linear, time anddoesn’t involve the host, making it more efficient andscalable to a larger number of processors.

DiscussionWe performed tests of the parallel volume-rendering

algorithm using an 18 million cell data set obtained froma simulation of flow surrounding an aircraft wing (seeFigure 7). Direct volume rendering of the flow speed(with the wing taken out) elicits many fine features in theflow field that would be invisible with conventional 2Dor 3D contour plots. In particular, we can verify the low-pressure region (red spherical cloud) above the wing andthe high-pressure region (yellow and orange blobs) belowthe wing. We can also see the extreme low-velocity valueson the flaps (white stripes) and the complex flow patterns

ahead of the leading edge and behind the trailing edgeof the wing. None of these detailed phenomena could beseen with either low-resolution data or rendering.

The algorithm scales well on large parallel systems.As we mentioned earlier, we obtained more than 75 per-cent parallel efficiency on the Cray T3E using up to asmany as 512 processors. Even on a cluster of Sun Ultra5 and Ultra 60 CPUs over Fast Ethernet, our experi-mental results show more than 63 percent parallel effi-ciency using up to 128 processors. A PC cluster with a1-Gbit network should easily outperform the T3E forcomparable numbers of processors and network per-formance. Nevertheless, we should point out that thisalgorithm doesn’t suit shared-memory architectures.Our experiments on the SGI Origin 2000 architectureshow the algorithm doesn’t scale beyond 32 processors.7

The algorithm we describe next was designed specifi-cally for shared-memory architectures.

A parallel ray-tracing algorithm forisosurface rendering

The parallel rendering system we discussed in the pre-vious section operates from object space, rendering indi-vidual volume elements to a screen. The other approachis to do the opposite: for each pixel on the screen, deter-mine which object or objects contribute to that pixel’sfinal color. This process is ray tracing.


3 5 2 6

11 106 15 16 47 6

17 25 23 10

Processor 1

P1 P2 P3 P4

4 4 9 2

Processor 2

P1 P2 P3 P4

3 10 5 0

Processor 3

P1 P2 P3 P4

7 6 7 2

Processor 4

P1 P2 P3 P46 Terminationis based on abinary-swapsummingprocess. At theend of theprocess, thetotal number ofray segmentseach processorshould receivebecomes avail-able locally.

7 Visualizationof flow sur-rounding anaircraft wingwith the wingtaken out.

Conventional wisdom holds that ray tracing is tooslow to be competitive with hardware z-buffers. How-ever, when rendering a sufficiently large data set, raytracing should be competitive because its low time com-plexity ultimately overcomes its large time constant.8

Because of the asymptotic complexity of each algo-rithm, ray tracing becomes a faster algorithm for ren-dering images with large numbers of primitives. Thiscrossover will happen sooner on a multiple CPU com-puter because of ray tracing’s high degree of intrinsicparallelism. The same arguments apply to the volumetraversal problem.

Figure 2 shows the basic ray-volume traversal methodwe describe here. This framework lets us implementvolume visualization methods that find exactly onevalue along a ray. Here, we describe an isosurfacingalgorithm for parallel ray tracing. Maximum-intensityprojection is a direct volume-rendering technique wherethe opacity is a function of the maximum intensity seenalong a ray.

We know that ray tracing is accelerated through twomain techniques:9 accelerating or eliminating ray–voxelintersection tests and parallelization. Acceleration isusually accomplished by a combination of spatial sub-division and early ray termination.3

Ray tracing for volume visualization naturally lendsitself toward parallel implementations.10 The computa-tion for each pixel occurs independently of all other pix-els, and the data structures used for casting rays areusually read only. These properties have resulted inmany parallel implementations. Researchers have used

a variety of techniques to make such systems paralleland have built many successful systems.11,12 (Whitman13

surveyed these techniques.)Our system organizes the data into a shallow recti-

linear hierarchy for ray tracing. For unstructured orcurvilinear grids, we imposed a rectilinear hierarchyover the data domain. Within a given level of the hier-archy, we traverse from cell to cell using the incremen-tal method that Amanatides and Woo14 described.

Memory brickingThe first optimization is to improve data locality by

organizing the volume into bricks, which are analogousto using image tiles in image-processing software andother volume rendering programs15 (see Figure 8).

Effectively using the cache hierarchy proves a crucialtask in designing algorithms for modern architectures.Bricking, or 3D tiling, has been a popular method forincreasing locality for ray-cast volume rendering. Thesystem reorders the data set into n × n × n cells, whichthen fill the entire volume. On a machine with 128-bytecache lines and using 16-bit data values, n is exactly 4.However, using float (32-bit) data sets, n is closer to 3.

Using an effective translation look-aside buffer (TLB)is also becoming a crucial factor in algorithm perfor-mance. The same bricking technique can be used toimprove TLB hit rates by creating m × m × m bricks of n× n × n cells. For example, a 40 × 20 × 19 volume couldbe decomposed into 4 × 2 × 2 macrobricks of 2 × 2 × 2bricks of 5 × 5 × 5 cells. This corresponds to m = 2 and n= 5. Because 19 can’t be factored by mn = 10, we needone level of padding. We use m = 5 for 16-bit data setsand m = 6 for 32-bit data sets.

The resulting offset q into the data array can be com-puted for any x, y, z triple with the expression

q = ((x ÷ n) ÷ m)n3m3((Nz ÷ n) ÷ m)((Ny ÷ n) ÷ m) + ((y ÷ n) ÷ m)n3m3((Nz ÷ n) ÷ m)+ ((z ÷ n) ÷ m)n3m3 + ((x ÷ n) mod m)n3m2

+ ((y ÷ n) mod m)n3m + ((z ÷ n) mod m)n3

+ (x mod n × n)n2 + (y mod n) × n + (z mod n)(1)

where Ny and Nz are the respective sizes of the data set.Equation 1 contains many integer multiplication,

divide, and modulus operations. On modern processors,these operations are extremely costly (32 or more cyclesfor the MIPS R10000). Where n and m are powers oftwo, we can convert these operations to bit shifts andbit-wise logical operations. However, the ideal size israrely a power of two. Thus, we need a method thataddresses arbitrary sizes. We can convert some of themultiplications to shift–add operations, but the divideand modulus operations prove more problematic. Com-puting the indices incrementally would require track-ing nine counters, with numerous comparisons and poorbranch prediction performance. Note that we can writeEquation 1 as

q = Fx(x) + Fy(y) + Fz(z) (2)

where

Tutorial

78 July/August 2001

0 1 2 3

4 5 6 7

8 9 10 118 We can organize cells into tiles orbricks in memory to improve locali-ty. The numbers in the first brickrepresent layout in memory. Nei-ther the number of voxels nor thenumber of bricks need be a powerof two.

9 With a two-level hierarchy, rayscan skip empty space by traversinglarger cells. We used a three-levelhierarchy for most of the examplesin this article.

Fx(x) = ((x ÷ n) ÷ m)n3m3((Nz ÷ n) ÷ m)((Ny ÷ n) ÷ m) + ((x ÷ n) mod m)n3m2

+ (x mod n × n)n2

Fy(y) = ((y ÷ n) ÷ m)n3m3((Nz ÷ n) ÷ m) + ((y ÷ n) mod m)n3m + (y mod n) × n

Fz(z) = ((z ÷ n) ÷m)n3m3 + ((z ÷ n) mod m)n3

+ (z mod n)

We tabulate Fx, Fy, and Fz and use x, y, and z, respec-tively, to find three offsets in the array. We sum thesethree values to compute the index into the data array.These tables will consist of Nx, Ny, and Nz elements,respectively. The total sizes of the tables will fit in theprocessor’s primary data cache even for large data-setsizes. Using this technique, we note that you could pro-duce more complex mappings than the two-level brick-ing we describe here, although it’s not obvious whichwould achieve the highest cache use.

For many algorithms, each iteration through the loopexamines the eight corners of a cell. To find these eightvalues, we need to only look up Fx(x), Fx(x + 1), Fy(y),Fy(y + 1), Fz(z), and Fz(z + 1). This consists of six indextable lookups for each of the eight data value lookups.

Multilevel gridThe other basic optimization we use is a multilevel

spatial hierarchy to accelerate the traversal of emptycells as Figure 9 shows. Cells are divided into equal por-tions and then the system creates a “macrocell,” whichcontains the minimum and maximum data value for itschildren cells. This is a common variant of standard ray-grid techniques16 and resembles previous multilevelgrids.17 Others have shown minimum–maximumcaching to be useful.18,19

The ray–isosurface traversal algorithm examines themin and max at each macrocell before deciding whetherto recursively examine a deeper level or to proceed tothe next cell. The typical complexity of this search willbe O(3√ n) for a three-level hierarchy. While the worstcase complexity is still O(n), it’s difficult to imagine anisosurface approaching this worst case. Using a deeperhierarchy can theoretically reduce the average case ofcomplexity slightly but also dramatically increases thestorage cost of intermediate levels.

We’ve experimented with modifying the number of

levels in the hierarchy and empirically determined thata trilevel hierarchy (one top-level cell, two intermedi-ate macrocell levels, and the data cells) is efficient. Thisoptimum may be data dependent and is modifiable atprogram startup. Using a trilevel hierarchy, the storageoverhead is negligible (typically less than 0.5 percent ofthe data size). The cell sizes used in the hierarchy areindependent of the brick sizes used for cache locality inthe first optimization.

We can index macrocells with the same approach thatwe used for memory bricking the data values. Howev-er, in this case each macrocell will have three tablelookups. This, combined with the significantly smallermemory footprint of the macrocells, made the effect ofbricking the macrocells negligible.

Rectilinear isosurfacingOnce the data have been organized into bricks and

macrocells, the system can use these data structures torender isosurfaces interactively. Our algorithm has threephases:

1. traversing a ray through cells that don’t contain anisosurface,

2. analytically computing the isosurface when inter-secting a voxel containing the isosurface, and

3. shading the resulting intersection point.

This process repeats for each pixel on the screen. A ben-efit is that adding incremental features to the renderinghas only incremental cost. For example, if you visualizemultiple isosurfaces with some of them rendered trans-parently, the correct compositing order is guaranteedbecause we traverse the volume in a front-to-back orderalong the rays. Additional shading techniques, such asshadows and specular reflection, can easily be incorpo-rated for enhanced visual cues. Another benefit is the abil-ity to exploit texture maps without being limited by a fixedquantity of special texture memory. We’ve demonstrat-ed results of 5.1-Gbyte 3D textures interactively.

If we assume a regular volume with even grid-pointspacing arranged in a rectilinear array, then ray–iso-surface intersection is straightforward. Analogous sim-ple schemes exist for the intersection of tetrahedral cells.

To find an intersection (see Figure 10), the ray a + tbtraverses cells in the volume, checking each cell to see ifits data range bounds an isovalue. If it does, the systemperforms an analytic computation to solve for the ray


Ray equation:x = xa + txby = ya + tybz = za + tzb

(x, y, z) =ρ ρiso

10 The ray traverses each cell(left), and when a cell is encoun-tered that has an isosurface in it(right), the system performs ananalytic ray-isosurface intersectioncomputation.

parameter t at the intersection with the isosurface:

ρ(xa + txb, ya + tyb, za + tzb) − ρiso = 0 (3)

When approximating ρ with a trilinear interpolationbetween discrete grid points, this equation will expandto a cubic polynomial in t. This cubic can then be solvedin closed form to find the intersections of the ray withthe isosurface in that cell. We use the closed form solu-tion because its stability and efficiency haven’t beenmajor issues for the data we’ve used in our tests. The sys-tem examines only the roots of the polynomial containedin the cell. There may be multiple roots, correspondingto multiple intersection points. In this case, we use thesmallest t (closest to the eye). There may also be no rootsof the polynomial, in which case the ray misses the iso-surface in the cell. Note that using trilinear interpolation

directly will produce more complex isosurfaces than ispossible with a marching cubes algorithm. Figure 11shows an example of this, which illustrates case 4 fromLorensen and Cline’s paper.20 Techniques such as theAsymptotic Decider21 could disambiguate such cases butthey would still miss the correct topology because of theisosurface interpolation scheme.

Unstructured isosurfacingFor unstructured meshes, we use the same memory

hierarchy as we did in the rectilinear case. However, wecan control the cell size’s resolution at the finest level.We chose a resolution that uses approximately the samenumber of leaf nodes as there are tetrahedral elements.The leaf nodes store a list of references to overlappingtetrahedra (see Figure 12). For efficiency, we store theselists as integer indices into an array of all tetrahedra.

Rays traverse the cell hierarchy in a manner identicalto the rectilinear case. However, when the system detectsa cell that might contain an isosurface for the current iso-value, it tests each of the tetrahedra in that cell for inter-section. The system doesn’t use connectivity informationfor the tetrahedra; instead it treats them as independentitems, just as in a traditional surface-based ray tracer.

The system computes the isosurface for a tetrahedronimplicitly by using barycentric coordinates. It computesthe intersection of the parameterized ray and the iso-plane directly, using the implicit equations for the planeand the parametric equation for the ray. The systemchecks the intersection point to see if it’s still within thebounds of the tetrahedron by ensuring that the barycen-tric coordinates are all positive. Parker et al. describethe details of this intersection.4

DiscussionWe contrasted applying this algorithm to explicitly

extracting polygonal isosurfaces from the Visible Womandata set. For the skin isosurface, we generated 18,068,534polygons. For the bone isosurface, we generated12,922,628 polygons. These numbers are consistent withthose reported by Lorensen given that he used a croppedversion of the volume.22 With this number of polygons,it would be challenging to achieve interactive renderingrates on conventional high-end graphics hardware.

Our method can render a ray-traced isosurface of this

Tutorial

80 July/August 2001

(a) (b)

11 (a) Theisosurface fromthe marchingcubes algo-rithm. (b) Theisosurfaceresulting fromthe true cubicbehavior insidethe cell.

12 For a givenleaf cell in therectilinear grid,indices to theshaded ele-ments of theunstructuredmesh arestored.

Table 1. Explicit bone isosurface extraction times.

Number Noise Build Noise Extract Marching Cubesof CPUs (seconds) (seconds) (seconds)

1 4,838 110 6272 2,109 81 3244 1,006 56 1718 885 31 93

16 437 24 4932 118 14 2664 59 12 24

data at roughly 10 frames per second using a 512 × 512image on 64 processors. Table 1 shows the extractiontime (in seconds) for the bone isosurface using both near-optimal isosurface extraction (Noise)23 and marchingcubes.20 Note that these numbers would improve if weused a dynamic load balancing scheme. However, thiswouldn’t let us interactively modify the isovalue whiledisplaying the isosurface. Using a downsampled or sim-plified detail volume would allow interaction at the costof some resolution. Simplified, precomputed isosurfacescould also yield interaction but storage and precompu-tation time would be significant. Triangle stripping couldimprove display rates by up to a factor of three becauseisosurface meshes are usually transform bound. Notethat we gain efficiency for both the extraction and ren-dering components by not explicitly extracting the geom-

etry. Therefore, our algorithm doesn’t suit applicationsthat will use the geometry for nongraphics purposes.

Our system’s interactivity lets us explore the data byinteractively changing the isovalue or viewpoint. Forexample, you could view the entire skeleton and inter-actively zoom in and modify the isovalue to examine thedetail in the toes all at about 10 fps. Figure 13 shows thevariation in frame rate.

ConclusionsWe described two different algorithms for rendering

large-scale data sets on a parallel machine. Each of thesealgorithms have their own strengths. We’ve pushed thefirst algorithm to a higher number of processors (512),while the second is limited by the number of processorsavailable in a shared-memory system (currently tested


1

2

3

4

6

8

12

16

24

32

48

64

96

124

0.427/1.00

0.84/1.97

1.26/2.94

1.67/3.91

2.45/5.73

3.20/7.50

4.81/11.26

6.38/14.93

9.54/22.33

12.65/29.61

18.85/44.13

24.73/57.90

35.38/82.82

43.06/100.79

0.304/1.00

0.60/1.98

0.89/2.93

1.19/3.92

1.76/5.77

2.32/7.61

3.44/11.30

4.59/15.08

6.84/22.48

9.12/29.96

13.52/44.39

17.72/58.19

25.04/82.23

30.28/99.45

0.084/1.00

0.17/1.99

0.25/2.95

0.33/3.96

0.50/5.97

0.67/7.94

1.00/11.89

1.33/15.84

1.98/23.54

2.63/31.38

3.92/46.72

5.18/61.78

7.67/91.38

9.73/115.88

0.155/1.00

0.31/2.00

0.46/2.96

0.62/3.97

0.93/5.96

1.23/7.93

1.84/11.88

2.45/15.80

3.65/23.49

4.88/31.47

7.30/47.02

9.64/62.14

14.28/92.02

18.17/117.08

0.568/1.00

1.13/1.98

1.68/2.96

2.24/3.94

3.29/5.80

4.36/7.67

6.51/11.47

8.64/15.21

12.92/22.76

17.09/30.10

25.27/44.50

32.25/56.80

45.50/80.14

57.70/101.63

Frame rate/SpeedupNumber ofprocessors

Fram

es p

er s

econ

d (3

2 p

roce

ssor

s)

20

15

10

5

0Rotate Zoom in Zoom outChange isovalue

Frame number (time)

13 Variation inframe rate asthe viewpointand isovaluechanges.

to 128). For larger image sizes, the first algorithm maybe more appropriate, as large images will slow down theray-tracing process. However, for large data sets andsmaller images, the ray-tracing method can outperformhigh-end graphics hardware even on a modest numberof processors.

Parallel rendering is reaching a stage where it’s usablefor interactive rendering of large-scale data sets, not justfor offline generation of high-quality images. The twoalgorithms we presented here are just a sample of thosethat are forming a reality of interactive large-scale sci-entific visualization. We anticipate data sizes will con-tinue to grow at such a rate that direct parallel renderingalone would not suffice to guarantee interactive ren-dering. Additional techniques such as compressionshould be applied whenever possible to accelerate bothrendering and communication. A multiresolutionframework is particularly desirable when there is a mis-match between the data size and the processor size. Thatis, when the number of processors available can’t ren-der the data at the full resolution, the user should beallowed to choose interactivity over accuracy by usinga coarser version of the data. Finally, we’ll extend thesealgorithms to less expensive commodity workstationclusters. The distributed memory architecture and therelatively slow communication networks associatedwith such clusters will require novel new techniques toachieve interactive rates for large data sets. �

AcknowledgmentsThe Visible Woman data set is available through the

National Library of Medicine as part of its Visible HumanProject. The Los Alamos National Laboratory releasedthe CT data set, which was generated with FlashCT, asoftware product of Hytec. Dimitir Mavriplis providedthe aerodynamic-flow data set. Peter Shirley helpedgenerate several of the explanatory figures. Tom Crock-ett performed the tests on the 128-CPU Sun cluster.

References1. K.-L. Ma and T.W. Crockett, “A Scalable, Cell-Projection

Volume Rendering Algorithm for 3D Unstructured Data,”Proc. 1997 Symposium on Parallel Rendering, IEEE CS Press,Los Alamitos, Calif., 1997, pp. 95-104.

2. S. Parker et al., “Interactive Ray Tracing for Isosurface Ren-dering,” Proc. Visualization 98, CD-ROM, ACM Press, NewYork, Oct. 1998.

3. M. Levoy, “Display of Surfaces from Volume Data,” IEEEComputer Graphics and Applications, vol. 8, no. 3, May1988, pp. 29-37.

4. J. Kniss et al., “Interactive Texture-Based Volume Render-ing for Large Data Sets,” IEEE Computer Graphics and Appli-cations, vol. 21, no. 4, Jul./Aug. 2001, pp. 52-61.

5. E.B. Lum, K.-L. Ma, and J. Clyne, “Texture Hardware Assist-ed Rendering of Time-Varying Volume Data,” to appear inProc. IEEE Visualization 2001, CD-ROM, ACM Press, NewYork, Oct. 2001.

6. K.-L. Ma and D. Camp, “High Performance Visualization

of Time-Varying Volume Data over a Wide-Area Network,”Proc. Supercomputing 2000 Conf., CD-ROM, ACM Press,New York, 2000.

7. C. Hofsetz and K.-L. Ma, “Multi-Threaded RenderingUnstructured-Grid Volume Data on the SGI Origin 2000,”Proc. Third Eurographics Workshop on Parallel Graphics andVisualization, Eurographics Assoc., Switzerland, 2000, pp.133-140.

8. J.T. Kajiya, “An Overview and Comparison of RenderingMethods,” A Consumer’s and Developer’s Guide to ImageSynthesis, ACM Siggraph 1988 Course 12 Notes, ACMPress, New York, 1988, pp. 259-263.

9. E. Reinhard, A.G. Chalmers, and F.W. Jansen, “Overviewof Parallel Photo-Realistic Graphics,” Proc. Eurographics98, Eurographics Assoc., Switzerland, 1998, pp. 1-25.

10. K.-L. Ma et al., “Parallel Volume Rendering Using Binary-Swap Image Composition,” IEEE Computer Graphics andApplications, vol. 14, no. 4, July 1994, pp. 59-68.

11. G. Vézina, P.A. Fletcher, and P.K. Robertson, “Volume Ren-dering on the MasPar MP-1,” Proc. 1992 Workshop on Vol-ume Visualization, Oct. 1992, pp. 3-8.

12. M.J. Muuss, “Towards Real-Time Ray-Tracing of Combi-natorial Solid Geometric Models,” Proc. Ballistic ResearchLaboratories Computer-Aided Design (BRL-CAD) Symp.,Army Research Lab, Adelphi, Md., 1995.

13. S. Whitman, Multiprocessor Methods for Computer Graph-ics Rendering, Jones and Bartlett Publishers, London, 1992.

14. J. Amanatides and A. Woo, “A Fast Voxel Traversal Algo-rithm for Ray Tracing,” Proc. Eurographics 87, Eurograph-ics Assoc., Switzerland, 1987, pp. 3-10.

15. M.B. Cox and D. Ellsworth, “Application-ControlledDemand Paging for Out-of-Core Visualization,” Proc. Visu-alization 97, ACM Press, New York, Oct. 1997, pp. 235-244.

16. J. Arvo and D. Kirk, “A Survey of Ray Tracing AccelerationTechniques,” An Introduction to Ray Tracing, A.S. Glassner,ed., Academic Press, San Diego, 1989.

17. K.S. Klimansezewski and T.W. Sederberg, “Faster Ray Trac-ing Using Adaptive Grids,” IEEE Computer Graphics andApplications, vol. 17, no. 1, Jan./Feb. 1997, pp. 42-51.

18. J. Wilhelms and A. Van Gelder, “Octrees for Faster Isosur-face Generation,” ACM Trans. Graphics, vol. 11, no. 3, July1992, pp. 201-227.

19. A. Globus, Octree Optimization, tech. report RNR-90-011,NASA Ames Research Center, Moffett Field, Calif., 1990.

20. W.E. Lorensen and H.E. Cline, “Marching Cubes: A HighResolution 3D Surface Construction Algorithm,” Comput-er Graphics (Proc. Siggraph 87), vol. 21, no. 4, ACM Press,New York, July 1987, pp. 163-169.

21. G. Nielson and B. Hamann, “The Asymptotic Decider:Resolving the Ambiguity in Marching Cubes,” Proc. Visu-alization 91, IEEE CS Press, Los Alamitos, Calif., 1991, pp.83-91.

22. W.E. Lorensen, Marching through the Visible Woman,http://www.crd.ge.com/cgi-bin/vw.pl, 1997.

23. Y. Livnat, H. Shen, and C.R. Johnson, “A Near Optimal Iso-surface Extraction Algorithm Using the Span Space,” IEEETrans. Visualization and Computer Graphics, vol. 2, no. 1,Mar. 1996, pp. 73-84.

Tutorial

82 July/August 2001

Kwan-Liu Ma is an associate pro-fessor of computer science at the Uni-versity of California, Davis, where heteaches and conducts research incomputer graphics and scientificvisualization. His career researchgoal is to improve the overall experi-

ence and performance of data visualization through moreeffective user-interface designs, interaction techniques, andhigh-performance computing. He received his PhD in com-puter science from the University of Utah in 1993. Heserved as co-chair for the 1997 IEEE Symposium on Par-allel Rendering, Case Studies of the IEEE VisualizationConference in 1998 and 1999, and the first National Sci-ence Foundation/US Department of Energy Workshop onLarge-Scale Data Visualization. He recently received thePecase award for his work in parallel visualization andlarge-scale data visualization.

Readers may contact Ma at the Dept. of Computer Sci-ence, Univ. of California, Davis, One Shields Ave., Davis,CA 95616-8562, email [email protected].

Steven Parker is currently aresearch assistant professor in theSchool of Computing at the Univer-sity of Utah. Additionally, he is a fac-ulty member at the university’sScientific Computing and Imaging(SCI) Institute. His primary research

interests are in component architectures, large-scale sci-entific visualization, and high-performance computing.Parker is the primary developer of the problem-solvingenvironment SCIRun, which was the basis of his PhD dis-sertation and is also the foundation for a number of pro-jects at the SCI Institute. He is also the lead softwarearchitect for the Center for Accidental Fires and Explosionsat the University of Utah, where he is developing a prob-lem-solving environment for a complex multiphysics sim-ulation running on thousands of processors. He receiveda PhD in computer science from the University of Utah.

For further information on this or any other computingtopic, please visit our Digital Library at http://computer.org/publications/dlib.


IEEE Visualization 2001San Diego, California, USA21–26 October 2001This conference focuses on interdisciplinary methods and collaboration amongdevelopers and users of visualization methods across all areas of science,engineering, medicine, and commerce. The Symposium on InformationVisualization 2001 (InfoVis 2001) and the Symposium on Parallel and Large-DataVisualization and Graphics (PVG 2001) will also be held in conjunction with IEEEVisualization 2001 this year. Visit http://vis.computer.org for more information.

SPIE Conference on Visualization and Data Analysis 2002San Jose, California, USA20–25 January 2002This conference covers all aspects of visualization and issues affecting successfulvisualizations. Example topics include biomedical visualization and applications,image processing, data explorations using classical and novel approaches, medicalimaging, interactive platforms, and virtual environments. Visit http://www.futurevisions.net/SPIE/vda2002 or email [email protected] for more information.

Coming SoonComing Soon

Documents

Massively Parallel Software Rendering for Visualizing Large