A Distributed Rendering System for Scientiﬁc Visualization

A Distributed Rendering System for Scientific Visualization

Robert Geist, Karl Rasche, Rishi Srivatsavai, James WestallDepartment of Computer Science

Clemson UniversityClemson, SC 29634-0974

{rmg|rkarl|rsrivat|westall}@cs.clemson.edu

ABSTRACTParallel, real-time rendering using clusters of commoditycomponents has rapidly become a topic of significant inter-est within the scientific visualization community. This pa-per describes the design and implementation of a very largescale, distributed system that renders 6144× 3072 pixel im-ages and projects them across a 14′ × 7′ display wall at 35frames per second.

Categories and Subject DescriptorsI.3.2 [Computer Graphics]: Graphics Systems—Distrib-uted/network graphics

General TermsDesign Performance

KeywordsChromium, distributed rendering, tiled displays, workloadbalancing, projector calibration

1. INTRODUCTIONOver the past several years, few categories of hardware

systems have witnessed a price/performance ratio declinecomparable to that seen in computer graphics. A high-endSGI Onyx 2 IR system, purchased in 1997 at a cost of sev-eral hundreds of thousands of dollars, now has a difficulttime matching the performance of a variety PC graphicscards that cost a hundred dollars or less. This rapid decline,coupled with a typical graphics card product cycle of sixmonths, has led to a fundamental change in the direction ofhigh-performance graphics. Manufacturers of high-end sys-tems must now focus on inter-connect structure for scalablesystems built from off-the-shelf, commodity components.

In May of 2002, with support from the W.M. Keck Foun-dation and the National Science Foundation, we installed a265-node, distributed rendering system in Clemson’s W.M.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.Copyright 2002 ACM X-XXXXX-XX-X/XX/XX ... $5.00.

Keck Visualization Laboratory. It is the purpose of this pa-per to describe the design and implementation of this sys-tem. In particular, we focus on the system software modifi-cations necessary to achieve interactive frame rates (30 fps)for a 6144 × 3072 pixel display projected across a 14′ × 7′

screen.Parallel rendering systems constructed from commodity

components are not new, and our own design is inspiredby the prototype system built at Princeton University [10].That prototype comprises nine PC stations; eight of theseare display stations and one is a control station. They arelinked via a Myrinet [1]. The system is sort-first (See [7]for a taxonomy of distributed rendering.) in that it is basedon screen-space or image-space partitioning of the renderingworkload.

Partitioning creates rectangular image tiles, and tiles areassigned to rendering tasks. Geometry to be rendered issent to a rendering task if that geometry, when projectedto image space, would fall within the tile assigned to thatrendering task. Unfortunately, this assignment is unknownuntil the geometry is projected from (3D) world coordinatesto (2D) image space. To avoid projecting all geometric prim-itives, bounding boxes are created in world coordinates forlarge sections of geometry. The bounding boxes are pro-jected, assigned to tasks, and, thereafter, geometry is sentto the rendering task determined by its world coordinatebounds.

Sort-first designs can reduce communication among nodes,at some expense in the redundant geometry calculations forprimitives that overlap partitions. Whenever a geometricprimitive overlaps two (or more) tiles, all geometry must besent to all associated rendering tasks. A sort-first designis essential for our approach, which calls for expansion in adirection that may be regarded as orthogonal to that of thePrinceton Wall.

In particular, we significantly expand the number of pro-cessing stations allocated to rendering each wall tile, ratherthan expanding the size of the wall. Thus, unlike the Prince-ton system, where a substantial portion of a tile renderingis likely to be carried out on the system responsible for itsdisplay, our display stations are relatively passive collectorsof pixel data and require only fast connections.

The remainder of the paper is organized as follows. Insection 2 we describe the basic hardware and software com-ponents of our system including the inter-connect design. Insection 3 we describe the Chromium software framework inmore detail and provide evidence of the overhead attachedto distributed rendering operations. In section 4 we provideperformance milestones achieved through system software

Figure 1: Distributed rendering: conceptual design

modifications. In section 5 we describe projector calibrationtechniques used for alignment and edge-blending. Conclu-sions follow in section 6.

2. SYSTEM ARCHITECTUREThe system hardware platform is shown (conceptually) in

Figure 1 and (physically) in Figure 2. There are 265 rack-mounted nodes. Each node has a 1.6GHz Pentium IV CPU,a 58GB IDE drive, and dual Ethernet cards (NICs). The 240nodes used for geometry generation and rendering each have512MB main memory, an Nvidia GeForce4 TI 4400 graphicscard, an Intel e100 100Mb NIC, and a 3Com 905c 100MbNIC. The 24 nodes used for display, which collectively drivethe 6 × 4 projector array, each have 1GB main memory,a Matrox G450 graphics card, a 1Gb Intel e1000 NIC anda 100Mb Intel e100 NIC. The G450 graphics card does nothave the rendering speed of the TI 4400s, but its open sourcedriver and performance in 2D operations [3] combined tomake it the choice for the display stations. The 905c NICwas selected due to the availability of its architectural speci-fications (data book), which facilitates driver modifications.All nodes are connected via a dedicated Gigabit Ethernetswitch. The 3Com 905cs and the Intel e1000s comprise thepixel network. The 24 projectors will accept a 1280 × 1024resolution input, but they downsample to a physical limitof 1024 × 768, and so we use the latter resolution. Thisconfiguration (6 × 4 @ 1024 × 768) yields the 6144 × 3072display.

The configuration design of the processing pipeline hasundergone two changes since installation. Since there wereno established configuration designs for comparable systems,we anticipated that logical changes to our processing pipeline

Figure 2: Distributed rendering: physical racks

would be required as performance tests revealed new bottle-necks. Our original design called for a single control nodeto distribute workload to 240 rendering nodes. Each of the24 display nodes was then to receive pixel data from 10rendering nodes through the gigabit Ethernet pipe. Testsquickly showed that, with the default 1500 byte Ethernetpacket size, 10, 100Mb TCP data streams could not feed a1Gb pipe without significant loss and retransmission. Wereconfigured for a 9-to-1 fan-in, rather than 10-to-1, andperformance improved dramatically. Although this reconfig-uration left 24 nodes without assigned purpose, the secondbottleneck discovery called them back into action. We foundthat a single control node was unable to generate scene ge-ometry fast enough to fully utilize the 216 (24×9) renderingnodes. The 216 rendering nodes could receive triangle dataat 100Mbps, and so the potential bandwidth was 21.6Gbps.Thus, a second reconfiguration was needed.

The new configuration design calls for the control nodeto activate 24 geometry nodes (formerly unused renderingnodes, now upgraded with Gb NICs), each of which gen-erates geometry for 9 rendering nodes. This increase (by afactor of 24) in the rate of scene geometry generation shouldsuffice to run the 216 attached rendering nodes at full uti-lization. Each of the 24 display nodes still receives pixeldata from 9 rendering nodes.

The system software is based on Linux, OpenGL, andChromium. Linux kernel 2.4.7-10, as installed by RedHat7.1, runs on all nodes. OpenGL [12] is the principal opensource graphics API in use today. Chromium [5] is a freelydistributed, generic software framework for distributing andprocessing streams of OpenGL commands. The renderingapplication context is described, in python script, as a di-rected graph of stream processing units (SPUs). Calls by theapplication to OpenGL are intercepted by Chromium androuted through the SPU graph. Numerous SPUs are sup-plied with the Chromium distribution, e.g. tile-sorting andrendering, but writing and installing new SPUs is straight-forward.

Initial rendering performance on the system for a stan-dard dynamic display application was a disappointing 7 fps.Substantial modifications were needed to achieve interactiveframe rates. Understanding these modifications is facilitated

Figure 3: Nearly trivial SPU graph

import syssys.path.append( ”../server” )from mothership import *

cr = CR()renderspu = SPU(’render’)renderspu.Conf(’window geometry’,10,10,300,300)display node = CRNetworkNode()display node.AddSPU(renderspu)cr.AddNode(display node)

packspu = SPU(’pack’)packspu.AddServer(display node,protocol=’tcpip’)client node = CRApplicationNode()client node.SetApplication(’gears’)client node.AddSPU(packspu)cr.AddNode(client node)cr.Go()

Figure 4: Nearly trivial SPU script

by some additional background on Chromium.

3. CHROMIUMChromium was developed as a successor to WireGL [4] to

provide a highly flexible software platform for distributedrendering on a variety of architectures. It is freely avail-able (see sourceforge.net/projects/ chromium), installs eas-ily, and is remarkably robust. A nearly trivial example SPUgraph is shown in figure 3, and the associated python scriptis shown in figure 4. This rendering context runs on a sin-gle node. A pack SPU simply packages OpenGL commandsfrom the application, gears, and hands them off to a renderSPU. The connection between the client node and the dis-play node is made by the two lines:

packspu.AddServer(display node,protocol=’tcpip’)client node.AddSPU(packspu)

To distribute the rendering workload to a second, remotenode (on which Chromium is installed) we need only passthe node name as an argument, e.g.,

display node = CRNetworkNode(zztop.clemson.edu)

The OpenGL applications, like gears, need not be modifiedin any way and need not be re-compiled.

In spite of its exceptional flexibility, robustness, and easeof installation, Chromium does carry a performance penalty.On mildly difficult rendering tasks, the overhead of runningChromium, even on a single node, can be substantial. Wetested a simply-specified application that generates a non-

Figure 5: Tilesort to render SPU graph

Figure 6: Near target configuration SPU graph

trivial workload: render 10,000 randomly specified points.We tested this in “native” mode, i.e., without Chromium,and under three Chromium rendering contexts. The first isthat shown in Figures 3 and 4. The second, whose graph isshown in Figure 5, substitutes a tilesort SPU for the packSPU. The tilesort SPU can partition the workload amongmultiple local or remote nodes, but for this test we usedonly one tile, the entire image, and the render node andtilesort node were again on the same physical node. Thethird, whose graph is shown in Figure 6, is still restrictedto a single physical node, but it is much closer to the SPUgraphs that match our basic, three-stage pipeline configura-tion design. A tilesort SPU distributes work to a readbackSPU, which renders but then makes the results of rendering(framebuffer) available for downstream processing. The re-sults are packaged by the pack SPU and transmitted (as apixel stream) to a terminating render SPU.

In Table 1 we show frame rate performance under eachrendering context for rendering on both the TI 4400 card andthe G450 card. We note that the older generation, G450,is no match for the TI 4400 on this task. Further, any useof Chromium incurs a substantial frame rate performancepenalty, but the readback is particularly taxing. The windowsize in each case was 256×256, so the extent of the readback,64K pixels, was not large. Although the final TI 4400 framerate is still well above our target of 30 fps, both of the finalframe rates are clearly a small percentage of that availablefrom the card. Although the target 6144 × 3072 resolutionat 30 fps can be reached, doing so is a non-trivial task.

4. PERFORMANCE MILESTONESInitially we focused our system performance investigations

on the application “Atlantis,” due to Mark Kilgard, whichdisplays whales, dolphins, and sharks swimming on a bluebackground. It was a natural choice for us, since it is dis-tributed with Chromium and provides an interesting displaywith non-trivial, yet non-taxing, geometry and shading. Un-fortunately, with an unmodified Chromium and unmodifieddrivers, the system delivered only 7 fps.

Upon investigation of the Chromium TCP/IP pipeline,we discovered that buffer receive() calls (crTCPIPRecv() incr/util/tcpip.c) on the display stations would stall if in-sufficient data was available for a complete read. Sinceother buffers may be ready, we added a check to determinewhether the current buffer contained sufficient data to com-

context TI 4400 (fps) G450 (fps)native OpenGL 2300 101pack-render (fig. 3) 395 85tilesort-render (fig. 5) 327 82three stage (fig. 6) 120 19

Table 1: Chromium context performance overhead

plete the receive() call. This allowed us to move to anotherbuffer if the current buffer was not ready and thereby sig-nificantly reduce the stalls. The frame rate went to 16 fps.

Further investigation revealed that the transfer of framesfrom the readback SPUs on the render nodes to the renderSPUs on the display nodes was serialized. Modificationsto the readback SPU, the render SPU, and the crserver toparallelize this transfer brought the frame rate to 21 fps.

At this stage, we turned to the display station routinesthat deliver pixel data to the frame buffers. This is usuallya 3 copy operation: a DMA copy from the NIC to kernelbuffers, a processor copy from kernel buffers to user spacebuffers, and a final copy from user space buffers to the framebuffer. The final copy might or might not involve DMA orAGP. We reduced this to a relatively fast, 3 copy opera-tion by locating the user level buffers in AGP memory andthen invoking bitBLT operations for transfer to the framebuffer. This required modifications to the Matrox G450driver (GL/mesa/src/drv/mga/mgapixel.c) Frame rate in-creased to 34 fps.

Finally, the standard Ethernet packet size, 1500 bytes,was less than ideal for transferring massive amounts of pixeldata. The availability of specifications for the 3Com 905cNIC allowed us to experiment with different packet sizes. Weused 9 nodes transmitting through 100Mb NICs to a singlereceive node with a 1Gb NIC and measured CPU utiliza-tion and net throughput. The results are shown in Figure7. As expected, throughput increases with packet size andapproaches the hardware limit of 100Mbps per transmitter.Unexpected was the minimum in receiver CPU utilizationachieved at a packet size of 6K bytes. Minimizing CPUactivity in handling receive buffers is clearly advantageous,and so we tested our application with 6KB packets. Theframe rate reached 37 fps.

5. PROJECTOR CALIBRATIONWhen one constructs a single image from multiple pro-

jected image tiles, projector alignment and tile seam han-dling are of crucial importance to the quality of the visualdisplay. Pan-tilt-rotate units for multi-projector displays re-quire frequent adjustment and are too expensive to be con-sidered commodity, off-the-shelf equipment. We have cho-sen, therefore, to align our systems completely in software.

We mount our projectors on a conventional rack and alignthem, very roughly, by hand. We then project alignmentpoints onto the screen and read them back with an inex-pensive (640x480) camera. Using these images, we correctfor camera lens distortion (see [2]) and then compute thealignment distortion transformation that is in effect for eachprojector. The general 2D affine transformation in 3D ho-mogeneous coordinates is of the form: a b c

d e fg h 1

which includes rotations, translations, scales, shears, andkeystone (perspective projection) effects. Mapping 4 align-ment points per projector thus allows us to solve for the 8unknowns. By applying the inverse of the transformationto the data in the associated frame buffer, we correct forthe projector distortion and bring the tile into visual align-ment. As a simple example of the effect, if a user turns aprojector upside down, the 180o rotation distortion is de-tected and the image rights itself. Our procedure is similarto that described by Sukthankar et al [11], except that wehand-align one projector and then use the coordinate systemdetermined by its 4 alignment points as our display spacereference system.

In figure 8 we show a snapshot of our display wall onwhich we have rendered a cloud of 9M points scanned fromthe recently salvaged civil-war submarine, the H. L. Hunley[8]. Although the software tile alignment is fairly good, animportant calibration step clearly remains. The 24 projec-tors, all the same model from the same supplier, show vastdifferences in color and intensity of the images they project.

Fortunately, from our projected calibration points, we candetermine projector overlap regions precisely and employ asimple linear ramp roll-off function [9] to smooth the in-tensity transition between and among projectors. Never-theless, as observed by Majumder et al [6], simple roll-offfunctions do not address all the photometric non-uniformityissues that arise with such displays. They propose a moreelaborate correction procedure that makes extensive use ofa spectroradiometer. Since we view such equipment as out-side our self-imposed constraint of commodity, off-the-shelfequipment, we are currently developing an approximationto their technique that would use an inexpensive camera toregister the results of intensity matching experiments.

6. CONCLUSIONSWe have described the design and implementation of a

very large scale, distributed rendering system that generates6144×3072 pixel images and projects them across a 14′×7′

display wall at 35 frames per second. Achieving such framerates required significant modifications to both Linux devicedrivers and the Chromium distributed rendering package.

Significant work remains in achieving a low-cost, pho-tometric uniformity, achieving comparable frame rates formore taxing geometry, and making effective use of the 15terabytes of available disk space.

7. ACKNOWLEDGMENTSThis work was supported in part by the ERC Program

of the U.S. National Science Foundation under award EEC-9731680, the ITR Program of the National Science Foun-dation under award ACI-0113139, the IUCRC Program ofthe National Science Foundation under award EEC-0116924,and a grant from the W. M. Keck Foundation.

8. REFERENCES[1] N. Boden, D. Cohen, R. Felderman, A. Kulawik,

C. Seitz, J. Seizovic, and W.-K. Su. Myrinet - agigabit-per-second local-area network. IEEE MICRO,15:29–36, February, 1995.

[2] F. Devernay and O. Faugeras. Straight lines have tobe straight. Machine Vision and Applications,13(1):14–24, 2001.

Figure 7: Frame size effects on performance

Figure 8: Full-screen view of Hunley data

[3] R. Geist, V. Sekhar, and J. Westall. Graphicsbenchmarking. In Proc. of the 27th Annual Int. Conf.of the Computer Measurement Group (CMG 2001),pages 151 – 160, Anaheim, California, December 2001.

[4] G. Humphreys, M. Eldridge, I. Buck, G. Stoll,M. Everett, and P. Hanrahan. Wiregl: A scalablegraphics system for clusters. In Proc. ACMSIGGRAPH 2001, pages 129–140, August 2001.

[5] G. Humphreys, M. Houston, R. Ng, R. Frank,S. Ahern, P. Kirchner, and J. Klosowski. Chromium: astream-processing framework for interactive renderingon clusters. ACM Transactions on Graphics (Proc.SIGGRAPH 2002), pages 693–702, July 2002.

[6] A. Majumder, Z. He, H. Towles, and G. Welch.Achieving color uniformity across multi-projectordisplays. In Proceedings 11th IEEE VisualizationConference (VIS 2000), pages 117–124, Salt LakeCity, UT, October 2000.

[7] S. Molnar, M. Cox, D. Ellsworth, and H. Fuchs. Asorting classification of parallel rendering. IEEEComputer Graphics and Applications, 14:23–32, July,1994.

[8] G. Oeland and I. Block. The H.L. Hunley. NationalGeographic, pages 82–101, July 2002.

[9] R. Rasker, M. Brown, R. Yang, W. Chen, G. Welch,H. Towles, B. Seales, and H. Fuchs. Multi-projectordisplays using camera-based registration. In Proc.IEEE Visualization 99, San Francisco, CA, October1999.

[10] R. Samanta, J. Zheng, T. Funkhouser, K. Li, and J. P.Singh. Load balancing for multi-projector renderingsystems. In Proc. SIGGRAPH/EurographicsWorkshop on Graphics Hardware, August 1999.

[11] R. Sukthankar, R. Stockton, and M. Mullin. Smarterpresentations: Exploiting homography incamera-projector systems. In Proceedings ofInternational Conference on Computer Vision, pages247–253, January 2001.

[12] M. Woo, J. Neider, T. Davis, and D. Shreiner.OpenGL Programing Guide. Addison Wesley, thirdedition, 1999.

Documents

A Distributed Rendering System for Scientiﬁc Visualization