12
Hybrid Sort-First and Sort-Last Parallel Rendering with a Cluster of PCs Rudrajit Samanta, Thomas Funkhouser, Kai Li, and Jaswinder Pal Singh Princeton University Abstract We investigate a new hybrid of sort-first and sort-last approach for parallel polygon rendering, using as a target platform a clus- ter of PCs. Unlike previous methods that statically partition the 3D model and/or the 2D image, our approach performs dynamic, view- dependent and coordinated partitioning of both the 3D model and the 2D image. Using a specific algorithm that follows this approach, we show that it performs better than previous approaches and scales better with both processor count and screen resolution. Overall, our algorithm is able to achieve interactive frame rates with efficiencies of 55.0% to 70.5% during simulations of a system with 64 PCs. While it does have potential disadvantages in client-side process- ing and in dynamic data management—which also stem from its dynamic, view-dependent nature—these problems are likely to di- minish with technology trends in the future. Keywords: Parallel rendering, cluster computing. 1 Introduction The objective of our research is to investigate whether it is possible to construct a fast and inexpensive parallel rendering system lever- aging the aggregate performance of multiple commodity graphics accelerators in PCs connected by a system area network. The moti- vations for this system architecture are numerous: Lower-cost: The price-to-performance ratio of commodity graphics hardware far exceeds that of custom-designed, high- end rendering systems, and some inexpensive PC graphics ac- celerators can already deliver performance competitive with an SGI InfiniteReality while costing orders-of-magnitude less. A system composed only of off-the-shelf commodity components produced for a mass market costs far less than traditional high-end, custom-designed rendering systems. Technology tracking: The performance of PC graphics ac- celerators have been improving at a rate far beyond Moore’s law for the last several years, and their development cycles are very short (six months) as compared to custom-designed, high-end hardware (one year or more). Accessing hardware components (PC graphics accelerators) only through standard software APIs (OpenGL) makes it easy to replace them on a frequent basis as faster versions become available from any hardware vendor. Modularity & flexibility: Networked systems in which com- puters communicate only via network protocols, allow PCs to be added and removed from the system easily, and they can even be heterogeneous. Network protocols also allow spe- cific rendering processors to be accessed directly by remote computers attached to the network, and the processors can be used for other computing purposes when not in use for high- performance graphics. Scalable capacity: The aggregate hardware compute, stor- age, and bandwidth capacity of a PC cluster grows linearly with increasing numbers of PCs. Since each PC has its own CPU, memory, and AGP bus driving a single graphics accel- erator, the scalability of a cluster-based system is far better than a tightly-integrated graphics architecture where multi- ple graphics pipelines share a bus, memory and I/O subsys- tem. Also, when using a cross-bar system area network, such as Myrinet [1], the aggregate communications bandwidth of the system scales as more PCs are added to the cluster, while the point-to-point bandwidth for any pair of PCs remains con- stant. The main challenge is to develop efficient parallel rendering al- gorithms that scale well within the processing, storage, and commu- nication characteristics of a PC cluster. As compared to traditional, tightly-integrated parallel computers, the relevant limitations of a PC cluster are that the processors (PCs) do not have fast access to a shared virtual address space, and the bandwidths and latencies of inter-processor communication are significantly inferior. More- over, commodity graphics accelerators usually do not allow effi- cient access through standard APIs to intermediate rendering data (e.g., fragments), and thus the design space of practical parallel ren- dering strategies is severely limited. The challenge is to develop al- gorithms that partition the workload evenly among PCs, minimize extra work due to parallelization, scale as more PCs are added to the system, and work efficiently within the constraints of commod- ity components. Our approach is to partition the rendering computation dynam- ically into coarse-grained tasks requiring practical inter-process communication bandwidths using a hybrid sort-first and sort-last strategy. We take advantage of the point-to-point nature of a system area network, employing a peer-to-peer sort-last scheme to com- posite images for tiles constructed with a sort-first screen decom- position. The most important contribution of our work is the parti- tioning algorithm that we use to simultaneously decompose the 2D screen into tiles and the 3D polygonal model into groups and assign them to PCs to balance the load and minimize overheads. The key idea is that both 2D and 3D partitions are created together dynami- cally in a view-dependent context for every frame of an interactive visualization session. As a result, our algorithm can create tiles and groups such that the region of the screen covered by any group of 3D polygons is closely correlated with the 2D tile of pixels assigned

Hybrid Sort-First and Sort-Last Parallel Rendering with a Cluster

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Hybrid Sort-First and Sort-Last Parallel Rendering with a Cluster

Hybrid Sort-First and Sort-LastParallel Rendering with a Cluster of PCs

RudrajitSamanta,ThomasFunkhouser, Kai Li, andJaswinderPal SinghPrincetonUniversity

Abstract

We investigatea new hybrid of sort-first and sort-lastapproachfor parallelpolygon rendering,using as a target platform a clus-terof PCs.Unlikepreviousmethodsthatstaticallypartitionthe3Dmodeland/orthe2D image,ourapproachperformsdynamic,view-dependentandcoordinatedpartitioningof both the3D modelandthe2Dimage.Usingaspecificalgorithmthatfollowsthisapproach,weshow thatit performsbetterthanpreviousapproachesandscalesbetterwith bothprocessorcountandscreenresolution.Overall,ouralgorithmis ableto achieve interactive framerateswith efficienciesof 55.0%to 70.5%during simulationsof a systemwith 64 PCs.While it doeshave potentialdisadvantagesin client-sideprocess-ing and in dynamicdatamanagement—whichalsostemfrom itsdynamic,view-dependentnature—theseproblemsarelikely to di-minishwith technologytrendsin thefuture.

Keywords: Parallelrendering,clustercomputing.

1 Introduction

Theobjectiveof our researchis to investigatewhetherit is possibleto constructa fastandinexpensiveparallelrenderingsystemlever-agingthe aggregateperformanceof multiple commoditygraphicsacceleratorsin PCsconnectedby asystemareanetwork. Themoti-vationsfor thissystemarchitecturearenumerous:

� Lower-cost: The price-to-performanceratio of commoditygraphicshardwarefarexceedsthatof custom-designed,high-endrenderingsystems,andsomeinexpensivePCgraphicsac-celeratorscanalreadydeliver performancecompetitive withan SGI InfiniteReality while costing orders-of-magnitudeless. A systemcomposedonly of off-the-shelfcommoditycomponentsproducedfor a massmarket costsfar lessthantraditionalhigh-end,custom-designedrenderingsystems.

� Technologytracking: The performanceof PC graphicsac-celeratorshave beenimproving at a ratefar beyondMoore’slaw for the last several years,and their developmentcyclesarevery short(six months)ascomparedto custom-designed,high-endhardware(oneyearor more). Accessinghardwarecomponents(PCgraphicsaccelerators)only throughstandardsoftwareAPIs (OpenGL)makesit easyto replacethemon a

frequentbasisas fasterversionsbecomeavailable from anyhardwarevendor.

� Modularity & flexibility: Networkedsystemsin whichcom-puterscommunicateonly via network protocols,allow PCstobe addedandremoved from the systemeasily, andthey caneven be heterogeneous.Network protocolsalso allow spe-cific renderingprocessorsto be accesseddirectly by remotecomputersattachedto thenetwork, andtheprocessorscanbeusedfor othercomputingpurposeswhennot in usefor high-performancegraphics.

� Scalablecapacity: The aggregatehardware compute,stor-age,andbandwidthcapacityof a PC clustergrows linearlywith increasingnumbersof PCs. SinceeachPC hasits ownCPU,memory, andAGPbusdriving a singlegraphicsaccel-erator, the scalabilityof a cluster-basedsystemis far betterthan a tightly-integratedgraphicsarchitecturewheremulti-ple graphicspipelinessharea bus, memoryandI/O subsys-tem.Also, whenusinga cross-barsystemareanetwork, suchasMyrinet [1], the aggregatecommunicationsbandwidthofthesystemscalesasmorePCsareaddedto thecluster, whilethepoint-to-pointbandwidthfor any pairof PCsremainscon-stant.

Themainchallengeis to developefficient parallelrenderingal-gorithmsthatscalewell within theprocessing,storage,andcommu-nicationcharacteristicsof aPCcluster. As comparedto traditional,tightly-integratedparallelcomputers,the relevant limitations of aPCclusterarethat theprocessors(PCs)do not have fastaccesstoa sharedvirtual addressspace,and the bandwidthsand latenciesof inter-processorcommunicationaresignificantlyinferior. More-over, commoditygraphicsacceleratorsusually do not allow effi-cientaccessthroughstandardAPIs to intermediaterenderingdata(e.g.,fragments),andthusthedesignspaceof practicalparallelren-deringstrategiesis severelylimited. Thechallengeis to developal-gorithmsthatpartitiontheworkloadevenly amongPCs,minimizeextra work dueto parallelization,scaleasmorePCsareaddedtothesystem,andwork efficiently within theconstraintsof commod-ity components.

Our approachis to partition the renderingcomputationdynam-ically into coarse-grainedtasks requiring practical inter-processcommunicationbandwidthsusing a hybrid sort-first and sort-laststrategy. Wetakeadvantageof thepoint-to-pointnatureof asystemareanetwork, employing a peer-to-peersort-lastschemeto com-positeimagesfor tiles constructedwith a sort-firstscreendecom-position.Themostimportantcontribution of our work is theparti-tioningalgorithmthatweuseto simultaneouslydecomposethe2Dscreeninto tilesandthe3D polygonalmodelinto groupsandassignthemto PCsto balancetheloadandminimizeoverheads.Thekeyideais thatboth2D and3D partitionsarecreatedtogetherdynami-cally in a view-dependentcontext for every frameof aninteractivevisualizationsession.As aresult,ouralgorithmcancreatetilesandgroupssuchthat theregion of thescreencoveredby any groupof3Dpolygonsis closelycorrelatedwith the2Dtile of pixelsassigned

Page 2: Hybrid Sort-First and Sort-Last Parallel Rendering with a Cluster

to thesamePC.In this case,relatively little network bandwidthisrequired

�to re-distribute pixels from PCsthat have renderedpoly-

gonsto thePCswhosetiles they overlap.In this paper, we describeour researchefforts aimedat using

a clusterof PCsto constructa high-performancepolygonrender-ing system.Weproposeahybridsort-firstandsort-lastpartitioningalgorithmthatboth balancesrenderingloadandrequirespracticalimagecompositionbandwidthsfor typical screenresolutions.Thisalgorithmis usedto drive a prototypesystemin which server PCsrenderpartialimagesindependentlyandthencompositethemwithpeer-to-peerpixel redistribution. We report the resultsof simula-tionsaimedat investigatingthescalabilityandfeasibilityof thisap-proachandevaluatingalgorithmictrade-offs andperformancebot-tleneckswithin suchasystem.

Thepaperis organizedasfollows. Thefollowing sectionreviewspreviouswork in parallelrendering,while Section3 discussespo-tential partitioningstrategies for our system. Section4 gives anoverview of ourapproach.Thedetailsof ourhybrid loadbalancingalgorithm are describedin Section5 followed by a communica-tion overheadanalysisin Section6. Section7 presentsthe resultsandanalysisof simulationswith our algorithmcomparedto previ-oussort-firstandsort-lastmethods.Section8 discusseslimitationsandpossibleextensionsof our approach,while Section9 containsabrief summaryandconclusion.

2 Previous Work

Previousparallelsystemsfor polygonrenderingcanbeclassifiedinmany ways:hardwarevs. software,sharedmemoryvs. distributedmemory, objectdecompositionvs. imagedecomposition,SIMD vs.MIMD, sort-firstvs. sort-middlevs. sort-last,or functionalvs. datavs. temporalpartitioning. See[6, 7, 17, 32] for moreon parallelrenderingtaxonomies.

Hardware-basedsystemsemploy customintegratedcircuitswithfast, dedicatedinterconnectionsto achieve high processingandcommunicationbandwidths.Earlyparallelrenderinghardwarewasdevelopedfor z-buffer scanconversion[9, 10, 24] andfor geometryprocessingpipelining [3, 4]. Most currentsystemsarebasedon asort-middlearchitectureandrelyuponafast,globalinterconnectionto distributeprimitivesfrom geometryprocessorsto rasterizers.Forinstance,SGI’sInfiniteRealityEngine[18] usesafastVertex Bustobroadcastscreenspacevertex informationfrom semi-customASICgeometryprocessorsto ASIC rasterizationprocessors.The maindrawbackof thehardwareapproachis thatthespecial-purposepro-cessorsandinterconnectsin thesesystemsarecustom-designedandthereforevery expensive,sometimescostingmillions of dollars.

Software-basedsystemshave beenimplementedon massivelyparallel architectures,such as Thinking Machines’ CM-5 [23],BBN’sButterflyTC2000[33], Intel’sTouchstoneDeltasystem[8],and SGI’s Origin 2000 [14, 30]. Unfortunately, tightly-coupled,scalableparallelcomputersarealsoexpensive, andsincethey aretypically designedor configuredfor scientificcomputationsandnotfor graphics,the price-to-performanceratio of puresoftwareren-deringis notcompetitive with currentgraphicshardware.

Relatively little work hasbeendoneon interactive polygonren-deringusingaclusterof networkedPCs[13, 28, 29]. Traditionally,the latency andbandwidthsof typical networkshave not beenad-equatefor fine-grainedparallelrenderingalgorithms.Rather, mostprior cluster-basedpolygon renderingsystemshave utilized onlyinter-frameparallelism[13, 25], renderingseparateframesof animagesequenceon separatecomputersin parallel. Otherparallelrenderingalgorithms,suchas ray tracing, radiosity [11, 26], andvolumetric rendering[12] have beenimplementedwith PC clus-ters. However, they generallyhave not achieved fast, interactiveframerates(i.e.,thirty framespersecond).Wearenotawareof anyprior systemthat hasachieved scalablespeedupsvia intra-frame

dataparallelismwhile renderingpolygonalmodelswith a networkof workstations.

Lastyear, Samantaet al. [28] describeda sort-firstparallelren-deringsystemrunningonaclusterof PCsdriving ahigh-resolution,multi-projectordisplaywall. In thatpaper, thegoalwasto balancetheloadamonga fixednumberof PCs(eight)eachdriving a sepa-rateprojectorcorrespondingto part of a very largeandhigh reso-lution seamlessimage.They focusedonsort-firstalgorithmsaimedat avoiding large overheadsdueto redistribution of big blocksofpixels betweenmultiple PCsdriving full-screenprojectors. Themethoddescribedin this paperis similar in concept.But, it usesasort-lastimagecompositionschemeto achieve scalablespeedupsfor largeclustersof PCsdriving asingledisplay.

3 Choosing a Partitioning Strategy

Thefirst challengein implementinga parallelrenderingsystemisto choosea partitioningstrategy. Following thetaxonomyof Mol-nar et al. [5, 17], we considersort-middle,sort-first,andsort-lastapproaches.

In sort-middlesystems,processingof graphicsprimitivesis par-titioned equallyamonggeometryprocessors,while processingofpixels is partitionedamongrasterizationprocessorsaccordingtooverlaps with screen-spacetiles. This approachis best suitedfor tightly-coupledsystemsthat usea fast,global interconnectionto sendprimitivesbetweengeometryandrasterizationprocessorsbasedon overlapswith simpleand static tilings, suchas a regu-lar, rectangulargrid. Usinga sort-middleapproachwould beverydifficult in acluster-of-PCssystemfor two reasons.First,commod-ity graphicsacceleratorsdonotgenerallyprovidehigh-performanceaccessto theresultsof geometryprocessingthroughstandardAPIs(e.g.,“feedback”modein OpenGL).Second,network communica-tion performanceis currentlytoo slow for every primitive to bere-distributedbetweenprocessorsondifferentPCsduringeveryframe.

In sort-first systems, screen-spaceis partitioned into non-overlapping2D tiles, eachof which is renderedindependentlybyatightly-coupledpairof geometryandrasterizationprocessors(i.e.a PC graphicsaccelerator, in our case),andthe subimagesfor all2D tiles arecomposited(without depthcomparisons)to form thefinal image.Themainadvantageof sort-firstis thatits communica-tion requirementsarerelatively small.Unlikesort-middle,sort-firstcanutilize retained-modescenegraphsto avoid most datatrans-fer for graphicsprimitivesbetweenprocessors,asgraphicsprimi-tivescanbereplicatedand/orsentdynamicallybetweenprocessorsas they migratebetweentiles [20]. The disadvantageis that ex-tra work mustbe doneto transformgraphicsprimitives,computeanappropriatescreenspacepartition,determineprimitive-tileover-lapsfor eachframe,andrendergraphicsprimitivesredundantlyifthey overlapmultiple tiles. Most significantamongtheseis theex-tra renderingwork, which canbecharacterizedby theoverlapfac-tor – theratio of thetotal renderingwork performedover theidealrenderingwork requiredwithout redundancy. Sinceoverlapfactorsgrow linearlywith increasingnumbersof processors,thescalabilityof sort-firstsystemsis limited. In our experience,theefficiency ofsort-firstalgorithms[19] dropsbelow 50%with 16processors[27].

In sort-lastsystems,eachprocessorrendersa separateimagecontaininga portionof thegraphicsprimitives,andthentheresult-ing imagesarecomposited(with depthcomparisons)into a singleimagefor display. Themainadvantageof sortlastis its scalability.Sinceeachgraphicsprimitive is renderedby exactlyoneprocessor,theoverlapfactoris always ��� � . Themaindisadvantageof sort-lastis that it usuallyrequiresan imagecompositionnetwork with veryhighbandwidthandprocessingcapabilitiesto supporttransmissionandcompositionof overlappingdepthimages.Also, the sort-lastapproachusuallyprovides no strict primitive orderingsemantics,

Page 3: Hybrid Sort-First and Sort-Last Parallel Rendering with a Cluster

andit incurslatency assubimagesmustbecompositedbeforedis-play.

To summarize,noprior parallelrenderingalgorithmis bothscal-ableandpracticalfor a PC cluster. At oneextreme,sort-firstsys-temsrequirerelatively little communicationbandwidth.However,they arenotscalable,asoverlapfactorsgrow linearlywith increas-ing numbersof processors.On theotherextreme,sort-lastsystemsarescalable.But, their bandwidthrequirementsexceedthe capa-bilities of currentsystemareanetworks. The goal of our work isto develop a parallelrenderingalgorithmwhich strikesa practicalbalancebetweenthesetrade-offs.

4 Overview of Our Approach

Our approachis to usea hybrid parallel renderingalgorithmthatcombinesfeaturesof both“sort-first” and“sort-last”strategies.Weexecutea view-dependentalgorithm that dynamically partitionsboth the 2D screeninto tiles andthe 3D polygonsinto groupsinorderto balancetherenderingloadamongthePCsandto minimizethe bandwidthsrequiredfor compositingtiles with a peer-to-peerredistributionof pixels.

The key ideais to cluster3D polygonsinto groupsfor render-ing by eachserver dynamicallybasedon theoverlapsof their pro-jectedboundingvolumesin screenspace.The motivation for thisapproachis bestdemonstratedwith an example. Considerthe ar-rangementof polygonsshown in Figure1. If polygonsareassignedrandomlyto twogroups,asshown in Figure1(a),theareaof overlapbetweentheboundingboxesof thetwogroupsis quitelarge(shownin hatchpattern).Ontheotherhand,if thepolygonsaregroupedac-cordingto their screenspaceoverlaps,asshown in Figure1(b), theareaof intersectionbetweenthe groups’boundingboxes is muchsmaller. Thisdifferencehascritical implicationsfor thebandwidthrequiredfor imagecompositingin asort-lastsystem.

(a)Randomgrouping. (b) Spatialgrouping.

Figure1: Polygongroupingsaffect imagecompositionbandwidths.

Wehaveimplementedaview-dependentobject-partitionstrategymotivatedby examplessuchas this oneusinga sort-first screen-partition algorithmin a client PC controlling a parallel renderingsystemcomprising� server PCsandonedisplayPC.Specifically,for every frameof an interactive visualizationsession,the systemproceedsin a threephasepipeline,asshown in Figure2.

� Phase1: In thefirst phase,theclient executesa partitioningalgorithmthatsimultaneouslydecomposesthe3D polygonalmodelinto � disjoint groups,assigningeachgroupto a dif-ferentserverPC, andpartitionsthe pixels of the 2D screeninto non-overlappingtiles,alsoassigningeachtile to adiffer-entserverPC.

� Phase2: In the secondphase,every server � rendersthegroupof the 3D primitives it hasbeenassignedinto its lo-cal frame buffer. It then readsback from the frame bufferinto memorythecolor anddepthvaluesof pixels that residewithin any intersectionof thegroup’s projectedscreen-space

boundingbox anda tile assignedto anotherserver � , anditsendsthemover thesystemareanetwork to server � . Mean-while, afterevery server � hascompletedrenderingits owngroupof polygons,it receivespixelsrenderedby otherserversfrom thenetwork, andit compositestheminto its local framebuffer with depthcomparisonsto form a completeimageforits tile. Finally, eachserver readsback the color valuesofpixelswithin its tile andsendsthemto adisplayPC.

� Phase3: In the third phase,the displayPC receivessubim-agesfrom all servers, and compositesthem (without depthcomparisons)to form a final completeimage in its framebuffer for display.

Display

Client

Servers

Pha

se 1

Pha

se 2

Pha

se 3

Figure 2: Systemarchitecturewith client, server and displayphases.

This systemarchitecturehastwo importantadvantages.First,it usespeer-to-peercommunicationto redistribute pixels betweenservers,astrategy thatis verywell-suitedfor point-to-pointsystemareanetworks,whichareavailablefor clustersof PCs.Theresultislower latenciesandhighernetwork utilizationsthanothermoretra-ditional sort-lastmethods.Second,the aggregatepixel bandwidthreceived by the displaymachineis minimal, asevery pixel is sentto the display exactly oncefor eachimagein the third phaseofthepipeline. This featureis importantfor our systembecausethebandwidthof network communicationinto any onePC is limited( 100 MB/s), andsort-lastmethodsin which the displayproces-sorreceivespixelswith depthvaluesfor thewholescreenwouldbeimpracticalfor aninteractive system.

5 Hybrid Partitioning Algorithm

In orderto make oursystemarchitectureviable,wemustdevelopapartitioningalgorithmthatconstructstiles of pixelsandgroupsofobjectsandassignsthemto servers for eachframe. An effectivealgorithmshouldachieve threegoals:

� It shouldbalancethework loadamongtheservers.� It shouldminimizetheoverheadsdueto pixel redistribution.� It shouldrunfastenoughsuchthattheclientis notthelimitingpipelinestage.

Page 4: Hybrid Sort-First and Sort-Last Parallel Rendering with a Cluster

Priorwork ondynamicscreen-spacepartitioninghasfocusedonconstructing partitionswith balancedrenderingloads.For instance,Whelandevelopeda median-cutmethodin which the screenwaspartitionedrecursively into a numberof tiles exactly equalto thenumberof processors[31]. For eachrecursive step,a tile wassplitby a line perpendicularto its longestaxissothatthecentroidsof itsoverlappinggraphicsprimitiveswerepartitionedmostequally. Inlaterwork, Muellerdevelopeda mesh-basedmedian-cutmethodinwhichprimitiveswerefirst talliedupaccordingto how theirbound-ing boxesoverlappedafinemesh,andanestimatedcostwascalcu-latedfor eachoverlappedmeshcell [19]. Then,usingthis dataasa hint, screenspacetiles wererecursively split alongtheir longestdimensionsuntil thenumberof regionsequaledthenumberof pro-cessors.Whitmanuseda dynamicschedulingmethodin which hestartedwith a set of initial tiles and “stole” part of anotherpro-cessor’s work dynamicallywhenno initial tiles remain[33]. Thestealingis achieved by splitting the remainingtile region on themaximallyloadedprocessorinto twoverticalstripshaving thesamenumberof scanlines.

In contrastto this previouswork, we not only try to balancetherenderingload acrossthe partition, but we alsoaim to minimizethe screenspaceoverlapsof the subimagesrenderedby differentservers.Ourmethodis a recursivebinarypartitionusinga two-linesweep.

Eachsinglepartitionstepproceedsasfollows: We split thecur-rentregionalongthelongestaxis.Without lossof generality, let usassumethewidth is largerthantheheightof theregion. Hence,weshallsweepvertical lines. We begin with two vertical lines,oneateachsideof the region to be partitioned. The vertical line on theleft will alwaysmove to theright, andtheoneon theright will al-waysmove left (asshown in Figure3(a)). Theobjectsassignedaswe move the left line will belongto the left group. Similarly, theright line will assignobjectsfor theright group.At every step,thealgorithmmovesthe line associatedwith the groupwith the leastwork in an attemptto maintaina good load balance.The line ismoved until it passesa currentlyunassignedobject. In our figure,we move theline on theleft until it completelypassestheleftmostobjecton the screen(which is unassignedsincethe algorithmhasjustbegunexecuting).Thisobjectis assignedto theleft group(seeFigure3(b)). This processis repeateduntil all objectshave beenassigned,in whichcasethesweeplinesmayhavecrossedandthereis a narrow swath (labeledC in Figure3(c)) wherethe boundingboxesof the left andright groupsoverlap. This swath is a regionrequiringcompositionof the two imagesrenderedfor the objectsassignedto theleft andright groups.Onaverage,theswath’swidthis themeanwidth of a boundingbox. Thescreenpartitioncontin-uesrecursively until exactly � tilesandgroupsareformed,andoneis assignedto eachof the � servers.

(a)Beginning

Assigned

(b) 1 objectassigned

BA C

(c) All objectsassigned

Figure3: Exampleexecutionof thehybridpartitionalgorithm.

This algorithmhasa runningtime complexity of O(� ������� +��� ), where� is thenumberof boundingboxesin thesceneand� is the numberof servers. The � ������� term arisesfrom thestagethe � boundingboxesaresortedaccordingto their screenposition,andthe ��� factorcomesfrom thefact that ����� cutsaremade,while O(� ) boundingboxesareconsideredfor eachcut.

6 Analysis of Communication Overheads

Weanalyzetheaveragecasecommunicationrequirementswith ouralgorithmby makingthefollowing simplifying assumptions:

� All objectsareevenlydistributed.� All objectsaresquaresof equalsize(in screenspace).Theedgeof thesquareis of length� .� Thereare� totalpixelson screenand� servers.� Eachserver is assigneda squaretile with area� /� anddi-mension� ����� � .

Basedontheseassumptions,eachtile will havearegionof width����� aroundits perimeterwhich its server will needto readbackfrom its framebuffer andsendto otherserversfor compositing(seeFigure4). Similarly, everyserverwill receivepixelswithin approx-imatelythesamesizedregionsfrom otherservers.

NP

B

B

2B

Figure4: Pixel areasto besentout.

The numberof pixels compositedby eachserver is the sumofthefour cornersquaresandthefour rectanglesalongtheperimeterof thebox. Thefour cornersquareseachhavearea:

����

�� (1)

Theareaof eachrectangleon thefour sidesis :

��� �� � (2)

Summingtheareaof theseregions,wehave:

��� �� �

! ���� (3)

or,

�"� � �� �

! ��# (4)

whichrepresentstheaveragenumberof pixelstransferredonewayfrom asingleserver for eachframe.

This pixel redistribution overheadcomparesfavorablywith pre-vious sort-lastmethods. For instance,several volume renderingsystemsemploy static partitionswhich result in an approximatedepthcomplexity of $� � , sincethe volume is divided into $� �blocksalong eachof its threeaxes [15, 21]. If we assumeeachserver is responsiblefor a tile covering �%�&� pixels,thetotal num-berof pixelscompositedby eachserver is:

�$� � #

(5)

Page 5: Hybrid Sort-First and Sort-Last Parallel Rendering with a Cluster

(a)Hand(643objects,654,666polygons)

(b) Dragon(1,257objects,871,414polygons)

(c) Buddha(1,476objects,1,087,716polygons)

Figure5: Testmodels.

Wecannow comparetheoverheadsin thetwo methodswith re-spectto � and� . Increasing� reducesthesizeof thecompositingareasby a factorof $� � # with thepreviousalgorithms,whereasitaffectsourcompositeregionsby a factorof � � . This factorfavorspreviousalgorithms.

However, increasing� causestheoverheadsof imagecomposit-ing to grow linearlyusingpreviousalgorithms,while they only in-creasewith � � usingourmethod.Since� is generallyquitelarge(e.g.,1280 ' 960), this is usuallythedominantfactorin practice,and thusour algorithmhassignificantly lower compositingover-headsthanpreviousschemes.

7 Results

Wehave implementedseveralparallelrenderingalgorithmsin C++underWindowsNT andincorporatedtheminto acluster-basedpar-allel renderingsystemfor interactivevisualizationof 3D polygonalmodelsin a VRML browser. Our prototypesystemis comprisedof a 550 Mhz Pentium-III client PC, eight 500 Mhz Pentium-IIIserver PCswith 256 MB RAM andnVidia GeForce256 graphicsaccelerators,anda 500 MHz Pentium-III displayPC with an In-tergraphWildcat graphicsaccelerator. All ten PCsareconnectedby a Myrinet systemareanetwork anduseVMMC-2 communica-tion firmwareandsoftware[2]. This implementationallows us tomeasurevariousarchitecturalandsystemparametersfor studiesoffeasibilityandscalability.

In this section,we reportdatacollectedduringa seriesof simu-lationsof our parallelrenderingsystem.Thegoalsof thesesimu-lationsareto investigatethealgorithmictrade-offs of differentpar-titioning strategies,to identify potentialperformanceissuesin ourprototypesystem,to analyzethescalabilityof ourhybridalgorithm,andto assessthefeasibilityof constructingaparallelrenderingsys-temwith a clusterof PCs.Unlessotherwisespecified,theparame-tersof every simulationweresetasfollows (these“default” valuesreflectthemeasuredparametersof ourprototypesystem):

� Numberof servers(N) = 64PCs� Numberof pixels(P)= 1280x960pixels� Network i/o latency = 20 microseconds� Network bandwidth= 100MB/second� Serverpolygonthroughput= 750Kpolygons/second� Serverpixel fill throughput= 300Mpixels/second� Servercolorbuffer i/o latency = 50 microseconds� Servercolorbuffer readthroughput= 20 Mpixels/second

� Servercolorbuffer write throughput= 10Mpixels/second� ServerZ buffer i/o latency = 50 microseconds� ServerZ buffer readthroughput= 20 Mpixels/second� ServerZ buffer write throughput= 10 Mpixels/second� Displaycolorbuffer write throughput= 20Mpixels/second

During every simulation,we loggedstatisticsgeneratedby thepartitioning algorithmsrunning on a 550 MHz Pentium-III PCwhile renderinga repeatablesequenceof framesin an interactivevisualizationprogram. Every test was run for three3D models(shown in Figure 5), eachof which was representedas a scenegraphin which multiple polygonsweregroupedat the leaf nodesand treatedas atomic objectsby the partitioning algorithms(thenumbersof polygonsandobjectsarelistedundertheimageof eachtest model in Figure 5). The objectswere formed by clusteringpolygonswhosecentroidslie within thesamecell of anoctreeex-pandedto a user-specifieddepth(usually3-5 levels in our cases).In all threemodels,theinputpolygonswererathersmall,asis typ-ical for a high-performancerenderingsystem,andthusthe objectboundingboxescloselyresembledboxesof anoctree,andrender-ing was always “geometry-bound.” It was assumedthat the 3Dmodelwasstaticandwasreplicatedin thememoryof everyserver,a currentlimitation of our systemthatwill bediscussedin Section8. For eachmodel,the cameratraveledalonga presetpathof 50framesthat rotatedat a fixed distancearoundthe model, lookingat it from a randomvarietyof viewing directionswhile themodelapproximatelycoveredthefull heightof thescreen.

Oursimulationstudyaimsto answerthefollowing performancequestionsabouttheproposedhybridapproach:

� Is it feasiblefor moderately-sizedPCclusters?� Doesit scalewith increasingnumbersof processors?� How doesit comparewith asort-firstalgorithm?� How doesit comparewith asort-lastalgorithm?� Whatis theimpactof increasingdisplayresolution?� Whatis theimpactof varyingobjectgranularity?

7.1 Feasibility of Proposed Rendering System

Ourfirst studyinvestigateswhetherit is practicalto build aparallelrenderingsystemwith aclusterof PCsusingourhybridpartitioningalgorithm. We definesuccessin this caseto be interactive framerates(e.g.,30framespersecond)andattractiveefficiencies(e.g., (50%)for moderatelysizedclusters(e.g.,8-64PCs).

Page 6: Hybrid Sort-First and Sort-Last Parallel Rendering with a Cluster

For this investigation,we executeda seriesof simulationtestswith our hybridalgorithmusingincreasingnumbersof serverPCs.Thetiming resultsfor client,server, anddisplayPCsappearfor allthreetestmodelsin the top one-thirdof Table1 (all timesare inmilliseconds). Note that the averagetime taken (per frame)by auniprocessorsystemwith thesamehardwareastheserver PCsforthethreemodelsis asfollows : Hand- 872.9ms,Dragon- 1161.9msandBuddha- 1450.3ms. Bar chartsof the total client, server,anddisplayphasesareshown for the Buddhamodel in Figure6.During analysisof theseresults,note that the client, servers,anddisplayexecuteconcurrentlyin a pipeline,so the slowestof thesethreestagesdeterminestheeffective frametime.

0

20

40

60

80

100

120

140

160

180

200

8 16 32 64

Number of servers

Tim

e (m

s)

Client

Server

Display

Figure6: Client,server, anddisplaytimesfor Buddhamodel.

Examiningthebarsin Figure6, weseethattheserver is usuallythelimiting factorin thesetests(thedarkbarin themiddleof eachtriple). Theclient(gray)wasnever limiting, andthedisplay(white)wasthebottleneckin only two tests(when64 serverswereusedtorenderthe simpler3D models). In thosecases,the servers werecumulatively ableto deliver imagesto thedisplayatapproximately40 framespersecond.But, dueto thelimited I/O bandwidthof thedisplayPC,it couldonly receivearound30framespersecondworthof data(100MB/second/ 4 bytes/pixel * 1280 ' 960pixels* 0.75screencoverage). The net result is an interactive framerate (30framesper second),but with slight under-utilization of the serverPCs.Thisresultpointsouttheneedfor fasterdigitaldisplaydevicesor for fasterI/O bandwidthsin commoditydisplayPCs.

Examining the rightmost column of Table 1, we seethat thespeedupsof the hybrid algorithmremainshigh for up to at least64 processors.Theefficiency of thesystemrangedbetween55.0%and70.5%for 64processorsfor all threetestmodels.

From theseresults,we concludethat our hybrid algorithmcanprovideapractical,low-costsolutionfor high-performancerender-ing of static,replicated3D polygonalmodelson moderatelysizedclustersof PCs.

7.2 Scalability with Increasing Processors

Our secondstudy investigatesthe scalability of the hybrid parti-tioning algorithmasthenumberof servers(� ) increases.Systemspeedupsfor up to 64 servers are shown in Figure9. Therearetwo concernsto beaddressedin thisstudy- scalabilityof theclientandscalabilityof theservers- sincetheleastscalablepipelinestagedictatesoverall systemscalability.

First, considerclient scalability. As statedin Section5, thehy-brid partitioningalgorithmgrows linearly with � . This trendcanbeseenclearlyin thedarkblackline in Figure7,whichshowsclientprocessingtimesfor thehybridalgorithmasa functionof thenum-berof serversmeasuredona550MHz Pentium-IIIPCduringtests

with theBuddhamodel.Extrapolatingthis curve indicatesthat theclient will beableto achieve 30 updatespersecondfor at leastupto 150serversor so. This resultis encouraging,assuchlargeclus-tersareveryraretoday, andclientPCprocessorspeedsaregrowingmorerapidly thanclustersizes.

0

5

10

15

20

25

0 10 20 30 40 50 60 70

Number of servers

Clie

nt

Tim

e (m

s)

Hybrid

Sort First

Sort Last

Figure7: Client timesfor sort-first,sort-last,andthehybrid algo-rithmsfor increasingnumbersof servers.

Second,considerserver scalability. As statedin Section6, thepixel redistribution overheadsof eachserver scalewith )* + . Thisresultcanbeseenin Figure8, which shows breakdowns of serverprocessingtimes. In the caseof the hybrid algorithm (the mid-dle setof bars),theoverheadsaredueprimarily to pixel readsandpixel writesduring server-to-server pixel redistribution. Althoughthe reductionin overheadsis not linear, the simulatedspeedupsof our systemare quite good. For 64 processors,the speedupsof the hybrid approachare36.3,43.7and46.5 for Hand,DragonandBuddha,respectively, correspondingto effectiveframetimesof32.8ms,34.6ms,and31.1ms,respectively (seeFigure9). Thesespeedupscomparefavorablyto previousparallelrenderingsystemswhoseframetimesaresignificantlylonger.

0

50

100

150

200

250

300

350

400

8 16 32 64 8 16 32 64 8 16 32 64

Sort First Hybrid Sort Last

Tim

e (m

s)

Imbalance

Final Read

Pixel Write

Pixel Read

Overlap Render

Ideal Render

Figure8: Server times for sort-first, sort-last,and the hybrid al-gorithmsfor screenresolution1280x960.The heightof eachbarrepresentsthetime requiredfor processingin theserver.

As an addendum,we note that bandwidthrequirementsof thedisplayPC are not affectedby increasingthe numberof servers.Exactlyonescreen-fullof pixelsarrivesat thedisplayPCfor eachimage,no matterhow finely theimageis split into tiles. Latenciesof pixel I/O operationsdo impactdisplayscalability, but they aresignificantonly for very largenumbersof servers.

Page 7: Hybrid Sort-First and Sort-Last Parallel Rendering with a Cluster

Client Server DisplayParrallel Test Obj Compute Ideal Overlap Pixel Pixel Final Wait for Final SpeedupAlgorithm Model N Xform Partition Total Render Render Read Write Read Imbalance Total Write FactorHybrid Hand 8 2.1 5.6 7.7 108.8 - 2.6 4.9 3.7 4.9 124.9 29.7 7.0

16 2.1 6.2 8.3 54.4 - 1.9 3.4 1.9 5.8 67.4 30.1 12.932 2.1 7.2 9.3 27.2 - 1.8 3.1 1.0 6.8 39.9 30.8 21.864 2.1 9.6 11.7 13.6 - 1.5 2.3 0.5 6.1 24.0 32.8 36.3

Dragon 8 4.1 10.6 14.7 144.9 - 2.5 4.7 4.0 4.8 160.9 32.0 7.216 4.1 11.7 15.8 72.4 - 2.1 3.6 2.0 4.4 84.5 32.3 13.732 4.1 13.1 17.2 36.2 - 1.8 3.1 1.0 4.2 46.3 33.0 25.064 4.1 16.5 20.6 18.1 - 1.5 2.4 0.5 4.0 26.5 34.6 43.7

Buddha 8 4.8 9.6 14.4 180.8 - 2.4 4.4 3.4 4.0 195.0 27.6 7.416 4.8 10.8 15.6 90.4 - 1.9 3.3 1.7 4.3 101.6 27.9 14.232 4.8 12.7 17.5 45.2 - 1.7 2.9 0.9 4.3 55.0 28.6 26.364 4.8 16.5 21.3 22.6 - 1.4 2.2 0.5 4.4 31.1 30.2 46.5

Sort-First Hand 8 2.1 3.2 5.3 108.8 66.6 - - 2.7 37.9 216.0 30.6 4.016 2.1 4.0 6.1 54.4 59.5 - - 1.3 42.9 158.1 31.0 5.532 2.1 5.4 7.5 27.2 50.1 - - 0.6 46.1 124.0 31.8 7.064 2.1 8.4 10.5 13.6 42.1 - - 0.3 45.0 101.0 33.4 8.0

Dragon 8 4.1 4.8 8.9 144.9 60.7 - - 3.4 28.2 237.2 32.9 4.916 4.1 6.1 10.2 72.4 49.7 - - 1.6 29.0 152.7 33.3 7.632 4.1 8.6 12.7 36.2 39.4 - - 0.8 26.2 102.6 34.1 11.364 4.1 13.5 17.6 18.1 32.5 - - 0.4 27.0 78.0 35.6 14.9

Buddha 8 4.8 5.5 10.3 180.8 67.2 - - 3.1 25.0 276.1 28.4 5.216 4.8 6.8 11.6 90.4 60.0 - - 1.5 30.6 182.5 28.8 7.932 4.8 9.7 14.5 45.2 50.2 - - 0.7 32.2 128.3 29.6 11.364 4.8 15.1 19.9 22.6 41.0 - - 0.4 32.0 96.0 31.2 15.1

Sort-Last Hand 8 2.1 0.1 2.2 108.8 - 14.1 26.2 3.9 4.6 157.6 31.5 5.516 2.1 0.2 2.3 54.4 - 9.2 16.8 1.8 11.2 93.4 29.5 9.332 2.1 0.5 2.6 27.2 - 5.2 9.5 0.9 9.9 52.7 29.2 16.564 2.1 0.6 2.7 13.6 - 3.1 5.5 0.4 9.4 32.0 28.8 27.2

Dragon 8 4.1 0.1 4.2 144.9 - 14.0 26.0 4.4 6.7 196.0 34.8 5.916 4.1 0.2 4.3 72.4 - 9.3 17.1 2.0 7.9 108.7 32.4 10.732 4.1 0.5 4.6 36.2 - 5.6 10.1 1.0 9.6 62.5 31.6 18.564 4.1 0.6 4.7 18.1 - 3.2 5.7 0.4 10.7 38.1 30.8 30.4

Buddha 8 4.8 0.1 4.9 180.8 - 14.2 26.1 4.0 9.7 234.8 31.8 6.216 4.8 0.2 5.0 90.4 - 9.5 17.5 1.9 14.2 133.5 31.2 10.832 4.8 0.5 5.3 45.2 - 6.3 11.5 0.9 10.8 74.7 29.8 19.464 4.8 0.6 5.4 22.6 - 3.8 6.7 0.4 17.3 50.8 29.2 28.5

Table1: Timing statisticsgatheredduringsimulationson testmodelswith sort-first,sort-last,andthehybrid parallelrenderingalgorithms.All timesarein milliseconds.

Page 8: Hybrid Sort-First and Sort-Last Parallel Rendering with a Cluster

0

10

20

30

40

50

60

70

0 20 40 60 80

Number of servers

Sp

eed

up

Ideal

Hybrid

Sort Last

Sort First

(a)Hand

0

10

20

30

40

50

60

70

0 20 40 60 80

Number of servers

Sp

eed

up

Ideal

Hybrid

Sort Last

Sort First

(b) Dragon

0

10

20

30

40

50

60

70

0 20 40 60 80

Number of servers

Sp

eed

up

Ideal

Hybrid

Sort Last

Sort First

(c) Buddha

Figure9: Speedupcurveswith increasingnumbersof processors.

7.3 Comparison to Sort-First Algorithms

In our third study, we compareour hybrid partitioningalgorithmto a sort-firstapproachpreviously describedin the literature. Al-though we have implementedtwo sort-first methods,Mueller’sMAHD [19] andSamanta’s KD-Split [28], wecomparesimulationresultsfor only theMAHD algorithmin thispaper. Thecomparisonfor KD-Split is similar.

In our implementationof the MAHD algorithm, the client PCfirst transformstheobjectsinto 2D andmarksa2D grid-baseddatastructurewith an estimateof the renderingload associatedwitheachcell in this grid. Next, the algorithmpartitionsthis grid datastructureinto tileswith balancedrenderingloadby recursively par-titioning thegrid structurealongthelongestdimensionateverystepof the recursion. The algorithm terminateswhen the numberoftiles is equalto the numberof servers. Eachtile is thenassignedto a server PC,which rendersan imagecontainingall polygonsinany objectthatat leastpartially overlapstheextentof its tile. Theserversthenreadtheresultingpixelsbackfrom their color buffersandsendthemto the displayPC so that they canbe compositedinto a completeimagefor display. Theclient processingtimesforthismethodgrow with ��� for � objectsand� tiles,just likethehybridalgorithm.Themainoverheadsincurredby thesort-firstal-gorithmaredueto clientprocessingasthescreenis partitioned,re-dundantserver renderingof objectsoverlappingmultiple tiles,andserver-to-serverpixel redistribution.

Timing resultsmeasuredduring testswith both the hybrid andsort-firstmethodsappearfor all threetestmodelsin Table1, andmoredetailedbreakdownsof theserverprocessingtimesareshownfor thetestswith theBuddhamodelin Figure8. Fromthesebreak-downs of Figure8, we canvisually comparetheoverheadsof dif-ferentalgorithms.In particular, thedarkbands(“OverlapRender”)in the barsfor sort-first (the leftmostsetof bars)representover-headsdueto redundantrenderingof objectsoverlappingmultipletiles. Notehow thesebandsbecomea largerpercentageof thetotalserver time astheoverlapfactorgrows with increasingnumbersofprocessors.Thehybrid algorithmsincur no suchoverheadsduetooverlaps,asevery objectis renderedexactly onceby a server. Asa result,theserver timessimulatedwith thehybrid algorithmscalebetterthanwith sort-first.

Speedupcurves for the hybrid andsort-firstalgorithmscanbecompareddirectly in Figure9. For 64 processors,thespeedupsofthehybrid approachare36.3,43.7and46.5for Hand,DragonandBuddharespectively, whereasthe sort-firstapproachspeedupsare8.0,14.9,and15.1. Overall, theefficiency of thehybrid algorithmis generallyaround3 to 4 timesbetterthanthesort-firstalgorithmin thesetests.

7.4 Comparison to a Sort-Last Algorithm

In our fourth study, we compareour hybrid partitioningalgorithmto a sort-lastapproach.For the purposesof this experiment,wehave implementeda polygonrenderingversionof a sort-lastalgo-rithm motivatedby Neumann’s “Object Partition with Block Dis-tribution” algorithm[22]. This algorithmis usedfor comparisonbecauseit requiresthe leastcompositionbandwidth,provides thebestbalance,andfits into our peer-to-peerimagecompositionsys-tembetterthanany othersort-lastalgorithmweareawareof.

In our implementationof Neumann’s algorithm,groupsof ob-jectsarepartitionedamongserversstaticallyusinganalgorithmthatis a direct3D extensionof the2D methoddescribedin Section5.Then,duringeachframeof aninteractivevisualizationsession,ev-ery server rendersits assignedgroupof objectsinto its own framebuffer. Then,theserversredistributerenderedpixelsaccordingto astaticassignmentof aninterleavedfine-grainedgridof tiles. Specif-ically, after renderingis complete,every server readsthepixels intheboundingboxof renderedobjectsfrom bothits colorbuffer andits Z-buffer andsendsthemto the server PCto whomthe tile hasbeenassigned.Uponreceiving pixels, theserverscompositetheminto their framebuffers.Finally, afterall tiles havebeenfully com-posited,theserversreadtheresultingpixelsbackfrom their framebuffers andsendthemto the displayPC so that they canbe com-positedinto acompleteimagefor display.

Simulationresultsfor this sort-lastalgorithmareshown in thebottomone-thirdof Table1. Thekey comparisonpointsareclientprocessingtimesandserveroverheads.

Clearly, the client processingtimes for the sort-lastalgorithmaresignificantlylessthanfor thehybridandsort-firstmethods(seeFigure7). Theonly work thatmustbedoneby thesort-lastclient isto transform� 3D boundingboxesinto the2D screenspacesothat“active” tiles can be marked, and server-to-server pixel transferscanbeavoidedfor theothers.This simpleprocessingcouldeasilybedonedirectly on theservers,or possiblyon thedisplayPC.So,in effect, thehybridandsort-firstalgorithmsrequireonemorePC’sprocessingpowerthanthesort-lastalgorithm- anadvantageof puresort-last.However, theimpactof this differenceis rathersmall forlargenumbersof servers.

On theserver side,thereis a significantdifferencebetweenthesort-lastandhybridalgorithmsin pixel redistributioncosts(seethecolumnslabeled“Pixel Read”and“Pixel Write” in Table1). Thesort-lastalgorithmincursmoreoverheads,asit mustperformimagecompositiontransfersandprocessingfor larger numbersof pixels(seeSection6). The differencecanbe seenin our simulationre-sultsasthewhiteregionsin therightmostsetof barsin Figure8. Inthesort-lastcase,theaveragenumberof pixelstransferredbetween

Page 9: Hybrid Sort-First and Sort-Last Parallel Rendering with a Cluster

(a)Sort-first (b) Sort-last (c) Hybrid

Figure10: Visualizationsof serveroverheadswith differentalgorithms.In (a)highlightedobjectboundingboxesspanmultipletilesandmustberenderedredundantly. In (b) and(c) brighterpixel intensitiesrepresentmoreimagecompositionoverheads.

serversduringeachframeis relatedto thescenedepthcomplexity(D) multipledby theresolutionof thescreen(P),whereD isapprox-imately � )-,/. andthetotal overheadis 0$ * +21 [15, 21]. In contrast,for thehybridalgorithm,thenumberof pixelstransferredis relatedto �"� *

0* + ! � # , whereB is theaveragesizeof anobject’sboundingbox (seeSection6). Empirically, thepixel redistributionoverheadsof thehybridalgorithmaremuchsmallerthanfor sort-lastin all ofoursimulations.

Thedifferencebetweenthepixel redistributioncostscanbeseenveryclearlyin thevisualizationsof Figures10(b)andFigure10(c),which show pixels transferredacrossthe network for the sort-lastand hybrid algorithmsrespectively, shadedin transparentwhite.Justby looking at “the amountof white” in theseimages,onegetsan intuitive feel for how muchpixel redistribution overheadis in-curredby thesystem.

Overall, theefficiency of thehybrid algorithmis approximately33%to 53%betterthanthesort-lastalgorithmin all of our tests.

7.5 Impact of Increasing Screen Resolution

In ourfifth study, weanalyzetheeffectof increasingscreenresolu-tion oneachof thepartitioningalgorithmsdiscussedsofar: hybrid,sort-first,andsort-last.Therearetwo issuesto consider– displaytimes(for all algorithms)andservertimes(for hybridandsort-last).

First, considerdisplaytimes. As previously discussed,the dis-play PC canbe the limiting factor if it is not ableto receive datafrom the network or write it into its framebuffer quickly enoughto keepup with the client and servers. Of course,this problemgetsworsewith increasingscreenresolution.For example,increas-ing theresolutionfrom 1280' 960to 2560' 1920,thefinal displaytime will increaseby a factorof 4. However, this concernmaygoaway in a few years,asPCnetwork andAGPbandwidthsimproveat a ratehigherthanscreenresolution.

Second,considerserver times.Here,therearefundamentaldif-ferencesbetweensort-first,sort-last,andour hybridalgorithm.Forsort-first,thedominantoverhead(overlapfactor)is completelyin-dependentof screenresolution(until therenderingbecomesraster-ization bound). As a result, sort-first is probablyappropriateforvery high-resolutionscreens,suchasmulti-projectordisplaywalls[28].

In contrast,thesort-lastandhybridalgorithmsaredefinitelyim-pactedby higherscreenresolutions.It is importantto note,how-ever, that the pixel redistribution overheadsof the sort-lastalgo-rithm grow linearly with � , while theoverheadsof our hybrid al-gorithm grow only with � � . Essentially, sort last mustcompos-ite areas,while the hybrid algorithmcompositesperimeters.Thisis a significantdifference,which can be seenvery clearly when

comparingFigures8 and 11, which show breakdowns of servertimesfor displayresolutions1280x960and2560x1920.The sig-nificantly taller white areasin the barsfor sort-lastindicatelargerpixel redistributionoverheadsthatthwart speedupsfor high resolu-tion screens.By comparison,pixel redistribution timesarearound5 timeslessfor 2560x1920imageswith thehybridalgorithm.

0

50

100

150

200

250

300

350

400

8 16 32 64 8 16 32 64 8 16 32 64

Sort First Hybrid Sort Last

Tim

e (m

s)

Imbalance

Final Read

Pixel Write

Pixel Read

Overlap Render

Ideal Render

Figure11: Server timesfor sort-first,sort-last,andthe hybrid al-gorithmsat screenresolution2560x1920.The heightof eachbarrepresentsthetime requiredfor processingin theserver.

7.6 Impact of Increasing Object Granularity

In ourfinal study, weinvestigatetheeffectsof differentobjectgran-ularities on partitioning algorithms. Figure 12 and 13 show theclient andserver timesmeasuredwith threedifferentobjectgranu-larities(399,1,476,and6,521objects,respectively) for theBuddhamodel– eachcasehasthesamenumberof polygons,justseparatedinto differentnumbersof objects.In this case,we cantrade-off in-creasesin clientprocessingfor betterefficienciesin serverswith thehybrid andsort-firstalgorithms.The sort-lastalgorithmis largelyunaffected.

First,considertheimpactof increasingthenumberof objectstobe processedby theclient. Judgingfrom thecurvesin Figure12,we seethattheclient timesof all algorithmsincreaselinearlywithobjectgranularity. This result is dueto a combinationof linearlyincreasingobjecttransformationsandlinearly increasingpartitionprocessingtimes. For 6,521objects,theclient processingtime forthehybridmethodclearlybecomesthebottleneck.

Page 10: Hybrid Sort-First and Sort-Last Parallel Rendering with a Cluster

0

10

20

30

40

50

60

70

80

90

100

0 1000 2000 3000 4000 5000 6000 7000

Number of objects

Clie

nt

Tim

e (m

s)Hybrid

Sort First

Sort Last

Figure12: Plotof clientprocessingtimesfor differentgranularitiesof objectsin theBuddhamodel.

Second,considertheimpacton theservers.Weseein Figure13that the relative overheadsfor the hybrid andsort-firstalgorithmsdecreasesignificantlyfor finer objectgranularities.Essentially, asobjectboundingboxesgetsmaller, theoverlapsgetsmaller, aspre-dictedby Molnaretal. [16]. In thecaseof thehybridalgorithm,theeffect is to make theregionon theboundaryof eachtile to becom-positedwith neighbors“thinner.” This effect canbeseenclearlyintheimagesof Figure14,whichshow visualizationsof pixel redistri-bution costs(morewhite meansmoreoverhead)for threedifferentobjectgranularities– the“swaths”requiringimagecompositiongetthinnerwith shrinkingboundingboxes.

0

50

100

150

200

250

300

350

400

450

0 1000 2000 3000 4000 5000 6000 7000

Number of objects

% o

verh

ead

Sort First

Sort Last

Hybrid

Figure13: Plotof serveroverheadsasapercentageof idealrender-ing time for differentgranularitiesof objectsin theBuddhamodel.

In thelimit, astheobjectgranularityof increases,andboundingboxwidthsapproachzero,theoverheadsof hybridandsort-firstal-gorithmstendtoward zero. Hence,fasterclientscansignificantlyimprove server efficienciesby partitioningat finer objectgranular-ities. This is a significantdifferencefrom the sort-lastalgorithm,which is unaffectedby objectgranularity. Developingandanalyz-ing theeffectsof hierarchicaland/orparallelclientalgorithmsis aninterestingtopic for futurestudy.

8 Discussion

The novel ideapresentedin this paperis that of dynamic,view-dependentpartitioning of both the 3D model spaceand the 2D

screenspacefor parallelpolygonrenderingon a cluster. The keyresult is that this approachboth performsbetterandscalesbetterwith processorcount than the bestknown earlier approaches,atleastin the clusterenvironment. Comparedto pure sort-firstap-proaches,thisapproachgreatlyreducestheproblemsthatarisedueto objectsoverlappingmultiplescreenpartitions.Comparedto sort-lastapproacheswith one-time,view-independentpartitioningof themodel,it greatlyreducesthebandwidthrequiredfor pixel transfer.The approachhasfurther advantagesin situationswhereapplica-tionsnot only renderthemodelfrom differentviewpointsbut alsozoomin on certainsub-areasandusegreaterdetail in thoseareas(which would make staticpartitionsperformpoorly sincetheareazoomedin onwouldbeassignedto only asmallnumberof proces-sorsin thatcase).

Theapproachis not without disadvantages,however, andthesetoo stemfrom the dynamic,view-dependentnatureof computingpartitions.Themajordisadvantagesarein thefollowing two areas.

First, a separateclient machinecomputespartitions for everyframe, so the speedof partitioning on the client can limit theachievedframerate. This tradeoff is particularlyvisible in our ap-proach,sinceunlike in other approachesmaking objectssmallercontinuesto help the parallel renderingrate, asymptoticallyre-sulting in near-zerooverheads.However, makingobjectssmallermakespartitioningmoreexpensive andhencemakesit difficult forthe client to keepup with the servers. A possiblesolution is touseaparallelclient-sideenginefor partitioning.Anotheris to havetheclient maintaina hierarchicalview of theobjectsin themodelandsubdivideobjectsadaptively asit computesits partitions,asop-posedto thecurrentmethodof fixing thesizesof objectsup front.Thesemethodsremaintopicsfor futureinvestigation.

Second,with a dynamic,view-dependentapproach,the3D datarenderedby a given server may changefrom frameto frame. Intheabsenceof a sharedaddressspace,it canbe bothdifficult andoverhead-consumingto managethe partitioningor distribution ofthe3D modeldataamongprocessorsandthe dynamicreplicationof dataondemand.Wehave thereforechosento replicatetheentire3D modelon all processingnodesfor now, to eliminatethe datamanagementproblemandfocuson the partitioningissues.How-ever, this clearly restrictsour currentimplementationto renderingstaticmodelsandlimits thesizeof themodelsthatcanberendered.Sincedynamicdatadistributionandreplicationwill introducerun-timeoverheads,to build atruly scalableapproachwemustexaminemethodsto performthisdatamanagementefficiently. Weplanto dothis in future work, examiningsystem-level approacheslike soft-wareimplementationof a sharedmemoryprogrammingmodelormemoryserversonclusters,aswell asapplication-level approacheslike customized“sharedmemory” or conservative replicationofdataat partitionboundariesbasedon thepartitionconstruction.Itis worth noting, of course,that an approachthat usesstaticparti-tioningof the3D modeldoesnot suffer from this replicationprob-lem, so it is likely that for situationsin which the 3-D model isvery largeandthescreenis small(sothatsort-lastbandwidthis nota major problem),a staticsort-lastapproachthat avoids the datamanagementoverheadsmay work bestaslong asthereis enoughbandwidthto sustainthedesiredframerate.

Weexpectthatlimitationsof ourapproachshoulddiminishwithtime and its performanceand scalability shouldimprove further,giventechnologytrends.In particular, bothclient processorspeedandbandwidthtend to grow much fasterthandisplay resolution.Theformerhelpstheclientpartitionatfinergranularities,while thelatterhelpsreducetheoverheadsof pixel redistributionby alleviat-ing thedynamicdatamanagementproblem. We thereforebelievethat this typeof dynamic,view-dependentapproachis likely to beaverygoodonefor parallelpolygonrenderingoncommodityclus-ters (andpotentiallyotherparallelmachines),as long asthe datamanagementproblemscanbesolvedsatisfactorily.

Page 11: Hybrid Sort-First and Sort-Last Parallel Rendering with a Cluster

(a)293objects (b) 1,476objects (c) 6,521objects

Figure14: Visualizationsof tile overlapsin partitionscreatedwith our hybrid algorithmusingdifferentobjectgranularities.The top rowof imagesshows 3D objectboundingboxes. Thebottomrow shows tiles with overlapregionsshaded.Largeroverlapregionscausehigherserverpixel redistributionoverheads.

9 Summary and Conclusion

Thispaperpresentsahybridparallelrenderingalgorithmthatcom-binesfeaturesof both“sort-first” and“sort-last”strategiesfor aPC-clusterarchitecture.Thealgorithmdynamicallypartitionsboththe2D screeninto tiles and the 3D polygonsinto groups in a view-dependentmannerper-frame,to balancetherenderingloadamongthePCsandto minimize thebandwidthsrequiredfor compositingtileswith apeer-to-peerredistributionof pixels.

During simulationsof our working prototypesystemwith vary-ing systemparameters,we find that our hybrid algorithmoutper-formssort-firstandsort-lastalgorithmsin almostall tests,includ-ing oneswith larger numbersof processors,higherscreenresolu-tions, andfiner objectgranularities.Overall, it is ableto achieveinteractive framerateswith efficienciesof 55.0%to 70.5%duringsimulationsof asystemwith 64PCs.Webelieve thisapproachpro-videsapractical,low-costsolutionfor high-performancerenderingof static,replicated3D polygonalmodelsonmoderatelysizedclus-tersof PCs.

Topics for futher study includedevelopmentof algorithmsfordynamicdatabasemanagement,supportfor fastdownloadof im-agesfrom a network into a framebuffer, andmethodsto partitiongraphicsprimitivesamongserverswith finer granularityusingei-therparallelclients,hierarchicalalgorithms,or fasteralgorithms.

Acknowledgements

This projectis partof thePrincetonScalableDisplayWall projectwhich is supportedin part by Departmentof Energy undergrantANI-9906704andgrantDE-FC02-99ER25387,by Intel ResearchCouncil andIntel Technology2000equipmentgrant,andby Na-tional ScienceFoundationunder grant CDA-9624099and grantEIA-9975011.ThomasFunkhouseris alsosupportedby anAlfredP. SloanFellowship.

Wewouldliketo thankJiannanZhengwhodiscussedwith usthehybridalgorithm,andYuqunChenandXiangYu whodesignedandimplementedtheVMMC-II communicationmechanismsfor thePCcluster. We alsowould like to thankGreg Turk andthe StanfordComputerGraphicsLaboratoryfor sharingwith usthemodelsusedin our experiments.

References[1] Nanette J. Boden, Danny Cohen, Robert E. Felderman,

Alan E. Kulawik, CharlesL. Seitz, Jakov N. Seizovic, andWen-KingSu. Myrinet: A gigabit-per-secondlocal areanet-work. IEEEMicro, 15(1):29–36,February1995.

[2] Y. Chen,A. Bilas, S. Damianakis,C. Dubnicki, and K. Li.Utlb: A mechanismfor translationson network interface. InProceedingsof the8th InternationalConferenceon Architec-tural Supportfor ProgrammingLanguagesandOperaingSys-tems, pages193–204,1998.

[3] J.Clark. A VLSI geometryprocessorfor graphics.Computer(IEEE), 13:59–62,64,66–68,July 1980.

[4] JamesH. Clark.Thegeometryengine:A VLSI geometrysys-tem for graphics. ComputerGraphics, 16(3):127–133,July1982.

[5] MichaelCox. Algorithmsfor Parallel Rendering. PhDthesis,Departmentof ComputerScience,PrincetonUniversity, 1995.

[6] ThomasW. Crockett. Parallel rendering. TechnicalReportTechnicalReportTR95-31,Institute for ComputerApplica-tions in ScienceandEngineering,NASA Langley ResearchCenter, 1995.

[7] T.W. Crockett. An introductionto parallelrendering.ParallelComputing, 23:819–843,1997.

Page 12: Hybrid Sort-First and Sort-Last Parallel Rendering with a Cluster

[8] David Ellsworth. A new algorithmfor interactivegraphicsonmulticomputers.IEEEComputerGraphicsandApplications,14(4):33–40,1994.

[9] H. FuchsandB. Johnson.An expandablemultiprocessorar-chitecturefor videographics. In Proceedingsof the 6th An-nualACM-IEEESymposiumonComputerArchitecture, April1979.

[10] Henry Fuchs. Distributing a visible surfacealgorithmovermultipleprocessors.In Proceedingsof the1977ACM AnnualConference, Seattle, WA, pages449–451,October1977.

[11] ThomasA. Funkhouser. Coarse-grainedparallelismfor hier-archicalradiosityusinggroupiterativemethods.In ComputerGraphics(SIGGRAPH96), 1996.

[12] C. GrietsenandJ. Petersen.Parallel volumerenderingon anetwork of workstations.IEEE ComputerGraphicsandAp-plications, 13(6):16–23,1993.

[13] JeremyHubbell. Network rendering. In AutodeskUniver-sity Sourcebook, volume2, pages443–453.Miller Freeman,1996.

[14] HomanIgehy, GordonStoll, andPat Hanrahan.The designof aparallelgraphicsinterface.In SIGGRAPH98ConferenceProceedings, pages141–150.ACM SIGGRAPH,July1998.

[15] K.L. Ma, J.S.Painter, C.D. Hansen,andM.F. Krogh. Paral-lel volumerenderingusingbinary-swap compositing. IEEEComputerGraphicsandApplications, 14(4):59–68,1994.

[16] S.Molnar. Image-compositionarchitecturesfor real-timeim-agegeneration.Ph.d.thesis,Universityof North CarolinaatChapelHill, 1991.

[17] Steve Molnar, Michael Cox, David Ellsworth, and HenryFuchs. A sortingclassificationof parallel rendering. IEEEComputerGraphicsandApplications, 14(4):23–32,1994.

[18] J.S.Montrym, D.R. Baum,D.L. Dignam,and C.J. Migdal.Infinitereality:A real-timegraphicssystem.In ProceedingsofComputerGraphics(SIGGRAPH97), pages293–303,1997.

[19] Carl Mueller. The sort-first renderingarchitecturefor high-performancegraphics.In ACMSIGGRAPHComputerGraph-ics (Special Issueon 1995 Symposiumon Interactive 3-DGraphics), 1995.

[20] Carl Mueller. Hierarchicalgraphicsdatabasesin sort-first. InProceedingsof the IEEE Symposiumon Parallel rendering,pages49–57,1997.

[21] Ulrich Neumann. Parallel volume-renderingalgorithmper-formanceon mesh-connectedmulticomputers.In ACM SIG-GRAPHSymposiumon Parallel Rendering, pages97–104.ACM, November1993.

[22] Ulrich Neumann.Communicationcostsfor parallelvolume-renderingalgorithms.IEEEComputerGraphicsandApplica-tions, 14(4):49–58,1994.

[23] FrankA. Ortega, CharlesD. Hansen,andJamesP. Ahrens.Fastdataparallelpolygonrendering. In IEEE, editor, Pro-ceedings,Supercomputing’93: Portland, Oregon,November15–19,1993, pages709–718.IEEE ComputerSocietyPress,1993.

[24] F. I. Parke. Simulationandexpectedperformanceanalysisofmultiple processorz-buffer systems.In Proceedingsof Com-puterGraphics(SIGGRAPH80), pages48–56,1980.

[25] Pixar. PhotoRealisticRenderManToolkit. 1998.

[26] R.J.Recker, D.W. George,andD.P. Greenberg. Accelerationtechniquesof progressive refinementradiosity. In ComputerGraphics(Proceedingsof the1990SymposiumonInteractive3D Graphics), pages59–66,1990.

[27] Rudrajit Samanta, Thomas Funkhouser, Kai Li, andJaswinderPal Singh.Sort-firstparallelrenderingwith aclus-ter of pcs. In SIGGRAPH2000Technical sketches, August2000.

[28] Rudrajit Samanta,JiannanZheng,ThomasFunkhouser, KaiLi, and JaswinderPal Singh. Load balancingfor multi-projectorrenderingsystems.In SIGGRAPH’99. Proceedings1999Eurographics/SIGGRAPHworkshoponGraphicshard-ware, Aug. 8–9,1999,LosAngeles,CA, pages107–116.ACMPress,1999.

[29] Bengt-OlafSchneider. Parallelrenderingon pc workstations.In InternationalConferenceon Parallel andDistributedPro-cessingTechniquesand Applications(PDTA98), Las Vegas,NV, 1998.

[30] JaswinderPal Singh, Anoop Gupta,and Marc Levoy. Par-allel visualizationalgorithms:Performanceandarchitecturalimplications.Computer, 27(7):45–55,July1994.

[31] DanielS.Whelan.Animac:A multiprocessorarchitectureforreal-timecomputeranimation. Ph.d.thesis,California Insti-tuteof Technology, 1985.

[32] Scott Whitman. MultiprocessorMethods for ComputerGraphicsRendering. JonesandBartlettPublishers,20 ParkPlaza,Boston,MA 02116,March1992.

[33] ScottWhitman. Dynamic load balancingfor parallel poly-gon rendering. IEEE ComputerGraphicsand Applications,14(4):41–48,1994.