Large Scale Visualization using PC Clusters · Large Scale Visualization using PC Clusters ... 200 time steps ... “Cluster to Wall with VTK,” Parallel and Large Data Volume

Large Scale Large Scale Visualization using Visualization using PC ClustersPC Clusters

ClusterWorld ClusterWorld 20032003

Brian WylieBrian WylieSandia National LaboratoriesSandia National Laboratories

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,for the United States Department of Energy under contract DE-AC04-94AL85000.

OutlineOutline

Hardware PlatformHardware Platform

Software PlatformSoftware Platform

Data DistributionData Distribution

Parallel RenderingParallel Rendering

Parallel Volume RenderingParallel Volume Rendering

Other TechniquesOther Techniques

Hardware PlatformHardware Platform

Visualization ClustersVisualization Clusters

Wilson (Classified)64 nodes

800 MHz P3 CPU.GeForce3 cards.

Myrinet 2000 interconnect

Europa (Unclassified)128/256 Dell Workstations

Dual 2.0 GHz P4 XeonGeForce3 cards.

Myrinet 2000 interconnect1.27 TFLOP on Linpack..#32 on Top 500#32 on Top 500

Wilson

Europa

DLP Projector AlignmentMarch 2, 2002

‘Blue Girl’ Color Test ImageFirst 48-Projector Alignment(March 2, 2002)

LLNL PPM DatasetFirst 48-tile Rendering

470M Polygons w/64 nodes(April 5, 2002)

VIEWS Visualization CorridorVIEWS Visualization Corridor

VIEWS Visualization CorridorVIEWS Visualization Corridor

Display ScreensSeamless 12m x 3m RP3 sections - each 4x4 array 0.5” Stewart FilmScreen AeroGlass100

Projection Systems3 16-projector arraysPrimary Projectors are DPI1280x1024, 3500 Lumen,3-chip DLP (DMD)

Data and Visualization “Corridor” metaphor :

A wide path through which massive quantities of data can easily flow, and through which scientists and engineers can explore data and collaborate.



Use the Visual ToolKit(VTK) as a framework

Only replace small parts of the framework with our research codes; the rest is tried and tested by a whole community,

VTK is criticized as being too slow… If necessary we will rewrite the slow parts.

Example: Poly Data Example: Poly Data MappersMappers

New classes perform significantly better:New classes perform significantly better:Standard VTK Standard VTK mappermapper: 4.5 : 4.5 MtriMtri/sec./sec.New generic 3D accelerated New generic 3D accelerated mappermapper: 9.5 : 9.5 MtriMtri/sec./sec.New New nVidia nVidia accelerated accelerated mappermapper: 20 : 20 MtriMtri/sec./sec.

New New mappers mappers available in ParaView now (through factory available in ParaView now (through factory method and dynamic loading).method and dynamic loading).

More accelerated modules to become part of the More accelerated modules to become part of the supercharged “GT” component project.supercharged “GT” component project.

VTK ContributionsVTK Contributions

3459344Lee Ann Fisk

Crimes:D3Possession with intent

to Distribute

9345325Brian Wylie

Crimes:Shadow renderingRacketeering

9023842Gary Templet

Crimes:vtkSNL Build SystemRepository MaintenanceAiding and Abetting

8934592David Thompson

Crimes:HALOChromiumLaundering

0235098Kenneth Moreland

Crimes:Rendering – Serial,

Parallel, VolumetricPublic Indecency

The Sandia The Sandia VisVis GangGang

VTK InfoVTK Info

As of the current release (VTK 4.2) there are approximately…

107 classes that read and write data.

317 classes that filter datasets. (Isosurface, Streamlines, etc)

145 classes that “map” the dataset into an image.

50 classes that support parallel processing

Distributed ParallelismDistributed Parallelism

Parallel Processing with VTKParallel Processing with VTK

Data SizeData Size

Visible Woman CT Data870 MBytes 1734 Slices at 512x512x2Visible Woman CT DataVisible Woman CT Data870870 MBytesMBytes 1734 Slices at 512x512x21734 Slices at 512x512x2

Flow Computation: Flow Computation: Robert Robert Meakin Meakin

Visualization: Visualization: David David KenwrightKenwright and and David Lane David Lane

Numerical Aerodynamic Numerical Aerodynamic Simulation Division at NASA Simulation Division at NASA Ames Research CenterAmes Research Center

• Bell-Boeing V-2 2 tiltrotor140 Gbytes

• Bell-Boeing V-2 2 tiltrotor140 Gbytes

Other Large DataOther Large Data

Modeling turbulence (Ken Jansen RPI)Modeling turbulence (Ken Jansen RPI)8.5 million tetrahedra, 200 time steps8.5 million tetrahedra, 200 time steps150 million tetrahedra, 2000 time steps (soon)150 million tetrahedra, 2000 time steps (soon)

VTK WeaknessesVTK WeaknessesParallel Data Distribution (inefficient, poor load balancing,etcParallel Data Distribution (inefficient, poor load balancing,etc))

Rendering (Serial and Parallel) (really not good all the way aroRendering (Serial and Parallel) (really not good all the way around) und)

DataDataReadersReaders

(not smart)(not smart)







D3D3 D3D3 D3D3 D3D3

Full VTKFull VTKPipelinePipeline

D3 is the backbone of our D3 is the backbone of our parallel VTK architecture. parallel VTK architecture.




OptimizedOptimizedRendererRenderer




ICEICE--TTChromiumChromium




ICEICE--T and Chromium based T and Chromium based parallel VTK rendering.parallel VTK rendering.

SoftwareSoftware

Load BalancingLoad BalancingD3 (Distributed Data Decomposition)D3 (Distributed Data Decomposition)

Scalable RenderingScalable RenderingImage Compositing Engine for Tiles (ICEImage Compositing Engine for Tiles (ICE--T)T)

Unstructured Volume RenderingUnstructured Volume RenderingGpu Gpu Accelerated Tetrahedral Rendering (Accelerated Tetrahedral Rendering (GAToRGAToR))Upcoming Parallel WorkUpcoming Parallel Work

Other TechniquesOther TechniquesHigher Order ElementsHigher Order ElementsReal Time ShadowsReal Time Shadows

Load BalancingLoad Balancing

D3 (Distributed Data Decomposition)D3 (Distributed Data Decomposition)

Spatial decomposition based on K-d tree

Spatial regions contain approximately equal number of mesh elements.

Fast execution for tree queries.

Axis aligned nature of boundaries can accelerate processing for some visualization algorithms.

Load BalancingLoad BalancingChallengesChallenges

Sorting the data is computationally prohibitive.Sorting the data is computationally prohibitive.

Use median finding algorithm (Select).Use median finding algorithm (Select).

Parallel implementation of Select is straightforward, although scalable implementation was some work

Parallel K-d tree build: Sub-groups of processors build sub-trees for sub-regions of the volume.

Bookkeeping (VTK data structures, attributes, ghost cells, etc).

D3 FeaturesD3 Features

K-d Tree BuildDepth of k-d treeAsk for shafts, slabs, or slices instead of blocks

DistributionAssign spatially contiguous regions. (Normal mode)Assign regions in a round robin fashion.Assign regions to minimize data movement. *Assign regions to processors according to a user supplied mapping.

D3 FeaturesD3 Features

Output OptionsvtkUnstructuredGrid (maybe others later)Ghost level - include ghost cells bordering spatial region.Omit duplicate points (reading 500 disk files).

Input Options (M x N)M > N : 512 files with 16 Visualization nodes (Typical).M < N : 16 files and 128 Vis nodes (more tricky).Out of core option? Perhaps our ongoing VTK work with OGI (Claudio Silva) will provide that functionality.

Dataset info: 241,600 cells 40 files. Contiguous spatial regions assigned to processors.

D3 PicturesD3 Pictures

Round Robin region assignment

D3 PicturesD3 Pictures

Scalable Rendering Scalable Rendering Technical ChallengesTechnical Challenges

Use of Commodity ComponentsUse of Commodity ComponentsDesktop PC’sDesktop PC’sGamer’sGamer’s graphics cardsgraphics cards

Graphics Cluster ThroughputGraphics Cluster ThroughputLoad BalancingLoad BalancingEffective use of aggregate performanceEffective use of aggregate performanceNetworking issuesNetworking issues

Tiled DisplayTiled DisplayDriving virtual display from PC clusterDriving virtual display from PC cluster

Scalable Rendering MetricsScalable Rendering Metrics

Focus is on large data.Focus is on large data.

Algorithms must adhere to the following criteria:Algorithms must adhere to the following criteria:Excellent load balancing. Excellent load balancing. Scale to large number of nodes.Scale to large number of nodes.Insensitive to data size (fixed overhead).Insensitive to data size (fixed overhead).

Willing to trade off frame rates to enable large data Willing to trade off frame rates to enable large data rendering.rendering.

Image Composition for Tiles ?!?Image Composition for Tiles ?!?

Yes, it’s Crazy…but is it Crazy enough?Yes, it’s Crazy…but is it Crazy enough?

Questions:Questions:Why use image compositing for a 62 Million pixel display?Why use image compositing for a 62 Million pixel display?Wouldn’t sort first be the obvious choice.Wouldn’t sort first be the obvious choice.

Constraints:Constraints:Each graphics adapter renders a 1280x1024 imageEach graphics adapter renders a 1280x1024 imageTarget data is huge so…Target data is huge so…

Data must remain stationary (no network per frame).Data must remain stationary (no network per frame).No data replication.No data replication.

1/N of the data on N computers.1/N of the data on N computers.

0

200

400

600

800

1000

0 2 4 6 8 10 12Input Geometry Size (GB)

Net

wor

k U

sage

/Fra

me

(MB

)

Sort-First Sort-First (with 90% cache) ICE-T

Cross OverCross Over

Data Transfer vs. Image TransferData Transfer vs. Image Transfer

ASCI-Sized Data

Virtual Trees StrategyVirtual Trees Strategy

Performance BoostersPerformance BoostersInherently ‘geometry’ load balanced.Inherently ‘geometry’ load balanced.

Composition responsibilities can be automatically adjusted basedComposition responsibilities can be automatically adjusted basedthe ‘load’ of the node.the ‘load’ of the node.

Active Pixel EncodingActive Pixel Encoding

Fast encoding.Fast encoding.–– Three operations per pixel.Three operations per pixel.

Free decoding. Faster depth compare.Free decoding. Faster depth compare.

Effective compression.Effective compression.–– Encoded 1/5 full image at beginning.Encoded 1/5 full image at beginning.

Good worst case behavior.Good worst case behavior.–– Encoded image can only grow a few bytes.Encoded image can only grow a few bytes.

ICEICE--T ResultsT Results

Good load balancing characteristicsGood load balancing characteristics

Performs well on large datasetsPerforms well on large datasets

Excellent ScalabilityExcellent Scalability

VTK Rendering resultsVTK Rendering resultsOver 1 Billion triangles/sec for desktop delivery Over 1 Billion triangles/sec for desktop delivery

(128 node cluster).(128 node cluster).60 million triangles/sec on 60 60 million triangles/sec on 60 Mpixel Mpixel display. display.

(64 node cluster)(64 node cluster)

““Scalable Rendering on PC ClustersScalable Rendering on PC Clusters”” Computer Graphics and Computer Graphics and ApplicationsApplications, Large Data Visualization Issue, July/Aug 2001., Large Data Visualization Issue, July/Aug 2001.““SortSort--Last Tiled Rendering for Viewing Extremely Large Datasets on TilLast Tiled Rendering for Viewing Extremely Large Datasets on Tiled ed DisplaysDisplays””, , IEEE Parallel and Large Data Visualization and Graphics, IEEE Parallel and Large Data Visualization and Graphics, San San Diego, CA, 2001Diego, CA, 2001

Leveraging Leveraging GPUsGPUs

Why implement algorithms onWhy implement algorithms on GPUsGPUs??

GPU performance increases have consistently GPU performance increases have consistently outpacedoutpaced Moore’sMoore’s Law.Law.

GPU’sGPU’s are cheap.are cheap.

Balance computations between the CPU and GPU.Balance computations between the CPU and GPU.

Marketing numbers onMarketing numbers on GeForceGeForce 4 claim 1.24 claim 1.2

TeraOpTeraOp/sec./sec.

Leveraging Leveraging GPUsGPUs

Non visualization algorithms onNon visualization algorithms on GPUsGPUs??

Using the GPU as a ‘coUsing the GPU as a ‘co--processor’.processor’.

BLAS library: Dense Dense Matrix Multiply BLAS library: Dense Dense Matrix Multiply (DGEMM).(DGEMM).

FFT calculationsFFT calculations

““FFT on a GPUFFT on a GPU,” ,” SIGGRAPH/SIGGRAPH/EurographicsEurographicsWorkshop on Graphics Hardware 2003Workshop on Graphics Hardware 2003..

GAToRGAToR

Implements the ‘Projected Tetrahedra’ algorithm Implements the ‘Projected Tetrahedra’ algorithm (Shirley & (Shirley & TuchmanTuchman) within the micro code of an ) within the micro code of an NVidia GPU.NVidia GPU.

Moves all of the following functions from the CPU Moves all of the following functions from the CPU the GPU.the GPU.

Transform to screen space.Determine projection class.Calculate thick vertex location.Determine depth at thick vertex.Compute color and opacity for thick vertexApply exponential attenuation texture

GGpupu AAccelerated ccelerated TTetrahedraletrahedral RRendererenderer

model with millions of cells

Visibility Sort

graphics card

PC (CPU)

for each cell in order

compute cell’s screen

projection

decompose to triangles

find thickest cell distance

compute each triangle’s parameters

final image of model

SoftwareProgrammable

Hardware

GPU: Computational ResourceGPU: Computational Resource

CPU GPU Cell Contribution

Vertex Program ConstraintsVertex Program Constraints

Each instance of a vertex shader program works independently on a single vertex in SIMD fashion.

No support dynamic vertex creation or topology modification within the vertex program.

No branching (at the time…now supported)

No knowledge of neighboring vertices.

Cannot change execution based on past information.

Constraints seem insurmountable but we devised some very clever workarounds!

Clever constraint workClever constraint work--aroundaround

V1’V4’

V0’

V3’

V2’

Basis GraphIsomorphic to all projection cases

Programmable vertex shaders do not support dynamic vertex creation or topology modification within the vertex program.

Isomorphic Property of Basis GraphIsomorphic Property of Basis Graph

Meticulous Code Review

Just Kidding!

This shows what the code looks like.

Results: Test DatasetsResults: Test Datasets

Dataset Vertices Tetrahedra

Blunt Fin 40,960 187,395

Oxygen Post 109,744 513,375

Delta Wing 211,680 1,005,675

Dataset GPU time Constant

Tets/s GPU timeLinear

Tets/s

Blunt Fin 0.20 sec 937 K 0.38 sec 493 K

Oxygen Post 0.55 sec 933 K 1.04 sec 493 K

Delta Wing 1.07 sec 940 K 2.03 sec 495 K

Dataset Info

Timings

Gratuitous PicturesGratuitous Pictures

““Tetrahedral Projection using VertexTetrahedral Projection using Vertex ShadersShaders””, , IEEE IEEE Volume VisualizationVolume Visualization, Boston, Massachusetts, Boston, Massachusetts, , 2002.2002.

Parallel Volume RenderingParallel Volume Rendering

Ongoing work with OGI (Oregon Graduate Ongoing work with OGI (Oregon Graduate Institute).Institute).Approach: Leverage distributed data Approach: Leverage distributed data

techniques for techniques for structuredstructured data and apply to data and apply to unstructuredunstructured data.data.

Structured data can be partitioned into Structured data can be partitioned into convexconvexsub domains.sub domains.

Higher Order ElementsHigher Order Elements

Typical approximationTypical approximation CorrectCorrect

The The InterpolantInterpolant

Parameters (r,s,t) specify field as a weighted sum of Parameters (r,s,t) specify field as a weighted sum of nodal values:nodal values:

Shape functions, Shape functions, NNii, are tensor products of Lagrange , are tensor products of Lagrange interpolants (for our example):interpolants (for our example):

∑=

=n

iii tsrNtsr

1),,(),,( φφ

)()()(),,( 321 tMsMrMtsrNi =

−−

−−=

node far midnode axis-origin near

)12()1(4

)12)(1()(

uu

u

uuuu

uuuM j

Rendering The GeometryRendering The Geometry

OpenGL OpenGL TesselatorsTesselators–– Overkill for small (screenOverkill for small (screen--space) elementsspace) elements

Adaptive TriangulationAdaptive Triangulation–– Fast (Fast (VelhoVelho et al., Chung et al.)et al., Chung et al.)

ResultsResults

““Rendering Higher Order Finite Element Surfaces in HardwareRendering Higher Order Finite Element Surfaces in Hardware””, , Proceedings of GRAPHITEProceedings of GRAPHITE, Melbourne Australia, February 2003., Melbourne Australia, February 2003.

Unlit

Lit

Advanced rendering for Advanced rendering for Sci VisSci Vis

Sandia Project PerceptSandia Project PerceptUtilize Utilize GPUsGPUs to provide realistic to provide realistic perceptual clues to general perceptual clues to general scientific visualizations. Perceptual scientific visualizations. Perceptual clues include shadows, reflections, clues include shadows, reflections, refractions, depth perception, and refractions, depth perception, and realistic lighting models. realistic lighting models.

Phase 1:Phase 1: vtkShadowRenderervtkShadowRenderer is an is an easy to use module that provides easy to use module that provides real time shadows forreal time shadows for vtkvtkapplications.applications.

VTK ConstraintsVTK Constraints

Shadow techniques are focused on GamesShadow techniques are focused on Games•• Often preprocess the geometryOften preprocess the geometry•• No self shadowingNo self shadowing•• Small amounts of geometrySmall amounts of geometry

Applying shadows to scientific Applying shadows to scientific vis vis (VTK)(VTK)•• Needs to work with VTKNeeds to work with VTK•• No No accessaccess to geometryto geometry•• Needs to be small ‘footprint’Needs to be small ‘footprint’•• Must have ‘self’ shadowMust have ‘self’ shadow•• Must work on large amounts of geometryMust work on large amounts of geometry•• Shadow Mapping best option for these constraintsShadow Mapping best option for these constraints

DemoDemo

PicturesPictures

PicturesPictures

PicturesPictures

ResultsResults

•• Works with unmodified VTKWorks with unmodified VTK•• Really simple to Add to VTK appReally simple to Add to VTK app

////vtkRenderervtkRenderer **renren == vtkRenderervtkRenderer::New();::New();vtkShadowRenderervtkShadowRenderer **renren == vtkShadowRenderervtkShadowRenderer::New();::New();

•• Next release of ParaView will include shadow optionNext release of ParaView will include shadow option

•• PerformancePerformance•• 4 datasets tested, 4 datasets tested, avgavg 1.7x Slowdown 1.7x Slowdown

•• How can this be less than 2x???How can this be less than 2x???•• Only writing to depth bufferOnly writing to depth buffer•• No lightingNo lighting

•• Using Using glCopyTexSubImage2D(GL_TEXTURE_RECTANGLE_NV, …);glCopyTexSubImage2D(GL_TEXTURE_RECTANGLE_NV, …);•• Very Fast! (cow at 30 Very Fast! (cow at 30 hz hz no problem)no problem)

Performance Sidebar Performance Sidebar

Overclocker’s Overclocker’s on on GPUsGPUs

Overclocker’s Overclocker’s on CPUson CPUs

OverclockOverclock YourselfYourself

Recent Publications (Recap)Recent Publications (Recap)

““Scalable Rendering on PC ClustersScalable Rendering on PC Clusters”” Computer Graphics and Computer Graphics and ApplicationsApplications, Large Data Visualization Issue, July/Aug 2001., Large Data Visualization Issue, July/Aug 2001.

““SortSort--Last Tiled Rendering for Viewing Extremely Large Last Tiled Rendering for Viewing Extremely Large Datasets on Tiled DisplaysDatasets on Tiled Displays””, , IEEE Parallel and Large Data IEEE Parallel and Large Data Visualization and Graphics, Visualization and Graphics, San Diego, CA, 2001.San Diego, CA, 2001.

““Tetrahedral Projection using VertexTetrahedral Projection using Vertex ShadersShaders””, , IEEE Volume IEEE Volume VisualizationVisualization, Boston, Massachusetts, Boston, Massachusetts, , 2002.2002.

““Rendering Higher Order Finite Element Surfaces in HardwareRendering Higher Order Finite Element Surfaces in Hardware””, , Proceedings of GRAPHITEProceedings of GRAPHITE, Melbourne Australia, Feb 2003., Melbourne Australia, Feb 2003.

““Cluster to Wall with VTKCluster to Wall with VTK,” ,” Parallel and Large Data Volume Parallel and Large Data Volume Graphics, 2003Graphics, 2003..

““FFT on a GPUFFT on a GPU,” ,” SIGGRAPH/SIGGRAPH/EurographicsEurographics Workshop on Workshop on Graphics Hardware 2003Graphics Hardware 2003..

ENDEND

Documents

Large Scale Visualization using PC Clusters · Large Scale Visualization using PC Clusters ... 200 time steps ... “Cluster to Wall with VTK,” Parallel and Large Data Volume