Upload
trinhdiep
View
235
Download
5
Embed Size (px)
Citation preview
Large Scale Large Scale Visualization using Visualization using PC ClustersPC Clusters
ClusterWorld ClusterWorld 20032003
Brian WylieBrian WylieSandia National LaboratoriesSandia National Laboratories
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,for the United States Department of Energy under contract DE-AC04-94AL85000.
OutlineOutline
Hardware PlatformHardware Platform
Software PlatformSoftware Platform
Data DistributionData Distribution
Parallel RenderingParallel Rendering
Parallel Volume RenderingParallel Volume Rendering
Other TechniquesOther Techniques
Hardware PlatformHardware Platform
Visualization ClustersVisualization Clusters
Wilson (Classified)64 nodes
800 MHz P3 CPU.GeForce3 cards.
Myrinet 2000 interconnect
Europa (Unclassified)128/256 Dell Workstations
Dual 2.0 GHz P4 XeonGeForce3 cards.
Myrinet 2000 interconnect1.27 TFLOP on Linpack..#32 on Top 500#32 on Top 500
Wilson
Europa
DLP Projector AlignmentMarch 2, 2002
‘Blue Girl’ Color Test ImageFirst 48-Projector Alignment(March 2, 2002)
LLNL PPM DatasetFirst 48-tile Rendering
470M Polygons w/64 nodes(April 5, 2002)
VIEWS Visualization CorridorVIEWS Visualization Corridor
VIEWS Visualization CorridorVIEWS Visualization Corridor
Display ScreensSeamless 12m x 3m RP3 sections - each 4x4 array 0.5” Stewart FilmScreen AeroGlass100
Projection Systems3 16-projector arraysPrimary Projectors are DPI1280x1024, 3500 Lumen,3-chip DLP (DMD)
Data and Visualization “Corridor” metaphor :
A wide path through which massive quantities of data can easily flow, and through which scientists and engineers can explore data and collaborate.
Software PlatformSoftware Platform
Software PlatformSoftware Platform
Use the Visual ToolKit(VTK) as a framework
Only replace small parts of the framework with our research codes; the rest is tried and tested by a whole community,
VTK is criticized as being too slow… If necessary we will rewrite the slow parts.
Example: Poly Data Example: Poly Data MappersMappers
New classes perform significantly better:New classes perform significantly better:Standard VTK Standard VTK mappermapper: 4.5 : 4.5 MtriMtri/sec./sec.New generic 3D accelerated New generic 3D accelerated mappermapper: 9.5 : 9.5 MtriMtri/sec./sec.New New nVidia nVidia accelerated accelerated mappermapper: 20 : 20 MtriMtri/sec./sec.
New New mappers mappers available in ParaView now (through factory available in ParaView now (through factory method and dynamic loading).method and dynamic loading).
More accelerated modules to become part of the More accelerated modules to become part of the supercharged “GT” component project.supercharged “GT” component project.
VTK ContributionsVTK Contributions
3459344Lee Ann Fisk
Crimes:D3Possession with intent
to Distribute
9345325Brian Wylie
Crimes:Shadow renderingRacketeering
9023842Gary Templet
Crimes:vtkSNL Build SystemRepository MaintenanceAiding and Abetting
8934592David Thompson
Crimes:HALOChromiumLaundering
0235098Kenneth Moreland
Crimes:Rendering – Serial,
Parallel, VolumetricPublic Indecency
The Sandia The Sandia VisVis GangGang
VTK InfoVTK Info
As of the current release (VTK 4.2) there are approximately…
107 classes that read and write data.
317 classes that filter datasets. (Isosurface, Streamlines, etc)
145 classes that “map” the dataset into an image.
50 classes that support parallel processing
Distributed ParallelismDistributed Parallelism
Parallel Processing with VTKParallel Processing with VTK
Data SizeData Size
Visible Woman CT Data870 MBytes 1734 Slices at 512x512x2Visible Woman CT DataVisible Woman CT Data870870 MBytesMBytes 1734 Slices at 512x512x21734 Slices at 512x512x2
Flow Computation: Flow Computation: Robert Robert Meakin Meakin
Visualization: Visualization: David David KenwrightKenwright and and David Lane David Lane
Numerical Aerodynamic Numerical Aerodynamic Simulation Division at NASA Simulation Division at NASA Ames Research CenterAmes Research Center
• Bell-Boeing V-2 2 tiltrotor140 Gbytes
• Bell-Boeing V-2 2 tiltrotor140 Gbytes
Other Large DataOther Large Data
Modeling turbulence (Ken Jansen RPI)Modeling turbulence (Ken Jansen RPI)8.5 million tetrahedra, 200 time steps8.5 million tetrahedra, 200 time steps150 million tetrahedra, 2000 time steps (soon)150 million tetrahedra, 2000 time steps (soon)
VTK WeaknessesVTK WeaknessesParallel Data Distribution (inefficient, poor load balancing,etcParallel Data Distribution (inefficient, poor load balancing,etc))
Rendering (Serial and Parallel) (really not good all the way aroRendering (Serial and Parallel) (really not good all the way around) und)
DataDataReadersReaders
(not smart)(not smart)
DataDataReadersReaders
(not smart)(not smart)
DataDataReadersReaders
(not smart)(not smart)
DataDataReadersReaders
(not smart)(not smart)
D3D3 D3D3 D3D3 D3D3
Full VTKFull VTKPipelinePipeline
D3 is the backbone of our D3 is the backbone of our parallel VTK architecture. parallel VTK architecture.
Full VTKFull VTKPipelinePipeline
Full VTKFull VTKPipelinePipeline
Full VTKFull VTKPipelinePipeline
OptimizedOptimizedRendererRenderer
OptimizedOptimizedRendererRenderer
OptimizedOptimizedRendererRenderer
OptimizedOptimizedRendererRenderer
ICEICE--TTChromiumChromium
ICEICE--TTChromiumChromium
ICEICE--TTChromiumChromium
ICEICE--TTChromiumChromium
ICEICE--T and Chromium based T and Chromium based parallel VTK rendering.parallel VTK rendering.
SoftwareSoftware
Load BalancingLoad BalancingD3 (Distributed Data Decomposition)D3 (Distributed Data Decomposition)
Scalable RenderingScalable RenderingImage Compositing Engine for Tiles (ICEImage Compositing Engine for Tiles (ICE--T)T)
Unstructured Volume RenderingUnstructured Volume RenderingGpu Gpu Accelerated Tetrahedral Rendering (Accelerated Tetrahedral Rendering (GAToRGAToR))Upcoming Parallel WorkUpcoming Parallel Work
Other TechniquesOther TechniquesHigher Order ElementsHigher Order ElementsReal Time ShadowsReal Time Shadows
Load BalancingLoad Balancing
D3 (Distributed Data Decomposition)D3 (Distributed Data Decomposition)
Spatial decomposition based on K-d tree
Spatial regions contain approximately equal number of mesh elements.
Fast execution for tree queries.
Axis aligned nature of boundaries can accelerate processing for some visualization algorithms.
Load BalancingLoad BalancingChallengesChallenges
Sorting the data is computationally prohibitive.Sorting the data is computationally prohibitive.
Use median finding algorithm (Select).Use median finding algorithm (Select).
Parallel implementation of Select is straightforward, although scalable implementation was some work
Parallel K-d tree build: Sub-groups of processors build sub-trees for sub-regions of the volume.
Bookkeeping (VTK data structures, attributes, ghost cells, etc).
D3 FeaturesD3 Features
K-d Tree BuildDepth of k-d treeAsk for shafts, slabs, or slices instead of blocks
DistributionAssign spatially contiguous regions. (Normal mode)Assign regions in a round robin fashion.Assign regions to minimize data movement. *Assign regions to processors according to a user supplied mapping.
D3 FeaturesD3 Features
Output OptionsvtkUnstructuredGrid (maybe others later)Ghost level - include ghost cells bordering spatial region.Omit duplicate points (reading 500 disk files).
Input Options (M x N)M > N : 512 files with 16 Visualization nodes (Typical).M < N : 16 files and 128 Vis nodes (more tricky).Out of core option? Perhaps our ongoing VTK work with OGI (Claudio Silva) will provide that functionality.
Dataset info: 241,600 cells 40 files. Contiguous spatial regions assigned to processors.
D3 PicturesD3 Pictures
Round Robin region assignment
D3 PicturesD3 Pictures
Scalable Rendering Scalable Rendering Technical ChallengesTechnical Challenges
Use of Commodity ComponentsUse of Commodity ComponentsDesktop PC’sDesktop PC’sGamer’sGamer’s graphics cardsgraphics cards
Graphics Cluster ThroughputGraphics Cluster ThroughputLoad BalancingLoad BalancingEffective use of aggregate performanceEffective use of aggregate performanceNetworking issuesNetworking issues
Tiled DisplayTiled DisplayDriving virtual display from PC clusterDriving virtual display from PC cluster
Scalable Rendering MetricsScalable Rendering Metrics
Focus is on large data.Focus is on large data.
Algorithms must adhere to the following criteria:Algorithms must adhere to the following criteria:Excellent load balancing. Excellent load balancing. Scale to large number of nodes.Scale to large number of nodes.Insensitive to data size (fixed overhead).Insensitive to data size (fixed overhead).
Willing to trade off frame rates to enable large data Willing to trade off frame rates to enable large data rendering.rendering.
Image Composition for Tiles ?!?Image Composition for Tiles ?!?
Yes, it’s Crazy…but is it Crazy enough?Yes, it’s Crazy…but is it Crazy enough?
Questions:Questions:Why use image compositing for a 62 Million pixel display?Why use image compositing for a 62 Million pixel display?Wouldn’t sort first be the obvious choice.Wouldn’t sort first be the obvious choice.
Constraints:Constraints:Each graphics adapter renders a 1280x1024 imageEach graphics adapter renders a 1280x1024 imageTarget data is huge so…Target data is huge so…
Data must remain stationary (no network per frame).Data must remain stationary (no network per frame).No data replication.No data replication.
1/N of the data on N computers.1/N of the data on N computers.
0
200
400
600
800
1000
0 2 4 6 8 10 12Input Geometry Size (GB)
Net
wor
k U
sage
/Fra
me
(MB
)
Sort-First Sort-First (with 90% cache) ICE-T
Cross OverCross Over
Data Transfer vs. Image TransferData Transfer vs. Image Transfer
ASCI-Sized Data
Virtual Trees StrategyVirtual Trees Strategy
Performance BoostersPerformance BoostersInherently ‘geometry’ load balanced.Inherently ‘geometry’ load balanced.
Composition responsibilities can be automatically adjusted basedComposition responsibilities can be automatically adjusted basedthe ‘load’ of the node.the ‘load’ of the node.
Active Pixel EncodingActive Pixel Encoding
Fast encoding.Fast encoding.–– Three operations per pixel.Three operations per pixel.
Free decoding. Faster depth compare.Free decoding. Faster depth compare.
Effective compression.Effective compression.–– Encoded 1/5 full image at beginning.Encoded 1/5 full image at beginning.
Good worst case behavior.Good worst case behavior.–– Encoded image can only grow a few bytes.Encoded image can only grow a few bytes.
ICEICE--T ResultsT Results
Good load balancing characteristicsGood load balancing characteristics
Performs well on large datasetsPerforms well on large datasets
Excellent ScalabilityExcellent Scalability
VTK Rendering resultsVTK Rendering resultsOver 1 Billion triangles/sec for desktop delivery Over 1 Billion triangles/sec for desktop delivery
(128 node cluster).(128 node cluster).60 million triangles/sec on 60 60 million triangles/sec on 60 Mpixel Mpixel display. display.
(64 node cluster)(64 node cluster)
““Scalable Rendering on PC ClustersScalable Rendering on PC Clusters”” Computer Graphics and Computer Graphics and ApplicationsApplications, Large Data Visualization Issue, July/Aug 2001., Large Data Visualization Issue, July/Aug 2001.““SortSort--Last Tiled Rendering for Viewing Extremely Large Datasets on TilLast Tiled Rendering for Viewing Extremely Large Datasets on Tiled ed DisplaysDisplays””, , IEEE Parallel and Large Data Visualization and Graphics, IEEE Parallel and Large Data Visualization and Graphics, San San Diego, CA, 2001Diego, CA, 2001
Leveraging Leveraging GPUsGPUs
Why implement algorithms onWhy implement algorithms on GPUsGPUs??
GPU performance increases have consistently GPU performance increases have consistently outpacedoutpaced Moore’sMoore’s Law.Law.
GPU’sGPU’s are cheap.are cheap.
Balance computations between the CPU and GPU.Balance computations between the CPU and GPU.
Marketing numbers onMarketing numbers on GeForceGeForce 4 claim 1.24 claim 1.2
TeraOpTeraOp/sec./sec.
Leveraging Leveraging GPUsGPUs
Non visualization algorithms onNon visualization algorithms on GPUsGPUs??
Using the GPU as a ‘coUsing the GPU as a ‘co--processor’.processor’.
BLAS library: Dense Dense Matrix Multiply BLAS library: Dense Dense Matrix Multiply (DGEMM).(DGEMM).
FFT calculationsFFT calculations
““FFT on a GPUFFT on a GPU,” ,” SIGGRAPH/SIGGRAPH/EurographicsEurographicsWorkshop on Graphics Hardware 2003Workshop on Graphics Hardware 2003..
GAToRGAToR
Implements the ‘Projected Tetrahedra’ algorithm Implements the ‘Projected Tetrahedra’ algorithm (Shirley & (Shirley & TuchmanTuchman) within the micro code of an ) within the micro code of an NVidia GPU.NVidia GPU.
Moves all of the following functions from the CPU Moves all of the following functions from the CPU the GPU.the GPU.
Transform to screen space.Determine projection class.Calculate thick vertex location.Determine depth at thick vertex.Compute color and opacity for thick vertexApply exponential attenuation texture
GGpupu AAccelerated ccelerated TTetrahedraletrahedral RRendererenderer
model with millions of cells
Visibility Sort
graphics card
PC (CPU)
for each cell in order
compute cell’s screen
projection
decompose to triangles
find thickest cell distance
compute each triangle’s parameters
final image of model
SoftwareProgrammable
Hardware
GPU: Computational ResourceGPU: Computational Resource
CPU GPU Cell Contribution
Vertex Program ConstraintsVertex Program Constraints
Each instance of a vertex shader program works independently on a single vertex in SIMD fashion.
No support dynamic vertex creation or topology modification within the vertex program.
No branching (at the time…now supported)
No knowledge of neighboring vertices.
Cannot change execution based on past information.
Constraints seem insurmountable but we devised some very clever workarounds!
Clever constraint workClever constraint work--aroundaround
V1’V4’
V0’
V3’
V2’
Basis GraphIsomorphic to all projection cases
Programmable vertex shaders do not support dynamic vertex creation or topology modification within the vertex program.
Isomorphic Property of Basis GraphIsomorphic Property of Basis Graph
Meticulous Code Review
Just Kidding!
This shows what the code looks like.
Results: Test DatasetsResults: Test Datasets
Dataset Vertices Tetrahedra
Blunt Fin 40,960 187,395
Oxygen Post 109,744 513,375
Delta Wing 211,680 1,005,675
Dataset GPU time Constant
Tets/s GPU timeLinear
Tets/s
Blunt Fin 0.20 sec 937 K 0.38 sec 493 K
Oxygen Post 0.55 sec 933 K 1.04 sec 493 K
Delta Wing 1.07 sec 940 K 2.03 sec 495 K
Dataset Info
Timings
Gratuitous PicturesGratuitous Pictures
““Tetrahedral Projection using VertexTetrahedral Projection using Vertex ShadersShaders””, , IEEE IEEE Volume VisualizationVolume Visualization, Boston, Massachusetts, Boston, Massachusetts, , 2002.2002.
Parallel Volume RenderingParallel Volume Rendering
Ongoing work with OGI (Oregon Graduate Ongoing work with OGI (Oregon Graduate Institute).Institute).Approach: Leverage distributed data Approach: Leverage distributed data
techniques for techniques for structuredstructured data and apply to data and apply to unstructuredunstructured data.data.
Structured data can be partitioned into Structured data can be partitioned into convexconvexsub domains.sub domains.
Higher Order ElementsHigher Order Elements
Typical approximationTypical approximation CorrectCorrect
The The InterpolantInterpolant
Parameters (r,s,t) specify field as a weighted sum of Parameters (r,s,t) specify field as a weighted sum of nodal values:nodal values:
Shape functions, Shape functions, NNii, are tensor products of Lagrange , are tensor products of Lagrange interpolants (for our example):interpolants (for our example):
∑=
=n
iii tsrNtsr
1),,(),,( φφ
)()()(),,( 321 tMsMrMtsrNi =
−−
−−=
node far midnode axis-origin near
)12()1(4
)12)(1()(
uu
u
uuuu
uuuM j
Rendering The GeometryRendering The Geometry
OpenGL OpenGL TesselatorsTesselators–– Overkill for small (screenOverkill for small (screen--space) elementsspace) elements
Adaptive TriangulationAdaptive Triangulation–– Fast (Fast (VelhoVelho et al., Chung et al.)et al., Chung et al.)
ResultsResults
““Rendering Higher Order Finite Element Surfaces in HardwareRendering Higher Order Finite Element Surfaces in Hardware””, , Proceedings of GRAPHITEProceedings of GRAPHITE, Melbourne Australia, February 2003., Melbourne Australia, February 2003.
Unlit
Lit
Advanced rendering for Advanced rendering for Sci VisSci Vis
Sandia Project PerceptSandia Project PerceptUtilize Utilize GPUsGPUs to provide realistic to provide realistic perceptual clues to general perceptual clues to general scientific visualizations. Perceptual scientific visualizations. Perceptual clues include shadows, reflections, clues include shadows, reflections, refractions, depth perception, and refractions, depth perception, and realistic lighting models. realistic lighting models.
Phase 1:Phase 1: vtkShadowRenderervtkShadowRenderer is an is an easy to use module that provides easy to use module that provides real time shadows forreal time shadows for vtkvtkapplications.applications.
VTK ConstraintsVTK Constraints
Shadow techniques are focused on GamesShadow techniques are focused on Games•• Often preprocess the geometryOften preprocess the geometry•• No self shadowingNo self shadowing•• Small amounts of geometrySmall amounts of geometry
Applying shadows to scientific Applying shadows to scientific vis vis (VTK)(VTK)•• Needs to work with VTKNeeds to work with VTK•• No No accessaccess to geometryto geometry•• Needs to be small ‘footprint’Needs to be small ‘footprint’•• Must have ‘self’ shadowMust have ‘self’ shadow•• Must work on large amounts of geometryMust work on large amounts of geometry•• Shadow Mapping best option for these constraintsShadow Mapping best option for these constraints
DemoDemo
PicturesPictures
PicturesPictures
PicturesPictures
ResultsResults
•• Works with unmodified VTKWorks with unmodified VTK•• Really simple to Add to VTK appReally simple to Add to VTK app
////vtkRenderervtkRenderer **renren == vtkRenderervtkRenderer::New();::New();vtkShadowRenderervtkShadowRenderer **renren == vtkShadowRenderervtkShadowRenderer::New();::New();
•• Next release of ParaView will include shadow optionNext release of ParaView will include shadow option
•• PerformancePerformance•• 4 datasets tested, 4 datasets tested, avgavg 1.7x Slowdown 1.7x Slowdown
•• How can this be less than 2x???How can this be less than 2x???•• Only writing to depth bufferOnly writing to depth buffer•• No lightingNo lighting
•• Using Using glCopyTexSubImage2D(GL_TEXTURE_RECTANGLE_NV, …);glCopyTexSubImage2D(GL_TEXTURE_RECTANGLE_NV, …);•• Very Fast! (cow at 30 Very Fast! (cow at 30 hz hz no problem)no problem)
Performance Sidebar Performance Sidebar
Overclocker’s Overclocker’s on on GPUsGPUs
Overclocker’s Overclocker’s on CPUson CPUs
OverclockOverclock YourselfYourself
Recent Publications (Recap)Recent Publications (Recap)
““Scalable Rendering on PC ClustersScalable Rendering on PC Clusters”” Computer Graphics and Computer Graphics and ApplicationsApplications, Large Data Visualization Issue, July/Aug 2001., Large Data Visualization Issue, July/Aug 2001.
““SortSort--Last Tiled Rendering for Viewing Extremely Large Last Tiled Rendering for Viewing Extremely Large Datasets on Tiled DisplaysDatasets on Tiled Displays””, , IEEE Parallel and Large Data IEEE Parallel and Large Data Visualization and Graphics, Visualization and Graphics, San Diego, CA, 2001.San Diego, CA, 2001.
““Tetrahedral Projection using VertexTetrahedral Projection using Vertex ShadersShaders””, , IEEE Volume IEEE Volume VisualizationVisualization, Boston, Massachusetts, Boston, Massachusetts, , 2002.2002.
““Rendering Higher Order Finite Element Surfaces in HardwareRendering Higher Order Finite Element Surfaces in Hardware””, , Proceedings of GRAPHITEProceedings of GRAPHITE, Melbourne Australia, Feb 2003., Melbourne Australia, Feb 2003.
““Cluster to Wall with VTKCluster to Wall with VTK,” ,” Parallel and Large Data Volume Parallel and Large Data Volume Graphics, 2003Graphics, 2003..
““FFT on a GPUFFT on a GPU,” ,” SIGGRAPH/SIGGRAPH/EurographicsEurographics Workshop on Workshop on Graphics Hardware 2003Graphics Hardware 2003..
ENDEND