Parallel Visualization of Large Scale Vector-Fieldschengu/Teaching/Fall2012/Lecs/... · Distributed computing: 20-50 nodes. Data partitioning may be needed (OpenMPI). Cloud Environment:

Parallel Visualization of Large Scale Vector-Fields

p r e s e n t e d o n T u e , 1 3 t h N o v 2 0 1 2 b y O l g a D a t s k o v a

-0-

Problem Description: Digital Tomosynthesis

Tomosynthesis is a screening and diagnostic method used for breast cancer detection.

The principle goal of the method is to produce high resolution 3D images of the object tissue.

-1-

X-raysource Object

DetectorFigure 1: Generic X-ray imaging concept

Rotating X-raysource

Object

RotatingDetector

Figure 2: Generic Tomosynthesis concept

Problem Description:Digital Tomosynthesis

Commonly only per slice visualization is produced and used for analysis. (IDL).

Currently we are working with simulated projections.

Outlook for projection output: with the current simulation and a scintillator detector we have a projection of 10MB. With a pixel detector of decreasing size approaching 50u the output is to reach 8GB.

-2-Figure 3: Example CT output: http://radiology.rsna.org/content/246/3/725/F11.expansion.html

http://radiology.rsna.org/content/246/3/725/F11.expansion.html

http://radiology.rsna.org/content/246/3/725/F11.expansion.html

Hardware and Tools For a single node on the Xanadu cluster we have: 8 cores (Barcelona), 8 GB RAM. Shared memory parallelization possible, but not enough (OpenMP, pthreads).

Distributed computing: 20-50 nodes. Data partitioning may be needed (OpenMPI).

Cloud Environment: Virtual set of nodes. Performance considerations, use of specialized processing is possible (Hadoop, MapReduce).

GPU based visualization: Same memory limitations as for the single node. Additional data transfer overheads (OpenCL and CUDA).

Visualization tools: Paraview, IDL, VisIt-3-

Suggested solutions (I)The problem described in [1] is concerned with the simulation of a coolant flow. The model contains 2.95M elements and 988M Grid points.

Parallelization strategies:

-4-Figure 4: source: http://www.mcs.anl.gov/~fischer/sem1b/

Parallelizing-over-particles: Good for small data and large number of particles (GPU).

Parallelizing-over-data: Large data with a consistent distribution. Still limited!

Hybrid approach: combination of distributed and shared memory parallelism concepts.

Example: master-slave strategy (next slide, [2])

Time varying data specifically not mentioned for hybrid approaches.

http://www.mcs.anl.gov/~fischer/sem1b/

http://www.mcs.anl.gov/~fischer/sem1b/

Suggested Solutions (II)In paper [2] the authors present three parallelization strategies for vector field streamline computation:

Static Allocation: mesh blocks assigned to a specific processor. As the streamline moves to the next “block”, the processor owning the block is notified.

Load on Demand: Evenly spaced seeds for each processor. If moving to the next block - load data block into Least Recently Used cache.

Master/Slave approach: The master coordinates the workload for each of the slaves depending on their queue and streamline computation. The slaves to an extent follow the static allocation and load on demand, whereby if they are unable to proceed further - the master reassigns the workload.

-5-Figure 5: Page 6 [2] - Slave process Figure 6: Page 6 [2] - Master assignments

Suggested Solutions (II)The authors then present the performance results obtained for the astrophysics dataset seeded outside the proto-neutron start.

According to [2] the data represents the magnetic field inside the supernova shock front. The magnetic field was sampled onto 512 blocks with 1 million cells per block.

-6-Figure 7: Page 4 [2] - Astrophysics dataset streamline visualization. Figure 8: Results section of [2]

Suggested Solutions (III)From lecture [3] our principle interest was in volume rendering approaches:

-7-

Image space decomposition - each process works on a disjoint section of the final image.

Object space decomposition - each process works on a disjoint subset of the data.

Hybrid volume rendering - a mixture between image and object space decomposition.

Figure 9: Hybrid Parallel Volume Rendering architecture [3].

Hybrid parallel volume rendering - the implementation seen in figure 8 is for Raycasting volume rendering case study (Levoy’s method).

Ghost data is for trilinear interpolation and gradient field.

Suggested Solutions (III)Case study presented in [3] focused on implementing the hybrid parallel algorithm for the hydrogen flame combustion simulation.

The original data size of 1024x1024x1024 was sampled to 512x512x512 and at runtime upscaled to 4608x4608x4608.

The authors suggest that the results indicate that the parallel hybrid approach is better. 40% less communication for ghost data. The hybrid approach is twice as fast the “-only” method.

-8-

Figure 10: Hydrogen Flame Combustion Simulation - source: http://www.idav.ucdavis.edu/gallery2/gallery2Embedded.php?g2_itemId=60

Figure 11: Parallel Hybrid Volume Rendering absolute runtime [3]

http://www.idav.ucdavis.edu/gallery2/gallery2Embedded.php?g2_itemId=60

http://www.idav.ucdavis.edu/gallery2/gallery2Embedded.php?g2_itemId=60

Suggested Solutions (IV)The authors in [4] present a parallel pathline construction algorithm for time-varying vector fields. The 3D time-varying vector fields are treated as 4D vector fields.

Using agglomerative hierarchical clustering method the streamlines are computed using adaptive grid construction (in order to accommodate for large data size).

Then we compute the pathlets (pathline segment in a time interval) at each time step. Spatial and temporal coherence are important for accuracy of displaying unsteady flow field.

For parallelizing the above computations, data and work load distribution analysis is used to determine what processor receives what piece of the cluster tree, boundary approximation as well as the input data.

-9-Figure 12: Load balancing and scalability for supernova data from [4].

Lessons Learned

Know your data and hardware setup.

Principle algorithm factors: data size, memory access pattern, communication, synchronization overheads and load balance as well as fixed initialization costs.

Performance factors: wall time, I/O time, data block efficiency, parallel speedup and efficiency.

End goal: Finding the right balance between distributed and shared-memory programming approaches.

VisIt and Paraview already implemente hybrid approaches.

-10-

Our ApproachFor our input data we will use large scale 8GB input CT projection file.

Using the Xanadu nodes and Paraview/VisIt setup we will establish performance reference point.

Further ambitions:

Ideally, we would like to have a dedicated 3D projection visualization program (however, OpenGL on Linux at the moment thinks otherwise).

For parallel processing the hybrid master/slave method seems most promising (using OpenMPI and OpenMP). However, server/client implementation would need to be put in place.

-11-

Deliverables

Week from Oct 19th: Project proposal - Abstract.

Week from Oct 29th: Current methods overview. Looking for Data.

Week from Nov 5th: To finish Related Works. Simulated data to be used.

Week from Nov 12th: Presentation. To identify our Parallelization strategy.

Week from Nov 19th: Implement the approach.

Week from Nov 26th: finish Implementation.

Week from Dec 3rd: Results to be finalized.

Week from Dec 10th: Thursday 13th of December - Final project presentation. Final report by Tuesday, December 11th.

-12-

References

-13-

[1] “Large Vector-Field Visualization, Theory and Practice: Large Data and Parallel Visualization”, Hank Childs et al. 2010. <http://www.idav.ucdavis.edu/~garth/vis10-tutorial/pdfs/childs.pdf>

[2] “Scalable Computation of Streamlines on Very Large Datasets”, Pugmire, Childs, Garth, Ahern, Weber. SuperComputing 2009. <www.idav.ucdavis.edu/func/return_pdf?pub_id=989>

[3] “MPI-hybrid Parallelism for Volume Rendering on Large, Multi-core Systems”. Mark Howison, E. Wes Bethel, and Hank Childs. EGPGV 2010 <graphics.cs.ucdavis.edu/~joy/ecs277/other.../childs_ecs277_lec3.pdf>

[4] “Parallel Hierarchical Visualization of Large Time-Varying 3D Vector Fields”. Hongfeng Yu et al. SuperComputing 2007. <sc07.supercomputing.org/schedule/pdf/pap291.pdf>

http://www.idav.ucdavis.edu/~garth/vis10-tutorial/pdfs/childs.pdf

http://www.idav.ucdavis.edu/~garth/vis10-tutorial/pdfs/childs.pdf

http://www.idav.ucdavis.edu/func/return_pdf?pub_id=989

http://www.idav.ucdavis.edu/func/return_pdf?pub_id=989

Documents

Parallel Visualization of Large Scale Vector-Fieldschengu/Teaching/Fall2012/Lecs/... · Distributed computing: 20-50 nodes. Data partitioning may be needed (OpenMPI). Cloud Environment: