Upload
katen
View
32
Download
3
Tags:
Embed Size (px)
DESCRIPTION
HOW PETASCALE VISUALIZATION WILL CHANGE THE RULES. 10/12/09. Hank Childs Lawrence Berkeley Lab & UC Davis. Supercomputing 101. Why simulation? Simulations are sometimes more cost effective than experiments. New model for science has three legs: theory, experiment, and simulation. - PowerPoint PPT Presentation
Citation preview
HOW PETASCALE VISUALIZATION WILL CHANGE THE RULES
Hank Childs
Lawrence Berkeley Lab & UC Davis10/12/09
Supercomputing 101
Why simulation? Simulations are sometimes more cost effective than
experiments. New model for science has three legs: theory, experiment,
and simulation. What is the “petascale”?
1 FLOP = 1 FLoating point OPeration per second 1 GigaFLOP = 1 billion FLOPs, 1 TeraFLOP = 1000 GigaFLOPs 1 PetaFLOP = 1,000,000 GigaFLOPs PetaFLOPs + petabytes on disk + petabytes of memory
petascale Why petascale?
More compute cycles, more memory, etc, lead for faster and/or more accurate simulations.
Petascale computing is here.
4 existing petascale machines
LANL RoadRunner ORNL Jaguar
Julich JUGeneUTK Kraken
Supercomputing is not slowing down.
LLNL Sequoia NCSA BlueWaters
How does the petascale affect visualization?
Large # of time steps
Large ensembles
Large scale
Large # of variables
Why is petascale visualization going to change the rules?
Michael Strayer (U.S. DoE Office of Science): “petascale is not business as usual” Especially true for visualization and analysis!
Large scale data creates two incredible challenges: scale and complexity
Scale is not “business as usual” Supercomputing landscape is changing Solution: we will need “smart” techniques in
production environments More resolution leads to more and more
complexity Will the “business as usual” techniques still
suffice?
Outline
What are the software engineering ramifications?
Production visualization tools use “pure parallelism” to process data.
P0P1
P3
P2
P8P7 P6
P5
P4
P9
Pieces of data
(on disk)
Read Process Render
Processor 0
Read Process Render
Processor 1
Read Process Render
Processor 2
Parallelized visualizationdata flow network
P0P0 P3P3P2P2
P5P5P4P4 P7P7P6P6
P9P9P8P8
P1P1
Parallel Simulation Code
Pure parallelism: pros and cons Pros:
Easy to implement Cons:
Requires large amount of primary memory Requires large I/O capabilities requires big machines
Pure parallelism performance is based on # bytes to process and I/O rates.
Vis is almost always >50% I/O and sometimes 98% I/O
Amount of data to visualize is typically O(total mem)
Relative I/O (ratio of total memory and I/O) is key
FLOPs Memory I/O
Terascale machine
“Petascale machine”
Anedoctal evidence: relative I/O is getting slower.
Machine name Main memory I/O rate
ASC purple 49.0TB 140GB/s 5.8min
BGL-init 32.0TB 24GB/s 22.2min
BGL-cur 69.0TB 30GB/s 38.3min
Petascale machine
?? ?? >40min
Time to write memory to disk
Why is relative I/O getting slower? “I/O doesn’t pay the bills” Simulation codes aren’t affected.
Recent runs of trillion cell data sets provide further evidence that I/O dominates
12
● Weak scaling study: ~62.5M cells/core
12
#coresProblem Size
TypeMachine
8K0.5TZAIXPurple
16K1TZSun LinuxRanger
16K1TZLinuxJuno
32K2TZCray XT5JaguarPF
64K4TZBG/PDawn
16K, 32K1TZ, 2TZCray XT4Franklin2T cells, 32K procs on Jaguar
2T cells, 32K procs on Franklin
-Approx I/O time: 2-5 minutes-Approx processing time: 10 seconds
-Approx I/O time: 2-5 minutes-Approx processing time: 10 seconds
Assumptions stated
I/O is a dominant term in visualization performance
Supercomputing centers are procuring “imbalanced” petascale machines Trend is towards massively multi-core, with
lots of shared memory within a node I/O goes to a node
more cores less I/O bandwidth per core And: Overall I/O bandwidth is also deficient
Pure parallelism is not well suited for the petascale. Emerging problem:
Pure parallelism emphasizes I/O and memory And: pure parallelism is the dominant
processing paradigm for production visualization software.
Solution? … there are “smart techniques” that de-emphasize memory and I/O. Data subsetting Multi-resolution Out of core In situ
Data subsetting eliminates pieces that don’t contribute to the final picture.
P0P1
P3
P2
P8P7 P6
P5
P4
P9
Pieces of data
(on disk)
Read Process Render
Processor 0
Read Process Render
Processor 1
Read Process Render
Processor 2
Parallelized visualizationdata flow network
P0P0 P3P3P2P2
P5P5P4P4 P7P7P6P6
P9P9P8P8
P1P1
Parallel Simulation Code
Data Subsetting: pros and cons Pros:
Less data to process (less I/O, less memory) Cons:
Extent of optimization is data dependent Only applicable to some algorithms
Multi-resolution techniques use coarse representations then refine.
P0P1
P3
P2
P8P7 P6
P5
P4
P9
Pieces of data
(on disk)
Read Process Render
Processor 0
Read Process Render
Processor 1
Read Process Render
Processor 2
Parallelized visualizationdata flow network
P0P0 P3P3P2P2
P5P5P4P4 P7P7P6P6
P9P9P8P8
P1P1
Parallel Simulation Code
P2P2
P4P4
Multi-resolution: pros and cons
Pros Avoid I/O & memory requirements
Cons Is it meaningful to process simplified
version of the data?
Out-of-core iterates pieces of data through the pipeline one at a time.
P0P1
P3
P2
P8P7 P6
P5
P4
P9
Pieces of data
(on disk)
Read Process Render
Processor 0
Read Process Render
Processor 1
Read Process Render
Processor 2
Parallelized visualizationdata flow network
P0P0 P3P3P2P2
P5P5P4P4 P7P7P6P6
P9P9P8P8
P1P1
Parallel Simulation Code
Out-of-core: pros and cons
Pros: Lower requirement for primary memory Doesn’t require big machines
Cons: Still paying large I/O costs
(Slow!)
In situ processing does visualization as part of the simulation.
P0P1
P3
P2
P8P7 P6
P5
P4
P9
Pieces of data
(on disk)
Read Process Render
Processor 0
Read Process Render
Processor 1
Read Process Render
Processor 2
P0P0 P3P3P2P2
P5P5P4P4 P7P7P6P6
P9P9P8P8
P1P1
Parallel Simulation Code
In situ processing does visualization as part of the simulation.
P0P1
P3
P2
P8P7 P6
P5
P4
P9
GetAccessToData
Process Render
Processor 0
Parallelized visualization data flow networkParallel Simulation Code
GetAccessToData
Process Render
Processor 1
GetAccessToData
Process Render
Processor 2
GetAccessToData
Process Render
Processor 9
… … … …
In situ: pros and cons
Pros: No I/O! Lots of compute power available
Cons: Very memory constrained Many operations not possible
Once the simulation has advanced, you cannot go back and analyze it
User must know what to look a priori Expensive resource to hold hostage!
Summary of Techniques and Strategies Pure parallelism can be used for anything,
but it takes a lot of resources Smart techniques can only be used
situationally Petascale strategy 1:
Stick with pure parallelism and live with high machine costs & I/O wait times
Other petascale strategies? Assumption:
We can’t afford massive dedicated clusters for visualization
We can fall back on the super computer, but only rarely
Now we know the tools … what problem are we trying to solve? Three primary use cases:
Exploration Confirmation Communication
Examples:Scientific discoveryDebugging
Examples:Scientific discoveryDebugging
Examples:Data analysisImages / moviesComparison
Examples:Data analysisImages / moviesComparison
Examples:Data analysisImages / movies
Examples:Data analysisImages / movies
Notional decision process
Need all data at
full resolution
?
Need all data at
full resolution
?
No Multi-resolution(debugging &
scientific discovery)
Multi-resolution(debugging &
scientific discovery)
Yes
Do operations require all the data?
Do operations require all the data?
NoData
subsetting(comparison
& data analysis)
Data subsetting(comparison
& data analysis)
Yes
Do you know what you want
do a priori?
Do you know what you want
do a priori?
Yes
In Situ(data analysis & images / movies)
In Situ(data analysis & images / movies)
No
Do algorithms require all
data in memory?
Do algorithms require all
data in memory?
No Interactivity
required?
Interactivity
required?No
Out-of-core(Data
analysis & images / movies)
Out-of-core(Data
analysis & images / movies)
ExplorationExplorationConfirmationConfirmationCommunicationCommunication
Pure parallelism
(Anything & esp. comparison)
Pure parallelism
(Anything & esp. comparison)
Yes Yes
Alternate strategy: smart techniques
All visualization and analysis work
Multi-res
In situ
Out-of-coreDo remaining
~5% on SC
Data subsetting
How Petascale Changes the Rules We can’t use pure parallelism alone any
more We will need algorithms to work in
multiple processing paradigms Incredible research problem… … but also an incredible software
engineering problem.
Data flow networks… a love story
29
File Reader(Source)
Slice Filter
ContourFilter
Renderer(Sink)U
pdat
e
Execute
Work is performed by a pipeline
A pipeline consists of data objects and components (sources, filters, and sinks)
Pipeline execution begins with a “pull”, which starts Update phase
Data flows from component to component during the Execute phase
Data flow networks: strengths
Flexible usage Networks can be multi-input
/ multi-output Interoperability of modules Embarrassingly parallel
algorithms handled by base infrastructure
Easy to extend New derived types of filters
Abstract filter
Abstract filter
Slice filterSlice filter
Contour filter
Contour filter
???? filter???? filter
Inheritance
SourceSource
SinkSink
Filter AFilter A Filter BFilter B
Filter CFilter C
Flow of data
Data flow networks: weaknesses Execution of modules happens in stages
Algorithms are executed at one time Cache inefficient
Memory footprint concerns Some implementations fix the data
model
Data flow networks: observations
Majority of code investment is in algorithms (derived types of filters), not in base classes (which manage data flow).
Source code for managing flow of data is small and in one place
Algorithms don’t care about data processing paradigm … they only
care about operating on inputs and outputs.
Algorithms don’t care about data processing paradigm … they only
care about operating on inputs and outputs.
Example filter: contouring
Contour algorithmContour
algorithm
Contour filter
Mesh input Surface/line output
Data ReaderData
ReaderContour
FilterContour
Filter RenderingRendering
{
Example filter: contouring with data subsetting
Contour algorithmContour
algorithm
Contour filter
Mesh input Surface/line output
Data ReaderData
ReaderContour
FilterContour
Filter RenderingRendering
{
Communicate with executive to discard pieces
Example filter: contouring with out-of-core
Contour algorithmContour
algorithm
Contour filter
Mesh input Surface/line output
Data ReaderData
ReaderContour
FilterContour
Filter RenderingRendering
{1 2 3
4 5 6
7 8 9
10 11 12
1 2 3
4 5 6
7 8 9
10 11 12
Algorithm called 12 times
Example filter: contouring with multi-resolution techniques
Contour algorithmContour
algorithm
Contour filter
Mesh input Surface/line output
Data ReaderData
ReaderContour
FilterContour
Filter RenderingRendering
{
Simulation code
Example filter: contouring with in situ
Contour algorithmContour
algorithm
Contour filter
Mesh input Surface/line output
Data ReaderData
ReaderContour
FilterContour
Filter RenderingRendering
{X
For each example, the contour algorithm didn’t change, just its
context.
For each example, the contour algorithm didn’t change, just its
context.
How big is this job?
Many algorithms are basically “processing paradigm” indifferent
What percentage of a vis code is algorithms?
What percentage is devoted to the “processing paradigm”?
Other?We can gain insight by looking at
the breakdown in a real world example (VisIt).
We can gain insight by looking at the breakdown in a real world
example (VisIt).
VisIt is a richly featured, turnkey application for large data.
Tool has two focal points: big data & providing a product for end users.
VisIt is an open source, end user visualization and analysis tool for simulated and experimental data
>100K downloads on web R&D 100 award in 2005 Used “heavily to exclusively”
on 8 of world’s top 12 supercomputers
Pure parallelism + out-of-core + data subsetting + in situ
27B elementRayleigh-Taylor Instability(MIRANDA, BG/L)
VisIt architecture & lines of code
Client side Server side
viewerviewerguigui
clicli
mdserver
mdserver engine
(parallel and serial)
engine (parallel and
serial)
+ custom interfaces + documentation + regression testing + user knowledge + Wiki + mailing list archives
154K154K
70K70K103K103K
14K14K
29K29K
PlotsPlots PlotsPlots43K43KOperatorsOperators OperatorsOperators
55K55K34K34K 33K33K Databases
Databases
192K192K
Handling large data, parallel algorithmsHandling large data, parallel algorithms32K / 559K32K / 559KSupport libraries & toolsSupport libraries & tools178K178K
libsimlibsim10K10K
Pure parallelism is the simplest paradigm. “Replacement” code may be significantly
larger.
Pure parallelism is the simplest paradigm. “Replacement” code may be significantly
larger.
Summary
Petascale machines are not well suited for pure parallelism, because of its high I/O and memory costs.
This will force production visualization software to utilize more processing paradigms.
The majority of existing investments can be preserved. This is thanks in large part to the elegant design of data
flow networks.
Hank Childs, [email protected] … and questions???