HOW PETASCALE VISUALIZATION WILL CHANGE THE RULES

HOW PETASCALE VISUALIZATION WILL CHANGE THE RULES

Hank Childs

Lawrence Berkeley Lab & UC Davis10/12/09

Supercomputing 101

Why simulation? Simulations are sometimes more cost effective than

experiments. New model for science has three legs: theory, experiment,

and simulation. What is the “petascale”?

1 FLOP = 1 FLoating point OPeration per second 1 GigaFLOP = 1 billion FLOPs, 1 TeraFLOP = 1000 GigaFLOPs 1 PetaFLOP = 1,000,000 GigaFLOPs PetaFLOPs + petabytes on disk + petabytes of memory

petascale Why petascale?

More compute cycles, more memory, etc, lead for faster and/or more accurate simulations.

Petascale computing is here.

4 existing petascale machines

LANL RoadRunner ORNL Jaguar

Julich JUGeneUTK Kraken

Supercomputing is not slowing down.

LLNL Sequoia NCSA BlueWaters

How does the petascale affect visualization?

Large # of time steps

Large ensembles

Large scale

Large # of variables

Why is petascale visualization going to change the rules?

Michael Strayer (U.S. DoE Office of Science): “petascale is not business as usual” Especially true for visualization and analysis!

Large scale data creates two incredible challenges: scale and complexity

Scale is not “business as usual” Supercomputing landscape is changing Solution: we will need “smart” techniques in

production environments More resolution leads to more and more

complexity Will the “business as usual” techniques still

suffice?

Outline

What are the software engineering ramifications?

Production visualization tools use “pure parallelism” to process data.

P0P1

P3

P2

P8P7 P6

P5

P4

P9

Pieces of data

(on disk)

Read Process Render

Processor 0

Read Process Render

Processor 1

Read Process Render

Processor 2

Parallelized visualizationdata flow network

P0P0 P3P3P2P2

P5P5P4P4 P7P7P6P6

P9P9P8P8

P1P1

Parallel Simulation Code

Pure parallelism: pros and cons Pros:

Easy to implement Cons:

Requires large amount of primary memory Requires large I/O capabilities requires big machines

Pure parallelism performance is based on # bytes to process and I/O rates.

Vis is almost always >50% I/O and sometimes 98% I/O

Amount of data to visualize is typically O(total mem)

Relative I/O (ratio of total memory and I/O) is key

FLOPs Memory I/O

Terascale machine

“Petascale machine”

Anedoctal evidence: relative I/O is getting slower.

Machine name Main memory I/O rate

ASC purple 49.0TB 140GB/s 5.8min

BGL-init 32.0TB 24GB/s 22.2min

BGL-cur 69.0TB 30GB/s 38.3min

Petascale machine

?? ?? >40min

Time to write memory to disk

Why is relative I/O getting slower? “I/O doesn’t pay the bills” Simulation codes aren’t affected.

Recent runs of trillion cell data sets provide further evidence that I/O dominates

12

● Weak scaling study: ~62.5M cells/core

12

#coresProblem Size

TypeMachine

8K0.5TZAIXPurple

16K1TZSun LinuxRanger

16K1TZLinuxJuno

32K2TZCray XT5JaguarPF

64K4TZBG/PDawn

16K, 32K1TZ, 2TZCray XT4Franklin2T cells, 32K procs on Jaguar

2T cells, 32K procs on Franklin

-Approx I/O time: 2-5 minutes-Approx processing time: 10 seconds

-Approx I/O time: 2-5 minutes-Approx processing time: 10 seconds

Assumptions stated

I/O is a dominant term in visualization performance

Supercomputing centers are procuring “imbalanced” petascale machines Trend is towards massively multi-core, with

lots of shared memory within a node I/O goes to a node

more cores less I/O bandwidth per core And: Overall I/O bandwidth is also deficient

Pure parallelism is not well suited for the petascale. Emerging problem:

Pure parallelism emphasizes I/O and memory And: pure parallelism is the dominant

processing paradigm for production visualization software.

Solution? … there are “smart techniques” that de-emphasize memory and I/O. Data subsetting Multi-resolution Out of core In situ

Data subsetting eliminates pieces that don’t contribute to the final picture.

P0P1

P3

P2

P8P7 P6

P5

P4

P9

Pieces of data

(on disk)

Read Process Render

Processor 0

Read Process Render

Processor 1

Read Process Render

Processor 2


P0P0 P3P3P2P2

P5P5P4P4 P7P7P6P6

P9P9P8P8

P1P1


Data Subsetting: pros and cons Pros:

Less data to process (less I/O, less memory) Cons:

Extent of optimization is data dependent Only applicable to some algorithms

Multi-resolution techniques use coarse representations then refine.

P0P1

P3

P2

P8P7 P6

P5

P4

P9

Pieces of data

(on disk)

Read Process Render

Processor 0

Read Process Render

Processor 1

Read Process Render

Processor 2


P0P0 P3P3P2P2

P5P5P4P4 P7P7P6P6

P9P9P8P8

P1P1


P2P2

P4P4

Multi-resolution: pros and cons

Pros Avoid I/O & memory requirements

Cons Is it meaningful to process simplified

version of the data?

Out-of-core iterates pieces of data through the pipeline one at a time.

P0P1

P3

P2

P8P7 P6

P5

P4

P9

Pieces of data

(on disk)

Read Process Render

Processor 0

Read Process Render

Processor 1

Read Process Render

Processor 2


P0P0 P3P3P2P2

P5P5P4P4 P7P7P6P6

P9P9P8P8

P1P1


Out-of-core: pros and cons

Pros: Lower requirement for primary memory Doesn’t require big machines

Cons: Still paying large I/O costs

(Slow!)

In situ processing does visualization as part of the simulation.

P0P1

P3

P2

P8P7 P6

P5

P4

P9

Pieces of data

(on disk)

Read Process Render

Processor 0

Read Process Render

Processor 1

Read Process Render

Processor 2

P0P0 P3P3P2P2

P5P5P4P4 P7P7P6P6

P9P9P8P8

P1P1


In situ processing does visualization as part of the simulation.

P0P1

P3

P2

P8P7 P6

P5

P4

P9

GetAccessToData

Process Render

Processor 0

Parallelized visualization data flow networkParallel Simulation Code

GetAccessToData

Process Render

Processor 1

GetAccessToData

Process Render

Processor 2

GetAccessToData

Process Render

Processor 9

… … … …

In situ: pros and cons

Pros: No I/O! Lots of compute power available

Cons: Very memory constrained Many operations not possible

Once the simulation has advanced, you cannot go back and analyze it

User must know what to look a priori Expensive resource to hold hostage!

Summary of Techniques and Strategies Pure parallelism can be used for anything,

but it takes a lot of resources Smart techniques can only be used

situationally Petascale strategy 1:

Stick with pure parallelism and live with high machine costs & I/O wait times

Other petascale strategies? Assumption:

We can’t afford massive dedicated clusters for visualization

We can fall back on the super computer, but only rarely

Now we know the tools … what problem are we trying to solve? Three primary use cases:

Exploration Confirmation Communication

Examples:Scientific discoveryDebugging

Examples:Scientific discoveryDebugging

Examples:Data analysisImages / moviesComparison

Examples:Data analysisImages / moviesComparison

Examples:Data analysisImages / movies

Examples:Data analysisImages / movies

Notional decision process

Need all data at

full resolution

?

Need all data at

full resolution

?

No Multi-resolution(debugging &

scientific discovery)

Multi-resolution(debugging &

scientific discovery)

Yes

Do operations require all the data?

Do operations require all the data?

NoData

subsetting(comparison

& data analysis)

Data subsetting(comparison

& data analysis)

Yes

Do you know what you want

do a priori?

Do you know what you want

do a priori?

Yes

In Situ(data analysis & images / movies)

In Situ(data analysis & images / movies)

No

Do algorithms require all

data in memory?

Do algorithms require all

data in memory?

No Interactivity

required?

Interactivity

required?No

Out-of-core(Data

analysis & images / movies)

Out-of-core(Data

analysis & images / movies)

ExplorationExplorationConfirmationConfirmationCommunicationCommunication

Pure parallelism

(Anything & esp. comparison)

Pure parallelism

(Anything & esp. comparison)

Yes Yes

Alternate strategy: smart techniques

All visualization and analysis work

Multi-res

In situ

Out-of-coreDo remaining

~5% on SC

Data subsetting

How Petascale Changes the Rules We can’t use pure parallelism alone any

more We will need algorithms to work in

multiple processing paradigms Incredible research problem… … but also an incredible software

engineering problem.

Data flow networks… a love story

29

File Reader(Source)

Slice Filter

ContourFilter

Renderer(Sink)U

pdat

e

Execute

Work is performed by a pipeline

A pipeline consists of data objects and components (sources, filters, and sinks)

Pipeline execution begins with a “pull”, which starts Update phase

Data flows from component to component during the Execute phase

Data flow networks: strengths

Flexible usage Networks can be multi-input

/ multi-output Interoperability of modules Embarrassingly parallel

algorithms handled by base infrastructure

Easy to extend New derived types of filters

Abstract filter

Abstract filter

Slice filterSlice filter

Contour filter

Contour filter

???? filter???? filter

Inheritance

SourceSource

SinkSink

Filter AFilter A Filter BFilter B

Filter CFilter C

Flow of data

Data flow networks: weaknesses Execution of modules happens in stages

Algorithms are executed at one time Cache inefficient

Memory footprint concerns Some implementations fix the data

model

Data flow networks: observations

Majority of code investment is in algorithms (derived types of filters), not in base classes (which manage data flow).

Source code for managing flow of data is small and in one place

Algorithms don’t care about data processing paradigm … they only

care about operating on inputs and outputs.

Algorithms don’t care about data processing paradigm … they only

care about operating on inputs and outputs.

Example filter: contouring

Contour algorithmContour

algorithm

Contour filter

Mesh input Surface/line output

Data ReaderData

ReaderContour

FilterContour

Filter RenderingRendering

{

Example filter: contouring with data subsetting


algorithm

Contour filter


Data ReaderData

ReaderContour

FilterContour


{

Communicate with executive to discard pieces

Example filter: contouring with out-of-core


algorithm

Contour filter


Data ReaderData

ReaderContour

FilterContour


{1 2 3

4 5 6

7 8 9

10 11 12

1 2 3

4 5 6

7 8 9

10 11 12

Algorithm called 12 times

Example filter: contouring with multi-resolution techniques


algorithm

Contour filter


Data ReaderData

ReaderContour

FilterContour


{

Simulation code

Example filter: contouring with in situ


algorithm

Contour filter


Data ReaderData

ReaderContour

FilterContour


{X

For each example, the contour algorithm didn’t change, just its

context.

For each example, the contour algorithm didn’t change, just its

context.

How big is this job?

Many algorithms are basically “processing paradigm” indifferent

What percentage of a vis code is algorithms?

What percentage is devoted to the “processing paradigm”?

Other?We can gain insight by looking at

the breakdown in a real world example (VisIt).

We can gain insight by looking at the breakdown in a real world

example (VisIt).

VisIt is a richly featured, turnkey application for large data.

Tool has two focal points: big data & providing a product for end users.

VisIt is an open source, end user visualization and analysis tool for simulated and experimental data

>100K downloads on web R&D 100 award in 2005 Used “heavily to exclusively”

on 8 of world’s top 12 supercomputers

Pure parallelism + out-of-core + data subsetting + in situ

27B elementRayleigh-Taylor Instability(MIRANDA, BG/L)

VisIt architecture & lines of code

Client side Server side

viewerviewerguigui

clicli

mdserver

mdserver engine

(parallel and serial)

engine (parallel and

serial)

+ custom interfaces + documentation + regression testing + user knowledge + Wiki + mailing list archives

154K154K

70K70K103K103K

14K14K

29K29K

PlotsPlots PlotsPlots43K43KOperatorsOperators OperatorsOperators

55K55K34K34K 33K33K Databases

Databases

192K192K

Handling large data, parallel algorithmsHandling large data, parallel algorithms32K / 559K32K / 559KSupport libraries & toolsSupport libraries & tools178K178K

libsimlibsim10K10K

Pure parallelism is the simplest paradigm. “Replacement” code may be significantly

larger.

Pure parallelism is the simplest paradigm. “Replacement” code may be significantly

larger.

Summary

Petascale machines are not well suited for pure parallelism, because of its high I/O and memory costs.

This will force production visualization software to utilize more processing paradigms.

The majority of existing investments can be preserved. This is thanks in large part to the elegant design of data

flow networks.

Hank Childs, [email protected] … and questions???

Documents

HOW PETASCALE VISUALIZATION WILL CHANGE THE RULES