Development of Seismic Imaging Tools (BSIT) on … · Development of Seismic Imaging Tools (BSIT) on Cell/B.E. Architecture Mauricio ARAYA-Polo Computational Applications on Science

Development of Seismic Imaging Tools (BSIT) onCell/B.E. Architecture

Mauricio ARAYA-PoloComputational Applications on Science and Engineering

➀ Seismic Imaging➁ Barcelona Seismic Imaging Tools (BSIT)➂ RTM➃ RTM on Cell/B.E.➄ Performance Evaluation➅ Conclusions and Roadmap

Lyon, 26/11/2008. 1

PETASCALE CONSUMERS

• The Fact : The oil-industry has a large (terascale) installed computational capacity(e.g. TotalFinaElf is ranked 17 of top500.org list, 106.2 TeraFlops machine)The Need: Complex oil-field analysis requires petascale capacityThe Examples : sub-salt seismic imagining of the Gulf of Mexico and off-shore Brazil

• One of BSC contributions to PRACE applications-suite (WP6) is Barcelona SeismicImaging Tools (BSIT)

• Seismic Imaging tools are fundamental in the decision-making chain regarding oil drilling(every drilling cost around $150 million, and 10% is the average success ratio of theindustry)

PETASCALE CONSUMERS 2

SEISMIC IMAGING

• Geophysics challenge: port the best (so far) seismic imaging technique (RTM) to an HPCenvironment. The goal, to assess the feasibility of the technique as daily-work tool

• Multiple level of parallelism (shots, domain, thread, data), not just one tool involved

SEISMIC IMAGING 3

BSIT• BSIT applications:

– Shots DB and management tool, deal with the top level parallelism

– Forward Modeling and Reverse Time Migration (RTM) sport MPI (domaindecomposition), OpenMP (threads) and SIMD (data)

– All of them are implemented in MareNostrum (js21 blades - PowerPC) and theirporting to MariCel (qs22 blades - Cell/B.E.) is underway

• RTM, among the mentioned applications, is the main tool of the suite

BSIT 4

OUTLOOK

➀ Seismic Imaging

➁ Barcelona Seismic Imaging Tools (BSIT)

➂ RTM

➃ RTM on Cell/B.E.

➄ Performance Evaluation

➅ Conclusions and Roadmap

OUTLOOK 5

REVERSE TIME MIGRATION (I)• Among seismic imaging techniques, Reverse Time Migration (RTM) is the tool of choice

because the resulting image quality.

• RTM consist of two-way acoustic wave propagation in a given media.

• RTM computational costs used to be unaffordable, both I/O and computing.

REVERSE TIME MIGRATION (I) 6

REVERSE TIME MIGRATION (II)• Acoustic wave propagation equation is a partial differential equation (PDE) with the

following form (assuming isotropic, non-elastic media):

∂2p(t,z,x,y)

∂t2+ c2

∇p(t, z, x, y) = s(t)

The inputs are the c velocity field, s(t) source wavelet.The output is a pressure wave-field.

• The equation is solved using Finite Difference (FD) method, derivatives are discretized inspace (stencil computation) and time (integration depending on previous timesteps)

• Time discretization has the following form:

pt = pt + 2pt−1− pt−2

REVERSE TIME MIGRATION (II) 7

REVERSE TIME MIGRATION (III)• Cross-correlation between forward and backward wave propagation

• The output image is generated according to:

I(z, x, y) =X

t

S(t, z, x, y)R(t, z, x, y)

where S is the source wave-field, and R is the receiver wave-field (Biondi and Shan 2002)

• Regular campaign runs order of 10.000 shots, months of computation time. Also, velocitymodel sizes are in order of [1-10]GB per shot.

REVERSE TIME MIGRATION (III) 8

OUTLOOK

➀ Seismic Imaging


➂ RTM




OUTLOOK 9

JS21 BLADE MAPPING

• JS21 blade features: 2 CPU x 2 cores, 2.3 GHz, RAM 8 GB, L2 1 MB per core

• Memory access pattern is critical, enhancement: Blocking.

• Low data-access/computation ratio, solution: Altivec Single Instruction Multiple Data(SIMD) operations (data parallelism)

• 4 threads per JS21 blade are exploited by using OpenMP (thread parallelism)

• Good scalability wth OpenMP, up to 3.8x (over 4)

• End of the line RTM kernel for this platform, including fine tuning and extra optimizations

• It takes 45 segs. (8.3 GFlops) to process a benchmark velocity field of 192x384x560 (160MB), 200 time steps

• For models bigger than the node’s RAM, we deploy domain decomposition by MPI

JS21 BLADE MAPPING 10

FROM MULTI TO MANY CORES

• Clock CPU race finished, multicores and manycores dominate.

• Multicores have many well known problems: memory bandwidth, big caches coherenceand high power consumption

• Among manycores Cell/B.E. is currently the best bet for HPC, why?

– Commodity, same chip from computers to game console

– Programming is not so hard

– Delivers high performance, remarkable example 1 PFlops, RoadrunnerSupercomputer

– Power efficient

• In the following section we will put to test the last three statements

FROM MULTI TO MANY CORES 11

CELL /B.E. MAPPING (I)• Programming the SPEs in an efficient way is a challenging task because:

1. load/store instructions only operate on the LS, which is small (256 Kbytes, shared forcode and data);

2. accesses to the main memory only happen via Direct Memory Access (DMA)operations, and are performed independently by a Memory Function Controller(MFC); overlap computation and transfers

3. also, DMA performance is influenced by usage parameters (transfer block size andalignment, average concurrent requests, bank congestion, controller congestion,NUMA issues);

4. the use of SIMD instructions requires an appropriate data layout (padding andalignment);

5. the branch predictors are simple, and the misprediction penalty is high

CELL/B.E. MAPPING (I) 12

CELL /B.E. MAPPING (II)• We address points 1, 4 and 5 of slide 11, at the same time, the parallelization strategy is

presented

• point 1: the LS size set the size of the planes that we process, streaming of planes

• point 4: we just have to adapt the SIMDization made for JS21 blades

• point 5: fortunately this algorithm is not intensive in control-flow

CELL/B.E. MAPPING (II) 13

CELL /B.E. MAPPING (III)• point 2: using double-buffering we almost achieve perfect overlap between computation

and communication. For instance, in the following Figure, the body segment depicts has acomputation time of 58µs and a transfer time of 63.6µs.

• point 3: through profiling and experiments we tuning the parameters to avoid as much aspossible OS/architectural misbehaviors

CELL/B.E. MAPPING (III) 14

CELL /B.E. MAPPING (IV)• Paraver trace (BSC in-house profiler). First image just one slice computation, and second

image just the body of the slice.

• Close to ideal workload balance among SPEs.

CELL/B.E. MAPPING (IV) 15

OUTLOOK

➀ Seismic Imaging


➂ RTM




OUTLOOK 16

DEVELOPMENT EFFORT

• case 1 of the following table summarize the efforts of mapping the RTM kernel to the JS21blade. The total lines of code (LOC) include the kernel and mentioned tasks (shot intro,etc.)

Target Project size Effort Speedup

(LOC) (man-months)

case 1 JS21 1800 2.25 8.0×naïve version

case 2 QS21 2200 3.25 18.9×naïve version

• case 2 summarize the presented efforts of mapping the RTM kernel to QS21 blade

• The Cell RTM version has more LOC due to the explicit management of memory andthreads. Also, this version takes longer development time due to initial steep learningcurve. On the other hand, the speed-up with respect naïve version is far larger that case 1.

DEVELOPMENT EFFORT 17

PERFORMANCE EVALUATION (I)• 21.8 GB/s is reached, which correspond to 95% of the max bandwidth.

0

5

10

15

20

25

1 2 3 4 5 6 7 8

Mem

ory

Ban

dwid

th [G

B/s

]

Synergistic Processing Elements

Z = 160Z = 168Z = 176Z = 184Z = 192

1

2

3

4

5

6

7

8

1 2 3 4 5 6 7 8

Sca

labi

lity


Z = 160Z = 168Z = 176Z = 184Z = 192

ideal

• Scalability is linear up to 8 SPEs. It takes 2.5 segs. (101.6 GFlops) to process abenchmark velocity field of 192x384x560 (160 MB), 200 time steps.

PERFORMANCE EVALUATION (I) 18

PERFORMANCE EVALUATION (II)• In term of energy efficiency, our RTM Cell/B.E. implementation is 10.0x more power

efficient than our JS21 RTM implementation, and same order of magnitude with respectthe other multicores.

Platform Avg Execution Arithmetic Energy

power time Throughput Efficiency

[W] [s] [GFlops] [GFlops/W]

JS21 267 45.0 8.3 0.03

QS21 370 2.5 101.6 0.32

• In term of arithmetic throughput and execution time, our Cell RTM implementation is atleast 12x faster than our JS21 RTM implementation.

PERFORMANCE EVALUATION (II) 19

OUTLOOK

➀ Seismic Imaging


➂ RTM




OUTLOOK 20

CONCLUSIONS

• Our RTM implementation is close to optimality according to performance indicators (e.g.,95% of the peak bandwidth throughput).

• Our RTM shows at least 12.0× speedup when compared against a reference traditionalmulti-core platform based on a PowerPC 970MP processor.

• The RTM implementation features an energy efficiency corresponding to 0.32GFlops/W, which is 10.0× higher than the reference.

• Thanks to these performance results RTM becomes a practically viable solution foreveryday use for industrial-size deployments.

• Roadmap: integrate RTM to a workflow with tomography.

CONCLUSIONS 21

ACKNOWLEDGMENT

• Thanks to BSC to allow us to publish this work.

• http://www.bsc.es

Thank you!

ACKNOWLEDGMENT 22

NOTES (I)• CFL convergence condition, advection in hyperbolic PDE, ν = u·∆ t

∆ x

• time breakdown of RTM kernel phases

0

20

40

60

80

100

Forward Backward

Rel

ativ

e ex

ecut

ion

time

(%)

Phases

Cross-correlation

Boundary cond.

Shot, receivers

Propagation

0

5

10

15

20

25

1 2 3 4 5 6 7 8

Agg

rega

te M

emor

y B

andw

idth

(G

byte

s/s)


64 bytes128 bytes256 bytes

512 bytes and larger

• Measured bandwidth

NOTES (I) 23

NOTES (II)• Paraver trace, first one slice computation, second just the body of the slice.

• Asmvis pipeline visualizer, this segment shows no stalls at all.

NOTES (II) 24

Documents

Development of Seismic Imaging Tools (BSIT) on … · Development of Seismic Imaging Tools (BSIT) on Cell/B.E. Architecture Mauricio ARAYA-Polo Computational Applications on Science