Upload
truongthien
View
218
Download
2
Embed Size (px)
Citation preview
Development of Seismic Imaging Tools (BSIT) onCell/B.E. Architecture
Mauricio ARAYA-PoloComputational Applications on Science and Engineering
➀ Seismic Imaging➁ Barcelona Seismic Imaging Tools (BSIT)➂ RTM➃ RTM on Cell/B.E.➄ Performance Evaluation➅ Conclusions and Roadmap
Lyon, 26/11/2008. 1
PETASCALE CONSUMERS
• The Fact : The oil-industry has a large (terascale) installed computational capacity(e.g. TotalFinaElf is ranked 17 of top500.org list, 106.2 TeraFlops machine)The Need: Complex oil-field analysis requires petascale capacityThe Examples : sub-salt seismic imagining of the Gulf of Mexico and off-shore Brazil
• One of BSC contributions to PRACE applications-suite (WP6) is Barcelona SeismicImaging Tools (BSIT)
• Seismic Imaging tools are fundamental in the decision-making chain regarding oil drilling(every drilling cost around $150 million, and 10% is the average success ratio of theindustry)
PETASCALE CONSUMERS 2
SEISMIC IMAGING
• Geophysics challenge: port the best (so far) seismic imaging technique (RTM) to an HPCenvironment. The goal, to assess the feasibility of the technique as daily-work tool
• Multiple level of parallelism (shots, domain, thread, data), not just one tool involved
SEISMIC IMAGING 3
BSIT• BSIT applications:
– Shots DB and management tool, deal with the top level parallelism
– Forward Modeling and Reverse Time Migration (RTM) sport MPI (domaindecomposition), OpenMP (threads) and SIMD (data)
– All of them are implemented in MareNostrum (js21 blades - PowerPC) and theirporting to MariCel (qs22 blades - Cell/B.E.) is underway
• RTM, among the mentioned applications, is the main tool of the suite
BSIT 4
OUTLOOK
➀ Seismic Imaging
➁ Barcelona Seismic Imaging Tools (BSIT)
➂ RTM
➃ RTM on Cell/B.E.
➄ Performance Evaluation
➅ Conclusions and Roadmap
OUTLOOK 5
REVERSE TIME MIGRATION (I)• Among seismic imaging techniques, Reverse Time Migration (RTM) is the tool of choice
because the resulting image quality.
• RTM consist of two-way acoustic wave propagation in a given media.
• RTM computational costs used to be unaffordable, both I/O and computing.
REVERSE TIME MIGRATION (I) 6
REVERSE TIME MIGRATION (II)• Acoustic wave propagation equation is a partial differential equation (PDE) with the
following form (assuming isotropic, non-elastic media):
∂2p(t,z,x,y)
∂t2+ c2
∇p(t, z, x, y) = s(t)
The inputs are the c velocity field, s(t) source wavelet.The output is a pressure wave-field.
• The equation is solved using Finite Difference (FD) method, derivatives are discretized inspace (stencil computation) and time (integration depending on previous timesteps)
• Time discretization has the following form:
pt = pt + 2pt−1− pt−2
REVERSE TIME MIGRATION (II) 7
REVERSE TIME MIGRATION (III)• Cross-correlation between forward and backward wave propagation
• The output image is generated according to:
I(z, x, y) =X
t
S(t, z, x, y)R(t, z, x, y)
where S is the source wave-field, and R is the receiver wave-field (Biondi and Shan 2002)
• Regular campaign runs order of 10.000 shots, months of computation time. Also, velocitymodel sizes are in order of [1-10]GB per shot.
REVERSE TIME MIGRATION (III) 8
OUTLOOK
➀ Seismic Imaging
➁ Barcelona Seismic Imaging Tools (BSIT)
➂ RTM
➃ RTM on Cell/B.E.
➄ Performance Evaluation
➅ Conclusions and Roadmap
OUTLOOK 9
JS21 BLADE MAPPING
• JS21 blade features: 2 CPU x 2 cores, 2.3 GHz, RAM 8 GB, L2 1 MB per core
• Memory access pattern is critical, enhancement: Blocking.
• Low data-access/computation ratio, solution: Altivec Single Instruction Multiple Data(SIMD) operations (data parallelism)
• 4 threads per JS21 blade are exploited by using OpenMP (thread parallelism)
• Good scalability wth OpenMP, up to 3.8x (over 4)
• End of the line RTM kernel for this platform, including fine tuning and extra optimizations
• It takes 45 segs. (8.3 GFlops) to process a benchmark velocity field of 192x384x560 (160MB), 200 time steps
• For models bigger than the node’s RAM, we deploy domain decomposition by MPI
JS21 BLADE MAPPING 10
FROM MULTI TO MANY CORES
• Clock CPU race finished, multicores and manycores dominate.
• Multicores have many well known problems: memory bandwidth, big caches coherenceand high power consumption
• Among manycores Cell/B.E. is currently the best bet for HPC, why?
– Commodity, same chip from computers to game console
– Programming is not so hard
– Delivers high performance, remarkable example 1 PFlops, RoadrunnerSupercomputer
– Power efficient
• In the following section we will put to test the last three statements
FROM MULTI TO MANY CORES 11
CELL /B.E. MAPPING (I)• Programming the SPEs in an efficient way is a challenging task because:
1. load/store instructions only operate on the LS, which is small (256 Kbytes, shared forcode and data);
2. accesses to the main memory only happen via Direct Memory Access (DMA)operations, and are performed independently by a Memory Function Controller(MFC); overlap computation and transfers
3. also, DMA performance is influenced by usage parameters (transfer block size andalignment, average concurrent requests, bank congestion, controller congestion,NUMA issues);
4. the use of SIMD instructions requires an appropriate data layout (padding andalignment);
5. the branch predictors are simple, and the misprediction penalty is high
CELL/B.E. MAPPING (I) 12
CELL /B.E. MAPPING (II)• We address points 1, 4 and 5 of slide 11, at the same time, the parallelization strategy is
presented
• point 1: the LS size set the size of the planes that we process, streaming of planes
• point 4: we just have to adapt the SIMDization made for JS21 blades
• point 5: fortunately this algorithm is not intensive in control-flow
CELL/B.E. MAPPING (II) 13
CELL /B.E. MAPPING (III)• point 2: using double-buffering we almost achieve perfect overlap between computation
and communication. For instance, in the following Figure, the body segment depicts has acomputation time of 58µs and a transfer time of 63.6µs.
• point 3: through profiling and experiments we tuning the parameters to avoid as much aspossible OS/architectural misbehaviors
CELL/B.E. MAPPING (III) 14
CELL /B.E. MAPPING (IV)• Paraver trace (BSC in-house profiler). First image just one slice computation, and second
image just the body of the slice.
• Close to ideal workload balance among SPEs.
CELL/B.E. MAPPING (IV) 15
OUTLOOK
➀ Seismic Imaging
➁ Barcelona Seismic Imaging Tools (BSIT)
➂ RTM
➃ RTM on Cell/B.E.
➄ Performance Evaluation
➅ Conclusions and Roadmap
OUTLOOK 16
DEVELOPMENT EFFORT
• case 1 of the following table summarize the efforts of mapping the RTM kernel to the JS21blade. The total lines of code (LOC) include the kernel and mentioned tasks (shot intro,etc.)
Target Project size Effort Speedup
(LOC) (man-months)
case 1 JS21 1800 2.25 8.0×naïve version
case 2 QS21 2200 3.25 18.9×naïve version
• case 2 summarize the presented efforts of mapping the RTM kernel to QS21 blade
• The Cell RTM version has more LOC due to the explicit management of memory andthreads. Also, this version takes longer development time due to initial steep learningcurve. On the other hand, the speed-up with respect naïve version is far larger that case 1.
DEVELOPMENT EFFORT 17
PERFORMANCE EVALUATION (I)• 21.8 GB/s is reached, which correspond to 95% of the max bandwidth.
0
5
10
15
20
25
1 2 3 4 5 6 7 8
Mem
ory
Ban
dwid
th [G
B/s
]
Synergistic Processing Elements
Z = 160Z = 168Z = 176Z = 184Z = 192
1
2
3
4
5
6
7
8
1 2 3 4 5 6 7 8
Sca
labi
lity
Synergistic Processing Elements
Z = 160Z = 168Z = 176Z = 184Z = 192
ideal
• Scalability is linear up to 8 SPEs. It takes 2.5 segs. (101.6 GFlops) to process abenchmark velocity field of 192x384x560 (160 MB), 200 time steps.
PERFORMANCE EVALUATION (I) 18
PERFORMANCE EVALUATION (II)• In term of energy efficiency, our RTM Cell/B.E. implementation is 10.0x more power
efficient than our JS21 RTM implementation, and same order of magnitude with respectthe other multicores.
Platform Avg Execution Arithmetic Energy
power time Throughput Efficiency
[W] [s] [GFlops] [GFlops/W]
JS21 267 45.0 8.3 0.03
QS21 370 2.5 101.6 0.32
• In term of arithmetic throughput and execution time, our Cell RTM implementation is atleast 12x faster than our JS21 RTM implementation.
PERFORMANCE EVALUATION (II) 19
OUTLOOK
➀ Seismic Imaging
➁ Barcelona Seismic Imaging Tools (BSIT)
➂ RTM
➃ RTM on Cell/B.E.
➄ Performance Evaluation
➅ Conclusions and Roadmap
OUTLOOK 20
CONCLUSIONS
• Our RTM implementation is close to optimality according to performance indicators (e.g.,95% of the peak bandwidth throughput).
• Our RTM shows at least 12.0× speedup when compared against a reference traditionalmulti-core platform based on a PowerPC 970MP processor.
• The RTM implementation features an energy efficiency corresponding to 0.32GFlops/W, which is 10.0× higher than the reference.
• Thanks to these performance results RTM becomes a practically viable solution foreveryday use for industrial-size deployments.
• Roadmap: integrate RTM to a workflow with tomography.
CONCLUSIONS 21
ACKNOWLEDGMENT
• Thanks to BSC to allow us to publish this work.
• http://www.bsc.es
Thank you!
ACKNOWLEDGMENT 22
NOTES (I)• CFL convergence condition, advection in hyperbolic PDE, ν = u·∆ t
∆ x
• time breakdown of RTM kernel phases
0
20
40
60
80
100
Forward Backward
Rel
ativ
e ex
ecut
ion
time
(%)
Phases
Cross-correlation
Boundary cond.
Shot, receivers
Propagation
0
5
10
15
20
25
1 2 3 4 5 6 7 8
Agg
rega
te M
emor
y B
andw
idth
(G
byte
s/s)
Synergistic Processing Elements
64 bytes128 bytes256 bytes
512 bytes and larger
• Measured bandwidth
NOTES (I) 23
NOTES (II)• Paraver trace, first one slice computation, second just the body of the slice.
• Asmvis pipeline visualizer, this segment shows no stalls at all.
NOTES (II) 24