48
The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, Boston University IMA Annual Program Year Workshop: High Performance Computing and Emerging Architectures Minneapolis, January 10-14, 2011

The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

The basis and perspectives of an exascale algorithm: our ExaFMM project.

Lorena A Barba, Boston University

IMA Annual Program Year Workshop: High Performance Computing and Emerging ArchitecturesMinneapolis, January 10-14, 2011

Page 2: The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

Acknowledgements:

work in Barba’s group done in collaboration with Jaydeep Bardhan (Rush), Mathew Knepley (UChicago), Tsuyoshi Hamada (Nagasaki Advanced Computing Center),Rio Yokota (postdoc at BU) and graduate students Felipe Cruz, Christopher Cooper, Anush Krishnan, Simon Layton

Page 3: The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

Acknowledgements:

work in Barba’s group done in collaboration with Jaydeep Bardhan (Rush), Mathew Knepley (UChicago), Tsuyoshi Hamada (Nagasaki Advanced Computing Center),Rio Yokota (postdoc at BU) and graduate students Felipe Cruz, Christopher Cooper, Anush Krishnan, Simon Layton

Page 4: The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

in Nagasaki Advanced Computing Center

Page 5: The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

“... the fundamental law of computer science [is]: the faster the computer, the greater the importance of speed of algorithms”

Trefethen & Bau “Numerical Linear Algebra” SIAM

Page 6: The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

100

101

102

103

104

105

106

102

104

106

108

1010

1012

1014

1016

1018

O(N3)

O(N2)

The curious story of conjugate gradient (CG) algorithms

‣ Iterative methods:

‣ sequence of iterates converging to the solution

‣ CG matrix iterations bring the O(N3) cost to O(N2)

‣ 1950s — N too small for CG to be competitive

‣ 1970s — renewed attention

Gauss ian e

l imin

at ion

CG i te ra t i ve methods

Page 7: The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

100

101

102

103

104

105

106

102

104

106

108

1010

1012

1014

1016

1018

O(N3)

O(N2)

100

101

102

103

104

105

106

102

104

106

108

1010

1012

1014

1016

1018

O(N2)

O(N3)

The curious story of conjugate gradient (CG) algorithms

‣ Iterative methods:

‣ sequence of iterates converging to the solution

‣ CG matrix iterations bring the O(N3) cost to O(N2)

‣ 1950s — N too small for CG to be competitive

‣ 1970s — renewed attention

Gauss ian e

l imin

at ion

CG i te ra t i ve methods

Page 8: The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

‣ 1946 — The Monte Carlo method.

‣ 1947 — Simplex Method for Linear Programming.

‣ 1950 — Krylov Subspace Iteration Method.

‣ 1951 — The Decompositional Approach to Matrix Computations.

‣ 1957 — The Fortran Compiler.

‣ 1959 — QR Algorithm for Computing Eigenvalues.

‣ 1962 — Quicksort Algorithms for Sorting.

‣ 1965 — Fast Fourier Transform.

‣ 1977 — Integer Relation Detection.

‣ 1987 — Fast Multipole Method Dongarra& Sullivan, IEEE Comput. Sci. Eng.,Vol. 2(1):22-- 23 ( 2000)

Page 9: The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

‣ Solves N-body problems

๏ e.g. astrophysical gravity interactions

๏ reduces operation count from O(N2) to O(N)

Fast multipole method

f(y) =N!

i=1

ciK(y ! xi) y ! [1...N ]

Page 10: The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

‣ Solves N-body problems

๏ e.g. astrophysical gravity interactions

๏ reduces operation count from O(N2) to O(N)

Fast multipole method

f(y) =N!

i=1

ciK(y ! xi) y ! [1...N ]

Page 11: The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

‣ Solves N-body problems

๏ e.g. astrophysical gravity interactions

๏ reduces operation count from O(N2) to O(N)

Fast multipole method

f(y) =N!

i=1

ciK(y ! xi) y ! [1...N ]

Page 12: The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

O(N) advantage

‣ Hierarchical methods:

‣ sequence of refinements converging (or contributing) to the solution

‣ FMM brings the O(N2) cost to O(N)

‣ 1990s — MD codes dropped FMM, as N too small to be competitive

‣ Now — renewed attention100

101

102

103

104

105

106

102

104

106

108

1010

1012

O(N2)

O(N)

Page 13: The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

‣ space subdivision tree structure

‣ to find “near” and “far” bodies

Page 14: The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

‣ space subdivision tree structure

‣ to find “near” and “far” bodies

Page 15: The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

Flow of FMM calculation

M2Mmultipole to multipole

treecode & FMM

M2Lmultipole to local

FMM

L2Llocal to local

FMM

L2Plocal to particle

FMM

P2Pparticle to particle

treecode & FMM

M2Pmultipole to particle

treecode

source particlestarget particles

information moves from red to blue

P2Mparticle to multipole

treecode & FMM

Page 16: The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

‣ The whole algorithm in a sketch

Downward SweepUpward Sweep

Create Multipole Expansions. Evaluate Local Expansions.

P2M M2M M2L L2L L2P

Page 17: The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

‣ Contributions from Barba group:

Page 18: The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

‣ Parallelization strategy:

M2M and L2L translations M2L transformation Local domain

Level k

Root tree

Sub-tree 1 Sub-tree 2 Sub-tree 3 Sub-tree 4 Sub-tree 5 Sub-tree 6 Sub-tree 7 Sub-tree 8

Page 19: The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

‣ Graph representation:

cijwi

wj

Ref. — F. A Cruz, M. G. Knepley, L. A. Barba, PetFMM—A dynamically load-balancing parallel fast multipole library, Int. J. Num. Meth. Eng., Vol. 85(4): 403–428 (Jan. 2011)

Page 20: The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

The algorithmic and hardware speed-ups properly multiply

GPU implementation of FMM kernels

Page 21: The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

GPU Gems, Volume IV

In press, to appear February 2011 (?)

Codes in http://code.google.com/p/gemsfmm/

Page 22: The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

GPU Gems, Volume III

Page 23: The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

FMM on GPU

103

104

105

106

107

10!3

10!2

10!1

100

101

102

103

104

105

N

tim

e [

s]

!

Direct (CPU)

Direct (GPU)

FMM (CPU)

FMM (GPU)

“Treecode and fast multipole method for N-body simulation with CUDA”, chapter in GPU Gems IV, in press

Page 24: The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

FMM on GPU

103

104

105

106

107

10!3

10!2

10!1

100

101

102

103

104

105

N

tim

e [

s]

!

Direct (CPU)

Direct (GPU)

FMM (CPU)

FMM (GPU)

200x

“Treecode and fast multipole method for N-body simulation with CUDA”, chapter in GPU Gems IV, in press

Page 25: The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

FMM on GPU

103

104

105

106

107

10!3

10!2

10!1

100

101

102

103

104

105

N

tim

e [

s]

!

Direct (CPU)

Direct (GPU)

FMM (CPU)

FMM (GPU)

40x

“Treecode and fast multipole method for N-body simulation with CUDA”, chapter in GPU Gems IV, in press

Page 26: The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

FMM on GPU

103

104

105

106

107

10!3

10!2

10!1

100

101

102

103

104

105

N

tim

e [

s]

!

Direct (CPU)

Direct (GPU)

FMM (CPU)

FMM (GPU)

40x

“Treecode and fast multipole method for N-body simulation with CUDA”, chapter in GPU Gems IV, in press

Page 27: The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

‣ the right methods and algorithms can provide leaps in capability many times that of Moore’s law would in a given period

‣ open source & open data enables tackling large, complex computational projects

Page 28: The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

‣ the right methods and algorithms can provide leaps in capability many times that of Moore’s law would in a given period

‣ new hardware for HPC adds to the mix for a new era of discovery via computation

‣ open source & open data enables tackling large, complex computational projects

Page 29: The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

Parallel FMM on multi-GPUs

Strong Scaling:

parallel efficiency of 80% at 256, and 50% at 512 nodes

N=108

p=10

Degima cluster at NACC, with Infiniband comm

! " # $ !% &" %# !"$ "'% '!"(

'(

!((

!'(

"((

"'(

&((

&'(

#((

)*+,-.

/012343)*+,-.35.6

3

3

/+223-,7./+8-/0,71*0.279*"*1*0.2791":;";<2+72:;"=<2+72:="=<2+72:="><2+72:>"><2+72:>";<2+72:

Page 30: The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

GPU breakdown

‣ N=108, on one node

!

"!

#!!

#"!

$!!

%&'()*+,-./

,

,

(0++,%12.(0'%()12*&).+23&$&*&).+23*$45$56+02+45$76+02+47$76+02+47$86+02+48$86+02+48$56+02+4

!

"!

#!!

#"!

$!!

,

,

(0++,%12.(0'%()12*&).+23&$&*&).+23*$4%9'26)2:,(;.6<'==+0)2:,3;(;%'3;>+(?+@)%+%'3;7;441%%'3;7+*%&A%'3;B+02+4

Page 31: The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

Under revision for Comput. Phys. Comm.

See also http://barbagroup.bu.edu/

Page 32: The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

Suitability of the FMM for achieving exascale

FMM is a particularly favorable algorithm for the emerging heterogeneous, many-core architectural landscape.

Page 33: The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

Spatial and temporal locality

Page 34: The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

Spatial and temporal locality

‣ Algorithm has intrinsic geometric locality

Page 35: The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

Spatial and temporal locality

‣ Algorithm has intrinsic geometric locality

Page 36: The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

Spatial and temporal locality

‣ Algorithm has intrinsic geometric locality

‣ Acces patterns could be non-local

Page 37: The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

Spatial and temporal locality

‣ Algorithm has intrinsic geometric locality

‣ Acces patterns could be non-local

๏ work with sorted particle indices, access via a start-offset combination

Page 38: The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

Spatial and temporal locality

‣ Algorithm has intrinsic geometric locality

‣ Acces patterns could be non-local

๏ work with sorted particle indices, access via a start-offset combination

‣ Temporal locality:

Page 39: The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

Spatial and temporal locality

‣ Algorithm has intrinsic geometric locality

‣ Acces patterns could be non-local

๏ work with sorted particle indices, access via a start-offset combination

‣ Temporal locality:

๏ queue GPU tasks before execution, buffer the input and output of data making memory access contiguous

Page 40: The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

Spatial and temporal locality

‣ Algorithm has intrinsic geometric locality

‣ Acces patterns could be non-local

๏ work with sorted particle indices, access via a start-offset combination

‣ Temporal locality:

๏ queue GPU tasks before execution, buffer the input and output of data making memory access contiguous

➡ The FMM is not a locallity-sensitive application

Page 41: The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

Global data comunications and synchronization

‣ Two most time-consuming in the FMM:

๏ p2p — purely local

๏ m2l — exhibits “hierarchical synchronization”

P2P at the leaf level

L2P evaluation

M2M

M2L

L2L

Upward sweep Downward sweep

P2M

Page 42: The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

Load balancing

‣ FMM load-balanced

๏ space-filling curves: Morton, Hilbert

๏ work-only (no comm)

‣ PetFMM:

๏ graph-partitioning

๏ will it scale?

๏ hierarchical partition?

Page 43: The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

plan for an “ExaFMM”1) our present FMM technology is state-of-the-art;2) we possess the potential for a substantial performance hike

AND all our codes are always open!

Page 44: The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

Present FMM state-of-the-art

103

104

105

106

10!3

10!2

10!1

100

101

102

103

O(N)

N

tim

e [s

]

published KIFMM codeSalmon&Warren treecodeour published FMM code

Single-node performance:

timings of published kifmm code (2006) , S&W treecode (2000) and our code

‣equal performance

‣same accuracy, measured L2-norm error 10-3

Single CPU core, Intel Core i7 2.67 (no SSE)

Page 45: The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

New experimental FMM with higher performance

103

104

105

106

107

10!3

10!2

10!1

100

101

102

103

O(N)

N

tim

e [s

]

Our published FMMWith new optimizationsWith algorithmic improvements

Optimized code:

explicit inline assembly within the p2p kernel, implementing SIMD

‣ 5x speed-up, single precision

Algorithmic improvements:

i)hybridize FMM with treecode

ii)dynamic error-control

Page 46: The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

‣ other recent work

In IEEE International Symposium on Parallel Distributed Processing (IPDPS), IEEE, pp. 1–12 (Atlanta, GA; April 2010)

Page 47: The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

‣ PetFMM — open library, dynamic load balancing, comm minimizing

๏ open question: will strategy scale to 1000s procs? hierarchical partition?

‣ Performance on single node:

๏ matching other s.o.t.a. codes

‣ Algorithmic innovations:

๏ hybrid treecode/FMM

๏ variable order/variable box-opening for minimum work to achieve target accuracy

Summary so far ...

Page 48: The basis and perspectives of an exascale algorithm: our ... Lorena Barba.pdf · The basis and perspectives of an exascale algorithm: our ExaFMM project. Lorena A Barba, ... importance

But there is more ...

‣ Fault-tolerance:

๏ traditional checkpointing no longer adequate by itself

‣ instead: replicate threads, correctness checks on-the-fly

๏ FMM allows natural correctness checks at the time of selecting p

‣ Autotuning the FMM:

๏ natural: use tests/work estimats to select particles per box, p, and box-

opening parameters.

๏ parameter selection for load-balancing