38
Experimental Evaluation of BSP Programming Libraries - p. 1/26 Experimental Evaluation of BSP Programming Libraries Peter Krusche Department of Computer Science Centre for Scientific Computing University of Warwick [email protected]

Experimental Evaluation of BSP Programming Libraries€¦ · Experimental Evaluation of BSP Programming Libraries - p. 1/26 Experimental Evaluation of BSP Programming Libraries Peter

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Experimental Evaluation of BSP Programming Libraries€¦ · Experimental Evaluation of BSP Programming Libraries - p. 1/26 Experimental Evaluation of BSP Programming Libraries Peter

Experimental Evaluation of BSP Programming Libraries - p. 1/26

Experimental Evaluation of BSP Programming Libraries

Peter Krusche

Department of Computer ScienceCentre for Scientific Computing

University of [email protected]

Page 2: Experimental Evaluation of BSP Programming Libraries€¦ · Experimental Evaluation of BSP Programming Libraries - p. 1/26 Experimental Evaluation of BSP Programming Libraries Peter

Introduction

•Outline

The BSP Model

BSP Libraries

Benchmarking

Performance/Predictions

Conclusion

Experimental Evaluation of BSP Programming Libraries - p. 2/26

Outline

Motivation...Study and compare the communicationcharacteristics and performance predictability of‘BSP-style’ communication libraries.

Outline1. The BSP Model

2. BSP Programming Libraries

3. Benchmarking

4. Performance and Predictability

Page 3: Experimental Evaluation of BSP Programming Libraries€¦ · Experimental Evaluation of BSP Programming Libraries - p. 1/26 Experimental Evaluation of BSP Programming Libraries Peter

Introduction

The BSP Model

•The BSP Model

BSP Libraries

Benchmarking

Performance/Predictions

Conclusion

Experimental Evaluation of BSP Programming Libraries - p. 3/26

The BSP Model

The BSP model for parallel algorithms was usedwith some slight adaptations.

The model has been adapted here to use secondsinstead of flops as a base unit for the running time:

• p identical processor/memory pairs (computingnodes), computation speed f

• Arbitrary interconnection network, latency l,bandwidth g

Page 4: Experimental Evaluation of BSP Programming Libraries€¦ · Experimental Evaluation of BSP Programming Libraries - p. 1/26 Experimental Evaluation of BSP Programming Libraries Peter

Introduction

The BSP Model

•The BSP Model

BSP Libraries

Benchmarking

Performance/Predictions

Conclusion

Experimental Evaluation of BSP Programming Libraries - p. 3/26

The BSP Model

M M M M M

Network

P1 P2 Pp...

• Programs are SPMD• Execution takes place in supersteps• Cost Formula : T = f · W + g · H + l · S

• As a base unit for communications, 8-byte doubles will beused

Page 5: Experimental Evaluation of BSP Programming Libraries€¦ · Experimental Evaluation of BSP Programming Libraries - p. 1/26 Experimental Evaluation of BSP Programming Libraries Peter

Introduction

The BSP Model

BSP Libraries

•BSP Programming

•The BSPlib Standard

•BSPlib Implementations

•Other Libraries

• ‘BSP-style’ Programming

in MPI

Benchmarking

Performance/Predictions

Conclusion

Experimental Evaluation of BSP Programming Libraries - p. 4/26

BSP Programming

‘BSP-style’ programming using a conventionalcommunications library (MPI/Cray shmem/...)• Barrier synchronizations for creating superstep structure

• Many libraries already provide functionality for one sidedcommunication/direct remote memory access (DRMA)

Using a specialized library (The Oxford BSPToolset/PUB/CGMlib/...)• Specialized communication primitives (bulk synchronous

message passing/DRMA)

• Some libraries (Oxford Toolset, PUB) include optimizedbarrier synchronization functions and routing

• Higher level of abstraction

Page 6: Experimental Evaluation of BSP Programming Libraries€¦ · Experimental Evaluation of BSP Programming Libraries - p. 1/26 Experimental Evaluation of BSP Programming Libraries Peter

Introduction

The BSP Model

BSP Libraries

•BSP Programming

•The BSPlib Standard

•BSPlib Implementations

•Other Libraries

• ‘BSP-style’ Programming

in MPI

Benchmarking

Performance/Predictions

Conclusion

Experimental Evaluation of BSP Programming Libraries - p. 5/26

The BSPlib Standard

Communication primitives:

• DRMA: buffered and unbuffered put, get

• BSMP: send and move

• Synchronization

• Combining and Broadcasting

For the experiments, a BSPlib-style wrapper librarywas created.

Page 7: Experimental Evaluation of BSP Programming Libraries€¦ · Experimental Evaluation of BSP Programming Libraries - p. 1/26 Experimental Evaluation of BSP Programming Libraries Peter

Introduction

The BSP Model

BSP Libraries

•BSP Programming

•The BSPlib Standard

•BSPlib Implementations

•Other Libraries

• ‘BSP-style’ Programming

in MPI

Benchmarking

Performance/Predictions

Conclusion

Experimental Evaluation of BSP Programming Libraries - p. 6/26

BSPlib Implementations

The Oxford BSP Toolset• Supports 3 kinds of base architecture: message passing,

shared memory, DRMA

• Experiments used message passing MPI interface

• Last release from ’98, compatibility issues on more modernsystems

PUB• Support for message passing and shared memory

architectures

• Experiments used message passing MPI interface

• Additional support for oblivious synchronization, processorgroups

• Less trouble with setup on all systems

• Advanced functionality e.g. for process migration

Page 8: Experimental Evaluation of BSP Programming Libraries€¦ · Experimental Evaluation of BSP Programming Libraries - p. 1/26 Experimental Evaluation of BSP Programming Libraries Peter

Introduction

The BSP Model

BSP Libraries

•BSP Programming

•The BSPlib Standard

•BSPlib Implementations

•Other Libraries

• ‘BSP-style’ Programming

in MPI

Benchmarking

Performance/Predictions

Conclusion

Experimental Evaluation of BSP Programming Libraries - p. 7/26

Other Libraries

CGMlib• Runs on top of message passing MPI

• Includes set of algorithms for sorting, list ranking, etc.

• Abstract C++ interface

• Lists of abstract datatypes with constant size are used fordata exchange

SSCRAP• Uses MPI (message passing) or Posix (SHMEM) for data

exchange

• Support for DRMA, BSMP, conventional message passing,collective operations, etc.

• ’Soft’ synchronization (send or receive)

• C++ interface

Page 9: Experimental Evaluation of BSP Programming Libraries€¦ · Experimental Evaluation of BSP Programming Libraries - p. 1/26 Experimental Evaluation of BSP Programming Libraries Peter

Introduction

The BSP Model

BSP Libraries

•BSP Programming

•The BSPlib Standard

•BSPlib Implementations

•Other Libraries

• ‘BSP-style’ Programming

in MPI

Benchmarking

Performance/Predictions

Conclusion

Experimental Evaluation of BSP Programming Libraries - p. 8/26

‘BSP-style’ Programming in MPI

Approach here: a BSPlib style MPI-1 library wasimplemented naively (without message combining,etc.)

• Isend/Recv for data exchange

• Barrier synchronization

• Emulated DRMA on top

Advantage: no overhead for send/putDrawback: high latency, presumably overhead forget operations

Page 10: Experimental Evaluation of BSP Programming Libraries€¦ · Experimental Evaluation of BSP Programming Libraries - p. 1/26 Experimental Evaluation of BSP Programming Libraries Peter

Introduction

The BSP Model

BSP Libraries

Benchmarking

•Systems used

•Measuring f

•Measuring g and l

•Bandwidth Surface

(aracari)•Bandwidth Surface

(argus)•Latency

•Benchmark Summary

Performance/Predictions

Conclusion

Experimental Evaluation of BSP Programming Libraries - p. 9/26

Systems used

Measurements on parallel machines at

the Centre for Scientific Computing:

aracari: IBM cluster, 64 × 2-way SMP Pentium31.4 GHz/128 GB of memory(Interconnection Network: Myrinet 2000,MPI: mpich-gm)

argus: Linux cluster, 31 × 2-way SMP Pentium4Xeon 2.6 GHz processors/62 GB of memory(Interconnection Network: 100Mbit Ethernet,MPI: mpich-p4)

Page 11: Experimental Evaluation of BSP Programming Libraries€¦ · Experimental Evaluation of BSP Programming Libraries - p. 1/26 Experimental Evaluation of BSP Programming Libraries Peter

Introduction

The BSP Model

BSP Libraries

Benchmarking

•Systems used

•Measuring f

•Measuring g and l

•Bandwidth Surface

(aracari)•Bandwidth Surface

(argus)•Latency

•Benchmark Summary

Performance/Predictions

Conclusion

Experimental Evaluation of BSP Programming Libraries - p. 10/26

Measuring f

Measuring algorithm perfor-mance on one node:

500 1000Matrix Size

0

100 M

200 M

300 M

400 M

500 M

Flo

p/s

BLASIJK loop

Measuring computation timeseparately in one run:

0 500 1000 1500 2000Matrix Size

300 M

350 M

400 M

450 M

500 M

550 M

600 M

Flo

p/s

Actual flopratef = 400 Mflops

(Example for Matrix-Matrix multiplication)

Page 12: Experimental Evaluation of BSP Programming Libraries€¦ · Experimental Evaluation of BSP Programming Libraries - p. 1/26 Experimental Evaluation of BSP Programming Libraries Peter

Introduction

The BSP Model

BSP Libraries

Benchmarking

•Systems used

•Measuring f

•Measuring g and l

•Bandwidth Surface

(aracari)•Bandwidth Surface

(argus)•Latency

•Benchmark Summary

Performance/Predictions

Conclusion

Experimental Evaluation of BSP Programming Libraries - p. 11/26

Measuring g and l

Problems encountered: realistic values of g and l depend on

• The number of processors that are used• The communications pattern• The communication volume

E.g. for all-to-all communication on aracari

50 100 150 200 250Number of messages

0

2000

4000

6000

Com

mun

icat

ion

time

[us]

MPI, 4 nodesMPI, 16 nodesOxtool, 4 nodesOxtool, 16 nodes

Page 13: Experimental Evaluation of BSP Programming Libraries€¦ · Experimental Evaluation of BSP Programming Libraries - p. 1/26 Experimental Evaluation of BSP Programming Libraries Peter

Introduction

The BSP Model

BSP Libraries

Benchmarking

•Systems used

•Measuring f

•Measuring g and l

•Bandwidth Surface

(aracari)•Bandwidth Surface

(argus)•Latency

•Benchmark Summary

Performance/Predictions

Conclusion

Experimental Evaluation of BSP Programming Libraries - p. 12/26

Bandwidth Surface (aracari)

For a better picture, the effective bandwidth can be sampleddepending on message size and count.

All-to-all communication on aracari:

tim

e p

er e

lem

ent

[us]

m (n

umber

of mes

sages

)

message size

PUB 16 nodesOXTOOL 16 nodes

MPI 16 nodes

2

8

32

128

28

32128

512

0.1

1

10

100

1000

Page 14: Experimental Evaluation of BSP Programming Libraries€¦ · Experimental Evaluation of BSP Programming Libraries - p. 1/26 Experimental Evaluation of BSP Programming Libraries Peter

Introduction

The BSP Model

BSP Libraries

Benchmarking

•Systems used

•Measuring f

•Measuring g and l

•Bandwidth Surface

(aracari)•Bandwidth Surface

(argus)•Latency

•Benchmark Summary

Performance/Predictions

Conclusion

Experimental Evaluation of BSP Programming Libraries - p. 12/26

Bandwidth Surface (aracari)

For a better picture, the effective bandwidth can be sampleddepending on message size and count.

All-to-all communication on aracari:

tim

e p

er e

lem

ent

[us]

m (n

umber

of mes

sages

)

message size

PUB 16 nodesOXTOOL 16 nodes

2

8

32

128

28

32128

512

0.1

1

10

100

Page 15: Experimental Evaluation of BSP Programming Libraries€¦ · Experimental Evaluation of BSP Programming Libraries - p. 1/26 Experimental Evaluation of BSP Programming Libraries Peter

Introduction

The BSP Model

BSP Libraries

Benchmarking

•Systems used

•Measuring f

•Measuring g and l

•Bandwidth Surface

(aracari)•Bandwidth Surface

(argus)•Latency

•Benchmark Summary

Performance/Predictions

Conclusion

Experimental Evaluation of BSP Programming Libraries - p. 12/26

Bandwidth Surface (aracari)

For a better picture, the effective bandwidth can be sampleddepending on message size and count.

Random permutation on aracari:

tim

e p

er e

lem

ent

[us]

m (n

umber

of mes

sages

)

message size

PUB 16 nodesOXTOOL 16 nodes

MPI 16 nodes

2 8

32 128

512

28

32128

512

0.1 1

10 100

10001e+04

Page 16: Experimental Evaluation of BSP Programming Libraries€¦ · Experimental Evaluation of BSP Programming Libraries - p. 1/26 Experimental Evaluation of BSP Programming Libraries Peter

Introduction

The BSP Model

BSP Libraries

Benchmarking

•Systems used

•Measuring f

•Measuring g and l

•Bandwidth Surface

(aracari)•Bandwidth Surface

(argus)•Latency

•Benchmark Summary

Performance/Predictions

Conclusion

Experimental Evaluation of BSP Programming Libraries - p. 12/26

Bandwidth Surface (aracari)

For a better picture, the effective bandwidth can be sampleddepending on message size and count.

Random permutation on aracari:

tim

e p

er e

lem

ent

[us]

m (n

umber

of mes

sages

)

message size

PUB 16 nodesOXTOOL 16 nodes

2 8

32 128

512

28

32128

512

0.1

1

10

100

1000

Page 17: Experimental Evaluation of BSP Programming Libraries€¦ · Experimental Evaluation of BSP Programming Libraries - p. 1/26 Experimental Evaluation of BSP Programming Libraries Peter

Introduction

The BSP Model

BSP Libraries

Benchmarking

•Systems used

•Measuring f

•Measuring g and l

•Bandwidth Surface

(aracari)•Bandwidth Surface

(argus)•Latency

•Benchmark Summary

Performance/Predictions

Conclusion

Experimental Evaluation of BSP Programming Libraries - p. 13/26

Bandwidth Surface (argus)

The picture looks different on the slower communicationsnetwork

All-to-all communication on argus:

tim

e p

er e

lem

ent

[us]

m (n

umber

of mes

sages

)

message size

PUB 4 nodesOXTOOL 4 nodes

MPI 4 nodes

2

8

32

128

28

32128

512

1

10

100

1000

1e+04

Page 18: Experimental Evaluation of BSP Programming Libraries€¦ · Experimental Evaluation of BSP Programming Libraries - p. 1/26 Experimental Evaluation of BSP Programming Libraries Peter

Introduction

The BSP Model

BSP Libraries

Benchmarking

•Systems used

•Measuring f

•Measuring g and l

•Bandwidth Surface

(aracari)•Bandwidth Surface

(argus)•Latency

•Benchmark Summary

Performance/Predictions

Conclusion

Experimental Evaluation of BSP Programming Libraries - p. 13/26

Bandwidth Surface (argus)

The picture looks different on the slower communicationsnetwork

All-to-all communication on argus:

tim

e p

er e

lem

ent

[us]

m (n

umber

of mes

sages

)

message size

PUB 4 nodesOXTOOL 4 nodes

2

8

32

128

28

32128

512

1

10

100

1000

Page 19: Experimental Evaluation of BSP Programming Libraries€¦ · Experimental Evaluation of BSP Programming Libraries - p. 1/26 Experimental Evaluation of BSP Programming Libraries Peter

Introduction

The BSP Model

BSP Libraries

Benchmarking

•Systems used

•Measuring f

•Measuring g and l

•Bandwidth Surface

(aracari)•Bandwidth Surface

(argus)•Latency

•Benchmark Summary

Performance/Predictions

Conclusion

Experimental Evaluation of BSP Programming Libraries - p. 13/26

Bandwidth Surface (argus)

The picture looks different on the slower communicationsnetwork

Random permutation on argus:

tim

e p

er e

lem

ent

[us]

m (n

umber

of mes

sages

)

message size

PUB 4 nodesOXTOOL 4 nodes

MPI 4 nodes

2 8

32 128

512

28

32128

512

0.01 0.1 1

10 100

10001e+04

Page 20: Experimental Evaluation of BSP Programming Libraries€¦ · Experimental Evaluation of BSP Programming Libraries - p. 1/26 Experimental Evaluation of BSP Programming Libraries Peter

Introduction

The BSP Model

BSP Libraries

Benchmarking

•Systems used

•Measuring f

•Measuring g and l

•Bandwidth Surface

(aracari)•Bandwidth Surface

(argus)•Latency

•Benchmark Summary

Performance/Predictions

Conclusion

Experimental Evaluation of BSP Programming Libraries - p. 13/26

Bandwidth Surface (argus)

The picture looks different on the slower communicationsnetwork

Random permutation on argus:

tim

e p

er e

lem

ent

[us]

m (n

umber

of mes

sages

)

message size

PUB 4 nodesOXTOOL 4 nodes

2 8

32 128

512

28

32128

512

0.01 0.1 1

10 100

10001e+04

Page 21: Experimental Evaluation of BSP Programming Libraries€¦ · Experimental Evaluation of BSP Programming Libraries - p. 1/26 Experimental Evaluation of BSP Programming Libraries Peter

Introduction

The BSP Model

BSP Libraries

Benchmarking

•Systems used

•Measuring f

•Measuring g and l

•Bandwidth Surface

(aracari)•Bandwidth Surface

(argus)•Latency

•Benchmark Summary

Performance/Predictions

Conclusion

Experimental Evaluation of BSP Programming Libraries - p. 14/26

Latency

The latency can be measured for synchronizations precededby different types of communication:

MPI Oxtool PUBaracari 4 processors

l (low) 210 µs 43 µs 39 µsl (high) 230 µs 67 µs 55 µsl (all-to-all) 252 µs 89 µs 72 µs

aracari 32 processors

l (low) 2203 µs 621 µs 142 µsl (high) 2242 µs 638 µs 163 µsl (all-to-all) 2881 µs 1250 µs 750 µs

argus 4 processors

l (low) 5642 µs 796 µs 975 µsl (high) 5789 µs 1442 µs 1176 µsl (all-to-all) 5086 µs 1613 µs 871 µs

Page 22: Experimental Evaluation of BSP Programming Libraries€¦ · Experimental Evaluation of BSP Programming Libraries - p. 1/26 Experimental Evaluation of BSP Programming Libraries Peter

Introduction

The BSP Model

BSP Libraries

Benchmarking

•Systems used

•Measuring f

•Measuring g and l

•Bandwidth Surface

(aracari)•Bandwidth Surface

(argus)•Latency

•Benchmark Summary

Performance/Predictions

Conclusion

Experimental Evaluation of BSP Programming Libraries - p. 15/26

Benchmark Summary

Bandwidth depends on message count, messagesize and the communications pattern

On aracari:• Best all-to-all performance: Oxtool

• Best random permutation performance (fewmessages): PUB, > 64 messages: Oxtool

• Best self communication performance (fewmessages): PUB, > 32 messages: Oxtool

• MPI: good performance when message size islarge

Page 23: Experimental Evaluation of BSP Programming Libraries€¦ · Experimental Evaluation of BSP Programming Libraries - p. 1/26 Experimental Evaluation of BSP Programming Libraries Peter

Introduction

The BSP Model

BSP Libraries

Benchmarking

•Systems used

•Measuring f

•Measuring g and l

•Bandwidth Surface

(aracari)•Bandwidth Surface

(argus)•Latency

•Benchmark Summary

Performance/Predictions

Conclusion

Experimental Evaluation of BSP Programming Libraries - p. 15/26

Benchmark Summary

Bandwidth depends on message count, messagesize and the communications pattern

On argus:• Best all-to-all performance: PUB

• Random permutation: little difference betweenPUB and Oxtool

• Best self communication performance (fewmessages): Oxtool

• MPI: good performance when message size andcount are larger than 16/32 doubles

Page 24: Experimental Evaluation of BSP Programming Libraries€¦ · Experimental Evaluation of BSP Programming Libraries - p. 1/26 Experimental Evaluation of BSP Programming Libraries Peter

Introduction

The BSP Model

BSP Libraries

Benchmarking

•Systems used

•Measuring f

•Measuring g and l

•Bandwidth Surface

(aracari)•Bandwidth Surface

(argus)•Latency

•Benchmark Summary

Performance/Predictions

Conclusion

Experimental Evaluation of BSP Programming Libraries - p. 15/26

Benchmark Summary

Latency:

• PUB consistently has best latency (without usingthe faster ‘oblivious’ synchronization)

• As expected, ‘naive’ MPI library has highestlatency

Page 25: Experimental Evaluation of BSP Programming Libraries€¦ · Experimental Evaluation of BSP Programming Libraries - p. 1/26 Experimental Evaluation of BSP Programming Libraries Peter

Introduction

The BSP Model

BSP Libraries

Benchmarking

Performance/Predictions

•BSP Matrix-Matrix

Multiplication•BSP Matrix-Matrix

Multiplication (2)•Why this algorithm?

•Prediction model

•Prediction results on

aracari•Prediction results, more

processors•Speedup Results (1)

•Speedup Results (2)

•Performance Comparison

with PBLAS

Conclusion

Experimental Evaluation of BSP Programming Libraries - p. 16/26

BSP Matrix-Matrix Multiplication

We want to compute the product of two dense n × n matricesA and B

Simple formula:

cik =n

j=1

aijbjk, having A = [aij ] , B = [bij ] , C = [cij]

V

C

A

B

aij

bjk

vijk

cik

Page 26: Experimental Evaluation of BSP Programming Libraries€¦ · Experimental Evaluation of BSP Programming Libraries - p. 1/26 Experimental Evaluation of BSP Programming Libraries Peter

Introduction

The BSP Model

BSP Libraries

Benchmarking

Performance/Predictions

•BSP Matrix-Matrix

Multiplication•BSP Matrix-Matrix

Multiplication (2)•Why this algorithm?

•Prediction model

•Prediction results on

aracari•Prediction results, more

processors•Speedup Results (1)

•Speedup Results (2)

•Performance Comparison

with PBLAS

Conclusion

Experimental Evaluation of BSP Programming Libraries - p. 17/26

BSP Matrix-Matrix Multiplication (2)

Block decomposition into q blocks for memory efficientparallel algorithm:

CIK =

q1/3

J=1

VIJK with I, K = 1, 2, ..., q1/3

V

C

A

B

Page 27: Experimental Evaluation of BSP Programming Libraries€¦ · Experimental Evaluation of BSP Programming Libraries - p. 1/26 Experimental Evaluation of BSP Programming Libraries Peter

Introduction

The BSP Model

BSP Libraries

Benchmarking

Performance/Predictions

•BSP Matrix-Matrix

Multiplication•BSP Matrix-Matrix

Multiplication (2)•Why this algorithm?

•Prediction model

•Prediction results on

aracari•Prediction results, more

processors•Speedup Results (1)

•Speedup Results (2)

•Performance Comparison

with PBLAS

Conclusion

Experimental Evaluation of BSP Programming Libraries - p. 18/26

Why this algorithm?

• Communication block size can be controlled byparameter q

• Message combining has to be used when usingfixed initial data distribution (block-cyclic withblock width n/

√p)

• Can be compared e.g. to PBLAS

• ‘Nice’ version can predistribute the blocks beforethe computation to avoid spikes because of datadistribution

Page 28: Experimental Evaluation of BSP Programming Libraries€¦ · Experimental Evaluation of BSP Programming Libraries - p. 1/26 Experimental Evaluation of BSP Programming Libraries Peter

Introduction

The BSP Model

BSP Libraries

Benchmarking

Performance/Predictions

•BSP Matrix-Matrix

Multiplication•BSP Matrix-Matrix

Multiplication (2)•Why this algorithm?

•Prediction model

•Prediction results on

aracari•Prediction results, more

processors•Speedup Results (1)

•Speedup Results (2)

•Performance Comparison

with PBLAS

Conclusion

Experimental Evaluation of BSP Programming Libraries - p. 19/26

Prediction model

BSP running time:

T = f ·

q

p

·n3

q+

g ·

q

p

·n2

q2/3·(

2 + 1/q1/3

)

+

l · 2

q

p

Two matrices are transferredrow by row→ value of g is taken fromthe red line as value for max-imum matrix size

Page 29: Experimental Evaluation of BSP Programming Libraries€¦ · Experimental Evaluation of BSP Programming Libraries - p. 1/26 Experimental Evaluation of BSP Programming Libraries Peter

Introduction

The BSP Model

BSP Libraries

Benchmarking

Performance/Predictions

•BSP Matrix-Matrix

Multiplication•BSP Matrix-Matrix

Multiplication (2)•Why this algorithm?

•Prediction model

•Prediction results on

aracari•Prediction results, more

processors•Speedup Results (1)

•Speedup Results (2)

•Performance Comparison

with PBLAS

Conclusion

Experimental Evaluation of BSP Programming Libraries - p. 20/26

Prediction results on aracari

Oxtool, using 4 processors

0 500 1000 1500 2000Matrix Size

0

5

10

15

20

Run

ning

tim

e [s

] q=27 mean error 9.89% q=125 mean error 10.60% q=343 mean error 13.06% q=729 mean error 13.56%prediction q=27prediction q=125prediction q=343prediction q=729

p = 4, 1/f = 4e+08 g = 3.5e-07, l = 8.9e-05

Page 30: Experimental Evaluation of BSP Programming Libraries€¦ · Experimental Evaluation of BSP Programming Libraries - p. 1/26 Experimental Evaluation of BSP Programming Libraries Peter

Introduction

The BSP Model

BSP Libraries

Benchmarking

Performance/Predictions

•BSP Matrix-Matrix

Multiplication•BSP Matrix-Matrix

Multiplication (2)•Why this algorithm?

•Prediction model

•Prediction results on

aracari•Prediction results, more

processors•Speedup Results (1)

•Speedup Results (2)

•Performance Comparison

with PBLAS

Conclusion

Experimental Evaluation of BSP Programming Libraries - p. 20/26

Prediction results on aracari

PUB, using 4 processors

0 500 1000 1500 2000Matrix Size

0

10

20

30

40

50

60

Run

ning

tim

e [s

] q=27 mean error 52.68% q=125 mean error 91.92% q=343 mean error 110.39% q=729 mean error 120.03%prediction q=27prediction q=125prediction q=343prediction q=729

p = 4, 1/f = 4e+08 g = 2.5e-06, l = 7.2e-05

Page 31: Experimental Evaluation of BSP Programming Libraries€¦ · Experimental Evaluation of BSP Programming Libraries - p. 1/26 Experimental Evaluation of BSP Programming Libraries Peter

Introduction

The BSP Model

BSP Libraries

Benchmarking

Performance/Predictions

•BSP Matrix-Matrix

Multiplication•BSP Matrix-Matrix

Multiplication (2)•Why this algorithm?

•Prediction model

•Prediction results on

aracari•Prediction results, more

processors•Speedup Results (1)

•Speedup Results (2)

•Performance Comparison

with PBLAS

Conclusion

Experimental Evaluation of BSP Programming Libraries - p. 20/26

Prediction results on aracari

MPI, using 4 processors

0 500 1000 1500 2000Matrix Size

0

5

10

15

20

Run

ning

tim

e [s

] q=27 mean error 4.04% q=125 mean error 4.92% q=343 mean error 7.44% q=729 mean error 8.20%prediction q=27prediction q=125prediction q=343prediction q=729

p = 4, 1/f = 4e+08 g = 3.4e-07, l = 0.00025

Page 32: Experimental Evaluation of BSP Programming Libraries€¦ · Experimental Evaluation of BSP Programming Libraries - p. 1/26 Experimental Evaluation of BSP Programming Libraries Peter

Introduction

The BSP Model

BSP Libraries

Benchmarking

Performance/Predictions

•BSP Matrix-Matrix

Multiplication•BSP Matrix-Matrix

Multiplication (2)•Why this algorithm?

•Prediction model

•Prediction results on

aracari•Prediction results, more

processors•Speedup Results (1)

•Speedup Results (2)

•Performance Comparison

with PBLAS

Conclusion

Experimental Evaluation of BSP Programming Libraries - p. 21/26

Prediction results, more processors

Oxtool, using 32 processors

0 500 1000 1500 2000Matrix Size

0

0.5

1

1.5

2

2.5

3

Run

ning

tim

e [s

] q=343 mean error 20.36% q=512 mean error 31.71% q=729 mean error 34.20%prediction q=343prediction q=512prediction q=729

p = 32, 1/f = 4e+08 g = 5.3e-07, l = 0.0013

Page 33: Experimental Evaluation of BSP Programming Libraries€¦ · Experimental Evaluation of BSP Programming Libraries - p. 1/26 Experimental Evaluation of BSP Programming Libraries Peter

Introduction

The BSP Model

BSP Libraries

Benchmarking

Performance/Predictions

•BSP Matrix-Matrix

Multiplication•BSP Matrix-Matrix

Multiplication (2)•Why this algorithm?

•Prediction model

•Prediction results on

aracari•Prediction results, more

processors•Speedup Results (1)

•Speedup Results (2)

•Performance Comparison

with PBLAS

Conclusion

Experimental Evaluation of BSP Programming Libraries - p. 22/26

Speedup Results (1)

aracari, using 16 processors

0 500 1000 1500 2000Matrix Size

0

5

10

15

20

25

30S

peed

upsequentialOxtool: q=64Oxtool: q=729PUB: q=64PUB: q=729MPI: q=64MPI: q=729

Page 34: Experimental Evaluation of BSP Programming Libraries€¦ · Experimental Evaluation of BSP Programming Libraries - p. 1/26 Experimental Evaluation of BSP Programming Libraries Peter

Introduction

The BSP Model

BSP Libraries

Benchmarking

Performance/Predictions

•BSP Matrix-Matrix

Multiplication•BSP Matrix-Matrix

Multiplication (2)•Why this algorithm?

•Prediction model

•Prediction results on

aracari•Prediction results, more

processors•Speedup Results (1)

•Speedup Results (2)

•Performance Comparison

with PBLAS

Conclusion

Experimental Evaluation of BSP Programming Libraries - p. 23/26

Speedup Results (2)

aracari, using 32 processors (spikes when matrix size mod6 == 0)

0 500 1000 1500 2000Matrix Size

0

10

20

30

40

50S

peed

upsequentialOxtool: q=216Oxtool: q=729PUB: q=216PUB: q=729MPI: q=216MPI: q=729

Page 35: Experimental Evaluation of BSP Programming Libraries€¦ · Experimental Evaluation of BSP Programming Libraries - p. 1/26 Experimental Evaluation of BSP Programming Libraries Peter

Introduction

The BSP Model

BSP Libraries

Benchmarking

Performance/Predictions

•BSP Matrix-Matrix

Multiplication•BSP Matrix-Matrix

Multiplication (2)•Why this algorithm?

•Prediction model

•Prediction results on

aracari•Prediction results, more

processors•Speedup Results (1)

•Speedup Results (2)

•Performance Comparison

with PBLAS

Conclusion

Experimental Evaluation of BSP Programming Libraries - p. 24/26

Performance Comparison with PBLAS

4000 6000 8000 10000Matrix size

0

5 G

10 G

15 G

20 G

flop/

sPBLAS, 36 nodesPBLAS, 16 nodesMPI, 36 nodesMPI, 16 nodesOxtool, 36 nodesOxtool, 16 nodesPUB, 36 nodesPUB, 16 nodes

Page 36: Experimental Evaluation of BSP Programming Libraries€¦ · Experimental Evaluation of BSP Programming Libraries - p. 1/26 Experimental Evaluation of BSP Programming Libraries Peter

Introduction

The BSP Model

BSP Libraries

Benchmarking

Performance/Predictions

•BSP Matrix-Matrix

Multiplication•BSP Matrix-Matrix

Multiplication (2)•Why this algorithm?

•Prediction model

•Prediction results on

aracari•Prediction results, more

processors•Speedup Results (1)

•Speedup Results (2)

•Performance Comparison

with PBLAS

Conclusion

Experimental Evaluation of BSP Programming Libraries - p. 24/26

Performance Comparison with PBLAS

2000 4000 6000 8000 10000Matrix size

2 G

4 G

6 G

8 G

10 G

12 G

flop/

sPBLAS, 36 nodesPUB, 32 nodesMPI, 32 nodesOxtool, 32 nodes

Page 37: Experimental Evaluation of BSP Programming Libraries€¦ · Experimental Evaluation of BSP Programming Libraries - p. 1/26 Experimental Evaluation of BSP Programming Libraries Peter

Introduction

The BSP Model

BSP Libraries

Benchmarking

Performance/Predictions

Conclusion

•Summary

•Further Work

Experimental Evaluation of BSP Programming Libraries - p. 25/26

Summary

• Despite restrictions due to BSP model, all implementationsreach good speedup on aracari when number of blocksis low

• Overall benchmark results look better for PUB• Oxtool has best matrix multiplication performance onaracari (Myrinet)

• PUB has best matrix multiplication performance on argus(Ethernet)

• Predictably no real speedup on argus, due to slowcommunications network and fast nodes

• Performance of simple BSP algorithm is comparable withPBLAS

Page 38: Experimental Evaluation of BSP Programming Libraries€¦ · Experimental Evaluation of BSP Programming Libraries - p. 1/26 Experimental Evaluation of BSP Programming Libraries Peter

Introduction

The BSP Model

BSP Libraries

Benchmarking

Performance/Predictions

Conclusion

•Summary

•Further Work

Experimental Evaluation of BSP Programming Libraries - p. 26/26

Further Work

• Run experiments on shared memory machine

• Use more different communication patterns forbenchmarking

• Study other algorithms with differentcommunication patterns

• Keeping simplicity, extend prediction model formore accuracy