32
The High Performance Cluster for Lattice QCD The High Performance Cluster for Lattice QCD Calculations: Calculations: System Monitoring and Benchmarking System Monitoring and Benchmarking Part II – BENCHMARKING Part II – BENCHMARKING Michał Kapałka [email protected] Summer student @ DESY Hamburg Supervisor: Andreas Gellrich September 2002

Micha ł Kapa ł ka [email protected] Summer student @ DESY Hamburg

  • Upload
    calais

  • View
    41

  • Download
    0

Embed Size (px)

DESCRIPTION

The High Performance Cluster for Lattice QCD Calculations: System Monitoring and Benchmarking Part II – BENCHMARKING. Micha ł Kapa ł ka [email protected] Summer student @ DESY Hamburg Supervisor: Andreas Gellrich. September 2002. Outline. Benchmarks – an introduction - PowerPoint PPT Presentation

Citation preview

Page 1: Micha ł  Kapa ł ka kapalka@icslab.agh.pl Summer student @ DESY Hamburg

The High Performance Cluster for Lattice QCD The High Performance Cluster for Lattice QCD

Calculations:Calculations:

System Monitoring and BenchmarkingSystem Monitoring and Benchmarking

Part II – BENCHMARKINGPart II – BENCHMARKING

Michał Kapał[email protected]

Summer student @ DESY Hamburg

Supervisor: Andreas Gellrich

September 2002

Page 2: Micha ł  Kapa ł ka kapalka@icslab.agh.pl Summer student @ DESY Hamburg

DESY Summer Student Programme 2002 Michał Kapałka

OutlineOutline

Benchmarks – an introduction

Single-node benchmarks

Parallel computing & MPI

How to benchmark a cluster?

• Point-to-point communication

• Collective communication

Summary & conclusions & questions

Page 3: Micha ł  Kapa ł ka kapalka@icslab.agh.pl Summer student @ DESY Hamburg

DESY Summer Student Programme 2002 Michał Kapałka

Benchmarks Benchmarks – WHY?– WHY?

Benchmarking – comparing or testing

Comparing – different hardware/software relatively simple

Testing a given configuration & finding bottlenecks difficult (and THAT’s what we’re going to talk about…)

Page 4: Micha ł  Kapa ł ka kapalka@icslab.agh.pl Summer student @ DESY Hamburg

DESY Summer Student Programme 2002 Michał Kapałka

WHAT to test?WHAT to test?

Single machine

CPU + memory + …

Cluster or parallel computer

communication:

• interprocessor

• inter-node

Page 5: Micha ł  Kapa ł ka kapalka@icslab.agh.pl Summer student @ DESY Hamburg

DESY Summer Student Programme 2002 Michał Kapałka

HOWTO part I – one nodeHOWTO part I – one node

Lattice QCD basic operations: Dirac operator, complex matrices, square norm, …

QCD Benchmark (Martin Lüscher)

Optimization: SSE (PIII), SSE2 (P4) – operations on 2 doubles at once, cache prefetching (PIII) www.intel.com

Page 6: Micha ł  Kapa ł ka kapalka@icslab.agh.pl Summer student @ DESY Hamburg

DESY Summer Student Programme 2002 Michał Kapałka

QCD Benchmark – resultsQCD Benchmark – results

D_psi

[Mflops]

32 bit

SSE(2)

32 bit

No SSE

64 bit

SSE(2)

64 bit

No SSE

PIII 800 MHz,

256 KB (pal01)

554 127 186 92

Xeon 1.7 GHz,

256 KB (node20)

1668

(1177)

196

(270)

894

(395)

166

(195)

Xeon 2 GHz,

512 KB (node10)

1900

(1385/ 1960)

357

(317/ 231)

1006

(465/ 1052)

201

(230/ 195)

Page 7: Micha ł  Kapa ł ka kapalka@icslab.agh.pl Summer student @ DESY Hamburg

DESY Summer Student Programme 2002 Michał Kapałka

QCD Benchmark – results (2)QCD Benchmark – results (2)

Add assign field

(k) = (k) + c(l)

[Mflops]

32 bit

SSE(2)

32 bit

No SSE

64 bit

SSE(2)

64 bit

No SSE

PIII 800 MHz,

256 KB (pal01)

90 63 44 42

Xeon 1.7 GHz,

256 KB (node20)

311 196 139 134

Xeon 2 GHz,

512 KB (node10)

(292) (229) (127) (129)

Page 8: Micha ł  Kapa ł ka kapalka@icslab.agh.pl Summer student @ DESY Hamburg

DESY Summer Student Programme 2002 Michał Kapałka

HOWTO part II – a clusterHOWTO part II – a cluster

CPUs & nodes have to COMMUNICATE

CPUs: shared memory

Nodes: sockets (grrrr…), virtual shared memory (hmm…), PVM, MPI, etc.

For clusters: MPI (here: MPICH-GM) that’s exactly what I’ve tested

Remark: communication OVERHEAD

Page 9: Micha ł  Kapa ł ka kapalka@icslab.agh.pl Summer student @ DESY Hamburg

DESY Summer Student Programme 2002 Michał Kapałka

MPI – point-to-pointMPI – point-to-point

Calls: blocking & non-blocking ( init + complete)

send receive

init x 2 complete x 2

Computation… Modes: standard,

synchronous, buffered, ready

Uni- or bidirectional?

Basic operations: send and receive

time

Page 10: Micha ł  Kapa ł ka kapalka@icslab.agh.pl Summer student @ DESY Hamburg

DESY Summer Student Programme 2002 Michał Kapałka

First step – POEFirst step – POE

Extremely simple – ping-pong test

Only blocking, standard-mode communication

Not user-friendly

But…

Page 11: Micha ł  Kapa ł ka kapalka@icslab.agh.pl Summer student @ DESY Hamburg

DESY Summer Student Programme 2002 Michał Kapałka

POE – resultsPOE – results

Non-local details later

Local, no shmem slow (90 MB/s)

Local with shmem fast (esp. 31-130 KB), but…

Page 12: Micha ł  Kapa ł ka kapalka@icslab.agh.pl Summer student @ DESY Hamburg

DESY Summer Student Programme 2002 Michał Kapałka

My point-to-point benchmarksMy point-to-point benchmarks

Using different MPI modes: standard, synchronous & buffered (no ready-mode)

Blocking & non-blocking calls

Fully configurable via command-line options

Text and LaTeX output

But still ping-pong tests

Page 13: Micha ł  Kapa ł ka kapalka@icslab.agh.pl Summer student @ DESY Hamburg

DESY Summer Student Programme 2002 Michał Kapałka

Problems…Problems…

Time measuring

• CPU time seems natural, but very low resolution on Linux (clock() call)

• Real time high resolution, but can be misleading on overloaded nodes (gettimeofday() call)

MPICH-GM bug – problems when using shared memory

Page 14: Micha ł  Kapa ł ka kapalka@icslab.agh.pl Summer student @ DESY Hamburg

DESY Summer Student Programme 2002 Michał Kapałka

Results (1)Results (1)

Send: peak 1575 MB/s, drops to 151 MB/s @ 16 KB

Total: max 151 MB/s

Page 15: Micha ł  Kapa ł ka kapalka@icslab.agh.pl Summer student @ DESY Hamburg

DESY Summer Student Programme 2002 Michał Kapałka

Results (2)Results (2)

Send & rcv completely different loosing sync

Send: peak 1205 MB/s

Total: max 120 MB/s

Page 16: Micha ł  Kapa ł ka kapalka@icslab.agh.pl Summer student @ DESY Hamburg

DESY Summer Student Programme 2002 Michał Kapałka

Results (3)Results (3)

Total bandwidth!

Standard seems to be the fastest

Buffered use with care

Page 17: Micha ł  Kapa ł ka kapalka@icslab.agh.pl Summer student @ DESY Hamburg

DESY Summer Student Programme 2002 Michał Kapałka

Results (4)Results (4)

Blocking: max 151 MB/s

Non-blocking: max 176 MB/s + computation

WHY??? when is it bidirectional?

Page 18: Micha ł  Kapa ł ka kapalka@icslab.agh.pl Summer student @ DESY Hamburg

DESY Summer Student Programme 2002 Michał Kapałka

Uni- or bidirectional?Uni- or bidirectional?

Node A

Node B

Blocking communication:send

receive send

receive

MSG MSG

Node A

Node B

Non-blocking communication:init x 2

init x 2

complete

complete

MSGtime

Page 19: Micha ł  Kapa ł ka kapalka@icslab.agh.pl Summer student @ DESY Hamburg

DESY Summer Student Programme 2002 Michał Kapałka

Results (5)Results (5)

Non-blocking calls use full duplex

Also MPI_Sendrecv

Blocking calls cannot use it that’s why they’re slower

Page 20: Micha ł  Kapa ł ka kapalka@icslab.agh.pl Summer student @ DESY Hamburg

DESY Summer Student Programme 2002 Michał Kapałka

Results (last but not least)Results (last but not least)

The ‘blocking calls story’ repeats…

However, buffered mode can be sometimes the most efficient

Page 21: Micha ł  Kapa ł ka kapalka@icslab.agh.pl Summer student @ DESY Hamburg

DESY Summer Student Programme 2002 Michał Kapałka

Point-to-point – conclusionsPoint-to-point – conclusions

Use standard-mode, non-blocking communication whenever it’s possible

Use large messages

1. Write your parallel program2. Benchmark3. Analyze4. Improve5. Go to 2

Page 22: Micha ł  Kapa ł ka kapalka@icslab.agh.pl Summer student @ DESY Hamburg

DESY Summer Student Programme 2002 Michał Kapałka

Collective communicationCollective communication

Collective operations:

• Broadcast

• Gather, gather to all

• Scatter,

• All to all gather/scatter

• Global reduction operator, all reduce

Root and non-root nodes

Can be implemented with point-to-point calls, but this CAN be less effective

Page 23: Micha ł  Kapa ł ka kapalka@icslab.agh.pl Summer student @ DESY Hamburg

DESY Summer Student Programme 2002 Michał Kapałka

Communication bandwidthWhere:

M – message size

N – number of messages

t – communication time

Summary bandwidthWhere:

K – number of nodes

Gives an impression of the speed of collective communication, but must be used with care!!!

What to measure?What to measure?

t

MNb

)1( Kbbsummary

Page 24: Micha ł  Kapa ł ka kapalka@icslab.agh.pl Summer student @ DESY Hamburg

DESY Summer Student Programme 2002 Michał Kapałka

Results – example #1Results – example #1

Root: max 527 MB/s, drops down @ 16 KB

Non-root: max 229 MB/s

Saturation: 227 MB/s

Page 25: Micha ł  Kapa ł ka kapalka@icslab.agh.pl Summer student @ DESY Hamburg

DESY Summer Student Programme 2002 Michał Kapałka

Results – example #2Results – example #2

Very effective algorithm used

Max: around 400 MB/s

Page 26: Micha ł  Kapa ł ka kapalka@icslab.agh.pl Summer student @ DESY Hamburg

DESY Summer Student Programme 2002 Michał Kapałka

Results – example #3Results – example #3

The same for root & non-root

Drop @ 16 KB

Saturation: 75 MB/s

BUT…

Page 27: Micha ł  Kapa ł ka kapalka@icslab.agh.pl Summer student @ DESY Hamburg

DESY Summer Student Programme 2002 Michał Kapałka

But…But…

We compute summary bandwidth as:

But the amount of data transmitted is K times higher, so we should write:

So we should have 300 MB/s instead of 75 MB/s this needs to be changed

t

KMNbsummary

)1(

t

KKMNbsummary

)1(

Page 28: Micha ł  Kapa ł ka kapalka@icslab.agh.pl Summer student @ DESY Hamburg

DESY Summer Student Programme 2002 Michał Kapałka

Results – example #4Results – example #4

Max: 960 MB/s for 12 nodes (160 MB/s per connection)

Hard job to improve that

Page 29: Micha ł  Kapa ł ka kapalka@icslab.agh.pl Summer student @ DESY Hamburg

DESY Summer Student Programme 2002 Michał Kapałka

Results – example #nResults – example #n

Strange behaviour

Stable for message size > 16 KB (max 162 MB/s)

Interpretation very difficult

Page 30: Micha ł  Kapa ł ka kapalka@icslab.agh.pl Summer student @ DESY Hamburg

DESY Summer Student Programme 2002 Michał Kapałka

Collective – conclusionsCollective – conclusions

Collective communication is usually NOT used too often, so one doesn’t need to improve its speed

However, if it’s a must, in some cases, changing collective to point-to-point in a SMART way can improve things a little

Also playing with message sizes can help a lot but BE CAREFUL

Page 31: Micha ł  Kapa ł ka kapalka@icslab.agh.pl Summer student @ DESY Hamburg

DESY Summer Student Programme 2002 Michał Kapałka

To do…To do…

Bidirectional communication

More flexible method for computing summary bandwidth in collective communication

Some other benchmarks – closer to the lattice QCD computations

And the most important – parallelizing all the lattice QCD programs and making USE of the benchmarks & results

Page 32: Micha ł  Kapa ł ka kapalka@icslab.agh.pl Summer student @ DESY Hamburg

DESY Summer Student Programme 2002 Michał Kapałka

SummarySummary

CPU benchmarks can speed up serial programs (running on one node)

For parallel computations the real bottleneck is communication and this has to be tested carefully

The interpretation of the results is NOT as important as using them to tune a program and make it fast