32
www.bsc.es Enes workshop on exascale techs. Hamburg, March 18 th 2014 Jesús Labarta, Judit Gimenez BSC Performance Tools (Paraver/Dimemas)

Performance Tools (Paraver/Dimemas)

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Performance Tools (Paraver/Dimemas)

www.bsc.es

Enes workshop on exascale techs. Hamburg, March 18th 2014

Jesús Labarta, Judit Gimenez BSC

Performance Tools (Paraver/Dimemas)

Page 2: Performance Tools (Paraver/Dimemas)

2

Our Tools

!   Since 1991

!   Based on traces

!   Open Source –  http://www.bsc.es/paraver

!   Core tools: –  Paraver (paramedir) – offline trace analysis –  Dimemas – message passing simulator –  Extrae – instrumentation

!   Focus –  Detail, flexibility, intelligence

Page 3: Performance Tools (Paraver/Dimemas)

3

0 3.5 s

A “different” view point

!   Look at structure … –  Of behavior, not syntax

–  Differentiated or repetitive patterns in time and space

–  Focus on computation regions (Burst)

Page 4: Performance Tools (Paraver/Dimemas)

4

LB Ser Trf Eff 0.83 0.97 0.80 0.87 0.90 0.78 0.88 0.82 0.73 0.88 0.72 0.63

A “different” view point

!   … and fundamental metrics

adv2 (gather–fft-scatter)* mono

Useful user function @ NMMB

M. Casas et al, “Automatic analysis of speedup of MPI applications”. ICS 2008.

LB Ser Trf Eff 0.83 0.97 0.80 0.87 0.90 0.78 0.88 0.97 0.84 0.73 0.88 0.96 0.75 0.61

Page 5: Performance Tools (Paraver/Dimemas)

5

More on structure and concurrency

Scalability tradeoffs between processes at different phases

?

Page 6: Performance Tools (Paraver/Dimemas)

6

More on structure and concurrency

How to find out:

Discussion with developer Automatic? V. Subotic et al, “Automatic exploration of

potential parallelism in sequential applications”. ISC 2014.

Page 7: Performance Tools (Paraver/Dimemas)

7

More on structure and concurrency

Page 8: Performance Tools (Paraver/Dimemas)

8

More on structure and concurrency

Huge potentials of concurrency and overlap to:

tolerate latencies

spread load across resource cores and network !!

Page 9: Performance Tools (Paraver/Dimemas)

9

More on structure and concurrency

You may even want to constrain potential concurrency !!!

Page 10: Performance Tools (Paraver/Dimemas)

10

More on structure and concurrency and syntax

WIP:

Taskify with OmpSs

OpenMP 4.0 accelerator features in OmpSs

Page 11: Performance Tools (Paraver/Dimemas)

11

Performance analytics

Page 12: Performance Tools (Paraver/Dimemas)

12

Using Clustering to identify structure

IPC

Completed Instructions

J. Gonzalez et al, “Automatic Detection of Parallel Applications Computation” Phases. (IPDPS 2009)

Page 13: Performance Tools (Paraver/Dimemas)

13

!   Full per region HWC characterization from a single run

Projecting hardware counters based on clustering

Miss ratios Instruction mix Stalls

Page 14: Performance Tools (Paraver/Dimemas)

14

!   Frame sequence: clustered scatterplot as core counts increases

Tracking structural evolution

64   128   192  

256   384   512  

64   128   192  

256   384   512  

G.Llort et all, “On the Usefulness of Object Tracking Techniques in Performance Analysis”, SC 2013

OpenMX Strong scaling

Page 15: Performance Tools (Paraver/Dimemas)

15

!   … to get extreme detail with minimal overhead

!   Different roles –  Instrumentation delimits regions –  Sampling report progress within region

Mixing instrumentation and sampling …

Iteration #1 Iteration #2 Iteration #3

Synthetic Iteration

Harald Servat et al. “Unveiling Internal Evolution of Parallel Application Computation Phases” ICPP 2011

Harald Servat et al. “Detailed performance analysis using coarse grain sampling” PROPER@EUROPAR, 2009

Page 16: Performance Tools (Paraver/Dimemas)

16

  Instructions evolution for routine copy_faces of NAS MPI BT.B

  Red crosses represent the folded samples and show the completed instructions from the start of the routine

  Green line is the curve fitting of the folded samples and is used to reintroduce the values into the tracefile

  Blue line is the derivative of the curve fitting over time (counter rate)

Folding hardware counters

Page 17: Performance Tools (Paraver/Dimemas)

17

17.20 M instructions ~ 1000 MIPS

24.92 M instructions ~ 1100 MIPS

32.53 M instructions ~ 1200 MIPS

MPI call

MPI call

Combined clustering + folding

!   Instantaneous values !   All metrics !   From a single run !   “No” overhead

CGPOP -1D

Page 18: Performance Tools (Paraver/Dimemas)

18

CESM v18 – v19 trace

!   User functions not instrumented

ATM: 384 LND: 16 ICE: 32 OCN: 10 CPL: 128

2.54 GB

160 s

5 200 ms

2.55 GB 4.5 MB

11.5 MB

570

Page 19: Performance Tools (Paraver/Dimemas)

19

CESM CAM v18

Convect_shallow_tend

Microp_driver_tend

aer_rad_props_sw

aer_rads_prop_lw

rrtmg_sg

rad_rrtmg_lw

Page 20: Performance Tools (Paraver/Dimemas)

20

CESM CAM v19

Convect_shallow_tend

Svp_water

M_list_mp_init_

Vertical_diffusion

rrtmg_sw

rad_rrtmg_lw Microp_driver_tend aer_rad_props_sw

Aerosol_dryed_intr_

Page 21: Performance Tools (Paraver/Dimemas)

21

Dimemas

Page 22: Performance Tools (Paraver/Dimemas)

22

Dimemas: Coarse grain, Trace driven simulation

!   Simulation: Highly non linear model –  Linear components

•  Point to point communication

•  Sequential processor performance –  Global CPU speed –  Per block/subroutine

–  Non linear components •  Synchronization semantics

–  Blocking receives –  Rendezvous

•  Resource contention –  CPU –  Communication subsystem

»  links (half/full duplex), busses CPU

Local Memory

B

CPU

CPU

L

CPU

CPU

CPU Local

Memory

L

CPU

CPU

CPU Local

Memory

L

Page 23: Performance Tools (Paraver/Dimemas)

23

Ideal machine

!   The impossible machine: BW = ∞, L = 0 !   Actually describes/characterizes Intrinsic application behavior

–  Load balance problems? –  Dependence problems?

waitall

sendrec

alltoall

Real run

Ideal network

Allgather +

sendrecv allreduce GADGET @ Nehalem cluster

256 processes

Impact on practical machines?

Page 24: Performance Tools (Paraver/Dimemas)

24

The potential of hybrid/accelerator parallelization

!   Hybrid parallelization –  Speedup SELECTED regions by the

CPUratio factor !   We do need to overcome the hybrid

Amdahl’s law –  asynchrony + Load balancing

mechanisms !!!

93.67% 97.49% 99.11%

Code region

%el

apse

d tim

e

GADGET, 128 procs

Page 25: Performance Tools (Paraver/Dimemas)

25

Conclusion

!   BSC tools –  Extremely powerful visualization and analysis capabilities

–  Performance Analytics •  Performance data is big data

–  Management –  analytics

–  Capturing knowledge and methodologies in algorithmic workflows

!   Useful insight for informed decisions on code refactoring

http://www.bsc.es/paraver [email protected]

Page 26: Performance Tools (Paraver/Dimemas)

THANKS

Page 27: Performance Tools (Paraver/Dimemas)

27

Insight

!   Observations / highly probable speculations / good questions –  about fundamental behavior –  Suggesting possibilities for optimization

!   Identification of specific poor performance sequential code !   Bimodal behavior in alternating “iterations?” !   Bimodal behavior in space:

–  Day-night imbalance –  Moving load imbalance

•  Separate cause and potential solution

!   Repetitive fine grain structure within phase –  2 / 3 sub iterations? parallelizable? Potential source for overlap of

communication/computation?

Page 28: Performance Tools (Paraver/Dimemas)

28

A call for Performance analytics

!   Data acquisition –  A lot of data is captured

!   Presentation –  Profile: a few (or not so few) pre computed first order statistics

•  Far too summarized –  Trace visualization

•  No summarization at all

Need for intelligent data processing

to derive actual insight

Page 29: Performance Tools (Paraver/Dimemas)

29

CESM CLM v18

29

Page 30: Performance Tools (Paraver/Dimemas)

30

CESM POP v18

30

Page 31: Performance Tools (Paraver/Dimemas)

31

NMMB

Page 32: Performance Tools (Paraver/Dimemas)

32

Measuring Parallel efficiency