43
www.bsc.es Understanding applications with Paraver Judit Gimenez [email protected] NERSC - Berkeley, August 2016

Understanding applications with Paraver

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Understanding applications with Paraver

www.bsc.es

Understanding applications with Paraver

Judit Gimenez [email protected]

NERSC - Berkeley, August 2016

Page 2: Understanding applications with Paraver

Humans are visual creatures

  Films or books? PROCESS –  Two hours vs. days (months)

  Memorizing a deck of playing cards STORE –  Each card translated to an image (person, action, location)

  Our brain loves pattern recognition IDENTIFY –  What do you see on the pictures?

Page 3: Understanding applications with Paraver

Our Tools

Since 1991 Based on traces Open Source

–  http://www.bsc.es/paraver

Core tools: –  Paraver (paramedir) – offline trace analysis –  Dimemas – message passing simulator –  Extrae – instrumentation

Focus –  Detail, variability, flexibility –  Behavioral structure vs. syntactic structure –  Intelligence: Performance Analytics

Page 4: Understanding applications with Paraver

www.bsc.es

Paraver

Page 5: Understanding applications with Paraver

Paraver – Performance data browser

Timelines

Raw data

2/3D tables (Statistics)

Goal = Flexibility No semantics

Programmable

Comparative analyses Multiple traces Synchronize scales

+ trace manipulation Trace visualization/analysis

Page 6: Understanding applications with Paraver

Timelines

  Each window displays one view –  Piecewise constant function of time

  Types of functions –  Categorical

•  State, user function, outlined routine

–  Logical •  In specific user function, In MPI call, In long MPI call

–  Numerical •  IPC, L2 miss ratio, Duration of MPI call, duration of computation

burst

[ ] <⊂∈ nNSi ,n0,

[ )1,,)( +∈= iii ttiSts

RSi ∈

{ }1 ,0 ∈iS

Page 7: Understanding applications with Paraver

Tables: Profiles, histograms, correlations

  From timelines to tables MPI calls profile

Useful Duration

Histogram Useful Duration

MPI calls

Page 8: Understanding applications with Paraver

Analyzing variability through histograms and timelines

Useful Duration

Instructions

IPC

L2 miss ratio

Page 9: Understanding applications with Paraver

  By the way: six months later ….

Useful Duration

Instructions

IPC

L2 miss ratio

Analyzing variability through histograms and timelines

Page 10: Understanding applications with Paraver

Variability … is everywhere   CESM: 16 processes, 2 simulated days

  Histogram useful computation duration shows high variability

  How is it distributed?   Dynamic imbalance

–  In space and time –  Day and night. –  Season ? ☺

Page 11: Understanding applications with Paraver

  Data handling/summarization capability –  Filtering

•  Subset of records in original trace •  By duration, type, value,… •  Filtered trace IS a paraver trace and can

be analysed with the same cfgs (as long as needed data kept)

–  Cutting •  All records in a given time interval •  Only some processes

–  Software counters •  Summarized values computed from those

in the original trace emitted as new even types

•  #MPI calls, total hardware count,…

570 s 2.2 GB

MPI, HWC

WRF-NMM Peninsula 4km 128 procs

570 s 5 MB

4.6 s 36.5 MB

Trace manipulation

Page 12: Understanding applications with Paraver

www.bsc.es

Dimemas

Page 13: Understanding applications with Paraver

Dimemas: Coarse grain, Trace driven simulation

  Simulation: Highly non linear model –  MPI protocols, resource contention…

  Parametric sweeps –  On abstract architectures –  On application computational regions

  What if analysis –  Ideal machine (instantaneous network) –  Estimating impact of ports to MPI+OpenMP/CUDA/… –  Should I use asynchronous communications? –  Are all parts equally sensitive to network?

  MPI sanity check –  Modeling nominal

Paraver – Dimemas tandem –  Analysis and prediction –  What-if from selected time window

Detailed feedback on simulation (trace)

CPU

Local Memory

B

CPU

CPU

L

CPU

CPU

CPU Local

Memory

L

CPU

CPU

CPU Local

Memory

L

Impact of BW (L=8; B=0)

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1 4 16 64 256 1024

Effic

ienc

y

NMM 512ARW 512NMM 256ARW 256NMM 128ARW 128

Page 14: Understanding applications with Paraver

Network sensitivity

  MPIRE 32 tasks, no network contention

L=5µs–BW=1GB/s L=1000µs–BW=1GB/s

L=5µs–BW=100MB/s

Allwindowssamescale

Page 15: Understanding applications with Paraver

Would I will benefit from asynchronous communications?

  SPECFEM3D

Courtesy Dimitri Komatitsch

Real

Ideal

Predic2onMN

Predic2on5MB/s

Predic2on1MB/s

Predic2on10MB/s

Predic2on100MB/s

Page 16: Understanding applications with Paraver

Ideal machine

  The impossible machine: BW = ∞, L = 0   Actually describes/characterizes Intrinsic application behavior

–  Load balance problems? –  Dependence problems?

waitall

sendrec

alltoall

Real run

Ideal network

Allgather +

sendrecv allreduce GADGET @ Nehalem cluster

256 processes

Impact on practical machines?

Page 17: Understanding applications with Paraver

Impact of architectural parameters

  Ideal speeding up ALL the computation bursts by the CPUratio factor –  The more processes the less speedup (higher impact of bandwidth

limitations) !! 64128

256

512

1024

2048

4096

8192

16384

141664

0

20

40

60

80

100

120

140

Bandwidth (MB/s)

CPUratio

Speedup

64128

256

512

1024

2048

4096

8192

16384

1

8

640

20

40

60

80

100

120

140

Bandwidth (MB/s)

CPUratio

Speedup

64128

256

512

1024

2048

4096

8192

16384

1

8

640

20

40

60

80

100

120

140

Bandwidth (MB/s)

CPUratio

Speedup

64 procs

128 procs

256 procs

GADGET

Page 18: Understanding applications with Paraver

Profile

0

5

10

15

20

25

30

35

40

1 2 3 4 5 6 7 8 9 10 11 12 13code region

% o

f com

puta

tion

time

Hybrid parallelization

  Hybrid/accelerator parallelization –  Speed-up SELECTED regions

by the CPUratio factor

% C

ompu

tatio

n Ti

me

64128

256

512

1024

2048

4096

8192

16384

141664

0

5

10

15

20

Bandwdith (MB/s)

CPUratio

Speedup

64128

256

512

1024

2048

4096

8192

16384

1

8

64

0

5

10

15

20

Bandwdith (MB/s)

CPUratio

Speedup

64128

256

512

1024

2048

4096

8192

16384

1

8

64

0

5

10

15

20

Bandwdith (MB/s)

CPUratio

Speedup

93.67% 97.49% 99.11%

Code regions

128 procs.

(Previous slide: speedups up to 100x) 18

Page 19: Understanding applications with Paraver

www.bsc.es

Models and Extrapolation

Page 20: Understanding applications with Paraver

Parallel efficiency model

  Parallel efficiency = LB eff * Comm eff

Computa2on Communica2on

MPI_Recv

MPI_Send Do not blame MPI

LB Comm

Page 21: Understanding applications with Paraver

Parallel efficiency refinement: LB * µLB * Transfer

  Serializations / dependences (µLB) Dimemas ideal network " Transfer (efficiency) = 1

Computa2on Communica2on

𝑳𝑩=1 MPI_SendMPI_Recv

MPI_Send MPI_Recv

MPI_Send

MPI_Send

MPI_Recv

MPI_Recv

Do not blame MPI

LB µLB Transfer

Page 22: Understanding applications with Paraver

Why scaling?

0

5

10

15

20

0 50 100 150 200 250 300 350 400

speed up

Good scalability !! Should we be happy?

CG-POP mpi2s1D - 180x120

TrfSerLB **=η

0.5

0.6

0.7

0.8

0.9

1

1.1

0 100 200 300 400

Parallel eff

LB

uLB

transfer

0.4

0.6

0.8

1

1.2

1.4

1.6

0 100 200 300 400

Efficiency

Parallel eff

instr. eff

IPC eff

IPCinstr ηηηη **=

Page 23: Understanding applications with Paraver

AVBP (strong scale) Input: runs 512 – 4096

PffmetricAmdahl

metricmetricfit *)1(

0

−+=

Strong scale " small computations betweeen MPI calls become more and more important

Page 24: Understanding applications with Paraver

TURBORVB(strong scale) Input: runs 512 – 4096

3/10

*)1( PffmetricAmdahl

metricmetricfit −+=

Network contention can be reduced balancing / limiting the random selection within a node

1.00

1.20

1.40

1.60

1.80

2.00

100 1000 10000 100000 1000000 Avg.

iter

atio

n tim

e (s

)

# MPI ranks

meassured

Page 25: Understanding applications with Paraver

www.bsc.es

Clustering

Page 26: Understanding applications with Paraver

Using Clustering to identify structure

IPC

Completed Instructions

Automatic Detection of Parallel Applications Computation Phases (IPDPS 2009)

Page 27: Understanding applications with Paraver

19% gain

What should I improve?

13% gain

… we increase the IPC of Cluster1?

What if ….

… we balance Clusters 1 & 2?

PEPC

Page 28: Understanding applications with Paraver

Tracking scability through clustering

64 128 192

256 384 512

64 128 192

256 384 512

OpenMX (strong scale from 64 to 512 tasks)

Page 29: Understanding applications with Paraver

www.bsc.es

Folding

Page 30: Understanding applications with Paraver

Folding: Detailed metrics evolution

•  Performance of a sequential region = 2000 MIPS

Is it good enough?

Is it easy to improve?

Page 31: Understanding applications with Paraver

Folding: Instantaneous CPI stack

MRGENESIS

•  Trivial fix.(loop interchange) •  Easy to locate?

•  Next step?

•  Availability of CPI stack models

for production processors? •  Provided by

manufacturers?

Page 32: Understanding applications with Paraver

  Instantaneous metrics with minimum overhead –  Combine instrumentation and sampling

•  Instrumentation delimits regions (routines, loops, …) •  Sampling exposes progression within a region

–  Captures performance counters and call-stack references

Folding

Iteration #1 Iteration #2 Iteration #3

Synth Iteration

Initialization Finalization

Page 33: Understanding applications with Paraver

“Blind” optimization

  From folded samples of a few levels to timeline structure of “relevant” routines

Recommendation without access to source code

Harald Servat et al. “Bio-inspired call-stack reconstruction for performance analysis” Submitted, 2015

Page 34: Understanding applications with Paraver

17.20 M instructions ~ 1000 MIPS

24.92 M instructions ~ 1100 MIPS

32.53 M instructions ~ 1200 MIPS

MPI call

MPI call

Unbalanced MPI application –  Same code –  Different duration –  Different performance

CG-POP multicore MN3 study

Page 35: Understanding applications with Paraver

Address space references

  Based on PEBS events •  Loads: address, cost in cycles, level providing the data •  Stores: only address

35

Modified Stream for (i = 0; i < NITERS; i++) { Extrae_event (BEGIN); for (j = 0; j < N; j++) c[j] = a[j]; for (j = 0; j < N/8; j++) b[j] = s*c[random()%j]; for (j = 0; j < N; j++) c[j] = a[j]+b[j]; for (j = 0; j < N; j++) a[j] = b[j]+s*c[j]; Extrae_event (END); }

Page 36: Understanding applications with Paraver

Lulesh memory access

Architecture impact Stalls distribution

Page 37: Understanding applications with Paraver

www.bsc.es

Methodology

Page 38: Understanding applications with Paraver

Performance analysis tools objective

Help validate hypotheses

Help generate hypotheses

Qualitatively

Quantitatively

Page 39: Understanding applications with Paraver

First steps

  Parallel efficiency – percentage of time invested on computation –  Identify sources for “inefficiency”:

•  load balance •  Communication /synchronization

  Serial efficiency – how far from peak performance? –  IPC, correlate with other counters

  Scalability – code replication? –  Total #instructions

  Behavioral structure? Variability?

Paraver Tutorial:

Introduction to Paraver and Dimemas methodology

Page 40: Understanding applications with Paraver

BSC Tools web site

www.bsc.es/paraver •  downloads

–  Sources / Binaries –  Linux / windows / MAC

•  documentation –  Training guides –  Tutorial slides

  Getting started –  Start wxparaver –  Help " tutorials and follow instructions –  Follow training guides

•  Paraver introduction (MPI): Navigation and basic understanding of Paraver operation

Page 41: Understanding applications with Paraver

Thanks!

 Use your brain, use visual tools :)   Look at your codes!

www.bsc.es/paraver

Page 42: Understanding applications with Paraver

www.bsc.es

Demo

Page 43: Understanding applications with Paraver

Some examples of efficiencies

Code Parallel efficiency Communication efficiency

Load Balance efficiency

Gromacs@mt 66.77 75.68 88.22

BigDFT@altamira 59.64 78.97 75.52

CG-POP@mt 80.98 98.92 81.86

ntchem_mini@pi 92.56 94.94 97.49

nicam@pi 87.10 75.97 89.22

cp2k@jureca 75.34 81.07 92.93

lulesh@mn3 90.55 99.22 91.26

lulesh@leftraru 69.15 99.12 69.76

lulesh@uv2 (mpt) 70.55 96.56 73.06

lulesh@uv2 (impi) 85.65 95.09 90.07

lulesh@mt 83.68 95.48 87.64

lulesh@cori 90.92 98.59 92.20