Understanding applications using the BSC performance tools

www.bsc.es

Understanding applications using the

BSC performance tools

Judit Gimenez ([email protected]), Harald Servat ([email protected])

HPCKP’15 - February 3th, 2015

Humans are visual creatures

Films or books? PROCESS

– Two hours vs. days (months)

Memorizing a deck of playing cards STORE

– Each card translated to an image (person, action, location)

Our brain loves pattern recognition IDENTIFY

– What do you see on the pictures?

Our Tools

Since 1991

Based on traces

Open Source – http://www.bsc.es/paraver

Core tools: – Paraver (paramedir) – offline trace analysis

– Dimemas – message passing simulator

– Extrae – instrumentation

Focus – Detail, variability, flexibility

– Behavioral structure vs. syntactic structure

– Intelligence: Performance Analytics

www.bsc.es

Paraver

Paraver – Performance data browser

Timelines

Raw data

2/3D tables

(Statistics)

Goal = Flexibility

No semantics

Programmable

Comparative analyses

Multiple traces

Synchronize scales

+ trace manipulation

Trace visualization/analysis

Timelines

Each window displays one view

– Piecewise constant function of time

Types of functions

– Categorical

• State, user function, outlined routine

– Logical

• In specific user function, In MPI call, In long MPI call

– Numerical

• IPC, L2 miss ratio, Duration of MPI call, duration of computation

burst

nNS i ,n0,

1,,)( iii ttiSts

RS i

1 ,0 iS

Tables: Profiles, histograms, correlations

From timelines to tables

MPI calls profile

Useful Duration

Histogram Useful Duration

MPI calls

Analyzing variability through histograms and timelines

Useful Duration

Instructions

IPC

L2 miss ratio

By the way: six months later ….

Useful Duration

Instructions

IPC

L2 miss ratio

Analyzing variability through histograms and timelines

From tables to timelines

Where in the timeline do the values in certain table

columns appear?

ie. want to see the time distribution of a given routine?

Variability … is everywhere

CESM: 16 processes, 2 simulated days

Histogram useful computation duration

shows high variability

How is it distributed?

Dynamic imbalance

– In space and time

– Day and night.

– Season ?

Data handling/summarization capability

– Filtering • Subset of records in original trace

• By duration, type, value,…

• Filtered trace IS a paraver trace and can be analysed with the same cfgs (as long as needed data kept)

– Cutting • All records in a given time interval

• Only some processes

– Software counters • Summarized values computed from those

in the original trace emitted as new even types

• #MPI calls, total hardware count,…

570 s

2.2 GB

MPI, HWC

WRF-NMM

Peninsula 4km

128 procs

570 s

5 MB

4.6 s

36.5 MB

Trace manipulation

www.bsc.es

Dimemas

Dimemas: Coarse grain, Trace driven simulation

Simulation: Highly non linear model

– Linear components

• Point to point communication

• Sequential processor performance

– Global CPU speed

– Per block/subroutine

– Non linear components

• Synchronization semantics

– Blocking receives

– Rendezvous

– Eager limit

• Resource contention

– CPU

– Communication subsystem

» links (half/full duplex), busses

LBW

eMessageSizT

CPU

Local

Memory

B

CPU

CPU

L

CPU

CPU

CPU Local

Memory

L

CPU

CPU

CPU Local

Memory

L

Dimemas: Type of analysis

Parametric sweeps

– On abstract architectures

– On application computational regions

What if analysis

– Ideal machine (instantaneous network)

– Estimating impact of ports to MPI+OpenMP/CUDA/…

– Should I use asynchronous communications?

– Are all parts of an app. equally sensitive to network?

MPI sanity check

– Modeling nominal

Detailed feedback on simulation (trace)

Impact of BW (L=8; B=0)

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1 4 16 64 256 1024

Eff

icie

ncy

NMM 512

ARW 512

NMM 256

ARW 256

NMM 128

ARW 128

4ms

64

12

8

25

6

51

2

10

24

20

48

40

96

81

92

16

38

4

1

4

16

64

0

5

10

15

20

Bandwdith (MB/s)

CPU

ratio

Speedup

Paraver & Dimemas: Ecosystem and Integration

Paraver – Dimemas tandem

– Analysis and prediction

– What-if from selected time window

Analytics

– Clustering to model speeding or

balancing regions

Scalability model

– Separate transfer and serialization

efficiencies

Coupled simulation

– Multiscale simulation

– Combine with detailed network / CPU

simulators

0

0,2

0,4

0,6

0,8

1

1,2

0 100 200 300Processors

Performance Factors

Parallel_Eff

LB

uLB

Transfer

GROMACSv4.5

93.67%

González J, Giménez J, Casas M, Moretó M, Ramirez A, Labarta J, Valero M.

Simulating Whole Supercomputer Applications. IEEE Micro. 2011 ;31:32-45.

Network sensitivity

MPIRE 32 tasks, no network contention

17 PATC Tools Training. Barcelona, May 12–13th 2014

L = 25µs – BW = 100MB/s L = 1000µs – BW = 100MB/s

L = 25µs – BW = 10MB/s

All windows same scale

Ideal machine

The impossible machine: BW = , L = 0

Actually describes/characterizes Intrinsic application behavior

– Load balance problems?

– Dependence problems?

waitall

sendrec

alltoall

Real

run

Ideal

network

Allgather

+

sendrecv allreduce GADGET @ Nehalem cluster

256 processes

Impact on practical machines?

Impact of architectural parameters

Ideal speeding up ALL the computation bursts by the

CPUratio factor

– The more processes the less speedup (higher impact of bandwidth

limitations) !!

64

12

8

25

6

51

2

10

24

20

48

40

96

81

92

16

38

4

1

4

1664

0

20

40

60

80

100

120

140

Bandwidth (MB/s)

CPU

ratio

Speedup

64

12

8

25

6

51

2

10

24

20

48

40

96

81

92

16

38

4

1

8

64

0

20

40

60

80

100

120

140

Bandwidth (MB/s)

CPU

ratio

Speedup

64

12

8

25

6

51

2

10

24

20

48

40

96

81

92

16

38

4

1

8

64

0

20

40

60

80

100

120

140

Bandwidth (MB/s)

CPU

ratio

Speedup

64

procs

128

procs 256

procs

GADGET

www.bsc.es

Models

Scaling model

)max(*

1

i

P

i

i

effP

eff

LB

)max( ieffCommEff

T

Teff i

i iTT

IPC

instr#

Directly from real execution metrics

CommEffLB *

LB

η

CommEff

Communication efficiency = µLB * Transfer

Serializations / dependences (µLB)

Measured using Dimemas

– An ideal network Transfer (efficiency) = 1

Computation Communication

𝑳𝑩=1

MPI_Send MPI_Recv

MPI_Send MPI_Recv

MPI_Send

MPI_Send

MPI_Recv

MPI_Recv

Scaling model

Dimemas simulation with ideal target

– Latency =0; BW = ∞

µLB

TransferLBCommEff *Migrating/local load imbalance

Serialization

idealT

T

TTransfer ideal

ideal

i

T

TLB

)max(

Why scaling?

0

5

10

15

20

0 100 200 300 400

speed up

Good scalability !! Should we be happy?

CG-POP mpi2s1D - 180x120

TrfSerLB **

0,5

0,6

0,7

0,8

0,9

1

1,1

0 100 200 300 400

Parallel eff

LB

uLB

transfer

0,4

0,6

0,8

1

1,2

1,4

1,6

0 100 200 300 400

Efficiency

Parallel eff

instr. eff

IPC eff

IPCinstr **

M. Casas et al, “Automatic analysis of speedup of MPI applications”. ICS 2008.

www.bsc.es

Clustering

Using Clustering to identify structure

IPC

Completed Instructions

Automatic Detection of Parallel Applications Computation Phases (IPDPS 2009)

Performance @ serial computation bursts

WRF 128 cores GROMACS SPECFEM3D

Asynchronous SPMD

Balanced #instr

variability in IPC

MPMD structure

Different coupled

imbalance trends

SPMD

Repeated substructure

Coupled imbalance

28

Full per region HWC characterization from a single run

Projecting hardware counters based on clustering

Miss ratios Instruction mix Stalls

González J. et al , “Performance Data Extrapolation in Parallel Codes”. ICPADS '10

19% gain

Integrating models and analytics

13% gain

… we increase the IPC of Cluster1?

What if ….

… we balance Clusters 1 & 2?

PEPC

OpenMX (strong scale from 64 to 512 tasks)

Tracking: scability through clustering

64 128 192

256 384 512

64 128 192

256 384 512

www.bsc.es

Folding

32

Folding: Detailed metrics evolution

• Performance of a sequential region = 2000 MIPS

Is it good enough?

Is it easy to improve?

33

Folding: Instantaneous CPI stack

MRGENESIS

• Trivial fix.(loop interchange)

• Easy to locate?

• Next step?

• Availability of CPI stack models

for production processors?

• Provided by

manufacturers?

“Blind” optimization

From folded samples of a few

levels to timeline structure of

“relevant” routines

Recommendation without

access to source code

Harald Servat et al. “Bio-inspired call-stack reconstruction for performance analysis” Submitted, 2015

17.20 M instructions

~ 1000 MIPS


~ 1100 MIPS


~ 1200 MIPS

MPI call

MPI call

Unbalanced MPI application

– Same code

– Different duration

– Different performance

CG-POP multicore MN3 study

www.bsc.es

Methodology

Performance analysis tools objective

Help validate hypotheses

Help generate hypotheses

Qualitatively

Quantitatively

First steps

Parallel efficiency – percentage of time invested on computation – Identify sources for “inefficiency”:

• load balance

• Communication /synchronization

Serial efficiency – how far from peak performance? – IPC, correlate with other counters

Scalability – code replication? – Total #instructions

Behavioral structure? Variability?

Paraver Tutorial:

Introduction to Paraver and Dimemas methodology

BSC Tools web site

www.bsc.es/paraver

• downloads

– Sources / Binaries

– Linux / windows / MAC

• documentation

– Training guides

– Tutorial slides

Getting started

– Start wxparaver

– Help tutorials and follow instructions

– Follow training guides

• Paraver introduction (MPI): Navigation and basic understanding of Paraver

operation

http://www.bsc.es/paraver

Documents

Understanding applications using the BSC performance tools