Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
www.bsc.es
Understanding applications using the
BSC performance tools
Judit Gimenez ([email protected]), Harald Servat ([email protected])
HPCKP’15 - February 3th, 2015
Humans are visual creatures
Films or books? PROCESS
– Two hours vs. days (months)
Memorizing a deck of playing cards STORE
– Each card translated to an image (person, action, location)
Our brain loves pattern recognition IDENTIFY
– What do you see on the pictures?
Our Tools
Since 1991
Based on traces
Open Source – http://www.bsc.es/paraver
Core tools: – Paraver (paramedir) – offline trace analysis
– Dimemas – message passing simulator
– Extrae – instrumentation
Focus – Detail, variability, flexibility
– Behavioral structure vs. syntactic structure
– Intelligence: Performance Analytics
www.bsc.es
Paraver
Paraver – Performance data browser
Timelines
Raw data
2/3D tables
(Statistics)
Goal = Flexibility
No semantics
Programmable
Comparative analyses
Multiple traces
Synchronize scales
+ trace manipulation
Trace visualization/analysis
Timelines
Each window displays one view
– Piecewise constant function of time
Types of functions
– Categorical
• State, user function, outlined routine
– Logical
• In specific user function, In MPI call, In long MPI call
– Numerical
• IPC, L2 miss ratio, Duration of MPI call, duration of computation
burst
nNS i ,n0,
1,,)( iii ttiSts
RS i
1 ,0 iS
Tables: Profiles, histograms, correlations
From timelines to tables
MPI calls profile
Useful Duration
Histogram Useful Duration
MPI calls
Analyzing variability through histograms and timelines
Useful Duration
Instructions
IPC
L2 miss ratio
By the way: six months later ….
Useful Duration
Instructions
IPC
L2 miss ratio
Analyzing variability through histograms and timelines
From tables to timelines
Where in the timeline do the values in certain table
columns appear?
ie. want to see the time distribution of a given routine?
Variability … is everywhere
CESM: 16 processes, 2 simulated days
Histogram useful computation duration
shows high variability
How is it distributed?
Dynamic imbalance
– In space and time
– Day and night.
– Season ?
Data handling/summarization capability
– Filtering • Subset of records in original trace
• By duration, type, value,…
• Filtered trace IS a paraver trace and can be analysed with the same cfgs (as long as needed data kept)
– Cutting • All records in a given time interval
• Only some processes
– Software counters • Summarized values computed from those
in the original trace emitted as new even types
• #MPI calls, total hardware count,…
570 s
2.2 GB
MPI, HWC
WRF-NMM
Peninsula 4km
128 procs
570 s
5 MB
4.6 s
36.5 MB
Trace manipulation
www.bsc.es
Dimemas
Dimemas: Coarse grain, Trace driven simulation
Simulation: Highly non linear model
– Linear components
• Point to point communication
• Sequential processor performance
– Global CPU speed
– Per block/subroutine
– Non linear components
• Synchronization semantics
– Blocking receives
– Rendezvous
– Eager limit
• Resource contention
– CPU
– Communication subsystem
» links (half/full duplex), busses
LBW
eMessageSizT
CPU
Local
Memory
B
CPU
CPU
L
CPU
CPU
CPU Local
Memory
L
CPU
CPU
CPU Local
Memory
L
Dimemas: Type of analysis
Parametric sweeps
– On abstract architectures
– On application computational regions
What if analysis
– Ideal machine (instantaneous network)
– Estimating impact of ports to MPI+OpenMP/CUDA/…
– Should I use asynchronous communications?
– Are all parts of an app. equally sensitive to network?
MPI sanity check
– Modeling nominal
Detailed feedback on simulation (trace)
Impact of BW (L=8; B=0)
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1 4 16 64 256 1024
Eff
icie
ncy
NMM 512
ARW 512
NMM 256
ARW 256
NMM 128
ARW 128
4ms
64
12
8
25
6
51
2
10
24
20
48
40
96
81
92
16
38
4
1
4
16
64
0
5
10
15
20
Bandwdith (MB/s)
CPU
ratio
Speedup
Paraver & Dimemas: Ecosystem and Integration
Paraver – Dimemas tandem
– Analysis and prediction
– What-if from selected time window
Analytics
– Clustering to model speeding or
balancing regions
Scalability model
– Separate transfer and serialization
efficiencies
Coupled simulation
– Multiscale simulation
– Combine with detailed network / CPU
simulators
0
0,2
0,4
0,6
0,8
1
1,2
0 100 200 300Processors
Performance Factors
Parallel_Eff
LB
uLB
Transfer
GROMACSv4.5
93.67%
González J, Giménez J, Casas M, Moretó M, Ramirez A, Labarta J, Valero M.
Simulating Whole Supercomputer Applications. IEEE Micro. 2011 ;31:32-45.
Network sensitivity
MPIRE 32 tasks, no network contention
17 PATC Tools Training. Barcelona, May 12–13th 2014
L = 25µs – BW = 100MB/s L = 1000µs – BW = 100MB/s
L = 25µs – BW = 10MB/s
All windows same scale
Ideal machine
The impossible machine: BW = , L = 0
Actually describes/characterizes Intrinsic application behavior
– Load balance problems?
– Dependence problems?
waitall
sendrec
alltoall
Real
run
Ideal
network
Allgather
+
sendrecv allreduce GADGET @ Nehalem cluster
256 processes
Impact on practical machines?
Impact of architectural parameters
Ideal speeding up ALL the computation bursts by the
CPUratio factor
– The more processes the less speedup (higher impact of bandwidth
limitations) !!
64
12
8
25
6
51
2
10
24
20
48
40
96
81
92
16
38
4
1
4
1664
0
20
40
60
80
100
120
140
Bandwidth (MB/s)
CPU
ratio
Speedup
64
12
8
25
6
51
2
10
24
20
48
40
96
81
92
16
38
4
1
8
64
0
20
40
60
80
100
120
140
Bandwidth (MB/s)
CPU
ratio
Speedup
64
12
8
25
6
51
2
10
24
20
48
40
96
81
92
16
38
4
1
8
64
0
20
40
60
80
100
120
140
Bandwidth (MB/s)
CPU
ratio
Speedup
64
procs
128
procs 256
procs
GADGET
www.bsc.es
Models
Scaling model
)max(*
1
i
P
i
i
effP
eff
LB
)max( ieffCommEff
T
Teff i
i iTT
IPC
instr#
Directly from real execution metrics
CommEffLB *
LB
η
CommEff
Communication efficiency = µLB * Transfer
Serializations / dependences (µLB)
Measured using Dimemas
– An ideal network Transfer (efficiency) = 1
Computation Communication
𝑳𝑩=1
MPI_Send MPI_Recv
MPI_Send MPI_Recv
MPI_Send
MPI_Send
MPI_Recv
MPI_Recv
Scaling model
Dimemas simulation with ideal target
– Latency =0; BW = ∞
µLB
TransferLBCommEff *Migrating/local load imbalance
Serialization
idealT
T
TTransfer ideal
ideal
i
T
TLB
)max(
Why scaling?
0
5
10
15
20
0 100 200 300 400
speed up
Good scalability !! Should we be happy?
CG-POP mpi2s1D - 180x120
TrfSerLB **
0,5
0,6
0,7
0,8
0,9
1
1,1
0 100 200 300 400
Parallel eff
LB
uLB
transfer
0,4
0,6
0,8
1
1,2
1,4
1,6
0 100 200 300 400
Efficiency
Parallel eff
instr. eff
IPC eff
IPCinstr **
M. Casas et al, “Automatic analysis of speedup of MPI applications”. ICS 2008.
www.bsc.es
Clustering
Using Clustering to identify structure
IPC
Completed Instructions
Automatic Detection of Parallel Applications Computation Phases (IPDPS 2009)
Performance @ serial computation bursts
WRF 128 cores GROMACS SPECFEM3D
Asynchronous SPMD
Balanced #instr
variability in IPC
MPMD structure
Different coupled
imbalance trends
SPMD
Repeated substructure
Coupled imbalance
28
Full per region HWC characterization from a single run
Projecting hardware counters based on clustering
Miss ratios Instruction mix Stalls
González J. et al , “Performance Data Extrapolation in Parallel Codes”. ICPADS '10
19% gain
Integrating models and analytics
13% gain
… we increase the IPC of Cluster1?
What if ….
… we balance Clusters 1 & 2?
PEPC
OpenMX (strong scale from 64 to 512 tasks)
Tracking: scability through clustering
64 128 192
256 384 512
64 128 192
256 384 512
www.bsc.es
Folding
32
Folding: Detailed metrics evolution
• Performance of a sequential region = 2000 MIPS
Is it good enough?
Is it easy to improve?
33
Folding: Instantaneous CPI stack
MRGENESIS
• Trivial fix.(loop interchange)
• Easy to locate?
• Next step?
• Availability of CPI stack models
for production processors?
• Provided by
manufacturers?
“Blind” optimization
From folded samples of a few
levels to timeline structure of
“relevant” routines
Recommendation without
access to source code
Harald Servat et al. “Bio-inspired call-stack reconstruction for performance analysis” Submitted, 2015
17.20 M instructions
~ 1000 MIPS
24.92 M instructions
~ 1100 MIPS
32.53 M instructions
~ 1200 MIPS
MPI call
MPI call
Unbalanced MPI application
– Same code
– Different duration
– Different performance
CG-POP multicore MN3 study
www.bsc.es
Methodology
Performance analysis tools objective
Help validate hypotheses
Help generate hypotheses
Qualitatively
Quantitatively
First steps
Parallel efficiency – percentage of time invested on computation – Identify sources for “inefficiency”:
• load balance
• Communication /synchronization
Serial efficiency – how far from peak performance? – IPC, correlate with other counters
Scalability – code replication? – Total #instructions
Behavioral structure? Variability?
Paraver Tutorial:
Introduction to Paraver and Dimemas methodology
BSC Tools web site
www.bsc.es/paraver
• downloads
– Sources / Binaries
– Linux / windows / MAC
• documentation
– Training guides
– Tutorial slides
Getting started
– Start wxparaver
– Help tutorials and follow instructions
– Follow training guides
• Paraver introduction (MPI): Navigation and basic understanding of Paraver
operation