50
Holger Brunst e-mail: [email protected] Center for Information Services and High Performance Computing (ZIH) Performance Analysis on IU’s HPC Systems using Vampir Trace Visualization

2 Vampir Trace Visualization

  • Upload
    ptihpa

  • View
    1.959

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 2 Vampir Trace Visualization

Holger Brunst

e-mail: [email protected]

Center for Information Services and High Performance Computing (ZIH)

Performance Analysis on IU’s HPCSystems using Vampir

Trace Visualization

Page 2: 2 Vampir Trace Visualization

Outline

Introduction

Prominent Display Types

Performance Analysis Examples

New Vampir GUI

Page 3: 2 Vampir Trace Visualization

Post-mortem Event-based Performance Analysis

Performance optimization remains one of the key issues in parallelprogramming

Strong need for performance analysis, analysis process still noteasy

Profilers do not give detailed insight into timing behavior of anapplication

Detailed online analysis pretty much impossible because of intrusionand data amount

Tracing is an option to capture the dynamic behavior of parallelapplications

Performance analysis done on a post-mortem basis

Page 4: 2 Vampir Trace Visualization

Background

Performance visualization and analysis tool

Targets the visualization of dynamic processeson massively parallel (compute-) systems

Available for major Unix based OSand for Windows

Development started more than 15 years ago at Research CentreJülich, ZAM

Since 1997, developed at TU Dresden(first: collaboration with Pallas GmbH,from 2003-2005: Intel Software & Solutions Group,since January 2006: TU Dresden, ZIH / GWT-TUD)

Visualization components (Vampir) are commercial

– Motivation

– Advantages/Disadvantages

Monitor components (VampirTrace) are Open Source

Page 5: 2 Vampir Trace Visualization

Time

Application

Components

CPU

1

2

3

4Application

CPU

VampirTrace

ApplicationCPU

VampirTrace

ApplicationCPU

VampirTrace

ApplicationCPU

VampirTrace

ApplicationCPU

VampirTrace10,000

.

.

.

VampirTrace

Trace

Data

(OTF)

Vampir

Trace Data

Part m

OTF Trace

Part 1

OTF Trace

Part 2

OTF Trace

Part 3

.

.

.

OTF Trace

Part 4

VampirServer

Task 1 Task n << m …

Page 6: 2 Vampir Trace Visualization

Flavors

Vampir

stabilized sequential version

similar set of displays and options as presented here

less scalable

no ongoing active development

VampirServer

distributed analysis engine

allows server and client on the same workstation as well

new features

windows port in progress

Page 7: 2 Vampir Trace Visualization

VampirServer

Parallel/distributed server

runs in (part of) production environment

no need to transfer huge traces

parallel I/O

Lightweight client on local workstation

receive visual content only

already adapted to display resolution

moderate network load

Scalability

data volumes > 100 GB

number of processes > 10.000

Page 8: 2 Vampir Trace Visualization

Outline

Introduction

Prominent Display Types

Performance Analysis Examples

Vampir for Windows

Page 9: 2 Vampir Trace Visualization

Main Displays

Global Timeline

Process Timeline + Counter

Counter Timeline

Summary Timeline

Summary Chart (aka. Profile)

Message Statistics

Collective Communication Statistics

I/O Statistics

Call Tree, ...

Page 10: 2 Vampir Trace Visualization

Most Prominent Displays: Global Timeline

Time Axis

MPI Processes

Red: MPIRoutines

Other Colors:ApplicationRoutines

Black Lines:MPI Messages

Thumbnail

FunctionGroups

Page 11: 2 Vampir Trace Visualization

Most Prominent Displays: Single Process Timeline

Time Axis

Call StackLevel

MPII/O

ApplicationRoutines

PerformanceCounter

Page 12: 2 Vampir Trace Visualization

Other Tools

TAU profiling (University of Oregon, USA)

extensive profiling and tracing for parallel application andvisualization, camparison, etc.

http://www.cs.uoregon.edu/research/tau/

KOJAK (JSC, FZ Jülich)

very scalable performance tracing

automatic performance analysis and classification

http://www.fz-juelich.de/jsc/kojak/

Paraver (CEPBA, Barcelona, Spain)

trace based parallel performance analysis and visualization

http://www.cepba.upc.edu/paraver/

Page 13: 2 Vampir Trace Visualization

Outline

Introduction

Prominent Display Types

Performance Analysis Examples

New Vampir GUI

Page 14: 2 Vampir Trace Visualization

Approaching Performance Problems

Trace Visualization

Vampir provides a number of display types

each provides many customization options

Advice

make a hypothesis about performance problems

consider application's internal workings if known

select the appropriate display

use statistic displays in conjunction with timelines

Page 15: 2 Vampir Trace Visualization

Finding Performance Bottlenecks

Four Categories

1. Computation

2. Communication

3. Memory, I/O, …

4. Tracing itself

Page 16: 2 Vampir Trace Visualization

Finding Performance Bottlenecks

Computation

unbalanced workload distribution: single late comer(s)

strictly serial parts of program: idle processes/threads

very frequent tiny function calls: call overhead

sparse loops

Page 17: 2 Vampir Trace Visualization

LM-MUSCAT Air Quality Model: Load Imbalance

Many Processeswaiting forindividual

procecsses tofinish chemistry

computation.

Page 18: 2 Vampir Trace Visualization

LM-MUSCAT Air Quality Model: Bad CPU Partitioning

MeteorologyProcesses are

waiting most ofthe time

Page 19: 2 Vampir Trace Visualization

LM-MUSCAT Air Quality Model: Good CPU Partitioning

More Processesfor Chemistry-Transport, less

for Meteorology:better balance

Page 20: 2 Vampir Trace Visualization

LM-MUSCAT Air Quality Model: Load Imbalance

High load imbalanceonly during (simulated)

sunrise

Examine theruntime

behavior of theApplication

Page 21: 2 Vampir Trace Visualization

SPEC OMP Benchmark fma3d: Unbalanced Threads

Not well balancedOpenMP threads

Page 22: 2 Vampir Trace Visualization

WRF Weather Model: MPI/OpenMP - Idle Threads

Idle Threads

Page 23: 2 Vampir Trace Visualization

Finding Performance Bottlenecks

Communication

communication as such (domination over computation)

late sender, late receiver

point-to-point messages instead ofcollective communication

unmatched messages

overcharge of MPI buffers

bursts of large messages (bandwidth)

frequent short messages (latency)

unnecessary synchronization (barrier)

Page 24: 2 Vampir Trace Visualization

High Performance Linpack Using Open MPI

Everythinglooks ok here

Page 25: 2 Vampir Trace Visualization

HPL with Alternative MPI Implementation

Several slowMessages.MPI Problem?

Page 26: 2 Vampir Trace Visualization

HPL with Alternative MPI Implementation

Transfer Rateonly 1.63 MB/s!

Tracking downPerformanceProblems toindividualEvents

Page 27: 2 Vampir Trace Visualization

Finding Performance Bottlenecks

Memory bound computation

inefficient L1/L2/L3 cache usage

TLB misses

detectable via HW performance counters

I/O bound computation

slow input/output

sequential I/O on single process

I/O load imbalance

Exception handling

Page 28: 2 Vampir Trace Visualization

Performance Counters: Floating Point Exceptions

Page 29: 2 Vampir Trace Visualization

WRF Weather Model: Floating Point Exceptions

FPU exceptions lead to longruntime of routine ADVECT.Timeline interval: 77.7ms

Other optimization, no FPUexceptions: only 10ms forthe same program section

Page 30: 2 Vampir Trace Visualization

WRF Weather Model: Low I/O Performance

Transfer Rateonly 389 kB/s!

Page 31: 2 Vampir Trace Visualization

WRF Weather Model: Slow Metadata Operations

128 Processescall open – takesmore than 4seconds

Page 32: 2 Vampir Trace Visualization

Semtex CFD Application: Serial I/O

Process 0 isperforming I/O

… while 127processes are

waiting

Page 33: 2 Vampir Trace Visualization

Complex Cell Application: RAxML (1)

RAxML (Randomized Accelerated Maximum Likelihood)with 8 SPEs, ramp-up phase

Page 34: 2 Vampir Trace Visualization

Complex Cell Application: RAxML (2)

RAxML with 8 SPEs, 4000 ns windowenlargement of a small loop

shifted start of loop, constant runtime

Page 35: 2 Vampir Trace Visualization

Complex Cell Application: RAxML (3)

RAxML with 8 SPEs, 4000 ns windowenlargement of a small loop (modified)synchronous start, memory contention

Page 36: 2 Vampir Trace Visualization

Complex Cell Application: RAxML (4)

RAxML with 16 SPEs, load imbalance

Page 37: 2 Vampir Trace Visualization

Finding Performance Bottlenecks

Tracing

measurement overhead

– esp. grave for tiny function calls

– solve with selective instrumentation

long, asynchronous trace buffer flushes

too many concurrent counters

– more data

heisenbugs

Page 38: 2 Vampir Trace Visualization

Trace Buffer Flush

Page 39: 2 Vampir Trace Visualization

Outline

Introduction

Prominent Display Types

Performance Analysis Examples

New Vampir GUI

Page 40: 2 Vampir Trace Visualization

Product Family

Vampir for UNIX:

VampirClassic (all-in-one,single threaded, UnixOpenMotif based)

VampirServer (parallelized(MPI) client/server programapproach)

Windows for Windows:

Based on VampirServer’sparallel service engine

New Windows GUI to theharnessed VampirServerservices

Vampir Classic

All in one, single threaded Motif app.

Vampir for Windows HPC Server

Threaded

service DLL

Windows

GUIAPI

Vampir Server

Parallelized

service engine

Visualization

(Motif)Sockets

Page 41: 2 Vampir Trace Visualization

Holger Brunst

New GUI Layout

Global Time Selection with Summary

Shared Chart AreaShared Chart Area

Chart Selection

Page 42: 2 Vampir Trace Visualization

Chart Overview

Page 43: 2 Vampir Trace Visualization

Chart Arrangement

Page 44: 2 Vampir Trace Visualization

Windows Event Tracing

Windows HPC Server 2008

Microsoft MPI (MS-MPI) integrated with

Event Tracing for Windows (ETW) infrastructure

Allows MPI tracing

No “special” builds needed. Just run application with extra mpiexec flag(-trace)

High-precision CPU clock correction for MS-MPI (mpicsync)

Tracing prerequisites:

User must be in Administrator or Performance Log group

Jobs should be executed exclusively in the Windows HPCServer 2008 Scheduler to avoid confusion/conflict of the tracedata

SLIDE 44HOLGER BRUNST

Page 45: 2 Vampir Trace Visualization

Creation of OTF Traces

Run myApp with tracing enabled

Time-Sync the ETL logs

Convert the ETL logs to OTF

Copy OTF files to head node

Rank 0 node

ETW

MS-MPITrace(.etl)

MS-MPI

MS-MPI

MS-MPI

myApp.exe

mpicsync

etl2otf

copyFormattedTrace (.otf)

Rank 1node

Rank Nnode

HEAD NODE

\\share\

userHome\

myApp.exe

Trace\

trace.etl_otf.otf

trace.etl_otf.0.def

trace.etl_otf.1.events

trace.etl_oft.2.events

Page 46: 2 Vampir Trace Visualization

Creation of OTF Traces

The four steps arecreated as individualtasks in a cluster job.

The task options allowto choose the numberof cores for the joband otherparameters.

In “Dependency” theright order ofexecution of the taskscan be ensured.

Page 47: 2 Vampir Trace Visualization

Creation of OTF Traces

File system prerequisites:

“\\share\userHome” is the shared user directory throughout thecluster

MPI executable “myApp.exe” is available in shared directory

“\\share\userHome\Trace” is the directory where the OTF files arecollected

Launch program with –tracefile option

mpiexec –wdir \\share\userHome\ -tracefile%USERPROFILE%\trace.etl myApp.exe

wdir sets the working directory, myApp.exe has to be here

%USERPROFILE% translates to the local home directory, e.g.“C:\Users\userHome”, on each node

the eventlog file (.etl) is stored locally in this directory

Page 48: 2 Vampir Trace Visualization

Creation of OTF Traces

Time-Sync the Log files on throughout all nodes

mpiexec –cores 1 –wdir %USERPROFILE% mpicsync trace.etl

“- cores 1”: run only one instance of mpicsync on each node

Format Log files to OTF files

mpiexec –cores 1 –wdir %USERPROFILE% etl2otf trace.etl

Copy all OTF files from nodes to trace directory on share

mpiexec –cores 1 –wdir %USERPROFILE% cmd /c copy /y“*_otf*” “\\share\userHome\Trace”

SLIDE 48HOLGER BRUNST

Page 49: 2 Vampir Trace Visualization

Summary

Hybrid MPI/OpenMP trace file with 1024 cores

256 MPI ranks

4 OpenMP threads per rank

Some feedback from users:

“This was very impressive to see live as they had never seentheir application profiled at this scale, and vampir pointed us atthe problem straight away”

“I was impressed by how detailed MPI functions are visualized ina zoomable fashion into micro seconds scale. We have someparallel C# programs currently run on Windows cluster of up to128 core. I will use Vampir to test on other applications I have.”

Work in progress with regular updates

Completion of charts

Additional information sources from ETW

Page 50: 2 Vampir Trace Visualization

TeamRonny Brendel

Dr. Holger Brunst

Jens Doleschal

Matthias Jurenz

Dr. Andreas Knüpfer

Matthias Lieber

Christian Mach

Holger Mickler

Dr. Hartmut Mix

Dr. Matthias Müller

Prof. Wolfgang E. Nagel

Michael Peter

Matthias Weber

Thomas William

Thank You

http://www.vampir.eu

http://www.tu-dresden.de/zih/vampirtrace