14
John Mellor-Crummey Robert Fowler Nathan Tallent Gabriel Marin Department of Computer Science, Rice University Los Alamos Computer Science Institute HPCToolkit : Multi-platform Tools for Performance Analysis http://www.hipersoft.rice.edu/hpctoolkit/

John Mellor-Crummey Robert Fowler Nathan Tallent Gabriel Marin Department of Computer Science, Rice University Los Alamos Computer Science Institute HPCToolkit

Embed Size (px)

Citation preview

Page 1: John Mellor-Crummey Robert Fowler Nathan Tallent Gabriel Marin Department of Computer Science, Rice University Los Alamos Computer Science Institute HPCToolkit

John Mellor-Crummey

Robert Fowler Nathan Tallent Gabriel Marin

Department of Computer Science, Rice University

Los Alamos Computer Science Institute

HPCToolkit : Multi-platform Tools for Performance Analysis

http://www.hipersoft.rice.edu/hpctoolkit/

Page 2: John Mellor-Crummey Robert Fowler Nathan Tallent Gabriel Marin Department of Computer Science, Rice University Los Alamos Computer Science Institute HPCToolkit

http://www.hipersoft.rice.edu/hpctoolkit/ 2

The Big Picture

• Long-term: compiler and architecture research requires detailed performance understanding

—identify performance bottlenecks in complex applications

—understand the mismatch between application needs and architecture capabilities

—automate strategies for performance improvement

• Short-term result: programmer-accessible tools for understanding application performance

Page 3: John Mellor-Crummey Robert Fowler Nathan Tallent Gabriel Marin Department of Computer Science, Rice University Los Alamos Computer Science Institute HPCToolkit

http://www.hipersoft.rice.edu/hpctoolkit/ 3

Performance Analysis and Tuning

• Increasingly necessary

—gap between typical and peak performance is growing

• Increasingly hard

—complex architectures are harder to program effectively

– deeply-pipelined microprocessors

VLIW or superscalar

– complex memory hierarchy

non-blocking, multi-level caches

—large-scale scientific applications pose challenges for tools

Page 4: John Mellor-Crummey Robert Fowler Nathan Tallent Gabriel Marin Department of Computer Science, Rice University Los Alamos Computer Science Institute HPCToolkit

http://www.hipersoft.rice.edu/hpctoolkit/ 4

LACSI HPCToolkit

• Support large, multi-lingual applications—a mix of of Fortran, C, C++—hundreds of thousands of lines, many procedures—external libraries

• Eliminate manual labor from run, analyze tune cycle—use optimized application binaries directly

– no: manual instrumentation, build process changes, recompilation

• Platform, language, and compiler independence—emphasis on LANL ASC Platforms (Origin, AlphaServer, Opteron)—multiple data sources cross platform comparisons

• Scalable data collection

• Effective presentation of analysis results—intuitive, top-down user interface

– hierarchical program structure with loop level metrics

Page 5: John Mellor-Crummey Robert Fowler Nathan Tallent Gabriel Marin Department of Computer Science, Rice University Los Alamos Computer Science Institute HPCToolkit

http://www.hipersoft.rice.edu/hpctoolkit/ 5

HPCToolkit System Overview

applicationsource

applicationsource

profile execution

profile execution

performanceprofile

performanceprofile

binaryobject code

binaryobject code

compilation

linking

binary analysisbinary analysis

programstructure

programstructure

interpret profileinterpret profile

source correlation

source correlation

hyperlinkeddatabase

hyperlinkeddatabase

hpcviewer

hpcviewer

Page 6: John Mellor-Crummey Robert Fowler Nathan Tallent Gabriel Marin Department of Computer Science, Rice University Los Alamos Computer Science Institute HPCToolkit

http://www.hipersoft.rice.edu/hpctoolkit/ 6

HPCToolkit System Overview

profile execution

profile execution

performanceprofile

performanceprofile

applicationsource

applicationsource

binaryobject code

binaryobject code

compilation

linking

binary analysisbinary analysis

programstructure

programstructure

interpret profileinterpret profile

source correlation

source correlation

hyperlinkeddatabase

hyperlinkeddatabase

hpcviewer

hpcviewer

—launch unmodified, optimized application binaries—collect statistical profiles of events of interest

Page 7: John Mellor-Crummey Robert Fowler Nathan Tallent Gabriel Marin Department of Computer Science, Rice University Los Alamos Computer Science Institute HPCToolkit

http://www.hipersoft.rice.edu/hpctoolkit/ 7

HPCToolkit System Overview

profile execution

profile execution

performanceprofile

performanceprofile

applicationsource

applicationsource

binaryobject code

binaryobject code

compilation

linking

binary analysisbinary analysis

programstructure

programstructure

interpret profileinterpret profile

source correlation

source correlation

hyperlinkeddatabase

hyperlinkeddatabase

hpcviewer

hpcviewer

—decode instructions and combine with profile data

Page 8: John Mellor-Crummey Robert Fowler Nathan Tallent Gabriel Marin Department of Computer Science, Rice University Los Alamos Computer Science Institute HPCToolkit

http://www.hipersoft.rice.edu/hpctoolkit/ 8

HPCToolkit System Overview

profile execution

profile execution

performanceprofile

performanceprofile

applicationsource

applicationsource

binaryobject code

binaryobject code

compilation

linking

binary analysisbinary analysis

programstructure

programstructure

interpret profileinterpret profile

source correlation

source correlation

hyperlinkeddatabase

hyperlinkeddatabase

hpcviewer

hpcviewer

—extract loop nesting information from executables

Page 9: John Mellor-Crummey Robert Fowler Nathan Tallent Gabriel Marin Department of Computer Science, Rice University Los Alamos Computer Science Institute HPCToolkit

http://www.hipersoft.rice.edu/hpctoolkit/ 9

HPCToolkit System Overview

profile execution

profile execution

performanceprofile

performanceprofile

applicationsource

applicationsource

binaryobject code

binaryobject code

compilation

linking

binary analysisbinary analysis

programstructure

programstructure

interpret profileinterpret profile

source correlation

source correlation

hyperlinkeddatabase

hyperlinkeddatabase

hpcviewer

hpcviewer

—synthesize new metrics by combining metrics —relate metrics, structure, and program source

Page 10: John Mellor-Crummey Robert Fowler Nathan Tallent Gabriel Marin Department of Computer Science, Rice University Los Alamos Computer Science Institute HPCToolkit

http://www.hipersoft.rice.edu/hpctoolkit/ 10

HPCToolkit System Overview

profile execution

profile execution

performanceprofile

performanceprofile

applicationsource

applicationsource

binaryobject code

binaryobject code

compilation

linking

binary analysisbinary analysis

programstructure

programstructure

interpret profileinterpret profile

source correlation

source correlation

hyperlinkeddatabase

hyperlinkeddatabase

hpcviewer

hpcviewer

—support top-down analysis with interactive viewer—analyze results anytime, anywhere

Page 11: John Mellor-Crummey Robert Fowler Nathan Tallent Gabriel Marin Department of Computer Science, Rice University Los Alamos Computer Science Institute HPCToolkit

http://www.hipersoft.rice.edu/hpctoolkit/ 11

HPCViewer Screenshot

MetricsNavigation

Annotated Source View

Page 12: John Mellor-Crummey Robert Fowler Nathan Tallent Gabriel Marin Department of Computer Science, Rice University Los Alamos Computer Science Institute HPCToolkit

http://www.hipersoft.rice.edu/hpctoolkit/ 12

Impact on LANL Code Teams

• HPCToolkit deployed on Origin— improved SAGE by 2x on one example (see next slide)

• First performance workshop (Feb 03)— Feedback: needed on Q, smaller DB on large codes— Improvements: Sophisticated support for Alpha/Tru64 platform,

new Java browser using compact database

• Second performance workshop (July 03)— Feedback: ready to use, binary analysis too slow on large codes— Improvement: sped up binary analysis on large codes by 30x

• HPCToolkit deployed on secure machines (July 03)— used to evaluate FLAG for ASCI burn code review (Aug 03)

• Ongoing interactions— Feedback: better support for shared libraries and Opteron— Improvement: new support for shared libraries installed on Q— Ongoing work: LANL/Rice collaboration for Opteron support

Page 13: John Mellor-Crummey Robert Fowler Nathan Tallent Gabriel Marin Department of Computer Science, Rice University Los Alamos Computer Science Institute HPCToolkit

http://www.hipersoft.rice.edu/hpctoolkit/ 13

Sage Solver Performance Improvement

0

20

40

60

80

100

120

140

160

180

1-ProcMFLOPS

BlueMtn 1proc

BlueMtn 5proc

Q 1 Proc Q 5-proc

BeforeAfter

Page 14: John Mellor-Crummey Robert Fowler Nathan Tallent Gabriel Marin Department of Computer Science, Rice University Los Alamos Computer Science Institute HPCToolkit

http://www.hipersoft.rice.edu/hpctoolkit/ 14

Future

• Collect and present dynamic context— what path gets us to expensive computations

— accurate call-graph profiling of unmodified binaries

— analysis and presentation of dynamic context to explain performance

– solver is slow only when called on non-preconditioned matrices

– MPI wait cost is incurred in the backsolve

• Statistical clustering — effective analysis of large collections of processes

• Performance diagnosis — why rather than what