Upload
annice-constance-terry
View
224
Download
2
Embed Size (px)
Citation preview
John Mellor-Crummey
Robert Fowler Nathan Tallent Gabriel Marin
Department of Computer Science, Rice University
Los Alamos Computer Science Institute
HPCToolkit : Multi-platform Tools for Performance Analysis
http://www.hipersoft.rice.edu/hpctoolkit/
http://www.hipersoft.rice.edu/hpctoolkit/ 2
The Big Picture
• Long-term: compiler and architecture research requires detailed performance understanding
—identify performance bottlenecks in complex applications
—understand the mismatch between application needs and architecture capabilities
—automate strategies for performance improvement
• Short-term result: programmer-accessible tools for understanding application performance
http://www.hipersoft.rice.edu/hpctoolkit/ 3
Performance Analysis and Tuning
• Increasingly necessary
—gap between typical and peak performance is growing
• Increasingly hard
—complex architectures are harder to program effectively
– deeply-pipelined microprocessors
VLIW or superscalar
– complex memory hierarchy
non-blocking, multi-level caches
—large-scale scientific applications pose challenges for tools
http://www.hipersoft.rice.edu/hpctoolkit/ 4
LACSI HPCToolkit
• Support large, multi-lingual applications—a mix of of Fortran, C, C++—hundreds of thousands of lines, many procedures—external libraries
• Eliminate manual labor from run, analyze tune cycle—use optimized application binaries directly
– no: manual instrumentation, build process changes, recompilation
• Platform, language, and compiler independence—emphasis on LANL ASC Platforms (Origin, AlphaServer, Opteron)—multiple data sources cross platform comparisons
• Scalable data collection
• Effective presentation of analysis results—intuitive, top-down user interface
– hierarchical program structure with loop level metrics
http://www.hipersoft.rice.edu/hpctoolkit/ 5
HPCToolkit System Overview
applicationsource
applicationsource
profile execution
profile execution
performanceprofile
performanceprofile
binaryobject code
binaryobject code
compilation
linking
binary analysisbinary analysis
programstructure
programstructure
interpret profileinterpret profile
source correlation
source correlation
hyperlinkeddatabase
hyperlinkeddatabase
hpcviewer
hpcviewer
http://www.hipersoft.rice.edu/hpctoolkit/ 6
HPCToolkit System Overview
profile execution
profile execution
performanceprofile
performanceprofile
applicationsource
applicationsource
binaryobject code
binaryobject code
compilation
linking
binary analysisbinary analysis
programstructure
programstructure
interpret profileinterpret profile
source correlation
source correlation
hyperlinkeddatabase
hyperlinkeddatabase
hpcviewer
hpcviewer
—launch unmodified, optimized application binaries—collect statistical profiles of events of interest
http://www.hipersoft.rice.edu/hpctoolkit/ 7
HPCToolkit System Overview
profile execution
profile execution
performanceprofile
performanceprofile
applicationsource
applicationsource
binaryobject code
binaryobject code
compilation
linking
binary analysisbinary analysis
programstructure
programstructure
interpret profileinterpret profile
source correlation
source correlation
hyperlinkeddatabase
hyperlinkeddatabase
hpcviewer
hpcviewer
—decode instructions and combine with profile data
http://www.hipersoft.rice.edu/hpctoolkit/ 8
HPCToolkit System Overview
profile execution
profile execution
performanceprofile
performanceprofile
applicationsource
applicationsource
binaryobject code
binaryobject code
compilation
linking
binary analysisbinary analysis
programstructure
programstructure
interpret profileinterpret profile
source correlation
source correlation
hyperlinkeddatabase
hyperlinkeddatabase
hpcviewer
hpcviewer
—extract loop nesting information from executables
http://www.hipersoft.rice.edu/hpctoolkit/ 9
HPCToolkit System Overview
profile execution
profile execution
performanceprofile
performanceprofile
applicationsource
applicationsource
binaryobject code
binaryobject code
compilation
linking
binary analysisbinary analysis
programstructure
programstructure
interpret profileinterpret profile
source correlation
source correlation
hyperlinkeddatabase
hyperlinkeddatabase
hpcviewer
hpcviewer
—synthesize new metrics by combining metrics —relate metrics, structure, and program source
http://www.hipersoft.rice.edu/hpctoolkit/ 10
HPCToolkit System Overview
profile execution
profile execution
performanceprofile
performanceprofile
applicationsource
applicationsource
binaryobject code
binaryobject code
compilation
linking
binary analysisbinary analysis
programstructure
programstructure
interpret profileinterpret profile
source correlation
source correlation
hyperlinkeddatabase
hyperlinkeddatabase
hpcviewer
hpcviewer
—support top-down analysis with interactive viewer—analyze results anytime, anywhere
http://www.hipersoft.rice.edu/hpctoolkit/ 11
HPCViewer Screenshot
MetricsNavigation
Annotated Source View
http://www.hipersoft.rice.edu/hpctoolkit/ 12
Impact on LANL Code Teams
• HPCToolkit deployed on Origin— improved SAGE by 2x on one example (see next slide)
• First performance workshop (Feb 03)— Feedback: needed on Q, smaller DB on large codes— Improvements: Sophisticated support for Alpha/Tru64 platform,
new Java browser using compact database
• Second performance workshop (July 03)— Feedback: ready to use, binary analysis too slow on large codes— Improvement: sped up binary analysis on large codes by 30x
• HPCToolkit deployed on secure machines (July 03)— used to evaluate FLAG for ASCI burn code review (Aug 03)
• Ongoing interactions— Feedback: better support for shared libraries and Opteron— Improvement: new support for shared libraries installed on Q— Ongoing work: LANL/Rice collaboration for Opteron support
http://www.hipersoft.rice.edu/hpctoolkit/ 13
Sage Solver Performance Improvement
0
20
40
60
80
100
120
140
160
180
1-ProcMFLOPS
BlueMtn 1proc
BlueMtn 5proc
Q 1 Proc Q 5-proc
BeforeAfter
http://www.hipersoft.rice.edu/hpctoolkit/ 14
Future
• Collect and present dynamic context— what path gets us to expensive computations
— accurate call-graph profiling of unmodified binaries
— analysis and presentation of dynamic context to explain performance
– solver is slow only when called on non-preconditioned matrices
– MPI wait cost is incurred in the backsolve
• Statistical clustering — effective analysis of large collections of processes
• Performance diagnosis — why rather than what