Upload
jasper-mcgee
View
232
Download
4
Tags:
Embed Size (px)
Citation preview
Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3
Philip [email protected]
Shirley [email protected]
Daniel [email protected]
October 11,2001ScicomP4
Knoxville, TN2
Hardware Counters
• Small set of registers that count events, which are occurrences of specific signals related to the processor’s function
• Monitoring these events facilitates correlation between the structure of the source/object code and the efficiency of the mapping of that code to the underlying architecture.
October 11,2001ScicomP4
Knoxville, TN3
Goals of PAPI
• Solid foundation for cross platform performance analysis tools
• Free tool developers from re-implementing counter access
• Standardization between vendors, academics and users
• Encourage vendors to provide hardware and OS support for counter access
• Reference implementations for a number of HPC architectures
• Well documented and easy to use
October 11,2001ScicomP4
Knoxville, TN4
Overview of PAPI
• Performance Application Programming Interface
• The purpose of the PAPI project is to design, standardize and implement a portable and efficient API to access the hardware performance monitor counters found on most modern microprocessors.
• Parallel Tools Consortium project http://www.ptools.org/
October 11,2001ScicomP4
Knoxville, TN5
PAPI Counter Interfaces
• PAPI provides three interfaces to the underlying counter hardware: 1. The low level interface manages hardware
events in user defined groups called EventSets.
2. The high level interface simply provides the ability to start, stop and read the counters for a specified list of events.
3. Graphical tools to visualize information.
October 11,2001ScicomP4
Knoxville, TN6
PAPI Implementation
Tools!!!
PAPI Low LevelPAPI High Level
Hardware Performance Counter
Operating System
Kernel Extension
PAPI Machine Dependent SubstrateMachine
SpecificLayer
PortableLayer
October 11,2001ScicomP4
Knoxville, TN7
PAPI Preset Events
• Proposed standard set of events deemed most relevant for application performance tuning
• Defined in papiStdEventDefs.h• Mapped to native events on a
given platform– Run tests/avail to see list of PAPI
preset events available on a platform
October 11,2001ScicomP4
Knoxville, TN8
High-level Interface
• Meant for application programmers wanting coarse-grained measurements
• Not thread safe• Calls the lower level API• Allows only PAPI preset events• Easier to use and less setup
(additional code) than low-level
October 11,2001ScicomP4
Knoxville, TN9
High-level API
• C interfacePAPI_start_countersPAPI_read_countersPAPI_stop_countersPAPI_accum_countersPAPI_num_countersPAPI_flops
• Fortran interfacePAPIF_start_countersPAPIF_read_countersPAPIF_stop_countersPAPIF_accum_countersPAPIF_num_countersPAPIF_flops
October 11,2001ScicomP4
Knoxville, TN10
Using the High-level API
• Int PAPI_num_counters(void)
– Initializes PAPI (if needed)
– Returns number of hardware counters
• int PAPI_start_counters(int *events, int len)
– Initializes PAPI (if needed)
– Sets up an event set with the given counters
– Starts counting in the event set
• int PAPI_library_init(int version)
– Low-level routine implicitly called by above
October 11,2001ScicomP4
Knoxville, TN11
Controlling the Counters
• PAPI_stop_counters(long_long *vals, int alen)– Stop counters and put counter values in array
• PAPI_accum_counters(long_long *vals, int alen)
– Accumulate counters into array and reset
• PAPI_read_counters(long_long *vals, int alen)– Copy counter values into array and reset counters
• PAPI_flops(float *rtime, float *ptime, long_long *flpins, float *mflops)– Wallclock time, process time, FP ins since start,– Mflop/s since last call
October 11,2001ScicomP4
Knoxville, TN12
PAPI_flops
• int PAPI_flops(float *real_time, float *proc_time, long_long *flpins, float *mflops)– Only two calls needed, PAPI_flops before and after
the code you want to monitor– real_time is the wall-clocktime between the two calls– proc_time is the “virtual” time or time the process
was actually executing between the two calls (not as fine grained as real_time but better for longer measurements)
– flpins is the total floating point instructions executed between the two calls
– mflops is the Mflop/s rating between the two calls
October 11,2001ScicomP4
Knoxville, TN13
PAPI High-level Example
long long values[NUM_EVENTS]; unsigned int
Events[NUM_EVENTS]={PAPI_TOT_INS,PAPI_TOT_CYC}; /* Start the counters */ PAPI_start_counters((int*)Events,NUM_EVENTS); /* What we are monitoring? */ do_work(); /* Stop the counters and store the results in values */ retval = PAPI_stop_counters(values,NUM_EVENTS);
October 11,2001ScicomP4
Knoxville, TN14
Name Description PAPI_OK No error PAPI_EINVAL Invalid argument PAPI_ENOMEM Insufficient memory PAPI_ESYS A system/C library call failed. Check errno variable PAPI_ESBSTR Substrate returned an error. E.g. unimplemented feature PAPI_ECLOST Access to the counters was lost or interrupted PAPI_EBUG Internal error PAPI_ENOEVNT Hardware event does not exist PAPI_ECNFLCT Hardware event exists, but resources are exhausted PAPI_ENOTRUN Event or envent set is currently counting PAPI_EISRUN Events or event set is currently running PAPI_ENOEVST No event set available PAPI_ENOTPRESET Argument is not a preset PAPI_ENOCNTR Hardware does not support counters PAPI_EMISC Any other error occured
Return Codes
October 11,2001ScicomP4
Knoxville, TN15
Low-level Interface
• Increased efficiency and functionality over the high level PAPI interface
• About 40 functions• Obtain information about the
executable and the hardware• Thread-safe• Fully programmable• Callbacks on counter overflow
October 11,2001ScicomP4
Knoxville, TN16
Low-level Functionality
• Library initializationPAPI_library_init, PAPI_thread_init,
PAPI_shutdown• Timing functions
PAPI_get_real_usec, PAPI_get_virt_usecPAPI_get_real_cyc, PAPI_get_virt_cyc
• Inquiry functions• Management functions• Simple lock
PAPI_lock/PAPI_unlock
October 11,2001ScicomP4
Knoxville, TN17
Event sets
• The event set contains key information– What low-level hardware counters to use
– Most recently read counter values
– The state of the event set (running/not running)
– Option settings (e.g., domain, granularity, overflow, profiling)
• Event sets can overlap if they map to the same hardware counter set-up.– Allows inclusive/exclusive measurements
October 11,2001ScicomP4
Knoxville, TN18
Event set operations
• Event set managementPAPI_create_eventset, PAPI_add_event[s], PAPI_rem_event[s], PAPI_destroy_eventset
• Event set controlPAPI_start, PAPI_stop, PAPI_read, PAPI_accum
• Event set inquiryPAPI_query_event, PAPI_list_events,...
October 11,2001ScicomP4
Knoxville, TN19
Simple example
#include "papi.h“#define NUM_EVENTS 2int Events[NUM_EVENTS]={PAPI_FP_INS,PAPI_TOT_CYC}, EventSet;
long_long values[NUM_EVENTS];/* Initialize the Library */retval = PAPI_library_init(PAPI_VER_CURRENT);/* Allocate space for the new eventset and do setup */retval = PAPI_create_eventset(&EventSet);/* Add Flops and total cycles to the eventset */retval = PAPI_add_events(&EventSet,Events,NUM_EVENTS);/* Start the counters */retval = PAPI_start(EventSet);
do_work(); /* What we want to monitor*/
/*Stop counters and store results in values */retval = PAPI_stop(EventSet,values);
October 11,2001ScicomP4
Knoxville, TN20
Overlapping counters
retval = PAPI_start(InclEventSet);retval = PAPI_start(OthersEventSet);
...
retval = PAPI_reset(OthersEventSet);do_flops(NUM_FLOPS); /* Function call */retval = PAPI_accum(OthersEventSet,Othersvalues);
...
retval = PAPI_stop(InclEventSet,Inclvalues);printf("Counts: %12lld %12lld\n",
Inclvalues[0], Inclvalues[0]-Othersvalues[0]);
October 11,2001ScicomP4
Knoxville, TN21
Counter Domains
• int PAPI_set_domain(int domain);– PAPI_DOM_USER User context counted
– PAPI_DOM_KERNEL Kernel/OS context counted
– PAPI_DOM_OTHER Exception/transient mode
– PAPI_DOM_ALL All contexts counted
– PAPI_DOM_MIN The smallest available context
– PAPI_DOM_MAX The largest available context
• Requires OS support, all domains not available on all platforms
• Default is PAPI_DOM_USER
October 11,2001ScicomP4
Knoxville, TN22
Using PAPI with Threads
• After PAPI_library_init need to register unique thread identifier function
• For Pthreads
retval=PAPI_thread_init(pthread_self, 0);
• OpenMP
retval=PAPI_thread_init(omp_get_thread_num, 0);
• Each thread responsible for creation, start, stop and read of its own counters
October 11,2001ScicomP4
Knoxville, TN23
Using PAPI with Multiplexing
• Multiplexing allows simultaneous use of more counters than are supported by the hardware.
• Some platforms support multiplexing in the OS (e.g., SGI IRIX); on those that don’t PAPI implements multiplexing in software.
• The more events you multiplex and the shorter the amount of time, the more likely the representation is not correct.
• See John May’s excellent whitepaper
October 11,2001ScicomP4
Knoxville, TN24
Issues with Multiplexing
• PAPI_multiplex_init() – should be called after
PAPI_library_init() to initialize multiplexing
• PAPI_set_multiplex( int *EventSet ); – Used after the eventset is created to
turn on multiplexing for that eventset
• Then use PAPI like normal
October 11,2001ScicomP4
Knoxville, TN25
Native Events
• An event countable by the CPU can be counted even if there is no matching preset PAPI event
• Same interface as when setting up a preset event, but a CPU-specific bit pattern is used instead of the PAPI event definition
• /usr/lpp/pmtoolkit/lib/POWER3-II.evs
October 11,2001ScicomP4
Knoxville, TN26
Power3 Native Event Example
native = (35 << 8) | 1; /* FPU1CMPL */
PAPI_add_event(&EventSet,native)
See for list of native events.
The encoding for native events is:
Lower 8 bits indicate which counter number: 0 – 7
Bits 8-16 indicate which event number: 0-50
October 11,2001ScicomP4
Knoxville, TN27
Power 3 General Events
1. INST_CMPL (1)2. FPU1_CMPL (35)3. LD_MISS_L1 (5)4. LD_CMPL (5)5. FPU0_CMPL (5)6. CYC (12)7. FMA (9)8. TLB (0)
October 11,2001ScicomP4
Knoxville, TN28
Power 3 False Sharing Events
1. L1_M_TO_E_OR_S (25)2. E_TO_S (23)3. L2_E_OR_S_TO_I (21)4. L2_M_TO_E_OR_S (27)5. L2_M_TO_I (10)6. PUSH_INT (8)7. 0INST_DISP (6)8. TLB (0)
October 11,2001ScicomP4
Knoxville, TN29
Callbacks on Counter Overflow
• PAPI provides the ability to call user-defined handlers when a specified event exceeds a specified threshold.
• For systems that do not support counter overflow at the OS level, PAPI sets up a high resolution interval timer and installs a timer interrupt handler.
October 11,2001ScicomP4
Knoxville, TN30
PAPI_overflow
• int PAPI_overflow(int EventSet, int EventCode, int threshold, int flags, PAPI_overflow_handler_t handler)
• Sets up an EventSet such that when it is PAPI_start()’d, it begins to register overflows
• The EventSet may contain multiple events, but only one may be an overflow trigger.
October 11,2001ScicomP4
Knoxville, TN31
Statistical Profiling
• PAPI provides support for execution profiling based on any counter event.
• PAPI_profil() creates a histogram by text address of overflow counts for a specified region of the application code.
• Used in vprof tool from Sandia Livermore
October 11,2001ScicomP4
Knoxville, TN32
PAPI Release Platforms
• Linux/x86, Windows 2000– Requires patch to Linux kernel, driver for
Windows
• Linux/IA-64• Sun Solaris 2.8/Ultra I/II• IBM AIX 4.3+/Power
– Contact IBM for pmtoolkit
• SGI IRIX/MIPS• Compaq Tru64/Alpha Ev6 & Ev67
• Requires OS device driver from Compaq
• Cray T3E/Unicos
October 11,2001ScicomP4
Knoxville, TN33
For More Information
• http://icl.cs.utk.edu/projects/papi/– Software and documentation– Reference materials– Papers and presentations– Third-party tools– Mailing lists
October 11,2001ScicomP4
Knoxville, TN34
PERC: Performance Evaluation Research Center
• Developing a Developing a sciencescience for understanding for understanding
performance of scientific applications on high-end performance of scientific applications on high-end
computer systems. computer systems.
• Developing Developing engineeringengineering strategies for improving strategies for improving
performance on these systems. performance on these systems.
• DOE Labs: ANL, LBNL, LLNL, ORNLDOE Labs: ANL, LBNL, LLNL, ORNL
• Universities: UCSD, UI-UC, UMD, UTKUniversities: UCSD, UI-UC, UMD, UTK
• Funded by SciDAC: Scientific Discovery through Funded by SciDAC: Scientific Discovery through
Advanced ComputingAdvanced Computing
October 11,2001ScicomP4
Knoxville, TN35
PERC: Real-World Applications
• High Energy and Nuclear PhysicsHigh Energy and Nuclear Physics– Shedding New Light on Exploding Stars: Terascale Simulations of Shedding New Light on Exploding Stars: Terascale Simulations of
Neutrino-Driven SuperNovae and Their NucleoSynthesisNeutrino-Driven SuperNovae and Their NucleoSynthesis– Advanced Computing for 21st Century Accelerator Science and Advanced Computing for 21st Century Accelerator Science and
TechnologyTechnology
• Biology and Environmental ResearchBiology and Environmental Research– Collaborative Design and Development of the Community Climate Collaborative Design and Development of the Community Climate
System Model for Terascale ComputersSystem Model for Terascale Computers
• Fusion Energy SciencesFusion Energy Sciences– Numerical Computation of Wave-Plasma Interactions in Multi-Numerical Computation of Wave-Plasma Interactions in Multi-
dimensional Systemsdimensional Systems
• Advanced Scientific ComputingAdvanced Scientific Computing– Terascale Optimal PDE Solvers (TOPS)Terascale Optimal PDE Solvers (TOPS)– Applied Partial Differential Equations Center (APDEC)Applied Partial Differential Equations Center (APDEC)– Scientific Data Management (SDM)Scientific Data Management (SDM)
• Chemical SciencesChemical Sciences– Accurate Properties for Open-Shell States of Large MoleculesAccurate Properties for Open-Shell States of Large Molecules
• ……and more…and more…
October 11,2001ScicomP4
Knoxville, TN36
Parallel Climate Transition Model
• Components for Ocean, Atmosphere, Sea Ice, Land Surface and River Transport
• Developed by Warren Washington’s group at NCAR
• POP: Parallel Ocean Program from LANL
• CCM3: Community Climate Model 3.2 from NCAR including LSM: Land Surface Model
• ICE: CICE from LANL and CCSM from NCAR
• RTM: River Transport Module from UT Austin
October 11,2001ScicomP4
Knoxville, TN37
PCTM: The Code
• 132K lines of code
• Mostly Fortran 90 with some Fortran 77 and C
• Each component runs sequentially
• Each component is parallel
• The overall model must run as pure MPI
• Code is already instrumented with timing library in flux coupler
• Add additional PAPI code to measure 8 hardware metrics for each timer
October 11,2001ScicomP4
Knoxville, TN38
PCTM: Parallel Climate Transition Model
Flux CouplerLand
SurfaceModel
OceanModel Atmosphere
Model
Sea Ice Model
Sequential Executionof Parallelized Modules
RiverModel
October 11,2001ScicomP4
Knoxville, TN39
PCTM: Performance Questions
• Are we cache, functional unit or bound?
• How does shared memory MPI affect performance?
• How many tasks on a 4-way node?
October 11,2001ScicomP4
Knoxville, TN40
PCTM: Answers
• No answers yet… Wait for SC2001– Measure coarse regions of code for
• Instructions per Cycle• Stall Cycles• 0 Instruction Dispatch Cycles
– Measure those regions with multiplexing and inspect code
– Statistically profile candidates for improvement, line by line