31
PAPI Directions PAPI Directions Dan Terpstra Innovative Computing Lab University of Tennessee

PAPI Directions Dan Terpstra Innovative Computing Lab University of Tennessee

Embed Size (px)

Citation preview

PAPI DirectionsPAPI Directions

Dan TerpstraInnovative Computing LabUniversity of Tennessee

IBM Petascale Workshop 2006

PAPI DirectionsPAPI Directions Overview

What’s PAPI?What’s New?

FeaturesPlatforms

What’s Next?Network PAPIThermal PAPI

When?PAPI release roadmap

What’s ICL?(a word from our sponsor)

IBM Petascale Workshop 2006

What’s PAPI?What’s PAPI? A software layer (library) designed to provide

the tool developer and application engineer with a consistent interface and methodology for use of the performance counter hardware found in most major micro-processors.

Countable events are defined in two ways: platform-neutral Preset Events Platform-dependent Native Events

Preset Events can be derived from multiple Native Events

All events referenced by name and collected in EventSets for sampling

Events can be multiplexed if counters are limited

Statistical sampling is implemented by: Software overflow with timer driven sampling Hardware overflow if supported by the platform

IBM Petascale Workshop 2006

Where’s PAPIWhere’s PAPI PAPI runs on most modern processors and

Operating Systems of interest to HPC: IBM POWER3,4,5 / AIX POWER4,5 / Linux PowerPC-32 and -64 / Linux Blue Gene Intel Pentium II, III, 4, M, EM64T, etc. / Linux Intel Itanium AMD Athlon, Opteron / Linux Cray T3E, X1, XD3, XT3 Catamount Altix, Sparc, …

NOTE: All Linux implementations require the perfctr kernel patch. Except Itanium which uses the built-in perfmon

interface Perfmon2 development is underway to replace perfctr

and be pre-installed in the kernel – NO PATCHES NEEDED!

IBM Petascale Workshop 2006

Extending PAPI beyond the CPUExtending PAPI beyond the CPU

PAPI has historically targeted on on-processor performance counters

Several categories of off-processor counters existnetwork interfaces: Myrinet, Infiniband, GigEmemory interfaces: Cray X1 thermal and power interfaces: ACPI

CHALLENGE: Extend the PAPI interface to address

multiple counter domainsPreserve the PAPI calling semantics, ease of

use, and platform independence for existing applications

IBM Petascale Workshop 2006

Multi-Substrate PAPIMulti-Substrate PAPI

Goals:Isolate hardware dependent code

in a separable ‘substrate’ moduleExtend platform independent code

to support multiple simultaneous substrates

Add or modify API calls to support access to any of several substrates

Modify build environment for easy selection and configuration of multiple available substrates

IBM Petascale Workshop 2006

PAPI 3.0 Design PAPI 3.0 Design

PAPI Low Level

Machine SpecificLayer

PortableLayer

PAPI Machine Dependent Substrate

PAPI High Level

Hardware Performance Counters

Operating System

Kernel Extension

Hardware Independent Layer

IBM Petascale Workshop 2006

PAPI Low Level

Machine SpecificLayer

PortableLayer

PAPI Machine Dependent Substrate

PAPI High Level

Hardware Performance Counters

Operating System

Kernel Extension

Hardware Independent Layer

PAPI Low Level

Machine SpecificLayer

PortableLayer

PAPI High Level

PAPI Machine Dependent Substrate

Hardware Performance Counters

Operating System

Kernel Extension

Hardware Independent Layer

PAPI Machine Dependent Substrate

Off-Processor Hardware Counters

Operating System

Kernel Extension

PAPI 4.0 Multiple Substrate DesignPAPI 4.0 Multiple Substrate Design

IBM Petascale Workshop 2006

API ChangesAPI Changes 3 calls augmented with a substrate index Old syntax preserved in wrapper functions

for backward compatibility Modified entry points:

PAPI_create_eventset PAPI_create_sbstr_eventset

PAPI_get_opt PAPI_get_sbstr_opt PAPI_num_hwctrs PAPI_num_sbstr_hwctrs

New entry points for new functionality: PAPI_num_substrates PAPI_get_sbstr_info

Old code can run with no source modifications

IBM Petascale Workshop 2006

PAPI 4.0 StatusPAPI 4.0 Status

Multi-substrate development complete

Some CPU platforms not yet ported Substrates available for

ACPI (Advanced Configuration and Power Interface )

Myrinet MX Substrates under development for

InfinibandGigE

Friendly User release available now for CVS checkout

IBM Petascale Workshop 2006

Myrinet MX CountersMyrinet MX CountersLANAI_UPTIMECOUNTERS_UPTIMEBAD_CRC8BAD_CRC32UNSTRIPPED_ROUTEPKT_DESC_INVALIDRECV_PKT_ERRORSPKT_MISROUTEDDATA_SRC_UNKNOWNDATA_BAD_ENDPTDATA_ENDPT_CLOSEDDATA_BAD_SESSIONPUSH_BAD_WINDOWPUSH_DUPLICATEPUSH_OBSOLETEPUSH_RACE_DRIVERPUSH_BAD_SEND_HANDLE

_MAGICPUSH_BAD_SRC_MAGICPULL_OBSOLETEPULL_NOTIFY_OBSOLETEPULL_RACE_DRIVERACK_BAD_TYPEACK_BAD_MAGICACK_RESEND_RACELATE_ACK

ACK_NACK_FRAMES_IN_PIPENACK_BAD_ENDPTNACK_ENDPT_CLOSEDNACK_BAD_SESSIONNACK_BAD_RDMAWINNACK_EVENTQ_FULLSEND_BAD_RDMAWINCONNECT_TIMEOUTCONNECT_SRC_UNKNOWNQUERY_BAD_MAGICQUERY_TIMED_OUTQUERY_SRC_UNKNOWNRAW_SENDSRAW_RECEIVESRAW_OVERSIZED_PACKETSRAW_RECV_OVERRUNRAW_DISABLEDCONNECT_SENDCONNECT_RECVACK_SENDACK_RECVPUSH_SENDPUSH_RECVQUERY_SENDQUERY_RECV

REPLY_SENDREPLY_RECVQUERY_UNKNOWNDATA_SEND_NULLDATA_SEND_SMALLDATA_SEND_MEDIUMDATA_SEND_RNDVDATA_SEND_PULLDATA_RECV_NULLDATA_RECV_SMALL_INLINEDATA_RECV_SMALL_COPYDATA_RECV_MEDIUMDATA_RECV_RNDVDATA_RECV_PULLETHER_SEND_UNICAST_CNTETHER_SEND_MULTICAST_C

NTETHER_RECV_SMALL_CNTETHER_RECV_BIG_CNTETHER_OVERRUNETHER_OVERSIZEDDATA_RECV_NO_CREDITSPACKETS_RESENTPACKETS_DROPPEDMAPPER_ROUTES_UPDATE

ROUTE_DISPERSIONOUT_OF_SEND_HANDLESOUT_OF_PULL_HANDLESOUT_OF_PUSH_HANDLESMEDIUM_CONT_RACECMD_TYPE_UNKNOWNUREQ_TYPE_UNKNOWNINTERRUPTS_OVERRUNWAITING_FOR_INTERRUPT_DMAWAITING_FOR_INTERRUPT_ACKWAITING_FOR_INTERRUPT_TIM

ERSLABS_RECYCLINGSLABS_PRESSURESLABS_STARVATIONOUT_OF_RDMA_HANDLESEVENTQ_FULLBUFFER_DROPMEMORY_DROPHARDWARE_FLOW_CONTROLSIMULATED_PACKETS_LOSTLOGGING_FRAMES_DUMPEDWAKE_INTERRUPTSAVERTED_WAKEUP_RACEDMA_METADATA_RACE

IBM Petascale Workshop 2006

Myrinet MX CountersMyrinet MX CountersLANAI_UPTIMECOUNTERS_UPTIMEBAD_CRC8BAD_CRC32UNSTRIPPED_ROUTEPKT_DESC_INVALIDRECV_PKT_ERRORSPKT_MISROUTEDDATA_SRC_UNKNOWNDATA_BAD_ENDPTDATA_ENDPT_CLOSEDDATA_BAD_SESSIONPUSH_BAD_WINDOWPUSH_DUPLICATEPUSH_OBSOLETEPUSH_RACE_DRIVERPUSH_BAD_SEND_HANDLE

_MAGICPUSH_BAD_SRC_MAGICPULL_OBSOLETEPULL_NOTIFY_OBSOLETEPULL_RACE_DRIVERACK_BAD_TYPEACK_BAD_MAGICACK_RESEND_RACELATE_ACK

ACK_NACK_FRAMES_IN_PIPENACK_BAD_ENDPTNACK_ENDPT_CLOSEDNACK_BAD_SESSIONNACK_BAD_RDMAWINNACK_EVENTQ_FULLSEND_BAD_RDMAWINCONNECT_TIMEOUTCONNECT_SRC_UNKNOWNQUERY_BAD_MAGICQUERY_TIMED_OUTQUERY_SRC_UNKNOWNRAW_SENDSRAW_RECEIVESRAW_OVERSIZED_PACKETSRAW_RECV_OVERRUNRAW_DISABLEDCONNECT_SENDCONNECT_RECVACK_SENDACK_RECVPUSH_SENDPUSH_RECVQUERY_SENDQUERY_RECV

REPLY_SENDREPLY_RECVQUERY_UNKNOWNDATA_SEND_NULLDATA_SEND_SMALLDATA_SEND_MEDIUMDATA_SEND_RNDVDATA_SEND_PULLDATA_RECV_NULLDATA_RECV_SMALL_INLINEDATA_RECV_SMALL_COPYDATA_RECV_MEDIUMDATA_RECV_RNDVDATA_RECV_PULLETHER_SEND_UNICAST_CNTETHER_SEND_MULTICAST_C

NTETHER_RECV_SMALL_CNTETHER_RECV_BIG_CNTETHER_OVERRUNETHER_OVERSIZEDDATA_RECV_NO_CREDITSPACKETS_RESENTPACKETS_DROPPEDMAPPER_ROUTES_UPDATE

ROUTE_DISPERSIONOUT_OF_SEND_HANDLESOUT_OF_PULL_HANDLESOUT_OF_PUSH_HANDLESMEDIUM_CONT_RACECMD_TYPE_UNKNOWNUREQ_TYPE_UNKNOWNINTERRUPTS_OVERRUNWAITING_FOR_INTERRUPT_DMAWAITING_FOR_INTERRUPT_ACKWAITING_FOR_INTERRUPT_TIM

ERSLABS_RECYCLINGSLABS_PRESSURESLABS_STARVATIONOUT_OF_RDMA_HANDLESEVENTQ_FULLBUFFER_DROPMEMORY_DROPHARDWARE_FLOW_CONTROLSIMULATED_PACKETS_LOSTLOGGING_FRAMES_DUMPEDWAKE_INTERRUPTSAVERTED_WAKEUP_RACEDMA_METADATA_RACE

IBM Petascale Workshop 2006

Multiple MeasurementsMultiple Measurements The HPCC HPL benchmark with 3 performance metrics:

FLOPS; Temperature; Network Sends/Receives Node 7:

IBM Petascale Workshop 2006

Multiple MeasurementsMultiple Measurements The HPCC HPL benchmark with 3 performance metrics:

FLOPS; Temperature; Network Sends/Receives Node 3:

IBM Petascale Workshop 2006

IBM Petascale Workshop 2006

IBM Petascale Workshop 2006

Data Structure Addressing Data Structure Addressing Goal:

Measure events related to specific data addresses (structures).

Availability: Itanium: 160 / 475 native eventsrumored on POWER4; POWER5?

PAPI example: ...

opt.addr.eventset = EventSet; opt.addr.start = (caddr_t)array; opt.addr.end = (caddr_t)(array + size_array); retval = PAPI_set_opt(PAPI_DATA_ADDRESS, &opt);actual.start = (caddr_t)array - opt.addr.start_off; actual.end = (caddr_t)(array + size_array) + opt.addr.end_off; ...

IBM Petascale Workshop 2006

Rensselaer to Build and HouseRensselaer to Build and House$100 Million Supercomputer$100 Million Supercomputer

NY Times, May 11, 2006

Rensselaer Polytechnic Institute announced yesterday that it was combining forces with New York State and I.B.M. to build a $100 million supercomputer that will be among the 10 most powerful in the world. The computer, a type of I.B.M. system known as Blue Gene, will be on Rensselaer's campus in Troy, N.Y., and will have the power to perform more than 70 trillion calculations per second. It will mainly be used to help researchers make smaller, faster semiconductor devices and for nanotechnology research.

IBM Petascale Workshop 2006

PAPI and BG/LPAPI and BG/L Performance Counters:

48 UPC Countersshared by both CPUsExternal to CPU cores32 bits :(

2 Counters on each FPU1 counts load/stores1 counts arithmetic

operations

Accessed via blg_perfctr

2 FPU PMCs

2 FPU PMCs

UPC Module48 SharedCounters

IBM Petascale Workshop 2006

PAPI and BG/L (2): VersionsPAPI and BG/L (2): Versions PAPI 2.3.4

Original releasePoor native event support

PAPI 3.2.2 betaCurrently being beta testedFull access to native events by name

LimitationsOnly events exposed by bgl_perfctr

No control over native event edgesStill no overflow/profile support

Is there a timer available?No configure script (cross-compilation)No scripted acceptance test suite

(multiple queuing systems)

IBM Petascale Workshop 2006

PAPI and BG/L (3): PresetsPAPI and BG/L (3): PresetsTest case avail.c: Available events and hardware information.-------------------------------------------------------------------------Vendor string and code : (1312)Model string and code : PVR=0x5202:0x1891 Serial=R00-M0-N1-C:J16-U01 (1375869073)CPU Revision : 20994.062500CPU Megahertz : 700.000000CPU's in this Node : 1Nodes in this System : 32Total CPU's : 32Number Hardware Counters : 52Max Multiplex Counters : 32-------------------------------------------------------------------------Name Derived Description (Mgr. Note)PAPI_L3_TCM No Level 3 cache misses ()PAPI_L3_LDM Yes Level 3 load misses ()PAPI_L3_STM No Level 3 store misses ()PAPI_FMA_INS No FMA instructions completed ()PAPI_TOT_CYC No Total cycles ()PAPI_L2_DCH Yes Level 2 data cache hits ()PAPI_L2_DCA Yes Level 2 data cache accesses ()PAPI_L3_TCH No Level 3 total cache hits ()PAPI_FML_INS No Floating point multiply instructions ()PAPI_FAD_INS No Floating point add instructions ()PAPI_BGL_OED No BGL special event: Oedipus operations ()PAPI_BGL_TS_32BYes BGL special event: Torus 32B chunks sent ()PAPI_BGL_TS_FULL Yes BGL special event: Torus no token UPC cycles ()PAPI_BGL_TR_DPKT Yes BGL special event: Tree 256 byte packets ()PAPI_BGL_TR_FULL Yes BGL special event: UPC cycles (CLOCKx2) tree rcv is full ()-------------------------------------------------------------------------avail.c PASSED

IBM Petascale Workshop 2006

PAPI and BG/L (4): Native EventsPAPI and BG/L (4): Native Events 328 native events available

Only events exposed by bgl_perfctr 4 arithmetic events per FPU 4 Load/Store events per FPU 312 UPC events

BGL_FPU_ARITH_ADD_SUBTRACT 0x40000000 |Add and subtract, fadd, fadds, fsub, fsubs (Book E add, substract)|

BGL_FPU_ARITH_MULT_DIV 0x40000001 |Multiplication and divisions, fmul, fmuls, fdiv, fdivs (Book E mul, div)|

BGL_FPU_ARITH_OEDIPUS_OP 0x40000002 |Oedipus operations, All symmetric, asymmetric, and complex Oedipus multiply-add instructions|...

BGL_UPC_TS_ZP_VCD0_CHUNKS 0x40000145 |ZP vcd0 chunks|

BGL_UPC_TS_ZP_VCD1_CHUNKS 0x40000146 |ZP vcd1 chunks|

BGL_PAPI_TIMEBASE 0x40000148 |special event for getting the timebase reg|

-------------------------------------------------------------------------Total events reported: 328native_avail.c PASSED

IBM Petascale Workshop 2006

XT3 and CatamountXT3 and Catamount

The Oak RidgerFebruary 21, 2006

“The Cray XT3 Jaguar, the flagship computing system in ORNL's Leadership Computing Facility, was ranked tenth in the world in a November 2005 survey of supercomputers, delivering 20.5 trillion operations per second (teraflops).”

IBM Petascale Workshop 2006

PAPI and CatamountPAPI and Catamount

Opteron-based Catamount OS similar to CNK Driven by Sandia-Cray version of

perfctrNo overflow / profiling

Configure works because compile node == compute node

Test Suite script works because there’s only one queuing system

IBM Petascale Workshop 2006

When? PAPI Release ScheduleWhen? PAPI Release Schedule

PAPI 3.3.0: RealSoonNow™BG/L in beta testingMerging and deprecating PAPI 3.0.8.1Regression testing on other platforms

PAPI 4.0: Q2, 2006Porting some substrates to Multi-

substrate modelDeveloping additional non-cpu

substrates Wanna Help? Distributed Testing…

IBM Petascale Workshop 2006

Distributed TestingDistributed Testing

Dart / CTest Mozilla

Tinderbox DejaGnu Homegrown Others?

Problem:How do you develop/test/verify on multiple

systems with multiple OS’s at multiple sites?Automatically; Transparently; Repetitively

IBM Petascale Workshop 2006

A Word from our Sponsor…A Word from our Sponsor…

Innovative Computing LaboratoryInnovative Computing Laboratory Jack’s Research Group in the CS

Department Size- About 45-50 people

16 students; 19 scientific staff; 10 support staff; 1 visitors

Funding NSF

Supercomputer Centers (UCSD & NCSA) Next Generation Software (NGS) Info Tech Res. (ITR) Middleware Init. (NMI)

DOE Scientific Discovery through Advanced

Computing (SciDAC) Math in Comp Sci (MICS)

DARPA High Productivity Computing Systems

DOD Modernization

Work with companies AMD, Cray, Dolphin,

Microsoft, MathWorks, Intel, Sun, Myricom, SGI, HP, IBM, Northrop-Grumman

PhD Dissertation, MS Project

Equipment A number of clusters Desktop machines Office setup

Summer internships Industry, ORNL, …

Travel to meetings Participate in

publications

IBM Petascale Workshop 2006

ICL Class of 2005ICL Class of 2005

IBM Petascale Workshop 2006

Speculative Performance PositionsSpeculative Performance Positions

PostDoc Positions Probably AvailablePAPI

New Platforms (Cell?)New Substrates (Infiniband?)

KOJAKAutomated Performance Analysis

ECLIPSE PTP & TAU Integration See me for brochures or more info

PAPI DirectionsPAPI Directions

Dan TerpstraInnovative Computing LabUniversity of Tennessee