34
11 July 2005 UPC/SHMEM Language UPC/SHMEM Language Analysis and Usability Analysis and Usability Study Study Professor Alan D. George, Principal Investigator Mr. Hung-Hsun Su, Sr. Research Assistant Mr. Adam Leko, Sr. Research Assistant Mr. Bryan Golden, Research Assistant Mr. Hans Sherburne, Research Assistant HCS Research Laboratory University of Florida PAT PAT

UPC/SHMEM Language Analysis and Usability Study

Embed Size (px)

DESCRIPTION

PAT. UPC/SHMEM Language Analysis and Usability Study. Professor Alan D. George, Principal Investigator Mr. Hung-Hsun Su, Sr. Research Assistant Mr. Adam Leko, Sr. Research Assistant Mr. Bryan Golden, Research Assistant Mr. Hans Sherburne, Research Assistant HCS Research Laboratory - PowerPoint PPT Presentation

Citation preview

11 July 2005

UPC/SHMEM Language UPC/SHMEM Language Analysis and Usability Analysis and Usability StudyStudy

Professor Alan D. George, Principal InvestigatorMr. Hung-Hsun Su, Sr. Research Assistant

Mr. Adam Leko, Sr. Research AssistantMr. Bryan Golden, Research Assistant

Mr. Hans Sherburne, Research Assistant

HCS Research LaboratoryUniversity of Florida

PATPAT

11 July 2005

Language Study

311 July 2005

Purpose and Method

Purpose Determine performance factors purely from

language’s perspective Gain insight into how to best incorporate

performance measurement with various implementations

Method Create a complete and minimal factor list Analyze UPC and SHMEM (Quadrics) specs Analyze various UPC/SHMEM implementations +

discussions with developers

411 July 2005

Factor List Creation

Factor list developed based on observations from other (tool, analytical model,

etc.) studies

Ensures factors are measurable

Provides insight into how they can be measured

Only basic events included to eliminate redundancy

Sufficient for time-based analysis and memory system analysis

Completion notification – Calling thread waiting for completion of a one-sided

operation initiated by calling thread

Synchronization – multiple threads waiting for each other to complete a single

task

Local access – refers only to access of local shared (global) variable

511 July 2005

Factor List

Factors Basic events

Computation

Execution (useful work) Time

System overhead Compiler, runtime, OS, thread, I/O

Communication

Bulk transfer Variable name, size, count, transfer time, overhead

Small transfer Variable name, count, value, total time (transfer + overhead)

Synchronization Notify time, wait time, count, overhead

Completion notification Wait time, count, overhead

Memory

Local access Variable name, time, count

Cache (remote data) Size, miss/hit count, variable name

Other

Scalability System size

Resource management I/O, etc.

611 July 2005

SHMEM Analysis Performed on Quadrics SHMEM specification and GPSHMEM library

Great similarity between implementations

Factors for each construct involves execution +

Small transfer (put/get)

Synchronization (other)

Variations between implementations troublesome

A standard for SHMEM/GPSHMEM function set is desirable

General: provides user with a uniform library set

PAT: reduces complexity of system (i.e. possibly only one wrapper library is sufficient)

Wrapper approach (ex: PSHMEM) fits very well

Can borrow many ideas from PMPI

However, analysis of data transfers needs special care to handle one-sided

communication

See Language Analysis sub-report for construct-factor assignments

711 July 2005

UPC Analysis (1)

Performed on UPC spec. 1.1, Berkeley UPC, Michigan Tech UPC, and HP

UPC (in progress)

See Language Analysis sub-report for construct-factor assignment

Specification analysis

Educated guesses, attempts to cover all aspect of language

Too generic for PAT development

Implementations

Many similarities between implementations

Wrapper mentality works with UPC function constructs PUPC proposal

Pre-processor needed to handle UPC non-function constructs

811 July 2005

UPC Analysis (2)

Implementations (cont.)

HP-UPC

Composed of UPC Compiler (compiler), Run-Time System (RTS), and

(optional) Run-Time Environment (RTE)

UPC global variable access translates to HW shared-memory access

impacts time of instrumentation

Waiting for Brian at HP to send details on UPC functions to complete

construct-factor assignment

GCC-UPC: will be studied after completion of HP UPC

911 July 2005

UPC Specification Construct-Factor Table (1)Shared variable

Declaration Execution, overhead, synchronization, local access

Assignment Execution, overhead, bulk/small transfer, completion, local access, cache

upc_threadof, upc_resetphase, upc_addrfield Execution

upc_affinitysize Execution, synchronization

Shared lock

upc_lock_t (declaration) Execution, overhead, synchronization

upc_global_lock_alloc, upc_lock_free Execution, overhead, small transfer, synchronization, local access

upc_all_lock_alloc, upc_lock, upc_unlock, upc_lock_attempt

Execution, overhead, small transfer, completion, local access

Global address space

upc_global_alloc, upc_all_alloc Execution, overhead, small transfer, synchronization

upc_alloc, upc_local_alloc, upc_free Execution, overhead, small transfer, completion, local access

1011 July 2005

UPC Specification Construct-Factor Table (2)

Environment

MYTHREAD, THREADS Execution, overhead

upc_strict / upc_strict.hupc_relaxed / upc_relaxed.h

N/A. This affects how the other constructs needs to be track but by itself does not relate to any factor.

upc_global_exit Execution, synchronization, resource management

Bulk memory transfer

upc_memcpy, upc_memget, upc_memput, upc_memset

Execution, bulk/small transfer, completion, local access, cache

Synchronization

upc_notify, upc_wait, upc_barrier, upc_fence Execution, synchronization

Other

upc_forall Execution, synchronization, local access

UPC_MAX_BLOCK_SIZE, upc_localsizeof, upc_blocksizeof, upc_elemsizeof

(compile time constants)

Execution, small transfer

1111 July 2005

Berkeley UPC Analysis (1)

Based on version 2.0.1

Analysis at UPC level with some consideration at communication level

Noteworthy implementation details

upc_all_alloc and upc_all_lock_alloc: use of all-to-all broadcast

Upc_alloc and upc_global_alloc behave like upc_local_alloc: double size of heap

when running out of space

Multiple mechanisms for implementing barrier

HW supported (Ex: InfiniBand)

Custom barrier (Ex: SHMEM/lapi)

Centralized (Other, current) logarithmic dissemination (other, future)

Impact on PAT

UPC level only instrumentation 1 unit, less accurate

UPC + communication level instrumentation multiple units, more accurate

1211 July 2005

Berkeley UPC Analysis (2)

Noteworthy implementation details (cont.)

Three different translations for upc_forall

All tasks can be done by 1 thread if statement followed by a regular for

loop

Tasks are cyclic distributed for loop with stride factor equal to number of

threads

Tasks are block distributed two-level for loops are used (outer level is

same as in second case and inner loop is a regular for loop corresponding to

all elements in block)

Impact on PAT – instrumentation needed before translation

1311 July 2005

Berkeley UPC Construct-Factor Table (1) Shared variable

Declaration Execution

Pointer assignment Execution

Scalar variable/array assignment Execution, small transfer if data is remote, completion

upc_threadof, upc_resetphase, upc_addrfield, upc_affinitysize

Execution

Shared lock

Shared lock declaration (upc_lock_t) Execution

upc_global_lock_alloc Node 0: Execution, local accessOther: ExecutionRare case: + small transfer, completion

upc_all_lock_alloc Execution, synchronization, local access

upc_lock, upc_lock_attempt Execution, small transfer, completion

upc_unlock, upc_lock_free Execution, small transfer

Global address space

upc_all_alloc Execution, synchronization, local access

upc_global_alloc , upc_alloc, upc_local_alloc Node 0: Execution, local accessOther: ExecutionRare case: + small transfer, completion

upc_free Execution, small transfer (if not owner) , local access

1411 July 2005

Berkeley UPC Construct-Factor Table (2) Environment

MYTHREAD, THREADS Execution

upc_strict / upc_strict.hupc_relaxed / upc_relaxed.h

N/A

upc_global_exit Execution, small transfer, synchronization (system dependent), local access

Bulk memory transfer

upc_memget, upc_memput, upc_memget, upc_memcpy

Execution, bulk/small transfer, completion, local access (see item 12 in the notes and observations section for detail for upc_memcpy)

Synchronization

upc_notify Execution, synchronization [notify]

upc_wait Execution, synchronization [wait]

upc_barrier Execution, synchronization

upc_fence Execution

Other

upc_forall Execution

UPC_MAX_BLOCK_SIZE, upc_localsizeof, upc_blocksizeof, upc_elemsizeof

Execution (expanded to constant at translation time)

1511 July 2005

Michigan Tech UPC Analysis

Based on version 1.1

Noteworthy implementation details

Uses a centralized control for most control processes (i.e. split and non-split

barriers, collective array allocation, collective lock allocation and global exit.)

Based on two pthreads system using consumer-producer mechanism.

Program thread (producer): adds entries to appropriate send queues

Communication thread (consumer): sending and processing requests via MPI (no

aggregation of data for optimization, bulk transfer = x small transfers)

Impact on PAT – transfer, completion and synchronization is much harder to track

Uses flat broadcast and tree broadcast

Caching capability complicates analysis

1611 July 2005

MTU UPC Construct-Factor Table (1)

Shared variable

Declaration Execution, overhead, local access

Pointer assignment Execution

Scalar variable/array assignment Owner: Execution, local access, cacheOther: Execution, small transfer

upc_threadof, upc_resetphase, upc_addrfield, upc_affinitysize

Execution

Shared lock

Shared lock declaration (upc_lock_t) Execution

upc_global_lock_alloc Execution

upc_all_lock_alloc Execution, synchronization (2, barrier + tree broadcast)

upc_lock, upc_unlock, upc_lock_attempt Execution, synchronization

upc_lock_free Owner: ExecutionOther: Execution, small transfer (actually should be classified as part of completion notification, however, the calling thread does not wait for reply)

Global address space

upc_global_alloc Node 0: Execution, local accessOther: Execution, completion

upc_all_alloc, upc_free Execution, synchronization (2, barrier + tree broadcast), local access

upc_alloc, upc_local_alloc Execution, local access

1711 July 2005

MTU UPC Construct-Factor Table (2) Environment

MYTHREAD, THREADS Execution

upc_strict / upc_strict.hupc_relaxed / upc_relaxed.h

Execution, cache

upc_global_exit Execution, synchronization

Bulk memory transfer

upc_memcpy Execution, small transfer, synchronization, local access (see item 10 in the notes and observations section for detail)

upc_memget, upc_memput, upc_memset Owner: Execution, local access, cacheOther: Execution, small transfer

Synchronization

upc_notify Execution, synchronization, cache

upc_wait Execution, synchronization, cache

upc_barrier Execution, small transfer, synchronization, cache

upc_fence Execution, cache

Other

upc_forall Execution

UPC_MAX_BLOCK_SIZE, upc_localsizeof, upc_blocksizeof, upc_elemsizeof

Execution

1811 July 2005

Summary

Factor list and construct-factor assignment provide basis for practical event tracing in UPC and SHMEM SHMEM

Wrapper library approach appears ideal Push for SHMEM standardization will simplify development

UPC Hybrid pre-processor/wrapper library approach appears appropriate

(compatible with GCC-UPC?) Analysis provides insights on how to instrument

UPC/SHMEM programs and raises awareness to possible difficulties

11 July 2005

Usability Study

2011 July 2005

Usability: Purpose and Methods

Purpose Determine factors affecting usability of performance tools Determine how to incorporate knowledge about factors into PAT

Methods Elicit user feedback through a Performance Tool Usability

Survey (survey generated after some literature reviews) Review and provide a concise summary of literature in area of

usability for parallel performance tools Outline

Discuss common problems seen in performance tools Provide a discussion on factors influencing usability of

performance tools Outline how to incorporate user-centered design into PAT Present guidelines to avoid common usability problems

2111 July 2005

Performance Tool Usability:

General Performance Tool Problems Difficult problem for tool developer Inherently unstable execution environment Monitoring behavior may disturb original behavior Short lifetime of parallel computers

Users Tools too difficult to use Too complex Unsuitable for real-world applications Users skeptical about value of tools

2211 July 2005

Discussion on Usability Factors* (1) Ease-of-learning

Concern Important for attracting new users Tool’s interface shapes user’s understanding of its functionality Inconsistency leads to confusion (e.g. providing defaults for some object

but not all) Possible solutions

Strive for internally and externally consistent tool Stick to established conventions Provide uniform interface Target as many platforms as necessary so user can amortize time

invested over many uses Usefulness

Concern: How directly tool helps user achieve their goal Possible Solution: Make common case simple even if that makes rare

case complex

* C. Pancake, ‘‘Improving the Usability of Numerical Software through User-Centered Design,’’ The Quality of Numerical Software: Assessment and Enhancement, ed. B. Ford and J. Rice, pp. 44-60, 1997.

2311 July 2005

Discussion on Usability Factors (2) Ease-of-use

Concern: Amount of effort required to accomplish work with tool too high to justify tool’s use

Possible solutions Do not force user to memorize information about interface – use

menus, mnemonics, and other mechanisms Provide a simple interface Make all user-required actions concrete and logical

Throughput Concern: How does tool contribute to user productivity in

general Keep in mind that inherent goal of tool is to increase user

productivity

2411 July 2005

User-Centered Design Concept that usability should be driving factor in tool development Based on premise that usability will only be achieved if design process

is user-driven Four-step model to incorporate user feedback* (chronological)

Ensure initial functionality is based on user needs Solicit input directly from users

MPI users (for information about existing tools) UPC/SHMEM users Sponsor

Analyze how users identify and correct performance problems UPC/SHMEM users primarily Gain better idea of how the tool will actually be used on real programs Information from users then presented to sponsor for critique/feedback

Implement Incrementally Organize interface so that most useful features are best supported User evaluation of preliminary/prototype designs Maintain strong relationship with users with whom we have access

Have users evaluate every aspect of tool’s interface, structure, and behavior Alpha/Beta testing User tests should be performed at many points along the way Feature-by-feature refinement in response to specific user feedback

* S. Musil, G. Pigel, M. Tscheligi. “User Centered Monitoring of Parallel Programs with InHouse.” HCI ’94 Ancillary Proceedings, 1994.

2511 July 2005

Performance Tool Usability: Guidelines Issues for Performance Tools and Solutions

Many tools begin by presenting windows with detailed info on a performance metric

Users prefer broader perspective on application behavior Some tools provide multiple views of program behavior

Good idea, but need support for comparing different metrics For example, if CPU utilization drops in same place, L1 cache

miss rate rises

Also essential to provide source-code correlation to be useful

User does not want info that cannot be used to fix code

2611 July 2005

Performance Tool Usability: Summary Summary

Tool will not gain user acceptance until useable in real-world environment

Need to identify successful user strategies from existing tools for real applications

Devise ways to apply successful strategies to tool in an intuitive manner

Use this functionality in development of new tool

2711 July 2005

Presentation Methodology: Introduction Why use visualizations?

To facilitate user comprehension To convey complexity and intricacy of performance data Help bridge gap between raw performance data and performance

improvements When to use visualizations?

On-line: visualization while application is running (can slow down execution significantly)

Post mortem: after execution (usually based on trace data gathered at runtime)

What to visualize? Interactive displays to guide the user Default visualizations should provide high-level views Low-level information should be easily accessible

2811 July 2005

General Approaches to Performance Visualization

General Categories System/Application-independent: depict performance data for

variety of systems and applications – most tools use this approach Meta-tools: facilitate development of custom visualization tools

Other Categories On-line: visualization during execution

Can be intrusive Volume of information may be too large to interpret without playback

functionality Allows user to observe only interesting parts of execution without

waiting Post mortem: visualization after execution

Have to wait to see visualizations Easier to implement Less intrusion on application behavior

2911 July 2005

Useful Visualizations Techniques

Animation Has been employed by various tools to provide program execution replay Most commonly animated events are communication operations Viewing data dynamically may illuminate bottlenecks more efficiently However, animation usually very cumbersome in practice

Program graphs Generalized picture of entire system

Gantt charts De facto standard for displaying inter-process communication

Data access displays Each cell of 2D display is devoted to an element of the array Color distinguishes between local/remote and read/write

Critical path analysis Concerned with identifying program regions which most contribute to

program execution time Graph depicts synchronization and communication dependencies among

processes in program

3011 July 2005

Summary of VisualizationsVisualization Name Advantages Disadvantages Requirements

DocumentUsed For

Animation Adds another dimension to visualizations

CPU intensive, cumbersome at times

Advanced Various

Program Graphs

(N-ary tree)

Built-in zooming; Integration of high and low-level data

Difficult to see inter-process data

Other Comprehensive Program Visualization

Gantt Charts

(Time histogram; Timeline)

Ubiquitous; Intuitive Not as applicable to shared memory as to message passing

Core, Functional Communication Graphs

Data Access Displays

(2D array)

Provides detailed information regarding the dynamics of shared data

Narrow focus; users may not be familiar with this type of visualization

Core, Functional Data Structure Visualization

Kiviat Diagrams Provides an easy way to represent statistical data

Can be difficult to understand

Advanced Various statistical data (processor utilization, cache miss rates, etc.)

Event Graph Displays

(Timeline)

Can be used to display multiple data types (event-based)

Provides mostly high-level information

Advanced Inter-process dependency

3111 July 2005

Guidelines and Interface Evaluation General Guidelines*

Visualization should guide, not rationalize Scalability is crucial Color should inform, not entertain Visualization should be interactive Visualizations should provide meaningful labels Default visualization should provide useful information Avoid showing too much detail Visualization controls should be simple

Goals, Operators, Methods, and Selection Rules (GOMS) Formal user interface evaluation technique Way to characterize a set of design decisions from point of view of user Description of what user must learn; may be basis for reference documentation May be able to use GOMS analysis in design of PAT Knowledge described in a form that can actually be executed (there have been several

fairly successful attempts to implement GOMS analysis in software, e.g. GLEAN) Various incarnations of GOMS with different assumptions useful for more specific

analyses (KVL, CMN-GOMS, NGOMSL, CPM-GOMS, etc.)

* B. Miller. “What to Draw? When to Draw?:an essay on parallel program visualization,” Journal of Parallel and Distributed Computing, 18:2, pp. 265-269, 1993.

3211 July 2005

Simple GOMS Example: OS X

GOMS model for OS X Method for goal: delete a file

Step 1. Think of file name and retain as first filespec (file specifier)

Step 2. Accomplish goal: drag file to trash Step 3. Return with goal accomplished

Method for goal: move a file Step 1. Think of file name and retain as first filespec Step 2. Think of destination directory name and retain as

second filespec Step 3. Accomplish goal: drag file to destination Step 4. Return with goal accomplished

3311 July 2005

Simple GOMS Example: UNIX

GOMS model for UNIX Method for goal: delete a file

Step 1. Recall that command verb is rm -f Step 2. Think of file name and retain as first filespec Step 3. Accomplish goal: enter and execute a command Step 4. Return with goal accomplished

Method for goal: copy a file Step 1. Recall that command verb is cp Step 2. Think of file name and retain as first filespec Step 3. Think of destination directory name and retain as

second filespec Step 4. Accomplish goal: enter and execute a command Step 5. Return with goal accomplished

3411 July 2005

Summary

Plan for development Develop a preliminary interface that provides

functionality required by user while conforming to visualization guidelines

After preliminary design is complete, elicit user feedback

During periods where user contact is unavailable, may be able to use GOMS analysis or another formal interface evaluation technique