95
Advanced OpenMP, SC'2000 1 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue University School of Electrical and Computer Engineering

Advanced Parallel Programming with OpenMP

  • Upload
    myron

  • View
    100

  • Download
    4

Embed Size (px)

DESCRIPTION

Advanced Parallel Programming with OpenMP. Tim Mattson Intel Corporation Computational Sciences Laboratory. Rudolf Eigenmann Purdue University School of Electrical and Computer Engineering. SC’2000 Tutorial Agenda. OpenMP: A Quick Recap OpenMP Case Studies including performance tuning - PowerPoint PPT Presentation

Citation preview

Page 1: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 1

Advanced Parallel Programming with OpenMP

Tim MattsonIntel Corporation

Computational Sciences Laboratory

Rudolf EigenmannPurdue University

School of Electrical and Computer Engineering

Page 2: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 2

SC’2000 Tutorial Agenda

OpenMP: A Quick RecapOpenMP Case Studies

– including performance tuningAutomatic Parallelism and Tools SupportCommon Bugs in OpenMP programs

– and how to avoid themMixing OpenMP and MPIThe Future of OpenMP

Page 3: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 3

OpenMP: Recap

omp_set_lock(lck)

#pragma omp parallel for private(A, B)

#pragma omp critical

C$OMP parallel do shared(a, b, c)

C$OMP PARALLEL REDUCTION (+: A, B)

call OMP_INIT_LOCK (ilok)

call omp_test_lock(jlok)

setenv OMP_SCHEDULE “dynamic”

CALL OMP_SET_NUM_THREADS(10)

C$OMP DO lastprivate(XX)

C$OMP ORDERED

C$OMP SINGLE PRIVATE(X)

C$OMP SECTIONS

C$OMP MASTERC$OMP ATOMIC

C$OMP FLUSH

C$OMP PARALLEL DO ORDERED PRIVATE (A, B, C)

C$OMP THREADPRIVATE(/ABC/)

C$OMP PARALLEL COPYIN(/blk/)

Nthrds = OMP_GET_NUM_PROCS()

!$OMP BARRIER

OpenMP: An API for Writing Multithreaded Applications

– A set of compiler directives and library routines for parallel application programmers

– Makes it easy to create multi-threaded (MT) programs in Fortran, C and C++

– Standardizes last 15 years of SMP practice

Page 4: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 4

OpenMP: Supporters*

Hardware vendors– Intel, HP, SGI, IBM, SUN, Compaq

Software tools vendors– KAI, PGI, PSR, APR

Applications vendors– ANSYS, Fluent, Oxford Molecular, NAG, DOE

ASCI, Dash, Livermore Software, and many others

**These names of these vendors were taken from the OpenMP web site (www.openmp.org). We have These names of these vendors were taken from the OpenMP web site (www.openmp.org). We have made no attempts to confirm OpenMP support, verify conformity to the specifications, or measure the made no attempts to confirm OpenMP support, verify conformity to the specifications, or measure the

degree of OpenMP utilization.degree of OpenMP utilization.

Page 5: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 5

OpenMP: Programming Model

Fork-Join Parallelism: Master thread spawns a team of threads as needed.Parallelism is added incrementally: i.e. the

sequential program evolves into a parallel program.

Parallel Regions

Master Thread

Page 6: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 6

OpenMP:How is OpenMP typically used? (in C)OpenMP is usually used to parallelize loops:

– Find your most time consuming loops.– Split them up between threads.

void main(){ double Res[1000];

for(int i=0;i<1000;i++) {do_huge_comp(Res[i]);

}}

#include “omp.h”void main(){ double Res[1000];#pragma omp parallel for for(int i=0;i<1000;i++) {

do_huge_comp(Res[i]); }}

Split-up this loop between multiple threads

Parallel ProgramSequential Program

Page 7: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 7

OpenMP:How is OpenMP typically used? (Fortran)OpenMP is usually used to parallelize loops:

– Find your most time consuming loops.– Split them up between threads.

program example double precision Res(1000)

do I=1,1000 call huge_comp(Res(I)) end do end

Split-up this loop between multiple threads

program example double precision Res(1000)C$OMP PARALLEL DO do I=1,1000 call huge_comp(Res(I)) end do end

Parallel ProgramSequential Program

Page 8: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 8

OpenMP:How do threads interact?

OpenMP is a shared memory model.– Threads communicate by sharing variables.

Unintended sharing of data causes race conditions:– race condition: when the program’s outcome changes

as the threads are scheduled differently.To control race conditions:

– Use synchronization to protect data conflicts.Synchronization is expensive so:

– Change how data is accessed to minimize the need for synchronization.

Page 9: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 9

Summary of OpenMP Constructs Parallel Region

C$omp parallel #pragma omp parallel Worksharing

C$omp do #pragma omp forC$omp sections #pragma omp sectionsC$single #pragma omp singleC$workshare #pragma workshare

Data Environment directive: threadprivate clauses: shared, private, lastprivate, reduction, copyin, copyprivate

Synchronization directives: critical, barrier, atomic, flush, ordered, master

Runtime functions/environment variables

Page 10: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 10

SC’2000 Tutorial Agenda

OpenMP: A Quick RecapOpenMP Case Studies

– including performance tuningAutomatic Parallelism and Tools SupportCommon Bugs in OpenMP programs

– and how to avoid themMixing OpenMP and MPIThe Future of OpenMP

Page 11: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 11

Performance Tuning and Case Studies with Realistic

Applications

1. Performance tuning of several benchmarks2. Case study of a large-scale application

Page 12: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 12

Performance Tuning Example 1: MDG MDG: A Fortran code of the “Perfect Benchmarks”. Automatic parallelization does not improve this code.

0

0.5

1

1.5

2

2.5

3

3.5

original tuning step 1 tuning step 2

These performance improvements were achieved through manual tuning on a 4-processor Sun Ultra:

Page 13: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 13

MDG: Tuning Steps

Step 1: Parallelize the most time-consuming loop. It consumes 95% of the serial execution time. This takes:

array privatizationreduction parallelization

Step 2: Balancing the iteration space of this loop.

Loop is “triangular”. By default unbalanced assignment of iterations to processors.

Page 14: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 14

MDG Code Sample

c1 = x(1)>0

c2 = x(1:10)>0

DO i=1,n

DO j=i,n

IF (c1) THEN rl(1:100) = …

IF (c2) THEN … = rl(1:100)

sum(j) = sum(j) + …

ENDDO

ENDDO

c1 = x(1)>0c2 = x(1:10)>0

Allocate(xsum(1:#proc,n))

C$OMP PARALLEL DOC$OMP+ PRIVATE (I,j,rl,id) C$OMP+ SCHEDULE (STATIC,1) DO i=1,n id = omp_get_thread_num()

DO j=i,n

IF (c1) THEN rl(1:100) = …

IF (c2) THEN … = rl(1:100)

xsum(id,j) = xsum(id,j) + …

ENDDO

ENDDO

C$OMP PARALLEL DODO i=1,n sum(i)=sum(i)+xsum(1:#proc,i)ENDDO

Structure of the most time-consuming loop in MDG:

Original

Parallel

Page 15: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 15

ARC2D: A Fortran code of the “Perfect Benchmarks”.

Performance Tuning Example 2: ARC2D

0

1

2

3

4

5

6

original locality granularity

ARC2D is parallelized very well by available compilers. However, the mapping of the code to the machine could be improved.

Page 16: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 16

ARC2D: Tuning StepsStep 1:

Loop interchanging increases cache locality through stride-1 references

Step 2: Move parallel loops to outer positions

Step 3: Move synchronization points outward

Step 4: Coalesce loops

Page 17: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 17

!$OMP PARALLEL DO!$OMP+PRIVATE(R1,R2,K,J) DO k = 2, kmax-1 DO j = jlow, jup r1 = prss(jminu(j), k) + prss(jplus(j), k) + (-2.)*prss(j, k) r2 = prss(jminu(j), k) + prss(jplus(j), k) + 2.*prss(j, k) coef(j, k) = ABS(r1/r2) ENDDO ENDDO!$OMP END PARALLEL

!$OMP PARALLEL DO!$OMP+PRIVATE(R1,R2,K,J) DO j = jlow, jup DO k = 2, kmax-1 r1 = prss(jminu(j), k) + prss(jplus(j), k) + (-2.)*prss(j, k) r2 = prss(jminu(j), k) + prss(jplus(j), k) + 2.*prss(j, k) coef(j, k) = ABS(r1/r2) ENDDO ENDDO!$OMP END PARALLEL

ARC2D: Code Samples

Loop interchanging increases cache locality

Page 18: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 18

ARC2D: Code SamplesIncreasing parallel loop granularitythrough NOWAIT clause

!$OMP PARALLEL!$OMP+PRIVATE(LDI,LD2,LD1,J,LD,K) DO k = 2+2, ku-2, 1!$OMP DO DO j = jl, ju ld2 = a(j, k) ld1 = b(j, k)+(-x(j, k-2))*ld2 ld = c(j, k)+(-x(j, k-1))*ld1+(-y(j, k-1))*ld2 ldi = 1./ld f(j, k, 1) = ldi*(f(j, k, 1)+(-f(j, k-2, 1))*ld2+(-f(j, k-1, 1))*ld1) f(j, k, 2) = ldi*(f(j, k, 2)+(-f(j, k-2, 2))*ld2+(-f(jk-2, 2))*ld1) x(j, k) = ldi*(d(j, k)+(-y(j, k-1))*ld1) y(j, k) = e(j, k)*ldi ENDDO!$OMP END DO ENDDO!$OMP END PARALLEL

NOWAIT

Page 19: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 19

ARC2D: Code Samples

!$OMP PARALLEL DO!$OMP+PRIVATE(n, k,j) DO n = 1, 4 DO k = 2, kmax-1 DO j = jlow, jup q(j, k, n) = q(j, k, n)+s(j, k, n) s(j, k, n) = s(j, k, n)*phic ENDDO ENDDO ENDDO!$OMP END PARALLEL

!$OMP PARALLEL DO!$OMP+PRIVATE(nk,n,k,j) DO nk = 0,4*(kmax-2)-1 n = nk/(kmax-2) + 1 k = MOD(nk,kmax-2)+2 DO j = jlow, jup q(j, k, n) = q(j, k, n)+s(j, k, n) s(j, k, n) = s(j, k, n)*phic ENDDO ENDDO !$OMP END PARALLEL

Increasing parallel loop granularitythough loop coalescing

Page 20: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 20

Performance Tuning Example 3: EQUAKEEQUAKE: A C code of the new SPEC OpenMP

benchmarks.

EQUAKE is hand-parallelized with relatively few code modifications. It achieves excellent speedup.

012345678

originalsequential

initialOpenMP

improvedallocate

Page 21: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 21

EQUAKE: Tuning Steps

Step1: Parallelizing the four most time-consuming loops

– inserted OpenMP pragmas for parallel loops and private data

– array reduction transformationStep2:

A change in memory allocation

Page 22: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 22

EQUAKE Code Samples

/* malloc w1[numthreads][ARCHnodes][3] */

#pragma omp parallel for for (j = 0; j < numthreads; j++) for (i = 0; i < nodes; i++) { w1[j][i][0] = 0.0; ...; }

#pragma omp parallel private(my_cpu_id,exp,...){ my_cpu_id = omp_get_thread_num();

#pragma omp for for (i = 0; i < nodes; i++) while (...) { ... exp = loop-local computation; w1[my_cpu_id][...][1] += exp; ... }}#pragma omp parallel for for (j = 0; j < numthreads; j++) { for (i = 0; i < nodes; i++) { w[i][0] += w1[j][i][0]; ...;}

Page 23: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 23

What Tools Did We Use for Performance Analysis and Tuning?Compilers

the starting point for our performance tuning of Fortran codes was always the compiler-parallelized program.

It reports: parallelized loops, data dependences.Subroutine and loop profilers

focusing attention on the most time-consuming loops is absolutely essential.

Performance tables:typically comparing performance differences at the

loop level.

Page 24: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 24

Guidelines for Fixing “Performance Bugs”The methodology that worked for us:

Use compiler-parallelized code as a starting pointGet loop profile and compiler listingInspect time-consuming loops (biggest potential for

improvement)– Case 1. Check for parallelism where the compiler

could not find it– Case 2. Improve parallel loops where the

speedup is limited

Page 25: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 25

Performance TuningCase 1: if the loop is not parallelized

automatically, do this:Check for parallelism:

read the compiler explanationa variable may be independent even if the compiler

detects dependences (compilers are conservative)check if conflicting array is privatizable (compilers

don’t perform array privatization well)If you find parallelism, add OpenMP parallel

directives, or make the information explicit for the parallelizer

Page 26: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 26

Performance TuningCase 2: if the loop is parallel but does not perform well,

consider several optimization factors:

Parallelization overhead

Memory

CPU CPU CPU

Spreading overhead

High overheads are caused by:•parallel startup cost•small loops•additional parallel code•over-optimized inner loops•less optimization for parallel code

•load imbalance•synchronized section•non-stride-1 references•many shared references•low cache affinity

serialprogram

parallelprogram

Page 27: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 27

Case Study of a Large-Scale ApplicationConverting a Seismic Processing Application to OpenMP Overview of the ApplicationBasic use of OpenMPOpenMP Issues EncounteredPerformance Results

Page 28: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 28

Overview of SeismicRepresentative of modern seismic processing

programs used in the search for oil and gas. 20,000 lines of Fortran. C subroutines interface

with the operating system.Available in a serial and a parallel variant.Parallel code is available in a message-passing

and an OpenMP form. Is part of the SPEChpc benchmark suite.

Includes 4 data sets: small to x-large.

Page 29: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 29

Seismic:Basic CharacteristicsProgram structure:

240 Fortran and 119 C subroutines.Main algorithms:

FFT, finite difference solversRunning time of Seismic (@ 500MFlops):

small data set: 0.1 hoursx-large data set: 48 hours

IO requirement:small data set: 110 MBx-large data set: 93 GB

Page 30: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 30

Basic OpenMP Use: Parallelization SchemeSplit into p parallel tasks

(p = number of processors)

Program Seismic initialization

C$OMP PARALLEL call main_subroutine()C$OMP END PARALLEL

initialization done by master processor only

main computation enclosed in one large parallel region

SPMD execution scheme

Page 31: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 31

Basic OpenMP Use: Data PrivatizationMost data structures are private,

i.e., Each thread has its own copy.Syntactic forms:

Program Seismic...C$OMP PARALLELC$OMP+PRIVATE(a) a = “local computation” call x()C$END PARALLEL

Subroutine x()common /cc/ dc$omp threadprivate (/cc/)real b(100)...b() = “local computation”d = “local computation”...

Page 32: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 32

Basic OpenMP Use: Synchronization and Communication

compute

communicate

compute

communicate

copy to shared buffer;barrier_synchronization;copy from shared buffer;

Copy-synchronize scheme corresponds to message send-receive operations in MPI programs

Page 33: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 33

OpenMP Issues:Mixing Fortran and C

Bulk of computation is done in Fortran

Utility routines are in C: IO operations data partitioning routines communication/synchronization

operations OpenMP-related issues:

IF C/OpenMP compiler is not available, data privatization must be done through “expansion”.

Mix of Fortran and C is implementation dependent

Data privatization in OpenMP/C#pragma omp thread private (item) float item;void x(){ ... = item;}

Data expansion in absence a of OpenMP/C compilerfloat item[num_proc];void x(){ int thread; thread = omp_get_thread_num_(); ... = item[thread];}

Page 34: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 34

OpenMP Issues:Broadcast Common Blocks

common /cc/ cdatacommon /dd/ ddatac initialization cdata = ... ddata = ...

C$OMP PARALELC$OMP+COPYIN(/cc/, /dd/) call main_subroutine()C$END PARALLEL

Issues in Seismic:• At the start of the parallel region it is not yet known which common blocks need to be copied in. Solution: • copy-in all common blocks overhead

Page 35: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 35

OpenMP Issues:Multithreading IO and mallocIO routines and memory allocation are called

within parallel threads, inside C utility routines.OpenMP requires all standard libraries and

instrinsics to be thread-save. However the implementations are not always compliant.

system-dependent solutions need to be foundThe same issue arises if standard C routines

are called inside a parallel Fortran region or in non-standard syntax.

Standard C compilers do not know anything about OpenMP and the thread-safe requirement.

Page 36: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 36

OpenMP Issues:Processor AffinityOpenMP currently does not specify or provide

constructs for controlling the binding of threads to processors.

Processors can migrate, causing overhead. This behavior is system-dependent.

System-dependent solutions may be available.

parallelregion tasks may migrate as a result of

an OS event

task1task2task3task4

p1 2 3 4

Page 37: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 37

Performance Results

0

1

2

3

4

5

6

7

1 proc 2 proc 4 proc 8 proc

PVMOpenMP

0

1

2

3

4

5

6

1 proc 2 proc 4 proc 8 proc

PVMOpenMP

small data set medium data set

Speedups of Seismic on an SGI Challenge system

MPI

Page 38: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 38

SC’2000 Tutorial Agenda

OpenMP: A Quick RecapOpenMP Case Studies

– including performance tuningAutomatic Parallelism and Tools SupportCommon Bugs in OpenMP programs

– and how to avoid themMixing OpenMP and MPIThe Future of OpenMP

Page 39: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 39

Generating OpenMP Programs Automatically

OpenMPprogram

userinserts

directives

parallelizingcompilerinsertsdirectives

usertunes

program

Source-to-sourcerestructurers:• F90 to F90/OpenMP• C to C/OpenMP

Examples:• SGI F77 compiler (-apo -mplist option)• Polaris compiler

Page 40: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 40

The Basics AboutParallelizing Compilers Loops are the primary source of parallelism in

scientific and engineering applications. Compilers detect loops that have independent

iterations.

DO I=1,N A(expression1) = … … = A(expression2)ENDDO

The loop is independent if, for different iterations, expression1 is always different from expression2

Page 41: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 41

Basic Program Transformations

Data privatization:

DO i=1,n work(1:n) = …. . . . … = work(1:n)ENDDO

C$OMP PARALLEL DOC$OMP+ PRIVATE (work)DO i=1,n work(1:n) = …. . . . … = work(1:n)ENDDO

Each processor is given a separate version of the private data, so there is no sharing conflict

Page 42: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 42

Basic Program Transformations

Reduction recognition:

DO i=1,n ... sum = sum + a(i) … ENDDO

C$OMP PARALLEL DOC$OMP+ REDUCTION (+:sum)DO i=1,n ... sum = sum + a(i) … ENDDO

Each processor will accumulate partial sums, followed by a combination of these parts at the end of the loop.

Page 43: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 43

Basic Program TransformationsInduction variable substitution:i1 = 0i2 = 0DO i =1,n i1 = i1 + 1 B(i1) = ... i2 = i2 + i A(i2) = …

ENDDO

C$OMP PARALLEL DO DO i =1,n

B(i) = ...

A((i**2 + i)/2) = …

ENDDO

The original loop contains data dependences: each processor modifies the shared variables i1, and i2.

Page 44: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 44

Compiler Options

Examples of options from the KAP parallelizing compiler (KAP includes some 60 options)

optimization levels – optimize : simple analysis, advanced analysis, loop

interchanging, array expansion– aggressive: pad common blocks, adjust data layout

subroutine inline expansion– inline all, specific routines, how to deal with libraries

try specific optimizations– e.g., recurrence and reduction recognition, loop fusion (These transformations may degrade performance)

Page 45: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 45

More About Compiler Options Limits on amount of optimization:

– e.g., size of optimization data structures, number of optimization variants tried

Make certain assumptions: – e.g., array bounds are not violated, arrays are not aliased

Machine parameters: – e.g., cache size, line size, mapping

Listing control

Note, compiler options can be a substitute for advanced compiler strategies. If the compiler has limited information, the user can help out.

Page 46: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 46

Inspecting the Translated Program

Source-to-source restructurers: transformed source code is the actual output Example: KAP

Code-generating compilers: typically have an option for viewing the translated

(parallel) code Example: SGI f77 -apo -mplist

This can be the starting point for code tuning

Page 47: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 47

Compiler ListingThe listing gives many useful clues for improving the

performance: Loop optimization tables Reports about data dependences Explanations about applied transformations The annotated, transformed code Calling tree Performance statistics

The type of reports to be included in the listing can be set through compiler options.

Page 48: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 48

5-processorSun Ultra SMP

BDNA

Native Parallelizer

Polaris to Native Directives

Polaris to OpenMP

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Spee

dup

ARC2D

FLO52QHYDRO2D

MDGSWIM

TOMCATVTRFD 1 234 5

Performance of Parallelizing Compilers

Page 49: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 49

Tuning Automatically-Parallelized Code

This task is similar to explicit parallel programming.

Two important differences :The compiler gives hints in its listing, which may

tell you where to focus attention. E.g., which variables have data dependences.

You don’t need to perform all transformations by hand. If you expose the right information to the compiler, it will do the translation for you.

(E.g., C$assert independent)

Page 50: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 50

Why Tuning Automatically-Parallelized Code?

Hand improvements can pay off becausecompiler techniques are limited

E.g., array reductions are parallelized by only few compilers

compilers may have insufficient information E.g.,loop iteration range may be input datavariables are defined in other subroutines (no

interprocedural analysis)

Page 51: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 51

Performance Tuning Tools

OpenMPprogram

userinserts

directives

parallelizingcompilerinsertsdirectives

usertunes

program

we need tool support

Page 52: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 52

Profiling ToolsTiming profiles (subroutine or loop level)

shows most time-consuming program sectionsCache profiles

point out memory/cache performance problemsData-reference and transfer volumes

show performance-critical program propertiesInput/output activities

point out possible I/O bottlenecksHardware counter profiles

large number of processor statistics

Page 53: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 53

KAI GuideView: Performance Analysis

Speedup curves Amdahl’s Law vs. Actual

times Whole program time

breakdown Productive work vs Parallel overheads

Compare several runs Scaling processors

Breakdown by section Parallel regions Barrier sections Serial sections

Breakdown by thread Breakdown overhead

Types of runtime calls Frequency and time

Page 54: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 54

GuideView

Analyze each Parallel region

Find serial regions that are hurt by parallelism

Sort or filter regions to navigate to hotspots

www.kai.com

Page 55: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 55

SGI SpeedShop and WorkShop

Suite of performance tools from SGIMeasurements based on

pc-sampling and call-stack sampling– based on time [prof,gprof]– based on R10K/R12K hw counters

basic block counting [pixie]Analysis on various domains

program graph, source and disassembled codeper-thread as well as cumulative data

Page 56: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 56

SpeedShop and WorkShop

Addresses the performance Issues:Load imbalance

Call stack sampling based on time (gprof)Synchronization Overhead

Call stack sampling based on time (gprof) Call stack sampling based on hardware counters

Memory Hierarchy PerformanceCall stack sampling based on hardware counters

Page 57: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 57

WorkShop: Call Graph View

Page 58: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 58

WorkShop: Source View

Page 59: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 59

Purdue Ursa Minor/MajorIntegrated environment for compilation and

performance analysis/tuningProvides browsers for many sources of

information: call graphs, source and transformed program,

compilation reports, timing data, parallelism estimation, data reference patterns, performance advice, etc.

www.ecn.purdue.edu/ParaMount/UM/

Page 60: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 60

Ursa Minor/Major

Performance Spreadsheet

Program Structure View

Page 61: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 61

TAU Tuning Analysis UtilitiesPerformance Analysis Environment for C++,

Java, C, Fortran 90, HPF, and HPC++compilation facilitatorcall graph browsersource code browserprofile browsersspeedup extrapolationwww.cs.uoregon.edu/research/paracomp/tau/

Page 62: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 62

TAU Tuning Analysis Utilities

Page 63: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 63

SC’2000 Tutorial Agenda

OpenMP: A Quick RecapOpenMP Case Studies

– including performance tuningAutomatic Parallelism and Tools SupportCommon Bugs in OpenMP programs

– and how to avoid themMixing OpenMP and MPIThe Future of OpenMP

Page 64: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 64

SMP Programming ErrorsShared memory parallel programming is a

mixed bag:It saves the programmer from having to map data

onto multiple processors. In this sense, its much easier.

It opens up a range of new errors coming from unanticipated shared resource conflicts.

Page 65: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 65

2 major SMP errorsRace Conditions

– The outcome of a program depends on the detailed timing of the threads in the team.

Deadlock– Threads lock up waiting on a locked resource

that will never become free.

Page 66: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 66

Race Conditions The result varies

unpredictably based on detailed order of execution for each section.

Wrong answers produced without warning!

C$OMP PARALLEL SECTIONS A = B + CC$OMP SECTION B = A + CC$OMP SECTION C = B + AC$OMP END PARALLEL SECTIONS

Page 67: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 67

Race Conditions:A complicated solution

In this example, we choose the assignments to occur in the order A, B, C.

ICOUNT forces this order. FLUSH so each thread sees

updates to ICOUNT - NOTE: you need the flush on each read and each write.

ICOUNT = 0C$OMP PARALLEL SECTIONS A = B + C ICOUNT = 1C$OMP FLUSH ICOUNTC$OMP SECTION1000 CONTINUEC$OMP FLUSH ICOUNT IF(ICOUNT .LT. 1) GO TO 1000 B = A + C ICOUNT = 2C$OMP FLUSH ICOUNTC$OMP SECTION2000 CONTINUEC$OMP FLUSH ICOUNT IF(ICOUNT .LT. 2) GO TO 2000 C = B + AC$OMP END PARALLEL SECTIONS

Page 68: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 68

Race Conditions The result varies

unpredictably because the value of X isn’t dependable until the barrier at the end of the do loop.

Wrong answers produced without warning!

Solution: Be careful when you use NOWAIT.

C$OMP PARALLEL SHARED (X)C$OMP& PRIVATE(TMP) ID = OMP_GET_THREAD_NUM()C$OMP DO REDUCTION(+:X) DO 100 I=1,100 TMP = WORK(I) X = X + TMP100 CONTINUEC$OMP END DO NOWAIT Y(ID) = WORK(X, ID)C$OMP END PARALLEL

Page 69: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 69

Race Conditions The result varies

unpredictably because access to shared variable TMP is not protected.

Wrong answers produced without warning!

The user probably wanted to make TMP private.

REAL TMP, XC$OMP PARALLEL DO REDUCTION(+:X) DO 100 I=1,100 TMP = WORK(I) X = X + TMP100 CONTINUEC$OMP END DO Y(ID) = WORK(X, ID)C$OMP END PARALLEL I lost an afternoon to this bug last

year. After spinning my wheels and insisting there was a bug in KAI’s compilers, the KAI tool Assure found the problem immediately!

Page 70: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 70

Deadlock This shows a race condition and

a deadlock. If A is locked by one thread and B

by another, you have deadlock. If the same thread gets both

locks, you get a race condition - i.e. different behavior depending on detailed interleaving of the thread.

Avoid nesting different locks.

CALL OMP_INIT_LOCK (LCKA) CALL OMP_INIT_LOCK (LCKB)C$OMP PARALLEL SECTIONSC$OMP SECTION CALL OMP_SET_LOCK(LCKA) CALL OMP_SET_LOCK(LCKB) CALL USE_A_and_B (RES) CALL OMP_UNSET_LOCK(LCKB) CALL OMP_UNSET_LOCK(LCKA)C$OMP SECTION CALL OMP_SET_LOCK(LCKB) CALL OMP_SET_LOCK(LCKA) CALL USE_B_and_A (RES) CALL OMP_UNSET_LOCK(LCKA) CALL OMP_UNSET_LOCK(LCKB)C$OMP END SECTIONS

Page 71: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 71

Deadlock This shows a race condition and

a deadlock. If A is locked in the first section

and the if statement branches around the unset lock, threads running the other sections deadlock waiting for the lock to be released.

Make sure you release your locks.

CALL OMP_INIT_LOCK (LCKA)C$OMP PARALLEL SECTIONSC$OMP SECTION CALL OMP_SET_LOCK(LCKA) IVAL = DOWORK() IF (IVAL .EQ. TOL) THEN CALL OMP_UNSET_LOCK (LCKA) ELSE CALL ERROR (IVAL) ENDIFC$OMP SECTION CALL OMP_SET_LOCK(LCKA) CALL USE_B_and_A (RES) CALL OMP_UNSET_LOCK(LCKA)C$OMP END SECTIONS

Page 72: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 72

OpenMP death-trapsAre you using threadsafe libraries?I/O inside a parallel region can interleave

unpredictably. Make sure you understand what your constructors

are doing with private objects.Private variables can mask globals.Understand when shared memory is coherent.

When in doubt, use FLUSH. NOWAIT removes implied barriers.

Page 73: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 73

Navigating through the Danger Zones

Option 1: Analyze your code to make sure every semantically permitted interleaving of the threads yields the correct results.

This can be prohibitively difficult due to the explosion of possible interleavings.

Tools like KAI’s Assure can help.

Page 74: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 74

Navigating through the Danger ZonesOption 2: Write SMP code that is portable and

equivalent to the sequential form.Use a safe subset of OpenMP.Follow a set of “rules” for Sequential Equivalence.

Page 75: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 75

Portable Sequential EquivalenceWhat is Portable Sequential Equivalence

(PSE)?– A program is sequentially equivalent if its results

are the same with one thread and many threads.– For a program to be portable (i.e. runs the same

on different platforms/compilers) it must execute identically when the OpenMP constructs are used or ignored.

Page 76: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 76

Portable Sequential EquivalenceAdvantages of PSE

– A PSE program can run on a wide range of hardware and with different compilers - minimizes software development costs.

– A PSE program can be tested and debugged in serial mode with off the shelf tools - even if they don’t support OpenMP.

Page 77: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 77

2 Forms of Sequential EquivalenceTwo forms of Sequential equivalence based on

what you mean by the phrase “equivalent to the single threaded execution”:

– Strong SE: bitwise identical results.– Weak SE: equivalent mathematically but due to

quirks of floating point arithmetic, not bitwise identical.

Page 78: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 78

Strong Sequential Equivalence: rules

Control data scope with the base language– Avoid the data scope clauses.– Only use private for scratch variables local to a

block (eg. temporaries or loop control variables) whose global initialization don’t matter.

Locate all cases where a shared variable can be written by multiple threads.

– The access to the variable must be protected.– If multiple threads combine results into a single

value, enforce sequential order.– Do not use the reduction clause.

Page 79: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 79

Strong Sequential Equivalence: example

Everything is shared except I and TMP. These can be private since they are not initialized and they are unused outside the loop.

The summation into RES occurs in the sequential order so the result from the program is bitwise compatible with the sequential program.

Problem: Can be inefficient if threads finish in an order that’s greatly different from the sequential order.

C$OMP PARALLEL PRIVATE(I, TMP)

C$OMP DO ORDERED DO 100 I=1,NDIM TMP =ALG_KERNEL(I)C$OMP ORDERED CALL COMBINE (TMP, RES)C$OMP END ORDERED100 CONTINUE

C$OMP END PARALLEL

Page 80: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 80

Weak Sequential equivalence For weak sequential equivalence only mathematically valid

constraints are enforced.– Floating point arithmetic is not associative and not commutative.– In most cases, no particular grouping of floating point operations

is mathematically preferred so why take a performance hit by forcing the sequential order?

In most cases, if you need a particular grouping of floating point operations, you have a bad algorithm.

How do you write a program that is portable and satisfies weak sequential equivalence?

Follow the same rules as the strong case, but relax sequential ordering constraints.

Page 81: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 81

Weak equivalence: example

C$OMP PARALLEL PRIVATE(I, TMP) C$OMP DO DO 100 I=1,NDIM TMP =ALG_KERNEL(I)C$OMP CRITICAL CALL COMBINE (TMP, RES)C$OMP END CRITICAL100 CONTINUE

C$OMP END PARALLEL

The summation into RES occurs one thread at a time, but in any order so the result is not bitwise compatible with the sequential program.

Much more efficient, but some users get upset when low order bits vary between program runs.

Page 82: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 82

Sequential Equivalence isn’t a Silver Bullet This program follows the

weak PSE rules, but its still wrong.

In this example, RAND() may not be thread safe. Even if it is, the pseudo-random sequences might overlap thereby throwing off the basic statistics.

C$OMP PARALLEL C$OMP& PRIVATE(I, ID, TMP, RVAL) ID = OMP_GET_THREAD_NUM() N = OMP_GET_NUM_THREADS() RVAL = RAND ( ID )C$OMP DO DO 100 I=1,NDIM RVAL = RAND (RVAL) TMP =RAND_ALG_KERNEL(RVAL)C$OMP CRITICAL CALL COMBINE (TMP, RES)C$OMP END CRITICAL100 CONTINUE C$OMP END PARALLEL

Page 83: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 83

SC’2000 Tutorial Agenda

OpenMP: A Quick RecapOpenMP Case Studies

– including performance tuningAutomatic Parallelism and Tools SupportCommon Bugs in OpenMP programs

– and how to avoid themMixing OpenMP and MPIThe Future of OpenMP

Page 84: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000

What is MPI?The message Passing InterfaceMPI created by an international forum in the

early 90’s.It is huge -- the union of many good ideas about

message passing API’s.over 500 pages in the specover 125 routines in MPI 1.1 alone.Possible to write programs using only a couple of

dozen of the routinesMPI 1.1 - MPIch reference implementation. MPI 2.0 - Exists as a spec, full implementations?

Page 85: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 85

How do people use MPI?The SPMD Model

Replicate the program.

Add glue code

Break up the data

A sequential program working on a data set

•A parallel program working on a decomposed data set.

• Coordination by passing messages.

Page 86: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 86

Pi program in MPI #include <mpi.h>void main (int argc, char *argv[]){

int i, my_id, numprocs; double x, pi, step, sum = 0.0 ;step = 1.0/(double) num_steps ;

MPI_Init(&argc, &argv) ;MPI_Comm_Rank(MPI_COMM_WORLD, &my_id) ;MPI_Comm_Size(MPI_COMM_WORLD, &numprocs) ;my_steps = num_steps/numprocs ;for (i=myrank*my_steps; i<(myrank+1)*my_steps ; i++){

x = (i+0.5)*step; sum += 4.0/(1.0+x*x);

}sum *= step ; MPI_Reduce(&sum, &pi, 1, MPI_DOUBLE, MPI_SUM, 0,

MPI_COMM_WORLD) ;}

Page 87: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 87

How do people mix MPI and OpenMP?

Replicate the program.

Add glue code

Break up the data

A sequential program working on a data set

•Create the MPI program with its data decomposition.

• Use OpenMP inside each MPI process.

Page 88: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 88

Pi program in MPI #include <mpi.h>#include “omp.h”void main (int argc, char *argv[]){

int i, my_id, numprocs; double x, pi, step, sum = 0.0 ;step = 1.0/(double) num_steps ;

MPI_Init(&argc, &argv) ;MPI_Comm_Rank(MPI_COMM_WORLD, &my_id) ;MPI_Comm_Size(MPI_COMM_WORLD, &numprocs) ;my_steps = num_steps/numprocs ;

#pragma omp parallel dofor (i=myrank*my_steps; i<(myrank+1)*my_steps ; i++){

x = (i+0.5)*step; sum += 4.0/(1.0+x*x);

}sum *= step ; MPI_Reduce(&sum, &pi, 1, MPI_DOUBLE, MPI_SUM, 0,

MPI_COMM_WORLD) ;}

Get the MPI part done first, then add OpenMP pragma where it makes sense to do so

Page 89: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 89

Mixing OpenMP and MPILet the programmer beware!Messages are sent to a process on a system

not to a particular threadSafest approach -- only do MPI inside serial regions.… or, do them inside MASTER constructs.… or, do them inside SINGLE or CRITICAL

– But this only works if your MPI is really thread safe!

Environment variables are not propagated by mpirun. You’ll need to broadcast OpenMP parameters and set them with the library routines.

Page 90: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 90

SC’2000 Tutorial Agenda

OpenMP: A Quick RecapOpenMP Case Studies

– including performance tuningAutomatic Parallelism and Tools SupportCommon Bugs in OpenMP programs

– and how to avoid themMixing OpenMP and MPIThe Future of OpenMP

Page 91: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000

OpenMP Futures: The ARB

The future of OpenMP is in the hands of the OpenMP Architecture Review Board (the ARB)

– Intel, KAI, IBM, HP, Compaq, Sun, SGI, DOE ASCIThe ARB resolves interpretation issues and

manages the evolution of new OpenMP API’s. Membership in the ARB is Open to any

organization with a stake in OpenMP.– Research organization (e.g. DOE ASCI)– Hardware vendors (e.g. Intel or HP)– Software vendors (e.g. KAI)

Page 92: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 92

The Future of OpenMPOpenMP is an evolving standard. We will see

to it that it is well matched to the changing needs of the shard memory programming community.

Here’s what’s coming in the future:– OpenMP 2.0 for Fortran:

– This is a major update of OpenMP for Fortran95. – Status. Specification released at SC’00

– OpenMP 2.0 for C/C++– Work to begin in January 2001– Specification complete by SC’01.

To learn more about OpenMP 2.0, come to the OpenMP BOF on Tuesday evening

Page 93: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 93

Reference Material on OpenMPOpenMP Homepage www.openmp.org:The primary source of information about OpenMP and its development.

Books:Parallel programming in OpenMP, Chandra, Rohit, San Francisco, Calif. : Morgan Kaufmann ; London : Harcourt, 2000, ISBN: 1558606718

Research papers:Sosa CP, Scalmani C, Gomperts R, Frisch MJ. Ab initio quantum chemistry on a ccNUMA architecture using OpenMP. III. Parallel Computing, vol.26, no.7-8, July 2000, pp.843-56. Publisher: Elsevier, Netherlands.

Bova SW, Breshears CP, Cuicchi C, Demirbilek Z, Gabb H. Nesting OpenMP in an MPI application. Proceedings of the ISCA 12th International Conference. Parallel and Distributed Systems. ISCA. 1999, pp.566-71. Cary, NC, USA.

Gonzalez M, Serra A, Martorell X, Oliver J, Ayguade E, Labarta J, Navarro N. Applying interposition techniques for performance analysis of OPENMP parallel applications. Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000. IEEE Comput. Soc. 2000, pp.235-40. Los Alamitos, CA, USA.

J. M. Bull and M. E. Kambites. JOMPan OpenMP-like interface for Java. Proceedings of the ACM 2000 conference on Java Grande, 2000, Pages 44 - 53.

Page 94: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 94

Chapman B, Mehrotra P, Zima H. Enhancing OpenMP with features for locality control. Proceedings of Eighth ECMWF Workshop on the Use of Parallel Processors in Meteorology. Towards Teracomputing. World Scientific Publishing. 1999, pp.301-13. Singapore.

Cappello F, Richard O, Etiemble D. Performance of the NAS benchmarks on a cluster of SMP PCs using a parallelization of the MPI programs with OpenMP. Parallel Computing Technologies. 5th International Conference, PaCT-99. Proceedings (Lecture Notes in Computer Science Vol.1662). Springer-Verlag. 1999, pp.339-50. Berlin, Germany.

Couturier R, Chipot C. Parallel molecular dynamics using OPENMP on a shared memory machine. Computer Physics Communications, vol.124, no.1, Jan. 2000, pp.49-59. Publisher: Elsevier, Netherlands.

Bova SW, Breshearsz CP, Cuicchi CE, Demirbilek Z, Gabb HA. Dual-level parallel analysis of harbor wave response using MPI and OpenMP. International Journal of High Performance Computing Applications, vol.14, no.1, Spring 2000, pp.49-64. Publisher: Sage Science Press, USA.

Scherer A, Honghui Lu, Gross T, Zwaenepoel W. Transparent adaptive parallelism on NOWS using OpenMP. ACM. Sigplan Notices (Acm Special Interest Group on Programming Languages), vol.34, no.8, Aug. 1999, pp.96-106. USA.

Ayguade E, Martorell X, Labarta J, Gonzalez M, Navarro N. Exploiting multiple levels of parallelism in OpenMP: a case study. Proceedings of the 1999 International Conference on Parallel Processing. IEEE Comput. Soc. 1999, pp.172-80. Los Alamitos, CA, USA.

Page 95: Advanced Parallel Programming with OpenMP

Advanced OpenMP, SC'2000 95

Honghui Lu, Hu YC, Zwaenepoel W. OpenMP on networks of workstations. Proceedings of ACM/IEEE SC98: 10th Anniversary. High Performance Networking and Computing Conference (Cat. No. RS00192). IEEE Comput. Soc. 1998, pp.13 pp.. Los Alamitos, CA, USA.

Throop J. OpenMP: shared-memory parallelism from the ashes. Computer, vol.32, no.5, May 1999, pp.108-9. Publisher: IEEE Comput. Soc, USA.

Hu YC, Honghui Lu, Cox AL, Zwaenepoel W. OpenMP for networks of SMPs. Proceedings 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing. IPPS/SPDP 1999. IEEE Comput. Soc. 1999, pp.302-10. Los Alamitos, CA, USA.

Parallel Programming with Message Passing and Directives; Steve W. Bova, Clay P. Breshears, Henry Gabb, Rudolf Eigenmann, Greg Gaertner, Bob Kuhn, Bill Magro, Stefano Salvini; SIAM News, Volume 32, No 9, Nov. 1999.

Still CH, Langer SH, Alley WE, Zimmerman GB. Shared memory programming with OpenMP. Computers in Physics, vol.12, no.6, Nov.-Dec. 1998, pp.577-84. Publisher: AIP, USA.

Chapman B, Mehrotra P. OpenMP and HPF: integrating two paradigms. [Conference Paper] Euro-Par'98 Parallel Processing. 4th International Euro-Par Conference. Proceedings. Springer-Verlag. 1998, pp.650-8. Berlin, Germany.

Dagum L, Menon R. OpenMP: an industry standard API for shared-memory programming. IEEE Computational Science & Engineering, vol.5, no.1, Jan.-March 1998, pp.46-55. Publisher: IEEE, USA.

Clark D. OpenMP: a parallel standard for the masses. IEEE Concurrency, vol.6, no.1, Jan.-March 1998, pp.10-12. Publisher: IEEE, USA.