arallel Programming on SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

  • View

  • Download

Embed Size (px)

Citation preview

Page 1: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

Parallel Programming on theSGI Origin2000

With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI

Taub Computer CenterTechnion

Mar 2005

Anne Weill-Zrahia

Page 2: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

Parallel Programming on the SGI Origin2000

1) Parallelization Concepts

2) SGI Computer Design

3) Efficient Scalar Design

4) Parallel Programming -OpenMP

5) Parallel Programming- MPI

Page 3: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

4) Parallel Programming-OpenMP

Page 4: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia





IL 0

IL 0





Limorin Haifa

Shimonin Tel Aviv

Is this your joint bank account?






Page 5: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia


- Parallelization instruction to the compiler: f77 –o prog –mp prog.f Or: f77 –o prog –pfa prog.f

- Now try to understand what a compiler has to determine when deciding how to parallelize

- Note that when loosely talk about parallelization, what is meant is: “Is the program as presented here parallelizable?”

-This is an important distinction, because sometimes rewriting can transform non-parallelizable code into a parallelizable form, as we will see…

Page 6: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

Data dependency types1) Iteration i depends on values calculated in the previous iteration i-1 (loop carried dependence) do i=2,n a(i) = a(i-1) cannot be parallelized enddo

2) Data dependence within single iteration (non-loop carried dependence) do i=2,n c = . . . . a(I) = . . . c . . . parallelizable enddo

3) Reduction do i=1,n s = s + x parallelizable enddo

All data dependencies in programs are variations on thesefundamental types.

Page 7: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

Data dependency analysis

Question: Are the following loops parallelizable?

do i=2,n a(i) = b(i-1)enddo

do i=2,n a(i) = a(i-1)enddo




Page 8: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

Data dependency analysis

do i=2,n a(i) = b(i-1)enddo











cycle1 cycle2

Page 9: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

Data dependency analysis

do i=2,n a(i) = a(i-1)enddo

CPU1 A(2)=A(1)






Scalar (non-parallel) run:



In each cycle NEW data from previous cycle is read

Page 10: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

Data dependency analysis

do i=2,n a(i) = a(i-1)enddo









Will probably readOLD data

Page 11: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

Data dependency analysisData dependency analysis

do i=2,n a(i) = a(i-1)enddo











cycle1 cycle2

May read NEW data

Will probably read

OLD data

Page 12: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

Data dependency analysis

Another question: Are the following loops parallelizable?

do i=3,n,2 a(i) = a(i-1)enddo

do i=1,n s = s + a(i)enddo



Page 13: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

Data dependency analysis

do i=3,n,2 a(i) = a(i-1)enddo











cycle1 cycle2

Page 14: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

Data dependency analysisData dependency analysis

do i=1,n s = s + a(i)enddo











cycle1 cycle2

-The value of S will be undetermined and typically it will vary from one run to the next- This bug in parallel programming is called a “race condition”

Page 15: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

Data dependency analysis

What is the principle involved here?

The examples shown fall into two categories:

1) Data being read is independent of data that is written: a(i) = b(i-1) i=2,3,4. . . a(i) = a(i-1) i=3,5,7. . .

2) Data being read depends on data that is written: a(i) = a(I-1) i=2,3,4. . . s = s + a(i) i=1,2,3. . .

Page 16: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

Data dependency analysis

Here is a typical situation:

Is there a data dependency in the following loop?

do i = 1,n a(i) = sin(x(i)) result = a(i) + b(i) c(i) = result * c(i)enddo

Clearly, “result” is a temporary variable that isreassigned for every iteration.

Note: “result” must be a “private” variable (this will be discussed later).


Page 17: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

Data dependency analysis

Here is a (slightly different) typical situation:

Is there a data dependency in the following loop?

do i = 1,n a(i) = sin(result) result = a(i) + b(i) c(i) = result * c(i)enddo


The value of “result” is carried over from one iterationto the next.

This is the classical read/write situation but now it is somewhat hidden.

Page 18: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

Data dependency analysis

do i = 1,n a(i) = sin(result(i-1)) result(i) = a(i) + b(i) c(i) = result(i) * c(i)enddo

do i = 1,n a(i) = sin(result(i-1)) result(i) = sin(result(i-1)) + b(i) c(i) = result(i) * c(i)enddo

The loop could (symbolically) be rewritten:

Now substitute the expression for a(i):

This is really of the type “a(i)=a(i-1)” !

Page 19: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

Data dependency analysis

One more: Can the following loop be parallelized?

do i = 3,n a(i) = a(i-2)enddo

If this is parallelized, there will probably be differentanswers from one run to another.


Page 20: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

Data dependency analysis







cycle1 cycle2

do i = 3,n a(i) = a(i-2)enddo

This looks like it will be safe.

Page 21: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

Data dependency analysis








do i = 3,n a(i) = a(i-2)enddo

HOWEVER: what if there are 3 cpu’s and not 2?

In this case, a(3) isread and written intwo threads at once

Page 22: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

RISC memory levels


Main memory


Single CPU

Page 23: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

RISC memory levels


Main memory


Single CPU

Page 24: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

RISC memory levels

Main memory

Multiple CPU’s


Cache 1



Cache 0

Page 25: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

RISC memory levels

Main memory

Multiple CPU’s


Cache 1



Cache 0

Page 26: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

Main memory

Multiple CPU’s


Cache 1



Cache 0

RISC Memory Levels

Page 27: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

Definition of OpenMP

- Application Program Interface (API) for Shared Memory Parallel Programming

- Directive based approach with library support

- Targets existing applications and widely used languages: Fortran API first released October 1997 C, C++ API first released October 1998

- Multi-vendor/platform support

Page 28: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

Why was OpenMP developed?

- Parallel programming before OpenMP * Standards for distributed memory (MPI and PVM) * No standard for shared memory programming- Vendors had different directive-based API for SMP * SGI, Cray, Kuck&Assoc, DEC * Vendor proprietary, similar but not the same * Most were targeted at loop level parallelism- Commercial users, high end software vendors, have big investment in existing codes- End result: users wanting portability were forced to use MPI even for shared memory * This sacrifices built-in SMP hardware benefits * Requires major effort

Page 29: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

The Spread of OpenMP

Organization: Architecture review board Web site: www.openmp.org

Hardware: HP/DEC IBM Intel SGI Sun

Software: Portland (PGI) NAG Intel Kuck & Assoc (KAI) Absoft

Page 30: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

OpenMP Interface model

•Control structures•Work sharing•Data scope attributes * private,firstprivate, lastprivate * shared * reduction

-Control and query * number thread * nested parallel? * throughput mode

- Lock API

-Runtime environment * schedule type * max number threads * nested parallelism * throughput mode





Page 31: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

OpenMP execution model

OpenMP programs starts in a single thread, sequential mode

To create additional threads, user opens a parallel region * additional slave threads launched * master thread is part of team * threads “disappear” at the end of parallel region run

This model is repeated as needed

Master thread

Parallel:4 threads

Parallel:2 threads

Parallel:3 threads

Page 32: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

Creating parallel threadsFortran


c$omp parallel [clause,clause] code to run in parallelc$omp end parallel

#pragma omp parallel [clause,clause]{ code to run in parallel}

Replicate execution

i=0C$omp parallel call foo(i,a,b)C$omp end parallel print*,i

foo foo foo foo



Number of threads: set in library or environment call

Page 33: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia
Page 34: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia
Page 35: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia
Page 36: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

OpenMP on the Origin 2000

Switches, formatsf77 -mp

c$omp parallel doc$omp+shared(a,b,c)ORc$omp parallel do shared(a,b,c)

c$ iam = omp_get_thread()+1

Conditional compilation

Page 37: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

OpenMP on the Origin 2000 -C

Switches, formatscc -mp

#pragma omp parallel for\shared(a,b,c)OR#pragma omp parallel for shared(a,b,c)

Page 38: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

OpenMP on the Origin 2000

Parallel Do Directive

c$omp parallel do private(I)

c$omp end parallel do --> optional

do I=1,na(I)= I+1enddo

Topics: Clauses, Detailed construct

Page 39: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

OpenMP on the Origin 2000

Parallel Do Directive - Clauses


Page 40: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia


Single thread Parallel region Single thread

S = shared variableP = private variable

Allocating private and shared variables

Page 41: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

Clauses in OpenMP - 1

Clauses for the “parallel” directive specify data association rulesand conditional computation

shared (list) - data accessible by all threads, which all refer to the same storageprivate (list) - data private to each thread - a new storage location is created with that name for each thread, and the of the storage are not available outside the parallel region

default (private | shared | none) - default association for variables not otherwise mentionedfirstprivate (list) - same as for private(list), but the contents are given an initial value from the variable with the same name, from outside the parallel regionlastprivate (list) - available only for work sharing constructs - a shared variable with that name is set to the last computed value of a thread private variable in the work sharing construct

Page 42: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

Clauses in OpenMP - 2reduction ({op/intrinsic}:list) - variables in the list are named scalars of intrinsic type - a private copy of each variable will be made in each thread and initialized according to the intended operation - at the end of the parallel region or other synchronization point all private copies will be combined - the operation must be of one of the forms: x = x op expr x = intrinsic(x,expr) if (x.LT.expr) x = expr x++; x--; ++x; --x; where expr does not contain x

Op Init+ or - 0* 1& -0| 0^ 0&& 1|| 0

Op/intrinsic Init+ or - 0* 1.AND. .TRUE..OR. .FALSE..EQV. .TRUE..NEQV. .FALSE.MAX smallest numberMIN largest numberIAND all bits onIOR or IEOR 0

- example: c$omp parallel do reduction(+:a,y) reduction (.OR.:s)

Page 43: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

Clauses in OpenMP - 3

copyin(list) - the list must contain common block (or global) names tahat have been declared threadprivate - data in the master thread in that common block will be copied to the thread private storage at the beginning of the parallel region - there is no “copyout” clause – data in private common block is not available outside of that threadif (scalar_logical_expression) - when an “if” clause is present, the enclosed code block is executed in parallel only if the scalar_logical_expression is .TRUE.ordered - only for do/for work sharing constructs – the code in the ORDERED block will be executed in the same sequence as sequential executionschedule (kind,[chunk]) - only for do/for work sharing constructs – specifies scheduling discipline for loop iterationsnowait - end of worksharing construct and SINGLE directive implies a synchronization\ point unless “nowait” is specified

Page 44: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

OpenMP on the Origin 2000

Parallel Sections Directive

c$omp parallel sections private(I)

c$omp end parallel sections

c$omp section block1c$omp section block2

Topics: Clauses, Detailed construct

Page 45: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

OpenMP on the Origin 2000

Parallel Sections Directive - Clauses


Page 46: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

OpenMP on the Origin 2000

Defining a Parallel Region - Individual Do Loopsc$omp parallel shared(a,b)

do j=1,na(j)=jenddo

do k=1,nb(k)=kenddo

c$omp do private(j)

c$omp end do nowaitc$omp do private(k)

c$omp end doc$omp end parallel

Page 47: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

OpenMP on the Origin 2000

Defining a Parallel Region - Explicit Sections

c$omp parallel shared(a,b)c$omp sectionblock1c$omp singleblock2c$omp sectionblock3c$omp end parallel

Page 48: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

OpenMP on the Origin 2000

Synchronization Constructs

master/end mastercritical/end criticalbarrieratomicflushordered/end ordered

Page 49: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

OpenMP on the Origin 2000

Run-Time Library Routines

Execution environment


Page 50: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

OpenMP on the Origin 2000

Run-Time Library Routines

Lock routines


Page 51: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

OpenMP on the Origin 2000

Environment Variables


Page 52: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

Exercise 5 – OpenMP to parallelize a loop

Page 53: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia
Page 54: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

main loop

initial values

Page 55: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia
Page 56: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia
Page 57: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

Enhancing Performance

• Ensuring sufficient work : running a loop in parallel adds runtime costs

• Scheduling loops for load - balancing

Page 58: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

The SCHEDULE clause


Static Each thread is assigned one chunk of iterations, according to variable or equally sized

Dynamic At runtime, chunks are assigned to threads dynamically

Page 59: Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia

OpenMP summary

- Small number of compiler directives to set up parallel execution of code and runtime library system for locking function- Portable directives (supported by different vendors in the same way)- Parallelization is for SMP programming model – the machine should have a global address space- Number of execution threads is controlled outside the program- A correct OpenMP program should not depend on the exact number of execution threads nor on the scheduling mechanism for work distribution- In addition, a correct OpenMP program should be (weakly) serially equivalent – that is, the results of the computation should be within rounding accuracy when compared to sequential program- On SGI, OpenMP programming can be mixed with MPI library, so that it is possible to have “hierarchical parallelism” * OpenMP parallelism in a single node (Global Address Space) * MPI parallelism between nodes in a cluster (Network connection)