39
Lecture 3 PRAMS Introduction to threads programming

Lecture 3 - University of California, San Diegocseweb.ucsd.edu/classes/fa08/cse260/Lectures/Lec03/Lec03.pdf · concurrency in reads and writes •There are therefore 4 flavors •We’ll

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 3 - University of California, San Diegocseweb.ucsd.edu/classes/fa08/cse260/Lectures/Lec03/Lec03.pdf · concurrency in reads and writes •There are therefore 4 flavors •We’ll

Lecture 3

PRAMS

Introduction to threads programming

Page 2: Lecture 3 - University of California, San Diegocseweb.ucsd.edu/classes/fa08/cse260/Lectures/Lec03/Lec03.pdf · concurrency in reads and writes •There are therefore 4 flavors •We’ll

10/2/08 Scott B. Baden /CSE 260/ Fall '08 2

Announcements

• Makeup lectures on 10/10 and 10/17 have beenposted on the schedule

• Assignment #1 has been posted

Page 3: Lecture 3 - University of California, San Diegocseweb.ucsd.edu/classes/fa08/cse260/Lectures/Lec03/Lec03.pdf · concurrency in reads and writes •There are therefore 4 flavors •We’ll

10/2/08 Scott B. Baden /CSE 260/ Fall '08 3

Scalability• A computation is scalable if

performance increases as a “nicefunction” with the number of processors,e.g. linearly

• In practice scalability can be hard toachieve▶ Serial sections: code that runs on only

one processor▶ “Non-productive” work associated with

parallel execution, e.g. communication▶ Load imbalance: uneven work

assignments over the processors

• Some algorithms present intrinsicbarriers to scalability leading toalternatives

for i=0:n-1 sum = sum + x[i]

Page 4: Lecture 3 - University of California, San Diegocseweb.ucsd.edu/classes/fa08/cse260/Lectures/Lec03/Lec03.pdf · concurrency in reads and writes •There are therefore 4 flavors •We’ll

10/2/08 Scott B. Baden /CSE 260/ Fall '08 4

Amdahl’s law (1967)• A serial section limits scalability• Let f = fraction of T1 that runs serially• Amdahl's Law (1967) : As P→∞, SP → 1/f

0.1

0.2

0.3

Page 5: Lecture 3 - University of California, San Diegocseweb.ucsd.edu/classes/fa08/cse260/Lectures/Lec03/Lec03.pdf · concurrency in reads and writes •There are therefore 4 flavors •We’ll

10/2/08 Scott B. Baden /CSE 260/ Fall '08 5

Scaled Speedup

• Is Amdahl’s law pessimistic?

• Observation: Amdahl’s law assumes thatthe workload (W) remains fixed

• But parallel computers are used to tacklemore ambitious workloads

W increases with P

f often decreases with W

Page 6: Lecture 3 - University of California, San Diegocseweb.ucsd.edu/classes/fa08/cse260/Lectures/Lec03/Lec03.pdf · concurrency in reads and writes •There are therefore 4 flavors •We’ll

10/2/08 Scott B. Baden /CSE 260/ Fall '08 6

Computing scaled speedup

• Instead of asking what the speedup is, let’s ask how longa parallel program would run on a single processor [J. Gustafson 1992]http://www.scl.ameslab.gov/Publications/Gus/FixedTime/FixedTime.pdf

• Let TP = 1• f ′ = fraction of serial time spent on the parallel program

• T1 = f ′ + (1- f ′ ) × P = S′P = scaled speedup

• Scaled speedup is linear in P

Page 7: Lecture 3 - University of California, San Diegocseweb.ucsd.edu/classes/fa08/cse260/Lectures/Lec03/Lec03.pdf · concurrency in reads and writes •There are therefore 4 flavors •We’ll

10/2/08 Scott B. Baden /CSE 260/ Fall '08 7

Isoefficiency

• Consequence of Gustafson’s observation is that weincrease N with P

• Kumar: We can maintain constant efficiency so long aswe increase N appropriately

• The isoefficiency function specifies the growth of N interms of P

• If N is linear in P, we have a scalable computation

• More on this later on

Page 8: Lecture 3 - University of California, San Diegocseweb.ucsd.edu/classes/fa08/cse260/Lectures/Lec03/Lec03.pdf · concurrency in reads and writes •There are therefore 4 flavors •We’ll

10/2/08 Scott B. Baden /CSE 260/ Fall '08 8

A theoretical basis: the PRAM

• Parallel Random Access Machine

• Idealized parallel computer– Unbounded number of processors

– Shared memory of unbounded size

– Constant access time

• Access time is comparable to that ofa machine instruction

• All processors execute in lock step

PE

PE

PE

PE

Memory

Page 9: Lecture 3 - University of California, San Diegocseweb.ucsd.edu/classes/fa08/cse260/Lectures/Lec03/Lec03.pdf · concurrency in reads and writes •There are therefore 4 flavors •We’ll

10/2/08 Scott B. Baden /CSE 260/ Fall '08 9

Why is the PRAM interesting?

• Inspires real world system and algorithm designs

• Formal basis for fundamental limitations– If a PRAM algorithm is inefficient, then so is any

parallel algorithm

– If a PRAM algorithm is efficient, does it followthat any parallel algorithm is efficient?

Page 10: Lecture 3 - University of California, San Diegocseweb.ucsd.edu/classes/fa08/cse260/Lectures/Lec03/Lec03.pdf · concurrency in reads and writes •There are therefore 4 flavors •We’ll

10/2/08 Scott B. Baden /CSE 260/ Fall '08 10

How do we handle concurrent accesses?

• Our options are to prohibit or permitconcurrency in reads and writes

• There are therefore 4 flavors

• We’ll focus on CRCW = Concurent Read Concurent Write

• All processors may read or write

Page 11: Lecture 3 - University of California, San Diegocseweb.ucsd.edu/classes/fa08/cse260/Lectures/Lec03/Lec03.pdf · concurrency in reads and writes •There are therefore 4 flavors •We’ll

10/2/08 Scott B. Baden /CSE 260/ Fall '08 11

CRCW PRAM

• What happens when more than one processor attempts towrite to the same location?

• We need a rule for combining multiple writes– Common: All processors must write the same value

– Arbitrary: Only allow 1 arbitrarily chosen processor to write

– Priority: Assign priorities to the processors, and allow the highest-priority processor’s write

– Combine the written values in some meaningful way, e.g. sum, max,using an associative operator.

Page 12: Lecture 3 - University of California, San Diegocseweb.ucsd.edu/classes/fa08/cse260/Lectures/Lec03/Lec03.pdf · concurrency in reads and writes •There are therefore 4 flavors •We’ll

10/2/08 Scott B. Baden /CSE 260/ Fall '08 12

A natural programming model for a PRAM:the data parallel model

• Apply an operation uniformly overall processors in a single step

• Assign each array element to avirtual processor

• Implicit barrier synchronizationbetween each step

12

18

8

2

10

7

-2

1

2

11

10

1

= +

Page 13: Lecture 3 - University of California, San Diegocseweb.ucsd.edu/classes/fa08/cse260/Lectures/Lec03/Lec03.pdf · concurrency in reads and writes •There are therefore 4 flavors •We’ll

10/2/08 Scott B. Baden /CSE 260/ Fall '08 14

Sorting on a PRAM

• A 2 step algorithm called rank sort

• Compute the rank (position in sorted order) for each elementin parallel– Compare all possible pairings of input values in parallel, n2-fold

parallelism

– CRCW model with update on write using summation

• Move each value to its correctly sorted position according tothe rank: n-fold parallelism

• O(1) running time

Page 14: Lecture 3 - University of California, San Diegocseweb.ucsd.edu/classes/fa08/cse260/Lectures/Lec03/Lec03.pdf · concurrency in reads and writes •There are therefore 4 flavors •We’ll

10/2/08 Scott B. Baden /CSE 260/ Fall '08 15

Rank sort on a PRAM

1. Compute the rank of each key using n2-foldparallelism

2. Move each value in position according tothe rank: n-fold parallelism

forall i=0:n-1, j=0:n-1

if ( x[i] > x[j] ) then rank[i] = 1 end if

forall i=0:n-1 y[rank[i]] = x[i]

Page 15: Lecture 3 - University of California, San Diegocseweb.ucsd.edu/classes/fa08/cse260/Lectures/Lec03/Lec03.pdf · concurrency in reads and writes •There are therefore 4 flavors •We’ll

10/2/08 Scott B. Baden /CSE 260/ Fall '08 16

Compute Ranks

forall i=0:n-1, j=0:n-1 if ( x[i] > x[j] ) then rank[i] = 1 end if

1115

-1

6

3

7

1

65-1371

1

1

1

111

1

1111

1

i0

2

4

3

5

1

rank

O(N2)parallelism Update on write:

summation

Page 16: Lecture 3 - University of California, San Diegocseweb.ucsd.edu/classes/fa08/cse260/Lectures/Lec03/Lec03.pdf · concurrency in reads and writes •There are therefore 4 flavors •We’ll

10/2/08 Scott B. Baden /CSE 260/ Fall '08 17

Route the data using the ranks

forall i=0:n-1 y[rank[i]] = x[i]

0

2

4

3

5

1

x

rank -1

3

6

5

7

1

5

3

7

6

1

-1

3

2

5

4

1

0

Page 17: Lecture 3 - University of California, San Diegocseweb.ucsd.edu/classes/fa08/cse260/Lectures/Lec03/Lec03.pdf · concurrency in reads and writes •There are therefore 4 flavors •We’ll

10/2/08 Scott B. Baden /CSE 260/ Fall '08 18

Parallel speedup and efficiency• Recall the parallel speedup on P processors

• The speedup is (n lg n) / O(1) = O(n lg n)

• No matter how many processors we have, the speedup forthis workload is limited by the amount of available work

• This is an intrinsic limitation of the algorithm

processors Pon program parallel theof timeRunning

processor 1on program serialbest theof timeRunning=PS

Page 18: Lecture 3 - University of California, San Diegocseweb.ucsd.edu/classes/fa08/cse260/Lectures/Lec03/Lec03.pdf · concurrency in reads and writes •There are therefore 4 flavors •We’ll

10/2/08 Scott B. Baden /CSE 260/ Fall '08 19

Enter real world constraints• The PRAM provides a necessary condition for an efficient

algorithm on physical hardware• But the condition is not sufficient; e.g. rank sort

forall ( i=0:n-1, j=0:n-1 ) if ( x[i] > x[j]) then rank[i] = 1 end if

forall ( i=0:n-1 ) y[rank[i]] = x[i]

• Real world computers have finite resourcesincluding memory and network capacity– We cannot ignore communication network capacity, nor

the cost of building a contention free network

– Not all computations can execute efficiently in lock-step

Page 19: Lecture 3 - University of California, San Diegocseweb.ucsd.edu/classes/fa08/cse260/Lectures/Lec03/Lec03.pdf · concurrency in reads and writes •There are therefore 4 flavors •We’ll

10/2/08 Scott B. Baden /CSE 260/ Fall '08 20

Data parallelism in practice

• APL (1962)

• Matlab

• Fortran 90, 95, HPF (High Perf. Fortran) - 1994

• Titanium (UC Berkeley)

• Uniform Parallel C (UPC)

• Co-Array Fortran

Page 20: Lecture 3 - University of California, San Diegocseweb.ucsd.edu/classes/fa08/cse260/Lectures/Lec03/Lec03.pdf · concurrency in reads and writes •There are therefore 4 flavors •We’ll

10/2/08 Scott B. Baden /CSE 260/ Fall '08 21

Programming with Threads

Page 21: Lecture 3 - University of California, San Diegocseweb.ucsd.edu/classes/fa08/cse260/Lectures/Lec03/Lec03.pdf · concurrency in reads and writes •There are therefore 4 flavors •We’ll

10/2/08 Scott B. Baden /CSE 260/ Fall '08 22

SPMD execution model• Most parallel programming is implemented under the

Same Program Multiple Data programming model = SPMD• Other names for this model are “loosely synchronous” or

“bulk synchronous”• Programs execute as a set of P processes or threads

– We specify P when we run the program– Each process/thread is usually assigned to a different physical

processor• Each process or thread

– is initialized with the same code– has an associated rank, a unique integer in the range 0:P-1– executes instructions at its own rate

• Processes communicate via messages or shared memory,threads through normally through shared memory

Page 22: Lecture 3 - University of California, San Diegocseweb.ucsd.edu/classes/fa08/cse260/Lectures/Lec03/Lec03.pdf · concurrency in reads and writes •There are therefore 4 flavors •We’ll

10/2/08 Scott B. Baden /CSE 260/ Fall '08 23

• A collection of concurrent instruction streams, calledthreads

• Each thread has a unique thread ID

• A new storage class: shared data• A thread is similar to a procedure call with notable

differences– A procedure call is “synchronous:” a return indicates completion– A spawned thread executes asynchronously until it completes– Both share global storage with caller– Synchronization is needed when updating shared state

Shared memory programming with threads

Page 23: Lecture 3 - University of California, San Diegocseweb.ucsd.edu/classes/fa08/cse260/Lectures/Lec03/Lec03.pdf · concurrency in reads and writes •There are therefore 4 flavors •We’ll

10/2/08 Scott B. Baden /CSE 260/ Fall '08 24

Why threads?• Processes are “heavy weight” objects scheduled by the OS

– Protected address space, open files, and other state• A thread, AKA a lightweight process (LWP) is sometimes

more appropriate– Threads share the address space and open files of the parent, but have

their own stack– Reduced management overheads– Kernel scheduler multiplexes threads

PPP

stack

. . .

stack

heap

Page 24: Lecture 3 - University of California, San Diegocseweb.ucsd.edu/classes/fa08/cse260/Lectures/Lec03/Lec03.pdf · concurrency in reads and writes •There are therefore 4 flavors •We’ll

10/2/08 Scott B. Baden /CSE 260/ Fall '08 25

Practical issues

• Thread creation is faster than process creation (real time)• Moving data in shared memory is cheaper than passing a

message through shared memoryhttps://computing.llnl.gov/tutorials/pthreads

54.5

55.0

64.2

41.1

Fork (µs)

4.30.31.642INTEL 2.4 GHz Xeon

164.11.758IBM 1.9 GHz POWER5 p5-575

5.31.20.668AMD 2.4 GHz Opteron

Intel 1.4 GHz Itanium 2

Mem -CPU(GB/sec)

MPI SharedMem(GB/sec)

Create(µs)

CPU/node

6.41.82.034

Page 25: Lecture 3 - University of California, San Diegocseweb.ucsd.edu/classes/fa08/cse260/Lectures/Lec03/Lec03.pdf · concurrency in reads and writes •There are therefore 4 flavors •We’ll

10/2/08 Scott B. Baden /CSE 260/ Fall '08 26

Threads in practice

• A common interface is the POSIX Threads “standard”(pthreads): IEEE POSIX 1003.1c-1995– Beware of non-standard features

• Another approach is to use program annotations via openMP

Page 26: Lecture 3 - University of California, San Diegocseweb.ucsd.edu/classes/fa08/cse260/Lectures/Lec03/Lec03.pdf · concurrency in reads and writes •There are therefore 4 flavors •We’ll

10/2/08 Scott B. Baden /CSE 260/ Fall '08 27

Programming model• Start with a single root thread• Fork-join parallelism to create

concurrently executing threads• Threads may or may not execute

on different processors, andmight be interleaved

• Scheduling behavior specifiedseparately

Page 27: Lecture 3 - University of California, San Diegocseweb.ucsd.edu/classes/fa08/cse260/Lectures/Lec03/Lec03.pdf · concurrency in reads and writes •There are therefore 4 flavors •We’ll

10/2/08 Scott B. Baden /CSE 260/ Fall '08 28

OpenMP programming

• Simpler interface than explicit threads• Parallelization handled via annotations• See http://www.openmp.org• Parallel loop:

#pragma omp parallel private(i) shared(n){#pragma omp forfor(i=0; i < n; i++) work(i);}

Page 28: Lecture 3 - University of California, San Diegocseweb.ucsd.edu/classes/fa08/cse260/Lectures/Lec03/Lec03.pdf · concurrency in reads and writes •There are therefore 4 flavors •We’ll

10/2/08 Scott B. Baden /CSE 260/ Fall '08 29

Parallel Sections#pragma omp parallel // Begin a parallel construct{ // form a team // Each team member executes the same code #pragma omp sections // Begin work sharing { #pragma omp section // A unit of work {x = x + 1;}

#pragma omp section // Another unit {x = x + 1;}

} // Wait until both units complete

} // End of Parallel Construct; disband team

// continue serial execution

Page 29: Lecture 3 - University of California, San Diegocseweb.ucsd.edu/classes/fa08/cse260/Lectures/Lec03/Lec03.pdf · concurrency in reads and writes •There are therefore 4 flavors •We’ll

10/2/08 Scott B. Baden /CSE 260/ Fall '08 30

Race conditions• Consider the statement, assuming x == 0 x=x+1;

• Generated code– r1 ← (x)– r1 ← r1 + #1– r1 → (x)

• Possible interleaving with two threadsP1 P2

r1 ← x r1(P1) gets 0 r1 ← x r2(P2) also gets 0

r1 ← r1+ #1 r1(P1) set to 1 r1 ← r1+#1 r1(P1) set to 1

x ← r1 P1 writes its R1 x ← r1 P2 writes its R1

x=x+1; x=x+1;

Page 30: Lecture 3 - University of California, San Diegocseweb.ucsd.edu/classes/fa08/cse260/Lectures/Lec03/Lec03.pdf · concurrency in reads and writes •There are therefore 4 flavors •We’ll

10/2/08 Scott B. Baden /CSE 260/ Fall '08 31

Race conditions• A Race condition arises when the timing of accesses to

shared memory can affect the outcome• We say we have a non-deterministic computation• Sometimes we can use non-determinism to advantage, but

we avoid it usually• For the same input we want to obtain the same results from

operations that do not have side effects(like I/O and random number generators)

• Memory consistency and cache coherence are necessary butnot sufficient conditions for ensuring program correctness

• We need to take steps to avoid race conditions throughappropriate program synchronization

Page 31: Lecture 3 - University of California, San Diegocseweb.ucsd.edu/classes/fa08/cse260/Lectures/Lec03/Lec03.pdf · concurrency in reads and writes •There are therefore 4 flavors •We’ll

10/2/08 Scott B. Baden /CSE 260/ Fall '08 33

Critical Sections

#pragma omp parallel // Begin a parallel construct{ #pragma omp sections // Begin worksharing { // #pragma omp critical // Critical section {x = x + 1} #pragma omp critical // Another critical section {x = x + 1} ... // More Replicated Code #pragma omp barrier // Wait for all members to arrive

} // Wait until both units of work complete}

x=x+1; x=x+1;

• Only one thread at a time may run the code in acritical section

• Mutual exclusion to implement critical sections

Page 32: Lecture 3 - University of California, San Diegocseweb.ucsd.edu/classes/fa08/cse260/Lectures/Lec03/Lec03.pdf · concurrency in reads and writes •There are therefore 4 flavors •We’ll

10/2/08 Scott B. Baden /CSE 260/ Fall '08 34

How does mutual exclusion work?• A simple solution is to use a mutex variable• E.g. provided by pthreads• Locks may be CLEAR or SET• Lock() waits if the lock is set, else sets the

lock• Unlock clears the lock if set

Mutex mtx; mtx.lock(); CRITICAL SECTION mtx.unlock();

Page 33: Lecture 3 - University of California, San Diegocseweb.ucsd.edu/classes/fa08/cse260/Lectures/Lec03/Lec03.pdf · concurrency in reads and writes •There are therefore 4 flavors •We’ll

10/2/08 Scott B. Baden /CSE 260/ Fall '08 36

Coding with pthreads#include <pthread.h>#include <iostream>using namespace std;void *Hello(void *tid) { cout << "Hello from ”<< (int) tid << endl;

pthread_exit(NULL);}int main(int argc, char *argv[ ]){ int NT = 3, status; pthread_t th[NT]; for(int t=0;t<NT;t++) assert(!pthread_create(&th[t],NULL, Hello, (void *)t)); for(int t=0;t<NT;t++) assert(!pthread_join(th[t], (void **) &status)); pthread_exit(NULL);

}

% g++ t.C -lpthread% a.outHello from thread 0Hello from thread 1Hello from thread 2

Page 34: Lecture 3 - University of California, San Diegocseweb.ucsd.edu/classes/fa08/cse260/Lectures/Lec03/Lec03.pdf · concurrency in reads and writes •There are therefore 4 flavors •We’ll

10/2/08 Scott B. Baden /CSE 260/ Fall '08 37

Computing a sum in parallel• Also see: dotprod_mutex.c in the LLNL tutorial

Globals:

pthread_mutex_t mutex_sum;

int *x, global_sum, N, NT;

Main:

for (int i=0; i < N; i++) x[i] = i;

global_sum = 0;

assert(!pthread_mutex_init(&mutex_sum, NULL));

pthread_t thrd[NT];

for(int t=0;t<NT;t++)

pthread_create(&thrd[t], NULL, summ, (void *)t));//Join threads…

cout << "The sum of 0 to " << N-1 << " is: " << sum << endl;

Page 35: Lecture 3 - University of California, San Diegocseweb.ucsd.edu/classes/fa08/cse260/Lectures/Lec03/Lec03.pdf · concurrency in reads and writes •There are therefore 4 flavors •We’ll

10/2/08 Scott B. Baden /CSE 260/ Fall '08 38

The computation

void *summ(void *arg){ int TID = (int)arg; int i0 = TID*(N/NT), i1 = i0 + (N/NT); double mysum = 0; for ( i=i0; i<i1; i++) mysum += x[i] ; pthread_mutex_lock (&mutex_sum); global_sum += mysum; pthread_mutex_unlock (&mutex_sum); pthread_exit((void*) 0);}

g++ sum.C -lpthread% a.out%The sum of 0 to 2047 is: 2096128

pthread_create(&thrd[t], NULL, summ, (void *)t));

Page 36: Lecture 3 - University of California, San Diegocseweb.ucsd.edu/classes/fa08/cse260/Lectures/Lec03/Lec03.pdf · concurrency in reads and writes •There are therefore 4 flavors •We’ll

10/2/08 Scott B. Baden /CSE 260/ Fall '08 39

Correctness and synchronization

int sum = 0; // Globalvoid *sumIt(void *arg){ // Thread int TID = (int)arg; pthread_mutex_lock (&mutex_sum); sum += 2*(TID+1); pthread_mutex_unlock (&mutex_sum); if (TID == 0) cout << "Total sum is " << sum << endl; pthread_exit((void*) 0); }

% a.out 5# threads: 5Total sum is 2The sum of 0 to 4 is: 30

Page 37: Lecture 3 - University of California, San Diegocseweb.ucsd.edu/classes/fa08/cse260/Lectures/Lec03/Lec03.pdf · concurrency in reads and writes •There are therefore 4 flavors •We’ll

10/2/08 Scott B. Baden /CSE 260/ Fall '08 40

Barrier synchronization

• Why was the sum incorrectly reported?• We read a location updated by other threads that

had not had the chance to produce their contribution(true dependence)

• Don’t overwrite the values used by other processesin the current iteration until they have beenconsumed (anti-dependence)

• A barrier can be built with locks

Page 38: Lecture 3 - University of California, San Diegocseweb.ucsd.edu/classes/fa08/cse260/Lectures/Lec03/Lec03.pdf · concurrency in reads and writes •There are therefore 4 flavors •We’ll

10/2/08 Scott B. Baden /CSE 260/ Fall '08 41

Building a linear time barrier with locks

Mutex arrival=UNLOCKED, departure=LOCKED; int count=0;

void Barrier( ) arrival.lock( ); // atomically count the count++; // waiting threads if (count < n$proc) arrival.unlock( ); else departure.unlock( ); // last processor

// enables all to godeparture.lock( );

count--; // atomically decrement if (count > 0) departure.unlock( ); else arrival.unlock( ); // last processor resets state

Page 39: Lecture 3 - University of California, San Diegocseweb.ucsd.edu/classes/fa08/cse260/Lectures/Lec03/Lec03.pdf · concurrency in reads and writes •There are therefore 4 flavors •We’ll

10/2/08 Scott B. Baden /CSE 260/ Fall '08 42

Assignment #2

• Compile and run two programs– Summation.C and Sum.C– Run on any machine you wish, including your laptop

• Due in class on Friday• See the schedule for details