33
CS4460 Ad d Al ith CS4460 Advanced Algorithms Batch 08, L4S2 Batch 08, L4S2 Lecture 11 Lecture 11 Multithreaded Algorithms – Part 1 N H N D de Silva N. H. N. D. de Silva Dept. of Computer Science & Eng University of Moratuwa

CS4460 Ad d Al ithCS4460 Advanced Algorithmsnisansa/Classes/01_University_of_Moratu… · CS4460 Ad d Al ithCS4460 Advanced Algorithms Batch 08, L4S2Batch 08, L4S2 Lecture 11Lecture

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS4460 Ad d Al ithCS4460 Advanced Algorithmsnisansa/Classes/01_University_of_Moratu… · CS4460 Ad d Al ithCS4460 Advanced Algorithms Batch 08, L4S2Batch 08, L4S2 Lecture 11Lecture

CS4460 Ad d Al ithCS4460 Advanced Algorithms Batch 08, L4S2Batch 08, L4S2

Lecture 11Lecture 11Multithreaded Algorithms – Part 1

N H N D de SilvaN. H. N. D. de SilvaDept. of Computer Science & Eng

University of Moratuwa

Page 2: CS4460 Ad d Al ithCS4460 Advanced Algorithmsnisansa/Classes/01_University_of_Moratu… · CS4460 Ad d Al ithCS4460 Advanced Algorithms Batch 08, L4S2Batch 08, L4S2 Lecture 11Lecture

AnnouncementsAnnouncements

• Last topic discussed is MultithreadingLast topic discussed is Multithreading– Will have two sessions

Nov 2012 N. H. N. D. de Silva 2

Page 3: CS4460 Ad d Al ithCS4460 Advanced Algorithmsnisansa/Classes/01_University_of_Moratu… · CS4460 Ad d Al ithCS4460 Advanced Algorithms Batch 08, L4S2Batch 08, L4S2 Lecture 11Lecture

Outline: Multithreaded AlgorithmsOutline: Multithreaded Algorithms

• Part 1: todayPart 1: today– Introduction, Dynamic Multithreading

Model for Multithreaded Execution– Model for Multithreaded Execution– Performance Measures

Analyzing Multithreaded Algorithms– Analyzing Multithreaded Algorithms• Part 2: next session

• Parallel Loops• Race Conditions• Examples• Examples

3Nov 2012 N. H. N. D. de Silva

Page 4: CS4460 Ad d Al ithCS4460 Advanced Algorithmsnisansa/Classes/01_University_of_Moratu… · CS4460 Ad d Al ithCS4460 Advanced Algorithms Batch 08, L4S2Batch 08, L4S2 Lecture 11Lecture

IntroductionIntroduction

• Aim: extend models to parallel algorithmsAim: extend models to parallel algorithms– Can run on multiprocessors that permit

multiple instructions to execute concurrentlymultiple instructions to execute concurrently• E.g.: chip multiprocessors (multicore), clusters,

supercomputers

• Multiprocessor models– Shared memory vs. distributed memoryS a ed e o y s d st buted e o y– Shared memory seem to be the trend– We use shared memory model for this studyWe use shared memory model for this study

4Nov 2012 N. H. N. D. de Silva

Page 5: CS4460 Ad d Al ithCS4460 Advanced Algorithmsnisansa/Classes/01_University_of_Moratu… · CS4460 Ad d Al ithCS4460 Advanced Algorithms Batch 08, L4S2Batch 08, L4S2 Lecture 11Lecture

IntroductionIntroduction

• Threading modelsThreading models– Static threading

• Software abstraction of “virtual processors” with• Software abstraction of virtual processors with shared common memory

• Programmer can manage (create/destroy) threads with OS support but difficult, error prone, etc

• Work-load balancing, scheduling by programmerH l d t l tf th t id• Has lead to concurrency platforms that provide software layers to coordinate, schedule, manage

5Nov 2012 N. H. N. D. de Silva

Page 6: CS4460 Ad d Al ithCS4460 Advanced Algorithmsnisansa/Classes/01_University_of_Moratu… · CS4460 Ad d Al ithCS4460 Advanced Algorithms Batch 08, L4S2Batch 08, L4S2 Lecture 11Lecture

IntroductionIntroduction

• Threading modelsThreading models– Dynamic multithreading

• Allows programmers to specify parallelism without• Allows programmers to specify parallelism without worrying about load-balancing, managing etc

• Concurrency platform has a scheduler• Simplifies programming• Supports nested parallelism and parallel loops• We use this model

6Nov 2012 N. H. N. D. de Silva

Page 7: CS4460 Ad d Al ithCS4460 Advanced Algorithmsnisansa/Classes/01_University_of_Moratu… · CS4460 Ad d Al ithCS4460 Advanced Algorithms Batch 08, L4S2Batch 08, L4S2 Lecture 11Lecture

IntroductionIntroduction

• Nested ParallelismNested Parallelism– Allows a subroutine (procedure) to be

spawnedspawned• This allows the caller to proceed while the called

subroutine is computing

• Parallel Loop– Like a normal (sequential) loop, but the e a o a (seque t a ) oop, but t e

iterations of the loop can execute concurrently

7Nov 2012 N. H. N. D. de Silva

Page 8: CS4460 Ad d Al ithCS4460 Advanced Algorithmsnisansa/Classes/01_University_of_Moratu… · CS4460 Ad d Al ithCS4460 Advanced Algorithms Batch 08, L4S2Batch 08, L4S2 Lecture 11Lecture

Dynamic MultithreadingDynamic Multithreading

• Advantages of the modelAdvantages of the model– Programmer specifies only the logical

parallelism within a computationparallelism within a computation• Threads within the concurrency platform schedule

and load balance within themselves– Simply extends the serial programming model

• Only need 3 keywords: parallel, spawn, sync• When these are deleted, results in serial code for

the same problem ( serialization of algorithm)

Nov 2012 N. H. N. D. de Silva 8

Page 9: CS4460 Ad d Al ithCS4460 Advanced Algorithmsnisansa/Classes/01_University_of_Moratu… · CS4460 Ad d Al ithCS4460 Advanced Algorithms Batch 08, L4S2Batch 08, L4S2 Lecture 11Lecture

Dynamic MultithreadingDynamic Multithreading

• Advantages of the modelAdvantages of the model– Clean way to quantify parallelism

• Based on concepts of “work” and “span”• Based on concepts of work and span– Many multithreaded algorithms involving

nested parallelism follow naturally from thenested parallelism follow naturally from the divide-and-conquer paradigm

• Can analyze by solving recurrences also– Close to how parallel computing is evolving

Nov 2012 N. H. N. D. de Silva 9

Page 10: CS4460 Ad d Al ithCS4460 Advanced Algorithmsnisansa/Classes/01_University_of_Moratu… · CS4460 Ad d Al ithCS4460 Advanced Algorithms Batch 08, L4S2Batch 08, L4S2 Lecture 11Lecture

Dynamic Multithreading: BasicsDynamic Multithreading: Basics

• Example: Fibonacci sequenceExample: Fibonacci sequence• Recall: F0 = 0 and F1 = 1

F F F f i 2Fi = Fi-1 + Fi-2 for i ≥ 2• Simple recursive serial algorithm

FIB(n)if n ≤ 1 return nelse x ← FIB (n-1)

y ← FIB (n-2)y ( )return x + y

Nov 2012 N. H. N. D. de Silva 10

Page 11: CS4460 Ad d Al ithCS4460 Advanced Algorithmsnisansa/Classes/01_University_of_Moratu… · CS4460 Ad d Al ithCS4460 Advanced Algorithms Batch 08, L4S2Batch 08, L4S2 Lecture 11Lecture

Tree of Recursive ProceduresTree of Recursive Procedures

Nov 2012 N. H. N. D. de Silva 11

Page 12: CS4460 Ad d Al ithCS4460 Advanced Algorithmsnisansa/Classes/01_University_of_Moratu… · CS4460 Ad d Al ithCS4460 Advanced Algorithms Batch 08, L4S2Batch 08, L4S2 Lecture 11Lecture

Dynamic Multithreading: BasicsDynamic Multithreading: Basics

• Example: Fibonacci sequenceExample: Fibonacci sequence– What about the running time of FIB?

Running time T(n) = T(n 1) + T(n 2) + Θ(1)– Running time T(n) = T(n-1) + T(n-2) + Θ(1)– Solution to the recurrence is T(n) = Θ(Fn)

This can be bound as T(n) Θ(ϕn)– This can be bound as T(n) = Θ(ϕn)• Where ϕ = (1 + √5)/2 is the golden ratio• T(n) grows exponentially in n• T(n) grows exponentially in n

Let us consider a multithreaded algorithm– Let us consider a multithreaded algorithm

Nov 2012 N. H. N. D. de Silva 12

Page 13: CS4460 Ad d Al ithCS4460 Advanced Algorithmsnisansa/Classes/01_University_of_Moratu… · CS4460 Ad d Al ithCS4460 Advanced Algorithms Batch 08, L4S2Batch 08, L4S2 Lecture 11Lecture

Dynamic Multithreading: BasicsDynamic Multithreading: Basics

• Example: Fibonacci sequenceExample: Fibonacci sequence

FIB i d i ltith di• FIB using dynamic multithreadingP-FIB(n)

if n ≤ 1 return nelse x ← spawn P-FIB (n-1)y ← P-FIB (n-2)sync

Nested parallelism occurs when keyword spawn

return x + yNov 2012 N. H. N. D. de Silva 13

yprecedes a procedure call

Page 14: CS4460 Ad d Al ithCS4460 Advanced Algorithmsnisansa/Classes/01_University_of_Moratu… · CS4460 Ad d Al ithCS4460 Advanced Algorithms Batch 08, L4S2Batch 08, L4S2 Lecture 11Lecture

Dynamic Multithreading: BasicsDynamic Multithreading: Basics

• Nested parallelismNested parallelism– The parent procedure instance may continue

instead of waiting for the child to continueinstead of waiting for the child to continue• E.g., while the child is computing P-FIB(n-1), the

parent may go on to compute P-FIB(n-2)– The two calls themselves and their children

create nested parallelism (they are recursive)– But, spawn does not say parent must

execute concurrently with spawned children• Shows logical parallelism; scheduler will decide

Nov 2012 N. H. N. D. de Silva 14

Page 15: CS4460 Ad d Al ithCS4460 Advanced Algorithmsnisansa/Classes/01_University_of_Moratu… · CS4460 Ad d Al ithCS4460 Advanced Algorithms Batch 08, L4S2Batch 08, L4S2 Lecture 11Lecture

Dynamic Multithreading: BasicsDynamic Multithreading: Basics

• Nested parallelismNested parallelism– A procedure cannot safely use the values

returned by a child until it executes a syncreturned by a child until it executes a sync• sync says parent must wait there for all spawned

children to complete before proceeding

– Every procedure executes a sync implicitly before it returns

• Ensures all children terminate before it does

Nov 2012 N. H. N. D. de Silva 15

Page 16: CS4460 Ad d Al ithCS4460 Advanced Algorithmsnisansa/Classes/01_University_of_Moratu… · CS4460 Ad d Al ithCS4460 Advanced Algorithms Batch 08, L4S2Batch 08, L4S2 Lecture 11Lecture

Model for Multithreaded ExecutionModel for Multithreaded Execution

• We can think of a multithreadedWe can think of a multithreaded computation as a computation dag

A vertex: A chain of instructions with no– A vertex: A chain of instructions with no parallel control (a strand)

– An edge (u v): strand u must execute before vAn edge (u,v): strand u must execute before v– If a strand has 2 successors?

• One of them must have been spawnedOne of them must have been spawned– A strand with multiple predecessors?

• Predecessors joined because of a syncPredecessors joined because of a sync

Nov 2012 N. H. N. D. de Silva 16

Page 17: CS4460 Ad d Al ithCS4460 Advanced Algorithmsnisansa/Classes/01_University_of_Moratu… · CS4460 Ad d Al ithCS4460 Advanced Algorithms Batch 08, L4S2Batch 08, L4S2 Lecture 11Lecture

Model for Multithreaded ExecutionIf a computation dag has a directed path from strand u tofrom strand u to strand v, then u and v are (logically) in series; else they areseries; else they are (logically) in parallel.

Nov 2012 N. H. N. D. de Silva 17

Page 18: CS4460 Ad d Al ithCS4460 Advanced Algorithmsnisansa/Classes/01_University_of_Moratu… · CS4460 Ad d Al ithCS4460 Advanced Algorithms Batch 08, L4S2Batch 08, L4S2 Lecture 11Lecture

Model for Multithreaded ExecutionModel for Multithreaded Execution

• Multithreaded algorithms are executed onMultithreaded algorithms are executed on an ideal parallel computer with

a set of processors of equal power– a set of processors of equal power– sequentially consistent shared memory

• Due to scheduling instruction ordering may differ• Due to scheduling, instruction ordering may differ from one program run to next, but can assume executed in some order consistent with the

i dcomputation dag

• We ignore cost of scheduling

Nov 2012 N. H. N. D. de Silva 18

Page 19: CS4460 Ad d Al ithCS4460 Advanced Algorithmsnisansa/Classes/01_University_of_Moratu… · CS4460 Ad d Al ithCS4460 Advanced Algorithms Batch 08, L4S2Batch 08, L4S2 Lecture 11Lecture

Performance MeasuresPerformance Measures

• Two measures: work and spanTwo measures: work and span• Work

T t l ti t t– Total time to execute on one processor– i.e., sum of times taken by each strand = the

number of vertices in the dagnumber of vertices in the dag• Span

– The longest time to execute strands along any path in the dag

If h t d t k it ti # ti• If each strand takes unit time, span = # vertices along the critical path

Nov 2012 N. H. N. D. de Silva 19

Page 20: CS4460 Ad d Al ithCS4460 Advanced Algorithmsnisansa/Classes/01_University_of_Moratu… · CS4460 Ad d Al ithCS4460 Advanced Algorithms Batch 08, L4S2Batch 08, L4S2 Lecture 11Lecture

Performance MeasuresPerformance Measures

• E g in Fig 27 2E.g., in Fig 27.2– Total 17 vertices, 8 vertices in the critical path

work = 17 time units span = 8 time units– work = 17 time units, span = 8 time units

• Actual running time also depends on – number of processors available – how the scheduler allocates strands to proc’s

Nov 2012 N. H. N. D. de Silva 20

Page 21: CS4460 Ad d Al ithCS4460 Advanced Algorithmsnisansa/Classes/01_University_of_Moratu… · CS4460 Ad d Al ithCS4460 Advanced Algorithms Batch 08, L4S2Batch 08, L4S2 Lecture 11Lecture

Performance MeasuresPerformance Measures

• NotationsNotations– Tp = running time on P processors

T = the work = running time on one proc– T1 = the work = running time on one proc.– T∞ = the span = running time if each strand

can run on its processor (have infinite proc’s)can run on its processor (have infinite proc s)

Work span: lower bounds on running time T– Work, span: lower bounds on running time Tpof a multithreaded computation on P proc’s

Nov 2012 N. H. N. D. de Silva 21

Page 22: CS4460 Ad d Al ithCS4460 Advanced Algorithmsnisansa/Classes/01_University_of_Moratu… · CS4460 Ad d Al ithCS4460 Advanced Algorithms Batch 08, L4S2Batch 08, L4S2 Lecture 11Lecture

Performance MeasuresPerformance Measures

• Work lawWork law– In a step, ideal computer with P proc’s can do

at most P units of work In TP time can do atat most P units of work. In TP time, can do at most P×Tp work. So, Tp ≥ T1/P

• Span lawSpan law– A P-processor ideal parallel computer cannot

run any faster than one with unlimited proc’srun any faster than one with unlimited proc s. So Tp ≥ T∞

Nov 2012 N. H. N. D. de Silva 22

Page 23: CS4460 Ad d Al ithCS4460 Advanced Algorithmsnisansa/Classes/01_University_of_Moratu… · CS4460 Ad d Al ithCS4460 Advanced Algorithms Batch 08, L4S2Batch 08, L4S2 Lecture 11Lecture

Performance MeasuresPerformance Measures

• Speedup = T1 / TPSpeedup = T1 / TP– Of a computation on P processors

How many times faster the computation is on– How many times faster the computation is on P processors than on 1 processor

– At most P i e T / T ≤ P– At most P, i.e., T1 / TP ≤ P– When T1 / TP = Θ(P) linear speedup

When T / T = P perfect linear speedup– When T1 / TP = P perfect linear speedup

Nov 2012 N. H. N. D. de Silva 23

Page 24: CS4460 Ad d Al ithCS4460 Advanced Algorithmsnisansa/Classes/01_University_of_Moratu… · CS4460 Ad d Al ithCS4460 Advanced Algorithms Batch 08, L4S2Batch 08, L4S2 Lecture 11Lecture

Performance MeasuresPerformance Measures

• Parallelism = T1 / TParallelism = T1 / T∞– As a ratio: denotes the average amount of

work that can be performed in parallel forwork that can be performed in parallel for each step along the critical path

– As an upper bound: gives the maximumAs an upper bound: gives the maximum possible speedup on any # of processors

– Also: Provides a limit on the possibility of p yattaining perfect linear speedup

• If # of proc’s exceeds parallelism, cannot possibly achieve perfect linear speedup

Nov 2012 N. H. N. D. de Silva 24

Page 25: CS4460 Ad d Al ithCS4460 Advanced Algorithmsnisansa/Classes/01_University_of_Moratu… · CS4460 Ad d Al ithCS4460 Advanced Algorithms Batch 08, L4S2Batch 08, L4S2 Lecture 11Lecture

Performance MeasuresPerformance Measures

• (Parallel) Slackness = (T1 / T ) / P(Parallel) Slackness = (T1 / T∞) / P– = T1 / (PT∞)

The factor by which the parallelism of the– The factor by which the parallelism of the computation exceeds the number of processors in the machineprocessors in the machine

• E.g., if slackness < 1, cannot hope to achieve perfect linear speedup

– As the slackness decreases from 1 toward 0, speedup of the computation diverges further

d f th f f t li dand further from perfect linear speedup

Nov 2012 N. H. N. D. de Silva 25

Page 26: CS4460 Ad d Al ithCS4460 Advanced Algorithmsnisansa/Classes/01_University_of_Moratu… · CS4460 Ad d Al ithCS4460 Advanced Algorithms Batch 08, L4S2Batch 08, L4S2 Lecture 11Lecture

SchedulingScheduling

• Rely on concurrency platform’s schedulerRely on concurrency platform s scheduler• A greedy scheduling is good enough

Nov 2012 N. H. N. D. de Silva 26

Page 27: CS4460 Ad d Al ithCS4460 Advanced Algorithmsnisansa/Classes/01_University_of_Moratu… · CS4460 Ad d Al ithCS4460 Advanced Algorithms Batch 08, L4S2Batch 08, L4S2 Lecture 11Lecture

SchedulingScheduling

Nov 2012 N. H. N. D. de Silva 27

Page 28: CS4460 Ad d Al ithCS4460 Advanced Algorithmsnisansa/Classes/01_University_of_Moratu… · CS4460 Ad d Al ithCS4460 Advanced Algorithms Batch 08, L4S2Batch 08, L4S2 Lecture 11Lecture

SchedulingScheduling

Nov 2012 N. H. N. D. de Silva 28

Page 29: CS4460 Ad d Al ithCS4460 Advanced Algorithmsnisansa/Classes/01_University_of_Moratu… · CS4460 Ad d Al ithCS4460 Advanced Algorithms Batch 08, L4S2Batch 08, L4S2 Lecture 11Lecture

Analyzing Multithreaded AlgorithmsAnalyzing Multithreaded Algorithms

Nov 2012 N. H. N. D. de Silva 29

Page 30: CS4460 Ad d Al ithCS4460 Advanced Algorithmsnisansa/Classes/01_University_of_Moratu… · CS4460 Ad d Al ithCS4460 Advanced Algorithms Batch 08, L4S2Batch 08, L4S2 Lecture 11Lecture

Analysis of P-FIB(n)Analysis of P FIB(n)

• The span of P-FIB(n)The span of P FIB(n)T∞ = max (T∞(n-1), T∞(n-2)) + Θ(1)

T ( 1) Θ(1) Θ( )= T∞(n-1) + Θ(1) = Θ(n)

• The parallelism of P-FIB(n)T1(n)/ T (n) = Θ(ϕn /n)T1(n)/ T∞(n) Θ(ϕ /n)

– Grows dramatically as n gets largeeven on largest parallel computers modest– even on largest parallel computers, modest

values of n give near perfect linear speedupNov 2012 N. H. N. D. de Silva 30

Page 31: CS4460 Ad d Al ithCS4460 Advanced Algorithmsnisansa/Classes/01_University_of_Moratu… · CS4460 Ad d Al ithCS4460 Advanced Algorithms Batch 08, L4S2Batch 08, L4S2 Lecture 11Lecture

Next Session – Part 2Next Session Part 2

• TopicsTopics– Parallel loops

Race conditions– Race conditions– Examples: Multithreaded matrix-vector and

matrix-matrix multiplicationmatrix-matrix multiplication• Before next session, read if possible

• pp 785-796 (12 pages) from CLSR 3/e• pp. 785-796 (12 pages) from CLSR 3/e • Document “A Minicourse on Dynamic Multithreaded

Algorithms” (focus on relevant topics)

Nov 2012 N. H. N. D. de Silva 31

Page 32: CS4460 Ad d Al ithCS4460 Advanced Algorithmsnisansa/Classes/01_University_of_Moratu… · CS4460 Ad d Al ithCS4460 Advanced Algorithms Batch 08, L4S2Batch 08, L4S2 Lecture 11Lecture

ReferencesReferences• The lecture slides are based on the slides prepared by p p y

Prof. Sanath Jayasena for this class in previous years.

• Mainly: CLRS book, 3e– Part VII: Selected Topics– Chapter 27: Multithreaded AlgorithmsChapter 27: Multithreaded Algorithms

• Other resources (on LMS)– Document “A Minicourse on Dynamic Multithreaded Algorithms”– Slide sets: Cilk and Design and Analysis of Dynamic

Multithreaded Algorithms and Strassen’s Matrix Multiplication Algorithm

32Nov 2012 N. H. N. D. de Silva

Page 33: CS4460 Ad d Al ithCS4460 Advanced Algorithmsnisansa/Classes/01_University_of_Moratu… · CS4460 Ad d Al ithCS4460 Advanced Algorithms Batch 08, L4S2Batch 08, L4S2 Lecture 11Lecture

ConclusionConclusion

• We discussed multithreaded algorithmsWe discussed multithreaded algorithms– Dynamic Multithreading– Model for Multithreaded ExecutionModel for Multithreaded Execution– Performance Measures– Analyzing Multithreaded Algorithms

• Next session (last session) – Part 2( )– Parallel Loops, Race Conditions, Examples

33Nov 2012 N. H. N. D. de Silva