22
Rechen- und Kommunikationszentrum (RZ) Parallelization at a Glance Christian Terboven <[email protected]> 20.11.2012 / Aachen, Germany Stand: 19.11.2012 Version 2.3

Parallelization at a Glance

  • Upload
    cardea

  • View
    29

  • Download
    0

Embed Size (px)

DESCRIPTION

Parallelization at a Glance. Christian Terboven 20.11.2012 / Aachen, Germany Stand: 19.11.2012 Version 2.3. Basic Concepts Parallelization Strategies Possible Issues Efficient Parallelization. Agenda. Basic Concepts. - PowerPoint PPT Presentation

Citation preview

Page 1: Parallelization at  a Glance

Rechen- und Kommunikationszentrum (RZ)

Parallelization at a Glance

Christian Terboven <[email protected]>20.11.2012 / Aachen, Germany

Stand: 19.11.2012Version 2.3

Page 2: Parallelization at  a Glance

RZ: Christian Terboven Folie 2

Basic Concepts

Parallelization Strategies

Possible Issues

Efficient Parallelization

Agenda

Page 3: Parallelization at  a Glance

RZ: Christian Terboven Folie 3

Basic Concepts

Page 4: Parallelization at  a Glance

RZ: Christian Terboven Folie 4

Process: Instance of a computer program Thread: A sequence of instructions (own Program Counter)

Parallel / Concurrent execution: Multiple Processes

Multiple Threads

Per process: Address Space, Files, Environment, … Per thread: Program Counter, Stack, …

Processes and Threads

process thread program counter

computer

Page 5: Parallelization at  a Glance

RZ: Christian Terboven Folie 5

Even on a multi-socket multi-core system you should not make any assumption which process / thread is executed when and where!

Two threads on one core:

Two threads on two cores:

Process and Thread Scheduling by the OS

Thread1 Thread 2 System thread

thread migration

“pinned” threads

Page 6: Parallelization at  a Glance

RZ: Christian Terboven Folie 6

Memory can be accessed by several threads running on different cores in a multi-socket multi-core system:

Shared Memory Parallelization

a=4

CPU1 CPU2

ac=3+a

Page 7: Parallelization at  a Glance

RZ: Christian Terboven Folie 7

Parallelization Strategies

Page 8: Parallelization at  a Glance

RZ: Christian Terboven Folie 8

Chances for concurrent execution: Look for tasks that can be executed simultaneously

(task parallelism)

Decompose data into distinct chunks to be processed

independently (data parallelism)

Finding Concurrency

Organize by task

Task parallelism

Divide and conquer

Organize by data decomposition

Geometric decomposition

Recursive data

Organize by flow of data

Pipeline

Event-based coordination

Page 9: Parallelization at  a Glance

RZ: Christian Terboven Folie 9

Divide and conquer

Problem

Subproblem Subproblem

split

SubproblemSubproblemSubproblemSubproblem

split

Subsolution Subsolution Subsolution Subsolution

solve

Subsolution Subsolution

merge

Solution

merge

Page 10: Parallelization at  a Glance

RZ: Christian Terboven Folie 10

Geometric decomposition

Example: ventricular assist device (VAD)

Page 11: Parallelization at  a Glance

RZ: Christian Terboven Folie 11

Analogy assembly line Assign different stages to different PEs

Pipeline

C1 C2 C3 C4 C5

C1 C2 C3 C4 C5

C1 C2 C3 C4 C5

C1 C2 C3 C4 C5

Stage 1

Stage 2

Stage 3

Stage 4

Time

Page 12: Parallelization at  a Glance

RZ: Christian Terboven Folie 12

Recursive data pattern Parallelize operations on recursive data structures

See example later on: Fibonacci

Event-based coordination pattern Decomposition into semi-independent tasks interacting in an irregular fashion

See example later on: While-loop with tasks

Further parallel design patterns

Page 13: Parallelization at  a Glance

RZ: Christian Terboven Folie 13

Possible Issues

Page 14: Parallelization at  a Glance

RZ: Christian Terboven Folie 14

Data Race: Concurrent access of the same memory location by multiple threads without proper synchronization Let variable x have the initial value 1

Depending on which thread is faster, you will see either 1 or 5

Result is nondeterministic (i.e. depends on OS scheduling)

Data Races (and how to detect and avoid them) will be covered in more detail later!

Data Races / Race Conditions

x=5; printf(x);

Page 15: Parallelization at  a Glance

RZ: Christian Terboven Folie 15

Serialization: When threads / processes wait „too much“ Limited scalability, if at all

Simple (and stupid) example:

Serialization

Send RecvData

Transfer

SendRecv DataTransfer

Send RecvData

Transfer

Calc

Calc

Wait

Wait

Page 16: Parallelization at  a Glance

RZ: Christian Terboven Folie 16

Efficient Parallelization

Page 17: Parallelization at  a Glance

RZ: Christian Terboven Folie 17

Overhead introduced by the parallelization: Time to start / end / manage threads

Time to send / exchange data

Time spent in synchronization of threads / processes

With parallelization: The total CPU time increases,

The Wall time decreases,

The System time stays the same.

Efficient parallelization is about minimizing the overhead introduced by the parallelization itself!

Parallelization Overhead

Page 18: Parallelization at  a Glance

RZ: Christian Terboven Folie 18

Load Balancing

perfect load balancing

time

load imbalance

time

– All threads / processes finish at the same time

– Some threads / processes take longer than others

– But: All threads / processes have to wait for the slowest thread / process, which is thus limitingthe scalability

Page 19: Parallelization at  a Glance

RZ: Christian Terboven Folie 19

Time using 1 CPU: T(1) Time using p CPUs: T(p) Speedup S: S(p)=T(1)/T(p)

Measures how much faster the parallel computation is! Efficiency E: E(p)=S(p)/p

Example: T(1)=6s, T(2)=4s

S(2)=1.5

E(2)=0.75

Ideal case: T(p)=T(1)/p S(p)=p E(p)=1.0

Speedup and Efficiency

Page 20: Parallelization at  a Glance

RZ: Christian Terboven Folie 20

Describes the influence of the serial part onto scalability(without taking any overhead into account).

S(p)=T(1)/T(p)=T(1)/(f*T(1) + (1-f)*T(1)/p)=1/(f+(1-f)/p)

f: serial part (0 f 1)

T(1) : time using 1 CPU

T(p): time using p CPUs

S(p): speedup; S(p)=T(1)/T(p)

E(p): efficiency; E(p)=S(p)/p

It is rather easy to scale to a small number of cores, but any parallelization is limited by the serial part of the program!

Amdahl‘s Law

Page 21: Parallelization at  a Glance

RZ: Christian Terboven Folie 21

If 80% (measured in program runtime) of your work can be parallelized and „just“ 20% are still running sequential, then your speedup will be:

Amdahl‘s Law illustrated

1 processor:time: 100%speedup: 1

2 processors:time: 60%speedup: 1.7

4 processors:time: 40%speedup: 2.5

processors:time: 20%speedup: 5

Page 22: Parallelization at  a Glance

RZ: Christian Terboven Folie 22

After the initial parallelization of a program, you will typically see speedup curves like this:

Speedup in Practicesp

eedu

p

1 2 3 4 5 6 7 8 . . .

1

2

3

4

5

6

7

8

Ideal speedup S(p)=p

Realistic speedup

p

Speedup according to Amdahl’s law