ΕΠΛ372 Παράλληλη Επεξεργασία - UCY · ΕΠΛ372 Παράλληλη Επεξεργασία ... well-funded research implies it’s hopeless – For a tight loop

ΕΠΛ372 Παράλληλη Επεξεργασία

Εισαγωγή:Παράλληλη Επεξεργασία

Γιάννος ΣαζεϊδηςΕαρινό Εξάμηνο 2014

READING1.www.pearsonhighered.com/assets/hip/us/hip_us_pearsonhighered/samplechapter/0321487907.pdf2. Illiac IV http://www.cs.auckland.ac.nz/courses/compsci703s1c/resources/Bouknight-ILIAC-IV.pdf.3. http://www.youtube.com/watch?v=On-k-E5HpcQ Parallel Computing Landscape4. Homework #1Slides based on notes by Calvin and Snyder and Pearson Pub.

Consider A Simple Task …• Adding a sequence of numbers A[0],…,A[n-1]

– Standard way to express itsum = 0;for (i=0; i<n; i++) {

sum += A[i];}

• Semantics require:(…((sum+A[0])+A[1])+…)+A[n-1]

– That is, sequential

• Can it be executed in parallel?

Parallel Summation• To sum a sequence in parallel

– add pairs of values producing 1st level results,– add pairs of 1st level results producing 2nd level

results,– sum pairs of 2nd level results …

• That is,– (…((A[0]+A[1]) + (A[2]+A[3])) +...+ (A[n-2]+A[n-1]))…)

Express the two formulations • Same number of operations different order

• Which version more parallel?• How to go from sequential to parallel?

The dream of automatic parallelization...

• Since 70s (Illiac IV days) the dream has been to compile sequential programs into parallel object code

• More than three decades of continual, well-funded research implies it’s hopeless– For a tight loop summing numbers, its doable– For other computations it has proved

extremely challenging to generate parallel code, even with pragmas or other assistance from programmers

What’s the Problem?

• It’s not likely a compiler will produce parallel code from a C specification any time soon…

• Fact: For most computations, the “best” (practically, not theoretically) sequential solution and a “best” parallel solution are usually fundamentally different …• Different solution paradigms imply

computations are not “simply” related• Compiler transformations generally preserve

the solution paradigm• Therefore... the programmer must discover the

parallel || solution

A Related Computation

• Consider computing the prefix sumsfor (i=1; i<n; i++) {

A[i] += A[i-1];}

• Semantics ...– A[0] is unchanged– A[1] = A[1] + A[0]– A[2] = A[2] + (A[1] + A[0])– ...– A[n-1] = A[n-1] + (A[n-2] + ( ... (A[1] + A[0]) … )

A[i] is the sum of thefirst i + 1 elements

Comparison of two Approaches

• The sequential solution computes the prefixes … the parallel solution computes only the last…

Parallel Prefix Algorithm

R. E. Ladner and M. J. Fischer, Parallel Prefix Computation Journal of the ACM, 1980

2log n time

Applies to a wide class of operations

Definitions: reduction and scan

• Tree-like operation used for parallel sum is called Reduction

• Tree-like operation used for parallel-prefix is called Scan

• Reduction and scan applied to other operators: max, min, second largest, etc

• When should we use these operators?• Faster than sequential operation due to

communication overhead

Parallel Compared to SequentialProgramming

•Has different costs, different advantages•Requires different, unfamiliar algorithms•Must use different abstractions•More complex to understand a program’s behavior•More difficult to control the interactions of the program’s components•Knowledge/tools/understanding more primitive•NEXT: Illustrate complexities of writing parallel programs

Consider a Simple Problem

• Count the 3s in array[] of length values• Sequential program

count = 0;for (i=0; i<length; i++){

if (array[i] == 3)count += 1;

}

Basic background on Thread parallel programming

• Thread is the unit of parallelism• Each thread has its own PC. Sequences

independent of the others• Has private text, registers and stack• Shares global data• Constructs to create, synchronize, and kill

threads

•Each processor has a private L1 cache; it shares an L2 cache with its “chip-mate” and shares an L3 cache with the other processors.

•What is a possible bottleneck to this parallel system?

Multi-core computer system

Allocations are consecutive indices.

Data allocation/partitioning natural step in developing parallel algo and program

Data allocation to threads

The first try at a Count 3s

•One of several possible interleavings of references to the unprotected variable count, illustrating a race condition•This is not a coherence problem. •Imagine debugging the above…

First Solution Incorrect: Race Condition

A high level count++ in assembly =>

ld reg, [count] ++reg store reg,[count]

•mutex protection for the count variable•mutex: construct that allows one thread at a time in the critical section. Mutex is always good? Large Overhead

Second Solution: Avoid race condition protect shared variables

critical section

Performance of second Count 3s solution

Locks serialize execution. Can we avoid them?IDEA: Do we really need to protect every count update?

•private_count array elements, one for each thread

•Still need critical section. But only when need to combine result of each thread. Much less frequent.

3rd Solution: private counts per thread

critical section

Algorithmically parallel solution suffers from serialization…

Performance for third Count 3s solution

FALSE SHARING:1. Granularity of coherence at cache block2. Block fits many variables that are independently and

concurrently updated by different threads

System causes serialization: False Coherence Traffic

•Private count elements are padded to force them to be allocated to different cache lines•Padding platform dependent. Problem? Portability…

4th Solution: Per thread counter to distinct cache block

int count;

int private_count = 0

++

Finally performance improvement. But performance does not scale beyond 4???

Performance for third Count 3s solution

•Memory Bandwidth limited•How to validate? Measure performance with an array that does not contain any 3s. Results same as before.

Analysis to determine source of serialization

Count 3s Summary

• The obvious “break into blocks” program• Wrong answer. Race condition.

• Avoid race condition by protecting the count variable • We got the right answer but the program was

slower … lock congestion• Privatized memory and 1-process was fast

enough, 2- processes slow … false sharing• Separated private variables to own cache line• Finally success.

• Analyze why no performance scaling

Parallel Programming Goals

Must be significantly faster than single thread implementation• Goal: Correct and scalable programs with

performance and portability• Scalable: More processors can be “usefully”

added to solve the problem faster• Performance: Programs run as fast as those

produced by experienced parallel programmers for the specific machine

• Portability: The solutions run well on all parallel platforms

• Minimize programming effort

Other non-obvious system causes of serialization

• virtual to physical address translation lead to cache conflicts (intra and across threads)• Os page mapping algorithms

• what happens when there are more threads than processors• Memory pressure when compute bound• Useful when i/o bound• Stress memory (paging)

• same thread can have different performance depending with which threads co-executes• How do we learn this?

Next

Parallel Architectures•Hw#1•Read about Illiac IV•Questions for reading/matterial

Documents

ΕΠΛ372 Παράλληλη Επεξεργασία - UCY · ΕΠΛ372 Παράλληλη Επεξεργασία ... well-funded research implies it’s hopeless – For a tight loop