Upload
others
View
13
Download
0
Embed Size (px)
Citation preview
ΕΠΛ372 Παράλληλη Επεξεργασία
Εισαγωγή:Παράλληλη Επεξεργασία
Γιάννος ΣαζεϊδηςΕαρινό Εξάμηνο 2014
READING1.www.pearsonhighered.com/assets/hip/us/hip_us_pearsonhighered/samplechapter/0321487907.pdf2. Illiac IV http://www.cs.auckland.ac.nz/courses/compsci703s1c/resources/Bouknight-ILIAC-IV.pdf.3. http://www.youtube.com/watch?v=On-k-E5HpcQ Parallel Computing Landscape4. Homework #1Slides based on notes by Calvin and Snyder and Pearson Pub.
Consider A Simple Task …• Adding a sequence of numbers A[0],…,A[n-1]
– Standard way to express itsum = 0;for (i=0; i<n; i++) {
sum += A[i];}
• Semantics require:(…((sum+A[0])+A[1])+…)+A[n-1]
– That is, sequential
• Can it be executed in parallel?
Parallel Summation• To sum a sequence in parallel
– add pairs of values producing 1st level results,– add pairs of 1st level results producing 2nd level
results,– sum pairs of 2nd level results …
• That is,– (…((A[0]+A[1]) + (A[2]+A[3])) +...+ (A[n-2]+A[n-1]))…)
Express the two formulations • Same number of operations different order
• Which version more parallel?• How to go from sequential to parallel?
The dream of automatic parallelization...
• Since 70s (Illiac IV days) the dream has been to compile sequential programs into parallel object code
• More than three decades of continual, well-funded research implies it’s hopeless– For a tight loop summing numbers, its doable– For other computations it has proved
extremely challenging to generate parallel code, even with pragmas or other assistance from programmers
What’s the Problem?
• It’s not likely a compiler will produce parallel code from a C specification any time soon…
• Fact: For most computations, the “best” (practically, not theoretically) sequential solution and a “best” parallel solution are usually fundamentally different …• Different solution paradigms imply
computations are not “simply” related• Compiler transformations generally preserve
the solution paradigm• Therefore... the programmer must discover the
parallel || solution
A Related Computation
• Consider computing the prefix sumsfor (i=1; i<n; i++) {
A[i] += A[i-1];}
• Semantics ...– A[0] is unchanged– A[1] = A[1] + A[0]– A[2] = A[2] + (A[1] + A[0])– ...– A[n-1] = A[n-1] + (A[n-2] + ( ... (A[1] + A[0]) … )
A[i] is the sum of thefirst i + 1 elements
Comparison of two Approaches
• The sequential solution computes the prefixes … the parallel solution computes only the last…
Parallel Prefix Algorithm
R. E. Ladner and M. J. Fischer, Parallel Prefix Computation Journal of the ACM, 1980
2log n time
Applies to a wide class of operations
Definitions: reduction and scan
• Tree-like operation used for parallel sum is called Reduction
• Tree-like operation used for parallel-prefix is called Scan
• Reduction and scan applied to other operators: max, min, second largest, etc
• When should we use these operators?• Faster than sequential operation due to
communication overhead
Parallel Compared to SequentialProgramming
•Has different costs, different advantages•Requires different, unfamiliar algorithms•Must use different abstractions•More complex to understand a program’s behavior•More difficult to control the interactions of the program’s components•Knowledge/tools/understanding more primitive•NEXT: Illustrate complexities of writing parallel programs
Consider a Simple Problem
• Count the 3s in array[] of length values• Sequential program
count = 0;for (i=0; i<length; i++){
if (array[i] == 3)count += 1;
}
Basic background on Thread parallel programming
• Thread is the unit of parallelism• Each thread has its own PC. Sequences
independent of the others• Has private text, registers and stack• Shares global data• Constructs to create, synchronize, and kill
threads
•Each processor has a private L1 cache; it shares an L2 cache with its “chip-mate” and shares an L3 cache with the other processors.
•What is a possible bottleneck to this parallel system?
Multi-core computer system
Allocations are consecutive indices.
Data allocation/partitioning natural step in developing parallel algo and program
Data allocation to threads
The first try at a Count 3s
•One of several possible interleavings of references to the unprotected variable count, illustrating a race condition•This is not a coherence problem. •Imagine debugging the above…
First Solution Incorrect: Race Condition
A high level count++ in assembly =>
ld reg, [count] ++reg store reg,[count]
•mutex protection for the count variable•mutex: construct that allows one thread at a time in the critical section. Mutex is always good? Large Overhead
Second Solution: Avoid race condition protect shared variables
critical section
Performance of second Count 3s solution
Locks serialize execution. Can we avoid them?IDEA: Do we really need to protect every count update?
•private_count array elements, one for each thread
•Still need critical section. But only when need to combine result of each thread. Much less frequent.
3rd Solution: private counts per thread
critical section
Algorithmically parallel solution suffers from serialization…
Performance for third Count 3s solution
FALSE SHARING:1. Granularity of coherence at cache block2. Block fits many variables that are independently and
concurrently updated by different threads
System causes serialization: False Coherence Traffic
•Private count elements are padded to force them to be allocated to different cache lines•Padding platform dependent. Problem? Portability…
4th Solution: Per thread counter to distinct cache block
int count;
int private_count = 0
++
Finally performance improvement. But performance does not scale beyond 4???
Performance for third Count 3s solution
•Memory Bandwidth limited•How to validate? Measure performance with an array that does not contain any 3s. Results same as before.
Analysis to determine source of serialization
Count 3s Summary
• The obvious “break into blocks” program• Wrong answer. Race condition.
• Avoid race condition by protecting the count variable • We got the right answer but the program was
slower … lock congestion• Privatized memory and 1-process was fast
enough, 2- processes slow … false sharing• Separated private variables to own cache line• Finally success.
• Analyze why no performance scaling
Parallel Programming Goals
Must be significantly faster than single thread implementation• Goal: Correct and scalable programs with
performance and portability• Scalable: More processors can be “usefully”
added to solve the problem faster• Performance: Programs run as fast as those
produced by experienced parallel programmers for the specific machine
• Portability: The solutions run well on all parallel platforms
• Minimize programming effort
Other non-obvious system causes of serialization
• virtual to physical address translation lead to cache conflicts (intra and across threads)• Os page mapping algorithms
• what happens when there are more threads than processors• Memory pressure when compute bound• Useful when i/o bound• Stress memory (paging)
• same thread can have different performance depending with which threads co-executes• How do we learn this?
Next
Parallel Architectures•Hw#1•Read about Illiac IV•Questions for reading/matterial