Upload
amanda-day
View
217
Download
0
Embed Size (px)
Citation preview
Agenda
• Project discussion• Modeling Critical Sections in Amdahl's Law and
its Implications for Multicore Design, S. Eyerman, L. Eeckhout, ISCA'10 [pdf]
• Benchmarking guidelines
• Regular vs. irregular parallel applications
Last time: Amdahl’s law
Under what assumptions?
Speedup =1
+1 - F
1
F
N
1-F
F
• Code is infinitely paralelizable
• No parallelization overheads
• No synchronization
Assuming multiple BCEs. Q: How to design a multicore for maximum speedup
• Assumed Perf(R) = square root of R• Two problems
– symmetric / asymmetric multicore chips– Area allocations
(symmetric)Sixteen 1-BCE cores
(symmetric)Four 4-BCE cores
(symmetric)One 16-BCE core
For Asymmetric Multicore Chips
• Serial Fraction 1-F same, so time = (1 – F) / Perf(R)
• Parallel Fraction F– One core at rate Perf(R)– N-R cores at rate 1– Parallel time = F / (Perf(R) + N - R)
• Therefore, w.r.t. one base core:
Asymmetric Speedup =1
+1 - F
Perf(R)
F
Perf(R) + N - R
[for 256 BCEs]
(256 cores) (253 cores) (193 cores) (1 core)
(241 cores)
Amdahl assumptions • Code is infinitely paralelizable• No parallelization overheads• No synchronization
– Add synchronization. Randomly entered (?!)
fseq + fpar = 1 fseq + fpar,ncs + fpar,cs = 1
fseq + fpar,ncs + fpar,cs = 1
Average time in critical sections
Paper also derives an estimate for max time in critical sections
fseq
fpar,csPcsPctn
fpar,cs(1-PcsPctn)/N
fpar,ncs / N
Speedup for an asymmetric processor as a function of the big core size (b) and small core size (s) for different contention rates, assuming 256 BCEs. Fraction spent in sequential code 1%.
Design space exploration across symmetric, asymmetric and ACS multicore processors
Varying the fraction of the time spent in critical sections and their contention rates.
Fraction spent in sequential code equals 1%
ACS = Accelerated critical section
agenda
• Project discussion• Modeling Critical Sections in Amdahl's Law and
its Implications for Multicore Design, S. Eyerman, L. Eeckhout, ISCA'10 [pdf]
• 12 ways to fool the masses
• Regular vs. irregular parallel applications
If you were plowing a field, whichwould you rather use?
Two oxen, or 1024 chickens?(Attributed to S. Cray)
David H. Bailey, “Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers”, Supercomputing Review, August 1991,
1. Quote only 32-bit performance results, not 64-bit results.2. Present performance figures for an inner kernel, and then represent these
figures as the performance of the entire application.3. Quietly employ assembly code and other low-level language constructs.4. Scale up the problem size with the number of processors, but omit any
mention of this fact.5. Quote performance results projected to a full system.6. Compare your results against scalar, unoptimized code on Crays.7. When direct run time comparisons are required, compare with an old code
on an obsolete system.8. If MFLOPS rates must be quoted, base the operation count on the parallel
implementation, not on the best sequential implementation.9. Quote performance in terms of processor utilization, parallel speedups or
MFLOPS per dollar.10. Mutilate the algorithm used in the parallel implementation to match the
architecture.11. Measure parallel run times on a dedicated system, but measure
conventional run times in a busy environment.12 If all else fails, show pretty pictures and animated videos, and don't talk
about performance.
Rodamap
• Project discussion• Modeling Critical Sections in Amdahl's Law and
its Implications for Multicore Design, S. Eyerman, L. Eeckhout, ISCA'10 [pdf]
• 12 ways to fool the masses
• Regular vs. irregular parallel applications
16
Definitions
• Regular applications– key data structures are
• vectors • dense matrices
– simple access patterns • (eg) array indices are affine functions of for-loop indices
– examples: • MMM, Cholesky & LU factorizations, stencil codes, FFT,…
• Irregular applications– key data structures are
• lists, priority queues • trees, DAGs, graphs • usually implemented using pointers or references
– complex access patterns– examples: see next slide
17
Regular application example: Stencil computation
• (e.g.,) Finite-difference method for solving pde’s– discrete representation of domain: grid
• Values at interior points are updated using values at neighbors
– values at boundary points are fixed • Data structure:
– dense arrays• Parallelism:
– values at next time step can be computed simultaneously– parallelism is not dependent on runtime values
• Compiler can find the parallelism– spatial loops are DO-ALL loops
//Jacobi iteration with 5-point stencil//initialize array Afor time = 1, nsteps for <i,j> in [2,n-1]x[2,n-1] temp(i,j)=0.25*(A(i-1,j)+A(i+1,j)+A(i,j-1)+A(i,j+1)) for <i,j> in [2,n-1]x[2,n-1]: A(i,j) = temp(i,j)
Jacobi iteration, 5-point stencil
A temp
tn tn+1
18
Delaunay Mesh Refinement• Iterative refinement to remove badly
shaped triangles:while there are bad triangles do {
Pick a bad triangle;Find its cavity;Retriangulate cavity; // may create new bad
triangles}
• Don’t-care non-determinism:– final mesh depends on order in which
bad triangles are processed– applications do not care which mesh is
produced• Data structure:
– graph in which nodes represent triangles and edges represent triangle adjacencies
• Parallelism: – bad triangles with cavities that do not
overlap can be processed in parallel– parallelism is dependent on runtime
values• compilers cannot find this parallelism