Agenda Project discussion Modeling Critical Sections in Amdahl's Law and its Implications for Multicore Design, S. Eyerman, L. Eeckhout, ISCA'10 pdf

Agenda

• Project discussion• Modeling Critical Sections in Amdahl's Law and

its Implications for Multicore Design, S. Eyerman, L. Eeckhout, ISCA'10 [pdf]

• Benchmarking guidelines

• Regular vs. irregular parallel applications

http://users.elis.ugent.be/~seyerman/ISCA10.pdf

Last time: Amdahl’s law

Under what assumptions?

Speedup =1

+1 - F

1

F

N

1-F

F

• Code is infinitely paralelizable

• No parallelization overheads

• No synchronization

Assuming multiple BCEs. Q: How to design a multicore for maximum speedup

• Assumed Perf(R) = square root of R• Two problems

– symmetric / asymmetric multicore chips– Area allocations

(symmetric)Sixteen 1-BCE cores

(symmetric)Four 4-BCE cores

(symmetric)One 16-BCE core

For Asymmetric Multicore Chips

• Serial Fraction 1-F same, so time = (1 – F) / Perf(R)

• Parallel Fraction F– One core at rate Perf(R)– N-R cores at rate 1– Parallel time = F / (Perf(R) + N - R)

• Therefore, w.r.t. one base core:

Asymmetric Speedup =1

+1 - F

Perf(R)

F

Perf(R) + N - R

[for 256 BCEs]

(256 cores) (253 cores) (193 cores) (1 core)

(241 cores)

Amdahl assumptions • Code is infinitely paralelizable• No parallelization overheads• No synchronization

– Add synchronization. Randomly entered (?!)

fseq + fpar = 1 fseq + fpar,ncs + fpar,cs = 1

fseq + fpar,ncs + fpar,cs = 1

Average time in critical sections

Paper also derives an estimate for max time in critical sections

fseq

fpar,csPcsPctn

fpar,cs(1-PcsPctn)/N

fpar,ncs / N

Speedup for an asymmetric processor as a function of the big core size (b) and small core size (s) for different contention rates, assuming 256 BCEs. Fraction spent in sequential code 1%.

Design space exploration across symmetric, asymmetric and ACS multicore processors

Varying the fraction of the time spent in critical sections and their contention rates.

Fraction spent in sequential code equals 1%

ACS = Accelerated critical section

agenda



• 12 ways to fool the masses



If you were plowing a field, whichwould you rather use?

Two oxen, or 1024 chickens?(Attributed to S. Cray)

David H. Bailey, “Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers”, Supercomputing Review, August 1991,

1. Quote only 32-bit performance results, not 64-bit results.2. Present performance figures for an inner kernel, and then represent these

figures as the performance of the entire application.3. Quietly employ assembly code and other low-level language constructs.4. Scale up the problem size with the number of processors, but omit any

mention of this fact.5. Quote performance results projected to a full system.6. Compare your results against scalar, unoptimized code on Crays.7. When direct run time comparisons are required, compare with an old code

on an obsolete system.8. If MFLOPS rates must be quoted, base the operation count on the parallel

implementation, not on the best sequential implementation.9. Quote performance in terms of processor utilization, parallel speedups or

MFLOPS per dollar.10. Mutilate the algorithm used in the parallel implementation to match the

architecture.11. Measure parallel run times on a dedicated system, but measure

conventional run times in a busy environment.12 If all else fails, show pretty pictures and animated videos, and don't talk

about performance.

Rodamap



• 12 ways to fool the masses



16

Definitions

• Regular applications– key data structures are

• vectors • dense matrices

– simple access patterns • (eg) array indices are affine functions of for-loop indices

– examples: • MMM, Cholesky & LU factorizations, stencil codes, FFT,…

• Irregular applications– key data structures are

• lists, priority queues • trees, DAGs, graphs • usually implemented using pointers or references

– complex access patterns– examples: see next slide

17

Regular application example: Stencil computation

• (e.g.,) Finite-difference method for solving pde’s– discrete representation of domain: grid

• Values at interior points are updated using values at neighbors

– values at boundary points are fixed • Data structure:

– dense arrays• Parallelism:

– values at next time step can be computed simultaneously– parallelism is not dependent on runtime values

• Compiler can find the parallelism– spatial loops are DO-ALL loops

//Jacobi iteration with 5-point stencil//initialize array Afor time = 1, nsteps for <i,j> in [2,n-1]x[2,n-1] temp(i,j)=0.25*(A(i-1,j)+A(i+1,j)+A(i,j-1)+A(i,j+1)) for <i,j> in [2,n-1]x[2,n-1]: A(i,j) = temp(i,j)

Jacobi iteration, 5-point stencil

A temp

tn tn+1

18

Delaunay Mesh Refinement• Iterative refinement to remove badly

shaped triangles:while there are bad triangles do {

Pick a bad triangle;Find its cavity;Retriangulate cavity; // may create new bad

triangles}

• Don’t-care non-determinism:– final mesh depends on order in which

bad triangles are processed– applications do not care which mesh is

produced• Data structure:

– graph in which nodes represent triangles and edges represent triangle adjacencies

• Parallelism: – bad triangles with cavities that do not

overlap can be processed in parallel– parallelism is dependent on runtime

values• compilers cannot find this parallelism

C:\Documents and Settings\Keshav Pingali\Desktop\ppt\tutorial\refinementdemo.jar

C:\Documents and Settings\Keshav Pingali\Desktop\ppt\theory\refinementdemo\refinementdemo.jar