Upload
julian-hubbard
View
220
Download
1
Embed Size (px)
Citation preview
(How) Can Programmers Conquer the
Multicore Menace?
Saman Amarasinghe
Computer Science and Artificial Intelligence Laboratory
Massachusetts Institute of Technology
Outline
• The Multicore Menace
• Deterministic Multithreading via Kendo
• Algorithmic Choices via PetaBricks
• Conquering the Multicore Menace
2
Today: The Happily ObliviousAverage Joe Programmer
• Joe is oblivious about the processor– Moore’s law bring Joe performance – Sufficient for Joe’s requirements
• Joe has built a solid boundary between Hardware and Software– High level languages abstract away the processors
– Ex: Java bytecode is machine independent
• This abstraction has provided a lot of freedom for Joe
• Parallel Programming is only practiced by a few experts
3
4
1
10
100
1000
10000
100000
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016
Per
form
ance
(vs
. VA
X-1
1/78
0)
25%/year
52%/year
??%/year
8086
286
386
486
Pentium
P2
P3
P4
Itanium
Itanium 2
Moore’s Law
From David Patterson
1,000,000,000
100,000
10,000
1,000,000
10,000,000
100,000,000
From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006
Nu
mb
er o
f Tra
nsisto
rs
5
8086
286
386
486
Pentium
P2
P3
P4
Itanium
Itanium 2
1
10
100
1000
10000
100000
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016
Per
form
ance
(vs
. VA
X-1
1/78
0)
25%/year
52%/year
Uniprocessor Performance (SPECint)
From David Patterson
1,000,000,000
100,000
10,000
1,000,000
10,000,000
100,000,000
From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006
Nu
mb
er o
f Tra
nsisto
rs
6
8086
286
386
486
Pentium
P2
P3
P4
Itanium
Itanium 2
1
10
100
1000
10000
100000
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016
Per
form
ance
(vs
. VA
X-1
1/78
0)
25%/year
52%/year
??%/year
Uniprocessor Performance (SPECint)
From David Patterson
1,000,000,000
100,000
10,000
1,000,000
10,000,000
100,000,000
From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006
Nu
mb
er o
f Tra
nsisto
rs
Squandering of the Moore’s Dividend
• 10,000x performance gain in 30 years! (~46% per year)• Where did this performance go?• Last decade we concentrated on correctness and
programmer productivity• Little to no emphasis on performance • This is reflected in:
– Languages– Tools– Research– Education
• Software Engineering: Only engineering discipline where performance or efficiency is not a central theme
7
Matrix Multiply An Example of Unchecked Excesses
• Abstraction and Software Engineering– Immutable Types– Dynamic Dispatch– Object Oriented
• High Level Languages• Memory Management
– Transpose for unit stride– Tile for cache locality
• Vectorization• Prefetching• Parallelization
2,271x296,260x522x1,117x7,514x12,316x33,453x87,042x220x
Matrix Multiply An Example of Unchecked Excesses
• Typical Software Engineering Approach– In Java– Object oriented– Immutable– Abstract types– No memory optimizations– No parallelization
• Good Performance Engineering Approach– In C/Assembly– Memory optimized (blocked)– BLAS libraries– Parallelized (to 4 cores)
9
14,700x
• In Comparison: Lowest to Highest MPG in transportation
296,260x
294,000x
Joe the Parallel Programmer
• Moore’s law is not bringing anymore performance gains
• If Joe needs performance he has to deal with multicores– Joe has to deal with
performance– Joe has to deal with
parallelism
10
Joe
11
Why Parallelism is Hard
• A huge increase in complexity and work for the programmer– Programmer has to think about performance! – Parallelism has to be designed in at every level
• Programmers are trained to think sequentially– Deconstructing problems into parallel tasks is hard for many of us
• Parallelism is not easy to implement– Parallelism cannot be abstracted or layered away– Code and data has to be restructured in very different (non-intuitive) ways
• Parallel programs are very hard to debug– Combinatorial explosion of possible execution orderings – Race condition and deadlock bugs are non-deterministic and illusive – Non-deterministic bugs go away in lab environment and with
instrumentation
Outline
• The Multicore Menace
• Deterministic Multithreading via Kendo– Joint work with Marek Olszewski and Jason Ansel
• Algorithmic Choices via PetaBricks
• Conquering the Multicore Menace
12
Racing for Lock Acquisition• Two threads
– Start at the same time– 1st thread: 1000 instructions to the lock acquisition– 2nd thread: 1100 instructions to the lock acquisition
13
Instruction #
Tim
e
Non-Determinism
• Inherent in parallel applications– Accesses to shared data can experience many possible
interleavings
– New! Was not the case for sequential applications!
– Almost never part of program specifications
– Simplest parallel programs, i.e. a work queue, is non deterministic
• Non-determinism is undesirable– Hard to create programs with repeatable results
– Difficult to perform cyclic debugging
– Testing offers weaker guarantees
14
Deterministic Multithreading
• Observation:– Non-determinism need not be a required property of threads
– We can interleave thread communication in a deterministic manner
– Call this Deterministic Multithreading
• Deterministic multithreading:– Makes debugging easier
– Tests offer guarantees again
– Supports existing programming models/languages
– Allows programmers to “determinize” computations that have
previously been difficult to do so using today’s programming idioms
– e.g.: Radiosity (Singh et al. 1994), LocusRoute (Rose 1988), and
Delaunay Triangulation (Kulkarni et al. 2008)
Deterministic Multithreading
• Strong Determinism– Deterministic interleaving for all accesses to shared data for a given input
– Attractive, but difficult to achieve efficiently without hardware support
• Weak Determinism– Deterministic interleaving of all lock acquisitions for a given input
– Cheaper to enforce
– Offers same guarantees as strong determinism for data-race-free program executions
– Can be checked with a dynamic race detector!
Kendo
• A Prototype Deterministic Locking Framework– Provides Weak Determinism for C and C++ code
– Runs on commodity hardware today!
– Implements a subset of the pthreads API
– Enforces determinism without sacrificing load balance
– Tracks progress of threads to dynamically construct
the deterministic interleaving: – Deterministic Logical Time
– Incurs low performance overhead (16% geomean on Splash2)
Deterministic Logical Time
• Abstract counterpart to physical time
– Used to deterministically order events on an SMP machine
– Necessary to construct the deterministic interleaving
• Represented as P independently updated deterministic logical clocks
– Not updated based on the progress of other threads
(unlike Lamport clocks)
– Event1 (on Thread 1) occurs before Event2 (on Thread 2) in Deterministic
Logical Time if:
– Thread 1 has lower deterministic logical clock than Thread 2 at time of events
Deterministic Logical Clocks
• Requirements– Must be based on events that are deterministically reproducible
from run to run
– Track progress of threads in physical time as closely as possible
(for better load balancing of the deterministic interleaving)
– Must be cheap to compute
– Must be portable over micro-architecture
– Must be stored in memory for other threads to observe
Deterministic Logical Clocks
• Some x86 performance counter events satisfy many of
these requirements– Chose the “Retired Store Instructions” event
• Required changes to Linux Kernel– Performance counters are kernel level accessible only
– Added an interrupt service routine
– Increments each thread’s deterministic logical clock (in memory) on
every performance counter overflow
– Frequency of overflows can be controlled
Locking Algorithm
• Construct a deterministic interleaving of lock acquires from deterministic logical clocks– Simulate the interleaving that would occur if running in
deterministic logical time
• Uses concept of a turn– It’s a thread’s turn when:
– All thread’s with smaller ID have greater deterministic logical clocks– All thread’s with larger ID have greater or equal deterministic logical
clocks
Locking Algorithm
function det_mutex_lock(l) { pause_logical_clock(); wait_for_turn(); lock(l); inc_logical_clock(); enable_logical_clock();}
function det_mutex_unlock(l) { unlock(l);}
Example
Thread 1
det_lock(a) t=25
Thread 2
t=18
Det
erm
inis
tic L
ogic
al T
ime
wait_for_turn()
Physical Time
Example
Thread 1
t=25
Thread 2
t=22
Det
erm
inis
tic L
ogic
al T
ime
det_lock(a)wait_for_turn()
det_lock(a)
Physical Time
Example
Thread 1
t=25
Thread 2
t=22
Det
erm
inis
tic L
ogic
al T
ime
det_lock(a)wait_for_turn()
det_lock(a) wait_for_turn()
Physical Time
Example
Thread 1
t=25
Thread 2
t=22
Det
erm
inis
tic L
ogic
al T
ime
det_lock(a)wait_for_turn()
det_lock(a) lock()
Physical Time
Example
Thread 1
t=25
Thread 2
t=22
Det
erm
inis
tic L
ogic
al T
ime
det_lock(a)wait_for_turn()
det_lock(a)
Thread 2 will always acquire the lock first!
Physical Time
Example
Thread 1
t=25
Thread 2
t=26
Det
erm
inis
tic L
ogic
al T
ime
det_lock(a)wait_for_turn()
det_lock(a)
Physical Time
Example
Thread 1
t=25
Thread 2
t=26
Det
erm
inis
tic L
ogic
al T
ime
det_lock(a)
det_lock(a)lock(a)
Physical Time
Example
Thread 1
t=25
Thread 2
Det
erm
inis
tic L
ogic
al T
ime
t=32
det_lock(a)
det_unlock(a)
det_lock(a)
lock(a)
Physical Time
Example
Thread 1
t=28
Thread 2
t=32
Det
erm
inis
tic L
ogic
al T
ime
det_lock(a)
det_unlock(a)
det_lock(a)
Physical Time
Locking Algorithm Improvements
• Eliminate deadlocks in nested locks – Make thread increment its deterministic logical clock while it
spins on the lock
– Must do so deterministically
• Queuing for fairness• Lock priority boosting
• See ASPLOS09 Paper on Kendo for details
Evaluation
• Methodology– Converted Splash2 benchmark suite to run use the Kendo
framework
– Eliminated data-races
– Checked determinism by examining output and the final
deterministic logical clocks of each thread
• Experimental Framework– Processor: Intel Core2 Quad-core running at 2.66GHz
– OS: Linux 2.6.23 (modified for performance counter support)
Results
tsp quicksort ocean barnes radiosity raytrace fmm volrend water-nsqrd
mean0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
1.5
1.6
Application Time Interrupt Overhead Deterministic Wait Overhead
Benchmarks
Exe
cuti
on
Tim
e (R
elat
ive
to N
on
-Det
erm
inis
tic)
Effect of interrupt frequency
64 128 256 512 1K 2K 4K 8K 16K0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Application Time Interrupt Overhead Deterministic Wait Overhead
Interrupt Period
Ex
ec
uti
on
Tim
e (
Re
lati
ve
to
No
n-D
ete
rmin
isti
c)
Related Work
• DMP – Deterministic Multiprocessing– Hardware design that provides Strong Determinism
• StreamIt Language– Streaming programming model only allows one interleaving of
inter-thread communication
• Cilk Language– Fork/join programming model that can produce programs with
semantics that always match a deterministic “serialization” of the code
– Cannot be used with locks– Must be data-race free (can be checked with a Cilk race detector)
Outline
• The Multicore Menace
• Deterministic Multithreading via Kendo
• Algorithmic Choices via PetaBricks– Joint work with Jason Ansel, Cy Chan, Yee Lok Wong,
Qin Zhao, and Alan Edelman
• Conquering the Multicore Menace
48
Observation 1: Algorithmic Choice
• For many problems there are multiple algorithms – Most cases there is no single winner– An algorithm will be the best performing for a given:
– Input size– Amount of parallelism– Communication bandwidth / synchronization cost– Data layout– Data itself (sparse data, convergence criteria etc.)
• Multicores exposes many of these to the programmer– Exponential growth of cores (impact of Moore’s law)– Wide variation of memory systems, type of cores etc.
• No single algorithm can be the best for all the cases
49
Observation 2: Natural Parallelism
• World is a parallel place– It is natural to many, e.g. mathematicians
– ∑, sets, simultaneous equations, etc.
• It seems that computer scientists have a hard time thinking in parallel– We have unnecessarily imposed sequential ordering on the world
– Statements executed in sequence – for i= 1 to n– Recursive decomposition (given f(n) find f(n+1))
• This was useful at one time to limit the complexity…. But a big problem in the era of multicores
50
Observation 3: Autotuning
• Good old days model based optimization• Now
– Machines are too complex to accurately model– Compiler passes have many subtle interactions– Thousands of knobs and billions of choices
• But…– Computers are cheap– We can do end-to-end execution of multiple runs – Then use machine learning to find the best choice
51
PetaBricks Language
transform MatrixMultiply
from A[c,h], B[w,c]
to AB[w,h]
{
// Base case, compute a single element
to(AB.cell(x,y) out)
from(A.row(y) a, B.column(x) b) {
out = dot(a, b);
}
}
• Implicitly parallel description
52
A
B
ABc
h
w
c
h
w
ABy
x
y
x
PetaBricks Language
transform MatrixMultiply
from A[c,h], B[w,c]
to AB[w,h]
{
// Base case, compute a single element
to(AB.cell(x,y) out)
from(A.row(y) a, B.column(x) b) {
out = dot(a, b);
}
// Recursively decompose in c
to(AB ab)
from(A.region(0, 0, c/2, h ) a1,
A.region(c/2, 0, c, h ) a2,
B.region(0, 0, w, c/2) b1,
B.region(0, c/2, w, c ) b2) {
ab = MatrixAdd(MatrixMultiply(a1, b1),
MatrixMultiply(a2, b2));
}
• Implicitly parallel description
• Algorithmic choice
53
A
B
ABABa1 a2 b1
b2
PetaBricks Language
transform MatrixMultiply
from A[c,h], B[w,c]
to AB[w,h]
{
// Base case, compute a single element
to(AB.cell(x,y) out)
from(A.row(y) a, B.column(x) b) {
out = dot(a, b);
}
// Recursively decompose in c
to(AB ab)
from(A.region(0, 0, c/2, h ) a1,
A.region(c/2, 0, c, h ) a2,
B.region(0, 0, w, c/2) b1,
B.region(0, c/2, w, c ) b2) {
ab = MatrixAdd(MatrixMultiply(a1, b1),
MatrixMultiply(a2, b2));
}
// Recursively decompose in w
to(AB.region(0, 0, w/2, h ) ab1,
AB.region(w/2, 0, w, h ) ab2)
from( A a,
B.region(0, 0, w/2, c ) b1,
B.region(w/2, 0, w, c ) b2) {
ab1 = MatrixMultiply(a, b1);
ab2 = MatrixMultiply(a, b2);
}
54
a
B
ABAB
b2b1
ab1 ab2
PetaBricks Language
transform MatrixMultiply
from A[c,h], B[w,c]
to AB[w,h]
{
// Base case, compute a single element
to(AB.cell(x,y) out)
from(A.row(y) a, B.column(x) b) {
out = dot(a, b);
}
// Recursively decompose in c
to(AB ab)
from(A.region(0, 0, c/2, h ) a1,
A.region(c/2, 0, c, h ) a2,
B.region(0, 0, w, c/2) b1,
B.region(0, c/2, w, c ) b2) {
ab = MatrixAdd(MatrixMultiply(a1, b1),
MatrixMultiply(a2, b2));
}
// Recursively decompose in w
to(AB.region(0, 0, w/2, h ) ab1,
AB.region(w/2, 0, w, h ) ab2)
from( A a,
B.region(0, 0, w/2, c ) b1,
B.region(w/2, 0, w, c ) b2) {
ab1 = MatrixMultiply(a, b1);
ab2 = MatrixMultiply(a, b2);
}
// Recursively decompose in h
to(AB.region(0, 0, w, h/2) ab1,
AB.region(0, h/2, w, h ) ab2)
from(A.region(0, 0, c, h/2) a1,
A.region(0, h/2, c, h ) a2,
B b) {
ab1=MatrixMultiply(a1, b);
ab2=MatrixMultiply(a2, b);
}
}
55
PetaBricks Compiler Internals
ChoiceGridChoiceGridRule/TransformHeaders
Ru
le B
od
ies
Code Generation
Compiler Passes
C++
ChoiceGridChoice
Dependency Graph
ChoiceGridChoiceGridRule Body IR
CompilerPasses
Compiler Passes
Seq
uen
tialL
eaf Co
de
Parallel D
ynam
ically S
ched
uled
PetaBricksSource Code
RUNTIME
Choice Grids
Rule1: to(B.cell(i) b) from(B.cell(i-1) left, A.cell(i) a) { … }
Rule2: to(B.cell(i) b) from(A.region(0, i) as) { … }
Rule1 or Rule2Rule2
0 1 n
Input
transform RollingSum from A[n] to B[n] {
}
0 n
A:
B:
Choice Dependency Graph
Rule1 or Rule2Rule2
Input
(r2, =)
(r1, =), (r2, <=)
(r1, =, -1)
(r1, <)
Rule1: to(B.cell(i) b) from(B.cell(i-1) left, A.cell(i) a) { … }
Rule2: to(B.cell(i) b) from(A.region(0, i) as) { … }
transform RollingSum from A[n] to B[n] {
}
PetaBricks Autotuning
ChoiceGridChoiceGridRule/TransformHeaders
Ru
le B
od
ies
Code Generation
Compiler Passes
C++
ChoiceGridChoice
Dependency Graph
ChoiceGridChoiceGridRule Body IR
CompilerPasses
Compiler Passes
ChoiceGridChoiceGridCompiled User Code
Parallel Runtime Engine
Seq
uen
tialL
eaf Co
de
Parallel D
ynam
ically S
ched
uled
PetaBricksSource Code
Autotuner
Choice Configuration
File
PetaBricks Execution
ChoiceGridChoiceGridRule/TransformHeaders
Ru
le B
od
ies
Code Generation
Compiler Passes
C++
ChoiceGridChoice
Dependency Graph
ChoiceGridChoiceGridRule Body IR
CompilerPasses
Compiler Passes
ChoiceGridChoiceGridCompiled
User Code
Parallel Runtime Engine
Dependency Graph
Seq
uen
tialL
eaf Co
de
Parallel D
ynam
ically S
ched
uled
PetaBricksSource Code
Choice Configuration
File
Pruning
Choice Configuration
File
Experimental Setup
• Test System– Dual-quad core (8 cores) – Xeon X5460 @ 3.16GHz w/ 8GB RAM– CSAIL Debian 4.0 (etch), kernel 2.6.18
• Training– Using our hybrid genetic tuner– Trained using all 8 cores– Training times varied from ~1 min to ~1 hour
Sort
62
Size
Tim
e
0 200 400 600 800 1000 1200 1400 1600 1800 20000.000
0.002
0.004
0.006
0.008
0.010
Insertion SortQuick SortMerge SortRadix Sort
Sort
63
Size
Tim
e
0 200 400 600 800 1000 1200 1400 1600 1800 20000.000
0.002
0.004
0.006
0.008
0.010
Insertion SortQuick SortMerge SortRadix SortAutotuned
Eigenvector Solve
64
Size
Tim
e
0 50 100 150 200 250 300 350 400 450 500
-0.01
0.00
0.01
0.02
0.03
0.04
0.05
Bisection DC
QR
Eigenvector Solve
65
Size
Tim
e
0 50 100 150 200 250 300 350 400 450 500
-0.01
0.00
0.01
0.02
0.03
0.04
0.05
Bisection
DC
QR
Autotuned
Poisson
66
Matrix Size
Tim
e
3 5 9 17 33 65 129 257 513 1025 20493.81469726562501E-06
0.0000305175781250001
0.000244140625
0.001953125
0.015625
0.125
1
7.99999999999999
63.9999999999999
511.999999999999
4095.99999999999
Direct
Jacobi
SOR
Multigrid
Poisson
67
Matrix Size
Tim
e
3 5 9 17 33 65 129 257 513 1025 20493.81469726562501E-06
0.0000305175781250001
0.000244140625
0.001953125
0.015625
0.125
1
7.99999999999999
63.9999999999999
511.999999999999
4095.99999999999
Direct
Jacobi
SOR
Multigrid
Autotuned
Scalability
68
1 2 3 4 5 6 7 80
1
2
3
4
5
6
7
8
MM
Sort
Poisson
Eigenvector Solve
Number of Cores
Spe
edup
Impact of Autotuning
• Custom hybrid genetic tuner• Huge gains by training on the target architecture:
SunFire T200Niagra (8 cores)
Xeon E7340(8 cores)
Xeon E7340(1 core)
SunFire T200Niagra (8 cores) 1.00x 0.72x
Xeon E7340(8 cores) 0.43x 1.00x 0.30x
Trained On:
Run On:
Related Work
• SPARSITY, OSKI – Sparse Matrices
• ATLAS, FLAME – Linear Algebra
• FFTW
• STAPL – Template Framework Library
• SPL – Digital signal processing
• High level optimization via automated statistical modeling. (Eric Brewer)
Outline
• The Multicore Menace
• Deterministic Multithreading via Kendo
• Algorithmic Choices via PetaBricks
• Conquering the Multicore Menace
71
Conquering the Menace
• Parallelism Extraction– The world is parallel,
but most computer science is based in sequential thinking– Parallel Languages
– Natural way to describe the maximal concurrency in the problem
– Parallel Thinking– Theory, Algorithms, Data Structures Education
• Parallelism Management– Mapping algorithmic parallelism to a given architecture– New hardware support
– Easier to enforce correctness– Reduce the cost of bad decisions
– A Universal Parallel Compiler
72
73
Hardware Opportunities
• Don’t have to contend with uniprocessors• Not your same old multiprocessor problem
– How does going from Multiprocessors to Multicores impact programs?
– What changed?– Where is the Impact?
– Communication Bandwidth– Communication Latency
74
Communication Bandwidth
• How much data can be communicated between two cores?
• What changed?– Number of Wires– Clock rate– Multiplexing
• Impact on programming model?– Massive data exchange is possible– Data movement is not the bottleneck
processor affinity not that important
32 Giga bits/sec ~300 Tera bits/sec
10,000X
Parallel Language Opportunities
• We need a lot more innovation! Languages that…..
– require no non-intuitive reorganization of data or code.
– make the programmer focus on concurrency, but not performance Off-load the parallelism and performance issues to the compiler (akin to ILP compilation to VLIW machines)
– eliminate hard problems such as race conditions and deadlocks (akin to the elimination of memory bugs in Java)
– inform the programmer if they have done something illegal (akin to a type system or runtime null-pointer checks)
– take advantage of domains to reduce the parallelization burden (akin to the StreamIt language for the streaming domain)
– use novel hardware to eliminate problems & help the programmer (akin to cache coherence hardware)
75
Compilation Opportunities
• Universal Parallel Compiler: GCC for Uniprocessors– Easily portable to any uniprocessor– Able to obtain respectable performance Single program (in C) runs on all uniprocessors
• MultiCompiler: Universal Compiler for Parallel Systems– Language exposes maximal parallelism Compiler manages it– Unlike uniprocessors, many single decisions are performance critical
– Candidates: Don’t bind a single decision, keep multiple tracks– Learning: Learn and improve heuristics – Adaptation: Dynamically choose candidates and adapt the program to
resources and runtime conditions
76
77
Conclusions
• Kendo
– The first system to efficiently provide weak determinism on commodity hardware
– Provide a systematic method of reproducing many non-deterministic bugs
– Incurs modest performance overhead when running on 4 processors
– This low overhead makes it possible to leave on while an application is deployed
• PetaBricks– First language where micro-level algorithmic choice can be naturally expressed– Autotuning can find the best choice – Can switch between choices as solution is constructed
• Switching to multicores without losing the gains in programmer productivity may be the Grandest of the Grand Challenges
– Half a century of work, still no winning solution– Will affect everyone! – A lot more work to do to solve this problem!!!
http://groups.csail.mit.edu/commit/