Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Charm++ on NUMA Platforms:
the impact of SMP Optimizations
and a NUMA-aware Load Balancer
Laércio Pilla (UFRGS/Brazil - INRIA)
Christiane Pousa (INRIA)
Daniel Cordeiro (USP/Brazil - INRIA)
Jean-François Méhaut (INRIA)
Outline
• Introduction
• Performance Evaluation of SMP Optimizations of
Charm++ on NUMA Machines
• NUMA-aware Load Balancer on Charm++
• Conclusion and Future Work
Motivation for NUMA platforms
• The number of cores per processor is
increasing
• Hierarchical shared memory multiprocessors
• ccNUMA is coming back (NUMA factor)
Node 3
M3 CPU 3
Node 0
M0 CPU 0
Node 1
M1 CPU 1
Node 2
M2 CPU 2
Node 2
M2 c c cc
Node 3
M3 c c cc
Node 0
M0 c c cc
Node 1
M1 c c cc
Now
NUMA problems
• Remote access
• Optimization of latency
Node 1
M1 c c cc
Node 3
M3 c c cc
Node 5
M5 c c cc
Node 7
M7 c c cc
Node 0
M0 c c cc
Node 2
M2 c c cc
Node 4
M4 c c cc
Node 6
M6 c c cc
NUMA problems
• Remote access
• Optimization of latency
Node 1
M1 c c cc
Node 3
M3 c c cc
Node 5
M5 c c cc
Node 7
M7 c c cc
Node 0
M0 c c cc
Node 2
M2 c c cc
Node 4
M4 c c cc
Node 6
M6 c c cc
NUMA problems
• Remote access
• Optimization of latency
Node 1
M1 c c cc
Node 3
M3 c c cc
Node 5
M5 c c cc
Node 7
M7 c c cc
Node 0
M0 c c cc
Node 2
M2 c c cc
Node 4
M4 c c cc
Node 6
M6 c c cc
NUMA problems
• Memory contention
• Optimization of bandwidth
Node 1
M1 c c cc
Node 3
M3 c c cc
Node 5
M5 c c cc
Node 7
M7 c c cc
Node 0
M0 c c cc
Node 2
M2 c c cc
Node 4
M4 c c cc
Node 6
M6 c c cc
NUMA problems
• Memory contention
• Optimization of bandwidth
Node 1
M1 c c cc
Node 3
M3 c c cc
Node 5
M5 c c cc
Node 7
M7 c c cc
Node 0
M0 c c cc
Node 2
M2 c c cc
Node 4
M4 c c cc
Node 6
M6 c c cc
NUMA problems
• Memory contention
• Optimization of bandwidth
Node 1
M1 c c cc
Node 3
M3 c c cc
Node 5
M5 c c cc
Node 7
M7 c c cc
Node 0
M0 c c cc
Node 2
M2 c c cc
Node 4
M4 c c cc
Node 6
M6 c c cc
NUMA problems
On NUMA machines,
data distribution
matters!
Charm++ Parallel
Programming System
• Platform independent
• Both shared and distributed memory
• Architecture abstraction
• Programmer productivity
From charm++ site: http://charm.cs.uiuc.edu/research/charm/
Charm++ Parallel
Programming System
• Communications originally implemented
with message passing
• Even on SMP machines
• Currently, uses optimizations for SMP
systems
• Chao Mei et al., “Optimizing a parallel runtime system
for multicore clusters: a case study”, in TG ‘10
Charm++ & NUMA
• How these optimizations work on
NUMA machines?
• How can we use knowledge about the
NUMA system to improve performance
on Charm++?
Charm++ & NUMA
• How these optimizations work on
NUMA machines?
• Our evaluation
• How can we use knowledge about the
NUMA system to improve performance
on Charm++?
• NUMA-aware load balancer
Outline
• Introduction
• Performance Evaluation of SMP Optimizations
of Charm++ on NUMA Machines
• NUMA-aware Load Balancer on Charm++
• Conclusion and Future Work
Evaluation of SMP optimizations
• Different Charm++ versions
• With optimizations
• Without optimizations
• Different architecture compilations (flavors)
• net-linux: distributed memory
• multicore: shared memory
NUMA machines• AMD Opteron
• 8 nodes x 2 cores
• @ 2.2GHz
• 2 MB L2 cache
• 32 GB main memory
• Low latency for local
memory access
• Crossbar
• NUMA factor: 1.2 – 1.5
• Linux 2.6.32.6
Node 6
M5 C12 C13
Node 7
M7 C14 C15
Node 4
M4 C8 C9
Node 5
M5 C10 C11
Node 2
M2 C4 C5
Node 3
M3 C6 C7
Node 0
M0 C0 C1
Node 1
M1 C2 C3
NUMA machines
• Intel Xeon X7560
• 4 nodes x 8 cores
• @ 2.27 GHz
• 24 MB shared L3 cache
• 64 GB main memory
• QuickPath
• NUMA factor: 2 - 2.6
• Linux 2.6.32
Node 2
M2
C22 C23
C20 C21
C18 C19
C16 C17
Node 3
M3
C30 C31
C28 C29
C26 C27
C24 C25
Node 0
M0
C6 C7
C4 C5
C2 C3
C0 C1
Node 1
M1
C14 C15
C12 C13
C10 C11
C8 C9
Experimental setup
• Exclusive access to the machines
• Minimum of 10 executions
• Low standard deviation (< 5%)
• Different numbers of cores
Benchmark: Jacobi2D
• Iterative benchmark
• Computations over 2D matrix
• Communications with 4 neighbors
• Stencil (CPU bound)
• Imbalanced
Jacobi2D on Opteron Machine
0
1
2
3
4
5
6
2 4 8 16
Ave
rag
e i
tera
tio
n t
ime
(s
)
Number of cores
With optim. multicore Without optim. multicore
With optim. net_linux Without optim. net_linux
No sensible
difference
b
e
t
t
e
r
Opteron
Jacobi2D on Xeon Machine
0,0
0,5
1,0
1,5
2,0
2,5
3,0
3,5
2 4 8 16 32
Ave
rag
e i
tera
tio
n t
ime
(s
)
Number of cores
With optim. multicore Without optim. multicore
With optim. net_linux Without optim. net_linux
b
e
t
t
e
r
Xeon
Inside de
error margin
Benchmark: kNeighbor
• Synthetic benchmark
• Completely communication bound
• Each chare communicates with k
neighbors
• k = 3
• Message size = 1024 B
kNeighbor on Opteron Machine
0
200
400
600
800
1000
1200
1400
1600
1800
2000
2 4 8 16
Ave
rag
e i
tera
tio
n t
ime
(u
s)
Number of cores
With optim. multicore Without optim. multicore
With optim. net_linux Without optim. net_linux
Opteron
b
e
t
t
e
r
Speedup of 9Speedup of 1.2
kNeighbor on Xeon Machine
0
100
200
300
400
500
600
700
800
2 4 8 16 32
Ave
rag
e i
tera
tio
n t
ime
(u
s)
Number of cores
With optim. multicore Without optim. multicore
With optim. net_linux Without optim. net_linux
Xeon
b
e
t
t
e
r
kNeighbor on Xeon Machine
0
100
200
300
400
500
600
700
800
2 4 8 16 32
Ave
rag
e i
tera
tio
n t
ime
(u
s)
Number of cores
With optim. multicore Without optim. multicore
With optim. net_linux Without optim. net_linux
Xeon
b
e
t
t
e
r
Speedup of 2
Speedup of 4.7
Partial conclusions
• Times can be have a 50% difference
between Charm++ versions
• Times 90% smaller when using
multicore instead of net-linux
• Impact proportional to the amount of
communications
Outline
• Introduction
• Performance Evaluation of SMP Optimizations of
Charm++ on NUMA Machines
• NUMA-aware Load Balancer on Charm++
• Conclusion and Future Work
NUMA-aware Load Balancer
• Use knowledge about the system
• NUMA-factor among nodes
• Collected through libarchtopo
• Communication history
• No knowledge about the chare’s memory
• Improve performance by reducing
communication latency
• Avoid too many chare migrations
NUMA-aware Load Balancer
Calculate processors’ load
Sort chares by decreasing load
While there are migratable chares
Pick most loaded chare k
Compute W(k,i) for all processors i
Migrate k for the processor with smaller W(k,i)
NUMA-aware Load Balancer
W(k,i) = L(i) +
ɑ*(
- M(k,i)
+ Σ j=1..N, j!=i (M(k,j)*NF(j,i))
)
NUMA-aware Load Balancer
W(k,i) = L(i) +
ɑ*( Communication weight (constant)
- M(k,i)
+ Σ j=1..N, j!=i (M(k,j)*NF(j,i))
)
Load on candidate processor (core)
NUMA-aware Load Balancer
W(k,i) = L(i) +
ɑ*(
- M(k,i)
+ Σ j=1..N, j!=i (M(k,j)*NF(j,i))
)
Intra-core communications
(extended for intra-NUMA node)
NUMA-aware Load Balancer
W(k,i) = L(i) +
ɑ*(
- M(k,i)
+ Σ j=1..N, j!=i (M(k,j)*NF(j,i))
) Inter-core communications
(extended for inter-NUMA node)
NUMA factor
Load Balancer Evaluation
• Benchmarks
• Imbalance
• Jacobi2D
• Poisson3D
• Comparison with different load balancers
• GreedyLB
• GreedyCommLB
Benchmark: Imbalance
• By Isaac Dooley
• Based on Fractography3D
• Iterative benchmark
• Imbalance increases with computations
• Computations over 2D array of chares
• Communications with 4 neighbors
Imbalance on Opteron Machine
0,0
0,5
1,0
1,5
2,0
2,5
3,0
8 16
To
tal ti
me
sp
ee
du
p
Number of cores
No LB GreedyCommLB GreedyLB NumaLB
Opteron
b
e
t
t
e
r
15%5%
Imbalance on Xeon Machine
0,0
0,5
1,0
1,5
2,0
2,5
3,0
8 16 32
To
tal ti
me
sp
ee
du
p
Number of cores
No LB GreedyCommLB GreedyLB NumaLB
Xeon
b
e
t
t
e
r
~5% (inside error margin)
Benchmark: Jacobi2D
• Iterative benchmark
• Computations over 2D matrix
• Communications with 4 neighbors
• Stencil (CPU bound)
• Imbalaced
Jacobi2D on Opteron Machine
0,0
0,5
1,0
1,5
2,0
2,5
3,0
8 16
Ite
rati
on
tim
e s
pe
ed
up
Number of cores
No LB GreedyCommLB GreedyLB NumaLB
Opteron
b
e
t
t
e
r
Jacobi2D on Xeon Machine
0,0
0,5
1,0
1,5
2,0
2,5
3,0
8 16 32
Ite
rati
on
tim
e s
pe
ed
up
Number of cores
No LB GreedyCommLB GreedyLB NumaLB
Xeon
b
e
t
t
e
r
17%
3%
Benchmark: Poisson3D
• By Xavier Besseron and Thierry Gautier
• Solves the Poisson equation on a 3D
domain
• Parallelized by domain decomposition
• Well balanced
Poisson3D on Opteron Machine
0,900 0,7970,999 0,989
0,0
0,5
1,0
1,5
2,0
2,5
3,0
8 16
To
tal ti
me
sp
ee
du
p
Number of cores
No LB GreedyCommLB GreedyLB NumaLB
Opteron
b
e
t
t
e
r
Poisson3D on Xeon Machine
0,914 0,8370,711
0,995 0,993 0,983
0,0
0,5
1,0
1,5
2,0
2,5
3,0
8 16 32
To
tal ti
me
sp
ee
du
p
Number of cores
No LB GreedyCommLB GreedyLB NumaLB
Xeon
b
e
t
t
e
r
GreedyLB performance decreased due to
migrations overhead
Outline
• Introduction
• Performance Evaluation of SMP Optimizations of
Charm++ on NUMA Machines
• NUMA-aware Load Balancer on Charm++
• Conclusion and Future Work
Conclusion
• SMP optimizations do affect the
performance on NUMA machines
• Up to 50% between versions and 90%
between architecture-specific compilations
• Gains with NUMA LB
• Speedups of up to 2.8 (compared to no LB)
• Performance near GreedyLB
• Avoid migrations
Future Work
• Evolution of NUMA-aware LB
• Consider topology
• Number of hops
• Cache hierarchy
• Memory per chare
• Improve NUMA information discovery
• Initialization overheads
• Run experiments with communication intensive
benchmarks
• Interface for a memory LB?
Charm++ on NUMA Platforms:
the impact of SMP Optimizations
and a NUMA-aware Load Balancer
Thank you