Upload
plato-larsen
View
16
Download
3
Embed Size (px)
DESCRIPTION
Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead. Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich. NUMA multicores. Processor 0. Processor 1. 0. 1. 2. 3. Cache. Cache. MC. MC. MC. IC. IC. IC. IC. - PowerPoint PPT Presentation
Citation preview
1
Memory Management inNUMA Multicore Systems:Trapped between Cache Contention and Interconnect Overhead
Zoltan Majo and Thomas R. Gross
Department of Computer ScienceETH Zurich
2
NUMA multicores
DRAM memory
32
MC
Cache
10
MC
DRAM memory
Cache
IC ICMC
DRAM memory DRAM memory
MCIC IC
Processor 0 Processor 1
3
10
MC
DRAM memory
Cache
DRAM memory
32
MC
Cache
NUMA multicores
Two problems:
• NUMA:interconnect overhead
BA
MA MB
IC IC
Processor 0 Processor 1
4
DRAM memory
32
MC
Cache
10
MC
DRAM memory
Cache
NUMA multicores
BA
MA MB
Cache
Two problems:
• NUMA:interconnect overhead
• multicore:cache contention
IC IC
Processor 0 Processor 1
5
Outline
• NUMA: experimental evaluation
• Scheduling
– N-MASS
– N-MASS evaluation
6
Multi-clone experiments
• Intel Xeon E5520
• 4 clones of soplex (SPEC CPU2006)
– local clone
– remote clone
DRAM memory
MC
Cache
0
MC
DRAM memory
Cache
IC IC
1 32 4 6 75
• Memory behavior of unrelated programs
M M M M
C C C C
C C C C
C
C
Processor 0 Processor 1
1 2
3
4 57
Cache
C
DRAM
Cache
C C C
Local bandwidth: 100%
M MMM
Cache
C
DRAM
Cache
C C C
Local bandwidth: 80%
M MMM
Cache
C
DRAM
Cache
C CC
Local bandwidth: 57%
M MMM
Cache
C
DRAM
Cache
C C C
Local bandwidth: 32%
M MMM
Cache
C
DRAM
Cache
C C C
Local bandwidth: 0%
M MMM
8
Performance of schedules
• Which is the best schedule?
• Baseline: single-program execution mode
Cache
C
Cache
M
9
0% 25% 50% 75% 100%1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
Local memory bandwidth
Execution time
local clones
remote clones
average
Slowdown relative to baseline
C
C
C
10
Outline
• NUMA: experimental evaluation
• Scheduling
– N-MASS
– N-MASS evaluation
11
Two steps:
– Step 1: maximum-local mapping
– Step 2: cache-aware refinement
N-MASS(NUMA-Multicore-Aware Scheduling Scheme)
12
Step 1: Maximum-local mapping
DRAM
Cache
0
DRAM
Cache
1 32 4 6 75
B MB
A MA
C MC
D MD
Processor 0 Processor 1
13
Default OS scheduling
DRAM
Cache
0
DRAM
Cache
1 32 4 6 75BA D
MBMA MC MD
C
Processor 0 Processor 1
14
Two steps:
– Step 1: maximum-local mapping
– Step 2: cache-aware refinement
N-MASS(NUMA-Multicore-Aware Scheduling Scheme)
15
Step 2: Cache-aware refinement
In an SMP:
DRAM
Cache
0
DRAM
Cache
1 32 4 6 75
MBMA MC MD
BAD C
Processor 0 Processor 1
16
Step 2: Cache-aware refinement
DRAM
Cache
0
DRAM
Cache
1 32 4 6 75
MBMA MC MD
BA DC
MA
In an SMP:Processor 0 Processor 1
17
Step 2: Cache-aware refinement
A B C
D
DRAM
Cache
0
DRAM
Cache
1 32 4 6 75
MB MC MD
BA D C
MA
A B CD
Performance degradation
In an SMP:
NUMA penalty
Processor 0 Processor 1
18
Step 2: Cache-aware refinement
DRAM
Cache
0
DRAM
Cache
1 32 4 6 75
MBMA MC MD
BA C DIn a NUMA:Processor 0 Processor 1
19
Step 2: Cache-aware refinement
DRAM
Cache
0
DRAM
Cache
1 32 4 6 75
MBMA MC MD
A DCBIn a NUMA:Processor 0 Processor 1
20
Step 2: Cache-aware refinement
A B C
D
Performance degradation
DRAM
Cache
0
DRAM
Cache
1 32 4 6 75
MB MC MDMA
BA DC
A
B
C D
NUMA allowance
In a NUMA:
NUMA penalty
Processor 0 Processor 1
21
Performance factors
Two factors cause performance degradation:
1. NUMA penaltyslowdown due toremote memory access
2. cache pressure local processes:
misses / KINST (MPKI) remote processes:
MPKI x NUMA penalty 1 4 7 10 13 16 19 22 25 28
1.0
1.1
1.2
1.3
1.4
1.5
SPEC programs
NUMA penalty
22
Implementation
• User-mode extension to the Linux scheduler
• Performance metrics– hardware performance counter feedback– NUMA penalty• perfect information from program traces• estimate based on MPKI
• All memory for a process allocated on one
processor
23
Outline
• NUMA: experimental evaluation
• Scheduling
– N-MASS
– N-MASS evaluation
24
0.00
001
0.00
01
0.00
1
0.01 0.
1 1 10 100
0.9
1.0
1.1
1.2
1.3
1.4
1.5
not used programsused programs
MPKI
Workloads
• SPEC CPU2006 subset
• 11 multi-program workloads (WL1 WL11)
4-program workloads(WL1 WL9)
8-program workloads(WL10, WL11)
NUMA penalty
CPU-bound Memory-bound
25
Memory allocation setup
• Where the memory of each process is allocated influences performance
• Controlled setup: memory allocation maps
26
Memory allocation maps
B MB
A C MC
D MD
MA
DRAM
Processor 1
Cache
Processor 0
DRAM
Cache
Allocation map: 0000
MA
MB
MC
MD
27
Memory allocation maps
BA C D
DRAM
Processor 1
Cache
Processor 0
DRAM
Cache
Allocation map: 0000
MA
MB
MC
MD
DRAM
Processor 1
Cache
Processor 0
DRAM
Cache
Allocation map: 0011
MA
MB
MC
MD
28
Memory allocation maps
BA C D
DRAM
Processor 1
Cache
Processor 0
DRAM
Cache
Allocation map: 0000
MA
MB
MC
MD
DRAM
Processor 1
Cache
Processor 0
DRAM
Cache
Allocation map: 0011
MA
MB
MC
MD
Unbalanced Balanced
29
Evaluation
• Baseline: Linux average
– Linux scheduler non-deterministic
– average performance degradation in all possible
cases
• N-MASS with perfect NUMA penalty
information
30
0000 1000 0100 0010 0001 1100 1010 10011.0
1.1
1.2
1.3
1.4
1.5
1.6
Linux average
Allocation maps
WL9: Linux averageAverage slowdown relative to single-program mode
31
0000 1000 0100 0010 0001 1100 1010 10011.0
1.1
1.2
1.3
1.4
1.5
1.6
Linux averageN-MASS
Allocation maps
WL9: N-MASSAverage slowdown relative to single-program mode
32
0000 1000 0100 0010 0001 1100 1010 10011.0
1.1
1.2
1.3
1.4
1.5
1.6
Linux averageN-MASS
Allocation maps
WL1: Linux average and N-MASSAverage slowdown relative to single-program mode
33
N-MASS performance
• N-MASS reduces performance degradation by up to 22%
• Which factor more important: interconnect overhead or cache contention?
• Compare:
- maximum-local- N-MASS (maximum-local
+ cache refinement step)
34
Data-locality vs. cache balancing (WL9)
0000 1000 0100 0010 0001 1100 1010 1001-10%
-5%
0%
5%
10%
15%
20%
25%
maximum-local
N-MASS (maxi-mum-local + cache refinement step)
Allocation maps
Performance improvement relative to Linux average
35
Data-locality vs. cache balancing (WL1)
0000 1000 0100 0010 0001 1100 1010 1001-10%
-5%
0%
5%
10%
15%
20%
25%
maximum-local
N-MASS (maxi-mum-local + cache refinement step)
Allocation maps
Performance improvement relative to Linux average
36
Data locality vs. cache balancing
• Data-locality more important than cache balancing
• Cache-balancing gives performance benefits mostly with unbalanced allocation maps
• What if information about NUMA penalty not available?
37
0 10 20 30 40 501.0
1.1
1.2
1.3
1.4
1.5
MPKI
Estimating NUMA penalty
• NUMA penalty is not directly measurable
• Estimate: fit linear regression onto MPKI data
NUMA penalty
38
Estimate-based N-MASS: performance
Performance improvement relative to Linux average
WL1 WL2 WL3 WL4 WL5 WL6 WL7 WL8 WL9 WL10 WL11-2%
0%
2%
4%
6%
8%
maximum-local N-MASS Estimate-based N-MASS
39
Conclusions
• N-MASS: NUMAmulticore-aware scheduler
• Data locality optimizations more beneficial than cache contention avoidance
• Better performance metrics needed for scheduling
40
Thank you! Questions?