40
Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1

Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer

Embed Size (px)

Citation preview

Page 1: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer

1

Memory Management inNUMA Multicore Systems:Trapped between Cache Contention and Interconnect Overhead

Zoltan Majo and Thomas R. Gross

Department of Computer ScienceETH Zurich

Page 2: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer

2

NUMA multicores

DRAM memory

32

MC

Cache

10

MC

DRAM memory

Cache

IC ICMC

DRAM memory DRAM memory

MCIC IC

Processor 0 Processor 1

Page 3: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer

3

10

MC

DRAM memory

Cache

DRAM memory

32

MC

Cache

NUMA multicores

Two problems:

• NUMA:interconnect overhead

BA

MA MB

IC IC

Processor 0 Processor 1

Page 4: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer

4

DRAM memory

32

MC

Cache

10

MC

DRAM memory

Cache

NUMA multicores

BA

MA MB

Cache

Two problems:

• NUMA:interconnect overhead

• multicore:cache contention

IC IC

Processor 0 Processor 1

Page 5: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer

5

Outline

• NUMA: experimental evaluation

• Scheduling

– N-MASS

– N-MASS evaluation

Page 6: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer

6

Multi-clone experiments

• Intel Xeon E5520

• 4 clones of soplex (SPEC CPU2006)

– local clone

– remote clone

DRAM memory

MC

Cache

0

MC

DRAM memory

Cache

IC IC

1 32 4 6 75

• Memory behavior of unrelated programs

M M M M

C C C C

C C C C

C

C

Processor 0 Processor 1

Page 7: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer

1 2

3

4 57

Cache

C

DRAM

Cache

C C C

Local bandwidth: 100%

M MMM

Cache

C

DRAM

Cache

C C C

Local bandwidth: 80%

M MMM

Cache

C

DRAM

Cache

C CC

Local bandwidth: 57%

M MMM

Cache

C

DRAM

Cache

C C C

Local bandwidth: 32%

M MMM

Cache

C

DRAM

Cache

C C C

Local bandwidth: 0%

M MMM

Page 8: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer

8

Performance of schedules

• Which is the best schedule?

• Baseline: single-program execution mode

Cache

C

Cache

M

Page 9: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer

9

0% 25% 50% 75% 100%1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

Local memory bandwidth

Execution time

local clones

remote clones

average

Slowdown relative to baseline

C

C

C

Page 10: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer

10

Outline

• NUMA: experimental evaluation

• Scheduling

– N-MASS

– N-MASS evaluation

Page 11: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer

11

Two steps:

– Step 1: maximum-local mapping

– Step 2: cache-aware refinement

N-MASS(NUMA-Multicore-Aware Scheduling Scheme)

Page 12: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer

12

Step 1: Maximum-local mapping

DRAM

Cache

0

DRAM

Cache

1 32 4 6 75

B MB

A MA

C MC

D MD

Processor 0 Processor 1

Page 13: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer

13

Default OS scheduling

DRAM

Cache

0

DRAM

Cache

1 32 4 6 75BA D

MBMA MC MD

C

Processor 0 Processor 1

Page 14: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer

14

Two steps:

– Step 1: maximum-local mapping

– Step 2: cache-aware refinement

N-MASS(NUMA-Multicore-Aware Scheduling Scheme)

Page 15: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer

15

Step 2: Cache-aware refinement

In an SMP:

DRAM

Cache

0

DRAM

Cache

1 32 4 6 75

MBMA MC MD

BAD C

Processor 0 Processor 1

Page 16: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer

16

Step 2: Cache-aware refinement

DRAM

Cache

0

DRAM

Cache

1 32 4 6 75

MBMA MC MD

BA DC

MA

In an SMP:Processor 0 Processor 1

Page 17: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer

17

Step 2: Cache-aware refinement

A B C

D

DRAM

Cache

0

DRAM

Cache

1 32 4 6 75

MB MC MD

BA D C

MA

A B CD

Performance degradation

In an SMP:

NUMA penalty

Processor 0 Processor 1

Page 18: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer

18

Step 2: Cache-aware refinement

DRAM

Cache

0

DRAM

Cache

1 32 4 6 75

MBMA MC MD

BA C DIn a NUMA:Processor 0 Processor 1

Page 19: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer

19

Step 2: Cache-aware refinement

DRAM

Cache

0

DRAM

Cache

1 32 4 6 75

MBMA MC MD

A DCBIn a NUMA:Processor 0 Processor 1

Page 20: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer

20

Step 2: Cache-aware refinement

A B C

D

Performance degradation

DRAM

Cache

0

DRAM

Cache

1 32 4 6 75

MB MC MDMA

BA DC

A

B

C D

NUMA allowance

In a NUMA:

NUMA penalty

Processor 0 Processor 1

Page 21: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer

21

Performance factors

Two factors cause performance degradation:

1. NUMA penaltyslowdown due toremote memory access

2. cache pressure local processes:

misses / KINST (MPKI) remote processes:

MPKI x NUMA penalty 1 4 7 10 13 16 19 22 25 28

1.0

1.1

1.2

1.3

1.4

1.5

SPEC programs

NUMA penalty

Page 22: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer

22

Implementation

• User-mode extension to the Linux scheduler

• Performance metrics– hardware performance counter feedback– NUMA penalty• perfect information from program traces• estimate based on MPKI

• All memory for a process allocated on one

processor

Page 23: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer

23

Outline

• NUMA: experimental evaluation

• Scheduling

– N-MASS

– N-MASS evaluation

Page 24: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer

24

0.00

001

0.00

01

0.00

1

0.01 0.

1 1 10 100

0.9

1.0

1.1

1.2

1.3

1.4

1.5

not used programsused programs

MPKI

Workloads

• SPEC CPU2006 subset

• 11 multi-program workloads (WL1 WL11)

4-program workloads(WL1 WL9)

8-program workloads(WL10, WL11)

NUMA penalty

CPU-bound Memory-bound

Page 25: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer

25

Memory allocation setup

• Where the memory of each process is allocated influences performance

• Controlled setup: memory allocation maps

Page 26: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer

26

Memory allocation maps

B MB

A C MC

D MD

MA

DRAM

Processor 1

Cache

Processor 0

DRAM

Cache

Allocation map: 0000

MA

MB

MC

MD

Page 27: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer

27

Memory allocation maps

BA C D

DRAM

Processor 1

Cache

Processor 0

DRAM

Cache

Allocation map: 0000

MA

MB

MC

MD

DRAM

Processor 1

Cache

Processor 0

DRAM

Cache

Allocation map: 0011

MA

MB

MC

MD

Page 28: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer

28

Memory allocation maps

BA C D

DRAM

Processor 1

Cache

Processor 0

DRAM

Cache

Allocation map: 0000

MA

MB

MC

MD

DRAM

Processor 1

Cache

Processor 0

DRAM

Cache

Allocation map: 0011

MA

MB

MC

MD

Unbalanced Balanced

Page 29: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer

29

Evaluation

• Baseline: Linux average

– Linux scheduler non-deterministic

– average performance degradation in all possible

cases

• N-MASS with perfect NUMA penalty

information

Page 30: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer

30

0000 1000 0100 0010 0001 1100 1010 10011.0

1.1

1.2

1.3

1.4

1.5

1.6

Linux average

Allocation maps

WL9: Linux averageAverage slowdown relative to single-program mode

Page 31: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer

31

0000 1000 0100 0010 0001 1100 1010 10011.0

1.1

1.2

1.3

1.4

1.5

1.6

Linux averageN-MASS

Allocation maps

WL9: N-MASSAverage slowdown relative to single-program mode

Page 32: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer

32

0000 1000 0100 0010 0001 1100 1010 10011.0

1.1

1.2

1.3

1.4

1.5

1.6

Linux averageN-MASS

Allocation maps

WL1: Linux average and N-MASSAverage slowdown relative to single-program mode

Page 33: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer

33

N-MASS performance

• N-MASS reduces performance degradation by up to 22%

• Which factor more important: interconnect overhead or cache contention?

• Compare:

- maximum-local- N-MASS (maximum-local

+ cache refinement step)

Page 34: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer

34

Data-locality vs. cache balancing (WL9)

0000 1000 0100 0010 0001 1100 1010 1001-10%

-5%

0%

5%

10%

15%

20%

25%

maximum-local

N-MASS (maxi-mum-local + cache refinement step)

Allocation maps

Performance improvement relative to Linux average

Page 35: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer

35

Data-locality vs. cache balancing (WL1)

0000 1000 0100 0010 0001 1100 1010 1001-10%

-5%

0%

5%

10%

15%

20%

25%

maximum-local

N-MASS (maxi-mum-local + cache refinement step)

Allocation maps

Performance improvement relative to Linux average

Page 36: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer

36

Data locality vs. cache balancing

• Data-locality more important than cache balancing

• Cache-balancing gives performance benefits mostly with unbalanced allocation maps

• What if information about NUMA penalty not available?

Page 37: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer

37

0 10 20 30 40 501.0

1.1

1.2

1.3

1.4

1.5

MPKI

Estimating NUMA penalty

• NUMA penalty is not directly measurable

• Estimate: fit linear regression onto MPKI data

NUMA penalty

Page 38: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer

38

Estimate-based N-MASS: performance

Performance improvement relative to Linux average

WL1 WL2 WL3 WL4 WL5 WL6 WL7 WL8 WL9 WL10 WL11-2%

0%

2%

4%

6%

8%

maximum-local N-MASS Estimate-based N-MASS

Page 39: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer

39

Conclusions

• N-MASS: NUMAmulticore-aware scheduler

• Data locality optimizations more beneficial than cache contention avoidance

• Better performance metrics needed for scheduling

Page 40: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer

40

Thank you! Questions?