Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich

1

Memory Management inNUMA Multicore Systems:Trapped between Cache Contention and Interconnect Overhead

Zoltan Majo and Thomas R. Gross

Department of Computer ScienceETH Zurich

2

NUMA multicores

DRAM memory

32

MC

Cache

10

MC

DRAM memory

Cache

IC ICMC

DRAM memory DRAM memory

MCIC IC

Processor 0 Processor 1

3

10

MC

DRAM memory

Cache

DRAM memory

32

MC

Cache

NUMA multicores

Two problems:

• NUMA:interconnect overhead

BA

MA MB

IC IC


4

DRAM memory

32

MC

Cache

10

MC

DRAM memory

Cache

NUMA multicores

BA

MA MB

Cache

Two problems:

• NUMA:interconnect overhead

• multicore:cache contention

IC IC


5

Outline

• NUMA: experimental evaluation

• Scheduling

– N-MASS

– N-MASS evaluation

6

Multi-clone experiments

• Intel Xeon E5520

• 4 clones of soplex (SPEC CPU2006)

– local clone

– remote clone

DRAM memory

MC

Cache

0

MC

DRAM memory

Cache

IC IC

1 32 4 6 75

• Memory behavior of unrelated programs

M M M M

C C C C

C C C C

C

C


1 2

3

4 57

Cache

C

DRAM

Cache

C C C

Local bandwidth: 100%

M MMM

Cache

C

DRAM

Cache

C C C


M MMM

Cache

C

DRAM

Cache

C CC


M MMM

Cache

C

DRAM

Cache

C C C


M MMM

Cache

C

DRAM

Cache

C C C

Local bandwidth: 0%

M MMM

8

Performance of schedules

• Which is the best schedule?

• Baseline: single-program execution mode

Cache

C

Cache

M

9

0% 25% 50% 75% 100%1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

Local memory bandwidth

Execution time

local clones

remote clones

average

Slowdown relative to baseline

C

C

C

10

Outline


• Scheduling

– N-MASS


11

Two steps:

– Step 1: maximum-local mapping

– Step 2: cache-aware refinement

N-MASS(NUMA-Multicore-Aware Scheduling Scheme)

12

Step 1: Maximum-local mapping

DRAM

Cache

0

DRAM

Cache

1 32 4 6 75

B MB

A MA

C MC

D MD


13

Default OS scheduling

DRAM

Cache

0

DRAM

Cache

1 32 4 6 75BA D

MBMA MC MD

C


14

Two steps:

– Step 1: maximum-local mapping

– Step 2: cache-aware refinement

N-MASS(NUMA-Multicore-Aware Scheduling Scheme)

15

Step 2: Cache-aware refinement

In an SMP:

DRAM

Cache

0

DRAM

Cache

1 32 4 6 75

MBMA MC MD

BAD C


16


DRAM

Cache

0

DRAM

Cache

1 32 4 6 75

MBMA MC MD

BA DC

MA

In an SMP:Processor 0 Processor 1

17


A B C

D

DRAM

Cache

0

DRAM

Cache

1 32 4 6 75

MB MC MD

BA D C

MA

A B CD

Performance degradation

In an SMP:

NUMA penalty


18


DRAM

Cache

0

DRAM

Cache

1 32 4 6 75

MBMA MC MD

BA C DIn a NUMA:Processor 0 Processor 1

19


DRAM

Cache

0

DRAM

Cache

1 32 4 6 75

MBMA MC MD

A DCBIn a NUMA:Processor 0 Processor 1

20


A B C

D

Performance degradation

DRAM

Cache

0

DRAM

Cache

1 32 4 6 75

MB MC MDMA

BA DC

A

B

C D

NUMA allowance

In a NUMA:

NUMA penalty


21

Performance factors

Two factors cause performance degradation:

1. NUMA penaltyslowdown due toremote memory access

2. cache pressure local processes:

misses / KINST (MPKI) remote processes:

MPKI x NUMA penalty 1 4 7 10 13 16 19 22 25 28

1.0

1.1

1.2

1.3

1.4

1.5

SPEC programs

NUMA penalty

22

Implementation

• User-mode extension to the Linux scheduler

• Performance metrics– hardware performance counter feedback– NUMA penalty• perfect information from program traces• estimate based on MPKI

• All memory for a process allocated on one

processor

23

Outline


• Scheduling

– N-MASS


24

0.00

001

0.00

01

0.00

1

0.01 0.

1 1 10 100

0.9

1.0

1.1

1.2

1.3

1.4

1.5

not used programsused programs

MPKI

Workloads

• SPEC CPU2006 subset

• 11 multi-program workloads (WL1 WL11)

4-program workloads(WL1 WL9)

8-program workloads(WL10, WL11)

NUMA penalty

CPU-bound Memory-bound

25

Memory allocation setup

• Where the memory of each process is allocated influences performance

• Controlled setup: memory allocation maps

26

Memory allocation maps

B MB

A C MC

D MD

MA

DRAM

Processor 1

Cache

Processor 0

DRAM

Cache

Allocation map: 0000

MA

MB

MC

MD

27


BA C D

DRAM

Processor 1

Cache

Processor 0

DRAM

Cache


MA

MB

MC

MD

DRAM

Processor 1

Cache

Processor 0

DRAM

Cache


MA

MB

MC

MD

28


BA C D

DRAM

Processor 1

Cache

Processor 0

DRAM

Cache


MA

MB

MC

MD

DRAM

Processor 1

Cache

Processor 0

DRAM

Cache


MA

MB

MC

MD

Unbalanced Balanced

29

Evaluation

• Baseline: Linux average

– Linux scheduler non-deterministic

– average performance degradation in all possible

cases

• N-MASS with perfect NUMA penalty

information

30

0000 1000 0100 0010 0001 1100 1010 10011.0

1.1

1.2

1.3

1.4

1.5

1.6

Linux average

Allocation maps

WL9: Linux averageAverage slowdown relative to single-program mode

31

0000 1000 0100 0010 0001 1100 1010 10011.0

1.1

1.2

1.3

1.4

1.5

1.6

Linux averageN-MASS

Allocation maps

WL9: N-MASSAverage slowdown relative to single-program mode

32

0000 1000 0100 0010 0001 1100 1010 10011.0

1.1

1.2

1.3

1.4

1.5

1.6

Linux averageN-MASS

Allocation maps

WL1: Linux average and N-MASSAverage slowdown relative to single-program mode

33

N-MASS performance

• N-MASS reduces performance degradation by up to 22%

• Which factor more important: interconnect overhead or cache contention?

• Compare:

- maximum-local- N-MASS (maximum-local

+ cache refinement step)

34

Data-locality vs. cache balancing (WL9)

0000 1000 0100 0010 0001 1100 1010 1001-10%

-5%

0%

5%

10%

15%

20%

25%

maximum-local

N-MASS (maxi-mum-local + cache refinement step)

Allocation maps

Performance improvement relative to Linux average

35

Data-locality vs. cache balancing (WL1)

0000 1000 0100 0010 0001 1100 1010 1001-10%

-5%

0%

5%

10%

15%

20%

25%

maximum-local

N-MASS (maxi-mum-local + cache refinement step)

Allocation maps


36

Data locality vs. cache balancing

• Data-locality more important than cache balancing

• Cache-balancing gives performance benefits mostly with unbalanced allocation maps

• What if information about NUMA penalty not available?

37

0 10 20 30 40 501.0

1.1

1.2

1.3

1.4

1.5

MPKI

Estimating NUMA penalty

• NUMA penalty is not directly measurable

• Estimate: fit linear regression onto MPKI data

NUMA penalty

38

Estimate-based N-MASS: performance


WL1 WL2 WL3 WL4 WL5 WL6 WL7 WL8 WL9 WL10 WL11-2%

0%

2%

4%

6%

8%

maximum-local N-MASS Estimate-based N-MASS

39

Conclusions

• N-MASS: NUMAmulticore-aware scheduler

• Data locality optimizations more beneficial than cache contention avoidance

• Better performance metrics needed for scheduling

40

Thank you! Questions?

Documents

Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich