1
Make the Most Out of Last Level Cache in Intel Processors Alireza Farshin + , Amir Roozbeh +* , Gerald Q. Maguire Jr. + , Dejan Kostic + R3 0803 G65695 01 S12 2 3 4 5 1 Problem Slice-aware Memory Management CacheDirector Impact on Tail Latency for NFV Service Chains Conclusions Intel QPI PCIe + DMA DDR Channels Core 0 Core 2 Core 6 Core 5 Core 4 Core 7 Core 3 Core 1 LLC Slice 0 LLC Slice 4 LLC Slice 7 LLC Slice 2 LLC Slice 6 LLC Slice 3 LLC Slice 1 LLC Slice 5 Number of Cycles Slice Number 0 1 2 3 4 5 6 7 0 10 20 30 40 50 60 Measuring read access time from Core 0 to different LLC slices. In modern (Intel) processors, Last Level Cache (LLC) is divided into multiple slices. For each core, accessing data stored in a closer slice is faster than accessing data stored in other slices. We introduce a slice-aware memory management scheme, wherein data can be mapped to specific LLC slice(s). ® ® Intel Xeon E5-2667 v3 Mellanox ConnectX-4 Sending/Receiving Packets via DDIO Data Direct I/O (DDIO) technology sends packets to LLC rather than DRAM. CacheDirector is a network I/O solution that extends DDIO by implementating slice-aware memory management as an extension to Data Plane Development Kit (DPDK). It sends the first 64 B of a packet (containing the packet's header) directly to the appropriate LLC slice. The CPU core that is responsible for processing a packet can access the packet header in fewer CPU cycles. Faster links exposes processing elements to packets at a higher rate, but the performance of the processors is no longer doubling at the earlier rate, making it harder to keep up with the growth in link speeds. For instance, a server receiving 64 B packets at a rate of 100 Gbps has only 5.12ns to process a packet before the next packet arrives. It is essential to exploit every oppurtunity to optimize current computer systems. We focus on better management of LLC. With slice-aware memory management scheme, which exploits the latency differences in accessing different LLC slices, it is possible to boost application performance and realize cache isolation. We proposed CacheDirector, a network I/O solution, which utilizes slice-aware memory management. CacheDirector increases efficiency and performance, while reducing tail latencies over the state-of-the-art. We evaluate the performance of NFV service chains in the presence of CacheDirector. CacheDirector reduces tail latencies by up to 21.5% for optimized NFV service chains that are running at 100 Gbps. KTH Royal Institute of Technology (EECS/COM) Ericsson Research [email protected] [email protected] [email protected] [email protected] Latency (μs) CDF 0 200 400 600 800 1000 0 20 40 60 80 100 Traditional DDIO Speedup (%) Percentiles 75 th 0 5 10 15 20 90 th 95 th 99 th Δ 119 μs CacheDirector + DDIO Accessing the closer LLC Slice can save up to 20 cycles, i.e., 6.25 ns. Stateful NFV Service Chain Router NAPT Load Balancer + * Approach: L1i L2 L1d Lower level caches are private to each core. WALLENBERG AI, AUTONOMOUS SYSTEMS AND SOFTWARE PROGRAM Work supported by SSF, WASP, and ERC. O u r P a p e r

Make the Most Out of Last Level Cache in Intel Processorsfarshin/documents/slice-aware-eurosys19-post… · KTH Royal Institute of Technology (EECS/COM) Ericsson Research [email protected]

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Make the Most Out of Last Level Cache in Intel Processorsfarshin/documents/slice-aware-eurosys19-post… · KTH Royal Institute of Technology (EECS/COM) Ericsson Research farshin@kth.se

Make the Most Out of Last Level Cachein Intel ProcessorsAlireza Farshin+, Amir Roozbeh+*, Gerald Q. Maguire Jr.+, Dejan Kostic+

R30803

G65695 01 S12

2

3

4 5

1 Problem Slice-aware Memory Management

CacheDirector

Impact on Tail Latency for NFV Service Chains Conclusions

Intel QPI PCIe + DMA

DDR Channels

Core 0

Core 2

Core 6

Core 5

Core 4

Core 7

Core 3

Core 1

LLCSlice 0

LLCSlice 4

LLCSlice 7

LLCSlice 2

LLCSlice 6

LLCSlice 3

LLCSlice 1

LLCSlice 5

Num

ber

of C

ycle

s

Slice Number0 1 2 3 4 5 6 7

0

10

20

30

40

50

60

Measuring read access time from Core 0 to different LLC slices.

In modern (Intel) processors, Last Level Cache (LLC) is divided into multiple slices.

For each core, accessing data stored in a closer slice is faster than accessing data stored in other slices.

We introduce a slice-aware memory management scheme, wherein data can be mapped to specific LLC slice(s).

®®Intel XeonE5-2667 v3

Mellanox ConnectX-4

Sending/Receiving Packets via DDIO

Data Direct I/O (DDIO) technology sends packets to LLC rather than DRAM.

CacheDirector is a network I/O solution that extends DDIO by implementating slice-aware memory management as an extension to Data Plane Development Kit (DPDK).

It sends the first 64 B of a packet (containing the packet's header) directly to the appropriate LLC slice.

The CPU core that is responsible for processing a packet can access the packet header in fewer CPU cycles.

Faster links exposes processing elements to packets at a higher rate, but the performance of the processors is no longer doubling at the earlier rate, making it harder to keep up with the growth in link speeds. For instance, a server receiving 64 B packets at a rate of 100 Gbps has only 5.12ns to process a packet before the next packet arrives.

It is essential to exploit every oppurtunity to optimize current computer systems. We focus on better management of LLC.

With slice-aware memory management scheme, which exploits the latency differences in accessing different LLC slices, it is possible to boost application performance and realize cache isolation.

We proposed CacheDirector, a network I/O solution, which utilizes slice-aware memory management.

CacheDirector increases efficiency and performance, while reducing tail latencies over the state-of-the-art.

We evaluate the performance of NFV service chains in the presence of CacheDirector.

CacheDirector reduces tail latencies by up to 21.5% for optimized NFV service chains that are running at 100 Gbps.

KTH Royal Institute of Technology (EECS/COM) Ericsson Research

[email protected]

[email protected]

[email protected]

[email protected]

Latency (μs)

CD

F

0 200 400 600 800 10000

20

40

60

80

100

Traditional DDIO

Spee

dup

(%)

Percentiles75th

0

5

10

15

20

90th 95th 99th

Δ ≈ 119 μsCacheDirector

+ DDIO

Accessing the closer LLC Slice can save up to 20 cycles, i.e., 6.25 ns.

Stateful NFV Service Chain

Router NAPT Load Balancer

+ *

Approach:

L1i

L2

L1d

Lower level caches are private to each core.

WALLENBERG AI, AUTONOMOUS SYSTEMSAND SOFTWARE PROGRAMWork supported by SSF, WASP, and ERC.

Our Paper