Selfish-LRU - Universität des...

Preview:

Citation preview

S e l f i s h - L R U : P r e e m p t i o n - A w a r e C a c h i n g f o r

P r e d i c t a b i l i t y a n d P e r f o r m a n c e

J a n R e i n e k e S a a r l a n d U n i v e r s i t y, G e r m a n y S e b a s t i a n A l t m e y e r U n i v e r s i t y o f A m s t e r d a m , N e t h e r l a n d s D a n i e l G r u n d T h a l e s G e r m a n y S e b a s t i a n H a h n S a a r l a n d U n i v e r s i t y, G e r m a n y C l a i r e M a i z a I N P G r e n o b l e , Ve r i m a g , F r a n c e

20th IEEE Real-Time and Embedded Technology and Applications Symposium April 15-17, 2014 Berlin, Germany

Task 1

Task 2

C o n t e x t : P r e e m p t i v e S c h e d u l i n g

2

Non-preemptive Execution:

C o n t e x t : P r e e m p t i v e S c h e d u l i n g

3

Preemptive Execution:

Task 1

Task 2

Task 1

Task 2

Cache-Related Preemption Delay (CRPD)

C a v e a t : P r e e m p t i o n s a r e n o t f r e e !

4

Preemptive Execution:

C o n t r i b u t i o n o f t h i s p a p e r

5

Selfish-LRU: a new cache replacement policy, that

➡ Increases performance by reducing the CRPD ➡ Simplifies static analysis of the CRPD

C o n t r i b u t i o n o f t h i s p a p e r

5

Selfish-LRU: a new cache replacement policy, that

➡ Increases performance by reducing the CRPD ➡ Simplifies static analysis of the CRPD

Selfish-LRU is a preemption-aware variant of least-recently used (LRU)

ABCD

most-recently used

least-recently used

EABC

E: miss

BEAC

B: hit

BEAC

B: hit

L e a s t - R e c e n t l y U s e d ( L R U )

6

“Replace data that has not been used for the longest time”

➡ Usually works well due to temporal locality

C R P D E x a m p l e u n d e r L R U R e p l a c e m e n t

7

Assume simple preempted task:

for i in [1,10]:!! ! do something()

for i in [1,10]:!! ! access A!! ! access B!! ! access C!! ! access D

C R P D E x a m p l e u n d e r L R U R e p l a c e m e n t

7

Assume simple preempted task:

for i in [1,10]:!! ! do something()

for i in [1,10]:!! ! access A!! ! access B!! ! access C!! ! access D

DCBA

ADCB

A: hit

BADC

B: hit

CBAD

C: hit

DCBA

D: hit

Without preemption (after warmup): 0 misses

DCBA

XDCB

X: miss

C R P D E x a m p l e u n d e r L R U R e p l a c e m e n t

8

Assume simple preempting task:

do something_else() access X

Preemption between loop iterations: 1 access

XDCB

AXDC

A: miss

BAXD

B: miss

CBAX

C: miss

DCBA

D: miss

C R P D E x a m p l e u n d e r L R U R e p l a c e m e n t

9

First loop iteration after preemption: 4 misses

XDCB

AXDC

A: miss

BAXD

B: miss

CBAX

C: miss

DCBA

D: miss

C R P D E x a m p l e u n d e r L R U R e p l a c e m e n t : Two types of misses related to preemption

10

1. Replaced Misses 2. Reordered Misses

XDCB

AXDC

A: miss

BAXD

B: miss

CBAX

C: miss

DCBA

D: miss

C R P D E x a m p l e u n d e r L R U R e p l a c e m e n t : Two types of misses related to preemption

10

1. Replaced Misses 2. Reordered Misses

Liu et al., PACT 2008: reordered misses account for 10% to 28% of all preemption-related misses

ABCD

most-recently used

least-recently used

EABD

E: miss

BEAD

B: hit

FBEA

F: miss

S e l f i s h - L R U : I d e a

11

Prioritize blocks of currently running task:

Intuition: “Memory blocks of currently running task more likely to be accessed again soon.”

DCBA

ADCB

A: hit

BADC

B: hit

CBAD

C: hit

DCBA

D: hit

S e l f i s h - L R U : C R P D E x a m p l e R e v i s i t e d

12

Assume simple preempted task:

for i in [1,10]:!! ! do something()

for i in [1,10]:!! ! access A!! ! access B!! ! access C!! ! access D

Without preemption (after warmup): 0 misses

➡ Same behavior as LRU

DCBA

XDCB

X: miss

S e l f i s h - L R U : C R P D E x a m p l e R e v i s i t e d

13

Assume simple preempting task:

do something_else() access X

Preemption between loop iterations: 1 access

XDCB

ADCB

A: miss

BADC

B: hit

CBAD

C: hit

DCBA

D: hit

S e l f i s h - L R U : C R P D E x a m p l e R e v i s i t e d

14

First loop iteration after preemption: 1 miss

➡ No reordering misses

S e l f i s h - L R U : P r o p e r t i e s

15

S e l f i s h - L R U : P r o p e r t i e s

15

Property 1: Selfish-LRU does not exhibit reordering misses.

➡ Often: smaller CRPD ➡ Simplifies static analysis of the CRPD

S e l f i s h - L R U : P r o p e r t i e s

15

Property 1: Selfish-LRU does not exhibit reordering misses.

➡ Often: smaller CRPD ➡ Simplifies static analysis of the CRPD

Property 2: In non-preempted execution, Selfish-LRU = LRU.

➡ No change in “regular” WCET analysis

Preempting

Preempted

S e l f i s h - L R U : C R P D A n a l y s i s

16

Preempting

Preempted

S e l f i s h - L R U : C R P D A n a l y s i s

16

1. Number of useful cache blocks (UCBs)?

Preempting

Preempted

S e l f i s h - L R U : C R P D A n a l y s i s

16

1. Number of useful cache blocks (UCBs)?

2. Number of evicting cache blocks (ECBs)?

➡ Smaller Bound

Preempting

Preempted

S e l f i s h - L R U : C R P D A n a l y s i s

16

1. Number of useful cache blocks (UCBs)?

2. Number of evicting cache blocks (ECBs)?

➡ Smaller Bound

3. Combination of ECBs and UCBs based on Resilience➡ Simplified and Smaller Bound

S e l f i s h - L R U : I m p l e m e n t a t i o n

17

Required modifications: • Manage task ids (TID) in operating system • Make TID available to cache in TID register • Augment cache lines with TID of “owner” task ‣ Conservative estimate: < 3% space overhead

• Modified replacement logic

Similar to virtually-addressed caches

E x p e r i m e n t a l E v a l u a t i o n

18

E x p e r i m e n t a l E v a l u a t i o n

18

Main goal: Compare Selfish-LRU with LRU in terms of performance and predictability !

E x p e r i m e n t a l E v a l u a t i o n

18

Main goal: Compare Selfish-LRU with LRU in terms of performance and predictability ! ➡ Modified MPARM simulator ➡ CRPD analyses implemented in AbsInt’s aiT

E x p e r i m e n t a l E v a l u a t i o n

18

Main goal: Compare Selfish-LRU with LRU in terms of performance and predictability !

Secondary goal: (see paper for details) Compare CRPD approach with cache partitioning

➡ Modified MPARM simulator ➡ CRPD analyses implemented in AbsInt’s aiT

E x p e r i m e n t a l E v a l u a t i o n : B e n c h m a r k s a n d C a c h e C o n f i g u r a t i o n

19

Benchmarks: • Four of the largest Mälardalen benchmarks • Four models from the SCADE distribution • Two SCADE models from an embedded systems course

Cache configuration: Capacity: 2 KB, 4 KB, 8 KB Associativity: 4, 8 Number of sets: 32, 64, 128

E x p e r i m e n t a l E v a l u a t i o n : S i m u l a t i o n R e s u l t s , “ L a r g e ” P r e e m p t i n g Ta s k

20

Measured number of additional

misses

Cache configuration: Capacity: 2 KiB, Associativity 4, Number of sets: 32

Preempted Tasks

Fig. 1. Measured number of context-switch misses for each benchmark whenpreempted by pilot. k = 4, n = 32, and thus C = 2 KiB.

Fig. 2. Measured number of context-switch misses for each benchmark whenpreempted by pilot. k = 8, n = 32, and thus C = 4 KiB.

Fig. 3. Measured number of context-switch misses for each benchmark whenpreempted by edn. k = 4, n = 32, and thus C = 2 KiB.

Fig. 7. Observed misses during the execution of different task sets.k = 4, n = 128, and thus C = 8 KiB.

Fig. 4. Bounds on the number of context-switch misses when preempted bypilot. k = 4, n = 32, and thus C = 2 KiB determined by static analysis.

Fig. 5. Bounds on the number of context-switch misses when preempted bypilot. k = 8, n = 32, and thus C = 4 KiB determined by static analysis.

Fig. 6. Bounds on the number of context-switch misses when preempted byedn. k = 4, n = 32, and thus C = 2 KiB determined by static analysis.

Fig. 8. Observed misses during the execution of different task sets.k = 8, n = 64, and thus C = 8 KiB.

E x p e r i m e n t a l E v a l u a t i o n : S i m u l a t i o n R e s u l t s , “ L a r g e ” P r e e m p t i n g Ta s k

20

Measured number of additional

misses

Cache configuration: Capacity: 2 KiB, Associativity 4, Number of sets: 32

Preempted Tasks

Fig. 1. Measured number of context-switch misses for each benchmark whenpreempted by pilot. k = 4, n = 32, and thus C = 2 KiB.

Fig. 2. Measured number of context-switch misses for each benchmark whenpreempted by pilot. k = 8, n = 32, and thus C = 4 KiB.

Fig. 3. Measured number of context-switch misses for each benchmark whenpreempted by edn. k = 4, n = 32, and thus C = 2 KiB.

Fig. 7. Observed misses during the execution of different task sets.k = 4, n = 128, and thus C = 8 KiB.

Fig. 4. Bounds on the number of context-switch misses when preempted bypilot. k = 4, n = 32, and thus C = 2 KiB determined by static analysis.

Fig. 5. Bounds on the number of context-switch misses when preempted bypilot. k = 8, n = 32, and thus C = 4 KiB determined by static analysis.

Fig. 6. Bounds on the number of context-switch misses when preempted byedn. k = 4, n = 32, and thus C = 2 KiB determined by static analysis.

Fig. 8. Observed misses during the execution of different task sets.k = 8, n = 64, and thus C = 8 KiB.

Large share of replaced misses ➡ Fairly small improvement

E x p e r i m e n t a l E v a l u a t i o n : S i m u l a t i o n R e s u l t s , “ S m a l l ” P r e e m p t i n g Ta s k

21

Fig. 1. Measured number of context-switch misses for each benchmark whenpreempted by pilot. k = 4, n = 32, and thus C = 2 KiB.

Fig. 2. Measured number of context-switch misses for each benchmark whenpreempted by pilot. k = 8, n = 32, and thus C = 4 KiB.

Fig. 3. Measured number of context-switch misses for each benchmark whenpreempted by edn. k = 4, n = 32, and thus C = 2 KiB.

Fig. 7. Observed misses during the execution of different task sets.k = 4, n = 128, and thus C = 8 KiB.

Fig. 4. Bounds on the number of context-switch misses when preempted bypilot. k = 4, n = 32, and thus C = 2 KiB determined by static analysis.

Fig. 5. Bounds on the number of context-switch misses when preempted bypilot. k = 8, n = 32, and thus C = 4 KiB determined by static analysis.

Fig. 6. Bounds on the number of context-switch misses when preempted byedn. k = 4, n = 32, and thus C = 2 KiB determined by static analysis.

Fig. 8. Observed misses during the execution of different task sets.k = 8, n = 64, and thus C = 8 KiB.

Measured number of additional

misses

Cache configuration: Capacity: 2 KiB, Associativity 4, Number of sets: 32

Preempted Tasks

E x p e r i m e n t a l E v a l u a t i o n : S i m u l a t i o n R e s u l t s , “ S m a l l ” P r e e m p t i n g Ta s k

21

Fig. 1. Measured number of context-switch misses for each benchmark whenpreempted by pilot. k = 4, n = 32, and thus C = 2 KiB.

Fig. 2. Measured number of context-switch misses for each benchmark whenpreempted by pilot. k = 8, n = 32, and thus C = 4 KiB.

Fig. 3. Measured number of context-switch misses for each benchmark whenpreempted by edn. k = 4, n = 32, and thus C = 2 KiB.

Fig. 7. Observed misses during the execution of different task sets.k = 4, n = 128, and thus C = 8 KiB.

Fig. 4. Bounds on the number of context-switch misses when preempted bypilot. k = 4, n = 32, and thus C = 2 KiB determined by static analysis.

Fig. 5. Bounds on the number of context-switch misses when preempted bypilot. k = 8, n = 32, and thus C = 4 KiB determined by static analysis.

Fig. 6. Bounds on the number of context-switch misses when preempted byedn. k = 4, n = 32, and thus C = 2 KiB determined by static analysis.

Fig. 8. Observed misses during the execution of different task sets.k = 8, n = 64, and thus C = 8 KiB.

Measured number of additional

misses

Cache configuration: Capacity: 2 KiB, Associativity 4, Number of sets: 32

Preempted Tasks

Small share of replaced misses ➡ Fairly significant improvement

E x p e r i m e n t a l E v a l u a t i o n : A n a l y s i s R e s u l t s , “ L a r g e ” P r e e m p t i n g Ta s k

22

Bound on number of additional

misses

Cache configuration: Capacity: 2 KiB, Associativity 4, Number of sets: 32

Preempted Tasks

Fig. 1. Measured number of context-switch misses for each benchmark whenpreempted by pilot. k = 4, n = 32, and thus C = 2 KiB.

Fig. 2. Measured number of context-switch misses for each benchmark whenpreempted by pilot. k = 8, n = 32, and thus C = 4 KiB.

Fig. 3. Measured number of context-switch misses for each benchmark whenpreempted by edn. k = 4, n = 32, and thus C = 2 KiB.

Fig. 7. Observed misses during the execution of different task sets.k = 4, n = 128, and thus C = 8 KiB.

Fig. 4. Bounds on the number of context-switch misses when preempted bypilot. k = 4, n = 32, and thus C = 2 KiB determined by static analysis.

Fig. 5. Bounds on the number of context-switch misses when preempted bypilot. k = 8, n = 32, and thus C = 4 KiB determined by static analysis.

Fig. 6. Bounds on the number of context-switch misses when preempted byedn. k = 4, n = 32, and thus C = 2 KiB determined by static analysis.

Fig. 8. Observed misses during the execution of different task sets.k = 8, n = 64, and thus C = 8 KiB.

E x p e r i m e n t a l E v a l u a t i o n : A n a l y s i s R e s u l t s , “ L a r g e ” P r e e m p t i n g Ta s k

22

Bound on number of additional

misses

Cache configuration: Capacity: 2 KiB, Associativity 4, Number of sets: 32

Preempted Tasks

Fig. 1. Measured number of context-switch misses for each benchmark whenpreempted by pilot. k = 4, n = 32, and thus C = 2 KiB.

Fig. 2. Measured number of context-switch misses for each benchmark whenpreempted by pilot. k = 8, n = 32, and thus C = 4 KiB.

Fig. 3. Measured number of context-switch misses for each benchmark whenpreempted by edn. k = 4, n = 32, and thus C = 2 KiB.

Fig. 7. Observed misses during the execution of different task sets.k = 4, n = 128, and thus C = 8 KiB.

Fig. 4. Bounds on the number of context-switch misses when preempted bypilot. k = 4, n = 32, and thus C = 2 KiB determined by static analysis.

Fig. 5. Bounds on the number of context-switch misses when preempted bypilot. k = 8, n = 32, and thus C = 4 KiB determined by static analysis.

Fig. 6. Bounds on the number of context-switch misses when preempted byedn. k = 4, n = 32, and thus C = 2 KiB determined by static analysis.

Fig. 8. Observed misses during the execution of different task sets.k = 8, n = 64, and thus C = 8 KiB.

All misses are replaced misses ➡ No improvement

E x p e r i m e n t a l E v a l u a t i o n : A n a l y s i s R e s u l t s , “ S m a l l ” P r e e m p t i n g Ta s k

23

Bound on number of additional

misses

Cache configuration: Capacity: 4 KiB, Associativity 8, Number of sets: 32

Preempted Tasks

Fig. 1. Measured number of context-switch misses for each benchmark whenpreempted by pilot. k = 4, n = 32, and thus C = 2 KiB.

Fig. 2. Measured number of context-switch misses for each benchmark whenpreempted by pilot. k = 8, n = 32, and thus C = 4 KiB.

Fig. 3. Measured number of context-switch misses for each benchmark whenpreempted by edn. k = 4, n = 32, and thus C = 2 KiB.

Fig. 7. Observed misses during the execution of different task sets.k = 4, n = 128, and thus C = 8 KiB.

Fig. 4. Bounds on the number of context-switch misses when preempted bypilot. k = 4, n = 32, and thus C = 2 KiB determined by static analysis.

Fig. 5. Bounds on the number of context-switch misses when preempted bypilot. k = 8, n = 32, and thus C = 4 KiB determined by static analysis.

Fig. 6. Bounds on the number of context-switch misses when preempted byedn. k = 4, n = 32, and thus C = 2 KiB determined by static analysis.

Fig. 8. Observed misses during the execution of different task sets.k = 8, n = 64, and thus C = 8 KiB.

E x p e r i m e n t a l E v a l u a t i o n : A n a l y s i s R e s u l t s , “ S m a l l ” P r e e m p t i n g Ta s k

23

Bound on number of additional

misses

Cache configuration: Capacity: 4 KiB, Associativity 8, Number of sets: 32

Preempted Tasks

Fig. 1. Measured number of context-switch misses for each benchmark whenpreempted by pilot. k = 4, n = 32, and thus C = 2 KiB.

Fig. 2. Measured number of context-switch misses for each benchmark whenpreempted by pilot. k = 8, n = 32, and thus C = 4 KiB.

Fig. 3. Measured number of context-switch misses for each benchmark whenpreempted by edn. k = 4, n = 32, and thus C = 2 KiB.

Fig. 7. Observed misses during the execution of different task sets.k = 4, n = 128, and thus C = 8 KiB.

Fig. 4. Bounds on the number of context-switch misses when preempted bypilot. k = 4, n = 32, and thus C = 2 KiB determined by static analysis.

Fig. 5. Bounds on the number of context-switch misses when preempted bypilot. k = 8, n = 32, and thus C = 4 KiB determined by static analysis.

Fig. 6. Bounds on the number of context-switch misses when preempted byedn. k = 4, n = 32, and thus C = 2 KiB determined by static analysis.

Fig. 8. Observed misses during the execution of different task sets.k = 8, n = 64, and thus C = 8 KiB.

Small share of replaced misses ➡ Fairly large improvement

S u m m a r y a n d F u t u r e W o r k

24

Selfish-LRU eliminates reordered misses:

➡ Increases performance by reducing the CRPD ➡ Simplifies static analysis of the CRPD ➡ Large improvements for small preempting tasks like interrupt handlers

S u m m a r y a n d F u t u r e W o r k

24

Selfish-LRU eliminates reordered misses:

➡ Increases performance by reducing the CRPD ➡ Simplifies static analysis of the CRPD ➡ Large improvements for small preempting tasks like interrupt handlers

Apply same idea in shared caches in multi-cores?