ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers

ATLASA Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers

Yoongu KimDongsu HanOnur Mutlu

Mor Harchol-Balter

2

Motivation

Modern multi-core systems employ multiple memory controllers

Applications contend with each other in multiple controllers

How to perform memory scheduling for multiple controllers?

3

Desired Properties of Memory Scheduling Algorithm

Maximize system performance Without starving any cores

Configurable by system software To enforce thread priorities and QoS/fairness policies

Scalable to a large number of controllers Should not require significant coordination between

controllers

No previous scheduling algorithm satisfies all these requirements

Multiple memory controllers

4

Multiple Memory Controllers

Core

MC Memory

Single-MC system

Multiple-MC system

Difference?

The need for coordination

Core MC Memory

MC Memory

5

MC 1 T1 T2

Memory service timeline

Thread 1’s request

Thread 2’s request

T2

Thread Ranking in Single-MC

# of requests: Thread 1 < Thread 2

Thread 1 Shorter job

Optimal averagestall time: 2TThread

2STALL

Thread 1

STALL

Execution timeline

Thread ranking: Thread 1 > Thread 2

Thread 1 Assigned higher rank

Assume all requests are to the same bank

6

Thread Ranking in Multiple-MC

MC 1 T2 T1T2

MC 2 T1 T1T1

MC 1 T1 T2T2

MC 2 T1 T1T1

Uncoordinated

Coordinated

MC 1’s shorter job: Thread 1Global shorter job: Thread 2

MC 1 incorrectly assigns higher rank to Thread 1

Global shorter job: Thread 2

MC 1 correctly assigns higher rank to Thread 2

Coordination

Coordination Better scheduling decisions

Thread 1

Thread 2

STALL

STALL

Avg. stall time: 3T Thread

1Thread

2

STALL

STALLSAVED CYCLE

S!

Avg. stall time: 2.5T

7

Coordination Limits Scalability

Coordination?

MC 1 MC 2

MC 3 MC 4

Meta-MC

MC-to-MC

Meta-MC

To be scalable, coordination should: exchange little information occur infrequently

Consumes bandwidth

8

The Problem and Our Goal

Problem: Previous memory scheduling algorithms are not scalable to many controllers

Not designed for multiple MCs Require significant coordination

Our Goal: Fundamentally redesign the memory scheduling algorithm such that it

Provides high system throughput Requires little or no coordination among MCs

9

Outline Motivation Rethinking Memory Scheduling

Minimizing Memory Episode Time ATLAS

Least Attained Service Memory Scheduling Thread Ranking Request Prioritization Rules Coordination

Evaluation Conclusion

10

Rethinking Memory SchedulingA thread alternates between two states (episodes)

Compute episode: Zero outstanding memory requests High IPC

Memory episode: Non-zero outstanding memory requests Low IPC

Goal: Minimize time spent in memory episodes

Ou

tsta

nd

ing

m

em

ory

re

qu

est

s

TimeMemory episode

Compute episode

11

How to Minimize Memory Episode Time

Minimizes time spent in memory episodes across all threads Supported by queueing theory:

Shortest-Remaining-Processing-Time scheduling is optimal in single-server queue

Remaining length of a memory episode?

Prioritize thread whose memory episode will end the soonest

Time

Outs

tandin

g

mem

ory

re

quest

s

How much longer?

Predicting Memory Episode Lengths

Large attained service Large expected remaining service

Q: Why?A: Memory episode lengths are Pareto distributed…

12

We discovered: past is excellent predictor for future

Time

Outs

tandin

g

mem

ory

re

quest

s

Remaining serviceFUTURE

Attained service

PAST

13

Pareto Distribution of Memory Episode Lengths

401.bzip2

Favoring least-attained-service memory episode

= Favoring memory episode which will end the soonest

Pr{

Mem

. epi

sode

> x

}

x (cycles)

Memory episode lengths of SPEC benchmarks

Pareto distribution

Attained service correlates with remaining service

The longer an episode has lasted The longer it will last

further

14





15

Prioritize the job with shortest-remaining-processing-

time

Provably optimal

Remaining service: Correlates with attained service

Attained service: Tracked by per-thread counter

Least Attained Service (LAS) Memory Scheduling

Prioritize the memory episode with least-remaining-service

Our Approach Queueing Theory

Least-attained-service (LAS)

scheduling: Minimize memory

episode time

However, LAS does not consider

long-term thread behavior

Prioritize the memory episode with least-attained-service

16

Long-Term Thread Behavior

Mem.episode

Thread 1 Thread 2

Short-termthread behavior

Mem.episode

Long-termthread behavior

Compute episode

Compute episode

>priority

<priority

Prioritizing Thread 2 is more beneficial: results in very long stretches of compute episodes

Short memory episode Long memory episode

17

Quantum-Based Attained Service of a Thread

TimeOuts

tandin

g

mem

ory

re

quest

s

Attained service

Short-termthread

behavior

We divide time into large, fixed-length intervals: quanta (millions of cycles)

Attained service

Long-termthread

behavior Outs

tandin

g

mem

ory

re

quest

s

Time

…Quantum (millions of

cycles)

18





19

LAS Thread Ranking

Each thread’s attained service (AS) is tracked by MCs

ASi = A thread’s AS during only the i-th quantum

Each thread’s TotalAS computed as:

TotalASi = α · TotalASi-1 + (1- α) · ASi

High α More bias towards history

Threads are ranked, favoring threads with lower TotalAS

Threads are serviced according to their ranking

During a quantum

End of a quantum

Next quantum

20





21

ATLAS Scheduling Algorithm

ATLAS Adaptive per-Thread Least Attained Service

Request prioritization order 1. Prevent starvation: Over threshold request 2. Maximize performance: Higher LAS rank 3. Exploit locality: Row-hit request 4. Tie-breaker: Oldest request

How to coordinate MCs to agree upon a consistent ranking?

22





23

ATLAS Coordination Mechanism

During a quantum: Each MC increments the local AS of each thread

End of a quantum: Each MC sends local AS of each thread to centralized meta-MC Meta-MC accumulates local AS and calculates ranking Meta-MC broadcasts ranking to all MCs

Consistent thread ranking

Coordination Cost in ATLAS

24

ATLASPAR-BS

(previous best work [ISCA08])

How often?

Very infrequently

Every quantum boundary (10 M cycles)

Frequently

Every batch boundary (thousands of cycles)

Sensitive to coordination latency?

Insensitive

Coordination latency << Quantum length

Sensitive

Coordination latency ~ Batch length

How costly is coordination in ATLAS?

25

Properties of ATLAS

LAS-ranking Bank-level parallelism Row-buffer locality

Very infrequent coordination

Scale attained service with thread weight (in paper)

Low complexity: Attained service requires a single counter per thread in each MC

Maximize system performance

Scalable to large number of controllers

Configurable by system software

Goals Properties of ATLAS

26





27

Evaluation Methodology

4, 8, 16, 24, 32-core systems 5 GHz processor, 128-entry instruction window 512 Kbyte per-core private L2 caches

1, 2, 4, 8, 16-MC systems 128-entry memory request buffer 4 banks, 2Kbyte row buffer 40ns (200 cycles) row-hit round-trip latency 80ns (400 cycles) row-conflict round-trip latency

Workloads Multiprogrammed SPEC CPU2006 applications 32 program combinations for 4, 8, 16, 24, 32-core

experiments

28

Comparison to Previous Scheduling Algorithms FCFS, FR-FCFS [Rixner+, ISCA00]

Oldest-first, row-hit first Low multi-core performance Do not distinguish between threads

Network Fair Queueing [Nesbit+, MICRO06] Partitions memory bandwidth equally among threads Low system performance Bank-level parallelism, locality not

exploited

Stall-time Fair Memory Scheduler [Mutlu+, MICRO07] Balances thread slowdowns relative to when run alone High coordination costs Requires heavy cycle-by-cycle

coordination

Parallelism-Aware Batch Scheduler [Mutlu+, ISCA08] Batches requests and performs thread ranking to preserve bank-level

parallelism High coordination costs Batch duration is very short

29

1 2 4 8 16Memory controllers

4

6

8

10

12

14

16

FCFS FR_FCFS STFM PAR-BS ATLAS

Syst

em th

roug

hput

System Throughput: 24-Core System

System throughput = ∑ Speedup

ATLAS consistently provides higher system throughput than all previous scheduling algorithms

17.0%

9.8%

8.4%

5.9%

3.5%

Syst

em

th

rou

gh

pu

t

# of memory controllers

30

4 8 16 24 32Cores

0

2

4

6

8

10

12

14

PAR-BS ATLAS

Syst

em th

roug

hput

System Throughput: 4-MC System

# of cores increases ATLAS performance benefit increases

1.1%3.5%

4.0%

8.4%10.8%

Syst

em

th

rou

gh

pu

t

# of cores

31

Other Evaluations In Paper System software support

ATLAS effectively enforces thread weights

Workload analysis ATLAS performs best for mixed-intensity workloads

Effect of ATLAS on fairness

Sensitivity to algorithmic parameters

Sensitivity to system parameters Memory address mapping, cache size, memory latency

32





33

Conclusions

Multiple memory controllers require coordination Need to agree upon a consistent ranking of threads

ATLAS is a fundamentally new approach to memory scheduling Scalable: Thread ranking decisions at coarse-grained intervals High-performance: Minimizes system time spent in memory

episodes (Least Attained Service scheduling principle) Configurable: Enforces thread priorities

ATLAS provides the highest system throughput compared to five previous scheduling algorithms Performance benefit increases as the number of cores increases

34

THANK YOU.

QUESTIONS?

35

ATLASA Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers

Yoongu KimDongsu HanOnur Mutlu

Mor Harchol-Balter

36

Hardware Cost

Additional hardware storage: For a 24-core, 4-MC system: 9kb

Not on critical path of execution

37

System Configuration

DDR2-800 memory timing

38

Benchmark Characteristics

Benchmarks have a wildly varying level of memory intensity Orders of magnitude difference Prioritizing the “light” threads over “heavy” threads is

critical

39

Workload-by-workload Result

Workloads

Results

40

Result on 24-Core System

24-core system with varying number of controllers.

PAR-BSc Ideally coordinated version of PAR-BS Instantaneous knowledge of every other controller ATLAS outperforms this unrealistic version of PAR-BSc

41

Result on X-Core, Y-MC System

System where both the numbers of cores and controllers are varied.

PAR-BSc Ideally coordinated version of PAR-BS Instantaneous knowledge of every other controller ATLAS outperforms this unrealistic version of PAR-BSc

42

Varying Memory Intensity

Workloads that are composed of varying number of memory intensive threads

ATLAS performs the best for a balanced-mix workload Distinguishes a memory-non-intensive threads and

prioritizes them over memory-intensive threads

43

ATLAS with and w/o Coordination ATLAS provides ~1% improvement over ATLASu

ATLAS is specifically designed to operate with little coordination

PAR-BSc provides ~3% improvement over PAR-BS.

44

ATLAS Algorithmic Parameters

Empirically determined quantum length: 10M cycles When quantum length is large:

Infrequent coordination Even long memory episodes can fit within the quantum Thread ranking is stable

45

ATLAS Algorithmic Parameters

Empirically determined history weight: 0.875 When history weight is large:

Attained service is retained for longer TotalAS changes slowly over time Better distinguish memory intensive threads

46

ATLAS Algorithmic Paramters

Empirically determined starvation threshold: 100K cycles

When T is too low: System performance is degraded (similar to FCFS)

When T is too high: Starvation-freedom guarantee is lost

47

Multi-threaded Applications In our evaluations, all threads are single-threaded

applications

For multi-threaded applications, we can generalize the notion of attained service Attained service of multiple threads

= Sum or average of each of their attained service

Within-application ranking

Future work

48

Unfairness in ATLAS

Assuming threads are equal priority, ATLAS maximizes system throughput by prioritizing shorter jobs at expense to the longer jobs

ATLAS increases unfairness Maximum slowdown: 20.3% Harmonic speedup: -10.3%

If threads are of different priorities, system software can ensure fairness/QoS by appropriate configuring thread priorities supported by ATLAS

49

System Software Support

ATLAS enforces system priorities, or thread weights. Linear relationship between thread weight and

speedup.

50

System Parameters

ATLAS performance on systems with varying cache sizes and memory timing

ATLAS performance benefit increases as contention for memory increases

51

Handling Prefetches Attained service encourages efficient prefetch

behavior Prefetches cause disruption to memory accesses of other

threads, while being less useful Attained service can be thought as a fee that is charged

Orthogonal to Prefetch Aware DRAM Controllers [Lee+, MICRO08] PADC: Track accuracy of prefetches.

If accurate, prefetches and demands are same priority. If inaccurate, deprioritize prefetches.

ATLAS + PADC If accurate, treat prefetches as demands (i.e., increment

attained service). If inaccurate, deprioritize prefetches.

52

Memory Addressing Scheme Evaluated memory addressing schemes

Block-interleaving: 3.5% improvement Row-interleaving: 8.4% improvement

Block-interleaving has more bank-level parallelism that ATLAS can exploit.

53

Row-buffer Locality? A balance exists between:

Row-buffer locality Bank-level parallelism

By emphasizing row-buffer locality, bank-level parallelism suffers We evaluated this algorithm and saw that performance

was worse

We try to achieve a balance, but place more emphasis on bank-level parallelism. But within a thread, we do preserve row-buffer locality

54

GPU

This will be applicable as long as threads are competing with each other.

55

PCM

Applicable to any kind of memory technologies. Banks Contention between threads

56

Difference between ATLAS and PAR-BS Attained service

Not utilized by PAR-BS Scheduling theory

ATLAS leverages Least-Attained-Scheduling principle Long-term thread behavior

PAR-BS forms batches on instantaneous state of the request buffer Memory episodes are broken across batch boundaries

But ATLAS monitors long-term thread behavior

Scalability PAR-BS need to coordinate frequently PAR-BS is more complex

Max # of requests to any bank Total # of requests

57

Non-Pareto-distributed Benchmarks There were three that weren’t Pareto distributed.

We can’t take advantage of the short-term thread behavior optimizations, i.e. returning quickly to the compute episode

However, we can still take advantage of long-term thread behavior optimizations We take into account the memory intensity of threads Based on this intensity, we favors non-intensive

threads, which leads to higher system performance

58

Writes

Handle them traditionally using write buffers.

Don’t incur the read-write switch penalty everytime.

So whenever a write buffer becomes close to full, issue the writes.

Documents

ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers