Upload
nasim-caldwell
View
36
Download
5
Tags:
Embed Size (px)
DESCRIPTION
ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers. Yoongu Kim Dongsu Han Onur Mutlu Mor Harchol-Balter. Motivation. Modern multi-core systems employ multiple memory controllers Applications contend with each other in multiple controllers - PowerPoint PPT Presentation
Citation preview
ATLASA Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers
Yoongu KimDongsu HanOnur Mutlu
Mor Harchol-Balter
2
Motivation
Modern multi-core systems employ multiple memory controllers
Applications contend with each other in multiple controllers
How to perform memory scheduling for multiple controllers?
3
Desired Properties of Memory Scheduling Algorithm
Maximize system performance Without starving any cores
Configurable by system software To enforce thread priorities and QoS/fairness policies
Scalable to a large number of controllers Should not require significant coordination between
controllers
No previous scheduling algorithm satisfies all these requirements
Multiple memory controllers
4
Multiple Memory Controllers
Core
MC Memory
Single-MC system
Multiple-MC system
Difference?
The need for coordination
Core MC Memory
MC Memory
5
MC 1 T1 T2
Memory service timeline
Thread 1’s request
Thread 2’s request
T2
Thread Ranking in Single-MC
# of requests: Thread 1 < Thread 2
Thread 1 Shorter job
Optimal averagestall time: 2TThread
2STALL
Thread 1
STALL
Execution timeline
Thread ranking: Thread 1 > Thread 2
Thread 1 Assigned higher rank
Assume all requests are to the same bank
6
Thread Ranking in Multiple-MC
MC 1 T2 T1T2
MC 2 T1 T1T1
MC 1 T1 T2T2
MC 2 T1 T1T1
Uncoordinated
Coordinated
MC 1’s shorter job: Thread 1Global shorter job: Thread 2
MC 1 incorrectly assigns higher rank to Thread 1
Global shorter job: Thread 2
MC 1 correctly assigns higher rank to Thread 2
Coordination
Coordination Better scheduling decisions
Thread 1
Thread 2
STALL
STALL
Avg. stall time: 3T Thread
1Thread
2
STALL
STALLSAVED CYCLE
S!
Avg. stall time: 2.5T
7
Coordination Limits Scalability
Coordination?
MC 1 MC 2
MC 3 MC 4
Meta-MC
MC-to-MC
Meta-MC
To be scalable, coordination should: exchange little information occur infrequently
Consumes bandwidth
8
The Problem and Our Goal
Problem: Previous memory scheduling algorithms are not scalable to many controllers
Not designed for multiple MCs Require significant coordination
Our Goal: Fundamentally redesign the memory scheduling algorithm such that it
Provides high system throughput Requires little or no coordination among MCs
9
Outline Motivation Rethinking Memory Scheduling
Minimizing Memory Episode Time ATLAS
Least Attained Service Memory Scheduling Thread Ranking Request Prioritization Rules Coordination
Evaluation Conclusion
10
Rethinking Memory SchedulingA thread alternates between two states (episodes)
Compute episode: Zero outstanding memory requests High IPC
Memory episode: Non-zero outstanding memory requests Low IPC
Goal: Minimize time spent in memory episodes
Ou
tsta
nd
ing
m
em
ory
re
qu
est
s
TimeMemory episode
Compute episode
11
How to Minimize Memory Episode Time
Minimizes time spent in memory episodes across all threads Supported by queueing theory:
Shortest-Remaining-Processing-Time scheduling is optimal in single-server queue
Remaining length of a memory episode?
Prioritize thread whose memory episode will end the soonest
Time
Outs
tandin
g
mem
ory
re
quest
s
How much longer?
Predicting Memory Episode Lengths
Large attained service Large expected remaining service
Q: Why?A: Memory episode lengths are Pareto distributed…
12
We discovered: past is excellent predictor for future
Time
Outs
tandin
g
mem
ory
re
quest
s
Remaining serviceFUTURE
Attained service
PAST
13
Pareto Distribution of Memory Episode Lengths
401.bzip2
Favoring least-attained-service memory episode
= Favoring memory episode which will end the soonest
Pr{
Mem
. epi
sode
> x
}
x (cycles)
Memory episode lengths of SPEC benchmarks
Pareto distribution
Attained service correlates with remaining service
The longer an episode has lasted The longer it will last
further
14
Outline Motivation Rethinking Memory Scheduling
Minimizing Memory Episode Time ATLAS
Least Attained Service Memory Scheduling Thread Ranking Request Prioritization Rules Coordination
Evaluation Conclusion
15
Prioritize the job with shortest-remaining-processing-
time
Provably optimal
Remaining service: Correlates with attained service
Attained service: Tracked by per-thread counter
Least Attained Service (LAS) Memory Scheduling
Prioritize the memory episode with least-remaining-service
Our Approach Queueing Theory
Least-attained-service (LAS)
scheduling: Minimize memory
episode time
However, LAS does not consider
long-term thread behavior
Prioritize the memory episode with least-attained-service
16
Long-Term Thread Behavior
Mem.episode
Thread 1 Thread 2
Short-termthread behavior
Mem.episode
Long-termthread behavior
Compute episode
Compute episode
>priority
<priority
Prioritizing Thread 2 is more beneficial: results in very long stretches of compute episodes
Short memory episode Long memory episode
17
Quantum-Based Attained Service of a Thread
TimeOuts
tandin
g
mem
ory
re
quest
s
Attained service
Short-termthread
behavior
We divide time into large, fixed-length intervals: quanta (millions of cycles)
Attained service
Long-termthread
behavior Outs
tandin
g
mem
ory
re
quest
s
Time
…Quantum (millions of
cycles)
18
Outline Motivation Rethinking Memory Scheduling
Minimizing Memory Episode Time ATLAS
Least Attained Service Memory Scheduling Thread Ranking Request Prioritization Rules Coordination
Evaluation Conclusion
19
LAS Thread Ranking
Each thread’s attained service (AS) is tracked by MCs
ASi = A thread’s AS during only the i-th quantum
Each thread’s TotalAS computed as:
TotalASi = α · TotalASi-1 + (1- α) · ASi
High α More bias towards history
Threads are ranked, favoring threads with lower TotalAS
Threads are serviced according to their ranking
During a quantum
End of a quantum
Next quantum
20
Outline Motivation Rethinking Memory Scheduling
Minimizing Memory Episode Time ATLAS
Least Attained Service Memory Scheduling Thread Ranking Request Prioritization Rules Coordination
Evaluation Conclusion
21
ATLAS Scheduling Algorithm
ATLAS Adaptive per-Thread Least Attained Service
Request prioritization order 1. Prevent starvation: Over threshold request 2. Maximize performance: Higher LAS rank 3. Exploit locality: Row-hit request 4. Tie-breaker: Oldest request
How to coordinate MCs to agree upon a consistent ranking?
22
Outline Motivation Rethinking Memory Scheduling
Minimizing Memory Episode Time ATLAS
Least Attained Service Memory Scheduling Thread Ranking Request Prioritization Rules Coordination
Evaluation Conclusion
23
ATLAS Coordination Mechanism
During a quantum: Each MC increments the local AS of each thread
End of a quantum: Each MC sends local AS of each thread to centralized meta-MC Meta-MC accumulates local AS and calculates ranking Meta-MC broadcasts ranking to all MCs
Consistent thread ranking
Coordination Cost in ATLAS
24
ATLASPAR-BS
(previous best work [ISCA08])
How often?
Very infrequently
Every quantum boundary (10 M cycles)
Frequently
Every batch boundary (thousands of cycles)
Sensitive to coordination latency?
Insensitive
Coordination latency << Quantum length
Sensitive
Coordination latency ~ Batch length
How costly is coordination in ATLAS?
25
Properties of ATLAS
LAS-ranking Bank-level parallelism Row-buffer locality
Very infrequent coordination
Scale attained service with thread weight (in paper)
Low complexity: Attained service requires a single counter per thread in each MC
Maximize system performance
Scalable to large number of controllers
Configurable by system software
Goals Properties of ATLAS
26
Outline Motivation Rethinking Memory Scheduling
Minimizing Memory Episode Time ATLAS
Least Attained Service Memory Scheduling Thread Ranking Request Prioritization Rules Coordination
Evaluation Conclusion
27
Evaluation Methodology
4, 8, 16, 24, 32-core systems 5 GHz processor, 128-entry instruction window 512 Kbyte per-core private L2 caches
1, 2, 4, 8, 16-MC systems 128-entry memory request buffer 4 banks, 2Kbyte row buffer 40ns (200 cycles) row-hit round-trip latency 80ns (400 cycles) row-conflict round-trip latency
Workloads Multiprogrammed SPEC CPU2006 applications 32 program combinations for 4, 8, 16, 24, 32-core
experiments
28
Comparison to Previous Scheduling Algorithms FCFS, FR-FCFS [Rixner+, ISCA00]
Oldest-first, row-hit first Low multi-core performance Do not distinguish between threads
Network Fair Queueing [Nesbit+, MICRO06] Partitions memory bandwidth equally among threads Low system performance Bank-level parallelism, locality not
exploited
Stall-time Fair Memory Scheduler [Mutlu+, MICRO07] Balances thread slowdowns relative to when run alone High coordination costs Requires heavy cycle-by-cycle
coordination
Parallelism-Aware Batch Scheduler [Mutlu+, ISCA08] Batches requests and performs thread ranking to preserve bank-level
parallelism High coordination costs Batch duration is very short
29
1 2 4 8 16Memory controllers
4
6
8
10
12
14
16
FCFS FR_FCFS STFM PAR-BS ATLAS
Syst
em th
roug
hput
System Throughput: 24-Core System
System throughput = ∑ Speedup
ATLAS consistently provides higher system throughput than all previous scheduling algorithms
17.0%
9.8%
8.4%
5.9%
3.5%
Syst
em
th
rou
gh
pu
t
# of memory controllers
30
4 8 16 24 32Cores
0
2
4
6
8
10
12
14
PAR-BS ATLAS
Syst
em th
roug
hput
System Throughput: 4-MC System
# of cores increases ATLAS performance benefit increases
1.1%3.5%
4.0%
8.4%10.8%
Syst
em
th
rou
gh
pu
t
# of cores
31
Other Evaluations In Paper System software support
ATLAS effectively enforces thread weights
Workload analysis ATLAS performs best for mixed-intensity workloads
Effect of ATLAS on fairness
Sensitivity to algorithmic parameters
Sensitivity to system parameters Memory address mapping, cache size, memory latency
32
Outline Motivation Rethinking Memory Scheduling
Minimizing Memory Episode Time ATLAS
Least Attained Service Memory Scheduling Thread Ranking Request Prioritization Rules Coordination
Evaluation Conclusion
33
Conclusions
Multiple memory controllers require coordination Need to agree upon a consistent ranking of threads
ATLAS is a fundamentally new approach to memory scheduling Scalable: Thread ranking decisions at coarse-grained intervals High-performance: Minimizes system time spent in memory
episodes (Least Attained Service scheduling principle) Configurable: Enforces thread priorities
ATLAS provides the highest system throughput compared to five previous scheduling algorithms Performance benefit increases as the number of cores increases
34
THANK YOU.
QUESTIONS?
35
ATLASA Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers
Yoongu KimDongsu HanOnur Mutlu
Mor Harchol-Balter
36
Hardware Cost
Additional hardware storage: For a 24-core, 4-MC system: 9kb
Not on critical path of execution
37
System Configuration
DDR2-800 memory timing
38
Benchmark Characteristics
Benchmarks have a wildly varying level of memory intensity Orders of magnitude difference Prioritizing the “light” threads over “heavy” threads is
critical
39
Workload-by-workload Result
Workloads
Results
40
Result on 24-Core System
24-core system with varying number of controllers.
PAR-BSc Ideally coordinated version of PAR-BS Instantaneous knowledge of every other controller ATLAS outperforms this unrealistic version of PAR-BSc
41
Result on X-Core, Y-MC System
System where both the numbers of cores and controllers are varied.
PAR-BSc Ideally coordinated version of PAR-BS Instantaneous knowledge of every other controller ATLAS outperforms this unrealistic version of PAR-BSc
42
Varying Memory Intensity
Workloads that are composed of varying number of memory intensive threads
ATLAS performs the best for a balanced-mix workload Distinguishes a memory-non-intensive threads and
prioritizes them over memory-intensive threads
43
ATLAS with and w/o Coordination ATLAS provides ~1% improvement over ATLASu
ATLAS is specifically designed to operate with little coordination
PAR-BSc provides ~3% improvement over PAR-BS.
44
ATLAS Algorithmic Parameters
Empirically determined quantum length: 10M cycles When quantum length is large:
Infrequent coordination Even long memory episodes can fit within the quantum Thread ranking is stable
45
ATLAS Algorithmic Parameters
Empirically determined history weight: 0.875 When history weight is large:
Attained service is retained for longer TotalAS changes slowly over time Better distinguish memory intensive threads
46
ATLAS Algorithmic Paramters
Empirically determined starvation threshold: 100K cycles
When T is too low: System performance is degraded (similar to FCFS)
When T is too high: Starvation-freedom guarantee is lost
47
Multi-threaded Applications In our evaluations, all threads are single-threaded
applications
For multi-threaded applications, we can generalize the notion of attained service Attained service of multiple threads
= Sum or average of each of their attained service
Within-application ranking
Future work
48
Unfairness in ATLAS
Assuming threads are equal priority, ATLAS maximizes system throughput by prioritizing shorter jobs at expense to the longer jobs
ATLAS increases unfairness Maximum slowdown: 20.3% Harmonic speedup: -10.3%
If threads are of different priorities, system software can ensure fairness/QoS by appropriate configuring thread priorities supported by ATLAS
49
System Software Support
ATLAS enforces system priorities, or thread weights. Linear relationship between thread weight and
speedup.
50
System Parameters
ATLAS performance on systems with varying cache sizes and memory timing
ATLAS performance benefit increases as contention for memory increases
51
Handling Prefetches Attained service encourages efficient prefetch
behavior Prefetches cause disruption to memory accesses of other
threads, while being less useful Attained service can be thought as a fee that is charged
Orthogonal to Prefetch Aware DRAM Controllers [Lee+, MICRO08] PADC: Track accuracy of prefetches.
If accurate, prefetches and demands are same priority. If inaccurate, deprioritize prefetches.
ATLAS + PADC If accurate, treat prefetches as demands (i.e., increment
attained service). If inaccurate, deprioritize prefetches.
52
Memory Addressing Scheme Evaluated memory addressing schemes
Block-interleaving: 3.5% improvement Row-interleaving: 8.4% improvement
Block-interleaving has more bank-level parallelism that ATLAS can exploit.
53
Row-buffer Locality? A balance exists between:
Row-buffer locality Bank-level parallelism
By emphasizing row-buffer locality, bank-level parallelism suffers We evaluated this algorithm and saw that performance
was worse
We try to achieve a balance, but place more emphasis on bank-level parallelism. But within a thread, we do preserve row-buffer locality
54
GPU
This will be applicable as long as threads are competing with each other.
55
PCM
Applicable to any kind of memory technologies. Banks Contention between threads
56
Difference between ATLAS and PAR-BS Attained service
Not utilized by PAR-BS Scheduling theory
ATLAS leverages Least-Attained-Scheduling principle Long-term thread behavior
PAR-BS forms batches on instantaneous state of the request buffer Memory episodes are broken across batch boundaries
But ATLAS monitors long-term thread behavior
Scalability PAR-BS need to coordinate frequently PAR-BS is more complex
Max # of requests to any bank Total # of requests
57
Non-Pareto-distributed Benchmarks There were three that weren’t Pareto distributed.
We can’t take advantage of the short-term thread behavior optimizations, i.e. returning quickly to the compute episode
However, we can still take advantage of long-term thread behavior optimizations We take into account the memory intensity of threads Based on this intensity, we favors non-intensive
threads, which leads to higher system performance
58
Writes
Handle them traditionally using write buffers.
Don’t incur the read-write switch penalty everytime.
So whenever a write buffer becomes close to full, issue the writes.