Upload
shemar-crompton
View
220
Download
4
Tags:
Embed Size (px)
Citation preview
1
PhD Defense Presentation
Managing Shared Resources in Chip Multiprocessor Memory Systems
12. October 2010
Magnus Jahre
2
Outline
• Chip Multiprocessors (CMPs)
• CMP Resource Management
• Miss Bandwidth Management– Greedy Miss Bandwidth Management– Interference Measurement– Model-Based Miss Bandwidth Management
• Off-Chip Bandwidth Management
• Conclusion
3
CHIP MULTIPROCESSORS
4
Historical Processor Performance
19
78
19
80
19
82
19
84
19
86
19
88
19
90
19
92
19
94
19
96
19
98
20
00
20
02
20
04
20
06
20
08
20
10
1
10
100
1000
10000
100000
1000000
Processor Performance Components Per Chip (Moore's Law)
Year
No
rma
lize
d V
alu
e
Technology scaling is used to increase
clock frequency
Technology scaling is used to add processor
cores
Moore’s Law: 50% annual increase in thenumber of components per chip
52% Annual Performance Increase20% Annual Performance
Increase
Aggregate performance still follows Moore’s Law
5
Power Dissipation Limits Practical Clock Frequency
Pe
ntiu
m
Pe
ntiu
m M
MX
Pe
ntiu
m II
Pe
ntiu
m II
I
Pe
ntiu
m 4
Pe
ntiu
m D
Co
re 2
Co
re i3
Co
re i5
Co
re i7
020406080
100120
Intel Processor Families
Th
erm
al
De
sig
n P
ow
er
(W)
Pe
ntiu
m
Pe
ntiu
m M
MX
Pe
ntiu
m II
Pe
ntiu
m II
I
Pe
ntiu
m 4
Pe
ntiu
m D
Co
re 2
Co
re i3
Co
re i5
Co
re i7
00.5
11.5
22.5
33.5
Intel Processor Family
Clo
ck
Fre
qu
en
cy
(G
Hz)
Source: Wikipedia, List of CPU Power Dissipation, Retrieved 02.06.10
Technology scaling increases clock
frequency
Technology scalingadds processor cores
Power Wall
6
Chip Multiprocessors (CMPs)
• CMPs utilize chip resources with a constant power budget
• How does technology scaling impact CMPs?
Intel Nehalem
7
Projected Number of Cores
2007 2008 2009 2010 2011 2012 2013 2014 2015 20160
10
20
30
40
50
60
70
80
90
ITRS Year of Production
Nu
mb
er
of
Co
res
ITRS expects 40% annual increase
Observation 2: Software parallelism is needed in the long-term
Observation 1: Multiprogramming can
provide near-term throughput improvement
8
Processor Memory Gap1
97
8
19
80
19
82
19
84
19
86
19
88
19
90
19
92
19
94
19
96
19
98
20
00
20
02
20
04
20
06
20
08
20
10
1
10
100
1000
10000
100000
Main Memory Latency Processor Performance
Year
Re
lati
ve
Pe
rfo
rma
nc
e
7% Annual Memory Latency Improvement
Me
mo
ry W
all
Observation 3: Latency hiding techniques are necessary
9
Performance vs. Bandwidth
2007 2008 2009 2010 2011 2012 2013 2014 20150
5
10
15
20
25
Processor Performance Off-Chip Bandwidth
ITRS Year of Production
Re
lati
ve
P
erf
orm
an
ce
/Ba
nd
-w
idth
Observation 4: Bandwidth must be used efficiently
10
Software parallelism
Multi-programming
Latency hiding
Bandwidth efficiency
Concurrent applications share
hardware
Complex Memory Systems
Shared Resource Management
Application Trends Hardware Trends
11
CMP RESOURCE MANAGEMENT
12
Why Manage Shared Resources?
Provide predictable performance
Support OS scheduler assumptions
Cloud: Fulfill Service Level Agreement
13
Performance Variability Metrics
• Fairness– The performance reduction due to interference between processes
is distributed across all processes in proportion to their priorities– Equal priorities: Performance reduction from sharing affects all
processes equally
• Quality of Service– The performance of a process is never drops below a certain limit
regardless of the behavior of co-scheduled processes
14
Performance Variability (Fairness)
1 4 7 10 13 16 19 22 25 28 31 34 37 400
0.2
0.4
0.6
0.8
1
1.2
Crossbar-Based, 1 channel Ring-Based, 1 channel Crossbar-Based, 2 channelsRing-Based, 2 channels Crossbar-Based, 4 channels Ring-Based, 4 channels
Number of Workloads
Lo
wes
t F
arin
ess
Val
ue
Paper B.I
15
Resource Management Tasks
Measurement
Allocation(Policy)
Enforcement(Mechanism)
16
Mis
s B
and
wid
th
Ma
na
ge
men
tP
refe
tch
S
ch
ed
ulin
g
Off-line InterferenceMeasurement
On-line InterferenceMeasurement
Dynamic Miss Handling
Architecture
Greedy Miss Bandwidth Allocation
Performance Model Based Miss
Bandwidth Allocation
Low-Cost Open Page Prefetching
Opportunistic Prefetch Scheduling
Contributions
17
GREEDY MISS BANDWIDTH MANAGEMENT
Miss Bandwidth Management
18
Conventional Resource Allocation Implementation
CPU 1
Cro
ssba
r
MainMemory
Memory Bus
D-Cache
I-Cache
CPU 2D-Cache
I-Cache
CPU 3D-Cache
I-Cache
CPU 4D-Cache
I-CacheS
hare
d C
ach
e
Mem
ory
Co
ntro
ller
Private Memory System
Measurement
Allocation
Enforcement
19
Alternative Resource Allocation Implementation
CPU 1
Cro
ssba
rMain
MemoryMemory Bus
D-Cache
I-Cache
CPU 2D-Cache
I-Cache
CPU 3D-Cache
I-Cache
CPU 4D-Cache
I-Cache
Sha
red
Cac
he
Mem
ory
Con
trol
ler
Private Memory System
Measurement
Allocation
Enforcement
4
Dynamic Miss Handling Architecture
20
A
D
B
E
C
1
1
10
0
0
1
Cache
A
B
C
D 1
E 1
Miss Handling Architecture (MHA)
Address Target Info. U
Dynamic Miss Handling Architecture
A
D
B
E
C
Accesses
Cache is blocked
Tim
e
A DMHA controls the number of concurrent shared memory system requests that are allowed for each processor
21
Greedy Miss Bandwidth Management• Idea: Reduce the number of MSHRs if a metric
exceeds a certain threshold
• Metrics:– Paper A.II: Memory bus utilization– Paper A.III: Simple interference counters (Interference Points)
• Performance feedback avoids excessive performance degradations
Paper A.II and A.III
22
INTERFERENCE MEASUREMENT
Miss Bandwidth Management
23
Resource Allocation Baselines
Baseline = Interference-free configuration
Quantify performance impact from interference
Private Mode and Shared Mode
24
Interference Definition
InterferencePrivate Mode
Latency
Estimate ErrorPrivate
Mode Latency Measurement
Shared Mode Latency
PrivateMode Latency
Estimate
25
Offline Interference Measurement
Interference Penalty Frequency (IPF) counts the number requests that experienced an interference latency of i cycles
Interference Impact Factor (IIF) is the interference latency times the probability of it arising, i.e. IIF(i) = i ∙ P(i)
Paper B.I
26
Aggregate Interference Impact
Paper B.I
CB
1
CB
2
CB
4
Rin
g 1
Rin
g 2
Rin
g 4
CB
1
CB
2
CB
4
Rin
g 1
Rin
g 2
Rin
g 4
CB
1
CB
2
CB
4
Rin
g 1
Rin
g 2
Rin
g 4
4-core CMP 8-core CMP 16-core CMP
0
50
100
150
200
250
300
Memory Bus Cache Interconnect
27
Resource Management Baselines
Processor A
Processor B
Shared Cache
Multiprogrammed Baseline (MPB)
Interconnect Memory Bus
Processor A
Processor B
Shared Cache
Processor A
Processor B
Shared Cache
Single Program Baseline (SPB)
Interconnect Memory Bus
Interconnect Memory Bus
A
B
28
Baseline Weaknesses
• Multiprogrammed Baseline– Only accounts for interference in partitioned resources– Static and equal division of DRAM bandwidth does not give equal
latency– Complex relationship between resource allocation and performance
• Single Program Baseline– Does not exist in shared mode
Online Interference MeasurementDynamic Interference Estimation
Framework (DIEF)Paper B.II
29
Online Interference Measurement
• Dynamic Interference Estimation Framework (DIEF)
• Estimates private mode average memory latency
• General, component-based framework
Paper B.II
30
Shared Cache InterferenceAuxiliary Tag Directories
CP
U 0
CP
U 1
Cache Accesses:
Shared Cache
M N
A M B NMiss
Hit
C A B C
B D N
B MAA M B CD B A
Eviction may not be interference
Interference latency cost = miss penalty
D CB A
Eviction is interference
31
Bus Interference Requirements
• Out-of-order memory bus scheduling• Shared mode only cache misses and cache hits• Shared cache writebacks
Computing private latency based on shared mode queue contents is difficult
Emulate private scheduling in the shared mode
32
E D
Shared Bus Queue
C B
D C B A
1202004040
Arrival Order
Head Pointer
Execution Order
15
32
Latency Lookup Table
Bank 0
Bank 1
...
...
Open Page Emulation Registers
Memory Latency Estimation Buffer
Bank/ Page Mapping: A à (0,15), B à (0,19), C à (0,15), D à (1,32)
Estimated Queue Latency 120 40 40+ +=
BCD 40200
33
MODEL-BASED MISS BANDWIDTH MANAGEMENT
Miss Bandwidth Management
34
Model-Based Miss Bandwidth Allocation
DIEF provides accurate estimates of the average private mode memory latency
Can we use the estimates provided by DIEF to choose miss bandwidth allocations?
We need a model that relates average memory latency to performance
Paper A.IV
35
Performance Model
Paper A.IV
Observation: The memory latency performance impact depends on the parallelism of memory requests
Very similar in private and shared mode
Shared mode measurements can provide private mode
performance estimates
36
Bandwidth Management Flow
Paper A.IV
Measurement Modeling Allocation
Shared ModeMemory Latency
Private ModeMemory Latency
CommittedInstructions
Number ofMemory Requests
CPU Stall Time
Per-CPU Models
Perf. Metric Model
Find MSHR allocation that maximizes the
chosen performance metric
Set number of MSHRs for all last-level private caches
37
OFF-CHIP BANDWIDTH MANAGEMENT
38
Modern DRAM Interfaces
• Maximize bandwidth with 3D organization
• Repeated requests to the row buffer are very efficient
Row address
Column address
DRAM
Banks
Row Buffer
Rows
Co
lum
ns
39
Low-Cost Open Page Prefetching• Idea: Piggyback
prefetches to open DRAM pages on demand reads
• Performance win if prefetcher accuracy is above ~40%
Paper C.I
40
Opportunistic Prefetch Scheduling
Page Vector Table (PVT)
99
100
101
102
Demand Access
Prefetch Request
Idea: Issue prefetches when a page is closed
Increased efficiency: 8 transfers for 3 activations
Issued Prefetch
Paper C.II
A
A
A
41
CONCLUSION
42
Conclusion
• Managing bandwidth allocations can improve CMP system performance
• Miss bandwidth management– Greedy allocations– Management guided by accurate measurements and performance
models
• Off-chip bandwidth management with prefetching
43
Thank You
Visit our website:http://research.idi.ntnu.no/multicore/
44
EXTRA SLIDES
45
Future Work
• Performance-directed management of shared caches and the memory bus
• Improving OS and system software with dynamic measurements
• Combining dynamic MHAs with prefetching to improve system performance
• Managing workloads of single-threaded and multi-threaded benchmarks
46
Example Chip Multiprocessor
CPU 1
Inte
rcon
nect
MainMemory
MemoryBus
D-Cache
I-Cache
CPU 2D-Cache
I-Cache
CPU 3D-Cache
I-Cache
CPU 4D-Cache
I-Cache
Sha
red
Cac
he
Mem
ory
Con
trol
ler
Private Memory System Shared Memory System