Upload
jesus
View
32
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Writeback -Aware Bandwidth Partitioning for Multi-core Systems with PCM. Miao Zhou, Yu Du, Bruce Childers, Rami MeLHEM , Daniel Mossé University of Pittsburgh. http:// www.cs.pitt.edu /PCM. Introduction. DRAM memory is not energy efficient Data centers are energy hungry - PowerPoint PPT Presentation
Citation preview
MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL
MOSSÉ
UNIVERSITY OF PITTSBURGH
Writeback-Aware Bandwidth Partitioning for Multi-core Systems
with PCM
http://www.cs.pitt.edu/PCM
Introduction
DRAM memory is not energy efficient Data centers are energy hungry DRAM memory consumes 20-40% of the energy
Apply PCM as main memory Energy efficient but Slower read, much slower write and shorter lifetime
Hybrid memory: add a DRAM cache Improve performance ( LLC miss rate) Extend lifetime ( LLC writeback rate)
How to manage the shared resources?
C0
L1
L2
C1
L1
L2
C2
L1
L2
C3
L1
L2
DRAMPCMDRAM LLCDRAM LLC
Shared Resource Management
CMP systems Shared resources: - last-level cache
- memory bandwidth Unmanaged resources interference poor performance
Partition resources: interference, performance Cache
PartitioningBandwidth
PartitioningDRAM main memoryHybrid main memory
UCP [Qureshi et. al., MICRO 39]
RBP [Liu et. al., HPCA’10]
WCP [Zhou et. al., HiPEAC’12]
This work
Utility-based Cache Partitioning (UCP) Tracks utility (LLC hit/miss) and minimizes overall LLC misses
Read-only Bandwidth Partitioning (RBP) Partitions the bus bandwidth based on LLC miss information
Writeback-aware Cache Partitioning (WCP) Tracks and minimizes LLC miss & writebacksQuestions: 1. Is read-only (LLC miss) information enough? 2. Is bus bandwidth still the bottleneck?
C0
L1
L2
C1
L1
L2
C2
L1
L2
C3
L1
L2
Memory
LLC
Bandwidth Partitioning
Analytic model guides the run time partitioning Use queuing theory to model delay Monitor performance to estimate the parameters of the model Find the partition that maximizes the system’s performance Enforce the partition at run time
DRAM vs. Hybrid main memory PCM writes are extremely slow and power hungry
Issues specific to hybrid main memory Bottleneck: bus bandwidth or device bandwidth Can we ignore the bandwidth consumed by LLC writebacks
Device Bandwidth Utilizationpe
rlbe
nch
bzip
2gc
cze
usm
pca
ctus
ADM
calc
ulix
hmm
ersj
eng
asta
rwr
fsp
hinx
3xa
lanc
bmk
0
1
2
3
4
5WriteRead
% D
evic
e B
andw
idth
Uti
lizat
ion
bwav
esm
cfm
ilcle
slie
3dgo
bmk
sopl
exGe
ms.
..lib
qua.
..lb
mom
netp
p
0
510152025303540
WriteRead
% D
evic
e B
andw
idth
Uti
lizat
ion
DRAM PCMDRAM PCM
DRAM Memory1. Low device bandwidth utilization2. Memory reads (LLC misses) dominate
Hybrid Memory1. High device bandwidth utilization2. Memory writes (LLC writebacks) often dominate
RBP on Hybrid Main Memory
0.00.20.40.60.81.01.21.41.61.8
Thou
ghpu
t of
RB
P N
orm
al-
ized
to
SHA
RE
RBP vs. SHARE1. RBP outperforms SHARE for workloads dominated by PCM read (LLC miss) 2. RBP lost against SHARE for workloads dominated by PCM write (LLC writeback) A new bandwidth partitioning scheme is necessary for hybrid memory
10% 90%Percentage of Device Bandwidth Consumed by PCM Writes (LLC Writebacks)
Writeback-Aware Bandwidth Partitioning
Focus on collective bandwidth of PCM devices
Considers LLC writeback informationToken bucket algorithm
Device service units = tokens Allocate tokens among app. every epoch (5 million
cycles)
Analytic model Maximize weighted speedup Model the contention on bandwidth as queuing delay Difficulty: write is blocking only when write queue is
full
Analytic Model for bandwidth partitioning
For a single core Additive CPI formula: CPI = CPILLC∞ + LLC miss freq. * LLC miss
penalty memory ≈ queue, memory service time ≈ queuing
delay
For a CMP
CPI with a infinite LLC
CPI due to LLC misses
request arrival rate … request service rateLLC miss rate λm
Memory bandwidth α
Time to serve requestsCPI due to LLC misses
…
…
…
…
LLC miss rate λm,1LLC miss rate λm,1
LLC miss rate λm,N
Memory bandwidth α1Memory bandwidth α2
Memory bandwidth αN
Memory
Maximize Weighted Speedup
Taking into account the LLC writebacks CPI = CPILLC∞ + LLC miss freq. * LLC miss penalty + LLC writeback freq. * LLC writeback penalty
LLC miss rate λm,1
Analytic Model for WBP
CPI due to LLC writebacks
…
…
…
LLC writeback rate λw,1
Read memory bandwidth α1
Write memory bandwidth β1
Memory
* P
RQWQ
…
…
LLC miss rate λm,2LLC writeback rate λw,2
Read memory bandwidth α2
Write memory bandwidth β2
…
…
LLC miss rate λm,N
LLC writeback rate λw,N
Read memory bandwidth αN
Write memory bandwidth βN
Memory
p
CPI due to LLC misses and writebacks
How to determin
e P?
Prob. that writebacks are on the critical path
Maximize Weighted Speedup
Dynamic Weight Adjustment
Choose P based on the expected number of executed instructions (EEI)
λm,1
λw,1
λm,2
λw,2
λm,N
λw,N
α1,1
β1,1
α1,2
β1,2
α1,N
β1,N
WBP
p1
EEI
BU1
BU2
BUN
Actual EEI EEI1
P
p2 …
α2,1
β2,1
α2,2
β2,2
α2,N
β2,N
αm,1βm,1
αm,2
βm,2
αm,N
βm,N
pm
EEI2EEIm
Bandwidth Utilization ratio (BU): utilized bandwidth : allocated bandwidth
Architecture Overview
BUMon tracks info during an epoch
DWA and WBP compute bandwidth partition for the next epoch
Bandwidth Regulator enforces the configuration
Enforcing Bandwidth Partitioning
Simulation Setup
Configurations 8-core CMP, 168-entry instruction window Private 4-way 64KB L1, Private 8-way 2MB L2 Partitioned 32MB LLC, 12.5 ns latency 64GB PCM, 4 channels of 2 ranks each, 50ns read latency, 1000ns write latency
Benchmarks SPEC CPU2006 Classified into 3 types (W, R, RW) based on whether PCM
reads/writes dominate bandwidth consumption Creates 15 workloads (Light, High)
Sensitivity study on write latency, #channels and #cores
Effective Read LatencyLi
ght1
Ligh
t2Li
ght3
Ligh
t4Li
ght5
Ligh
t6Li
ght7
Hig
h1H
igh2
Hig
h3H
igh4
Hig
h5H
igh6
Hig
h7H
igh8 Avg
0.0
0.5
1.0
1.5
2.0
2.5RBP WBP_0.5 WBP_1.0 WBP+DWA
Nor
mal
ized
Eff
ecti
ve R
ead
Late
ncy
1. Different workloads favor different policy (partitioning weight)2. WBP+DWA can match the best static policy (partitioning weight)3. WBP+DWA reduces the effective read latency by 31.9% over RBP
ThroughputLi
ght1
Ligh
t2Li
ght3
Ligh
t4Li
ght5
Ligh
t6Li
ght7
Hig
h1H
igh2
Hig
h3H
igh4
Hig
h5H
igh6
Hig
h7H
igh8 Avg
0.00.20.40.60.81.01.21.41.61.82.0
RBP WBP_0.5 WBP_1.0 WBP+DWA
Nor
mal
ized
Thr
ough
put
1. The best weight varies for different workloads (writebacks weight)2. WBP+DWA achieves comparable performance to the best static weight3. WBP+DWA improves the throughput by 24.2% over RBP
Fairness (Harmonic IPC)
WBP+DWA improves fairness by an average of 16.7% over RBP
Ligh
t1Li
ght2
Ligh
t3Li
ght4
Ligh
t5Li
ght6
Ligh
t7H
igh1
Hig
h2H
igh3
Hig
h4H
igh5
Hig
h6H
igh7
Hig
h8 Avg
0.0
0.5
1.0
1.5
2.0
2.5RBP WBP_0.5 WBP_1.0 WBP+DWA
Nor
mal
ized
Thr
ough
put
Conclusions
PCM device bandwidth is the bottleneck in hybrid memory
Writeback information is important (LLC writebacks consume a substantial portion of memory bandwidth)
WBP can better partition the PCM bandwidth
WBP outperforms RBP by an average of 24.9% in terms of weighted speedup
Thank you
Questions ?