Writeback -Aware Bandwidth Partitioning for Multi-core Systems with PCM

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL

MOSSÉ

UNIVERSITY OF PITTSBURGH

Writeback-Aware Bandwidth Partitioning for Multi-core Systems

with PCM

http://www.cs.pitt.edu/PCM

Introduction

DRAM memory is not energy efficient Data centers are energy hungry DRAM memory consumes 20-40% of the energy

Apply PCM as main memory Energy efficient but Slower read, much slower write and shorter lifetime

Hybrid memory: add a DRAM cache Improve performance ( LLC miss rate) Extend lifetime ( LLC writeback rate)

How to manage the shared resources?

C0

L1

L2

C1

L1

L2

C2

L1

L2

C3

L1

L2

DRAMPCMDRAM LLCDRAM LLC

Shared Resource Management

CMP systems Shared resources: - last-level cache

- memory bandwidth Unmanaged resources interference poor performance

Partition resources: interference, performance Cache

PartitioningBandwidth

PartitioningDRAM main memoryHybrid main memory

UCP [Qureshi et. al., MICRO 39]

RBP [Liu et. al., HPCA’10]

WCP [Zhou et. al., HiPEAC’12]

This work

Utility-based Cache Partitioning (UCP) Tracks utility (LLC hit/miss) and minimizes overall LLC misses

Read-only Bandwidth Partitioning (RBP) Partitions the bus bandwidth based on LLC miss information

Writeback-aware Cache Partitioning (WCP) Tracks and minimizes LLC miss & writebacksQuestions: 1. Is read-only (LLC miss) information enough? 2. Is bus bandwidth still the bottleneck?

C0

L1

L2

C1

L1

L2

C2

L1

L2

C3

L1

L2

Memory

LLC

Bandwidth Partitioning

Analytic model guides the run time partitioning Use queuing theory to model delay Monitor performance to estimate the parameters of the model Find the partition that maximizes the system’s performance Enforce the partition at run time

DRAM vs. Hybrid main memory PCM writes are extremely slow and power hungry

Issues specific to hybrid main memory Bottleneck: bus bandwidth or device bandwidth Can we ignore the bandwidth consumed by LLC writebacks

Device Bandwidth Utilizationpe

rlbe

nch

bzip

2gc

cze

usm

pca

ctus

ADM

calc

ulix

hmm

ersj

eng

asta

rwr

fsp

hinx

3xa

lanc

bmk

0

1

2

3

4

5WriteRead

% D

evic

e B

andw

idth

Uti

lizat

ion

bwav

esm

cfm

ilcle

slie

3dgo

bmk

sopl

exGe

ms.

..lib

qua.

..lb

mom

netp

p

0

510152025303540

WriteRead

% D

evic

e B

andw

idth

Uti

lizat

ion

DRAM PCMDRAM PCM

DRAM Memory1. Low device bandwidth utilization2. Memory reads (LLC misses) dominate

Hybrid Memory1. High device bandwidth utilization2. Memory writes (LLC writebacks) often dominate

RBP on Hybrid Main Memory

0.00.20.40.60.81.01.21.41.61.8

Thou

ghpu

t of

RB

P N

orm

al-

ized

to

SHA

RE

RBP vs. SHARE1. RBP outperforms SHARE for workloads dominated by PCM read (LLC miss) 2. RBP lost against SHARE for workloads dominated by PCM write (LLC writeback) A new bandwidth partitioning scheme is necessary for hybrid memory

10% 90%Percentage of Device Bandwidth Consumed by PCM Writes (LLC Writebacks)

Writeback-Aware Bandwidth Partitioning

Focus on collective bandwidth of PCM devices

Considers LLC writeback informationToken bucket algorithm

Device service units = tokens Allocate tokens among app. every epoch (5 million

cycles)

Analytic model Maximize weighted speedup Model the contention on bandwidth as queuing delay Difficulty: write is blocking only when write queue is

full

Analytic Model for bandwidth partitioning

For a single core Additive CPI formula: CPI = CPILLC∞ + LLC miss freq. * LLC miss

penalty memory ≈ queue, memory service time ≈ queuing

delay

For a CMP

CPI with a infinite LLC

CPI due to LLC misses

request arrival rate … request service rateLLC miss rate λm

Memory bandwidth α

Time to serve requestsCPI due to LLC misses

…

…

…

…

LLC miss rate λm,1LLC miss rate λm,1

LLC miss rate λm,N

Memory bandwidth α1Memory bandwidth α2

Memory bandwidth αN

Memory

Maximize Weighted Speedup

Taking into account the LLC writebacks CPI = CPILLC∞ + LLC miss freq. * LLC miss penalty + LLC writeback freq. * LLC writeback penalty

LLC miss rate λm,1

Analytic Model for WBP

CPI due to LLC writebacks

…

…

…

LLC writeback rate λw,1

Read memory bandwidth α1

Write memory bandwidth β1

Memory

* P

RQWQ

…

…

LLC miss rate λm,2LLC writeback rate λw,2

Read memory bandwidth α2

Write memory bandwidth β2

…

…

LLC miss rate λm,N

LLC writeback rate λw,N

Read memory bandwidth αN

Write memory bandwidth βN

Memory

p

CPI due to LLC misses and writebacks

How to determin

e P?

Prob. that writebacks are on the critical path

Maximize Weighted Speedup

Dynamic Weight Adjustment

Choose P based on the expected number of executed instructions (EEI)

λm,1

λw,1

λm,2

λw,2

λm,N

λw,N

α1,1

β1,1

α1,2

β1,2

α1,N

β1,N

WBP

p1

EEI

BU1

BU2

BUN

Actual EEI EEI1

P

p2 …

α2,1

β2,1

α2,2

β2,2

α2,N

β2,N

αm,1βm,1

αm,2

βm,2

αm,N

βm,N

pm

EEI2EEIm

Bandwidth Utilization ratio (BU): utilized bandwidth : allocated bandwidth

Architecture Overview

BUMon tracks info during an epoch

DWA and WBP compute bandwidth partition for the next epoch

Bandwidth Regulator enforces the configuration

Enforcing Bandwidth Partitioning

Simulation Setup

Configurations 8-core CMP, 168-entry instruction window Private 4-way 64KB L1, Private 8-way 2MB L2 Partitioned 32MB LLC, 12.5 ns latency 64GB PCM, 4 channels of 2 ranks each, 50ns read latency, 1000ns write latency

Benchmarks SPEC CPU2006 Classified into 3 types (W, R, RW) based on whether PCM

reads/writes dominate bandwidth consumption Creates 15 workloads (Light, High)

Sensitivity study on write latency, #channels and #cores

Effective Read LatencyLi

ght1

Ligh

t2Li

ght3

Ligh

t4Li

ght5

Ligh

t6Li

ght7

Hig

h1H

igh2

Hig

h3H

igh4

Hig

h5H

igh6

Hig

h7H

igh8 Avg

0.0

0.5

1.0

1.5

2.0

2.5RBP WBP_0.5 WBP_1.0 WBP+DWA

Nor

mal

ized

Eff

ecti

ve R

ead

Late

ncy

1. Different workloads favor different policy (partitioning weight)2. WBP+DWA can match the best static policy (partitioning weight)3. WBP+DWA reduces the effective read latency by 31.9% over RBP

ThroughputLi

ght1

Ligh

t2Li

ght3

Ligh

t4Li

ght5

Ligh

t6Li

ght7

Hig

h1H

igh2

Hig

h3H

igh4

Hig

h5H

igh6

Hig

h7H

igh8 Avg

0.00.20.40.60.81.01.21.41.61.82.0

RBP WBP_0.5 WBP_1.0 WBP+DWA

Nor

mal

ized

Thr

ough

put

1. The best weight varies for different workloads (writebacks weight)2. WBP+DWA achieves comparable performance to the best static weight3. WBP+DWA improves the throughput by 24.2% over RBP

Fairness (Harmonic IPC)

WBP+DWA improves fairness by an average of 16.7% over RBP

Ligh

t1Li

ght2

Ligh

t3Li

ght4

Ligh

t5Li

ght6

Ligh

t7H

igh1

Hig

h2H

igh3

Hig

h4H

igh5

Hig

h6H

igh7

Hig

h8 Avg

0.0

0.5

1.0

1.5

2.0

2.5RBP WBP_0.5 WBP_1.0 WBP+DWA

Nor

mal

ized

Thr

ough

put

Conclusions

PCM device bandwidth is the bottleneck in hybrid memory

Writeback information is important (LLC writebacks consume a substantial portion of memory bandwidth)

WBP can better partition the PCM bandwidth

WBP outperforms RBP by an average of 24.9% in terms of weighted speedup

Thank you

Questions ?

Documents

Writeback -Aware Bandwidth Partitioning for Multi-core Systems with PCM