CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL...

UBIK: EFFICIENT CACHE SHARING WITH STRICT QOS

FOR LATENCY CRITICAL WORKLOADS

HARSHAD KASTURE, DANIEL SANCHEZ

ASPLOS 2014

Motivation 2

Low server utilization in datacenters is a major source of

inefficiency

L. Barroso and U. Hölzle, The Case for Energy-Proportional Computing

Common Industry Practice 3

Core 0

Last Level Cache

Core 1 Core 2

Core 3 Core 4 Core 5

Latency Critical Application

Dedicated machines for latency-critical applications

guarantees QoS

Core 0

Last Level Cache

Core 1 Core 2

Common Industry Practice 4

Latency Critical Application

Dedicated machines for latency-critical applications guarantees QoS

Under utilization of machine resources

Core 0

Last Level Cache

Core 1 Core 2

Colocation to Improve Utilization 5

Latency Critical Applications

Can utilize spare resources by colocating batch apps

Core 0

Last Level Cache

Core 1 Core 2

Sharing Causes Interference! 6

Latency Critical Applications

Can utilize spare resources by colocating batch apps

Contention in shared resources degrades QoS

Outline 7

Introduction

Analysis of latency-critical apps

Inertia-oblivious cache management schemes

Ubik: Inertia-aware cache management

Evaluation

Understanding Latency-Critical Applications 8

Large number of backend servers participate in handling every user request

Total service time determined by tail latency behavior of backend

Client

Front End

Back End

Front End

Datacenter

Service latency highly sensitive to changes in load

Short bursts of activity interspersed with idle periods

Need guaranteed high performance during active periods

Active

Inertia and Transient Behavior 11

Core 0 Core 1

Last Level Cache

Core 2

Inertia and Transient Behavior 12

Transient lengths can dominate tail latency!

Any dynamic reconfiguration scheme has to be inertia-aware

Many hardware resources exhibit inertia

branch predictors, prefetchers, memory bandwidth…

LLCs are one of the biggest sources of inertia

Core 0 Core 1

Last Level Cache

Core 2

Transient begin Transient end

Outline 13

Introduction

Evaluation

Inertia-Oblivious Cache Management 14

Last Level Cache (LLC)

Core 2 Core 3

Core 0 Core 1

LC1 LC2

Batch1 Batch2

Active

Idle Time

Unmanaged LLC (LRU Replacement) 15

Core 2 Core 3

Core 0 Core 1

LC1 LC2

Batch1 Batch2

Active

Idle Time

✖ Unconstrained interference

results in poor tail-latency

behavior

Utility Based Cache Partitioning (UCP) 16

Active

Idle Time

Reconfigure

Core 2 Core 3

Core 0 Core 1

LC1 LC2

Batch1 Batch2

✔ High batch throughput

✖ Poor tail latency (low allocation)

OnOff: Efficient but Unsafe 17

Active

Idle Time

Batch Reconfigure Time

Core 2 Core 3

Core 0 Core 1

LC1 LC2

Batch1 Batch2

✔ High batch throughput

Cross-Request LLC Inertia 18

Other applications qualitatively similar (see paper for details)

Shore-MT, 2 MB LLC

Cross-request hits

) Misses

Hits (same request)

StaticLC: Safe but Inefficient 19

Batch Reconfigure Time

Active

Idle Time

Core 2 Core 3

Core 0 Core 1

LC1 LC2

Batch1 Batch2

✔ Low tail latency (preserve LLC state)

✖ Low batch throughput (poor space

utilization)

Outline 20

Introduction

Evaluation

Ubik: Performance Guarantee 21

Performance as well as overall progress under Ubik after

the deadline is identical to static partitioning

Instructions

deadline

Progress

with constant

Progress

with Ubik

Request begins

Ubik: Overview 22

Activity

nominal static

Target Size

Actual Size

Ubik: Overview 23

Activity

Target Size

Actual Size nominal static

boosted

Ubik: Overview 24

Activity

Target Size

Actual Size nominal static

boosted

Ubik: Overview 25

Constraint: Cycles lost during should be compensated

for by the cycles gained during before the deadline

Activit

Time Size

Target Size

Actual Size nominal static size

idle size

boosted size

Analyzing Transients 26

Target

Size Actual

Ttransient

Need accurate predictions for

The length of the transient from s1 to s2

Cycles lost during the transient from s1 to s2

Instructions

Progress with

constant size (s2)

Progress

with Ubik

Transient

begins Transient ends

Performanc

Hardware Support 27

Utility monitors to measure per-application miss curves

Fine grained cache partitioning

Memory Level Parallelism (MLP) profiler

Size s1

Miss probability

Bounds on Transient Behavior 28

Target

Size Actual

Ttransient

Instructions

Progress with

constant size (s2)

Progress

with Ubik

Transient

begins Transient ends

Performance (L)

Ttransient

L M 1 ps2

Ubik: Partition Sizing 29

Use transient analysis to identify feasible (idle size,

boosted size) pairs

deadline

boosted size) pairs

deadline

boosted size) pairs

deadline

boosted size) pairs

I N F E A S I B L E Size

deadline

boosted size) pairs

Choose the pair that yields the maximum batch throughput

Time deadline

See paper for details

Outline 34

Introduction

Evaluation

Workloads 35

Five diverse latency-critical apps

xapian (search engine)

masstree (in-memory key-value store)

moses (statistical machine translation)

shore-mt (multi-threaded DBMS)

specjbb (java middleware)

Batch applications: random mixes of SPECCPU 2006

benchmarks

Target System 36

6 OOO cores

Private L1I, L1D and L2

caches

12MB shared LLC

400 6-app mixes: 3

latency-critical + 3 batch

Apps pinned to cores

Core 0

Bank 0

Core 1

Bank 1

Core 2

Bank 2

Core 3

Bank 3

Core 4

Bank 4

Core 5

Bank 5

Batch1 Batch2 Batch3

LC1 LC2 LC3

Metrics 37

Baseline system has

private LLCs

We report

Normalized tail latency

Throughput improvement

for batch applications Core 0

Core 1

Core 2

Core 3

Core 4

Core 5

Batch1 Batch2 Batch3

LC1 LC2 LC3

Results: Unmanaged LLC (LRU) 38

Higher is better

Results: UCP 39

Higher is better

Results: OnOff 40

Higher is better

Results: StaticLC 41

Higher is better

Results: Ubik 42

Higher is better

Results: Summary 43

Higher is better

StaticLC Private LLC

Conclusions 44

To guarantee tail latency, dynamic resource management

schemes must be inertia-aware

Ubik: Inertia-aware cache capacity management

Preserves tail of latency-critical apps

Achieves high cache space utilization for batch apps

Requires minimal additional hardware

THANKS FOR YOUR ATTENTION!

QUESTIONS?

CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL...

Documents

RoADTRIP - learningforward.org · 30 JSD | June2010 | Vol.31No.3 approvedbytheboardofdirectors.Theyexpectedthat,asadis-trict,wewouldalignourcontracted19hoursofschoolim

Transient Fault Detection and Reducing Transient Error Rate

12 Transient

[people.csail.mit.edu]people.csail.mit.edu/jakobn/research/TalkPhDsem060403.pdfOutline of Part I: Proof Complexity and Resolution Introduction Propositional Proof Systems Proof Systems

Transient Paper

Optimization of Graph Neural ... - people.csail.mit.edu

W WW.P HONEBOOK.ST NEEW W MEXICO|2019 State Government Phonebook... · trict 4 ‐ Lyn trict 5 ‐ San Government Te 2019 CIALS ... NATU NEW NEW NEW NOR DEVE NOR GOV NUR O ... Anaya

Transient Ischaemic Attack - St George's, University of London · Transient RSW Transient LSW Transient RSW. ... • Neurovascular MDT pathway No RCTs of different types of TIA service

Transient Overvoltages

Transient Phenomena

TATEOF EWMEXICO SECOND J DICIAL DI TRICT … OF EWMEX ICO ECO D J DICIAL DI TRICT CO RT OFFICIAL RO TER Honorable William Parnall ... Honorable Deborah Davis Walker Honorable C. hannon

people.csail.mit.edu/seneﬀ/2015/ WAPF folate.pptx

transient stability

Transient Machine

Pressure Transient

Automation sensor transient and overvoltage … sensor transient and overvoltage protection ... Transient and surge voltage protection ... Automation sensor transient and overvoltage

FinalThesisWork Transient

transient steering.pdf

Download these slides people.csail.mit.edu/seneff/2015/WAPF_folate. pptx

Validation of OLGA HD against transient and pseudo- transient