39
Heracles: Improving Resource Efficiency at Scale David Lo , Liqun Cheng * , Rama Govindaraju * , Parthasarathy Ranganathan * , Christos Kozyrakis Stanford University * Google Inc. © 2012 Google Inc. All rights reserved. Google and the Google Logo are registered trademarks of Google Inc.

Heracles: Improving Resource Efficiency at Scale · 2019-05-02 · Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 15 Core 1 Core 2 Core 3 Core 4 ... Core

  • Upload
    others

  • View
    4

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Heracles: Improving Resource Efficiency at Scale · 2019-05-02 · Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 15 Core 1 Core 2 Core 3 Core 4 ... Core

Heracles: Improving Resource

Efficiency at Scale

David Lo†, Liqun Cheng*, Rama Govindaraju*, Parthasarathy Ranganathan*, Christos Kozyrakis†

† Stanford University * Google Inc.

© 2012 Google Inc. All rights reserved. Google and the Google Logo are registered trademarks of Google Inc.

Page 2: Heracles: Improving Resource Efficiency at Scale · 2019-05-02 · Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 15 Core 1 Core 2 Core 3 Core 4 ... Core

The case for oversubscription

Diurnal load variation Total Cost of Ownership

Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 2

61%16%

14%

6%

3%

Servers

Energy

Cooling

Networking

Other

[J. Hamilton, http://mvdirona.com]

Idleness in

latency

critical

workload! Bigger

OpportunityPEGASUS

[ISCA’14]

Page 3: Heracles: Improving Resource Efficiency at Scale · 2019-05-02 · Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 15 Core 1 Core 2 Core 3 Core 4 ... Core

Oversubscription summary

Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 3

Motivation: fill in idle cycles

with useful work

How: Latency Critical (LC) +

Best Effort (BE)

Plenty of analytics jobs, such

as deep learning training

Page 4: Heracles: Improving Resource Efficiency at Scale · 2019-05-02 · Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 15 Core 1 Core 2 Core 3 Core 4 ... Core

Challenges of oversubscription

Allocation of shared resources between LC and BE

Interference on shared resources

DRAM

LLC

Cores

Network

Power

Difficult to guarantee quality of service (QoS)

Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 4

Page 5: Heracles: Improving Resource Efficiency at Scale · 2019-05-02 · Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 15 Core 1 Core 2 Core 3 Core 4 ... Core

How bad can interference get?

Quick experiment with a latency critical job and a batch job

The latency critical job: Google websearch

The batch job: deep learning classifier

The setup:

Run batch job at very low priority to fill in idle CPU cycles

Hope that the Linux scheduler is sufficient for QoS

Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 5

Page 6: Heracles: Improving Resource Efficiency at Scale · 2019-05-02 · Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 15 Core 1 Core 2 Core 3 Core 4 ... Core

How bad can interference get?

Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 6

SLO latency

Cannot co-locate

workload at any

load!

Page 7: Heracles: Improving Resource Efficiency at Scale · 2019-05-02 · Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 15 Core 1 Core 2 Core 3 Core 4 ... Core

Interference is different based on resource

LLC >300% >300% >300% >300% >300% >300% >300% 264% 123%

DRAM >300% >300% >300% >300% >300% >300% >300% 270% 122%

HyperThread 110% 107% 114% 115% 105% 117% 120% 136% >300%

CPU power 124% 107% 116% 109% 115% 105% 101% 100% 100%

Network 36% 36% 37% 37% 39% 42% 48% 55% 64%

10% 20% 30% 40% 50% 60% 70% 80% 90%

Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 7

Impact of interference on websearch’s latency

Re

sou

rce

Websearch load

0%

100%

300%

O

K

N

O

T

O

K

No oversubscription

possible

Page 8: Heracles: Improving Resource Efficiency at Scale · 2019-05-02 · Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 15 Core 1 Core 2 Core 3 Core 4 ... Core

Interference is different based on resource

LLC >300% >300% >300% >300% >300% >300% >300% 264% 123%

DRAM >300% >300% >300% >300% >300% >300% >300% 270% 122%

HyperThread 110% 107% 114% 115% 105% 117% 120% 136% >300%

CPU power 124% 107% 116% 109% 115% 105% 101% 100% 100%

Network 36% 36% 37% 37% 39% 42% 48% 55% 64%

10% 20% 30% 40% 50% 60% 70% 80% 90%

Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 8

Impact of interference on websearch’s latency

Re

sou

rce

Websearch load

0%

100%

300%

Need to manage MULTIPLE resources

with DYNAMIC controller

Page 9: Heracles: Improving Resource Efficiency at Scale · 2019-05-02 · Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 15 Core 1 Core 2 Core 3 Core 4 ... Core

Oversubscription appears to be too hard

Google Twitter

Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 9

[Barroso’09] [Delimitrou’14]

Even with cluster managers and lots of available jobs

Caused by fear of interference

20% avg. utilization30% avg. utilization

Page 10: Heracles: Improving Resource Efficiency at Scale · 2019-05-02 · Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 15 Core 1 Core 2 Core 3 Core 4 ... Core

Heracles: low latency and high utilization

Insights:

Use iso-latency to tolerate some interference

Fine-grained isolation on all shared resources to mitigate the rest

Implementation:

Dynamic controller to manage shared resource allocations

Evaluated on Google workloads, high utilization without QoS

violations

Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 10

Page 11: Heracles: Improving Resource Efficiency at Scale · 2019-05-02 · Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 15 Core 1 Core 2 Core 3 Core 4 ... Core

0% 20% 40% 60% 80% 100%

Ov

era

ll q

ue

ry la

ten

cy

% of maximum cluster load

websearch latency vs. cluster load

What is iso-latency?

Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 11

Can hide interference in this slack!SLO latency

Page 12: Heracles: Improving Resource Efficiency at Scale · 2019-05-02 · Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 15 Core 1 Core 2 Core 3 Core 4 ... Core

Fine-grained resource isolation mechanisms

CPU (HyperThread/L1/L2)

Use Linux cpuset cgroups to partition cores between LC and BE jobs

Single core granularity (Haswell has up to 18 cores)

~1ms response time

Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 12

Core 1 Core 2 Core 3 Core 4 ... Core N-1 Core N

LC cpuset BE

Example partitioning setup:

Page 13: Heracles: Improving Resource Efficiency at Scale · 2019-05-02 · Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 15 Core 1 Core 2 Core 3 Core 4 ... Core

Fine-grained resource isolation mechanisms

LLC

Hardware cache partitioning in latest Haswell Xeon

Partitioning by cache way (20 ways in Haswell)

<1ms adjustment latency

Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 13

Way 1 Way 2 Way 3 Way 4 ... Way N-1 Way N

Partition for LC BE

Example partitioning setup:

Page 14: Heracles: Improving Resource Efficiency at Scale · 2019-05-02 · Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 15 Core 1 Core 2 Core 3 Core 4 ... Core

Fine-grained resource isolation mechanisms

Network

Transmit rate limiting in Linux kernel with hierarchical token bucket

Extremely fine grained limits of at least 1Mbps

~1ms response time

Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 14

BE pkt

BE pkt

BE pkt

LC pkt

LC pkt

BE queue LC queue

Pkt

Sched

To NIC

Rate limit BE flows

Page 15: Heracles: Improving Resource Efficiency at Scale · 2019-05-02 · Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 15 Core 1 Core 2 Core 3 Core 4 ... Core

Fine-grained resource isolation mechanisms

CPU power

Per-core DVFS to ensure minimum Turbo frequency for LC workload

Can change clock frequency in increments of 100MHz

<1 ms response of hardware

Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 15

Core 1 Core 2 Core 3 Core 4 ... Core N-1 Core N

3.0 GHz 3.0 GHz 3.0 GHz 3.0 GHz 3.0 GHz 2.0GHz 2.0GHz

BE coresLC cores

Shift power from BE to LC cores

to maintain guaranteed LC freq.

Page 16: Heracles: Improving Resource Efficiency at Scale · 2019-05-02 · Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 15 Core 1 Core 2 Core 3 Core 4 ... Core

Fine-grained resource isolation mechanisms

DRAM BW

Not available in hardware, have to simulate with other mechanisms

LLC partitions influences amount of traffic that is served by DRAM

Use number of cores to control DRAM BW

Intuition: each core can only issue so many requests/sec

Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 16

LC

LC

BE

BE

LLC partitioning

Core partitioning

× DRAM BW

BWPerCore

NumCores

Page 17: Heracles: Improving Resource Efficiency at Scale · 2019-05-02 · Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 15 Core 1 Core 2 Core 3 Core 4 ... Core

But how should the knobs be set?

This looks like an optimization problem

Objective: maximize resources given to BE job

Constraints: preserve SLO of latency critical application

Challenge: 5-dimensional formulation!

Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 17

Page 18: Heracles: Improving Resource Efficiency at Scale · 2019-05-02 · Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 15 Core 1 Core 2 Core 3 Core 4 ... Core

Control insight #1: independence

Observation: latency violations occur when a shared

resource is extremely loaded

High demand for resource causes significant contention

LC workload is unable to obtain its required allocation

Insight: assume independent interference under 2 conditions

LC workload is not starved for any resource

Each resource has enough slack (~10%) to absorb bursts

Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 18

Page 19: Heracles: Improving Resource Efficiency at Scale · 2019-05-02 · Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 15 Core 1 Core 2 Core 3 Core 4 ... Core

Control insight #2: convexity

Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 19

Performance as a function

of resources is convex for

benchmarked workloads

Use of gradient descent is

guaranteed to produce

optimality

Page 20: Heracles: Improving Resource Efficiency at Scale · 2019-05-02 · Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 15 Core 1 Core 2 Core 3 Core 4 ... Core

Heracles: high level controller overview

Goal: meet SLO, keep BE from saturating shared resource

Runs on each machine

LC

workload

Controller

CPU +

Memory

CPU

powerNetwork

LLC CPUDRAM

BWDVFS

CPU

PowerHTB

Net.

BW

Latency readings

Can BE grow?

Internal

feedback

loops

Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 20

Page 21: Heracles: Improving Resource Efficiency at Scale · 2019-05-02 · Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 15 Core 1 Core 2 Core 3 Core 4 ... Core

LC

workload

Controller

CPU +

Memory

CPU

powerNetwork

LLC CPUDRAM

BWDVFS

CPU

PowerHTB

Net.

BW

Latency readings

Can BE grow?

Internal

feedback

loops

Heracles: high level controller overview

Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 21

Cores LLC Core freq. Network BW

LC LC Max LCBE BE BE BE

L

BE BEBE BEBE BEBE

Page 22: Heracles: Improving Resource Efficiency at Scale · 2019-05-02 · Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 15 Core 1 Core 2 Core 3 Core 4 ... Core

LC

workload

Controller

CPU +

Memory

CPU

powerNetwork

LLC CPUDRAM

BWDVFS

CPU

PowerHTB

Net.

BW

Latency readings

Can BE grow?

Internal

feedback

loops

Heracles: high level controller overview

Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 22

Cores LLC Core freq. Network BW

LC LC Max LCBE BE BE BE

L

Page 23: Heracles: Improving Resource Efficiency at Scale · 2019-05-02 · Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 15 Core 1 Core 2 Core 3 Core 4 ... Core

Example subcontroller: Core+Memory

Isolates: Cores, LLC, DRAM

Physical mechanisms: Partitioning of cores, LLC, and DRAM

Goal: maximize cores running BE job by minimizing DRAM BW

Guardband in DRAM BW to ensure LC job is not being starved

Iterative phases:

1. Reduce total DRAM BW through LLC partitioning

2. Grow allocation of BE cores

Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 23

Page 24: Heracles: Improving Resource Efficiency at Scale · 2019-05-02 · Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 15 Core 1 Core 2 Core 3 Core 4 ... Core

Example subcontroller: Core+Memory

Time

LCD

RA

M B

W

Time

BE D

RA

M B

W

Time

Tota

l D

RA

M B

W

Start here

Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 24

Page 25: Heracles: Improving Resource Efficiency at Scale · 2019-05-02 · Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 15 Core 1 Core 2 Core 3 Core 4 ... Core

Example subcontroller: Core+Memory

Time

LCD

RA

M B

W

Time

BE D

RA

M B

W

Time

Tota

l D

RA

M B

W

Reduce BW

Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 25

Page 26: Heracles: Improving Resource Efficiency at Scale · 2019-05-02 · Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 15 Core 1 Core 2 Core 3 Core 4 ... Core

Example subcontroller: Core+Memory

Time

LCD

RA

M B

W

Time

BE D

RA

M B

W

Time

Tota

l D

RA

M B

W

Reduce BW

Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 26

∇≈ 0Negligible benefit

Page 27: Heracles: Improving Resource Efficiency at Scale · 2019-05-02 · Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 15 Core 1 Core 2 Core 3 Core 4 ... Core

Example subcontroller: Core+Memory

Time

LCD

RA

M B

W

Time

BE D

RA

M B

W

Time

Tota

l D

RA

M B

W

+ BE cores

Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 27

Page 28: Heracles: Improving Resource Efficiency at Scale · 2019-05-02 · Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 15 Core 1 Core 2 Core 3 Core 4 ... Core

Example subcontroller: Core+Memory

Time

LCD

RA

M B

W

Time

BE D

RA

M B

W

Time

Tota

l D

RA

M B

W

+ BE cores

Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 28

Danger zone

Page 29: Heracles: Improving Resource Efficiency at Scale · 2019-05-02 · Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 15 Core 1 Core 2 Core 3 Core 4 ... Core

Example subcontroller: Core+Memory

Time

LCD

RA

M B

W

Time

BE D

RA

M B

W

Time

Tota

l D

RA

M B

W

Reduce BW

Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 29

Page 30: Heracles: Improving Resource Efficiency at Scale · 2019-05-02 · Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 15 Core 1 Core 2 Core 3 Core 4 ... Core

Example subcontroller: Core+Memory

Time

LCD

RA

M B

W

Time

BE D

RA

M B

W

Time

Tota

l D

RA

M B

W

+ BE cores

Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 30

Hit BW cap

Page 31: Heracles: Improving Resource Efficiency at Scale · 2019-05-02 · Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 15 Core 1 Core 2 Core 3 Core 4 ... Core

Evaluation of Heracles

Evaluation of Google production workloads on real hardware

Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 31

Page 32: Heracles: Improving Resource Efficiency at Scale · 2019-05-02 · Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 15 Core 1 Core 2 Core 3 Core 4 ... Core

Latency Critical workloads

websearch

Leaf node, document retrieval/scoring

99%-ile latency SLO of tens of milliseconds

ml_cluster

Machine learning for text clustering

95%-ile latency SLO of tens of milliseconds

memkeyval

In-memory key-value store

99%-ile latency SLO of hundreds of microseconds

Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 32

Production

Page 33: Heracles: Improving Resource Efficiency at Scale · 2019-05-02 · Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 15 Core 1 Core 2 Core 3 Core 4 ... Core

Best Effort jobs

stream-LLC: LLC antagonist

stream-DRAM: DRAM BW antagonist

cpu_pwr: CPU power antagonist

brain: deep learning (LLC, DRAM, CPU, CPU power)

streetview: image stitching (DRAM BW)

Run Heracles on real hardware, measure latency and utilization

Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 33

Synthetic

Production

Page 34: Heracles: Improving Resource Efficiency at Scale · 2019-05-02 · Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 15 Core 1 Core 2 Core 3 Core 4 ... Core

Latency validation: do no harm

SLO latency

Iso-latency: recovering

slack and turning it into work

Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 34

Page 35: Heracles: Improving Resource Efficiency at Scale · 2019-05-02 · Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 15 Core 1 Core 2 Core 3 Core 4 ... Core

Putting it together: resource efficiency

Effective Machine Utilization = (LC load) + (% BE throughput)

Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 35

Load on LC app

Free batch processing

capability

Page 36: Heracles: Improving Resource Efficiency at Scale · 2019-05-02 · Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 15 Core 1 Core 2 Core 3 Core 4 ... Core

Putting it together: resource efficiency

Effective Machine Utilization = (LC load) + (% BE throughput)

Better than 100% is due to better

binpacking

Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 36

Page 37: Heracles: Improving Resource Efficiency at Scale · 2019-05-02 · Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 15 Core 1 Core 2 Core 3 Core 4 ... Core

Bonus: energy efficiency too!

Power increase is far less than resource utilization increase!

Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 37

300% more work for

60% more power

Page 38: Heracles: Improving Resource Efficiency at Scale · 2019-05-02 · Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 15 Core 1 Core 2 Core 3 Core 4 ... Core

Cluster results

Use load trace for off-peak hours on websearch cluster

Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 38

Page 39: Heracles: Improving Resource Efficiency at Scale · 2019-05-02 · Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 15 Core 1 Core 2 Core 3 Core 4 ... Core

Conclusion

Increasing utilization is key to improving datacenter efficiency

Fine-grained knobs to control many sources of interference

Need coordinated policy to find optimal settings

Heracles significantly increases utilization

Achieves average of 90% utilization for Google workloads

Potential increase of >300% in cost efficiency

Heracles: Improving Resource Efficiency at Scale (ISCA-42 June 16, 2015) 39