Streaming Auto-Scaling in Google Cloud Dataflow · Given code in Dataflow (incubating as Apache...

Preview:

Citation preview

Software EngineerGoogle

Manuel Fahndrich

Streaming Auto-Scaling in Google Cloud Dataflow

https://commons.wikimedia.org/wiki/File:Globe_centered_in_the_Atlantic_Ocean_(green_and_grey_globe_scheme).svg

Addictive

Mobile

Game

1,251,965

1,019,341

989,673

151,365

109,903

98,736

Team RankingIndividual Ranking

Sarah

Joe

Milo

Hourly Ranking Daily Ranking

An Unbounded Stream of Game Events

9:008:00 14:0013:0012:0011:0010:002:001:00 7:006:005:004:003:00

… with unknown delays.

9:008:00 14:0013:0012:0011:0010:008:00

8:00

8:00

8:00

The Resource Allocation Problem

time

workload

over-provisioned

resources

time

workload

under-provisioned

resources

resources

resources

Matching Resources to Workload

time

workload

auto-tuned

resourcesresources

Resources = Parallelism

time

workload

auto-tuned

parallelismparallelism

More generally: VMs (including CPU, RAM, network, IO).

Assumptions

Big Data Problem

Embarrassingly Parallel

Scaling VMs ==> Scales Throughput

Horizontal Scaling

Agenda

Streaming Dataflow Pipelines

Pipeline Execution

Adjusting Parallelism Automatically

Summary + Future Work

1

2

3

4

Streaming Dataflow1

20122002 2004 2006 2008 2010

MapReduce

GFS Big Table

Dremel

Pregel

FlumeJava

Colossus

Spanner

2014

MillWheel

Dataflow

2016

Google’s Data-Related Systems

Google Dataflow SDK

Open Source SDK used to construct a Dataflow pipeline.

(Now Incubating as Apache Beam)

Computing Team Scores

// Collection of raw log linesPCollection<String> raw = ...;

// Element-wise transformation into team/score// pairsPCollection<KV<String, Integer>> input =

raw.apply(ParDo.of(new ParseFn()))

// Composite transformation containing an// aggregationPCollection<KV<String, Integer>> output = input.apply(Window.into(FixedWindows.of(Minutes(60)))).apply(Sum.integersPerKey());

Google Cloud Dataflow

● Given code in Dataflow (incubating as Apache Beam)

SDK...

● Pipelines can run…

○ On your development machine

○ On the Dataflow Service on Google Cloud Platform

○ On third party environments like Spark or Flink.

Cloud Dataflow

A fully-managed cloud service and

programming model for batch and

streaming big data processing.

Google Cloud Dataflow

Google Cloud Dataflow

Optimize

Schedule

GCS GCS

time

workload

auto-tuned

parallelismparallelism

Back to the Problem at Hand

Signals measuring Workload

Policy making Decisions

Mechanism actuating Change

Auto-Tuning Ingredients

Pipeline Execution2

S0

S2

S1

Optimized Pipeline = DAG of Stages

raw input

Individualpoints

teampoints

S0

S2

S1

Stage Throughput Measure

raw input

Individualpoints

teampoints

throughput

throughput

throughput

Picture by Alexandre Duret-Lutz, Creative Commons 2.0 Generic

S0

S2

S1

Queues of Data Ready for Processing

Queue Size = Backlog

vs.Backlog Growth

Backlog Size

Backlog Growth=

Processing Deficit

S1

Derived Signal: Stage Input Rate

throughput

Input Rate = Throughput + Backlog Growth

backlog growth

Constant Backlog...

...could be bad

Backlog Time =Backlog Size

Throughput

Backlog Time =Time to get through backlog

Bad Backlog = Long Backlog Time

Backlog Growthand

Backlog TimeInform Upscaling.

What Signals indicate Downscaling?

Low CPU Utilization

Throughput

Backlog growth

Backlog time

CPU utilization

Signals Summary

Goals:1. No backlog growth2. Short backlog time3. Reasonable CPU utilization

Policy: making Decisions

Upscaling Policy: Keeping Up

Given M machines

For a stage, given:

average stage throughput T

average positive backlog growth G of stage

Machines needed for stage to keep up:

(T + G)

TM’ = M

Upscaling Policy: Catching Up

Given M machines

Given R (time to reduce backlog)

For a stage, given:

average backlog time B

Extra machines to remove backlog:

B

RExtra = M

Upscaling Policy: All Stages

Want all stages to:

1. keep up

2. have log backlog time

Pick Maximum over all stages of M’ + Extra

Example (signals)

input rate

throughput

backlog

growth

backlog

time

MB

/s

seconds

Example (signals)

input rate

throughput

backlog

growth

backlog

time

MB

/s

seconds

Example (signals)

input rate

throughput

backlog

growth

backlog

time

MB

/s

seconds

Example (signals)

input rate

throughput

backlog

growth

backlog

time

MB

/s

seconds

Example (policy)

M’

M

Extra R=60s

machin

es

Example (policy)

M’

Mmachin

es

Extra R=60s

Example (policy)

M’

Mmachin

es

Extra R=60s

Example (policy)

M’

Mmachin

es

Extra R=60s

Preconditions for Downscaling

Low backlog time

No backlog growth

Low CPU utilization

How far can we downscale?

Stay tuned...

Adjusting Parallelism of a Running Streaming Pipeline

Mechanism: actuating Change

3

S0

S2

S1

Optimized Pipeline = DAG of Stages

S0

S2

S1

Optimized Pipeline = DAG of Stages

S0

S2

S1

Optimized Pipeline = DAG of Stages

Machine 0

Adding Parallelism

Machine 0

S0

S2

S1

S0

S2

S1

S0

S2

S1

Adding Parallelism

S0

S2

S1

S0

S2

S1

Machine 0

Machine 1

Adding Parallelism = Splitting Key Ranges

S0

S2

S1

S0

S2

S1

Machine 0

Machine 1

Migrating a Computation

Adding Parallelism = Migrating Computation Ranges

S0

S2

S1

S0

S2

S1

Machine 0

Machine 1

Checkpoint and Recovery~

Computation Migration

Key Ranges and Persistence

S0

S2

S1Machine 0

S0

S2

S1Machine 1

S0

S2

S1Machine 3

S0

S2

S1Machine 2

Downscaling from 4 to 2 Machines

S0

S2

S1Machine 0

S0

S2

S1Machine 1

S0

S2

S1Machine 3

S0

S2

S1Machine 2

Downscaling from 4 to 2 Machines

S0

S2

S1Machine 0

S0

S2

S1Machine 1

S0

S2

S1Machine 3

S0

S2

S1Machine 2

Downscaling from 4 to 2 Machines

S0

S2

S1Machine 0

S0

S2

S1Machine 1

Downscaling from 4 to 2 Machines

S0

S2

S1Machine 1

S0

S2

S1Machine 2

Upsizing = Steps in Reverse

Granularity of Parallelism

As of March 2016, Google Cloud Dataflow:

• Splits Key Ranges initially Based on Max Machines

• At Max: 1 Logical Persistent Disk per Machine

Each disk has slice of key ranges from all stages

• Only (relatively) even Disk Distributions

• Results in Scaling Quanta

Parallelism Disk per Machine

3 N/A

4 15

5 12

6 10

7 8, 9

8 7, 8

9 6, 7

10 6

12 5

15 4

20 3

30 2

60 1

Example ScalingQuanta: Max = 60 Machines

Goals:1. No backlog growth2. Short backlog time3. Reasonable CPU utilization

Policy: making Decisions

Preconditions for Downscaling

Low backlog time

No backlog growth

Low CPU utilization

Next lower scaling quanta => M’ machines

Estimate future CPUM’ per machine:

If new CPUM’ < threshold (say 90%),

downscale to M’

Downscaling Policy

M

M’CPUM’ = CPUM

Summary +Future Work

4

Artificial Experiment

Auto-Scaling Summary

Signals: throughput, backlog time,backlog growth, CPU utilization

Policy: keep up, reduce backlog,use CPUs

Mechanism: split key ranges, migrate computations

• Experiment with non-uniform disk distributions to

address hot ranges

• Dynamically splitting ranges finer than initially done.

• Approximate model of #VM - throughput relation

Future Work

Questions?

Further reading on streaming model:

The world beyond batch: Streaming 101

The world beyond batch: Streaming 102