58
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. ARC348 Seagull Osman Sarood Software Engineer @ Yelp A highly Fault-tolerant Distributed System for Concurrent Task Execution

(ARC348) Seagull: How Yelp Built A System For Task Execution

Embed Size (px)

Citation preview

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

ARC348

Seagull

Osman Sarood

Software Engineer @ Yelp

A highly Fault-tolerant Distributed System for Concurrent Task Execution

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Seagull: A Highly Fault-Tolerant Distributed System

for Concurrent Task Execution

Osman Sarood, Software Engineer, Yelp

ARC348

October 2015

Monthly Visitors Reviews Mobile Searches Countries

How Yelp:

• Runs millions of tests a day

• Downloads TBs of data in an extremely efficient manner

• Scales using our custom metric

What’s in it for me?

What to Expect

• High-level architecture talk

• No code

• Assumes basic understanding of Apache Mesos

A distributed system that allows concurrent task

execution:

• at large scale

• while maintaining high cluster utilization

• and is highly fault tolerant and resilient

What Is Seagull?

The Developer

Run millions of tests each day

Seagull @ Yelp

• How does Seagull work?

• Major problem: Artifact downloads

• Auto Scaling and fault tolerance

What We’ll Cover

1: How Does Seagull Work?

• Each day, Seagull runs tests that would take 700 days (serially)!

• On average, 350 seagull-runs a day

• Each seagull-run has ~ 70,000 tests

• Each seagull-run would take 2 days to run serially

• Volatile but predictable demand

• 30% of load in 3 hours

• Predictable peak time (3PM-6PM)

What’s the Challenge?

• Run 3000 tests concurrently on a 200 machine cluster

• 2 days in serial => 14 mins!

• Running at such high scale involves:

• Downloading 36 TB a day (5 TB/hr peak)

• Up to 200 simultaneous downloads for a single large file

• 1.5 million Docker containers a day (210K/hr peak)

What Do We Do?

S3

Docker

Jenkins

Mesos

EC2

Elasticsearch

DynamoDB

ReportingMonitoring

Seagull Ingredients

EC2

Scheduler1 Scheduler2

……

Scheduler’y’…

Slave1 Slave2 Slave’n’

Elasticsearch

DynamoDB

S3

PrioritizerUI

Yelp Developer

1

2

3

4

5

6(a)

6(b)

7

8

Seagull Overview

• Cluster of 7 r3.8xlarges

• Builds our largest service (artifact)

• Uploads artifacts to Amazon S3

• Discover tests

Build artifact

Discover tests

S3

Yelp Developer

Jenkins

• Largest service that forms major part of website

• Several hundreds of MB in size

• Takes ~10 mins to build

• Uses lots of memory

• Huge setup cost

• Build it once and download later

Yelp Artifact

• Takes the artifact and determines test names

• Parse Python code to extract test names

• Finishes in ~2 mins

• Separate test list for each of the 7 different suites

Test Discovery

EC2

Scheduler1 Scheduler2

……

Scheduler’y’…

Slave1 Slave2 Slave’n’

Elasticsearch

DynamoDB

S3

UI

Yelp Developer

1

2

3

4

5

6(a)

6(b)

7

8

Prioritizer

Recap

• Schedule longest tests first

• Historical test timing data from DynamoDB

• Fetch 25 million records/day

• Why DynamoDB:

• We don’t need to maintain it!

• Low cost (just $200/month!)

DynamoDB

Historical data

Test list

Prioritizer

Test Prioritization

EC2

Scheduler1 Scheduler2

……

Scheduler’y’…

Slave1 Slave2 Slave’n’

Elasticsearch

DynamoDB

S3

UI

Yelp Developer

1

2

3

4

5

6(a)

6(b)

7

8

Prioritizer

Recap

• Run ~350 seagull-runs/day:

• each run ~70000 tests (~ 25 million tests/day)

• total serial time of 48 hours/run

• Challenging to run lots of tests at scale during peak times

Runs submitted per 10 minsPeak

The Testing Problem

• Resource management system

• mesos-master: On master node

• mesos-slave: On every slave

• Slaves register resources Mesos master

• Schedulers subscribe to Mesos master for

consuming resources

• Master offers resources to schedulers in a fair

manner

Mesos Master

Slave2 Slave2

Scheduler 1 Scheduler 2

Apache Mesos

Seagull leverages resource management abilities of Apache

Mesos

• Each run has a Mesos scheduler

• Each scheduler distributes work amongst ~600

workers (executors)

• 200 instances r3.8xlarge machines (32 cores/256GB)

Running Tests in Parallel

Test (color coded for different schedulers

Set of tests (bundle)

C1

Scheduler C1

Slave1(s1)Slave S1

Terminology (Key)

C1

Yelp Devs

C2

Seagull Schedulers

Seagull Cluster

Test

Set of tests

(bundle)

Key

Mesos Master

Slave1(s1) Slave2 (s2)

S1 S2User1 User2

S1

Parallel Test Execution

2: Key Challenges: Artifact

Downloads

• Each executor needs to have the artifact before running tests

• 18,000 requests per hour at peak

• Each request is for a large file (hundreds of MBs)

• A single executor (out of 600) taking long to download could delay the entire seagull-run.

Why Is Artifact Download Critical?

EC2

Scheduler1 Scheduler2

……

Scheduler’y’…

Slave1 Slave2 Slave’n’

Elasticsearch

DynamoDB

S3

UI

Yelp Developer

1

2

3

4

5

6(a)

6(b)

7

8

Prioritizer

Recap

Docker

Amazon S3

Elasticsearch Amazon

DynamoDB

Fetch artifact

Takes 10 mins on average

Start Service

Run Tests

Report Results

Seagull Executor

Tes

t

Set of tests (bundle)

C1

Scheduler C1

Slave1(s1)Slave S1

Artifact for Scheduler C1 A1

Exec C1

A1Executor of scheduler C1

Terminology (Key)

• Scheduler C1 starts and distributes works amongst 600 executors

• Each executor (a.k.a task):

• own artifact (independent)

• Runs for ~ 10 mins on average

• Each slave runs 15 executors (C1 uses a total of 40 slaves)

• 200 * 15 * 6 = 18000 reqs/hr! (13.5 TB/hour)

S

3

Seagull Cluster

Slave 40

Exec C1

A1

….Exec C1

A1

Slave 1

Exec C1

A1

….

Exec C1

A1

Artifact Handling

• Lots of requests took as long as 30 mins!

• We choked NAT boxes with tons of request

• Avoiding NAT required bigger effort

• Wanted a quick solutions

Slow Download Times

• Executors from same scheduler can share artifacts

• Disadvantages:

• Executors are no longer independent

• Locking implementation for downloading artifacts

S3

Still doesn’t scale wellSeagull Cluster

Slave 40

Exec C2

A2

Exec C1

A1

Slave 1

Exec C1

A1

Exec C1

A1

Exec C2

A2

Exec C2

A1

A2

A2A2 A1

Sharing Artifacts

• Artifactcache consisting of 9 r3.8xlarges

• Replicate each artifact across each of the 9 artifact caches

• Nginx distributes requests

• 10 Gbps network bandwidth helped

Artifactcache

Seagull Cluster

Slave 40

Exec C2Exec C1

Slave 1

Exec C1Exec C1 Exec C2 Exec C2

A1 A2A2 A1

Separate Artifactcache

Number of active schedulers per 10m

Download time (secs) per 10m

Artifact Download Metrics

• Why not use so much network bandwidth from our Amazon EC2

compute?

• The entire cluster serves as the artifactcache

• Cache scales as the cluster scales

• Bandwidth comparison:

• Centralized cache ~ 30 Mbps/executor

9 (# caches) * 10 (Gbps) / 3000 (# of executors)]

• Distributed cache ~ 666 Mbps/executor

200 (#caches) * 10 (Gbps) / 3000 (# of executors)

Distributed Artifactcache

Random Selector

Seagull Cluster

Slave 1

Artifact Pool

Slave 2

Artifact Pool

Slave 3

Artifact Pool

A1

Slave 4

Benefits of distributed artifact caching:

• Very scalable

• No extra machines to maintain

• Significant reduction in out-of-space disk issues

• Fast downloads due to less contention

A1

Artifact Pool

Distributed Artifact Caching

Artifact Download Time (secs) per 10 min

Number of Downloads per 10 mins

Can we improve download

times further?

Distributed Artifactcache Performance

• At peak times:

• Lots of downloads happens

• Most artifacts end up being downloaded on 90% of

slaves

• Once a machine downloads an artifact it should serve

other requests for that artifact

• Disadvantage: Bookkeeping

Stealing Artifact

1. Slave 4 gets A2

Seagull Cluster

Slave 1

Artifact Pool

A2

2. Bundle starts on Slave 2

3. Slave 2 pulls A2 from Slave 4

5a. Slave 3 gets A2 from Slave 3

5b. Slave 1 steals A2 from Slave 2

Exec C2

Slave 2

Artifact Pool

A2

Exec C2

Slave 3

Artifact Pool

A2

Exec C2

Slave 4

Artifact Pool

A2Steal

4. Bundles start on Slave 1 & 3

Random Selector

Stealing Artifact

ARTIFACT STEAL TIME (per 10m)

NUM OF STEAL (per 10m)

Performance: Stealing in Distributed Artifact

Caching

Artifact Load-Balancing Viz

3: Auto Scaling and Fault Tolerance

• Used Auto Scaling group provided by AWS but it wasn’t easy to ‘select’ which

instances to terminate

• Mesos uses FIFO to assign work whereas Auto Scaling also uses FIFO to

terminate

• Example: 10% Slave working -> remove 10% -> terminate slaves doing work

Runs submitted (per 10 mins)

Auto Scaling

• CPU and memory demand is volatile

• Seagull tells Mesos to reserve the max amount of memory a

task requires ( )

• Total memory required to run a set of ( ) tasks concurrently:

Reserved Memory

• Total available memory for slave ‘i’:

• Let denote the set of all slaves in our cluster

• Total available memory available:

• Gull-load: Ratio of total reserved memory to total memory available

Gull-load

GullL

oad

Running lots of executors

Gull-load

Calculate Gull-load for

each machine

Sort on

Gull-load

Select slaves with

least Gull-load (10%)

Terminate

Slaves

Add 10% Extra

Machines

Invoke Auto

Scaling (Every 10

mins)

YesNoYes

No

Gull-load

> 0.5

Gull-load < 0.9

Gull-load (GL) Action (# slaves)

0.5 < GL < 0.9 Nothing

GL > 0.9 Add 10%

GL < 0.5 Remove 10%

How Do We Scale Automatically?

• Started with all Reserved instances. Too expensive!

• Shifted to all Spot. Always knew it was risky..

• One fine day, all slaves were gone!

• A mix of On-Demand (25%) and Spot (75%) instances

Reserved, On-Demand, or Spot Instances?

Seagull provides fault tolerance at two levels

• Hardware level: Spreading our machines

geographically (preventive)

• Infrastructure level: Seagull retries upon failure

(corrective)

Fault Tolerance and Reliability

• Equally dividing machines amongst AZs

• us-west-2: a => 60, b => 66, c => 66

• Easy to terminate a slave and recreate it quickly

• In the event of losing Spot instances:

• Our seagull-runs keep running using the On-Demand

instances

• Add on-demand instances until Spot Instances are

available again (manual)

Preventive Fault Tolerance (Reliability)

• Lots of reasons for executors to fail:

• Bad service

• Docker problems (>100 concurrent containers/machine)

• External partners (e.g., Sauce Labs)

• How do we do it:

• Task Manager (inside scheduler) tracks life cycle of each

executor/task

• Fixed number of retries upon failure/timeout

Corrective Fault Tolerance

Tes

t

Set of tests (bundle)

C1

Scheduler C1

Slave1(s1)Slave S1

Artifact for Scheduler 1 A1

Exec C1

A1Executor of scheduler C1

Tracks life-cycle for each task

i.e. queued, running, finished

Terminology (Key)

Task

Manager

Yelp Devs C1

Seagull Schedulers

Seagull Cluster

Test

Set of

tests

(bundle)

KeyMesos Master

S1 (uswest2a) S2 (uswest2b)

S1

User1

S2

Task

Manager

S1 Crashed

Rerun

Bundles?

Corrective Fault Tolerance

• How Seagull works and interacts with other systems

• An extremely efficient artifact hosting design

• Custom scaling policy and its use of gull-load

• Fault tolerance at scale using:

• AWS

• Executor retry logic

What Did We Learn?

• Sanitize code for open source

• Explore why Amazon S3 downloads are so slow

• Avoiding NAT box

• Using multiple buckets

• Breaking our artifact to smaller files

• Improve scaling:

• Ability to use other instance types

• Reduce cost by choosing Spot instance types with minimum GB/$

Future Work

Remember to complete

your evaluations!

Thank you!