Upload
amazon-web-services
View
1.776
Download
0
Embed Size (px)
Citation preview
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
ARC348
Seagull
Osman Sarood
Software Engineer @ Yelp
A highly Fault-tolerant Distributed System for Concurrent Task Execution
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Seagull: A Highly Fault-Tolerant Distributed System
for Concurrent Task Execution
Osman Sarood, Software Engineer, Yelp
ARC348
October 2015
How Yelp:
• Runs millions of tests a day
• Downloads TBs of data in an extremely efficient manner
• Scales using our custom metric
What’s in it for me?
What to Expect
• High-level architecture talk
• No code
• Assumes basic understanding of Apache Mesos
A distributed system that allows concurrent task
execution:
• at large scale
• while maintaining high cluster utilization
• and is highly fault tolerant and resilient
What Is Seagull?
• How does Seagull work?
• Major problem: Artifact downloads
• Auto Scaling and fault tolerance
What We’ll Cover
• Each day, Seagull runs tests that would take 700 days (serially)!
• On average, 350 seagull-runs a day
• Each seagull-run has ~ 70,000 tests
• Each seagull-run would take 2 days to run serially
• Volatile but predictable demand
• 30% of load in 3 hours
• Predictable peak time (3PM-6PM)
What’s the Challenge?
• Run 3000 tests concurrently on a 200 machine cluster
• 2 days in serial => 14 mins!
• Running at such high scale involves:
• Downloading 36 TB a day (5 TB/hr peak)
• Up to 200 simultaneous downloads for a single large file
• 1.5 million Docker containers a day (210K/hr peak)
What Do We Do?
EC2
Scheduler1 Scheduler2
……
Scheduler’y’…
Slave1 Slave2 Slave’n’
Elasticsearch
DynamoDB
S3
PrioritizerUI
…
Yelp Developer
1
2
3
4
5
6(a)
6(b)
7
8
Seagull Overview
• Cluster of 7 r3.8xlarges
• Builds our largest service (artifact)
• Uploads artifacts to Amazon S3
• Discover tests
Build artifact
Discover tests
S3
Yelp Developer
Jenkins
• Largest service that forms major part of website
• Several hundreds of MB in size
• Takes ~10 mins to build
• Uses lots of memory
• Huge setup cost
• Build it once and download later
Yelp Artifact
• Takes the artifact and determines test names
• Parse Python code to extract test names
• Finishes in ~2 mins
• Separate test list for each of the 7 different suites
Test Discovery
EC2
Scheduler1 Scheduler2
……
Scheduler’y’…
Slave1 Slave2 Slave’n’
Elasticsearch
DynamoDB
S3
UI
…
Yelp Developer
1
2
3
4
5
6(a)
6(b)
7
8
Prioritizer
Recap
• Schedule longest tests first
• Historical test timing data from DynamoDB
• Fetch 25 million records/day
• Why DynamoDB:
• We don’t need to maintain it!
• Low cost (just $200/month!)
DynamoDB
Historical data
Test list
Prioritizer
Test Prioritization
EC2
Scheduler1 Scheduler2
……
Scheduler’y’…
Slave1 Slave2 Slave’n’
Elasticsearch
DynamoDB
S3
UI
…
Yelp Developer
1
2
3
4
5
6(a)
6(b)
7
8
Prioritizer
Recap
• Run ~350 seagull-runs/day:
• each run ~70000 tests (~ 25 million tests/day)
• total serial time of 48 hours/run
• Challenging to run lots of tests at scale during peak times
Runs submitted per 10 minsPeak
The Testing Problem
• Resource management system
• mesos-master: On master node
• mesos-slave: On every slave
• Slaves register resources Mesos master
• Schedulers subscribe to Mesos master for
consuming resources
• Master offers resources to schedulers in a fair
manner
Mesos Master
Slave2 Slave2
Scheduler 1 Scheduler 2
Apache Mesos
Seagull leverages resource management abilities of Apache
Mesos
• Each run has a Mesos scheduler
• Each scheduler distributes work amongst ~600
workers (executors)
• 200 instances r3.8xlarge machines (32 cores/256GB)
Running Tests in Parallel
Test (color coded for different schedulers
Set of tests (bundle)
C1
Scheduler C1
Slave1(s1)Slave S1
Terminology (Key)
C1
Yelp Devs
C2
Seagull Schedulers
Seagull Cluster
Test
Set of tests
(bundle)
Key
Mesos Master
Slave1(s1) Slave2 (s2)
S1 S2User1 User2
S1
Parallel Test Execution
• Each executor needs to have the artifact before running tests
• 18,000 requests per hour at peak
• Each request is for a large file (hundreds of MBs)
• A single executor (out of 600) taking long to download could delay the entire seagull-run.
Why Is Artifact Download Critical?
EC2
Scheduler1 Scheduler2
……
Scheduler’y’…
Slave1 Slave2 Slave’n’
Elasticsearch
DynamoDB
S3
UI
…
Yelp Developer
1
2
3
4
5
6(a)
6(b)
7
8
Prioritizer
Recap
Docker
Amazon S3
Elasticsearch Amazon
DynamoDB
Fetch artifact
Takes 10 mins on average
Start Service
Run Tests
Report Results
Seagull Executor
Tes
t
Set of tests (bundle)
C1
Scheduler C1
Slave1(s1)Slave S1
Artifact for Scheduler C1 A1
Exec C1
A1Executor of scheduler C1
Terminology (Key)
• Scheduler C1 starts and distributes works amongst 600 executors
• Each executor (a.k.a task):
• own artifact (independent)
• Runs for ~ 10 mins on average
• Each slave runs 15 executors (C1 uses a total of 40 slaves)
• 200 * 15 * 6 = 18000 reqs/hr! (13.5 TB/hour)
S
3
Seagull Cluster
Slave 40
Exec C1
A1
….Exec C1
A1
Slave 1
Exec C1
A1
….
Exec C1
A1
Artifact Handling
• Lots of requests took as long as 30 mins!
• We choked NAT boxes with tons of request
• Avoiding NAT required bigger effort
• Wanted a quick solutions
Slow Download Times
• Executors from same scheduler can share artifacts
• Disadvantages:
• Executors are no longer independent
• Locking implementation for downloading artifacts
S3
Still doesn’t scale wellSeagull Cluster
Slave 40
Exec C2
A2
Exec C1
A1
Slave 1
Exec C1
A1
Exec C1
A1
Exec C2
A2
Exec C2
A1
A2
A2A2 A1
Sharing Artifacts
• Artifactcache consisting of 9 r3.8xlarges
• Replicate each artifact across each of the 9 artifact caches
• Nginx distributes requests
• 10 Gbps network bandwidth helped
Artifactcache
Seagull Cluster
Slave 40
Exec C2Exec C1
Slave 1
Exec C1Exec C1 Exec C2 Exec C2
A1 A2A2 A1
Separate Artifactcache
• Why not use so much network bandwidth from our Amazon EC2
compute?
• The entire cluster serves as the artifactcache
• Cache scales as the cluster scales
• Bandwidth comparison:
• Centralized cache ~ 30 Mbps/executor
9 (# caches) * 10 (Gbps) / 3000 (# of executors)]
• Distributed cache ~ 666 Mbps/executor
200 (#caches) * 10 (Gbps) / 3000 (# of executors)
Distributed Artifactcache
Random Selector
Seagull Cluster
Slave 1
Artifact Pool
Slave 2
Artifact Pool
Slave 3
Artifact Pool
A1
Slave 4
Benefits of distributed artifact caching:
• Very scalable
• No extra machines to maintain
• Significant reduction in out-of-space disk issues
• Fast downloads due to less contention
A1
Artifact Pool
Distributed Artifact Caching
Artifact Download Time (secs) per 10 min
Number of Downloads per 10 mins
Can we improve download
times further?
Distributed Artifactcache Performance
• At peak times:
• Lots of downloads happens
• Most artifacts end up being downloaded on 90% of
slaves
• Once a machine downloads an artifact it should serve
other requests for that artifact
• Disadvantage: Bookkeeping
Stealing Artifact
1. Slave 4 gets A2
Seagull Cluster
Slave 1
Artifact Pool
A2
2. Bundle starts on Slave 2
3. Slave 2 pulls A2 from Slave 4
5a. Slave 3 gets A2 from Slave 3
5b. Slave 1 steals A2 from Slave 2
Exec C2
Slave 2
Artifact Pool
A2
Exec C2
Slave 3
Artifact Pool
A2
Exec C2
Slave 4
Artifact Pool
A2Steal
4. Bundles start on Slave 1 & 3
Random Selector
Stealing Artifact
ARTIFACT STEAL TIME (per 10m)
NUM OF STEAL (per 10m)
Performance: Stealing in Distributed Artifact
Caching
• Used Auto Scaling group provided by AWS but it wasn’t easy to ‘select’ which
instances to terminate
• Mesos uses FIFO to assign work whereas Auto Scaling also uses FIFO to
terminate
• Example: 10% Slave working -> remove 10% -> terminate slaves doing work
Runs submitted (per 10 mins)
Auto Scaling
• CPU and memory demand is volatile
• Seagull tells Mesos to reserve the max amount of memory a
task requires ( )
• Total memory required to run a set of ( ) tasks concurrently:
Reserved Memory
• Total available memory for slave ‘i’:
• Let denote the set of all slaves in our cluster
• Total available memory available:
• Gull-load: Ratio of total reserved memory to total memory available
Gull-load
Calculate Gull-load for
each machine
Sort on
Gull-load
Select slaves with
least Gull-load (10%)
Terminate
Slaves
Add 10% Extra
Machines
Invoke Auto
Scaling (Every 10
mins)
YesNoYes
No
Gull-load
> 0.5
Gull-load < 0.9
Gull-load (GL) Action (# slaves)
0.5 < GL < 0.9 Nothing
GL > 0.9 Add 10%
GL < 0.5 Remove 10%
How Do We Scale Automatically?
• Started with all Reserved instances. Too expensive!
• Shifted to all Spot. Always knew it was risky..
• One fine day, all slaves were gone!
• A mix of On-Demand (25%) and Spot (75%) instances
Reserved, On-Demand, or Spot Instances?
Seagull provides fault tolerance at two levels
• Hardware level: Spreading our machines
geographically (preventive)
• Infrastructure level: Seagull retries upon failure
(corrective)
Fault Tolerance and Reliability
• Equally dividing machines amongst AZs
• us-west-2: a => 60, b => 66, c => 66
• Easy to terminate a slave and recreate it quickly
• In the event of losing Spot instances:
• Our seagull-runs keep running using the On-Demand
instances
• Add on-demand instances until Spot Instances are
available again (manual)
Preventive Fault Tolerance (Reliability)
• Lots of reasons for executors to fail:
• Bad service
• Docker problems (>100 concurrent containers/machine)
• External partners (e.g., Sauce Labs)
• How do we do it:
• Task Manager (inside scheduler) tracks life cycle of each
executor/task
• Fixed number of retries upon failure/timeout
Corrective Fault Tolerance
Tes
t
Set of tests (bundle)
C1
Scheduler C1
Slave1(s1)Slave S1
Artifact for Scheduler 1 A1
Exec C1
A1Executor of scheduler C1
Tracks life-cycle for each task
i.e. queued, running, finished
Terminology (Key)
Task
Manager
Yelp Devs C1
Seagull Schedulers
Seagull Cluster
Test
Set of
tests
(bundle)
KeyMesos Master
S1 (uswest2a) S2 (uswest2b)
S1
User1
S2
Task
Manager
S1 Crashed
Rerun
Bundles?
Corrective Fault Tolerance
• How Seagull works and interacts with other systems
• An extremely efficient artifact hosting design
• Custom scaling policy and its use of gull-load
• Fault tolerance at scale using:
• AWS
• Executor retry logic
What Did We Learn?
• Sanitize code for open source
• Explore why Amazon S3 downloads are so slow
• Avoiding NAT box
• Using multiple buckets
• Breaking our artifact to smaller files
• Improve scaling:
• Ability to use other instance types
• Reduce cost by choosing Spot instance types with minimum GB/$
Future Work