Scaling capacity while saving cash

Capacity While Cash

Kim Moir, Mozilla @kmoir URES, Seattle, Nov 10, 2014

Good afternoon. My name is Kim Moir and I’m a release engineer at Mozilla. Today I’m going to discuss how Mozilla scaled our infrastructure on AWS to handle the increasing load on our continuous integration farm, while reducing our monthly bills at the same time. —- References Montreal Subway picture https://www.flickr.com/photos/dephineprieur/3841791164/sizes/o/

Mozilla is a non-profit. Our mission is to promote openness, innovation & opportunity on the Web. !You’re probably familiar with the products we build, such as Firefox for Desktop and Android and Firefox OS. Firefox OS is a relatively new product that Mozilla started working on a few years ago. It’s an open source operating system for smartphones. When there was a new product coming on line, we knew that we would have to be able to scale our build farm to handle additional load. !Note that we ship Firefox on four platforms and with ~97 locales on the same day as US English

Our release cadence is every six weeks for Firefox for Desktop and Android. We release betas every week. FirefoxOS is on a different cadence. https://wiki.mozilla.org/RapidRelease

Release Engineering are a very geographically distributed team and many of us work remotely. Even those people who work close to a physical Mozilla office work several days a week from home.

Before I talk about how we scaled our build and test infrastructure on AWS, I’m going to talk a bit about the scale of our operations. How many builds we run, how many tests, the number of platforms, number of repositories etc. Image: https://www.flickr.com/photos/30649191@N00/9002545206/sizes/l

Daily

!

4500 build jobs

70,000 test jobs

Each time a developer lands a change, it invokes a series of builds and associated tests on relevant platforms. Within each test job there are many actual test suites that run. !!!

We have a commitment to developers that build/test jobs should start within 15 minutes of being requested. We don’t have a perfect record on this, but certainly our numbers are good. We have metrics that measure this every day so we can see what platforms need additional capacity. And we adjust capacity as needed, and remove old platforms as they become less relevant in the marketplace. !——— Pizza picture https://www.flickr.com/photos/djwtwo/9864611814/sizes/l/

Platforms• Windows • Mac • Linux • Android • all x many os versions

We build and test on the following platforms. And many different releases of these platforms.

You can see all the versions of the platforms we build for on this page, called http://treeherder.mozilla.org. It’s a web page anyone can see. Our developers look at it to see the results of their builds and tests, organized by branch.

Devices

• 5600+ in total • 1600+ for builds • 4000+ for tests

We have lot of hardware used on our build farm, both in our two datacenters, and virtually (AWS) !——- References https://secure.pub.build.mozilla.org/builddata/reports/slave_health/index.html * https://secure.pub.build.mozilla.org/slavealloc/ui/#silos

Most companies that do a lot of mobile device testing just have a roomful of devices that developers can test on. !We actually run continuous integration tests on Android reference cards. We have about 800 of them. They are called pandas and are rack mounted. These devices are not as stable as desktop devices, and are prone to failure. Given their numbers, having to deal with the machines failing all the time is very expensive if they were managed by humans. !As an aside, the failure rate on these reference devices is much higher (18%) than running the tests on emulators in AWS (2%) !___ References Pictures of Panda chassis from Dustin’s blog https://blog.mozilla.org/it/2013/01/04/mozpool/2012-11-09-08-30-03/

Bursty traffic - you can see that the number of jobs run each day is variable as time zones wake up. The large trough is obviously the weekend. !source: http://atlee.ca/blog/posts/bursty-load.html

We also have a lot of repositories to manage

We have many different branches in Hg at Mozilla. Our Hg branches are all named after different tree species Developers push to different branches depending on their purpose. Different various branches have to different scheduling priorities within our continuous integration engine. So for instance, if a change is landed in a mozilla-beta branch, the builds and tests associated with that change will have machines allocated to them with at a higher priority than if a change was landed on a cedar branch which is just for testing purposes.

+ many Mozilla tools

Here are some of projects that we use in our infrastructure. !Buildbot is our continuous integration engine. It’s an open source project written in Python. We spend a lot of time writing Python to extend and customize it. !We use Puppet for configuration management all our Buildbot masters, and the Linux and Mac slaves. So when we provision new hardware, we just boot the device and it puppetizes based on it’s role that’s defined by it’s hostname. !Our repository of record is hg.mozilla.org but developers also commit to git repos and these commits are transferred to the hg repository. We also use a lot of mozilla tools that allow us to scale. These tools are open source as well and I have links at the end of the talk to these repos. !—— References octokitty http://www.flickr.com/photos/tachikoma/2760470578/sizes/l/

This is a picture of how the different parts of our build farm work together. Developers land change on code repositories such as hg.mozilla.org. !As I mentioned before, we use an open source continuous integration engine called Buildbot. We have over 50 buildbot masters. Masters are segregated by function to run tests, builds, scheduling, and try. Test and build masters are further divided by function so we can limit the type of jobs they run and the types of slaves they serve. For instance, a master may have Windows build slaves allocated to it. Or Android test slaves. This makes the masters more efficient because you don’t need to have every type of job loaded and consuming memory. It also makes maintenance more efficient in that you can bring down for example, Android test masters for maintenance without having to touch other platforms. !Buildbot polls the hg push log for each of the code repositories. (Hgpoller) !When the poller detects a change, the information about the change is written into the scheduler database. The buildbot scheduler masters are responsible for taking this request in the database and creating a new build request. The build request then will appear as pending in the web page in the previous slide. !The jobs may be on existing hardware in our data centre, or new VMs may start or be created in the cloud to run these pending jobs.

This is a street in Bangkok. As you can see, lots of traffic, not much movement. There used to be a problem at Mozilla where some platforms didn’t have good wait times because we simply didn’t have the slave capacity to handle them. Many pending tests. This was a source of frustration for developers. !We used to run all our builds and tests on in-house hardware in our data centres. This was inefficient in that it took a long time to acquire, rack and install the machines and burn them in. Also, we could not dynamically bring machines up to deal with peak load, and then put them offline when they were no longer needed. !

So in early 2012 we started investigating how we could better scale this traffic. !We investigated running jobs on AWS starting with CentOS machines. One of the things that allowed us to move to AWS more easily was that we use Puppet to manage the configuration many of our build and test slaves (exception is Windows and Android). Our puppet modules are role based so the modifications required to add Amazon VMs were not that difficult. !This move to AWS provided additional capacity, and some of the machines in our data centres were repurposed to pools that were lacking capacity. —- Reference Cloud picture http://www.flickr.com/photos/paul-vallejo/2359829594/sizes/l/

AWS Terminology• EC2 - Elastic compute 2 - machines as VMs

• EBS - Elastic block store - network attached storage

• Region - separate geographical area

• Availability zone - Multiple, isolated locations within a region

I’m going to talk a bit about some AWS terms for those of you that may not be familiar with them. !Notes: AWS instance types http://aws.amazon.com/ec2/instance-types/

More AWS terms• AMI - Amazon machine image

• instance type - VM with defined specifications and cost per hour. For example:

-AMIs - Amazon has standard ones that you can modify or create your own -pricing on instance types can depend on the region -m3.medium currently costs around $0.07hr in most regions -Some instance types may not be available in all availability zones

Source: http://oduinn.com/blog/2012/11/27/releng-production-systems-now-in-3-aws-regions/

We have most of our servers in us-east1 and us-west2. us-west-1 doesn’t have much in it right now, it would be used as a hot backup if one of the other regions went down. Also some traffic is routed over the internet now (ftp via ssl) !2 in-house data centres 3 AWS regions VPC (private cloud for us within Amazon) VPN link between the our data centres and Amazon !Other notes: -using internet for VCS traffic is also part of the story = IPSEC tunnel is a limited and expensive resource, by moving traffic that has “built in” security and integrity checking out of the tunnel, allowing greater capacity !60% of our capacity is in AWS. This number does not reflect the amount of traffic just the amount of available devices

http://oduinn.com/blog/2012/11/27/releng-production-systems-now-in-3-aws-regions/

We migrated

• Linux build and subset of test slaves

• Builds for Android and tests on Android Emulators

• Buildbot and Puppet masters to support these slaves

• vcssync servers

Buildbot has stateful connections and having connections to slaves in another DC did not work well So we created buildbot and puppet masters in AWS to support the slaves we instantiate there. We also have vcssync servers in AWS which support a service that maintains bidirectional commits between our hg and git servers.

Not in cloud• performance tests • graphics tests • Builds and tests for

• Windows • Macs

-Need bare hardware for predictable performance results -Graphics tests that need a specific card -Might be possible to build on Windows in the future -Macs - not available in AWS. Apple licensing prohibits more than two virtual machines on the same Mac. I investigated the possibility of outsourcing them to a “mac in cloud vendor” earlier this year but this is really just “mac in racks in another dc”

Where’s the code?

• The tools we use are all open source

• https://github.com/mozilla/build-cloud-tools

• Which use boto libraries (Python interface to AWS) https://github.com/boto/boto

The code we use to interact with AWS APIs resides here

https://github.com/mozilla/build-cloud-tools

https://github.com/boto/boto

Smarter Bidding Algorithms

• Important scripts

• aws_stop_idle.py

• aws_watch_pending.py

-stop_idle stops instances that are no longer needed given our current capacity (idle for a certain time period - threshold depends on if on-demand or spot) -aws_watch_pending activates instances given the criteria on the next slide

Regions and instances

• Run instances in multiple regions

• Start instances in cheaper regions first

• Automatically shut down inactive instances

• Start instances that have been recently running

If you look at aws_watch_pending.py, these are some of the rules that it implements !We also use machines in multiple AWS regions, in case one region went down, and also to incur cost savings (some regions are cheaper). Currently we only use us-east1 and us-west2. Since all of our CI infrastructure resides in California, we don’t use most other regions. Unlike some companies that need to have instances available instantly - for instance I recently saw a talk by Bridget Kromhout (http://bridgetkromhout.com/speaking/2014/beyondthecode/), an operations engineer from DramaFever. This company provides international movies content on demand. They use every single AWS region because there customer base is so distributed. !Better build times and lower costs if you start instances that have recently been running (still retain artifact dirs, billing advantages) !

Use spot instances

• Use spot instances vs on demand instances

• much cheaper

• however not brought up as quickly

• Useful for tests not builds

Amazon has many different types of instances. Initially, we used on demand instances. They come up very quickly but cost more per hour than other instance types. !Spot instances are Amazon way of bidding off excess capacity. You can bid for the instance and if nobody else bids for it at a price above your offer, the spot instances will be instantiated for you. However, if you’re running a spot instance and someone bids a price higher than you did, your instance can be killed. But that’s okay because we have configured Buildbot to retry jobs that failed and a very small percentage are killed this way (< 1%) !Since the spot instances aren’t available as quickly as the on-demand instances, some tests don’t start within 15 minutes but that’s okay. To reduce costs, we initially started using spot instances for some of our test slaves. !Spot instances are instantiated every time with the AMI you specify. So they aren’t really appropriate for builds, because we run incremental builds and having the build artifacts on the disk is useful when rerunning the same build type on the machine to reduce wall time. !Other notes Smart bidding spot bidding library https://bugzilla.mozilla.org/show_bug.cgi?id=972562

Minimum viable instance type

• Run more tests in parallel on a cheaper instance types rather than upgrading instance type

• Most tests run on m3.medium but some need more

• Limit the subset of tests run on more expensive instance types to those that actually need it

Our tests have a timeout for a suite of tests. If they don’t complete within this timeout, they fail and retry. It’s much cheaper to run more tests in parallel on a cheaper instance type, than run on a more expensive instance type due to the scale of our operations !For instance, we have Android tests that run on Emulators on AWS. Some of the reference tests required a c3.xlarge to run. The correctness tests were fine to run on m3.medium

Limit EBS use

• EBS is network attached store to the EC2 VM

• Much cheaper to use the disk that comes with the instance type

So that was good. We had a lot more capacity on our CI farm. But with every change, you encounter some new bottlenecks. !At Mozilla, when a lot of jobs are failing, we say the trees are burning. !!!http://atlee.ca/blog/posts/aws-networks-and-burning-trees.html !—- Reference http://www.flickr.com/photos/ervins_strauhmanis/9554405492/sizes/l/ http://armenzg.blogspot.ca/search?updated-max=2014-02-27T14:07:00-05:00&max-results=3 !

Bottleneck: Network

• Firewall for VPN tunnel between Mozilla and AWS couldn’t keep up

• High latency connecting to scheduler database

• Jobs weren’t scheduled so unhappy developers!

All of our traffic from our infrastructure in EC2 was routed over the VPN tunnel to be handled by Mozilla's firewall in our SCL3 data center. And the firewall couldn’t keep up. And thus there was a lot of latency connecting to our scheduler database to add jobs.

Solution• Created ftp-ssl endpoint

• gave our AWS instances public addresses (before only had private)

• Changed our routing tables in AWS to route traffic to ftp-ssl via the public internet rather than our VPN tunnel

• updated builds config to download files from ftp-ssl vs ftp

• changed scripts to cache some repos locally vs cloning each time

• added more capacity to the firewall

Cache all the things• Reduce our VPN network utilization further

• Implement a tool called proxxy

• Cache build artifacts

• Cache static tools

• Note: This increased costs because we increased our reliance on EBS

AWS region-local caches for https stuff https://bugzilla.mozilla.org/show_bug.cgi?id=1017759 !https://wiki.mozilla.org/ReleaseEngineering/Applications/ Proxxy http://atlee.ca/blog/posts/cache-em-all.html

Bandwidth 50%

Source: Chris Atlee http://atlee.ca/blog/posts/cache-em-all.html

http://atlee.ca/blog/posts/cache-em-all.html

- Another issue. Increased wall times. - We run incremental builds. This means that if a machine has recently run a certain build type, it will run faster the next time it runs that same type

because it will just update the existing files on the machine, such as checkouts and object directories. - With the switch to AWS, we had large pools of devices allocated to certain types of builds. This means that a build might not run on a machine that has

recently ran a build of the same type. So a couple of people on the team looked at enabling smaller pools of build machines for certain types of builds. The nickname for these smaller types of build pools is a jacuzzi.

- Given a smaller pool, there would be a higher chance that the previous artifacts remain the next time the job ran. - We use a tool called mock that installs packages in a virtual environments. They also optimized mock environments so that packages weren’t reinstalled

if they already existed. - These changes improved build times on these machines by 50%. —— Reference http://atlee.ca/blog/posts/initial-jacuzzi-results.html http://hearsum.ca/blog/experiments-with-smaller-pools-of-build-machines/ http://rail.merail.ca/posts/firefox-builds-are-way-cheaper-now.html

Source: Chris Atlee http://atlee.ca/blog/posts/initial-jacuzzi-results.html

http://atlee.ca/blog/posts/initial-jacuzzi-results.html

Puppet vs AMIs

• Originally used puppet to manage all our of build and test instances

• It was too slow to puppetize the spot instances

• Solution: Create golden AMIs from configs each night. These are used to instantiate the new spot instances.

note: we still use Puppet to manage our buildbot masters within AWS

Summary costs• Optimize use of regions, instance type and capacity

• Use spot instances

• Smarter bidding algorithms

• Shorter wall time through use of jacuzzis

• Use instance storage vs EBS to save $

• Route over public internet where possible

• Cache artifacts the use of proxxy tool

This chart shows the number of monthly pushes in the last six years. You can see that in 2014 our volume has increased significantly. (Doubled compared with the beginning of 2013) For instance, last month we had 12821 pushes (October)

This chart shows our monthly AWS bill since we started migrating machines. You can see how there is quite a dramatic drop off in our monthly AWS costs despite our increased load.

This chart shows the dollar per push. This does not include costs for on premise equipment, just AWS. !Note: Of course, some of these drops in cost are due to Amazon’s reduced prices over the years, not just our optimizations :-)

Selena Deckelman’s architecture diagram https://wiki.mozilla.org/ReleaseEngineering/OverviewArchitectureDiagram

For context, here is what our entire releng build pipeline looks like !Selena Deckelman’s architecture diagram https://wiki.mozilla.org/ReleaseEngineering/OverviewArchitectureDiagram

https://wiki.mozilla.org/ReleaseEngineering/OverviewArchitectureDiagram

And here are the parts highlighted that now reside in AWS. Linux and Android builds and tests. Buildbot and Puppet masters to support them. There is still some work left to do…. !The red circles indicate some parts that have been migrated, but do not indicate that all of that infrastructure has been migrated for that service. For instance, some of our buildbot masters now reside in AWS, but those that support our on premise equipment remain in our data centre.

Questions?

Learn more

• @MozRelEng

• http://planet.mozilla.org/releng/

• Mozilla Releng wiki https://wiki.mozilla.org/ReleaseEngineering

• IRC: channel #releng on moznet

http://planet.mozilla.org/releng/

https://wiki.mozilla.org/ReleaseEngineering

Where’s the code?• Cloud tools: https://github.com/mozilla/build-cloud-tools

• buildbot configs https://github.com/mozilla/build-buildbot-configs

• builldbotcustom https://github.com/mozilla/build-buildbotcustom

• Mozharness https://github.com/mozilla/build-mozharness

• Mozpool https://github.com/mozilla/mozpool

• Puppet configs https://github.com/mozilla/build-puppet

https://github.com/mozilla/build-buildbot-configs

https://github.com/mozilla/build-buildbotcustom

https://github.com/mozilla/build-mozharness

https://github.com/mozilla/build-puppet

More Reading 1• Laura's talks on monitoring complex systems http://vimeo.com/album/

3108317/video/110088288

• Armen’s talk on our hybrid infrastructure https://air.mozilla.org/problems-and-cutting-costs-for-mozillas-hybrid-ec2-in-house-continuous-integration/

• Move to AWS starting in 2012

• http://atlee.ca/blog/posts/blog20121002firefox-builds-in-the-cloud.html

• http://johnnybuild.blogspot.ca/2012/08/migrating-linux32-and-linux64-builds-to.html

• http://atlee.ca/blog/posts/blog20121214behind-the-clouds.html

• http://rail.merail.ca/posts/firefox-unit-tests-on-ubuntu.html

Scaling http://atlee.ca/blog/posts/bursty-load.html jacuzzis http://atlee.ca/blog/posts/initial-jacuzzi-results.html http://hearsum.ca/blog/experiments-with-smaller-pools-of-build-machines/ Caching

http://vimeo.com/album/3108317/video/110088288

http://atlee.ca/blog/posts/blog20121002firefox-builds-in-the-cloud.html

http://johnnybuild.blogspot.ca/2012/08/migrating-linux32-and-linux64-builds-to.html

http://atlee.ca/blog/posts/blog20121214behind-the-clouds.html

http://rail.merail.ca/posts/firefox-unit-tests-on-ubuntu.html

More Reading 2• AWS spot instances vs reserved instances

• http://atlee.ca/blog/posts/now-using-aws-spot-instances.html

• http://rail.merail.ca/posts/firefox-builds-are-way-cheaper-now.html

• http://rail.merail.ca/posts/ec2-spot-instances-experiments.html

• http://taras.glek.net/blog/2014/05/09/how-amazon-ec2-got-15x-cheaper-in-6-months/

• http://taras.glek.net/blog/2014/03/05/more-and-faster-c-i-for-less-on-aws/

• AWS networking

• http://atlee.ca/blog/posts/aws-networks-and-burning-trees.html

• http://rail.merail.ca/posts/using-dns-to-query-aws.html

http://atlee.ca/blog/posts/now-using-aws-spot-instances.html

http://rail.merail.ca/posts/firefox-builds-are-way-cheaper-now.html

http://rail.merail.ca/posts/ec2-spot-instances-experiments.html

http://taras.glek.net/blog/2014/05/09/how-amazon-ec2-got-15x-cheaper-in-6-months/

http://atlee.ca/blog/posts/aws-networks-and-burning-trees.html

http://rail.merail.ca/posts/using-dns-to-query-aws.html

More Reading 3• Scaling

• http://atlee.ca/blog/posts/bursty-load.html

• jacuzzis

• http://atlee.ca/blog/posts/initial-jacuzzi-results.html

• http://hearsum.ca/blog/experiments-with-smaller-pools-of-build-machines/

• Caching

• http://atlee.ca/blog/posts/cache-em-all.html

http://atlee.ca/blog/posts/cache-em-all.html

Documents

Scaling capacity while saving cash