Rebuilding the airplane at 10 000m - strongspace.com · Rebuilding the airplane at 10 000m...

Preview:

Citation preview

Rebuilding the airplane at 10 000mContinuous Deployment with Jenkins and Gerrit

R. Tyler Croytyler@linux.com

Hello and thanks for coming. I'm R. Tyler Croy, and today I'm going to talk to you about two things at the same time!

I'm going to tell you how we rebuilt our engineering organization "mid-flight" for Continuous Deployment and at the same time, I'm going to tell you how you can too with Jenkins and Gerrit

@jenkinsci

Jenkins Infrastructurewith

Puppet

I work for Lookout

First of all, I work at Lookout Mobile Security, if you're an Android user you might already be familiar with some of our products.

If you're not familar with us, we are primarily known for our security app on Android.

My job at Lookout primarily involves ->

Hacking with Ruby

Hacking with Ruby, you see while we have a nice fancy Java-based Android application, we also have a *large* server-backend which handles device notifications, backups, analysis and much more.

That entire backend is written in Ruby, and can benefit from ->

Let's talk about:continuous deployment

Continuous Deployment, which is also known as "Continuous Delivery". For the context of this talk, I will always refer to it as "Continuous Deployment"

Before I talk too much about what it *is*, I'd like to talk about ->

What it is:

NOT

What Continuous Deployment is NOT.

Above all else, it is not something you do *once* and then you're finished and you can move on. Continous Deployment is a process and mind-set you and your team stick with

"Release everythingas soon as possible!"

Continuous Deployment doesn't mean you release EVERYTHING as soon as it's committed. It doesn't mean you *MUST* deploy every single commit either.

(photo by thomen: <http://www.flickr.com/photos/thomen/364890522/>)

"Great! No need for a QA team"

One of the interesting things I've discovered at Lookout and at other organizations, is that Continuous Deployment, and some of the practices involved in it is that it will free up the QA team to *do their jobs*.

Good QA engineers are most useful when they're exploring, hunting for bugs. Having QA engineers running through test plans every day of the week is boring, slow and is a good candidate for replacement with automated testing tools

"Our users will be our testers!"

Lastly, I don't think Continuous Deployment means you can offload testing to your users.

To some extent this will be inevitable, as changes are more rapidly deployed, but I believe you should try everything you can to avoid your users experiencing issues because of bugs you've introduced

Continuous Deploymentis about

Continous Deployment, in my opinion, is all about ->

stability

stability. It is about being able to deploy changes ->

Fasterwith

More Confidence

faster and with more confidence. Releases shouldn't be scary or stressful, in fact the more boring a deployment is, the better!

In order to make that happen, it is important to have: * Well-thought out procedures for rapid deployment of changes * A good feedback loop from production, in order to inform your team about the current status of the production infrastructure

In my opinion, above everything else, these are the two most important factors to succeeding with Continuous Deployment

Overall, I think Continuous Deployment is ->

Continuous Deployment is

GOOD

GOOD for your organization. Even if you don't end up rapidly deploying your software, the striving for continuous deployment will help improve so many other parts of your development process

So that's Continuous Deployment, and that's all well and good, but let's go back ->

Meanwhile at Lookout

to Lookout, and talk about "where we were" when we started our journey on the road towards Continuous Deployment

"In the olden days" we used ->

Subversion branches for releases

Subversion branches for all our releases. Whenever we were going to prepare a release, we would create a branch in Subversion and those branches would live ->

10-18 days per release branch

for 10-18 days. That means we would take around two weeks every time we wanted to take the code in trunk, and create a release from it

We also performed ->

manual code review

Manual code review. The code review process used to be "hey kyle, come look at this real quick before I commit it to trunk."

In addition to this loosely enforced manual code review process, we had ->

very little automation

very little automation. We actually had some automated tests running somewhere, but there wasn't a "culture of automation"

Being a huge fan of Jenkins and Continuous Deployment, this was very ->

Sad, besides being emotionally frustrating, if we look at the data, it was actually hurting us.

36%of deployments failed

This means that 1/3 of the time when we would try to deploy code into production, something would go wrong and we would have to rollback the deploy and find out what went wrong.

Unfortunately, since it took us two or more weeks to get the release out, we had on average ->

68commits per deployment

68 commits per deployment, so one or more commits out of 68 could have caused the failure.

After a rollback, we'd have to sift through all those commits and find the bug, fix it and then re-deploy.

Because of this ->

62%of deployments slipped

About 2/3rds of our deployments slipped their planned deployment dates. As an engineering organization, we couldn't really tell the product owner when changes were going to be live for customers with *any* confidence!

So I was pretty frustrated with all of this, and as I talked to more people in the engineering organization, they were frustrated too!

SO! ->

Let's fix this.

When we started to come up with a plan to drive towards continuous deployment, the first thing we needed to do was ->

Step One:Automate Development

Jenkins

Before we used a tool called Bitten, I won't tell you too much about Bitten, but it's not a great tool and we had a number of issues with it: * Practically zero developer insight into the test/build process * All the tests ran on *one* build machine which was hand-crafted by the Operations team for the task

We installed Jenkins and started to work on migration "jobs" over to Jenkins.

"Why don't our tests pass?"

The first major issue we had was that we noticed that we had tests that didn't actually *pass* reliably. Previously this was hidden from us, but after running the tests after every commit with Jenkins, we noticed that we had some technical debt in the test suite

Never stop automating.

Like Continuous Deployment itself, automation is a mind-set. In the past 12 months at Lookout we have gone from zero jobs in Jenkins to over 200 different jobs performing different tasks.

It is important that you constantly recognize manual processes that can and should be automated. Over time these little automations add up into pipelines of tasks that Jenkins can perform entirely on its own

Once we had automation in place, we needed to focus on ->

Step TwoBetter tools,processes

Better tools and better processes than what we were using before.

This meant evaluating what tools were working, and what wasn't. In the process we migrated away from Pivotal Tracker and dropped Subversion ->

(I don't like SVN)

I'm not going to rant against Subversion, if you like it, that's fine. There are ways to accomplish continuous deployment with Subversion. At Lookout however, we viewed it as one of the things standing in our way.

So we got rid of Subversion and instead opted for ->

Git

Git, if you're not familiar with Git it can be used a number of different ways, there's the familiar ->

centralized workflow

This is more common for smaller companies

integration managerworkflow

This is common for GitHub-based projects

lieutenants workflow

This is how the Linux kernel is developed

Git + Gerrit

Git and Gerrit

Gerrit

Gerrit is a Git-based code review tool

code review

collaboration

Developer Workflow

What this means for an individual developer is that they can iterate on their code in Gerrit, based on feedback from their colleagues.

Once the code is all polished up, it can then be integrated into the "main" repository

"Pre-tested" Commits

An integral part of our Git + Gerrit workflow involved pre-testing commits.

The whole concept behind "pre-testing" a commit is that only changes which have passed the "tests" will be allowed to be integrated or merged.

Gerrit Trigger plugin

Gerrit Trigger plugin

Feedback in Gerrit

The Cycle

Demo!

Git + Gerritat the same time!

TODO: Perhaps this should go after the pre-tested commits bit We switched from Subversion directly to Git and Gerrit, all at once.

Instead of introducing Git as a separate tool to developers, we introduced at the same time so developers never learned a Git-based workflow that *didn't* involve Gerrit at its core.

We walked through the different teams, one by one with a fully set-up "demo" project to use for experimentation of creating commits, code reviewing them, verifying them with Jenkins and finally merging them into the "master" branch

By immediately starting developers with "here's how you do work now", we were able to finish the migration in a matter of a couple weeks instead of the months it's taken me at previous companies.

One thing we immediately realized after switching to Git + Gerrit was ->

"We need more builders!"

When we first started moving things into Jenkins we had 3 slaves that were properly configured for running our tests. As we started using pre-tested commits with Gerrit and Jenkins, we *very* quickly realized that we needed

In the very early stages of this, we used the same hand-crafted VM base image with VMWare ESX, and then spun up multiple new machines.

If you're going to use a pre-tested commit workflow with an active engineering organization, make sure plan ahead and have plenty of hardware, or virtualized hardware for Jenkins

Recap

Cleaning up and making the development part of the workflow was long overdue, and already was gaining us noticable improvements, even with our slow release process.

With every commit getting tested by Jenkins, the stability of our "master" branch had greatly improved, and our deploys became less error prone.

Continuous Deployment touches more than just your development team though, it affects your QA team, and your operations team too, so there's still work to be done! The next thing to be done is to ->

Step ThreeAutomate Everything Else

Automate *EVERYTHING*. There's so much more than writing code involved in getting software into the hands of your customers

Take for example ->

Deploying test environments

Deploying test environments, in the "olden days" QA engineers were responsible for deploying test environments manually.

Once the deployment of our test environments was managed through Jenkins, we could create pipelines with Jenkins, chaining off of a successful deployment to the test environment.

This opened us up to running ->

New kinds of tests

Entirely new kinds of tests! With deployment of a test environment driven from Jenkins, we could create downstream jobs which would spin up VMs with Sauce Labs to run hundreds of Selenium tests in parallel against the test environment, without any intervention from QA or developers.

With the automation of development, and then testing squared away, the last big thing to tackle was ->

Automating deploymentto

production

Automating the deployment of the production site is and was the hardest part of the entire workflow to accomplish. To automate a production deployment you need to be very confident in what code is going out and the sanity of the processes leading up to the production deployment.

Currently at Lookout production deployments are "manually automated" that means we have all the steps *right up* to deployment automated, and the deployment itself automated, but there is currently no process in place for changes to automatically flow through the entire system from a developers workstation, through Gerrit, and ultimately out to production

With Jenkins doing all of these things, you will likely have a similar realization to us, in that you need ->

Step FourMore Robots!

More robots

OpenStackand the

jclouds plugin

TODO: Discuss the investment in OpenStack for adding more build capacity. Looking forward to the work done with the jclouds plugin

Slave Managementwith

Puppet

As the number of slaves we operate grows, we have found the need to manage the slaves all at once. Currently we use a masterless Puppet set up for all of our slave images, they all have a cron job that checks the Git repository for new Puppet manifests every hour, and will apply any outstanding updates.

This allows us to bring new system dependencies, such as ImageMagick or libxml2, onto the slave machines without needing to log into them manually or re-image the whole cluster of VMs

TODO

(nobody's perfect)

No automated rollback

Currently there is no automated system for rolling back changes, if we notice any issues in production then must manually rollback the deployment.

No production acceptance testing

TODO: Explain that we have to rely on error logging and Graphite and manual acceptance testing to ensure that we didn't break production

Tests are slow

TODO: Mention our workaround for slow tests, parallelizing with the Gerrit Trigger plugin

Sure we have problems, but with: * Introducing a lot of Automation with Jenkins * Using Gerrit and Jenkins for code review and pre-testing all commits before the get into the master branch * And automating the pipeline from "merge to master" all the way through to "deploy to production"

We've managed to change entire engineering organization such that ->

2%of deployments failed

14commits per deployment

3%of deployments slipped

Questions?

Thank you

Recommended