AWS re:Invent 2016: DevOps on AWS: Advanced Continuous Delivery Techniques (DEV403)

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Mark Mansour, Senior Manager, Continuous Delivery

November 30, 2016

DEV403

DevOps on AWSAdvanced Continuous Delivery Techniques

What to expect from the session

Make your pipeline safer by

1. Identifying production issues quickly

2. Deploying changes safely

3. Automatically deciding when to release changes

Techniques

1. Continuous production testing

2. Manage deployment health

3. Segment production

4. Halt promotions

5. Gates

Starting Point:

The release process is automated

Prerequisites

• Versioned source

• Automated build

• Automated deployments

• Deploy to > 1 instance

• Unit tests

• Integration tests

• Continuous Delivery

• Operations dashboard

Source

Build

Deploy to Integration Stack

Integration Tests

Deploy to Production

Best practices with your tools

• Focus in on best practices

• Keep using your current tools where possible

• Deployment tools

• Continuous Integration and Continuous Delivery Tools

• Extend your current tools when needed

• This talk uses AWS tools

Tools used in this talk

Monitoring

Amazon CloudWatch

Software Development

Amazon SNS

AWS Lambda

Deployment

AWS CodeDeploy

AWS CodePipeline

MyApp

CodeCommit

Source

Build

CodeCommit

Build

DeployToInteg

CodeDeploy

Integration

IntegTest

End2EndTester

DeployToProd

CodeDeploy

Production

Source

Build

Deploy to Integration Stack

Integration Tests

Deploy to Production

Model the release process in CodePipeline

Pipeline Run

ActionStage

Pipeline

Source change

• starts a run; and

• creates an artifact to be used by

other actions.

Change 1

Release and deploy process: Starting point

MyApp

CodeCommit

Source

Build

Build

Build

DeployToInteg

CodeDeploy

Integration

IntegTest

End2EndTester

DeployToProd

CodeDeploy

Production

CodeDeploy




4. Halt promotions

5. Gates

Techniques

Be aware when a service is unavailable

Problem:

A service can stop working at any time for reasons inside

or outside of its control.

Consequence:

Your service may be unavailable without your team

knowing about it.

1 of 5 – Continuous production testing

Use synthetic traffic to simulate real users

• Test all business critical functionality (UI and APIs)

• Tests must run quickly

• Measure client latencies

• Check for reachability


Synthetic Traffic

How synthetic traffic flows

CloudWatch

Alarm


CloudWatch

Events (1m)

CloudWatch

Events (1m)

Synthetic Traffic

Synthetic traffic flow – why two metric streams?

CloudWatch

Alarm


Building a Synthetic Traffic Test

Building a synthetic traffic test

• Keep it simple

• Build logic in Lambda (invoke with CloudWatch Events)

• Capture data in CloudWatch metrics


Lambda’s synthetic traffic blueprint


Scheduling the synthetic traffic test


Building a synthetic traffic test - Code


Building a synthetic traffic test – Alarming


Release and deploy process: Synthetic traffic

DeployToProd

CodeDeploy

Production

Synthetic Traffic

CodeDeploy




4. Halt promotions

5. Gates




4. Halt promotions

5. Gates

Techniques

V1V1 V1 V1 V1 V1 V1 V1 V1 V1V2 V2 V2 V2 V2V2 V2 V2 V2 V2

Rolling deployments – success

Production Fleet

ELB

2 of 5 – Manage deployment health

V1V1 V1 V1 V1 V1 V1 V1 V1 V1V2 V2 V2 V2 V2V2 V2 V2 V2 V2

Rolling deployments – fail

Production Fleet

ELB


Check for deployment failures in production

Problem:

There are no automated tests to verify a service is working

after a new deployment.

Consequence:

Each production deployment needs to be checked

manually.


Add safety to rolling deployments

1. Validate each host’s health

2. Ensure a minimum percentage of the fleet is healthy

3. Rollback if the deployment failed


Configure CodeDeploy

Step 1: Deployment Validation – AppSpec.yml


V1V1 V1 V1 V1 V1 V1 V1 V1 V1V2 V2 V2 V2 V2V2

Step 1: Working tests raises more issues

Production Fleet

ELB


Failed Deployment

4 failures – 60% healthy

MHH 70%, 10 hosts:

V1V2 V1V1 V1 V1 V1 V1 V1 V1 V1V2 V2 V2 V2V2 V2 V2 V2 V2

Step 2: Use minimum healthy hosts

Production Fleet

ELB


1 failure – 90% healthy

Step 2: Use minimum health hosts - CodeDeploy


Step 3: Rollback when a deployment fails

• CodeDeploy: configured in deployment group


Release and deploy: Deployment health

DeployToProd

CodeDeploy

Production

Synthetic Traffic

CodeDeploy




4. Halt promotions

5. Gates




4. Halt promotions

5. Gates

Techniques

3 of 5 - Segment production

Bad changes must not affect all customers

Pipeline Problem:

When a critical issue reaches production all hosts are

affected.

Consequence:

Bad changes impact all customers.


Lower deployment risk by segmenting

1. Break production into multiple segments

2. Deploy to a segment

3. Test a segment after a deployment

4. Repeat 2 & 3 until done


Segment Production

Step 1: Break production into multiple segments

Typical segment types:

• Region

• Availability Zone

• Sub-Zonal

• Single Host (Canary)


US-EAST-1

US-EAST-1A US-EAST-1B

V2 V2 V2V2V1 V1V1

Step 1: Typical deployment segmentation

Availability Zone based

Deployment

Availability Zone based

DeploymentAvailability Zone based

Deployment

V2 V2V2V1 V1V1 V2 V2V2V1 V1V1

Production Fleet

Post-deployment test


Canary

Deployment

V1

Region based Deployment

Step 1: Use deployment groups as segments

Create deployment groups per segment using:

• Tags

• Auto Scaling groups


Production

CanaryDeploy

CodeDeploy

PostDeployTest

Approval

Deploy-AZ-1

CodeDeploy

PostDeployTest

Approval

Deploy-AZ-2

CodeDeploy

Deploy-AZ-3

CodeDeploy

DeployToInteg

CodeDeploy

Integration

IntegTest

End2EndTester

1. Deploy to smallest segment

2. Post-deployment tests

3. Deploy to one Availability Zone

4. Post-deployment tests

5. Deploy to remaining Availability Zones

Step 2: Deploy to each segment


Step 3: Test each segment

A deployment is valid if:

• The test has gathered enough data to gain confidence

• CloudWatch metrics

• No service alarms have fired

• CloudWatch alarms

• The test has not timed out

• Code


Add segment tests to your pipeline

Extend CodePipeline with:

• Test Actions

• Lambda Invoke Actions

• Custom Actions

• Approval Actions


1 hour timeout

7 day timeout

Use CodePipeline approvals to trigger tests

Source

MyAppSource

CodeCommit

Deploy

DeployToSegment

CodeDeploy

SNS topicValidateSegment

Approval

putApprovalResult

Approval

message


DeployToSegment

CodeDeploy

Use SNS to start an automated approval check


Creating a post-deployment test

Source

MyAppSource

CodeCommit

Build

MyAppBuild

Build

Deploy

CanaryDeploy

CodeDeploy

ValidateCanary

Approval

SNS topic Lambda Function

registerDeployTest()

Lambda Function

evaluateDeploy()

DynamoDB

CloudWatch

Events (1m)

Change 1

Prod-us-east-1a

CodeDeploy alarmtimeusage


Post-deployment test – registerDeployTest

Source

MyAppSource

CodeCommit

Build

MyAppBuild

Build

Deploy

CanaryDeploy

CodeDeploy

ValidateCanary

Approval



Lambda Function

evaluateDeploy()

DynamoDB

CloudWatch

Events (1m)

Change 1

Prod-us-east-1a



registerDeployTest function – (Node.js 4.3)


Post-deployment test – evaluateDeployTest

Source

MyAppSource

CodeCommit

Build

MyAppBuild

Build

Deploy

CanaryDeploy

CodeDeploy

ValidateCanary

Approval



Lambda Function

evaluateDeploy()

DynamoDB

CloudWatch

Events (1m)

Change 1

Prod-us-east-1a



approveValidation function (Node.js 4.3)


Canary Deployments – they’re different

All production hosts:

• Participates in serving production traffic

• Configured as a production instance

• Participates in production metrics stream

Canary hosts:

• Has its own metrics stream

• Canary validations use the canary metric stream


Summary: Segment production

• Segment production to reduce impact of a bad change

• Minimum segmentation:

• Region

• Canary deployment per region

• Larger service segmentation

• Zonal

• Sub-zonal

• Test each segment before moving on


Release and deploy: Segment production

Synthetic Traffic

CodeDeploy

Production

CanaryDeploy

CodeDeploy

PostDeployTest

Approval

Deploy-AZ-1

CodeDeploy

PostDeployTest

Approval

Deploy-AZ-2

CodeDeploy

Deploy-AZ-3

CodeDeployDeployToProd

CodeDeploy

Production




4. Halt promotions

5. Gates




4. Halt promotions

5. Gates

Techniques


4 of 5 – Halt promotions

EC2 instance

Change 2Change 3

Don’t change the system under test

Source

MyAppSource

CodeCommit

Build

MyAppBuild

Build

DeployToProd

MyApp

CodeDeploy

deploys

Change 1

Don’t compound problems during an outage

Pipeline Problem:

The pipeline is unaware of the health of the infrastructure

that it is deploying to

Consequence:

Production changes, usually deployments, can make it

difficult for an operator to resolve a production event.


Build promotion blockers

Source

MyAppSource

CodeCommit

Build

MyAppBuild

Build

DeployToProd

MyApp

CodeDeploy

Change 1Change 2

Automatically stop deploying to production

during an event

CloudWatchSynthetic

Trafficdeploys

checks

CloudWatch

Events (1m)

triggers

emitsdisables

disableTransition() Alarm

EC2 instance

SNS


disableTransition function (Lambda Node.js 4.3)


Enable production deployments - CodePipeline


Summary: Halt promotions

• Halt promotions to production when your production

environment has “issues”

• Automate by disabling stage transitions


Release and deploy: Halt promotions

Synthetic Traffic

CodeDeploy

Production

CanaryDeploy

CodeDeploy

PostDeployTest

Approval

Deploy-AZ-1

CodeDeploy

PostDeployTest

Approval

Deploy-AZ-2

CodeDeploy

Deploy-AZ-3

CodeDeploy




4. Halt promotions

5. Gates




4. Halt promotions

5. Gates

Techniques


Do not deploy at sensitive times

Problem:

A bad change during sensitive times has a disproportionate

affect on the business.

Consequence:

Issues during sensitive days risk reputation and financial

loss.

5 of 5 - Gates

Adding safety with deployment black-days

Deploy to production during normal conditions

• Halt deployments during sensitive times

Building a black-day calendar with CodePipeline:

• Use Approvals to pause production deployments

• Lambda to automatically approve when the time is right

5 of 5 - Gates

Build black-day gates

Black-day test

Source

MyAppSource

CodeCommit

Build

MyAppBuild

Build

Deploy

BlackDayCheck

Approval

ProductionDeploy

CodeDeploy


registerDeployment

Lambda Function

processTimeWindows

DynamoDB

CloudWatch

Events (1m)

Change 1

5 of 5 - Gates

This looks familiar…

Source

MyAppSource

CodeCommit

Build

MyAppBuild

Build

Deploy

BlackDayCheck

Approval

ProductionDeploy

CodeDeploy


registerDeployment

Lambda Function

processTimeWindows

DynamoDB

CloudWatch

Events (1m)

5 of 5 - Gates

This looks familiar – post-deployment test

Source

MyAppSource

CodeCommit

Build

MyAppBuild

Build

Deploy

CanaryDeploy

CodeDeploy

ValidateCanary

Approval



Lambda Function

evaluateDeploy()

DynamoDB

CloudWatch

Events (1m)

Prod-us-east-1a



What’s the difference?

Source

MyAppSource

CodeCommit

Build

MyAppBuild

Build

Deploy

BlackDayCheck

Approval

ProductionDeploy

CodeDeploy


registerDeployment

Lambda Function

processTimeWindows

DynamoDB

CloudWatch

Events (1m)

5 of 5 - Gates

Summary: Gates

• Black-days provide centralized control

• Add common action to all pipelines

• Black-days are a type of gate

• Implement with Approval actions in CodePipeline

5 of 5 - Gates

Production

CanaryDeploy

CodeDeploy

PostDeployTest

Approval

Deploy-AZ-1

CodeDeploy

PostDeployTest

Approval

Deploy-AZ-2

CodeDeploy

Deploy-AZ-3

CodeDeploy

CheckBlackDays

Approval

Release and deploy: Gates

Synthetic Traffic

CodeDeploy

Production

CanaryDeploy

CodeDeploy

PostDeployTest

Approval

Deploy-AZ-1

CodeDeploy

PostDeployTest

Approval

Deploy-AZ-2

CodeDeploy

Deploy-AZ-3

CodeDeploy

What we’ve learned

Goal: Make your pipeline safer…

1. Identify production issues quickly

• Continuous Production Testing

2. Safely deploy changes

• Manage deployment health

• Segment production

3. Automatically decide when to release changes

• Halt promotions

• Black-days and Gates

Release and deploy process: Ending point

DeployToProd

CodeDeploy

Production

CodeDeploy

Synthetic Traffic

CanaryDeploy

CodeDeploy

PostDeployTest

Approval

Deploy-AZ-1

CodeDeploy

PostDeployTest

Approval

Deploy-AZ-2

CodeDeploy

CheckBlackDays

Approval

CanaryDeploy

CodeDeploy

PostDeployTest

Approval

Deploy-AZ-1

CodeDeploy

PostDeployTest

Approval

Deploy-AZ-2

CodeDeploy

Deploy-AZ-3

CodeDeploy

Production

Thank you!

Remember to complete

your evaluations!

Code is available online

• github.com/awslabs/aws-codepipeline-time-windows

• github.com/awslabs/aws-codepipeline-synthetic-tests

• github.com/awslabs/aws-codepipeline-block-production

Related Sessions

• DEV303 – Deploying and Managing .NET Pipelines and

Microsoft Workloads

• DEV310 – DevOps on AWS: Choosing the Right

Software Deployment Technique

• DEV313 – Infrastructure Continuous Deployment Using

AWS CloudFormation

• SVR307 – Application Lifecycle Management in a

Serverless World

Author:

Slides written and prepared by Mark Mansour, Senior

Manager, Continuous Delivery, AWS.

This presentation, “DevOps on AWS: Advanced

Continuous Delivery Techniques”, was originally given at

re:Invent 2016 on Nov 30.

Technology

AWS re:Invent 2016: DevOps on AWS: Advanced Continuous Delivery Techniques (DEV403)