70
Josh Evans - Director of Operations Engineering November 16, 2015 Beyond DevOps: How Netflix Bridges the Gap

Beyond DevOps - How Netflix Bridges the Gap

Embed Size (px)

Citation preview

PowerPoint Presentation

Josh Evans - Director of Operations Engineering

November 16, 2015Beyond DevOps:How Netflix Bridges the Gap

Technical Debt

Java 6PerforceSingle Master JenkinsAntCentOSAsgard/Mimir

Fall 2013

Java 6 needed to move forward on Java but struggled to drive adoptionPerforce many teams moving to Git no story for supporting perforce in the cloudJenkins long queues & build timesAnt long build times, inefficient dependency managementCentOS slow delivery of new kernel and userland binariesAsgard served us well as a deployment & cloud managementMimir gave a great prototype and we learned a lot

Tech debt kept us from doing our jobs well

How do we drive broad-based change?

Does this sound familiar? Have any of you been on one side or the other of this situation?

The Paved RoadJava 7StashJenkins ShardsGradleUbuntu

To move forward we defined the concept of the paved roadThe paved road promises a well supported integrated developer experience. Java 7 just to move forward Java 8 already on the horizonGit organically adopted by many teams Gradle built time reduced due to efficient dependency managementUbuntu more frequent, well vetted userland binarie & kernelsJenkins shards to fix long build timesStarted building our next generation cloud console & continuous delivery platform Spinnaker

We staffed up and went for it big bang

Some said Youre overloading usToo many projectsPoor targeting

Others saidWhat took you so long?Weve moved onNow we need to migrate

Thats great butWere paying a high tax

Expectations gapDivision of laborTiming of solutionsLeadership

AffectsReputationRelationshipsLost opportunities

Organizational Debt

How do we bridge the gap?

Remember that TIME is money

Read to the audience:

He that can earn ten shillings a day by his labour, and goes abroad, or sits idle one half of that day, tho' he spends but sixpence during his diversion or idleness, ought not to reckon that the only expense; he has really spent or rather thrown away five shillings besides.

- Advice to a Young Tradesman

Time is a form of currency

Please raise you hand if you know which puritanical workaholic wrote this?In addition to the obvious intent behind this there is a more profound message. Time spent working is related to the money you make but time is also in and of itself a form of currency. Its the exchange or giving of time that drives the economics of an engineering organization

Product EngineeringOperations EngineeringChallenges & StrategiesOur time today

Product EngineeringOperations EngineeringChallenges & StrategiesOur time today

Product Innovation

winning moments of truth

Every facet of the product1400 AB tests in the last year & accelerating

Continuous Innovation

But wait, theres more

Build Itdesigncodebuildbaketestdeploy

Run Itconfiguremonitortriagefixat scale, globallyYou build it, you run it

Netflix has a freedom & responsibility culture. You build it you run it perfectly aligns with our values around autonomy & ownership

Internet

1000s of starts per second100,000s of requests per second100,000,000 hours of content / day

3 AWS Regions, 3 AZs per region

Relentless product innovationBuilding & running micro-services at scale, globally

This leads a high pressure situation created a shortage of time.

Product EngineeringOperations EngineeringChallenges & StrategiesOur time today

DevOps is a software development method that emphasizes the roles of both software developers and other information-technology (IT) professionals with an emphasis on IT Operations.

- WikipediaThe Gap

Read definition out loudOut of curiosity who agrees with this definition? Who disagrees?Not only is there disagreement but the general construct isnt really that helpful

Why? How?

It doesnt address how to bridge the gap or why it matters to do so?Whats are the strategies for success?

Its the practices, tools, cultureMotivations the reason for doing DevOps is to achieve operational excellence

QualityVelocityOperational Excellence

Operational Excellence is the continuous improvement of the management, design, and function of operational environments to achieve greater quality, velocity, and competitive advantage.

Engineering ToolsInsight & Real-time AnalyticsPerformance & ReliabilityOperations Engineering is the application of software engineering practices to achieve and sustain operational excellence.

We do the undifferentiated heavy lifting for out customers. This means we take on the operationally oriented common engineering work across teams so that each team can focus on their core charter.

Operations Engineering

Service providerOperational excellence driver

Cross-cutting solutionsUndifferentiated heavy lifting

We do the undifferentiated heavy lifting for out customers. This means we take on the operationally oriented common engineering work across teams so that each team can focus on their core charter.

Product EngineeringOperations EngineeringChallenges & StrategiesOur time today

Youre overloading usWhat took you so long?

Remember that feedback?We made assumptionsRequirements what & whenTime for non-product work

Move from assumptions to knowledgeAffect change without imposing a tax?Achieve and sustain operational excellence?How do we

Time is a form of currency

Going back to our Ben Franklin quote time is a form of currency. In our engineering world time really is currency. We dont pay each other to do work.We commit time to projects. In other words we have a time-based economy.

5 strategies for successin time-based economies

software & organizational engineering

Audience can anyone name one of the strategies?

1. Reach out

What are your biggest operational pain points?How can we help?How well are we meeting your needs today?What would you like to see from us in the future?

Listen

Shower, rinse, repeatTalk to your engineering customers

Grease the Squeaky Wheelslow tolerance for taxmore vocal than most

Stop spamming us!

High impact solutions Clarity on deliverablesLower operational taxLeadership, innovation, and partnershipWhat they wanted

Deliver on solutions Better road map definition & communicationA more aggressive stance on automationDeeper investment into leadership, innovation, planningOur commitments

2. Make an impactApply what youve learnedDeliver what matters

global cloud consoleend to end deliveryautomation platform

velocity with confidence

Pipelines - Automated Global Delivery

3. Make it easy to do the right thing

Audience can anyone name one of the strategies?

A free chaos monkey for good ones

Engineering time is scarce

We must do more heavy lifting

Supply & Demand

Spinnaker manual stepAutomated migrations MimirProvide on-ramps

Automate proven practices

Alerting and MonitoringApache & Tomcat HardeningAutomated Canary AnalysisAutoscalingChaos ParticipationConsistent NamingELB ConfigurationHealthcheck ConfiguredRed-Black PipelineSqueeze TestingTimeout & Fallback TuningWorkload Reliability

Production Ready?

Alerting and MonitoringApache & Tomcat HardeningAutomated Canary AnalysisAutoscalingChaos ParticipationConsistent NamingELB ConfigurationHealthcheck ConfiguredRed-Black PipelineSqueeze TestingTimeout & Fallback TuningWorkload ReliabilityProduction Ready?

Old Version (v1.0)New Version(v1.1)Load BalancerCustomers100 Servers5 Servers95% 5%MetricsCanaries

Old Version (v1.0)New Version(v1.1)Load BalancerCustomers0 Servers100 Servers 100%MetricsCanaries

DefineMetricsA threshold

Every n minutesClassify metricsCompute scoreMake a decision

Automated Canary Analysis

Canary AnalysisPerformanceIntegration TestsChaosConformityStaticUnit Tests

Make it easy to do the right thing

Static & Functional Testing

4. Reduce the cost of change

\

Ongoing migrationsLibrary propagation

100s of micro-servicesComplex dependencies

Continuous, Broad-based Change

There are several approaches that you might take to solve for this problem. Ill explore each one.

Change EngineeringLocateCommunicateFacilitate

Automated forensicsWho last touched x?What team?Who was their manager?Who owns this artifact, repository, service?

WhitepagesWorkday wrapperApp & REST APIOrganization hierarchyMetadataChange log

(###) ###-####

KriegerREST-based serviceSourcesWhitepagesStashEddaJenkinsSpinnakerEtc

{ "content": {}, "_links": { "employees": { "href": "/api/employees/" }, "projects": { "href": "/api/projects/" }, "teams": { "href": "/api/teams/" }, "applications": { "href": "/api/applications/" }, "jobs": { "href": "/api/build/jobs" }, "masters": { "href": "/api/build/masters" }, "projectDistribution": { "href": "/api/teams/projectDistribution" } }}

/api/employees?q=jevans "employees": [ { "id": "241", "firstName": "Josh", "lastName": "Evans", "username": "jevans", "email": "[email protected]", "jobTitle": "Director of Operations Engineering", "isManager": true, "isCurrent": true, "title": "Josh Evans (jevans) - Operations Engineering", "_links": { "self": { "href": "/api/employees/241" }, "manager": { "href": "/api/employees/117890" }, "team": { "href": "/api/teams/f9134a81" }, "projects": { "href": "/api/teams/f9134a81/projects" } } } ] }

Security vulnerabilitiesWho owns this service?

Platform updatesWho is using this version of this library?

Today Targeted Coordination

Automated, efficient technical project management

CommunicationGuidanceTracking

Low tax for TPMs & engineers

Security FixJava 9GuavaFuture Change Campaigns

5. Develop Partnerships Beyond supply & demand

And once youve proven that you can deliver you have some money in the bank. You have earned a seat at the table. Now youre ready to build strong partnerships.

Nearing completionAggressive scheduleUnexpected delaysCommitment to June deliverySpinnaker 1.0 1H 2015

Built their own continuous delivery solutionNot positioned for engineering-wide supportBelieves common solutionsEdge Engineering

Partnership in ActionStrong relationshipOpen discussions about concernsDecision - leaned forward

+2 engineers on SpinnakerSuccessful 1.0 launch

Moving Forward TogetherContainers?Achieving alignmentCollaborative explorationEdge, Platform, OperationsA new paved road?

Paved Road adoptedAdding new ones Production Ready ongoingMigrations easierReputation improvingImprovedService uptimeRate of changePayoffs

Putting it to the test in 2016

Streaming production & test - EC2 Classic to VPCHighly cross-functionalComplex dependenciesZero downtime

Stay tuned

Five StrategiesReach outMake an impactMake it easy to do the right thingReduce the cost of changeDevelop partnerships

Open Sourced!https://netflix.github.io/

Josh Evans [email protected] @ops_engineeringQuestions?