DevOps Chicago - The Game Of Operations and the Operation of Games

Preview:

DESCRIPTION

Operating online games is fun and challenging. Games are some of the spikiest workloads around, and real-time really means *real-time*. Randy shares many of the DevOps techniques he has been putting into practice at KIXEYE, including migrating to the cloud, organizing around services, and focusing on automation. He illustrates his points with war stories from operating large-scale services at Google and eBay. Please see companion video at https://vimeo.com/95841677.

Citation preview

The Game of Operationsand

The Operation of Games

Randy Shoup @randyshoup

linkedin.com/in/randyshoup

DevOps Chicago Meetup, May 19 2014

Background

CTO at KIXEYE• Real-time strategy games for web and

mobile

Director of Engineering for Google App Engine• World’s largest Platform-as-a-Service

Chief Engineer at eBay• Multiple generations of eBay’s real-time

search infrastructure

Real-Time Strategy Games are …

• Real-time• Spiky• Computationally-

intensive• Constantly evolving• Constantly pushing

boundaries

Technically and operationally demanding

Operating Games: Goals

Player Fun• If players aren’t playing, we don’t have a business• If players aren’t having fun, we don’t have a business for long• Fun includes game mechanics, feature set, quality,

performance

Studio Velocity• 8 *highly independent* game studios• Different tech stacks, tool chains, phases of development

Developer Productivity and Satisfaction• We are a vendor; the studios are our customers• Must be *strictly better* than the alternatives of build, buy,

borrow

Cost Efficiency• More output for less

The Game of Operations

Cloud• All studios and services moving to AWS• Strong focus on automation

Services• Small, focused teams • Clean, well-defined interface to customers

DevOps• Developers behave like Ops• Ops behaves like Developers

The Game of Operations

Cloud

Services

DevOps

Why Cloud? (The Obvious)

Provisioning Speed• Minutes, not weeks• Autoscaling in response to load

Near-Infinite Capacity• No need to predict and plan for growth• No need to defensively overprovision

Pay For What You Use• No “utilization risk” from owning / renting• If it’s not in use, spin it down

Why Cloud? (The Less Obvious)Instance Optimization Opportunities• Instance shapes to fit most parts of the solution

space (compute-intensive, IO-intensive, etc.)• If the shape does not fit, try another

Service Quality• Amazon and Google know how to run data

centers• Battle-tested and highly automated• World-class networking, both cluster fabric and

external peering

Why Cloud? (The Fundamentals)Right Side of History• Almost impossible to beat Google / Amazon

buying power or operating efficiencies• 2010s in computing are like 1910s in electric

power• Soon it will be just as common to run your own

data center as it is to run your own electric power generation (!)

Easy and Fun• It Just Works ™• Makes it easy to fall in love with infrastructure

Autoscaling

Games are very spiky• Very unpredictable• Huge variability between peak and trough• Hits are self-reinforcing

Services and clients have to “flex”• Clients back off in response to latency• Services grow / shrink based on load

Service Cluster == AWS Auto-Scale Group• Scale up or down based on predefined metrics,

thresholds

Automation Work at KIXEYE

Build / Deploy Pipeline• One button• Puppet -> Packer -> AMI -> Asgard• No-downtime red-black deployment• Futures: canarying, auto-rollback

Manageability• Flume -> ElasticSearch / Kibana for logging• Shinken -> PagerDuty for monitoring and

alerting

The Game of Operations

Cloud

Services

DevOps

Service Teams

• Give teams autonomy• Freedom to choose technology,

methodology, working environment• Responsibility for the results of those

choices

• Hold them accountable for *results*• Give a team a goal, not a solution• Let team own the best way to achieve the

goal

KIXEYE Service Chassis

• Goal: Produce a “chassis” for building scalable game services

• Minimal resources, minimal direction• 3 people x 1 month• Consider building on open source projects

Team exceeded expectations• Co-developed chassis, transport layer, service template,

build pipeline, red-black deployment, etc.• Operability and manageability from the beginning• Heavy use of Netflix open source projects• 15 minutes from no code to running service in AWS (!)• Plan to open-source several parts of this work

Micro-Services

SimpleWell-defined interfaceSingle-purposeModular and independentSmall teamsAutonomy and responsibility

A

C D E

B

Transition to Building ServicesCommon Chassis

• Make it trivially easy to build and maintain a service

Define Service Interface (Formally!)• Propose, Discuss, Agree

Prototype Implementation• Simplest thing that could possibly work• Client can integrate with prototype• Implementor can learn what works and what does not

Real Implementation• Throw away the prototype (!)

Rinse and Repeat

Transition to Service RelationshipsVendor – Customer Relationship

• Friendly and cooperative, but structured• Clear ownership and division of responsibility• Customer can choose to use service or not (!)

Service-Level Agreement (SLA)• Promise of service levels by the service provider• Customer needs to be able to rely on the service, like a

utility

Charging and Cost Allocation• Charge customers for *usage* of the service• Aligns economic incentives of customer and provider• Motivates both sides to optimize

The Game of Operations

Cloud

Services

DevOps

Instrumentation and Measurement

Instrument Everything• Machine / instance stats: CPU, memory, I/O• Software infrastructure stats: database, message

queue• Application stats: game client, game server, services

Make It Easy to Do the Right Thing ™• Easy, reliable, low-latency• Auto-tagged and searchable

Why?• Measurement beats intuition every time; my own

intuition is usually wrong • If you need to ssh into a box, instrumentation failed

you

One Team (!)

• Act as one team across development, product, operations, etc.

• Solve problems instead of blaming and pointing fingers

• Political games are not as fun as real-time strategy games

Everyone Is Responsible for ProdEveryone’s incentives are aligned

Everyone is strongly motivated to have solid instrumentation and monitoring

Organization: Learning CultureLearn from mistakes and improve• What did you do -> What did you learn• Take emotion and personalization out of

it

Encourage iteration and velocity• “Failure is not falling down but refusing

to get back up” – Theodore Roosevelt

Google Blame-Free Post-MortemsPost-mortem After Every Incident• Document exactly what happened• What went right• What went wrong

Open and Honest Discussion• What contributed to the incident?• What could we have done better?Engineers compete to take personal

responsibility (!)

Transition to DevOps

Organization• Studios make user-visible games• Services provide common endpoints

Training / Retraining• Common bootcamp• Train devs as Ops, Ops as devs

You Build It, You Run It• Transition on-call• Use primary / secondary on-call as

apprenticeship

Recap: The Game of OperationsCloud

Services

DevOps

Come Join Us!

KIXEYE is hiring in SF, Seattle, Victoria, Brisbane, Amsterdam

@randyshouprshoup@kixeye.comlinkedin.com/in/randyshoupslideshare.net/randyshoup

Recommended