40
Service Stampede Anil Gursel, PayPal

Service Stampede: Surviving a Thousand Services

Embed Size (px)

Citation preview

Page 1: Service Stampede: Surviving a Thousand Services

Service StampedeAnil Gursel, PayPal

Page 2: Service Stampede: Surviving a Thousand Services

Agenda Monoliths to Microservices

Problems with microservices

Solves & Practices

The need for standardization

Introducing squbs

Page 3: Service Stampede: Surviving a Thousand Services

Monolith to Microservices

Requests

Congrats! Your monolith became a thousand microservices – now you’re in serious trouble!!!

Page 4: Service Stampede: Surviving a Thousand Services

Cost/Benefits of Moving to Microservices

• Independence – faster PDLC

• Freedom of choice for service implementation

• Easy evolution of service & technology

• Coexisting services across generations

• Complexity & Latency

Gains• Homogeneity

• Consistency of implementation across

• Timing & Determinism

Losses

Hmm. To be, or not to be… a service, that is...

Page 5: Service Stampede: Surviving a Thousand Services

Microservices Issues

Latency & Determinism

Service BoundariesTo be, or not to be a service

Scaling and rightsizing

Many failure points – need resiliency

Inconsistency – need standardization

Page 6: Service Stampede: Surviving a Thousand Services

Microservices Issues

Latency & Determinism

Service BoundariesTo be, or not to be a service

Scaling and rightsizing

Many failure points – need resiliency

Inconsistency – need standardization

Page 7: Service Stampede: Surviving a Thousand Services

Latency Determinism

Page 8: Service Stampede: Surviving a Thousand Services

Latency by Deployment Topology

• Avoid too many layers of services• Keep state close to the edge• The more hops, the higher and less deterministic the latency

is

Page 9: Service Stampede: Surviving a Thousand Services

Microservices Issues

Latency & Determinism

Service BoundariesTo be, or not to be a service

Scaling and rightsizing

Many failure points – need resiliency

Inconsistency – need standardization

Page 10: Service Stampede: Surviving a Thousand Services

Microservices Issues

Latency & Determinism

Service BoundariesTo be, or not to be a service

Scaling and rightsizing

Many failure points – need resiliency

Inconsistency – need standardization

Page 11: Service Stampede: Surviving a Thousand Services

Services Need to Scale

• Scale horizontally with increasing workload• More nodes, or…• More pods with increasing workload

• Scale vertically – why?• Keep the number of instances under control• 125 nodes @16CPU easier to manage than 1000 nodes @2CPU• Less load on network and switching infrastructure• Potentially better utilization & cache hits• Stateful systems: More limited horizontal scale• Need critical mass for redundancy

Page 12: Service Stampede: Surviving a Thousand Services

Microservices Issues

Latency & Determinism

Service BoundariesTo be, or not to be a service

Scaling and rightsizing

Many failure points – need resiliency

Inconsistency – need standardization

Page 13: Service Stampede: Surviving a Thousand Services

Microservices Issues

Latency & Determinism

Service BoundariesTo be, or not to be a service

Scaling and rightsizing

Many failure points – need resiliency

Inconsistency – need standardization

Page 14: Service Stampede: Surviving a Thousand Services

Practices for Successful Microservices

Deployment Topologies

Reactive Systems

Resilience with Circuit Breakers

Asynchronous Communication

Standardization

Page 15: Service Stampede: Surviving a Thousand Services

Practices for Successful Microservices

Deployment Topologies

Reactive Systems

Resilience with Circuit Breakers

Asynchronous Communication

Standardization

Page 16: Service Stampede: Surviving a Thousand Services

Individual Service Deployments

Service A Service B

RequestsRequests

Page 17: Service Stampede: Surviving a Thousand Services

Joint Deployments

Service A

Requests Service B

Service C

• Deployment orchestration using Chef, etc.• Kubernetes Pods

Page 18: Service Stampede: Surviving a Thousand Services

Practices for Successful Microservices

Deployment Topologies

Reactive Systems

Resilience with Circuit Breakers

Asynchronous Communication

Standardization

Page 19: Service Stampede: Surviving a Thousand Services

The Reactive Manifesto

Responsive

Message Driven

Elastic Resilient

Page 20: Service Stampede: Surviving a Thousand Services

Why Does it Matter?

Respond in a deterministic, timely manner. Controls determinism

Stays responsive in the face of failure – even cascading failures

Stays responsive under workload spikes

Basic building block for responsive, resilient, and elastic systems

Responsive

Resilient

Elastic

Message Driven

Page 21: Service Stampede: Surviving a Thousand Services

Practices for Successful Microservices

Deployment Topologies

Reactive Systems

Resilience with Circuit Breakers

Asynchronous Communication

Standardization

Page 22: Service Stampede: Surviving a Thousand Services

Circuit Breaker Keeps systems responsive under failure

Avoids cascading failures

Especially with multi-generational downstream services

Critical part to keeping your 1000 services alive

Page 23: Service Stampede: Surviving a Thousand Services

Practices for Successful Microservices

Deployment Topologies

Reactive Systems

Resilience with Circuit Breakers

Asynchronous Communication

Standardization

Page 24: Service Stampede: Surviving a Thousand Services

Practices for Successful Microservices

Deployment Topologies

Reactive Systems

Resilience with Circuit Breakers

Asynchronous Communication

Standardization

Page 25: Service Stampede: Surviving a Thousand Services

Standardization

• Monitoring• Need to collect metrics, consistently

• Logging• Correlation across services• Uniformity in logs

• Security• Need to apply standard security configuration

• Environment Resolution• Staging, production, etc.

Consistency in the face of Heterogeneity

Page 26: Service Stampede: Surviving a Thousand Services

Standardized Reactive PlatformFor Large Scale Internet Deployments

Page 27: Service Stampede: Surviving a Thousand Services

Akka, Spray, Akka Http & Streams

Asynchronous

High Performance

Resilience & Supervision

Great Libraries for building Reactive Systems

Page 28: Service Stampede: Surviving a Thousand Services

Bootstrap and Lifecycle Management

Unicomplex: Lightweight bootstrap module

Emits lifecycle events: starting, active, stopping

Startup and shutdown hooks

Allows obtaining the current state

Page 29: Service Stampede: Surviving a Thousand Services

Listener

• Declares configuration for port binding, interfaces, security, etc

Page 30: Service Stampede: Surviving a Thousand Services

Service

• Akka Http/Spray Routes and Http Request Handler Actors• Configured in squbs-meta.conf• A service can be defined in a dependency artifact

Page 31: Service Stampede: Surviving a Thousand Services

Extension

• To start low level (non-actor) facilities needed for the environment

Page 32: Service Stampede: Surviving a Thousand Services

Request/Response Pipeline

Page 33: Service Stampede: Surviving a Thousand Services

CubesAnother deployment Topology

squbs: rhymes with cubes

Drop-in modules

Cubes can run in isolation as well as on a flat classpath

Easy to compose/decompose/refactor

Cubes share the actor system

Provide better predictability

Page 34: Service Stampede: Surviving a Thousand Services

Orchestrationtask1

task2task3

task4task5

Input

Output

Page 35: Service Stampede: Surviving a Thousand Services

val task1F = doTask1(input)val task2F = doTask2(input)val task3F = (task1F, task2F) >> doTask3val task4F = task2F >> doTask4val task5F = (task3F, task4F) >> doTask5for { result <- task5F } { requester ! result context.stop(self)}

Orchestrationtask1

task2task3

task4task5

Input

Output

Page 36: Service Stampede: Surviving a Thousand Services

Orchestration DSL

High-performance asynchronous orchestration

Responsive: Respond within SLA, with or without results

Streamlined error handling

Reduced code complexity

Page 37: Service Stampede: Surviving a Thousand Services

More Utilities

• Http Client• Admin Console• Actor Registry• Perpetual Stream• Persistence Buffer• …

Page 38: Service Stampede: Surviving a Thousand Services

Summary

• Large number of services have benefits, but are more difficult• Control your service topology for more determinism and lower

latency• Rule of thumb: No more than two hops of synchronous calls

from edge• Reactive systems – ideal for services• Responsive & resilient

• Standardization• Walk like a duck, quack like a duck, and manage it like a

duck• squbs: Have the cake, and eat it too

Page 39: Service Stampede: Surviving a Thousand Services

Q&A – Feedback AppreciatedJoin us on – link from https://github.com/paypal/squbs @squbs

Page 40: Service Stampede: Surviving a Thousand Services