Service Stampede: Surviving a Thousand Services

Service StampedeAnil Gursel, PayPal

Agenda Monoliths to Microservices

Problems with microservices

Solves & Practices

The need for standardization

Introducing squbs

Monolith to Microservices

Requests

Congrats! Your monolith became a thousand microservices – now you’re in serious trouble!!!

Cost/Benefits of Moving to Microservices

• Independence – faster PDLC

• Freedom of choice for service implementation

• Easy evolution of service & technology

• Coexisting services across generations

• Complexity & Latency

Gains• Homogeneity

• Consistency of implementation across

• Timing & Determinism

Losses

Hmm. To be, or not to be… a service, that is...

Microservices Issues

Latency & Determinism

Service BoundariesTo be, or not to be a service

Scaling and rightsizing

Many failure points – need resiliency

Inconsistency – need standardization







Latency Determinism

Latency by Deployment Topology

• Avoid too many layers of services• Keep state close to the edge• The more hops, the higher and less deterministic the latency

is













Services Need to Scale

• Scale horizontally with increasing workload• More nodes, or…• More pods with increasing workload

• Scale vertically – why?• Keep the number of instances under control• 125 nodes @16CPU easier to manage than 1000 nodes @2CPU• Less load on network and switching infrastructure• Potentially better utilization & cache hits• Stateful systems: More limited horizontal scale• Need critical mass for redundancy













Practices for Successful Microservices

Deployment Topologies

Reactive Systems

Resilience with Circuit Breakers

Asynchronous Communication

Standardization



Reactive Systems



Standardization

Individual Service Deployments

Service A Service B

RequestsRequests

Joint Deployments

Service A

Requests Service B

Service C

• Deployment orchestration using Chef, etc.• Kubernetes Pods



Reactive Systems



Standardization

The Reactive Manifesto

Responsive

Message Driven

Elastic Resilient

Why Does it Matter?

Respond in a deterministic, timely manner. Controls determinism

Stays responsive in the face of failure – even cascading failures

Stays responsive under workload spikes

Basic building block for responsive, resilient, and elastic systems

Responsive

Resilient

Elastic

Message Driven



Reactive Systems



Standardization

Circuit Breaker Keeps systems responsive under failure

Avoids cascading failures

Especially with multi-generational downstream services

Critical part to keeping your 1000 services alive



Reactive Systems



Standardization



Reactive Systems



Standardization

Standardization

• Monitoring• Need to collect metrics, consistently

• Logging• Correlation across services• Uniformity in logs

• Security• Need to apply standard security configuration

• Environment Resolution• Staging, production, etc.

Consistency in the face of Heterogeneity

Standardized Reactive PlatformFor Large Scale Internet Deployments

Akka, Spray, Akka Http & Streams

Asynchronous

High Performance

Resilience & Supervision

Great Libraries for building Reactive Systems

Bootstrap and Lifecycle Management

Unicomplex: Lightweight bootstrap module

Emits lifecycle events: starting, active, stopping

Startup and shutdown hooks

Allows obtaining the current state

Listener

• Declares configuration for port binding, interfaces, security, etc

Service

• Akka Http/Spray Routes and Http Request Handler Actors• Configured in squbs-meta.conf• A service can be defined in a dependency artifact

Extension

• To start low level (non-actor) facilities needed for the environment

Request/Response Pipeline

CubesAnother deployment Topology

squbs: rhymes with cubes

Drop-in modules

Cubes can run in isolation as well as on a flat classpath

Easy to compose/decompose/refactor

Cubes share the actor system

Provide better predictability

Orchestrationtask1

task2task3

task4task5

Input

Output

val task1F = doTask1(input)val task2F = doTask2(input)val task3F = (task1F, task2F) >> doTask3val task4F = task2F >> doTask4val task5F = (task3F, task4F) >> doTask5for { result <- task5F } { requester ! result context.stop(self)}

Orchestrationtask1

task2task3

task4task5

Input

Output

Orchestration DSL

High-performance asynchronous orchestration

Responsive: Respond within SLA, with or without results

Streamlined error handling

Reduced code complexity

More Utilities

• Http Client• Admin Console• Actor Registry• Perpetual Stream• Persistence Buffer• …

Summary

• Large number of services have benefits, but are more difficult• Control your service topology for more determinism and lower

latency• Rule of thumb: No more than two hops of synchronous calls

from edge• Reactive systems – ideal for services• Responsive & resilient

• Standardization• Walk like a duck, quack like a duck, and manage it like a

duck• squbs: Have the cake, and eat it too

Q&A – Feedback AppreciatedJoin us on – link from https://github.com/paypal/squbs @squbs

Software

Service Stampede: Surviving a Thousand Services