193
Without Resilience Nothing Else Matters Jonas Bonér CTO TypEsafe @jboner

Without Resilience, Nothing Else Matters

Embed Size (px)

Citation preview

Page 1: Without Resilience, Nothing Else Matters

Without Resilience Nothing Else Matters

Jonas Bonér CTO TypEsafe

@jboner

Page 2: Without Resilience, Nothing Else Matters

Without Resilience Nothing Else Matters

Jonas Bonér CTO TypEsafe

@jboner

Page 3: Without Resilience, Nothing Else Matters

Without Resilience Nothing Else Matters

Jonas Bonér CTO TypEsafe

@jboner

Page 4: Without Resilience, Nothing Else Matters
Page 5: Without Resilience, Nothing Else Matters
Page 6: Without Resilience, Nothing Else Matters

This is Fault Tolerance

“But it ain’t how hard you’re hit; it’s about how hard you can get hit, and keep moving forward. How much you can take, and keep moving forward. That’s

how winning is done.”- Rocky Balboa

Page 7: Without Resilience, Nothing Else Matters

Resilience is Beyond

Fault Tolerance

Page 8: Without Resilience, Nothing Else Matters

Resilience

“The ability of a substance or object to spring back into shape. The capacity to recover quickly

from difficulties.”-Merriam Webster

Page 9: Without Resilience, Nothing Else Matters

Antifragility

“Antifragility is beyond resilience and robustness. The resilient resists shock and stays the same; the antifragile gets better.”

- Nassem Nicholas Taleb

Antifragile: Things That Gain from Disorder - Nassim Nicholas Taleb

Page 10: Without Resilience, Nothing Else Matters

“We can model and understand in isolation. But, when released into competitive nominally

regulated societies, their connections proliferate, their interactions and interdependencies multiply,

their complexities mushroom. And we are caught short.”

- Sidney Dekker

Drift into Failure - Sidney Dekker

Page 11: Without Resilience, Nothing Else Matters

Software systems today are

incredibly complex

Netflix Twitter

Page 12: Without Resilience, Nothing Else Matters

We need to study

Resilience in

Complex Systems

Page 13: Without Resilience, Nothing Else Matters

Complicated System

Page 14: Without Resilience, Nothing Else Matters

Complicated System

Page 15: Without Resilience, Nothing Else Matters

Complex System

Page 16: Without Resilience, Nothing Else Matters

Complex System

Page 17: Without Resilience, Nothing Else Matters

Complex System

Page 18: Without Resilience, Nothing Else Matters

Complicated ≠ Complex

Page 19: Without Resilience, Nothing Else Matters

“Complex systems run in degraded mode.” “Complex systems run as broken systems.”

- richard Cook

How Complex Systems Fail - Richard Cook

Page 20: Without Resilience, Nothing Else Matters

Humans generally make things worse

Page 21: Without Resilience, Nothing Else Matters

“Counterintuitive. That’s [Jay] Forrester’s word to describe complex systems. Leverage points are not intuitive. Or if they are, we intuitively

use them backward, systematically worsening whatever problems we are trying to solve.”

- Donella Meadows

Leverage Points: Places to Intervene in a System - Donella Meadows

Page 22: Without Resilience, Nothing Else Matters

‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013

Operating at the Edge of Failure

Page 23: Without Resilience, Nothing Else Matters

‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013

Economic Failure

Boundary

Operating at the Edge of Failure

Page 24: Without Resilience, Nothing Else Matters

‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013

Economic Failure

Boundary

Unacceptable Workload Boundary

Operating at the Edge of Failure

Page 25: Without Resilience, Nothing Else Matters

‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013

Economic Failure

Boundary

Unacceptable Workload Boundary

Accident Boundary

Operating at the Edge of Failure

Page 26: Without Resilience, Nothing Else Matters

‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013

Economic Failure

Boundary

Unacceptable Workload Boundary

Operating Point

Accident Boundary

Operating at the Edge of Failure

Page 27: Without Resilience, Nothing Else Matters

‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013

Economic Failure

Boundary

Unacceptable Workload Boundary

Accident Boundary

Operating at the Edge of Failure

Page 28: Without Resilience, Nothing Else Matters

‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013

Economic Failure

Boundary

Unacceptable Workload Boundary

FAILURE

Accident Boundary

Operating at the Edge of Failure

Page 29: Without Resilience, Nothing Else Matters

Economic Failure

Boundary

Unacceptable Workload Boundary

Accident Boundary

‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013

Operating at the Edge of Failure

Page 30: Without Resilience, Nothing Else Matters

Economic Failure

Boundary

Unacceptable Workload Boundary

Accident Boundary

Management Pressure

Towards Economic Efficiency

‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013

Operating at the Edge of Failure

Page 31: Without Resilience, Nothing Else Matters

Economic Failure

Boundary

Unacceptable Workload Boundary

Accident Boundary

Management Pressure

Towards Economic Efficiency

Gradient Towards Least Effort

‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013

Operating at the Edge of Failure

Page 32: Without Resilience, Nothing Else Matters

Economic Failure

Boundary

Unacceptable Workload Boundary

Accident Boundary

Management Pressure

Towards Economic Efficiency

Gradient Towards Least Effort

‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013

Operating at the Edge of Failure

Page 33: Without Resilience, Nothing Else Matters

Economic Failure

Boundary

Unacceptable Workload Boundary

Accident Boundary

Management Pressure

Towards Economic Efficiency

Gradient Towards Least Effort

Counter Gradient For More Resilience

‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013

Operating at the Edge of Failure

Page 34: Without Resilience, Nothing Else Matters

‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013

Economic Failure

Boundary

Unacceptable Workload Boundary

Accident Boundary

Operating at the Edge of Failure

Page 35: Without Resilience, Nothing Else Matters

‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013

Economic Failure

Boundary

Unacceptable Workload Boundary

Accident Boundary

Error Margin

Marginal Boundary

Operating at the Edge of Failure

Page 36: Without Resilience, Nothing Else Matters

‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013

Economic Failure

Boundary

Unacceptable Workload Boundary

Accident Boundary

Error Margin

Marginal Boundary

Operating at the Edge of Failure

Page 37: Without Resilience, Nothing Else Matters

‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013

Economic Failure

Boundary

Unacceptable Workload Boundary

Accident Boundary

Error Margin

Marginal Boundary

Operating at the Edge of Failure

Page 38: Without Resilience, Nothing Else Matters

‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013

Accident Boundary

Marginal Boundary

Operating at the Edge of Failure

Page 39: Without Resilience, Nothing Else Matters

‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013

Marginal Boundary

?

Operating at the Edge of Failure

Page 40: Without Resilience, Nothing Else Matters

‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013

Operating at the Edge of Failure

Accident Boundary

Marginal Boundary

Page 41: Without Resilience, Nothing Else Matters

‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013

Operating at the Edge of Failure

Accident Boundary

Marginal Boundary

Page 42: Without Resilience, Nothing Else Matters

‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013

Operating at the Edge of Failure

Accident Boundary

Marginal Boundary

Page 43: Without Resilience, Nothing Else Matters

‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013

Operating at the Edge of Failure

Accident Boundary

Marginal Boundary

Page 44: Without Resilience, Nothing Else Matters

Embrace Failure

Page 45: Without Resilience, Nothing Else Matters

Resilience is by

Design

Photo courtesy of FEMA/Joselyne Augustino

Page 46: Without Resilience, Nothing Else Matters

Resilience in

Biological Systems

Page 47: Without Resilience, Nothing Else Matters

MeerkatsPuppies! Now that I’ve got your attention, complexity theory - Nicolas Perony, TED talk

Page 48: Without Resilience, Nothing Else Matters

“In three words, in the animal kingdom, simplicity leads to complexity

which leads to resilience.” - Nicolas Perony

Puppies! Now that I’ve got your attention, complexity theory - Nicolas Perony, TED talk

Page 49: Without Resilience, Nothing Else Matters

Resilience in

Social

Systems

Page 50: Without Resilience, Nothing Else Matters

Dealing in SecurityUnderstanding vital services, and how they keep you safe

1 INDIVIDUAL

6 ways to die

3 sets of essential services

7 layers of PROTECTION

Dealing in Security - Mike Bennet, Vinay Gupta

Page 51: Without Resilience, Nothing Else Matters

What we can learn from Resilience in Biological and Social Systems

1. Feature Diversity and redundancy 2. Inter-Connected network structure 3. Wide distribution across all scales 4. Capacity to self-adapt & self-organize

Toward Resilient Architectures 1: Biology Lessons - Michael Mehaffy, Nikos A. Salingaros

Applying resilience thinking: Seven principles for building resilience in social-ecological systems - Reinette Biggs et. al.

Page 52: Without Resilience, Nothing Else Matters

Resilience in

Computer Systems

Page 53: Without Resilience, Nothing Else Matters
Page 54: Without Resilience, Nothing Else Matters

We need to

Manage

Failure

Page 55: Without Resilience, Nothing Else Matters

“Post-accident attribution to a ‘root cause’ is fundamentally wrong:

Because overt failure requires multiple faults, there is no isolated ‘cause’ of an accident.”

- richard Cook

How Complex Systems Fail - Richard Cook

Page 56: Without Resilience, Nothing Else Matters

There is no Root Cause

Page 57: Without Resilience, Nothing Else Matters

Let it Crash

Page 58: Without Resilience, Nothing Else Matters

Crash Only Software

Crash-Only Software - George Candea, Armando Fox

Stop = Crash Safely Start = Recover Fast

Page 59: Without Resilience, Nothing Else Matters

“To make a system of interconnected components crash-only, it must be designed so that components

can tolerate the crashes and temporary unavailability of their peers. This means we require: [1] strong

modularity with relatively impermeable component boundaries, [2] timeout-based communication and

lease-based resource allocation, and [3] self-describing requests that carry a time-to-live and

information on whether they are idempotent.”- George Candea, Armando Fox

Crash-Only Software - George Candea, Armando Fox

Page 60: Without Resilience, Nothing Else Matters

Recursive RestartabilityTurning the Crash-Only Sledgehammer into a Scalpel

Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel - George Candea, Armando Fox

Page 61: Without Resilience, Nothing Else Matters

"Software components should be designed such that they can deny service for any request or call.

Then, if an underlying component can say No, apps must be designed to take No for an answer

and decide how to proceed: give up, wait and retry, reduce fidelity, etc.”

- George Candea, Armando Fox

Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel - George Candea, Armando Fox

Page 62: Without Resilience, Nothing Else Matters

Services need to learn to accept

NO for an answer

Page 63: Without Resilience, Nothing Else Matters

“The explosive growth of software has added greatly to systems’ interactive

complexity. With software, the possible states that a system can end up in become

mind-boggling.”- Sidney Dekker

Drift into Failure - Sidney Dekker

Page 64: Without Resilience, Nothing Else Matters

Out of the Tar Pit - Ben Moseley , Peter Marks

• Input Data • Derived Data

We need a way out of the State Tar Pit

Page 65: Without Resilience, Nothing Else Matters

Out of the Tar Pit - Ben Moseley , Peter Marks

• Input Data • Derived Data

Critical

We need a way out of the State Tar Pit

Page 66: Without Resilience, Nothing Else Matters

Traditional State Management

Object

Critical state that needs protection

Client

Thread boundary

Page 67: Without Resilience, Nothing Else Matters

Traditional State Management

Object

Critical state that needs protection

Client

Thread boundary

Page 68: Without Resilience, Nothing Else Matters

Traditional State Management

Object

Critical state that needs protection

Client

Thread boundary

Page 69: Without Resilience, Nothing Else Matters

Traditional State Management

Object

Critical state that needs protection

Client

Thread boundary

Synchronous dispatch Thread boundary

Page 70: Without Resilience, Nothing Else Matters

Traditional State Management

Object

Critical state that needs protection

Client

Thread boundary

Synchronous dispatch Thread boundary

Page 71: Without Resilience, Nothing Else Matters

Traditional State Management

Object

Critical state that needs protection

Client

Thread boundary

Synchronous dispatch Thread boundary

?

Page 72: Without Resilience, Nothing Else Matters

Traditional State Management

Object

Critical state that needs protection

Client

Thread boundary

Synchronous dispatch Thread boundary

?

Utterly broken

Page 73: Without Resilience, Nothing Else Matters

“Accidents come from relationships not broken parts.”

- Sidney dekker

Drift into Failure - Sidney Dekker

Page 74: Without Resilience, Nothing Else Matters

Requirements for a Sane Failure Model

1. Contained—Avoid cascading failures 2. Reified—as messages 3. Signalled—Asynchronously 4. Observed—by 1-N 5. Managed—Outside failed Context

Failures need to be

Page 75: Without Resilience, Nothing Else Matters

Bulkhead

Pattern

Page 76: Without Resilience, Nothing Else Matters

Bulkhead

Pattern

Page 77: Without Resilience, Nothing Else Matters

Bulkhead

Pattern

Page 78: Without Resilience, Nothing Else Matters

Enter Supervision

Page 79: Without Resilience, Nothing Else Matters

Enter Supervision

Page 80: Without Resilience, Nothing Else Matters

The

Vending Machine Pattern

Page 81: Without Resilience, Nothing Else Matters

Think Vending Machine

Coffee Machine

Programmer

Page 82: Without Resilience, Nothing Else Matters

Think Vending Machine

Coffee Machine

Programmer

Inserts coins

Page 83: Without Resilience, Nothing Else Matters

Think Vending Machine

Coffee Machine

Programmer

Inserts coins

Add more coins

Page 84: Without Resilience, Nothing Else Matters

Think Vending Machine

Coffee Machine

Programmer

Inserts coins

Gets coffee

Add more coins

Page 85: Without Resilience, Nothing Else Matters

Think Vending Machine

ProgrammerCoffee

Machine

Page 86: Without Resilience, Nothing Else Matters

Think Vending Machine

Programmer

Inserts coins

Coffee Machine

Page 87: Without Resilience, Nothing Else Matters

Think Vending Machine

Programmer

Inserts coins

Out of coffee beans error Coffee Machine

Page 88: Without Resilience, Nothing Else Matters

Think Vending Machine

Programmer

Inserts coins

Out of coffee beans error

WRONGCoffee

Machine

Page 89: Without Resilience, Nothing Else Matters

Think Vending Machine

Programmer

Inserts coins

Coffee Machine

Page 90: Without Resilience, Nothing Else Matters

Think Vending Machine

Programmer

Inserts coins

Out of coffee beans

failure

Coffee Machine

Page 91: Without Resilience, Nothing Else Matters

Think Vending Machine

Programmer

Service Guy

Inserts coins

Out of coffee beans

failure

Coffee Machine

Page 92: Without Resilience, Nothing Else Matters

Think Vending Machine

Programmer

Service Guy

Inserts coins

Out of coffee beans

failure

Adds more beans

Coffee Machine

Page 93: Without Resilience, Nothing Else Matters

Think Vending Machine

Programmer

Service Guy

Inserts coins

Gets coffee

Out of coffee beans

failure

Adds more beans

Coffee Machine

Page 94: Without Resilience, Nothing Else Matters

Think Vending Machine

ServiceClient

Page 95: Without Resilience, Nothing Else Matters

Think Vending Machine

ServiceClient

Request

Page 96: Without Resilience, Nothing Else Matters

Think Vending Machine

ServiceClient

Request

Response

Page 97: Without Resilience, Nothing Else Matters

Think Vending Machine

ServiceClient

Request

Response

Validation Error

Page 98: Without Resilience, Nothing Else Matters

Think Vending Machine

ServiceClient

Request

Response

Validation Error

Application Failure

Page 99: Without Resilience, Nothing Else Matters

Think Vending Machine

ServiceClient

Supervisor

Request

Response

Validation Error

Application Failure

Page 100: Without Resilience, Nothing Else Matters

Think Vending Machine

ServiceClient

Supervisor

Request

Response

Validation Error

Application Failure

Manages Failure

Page 101: Without Resilience, Nothing Else Matters

Out of the Tar Pit - Ben Moseley , Peter Marks

We need a way out of the State Tar Pit

Page 102: Without Resilience, Nothing Else Matters

Essential State

Out of the Tar Pit - Ben Moseley , Peter Marks

We need a way out of the State Tar Pit

Page 103: Without Resilience, Nothing Else Matters

Essential State

Out of the Tar Pit - Ben Moseley , Peter Marks

Essential Logic

We need a way out of the State Tar Pit

Page 104: Without Resilience, Nothing Else Matters

Essential State

Out of the Tar Pit - Ben Moseley , Peter Marks

Essential Logic

Accidental State and

Control

We need a way out of the State Tar Pit

Page 105: Without Resilience, Nothing Else Matters
Page 106: Without Resilience, Nothing Else Matters

Error Kernel PatternOnion-layered state & Failure management

Making reliable distributed systems in the presence of software errors - Joe Armstrong On Erlang, State and Crashes - Jesper Louis Andersen

Page 107: Without Resilience, Nothing Else Matters

Onion Layered State Management

Object

Critical state that needs protection

Client

Thread boundary

Page 108: Without Resilience, Nothing Else Matters

Onion Layered State Management

Object

Critical state that needs protection

Client

Thread boundary

Page 109: Without Resilience, Nothing Else Matters

Onion Layered State Management

Error Kernel

Object

Critical state that needs protection

Client

Thread boundary

Page 110: Without Resilience, Nothing Else Matters

Onion Layered State Management

Error Kernel

Object

Critical state that needs protection

Client

Thread boundary

Page 111: Without Resilience, Nothing Else Matters

Onion Layered State Management

Error Kernel

Object

Critical state that needs protection

Client

Supervision

Thread boundary

Page 112: Without Resilience, Nothing Else Matters

Onion Layered State Management

Error Kernel

Object

Critical state that needs protection

Client

Supervision

Supervision

Thread boundary

Page 113: Without Resilience, Nothing Else Matters

Onion Layered State Management

Error Kernel

Object

Critical state that needs protection

Client

Supervision

Supervision

Thread boundary

Supervision

Page 114: Without Resilience, Nothing Else Matters

Onion Layered State Management

Error Kernel

Object

Critical state that needs protection

Client

Supervision

Supervision

Thread boundary

Supervision

Page 115: Without Resilience, Nothing Else Matters

Onion Layered State Management

Error Kernel

Object

Critical state that needs protection

Client

Supervision

Supervision

Thread boundary

Supervision

Page 116: Without Resilience, Nothing Else Matters

Onion Layered State Management

Error Kernel

Object

Critical state that needs protection

Client

Supervision

Supervision

Thread boundary

Supervision

Page 117: Without Resilience, Nothing Else Matters

Onion Layered State Management

Error Kernel

Object

Critical state that needs protection

Client

Supervision

Supervision

Thread boundary

Supervision

Page 118: Without Resilience, Nothing Else Matters

Onion Layered State Management

Error Kernel

Object

Critical state that needs protection

Client

Supervision

Supervision

Thread boundary

Supervision

Page 119: Without Resilience, Nothing Else Matters

Onion Layered State Management

Error Kernel

Object

Critical state that needs protection

Client

Supervision

Supervision

Thread boundary

Supervision

Page 120: Without Resilience, Nothing Else Matters

Demo Time

Let’s model a resilient vending machine, in Akka

Page 121: Without Resilience, Nothing Else Matters

Demo Runner

object VendingMachineDemo extends App { val system = ActorSystem("vendingMachineDemo") val coffeeMachine = system.actorOf(Props[CoffeeMachineManager], "coffeeMachineManager") val customer = Inbox.create(system) // emulates the customer

… // test runs system.shutdown() }

https://gist.github.com/jboner/d24c0eb91417a5ec10a6

Page 122: Without Resilience, Nothing Else Matters

Test Happy Path

// Insert 2 coins and get an Espresso customer.send(coffeeMachine, Coins(2))customer.send(coffeeMachine, Selection(Espresso))val Beverage(coffee1) = customer.receive(5.seconds)println(s"Got myself an $coffee1")assert(coffee1 == Espresso)

https://gist.github.com/jboner/d24c0eb91417a5ec10a6

Page 123: Without Resilience, Nothing Else Matters

Test User Error

customer.send(coffeeMachine, Coins(1))customer.send(coffeeMachine, Selection(Latte))val NotEnoughCoinsError(message) = customer.receive(5.seconds)println(s"Got myself a validation error: $message")assert(message == "Please insert [1] coins")

https://gist.github.com/jboner/d24c0eb91417a5ec10a6

Page 124: Without Resilience, Nothing Else Matters

Test System Failure

// Insert 1 coin (had 1 before) and try to get my Latte// Machine should: // 1. Fail // 2. Restart // 3. Resubmit my order // 4. Give me my coffee customer.send(coffeeMachine, Coins(1))customer.send(coffeeMachine, TriggerOutOfCoffeeBeansFailure)customer.send(coffeeMachine, Selection(Latte))val Beverage(coffee2) = customer.receive(5.seconds)println(s"Got myself a $coffee2") assert(coffee2 == Latte)

https://gist.github.com/jboner/d24c0eb91417a5ec10a6

Page 125: Without Resilience, Nothing Else Matters

Protocol// Coffee typestrait CoffeeTypecase object BlackCoffee extends CoffeeType case object Latte extends CoffeeType case object Espresso extends CoffeeType // Commandscase class Coins(number: Int) case class Selection(coffee: CoffeeType)case object TriggerOutOfCoffeeBeansFailure // Eventscase class CoinsReceived(number: Int) // Repliescase class Beverage(coffee: CoffeeType) // Errorscase class NotEnoughCoinsError(message: String)// Failurescase class OutOfCoffeeBeansFailure(customer: ActorRef, pendingOrder: Selection, nrOfInsertedCoins: Int) extends Exception

https://gist.github.com/jboner/d24c0eb91417a5ec10a6

Page 126: Without Resilience, Nothing Else Matters

CoffeeMachine

class CoffeeMachine extends Actor { val price = 2 var nrOfInsertedCoins = 0 var outOfCoffeeBeans = false var totalNrOfCoins = 0 def receive = { … } override def postRestart(failure: Throwable): Unit = { … } }

https://gist.github.com/jboner/d24c0eb91417a5ec10a6

Page 127: Without Resilience, Nothing Else Matters

CoffeeMachine def receive = { case Coins(nr) => nrOfInsertedCoins += nr totalNrOfCoins += nr println(s"Inserted [$nr] coins") println(s"Total number of coins in machine is [$totalNrOfCoins]") case selection @ Selection(coffeeType) => if (nrOfInsertedCoins < price) sender.tell(NotEnoughCoinsError( s”Insert [${price - nrOfInsertedCoins}] coins"), self) else { if (outOfCoffeeBeans) throw new OutOfCoffeeBeansFailure(sender, selection, nrOfInsertedCoins) println(s"Brewing your $coffeeType") sender.tell(Beverage(coffeeType), self) nrOfInsertedCoins = 0 } case TriggerOutOfCoffeeBeansFailure => outOfCoffeeBeans = true }

https://gist.github.com/jboner/d24c0eb91417a5ec10a6

Page 128: Without Resilience, Nothing Else Matters

CoffeeMachine

override def postRestart(failure: Throwable): Unit = { println(s"Restarting coffee machine...") failure match { case OutOfCoffeeBeansFailure(customer, pendingOrder, coins) => nrOfInsertedCoins = coins outOfCoffeeBeans = false println(s"Resubmitting pending order $pendingOrder") context.self.tell(pendingOrder, customer) }}

https://gist.github.com/jboner/d24c0eb91417a5ec10a6

Page 129: Without Resilience, Nothing Else Matters

Supervisor

class CoffeeMachineManager extends Actor { override val supervisorStrategy = OneForOneStrategy(maxNrOfRetries = 10, withinTimeRange = 1.minute) { case e: OutOfCoffeeBeansFailure => println(s"ServiceGuy notified: $e") Restart case _: Exception => Escalate } // to simplify things he is only managing 1 single machine val machine = context.actorOf( Props[CoffeeMachine], name = "coffeeMachine") def receive = { case request => machine.forward(request) }}

https://gist.github.com/jboner/d24c0eb91417a5ec10a6

Page 130: Without Resilience, Nothing Else Matters

So.........Are We Done?

Page 131: Without Resilience, Nothing Else Matters

So.........Are We Done?Sorry...but Not really

Page 132: Without Resilience, Nothing Else Matters

We can not keep putting all eggs in the same basket

Page 133: Without Resilience, Nothing Else Matters

We need to Maintain Diversity and Redundancy

Page 134: Without Resilience, Nothing Else Matters

We need to Maintain Diversity and Redundancy

Page 135: Without Resilience, Nothing Else Matters

akka { actor { deployment { /coffeeMachine { router = round-robin-pool resizer { lower-bound = 12 upper-bound = 24 } } }

provider = akka.cluster.ClusterActorRefProvider }    cluster { seed-nodes = [ akka.tcp://[email protected]:2551 akka.tcp://[email protected]:2552 ]   } }

Akka Cluster

Page 136: Without Resilience, Nothing Else Matters

Member Node

Akka Cluster Akka Distributed Data

Page 137: Without Resilience, Nothing Else Matters

Member Node

Member Node

Member Node

Member Node

Member Node

Member Node

Member Node

Member Node

Member Node

Member Node

Akka Cluster Akka Distributed Data

Page 138: Without Resilience, Nothing Else Matters

Member Node

Member Node

Member Node

Member Node

Member Node

Member Node

Member Node

Member Node

Member Node

Member Node

Akka Cluster Akka Distributed Data

Gossip Of membership, Data & Meta Data

Page 139: Without Resilience, Nothing Else Matters

Member Node

Member Node

Member Node

Member Node

Member Node

Member Node

Member Node

Member Node

Member Node

Member Node

Akka Cluster Akka Distributed Data

Page 140: Without Resilience, Nothing Else Matters

Member Node

Member Node

Member Node

Member Node

Member Node

Member Node

Member Node

Member Node

Member Node

Member Node

Akka Cluster Akka Distributed Data

Failure detection heartbeat

Page 141: Without Resilience, Nothing Else Matters

Member Node

Member Node

Member Node

Member Node

Member Node

Member Node

Member Node

Member Node

Member Node

Member Node

Akka Cluster Akka Distributed Data

Failure detection heartbeat

Page 142: Without Resilience, Nothing Else Matters

We need to decompose the system using

Consistency Boundaries

Page 143: Without Resilience, Nothing Else Matters

WITHIN the Consistency Boundary we can have STRONG CONSISTENCY

Page 144: Without Resilience, Nothing Else Matters

BETWEEN the

Consistency Boundaries

it is a

ZOO

Page 145: Without Resilience, Nothing Else Matters

The Network is Reliable

Page 146: Without Resilience, Nothing Else Matters

The Network is Reliable

NATReally

Page 147: Without Resilience, Nothing Else Matters

Here, We are living in the

Looming Shadow of

Impossibility Theorems

Page 148: Without Resilience, Nothing Else Matters

Here, We are living in the

Looming Shadow of

Impossibility Theorems

CAP: Consistency is impossible

Page 149: Without Resilience, Nothing Else Matters

Here, We are living in the

Looming Shadow of

Impossibility Theorems

CAP: Consistency is impossibleFLP: Consensus is impossible

Page 150: Without Resilience, Nothing Else Matters

STRONG ConsistencyIs the wrong default

Page 151: Without Resilience, Nothing Else Matters

We need Systems that are

Decoupled in Time and Space

Page 152: Without Resilience, Nothing Else Matters

Resilient Protocols

Page 153: Without Resilience, Nothing Else Matters

Resilient ProtocolsDepend on

1. Asynchronous Communication2. Eventual Consistency

Page 154: Without Resilience, Nothing Else Matters

Resilient Protocols

• are tolerant to• Message loss• Message reordering• Message duplication

Depend on1. Asynchronous Communication2. Eventual Consistency

Page 155: Without Resilience, Nothing Else Matters

Resilient Protocols

• are tolerant to• Message loss• Message reordering• Message duplication

• Embrace ACID 2.0• Associative

• Commutative

• Idempotent

• Distributed

Depend on1. Asynchronous Communication2. Eventual Consistency

Page 156: Without Resilience, Nothing Else Matters

Let’s model a resilient & Persistent vending machine, in Akka

Demo Time

Page 157: Without Resilience, Nothing Else Matters

Persistent CoffeeMachine// Eventscase class CoinsReceived(number: Int)

class CoffeeMachine extends PersistentActor { val price = 2 var nrOfInsertedCoins = 0 var outOfCoffeeBeans = false var totalNrOfCoins = 0 override def persistenceId = "CoffeeMachine" override def receiveCommand: Receive = { case Coins(nr) => nrOfInsertedCoins += nr println(s"Inserted [$nr] coins") persist(CoinsReceived(nr)) { evt => totalNrOfCoins += nr println(s"Total number of coins in machine is [$totalNrOfCoins]") } …

}

override def receiveRecover: Receive = { case CoinsReceived(coins) => totalNrOfCoins += coins println(s"Total number of coins in machine is [$totalNrOfCoins]") }}

https://gist.github.com/jboner/1db37eeee3ed3c9422e4

Page 158: Without Resilience, Nothing Else Matters

“An escalator can never break: it can only become stairs. You should never see an

Escalator Temporarily Out Of Order sign, just Escalator Temporarily Stairs. Sorry for the

convenience.”- Mitch Hedberg

Page 159: Without Resilience, Nothing Else Matters

Graceful Degradation

Page 160: Without Resilience, Nothing Else Matters

Always Rely on

Asynchronous Communication

Page 161: Without Resilience, Nothing Else Matters

Always Rely on

Asynchronous Communication

…And If not pOssiblE

Always use Timeouts

Page 162: Without Resilience, Nothing Else Matters

Circuit Breaker

Page 163: Without Resilience, Nothing Else Matters

Little’s Law

L = λW

W: Response Time

L: Queue Length

Page 164: Without Resilience, Nothing Else Matters

Little’s Law

L = λW

Queue Length = Arrival Rate * Response Time

W: Response Time

L: Queue Length

Page 165: Without Resilience, Nothing Else Matters

Little’s Law

W = L/λ

Response Time = Queue Length / Arrival Rate

W: Response Time

L: Queue Length

Page 166: Without Resilience, Nothing Else Matters

Flow Control

Page 167: Without Resilience, Nothing Else Matters

Flow Control

Always Apply Back Pressure

Page 168: Without Resilience, Nothing Else Matters

Feedback Control

Page 169: Without Resilience, Nothing Else Matters

“Humans should not be involved in setting timeouts.” “Human involvement in complex systems is the

biggest source of trouble.”- Ben Christensen, Netflix

Page 170: Without Resilience, Nothing Else Matters

“Continuously compare the actual output to its desired reference value;

then apply a change to the system inputs that counteracts any deviation of the actual output from the reference.”

- Philipp K. Janert

Feedback Control for Computer Systems - Philipp K. Janet

The Feedback Principle

Page 171: Without Resilience, Nothing Else Matters

Feedback Control

Page 172: Without Resilience, Nothing Else Matters

Feedback Control

Page 173: Without Resilience, Nothing Else Matters

Feedback Control

Page 174: Without Resilience, Nothing Else Matters

Feedback Control

Page 175: Without Resilience, Nothing Else Matters

Influencing a Complex System

Page 176: Without Resilience, Nothing Else Matters

Places to Intervene in a Complex System

1. The constants, parameters or numbers 2. The sizes of buffers relative to their flows 3. The structure of material stocks and flows 4. The lengths of delays, relative to the rate of system change 5. The strength of negative feedback loops 6. The gain around driving positive feedback loops 7. The structure of information flows 8. The rules of the system 9. The power to add, change, evolve, or self-organize structure 10. The goals of the system 11. The mindset or paradigm out of which the system arises 12. The power to transcend paradigms

Leverage Points: Places to Intervene in a System - Donella Meadows:

Page 177: Without Resilience, Nothing Else Matters

Triple Loop Learning Loop 1: Follow the rules Loop 2: Change the rules Loop 3: Learn how to learn

Triple Loop Learning - Chris Argyris

Page 178: Without Resilience, Nothing Else Matters

Testing

Page 179: Without Resilience, Nothing Else Matters

What can we learn from Arnold?

Page 180: Without Resilience, Nothing Else Matters

What can we learn from Arnold?

Page 181: Without Resilience, Nothing Else Matters

What can we learn from Arnold?

Blow things up

Page 182: Without Resilience, Nothing Else Matters

Shoot Your App

Down

Page 183: Without Resilience, Nothing Else Matters

Pull the Plug…and see what happens

Page 184: Without Resilience, Nothing Else Matters
Page 185: Without Resilience, Nothing Else Matters

Executive Summary

Page 186: Without Resilience, Nothing Else Matters

“Complex systems run as broken systems.”- richard Cook

How Complex Systems Fail - Richard Cook

Page 187: Without Resilience, Nothing Else Matters

Resilience is by

Design

Photo courtesy of FEMA/Joselyne Augustino

Page 188: Without Resilience, Nothing Else Matters

Without Resilience Nothing Else Matters

Page 189: Without Resilience, Nothing Else Matters

Thank

You

Page 190: Without Resilience, Nothing Else Matters

ReferencesDrift into Failure - http://www.amazon.com/Drift-into-Failure-Components-Understanding-ebook/dp/B009KOKXKYHow Complex Systems Fail - http://web.mit.edu/2.75/resources/random/How%20Complex%20Systems%20Fail.pdfAntifragile: Things That Gain from Disorder - http://www.amazon.com/Antifragile-Things-that-Gain-Disorder-ebook/dp/B009K6DKTS Leverage Points: Places to Intervene in a System - http://www.donellameadows.org/archives/leverage-points-places-to-intervene-in-a-system/ Going Solid: A Model of System Dynamics and Consequences for Patient Safety - http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1743994/Resilience in Complex Adaptive Systems: Operating at the Edge of Failure - https://www.youtube.com/watch?v=PGLYEDpNu60Puppies! Now that I’ve got your attention, Complexity Theory - https://www.ted.com/talks/nicolas_perony_puppies_now_that_i_ve_got_your_attention_complexity_theoryHow Bacteria Becomes Resistant - http://www.abc.net.au/science/slab/antibiotics/resistance.htmTowards Resilient Architectures: Biology Lessons - http://www.metropolismag.com/Point-of-View/March-2013/Toward-Resilient-Architectures-1-Biology-Lessons/Dealing in Security - http://resiliencemaps.org/files/Dealing_in_Security.July2010.en.pdfWhat is resilience? An introduction to social-ecological research - http://www.stockholmresilience.org/download/18.10119fc11455d3c557d6d21/1398172490555/SU_SRC_whatisresilience_sidaApril2014.pdf Applying resilience thinking: Seven principles for building resilience in social-ecological systems - http://www.stockholmresilience.org/download/18.10119fc11455d3c557d6928/1398150799790/SRC+Applying+Resilience+final.pdfCrash-Only Software - https://www.usenix.org/legacy/events/hotos03/tech/full_papers/candea/candea.pdfRecursive Restartability: Turning the Reboot Sledgehammer into a Scalpel - http://roc.cs.berkeley.edu/papers/recursive_restartability.pdfOut of the Tar Pit - http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.93.8928Bulkhead Pattern - http://skife.org/architecture/fault-tolerance/2009/12/31/bulkheads.htmlMaking Reliable Distributed Systems in the Presence of Software Errors - http://www.erlang.org/download/armstrong_thesis_2003.pdfOn Erlang, State and Crashes - http://jlouisramblings.blogspot.be/2010/11/on-erlang-state-and-crashes.htmlAkka Supervision - http://doc.akka.io/docs/akka/snapshot/general/supervision.htmlRelease It!: Design and Deploy Production-Ready Software - https://pragprog.com/book/mnee/release-itFeedback Control for Computer Systems - http://www.amazon.com/Feedback-Control-Computer-Systems-Philipp/dp/1449361692Hystrix - https://github.com/Netflix/HystrixAkka Circuit Breaker - http://doc.akka.io/docs/akka/snapshot/common/circuitbreaker.html Reactive Streams - http://reactive-streams.orgSimian Army - https://github.com/Netflix/SimianArmyGatling - http://gatling.ioAkka MultiNode Testing - http://doc.akka.io/docs/akka/snapshot/dev/multi-node-testing.htmlVending Machine Akka Supervision Demo - https://gist.github.com/jboner/d24c0eb91417a5ec10a6Persistent Vending Machine Akka Supervision Demo - https://gist.github.com/jboner/1db37eeee3ed3c9422e4

Page 191: Without Resilience, Nothing Else Matters

Q & A

Page 192: Without Resilience, Nothing Else Matters
Page 193: Without Resilience, Nothing Else Matters