58
Fault tolerant microservices BSkyB @chbatey

Fault tolerant microservices - LJC Skills Matter 4thNov2014

Embed Size (px)

DESCRIPTION

Fault tolerant microservices - LJC Skills Matter 4thNov2014

Citation preview

Page 1: Fault tolerant microservices - LJC Skills Matter 4thNov2014

Fault tolerant microservices

BSkyB@chbatey

Page 2: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

Who is this guy?

● Enthusiastic nerd● Senior software engineer at BSkyB● Builds a lot of distributed applications● Apache Cassandra MVP

Page 3: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

Agenda

1. Setting the scene○ What do we mean by a fault?○ What is a microservice?○ Monolith application vs the micro(ish) service

2. A worked example○ Identify an issue○ Reproduce/test it○ Show how to deal with the issue

Page 4: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

So… what do applications look like?

Page 5: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

So... what do systems look like now?

Page 6: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

But different things go wrong...

down

slow network

slow app

2 second max

missing packets

GC :(

Page 7: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

Fault tolerance

1. Don’t take forever - Timeouts2. Don’t try if you can’t succeed 3. Fail gracefully 4. Know if it’s your fault5. Don’t whack a dead horse6. Turn broken stuff off

Page 8: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

Time for an example...

● All examples are on github● Technologies used:

○ Dropwizard○ Spring Boot○ Wiremock○ Hystrix○ Graphite○ Saboteur

Page 9: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

Shiny App DeviceService

UserService

PinService

Shiny AppShiny App

Shiny App

UserService User

Service

DeviceService

Play Movie

Example: Movie player service

Page 10: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

Testing microservices

You don’t know a service is fault tolerant if you don’t test faults

Page 11: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

Isolated service tests

Shiny App

Mocks User

Device Pin

service

Play Movie AcceptanceTest

Prime

Page 12: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

1 - Don’t take forever

● If at first you don’t succeed, don’t take forever to tell someone

● Timeout and fail fast

Page 13: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

Which timeouts?

● Socket connection timeout● Socket read timeout

Page 14: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

Your service hung for 30 seconds :(

Customer

You :(

Page 15: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

Which timeouts?

● Socket connection timeout● Socket read timeout● Resource acquisition

Page 16: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

Your service hung for 10 minutes :(

Page 17: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

Let’s think about this

Page 18: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

A little more detail

Page 19: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

Wiremock + Saboteur + Vagrant

● Vagrant - launches + provisions local VMs● Saboteur - uses tc, iptables to simulate

network issues● Wiremock - used to mock HTTP

dependencies● Cucumber - acceptance tests

Page 20: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

I can write an automated test for that?

WiremockUser Service

Device ServicePin Service

Saboteur

Vagrant + Virtual box VM

PlayMovie

Service

AcceptanceTest

prime to drop traffic

reset

Page 21: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

Implementing reliable timeouts

● Homemade: Worker Queue + Thread pool (executor)

Page 22: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

Implementing reliable timeouts

● Homemade: Worker Queue + Thread pool (executor)

● Hystrix

Page 23: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

Implementing reliable timeouts

● Homemade: Worker Queue + Thread pool (executor)

● Hystrix● Spring Cloud Netflix

Page 24: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

A simple Spring RestController@RestController

public class Resource {

private static final Logger LOGGER = LoggerFactory.getLogger(Resource.class);

@Autowired

private ScaryDependency scaryDependency;

@RequestMapping("/scary")

public String callTheScaryDependency() {

LOGGER.info("RestContoller: I wonder which thread I am on!");

return scaryDependency.getScaryString();

}

}

Page 25: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

Scary dependency@Component

public class ScaryDependency {

private static final Logger LOGGER = LoggerFactory.getLogger(ScaryDependency.class);

public String getScaryString() {

LOGGER.info("Scary dependency: I wonder which thread I am on!");

if (System.currentTimeMillis() % 2 == 0) {

return "Scary String";

} else {

Thread.sleep(10000);

return "Really slow scary string"; }

}

}

Page 26: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

All on the tomcat thread

13:07:32.814 [http-nio-8080-exec-1] INFO info.batey.examples.Resource - RestContoller: I wonder which thread I am on!13:07:32.896 [http-nio-8080-exec-1] INFO info.batey.examples.ScaryDependency - Scary dependency: I wonder which thread I am on!

Page 27: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

Seriously this simple now?@Component

public class ScaryDependency {

private static final Logger LOGGER = LoggerFactory.getLogger(ScaryDependency.class);

@HystrixCommand

public String getScaryString() {

LOGGER.info("Scary dependency: I wonder which thread I am on!");

if (System.currentTimeMillis() % 2 == 0) {

return "Scary String";

} else {

Thread.sleep(10000);

return "Really slow scary string";

}

}

}

Page 28: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

What an annotation can do...

13:07:32.814 [http-nio-8080-exec-1] INFO info.batey.examples.Resource - RestController: I wonder which thread I am on!13:07:32.896 [hystrix-ScaryDependency-1] INFO info.batey.examples.ScaryDependency - Scary Dependency: I wonder which thread I am on!

Page 29: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

Timeouts take home

● You can’t use network level timeouts for SLAs

● Test your SLAs - if someone says you can’t, hit them with a stick

● Scary things happen without network issues

Page 30: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

2 - Don’t try if you can’t succeed

Page 31: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

Complexity

● When an application grows in complexity it will eventually start sending emails

Page 32: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

Complexity

● When an application grows in complexity it will eventually start sending emails contain queues and thread pools

Page 33: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

Don’t try if you can’t succeed

● Executor Unbounded queues :(○ newFixedThreadPool○ newSingleThreadExecutor○ newThreadCachedThreadPool

● Bound your queues and threads● Fail quickly when the queue /

maxPoolSize is met● Know your drivers

Page 34: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

This is a functional requirement

● Set the timeout very high● Use wiremock to add a large delay to the

requests● Set queue size and thread pool size to 1● Send in 2 requests to use the thread and fill

the queue● What happens on the 3rd request?

Page 35: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

3 - Fail gracefully

Page 36: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

Expect rubbish

● Expect invalid HTTP● Expect malformed response bodies● Expect connection failures● Expect huge / tiny responses

Page 37: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

Testing with WiremockstubFor(get(urlEqualTo("/dependencyPath"))

.willReturn(aResponse()

.withFault(Fault.MALFORMED_RESPONSE_CHUNK)));

{

"request": {

"method": "GET",

"url": "/fault"

},

"response": {

"fault": "RANDOM_DATA_THEN_CLOSE"

}

}

{

"request": {

"method": "GET",

"url": "/fault"

},

"response": {

"fault": "EMPTY_RESPONSE"

}

}

Page 38: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

4 - Know if it’s your fault

Page 39: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

What to record

● Metrics: Timings, errors, concurrent incoming requests, thread pool statistics, connection pool statistics

● Logging: Boundary logging, elasticsearch / logstash

● Request identifiers

Page 40: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

Graphite + Codahale

Page 41: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

Response times

Page 42: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

Separate resource pools

● Don’t flood your dependencies● Be able to answer the questions:

○ How many connections will you make to dependency X?

○ Are you getting close to your max connections?

Page 43: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

So easy with Dropwizard + Hystrix @Override

public void initialize(Bootstrap<AppConfig> appConfigBootstrap) {

HystrixCodaHaleMetricsPublisher metricsPublisher

= new HystrixCodaHaleMetricsPublisher(appConfigBootstrap.getMetricRegistry())

HystrixPlugins.getInstance().registerMetricsPublisher(metricsPublisher);

}

metrics:

reporters:

- type: graphite

host: 192.168.10.120

port: 2003

prefix: shiny_app

Page 44: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

5 - Don’t whack a dead horse

Shiny App DeviceService

UserService

PinService

Shiny AppShiny App

Shiny App

UserService User

Service

DeviceService

Play Movie

Page 45: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

What to do..

● Yes this will happen..● Mandatory dependency - fail *really* fast● Throttling● Fallbacks

Page 46: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

Circuit breaker pattern

Page 47: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

Implementation with Hystrix

@GET

@Timed

public String integrate() {

LOGGER.info("I best do some integration!");

String user = new UserServiceDependency(userService).execute();

String device = new DeviceServiceDependency(deviceService).execute();

Boolean pinCheck = new PinCheckDependency(pinService).execute();

return String.format("[User info: %s] \n[Device info: %s] \n[Pin check: %s] \n", user, device,

pinCheck);

}

Page 48: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

Implementation with Hystrixpublic class PinCheckDependency extends HystrixCommand<Boolean> {

@Override

protected Boolean run() throws Exception {

HttpGet pinCheck = new HttpGet("http://localhost:9090/pincheck");

HttpResponse pinCheckResponse = httpClient.execute(pinCheck);

String pinCheckInfo = EntityUtils.toString(pinCheckResponse.getEntity());

return Boolean.valueOf(pinCheckInfo);

}

}

Page 49: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

Implementation with Hystrixpublic class PinCheckDependency extends HystrixCommand<Boolean> {

@Override

protected Boolean run() throws Exception {

HttpGet pinCheck = new HttpGet("http://localhost:9090/pincheck");

HttpResponse pinCheckResponse = httpClient.execute(pinCheck);

String pinCheckInfo = EntityUtils.toString(pinCheckResponse.getEntity());

return Boolean.valueOf(pinCheckInfo);

}

@Override

public Boolean getFallback() {

return true;

}

}

Page 50: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

Triggering the fallback

● Error threshold percentage● Bucket of time for the percentage● Minimum number of requests to trigger● Time before trying a request again● Disable● Per instance statistics

Page 51: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

6 - Turn off broken stuff

● The kill switch

Page 52: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

To recap

1. Don’t take forever - Timeouts2. Don’t try if you can’t succeed 3. Fail gracefully 4. Know if it’s your fault5. Don’t whack a dead horse6. Turn broken stuff off

Page 53: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

Links

● Examples:○ https://github.com/chbatey/spring-cloud-example○ https://github.com/chbatey/dropwizard-hystrix○ https://github.com/chbatey/vagrant-wiremock-saboteur

● Tech:○ https://github.com/Netflix/Hystrix○ https://www.vagrantup.com/○ http://wiremock.org/○ https://github.com/tomakehurst/saboteur

Page 54: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

Questions?

● Thanks for listening!● http://christopher-batey.blogspot.co.uk/

Page 55: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

Developer takeaways

● Learn about TCP● Love vagrant, docker etc to enable testing● Don’t trust libraries

Page 56: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

Hystrix cost - do this yourself

Page 57: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

Hystrix metrics

● Failure count● Percentiles from Hystrix

point of view● Error percentages

Page 58: Fault tolerant microservices - LJC Skills Matter 4thNov2014

@chbatey

How to test metric publishing?

● Stub out graphite and verify calls?● Programmatically call graphite and verify

numbers?● Make metrics + logs part of the story demo