Fault tolerant microservices - LJC Skills Matter 4thNov2014

Preview:

DESCRIPTION

Fault tolerant microservices - LJC Skills Matter 4thNov2014

Citation preview

Fault tolerant microservices

BSkyB@chbatey

@chbatey

Who is this guy?

● Enthusiastic nerd● Senior software engineer at BSkyB● Builds a lot of distributed applications● Apache Cassandra MVP

@chbatey

Agenda

1. Setting the scene○ What do we mean by a fault?○ What is a microservice?○ Monolith application vs the micro(ish) service

2. A worked example○ Identify an issue○ Reproduce/test it○ Show how to deal with the issue

@chbatey

So… what do applications look like?

@chbatey

So... what do systems look like now?

@chbatey

But different things go wrong...

down

slow network

slow app

2 second max

missing packets

GC :(

@chbatey

Fault tolerance

1. Don’t take forever - Timeouts2. Don’t try if you can’t succeed 3. Fail gracefully 4. Know if it’s your fault5. Don’t whack a dead horse6. Turn broken stuff off

@chbatey

Time for an example...

● All examples are on github● Technologies used:

○ Dropwizard○ Spring Boot○ Wiremock○ Hystrix○ Graphite○ Saboteur

@chbatey

Shiny App DeviceService

UserService

PinService

Shiny AppShiny App

Shiny App

UserService User

Service

DeviceService

Play Movie

Example: Movie player service

@chbatey

Testing microservices

You don’t know a service is fault tolerant if you don’t test faults

@chbatey

Isolated service tests

Shiny App

Mocks User

Device Pin

service

Play Movie AcceptanceTest

Prime

@chbatey

1 - Don’t take forever

● If at first you don’t succeed, don’t take forever to tell someone

● Timeout and fail fast

@chbatey

Which timeouts?

● Socket connection timeout● Socket read timeout

@chbatey

Your service hung for 30 seconds :(

Customer

You :(

@chbatey

Which timeouts?

● Socket connection timeout● Socket read timeout● Resource acquisition

@chbatey

Your service hung for 10 minutes :(

@chbatey

Let’s think about this

@chbatey

A little more detail

@chbatey

Wiremock + Saboteur + Vagrant

● Vagrant - launches + provisions local VMs● Saboteur - uses tc, iptables to simulate

network issues● Wiremock - used to mock HTTP

dependencies● Cucumber - acceptance tests

@chbatey

I can write an automated test for that?

WiremockUser Service

Device ServicePin Service

Saboteur

Vagrant + Virtual box VM

PlayMovie

Service

AcceptanceTest

prime to drop traffic

reset

@chbatey

Implementing reliable timeouts

● Homemade: Worker Queue + Thread pool (executor)

@chbatey

Implementing reliable timeouts

● Homemade: Worker Queue + Thread pool (executor)

● Hystrix

@chbatey

Implementing reliable timeouts

● Homemade: Worker Queue + Thread pool (executor)

● Hystrix● Spring Cloud Netflix

@chbatey

A simple Spring RestController@RestController

public class Resource {

private static final Logger LOGGER = LoggerFactory.getLogger(Resource.class);

@Autowired

private ScaryDependency scaryDependency;

@RequestMapping("/scary")

public String callTheScaryDependency() {

LOGGER.info("RestContoller: I wonder which thread I am on!");

return scaryDependency.getScaryString();

}

}

@chbatey

Scary dependency@Component

public class ScaryDependency {

private static final Logger LOGGER = LoggerFactory.getLogger(ScaryDependency.class);

public String getScaryString() {

LOGGER.info("Scary dependency: I wonder which thread I am on!");

if (System.currentTimeMillis() % 2 == 0) {

return "Scary String";

} else {

Thread.sleep(10000);

return "Really slow scary string"; }

}

}

@chbatey

All on the tomcat thread

13:07:32.814 [http-nio-8080-exec-1] INFO info.batey.examples.Resource - RestContoller: I wonder which thread I am on!13:07:32.896 [http-nio-8080-exec-1] INFO info.batey.examples.ScaryDependency - Scary dependency: I wonder which thread I am on!

@chbatey

Seriously this simple now?@Component

public class ScaryDependency {

private static final Logger LOGGER = LoggerFactory.getLogger(ScaryDependency.class);

@HystrixCommand

public String getScaryString() {

LOGGER.info("Scary dependency: I wonder which thread I am on!");

if (System.currentTimeMillis() % 2 == 0) {

return "Scary String";

} else {

Thread.sleep(10000);

return "Really slow scary string";

}

}

}

@chbatey

What an annotation can do...

13:07:32.814 [http-nio-8080-exec-1] INFO info.batey.examples.Resource - RestController: I wonder which thread I am on!13:07:32.896 [hystrix-ScaryDependency-1] INFO info.batey.examples.ScaryDependency - Scary Dependency: I wonder which thread I am on!

@chbatey

Timeouts take home

● You can’t use network level timeouts for SLAs

● Test your SLAs - if someone says you can’t, hit them with a stick

● Scary things happen without network issues

@chbatey

2 - Don’t try if you can’t succeed

@chbatey

Complexity

● When an application grows in complexity it will eventually start sending emails

@chbatey

Complexity

● When an application grows in complexity it will eventually start sending emails contain queues and thread pools

@chbatey

Don’t try if you can’t succeed

● Executor Unbounded queues :(○ newFixedThreadPool○ newSingleThreadExecutor○ newThreadCachedThreadPool

● Bound your queues and threads● Fail quickly when the queue /

maxPoolSize is met● Know your drivers

@chbatey

This is a functional requirement

● Set the timeout very high● Use wiremock to add a large delay to the

requests● Set queue size and thread pool size to 1● Send in 2 requests to use the thread and fill

the queue● What happens on the 3rd request?

@chbatey

3 - Fail gracefully

@chbatey

Expect rubbish

● Expect invalid HTTP● Expect malformed response bodies● Expect connection failures● Expect huge / tiny responses

@chbatey

Testing with WiremockstubFor(get(urlEqualTo("/dependencyPath"))

.willReturn(aResponse()

.withFault(Fault.MALFORMED_RESPONSE_CHUNK)));

{

"request": {

"method": "GET",

"url": "/fault"

},

"response": {

"fault": "RANDOM_DATA_THEN_CLOSE"

}

}

{

"request": {

"method": "GET",

"url": "/fault"

},

"response": {

"fault": "EMPTY_RESPONSE"

}

}

@chbatey

4 - Know if it’s your fault

@chbatey

What to record

● Metrics: Timings, errors, concurrent incoming requests, thread pool statistics, connection pool statistics

● Logging: Boundary logging, elasticsearch / logstash

● Request identifiers

@chbatey

Graphite + Codahale

@chbatey

Response times

@chbatey

Separate resource pools

● Don’t flood your dependencies● Be able to answer the questions:

○ How many connections will you make to dependency X?

○ Are you getting close to your max connections?

@chbatey

So easy with Dropwizard + Hystrix @Override

public void initialize(Bootstrap<AppConfig> appConfigBootstrap) {

HystrixCodaHaleMetricsPublisher metricsPublisher

= new HystrixCodaHaleMetricsPublisher(appConfigBootstrap.getMetricRegistry())

HystrixPlugins.getInstance().registerMetricsPublisher(metricsPublisher);

}

metrics:

reporters:

- type: graphite

host: 192.168.10.120

port: 2003

prefix: shiny_app

@chbatey

5 - Don’t whack a dead horse

Shiny App DeviceService

UserService

PinService

Shiny AppShiny App

Shiny App

UserService User

Service

DeviceService

Play Movie

@chbatey

What to do..

● Yes this will happen..● Mandatory dependency - fail *really* fast● Throttling● Fallbacks

@chbatey

Circuit breaker pattern

@chbatey

Implementation with Hystrix

@GET

@Timed

public String integrate() {

LOGGER.info("I best do some integration!");

String user = new UserServiceDependency(userService).execute();

String device = new DeviceServiceDependency(deviceService).execute();

Boolean pinCheck = new PinCheckDependency(pinService).execute();

return String.format("[User info: %s] \n[Device info: %s] \n[Pin check: %s] \n", user, device,

pinCheck);

}

@chbatey

Implementation with Hystrixpublic class PinCheckDependency extends HystrixCommand<Boolean> {

@Override

protected Boolean run() throws Exception {

HttpGet pinCheck = new HttpGet("http://localhost:9090/pincheck");

HttpResponse pinCheckResponse = httpClient.execute(pinCheck);

String pinCheckInfo = EntityUtils.toString(pinCheckResponse.getEntity());

return Boolean.valueOf(pinCheckInfo);

}

}

@chbatey

Implementation with Hystrixpublic class PinCheckDependency extends HystrixCommand<Boolean> {

@Override

protected Boolean run() throws Exception {

HttpGet pinCheck = new HttpGet("http://localhost:9090/pincheck");

HttpResponse pinCheckResponse = httpClient.execute(pinCheck);

String pinCheckInfo = EntityUtils.toString(pinCheckResponse.getEntity());

return Boolean.valueOf(pinCheckInfo);

}

@Override

public Boolean getFallback() {

return true;

}

}

@chbatey

Triggering the fallback

● Error threshold percentage● Bucket of time for the percentage● Minimum number of requests to trigger● Time before trying a request again● Disable● Per instance statistics

@chbatey

6 - Turn off broken stuff

● The kill switch

@chbatey

To recap

1. Don’t take forever - Timeouts2. Don’t try if you can’t succeed 3. Fail gracefully 4. Know if it’s your fault5. Don’t whack a dead horse6. Turn broken stuff off

@chbatey

Links

● Examples:○ https://github.com/chbatey/spring-cloud-example○ https://github.com/chbatey/dropwizard-hystrix○ https://github.com/chbatey/vagrant-wiremock-saboteur

● Tech:○ https://github.com/Netflix/Hystrix○ https://www.vagrantup.com/○ http://wiremock.org/○ https://github.com/tomakehurst/saboteur

@chbatey

Questions?

● Thanks for listening!● http://christopher-batey.blogspot.co.uk/

@chbatey

Developer takeaways

● Learn about TCP● Love vagrant, docker etc to enable testing● Don’t trust libraries

@chbatey

Hystrix cost - do this yourself

@chbatey

Hystrix metrics

● Failure count● Percentiles from Hystrix

point of view● Error percentages

@chbatey

How to test metric publishing?

● Stub out graphite and verify calls?● Programmatically call graphite and verify

numbers?● Make metrics + logs part of the story demo

Recommended