Fault tolerance - look, it's simple!

Fault Tolerance

look, it’s simple!

● just baked father

● SA at EPAM Systems

● primary skill is Java

● hands-on-coding with Groovy, Ruby

● trying to get in touch with Erlang/Elixir

● passionate about agile, clean code and devops

Agenda● Why should I care?● Philosophy● Tools● Patterns● Design Concepts● Demo● Summary● Q&A

Why Should I Care?..

1500 USD / second

90 000 USD / minute

5 400 000 USD / hour

x 6BMW 6 Series Gran Coupé

1800 USD / second

108 000 USD / minute

6 480 000 USD / hour

x 6Jaguar F-Type S

Philosophy

Everything Crashes

Let It Crash

Fail Fast

Try Again

It’s all about

Fault Tolerance

Resiliency

Availability

Production

Tools

JRugged https://github.com/Comcast/jrugged

A Java library of robustness design

patterns that makes your Java code

more rugged

- implements some common patterns like Circuit Breaker, Limited Retries etc;- provides straightforward add-ons to existing code;- performance monitoring capabilities

https://github.com/Comcast/jrugged

Hystrix https://github.com/Netflix/Hystrix

Latency and Fault Tolerance for

Distributed Systems

- isolates points of access to remote systems, services and 3rd party libraries;- stops cascading failure;- enables resilience in complex distributed systems where failure is inevitable;- real-time monitoring and configuration changes;- parallel execution;

https://github.com/Netflix/Hystrix

Patterns

#1 Timeout

#2 Circuit Breaker

#2 Circuit Breaker

#3 Fail Fast

#3 Fail Fast

#4 Shed Load

#4 Shed Load

#5 Deferrable Work

#5 Deferrable Work

#6 Limited Retries

#6 Limited Retries

More …

● Fallback

● Cached Fallback

● Fail silent

● Bulkheads

● ...

DesignConcepts

#1 Modularity

● Small components = understandable

● Single Responsibility Principle

● Loose coupling high cohesion

● Microservices?..

#2 Redundancy

● Avoid single point of failure

● Avoid data sits in one place

#3 Asynchronicity

● Non-blocking operations

● Messaging

● Reactive

#4 Immutability

● Code

● State and Data

● Infrastructure

Demo Time

Demo App Architecture

● Systems are distributed● Systems are becoming even more distributed…● Failures are the normal case● Failures are not predictable● Fault tolerant = scalable● It’s all about availability of production system

Summary

References

● http://en.wikipedia.org/wiki/Fault_tolerance

● http://martinfowler.com/bliki/CircuitBreaker.html

● https://github.com/Netflix/Hystrix

● https://github.com/webdizz/fault-tolerance-talk

http://en.wikipedia.org/wiki/Fault_tolerance

http://en.wikipedia.org/wiki/Fault_tolerance

http://martinfowler.com/bliki/CircuitBreaker.html






https://github.com/webdizz/fault-tolerance-talk



Q&A

Izzet Mustafayev@EPAM Systems@webdizz webdizz izzetmustafaievhttp://webdizz.name

http://webdizz.name

http://webdizz.name

Technology

Fault tolerance - look, it's simple!