25
Chaos Engineering 13.11.2018 Dennis Pietruck 1

Chaos Engineering - users.informatik.haw-hamburg.deubicomp/... · Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Chaos Engineering - users.informatik.haw-hamburg.deubicomp/... · Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the

Chaos Engineering13.11.2018 Dennis Pietruck

1

Page 2: Chaos Engineering - users.informatik.haw-hamburg.deubicomp/... · Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the

13.11.2018 Dennis Pietruck

Agenda

● Motivation● Grundlagen● Forschung & Konferenzen

2

Page 3: Chaos Engineering - users.informatik.haw-hamburg.deubicomp/... · Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the

Motivation

3

Page 4: Chaos Engineering - users.informatik.haw-hamburg.deubicomp/... · Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the

13.11.2018 Dennis Pietruck

Fehler in einem Monolith

?

Was sieht der Client?

4

Page 5: Chaos Engineering - users.informatik.haw-hamburg.deubicomp/... · Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the

13.11.2018 Dennis Pietruck

Fehler in einem Monolith

?

Was sieht der Client?

● Internal Server Error - 500● Timeout

5

Page 6: Chaos Engineering - users.informatik.haw-hamburg.deubicomp/... · Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the

13.11.2018 Dennis Pietruck

Fehler in einem verteilten System

?

Was sieht der Client? A

B

C

6

Page 7: Chaos Engineering - users.informatik.haw-hamburg.deubicomp/... · Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the

13.11.2018 Dennis Pietruck

Fehler in einem verteilten System

?

Was sieht der Client?

● Internal Server Error - 500● Timeout● OK - 200 (mit fehlenden Daten)

○ nach langem Warten

A

B

C

7

Page 8: Chaos Engineering - users.informatik.haw-hamburg.deubicomp/... · Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the

13.11.2018 Dennis Pietruck

Fehler in einem verteilten System

?

Wie kann sich dieser Fehler auswirken?

● Keine Fehlerbehandlung○ A kann auf DB Fehler nicht reagieren○ B kann auf Fehler in A nicht reagieren○ C kann auf Fehler in B nicht reagieren

● A hat Retries nicht konfiguriert● A und B haben Timeout nicht konfiguriert

A

B

C

8

Page 9: Chaos Engineering - users.informatik.haw-hamburg.deubicomp/... · Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the

13.11.2018 Dennis Pietruck

Verteilte Systeme heute

Josh Evans - A Netflix Guide to Microserviceshttps://de.slideshare.net/JoshEvans2/mastering-chaos-a-netflix-guide-to-microservices

9

Page 10: Chaos Engineering - users.informatik.haw-hamburg.deubicomp/... · Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the

Grundlagen

10

Page 11: Chaos Engineering - users.informatik.haw-hamburg.deubicomp/... · Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the

13.11.2018 Dennis Pietruck

Was ist Chaos Engineering?

Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand

turbulent conditions in production.

principlesofchaos.org

11

Page 12: Chaos Engineering - users.informatik.haw-hamburg.deubicomp/... · Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the

13.11.2018 Dennis Pietruck

Testing ≠ Chaos Engineering

Entspricht das Systemverhalten der Spezifikation?

Was passiert wenn…

● Latenz erhöht wird?● Services abstürzen?● Netzwerkkomponenten ausfallen?● Regionen nicht erreichbar sind?● ...

https://www.a1qa.com/blog/interview-with-daniel-knott-upside-down-testing

12

Page 13: Chaos Engineering - users.informatik.haw-hamburg.deubicomp/... · Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the

13.11.2018 Dennis Pietruck

Ursprünge von Chaos Engineering (1/2)

● 2010 Netflix entwickelt Chaos Monkeyfür den Umzug nach AWS

● 2011Simian Army erweitert Chaos Monkey

● 2012Chaos Monkey auf github

Chaos Monkey

Simian Army

13

Page 14: Chaos Engineering - users.informatik.haw-hamburg.deubicomp/... · Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the

13.11.2018 Dennis Pietruck

Ursprünge von Chaos Engineering (1/2)

● 2016Gremlin Inc. bietet “Failure as a service”

● 2017Chaos Engineering Book

Gremlin

14

Page 15: Chaos Engineering - users.informatik.haw-hamburg.deubicomp/... · Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the

13.11.2018 Dennis Pietruck

Vorgehen

Voraussetzung: Monitoring und Tracing

1. Messbaren stabilen Zustand definieren2. Annehmen, dass dieser Zustand weiter besteht3. Fehler simulieren4. Annahme überprüfen5. Maßnahmen treffen

15

Page 16: Chaos Engineering - users.informatik.haw-hamburg.deubicomp/... · Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the

13.11.2018 Dennis Pietruck

Monitoring

Anwendung

Container

Containerplattform

Host OS

(Virtuelle) Hardware

Netzwerk

Whitebox Monitoring (RED-Principle)

Blackbox Monitoring (USE-Principle)

16

Page 17: Chaos Engineering - users.informatik.haw-hamburg.deubicomp/... · Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the

13.11.2018 Dennis Pietruck

Tracing

● Zeigt Kommunikation der Services● Zeigt Latenzen● Unterstützt Debugging

[1]A. Blohowiak, A. Basiri, L. Hochstein, und C. Rosenthal, „A Platform for Automating Chaos Experiments“

jaegertracing.io/docs/1.7/architecture/

17

Page 18: Chaos Engineering - users.informatik.haw-hamburg.deubicomp/... · Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the

13.11.2018 Dennis Pietruck

Stabilen Zustand definieren

Metriken finden die zeigen, dass das System noch funktioniert

Bei Netflix: “Stream Starts per Second”

Streams per Second im stabilen Zustand A. Basiri u. a., „Chaos Engineering“, IEEE Software, Bd. 33, Nr. 3, S. 35–41, Mai 2016.

18

Page 19: Chaos Engineering - users.informatik.haw-hamburg.deubicomp/... · Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the

13.11.2018 Dennis Pietruck

Fehler simulieren

● Nur für ausgewählte Komponenten● Für einen kleinen Anteil der

Benutzer

Tools:

Anwendung

Container

Containerplattform

Datenbank

Host OS

(Virtuelle) Hardware

Netzwerk

Host OS

Availability Zone

Mögliche Fehlerquellen:

Chaos Toolkit Chaos Monkey Toxi Proxy Pumba

19

Page 20: Chaos Engineering - users.informatik.haw-hamburg.deubicomp/... · Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the

Forschungsthemen

20

Page 21: Chaos Engineering - users.informatik.haw-hamburg.deubicomp/... · Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the

13.11.2018 Dennis Pietruck

Forschungsthemen

● Tooling○ Fehlersimulation○ Automatisierung

■ der Experimente■ der Auswertung

● Anpassung an kleinere Organisationen● Chaos Engineering in nicht technischen Bereichen

21

Page 22: Chaos Engineering - users.informatik.haw-hamburg.deubicomp/... · Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the

13.11.2018 Dennis Pietruck

ForschungsthemenPaper

● The Business Case for Chaos EngineeringH. Tucker, L. Hochstein, N. Jones, A. Basiri, und C. Rosenthal

● A Platform for Automating Chaos ExperimentsA. Blohowiak, A. Basiri, L. Hochstein, und C. Rosenthal

● Why is random testing effective for partition tolerance bugs?R. Majumdar und F. Niksic

Konferenzen

● IEEE International Conference on Cloud Computing● IEEE International Symposium on Software Reliability Engineering

22

Page 23: Chaos Engineering - users.informatik.haw-hamburg.deubicomp/... · Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the

13.11.2018 Dennis Pietruck

The Business Case for Chaos Engineering

● Quantitative Vorteile○ Return on Investment

Voraussetzung: bestehendes Monitoring & Incident ManagementKosten-Nutzen Verhältnis für Chaos Engineering

● Qualitative Vorteile○ Förderung zur Entwicklung von Widerstandsfähigen Anwendungen ○ Widerstandsfähigkeit als Priorität

23

Page 24: Chaos Engineering - users.informatik.haw-hamburg.deubicomp/... · Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the

13.11.2018 Dennis Pietruck

Automating Failure Testing Research at Internet Scale

● Chaos Automation Platform● Konfiguration eines Experiments über ein Dashboard● Automatische Ausführung des Experiments● Auswahl von Metriken über ein Dashboard

Weitere Entwicklung:

● Automatische Konfiguration von Tests● Automatische Auswertung der Tests

24

Page 25: Chaos Engineering - users.informatik.haw-hamburg.deubicomp/... · Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the

13.11.2018 Dennis Pietruck

QuellenA. Basiri u. a., „Chaos Engineering“, IEEE Software, Bd. 33, Nr. 3, S. 35–41, Mai 2016.

„Principles of Chaos Engineering“. [Online]. Verfügbar unter: https://principlesofchaos.org/. [Zugegriffen: 10-Nov-2018].

A. Blohowiak, A. Basiri, L. Hochstein, und C. Rosenthal, „A Platform for Automating Chaos Experiments“, in 2016 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW), 2016, S. 5–8.

H. Tucker, L. Hochstein, N. Jones, A. Basiri, und C. Rosenthal, „The Business Case for Chaos Engineering“, IEEE Cloud Computing, Bd. 5, Nr. 3, S. 45–54, Mai 2018.

R. Majumdar und F. Niksic, „Why is random testing effective for partition tolerance bugs?“, Proceedings of the ACM on Programming Languages, Bd. 2, Nr. POPL, S. 1–24, Dez. 2017.

25