9
Distributed Fault Injection Testing(DiFIT) - The Flip way Rahul Karmshil Shwet Shashank

Slash n: Technical Session 9 - DiFIT - Distributed Fault Injection Testing - Rahul Karmshil, Shwet Shashank

  • Upload
    slashn

  • View
    609

  • Download
    3

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Slash n: Technical Session 9 - DiFIT - Distributed Fault Injection Testing - Rahul Karmshil,  Shwet Shashank

Distributed Fault Injection Testing(DiFIT) - The Flip wayRahul Karmshil

Shwet Shashank

Page 2: Slash n: Technical Session 9 - DiFIT - Distributed Fault Injection Testing - Rahul Karmshil,  Shwet Shashank

Agenda

• Fault Injection testing – why we need it?• Fault Scenarios and the fallacies• DiFIT – High level architecture• DiFIT – Tech Stack• Q & A

Page 3: Slash n: Technical Session 9 - DiFIT - Distributed Fault Injection Testing - Rahul Karmshil,  Shwet Shashank

Fault Injection testing – why we need it?

• Distributed systems are unreliable!

Application fault Examples1. Target service instance(s) or cluster down2. High Availability scenarios for infrastructure pieces (Best Effort,

NSPOF, Session Failover)3. Impact of one off resource intensive operations like large report

generation, garbage collection, cronsSystem fault examples

4. Network timeout5. Disk Full6. FD reaching limits7. Network interface is down

The 8 fallacies of distributed computing - James Gosling

1. The network is reliable. 2. Latency is zero. 3. Bandwidth is infinite. 4. The network is secure. 5. Topology doesn't change. 6. There is one administrator. 7. Transport cost is zero. 8. The network is homogeneous.

Page 4: Slash n: Technical Session 9 - DiFIT - Distributed Fault Injection Testing - Rahul Karmshil,  Shwet Shashank

How to test for faults?

Service 1 Service 2✗Challenges in manual testing:• Know the commands• How to test operations like bringing down

network?• Repeatability

Wouldn’t it be easier if we had something like –

/v0.1/services/service2/stop[PUT]Payload: {“service_port”:80,”host”:”192.168.1.50”,“forceful”:0} Response:

204 OK404, SERVICE_NOT_FOUND400, COULD_NOT_STOP_SERVICE

Page 5: Slash n: Technical Session 9 - DiFIT - Distributed Fault Injection Testing - Rahul Karmshil,  Shwet Shashank

A Typical flow

Controller

OMS Fulfillment Logistics

Message Queue

Supply Chain

✗✗ ✗

✗ ✗✗✗ ✗

DiFIT Agent

RESTfulController

DiFIT Agent

X-Unit

UI

Backend HTTP Request

What can break?How to test?DiFITDiFIT – The Complete PictureAn example

Retry Queue

Website

Page 6: Slash n: Technical Session 9 - DiFIT - Distributed Fault Injection Testing - Rahul Karmshil,  Shwet Shashank

DiFIT- Tech Stack

STAF

DiFIT Libraries

STAF

DiFIT APIs

DiFIT REST Interface

TCP/IP

DiFIT Agent

DiFIT ServerModule

OperationsNetwork

OperationsInfra

Operations

Page 7: Slash n: Technical Session 9 - DiFIT - Distributed Fault Injection Testing - Rahul Karmshil,  Shwet Shashank

Code snippetdef test_relayed_message_is_sidelined_when_target_is_down

#Setup, bringing down servicestop_payload={:host=>”192.168.76.24", :forceful=>1, :options=>[]}.to_json stop_response = RestClient.put(@difit_base_url+"/v0.1/services/fulfillment/stop", stop_payload ) assert_equal ( 204, stop_response.code )

#Buisiness floworder_object = OrderFactory.default_orderorder_id = @oms_client.create_order ( order_object )message_id = @oms_client.find_message_id_by_order_id ( order_id ) wait_till_message_is_relayed ( message_id )

#Verification stepassert_equal ( SIDELINE_STATUS, @oms_client.message_status(message_id), "” )

#Restore, bring up the service …end

Page 9: Slash n: Technical Session 9 - DiFIT - Distributed Fault Injection Testing - Rahul Karmshil,  Shwet Shashank

The best way to avoid failure is to fail constantly.