Upload
slashn
View
609
Download
3
Embed Size (px)
DESCRIPTION
Citation preview
Distributed Fault Injection Testing(DiFIT) - The Flip wayRahul Karmshil
Shwet Shashank
Agenda
• Fault Injection testing – why we need it?• Fault Scenarios and the fallacies• DiFIT – High level architecture• DiFIT – Tech Stack• Q & A
Fault Injection testing – why we need it?
• Distributed systems are unreliable!
Application fault Examples1. Target service instance(s) or cluster down2. High Availability scenarios for infrastructure pieces (Best Effort,
NSPOF, Session Failover)3. Impact of one off resource intensive operations like large report
generation, garbage collection, cronsSystem fault examples
4. Network timeout5. Disk Full6. FD reaching limits7. Network interface is down
The 8 fallacies of distributed computing - James Gosling
1. The network is reliable. 2. Latency is zero. 3. Bandwidth is infinite. 4. The network is secure. 5. Topology doesn't change. 6. There is one administrator. 7. Transport cost is zero. 8. The network is homogeneous.
How to test for faults?
Service 1 Service 2✗Challenges in manual testing:• Know the commands• How to test operations like bringing down
network?• Repeatability
Wouldn’t it be easier if we had something like –
/v0.1/services/service2/stop[PUT]Payload: {“service_port”:80,”host”:”192.168.1.50”,“forceful”:0} Response:
204 OK404, SERVICE_NOT_FOUND400, COULD_NOT_STOP_SERVICE
A Typical flow
Controller
OMS Fulfillment Logistics
Message Queue
Supply Chain
✗
✗✗ ✗
✗ ✗✗✗ ✗
DiFIT Agent
RESTfulController
DiFIT Agent
X-Unit
UI
Backend HTTP Request
What can break?How to test?DiFITDiFIT – The Complete PictureAn example
Retry Queue
Website
DiFIT- Tech Stack
STAF
DiFIT Libraries
STAF
DiFIT APIs
DiFIT REST Interface
TCP/IP
DiFIT Agent
DiFIT ServerModule
OperationsNetwork
OperationsInfra
Operations
Code snippetdef test_relayed_message_is_sidelined_when_target_is_down
#Setup, bringing down servicestop_payload={:host=>”192.168.76.24", :forceful=>1, :options=>[]}.to_json stop_response = RestClient.put(@difit_base_url+"/v0.1/services/fulfillment/stop", stop_payload ) assert_equal ( 204, stop_response.code )
#Buisiness floworder_object = OrderFactory.default_orderorder_id = @oms_client.create_order ( order_object )message_id = @oms_client.find_message_id_by_order_id ( order_id ) wait_till_message_is_relayed ( message_id )
#Verification stepassert_equal ( SIDELINE_STATUS, @oms_client.message_status(message_id), "” )
#Restore, bring up the service …end
References
• http://www.rgoarchitects.com/Files/fallacies.pdf• http://staf.sourceforge.net/• http://dropwizard.codahale.com/
The best way to avoid failure is to fail constantly.