Upload
sulman-ahmed
View
10
Download
4
Embed Size (px)
Citation preview
Use of Formal Methods at Amazon Web Services (Chris Newcombe, Tim Rath, Fan Zhang, Bogdan Munteanu, Marc Brooker, Michael Deardeuff )
ASAD RIAZ (021)
MALIK FARHAN (028)
HASSNAIN SHAH (086)
What is AWS?
oCloud services
oDatabase storage
oNetworking
oPay-as-you-go pricing
AWS ServicesoS3
oLaunch a virtual machine
oBuild a web app
oMachine learning (Rekognition)
oDatabases (DynomoDB)
oAnalytics
oAR & VR
AWS Business Growth & Cost-efficient InfrastructureoS3 grew to store 1 trillion objects. Less than a year later it had grown to 2 trillion objects, and was regularly handling 1.1 million requests per second.
oFault tolerant
oReplication
oConsistency
oConcurrency
oLoad Balancing
ComplexityHigh complexity increases the probability of human error in design, code & operations.
What we have tried?
oDeep design reviews
oStandard verification techniques
oCode reviews
oFault-injection testing
Still subtle bugs & failure reason? (complexity)
Solution?oTLA Temporal Logic of Actions+, a formal specification language.
oTLA+ is based on simple discrete math, i.e. basic set theory and predicates, with which all engineers are familiar.
oTLA+ specification describes the set of all possible legal behaviors.
oTLA+ describes correctness properties (the ‘what’). & the design of the system (the ‘how’).
oUse conventional mathematical reasoning & TLC model checker.
What is TLC?
A tool which takes a TLA+ specification & exhaustively checks the desired correctness properties.
TLA+ (Temporal Logic of Action)PlusCal (similar to C-style programming language)
PlusCal is automatically translated to TLA+ with a single key press.
System Components Line count (excl. comments) Benefit
S3
Fault-tolerant low-level network algorithm
804 PlusCalFound 2 bugs. Found further bugs in proposed optimizations.
Background redistribution of data 645 PlusCalFound 1 bug, and found a bug in the first proposed fix.
DynamoDB Replication & group- membership system
939 TLA+ Found 3 bugs, some requiring traces of 35 steps
EBS Volume management 102 PlusCal Found 3 bugs.
Internal distributed lock manager
Lock-free data structure 223 PlusCal Improved confidence. Failed to find a liveness bug as we did not check liveness.
Fault tolerant replication and reconfiguration algorithm
318 TLA+ Found 1 bug. Verified an aggressive optimization.
Starting steps of Formal Specifications1. Safety properties: “what the system is allowed to do”
Example: at all times, all committed data is present and correct.
2. Liveness properties: “what the system must eventually do”
Example: Whenever the system receives a request, it must eventually respond to that request.
3. Next step: “what must go right”?
4. Conforming to the design: with the goal of confirming design correctly handles all of the dynamic events in the environment.
What to confirm?oNetwork errors & repairs
oDisk errors
oCrashes & restarts
oData center failure and repairs
oActions by human operators
5. Using the model checker to verify that the specification of the system in its environment implements the chosen correctness properties.
TLA & PlusCal ExampleThe problem
You’re writing software for a bank. You have Alice and Bob as clients, each with a certain amount of money in their accounts. Alice wants to send some money to Bob. How do you model this? Assume all you care about is their bank accounts.
Step One
Assertions & SetsCan Alice’s account go negative? Asserts in TLA+ used for debugging.
Step Two
We are going to get error at this stage. Tell me why? Tell me how we are going to fix it.
Fixing the issue
ConclusionAt AWS, formal methods have been a big success. They have helped us prevent subtle, serious bugs from reaching production, bugs that we would not have found via any other techniques.
In simple words, whatever we are now, that would not have been achieved without using formal methods.