Concurrent, Distributed Systems Stock ExchangesTelecoms Commuter Rail

Preview:

Citation preview

Finding Liveness Bugs In Distributed Systems

R. Jhala [C. Killian, J. Anderson , A. Vahdat]UC San Diego

Concurrent, Distributed Systems

Stock Exchanges Telecoms Commuter Rail

Concurrent, Distributed Systems

System Nodes exchanging Messages

Execution1. Node gets message event2. Executes event handler

- Updates node state - Sends new messages

3. Repeat…

Distributed Systems: Challenges

SystemNodes exchanging Messages

Challenges Nodes: enter, leave, fail Messages: reordered, lost

System must stay available- Eventually, all nodes regroup - Eventually, all packets delivered- Eventually, some good happens

Liveness Properties

The Space of System Executions

1 2 Initial State

1 2

1 2

1 2

1 2

1 2

event@1

event@2

fail@1

event@1

fail@2

event@2

At each state,scheduler picks:1. Node n2. Event @n3. Executes code

An Execution = Sequence of Choices

1 2

1 2

1 2

1 2

1 2

1 2

event@1

event@2

fail@1

event@1

fail@2

event@1

An Execution = Sequence of Choices

1 2

1 2

1 2

1 2

1 2

1 2

event@1

event@2

fail@1

event@1

fail@2

event@1

An Execution = Sequence of Choices

1 2

1 2

1 2

1 2

1 2

1 2

event@1

event@2

fail@1

event@1

fail@2

event@1

An Execution = Sequence of Choices

1 2

1 2

1 2

1 2

1 2

1 2

event@1

event@2

fail@1

event@1

fail@2

event@1

Bad States

Safety Bugs: Execution that drives system to bad state

1 2 1 2

Safety Bugs

Bad States• Null Dereferences• Buffer overflows• Assertion Failures• Low-level crash

1 2 1 2event@2 fail@2

How to find Safety Bugs?Find path from Initial to BadBy systematically exploring executions(Iterating over sequences of choices)

Initial State Bad States

1 2

Model Checking for Safety Bugs

Bad States1 2

Find path from Initial to BadBy systematically exploring executions[Verisoft 97, Cmc 04, Chess 07]

Safety Properties are too Low Level

Find path from Initial to BadBy systematically exploring executions[Verisoft 97, Cmc 04, Chess 07]

Safety Properties are too Low Level

Distributed Systems:Designed for crashes & failures

Challenge: End-to-end ProblemsLiveness bugs

Live States

Bad States

InitialState

Good States: All nodes regroupAll packets deliveredLive States: Eventually Good Happens

Live Executions

InitialState

Live States

Liveness Violations

InitialState

Live States

Execution never reaches live state

How to Find Liveness Violations?

Live States

Explore all executions ?Infinitely many ...

How to Find Liveness Violations?

Live States

Explore all executions upto bound ?

Combinatorial explosion (depth < 50) Liveness at depth >> 50

[Verisoft 97, Cmc 04, Chess 07]

How to Find Liveness Violations?

Live States

Looks pretty hopeless...

Live States

Idea 1: Dead States

Dead States

No execution can reach live statesRecovery is impossible

Idea 1: Dead States

To find Liveness bugs, Look for Dead executions.How to tell if a state is Dead ?

Idea 2: Random Walks

Live States

Dead States

Execute long random walks from state Pr[reaching live] = 0 Pr[reaching live] = 1How to tell if a state is Dead ?

Executions and Random Walks

At each execution step, 1. Scheduler picks node n2. Schedular picks event @n3. Executes event code

Random Walk: Scheduler picks randomly(from some Prob. Dist. over nodes, events)

Liveness Bugs = Search + Random Walks

1. Systematic Search: find candidates 2. Random Walk: test if candidate dead

Live States

Iterate

Liveness Bugs = Search + Random Walks

Live States

If walk length >> avg. steps to livenessThen non-live walk is likely liveness bug!

100k Events

1k Events

100,000 Step Execution (2 Gb Log file)How to pinpoint bug ?

Live States

Idea 3: The Critical Transition

Dead States

System transitions from a recoverable to a dead stateHow to find Critical Transitionwithout knowing Dead States?

Live States

Idea 3: The Critical Transition

Binary Search using

Random Walks!

Live States

Idea 3: The Critical Transition

Binary Search using

Random Walks!

Binary Search

Live States

Idea 3: The Critical Transition

Critical Transition

Dead States

System transitions from a recoverable to a dead statePinpoints bug

RecapLiveness Bugs FoundSystem has shot itself (but doesnt know it)

Systematic SearchFinds candidate dead states

Random WalksDetermine if candidate is dead

Critical TransitionThe event that makes recovery impossible

Bells and Whistles (1/2)

Random Walk Bias• Assign “likely” events higher weight• e.g. application > network > timer > fail

Bugs not missed• Random walk only tests deadness

Live state reached sooner• Error traces shorter, simpler

Bells and Whistles (2/2)

Prefix-Based Search• Restart search after reaching liveness• Analyzes effect of failures in “steady-state”

Evaluation

Liveness Bugs,Critical Transition

Mace (C++)System MaceMC

Liveness Properties

Systems

RandTreeRandom Overlay Tree with max degree.

MaceTransportUser-level, reliable messaging service.

PastryKey-based routing, using an overlay ring.

ChordKey-based routing, using an overlay ring.

Liveness Properties

RandTreeRandom Overlay Tree with max degree.

MaceTransportUser-level, reliable transport service.

PastryKey-based routing, using an overlay ring.

ChordKey-based routing, using an overlay ring.

Eventually, all messages acknowledged.

Eventually, all nodes form single tree.

Eventually, all nodes form a ring.

Eventually, all nodes form a ring.

Sample Bug: RandTree

Nodes With Child, Parent pointers

PropertyEventually nodes form tree

Sample Bug: RandTree

C

A

C requests to join under AA sends ackC fails and restartsC ignores ack from AC joins under B

Bug: System stuck as a DAG!

C’s failure not propagated to A

B

Liveness Bugs Yield Safety Assertions

Dead States Violations of a priori unknown safety properties

Critical TransitionHelps identify dead statesYields new safety properties and bugs

New Safety Property: ChordNodes with Fwd, Back pointers

PropertyEventually nodes form a ring

Critical Transition To Dead StateWhere: n.back=n, n.fwd = m

New Safety PropertyIF n.back=n THEN n.fwd=n

ScorecardSystem Bugs Liveness Safety

MaceTransport 11 5 6RandTree 17 12 5

Pastry 5 5 0Chord 19 9 10Totals 52 31 21

Several “protocol level” bugsRoutinely used by Mace programmers

Programming Challenges

How to handle unexpected events ?

How to propagate effects of failures ?

How to limit impact on performance?

Take Away Message

Liveness BugsAre Very ImportantRandomness Helps.

www.macesystems.org(papers, code, etc.)

Recommended