Transcript
Page 1: Breakpoints and Halting in Distributed Systems

Breakpoints and Halting in Distributed Systems

Presented by

Abhishek Saxena

CS 739 Distributed Systems

Spring 2002

Page 2: Breakpoints and Halting in Distributed Systems

2

References

• Detecting Relational Global Predicates in Distributed Systems by Alexander I. Tomlinson and Vijay K. Garg, 1993

• Breakpoints and Halting in Distributed Programs by Barton P. Miller and Jong-Deok Choi, 1992

• Restoring Consistent Global States of Distributed Computations by Goldberg et al., 1991

Page 3: Breakpoints and Halting in Distributed Systems

3

Presentation Layout

• Introduction• Motivation• Halting in Distributed Systems• Detecting Breakpoints for:

• Conjunctive/Disjunctive/Linked Predicates• Relational Predicates

• Applications to Research• Relevance to papers read• Conclusions

Page 4: Breakpoints and Halting in Distributed Systems

4

Introduction

• General problems of:– Halting distributed programs– Detecting breakpoints – Validating resource conflicts– Recording, restoration and replay of program

sequences

Page 5: Breakpoints and Halting in Distributed Systems

5

Motivation

• Why halt?– Interactive debugging– Issues in distributed systems:

• No single global notion of time• Unpredictable communication delays• How to issue instant command to all processes?• Command to simultaneously reach all processes?

Page 6: Breakpoints and Halting in Distributed Systems

6

Halting

• 2 pertinent questions:– How to halt a distributed program?

• Halting Algorithm

– When to halt?• Breakpoint Detection

Page 7: Breakpoints and Halting in Distributed Systems

7

Halting Algorithm

• Extends Chandy & Lamport’s algorithm• Sending rule:

– Increments last_halt_id– Send halt marker containing this value to

outgoing channels

• Receiving rule:– Compare the halt_id with its last_halt_id &

update – Send halt marker like sender

Page 8: Breakpoints and Halting in Distributed Systems

8

Receiving process Q

Process T

Process U

Halt marker

Sending process P

Process R

Process S

Halt markerHalt marker

The Halting Algorithm

Halt marker Halt marker

Page 9: Breakpoints and Halting in Distributed Systems

9

The Halting Algorithm

• Intuitive extension to Chandy & Lamport’s Algorithm[1]

• Leads to a global consistent state since:– Process states same as recorded process

states in [1]– Undelivered messages same as recorded

channels states in [1]

Page 10: Breakpoints and Halting in Distributed Systems

10

Problems with this Algorithm

• Processes that infrequently interact with other computation processes• Long halting time

• Acyclic network connection

P Q

Producer Consumer

Communication Channel

Page 11: Breakpoints and Halting in Distributed Systems

11

A Solution…• Centralized debugger process:

d

qp

Debugger process

Page 12: Breakpoints and Halting in Distributed Systems

12

Problems with this solution

• Communication overheads

• Possible change in execution of program

• Complex to build

Page 13: Breakpoints and Halting in Distributed Systems

13

Detecting Breakpoints

• Breakpoints & Predicates

• Predicate satisfaction = breakpoint detection

• Distributed processes’ system needs: – Simple predicates– Disjunctive predicates– Linked predicates…interesting!– Conjunctive predicates…very interesting!

Page 14: Breakpoints and Halting in Distributed Systems

14

Simple Predicates

• Encapsulate single process behavior

• Detect simple events:– Entered procedure– Message sent / received– Channel created / destroyed– Process created / destroyed

Page 15: Breakpoints and Halting in Distributed Systems

15

Disjunctive predicates

• Form:

DP ::= SP [ U SP ]*

• Satisfied when any SP is satisfied

• Initiate halting when DP is true

Page 16: Breakpoints and Halting in Distributed Systems

16

Linked Predicates

• Specify sequences of events

• Form:

LP ::= DP [ ->DP ]*

• Debugger process sends the LP {DP1->...} to processes involved in DP1

• Upon DP1, strip off DP1 & send stripped LP to processes involved in DP2

Page 17: Breakpoints and Halting in Distributed Systems

17

Process S

Process P

Linked predicates’ implementation

Debugger process

Process Q

Process R

Processes involved in DP1

Processes

involved in DP2

DP1->DP2DP1->DP2DP1->DP2

Start Halting

Process T

DP2DP2

Start halting

Start halting

Page 18: Breakpoints and Halting in Distributed Systems

18

Conjunctive Predicates

• Form:

CP ::= SP [ ∩ SP ]*• Hardest to detect! • No single time reference across machines• Interpretation based on virtual time:

– Consider processes P1, P2 with virtual time axes T1, T2

– Define

SCP = { (t1, t2) | t1ε T1, t2ε T2, SP(t1) ∩ SP(T2) }

Page 19: Breakpoints and Halting in Distributed Systems

19

Conjunctive predicates

• Split SCP into:– Ordered-SCP:

{ (t1, t2) | (t1, t2)ε SCP, ((SP1) i -> (SP2) j) U ((SP2) i ->(SP1) j) }

– Unordered-SCP:{ (t1, t2) | (t1, t2)ε SCP, (t1, t2) € ordered-SCP }

Page 20: Breakpoints and Halting in Distributed Systems

20

Conjunctive Predicates

t11

t12

t13

t21

t22

t23

unordered- SCP pair

ordered-SCP pair

Page 21: Breakpoints and Halting in Distributed Systems

21

Conjunctive Predicates

• Detecting unordered-SCP events difficult

• Requires:– Global information gathering process– Time delay!– Cannot preserve meaningful process states

Page 22: Breakpoints and Halting in Distributed Systems

22

Detecting Relational Global Predicates

• Resource conflict validation problems undetectable by earlier predicate classes

• Form:

( x0 +…+ xn > C )– xi: resource usage at Pi– C: total resource available

• Undecomposable into earlier classes of predicates

Page 23: Breakpoints and Halting in Distributed Systems

23

How to detect such predicates?

• 2 algorithms:– Decentralized: runs concurrently– Centralized: decoupled from the target

program

Page 24: Breakpoints and Halting in Distributed Systems

24

Model & Notation

• Partial ordering on S = { S0, …, Sn } where, Si <= Sj, for 0 <= i,j <= n

• Happens-before relation: “->”

• pred.u.i: Intuitively, is the state just preceding u in process i

• succ.u.i: The state just succeeding u in process i

Page 25: Breakpoints and Halting in Distributed Systems

25

Concurrent States & Intervals

Deterministic event

Non-deterministic event

Local state

P Q

State Interval

Receive Interval

2

3

411

10

9

Page 26: Breakpoints and Halting in Distributed Systems

26

Concurrent Intervals

1, lo1

0, lo0 0, i 0, hi0 KEY

1, j 1, hi1

pred relation

P1

P0

Page 27: Breakpoints and Halting in Distributed Systems

27

Concurrent Intervals

• Intervals (0,i) & (1, j) concurrent iff

KEY exists in P0 or P1 s.t.,

lo0 < i <= hi0 & lo1 < j <= hi1,

where,

the lo0, lo1, hi0, hi1 as defined by the previous diagram

Page 28: Breakpoints and Halting in Distributed Systems

28

Overview of algorithms

• Gather information– What?– How?

• Consider 2 processes P0 & P1

• Gather concurrent interval sequences: – { lo0 to hi0 } at P0 & { lo1 to hi1 } at P1

• Check resource violations at all possible pairs of states in these sequences!!

Page 29: Breakpoints and Halting in Distributed Systems

29

Algorithms contd…

• Representation of

(0, lo0) (0, hi0)

(1, lo1) (1, hi1)

as a 2x2 Matrix clock• Row i of Pi’s matrix clock = Pi’s vector clock• Current interval at Pk = (k, Mk[ , ])• Row k of Mk…pred() of current interval at Pk• Row i<>k…pred.pred() of current interval at Pk

Page 30: Breakpoints and Halting in Distributed Systems

30

Maintaining Matrix Clocks

• Initialize– Initialize matrix to 0– If k=0 or k=1 Mk[k, k] ++

• Send message tagged with Mk[., .] ; Increment Mk[k,k] for k=0 V 1

• Upon message receive update matrix clock; Increment Mk[k,k] ; – Mk[k, ]= diagonal(Mk)

Page 31: Breakpoints and Halting in Distributed Systems

31

Matrix Clock Example

1 00 0

0 00 1

0 00 2

2 12 3

2 10 1

3 10 1

0 0

0 1

2 1

0 1

P0

P1

Page 32: Breakpoints and Halting in Distributed Systems

32

Decentralized Algorithm

• Consider process P0

• Upon mesg receive evaluate lo0, lo1, hi0, hi1

• Find min value of resource(x) at P0

• Send debug mesg (min_x0, lo1, hi1) to P1

• P1 detects the predicate :

(min_x0 + min_x1 > C)

Page 33: Breakpoints and Halting in Distributed Systems

33

Overheads & Complexity at P0

• Message overheads:– (# of receive intervals at P0)* sizeof ( 3

integers)………………..Debug mesgs– Sizeof(4 integers)…………Application mesgs

• Memory:– # intervals at P0; min_x for each interval

• Computation:– (# intervals at P0)*( # debug mesgs sent +

received)

Page 34: Breakpoints and Halting in Distributed Systems

34

Centralized Algorithm

• Checker process runs concurrently or, post-mortem

• Consider the latter: processes P0 & P1– Processes keep trace files containing:

• min_x for each interval• an array of {lo0, lo1, hi0, hi1} for each interval

– Runs a check algorithm• Builds heaps by inserting the min_x values for all

concurrent interval sequences at P0 & P1 • Use these heap-tops to detect the predicate

Page 35: Breakpoints and Halting in Distributed Systems

35

Overheads & Complexity for P0

• Memory:– 4 integers for matrix clock each application process

• Computation:– Monitor local variables– Rest offloaded to checker– O(R0 + M0logM0 + M1logM1)

Where, R0 & M0 = # rec intervals & total intervals at P0

Page 36: Breakpoints and Halting in Distributed Systems

36

Major Practical Problems

• Reduced complexity from exp to O(nlogn) but still…

• Large overheads even for 2 processes

• Lots of messages!

• Lots of memory space!

• Lots of computation!

Page 37: Breakpoints and Halting in Distributed Systems

37

Applications to Research

• Development of distributed debugging environment– Recording of execution sequences– Rollback– Replay– Exploration of new execution scenarios

• Command of mission-control distributed systems

Page 38: Breakpoints and Halting in Distributed Systems

38

Relevance to Papers Read

• The S/Net’s Linda kernel:– Debugging distributed tuple space– Detecting race conditions, deadlocks, probe

effects

• Chandy & Lamport’s paper explores the detection of stable predicates and Garg’s paper explores unstable predicate detection

Page 39: Breakpoints and Halting in Distributed Systems

39

Conclusions

• Distributed debugging still challenging

• No efficient algorithm

• Hard to do away with overheads

• Need for efficient event monitoring & manipulation tools

• Message sequence chart generators

• Program flow analysis for more independent program splitting