Logical Clocks

Logical Clocks

Ken Birman

Time: A major issue in distributed systems

We tend to casually use temporal concepts Example: “p suspects that q has failed”

Implies a notion of time: first q was believed correct, later q is suspected faulty

Challenge: relating local notion of time in a single process to a global notion of time

Discuss this issue before developing practical tools for dealing with other aspects, such as system state

Time in Distributed Systems

Three notions of time: Time seen by external observer. A global clock

of perfect accuracy Time seen on clocks of individual processes.

Each has its own clock, and clocks may drift out of sync.

Logical notion of time: event a occurs before event b and this is detectable because information about a may have reached b.

External Time

The “gold standard” against which many protocols are defined

Not implementable: no system can avoid uncertain details that limit temporal precision!

Use of external time is also risky: many protocols that seek to provide properties defined by external observers are extremely costly and, sometimes, are unable to cope with failures

Time seen on internal clocks

Most workstations have reasonable clocks Clock synchronization is the big problem (will visit

topic later in course): clocks can drift apart and resynchronization, in software, is inaccurate

Unpredictable speeds a feature of all computing systems, hence can’t predict how long events will take (e.g. how long it will take to send a message and be sure it was delivered to the destination)

Logical notion of time

Has no clock in the sense of “real-time” Focus is on definition of the “happens before”

relationship: “a happens before b” if: both occur at same place and a finished before b started,

or a is the send of message m, b is the delivery of m, or a and b are linked by a chain of such events

Logical time as a time-space picture

a

b

d

c

p0

p1

p2

p3

a, b are concurrent

c happens after a, b

d happens after a, b, c

Notation

Use “arrow” to represent happens-before relation

For previous slide: a c, b c, c d hence, a d, b d a, b are concurrent

Also called the “potential causality” relation

Logical clocks

Proposed by Lamport to represent causal order

Write: LT(e) to denote logical timestamp of an event e, LT(m) for a timestamp on a message, LT(p) for the timestamp associated with process p

Algorithm ensures that if a b, then LT(a) < LT(b)

Algorithm

Each process maintains a counter, LT(p) For each event other than message delivery:

set LT(p) = LT(p)+1 When sending message m,

set LT(m) = LT(p) When delivering message m to process q,

set LT(q) = max(LT(m), LT(q))+1

Illustration of logical timestamps

0 1 2 7p0

p1

p2

p3

0 1 6

0 2 3 4 5 6

0 1

Concurrent events

If a, b are concurrent, LT(a) and LT(b) may have arbitrary values!

Thus, logical time lets us determine that a potentially happened before b, but not that a definitely did so!

Example: processes p and q never communicate. Both will have events 1, 2, ... but even if LT(e)<LT(e’) e may not have happened before e’

Vector timestamps

Extend logical timestamps into a list of counters, one per process in the system

Again, each process keeps its own copy Event e occurs at process p: p increments VT(p)[p]

(p’th entry in its own vector clock) q receives a message from p: q sets

VT(q)=max(VT(q),VT(p)) (element-by-element)

Illustration of vector timestamps

[1,0,0,0] [2,0,0,0]

[0,0,1,0]

[2,1,1,0] [2,2,1,0]

p0

p1

p2

p3

[0,0,0,1]

Vector timestamps accurately represent happens-before relation

Define VT(e)<VT(e’) if, for all i, VT(e)[i]<VT(e’)[i], and for some j, VT(e)[j]<VT(e’)[j]

Example: if VT(e)=[2,1,1,0] and VT(e’)=[2,3,1,0] then VT(e)<VT(e’)

Notice that not all VT’s are “comparable” under this rule: consider [4,0,0,0] and [0,0,0,4]

Vector timestamps accurately represent happens-before relation

Now can show that VT(e)<VT(e’) if andonly if e e’: If e e’, then there exists a chain e0 e1 ... en on

which vector timestamps increase “hop by hop” If VT(e)<VT(e’) suffices to look at VT(e’)[proc(e)],

where proc(e) is the place that e occured. By definition, we know that VT(e’)[proc(e)] is at least as large as VT(e)[proc(e)], and by construction, this implies a chain of events from e to e’

Examples of VT’s and happens-before

Example: suppose that VT(e)=[2,1,0,1] and VT(e’)=[2,3,0,1], so VT(e)<VT(e’)

How did e’ “learn” about the 3 and the 1? Either these events occured at the same place as e’, or Some chain of send/receive events carried the values!

If VT’s are not comparable, the corresponding events are concurrent

Notice that vector timestamps require a static notion of system membership

For vector to make sense, must agree on the number of entries

Later will see that vector timestamps are useful within groups of processes

Will also find ways to compress them and to deal with dynamic group membership changes

What about “real-time” clocks?

Accuracy of clock synchronization is ultimately limited by uncertainty in communication latencies

These latencies are “large” compared with speed of modern processors (typical latency may be 35us to 500us, time for thousands of instructions)

Limits use of real-time clocks to “coarse-grained” applications

Interpretations of temporal terms

Understand now that “a happens before b” means that information can flow from a to b

Understand that “a is concurrent with b” means that there is no information flow between a and b

What about the notion of an “instant in time”, over a set of processes?

Neither clock is appropriate

Problem is that with both clocks, there can be many events that are concurrent with a given event

Leads to a philosophical question: Event e has happened at process p Which events are “really” simultaneous with p?

Perspectives on logical time

One view is based on intuition from physics Imagine a time-space diagram Cones of causality define past and future “Now” is any cut across the system consistent

including no future events and no past events Next Tuesday will see algorithms based on this

Causal notions of past, future

a

b

g

e

p0

p1

p2

p3

d

c

f

FUTURE

Causal notions of past, future

a

b

g

e

p0

p1

p2

p3

d

c

fPAST

Issues raised by time

Time is a tool Typical uses of time?

To put events into some sort of order Example: the order of updates on a replicated

data item With one item, logical time may make sense With multiple items, consider VT with one

element per item

Ways to extend time to a total order

Often extend a logical timestamp or vector timestamp with actual clock time when the event occurred and process id where it occurred

Combination breaks any possible ties Or can use event “names”

An example

Suppose we are broadcasting messages Atomic broadcast is

Fault-tolerant: unless every process with a copy fails, the message is delivered everywhere (often expressed as all or nothing delivery)

Ordered: if p, q both receive m, n, either both receive m before n, or both receive n before m

How should we implement this policy?

Easy case

In many systems there is really just one source of broadcasts Typically we see this pattern when there is really one

reference copy of a replicated object and the replicas are viewed as cached copies

Accordingly we can use a FIFO ordered broadcast and reduce the problem to fault-tolerance

FIFO ordering simply requires a counter from sender

A more complex example

Sender-ordered multicast Sender places a timestamp in the broadcast Receiver waits until it has full set of messages Orders them by logical timestamp, breaks ties

with sender-id Then delivers in this order

How can it tell when it has the “full set”?


m

n

Deliver m,n or n,m?


Solution implicitly depends upon membership In fact, most distributed systems depend upon membership Membership is “the most fundamental” idea in many

systems for this reason Receiver can simply wait until all members have sent

one message System ends up running in rounds, where each

member contributes zero or one messages per round Use a “null” message if you have nothing to send


m

n

Optimizations

We could agree in advance on “permission to send” Now, perhaps only p, q have permission We treat their messages in rounds but others must

get permission before sending Avoids all the null messages and ensures fairness if

p, q send at same rate Dolev: explored extensions for varied rates, gets

quite elaborate…

Optimizations

In the limit, we end up with a token scheme While holding the token, p has permission to

send If q requests the token p must release it

(perhaps after a small delay) Token carries the sequence number to use


m:1


m:1


m:1

n:2

An example

Such solutions are expressed in many ways With a ring: Chang and Maxemchuck; messages

are like a “train” with new message tacked onto end and old ones delivered from front

Direct all-to-all broadcast Like a token moving around the ring, but it

carries the messages with it (inspired by FDDI) Tree structured in various ways

More examples

Old Isis system uses logical clocks Sender says “here is a message” Receivers maintain logical clocks. Each

proposes a delivery time Sender gathers votes, picks maximum, says

“commit delivery at time t” Receivers deliver committed messages in

timestamp order from front of a queue

More examples

m m:[1,p] n:[2,p]

n:[1,q] m:[2,q]

m:[1,r] n:[2,r]

More examples

m m:[1,p] n:[2,p] m:[2,q]

n:[1,q] m:[2,q] n:[2,r]

m:[1,r] n:[2,r]

More examples

m m:[1,p] n:[2,p] m:[2,q] m! n!

n:[1,q] m:[2,q] n:[2,r] m!n!

m:[1,r] n:[2,r] m!n!

More examples

Later versions of Isis used vector times Membership is handled separately Each message is assigned a vector time Delivered in vector time order, with ties broken

using process id of the sender

Totem and Transis

These systems represent time using partial order information

Message m arrives and includes ordering fields: Deliver m after n and o By transitivity, if n is after p, them m is after p

Break ties using process id number

Totem and Transis

m

n o

p

Things to notice

Time is just a programming tool But membership and message atomicity are very

fundamental Waiting for m won’t work if m never arrives And VT is only meaningful if we can agree on the

meaning of the indicies With failures, these algorithms get surprisingly

complicated: suppose p fails while sending m?

Major uses of time

To order updates on replicated data To define versions of objects To deal with processes that come and go in

dynamic networked applications Processes that joined earlier often have more complete

knowledge of system state Process that leaves and rejoins often needs some form of

incrementing “incarnation number” To prove correctness of complex protocols

Documents

Logical Clocks