Building Blocks for High-Performance, Fault-Tolerant Distributed Systems

1

Building Blocks for High-Performance, Fault-Tolerant

Distributed Systems

Nancy LynchTheory of Distributed SystemsMIT

AFOSR project reviewCornell UniversityMay, 2001

…

2

Project Participants

• Leaders: Nancy Lynch, Idit Keidar, Alex Shvartsman, Steve Garland

• PhD students: Victor Luchangco, Roger Khazan, Carl Livadas, Josh Tauber, Ziv Bar-Joseph, Rui Fan

• MEng: Rui Fan, Kyle Ingols, Igor Taraschanskiy, Andrej Bogdanov, Michael Tsai, Laura Dean

• Collaborators: Roberto De Prisco, Jeremy Sussman, Keith Marzullo, Danny Dolev, Alan Fekete, Gregory Chockler, Roman Vitenberg

3

Project Scope

• Define services to support high-performance distributed computing in dynamic environments: Failures, changing participants.

• Design algorithms to implement the services.

• Analyze algorithms: Correctness, performance, fault-tolerance.

• Develop necessary mathematical foundations: State machine models, analysis methods.

• Develop supporting languages and tools: IOA.

4

Talk Outline

I. View-oriented group communication services

II. Non-view-oriented group communication

III. Mathematical foundations

IV. IOA language and tools

V. Memory consistency models

VI. Plans

5

I. View-Oriented Group Communication Services

6

View-Oriented Group Communication Services

• Cope with changing participants using abstract groups of client processes with changing membership sets.

• Processes communicate with group members by sending messages to the group as a whole.

• GC services support management of groups:– Maintain membership information, form views.

– Manage communication.

– Make guarantees about ordering, reliability of message delivery.

• Isis, Transis, Totem, Ensemble,…

…

GC

7

Using View-Oriented GC Services

• Advantages: – High-level programming abstraction

– Hides complexity of coping with changes

• Disadvantages: – Can be costly, especially when forming new views.

– May have problems scaling to large networks.

• Applications:– Managing replicated data

– Distributed interactive games

– Multi-media conferencing, collaborative work

8

Our Approach

• Mathematical, using state machines (I/O automata)

• Model everything:– Applications

– Service specifications

– Implementations of the services

• Prove correctness• Analyze performance, fault-tolerance

Application

Application Algorithm

Service

9

Our Earlier Work: VS [Fekete, Lynch, Shvartsman 97, 01]

• Defined automaton models for:

– VS, partitionable GC service, based on Transis

– TO, non-view-oriented totally ordered bcast service

– VStoTO, algorithm based on [Amir, Dolev, Keidar, Melliar-Smith, Moser]

• Proved correctness

• Analyzed performance, fault-tolerance: conditional performance analysis

bcast

VS

VStoTO VStoTO

TO

brcv

gpsnd

gprcvnewview

10

Conditional Performance Analysis• Assume VS satisfies:

– If a network component C stabilizes, then soon thereafter, views known within C become consistent, and messages sent in the final view are delivered everywhere in C, within bounded time.

• And VStoTO satisfies:– Simple timing and fault-tolerance assumptions.

• Then TO satisfies:– If C stabilizes, then soon thereafter, any message

sent or delivered anywhere in C is delivered everywhere in C, within bounded time.

11

Ensemble [Hickey, Lynch, van Renesse 99]

• Ensemble system [Birman, Hayden], layered design:

• Worked with developers, following VS.• Developed global specs for key layers.• Modeled Ensemble algorithm spanning between layers. • Tried proof; found algorithmic error. • Modeled, analyzed repaired system• Same error found in Horus.

12

More Recent Progress

1. GC with unique primary views

2. Scalable GC

3. Optimistic Virtual Synchrony

4. GC service specifications

13

1. GC With Unique Primary Views

14

GC With Unique Primaries

• Dynamic View Service [De Prisco, Fekete, Lynch, Shvartsman 98] – Produces unique primary views

– Copes with long-term changes.

• Dynamic Configuration Service [DFLS 99]– Adds quorums.

– Copes with long-term and transient changes.

• Dynamic Leader Configuration Service [D 99], [DL 01]– Adds leaders.

15

GC With Unique Primaries

• Algorithms to implement the services– Based on dynamic voting algorithm of

[Yeger-Lotem, Keidar, Dolev 97].– Each primary needs majority of all possible

previous primaries.– Models, proofs,…

• Applications– Consistent total order and consistent replicated data

algorithms that tolerate both long-term and transient changes.

– Models, proofs,…

16

Availability of Unique Primary Algorithms [Ingols, Keidar 01]

• Simulation study comparing unique primary algorithms: – [Yeger-Lotem, Keidar, Dolev], [DFLS]

– 1-pending, like [Jajodia, Mutchler]

– Majority-resilient 1-pending, like [Lamport], [Keidar, Dolev]

• Simulate repeated view changes, interrupting other view changes.

• Availability shown to depend heavily on:– Number of processes from previous view needed to form new view.

– Number of message rounds needed to form a view.

• [YKD], [DFLS] have highest availability.

17

2. Scalable Group Communication [Keidar, Khazan 00]

18

Group Communication Service

• Manages group membership, current view.

• Multicast communication among group members, with ordering, reliability guarantees.

• Virtual Synchrony [Birman, Joseph 87]– Integrates group membership and group communication.

– Processes that move together from one view to another deliver the same messages in the first view.

– Useful for replicated data management.

– Before announcing new view, processes must synchronize, exchange messages.

19

Example: Virtual Synchrony

3: i,j,k 3: i,j,k 3: i,j,k

4: i, j 4: i, j

mcast(m)

rcv(m)rcv(m)

VS algorithm supplies missing m

i j k

20

Group Communication in WANs

• Difficulties: – High message latency, message exchanges are expensive

– Frequent connectivity changes

• New, scalable GC algorithm:– Uses scalable GM service of [Keidar, Sussman, et al. 00],

implemented on a small set of membership servers.

– GC (with virtual synchrony) implemented on clients.

VS

Net GM

VSGC

21

Group Communication in WANs

• Try to minimize time from when network stabilizes until GC delivers new views to clients.

• After stabilization: GM forms view, VSGC algorithm synchronizes.

• Existing systems (LANs):– GM, VSGC uses several message exchange rounds– Continue in spite of new network events

• Inappropriate for WANs

GM AlgorithmVSGC AlgorithmNet event

view(v)

22

New Algorithm• VSGC uses one message exchange round, in parallel with

GM’s agreement on views.• GM usually delivers views in one message exchange.

• Responds to new network events during reconfiguration:– GM produces new membership sets – VSGC responds to membership changes

• Distributed implementation [Tarashchanskiy 00]

GM AlgorithmVSGC AlgorithmNet event

view(v)

23

Correctness Proofs

• Models, proofs (safety and liveness)

• Developed new incremental modeling, proof methods [Keidar, Khazan, Lynch, Shvartsman 00]

• Proof Extension Theorem:

• Used new methods for the safety proofs.

S S’

A A’

24

Performance Analysis

• Analyze time from when network stabilizes until GC delivers new views to clients.

• System is a composition: – Network service, GM services, VSGC processes

• Compositional analysis:– Analyze the VSGC algorithm alone, in terms of its inputs and

timing assumptions.

– State reasonable performance guarantees for GM, Network.

– Combine to get conditional performance properties for the system as a whole.

25

Analysis of VSGC algorithm

• Assume component C stabilizes:– GM delivers same views to VSGC processes

– Net provides reliable communication with latency .

• Let – T[start], T[view] be times of last GM events for C be upper bound on local step time.

• Then VSGC outputs new views by time

max (T[start] + + x, T[view]) +

26

Analysis of VSGC Algorithm

GM algorithm

VS AlgorithmNet Event

+ x

T[start] T[view]

view(v)

view(v)startstart

27

Assumed Bounds for GM

• Bounds for “Fast Path” of [Keidar, et al. 00], observed empirically in almost all cases.

GM

T[start] T[view]

start start view(v)

28

VSGC + x

GM

T[start] T[view]

start start view(v)

Combining VSGC and GM Bounds

• Bounds for system, conditional on GM bounds.

view(v)

29

3. Optimistic Virtual Synchrony[Sussman, Keidar, Marzullo 00]

• Most GC algorithms block sending during reconfiguration.

• OVS service provides:– Optimistic view proposal, before reconfiguration.– Optimistic sends after proposal, during reconfiguration.– Deliveries of optimistic messages in next view, subject to

application policy.

• Useful for applications:– Replicated data management– State transfer– Sending vectors of data

30

4. GC Service Specifications [Chockler, Keidar, Vitenberg 01]

• Comprehensive set of specifications for properties guaranteed by GC services.

• Unifying framework.• Safety properties

– Membership: View order, partitionable, primary component– Multicast: Sending view delivery, virtual synchrony– Safe notifications– Ordering, reliability: FIFO, causal, totally ordered, atomic

• Liveness properties– For eventually stable components: View stability, multicast

delivery, safe notification liveness– For eventually stable pairs

31

II. Non-View-Oriented Group Communication

1. Totally Ordered Multicast with QoS

[Bar-Joseph, Keidar, Anker, Lynch 00, 01]

32

Totally Ordered Multicast with QoS

• Multicast to dynamic group, subject to joins, leaves, and failures.

• Global total ordering of messages• QoS: Message delivery latency • Built on reliable network with latency guarantees• Add ordering guarantees, preserve latency bounds.• Applications

– State machine replication

– Distributed games

– Shared editing

33

Two Algorithms• Algorithm 1: Basic Totally Ordered Multicast

– Sends, receives consistent with total ordering of messages.– Non-failing processes agree on messages from non-failing

processes.– Latency: Constant, even with joins, leaves, failures.

• Algorithm 2: Atomic Multicast– Non-failing processes agree on all messages.– Latency:

• Joins, leaves only: Constant • With failures: Linear in f

Net

TOMfail_i fail_j

34

FrontEnd_i Memb_i

Sniffer_i

Net

Ord_i

rcv(m) joinleave

mcast(m)

join leave mcast(join)

mcast(leave)

mcast(m)

rcv(m)

progress(s,j)

joiners(s,J), leavers(s,J)end-slot(s)

members(s,J)

Local Node Process

35

Local Algorithm Operation

• FrontEnd divides time into slots, tags messages with slots.

• Ord delivers messages by slot, in order of process indices.

• Memb determines slot membership.– Join, leave messages

– Failures:

• Algorithm 1 uses local failure detector.

• Algorithm 2 uses consensus on failures.– Requires new dynamic version of consensus.

• Timing-dependent

36

Architecture for Algorithm 2

Net GM

TO-QoS

37

2. Scalable Reliable Multicast Services [Livadas 01]

38

SRM [Floyd, et al.]

• Reliable multicast to dynamic group.• Built over IP multicast• Based on requests (NACKs) and retransmissions• Limits duplicate requests/retransmissions using:

– Deterministic suppression: Ancestors suppress descendants, by scheduling requests/replies based on distance to source.

– Probabilistic suppression: Siblings suppress each other, by spreading out requests/replies.

39

SRM Architecture

IPMcast

SRM

40

New Protocol

• Inspired by SRM• Assume future losses occur on same link (locality).• Uses deterministic suppression for siblings• Elects, caches best requestor and retransmitter

– Chooses requestor closest to source.

– Chooses retransmitter closest to requestor.

– Break ties with processor ids.

41

Best Requestor and Retransmitter

S

Requestor

Retransmitter

42

Performance Analysis

• Metrics: – Loss recovery latency: Time from detection of packet loss to

receipt of first retransmission

– Loss recovery overhead: Number of messages multicast to recover from a message loss

• Protocol performance benefits:– Removes delays caused by probabilistic suppression

– Following election of requestor and retransmitter:

• Reduces latency by using best requestor and retransmitter.

• Reduces overhead by using single requestor and retransmitter.

43

III. Mathematical Foundations

• Incremental modeling and proof methods [Keidar, Khazan, Lynch, Shvartsman 00]– Proof Extension Theorem– Arose in Scalable GC work [Keidar, Khazan 00]

• Hybrid Input/Output Automata [Lynch, Segala, Vaandrager 01]– Model for continuous and discrete system behavior– Useful for mobile computing?

• Conditional performance analysis methods– For analyzing communication protocols– AFOSR MURI project (Berkeley)

44

IV. IOA Language and Tools

IO A

45

IOA Language and Tools

• Language for describing I/O automata: Garland, Lynch – Use to describe services and algorithms.

• Front end: Garland – Translates to Java objects

– Completely rewritten this year.

– Still needs support for composition.

• Theorem-prover connection: Garland, Bogdanov– Connection with LP

– Seeking connections: SAL, Isabelle, STeP, NuPRL

IO A

46

IOA Language and Tools

• Simulator: Chefter, Ramirez, Dean– Has support for paired simulation.– Needs additions.– Being instrumented for invariant discovery using

Ernst’s Daikon tool

• Code generator: Tauber, Tsai– Local code-gen (translation to Java) running.– Needs composition, communication service calls,

correctness proof.

• Challenge examples

47

V. Multiprocessor Memory Models [Luchangco 01]

48

Memory Models

• Establishes a general mathematical framework for specifying and reasoning about multiprocessor memories and the programs that use them.

• Also applies to distributed shared memory.

readwrite

Memory

P1 P2 Pn

49

Memory Models

• Sequentially consistent memory:– Operations appear to happen in some sequential order.– Read operation returns latest value written to the location.

• Processor consistent memory:– Reads overtake writes to other locations.– SPARC TSO, IBM 370

• Coherent memory• Memory with synchronization commands:

– Fences, barriers, acquire/release,…– Release consistency, weak ordering, locking

• Transactional memory

50

Programming restrictions:

• Data-race-free (for use with weak ordering)• Properly labelled (for use with release

consistency)• Two-phase locking

51

Formal Modeling Framework

• Computation DAG– Partial order model for

individual executions

– Describes dependencies between operations

– Doesn’t model entire programs.

• Memory = set of computations with return values

read(x)

write(y,1)

read(y)

write(x,2)

write(x,3)

read(x)

52

Results

Results I: Uses the framework to model, classify, compare memories and programming disciplines.

Results II: Programming discipline + memory model stronger memory model:– Completely race-free + any memory sequential consistency– Race-free under locking + locking memory seq. consistency

– 2-phase locking + locking memory serializable transactions

Results III: Extend results to automaton models for programs and memory.

53

VI. Plans

• Finish Scalable GC, TO Mcast with QoS, SRM.• More dynamic services:

– Resource allocation, consensus, communication, distributed data management, location services,…

– Services for mobile computing systems– Theory: Algorithms and lower bounds

• Foundations: – Hybrid automata, add control theory and probability– Conditional performance analysis methods

• IOA: – Solidify front end, simulator, theorem-prover connections– Finish code generation.

Documents

Building Blocks for High-Performance, Fault-Tolerant Distributed Systems