53
1 Building Blocks for High- Performance, Fault-Tolerant Distributed Systems Nancy Lynch Theory of Distributed Systems MIT AFOSR project review Cornell University May, 2001

Building Blocks for High-Performance, Fault-Tolerant Distributed Systems

  • Upload
    yates

  • View
    24

  • Download
    0

Embed Size (px)

DESCRIPTION

Building Blocks for High-Performance, Fault-Tolerant Distributed Systems. Nancy Lynch Theory of Distributed Systems MIT AFOSR project review Cornell University May, 2001. …. Project Participants. Leaders: Nancy Lynch, Idit Keidar, Alex Shvartsman, Steve Garland - PowerPoint PPT Presentation

Citation preview

Page 1: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

1

Building Blocks for High-Performance, Fault-Tolerant

Distributed Systems

Nancy LynchTheory of Distributed SystemsMIT

AFOSR project reviewCornell UniversityMay, 2001

Page 2: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

2

Project Participants

• Leaders: Nancy Lynch, Idit Keidar, Alex Shvartsman, Steve Garland

• PhD students: Victor Luchangco, Roger Khazan, Carl Livadas, Josh Tauber, Ziv Bar-Joseph, Rui Fan

• MEng: Rui Fan, Kyle Ingols, Igor Taraschanskiy, Andrej Bogdanov, Michael Tsai, Laura Dean

• Collaborators: Roberto De Prisco, Jeremy Sussman, Keith Marzullo, Danny Dolev, Alan Fekete, Gregory Chockler, Roman Vitenberg

Page 3: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

3

Project Scope

• Define services to support high-performance distributed computing in dynamic environments: Failures, changing participants.

• Design algorithms to implement the services.

• Analyze algorithms: Correctness, performance, fault-tolerance.

• Develop necessary mathematical foundations: State machine models, analysis methods.

• Develop supporting languages and tools: IOA.

Page 4: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

4

Talk Outline

I. View-oriented group communication services

II. Non-view-oriented group communication

III. Mathematical foundations

IV. IOA language and tools

V. Memory consistency models

VI. Plans

Page 5: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

5

I. View-Oriented Group Communication Services

Page 6: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

6

View-Oriented Group Communication Services

• Cope with changing participants using abstract groups of client processes with changing membership sets.

• Processes communicate with group members by sending messages to the group as a whole.

• GC services support management of groups:– Maintain membership information, form views.

– Manage communication.

– Make guarantees about ordering, reliability of message delivery.

• Isis, Transis, Totem, Ensemble,…

GC

Page 7: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

7

Using View-Oriented GC Services

• Advantages: – High-level programming abstraction

– Hides complexity of coping with changes

• Disadvantages: – Can be costly, especially when forming new views.

– May have problems scaling to large networks.

• Applications:– Managing replicated data

– Distributed interactive games

– Multi-media conferencing, collaborative work

Page 8: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

8

Our Approach

• Mathematical, using state machines (I/O automata)

• Model everything:– Applications

– Service specifications

– Implementations of the services

• Prove correctness• Analyze performance, fault-tolerance

Application

Application Algorithm

Service

Page 9: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

9

Our Earlier Work: VS [Fekete, Lynch, Shvartsman 97, 01]

• Defined automaton models for:

– VS, partitionable GC service, based on Transis

– TO, non-view-oriented totally ordered bcast service

– VStoTO, algorithm based on [Amir, Dolev, Keidar, Melliar-Smith, Moser]

• Proved correctness

• Analyzed performance, fault-tolerance: conditional performance analysis

bcast

VS

VStoTO VStoTO

TO

brcv

gpsnd

gprcvnewview

Page 10: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

10

Conditional Performance Analysis• Assume VS satisfies:

– If a network component C stabilizes, then soon thereafter, views known within C become consistent, and messages sent in the final view are delivered everywhere in C, within bounded time.

• And VStoTO satisfies:– Simple timing and fault-tolerance assumptions.

• Then TO satisfies:– If C stabilizes, then soon thereafter, any message

sent or delivered anywhere in C is delivered everywhere in C, within bounded time.

Page 11: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

11

Ensemble [Hickey, Lynch, van Renesse 99]

• Ensemble system [Birman, Hayden], layered design:

• Worked with developers, following VS.• Developed global specs for key layers.• Modeled Ensemble algorithm spanning between layers. • Tried proof; found algorithmic error. • Modeled, analyzed repaired system• Same error found in Horus.

Page 12: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

12

More Recent Progress

1. GC with unique primary views

2. Scalable GC

3. Optimistic Virtual Synchrony

4. GC service specifications

Page 13: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

13

1. GC With Unique Primary Views

Page 14: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

14

GC With Unique Primaries

• Dynamic View Service [De Prisco, Fekete, Lynch, Shvartsman 98] – Produces unique primary views

– Copes with long-term changes.

• Dynamic Configuration Service [DFLS 99]– Adds quorums.

– Copes with long-term and transient changes.

• Dynamic Leader Configuration Service [D 99], [DL 01]– Adds leaders.

Page 15: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

15

GC With Unique Primaries

• Algorithms to implement the services– Based on dynamic voting algorithm of

[Yeger-Lotem, Keidar, Dolev 97].– Each primary needs majority of all possible

previous primaries.– Models, proofs,…

• Applications– Consistent total order and consistent replicated data

algorithms that tolerate both long-term and transient changes.

– Models, proofs,…

Page 16: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

16

Availability of Unique Primary Algorithms [Ingols, Keidar 01]

• Simulation study comparing unique primary algorithms: – [Yeger-Lotem, Keidar, Dolev], [DFLS]

– 1-pending, like [Jajodia, Mutchler]

– Majority-resilient 1-pending, like [Lamport], [Keidar, Dolev]

• Simulate repeated view changes, interrupting other view changes.

• Availability shown to depend heavily on:– Number of processes from previous view needed to form new view.

– Number of message rounds needed to form a view.

• [YKD], [DFLS] have highest availability.

Page 17: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

17

2. Scalable Group Communication [Keidar, Khazan 00]

Page 18: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

18

Group Communication Service

• Manages group membership, current view.

• Multicast communication among group members, with ordering, reliability guarantees.

• Virtual Synchrony [Birman, Joseph 87]– Integrates group membership and group communication.

– Processes that move together from one view to another deliver the same messages in the first view.

– Useful for replicated data management.

– Before announcing new view, processes must synchronize, exchange messages.

Page 19: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

19

Example: Virtual Synchrony

3: i,j,k 3: i,j,k 3: i,j,k

4: i, j 4: i, j

mcast(m)

rcv(m)rcv(m)

VS algorithm supplies missing m

i j k

Page 20: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

20

Group Communication in WANs

• Difficulties: – High message latency, message exchanges are expensive

– Frequent connectivity changes

• New, scalable GC algorithm:– Uses scalable GM service of [Keidar, Sussman, et al. 00],

implemented on a small set of membership servers.

– GC (with virtual synchrony) implemented on clients.

VS

Net GM

VSGC

Page 21: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

21

Group Communication in WANs

• Try to minimize time from when network stabilizes until GC delivers new views to clients.

• After stabilization: GM forms view, VSGC algorithm synchronizes.

• Existing systems (LANs):– GM, VSGC uses several message exchange rounds– Continue in spite of new network events

• Inappropriate for WANs

GM AlgorithmVSGC AlgorithmNet event

view(v)

Page 22: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

22

New Algorithm• VSGC uses one message exchange round, in parallel with

GM’s agreement on views.• GM usually delivers views in one message exchange.

• Responds to new network events during reconfiguration:– GM produces new membership sets – VSGC responds to membership changes

• Distributed implementation [Tarashchanskiy 00]

GM AlgorithmVSGC AlgorithmNet event

view(v)

Page 23: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

23

Correctness Proofs

• Models, proofs (safety and liveness)

• Developed new incremental modeling, proof methods [Keidar, Khazan, Lynch, Shvartsman 00]

• Proof Extension Theorem:

• Used new methods for the safety proofs.

S S’

A A’

Page 24: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

24

Performance Analysis

• Analyze time from when network stabilizes until GC delivers new views to clients.

• System is a composition: – Network service, GM services, VSGC processes

• Compositional analysis:– Analyze the VSGC algorithm alone, in terms of its inputs and

timing assumptions.

– State reasonable performance guarantees for GM, Network.

– Combine to get conditional performance properties for the system as a whole.

Page 25: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

25

Analysis of VSGC algorithm

• Assume component C stabilizes:– GM delivers same views to VSGC processes

– Net provides reliable communication with latency .

• Let – T[start], T[view] be times of last GM events for C be upper bound on local step time.

• Then VSGC outputs new views by time

max (T[start] + + x, T[view]) +

Page 26: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

26

Analysis of VSGC Algorithm

GM algorithm

VS AlgorithmNet Event

+ x

T[start] T[view]

view(v)

view(v)startstart

Page 27: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

27

Assumed Bounds for GM

• Bounds for “Fast Path” of [Keidar, et al. 00], observed empirically in almost all cases.

GM

T[start] T[view]

start start view(v)

Page 28: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

28

VSGC + x

GM

T[start] T[view]

start start view(v)

Combining VSGC and GM Bounds

• Bounds for system, conditional on GM bounds.

view(v)

Page 29: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

29

3. Optimistic Virtual Synchrony[Sussman, Keidar, Marzullo 00]

• Most GC algorithms block sending during reconfiguration.

• OVS service provides:– Optimistic view proposal, before reconfiguration.– Optimistic sends after proposal, during reconfiguration.– Deliveries of optimistic messages in next view, subject to

application policy.

• Useful for applications:– Replicated data management– State transfer– Sending vectors of data

Page 30: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

30

4. GC Service Specifications [Chockler, Keidar, Vitenberg 01]

• Comprehensive set of specifications for properties guaranteed by GC services.

• Unifying framework.• Safety properties

– Membership: View order, partitionable, primary component– Multicast: Sending view delivery, virtual synchrony– Safe notifications– Ordering, reliability: FIFO, causal, totally ordered, atomic

• Liveness properties– For eventually stable components: View stability, multicast

delivery, safe notification liveness– For eventually stable pairs

Page 31: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

31

II. Non-View-Oriented Group Communication

1. Totally Ordered Multicast with QoS

[Bar-Joseph, Keidar, Anker, Lynch 00, 01]

Page 32: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

32

Totally Ordered Multicast with QoS

• Multicast to dynamic group, subject to joins, leaves, and failures.

• Global total ordering of messages• QoS: Message delivery latency • Built on reliable network with latency guarantees• Add ordering guarantees, preserve latency bounds.• Applications

– State machine replication

– Distributed games

– Shared editing

Page 33: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

33

Two Algorithms• Algorithm 1: Basic Totally Ordered Multicast

– Sends, receives consistent with total ordering of messages.– Non-failing processes agree on messages from non-failing

processes.– Latency: Constant, even with joins, leaves, failures.

• Algorithm 2: Atomic Multicast– Non-failing processes agree on all messages.– Latency:

• Joins, leaves only: Constant • With failures: Linear in f

Net

TOMfail_i fail_j

Page 34: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

34

FrontEnd_i Memb_i

Sniffer_i

Net

Ord_i

rcv(m) joinleave

mcast(m)

join leave mcast(join)

mcast(leave)

mcast(m)

rcv(m)

progress(s,j)

joiners(s,J), leavers(s,J)end-slot(s)

members(s,J)

Local Node Process

Page 35: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

35

Local Algorithm Operation

• FrontEnd divides time into slots, tags messages with slots.

• Ord delivers messages by slot, in order of process indices.

• Memb determines slot membership.– Join, leave messages

– Failures:

• Algorithm 1 uses local failure detector.

• Algorithm 2 uses consensus on failures.– Requires new dynamic version of consensus.

• Timing-dependent

Page 36: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

36

Architecture for Algorithm 2

Net GM

TO-QoS

Page 37: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

37

2. Scalable Reliable Multicast Services [Livadas 01]

Page 38: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

38

SRM [Floyd, et al.]

• Reliable multicast to dynamic group.• Built over IP multicast• Based on requests (NACKs) and retransmissions• Limits duplicate requests/retransmissions using:

– Deterministic suppression: Ancestors suppress descendants, by scheduling requests/replies based on distance to source.

– Probabilistic suppression: Siblings suppress each other, by spreading out requests/replies.

Page 39: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

39

SRM Architecture

IPMcast

SRM

Page 40: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

40

New Protocol

• Inspired by SRM• Assume future losses occur on same link (locality).• Uses deterministic suppression for siblings• Elects, caches best requestor and retransmitter

– Chooses requestor closest to source.

– Chooses retransmitter closest to requestor.

– Break ties with processor ids.

Page 41: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

41

Best Requestor and Retransmitter

S

Requestor

Retransmitter

Page 42: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

42

Performance Analysis

• Metrics: – Loss recovery latency: Time from detection of packet loss to

receipt of first retransmission

– Loss recovery overhead: Number of messages multicast to recover from a message loss

• Protocol performance benefits:– Removes delays caused by probabilistic suppression

– Following election of requestor and retransmitter:

• Reduces latency by using best requestor and retransmitter.

• Reduces overhead by using single requestor and retransmitter.

Page 43: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

43

III. Mathematical Foundations

• Incremental modeling and proof methods [Keidar, Khazan, Lynch, Shvartsman 00]– Proof Extension Theorem– Arose in Scalable GC work [Keidar, Khazan 00]

• Hybrid Input/Output Automata [Lynch, Segala, Vaandrager 01]– Model for continuous and discrete system behavior– Useful for mobile computing?

• Conditional performance analysis methods– For analyzing communication protocols– AFOSR MURI project (Berkeley)

Page 44: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

44

IV. IOA Language and Tools

IO A

Page 45: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

45

IOA Language and Tools

• Language for describing I/O automata: Garland, Lynch – Use to describe services and algorithms.

• Front end: Garland – Translates to Java objects

– Completely rewritten this year.

– Still needs support for composition.

• Theorem-prover connection: Garland, Bogdanov– Connection with LP

– Seeking connections: SAL, Isabelle, STeP, NuPRL

IO A

Page 46: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

46

IOA Language and Tools

• Simulator: Chefter, Ramirez, Dean– Has support for paired simulation.– Needs additions.– Being instrumented for invariant discovery using

Ernst’s Daikon tool

• Code generator: Tauber, Tsai– Local code-gen (translation to Java) running.– Needs composition, communication service calls,

correctness proof.

• Challenge examples

Page 47: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

47

V. Multiprocessor Memory Models [Luchangco 01]

Page 48: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

48

Memory Models

• Establishes a general mathematical framework for specifying and reasoning about multiprocessor memories and the programs that use them.

• Also applies to distributed shared memory.

readwrite

Memory

P1 P2 Pn

Page 49: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

49

Memory Models

• Sequentially consistent memory:– Operations appear to happen in some sequential order.– Read operation returns latest value written to the location.

• Processor consistent memory:– Reads overtake writes to other locations.– SPARC TSO, IBM 370

• Coherent memory• Memory with synchronization commands:

– Fences, barriers, acquire/release,…– Release consistency, weak ordering, locking

• Transactional memory

Page 50: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

50

Programming restrictions:

• Data-race-free (for use with weak ordering)• Properly labelled (for use with release

consistency)• Two-phase locking

Page 51: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

51

Formal Modeling Framework

• Computation DAG– Partial order model for

individual executions

– Describes dependencies between operations

– Doesn’t model entire programs.

• Memory = set of computations with return values

read(x)

write(y,1)

read(y)

write(x,2)

write(x,3)

read(x)

Page 52: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

52

Results

Results I: Uses the framework to model, classify, compare memories and programming disciplines.

Results II: Programming discipline + memory model stronger memory model:– Completely race-free + any memory sequential consistency– Race-free under locking + locking memory seq. consistency

– 2-phase locking + locking memory serializable transactions

Results III: Extend results to automaton models for programs and memory.

Page 53: Building Blocks for                        High-Performance, Fault-Tolerant Distributed Systems

53

VI. Plans

• Finish Scalable GC, TO Mcast with QoS, SRM.• More dynamic services:

– Resource allocation, consensus, communication, distributed data management, location services,…

– Services for mobile computing systems– Theory: Algorithms and lower bounds

• Foundations: – Hybrid automata, add control theory and probability– Conditional performance analysis methods

• IOA: – Solidify front end, simulator, theorem-prover connections– Finish code generation.