25
State Machines Sabina Petride

State Machines Sabina Petride. General Problems zConsensus ya particular problem yalgorithms and different formulations ycorrectness and time analysis

Embed Size (px)

Citation preview

Page 1: State Machines Sabina Petride. General Problems zConsensus ya particular problem yalgorithms and different formulations ycorrectness and time analysis

State Machines

Sabina Petride

Page 2: State Machines Sabina Petride. General Problems zConsensus ya particular problem yalgorithms and different formulations ycorrectness and time analysis

General Problems

Consensus a particular problem algorithms and different formulations correctness and time analysis

Application To Data Replication replica coordination group membership; reintegration unique identifiers using logical/real clocks

Page 3: State Machines Sabina Petride. General Problems zConsensus ya particular problem yalgorithms and different formulations ycorrectness and time analysis

The Paxos Parliament And The Consensus Problem

The Paxos Parliament

determine the law of the land, defined by the sequence of decrees passed each legislator had his own ledger with decrees, their unique number and

their contents entries in ledgers could not be modified or deleted legislators could leave the court for very long periods of time and return

later communication only by messangers (could lose the message, send it many

times or lose the messages)

Requirements consistency of the ledgers progress to ensure that some decree will eventually be passed

The Synod basically, the same problem as with the Parliament, just that a single

decree had to be passed the group of priests/legislators asked to vote for a decree was called the

quorum

Page 4: State Machines Sabina Petride. General Problems zConsensus ya particular problem yalgorithms and different formulations ycorrectness and time analysis

This can be modelled as a consensus problem:

•Agreement: no two ledgers should contain different decrees with the same number (no conflicts among ledgers)

•Validity: any decree should be written in the standard form

•Termination (the progress condition) Agreement and validation are guaranteed and progress is possible if three conditions are satisfied:

B1 Each ballot has a unique number.

B2 The quorums of any two ballots have at least one priest in common.

B3 For every ballot, if any priest in a quorum has voted in an earlier ballot, then the decree equals the decree of the latest of those earlier ballots.

Page 5: State Machines Sabina Petride. General Problems zConsensus ya particular problem yalgorithms and different formulations ycorrectness and time analysis

Assumptions About The System

partial synchronous distributed system in which processes take actions within l time and messages are delivered within d time

the system doen not necessarily exhibits this “normal” timing behavior

each process has a direct communication channel with each other process

allowed failures: timig failures (the bounds of l and d can be occasionally exceded) loss, duplication or reordering of messages process stopping

some stable storage is needed process recovery is considered

Page 6: State Machines Sabina Petride. General Problems zConsensus ya particular problem yalgorithms and different formulations ycorrectness and time analysis

The Synod Algorithm

(1) Priest p chooses a new ballot number b. p sends message NextBallot(b) to some set of priests.

(2) When a priest q recieves a NextBallot(b), he checks the notes in the back of his ledger and determines the vote v with the largest ballot number less then b that he has voted for. If such a vote doesn’t exist, then a default value null(q) is used.

q sends p a LastVoted(b,v) message.(3) After p receives a LastVoted(b,v) message from all the priests in a

majority set Q, he initiates a new ballot with number b, quorum Q, and decree chosen according to B3.

p records the new ballot and sens BeginBallot(b,d) to Q.(4) If q receives BeginBallot(b,d) and decides to vote, then he

records the vote in the back of his ledger and sends Voted(b,q) to p.

Page 7: State Machines Sabina Petride. General Problems zConsensus ya particular problem yalgorithms and different formulations ycorrectness and time analysis

(5) If p has recieved a Voted(b,q) from all q in Q, then he writes d in his ledger and sends Success(d) to all priests.

(6) After receiving Success(d), a priest enters d in his ledger.

Page 8: State Machines Sabina Petride. General Problems zConsensus ya particular problem yalgorithms and different formulations ycorrectness and time analysis

Notes on The Synod Algorithm

to maintain B1, each ballot has to receive a unique number; this can be done by having each priest noting the ballots in his ledger patitioning the set of possible ballots among the priests

( later we will talk about different implementations) a priest should not cast the vote after receiving

BeginBallot(b,d) if he has already sent a LastVote(b’,v’) message for some other ballot and v.bal’<b<b’.

It follows that a priest must record:

the number of every ballot he has initiated every vote he has cast every LastVote message he has sent

Page 9: State Machines Sabina Petride. General Problems zConsensus ya particular problem yalgorithms and different formulations ycorrectness and time analysis

Stating The Problem in Terms of State Machines

a state machine consists of state variables (encoded in states) commands (which transform the states)

each command is implemented by a deterministic program and its execution is atomic with respect to other commands

clock I/O automaton: specific state machine devised by Lynch and Tuttle for modelling, verifying, and analyzing time-based systems

Page 10: State Machines Sabina Petride. General Problems zConsensus ya particular problem yalgorithms and different formulations ycorrectness and time analysis

Clock I/O Automata

An I/O time automaton A consists of a set of states: states(A) a nonempty set start(A) of start states a set of actions partitioned in input, output, internal, and time-

passage actions and specified in the signature of A a transition relation steps(A) subset of

states(A)*acts(A)*states(A).No input action can be blocked: for all s state, for all a input

action, there is a state s’ such that (s,a,s’ ) is a step in A. A time-passage action (t) models the passage of real time t. A special real variable Clock is included in each state to model the

local clock of the process. It is not necessary that Clock simulates the real time.

Page 11: State Machines Sabina Petride. General Problems zConsensus ya particular problem yalgorithms and different formulations ycorrectness and time analysis

The Synod Algorithm In Terms Of Clock GTA

The Distributed Setting relation with the Paxos problem:

• priest/process• law book/state• passing a decree/executing a command

complete network of n processes with unique identifiers in a totally ordered set known by all processes

clock GT automata are used to model both processes and channels; each automaton has a local clock and the local clock for a channel is used to detect timig failures

The Algorithm ideea: propose values until one of them is accepted by a

majority of processes any process may propose a value by initiating a round for that

value; it becomes the leader of that round the leader and the other processes are agents

Page 12: State Machines Sabina Petride. General Problems zConsensus ya particular problem yalgorithms and different formulations ycorrectness and time analysis

(1) The leader sends a Collect message to all agents

(2) If an agent recieves a Collect message and it is already committed for a round with a biger round number, it sends an OldRound message; otherwise, it sends a Last message with its information about rounds previously conducted.

(3) If the leader receives more than n/2 Last messages, it initiates a new round and sends to all agents a Begin message.

(4) If an agent receives the Begin message and is committed, it sends an OldRound message; otherwise, it accepts the value proposed and responds with an Accept message.

(5) If the leader receives more than n/2 Accept messages, then the round is successful and its own output value is the value of the round.

(6) The leader broadcasts the reached decision.

Notes:•the set of agents Last (Accept) messages are received from=info-quorum (accepting-quorum)

Page 13: State Machines Sabina Petride. General Problems zConsensus ya particular problem yalgorithms and different formulations ycorrectness and time analysis

Implementation(1)

STARTERALG(I)

Input: Leader(i), NotLeader(i) BeginCast(i), RndSuccess(i), Stop(i), Recover(i)

Output: NewRound(i)

Internal: CheckRndSuccess(i)

Time-passage: τ(t)

DETECTOR(I)

Input: Receive(m)(j,i) Stop(i), Recover(i)

Output: Send(m)(i,j) InformStopped(j)(i) InformAlive(j)(i)

Internal: Check(j)(i)

Time-passage: τ(t)

LEADERELECTOR(I)

Input: InformStopped(j)(i) InformAlive(j)(i) Stop(i), Recover(i)

Output: Leader(i), NotLeader(i)

Time-passage: τ(t)

Clock GTA STARTERALG(I) isresponsible for starting a new roundwith i as leader.Clock GTA DETECTOR(I) is the failure-detector running at process i, whileLEADERELECTOR(I) is the detector ofthe leader running at process i.

Page 14: State Machines Sabina Petride. General Problems zConsensus ya particular problem yalgorithms and different formulations ycorrectness and time analysis

Implementation(2)

BPLEADER(I) (clock GTA running the leader at process i)

Input: NewRound(i), Leader(i) NotLeader(i) Receive(m)(j,i), m=Last,

Accept, Success, OldRound Output: Send(m)(j,i), m=Collect,

Begin BeginCast(i) RndSuccess(v)(i)Internal: Collect(i),

GatherLast(i) ...

Time-passage: ...

BPAGENT(I) (clock GTA running an agent at process i)

Input: Receive(m)(j,i), m=Collect, Begin

Output: Send(m)(j,i), m=Last, Accept, OldRound

Internal: LastAccept(i), Accept(i), ...

Time-passage: ...

Page 15: State Machines Sabina Petride. General Problems zConsensus ya particular problem yalgorithms and different formulations ycorrectness and time analysis

Correctness Proof execution fragment: sequence of states followed

by actions in steps according to the automaton problem specification: set of allowable behaviors

(behavior = sequence of external actions from an execution fragment)

an automaton A solves the problem if each of its behaviors is contained in the problem specification

safety properties: must hold in every state of a computation

liveness properties: specify events that must eventually be performed

Page 16: State Machines Sabina Petride. General Problems zConsensus ya particular problem yalgorithms and different formulations ycorrectness and time analysis

Safety/Liveness Properties

safety property: in any execution of the system agreement and validity are guaranteed

liveness property: under some conditions, termination is guaranteed an execution fragment is nice if

no loss or duplication takes place at each time-passage action the local clock is incremented with the real

time variation every process is either stopped or alive a majority of process are alive

Theorem: If a nice execution fragment starts in a reachable state and

it has a unique leader and lasts for more than 16l+8nl+9d time units, then by the time 16l+8nl+9d the leader has reached a decision.

Note: proofs are based on invariants.

Page 17: State Machines Sabina Petride. General Problems zConsensus ya particular problem yalgorithms and different formulations ycorrectness and time analysis

Other Results On Time Performance

If a nice execution fragment starts in a reachable state and lasts more than 24l+10nl+13d, then: the leader decides by the time 21l+8nl+11d

and at most 8n messages are sent all alive processes decide by time

24l+10nl+13d and at most 2n additional messages are sent

Page 18: State Machines Sabina Petride. General Problems zConsensus ya particular problem yalgorithms and different formulations ycorrectness and time analysis

Generalization Of The Synod Protocol :MULTIPAXOS

consensus has to be reached on a sequence of values for each value we run BAXICPAXOS the automata used for each instance of the algorithm

are like automata in BAXIXPAXOS, except that an additional parameter (the index of the proposed value) is present in each action

concurrency: several leaders may concurrently initiate rounds and these round are carried out concurrently several leaders initiating values concurrently is an important

difference between Paxos algorithm and three phase commit protocol

Page 19: State Machines Sabina Petride. General Problems zConsensus ya particular problem yalgorithms and different formulations ycorrectness and time analysis

Data Replication

problem: providing distributed and concurrent access to data objects

simple implementation: maintain the object at a single process accessed by multiple clients some disadvantages:

• not good scaling when the number of clients increases• not fault-tolerant

other solution: data replication servers are replicated: each server runs the same state

machine clients make requests which are redirected to specific servers

Page 20: State Machines Sabina Petride. General Problems zConsensus ya particular problem yalgorithms and different formulations ycorrectness and time analysis

Replica Coordination(1)

Requirements requests should be processed by state machines one at a time the order of processing should be consistent with potential

causality outputs: determined only by the sequence of requests,

independent of time or any other activity in the system Replica coordination

agreement: every nonfaulty state machine replica receives every request

order: every nonfaulty state machine replica processes the requests it receives in the same relative order

issues to be considered: fault-tolerance and reconfiguration MULTIPAXOS: possible solution to the problem

Page 21: State Machines Sabina Petride. General Problems zConsensus ya particular problem yalgorithms and different formulations ycorrectness and time analysis

Replica Coordination(2)MULTIPAXOS For Replica Coordination

each process in the system maintains a copy of the data object

a client requests un update operation a process proposes the operation in an instance of MULTIPAXOS after some time, the update operation is the output value of the

instance of MULTIPAXOS the leader of the round updates its local copy; because of

correctness, all the alive processes update their copies, too a report to the client is given

a client requests a read operation the request is immediately satisfied based on the local copy

Note: majority to achieve consistency-> majority voting

a unique leader required to achieve termination-> primary copy replication

Page 22: State Machines Sabina Petride. General Problems zConsensus ya particular problem yalgorithms and different formulations ycorrectness and time analysis

Replica Coordination(3)Order and Stability

unique identifiers for requests (total order) implementation: a replica next processes the stable request with the

smallest unique identifier (stable request: no request from a correct client and with a lower uid can be subsequently delivered to that state machine)

using logical clocks to ensure order and stability: each process has a local counter local counter is incremented after each event at that process each message sent is timestamped with the local clock upon receipt of a message, the local clock of the receiver becomes

1+maximum of timestamp and local clock a uid for each event is given by appending a fixed-length bit (encodes the

process id) to the counter value of the process where the event takes place

using real clocks to ensure order and stability assumptions:

the degree of clock synchronization better than min message delivery time a request r will be received by every correct process no later then uid(r)+Δ

stability test: a request r is stable at a state machine if the local clock reads time t and t>uid(r)+ Δ

Page 23: State Machines Sabina Petride. General Problems zConsensus ya particular problem yalgorithms and different formulations ycorrectness and time analysis

Replica Coordination(4)Reconfiguration

at time t there are P(t) processes, F(t) faulty necessary condition for correct output:

P(t)>F(t)/2 if Byzantine failures are possible P(t)>F(t) if only fail-stop failures

system described by 3 sets: clients (C), state machines (S), and output devices (O) ; information about them stored in state variables and changed by commands C and O make periodical queries-> better share processors messages sent by S always contain information about future

reconfiguration-> permanent communication S<->C and S<->O requests to change a configuration of the system made by

failure/recovery detector mechanism

Page 24: State Machines Sabina Petride. General Problems zConsensus ya particular problem yalgorithms and different formulations ycorrectness and time analysis

Replica Coordination(6)Integrating A Repaired Object

goal: integrate element e at request r notation: e[r] is the state a non-faulty system element e should be

in after processing all the requests up to r if processors are fail stop and logical clocks are implemented,

then the cooperation of only one state machine replica is needed (if the sm has not failed, then it is correct, and because of consensus among replicas, its information on the system is correct and complete with respect to other sm) -> the used sm should have access to enough information

implementation: e[r] is sent to e before the output produced by processing any request with uid larger than uid(r) e in O : e[r] usually is device-specific setup information

can be stored in state variables of sm e in C : e[r] usually based on sensor values read

use information from C to sm

Page 25: State Machines Sabina Petride. General Problems zConsensus ya particular problem yalgorithms and different formulations ycorrectness and time analysis

Replica Coordination(7)Integrating A Repaired State Machine

try to use the algorithm: sm sends to e the values of all its state variables before the output produced by processing any request with uid larger than uid(r) .... problem: some client request might be recieved by sm after sending e[r], but delivered to e before its repair

solution: sm must relay to e requests received from clients how long: as soon as e has received a request directly from a client

c, requests from the same c with larger uid need not be relayed to e

so, e should inform sm of the uid of requests received directly from c

algorithm: (1) sm sends e the values of its state variables and copies of

pending requests (2) sm sends to e every subsequent request r received from

client c s.t. uid(r)<uid(rc) (rc is the first request e has directly recieved from c, after e restarted)