State Machine Replication Project Presentation Ido Zachevsky Marat Radan Supervisor: Ittay Eyal...

Preview:

Citation preview

State Machine Replication

Project Presentation

Ido ZachevskyMarat Radan

Supervisor:Ittay Eyal

Winter Semester 2010

Goals

• Learn and understand Paxos and Python.

• Design program for fault-tolerant distributed system using the Paxos algorithm.

• Test on a real internet scale system, Planet-Lab.

The Problem – Distributed Storage

• Using Distributed Algorithms on a network has many advantages

• It also has many problems

• This project focuses on the Synchronization Problem

Synchronization

• The task: Successfully issue a state machine which involves all the computers of a network

• All the computers need to be in sync regarding the Current State and the Next States.

• All the computers need to know the transitions.

Problems?

• Can any computer choose the next state?

• What if a computer disconnects ungracefully?

• What if a message is delayed due to congestion?

• Other problems…

• Solution: Use a dedicated algorithm

A Solution – Paxos

• Keeping the Safety requirements ensures an agreed-upon value, by all computers, is chosen

• Keeping the Liveness requirements ensures a value will be chosen

Paxos - Background

Paxos Made Simple

Leslie Lamport01 Nov 2001

• Paxos Made Live

Principles

• The system consists of three agent classes:– Proposers– Acceptors– Learners

• Some of them distinguished

• Communicate via messages

Principles – continued

• A single computer – a Leader – is in charge

• Decision cycle in two phases:1. A majority must promise to commit to a

recent proposal.2. Once a majority has committed, all

computers are informed of the Decision.

Safety requirements

• Only a value that has been proposed may be chosen,

• Only a single value is chosen, and• A process never learns that a value has been

chosen unless it actually has been.

Liveness requirements

• Some proposed value is eventually chosen.• A process can eventually learn the value which

has been chosen.

Implementing a State Machine

• Collection of servers, each implementing a state machine.

• The i-th state machine command in the sequence is the value chosen by the i-th instance of the Paxos consensus algorithm.

• A pre-decided set of commands is necessary.

Planet-Lab

• Planet-Lab is a global research network that supports the development of new network services.

• Understanding the system is required• Monitoring is necessary

– Generally, implemented via NSSL-lab.

Project Design

• Chosen language for implementation: Python• Network framework: Twisted Matrix

• Implementation stages:– Single Decision on NSSL– Multiple Decisions on NSSL– Single Decision on Planet-Lab– Multiple Decisions on Planet-Lab

Clients 1

Server 1

Clients 2

Server 2

Clients N

Server N

The Network

……...

Transport

Listening Socket

Transport

Transport

Protocol

Protocol

Protocol

ProtocolFactory

Paxos Algorithm

Transport

Transport

Transport

Protocol

Protocol

Protocol

ProtocolFactory

Reactor Loop

... ...

... ...

Implementation

• Use Cases– Acceptor disconnects?

– Leader disconnects?• At which stage?

– Acceptor message fails to deliver?

Implementation

• Leader Election– In fact an inherent part of the algorithm

• Output and monitoring– Actual output not visible in general– Only via monitoring

Flow

1. Register Nodes 2. Verify and install necessary files3. Upload4. Initiate Monitor5. Run and wait for activity6. Review results

Implementation – File Structure

Initial Installation

Installationmy_install (csh)

Initial Communication send_install (py)

Alive Machines Server

install_serv (py)

Uploading and Running

Deployment my_deploy (csh)

Multi-Run my_multirun (csh)

Multi-Stop my_multistop (csh)

Core Paxos Program

Paxos Instancepaxos_inst (py)

Paxos Algorithmpaxos_alg (py)

Network Datapaxos_net_data

(txt)

ProjectFile Structure

Service Scripts and Files

Alive Nodes listnodes (txt)

Paxos Monitorpaxos_mon_serv

(py)

combine_nodes (csh)

conv_nodes (csh)

remove_done (csh)

Additional files

Results

• Everything works at the NSSL• In Real-Life, not necessarily• Communication phenomena – messages

arriving unordered, in large chunks, etc.• Works well for up to 20-30 Nodes• Use cases tested in Lab

Conclusions

• Preliminary work needed to understand Twisted Matrix and Planet-Lab

• Dealing with network problems– SSH Tunnel instead of “real” monitoring

• Requirements fulfilled

Further work

• Optimize networking protocol– Improve client-server interface– Inefficient startup – N(N-1) for N machines

• Partition Decision processes– Only few nodes decide each resolution

Thank you

Recommended