Distributed Systems Research that Makes a Differenceyairamir/Space_telescope_0212.pdf · –...

Preview:

Citation preview

Yair Amir 110 Dec 02

Distributed Systems Research that Makes a Difference

Center for Networking and Distributed SystemsJohns Hopkins University

With: Baruch Awerbuch, R. Sean Borgstrom, Ryan Caudy,Claudiu Danilov, Ashima Munjal, Cristina Nita-Rotaru, Theo Schlossnagle, Jonathan Stanton, Ciprian Tutu

www.cnds.jhu.edu

Yair Amir

Yair Amir 210 Dec 02

Distributed Systems• Distributed system is:

– A collection of computers. – Connected by a network.– With a common goal.

• We at CNDS care about:– High availability.– High performance.

Yair Amir 310 Dec 02

Hopkins/CNDS�Research�Objective

• Understand the problem: distributed communication andinformation access.– Availability, consistency, performance tradeoffs…

– Local / wide area networks, latency, throughput…

• Invent correct and efficient algorithmic solutions to cope with the problematic aspects of distributed systems.

• Develop a small set of generic software tools encapsulating these solutions.

• Use these software tools to create a paradigm shift in the way distributed systems are actually built:– Through actual use of these software tools.

– Through education.

Yair Amir 410 Dec 02

Outline• CNDS vision.• Group communication.

– What is it?– The Spread toolkit.– Distributed logging with Spread.

• Metacomputers.– Static and dynamic resource allocation.– The Cost-Benefit framework.– Backhand – load balancing a web cluster.

• Wackamole.– High availability for clusters.– High availability for edge routers.

• Making a difference.

Yair Amir 510 Dec 02

Group�Communication:A�different�communication�paradigm

C C C C C C C Ca a a ab

bc c d a

Message to group a Message to group b

Yair Amir 610 Dec 02

Group�Communication:A�different�communication�paradigm

• Handles potential problems in the network: message loss, machine downtime, network partitions, security threats.

• Delivery guarantees: Unreliable, Reliable, Safe (stable).• Message ordering: Unordered, FIFO, Agreed order.• Membership services.• Group security: Unsecure, Signed, Encrypted.

C C C C C C C Ca a a ab

bc c d a

Yair Amir 710 Dec 02

What�Can�You�Do�With�Group�Communication?

• Efficient server replication– High throughput.– High availability.

• And also:– Distributed system management.– Conferencing.– Collaborative Design. – Distributed simulations.– …

Yair Amir 810 Dec 02

Spread:�A�Group�Communication�Toolkit

• Client – Daemon architecture.• Very simple API (basically 6 calls). C/C++, Java, Perl, Python…• Cross platform. Windows, Solaris, Alpha, MacOS, Embedded… • Part of OpenLinux, Debian, FreeBSD.

http://www.spread.org.

C

S

CC C

S

CC C

S

Ca a a ab

bc c d a

Yair Amir 910 Dec 02

A�Spread�Overlay�Network

• Uses membership knowledge to optimize performance.

UCSB

OSU

Rutgers

Mae East

Hopkins

CNDS

HardwareBroadcast

HardwareMulticast

IP Multicast

Yair Amir 1010 Dec 02

Distributed�Logging

• Think of a cluster of 50 (web) servers.• Operation logs for all of the servers should be

consolidated in one logical place.• Aggregate log should be resilient (so it must be

replicated).

Yair Amir 1110 Dec 02

Distributed�Logging�(cont.)

Today, mainly a service for local area clusters.

(Web) cluster

Useful Spread properties:

• Scalability with the number of groups.• Open group semantics - good support

for publish-subscribe.• Membership services.• Efficient for small messages.• Small footprint - low overhead.

Yair Amir 1210 Dec 02

Implementation

• Option 1: Distributed syslogd– Adding a Spread group as a potential

destination of the operating system loggersyslogd (in addition to file, screen, socket).

• Option 2: mod_log_spread – An Apache module enabling logging through

Spread (courtesy of George Schlossnagle).

Yair Amir 1310 Dec 02

The�Real�Challenge• Surprisingly, not the basic protocols.

• But:– Hostile environment - average load on the servers

may top 100!– They keep adding these servers...– Has to be rock-solid, running constantly for days.

====================================================Status at bp12 V3.12 (state 1, gstate1) after 81113 seconds :Membership : 34 procs in 1 segments, leader is bp12rounds : 2796352 tok_hurry : 8537 memb change: 5sent pack: 625516 recv pack : 27229735 retrans : 553u retrans: 460 s retrans : 93 b retrans : 0My_aru: 21829811 Aru : 21829718 Highest seq: 21829811Sessions : 67 Groups : 2 Window : 60Deliver M: 27552100 Deliver Pk: 27855251 Pers Window: 15Delta Mes: 8076 Delta Pack: 8166 Delta sec : 10====================================================

Yair Amir 1410 Dec 02

Distributed�Logging�(cont.)

• Once you have it running, there is more:– Real-time monitoring.– Real-time data-mining and customization.

Yair Amir 1510 Dec 02

Outline• CNDS vision.• Group communication.

– What is it?– The Spread toolkit.– Distributed logging with Spread.

• Metacomputers.– Static and dynamic resource allocation.– The Cost-Benefit framework.– Backhand – load balancing a web cluster.

• Wackamole.– High availability for clusters.– High availability for edge routers.

• Making a difference.

Yair Amir 1610 Dec 02

Static�and�Dynamic�Distributed�Resource�Allocation

• PVM - Parallel Virtual Machine– Static Assignment of jobs to machines.– Default - round robin assignment policy.– Widely used.

• Mosix– Dynamic process migration.– Main objective is load balancing, with some

ad-hoc heuristics for memory depletion.

We looked at two metacomputing systems: PVM and Mosix.

Yair Amir 1710 Dec 02

A�Cost�Benefit�Approach�to�Distributed�Resource�Allocation

...1 n2

Resource A(CPU)

Resource B(memory)

Resource C(I/O)

Usage

Cost

Usage

Cost

Usage

Cost

)( rnutilization=

Machine_cost = cost(CPU) + cost(memory) + cost(I/O)

cost

Yair Amir 1810 Dec 02

New�Assignment�Policy

• Enhanced PVM (EPVM)

• Enhanced Mosix (Emosix).

Based on the Cost Benefit framework,We have created two additional policies:

We compared the performance of the four policiesaccording to the average slowdown criterion.

Yair Amir 1910 Dec 02

Simulation:�PVM�/�EPVM

0

10

20

30

40

50

60

70

80

0 10 20 30 40 50 60 70 80

PVM

Enh

ance

d P

VM

Average Slowdown (lower is better)

Yair Amir 2010 Dec 02

Simulation:�Mosix /�EMosix

0

10

20

30

40

50

60

70

0 10 20 30 40 50 60 70

MOSIX

En

han

ced

MO

SIX

Average Slowdown (lower is better)

Yair Amir 2110 Dec 02

Real�Life:�PVM�/�EPVM

0

5

10

15

20

25

30

35

0 5 10 15 20 25 30 35

PVM

En

han

ced

PV

MN

Average Slowdown (lower is better)

Yair Amir 2210 Dec 02

Interesting:�Mosix /�EPVM

0

5

10

15

20

25

30

0 5 10 15 20 25 30

Mosix

En

han

ced

PV

M

Average Slowdown (lower is better)

Yair Amir 2310 Dec 02

How�general�is�the�framework?�Managing�a�Web�farm

• Applying the same concept to manage the resources of a Web farm.

• Web servers in the farm may have different capabilities. Tasks may have different requirements.

• Hard Problems:– Relatively short lived tasks.– Stale information (welcome to the real world).– Inaccurate information (welcome to the real world -

did we say that?).

Yair Amir 2410 Dec 02

Backhand:Load�Balancing�A�Web�Cluster

• Peer-to-peer, cost-based resource allocation decisions.

• The necessary “machinery” implemented as an Apache module to minimize overhead.

• Linux, Solaris, FreeBSD, Windows*• Part of SuSE Linux and Open Linux.• www.backhand.org

Yair Amir 2510 Dec 02

Backhand�in�Action

Yair Amir 2610 Dec 02

Wackamole

Yair Amir 2710 Dec 02

Wackamole

Yair Amir 2810 Dec 02

Wackamole

Yair Amir 2910 Dec 02

Outline• CNDS vision.• Group communication.

– What is it?– The Spread toolkit.– Distributed logging with Spread.

• Metacomputers.– Static and dynamic resource allocation.– The Cost-Benefit framework.– Backhand – load balancing a web cluster.

• Wackamole.– High availability for clusters.– High availability for edge routers.

• Making a difference.

Yair Amir 3010 Dec 02

Wackamole�- Dynamic�Cluster�with�N-way�Fail-over

• In contrast to available gateway solutions (Alteon, BigIP, Cisco’s Local-director, …).

• Exploiting Spread’s strong membership semantics to build a distributed agreement protocol that strictly distributes the cluster’s public IP addresses between the currentlyconnected servers.

• Useful for locally replicated web servers, ftp servers, DNS servers, and even fail-over routers.

• Transparent!www.wackamole.org

Yair Amir 3110 Dec 02

Wackamole�Architecture

Yair Amir 3210 Dec 02

Wackamole�for�Clusters

Yair Amir 3310 Dec 02

Wackamole�for�Edge�Routers

Yair Amir 3410 Dec 02

Wackamole�Performance

• Out of the box average for fail-over: 12 seconds.• Tuned average for fail-over: 2.5 seconds.• Voluntary fail-over (maintenance, etc): up to 250 milliseconds,

most of the time within 10 milliseconds.

Fail-over Latency

0

2

4

6

8

10

12

14

1 2 3 4 5 6 7 8 9 10 11

Number of Servers

Sec

on

ds Out of the box

Spread

Tuned Spread

Yair Amir 3510 Dec 02

0

1000

2000

3000

4000

5000

6000

Oct

-97

Apr

-98

Oct

-98

Apr

-99

Oct

-99

Apr

-00

Oct

-00

Apr

-01

Oct

-01

Apr

-02

Oct

-02

Nu

mbe

r of D

ownl

oads

SpreadBackhandWackamole

Making�a�Difference

10,000 actual sitesdiscovered

100 actual sitesdiscovered

Backhand-powered web sites discovery thanks to Netcraft

Recommended