View
2
Download
0
Category
Preview:
Citation preview
Yair Amir 110 Dec 02
Distributed Systems Research that Makes a Difference
Center for Networking and Distributed SystemsJohns Hopkins University
With: Baruch Awerbuch, R. Sean Borgstrom, Ryan Caudy,Claudiu Danilov, Ashima Munjal, Cristina Nita-Rotaru, Theo Schlossnagle, Jonathan Stanton, Ciprian Tutu
www.cnds.jhu.edu
Yair Amir
Yair Amir 210 Dec 02
Distributed Systems• Distributed system is:
– A collection of computers. – Connected by a network.– With a common goal.
• We at CNDS care about:– High availability.– High performance.
Yair Amir 310 Dec 02
Hopkins/CNDS�Research�Objective
• Understand the problem: distributed communication andinformation access.– Availability, consistency, performance tradeoffs…
– Local / wide area networks, latency, throughput…
• Invent correct and efficient algorithmic solutions to cope with the problematic aspects of distributed systems.
• Develop a small set of generic software tools encapsulating these solutions.
• Use these software tools to create a paradigm shift in the way distributed systems are actually built:– Through actual use of these software tools.
– Through education.
Yair Amir 410 Dec 02
Outline• CNDS vision.• Group communication.
– What is it?– The Spread toolkit.– Distributed logging with Spread.
• Metacomputers.– Static and dynamic resource allocation.– The Cost-Benefit framework.– Backhand – load balancing a web cluster.
• Wackamole.– High availability for clusters.– High availability for edge routers.
• Making a difference.
Yair Amir 510 Dec 02
Group�Communication:A�different�communication�paradigm
C C C C C C C Ca a a ab
bc c d a
Message to group a Message to group b
Yair Amir 610 Dec 02
Group�Communication:A�different�communication�paradigm
• Handles potential problems in the network: message loss, machine downtime, network partitions, security threats.
• Delivery guarantees: Unreliable, Reliable, Safe (stable).• Message ordering: Unordered, FIFO, Agreed order.• Membership services.• Group security: Unsecure, Signed, Encrypted.
C C C C C C C Ca a a ab
bc c d a
Yair Amir 710 Dec 02
What�Can�You�Do�With�Group�Communication?
• Efficient server replication– High throughput.– High availability.
• And also:– Distributed system management.– Conferencing.– Collaborative Design. – Distributed simulations.– …
Yair Amir 810 Dec 02
Spread:�A�Group�Communication�Toolkit
• Client – Daemon architecture.• Very simple API (basically 6 calls). C/C++, Java, Perl, Python…• Cross platform. Windows, Solaris, Alpha, MacOS, Embedded… • Part of OpenLinux, Debian, FreeBSD.
http://www.spread.org.
C
S
CC C
S
CC C
S
Ca a a ab
bc c d a
Yair Amir 910 Dec 02
A�Spread�Overlay�Network
• Uses membership knowledge to optimize performance.
UCSB
OSU
Rutgers
Mae East
Hopkins
CNDS
HardwareBroadcast
HardwareMulticast
IP Multicast
Yair Amir 1010 Dec 02
Distributed�Logging
• Think of a cluster of 50 (web) servers.• Operation logs for all of the servers should be
consolidated in one logical place.• Aggregate log should be resilient (so it must be
replicated).
Yair Amir 1110 Dec 02
Distributed�Logging�(cont.)
Today, mainly a service for local area clusters.
(Web) cluster
Useful Spread properties:
• Scalability with the number of groups.• Open group semantics - good support
for publish-subscribe.• Membership services.• Efficient for small messages.• Small footprint - low overhead.
Yair Amir 1210 Dec 02
Implementation
• Option 1: Distributed syslogd– Adding a Spread group as a potential
destination of the operating system loggersyslogd (in addition to file, screen, socket).
• Option 2: mod_log_spread – An Apache module enabling logging through
Spread (courtesy of George Schlossnagle).
Yair Amir 1310 Dec 02
The�Real�Challenge• Surprisingly, not the basic protocols.
• But:– Hostile environment - average load on the servers
may top 100!– They keep adding these servers...– Has to be rock-solid, running constantly for days.
====================================================Status at bp12 V3.12 (state 1, gstate1) after 81113 seconds :Membership : 34 procs in 1 segments, leader is bp12rounds : 2796352 tok_hurry : 8537 memb change: 5sent pack: 625516 recv pack : 27229735 retrans : 553u retrans: 460 s retrans : 93 b retrans : 0My_aru: 21829811 Aru : 21829718 Highest seq: 21829811Sessions : 67 Groups : 2 Window : 60Deliver M: 27552100 Deliver Pk: 27855251 Pers Window: 15Delta Mes: 8076 Delta Pack: 8166 Delta sec : 10====================================================
Yair Amir 1410 Dec 02
Distributed�Logging�(cont.)
• Once you have it running, there is more:– Real-time monitoring.– Real-time data-mining and customization.
Yair Amir 1510 Dec 02
Outline• CNDS vision.• Group communication.
– What is it?– The Spread toolkit.– Distributed logging with Spread.
• Metacomputers.– Static and dynamic resource allocation.– The Cost-Benefit framework.– Backhand – load balancing a web cluster.
• Wackamole.– High availability for clusters.– High availability for edge routers.
• Making a difference.
Yair Amir 1610 Dec 02
Static�and�Dynamic�Distributed�Resource�Allocation
• PVM - Parallel Virtual Machine– Static Assignment of jobs to machines.– Default - round robin assignment policy.– Widely used.
• Mosix– Dynamic process migration.– Main objective is load balancing, with some
ad-hoc heuristics for memory depletion.
We looked at two metacomputing systems: PVM and Mosix.
Yair Amir 1710 Dec 02
A�Cost�Benefit�Approach�to�Distributed�Resource�Allocation
...1 n2
Resource A(CPU)
Resource B(memory)
Resource C(I/O)
Usage
Cost
Usage
Cost
Usage
Cost
)( rnutilization=
Machine_cost = cost(CPU) + cost(memory) + cost(I/O)
cost
Yair Amir 1810 Dec 02
New�Assignment�Policy
• Enhanced PVM (EPVM)
• Enhanced Mosix (Emosix).
Based on the Cost Benefit framework,We have created two additional policies:
We compared the performance of the four policiesaccording to the average slowdown criterion.
Yair Amir 1910 Dec 02
Simulation:�PVM�/�EPVM
0
10
20
30
40
50
60
70
80
0 10 20 30 40 50 60 70 80
PVM
Enh
ance
d P
VM
Average Slowdown (lower is better)
Yair Amir 2010 Dec 02
Simulation:�Mosix /�EMosix
0
10
20
30
40
50
60
70
0 10 20 30 40 50 60 70
MOSIX
En
han
ced
MO
SIX
Average Slowdown (lower is better)
Yair Amir 2110 Dec 02
Real�Life:�PVM�/�EPVM
0
5
10
15
20
25
30
35
0 5 10 15 20 25 30 35
PVM
En
han
ced
PV
MN
Average Slowdown (lower is better)
Yair Amir 2210 Dec 02
Interesting:�Mosix /�EPVM
0
5
10
15
20
25
30
0 5 10 15 20 25 30
Mosix
En
han
ced
PV
M
Average Slowdown (lower is better)
Yair Amir 2310 Dec 02
How�general�is�the�framework?�Managing�a�Web�farm
• Applying the same concept to manage the resources of a Web farm.
• Web servers in the farm may have different capabilities. Tasks may have different requirements.
• Hard Problems:– Relatively short lived tasks.– Stale information (welcome to the real world).– Inaccurate information (welcome to the real world -
did we say that?).
Yair Amir 2410 Dec 02
Backhand:Load�Balancing�A�Web�Cluster
• Peer-to-peer, cost-based resource allocation decisions.
• The necessary “machinery” implemented as an Apache module to minimize overhead.
• Linux, Solaris, FreeBSD, Windows*• Part of SuSE Linux and Open Linux.• www.backhand.org
Yair Amir 2510 Dec 02
Backhand�in�Action
Yair Amir 2610 Dec 02
Wackamole
Yair Amir 2710 Dec 02
Wackamole
Yair Amir 2810 Dec 02
Wackamole
Yair Amir 2910 Dec 02
Outline• CNDS vision.• Group communication.
– What is it?– The Spread toolkit.– Distributed logging with Spread.
• Metacomputers.– Static and dynamic resource allocation.– The Cost-Benefit framework.– Backhand – load balancing a web cluster.
• Wackamole.– High availability for clusters.– High availability for edge routers.
• Making a difference.
Yair Amir 3010 Dec 02
Wackamole�- Dynamic�Cluster�with�N-way�Fail-over
• In contrast to available gateway solutions (Alteon, BigIP, Cisco’s Local-director, …).
• Exploiting Spread’s strong membership semantics to build a distributed agreement protocol that strictly distributes the cluster’s public IP addresses between the currentlyconnected servers.
• Useful for locally replicated web servers, ftp servers, DNS servers, and even fail-over routers.
• Transparent!www.wackamole.org
Yair Amir 3110 Dec 02
Wackamole�Architecture
Yair Amir 3210 Dec 02
Wackamole�for�Clusters
Yair Amir 3310 Dec 02
Wackamole�for�Edge�Routers
Yair Amir 3410 Dec 02
Wackamole�Performance
• Out of the box average for fail-over: 12 seconds.• Tuned average for fail-over: 2.5 seconds.• Voluntary fail-over (maintenance, etc): up to 250 milliseconds,
most of the time within 10 milliseconds.
Fail-over Latency
0
2
4
6
8
10
12
14
1 2 3 4 5 6 7 8 9 10 11
Number of Servers
Sec
on
ds Out of the box
Spread
Tuned Spread
Yair Amir 3510 Dec 02
0
1000
2000
3000
4000
5000
6000
Oct
-97
Apr
-98
Oct
-98
Apr
-99
Oct
-99
Apr
-00
Oct
-00
Apr
-01
Oct
-01
Apr
-02
Oct
-02
Nu
mbe
r of D
ownl
oads
SpreadBackhandWackamole
Making�a�Difference
10,000 actual sitesdiscovered
100 actual sitesdiscovered
Backhand-powered web sites discovery thanks to Netcraft
Recommended