Efficient and Fault Tolerant Distributed Host Monitoring Using

1

Efficient and Fault Tolerant

Distributed Host Monitoring

Using

System-Level Diagnosis

Mark J. Bearden

and

Ronald Bianchini, Jr.

Carnegie Mellon University

Networked and Mobile Computing Laboratory

Electrical and Computer Engineering Dept.

Pittsburgh, Pennsylvania. USA

February 28, 1996

2

Overview

Research goal:

Monitor status of networked hosts

Focus on:

Fault-tolerance

Efficient communication

Approach:

Decentralize monitoring

Apply distributed diagnosis

Distributed filtering

Light-weight “condensed” broadcast

3

Introduction

Unreliable personal computers/workstations

Asynchronous or loosely synchronous network

Generate status at each host:

• User accounting

• CPU Load

• Disk Usage

• Network statistics

A B

C

D

Load: 18% Load: 35%

Load: 24%Load: 90%

4

Problem Abstraction

Want Global “system state”

• (local state)i

• Each host has “current” view

• Consistent at each host

∑i

A B

C

D

Load

A: 18%

B: 35%

C: 24%

D: 90%

Load

A: 18%

B: 35%

C: 24%

D: 90%

Load

A: 18%

B: 35%

C: 24%

D: 90%

Load

A: 18%

B: 35%

C: 24%

D: 90%

5

Centralized Monitoring

Disadvantages:

• Expensive fault-tolerance

• Poor scalability

• Throughput bottleneck

monitor

6

Decentralized Monitoring

Distribute monitoring task among the hosts being monitored.

Advantages:

• Cheaper fault-tolerance

• Scalability

• Concurrency

7

Decentralized Monitoring

Hosts cooperate to reduce cost:

• Each host monitors part of the system

• Distribute results

What about failures?

• Identify trusted (“fault-free”) hosts for

• Data generation

• Data distribution

? Can I trust him ?

8

Monitoring

Single host:

Distributed system:

Q

M3M2M1

HostQuery Program

Monitor Agents

dist

M3M2M1

QHost

ReliableDistributionLayer

9

Tolerating Failures

Two approaches:

• mask failures (redundancy)

example: agreement protocol

• detect failures and reconfigure

example: locate faults, repair/reconfigure

Distributed system-level diagnosis theory:

• Identify faulty/fault-free hosts in system

• Hosts test each other (pass/fail)

• “Passed” tests —> trust messages & operation

• Test results distributed to all hosts

• Diagnosis at each host

test

message

host host

10

Adaptive Distributed Diagnosis

Adaptive DSD Algorithm (Bianchini & Buskens, 1991):

• On-line

• Fully connected (logically) network

• Adaptive testing topology:

• Hosts in logical ring

• Repeatedly test nearest fault-free neighbor

• Cycle of fault-free hosts

Tests

11

Updating topology

Update changed test results

• reliably - along test cycle

• quickly - parallel distribution

+

=

Ring O(N) hops K-ary tree: O(logKN) hops

12

Adaptive Distributed Diagnosis

Algorithm Characteristics for N hosts:

• Tolerates N-1 host failures

• N tests (2N msgs)

• Failure/Recovery: = N( 1 + ) update messages

• Evenly distributed overhead

• Provably minimum cost!

K 1–

K-------------

13

Consistent Global State: Adaptive DSD

0

1

2

3

4

5

0

1

2

3

4

5

Host #

@ each host:

Diagnosis

14

Extending Consistent Global State

Extend diagnosis data structures

Monitored information

• forwarded by ring of fault-free hosts

• reliable “broadcast” (using point-point msgs)

0 S0, . . .

1 S1, . . .

2 S2, . . .

3 S3, . . .

4 S4, . . .

5 S5, . . .

Host #

@ each host:

Dia

gnos

is

Add Host Status

15

Distributed Monitoring

dist

M3M2M1

Q

dist

M1

Q

dist

M3M2M1

Q

dist

M1

Q

dist

M3

M2

M1

Q distM2

M1Q

host

test

16

Distributed Filtering

Do not need to forward all sampled values

Filter at each host before distributing

Evaluate:

• Should new sample be distributed?

• EVENT (High Priority) - send immediately

• TRICKLE (Low Priority) - buffer, “piggyback”

• IGNORE

.80

.39 .38.43

.80

.39

Distribute

timesample

large ∆small ∆

CPU

17

Complete Broadcast

“Complete” broadcast:

• in order delivery (by variable)

• within bounded time

m0, m1, m

2 m0, m1, m2

m1 m2m0 time

18

Condensed Broadcast

Special “light-weight” broadcast:

• no complete history

• no consistent history

“Condensed broadcast”:

• Each state update

• delivered in bounded time

UNLESS

• a more recent value is delivered

m0, m1, m

2 m0, m2

m1 m2m0

This host condenses

time

19

Implementation

Distributed System Monitor (DSMon) running since 1993:

• 150+ Unix workstations in department ethernet LAN:

• + ~10 Windows 3.1, Linux, Novell Server

Communication: IP/UDP + ack/retry

Processes

Diagnosis CPU overhead: 0.02% (max observed)

DSMon

otherSNMP

Background daemon

APIGUI

- Filtering- Distribution

Query

Monitor Agents

- Diagnosis

20

Network Overhead (Diagnosis)

Experiments:

• Data collected on 100 machines

• Network messages per host

Messages/30 sec.

(sec.)

0 60 120 180 240 300 360 420 480 540

2

4

6

8

10

12

14

16

F R

Diagnosis(350-7)

Detection(430)

F = Failure

TimeF F

F

R R

R = Repair

Updates

Fault EVENT Distribution

21

Network Overhead (Monitor Updates)

Network messages & bytes communicated per host:

0 60 120 180 240 300 360 420

2

4

6

8

10

12

14

16

100

200

300

E EE t t tt t

(sec.)Time

Messages/30 sec. Bytes/30 sec.

TRICKLEs

EVENT

EVENT updates (E)

TRICKLE updates (t)

22

CMU ECE Dept. Variable Set

Variable Polling

Period

Updates:

EVENT TRICKLE

Fault State 30 sec any change

CPU Load 60 sec ± .50 ± .20

Disk Usage 60 sec ± .15 ± .05

Users 60 sec any change

23

Network Overhead (Total)

10 minutes during typical weekday p.m.:

0 120 240 360 480 600

2

6

10

14

200

600

1000

1400

F F

R DD

C

C

C

C C C

C

U

U

U

UU

U U U U

U

U

U U

U

c

ccc

cc

c

c

c

c c

c

c

c

c

c c

c

c

c

cccc

c

c

cc

c

c

c

c

c

cc

c

c

c

c c c

cc

c c

(sec.)Time

Messages / 30 sec. Bytes / 30 sec.

EVENT TRICKLE

User

Disk

CPU

U

D

C

u

d

c

Updates:

F = Failure R = Recovery

24

Summary

Fully distributed monitoring

• prevents bottlenecks

• tolerates multiple host failures

• filtering at each host conserves network bandwidth

Light-weight “condensed” reliable broadcast

• delivers most recent information

• does not preserve consistent “history”

• less costly than “complete” broadcast

Extension of system-level diagnosis algorithm

• maintain general global state

25

Future and Current Research

Other diagnosis algorithms

• general communication topology

• pessimistic fault models

Partial replication of monitored state

• always replicate at k hosts

Fault tolerant distributed shared memory

• networked workstations

• condensed reliable broadcast

Web Page: http://www.ece.cmu.edu/afs/ece/usr/dsd/

E-mail: [email protected]

For More Information:

Documents

Efficient and Fault Tolerant Distributed Host Monitoring Using