25
1 Efficient and Fault Tolerant Distributed Host Monitoring Using System-Level Diagnosis Mark J. Bearden and Ronald Bianchini, Jr. Carnegie Mellon University Networked and Mobile Computing Laboratory Electrical and Computer Engineering Dept. Pittsburgh, Pennsylvania. USA February 28, 1996

Efficient and Fault Tolerant Distributed Host Monitoring Using

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Efficient and Fault Tolerant Distributed Host Monitoring Using

1

Efficient and Fault Tolerant

Distributed Host Monitoring

Using

System-Level Diagnosis

Mark J. Bearden

and

Ronald Bianchini, Jr.

Carnegie Mellon University

Networked and Mobile Computing Laboratory

Electrical and Computer Engineering Dept.

Pittsburgh, Pennsylvania. USA

February 28, 1996

Page 2: Efficient and Fault Tolerant Distributed Host Monitoring Using

2

Overview

Research goal:

Monitor status of networked hosts

Focus on:

Fault-tolerance

Efficient communication

Approach:

Decentralize monitoring

Apply distributed diagnosis

Distributed filtering

Light-weight “condensed” broadcast

Page 3: Efficient and Fault Tolerant Distributed Host Monitoring Using

3

Introduction

Unreliable personal computers/workstations

Asynchronous or loosely synchronous network

Generate status at each host:

• User accounting

• CPU Load

• Disk Usage

• Network statistics

A B

C

D

Load: 18% Load: 35%

Load: 24%Load: 90%

Page 4: Efficient and Fault Tolerant Distributed Host Monitoring Using

4

Problem Abstraction

Want Global “system state”

• (local state)i

• Each host has “current” view

• Consistent at each host

∑i

A B

C

D

Load

A: 18%

B: 35%

C: 24%

D: 90%

Load

A: 18%

B: 35%

C: 24%

D: 90%

Load

A: 18%

B: 35%

C: 24%

D: 90%

Load

A: 18%

B: 35%

C: 24%

D: 90%

Page 5: Efficient and Fault Tolerant Distributed Host Monitoring Using

5

Centralized Monitoring

Disadvantages:

• Expensive fault-tolerance

• Poor scalability

• Throughput bottleneck

monitor

Page 6: Efficient and Fault Tolerant Distributed Host Monitoring Using

6

Decentralized Monitoring

Distribute monitoring task among the hosts being monitored.

Advantages:

• Cheaper fault-tolerance

• Scalability

• Concurrency

Page 7: Efficient and Fault Tolerant Distributed Host Monitoring Using

7

Decentralized Monitoring

Hosts cooperate to reduce cost:

• Each host monitors part of the system

• Distribute results

What about failures?

• Identify trusted (“fault-free”) hosts for

• Data generation

• Data distribution

? Can I trust him ?

Page 8: Efficient and Fault Tolerant Distributed Host Monitoring Using

8

Monitoring

Single host:

Distributed system:

Q

M3M2M1

HostQuery Program

Monitor Agents

dist

M3M2M1

QHost

ReliableDistributionLayer

Page 9: Efficient and Fault Tolerant Distributed Host Monitoring Using

9

Tolerating Failures

Two approaches:

• mask failures (redundancy)

example: agreement protocol

• detect failures and reconfigure

example: locate faults, repair/reconfigure

Distributed system-level diagnosis theory:

• Identify faulty/fault-free hosts in system

• Hosts test each other (pass/fail)

• “Passed” tests —> trust messages & operation

• Test results distributed to all hosts

• Diagnosis at each host

test

message

host host

Page 10: Efficient and Fault Tolerant Distributed Host Monitoring Using

10

Adaptive Distributed Diagnosis

Adaptive DSD Algorithm (Bianchini & Buskens, 1991):

• On-line

• Fully connected (logically) network

• Adaptive testing topology:

• Hosts in logical ring

• Repeatedly test nearest fault-free neighbor

• Cycle of fault-free hosts

Tests

Page 11: Efficient and Fault Tolerant Distributed Host Monitoring Using

11

Updating topology

Update changed test results

• reliably - along test cycle

• quickly - parallel distribution

+

=

Ring O(N) hops K-ary tree: O(logKN) hops

Page 12: Efficient and Fault Tolerant Distributed Host Monitoring Using

12

Adaptive Distributed Diagnosis

Algorithm Characteristics for N hosts:

• Tolerates N-1 host failures

• N tests (2N msgs)

• Failure/Recovery: = N( 1 + ) update messages

• Evenly distributed overhead

• Provably minimum cost!

K 1–

K-------------

Page 13: Efficient and Fault Tolerant Distributed Host Monitoring Using

13

Consistent Global State: Adaptive DSD

0

1

2

3

4

5

0

1

2

3

4

5

Host #

@ each host:

Diagnosis

Page 14: Efficient and Fault Tolerant Distributed Host Monitoring Using

14

Extending Consistent Global State

Extend diagnosis data structures

Monitored information

• forwarded by ring of fault-free hosts

• reliable “broadcast” (using point-point msgs)

0 S0, . . .

1 S1, . . .

2 S2, . . .

3 S3, . . .

4 S4, . . .

5 S5, . . .

Host #

@ each host:

Dia

gnos

is

Add Host Status

Page 15: Efficient and Fault Tolerant Distributed Host Monitoring Using

15

Distributed Monitoring

dist

M3M2M1

Q

dist

M1

Q

dist

M3M2M1

Q

dist

M1

Q

dist

M3

M2

M1

Q distM2

M1Q

host

test

Page 16: Efficient and Fault Tolerant Distributed Host Monitoring Using

16

Distributed Filtering

Do not need to forward all sampled values

Filter at each host before distributing

Evaluate:

• Should new sample be distributed?

• EVENT (High Priority) - send immediately

• TRICKLE (Low Priority) - buffer, “piggyback”

• IGNORE

.80

.39 .38.43

.80

.39

Distribute

timesample

large ∆small ∆

CPU

Page 17: Efficient and Fault Tolerant Distributed Host Monitoring Using

17

Complete Broadcast

“Complete” broadcast:

• in order delivery (by variable)

• within bounded time

m0, m1, m

2 m0, m1, m2

m1 m2m0 time

Page 18: Efficient and Fault Tolerant Distributed Host Monitoring Using

18

Condensed Broadcast

Special “light-weight” broadcast:

• no complete history

• no consistent history

“Condensed broadcast”:

• Each state update

• delivered in bounded time

UNLESS

• a more recent value is delivered

m0, m1, m

2 m0, m2

m1 m2m0

This host condenses

time

Page 19: Efficient and Fault Tolerant Distributed Host Monitoring Using

19

Implementation

Distributed System Monitor (DSMon) running since 1993:

• 150+ Unix workstations in department ethernet LAN:

• + ~10 Windows 3.1, Linux, Novell Server

Communication: IP/UDP + ack/retry

Processes

Diagnosis CPU overhead: 0.02% (max observed)

DSMon

otherSNMP

Background daemon

APIGUI

- Filtering- Distribution

Query

Monitor Agents

- Diagnosis

Page 20: Efficient and Fault Tolerant Distributed Host Monitoring Using

20

Network Overhead (Diagnosis)

Experiments:

• Data collected on 100 machines

• Network messages per host

Messages/30 sec.

(sec.)

0 60 120 180 240 300 360 420 480 540

2

4

6

8

10

12

14

16

F R

Diagnosis(350-7)

Detection(430)

F = Failure

TimeF F

F

R R

R = Repair

Updates

Fault EVENT Distribution

Page 21: Efficient and Fault Tolerant Distributed Host Monitoring Using

21

Network Overhead (Monitor Updates)

Network messages & bytes communicated per host:

0 60 120 180 240 300 360 420

2

4

6

8

10

12

14

16

100

200

300

E EE t t tt t

(sec.)Time

Messages/30 sec. Bytes/30 sec.

TRICKLEs

EVENT

EVENT updates (E)

TRICKLE updates (t)

Page 22: Efficient and Fault Tolerant Distributed Host Monitoring Using

22

CMU ECE Dept. Variable Set

Variable Polling

Period

Updates:

EVENT TRICKLE

Fault State 30 sec any change

CPU Load 60 sec ± .50 ± .20

Disk Usage 60 sec ± .15 ± .05

Users 60 sec any change

Page 23: Efficient and Fault Tolerant Distributed Host Monitoring Using

23

Network Overhead (Total)

10 minutes during typical weekday p.m.:

0 120 240 360 480 600

2

6

10

14

200

600

1000

1400

F F

R DD

C

C

C

C C C

C

U

U

U

UU

U U U U

U

U

U U

U

c

ccc

cc

c

c

c

c c

c

c

c

c

c c

c

c

c

cccc

c

c

cc

c

c

c

c

c

cc

c

c

c

c c c

cc

c c

(sec.)Time

Messages / 30 sec. Bytes / 30 sec.

EVENT TRICKLE

User

Disk

CPU

U

D

C

u

d

c

Updates:

F = Failure R = Recovery

Page 24: Efficient and Fault Tolerant Distributed Host Monitoring Using

24

Summary

Fully distributed monitoring

• prevents bottlenecks

• tolerates multiple host failures

• filtering at each host conserves network bandwidth

Light-weight “condensed” reliable broadcast

• delivers most recent information

• does not preserve consistent “history”

• less costly than “complete” broadcast

Extension of system-level diagnosis algorithm

• maintain general global state

Page 25: Efficient and Fault Tolerant Distributed Host Monitoring Using

25

Future and Current Research

Other diagnosis algorithms

• general communication topology

• pessimistic fault models

Partial replication of monitored state

• always replicate at k hosts

Fault tolerant distributed shared memory

• networked workstations

• condensed reliable broadcast

Web Page: http://www.ece.cmu.edu/afs/ece/usr/dsd/

E-mail: [email protected]

For More Information: