20
www.tttech.com Page 1 Ensuring Reliable Networks Discussion of Failure Mode Assumptions for IEEE 802.1Qbt Wilfried Steiner, Corporate Scientist [email protected]

Discussion of Failure Mode Assumptions for IEEE 802grouper.ieee.org/groups/802/1/files/public/docs2012/new-avb-wsteiner... · Example: SAE AS6802 Ensuring Reliable Networks First

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Discussion of Failure Mode Assumptions for IEEE 802grouper.ieee.org/groups/802/1/files/public/docs2012/new-avb-wsteiner... · Example: SAE AS6802 Ensuring Reliable Networks First

www.tttech.com Page 1

Ensuring Reliable Networks

Discussion of Failure Mode Assumptions for IEEE 802.1Qbt

Wilfried Steiner, Corporate [email protected]

Page 2: Discussion of Failure Mode Assumptions for IEEE 802grouper.ieee.org/groups/802/1/files/public/docs2012/new-avb-wsteiner... · Example: SAE AS6802 Ensuring Reliable Networks First

www.tttech.com Page 2

Ensuring Reliable Networks

Clock Synchronization is a core building block of many RT Systems

TTE

1588

1588

Eth

TTE

TTE

Eth

TTE

TTETTE

TTE

TTE

TTE

Eth

Grand Master

The local clocks in a distributed system can accurately be synchronized to each other.

Page 3: Discussion of Failure Mode Assumptions for IEEE 802grouper.ieee.org/groups/802/1/files/public/docs2012/new-avb-wsteiner... · Example: SAE AS6802 Ensuring Reliable Networks First

www.tttech.com Page 3

Ensuring Reliable Networks

Basic Questions in Fault-Tolerant Clock Synchronization

TTE

1588

1588

Eth

TTE

TTE

Eth

TTE

TTETTE

TTE

TTE

TTE

Eth

Grand MasterLoss of Grand Master clock requires a changeover

- How long does the changeover take?- Is the changeover fault-tolerant?- Is a malicious failure behavior of the

Grand Master clock tolerated?

Page 4: Discussion of Failure Mode Assumptions for IEEE 802grouper.ieee.org/groups/802/1/files/public/docs2012/new-avb-wsteiner... · Example: SAE AS6802 Ensuring Reliable Networks First

www.tttech.com Page 4

Ensuring Reliable Networks

Fault-Tolerance through Redundancy

Situation:What is the color of the house?

Green

No Failure

Don’t Know

Fail-Silence Failure

Green

Fail-Consistent Failure

Red Green

Green

Page 5: Discussion of Failure Mode Assumptions for IEEE 802grouper.ieee.org/groups/802/1/files/public/docs2012/new-avb-wsteiner... · Example: SAE AS6802 Ensuring Reliable Networks First

www.tttech.com Page 5

Ensuring Reliable NetworksFailure Mode: Fail-Silence

When the current grandmaster clock fails then gPTPensures that another clock becomes the new grandmaster

• if there exists such a clock in the system, which we will assume in the following

This means that there is some fail-over time after which the system is running stable again – synchronized and syntonized to the new grandmaster clock.

The fail-silence failure mode is tolerated• when the original grand master clock fails permanently.

Page 6: Discussion of Failure Mode Assumptions for IEEE 802grouper.ieee.org/groups/802/1/files/public/docs2012/new-avb-wsteiner... · Example: SAE AS6802 Ensuring Reliable Networks First

www.tttech.com Page 6

Ensuring Reliable NetworksFailure Mode: Fail-Silence

What happens when the original grandmaster clock fails transiently or intermittent?

• e.g., the original grandmaster clock periodically reboots

�Will the network oscillate between the original and a secondary grandmaster clock?

Page 7: Discussion of Failure Mode Assumptions for IEEE 802grouper.ieee.org/groups/802/1/files/public/docs2012/new-avb-wsteiner... · Example: SAE AS6802 Ensuring Reliable Networks First

www.tttech.com Page 7

Ensuring Reliable NetworksModel-Based Development i

Development of fault-tolerant clock synchronization algorithms is non-trivial:

• synchronization proof is hard for certain failure modes• completeness has to be proven as well

• i.e., we need to prove that we have covered all possible failurescenarios

Therefore, formal methods are used in the development and in the verification of such algorithms.

• Theorem Proving is the process of developing a deductive proof, typically interactive with a proof assistant.

• Model Checking is an automatized approach.

Page 8: Discussion of Failure Mode Assumptions for IEEE 802grouper.ieee.org/groups/802/1/files/public/docs2012/new-avb-wsteiner... · Example: SAE AS6802 Ensuring Reliable Networks First

www.tttech.com Page 8

Ensuring Reliable Networks

INIT(1)

LISTEN(2)

COLDSTART

(3)

ACTIVE(4)

1.1 2.1

2.2

3.1

3.2

INIT(1)

LISTEN(2)

1.1

2.1 STARTUP(3)

TentativeROUND

(5)

ACTIVE(7)

ProtectedSTARTUP

(6)

2.2

3.2

SILENCE(4)

3.1

4.1

5.1

5.2

6.1

6.26.3

2.3

ok

Model Checkerno, because…

Model-Based Development iie.

g., I

EE

E 8

02.1

AS

bte.

g., f

ail-s

ilenc

e

e.g., system will sync

Page 9: Discussion of Failure Mode Assumptions for IEEE 802grouper.ieee.org/groups/802/1/files/public/docs2012/new-avb-wsteiner... · Example: SAE AS6802 Ensuring Reliable Networks First

www.tttech.com Page 9

Ensuring Reliable NetworksExample: SAE AS6802

First Byzantine fault-tolerant clock synchronization algorithm verified by model-checking only.

Basic algorithm addresses only synchronization of the clocks.

Extension for syntonization (we call it clock-rate correction) has been modeled and studied as well.

Page 10: Discussion of Failure Mode Assumptions for IEEE 802grouper.ieee.org/groups/802/1/files/public/docs2012/new-avb-wsteiner... · Example: SAE AS6802 Ensuring Reliable Networks First

www.tttech.com Page 10

Ensuring Reliable Networks

Fault-Tolerant Clock Synchronization

TTE

1588

1588

Eth

TTE

TTE

TTE

TTE

TTE

TTE

TTE

TTE

TTE

TTE

TTE

Eth

Grand Master

Grand Master

Grand Master

Fault-tolerant synchronization services are needed for establishing a safe and highly available synchronized time.

Grand Master

Page 11: Discussion of Failure Mode Assumptions for IEEE 802grouper.ieee.org/groups/802/1/files/public/docs2012/new-avb-wsteiner... · Example: SAE AS6802 Ensuring Reliable Networks First

www.tttech.com Page 11

Ensuring Reliable Networks

SAE AS6802 Clock Synchronization Algorithm(case of five SM is updated in the standard)

Algorithm Specification

Page 12: Discussion of Failure Mode Assumptions for IEEE 802grouper.ieee.org/groups/802/1/files/public/docs2012/new-avb-wsteiner... · Example: SAE AS6802 Ensuring Reliable Networks First

www.tttech.com Page 12

Ensuring Reliable NetworksByzantine Failure Tolerance

Occurrence of a Byzantine failure is a combination of a fail-arbitrary synchronization master (end station) and an inconsistent-omission faulty compression master (bridge).

Page 13: Discussion of Failure Mode Assumptions for IEEE 802grouper.ieee.org/groups/802/1/files/public/docs2012/new-avb-wsteiner... · Example: SAE AS6802 Ensuring Reliable Networks First

www.tttech.com Page 13

Ensuring Reliable Networks

Rate-Correction with Stable Clock Drifts

Store 1 st state-correction term

Store 2 nd state-correction term

Calculate and apply rate-correction term

Page 14: Discussion of Failure Mode Assumptions for IEEE 802grouper.ieee.org/groups/802/1/files/public/docs2012/new-avb-wsteiner... · Example: SAE AS6802 Ensuring Reliable Networks First

www.tttech.com Page 14

Ensuring Reliable Networks

Rate-Correction with Unstable Clock Drifts

Store 1 st state-correction term

Store 2 nd state-correction term

Calculate and apply rate-correction term

Coincidently also the speed of the oscillator changes

Page 15: Discussion of Failure Mode Assumptions for IEEE 802grouper.ieee.org/groups/802/1/files/public/docs2012/new-avb-wsteiner... · Example: SAE AS6802 Ensuring Reliable Networks First

www.tttech.com Page 15

Ensuring Reliable Networks

What are the failure modes of IEEE 802.1ASbt

Permanent fail-silence?Transient/Intermittent fail-silence?Fail-consistent faulty?

• e.g., a grandmaster providing faulty time

Inconsistent faulty bridges?• e.g., a bridge forwarding time information only on some

ports

Byzantine faulty grandmaster clocks?

Page 16: Discussion of Failure Mode Assumptions for IEEE 802grouper.ieee.org/groups/802/1/files/public/docs2012/new-avb-wsteiner... · Example: SAE AS6802 Ensuring Reliable Networks First

www.tttech.com Page 16

Ensuring Reliable Networks

www.tttech.com

Wilfried Steiner, Corporate [email protected]

Page 17: Discussion of Failure Mode Assumptions for IEEE 802grouper.ieee.org/groups/802/1/files/public/docs2012/new-avb-wsteiner... · Example: SAE AS6802 Ensuring Reliable Networks First

www.tttech.com Page 17

Ensuring Reliable Networks

www.tttech.com

Backup

Page 18: Discussion of Failure Mode Assumptions for IEEE 802grouper.ieee.org/groups/802/1/files/public/docs2012/new-avb-wsteiner... · Example: SAE AS6802 Ensuring Reliable Networks First

www.tttech.com Page 18

Ensuring Reliable NetworksStatic vs. Dynamic Systems

Situation:What is the color of the house?

Static Situation – one Truth

Situation:What is the color of the ball ?

Dynamic Situation – >one Truth

Page 19: Discussion of Failure Mode Assumptions for IEEE 802grouper.ieee.org/groups/802/1/files/public/docs2012/new-avb-wsteiner... · Example: SAE AS6802 Ensuring Reliable Networks First

www.tttech.com Page 19

Ensuring Reliable NetworksOrigins: Byzantine Failures

HOT COLD

N2

HOTHOT N3

COLDCOLD

N1

Faulty

N1: COLD

N2: HOT

N3: COLD

==========

COLD

A distributed system that measures the temperature of a vessel shall raise an alarm when the temperature exceeds a certain threshold. The system shall tolerate the arbitrary failure of one node.How many nodes are required?How many messages are required? T

em

pera

ture

In general, three nodes are insufficient to tolerate the arbitrary failure of a single node.The two correct nodes are not always able to agree on a value . A decent body of scientific literature exists that address this problem of dependable systems, in particular dependable communication.

Page 20: Discussion of Failure Mode Assumptions for IEEE 802grouper.ieee.org/groups/802/1/files/public/docs2012/new-avb-wsteiner... · Example: SAE AS6802 Ensuring Reliable Networks First

www.tttech.com Page 20

Ensuring Reliable NetworksByzantine Clocks

Time

N2

00:01

N3

00:04

N1

Faulty

00:0400:01

00:04

00:01N1: 00:04

N2: 00:01

N3: 00:04

==========

00:04

N1: 00:01

N2: 00:01

N3: 00:04

==========

00:01

Per

fect

Clo

ck

Real Time

Slow Clock

Fast Clock

R.int R.int

A distributed system in which all nodes are equipped with local clocks, all clocks shall become and remain synchronized.The system shall tolerate the arbitrary failure of one node.How many nodes are required?How many messages are required?

In general, three nodes are insufficient to tolerate the arbitrary failure of a single node.The two correct nodes are not always able to bring their clocks into close agreement . A decent body of scientific literature exists that address this problem of fault-tolerant clock synchronization.