120
Mechanisms for Enhanced Dependability and Timeliness in CAN Ricardo Alexandre Neves Correia Pinto Dissertação para a obtenção do grau de Mestre em Engenharia Electrotécnica e de Computadores Júri Presidente: Doutor Nuno Cavaco Gomes Horta Orientador: Doutor José Manuel de Sousa de Matos Rufino Co-Orientador: Doutor Carlos Manuel Ribeiro Almeida Vogal: Doutor Carlos Jorge Ferreira Silvestre Dezembro de 2010

Mechanisms for Enhanced Dependability and Timeliness in CAN

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Mechanisms for Enhanced Dependability andTimeliness in CAN

Ricardo Alexandre Neves Correia Pinto

Dissertação para a obtenção do grau de Mestre em

Engenharia Electrotécnica e de Computadores

JúriPresidente: Doutor Nuno Cavaco Gomes HortaOrientador: Doutor José Manuel de Sousa de Matos RufinoCo-Orientador: Doutor Carlos Manuel Ribeiro AlmeidaVogal: Doutor Carlos Jorge Ferreira Silvestre

Dezembro de 2010

Acknowledgments

Firstly, I would like to express my gratitude to my supervisor, Prof. José Rufino for the effort

put into this work. His constant encouragement, motivation and support were fundamental for the

writing of this document. A special thanks also goes to my co-supervisor, Prof. Carlos Almeida

for the support given.

To the people in the Computers Scientific Area for the support and for providing the technical

means for developing this work, and for the pleasant coffee breaks.

To my friends, especially Pedro Fernandes, for all these years of friendship and encourage-

ment.

Last, but certainly not least, I would like to thank my family, especially my parents, Carlos and

Isabel, and my sister Andreia for their love and support all these years. This would not have been

possible without you.

i

Abstract

A cost-effective solution for Distributed Control System (DCS) interconnection is the Controller

Area Network (CAN) fieldbus. Designed to be used in the harsh automotive environment, its us-

age has spread to other domains, e.g. home automation, elevators, shop-floor control and even

aerospace applications. However, there is a set of domains where CAN could not be used with-

out additional mechanisms: mission-critical applications. In fact, despite exhibiting fault-tolerant

behaviour in the presence of errors, CAN fault coverage alone is not high enough to meet the

stringent requirements regarding safety, availability and reliability these domains demand.

The CAN Enhanced Layer (CANELy) architecture is a step towards a CAN-based high depend-

ability architecture, through the provision of: reliable communication services, network reliability

and availability, channel timeliness guarantees.

This work discusses the design and implementation of effective mechanisms for network de-

pendability and timeliness enhancement, in the context of the CANELy architecture. Our working

basis is the extended fault model provided by the CANELy architecture, which contemplates the

utilisation of media redundancy for the communication channel. From this basis we identified ef-

fective mechanisms to detect and isolate faults affecting either the channel or any of the redundant

media conveying the channel.

The final result is a set of mechanisms specified in a Hardware Description Language (HDL),

which can be fitted into a small-sized Field Programmable Gate Array (FPGA), thus providing

CANELy-based applications with means for: effective redundancy management, channel and

media fault detection and confinement, upper layer signalling for network operation status as-

sessment, all in a cost-effective manner.

Keywords

Controller Area Network, networked embedded systems, dependability, timeliness, CAN En-

hanced Layer

iii

Resumo

A rede industrial Controller Area Network (CAN) é uma solução eficiente para a interligação

de Sistemas de Controlo Distribuído. Desenhada para aplicações automóveis, a sua utilização

estendeu-se a outros domínios, e.g. domótica, controlo de produção e até mesmo aplicações

aeroespaciais. Existem, contudo, domínios onde a rede CAN não pode ser utilizada sem mecan-

ismos adicionais: aplicações missão crítica. Apesar de a rede CAN possuir características de

tolerância a faltas, carece de uma cobertura de faltas suficientemente ampla para cobrir os

requisitos rigorosos desses domínios no que respeita a segurança no funcionamento (safety)

e disponibilidade.

A arquitectura CAN Enhanced Layer (CANELy) é um passo na direcção de soluções de el-

evada confiabilidade baseadas em redes CAN, através do aprovisionamento de: serviços de

comunicação fiável, rede fiável e disponível, garantias de pontualidade do canal de comunicação.

Este trabalho discute a concepção e concretização de mecanismos eficientes para reforço

da confiabilidade e pontualidade, no contexto da arquitectura CANELy. A base de trabalho é o

modelo de faltas estendido fornecido pela arquitectura CANELy, contemplando redundância do

meio físico para o canal de comunicação, assim como mecanismos para detectar e isolar faltas

que afectem tanto o canal como qualquer meio redundante de suporte ao canal.

O resultado final é um conjunto de mecanismos especificado numa Linguagem de Descrição

de Hardware, concretizados num dispositivo FPGA, providenciando às aplicações baseadas em

CANELy os meios para: gestão eficiente da redundância, detecção e confinamento de faltas no

canal e meio físico, sinalização a camadas superiores para aferição do estado da rede.

Palavras Chave

Controller Area Network, sistemas embebidos ligados em rede, confiabilidade, pontualidade,

CAN Enhanced Layer

v

Contents

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.5 Document organisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 State of the Art 7

2.1 Distributed Control Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Distributed Real-time Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.1 Real-time Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.2 Communication System Operation Models . . . . . . . . . . . . . . . . . . 10

2.2.3 Real-time Communication Networking Infrastructure . . . . . . . . . . . . . 12

2.3 Embedded Systems Networking (Fieldbus) Technologies . . . . . . . . . . . . . . 12

2.3.1 Time-Triggered Ethernet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.2 SpaceWire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3.3 Controller Area Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4 CAN in airborne and spaceborne applications . . . . . . . . . . . . . . . . . . . . . 17

2.5 High-Dependability CAN-based architectures . . . . . . . . . . . . . . . . . . . . . 19

2.5.1 RedCAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.5.2 FlexCAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.5.3 CAN Enhanced Layer - CANELy . . . . . . . . . . . . . . . . . . . . . . . . 21

3 CAN Enhanced Layer 23

3.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1.1 Reliable Communication and Services . . . . . . . . . . . . . . . . . . . . . 24

3.1.2 Network Dependability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.1.3 Hard Real-time Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2 System Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2.1 CANELy Dependability Engine . . . . . . . . . . . . . . . . . . . . . . . . . 27

vii

Contents

3.2.2 Media Selection Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.3 Inaccessibility Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.3 Engineering Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.3.1 CANELy Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.3.2 CAN Data-link Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3.3 CAN Physical Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4 Dependability Enforcement 33

4.1 Working Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.1.1 CAN Physical Layer Fault-Tolerance . . . . . . . . . . . . . . . . . . . . . . 35

4.1.2 CANELy Approach to Network Dependability . . . . . . . . . . . . . . . . . 36

4.1.3 Fault classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.2 Physical Network Availability and Reliability . . . . . . . . . . . . . . . . . . . . . . 39

4.2.1 Media Redundancy Provision and Management . . . . . . . . . . . . . . . 39

4.2.2 Stuck-at-dominant Fault Handling . . . . . . . . . . . . . . . . . . . . . . . . 41

4.3 CAN Bit-Sequence Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.4 Channel Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.5 Medium Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.5.1 Medium Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.5.2 Frame Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.5.3 Omission Degree Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.6 Media Selection Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.6.1 System Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.6.2 Management Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5 Timeliness Enforcement 61

5.1 Channel Inaccessibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.2 Inaccessibility Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.2.1 Assessment of Inaccessibility Events . . . . . . . . . . . . . . . . . . . . . . 63

5.2.2 Extended Channel Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.2.3 Assessment of Inaccessibility Effects . . . . . . . . . . . . . . . . . . . . . . 68

5.3 Usefulness of Inaccessibility Control Mechanisms . . . . . . . . . . . . . . . . . . 69

5.4 Inaccessibility Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

viii

Contents

6 CANELy Mechanism and Prototype Engineering 73

6.1 CANELy Mechanism Verification and Validation . . . . . . . . . . . . . . . . . . . . 73

6.1.1 Media Selection Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

6.1.2 Inaccessibility Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

6.2 FPGA Mechanism Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

6.2.1 Media Selection Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

6.2.2 Inaccessibility Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

6.2.3 Resource usage comparison . . . . . . . . . . . . . . . . . . . . . . . . . . 78

6.3 CANELy Prototype Board . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

6.3.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

6.3.2 Prototype Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

7 Conclusions and Future Work 81

Bibiography 83

A VHDL Snippets 89

A.1 Sequence detection machinery and mapped sequences . . . . . . . . . . . . . . . 89

A.2 Omission Monitoring and Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

A.3 Inaccessibility Monitoring and Evaluation . . . . . . . . . . . . . . . . . . . . . . . 92

B Mechanism Design Verification 95

B.1 Approach to Component Design Simulation . . . . . . . . . . . . . . . . . . . . . . 95

B.2 CAN Channel Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

B.3 Simulation Fragments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

ix

Contents

x

List of Figures

1.1 SSTL-900 Satellite Architecture Block Diagram . . . . . . . . . . . . . . . . . . . . 3

2.1 Block diagram of a generic control system [1] . . . . . . . . . . . . . . . . . . . . . 8

2.2 Typical DCS infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Typical Time-Triggered Ethernet network . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4 Typical SpaceWire network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.5 Typical CAN network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.6 Ring topology RedCAN network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.7 Typical FlexCAN network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.8 Typical CAN Enhanced Layer network . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.1 CANELy System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2 CANELy reliable communication and services block diagram . . . . . . . . . . . . 26

3.3 CANELy Dependability Engine interfaces . . . . . . . . . . . . . . . . . . . . . . . 27

3.4 Channel redundant media management . . . . . . . . . . . . . . . . . . . . . . . . 28

3.5 Inaccessibility Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.6 CANELy engineering model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.7 CANELy Dependability Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.8 Extended CiA Connector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.1 CANELy network assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2 CAN message termination sequence . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.3 CAN physical layer faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.4 Errors affecting a dual-media CAN network . . . . . . . . . . . . . . . . . . . . . . 37

4.5 Media-redundant network physical partition . . . . . . . . . . . . . . . . . . . . . . 37

4.6 Columbus’ Egg strategy block diagram . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.7 AND-based Media Selection description in VHDL . . . . . . . . . . . . . . . . . . . 40

4.8 Medium Disable Receive description in VHDL . . . . . . . . . . . . . . . . . . . . . 42

4.9 Sliding Window sequence detection . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.10 Signal assertion machinery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

xi

List of Figures

4.11 Sequence detector description in VHDL . . . . . . . . . . . . . . . . . . . . . . . . 45

4.12 ChEOT signal description in VHDL . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.13 CANELy Basic Channel Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.14 Mdis−tx description in VHDL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.15 Channel Monitoring signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.16 Sequences mapped into VHDL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.17 CANELy Channel Monitoring functions . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.18 Medium Status Word VHDL data type . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.19 Medium Status Monitoring block diagram . . . . . . . . . . . . . . . . . . . . . . . 53

4.20 Medium Omission Detection auxiliary functions . . . . . . . . . . . . . . . . . . . . 55

4.21 Omission Degree Control block diagram . . . . . . . . . . . . . . . . . . . . . . . . 56

4.22 Media Selection Unit block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.23 CANELy Media Selection Unit management primitives . . . . . . . . . . . . . . . . 57

5.1 CAN vs. CANELy normalised inaccessibility duration bounds . . . . . . . . . . . . 62

5.2 Timing of the CAN channel monitoring signals . . . . . . . . . . . . . . . . . . . . . 65

5.3 Inaccessibility Event Count description in VHDL . . . . . . . . . . . . . . . . . . . . 66

5.4 Extended Channel Monitoring signals . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.5 Timeliness-related sequences mapped into VHDL . . . . . . . . . . . . . . . . . . 67

5.6 Inaccessibility duration evaluation description in VHDL . . . . . . . . . . . . . . . . 68

5.7 Optimised Diffusion-based protocol. . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.8 Inaccessibility Control Unit block diagram . . . . . . . . . . . . . . . . . . . . . . . 71

5.9 CANELy Inaccessibility Control Unit management primitives . . . . . . . . . . . . . 72

6.1 Media Selection Unit simulation fragment . . . . . . . . . . . . . . . . . . . . . . . 74

6.2 Inaccessibility Control Unit simulation fragment . . . . . . . . . . . . . . . . . . . . 75

6.3 Sequence detection description resource occupation . . . . . . . . . . . . . . . . . 76

6.4 CANELy vs CAN Cores resource usage comparison . . . . . . . . . . . . . . . . . 78

6.5 CANELy vs CAN Cores relative slice usage . . . . . . . . . . . . . . . . . . . . . . 78

6.6 CANELy Prototype Board block diagram . . . . . . . . . . . . . . . . . . . . . . . . 79

6.7 CANELy Prototype Board . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

B.1 Bus Media simulation data flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

B.2 Simulation text file content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

B.3 Simulated CAN Channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

B.4 Simulation of the Basic Channel Monitoring mechanism . . . . . . . . . . . . . . . 99

B.5 Simulation of the Message Identifier Extraction mechanism . . . . . . . . . . . . . 100

xii

List of Tables

2.1 Comparison of TTEthernet, SpW and CAN . . . . . . . . . . . . . . . . . . . . . . 18

6.1 Media Selection Unit FPGA resource occupation . . . . . . . . . . . . . . . . . . . 77

6.2 Inaccessibility Control Unit FPGA resource occupation . . . . . . . . . . . . . . . . 77

xiii

List of Tables

xiv

List of Acronyms

ABS Anti-lock Braking System

AOCS Attitude and Orbit Control System

ARINC Aeronautical Radio, Inc.

ASIC Application Specific Integrated Circuit

AMBA Advanced Microcontroller Bus Architecture

CAM Content Addressable Memory

CAN Controller Area Network

CANELy CAN Enhanced Layer

CiA CAN in Automation

COTS Commercial Off-The-Shelf

CSMA/DCR Carrier-Sense Multiple Access / Deterministic Collision Resolution

CRC Cyclic Redundancy Check

DCS Distributed Control System

ECSS European Cooperation for Space Standardization

EMI Electromagnetic Interference

EOF End-of-Frame

EOT End-of-Transmission

ESA European Space Agency

FIFO First-In, First-Out

FPGA Field Programmable Gate Array

FSM Finite State Machine

xv

List of Acronyms

FTU In FlexCAN: Fault Tolerant Unit

HDL Hardware Description Language

ICU In CANELy: Inaccessibility Control Unit

IEEE Institute of Electrical and Electronic Engineers

I/O Input/Output

IP Intellectual Property

IRQ Interruption Request

LLC Logical Link Control

LUT Look-up Table

MAC Medium Access Control

MSU In CANELy: Media Selection Unit

OBC OnBoard Computer

OOB Out-of-band

OSI Open Systems Interconnection

PHY Physical Layer

PLD Programmable Logic Device

QoS Quality of Service

RAM Random-Access Memory

ROM Read-Only Memory

ROV Remotely Operated Vehicle

RTU Remote Terminal Unit

SCADA Supervisory Control and Data Acquisition

SNR Signal-to-Noise Ratio

SoC System-on-a-Chip

SpW SpaceWire

SRAM Static Random-Access Memory

xvi

SSTL Surrey Satellite Technology Ltd

SWaP Size, Weight and Power-consumption

TTEthernet Time-Triggered Ethernet

UAV Unmanned Aerial Vehicle

UTP Unshielded Twisted Pair

VHDL Very High Speed Integrated Circuit Hardware Description Language

VOIP Voice over Internet Protocol

WCET Worst Case Execution Time

xvii

List of Acronyms

xviii

1Introduction

The advances in integration of electronic circuits during the last decades have allowed the

proliferation of devices with significant processing capabilities into the various aspects of daily life.

Nowadays it is common to see a mere cellular telephone having greater computing power than

the most expensive computers of 20 years ago. This integration lead to an explosion of systems

with a reduced size footprint, containing the elements of a complete computer: processing unit,

memory (both volatile and non-volatile) and input/output interfacing capabilities. These embedded

systems comprise practically every current electronic system, be it a portable medical device or a

ticket vending machine.

The domain of control and automation is one of the areas that has benefited the most with the

proliferation of embedded systems. While in the past control applications like process control had

to rely on large (central) computers in order to implement the control laws governing the system,

nowadays one can easily build an equivalent control system with small and cheap electronic

components.

The reduction in size and cost has also allowed the physical distribution of the control systems

to the premises of the sub-systems being controlled and consequent elimination of the point-to-

point control wires. This allowed to save money and space while increasing reliability, availability

and maintainability of the global control system.

The shift from centralised control systems to distributed systems raised many challenges,

including the need for communication technologies enabling the dependable interconnection of

these systems. A special class of computer networks called fieldbuses was devised, with the

aim of providing support for communication services in the harsh1 industrial environment, thus

enabling the construction of distributed embedded systems for real-time control tasks.

Today, many of these technologies are all around us due to the ubiquity of distributed control

systems in domains such as automotive or aviation. The trend is to keep on, with the replacement

of all mechanical systems such as a car’s engine control system with more effective systems

based on communicating electronic control units.

1Some of the problems are: Electromagnetic Interference (EMI), power supply transients, physical damage to thecabling infrastructure.

1

1. Introduction

1.1 Motivation

High dependability application domains like aerospace, medical or process control have mostly

been implemented with resort to ad-hoc solutions due to their specific and low production nature.

However, in the last decade both the industrial and academic worlds have shown a growing inter-

est in using Commercial Off-The-Shelf (COTS) solutions in application domains having stringent

requirements with respect to dependability and timeliness.

The interest in COTS solutions has mainly been driven by their (higher) performance rather

than by their cost2. However, most of these solutions do not possess all the required charac-

teristics for dependable operation due to their commercial- and performance-oriented nature, and

thus must be complemented in order to behave in conformity with the strict requirements of critical

applications. Furthermore, recent studies advocate the use of COTS, namely those implementing

widely used standards and non-proprietary interfaces [4].

In this context, the Controller Area Network (CAN) fieldbus is an excellent candidate to be used

as a building block for these application domains. Designed to be used in the harsh automotive

environment, it was quickly adopted as a fundamental block for networked embedded control

systems, due to its simplicity of operation, low mass cabling and behaviour in the presence of

errors. Furthermore, its specifications are open [5] and have been standardised [6]. Variants of

these standards for specific application domains, such as aeronautics [7] and space [8] have also

been introduced.

Current applications

Even though CAN was not initially intended for critical applications such as spaceborne sys-

tems, there are spacecraft architectures already deployed using CAN to convey information be-

tween the several subsystems that compose them and the OnBoard Computer (OBC). One ex-

ample of this type of architecture is the SSTL-900 from Surrey Satellite Technology Ltd, shown in

Figure 1.1. In this satellite architecture the several sub-systems such as Attitude and Orbit Control

System (AOCS) and payload are connected through CAN, in a dual communication media setup

to achieve media redundancy.

However, this solution does not provide timeliness and extremely high availability guaran-

tees: swapping between the communication media is made through remotely operated physical

switches, which takes time and might endanger the operation of a hard real-time system. Whilst in

Earth-orbiting satellites this is acceptable, there are other spaceborne applications where stronger

mission-critical guarantees are needed, e.g. an interplanetary exploration mission, especially dur-

ing the phase of planetary probe deployment.

2Contrary to popular belief, COTS components may be more expensive than their military counterparts, mainly due tovalidation and certification processes [2, 3].

2

1.1 Motivation

Payload ThermalManagement

Attitude & Orbit Control System

PowerSystem

CAN

CAN

Communications & Data HandlingSystem

PropulsionSystem

Dual Media CAN bus

Ch. 1Ch. 2

Ch. 3 Ch. 4

Ch. 5

Ch. 6Ch. 7

Ch. 8

A SideB Side

CAN

CAN

28 V 50 V

CAN CAN

Ch. 1Ch. 2

Ch. 3 Ch. 4

Ch. 5

Ch. 6Ch. 7

Ch. 8

A SideB Side

y

zx

On-BoardComputer

Figure 1.1: SSTL-900 Satellite Architecture Block Diagram

Future trends

Looking into the future, the European Space Agency (ESA) Aurora Programme aims to push

robotic exploration of the Solar System and development of manned spaceflight missions to Mars.

One of the programme’s initial goals was the study and definition of future mission requirements,

in order to achieve a unified hardware/software architecture. This (modular) architecture shall be

used by all the missions under the programme’s scope, thus lowering development-related costs

and mitigating the eventual technological obsolescence of system components.

Due to the long-term life-cycle of the space missions3, these requirements must have a high

longevity in order to cope with the temporal horizon of mission execution. It is in this scope that

COTS components play a key-role, not only opening room for the inclusion of industrial standard-

ised components such as CAN, but also allowing cost savings w.r.t. special-purpose solutions in

the long run due to component reuse.

According to the final report of the study commissioned by ESA with recommendations for the

avionics architecture of the Aurora programme [10] :

“Concerning the low rate bus which is the most often used for acquisition/command

exchanges, the trade-off has already been performed between CAN, (MIL-STD)1553,

ODH485 and TTP/C (...) . (MIL-STD)1553 and CAN busses have been selected : the

first one is the current standard bus and allows to reuse existing units (AOCS units

in particular) and the second one is the future standard and will allow to connect new

units embedding CAN bus coupler.”

Therefore, CAN is more than just a popular fieldbus within current automotive and control appli-

cations: future systems are being designed based on this technology for sub-system networking.

3Space missions usually extend over 10 years, from initial planning to mission launch and exploration [9].

3

1. Introduction

Moreover, the applications where the CAN bus will be incorporated tend to be critical in nature,

thus denoting an opportunity window for a CAN-based dependable communication infrastructure.

1.2 Objectives

The main objective of this work is: enable a design enforcing dependability and timeliness

in CAN, demonstrating the feasibility of using the CAN fieldbus as the network building block for

applications with stringent requirements regarding safety, availability and timeliness, such as hard

real-time distributed control systems.

To achieve this objective we discuss an implementation of the CAN Enhanced Layer (CANELy)

architecture defined in [11], specifically the low-level components dealing with dependable oper-

ation of the communication channel and media, both on the temporal and spatial domains. This

implementation will lead to a specialised unit, managing the incoming bit streams from the sev-

eral media and providing: bus media redundancy management, channel/media monitoring, and

channel/media status signalling support for the upper layers.

The specification of these components involves the study of the basic analytic models defined

by the CANELy architecture, and devise efficient machinery to perform the modelled low-level

operations. Special attention must be taken to the suitability of implementing these mechanisms

in a medium sized Programmable Logic Device (PLD), e.g. Xilinx’s Spartan-3E, a low-cost Field

Programmable Gate Array (FPGA) family [12]. The complexity must be kept low, not only due to

the scarce resource which is PLD fabric but also to ease the verification of errors, increasing both

the reliability, maintainability and even compositionability.

The last, long-term objective is the wrapping of the components in an Intellectual Property (IP)

core with an adequate Advanced Microcontroller Bus Architecture (AMBA) [13] interface , which

allows near “Plug-n-play” integration within a System-on-a-Chip (SoC) design with AMBA bus

support, such as the LEON3 spaceborne processor IP core [14].

1.3 Contributions

The contributions provided by this work to the field of highly dependable CAN-based solutions

are as follows:

• Demonstration of the feasibility of effective CAN bus media redundancy and monitoring

mechanisms through a proof-of-concept;

• Demonstration of the (low) complexity of the mechanisms supporting a dependable archi-

tecture based on the CAN fieldbus;

• A (portable) parametrised description of the hardware mechanisms in an hardware descrip-

tion language, e.g. VHDL, suitable for implementation in an FPGA or ASIC.

4

1.4 Publications

1.4 Publications

This work has provided conditions for the elaboration of the following works, presented in

international conferences:

• J. Rufino, R. Pinto, and C. Almeida, “A FPGA-based solution for enforcing dependability

and timeliness in CAN,” in Proceedings of the 2007 IP Based Electronic System (IP’07),

Grenoble, France, Dec. 2007 [15].

• ——, “FPGA-based engineering of bus media redundancy in CAN,” in Proceedings of the

12th International CAN Conference (iCC’08), Barcelona, Spain, Mar. 2008 [16].

and of the following Technical Reports:

• R. Pinto, J. Rufino, and C. Almeida, “CANELy prototype board schematic specification,”

FCUL/IST, Tech. Rep. DARIO RT-05-04, Dec. 2005. [17]

• ——, “Specification and engineering of the CANELy prototype board,” FCUL/IST, Tech. Rep.

DARIO RT-06-06, Oct. 2006 [18]

• J. Rufino, R. Pinto, and C. Almeida, “How to enforce dependability and timeliness in CANELy?”

FCUL/IST, Tech. Rep. DARIO RT-07-02, Jul. 2007. [19]

1.5 Document organisation

The remainder of this document is organised as follows:

Chapter 2 documents the state of the art with respect to real-time operation and dependable

distributed embedded systems. A set of fieldbus technologies is presented, and CAN-based

solutions for dependable applications are documented.

Chapter 3 describes in detail the CANELy architecture, analysing its contributions to build

highly dependable hard real-time systems.

Chapter 4 discusses the problems that affect the dependability of communication bus op-

eration, and the materialisation of effective mechanisms to ensure proper operation even in

the presence of errors;

Chapter 5 addresses the issues affecting system timeliness and provides solutions to en-

force correctness in the time-domain;

Chapter 6 discusses the engineering of the proposed mechanisms, along with the design

of a prototype supporting the CANELy architecture;

Chapter 7 concludes this dissertation, with future work analysis.

5

1. Introduction

6

2State of the Art

The interconnection of embedded systems in an distributed real-time industrial scenario poses

several problems w.r.t. correct system behaviour. These problems stem from all the system layers:

from the interconnection technology, to the communication protocols, passing by the requirements

for distributed operation. These problems can grow exponentially when these systems are used

in control and automation tasks, since they add dependability and safety concerns to the set of

issues to be addressed.

This Chapter documents the State of the Art pertaining to: requirements for distributed real-

time control systems, specifically the networking components; fieldbuses for embedded net-

worked system interconnection, including the CAN fieldbus; and CAN-based dependable archi-

tectures.

Firstly, we briefly introduce a generic approach to control systems, focusing then on distributed

real-time control systems. The real-time paradigms governing system and communication be-

haviour are introduced and discussed. A set of fieldbuses - in which CAN is included - is analysed

and compared, w.r.t. the several attributes they offer for (dependable) embedded system network-

ing. Finally we present a set of existing solutions for high dependability architectures using the

CAN fieldbus as the networking block.

2.1 Distributed Control Systems

Control systems are all around us. Whenever one needs a physical quantity change according

to other, there must exist a control system mediating these interactions.

A generic control system is composed by basic entities, which can be of the following types:

− Sensor, responsible for gathering information from the environment surrounding it;

− Controller, evaluates the information collected by the sensor(s) and according to a defined

control law, generates a piece of information for the entities interacting with the environment;

− Actuator, makes a change in the environment based on the information received from the

controller.

7

2. State of the Art

The relations between entities and the environment are depicted in Figure 2.1.

ActuatorSystem

ControllerSystem

Environment

SensorSystem

Figure 2.1: Block diagram of a generic control system [1]

A Distributed Control System (DCS) shares the same basic entities and operating principle,

but with a fundamental difference: these elements are physically separated. An example of a

typical DCS is a car’s brake-by-wire facility, where the several entities are physically separated.

In this system, the braking pedal is connected to a sensing element, usually a variable resistor,

feeding a controller with data. Whenever the driver presses the pedal, the controller receives the

pedal’s data and can request more data from other sensors1 before processing it and sending an

adequate piece of data to the brake’s actuating entities.

A solution to physically interconnect these entities appeared under the form of fieldbuses.

Fieldbuses are a class of computer networks that initially were designed to be used in large control

systems, such as chemical process control and power plants, as the interconnection technology

for the smaller distributed control entities replacing the large centralised control computers. An

instance of such a system is shown in Figure 2.2.

Fieldbus

Controller

Temper.

���oK

Sensor

Thrust

��� %

Actuator

V Speed

����

Sensor

ftm

Voltage

���

Sensor

V

ControllerAngle L

���o

ActuatorAngle R

���

Actuator

o

In: V Speed

Out: Thrust

Control Law

In: Temper.

Out: Voltage

Control Law

Figure 2.2: Typical DCS infrastructure

The system of Figure 2.2 has several entities interconnected by a shared fieldbus, which allows1An example is the Anti-lock Braking System (ABS), where the braking order is conjugated with the wheel’s state.

8

2.2 Distributed Real-time Systems

the controllers to gather data from the sensors. After using the sensor’s data as an input for the

control laws they implement, they send the generated data to the actuators, which should act

accordingly.

Although usually the DCSs are small in scope, they can be integrated in larger control sys-

tems, such as Supervisory Control and Data Acquisition (SCADA), exploiting the availability and

properties of these smaller systems. One problem remains, however: how should the entities of

this distributed system interact?

Notwithstanding the fact that control systems’ entities can be spatially distributed, they still

need to interact as if they were in the same physical platform. Since the components are phys-

ically separated and do not share the same computing platform, it is necessary to synchronise

their state with each other. Furthermore, this synchronisation must be reliable, to ensure safe

operation.

To bridge the gap between the centralised and distributed paradigms a set of services is in

need, allowing this distributed system to operate and provide a safe and secure platform for

reliable applications. The foundations of distributed systems are well understood [20], with several

services and attributes identified. However, not all of services are useful to build distributed control

systems.

The subset of distributed systems’ functions and facilities that are relevant to the design of DCS

are mostly the ones common in fault-tolerant distributed systems: group communication; mem-

bership and failure detection; clock synchronisation; reliable network infrastructure. For example,

usually a DCS does not need naming and addressing facilities, since the available services (sen-

sors, actuators) and their location (addresses) are defined during system design, and are static

through system operation. On the other hand, an entity of this system might need to know if an-

other entity is available on the system and providing correct service. It can achieve this operation

by resorting to a membership and failure detection service.

2.2 Distributed Real-time Systems

Most systems interacting with the real world are real-time, in the sense that the system’s

actions are bound by the progression of time. In this class of systems the failure to react to an

event in a timely fashion might yield results ranging from insignificant to catastrophic, e.g. the

previously mentioned car braking system taking too long to respond to a braking order might

endanger human lives.

This Section presents a brief introduction to real-time systems, with particular emphasis on

real-time communication, its operational models and properties.

9

2. State of the Art

2.2.1 Real-time Systems

The definition of “real-time systems” under a different perspective might be: real-time systems

are those whose tasks have timeliness requirements, i.e. the time the tasks require to be com-

pleted must be bounded and usually is computed as the Worst Case Execution Time (WCET).

The order that these tasks are executed, or the task schedule depends on the type of scheduling:

fixed scheduling, which can be done “off-line” and ensures a cyclic behaviour, with the tasks be-

ing carried out in the same order periodically; dynamic scheduling where the order is dependent

on a priority metric, usually the task’s deadline, i.e. the time when the task’s actions must be

completed.

Independently of the type of scheduling, the scheduler may preempt tasks. A reason justifying

preemption might be a task whose deadline is nearer than the one of the currently running task.

Another reason for preemption might be a task whose deadline has been missed, i.e. execution

did not end before the deadline.

The consequences and behaviour of the system in the presence of a missed deadline depends

if it is a soft real-time system, where occasional misses and their consequences are acceptable,

or a hard real-time system where the consequences of a missed deadline might be catastrophic.

An example of a soft real-time system is media streaming, such as Voice over Internet Protocol

(VOIP) telephony. This system is real-time, in the sense that the packets carrying the encoded

voice must arrive at their destination within bounded time. If any packet misses its deadline,

however, the consequences will be perceived by the receiver as a glitch or a small period of

silence, which are negligible.

On the other hand, a hard real-time system has stricter operational assumptions. Such a sys-

tem is the aforementioned brake-by-wire example, where missing a deadline - such as computing

and output data to the braking actuators - can have catastrophic effects.

Given the distribution of these real-time systems, a facility providing real-time communication

services is also needed, ensuring that a reliable, available and timely communication channel ex-

ists to convey the information between system elements, thus ensuring global real-time operation.

2.2.2 Communication System Operation Models

There are two fundamental paradigms defining the timeliness of the interaction between the

real-world and the distributed entities, i.e., when should the system react to external events, and

communicate with other system’s entities. These paradigms are: the time-triggered approach,

where actions2 take place only at certain points in time, synchronised by a global clock; the event-

triggered, where actions take place as soon as possible after the originating event.

Although these two paradigms can be used to describe the behaviour of complete systems,

2The term “actions” is deliberately generic, since its meaning can range from: physical interaction with the environmentto network communication message communication.

10

2.2 Distributed Real-time Systems

right from the application layer down to the communication infrastructure layer, we are mainly

interested in the latter. Both approaches have advantages and disadvantages, which will be

explored next.

Time-triggered Approach

In a time-triggered architecture the interaction between system elements is not done upon

event generation: it only takes place in predefined instants in time. This mode of operation im-

poses a synchronous behaviour on the system which has obvious advantages, such as main-

taining information consistency, or fault-detection. The time-triggered operation requires static

scheduling of the communication system, which must be done “off-line” thus easing the design

and verification of such system, and during operation allows fault detection, since all elements

have the knowledge about who should be communicating at any give instant.

This very same static communication scheduling, however, is its weakness: the time-triggered

architecture may not be able to accommodate changes to the system, such as adding new ele-

ments, at least without redesigning the schedule. Another consequence of the static communica-

tion scheduling is that the event generation distribution must be known at design time, in order to

provide the communicating entity with sufficient slots for event production/consuming.

Finally, due to the cyclic operation the mean response time is higher, due to the message

transmission being delayed until the node’s scheduled transmission time. This can be a hindrance

in real-time systems whose behaviour is dominated mainly by aperiodic events, i.e. events that

may occur at any instant in time.

Event-triggered Approach

In an event-triggered architecture the network messages can be sent at any time, usually after

their generation. This paradigm is also called reactive, since the processing is done as a reaction

to an event. The previous example of the brake-by-wire system is such a system. Braking can

occur at any instant, and it must be dealt as quickly as possible due to the possibly fast-changing

physical quantities involved, such as speed and distance.

The event-triggered approach is the most suited for applications where all the environment

variables might not be known a priori. This is in part due to the possibility of using dynamic com-

munication scheduling, which allows it to cope with the uncertainty of the environment. Hence,

this architecture is the most adequate to interact with the “real-world”, given its dynamic behaviour

- induced by the Input/Output (I/O) - and therefore is more suitable to adapt to new situations.

The responsiveness of an event-triggered architecture is better than its time-triggered coun-

terpart: there is no (artificially induced) delay time, the event is processed as soon as possible.

However, there may be situations where a burst of events is brought into the system, which must

be dealt in a timely fashion, and most importantly, in a safe fashion. For example, if deadline

11

2. State of the Art

misses occur - the system should accommodate this scenario and have well defined semantics

regarding the provision of service.

2.2.3 Real-time Communication Networking Infrastructure

As previously discussed, the distribution of systems implies their interconnection through a

physical network. To ensure real-time operation of the interconnected systems one must also

extend the real-time paradigm into the communication network infrastructure, thus introducing

some constraints in the communication model.

The purpose of these constraints is to enforce determinism into the communication channel,

hence making it adequate for supporting real-time operation. Therefore, these constraints pertain

to: ensuring bounded transmission and processing delay; maintain network connectivity; control

partitioning.

Maintaining network connectivity is tightly coupled with fault-tolerance. In order to maintain

connectivity, one must avoid partitioning, which creates subsets of elements that are unable to

communicate with each others. This impairment has two facets: physical partitioning, where the

physical network infrastructure is affected; virtual partitioning, where the communication channel

is affected. The immediate result of both forms is an unavailable network infrastructure.

One of the most important guarantees that a real-time communication infrastructure must pro-

vide is bounded transmission delay, even in the presence of disturbances, such as other (unre-

lated) network traffic, or transient overload of the network. The knowledge of such delay is needed

to assess the (global) WCET of a task, for system scheduling purposes. This time can also be

used to detect abnormal protocol operation, and trigger fault recovery procedures.

Another concept that must be mapped from the real-time system onto the network is the notion

of priorities. Traffic prioritisation can be achieved by Quality of Service (QoS) measures, such as

different classes of traffic. This would allow the distinction between urgency classes, useful for

providing service to high priority tasks, in a process similar to dynamic task priority assignment.

Finally, to achieve (dependable) real-time communication one must have a reliable network,

which must be available and providing correct service even in the presence of disturbances.

These properties must be provided by any networking technology used in the construction of

real-time distributed systems, thus ensuring correct real-time operation of the (global) system.

2.3 Embedded Systems Networking (Fieldbus) Technologies

There are several fieldbus technologies in the industry, most intending to solve different prob-

lems in different applications. Whilst some aim at high data-rates, others aim at real-time be-

haviour. There is no technology satisfying all the requisites at once, e.g. timeliness, data-rate,

cost, dependability.

12

2.3 Embedded Systems Networking (Fieldbus) Technologies

In this Section we present and analyse fieldbus technologies in use - or with potential to be

used - in domains requiring the interconnection of embedded systems, such as control or data-

handling. These fieldbus technologies are: Time-Triggered Ethernet (TTEthernet), SpaceWire

(SpW) and Controller Area Network (CAN).

These technologies can be mapped essentially into only three layers of the Open Systems

Interconnection (OSI) model: Application, Data-Link and Physical. The technical specifications of

these technologies, however, only cover the two lower layers, with Application layer being left to

the system designer’s discretion.

In order to be able to compare these fieldbus technologies, the analysis will focus on the

following figures of merit: network topology, network operation (including medium access), ad-

vanced services such as time information diffusion and fault-tolerance. Lastly, these results will

be compiled in a table, for a more effective comparison.

2.3.1 Time-Triggered Ethernet

Time-Triggered Ethernet (TTEthernet) [21] is a fieldbus being developed by TTTech Comput-

ertechnik AG as the network block for the construction of time-triggered applications. It uses

and extends standard Ethernet [22] as the physical (PHY) and data-link layers, providing coexis-

tence with standard Ethernet applications. An instance of a mixed-criticality TTEthernet network

is shown in Figure 2.3.

Ethernet Link

SwitchTTEthernet

SwitchTTEthernet

Host

Safety-Cri.TTEthernet

HostStandard

TTEthernet

Host

Safety-Cri.TTEthernet

HostStandardEthernet

HostSafety-Cri.TTEthernet

HostSafety-Cri.TTEthernet

Network

HostStandardEthernet

Figure 2.3: Typical Time-Triggered Ethernet network

The typical network topology is a star where each node device connects directly to a TTEther-

net-aware switch in a full-duplex link, allowing simultaneous transmission and reception of data.

The full-duplex property has several benefits, such as: avoids (local) bus access contention and

therefore enhances the communication system’ timeliness by providing deterministic local mes-

sage transmission time; eliminates the need for minimum cabling size since there is no need for

collision detection. The TTEthernet specification, however, does not preclude direct connection

between nodes, given that the full-duplex link property is provided.

13

2. State of the Art

The TTEthernet fieldbus uses standard Ethernet infrastructure, namely the same data-link

layer up to a certain extent. Therefore, the frames have the same structure as Ethernet, whose

frame length goes from 64 byte up to 1518 bytes. Since the used media is Ethernet, the data

rates can go from 10 Mbps up to 1 Gbps, and must always be full-duplex. A TTEthernet network

admits coexistence of both time-triggered and event-triggered (standard Ethernet) traffic. The

distinction, or multiplexing, between the type of traffic (payload) is done by the content of the

frame’s EtherType field, used to either: indicate the frame’s payload size, if the value is smaller

than 1500; used to represent a user protocol, if bigger than 1536. The utilisation of this field also

excludes the utilisation of a Logical Link Control (LLC) sublayer protocol, e.g. those defined by

Institute of Electrical and Electronic Engineers (IEEE) Standard 802.2.

The switches must be TTEthernet-aware, thus being able to make the distinction between the

traffic flows. With this knowledge, the switch might preempt the frames pertaining to standard

Ethernet upon synchronisation events, giving priority to TTEthernet ones. This modification of

the data-link layer behaviour raises issues for standard Ethernet nodes, which may suddenly stop

receiving a frame, not being prepared to deal (often) with such events. According to the TTEth-

ernet specification, fault confinement is provided by the switch, both in the spatial and temporal

domains. There is also error detection at the data-link layer, provided by standard Ethernet layer,

such as frame Cyclic Redundancy Check (CRC). This mechanism, however, is the only one

provided for error detection, and none is provided for error recovery.

The TTEthernet specification also offers advanced services for the basic construction of (de-

pendable) distributed real-time systems, such as: site membership, failure detection and global

clock synchronisation, as a consequence of the time-triggered paradigm; channel redundancy

and bus guardian functions at the switches are contemplated, although reserved for safety-critical

applications due to the overhead they introduce. The bus guardian function avoid the “babbling

idiot” failure mode, where a component uses - or tries to use - the resources in an untimely fashion.

Depending on the application domain, however, the TTEthernet architecture might raise some

issues. These issues can stem both from its time-triggered nature, and the complexity of the

network and nodes, thus not being cost-effective.

2.3.2 SpaceWire

SpaceWire (SpW) [23] is a fieldbus being developed by ESA, other space agencies and

academia to provide high data-rate connectivity to spacecraft systems’ components. Its design

was inspired in the IEEE Standard 1355 [24], sharing several similarities, and even showing com-

patibility at some layers.

The SpW fieldbus is a point-to-point, full-duplex serial data communication bus, embodied

in an entity called SpW link. The point-to-point characteristic implies that: the bus topology is

arbitrary; the messages between two nodes not directly connected must be routed through the

14

2.3 Embedded Systems Networking (Fieldbus) Technologies

network; faults affecting a link are self-contained. An instance of a SpW network is depicted in

Figure 2.4.

RouterSpW

Interfaces

Network

RTUSpW

Interface

RTUSpW

Interface

RTUSpW

Interface

OBCSpW

Interface

OBCSpW

Interface

Mem.SpW

Interface

Mem.SpW

Interface

OBCSpW

Interface

SpW Link

Figure 2.4: Typical SpaceWire network

The network in Figure 2.4 mimics a typical aerospace application, such as a satellite, where

the OnBoard Computer (OBC) is connected to several Remote Terminal Units (RTUs). Each RTU

is responsible for one or more sub-systems, which can be either control or data. This network

is composed by several links and a router interconnecting two (logical) network segments. A

redundant setup is also shown, with two cross-connected SpW links between the replicated com-

ponents. The management of this setup must be done entirely by the application layer, since the

SpW specification [25] does not contemplate any type of redundancy provision and management.

Due to each link being full-duplex there is no contention for bus access between two directly

connected nodes, and therefore there is no need for bus media access arbitration. Whenever

a node needs to send a message it starts transmission immediately unless there is already a

previous transmission in progress.

Since the network topology is arbitrary but composed by point-to-point links, packets3 need

to be routed through the network. The used method is wormhole routing: an incoming packet

is transmitted as soon as it is possible to determine its destination, without waiting for it to be

received completely [26]. This architecture has the advantage of reducing message transmission

delay, being the opposite of store and forward which is used by Ethernet and TTEthernet.

The basic information element is the character, which can be either control or data. Data

characters are 10 bit wide, having an 8 bit payload and two control bits, where one of them is the

parity. The packet length in a SpW link is not limited: it can go from 2 characters up to infinity.

This design option was taken both due to the limitation of 32 bytes per packet in IEEE-1355, and

in order to allow a simpler hardware implementation of the protocol, leaving most of the details to

be implemented in software.

3The terminology used is the one defined in the SpW specification [25], where the packet is equivalent to a frame, androuting is equivalent to switching.

15

2. State of the Art

Packet routing admits two types of addressing: path addressing, where the (local) routing port

addresses are embedded in the packet itself; logical addressing where each port in the network

has a unique identifier, and is used by the packet to identify its destination. These modes have

implications in the logic of the routers: while the former needs little logic, just to remove one

address from the top of the packet, the later needs memory to hold a routing table, usually a

Content Addressable Memory (CAM) mapping the logical addresses to router’s ports.

Error detection can take place through a parity bit embedded in each character. Upon detec-

tion, a special control character is appended, in order to notify the remaining routers that an error

has occurred. Regarding advanced features, SpW provides native means for the synchronisation

of time, trough a special packet called “Time-Code”.

Being optimised for raw throughput and small hardware footprint, SpW does not possess na-

tive support for either fault-tolerance or broadcast communication. However, an enhanced speci-

fication named SpaceWire-RT (Reliable-Timely) is being prepared, aiming at securing properties

such as reliability and timeliness, both essential for safe real-time operation.

2.3.3 Controller Area Network

The Controller Area Network (CAN) fieldbus was designed by Bosch GmBH for automotive

applications. It features a bus network topology, operated in a simplex fashion: all nodes share

the same medium to transmit and receive. In Figure 2.5 is shown an instance of a CAN network,

composed by several communicating nodes.

Bus Termination CAN Link

HostCAN

Interface

HostCAN

Interface

HostCAN

Interface

HostCAN

Interface

Figure 2.5: Typical CAN network

The network topology is a shared, multi-master communication bus with a Carrier-Sense Multi-

ple Access / Deterministic Collision Resolution (CSMA/DCR) medium access arbitration scheme.

Whenever the bus is in an idle state, any node with queued messages starts transmission by

sending the message’s identifier, which is broadcast to all nodes. Upon collision the node trans-

mitting the message with the lowest identifier goes through and gains bus access, while the other

competing nodes back-off and go into listening mode. At the end of the transmission any node

having pending messages starts the arbitration process again, until all messages are transmitted.

The payload of a CAN data message varies from 0 to 8 bytes, and the data rate can go up to

1 Mbps. The maximum data rate is dependent of the physical network length, due to the network

operation mode where all nodes sample the bit present on the network at the same time, save

16

2.4 CAN in airborne and spaceborne applications

from eventually a small amount of jitter derived from the node’s local clock drift. Therefore, the bit

must be allowed to propagate to the farthest node in the network, before being sampled.

The CAN fieldbus possesses native fault-tolerant mechanisms. During message transmission

all the nodes - including the transmitting node - listen and check for any violation of the protocol.

If detected, an error flag is sent by the detecting node(s), consisting of only dominant bits and

overwriting the current transmission. This ensures the error is perceived by all the nodes. Upon

an error, the transmitting node reschedules the message automatically to be transmitted in the

next available opportunity. Furthermore, each frame has a CRC field, allowing the detection of

errors that may have gone undetected by other means.

During normal operation, CAN signals are transmitted in differential mode through two wires.

If any of the wires gets “stuck-at” a level, transmission falls back to single-wire mode. Hence, CAN

can keep the communication channel even in the presence of physical faults.

Whenever a message is transmitted, be it any type, a set of counters internal to each node is

incremented or decremented, depending on the type of message. Upon reset, a CAN node is in

an state called error active, allowing it to fully participate in communication. This mechanisms is

part of the fault-confinement functions, which may lead to node disconnection from the bus if the

error counter exceeds a certain threshold, and the node is put into bus-off state.

The CAN fieldbus is particularly well suited for real-time communication, due to its prioritised

medium access scheme, where the message having the lowest identifier gets through. The mes-

sage identifiers can be mapped to traffic classes, depending on their urgency and thus provide

basic QoS measures where higher priority traffic, e.g. an alarm, can get bus access without

having to contend with lower priority traffic.

Fieldbus Technologies Comparison

To conclude the presentation of the fieldbus technologies, a comparison chart is shown w.r.t.

several attributes. The comparison chart is shown in Table 2.1.

A comment regarding the frame efficiency of SpW: these figures come from the fact that

in a SpW there is no maximum size for a packet, hence the 99.9(9)% value. Regarding the

minimum value, the path addressing mode for packet routing specifies that the maximum number

of hops (routers) must be 32. Therefore, a (highly unlikely) scenario may arise, composed by:

one character transmitted using path addressing mode, through a network composed of 32 hops,

hence the rounded 3.(03)%.

2.4 CAN in airborne and spaceborne applications

The provision of native fault detection and recovery mechanisms, together with low weight and

size cabling have allowed CAN to be quickly incorporated into domains other than automotive.

17

2. State of the Art

Parameter TTEthernet SpaceWire Standard CANMaximum Data Rate 1 Gbps 400 Mbps 1 MbpsNetwork Operation transmission line transmission line quasi-stationary

busMedia Access Control N/A (Full-Duplex) N/A (Full-Duplex) CSMA/DCRFrame Efficiency 37.5% - 98.8% 3.03% - 99.9% 45.3% - 59.2%Error Detection (Domains) Value and Time Value only Value onlyFault Confinement Provided by the

switchlink active, link inac-tive

error active, errorpassive, bus off

Omission Handling no detection detection detection/recoveryframe retransmis-sion

Media Redundancy no no noChannel Redundancy possible no noBabbling idiot avoidance bus guardian (in

switch)not provided not provided

Communications unicast/broadcast unicast/limitedbroadcast

broadcast

Table 2.1: Comparison of TTEthernet, SpW and CAN

One of those domains was aviation, where the Size, Weight and Power-consumption (SWaP)

requirements are paramount. Current aircrafts such as Airbus A380 and Boeing 787 have CAN-

enabled control sub-systems, complying with Aeronautical Radio, Inc. (ARINC) 825 specifica-

tion [7].

The CAN fieldbus has also found its way into space applications, being used both by com-

mercial satellites buses [27], and space agencies [28]. Currently the European Cooperation for

Space Standardization (ECSS) is coordinating the standardisation of CAN to be used in future

space missions. The document is still in the draft phase [8], but CAN is already being deployed in

current ESA missions. One of those missions is the ExoMars rover [29], which will perform robotic

exploration tasks in Mars. This mission has used the draft as starting point for on-board CAN bus

design, which also served the purpose of testing and refining the draft standard.

A strong argument for CAN deployment in space is cabling weight, which is at a premium.

The regular Unshielded Twisted Pair (UTP) has a much smaller weight footprint, compared with

other space technologies such as SpW and legacy MIL-STD1553B. Another argument is power-

consumption, which is very strong given the solar-cell and battery-based nature of space ap-

plications. The CAN communication uses much less power than their specialised counterparts,

making it a suitable solution for space applications.

Although space is usually regarded as a high-dependability domain, not all applications are

design to capture these requirements, especially the attributes concerning sub-system fault-

tolerant real-time communication. One of those applications are the Surrey Satellite Technol-

ogy Ltd (SSTL) satellite buses4, which use a dual-media CAN solution for redundant sub-system

4A satellite bus is a generic platform having most sub-systems defined and only differing in the payload. This way the

18

2.5 High-Dependability CAN-based architectures

communication. However, the detection and recovery from a failed bus medium can take up to

five minutes [27], which can be unacceptable in applications dealing with short deadlines from

aperiodic tasks, e.g. orbital manoeuvres for spacecraft docking.

2.5 High-Dependability CAN-based architectures

The industry and academy started to envisage new application domains for the CAN fieldbus

after its debut in the automotive world. Some of these domains demanded higher operational

guarantees, such as mission-critical ones. The set of these high dependability domains includes

medical, oil drilling, chemical process and power plant control, among others. Standard CAN,

however, could not address all the strict requirements concerning dependable operation that these

domains demand. Therefore, the CAN fieldbus weaknesses w.r.t. dependable operation had to

be overcome.

There were two main approaches to endow CAN with dependable operation attributes: en-

hance or complement the standard CAN layer5, i.e. enhance CAN through additional mecha-

nisms, or complement through the execution of protocols and algorithms at the application-level.

A first analysis and approach to the problem of real-time CAN dependable operation was given

in [30], where a dual CAN channel solution was proposed to provide communication redundancy,

together with full-space redundancy of the host itself. This approach, however, it is not cost effec-

tive: the replication of hardware, from the processing elements to the communication controller

can be very expensive, both in terms of cost and SWaP properties.

A simpler approach consists on replicating only the communication media, thus providing re-

dundancy to the channel itself. This approach increases availability, but also poses another prob-

lem: managing the several redundant media, especially how to conjugate the information received

by them and present it to the single Medium Access Control (MAC) entity.

The following Sections are devoted to the analysis of high-dependability CAN-based architec-

tures, highlighting their main contributions for CAN dependable operation, such as maintaining

network connectivity, and architectural features.

2.5.1 RedCAN

The RedCAN architecture is a commercial solution for high dependability CAN based sys-

tems [31, 32]. An instance of a RedCAN network is depicted in Figure 2.6.

The network topology is a self-healing ring, achieved by dividing the network into segments,

called sections in the RedCAN context. The transformation of the CAN ring topology into bus is

made through special-purpose machinery inserted between the physical network cabling and the

costs can be lowered and the mission cycle can be shortened.5The option to modify the standard CAN layer has only become possible in the last ∼5-7 years due to the availability of

cost-effective FPGA devices, such as the Xilinx Spartan-3 family, and IP cores. However, such an approach must ensurebackward-compatibility with the standard CAN layer, mainly due to the large number of deployed standard CAN devices.

19

2. State of the Art

Node n

RedCANInterface

RedCAN Section

Node 3

RedCANInterface

Node 2

CANInterface

Node 1

RedCANInterface

CANInterface

CANInterface

Left Right Left Right

Figure 2.6: Ring topology RedCAN network

CAN transceivers. This machinery is composed by termination resistors and switches controlled

by the node, embodied by the RedCAN interface (see Figure 2.6). The termination of the bus is

made at a specific node possessing a RedCAN interface (such as Node n in Figure 2.6), while

the remaining nodes possessing RedCAN interfaces are in transparent mode, bypassing the bus

media (Node 1 in the same Figure).

In a RedCAN network not all nodes are required to have a RedCAN interface. Each section

needs two RedCAN-enable nodes: the leftmost and the rightmost, for bus termination purposes.

Moreover, each section supports the connection of nodes with standard CAN interfaces (e.g.

Node 2).

The RedCAN error confinement mechanisms act upon the sections, isolating the faults. These

faults may have origin on the sections themselves, e.g. damaged network cable, or faulty nodes,

e.g. a node exhibiting “babbling idiot” behaviour.

Upon RedCAN ring reconfiguration mechanisms, nodes equipped with a RedCAN interface

(Nodes 1,3 and n, in Figure 2.6) can maintain connectivity even if a section is disconnected.

Although RedCAN provides a fault-resilient network infrastructure, it may be inadequate for

hard real-time operation. The fault detection and recovery procedures involved in bus reconfigu-

ration - done through mechanical switches - take time, which can be in the order of hundreds of

milliseconds [32]. This time may be enough for deadline violation in applications with tasks having

short periods, and thus system failure.

2.5.2 FlexCAN

The FlexCAN architecture [33] aims at providing an ultra-dependable solution for generic em-

bedded systems requiring dependable operation. It resorts to the concept of SafeWare, a middle-

ware providing the necessary services to achieve high dependability. It also supports operation

with both standard CAN and FlexCAN-aware nodes. An instance of a FlexCAN network is shown

in Figure 2.7.

It offers a layered architecture with several options for fault-tolerance: channel redundancy,

using multiple CAN controllers and buses; and node redundancy, forming a Fault Tolerant Unit

20

2.5 High-Dependability CAN-based architectures

Node nNode 3Node 2

CA

N

CA

N

CA

N

CA

N

CA

N

CA

N

CA

NC

AN

Node 1

CA

N

CA

N

CA

N

SafeWare SafeWare SafeWare

FTU

CANInterface

Redundant CAN Buses

Figure 2.7: Typical FlexCAN network

(FTU). In the example of Figure 2.7 one can observe: a full-space redundant node (Node 1), two

regular nodes (Nodes 2 an n), and a non-redundant, standard node (Node 3).

In the FlexCAN architecture redundancy management is achieved through a specially crafted

protocol, dubbed SafeCAN [34]. This protocol is implemented at the application layer, and man-

ages the information from the several redundant controllers in order to guarantee reliable and

available communication.

Although this architecture is capable of offering highly dependable CAN-based operation, it

can do with high costs. These costs derive mainly from the provision of a fully space-redundant

architecture both at the communication channel and the supporting computing platform.

2.5.3 CAN Enhanced Layer - CANELy

Another approach to highly dependable solutions based on CAN bus is the CANELy archi-

tecture [11]. This architecture uses a replicated medium bus network topology to enhance the

network availability. A typical CANELy network is depicted in Figure 2.8, composed by two redun-

dant media conveying one CAN channel.

Node n

CANELyinterface

Node 3

CANELyinterface

Node 2

CANELyinterface

Node 1

CANELyinterface

Redundant CAN media

Figure 2.8: Typical CAN Enhanced Layer network

The nodes are all connected to the redundant media, having suitable mechanisms for re-

dundancy management. The CANELy architecture provides several services and functions for

building dependable distributed systems. Such services include: reliable communication protocol

21

2. State of the Art

suite, clock synchronisation, inaccessibility detection and control, medium and channel monitor-

ing, besides the aforementioned bus media redundancy.

The approach taken w.r.t. replicated network infrastructure is similar to the one in Delta-4 [35],

where only the Physical Layer (PHY) components are replicated, thus keeping one MAC entity

per communication channel. This provides a more cost-effective solution, since the cost - both

in terms of components and redundancy management - is kept low. This approach, however,

does not preclude a full-spatial redundancy scenario, with the replication of the channel. The

provision of such level of redundancy is kept to the applications requiring the highest guarantees

of operation, where the cost of having extra machinery is acceptable. The CANELy architecture

is analysed in a more comprehensive manner in the following Chapter.

22

3CAN Enhanced Layer

The design of a dependable hard real-time communication infrastructure must take into ac-

count the dynamics of both the computational and communication systems, and at the same time

exploit the operative mechanisms of the underlying fieldbus. Therefore a systemic approach is

in order, obliging the inclusion of communication behaviour in the system’s dynamics, paving the

way for a dependable hard real-time communication architecture using CAN.

In this Chapter the CAN Enhanced Layer (CANELy) architecture [11] is presented in detail,

giving emphasis to the contributions being addressed in the construction of dependable hard real-

time distributed control systems. The system architecture is presented and discussed in a top-

down approach, through the several system services and components. Finally, the engineering

aspects are detailed w.r.t. the CAN controller and physical layer components.

3.1 System Architecture

A dependable communication infrastructure must provide several services to the host system

using it, ranging from reliable message diffusion protocols to (physical) redundant communication

media management. Moreover, it must be organized in a layered structure in order to allow

composition of the several services.

The main objective of the CANELy architecture is to enhance and complement the standard

CAN layer with mechanisms pertaining to dependability and timeliness guarantees, but without

modifying it. This goal is achieved through a non-invasive approach w.r.t. to the standard CAN

layer as defined in [6], thus promoting the reuse of currently deployed applications and devices,

while providing them with higher operational guarantees of service. The CANELy architecture is

shown in Figure 3.1.

The architecture depicted in Figure 3.1 shows clearly the hierarchy of the architecture and its

components, and how is the standard CAN layer accommodated. Based on a modular philosophy,

CANELy comprises both hardware and software components: the hardware-based mechanisms

deal with dependability enhancements such as bus media redundancy and bus failure masking

through media redundancy; the software-based mechanisms deal with reliable communication

protocols and services common to distributed systems, such as: group communication, member-

23

3. CAN Enhanced Layer

CAN Standard Layer

layermanagement

media/networkmonitoring

control ofinaccessibility

AND-basedmedia selection

reliable communication protocol suite

CAN Enhanced Layer InterfaceCommunication Management

Channel Interface ChTx

Media Redundant CAN Communication Channel

CANELycomponents

ChRx

Figure 3.1: CANELy System Architecture

ship and node failure detection, clock synchronisation. The following Sections detail CANELy’s

architecture, focusing on the main working areas: Reliable Communication, Network Dependabil-

ity and hard real-time operation.

3.1.1 Reliable Communication and Services

A basic building block of a communication system is a reliable communication facility, for ef-

fective support of more advanced services common to distributed systems [20], thus providing an

important set of protocols and services for dependable distributed operation.

Message diffusion

Given the shared nature of CAN communication media all messages sent by any node are

broadcast to all nodes. However, even with robust electrical encoding that broadcast is not im-

mune to (consistency) errors. It has been shown in [36] a scenario where an error affecting the

message transmission would not be correctly perceived by all nodes, leaving the system in an

inconsistent state.

In order to cope with these problems a set of communication protocols was devised [11], pro-

viding the foundations for building complex applications and replication/cooperation management

services based on reliable message diffusion.

Group Communication

A key feature of a distributed system is the concept of group: the set of elements participating

in the (global) system’s actions, usually through the execution of distributed algorithms. Through

group communication the application using CANELy’s services can access more advanced facili-

24

3.1 System Architecture

ties such as QoS or message filtering, e.g. a set of controllers and actuators requesting a sensor’s

value.

In distributed control applications it is usual to have replicated components, e.g. replicated

actuators for safety-critical operation. Moreover, it is also common for some controllers to get

input from more than one sensor, and a sensor can serve multiple controllers. Therefore, CANELy

must provide such facilities, in order to: filter messages, delivering only messages intended for

that node to the upper layers, i.e., introduce the notion of multicast communication.

Site membership and failure detection

A site membership service provides consistent information regarding the sites (or nodes)

present in the network. This information is usually called view. Such a service might aid and

ease the provision of other services involving interactions between the sites, since it provides in-

formation on which nodes are active. For example, a group communication service can benefit

from the knowledge of exactly what nodes it is transmitting messages to.

The CANELy architecture provides a site membership service, together with a node failure

detection service to detect crashed sites. These two services ensure that there is at all times

a correct view of the elements participating in the system’s actions. A set of low-level micro-

protocols has been devised to: handle node join/leave events, node failure detection, enforce

agreement. These protocols must be effective in the utilization of the CAN bandwidth, thus lower-

ing their impact in normal network operation.

Clock Synchronization

Another key feature in a distributed system is (event) causality and time-stamping, e.g. to

determine the state of a task composed by several processes [37]. This service is extremely useful

in a control system for keeping track of application state, and having the means for coordinating

system’s actions that progress over time.

In order to capture event causality, each node in the system must not only possess a time-

keeping facility but also a service to guarantee a globally coherent timebase, shared by all nodes.

Although this issue has not been discussed specifically in the CANELy architecture definition,

a suitable (distributed) algorithm is described in [38]. This algorithm provides CANELy with the

means for accurate clock synchronization. The integration of this service and all the previously

described services is depicted in Figure 3.2.

3.1.2 Network Dependability

The provision of dependable service must be supported by a physical network infrastructure

providing reliable and available service. Some of the problems affecting the communication infras-

tructure are: transient faults such as bit corruption due to EMI; permanent faults such as physical

25

3. CAN Enhanced Layer

GroupCommunication

Site Membership& Failure Detection

ClockSynchronization

BroadcastCommunication

Standard CAN Layer

ChTx Ch

Rx

ManagementCommunication

Managem

ent

Figure 3.2: CANELy reliable communication and services block diagram

bus media damage. These disturbances cause virtual and physical network partitioning, leading

to subsets of nodes that cannot communicate with each others.

Bus Media Redundancy

The basis for network dependability enhancement is bus media redundancy, providing redun-

dant communication paths. Although this solution allows the relaxation of the CAN fault model, it

also introduces a new problem: redundancy management.

The problem of redundant media management is ingeniously solved through the exploitation of

both the quasi-stationary bus operation mode and the wired-AND nature of CAN bus access: all

incoming media are AND’ed together into the Channel, through a component called AND-based

media selection, depicted in Figure 3.1. This scheme greatly simplifies the machinery involved,

since there is no need for complicated media bit-synchronization and decision mechanisms.

Channel and Bus Medium Fault-Tolerance

The introduction of redundant bus media may have relaxed1 the CAN fault model, but also

introduces new faults into this extended model. The CANELy architecture has clearly defined

the fault model affecting a CAN-based redundant media network, and provides the analytical

foundations for monitoring functions aiming at error detection and fault confinement.

These mechanisms act at all levels of the network: medium and channel. A fault affecting a

medium might propagate into the channel, as it happens with standard CAN. Given the redundant

media, however, we can detect which medium or media is being affected (fault detection) and

disable its participation to the channel (fault confinement).

Lastly, each medium is also monitored w.r.t. its omission degree, i.e. the number of omission

errors affecting that medium, in a reference time interval. A medium exceeding its omission degree

bound should be declared failed, its participation in the channel formation disable, and the upper

layers signalled to execute any recovery procedures.

26

3.2 System Components

3.1.3 Hard Real-time Operation

Hard real-time operation assumes a greater importance when part of a distributed system

due to a common point of intersection: the communication network. The correctness of real-time

behaviour depends not only on the local resources, be it computational or regular I/O interfaces,

but also on the availability of the networking infrastructure.

Inaccessibility is a subtle form of partitioning, characterized by the channel being temporarily

unavailable for other nodes to communicate, i.e. inaccessible. In a CAN network the occurrence

of omissions is tightly coupled with inaccessibility events. An omission error implies (at least) the

signalling of the error, thus leaving the bus inaccessible for the duration. Therefore, omissions are

implicitly transformed into inaccessibility periods.

It is impossible to avoid inaccessibility - even an ideal and fault-free physical network infrastruc-

ture could suffer from local node circuitry malfunction, such as loss of synchronization. Therefore,

inaccessibility control is needed to: assess the amount and duration of inaccessibility periods,

optimize protocol timeout calculations and mitigate its effects.

3.2 System Components

The services and mechanisms provided by CANELy can be mapped into components, each of

them encompassing several functions and facilities. There are three main components: CANELy

Dependability Engine; Media Selection Unit and Inaccessibility Control Unit.

3.2.1 CANELy Dependability Engine

The CANELy Dependability Engine provides support for the higher layers of the CANELy ar-

chitecture, i.e. reliable communication and services. It supports the execution of protocols and

other computational tasks, such as management functions in order to assess network operation

status, and is illustrated in Figure 3.3.

CANELy Dependability Engine

ManagementInterface

ChTx

ChRx

Host System

Figure 3.3: CANELy Dependability Engine interfaces

27

3. CAN Enhanced Layer

This component communicates with the host system through (buffer) channels, communicat-

ing with other (networked) nodes on the system through a standard CAN channel. It communi-

cates also with the remaining components of a CANELy node through a management interface,

using the information provided by lower level components to aid in higher-layer protocol operation.

3.2.2 Media Selection Unit

The Media Selection Unit component encapsulates most of the Network Dependability ser-

vices and mechanisms, being depicted in Figure 3.4.

Media Selection Unit

Standard Media Interfaces

MTx

(1) MTx

(m)MRx

(1) MRx

(m)

ManagementInterfaceCh

TxCh

Rx

Figure 3.4: Channel redundant media management

This component receives the several redundant media, and extracts the unique representation

of channel information to be provided to the MAC sublayer, through the bus redundancy manage-

ment mechanism. It also does the inverse function: replicates the channel trough all the media.

This component’s duties also include the provision of monitoring functions w.r.t. the channel and

the several media. This monitoring is done with the purpose of: assess the state of the elements

participating in the communication; perform error detection and confinement, e.g. a failed medium

permanently in dominant state.

This unit communicates with the CANELy Dependability Engine with two purposes: receive

network operation configuration parameters, and signal the upper layers upon an exceptional

event, such as a failed medium. It achieves this communication via a special purpose interface,

as show in Figure 3.4.

3.2.3 Inaccessibility Control Unit

The Inaccessibility Control Unit (ICU) is responsible for the monitoring of the channel w.r.t. to

events affecting the communication timeliness. Its interfaces are shown in Figure 3.5.

The ICU monitors continuously the channel, in order to detect a network inaccessibility event.

When detected, it may inform the upper layers of the duration of such event, through the assertion

of a signal. The contribution of this unit to the CANELy architecture is the assessment of the real

28

3.3 Engineering Constraints

ChannelMonitoring

Inacessibility Control Unit

ManagementInterface

ChRx

ChIna

Figure 3.5: Inaccessibility Control Unit

duration of network inaccessibility and of how much its effects last in the operation of a CAN-based

infrastructure.

3.3 Engineering Constraints

The CANELy architecture is built on top of theoretical and analytical results, making little or

no assumptions of the supporting devices or technology. Hence, to be materialized it must meet

engineering, and the constraints introduced by it. The constraints are presented through the use

of the collapsed three-layer communication stack (see Section 2.3.3, page 16). In Figure 3.6 the

hardware components’ interaction is illustrated.

CAN Single/Dual ChannelInterfaces

Management Interfaces

Microcontroller

FieldProgrammableGateArray

Host System

CAN PHY Interfaces

Figure 3.6: CANELy engineering model

The model in Figure 3.6 provides a processing unit for the execution of the higher layers of

CANELy, and a PLD, namely an FPGA, for the execution of the monitoring and confinement

functions. They interact through two different channels: the standard CAN layer; specialized

management interfaces. It is explicited in the Figure that the management information is con-

29

3. CAN Enhanced Layer

veyed Out-of-band (OOB) with respect to the CAN channels, and therefore there is no influence

whatsoever of the management functions in CAN bus operation.

3.3.1 CANELy Components

CANELy Dependability Engine

In Figure 3.7 is shown the embodiment of the CANELy Dependability Engine component.

CAN Controller

Input Channel Output ChannelExecution Environment

System Interface

Physical Layer Interface

EEPROM

RAM

Programmable Timers

Microcontroller

Figure 3.7: CANELy Dependability Engine

This component is composed by three main blocks: CAN Controller, providing an implementa-

tion of the standard CAN layer as defined in [6, 5]; Microcontroller, commanding the CAN controller

and providing support for the execution of the communication protocols and advanced services;

Message Input/Output Channels, providing message buffers which must support a priority-based

queuing policy, in order to provide QoS to higher priority messages, e.g. urgent control messages.

While the components need not to be integrated in the same integrated circuit, such feature

is desirable to: lower component count, lower implementation area and increase overall circuit

reliability due to less points of fault. This component can then be implemented by resorting to

state-of-the art microcontrollers having at least one standard CAN controller, such as the Maxim/-

Dallas DS80C390 [39] (Dual-CAN, 8051 architecture) or the Texas Instruments LM3S2965 [40]

(Dual-CAN, ARM Cortex-M3 architecture).

CANELy Media Selection and Inaccessibility Control Units

The engineering of the CANELy Media Selection and Inaccessibility Control Units has a strong

hardware design component, since they provide low-level functions. Therefore these must be

mapped into an FPGA device, for cost-effectiveness.

There is no strong restriction pertaining to the mapping of these units into hardware. The only

necessary conditions is that the (reconfigurable) hardware providing support for the mechanisms

30

3.3 Engineering Constraints

has: enough resources so they can be fitted; enough I/O interfaces for the media, channel and

management interfaces interconnection; suitable clock managing circuitry to aid the interface and

bit synchronization with the CAN network.

3.3.2 CAN Data-link Layer

In a CAN network the data-link layer is materialized by the CAN controller. While the CANELy

architecture does not enforce any particular controller device, besides fully supporting the stan-

dard CAN layer, it poses some restrictions regarding its features in order to secure hard real-time

operation attributes, and proper operation of the reliable communication protocols.

The CAN controller device must provide a transmitting message queueing buffer with more

than one slot, with the order of transmission being based on the message identifier - as opposed

to a First-In, First-Out (FIFO) policy. If not, a priority inversion scenario may arise, where a higher

priority message gets delayed due to a lower priority message being held in queue - and possibly

leading to timeliness violations.

3.3.3 CAN Physical Layer

Following the top-down approach, the last layer remaining is the CAN physical layer. As the

name suggests, this layer is concerned with the physical interconnection of the systems, from the

electrical representation of the CAN bits to the physical attachment of cables to the nodes.

Transceivers

A fundamental piece in a CAN network - or any other computer network - is the transceiver,

which is the device that interfaces the controller with the physical means to convey the signals.

It is the transceiver’s function to convert the information into an electrical representation, suitable

for each one of the domains.

The CANELy architecture was designed without making any assumption regarding the physi-

cal medium used to propagate the CAN signals - at least beyond the quasi-stationary mode and

the wired-AND bus operation. Although optical interfaces exist, we are most concerned with a

more usual medium: twisted-pair cabling.

When using regular twisted-pair cabling the COTS philosophy is kept through usage of com-

mercially available CAN transceivers such as Maxim’s MAX13050 [41] or Microchip’s MCP2551 [42].

There is no need for special (fault-tolerant) transceivers with stuck-at-dominant fault masking since

these mechanisms are provided by the Media Selection Unit. Furthermore, these mechanisms

might even not provide adequate protection to real-time systems. Since these devices have no

knowledge regarding the network bit rate, they use a “worst-case scenario”, which is the lowest

bit rate possible. Hence, the amount of time these mechanisms require to act is in the order of

hundreds of milliseconds, which may be longer than the timeliness requirements of the host.

31

3. CAN Enhanced Layer

Connectors

The usage of communication media redundancy requires physical attachment of the several

cables to the node, which must be made through connectors. For the particular case of two

redundant media, these connectors were made compliant with CAN in Automation (CiA) stan-

dard 102 [43] and 303 part 1 [44], which are used by CANOpen thus using an already deployed

standard. The set of used signals is shown if Figure 3.8.

1

2

3

4

5

6

7

8

9 CAN_V+ (optional power)

CAN_H (CiA Standard)

optional CAN ground

optional CAN shield

(secondary) CAN_L

CAN ground

(CiA Standard) CAN_L

(secondary) CAN_H

reserved (error line)

Figure 3.8: Extended CiA Connector

The connector supports not only the standard CAN signals (CAN_L and CAN_H), but also the

possibility of power transmission, thus opening room to applications where no local power supply

is available, e.g. intelligent sensors.

This design also supports the connection of a secondary channel, signalled in the Figure 3.8

as the secondary set of CAN signals. The provision of such connection enables a solution offer-

ing full-space redundancy, through dual-channel/quad-bus operation, enhancing even further the

properties related to bus operation dependability and timeliness.

3.4 Summary

The Controller Area Network (CAN) bus is a fieldbus widely acknowledged by its attributes:

low cost, robust operation, low complexity and flexibility. Being designed to be used in the auto-

motive industry, it has shortcomings that must be addressed before it can be used to build highly

dependable applications. It is in this context that the CANELy architecture was designed, provid-

ing the foundations for high dependability applications, such as Distributed Control System (DCS)

requiring hard real-time behaviour.

The CANELy architecture achieves the goal of highly dependable CAN-based operation through

a systemic approach to the problem of distributed hard real-time operation. It presents a highly

modular architecture, covering attributes from low-level media redundancy mechanisms to ensure

reliable and available real-time communication, to high-level advanced services common to dis-

tributed systems, such as: group communication, membership and node failure detection, clock

synchronisation.

32

4Dependability Enforcement

The price of reliability is the pursuit of the utmostsimplicity.The Emperor’s Old ClothesC.A.R. HOARE

The systemic approach to the design of dependable distributed systems dictates that full at-

tention must be given also to the network interconnecting the system’s nodes. One of the major

contributions stemming from the CANELy architecture was a formal and analytic model of the

CAN fieldbus operation. This model was a necessary condition for identifying the weaknesses

w.r.t. network dependability, which must be addressed.

The upper layers of CANELy work on the operational assumption that the (physical) network

is reliable to some extent, and mostly free from inconsistent errors, except for errors affecting the

last but one bit of a message. These errors are addressed by the reliable communication protocol

suite.

Although this type of error-free assumption is common in layered architectures due to layer

partitioning, it does not hold true in a standard CAN network. Therefore, the properties pertaining

to network dependability must be secured. This Chapter is then devoted to the problems affect-

ing the CAN fieldbus network dependability, and how to provide effective dependable operation

based on the CANELy architecture formal models and mechanisms, thus providing a dependable

channel for frame transmission and reception.

The concept of dependability is thoroughly discussed in [45]. To achieve high levels of de-

pendability, one must attain also high levels of the attributes encompassed by it. These attributes

are the following:

• availability: readiness for correct service.

• reliability: continuity of correct service.

• safety: absence of catastrophic consequences on the user(s) and the environment.

• integrity: absence of improper system alterations.

• maintainability: ability to undergo modifications and repairs.

33

4. Dependability Enforcement

The attributes of interest for achieving dependable network operation are availability and re-

liability. To secure reliability and availability in the communication medium one has to resort to

spatial redundancy. As discussed in Chapter 3, one of the most basic strategies of the CANELy

architecture is medium redundancy, i.e. redundancy of the physical medium which conveys the

communication channel used for message diffusion.

This Chapter discusses the reliability and availability of the CAN network, and how it can be

secured based on the models offered by the CANELy architecture. Firstly we present the opera-

tional assumptions of CAN operation, and the classes of faults affecting it. Then we obviate the

failures affecting both CAN and CANELy, and what are their consequences for correct network op-

eration. Finally, and following a bottom-up approach, we discuss the several hindrances against

dependable network operation: medium redundancy and its management, error detection and

confinement at the physical medium level, and monitoring functions allowing the assessment of

the omission degree of a medium. Along the discussion of each of these components we present

their analytic foundations, and how can such elements be mapped to innovative structures, suit-

able to be engineered effectively in a PLD, such as an FPGA.

4.1 Working Model

Before we start to discuss the models, functions and mechanisms that the CANELy archi-

tecture specifies to enhance network dependability, we must present what are the assumptions

supporting them. The assumptions listed in Figure 4.1 are based - and valid - on a network com-

posed of N nodes interconnected by a channel. Each node n ∈ N connects to the channel by a

channel transmitter (outgoing bit stream) and a channel receiver (incoming bit stream).

N1 channel redundancy is used, through replicated media (physical and medium layers), butonly one MAC sub-layer.

N2 each medium replica is routed differently.N3 all media are active, meaning every bit issued from the MAC sub-layer is transmitted simul-

taneously on all media.N4 there is always a detectable minimum idle period preceding the start of every CAN data or

remote frame transmission.N5 there is a detectable and unique fixed form sequence that identifies the correct reception of

a CAN data or remote frame.N6 there is a detectable bit sequence that identifies the signalling of errors in the CAN bus.

Figure 4.1: CANELy network assumptions

While the first three assumptions (N1, N2 and N3) are related to the physical aspects of the

network deployment and operation, the other three (N4, N5 and N6) are related to CAN network

operation only, specifically correct operation, having been derived directly from the CAN specifi-

cation [6, 5].

34

4.1 Working Model

Assumption N4 is guaranteed by the inter-frame spacing, intermission period, a sequence

corresponding to the channel being in a recessive state for a duration of at least two (normally

three) bits after the end of a frame, of any type [6, 5].

An illustration of assumption N5 is shown in Figure 4.2, w.r.t. to the final part of a CAN

message transmission, called End-of-Frame (EOF).

CRC Sequence ACKSlot

CRCDel

ACKDel

EOF Delimiter

r d r r r r r r r rr-recessived-dominant

EFS - End of Frame Sequence

bit-stuffing coding

Figure 4.2: CAN message termination sequence

The unique sequence of assumption N5 is the End of Frame Sequence, which ends a suc-

cessful transmission of a data/remote frame. Assumption N6 is guaranteed by the CAN error flag

(also called error frame), which is composed by a sequence violating the bit-stuffing coding of

CAN. The bit-stuffing coding defines that the maximum length of a bit sequence having identical

polarity is five bits, with the exception of the EOF Sequence.

Finally, a useful construct is the normalisation of the CAN network bit rate, yielding the bit time,

Tbit . This unit of measurement is extremely useful, since it allows the analytic expressions to hold

no connection to the bus bit rate, thus allowing a more general analysis.

4.1.1 CAN Physical Layer Fault-Tolerance

The CAN transmission medium is usually a two-wire differential line. The CAN physical layer

specified in [6] allows resilience against some of the transmission medium failures illustrated in

Figure 4.3, by switching from the normal two-wire differential operation to a single-wire mode.

After mode switch-over bus operation is allowed to proceed, though with a reduced Signal-to-

Noise Ratio (SNR), in the presence of one of the following failures:

• one-wire interruption (A or B failures, in Figure 4.3);

• one-wire short-circuit either to ground (C or D) or power (E or F);

• two-wire short-circuit (G).

There are commercially available transceivers claimed to be fully compliant with the ISO-11898

standard [41, 42], i.e. they switch to single-wire operation upon the detection of any of these

failures, switching back to two-wire differential mode upon recovery.

The CAN standard provides coverage for the shorting failure modes (C to E in Figure 4.3).

Resilience to the failure of one termination (failure H, Figure 4.3) implies that extra time is needed

35

4. Dependability Enforcement

NodeCAN

Interface

NodeCAN

Interface

CAN_H

CAN_L

A

B

C

D

E

FG H

Figure 4.3: CAN physical layer faults

for bus signal stabilisation, and can be overcome by adjusting the (local controller) parameter of

propagation time segment [6], thus delaying the bit-sampling of the network.

There is no standardised mechanism for providing resilience against the simultaneous inter-

ruption of both bus wires (A and B failures, in Figure 4.3). Upon such a failure, the network will be

partitioned, with each partition containing a subset of the N nodes.

We are interested in tolerating partitioning faults. Most of them are beyond our control, since

they involve physical damage to the network infrastructure. Nonetheless, they can be tolerated

by resorting to media redundancy, especially if the several media are routed through different

physical paths, in order to avoid a sort of physical “common-mode” faults.

4.1.2 CANELy Approach to Network Dependability

The fault model considered by the CANELy architecture complements and extends the one

defined by CAN, as a consequence of both the non-invasive and COTS approach. Therefore, the

CANELy architecture not only supports the CAN fault model - provided by the CAN controller and

transceivers - but widens the scope of the model, thus increasing the fault coverage.

One of the basic strategies in the CANELy architecture is the utilisation of media redundancy.

Although this strategy enhances the network availability and reliability, it also brings new problems

w.r.t. error-free bus operation. A set of errors affecting a media redundant network, e.g. a CANELy

network, is presented in Figure 4.4.

The network of Figure 4.4 is composed by two bus media replicas, P and S. The errors

affecting it can be mapped to two classes: Common-mode (error A), affecting all media; Single-

medium (errors B, C and D) affecting just one medium. The effects of these medium errors in the

channel depend on the type, and will be addressed in Section 4.5.3.

Since the CANELy enhancements are transparent to the standard CAN layer, they can be

used to provide the node with fault confinement on both outgoing and incoming bit streams. This

allows local node fault confinement in a more effective manner, especially w.r.t permanents faults

that disrupt the network operation, and without resorting to specially enhanced COTS transceivers

36

4.1 Working Model

NodeCANELy

Interface

Medium P

Medium S

d dMedium P

Medium S d d

r rMedium P

Medium S r d

d dMedium P

Medium S r d

A - common mode errors B - single-medium (d ⇝ r)

C - single-medium (r ⇝ d)

IncorrectValues

...

...

...

...

(d ⇝ r)

...

...

(r ⇝ d)d rMedium P

Medium S r d

D - single-medium (both)

...

...(r ⇝ d) (d ⇝ r)

Figure 4.4: Errors affecting a dual-media CAN network

with fault confinement mechanisms.

Partitioning

A class of faults originating from the physical medium is partitioning. Physical partitioning

happens whenever bus cabling interruption occurs, leaving the network with at least two physical

partitions (see Figure 4.5). The aim of media redundancy, and therefore redundancy management

is to mask this event, presenting a correct view of the network to the upper layers, even in the

presence of network errors, as specified in assumption N1.

NodeCAN

Interface

NodeCAN

Interface

P-Bus

S-Bus

NodeCAN

Interface

d

Bus MediumInterruption

d d d r d

Partition Partition

TransmittingNode

incorrectvalue

Figure 4.5: Media-redundant network physical partition

This network partitioning phenomena, however, should be explored further. A network distur-

bance might last only one Tbit , due to transient errors such as EMI or a loose network connector,

or might last longer, such as a crushed cable. The later should be dealt by the upper layers,

through specific algorithms for the assessment of the affected network segments. The former

might lead to another subtle form of partitioning that assumes a virtual form, in the sense that

there is no interruption in the physical path of the communication channel. This form of partition-

ing, however, still has impact in the provision of communication, affecting its timeliness. This form

of partitioning will be a central theme in Chapter 5.

37

4. Dependability Enforcement

4.1.3 Fault classes

Correct CAN bus operation is mainly disturbed by errors originating from two classes of faults:

stuck-at, where the affected components1 experience the same logic level for an abnormal period

of time w.r.t. CAN protocol operation; “omissions” which are experienced by a component that

does not receive service, usually due to errors.

Stuck-at Faults

The class of stuck-at faults pertains to the physical layer, affecting the bus media. This class

of faults has two elements in a CAN network: “stuck-at-dominant”, where the value present in the

bus has the dominant level; “stuck-at-recessive” where the level is recessive. Given the wired-

AND operation of CAN, a stuck-at-recessive fault usually indicates physical disconnection of the

medium, be it a faulty connector or a partitioning failure. Also due to the same operation mode,

the “stuck-at-dominant” faults are extremely disruptive, since they inhibit the communication.

Therefore, the CANELy architecture must tolerate this class of faults, allowing the provision of

service even in the presence of such faults.

Omission Faults

There is another class of faults that can be triggered by the former class, which are omissions.

An omission occurs whenever a component fails to receive a message2, and can have their origins

in several factors, ranging from faulty circuitry, damaged network media, electromagnetic interfer-

ence or other transient accidental faults affecting any network communication components.

These faults can also be very harmful for a system, especially a control system. For example,

if the actuator driving the brakes in a car-braking system does not receive the message with the

braking order from the controller, i.e. suffers an omission, it will not brake and the result can be

catastrophic. This omission might be caused by EMI emanating from other component, such as

the car’s alternator.

This class of faults affects the data-link layer, specifically the LLC sub-layer, since it is this

layer that is responsible for ensuring the correct communication of messages. This class of faults

must also be tolerated, and the CANELy architecture provides means to either tolerate them, or

at least perceive them and ultimately declare the involved media failed, thus allowing safe system

operation.

The assessment if a media should be declared failed is based on the bus medium omission

degree, which represents the number of consecutive omissions in a given interval of time. A

medium violating its omission degree bound should be declared failed, its contribution to the

channel disabled and the upper layers notified, so that adequate measures can be undertaken.1A component might be: the physical network medium, the transceiver or the CAN controller, both on transmission and

reception.2In the CAN context, frame and message are interchangeable.

38

4.2 Physical Network Availability and Reliability

4.2 Physical Network Availability and Reliability

Given our bottom-up approach, the first class of issues affecting network dependability are

related to the availability of the network. In order to enhance the availability w.r.t. standard CAN,

redundancy must be provided. The CANELy architecture contemplates bus media redundancy,

through a set of replicated media conveying the channel. This strategy, however, does not pre-

clude a full-space redundant architecture with replicated channels.

The utilisation of bus media redundancy, however, poses several new challenges. Questions

such as “How many redundant paths should be used?” and “How to manage the redundant paths,

recovering a coherent view of the communication channel?” are extremely pertinent, and should

be addressed.

4.2.1 Media Redundancy Provision and Management

The first step w.r.t. redundancy provision has already been taken, under the form of assump-

tion N1, which explicits that media redundancy should be used. Given the several media convey-

ing the CAN bus signal, there is the need to extract a unique representation of the channel from

all the media. Therefore, some means of redundancy management are in need.

In CANELy the media redundancy management is solved by an ingenious mechanism ex-

ploiting: the wired-AND nature of CAN PHY layer; the quasi-stationary bus operation ensuring

(almost) simultaneous bus bit sampling by all correct nodes in the network. This mechanism

gathers all the signals received by each medium interface into a single representation through

an AND function, before being interfaced with the MAC sub-layer. Dubbed The Columbus’ egg

strategy due to its simple (in hindsight) nature, it allows to form a single incoming bit stream rep-

resenting the channel, ChRx. Its structure and relation with the standard CAN layer and physical

media is depicted in Figure 4.6.

NodeCAN

Interface

P-Bus

S-Bus

P-Bus S-Bus

CANController

MediumInterface

MediumInterface

ChTx

ChannelInterfaceCh

Rx

MTx(P) (P)M

RxMTx(S) (S)M

Rx

Figure 4.6: Columbus’ Egg strategy block diagram

39

4. Dependability Enforcement

The AND function is used to gather the incoming media, right after the transceivers. The result

is then provided to the standard CAN controller. Although there are only two media depicted, this

approach is valid for any number of media. The integration of this strategy into more complex

models requires that it must be defined formally, yielding the following expression:

ChRx =∏

m∈MMRx(m), (4.1)

where: the symbol∏

is used to denote the logical AND function; M is the set of medium inter-

faces. For example, in the dual-media architecture of Figure 4.6,M = {P, S}.

The materialisation of this mechanism can be effectively described in VHDL, through the ex-

ploitation of the language’s properties, such as vector attributes [46]. Such a description is illus-

trated in Figure 4.7.�1 −− MediumRX : Vector aggregat ing the severa l media2 −− ChRx : Channel incoming (Rx) b i t stream , 1 i f a l l media are l o g i c a l ’ 1 ’ , e lse ’ 0 ’ .3 −− In CAN: l o g i c a l ’ 1 ’ = recess ive ( r )4 −− l o g i c a l ’ 0 ’ = dominant ( d )56 ChRx <= ’1 ’ when MediumRX = (MediumRX ’ range => ’ 1 ’ )7 else ’ 0 ’ ; � �

Figure 4.7: AND-based Media Selection description in VHDL

For the sake of clearness, from hereon all the signals written in tt report to implementation

(hardware description) signals, e.g. ChRx, while signals written in italic report to analytic signals,

e.g. ChRx.

The ChRx signal, in Figure 4.7, is the mapping of the receiving component of the communica-

tion channel, ChRx, recovered from the redundant bus media. The description ingeniously takes

a different approach to the AND function, which we wish to describe. If we reason about the AND

function, we can describe its behaviour as: the output of an AND function is True, iff all the inputs

are True. This is the approach followed in the description, where we want to assess if all the

inputs, MRx(m), are True, i.e., at the logical level ’1’, and express the evaluation through ChRx.

In order to do so, the inputs, MRx(m), are mapped to a vector (MediumRX in Figure 4.7) and com-

pared bit-by-bit with a construct who has the same size and bit-order (range) than the MediumRX,

and having all elements at logical ’1’. The result of this comparison is output through the signal

ChRx, being logical ’1’ only when all the media are at logical ’1’.

The VHDL description in Figure 4.7 has the advantage of simplicity and clarity over other

descriptions, such as the ones using variables and loop unrolling. Those advantages can be

extended to the implementation, in the PLD. These, however, are extremely dependent of the

synthesis tools used.

Having recovered the incoming bit stream representing the channel, ChRx, not only we can

40

4.2 Physical Network Availability and Reliability

pass it to the upper layers, namely the standard CAN layer, but also perform monitoring actions

based on its information, and enhance the bus media redundancy management, e.g. implement

fault detection and confinement functions for each independent media, thus increasing the avail-

ability and reliability of the communication network.

4.2.2 Stuck-at-dominant Fault Handling

A failed medium stuck in a dominant state must be dealt through confinement mechanisms,

due to its disruptive nature w.r.t. correct network operation. Therefore, the failed medium must be

detected and its contribution to the recovery of the channel disabled, i.e. equation (4.1) must be

complemented. This can be achieved easily, through the exploitation of the AND-function neutral

element - the logic value ’1’.

Based on the CAN standard [6, 5], the CANELy architecture defined the minimum length of

consecutive dominant bits present on a medium before declaring it “stuck-at-dominant”. This

length, formally defined lstk←dm can be used to assess if a given medium is indeed at such a

state. This length can be expressed as an amount of time by:

Tstuck←dm = [2 · lstk_d + (lstk_d + 1) · errstuck←rx(bus)] · Tbit (4.2)

where: Tstuck←dm is the minimum (normalised) time for the channel to be considered stuck-at-

dominant; lstk_d represents the length of the sequence of consecutive dominant bits, tolerated

by the CAN fault confinement mechanisms upon the transmission of an active error flag [6, 5];

errstuck←rx(bus) is a parameter for allowing a tolerance margin in the violation of the active error

flag tolerance, which must be a positive integer; Tbit represents one bit-time. The sequence

defined by lstk_d is composed by 7 consecutive dominant bits. Regarding the tolerance margin, its

value must obey the following relation: 1 ≤ errstuck←rx(bus) ≤ 10, where the upper bound defines

a Tstuck←dm which is equivalent to native CAN fault confinement mechanisms.

Upon the detection of a stuck-at-dominant condition, an indication of Medium m failure is

provided:

Mstk−d(m) 7→

true if T (MRx(m) = d) > Tstuck←dm

false when T (MRx(m) = d) ≤ Tstuck←dm ∨MRx(m) = r(4.3)

where: T (MRx(m) = d) represents the normalised time elapsed since Mediumm is at a dominant

state. If it exceeds Tstuck←dm, the medium is declared failed. The signal Mstk−d can be used to

command the disabling of the affected medium, thus confining the fault. The equation (4.1) is then

extended to accommodate the medium disabling function:

41

4. Dependability Enforcement

ChRx =∏

m∈M(MRx(m) +Mdis(m)) (4.4)

where: Mdis(m), the medium disabling signal can be derived directly fromMstk−d(m), i.e. Mdis(m) =

Mstk−d(m); the symbol “+” denotes the OR function. With this expression we can build a self-

contained block that: merges the channel information received by all media in a single entity,

ChRx; provides error confinement for any medium suffering a stuck-at-dominant fault.

Having already described the AND function (see Figure 4.7), is it possible to just add the mask-

ing component provided by the Mdis signal and OR function, without having to redesign it from

scratch? The answer to this question is affirmative, and can be achieved through the exploitation

of VHDL loop constructs, which provide the means for iteration3 over blocks of statements, as

illustrated in Figure 4.8:

�1 −− MediumRXtr : Vector aggregat ing the severa l media , from the t r ansc e i ve r s2 −− Mdis : Vector aggregat ing the severa l Medium Disable s igna l s3 −− MediumRX : Vector aggregat ing the severa l media , to be used i n AND f u n c t i o n45 procMediumRXOR : process ( MediumRXtr , Mdis ) is6 begin7 for m in 1 to NumberMedia loop −− I t e r a t e over each media m8 MediumRX(m) <= MediumRXtr (m) or Mdis (m) ; −− execut ing the OR f u n c t i o n9 end loop ; −− m

10 end process procMediumRXOR ; � �Figure 4.8: Medium Disable Receive description in VHDL

The strategy followed is: to iterate over all the media, resorting to the variable m, and for

each medium execute the OR function, between the information received from the physical layer

(MediumRXtr, mapping of MRx(m) in equation (4.4)) and the information generated resorting to

equation (4.3) (Mdis). The result is a vector with the masked incoming bit stream (MediumRX). The

vector MediumRX can then be used in the VHDL description of Figure 4.7. The problems involving

iteration over vectors are a natural application of loop constructs, being the most efficient and

generic manner to accomplish that type of operations.

The generation of the information regarding signal Mdis can be thought as a watchdog timer,

which expires after some time - given by equation (4.2) - if no recessive bit is detected in the

channel. This idea, however, can be extended even further by considering the quasi-stationary

operation mode of the CAN network: this problem can be expressed as a sequence detection

problem, where these signals can be mapped to a certain sequence, which can occur in the

network, hence able to be detected.

3Although this is called iteration, what really happens is parallelisation, since the structures we are describing arehardware.

42

4.3 CAN Bit-Sequence Detection

4.3 CAN Bit-Sequence Detection

The bit-serial nature of the CAN protocol operation permits the assessment of correct be-

haviour in a practical manner, through the detection of certain sequences occurring in the network

components, be it the channel or the several media replicas. The quasi-stationary network oper-

ation of CAN can (and should) be further exploited, to allow on-line CAN protocol processing and

evaluation, masking errors and notifying the relevant layers of any abnormal event pertaining to

correct system operation. Therefore, the monitoring functions must be transformed into sequence

detection functions when mapping them into VHDL.

A problem still remains, however: how effective and flexible can and should the mechanisms

providing these sequence detection functions be? A “brute force” approach would either involve

a “sliding-window”-based technique, comparing the entire sequence at once (see Figure 4.9), or

specially and individually crafted structures to handle each and every function requiring sequence

detection. This “ad-hoc” approach has several drawbacks: implies lack of flexibility, since each

function would be designed individually, and high maintainability costs in the future, since there

was no common design basis. This type of design also presents impairments w.r.t. engineering,

since it needs scarce PLD resources.

1 0 0 1 1 1

0 0 0 0 1 0

Sequence

Bit stream

Figure 4.9: Sliding Window sequence detection

To avoid all these drawbacks, these structures should be made as flexible and efficient as

possible. After careful consideration, it was noted that most monitoring functions can be mapped

into fixed length, deterministic sequence detection problems4, sharing many similarities among

them. Most relevant characteristics of the sequences are:

• possibly long, with some going up to 96 bit in length;

• unique, although some are sub-sequences of others.

These observations make the sequence detection problem amenable to a generic approach,

thus providing component reuse and even compositionality, w.r.t. sub-sequences.

Under this perspective, we can easily map the analytic expressions defined by the CANELy

architecture into sequences of bits, which must occur in the network, either at the channel level or

individual medium level. For example, the sequence in equation (4.2) can be detected through the

4Under a slightly different perspective, these sequences can be seen as strings, formed from an alphabet Σ = {0, 1}.This would allow the description of the sequences as regular expressions.

43

4. Dependability Enforcement

sequence: dddddddddddddddddddddddddddddd, composed by 30 bits, given a value of lstk_d = 7

and errstuck←rx(bus) = 2, being the latter an acceptable tolerance margin [11]. The CAN bus

levels can be mapped into binary levels: recessive (r ) is equivalent to a logical ’1’; dominant (d) is

equivalent to a logical ’0’. Therefore, the previous sequence mapping yields the binary sequence

000000000000000000000000000000.

Although the detection of sequences is fundamental, itself alone is not sufficient. Most of the

monitoring functions, either channel or medium, require actions to be performed upon sequence

detection, e.g. signal latching, pulsing, negation upon other signal assertion.

Hence, additional machinery is necessary to satisfy these requirements. The concept joining

these two elements is illustrated in Figure 4.10.

Complementary Logic

Bit stream

Signal Assertion

Sequence Detector

Additional Signals

SequenceOk

Figure 4.10: Signal assertion machinery

Each sequence detection block is composed by two fundamental elements: Sequence detec-

tor, which asserts the presence of the sequence of interest in the network; Complementary logic,

which handles the integration of the sequence detection mechanism with other signals, in order

to provide more complex actions. These actions include: signal pulsing, e.g. signals only active

one Tbit ; signal latching; signal composition for signals triggering further detections or actions.

The mapping of the sequence detection function into VHDL yields the description in Fig-

ure 4.11. This description allows a flexible approach, with the sequence representing the monitor-

ing function being specified through VHDL parametrised constructs upon component instantiation,

thus using the same building block for most of the monitoring functions.

The sequence detection machinery of Figure 4.11 is synchronised with the system-wide clock

signal, sys_clk, being the actions pertaining to bit comparison synchronised with CAN bit timing,

through can_clk_en which provides the equivalent to Tbit . Hence, the sequence matching is

made “on-line”, i.e. on a bit-by-bit basis and synchronised with the CAN network.

The sequence to be detected is mapped into a Read-Only Memory (ROM), provided by

sequence_rom, which is parametrised at design time with the desired sequence. The size of

the ROM for a sequence having a length of n bit is n× 1 bit, i.e. it stores the n bits composing the

sequence, being the output 1 bit wide.

The sequence detection is done through comparison between the bit being output by the

ROM, sequence_rom(cnt), and the bit from the incoming bit stream, data. If the two bits are

44

4.3 CAN Bit-Sequence Detection

�1 −− ROM Addressing r e g i s t e r2 −− Stores the value addressing the ROM hold ing the sequence to be detected3 pSDetect ion : process ( sys_c lk ) −− FPGA System Clock4 begin5 i f r i s ing_edge ( sys_c lk ) then6 i f rst_N = ’0 ’ then −− Synchronous reset , f o r sa fe t y purposes7 cnt <= 0;8 else9 i f can_clk_en = ’1 ’ then −− CAN clock enable , f o r network synch

10 cnt <= cnt_aux ; −− Store the ROM address value11 end i f ;12 end i f ;13 end i f ;14 end process pSDetect ion ;1516 −− Decis ion l o g i c17 −− Progress wi th the ROM addressing , wh i le i npu t matches sequence18 −− Reset count upon e i t h e r f a i l e d match or when reaching end of sequence1920 cnt_aux <= ( cnt +1) when data = sequence_rom ( cnt ) and cnt /= sequence ’ leng th21 else 0;2223 −− Output l o g i c24 −− Output l o g i c a l ’ 1 ’ upon success fu l sequence detec t ion , l o g i c a l ’ 0 ’ o therwise2526 Sequence_Ok <= ’1 ’ when cnt = sequence ’ leng th27 else ’ 0 ’ ; � �

Figure 4.11: Sequence detector description in VHDL

equal, the value addressing the ROM is incremented, in order to test the next bit. Once this value

equals the length of the sequence, the detection has successfuly ended, and the Sequence_Ok

(Figure 4.11) signal is asserted for one Tbit , being deasserted after. Upon a failed bit matching,

the value addressing the ROM is reset, restarting the sequence detection. The assertion of

Sequence_Ok can be used to assert the signal pertaining to the detected sequence, eventually

based on additional signals(see Figure 4.10).

The sequence detector VHDL description’s objective is two-fold: it intends to be generic, avoid-

ing the description of dedicated Finite State Machines (FSMs) for each monitoring function; it

intends to be resource-effective, consuming the least resources as possible from an FPGA de-

vice, since they are finite. Hence, the utilisation of a ROM, the most abundant resource in an

FPGA5, for storing the bit sequence is crucial, thus saving other scarcer memory elements such

as flip-flops.

Lastly, the description of Figure 4.11, however, has a shortcoming: it fails to detect a sequence,

if a (starting) sub-sequence of the sequence to be detected is present. For example, if we wished

to detect the sequence rdrrr inside the larger sequence rdrdrrr using this strategy, it would

not be possible with such machinery. This limitation stems from the simple sequence detection

restart machinery, which does no account (on purpose) with these sub-sequences, thus making

the machinery implementation area smaller.

5In fact, the must abundant resource in an FPGA is Random-Access Memory (RAM), under the name of Look-upTable (LUT). The ROM, however, is implemented with resort to LUT elements, but with content alteration inhibited.

45

4. Dependability Enforcement

This type of sequences, however, only occur once in the CANELy monitoring functions, and

being reduced in length can be implemented with resort to a method similar to the one in Fig-

ure 4.9, without having a great impact in occupied area. This different implementation, however,

can be made transparent w.r.t. component instantiation, through the abstraction constructs of Very

High Speed Integrated Circuit Hardware Description Language (VHDL) which allow the same en-

tity (component) to have multiple implementations (architectures). Despite this shortcoming, the

ROM-based approach to the sequence detection problem is still the most effective for the CANELy

monitoring functions, due to the possibly long length of the sequences to be detected, and ab-

sence of sub-sequences, save for one.

4.4 Channel Monitoring

The detection of errors requires constant monitoring of the bus media and channel. This

monitoring serves the purpose of assessing the communication state, through the detection of

certain sequences.

Basic Channel Monitoring

The most basic set of monitoring signals concerning CAN network operation stem from the

assumptions N4, N5 and N6. Through the assertion of correct channel behaviour we can implicitly

assess the correct CAN operation. Furthermore, from this information more complex monitoring

functions can constructed.

The extensive monitoring mechanisms defined by CANELy depend on a basic set of channel

status signals, pertaining to basic operating mechanisms of the network. These signals are:

End-of-Transmission, meaning the successful transmission of a CAN frame, and the bus being

available for another transmission; Frame Correct, meaning the correct reception of a CAN frame;

and Error, meaning that an error flag has been detected on the channel.

The End-of-Transmission (EOT) signal definition embodies assumption N4, being asserted

when a frame has been successfully transmitted, and the bus is available for another frame trans-

mission, i.e. the minimum intermission period time has been elapsed. Its formal definition is

ChEOT 7→

true if T (ChRx = r) ≥ TL

false if T (ChRx = r) < TL ∨ ChRx = d(4.5)

where: ChEOT represents the EOT signal, T (ChRx = r) represents the normalised time elapsed

since the channel receive, ChRx is in a recessive state, and TL is the minimum normalised time

for the bus being idle, before starting a new transmission. This time takes into account that

transmission may start at the last bit of intermission. The mapping of this signal into VHDL is

illustrated in Figure 4.12.

46

4.4 Channel Monitoring

�1 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

−− EOT Sequence Detec t ion − Sequence de tec t i on3 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

instSequenceDetectorEOT : ent i ty work . sequence_detector (ROM_MEM)generic map (

6 sequence => seq_eot )port map (

sys_c lk => sys_clk ,9 rst_N => rst_N ,

can_clk_en => can_clk_en ,data => ChRx,

12 Sequence_Ok => eot_sequence ) ;

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−15 −− EOT f l a g asse r t i on − Complementary l o g i c

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− purpose : Asser t / de−asser t the EOT f l a g

18 −− i npu ts : sys_clk , rst_N , can_clk_en , eot_seq , eo t_ in t ,−− outputs : ChEOTpEOT: process ( sys_c lk ) is

21 begin −− process pEOTi f r i s ing_edge ( sys_c lk ) then −− r i s i n g c lock edge

i f rst_N = ’0 ’ then −− synchronous rese t ( a c t i v e low )24 ChEOT <= ’ 0 ’ ;

elsei f can_clk_en = ’1 ’ then

27 −− Signa l not asser ted yet , so asser t i ti f eot_sequence = ’1 ’ and ChRx /= ’0 ’ then

e o t _ i n t <= ’ 1 ’ ;30 −− Signa l asser ted and dominant b i t o f SOF detected a t ChRx , de−asser t

e l s i f e o t _ i n t = ’1 ’ and ChRx = ’0 ’ thene o t _ i n t <= ’ 0 ’ ;

33 end i f ;end i f ;

end i f ;36 end i f ;

ChEOT <= e o t _ i n t ; −− Output the ChEOT s i g n a lend process pEOT; � �

Figure 4.12: ChEOT signal description in VHDL

The first part of the listing in Figure 4.12 is the instantiation of the Sequence Detector compo-

nent with a constant, seq_eot, which holds the sequence to be detected, EOT in this case. The

second part is the Complementary Logic (see Figure 4.10). After the detection of the sequence,

the Complementary Logic block asserts the eot_int internal signal, which remains asserted until

the detection of a dominant bit in the channel, ChRx. The final assertion of ChEOT, the mapping of

ChEOT is done through eot_int, which is used purely due to VHDL restrictions regarding reading

the value of output signals.

The mapping of the remaining monitoring functions involving sequence detection follows this

design philosophy, thus allowing a clean, reusable and maintainable strategy for the mapping of

these monitoring functions from the analytic models into hardware.

Another signal needing to be generated is ChFok, which asserts that the received frame has

not been disturbed by errors, i.e. there have been no errors detected up to the last bit of the EOF

delimiter6. Its formal definition is

6This comes from the CAN standard [6, 5], which states that nodes do not take into account the last bit of a frame,

47

4. Dependability Enforcement

ChFok 7→

true if ChRx = rdrrrrrrr

false when ChEOT

(4.6)

where: ChFok is the Channel Frame Correct signal, meaning the correct reception of the se-

quence, correctly terminating a CAN data or remote frame.

The third fundamental signal w.r.t. channel monitoring functions is the Channel Error (ChErr),

related to assumption N6. It is asserted upon the detection of an active error flag, i.e. a dominant

value is put in the network for a time longer than the maximum allowed by the bit-stuffing coding:

ChErr 7→

true if T (ChRx = d) ≥ (lstuff + 1) · Tbit

false when ChEOT

(4.7)

where: T (ChRx = d) is the normalised time elapsed since the channel is in a dominant state;

lstuff is the bit-stuffing coding length and Tbit is the normalised bit-time. Giving lstuff = 5 and

mapping this sequence into binary digits yields 000000. The operation of these mechanisms w.r.t.

a CAN message are represented in Figure 4.13.

Ch

EO

T

CRC Sequence ACKSlot

CRCDel

ACKDel

EOF Delimiter

Fok

Ch

Figure 4.13: CANELy Basic Channel Monitoring

Extended Channel Monitoring

So far we have been concerned with the incoming bit stream, ChRx, and how to perform local

fault confinement in the event of a stuck-at-dominant medium. The outgoing bit stream ChTx,

however, is not immune to faults and must be accounted for also. The faults affecting correct

network operation do not necessarily have to be related to the physical network media. They

can also have their origin in the CAN controller circuitry, e.g. a failed oscillator that leaves the

CAN controller stopped while transmitting a dominant bit. Therefore the CANELy model must

also account for a stuck-at-dominant during message transmission situation. The period required

for detecting this situation has a normalised duration Tstuck−tx, and is formally defined as:

Tstuck−tx = [2 · lstkd+ (lstkd

+ 1) · errstuck←tx] · Tbit (4.8)

when evaluating its correctness.

48

4.4 Channel Monitoring

where: errstuck←tx is the stuck-at-dominant transmit-error-tolerance margin. This expression is

similar to equation (4.2), since both express the same type of violation of the CAN protocol, but

in different directions of the information flow. There is, however, a small difference between the

two mechanisms. The tolerance margin of equation (4.8) must be smaller that the one in (4.2),

0 ≤ errstuck←tx < errstuck←rx(bus) < 10. This condition is necessary for avoiding the actuation

of the incoming bit stream stuck-at-dominant mechanism first, thus leaving the node completely

disconnected from the network. The function that monitors the stuck-at-dominant is defined as:

Chstk−Tx 7→

true if T (ChTx = d) > Tstuck−tx

false when mgmt.request(4.9)

where: T (ChRx = d) is the normalised time elapsed since the ChTx is in a dominant state. If

this time grows larger than Tstuck−tx, we are in the presence of a stuck-at-dominant outgoing

bit stream, and assert the signal Chstk−Tx, signal the upper layers and disable the outgoing bit

stream, ChTx. It can only be negated by an upper layer request.

Mapping the disabling of the outgoing bit stream into hardware components is straightforward:

we just need to exploit (once again) the Boolean OR function, defining:

MTx(m) = ChTx +Mdis−Tx (4.10)

where: Mdis−Tx is the signal for disabling the outgoing bit stream, on all media. The result-

ing mechanism description iterates over the outgoing media, absorbing their value if Mdis−Tx is

asserted. The description of this mechanism is detailed in Figure 4.14.�1 −− MediumTX : Vector aggregat ing the severa l media , going i n t o the t r ans ce i v e r s2 −− MdisTx : S igna l f o r d i s a b l i n g a l l media through the OR f u n c t i o n3 −− ChTx : S igna l rep resen t ing the outgoing b i t stream45 procMediumTXOR : process (ChTx , MdisTx ) is6 begin −− process procMediumTXOR7 for m in 1 to NumberMedia loop −− I t e r a t e over a l l media8 MediumTX(m) <= ChTX or MdisTx ; −− Disab l i ng i f MdisTx i s set9 end loop ; −− m

10 end process procMediumTXOR ; � �Figure 4.14: Mdis−tx description in VHDL

Considering the tolerance parameter of equation (4.8) as errstuck←tx = 1 [11], a stuck-at-

dominant condition on the outgoing channel bit stream is detected in 22 bit times. In a network

operating at a 1 Mbps bit rate, such condition will be detected in just 22 µs, and the faulty node

disconnected from the network.

There are (non fault-tolerant) commercially available transceivers detecting and masking stuck-

at-dominant scenarios, such as Maxim’s MAX13050 [41] and Microchip’s MCP2551 [42].

49

4. Dependability Enforcement

The minimum amount of time these devices take to disconnect the faulty node from the net-

work, however, is rather high: 1 ms for MAX13050; 1.25 ms for MCP2551. Such behaviour is

explained by the transceivers being purely PHY-aware, having no knowledge on the bus bit rate.

Therefore they assume a worst-case figure, which would be the lowest bit rate possible - and

implicitly define the lowest bit rate they can support.

Channel Monitoring Component

Having defined the channel monitoring functions, we can now map them into suitable se-

quences. The signals, their behaviour and respective sequence are summarised in Figure 4.15.

Basic Channel MonitoringChEOT End Of Transmission

asserted after detection of minimum bus idle period;negated upon the start of a new frame.Sequence: ChRx = rrrrrrrrr

ChFok Frame Correctdata or remote frame received without errors;negated upon assertion of ChEOT .Sequence: ChRx = rdrrrrrrr

ChErr Frame Errorasserted upon violation of CAN bit-stuffing coding rule;negated upon assertion of ChEOT .Sequence: ChRx = dddddd

Extended Channel MonitoringChstk−Tx Channel Stuck-at-dominant

asserted upon violation of active error flag length limitnegated upon upper layers requestSequence: ChTx = dddddddddddddddddddddd (errstuck←tx = 1)

Figure 4.15: Channel Monitoring signals

This summary provides all the information needed for: parametrisation of sequence detection

machinery; complementary logic design w.r.t. the assertion, hold and negation of the channel

monitoring signals. The list of mapped sequences is presented in Figure 4.16.

With the information of Figure 4.15 and Figure 4.16, the mapping of these functions to the

abstract component of Figure 4.10 has now become trivial, with the Sequence Detector block

being parametrised with the sequence of interest; the Complementary Logic being filled with the

conditions for assertion and negation. From hereon, almost all monitoring functions can be trans-

formed into a sequence detection problem, with complementary signal logic, thus being tackled

with resort to this reusable approach.

The Channel Monitoring signals can be encapsulated in a VHDL component, which continu-

ously monitors the channel incoming and outgoing bit stream. A block diagram depiction of such

component is shown in Figure 4.17.

50

4.5 Medium Monitoring

�1 −− ChEOT − End of Transmission2 constant seq_eot : s td_u log i c_vec to r := " 111111111 " ;34 −− ChFok − Frame Correct5 constant seq_fok : s td_u log i c_vec to r := " 101111111 " ;67 −− ChErr − Channel e r r o r8 constant seq_err : s td_u log i c_vec to r := " 000000 " ;9

10 −− Stuck−at−dominant tx , margin = 111 constant seq_stk_tx : s td_u log i c_vec to r := " 0000000000000000000000 " ;1213 −− Stuck−at−dominant rx , margin = 214 constant seq_stk_d : s td_u log i c_vec to r := " 000000000000000000000000000000 " ;1516 −− Stuck−at−recess ive medium17 constant seq_stuck_rm : s td_u log i c_vec to r := " 1111111111111111 " ; � �

Figure 4.16: Sequences mapped into VHDL

ChannelMonitoring

ChEOT

ChErr

ChFok

ChRx

ChTx

Chstk-tx

Figure 4.17: CANELy Channel Monitoring functions

Given the modular approach being followed, this component can be used as the building

block for other components requiring information regarding the state of the channel, e.g. medium

monitoring.

4.5 Medium Monitoring

The previous Section dealt with channel monitoring mechanisms, to assess its state and issue

fault-confinement actions, if necessary. Most of the faults, however, have their origin in the media,

being propagated to the channel. Hence, the next logical step is to do medium monitoring, in order

to assess and confine any faults affecting a medium, so it does not propagate to the channel.

A fault that might affect a medium is the reciprocal of the stuck-at-dominant (see Section 4.2.2),

stuck-at-recessive. Although a bus in a stuck-at-recessive state apparently does not pose a prob-

lem as dangerous as the previous case, it can still be dangerous under a different perspective,

since it can mean network partition(s). Therefore, it is equally important to detect such faults

and inform the upper layers, in order to trigger procedures to assess the location of the affected

network segments, and decide the measures to be taken in order to ensure system safety.

Lastly, there is also the case of a medium exhibiting omission errors. This situation must not

only be detected, but corrected if the medium is exhibiting an amount of omissions larger than a

certain threshold, called the medium’s omission degree.

51

4. Dependability Enforcement

4.5.1 Medium Status

The partitioning of physical media is a type of fault that must be accounted for. Although its

behaviour is not as disruptive (in the short term) as a stuck-at-dominant type of fault, it may be

critical in the long term, since it hinders the media redundancy strategy. Therefore, it is necessary

to assess network partition faults, and inform the upper layers for high-level fault detection, i.e.

pinpoint the physical location of the partition.

A network partition can be described as a prolonged silence in the affected medium, since it is

kept in the recessive state. This abnormal recessive period affecting medium m can be detected

by the following expression:

Marp(m) 7→

true if T (MRx(m)) = r > Tstuck−rm ∧ ¬ChFok

false when ChEOT

(4.11)

where: Marp is the medium abnormal recessive period signal, asserted when T (MRx(m)), the

normalised elapsed time since medium m is in a recessive state exceeds Tstuck−rm. Tstuck−rm is

the minimum normalised time to detect a stuck-at-recessive fault, and it has a duration Tstuck−rm =

2 · lstkdbit times.

The upper layers should be notified of abnormal events in the network, even if they do not

disturb the correct network operation. Therefore, the event of a medium being abnormally in a

recessive state should also be notified. This notification, however, should only be done when a

correct frame is received at the channel interface. The signal Midle asserts the situation of correct

frame reception, thus allowing the notification to occur.

Midle(m) 7→

true if Marp(m) ∧ ChFok

false when ChEOT

(4.12)

Lastly, stuck-at-recessive failures should be distinguished from medium partitions. The signal

Mds serves this purpose, being asserted whenever a dominant bit value is detected in the medium,

meaning the partition is not permanent, thus being a stuck-at-recessive fault.

Mds(m) 7→

true if MRx(m) = d

false when ChEOT

(4.13)

The results of these functions are mapped into the Medium Status Word (see Figure 4.18),

which keeps track of the most relevant parameters for each medium. The Media Status Word is

always exposed to the remaining components, which can only have “read” access.

We can now design a component wrapping the functionality of the Media Monitoring. In practi-

cal terms, and given the fact that the status assessment must be done for each and every medium

52

4.5 Medium Monitoring

�1 −− Medium Status Word2 type medium_status_t is record3 arp : s td_u log i c ; −− Abnormal Recessive Period , maps M_arp (m)4 i d l e : s td_u log i c ; −− Medium Id le , maps M_idle (m)5 ds : s td_u log i c ; −− Dominant B i t Received , maps M_ds(m)6 stk−d : s td_u log i c ; −− Stuck a t Dominant , maps M_stk−d (m)7 d is : s t d_u log i c ; −− Medium Disable , maps M_dis (m)8 fm : s td_u log i c ; −− Frame Mismatch , maps M_fm(m)9 end record medium_status_t ; � �

Figure 4.18: Medium Status Word VHDL data type

in parallel, we will have a component instance of the Media Status Monitor for each medium, with

its own Medium Status Word. Such component is illustrated in Figure 4.19, in a block diagram

form.

Medium StatusMonitoringCh

Fok

ChEOT

Medium Status WordChRx

MRx

(m)

Figure 4.19: Medium Status Monitoring block diagram

4.5.2 Frame Monitoring

Any medium is susceptible to suffer (or propagate) a fault, during normal network operation.

The fault’s manifestation, the error, can be propagated further into the system, reaching the chan-

nel via the bus redundancy management mechanisms (see Figure 4.4, errors C and D), or may be

masked (error B). Either way, it is important to assess if was there any error affecting the network,

and if so, which medium or media exhibited the errors.

The first step to achieve this goal is to perform the monitoring of the received frames, by

comparing the received data in each medium, Mrx(m), with the received channel data, ChRx.

This comparison is made “on-line”, on a bit-by-bit basis. If there is any disagreement between

the value present in the channel and the medium, a flag signalling frame mismatch - MFm(m) is

asserted for medium m. Formally,

MFm(m) 7→

true if MRx(m) 6= ChRx ∧ ChTiP

false when ChEOT

(4.14)

where: ChTiP = ¬ChFok ∧ ¬ChErr signals that a frame transfer is in progress. The assertion

of MFm(m) is of the utmost importance for omission detection, since it signals if was there any

53

4. Dependability Enforcement

disturbance in the medium during the last frame transmission. This signal is also mapped into the

Media Status Word (see Figure 4.18).

4.5.3 Omission Degree Control

The medium monitoring functions are essential to assess which medium or media are exhibit-

ing errors. With this information is now possible to go further in and proceed to the confinement

of these errors.

As discussed previously, a (perceived) fault affecting a frame generates an omission error: the

frame is destroyed, not being delivered to the upper layers. Such errors should be rare during

normal operation, and normally are caused by transient events such as EMI or local node power

supply transients.

There are occasions, however, where a medium might be affected by a higher than usual

amount of errors. These errors can have their origin in defective circuitry, damaged cabling or

connectors. In this situation, the redundancy management machinery may worsen system oper-

ation, instead of improving it7. Therefore, the medium state must not only be monitored, but also

be rated in terms of dependability, i.e. how dependable is that medium.

The final objective of the media monitoring functions is to achieve omission degree control for

each medium, through the assessment of the medium’s omission degree (MOd). Any medium

exceeding its omission degree bound, km, must be declared failed, the corresponding upper layer

entities signalled and its contribution to equation (4.4) disabled.

Omission Detection

Before assessing the omission degree of a given medium, we must first proceed with the

detection of omissions. One of the most important detection functions is the assessment if was

there any medium affected by errors. Such assessment is made by the MFm−s signal, which is

defined as:

MFm−s =∑

m∈MMFm(m) (4.15)

where:∑

denotes a logical sum. The MFm−s is asserted if there is, at least, one medium with

the MFm(m) asserted. The omission detection functions are illustrated in Figure 4.20.

These functions were designed to provide medium omission error accountability, i.e. knowing

upon error detection, which medium or media exhibited the erroneous behaviour. This step is

crucial to evaluate the omission degree, since there are common-mode errors (Figure 4.4, error

A) and single-mode errors (Figure 4.4, errors B-D).7Such situation can arise from a medium suffering errors such as the ones in Figure 4.4, errors C and D. In these cases

the error is propagated. The reciprocal is error B in the same Figure, which is masked.

54

4.5 Medium Monitoring

Medium omission functions (for all medium m)

MFm−s Media Frame Mismatchasserted if any medium exhibited a frame mismatch;negated if no mismatch or all media affected by a common-mode error.

Medium omission functions (for each medium m)

MOer Omission error in mediumframe received without errors at the channel, but medium had an error.Condition: ChFok ∧MFm(m)

MOch Omission error in channelframe aborted with errors, but medium had no error.Condition: ChErr ∧MFm−s ∧ ¬MFm

MUerr Omission error in undetermined medium or mediaframe aborted with errors, either common-mode error or multiple errorsCondition: ChErr ∧ (¬MFm−s ∨MFm)

Figure 4.20: Medium Omission Detection auxiliary functions

Omission Degree Assessment

As a general rule, a medium exhibiting errors should have its omission degree value incre-

mented, in the same way a medium showing no errors should have it cleared. There are, how-

ever, some cases where the media cannot be held accountable, e.g. common-mode errors. The

omission degree is evaluated upon a successful transmission. It is defined by:

MOd(m)↑ChEOT=

MOd(m) + 1 if MOer(m) ∨ (MOch(m) ∧ ¬ChFok)

MOd(m) if MUer(m) ∧ ¬ChFok

0 if ChFok ∧ ¬MFm(m)

(4.16)

where the omission degree of a Medium should be:

• Incremented, if a Medium has experienced an omission error, or if the channel experienced

an omission and the frame was not correctly received;

• Maintained, if it was not possible to pinpoint the medium or media that suffered the omis-

sion, and the frame was not correctly received;

• Cleared, if the frame was correctly received and the Medium did not exhibit any error.

The assessment of the MOd is made in parallel for all the media, upon a successful frame

transmission - signalled through the assertion of ChEOT . Any medium whose omission degree

has exceeded the omission degree bound should be declared failed, and the upper layers notified.

55

4. Dependability Enforcement

Omission Degree Control

MFm

MRx

(1) MRx

(m)

ChFok

ChEOT

ChErr

OD Register

Omission Detection

MOd

m

Figure 4.21: Omission Degree Control block diagram

The functionalities provided by the omission detection and control functions can be wrapped in

a component, providing Omission Degree Control. Such component is illustrated in Figure 4.21.

This component encapsulates the omission detection machinery, which supports the evalua-

tion pertaining to the omission degree of a given medium. It outputs the current omission degree

of all the media, so it can be accessed by management interfaces and thus being consulted by

upper layer entities.

Omission Degree Control mechanisms

The functions described in Figure 4.20 and through equations (4.14) and (4.15), are simple to

be mapped onto mechanisms. The simplicity derives from their purely combinatorial nature.

Regarding the assessment of the omission degree (see equation (4.16)), they can be mapped

into a simple state machine, with three states: “INCREMENT”, “KEEP” and “CLEAR”. It may

be argued that this design could be optimised. The description, however, is being made in be-

havioural mode, i.e. describing behaviour as opposition to describing structures, thus leaving the

optimisations to the synthesiser.

4.6 Media Selection Unit

So far, we have been discussing the mechanisms for error detection and confinement in a

comprehensive manner. The materialisation of these mechanisms, however, must be done by

encapsulating them in components, which in turn must be arranged to achieve a unified block.

This Section shows how do all the monitoring functions and mechanisms integrate in a Media

Selection Unit component.

The modular philosophy of the previous blocks allows seamless integration into a single com-

ponent, with well defined interfaces (see Figure 4.22).

56

4.6 Media Selection Unit

ChannelMonitoring

MediumStatus

FrameMonitoring

Standard CAN layer

Omission DegreeControl

Management Interface

MTx

(1) MRx

(1)

ChTx

ChRx

MTx

(m) MRx

(m)MRx

(m)

MRx

(m)MRx

(1)

Management

Figure 4.22: Media Selection Unit block diagram

4.6.1 System Interface

The Media Selection Unit (MSU) must interface with a computational platform, for: parameter

configuration, such as bit rate; exception notification, such as a medium partition; and reading,

such as the message identifier related to an event. Furthermore, it must connect to the channel

and the several redundant media (see Figure 3.4, page 28). Regarding initialisation, it must

receive a set of parameters before it can start operation. The set of initialisation parameters and

notifications is illustrated in Figure 4.23.

Invocation PrimitivesDescriptionInitialise (baud, km)

Notification PrimitivesDescription Issuing ConditionOmission degree exceeded (m) MOd(m) > km

Stuck-at-dominant Medium (m) Mstk−d(m)

Stuck-at-recessive Medium (m, mid) Midle(m) ∧ ¬Mds(m)

Medium partition (m, mid) Midle(m) ∧Mds(m)

Stuck-at-dominant Channel Chstk−Tx

Figure 4.23: CANELy Media Selection Unit management primitives

The MSU must be initialised with: CAN network bit rate; media omission degree bound. The bit

rate is necessary to keep synchronism with the CAN network, and ensure that the quasi-stationary

operation is also extended to the CANELy machinery. The media omission degree bound defines

the omission degree threshold that upon violation implies medium failure. The operation of mech-

57

4. Dependability Enforcement

anisms that might lead to any form of fault confinement, such as bus disconnection, is halted until

these two conditions hold true.

Some functions require information regard the message identifier when an exceptional condi-

tion is raised, e.g. medium partition detection. This auxiliary function implements a logic which

processes the first 14 bit of a CAN 2.0A message or the first 32 bit of CAN 2.0B message. The

recovered message identifier is put into a VHDL record, along with information regarding the

CAN protocol version. The access to this information is dependent on the management interface

implementation.

4.6.2 Management Interface

The MSU management interface specification has been left as general as possible, given

the amount of bus interfacing technologies available. There are, however, certain restrictions that

must be observed. The first and foremost, is the provision of an Interruption Request (IRQ) facility.

The issuing of IRQs is paramount, allowing the MSU to notify the host system upon an exceptional

event (notification), such as the ones listed in Figure 4.23.

From the computing platform point-of-view, the MSU can be seen as a peripheral, capable

of being configured and having accessible information relevant to its operation. Therefore, some

sort of peripheral I/O bus must be used.

The most common I/O method for this type of interface is memory-mapped, where I/O is

achieved through reading and writing data to certain memory addresses. The function of the

management interface is then to multiplex the requested actions, which can be reading or writ-

ing data, e.g. reading the value of a medium omission degree; writing the value of the bit rate

parameter.

4.7 Summary

The standard Controller Area Network (CAN) layer has a restrictive fault model, which al-

though is adequate for the applications it was designed for, it is not suitable for high dependability

domains. Therefore it must be complemented, providing dependable network operation even in

the presence of disturbances. The CAN Enhanced Layer (CANELy) architecture aims at solving

the impairments of CAN network operation, qualifying the utilisation of CAN for high dependability

domains, which require strict operational guarantees.

This Chapter covered the dependability enhancements offered by the CANELy architecture

over the standard CAN protocol, namely with respect to network availability. These enhancements

were defined by CANELy, based on modelling the operation of the CAN protocol in a comprehen-

sive manner. With the network model defined, it progressed into identifying its weaknesses and

propose solutions for solving or mitigate them.

58

4.7 Summary

Based on the analytical results of the CANELy architecture, we were able to discuss a set of

mechanisms, whose objective was mapping the models into (firstly) abstract structures, describing

the model’s behaviour. It was then noticed that most problems modelled by CANELy could be

transformed into sequence detection problems. These problems, however, required machinery

which should be made as generic as possible to promote component reuse, maintainability and

reduce design effort, thus easing the verification of component correctness.

The dependability enhancements were discussed in a bottom-up approach, starting with the

effective design of redundant media management mechanisms, and then passing to the sev-

eral monitoring functions both at the channel and medium level. These monitoring functions rely

heavily on the sequence detection machinery for fault detection, thus permitting their confinement.

Finally, the integration of these components in a single entity, called Media Selection Unit (MSU)

was discussed.

59

4. Dependability Enforcement

60

5Timeliness Enforcement

Defer no time, for delays are dangerous.Henry VIWILLIAM SHAKESPEARE

The concept of correct service has two dimensions in a real-time system. One of the di-

mensions is the value domain, where correct results are required. The other dimension is the

time domain, where results are required on time. Therefore reliable real-time operation demands

correct results on time.

Communication network channels are known for being unreliable, i.e. there is always some

probability that messages conveyed by the channel get corrupted, or even lost. While the channel

can be made reliable and available, as discussed in Chapter 4, one cannot completely avoid low-

level protocol glitches (e.g. error or overload frame transmissions) which manifest themselves as

periods of time where the CAN network in unavailable, i.e. inaccessible.

One of the contributions of the CANELy architecture was the study of CAN inaccessibility,

allowing its integration into the timeliness model of CAN bus communication [11, 47]. A set of

mechanisms was devised for the evaluation of inaccessibility events and their duration, together

with an effective method for controlling its effects.

This Chapter discusses the effective mapping of mechanisms supporting Inaccessibility Con-

trol in CANELy, enabling the enforcement of correct behaviour in the time domain. Firstly, we

introduce the inaccessibility concept, how it manifests and the duration of such events in CAN

network operation. Inaccessibility events must be assessed, and have their effects evaluated

and controlled. Such actions are taken by special-purpose mechanisms, which monitor the chan-

nel. Lastly, we discuss the mapping and integration of such mechanisms in a self-contained unit,

suitable for being mapped into an FPGA device.

5.1 Channel Inaccessibility

The concept of inaccessibility was formally defined in [48], and it states that inaccessibility is

a perceived temporary condition, during which a component is unable to provide service. The du-

ration and rate of these inaccessibility events are: known; bounded; the violation of such bounds

implies the failure of the component.

61

5. Timeliness Enforcement

In a CAN network, the component that suffers inaccessibility events is the channel. Inacces-

sibility results from two different conditions: omissions, which derive from network errors and that

in CAN are automatically transformed in inaccessibility events; overloads, which derive from the

local state of the nodes. The CAN standard defines that an overload frame can be transmitted

during the intermission period, if a node requires extra time to process the received frame, e.g. a

nearly full FIFO serving a slow processing element.

While these events are taking place, the channel cannot be used by any other node for mes-

sage diffusion, thus being inaccessible. The impact that these events have in timeliness cannot

be neglected: the time wasted in error frame transmission and lost frame retransmission will

add up towards a possible violation of the bounded transmission delay property, a fundamental

requirement of a real-time communication system (see Section 2.2.3, page 12).

One of the contributions of the CANELy architecture was the study and analytic definition of the

duration and boundedness of inaccessibility in CAN. Some results of this study are presented in

Figure 5.1, pertaining to the inaccessibility duration bounds provided by the CANELy architecture.

Figure 5.1: CAN vs. CANELy normalised inaccessibility duration bounds

The chart in Figure 5.1 shows that CANELy mechanisms may provide a reduction of the in-

accessibility times, compared to standard CAN. This reduction, however, only benefits network

errors lasting longer than a single message transfer, e.g. a failed transmitter (Tx-fail). These

errors were handled by the standard CAN fault confinement mechanisms at each node, based

on counters, which account for both transmit and receive errors. The inaccessibility periods only

ended when (one of) those counters reached a certain threshold, and put the CAN controller in an

“error-passive” or “bus-off” states. The CANELy architecture minimises those times by exploiting

mechanisms present in standard CAN controllers, which allow the issue of warning signals when

an error counter exceeds a given threshold.

62

5.2 Inaccessibility Evaluation

Inaccessibility Impact on System Timeliness

Most communication protocols are based on timers, which in the event of failed communication

allow system progression or recovery actions to be undertaken. An inaccessibility period is usually

beyond these timers, thus not being accounted for. Depending on the network load, the effects

of an inaccessibility period may go uncovered. There are, however, situations where the effects

can propagate into the upper layers, triggering unwanted actions and behaviours, e.g. message

retransmission or even protocol failure.

The effects of inaccessibility can be extremely dangerous in a hard real-time system. Given

the close relation between system task execution and message communication, the communi-

cation delays can propagate into the computational task itself, and ultimately cause a deadline

violation. Therefore, it is important to have knowledge pertaining to the parameters that charac-

terises inaccessibility events, rate and duration, so they can be accounted for in the timeliness

model of the real-time system.

5.2 Inaccessibility Evaluation

Generically, confinement actions must be supported by monitoring and evaluation mecha-

nisms. The control of inaccessibility in CANELy is no exception, being supported by inaccessibility

parameter evaluation mechanisms. Such mechanisms provide information both on the rate and

duration of inaccessibility events and their periods, thus making possible the assessment of the

channel status w.r.t. timeliness, and providing valuable information related with the real inacces-

sibility parameters to the upper layers, e.g. for timeout-based protocols management purposes.

5.2.1 Assessment of Inaccessibility Events

The rate of inaccessibility events can be assessed by counting the number of events in a given

reference interval. Such evaluation should be done at the end of a network activity period, i.e.

upon the assertion of ChEOT . These counters are defined as:

ChIe ↑ ChEOT7→

ChIe + 1 if ChErr

0 when mgmt. request(5.1)

ChOe ↑ ChEOT7→

ChOe + 1 if ChErr ∧ ¬ChFok

0 when mgmt. request(5.2)

where: ChIe is the total number of inaccessibility events; ChOe is the number of inaccessibility

events derived from omissions. The difference between the ChIe and ChIe counters provides the

number of inaccessibility events strictly due to overload conditions.

63

5. Timeliness Enforcement

The communication channel suffers an omission whenever a message transfer is aborted or

an error is detected by CRC mechanisms, and an error frame is issued. Therefore, it is necessary

to account for channel omissions. As a general rule, a channel exceeding the omission degree

bound, k, should be declared failed. The evaluation of the real number of omissions is defined as:

ChOd ↑ ChEOT7→

ChOd + 1 if ChErr ∧ ¬ChFok

0 if ChFok

(5.3)

Assessing the number of inaccessibility events is a step towards inaccessibility control mech-

anisms and upper layer protocol optimisation. There is still the need, however, to account for the

time these inaccessibility events last.

5.2.2 Extended Channel Monitoring

Assessing the duration of inaccessibility periods affecting the communication channel requires

the assessment of a few more parameters. Important parameters pertaining to accessibility are

those signalling: start of a new frame transfer; successful frame transfer. These functions extend

the set of Channel Monitoring functions needed for network monitoring with regard to availability

and reliability (see Section 4.4).

The first signal of this extended set of channel monitoring signals is the Start-of-Frame signal,

ChSOF , and is defined as:

ChSOF 7→

true if ChEOT ∧ ChRx = d

false when ChSOF

(5.4)

The assertion of the ChSOF signals the channel’s availability to convey messages, being as-

serted for the duration of only one bit-time. This signal is useful for machinery assessing the

duration of inaccessibility events, since it marks the start of a (possible) inaccessibility period.

Another signal required by inaccessibility monitoring functions is the Transmission Correct,

ChTok, which assesses if there was no violation in the frame transfer format, detected up to the

first bit of intermission. It is defined as:

ChTok 7→

true if ChRx = rdrrrrrrrrr

false when ChEOT

(5.5)

This signal is useful to assess that not only the message was received correctly, but also there

was no error affecting the last but one bit of the message transmission, thus not needing message

retransmission.

64

5.2 Inaccessibility Evaluation

Lastly, if no violation is detected up to the second bit of the intermission, which is the minimum

intermission period before a new data/remote frame transmission can take place, the ChIFS sig-

nal should be asserted. It is defined as:

ChIFS 7→

true if ChRx = rdrrrrrrrrrr

false when ChEOT

(5.6)

Given the conditions for the assertion of the ChEOT are also met, the ChIFS signal will be

asserted only during one bit-time. A depiction of these mechanisms w.r.t. the end of a CAN frame

is shown in Figure 5.2.

Fok

Ch

CRC Sequence ACKSlot

CRCDel

ACKDel

EOF Delimiter

Tok

Ch

IFS

Ch

Figure 5.2: Timing of the CAN channel monitoring signals

This set of signals allows the implementation of a simple scheme for the evaluation of CAN

inaccessibility periods. They can be combined into one signal, ChFc, which asserts a correct

frame-level boundary, i.e. start or end of a correct frame. Its definition is given by:

ChFc = ChSOF ∨ ChTok−p ∨ ChIFS (5.7)

where: ChTok−p is a pulsed version of ChTok, i.e. it only lasts one bit time after ChTok assertion.

This signal can then be used by inaccessibility evaluation machinery, helping to assess the end

of an inaccessibility event.

Another set of extended monitoring signals can be defined, asserting both the presence of a

channel inaccessibility period, ChIna, and the end of such period, ChBidle:

ChBidle 7→

true if T (ChRx = r) ≥ TB

false if T (ChRx = r) < TB ∨ ChRx = d(5.8)

where: TB is the normalised duration of the minimum bus idle period that identifies the absence

of any frame transmission, TB = 12 bit times. The assertion of ChBidle means the effects of the

last inaccessibility events have passed, and all the (pending) messages have been transferred,

thus leaving the channel in an idle state.

65

5. Timeliness Enforcement

Lastly, the extended monitoring signal ChIna defines when a period of inaccessibility begins

and for how long its effects last. It is asserted upon the detection of an inaccessibility event and

negated upon the assertion of the ChBidle signal, as specified by:

ChIna 7→

true if ChErr

false when ChBidle

(5.9)

This signal is of paramount importance to the inaccessibility control mechanisms, for it indi-

cates the real inaccessibility effects. The ChIna signal can be supplied to the upper layers, notify-

ing the start and effective duration of inaccessibility events. Such information is useful for protocol

timeout management, allowing the extension of timers to account for inaccessibility periods.

Another use for the ChIna signal is the signalling of an inaccessibility period, since ChIna is

asserted until the end of inaccessibility effects, i.e. assertion of ChBidle. Therefore we can use

ChIna to account for the number of inaccessibility incidents in this period, ChIi:

ChIi ↑ ChEOT7→

ChIi + 1 if ChIna

0 when ¬ChIna(5.10)

This counter is incremented whenever a correct frame transmission takes place during an in-

accessibility period. Such counting mechanisms can be mapped into VHDL by simple constructs.

The mapping of the ChIi signal is illustrated in Figure 5.3.�1 −− purpose : Count the number o f inacc . events dur ing an inacc . per iod2 −− i npu ts : sys_clk , rst_N , ChIna , ChEOT3 −− outputs : ChI i4 procInacEvtPer : process ( sys_c lk ) is5 var iable ChEOT_s : s td_u log i c_vec to r (1 downto 0) := ( others => ’ 0 ’ ) ;6 begin −− process procInacEvtPer7 i f r i s ing_edge ( sys_c lk ) then8 i f rst_N = ’0 ’ then9 ChI i <= ( others => ’ 0 ’ ) ; −− Clear the event count upon rese t

10 ChEOT_s := ( others => ’ 0 ’ ) ;11 else12 i f can_clk_en = ’1 ’ then −− Sync wi th the CAN network13 ChEOT_s := ChEOT_s( 0 ) & ChEOT; −− S h i f t values l e f t , i npu t ChEOT14 −− Has ChEOT j u s t been asser ted?15 i f ChEOT_s( 1 ) = ’0 ’ and ChEOT_s( 0 ) = ’1 ’ then16 i f ChIna = ’1 ’ then17 ChI i <= ChI i + 1 ; −− Increment , ChIna i s asser ted18 end i f ;19 end i f ;20 i f ChIna = ’0 ’ then21 ChI i <= ( others => ’ 0 ’ ) ; −− Clear the count22 end i f ;23 end i f ;24 end i f ;25 end i f ;26 end process procInacEvtPer ; � �

Figure 5.3: Inaccessibility Event Count description in VHDL

66

5.2 Inaccessibility Evaluation

Integration of Basic Inaccessibility Control Mechanisms

The set of extended monitoring signals required to assist monitoring inaccessibility in a CAN

network is summarised in Figure 5.4.

Extended Channel MonitoringChSOF Start Of Frame

asserted at beginning of frame transmission;one bit-time duration.Condition: ChEOT ∧ ChRx = d

ChTok Transmission Correctasserted at the 1st bit of intermission;negated upon ChEOT .Sequence: ChRx = rdrrrrrrrr

ChIFS Frame Termination Correctasserted at the 2nd bit of intermission;negated upon ChEOT .Sequence: ChRx = rdrrrrrrrrr

ChBidleBus idlenessasserted after bus is idle for a certain thresholdnegated upon detection of a dominant bitSequence: ChRx = rrrrrrrrrrrr

ChIna Channel Inaccessibility Statusasserted upon ChErr

negated upon assertion of ChBidle

Condition: ChErr = true

Figure 5.4: Extended Channel Monitoring signals

With the information of Figure 5.4 we can easily map these monitoring functions into sequence

detection and assertion machinery. Lastly, in Figure 5.5 the sequences pertaining to the monitor-

ing functions of Figure 5.4 are shown.�1 −− ChTok − Transmit Correc t2 constant seq_tok : s td_u log i c_vec to r := " 10111111111 " ;34 −− ChIFS − I n te r f rame Spacing5 constant seq_ i f s : s td_u log i c_vec to r := " 101111111111 " ;67 −− ChBidle − Bus I d l e8 constant seq_chbid le : s td_u log i c_vec to r := " 111111111111 " ; � �

Figure 5.5: Timeliness-related sequences mapped into VHDL

It can be noticed that the sequences pertaining to the ChTok and ChIFS signals are equal

up to the last bit of ChTok, thus ChTok being a sub-sequence of ChIFS , and therefore the de-

tection could be optimised. Although might be tempting to perform this optimisation, it should be

left for the VHDL synthesiser. Most synthesisers recognise resources that may be shared, and

automatically perform the optimisation, which can be confirmed through the synthesis tool report.

67

5. Timeliness Enforcement

5.2.3 Assessment of Inaccessibility Effects

The extension of the set of channel monitoring functions provides the basis for the assess-

ment and characterisation of inaccessibility periods and events. An inaccessibility period can be

composed by several inaccessibility events, e.g. two consecutive message omissions. The as-

sessment of the amount of such events has been discussed previously. Therefore, the remaining

parameter that must be evaluated is the duration of these inaccessibility events and periods.

The mechanisms evaluating the duration of an inaccessibility event are defined by:

Te_ina =

Te_ina + Tbit if ¬ChEOT

Te_ina if ChEOT

0 if ChFc

(5.11)

The principle of operation of this mechanism is simple: time count is started when the ChSOF

signal is asserted, and is reset when: the transmission of a data/remote frame succeeds; the

transmission of a data/remote frame is correctly terminated by a minimum intermission period.

Should an inaccessibility event occur, Tina will hold its exact duration, upon the assertion of

the ChEOT signal. The duration of a single inaccessibility event, Te_ina is upper bounded by

Te_ina = 2160 Tbit (Tx. Fail in Figure 5.1). The mapping of this duration evaluation mechanism

into hardware is shown in Figure 5.6.�1 −− purpose : Assess the dura t i on o f the cu r ren t inacc . event2 −− i npu ts : sys_clk , rst_N , can_clk_en , ChFc , ChEOT3 −− outputs : Te_ina4 procInacTimeCount : process ( sys_c lk ) is5 begin −− process procInacTimeCount6 i f r i s ing_edge ( sys_c lk ) then −− r i s i n g c lock edge7 i f rst_N = ’0 ’ then −− synchronous rese t ( a c t i v e low )8 Te_ina <= ( others => ’ 0 ’ ) ; −− Clear the count upon rese t9 else

10 i f can_clk_en = ’1 ’ then −− Sync wi th the CAN network , maps T b i t11 i f ChFc = ’1 ’ then −− ChFc s i g n a l asser ted12 Te_ina <= ( others => ’ 0 ’ ) ; −− Clear the count when s t a r t i n g a new event13 e l s i f ChEOT = ’0 ’ then −− ChEOT asserted14 Te_ina <= Te_ina + 1; −− Increment by one T b i t15 end i f ; −− Defau l t behaviour : keep the Te_ina value16 end i f ;17 end i f ;18 end i f ;19 end process procInacTimeCount ; � �

Figure 5.6: Inaccessibility duration evaluation description in VHDL

The VHDL description of Figure 5.6 shows the materialisation of the inaccessibility event du-

ration evaluation signal, Te_ina; the signal Te_ina is the mapping of Te_ina. The physical dimen-

sioning of the signal’s width is dependent of its bounds, Te_ina. This is an important parameter,

in order to avoid using more resources than those strictly necessary, thus enabling a compact

design.

68

5.3 Usefulness of Inaccessibility Control Mechanisms

The total amount of time of consecutive inaccessibility periods should be accounted for, in

order to consolidate an overall inaccessibility time, Tina:

Tina↑ChEOT=

Tina + Te_ina if ChIna

Tina if ChBidle

0 when mgmt. request

(5.12)

The actions of equation 5.12 are executed only at the end of a network activity period, i.e.

upon ChEOT assertion. The information provided by this equation can be used by upper layers

for assessing the channel status, with regard to inaccessibility.

An inaccessibility epoch starts with a (possibly) correct frame transfer. The maximum duration

of such epoch, Tpinais dependent on CAN traffic patterns, i.e. network load. The duration,

however, must also be accounted for:

Tp_ina =

Tp_ina + Tbit if ¬ChEOT ∨ ChIna

Tp_ina if ChBidle

0 if ChFc ∧ ¬ChIna

(5.13)

An inaccessibility epoch ends after all the effects of inaccessibility have been cleared, i.e. all

the pending frame transfers have succeeded.

The mapping of these mechanisms into hardware follows the same philosophy of the mapping

present in Figure 5.6: a counter is defined, based on the signal’s upper bound value; the actions

upon that signal are mapped and defined by conditions.

5.3 Usefulness of Inaccessibility Control Mechanisms

The ultimate goal of inaccessibility control mechanisms is to use the real inaccessibility event

duration for protocol optimisation. This goal can be achieved through the previously defined ma-

chinery, which interfaces higher layer mechanisms. The information regarding the real duration

and effects of inaccessibility can be used at several layers of protocols. Such knowledge permits

fine-tuning network operation, instead of using conservative values, which may not yield optimum

performance of the global system.

The inaccessibility control method contemplated by the CANELy architecture is called inac-

cessibility flushing. This method avoids the need for an “accessibility test” message transmission,

to assess the availability of the network for conveying messages, i.e. its accessibility status. In-

stead, it uses the ChIna signal for assessing when the (distributed) frame transmission queue has

become empty, after the occurrence of an inaccessibility event [47, 11]. Such method can be used

for upper layer timer management [49], thus allowing the incorporation of the real inaccessibility

effects as parameters of timeout-based protocols.

69

5. Timeliness Enforcement

However, the specified mechanisms are also useful for clock-less protocols, i.e. that do not use

timers. The Figure 5.7 shows an example of such an optimisation, applied to a diffusion-based

protocol.

D-CAN: (optimised) Diffusion-based Protocol

Initializationi01 ndup(mid) := 0; // number of duplicates, kept for each messageSenders10 when d-can.req(mid〈type,p,n〉, mess) invoked at p dos11 if mess = NULL thens12 can-rtr.req(mid);s13 elses14 can-data.req(mid, mess);s15 od;s16 when can-rtr.cnf(mid) or can-data.cnf(mid, mess) confirmed dos17 deliver d-can.cnf (mid,mess);s18 od;Recipientr00 when can-data.ind(mid, mess) received at qr01 or can-rtr.ind(mid, mess=NULL) received at q dor02 ndup(mid) := ndup(mid) + 1;r03 if ndup(mid)= 1 then // new messager04 d-can.ind (mid, mess);a00 if ¬ChTok(mid) thenr05 if mess = NULL thenr06 can-rtr.req(mid); // clusteredr07 elser08 can-data.req(mid, mess);r09 fi;a01 fi ;

r10 elif ndup(mid) > j or ChTok(mid) thenr11 can-abort.req(mid);r12 fi;r13 od;

Figure 5.7: Optimised Diffusion-based protocol.

The protocol depicted in Figure 5.7, dubbed D-CAN, is a message diffusion protocol. Its pur-

pose is to avoid inconsistent message omissions, usually caused by errors affecting the last but

one bit of a frame transfer [36]. An initial solution to this problem was offered by the CANELy ar-

chitecture, by the EDCAN protocol [11], where all correct nodes that received the message would

eagerly diffuse it to the network. Such message diffusion was done through a number of consec-

utive messages specified by the inconsistency omission degree bound. Such strategy, however,

has a non-negligible impact in bandwidth and network load: a message correctly received by all

nodes would still be diffused, up to the inconsistency omission degree bound.

The D-CAN protocol aims at reducing the bandwidth utilisation by assessing the communica-

tion channel state via specialised machinery. The ChTok signal provides the D-CAN protocol with

information for early termination of the message diffusion, since its assertion guarantees that all

correct nodes have received the message and that no message retransmission is due. The use

of the ChTok signal in the protocol is clearly identified in Figure 5.7.

70

5.4 Inaccessibility Control Unit

5.4 Inaccessibility Control Unit

The final step is the encapsulation of the previously defined mechanisms and functions into a

single entity, named Inaccessibility Control Unit (ICU). This component is responsible for: monitor

the CAN communication channel and assess the existence of inaccessibility events and periods;

notify the upper layers of the occurrence and duration of such events and periods; assist up-

per layer protocols with signals and inaccessibility time measures for optimum protocol timeout

definition. A block diagram is shown in Figure 5.8.

Management

ChRx

Management Interface

ChFok

ChEOT

ChErr

Event & TimeRegisters

InaccessibilityDetection

Figure 5.8: Inaccessibility Control Unit block diagram

The component of Figure 5.8 is partitioned into three main blocks, encompassing the moni-

toring, evaluation and management interface mechanisms. It receives the basic set of channel

monitoring functions defined in Chapter 4, enabling their extension into more advanced monitoring

functions, for inaccessibility monitoring purposes. The management block provides the mecha-

nisms for interfacing the monitoring and evaluation components with the upper layers, conveying

the invocation and notification primitives between the parties.

System and Management Interfaces

The MSU must interface with a computational platform, in order to provide/receive information

necessary for the correct operation of the CANELy mechanisms. The set of initialisation parame-

ters and notifications is illustrated in Figure 5.9.

The set of primitives defined by the ICU system interface include: parameter configuration,

such as bit rate; exception notification, such as the change in channel inaccessibility status; pa-

rameter extraction, such as the number of inaccessibility events.

The management interface is the component responsible for supporting the System Interface

primitives. It has been left as generic as possible, in order to allow interconnection with several

base technologies (see Section 4.6.2).

71

5. Timeliness Enforcement

Invocation PrimitivesDescriptionInitialise (baud)Get Channel status (ChIna)Get Channel inaccessibility events (ChIi)Get Channel normalised inaccessibility times (Tina,Tp_ina)

Notification PrimitivesDescription Issuing ConditionChannel Status Change ChIna

Channel Transmission Correct (mid) ChTok

Channel Omission degree exceeded ChOd > k

Figure 5.9: CANELy Inaccessibility Control Unit management primitives

The ICU management interface has, however, one extra requisite: an Interruption Request

(IRQ) facility capable of supporting multiple inputs, i.e. multiple IRQ sources. Such facility would

provide adequate support for the signals being asserted by the ICU upon certain events, and be

output to provide support for timer management services and protocols [49, 47], together with

lower processing latency.

5.5 Summary

The provision of timely service is a property that must be secured for a real-time system. Such

property becomes harder to guarantee, when the system needs to perform Input/Output (I/O)

through a communication channel. This channel may suffer disturbances and be temporarily

inaccessible. Such disturbances introduce errors in the temporal domain, since they affect the

timeliness of the communication. The CAN Enhanced Layer (CANELy) architecture provides a set

of analytic results and mechanisms to address the issue of timeliness, in the realm of Controller

Area Networks.

This Chapter discussed the functions and mechanisms for supporting timely service in a

Controller Area Network (CAN) infrastructure, disturbed by the presence of errors or overload

conditions. These disturbances are mapped into the temporal domain as inaccessibility peri-

ods, during which the communication channel is temporarily unavailable. The CAN Enhanced

Layer (CANELy) architecture provides a set of low-level mechanisms to deal with inaccessibility

and its control.

The discussed mechanisms are partitioned in: monitoring, which assess the state of the chan-

nel w.r.t. inaccessibility, enabling upper layer support for inaccessibility assessment; evaluation ,

which use the former monitoring mechanisms to perform evaluation actions, both at the rate and

duration levels. These mechanisms and functions are integrated in a single entity, called Inacces-

sibility Control Unit (ICU), interfacing the remaining components of the CANELy architecture.

72

6CANELy Mechanism and Prototype

EngineeringStrive for perfection in everything you do. Take thebest that exists and make it better. When it does notexist, design it.SIR HENRY ROYCE

Materialising the CANELy low-level mechanisms and functions into hardware is the final step

in this journey. Such materialisation, however, must be performed in the the most efficient way

as possible, in order to fit a small FPGA device, thus being a cost-effective solution for enabling

highly dependable behaviour in both existing and future real-time CAN-based systems.

An implicit requisite to the materialisation of the mechanisms is the existence of a computing

platform. This computing platform integrates the FPGA along with other elements, such as a pro-

cessing element and memory buffers. The result of such integration is the enabling of a CANELy

node, suitable for implementation from the low-level mechanisms to the upper layers’ protocols.

This Chapter reports the engineering aspects of the mechanisms specified in the previous

Chapters. The simulated operation of the mechanisms is presented and discussed. The mapping

into an FPGA device is then described, along with the associated constraints. The FPGA resource

occupation is analysed, and compared with both freely- and commercially-available standard CAN

controller IP cores. Lastly, we document the specification and construction of a prototype board

having the resources for enabling an implementation of the CANELy architecture, both on hard-

ware and software aspects.

6.1 CANELy Mechanism Verification and Validation

A crucial step in digital system design is simulation. The purpose of simulation is to ensure

that the component hardware description behaves as intended, i.e. it follows the specifications

and provides correct ouputs for the tested inputs. This simulation action must be provided with

a sensible set of input values, which will generate a set of output values. These values are then

compared with the expected output, thus performing the verification of the component’s hardware

description correctness.

73

6. CANELy Mechanism and Prototype Engineering

The designed mechanisms are divided in hierarchical blocks, called components, which can

be composed, e.g. a Basic Channel Monitoring component providing signals to generate the

Extended Channel Monitoring signals for inaccessibility parameter evaluation. Each component

has a companion testbench, which exercises the component and provides information about op-

eration correctness. This modular approach allows to ensure correctness of all the components,

even before integrating them into more complex components. The simulations were performed

by Mentor Graphics Modelsim 6.5d for Linux software, and the CAN bus data was generated by

a specially crafted programme, being interfaced with the components via a custom-made VHDL

component emulating the several incoming media (see Appendix B).

6.1.1 Media Selection Unit

A simulated fragment of the Media Selection Unit (MSU) operation is shown in Figure 6.1. This

simulation fragment shows the transmission of three CAN frames, with one of them being affected

by an error, signalled by an error frame and corresponding ChErr assertion. The affected frame

is then retransmitted, and each media omission degree is evaluated.

ChannelError

ChannelError

FrameMismatch

Medium OdIncremented

Medium OdMaintained

Medium OdIncremented

FrameMismatch

Figure 6.1: Media Selection Unit simulation fragment

The ChRx signal of Figure 6.1 is the channel’s incoming bit stream, recovered by the AND-

based media selection function from the several incoming media, M_Rx(m). It is shown the oper-

ation of basic channel monitoring mechanisms, e.g. ChErr or ChEOT, which are fundamental for

determining the status of the CAN bus.

This fragment also shows the several types of errors and medium omission degree evaluation

mechanisms: a masked medium error, not affecting the channel and corresponding increment

of the affected medium omission degree, Od(1); an unmasked error affecting both media, and

corresponding omission degree is maintained; another masked error affecting one medium, with

the consequent Od(1) increment.

74

6.2 FPGA Mechanism Engineering

6.1.2 Inaccessibility Control Unit

The Inaccessibility Control Unit (ICU) component was also simulated. A fragment from the

corresponding simulation is presented in Figure 6.2. This fragment shows the transmission of

frames, disturbed by both an error and overload conditions. Both these events are transformed

into inaccessibility periods.

ChannelError

ChannelError

ChannelError

ChannelOverload

Inaccessibility Period

Inaccessibility Period

ChannelErrorTotal Inac. Event

Count Incremented

ChannelError

Period Inac. Event Count Cleared

Figure 6.2: Inaccessibility Control Unit simulation fragment

This fragment shows the operation of the several inaccessibility parameter evaluation mecha-

nisms, such as inaccessibility duration counters, and also the assessment of the type and amount

of inaccessibility events. This fragment only shows the channel, ChRx, since this is the signal that

may be affected by inaccessibility incidents.

The first channel error is an omission error, leading to the increment of both the ChIe and

ChOe counts, mapped as ChIe and ChOe in Figure 6.2. The second channel error results from

an overload condition, incrementing only the ChIe quantity (see Section 5.2.1, page 63), while

maintaining the ChOe. The total (normalised) inaccessibility time, Tina (Tina in Figure 6.2) in-

creases monotonically only at the end of an inaccesibility event, being increment by the amount

representing the duration of the (finished) inaccessibility period.

6.2 FPGA Mechanism Engineering

The implementation of the mechanisms in an FPGA device has several constraints. The

first and foremost stems from finite resource availability, especially register elements (flip-flops).

Therefore the mapping of the machinery into hardware must be made extremely efficient, in order

to occupy the least resources.

The VHDL synthesis tool utilised to map the mechanisms into hardware was Xilinx XST 11.4

for Linux. Default optimisation options were used, save for the main optimisation goal, which was

selected as “Area” over the default “Speed”.

75

6. CANELy Mechanism and Prototype Engineering

The target FPGA device for place & route actions was Xilinx’s Spartan-3E, which is part of the

Spartan-3 device family. This device family was chosen due to its low cost nature, together with

high longevity and adequate I/O resources. The devices used for comparison were: Spartan-3E

XC3S500E and XC3S100E, having an equivalent capacity of 500k and 100k logic gates respec-

tively; Spartan-3 XC3S50, with an equivalent capacity of 50k gates. These devices are architec-

turally identical, meaning that the end result of synthesising a design for any of the three will show

the same resource usage.

Sequence Detection Machinery

The problem of efficient sequence detection machinery has been discussed previously (see

Section 4.3). In order to assess the efficiency of the proposed mechanism, a comparison must be

made between the proposed ROM-based and the Sliding Window sequence detectors, regarding

FPGA resource utilisation. The comparison results are illustrated in Figure 6.3

The synthesis process was performed for a set of sensible sequence length values, i.e. vary-

ing the tolerance margin errstuck←rx(bus) of the stuck-at-dominant detection machinery between

its lower and upper bounds. The metric used for comparison is slices, which is the effectively

used resource, and composed both by a single sequential (Flip-Flop, FF) element and a single

combinatorial (Look-Up Table, LUT) element.

Figure 6.3: Sequence detection description resource occupation

The results presented in Figure 6.3 show a clear advantage of the proposed sequence de-

tection architecture, with regard to resource utilisation. The rather constant number of used re-

sources is easily explained by the number of sequential elements needed for ROM addressing

purposes, which remains constant for sequences having a length between two consecutive base

2 powers, i.e. 2n ≤ l < 2n+1, where n is the amount of flip-flops, l is the length of the sequence.

The resource usage of the sequence detection machinery is critical, since this mechanism is a

cornerstone in the mapping of most monitoring functions into hardware. Furthermore, each repli-

cated medium has their own monitoring mechanisms, which implies multiple (parallel) instances

of the sequence detection machinery.

76

6.2 FPGA Mechanism Engineering

6.2.1 Media Selection Unit

The Media Selection Unit (MSU) is the core component enabling network dependability func-

tions and mechanisms. It provides the: media redundancy management functions; channel and

media monitoring functions; channel and media error confinement functions. This component can

be parametrised at design time with the several significant parameters, e.g. the number of repli-

cated media. Synthesising this unit using conservative values [11] and for a dual-media network

yields the results shown in Table 6.1.

Table 6.1: Media Selection Unit FPGA resource occupation

Media Selection Unit MechanismsDevice Slices

Flip-Flops LUTsAbsolute Relative (%)

XC3S500E 121 228 148 3.2XC3S100E 121 228 148 15.4

XC3S50 121 228 148 19.3

The relative slice metric is presented with the purpose of comparing it with the total amount

of slices in each FPGA device, thus providing an easy method for assessing the resource usage.

The numbers shown in Table 6.1 allow us to conclude that the MSU can be perfectly fitted in even

a small sized FPGA (XC3S50), as intended.

6.2.2 Inaccessibility Control Unit

The Inaccessibility Control Unit (ICU) is the core component of the timeliness related functions

and mechanisms. It provides the upper layers with a set of extended channel monitoring func-

tions, together with the assessment of: channel inaccessibility status; inaccessibility event rate;

inaccessibility event duration. The synthesis of these mechanisms with sensible parameters [11]

yields the results presented in Table 6.2.

Table 6.2: Inaccessibility Control Unit FPGA resource occupation

Inaccessibility Control Unit MechanismsDevice Slices

Flip-Flops LUTsAbsolute Relative (%)

XC3S500E 81 78 73 1.6XC3S100E 81 78 73 7.6

XC3S50 81 78 73 9.5

The numbers from Table 6.1 allow us to conclude that these mechanisms can also be perfectly

fitted in even a small sized FPGA. These numbers depend on the parametrised bounds and

parameters configuring evaluation counter bounds, e.g. Tp_ina. Their variation, however, is not

significant, since they translate into a few more bits, i.e. a few more slices.

77

6. CANELy Mechanism and Prototype Engineering

6.2.3 Resource usage comparison

The comparison of the resource usage of theCANELy components w.r.t to a standard CAN

controller must be done. The integrated (MSU and ICU) CANELy core is compared against

two standard CAN cores: the free OpenCores CAN [50] and the commercial Xilinx LogiCORE

XPS [51]. Both CAN controllers had equal operating parameters: 64 message-deep FIFO; 3

acceptance filters. The results are shown in Figure 6.4, w.r.t the several FPGA resource types.

Figure 6.4: CANELy vs CAN Cores resource usage comparison

A significant part of the MSU flip-flop usage is allocated for recovering and storing the message

identifier, mid. Therefore, the integration of CANELy mechanisms with a CAN controller IP core

might provide interesting results, both at further lowering resource usage by sharing common

mechanisms, but also by providing access to other CAN machinery, such as error counters. This

machinery can be used to design extended quarantine mechanisms.

Lastly, we compare the relative resource usage in a design integrating both the CANELy com-

ponents and the CAN controllers. The results are shown in Figure 6.5. This integration does not

involve any optimisation nor resource sharing.

82.2% 85.1%

11.9% 10.0%

Figure 6.5: CANELy vs CAN Cores relative slice usage

78

6.3 CANELy Prototype Board

6.3 CANELy Prototype Board

The last issue pertaining to the engineering of CANELy is the integration of all the fundamen-

tal hardware components that define a CANELy node. These components must be integrated in

a computing platform, obeying to the requirements presented in Chapter 3. Such components

are: processing element, for higher layer services’ execution and management functions; CAN

controller, providing the standard CAN layer; FPGA device, providing the support for the materi-

alisation of the mechanisms discussed in the previous Chapters.

6.3.1 Architecture

Before implementing a design satisfying the CANELy requisites, we must define an architec-

ture. This architecture is a general view of the system, with the main blocks and their intercon-

nection explicited. Such a view is presented in Figure 6.6.

uController

Bootloader

FLASHMemory

FPGA

Addre

ss B

us

RS-232

CANTransceivers

SRAM

Data

Bus

CAN

UART

I/OCAN

FLASHMemory

Dual-media CAN bus

Figure 6.6: CANELy Prototype Board block diagram

The prototype board is composed by: microcontroller, which provides the computation support

for software execution, such as the CANELy protocols and services; FPGA, for the implementa-

tion of the CANELy low-level mechanisms, discussed in the previous Chapters; memory, both

volatile (Static Random-Access Memory (SRAM)) and non-volatile (FLASH). Regarding software,

it provides a bootloader, which is used for diagnostic functions and in-system programming for

user code.

Although the CANELy architecture focuses on serving as the network building block of other

(advanced) computing platforms, it does not preclude a self-contained system, executing the ap-

plication directly. Examples of such systems comprise intelligent sensors and actuators, which

usually require: processing element; networking element and I/O capabilities, all in the same

computing platform.

79

6. CANELy Mechanism and Prototype Engineering

6.3.2 Prototype Implementation

The materialisation of the architecture is the next step, achieved by selecting and interconnect-

ing the several hardware components. The final result is the prototype board shown in Figure 6.7.

FPGAMicrocontroller

Dual-MAC

Reliable Comm. Protocol SuiteLayer Management

Dual-CAN(optional)

ManagementInterface

CANELy FunctionsControl of Inaccessibility

CAN MonitoringAND-based Media Redundancy

ChannelInterface

MediaInterfaces

Cable Connectors

Figure 6.7: CANELy Prototype Board

The current CANELy node prototype is composed by the basic elements necessary for its

materialisation: Maxim/Dallas DS80C390 microcontroller, having an optimised 8051 core, two

CAN 2.0B controllers with 15 message centres’ each; Xilinx Spartan-3E XC3S500E FPGA, with

an equivalent capacity of 500k logic gates; Maxim MAX13050, standard CAN transceivers. The

interface between the low-level mechanisms embedded in the FPGA and the microcontroller is

done through memory-mapped I/O, exploiting the parallel data and address buses.

The current FPGA device still has resources for another MSU and ICU components (see Ta-

ble 6.1 and 6.2), thus enabling a dual-CAN channel/quad-media solution, thus providing even

higher dependability and timeliness guarantees, due to the fully space-redundant architecture.

6.4 Summary

The engineering of CAN Enhanced Layer (CANELy) low-level mechanisms demands high

resource utilisation efficiency, in order to provide a low cost solution. Furthermore, these mecha-

nisms are only a part of the CANELy architecture, which is materialised by a CANELy node.

This Chapter reported the materialisation of the proposed CANELy mechanisms, showing how

they can be made effective, resource-wise. This opens room for the integration of the CANELy

low-level components in both currently deployed and newly designed Controller Area Network

(CAN)-based real-time applications, through the addition of a low-cost FPGA device.

80

7Conclusions and Future Work

Every new beginning comes from some otherbeginning’s end.SENECA

The Controller Area Network (CAN) fieldbus is a widely deployed technology, being used in

domains as diverse as automotive, home automation and robotics. The CAN Enhanced Layer

(CANELy) architecture was designed around the standard CAN layer, in order to enhance it and

attain high levels of dependability.

The construction of highly-dependable architectures is dully justified by the characteristics’

standard CAN already possesses, both w.r.t. bus operation and physical aspects, such as cabling

and transceivers. These are the very same characteristics that make the CAN bus desirable to

new domains, such as aerospace, deep sea oil-drilling, or even more common applications such

as trash collecting trucks. The common denominator in all these applications is the need for

dependable service, even in the presence of disturbances.

This work discussed the mapping of the CAN Enhanced Layer (CANELy) architecture’s low-

level mechanisms into hardware. These low-level mechanisms pertain essentially to the depend-

ability of the network, both on the spatial and temporal domains, performing bus monitoring func-

tions for error detection and confinement.

The biggest contribution of this work is: an area-effective description of the CANELy archi-

tecture supporting foundations, thus opening room for a complete functional CANELy node, im-

plementing all the layers envisaged by the architecture. Another useful result stemming from this

work is the analysis of FPGA-effective mechanisms for “on-line” processing of bit serial proto-

cols. The final result was a proposed sequence detection strategy that - for this specific type of

sequences - outperformed conventional methods.

The CANELy architecture mechanisms can be effectively mapped onto cost-effective PLD de-

vices, such as FPGAs, and therefore enhance currently deployed CAN applications at a low cost.

Furthermore, the FPGA devices also add a new dimension to the applications, stemming from the

dependability attributes: maintainability. Unlike ASIC components, the use of reconfigurable logic

devices allows the extension of functions, thus providing the necessary support for the inclusion

of new mechanisms.

81

7. Conclusions and Future Work

A medium exhibiting more omission errors than those allowed by its omission degree must be

declared failed. After this procedure, however, one question still remains: how and when should

it be reactivated? The provision of medium quarantine techniques must aim at answering these

questions, through both models and monitoring mechanisms. An essential work basis for such

service is a (stochastic) model characterising the errors affecting a CAN network. The CANELy

architecture would benefit greatly from such a service, which would enrich it with attributes related

with robustness and adaptability. These attributes are in demand by applications where mainte-

nance and repair actions cannot be carried out by a human agent - at least in a timely fashion -

e.g. manned and unmanned spaceflight, deep-space probing missions or to a lesser extent, more

common unmanned or remotely operated applications, such as Unmanned Aerial Vehicle (UAV)

and Remotely Operated Vehicles (ROVs).

Another aspect of fault-tolerant communication is a bus guardian service to prevent babbling

idiot faults, where a node communicates in an arbitrary fashion, either due to failed circuitry,

drifting clock signal or even a malicious application. The theoretical foundations of such service

in CAN are already laid [52], with an analysis on how could this service be provided. Given

the available resources in the CANELy architecture, a (quasi)independent bus guardian can be

provided with partitioned FPGA machinery and an external clock signal source to avoid common-

mode failures, such as a failed clock device or an unperceived drifting clock signal, which would

induce the bus guardian in error due to the relative signals being within limits. These issues should

be investigated further.

Another research direction involves the explicit modification of the CAN standard. This ap-

proach has only been recently considered, due to the availability of low-cost FPGA devices hav-

ing enough resources to implement multiple standard-compliant CAN controllers, and at the same

time the availability of CAN IP cores [53, 50]. The CANELy architecture would benefit from this

integration, since it could reuse machinery already provided by the CAN controller, such as: net-

work bit synchronisation; bit-destuffing and message identifier recovery; and at the same time

access the CAN protocol FSM, thus rendering some monitoring functions more effective.

82

Bibliography

[1] G. C. Buttazzo, Hard Real-time Computing Systems: Predictable Scheduling Algorithms And

Applications (Real-Time Systems Series). Santa Clara, CA, USA: Springer-Verlag TELOS,

2004.

[2] M. Pignol, “COTS-based applications in space avionics,” in Design, Automation Test in

Europe Conference Exhibition (DATE 2010), 8-12 2010, pp. 1213–1219.

[3] J. Alford, L.D., “The problem with aviation COTS,” IEEE Aerospace and Electronic Systems

Magazine, vol. 16, no. 2, pp. 33–37, Feb. 2001.

[4] R. Black and M. Fletcher, “Open systems architecture - both boon and bane,” in Proceedings

of the 25th IEEE/AIAA Digital Avionics Systems Conference (DASC’06), Oct. 2006, pp. 1–7.

[5] CAN Specification Version 2.0, Robert Bosch GmbH, Sep. 1991.

[6] International Standard 11898 - Road vehicles - Controller Area Network (CAN) Part 1: Data

link layer and physical signalling, ISO Std., Dec. 2003.

[7] General Standardization of CAN (Controller Area Network) for Airborne Use, Airlines Elec-

tronic Engineering Committee (AEEC) Std. ARINC Specification 825-1, May 2010.

[8] “ECSS draft standard ECSS-E-ST-50-15C - recommendations for CAN bus in spacecraft on-

board applications,” ECSS Draft, European Cooperation for Space Standardization (ECSS),

May 2005.

[9] P. W. Fortescue, J. P. W. Stark, and G. Swinerd, Eds., Spacecraft Systems Engineering,

3rd ed. Wiley, 2003.

[10] Alcatel Alenia Space, “AURORA avionics architecture,” Alcatel Alenia Space, Tech. Rep.,

2005.

[11] J. Rufino, “Computational system for real-time distributed control,” Ph.D. dissertation,

Technical University of Lisbon - Instituto Superior Técnico, Lisboa, Portugal, Jul. 2002.

[Online]. Available: http://dario.di.fc.ul.pt/downloads/PhD-THESIS.pdf

83

Bibliography

[12] “Spartan-3E FPGA family data sheet,” Xilinx Inc., Aug. 2009. [Online]. Available:

http://www.xilinx.com/support/documentation/data_sheets/ds312.pdf

[13] D. Flynn, “AMBA: enabling reusable on-chip designs,” IEEE Micro, vol. 17, no. 4, pp. 20–27,

Jul./Aug. 1997.

[14] GRLIB IP Library User’s Manual, Aeroflex Gaisler AB. [Online]. Available: http:

//www.gaisler.com/

[15] J. Rufino, R. Pinto, and C. Almeida, “A FPGA-based solution for enforcing dependability and

timeliness in CAN,” in Proceedings of the 2007 IP Based Electronic System (IP’07), Grenoble,

France, Dec. 2007.

[16] ——, “FPGA-based engineering of bus media redundancy in CAN,” in Proceedings of the

12th International CAN Conference (iCC’08), Barcelona, Spain, Mar. 2008.

[17] R. Pinto, J. Rufino, and C. Almeida, “CANELy prototype board schematic specification,”

FCUL/IST, Tech. Rep. DARIO RT-05-04, Dec. 2005.

[18] ——, “Specification and engineering of the CANELy prototype board,” FCUL/IST, Tech. Rep.

DARIO RT-06-06, Oct. 2006.

[19] J. Rufino, R. Pinto, and C. Almeida, “How to enforce dependability and timeliness in

CANELy?” FCUL/IST, Tech. Rep. DARIO RT-07-02, Jul. 2007.

[20] P. Verissimo and L. Rodrigues, Distributed Systems for System Architects. Norwell, MA,

USA: Kluwer Academic Publishers, 2001.

[21] H. Kopetz, A. Ademaj, P. Grillinger, and K. Steinhammer, “The time-triggered ethernet

(TTE) design,” in Proceedings of the 8th IEEE International Symposium on Object-Oriented

Real-Time Distributed Computing (ISORC’05). Washington, DC, USA: IEEE Computer So-

ciety, 2005, pp. 22–33.

[22] IEEE Standard for Information Technology–Telecommunications and Information Exchange

Between Systems–Local and Metropolitan Area Networks–Specific Requirements Part 3:

Carrier Sense Multiple Access With Collision Detection (CSMA/CD) Access Method and

Physical Layer Specifications - Section One, IEEE Std. 802.3-2008, Dec. 2008.

[23] S. Parkes and P. Armbruster, “SpaceWire: Spacecraft onboard data-handling network,” Acta

Astronautica, vol. 66, no. 1-2, pp. 88–95, 2010.

[24] IEEE Standard for Heterogeneous Interconnect (HIC) (Low-Cost, Low-Latency Scalable

Serial Interconnect for Parallel System Construction), IEEE Std. 1355-1995, Sep. 1995.

84

Bibliography

[25] Space engineering: SpaceWire — Links, nodes, routers and networks, ECSS Std. ECSS-E-

ST-50-12C, Jul. 2008.

[26] M. D. May, P. W. Thompson, and P. H. Welch, Eds., Networks, Routers and Transputers:

Function, Performance and Applications. Amsterdam, The Netherlands: IOS Press, 1993.

[Online]. Available: http://wotug.ukc.ac.uk/parallel/www/nrat.html

[27] A. Woodroffe and P. Madle, “Application and experience of CAN as a low cost OBDH bus

system,” in Proceedings of the 2004 Data Systems In Aerospace Conference (DASIA’04),

Aug. 2004.

[28] F. Tortosa López, P. Roos, L. Stagnaro, C. Plummer, and B. Storni, “The CAN bus in

spacecraft on board applications,” in Proceedings of the 2004 Data Systems In Aerospace

Conference (DASIA’04), Aug. 2004.

[29] F. Tortosa López, G. Furano, A. J. Winton, M. Montagna, M. Caramia, B. Dean, and

M. Bhana, “CAN bus on ExoMars,” in Proceedings of the 2009 Data Systems In Aerospace

Conference (DASIA’09), Istanbul, Turkey, May 2009.

[30] H. Hilmer, H.-D. Kochs, and E. Dittmar, “A fault-tolerant communication architecture for

real-time control systems,” in Proceedings of the IEEE International Workshop on Factory

Communication Systems (WFCS’97), Oct. 1997, pp. 111–118.

[31] L.-B. Fredriksson, “CAN for critical embedded automotive networks,” IEEE Micro, vol. 22,

no. 4, pp. 28–35, 2002.

[32] H. Sivencrona, T. Olsson, R. Johansson, and J. Torin, “RedCAN: Simulations of two fault

recovery algorithms for CAN,” in Proceedings of the 10th IEEE Pacific Rim International

Symposium on Dependable Computing (PRDC’04). Washington, DC, USA: IEEE Computer

Society, 2004, pp. 302–311.

[33] J. R. Pimentel and J. A. Fonseca, “FlexCAN: A flexible architecture for highly dependable

embedded applications,” in Proceedings of the 3rd International Workshop on Real-Time

Networks (RTN 2004), Catania, Italy, Jul. 2004.

[34] J. R. Pimentel and J. Kaniarz, “A CAN-based application level error detection and fault

containment protocol,” in Proceedings of the 11th IFAC Symposium on Information Control

Problems in Manufacturing (INCOM’04), Salvador, Brazil, Apr. 2004.

[35] D. Powell, D. Seaton, D. Bonn, P. Veríssimo, and F. Waeselynck, “The Delta-4 approach

to dependability in open distributed computing systems,” in Proceedings of the 18th IEEE

International Symposium on Fault-Tolerant Computing (FTCS-18), Jun. 1988, pp. 246–251.

85

Bibliography

[36] J. Rufino, P. Verissimo, G. Arroz, C. Almeida, and L. Rodrigues, “Fault-tolerant broadcasts

in CAN,” in Digest of Papers of the 28th Annual International Symposium on Fault-Tolerant

Computing (FTCS’98), 23-25 1998, pp. 150–159.

[37] L. Lamport, “Time, clocks, and the ordering of events in a distributed system,” Commun.

ACM, vol. 21, no. 7, pp. 558–565, 1978.

[38] L. Rodrigues, M. Guimarães, and J. Rufino, “Fault-tolerant clock synchronization in CAN,” in

Proceedings of the 19th Real-Time Systems Symposium (RTSS’98). Madrid, Spain: IEEE,

Dec. 1998, pp. 420–429.

[39] “DS80C390 dual-CAN high-speed microprocessor,” Maxim/Dallas Semiconductors, Nov.

2005. [Online]. Available: http://datasheets.maxim-ic.com/en/ds/DS80C390.pdf

[40] “Stellaris LM3S2965 microcontroller,” Texas Instruments Incorporated, Sep. 2010.

[41] “MAX13050 industry-standard high-speed CAN transceiver,” Maxim Integrated

Products, Feb. 2005. [Online]. Available: http://datasheets.maxim-ic.com/en/ds/

MAX13050-MAX13054.pdf

[42] “MCP2551 high-speed CAN transceiver,” Microchip Technology Inc., 2003. [Online].

Available: http://ww1.microchip.com/downloads/en/DeviceDoc/21667f.pdf

[43] CiA Draft Standard 102 - CAN physical layer specification for industrial applications, CAN in

Automation, Feb. 2010.

[44] CiA Draft Standard 303, Part 1 - Cabling and connector pin assignment, CAN in Automation,

Dec. 2009.

[45] A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr, “Basic concepts and taxonomy of de-

pendable and secure computing,” IEEE Transactions on Dependable and Secure Computing,

vol. 1, no. 6, pp. 11–33, jan.-march 2004.

[46] P. J. Ashenden, The Designer’s Guide to VHDL, 3rd ed. San Francisco, CA, USA: Morgan

Kaufmann Publishers Inc., 2008.

[47] J. Rufino, P. Veríssimo, G. Arroz, and C. Almeida, “Control of inaccessibility in CANELy,”

in Proceedings of the 6th. International Workshop on Factory Communication Systems

(WFCS’06). Torino, Italy: IEEE, Jun. 2006, pp. 35–44.

[48] P. Verissimo and J. Marques, “Reliable broadcast for fault-tolerance on local computer net-

works,” in Proceedings of the 9th Symposium on Reliable Distributed Systems (SRDS’90),

Oct. 1990, pp. 54–63.

86

Bibliography

[49] J. Rufino, P. Veríssimo, C. Almeida, and G. Arroz, “Integrating inaccessibility control and

timer management in CANELy,” in Proceedings of the 11th IEEE International Conference

on Emerging Technologies and Factory Automation (ETFA 2006). Prague, Czech Republic:

IEEE, Sep. 2006, pp. 348–355.

[50] I. Mohor, CAN Protocol Controller IP core, OpenCores, Nov. 2004. [Online]. Available:

http://opencores.org/project,can

[51] “LogiCORE IP XPS Controller Area Network (CAN),” Xilinx Inc., Jul. 2010. [Online].

Available: http://www.xilinx.com/support/documentation/ip_documentation/xps_can.pdf

[52] I. Broster and A. Burns, “An analysable bus-guardian for event-triggered communication,”

in Proceedings of the 24th IEEE International Real-Time Systems Symposium (RTSS’03),

Cancun, Mexico, Dec. 2003, pp. 410–419.

[53] HurriCANe - Controller Are Network IP core User’s Manual, European Space Agency, Sep.

2007. [Online]. Available: http://microelectronics.esa.int/core/ipdoc/can524_user_manual.

pdf

87

Bibliography

88

AVHDL Snippets

A.1 Sequence detection machinery and mapped sequences

Listing A.1: Sequence detector VHDL instantiation template�1 ent i ty sequence_detector is2 generic (3 sequence : s td_u log i c_vec to r := " 101111111010101 " ) ; −− sequence to be detected45 port (6 sys_c lk : in s td_u log i c ; −− Clock7 can_clk_en : in s td_u log i c ; −− Bi t−sampling enable , to sync . w i th CAN network8 rst_N : in s td_u log i c ; −− Reset , a c t i v e low9 data : in s td_u log i c ; −− Data i npu t

10 Sequence_Ok : out s td_u log i c ) ; −− Sequence Ok11 end sequence_detector ; � �

Listing A.2: ChFok signal specification�1 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

−− Frame OK Sequence Detec t ioninstSequenceDetectorFOk : ent i ty work . sequence_detector (SHIFT_REGISTER)

generic map (5 sequence => seq_fok )

port map (sys_c lk => sys_clk ,rst_N => rst_N ,can_clk_en => can_clk_en ,

10 data => rx ,Sequence_Ok=> f o k _ i n t ) ;

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− fok s i g n a l asse r t i on−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

15 −− purpose : Asser t the " Frame Ok" s i g n a l−− type : sequen t ia l−− i npu ts : sys_clk , rst_N , can_clk_en , f o k _ i n t−− outputs : fokpFok : process ( sys_c lk ) is

20 begin −− process pFoki f r i s ing_edge ( sys_c lk ) then −− r i s i n g c lock edge

i f rst_N = ’0 ’ then −− synchronous rese t ( a c t i v e low )fok <= ’ 0 ’ ;

else25 i f can_clk_en = ’1 ’ then −− Sync wi th b i t−t imes

i f f o k _ i n t = ’1 ’ thenfok <= ’ 1 ’ ;

e l s i f e o t _ i n t = ’1 ’ thenfok <= ’ 0 ’ ;

30 end i f ;end i f ;

end i f ;end i f ;

end process pFok ; � �89

A. VHDL Snippets

Listing A.3: ChErr signal specification�1 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

−− Er ro r Frame Sequence Detec t ionins tSequenceDetectorErr : ent i ty work . sequence_detector (ROM_MEM)

generic map (sequence => seq_err )

6 port map (sys_c lk => sys_clk ,rst_N => rst_N ,can_clk_en => can_clk_en ,data => rx ,Sequence_Ok=> e r r _ i n t ) ;

12 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− e r r s i g n a l asse r t i on−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− purpose : Asser t an e r r o r s i g n a l de tec t i on−− type : sequen t ia l−− i npu ts : sys_clk , rst_N , can_clk_en , seq_err , e o t _ i n t

18 −− outputs : e r rpError : process ( sys_c lk ) isbegin −− process pError

i f r i s ing_edge ( sys_c lk ) then −− r i s i n g c lock edgei f rst_N = ’0 ’ then −− synchronous rese t ( a c t i v e low )

e r r <= ’ 0 ’ ;24 else

i f can_clk_en = ’1 ’ then −− Sync wi th b i t−t imesi f e o t _ i n t = ’1 ’ then

e r r <= ’ 0 ’ ;e l s i f err_seq = ’1 ’ then

e r r <= ’ 1 ’ ;30 end i f ;

end i f ;end i f ;

end i f ;end process pError ; � �

A.2 Omission Monitoring and Control

Listing A.4: Medium Omission Fault Detection�1 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

−− Mismatch Vector−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−genMisVector : for m in 1 to NumberMedia generate

5 MismatchVector (m) <= MediaStatusRD (m) . fm ;end generate genMisVector ;

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− MFm−s

10 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−mismatch <= ’0 ’ when MismatchVector = ( MismatchVector ’ range => ’ 0 ’ ) else ’ 1 ’ ;

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− Omission e r r o r

15 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− purpose : a−− type : combinat iona l−− i npu ts : ChFok , MismatchVector−− outputs : OmissionVector

20 procMOerr : process ( ChFok , MismatchVector ) isbegin −− process procMOerr

for m in 1 to NumberMedia loopMoErrVector (m) <= ChFok and MismatchVector (m) ;

end loop ; −− m25 end process procMOerr ;

90

A.2 Omission Monitoring and Control

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− Omission a t channel−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

30 −− purpose : a−− type : combinat iona l−− i npu ts : ChError , mismatch , MismatchVector−− outputs : MOChVectorprocMOch : process ( ChError , mismatch , MismatchVector ) is

35 begin −− process procMOchfor m in 1 to NumberMedia loop

MOChVector (m) <= ChError and mismatch and ( not MismatchVector (m) ) ;end loop ; −− m

end process procMOch ;40

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− Omission a t undetermined−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− purpose : a

45 −− type : combinat iona l−− i npu ts : ChError , mismatch , MismatchVector−− outputs : MUerrVectorprocMuerr : process ( ChError , mismatch , MismatchVector ) isbegin −− process procMuerr

50 for m in 1 to NumberMedia loopMUerrVector (m) <= ChError and ( ( not mismatch ) or MismatchVector (m) ) ;

end loop ; −− mend process procMuerr ; � �

Listing A.5: Medium Omission Fault Accounting�1 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

−− Omission degree r e g i s t r a t i o n−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− Upon the End−of−t ransmiss ion , we must check which media ex i b h i t e d omission

5 −− f a u l t s . According to the type of behaviour ,−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−procODRegist rat ion : process ( sys_c lk ) isbegin −− process procODRegist rat ion

i f r i s ing_edge ( sys_c lk ) then −− r i s i n g c lock edge10 i f rst_N = ’0 ’ then −− synchronous rese t ( a c t i v e low )

km_ovf <= ( others => ’ 0 ’ ) ;for m in 1 to NumberMedia loop

od_reg_ in t (m) <= 0;end loop ; −− m

15 elsei f can_clk_en = ’1 ’ then −− Bi t−sample t ime−− Detect ChEOT r i s i n g edgei f ChEOT = ’1 ’ and EOT_assert = ’0 ’ then

EOT_assert <= ’ 1 ’ ;20 −− Execute the ac t i on

for m in 1 to NumberMedia loopcase odAct ions (m) is−− INCREMENTwhen INCREMENT =>

25 i f od_reg_ in t (m) = od_parameter thenkm_ovf (m) <= ’ 1 ’ ; −− OD exceed (k_m over f low )

elseod_reg_ in t (m) <= od_reg_ in t (m) + 1;

end i f ;30 −− RESET

when RESET =>od_reg_ in t (m) <= 0;

−− UNKONW, REPORT IT ! Only v a l i d i n s imu la t i on−− pragma s y n t h e s i s _ o f f

35 when UNKNOWN =>report "UNKNOWN c o nd i t i on i n Omission Degree processing " sever i ty

ERROR;−− pragma synthesis_on−− MAINTAIN

91

A. VHDL Snippets

when others => nul l ;40 end case ;

end loop ; −− me l s i f EOT_assert = ’1 ’ and ChEOT = ’0 ’ then

EOT_assert <= ’ 0 ’ ;end i f ;

45 end i f ;end i f ;

end i f ;end process procODRegist rat ion ; � �

A.3 Inaccessibility Monitoring and Evaluation

Extended Channel Monitoring

Listing A.6: Channel Start-of-Frame Detection�1 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

−− S t a r t o f Frame−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− purpose : Detect the S t a r t Of Frame

5 −− type : sequen t ia l−− i npu ts : sys_clk , rst_N , ChRX, ChEOT−− outputs : ChSOFprocSOF : process ( sys_c lk ) isbegin −− process procSOF

10 i f r i s ing_edge ( sys_c lk ) then −− r i s i n g c lock edgei f rst_N = ’0 ’ then −− synChronous rese t ( a c t i v e low )

ChSOF <= ’ 0 ’ ;else

i f can_clk_en = ’1 ’ then −− RX B i t sampling15 i f ChEOT = ’1 ’ and ChRX = ’0 ’ then

ChSOF <= ’ 1 ’ ;end i f ;i f ChSOF = ’1 ’ then −− Deassert the s i g n a l

ChSOF <= ’ 0 ’ ;20 end i f ;

end i f ;end i f ;

end i f ;end process procSOF ; � �

Listing A.7: Channel Frame Correct Boundary�1 ChFc <= ’1 ’ when (ChSOF = ’1 ’ or ChTok = ’1 ’ or ChIFS = ’ 1 ’ ) else ’ 0 ’ ; � �

92

A.3 Inaccessibility Monitoring and Evaluation

Inaccessibility Parameter Evaluation

Listing A.8: Inaccessibility Event Counters�1 −− purpose : Assess the i n a c c e s s i b i l i t y event counts

−− type : sequen t ia l−− i npu ts : c lk , rst_N , ChErr_l , ChEOT, ChFok_l , ChIEClr , ChOEClr−− outputs : ChOE, ChIE , ChOD

5 procCounter : process ( sys_c lk ) isvar iable ChEOT_s : s td_u log i c_vec to r (1 downto 0) := ( others => ’ 0 ’ ) ;

begin −− process procCounteri f r i s ing_edge ( sys_c lk ) then −− r i s i n g c lock edge

i f rst_N = ’0 ’ then −− synchronous rese t ( a c t i v e low )10 ChEOT_s := ( others => ’ 0 ’ ) ;

ChIE <= ( others => ’ 0 ’ ) ;ChOE <= ( others => ’ 0 ’ ) ;ChOD <= ( others => ’ 0 ’ ) ;

else15 −− CAN Clock

i f can_clk_en = ’1 ’ then−− ChEOT " sampling "ChEOT_s := ChEOT_s( 0 ) & ChEOT; −− S h i f t values l e f t , i npu t ChEOT−− Has ChEOT j u s t been asser ted?

20 i f ChEOT_s( 1 ) = ’0 ’ and ChEOT_s( 0 ) = ’1 ’ then−− ChErr_l a c t i v e ?i f ChErr_l = ’1 ’ then

ChIE <= ChIE + 1; −− Tota l Inacc . Event Counti f ChFok_l = ’0 ’ then

25 ChOE <= ChOE + 1; −− Omission Er ro r CountChOD <= ChOD + 1; −− Ch OD increment

end i f ;end i f ;−− Channel Omission Degree Clear

30 i f ChFok_l = ’1 ’ thenChOD <= ( others => ’ 0 ’ ) ;

end i f ;end i f ;−− Mgmt Req , c l ea r

35 i f ChOEClr = ’1 ’ thenChOE <= ( others => ’ 0 ’ ) ;

end i f ;i f ChIEClr = ’1 ’ then

ChIE <= ( others => ’ 0 ’ ) ;40 end i f ;

end i f ;end i f ;

end i f ;end process procCounter ; � �

93

A. VHDL Snippets

94

BMechanism Design Verification

A fundamental stage in the flow of digital hardware design is verification, assessing the cor-

rectness of the designed component(s). This verification can be done by simulation, where one or

more sets of stimuli are applied to the inputs of the component being tested, and the correspond-

ing set of output signals is observed and compared with the expected behaviour, defined by the

component’s functional specifications.

A requirement for meaningful and successful testing is the sensible definition of the input

stimuli, in order to attain adequate test coverage and consequently adequate characterisation of

the component’s behaviour. In the context of CAN bus operation, a sensible definition of the input

stimuli translates into having a representation of the CAN channel.

This Appendix describes the approach taken w.r.t. simulation, giving particular emphasis to the

generation of simulation data, with the purpose of emulating a CAN bus channel. This emulation

is achieved by having a trace of data and remote frames exchange having several (different) mes-

sage identifiers, together with other types of events susceptible of occurring during the normal1

operation, such as: CAN error and overload frames; single- and common-mode errors affecting

the (replicated) media. Lastly, a set of simulation fragments is analysed.

B.1 Approach to Component Design Simulation

A possible approach to simulate a digital design involves the specification of a test-bench

design, which includes the component to be tested and a set of input stimuli relevant to that

component. The simulation task might become iterative, i.e. if the designed component fails in

the simulation by not behaving as expected in the presence of the defined input stimuli, it must be

corrected and tested again. Therefore the sensible design of the input stimuli is paramount.

With the growth of the components’ complexity, the need of more sophisticated input stimuli

also gets stronger. Having components processing the CAN protocol, this means that a set of

input stimuli is composed by one or more especially formed CAN frames, each having several bits

in length. The manual method for providing input stimuli, however, does not scale. Therefore a

1The meaning of normal is context dependent: a lightly loaded network with sporadic EMI events vs. highly loadednetwork with a high rate of errors due to EMI phenomena.

95

B. Mechanism Design Verification

different approach is in need, contemplating the exchange of standard-compliant CAN messages,

thus miming the operation of a standard CAN network, including the impairments to its operation,

i.e. errors. Therefore, a simulated CAN communication channel must be provided.

B.2 CAN Channel Simulation

The test of high-level CANELy components requires a set of properly simulated CAN mes-

sages. These must be conveyed by the CAN channel, which in turn can be distributed over a set

of replicated media. Such specifications provide the support for the design of tests exercising the

CANELy machinery. Such tests are comprised by properly crafted CAN messages, along with

errors affecting both the channel and any of the replicated media.

In order to satisfy these requirements, an approach involving both the (semi-)automatic gen-

eration of test data and presentation of that same data to the component being tested. The flow

of such process is depicted in Figure B.1.

File00110001 # Error000000

Bus Data Gen.

MessageDefinition

CAN Bit StreamGenerator

Unit Under Test

Bus MediaComponent

Simulator

Figure B.1: Bus Media simulation data flow

After being defined in a high-level of abstraction, the CAN message exchange is converted

into a suitable representation and written to a text file in a defined form. This file is then used by a

custom component, which interfaces with the component being tested - the unit-under-test - and

provides it with the (simulated) CAN channel data, through one or more media.

The usage of an external description of the input stimuli has several advantages, ranging

from not requiring test-bench recompilation if a different set of stimuli is defined; to simulation

automation, together with VHDL’s reporting facilities. Such advantages allow a faster simulation

process and wider coverage, thus leading to a higher quality test.

Simulation Data Generation

In order to generate the simulated CAN bus operation data, a software programme was coded.

It was chosen the Python language, mainly due to: offering an object-oriented paradigm; being

an excellent prototyping language, with a rich set of modules providing complex functions, e.g.

random number generation, data format conversions and dynamic data typing.

96

B.2 CAN Channel Simulation

This programme was coded using an object-oriented approach, where the CAN frames are

objects that can be manipulated by appropriate methods invoked upon them, e.g. set payload

content. It supports generating both CAN 2.0A and CAN 2.0B frames, data and remote. It can

also produce error and overload frames, with the restriction of their generation being manually

defined. An extension of this programme into a fully-fledged discrete event simulator is envisaged.

The frames are configured at the programme level by both their identifier, and payload. Since

the programme is written in Python, the frame generation can be easily accomplished without

needing to recompile the entire file again. Upon creation of a frame, it defaults to a remote frame,

i.e. zero payload. It is transformed into a data frame by setting its payload content to other than

non-null content.

The (simulated) CAN data is output into a text file. Each message (object) has methods to build

its bit representation from the several variables: message identifier, payload, (possible) bit errors.

The number of replicated media is configured at the application level. A fragment of the simulated

CAN network is shown in Figure B.2, for two replicated media. Each row of the file represents

one bit-time, and the time flow is from top-to-bottom. Each column represents a physical medium,

and the media are numbered from left-to-right (i.e. 1, 2, ..., N ). The values in the file are the bus

values: ’1’ is a recessive bit; ’0’ is a dominant bit.�1 11 # Text a f t e r a ’# ’ i s comment

11 # S t a r t o f bus . sim f i l e11 # <− Each row represents one b i t t ime11 # Each column represents a medium

5 11 # Number o f Media : 211111100 # SOF − MID : 10

10 0000000011 # B i t−s t u f f i n g

15 0010 # Er ro r @ Medium 1. . � �

Figure B.2: Simulation text file content

This fragment shows the start of a new frame (SOF), clearly identified by the accompanying

comment. The bit-stuffing bits are identified, for informational purposes only. The simulated

data can also (deliberately) suffer errors, i.e. exhibit values different from those intended. Such

errors are paramount to exercise the CAN protocol processing machinery, providing it with a more

realistic view of the network traffic.

Errors can affect: only one medium (single-mode); affect several or all media (common-mode).

These can be inserted in the generated stream, in order to assess the behaviour of the compo-

nents in the presence of different types of errors, e.g. stuck-at, single or multiple (burst) bit errors.

97

B. Mechanism Design Verification

Simulated CAN Channel and Media Components

The set of signals represented in the simulation text file must be converted into a form suit-

able for being interfaced with the media redundancy management mechanisms, or any other

component needing only one medium, i.e. the CAN channel itself. This goal is achieved by a

custom VHDL component simulating the incoming media, by reading the text file and generating

the appropriate signals to be presented to the component being tested. The operation of such

component is depicted in Figure B.3. This fragment of simulation shows the incoming bit streams,

described in the simulation text file generated by the programme, and the recovered CAN channel

bit stream, after the AND-based media management.

ChannelErrorMedia

Mismatch

ChannelError

End-of-FrameSequence

Figure B.3: Simulated CAN Channel

In the fragment of Figure B.3 there are also two clock signals: the bus_clk, which governs

the reading of the simulation data file and affects the bits present in each media; the can_clk_en,

which governs the sampling of the channel. For simplicity of operation the bits are sampled at the

middle of the bit time, instead of being sampled later. This is just a simplification, since there is no

need for the bit to propagate all over the network and settle its value due to the ideal conditions

presented by the simulation environment.

This testing flow aids greatly the test designer, since the (low-level) burden is now put into a

piece of software, leaving the designer at the high-level layer. This also avoids the introduction of

errors in the input stimuli, since they are now generated in a programmatic fashion.

B.3 Simulation Fragments

This Section presents simulation fragments with the purpose of exemplifying the testing phase

of some of CANELy’s components; and at the same time demonstrate the usefulness of the

previously described simulation data generation flow, especially how a set of properly defined

messages can speed up considerably the test, and at the same time ensure that the input stimuli

have good quality, contributing for the overall quality of the test.

98

B.3 Simulation Fragments

Basic Channel Monitoring

One of the firstly designed components was the one providing the basic channel monitoring

signals, ChFok, ChEOT and ChErr. These signals are the building block for other (complex)

machinery, therefore they not only had to be designed first, but had to be proved correct, in

order to allow their safe integration into other functions. These signals, in turn, are based on the

sequence detection machinery, which given its simplicity - and parametrisable nature - was able

to be tested with a simple set of manually defined input stimuli.

These signals, however, are more complex that the sequence detection machinery, for they

contain also complementary logic for signal assertion. Furthermore, there are interactions be-

tween the signals, e.g. ChErr is negated upon ChEOT . Although they can be tested separately,

with a synthetic stimulus representing the desired input signal, they were tested together, which

allowed to also rule out any ill-behaved interaction. The depiction of such test is shown in Fig-

ure B.4. This fragment shows a set of messages, with one of them suffering an error.

Figure B.4: Simulation of the Basic Channel Monitoring mechanism

In this case, the test is being done on machinery that get information from the channel. There-

fore the test has been configure with only one medium, and the Medium_Rx signal only has one

element, the channel itself. This test allowed to confirm the correct behaviour of the basic channel

monitoring mechanisms, which could then be used to develop functions of greater complexity with

the confidence that this component would behave as expected.

Message Identifier Recovery

Some notification primitives of the MSU require the indication of the affected message identi-

fier, mid (see Figure 4.23, page 57). To recover the mid, special machinery had to be designed,

which basically processes the start of every new CAN data or remote frame, until it receives the

identifier. This information depends on the version of the frame, if it is CAN 2.0A or CAN 2.0B,

which also affects the execution of the machinery, since CAN 2.0A frames have 11-bit identifiers

vs. 29-bit identifiers in CAN 2.0B.

99

B. Mechanism Design Verification

The test of such machinery calls for a set of CAN message exchange, with frames having

different identifiers, both in number and length (protocol version). Such diversified test allows

the assessment of correct operation. A fragment of this test is depicted in Figure B.5, and is

composed by the exchange of several CAN messages.

CAN 2.0B FrameMID: 33253

CAN 2.0A FrameMID: 10

Figure B.5: Simulation of the Message Identifier Extraction mechanism

This component uses an auxiliary signal, ChSOF, which indicates the start of a new frame trans-

mission, thus also starting the message processing machinery. The midOK signal indicates that

the message identifier has been correctly recovered. Such indication can be useful for signalling

the upper layers.

Although the information pertaining to the mid and CAN frame version are represented by a

VHDL record data type, they are converted into a form suitable to be used by other components

through a conversion function. Lastly, the vector storing (locally) the mid is 29-bit wide, and it

is not initialised, i.e. filled with zeros or ones. This leads to the simulation behaviour observed

in Figure B.5, in the first CAN 2.0A frames: since the remaining 18 bit are not initialised, the

simulator highlights such condition. The safety of the operation is ensured by the data conversion

function, which depending on the frame version outputs 11 or 29 bit, thus always ensuring that

the correct data output.

100