Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Mechanisms for Enhanced Dependability andTimeliness in CAN
Ricardo Alexandre Neves Correia Pinto
Dissertação para a obtenção do grau de Mestre em
Engenharia Electrotécnica e de Computadores
JúriPresidente: Doutor Nuno Cavaco Gomes HortaOrientador: Doutor José Manuel de Sousa de Matos RufinoCo-Orientador: Doutor Carlos Manuel Ribeiro AlmeidaVogal: Doutor Carlos Jorge Ferreira Silvestre
Dezembro de 2010
Acknowledgments
Firstly, I would like to express my gratitude to my supervisor, Prof. José Rufino for the effort
put into this work. His constant encouragement, motivation and support were fundamental for the
writing of this document. A special thanks also goes to my co-supervisor, Prof. Carlos Almeida
for the support given.
To the people in the Computers Scientific Area for the support and for providing the technical
means for developing this work, and for the pleasant coffee breaks.
To my friends, especially Pedro Fernandes, for all these years of friendship and encourage-
ment.
Last, but certainly not least, I would like to thank my family, especially my parents, Carlos and
Isabel, and my sister Andreia for their love and support all these years. This would not have been
possible without you.
i
Abstract
A cost-effective solution for Distributed Control System (DCS) interconnection is the Controller
Area Network (CAN) fieldbus. Designed to be used in the harsh automotive environment, its us-
age has spread to other domains, e.g. home automation, elevators, shop-floor control and even
aerospace applications. However, there is a set of domains where CAN could not be used with-
out additional mechanisms: mission-critical applications. In fact, despite exhibiting fault-tolerant
behaviour in the presence of errors, CAN fault coverage alone is not high enough to meet the
stringent requirements regarding safety, availability and reliability these domains demand.
The CAN Enhanced Layer (CANELy) architecture is a step towards a CAN-based high depend-
ability architecture, through the provision of: reliable communication services, network reliability
and availability, channel timeliness guarantees.
This work discusses the design and implementation of effective mechanisms for network de-
pendability and timeliness enhancement, in the context of the CANELy architecture. Our working
basis is the extended fault model provided by the CANELy architecture, which contemplates the
utilisation of media redundancy for the communication channel. From this basis we identified ef-
fective mechanisms to detect and isolate faults affecting either the channel or any of the redundant
media conveying the channel.
The final result is a set of mechanisms specified in a Hardware Description Language (HDL),
which can be fitted into a small-sized Field Programmable Gate Array (FPGA), thus providing
CANELy-based applications with means for: effective redundancy management, channel and
media fault detection and confinement, upper layer signalling for network operation status as-
sessment, all in a cost-effective manner.
Keywords
Controller Area Network, networked embedded systems, dependability, timeliness, CAN En-
hanced Layer
iii
Resumo
A rede industrial Controller Area Network (CAN) é uma solução eficiente para a interligação
de Sistemas de Controlo Distribuído. Desenhada para aplicações automóveis, a sua utilização
estendeu-se a outros domínios, e.g. domótica, controlo de produção e até mesmo aplicações
aeroespaciais. Existem, contudo, domínios onde a rede CAN não pode ser utilizada sem mecan-
ismos adicionais: aplicações missão crítica. Apesar de a rede CAN possuir características de
tolerância a faltas, carece de uma cobertura de faltas suficientemente ampla para cobrir os
requisitos rigorosos desses domínios no que respeita a segurança no funcionamento (safety)
e disponibilidade.
A arquitectura CAN Enhanced Layer (CANELy) é um passo na direcção de soluções de el-
evada confiabilidade baseadas em redes CAN, através do aprovisionamento de: serviços de
comunicação fiável, rede fiável e disponível, garantias de pontualidade do canal de comunicação.
Este trabalho discute a concepção e concretização de mecanismos eficientes para reforço
da confiabilidade e pontualidade, no contexto da arquitectura CANELy. A base de trabalho é o
modelo de faltas estendido fornecido pela arquitectura CANELy, contemplando redundância do
meio físico para o canal de comunicação, assim como mecanismos para detectar e isolar faltas
que afectem tanto o canal como qualquer meio redundante de suporte ao canal.
O resultado final é um conjunto de mecanismos especificado numa Linguagem de Descrição
de Hardware, concretizados num dispositivo FPGA, providenciando às aplicações baseadas em
CANELy os meios para: gestão eficiente da redundância, detecção e confinamento de faltas no
canal e meio físico, sinalização a camadas superiores para aferição do estado da rede.
Palavras Chave
Controller Area Network, sistemas embebidos ligados em rede, confiabilidade, pontualidade,
CAN Enhanced Layer
v
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Document organisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 State of the Art 7
2.1 Distributed Control Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Distributed Real-time Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Real-time Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Communication System Operation Models . . . . . . . . . . . . . . . . . . 10
2.2.3 Real-time Communication Networking Infrastructure . . . . . . . . . . . . . 12
2.3 Embedded Systems Networking (Fieldbus) Technologies . . . . . . . . . . . . . . 12
2.3.1 Time-Triggered Ethernet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.2 SpaceWire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.3 Controller Area Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 CAN in airborne and spaceborne applications . . . . . . . . . . . . . . . . . . . . . 17
2.5 High-Dependability CAN-based architectures . . . . . . . . . . . . . . . . . . . . . 19
2.5.1 RedCAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5.2 FlexCAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5.3 CAN Enhanced Layer - CANELy . . . . . . . . . . . . . . . . . . . . . . . . 21
3 CAN Enhanced Layer 23
3.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.1 Reliable Communication and Services . . . . . . . . . . . . . . . . . . . . . 24
3.1.2 Network Dependability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.3 Hard Real-time Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 System Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.1 CANELy Dependability Engine . . . . . . . . . . . . . . . . . . . . . . . . . 27
vii
Contents
3.2.2 Media Selection Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.3 Inaccessibility Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Engineering Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.1 CANELy Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3.2 CAN Data-link Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.3 CAN Physical Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4 Dependability Enforcement 33
4.1 Working Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1.1 CAN Physical Layer Fault-Tolerance . . . . . . . . . . . . . . . . . . . . . . 35
4.1.2 CANELy Approach to Network Dependability . . . . . . . . . . . . . . . . . 36
4.1.3 Fault classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2 Physical Network Availability and Reliability . . . . . . . . . . . . . . . . . . . . . . 39
4.2.1 Media Redundancy Provision and Management . . . . . . . . . . . . . . . 39
4.2.2 Stuck-at-dominant Fault Handling . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3 CAN Bit-Sequence Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.4 Channel Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.5 Medium Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.5.1 Medium Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.5.2 Frame Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.5.3 Omission Degree Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.6 Media Selection Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.6.1 System Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.6.2 Management Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5 Timeliness Enforcement 61
5.1 Channel Inaccessibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2 Inaccessibility Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2.1 Assessment of Inaccessibility Events . . . . . . . . . . . . . . . . . . . . . . 63
5.2.2 Extended Channel Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.2.3 Assessment of Inaccessibility Effects . . . . . . . . . . . . . . . . . . . . . . 68
5.3 Usefulness of Inaccessibility Control Mechanisms . . . . . . . . . . . . . . . . . . 69
5.4 Inaccessibility Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
viii
Contents
6 CANELy Mechanism and Prototype Engineering 73
6.1 CANELy Mechanism Verification and Validation . . . . . . . . . . . . . . . . . . . . 73
6.1.1 Media Selection Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.1.2 Inaccessibility Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.2 FPGA Mechanism Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.2.1 Media Selection Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.2.2 Inaccessibility Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.2.3 Resource usage comparison . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.3 CANELy Prototype Board . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.3.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.3.2 Prototype Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7 Conclusions and Future Work 81
Bibiography 83
A VHDL Snippets 89
A.1 Sequence detection machinery and mapped sequences . . . . . . . . . . . . . . . 89
A.2 Omission Monitoring and Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
A.3 Inaccessibility Monitoring and Evaluation . . . . . . . . . . . . . . . . . . . . . . . 92
B Mechanism Design Verification 95
B.1 Approach to Component Design Simulation . . . . . . . . . . . . . . . . . . . . . . 95
B.2 CAN Channel Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
B.3 Simulation Fragments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
ix
List of Figures
1.1 SSTL-900 Satellite Architecture Block Diagram . . . . . . . . . . . . . . . . . . . . 3
2.1 Block diagram of a generic control system [1] . . . . . . . . . . . . . . . . . . . . . 8
2.2 Typical DCS infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Typical Time-Triggered Ethernet network . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Typical SpaceWire network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Typical CAN network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6 Ring topology RedCAN network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.7 Typical FlexCAN network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.8 Typical CAN Enhanced Layer network . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1 CANELy System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 CANELy reliable communication and services block diagram . . . . . . . . . . . . 26
3.3 CANELy Dependability Engine interfaces . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 Channel redundant media management . . . . . . . . . . . . . . . . . . . . . . . . 28
3.5 Inaccessibility Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.6 CANELy engineering model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.7 CANELy Dependability Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.8 Extended CiA Connector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1 CANELy network assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 CAN message termination sequence . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3 CAN physical layer faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.4 Errors affecting a dual-media CAN network . . . . . . . . . . . . . . . . . . . . . . 37
4.5 Media-redundant network physical partition . . . . . . . . . . . . . . . . . . . . . . 37
4.6 Columbus’ Egg strategy block diagram . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.7 AND-based Media Selection description in VHDL . . . . . . . . . . . . . . . . . . . 40
4.8 Medium Disable Receive description in VHDL . . . . . . . . . . . . . . . . . . . . . 42
4.9 Sliding Window sequence detection . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.10 Signal assertion machinery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
xi
List of Figures
4.11 Sequence detector description in VHDL . . . . . . . . . . . . . . . . . . . . . . . . 45
4.12 ChEOT signal description in VHDL . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.13 CANELy Basic Channel Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.14 Mdis−tx description in VHDL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.15 Channel Monitoring signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.16 Sequences mapped into VHDL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.17 CANELy Channel Monitoring functions . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.18 Medium Status Word VHDL data type . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.19 Medium Status Monitoring block diagram . . . . . . . . . . . . . . . . . . . . . . . 53
4.20 Medium Omission Detection auxiliary functions . . . . . . . . . . . . . . . . . . . . 55
4.21 Omission Degree Control block diagram . . . . . . . . . . . . . . . . . . . . . . . . 56
4.22 Media Selection Unit block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.23 CANELy Media Selection Unit management primitives . . . . . . . . . . . . . . . . 57
5.1 CAN vs. CANELy normalised inaccessibility duration bounds . . . . . . . . . . . . 62
5.2 Timing of the CAN channel monitoring signals . . . . . . . . . . . . . . . . . . . . . 65
5.3 Inaccessibility Event Count description in VHDL . . . . . . . . . . . . . . . . . . . . 66
5.4 Extended Channel Monitoring signals . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.5 Timeliness-related sequences mapped into VHDL . . . . . . . . . . . . . . . . . . 67
5.6 Inaccessibility duration evaluation description in VHDL . . . . . . . . . . . . . . . . 68
5.7 Optimised Diffusion-based protocol. . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.8 Inaccessibility Control Unit block diagram . . . . . . . . . . . . . . . . . . . . . . . 71
5.9 CANELy Inaccessibility Control Unit management primitives . . . . . . . . . . . . . 72
6.1 Media Selection Unit simulation fragment . . . . . . . . . . . . . . . . . . . . . . . 74
6.2 Inaccessibility Control Unit simulation fragment . . . . . . . . . . . . . . . . . . . . 75
6.3 Sequence detection description resource occupation . . . . . . . . . . . . . . . . . 76
6.4 CANELy vs CAN Cores resource usage comparison . . . . . . . . . . . . . . . . . 78
6.5 CANELy vs CAN Cores relative slice usage . . . . . . . . . . . . . . . . . . . . . . 78
6.6 CANELy Prototype Board block diagram . . . . . . . . . . . . . . . . . . . . . . . . 79
6.7 CANELy Prototype Board . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
B.1 Bus Media simulation data flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
B.2 Simulation text file content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
B.3 Simulated CAN Channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
B.4 Simulation of the Basic Channel Monitoring mechanism . . . . . . . . . . . . . . . 99
B.5 Simulation of the Message Identifier Extraction mechanism . . . . . . . . . . . . . 100
xii
List of Tables
2.1 Comparison of TTEthernet, SpW and CAN . . . . . . . . . . . . . . . . . . . . . . 18
6.1 Media Selection Unit FPGA resource occupation . . . . . . . . . . . . . . . . . . . 77
6.2 Inaccessibility Control Unit FPGA resource occupation . . . . . . . . . . . . . . . . 77
xiii
List of Acronyms
ABS Anti-lock Braking System
AOCS Attitude and Orbit Control System
ARINC Aeronautical Radio, Inc.
ASIC Application Specific Integrated Circuit
AMBA Advanced Microcontroller Bus Architecture
CAM Content Addressable Memory
CAN Controller Area Network
CANELy CAN Enhanced Layer
CiA CAN in Automation
COTS Commercial Off-The-Shelf
CSMA/DCR Carrier-Sense Multiple Access / Deterministic Collision Resolution
CRC Cyclic Redundancy Check
DCS Distributed Control System
ECSS European Cooperation for Space Standardization
EMI Electromagnetic Interference
EOF End-of-Frame
EOT End-of-Transmission
ESA European Space Agency
FIFO First-In, First-Out
FPGA Field Programmable Gate Array
FSM Finite State Machine
xv
List of Acronyms
FTU In FlexCAN: Fault Tolerant Unit
HDL Hardware Description Language
ICU In CANELy: Inaccessibility Control Unit
IEEE Institute of Electrical and Electronic Engineers
I/O Input/Output
IP Intellectual Property
IRQ Interruption Request
LLC Logical Link Control
LUT Look-up Table
MAC Medium Access Control
MSU In CANELy: Media Selection Unit
OBC OnBoard Computer
OOB Out-of-band
OSI Open Systems Interconnection
PHY Physical Layer
PLD Programmable Logic Device
QoS Quality of Service
RAM Random-Access Memory
ROM Read-Only Memory
ROV Remotely Operated Vehicle
RTU Remote Terminal Unit
SCADA Supervisory Control and Data Acquisition
SNR Signal-to-Noise Ratio
SoC System-on-a-Chip
SpW SpaceWire
SRAM Static Random-Access Memory
xvi
SSTL Surrey Satellite Technology Ltd
SWaP Size, Weight and Power-consumption
TTEthernet Time-Triggered Ethernet
UAV Unmanned Aerial Vehicle
UTP Unshielded Twisted Pair
VHDL Very High Speed Integrated Circuit Hardware Description Language
VOIP Voice over Internet Protocol
WCET Worst Case Execution Time
xvii
1Introduction
The advances in integration of electronic circuits during the last decades have allowed the
proliferation of devices with significant processing capabilities into the various aspects of daily life.
Nowadays it is common to see a mere cellular telephone having greater computing power than
the most expensive computers of 20 years ago. This integration lead to an explosion of systems
with a reduced size footprint, containing the elements of a complete computer: processing unit,
memory (both volatile and non-volatile) and input/output interfacing capabilities. These embedded
systems comprise practically every current electronic system, be it a portable medical device or a
ticket vending machine.
The domain of control and automation is one of the areas that has benefited the most with the
proliferation of embedded systems. While in the past control applications like process control had
to rely on large (central) computers in order to implement the control laws governing the system,
nowadays one can easily build an equivalent control system with small and cheap electronic
components.
The reduction in size and cost has also allowed the physical distribution of the control systems
to the premises of the sub-systems being controlled and consequent elimination of the point-to-
point control wires. This allowed to save money and space while increasing reliability, availability
and maintainability of the global control system.
The shift from centralised control systems to distributed systems raised many challenges,
including the need for communication technologies enabling the dependable interconnection of
these systems. A special class of computer networks called fieldbuses was devised, with the
aim of providing support for communication services in the harsh1 industrial environment, thus
enabling the construction of distributed embedded systems for real-time control tasks.
Today, many of these technologies are all around us due to the ubiquity of distributed control
systems in domains such as automotive or aviation. The trend is to keep on, with the replacement
of all mechanical systems such as a car’s engine control system with more effective systems
based on communicating electronic control units.
1Some of the problems are: Electromagnetic Interference (EMI), power supply transients, physical damage to thecabling infrastructure.
1
1. Introduction
1.1 Motivation
High dependability application domains like aerospace, medical or process control have mostly
been implemented with resort to ad-hoc solutions due to their specific and low production nature.
However, in the last decade both the industrial and academic worlds have shown a growing inter-
est in using Commercial Off-The-Shelf (COTS) solutions in application domains having stringent
requirements with respect to dependability and timeliness.
The interest in COTS solutions has mainly been driven by their (higher) performance rather
than by their cost2. However, most of these solutions do not possess all the required charac-
teristics for dependable operation due to their commercial- and performance-oriented nature, and
thus must be complemented in order to behave in conformity with the strict requirements of critical
applications. Furthermore, recent studies advocate the use of COTS, namely those implementing
widely used standards and non-proprietary interfaces [4].
In this context, the Controller Area Network (CAN) fieldbus is an excellent candidate to be used
as a building block for these application domains. Designed to be used in the harsh automotive
environment, it was quickly adopted as a fundamental block for networked embedded control
systems, due to its simplicity of operation, low mass cabling and behaviour in the presence of
errors. Furthermore, its specifications are open [5] and have been standardised [6]. Variants of
these standards for specific application domains, such as aeronautics [7] and space [8] have also
been introduced.
Current applications
Even though CAN was not initially intended for critical applications such as spaceborne sys-
tems, there are spacecraft architectures already deployed using CAN to convey information be-
tween the several subsystems that compose them and the OnBoard Computer (OBC). One ex-
ample of this type of architecture is the SSTL-900 from Surrey Satellite Technology Ltd, shown in
Figure 1.1. In this satellite architecture the several sub-systems such as Attitude and Orbit Control
System (AOCS) and payload are connected through CAN, in a dual communication media setup
to achieve media redundancy.
However, this solution does not provide timeliness and extremely high availability guaran-
tees: swapping between the communication media is made through remotely operated physical
switches, which takes time and might endanger the operation of a hard real-time system. Whilst in
Earth-orbiting satellites this is acceptable, there are other spaceborne applications where stronger
mission-critical guarantees are needed, e.g. an interplanetary exploration mission, especially dur-
ing the phase of planetary probe deployment.
2Contrary to popular belief, COTS components may be more expensive than their military counterparts, mainly due tovalidation and certification processes [2, 3].
2
1.1 Motivation
Payload ThermalManagement
Attitude & Orbit Control System
PowerSystem
CAN
CAN
Communications & Data HandlingSystem
PropulsionSystem
Dual Media CAN bus
Ch. 1Ch. 2
Ch. 3 Ch. 4
Ch. 5
Ch. 6Ch. 7
Ch. 8
A SideB Side
CAN
CAN
28 V 50 V
CAN CAN
Ch. 1Ch. 2
Ch. 3 Ch. 4
Ch. 5
Ch. 6Ch. 7
Ch. 8
A SideB Side
y
zx
On-BoardComputer
Figure 1.1: SSTL-900 Satellite Architecture Block Diagram
Future trends
Looking into the future, the European Space Agency (ESA) Aurora Programme aims to push
robotic exploration of the Solar System and development of manned spaceflight missions to Mars.
One of the programme’s initial goals was the study and definition of future mission requirements,
in order to achieve a unified hardware/software architecture. This (modular) architecture shall be
used by all the missions under the programme’s scope, thus lowering development-related costs
and mitigating the eventual technological obsolescence of system components.
Due to the long-term life-cycle of the space missions3, these requirements must have a high
longevity in order to cope with the temporal horizon of mission execution. It is in this scope that
COTS components play a key-role, not only opening room for the inclusion of industrial standard-
ised components such as CAN, but also allowing cost savings w.r.t. special-purpose solutions in
the long run due to component reuse.
According to the final report of the study commissioned by ESA with recommendations for the
avionics architecture of the Aurora programme [10] :
“Concerning the low rate bus which is the most often used for acquisition/command
exchanges, the trade-off has already been performed between CAN, (MIL-STD)1553,
ODH485 and TTP/C (...) . (MIL-STD)1553 and CAN busses have been selected : the
first one is the current standard bus and allows to reuse existing units (AOCS units
in particular) and the second one is the future standard and will allow to connect new
units embedding CAN bus coupler.”
Therefore, CAN is more than just a popular fieldbus within current automotive and control appli-
cations: future systems are being designed based on this technology for sub-system networking.
3Space missions usually extend over 10 years, from initial planning to mission launch and exploration [9].
3
1. Introduction
Moreover, the applications where the CAN bus will be incorporated tend to be critical in nature,
thus denoting an opportunity window for a CAN-based dependable communication infrastructure.
1.2 Objectives
The main objective of this work is: enable a design enforcing dependability and timeliness
in CAN, demonstrating the feasibility of using the CAN fieldbus as the network building block for
applications with stringent requirements regarding safety, availability and timeliness, such as hard
real-time distributed control systems.
To achieve this objective we discuss an implementation of the CAN Enhanced Layer (CANELy)
architecture defined in [11], specifically the low-level components dealing with dependable oper-
ation of the communication channel and media, both on the temporal and spatial domains. This
implementation will lead to a specialised unit, managing the incoming bit streams from the sev-
eral media and providing: bus media redundancy management, channel/media monitoring, and
channel/media status signalling support for the upper layers.
The specification of these components involves the study of the basic analytic models defined
by the CANELy architecture, and devise efficient machinery to perform the modelled low-level
operations. Special attention must be taken to the suitability of implementing these mechanisms
in a medium sized Programmable Logic Device (PLD), e.g. Xilinx’s Spartan-3E, a low-cost Field
Programmable Gate Array (FPGA) family [12]. The complexity must be kept low, not only due to
the scarce resource which is PLD fabric but also to ease the verification of errors, increasing both
the reliability, maintainability and even compositionability.
The last, long-term objective is the wrapping of the components in an Intellectual Property (IP)
core with an adequate Advanced Microcontroller Bus Architecture (AMBA) [13] interface , which
allows near “Plug-n-play” integration within a System-on-a-Chip (SoC) design with AMBA bus
support, such as the LEON3 spaceborne processor IP core [14].
1.3 Contributions
The contributions provided by this work to the field of highly dependable CAN-based solutions
are as follows:
• Demonstration of the feasibility of effective CAN bus media redundancy and monitoring
mechanisms through a proof-of-concept;
• Demonstration of the (low) complexity of the mechanisms supporting a dependable archi-
tecture based on the CAN fieldbus;
• A (portable) parametrised description of the hardware mechanisms in an hardware descrip-
tion language, e.g. VHDL, suitable for implementation in an FPGA or ASIC.
4
1.4 Publications
1.4 Publications
This work has provided conditions for the elaboration of the following works, presented in
international conferences:
• J. Rufino, R. Pinto, and C. Almeida, “A FPGA-based solution for enforcing dependability
and timeliness in CAN,” in Proceedings of the 2007 IP Based Electronic System (IP’07),
Grenoble, France, Dec. 2007 [15].
• ——, “FPGA-based engineering of bus media redundancy in CAN,” in Proceedings of the
12th International CAN Conference (iCC’08), Barcelona, Spain, Mar. 2008 [16].
and of the following Technical Reports:
• R. Pinto, J. Rufino, and C. Almeida, “CANELy prototype board schematic specification,”
FCUL/IST, Tech. Rep. DARIO RT-05-04, Dec. 2005. [17]
• ——, “Specification and engineering of the CANELy prototype board,” FCUL/IST, Tech. Rep.
DARIO RT-06-06, Oct. 2006 [18]
• J. Rufino, R. Pinto, and C. Almeida, “How to enforce dependability and timeliness in CANELy?”
FCUL/IST, Tech. Rep. DARIO RT-07-02, Jul. 2007. [19]
1.5 Document organisation
The remainder of this document is organised as follows:
Chapter 2 documents the state of the art with respect to real-time operation and dependable
distributed embedded systems. A set of fieldbus technologies is presented, and CAN-based
solutions for dependable applications are documented.
Chapter 3 describes in detail the CANELy architecture, analysing its contributions to build
highly dependable hard real-time systems.
Chapter 4 discusses the problems that affect the dependability of communication bus op-
eration, and the materialisation of effective mechanisms to ensure proper operation even in
the presence of errors;
Chapter 5 addresses the issues affecting system timeliness and provides solutions to en-
force correctness in the time-domain;
Chapter 6 discusses the engineering of the proposed mechanisms, along with the design
of a prototype supporting the CANELy architecture;
Chapter 7 concludes this dissertation, with future work analysis.
5
2State of the Art
The interconnection of embedded systems in an distributed real-time industrial scenario poses
several problems w.r.t. correct system behaviour. These problems stem from all the system layers:
from the interconnection technology, to the communication protocols, passing by the requirements
for distributed operation. These problems can grow exponentially when these systems are used
in control and automation tasks, since they add dependability and safety concerns to the set of
issues to be addressed.
This Chapter documents the State of the Art pertaining to: requirements for distributed real-
time control systems, specifically the networking components; fieldbuses for embedded net-
worked system interconnection, including the CAN fieldbus; and CAN-based dependable archi-
tectures.
Firstly, we briefly introduce a generic approach to control systems, focusing then on distributed
real-time control systems. The real-time paradigms governing system and communication be-
haviour are introduced and discussed. A set of fieldbuses - in which CAN is included - is analysed
and compared, w.r.t. the several attributes they offer for (dependable) embedded system network-
ing. Finally we present a set of existing solutions for high dependability architectures using the
CAN fieldbus as the networking block.
2.1 Distributed Control Systems
Control systems are all around us. Whenever one needs a physical quantity change according
to other, there must exist a control system mediating these interactions.
A generic control system is composed by basic entities, which can be of the following types:
− Sensor, responsible for gathering information from the environment surrounding it;
− Controller, evaluates the information collected by the sensor(s) and according to a defined
control law, generates a piece of information for the entities interacting with the environment;
− Actuator, makes a change in the environment based on the information received from the
controller.
7
2. State of the Art
The relations between entities and the environment are depicted in Figure 2.1.
ActuatorSystem
ControllerSystem
Environment
SensorSystem
Figure 2.1: Block diagram of a generic control system [1]
A Distributed Control System (DCS) shares the same basic entities and operating principle,
but with a fundamental difference: these elements are physically separated. An example of a
typical DCS is a car’s brake-by-wire facility, where the several entities are physically separated.
In this system, the braking pedal is connected to a sensing element, usually a variable resistor,
feeding a controller with data. Whenever the driver presses the pedal, the controller receives the
pedal’s data and can request more data from other sensors1 before processing it and sending an
adequate piece of data to the brake’s actuating entities.
A solution to physically interconnect these entities appeared under the form of fieldbuses.
Fieldbuses are a class of computer networks that initially were designed to be used in large control
systems, such as chemical process control and power plants, as the interconnection technology
for the smaller distributed control entities replacing the large centralised control computers. An
instance of such a system is shown in Figure 2.2.
Fieldbus
Controller
Temper.
���oK
Sensor
Thrust
��� %
Actuator
V Speed
����
Sensor
ftm
Voltage
���
Sensor
V
ControllerAngle L
���o
ActuatorAngle R
���
Actuator
o
In: V Speed
Out: Thrust
Control Law
In: Temper.
Out: Voltage
Control Law
Figure 2.2: Typical DCS infrastructure
The system of Figure 2.2 has several entities interconnected by a shared fieldbus, which allows1An example is the Anti-lock Braking System (ABS), where the braking order is conjugated with the wheel’s state.
8
2.2 Distributed Real-time Systems
the controllers to gather data from the sensors. After using the sensor’s data as an input for the
control laws they implement, they send the generated data to the actuators, which should act
accordingly.
Although usually the DCSs are small in scope, they can be integrated in larger control sys-
tems, such as Supervisory Control and Data Acquisition (SCADA), exploiting the availability and
properties of these smaller systems. One problem remains, however: how should the entities of
this distributed system interact?
Notwithstanding the fact that control systems’ entities can be spatially distributed, they still
need to interact as if they were in the same physical platform. Since the components are phys-
ically separated and do not share the same computing platform, it is necessary to synchronise
their state with each other. Furthermore, this synchronisation must be reliable, to ensure safe
operation.
To bridge the gap between the centralised and distributed paradigms a set of services is in
need, allowing this distributed system to operate and provide a safe and secure platform for
reliable applications. The foundations of distributed systems are well understood [20], with several
services and attributes identified. However, not all of services are useful to build distributed control
systems.
The subset of distributed systems’ functions and facilities that are relevant to the design of DCS
are mostly the ones common in fault-tolerant distributed systems: group communication; mem-
bership and failure detection; clock synchronisation; reliable network infrastructure. For example,
usually a DCS does not need naming and addressing facilities, since the available services (sen-
sors, actuators) and their location (addresses) are defined during system design, and are static
through system operation. On the other hand, an entity of this system might need to know if an-
other entity is available on the system and providing correct service. It can achieve this operation
by resorting to a membership and failure detection service.
2.2 Distributed Real-time Systems
Most systems interacting with the real world are real-time, in the sense that the system’s
actions are bound by the progression of time. In this class of systems the failure to react to an
event in a timely fashion might yield results ranging from insignificant to catastrophic, e.g. the
previously mentioned car braking system taking too long to respond to a braking order might
endanger human lives.
This Section presents a brief introduction to real-time systems, with particular emphasis on
real-time communication, its operational models and properties.
9
2. State of the Art
2.2.1 Real-time Systems
The definition of “real-time systems” under a different perspective might be: real-time systems
are those whose tasks have timeliness requirements, i.e. the time the tasks require to be com-
pleted must be bounded and usually is computed as the Worst Case Execution Time (WCET).
The order that these tasks are executed, or the task schedule depends on the type of scheduling:
fixed scheduling, which can be done “off-line” and ensures a cyclic behaviour, with the tasks be-
ing carried out in the same order periodically; dynamic scheduling where the order is dependent
on a priority metric, usually the task’s deadline, i.e. the time when the task’s actions must be
completed.
Independently of the type of scheduling, the scheduler may preempt tasks. A reason justifying
preemption might be a task whose deadline is nearer than the one of the currently running task.
Another reason for preemption might be a task whose deadline has been missed, i.e. execution
did not end before the deadline.
The consequences and behaviour of the system in the presence of a missed deadline depends
if it is a soft real-time system, where occasional misses and their consequences are acceptable,
or a hard real-time system where the consequences of a missed deadline might be catastrophic.
An example of a soft real-time system is media streaming, such as Voice over Internet Protocol
(VOIP) telephony. This system is real-time, in the sense that the packets carrying the encoded
voice must arrive at their destination within bounded time. If any packet misses its deadline,
however, the consequences will be perceived by the receiver as a glitch or a small period of
silence, which are negligible.
On the other hand, a hard real-time system has stricter operational assumptions. Such a sys-
tem is the aforementioned brake-by-wire example, where missing a deadline - such as computing
and output data to the braking actuators - can have catastrophic effects.
Given the distribution of these real-time systems, a facility providing real-time communication
services is also needed, ensuring that a reliable, available and timely communication channel ex-
ists to convey the information between system elements, thus ensuring global real-time operation.
2.2.2 Communication System Operation Models
There are two fundamental paradigms defining the timeliness of the interaction between the
real-world and the distributed entities, i.e., when should the system react to external events, and
communicate with other system’s entities. These paradigms are: the time-triggered approach,
where actions2 take place only at certain points in time, synchronised by a global clock; the event-
triggered, where actions take place as soon as possible after the originating event.
Although these two paradigms can be used to describe the behaviour of complete systems,
2The term “actions” is deliberately generic, since its meaning can range from: physical interaction with the environmentto network communication message communication.
10
2.2 Distributed Real-time Systems
right from the application layer down to the communication infrastructure layer, we are mainly
interested in the latter. Both approaches have advantages and disadvantages, which will be
explored next.
Time-triggered Approach
In a time-triggered architecture the interaction between system elements is not done upon
event generation: it only takes place in predefined instants in time. This mode of operation im-
poses a synchronous behaviour on the system which has obvious advantages, such as main-
taining information consistency, or fault-detection. The time-triggered operation requires static
scheduling of the communication system, which must be done “off-line” thus easing the design
and verification of such system, and during operation allows fault detection, since all elements
have the knowledge about who should be communicating at any give instant.
This very same static communication scheduling, however, is its weakness: the time-triggered
architecture may not be able to accommodate changes to the system, such as adding new ele-
ments, at least without redesigning the schedule. Another consequence of the static communica-
tion scheduling is that the event generation distribution must be known at design time, in order to
provide the communicating entity with sufficient slots for event production/consuming.
Finally, due to the cyclic operation the mean response time is higher, due to the message
transmission being delayed until the node’s scheduled transmission time. This can be a hindrance
in real-time systems whose behaviour is dominated mainly by aperiodic events, i.e. events that
may occur at any instant in time.
Event-triggered Approach
In an event-triggered architecture the network messages can be sent at any time, usually after
their generation. This paradigm is also called reactive, since the processing is done as a reaction
to an event. The previous example of the brake-by-wire system is such a system. Braking can
occur at any instant, and it must be dealt as quickly as possible due to the possibly fast-changing
physical quantities involved, such as speed and distance.
The event-triggered approach is the most suited for applications where all the environment
variables might not be known a priori. This is in part due to the possibility of using dynamic com-
munication scheduling, which allows it to cope with the uncertainty of the environment. Hence,
this architecture is the most adequate to interact with the “real-world”, given its dynamic behaviour
- induced by the Input/Output (I/O) - and therefore is more suitable to adapt to new situations.
The responsiveness of an event-triggered architecture is better than its time-triggered coun-
terpart: there is no (artificially induced) delay time, the event is processed as soon as possible.
However, there may be situations where a burst of events is brought into the system, which must
be dealt in a timely fashion, and most importantly, in a safe fashion. For example, if deadline
11
2. State of the Art
misses occur - the system should accommodate this scenario and have well defined semantics
regarding the provision of service.
2.2.3 Real-time Communication Networking Infrastructure
As previously discussed, the distribution of systems implies their interconnection through a
physical network. To ensure real-time operation of the interconnected systems one must also
extend the real-time paradigm into the communication network infrastructure, thus introducing
some constraints in the communication model.
The purpose of these constraints is to enforce determinism into the communication channel,
hence making it adequate for supporting real-time operation. Therefore, these constraints pertain
to: ensuring bounded transmission and processing delay; maintain network connectivity; control
partitioning.
Maintaining network connectivity is tightly coupled with fault-tolerance. In order to maintain
connectivity, one must avoid partitioning, which creates subsets of elements that are unable to
communicate with each others. This impairment has two facets: physical partitioning, where the
physical network infrastructure is affected; virtual partitioning, where the communication channel
is affected. The immediate result of both forms is an unavailable network infrastructure.
One of the most important guarantees that a real-time communication infrastructure must pro-
vide is bounded transmission delay, even in the presence of disturbances, such as other (unre-
lated) network traffic, or transient overload of the network. The knowledge of such delay is needed
to assess the (global) WCET of a task, for system scheduling purposes. This time can also be
used to detect abnormal protocol operation, and trigger fault recovery procedures.
Another concept that must be mapped from the real-time system onto the network is the notion
of priorities. Traffic prioritisation can be achieved by Quality of Service (QoS) measures, such as
different classes of traffic. This would allow the distinction between urgency classes, useful for
providing service to high priority tasks, in a process similar to dynamic task priority assignment.
Finally, to achieve (dependable) real-time communication one must have a reliable network,
which must be available and providing correct service even in the presence of disturbances.
These properties must be provided by any networking technology used in the construction of
real-time distributed systems, thus ensuring correct real-time operation of the (global) system.
2.3 Embedded Systems Networking (Fieldbus) Technologies
There are several fieldbus technologies in the industry, most intending to solve different prob-
lems in different applications. Whilst some aim at high data-rates, others aim at real-time be-
haviour. There is no technology satisfying all the requisites at once, e.g. timeliness, data-rate,
cost, dependability.
12
2.3 Embedded Systems Networking (Fieldbus) Technologies
In this Section we present and analyse fieldbus technologies in use - or with potential to be
used - in domains requiring the interconnection of embedded systems, such as control or data-
handling. These fieldbus technologies are: Time-Triggered Ethernet (TTEthernet), SpaceWire
(SpW) and Controller Area Network (CAN).
These technologies can be mapped essentially into only three layers of the Open Systems
Interconnection (OSI) model: Application, Data-Link and Physical. The technical specifications of
these technologies, however, only cover the two lower layers, with Application layer being left to
the system designer’s discretion.
In order to be able to compare these fieldbus technologies, the analysis will focus on the
following figures of merit: network topology, network operation (including medium access), ad-
vanced services such as time information diffusion and fault-tolerance. Lastly, these results will
be compiled in a table, for a more effective comparison.
2.3.1 Time-Triggered Ethernet
Time-Triggered Ethernet (TTEthernet) [21] is a fieldbus being developed by TTTech Comput-
ertechnik AG as the network block for the construction of time-triggered applications. It uses
and extends standard Ethernet [22] as the physical (PHY) and data-link layers, providing coexis-
tence with standard Ethernet applications. An instance of a mixed-criticality TTEthernet network
is shown in Figure 2.3.
Ethernet Link
SwitchTTEthernet
SwitchTTEthernet
Host
Safety-Cri.TTEthernet
HostStandard
TTEthernet
Host
Safety-Cri.TTEthernet
HostStandardEthernet
HostSafety-Cri.TTEthernet
HostSafety-Cri.TTEthernet
Network
HostStandardEthernet
Figure 2.3: Typical Time-Triggered Ethernet network
The typical network topology is a star where each node device connects directly to a TTEther-
net-aware switch in a full-duplex link, allowing simultaneous transmission and reception of data.
The full-duplex property has several benefits, such as: avoids (local) bus access contention and
therefore enhances the communication system’ timeliness by providing deterministic local mes-
sage transmission time; eliminates the need for minimum cabling size since there is no need for
collision detection. The TTEthernet specification, however, does not preclude direct connection
between nodes, given that the full-duplex link property is provided.
13
2. State of the Art
The TTEthernet fieldbus uses standard Ethernet infrastructure, namely the same data-link
layer up to a certain extent. Therefore, the frames have the same structure as Ethernet, whose
frame length goes from 64 byte up to 1518 bytes. Since the used media is Ethernet, the data
rates can go from 10 Mbps up to 1 Gbps, and must always be full-duplex. A TTEthernet network
admits coexistence of both time-triggered and event-triggered (standard Ethernet) traffic. The
distinction, or multiplexing, between the type of traffic (payload) is done by the content of the
frame’s EtherType field, used to either: indicate the frame’s payload size, if the value is smaller
than 1500; used to represent a user protocol, if bigger than 1536. The utilisation of this field also
excludes the utilisation of a Logical Link Control (LLC) sublayer protocol, e.g. those defined by
Institute of Electrical and Electronic Engineers (IEEE) Standard 802.2.
The switches must be TTEthernet-aware, thus being able to make the distinction between the
traffic flows. With this knowledge, the switch might preempt the frames pertaining to standard
Ethernet upon synchronisation events, giving priority to TTEthernet ones. This modification of
the data-link layer behaviour raises issues for standard Ethernet nodes, which may suddenly stop
receiving a frame, not being prepared to deal (often) with such events. According to the TTEth-
ernet specification, fault confinement is provided by the switch, both in the spatial and temporal
domains. There is also error detection at the data-link layer, provided by standard Ethernet layer,
such as frame Cyclic Redundancy Check (CRC). This mechanism, however, is the only one
provided for error detection, and none is provided for error recovery.
The TTEthernet specification also offers advanced services for the basic construction of (de-
pendable) distributed real-time systems, such as: site membership, failure detection and global
clock synchronisation, as a consequence of the time-triggered paradigm; channel redundancy
and bus guardian functions at the switches are contemplated, although reserved for safety-critical
applications due to the overhead they introduce. The bus guardian function avoid the “babbling
idiot” failure mode, where a component uses - or tries to use - the resources in an untimely fashion.
Depending on the application domain, however, the TTEthernet architecture might raise some
issues. These issues can stem both from its time-triggered nature, and the complexity of the
network and nodes, thus not being cost-effective.
2.3.2 SpaceWire
SpaceWire (SpW) [23] is a fieldbus being developed by ESA, other space agencies and
academia to provide high data-rate connectivity to spacecraft systems’ components. Its design
was inspired in the IEEE Standard 1355 [24], sharing several similarities, and even showing com-
patibility at some layers.
The SpW fieldbus is a point-to-point, full-duplex serial data communication bus, embodied
in an entity called SpW link. The point-to-point characteristic implies that: the bus topology is
arbitrary; the messages between two nodes not directly connected must be routed through the
14
2.3 Embedded Systems Networking (Fieldbus) Technologies
network; faults affecting a link are self-contained. An instance of a SpW network is depicted in
Figure 2.4.
RouterSpW
Interfaces
Network
RTUSpW
Interface
RTUSpW
Interface
RTUSpW
Interface
OBCSpW
Interface
OBCSpW
Interface
Mem.SpW
Interface
Mem.SpW
Interface
OBCSpW
Interface
SpW Link
Figure 2.4: Typical SpaceWire network
The network in Figure 2.4 mimics a typical aerospace application, such as a satellite, where
the OnBoard Computer (OBC) is connected to several Remote Terminal Units (RTUs). Each RTU
is responsible for one or more sub-systems, which can be either control or data. This network
is composed by several links and a router interconnecting two (logical) network segments. A
redundant setup is also shown, with two cross-connected SpW links between the replicated com-
ponents. The management of this setup must be done entirely by the application layer, since the
SpW specification [25] does not contemplate any type of redundancy provision and management.
Due to each link being full-duplex there is no contention for bus access between two directly
connected nodes, and therefore there is no need for bus media access arbitration. Whenever
a node needs to send a message it starts transmission immediately unless there is already a
previous transmission in progress.
Since the network topology is arbitrary but composed by point-to-point links, packets3 need
to be routed through the network. The used method is wormhole routing: an incoming packet
is transmitted as soon as it is possible to determine its destination, without waiting for it to be
received completely [26]. This architecture has the advantage of reducing message transmission
delay, being the opposite of store and forward which is used by Ethernet and TTEthernet.
The basic information element is the character, which can be either control or data. Data
characters are 10 bit wide, having an 8 bit payload and two control bits, where one of them is the
parity. The packet length in a SpW link is not limited: it can go from 2 characters up to infinity.
This design option was taken both due to the limitation of 32 bytes per packet in IEEE-1355, and
in order to allow a simpler hardware implementation of the protocol, leaving most of the details to
be implemented in software.
3The terminology used is the one defined in the SpW specification [25], where the packet is equivalent to a frame, androuting is equivalent to switching.
15
2. State of the Art
Packet routing admits two types of addressing: path addressing, where the (local) routing port
addresses are embedded in the packet itself; logical addressing where each port in the network
has a unique identifier, and is used by the packet to identify its destination. These modes have
implications in the logic of the routers: while the former needs little logic, just to remove one
address from the top of the packet, the later needs memory to hold a routing table, usually a
Content Addressable Memory (CAM) mapping the logical addresses to router’s ports.
Error detection can take place through a parity bit embedded in each character. Upon detec-
tion, a special control character is appended, in order to notify the remaining routers that an error
has occurred. Regarding advanced features, SpW provides native means for the synchronisation
of time, trough a special packet called “Time-Code”.
Being optimised for raw throughput and small hardware footprint, SpW does not possess na-
tive support for either fault-tolerance or broadcast communication. However, an enhanced speci-
fication named SpaceWire-RT (Reliable-Timely) is being prepared, aiming at securing properties
such as reliability and timeliness, both essential for safe real-time operation.
2.3.3 Controller Area Network
The Controller Area Network (CAN) fieldbus was designed by Bosch GmBH for automotive
applications. It features a bus network topology, operated in a simplex fashion: all nodes share
the same medium to transmit and receive. In Figure 2.5 is shown an instance of a CAN network,
composed by several communicating nodes.
Bus Termination CAN Link
HostCAN
Interface
HostCAN
Interface
HostCAN
Interface
HostCAN
Interface
Figure 2.5: Typical CAN network
The network topology is a shared, multi-master communication bus with a Carrier-Sense Multi-
ple Access / Deterministic Collision Resolution (CSMA/DCR) medium access arbitration scheme.
Whenever the bus is in an idle state, any node with queued messages starts transmission by
sending the message’s identifier, which is broadcast to all nodes. Upon collision the node trans-
mitting the message with the lowest identifier goes through and gains bus access, while the other
competing nodes back-off and go into listening mode. At the end of the transmission any node
having pending messages starts the arbitration process again, until all messages are transmitted.
The payload of a CAN data message varies from 0 to 8 bytes, and the data rate can go up to
1 Mbps. The maximum data rate is dependent of the physical network length, due to the network
operation mode where all nodes sample the bit present on the network at the same time, save
16
2.4 CAN in airborne and spaceborne applications
from eventually a small amount of jitter derived from the node’s local clock drift. Therefore, the bit
must be allowed to propagate to the farthest node in the network, before being sampled.
The CAN fieldbus possesses native fault-tolerant mechanisms. During message transmission
all the nodes - including the transmitting node - listen and check for any violation of the protocol.
If detected, an error flag is sent by the detecting node(s), consisting of only dominant bits and
overwriting the current transmission. This ensures the error is perceived by all the nodes. Upon
an error, the transmitting node reschedules the message automatically to be transmitted in the
next available opportunity. Furthermore, each frame has a CRC field, allowing the detection of
errors that may have gone undetected by other means.
During normal operation, CAN signals are transmitted in differential mode through two wires.
If any of the wires gets “stuck-at” a level, transmission falls back to single-wire mode. Hence, CAN
can keep the communication channel even in the presence of physical faults.
Whenever a message is transmitted, be it any type, a set of counters internal to each node is
incremented or decremented, depending on the type of message. Upon reset, a CAN node is in
an state called error active, allowing it to fully participate in communication. This mechanisms is
part of the fault-confinement functions, which may lead to node disconnection from the bus if the
error counter exceeds a certain threshold, and the node is put into bus-off state.
The CAN fieldbus is particularly well suited for real-time communication, due to its prioritised
medium access scheme, where the message having the lowest identifier gets through. The mes-
sage identifiers can be mapped to traffic classes, depending on their urgency and thus provide
basic QoS measures where higher priority traffic, e.g. an alarm, can get bus access without
having to contend with lower priority traffic.
Fieldbus Technologies Comparison
To conclude the presentation of the fieldbus technologies, a comparison chart is shown w.r.t.
several attributes. The comparison chart is shown in Table 2.1.
A comment regarding the frame efficiency of SpW: these figures come from the fact that
in a SpW there is no maximum size for a packet, hence the 99.9(9)% value. Regarding the
minimum value, the path addressing mode for packet routing specifies that the maximum number
of hops (routers) must be 32. Therefore, a (highly unlikely) scenario may arise, composed by:
one character transmitted using path addressing mode, through a network composed of 32 hops,
hence the rounded 3.(03)%.
2.4 CAN in airborne and spaceborne applications
The provision of native fault detection and recovery mechanisms, together with low weight and
size cabling have allowed CAN to be quickly incorporated into domains other than automotive.
17
2. State of the Art
Parameter TTEthernet SpaceWire Standard CANMaximum Data Rate 1 Gbps 400 Mbps 1 MbpsNetwork Operation transmission line transmission line quasi-stationary
busMedia Access Control N/A (Full-Duplex) N/A (Full-Duplex) CSMA/DCRFrame Efficiency 37.5% - 98.8% 3.03% - 99.9% 45.3% - 59.2%Error Detection (Domains) Value and Time Value only Value onlyFault Confinement Provided by the
switchlink active, link inac-tive
error active, errorpassive, bus off
Omission Handling no detection detection detection/recoveryframe retransmis-sion
Media Redundancy no no noChannel Redundancy possible no noBabbling idiot avoidance bus guardian (in
switch)not provided not provided
Communications unicast/broadcast unicast/limitedbroadcast
broadcast
Table 2.1: Comparison of TTEthernet, SpW and CAN
One of those domains was aviation, where the Size, Weight and Power-consumption (SWaP)
requirements are paramount. Current aircrafts such as Airbus A380 and Boeing 787 have CAN-
enabled control sub-systems, complying with Aeronautical Radio, Inc. (ARINC) 825 specifica-
tion [7].
The CAN fieldbus has also found its way into space applications, being used both by com-
mercial satellites buses [27], and space agencies [28]. Currently the European Cooperation for
Space Standardization (ECSS) is coordinating the standardisation of CAN to be used in future
space missions. The document is still in the draft phase [8], but CAN is already being deployed in
current ESA missions. One of those missions is the ExoMars rover [29], which will perform robotic
exploration tasks in Mars. This mission has used the draft as starting point for on-board CAN bus
design, which also served the purpose of testing and refining the draft standard.
A strong argument for CAN deployment in space is cabling weight, which is at a premium.
The regular Unshielded Twisted Pair (UTP) has a much smaller weight footprint, compared with
other space technologies such as SpW and legacy MIL-STD1553B. Another argument is power-
consumption, which is very strong given the solar-cell and battery-based nature of space ap-
plications. The CAN communication uses much less power than their specialised counterparts,
making it a suitable solution for space applications.
Although space is usually regarded as a high-dependability domain, not all applications are
design to capture these requirements, especially the attributes concerning sub-system fault-
tolerant real-time communication. One of those applications are the Surrey Satellite Technol-
ogy Ltd (SSTL) satellite buses4, which use a dual-media CAN solution for redundant sub-system
4A satellite bus is a generic platform having most sub-systems defined and only differing in the payload. This way the
18
2.5 High-Dependability CAN-based architectures
communication. However, the detection and recovery from a failed bus medium can take up to
five minutes [27], which can be unacceptable in applications dealing with short deadlines from
aperiodic tasks, e.g. orbital manoeuvres for spacecraft docking.
2.5 High-Dependability CAN-based architectures
The industry and academy started to envisage new application domains for the CAN fieldbus
after its debut in the automotive world. Some of these domains demanded higher operational
guarantees, such as mission-critical ones. The set of these high dependability domains includes
medical, oil drilling, chemical process and power plant control, among others. Standard CAN,
however, could not address all the strict requirements concerning dependable operation that these
domains demand. Therefore, the CAN fieldbus weaknesses w.r.t. dependable operation had to
be overcome.
There were two main approaches to endow CAN with dependable operation attributes: en-
hance or complement the standard CAN layer5, i.e. enhance CAN through additional mecha-
nisms, or complement through the execution of protocols and algorithms at the application-level.
A first analysis and approach to the problem of real-time CAN dependable operation was given
in [30], where a dual CAN channel solution was proposed to provide communication redundancy,
together with full-space redundancy of the host itself. This approach, however, it is not cost effec-
tive: the replication of hardware, from the processing elements to the communication controller
can be very expensive, both in terms of cost and SWaP properties.
A simpler approach consists on replicating only the communication media, thus providing re-
dundancy to the channel itself. This approach increases availability, but also poses another prob-
lem: managing the several redundant media, especially how to conjugate the information received
by them and present it to the single Medium Access Control (MAC) entity.
The following Sections are devoted to the analysis of high-dependability CAN-based architec-
tures, highlighting their main contributions for CAN dependable operation, such as maintaining
network connectivity, and architectural features.
2.5.1 RedCAN
The RedCAN architecture is a commercial solution for high dependability CAN based sys-
tems [31, 32]. An instance of a RedCAN network is depicted in Figure 2.6.
The network topology is a self-healing ring, achieved by dividing the network into segments,
called sections in the RedCAN context. The transformation of the CAN ring topology into bus is
made through special-purpose machinery inserted between the physical network cabling and the
costs can be lowered and the mission cycle can be shortened.5The option to modify the standard CAN layer has only become possible in the last ∼5-7 years due to the availability of
cost-effective FPGA devices, such as the Xilinx Spartan-3 family, and IP cores. However, such an approach must ensurebackward-compatibility with the standard CAN layer, mainly due to the large number of deployed standard CAN devices.
19
2. State of the Art
Node n
RedCANInterface
RedCAN Section
Node 3
RedCANInterface
Node 2
CANInterface
Node 1
RedCANInterface
CANInterface
CANInterface
Left Right Left Right
Figure 2.6: Ring topology RedCAN network
CAN transceivers. This machinery is composed by termination resistors and switches controlled
by the node, embodied by the RedCAN interface (see Figure 2.6). The termination of the bus is
made at a specific node possessing a RedCAN interface (such as Node n in Figure 2.6), while
the remaining nodes possessing RedCAN interfaces are in transparent mode, bypassing the bus
media (Node 1 in the same Figure).
In a RedCAN network not all nodes are required to have a RedCAN interface. Each section
needs two RedCAN-enable nodes: the leftmost and the rightmost, for bus termination purposes.
Moreover, each section supports the connection of nodes with standard CAN interfaces (e.g.
Node 2).
The RedCAN error confinement mechanisms act upon the sections, isolating the faults. These
faults may have origin on the sections themselves, e.g. damaged network cable, or faulty nodes,
e.g. a node exhibiting “babbling idiot” behaviour.
Upon RedCAN ring reconfiguration mechanisms, nodes equipped with a RedCAN interface
(Nodes 1,3 and n, in Figure 2.6) can maintain connectivity even if a section is disconnected.
Although RedCAN provides a fault-resilient network infrastructure, it may be inadequate for
hard real-time operation. The fault detection and recovery procedures involved in bus reconfigu-
ration - done through mechanical switches - take time, which can be in the order of hundreds of
milliseconds [32]. This time may be enough for deadline violation in applications with tasks having
short periods, and thus system failure.
2.5.2 FlexCAN
The FlexCAN architecture [33] aims at providing an ultra-dependable solution for generic em-
bedded systems requiring dependable operation. It resorts to the concept of SafeWare, a middle-
ware providing the necessary services to achieve high dependability. It also supports operation
with both standard CAN and FlexCAN-aware nodes. An instance of a FlexCAN network is shown
in Figure 2.7.
It offers a layered architecture with several options for fault-tolerance: channel redundancy,
using multiple CAN controllers and buses; and node redundancy, forming a Fault Tolerant Unit
20
2.5 High-Dependability CAN-based architectures
Node nNode 3Node 2
CA
N
CA
N
CA
N
CA
N
CA
N
CA
N
CA
NC
AN
Node 1
CA
N
CA
N
CA
N
SafeWare SafeWare SafeWare
FTU
CANInterface
Redundant CAN Buses
Figure 2.7: Typical FlexCAN network
(FTU). In the example of Figure 2.7 one can observe: a full-space redundant node (Node 1), two
regular nodes (Nodes 2 an n), and a non-redundant, standard node (Node 3).
In the FlexCAN architecture redundancy management is achieved through a specially crafted
protocol, dubbed SafeCAN [34]. This protocol is implemented at the application layer, and man-
ages the information from the several redundant controllers in order to guarantee reliable and
available communication.
Although this architecture is capable of offering highly dependable CAN-based operation, it
can do with high costs. These costs derive mainly from the provision of a fully space-redundant
architecture both at the communication channel and the supporting computing platform.
2.5.3 CAN Enhanced Layer - CANELy
Another approach to highly dependable solutions based on CAN bus is the CANELy archi-
tecture [11]. This architecture uses a replicated medium bus network topology to enhance the
network availability. A typical CANELy network is depicted in Figure 2.8, composed by two redun-
dant media conveying one CAN channel.
Node n
CANELyinterface
Node 3
CANELyinterface
Node 2
CANELyinterface
Node 1
CANELyinterface
Redundant CAN media
Figure 2.8: Typical CAN Enhanced Layer network
The nodes are all connected to the redundant media, having suitable mechanisms for re-
dundancy management. The CANELy architecture provides several services and functions for
building dependable distributed systems. Such services include: reliable communication protocol
21
2. State of the Art
suite, clock synchronisation, inaccessibility detection and control, medium and channel monitor-
ing, besides the aforementioned bus media redundancy.
The approach taken w.r.t. replicated network infrastructure is similar to the one in Delta-4 [35],
where only the Physical Layer (PHY) components are replicated, thus keeping one MAC entity
per communication channel. This provides a more cost-effective solution, since the cost - both
in terms of components and redundancy management - is kept low. This approach, however,
does not preclude a full-spatial redundancy scenario, with the replication of the channel. The
provision of such level of redundancy is kept to the applications requiring the highest guarantees
of operation, where the cost of having extra machinery is acceptable. The CANELy architecture
is analysed in a more comprehensive manner in the following Chapter.
22
3CAN Enhanced Layer
The design of a dependable hard real-time communication infrastructure must take into ac-
count the dynamics of both the computational and communication systems, and at the same time
exploit the operative mechanisms of the underlying fieldbus. Therefore a systemic approach is
in order, obliging the inclusion of communication behaviour in the system’s dynamics, paving the
way for a dependable hard real-time communication architecture using CAN.
In this Chapter the CAN Enhanced Layer (CANELy) architecture [11] is presented in detail,
giving emphasis to the contributions being addressed in the construction of dependable hard real-
time distributed control systems. The system architecture is presented and discussed in a top-
down approach, through the several system services and components. Finally, the engineering
aspects are detailed w.r.t. the CAN controller and physical layer components.
3.1 System Architecture
A dependable communication infrastructure must provide several services to the host system
using it, ranging from reliable message diffusion protocols to (physical) redundant communication
media management. Moreover, it must be organized in a layered structure in order to allow
composition of the several services.
The main objective of the CANELy architecture is to enhance and complement the standard
CAN layer with mechanisms pertaining to dependability and timeliness guarantees, but without
modifying it. This goal is achieved through a non-invasive approach w.r.t. to the standard CAN
layer as defined in [6], thus promoting the reuse of currently deployed applications and devices,
while providing them with higher operational guarantees of service. The CANELy architecture is
shown in Figure 3.1.
The architecture depicted in Figure 3.1 shows clearly the hierarchy of the architecture and its
components, and how is the standard CAN layer accommodated. Based on a modular philosophy,
CANELy comprises both hardware and software components: the hardware-based mechanisms
deal with dependability enhancements such as bus media redundancy and bus failure masking
through media redundancy; the software-based mechanisms deal with reliable communication
protocols and services common to distributed systems, such as: group communication, member-
23
3. CAN Enhanced Layer
CAN Standard Layer
layermanagement
media/networkmonitoring
control ofinaccessibility
AND-basedmedia selection
reliable communication protocol suite
CAN Enhanced Layer InterfaceCommunication Management
Channel Interface ChTx
Media Redundant CAN Communication Channel
CANELycomponents
ChRx
Figure 3.1: CANELy System Architecture
ship and node failure detection, clock synchronisation. The following Sections detail CANELy’s
architecture, focusing on the main working areas: Reliable Communication, Network Dependabil-
ity and hard real-time operation.
3.1.1 Reliable Communication and Services
A basic building block of a communication system is a reliable communication facility, for ef-
fective support of more advanced services common to distributed systems [20], thus providing an
important set of protocols and services for dependable distributed operation.
Message diffusion
Given the shared nature of CAN communication media all messages sent by any node are
broadcast to all nodes. However, even with robust electrical encoding that broadcast is not im-
mune to (consistency) errors. It has been shown in [36] a scenario where an error affecting the
message transmission would not be correctly perceived by all nodes, leaving the system in an
inconsistent state.
In order to cope with these problems a set of communication protocols was devised [11], pro-
viding the foundations for building complex applications and replication/cooperation management
services based on reliable message diffusion.
Group Communication
A key feature of a distributed system is the concept of group: the set of elements participating
in the (global) system’s actions, usually through the execution of distributed algorithms. Through
group communication the application using CANELy’s services can access more advanced facili-
24
3.1 System Architecture
ties such as QoS or message filtering, e.g. a set of controllers and actuators requesting a sensor’s
value.
In distributed control applications it is usual to have replicated components, e.g. replicated
actuators for safety-critical operation. Moreover, it is also common for some controllers to get
input from more than one sensor, and a sensor can serve multiple controllers. Therefore, CANELy
must provide such facilities, in order to: filter messages, delivering only messages intended for
that node to the upper layers, i.e., introduce the notion of multicast communication.
Site membership and failure detection
A site membership service provides consistent information regarding the sites (or nodes)
present in the network. This information is usually called view. Such a service might aid and
ease the provision of other services involving interactions between the sites, since it provides in-
formation on which nodes are active. For example, a group communication service can benefit
from the knowledge of exactly what nodes it is transmitting messages to.
The CANELy architecture provides a site membership service, together with a node failure
detection service to detect crashed sites. These two services ensure that there is at all times
a correct view of the elements participating in the system’s actions. A set of low-level micro-
protocols has been devised to: handle node join/leave events, node failure detection, enforce
agreement. These protocols must be effective in the utilization of the CAN bandwidth, thus lower-
ing their impact in normal network operation.
Clock Synchronization
Another key feature in a distributed system is (event) causality and time-stamping, e.g. to
determine the state of a task composed by several processes [37]. This service is extremely useful
in a control system for keeping track of application state, and having the means for coordinating
system’s actions that progress over time.
In order to capture event causality, each node in the system must not only possess a time-
keeping facility but also a service to guarantee a globally coherent timebase, shared by all nodes.
Although this issue has not been discussed specifically in the CANELy architecture definition,
a suitable (distributed) algorithm is described in [38]. This algorithm provides CANELy with the
means for accurate clock synchronization. The integration of this service and all the previously
described services is depicted in Figure 3.2.
3.1.2 Network Dependability
The provision of dependable service must be supported by a physical network infrastructure
providing reliable and available service. Some of the problems affecting the communication infras-
tructure are: transient faults such as bit corruption due to EMI; permanent faults such as physical
25
3. CAN Enhanced Layer
GroupCommunication
Site Membership& Failure Detection
ClockSynchronization
BroadcastCommunication
Standard CAN Layer
ChTx Ch
Rx
ManagementCommunication
Managem
ent
Figure 3.2: CANELy reliable communication and services block diagram
bus media damage. These disturbances cause virtual and physical network partitioning, leading
to subsets of nodes that cannot communicate with each others.
Bus Media Redundancy
The basis for network dependability enhancement is bus media redundancy, providing redun-
dant communication paths. Although this solution allows the relaxation of the CAN fault model, it
also introduces a new problem: redundancy management.
The problem of redundant media management is ingeniously solved through the exploitation of
both the quasi-stationary bus operation mode and the wired-AND nature of CAN bus access: all
incoming media are AND’ed together into the Channel, through a component called AND-based
media selection, depicted in Figure 3.1. This scheme greatly simplifies the machinery involved,
since there is no need for complicated media bit-synchronization and decision mechanisms.
Channel and Bus Medium Fault-Tolerance
The introduction of redundant bus media may have relaxed1 the CAN fault model, but also
introduces new faults into this extended model. The CANELy architecture has clearly defined
the fault model affecting a CAN-based redundant media network, and provides the analytical
foundations for monitoring functions aiming at error detection and fault confinement.
These mechanisms act at all levels of the network: medium and channel. A fault affecting a
medium might propagate into the channel, as it happens with standard CAN. Given the redundant
media, however, we can detect which medium or media is being affected (fault detection) and
disable its participation to the channel (fault confinement).
Lastly, each medium is also monitored w.r.t. its omission degree, i.e. the number of omission
errors affecting that medium, in a reference time interval. A medium exceeding its omission degree
bound should be declared failed, its participation in the channel formation disable, and the upper
layers signalled to execute any recovery procedures.
26
3.2 System Components
3.1.3 Hard Real-time Operation
Hard real-time operation assumes a greater importance when part of a distributed system
due to a common point of intersection: the communication network. The correctness of real-time
behaviour depends not only on the local resources, be it computational or regular I/O interfaces,
but also on the availability of the networking infrastructure.
Inaccessibility is a subtle form of partitioning, characterized by the channel being temporarily
unavailable for other nodes to communicate, i.e. inaccessible. In a CAN network the occurrence
of omissions is tightly coupled with inaccessibility events. An omission error implies (at least) the
signalling of the error, thus leaving the bus inaccessible for the duration. Therefore, omissions are
implicitly transformed into inaccessibility periods.
It is impossible to avoid inaccessibility - even an ideal and fault-free physical network infrastruc-
ture could suffer from local node circuitry malfunction, such as loss of synchronization. Therefore,
inaccessibility control is needed to: assess the amount and duration of inaccessibility periods,
optimize protocol timeout calculations and mitigate its effects.
3.2 System Components
The services and mechanisms provided by CANELy can be mapped into components, each of
them encompassing several functions and facilities. There are three main components: CANELy
Dependability Engine; Media Selection Unit and Inaccessibility Control Unit.
3.2.1 CANELy Dependability Engine
The CANELy Dependability Engine provides support for the higher layers of the CANELy ar-
chitecture, i.e. reliable communication and services. It supports the execution of protocols and
other computational tasks, such as management functions in order to assess network operation
status, and is illustrated in Figure 3.3.
CANELy Dependability Engine
ManagementInterface
ChTx
ChRx
Host System
Figure 3.3: CANELy Dependability Engine interfaces
27
3. CAN Enhanced Layer
This component communicates with the host system through (buffer) channels, communicat-
ing with other (networked) nodes on the system through a standard CAN channel. It communi-
cates also with the remaining components of a CANELy node through a management interface,
using the information provided by lower level components to aid in higher-layer protocol operation.
3.2.2 Media Selection Unit
The Media Selection Unit component encapsulates most of the Network Dependability ser-
vices and mechanisms, being depicted in Figure 3.4.
Media Selection Unit
Standard Media Interfaces
MTx
(1) MTx
(m)MRx
(1) MRx
(m)
ManagementInterfaceCh
TxCh
Rx
Figure 3.4: Channel redundant media management
This component receives the several redundant media, and extracts the unique representation
of channel information to be provided to the MAC sublayer, through the bus redundancy manage-
ment mechanism. It also does the inverse function: replicates the channel trough all the media.
This component’s duties also include the provision of monitoring functions w.r.t. the channel and
the several media. This monitoring is done with the purpose of: assess the state of the elements
participating in the communication; perform error detection and confinement, e.g. a failed medium
permanently in dominant state.
This unit communicates with the CANELy Dependability Engine with two purposes: receive
network operation configuration parameters, and signal the upper layers upon an exceptional
event, such as a failed medium. It achieves this communication via a special purpose interface,
as show in Figure 3.4.
3.2.3 Inaccessibility Control Unit
The Inaccessibility Control Unit (ICU) is responsible for the monitoring of the channel w.r.t. to
events affecting the communication timeliness. Its interfaces are shown in Figure 3.5.
The ICU monitors continuously the channel, in order to detect a network inaccessibility event.
When detected, it may inform the upper layers of the duration of such event, through the assertion
of a signal. The contribution of this unit to the CANELy architecture is the assessment of the real
28
3.3 Engineering Constraints
ChannelMonitoring
Inacessibility Control Unit
ManagementInterface
ChRx
ChIna
Figure 3.5: Inaccessibility Control Unit
duration of network inaccessibility and of how much its effects last in the operation of a CAN-based
infrastructure.
3.3 Engineering Constraints
The CANELy architecture is built on top of theoretical and analytical results, making little or
no assumptions of the supporting devices or technology. Hence, to be materialized it must meet
engineering, and the constraints introduced by it. The constraints are presented through the use
of the collapsed three-layer communication stack (see Section 2.3.3, page 16). In Figure 3.6 the
hardware components’ interaction is illustrated.
CAN Single/Dual ChannelInterfaces
Management Interfaces
Microcontroller
FieldProgrammableGateArray
Host System
CAN PHY Interfaces
Figure 3.6: CANELy engineering model
The model in Figure 3.6 provides a processing unit for the execution of the higher layers of
CANELy, and a PLD, namely an FPGA, for the execution of the monitoring and confinement
functions. They interact through two different channels: the standard CAN layer; specialized
management interfaces. It is explicited in the Figure that the management information is con-
29
3. CAN Enhanced Layer
veyed Out-of-band (OOB) with respect to the CAN channels, and therefore there is no influence
whatsoever of the management functions in CAN bus operation.
3.3.1 CANELy Components
CANELy Dependability Engine
In Figure 3.7 is shown the embodiment of the CANELy Dependability Engine component.
CAN Controller
Input Channel Output ChannelExecution Environment
System Interface
Physical Layer Interface
EEPROM
RAM
Programmable Timers
Microcontroller
Figure 3.7: CANELy Dependability Engine
This component is composed by three main blocks: CAN Controller, providing an implementa-
tion of the standard CAN layer as defined in [6, 5]; Microcontroller, commanding the CAN controller
and providing support for the execution of the communication protocols and advanced services;
Message Input/Output Channels, providing message buffers which must support a priority-based
queuing policy, in order to provide QoS to higher priority messages, e.g. urgent control messages.
While the components need not to be integrated in the same integrated circuit, such feature
is desirable to: lower component count, lower implementation area and increase overall circuit
reliability due to less points of fault. This component can then be implemented by resorting to
state-of-the art microcontrollers having at least one standard CAN controller, such as the Maxim/-
Dallas DS80C390 [39] (Dual-CAN, 8051 architecture) or the Texas Instruments LM3S2965 [40]
(Dual-CAN, ARM Cortex-M3 architecture).
CANELy Media Selection and Inaccessibility Control Units
The engineering of the CANELy Media Selection and Inaccessibility Control Units has a strong
hardware design component, since they provide low-level functions. Therefore these must be
mapped into an FPGA device, for cost-effectiveness.
There is no strong restriction pertaining to the mapping of these units into hardware. The only
necessary conditions is that the (reconfigurable) hardware providing support for the mechanisms
30
3.3 Engineering Constraints
has: enough resources so they can be fitted; enough I/O interfaces for the media, channel and
management interfaces interconnection; suitable clock managing circuitry to aid the interface and
bit synchronization with the CAN network.
3.3.2 CAN Data-link Layer
In a CAN network the data-link layer is materialized by the CAN controller. While the CANELy
architecture does not enforce any particular controller device, besides fully supporting the stan-
dard CAN layer, it poses some restrictions regarding its features in order to secure hard real-time
operation attributes, and proper operation of the reliable communication protocols.
The CAN controller device must provide a transmitting message queueing buffer with more
than one slot, with the order of transmission being based on the message identifier - as opposed
to a First-In, First-Out (FIFO) policy. If not, a priority inversion scenario may arise, where a higher
priority message gets delayed due to a lower priority message being held in queue - and possibly
leading to timeliness violations.
3.3.3 CAN Physical Layer
Following the top-down approach, the last layer remaining is the CAN physical layer. As the
name suggests, this layer is concerned with the physical interconnection of the systems, from the
electrical representation of the CAN bits to the physical attachment of cables to the nodes.
Transceivers
A fundamental piece in a CAN network - or any other computer network - is the transceiver,
which is the device that interfaces the controller with the physical means to convey the signals.
It is the transceiver’s function to convert the information into an electrical representation, suitable
for each one of the domains.
The CANELy architecture was designed without making any assumption regarding the physi-
cal medium used to propagate the CAN signals - at least beyond the quasi-stationary mode and
the wired-AND bus operation. Although optical interfaces exist, we are most concerned with a
more usual medium: twisted-pair cabling.
When using regular twisted-pair cabling the COTS philosophy is kept through usage of com-
mercially available CAN transceivers such as Maxim’s MAX13050 [41] or Microchip’s MCP2551 [42].
There is no need for special (fault-tolerant) transceivers with stuck-at-dominant fault masking since
these mechanisms are provided by the Media Selection Unit. Furthermore, these mechanisms
might even not provide adequate protection to real-time systems. Since these devices have no
knowledge regarding the network bit rate, they use a “worst-case scenario”, which is the lowest
bit rate possible. Hence, the amount of time these mechanisms require to act is in the order of
hundreds of milliseconds, which may be longer than the timeliness requirements of the host.
31
3. CAN Enhanced Layer
Connectors
The usage of communication media redundancy requires physical attachment of the several
cables to the node, which must be made through connectors. For the particular case of two
redundant media, these connectors were made compliant with CAN in Automation (CiA) stan-
dard 102 [43] and 303 part 1 [44], which are used by CANOpen thus using an already deployed
standard. The set of used signals is shown if Figure 3.8.
1
2
3
4
5
6
7
8
9 CAN_V+ (optional power)
CAN_H (CiA Standard)
optional CAN ground
optional CAN shield
(secondary) CAN_L
CAN ground
(CiA Standard) CAN_L
(secondary) CAN_H
reserved (error line)
Figure 3.8: Extended CiA Connector
The connector supports not only the standard CAN signals (CAN_L and CAN_H), but also the
possibility of power transmission, thus opening room to applications where no local power supply
is available, e.g. intelligent sensors.
This design also supports the connection of a secondary channel, signalled in the Figure 3.8
as the secondary set of CAN signals. The provision of such connection enables a solution offer-
ing full-space redundancy, through dual-channel/quad-bus operation, enhancing even further the
properties related to bus operation dependability and timeliness.
3.4 Summary
The Controller Area Network (CAN) bus is a fieldbus widely acknowledged by its attributes:
low cost, robust operation, low complexity and flexibility. Being designed to be used in the auto-
motive industry, it has shortcomings that must be addressed before it can be used to build highly
dependable applications. It is in this context that the CANELy architecture was designed, provid-
ing the foundations for high dependability applications, such as Distributed Control System (DCS)
requiring hard real-time behaviour.
The CANELy architecture achieves the goal of highly dependable CAN-based operation through
a systemic approach to the problem of distributed hard real-time operation. It presents a highly
modular architecture, covering attributes from low-level media redundancy mechanisms to ensure
reliable and available real-time communication, to high-level advanced services common to dis-
tributed systems, such as: group communication, membership and node failure detection, clock
synchronisation.
32
4Dependability Enforcement
The price of reliability is the pursuit of the utmostsimplicity.The Emperor’s Old ClothesC.A.R. HOARE
The systemic approach to the design of dependable distributed systems dictates that full at-
tention must be given also to the network interconnecting the system’s nodes. One of the major
contributions stemming from the CANELy architecture was a formal and analytic model of the
CAN fieldbus operation. This model was a necessary condition for identifying the weaknesses
w.r.t. network dependability, which must be addressed.
The upper layers of CANELy work on the operational assumption that the (physical) network
is reliable to some extent, and mostly free from inconsistent errors, except for errors affecting the
last but one bit of a message. These errors are addressed by the reliable communication protocol
suite.
Although this type of error-free assumption is common in layered architectures due to layer
partitioning, it does not hold true in a standard CAN network. Therefore, the properties pertaining
to network dependability must be secured. This Chapter is then devoted to the problems affect-
ing the CAN fieldbus network dependability, and how to provide effective dependable operation
based on the CANELy architecture formal models and mechanisms, thus providing a dependable
channel for frame transmission and reception.
The concept of dependability is thoroughly discussed in [45]. To achieve high levels of de-
pendability, one must attain also high levels of the attributes encompassed by it. These attributes
are the following:
• availability: readiness for correct service.
• reliability: continuity of correct service.
• safety: absence of catastrophic consequences on the user(s) and the environment.
• integrity: absence of improper system alterations.
• maintainability: ability to undergo modifications and repairs.
33
4. Dependability Enforcement
The attributes of interest for achieving dependable network operation are availability and re-
liability. To secure reliability and availability in the communication medium one has to resort to
spatial redundancy. As discussed in Chapter 3, one of the most basic strategies of the CANELy
architecture is medium redundancy, i.e. redundancy of the physical medium which conveys the
communication channel used for message diffusion.
This Chapter discusses the reliability and availability of the CAN network, and how it can be
secured based on the models offered by the CANELy architecture. Firstly we present the opera-
tional assumptions of CAN operation, and the classes of faults affecting it. Then we obviate the
failures affecting both CAN and CANELy, and what are their consequences for correct network op-
eration. Finally, and following a bottom-up approach, we discuss the several hindrances against
dependable network operation: medium redundancy and its management, error detection and
confinement at the physical medium level, and monitoring functions allowing the assessment of
the omission degree of a medium. Along the discussion of each of these components we present
their analytic foundations, and how can such elements be mapped to innovative structures, suit-
able to be engineered effectively in a PLD, such as an FPGA.
4.1 Working Model
Before we start to discuss the models, functions and mechanisms that the CANELy archi-
tecture specifies to enhance network dependability, we must present what are the assumptions
supporting them. The assumptions listed in Figure 4.1 are based - and valid - on a network com-
posed of N nodes interconnected by a channel. Each node n ∈ N connects to the channel by a
channel transmitter (outgoing bit stream) and a channel receiver (incoming bit stream).
N1 channel redundancy is used, through replicated media (physical and medium layers), butonly one MAC sub-layer.
N2 each medium replica is routed differently.N3 all media are active, meaning every bit issued from the MAC sub-layer is transmitted simul-
taneously on all media.N4 there is always a detectable minimum idle period preceding the start of every CAN data or
remote frame transmission.N5 there is a detectable and unique fixed form sequence that identifies the correct reception of
a CAN data or remote frame.N6 there is a detectable bit sequence that identifies the signalling of errors in the CAN bus.
Figure 4.1: CANELy network assumptions
While the first three assumptions (N1, N2 and N3) are related to the physical aspects of the
network deployment and operation, the other three (N4, N5 and N6) are related to CAN network
operation only, specifically correct operation, having been derived directly from the CAN specifi-
cation [6, 5].
34
4.1 Working Model
Assumption N4 is guaranteed by the inter-frame spacing, intermission period, a sequence
corresponding to the channel being in a recessive state for a duration of at least two (normally
three) bits after the end of a frame, of any type [6, 5].
An illustration of assumption N5 is shown in Figure 4.2, w.r.t. to the final part of a CAN
message transmission, called End-of-Frame (EOF).
CRC Sequence ACKSlot
CRCDel
ACKDel
EOF Delimiter
r d r r r r r r r rr-recessived-dominant
EFS - End of Frame Sequence
bit-stuffing coding
Figure 4.2: CAN message termination sequence
The unique sequence of assumption N5 is the End of Frame Sequence, which ends a suc-
cessful transmission of a data/remote frame. Assumption N6 is guaranteed by the CAN error flag
(also called error frame), which is composed by a sequence violating the bit-stuffing coding of
CAN. The bit-stuffing coding defines that the maximum length of a bit sequence having identical
polarity is five bits, with the exception of the EOF Sequence.
Finally, a useful construct is the normalisation of the CAN network bit rate, yielding the bit time,
Tbit . This unit of measurement is extremely useful, since it allows the analytic expressions to hold
no connection to the bus bit rate, thus allowing a more general analysis.
4.1.1 CAN Physical Layer Fault-Tolerance
The CAN transmission medium is usually a two-wire differential line. The CAN physical layer
specified in [6] allows resilience against some of the transmission medium failures illustrated in
Figure 4.3, by switching from the normal two-wire differential operation to a single-wire mode.
After mode switch-over bus operation is allowed to proceed, though with a reduced Signal-to-
Noise Ratio (SNR), in the presence of one of the following failures:
• one-wire interruption (A or B failures, in Figure 4.3);
• one-wire short-circuit either to ground (C or D) or power (E or F);
• two-wire short-circuit (G).
There are commercially available transceivers claimed to be fully compliant with the ISO-11898
standard [41, 42], i.e. they switch to single-wire operation upon the detection of any of these
failures, switching back to two-wire differential mode upon recovery.
The CAN standard provides coverage for the shorting failure modes (C to E in Figure 4.3).
Resilience to the failure of one termination (failure H, Figure 4.3) implies that extra time is needed
35
4. Dependability Enforcement
NodeCAN
Interface
NodeCAN
Interface
CAN_H
CAN_L
A
B
C
D
E
FG H
Figure 4.3: CAN physical layer faults
for bus signal stabilisation, and can be overcome by adjusting the (local controller) parameter of
propagation time segment [6], thus delaying the bit-sampling of the network.
There is no standardised mechanism for providing resilience against the simultaneous inter-
ruption of both bus wires (A and B failures, in Figure 4.3). Upon such a failure, the network will be
partitioned, with each partition containing a subset of the N nodes.
We are interested in tolerating partitioning faults. Most of them are beyond our control, since
they involve physical damage to the network infrastructure. Nonetheless, they can be tolerated
by resorting to media redundancy, especially if the several media are routed through different
physical paths, in order to avoid a sort of physical “common-mode” faults.
4.1.2 CANELy Approach to Network Dependability
The fault model considered by the CANELy architecture complements and extends the one
defined by CAN, as a consequence of both the non-invasive and COTS approach. Therefore, the
CANELy architecture not only supports the CAN fault model - provided by the CAN controller and
transceivers - but widens the scope of the model, thus increasing the fault coverage.
One of the basic strategies in the CANELy architecture is the utilisation of media redundancy.
Although this strategy enhances the network availability and reliability, it also brings new problems
w.r.t. error-free bus operation. A set of errors affecting a media redundant network, e.g. a CANELy
network, is presented in Figure 4.4.
The network of Figure 4.4 is composed by two bus media replicas, P and S. The errors
affecting it can be mapped to two classes: Common-mode (error A), affecting all media; Single-
medium (errors B, C and D) affecting just one medium. The effects of these medium errors in the
channel depend on the type, and will be addressed in Section 4.5.3.
Since the CANELy enhancements are transparent to the standard CAN layer, they can be
used to provide the node with fault confinement on both outgoing and incoming bit streams. This
allows local node fault confinement in a more effective manner, especially w.r.t permanents faults
that disrupt the network operation, and without resorting to specially enhanced COTS transceivers
36
4.1 Working Model
NodeCANELy
Interface
Medium P
Medium S
d dMedium P
Medium S d d
r rMedium P
Medium S r d
d dMedium P
Medium S r d
A - common mode errors B - single-medium (d ⇝ r)
C - single-medium (r ⇝ d)
IncorrectValues
...
...
...
...
(d ⇝ r)
...
...
(r ⇝ d)d rMedium P
Medium S r d
D - single-medium (both)
...
...(r ⇝ d) (d ⇝ r)
Figure 4.4: Errors affecting a dual-media CAN network
with fault confinement mechanisms.
Partitioning
A class of faults originating from the physical medium is partitioning. Physical partitioning
happens whenever bus cabling interruption occurs, leaving the network with at least two physical
partitions (see Figure 4.5). The aim of media redundancy, and therefore redundancy management
is to mask this event, presenting a correct view of the network to the upper layers, even in the
presence of network errors, as specified in assumption N1.
NodeCAN
Interface
NodeCAN
Interface
P-Bus
S-Bus
NodeCAN
Interface
d
Bus MediumInterruption
d d d r d
Partition Partition
TransmittingNode
incorrectvalue
Figure 4.5: Media-redundant network physical partition
This network partitioning phenomena, however, should be explored further. A network distur-
bance might last only one Tbit , due to transient errors such as EMI or a loose network connector,
or might last longer, such as a crushed cable. The later should be dealt by the upper layers,
through specific algorithms for the assessment of the affected network segments. The former
might lead to another subtle form of partitioning that assumes a virtual form, in the sense that
there is no interruption in the physical path of the communication channel. This form of partition-
ing, however, still has impact in the provision of communication, affecting its timeliness. This form
of partitioning will be a central theme in Chapter 5.
37
4. Dependability Enforcement
4.1.3 Fault classes
Correct CAN bus operation is mainly disturbed by errors originating from two classes of faults:
stuck-at, where the affected components1 experience the same logic level for an abnormal period
of time w.r.t. CAN protocol operation; “omissions” which are experienced by a component that
does not receive service, usually due to errors.
Stuck-at Faults
The class of stuck-at faults pertains to the physical layer, affecting the bus media. This class
of faults has two elements in a CAN network: “stuck-at-dominant”, where the value present in the
bus has the dominant level; “stuck-at-recessive” where the level is recessive. Given the wired-
AND operation of CAN, a stuck-at-recessive fault usually indicates physical disconnection of the
medium, be it a faulty connector or a partitioning failure. Also due to the same operation mode,
the “stuck-at-dominant” faults are extremely disruptive, since they inhibit the communication.
Therefore, the CANELy architecture must tolerate this class of faults, allowing the provision of
service even in the presence of such faults.
Omission Faults
There is another class of faults that can be triggered by the former class, which are omissions.
An omission occurs whenever a component fails to receive a message2, and can have their origins
in several factors, ranging from faulty circuitry, damaged network media, electromagnetic interfer-
ence or other transient accidental faults affecting any network communication components.
These faults can also be very harmful for a system, especially a control system. For example,
if the actuator driving the brakes in a car-braking system does not receive the message with the
braking order from the controller, i.e. suffers an omission, it will not brake and the result can be
catastrophic. This omission might be caused by EMI emanating from other component, such as
the car’s alternator.
This class of faults affects the data-link layer, specifically the LLC sub-layer, since it is this
layer that is responsible for ensuring the correct communication of messages. This class of faults
must also be tolerated, and the CANELy architecture provides means to either tolerate them, or
at least perceive them and ultimately declare the involved media failed, thus allowing safe system
operation.
The assessment if a media should be declared failed is based on the bus medium omission
degree, which represents the number of consecutive omissions in a given interval of time. A
medium violating its omission degree bound should be declared failed, its contribution to the
channel disabled and the upper layers notified, so that adequate measures can be undertaken.1A component might be: the physical network medium, the transceiver or the CAN controller, both on transmission and
reception.2In the CAN context, frame and message are interchangeable.
38
4.2 Physical Network Availability and Reliability
4.2 Physical Network Availability and Reliability
Given our bottom-up approach, the first class of issues affecting network dependability are
related to the availability of the network. In order to enhance the availability w.r.t. standard CAN,
redundancy must be provided. The CANELy architecture contemplates bus media redundancy,
through a set of replicated media conveying the channel. This strategy, however, does not pre-
clude a full-space redundant architecture with replicated channels.
The utilisation of bus media redundancy, however, poses several new challenges. Questions
such as “How many redundant paths should be used?” and “How to manage the redundant paths,
recovering a coherent view of the communication channel?” are extremely pertinent, and should
be addressed.
4.2.1 Media Redundancy Provision and Management
The first step w.r.t. redundancy provision has already been taken, under the form of assump-
tion N1, which explicits that media redundancy should be used. Given the several media convey-
ing the CAN bus signal, there is the need to extract a unique representation of the channel from
all the media. Therefore, some means of redundancy management are in need.
In CANELy the media redundancy management is solved by an ingenious mechanism ex-
ploiting: the wired-AND nature of CAN PHY layer; the quasi-stationary bus operation ensuring
(almost) simultaneous bus bit sampling by all correct nodes in the network. This mechanism
gathers all the signals received by each medium interface into a single representation through
an AND function, before being interfaced with the MAC sub-layer. Dubbed The Columbus’ egg
strategy due to its simple (in hindsight) nature, it allows to form a single incoming bit stream rep-
resenting the channel, ChRx. Its structure and relation with the standard CAN layer and physical
media is depicted in Figure 4.6.
NodeCAN
Interface
P-Bus
S-Bus
P-Bus S-Bus
CANController
MediumInterface
MediumInterface
ChTx
ChannelInterfaceCh
Rx
MTx(P) (P)M
RxMTx(S) (S)M
Rx
Figure 4.6: Columbus’ Egg strategy block diagram
39
4. Dependability Enforcement
The AND function is used to gather the incoming media, right after the transceivers. The result
is then provided to the standard CAN controller. Although there are only two media depicted, this
approach is valid for any number of media. The integration of this strategy into more complex
models requires that it must be defined formally, yielding the following expression:
ChRx =∏
m∈MMRx(m), (4.1)
where: the symbol∏
is used to denote the logical AND function; M is the set of medium inter-
faces. For example, in the dual-media architecture of Figure 4.6,M = {P, S}.
The materialisation of this mechanism can be effectively described in VHDL, through the ex-
ploitation of the language’s properties, such as vector attributes [46]. Such a description is illus-
trated in Figure 4.7.�1 −− MediumRX : Vector aggregat ing the severa l media2 −− ChRx : Channel incoming (Rx) b i t stream , 1 i f a l l media are l o g i c a l ’ 1 ’ , e lse ’ 0 ’ .3 −− In CAN: l o g i c a l ’ 1 ’ = recess ive ( r )4 −− l o g i c a l ’ 0 ’ = dominant ( d )56 ChRx <= ’1 ’ when MediumRX = (MediumRX ’ range => ’ 1 ’ )7 else ’ 0 ’ ; � �
Figure 4.7: AND-based Media Selection description in VHDL
For the sake of clearness, from hereon all the signals written in tt report to implementation
(hardware description) signals, e.g. ChRx, while signals written in italic report to analytic signals,
e.g. ChRx.
The ChRx signal, in Figure 4.7, is the mapping of the receiving component of the communica-
tion channel, ChRx, recovered from the redundant bus media. The description ingeniously takes
a different approach to the AND function, which we wish to describe. If we reason about the AND
function, we can describe its behaviour as: the output of an AND function is True, iff all the inputs
are True. This is the approach followed in the description, where we want to assess if all the
inputs, MRx(m), are True, i.e., at the logical level ’1’, and express the evaluation through ChRx.
In order to do so, the inputs, MRx(m), are mapped to a vector (MediumRX in Figure 4.7) and com-
pared bit-by-bit with a construct who has the same size and bit-order (range) than the MediumRX,
and having all elements at logical ’1’. The result of this comparison is output through the signal
ChRx, being logical ’1’ only when all the media are at logical ’1’.
The VHDL description in Figure 4.7 has the advantage of simplicity and clarity over other
descriptions, such as the ones using variables and loop unrolling. Those advantages can be
extended to the implementation, in the PLD. These, however, are extremely dependent of the
synthesis tools used.
Having recovered the incoming bit stream representing the channel, ChRx, not only we can
40
4.2 Physical Network Availability and Reliability
pass it to the upper layers, namely the standard CAN layer, but also perform monitoring actions
based on its information, and enhance the bus media redundancy management, e.g. implement
fault detection and confinement functions for each independent media, thus increasing the avail-
ability and reliability of the communication network.
4.2.2 Stuck-at-dominant Fault Handling
A failed medium stuck in a dominant state must be dealt through confinement mechanisms,
due to its disruptive nature w.r.t. correct network operation. Therefore, the failed medium must be
detected and its contribution to the recovery of the channel disabled, i.e. equation (4.1) must be
complemented. This can be achieved easily, through the exploitation of the AND-function neutral
element - the logic value ’1’.
Based on the CAN standard [6, 5], the CANELy architecture defined the minimum length of
consecutive dominant bits present on a medium before declaring it “stuck-at-dominant”. This
length, formally defined lstk←dm can be used to assess if a given medium is indeed at such a
state. This length can be expressed as an amount of time by:
Tstuck←dm = [2 · lstk_d + (lstk_d + 1) · errstuck←rx(bus)] · Tbit (4.2)
where: Tstuck←dm is the minimum (normalised) time for the channel to be considered stuck-at-
dominant; lstk_d represents the length of the sequence of consecutive dominant bits, tolerated
by the CAN fault confinement mechanisms upon the transmission of an active error flag [6, 5];
errstuck←rx(bus) is a parameter for allowing a tolerance margin in the violation of the active error
flag tolerance, which must be a positive integer; Tbit represents one bit-time. The sequence
defined by lstk_d is composed by 7 consecutive dominant bits. Regarding the tolerance margin, its
value must obey the following relation: 1 ≤ errstuck←rx(bus) ≤ 10, where the upper bound defines
a Tstuck←dm which is equivalent to native CAN fault confinement mechanisms.
Upon the detection of a stuck-at-dominant condition, an indication of Medium m failure is
provided:
Mstk−d(m) 7→
true if T (MRx(m) = d) > Tstuck←dm
false when T (MRx(m) = d) ≤ Tstuck←dm ∨MRx(m) = r(4.3)
where: T (MRx(m) = d) represents the normalised time elapsed since Mediumm is at a dominant
state. If it exceeds Tstuck←dm, the medium is declared failed. The signal Mstk−d can be used to
command the disabling of the affected medium, thus confining the fault. The equation (4.1) is then
extended to accommodate the medium disabling function:
41
4. Dependability Enforcement
ChRx =∏
m∈M(MRx(m) +Mdis(m)) (4.4)
where: Mdis(m), the medium disabling signal can be derived directly fromMstk−d(m), i.e. Mdis(m) =
Mstk−d(m); the symbol “+” denotes the OR function. With this expression we can build a self-
contained block that: merges the channel information received by all media in a single entity,
ChRx; provides error confinement for any medium suffering a stuck-at-dominant fault.
Having already described the AND function (see Figure 4.7), is it possible to just add the mask-
ing component provided by the Mdis signal and OR function, without having to redesign it from
scratch? The answer to this question is affirmative, and can be achieved through the exploitation
of VHDL loop constructs, which provide the means for iteration3 over blocks of statements, as
illustrated in Figure 4.8:
�1 −− MediumRXtr : Vector aggregat ing the severa l media , from the t r ansc e i ve r s2 −− Mdis : Vector aggregat ing the severa l Medium Disable s igna l s3 −− MediumRX : Vector aggregat ing the severa l media , to be used i n AND f u n c t i o n45 procMediumRXOR : process ( MediumRXtr , Mdis ) is6 begin7 for m in 1 to NumberMedia loop −− I t e r a t e over each media m8 MediumRX(m) <= MediumRXtr (m) or Mdis (m) ; −− execut ing the OR f u n c t i o n9 end loop ; −− m
10 end process procMediumRXOR ; � �Figure 4.8: Medium Disable Receive description in VHDL
The strategy followed is: to iterate over all the media, resorting to the variable m, and for
each medium execute the OR function, between the information received from the physical layer
(MediumRXtr, mapping of MRx(m) in equation (4.4)) and the information generated resorting to
equation (4.3) (Mdis). The result is a vector with the masked incoming bit stream (MediumRX). The
vector MediumRX can then be used in the VHDL description of Figure 4.7. The problems involving
iteration over vectors are a natural application of loop constructs, being the most efficient and
generic manner to accomplish that type of operations.
The generation of the information regarding signal Mdis can be thought as a watchdog timer,
which expires after some time - given by equation (4.2) - if no recessive bit is detected in the
channel. This idea, however, can be extended even further by considering the quasi-stationary
operation mode of the CAN network: this problem can be expressed as a sequence detection
problem, where these signals can be mapped to a certain sequence, which can occur in the
network, hence able to be detected.
3Although this is called iteration, what really happens is parallelisation, since the structures we are describing arehardware.
42
4.3 CAN Bit-Sequence Detection
4.3 CAN Bit-Sequence Detection
The bit-serial nature of the CAN protocol operation permits the assessment of correct be-
haviour in a practical manner, through the detection of certain sequences occurring in the network
components, be it the channel or the several media replicas. The quasi-stationary network oper-
ation of CAN can (and should) be further exploited, to allow on-line CAN protocol processing and
evaluation, masking errors and notifying the relevant layers of any abnormal event pertaining to
correct system operation. Therefore, the monitoring functions must be transformed into sequence
detection functions when mapping them into VHDL.
A problem still remains, however: how effective and flexible can and should the mechanisms
providing these sequence detection functions be? A “brute force” approach would either involve
a “sliding-window”-based technique, comparing the entire sequence at once (see Figure 4.9), or
specially and individually crafted structures to handle each and every function requiring sequence
detection. This “ad-hoc” approach has several drawbacks: implies lack of flexibility, since each
function would be designed individually, and high maintainability costs in the future, since there
was no common design basis. This type of design also presents impairments w.r.t. engineering,
since it needs scarce PLD resources.
1 0 0 1 1 1
0 0 0 0 1 0
Sequence
Bit stream
Figure 4.9: Sliding Window sequence detection
To avoid all these drawbacks, these structures should be made as flexible and efficient as
possible. After careful consideration, it was noted that most monitoring functions can be mapped
into fixed length, deterministic sequence detection problems4, sharing many similarities among
them. Most relevant characteristics of the sequences are:
• possibly long, with some going up to 96 bit in length;
• unique, although some are sub-sequences of others.
These observations make the sequence detection problem amenable to a generic approach,
thus providing component reuse and even compositionality, w.r.t. sub-sequences.
Under this perspective, we can easily map the analytic expressions defined by the CANELy
architecture into sequences of bits, which must occur in the network, either at the channel level or
individual medium level. For example, the sequence in equation (4.2) can be detected through the
4Under a slightly different perspective, these sequences can be seen as strings, formed from an alphabet Σ = {0, 1}.This would allow the description of the sequences as regular expressions.
43
4. Dependability Enforcement
sequence: dddddddddddddddddddddddddddddd, composed by 30 bits, given a value of lstk_d = 7
and errstuck←rx(bus) = 2, being the latter an acceptable tolerance margin [11]. The CAN bus
levels can be mapped into binary levels: recessive (r ) is equivalent to a logical ’1’; dominant (d) is
equivalent to a logical ’0’. Therefore, the previous sequence mapping yields the binary sequence
000000000000000000000000000000.
Although the detection of sequences is fundamental, itself alone is not sufficient. Most of the
monitoring functions, either channel or medium, require actions to be performed upon sequence
detection, e.g. signal latching, pulsing, negation upon other signal assertion.
Hence, additional machinery is necessary to satisfy these requirements. The concept joining
these two elements is illustrated in Figure 4.10.
Complementary Logic
Bit stream
Signal Assertion
Sequence Detector
Additional Signals
SequenceOk
Figure 4.10: Signal assertion machinery
Each sequence detection block is composed by two fundamental elements: Sequence detec-
tor, which asserts the presence of the sequence of interest in the network; Complementary logic,
which handles the integration of the sequence detection mechanism with other signals, in order
to provide more complex actions. These actions include: signal pulsing, e.g. signals only active
one Tbit ; signal latching; signal composition for signals triggering further detections or actions.
The mapping of the sequence detection function into VHDL yields the description in Fig-
ure 4.11. This description allows a flexible approach, with the sequence representing the monitor-
ing function being specified through VHDL parametrised constructs upon component instantiation,
thus using the same building block for most of the monitoring functions.
The sequence detection machinery of Figure 4.11 is synchronised with the system-wide clock
signal, sys_clk, being the actions pertaining to bit comparison synchronised with CAN bit timing,
through can_clk_en which provides the equivalent to Tbit . Hence, the sequence matching is
made “on-line”, i.e. on a bit-by-bit basis and synchronised with the CAN network.
The sequence to be detected is mapped into a Read-Only Memory (ROM), provided by
sequence_rom, which is parametrised at design time with the desired sequence. The size of
the ROM for a sequence having a length of n bit is n× 1 bit, i.e. it stores the n bits composing the
sequence, being the output 1 bit wide.
The sequence detection is done through comparison between the bit being output by the
ROM, sequence_rom(cnt), and the bit from the incoming bit stream, data. If the two bits are
44
4.3 CAN Bit-Sequence Detection
�1 −− ROM Addressing r e g i s t e r2 −− Stores the value addressing the ROM hold ing the sequence to be detected3 pSDetect ion : process ( sys_c lk ) −− FPGA System Clock4 begin5 i f r i s ing_edge ( sys_c lk ) then6 i f rst_N = ’0 ’ then −− Synchronous reset , f o r sa fe t y purposes7 cnt <= 0;8 else9 i f can_clk_en = ’1 ’ then −− CAN clock enable , f o r network synch
10 cnt <= cnt_aux ; −− Store the ROM address value11 end i f ;12 end i f ;13 end i f ;14 end process pSDetect ion ;1516 −− Decis ion l o g i c17 −− Progress wi th the ROM addressing , wh i le i npu t matches sequence18 −− Reset count upon e i t h e r f a i l e d match or when reaching end of sequence1920 cnt_aux <= ( cnt +1) when data = sequence_rom ( cnt ) and cnt /= sequence ’ leng th21 else 0;2223 −− Output l o g i c24 −− Output l o g i c a l ’ 1 ’ upon success fu l sequence detec t ion , l o g i c a l ’ 0 ’ o therwise2526 Sequence_Ok <= ’1 ’ when cnt = sequence ’ leng th27 else ’ 0 ’ ; � �
Figure 4.11: Sequence detector description in VHDL
equal, the value addressing the ROM is incremented, in order to test the next bit. Once this value
equals the length of the sequence, the detection has successfuly ended, and the Sequence_Ok
(Figure 4.11) signal is asserted for one Tbit , being deasserted after. Upon a failed bit matching,
the value addressing the ROM is reset, restarting the sequence detection. The assertion of
Sequence_Ok can be used to assert the signal pertaining to the detected sequence, eventually
based on additional signals(see Figure 4.10).
The sequence detector VHDL description’s objective is two-fold: it intends to be generic, avoid-
ing the description of dedicated Finite State Machines (FSMs) for each monitoring function; it
intends to be resource-effective, consuming the least resources as possible from an FPGA de-
vice, since they are finite. Hence, the utilisation of a ROM, the most abundant resource in an
FPGA5, for storing the bit sequence is crucial, thus saving other scarcer memory elements such
as flip-flops.
Lastly, the description of Figure 4.11, however, has a shortcoming: it fails to detect a sequence,
if a (starting) sub-sequence of the sequence to be detected is present. For example, if we wished
to detect the sequence rdrrr inside the larger sequence rdrdrrr using this strategy, it would
not be possible with such machinery. This limitation stems from the simple sequence detection
restart machinery, which does no account (on purpose) with these sub-sequences, thus making
the machinery implementation area smaller.
5In fact, the must abundant resource in an FPGA is Random-Access Memory (RAM), under the name of Look-upTable (LUT). The ROM, however, is implemented with resort to LUT elements, but with content alteration inhibited.
45
4. Dependability Enforcement
This type of sequences, however, only occur once in the CANELy monitoring functions, and
being reduced in length can be implemented with resort to a method similar to the one in Fig-
ure 4.9, without having a great impact in occupied area. This different implementation, however,
can be made transparent w.r.t. component instantiation, through the abstraction constructs of Very
High Speed Integrated Circuit Hardware Description Language (VHDL) which allow the same en-
tity (component) to have multiple implementations (architectures). Despite this shortcoming, the
ROM-based approach to the sequence detection problem is still the most effective for the CANELy
monitoring functions, due to the possibly long length of the sequences to be detected, and ab-
sence of sub-sequences, save for one.
4.4 Channel Monitoring
The detection of errors requires constant monitoring of the bus media and channel. This
monitoring serves the purpose of assessing the communication state, through the detection of
certain sequences.
Basic Channel Monitoring
The most basic set of monitoring signals concerning CAN network operation stem from the
assumptions N4, N5 and N6. Through the assertion of correct channel behaviour we can implicitly
assess the correct CAN operation. Furthermore, from this information more complex monitoring
functions can constructed.
The extensive monitoring mechanisms defined by CANELy depend on a basic set of channel
status signals, pertaining to basic operating mechanisms of the network. These signals are:
End-of-Transmission, meaning the successful transmission of a CAN frame, and the bus being
available for another transmission; Frame Correct, meaning the correct reception of a CAN frame;
and Error, meaning that an error flag has been detected on the channel.
The End-of-Transmission (EOT) signal definition embodies assumption N4, being asserted
when a frame has been successfully transmitted, and the bus is available for another frame trans-
mission, i.e. the minimum intermission period time has been elapsed. Its formal definition is
ChEOT 7→
true if T (ChRx = r) ≥ TL
false if T (ChRx = r) < TL ∨ ChRx = d(4.5)
where: ChEOT represents the EOT signal, T (ChRx = r) represents the normalised time elapsed
since the channel receive, ChRx is in a recessive state, and TL is the minimum normalised time
for the bus being idle, before starting a new transmission. This time takes into account that
transmission may start at the last bit of intermission. The mapping of this signal into VHDL is
illustrated in Figure 4.12.
46
4.4 Channel Monitoring
�1 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
−− EOT Sequence Detec t ion − Sequence de tec t i on3 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
instSequenceDetectorEOT : ent i ty work . sequence_detector (ROM_MEM)generic map (
6 sequence => seq_eot )port map (
sys_c lk => sys_clk ,9 rst_N => rst_N ,
can_clk_en => can_clk_en ,data => ChRx,
12 Sequence_Ok => eot_sequence ) ;
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−15 −− EOT f l a g asse r t i on − Complementary l o g i c
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− purpose : Asser t / de−asser t the EOT f l a g
18 −− i npu ts : sys_clk , rst_N , can_clk_en , eot_seq , eo t_ in t ,−− outputs : ChEOTpEOT: process ( sys_c lk ) is
21 begin −− process pEOTi f r i s ing_edge ( sys_c lk ) then −− r i s i n g c lock edge
i f rst_N = ’0 ’ then −− synchronous rese t ( a c t i v e low )24 ChEOT <= ’ 0 ’ ;
elsei f can_clk_en = ’1 ’ then
27 −− Signa l not asser ted yet , so asser t i ti f eot_sequence = ’1 ’ and ChRx /= ’0 ’ then
e o t _ i n t <= ’ 1 ’ ;30 −− Signa l asser ted and dominant b i t o f SOF detected a t ChRx , de−asser t
e l s i f e o t _ i n t = ’1 ’ and ChRx = ’0 ’ thene o t _ i n t <= ’ 0 ’ ;
33 end i f ;end i f ;
end i f ;36 end i f ;
ChEOT <= e o t _ i n t ; −− Output the ChEOT s i g n a lend process pEOT; � �
Figure 4.12: ChEOT signal description in VHDL
The first part of the listing in Figure 4.12 is the instantiation of the Sequence Detector compo-
nent with a constant, seq_eot, which holds the sequence to be detected, EOT in this case. The
second part is the Complementary Logic (see Figure 4.10). After the detection of the sequence,
the Complementary Logic block asserts the eot_int internal signal, which remains asserted until
the detection of a dominant bit in the channel, ChRx. The final assertion of ChEOT, the mapping of
ChEOT is done through eot_int, which is used purely due to VHDL restrictions regarding reading
the value of output signals.
The mapping of the remaining monitoring functions involving sequence detection follows this
design philosophy, thus allowing a clean, reusable and maintainable strategy for the mapping of
these monitoring functions from the analytic models into hardware.
Another signal needing to be generated is ChFok, which asserts that the received frame has
not been disturbed by errors, i.e. there have been no errors detected up to the last bit of the EOF
delimiter6. Its formal definition is
6This comes from the CAN standard [6, 5], which states that nodes do not take into account the last bit of a frame,
47
4. Dependability Enforcement
ChFok 7→
true if ChRx = rdrrrrrrr
false when ChEOT
(4.6)
where: ChFok is the Channel Frame Correct signal, meaning the correct reception of the se-
quence, correctly terminating a CAN data or remote frame.
The third fundamental signal w.r.t. channel monitoring functions is the Channel Error (ChErr),
related to assumption N6. It is asserted upon the detection of an active error flag, i.e. a dominant
value is put in the network for a time longer than the maximum allowed by the bit-stuffing coding:
ChErr 7→
true if T (ChRx = d) ≥ (lstuff + 1) · Tbit
false when ChEOT
(4.7)
where: T (ChRx = d) is the normalised time elapsed since the channel is in a dominant state;
lstuff is the bit-stuffing coding length and Tbit is the normalised bit-time. Giving lstuff = 5 and
mapping this sequence into binary digits yields 000000. The operation of these mechanisms w.r.t.
a CAN message are represented in Figure 4.13.
Ch
EO
T
CRC Sequence ACKSlot
CRCDel
ACKDel
EOF Delimiter
Fok
Ch
Figure 4.13: CANELy Basic Channel Monitoring
Extended Channel Monitoring
So far we have been concerned with the incoming bit stream, ChRx, and how to perform local
fault confinement in the event of a stuck-at-dominant medium. The outgoing bit stream ChTx,
however, is not immune to faults and must be accounted for also. The faults affecting correct
network operation do not necessarily have to be related to the physical network media. They
can also have their origin in the CAN controller circuitry, e.g. a failed oscillator that leaves the
CAN controller stopped while transmitting a dominant bit. Therefore the CANELy model must
also account for a stuck-at-dominant during message transmission situation. The period required
for detecting this situation has a normalised duration Tstuck−tx, and is formally defined as:
Tstuck−tx = [2 · lstkd+ (lstkd
+ 1) · errstuck←tx] · Tbit (4.8)
when evaluating its correctness.
48
4.4 Channel Monitoring
where: errstuck←tx is the stuck-at-dominant transmit-error-tolerance margin. This expression is
similar to equation (4.2), since both express the same type of violation of the CAN protocol, but
in different directions of the information flow. There is, however, a small difference between the
two mechanisms. The tolerance margin of equation (4.8) must be smaller that the one in (4.2),
0 ≤ errstuck←tx < errstuck←rx(bus) < 10. This condition is necessary for avoiding the actuation
of the incoming bit stream stuck-at-dominant mechanism first, thus leaving the node completely
disconnected from the network. The function that monitors the stuck-at-dominant is defined as:
Chstk−Tx 7→
true if T (ChTx = d) > Tstuck−tx
false when mgmt.request(4.9)
where: T (ChRx = d) is the normalised time elapsed since the ChTx is in a dominant state. If
this time grows larger than Tstuck−tx, we are in the presence of a stuck-at-dominant outgoing
bit stream, and assert the signal Chstk−Tx, signal the upper layers and disable the outgoing bit
stream, ChTx. It can only be negated by an upper layer request.
Mapping the disabling of the outgoing bit stream into hardware components is straightforward:
we just need to exploit (once again) the Boolean OR function, defining:
MTx(m) = ChTx +Mdis−Tx (4.10)
where: Mdis−Tx is the signal for disabling the outgoing bit stream, on all media. The result-
ing mechanism description iterates over the outgoing media, absorbing their value if Mdis−Tx is
asserted. The description of this mechanism is detailed in Figure 4.14.�1 −− MediumTX : Vector aggregat ing the severa l media , going i n t o the t r ans ce i v e r s2 −− MdisTx : S igna l f o r d i s a b l i n g a l l media through the OR f u n c t i o n3 −− ChTx : S igna l rep resen t ing the outgoing b i t stream45 procMediumTXOR : process (ChTx , MdisTx ) is6 begin −− process procMediumTXOR7 for m in 1 to NumberMedia loop −− I t e r a t e over a l l media8 MediumTX(m) <= ChTX or MdisTx ; −− Disab l i ng i f MdisTx i s set9 end loop ; −− m
10 end process procMediumTXOR ; � �Figure 4.14: Mdis−tx description in VHDL
Considering the tolerance parameter of equation (4.8) as errstuck←tx = 1 [11], a stuck-at-
dominant condition on the outgoing channel bit stream is detected in 22 bit times. In a network
operating at a 1 Mbps bit rate, such condition will be detected in just 22 µs, and the faulty node
disconnected from the network.
There are (non fault-tolerant) commercially available transceivers detecting and masking stuck-
at-dominant scenarios, such as Maxim’s MAX13050 [41] and Microchip’s MCP2551 [42].
49
4. Dependability Enforcement
The minimum amount of time these devices take to disconnect the faulty node from the net-
work, however, is rather high: 1 ms for MAX13050; 1.25 ms for MCP2551. Such behaviour is
explained by the transceivers being purely PHY-aware, having no knowledge on the bus bit rate.
Therefore they assume a worst-case figure, which would be the lowest bit rate possible - and
implicitly define the lowest bit rate they can support.
Channel Monitoring Component
Having defined the channel monitoring functions, we can now map them into suitable se-
quences. The signals, their behaviour and respective sequence are summarised in Figure 4.15.
Basic Channel MonitoringChEOT End Of Transmission
asserted after detection of minimum bus idle period;negated upon the start of a new frame.Sequence: ChRx = rrrrrrrrr
ChFok Frame Correctdata or remote frame received without errors;negated upon assertion of ChEOT .Sequence: ChRx = rdrrrrrrr
ChErr Frame Errorasserted upon violation of CAN bit-stuffing coding rule;negated upon assertion of ChEOT .Sequence: ChRx = dddddd
Extended Channel MonitoringChstk−Tx Channel Stuck-at-dominant
asserted upon violation of active error flag length limitnegated upon upper layers requestSequence: ChTx = dddddddddddddddddddddd (errstuck←tx = 1)
Figure 4.15: Channel Monitoring signals
This summary provides all the information needed for: parametrisation of sequence detection
machinery; complementary logic design w.r.t. the assertion, hold and negation of the channel
monitoring signals. The list of mapped sequences is presented in Figure 4.16.
With the information of Figure 4.15 and Figure 4.16, the mapping of these functions to the
abstract component of Figure 4.10 has now become trivial, with the Sequence Detector block
being parametrised with the sequence of interest; the Complementary Logic being filled with the
conditions for assertion and negation. From hereon, almost all monitoring functions can be trans-
formed into a sequence detection problem, with complementary signal logic, thus being tackled
with resort to this reusable approach.
The Channel Monitoring signals can be encapsulated in a VHDL component, which continu-
ously monitors the channel incoming and outgoing bit stream. A block diagram depiction of such
component is shown in Figure 4.17.
50
4.5 Medium Monitoring
�1 −− ChEOT − End of Transmission2 constant seq_eot : s td_u log i c_vec to r := " 111111111 " ;34 −− ChFok − Frame Correct5 constant seq_fok : s td_u log i c_vec to r := " 101111111 " ;67 −− ChErr − Channel e r r o r8 constant seq_err : s td_u log i c_vec to r := " 000000 " ;9
10 −− Stuck−at−dominant tx , margin = 111 constant seq_stk_tx : s td_u log i c_vec to r := " 0000000000000000000000 " ;1213 −− Stuck−at−dominant rx , margin = 214 constant seq_stk_d : s td_u log i c_vec to r := " 000000000000000000000000000000 " ;1516 −− Stuck−at−recess ive medium17 constant seq_stuck_rm : s td_u log i c_vec to r := " 1111111111111111 " ; � �
Figure 4.16: Sequences mapped into VHDL
ChannelMonitoring
ChEOT
ChErr
ChFok
ChRx
ChTx
Chstk-tx
Figure 4.17: CANELy Channel Monitoring functions
Given the modular approach being followed, this component can be used as the building
block for other components requiring information regarding the state of the channel, e.g. medium
monitoring.
4.5 Medium Monitoring
The previous Section dealt with channel monitoring mechanisms, to assess its state and issue
fault-confinement actions, if necessary. Most of the faults, however, have their origin in the media,
being propagated to the channel. Hence, the next logical step is to do medium monitoring, in order
to assess and confine any faults affecting a medium, so it does not propagate to the channel.
A fault that might affect a medium is the reciprocal of the stuck-at-dominant (see Section 4.2.2),
stuck-at-recessive. Although a bus in a stuck-at-recessive state apparently does not pose a prob-
lem as dangerous as the previous case, it can still be dangerous under a different perspective,
since it can mean network partition(s). Therefore, it is equally important to detect such faults
and inform the upper layers, in order to trigger procedures to assess the location of the affected
network segments, and decide the measures to be taken in order to ensure system safety.
Lastly, there is also the case of a medium exhibiting omission errors. This situation must not
only be detected, but corrected if the medium is exhibiting an amount of omissions larger than a
certain threshold, called the medium’s omission degree.
51
4. Dependability Enforcement
4.5.1 Medium Status
The partitioning of physical media is a type of fault that must be accounted for. Although its
behaviour is not as disruptive (in the short term) as a stuck-at-dominant type of fault, it may be
critical in the long term, since it hinders the media redundancy strategy. Therefore, it is necessary
to assess network partition faults, and inform the upper layers for high-level fault detection, i.e.
pinpoint the physical location of the partition.
A network partition can be described as a prolonged silence in the affected medium, since it is
kept in the recessive state. This abnormal recessive period affecting medium m can be detected
by the following expression:
Marp(m) 7→
true if T (MRx(m)) = r > Tstuck−rm ∧ ¬ChFok
false when ChEOT
(4.11)
where: Marp is the medium abnormal recessive period signal, asserted when T (MRx(m)), the
normalised elapsed time since medium m is in a recessive state exceeds Tstuck−rm. Tstuck−rm is
the minimum normalised time to detect a stuck-at-recessive fault, and it has a duration Tstuck−rm =
2 · lstkdbit times.
The upper layers should be notified of abnormal events in the network, even if they do not
disturb the correct network operation. Therefore, the event of a medium being abnormally in a
recessive state should also be notified. This notification, however, should only be done when a
correct frame is received at the channel interface. The signal Midle asserts the situation of correct
frame reception, thus allowing the notification to occur.
Midle(m) 7→
true if Marp(m) ∧ ChFok
false when ChEOT
(4.12)
Lastly, stuck-at-recessive failures should be distinguished from medium partitions. The signal
Mds serves this purpose, being asserted whenever a dominant bit value is detected in the medium,
meaning the partition is not permanent, thus being a stuck-at-recessive fault.
Mds(m) 7→
true if MRx(m) = d
false when ChEOT
(4.13)
The results of these functions are mapped into the Medium Status Word (see Figure 4.18),
which keeps track of the most relevant parameters for each medium. The Media Status Word is
always exposed to the remaining components, which can only have “read” access.
We can now design a component wrapping the functionality of the Media Monitoring. In practi-
cal terms, and given the fact that the status assessment must be done for each and every medium
52
4.5 Medium Monitoring
�1 −− Medium Status Word2 type medium_status_t is record3 arp : s td_u log i c ; −− Abnormal Recessive Period , maps M_arp (m)4 i d l e : s td_u log i c ; −− Medium Id le , maps M_idle (m)5 ds : s td_u log i c ; −− Dominant B i t Received , maps M_ds(m)6 stk−d : s td_u log i c ; −− Stuck a t Dominant , maps M_stk−d (m)7 d is : s t d_u log i c ; −− Medium Disable , maps M_dis (m)8 fm : s td_u log i c ; −− Frame Mismatch , maps M_fm(m)9 end record medium_status_t ; � �
Figure 4.18: Medium Status Word VHDL data type
in parallel, we will have a component instance of the Media Status Monitor for each medium, with
its own Medium Status Word. Such component is illustrated in Figure 4.19, in a block diagram
form.
Medium StatusMonitoringCh
Fok
ChEOT
Medium Status WordChRx
MRx
(m)
Figure 4.19: Medium Status Monitoring block diagram
4.5.2 Frame Monitoring
Any medium is susceptible to suffer (or propagate) a fault, during normal network operation.
The fault’s manifestation, the error, can be propagated further into the system, reaching the chan-
nel via the bus redundancy management mechanisms (see Figure 4.4, errors C and D), or may be
masked (error B). Either way, it is important to assess if was there any error affecting the network,
and if so, which medium or media exhibited the errors.
The first step to achieve this goal is to perform the monitoring of the received frames, by
comparing the received data in each medium, Mrx(m), with the received channel data, ChRx.
This comparison is made “on-line”, on a bit-by-bit basis. If there is any disagreement between
the value present in the channel and the medium, a flag signalling frame mismatch - MFm(m) is
asserted for medium m. Formally,
MFm(m) 7→
true if MRx(m) 6= ChRx ∧ ChTiP
false when ChEOT
(4.14)
where: ChTiP = ¬ChFok ∧ ¬ChErr signals that a frame transfer is in progress. The assertion
of MFm(m) is of the utmost importance for omission detection, since it signals if was there any
53
4. Dependability Enforcement
disturbance in the medium during the last frame transmission. This signal is also mapped into the
Media Status Word (see Figure 4.18).
4.5.3 Omission Degree Control
The medium monitoring functions are essential to assess which medium or media are exhibit-
ing errors. With this information is now possible to go further in and proceed to the confinement
of these errors.
As discussed previously, a (perceived) fault affecting a frame generates an omission error: the
frame is destroyed, not being delivered to the upper layers. Such errors should be rare during
normal operation, and normally are caused by transient events such as EMI or local node power
supply transients.
There are occasions, however, where a medium might be affected by a higher than usual
amount of errors. These errors can have their origin in defective circuitry, damaged cabling or
connectors. In this situation, the redundancy management machinery may worsen system oper-
ation, instead of improving it7. Therefore, the medium state must not only be monitored, but also
be rated in terms of dependability, i.e. how dependable is that medium.
The final objective of the media monitoring functions is to achieve omission degree control for
each medium, through the assessment of the medium’s omission degree (MOd). Any medium
exceeding its omission degree bound, km, must be declared failed, the corresponding upper layer
entities signalled and its contribution to equation (4.4) disabled.
Omission Detection
Before assessing the omission degree of a given medium, we must first proceed with the
detection of omissions. One of the most important detection functions is the assessment if was
there any medium affected by errors. Such assessment is made by the MFm−s signal, which is
defined as:
MFm−s =∑
m∈MMFm(m) (4.15)
where:∑
denotes a logical sum. The MFm−s is asserted if there is, at least, one medium with
the MFm(m) asserted. The omission detection functions are illustrated in Figure 4.20.
These functions were designed to provide medium omission error accountability, i.e. knowing
upon error detection, which medium or media exhibited the erroneous behaviour. This step is
crucial to evaluate the omission degree, since there are common-mode errors (Figure 4.4, error
A) and single-mode errors (Figure 4.4, errors B-D).7Such situation can arise from a medium suffering errors such as the ones in Figure 4.4, errors C and D. In these cases
the error is propagated. The reciprocal is error B in the same Figure, which is masked.
54
4.5 Medium Monitoring
Medium omission functions (for all medium m)
MFm−s Media Frame Mismatchasserted if any medium exhibited a frame mismatch;negated if no mismatch or all media affected by a common-mode error.
Medium omission functions (for each medium m)
MOer Omission error in mediumframe received without errors at the channel, but medium had an error.Condition: ChFok ∧MFm(m)
MOch Omission error in channelframe aborted with errors, but medium had no error.Condition: ChErr ∧MFm−s ∧ ¬MFm
MUerr Omission error in undetermined medium or mediaframe aborted with errors, either common-mode error or multiple errorsCondition: ChErr ∧ (¬MFm−s ∨MFm)
Figure 4.20: Medium Omission Detection auxiliary functions
Omission Degree Assessment
As a general rule, a medium exhibiting errors should have its omission degree value incre-
mented, in the same way a medium showing no errors should have it cleared. There are, how-
ever, some cases where the media cannot be held accountable, e.g. common-mode errors. The
omission degree is evaluated upon a successful transmission. It is defined by:
MOd(m)↑ChEOT=
MOd(m) + 1 if MOer(m) ∨ (MOch(m) ∧ ¬ChFok)
MOd(m) if MUer(m) ∧ ¬ChFok
0 if ChFok ∧ ¬MFm(m)
(4.16)
where the omission degree of a Medium should be:
• Incremented, if a Medium has experienced an omission error, or if the channel experienced
an omission and the frame was not correctly received;
• Maintained, if it was not possible to pinpoint the medium or media that suffered the omis-
sion, and the frame was not correctly received;
• Cleared, if the frame was correctly received and the Medium did not exhibit any error.
The assessment of the MOd is made in parallel for all the media, upon a successful frame
transmission - signalled through the assertion of ChEOT . Any medium whose omission degree
has exceeded the omission degree bound should be declared failed, and the upper layers notified.
55
4. Dependability Enforcement
Omission Degree Control
MFm
MRx
(1) MRx
(m)
ChFok
ChEOT
ChErr
OD Register
Omission Detection
MOd
m
Figure 4.21: Omission Degree Control block diagram
The functionalities provided by the omission detection and control functions can be wrapped in
a component, providing Omission Degree Control. Such component is illustrated in Figure 4.21.
This component encapsulates the omission detection machinery, which supports the evalua-
tion pertaining to the omission degree of a given medium. It outputs the current omission degree
of all the media, so it can be accessed by management interfaces and thus being consulted by
upper layer entities.
Omission Degree Control mechanisms
The functions described in Figure 4.20 and through equations (4.14) and (4.15), are simple to
be mapped onto mechanisms. The simplicity derives from their purely combinatorial nature.
Regarding the assessment of the omission degree (see equation (4.16)), they can be mapped
into a simple state machine, with three states: “INCREMENT”, “KEEP” and “CLEAR”. It may
be argued that this design could be optimised. The description, however, is being made in be-
havioural mode, i.e. describing behaviour as opposition to describing structures, thus leaving the
optimisations to the synthesiser.
4.6 Media Selection Unit
So far, we have been discussing the mechanisms for error detection and confinement in a
comprehensive manner. The materialisation of these mechanisms, however, must be done by
encapsulating them in components, which in turn must be arranged to achieve a unified block.
This Section shows how do all the monitoring functions and mechanisms integrate in a Media
Selection Unit component.
The modular philosophy of the previous blocks allows seamless integration into a single com-
ponent, with well defined interfaces (see Figure 4.22).
56
4.6 Media Selection Unit
ChannelMonitoring
MediumStatus
FrameMonitoring
Standard CAN layer
Omission DegreeControl
Management Interface
MTx
(1) MRx
(1)
ChTx
ChRx
MTx
(m) MRx
(m)MRx
(m)
MRx
(m)MRx
(1)
Management
Figure 4.22: Media Selection Unit block diagram
4.6.1 System Interface
The Media Selection Unit (MSU) must interface with a computational platform, for: parameter
configuration, such as bit rate; exception notification, such as a medium partition; and reading,
such as the message identifier related to an event. Furthermore, it must connect to the channel
and the several redundant media (see Figure 3.4, page 28). Regarding initialisation, it must
receive a set of parameters before it can start operation. The set of initialisation parameters and
notifications is illustrated in Figure 4.23.
Invocation PrimitivesDescriptionInitialise (baud, km)
Notification PrimitivesDescription Issuing ConditionOmission degree exceeded (m) MOd(m) > km
Stuck-at-dominant Medium (m) Mstk−d(m)
Stuck-at-recessive Medium (m, mid) Midle(m) ∧ ¬Mds(m)
Medium partition (m, mid) Midle(m) ∧Mds(m)
Stuck-at-dominant Channel Chstk−Tx
Figure 4.23: CANELy Media Selection Unit management primitives
The MSU must be initialised with: CAN network bit rate; media omission degree bound. The bit
rate is necessary to keep synchronism with the CAN network, and ensure that the quasi-stationary
operation is also extended to the CANELy machinery. The media omission degree bound defines
the omission degree threshold that upon violation implies medium failure. The operation of mech-
57
4. Dependability Enforcement
anisms that might lead to any form of fault confinement, such as bus disconnection, is halted until
these two conditions hold true.
Some functions require information regard the message identifier when an exceptional condi-
tion is raised, e.g. medium partition detection. This auxiliary function implements a logic which
processes the first 14 bit of a CAN 2.0A message or the first 32 bit of CAN 2.0B message. The
recovered message identifier is put into a VHDL record, along with information regarding the
CAN protocol version. The access to this information is dependent on the management interface
implementation.
4.6.2 Management Interface
The MSU management interface specification has been left as general as possible, given
the amount of bus interfacing technologies available. There are, however, certain restrictions that
must be observed. The first and foremost, is the provision of an Interruption Request (IRQ) facility.
The issuing of IRQs is paramount, allowing the MSU to notify the host system upon an exceptional
event (notification), such as the ones listed in Figure 4.23.
From the computing platform point-of-view, the MSU can be seen as a peripheral, capable
of being configured and having accessible information relevant to its operation. Therefore, some
sort of peripheral I/O bus must be used.
The most common I/O method for this type of interface is memory-mapped, where I/O is
achieved through reading and writing data to certain memory addresses. The function of the
management interface is then to multiplex the requested actions, which can be reading or writ-
ing data, e.g. reading the value of a medium omission degree; writing the value of the bit rate
parameter.
4.7 Summary
The standard Controller Area Network (CAN) layer has a restrictive fault model, which al-
though is adequate for the applications it was designed for, it is not suitable for high dependability
domains. Therefore it must be complemented, providing dependable network operation even in
the presence of disturbances. The CAN Enhanced Layer (CANELy) architecture aims at solving
the impairments of CAN network operation, qualifying the utilisation of CAN for high dependability
domains, which require strict operational guarantees.
This Chapter covered the dependability enhancements offered by the CANELy architecture
over the standard CAN protocol, namely with respect to network availability. These enhancements
were defined by CANELy, based on modelling the operation of the CAN protocol in a comprehen-
sive manner. With the network model defined, it progressed into identifying its weaknesses and
propose solutions for solving or mitigate them.
58
4.7 Summary
Based on the analytical results of the CANELy architecture, we were able to discuss a set of
mechanisms, whose objective was mapping the models into (firstly) abstract structures, describing
the model’s behaviour. It was then noticed that most problems modelled by CANELy could be
transformed into sequence detection problems. These problems, however, required machinery
which should be made as generic as possible to promote component reuse, maintainability and
reduce design effort, thus easing the verification of component correctness.
The dependability enhancements were discussed in a bottom-up approach, starting with the
effective design of redundant media management mechanisms, and then passing to the sev-
eral monitoring functions both at the channel and medium level. These monitoring functions rely
heavily on the sequence detection machinery for fault detection, thus permitting their confinement.
Finally, the integration of these components in a single entity, called Media Selection Unit (MSU)
was discussed.
59
5Timeliness Enforcement
Defer no time, for delays are dangerous.Henry VIWILLIAM SHAKESPEARE
The concept of correct service has two dimensions in a real-time system. One of the di-
mensions is the value domain, where correct results are required. The other dimension is the
time domain, where results are required on time. Therefore reliable real-time operation demands
correct results on time.
Communication network channels are known for being unreliable, i.e. there is always some
probability that messages conveyed by the channel get corrupted, or even lost. While the channel
can be made reliable and available, as discussed in Chapter 4, one cannot completely avoid low-
level protocol glitches (e.g. error or overload frame transmissions) which manifest themselves as
periods of time where the CAN network in unavailable, i.e. inaccessible.
One of the contributions of the CANELy architecture was the study of CAN inaccessibility,
allowing its integration into the timeliness model of CAN bus communication [11, 47]. A set of
mechanisms was devised for the evaluation of inaccessibility events and their duration, together
with an effective method for controlling its effects.
This Chapter discusses the effective mapping of mechanisms supporting Inaccessibility Con-
trol in CANELy, enabling the enforcement of correct behaviour in the time domain. Firstly, we
introduce the inaccessibility concept, how it manifests and the duration of such events in CAN
network operation. Inaccessibility events must be assessed, and have their effects evaluated
and controlled. Such actions are taken by special-purpose mechanisms, which monitor the chan-
nel. Lastly, we discuss the mapping and integration of such mechanisms in a self-contained unit,
suitable for being mapped into an FPGA device.
5.1 Channel Inaccessibility
The concept of inaccessibility was formally defined in [48], and it states that inaccessibility is
a perceived temporary condition, during which a component is unable to provide service. The du-
ration and rate of these inaccessibility events are: known; bounded; the violation of such bounds
implies the failure of the component.
61
5. Timeliness Enforcement
In a CAN network, the component that suffers inaccessibility events is the channel. Inacces-
sibility results from two different conditions: omissions, which derive from network errors and that
in CAN are automatically transformed in inaccessibility events; overloads, which derive from the
local state of the nodes. The CAN standard defines that an overload frame can be transmitted
during the intermission period, if a node requires extra time to process the received frame, e.g. a
nearly full FIFO serving a slow processing element.
While these events are taking place, the channel cannot be used by any other node for mes-
sage diffusion, thus being inaccessible. The impact that these events have in timeliness cannot
be neglected: the time wasted in error frame transmission and lost frame retransmission will
add up towards a possible violation of the bounded transmission delay property, a fundamental
requirement of a real-time communication system (see Section 2.2.3, page 12).
One of the contributions of the CANELy architecture was the study and analytic definition of the
duration and boundedness of inaccessibility in CAN. Some results of this study are presented in
Figure 5.1, pertaining to the inaccessibility duration bounds provided by the CANELy architecture.
Figure 5.1: CAN vs. CANELy normalised inaccessibility duration bounds
The chart in Figure 5.1 shows that CANELy mechanisms may provide a reduction of the in-
accessibility times, compared to standard CAN. This reduction, however, only benefits network
errors lasting longer than a single message transfer, e.g. a failed transmitter (Tx-fail). These
errors were handled by the standard CAN fault confinement mechanisms at each node, based
on counters, which account for both transmit and receive errors. The inaccessibility periods only
ended when (one of) those counters reached a certain threshold, and put the CAN controller in an
“error-passive” or “bus-off” states. The CANELy architecture minimises those times by exploiting
mechanisms present in standard CAN controllers, which allow the issue of warning signals when
an error counter exceeds a given threshold.
62
5.2 Inaccessibility Evaluation
Inaccessibility Impact on System Timeliness
Most communication protocols are based on timers, which in the event of failed communication
allow system progression or recovery actions to be undertaken. An inaccessibility period is usually
beyond these timers, thus not being accounted for. Depending on the network load, the effects
of an inaccessibility period may go uncovered. There are, however, situations where the effects
can propagate into the upper layers, triggering unwanted actions and behaviours, e.g. message
retransmission or even protocol failure.
The effects of inaccessibility can be extremely dangerous in a hard real-time system. Given
the close relation between system task execution and message communication, the communi-
cation delays can propagate into the computational task itself, and ultimately cause a deadline
violation. Therefore, it is important to have knowledge pertaining to the parameters that charac-
terises inaccessibility events, rate and duration, so they can be accounted for in the timeliness
model of the real-time system.
5.2 Inaccessibility Evaluation
Generically, confinement actions must be supported by monitoring and evaluation mecha-
nisms. The control of inaccessibility in CANELy is no exception, being supported by inaccessibility
parameter evaluation mechanisms. Such mechanisms provide information both on the rate and
duration of inaccessibility events and their periods, thus making possible the assessment of the
channel status w.r.t. timeliness, and providing valuable information related with the real inacces-
sibility parameters to the upper layers, e.g. for timeout-based protocols management purposes.
5.2.1 Assessment of Inaccessibility Events
The rate of inaccessibility events can be assessed by counting the number of events in a given
reference interval. Such evaluation should be done at the end of a network activity period, i.e.
upon the assertion of ChEOT . These counters are defined as:
ChIe ↑ ChEOT7→
ChIe + 1 if ChErr
0 when mgmt. request(5.1)
ChOe ↑ ChEOT7→
ChOe + 1 if ChErr ∧ ¬ChFok
0 when mgmt. request(5.2)
where: ChIe is the total number of inaccessibility events; ChOe is the number of inaccessibility
events derived from omissions. The difference between the ChIe and ChIe counters provides the
number of inaccessibility events strictly due to overload conditions.
63
5. Timeliness Enforcement
The communication channel suffers an omission whenever a message transfer is aborted or
an error is detected by CRC mechanisms, and an error frame is issued. Therefore, it is necessary
to account for channel omissions. As a general rule, a channel exceeding the omission degree
bound, k, should be declared failed. The evaluation of the real number of omissions is defined as:
ChOd ↑ ChEOT7→
ChOd + 1 if ChErr ∧ ¬ChFok
0 if ChFok
(5.3)
Assessing the number of inaccessibility events is a step towards inaccessibility control mech-
anisms and upper layer protocol optimisation. There is still the need, however, to account for the
time these inaccessibility events last.
5.2.2 Extended Channel Monitoring
Assessing the duration of inaccessibility periods affecting the communication channel requires
the assessment of a few more parameters. Important parameters pertaining to accessibility are
those signalling: start of a new frame transfer; successful frame transfer. These functions extend
the set of Channel Monitoring functions needed for network monitoring with regard to availability
and reliability (see Section 4.4).
The first signal of this extended set of channel monitoring signals is the Start-of-Frame signal,
ChSOF , and is defined as:
ChSOF 7→
true if ChEOT ∧ ChRx = d
false when ChSOF
(5.4)
The assertion of the ChSOF signals the channel’s availability to convey messages, being as-
serted for the duration of only one bit-time. This signal is useful for machinery assessing the
duration of inaccessibility events, since it marks the start of a (possible) inaccessibility period.
Another signal required by inaccessibility monitoring functions is the Transmission Correct,
ChTok, which assesses if there was no violation in the frame transfer format, detected up to the
first bit of intermission. It is defined as:
ChTok 7→
true if ChRx = rdrrrrrrrrr
false when ChEOT
(5.5)
This signal is useful to assess that not only the message was received correctly, but also there
was no error affecting the last but one bit of the message transmission, thus not needing message
retransmission.
64
5.2 Inaccessibility Evaluation
Lastly, if no violation is detected up to the second bit of the intermission, which is the minimum
intermission period before a new data/remote frame transmission can take place, the ChIFS sig-
nal should be asserted. It is defined as:
ChIFS 7→
true if ChRx = rdrrrrrrrrrr
false when ChEOT
(5.6)
Given the conditions for the assertion of the ChEOT are also met, the ChIFS signal will be
asserted only during one bit-time. A depiction of these mechanisms w.r.t. the end of a CAN frame
is shown in Figure 5.2.
Fok
Ch
CRC Sequence ACKSlot
CRCDel
ACKDel
EOF Delimiter
Tok
Ch
IFS
Ch
Figure 5.2: Timing of the CAN channel monitoring signals
This set of signals allows the implementation of a simple scheme for the evaluation of CAN
inaccessibility periods. They can be combined into one signal, ChFc, which asserts a correct
frame-level boundary, i.e. start or end of a correct frame. Its definition is given by:
ChFc = ChSOF ∨ ChTok−p ∨ ChIFS (5.7)
where: ChTok−p is a pulsed version of ChTok, i.e. it only lasts one bit time after ChTok assertion.
This signal can then be used by inaccessibility evaluation machinery, helping to assess the end
of an inaccessibility event.
Another set of extended monitoring signals can be defined, asserting both the presence of a
channel inaccessibility period, ChIna, and the end of such period, ChBidle:
ChBidle 7→
true if T (ChRx = r) ≥ TB
false if T (ChRx = r) < TB ∨ ChRx = d(5.8)
where: TB is the normalised duration of the minimum bus idle period that identifies the absence
of any frame transmission, TB = 12 bit times. The assertion of ChBidle means the effects of the
last inaccessibility events have passed, and all the (pending) messages have been transferred,
thus leaving the channel in an idle state.
65
5. Timeliness Enforcement
Lastly, the extended monitoring signal ChIna defines when a period of inaccessibility begins
and for how long its effects last. It is asserted upon the detection of an inaccessibility event and
negated upon the assertion of the ChBidle signal, as specified by:
ChIna 7→
true if ChErr
false when ChBidle
(5.9)
This signal is of paramount importance to the inaccessibility control mechanisms, for it indi-
cates the real inaccessibility effects. The ChIna signal can be supplied to the upper layers, notify-
ing the start and effective duration of inaccessibility events. Such information is useful for protocol
timeout management, allowing the extension of timers to account for inaccessibility periods.
Another use for the ChIna signal is the signalling of an inaccessibility period, since ChIna is
asserted until the end of inaccessibility effects, i.e. assertion of ChBidle. Therefore we can use
ChIna to account for the number of inaccessibility incidents in this period, ChIi:
ChIi ↑ ChEOT7→
ChIi + 1 if ChIna
0 when ¬ChIna(5.10)
This counter is incremented whenever a correct frame transmission takes place during an in-
accessibility period. Such counting mechanisms can be mapped into VHDL by simple constructs.
The mapping of the ChIi signal is illustrated in Figure 5.3.�1 −− purpose : Count the number o f inacc . events dur ing an inacc . per iod2 −− i npu ts : sys_clk , rst_N , ChIna , ChEOT3 −− outputs : ChI i4 procInacEvtPer : process ( sys_c lk ) is5 var iable ChEOT_s : s td_u log i c_vec to r (1 downto 0) := ( others => ’ 0 ’ ) ;6 begin −− process procInacEvtPer7 i f r i s ing_edge ( sys_c lk ) then8 i f rst_N = ’0 ’ then9 ChI i <= ( others => ’ 0 ’ ) ; −− Clear the event count upon rese t
10 ChEOT_s := ( others => ’ 0 ’ ) ;11 else12 i f can_clk_en = ’1 ’ then −− Sync wi th the CAN network13 ChEOT_s := ChEOT_s( 0 ) & ChEOT; −− S h i f t values l e f t , i npu t ChEOT14 −− Has ChEOT j u s t been asser ted?15 i f ChEOT_s( 1 ) = ’0 ’ and ChEOT_s( 0 ) = ’1 ’ then16 i f ChIna = ’1 ’ then17 ChI i <= ChI i + 1 ; −− Increment , ChIna i s asser ted18 end i f ;19 end i f ;20 i f ChIna = ’0 ’ then21 ChI i <= ( others => ’ 0 ’ ) ; −− Clear the count22 end i f ;23 end i f ;24 end i f ;25 end i f ;26 end process procInacEvtPer ; � �
Figure 5.3: Inaccessibility Event Count description in VHDL
66
5.2 Inaccessibility Evaluation
Integration of Basic Inaccessibility Control Mechanisms
The set of extended monitoring signals required to assist monitoring inaccessibility in a CAN
network is summarised in Figure 5.4.
Extended Channel MonitoringChSOF Start Of Frame
asserted at beginning of frame transmission;one bit-time duration.Condition: ChEOT ∧ ChRx = d
ChTok Transmission Correctasserted at the 1st bit of intermission;negated upon ChEOT .Sequence: ChRx = rdrrrrrrrr
ChIFS Frame Termination Correctasserted at the 2nd bit of intermission;negated upon ChEOT .Sequence: ChRx = rdrrrrrrrrr
ChBidleBus idlenessasserted after bus is idle for a certain thresholdnegated upon detection of a dominant bitSequence: ChRx = rrrrrrrrrrrr
ChIna Channel Inaccessibility Statusasserted upon ChErr
negated upon assertion of ChBidle
Condition: ChErr = true
Figure 5.4: Extended Channel Monitoring signals
With the information of Figure 5.4 we can easily map these monitoring functions into sequence
detection and assertion machinery. Lastly, in Figure 5.5 the sequences pertaining to the monitor-
ing functions of Figure 5.4 are shown.�1 −− ChTok − Transmit Correc t2 constant seq_tok : s td_u log i c_vec to r := " 10111111111 " ;34 −− ChIFS − I n te r f rame Spacing5 constant seq_ i f s : s td_u log i c_vec to r := " 101111111111 " ;67 −− ChBidle − Bus I d l e8 constant seq_chbid le : s td_u log i c_vec to r := " 111111111111 " ; � �
Figure 5.5: Timeliness-related sequences mapped into VHDL
It can be noticed that the sequences pertaining to the ChTok and ChIFS signals are equal
up to the last bit of ChTok, thus ChTok being a sub-sequence of ChIFS , and therefore the de-
tection could be optimised. Although might be tempting to perform this optimisation, it should be
left for the VHDL synthesiser. Most synthesisers recognise resources that may be shared, and
automatically perform the optimisation, which can be confirmed through the synthesis tool report.
67
5. Timeliness Enforcement
5.2.3 Assessment of Inaccessibility Effects
The extension of the set of channel monitoring functions provides the basis for the assess-
ment and characterisation of inaccessibility periods and events. An inaccessibility period can be
composed by several inaccessibility events, e.g. two consecutive message omissions. The as-
sessment of the amount of such events has been discussed previously. Therefore, the remaining
parameter that must be evaluated is the duration of these inaccessibility events and periods.
The mechanisms evaluating the duration of an inaccessibility event are defined by:
Te_ina =
Te_ina + Tbit if ¬ChEOT
Te_ina if ChEOT
0 if ChFc
(5.11)
The principle of operation of this mechanism is simple: time count is started when the ChSOF
signal is asserted, and is reset when: the transmission of a data/remote frame succeeds; the
transmission of a data/remote frame is correctly terminated by a minimum intermission period.
Should an inaccessibility event occur, Tina will hold its exact duration, upon the assertion of
the ChEOT signal. The duration of a single inaccessibility event, Te_ina is upper bounded by
Te_ina = 2160 Tbit (Tx. Fail in Figure 5.1). The mapping of this duration evaluation mechanism
into hardware is shown in Figure 5.6.�1 −− purpose : Assess the dura t i on o f the cu r ren t inacc . event2 −− i npu ts : sys_clk , rst_N , can_clk_en , ChFc , ChEOT3 −− outputs : Te_ina4 procInacTimeCount : process ( sys_c lk ) is5 begin −− process procInacTimeCount6 i f r i s ing_edge ( sys_c lk ) then −− r i s i n g c lock edge7 i f rst_N = ’0 ’ then −− synchronous rese t ( a c t i v e low )8 Te_ina <= ( others => ’ 0 ’ ) ; −− Clear the count upon rese t9 else
10 i f can_clk_en = ’1 ’ then −− Sync wi th the CAN network , maps T b i t11 i f ChFc = ’1 ’ then −− ChFc s i g n a l asser ted12 Te_ina <= ( others => ’ 0 ’ ) ; −− Clear the count when s t a r t i n g a new event13 e l s i f ChEOT = ’0 ’ then −− ChEOT asserted14 Te_ina <= Te_ina + 1; −− Increment by one T b i t15 end i f ; −− Defau l t behaviour : keep the Te_ina value16 end i f ;17 end i f ;18 end i f ;19 end process procInacTimeCount ; � �
Figure 5.6: Inaccessibility duration evaluation description in VHDL
The VHDL description of Figure 5.6 shows the materialisation of the inaccessibility event du-
ration evaluation signal, Te_ina; the signal Te_ina is the mapping of Te_ina. The physical dimen-
sioning of the signal’s width is dependent of its bounds, Te_ina. This is an important parameter,
in order to avoid using more resources than those strictly necessary, thus enabling a compact
design.
68
5.3 Usefulness of Inaccessibility Control Mechanisms
The total amount of time of consecutive inaccessibility periods should be accounted for, in
order to consolidate an overall inaccessibility time, Tina:
Tina↑ChEOT=
Tina + Te_ina if ChIna
Tina if ChBidle
0 when mgmt. request
(5.12)
The actions of equation 5.12 are executed only at the end of a network activity period, i.e.
upon ChEOT assertion. The information provided by this equation can be used by upper layers
for assessing the channel status, with regard to inaccessibility.
An inaccessibility epoch starts with a (possibly) correct frame transfer. The maximum duration
of such epoch, Tpinais dependent on CAN traffic patterns, i.e. network load. The duration,
however, must also be accounted for:
Tp_ina =
Tp_ina + Tbit if ¬ChEOT ∨ ChIna
Tp_ina if ChBidle
0 if ChFc ∧ ¬ChIna
(5.13)
An inaccessibility epoch ends after all the effects of inaccessibility have been cleared, i.e. all
the pending frame transfers have succeeded.
The mapping of these mechanisms into hardware follows the same philosophy of the mapping
present in Figure 5.6: a counter is defined, based on the signal’s upper bound value; the actions
upon that signal are mapped and defined by conditions.
5.3 Usefulness of Inaccessibility Control Mechanisms
The ultimate goal of inaccessibility control mechanisms is to use the real inaccessibility event
duration for protocol optimisation. This goal can be achieved through the previously defined ma-
chinery, which interfaces higher layer mechanisms. The information regarding the real duration
and effects of inaccessibility can be used at several layers of protocols. Such knowledge permits
fine-tuning network operation, instead of using conservative values, which may not yield optimum
performance of the global system.
The inaccessibility control method contemplated by the CANELy architecture is called inac-
cessibility flushing. This method avoids the need for an “accessibility test” message transmission,
to assess the availability of the network for conveying messages, i.e. its accessibility status. In-
stead, it uses the ChIna signal for assessing when the (distributed) frame transmission queue has
become empty, after the occurrence of an inaccessibility event [47, 11]. Such method can be used
for upper layer timer management [49], thus allowing the incorporation of the real inaccessibility
effects as parameters of timeout-based protocols.
69
5. Timeliness Enforcement
However, the specified mechanisms are also useful for clock-less protocols, i.e. that do not use
timers. The Figure 5.7 shows an example of such an optimisation, applied to a diffusion-based
protocol.
D-CAN: (optimised) Diffusion-based Protocol
Initializationi01 ndup(mid) := 0; // number of duplicates, kept for each messageSenders10 when d-can.req(mid〈type,p,n〉, mess) invoked at p dos11 if mess = NULL thens12 can-rtr.req(mid);s13 elses14 can-data.req(mid, mess);s15 od;s16 when can-rtr.cnf(mid) or can-data.cnf(mid, mess) confirmed dos17 deliver d-can.cnf (mid,mess);s18 od;Recipientr00 when can-data.ind(mid, mess) received at qr01 or can-rtr.ind(mid, mess=NULL) received at q dor02 ndup(mid) := ndup(mid) + 1;r03 if ndup(mid)= 1 then // new messager04 d-can.ind (mid, mess);a00 if ¬ChTok(mid) thenr05 if mess = NULL thenr06 can-rtr.req(mid); // clusteredr07 elser08 can-data.req(mid, mess);r09 fi;a01 fi ;
r10 elif ndup(mid) > j or ChTok(mid) thenr11 can-abort.req(mid);r12 fi;r13 od;
Figure 5.7: Optimised Diffusion-based protocol.
The protocol depicted in Figure 5.7, dubbed D-CAN, is a message diffusion protocol. Its pur-
pose is to avoid inconsistent message omissions, usually caused by errors affecting the last but
one bit of a frame transfer [36]. An initial solution to this problem was offered by the CANELy ar-
chitecture, by the EDCAN protocol [11], where all correct nodes that received the message would
eagerly diffuse it to the network. Such message diffusion was done through a number of consec-
utive messages specified by the inconsistency omission degree bound. Such strategy, however,
has a non-negligible impact in bandwidth and network load: a message correctly received by all
nodes would still be diffused, up to the inconsistency omission degree bound.
The D-CAN protocol aims at reducing the bandwidth utilisation by assessing the communica-
tion channel state via specialised machinery. The ChTok signal provides the D-CAN protocol with
information for early termination of the message diffusion, since its assertion guarantees that all
correct nodes have received the message and that no message retransmission is due. The use
of the ChTok signal in the protocol is clearly identified in Figure 5.7.
70
5.4 Inaccessibility Control Unit
5.4 Inaccessibility Control Unit
The final step is the encapsulation of the previously defined mechanisms and functions into a
single entity, named Inaccessibility Control Unit (ICU). This component is responsible for: monitor
the CAN communication channel and assess the existence of inaccessibility events and periods;
notify the upper layers of the occurrence and duration of such events and periods; assist up-
per layer protocols with signals and inaccessibility time measures for optimum protocol timeout
definition. A block diagram is shown in Figure 5.8.
Management
ChRx
Management Interface
ChFok
ChEOT
ChErr
Event & TimeRegisters
InaccessibilityDetection
Figure 5.8: Inaccessibility Control Unit block diagram
The component of Figure 5.8 is partitioned into three main blocks, encompassing the moni-
toring, evaluation and management interface mechanisms. It receives the basic set of channel
monitoring functions defined in Chapter 4, enabling their extension into more advanced monitoring
functions, for inaccessibility monitoring purposes. The management block provides the mecha-
nisms for interfacing the monitoring and evaluation components with the upper layers, conveying
the invocation and notification primitives between the parties.
System and Management Interfaces
The MSU must interface with a computational platform, in order to provide/receive information
necessary for the correct operation of the CANELy mechanisms. The set of initialisation parame-
ters and notifications is illustrated in Figure 5.9.
The set of primitives defined by the ICU system interface include: parameter configuration,
such as bit rate; exception notification, such as the change in channel inaccessibility status; pa-
rameter extraction, such as the number of inaccessibility events.
The management interface is the component responsible for supporting the System Interface
primitives. It has been left as generic as possible, in order to allow interconnection with several
base technologies (see Section 4.6.2).
71
5. Timeliness Enforcement
Invocation PrimitivesDescriptionInitialise (baud)Get Channel status (ChIna)Get Channel inaccessibility events (ChIi)Get Channel normalised inaccessibility times (Tina,Tp_ina)
Notification PrimitivesDescription Issuing ConditionChannel Status Change ChIna
Channel Transmission Correct (mid) ChTok
Channel Omission degree exceeded ChOd > k
Figure 5.9: CANELy Inaccessibility Control Unit management primitives
The ICU management interface has, however, one extra requisite: an Interruption Request
(IRQ) facility capable of supporting multiple inputs, i.e. multiple IRQ sources. Such facility would
provide adequate support for the signals being asserted by the ICU upon certain events, and be
output to provide support for timer management services and protocols [49, 47], together with
lower processing latency.
5.5 Summary
The provision of timely service is a property that must be secured for a real-time system. Such
property becomes harder to guarantee, when the system needs to perform Input/Output (I/O)
through a communication channel. This channel may suffer disturbances and be temporarily
inaccessible. Such disturbances introduce errors in the temporal domain, since they affect the
timeliness of the communication. The CAN Enhanced Layer (CANELy) architecture provides a set
of analytic results and mechanisms to address the issue of timeliness, in the realm of Controller
Area Networks.
This Chapter discussed the functions and mechanisms for supporting timely service in a
Controller Area Network (CAN) infrastructure, disturbed by the presence of errors or overload
conditions. These disturbances are mapped into the temporal domain as inaccessibility peri-
ods, during which the communication channel is temporarily unavailable. The CAN Enhanced
Layer (CANELy) architecture provides a set of low-level mechanisms to deal with inaccessibility
and its control.
The discussed mechanisms are partitioned in: monitoring, which assess the state of the chan-
nel w.r.t. inaccessibility, enabling upper layer support for inaccessibility assessment; evaluation ,
which use the former monitoring mechanisms to perform evaluation actions, both at the rate and
duration levels. These mechanisms and functions are integrated in a single entity, called Inacces-
sibility Control Unit (ICU), interfacing the remaining components of the CANELy architecture.
72
6CANELy Mechanism and Prototype
EngineeringStrive for perfection in everything you do. Take thebest that exists and make it better. When it does notexist, design it.SIR HENRY ROYCE
Materialising the CANELy low-level mechanisms and functions into hardware is the final step
in this journey. Such materialisation, however, must be performed in the the most efficient way
as possible, in order to fit a small FPGA device, thus being a cost-effective solution for enabling
highly dependable behaviour in both existing and future real-time CAN-based systems.
An implicit requisite to the materialisation of the mechanisms is the existence of a computing
platform. This computing platform integrates the FPGA along with other elements, such as a pro-
cessing element and memory buffers. The result of such integration is the enabling of a CANELy
node, suitable for implementation from the low-level mechanisms to the upper layers’ protocols.
This Chapter reports the engineering aspects of the mechanisms specified in the previous
Chapters. The simulated operation of the mechanisms is presented and discussed. The mapping
into an FPGA device is then described, along with the associated constraints. The FPGA resource
occupation is analysed, and compared with both freely- and commercially-available standard CAN
controller IP cores. Lastly, we document the specification and construction of a prototype board
having the resources for enabling an implementation of the CANELy architecture, both on hard-
ware and software aspects.
6.1 CANELy Mechanism Verification and Validation
A crucial step in digital system design is simulation. The purpose of simulation is to ensure
that the component hardware description behaves as intended, i.e. it follows the specifications
and provides correct ouputs for the tested inputs. This simulation action must be provided with
a sensible set of input values, which will generate a set of output values. These values are then
compared with the expected output, thus performing the verification of the component’s hardware
description correctness.
73
6. CANELy Mechanism and Prototype Engineering
The designed mechanisms are divided in hierarchical blocks, called components, which can
be composed, e.g. a Basic Channel Monitoring component providing signals to generate the
Extended Channel Monitoring signals for inaccessibility parameter evaluation. Each component
has a companion testbench, which exercises the component and provides information about op-
eration correctness. This modular approach allows to ensure correctness of all the components,
even before integrating them into more complex components. The simulations were performed
by Mentor Graphics Modelsim 6.5d for Linux software, and the CAN bus data was generated by
a specially crafted programme, being interfaced with the components via a custom-made VHDL
component emulating the several incoming media (see Appendix B).
6.1.1 Media Selection Unit
A simulated fragment of the Media Selection Unit (MSU) operation is shown in Figure 6.1. This
simulation fragment shows the transmission of three CAN frames, with one of them being affected
by an error, signalled by an error frame and corresponding ChErr assertion. The affected frame
is then retransmitted, and each media omission degree is evaluated.
ChannelError
ChannelError
FrameMismatch
Medium OdIncremented
Medium OdMaintained
Medium OdIncremented
FrameMismatch
Figure 6.1: Media Selection Unit simulation fragment
The ChRx signal of Figure 6.1 is the channel’s incoming bit stream, recovered by the AND-
based media selection function from the several incoming media, M_Rx(m). It is shown the oper-
ation of basic channel monitoring mechanisms, e.g. ChErr or ChEOT, which are fundamental for
determining the status of the CAN bus.
This fragment also shows the several types of errors and medium omission degree evaluation
mechanisms: a masked medium error, not affecting the channel and corresponding increment
of the affected medium omission degree, Od(1); an unmasked error affecting both media, and
corresponding omission degree is maintained; another masked error affecting one medium, with
the consequent Od(1) increment.
74
6.2 FPGA Mechanism Engineering
6.1.2 Inaccessibility Control Unit
The Inaccessibility Control Unit (ICU) component was also simulated. A fragment from the
corresponding simulation is presented in Figure 6.2. This fragment shows the transmission of
frames, disturbed by both an error and overload conditions. Both these events are transformed
into inaccessibility periods.
ChannelError
ChannelError
ChannelError
ChannelOverload
Inaccessibility Period
Inaccessibility Period
ChannelErrorTotal Inac. Event
Count Incremented
ChannelError
Period Inac. Event Count Cleared
Figure 6.2: Inaccessibility Control Unit simulation fragment
This fragment shows the operation of the several inaccessibility parameter evaluation mecha-
nisms, such as inaccessibility duration counters, and also the assessment of the type and amount
of inaccessibility events. This fragment only shows the channel, ChRx, since this is the signal that
may be affected by inaccessibility incidents.
The first channel error is an omission error, leading to the increment of both the ChIe and
ChOe counts, mapped as ChIe and ChOe in Figure 6.2. The second channel error results from
an overload condition, incrementing only the ChIe quantity (see Section 5.2.1, page 63), while
maintaining the ChOe. The total (normalised) inaccessibility time, Tina (Tina in Figure 6.2) in-
creases monotonically only at the end of an inaccesibility event, being increment by the amount
representing the duration of the (finished) inaccessibility period.
6.2 FPGA Mechanism Engineering
The implementation of the mechanisms in an FPGA device has several constraints. The
first and foremost stems from finite resource availability, especially register elements (flip-flops).
Therefore the mapping of the machinery into hardware must be made extremely efficient, in order
to occupy the least resources.
The VHDL synthesis tool utilised to map the mechanisms into hardware was Xilinx XST 11.4
for Linux. Default optimisation options were used, save for the main optimisation goal, which was
selected as “Area” over the default “Speed”.
75
6. CANELy Mechanism and Prototype Engineering
The target FPGA device for place & route actions was Xilinx’s Spartan-3E, which is part of the
Spartan-3 device family. This device family was chosen due to its low cost nature, together with
high longevity and adequate I/O resources. The devices used for comparison were: Spartan-3E
XC3S500E and XC3S100E, having an equivalent capacity of 500k and 100k logic gates respec-
tively; Spartan-3 XC3S50, with an equivalent capacity of 50k gates. These devices are architec-
turally identical, meaning that the end result of synthesising a design for any of the three will show
the same resource usage.
Sequence Detection Machinery
The problem of efficient sequence detection machinery has been discussed previously (see
Section 4.3). In order to assess the efficiency of the proposed mechanism, a comparison must be
made between the proposed ROM-based and the Sliding Window sequence detectors, regarding
FPGA resource utilisation. The comparison results are illustrated in Figure 6.3
The synthesis process was performed for a set of sensible sequence length values, i.e. vary-
ing the tolerance margin errstuck←rx(bus) of the stuck-at-dominant detection machinery between
its lower and upper bounds. The metric used for comparison is slices, which is the effectively
used resource, and composed both by a single sequential (Flip-Flop, FF) element and a single
combinatorial (Look-Up Table, LUT) element.
Figure 6.3: Sequence detection description resource occupation
The results presented in Figure 6.3 show a clear advantage of the proposed sequence de-
tection architecture, with regard to resource utilisation. The rather constant number of used re-
sources is easily explained by the number of sequential elements needed for ROM addressing
purposes, which remains constant for sequences having a length between two consecutive base
2 powers, i.e. 2n ≤ l < 2n+1, where n is the amount of flip-flops, l is the length of the sequence.
The resource usage of the sequence detection machinery is critical, since this mechanism is a
cornerstone in the mapping of most monitoring functions into hardware. Furthermore, each repli-
cated medium has their own monitoring mechanisms, which implies multiple (parallel) instances
of the sequence detection machinery.
76
6.2 FPGA Mechanism Engineering
6.2.1 Media Selection Unit
The Media Selection Unit (MSU) is the core component enabling network dependability func-
tions and mechanisms. It provides the: media redundancy management functions; channel and
media monitoring functions; channel and media error confinement functions. This component can
be parametrised at design time with the several significant parameters, e.g. the number of repli-
cated media. Synthesising this unit using conservative values [11] and for a dual-media network
yields the results shown in Table 6.1.
Table 6.1: Media Selection Unit FPGA resource occupation
Media Selection Unit MechanismsDevice Slices
Flip-Flops LUTsAbsolute Relative (%)
XC3S500E 121 228 148 3.2XC3S100E 121 228 148 15.4
XC3S50 121 228 148 19.3
The relative slice metric is presented with the purpose of comparing it with the total amount
of slices in each FPGA device, thus providing an easy method for assessing the resource usage.
The numbers shown in Table 6.1 allow us to conclude that the MSU can be perfectly fitted in even
a small sized FPGA (XC3S50), as intended.
6.2.2 Inaccessibility Control Unit
The Inaccessibility Control Unit (ICU) is the core component of the timeliness related functions
and mechanisms. It provides the upper layers with a set of extended channel monitoring func-
tions, together with the assessment of: channel inaccessibility status; inaccessibility event rate;
inaccessibility event duration. The synthesis of these mechanisms with sensible parameters [11]
yields the results presented in Table 6.2.
Table 6.2: Inaccessibility Control Unit FPGA resource occupation
Inaccessibility Control Unit MechanismsDevice Slices
Flip-Flops LUTsAbsolute Relative (%)
XC3S500E 81 78 73 1.6XC3S100E 81 78 73 7.6
XC3S50 81 78 73 9.5
The numbers from Table 6.1 allow us to conclude that these mechanisms can also be perfectly
fitted in even a small sized FPGA. These numbers depend on the parametrised bounds and
parameters configuring evaluation counter bounds, e.g. Tp_ina. Their variation, however, is not
significant, since they translate into a few more bits, i.e. a few more slices.
77
6. CANELy Mechanism and Prototype Engineering
6.2.3 Resource usage comparison
The comparison of the resource usage of theCANELy components w.r.t to a standard CAN
controller must be done. The integrated (MSU and ICU) CANELy core is compared against
two standard CAN cores: the free OpenCores CAN [50] and the commercial Xilinx LogiCORE
XPS [51]. Both CAN controllers had equal operating parameters: 64 message-deep FIFO; 3
acceptance filters. The results are shown in Figure 6.4, w.r.t the several FPGA resource types.
Figure 6.4: CANELy vs CAN Cores resource usage comparison
A significant part of the MSU flip-flop usage is allocated for recovering and storing the message
identifier, mid. Therefore, the integration of CANELy mechanisms with a CAN controller IP core
might provide interesting results, both at further lowering resource usage by sharing common
mechanisms, but also by providing access to other CAN machinery, such as error counters. This
machinery can be used to design extended quarantine mechanisms.
Lastly, we compare the relative resource usage in a design integrating both the CANELy com-
ponents and the CAN controllers. The results are shown in Figure 6.5. This integration does not
involve any optimisation nor resource sharing.
82.2% 85.1%
11.9% 10.0%
Figure 6.5: CANELy vs CAN Cores relative slice usage
78
6.3 CANELy Prototype Board
6.3 CANELy Prototype Board
The last issue pertaining to the engineering of CANELy is the integration of all the fundamen-
tal hardware components that define a CANELy node. These components must be integrated in
a computing platform, obeying to the requirements presented in Chapter 3. Such components
are: processing element, for higher layer services’ execution and management functions; CAN
controller, providing the standard CAN layer; FPGA device, providing the support for the materi-
alisation of the mechanisms discussed in the previous Chapters.
6.3.1 Architecture
Before implementing a design satisfying the CANELy requisites, we must define an architec-
ture. This architecture is a general view of the system, with the main blocks and their intercon-
nection explicited. Such a view is presented in Figure 6.6.
uController
Bootloader
FLASHMemory
FPGA
Addre
ss B
us
RS-232
CANTransceivers
SRAM
Data
Bus
CAN
UART
I/OCAN
FLASHMemory
Dual-media CAN bus
Figure 6.6: CANELy Prototype Board block diagram
The prototype board is composed by: microcontroller, which provides the computation support
for software execution, such as the CANELy protocols and services; FPGA, for the implementa-
tion of the CANELy low-level mechanisms, discussed in the previous Chapters; memory, both
volatile (Static Random-Access Memory (SRAM)) and non-volatile (FLASH). Regarding software,
it provides a bootloader, which is used for diagnostic functions and in-system programming for
user code.
Although the CANELy architecture focuses on serving as the network building block of other
(advanced) computing platforms, it does not preclude a self-contained system, executing the ap-
plication directly. Examples of such systems comprise intelligent sensors and actuators, which
usually require: processing element; networking element and I/O capabilities, all in the same
computing platform.
79
6. CANELy Mechanism and Prototype Engineering
6.3.2 Prototype Implementation
The materialisation of the architecture is the next step, achieved by selecting and interconnect-
ing the several hardware components. The final result is the prototype board shown in Figure 6.7.
FPGAMicrocontroller
Dual-MAC
Reliable Comm. Protocol SuiteLayer Management
Dual-CAN(optional)
ManagementInterface
CANELy FunctionsControl of Inaccessibility
CAN MonitoringAND-based Media Redundancy
ChannelInterface
MediaInterfaces
Cable Connectors
Figure 6.7: CANELy Prototype Board
The current CANELy node prototype is composed by the basic elements necessary for its
materialisation: Maxim/Dallas DS80C390 microcontroller, having an optimised 8051 core, two
CAN 2.0B controllers with 15 message centres’ each; Xilinx Spartan-3E XC3S500E FPGA, with
an equivalent capacity of 500k logic gates; Maxim MAX13050, standard CAN transceivers. The
interface between the low-level mechanisms embedded in the FPGA and the microcontroller is
done through memory-mapped I/O, exploiting the parallel data and address buses.
The current FPGA device still has resources for another MSU and ICU components (see Ta-
ble 6.1 and 6.2), thus enabling a dual-CAN channel/quad-media solution, thus providing even
higher dependability and timeliness guarantees, due to the fully space-redundant architecture.
6.4 Summary
The engineering of CAN Enhanced Layer (CANELy) low-level mechanisms demands high
resource utilisation efficiency, in order to provide a low cost solution. Furthermore, these mecha-
nisms are only a part of the CANELy architecture, which is materialised by a CANELy node.
This Chapter reported the materialisation of the proposed CANELy mechanisms, showing how
they can be made effective, resource-wise. This opens room for the integration of the CANELy
low-level components in both currently deployed and newly designed Controller Area Network
(CAN)-based real-time applications, through the addition of a low-cost FPGA device.
80
7Conclusions and Future Work
Every new beginning comes from some otherbeginning’s end.SENECA
The Controller Area Network (CAN) fieldbus is a widely deployed technology, being used in
domains as diverse as automotive, home automation and robotics. The CAN Enhanced Layer
(CANELy) architecture was designed around the standard CAN layer, in order to enhance it and
attain high levels of dependability.
The construction of highly-dependable architectures is dully justified by the characteristics’
standard CAN already possesses, both w.r.t. bus operation and physical aspects, such as cabling
and transceivers. These are the very same characteristics that make the CAN bus desirable to
new domains, such as aerospace, deep sea oil-drilling, or even more common applications such
as trash collecting trucks. The common denominator in all these applications is the need for
dependable service, even in the presence of disturbances.
This work discussed the mapping of the CAN Enhanced Layer (CANELy) architecture’s low-
level mechanisms into hardware. These low-level mechanisms pertain essentially to the depend-
ability of the network, both on the spatial and temporal domains, performing bus monitoring func-
tions for error detection and confinement.
The biggest contribution of this work is: an area-effective description of the CANELy archi-
tecture supporting foundations, thus opening room for a complete functional CANELy node, im-
plementing all the layers envisaged by the architecture. Another useful result stemming from this
work is the analysis of FPGA-effective mechanisms for “on-line” processing of bit serial proto-
cols. The final result was a proposed sequence detection strategy that - for this specific type of
sequences - outperformed conventional methods.
The CANELy architecture mechanisms can be effectively mapped onto cost-effective PLD de-
vices, such as FPGAs, and therefore enhance currently deployed CAN applications at a low cost.
Furthermore, the FPGA devices also add a new dimension to the applications, stemming from the
dependability attributes: maintainability. Unlike ASIC components, the use of reconfigurable logic
devices allows the extension of functions, thus providing the necessary support for the inclusion
of new mechanisms.
81
7. Conclusions and Future Work
A medium exhibiting more omission errors than those allowed by its omission degree must be
declared failed. After this procedure, however, one question still remains: how and when should
it be reactivated? The provision of medium quarantine techniques must aim at answering these
questions, through both models and monitoring mechanisms. An essential work basis for such
service is a (stochastic) model characterising the errors affecting a CAN network. The CANELy
architecture would benefit greatly from such a service, which would enrich it with attributes related
with robustness and adaptability. These attributes are in demand by applications where mainte-
nance and repair actions cannot be carried out by a human agent - at least in a timely fashion -
e.g. manned and unmanned spaceflight, deep-space probing missions or to a lesser extent, more
common unmanned or remotely operated applications, such as Unmanned Aerial Vehicle (UAV)
and Remotely Operated Vehicles (ROVs).
Another aspect of fault-tolerant communication is a bus guardian service to prevent babbling
idiot faults, where a node communicates in an arbitrary fashion, either due to failed circuitry,
drifting clock signal or even a malicious application. The theoretical foundations of such service
in CAN are already laid [52], with an analysis on how could this service be provided. Given
the available resources in the CANELy architecture, a (quasi)independent bus guardian can be
provided with partitioned FPGA machinery and an external clock signal source to avoid common-
mode failures, such as a failed clock device or an unperceived drifting clock signal, which would
induce the bus guardian in error due to the relative signals being within limits. These issues should
be investigated further.
Another research direction involves the explicit modification of the CAN standard. This ap-
proach has only been recently considered, due to the availability of low-cost FPGA devices hav-
ing enough resources to implement multiple standard-compliant CAN controllers, and at the same
time the availability of CAN IP cores [53, 50]. The CANELy architecture would benefit from this
integration, since it could reuse machinery already provided by the CAN controller, such as: net-
work bit synchronisation; bit-destuffing and message identifier recovery; and at the same time
access the CAN protocol FSM, thus rendering some monitoring functions more effective.
82
Bibliography
[1] G. C. Buttazzo, Hard Real-time Computing Systems: Predictable Scheduling Algorithms And
Applications (Real-Time Systems Series). Santa Clara, CA, USA: Springer-Verlag TELOS,
2004.
[2] M. Pignol, “COTS-based applications in space avionics,” in Design, Automation Test in
Europe Conference Exhibition (DATE 2010), 8-12 2010, pp. 1213–1219.
[3] J. Alford, L.D., “The problem with aviation COTS,” IEEE Aerospace and Electronic Systems
Magazine, vol. 16, no. 2, pp. 33–37, Feb. 2001.
[4] R. Black and M. Fletcher, “Open systems architecture - both boon and bane,” in Proceedings
of the 25th IEEE/AIAA Digital Avionics Systems Conference (DASC’06), Oct. 2006, pp. 1–7.
[5] CAN Specification Version 2.0, Robert Bosch GmbH, Sep. 1991.
[6] International Standard 11898 - Road vehicles - Controller Area Network (CAN) Part 1: Data
link layer and physical signalling, ISO Std., Dec. 2003.
[7] General Standardization of CAN (Controller Area Network) for Airborne Use, Airlines Elec-
tronic Engineering Committee (AEEC) Std. ARINC Specification 825-1, May 2010.
[8] “ECSS draft standard ECSS-E-ST-50-15C - recommendations for CAN bus in spacecraft on-
board applications,” ECSS Draft, European Cooperation for Space Standardization (ECSS),
May 2005.
[9] P. W. Fortescue, J. P. W. Stark, and G. Swinerd, Eds., Spacecraft Systems Engineering,
3rd ed. Wiley, 2003.
[10] Alcatel Alenia Space, “AURORA avionics architecture,” Alcatel Alenia Space, Tech. Rep.,
2005.
[11] J. Rufino, “Computational system for real-time distributed control,” Ph.D. dissertation,
Technical University of Lisbon - Instituto Superior Técnico, Lisboa, Portugal, Jul. 2002.
[Online]. Available: http://dario.di.fc.ul.pt/downloads/PhD-THESIS.pdf
83
Bibliography
[12] “Spartan-3E FPGA family data sheet,” Xilinx Inc., Aug. 2009. [Online]. Available:
http://www.xilinx.com/support/documentation/data_sheets/ds312.pdf
[13] D. Flynn, “AMBA: enabling reusable on-chip designs,” IEEE Micro, vol. 17, no. 4, pp. 20–27,
Jul./Aug. 1997.
[14] GRLIB IP Library User’s Manual, Aeroflex Gaisler AB. [Online]. Available: http:
//www.gaisler.com/
[15] J. Rufino, R. Pinto, and C. Almeida, “A FPGA-based solution for enforcing dependability and
timeliness in CAN,” in Proceedings of the 2007 IP Based Electronic System (IP’07), Grenoble,
France, Dec. 2007.
[16] ——, “FPGA-based engineering of bus media redundancy in CAN,” in Proceedings of the
12th International CAN Conference (iCC’08), Barcelona, Spain, Mar. 2008.
[17] R. Pinto, J. Rufino, and C. Almeida, “CANELy prototype board schematic specification,”
FCUL/IST, Tech. Rep. DARIO RT-05-04, Dec. 2005.
[18] ——, “Specification and engineering of the CANELy prototype board,” FCUL/IST, Tech. Rep.
DARIO RT-06-06, Oct. 2006.
[19] J. Rufino, R. Pinto, and C. Almeida, “How to enforce dependability and timeliness in
CANELy?” FCUL/IST, Tech. Rep. DARIO RT-07-02, Jul. 2007.
[20] P. Verissimo and L. Rodrigues, Distributed Systems for System Architects. Norwell, MA,
USA: Kluwer Academic Publishers, 2001.
[21] H. Kopetz, A. Ademaj, P. Grillinger, and K. Steinhammer, “The time-triggered ethernet
(TTE) design,” in Proceedings of the 8th IEEE International Symposium on Object-Oriented
Real-Time Distributed Computing (ISORC’05). Washington, DC, USA: IEEE Computer So-
ciety, 2005, pp. 22–33.
[22] IEEE Standard for Information Technology–Telecommunications and Information Exchange
Between Systems–Local and Metropolitan Area Networks–Specific Requirements Part 3:
Carrier Sense Multiple Access With Collision Detection (CSMA/CD) Access Method and
Physical Layer Specifications - Section One, IEEE Std. 802.3-2008, Dec. 2008.
[23] S. Parkes and P. Armbruster, “SpaceWire: Spacecraft onboard data-handling network,” Acta
Astronautica, vol. 66, no. 1-2, pp. 88–95, 2010.
[24] IEEE Standard for Heterogeneous Interconnect (HIC) (Low-Cost, Low-Latency Scalable
Serial Interconnect for Parallel System Construction), IEEE Std. 1355-1995, Sep. 1995.
84
Bibliography
[25] Space engineering: SpaceWire — Links, nodes, routers and networks, ECSS Std. ECSS-E-
ST-50-12C, Jul. 2008.
[26] M. D. May, P. W. Thompson, and P. H. Welch, Eds., Networks, Routers and Transputers:
Function, Performance and Applications. Amsterdam, The Netherlands: IOS Press, 1993.
[Online]. Available: http://wotug.ukc.ac.uk/parallel/www/nrat.html
[27] A. Woodroffe and P. Madle, “Application and experience of CAN as a low cost OBDH bus
system,” in Proceedings of the 2004 Data Systems In Aerospace Conference (DASIA’04),
Aug. 2004.
[28] F. Tortosa López, P. Roos, L. Stagnaro, C. Plummer, and B. Storni, “The CAN bus in
spacecraft on board applications,” in Proceedings of the 2004 Data Systems In Aerospace
Conference (DASIA’04), Aug. 2004.
[29] F. Tortosa López, G. Furano, A. J. Winton, M. Montagna, M. Caramia, B. Dean, and
M. Bhana, “CAN bus on ExoMars,” in Proceedings of the 2009 Data Systems In Aerospace
Conference (DASIA’09), Istanbul, Turkey, May 2009.
[30] H. Hilmer, H.-D. Kochs, and E. Dittmar, “A fault-tolerant communication architecture for
real-time control systems,” in Proceedings of the IEEE International Workshop on Factory
Communication Systems (WFCS’97), Oct. 1997, pp. 111–118.
[31] L.-B. Fredriksson, “CAN for critical embedded automotive networks,” IEEE Micro, vol. 22,
no. 4, pp. 28–35, 2002.
[32] H. Sivencrona, T. Olsson, R. Johansson, and J. Torin, “RedCAN: Simulations of two fault
recovery algorithms for CAN,” in Proceedings of the 10th IEEE Pacific Rim International
Symposium on Dependable Computing (PRDC’04). Washington, DC, USA: IEEE Computer
Society, 2004, pp. 302–311.
[33] J. R. Pimentel and J. A. Fonseca, “FlexCAN: A flexible architecture for highly dependable
embedded applications,” in Proceedings of the 3rd International Workshop on Real-Time
Networks (RTN 2004), Catania, Italy, Jul. 2004.
[34] J. R. Pimentel and J. Kaniarz, “A CAN-based application level error detection and fault
containment protocol,” in Proceedings of the 11th IFAC Symposium on Information Control
Problems in Manufacturing (INCOM’04), Salvador, Brazil, Apr. 2004.
[35] D. Powell, D. Seaton, D. Bonn, P. Veríssimo, and F. Waeselynck, “The Delta-4 approach
to dependability in open distributed computing systems,” in Proceedings of the 18th IEEE
International Symposium on Fault-Tolerant Computing (FTCS-18), Jun. 1988, pp. 246–251.
85
Bibliography
[36] J. Rufino, P. Verissimo, G. Arroz, C. Almeida, and L. Rodrigues, “Fault-tolerant broadcasts
in CAN,” in Digest of Papers of the 28th Annual International Symposium on Fault-Tolerant
Computing (FTCS’98), 23-25 1998, pp. 150–159.
[37] L. Lamport, “Time, clocks, and the ordering of events in a distributed system,” Commun.
ACM, vol. 21, no. 7, pp. 558–565, 1978.
[38] L. Rodrigues, M. Guimarães, and J. Rufino, “Fault-tolerant clock synchronization in CAN,” in
Proceedings of the 19th Real-Time Systems Symposium (RTSS’98). Madrid, Spain: IEEE,
Dec. 1998, pp. 420–429.
[39] “DS80C390 dual-CAN high-speed microprocessor,” Maxim/Dallas Semiconductors, Nov.
2005. [Online]. Available: http://datasheets.maxim-ic.com/en/ds/DS80C390.pdf
[40] “Stellaris LM3S2965 microcontroller,” Texas Instruments Incorporated, Sep. 2010.
[41] “MAX13050 industry-standard high-speed CAN transceiver,” Maxim Integrated
Products, Feb. 2005. [Online]. Available: http://datasheets.maxim-ic.com/en/ds/
MAX13050-MAX13054.pdf
[42] “MCP2551 high-speed CAN transceiver,” Microchip Technology Inc., 2003. [Online].
Available: http://ww1.microchip.com/downloads/en/DeviceDoc/21667f.pdf
[43] CiA Draft Standard 102 - CAN physical layer specification for industrial applications, CAN in
Automation, Feb. 2010.
[44] CiA Draft Standard 303, Part 1 - Cabling and connector pin assignment, CAN in Automation,
Dec. 2009.
[45] A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr, “Basic concepts and taxonomy of de-
pendable and secure computing,” IEEE Transactions on Dependable and Secure Computing,
vol. 1, no. 6, pp. 11–33, jan.-march 2004.
[46] P. J. Ashenden, The Designer’s Guide to VHDL, 3rd ed. San Francisco, CA, USA: Morgan
Kaufmann Publishers Inc., 2008.
[47] J. Rufino, P. Veríssimo, G. Arroz, and C. Almeida, “Control of inaccessibility in CANELy,”
in Proceedings of the 6th. International Workshop on Factory Communication Systems
(WFCS’06). Torino, Italy: IEEE, Jun. 2006, pp. 35–44.
[48] P. Verissimo and J. Marques, “Reliable broadcast for fault-tolerance on local computer net-
works,” in Proceedings of the 9th Symposium on Reliable Distributed Systems (SRDS’90),
Oct. 1990, pp. 54–63.
86
Bibliography
[49] J. Rufino, P. Veríssimo, C. Almeida, and G. Arroz, “Integrating inaccessibility control and
timer management in CANELy,” in Proceedings of the 11th IEEE International Conference
on Emerging Technologies and Factory Automation (ETFA 2006). Prague, Czech Republic:
IEEE, Sep. 2006, pp. 348–355.
[50] I. Mohor, CAN Protocol Controller IP core, OpenCores, Nov. 2004. [Online]. Available:
http://opencores.org/project,can
[51] “LogiCORE IP XPS Controller Area Network (CAN),” Xilinx Inc., Jul. 2010. [Online].
Available: http://www.xilinx.com/support/documentation/ip_documentation/xps_can.pdf
[52] I. Broster and A. Burns, “An analysable bus-guardian for event-triggered communication,”
in Proceedings of the 24th IEEE International Real-Time Systems Symposium (RTSS’03),
Cancun, Mexico, Dec. 2003, pp. 410–419.
[53] HurriCANe - Controller Are Network IP core User’s Manual, European Space Agency, Sep.
2007. [Online]. Available: http://microelectronics.esa.int/core/ipdoc/can524_user_manual.
87
AVHDL Snippets
A.1 Sequence detection machinery and mapped sequences
Listing A.1: Sequence detector VHDL instantiation template�1 ent i ty sequence_detector is2 generic (3 sequence : s td_u log i c_vec to r := " 101111111010101 " ) ; −− sequence to be detected45 port (6 sys_c lk : in s td_u log i c ; −− Clock7 can_clk_en : in s td_u log i c ; −− Bi t−sampling enable , to sync . w i th CAN network8 rst_N : in s td_u log i c ; −− Reset , a c t i v e low9 data : in s td_u log i c ; −− Data i npu t
10 Sequence_Ok : out s td_u log i c ) ; −− Sequence Ok11 end sequence_detector ; � �
Listing A.2: ChFok signal specification�1 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
−− Frame OK Sequence Detec t ioninstSequenceDetectorFOk : ent i ty work . sequence_detector (SHIFT_REGISTER)
generic map (5 sequence => seq_fok )
port map (sys_c lk => sys_clk ,rst_N => rst_N ,can_clk_en => can_clk_en ,
10 data => rx ,Sequence_Ok=> f o k _ i n t ) ;
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− fok s i g n a l asse r t i on−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
15 −− purpose : Asser t the " Frame Ok" s i g n a l−− type : sequen t ia l−− i npu ts : sys_clk , rst_N , can_clk_en , f o k _ i n t−− outputs : fokpFok : process ( sys_c lk ) is
20 begin −− process pFoki f r i s ing_edge ( sys_c lk ) then −− r i s i n g c lock edge
i f rst_N = ’0 ’ then −− synchronous rese t ( a c t i v e low )fok <= ’ 0 ’ ;
else25 i f can_clk_en = ’1 ’ then −− Sync wi th b i t−t imes
i f f o k _ i n t = ’1 ’ thenfok <= ’ 1 ’ ;
e l s i f e o t _ i n t = ’1 ’ thenfok <= ’ 0 ’ ;
30 end i f ;end i f ;
end i f ;end i f ;
end process pFok ; � �89
A. VHDL Snippets
Listing A.3: ChErr signal specification�1 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
−− Er ro r Frame Sequence Detec t ionins tSequenceDetectorErr : ent i ty work . sequence_detector (ROM_MEM)
generic map (sequence => seq_err )
6 port map (sys_c lk => sys_clk ,rst_N => rst_N ,can_clk_en => can_clk_en ,data => rx ,Sequence_Ok=> e r r _ i n t ) ;
12 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− e r r s i g n a l asse r t i on−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− purpose : Asser t an e r r o r s i g n a l de tec t i on−− type : sequen t ia l−− i npu ts : sys_clk , rst_N , can_clk_en , seq_err , e o t _ i n t
18 −− outputs : e r rpError : process ( sys_c lk ) isbegin −− process pError
i f r i s ing_edge ( sys_c lk ) then −− r i s i n g c lock edgei f rst_N = ’0 ’ then −− synchronous rese t ( a c t i v e low )
e r r <= ’ 0 ’ ;24 else
i f can_clk_en = ’1 ’ then −− Sync wi th b i t−t imesi f e o t _ i n t = ’1 ’ then
e r r <= ’ 0 ’ ;e l s i f err_seq = ’1 ’ then
e r r <= ’ 1 ’ ;30 end i f ;
end i f ;end i f ;
end i f ;end process pError ; � �
A.2 Omission Monitoring and Control
Listing A.4: Medium Omission Fault Detection�1 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
−− Mismatch Vector−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−genMisVector : for m in 1 to NumberMedia generate
5 MismatchVector (m) <= MediaStatusRD (m) . fm ;end generate genMisVector ;
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− MFm−s
10 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−mismatch <= ’0 ’ when MismatchVector = ( MismatchVector ’ range => ’ 0 ’ ) else ’ 1 ’ ;
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− Omission e r r o r
15 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− purpose : a−− type : combinat iona l−− i npu ts : ChFok , MismatchVector−− outputs : OmissionVector
20 procMOerr : process ( ChFok , MismatchVector ) isbegin −− process procMOerr
for m in 1 to NumberMedia loopMoErrVector (m) <= ChFok and MismatchVector (m) ;
end loop ; −− m25 end process procMOerr ;
90
A.2 Omission Monitoring and Control
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− Omission a t channel−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
30 −− purpose : a−− type : combinat iona l−− i npu ts : ChError , mismatch , MismatchVector−− outputs : MOChVectorprocMOch : process ( ChError , mismatch , MismatchVector ) is
35 begin −− process procMOchfor m in 1 to NumberMedia loop
MOChVector (m) <= ChError and mismatch and ( not MismatchVector (m) ) ;end loop ; −− m
end process procMOch ;40
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− Omission a t undetermined−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− purpose : a
45 −− type : combinat iona l−− i npu ts : ChError , mismatch , MismatchVector−− outputs : MUerrVectorprocMuerr : process ( ChError , mismatch , MismatchVector ) isbegin −− process procMuerr
50 for m in 1 to NumberMedia loopMUerrVector (m) <= ChError and ( ( not mismatch ) or MismatchVector (m) ) ;
end loop ; −− mend process procMuerr ; � �
Listing A.5: Medium Omission Fault Accounting�1 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
−− Omission degree r e g i s t r a t i o n−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− Upon the End−of−t ransmiss ion , we must check which media ex i b h i t e d omission
5 −− f a u l t s . According to the type of behaviour ,−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−procODRegist rat ion : process ( sys_c lk ) isbegin −− process procODRegist rat ion
i f r i s ing_edge ( sys_c lk ) then −− r i s i n g c lock edge10 i f rst_N = ’0 ’ then −− synchronous rese t ( a c t i v e low )
km_ovf <= ( others => ’ 0 ’ ) ;for m in 1 to NumberMedia loop
od_reg_ in t (m) <= 0;end loop ; −− m
15 elsei f can_clk_en = ’1 ’ then −− Bi t−sample t ime−− Detect ChEOT r i s i n g edgei f ChEOT = ’1 ’ and EOT_assert = ’0 ’ then
EOT_assert <= ’ 1 ’ ;20 −− Execute the ac t i on
for m in 1 to NumberMedia loopcase odAct ions (m) is−− INCREMENTwhen INCREMENT =>
25 i f od_reg_ in t (m) = od_parameter thenkm_ovf (m) <= ’ 1 ’ ; −− OD exceed (k_m over f low )
elseod_reg_ in t (m) <= od_reg_ in t (m) + 1;
end i f ;30 −− RESET
when RESET =>od_reg_ in t (m) <= 0;
−− UNKONW, REPORT IT ! Only v a l i d i n s imu la t i on−− pragma s y n t h e s i s _ o f f
35 when UNKNOWN =>report "UNKNOWN c o nd i t i on i n Omission Degree processing " sever i ty
ERROR;−− pragma synthesis_on−− MAINTAIN
91
A. VHDL Snippets
when others => nul l ;40 end case ;
end loop ; −− me l s i f EOT_assert = ’1 ’ and ChEOT = ’0 ’ then
EOT_assert <= ’ 0 ’ ;end i f ;
45 end i f ;end i f ;
end i f ;end process procODRegist rat ion ; � �
A.3 Inaccessibility Monitoring and Evaluation
Extended Channel Monitoring
Listing A.6: Channel Start-of-Frame Detection�1 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
−− S t a r t o f Frame−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− purpose : Detect the S t a r t Of Frame
5 −− type : sequen t ia l−− i npu ts : sys_clk , rst_N , ChRX, ChEOT−− outputs : ChSOFprocSOF : process ( sys_c lk ) isbegin −− process procSOF
10 i f r i s ing_edge ( sys_c lk ) then −− r i s i n g c lock edgei f rst_N = ’0 ’ then −− synChronous rese t ( a c t i v e low )
ChSOF <= ’ 0 ’ ;else
i f can_clk_en = ’1 ’ then −− RX B i t sampling15 i f ChEOT = ’1 ’ and ChRX = ’0 ’ then
ChSOF <= ’ 1 ’ ;end i f ;i f ChSOF = ’1 ’ then −− Deassert the s i g n a l
ChSOF <= ’ 0 ’ ;20 end i f ;
end i f ;end i f ;
end i f ;end process procSOF ; � �
Listing A.7: Channel Frame Correct Boundary�1 ChFc <= ’1 ’ when (ChSOF = ’1 ’ or ChTok = ’1 ’ or ChIFS = ’ 1 ’ ) else ’ 0 ’ ; � �
92
A.3 Inaccessibility Monitoring and Evaluation
Inaccessibility Parameter Evaluation
Listing A.8: Inaccessibility Event Counters�1 −− purpose : Assess the i n a c c e s s i b i l i t y event counts
−− type : sequen t ia l−− i npu ts : c lk , rst_N , ChErr_l , ChEOT, ChFok_l , ChIEClr , ChOEClr−− outputs : ChOE, ChIE , ChOD
5 procCounter : process ( sys_c lk ) isvar iable ChEOT_s : s td_u log i c_vec to r (1 downto 0) := ( others => ’ 0 ’ ) ;
begin −− process procCounteri f r i s ing_edge ( sys_c lk ) then −− r i s i n g c lock edge
i f rst_N = ’0 ’ then −− synchronous rese t ( a c t i v e low )10 ChEOT_s := ( others => ’ 0 ’ ) ;
ChIE <= ( others => ’ 0 ’ ) ;ChOE <= ( others => ’ 0 ’ ) ;ChOD <= ( others => ’ 0 ’ ) ;
else15 −− CAN Clock
i f can_clk_en = ’1 ’ then−− ChEOT " sampling "ChEOT_s := ChEOT_s( 0 ) & ChEOT; −− S h i f t values l e f t , i npu t ChEOT−− Has ChEOT j u s t been asser ted?
20 i f ChEOT_s( 1 ) = ’0 ’ and ChEOT_s( 0 ) = ’1 ’ then−− ChErr_l a c t i v e ?i f ChErr_l = ’1 ’ then
ChIE <= ChIE + 1; −− Tota l Inacc . Event Counti f ChFok_l = ’0 ’ then
25 ChOE <= ChOE + 1; −− Omission Er ro r CountChOD <= ChOD + 1; −− Ch OD increment
end i f ;end i f ;−− Channel Omission Degree Clear
30 i f ChFok_l = ’1 ’ thenChOD <= ( others => ’ 0 ’ ) ;
end i f ;end i f ;−− Mgmt Req , c l ea r
35 i f ChOEClr = ’1 ’ thenChOE <= ( others => ’ 0 ’ ) ;
end i f ;i f ChIEClr = ’1 ’ then
ChIE <= ( others => ’ 0 ’ ) ;40 end i f ;
end i f ;end i f ;
end i f ;end process procCounter ; � �
93
BMechanism Design Verification
A fundamental stage in the flow of digital hardware design is verification, assessing the cor-
rectness of the designed component(s). This verification can be done by simulation, where one or
more sets of stimuli are applied to the inputs of the component being tested, and the correspond-
ing set of output signals is observed and compared with the expected behaviour, defined by the
component’s functional specifications.
A requirement for meaningful and successful testing is the sensible definition of the input
stimuli, in order to attain adequate test coverage and consequently adequate characterisation of
the component’s behaviour. In the context of CAN bus operation, a sensible definition of the input
stimuli translates into having a representation of the CAN channel.
This Appendix describes the approach taken w.r.t. simulation, giving particular emphasis to the
generation of simulation data, with the purpose of emulating a CAN bus channel. This emulation
is achieved by having a trace of data and remote frames exchange having several (different) mes-
sage identifiers, together with other types of events susceptible of occurring during the normal1
operation, such as: CAN error and overload frames; single- and common-mode errors affecting
the (replicated) media. Lastly, a set of simulation fragments is analysed.
B.1 Approach to Component Design Simulation
A possible approach to simulate a digital design involves the specification of a test-bench
design, which includes the component to be tested and a set of input stimuli relevant to that
component. The simulation task might become iterative, i.e. if the designed component fails in
the simulation by not behaving as expected in the presence of the defined input stimuli, it must be
corrected and tested again. Therefore the sensible design of the input stimuli is paramount.
With the growth of the components’ complexity, the need of more sophisticated input stimuli
also gets stronger. Having components processing the CAN protocol, this means that a set of
input stimuli is composed by one or more especially formed CAN frames, each having several bits
in length. The manual method for providing input stimuli, however, does not scale. Therefore a
1The meaning of normal is context dependent: a lightly loaded network with sporadic EMI events vs. highly loadednetwork with a high rate of errors due to EMI phenomena.
95
B. Mechanism Design Verification
different approach is in need, contemplating the exchange of standard-compliant CAN messages,
thus miming the operation of a standard CAN network, including the impairments to its operation,
i.e. errors. Therefore, a simulated CAN communication channel must be provided.
B.2 CAN Channel Simulation
The test of high-level CANELy components requires a set of properly simulated CAN mes-
sages. These must be conveyed by the CAN channel, which in turn can be distributed over a set
of replicated media. Such specifications provide the support for the design of tests exercising the
CANELy machinery. Such tests are comprised by properly crafted CAN messages, along with
errors affecting both the channel and any of the replicated media.
In order to satisfy these requirements, an approach involving both the (semi-)automatic gen-
eration of test data and presentation of that same data to the component being tested. The flow
of such process is depicted in Figure B.1.
File00110001 # Error000000
Bus Data Gen.
MessageDefinition
CAN Bit StreamGenerator
Unit Under Test
Bus MediaComponent
Simulator
Figure B.1: Bus Media simulation data flow
After being defined in a high-level of abstraction, the CAN message exchange is converted
into a suitable representation and written to a text file in a defined form. This file is then used by a
custom component, which interfaces with the component being tested - the unit-under-test - and
provides it with the (simulated) CAN channel data, through one or more media.
The usage of an external description of the input stimuli has several advantages, ranging
from not requiring test-bench recompilation if a different set of stimuli is defined; to simulation
automation, together with VHDL’s reporting facilities. Such advantages allow a faster simulation
process and wider coverage, thus leading to a higher quality test.
Simulation Data Generation
In order to generate the simulated CAN bus operation data, a software programme was coded.
It was chosen the Python language, mainly due to: offering an object-oriented paradigm; being
an excellent prototyping language, with a rich set of modules providing complex functions, e.g.
random number generation, data format conversions and dynamic data typing.
96
B.2 CAN Channel Simulation
This programme was coded using an object-oriented approach, where the CAN frames are
objects that can be manipulated by appropriate methods invoked upon them, e.g. set payload
content. It supports generating both CAN 2.0A and CAN 2.0B frames, data and remote. It can
also produce error and overload frames, with the restriction of their generation being manually
defined. An extension of this programme into a fully-fledged discrete event simulator is envisaged.
The frames are configured at the programme level by both their identifier, and payload. Since
the programme is written in Python, the frame generation can be easily accomplished without
needing to recompile the entire file again. Upon creation of a frame, it defaults to a remote frame,
i.e. zero payload. It is transformed into a data frame by setting its payload content to other than
non-null content.
The (simulated) CAN data is output into a text file. Each message (object) has methods to build
its bit representation from the several variables: message identifier, payload, (possible) bit errors.
The number of replicated media is configured at the application level. A fragment of the simulated
CAN network is shown in Figure B.2, for two replicated media. Each row of the file represents
one bit-time, and the time flow is from top-to-bottom. Each column represents a physical medium,
and the media are numbered from left-to-right (i.e. 1, 2, ..., N ). The values in the file are the bus
values: ’1’ is a recessive bit; ’0’ is a dominant bit.�1 11 # Text a f t e r a ’# ’ i s comment
11 # S t a r t o f bus . sim f i l e11 # <− Each row represents one b i t t ime11 # Each column represents a medium
5 11 # Number o f Media : 211111100 # SOF − MID : 10
10 0000000011 # B i t−s t u f f i n g
15 0010 # Er ro r @ Medium 1. . � �
Figure B.2: Simulation text file content
This fragment shows the start of a new frame (SOF), clearly identified by the accompanying
comment. The bit-stuffing bits are identified, for informational purposes only. The simulated
data can also (deliberately) suffer errors, i.e. exhibit values different from those intended. Such
errors are paramount to exercise the CAN protocol processing machinery, providing it with a more
realistic view of the network traffic.
Errors can affect: only one medium (single-mode); affect several or all media (common-mode).
These can be inserted in the generated stream, in order to assess the behaviour of the compo-
nents in the presence of different types of errors, e.g. stuck-at, single or multiple (burst) bit errors.
97
B. Mechanism Design Verification
Simulated CAN Channel and Media Components
The set of signals represented in the simulation text file must be converted into a form suit-
able for being interfaced with the media redundancy management mechanisms, or any other
component needing only one medium, i.e. the CAN channel itself. This goal is achieved by a
custom VHDL component simulating the incoming media, by reading the text file and generating
the appropriate signals to be presented to the component being tested. The operation of such
component is depicted in Figure B.3. This fragment of simulation shows the incoming bit streams,
described in the simulation text file generated by the programme, and the recovered CAN channel
bit stream, after the AND-based media management.
ChannelErrorMedia
Mismatch
ChannelError
End-of-FrameSequence
Figure B.3: Simulated CAN Channel
In the fragment of Figure B.3 there are also two clock signals: the bus_clk, which governs
the reading of the simulation data file and affects the bits present in each media; the can_clk_en,
which governs the sampling of the channel. For simplicity of operation the bits are sampled at the
middle of the bit time, instead of being sampled later. This is just a simplification, since there is no
need for the bit to propagate all over the network and settle its value due to the ideal conditions
presented by the simulation environment.
This testing flow aids greatly the test designer, since the (low-level) burden is now put into a
piece of software, leaving the designer at the high-level layer. This also avoids the introduction of
errors in the input stimuli, since they are now generated in a programmatic fashion.
B.3 Simulation Fragments
This Section presents simulation fragments with the purpose of exemplifying the testing phase
of some of CANELy’s components; and at the same time demonstrate the usefulness of the
previously described simulation data generation flow, especially how a set of properly defined
messages can speed up considerably the test, and at the same time ensure that the input stimuli
have good quality, contributing for the overall quality of the test.
98
B.3 Simulation Fragments
Basic Channel Monitoring
One of the firstly designed components was the one providing the basic channel monitoring
signals, ChFok, ChEOT and ChErr. These signals are the building block for other (complex)
machinery, therefore they not only had to be designed first, but had to be proved correct, in
order to allow their safe integration into other functions. These signals, in turn, are based on the
sequence detection machinery, which given its simplicity - and parametrisable nature - was able
to be tested with a simple set of manually defined input stimuli.
These signals, however, are more complex that the sequence detection machinery, for they
contain also complementary logic for signal assertion. Furthermore, there are interactions be-
tween the signals, e.g. ChErr is negated upon ChEOT . Although they can be tested separately,
with a synthetic stimulus representing the desired input signal, they were tested together, which
allowed to also rule out any ill-behaved interaction. The depiction of such test is shown in Fig-
ure B.4. This fragment shows a set of messages, with one of them suffering an error.
Figure B.4: Simulation of the Basic Channel Monitoring mechanism
In this case, the test is being done on machinery that get information from the channel. There-
fore the test has been configure with only one medium, and the Medium_Rx signal only has one
element, the channel itself. This test allowed to confirm the correct behaviour of the basic channel
monitoring mechanisms, which could then be used to develop functions of greater complexity with
the confidence that this component would behave as expected.
Message Identifier Recovery
Some notification primitives of the MSU require the indication of the affected message identi-
fier, mid (see Figure 4.23, page 57). To recover the mid, special machinery had to be designed,
which basically processes the start of every new CAN data or remote frame, until it receives the
identifier. This information depends on the version of the frame, if it is CAN 2.0A or CAN 2.0B,
which also affects the execution of the machinery, since CAN 2.0A frames have 11-bit identifiers
vs. 29-bit identifiers in CAN 2.0B.
99
B. Mechanism Design Verification
The test of such machinery calls for a set of CAN message exchange, with frames having
different identifiers, both in number and length (protocol version). Such diversified test allows
the assessment of correct operation. A fragment of this test is depicted in Figure B.5, and is
composed by the exchange of several CAN messages.
CAN 2.0B FrameMID: 33253
CAN 2.0A FrameMID: 10
Figure B.5: Simulation of the Message Identifier Extraction mechanism
This component uses an auxiliary signal, ChSOF, which indicates the start of a new frame trans-
mission, thus also starting the message processing machinery. The midOK signal indicates that
the message identifier has been correctly recovered. Such indication can be useful for signalling
the upper layers.
Although the information pertaining to the mid and CAN frame version are represented by a
VHDL record data type, they are converted into a form suitable to be used by other components
through a conversion function. Lastly, the vector storing (locally) the mid is 29-bit wide, and it
is not initialised, i.e. filled with zeros or ones. This leads to the simulation behaviour observed
in Figure B.5, in the first CAN 2.0A frames: since the remaining 18 bit are not initialised, the
simulator highlights such condition. The safety of the operation is ensured by the data conversion
function, which depending on the frame version outputs 11 or 29 bit, thus always ensuring that
the correct data output.
100