7
How future automotive functional safety requirements will impact microprocessors design M. Bellotti a, * , R. Mariani b a Engineering & Design Electrical Electronics Department, Fiat Group Automobiles, Torino, Italy b YOGITECH SPA, via Lenin 132/p 56017 San Martino Ulmiano, Pisa, Italy article info Article history: Received 7 July 2010 Accepted 13 July 2010 Available online 3 August 2010 abstract Safety is one of the key issues of future automobile development. Car maker as well as suppliers need to prove that, despite increasing complexity, their electronic systems will deliver the required functionality safely and reliably. Future development and integration of these functionalities will even strengthen the need of safe system development processes and the possibility to provide evidence that all reasonable safety objectives are satisfied. Obviously with the trend of increasing complexity, there are increasing risks from systematic failures and random hardware faults that could impact negatively on vehicle safety. Safety relevant systems (such as advanced driving assistance and vehicle dynamic control units) require microcontrollers able to guarantee safety and availability with an acceptable cost. Safety must be achieved with respect to both systematic and hardware random faults, including soft-errors and com- mon-cause failures. To provide availability, efficient and fast fault detection mechanisms shall be com- bined with infrastructures able to collect error events with enough details to allow reactions by the remaining hardware and the operating system. Costs shall be minimized by introducing as much robust- ness as needed and not more: this shall be done by avoiding unnecessary redundancies and reducing at the minimum the impact on system performances, therefore maximizing the usage of the available resources. This paper will give a short introduction on main concept of functional safety and ISO/DIS 26262, underlining the impact of such requirements on microprocessors and microcontrollers design. Some examples will be given on current approaches used to answer ISO/DIS 26262 requirements. Ó 2010 Elsevier Ltd. All rights reserved. 1. Introduction Today, many safety relevant types of equipment require micro- controllers (MCU). Currently the average number of MCUs per vehi- cle is about 25 and this number is expected to increase in the following decade. For example, automotive MCUs are expected to see revenues of $3.1 billion for 2009, accounting for 21% of the auto- motive semiconductor market. By 2014 it is expected that automo- tive MCUs will generate revenues of almost $4.8 billion, with revenue growing in line with the overall automotive semiconductors market [1]. System complexity raises also safety questions concern- ing their impact of the vehicle and its occupants. As a consequence, safety relevant systems need to be properly designed, considering the fact the system and his internal resources are involved in safety functionalities and non-safety relevant functionalities. Many MCU producers (see for example [2–4]) are developing components able to face new safety requirements, specifying, designing and analyzing their components in accordance with functional safety norms like IEC 61508 [5] and ISO/DIS 26262 [6]. These MCUs deliver cost effective solution; some MCUs are based on a hardware architecture that reduces both effort of safety mech- anisms and their detection latency; others with dual-core lock-step implementation simplifying software development by removing redundant safety system requirements. Such solutions provide automatic hardware diagnostics and safety features to meet the highest requirements of ISO/DIS 26262 automotive safety integrity level: ASIL D. 2. ISO/DIS 26262 in general The car manufacturer has the basic obligation to place on the market only safe products (as reported in 85/374 EEC and 2001/ 95 EC in Europe, US Tread Act). Actually there is no legal require- ment for certification of compliance of automotive E/E systems with ISO/DIS 26262; it is a voluntary adherence (state of art). The need of adoption of such new safety standards is mainly due to the fact that car makers as well as suppliers need to prove that, de- spite increasing complexity, their electronic systems will deliver the required functionality safely and reliably. This standard gives 0026-2714/$ - see front matter Ó 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.microrel.2010.07.041 * Corresponding author. Tel.: +39 011 0038863; fax: +39 011 0033609. E-mail address: [email protected] (M. Bellotti). Microelectronics Reliability 50 (2010) 1320–1326 Contents lists available at ScienceDirect Microelectronics Reliability journal homepage: www.elsevier.com/locate/microrel

How future automotive functional safety requirements will impact microprocessors design

Embed Size (px)

Citation preview

Page 1: How future automotive functional safety requirements will impact microprocessors design

Microelectronics Reliability 50 (2010) 1320–1326

Contents lists available at ScienceDirect

Microelectronics Reliability

journal homepage: www.elsevier .com/locate /microrel

How future automotive functional safety requirements will impactmicroprocessors design

M. Bellotti a,*, R. Mariani b

a Engineering & Design Electrical Electronics Department, Fiat Group Automobiles, Torino, Italyb YOGITECH SPA, via Lenin 132/p 56017 San Martino Ulmiano, Pisa, Italy

a r t i c l e i n f o

Article history:Received 7 July 2010Accepted 13 July 2010Available online 3 August 2010

0026-2714/$ - see front matter � 2010 Elsevier Ltd. Adoi:10.1016/j.microrel.2010.07.041

* Corresponding author. Tel.: +39 011 0038863; faxE-mail address: [email protected] (M.

a b s t r a c t

Safety is one of the key issues of future automobile development. Car maker as well as suppliers need toprove that, despite increasing complexity, their electronic systems will deliver the required functionalitysafely and reliably. Future development and integration of these functionalities will even strengthen theneed of safe system development processes and the possibility to provide evidence that all reasonablesafety objectives are satisfied. Obviously with the trend of increasing complexity, there are increasingrisks from systematic failures and random hardware faults that could impact negatively on vehicle safety.Safety relevant systems (such as advanced driving assistance and vehicle dynamic control units) requiremicrocontrollers able to guarantee safety and availability with an acceptable cost. Safety must beachieved with respect to both systematic and hardware random faults, including soft-errors and com-mon-cause failures. To provide availability, efficient and fast fault detection mechanisms shall be com-bined with infrastructures able to collect error events with enough details to allow reactions by theremaining hardware and the operating system. Costs shall be minimized by introducing as much robust-ness as needed and not more: this shall be done by avoiding unnecessary redundancies and reducing atthe minimum the impact on system performances, therefore maximizing the usage of the availableresources. This paper will give a short introduction on main concept of functional safety and ISO/DIS26262, underlining the impact of such requirements on microprocessors and microcontrollers design.Some examples will be given on current approaches used to answer ISO/DIS 26262 requirements.

� 2010 Elsevier Ltd. All rights reserved.

1. Introduction

Today, many safety relevant types of equipment require micro-controllers (MCU). Currently the average number of MCUs per vehi-cle is about 25 and this number is expected to increase in thefollowing decade. For example, automotive MCUs are expected tosee revenues of $3.1 billion for 2009, accounting for 21% of the auto-motive semiconductor market. By 2014 it is expected that automo-tive MCUs will generate revenues of almost $4.8 billion, withrevenue growing in line with the overall automotive semiconductorsmarket [1]. System complexity raises also safety questions concern-ing their impact of the vehicle and its occupants. As a consequence,safety relevant systems need to be properly designed, consideringthe fact the system and his internal resources are involved in safetyfunctionalities and non-safety relevant functionalities.

Many MCU producers (see for example [2–4]) are developingcomponents able to face new safety requirements, specifying,designing and analyzing their components in accordance with

ll rights reserved.

: +39 011 0033609.Bellotti).

functional safety norms like IEC 61508 [5] and ISO/DIS 26262 [6].These MCUs deliver cost effective solution; some MCUs are basedon a hardware architecture that reduces both effort of safety mech-anisms and their detection latency; others with dual-core lock-stepimplementation simplifying software development by removingredundant safety system requirements. Such solutions provideautomatic hardware diagnostics and safety features to meet thehighest requirements of ISO/DIS 26262 automotive safety integritylevel: ASIL D.

2. ISO/DIS 26262 in general

The car manufacturer has the basic obligation to place on themarket only safe products (as reported in 85/374 EEC and 2001/95 EC in Europe, US Tread Act). Actually there is no legal require-ment for certification of compliance of automotive E/E systemswith ISO/DIS 26262; it is a voluntary adherence (state of art). Theneed of adoption of such new safety standards is mainly due tothe fact that car makers as well as suppliers need to prove that, de-spite increasing complexity, their electronic systems will deliverthe required functionality safely and reliably. This standard gives

Page 2: How future automotive functional safety requirements will impact microprocessors design

M. Bellotti, R. Mariani / Microelectronics Reliability 50 (2010) 1320–1326 1321

rules to perform a Functional Safety Assessment and providesautomotive-specific analysis methods to identify the automotivesafety integrity level (ASIL = specifies the item’s necessary safetyrequirements for achieving an acceptable residual risk). Car makersas well as suppliers can adopt this new standard also because givesclear rules for a ‘‘Safety Oriented” Design; it asses hazards and risksof a system during the ‘‘concept phase” or during early designphases; it also includes requirements on manufacturer/supplierrelation and distributed development process for safety relevantsystems; it requires a detailed scheduling of the design activitieswith well defined work products for each system lifecycle stageand finally states the rules of standardized safety elements reuseand placing fundamentals on standardization concept application.

One of the first phases described by this standard is the hazardanalysis and risk assessment which objective is to identify and cat-egorize the hazards of the item and formulate the safety goals re-lated to the prevention or mitigation of these hazards, in order toavoid unreasonable risk.

An ASIL – defined as one of four levels to specify the item’s orelement’s necessary requirements of ISO/DIS 26262 and safetymeasures for avoiding an unreasonable residual risk with D repre-senting the most stringent and A the least stringent level – shall bedetermined for each hazardous event using the estimation param-eters severity (S), probability of exposure (E) and controllability (C)in accordance with Table 1. In addition to these four ASILs, the classQM (Quality Management) denotes no requirement in accordancewith ISO/DIS 26262.

The ASIL/hazard classification scheme comprises the determi-nation of the severity (S), the exposure (E) and the controllability(C) associated with the considered hazard of the item. For a givenhazard, this classification will result in one or more combinationsof S, E, and C classes. As such, each combination represents an esti-mate of potential harm in a particular driving situation, with theseverity determined by the potential harm and the exposure deter-mined by the situation. The controllability rates how easy or diffi-cult it is for the driver or other road traffic participant to avoid theconsidered accident type in the considered situation.

The ASIL determined for the hazardous event shall be assignedto the corresponding safety goal.

To comply with the safety goals, the functional safety conceptspecifies the basic safety mechanisms and safety measures in theform of functional safety requirements. The functional safetyrequirements are allocated to elements in the system architecture.

To specify safety mechanisms the functional safety concept ad-dresses the following:

– Fault detection and failure mitigation.– Transitioning to a safe state.

Table 1ASIL determination.

C1 C2 C3

S1 E1 QM QM QME2 QM QM QME3 QM QM AE4 QM A B

S2 E1 QM QM QME2 QM QM AE3 QM A BE4 A B C

S3 E1 QM QM AE2 QM A BE3 A B CE4 B C D

– Fault tolerance mechanisms, where a fault does not leaddirectly to the violation of the safety goals and which maintainsthe system in a safe state (with or without degradation); and

– Fault detection and driver warning in order to reduce the riskexposure time to an acceptable interval (repair request, stoprequest).

– Arbitration logic to select the most appropriate control requestfrom multiple requests generated simultaneously by differentfunctions.

The structure and distribution of the safety requirements withinthe corresponding parts of ISO/DIS 26262 are illustrated in Fig. 1.

The technical safety requirements are allocated to hardwareand software. The requirements that are allocated to both are fur-ther partitioned to yield hardware only safety requirements. Thehardware safety requirements are further detailed and considerdesign constraints and the impact of these constraints on the hard-ware. Hardware design includes hardware architectural design andhardware detailed design. Hardware architectural design repre-sents all hardware components and their interactions with one an-other. Hardware detailed design is at the level of electricalschematics representing the interconnections between hardwareparts composing the hardware components. In order to develop asingle hardware design both hardware safety requirements as wellas all non-safety requirements have to be fulfilled. Hence, safetyand non-safety requirements are handled within one developmentprocess. Finally a safety analysis of hardware architectural and de-tailed design to determine effects and causes of faults shall be ap-plied using deductive or inductive methods such as Failure Modeand Effects Analysis (FMEA) or Fault Tree Analysis (FTA).

3. ISO/DIS 26262 and MCU design

Fig. 2 gives an abstract view of an Electronic Control Unit (ECU),including a MCU plus a set of digital and analogue drivers to con-trol/read actuators/sensors, bus interfaces, relays and clockgeneration.

The MCU typically includes a ‘‘vital” core (including the CPU,the memories, the inner busses and the key registers such as oper-ating system timers, configuration registers) plus the peripheralsneeded to control the digital and analogue drivers and the businterface.

3.1. ISO/DIS 26262 MCU failure classes

From ISO/DIS 26262 point of view, failures of the MCU can bedivided in two main classes:

� HW random failure: failure that may occur unpredictably duringthe lifetime of a hardware element and that follows a probabil-ity distribution. In an MCU, they can be caused by permanentfaults (e.g. stuck-at faults), transient faults (e.g. single-event-upset), intermittent faults (e.g. time dependent variability).� Systematic failure: failure of an element or item that is caused in

a deterministic way during development, manufacturing, ormaintenance. For example: MCU design bugs, unverified cornercases, timing weaknesses, underestimated couplings effects, sil-icon defects escaped from production test, etc.

Dependent failures are an important sub-case: they are failureswhose probability of simultaneous or successive occurrence can-not be expressed as the simple product of the unconditional prob-abilities of each of them. In particular, common-cause failures(CCF, failure of two or more elements of an item resulting from asingle specific event or root cause) are very important for MCUs be-

Page 3: How future automotive functional safety requirements will impact microprocessors design

Fig. 1. Structure of the safety requirements.

Fig. 2. Abstract view of a MCU.

1322 M. Bellotti, R. Mariani / Microelectronics Reliability 50 (2010) 1320–1326

cause physical faults can easily cause those kinds of failures insidethe same silicon. The dependent failures can be either random (e.g.a change in the temperature affecting all the integrated circuit, or ahot-spot) or systematic (e.g. a mistake during timing analysis ofthe MCU leading to simultaneous failures in the MCU modules).

According ISO/DIS 26262, hardware random failures can be fur-ther divided using the model given in Fig. 3.

A single point fault is a fault in an element which is not coveredby a safety mechanism and where the fault leads directly to theviolation of a safety goal.

A residual fault is a portion of a fault which by itself leads to theviolation of a safety goal, occurring in a hardware element, wherethat portion of the fault is not covered by existing safetymechanisms.

Page 4: How future automotive functional safety requirements will impact microprocessors design

Fig. 3. Faults model.

M. Bellotti, R. Mariani / Microelectronics Reliability 50 (2010) 1320–1326 1323

A multiple point fault is one fault of several independent faultsthat in combination, leads to a multiple point failure (either per-ceived, detected or latent). In particular, a latent fault is a multiplepoint fault whose presence is not detected by a safety mechanismnor perceived by the driver.

The intention of the identification of multiple point faults is notto require a systematic analysis of every possible combination of 2or more hardware faults but at least to consider combinations thatderives from the safety concept (for instance the combination of 2faults where one fault affects a safety-related element and anotherfault affects the corresponding safety mechanism intended toachieve or maintain a safe state).

A safe fault is a fault whose occurrence will not significantly in-crease the probability of violation of a safety goal.

3.1.1. Requirements on failure classesIn relation with the previous defined failures, the ISO/DIS 26262

defines the following requirements:

� HW random failuress Requirements for HW architectural metrics (single point

faults metric and latent faults metric).s Requirements for probability of violation of the safety goal.� Dependent failures

s Qualitative requirements for dependent failures.� Systematic failures

s Requirements for avoidance of systematic failures duringMCU design, development and production.

3.2. ISO/DIS 26262 HW architectural metrics and probability ofviolation of the safety goal

The aim of HW architectural ‘‘relative” metrics is to be objec-tively assessable: metrics are verifiable, unambiguous, reproduc-

ible and precise enough to differentiate between differentarchitectures. They support evaluation of the final design (the pre-cise calculations are done with the detailed hardware design) andin particular they reveal whether or not the coverage of the safetymechanisms, to control hardware faults in the architecture is suf-ficient. Moreover, being computed as ratio between failure rates,they are robust concerning uncertainty of hardware failures rates.

The most important HW architectural metric is the ‘‘SinglePoint Fault” metric (SPF metric). This metric reflects the robustnessof the item to single point faults either by coverage from safetymechanisms or by design (primarily safe faults). A high single pointfaults metric implies that the proportion of single point faults andresidual faults in the hardware is low. The definition is given by thefollowing equation:

SPFault metric ¼ 1�P

Safety related HW elementsðkSPF þ kRFÞP

Safety related HW elementsk

¼P

Safety related HW elementsðkMPF þ kSÞP

Safety related HW elementskð1Þ

where kSPF, failure rate associated to hardware element single pointfaults; kRF, failure rate associated to hardware element residualfaults; kMPF, failure rate associated to hardware element multiplepoint faults; kS, failure rate associated to hardware element safefaults; with k = kSPF + kRF + kMPF + kS.

For ASILD, this metric shall be >99%. A similar metric is definedfor latent faults with the aim to emphasize the robustness of theitem to latent faults either by coverage of faults in safety mecha-nisms or by the driver recognizing that the fault exists before theviolation of the safety goal, or by design (primarily safe faults)For ASILD, this metric shall be >90%.

ISO/DIS 26262 defines as well an ‘‘absolute” metric, the Proba-bilistic Metric for random Hardware Failures (PMHF). It gives aquantitative target values for maximum probability of violation

Page 5: How future automotive functional safety requirements will impact microprocessors design

1324 M. Bellotti, R. Mariani / Microelectronics Reliability 50 (2010) 1320–1326

of each safety goal due to random hardware failures, expressed interms of average probability per hour over the operational lifetimeof the item. For ASILD, this metric shall be <10�8 h�1, i.e. less than10 FIT (Failure in Time, 10�8 h�1).

One of the critical points of the computation of PMHF is how todetermine the failure rates affecting a microcontroller. AccordingISO/DIS 26262, the failure rates can be determined either usinghardware part failure rates data from a recognized industry source(e.g. IEC TR 62380, IEC 61709, MIL HDBK 217 F notice 2, etc.) orusing statistics based on field returns or using expert judgementfounded on engineering approach. The following Table 2 gives someexamples of elementary failure rates that can be used as a basis forthe computation of the overall failure rate of a microcontroller.

According ISO/DIS 26262, those requirements are tightly linkedto the safety goal and they are related to the item, e.g. to the fullECU. However, they can be tailored to the microprocessor levelby partitioning the related failure rates down to the MCU (and itssub-parts).

Starting from the experience of IEC 61508, the ISO/DIS 26262includes an Annex (part 5, Annex D) giving hints about which arethe fault models and failure modes to be considered for each differ-ent MCU sub-parts with respect to the target ASIL. This annex isalso giving hints about which diagnostic coverage could be claimedfor each of those fault models and failure modes depending on thesafety mechanism being used.

For example, for the registers and internal RAM of a processingunit (CPU) the following fault models and failure modes shall beconsidered for ASIL D designs: d.c. fault model (i.e. stuck-at,stuck-on, stuck-open and bridging faults) for data and addresses;dynamic cross-over for memory cells; none, wrong or multipleaddressing. As another example, for interrupt handling the follow-ing failure modes shall be considered for ASIL D designs: no or con-tinuous interrupts and cross-over of interrupts.

Of course this Annex is a ‘‘living book”, i.e. it has to be updatedbased on the changing of MCU technology. For example, the nextrelease of ISO/DIS 26262 – currently in discussion in the interna-tional working group ISO TC22/SC3/WG16 – will include failuremodes such as soft-errors (i.e. transient faults caused by alpha par-ticles from package decay, neutrons, etc.) and transition faults.

3.3. ISO/DIS 26262 requirements for dependent and systematic failures

Concerning dependent failures, the ISO/DIS 26262 requires adeep analysis of all the possible causes that could lead to com-mon-cause failures. The most important ones are the environmen-

Table 2Elementary failure rates.

Failure rate class Value Comment

Permanent in glue logic (power-up/power-down)

610�5

FIT/gateMost of permanent faults are activ

Permanent in glue logic (duringnormal operation)

610�8

FIT/gatePermanent in RAM 610�5

FIT/gateMemories are always very well test

Permanent in FLASH 610�5

FIT/gateMemories are always very well teslow. However, being FLASH more a

Intermittent in glue logic 610�7

FIT/gatePhenomena such as oxide breakdowmore permanent nature

Transient in RAM 610�3

FIT/cellMost of the memories have layout(MBU) are increasing in small tech

Transient in FLASH 610�7

FIT/cellFor FLASH memories, the probabiliincreasing

Transient (registers) 610�4

FIT/regDensity of registers is increasing

Transient (glue logic) 610�7

FIT/gateCapacity and voltage of the logic cecells is increasing

tal factors, e.g. temperature, vibration, pressure, humidity/condensation, pollution, corrosion, contamination, EMC; the fail-ures of common external resources, e.g. power supply, input data,intersystem data bus and communication; the stress due to spe-cific situations, e.g. wear, aging.

Measures for resolution of relevant dependent failures shall bespecified during the development phase; measures for resolutionshall include assumptions and rationale regarding the dependentfailures, their impact and suggested measures for preventing theirroot causes, reducing the coupling factors or controlling theireffects.

Concerning avoidance of systematic failures, the ISO/DIS 26262does not specify any detailed measures for MCUs but useful guide-lines can be derived from the IEC 61508 in which precise rules aregiven for each of the phases of an MCU design. For example, func-tional test shall be done emulating the system environment, asyn-chronous circuits shall be avoided or analyzed in detail, Design ForTestability (DfT) >99% is recommended, functional verification cov-erage must be quantified, soft-cores (e.g. CPU) should be alreadyvalidated or must be validated from scratch (delivering appropriateevidences).

3.4. Design with safety requirements. The ‘‘white-box” approach

Other more general requirements for an MCU design can be de-rived from ISO/DIS 26262, such as how to specify and verify thesafety requirements, how to specify HW–SW interfaces (HIS) andhow to perform safety analysis of MCU design to determine effectsand causes of faults.

In particular, the safety analysis is a key topic to be consideredin design of modern MCUs.

The aim of the safety analysis is the verification of safety con-cepts and safety requirements. In particular, the identification ofconditions and causes, including faults and failures, that could leadto violation of a safety goal or safety requirement as also the iden-tification of additional requirements for detection of those faults orfailures. Moreover, the determination of the required responses(actions/measures) to those detected faults or failures and theidentification of additional requirements for verifying that thesafety goals or safety requirements are satisfied, including safety-related vehicle testing.

Safety analyses shall be performed at the appropriate level ofabstraction during the concept and product development phases.

Quantitative analysis methods are used to predict the frequencyof failures while qualitative analysis methods identify failures but

ated during power-up or power-down, i.e. during voltage stress of the MCU

ed during end-of-line tests, so probability of weaknesses and residual faults is low

ted during end-of-line tests, so probability of weaknesses and residual faults isnd more bigger, faults in address decoders are more and more importantn, tunnelling, aging are creating faults that appear as a transient but they have a

techniques to reduce impact of transient faults. However, multiple-bits upsetnologies (45 nm)ty of transient faults is still very low even if phenomena such as tunnelling are

lls are decreasing (so it is easier to create a single-event-upset) as also density of

Page 6: How future automotive functional safety requirements will impact microprocessors design

Fig. 5. HW-based redundancy (homogenous). A comparator (COMP) is used tocompare the CPU outputs at each cycle. Typically the two CPUs works with a 1.5 or2 clock cycles skew to decrease the impact of CCF.

M. Bellotti, R. Mariani / Microelectronics Reliability 50 (2010) 1320–1326 1325

do not predict the frequency of failures. Both types of analysismethods depend upon the knowledge of fault types and fault mod-els. Examples of those methods are FMEA, FTA and Event TreeAnalysis (ETA).

In a few words, it means that the MCU cannot be anymore a‘‘black-box”, but a ‘‘white-box” analysis has to be performed. How-ever, those analyses have to be tailored to the MCU level and this isa big challenge: the number of components inside an integratedcircuit is so high that system-oriented FMEA methods are practi-cally inapplicable: a modern System-on-Chip easily reaches 20millions of gates and several millions or transistors. A more auto-mated and deep FMEA approach is therefore mandatory.

One example of this ‘‘white-box” approach is fRMethodology byYOGITECH SpA.fRMethodology is a ‘‘white-box” approach to dofunctional safety analysis and safety-oriented exploration of inte-grated circuits (e.g. MCU, ASIC, FPGA) in compliance with IEC61508 or ISO/DIS 26262. It has been approved by TÜV SÜD, aninternational certification organization.

In essence, fRMethodology includes a first stage in which anautomatic tool extracts information from the integrated circuitdescription by partitioning in ‘‘sensitive zones”. In a second stage,the information and optionally a workload profile is used to pre-pare a Failure Mode, Effects and Diagnostic Analysis (FMEDA) data-base. At the end of this step, failure rates of the sensitive zones ofthe integrated circuit are computed. Then, the sensitive zones areranked from the less to the most critical one, and results are col-lected from the FMEDA database, including specific indices re-quired by the ISO/DIS 26262 norm such the previously definedHW architectural metrics and eventually the ASIL. These twostages are called fRFMEA. In a third stage, the FMEDA database isvalidated using circuit workloads by a mix of fault injection andfault simulation (this mix is called fRFaultInjector, or fRFI). Moredetails can be found in [7].

4. Fulfilling safety requirements: examples

The following are some examples of possible approaches to an-swer ISO/DIS 26262 requirements at MCU level, with particularemphasis to the two most important MCU modules, i.e. the CPUand the memories.

4.1. CPU

Software diversified redundancy (one hardware channel) (seeFig. 4): two different (and diverse) tasks are executed in the sameCPU and then results are compared (e.g. [8]).

Pro’s: low cost in terms of gates count, good SW systematicfaults detection.

Con’s: high cost in terms of memory size, poor soft-errors detec-tion, susceptibility to CCF of the two SW processes, poor reusabil-ity, high error detection latency, significant impact of CPUperformance.

Fig. 4. SW based redundancy. A Multiple Input Shift Register (MISR) is sometimesused to compare results.

Typically the ASIL achievable is ASILB.HW-based redundancy – homogenous (symmetric) redundancy

(see Fig. 5): two identical HW CPUs are used in parallel (lock-step),one as the active ‘‘master” and the other as ‘‘passive” checker (e.g.[9]).

Pro’s: high diagnostic coverage, low cost in terms of memorysize.

Con’s: high cost in terms of gates count, high power consump-tion, high susceptibility to CCF between the two HW channels, higherror detection latency, high susceptibility to HW systematicfailures.

Typically the ASIL achievable is ASILD (if the comparator is well-done and comparing all the signals at each cycle).

HW-based redundancy – optimized tightly coupled (OTC) redun-dancy (see Fig. 6): two diverse HW CPUs are combined in parallel,one as the active ‘‘master” and the other as ‘‘passive” tightly-cou-pled supervisor (e.g. [10,11]).

Pro’s: low cost in terms of gates count, low current consump-tion, high diagnostic coverage, low cost in terms of memory size,good robustness against CCF, fast error detection latency, low sus-ceptibility to HW systematic failures.

Con’s: a very detailed analysis is required to prove diagnosticcoverage, a dedicated CPU interface may be needed.

Typically the ASIL achievable is ASILD.HW-based redundancy – ‘‘questions and answers” (Q&A) architec-

ture (see Fig. 7): two diverse HW CPUs are used, with two differentSW. The secondary CPU sends questions to be answered by themain one (e.g. [12,13]).

Pro’s: good robustness against CCF, low susceptibility to SW andHW systematic failures.

Con’s: high cost in terms of gates count and memory size, highcurrent consumption, poor soft-errors detection, significant impactof CPU performance, poor reusability, slow error detection latency.

Typically the ASIL achievable is ASILB and ASILC/D for very sim-ple safety tasks.

Fig. 6. HW-based redundancy (OTC). A dedicated processing unit is tightly coupledwith the main processing units by means of an interface enabling a step-by-stepcomparison of internal and external CPU results.

Page 7: How future automotive functional safety requirements will impact microprocessors design

Fig. 7. HW-based redundancy (Q&A). The secondary CPU is typically smaller thanthe main one, e.g. 16 bits CPU or DSP with respect to the 32 bits main CPU. In manyapplications, the secondary CPU is in a separate silicon with respect the main CPU.

1326 M. Bellotti, R. Mariani / Microelectronics Reliability 50 (2010) 1320–1326

All those redundancies can be combined (e.g. a quad-core usingtwo dual-core lock-step or a dual OTC, with two OTC channels) toprovide availability on top of safety. In case of a fault in one of theCPU pair, the other one can take the control of the application (seefor example [14]).

Local redundancy: i.e. using special structures for the standardcell gates used to design MCUs. For example, the CPU can be de-signed with soft-error robust flip-flops, like triple modular redun-dancy (TMR) registers or other similar methods (e.g. [15]).

Pro’s is the capability of correction of single-event-upset in asingle flip-flop cell.

But there are many con’s. The presence of MBU (multiple-bitupset) is recognized already at 180 nm as also permanent, tran-sient and intermittent faults in the logic and interconnect: so toprotect the flip-flop is just one piece of the full story. Moreover,probability of common-cause failures grows exponentially if lo-cal redundancy is used: not only global CCF can affect the circuitbut now also local ones, e.g. hot-spots or local clock distur-bances, etc. At the same way, especially for circuits having along operation time, automatic correction without any flag to asupervising unit is not seen in a good way by ISO/DIS 26262, be-cause it leads to a very big risk of latent faults (e.g. TMR votingin the wrong way).

Last but not least, to touch the internal timing mechanisms of amodern CPU is not appreciated by CPU vendors (e.g. timing viola-tions, scripts not running, contractual aspects).

4.2. Memories

Memory key-on tests (such as MARCH tests for RAM, or signa-ture-based tests for FLASH) are a good way to detect permanentfaults (see for example [16,17]). However, additional safety mech-anisms are required by ISO/DIS 26262 to cover transient and inter-mittent faults: therefore the use of Error Correction Codes (ECC) isbecoming more and more a strict requirement for memoriesembedded in modern MCUs (see for example [18] for a survey onerror correcting codes).

ISO/DIS 26262 requires also that effectiveness of ECC shall becarefully verified. On one side, accumulation of single-bit errorscan cause a degradation of ECC efficacy: scrubbing methods (e.g.[19]) are a useful way to counter-balance this negative effect. Onthe other side, faults in circuits such as charge pump, test logic, ad-dress decoder are causing multiple-bit errors that can be mis-cor-rected by ECC: for that reason, single-error-correction (SEC) flagsshall be visible to the operating system and number of correctionsshall be limited. Moreover, methods shall put in place to increasethe diagnostic coverage of those critical failures, such as to includeaddresses in the code bits or special arrangements of the memorycells.

5. Conclusions

Today, few milliseconds (if not even microseconds) are dividingsafety from availability in safety–critical systems such as the com-ing generation of braking Electronic Control Units. In those frac-tions of time, any fault causing errors potentially leading tofailures shall be detected, analyzed and possibly controlled. Withthe increasing risk of faults (not only soft-errors), safe states like‘‘reset the MCU” are becoming a very dangerous killer of systemavailability.

At the same time, for automotive, switching to a full redundantsecondary electronic/mechanical/hydraulic channel in case of fail-ure of the electronic system is less and less feasible from a costpoint of view.

The ‘‘old” trend: HW is unreliable by definition or even worst,HW is stupid (or silent), in case of a failure it cannot tell me what’sgoing on. So there is still the need of massive HW&SW redundancyto guarantee the availability of safety–critical systems.

A ‘‘new” trend is needed: Design-for-Uncertainty must becomea new paradigm for MCU designers. Processes like Failure Modesand Effects Analysis (FMEA) should become a ‘‘de facto” standardand they must be adopted till the gate level (white-box approach).Moreover, new MCU architectures shall embed methods to detectand control the errors – as soon as they appear and as close as pos-sible to the faulty location. Therefore, distributing ‘‘safety re-sources” inside the MCU will make possible to react in a quickand intelligent way. This will make the cost of safety acceptableand sustainable from a system point of view.

References

[1] Automotive microcontrollers market report, electronics. CA Publications;August 2009.

[2] Toshiba, press release; January 2010. <http://www.toshiba-components.com/prpdf/5937E.pdf>.

[3] Texas instruments, news release; November 2008. <http://focus.ti.com/pr/docs/preldetail.tsp?sectionId=594&prelId=sc08145>.

[4] Freescale-ST. Automotive electronic news; October 2009. <http://www.st.com/stonline/stappl/cms/press/news/year2009/t2422.htm>.

[5] IEC 61508 ed. 2.0. International Electrotechnical Commission, IEC; 2009.[6] ISO/DIS 26262 (all parts), road vehicles – functional safety. International

Organization for Standardization.[7] Mariani R, Boschi G. A systematic approach for failure modes and effects

analysis of system-on-chips; IOLTS 2007.[8] Leaphart E, et al. Survey of software failsafe techniques for safety-critical

automotive applications. SAE 2005 [2005-01-0779].[9] Frueling T. Delphi secured microcontroller architecture. SAE 2000 World

Congress, SAE# 2000-01-1052.[10] Mariani R, Fuhrmann P, Vittorelli B. Cost-effective approach to error detection

for an embedded automotive platform. SAE 2006 World congress & exhibition,Detroit, MI, USA; April 2006.

[11] Mariani R, Baumeister M, Fuhrmann P. A single channel, fail-safemicrocontroller to simplify SIL3 safety architectures in automotiveapplications. In: Electronic systems for vehicles VDI conference, Baden–Baden, Germany; October 2007.

[12] Brewerton S, Schneider R, Eberhard D. Implementation of a basic single-microcontroller monitoring concept for safety critical systems on a dual coremicrocontroller. SAE 2007 world congress & exhibition, 2007-01-1486.

[13] Arbeitskreis EGAS: Standardisiertes E-Gas-Überwachungskonzept fürMotorsteuerungen von Otto-und Dieselmotoren. Report Version 2.0. Verbandder Automobilindustrie; May 2005.

[14] Baleani M, et al. Fault-tolerant platforms for automotive safety-criticalapplications. In: International conference on compilers, architecture andsynthesis for embedded systems (CASES’03). San Jose; October 2003.

[15] Austin Todd, Blaauw David, Mudge Trevor, Flautner Krisztián. Making typicalsilicon matter with razor. Computer 2004;37(3):57–65.

[16] Kim V, Chen T. Assessing defect coverage of memory testing algorithms. In:Proc ninth great lakes symposium on VLSI; 1999.

[17] van de Goor AJ. Testing semiconductor devices, theory and practice. A.J. van deGoor/ComTex Publishing; 1999.

[18] Chen CL, Hsiao MY. Error-correcting codes for semiconductor memoryapplications: a state-of-the-art-review. IBM J Res Develop 1984;28(2):124–34.

[19] Mariani R, Colucci F, Fuhrmann P. Safety integrity of memory sub-systems inautomotive microcontrollers, 2007-01-1494 SAE 2007, NUMB 2121. p. 99–110.