35
© Copyright 2013-2016 Xilinx . Ken Chapman A Practical Look at SEU, Effects and Mitigation FPGA Network: Safety, Certification & Security University of Hertfordshire 19 th May 2016

A Practical Look at SEU, Effects and Mitigation · PDF fileEvaluate SEU susceptibility of a particular design. ... status_heartbeat icap_grant SEM IP ICAPE2 ... - Most calculations

Embed Size (px)

Citation preview

Page 1: A Practical Look at SEU, Effects and Mitigation · PDF fileEvaluate SEU susceptibility of a particular design. ... status_heartbeat icap_grant SEM IP ICAPE2 ... - Most calculations

© Copyright 2013-2016 Xilinx.

Ken Chapman

A Practical Look at SEU, Effects and Mitigation

FPGA Network: Safety, Certification & Security

University of Hertfordshire

19th May 2016

Page 2: A Practical Look at SEU, Effects and Mitigation · PDF fileEvaluate SEU susceptibility of a particular design. ... status_heartbeat icap_grant SEM IP ICAPE2 ... - Most calculations

© Copyright 2013-2016 Xilinx.

Premium Bonds

Page 2

Each Bond is £1

Each stays in the system

until you cash it in (or die!)

These 5 Bonds are still worth

£5 and have taken part in

over 570 monthly draws

Each Bond takes part in a

monthly draw.

Page 3: A Practical Look at SEU, Effects and Mitigation · PDF fileEvaluate SEU susceptibility of a particular design. ... status_heartbeat icap_grant SEM IP ICAPE2 ... - Most calculations

© Copyright 2013-2016 Xilinx.

ERNIE picks the winning bonds each month

Page 3

Electronic Random Number Indicator Equipment

ERNIE 1

Unveiled in 1957

Generated bond numbers

based on signal noise from

neon tubes.

Now on display at the Science

Museum in London.

Page 4: A Practical Look at SEU, Effects and Mitigation · PDF fileEvaluate SEU susceptibility of a particular design. ... status_heartbeat icap_grant SEM IP ICAPE2 ... - Most calculations

© Copyright 2013-2016 Xilinx.

Every month ERNIE picks the winning bonds

Page 4

There are ~ 60 Billion

Bonds in the system.

1 in every 30,000 Bonds

Page 5: A Practical Look at SEU, Effects and Mitigation · PDF fileEvaluate SEU susceptibility of a particular design. ... status_heartbeat icap_grant SEM IP ICAPE2 ... - Most calculations

© Copyright 2013-2016 Xilinx.

Statistics

Page 5

If you have 30,000 Bonds*...

Does it guarantee that you win a prise every month?

2 prizes = 18%

3 prizes = 6%

4 prizes = 1%

But over a year you’ll probably win ~12 prizes

* Maximum permitted holding is 50,000

1 prize = 37%

and over 10 years you’ll win ~120 prizes.

Win nothing = 37%

Odds = 1 in 30,000

Page 6: A Practical Look at SEU, Effects and Mitigation · PDF fileEvaluate SEU susceptibility of a particular design. ... status_heartbeat icap_grant SEM IP ICAPE2 ... - Most calculations

© Copyright 2013-2016 Xilinx.

Which prize will ERNIE give you?

Page 6

93.3% of Prizes

6.4%

0.3%

Value No. of Prizes

Will it be a life

changing £1,000,000…

12 × £25 = £300

1% tax free return

on investment

…or average good fortune?

Page 7: A Practical Look at SEU, Effects and Mitigation · PDF fileEvaluate SEU susceptibility of a particular design. ... status_heartbeat icap_grant SEM IP ICAPE2 ... - Most calculations

© Copyright 2013-2016 Xilinx.

What Did The Space Program Ever Do For Me?

75 days

9,926yrs

MTBF

Great for space and very special situations but is this practical?

Engineering solutions!

Standard products do benefit from the space program.

Page 7

Page 8: A Practical Look at SEU, Effects and Mitigation · PDF fileEvaluate SEU susceptibility of a particular design. ... status_heartbeat icap_grant SEM IP ICAPE2 ... - Most calculations

© Copyright 2013-2016 Xilinx.

Only Soft Errors

NO SEL (Single Event Latch-up)

– Proprietary Design Techniques

• >40 Patents.

– Immunity to latch-up confirmed continuously by Xilinx testing.

• Continuous monitoring of devices.

• No reports from customers (significant quantities of devices are monitored 24/7).

• Beam testing at high energy levels.

NO SEFIs (Single Event Functional Interrupts) observed

– Only significant in space (< 0.04 device FIT terrestrially)

NO SETs (Single Event Transients) observed

– Large RCs on logic & DFF nets prevent occurrence.

NO subtle device behaviour changes observed

– No performance or frequency degradation.

– Negligible effects on power consumption.

Upsets only occur in memory cells

– Values flip from 0 1 or 1 0.

– Soft Errors Only.

Page 8

Page 9: A Practical Look at SEU, Effects and Mitigation · PDF fileEvaluate SEU susceptibility of a particular design. ... status_heartbeat icap_grant SEM IP ICAPE2 ... - Most calculations

© Copyright 2013-2016 Xilinx.

Over 17 Years of ‘Rosetta’ and Beam Testing

Page 10: A Practical Look at SEU, Effects and Mitigation · PDF fileEvaluate SEU susceptibility of a particular design. ... status_heartbeat icap_grant SEM IP ICAPE2 ... - Most calculations

© Copyright 2013-2016 Xilinx.

Being Practical Begins and Ends With UG116

Use ‘the known’ to deal with the unknown!

But what does this mean in practical terms?

Always use the latest version

http://www.xilinx.com/support/documentation/user_guides/ug116.pdf

Page 10

Page 11: A Practical Look at SEU, Effects and Mitigation · PDF fileEvaluate SEU susceptibility of a particular design. ... status_heartbeat icap_grant SEM IP ICAPE2 ... - Most calculations

© Copyright 2013-2016 Xilinx.

Some Xilinx SEU History

Improvements are generally ‘by design’.

We didn’t just get lucky!

Xilinx is the only FPGA vendor that

openly publishes SEU and Soft Error

Rate measurements (see UG116).

Observations and experiences of

devices in the real atmosphere as

well as during beam experiments

have enabled Xilinx to understand

the susceptibility of our devices.

1998(250nm)

Use known published data to make

informed and relevant decisions

about today’s devices.

2015(Now)

2012(Now)

2003

Page 12: A Practical Look at SEU, Effects and Mitigation · PDF fileEvaluate SEU susceptibility of a particular design. ... status_heartbeat icap_grant SEM IP ICAPE2 ... - Most calculations

© Copyright 2013-2016 Xilinx.

Page 12

7-Series FIT RateFailures In Time

Time = 109 hours = 114,155 years

SER (Soft Error Rate)

– Frequency of soft error occurrences

81 upsets in 114,155 years for every 1 million bits of configuration memory

135 × 36 × 1024 = 4,976,640 bits are BRAM contents

30,606,304 - 4,976,640 = 25,629,664 fixed configuration bits

81 × 25.6 = 2,074 FIT

** This is close enough

for an estimate

**

109 / 2,074 = 482,160 hours = 20,090 days = 55 years

Page 13: A Practical Look at SEU, Effects and Mitigation · PDF fileEvaluate SEU susceptibility of a particular design. ... status_heartbeat icap_grant SEM IP ICAPE2 ... - Most calculations

© Copyright 2013-2016 Xilinx.

What Do The 7-Series Figures Tell Us?

Operating the following devices at sea level in New York the mean

time between upsets will be…

Artix 7A100T - 55 Years

Artix 7A200T - 22 Years

Kintex 7K70T - 74 Years

Kintex 7K325T - 19 Years

Virtex 7VX690T - 8 Years

Virtex 7V2000T - 4 Years

Now you know why real data collection takes lots of devices and time.

Now you know why Xilinx do also go beam testing.

Page 13

Page 14: A Practical Look at SEU, Effects and Mitigation · PDF fileEvaluate SEU susceptibility of a particular design. ... status_heartbeat icap_grant SEM IP ICAPE2 ... - Most calculations

© Copyright 2013-2016 Xilinx.

Scaling Factors

Real figures should

be scaled for the

working environment.

http://www.seutest.com/cgi-bin/FluxCalculator.cgi

- Sea Level New York

Relative Flux 1.00

- Xilinx also provide an

SEU FIT Rate Calculator

Page 14

Page 15: A Practical Look at SEU, Effects and Mitigation · PDF fileEvaluate SEU susceptibility of a particular design. ... status_heartbeat icap_grant SEM IP ICAPE2 ... - Most calculations

© Copyright 2013-2016 Xilinx.

Scaling For Ground Based Products

Operating the following devices anywhere normal on the surface of

Earth will experience upsets less frequently than…

Artix 7A100T - 1,181 Days (3 Years)

Artix 7A200T - 470 Days

Kintex 7K70T - 1,583 Days (4 Years)

Kintex 7K325T - 403 Days

Virtex 7VX690T - 172 Days

Virtex 7V2000T - 76 Days

But a ground based product may need to operate 24/7 for many years.

Useful Scaling to

Remember

17×Covers anywhere on the

surface of The Earth

(Includes aircraft operating at lower altitudes)

Page 15

Reference: Longmont,Colorado

4,978ft amsl

Flux 3.52×

Page 16: A Practical Look at SEU, Effects and Mitigation · PDF fileEvaluate SEU susceptibility of a particular design. ... status_heartbeat icap_grant SEM IP ICAPE2 ... - Most calculations

© Copyright 2013-2016 Xilinx.

Altitude 40,000ft Anywhere

Operating the following devices at 40,000 feet

the mean time between upsets will be…

Artix 7A100T - 40 Days

Artix 7A200T - 16 Days

Kintex 7K70T - 54 Days

Kintex 7K325T - 14 Days

Virtex 7VX690T - 6 Days

Virtex 7V2000T - 3 Days

A device in a high utilization long haul aircraft could expect to

experience a few flights a year in which an upset occurs.

That’s a long time to

sit in economy…

Useful Scaling to

Remember

500×40,000ft anywhere

Page 16

Page 17: A Practical Look at SEU, Effects and Mitigation · PDF fileEvaluate SEU susceptibility of a particular design. ... status_heartbeat icap_grant SEM IP ICAPE2 ... - Most calculations

© Copyright 2013-2016 Xilinx.

SEU Detection

Built-in ‘Readback CRC’ continuously scans the configuration cells.

Can be completely independent of user design.

When CRC is incorrect at end of scan INIT_B pin is driven Low.

e.g. 20ms

- Scan time depends on device size and clock frequency (4.6ms to 54.1ms).

- XC7A200T scan time 18.3ms at FMAX

- XC7V325T scan time 23.5ms at FMAX

What is the longest time between an upset occurring and error being reported?

INIT_B=0

What is the shortest time between an upset occurring and error being reported?

What is the average time between an upset occurring and error being reported?

CRCERROR=1

Page 17

40ms

0ms

20ms

Page 18: A Practical Look at SEU, Effects and Mitigation · PDF fileEvaluate SEU susceptibility of a particular design. ... status_heartbeat icap_grant SEM IP ICAPE2 ... - Most calculations

© Copyright 2013-2016 Xilinx.

Error Correction

20ms

ECCERROR=1

CRCERROR=0

INIT_B=1

7-Series also has error correction built-in.

Automatically corrects all single bit per frame upsets (the most common type).

Readback CRC mechanism still used to scan the device.

- CRC provides redundancy for ECC.

Each frame (101×32 = 3,232 bits) has an Error Correcting Code (ECC)

- Detects an error as frame containing error is scanned.

50% reduction in average detection time.

- Identifies location of a single bit error within that frame.

- Correction time <1ms.

Page 18

Page 19: A Practical Look at SEU, Effects and Mitigation · PDF fileEvaluate SEU susceptibility of a particular design. ... status_heartbeat icap_grant SEM IP ICAPE2 ... - Most calculations

© Copyright 2013-2016 Xilinx.

When ECC alone is not enough!

Page 19

Single Bit Error

(SBE)

Adjacent Frame

Double Bit Error= 2 × SBE

Same Frame Double

Bit Error (DBE)

Page 20: A Practical Look at SEU, Effects and Mitigation · PDF fileEvaluate SEU susceptibility of a particular design. ... status_heartbeat icap_grant SEM IP ICAPE2 ... - Most calculations

© Copyright 2013-2016 Xilinx.

What Effect Does An Upset Have On My Design?

Error injection is a VERY Powerful tool (partial reconfiguration).

Not available in ASIC or fixed configuration devices.

(Only pre-defined error injection points are practical within an ASIC design)

“It’s like having a proton beam on my desk but better”

Evaluate SEU susceptibility of a particular design.

- What proportion of upsets effect the design?

- What happens when they do?

- How many upsets are critical to operation?

Page 20

Evaluate and test all your mitigation strategies

- Does your system correctly handle and report errors?

- Does your TMR scheme really see you through (hard and soft errors)?

Where and what is the weakest link?

Page 21: A Practical Look at SEU, Effects and Mitigation · PDF fileEvaluate SEU susceptibility of a particular design. ... status_heartbeat icap_grant SEM IP ICAPE2 ... - Most calculations

© Copyright 2013-2016 Xilinx.

Page 21

The ‘Proton Beam’ for Your Desk!

status_heartbeat

icap_grant

SEM IP

ICAPE2FRAME_ECC2

Monitor

Interface

Error

Injection

Interface

Status

Interface

CRCERROR

FIFO

FIFO

Ports

led[7:0]24-bit Counter

CE Q[23:0]

[23:16]

Port

Port

8-bit Counter

RST

Q[7:0] Port

Port

Ports

KCPSM6

4K ROM

UART_RX6Ports

UART_TX6

Port Port

400 ‘Break Me’ Modules

Represents a ‘Design’ filling ~90% of device

INIT_B (dedicated)

XC7K325T on KC705 Board

CRCERROR

Ok

Page 22: A Practical Look at SEU, Effects and Mitigation · PDF fileEvaluate SEU susceptibility of a particular design. ... status_heartbeat icap_grant SEM IP ICAPE2 ... - Most calculations

© Copyright 2013-2016 Xilinx.

Page 22

‘Break Me’ Module!

+

DSP48E1

Counter 256×18 ROM8 18 18

25

48

43

LFSR25

+

DSP48E1

Counter 256×18 ROM8 18 18

25

48

43

LFSR25

25

=

DSP Failure

In

Out

KCPSM6

2K

Program

ROM

DSP Circuits

~57 Slices

2 DSP48E1

PicoBlaze

32 Slices

ROM

1 BRAM (36kb)

8

12 18

Dual Port

BRAMCRC

Calculator12

9

ROM FailureOut

4 4 Out

In

InKCPSM6 Failure

Total Size of each Module

1 BRAM

2 DSP48E1

~110 Slices

DEFAULT_JUMP

×400

CRC

8 Slices

‘Latch’

‘Latch’

‘Latch’

Other logic

~13 Slices

Page 23: A Practical Look at SEU, Effects and Mitigation · PDF fileEvaluate SEU susceptibility of a particular design. ... status_heartbeat icap_grant SEM IP ICAPE2 ... - Most calculations

© Copyright 2013-2016 Xilinx.

Page 23

The ‘Proton Beam’ for Your Desk!

Target Device : xc7k325t

Design Summary

--------------

Number of occupied Slices: 44,405 out of 50,950 87%

Number of RAMB36E1/FIFO36E1s: 411 out of 445 92%

Number of RAMB18E1/FIFO18E1s: 4 out of 890 1%

Number of DSP48E1s: 800 out of 840 95%

19 Years at Sea Level New York

For an XC7K325T, each simulated SEU (arrow!) is equivalent to:-

14 Days worst case anywhere at 40,000ft

403 Days worst case anywhere on the surface of the Earth

Today’s target…

PicoBlaze circuits ~40%

ROM CRC circuits ~7%

SEM IP and system controller ~1%

DSP Circuits ~52%

Page 24: A Practical Look at SEU, Effects and Mitigation · PDF fileEvaluate SEU susceptibility of a particular design. ... status_heartbeat icap_grant SEM IP ICAPE2 ... - Most calculations

© Copyright 2013-2016 Xilinx.

Page 24

Results From My Desk!

500 simulated SEU equivalent to 18 Years of

worst case continuous operation at 40,000ft

Each dot represents a frame in which an error

was injected. Red dots represent upsets that

resulted in disturbance to operation of a ‘break

me’ circuit.

DSP circuits

PicoBlaze circuits

ROM CRC calculator

SEM IP and system controller

Design Feature

Most SEU have no effect

Relative Susceptibly

59%

17%

24%

0%

Simulating SEU in your design helps you to

observe the susceptibility of your circuits and

focus on the effects to the important ones.

Different circuits have different susceptibility…

Percentage of

observed disruptions

to operation

normalised to area

occupied by feature

Page 25: A Practical Look at SEU, Effects and Mitigation · PDF fileEvaluate SEU susceptibility of a particular design. ... status_heartbeat icap_grant SEM IP ICAPE2 ... - Most calculations

© Copyright 2013-2016 Xilinx.

Page 25

‘Break Me’ – Designed To Break AND Report It!

+

DSP48E1

Counter 256×18 ROM8 18 18

25

48

43

LFSR25

+

DSP48E1

Counter 256×18 ROM8 18 18

25

48

43

LFSR25

25

=

DSP Failure

In

Out‘Latch’

However, a real DSP algorithm (e.g. FIR filter or FFT)….

- Computes results for sets of data samples which are unknown variables.

- Most calculations errors will be completely indistinguishable from signal noise.

- The upset will be temporary (e.g. <23ms).

- Naturally ‘flushes’ with clean data and results following correction.

Very low probability of any meaningful or observable disturbances.

‘Latches’ any difference between two identical circuits.

Just 1-bit for 1-clock cycle is captured and reported.

Matching 48-bit results

every clock cycle

100% known

input data!

‘Latch’

Page 26: A Practical Look at SEU, Effects and Mitigation · PDF fileEvaluate SEU susceptibility of a particular design. ... status_heartbeat icap_grant SEM IP ICAPE2 ... - Most calculations

© Copyright 2013-2016 Xilinx.

Page 26

‘Break Me’ – PicoBlaze Susceptibility or ‘DVF’

PicoBlaze + interfacing logic = ~50 Slices (similar to a typical application)

400 PicoBlaze circuits occupy ~40% of the XC7K325T device

500 simulated SEU resulted in 16 disturbances to PicoBlaze operation.

PicoBlaze circuit susceptibility = (100% / 40%) × (16/500) = 0.08

i.e. Only 1 in 12.5 SEU landing within the area occupied by a PicoBlaze circuit has an effect.

One (1) PicoBlaze… circuit occupies ~0.1% of XC7K325TSlices

1 PicoBlaze circuit = 6,087 × 0.1% × 8% = 0.49 FIT (234,424 Years)

40,000ft anywhere (500×)…. PicoBlaze circuit = 245 FIT (469 Years)

Anywhere on Earth (17×)…. PicoBlaze circuit = 8 FIT (13,789 Years)

Design Vulnerability Factor (DVF) = 8%

Nominal SEU rate of…. XC7K325T device is 6,087 FIT (19 Years)

Page 27: A Practical Look at SEU, Effects and Mitigation · PDF fileEvaluate SEU susceptibility of a particular design. ... status_heartbeat icap_grant SEM IP ICAPE2 ... - Most calculations

© Copyright 2013-2016 Xilinx.

Page 27

Categorisation of Events

60-80% Completely miss the design

- These upsets will never impact operation

- But all SEU are detected and reported

100% Detection

Observed results for a variety of real applications

( normalised for device utilisation )

<1% Impact product functionality.

(i.e. The ones that actually matter)

10-40% ‘Touch’ the design but either…

Have no effect on operation at all

or

No effect could be observed.

<10% will be observed to have any effect.

E.g. PicoBlaze ~8% (in ‘Break Me’ design)

2-5% is a typical observation rate.

Page 28: A Practical Look at SEU, Effects and Mitigation · PDF fileEvaluate SEU susceptibility of a particular design. ... status_heartbeat icap_grant SEM IP ICAPE2 ... - Most calculations

© Copyright 2013-2016 Xilinx.

Typical Design Operational Disturbance Rates

Nominal

Kintex 7K325T

SEU Detection Rate

19 Years

403 Days

Operational Disturbance Rate

190 to 950 Years

10 to 51 Years

(Continuous operation of >80% utilized device)

Page 28

Anywhere on Earth (17×)

Anywhere at 40,000ft (500×) 14 Days 137 Days to 2 Years

Page 29: A Practical Look at SEU, Effects and Mitigation · PDF fileEvaluate SEU susceptibility of a particular design. ... status_heartbeat icap_grant SEM IP ICAPE2 ... - Most calculations

© Copyright 2013-2016 Xilinx.

Page 29

How Do So Many SEU Miss My Design?

‘Break Me’ design fills ~90% of the device but what does ‘used’ actually mean?

In a typical real design only 20% to 40% of configuration bits are ‘used’.

So that means 60% to 80% of upsets miss the design altogether (false alarms?)

Page 30: A Practical Look at SEU, Effects and Mitigation · PDF fileEvaluate SEU susceptibility of a particular design. ... status_heartbeat icap_grant SEM IP ICAPE2 ... - Most calculations

© Copyright 2013-2016 Xilinx.

What Happens to ‘X’..... If…

Page 30

CE

QD

R

I0

I1

I2

O

QD

QD

QD

QDEnable

A

B

C

X

Reset

LUT

Page 31: A Practical Look at SEU, Effects and Mitigation · PDF fileEvaluate SEU susceptibility of a particular design. ... status_heartbeat icap_grant SEM IP ICAPE2 ... - Most calculations

© Copyright 2013-2016 Xilinx.

Nothing Happens to ‘X’ Unless…

Page 31

CE

QD

R

I0

I1

I2

O

QD

QD

QD

QDEnable

A

B

C

X

Reset

Enable = ‘1’

B = ‘0’ and C = ‘1’

Reset = ‘0’

LUT

A changes state

When the upset is present (e.g. a 23ms ‘window’)

Page 32: A Practical Look at SEU, Effects and Mitigation · PDF fileEvaluate SEU susceptibility of a particular design. ... status_heartbeat icap_grant SEM IP ICAPE2 ... - Most calculations

© Copyright 2013-2016 Xilinx.

Risk Assessment – Whole Device

Page 32

Let’s take a look at the XC7K325T which is a mid-range Kintex-7 device

326,000 logic cells (i.e. not small!)1Mb

75.1Mb of static configuration

16.5Mb of

available

user RAM

0.5Mb of available

user flip-flops

Page 33: A Practical Look at SEU, Effects and Mitigation · PDF fileEvaluate SEU susceptibility of a particular design. ... status_heartbeat icap_grant SEM IP ICAPE2 ... - Most calculations

© Copyright 2013-2016 Xilinx.

Risk Assessment – Resources Actually Used

Page 33

Every design is different so obviously better to work with actual values.

But let’s accept some typical figures for now…

0.15Mb of

used flip-flops

12Mb of

used RAM

1-7Mb of ‘Critical Bits’

30Mb of used or ‘Essential Bits’

Page 34: A Practical Look at SEU, Effects and Mitigation · PDF fileEvaluate SEU susceptibility of a particular design. ... status_heartbeat icap_grant SEM IP ICAPE2 ... - Most calculations

© Copyright 2013-2016 Xilinx.

Risk Assessment – Not all flip-flops are the same!

Page 34

Flip-flops and RAM can easily be

10 to 4,000 times more susceptible…

0.15Mb of

used flip-flops

12Mb of

used RAM

30Mb of used or ‘Essential Bits’

1-7Mb of ‘Critical Bits’

Know what the figures are and what they mean before you make a decision

0.15Mb of flip-flops

fabricated using a typical

ASIC process

Page 35: A Practical Look at SEU, Effects and Mitigation · PDF fileEvaluate SEU susceptibility of a particular design. ... status_heartbeat icap_grant SEM IP ICAPE2 ... - Most calculations

© Copyright 2013-2016 Xilinx.

Being Practical Begins and Ends With UG116

Use ‘the known’ to deal with the unknown!

http://www.xilinx.com/support/documentation/user_guides/ug116.pdf

Always use the latest version

Page 35