Upload
nguyenquynh
View
219
Download
5
Embed Size (px)
Citation preview
© Copyright 2013-2016 Xilinx.
Ken Chapman
A Practical Look at SEU, Effects and Mitigation
FPGA Network: Safety, Certification & Security
University of Hertfordshire
19th May 2016
© Copyright 2013-2016 Xilinx.
Premium Bonds
Page 2
Each Bond is £1
Each stays in the system
until you cash it in (or die!)
These 5 Bonds are still worth
£5 and have taken part in
over 570 monthly draws
Each Bond takes part in a
monthly draw.
© Copyright 2013-2016 Xilinx.
ERNIE picks the winning bonds each month
Page 3
Electronic Random Number Indicator Equipment
ERNIE 1
Unveiled in 1957
Generated bond numbers
based on signal noise from
neon tubes.
Now on display at the Science
Museum in London.
© Copyright 2013-2016 Xilinx.
Every month ERNIE picks the winning bonds
Page 4
There are ~ 60 Billion
Bonds in the system.
1 in every 30,000 Bonds
© Copyright 2013-2016 Xilinx.
Statistics
Page 5
If you have 30,000 Bonds*...
Does it guarantee that you win a prise every month?
2 prizes = 18%
3 prizes = 6%
4 prizes = 1%
But over a year you’ll probably win ~12 prizes
* Maximum permitted holding is 50,000
1 prize = 37%
and over 10 years you’ll win ~120 prizes.
Win nothing = 37%
Odds = 1 in 30,000
© Copyright 2013-2016 Xilinx.
Which prize will ERNIE give you?
Page 6
93.3% of Prizes
6.4%
0.3%
Value No. of Prizes
Will it be a life
changing £1,000,000…
12 × £25 = £300
1% tax free return
on investment
…or average good fortune?
© Copyright 2013-2016 Xilinx.
What Did The Space Program Ever Do For Me?
75 days
9,926yrs
MTBF
Great for space and very special situations but is this practical?
Engineering solutions!
Standard products do benefit from the space program.
Page 7
© Copyright 2013-2016 Xilinx.
Only Soft Errors
NO SEL (Single Event Latch-up)
– Proprietary Design Techniques
• >40 Patents.
– Immunity to latch-up confirmed continuously by Xilinx testing.
• Continuous monitoring of devices.
• No reports from customers (significant quantities of devices are monitored 24/7).
• Beam testing at high energy levels.
NO SEFIs (Single Event Functional Interrupts) observed
– Only significant in space (< 0.04 device FIT terrestrially)
NO SETs (Single Event Transients) observed
– Large RCs on logic & DFF nets prevent occurrence.
NO subtle device behaviour changes observed
– No performance or frequency degradation.
– Negligible effects on power consumption.
Upsets only occur in memory cells
– Values flip from 0 1 or 1 0.
– Soft Errors Only.
Page 8
© Copyright 2013-2016 Xilinx.
Over 17 Years of ‘Rosetta’ and Beam Testing
© Copyright 2013-2016 Xilinx.
Being Practical Begins and Ends With UG116
Use ‘the known’ to deal with the unknown!
But what does this mean in practical terms?
Always use the latest version
http://www.xilinx.com/support/documentation/user_guides/ug116.pdf
Page 10
© Copyright 2013-2016 Xilinx.
Some Xilinx SEU History
Improvements are generally ‘by design’.
We didn’t just get lucky!
Xilinx is the only FPGA vendor that
openly publishes SEU and Soft Error
Rate measurements (see UG116).
Observations and experiences of
devices in the real atmosphere as
well as during beam experiments
have enabled Xilinx to understand
the susceptibility of our devices.
1998(250nm)
Use known published data to make
informed and relevant decisions
about today’s devices.
2015(Now)
2012(Now)
2003
© Copyright 2013-2016 Xilinx.
Page 12
7-Series FIT RateFailures In Time
Time = 109 hours = 114,155 years
SER (Soft Error Rate)
– Frequency of soft error occurrences
81 upsets in 114,155 years for every 1 million bits of configuration memory
135 × 36 × 1024 = 4,976,640 bits are BRAM contents
30,606,304 - 4,976,640 = 25,629,664 fixed configuration bits
81 × 25.6 = 2,074 FIT
** This is close enough
for an estimate
**
109 / 2,074 = 482,160 hours = 20,090 days = 55 years
© Copyright 2013-2016 Xilinx.
What Do The 7-Series Figures Tell Us?
Operating the following devices at sea level in New York the mean
time between upsets will be…
Artix 7A100T - 55 Years
Artix 7A200T - 22 Years
Kintex 7K70T - 74 Years
Kintex 7K325T - 19 Years
Virtex 7VX690T - 8 Years
Virtex 7V2000T - 4 Years
Now you know why real data collection takes lots of devices and time.
Now you know why Xilinx do also go beam testing.
Page 13
© Copyright 2013-2016 Xilinx.
Scaling Factors
Real figures should
be scaled for the
working environment.
http://www.seutest.com/cgi-bin/FluxCalculator.cgi
- Sea Level New York
Relative Flux 1.00
- Xilinx also provide an
SEU FIT Rate Calculator
Page 14
© Copyright 2013-2016 Xilinx.
Scaling For Ground Based Products
Operating the following devices anywhere normal on the surface of
Earth will experience upsets less frequently than…
Artix 7A100T - 1,181 Days (3 Years)
Artix 7A200T - 470 Days
Kintex 7K70T - 1,583 Days (4 Years)
Kintex 7K325T - 403 Days
Virtex 7VX690T - 172 Days
Virtex 7V2000T - 76 Days
But a ground based product may need to operate 24/7 for many years.
Useful Scaling to
Remember
17×Covers anywhere on the
surface of The Earth
(Includes aircraft operating at lower altitudes)
Page 15
Reference: Longmont,Colorado
4,978ft amsl
Flux 3.52×
© Copyright 2013-2016 Xilinx.
Altitude 40,000ft Anywhere
Operating the following devices at 40,000 feet
the mean time between upsets will be…
Artix 7A100T - 40 Days
Artix 7A200T - 16 Days
Kintex 7K70T - 54 Days
Kintex 7K325T - 14 Days
Virtex 7VX690T - 6 Days
Virtex 7V2000T - 3 Days
A device in a high utilization long haul aircraft could expect to
experience a few flights a year in which an upset occurs.
That’s a long time to
sit in economy…
Useful Scaling to
Remember
500×40,000ft anywhere
Page 16
© Copyright 2013-2016 Xilinx.
SEU Detection
Built-in ‘Readback CRC’ continuously scans the configuration cells.
Can be completely independent of user design.
When CRC is incorrect at end of scan INIT_B pin is driven Low.
e.g. 20ms
- Scan time depends on device size and clock frequency (4.6ms to 54.1ms).
- XC7A200T scan time 18.3ms at FMAX
- XC7V325T scan time 23.5ms at FMAX
What is the longest time between an upset occurring and error being reported?
INIT_B=0
What is the shortest time between an upset occurring and error being reported?
What is the average time between an upset occurring and error being reported?
CRCERROR=1
Page 17
40ms
0ms
20ms
© Copyright 2013-2016 Xilinx.
Error Correction
20ms
ECCERROR=1
CRCERROR=0
INIT_B=1
7-Series also has error correction built-in.
Automatically corrects all single bit per frame upsets (the most common type).
Readback CRC mechanism still used to scan the device.
- CRC provides redundancy for ECC.
Each frame (101×32 = 3,232 bits) has an Error Correcting Code (ECC)
- Detects an error as frame containing error is scanned.
50% reduction in average detection time.
- Identifies location of a single bit error within that frame.
- Correction time <1ms.
Page 18
© Copyright 2013-2016 Xilinx.
When ECC alone is not enough!
Page 19
Single Bit Error
(SBE)
Adjacent Frame
Double Bit Error= 2 × SBE
Same Frame Double
Bit Error (DBE)
© Copyright 2013-2016 Xilinx.
What Effect Does An Upset Have On My Design?
Error injection is a VERY Powerful tool (partial reconfiguration).
Not available in ASIC or fixed configuration devices.
(Only pre-defined error injection points are practical within an ASIC design)
“It’s like having a proton beam on my desk but better”
Evaluate SEU susceptibility of a particular design.
- What proportion of upsets effect the design?
- What happens when they do?
- How many upsets are critical to operation?
Page 20
Evaluate and test all your mitigation strategies
- Does your system correctly handle and report errors?
- Does your TMR scheme really see you through (hard and soft errors)?
Where and what is the weakest link?
© Copyright 2013-2016 Xilinx.
Page 21
The ‘Proton Beam’ for Your Desk!
status_heartbeat
icap_grant
SEM IP
ICAPE2FRAME_ECC2
Monitor
Interface
Error
Injection
Interface
Status
Interface
CRCERROR
FIFO
FIFO
Ports
led[7:0]24-bit Counter
CE Q[23:0]
[23:16]
Port
Port
8-bit Counter
RST
Q[7:0] Port
Port
Ports
KCPSM6
4K ROM
UART_RX6Ports
UART_TX6
Port Port
400 ‘Break Me’ Modules
Represents a ‘Design’ filling ~90% of device
INIT_B (dedicated)
XC7K325T on KC705 Board
CRCERROR
Ok
© Copyright 2013-2016 Xilinx.
Page 22
‘Break Me’ Module!
+×
+
DSP48E1
Counter 256×18 ROM8 18 18
25
48
43
LFSR25
+×
+
DSP48E1
Counter 256×18 ROM8 18 18
25
48
43
LFSR25
25
=
DSP Failure
In
Out
KCPSM6
2K
Program
ROM
DSP Circuits
~57 Slices
2 DSP48E1
PicoBlaze
32 Slices
ROM
1 BRAM (36kb)
8
12 18
Dual Port
BRAMCRC
Calculator12
9
ROM FailureOut
4 4 Out
In
InKCPSM6 Failure
Total Size of each Module
1 BRAM
2 DSP48E1
~110 Slices
DEFAULT_JUMP
×400
CRC
8 Slices
‘Latch’
‘Latch’
‘Latch’
Other logic
~13 Slices
© Copyright 2013-2016 Xilinx.
Page 23
The ‘Proton Beam’ for Your Desk!
Target Device : xc7k325t
Design Summary
--------------
Number of occupied Slices: 44,405 out of 50,950 87%
Number of RAMB36E1/FIFO36E1s: 411 out of 445 92%
Number of RAMB18E1/FIFO18E1s: 4 out of 890 1%
Number of DSP48E1s: 800 out of 840 95%
19 Years at Sea Level New York
For an XC7K325T, each simulated SEU (arrow!) is equivalent to:-
14 Days worst case anywhere at 40,000ft
403 Days worst case anywhere on the surface of the Earth
Today’s target…
PicoBlaze circuits ~40%
ROM CRC circuits ~7%
SEM IP and system controller ~1%
DSP Circuits ~52%
© Copyright 2013-2016 Xilinx.
Page 24
Results From My Desk!
500 simulated SEU equivalent to 18 Years of
worst case continuous operation at 40,000ft
Each dot represents a frame in which an error
was injected. Red dots represent upsets that
resulted in disturbance to operation of a ‘break
me’ circuit.
DSP circuits
PicoBlaze circuits
ROM CRC calculator
SEM IP and system controller
Design Feature
Most SEU have no effect
Relative Susceptibly
59%
17%
24%
0%
Simulating SEU in your design helps you to
observe the susceptibility of your circuits and
focus on the effects to the important ones.
Different circuits have different susceptibility…
Percentage of
observed disruptions
to operation
normalised to area
occupied by feature
© Copyright 2013-2016 Xilinx.
Page 25
‘Break Me’ – Designed To Break AND Report It!
+×
+
DSP48E1
Counter 256×18 ROM8 18 18
25
48
43
LFSR25
+×
+
DSP48E1
Counter 256×18 ROM8 18 18
25
48
43
LFSR25
25
=
DSP Failure
In
Out‘Latch’
However, a real DSP algorithm (e.g. FIR filter or FFT)….
- Computes results for sets of data samples which are unknown variables.
- Most calculations errors will be completely indistinguishable from signal noise.
- The upset will be temporary (e.g. <23ms).
- Naturally ‘flushes’ with clean data and results following correction.
Very low probability of any meaningful or observable disturbances.
‘Latches’ any difference between two identical circuits.
Just 1-bit for 1-clock cycle is captured and reported.
Matching 48-bit results
every clock cycle
100% known
input data!
‘Latch’
© Copyright 2013-2016 Xilinx.
Page 26
‘Break Me’ – PicoBlaze Susceptibility or ‘DVF’
PicoBlaze + interfacing logic = ~50 Slices (similar to a typical application)
400 PicoBlaze circuits occupy ~40% of the XC7K325T device
500 simulated SEU resulted in 16 disturbances to PicoBlaze operation.
PicoBlaze circuit susceptibility = (100% / 40%) × (16/500) = 0.08
i.e. Only 1 in 12.5 SEU landing within the area occupied by a PicoBlaze circuit has an effect.
One (1) PicoBlaze… circuit occupies ~0.1% of XC7K325TSlices
1 PicoBlaze circuit = 6,087 × 0.1% × 8% = 0.49 FIT (234,424 Years)
40,000ft anywhere (500×)…. PicoBlaze circuit = 245 FIT (469 Years)
Anywhere on Earth (17×)…. PicoBlaze circuit = 8 FIT (13,789 Years)
Design Vulnerability Factor (DVF) = 8%
Nominal SEU rate of…. XC7K325T device is 6,087 FIT (19 Years)
© Copyright 2013-2016 Xilinx.
Page 27
Categorisation of Events
60-80% Completely miss the design
- These upsets will never impact operation
- But all SEU are detected and reported
100% Detection
Observed results for a variety of real applications
( normalised for device utilisation )
<1% Impact product functionality.
(i.e. The ones that actually matter)
10-40% ‘Touch’ the design but either…
Have no effect on operation at all
or
No effect could be observed.
<10% will be observed to have any effect.
E.g. PicoBlaze ~8% (in ‘Break Me’ design)
2-5% is a typical observation rate.
© Copyright 2013-2016 Xilinx.
Typical Design Operational Disturbance Rates
Nominal
Kintex 7K325T
SEU Detection Rate
19 Years
403 Days
Operational Disturbance Rate
190 to 950 Years
10 to 51 Years
(Continuous operation of >80% utilized device)
Page 28
Anywhere on Earth (17×)
Anywhere at 40,000ft (500×) 14 Days 137 Days to 2 Years
© Copyright 2013-2016 Xilinx.
Page 29
How Do So Many SEU Miss My Design?
‘Break Me’ design fills ~90% of the device but what does ‘used’ actually mean?
In a typical real design only 20% to 40% of configuration bits are ‘used’.
So that means 60% to 80% of upsets miss the design altogether (false alarms?)
© Copyright 2013-2016 Xilinx.
What Happens to ‘X’..... If…
Page 30
CE
QD
R
I0
I1
I2
O
QD
QD
QD
QDEnable
A
B
C
X
Reset
LUT
© Copyright 2013-2016 Xilinx.
Nothing Happens to ‘X’ Unless…
Page 31
CE
QD
R
I0
I1
I2
O
QD
QD
QD
QDEnable
A
B
C
X
Reset
Enable = ‘1’
B = ‘0’ and C = ‘1’
Reset = ‘0’
LUT
A changes state
When the upset is present (e.g. a 23ms ‘window’)
© Copyright 2013-2016 Xilinx.
Risk Assessment – Whole Device
Page 32
Let’s take a look at the XC7K325T which is a mid-range Kintex-7 device
326,000 logic cells (i.e. not small!)1Mb
75.1Mb of static configuration
16.5Mb of
available
user RAM
0.5Mb of available
user flip-flops
© Copyright 2013-2016 Xilinx.
Risk Assessment – Resources Actually Used
Page 33
Every design is different so obviously better to work with actual values.
But let’s accept some typical figures for now…
0.15Mb of
used flip-flops
12Mb of
used RAM
1-7Mb of ‘Critical Bits’
30Mb of used or ‘Essential Bits’
© Copyright 2013-2016 Xilinx.
Risk Assessment – Not all flip-flops are the same!
Page 34
Flip-flops and RAM can easily be
10 to 4,000 times more susceptible…
0.15Mb of
used flip-flops
12Mb of
used RAM
30Mb of used or ‘Essential Bits’
1-7Mb of ‘Critical Bits’
Know what the figures are and what they mean before you make a decision
0.15Mb of flip-flops
fabricated using a typical
ASIC process
© Copyright 2013-2016 Xilinx.
Being Practical Begins and Ends With UG116
Use ‘the known’ to deal with the unknown!
http://www.xilinx.com/support/documentation/user_guides/ug116.pdf
Always use the latest version
Page 35