Upload
abigail-horton
View
216
Download
0
Embed Size (px)
Citation preview
1
Dependability Benchmarking of VLSI Circuits
Cristian Constantinescu
Intel Corporation
2
Outline Neutron SER characterization of microprocessors
SER scaling trends Experimental set-up Experimental Results
Other sources of errors Memory intermittent faults Front side bus intermittent faults
Using environmental tests as dependability benchmarking tools Temperature and Voltage Operating Test ESD Operating Test
Summary Backup
Linpack benchmark References
AcknowledgementNeutron SER characterization: Bruce Takala, Steve Wander (LANSCE), Nelson
Tam, Pat Armstrong (Intel Corp.)Environmental testing: John Blair, Scott Scheuneman (Intel Corp.)
3
Neutron SER Characterization of Microprocessors
4
Single Event Upsets Single event upsets (SEU) are
induced by Alpha particles – generated during
radioactive decay of the package
and interconnect materials Neutrons, protons, pions – generated
by cosmic rays penetrating the atmosphere
SEU may induce errors both in storage elements and combinational logic
Frequency of occurrence of the particle induced induced errors: soft error rate (SER)
+ V
+-
+-
-+
-+
Depletion RegionDrift
Diffusion
Ion
Path
I
II III
5
SER Scaling Trends
SRAM SER per bit and chip Latch SER per bit and chip
Assumption: SRAM/latch count increases ~2x per generation
0.001
0.01
0.1
1
0.18 um 0.13 um 90 nm 65 nm
Technology generation
Err
or
rate
(AU
, lo
g s
cale
)
0.0001
0.001
0.01
0.1
1
0.18 um 0.13 um 90 nm 65 nm
Technology generation
Err
or
rate
(AU
, lo
g s
cale
)
Per Bit
Per Chip
6
Hadron Cascades
Neutrons represent 94% of the hadrons reaching sea level For terrestrial applications it makes sense to benchmark the impact of
neutron SER
Altitude Neutrons Pions Protons
Sea level 94% 3% 3%
32,000 ft 52% 36% 12%
Main constituents of atmospheric hadron cascades
7
1
10
100
1000
10000
100000
1000000
1 10 100 1000
Neutron Energy (Mev)
n/M
eV
/cm
2/s
ec
LANSCE flux
Natural flux x10 6̂
LANSCE Neutron Beam Los Alamos Neutron Science Center (LANSCE)
Generates high-energy neutrons by spallation: a linear accelerator generates a pulsed proton beam that strikes a tungsten target
Energy dependence of the natural cosmic-ray neutron flux and the LANSCE neutron flux
8
Experimental Set Up Itanium processor based server Windows NT 4.0 operating system Linpack benchmark
Performs matrix computations Derives residues – can detect silent data corruption (SDC)
Fission ion chamber to determine neutron fluence
9
Deriving MTTF
MTTF = Tua/U Tua – duration of an equivalent experiment, taking place in
unaccelerated conditions [h] U – total number of upsets (failures) over the duration of the
experiment
Tua = (Fcp * Nc)/ Nf Fcp – total number of fission chamber pulses, over the duration of
the experiment Nc – average neutron conversion factor [neutrons/fission
pulse/cm2] Nf – cosmic-ray induced neutron flux at the desired geographical
location and altitude [neutrons/cm2/h]
10
Experimental Results
Run Linpack benchmark for square matrixes of size 800 and 1000
Completed 40 runs Duration of one run: 10 s – 5 min Failure types
Blue screen Hang Silent data corruption (SDC)
11
0
20
40
60
80
100
0 10 20 30 40
Number of Runs
MT
TF
(a. u
.)
Experimental Results
Itanium processor MTTF due to neutrons, as a function of number of runs
12
MTTF confidence intervals
MTTF (a.u.) Confidence
level 20 upsets 30 upsets 40 upsets
80% 52, 103 56, 97 58, 91
90% 44, 110 50, 103 53, 96
Experimental Results
SDC – one event Insufficient for statistical analysis
13
Practical Considerations
Error handling techniques differ greatly from one manufacturer to another
HW error detection and correction, e.g. ECC, is faster FW/SW implemented recovery may be overwhelmed by an
accelerated test (near coincident faults scenario) Acceleration factor is an important variable
Failure prediction and automatic deconfiguration may lead to misleading results
Multiple experiments Beam divergence Beam attenuation
14
Other Sources of Errors
15
Memory Intermittent Faults Intermittent faults are induced by unstable or marginal hardware
Intermittent shorts/opens Manufacturing residuals Timing faults
0
10
20
30
40
50
60
70
80
90
0 1 to 5
6 to10
11 to50
51 to100
101 to1000
>1000
NUMBER OF ERRORS
NU
MB
ER
OF
SY
STE
MS
0
5
10
15
20
25
30
5/1/
99
7/1/
99
9/1/
99
11/1
/99
1/1/
00
3/1/
00
5/1/
00
7/1/
00
DATE
NU
MB
ER
OF
ER
RO
RS
Number of memory single-bit errors reported by 193 systems over 16 months
Daily number of memory single-bit errors reported by one system over 16 months
16
Front Side Bus Intermittent Faults Front side bus (FSB) errors
Bursts of single-bit errors (SBE) on data path SBE detected and corrected (data path protected by ECC)
Failure analysis results Intermittent contacts at solder joints
Fault injection showed that similar faults experienced by control signals induce SDC
Server 1 Server 2 P0 P1 P2 P3 P0 P1 P2 P3
3264 15 0 0 108 121 97 101 7104 20 0 0 - - - -
17
Using Environmental Tests as Dependability
Benchmarking Tools
18
Temperature and Voltage Operating Test
Profile of the testNominal -5% -6% -10% +5% +6% +10% Nominal
30 hours
Nominal -5% -6% -10% +5% +6% +10% Nominal
30 hours
9 systems experienced SDC SDC events: 134 (90.5%) Detected errors: 14 (9.5%) SDC preceded detected errors
70o C
25o C
-10o C
Ten systems were tested Workload: Linpack benchmark
19
Temperature and Voltage Operating Test
Distribution of the SDC events Failure analysis results
Memory controller setup and hold-time violations
Nominal-5%-6%-10% +5% +6% +10%-10
0
10
20
30
40
50
60
70
Voltage
10 92
16 7
Tem
pera
ture
o C
20
ESD Operating Test
4 servers from 2 manufacturers Workload: Linpack benchmark 30 test points per server 20 positive and 20 negative discharges per test point Air discharge 4 kV – 15 kV Contact discharge 8 kV
One server experienced SDC 8% of the discharges targeted to the disk bay area (15 kV, air)
First ESD operating test to reveal SDC in a commercially available server
21
Summary The need for dependability benchmarking is increasing
Wider use of COTS components in critical applications Technology is a two edge sword
Higher performance Higher rates of occurrence of the transient and intermittent faults
SDC is a real threat We take for granted the correctness of the computer data Dependability benchmarks should determine whether the circuits/systems
under evaluation experience SDC
Fault injection techniques require in depth knowledge of the evaluated system
Appropriate for designers and manufacturers
Accelerated neutron tests and environmental tests are a “black box approach”
Capable of unveiling SDC In depth knowledge of the system under test is not required Linpack benchmark is available for free Can be used both by manufacturers and independent evaluators
22
Backup
23
Linpack Benchmark Example of Linpack output: large residues indicate SDC
24
References
“Neutron SER characterization of microprocessors”, Proc. of the International Conference on Dependable Systems and Networks, Yokohama, Japan, June 2005, pp. 754-759.
“Dependability benchmarking using environmental test tools”, Proc. of the Reliability and Maintainability Symposium, Alexandria, VA, USA, January 2005, pp. 567 – 571.
“Impact of deep submicron technology on dependability of VLSI circuits”, Proc. of the International Conference on Dependable Systems and Networks, Washington, DC, USA, June 2002, pp. 205-209.