24
1 Dependability Benchmarking of VLSI Circuits Cristian Constantinescu [email protected] Intel Corporation

1 Dependability Benchmarking of VLSI Circuits Cristian Constantinescu [email protected] Intel Corporation

Embed Size (px)

Citation preview

Page 1: 1 Dependability Benchmarking of VLSI Circuits Cristian Constantinescu cristian.constantinescu@intel.com Intel Corporation

1

Dependability Benchmarking of VLSI Circuits

Cristian Constantinescu

[email protected]

Intel Corporation

Page 2: 1 Dependability Benchmarking of VLSI Circuits Cristian Constantinescu cristian.constantinescu@intel.com Intel Corporation

2

Outline Neutron SER characterization of microprocessors

SER scaling trends Experimental set-up Experimental Results

Other sources of errors Memory intermittent faults Front side bus intermittent faults

Using environmental tests as dependability benchmarking tools Temperature and Voltage Operating Test ESD Operating Test

Summary Backup

Linpack benchmark References

AcknowledgementNeutron SER characterization: Bruce Takala, Steve Wander (LANSCE), Nelson

Tam, Pat Armstrong (Intel Corp.)Environmental testing: John Blair, Scott Scheuneman (Intel Corp.)

Page 3: 1 Dependability Benchmarking of VLSI Circuits Cristian Constantinescu cristian.constantinescu@intel.com Intel Corporation

3

Neutron SER Characterization of Microprocessors

Page 4: 1 Dependability Benchmarking of VLSI Circuits Cristian Constantinescu cristian.constantinescu@intel.com Intel Corporation

4

Single Event Upsets Single event upsets (SEU) are

induced by Alpha particles – generated during

radioactive decay of the package

and interconnect materials Neutrons, protons, pions – generated

by cosmic rays penetrating the atmosphere

SEU may induce errors both in storage elements and combinational logic

Frequency of occurrence of the particle induced induced errors: soft error rate (SER)

+ V

+-

+-

-+

-+

Depletion RegionDrift

Diffusion

Ion

Path

I

II III

Page 5: 1 Dependability Benchmarking of VLSI Circuits Cristian Constantinescu cristian.constantinescu@intel.com Intel Corporation

5

SER Scaling Trends

SRAM SER per bit and chip Latch SER per bit and chip

Assumption: SRAM/latch count increases ~2x per generation

0.001

0.01

0.1

1

0.18 um 0.13 um 90 nm 65 nm

Technology generation

Err

or

rate

(AU

, lo

g s

cale

)

0.0001

0.001

0.01

0.1

1

0.18 um 0.13 um 90 nm 65 nm

Technology generation

Err

or

rate

(AU

, lo

g s

cale

)

Per Bit

Per Chip

Page 6: 1 Dependability Benchmarking of VLSI Circuits Cristian Constantinescu cristian.constantinescu@intel.com Intel Corporation

6

Hadron Cascades

Neutrons represent 94% of the hadrons reaching sea level For terrestrial applications it makes sense to benchmark the impact of

neutron SER

Altitude Neutrons Pions Protons

Sea level 94% 3% 3%

32,000 ft 52% 36% 12%

Main constituents of atmospheric hadron cascades

Page 7: 1 Dependability Benchmarking of VLSI Circuits Cristian Constantinescu cristian.constantinescu@intel.com Intel Corporation

7

1

10

100

1000

10000

100000

1000000

1 10 100 1000

Neutron Energy (Mev)

n/M

eV

/cm

2/s

ec

LANSCE flux

Natural flux x10 6̂

LANSCE Neutron Beam Los Alamos Neutron Science Center (LANSCE)

Generates high-energy neutrons by spallation: a linear accelerator generates a pulsed proton beam that strikes a tungsten target

Energy dependence of the natural cosmic-ray neutron flux and the LANSCE neutron flux

Page 8: 1 Dependability Benchmarking of VLSI Circuits Cristian Constantinescu cristian.constantinescu@intel.com Intel Corporation

8

Experimental Set Up Itanium processor based server Windows NT 4.0 operating system Linpack benchmark

Performs matrix computations Derives residues – can detect silent data corruption (SDC)

Fission ion chamber to determine neutron fluence

Page 9: 1 Dependability Benchmarking of VLSI Circuits Cristian Constantinescu cristian.constantinescu@intel.com Intel Corporation

9

Deriving MTTF

MTTF = Tua/U Tua – duration of an equivalent experiment, taking place in

unaccelerated conditions [h] U – total number of upsets (failures) over the duration of the

experiment

Tua = (Fcp * Nc)/ Nf Fcp – total number of fission chamber pulses, over the duration of

the experiment Nc – average neutron conversion factor [neutrons/fission

pulse/cm2] Nf – cosmic-ray induced neutron flux at the desired geographical

location and altitude [neutrons/cm2/h]

Page 10: 1 Dependability Benchmarking of VLSI Circuits Cristian Constantinescu cristian.constantinescu@intel.com Intel Corporation

10

Experimental Results

Run Linpack benchmark for square matrixes of size 800 and 1000

Completed 40 runs Duration of one run: 10 s – 5 min Failure types

Blue screen Hang Silent data corruption (SDC)

Page 11: 1 Dependability Benchmarking of VLSI Circuits Cristian Constantinescu cristian.constantinescu@intel.com Intel Corporation

11

0

20

40

60

80

100

0 10 20 30 40

Number of Runs

MT

TF

(a. u

.)

Experimental Results

Itanium processor MTTF due to neutrons, as a function of number of runs

Page 12: 1 Dependability Benchmarking of VLSI Circuits Cristian Constantinescu cristian.constantinescu@intel.com Intel Corporation

12

MTTF confidence intervals

MTTF (a.u.) Confidence

level 20 upsets 30 upsets 40 upsets

80% 52, 103 56, 97 58, 91

90% 44, 110 50, 103 53, 96

Experimental Results

SDC – one event Insufficient for statistical analysis

Page 13: 1 Dependability Benchmarking of VLSI Circuits Cristian Constantinescu cristian.constantinescu@intel.com Intel Corporation

13

Practical Considerations

Error handling techniques differ greatly from one manufacturer to another

HW error detection and correction, e.g. ECC, is faster FW/SW implemented recovery may be overwhelmed by an

accelerated test (near coincident faults scenario) Acceleration factor is an important variable

Failure prediction and automatic deconfiguration may lead to misleading results

Multiple experiments Beam divergence Beam attenuation

Page 14: 1 Dependability Benchmarking of VLSI Circuits Cristian Constantinescu cristian.constantinescu@intel.com Intel Corporation

14

Other Sources of Errors

Page 15: 1 Dependability Benchmarking of VLSI Circuits Cristian Constantinescu cristian.constantinescu@intel.com Intel Corporation

15

Memory Intermittent Faults Intermittent faults are induced by unstable or marginal hardware

Intermittent shorts/opens Manufacturing residuals Timing faults

0

10

20

30

40

50

60

70

80

90

0 1 to 5

6 to10

11 to50

51 to100

101 to1000

>1000

NUMBER OF ERRORS

NU

MB

ER

OF

SY

STE

MS

0

5

10

15

20

25

30

5/1/

99

7/1/

99

9/1/

99

11/1

/99

1/1/

00

3/1/

00

5/1/

00

7/1/

00

DATE

NU

MB

ER

OF

ER

RO

RS

Number of memory single-bit errors reported by 193 systems over 16 months

Daily number of memory single-bit errors reported by one system over 16 months

Page 16: 1 Dependability Benchmarking of VLSI Circuits Cristian Constantinescu cristian.constantinescu@intel.com Intel Corporation

16

Front Side Bus Intermittent Faults Front side bus (FSB) errors

Bursts of single-bit errors (SBE) on data path SBE detected and corrected (data path protected by ECC)

Failure analysis results Intermittent contacts at solder joints

Fault injection showed that similar faults experienced by control signals induce SDC

Server 1 Server 2 P0 P1 P2 P3 P0 P1 P2 P3

3264 15 0 0 108 121 97 101 7104 20 0 0 - - - -

Page 17: 1 Dependability Benchmarking of VLSI Circuits Cristian Constantinescu cristian.constantinescu@intel.com Intel Corporation

17

Using Environmental Tests as Dependability

Benchmarking Tools

Page 18: 1 Dependability Benchmarking of VLSI Circuits Cristian Constantinescu cristian.constantinescu@intel.com Intel Corporation

18

Temperature and Voltage Operating Test

Profile of the testNominal -5% -6% -10% +5% +6% +10% Nominal

30 hours

Nominal -5% -6% -10% +5% +6% +10% Nominal

30 hours

9 systems experienced SDC SDC events: 134 (90.5%) Detected errors: 14 (9.5%) SDC preceded detected errors

70o C

25o C

-10o C

Ten systems were tested Workload: Linpack benchmark

Page 19: 1 Dependability Benchmarking of VLSI Circuits Cristian Constantinescu cristian.constantinescu@intel.com Intel Corporation

19

Temperature and Voltage Operating Test

Distribution of the SDC events Failure analysis results

Memory controller setup and hold-time violations

Nominal-5%-6%-10% +5% +6% +10%-10

0

10

20

30

40

50

60

70

Voltage

10 92

16 7

Tem

pera

ture

o C

Page 20: 1 Dependability Benchmarking of VLSI Circuits Cristian Constantinescu cristian.constantinescu@intel.com Intel Corporation

20

ESD Operating Test

4 servers from 2 manufacturers Workload: Linpack benchmark 30 test points per server 20 positive and 20 negative discharges per test point Air discharge 4 kV – 15 kV Contact discharge 8 kV

One server experienced SDC 8% of the discharges targeted to the disk bay area (15 kV, air)

First ESD operating test to reveal SDC in a commercially available server

Page 21: 1 Dependability Benchmarking of VLSI Circuits Cristian Constantinescu cristian.constantinescu@intel.com Intel Corporation

21

Summary The need for dependability benchmarking is increasing

Wider use of COTS components in critical applications Technology is a two edge sword

Higher performance Higher rates of occurrence of the transient and intermittent faults

SDC is a real threat We take for granted the correctness of the computer data Dependability benchmarks should determine whether the circuits/systems

under evaluation experience SDC

Fault injection techniques require in depth knowledge of the evaluated system

Appropriate for designers and manufacturers

Accelerated neutron tests and environmental tests are a “black box approach”

Capable of unveiling SDC In depth knowledge of the system under test is not required Linpack benchmark is available for free Can be used both by manufacturers and independent evaluators

Page 22: 1 Dependability Benchmarking of VLSI Circuits Cristian Constantinescu cristian.constantinescu@intel.com Intel Corporation

22

Backup

Page 23: 1 Dependability Benchmarking of VLSI Circuits Cristian Constantinescu cristian.constantinescu@intel.com Intel Corporation

23

Linpack Benchmark Example of Linpack output: large residues indicate SDC

Page 24: 1 Dependability Benchmarking of VLSI Circuits Cristian Constantinescu cristian.constantinescu@intel.com Intel Corporation

24

References

“Neutron SER characterization of microprocessors”, Proc. of the International Conference on Dependable Systems and Networks, Yokohama, Japan, June 2005, pp. 754-759.

“Dependability benchmarking using environmental test tools”, Proc. of the Reliability and Maintainability Symposium, Alexandria, VA, USA, January 2005, pp. 567 – 571.

“Impact of deep submicron technology on dependability of VLSI circuits”, Proc. of the International Conference on Dependable Systems and Networks, Washington, DC, USA, June 2002, pp. 205-209.