Upload
tiffany-miles
View
234
Download
3
Tags:
Embed Size (px)
Citation preview
CML
CSE 591: Advances in Reliable Computing
Aviral Shrivastava
CMLWeb page: aviral.lab.asu.edu CML
Saving Galileo 1978 – Galileo commissioned for Jupiter
exploration 1980 – Design and Architecture decided
Use of AT 2901 for attitude control 1982 – Voyager reaches Jupiter
Intermittent Resets Sulfur ions from Jupiter’s volcanic moon
were being whipped up to high energy by the Jovian gravity.
After extensive testing of Galileo, chief engineer decided “not worth flying if soft error problem not solved”
Overheads 5 years, 5 million dollars Sandia National Laboratories was
subcontracted to custom-make radiation hardened AT 2901
CMLWeb page: aviral.lab.asu.edu3 CML
Radiation Induced Soft Errors
= 1.64 x 10-
10sec
= 5.10x10-
11sec
Typically
Induced current has a rapid rise time but a more gradual fall time
CMLWeb page: aviral.lab.asu.edu4 CML
It started with nuclear tests…
1954-57: Nuclear Tests Electronic anomalies in monitoring equipment
Could not be traced to any hardware fault Equipment worked properly after restart
1962: Wallmark and Marcus (RCA Labs, Princeton) Minimum size and Maximum Packing Density of Non-
Redundant Semiconductor Devices, March 1962 Predicted that cosmic rays would start affecting
microelectronics 1962: Telestar - First communication satellite
July 9, 1962: Starfish Prime United States tested a high-altitude nuclear device
(called Starfish Prime) which super-energized the Earth's Van Allen Belt where Telstar took orbit
100X increase in radiation Rendered the satellite unoperational
worked after reboot
CMLWeb page: aviral.lab.asu.edu5 CML
Radioactive Contamination 1978: Intel could not deliver chips to AT&T to
upgrade switching system from mechanical relays to ICs May and Woods traced problem to packaging
Packaging modules were contaminated with Uranium from and old uranium mine upstream.
Also proposed the Q_critical model of soft errors Q_critical must be overcome by accumulated charge
generated by particle strike to cause a fault.
1986-87: IBM faced problems of radioactive contamination Traced problem to a distant chemical plant that used
radioactive contaminant to clean bottles that were used to store an acid required in chip manufacturing process.
CMLWeb page: aviral.lab.asu.edu6 CML
History of Radiation-induced SERs
1979: Zeigler and Lanford presented solid evidence that, the electronic sensitivity to
radiation-induced soft errors could become a nightmare for the future technologies.
Predicted that soft errors due to cosmic radiations would increase with altitude
1995: Baumann et. al. Soft errors caused by Boron-10 isotopes activated by low-energy
atmospheric neutrons. 1996: Normand
Documented strikes in large servers found in error logs Discovered that memory error rates very significantly correlated to
the altitude of the computers – attributed them to soft errors (Z&L) High in servers in Los Alamos, and in fighter planes.
“Single Event Upset at Ground Level,” IEEE Transactions on Nuclear Science, Vol. 43, No. 6, December 1996.
CMLWeb page: aviral.lab.asu.edu CML
Here comes the Sun… 11 year solar cycle of sun-spots
Major solar storms this year and next 109kg/s of material lost by the Sun as
ejected solar wind. Protons (~70%), electrons, ionized helium, less
than 0.5% minor ions. 2x1010 protons/cm2
Loss of satellites
CMLWeb page: aviral.lab.asu.edu8 CML
Fault, Error and Failure
FAULT
a physical defect thatoccurs within hw or swcomponentsHW defect, SW bug Physical
Universe
physical entities making up a system
activation
ERROR
a deviation from accuracy or correctnessmanifestation of a fault
InformationalUniverse
units of information(eg: data words)
fault latency
FAILURE
nonperformance ofsome action that is due or expected
malfunction
ExternalUniverse
the user of a systemultimately see the effects
propagation
error latency
[Geffroyand, 02] Jean-Claude Geffroyand Gilles Motet, “Design of Dependable Computing Systems”, KluwerAcademic Publishers, 2002, ISBN 1-4020-0437-0
CMLWeb page: aviral.lab.asu.edu9 CML
Electrical MaskingPulse attenuated
by electrical resistance in the
circuit
Pulse still strong enough to be
latched at output
CMLWeb page: aviral.lab.asu.edu CML
Single Event Latchup
SEL: Single Event Latchup Parasitic circuit elements forming a silicon controlled rectifier
(SCR) Potentially destructive
the device current may destroy the device if not current limited and removed "in time.
Removal of power to the device is required in all non-catastrophic SEL conditions in order to recover device operations.
SEL probability increases with temperature!
CMLWeb page: aviral.lab.asu.edu11 CML
Logical Masking
Value unchanged at the gate
CMLWeb page: aviral.lab.asu.edu12 CML
Logical Masking
Error propagated
to the output
CMLWeb page: aviral.lab.asu.edu13 CML
Temporal MaskingTransient Fault Soft Error
A transient pulse at the latching window:1) Before tsetup masked (not latched)2) After tsetup, Before thold race condition3) At the latching window not masked (latched)
[Firouzi ROCS 2010]
CMLWeb page: aviral.lab.asu.edu CML
Soft Error Trends
DRAM System error rate of DRAMs is fairly constant
SRAM Increasing exponentially
Logic Increasing exponentially
CMLWeb page: aviral.lab.asu.edu15 CML
Increasing Soft Error Rates
Reducing features sizes and lower supply voltage Decreasing capacitive nodes
and noise margins Q_critical reducing
Exponentially more low-energy particles than high-energy ones
More number of transistors per chip More functionality is moving on-chip Higher probability of error due to more faults.
Increasing clock rates Larger fraction of time between setup and hold times for better
error latching
CMLWeb page: aviral.lab.asu.edu16 CML
One Failure per Day per Chip
Soft error rates could increase from one error per year to one error per day in a decade!
[Shivakumar et al 2002]
CMLWeb page: aviral.lab.asu.edu CML
Processing and Packaging Solutions
Reduce the number of particles that strike Reduce upsets
Use of highly purified fabrication materials Remove traces of boron and heavy
metals Surround by metallic frame
Reduce low-energy particles But neutrons can pass through > 10 ft
of concrete
Process Technology Solutions Partially depleted SOI: no help after
250 nm Fully depleted SOI: very expensive
CMLWeb page: aviral.lab.asu.edu
Transistor Level Techniques
□ Normally CMOS inverter is scaled with 2:1 ratio between p- and n-channel devices□ To compensate for electron and hole mobilities
□ Changing this ratio can increase the tolerance
CMLWeb page: aviral.lab.asu.edu CML
Gate-Level Techniques
Some gates are more vulnerable than others Radiation hardened designs use NAND gates
When all inputs are low, drive of p-stack is low, high leakage of n-transistors rise in the output slow functional failure
Gates vulnerability may change by 5X depending on the state NAND gate
Extremely vulnerable when inputs 10 Not vulnerable when inputs 00
How to synthesize to minimize vulnerability
CMLWeb page: aviral.lab.asu.edu CML
Circuit-Level Techniques
Adding resistance introduces additional time constants that filter out the very fast SEU-induced transients High temperature coefficients of poly-silicon resistors Difficult to control variation of resistance
CMLWeb page: aviral.lab.asu.edu CMLCopyright 2005, M. Tahoori
21
D-Cache: Flushing4x reduction
in vulnerability
CMLWeb page: aviral.lab.asu.edu Copyright 2005, M. Tahoori
22
D-Cache: Write Policy10x reduction
in vulnerability
CMLWeb page: aviral.lab.asu.edu CMLCopyright 2005, M. Tahoori
23
D-Cache: Refresh3x reduction
in vulnerabilityusing write-thru
(30x total)
CMLWeb page: aviral.lab.asu.edu
Replica Cache
CMLWeb page: aviral.lab.asu.edu CMLMemoryFNC FC
Main Cache Mini Cache
PPC (Partially Protected Caches) 2 Caches at the same level of memory
hierarchy Main Cache, and the protected mini-
cache Mini-cache
low power, low latency Timing slack to harden it
Compiler maps data to the two caches Map Failure-Critical data to the
protected mini-cache Map Not Failure-Critical data to
unprotected main cache
Intuition is to provide protection to only the FC data In multimedia applications, the
multimedia data is NOT failure critical An error Loss in Quality of Service
How to use PPCs for general applications?
Processor Pipeline
Unprotected Main Cache
Protected Mini Cache
HPC
Processor
Memory ControllerPage Mapping
PPC
FNC FC
CMLWeb page: aviral.lab.asu.edu CML
Cache Scrubbing Periodically read memory and correct all
single bit errors
Disallows accumulation of temporal double bit errors
Standard technique in main memories (DRAMs)
CMLWeb page: aviral.lab.asu.edu CML
Pipeline Protection: Razor
Originally proposed to tolerate process variations Shadow latch clocked with a delayed clock If difference in values latched, raise error
How to use it to detect soft errors?