View
221
Download
1
Category
Preview:
Citation preview
Accelerated Stress Testing and Reliability Conference
Thermal HALT - a tool for discovery Signal Integrity and Software reliability issues
Kirk A. Gray
Accelerated Reliability Solutions, L.L.C.
kirk@acceleratedreliabilitysolutions.com
August 2, 2016 1
Accelerated Stress Testing and Reliability Conference
SI operational failures
2
• The differential voltage, the “eye” in the eye diagram is shrinking as clock and bus frequencies increase
Undistorted eye diagram of band limited
signal
eye diagram of signal with amplitude
(noise) and phase (timing) errors
From “Analyzing Signals Using the Eye Diagram”
November 2005 High Frequency Electronics
Accelerated Stress Testing and Reliability Conference
SI operational failures
3
Little data exist on relationship of PWBA hardware variations and effects on signal integrity and software failures at
Affects in data transmission that were 2nd or 3rd order in previous designs begin to dominate as bus speeds increase
– Not a big deal before, become a big deal in SI
– Many new variables that are difficult to model correctly
– Decrease in IC metallization an higher frequency’s will result in higher sensitivity to fabrication variations
Accelerated Stress Testing and Reliability Conference
No Fault Found (NFF) Field Returns
4
Many causes for warranty returns that are NFF
Some Intermittent failures due to low SI margin
Signal Integrity operational margin due to voltage, board impedance, crosstalk, noise, etc
Accelerated Stress Testing and Reliability Conference
No Fault Found (NFF) Field Returns
5
Many companies do not consider it a “failure”
• no root cause investigation
• may be returned to field or as a replacement part for repair depot
• Marginal operation may be less in another system – Ex. graphics cards, or DIMMs (DRAM memory)
Accelerated Stress Testing and Reliability Conference
Signal Integrity (SI) and HALT
6
SI operational issues may significantly contribute to NFF
Very difficult to observe in the field and on a test bench
– May take hundreds of operational cycles to observe
– Marginality may only occur in the stack up of a specific whole system hardware
– NFF when tested on bench or in the “golden” system
Accelerated Stress Testing and Reliability Conference
Using Thermal Stress for Timing variations
7
• Thermal Stress is very useful for stimulation of timing variations –Both high and Lowtemperature limits
– High temperature – lower speed
– Low temperature – higher speed
Accelerated Stress Testing and Reliability Conference
Effect of Temperature on Signal Propagation
8
Measured low-to-high propagation delay versus case temperature in Fairchild Octal buffer MM74HC244N (rated for -40 to 85C)
Referenced From L. Condra, D. Das, N. Pendse, and M. Pecht “Junction Temperature Considerations in Evaluating Electronic Parts for Use
Outside Manufacturers-Specified Temperature Ranges” IEEE Transactions on Components and Packaging Technologies, Vol. 21, No. 4, pp 721-
728, Dec. 2001
Accelerated Stress Testing and Reliability Conference
Variation of Signal Propagation
9
Thermal stress provides stimulation
of signal propagation variation
?
Accelerated Stress Testing and Reliability Conference
Lot to Lot Variation of Signal Propagation?
10
Predicting the Future variations • How much propagation variation die to die, and
lot to lot?• How close to the specified maximum delay?• How will variation impact operational reliability in
each in-circuit application of the component?
??
Accelerated Stress Testing and Reliability Conference
Stimulation of Variations
11
Temperature can
skew signal
propagation IC’s
and conductors
Accelerated Stress Testing and Reliability Conference
Using Thermal Stress for Timing variations
12
Timing and quality of SI variations come from:
• Lot to lot manufacturing
• Within lots
• Board impedance variations
• Second and third source components
• Interconnects
• Parametric drift - Aging
Accelerated Stress Testing and Reliability Conference
Signal Integrity
13
• Electric and Magnetic fields - noise, crosstalk, reflections
• Every conductor -frequency dependant Inductance, Capacitance and resistance impacts the quality of signal transmissions from each node of the non-ideal conductor
• Surface on Copper affects L and R – rough for FR4 adhesion
Typical transmission lines in PCB cross
section
Ground /power
Ground /power
Referenced from S.H. HALL and H.L. Howard, “Advanced Signal Integrity for
High-Speed Designs” , Wiley and Sons, 2009
Electric Field
Magnetic Field
Accelerated Stress Testing and Reliability Conference
Thermal Stress to Skew Parametrics
14
Marginal designs may not be discovered until a sufficient number of units are in the field
Field - costly place to discover these marginal conditions or have high NDF returns if not discovered
Parametric
timing value
Parametric
timing value
#units#unit
s
100 100,000
Limited samples During development Mass Production variation
Parameter
Specification
Lower op
limitUpper op
limit
Lower op
limit
Upper op
limit
Parameter
Specification
Marginal
operation
regions1,00
0
Accelerated Stress Testing and Reliability Conference
Parametric Skewing from Thermal Stress
15
Applying thermal stress stimulates a timing shift
Parametric
timing value
units
100
Parameter
Specification
Lower op
limit
Upper op
limit
ColdHOT
• Thermal Step Stress skews the signal propagation
speeds in components and assemblies
• Rapid thermal transitions provide higher thermal
gradients across components and PWBA – mix of
parametrics skewing
Accelerated Stress Testing and Reliability Conference
Signal Integrity
16
Fiber weave effect - weave of dielectric cannot be assumed homogeneous at Gb/s transmission rates RH (relative humidity) has an impact on the electrical performance of the dielectric – dramatic increase of insertion loss from dry Arizona to humid Malaysia
Typical transmission lines in PCB cross
section
Ground /power
Ground /power
Referenced from S.H. HALL and H.L. Howard, “Advanced Signal Integrity for
High-Speed Designs” , Wiley and Sons, 2009
Accelerated Stress Testing and Reliability Conference
Thermal stimulation for Signal Integrity Margin
17
Thermal stress expands and contracts material dimensions
Heat expands materials, dimensions
Ground /power
Ground /power
Heat
Accelerated Stress Testing and Reliability Conference
Thermal stimulation for Signal Integrity Margin
18
Can provide stimulation of potential affects and impact of variation of parametrics, noise, L, C, R resulting from manufacturing, materials variation
Cooling contracts materials, dimensions
Cold
Accelerated Stress Testing and Reliability Conference
Thermal stimulation for SI Margin
19
Thermal cycling adds an additional variation –
thermal gradients create differential parametric shift
from shifts in dimensions and impedance
Thermal Gradients provide differential
mechanical and parametric stresses
Transitio
ns
Accelerated Stress Testing and Reliability Conference
No Fault Found (NFF) Field Returns
20
Two computers returned from two different customers with same reported intermittent failure condition– After five days of bench testing, OEM Failure
Analysis could not duplicate the failure mode– Units were declared NFF – Same units placed in thermal cycling (+65 to -10
°C) reproduced the same (soft) failure mode as reported by the customer 3 times in a 24 hour period
Accelerated Stress Testing and Reliability Conference
Combined HALT Stress Interactions
Stresses combinations can have significant interactions for multi-dimensional limit or boundary maps
Clock/bus Frequency margining limits
Voltage margining limits
www.ieee-astr.org September 28- 30 2016, Pensacola Beach, Florida 21
Accelerated Stress Testing and Reliability Conference
Stress Boundary Maps
22
tem
pe
ratu
re
voltage
distributions in the
boundary identifies
reliability margin
risks
Derating boundary 5%
guard band
Normal user operating
ranges
“Four Corner” test points
HALT Operational Limit
Accelerated Stress Testing and Reliability ConferenceStress Boundary Maps
23
tem
pe
ratu
re
voltage
Wide distributions in limits –higher risk of
stress strength overlap
Stress-strength
interference
End use
stress
conditions
reliability margin
reliability margin
Accelerated Stress Testing and Reliability Conference
Allied Telesis | White PaperSoftware Fault Isolation using HALT and HASS
24
First Presented at the IEEE/CPMT 2010 ASTR Workshop
• Donovan JohnsonSenior Hardware & Reliability Test Engineer
• Ken Franks
Hardware & Reliability Test Manager
Accelerated Stress Testing and Reliability Conference
Background
25
2004 Gregg Hobbs, Ph.D. gave a “Mastering HALT and HASS” Class at New Zealand research and development centre
The term “software fault” is defined at Allied Telesis (formerly Allied Telesyn) as a fault found in:
• The firmware of a product, such as code in a Programmable Logic Device (PLD)
• The boot code of a product, such as EPROM boot code.
• The operating system of a product.
Accelerated Stress Testing and Reliability Conference
Allied Telesyn HALT
26
Test at each thermal step during HALT
• External traffic test – use industry standard equipment
• Power Cycling – margin voltage and frequency
• Internal memory test – RAM and NVS testing
• Internal packet generator test – CPU, Encryption engine and RAM test
• Other product specific tests
Accelerated Stress Testing and Reliability Conference
Software Fault Isolation
27
Nearly one-third of the issues found in HALT were software related
Accelerated Stress Testing and Reliability Conference
Failure Types Found in HALT
28
Software issues28%
Hardware issues70%
To be determine2%
FAILURE PERCENTAGE
Accelerated Stress Testing and Reliability Conference
Software Issues found using HALT
29
Abnormal LED Activity
• This anomaly was found during cold step testing at minus 10°C and attributed to the reset pulse timing inside PLD code.
• After change to PLD code the system operated to -50°C.
Accelerated Stress Testing and Reliability Conference
Software Issues found using HALT
30
Switch Tuning
• Change in UOL from 70°C to greater than 100°C And LOL from minus 20 to less than minus 60°C.
changes in software increase operational temperature range of 90°C to 160°C
Accelerated Stress Testing and Reliability Conference
Software Issues found using HALT
31
System Crash• A product that had been in the field for six
months • First HALT the UOL of 70°C – failure was a system
crash • Changed register setting inside the boot code
allowed operation to 100°C.• In addition to the software fault, a flaw within
the CPU silicon was revealed, which amplified the effects of the software fault.
Accelerated Stress Testing and Reliability Conference
Software Issues found using HALT
32
System Silent Reboot
• Rapid Thermal Transitions exposed a flaw in software during temperature ramps even though the initial failure occurred in a moderate climate inside a server room.
• Failure mode was only apparent when running one particular test. A software patch fixed this problem.
Accelerated Stress Testing and Reliability Conference
Software Issues found using HALT
33
Silent Reboot
• The same fault took weeks to replicate intermittently using traditional methods,.
• The same failure mode was repeatedly replicated in HALT in less than one day of testing
Accelerated Stress Testing and Reliability Conference
Summary
34
Thermal HALT has multiple benefits in electronics systems testing
– Well established for hardware latent defects
– Secondary and less recognized (Opportunity) –Thermal induced skewing of signal speeds in components and PWB assemblies help to discover marginal SI that may result in failures of software and firmware.
Accelerated Stress Testing and Reliability Conference
Thank you – Q & A
www.ieee-astr.org September 28- 30 2016, Pensacola Beach, Florida 35
The material in this presentation is contained
in our new book Next Generation HALT and
HASS: Robust Design of Electronics and
Systems published by John Wiley & Sons,
June, 2016
Co-Authored with John J. Paschkewitz
Recommended