View
2
Download
0
Category
Preview:
Citation preview
Radiation-Induced Error Criticality In Modern HPC Parallel AcceleratorsPresented by: Christopher Boggs, Clayton Connors on 09/26/2018
Authored by: Daniel Oliveira, Laercio Pilla, Mauricio Hanzich, Vinicius Fratin, Fernando Fernandes, Caio Lunardi, Jose ́ Mar ́ıa Cela, Philippe Navaux, Luigi Carro, Paolo Rech
Outline
● Background● Motivation● Radiation-Induced Effects● Error Criticality● Procedure● Results● SDCs for HPC Applications● Discussion
High Performance Computing (HPC)
● Parallel processing for advanced application programs● Above a teraflop of floating point operations per second● Interest businesses of all sizes
○ Transaction processing○ Data warehouses○ Complex models○ Etc
An Accelerator?
● “Accelerate” a computation with massive parallelization● Numerous shared resources● Work best with many algebraic-heavy operations● Intel Xeon Phi and Nvidia Kepler GPU
Parallel Accelerators Offer:
● Lower cost● Flexibility● High efficiency● High computational power● Massive amount of resources
Parallel Accelerators Offer:
● Lower cost● Flexibility● High efficiency● High computational power● Massive amount of resources● What about reliability?
With Titan
● 18,688 GPUs● GPU Corruption Common● Uncorrectable Errors MTBF ~44 hours
https://www.kisspng.com/png-top5-cray-xk7-oak-ridge-leadership-computing-facil-6045373/
Radiation-Induced Effects
● Number of high-energy neutrons generated● Interaction with device can give Soft Errors
○ Bit-flips○ Logic Errors
● Cause crash in instruction cache, bus controller, etc● Could cause Silent Data Corruption (SDC)
Silent Data Corruption (SDC)
● Soft Error hits, DOESN’T cause a crash○ Data cache○ Logic gates (ALU)○ Register files○ etc
● Especially harmful in HPC○ Fault on shared resource or scheduler○ Affects several threads, many elements
So What?● Error can be small
○ Within certain range so not seen as errors○ In the xth bit of a float
● Not all errors critical○ Within certain range so not seen as errors
● Quantify and qualify SDC in Intel Xeon-Phi and Nvidia K40
http://ena.support.keysight.com/e5061b/manuals/webhelp/eng/programming/remote_control/reading-writing_measurement_data/data_transfer_format.htm
Parallel Accelerators
https://techgage.com/article/a-look-at-nvidias-kepler-based-tesla-k-series-gpu-accelerators/
https://www.software.intel.com
How Reliable?● K40
○ Error will raise with input○ Threads data shared in register file
● Xeon-Phi○ Constant errors with input○ Other areas for errors
● A metric must be workload between failures!
Errors
● Relative Error○ Read = observed value○ Mean of Relative Errors
● Masked Errors○ < 2% RE is tolerable
● Spatial Locality of Errors○ Line, square, etc○ Share a resource○ Correct error types differently
De Oliveira et al. “Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators”, HPCA 2017.“Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators”, WMC_2017_Rio_Daniel
Testing
● Each architecture tested for 800 hours● Simulates ~91,000 years of natural radiation● Algorithms which
○ Simulate different resources○ Represent HPC applications○ Minimize error masking
Algorithms
● DGEMM○ Matrix multiplication
● LavaMD○ Calculates interactions of particles
● Hotspot○ Simulates energy dissipation
● CLAMR○ Fluid dynamics application
DGEMM
Relative mean error, number of corrupted elements lower for K40
K40 Xeon Phi
De Oliveira et al. “Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators”, HPCA 2017.
DGEMM
>2% filter removes most random errors on K40
ABFT corrects single, line errors in linear time
FIT less dependent on input size for Xeon Phi
K40 Xeon Phi
De Oliveira et al. “Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators”, HPCA 2017.
DGEMM
● FIT correlation with input size on K40 but not Xeon Phi○ NVIDIA devices have a dedicated scheduler
○ K40 keeps active thread data on device
Source: Rech, Pilla, Navaux, Carro. “Impact of GPUs Parallelism Management on Safety-Critical and HPC Applications Reliability,” DSN, Atlanta, USA, 2014.
LavaMD
Number of corrupted elements lower for K40
Relative mean error lower for Xeon Phi
Exponentiation may cause large deviance
K40 Xeon Phi
De Oliveira et al. “Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators”, HPCA 2017.
LavaMD
Xeon Phi: cubic, square errors from larger shared cache
Less K40 FIT correlation: Local memory use limits thread count
K40 locality vs input size: Less likely to “share” errors for larger input
K40 Xeon Phi
De Oliveira et al. “Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators”, HPCA 2017.
HotSpot
Number of corrupted elements lower for K40
Relative mean error appears lower for K40 (not stated in paper)
Errors “dissipate”
K40 Xeon Phi
De Oliveira et al. “Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators”, HPCA 2017.
HotSpot
>2% threshold removes most errors on both devices
Runtime error checking can affect performance
K40 Xeon Phi
De Oliveira et al. “Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators”, HPCA 2017.
CLAMR
Only tested on Xeon Phi
All errors were >2%
Xeon Phi Locality Map(for a single execution) Xeon Phi
De Oliveira et al. “Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators”, HPCA 2017.
CLAMR
● (Related work) Runtime error checking showed fault coverage of 82%
Source: Atkinson, Debardeleben, Guan, Robey, Jones. “Fault injection experiments with the clamr hydrodynamics mini-app,” ISSREW, 2014.
Conclusion
● DGEMM more resilient on K40○ GPUs have shortened pipelines
● LavaMD more resilient on Xeon Phi○ Transcendental function unit more prone to corruption in K40?
● HotSpot spreads errors○ This behavior may hold for all stencil applications
● CLAMR spreads errors without attenuating them● Xeon Phi keeps corrupted elements around for longer
Future Work
● Determine sources of most critical errors
Discussion Questions
● Does the provided data allow for anything beyond comparing the two tested devices?
● Would it be tolerable for manufacturers to target “lower relative error” at the expense of having a higher total number of errors?
● Is it fair to irradiate the chips but not the DRAM?
Recommended