In Modern HPC Parallel Authored by: Daniel Oliveira ... · Intel Xeon Phi and Nvidia Kepler GPU....

Radiation-Induced Error Criticality In Modern HPC Parallel AcceleratorsPresented by: Christopher Boggs, Clayton Connors on 09/26/2018

Authored by: Daniel Oliveira, Laercio Pilla, Mauricio Hanzich, Vinicius Fratin, Fernando Fernandes, Caio Lunardi, Jose ́ Mar ́ıa Cela, Philippe Navaux, Luigi Carro, Paolo Rech

Outline

● Background● Motivation● Radiation-Induced Effects● Error Criticality● Procedure● Results● SDCs for HPC Applications● Discussion

High Performance Computing (HPC)

● Parallel processing for advanced application programs● Above a teraflop of floating point operations per second● Interest businesses of all sizes

○ Transaction processing○ Data warehouses○ Complex models○ Etc

An Accelerator?

● “Accelerate” a computation with massive parallelization● Numerous shared resources● Work best with many algebraic-heavy operations● Intel Xeon Phi and Nvidia Kepler GPU

Parallel Accelerators Offer:

● Lower cost● Flexibility● High efficiency● High computational power● Massive amount of resources

Parallel Accelerators Offer:

● Lower cost● Flexibility● High efficiency● High computational power● Massive amount of resources● What about reliability?

With Titan

● 18,688 GPUs● GPU Corruption Common● Uncorrectable Errors MTBF ~44 hours

https://www.kisspng.com/png-top5-cray-xk7-oak-ridge-leadership-computing-facil-6045373/

Radiation-Induced Effects

● Number of high-energy neutrons generated● Interaction with device can give Soft Errors

○ Bit-flips○ Logic Errors

● Cause crash in instruction cache, bus controller, etc● Could cause Silent Data Corruption (SDC)

Silent Data Corruption (SDC)

● Soft Error hits, DOESN’T cause a crash○ Data cache○ Logic gates (ALU)○ Register files○ etc

● Especially harmful in HPC○ Fault on shared resource or scheduler○ Affects several threads, many elements

So What?● Error can be small

○ Within certain range so not seen as errors○ In the xth bit of a float

● Not all errors critical○ Within certain range so not seen as errors

● Quantify and qualify SDC in Intel Xeon-Phi and Nvidia K40

http://ena.support.keysight.com/e5061b/manuals/webhelp/eng/programming/remote_control/reading-writing_measurement_data/data_transfer_format.htm

Parallel Accelerators

https://techgage.com/article/a-look-at-nvidias-kepler-based-tesla-k-series-gpu-accelerators/

https://www.software.intel.com

How Reliable?● K40

○ Error will raise with input○ Threads data shared in register file

● Xeon-Phi○ Constant errors with input○ Other areas for errors

● A metric must be workload between failures!

Errors

● Relative Error○ Read = observed value○ Mean of Relative Errors

● Masked Errors○ < 2% RE is tolerable

● Spatial Locality of Errors○ Line, square, etc○ Share a resource○ Correct error types differently

De Oliveira et al. “Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators”, HPCA 2017.“Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators”, WMC_2017_Rio_Daniel

Testing

● Each architecture tested for 800 hours● Simulates ~91,000 years of natural radiation● Algorithms which

○ Simulate different resources○ Represent HPC applications○ Minimize error masking

Algorithms

● DGEMM○ Matrix multiplication

● LavaMD○ Calculates interactions of particles

● Hotspot○ Simulates energy dissipation

● CLAMR○ Fluid dynamics application

Relative mean error, number of corrupted elements lower for K40

K40 Xeon Phi

De Oliveira et al. “Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators”, HPCA 2017.

>2% filter removes most random errors on K40

ABFT corrects single, line errors in linear time

FIT less dependent on input size for Xeon Phi

K40 Xeon Phi

● FIT correlation with input size on K40 but not Xeon Phi○ NVIDIA devices have a dedicated scheduler

○ K40 keeps active thread data on device

Source: Rech, Pilla, Navaux, Carro. “Impact of GPUs Parallelism Management on Safety-Critical and HPC Applications Reliability,” DSN, Atlanta, USA, 2014.

LavaMD

Number of corrupted elements lower for K40

Relative mean error lower for Xeon Phi

Exponentiation may cause large deviance

K40 Xeon Phi

LavaMD

Xeon Phi: cubic, square errors from larger shared cache

Less K40 FIT correlation: Local memory use limits thread count

K40 locality vs input size: Less likely to “share” errors for larger input

K40 Xeon Phi

HotSpot

Number of corrupted elements lower for K40

Relative mean error appears lower for K40 (not stated in paper)

Errors “dissipate”

K40 Xeon Phi

HotSpot

>2% threshold removes most errors on both devices

Runtime error checking can affect performance

K40 Xeon Phi

Only tested on Xeon Phi

All errors were >2%

Xeon Phi Locality Map(for a single execution) Xeon Phi

● (Related work) Runtime error checking showed fault coverage of 82%

Source: Atkinson, Debardeleben, Guan, Robey, Jones. “Fault injection experiments with the clamr hydrodynamics mini-app,” ISSREW, 2014.

Conclusion

● DGEMM more resilient on K40○ GPUs have shortened pipelines

● LavaMD more resilient on Xeon Phi○ Transcendental function unit more prone to corruption in K40?

● HotSpot spreads errors○ This behavior may hold for all stencil applications

● CLAMR spreads errors without attenuating them● Xeon Phi keeps corrupted elements around for longer

Future Work

● Determine sources of most critical errors

Discussion Questions

● Does the provided data allow for anything beyond comparing the two tested devices?

● Would it be tolerable for manufacturers to target “lower relative error” at the expense of having a higher total number of errors?

● Is it fair to irradiate the chips but not the DRAM?

In Modern HPC Parallel Authored by: Daniel Oliveira ... · Intel Xeon Phi and Nvidia Kepler GPU....

Documents

Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

Xeon / Pentium III Xeon PCI ISA System Guide Megaplex II Quad Pentium II/III Xeon PCI ISA System Guide Preface To the OEM Thank you for purchasing the high performance American Megatrends

High Resolution Parallel Coastal Ocean Modeling: a Large ... · High Resolution Parallel Coastal Ocean Modeling: ... High Resolution Parallel Coastal Ocean ... conservative schemes

HIGH PERFORMANCE AND PARALLEL COMPUTING

High-performance, multi-core computing · PDF fileVolgoUralNIPIgaz models underground reservoirs with parallel hydrodynamic simulator tNavigator running on the Intel®Xeon®processor

Accelerating Mobile Applications With Parallel High

Bit-Parallel Approximate Pattern Matching on the Xeon Phi ...€¦ · Xeon Phi Architecture • Cores interconnected by a high-speed bidirectional ring; • 512-KB L2-Cache per core

High%Performance%% Linear%Algebra%with% Intel%Xeon%Phi

Intel Xeon Phi Co-Processorsechow/ipcc/hpc-course/HPC-xeonphi.pdfIntel Xeon Phi Coprocessors Reference: Parallel Programming and Optimization with Intel Xeon Phi Coprocessors, by A

Parallel Graph Algorithms on the Xeon Phi Coprocessorfelsin9.de/nnis/phi/thesis/thesis.pdf · algorithms and other algorithms to the Intel Xeon Phi architecture, including eval-uations

Intel® Xeon Phi™ Coprocessor DEVELOPER S QUICK START GUIDEgec.di.uminho.pt/Discip/MInf/cpd1314/SCD/Intel_Xeon-Phi_QStartGui… · Parallel Programming on the Intel® Xeon Phi™

Parallel Database Systems: The Future of High …pages.cs.wisc.edu/~nil/764/Parallel/26_cacm.pdf · 1 Parallel Database Systems: The Future of High Performance Database Processing1

Seamless Parallelization and Vectorization Integration ... · Intel® Xeon® and Intel® Xeon Phi™ Product Families are both going parallel Intel® Xeon Phi™ coprocessor Knights

Parallel Programming and High-Performance Computing

High Throughput Parallel Computing (HTPC)

Parallel, Dynamically Adaptive 2.5D Porous Media …Chapter 3: Intel TMR Xeon Phi Architecture on the SuperMIC System Chapter 3 explains in detail the Intel TM R Xeon Phi architecture

High-speed parallel Viterbi decoding: algorithm and VLSI ...ee.sharif.edu/~digitalvlsi/Docs/Viterbi/High-speed parallel Viterbi... · High-speed Parallel Viterbi Decoding: Algorithm

Parallel Graph Algorithms on the Xeon Phi Coprocessor - Master Thesis presentation · 2019. 9. 22. · INSTITUTE OF THEORETICAL INFORMATICS, RESEARCH GROUP PARALLEL COMPUTING Parallel

High Performance Parallel/Distributed Biclustering Using

High Throughput Parallel Molecular Dynamics