HPCG @ Arm · Inspiration •We have used different sources as inspiration •J. Park et al. 2014....

• Daniel Ruiz• 13th November 2018

HPCG @ Arm

Motivation

• HPCG is becoming increasingly important in the HPC world

• Few optimized versions are open-source• None are Arm-specific

• Establish a baseline codebase for the community to build upon

• Stop reinventing the wheel

Inspiration

• We have used different sources as inspiration• J. Park et al. 2014. Efficient shared-memory implementation of high-performance conjugate gradient benchmark

and its application to unstructured matrices. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '14). IEEE Press, Piscataway, NJ, USA, 945-955. DOI: https://doi.org/10.1109/SC.2014.82

• E. Phillips and M. Fatica. 2014. A CUDA Implementation of the High Performance Conjugate Gradient Benchmark”. In High Performance Computing Systems. Performance Modeling, Benchmarking and Simulation, ser. Lecture Notes in Computer Science. Springer, Cham, pp. 68-84. Available: https://link.springer.com/chapter/10.1007/978-3-319-17248-4_4

• K. Kumahata et al. 2016. High-Performance conjugate gradient performance improvement on the K computer. In The International Journal of High Performance Computing Applications, vol. 30, no. 1, pp. 55—70. Available: https://doi.org/10.1177/1094342015607950

• X. Zhang et al. 2014. Optimizing and Scaling HPCG on Tianhe-2: Early Experience. In Algorithms and Architectures for Parallel Processing, ser. Lecture Notes in Computer Science, Springer, Cham, pp 28-41. Available: https://link.springer.com/chapter/10.1007/978-3-319-11197-1_3

Multi-level task dependency graph

• Nodes in the same level of the graph can be processed in parallel

• How to:

1. Add node 0 to the level 1

2. Mark node 1 as visited

3. Close level 1

4. Check neighbors of nodes in previous level to see if dependencies are fulfilled

1. If yes, add node to the level and mark node as visited

2. If no, continue with the next node

5. Close level, add new level and go to 4 if no more nodes to process

Level 0: 0Level 1: 1Level 2: 2, 4Level 3: 3, 5Level 4: 6, 8Level 5: 7, 9Level 7: 10, 12Level 8: 11, 13Level 9: 14Level 10: 15

Block multi-coloring

• Blocks with the same color can be processed in parallel

• How to:1. Group N consecutive nodes in blocks2. Colorize blocks3. Reorder blocks and rows of blocks sharing the same color

Merging all together

Coarser levels

Finest level

Intra-node performance

1 2 4 8 16 28

llel E

OpenMP threads

Strong scaling Weak scaling Ideal

1 2 4 8 16 28

OpenMP threads

Strong scaling Weak scaling Ideal

• Experiments executed on a dual-socket state-of-the-art Arm platform

Multi-node performance

1 2 3 4 5 6 7

# Nodes

Optimized Vanilla

1 2 3 4 5 6 7

# Nodes

Optimized Vanilla Ideal

• Optimized:• 8 MPI ranks per node• 7 OpenMP threads per MPI rank• 256x224x256

• Vanilla:• 56 MPI ranks per node• OpenMP disabled• 128x128x128

Summary

• The code is open-source• https://gitlab.com/arm-hpc/benchmarks/hpcg• We welcome contributions! !

• We can collaboratively improve things• Hand-made NEON code• Hand-made SVE code• Platform-specific optimizations• Network communication

• Further information about our code on the the Arm Community blog• https://community.arm.com/tools/hpc/b/hpc/posts/parallelizing-hpcg

Thank YouDankeMerci��GraciasKiitos감사합니다ध"यवादהדות

HPCG @ Arm · Inspiration •We have used different sources as inspiration •J. Park et al. 2014....

Documents

hpcg 1 TOWARD A NEW (ANOTHER) METRIC FOR RANKING HIGH

Sisteme Conjugate

Conjugate direction methods

Gradient Conjugate

Conjugate (1,4-) additiongjrowlan/stereo2/lecture9.pdfEnantioselective radical conjugate addition • Not unsurprisingly, once stereoselective conjugate radical additions with auxiliaries

Conjugate Ratio:

Conjugate Heat Transfer

Conjugate Beam

Chemengcomm Conjugate

HPCG Pleiades

The HPC Conjugate Gradient (HPCG)

Performance modeling of the HPCG benchmarksdh/pmbs14/PMBS14/Workshop_Schedule_files/...Performance modeling of the HPCG benchmark Vladimir Marjanovi c, Jos e Gracia, and Colin W. Glass

An Approach to the Highest Efficiency of the HPCG ...sc15.supercomputing.org/sites/all/themes/SC15images/tech...An Approach to the Highest Efficiency of the HPCG Benchmark on the SX-ACE

Pneumococcal Conjugate Vaccine

Conjugate Beam Metehod

HPCG Technical Specification - Sandia National … REPORT SAND2013-!8752 Unlimited Release Printed October 2013 HPCG Technical Specification Michael A. Heroux, Sandia National Laboratories1

Conjugate gradient method - An Introduction to the Conjugate

Conjugate Direction Methods - CELS IT Wiki · 2016. 9. 15. · 1 Conjugate Direction Methods 2 Classical Conjugate Gradient Method 3 The Barzilai-Borwein Method 10/26. Classical Conjugate

The HPCG Benchmark

Conjugate Gradients I: Setup - Computer Graphicsgraphics.stanford.edu/.../lecture_slides/cg_i.pdf · Conjugate Directions A-Conjugate Vectors Two vectors ~vand w~are A-conjugate if