What is the most important kernel of sparse linear solvers for heterogeneous supercomputers?

  • View
    17

  • Download
    1

Embed Size (px)

DESCRIPTION

What is the most important kernel of sparse linear solvers for heterogeneous supercomputers?. Shengxin Zhu The University of Oxford. Prof. Xingping Liu and Prof. Tongxiang Gu National Key Laboratory of Computational Physics Institute of Applied Physics and Computational Mathematics. - PowerPoint PPT Presentation

Text of What is the most important kernel of sparse linear solvers for heterogeneous supercomputers?

  • What is the most important kernel of sparse linear solvers for heterogeneous supercomputers?Shengxin ZhuThe University of Oxford Prof. Xingping Liu and Prof. Tongxiang Gu

    National Key Laboratory of Computational Physics Institute of Applied Physics and Computational Mathematics

    SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  • Outlines

    Brief introduction on Heterogeneous supper-computers Computation kernels of Krylov methodsInfluence of communications Case study: GPBiCG(m,l)Challenging problemsConclusion

    SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  • Introduction to heterogeneous supper-computersDawning5000ANodes: Bandwidth:Memory: *

    Dawning 5000Ranking history 11/200811th 06/200915th11/200919th 06/201024th 11/201035th06/201140th 11/2011 58th

    2011/ Nov : top5001st K (JP)2st NUDT (CN)3rd Cray (US)4th Dawning (CN)

    SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  • Computational kernels of Krylov methodsVector update: parallel in natureMat-vec Computation intensive; multi-core technology CUDA/OpenMPInner product: Communication intensive (CPU/MPI).

    SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  • Influence of communicationfirst glance S Zhu, MSc Thesis, CAEP, 2010Computation cheapCommunication expensive Based on Aztec by Prof. Tuminaro et al @ Sandia

    SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  • Real reason for time-consuming communicationsSmall workshops: focus less preparing timeConference: diversity more preparing time

    SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  • Strategies for minimizing communicationsReplacing dot by others (semi-Chebyshev ) : workshop only no conference if possible. Inner product free , Gu, Liu, Mo(2002)Reorganizing algorithm such that: (reduce number of conference and each conference accept more talks) residual replacement strategies due to Von de Vorst (2000s). CA KSMs, Demmel et al (2008) Overlapping communication over computation

    SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  • A case study, Paralleling GPBiCG(m,l) (S. Fujino, 2002)GPBiCG(1,0) BiCGSTAB

    GPBiCG(0,1) GPBiCG

    GPBiCG(1,1) BiCGSTAB2Could be used to design breakdown free BiCGSTAB method.

    SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  • GPBiCG(m,l) (S. Fujino, 2002)

    SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  • GPBiCG(m,l) (S. Fujino, 2002)

    SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  • Algorithm Design of PGPBiCG(m,l) Method

    SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  • PGPBiCG(m,l) Method(reduce # global commun. )Algorithm reconstruct: three GobalCs to one

    SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  • PerformanceBased on Aztec by Prof. R.S. Tuminaro et al @ Sandia

    SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  • Convergence analysisResidual replacements strategiesBackward stable analysis

    SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  • Challenging problemAccurate compute dotWhy Mindless by Kahan Accurate compute inner product.Ogita and Rump et-al, Accurate sum and dot product, SIAM Sci Compt. 2005 cited 188 times. (but) .PLASMA team Backward stable analysis of residual replacement methods.Carson and Demmel, A residual replacement strategy for improving the maximum attainable accuracy of communication avoiding Krylov subspace Methods, April 20 2012 Reliable dot computation algorithm

    SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  • Conclusion:Avoiding communication Reliable computationInner product computation is very likely to be the most challenging kernel for HHPC, while Mat_vec important for bothSoftware abstraction and threads programming are helpful, together with re-designing algorithms will do betterMath/AlgorithmCS/Performance Applications interfaceAztecPOSKIPOSKI Hyper, PETSc; Trilinos(Parallel Optimized Sparse Kernel Interface LIbrary) Poski v.1.0 May 02/2012

    SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  • Thanks !

    SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  • More than ten thousand processors are connected by networkGlobal Communication becomes more and more seriousInitial study on communication complexity

    SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  • Based on the former two strategiesde Sturler and van der Vorst: Parallel GMRES(m) and CG methods (1995)Bucker and Sauren: Parallel QMR method (1997)Yang and Brent: Improved CGS, BiCG and BiCGSTAB methods (2002-03)Gu and Liu et al.: ICR, IBiCR, IBiCGSTAB(2) and PQMRCGSTAB methods (2004-2010)Demmel et al CA-KSMs (2008---)Gu, Liu and Mo: MSD-CG: multiple search direction conjugate gradient method (2004)replaced the inner products computation by solving linear systems with small size. Eliminates global inner products completely.The idea have been generated to MPCG by Grief and Bridson (2006)Methods in literatures

    SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  • Comparison of computational count of two Algorithms

    SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  • Comparison of computational count of two Algorithms

    SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  • Mathematical model of the time consummation

    SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  • Scalability analysis

    SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  • The optimal number of processors

    SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  • Convergence Analysis

    SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  • Numerical Experiments: timing and improvements

    SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  • Numerical Experiments: Speedup

    SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  • PGPBiCG(m,l) method is more scalable and parallel for solving large sparse unsymmetrical linear systems on distributed parallel architecturesPerformance, isoefficiency analysis and numerical experiments have been done for PGPBiCG(m,l) and GPBiCG(m,l) methodsThe parallel communication performance can be improved by a factor of larger than 3. The PGPBiCG(m,l) method has better parallel speed up compared with the GPBiC(m,l) method.For further performance improvements: overlap of computation with communication, numerical stability. Conclusions

    SNSCC'12, shengxin.zhu@maths.ox.ac.uk

    ***