Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid

  • View
    49

  • Download
    1

Embed Size (px)

DESCRIPTION

Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. Jeffrey Bolz, Ian Farmer, Eitan Grinspun, Peter Schr öder Caltech ASCI Center. Actual. Possible. Why Use the GPU?. Semiconductor trends cost wires vs. compute Stanford streaming supercomputer Parallelism - PowerPoint PPT Presentation

Text of Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid

  • Sparse Matrix Solvers on the GPU: Conjugate Gradients and MultigridJeffrey Bolz, Ian Farmer, Eitan Grinspun, Peter Schrder

    Caltech ASCI Center

  • Why Use the GPU?Semiconductor trendscostwires vs. computeStanford streaming supercomputerParallelismmany functional unitsgraphics is prime exampleHarvesting this powerwhat application suitable?what abstractions useful?Historymassively parallel SIMD machinesmedia processingChart courtesy Bill DallyImagine stream processor; Bill Dally, StanfordConnection Machine CM2; Thinking Machines

    Chart1

    21000001875000

    13794481075085.32423208

    906131.80224616431.175668907

    595219.858255411353448.592192411

    390988.020490814202659.94364616

    256832.210900006116200.9233193

    168707.94269599666627.1505721585

    110820.87339814638202.598280282

    72796.015317774121904.561471288

    47818.246541939412559.6120381446

    31410.84978846927201.4157761375

    20633.15900904964129.13942113

    13553.50948986452367.5611697947

    8903.02931370221357.5094761963

    5848.2218955847778.3672081945

    3841.5799987717446.2992866098

    2523.4570695931255.8985670664

    1657.6084798743146.7268234374

    1088.849858259984.1300557593

    784.267925396348.238393746

    606.082252746227.6588742298

    468.380364922315.8590132102

    361.96434601199.0932225915

    279.7260465985.2138614177

    228.45155817152.9895178095

    191.8993088641.7141262525

    161.19541944580.9828437216

    135.40415233450.5635417926

    113.7394879610.3231229391

    95.54116988720.1852718559

    80.25458270520.1062309617

    67.41384947240.0609105856

    56.62763355680.0349248409

    47.56721218770.0200251647

    39.95645823770.0114820057

    33.56342491970.0065835391

    28.19327693250.003774862

    23.68235262330.002164426

    19.89317620360.0012410361

    16.7102680110.0007115838

    14.03662512920.0004080071

    Perf (ps/Inst)

    Linear (ps/Inst)

    Sheet1

    1.19047619051.1904761905

    0.840.920.850.656881.4651.5223480697

    YearGate Delay (ps)Gates/ClockClock (ps)CPIPerf (ps/Inst)1/GridsGridsDelay/CPUs

    19803000702100001021000006.25E-061.60E+051.88E+06

    1981252064.41622888.513794484.27E-062.34E+051.08E+06

    19822116.859.248125416.16647.225906131.802242.91E-063.43E+056.16E+05

    19831778.11254.5081696921.613393926.14125595219.8582554111.99E-065.03E+053.53E+05

    19841493.6140850.147507274901.02283082145.2200625390988.0204908141.36E-067.37E+052.03E+05

    19851254.635827246.13570662457883.51044365874.437053125256832.2109000069.26E-071.08E+061.16E+05

    19861053.89409484842.444850094144732.37687085953.7714951562168707.9426959966.32E-071.58E+066.66E+04

    1987885.271039672339.049262086634569.18084580023.2057708828110820.8733981464.32E-072.32E+063.82E+04

    1988743.627673324735.925321119626715.06295763442.724905250472796.01531777412.95E-073.39E+062.19E+04

    1989624.647245592833.051295430120645.40065365992.316169462847818.24654193942.01E-074.97E+061.26E+04

    1990524.703686297930.407191795715954.76562514831.968744043431410.84978846921.37E-077.29E+067.20E+03

    1991440.751096490327.97461645212329.84287511461.673432436920633.15900904969.37E-081.07E+074.13E+03

    1992370.230921051825.73664713589528.50257388861.422417571413553.50948986456.39E-081.56E+072.37E+03

    1993310.993973683523.6777153657363.62678910111.20905493578903.02931370224.37E-082.29E+071.36E+03

    1994261.234937894221.78349813585690.61078261731.02769669535848.22189558472.98E-083.36E+077.78E+02

    1995219.437347831120.04081828494397.70401280670.8735421913841.57999877172.03E-084.92E+074.46E+02

    1996184.327372178118.43755282213398.5456610970.74251086242523.45706959311.39E-087.20E+072.56E+02

    1997154.834992629616.96254859642626.39608689580.6311342331657.60847987439.48E-091.06E+081.47E+02

    1998130.061393808915.60554470862029.6788959530.53646409811088.84985825996.47E-091.55E+088.41E+01

    1999109.251570799514.3571011321568.53585079250.5784.26792539634.42E-092.26E+084.82E+01

    200091.771319471513.20853304141212.16450549240.5606.08225274623.01E-093.32E+082.77E+01

    200177.087908356112.1518503981936.76072984460.5468.38036492232.06E-094.86E+081.59E+01

    200264.753843019111.1797023662723.92869202390.5361.96434601191.40E-097.12E+089.09E+00

    200354.393228136110.2853261769559.45209319610.5279.7260465989.59E-101.04E+095.21E+00

    200445.690311634310456.9031163430.5228.45155817156.54E-101.53E+092.99E+00

    200538.379861772810383.79861772810.5191.8993088644.47E-102.24E+091.71E+00

    200632.239083889210322.39083889160.5161.19541944583.05E-103.28E+099.83E-01

    200727.080830466910270.80830466890.5135.40415233452.08E-104.81E+095.64E-01

    200822.747897592210227.47897592190.5113.7394879611.42E-107.04E+093.23E-01

    200919.108233977410191.08233977440.595.54116988729.70E-111.03E+101.85E-01

    201016.05091654110160.50916541050.580.25458270526.62E-111.51E+101.06E-01

    201113.482769894510134.82769894480.567.41384947244.52E-112.21E+106.09E-02

    201211.325526711410113.25526711360.556.62763355683.08E-113.24E+103.49E-02

    20139.51344243751095.13442437550.547.56721218772.10E-114.75E+102.00E-02

    20147.99129164751079.91291647540.539.95645823771.44E-116.96E+101.15E-02

    20156.71268498391067.12684983930.533.56342491979.81E-121.02E+116.58E-03

    20165.63865538651056.3865538650.528.19327693256.69E-121.49E+113.77E-03

    20174.73647052471047.36470524660.523.68235262334.57E-122.19E+112.16E-03

    20183.97863524071039.78635240720.519.89317620363.12E-123.21E+111.24E-03

    20193.34205360221033.4205360220.516.7102680112.13E-124.70E+117.12E-04

    20202.80732502581028.07325025850.514.03662512921.45E-126.88E+114.08E-04

    Sheet1

    000000

    000000

    000000

    000000

    000000

    000000

    000000

    000000

    000000

    000000

    000000

    000000

    000000

    000000

    000000

    000000

    000000

    000000

    000000

    000000

    000000

    000000

    000000

    000000

    000000

    000000

    000000

    000000

    000000

    000000

    000000

    000000

    000000

    000000

    000000

    000000

    000000

    000000

    000000

    000000

    000000

    Gate Delay (ps)

    Gates/Clock

    Clock (ps)

    CPI

    Perf (ps/Inst)

    1/Grids

    Sheet2

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    Gate Delay (ps)

    1/Grids

    Gate Delay (ps)

    1/Grids

    Sheet3

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    Perf (ps/Inst)

  • Contributions and Related WorkContributionsnumerical algorithms on GPUunstructured grids: conjugate gradientsregular grids: multigridwhat abstractions are needed?Numerical algorithmsGoodnight et al. 2003 (MG)Hall et al. 2003 (cache)Harris et al. 2002 (FD sim.)Hillisland et al. 2003 (optimization)Krueger & Westermann 2003 (NLA)Strzodka (PDEs)

  • Streaming ModelAbstract modelPurcell, et al. 2002data structures: streamsalgorithms: kernelsConcrete modelrender a rectangledata structures: texturesalgorithms: fragment programsinputrecordstreamoutputrecordstreamglobalsglobals

  • Sparse Matrices: Geometric FlowUbiquitous in numerical computingdiscretization of PDEs: animationfinite elements, difference, volumesoptimization, editing, etc., etc.Example here:processing of surfacesCanonical non-linear problemmean curvature flowimplicit time discretizationsolve sequence of SPD systems

  • Conjugate GradientsHigh level codeinner loopmatrix-vector multiplysum-reductionscalar-vector MADInner productfragment-wise multiplyfollowed by sum-reductionodd dimensions can be handled

  • y=AxAj off-diagonal matrix elementsR pointers to segments

  • Row-Vector ProductX vector elementsR pointers to segmentsAi diagonal matrix elementsJ pointers to xjAj off-diagonal matrix elementsFragment program

  • Apply to All PixelsTwo extremesone row at a time: setup overheadall rows at once: limited by worst rowMiddle groundorganize batches of workHow to arrange batches?order rows by non-zero entriesoptimal packing NP hardWe choose fixed size rectanglesfragment pipe is quantizedsimple experiments reveal best size26 x 18 91% efficientwasted fragments on diagonal

  • Packing (Greedy)99888887715131312121110997777777765541513131212111099998888877777777776non-zero entriesper rowAll this setup doneonce only at thebeginning of time.Depends only onmesh connectivity

  • Recomputing MatrixMatrix entries depend on surfacemust render into matrixtwo additional indirection textures

    previous and next

  • Results (NV30@500MHz)37k elements matrix multiply33 instructions, 120 per secondonly 13 flopslatency limitedreduction7 inst/frag/pass, 3400 per secondCG solve: 20 per second

  • Regular GridsPoisson solver as examplemultigrid approachthis time variables on pixel gride.g.: Navier-Stokesafter discretization:solve Poisson eq.at each time step

  • Poisson EquationAppears all over the placeeasy to discretize on regular gridmatrix multiply is stencil applicationFD Laplace stencil:Use iterative matrix solverjust need application of stencileasy: just like filteringincorporate