Parallel Cholesky Decomposition in Julia (6.338 Project)courses.csail.mit.edu/18.337/2011/projects/Mysore_report.pdf · Parallel Cholesky Decomposition in Julia (6.338 Project) Omar

ParallelCholeskyDecompositioninJulia(6.338Project)

OmarMysore

December16,2011

ProjectIntroduction

Sparsematricesdominatenumericalsimulationandtechnicalcomputing.Their

applicationsarepracticallylimitless,rangingfromsolvingpartialdifferentialequationsto

convexoptimizationtobigdata.Becauseofthewideuseofsparsematrices,theobjective

ofthisprojectwastoimplementsparseCholeskydecompositioninparallelusingtheJulia

language.InadditiontodevelopingtheCholeskydecompositionfeatureintheJulia

languageandinvestigatingtheeffectsofparallelization,thisprojectservedthepurposeof

aidingthedevelopmentofsparsematrixsupportinJulia.

AworkingsparseparallelCholeskydecompositionsolverwasdeveloped.Although,

thecurrentimplementationcontainslimitsintermsofspeedandcapability,itishopedthat

thisimplementationcanserveasameanstofurtherdevelopment.Theremainderofthis

reportwilldiscussthebasicsofCholeskydecompositionandthealgorithmused,the

opportunitiesandmethodsofparallelization,theresults,andthepotentialforfuturework.

CholeskyDecomposition

TheCholeskydecompositionofasymmetricpositivedefinitematrixAdeterminesthe

lower‐triangularmatrixL,whereLLT=A.Althoughitislimitedtosymmetricpositive

definitematrices,thesematricesoftenappearinfieldssuchasconvexoptimization.The

followingequationsareusedfrom[2].

ThebasicdenseCholeskydecompositionalgorithmconsistsofrepeatedly

performingthefactorizationbelowonamatrix,Aofsizen,wheredisaconstantvalueand

dandvisann‐1by1columnvector.

Oncethisfactorizationiscompleted,thefirstcolumnofLisdetermined.Thenthesame

factorizationisperformedonC‐vv/d,andtheprocessisrepeateduntilallofthecolumnsof

Larefound.Similarly,thesameprocesscanbedoneforblockmatrices:

CholeskyDecompositionforSparseMatrices

Forsparsematrices,severaladditionalstepsaretakeninordertotakeadvantageofthe

substantiallyfewernonzerovalues.Firstthefill‐ins,andthestructureofLaredetermine,

withoutcalculatingthevalues.Nextthetreeofdependenciesisfound,andfinallythe

valuesofLarecalculated.Foradetailedexplanationofthemethodsummarizedinthis

report,see[2].

ThefirststepinsparseCholeskydecompositionistodeterminethestructureofL

withoutexplitlydeterminingL.AllofthefollowingimagesareusedfromJohnGilbert’s

slides[1].SupposeAhasthestructurebelow:

Then,thegoalwouldbetodeterminethestructureofL,whichisshowbelow:

Thereddotsareknownasthefill‐ins,sincethesevaluesarenonzeroinL,butzeroinA.

AlthoughthestructureofLisknown,noneofthespecificvaluesareknown.Inorderto

determinethestructureofL,thegraphsofthematricesareused.Belowisthegraphofthe

previouslyshownmatrixA:

Fromthisgraph,thegraphofLcaneasilybedeterminedbyconnectingthehigher

numberedneighborsofeachnode/columninthegraph.BelowisthegraphofLwiththe

fill‐insinred:

OncethestructureofLisdetermined,thedependencytreecanbedetermined.Inorderto

calculatethedependencytree,theparentofeachnodemustbedetermined,andtheparent

ofagivennode/columnistheminimumrowvalueofanonzerovalueinthegivencolumn

ofL,notincludingthediagonalvalues.Forexample,forthegraphofLpreviouslyshown,

thedependencytreeisshownbelow:

Afterthedependencytreeisdetermined,thevaluesofLcanbecalculated.Thebasic

equationsfordeterminingeachcolumnofLareshownbelow:

ForeachcolumnjinA

DeterminethefrontalmatrixFj,where

And,

Here,T[j]‐{j}representallofchildrennodes,whichhavetoalready

havebeendeterminedinordertocalculateLj.

ThisisthebasicformulationfordeterminingthesparseCholeskydecomposition.All

equationsandimagesusedinthisandtheprevioussectionwereobtainedfrom[1]and[2].

Formoredetailssee[2].

ParallelizationandImplementationinJulia

ForsparseCholeskydecomposition,therearethreeprimaryopportunitiesfor

parallelization.First,indeterminingthestructureofL.Second,insimultaneously

computingcolumns,whichareindependent.Third,inturningtheadditionstepforeach

columnintoaparallelreduction.

InordertodeterminethestructureofL,eachcolumnorblockofcolumnscanbe

senttodifferentprocessors,andthenecessaryfill‐inscanbedetermined.Oncethefill‐ins

aredetermined,theycanbeaddedtoL.ThefollowingJuliafunctiondemonstratesthis:

functionfillinz(L)L=tril(L)L=spones(L)

refs=[remote_call((mod(i,nprocs()))+1,pfillz,L[:,i],i)|i=1:size(L)[1]fori=1:size(L)[1]q=fetch(refs[i])forj=1:length(q)/2L[q[2*j‐1],q[2*j]]=1endendreturnLend

TheinputtothisfunctionisthematrixA,forwhichwewouldliketheCholesky

decomposition.Thevector,refs,sendseachcolumntodifferentprocessors,andthefor‐

loopobtainstheresultsandaddsthefill‐ins.

Thetreestructureofdependenciesallowsforfurtherparallelization.Aspreviously

discussed,foramatrixA,thedependencytreemightlooklikethefollowingfigure:

Inthiscase,columns1,2,4,and6ofLcanallbedeterminedwithoutanyothercolumnsofL,

andtheycanbedeterminedsimultaneously.Thefollowingfor‐loop,whichispartofthe

mainsparseCholeskyfunction,performsthisprocessofgoingthroughthelevelsofthetree

andsendingallofthecolumnsatthesameleveltodifferentprocessors:

fori=1:size(kids)[1]k=[]forj=1:size(kids)[1]iftree[j]==i‐1k=vcat(k,[j])endendrefs=[remote_call(mod(i,nprocs())+1,spcholhelp,A,L,k[i],kids)|i=1:length(k)]form=1:length(refs)lcol=fetch(refs[m])L[:,k[m]]=lcolendend

Inthisforloop,theindexiloopsthroughallofthepossiblevaluesforthenumberof

childrenacolumncanhave.ThevectorrefscontainsallofthecolumnsofLthatare

calculatedinparallelforagivenstageofthetree.

Thefinallevelofparallelizationiswithinthefunctionwhichdeterminesthevalues

ofthecolumnsL.Duringthecalculation,anumberofmatricesequaltothenumberof

childrenofthegivencolumnmustbeadded.Ratherthanaddingserially,thisisdonewith

aparallelfor‐loop.Itisshownbelow:

addz=@parallel(+)fori=1:size(kids)[1]‐L[nzs,convert(Int16,kids[i])]*L[nzs,convert(Int16,kids[i])]'end

ResultsandDiscussion

Theprimaryobjectiveoftheprojectwastowriteafunction,whichperformsCholesky

decompositiononasparsematrix.Thisfunctioniscalledspchol(),andtheinputargument

isthematrix.Thisfunctionseemstoworkforallmatricestested.Averysimpleexampleis

shown:

julia>A10x10Float64Array:121.033.00.00.033.00.055.00.044.00.033.010.00.00.012.00.017.02.012.00.00.00.01.00.00.09.00.00.00.00.00.00.00.01.08.00.00.00.00.00.033.012.00.08.083.00.023.06.012.02.00.00.09.00.00.0162.018.063.00.00.055.017.00.00.023.018.038.018.020.04.00.02.00.00.06.063.018.054.00.03.044.012.00.00.012.00.020.00.017.00.00.00.00.00.02.00.04.03.00.014.0julia>u=spchol(A)10x10Float64Array:11.00.00.00.00.00.00.00.00.00.03.01.00.00.00.00.00.00.00.00.00.00.01.00.00.00.00.00.00.00.00.00.00.01.00.00.00.00.00.00.03.03.00.08.01.00.00.00.00.00.00.00.09.00.00.09.00.00.00.00.05.02.00.00.02.02.01.00.00.00.00.02.00.00.00.07.00.01.00.00.04.00.00.00.00.00.00.00.01.00.00.00.00.00.02.00.00.03.00.01.0julia>u*u'10x10Float64Array:

121.033.00.00.033.00.055.00.044.00.033.010.00.00.012.00.017.02.012.00.00.00.01.00.00.09.00.00.00.00.00.00.00.01.08.00.00.00.00.00.033.012.00.08.083.00.023.06.012.02.00.00.09.00.00.0162.018.063.00.00.055.017.00.00.023.018.038.018.020.04.00.02.00.00.06.063.018.054.00.03.044.012.00.00.012.00.020.00.017.00.00.00.00.00.02.00.04.03.00.014.0julia>A10x10Float64Array:121.033.00.00.033.00.055.00.044.00.033.010.00.00.012.00.017.02.012.00.00.00.01.00.00.09.00.00.00.00.00.00.00.01.08.00.00.00.00.00.033.012.00.08.083.00.023.06.012.02.00.00.09.00.00.0162.018.063.00.00.055.017.00.00.023.018.038.018.020.04.00.02.00.00.06.063.018.054.00.03.044.012.00.00.012.00.020.00.017.00.00.00.00.00.02.00.04.03.00.014.0

Above,Aandu*u’areidentical.

Testswereconductedtoseetheeffectsofparallelization.Allmatricesweregeneratedin

thesamemanner,byfirstdeclaringA=tril(round(10*sprand(n,n,.3))+eye(n));andthen

A=A*A’.Theresultsareshowninthetablebelow

Size Timeon1processor Timeon4processors Paralleltoserial

ratio

10x10 0.0025s .189s 76

50x50 0.101s .52s 5

100x100 2.8s 2.8s 1

150x150 24.5s 22.1s 0.9

Forsmallermatrices,itappearsasthoughserialisbetter,becausethecommunication

betweenprocessorsdominates.Asthesizeofmatricesincreases,parallelseemsto

substantiallyimproverelativetoserial.Moretestsneedtobeconductedinorderto

understandthebehaviorofthisfunction.

LimitationsandFurtherWork

Asstatedpreviously,moretestsneedtobeconductedinordertounderstandthe

performanceandlimitationsofthespchol()function.Additionally,amajorlimitationisthe

factthatalthoughthealgorithmisdesignedforsparsematrices,itcurrentlyonlyworksfor

fullsparsematrices(andofcoursefullmatrices).Thisisalsoapossiblecauseofthemajor

timeincreasewithrespecttomatrixdimensionsshownintheresults.Initiallythe

algorithmwasdevelopedthisway,becauseoferrorsinvolvingindexingcolumnsofmatrics

thatweredeclaredassparse.Whiletheseerrorswererecentlyfixed,thealgorithm

currentlystilldoesnotworkforsparsematricesunlesstheyaredeclaredasfull.Next

stepsinvolvedebuggingthis.

ThoughtsandQuestionsAboutJulia

Asstatedpreviously,currentlythespchol()functionpresentedinthisreport,treats

allofthematricesandvectorsasfull.Forexample,thefunctiontril()iscalledbyspchol():

functiontril(B)c=zeros(size(B)[1],size(B)[1])fori=1:size(B)[1]c[i:size(B)[1],i]=B[i:size(B)[1],i]endreturncend

ClearlythisfunctiontreatsBandcasdensematrices.Ibelievethisraisesseveral

questions.Firstly,ifsparsematrixfunctionsareimplementedinJulia,shouldtheybe

designedtoworkforbothdenseandsparsematrices?Forexample,iftrilassumedBwas

sparseandaccessedB.nzval,thenitcouldneverworkfordensematrices.Another

questionthisraisesiswhatwouldcauseafunctiondesignedfordensematricesnotwork

forasparsematrix?IfIrunthespchol()functiononasparsematrixthatisdeclaredasa

sparsematrix,Igetincorrectresultsorerrors.ItonlyworksifImakethesparsematrix

full.Thisisabitproblematic,andshouldbeaddressedinthefuture.Whileattemptingto

debugthis,Ifoundaninterestingresult:

julia>A10‐by‐10sparsematrixwith21nonzeros: [1, 1] =6.0 [2, 1] =9.0 [3, 1] =11.0 [7, 1] =1.0 [9, 1] =2.0 [2, 2] =1.0 [6, 2] =15.0 [3, 3] =1.0 [6, 3] =6.0 [10, 3] =5.0 [4, 4] =1.0 [8, 4] =3.0 [5, 5] =1.0 [6, 5] =6.0

[6, 6] =9.0 [7, 7] =1.0 [8, 7] =12.0 [8, 8] =3.0 [9, 8] =1.0 [9, 9] =9.0 [10, 10] =1.0julia>full(A*A')10x10Float64Array:36.054.066.00.00.00.06.00.012.00.054.082.099.00.00.015.09.00.018.00.066.099.0122.00.00.06.00.00.00.00.00.00.00.01.00.00.00.00.00.00.00.00.00.00.01.00.00.00.00.00.00.015.00.00.00.00.00.00.00.00.06.00.00.00.00.00.00.00.00.00.00.00.00.03.00.00.00.00.00.00.012.00.00.00.00.00.00.00.00.00.00.00.05.00.00.00.00.00.00.00.0julia>full(A)*full(A')10x10Float64Array:36.054.066.00.00.00.06.00.012.00.054.082.099.00.00.015.09.00.018.00.066.099.0122.00.00.06.011.00.022.05.00.00.00.01.00.00.00.03.00.00.00.00.00.00.01.06.00.00.00.00.00.015.06.00.06.0378.00.00.00.030.06.09.011.00.00.00.02.012.02.00.00.00.00.03.00.00.012.0162.03.00.012.018.022.00.00.00.02.03.086.00.00.00.05.00.00.030.00.00.00.026.0

Uponinspection,full(A*A’)isnotequalto(full(A))*(full(A))’.Isuspectthattheseshouldbe

equal,butsomethingseemstobecausingaproblem.

Conclusion

Thefunctionspchol()ispresentedinthisreport.Asstatedintheprevioussection,some

workandtestingstillneedstobedone.Iwouldbehappytocontributeinanyway

possible.

References

[1]JohnGilbert’sslidesfromDay1ofSparseMatrixDaysatMIT.Availableat

http://www.cs.ucsb.edu/~gilbert/talks/talks.htm.

[2]Liu,JosephW.H.“TheMultifrontalMethodforSparseMatrixSoluation:Theoryand

Practice.SIAMReview.Vol.34,No.1(Mar.,1992,pp.82‐109.

Acknowledgements

IwouldliketothankAlanEdelman,JeffBezanson,andViralShah.AdditionallyIwouldlike

tothankeveryonewhohasbeendevelopingtheJulialanguage.

AppendixA:Runningthecode

Allofcodeisfoundinspcholp.jandmustberunwith@everywhereload(“spcholp.j”).To

findtheCholeskydecompositionofA,runspchol(A).

AppendixB:TheCodeforspcholp.j:

functionspones(A)A=sparse(A)A.nzval=ones(length(A.nzval),)returnfull(A)endfunctiontril(B)c=zeros(size(B)[1],size(B)[1])fori=1:size(B)[1]c[i:size(B)[1],i]=B[i:size(B)[1],i]endreturncend@everywherefunctionpfillz(b,i)r=[]forj=1:(length(b))ifb[j]==1fork=1:(length(b))if(j>i)&&(k>j)&&b[k]==1r=vcat(r,[k,j])endendendendreturnrend

functionfillinz(L)L=tril(L)L=spones(L)refs=[remote_call((mod(i,nprocs()))+1,pfillz,L[:,i],i)|i=1:size(L)[1]]fori=1:size(L)[1]q=fetch(refs[i])forj=1:length(q)/2L[q[2*j‐1],q[2*j]]=1endendreturnLendfunctionnzindex(g)b=[]fori=1:length(g)ifg[i]!=0b=vcat(b,[i])endendreturnbendfunctionpary(L)u=size(L)par=zeros(u[1]‐1,1)form=1:(u[1]‐1)dad=nzindex(L[:,m])ifsize(dad)[1]==1par[m]=length(L[:,m])elsepar[m]=dad[2]endendreturnparend

functionkiddies(par)dim=length(par)+1kids=zeros(dim,dim)fori=1:length(par)kids[i,par[i]]=iendforj=1:size(kids)[1]p=nzindex(kids[:,j])fork=1:length(p)kids[:,j]=kids[:,j]+kids[:,p[k]]endendreturnkidsend

@everywherefunctionfrontconstruct(A,L,colj,kids)col=L[:,colj]nzs=nzindex(col)Uj=zeros(length(nzs),length(nzs))ifsize(kids)[1]!=0addz=@parallel(+)fori=1:size(kids)[1]‐L[nzs,convert(Int16,kids[i])]*L[nzs,convert(Int16,kids[i])]'endelseaddz=zeros(length(nzs),length(nzs))endFj=zeros(length(nzs),length(nzs))Fj[:,1]=A[nzs,colj]Fj[1,:]=A[colj,nzs]F=Fj+Uj+addzalpha=sqrt(F[1,1]);r=F[:,1]lz=vcat(alpha,(1/alpha)*r[2:length(r)])returnlzend

@everywherefunctionspcholhelp(A,L,i,kids)kidset=nonzeros(kids[:,i])iflength(kidset)==0kidset=[]endlz=frontconstruct(A,L,i,kidset)forj=1:size(A)[1]ifL[j,i]==1L[j,i]=lz[1]lz=lz[2:length(lz)]endendreturnL[:,i]end

functionspchol(A)B=AL=fillinz(B)par=pary(L)kids=kiddies(par)tree=zeros(size(kids)[1])fori=1:size(kids)[1]tree[i]=length(nonzeros(kids[:,i]))endfori=1:size(kids)[1]k=[]forj=1:size(kids)[1]iftree[j]==i‐1k=vcat(k,[j])endendrefs=[remote_call(mod(i,nprocs())+1,spcholhelp,A,L,k[i],kids)|i=1:length(k)]form=1:length(refs)lcol=fetch(refs[m])L[:,k[m]]=lcolendendreturnLend

Documents

Parallel Cholesky Decomposition in Julia (6.338 Project)courses.csail.mit.edu/18.337/2011/projects/Mysore_report.pdf · Parallel Cholesky Decomposition in Julia (6.338 Project) Omar