MILAMIN: MATLAB-based finite element method solver for large

MILAMIN: MATLAB-based finite element method solverfor large problems

M. Dabrowski, M. Krotkiewski, and D. W. SchmidPhysics of Geological Processes, University of Oslo, Pb 1048 Blindern, N-0316 Oslo, Norway ([email protected])

[1] The finite element method (FEM) combined with unstructured meshes forms an elegant and versatileapproach capable of dealing with the complexities of problems in Earth science. Practical applicationsoften require high-resolution models that necessitate advanced computational strategies. We thereforedeveloped ‘‘Million a Minute’’ (MILAMIN), an efficient MATLAB implementation of FEM that iscapable of setting up, solving, and postprocessing two-dimensional problems with one million unknownsin one minute on a modern desktop computer. MILAMIN allows the user to achieve numerical resolutionsthat are necessary to resolve the heterogeneous nature of geological materials. In this paper we provide thetechnical knowledge required to develop such models without the need to buy a commercial FEM package,programming compiler-language code, or hiring a computer specialist. It has been our special aim that allthe components of MILAMIN perform efficiently individually and as a package. While some of thecomponents rely on readily available routines, we develop others from scratch and make sure that all ofthem work together efficiently. One of the main technical focuses of this paper is the optimization of theglobal matrix computations. The performance bottlenecks of the standard FEM algorithm are analyzed. Analternative approach is developed that sustains high performance for any system size. Appliedoptimizations eliminate Basic Linear Algebra Subprograms (BLAS) drawbacks when multiplying smallmatrices, reduce operation count and memory requirements when dealing with symmetric matrices, andincrease data transfer efficiency by maximizing cache reuse. Applying loop interchange allows us to useBLAS on large matrices. In order to avoid unnecessary data transfers between RAM and CPU cache weintroduce loop blocking. The optimization techniques are useful in many areas as demonstrated with ourMILAMIN applications for thermal and incompressible flow (Stokes) problems. We use these to provideperformance comparisons to other open source as well as commercial packages and find that MILAMIN isamong the best performing solutions, in terms of both speed and memory usage. The correspondingMATLAB source code for the entire MILAMIN, including input generation, FEM solver, andpostprocessing, is available from the authors (http://www.milamin.org) and can be downloaded asauxiliary material.

Components: 11,344 words, 10 figures, 3 tables, 1 animation.

Keywords: numerical models; FEM; earth science; diffusion; incompressible Stokes; MATLAB.

Index Terms: 0545 Computational Geophysics: Modeling (4255); 0560 Computational Geophysics: Numerical solutions

(4255); 0850 Education: Geoscience education research.

Received 12 June 2007; Revised 26 October 2007; Accepted 11 December 2007; Published 23 April 2008.

Dabrowski, M., M. Krotkiewski, and D. W. Schmid (2008), MILAMIN: MATLAB-based finite element method solver for

large problems, Geochem. Geophys. Geosyst., 9, Q04030, doi:10.1029/2007GC001719.

G3G3GeochemistryGeophysics

Geosystems

Published by AGU and the Geochemical Society

AN ELECTRONIC JOURNAL OF THE EARTH SCIENCES

GeochemistryGeophysics

Geosystems

Technical Brief

Volume 9, Number 4

23 April 2008

Q04030, doi:10.1029/2007GC001719

ISSN: 1525-2027

ClickHere

for

FullArticle

Copyright 2008 by the American Geophysical Union 1 of 24

http://dx.doi.org/10.1029/2007GC001719

1. Introduction

[2] Geological systems are often formed by mul-tiphysics processes interacting on many temporaland spatial scales. Moreover, they are heteroge-neous and exhibit large material property con-trasts. In order to understand and decipher thesesystems numerical models are frequently employed.Appropriate resolution of the behavior of theseheterogeneous systems, without the (over)simpli-fications of a priori applied homogenizationtechniques, requires numerical models capableof efficiently and accurately dealing with high-resolution, geometry-adapted meshes. These crite-rions are usually used to justify the need for specialpurpose software (commercial finite element method(FEM) packages) or special code development inhigh-performance compiler languages such as C orFORTRAN. General purpose packages like MAT-LAB are usually considered not efficient enough forthis task. This is reflected in the current literature.MATLAB is treated as an educational tool thatallows for fast learning when trying to masternumerical methods, e.g., the books by Kwon andBang [2000], Elman et al. [2005], and Pozrikidis[2005]. MATLAB also facilitates very short imple-mentations of numerical methods that give overviewand insight, which is impossible to obtain whendealing with closed black-box routines, e.g., finiteelements on 50 lines [Alberty et al., 1999], topologyoptimization on 99 lines [Sigmund, 2001], and meshgeneration on one page [Persson and Strang, 2004].However, while advantageous from an educationalstandpoint, these implementations are usually ratherslow and run at a speed that is a fraction of the peakperformance of modern computers. Therefore theusual approach is to use MATLAB for prototyping,development, and testing only. This is followed byan additional step where the code is manuallytranslated to a compiler language to achieve thememory and CPU efficiency required for high-resolution models.

[3] This paper presents the outcome of a projectcalled ‘‘MILAMIN - MILlion A MINute’’ aimedat developing a MATLAB-based FEM packagecapable of preprocessing, processing, and post-processing an unstructured mesh problem withone million degrees of freedom in two dimensionswithin one minute on a commodity personal com-puter. Choosing a native MATLAB implementationallows simultaneously for educational insight, easyaccess to computational libraries and visualizationtools, rapid prototyping and development, as well asactual two-dimensional production runs. Our stan-

dard implementation serves to provide educationalinsight into subjects such as implementation of thenumerical method, efficient use of the computerarchitecture and computational libraries, code struc-turing, proper data layout, and solution techniques.We also provide an optimized FEM version thatincreases the performance of production runs evenfurther, but at the cost of code clarity.

[4] The MATLAB code implementing the differentapproaches discussed here is available from theauthors (http://www.milamin.org) and can be down-loaded as auxiliary material (see Software S11).

2. Code Overview

[5] A typical finite element code consists of threebasic components: preprocessor, processor, post-processor. The main component is the processor,which is the actual numerical model that imple-ments a discretized version of the governing con-servation equations. The preprocessor provides allthe input data for the processor and in the presentcase the main work is to generate an unstructuredmesh for a given geometry. The task of the post-processor is to analyze and visualize the resultsobtained by the processor. These three componentsof MILAMIN are documented in the followingsections.

3. Preprocessor

[6] Geometrically complex problems promote theuse of interface adapted meshes, which accuratelyresolve the input geometry and are typically creat-ed by a mesh generator that automatically producesa quality mesh. The drawback of this approach isthat one cannot exploit the advantages of solutionstrategies for structured meshes, such as operatorsplitting methods (e.g., ADI [Fletcher, 1997]) orgeometric multigrid [Wesseling, 1992] for efficientcomputation.

[7] A number of mesh generators are freelyavailable. Yet, none of these are written in nativeMATLAB and fulfill the requirement of automat-ed quality mesh generation for multiple domains.DistMesh by Persson and Strang [2004] is aninteresting option as it is simple, elegant, andwritten entirely in MATLAB. However, lack ofspeed and proper multidomain support renders itunsuitable for a production code with the out-

1Auxiliary materials are available in the HTML. doi:10.1029/2007GC001719.

GeochemistryGeophysicsGeosystems G3G3

dabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719

2 of 24

lined goals. The mesh generator chosen is Tri-angle developed by J. R. Shewchuk (version 1.6,http://www.cs.cmu.edu/�quake/triangle.html-Shewchuk, 2007). Triangle is extremely versatileand stable, and consists of one single file thatcan be compiled into an executable on all plat-forms with a standard C compiler. We choose theexecutable-based file I/O approach, which has theadvantage that we can always reuse a savedmesh. The disadvantage is that the ASCII fileI/O provided by Triangle is rather slow, whichcan be overcome by adding binary file I/O asdescribed in the instructions provided in theMILAMIN code repository.

4. Processor

4.1. FEM Outline

[8] In this paper we show two different physicalapplications of MILAMIN: steady state thermalproblems and incompressible Stokes flow (referredto as mechanical problem). This section providesan outline of the governing equations and theircorresponding FEM formulation. The numericalimplementation and performance discussions fol-low in subsequent sections.

4.1.1. Thermal Problem

[9] The strong form of the steady state thermaldiffusion in the two-dimensional domain W is

@

@xk@T

@x

� �þ @

@yk@T

@y

� �¼ 0 in W ð1Þ

where T is temperature, k is the conductivity, x andy are Cartesian coordinates. The boundary G of Wis divided into two nonintersecting parts: G = GN[GD. Zero heat-flux is specified on GN (Neumannboundary condition) and temperature �T is pre-scribed on GD (Dirichlet boundary condition).

[10] The FEM is based on the weak (variational)formulation of partial differential equations, takingan integral form. For the purpose of this paper weonly introduce the basic concepts of this methodthat are important from an implementation view-point. A detailed derivation of the finite elementmethod and a description of the weak formulationof PDEs can be found in textbooks [e.g., Bathe,1996; Hughes, 2000; Zienkiewicz and Taylor,2000].

[11] In FEM, the domain W is partitioned intononoverlapping element subdomains We, i.e., W =Snele¼1

We, where nel denotes the number of elements.

The basic two-dimensional element is a triangle. Inthe thermal problem discrete temperature values aredefined for the nodal points, which can be associatedwith element vertices, located on its edges, or evenreside inside the elements. Introducing shape func-tionsNi that interpolate temperatures from the nodesTi to the domains of neighboring elements, anapproximation to the temperature field ~T in W isdefined as

~T x; yð Þ ¼Xnnodi¼1

Ni x; yð ÞTi ð2Þ

where nnod is the number of nodes in thediscretized domain.

[12] On the basis of the weak formulation thattakes the form of an integral over W, the problemcan now be stated in terms of a system of linearequations. From a computational point of view it isbeneficial to evaluate this integral as a sum ofintegrals over each element We. A single elementcontribution, the so-called ‘‘element stiffness ma-trix,’’ to the global system matrix in the Galerkinapproach for the thermal problem is given by

Keij ¼

ZWe

Zke

@Ni

@x

@Nj

@xþ @Ni

@y

@Nj

@y

� �dxdy ð3Þ

where ke is the element specific conductivity. Notethat the shape function index in equation (3)corresponds to local numbering of element nodesand must be converted to global node numbersbefore element matrix Ke is assembled into theglobal matrix K.

4.1.2. Mechanical Problem

[13] The strong form of the plane strain Stokesflow in W is

@

@xm

4

3

@ux@x

� 2

3

@uy@y

� �� þ @

@ym

@ux@y

þ @uy@x

� �� @p

@x¼ fx

@

@ym

4

3

@uy@y

� 2

3

@ux@x

� �� þ @

@xm

@ux@y

þ @uy@x

� �� @p

@y¼ fy in W

@ux@x

þ @uy@y

þ p

k¼ 0

ð4Þ

where ux and uy are components of velocity, fx andfy are components of the body force vector field, pis pressure and m denotes viscosity. In ournumerical code the incompressibility constrainedis achieved by penalizing the bulk deformationwith a large bulk modulus k. The boundary


dabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719dabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719

3 of 24

conditions are given as constrained velocity orvanishing traction components. In equation (4) weuse the divergence rather than Laplace form (in thelatter different velocity components are onlycoupled through the incompressibility constraint)as we expect to deal with strongly varying viscosity.It is also worth noting that even for homogeneousmodels the computationally advantageous Laplaceform may lead to serious defects if the boundaryterms are not treated adequately [Limache et al.,2007]. Additionally our formulation, equation (4),and its numerical implementation are also applicableto compressible and incompressible elastic problemsdue to the correspondence principle.

[14] In analogy to the thermal problem we intro-duce the discrete spaces to approximate the veloc-ity components and pressure:

~ux x; yð Þ ¼Xnnodi¼1

Ni x; yð Þuix

~uy x; yð Þ ¼Xnnodi¼1

Ni x; yð Þuiy

~p x; yð Þ ¼Xnpi¼1

Pi x; yð Þpi

ð5Þ

where np denotes the number of pressure degrees offreedom and Pi are the pressure shape functions,which may not coincide with the velocity ones. Toensure the solvability of the resulting system ofequations (inf-sup condition [see Elman et al.,2005]), special caremust be takenwhen constructingthe approximation spaces. A wrong choice of thepressure and velocity discretization results in spur-ious pressure modes that may seriously pollute thenumerical solution. Our particular element choice isthe seven-nodeCrouzeix-Raviart trianglewith quad-ratic velocity shape functions enhanced by a cubicbubble function and discontinuous linear interpola-tion for the pressure field [e.g.,Cuvelier et al., 1986].This element is stable and no additional stabilizationtechniques are required [Elman et al., 2005]. The factthat in our case the velocity and pressure approxima-tions are autonomous leads to the so-called mixedformulation of the finite element method [Brezzi andFortin, 1991].

[15] With the convention that velocity degrees offreedom are followed by pressure ones in the localelement numbering, the stiffness matrix for theStokes problem is given by [e.g., Bathe, 1996]

Ke ¼A QT

Q �k�1M

!

¼Z Z

We

meBTDB �BTvolP

T

�PBvol �k�1PPT

!dxdy

ð6Þ

where B is the so-called kinematic matrix trans-forming velocity into strain rate _e (we use here theengineering convention for the shear strain rate)

_exx x; yð Þ_eyy x; yð Þ_gxy x; yð Þ

0B@

1CA ¼ B x; yð Þue ¼

@N1

@xx; yð Þ 0 . . .

0@N1

@yx; yð Þ . . .

@N1

@yx; yð Þ @N1

@xx; yð Þ . . .

0BBBBBB@

1CCCCCCA

u1x

u1y

..

.

0BB@

1CCA ð7Þ

The matrix D extracts the deviatoric part of thestrain rate, converts from engineering conventionto standard shear strain rate, and includes aconventional factor 2. The bulk strain rate iscomputed according to the equation _evol = Bvolu

e

and pressure is the projection of this field onto thepressure approximation space

p x; yð Þ ¼ kPT x; yð ÞM�1Que ð8Þ

With the chosen approximation spaces, the linearpressure shape functions P are spanned by thecorner nodal values that are defined independentlyfor neighboring elements. Thus it is possible toinvert M on element level (the so-called staticcondensation) and consequently avoid the pres-sure unknowns in the global system. Since thepressure part of the right-hand-side vector is set tozero, this results in the following velocity Schurcomplement:

Aþ kQTM�1Q� �

ue ¼ f e ð9Þ

Once the solution to the global counterpart of (9)is obtained, the pressure can be restored afterwardaccording to (8). The resulting global system ofequations is not only symmetric, but also positive-definite as opposed to the original system (6).Unfortunately, the global matrix becomes ill-conditioned for penalty parameter values corre-sponding to a satisfactorily low level of the flowdivergence. It is possible to circumvent this by



4 of 24

introducing Powell andHestenes iterations [Cuvelieret al., 1986] and keeping the penalty parameter kmoderate compared to the viscosity m:

p0 ¼ 0

whilemax Dpi� �

> tol

ui ¼ Aþ kQTM�1Q� ��1

f �QTpi� �

Dpi ¼ kM�1Qui

piþ1 ¼ pi þDpi

increment i

end

ð10Þ

In the above iteration scheme the matrices A, Q, Mrepresent global assembled versions rather thansingle element contributions.

4.1.3. Isoparametric Elements

[16] To exploit the full flexibility of FEM, weemploy isoparametric elements. Each element inphysical space is mapped onto the reference ele-ment with fixed shape, size, and orientation. Thisgeometrical mapping between local (x, h) andglobal (x, y) coordinates of an element is realizedusing the same shape functions Ni that interpolatephysical fields:

x x; hð Þ ¼Xnnodeli¼1

Ni x; hð Þxi ð11Þ

y x; hð Þ ¼Xnnodeli¼1

Ni x; hð Þyi ð12Þ

where nnodel is the number of nodes in theelement. The local linear approximation to thismapping is given by the Jacobian matrix J:

J ¼

@x

@x@x

@h@y

@x@y

@h

2664

3775 ð13Þ

The shape function derivatives with respect toglobal coordinates (x, y) are calculated using theinverse of the Jacobian and the shape functionderivatives with respect to local coordinates (x, h):

@Ni

@x

@Ni

@y

� �¼ @Ni

@x@Ni

@h

� � @x

@x@x

@h@y

@x@y

@h

2664

3775�1

ð14Þ

Thus the element matrix from equation (3) is nowgiven by

Keij ¼

Z ZWref

ke@Ni

@x

@Nj

@xþ @Ni

@y

@Nj

@y

� �Jj jdxdh ð15Þ

where jJj is the determinant of the Jacobian, takingcare of the area change introduced by the mapping,and Wref is the domain of the reference element. Toavoid symbolic integration equation (15) can beintegrated numerically:

Keij ¼

Xnipk¼1

Wkke @Ni

@x

@Nj

@xþ @Ni

@y

@Nj

@y

� �Jj j

� ��xk ;hkð Þ

ð16Þ

Here the integral is transformed into a sum over nipintegration points located at (xk, hk), where theindividual summands are evaluated and weightedby point specific Wk. For numerical integrationrules for triangular elements, see, e.g., Dunavant[1985]. The numerical integration of the elementmatrix arising in the mechanical case is analogous.

[17] In the following we first show the straightfor-ward implementation of the global matrix compu-tation and investigate its efficiency. It proves to beunsuited for high-performance computing in theMATLAB environment. We then introduce a dif-ferent approach, which solves the identified prob-lems. Finally, we discuss how to build sparsematrix data structures, apply boundary conditions,solve the system of linear equations and performthe Powell and Hestenes iterations.

4.2. Matrix Computation:Standard Algorithm

4.2.1. Algorithm Description

[18] The algorithm outlined in Code Fragment 1(see Figure 1) represents the straightforward im-plementation of section 4.1. We tried to use intu-itive variable and index names; they are explainedin Table A1. The details of the algorithm aredescribed in the following (Roman numbers corre-spond to the comments in Code Fragment 1).

[19] i.) The outermost loop of the standard algo-rithm is the element loop. Before the actual matrixcomputation, general element-type specific datasuch as integration points IP_X and weights IP_ware assigned. The derivatives of the shape functionsdNdu with respect to the local (x, h) coordinatesare evaluated in the integration points IP_X. All



5 of 24

arrays used during the matrix computation proce-dure are allocated in advance, e.g., K_all.

[20] ii.) Inside the loop over all elements the codebegins with reading element-specific information,such as indices of the nodes belonging to the

current element, coordinates of the nodes, andelement conductivity, viscosity and density.

[21] iii.) For each element the following loop overintegration points performs numerical integrationof the underlying equations, which results in theelement stiffness matrix K_elem[nnodel,nnodel].

Figure 1. Code Fragment 1 shows the standard matrix computation.



6 of 24

In the case of mechanical code additional matricesA_elem[nedof, nedof], Q_elem[nedof,np] andM_elem[np,np] are required. All of the abovearrays must be cleared before the integration pointloop together with the right-hand-side vectorRhs_elem.

[22] iv.) Inside the integration point loop the pre-computed shape function derivatives dNdui areextracted for the current integration point. b) Inthe chosen element type the pressure is interpolatedlinearly in the global coordinates. Pressure shapefunctions Pi at an integration point are obtained asa solution of the system P*Pi = Pb, where the firstequation enforces that the shape functions Pi sumto unity.

[23] v.) The Jacobian J[ndim,ndim] is calculatedfor each integration point by multiplying the ele-ment’s nodal coordinates matrix ECOORD_X[n-dim,nnodel] by dNdui[nnodel,ndim]. Furthermoreits determinant, detJ, and inverse, invJ[ndim,n-dim], are obtained with the corresponding MAT-LAB functions.

[24] vi.) The derivatives versus global coordinates,dNdx[nnodel, ndim], are obtained by dNdx =dNdui*invJ according to equation (14).

[25] vii.) a) The element thermal stiffness matrixcontribution is obtained according to equation (16)and implemented as K_elem = K_elem + weight*-ED*(dNdX*dNdX’). b) The kinematic matrix Bneeds to be formed, equation (7), and A_elem,Q_elem and M_elem are computed according toequation (6).

[26] viii.) The pressure degrees of freedom areeliminated at this stage. It is possible to invertM_elem locally because the pressure degrees offreedom are not coupled across elements, thusthere is no need to assemble them into the globalsystem of equations. For large viscosity variationsit is beneficial to relate the penalty factor PF to theelement’s viscosity to improve the condition num-ber of the global matrix.

[27] ix.) The lower (incl. diagonal) part of theelement stiffness matrix is written into the globalstorage relying on the symmetry of the system. b)Q_elem and invM_elem matrices are stored foreach element in order to avoid recomputing themduring Powell and Hestenes iterations.

[28] MATLAB provides a framework for scientificcomputing that is freed from the burden of con-ventional high-level programming languages,which require detailed variable declarations and

do not provide native access to solvers, visualiza-tion, file I/O etc. However, the ease of codedevelopment in MATLAB comes with a loss ofsome performance, especially when certain recom-mended strategies are not followed: http://math-works.com/support/solutions/data/1-15NM7.html.The more obvious performance considerationshave already gone into the above standard imple-mentation and we would like to point these out:

[29] 1. Memory allocation and explicit variabledeclaration have been performed. Although notformally required, it is advisable to explicitlydeclare variables, including their size and type. Ifvariables are not declared with their final size, butare instead successively extended (filled in) duringloop evaluation, a large penalty has to be paid forthe continuous and unnecessary memory manage-ment. Hence, all variables that could potentiallygrow in size during loop execution are preallo-cated, e.g., K_all. Variables such as ELEM2NODEthat only have to store integer numbers should bedeclared accordingly, int32 in the case of ELE-M2NODE instead of MATLABs default variabletype double. This reduces both the amount ofmemory required to store this large array and thetime required to access it since less data must betransferred.

[30] 2. Data layout has been optimized to facilitatememory access by the CPU. For example, theindices of the nodes of each element must bestored in neighboring memory locations, and sim-ilarly the x-y-z coordinates of every node. Theactual numbering of nodes and elements also has avisible effect on cache reuse inside the elementloop, similarly to sparse matrix-vector multiplica-tion problem [Toledo, 1997].

[31] 3. Multiple data transfers and computationshave been avoided. Generally, statements shouldappear in the outermost possible loop to avoidmultiple transfer and computation of identical data.This is why the integration point evaluated shapefunction derivatives with respect to local coordi-nates are precomputed outside the element loop (asopposed to inside the integration loop) and thenodal coordinates are extracted before the integra-tion loop.

4.2.2. Performance Analysis

[32] In order to analyze the performance of thestandard matrix computation algorithm we runcorresponding tests on an AMD Opteron systemwith 64 bit Red Hat Enterprise Linux 4 and



7 of 24

MATLAB 2007a using GoTo BLAS (http://www.tacc.utexas.edu/resources/software). Thissystem has a peak performance of 4.4 gigaflopsper core, i.e., it is theoretically capable ofperforming 4.4 billion double precision floatingpoint operations per second (flops). The specificelement types used are 6-node triangles (quadraticshape functions) with 6 integration points for thethermal problem and 7-node triangles with 12integration points for the mechanical problem.

[33] In the thermal problem, results are obtainedfor an unstructured mesh consisting of approxi-mately 1 million nodes and 0.5 million elements.For this model the previously described matrixcomputation took 65 s, during which 324 floatingpoint operations per integration point per elementwere calculated. This corresponds to 15 Megaflops(Mflops) or approximately 0.4% of the peak per-formance. Analysis of the code with MATLAB’sbuilt-in profiler revealed that a significant amountof time was spent on the calculation of the deter-minant and inverse of the Jacobian. Therefore, infurther tests these calls were replaced by explicitcalculations of detJ and invJ. The final perfor-mance achieved by this algorithm was 30 Mflops,which is still less than one percent of the peakperformance and equivalent to a peak CPU perfor-mance that was reached by commodity computersmore than a decade ago.

[34] Profiling the improved standard algorithmrevealed that most of the computational time wasspent on matrix multiplications. This means thatthe efficiency of the analyzed implementationdepends mainly on the efficiency of dense matrixby matrix multiplications inside the integrationpoint loop. In order to perform these calculationsMATLAB uses hardware-tuned, high-performanceBLAS libraries (Basic Linear Algebra Subpro-grams; see http://www.netlib.org/blas/faq.html andDongarra et al. [1990]), which reach up to 90% ofthe CPU peak performance; a value from which theanalyzed code is far away.

[35] The cause for this bad performance is that thematrix by matrix multiplications inside the integra-tion point loop operate on very small matrices, forwhich BLAS libraries are known not to work welldue to the introduced overhead (e.g., http://math-atlas.sourceforge.net/timing/36v34/OptPerf.html).Therefore, the same observation can be made whenwriting the standard algorithm in a compiler lan-guage such as C and relying on BLAS for thematrix multiplications, although the actual perfor-mance in this case is higher than in MATLAB. In C

a possible solution is to explicitly write out thesmall matrix by matrix multiplications, whichresults in a more efficient code. In MATLAB,however, this is not a practical alternative asexplicitly writing out matrix multiplications leadsto unreadable code without substantial perfor-mance gains. The above performance considera-tions apply equally to the mechanical code.

[36] In conclusion, the standard algorithm is aviable option when writing compiler code. How-ever, the achievable performance in MATLAB isunsatisfactory so we developed a more efficientapproach, which is presented in the followingsection.

[37] Remark 1: Measuring code performance

[38] Since no flops measure exists in MATLAB,the number of operations must be manually calcu-lated on the basis of code inspection and dividedby the computational time. To provide more mean-ingful performance measures only the number ofnecessary floating point operations may be consid-ered, e.g., the redundant computations of the uppertriangular entries in the standard matrix contributeto the flop count, which artificially increases themeasured performance. However, it is not neces-sarily the case that the algorithm with the lowestoperation count is the fastest in terms of executiontime. We restrain from adjusting the actual flopcounts in this paper.

4.3. Matrix Computation: OptimizedAlgorithm

[39] In this section we explain how to efficientlycompute the local stiffness matrices. This optimi-zation strategy is common to both (thermal andmechanical) problems considered. For simplicity,we present it on the example of the thermalproblem. Overall performance benchmarks andapplication examples are provided for both typesof problems in subsequent sections.

[40] The small matrix by small matrix multiplica-tions in the integration loop nested inside the loopover elements are the bottleneck of the standardalgorithm. Written out in terms of loops, thesematrix multiplications represent another three loops,totaling to five. Since the element loop exhibits nodata dependency, it can be moved into the innermostthree (out of five), effectively becoming part ofsmall matrix by large matrix multiplication.

[41] This loop reordering does not change the totalamount of operations. However, the number of



8 of 24

increase of the block size leads to a performancedecrease toward a stable level of �120 Mflops dueto lack of cache reuse in the integration point loop.Compared to the standard version, the optimizedmatrix computation achieves a 20-fold speedup interms of flops performance. Since the optimizedalgorithm performs fewer operations (computationof only lower triangular part of symmetric elementmatrix), its execution time is actually more than 30times faster.

[53] The achieved 350 Mflops efficiency corre-sponds to only �8% of the peak CPU performance.

Profiling the code revealed that for the test problemapproximately half of the time was spent onreading and writing variables from and to RAM(e.g., nodal coordinates and element matrices).This value is constrained by the memory band-width of the hardware, which on current computerarchitectures is often a bigger bottleneck than theCPU performance. Compared to C implementa-tions, the optimized matrix computation perfor-mance is better than the straightforward standardalgorithm using BLAS, but more than a factor 3

Figure 2. Code Fragment 2 shows the optimized finite element global matrix computation.



10 of 24

slower than what can be achieved by explicitlywriting out the matrix multiplications.

[54] In the mechanical code, the peak flops perfor-mance is similar. Note that in this case the optimalblocksize is smaller due to the larger workspace ofthe method; see Figure 3.

4.4. Matrix Assembly: Triplet to SparseFormat Conversion

[55] The element stiffness matrices stored in K_allmust be assembled into the global stiffness matrixK. The row and column indices (K_i and K_j) thatspecify where the individual entries of K_all haveto be stored in the global system are commonlyknown as the triplet sparse matrix format [e.g.,Davis, 2006]. Since we only use lower triangularentries, special care must be taken so that theindices referring to the upper triangle are notcreated; see Code Fragment 3. Note that K_i andK_j hold duplicate entries, and the purpose of theMATLAB sparse function is to sum and eliminatethem.

[56] While creation of the triplet format is fast, thecall to sparse gives some concerns. MATLABssparse implementation requires that K_i and K_jare of type double, which is memory- andperformance-wise inefficient. In addition, sparseitself is rather slow, especially if compared to thetime spent on the entire matrix computation. The

equivalent function sparse2, provided by T. A.Davis within the CHOLMOD package (http://www.cise.ufl.edu/research/sparse/SuiteSparse), issubstantially faster and does not require a conver-sion of the coefficients to double precision. CodeFragment 3 presents in detail how to create a globalsystem matrix.

[57] Code Fragment 3 shows the global sparsematrix assembly.

% CREATE TRIPLET FORMAT INDICESindx_j = repmat(1:nnodel,nnodel,1);indx_i = indx_j0;indx_i = tril(indx_i);indx_i = indx_i(:);indx_i = indx_i(indx_i>0);indx_j = tril(indx_j);indx_j = indx_j(:);indx_j = indx_j(indx_j>0);

K_i = ELEM2NODE(indx_i,:);K_j = ELEM2NODE(indx_j,:);

K_i = K_i(:);K_j = K_j(:);

% SWAP INDICES REFERRING TO UPPERTRIANGLEindx = K_i < K_j;tmp = K_j(indx);

Figure 3. Performance of optimized matrix computation versus block size.



11 of 24

K_j(indx) = K_i(indx);K_i(indx) = tmp;

K_all = K_all(:);

% CONVERT TRIPLET DATATO SPARSEMATRIXK = sparse2(K_i, K_j, K_all);clear K_i K_j K_all;

[58] The triplet format is converted into the sparsematrix K with one single call to sparse2. Assem-bling smaller sparse matrices for blocks of elementsand calling sparse consecutively would reduce theworkspace for the auxiliary arrays; however, itwould also slow down the code. Therefore, as longas the K_i, K_j and K_all arrays are not the memorybottleneck, it is beneficial to perform the globalconversion. Once K is created, the triplet data iscleared in order to free as much memory as possiblefor the solution stage. In the mechanical codeQ andM�1 matrices are stored in sparse format for laterreuse in the Powell and Hestenes iterations.

[59] Remark 2: Symbolic approach to sparse ma-trix assembly.

[60] In general the auxiliary arrays can be altogetheravoided with a symbolic approach to sparse matri-ces. While the idea of sparse storage is the elimi-nation of zero entries, in a symbolic approach allpossible nonzero entries are stored and initializedto zero. During the computation of element stiff-ness matrices, global locations of their entries canbe found at a small computational cost, andcorresponding values are incrementally updated.Also, this symbolic storage pattern can be reusedbetween subsequent time steps, as long as themesh topology is not changed. Unfortunately, thisimprovement cannot be implemented in MATLABas zero entries are automatically deleted.

4.5. Boundary Conditions

[61] The implemented models have two types ofboundary conditions: vanishing fluxes and Dirich-let. While the former automatically results from theFEM discretization, the latter must be specifiedseparately, which usually leads to a modification ofthe global stiffness matrix. These modificationsmay, depending on the implementation, cause lossof symmetry, changes in the sparsity pattern androw addressing of K, all of which can lead to abadly performing code.

[62] An elegant and sufficiently fast approach is toseparate the degrees of freedom of the model into

Free (indices of unconstraint degrees of freedom)and Bc_ind, where Dirichlet boundary conditionswith corresponding values Bc_val are applied.Since the solution values in the Bc_ind are known,the corresponding equations can be eliminated fromthe system of equations by modifying the right-hand side of the remaining degrees of freedomaccordingly. This is implemented as shown in CodeFragment 4.

[63] Code Fragment 4 shows the boundary condi-tion implementation for the thermal problem.

Free = 1:nnod;Free(Bc_ind) = [];TMP = K(:,Bc_ind) + cs_transpose(K(Bc_ind,:));Rhs = Rhs - TMP*Bc_val0;K = K(Free,Free);T = zeros(nnod,1);T(Bc_ind) = Bc_val;

[64] Since only the lower part of the global matrixis stored, we need to restore the remaining parts ofthe columns by transposing the adequate rows.

4.6. System Solution

[65] We have ensured that the global system oflinear equations under consideration is symmetric,positive-definite, and sparse. It has the form

KT ¼ Rhs ð17Þ

where K is the stiffness matrix, T the unknowntemperature vector, and Rhs is the right-hand side.One of the fastest and memory efficient directsolvers for this type of systems is CHOLMOD, asparse supernodal Cholesky factorization packagedeveloped by T. Davis [Davis and Hager, 2005; Y.Chen et al., Algorithm 8xx: CHOLMOD, super-nodal sparse Cholesky factorization and update/downdate, submitted to ACM Transactions onMathematical Software, 2007; T. A. Davis andW. W. Hager, Dynamic supernodes in sparseCholesky update/downdate and triangular solves,submitted to ACM Transactions on MathematicalSoftware, 2007]; see the report by Gould et al.[2007]. Newer versions of MATLAB (2006a andlater) use this solver, which is substantially fasterthan the previous implementation. When sym-metric storage is not exploited, CHOLMOD can beinvoked through the backslash operator: T = K\Rhs(make sure that the matrix K is numericallysymmetric, otherwise MATLABs will invoke adifferent, slower solver).



12 of 24

[66] However, it is best to use CHOLMOD and therelated parts by installing the entire package fromthe developers SuiteSparse Web site (http://www.cise.ufl.edu/research/sparse/SuiteSparse). This pro-vides access to cholmod2, which is capable ofdealing with only upper triangular input data andprecomputed permutation (reordering) vectors. Sui-teSparse also contains lchol, a Cholesky factoriza-tion operating only on lower triangular matrices,which is faster and more memory efficient thanMATLABs chol equivalent. Reusing the Choleskyfactor L during the Powell and Hestenes iterations inthe mechanical problem greatly reduces the compu-tational cost of achieving a divergence free flowsolution.

[67] The mentioned reuse of reordering data ispossible as long as the mesh topology remainsidentical, which even in our large strain flowcalculations is the case for many time steps. Thereordering step decreases factorization fill-in andconsequently improvesmemory and CPU efficiency[Davis, 2006], but is a rather costly operationcompared to the rest of the Cholesky algorithm.Different reordering schemes can be used, and wecompare two of them in Figure 4: AMD (Approxi-mate Minimum Degree) and METIS (http://glaros.dtc.umn.edu/gkhome/views/metis). While AMD is

faster during the reordering steps, it results in slowerCholesky factorization and forward and back substi-tution. If the reordering can be reused for a largenumber of steps, it is recommended to rely onMETIS, which is accessible in MATLAB throughthe SuiteSparse package.

4.7. Powell and Hestenes Iterations

[68] In the thermal code, the solution vector isobtained by calling forward and back substitutionroutines with the Cholesky factor and the adequatelypermuted right-hand-side vector. During the secondsubstitution phase the upper Cholesky factor isrequired. However, instead of explicitly forming itthrough the transposition of the stored lower factor,it is advantageous to call the cs_ltsolve that canoperate on the lower factor and performs the neededtask of the back substitution.

[69] In the MILAMIN flow solver the incompres-sibility constraint is achieved through an iterativepenalty method, i.e., the bulk part of the deforma-tion is suppressed with a large bulk modulus(penalty parameter) k. In a single step penaltymethod there is a trade off between the incompres-sibility of the flow solution and the conditionnumber of the global equation system. This can

Figure 4. Performance analysis of the different steps of the Cholesky algorithm with different reorderings for ourone million degrees of freedom thermal test problem.



13 of 24

be avoided by using a relatively small k, whichensures a good condition number and then itera-tively improving incompressibility of the flow.Note that for the chosen Crouzeix-Raviart element,pressure is discontinuous between elements and thecorresponding degrees of freedom can be eliminat-ed element-wise (no global system solution re-quired). Pressure increments can be computedwith the velocity solution vector and stored Qand M�1 matrices. These pressure increments aresent to the right-hand side of the system andaccumulated in the total pressure. The codefragment for these so-called Powell and Hestenesiterations is given in Code Fragment 5.

[70] Code Fragment 5 shows the Powell and Hes-tenes iterations.

while (div_max>div_max_uz && uz_iter<uz_iter_max)uz_iter = uz_iter + 1;%FORWARD AND BACK SUBSTITUTIONVel(Free(perm)) =cs_ltsolve(L,cs_lsolve(L,Rhs(Free(perm))));

%COMPUTE QUASI-DIVERGENCEDiv = invM*(Q*Vel);

%UPDATE RHSRhs = Rhs – PF*(Q’*Div);

%UPDATE TOTAL PRESSUREPressure = Pressure + PF*Div;

%CHECK INCOMPRESSIBILITYdiv_max = max(abs(Div(:)));

end

5. Postprocessor

[71] The results of a numerical model are onlyuseful if fast and precise analysis and visualizationis possible. One of the main aspects to achieve thisis to avoid loops. For triangular meshes trisurf isthe natural choice for two and three dimensionaldata visualization as it employs the usual FEMstructures: connectivity (ELEM2NODE), coordi-nates (GCOORD), and data (T). This allows forvisualization of FEM models with more than onemillion elements in less than one second.

[72] A problem that often arises is the visualizationof discontinuous data, such as pressure in mixedformulations of deformation problems. The remedyis to abandon the nodal connectivity and to create a

new arrangement, where physical nodes are listedseparately for every element that accesses them.The same can also be done for other meshes thantriangular ones by creating the corresponding con-nectivity (ELEM2NODE) and calling:

[73] Code Fragment 6 shows the postprocessor.

patch(0faces0, ELEM2NODE,0vertices0,GCOORD0,0facevertexcdata0,T);shading interp;

6. MILAMIN Performance Analysis

6.1. Overall Performance

[74] The overall performance of MILAMIN versusthe number of nodes is analyzed in Figure 5. Thegoal of MILAMIN to perform a complete FEManalysis for one million unknowns in one minute isreached for the thermal as well as the mechanicalproblem. All components of MILAMIN scale lin-early with the number of nodes; the only exceptionis the direct solver, which shows super-linearscaling. The performance details are discussed inthe following sections.

6.2. Component Performance

[75] Figure 6 shows the total amount of time forthe one million degrees of freedom (DOFs) testproblems split into the individual components ofMILAMIN. The contributions of the boundaryconditions and postprocessor are minor. The timetaken by the preprocessor is also not relevant,especially if the same (Lagrangian) mesh is usedfor many time steps. A major achievement ofMILAMIN is the performance of the optimizedmatrix computation that is more than 15–30foldbetter than the standard algorithm. The matrixassembly done by sparse2 is one of the majorcontributors to the total time, but cannot be opti-mized without a major change in the way MAT-LAB operates on sparse matrices; see Remark 2.Finally the three components of the Choleskysolver take substantial time.

[76] The time taken by the first part of the Cho-lesky solver, the reordering, can often be neglectedfor practical applications. During nonlinear mate-rial and time step iterations the mesh topologyremains the same as long as no remeshing isperformed, and the permutation vector can bereused if the SuiteSparse package is employed.



14 of 24

[77] The second component of the Cholesky solveris the factorization. This step takes most of the totalMILAMIN execution time. However, the efficien-cy achieved by CHOLMOD is close to the optimalCPU performance. For further optimization onecould consider other types of solvers such as itera-tive ones. Yet, preconditioned iterative methods oralgebraic multigrid are less robust (especially forlarge material contrasts as targeted here) and per-form better only for large systems; see section 6.4.These methods are the only option in the case ofmost three-dimensional problems, because the scal-ing of factorization time and memory requirementsfor direct solvers is much worse than in two dimen-sions. However, for two dimensional problemsdirect solvers are the best choice for resolutionson the order of one million degrees of freedom,

especially for positive definite systems that can besolved with Cholesky factorizations. Moreover, itis in problems of this size where our optimizationsgreatly reduce the total solution time. Such numer-ical resolutions are often sufficient in two dimen-sions to solve challenging problems and theachieved performance allows for studies with largenumber of time steps.

[78] The third part of the Cholesky solver is theforward and backward substitution and does notcontribute substantially in the case of thermalproblems. For mechanical problems several Powelland Hestenes iterations are required to enforceincompressibility, each issuing a forward and backsubstitution call plus other computations. The timespent on the Powell and Hestenes iteration is not

Figure 5. Overall performance results for MILAMIN given for total time spent on problem, and the direct solvercontribution.



15 of 24

negligible, but the strategy chosen to deal withincompressibility is clearly advantageous to otherstrategies that would not allow the use of Choleskysolvers; see, for example, the results for FEMLABusing UMFPACK in section 6.4.

[79] A final analysis of the overall speedup achievedbyMILAMIN is shown in Figure 7, where we depictthe ratio of the total time tstandard/toptimized for thethermal and mechanical code. In this speedup anal-ysis we define the total time as the sum of the timeneeded to compute and assemble the global matrix,apply boundary conditions, factorize and solve thesystem of equations, and perform the Powell andHestenes iterations (incompressible Stokes flow).Thus mesh generation, postprocessing, and reor-dering, which do not need to be performed forevery time step, do not enter this analysis. For ourtarget system sizes the achieved speedups reachapproximately 3 and 4 for the mechanical andthermal codes, respectively. Hence the performancegains due to the developed MILAMIN package aresubstantial. The scaling with respect to system sizeshows that the speedup decreases with increasingnumber of nodes. This is due to the super-linerscaling of the direct solver, which starts to domi-nate the total execution time for very large systems.

6.3. Memory Requirements

[80] Besides CPU performance the available mem-ory (RAM) is the other parameter that determinesthe problem size that can be solved on a specificmachine. The memory requirements of MILAMINare presented in Figure 8. Within the studied rangeof systems sizes, all data allocated during thematrix computation and assembly requires substan-tially less memory than the solution stage. Thus theauxiliary arrays such as K_i, K_j and K_val are nota memory bottleneck and it is indeed beneficial toperform conversion to sparse format globally. Notethat the amount of memory required during thefactorization stage depends strongly on reorderingused. This analysis is only approximate as theworkspace of the external routines (lchol, sparse2,etc.) is not taken into account. On 2 Gb RAMcomputers we are able to solve systems consistingof 1.65 and 0.65 million nodes for the thermal andmechanical problems, respectively.

6.4. Comparison to Other Software

[81] In this section we compare MILAMIN todifferent available commercial and free softwaresolving similar test problems. Table 1 presents runtimes for a thermal problem with �1 milliondegrees of freedom. The model setup consists ofa box with a circular hole (zero flux) and a circularinclusion of ten times higher conductivity than thematrix. The outer boundaries are set to Dirichlet

Figure 6. Overall performance of MILAMIN split upinto the individual components for thermal andmechanical test problems with one million degrees offreedom. The timing for the matrix computation is givenfor the standard (S) and the optimized (O) algorithm.Note that the forward and backward (F&B) substitutiontiming also contains three Powell and Hestenes itera-tions in the case of the mechanical problem.

Figure 7. Achieved MILAMIN speedup for alloperations that need to be performed for every timestep; see text for details.



16 of 24

conditions representing a linearly varying temper-ature field.

[82] The software that entered the test are commer-cial finite element packages, ABAQUS (SIMULIA,6.6-1, http://www.simulia.com/products/abaqus_fea.html) and FEMLAB (COMSOL 3.3, http://www.femlab.com), and open source packagesFEAPpv (O. C. Zienkiewicz and R. L. Taylor,2.0, http://www.ce.berkeley.edu/�rlt/feappv),OOFEM (B. Patzak, OOFEM 1.7, http://www.oofem.org), and TOCHNOG (D. Roddeman,11 February 2001, http://sourceforge.net/projects/tochnog) for compiler languages, and AFEM@matlab (L. Chen and C. Zhang, http://www.mathworks.com/matlabcentral/fileexchange), andIFISS (D. J. Silvester et al., 2.2, http://www.maths.manchester.ac.uk/�djs/ifiss) for MATLAB. Forthe solution stage we used a wide range of directsolvers, including UMFPACK (T. A. Davis, http://www.cise.ufl.edu/research/sparse/umfpack),

TAUCS (S. Toledo et al., http://www.tau.ac.il/�stoledo/taucs), PARDISO (O. Schenk andK. Gartner, http://www.pardiso-project.org),SPOOLES (C. Ashcraft et al., http://www.netlib.org/linalg/spooles/spooles.2.2.html), CHOLMOD(T. A. Davis, http://www.cise.ufl.edu/research/sparse/cholmod), and the MATLAB backslashoperator (\). We also compared different imple-mentations of iterative solvers such as ConjugateGradients preconditioned with Jacobi (PCG),Symmetric Successive Over-Relaxation (SSOR-CG), Incomplete Cholesky (ICCG), and AlgebraicMultigrid (AMG-CG), and a Biconjugate Gradientssolver preconditioned with Jacobi (BiCG).

[83] A number of other MATLAB-based packagesare available, which, however, could not enter ourtable because they are simply incapable of solvingthe test problem in a reasonable amount of timeand the amount of RAM available. From theMATLAB packages that entered the performance

Table 1. Performance Results for Different Software Packages for the Thermal Problema

SoftwareMatrix Computation

and Assembly Solve Solver Type

ABAQUS, T2 80 260 proprietaryFEMLAB, T2 18 40 UMFPACK

45 TAUCS52 PARDISO58 SPOOLES

240 ICCG500 AMG-CG1000 SSOR-CG2500 PCG

FEAPpv, Fortran, T2 7 712 PCGOOFEM, C++, T1 36 400 ICCGTOCHNOG, C\C++, T2 15 1711 BiCGAFEM@matlab, T1 25 19 MATLAB \IFISS, Q2 999 57 MATLAB \IFISS, Q1 464 30 MATLAB \MILAMIN std, T2 65 24 CHOLMOD2 (AMD)MILAMIN opt, T2 5 24 CHOLMOD2 (AMD)

aT1 and T2 stand for linear and quadratic triangles, and Q1 and Q2 stand for linear and quadratic quadrilateral

elements, respectively.

Table 2. Performance Results for Different Software Packages for the Mechanical Problema

SoftwareMatrix Computation

and Assembly Solve Solver Type

IFISS Q2-P1 (5e5 DOFs) 340 298 MATLAB \FEMLAB 3.3 T2+P-1 (2e5 DOFs) 7 66 UMFPACK

186 ILU-GMRESMILAMIN (opt) T2 + P-1 (1e6 DOFs) 15 34 CHOLMOD (AMD)

aNote the different system sizes for this test.



17 of 24

comparison AFEM excels with high performance.However, AFEM is specifically developed to op-erate with linear triangles solving the Poissonproblem. This allows AFEM to employ only oneintegration point and the amount of work per-formed is substantially less than for isoparamtericquadratic elements, although the actual number ofelements is higher for the test problem with a fixednumber of nodes. IFISS is another MATLAB-based package capable of solving Poisson andincompressible Navier-Stokes problems on thebasis of linear and quadratic quadrilateral meshes.Despite its aim of being a vectorized code, theperformance of IFISS is not optimal. This is partlydue to a badly performing boundary conditionimplementation. The matrix computation and as-sembly performance of the compile language andcommercial codes is quite reasonable, with FEAPbeing the clear leader. However, none of the testedpackages is as fast for the matrix computation andassembly as the optimized version of MILAMINand even the standard version of MILAMIN isperforming quite reasonably in comparison.

[84] The analysis of the solver times confirms ourprevious statement that for the studied 2-D prob-lems direct solvers (CHOLMOD, UMFPACK,TAUCS, PARDISO, SPOOLES) are the bestchoice with CHOLMOD being the best in thegroup. Iterative solvers, even if equipped withgood preconditioners, like incomplete Cholesky

or AMG, are not competitive with respect to thedirect solvers for the targeted problem size.

[85] A performance comparison of MILAMIN fora mechanical test problem is given in Table 2. Thedomain is again a box containing a circular hole(free surface) and a circular inclusion with a tentimes higher viscosity than the matrix. The outerboundaries are set to Dirichlet conditions repre-senting pure shear deformation. The number ofavailable packages to solve incompressible Stokesproblems with heterogeneous material is greatlyreduced compared to the thermal problem. In factthe IFISS package is not capable of dealing withheterogeneous materials and we used here an iso-viscous model. In the case of FEMLAB we had toemploy the special MEMS module, which providesan incompressible Stokes application mode. How-ever, even with this specialized module we wereunable to fit the test problem into the 2 Gb RAMand therefore the results are provided for a fivetimes smaller problem size. MILAMIN outper-forms IFISS as well as FEMLAB both in termsof matrix computation and assembly, and thesolution time. The latter demonstrates that iterativepenalty approach chosen in MILAMIN and theresulting possibility to use a Cholesky solver(symmetric and positive definite system) is superi-or to other approaches.

Figure 8. Memory requirements of the thermal and mechanical versions of MILAMIN.



18 of 24

6.5. Applications

[86] The power of MILAMIN to perform high-resolution calculation for heterogeneous problemsis illustrated with a thermal and a mechanicalapplication example. Figure 9 shows the heat fluxthrough a heterogeneous rock requiring approxi-mately one million nodes to resolve it. Figure 10shows a mechanical application of MILAMIN.Gravity-driven incompressible Stokes flow is usedto study the interaction of circular inclusions withdifferent densities leading to a stratification of thematerial; see Animation S1.2

[87] MILAMIN not only allows us to study theoverall response of the system, but also resolvesthe details of the flow pattern around the hetero-geneities. Note that we see none of the pressureoscillation problems that are often caused by theincompressibility constraint [e.g., Pelletier et al.,1989].

[88] The MILAMIN strategies and package areapplicable to a much broader class of problemsthan illustrated here. For example, transient thermalproblems require only minor modifications to the

thermal solver. As already mentioned the mechan-ical solver is devised in a way that compressibleand incompressible elastic problems can be easilytreated, simply by variable substitution. Coupledthermomechanical problems, arising for example inmantle convection, only require that the developedthermal and mechanical models are combined inthe same time loop. This results in an unstructured,Lagrangian mantle convection solver capable ofefficiently dealing with hundreds of thousands ofnodes [cf. Davies et al., 2007].

7. Conclusions

[89] We have demonstrated that it is possible towrite an efficient native MATLAB implementationof the finite element method and achieved the goalto set up, process, and postprocess thermal andmechanical problems with one million degrees offreedom in one minute on a desktop computer.

[90] In our standard implementation we have com-bined all the state of the art components required ina finite element implementation. These includeefficient preprocessing, fast matrix assembly,exploiting matrix symmetry for storage, and

Figure 9. Illustration of a one million node application problem modeled with MILAMIN. Steady state diffusion issolved in a heterogeneous rock with channels of high conductivity. Heat flow is imposed by a horizontal thermalgradient; i.e., T(left boundary) = 0, T(right boundary) = 1. Top and bottom boundary conditions are zero flux.(a) Conductivity distribution. (b) Flux visualized by cones and colored by magnitude. Normalization versus flux inhomogeneous medium with conductivity of the channels. Background color represents the conductivity. Triangulargrid is the finite element mesh used for computation. Note that this picture only corresponds to a small subdomain ofFigure 9a (see square outline).

2Animations are available in HTML.



19 of 24

employing the best available direct solver andreordering packages. MATLAB-specific optimiza-tions include proper memory management (preal-location of arrays) and data structures, explicit typedeclaration for integer arrays, and efficient imple-

mentation of boundary conditions. In the case ofthe mechanical application the chosen penaltymethod together with the particular element typeallows us to use the efficient Cholesky factoriza-tion to solve the incompressible flow problem. The

Figure 10. Mechanical application example. Circular inclusions in box subjected to vertical gravity field. Black(heavy) and white (light) inclusions have the same density contrast with respect to the matrix. They are hundred timesmore viscous than the matrix. Figures 10a and 10b show (unsmoothed) pressure perturbations, Figures 10c and 10dshow maximum shear strain rate, and Figures 10e and 10f shows the magnitude of the velocity field with superposedvelocity arrows (random positions). All values are normalized by the corresponding maximum value generated by asingle inclusion of the same size centered in the same box. Figures 10a, 10c, and 10e show the entire domain;Figures 10b, 10d, and 10f show a zoom-in with superposed finite element mesh according to the white square.



20 of 24

Table A1. MILAMIN Variablesa

Variable Group Variable Size Description

Variable size ndim 1 number of dimensionsnel 1 number of elementsnnod 1 number of nodesnnodel 1 number of nodes per elementnedof 1 number of thermal or velocity

degrees of freedom per elementnp 1 number of pressure degrees of

freedom per elementnip 1 number of integration points per elementnelblo 1 number of elements per blocknblo 1 number of blocksnpha 1 number of material phasesnbc 1 number of constraint degrees of freedomnfree 1 number of unconstraint degrees of freedom

Mesh ELEM2NODE [nnodel, nel] connectivityPhases [1, nel] phase of elementsGCOORD [ndim, nnod] global coordinates of nodes

Integration points,shape functionsand their derivatives

IP_X [ndim, nip] local coordinates of integration points

IP_w [1, nip] weights of integration pointsN {nip*[ nnodel, 1 ]} cell array of nip entries of shape

functions Ni evaluated at integrationpoints

dNdu {nip*[ nnodel, ndim]} cell array of nip entries of shapefunctions derivatives wrt localcoordinates dNdui evaluated atintegration points

Geometry ECOORD_X [ndim, nnodel] global coordinates of nodes in elementJ [ndim, ndim] Jacobian in integration pointinvJ [ndim, ndim] inverse of JacobiandetJ 1 or [nelblo,1] determinant of Jacobian (or aeib)dNdX [nnodel, ndim] shape function derivatives wrt global

coordinates in integration pointECOORD_x, ECOORD_y [nnodel, nelblo] global x and y coordinates for nodes (aeib)Jx, Jy [nelblo, ndim] first (x) and second (y) row of Jacobian

in integration point (aeib)invJx, invJy [nelblo, ndim] first (x) and second (y) column of

inverse of Jacobian (aeib)dNdx, dNdy [nelblo, nnodel] shape function derivatives wrt global

x and y coordinate (aeib)Auxiliary arrays indx_l [nedof*(nedof+1)/2,1] indices extracting lower part of

element matrixBoundary conditions Free [1, nfree] unconstraint degrees of freedom

Bc_ind [1, nbc] constraint degrees of freedomBc_val [1, nbc] constraint boundary values

Solution perm [1,nfree] permutation vector reducingfactorization fill-in

L [nfree, nfree] sparse lower Cholesky factor ofglobal stiffness matrix

Rhs [nfree, 1] global right-hand-side vectorTHERMALMaterials D [npha,1] conductivities for different phases

ED 1 or [nelblo,1] conductivity of element (or aeib)Matrix calculations K_elem [nnodel, nnodel] element stiffness matrix

K_block [nelblo, nnodel*(nnodel+1)/2] flattened element stiffness matrices(aeib)

Triplet storage K_i [nnodel*(nnodel+1)/2, nel] row indices of triplet sparse formatfor K_all

K_j column indices of triplet sparse formatfor K_all



21 of 24

Table A1. (continued)

Variable Group Variable Size Description

K_all flattened element stiffness matricesfor all elements

Solution stage K [nfree, nfree] sparse global stiffness matrix(only lower part)

T [nnod, 1] unknown temperature vectorMECHANICALMaterials Mu, Rho [npha,1] viscosity and density for different phases

EMu, ERho 1 or [nelblo,1] viscosity and density of element (or aeib)Matrix calculations Pi [np,1] pressure shape functions in integration

pointP [np, np] auxiliary matrix containing global

coordinates of the corner nodesPb [np,1] auxiliary vector containing global

coordinates of integration pointB [nedof, ndim*(ndim+1)/2] kinematic matrixA_elem [nedof, nedof] element stiffness matrix (velocity part)Q_elem [np, nedof] element divergence matrixM_elem [np, np] element pressure mass matrixinvM_elem [np, np] inverse of element pressure mass matrixRhs_elem [ndim, nedof] element right-hand-side vectorPF 1 penalty factorGIP_x, GIP_y [1,nelblo] global x and y coordinates of integration point

(aeib)Pi_block [nelblo, np] pressure shape functions in integration point

(aeib)A_block [nelblo, nedof*(nedof+1)/2] flattened element stiffness matrices (aeib)Q_block [nelblo, nedof*np] flattened element divergence matrices (aeib)M_block [nelblo, np*(np+1)/2] flattened element pressure mass matrices (aeib)invM_block [nelblo, np*np] flattened inverses of element pressure mass

matrices (aeib)Rhs_block [nelblo, nedof] element right-hand-side vectors (aeib)

Triplet storage Rhs_all [nedof, nel] element right-hand-side vectors for all elementsA_i [nedof*(nedof+1)/2, nel] row indices of triplet sparse format

for A_allA_j column indices of triplet sparse format

for A_allA_all flattened element stiffness matrices

for all elementsQ_i [nedof*np, nel] row indices of triplet sparse format

for Q_allQ_j column indices of triplet sparse format

for Q_allQ_all flattened element divergence matrices

for all elementsinvM_i [np*np, nel] row indices of triplet sparse format

for invM_allinvM_j column indices of triplet sparse format

for invM_allinvM_all flattened inverses of element pressure mass

matrices for all elementsSolution stage A [nfree, nfree] sparse global stiffness matrix

(only lower part)Q [np*nel, ndim*nnod] sparse divergence matrixinvM [np*nel, np*nel] sparse pressure mass matrixDiv [nel*np, 1] quasi-divergence vectorVel [ndim*nnod, 1] unknown velocity vectorPressure [nel*np, 1] unknown pressure vector

aNote: ‘‘aeib’’ stands for ‘‘all elements in block.’’



22 of 24

clear structure of the code serves the educationalpurposes well. The results of our software compar-ison show that our standard version performssurprisingly efficiently even compared to packagesimplemented in compiler languages.

[91] Furthermore, in our optimized version wehave improved the efficiency of the stiffness matrixcalculations, which resulted in an overall executionspeedup of approximately 4 times with respect to thestandard version. This has been done by minimizingthe ratio of overhead (BLAS and MATLAB) tocomputation. Another priority was to avoid unnec-essary data transfers and promote cache reuse, asmemory speed is a major bottleneck on currentcomputer architectures. Particular optimizations tothe matrix computation algorithm include (1) in-creased performance of the BLAS operations byinterchanging loops and operating on largematrices,(2) reducing the total operation count by exploitingthe symmetry of the system, and (3) facilitatingcache reuse through the introduction of blocking.

[92] Our implementation of the matrix computationachieves a sustained performance of 350 Mflopsfor any system size. Any further performanceimprovements to this part of the code are irrele-vant, since even for smallest systems the matrixcomputation now takes only a fraction of the totalsolution time, with the solver being the bottleneck.

[93] By paying attention to the strategies outlined inthis article, MATLAB-based MILAMIN can notonly be used as a development and prototype tool,but also as a production tool for the analysis of twodimensional problems with millions of unknownswithin minutes. The complete MILAMIN sourcecode is available from the authors and can bedownloaded as auxiliary material (see Software S1).

Appendix A

[94] Table A1 lists the variables used throughoutthe paper and in the code to facilitate its under-standing. Variable names, their sizes, and shortdescriptions are given.

Acknowledgments

[95] This work was supported by the Norwegian Research

Council through a Centre of Excellence grant to PGP. We

would like to thank Tim Davis, the author of the SuiteSparse

package, for making this large suite of tools available and

giving us helpful comments. We would also like to thank J. R.

Shewchuk for making the mesh generator Triangle freely

available. We are grateful to Antje Keller for her help regarding

code benchmarking. We thank Galen Gisler for improving the

English. The manuscript benefited from the reviews by Boris

Kaus and Eh. Tan and the editorial work of Peter van Keken.

Finally, we would like to thank Yuri Podladchikov for his never-

ending enthusiasm and stimulation.

References

Alberty, J., et al. (1999), Remarks around 50 lines of Matlab:Short finite element implementation, Numer. Algorithms, 20,117–137.

Bathe, K.-J. (1996), Finite Element Procedures, vol. XIV,1037 pp., Prentice-Hall, London.

Brezzi, F., and M. Fortin (1991), Mixed and Hybrid FiniteElements Methods, vol. ix, 350 pp., Springer, New York.

Cuvelier, C., et al. (1986), Finite Element Methods and Navier-Stokes Equations, vol. XVI, 483 pp., D. Reidel, Dordrecht,Netherlands.

Davies, D. R., et al. (2007), Investigations into the applicabil-ity of adaptive finite element methods to two-dimensionalinfinite Prandtl number thermal and thermochemical convec-tion, Geochem. Geophys. Geosyst., 8, Q05010, doi:10.1029/2006GC001470.

Davis, T. A. (2006), Direct Methods for Sparse Linear Sys-tems, Soc. for Ind. and Appl. Math., Philadelphia, Pa.

Davis, T. A., and W. W. Hager (2005), Row modifications of asparse Cholesky factorization, SIAM J. Matrix Anal. Appl.,26, 621–639.

Dongarra, J. J., et al. (1990), A set of level 3 basic linearalgebra subprograms, ACM Trans. Math. Software, 16, 1–17.

Dunavant, D. A. (1985), High degree efficient symmetricalGaussian quadrature rules for the triangle, Int. J. Numer.Methods Eng., 21, 1129–1148.

Elman, H. C., et al. (2005), Finite Elements and Fast IterativeSolvers With Applications in Incompressible Fluid Dy-namics, 400 pp., Oxford Univ. Press, New York.

Ferencz, R. M., and T. J. R. Hughes (1998), Implementation ofelement operations, in Handbook of Numerical Analysis,edited by P. G. Ciarlet and J. L. Lions, pp. 39–52, Elsevier,New York.

Fletcher, C. A. J. (1997), Computational Techniques for FluidDynamics, 3rd ed., Springer, Berlin.

Gould, N. I. M., et al. (2007), A numerical evaluation of sparsedirect solvers for the solution of large sparse symmetric lin-ear systems of equations, ACM Trans. Math. Software, 33(2),article 10, doi:10.1145/1206040.1206043.

Hughes, T. J. R. (2000), The Finite Element Method: LinearStatic and Dynamic Finite Element Analysis, vol. XXII, 682pp., Dover, Mineola, N. Y.

Hughes, T. J. R., et al. (1987), Large-scale vectorized implicitcalculations in solid mechanics on a Cray X-MP/48 utilizingEBE preconditioned conjugate gradients, Comput. MethodsAppl. Mech. Eng., 61, 215–248.

Kwon, Y. W., and H. Bang (2000), The Finite Element MethodUsing MATLAB, 2nd ed., 607 pp., CRC Press, Boca Raton,Fla.

Limache, A., et al. (2007), The violation of objectivity in La-place formulations of the Navier-Stokes equations, Int. J.Numer. Methods Fluids, 54, 639–664.

Pelletier, D., et al. (1989), Are FEM solutions of incompres-sible flows really incompressible? (or how simple flows cancause headaches!), Int. J. Numer. Methods Fluids, 9, 99–112.

Persson, P. O., and G. Strang (2004), A simple mesh generatorin MATLAB, SIAM Rev., 46, 329–345.



23 of 24

Pozrikidis, C. (2005), Introduction to Finite and Spectral Ele-ment Methods Using MATLAB, 653 pp., CRC Press, BocaRaton, Fla.

Sigmund, O. (2001), A 99 line topology optimization codewritten in Matlab, Struct. Multidisciplinary Optim., 21,120–127.

Silvester, D. J. (1988), Optimizing finite-element matrix cal-culations using the general technique of element vectoriza-tion, Parallel Comput., 6, 157–164.

Toledo, S. (1997), Improving the memory-system performanceof sparse-matrix vector multiplication, IBM J. Res. Dev., 41,711–725.

Wesseling, P. (1992), An Introduction to Multigrid Methods,284 pp., John Wiley, Chichester, N. Y.

Zienkiewicz, O. C., and R. L. Taylor (2000), The Finite ElementMethod, 5th ed., Butterworth-Heinemann, Oxford, U. K.



24 of 24

Documents

MILAMIN: MATLAB-based finite element method solver for large