Ivancik Thesis 2012 Online

7/31/2019 Ivancik Thesis 2012 Online

1/70

i


2/70


3/70

iii


4/70


5/70

v


6/70

Czech Technical University in Prague

Faculty of Electrical EngineeringDepartment of Cybernetics

Masters Thesis

The Linear Direct Sparse Solver on GPU for

Bundle Adjustment Method

Bc. Ondrej Ivank

Supervisor: Ing.Ivan imeek, Ph.D.

Study Programme: Open Informatics

Field of Study: Computer Vision and Image Processing

May 11, 2012


7/70

v


8/70

vi

Aknowledgements

I would like to thank to my supervisor Ivan imeek who enabled me todeal with a very interesting topic and to prof. Olaf Hellwich and CorneliusWefelscheid who allow me to work on my thesis within an individual projectat TU Berlin.


9/70

vii


10/70

viii

Declaration

I hereby declare that I have completed this thesis independently and that Ihave listed all the literature and publications used.

I have no objection to usage of this work in compliance with the act 60Zkon . 121/2000Sb. (copyright law), and with the rights connected withthe copyright act including the changes in the act.

Prague, May 11, 2012


11/70

ix


12/70

Abstract

The thesis deals with solving of sparse linear positive definite systems. It

implements Cholesky decomposition on CPU utilizing a CRS format forsparse matrices, a fast AMD ordering, and a symbolic factorization.

Analysed are possibilities of a parallelization of Cholesky decomposition forsparse diagonal-based linear systems and for Bundle Adjustment problemwhere matrices of specific structure arise. Cholesky decomposition exploitinga Schur complement is implemented on both CPU and GPU side.

Abstrakt

Prce se zabv eenm dkch linernch pozitivn definitnch soustav.Implementuje Choleskho dekompozici na CPU s vyuitm CRS formtudkch matic, rychl AMD permutace a symbolick faktorizace.

Analyzuje monosti paralelizace Choleskho dekompozice pro dk linernsystmy diagonlnho tvaru a pro problm vyrovnn svazku, kde vznikajdk matice specifick struktury. Navrhuje a implementuje vpoet Choles-kho dekompozice na GPU a CPU pomoci Schrova komplementu.

x


13/70

xi


14/70

Contents

1 Introduction 2

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Solving Linear Systems 4

2.1 System of Linear Equations . . . . . . . . . . . . . . . . . . . 42.2 Direct Methods for Solving Linear Systems . . . . . . . . . . 5

2.2.1 Cramers Rule . . . . . . . . . . . . . . . . . . . . . . 52.2.2 Forward and Backward Substitution . . . . . . . . . . 52.2.3 Gaussian Elimination . . . . . . . . . . . . . . . . . . 62.2.4 Gauss-Jordan Elimination . . . . . . . . . . . . . . . . 72.2.5 LU Decomposition . . . . . . . . . . . . . . . . . . . . 72.2.6 Cholesky Decomposition . . . . . . . . . . . . . . . . . 7

2.3 Iterative Methods for Solving Linear Systems . . . . . . . . . 8

3 Sparse Matrices 10

3.1 Ordering Methods . . . . . . . . . . . . . . . . . . . . . . . . 103.1.1 Arrowhead Matrix Example . . . . . . . . . . . . . . . 113.1.2 Graph Representation . . . . . . . . . . . . . . . . . . 113.1.3 Bottom-up Ordering Methods . . . . . . . . . . . . . . 123.1.4 Top-down Ordering Methods . . . . . . . . . . . . . . 12

3.2 Symbolical Factorization . . . . . . . . . . . . . . . . . . . . . 13

4 Bundle Adjustment 16

4.1 Unconstrained Optimization . . . . . . . . . . . . . . . . . . . 174.1.1 Search Methods . . . . . . . . . . . . . . . . . . . . . . 184.1.2 LevenbergMarquardt Algorithm . . . . . . . . . . . . 19

5 Overview of NVIDIA CUDA 22

5.1 The CUDA Execution Model . . . . . . . . . . . . . . . . . . 235.2 GPU Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

6 Analysis of the Problem 28

6.1 Structure of Linear Systems in BA . . . . . . . . . . . . . . . 28

xii


15/70

xiii CONTENTS

6.2 Block Cholesky Decomposition for BA . . . . . . . . . . . . . 29

7 Implementation 347.1 Used Framework . . . . . . . . . . . . . . . . . . . . . . . . . 347.2 Compressed Row Storage Format . . . . . . . . . . . . . . . . 347.3 Cholesky decomposition on GPU . . . . . . . . . . . . . . . . 357.4 Ordering for CPU solver . . . . . . . . . . . . . . . . . . . . . 367.5 Block Matrix Format for GPU . . . . . . . . . . . . . . . . . 367.6 Block Cholesky decomposition on GPU . . . . . . . . . . . . . 377.7 Ordering for GPU solver . . . . . . . . . . . . . . . . . . . . . 38

8 Testing 40

8.1 Octave solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . 408.2 CPU solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418.3 GPU solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428.4 CUSP solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

9 Conclusion 44

A List of Abbrevations 50

B User Manual 52B.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 52B.2 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

C Contetns of the Attached CD 54


16/70

List of Figures

3.1 The dependence of the reordering of a sparse matrix on the

fill-in count . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2 Ordering example . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.1 Reprojection error . . . . . . . . . . . . . . . . . . . . . . . . 17

5.1 Block diagram of a GF100 GPU . . . . . . . . . . . . . . . . . 245.2 Streaming multiprocessor of a GF100 (Fermi) GPU . . . . . . 255.3 Bandwidth of various GPU memory . . . . . . . . . . . . . . 25

6.1 An example of a modestly sized Hessian in BA . . . . . . . . 30

7.1 Sample of a symmetric positive definite sparse matrix 6 6

with 22 nonzero elements . . . . . . . . . . . . . . . . . . . . 357.2 Performing k-way ordering on diagonal-based matrix Wathen

10 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387.3 Performing k-way ordering on diagonal-based matrix Poisson

30 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

8.1 Test of Octave solvers . . . . . . . . . . . . . . . . . . . . . . 418.2 Test of iterative CUSP solvers. Max. error is the maximal

difference with Octaves reference solution . . . . . . . . . . . 43

xiv


17/70

xv LIST OF FIGURES


18/70

Chapter 1

Introduction

Finding a solution of a system of linear algebraic equations (2.1) is the mostbasic task in linear algebra and the heart of many engineering problems. Itis the topic of studies for many years not only for its application in manybranches of scientific computing, but also for its high computational com-plexity and a wide variety of methods and approaches that help to solvelinear systems of different types faster and more accurately.

Finding a solution for a system of nonlinear algebraic equations can be

achieved using iterative solvers which keystone is solving a linear systemin each iteration step to bring near the sufficiently accurate solution. There-fore, a linear solver forms a crucial part and a bottleneck of a nonlinear solverat the same time.

A widely used optimization method in 3D reconstruction algorithms is bundleadjustment. As a nonlinear iterative optimization method, it needs to solvea sparse, often very large linear system of a specific structure many times.Studying of the suitable linear solver for bundle adjustment is the main partof my thesis.

1.1 Motivation

One particular and promising approach for speeding-up the process of solvingsystems of linear equations consists in parallel computation. In case of densedirect solvers, the parallelization is more straightforward and has better per-formance results than those for sparse direct solvers. Iterative methods,almost used for solving large sparse linear systems, are efficiently paralleliz-able thanks to the character of iterative solvers that used only sparse matrixand vector multiplications and additions.

2


19/70

3 1.1. MOTIVATION

In the last decade, there has been growing interest in general-purpose compu-

tation on graphics processing units (GPGPU). Several libraries were devel-oped which implement basic linear algebra subroutines or even linear solversfor dense matrices (NVIDIA cuBLAS, MAGMA, CULA Dense) and sparsematrices (NVIDIA cuSparse, NVIDIA CUSP, CULA Sparse). At the presenttime, no implementation of a linear direct solver for general sparse matriceson GPU exists. The main cause is the problematic fine-grain parallelizationand the thread divergence on a GPU.

Sparse matrices consisting of many small independent full blocks on diagonalwith some dependent parts on borders are formed during computation ofbundle adjustment. It seems that there is possibility to eliminate theseblocks in parallel manner effectively even on GPU. The question is whichtype of solver is more suitable direct or iterative? My thesis aims to givethe answer for it.


20/70

Chapter 2

Solving Linear Systems1

2.1 System of Linear Equations

Definition 1. A system of m linear equations in n unknowns consists ofa set of algebraic relations of the form

nj=1

aijxj = bi, i = 1, . . . , m (2.1)

where xj are unknowns, aij are the coefficients of the system and bi are thecomponents of the right-hand side. System (2.1) can be more convenientlywritten in matrix form as

Ax = b, (2.2)

where A = (aij) Cmn denotes the coefficient matrix, b = (bi) Cm

the right side vector and x = (xi) Cn the unknown vector, respectively.A solution of (2.2) is any n-tuple of values xi which satisfies (2.1).

Remark 1. The existence and uniqueness of the solution of are ensured ifone of the following (equivalent) hypotheses holds:

1. A is invertible,2. rank(A) = n,

3. the homogeneous system Ax = 0 admits only the null solution.

In next chapters I will be dealing with numerical methods finding the solutionof real-valued square systems of order n, that is, systems of the form (2.2)with A Rnn and x, b Rn. Such linear systems arise frequently in any

1For this chapter was cited from [20] and [21]

4


21/70

5 2.2. DIRECT METHODS FOR SOLVING LINEAR SYSTEMS

branch of science, also in bundle adjustment. These numerical methods can

generally be divided into two classes. In absence of roundoff errors, directmethods yield the exact solution in a finite number of steps. Iterative methodsrequire (theoretically) an infinite number of steps to find the exact solution.

2.2 Direct Methods for Solving Systems of Linear

Equations

2.2.1 Cramers Rule

The solution of system (2.2) is formally provided by Cramers rule

xj =det(Aj)

det(A), j = 1, . . . , n , (2.3)

where Aj is the matrix obtained by substituting the j-th column ofA withthe right-hand side b. If the determinants are evaluated by the recursiveLaplace rule, the method based on Cramers rule turns out to be unac-ceptable even for small dimensions ofA because of its computational costs(n + 1)! flops. However, Habgood and Arel [11] have recently shown thatCramers rule can be implemented in O(n3) time, which is comparable tomore common methods of solving systems of linear equations.

2.2.2 Forward and Backward Substitution

Definition 2. A square matrix with zero entries above the main diagonal(aij = 0 for i < j) is called lower triangular. A square matrix with zeroentries below the main diagonal (aij = 0 for i > j) is called upper triangular.A lower (upper) triangular matrix is strictly lower (upper) triangular whenits entries on the main diagonal are zeros, too.

Example 1. Lower (upper) triangular systems can be easily solved usingforward (backward) substitution. For example, the nonsingular 3 3 upper

triangular system u11 u12 u130 u22 u230 0 u33

x1x2x3

= b1b2b3

can be solved in sequence as follows

x3 = b3/u33,

x2 = (b2 u23x3)/u22,

x1 = (b1 u12x2 u13x3)/u33.


22/70

CHAPTER 2. SOLVING LINEAR SYSTEMS 6

For a nonsingular upper triangular system of order n (n 2), the solution

can be expressed generally in the form

xn =bn

unn

xi =1

uii

bi n

j=i+1

uijxj

, i = n 1, . . . , 1. (2.4)

Analogically, the solution for a nonsingular lower triangular system of ordern (n 2) in the form

xi =b1l11

xi =1

lii

bi i1j=1

lijxj , i = 2, . . . , n . (2.5)

The number of multiplication and divisions for forward/backward substi-tution is equal to n2 (n + 1), while the number of sums and subtractions isn2 (n 1). The total operation count for (2.4) and (2.5) is thus n

2.

2.2.3 Gaussian Elimination

LetA

be a square nonsingular matrix. A linear systemAx

=b

can betransformed into equivalent (lower or upper) triangular system Tx = b thathas the same solution using three elementary row operations. The solutionof the system is invariant to

1. the multiplication of a row by a nonzero scalar,

2. the addition of one row to another,

3. the swapping of two rows.

The basic idea is to multiply the i-th equation by a nonzero constant andsubtract with the first equation to zeroize first unknown in the i-th equation.

This is done with all equations from 2 to n. Then, the second equation isconsidered as reference and all unknowns in equations form 3 to n are zeroed.The procedure ends, when the system has form Tx = b. Right-hand sideb equals to Tb. Finally, the solution is obtained by forward substitution(if T is lower triangular matrix) or backward substitution (if T is uppertriangular matrix).

To complete Gaussian elimination 23(n 1)n(n + 1) + n(n 1) flops arerequired. To solve the linear system, about 23n

3 + 2n2 flops are needed (withn2 flops to backsolve the triangular system). Neglecting the lower order ofterms, the Gaussian elimination process has a cost of 23n

3 flops.


23/70

7 2.2. DIRECT METHODS FOR SOLVING LINEAR SYSTEMS

2.2.4 Gauss-Jordan Elimination

Gauss-Jordan elimination is slightly different as Gaussian elimination. Thetransformation of the system using three elementary row operations repeatsuntil each equation contains only one of the unknowns, thus giving an im-mediate solution. Principal deficiencies of this method are that

1. it requires all the right-hand sides to be stored and manipulated at thesame time and

2. it is three times slower than the alternative solvers, when the inverseofA is not desired.

2.2.5 LU Decomposition

Suppose that it is able to write the matrix A as a product of two matrices,A = LU where L is lower triangular and U is upper triangular. Thisdecomposition can be used to solve the linear system

Ax = (LU)x = L(Ux) = b (2.6)

by first solving (by forward substitution) for the vector y such that

Ly = b (2.7)

and then solving (by backward substitution) for the vector x such that

Ux = y. (2.8)

Theorem 1. Let A Rnn. The LU decomposition of A with lii = 1 fori = 1, . . . , n exists and is unique iff the principal submatrices Ai of A oforder i = 1, . . . , n 1 are nonsingular.

The LU decomposition is usually performed in place to avoid copying andwasting the memory when storing triangular matrices L and U separately

as it is shown in Algorithm 1. At the end (here only for presentationalpurposes) is the result stored in L and U matrix.

2.2.6 Cholesky Decomposition

Theorem 2. Let A Rnn be a symmetric and positive definite matrix.Then, there exists a unique lower triangular matrixL with positive diagonal

entries such that

A = LL. (2.9)
http://-/?-http://-/?-http://-/?-


24/70

CHAPTER 2. SOLVING LINEAR SYSTEMS 8

Algorithm 1 LU Decomposition

Require: A square matrix A.Ensure: A lower triangular matrix L with ones on the main diagonal and

an upper triangular matrix U such that LU = A.

function [L, U] = lu2(A)

[n,n] = size(A);

for k = 1:n

A(k+1:n,k) = A(k+1:n,k) / A(k,k);

A(k+1:n,k+1:n) = A(k+1:n,k+1:n) - A(k+1:n,k) * A(k,k+1:n);

end

L = tril(A,-1) + eye(n); % ones on the diagonal

U = triu(A);

end

The computational costs for Cholesky halves, with respect to the LU decom-position, to about n

3

3 flops because the input matrix A is symmetric. Animplementation example of Cholesky decomposition is coded in Algorithm 2.

Algorithm 2 Cholesky Decomposition

Require: A square positive definite matrix A.Ensure: A lower triangular matrix L such that LL = A.

function [L] = chol2(A)

[n,n] = size(A);for k = 1:n

A(k,k) = sqrt(A(k,k));

A(k,k+1:n) = A(k,k+1:n) / A(k,k);

for i = k+1:n

A(i,i:n) = A(i,i:n) - A(k,i:n) * A(k,i);

end

end

L = triu(A);

end

2.3 Iterative Methods for Solving Systems of LinearEquations

Iterative methods formally yield the solution x of a linear system after aninfinitive number of steps. At each step they require the computation ofthe residual of the system. For full matrices, their computational cost is ofthe order n2 operations for each iteration to be compared with an overallcost of the order of 23n

3 operations needed by direct methods. Iterativemethods can therefore become competitive with direct methods because the
http://-/?-http://-/?-http://-/?-


25/70

9 2.3. ITERATIVE METHODS FOR SOLVING LINEAR SYSTEMS

required number of iterations to converge is either independent ofn or scales

sublinearly with respect to n.The basic idea of iterative methods is to construct a sequence of vectors x(k)

that enjoy the property ofconvergence

x = limk

x(k),

where x is the solution to (2.2). In practice, the iterative process is stoppedat the minimum value of n such that

x(n) x < .


26/70

Chapter 3

Sparse Matrices

Many engineering problems have to confront with large and sparse matrices.A sparse matrix is a matrix that allows special techniques to take advantageof the large number of zero elements. This definition helps to define howmany zeros a matrix needs in order to be sparse. The answer is that it de-pends on what the structure of the matrix is, and what is being used for. Forexample, a randomly generated sparse nn matrix with cn entries scatteredrandomly throughout the matrix is not sparse in the sense of Wilkinson (fordirect methods) since it takes O(n3) time to factorize (with high probability

and for large enough c [9]). [3]Example 2. Using some of the sparse formats to store real sparse matricescan result in significant computational and storage savings. For instance atridiagonal square matrix with 1, 000, 000 rows. Storing 3 million nonzeroelements in double precision, and other data as row and column indices,consumes approx. 40 MB. But storing the same matrix as full matrix wouldconsume more than 7TB. Such big differences can be expected also in exe-cution times.

3.1 Ordering Methods

An unfavourable fact lies in the process of elimination with sparse matrices.Some zero values of input matrix become non-zero during the elimination(fill-ins) and their positions must be precomputed in advance. Reorderingtechniques try to minimize the amount of fill-ins by finding a permutation ofrows and columns of the input matrix. But finding such optimal permutationis a NP-complete problem [26] and could be more time consuming thansolving original linear system; therefore, heuristic approach that gives oftennear optimal results is applied.

10


27/70

11 3.1. ORDERING METHODS

3.1.1 Arrowhead Matrix Example

Example 3. The operations counts required for the solution of two linearsystem Ax = b will be examined. The input matrices are on the figure 3.1.Even though both matrices have the same number of non-zero elements,there is a significant computation reduction by simply permutation of rowsand columns.

(a) Left-up arrow-head matrix

(b) Left-up arrow-head matrix afterLU

(c) Right-down ar-rowhead matrix

(d) Right-down ar-rowhead matrix af-ter LU

Figure 3.1: The dependence of the reordering of a sparse matrix on the fill-in count. represents nonzero elements of the input matrix, fill-ins andempty space zero elements

For the left-up arrowhead matrix 3.1a, the number of multiplications anddivisions required by the forward elimination is = 40, for the back substi-tution = 25. The total number of operations is + = 65 and the inputsparse matrix becomes full. For the right-down matrix 3.1c, the number ofmul. and div. required for the forward elimination is = 8, for back sub-stitution is = 13. The total number of operations is + = 21 and theinput sparse matrix remains sparse.

There are many recent works about ordering schemas. This is because thespecific problems construct specific types of sparse matrices (band-diagonal,block triangular, block tridiagonal, ...) [20, p. 77]. Below, the most usedmethods are described. They can be divided in two categories, accordinghow the elimination tree is build. Most state-of-the-art ordering schemes

for sparse matrices are a hybrid of a bottom-up method such as minimumdegree and a top-down scheme such as Georges nested dissection.

3.1.2 Graph Representation of Sparse Matrices

To explain ordering methods, it is convenient to introduce a graph represen-tation of sparse matrices. They are then represented as undirected graphs(sparse matrix have the structure of an adjacency matrix for this graph).All schemes are described for the undirected graph G = (V, E), E V V,


28/70

CHAPTER 3. SPARSE MATRICES 12

associated with the symmetric matrix S. Let v be a vertex ofG. The set of

vertices that are adjacent to v is denoted by adjG(v).

3.1.3 Bottom-up Ordering Methods

Bottom-up methods build the elimination tree from the leaves up to the root.In each iteration k a greedy heuristic is applied to Gk1 to select a vertex forelimination. This section briefly describes two of the most popular bottom-up algorithms, the minimum degree and the minimum deficiency orderingheuristics.

Minimum Degree Ordering As mentioned above, at each iteration k the

minimum degree algorithm eliminates a vertex v that minimizes thenumber of adjacent vertices degGk1(v) = |adjGk1(v)|. The algo-rithm is a symmetric variant of the Markowitz scheme [15] and wasfirst applied to sparse symmetric factorization by Tinney and Walker[22]. Over the years many enhancements have been proposed to thebasic algorithm that have greatly improved its efficiency.

Minimum Deficiency Fill A less popular bottom-up scheme is the mini-mum deficiency or minimum local fill heuristic. The exact amount offill is used to select a vertex for elimination. The minimum deficiencyalgorithm has received much less attention because of its prohibitive

runtime.

3.1.4 Top-down Ordering Methods

The most popular top-down scheme is Georges nested dissection algorithm[7, 8]. The basic idea of this approach is to find a subset of vertices Sin G, whose removal partitions G in two subgraphs G(B) and G(W) withV = S B W and |B|, |W| |V| for some 0 < < 1. Such a par-tition of G is denoted by (S,B,W). The set S is called vertex separatorof G. If we order the vertices in S after the (black) vertices in B and the

(white) vertices in W, no fill-edge can occur between B and W. Typically,the columns corresponding to S constitute a full off-diagonal block in theCholesky factor. Therefore, S is supposed to be small. Once S has beenfound, the algorithm is recursively applied to each connected component ofG(B) and G(W) until a component consists of a single vertex or a clique.In this way the elimination tree is built from the root down to the leaves.

Graph partitioning heuristics are usually divided into construction and im-provement heuristics. A construction heuristic takes the graph as input andcomputes an initial separator from scratch. An improvement heuristic triesto minimize the size of a separator through a sequence of elementary steps.


29/70

13 3.2. SYMBOLICAL FACTORIZATION

As some ordering methods are implemented in MATLAB as standard func-

tions (colperm, symrcm, colamd, symamd, amd, dmperm), I have tested someof them (see figure 3.2).

3.2 Symbolical Factorization

Symbolical factorization is a step executed before the numerical factoriza-tion. It precomputes the positions of fill-ins (see also 3.1) that appears duringfactorization process when one row is added to another. It can be seen onthe Cholesky or LU factors that they are often more denser than originalmatrices (see figure 3.2). The CRS format stores only nonzero elements andtherefore needed space for fill-ins must be allocated before the numericalfactorization. The nave solution is to run slightly changed numerical factor-ization and store new nonzero entries. In fact that symbolical factorizationworks only with indices to determine the Cholesky or LU factors, it can becomputed much faster than full numerical factorization. When implementingmy symbolical factorization I have used a great information source [13].


30/70

CHAPTER 3. SPARSE MATRICES 14

0 50 100 150 200 250 300

0

50

100

150

200

250

300

no ordering: fillins=13309

0 50 100 150 200 250 300

0

50

100

150

200

250

300

colperm: fillins=30627

0 50 100 150 200 250 300

0

50

100

150

200

250

300

symrcm: fillins=13040

0 50 100 150 200 250 300

0

50

100

150

200

250

300

colamd: fillins=9569

0 50 100 150 200 250 300

0

50

100

150

200

250

300

symamd: fillins=6681

0 50 100 150 200 250 300

0

50

100

150

200

250

300

amd: fillins=6583

Figure 3.2: Applying different ordering methods and displaying LU factors.Nonzeros are in black, fill-ins in gray color


31/70

15 3.2. SYMBOLICAL FACTORIZATION


32/70

Chapter 4

Bundle Adjustment

Three-dimensional (3D) reconstruction is a problem that appears often inmany computer vision tasks. 3D reconstruction can be defined as the prob-lem of using 2D measurements arising from a set of images depicting the samescene from different viewpoints, aiming to derive information related to thescene geometry as well as the relative motion and the optical characteris-tics of the camera(s) employed to acquire these images. Bundle adjustment(BA) is almost invariably used as the last step of every feature-based 3D

reconstruction algorithm [14, p. 12].

Bundle adjustment is the problem of refining a visual reconstruction toproduce jointly optimal 3D structure and viewing parameter (camera poseand/or calibration) estimates. Optimal means that the parameter estimatesare found by minimizing some cost function that quantifies the model fittingerror, and jointly that the solution is simultaneously optimal with respectto both structure and camera variations. The name refers to the bundlesof light rays leaving each 3D feature and converging on each camera cen-tre, which are adjusted optimally with respect to both feature and camerapositions. Equivalently unlike independent model methods, which merge

partial reconstructions without updating their internal structure all of thestructure and camera parameters are adjusted together in one bundle [23].

BA boils down to minimizing the reprojection error (4.1) between the ob-served and predicted image points, which is expressed as the sum of squares ofa large number of nonlinear, real-valued functions. Thus, the minimization isachieved using nonlinear least-squares algorithms [4], from which Levenberg-Marquardt has proven to be of the most successful due to its ease of imple-mentation and its use of an effective damping strategy that lends it theability to converge quickly from a wide range of initial guesses [12].

16


33/70

17 4.1. UNCONSTRAINED OPTIMIZATION

Figure 4.1: Reprojection error [17]

4.1 Unconstrained Optimization1

The aim of the unconstrained optimization is to find x such that

arg minxRn

f(x). (4.1)

The point x is called a global minimizer of f if f(x) f(x) x Rn,while x is called a local minimizer of f if a neighborhood N of x existssuch that f(x) f(x) x N. Vector of first partial derivations of thefunction f (must be continuously differentiable) by the vector x is denotedby

f(x) =

f

x1(x), . . . ,

f

xn(x)

and called gradient off at a point x. Ifd is is a non null vector in Rn, thenthe directional derivative of f with respect to d is

f

d(x) = lim

0

f(x + d) f(x)

and satisfies f(x)/d = [fx]d. Moreover, denoting by (x, x + d) thesegment in Rn joining the points x and x + d, with R, Taylorssexpansion ensures that (x, x + d) such that

f(x + d) f(x) = f()d.

1This chapter was cited from [21].


34/70

CHAPTER 4. BUNDLE ADJUSTMENT 18

If f is twice-continuously differentiable, it can by denoted by H(x) (or

2f(x)) the Hessian matrix off evaluated at a point x, whose entries are

hij(x) =2f(x)

xixj, i, j = 1, . . . , n .

In such case it can be shown that, if d = 0, the second-order directionalderivative exists

2f

2d(x) = dH(x)d.

For a suitable (x, x + d) also

f(x + d) f(x) = f(x)d + 12dH()d.

Existence and uniqueness of solution for (4.1) is not guaranteed in Rn. Nev-ertheless, it can be proved that the gradient of a local minimizer x equalsto a null vector. This condition is necessary for optimality to hold. However,this condition also becomes sufficient iff is a convex function on R, i.e., suchthat x, y Rn and for any [0, 1]

f[x + (1 )y] f(x) + (1 )f(y).

4.1.1 Search Methods

Analytical methods are possible to use only for simple problems (Bra-chistochrone problem, univariate minimization).

Numerical methods must be used for most engineering optimizationproblems (too large and complex to solve analytically). Numericalmethods can be divided into two classes

Gradient-based methods are efficient for many variables and forsmooth objective function. The drawback is the local conver-gence.

Derivative-free methods are suitable for problems when gradi-ents are not available, objective function is not differentiable orthe global minimizer is desired to find.

Gradient-based descent methods compute direction d(k) and positive pa-rameter (step length) (k) at each iteration k with the help of gradient andHessian. Algorithm 3 shows the skeleton of this method.

The concept of computing the direction d(k) and the step length (k) definesa specific direct method.
http://-/?-http://-/?-


35/70


Algorithm 3 Descent method

Require: f(x), H(x) and a starting point x0.Ensure: A local minimizer x.

1: k 02: while (not converged) do3: compute direction d(k) and step length (k)

4: x(k+1) x(k) + (k)d(k)

5: k k + 16: end while7: return x(k)

Newtons method computes

d(k) = H1(x(k))f(x(k)),

where H is positive definite within a sufficiently large neighborhood ofpoint x;

inexact Newtons method

d(k) = B1(x(k))f(x(k)),

where B(x(k)) is a suitable approximation of H(x(k));

gradient (steepest descent) method

d(k) = f(x(k));

conjugate gradient method

d(k) = f(x(k)) + (k)d(k1),

where (k) is a scalar to be suitably selected in such a way that thedirections

d(k)

turn out to be mutually orthogonal with respect to

a suitable scalar product.

4.1.2 LevenbergMarquardt Algorithm

LevenbergMarquardt (LM) algorithm, also known as the damped least-squares method, provides a numerical solution to the problem of minimizinga function, generally nonlinear, over a space of parameters of the function.It can be thought of as a combination of GaussNewton and the steepestdescent method. When the current solution is far from a local minimum,the algorithm behaves like a steepest descent method: slow, but guaranteed


36/70

CHAPTER 4. BUNDLE ADJUSTMENT 20

to converge. When the current solution is close to a local minimum, it

becomes a Gauss-Newton method and exhibits fast convergence. For thesereasons, mostly LM algorithm is used in bundle adjustment.

Let f be an assumed functional relation which maps a parameter vectorp Rm to an estimated measurement vector x = f(p), x Rn. An initialparameter estimate p0 and a measured vector x are provided and it is desiredto find the vector p that best satisfies the functional relation f locally, thatis, minimizes the squared distance with = x x for all p within asphere having a certain, small radius. The basis of LM algorithm is anaffine approximation to f in the neighborhood of p. For a small ||p||, f isapproximated by (see [5, p. 75])

f(p + p) f(p) + Jp,

where J is the Jacobian off.

The basis of LM algorithm is an affine approximation to f in the neighbor-hood ofp. At each iteration, it is required to find the step p that minimizesthe quantity ||x f(p + p)|| ||x f(p) Jp|| = || Jp||. The mini-mum is attained when Jp is orthogonal to the column space of J. Thisleads to J(Jp ) = 0, which yields p as the solution of the so-callednormal equations [10]:

JJp = J (4.2)

Matrix JJ in the above equation is the first order approximation to theHessian of 12

[16] and p is the Gauss-Newton step. J correspondsto the steepest descent direction since the gradient of 12

is J. TheLM algorithm actually solves a slight variation of Equation (4.2), known asaugmented normal equations

Np = J, with N JJ + I, > 0. (4.3)

The strategy of altering the diagonal elements of JJ is called damping and is referred to as the damping term. It is decreased, when the updatedparameter vector p + p with p computed from Equation (4.3) leads to a

reduction in the error

; otherwise it is increased, the augmented normalequations are solved again and this process iterates until a value of p thatdecreases the error is found.


37/70



38/70

Chapter 5

Overview of NVIDIA CUDA1

By introducing CUDA (Compute Unified Device Architecture) NVIDIA hasgiven programmers the initial opportunity to capitalize on inexpensive, gen-erally available, massively parallel computing hardware. Teraflop computingis now within the economic reach of most people around the world. Theimpact of GPGPU (General-Purpose Graphics Processing Units) technologyspans all aspect of computation, from cell phones to largest supercomputers.Programmable GPUs are deployed in areas of scientific computing, cloudcomputing, computer visualization, simulations, games, . . .

Programming for GPGPU requires a basic knowledge about the GPU archi-tecture because only small changes in data structures or program can makesignificant differences in the performance. Modern GPUs belong in principleto the SIMD class of Flynns taxonomy. That means that GPUs are capableto do the same operation on multiple data simultaneously. The restrictionis one operation at the time which reduces possible problems worth to par-allelize on GPU. On the other hand, well-vectorized problems are able toachieve an acceleration by two or more orders of magnitude over multi-coreprocessors2.

To ensure best performance of GPGPU, next tree rules should be met.

1. Get the data on the GPGPU and keep it there. GPGPU are separatedevices plugged into the PCI Express bus of the host computer whichis very slow compared to GPGPU memory system (20 to 28 timesslower).

2. Give the GPGPU enough work to do. CUDA-enabled GPUs deliverteraflop performance and they are fast enough to complete small prob-

1For this chapter I have quote from [6].2Top 100 NVIDIA CUDA application showcase speedups (Min 100, Max 2600, Median

1350), published May 9., 2011.

22


39/70

23 5.1. THE CUDA EXECUTION MODEL

lems faster than the host processor can start kernels. Each thread

should perform as much instruction to hide this latency.3. Focus on data reuse within the GPGPU to avoid memory bandwidth

limitation. All high-performance CUDA applications exploit internalresources on the GPU (registers, shared memory) to bypass globalmemory bottlenecks.

5.1 The CUDA Execution Model

The heart of CUDA performance lies in the execution model and the sim-

ple partitioning of a computation into fixed-sized blocks of threads in theexecution configuration. CUDA maps naturally the parallelism within anapplication to the massive parallelism of the GPGPU hardware. The resultis the compatibility within older and future generations of GPU.

GPU hardware parallelism is achieved through replication of a common ar-chitectural building blocks called a streaming multiprocessor (SM). Figure 5.1illustrates 16 SM on a GF100 (Fermi) series GPGPU. The software abstrac-tion of a thread block translates into a natural mapping of the kernel ontoan arbitrary number of SM on a GPGPU. Each SM can be scheduled (byGigaThread global scheduler) to run one or more thread blocks. Therefore,

they are independent and not synchronizable during the kernel execution3

.Thread blocks also acts as a container of thread cooperation, as only threadsin a thread block can share data. Thread in a thread block can utilize high-speed memory inside the SM called shared memory for data sharing.

Figure 5.2b depicts the composition of one of 16 streaming multiprocessors inGF100 GPU. SIMD cores require less power and space than non-SIMD cores.As a result, GPGPU have a high flop per watt ratio compared to conventionalCPUs [25]. The threads running on a multiprocessor are partitioned intogroups in which all threads execute the same instruction simultaneously. Onthe CUDA architecture, these groups are called warps, each warp has 32threads, and this execution model is referred to as SIMT (Single InstructionMultiple Threads) [18].

GPGPUs are not true SIMD machines (but SIMT), since SIMD are onlystreaming multiprocessors which may be running one or more different in-structions. Conditionals (if statements) can decrease performance inside aSM because each branch of each conditional must be evaluated. This cancause slowdown of2n for n nested loops.

3Atomic operation make an exception, they allow threads of different blocks to commu-nicate. This approach should be used in reasonably situations, as using atomic operationsmay introduce scalability and p erformance issues.


40/70

CHAPTER 5. OVERVIEW OF NVIDIA CUDA 24

Figure 5.1: Block diagram of a GF100 (Fermi) GPU [2]

5.2 GPU Memory

For highest performance of applications developed for GPU, data inside theSM must be reused. The reason is that on-board global memory (DRAMin 5.2a) is not fast enough when all SM want to perform read/write operation.CUDA provides configurable caches for each SM to give the opportunity fordata reuse. The awareness of difference between on-board (GPU) and on-chip (SM) memory is the key to achieving the highest performance that

GPGPU can provide.The most fastest and most scalable is on-chip SM memory. However, it islimited to a few KB. The on-board global memory is accessible by all the SMacross the GPU and is measured in GB. Significant bandwidth gaps betweenon-board and on-chip memories could be seen in table 5.3. Although thebandwidth of shared memory can greatly accelerate applications, it is tooslow to achieve peak performance [24].

Example 4. Computing a simple dot-productfor( i = 0; i < N; i++ ) c[i] = a[i] * b[i];


41/70

25 5.2. GPU MEMORY

(a) Memory hierarchy [1] (b) Block diagram [1]

Figure 5.2: Streaming multiprocessor of a GF100 (Fermi) GPU

Register memory 8000 GB/sShared memory 1600 GB/sGlobal memory 177 GB/sMapped memory 8 GB/s

Figure 5.3: Bandwidth of various GPU memory [6, p. 111]

on a GPU utilizing only global memory gives a limited performance. When4-byte floating-point values are being used, a 1Tflop GPU would require12 TB/s of memory bandwidth. A GPU with 177 GB/s of memory band-width could only deliver 14 Gflop (1.4% of the potential 1Tflop performance).

When programming for a GPU, it is necessary to reuse data within the SM(to exploit data locality). GPGPUs support two types of data locality


42/70

CHAPTER 5. OVERVIEW OF NVIDIA CUDA 26

temporal locality (or LRU Last Recently Used) means that last recently

accessed data is likely to be used again in the future and spatial localitymeans that neighbouring data is cached to be used in the future.

For compute capability 2.0 or higher, a constant or texture memory used foreffective data broadcasting to all threads are overcome by the global memory.This is because compute 2.0 devices contains SM with L1 cache and a unifiedL2 cache that speed-up accessing the global memory.


43/70

27 5.2. GPU MEMORY


44/70

Chapter 6

Analysis of the Problem

As I have mentioned in the Introduction, finding the solution of a linearsystem is the most computedemanding part in the problem of solving anonlinear system. At each iteration, a linear system Ax = b must be solved.Bundle adjustment (BA), as a least squares problem, works with sparse linearsystems of a special structure (doubly bordered block diagonal). A similarstructure can be obtained when applying nested dissection ordering on thediagonal-based matrix A (band-diagonal, block tridiagonal, . . . ). The im-plemented solver on GPU can be used for BA when the information about

the structure of A matrix is provided by BA or for the diagonal-based ma-trix when the information about the structure is provided by the orderingfunction.

6.1 Structure of Linear Systems in BA

A system of linear augmented normal equations 4.3 arises in BA and aresolved at each iteration of Levenberg-Marquardt algorithm. Matrix J isJacobian and N is the first order approximation of Hessian. The structure

of N can be exactly determined according to the input parameters of BAproblem.

Example 5. [14, p. 9] Consider that we want to optimize parameters of3 cameras and 4 3D points visible in all cameras. The measurement vec-tor X = (x11, x

12, x

13, x

22, x

21, x

23, x

31, x

32, x

33, x

41, x

42, x

43) is made up

of the measured image point coordinates across all cameras. The parametervector P = (a1 , a

2 , a

3 , b

1 , b

2 , b

3 , b

4 ) is defined by all parameters de-

scribing 3 projection matrices and 4 3D points. Let Aij and Bij denotexijaj

and xijbj

, respectively. xijak

= 0, j = k and xijbk

= 0, i = k. Employing

28


45/70

29 6.2. BLOCK CHOLESKY DECOMPOSITION FOR BA

this notation, the Jacobian can be written as

J =X

P=

A11 0 0 B11 0 0 0

0 A12 0 B12 0 0 0

0 0 A13 B13 0 0 0

A21 0 0 0 B21 0 0

0 A22 0 0 B22 0 0

0 0 A23 0 B23 0 0

A31 0 0 0 0 B31 0

0 A32 0 0 0 B32 0

0 0 A33 0 0 B33 0

A41 0 0 0 0 0 B410 A

420 0 0 0 B

420 0 A43 0 0 0 B43

. (6.1)

Then, the approximation of Hessian (matrix N from Equation (4.3)) have aform

U1 0 0 W11 W21 W31 W410 U2 0 W12 W22 W31 W410 0 U3 W13 W23 W33 W43

W11 W12 W

13 V1 0 0 0

W21 W22 W

23 0 V2 0 0

W31 W32 W

33 0 0 V3 0

W

41 W

42 W

43 0 0 0 V4

a1a2a3b1b2b3

b4

=

a1a2a3b1b2b3

b4

. (6.2)

Denoting the upper left, lower right, and upper right parts of the matrix inEquation (6.2), respectively, with U, V and W, allows to rewrite augmentednormal equations (4.3) compactly to

U W

W V

ab

=

ab

, (6.3)

where * designates the augmentation of the diagonal elements of U and V.Now, lets compare the structure of Hessian in Equation (6.2) with a Hessian

of a bigger BA problem (figure 6.1). The upper left part (U) corresponds tothe approximation of second derivations of camera parameters, lower right(V) to the approximation of second derivations of 3D points and upper rightpart (W) to the derivation

6.2 Block Cholesky Decomposition for BA

Lourakis and Argyros [14] suggest to solve augmented normal equations (6.3)arising in BA in two steps (firstly for a and then for b) as follows. Left


46/70

CHAPTER 6. ANALYSIS OF THE PROBLEM 30

(a) Original input matrix (b) Rotated of 180 degrees with markedparts (see also figure 7.1 for comparison)

Figure 6.1: An example of a modestly sized Hessian in BA. This is thesparsity pattern of a 992 992 normal equations (i.e. approximate Hessian).Black regions correspond to nonzero elements [14, p. 27]

multiplication of Equation (6.3) by the block matrix

I WV1

0 I (6.4)results in

U WV1W 0W V

ab

=

a WV1b

b

Since the top right block of the above left hand matrix is zero, therefore acan be determined from its top half, which is

(U WV1W) a = a WV1b (6.5)

Matrix S UWV1W is the Schur complement ofV in the left-handside matrix of (6.3) and is also positive definite [19]. Linear system (6.5) issolved for a using Cholesky decomposition of S. b is computed by solving

Vb = b Wa.

This approach has a big advantage an absence of fill-ins during the compu-tation. The approach explained in the next Example is slightly different [21,p. 102].


47/70


Example 6. Let A Rnn be a symmetric positive definite matrix that

can be divided into 4 submatrices A11, A12, A21 and A22. Then, accordingthe Theorem 2, Cholesky decomposition A = LL exists where L is lowertriangular matrix with strictly positive diagonal entries. If matrix A consistsof 4 submatrices, the equation A = LL can be rewritten to

A =

A11 A

21

A21 A22

=

L11 0

L21 L22

L11 L

21

0 L22

.

The aim of the block Cholesky decomposition is to compute values in L11,L21, L22 submatrices or L11, L

21, L

22 respectively. The whole process can

be divided into 4 steps:

1. A11 = L11L11 (Cholesky decomposition)2. L21 = L

111 A

21 from A

21 = L11L

21

L21 = A21L

11 from A21 = L21L11

3. A22 L21L21 = L22L22 (Cholesky decomposition)

During the decomposition process, first two steps can be done simultaneously.The last step is updating the A22 submatrix with matrix AS22 that is calledSchur complement ofA11 in matrix A and can be expressed as

AS22 = A22 A21A111 A

21 =

= A22 L21L

11

(L11L

11

)1L11L

21

=

= A22 L21(L11L

11 )(L111 L11)L

21 =

= A22 L21L21.

(6.6)

Example 7. This method allows parallel computation when diagonal blocksare independent, for example linear system (6.7). Blocks A11 and A22 havenot any mutual dependent elements (A12 and A12 are zero matrices).

A11 0 A130 A22 A23

A13 A23 A33

x1x2x3

=

b1b2b3

(6.7)

After the first step, blocks A11, A13, A13, A22, A23, A23 and parts of right-

hand side b1 and b2 are updated parallely and the system has the form asfollows:

L11 0 L111 A130 L22 L122 A23

0 0 A33

x1x2

x3

=

L11 0 L130 L22 L23

0 0 A33

x1x2

x3

=

L111 b1L122 b2

b3


48/70

CHAPTER 6. ANALYSIS OF THE PROBLEM 32

The next step is to update block A33 with the Schur complement AS33 of

matrix A11 00 A22 in matrix A, that is according to (6.6)A33

L13 L23

L13L23

and to update vector b3 with bS3 , that equals to

b3

L13 L23 L111 b1

L122 b2

.

Next, the linear system

AS33x3 = bS3

is using the Gaussian elimination transformed to

LS33 x3 = LS33 b

S3

and solved for x3 using back substitution. Finally, remaining parts of xvector (x1 and x2) in the transformed system

L11 0 L

13

0 L

22 L

230 0 LS33

x1

x2x3 =

L111 b1

L

1

22 b2LS33 b

S3

are computed now using only back substitution.


49/70



50/70

Chapter 7

Implementation

This chapter describes chosen framework and implementation details such asused structures, functions and data types of a practical output of the thesis linear direct solver (LDS).

7.1 Used Framework

The whole application was developed on a Linux environment (Xubuntu

12.04 for 64-bit PC and Debian 6.0 for 32-bit PC). The host code (for theCPU side) was written in ANSI C, the device code (for the GPU side) inCUDA (CUDA Driver 4.0). All object files was linked together into an exe-cutable file (ldsexam) using NVCC compiler, no static or dynamic librarieswas created (see my makefile).

7.2 Compressed Row Storage Format

Many formats for sparse matrices exists. One of the most general is the

compressed row storage (CRS) format. It makes no assumptions about thesparsity pattern and stores only indices and nonzero elements. On the otherhand, it is not very efficient because it needs an indirect addressing step forevery scalar operation in a matrix-vector product. I have decided on thisformat in my CPU-side solver for its effective utilization in the Choleskydecomposition.

A CRS format needs three vectors: nozval of floating-point numbers, rowbegand colind of integers. The nozval vector stores the values of the nonzeroelements of the matrix, as they are traversed in a row-wise fashion. Thecolind vector stores the column indexes of the elements in the nozval vector.

34


51/70

35 7.3. CHOLESKY DECOMPOSITION ON GPU

That is, if nozval(k) = aij then colind(k) = j. The rowptr vector stores

the locations in the nozval vector that start a row, that is, ifnozval(k) = aijthen rowptr(i) k < rowptr(i+1). By convention, rowptr(n+1) = nnz+1,where nnz is the number of all nonzeros.

Example 8. Consider a sparse symmetric matrix in the figure 7.1

0 1 2 3 4 50 7 11 8 1 22 1 8 3 23 9 3 24 2 3 3 9 35 1 2 2 3 9

Figure 7.1: Sample of a symmetric positive definite sparse matrix 6 6 with22 nonzero elements

CRS has the following attributes: n = 6, nnz = 22,

rowptr

0 1 2 3 4 5 6

0 2 5 9 1 2 1 7 2 2

colind

0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 0 2 10 5 1 2 4 1 2 4 5 3 4 5 1 2 3 4 5 0 2 3 4 5

nozval0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 0 2 1

7 7 8 1 2 1 9 3 2 9 3 2 2 3 3 9 3 1 2 2 3 9

7.3 Cholesky decomposition on GPU

The implementation of a sparse Cholesky decomposition (functions CRS_choland CRS_chol_subs) was quite straightforward. Before these functions arecalled, a symbolical factorization must be performed which determines theindices of fill-ins and allocate space for them. For purpose of Cholesky de-composition, only lower or upper triangular matrix is sufficient to have. Thisfact was exploited by skipping all elements from the beginning of each row tothe main diagonal. This is done by CRS_shifted_rows. Another differencein decomposition of sparse matrices rests in the necessity of altering the be-ginning of each row during the factorization. Regarding that I have workedwith temporary arrays rowbeg and rowend.


52/70

CHAPTER 7. IMPLEMENTATION 36

7.4 Ordering for CPU solver

In my solver, I have utilized approximate minimum degree (AMD) orderingby Tim Davis that can be found also in MATLABs amd function. It minimizesnumber of fill-ins very effectively and fast (see figure 3.2). For BA problem,even faster ordering (but with more fill-ins) can be used and that is a simplerotation of 180 degrees.

7.5 Block Matrix Format for GPU

There are 3 different parts in the matrix: full diagonal blocks, the sparseborder and the almost dense tail (light, middle and dark gray in the fig-ure 7.1). Analyzing the properties of this parts and CUDA architecture Ihave suggested to use the following matrix data structure (MXBF).

Blocks: As there are many (from thousands to millions) full but small di-agonal blocks (Vi), they can be stored in one array (data) in row-wisemanner. In BA, the blocks have the same size, but when using METISk-way ordering, blocks have not the same size. Because of that, foreach block a information about its size must be stored (blksz). Wheniterating over the blocks, it is efficient to have an index saying where

the data start for i-th block (blkp). Only upper part of the blocks isstored but memory is allocated for the full block to get rid of strangeindexing.

Border: This part has the majority of nonzero elements. Therefore, is mustbe stored as a sparse matrix. I have chosen the CRS format. In factthat the input matrix is symmetric, only one border side is sufficientto have.

Tail: After computing the Schur complement, this part will be almost dense.Consequently it is stored as a full matrix. Only upper triangle is stored,but memory allocated for a full matrix, as in the case with blocks. The

data for this part are stored also in data array and tail points tolocation where the data for tail start.

The MXBF structure of matrix from Example 7.1 have these attributes: n= 6, tail = 5 (where data for tail in the array data starts), tailsz = 3,ndata = 14 (number of elements in blocks and tail), brd_nnz = 13 (numberof nonzeros in the border),

blksz0 1

1 2

blkp

0 1

0 1

data0 1 2 3 4 5 6 7 8 9 10 11 12 13

7 8 1 0 8 9 3 2 0 9 3 0 0 9


53/70

37 7.6. BLOCK CHOLESKY DECOMPOSITION ON GPU

brd_rowptr

0 1 2 30 1 2 4

brd_colind

0 1 2 32 1 1 2

brd_nozval

0 1 2 31 2 3 2

7.6 Block Cholesky decomposition on GPU

Consider the block matrixA11 0 A130 A22 A23

A13 A23 A33

x1x2

x3

=

b1b2

b3

,

where A11 and A22 are called blocks, A13 and A23 borders and A33 istail.

The block Cholesky decomposition consists of four main parts:

1. Eliminating blocks (A11 L11 and A22 L22), updating corre-

sponding borders (A13 L111 A13 and A23 L122 A23), and updating

corresponding parts of right-hand side of linear system (b1 L111 b1and b2 L122 b2). All previous steps are done simultaneously (withinelimination loops). Each thread eliminates one block (in the test ma-trix it is the size of 3 3) and update its own part of border and bvector. As the border part is sparse and can have arbitrary number ofnonzero elements, I store and access this data in a global memory.

2. Computing the Schur complement

A33 AS33 = A33

L111 A13L122 A23

L111 A13L122 A23

There was a problem that updated border part

L111

A13

L122

A23

was stored in

row-wise manner and the transposed matrix was not available. There-fore, using dot-product when matrix-matrix multiplying was not pos-

sible. I had to loop through the rows of the matrix and update theelements of A33 matrix at every multiplication. But this could bepossible only when using atomic operations (atomicAdd). Even more,this could be used only for single-precision floats in compute capability> 2.0. I am aware of such restriction of proposed approach.

3. Eliminating of the tail (AS33 LS33 ). This part has surely the biggest

potential to exploit the full potential of a GPU. Unfortunately, thiswas postponed due to lack of time. I have planned to call a functionfrom MAGMA library that is able to solve dense linear system. In mysolver, this part is performed on CPU-side.


54/70

CHAPTER 7. IMPLEMENTATION 38

4. Back substitution. Performed on CPU-side, firstly for dense part LS33

and then for sparse borders and full blocks.

7.7 Ordering for GPU solver

A requirement of my GPU solver is that the input matrix can be partitionedin a such structure that appears in a approximation of Hessians in BA prob-lem (see the matrix in Equation 6.2). This can be achieved applying nesteddissection ordering recursively. METIS K-way ordering was used for parti-tioning the input matrix into independent block structure for GPU solver.Figures 7.2 and 7.3 illustrates structure of matrices from MATLAB galleryreordered by k-way ordering. As BA has this structure implicitly and thesize and number on independent block are known from BA configuration, itneeds only rotation of 180 degrees to get structure like in figure 6.1b.

0 50 100 150 200 250 300

0

50

100

150

200

250

300

nz = 4861

(a) Original matrix

0 50 100 150 200 250 300

0

50

100

150

200

250

300

nz = 4861

(b) Reordered into 5 independent blocks

Figure 7.2: Performing k-way ordering on diagonal-based matrix Wathen10 10


55/70

39 7.7. ORDERING FOR GPU SOLVER

0 200 400 600 800

0

100

200

300

400

500

600

700

800

900

nz = 4380

(a) Original matrix

0 200 400 600 800

0

100

200

300

400

500

600

700

800

900

nz = 4380

(b) Reordered using k-way ordering into10 independent blocks

Figure 7.3: Performing k-way ordering on diagonal-based matrix Poisson30


56/70

Chapter 8

Testing

Testing was performed on the followed configuration: Intel i7-2600 CPU@ 3.40GHz, 4GB RAM, GeeForce GT570, Debian 6.0 for 32-bit PC, CUDADriver 4.0. Applications were compiled using GCC (version 4.3.5) and NVCCwith -use-fast-math and -O3 optimization mode.

To check the accuracy of my solvers I have used Octave to get the reference

x vector. The solution from Octave and my solver were printed into the file(x_octave.vec and x_result.vec) and the differences were compared withanother Octave function (vec_ck).

The main testing input matrix was the approximation of Hessian from BAproblem optimizing 3 parameters of 11049 3D points and 7 parameters of 22cameras. The matrix is of size 33, 301 33, 301 and have 1, 817, 521 nonzeroelements saved in Matrix Market coordinate format (data/jTj_mue.mtx).

8.1 Octave solvers

In Octave, I have tested the direct solver (left division operator \), the Pre-conditioned Conjugate Gradient solver (pcg) and Preconditioned ConjugateResiduals (pcr). Iterative solvers was set to terminate after reaching 200iterations or a residual norm less than 16. Table 8.1 shows the results. Pre-conditioned Conjugate Residuals solver have terminated after 45 iterations,but the result was wrong.

40


57/70

41 8.2. CPU SOLVER

Method Time Res. norm Iterations

Left division operator 695ms 1.28313 Conjugate gradient 1440ms 4.1285 75

Conjugate residuals 1386 ms NaN 45

Figure 8.1: Test of Octave solvers

8.2 CPU solver

After execution the CPU solver from lds directory with the command./bin/ldscpuexam data/jtj_mueI.mtx data/g.mtxthe following information are printed:

load matrix: 1070 ms

load vector: 10 ms

symamd ord.: 80 ms

mat. reorder: 390 ms

symbolic: 500 ms

CRS_symbolic: 1834461 nnz

CPU CRS chol: 50 ms

all: 2120 ms

The new number of nonzeros havent increased much (from 1, 817, 521 to1, 8344, 61) which means that there are very few fill-ins (less then 1%). Itcan be seen, that my implemented functions for reordering of the matrixand for symbolic factorization are not very efficient. The reason can be thatreordering is performed by transforming CRS format into triplet (or COO)format, reordered, sorted, and then transformed back which needs a lot ofdata moves. Although finding the ordering takes more time then solvingthe whole linear system, without it (try to comment it in ldscpuexam.c)the computation will takes more than several minutes. Execution of allfunctions required by finding the solution takes 1 second.

Commandoctave -q --eval="vec_ck( x_octave.vec, x_result.vec );"

outputs the residual norm of the difference with the reference octave solutionand find where is the biggest difference

max err: 0.0000000228 at 138th element

res nrm: 0.0000000000


58/70

CHAPTER 8. TESTING 42

8.3 GPU solver

For checking the corectness of GPU solver, I have implemented the GPUsolver on CPU side (to use this, a constant BLOCK_CHOLESKY_CPUmust be uncomented and to compile with make). Then, ldsgpuexam inperformed on CPU.

Calling./bin/ldsgpuexam data/jtj_mueI.mtx data/g.mtx

gives this results:

load matrix: 1060 ms

load vector: 20 mskway ord.: 20 ms


symbolic: 500 ms


MXBF_from_crs: 11049 blocks

858522 border nnz

123157 block and tail data

block matrix: 10 ms

elim. blks: 10 ms

tail update: 30 ms

elim tail: 0 msback subs: 0 ms

CPU block chol: 40 ms

all: 2020 ms

This solver that exploits the special structure in BA runs faster than generalCPU solver (40 vs. 50 ms). When checking the residual norm:

max err: 0.0000221960 at 59th element.

res nrm: 0.0000000010

Output of the real GPU solver:

load matrix: 1070 msload vector: 10 ms

kway ord.: 20 ms


symbolic: 500 ms


MXBF_from_crs: 11049 blocks

858522 border nnz

123157 block and tail data

block matrix: 10 ms


59/70

43 8.4. CUSP SOLVERS

elim on GPU:

elim without copy: 15.1688 mselim with copy: 20.0004 ms

elim blocks + tail update: 420 ms

elim tail: 0 ms

back subs: 0 ms

GPU block chol: 430 ms

all: 2430 ms

with residual norm:

max err: 0.0000072417 at 103th element.

res nrm: 0.0000000003

The GPU solver must be run on single-precision floats because of atomicAddoperations. elim without copy is the time that is needed for eliminationof blocks and tail update (computing the Schur complement).

8.4 CUSP solvers

CUSP is a C++ template library that implements parallel algorithms forsparse matrix and graph computations. It provides a variety of iterativesolvers such as Conjugate-Gradient (CG), Biconjugate Gradient (BiCG), Bi-

conjugate Gradient Stabilized (BiCGstab), Generalized Minimum Residual(GMRES), Multi-mass Conjugate-Gradient (CG-M) and Multi-mass Bicon-jugate Gradient stabilized (BiCGstab-M). Two of them I have tested withmaximum number of iterations set to 200 and relative error 16. Table 8.2shows the results.

Method Time Max. error IterationsCG 50 ms 3.88 77BiCGstab 90ms 2.38 76

Figure 8.2: Test of iterative CUSP solvers. Max. error is the maximaldifference with Octaves reference solution


60/70

Chapter 9

Conclusion

The aim of this thesis was to deal with linear direct solvers and then imple-ment a linear direct GPU solver for BA problem. Of course, the implementa-tion of a GPU solver was preceded by studying the mathematical backgroundof linear direct solvers. Firstly, the CPU solver must be implemented. An-other important concepts that concerns about direct sparse solvers must beacquired like a symbolic factorization, working with CRS matrix format, andapplying ordering techniques. I can say that my implemented CPU solveris fast and reliable when solving positive definite linear systems. This have

been done in the first half of the academic year.The next half year, I have started experimenting with the METIS k-wayordering and how to utilize it for solving general sparse systems in parallel.Although this approach is fully usable, it has drawbacks such as a slowcomputation of the ordering, relatively big tail part, and independent blocksof different sizes.

Simultaneously I was analyzing the BA problem and structure of its linearsparse systems in Levenberg-Marquardt algorithm. As the structure of theBA and returned k-way ordering was the same, I tried to write the solverwhich could be general (the needed information about block matrix gives k-

way ordering) and specific at the same time (in this case information aboutthe block matrix provides BA configuration). The general solver on GPU isnot finished (special symbolic factorization is missing).

The GPU solver special for BA was implemented, but provides very smallspeedups in comparison with CPU solver. The reason is that only globalmemory on GPU was used for all computations.

In testing phase I have found out that iterative solver have a great potentialto solve linear systems very fast. The advantage of iterative solvers is theconfigurable accuracy which can be sufficient for iterative nonlinear solvers.

44


61/70

45

Even when using with a preconditioner, the solution should be found very

fast. On the other side, when using direct solvers, symbolical factorizationis solved only once in LM algorithm. Direct solvers give in general moreaccurate results. From my experiments, I suggest to use direct solver onCPU with a dense GPU solver for factorizing the Schur complement.

I realize that the precise study of SBA (Sparse Bundle Adjustment) packageis missing such as testing of a practical utilization of GPU solvers in thispackage.


62/70

Bibliography

[1] Unknown author. NVIDIA GeForce GTX 680 s ipem GK104: HernKepler detailn. CD-R server s.r.o, URL http://diit.cz/clanek/unifikovane-jadro-a-rizeni-cipu, 2012. Cited in page 25.

[2] O. Coles. Nvidia GF100 GPU fermi graphics architecture. BenchmarkReviews, URL http://benchmarkreviews.com, 2010. Cited in page 24.

[3] T. Davis. Sparse matrix. From MathWorld A Wolfram Web Re-source URL http://mathworld.wolfram.com/SparseMatrix.html, 2012.Retrieved April 2012. Cited in page 10.

[4] J. Dennis. Nonlinear last squares. State of the Art in Numerical Anal-ysis, pages 269312, 1977. Cited in page 16.

[5] J. Dennis and R. Schnabel. Numerical Methods for Unconstrained Op-timization and Nonlinear Equations. SIAM Publications, 1996. Citedin page 20.

[6] R. Farber. CUDA Application Design and Development. Morgan Kauf-mann, Waltham, MA, 2011. 2 citations in pages 22 and 25.

[7] A. George and J. W. H. Liu. An automatic nested dissection algorithmfor irregular finite element problems. SIAM Journal on Numerical Anal-ysis, 15(5):1051069, 1978. Cited in page 12.

[8] A. George and J. W. H. Liu. A fast implementation of the minimum

degree algorithm using quotient graphs. ACM Transactions on Mathe-matical Software, 6:33358, 1980. Cited in page 12.

[9] J. R. Gilbert, C. Moler, and R. Schreiber. Sparse matrices in matlab:Design and implementation. SIAM Journal on Matrix Analysis andApplication, pages 333356, 1992. Cited in page 10.

[10] G. Golub and C. van Loan. Matrix Computations. Jonhn HopkinsUniversity Press, Baltimore, MD, 3rd edition, 1996. Cited in page 20.

[11] K. Habgood and I. Arel. A condensation-based application of cramers

46
http://diit.cz/clanek/unifikovane-jadro-a-rizeni-cipuhttp://diit.cz/clanek/unifikovane-jadro-a-rizeni-cipuhttp://diit.cz/clanek/unifikovane-jadro-a-rizeni-cipuhttp://benchmarkreviews.com/index.php?option=com_content&task=view&id=440&Itemid=63&limit=1&limitstart=3http://mathworld.wolfram.com/SparseMatrix.htmlhttp://mathworld.wolfram.com/SparseMatrix.htmlhttp://mathworld.wolfram.com/SparseMatrix.htmlhttp://benchmarkreviews.com/index.php?option=com_content&task=view&id=440&Itemid=63&limit=1&limitstart=3http://diit.cz/clanek/unifikovane-jadro-a-rizeni-cipuhttp://diit.cz/clanek/unifikovane-jadro-a-rizeni-cipu


63/70

47 BIBLIOGRAPHY

rule for solving large-scale linear systems. Journal of Discrete Algo-

rithms, 10:98109, 2012. Cited in page 5.[12] K. Hiebert. An eveluation of mathematical software that solves nonlin-

ear least squares problems. ACM Transactions on Mathematical Soft-ware, 7(1):116, 1981. Cited in page 16.

[13] J.-Y. LExcellent and B. Ucar. Elimination tree. URL http://graal.ens-lyon.fr/~bucar/CR07, 2010. Cited in page 13.

[14] M. I. A. Lourakis and A. A. Argyros. Sba: A software package forgeneric sparse bundle adjustment. ACM Transactions on MathematicalSoftware, 36(1), 2007. 4 citations in pages 16, 28, 29, and 30.

[15] H. M. Markowitz. The elimination form of the inverse and its applicationto linear programming. Management Science, 3:255269, 1957. Citedin page 12.

[16] J. Nocedal and S. Wright. Numerical Optimization. Springer, New York,NY, 1999. Cited in page 20.

[17] F. Ntawiniga. Bundle adjustment technique. URL http://archimede.bibl.ulaval.ca/archimede/fichiers/25229/ch06.html, 2008. RetrievedApril 2012. Cited in page 17.

[18] NVIDIA. OpenCL Programming for the CUDA Architecture, 2009.

Cited in page 23.[19] V. Prasolov. Problems and Theorems in Linear Algebra. American

Mathematical Society, Providence, RI, 1994. Cited in page 30.

[20] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery.Numerical Recipes: The Art of Scientific Computing. Cambridge Uni-versity Press, 3rd edition, 2007. 2 citations in pages 4 and 11.

[21] A. Quarteroni, R. Sacco, and F. Saleri. Numerical Mathematics.Springer, 2000. 3 citations in pages 4, 17, and 30.

[22] W. F. Tinney and J. W. Walker. Direct solutions of sparse network

equations by optimally ordered triangular factorization. In Proceedingsof the IEEE, volume 55, pages 18011809, 1967. Cited in page 12.

[23] B. Triggs, P. McLauchlan, R. Hartley, and A. Fitzgibbon. Bundle ad-justment a modern synthesis. Proceedings of the International Work-shop on Vision Algorithms: Theory and Practice, pages 298372, 1999.Cited in page 16.

[24] V. Volkov. Better performance at lower occupancy. GPU TechnologyConference 2010 (GTC 2010), 2010. URL http://www.cs.berkeley.edu/~volkov. Cited in page 24.
http://graal.ens-lyon.fr/~bucar/CR07http://graal.ens-lyon.fr/~bucar/CR07http://graal.ens-lyon.fr/~bucar/CR07http://archimede.bibl.ulaval.ca/archimede/fichiers/25229/ch06.htmlhttp://archimede.bibl.ulaval.ca/archimede/fichiers/25229/ch06.htmlhttp://archimede.bibl.ulaval.ca/archimede/fichiers/25229/ch06.htmlhttp://www.cs.berkeley.edu/~volkovhttp://www.cs.berkeley.edu/~volkovhttp://www.cs.berkeley.edu/~volkovhttp://www.cs.berkeley.edu/~volkovhttp://archimede.bibl.ulaval.ca/archimede/fichiers/25229/ch06.htmlhttp://archimede.bibl.ulaval.ca/archimede/fichiers/25229/ch06.htmlhttp://graal.ens-lyon.fr/~bucar/CR07http://graal.ens-lyon.fr/~bucar/CR07


64/70

BIBLIOGRAPHY 48

[25] R. Vuduc. Analysis and tuning case study. Teragrid Conference, URL

http://hpcgarage.org/tg10--gpu-tutorial, 2010. Retrieved May 2012.Cited in page 23.

[26] M. Yannakakis. Computing the minimum fill-in is NP-complete. SIAMJournal of Algebraic Discrete Methods, 2:7779, 1981. Cited in page 10.
http://hpcgarage.org/tg10--gpu-tutorialhttp://hpcgarage.org/tg10--gpu-tutorialhttp://hpcgarage.org/tg10--gpu-tutorial


65/70

49 BIBLIOGRAPHY


66/70

Appendix A

List of Abbrevations

3D Three-Dimensional

BA Bundle Adjustment

CPU Central Processing Unit

CUDA Compute Unified Device Architecture

CRS Compressed Row Storage

GPGPU General-Purpose Computing on Graphics Processing Unit

GPU Graphics Processing Unit

LDS Linear Direct Solver (output of this thesis)

LM Levenberg-Marquardt (algorithm)

LRU Last Recently Used

SBA Sparse Bundle Adjustment

SIMD Single Instruction Multiple Data

SIMT Single Instruction Multiple Threads

SM Streaming Multiprocessor

50


67/70

51


68/70

Appendix B

User Manual

B.1 Requirements

All code was written in ANSI C and CUDA and tested on 64-bit Linux(Xubuntu distribution), GCC 4.4.6. For successful compilation, the packagelibscotchmetis-dev is required. Some install paths in makefile for CUDAand METIS include file must be set properly. After compilation, executionfile ldscpuexam and ldsgpuexam will be created into bin directory.

B.2 Usage

ldscpuexam

ldsgpuexam

bin/ldscpuexam data/jtj_mueI.mtx data/g.mtx orbin/ldsgpuexam data/jtj_mueI.mtx data/g.mtx for tested matrix.

A.mtx is symmetric positive definite matrix of size n n stored in matrixmarket format, b.vec is the right side n 1 vector of the equation systemstored in matrix market format. Some time information will be printed on

the stdout and the solution will be stored in file named x_result.vec.For testing the correctness of the solution, octave function vec_ck can beused callable from the command line:octave -q eval vec_ck( "x_result.vec", "x_octave.vec" );

52


69/70

53 B.2. USAGE

Documents

Ivancik Thesis 2012 Online