Ivancik Thesis 2012 Online

Embed Size (px)

Citation preview

  • 7/31/2019 Ivancik Thesis 2012 Online

    1/70

    i

  • 7/31/2019 Ivancik Thesis 2012 Online

    2/70

  • 7/31/2019 Ivancik Thesis 2012 Online

    3/70

    iii

  • 7/31/2019 Ivancik Thesis 2012 Online

    4/70

  • 7/31/2019 Ivancik Thesis 2012 Online

    5/70

    v

  • 7/31/2019 Ivancik Thesis 2012 Online

    6/70

    Czech Technical University in Prague

    Faculty of Electrical EngineeringDepartment of Cybernetics

    Masters Thesis

    The Linear Direct Sparse Solver on GPU for

    Bundle Adjustment Method

    Bc. Ondrej Ivank

    Supervisor: Ing.Ivan imeek, Ph.D.

    Study Programme: Open Informatics

    Field of Study: Computer Vision and Image Processing

    May 11, 2012

  • 7/31/2019 Ivancik Thesis 2012 Online

    7/70

    v

  • 7/31/2019 Ivancik Thesis 2012 Online

    8/70

    vi

    Aknowledgements

    I would like to thank to my supervisor Ivan imeek who enabled me todeal with a very interesting topic and to prof. Olaf Hellwich and CorneliusWefelscheid who allow me to work on my thesis within an individual projectat TU Berlin.

  • 7/31/2019 Ivancik Thesis 2012 Online

    9/70

    vii

  • 7/31/2019 Ivancik Thesis 2012 Online

    10/70

    viii

    Declaration

    I hereby declare that I have completed this thesis independently and that Ihave listed all the literature and publications used.

    I have no objection to usage of this work in compliance with the act 60Zkon . 121/2000Sb. (copyright law), and with the rights connected withthe copyright act including the changes in the act.

    Prague, May 11, 2012

  • 7/31/2019 Ivancik Thesis 2012 Online

    11/70

    ix

  • 7/31/2019 Ivancik Thesis 2012 Online

    12/70

    Abstract

    The thesis deals with solving of sparse linear positive definite systems. It

    implements Cholesky decomposition on CPU utilizing a CRS format forsparse matrices, a fast AMD ordering, and a symbolic factorization.

    Analysed are possibilities of a parallelization of Cholesky decomposition forsparse diagonal-based linear systems and for Bundle Adjustment problemwhere matrices of specific structure arise. Cholesky decomposition exploitinga Schur complement is implemented on both CPU and GPU side.

    Abstrakt

    Prce se zabv eenm dkch linernch pozitivn definitnch soustav.Implementuje Choleskho dekompozici na CPU s vyuitm CRS formtudkch matic, rychl AMD permutace a symbolick faktorizace.

    Analyzuje monosti paralelizace Choleskho dekompozice pro dk linernsystmy diagonlnho tvaru a pro problm vyrovnn svazku, kde vznikajdk matice specifick struktury. Navrhuje a implementuje vpoet Choles-kho dekompozice na GPU a CPU pomoci Schrova komplementu.

    x

  • 7/31/2019 Ivancik Thesis 2012 Online

    13/70

    xi

  • 7/31/2019 Ivancik Thesis 2012 Online

    14/70

    Contents

    1 Introduction 2

    1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    2 Solving Linear Systems 4

    2.1 System of Linear Equations . . . . . . . . . . . . . . . . . . . 42.2 Direct Methods for Solving Linear Systems . . . . . . . . . . 5

    2.2.1 Cramers Rule . . . . . . . . . . . . . . . . . . . . . . 52.2.2 Forward and Backward Substitution . . . . . . . . . . 52.2.3 Gaussian Elimination . . . . . . . . . . . . . . . . . . 62.2.4 Gauss-Jordan Elimination . . . . . . . . . . . . . . . . 72.2.5 LU Decomposition . . . . . . . . . . . . . . . . . . . . 72.2.6 Cholesky Decomposition . . . . . . . . . . . . . . . . . 7

    2.3 Iterative Methods for Solving Linear Systems . . . . . . . . . 8

    3 Sparse Matrices 10

    3.1 Ordering Methods . . . . . . . . . . . . . . . . . . . . . . . . 103.1.1 Arrowhead Matrix Example . . . . . . . . . . . . . . . 113.1.2 Graph Representation . . . . . . . . . . . . . . . . . . 113.1.3 Bottom-up Ordering Methods . . . . . . . . . . . . . . 123.1.4 Top-down Ordering Methods . . . . . . . . . . . . . . 12

    3.2 Symbolical Factorization . . . . . . . . . . . . . . . . . . . . . 13

    4 Bundle Adjustment 16

    4.1 Unconstrained Optimization . . . . . . . . . . . . . . . . . . . 174.1.1 Search Methods . . . . . . . . . . . . . . . . . . . . . . 184.1.2 LevenbergMarquardt Algorithm . . . . . . . . . . . . 19

    5 Overview of NVIDIA CUDA 22

    5.1 The CUDA Execution Model . . . . . . . . . . . . . . . . . . 235.2 GPU Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    6 Analysis of the Problem 28

    6.1 Structure of Linear Systems in BA . . . . . . . . . . . . . . . 28

    xii

  • 7/31/2019 Ivancik Thesis 2012 Online

    15/70

    xiii CONTENTS

    6.2 Block Cholesky Decomposition for BA . . . . . . . . . . . . . 29

    7 Implementation 347.1 Used Framework . . . . . . . . . . . . . . . . . . . . . . . . . 347.2 Compressed Row Storage Format . . . . . . . . . . . . . . . . 347.3 Cholesky decomposition on GPU . . . . . . . . . . . . . . . . 357.4 Ordering for CPU solver . . . . . . . . . . . . . . . . . . . . . 367.5 Block Matrix Format for GPU . . . . . . . . . . . . . . . . . 367.6 Block Cholesky decomposition on GPU . . . . . . . . . . . . . 377.7 Ordering for GPU solver . . . . . . . . . . . . . . . . . . . . . 38

    8 Testing 40

    8.1 Octave solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . 408.2 CPU solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418.3 GPU solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428.4 CUSP solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

    9 Conclusion 44

    A List of Abbrevations 50

    B User Manual 52B.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 52B.2 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

    C Contetns of the Attached CD 54

  • 7/31/2019 Ivancik Thesis 2012 Online

    16/70

    List of Figures

    3.1 The dependence of the reordering of a sparse matrix on the

    fill-in count . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2 Ordering example . . . . . . . . . . . . . . . . . . . . . . . . . 14

    4.1 Reprojection error . . . . . . . . . . . . . . . . . . . . . . . . 17

    5.1 Block diagram of a GF100 GPU . . . . . . . . . . . . . . . . . 245.2 Streaming multiprocessor of a GF100 (Fermi) GPU . . . . . . 255.3 Bandwidth of various GPU memory . . . . . . . . . . . . . . 25

    6.1 An example of a modestly sized Hessian in BA . . . . . . . . 30

    7.1 Sample of a symmetric positive definite sparse matrix 6 6

    with 22 nonzero elements . . . . . . . . . . . . . . . . . . . . 357.2 Performing k-way ordering on diagonal-based matrix Wathen

    10 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387.3 Performing k-way ordering on diagonal-based matrix Poisson

    30 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    8.1 Test of Octave solvers . . . . . . . . . . . . . . . . . . . . . . 418.2 Test of iterative CUSP solvers. Max. error is the maximal

    difference with Octaves reference solution . . . . . . . . . . . 43

    xiv

  • 7/31/2019 Ivancik Thesis 2012 Online

    17/70

    xv LIST OF FIGURES

  • 7/31/2019 Ivancik Thesis 2012 Online

    18/70

    Chapter 1

    Introduction

    Finding a solution of a system of linear algebraic equations (2.1) is the mostbasic task in linear algebra and the heart of many engineering problems. Itis the topic of studies for many years not only for its application in manybranches of scientific computing, but also for its high computational com-plexity and a wide variety of methods and approaches that help to solvelinear systems of different types faster and more accurately.

    Finding a solution for a system of nonlinear algebraic equations can be

    achieved using iterative solvers which keystone is solving a linear systemin each iteration step to bring near the sufficiently accurate solution. There-fore, a linear solver forms a crucial part and a bottleneck of a nonlinear solverat the same time.

    A widely used optimization method in 3D reconstruction algorithms is bundleadjustment. As a nonlinear iterative optimization method, it needs to solvea sparse, often very large linear system of a specific structure many times.Studying of the suitable linear solver for bundle adjustment is the main partof my thesis.

    1.1 Motivation

    One particular and promising approach for speeding-up the process of solvingsystems of linear equations consists in parallel computation. In case of densedirect solvers, the parallelization is more straightforward and has better per-formance results than those for sparse direct solvers. Iterative methods,almost used for solving large sparse linear systems, are efficiently paralleliz-able thanks to the character of iterative solvers that used only sparse matrixand vector multiplications and additions.

    2

  • 7/31/2019 Ivancik Thesis 2012 Online

    19/70

    3 1.1. MOTIVATION

    In the last decade, there has been growing interest in general-purpose compu-

    tation on graphics processing units (GPGPU). Several libraries were devel-oped which implement basic linear algebra subroutines or even linear solversfor dense matrices (NVIDIA cuBLAS, MAGMA, CULA Dense) and sparsematrices (NVIDIA cuSparse, NVIDIA CUSP, CULA Sparse). At the presenttime, no implementation of a linear direct solver for general sparse matriceson GPU exists. The main cause is the problematic fine-grain parallelizationand the thread divergence on a GPU.

    Sparse matrices consisting of many small independent full blocks on diagonalwith some dependent parts on borders are formed during computation ofbundle adjustment. It seems that there is possibility to eliminate theseblocks in parallel manner effectively even on GPU. The question is whichtype of solver is more suitable direct or iterative? My thesis aims to givethe answer for it.

  • 7/31/2019 Ivancik Thesis 2012 Online

    20/70

    Chapter 2

    Solving Linear Systems1

    2.1 System of Linear Equations

    Definition 1. A system of m linear equations in n unknowns consists ofa set of algebraic relations of the form

    nj=1

    aijxj = bi, i = 1, . . . , m (2.1)

    where xj are unknowns, aij are the coefficients of the system and bi are thecomponents of the right-hand side. System (2.1) can be more convenientlywritten in matrix form as

    Ax = b, (2.2)

    where A = (aij) Cmn denotes the coefficient matrix, b = (bi) Cm

    the right side vector and x = (xi) Cn the unknown vector, respectively.A solution of (2.2) is any n-tuple of values xi which satisfies (2.1).

    Remark 1. The existence and uniqueness of the solution of are ensured ifone of the following (equivalent) hypotheses holds:

    1. A is invertible,2. rank(A) = n,

    3. the homogeneous system Ax = 0 admits only the null solution.

    In next chapters I will be dealing with numerical methods finding the solutionof real-valued square systems of order n, that is, systems of the form (2.2)with A Rnn and x, b Rn. Such linear systems arise frequently in any

    1For this chapter was cited from [20] and [21]

    4

  • 7/31/2019 Ivancik Thesis 2012 Online

    21/70

    5 2.2. DIRECT METHODS FOR SOLVING LINEAR SYSTEMS

    branch of science, also in bundle adjustment. These numerical methods can

    generally be divided into two classes. In absence of roundoff errors, directmethods yield the exact solution in a finite number of steps. Iterative methodsrequire (theoretically) an infinite number of steps to find the exact solution.

    2.2 Direct Methods for Solving Systems of Linear

    Equations

    2.2.1 Cramers Rule

    The solution of system (2.2) is formally provided by Cramers rule

    xj =det(Aj)

    det(A), j = 1, . . . , n , (2.3)

    where Aj is the matrix obtained by substituting the j-th column ofA withthe right-hand side b. If the determinants are evaluated by the recursiveLaplace rule, the method based on Cramers rule turns out to be unac-ceptable even for small dimensions ofA because of its computational costs(n + 1)! flops. However, Habgood and Arel [11] have recently shown thatCramers rule can be implemented in O(n3) time, which is comparable tomore common methods of solving systems of linear equations.

    2.2.2 Forward and Backward Substitution

    Definition 2. A square matrix with zero entries above the main diagonal(aij = 0 for i < j) is called lower triangular. A square matrix with zeroentries below the main diagonal (aij = 0 for i > j) is called upper triangular.A lower (upper) triangular matrix is strictly lower (upper) triangular whenits entries on the main diagonal are zeros, too.

    Example 1. Lower (upper) triangular systems can be easily solved usingforward (backward) substitution. For example, the nonsingular 3 3 upper

    triangular system u11 u12 u130 u22 u230 0 u33

    x1x2x3

    = b1b2b3

    can be solved in sequence as follows

    x3 = b3/u33,

    x2 = (b2 u23x3)/u22,

    x1 = (b1 u12x2 u13x3)/u33.

  • 7/31/2019 Ivancik Thesis 2012 Online

    22/70

    CHAPTER 2. SOLVING LINEAR SYSTEMS 6

    For a nonsingular upper triangular system of order n (n 2), the solution

    can be expressed generally in the form

    xn =bn

    unn

    xi =1

    uii

    bi n

    j=i+1

    uijxj

    , i = n 1, . . . , 1. (2.4)

    Analogically, the solution for a nonsingular lower triangular system of ordern (n 2) in the form

    xi =b1l11

    xi =1

    lii

    bi i1j=1

    lijxj , i = 2, . . . , n . (2.5)

    The number of multiplication and divisions for forward/backward substi-tution is equal to n2 (n + 1), while the number of sums and subtractions isn2 (n 1). The total operation count for (2.4) and (2.5) is thus n

    2.

    2.2.3 Gaussian Elimination

    LetA

    be a square nonsingular matrix. A linear systemAx

    =b

    can betransformed into equivalent (lower or upper) triangular system Tx = b thathas the same solution using three elementary row operations. The solutionof the system is invariant to

    1. the multiplication of a row by a nonzero scalar,

    2. the addition of one row to another,

    3. the swapping of two rows.

    The basic idea is to multiply the i-th equation by a nonzero constant andsubtract with the first equation to zeroize first unknown in the i-th equation.

    This is done with all equations from 2 to n. Then, the second equation isconsidered as reference and all unknowns in equations form 3 to n are zeroed.The procedure ends, when the system has form Tx = b. Right-hand sideb equals to Tb. Finally, the solution is obtained by forward substitution(if T is lower triangular matrix) or backward substitution (if T is uppertriangular matrix).

    To complete Gaussian elimination 23(n 1)n(n + 1) + n(n 1) flops arerequired. To solve the linear system, about 23n

    3 + 2n2 flops are needed (withn2 flops to backsolve the triangular system). Neglecting the lower order ofterms, the Gaussian elimination process has a cost of 23n

    3 flops.

  • 7/31/2019 Ivancik Thesis 2012 Online

    23/70

    7 2.2. DIRECT METHODS FOR SOLVING LINEAR SYSTEMS

    2.2.4 Gauss-Jordan Elimination

    Gauss-Jordan elimination is slightly different as Gaussian elimination. Thetransformation of the system using three elementary row operations repeatsuntil each equation contains only one of the unknowns, thus giving an im-mediate solution. Principal deficiencies of this method are that

    1. it requires all the right-hand sides to be stored and manipulated at thesame time and

    2. it is three times slower than the alternative solvers, when the inverseofA is not desired.

    2.2.5 LU Decomposition

    Suppose that it is able to write the matrix A as a product of two matrices,A = LU where L is lower triangular and U is upper triangular. Thisdecomposition can be used to solve the linear system

    Ax = (LU)x = L(Ux) = b (2.6)

    by first solving (by forward substitution) for the vector y such that

    Ly = b (2.7)

    and then solving (by backward substitution) for the vector x such that

    Ux = y. (2.8)

    Theorem 1. Let A Rnn. The LU decomposition of A with lii = 1 fori = 1, . . . , n exists and is unique iff the principal submatrices Ai of A oforder i = 1, . . . , n 1 are nonsingular.

    The LU decomposition is usually performed in place to avoid copying andwasting the memory when storing triangular matrices L and U separately

    as it is shown in Algorithm 1. At the end (here only for presentationalpurposes) is the result stored in L and U matrix.

    2.2.6 Cholesky Decomposition

    Theorem 2. Let A Rnn be a symmetric and positive definite matrix.Then, there exists a unique lower triangular matrixL with positive diagonal

    entries such that

    A = LL. (2.9)

    http://-/?-http://-/?-http://-/?-
  • 7/31/2019 Ivancik Thesis 2012 Online

    24/70

    CHAPTER 2. SOLVING LINEAR SYSTEMS 8

    Algorithm 1 LU Decomposition

    Require: A square matrix A.Ensure: A lower triangular matrix L with ones on the main diagonal and

    an upper triangular matrix U such that LU = A.

    function [L, U] = lu2(A)

    [n,n] = size(A);

    for k = 1:n

    A(k+1:n,k) = A(k+1:n,k) / A(k,k);

    A(k+1:n,k+1:n) = A(k+1:n,k+1:n) - A(k+1:n,k) * A(k,k+1:n);

    end

    L = tril(A,-1) + eye(n); % ones on the diagonal

    U = triu(A);

    end

    The computational costs for Cholesky halves, with respect to the LU decom-position, to about n

    3

    3 flops because the input matrix A is symmetric. Animplementation example of Cholesky decomposition is coded in Algorithm 2.

    Algorithm 2 Cholesky Decomposition

    Require: A square positive definite matrix A.Ensure: A lower triangular matrix L such that LL = A.

    function [L] = chol2(A)

    [n,n] = size(A);for k = 1:n

    A(k,k) = sqrt(A(k,k));

    A(k,k+1:n) = A(k,k+1:n) / A(k,k);

    for i = k+1:n

    A(i,i:n) = A(i,i:n) - A(k,i:n) * A(k,i);

    end

    end

    L = triu(A);

    end

    2.3 Iterative Methods for Solving Systems of LinearEquations

    Iterative methods formally yield the solution x of a linear system after aninfinitive number of steps. At each step they require the computation ofthe residual of the system. For full matrices, their computational cost is ofthe order n2 operations for each iteration to be compared with an overallcost of the order of 23n

    3 operations needed by direct methods. Iterativemethods can therefore become competitive with direct methods because the

    http://-/?-http://-/?-http://-/?-
  • 7/31/2019 Ivancik Thesis 2012 Online

    25/70

    9 2.3. ITERATIVE METHODS FOR SOLVING LINEAR SYSTEMS

    required number of iterations to converge is either independent ofn or scales

    sublinearly with respect to n.The basic idea of iterative methods is to construct a sequence of vectors x(k)

    that enjoy the property ofconvergence

    x = limk

    x(k),

    where x is the solution to (2.2). In practice, the iterative process is stoppedat the minimum value of n such that

    x(n) x < .

  • 7/31/2019 Ivancik Thesis 2012 Online

    26/70

    Chapter 3

    Sparse Matrices

    Many engineering problems have to confront with large and sparse matrices.A sparse matrix is a matrix that allows special techniques to take advantageof the large number of zero elements. This definition helps to define howmany zeros a matrix needs in order to be sparse. The answer is that it de-pends on what the structure of the matrix is, and what is being used for. Forexample, a randomly generated sparse nn matrix with cn entries scatteredrandomly throughout the matrix is not sparse in the sense of Wilkinson (fordirect methods) since it takes O(n3) time to factorize (with high probability

    and for large enough c [9]). [3]Example 2. Using some of the sparse formats to store real sparse matricescan result in significant computational and storage savings. For instance atridiagonal square matrix with 1, 000, 000 rows. Storing 3 million nonzeroelements in double precision, and other data as row and column indices,consumes approx. 40 MB. But storing the same matrix as full matrix wouldconsume more than 7TB. Such big differences can be expected also in exe-cution times.

    3.1 Ordering Methods

    An unfavourable fact lies in the process of elimination with sparse matrices.Some zero values of input matrix become non-zero during the elimination(fill-ins) and their positions must be precomputed in advance. Reorderingtechniques try to minimize the amount of fill-ins by finding a permutation ofrows and columns of the input matrix. But finding such optimal permutationis a NP-complete problem [26] and could be more time consuming thansolving original linear system; therefore, heuristic approach that gives oftennear optimal results is applied.

    10

  • 7/31/2019 Ivancik Thesis 2012 Online

    27/70

    11 3.1. ORDERING METHODS

    3.1.1 Arrowhead Matrix Example

    Example 3. The operations counts required for the solution of two linearsystem Ax = b will be examined. The input matrices are on the figure 3.1.Even though both matrices have the same number of non-zero elements,there is a significant computation reduction by simply permutation of rowsand columns.

    (a) Left-up arrow-head matrix

    (b) Left-up arrow-head matrix afterLU

    (c) Right-down ar-rowhead matrix

    (d) Right-down ar-rowhead matrix af-ter LU

    Figure 3.1: The dependence of the reordering of a sparse matrix on the fill-in count. represents nonzero elements of the input matrix, fill-ins andempty space zero elements

    For the left-up arrowhead matrix 3.1a, the number of multiplications anddivisions required by the forward elimination is = 40, for the back substi-tution = 25. The total number of operations is + = 65 and the inputsparse matrix becomes full. For the right-down matrix 3.1c, the number ofmul. and div. required for the forward elimination is = 8, for back sub-stitution is = 13. The total number of operations is + = 21 and theinput sparse matrix remains sparse.

    There are many recent works about ordering schemas. This is because thespecific problems construct specific types of sparse matrices (band-diagonal,block triangular, block tridiagonal, ...) [20, p. 77]. Below, the most usedmethods are described. They can be divided in two categories, accordinghow the elimination tree is build. Most state-of-the-art ordering schemes

    for sparse matrices are a hybrid of a bottom-up method such as minimumdegree and a top-down scheme such as Georges nested dissection.

    3.1.2 Graph Representation of Sparse Matrices

    To explain ordering methods, it is convenient to introduce a graph represen-tation of sparse matrices. They are then represented as undirected graphs(sparse matrix have the structure of an adjacency matrix for this graph).All schemes are described for the undirected graph G = (V, E), E V V,

  • 7/31/2019 Ivancik Thesis 2012 Online

    28/70

    CHAPTER 3. SPARSE MATRICES 12

    associated with the symmetric matrix S. Let v be a vertex ofG. The set of

    vertices that are adjacent to v is denoted by adjG(v).

    3.1.3 Bottom-up Ordering Methods

    Bottom-up methods build the elimination tree from the leaves up to the root.In each iteration k a greedy heuristic is applied to Gk1 to select a vertex forelimination. This section briefly describes two of the most popular bottom-up algorithms, the minimum degree and the minimum deficiency orderingheuristics.

    Minimum Degree Ordering As mentioned above, at each iteration k the

    minimum degree algorithm eliminates a vertex v that minimizes thenumber of adjacent vertices degGk1(v) = |adjGk1(v)|. The algo-rithm is a symmetric variant of the Markowitz scheme [15] and wasfirst applied to sparse symmetric factorization by Tinney and Walker[22]. Over the years many enhancements have been proposed to thebasic algorithm that have greatly improved its efficiency.

    Minimum Deficiency Fill A less popular bottom-up scheme is the mini-mum deficiency or minimum local fill heuristic. The exact amount offill is used to select a vertex for elimination. The minimum deficiencyalgorithm has received much less attention because of its prohibitive

    runtime.

    3.1.4 Top-down Ordering Methods

    The most popular top-down scheme is Georges nested dissection algorithm[7, 8]. The basic idea of this approach is to find a subset of vertices Sin G, whose removal partitions G in two subgraphs G(B) and G(W) withV = S B W and |B|, |W| |V| for some 0 < < 1. Such a par-tition of G is denoted by (S,B,W). The set S is called vertex separatorof G. If we order the vertices in S after the (black) vertices in B and the

    (white) vertices in W, no fill-edge can occur between B and W. Typically,the columns corresponding to S constitute a full off-diagonal block in theCholesky factor. Therefore, S is supposed to be small. Once S has beenfound, the algorithm is recursively applied to each connected component ofG(B) and G(W) until a component consists of a single vertex or a clique.In this way the elimination tree is built from the root down to the leaves.

    Graph partitioning heuristics are usually divided into construction and im-provement heuristics. A construction heuristic takes the graph as input andcomputes an initial separator from scratch. An improvement heuristic triesto minimize the size of a separator through a sequence of elementary steps.

  • 7/31/2019 Ivancik Thesis 2012 Online

    29/70

    13 3.2. SYMBOLICAL FACTORIZATION

    As some ordering methods are implemented in MATLAB as standard func-

    tions (colperm, symrcm, colamd, symamd, amd, dmperm), I have tested someof them (see figure 3.2).

    3.2 Symbolical Factorization

    Symbolical factorization is a step executed before the numerical factoriza-tion. It precomputes the positions of fill-ins (see also 3.1) that appears duringfactorization process when one row is added to another. It can be seen onthe Cholesky or LU factors that they are often more denser than originalmatrices (see figure 3.2). The CRS format stores only nonzero elements andtherefore needed space for fill-ins must be allocated before the numericalfactorization. The nave solution is to run slightly changed numerical factor-ization and store new nonzero entries. In fact that symbolical factorizationworks only with indices to determine the Cholesky or LU factors, it can becomputed much faster than full numerical factorization. When implementingmy symbolical factorization I have used a great information source [13].

  • 7/31/2019 Ivancik Thesis 2012 Online

    30/70

    CHAPTER 3. SPARSE MATRICES 14

    0 50 100 150 200 250 300

    0

    50

    100

    150

    200

    250

    300

    no ordering: fillins=13309

    0 50 100 150 200 250 300

    0

    50

    100

    150

    200

    250

    300

    colperm: fillins=30627

    0 50 100 150 200 250 300

    0

    50

    100

    150

    200

    250

    300

    symrcm: fillins=13040

    0 50 100 150 200 250 300

    0

    50

    100

    150

    200

    250

    300

    colamd: fillins=9569

    0 50 100 150 200 250 300

    0

    50

    100

    150

    200

    250

    300

    symamd: fillins=6681

    0 50 100 150 200 250 300

    0

    50

    100

    150

    200

    250

    300

    amd: fillins=6583

    Figure 3.2: Applying different ordering methods and displaying LU factors.Nonzeros are in black, fill-ins in gray color

  • 7/31/2019 Ivancik Thesis 2012 Online

    31/70

    15 3.2. SYMBOLICAL FACTORIZATION

  • 7/31/2019 Ivancik Thesis 2012 Online

    32/70

    Chapter 4

    Bundle Adjustment

    Three-dimensional (3D) reconstruction is a problem that appears often inmany computer vision tasks. 3D reconstruction can be defined as the prob-lem of using 2D measurements arising from a set of images depicting the samescene from different viewpoints, aiming to derive information related to thescene geometry as well as the relative motion and the optical characteris-tics of the camera(s) employed to acquire these images. Bundle adjustment(BA) is almost invariably used as the last step of every feature-based 3D

    reconstruction algorithm [14, p. 12].

    Bundle adjustment is the problem of refining a visual reconstruction toproduce jointly optimal 3D structure and viewing parameter (camera poseand/or calibration) estimates. Optimal means that the parameter estimatesare found by minimizing some cost function that quantifies the model fittingerror, and jointly that the solution is simultaneously optimal with respectto both structure and camera variations. The name refers to the bundlesof light rays leaving each 3D feature and converging on each camera cen-tre, which are adjusted optimally with respect to both feature and camerapositions. Equivalently unlike independent model methods, which merge

    partial reconstructions without updating their internal structure all of thestructure and camera parameters are adjusted together in one bundle [23].

    BA boils down to minimizing the reprojection error (4.1) between the ob-served and predicted image points, which is expressed as the sum of squares ofa large number of nonlinear, real-valued functions. Thus, the minimization isachieved using nonlinear least-squares algorithms [4], from which Levenberg-Marquardt has proven to be of the most successful due to its ease of imple-mentation and its use of an effective damping strategy that lends it theability to converge quickly from a wide range of initial guesses [12].

    16

  • 7/31/2019 Ivancik Thesis 2012 Online

    33/70

    17 4.1. UNCONSTRAINED OPTIMIZATION

    Figure 4.1: Reprojection error [17]

    4.1 Unconstrained Optimization1

    The aim of the unconstrained optimization is to find x such that

    arg minxRn

    f(x). (4.1)

    The point x is called a global minimizer of f if f(x) f(x) x Rn,while x is called a local minimizer of f if a neighborhood N of x existssuch that f(x) f(x) x N. Vector of first partial derivations of thefunction f (must be continuously differentiable) by the vector x is denotedby

    f(x) =

    f

    x1(x), . . . ,

    f

    xn(x)

    and called gradient off at a point x. Ifd is is a non null vector in Rn, thenthe directional derivative of f with respect to d is

    f

    d(x) = lim

    0

    f(x + d) f(x)

    and satisfies f(x)/d = [fx]d. Moreover, denoting by (x, x + d) thesegment in Rn joining the points x and x + d, with R, Taylorssexpansion ensures that (x, x + d) such that

    f(x + d) f(x) = f()d.

    1This chapter was cited from [21].

  • 7/31/2019 Ivancik Thesis 2012 Online

    34/70

    CHAPTER 4. BUNDLE ADJUSTMENT 18

    If f is twice-continuously differentiable, it can by denoted by H(x) (or

    2f(x)) the Hessian matrix off evaluated at a point x, whose entries are

    hij(x) =2f(x)

    xixj, i, j = 1, . . . , n .

    In such case it can be shown that, if d = 0, the second-order directionalderivative exists

    2f

    2d(x) = dH(x)d.

    For a suitable (x, x + d) also

    f(x + d) f(x) = f(x)d + 12dH()d.

    Existence and uniqueness of solution for (4.1) is not guaranteed in Rn. Nev-ertheless, it can be proved that the gradient of a local minimizer x equalsto a null vector. This condition is necessary for optimality to hold. However,this condition also becomes sufficient iff is a convex function on R, i.e., suchthat x, y Rn and for any [0, 1]

    f[x + (1 )y] f(x) + (1 )f(y).

    4.1.1 Search Methods

    Analytical methods are possible to use only for simple problems (Bra-chistochrone problem, univariate minimization).

    Numerical methods must be used for most engineering optimizationproblems (too large and complex to solve analytically). Numericalmethods can be divided into two classes

    Gradient-based methods are efficient for many variables and forsmooth objective function. The drawback is the local conver-gence.

    Derivative-free methods are suitable for problems when gradi-ents are not available, objective function is not differentiable orthe global minimizer is desired to find.

    Gradient-based descent methods compute direction d(k) and positive pa-rameter (step length) (k) at each iteration k with the help of gradient andHessian. Algorithm 3 shows the skeleton of this method.

    The concept of computing the direction d(k) and the step length (k) definesa specific direct method.

    http://-/?-http://-/?-
  • 7/31/2019 Ivancik Thesis 2012 Online

    35/70

    19 4.1. UNCONSTRAINED OPTIMIZATION

    Algorithm 3 Descent method

    Require: f(x), H(x) and a starting point x0.Ensure: A local minimizer x.

    1: k 02: while (not converged) do3: compute direction d(k) and step length (k)

    4: x(k+1) x(k) + (k)d(k)

    5: k k + 16: end while7: return x(k)

    Newtons method computes

    d(k) = H1(x(k))f(x(k)),

    where H is positive definite within a sufficiently large neighborhood ofpoint x;

    inexact Newtons method

    d(k) = B1(x(k))f(x(k)),

    where B(x(k)) is a suitable approximation of H(x(k));

    gradient (steepest descent) method

    d(k) = f(x(k));

    conjugate gradient method

    d(k) = f(x(k)) + (k)d(k1),

    where (k) is a scalar to be suitably selected in such a way that thedirections

    d(k)

    turn out to be mutually orthogonal with respect to

    a suitable scalar product.

    4.1.2 LevenbergMarquardt Algorithm

    LevenbergMarquardt (LM) algorithm, also known as the damped least-squares method, provides a numerical solution to the problem of minimizinga function, generally nonlinear, over a space of parameters of the function.It can be thought of as a combination of GaussNewton and the steepestdescent method. When the current solution is far from a local minimum,the algorithm behaves like a steepest descent method: slow, but guaranteed

  • 7/31/2019 Ivancik Thesis 2012 Online

    36/70

    CHAPTER 4. BUNDLE ADJUSTMENT 20

    to converge. When the current solution is close to a local minimum, it

    becomes a Gauss-Newton method and exhibits fast convergence. For thesereasons, mostly LM algorithm is used in bundle adjustment.

    Let f be an assumed functional relation which maps a parameter vectorp Rm to an estimated measurement vector x = f(p), x Rn. An initialparameter estimate p0 and a measured vector x are provided and it is desiredto find the vector p that best satisfies the functional relation f locally, thatis, minimizes the squared distance with = x x for all p within asphere having a certain, small radius. The basis of LM algorithm is anaffine approximation to f in the neighborhood of p. For a small ||p||, f isapproximated by (see [5, p. 75])

    f(p + p) f(p) + Jp,

    where J is the Jacobian off.

    The basis of LM algorithm is an affine approximation to f in the neighbor-hood ofp. At each iteration, it is required to find the step p that minimizesthe quantity ||x f(p + p)|| ||x f(p) Jp|| = || Jp||. The mini-mum is attained when Jp is orthogonal to the column space of J. Thisleads to J(Jp ) = 0, which yields p as the solution of the so-callednormal equations [10]:

    JJp = J (4.2)

    Matrix JJ in the above equation is the first order approximation to theHessian of 12

    [16] and p is the Gauss-Newton step. J correspondsto the steepest descent direction since the gradient of 12

    is J. TheLM algorithm actually solves a slight variation of Equation (4.2), known asaugmented normal equations

    Np = J, with N JJ + I, > 0. (4.3)

    The strategy of altering the diagonal elements of JJ is called damping and is referred to as the damping term. It is decreased, when the updatedparameter vector p + p with p computed from Equation (4.3) leads to a

    reduction in the error

    ; otherwise it is increased, the augmented normalequations are solved again and this process iterates until a value of p thatdecreases the error is found.

  • 7/31/2019 Ivancik Thesis 2012 Online

    37/70

    21 4.1. UNCONSTRAINED OPTIMIZATION

  • 7/31/2019 Ivancik Thesis 2012 Online

    38/70

    Chapter 5

    Overview of NVIDIA CUDA1

    By introducing CUDA (Compute Unified Device Architecture) NVIDIA hasgiven programmers the initial opportunity to capitalize on inexpensive, gen-erally available, massively parallel computing hardware. Teraflop computingis now within the economic reach of most people around the world. Theimpact of GPGPU (General-Purpose Graphics Processing Units) technologyspans all aspect of computation, from cell phones to largest supercomputers.Programmable GPUs are deployed in areas of scientific computing, cloudcomputing, computer visualization, simulations, games, . . .

    Programming for GPGPU requires a basic knowledge about the GPU archi-tecture because only small changes in data structures or program can makesignificant differences in the performance. Modern GPUs belong in principleto the SIMD class of Flynns taxonomy. That means that GPUs are capableto do the same operation on multiple data simultaneously. The restrictionis one operation at the time which reduces possible problems worth to par-allelize on GPU. On the other hand, well-vectorized problems are able toachieve an acceleration by two or more orders of magnitude over multi-coreprocessors2.

    To ensure best performance of GPGPU, next tree rules should be met.

    1. Get the data on the GPGPU and keep it there. GPGPU are separatedevices plugged into the PCI Express bus of the host computer whichis very slow compared to GPGPU memory system (20 to 28 timesslower).

    2. Give the GPGPU enough work to do. CUDA-enabled GPUs deliverteraflop performance and they are fast enough to complete small prob-

    1For this chapter I have quote from [6].2Top 100 NVIDIA CUDA application showcase speedups (Min 100, Max 2600, Median

    1350), published May 9., 2011.

    22

  • 7/31/2019 Ivancik Thesis 2012 Online

    39/70

    23 5.1. THE CUDA EXECUTION MODEL

    lems faster than the host processor can start kernels. Each thread

    should perform as much instruction to hide this latency.3. Focus on data reuse within the GPGPU to avoid memory bandwidth

    limitation. All high-performance CUDA applications exploit internalresources on the GPU (registers, shared memory) to bypass globalmemory bottlenecks.

    5.1 The CUDA Execution Model

    The heart of CUDA performance lies in the execution model and the sim-

    ple partitioning of a computation into fixed-sized blocks of threads in theexecution configuration. CUDA maps naturally the parallelism within anapplication to the massive parallelism of the GPGPU hardware. The resultis the compatibility within older and future generations of GPU.

    GPU hardware parallelism is achieved through replication of a common ar-chitectural building blocks called a streaming multiprocessor (SM). Figure 5.1illustrates 16 SM on a GF100 (Fermi) series GPGPU. The software abstrac-tion of a thread block translates into a natural mapping of the kernel ontoan arbitrary number of SM on a GPGPU. Each SM can be scheduled (byGigaThread global scheduler) to run one or more thread blocks. Therefore,

    they are independent and not synchronizable during the kernel execution3

    .Thread blocks also acts as a container of thread cooperation, as only threadsin a thread block can share data. Thread in a thread block can utilize high-speed memory inside the SM called shared memory for data sharing.

    Figure 5.2b depicts the composition of one of 16 streaming multiprocessors inGF100 GPU. SIMD cores require less power and space than non-SIMD cores.As a result, GPGPU have a high flop per watt ratio compared to conventionalCPUs [25]. The threads running on a multiprocessor are partitioned intogroups in which all threads execute the same instruction simultaneously. Onthe CUDA architecture, these groups are called warps, each warp has 32threads, and this execution model is referred to as SIMT (Single InstructionMultiple Threads) [18].

    GPGPUs are not true SIMD machines (but SIMT), since SIMD are onlystreaming multiprocessors which may be running one or more different in-structions. Conditionals (if statements) can decrease performance inside aSM because each branch of each conditional must be evaluated. This cancause slowdown of2n for n nested loops.

    3Atomic operation make an exception, they allow threads of different blocks to commu-nicate. This approach should be used in reasonably situations, as using atomic operationsmay introduce scalability and p erformance issues.

  • 7/31/2019 Ivancik Thesis 2012 Online

    40/70

    CHAPTER 5. OVERVIEW OF NVIDIA CUDA 24

    Figure 5.1: Block diagram of a GF100 (Fermi) GPU [2]

    5.2 GPU Memory

    For highest performance of applications developed for GPU, data inside theSM must be reused. The reason is that on-board global memory (DRAMin 5.2a) is not fast enough when all SM want to perform read/write operation.CUDA provides configurable caches for each SM to give the opportunity fordata reuse. The awareness of difference between on-board (GPU) and on-chip (SM) memory is the key to achieving the highest performance that

    GPGPU can provide.The most fastest and most scalable is on-chip SM memory. However, it islimited to a few KB. The on-board global memory is accessible by all the SMacross the GPU and is measured in GB. Significant bandwidth gaps betweenon-board and on-chip memories could be seen in table 5.3. Although thebandwidth of shared memory can greatly accelerate applications, it is tooslow to achieve peak performance [24].

    Example 4. Computing a simple dot-productfor( i = 0; i < N; i++ ) c[i] = a[i] * b[i];

  • 7/31/2019 Ivancik Thesis 2012 Online

    41/70

    25 5.2. GPU MEMORY

    (a) Memory hierarchy [1] (b) Block diagram [1]

    Figure 5.2: Streaming multiprocessor of a GF100 (Fermi) GPU

    Register memory 8000 GB/sShared memory 1600 GB/sGlobal memory 177 GB/sMapped memory 8 GB/s

    Figure 5.3: Bandwidth of various GPU memory [6, p. 111]

    on a GPU utilizing only global memory gives a limited performance. When4-byte floating-point values are being used, a 1Tflop GPU would require12 TB/s of memory bandwidth. A GPU with 177 GB/s of memory band-width could only deliver 14 Gflop (1.4% of the potential 1Tflop performance).

    When programming for a GPU, it is necessary to reuse data within the SM(to exploit data locality). GPGPUs support two types of data locality

  • 7/31/2019 Ivancik Thesis 2012 Online

    42/70

    CHAPTER 5. OVERVIEW OF NVIDIA CUDA 26

    temporal locality (or LRU Last Recently Used) means that last recently

    accessed data is likely to be used again in the future and spatial localitymeans that neighbouring data is cached to be used in the future.

    For compute capability 2.0 or higher, a constant or texture memory used foreffective data broadcasting to all threads are overcome by the global memory.This is because compute 2.0 devices contains SM with L1 cache and a unifiedL2 cache that speed-up accessing the global memory.

  • 7/31/2019 Ivancik Thesis 2012 Online

    43/70

    27 5.2. GPU MEMORY

  • 7/31/2019 Ivancik Thesis 2012 Online

    44/70

    Chapter 6

    Analysis of the Problem

    As I have mentioned in the Introduction, finding the solution of a linearsystem is the most computedemanding part in the problem of solving anonlinear system. At each iteration, a linear system Ax = b must be solved.Bundle adjustment (BA), as a least squares problem, works with sparse linearsystems of a special structure (doubly bordered block diagonal). A similarstructure can be obtained when applying nested dissection ordering on thediagonal-based matrix A (band-diagonal, block tridiagonal, . . . ). The im-plemented solver on GPU can be used for BA when the information about

    the structure of A matrix is provided by BA or for the diagonal-based ma-trix when the information about the structure is provided by the orderingfunction.

    6.1 Structure of Linear Systems in BA

    A system of linear augmented normal equations 4.3 arises in BA and aresolved at each iteration of Levenberg-Marquardt algorithm. Matrix J isJacobian and N is the first order approximation of Hessian. The structure

    of N can be exactly determined according to the input parameters of BAproblem.

    Example 5. [14, p. 9] Consider that we want to optimize parameters of3 cameras and 4 3D points visible in all cameras. The measurement vec-tor X = (x11, x

    12, x

    13, x

    22, x

    21, x

    23, x

    31, x

    32, x

    33, x

    41, x

    42, x

    43) is made up

    of the measured image point coordinates across all cameras. The parametervector P = (a1 , a

    2 , a

    3 , b

    1 , b

    2 , b

    3 , b

    4 ) is defined by all parameters de-

    scribing 3 projection matrices and 4 3D points. Let Aij and Bij denotexijaj

    and xijbj

    , respectively. xijak

    = 0, j = k and xijbk

    = 0, i = k. Employing

    28

  • 7/31/2019 Ivancik Thesis 2012 Online

    45/70

    29 6.2. BLOCK CHOLESKY DECOMPOSITION FOR BA

    this notation, the Jacobian can be written as

    J =X

    P=

    A11 0 0 B11 0 0 0

    0 A12 0 B12 0 0 0

    0 0 A13 B13 0 0 0

    A21 0 0 0 B21 0 0

    0 A22 0 0 B22 0 0

    0 0 A23 0 B23 0 0

    A31 0 0 0 0 B31 0

    0 A32 0 0 0 B32 0

    0 0 A33 0 0 B33 0

    A41 0 0 0 0 0 B410 A

    420 0 0 0 B

    420 0 A43 0 0 0 B43

    . (6.1)

    Then, the approximation of Hessian (matrix N from Equation (4.3)) have aform

    U1 0 0 W11 W21 W31 W410 U2 0 W12 W22 W31 W410 0 U3 W13 W23 W33 W43

    W11 W12 W

    13 V1 0 0 0

    W21 W22 W

    23 0 V2 0 0

    W31 W32 W

    33 0 0 V3 0

    W

    41 W

    42 W

    43 0 0 0 V4

    a1a2a3b1b2b3

    b4

    =

    a1a2a3b1b2b3

    b4

    . (6.2)

    Denoting the upper left, lower right, and upper right parts of the matrix inEquation (6.2), respectively, with U, V and W, allows to rewrite augmentednormal equations (4.3) compactly to

    U W

    W V

    ab

    =

    ab

    , (6.3)

    where * designates the augmentation of the diagonal elements of U and V.Now, lets compare the structure of Hessian in Equation (6.2) with a Hessian

    of a bigger BA problem (figure 6.1). The upper left part (U) corresponds tothe approximation of second derivations of camera parameters, lower right(V) to the approximation of second derivations of 3D points and upper rightpart (W) to the derivation

    6.2 Block Cholesky Decomposition for BA

    Lourakis and Argyros [14] suggest to solve augmented normal equations (6.3)arising in BA in two steps (firstly for a and then for b) as follows. Left

  • 7/31/2019 Ivancik Thesis 2012 Online

    46/70

    CHAPTER 6. ANALYSIS OF THE PROBLEM 30

    (a) Original input matrix (b) Rotated of 180 degrees with markedparts (see also figure 7.1 for comparison)

    Figure 6.1: An example of a modestly sized Hessian in BA. This is thesparsity pattern of a 992 992 normal equations (i.e. approximate Hessian).Black regions correspond to nonzero elements [14, p. 27]

    multiplication of Equation (6.3) by the block matrix

    I WV1

    0 I (6.4)results in

    U WV1W 0W V

    ab

    =

    a WV1b

    b

    Since the top right block of the above left hand matrix is zero, therefore acan be determined from its top half, which is

    (U WV1W) a = a WV1b (6.5)

    Matrix S UWV1W is the Schur complement ofV in the left-handside matrix of (6.3) and is also positive definite [19]. Linear system (6.5) issolved for a using Cholesky decomposition of S. b is computed by solving

    Vb = b Wa.

    This approach has a big advantage an absence of fill-ins during the compu-tation. The approach explained in the next Example is slightly different [21,p. 102].

  • 7/31/2019 Ivancik Thesis 2012 Online

    47/70

    31 6.2. BLOCK CHOLESKY DECOMPOSITION FOR BA

    Example 6. Let A Rnn be a symmetric positive definite matrix that

    can be divided into 4 submatrices A11, A12, A21 and A22. Then, accordingthe Theorem 2, Cholesky decomposition A = LL exists where L is lowertriangular matrix with strictly positive diagonal entries. If matrix A consistsof 4 submatrices, the equation A = LL can be rewritten to

    A =

    A11 A

    21

    A21 A22

    =

    L11 0

    L21 L22

    L11 L

    21

    0 L22

    .

    The aim of the block Cholesky decomposition is to compute values in L11,L21, L22 submatrices or L11, L

    21, L

    22 respectively. The whole process can

    be divided into 4 steps:

    1. A11 = L11L11 (Cholesky decomposition)2. L21 = L

    111 A

    21 from A

    21 = L11L

    21

    L21 = A21L

    11 from A21 = L21L11

    3. A22 L21L21 = L22L22 (Cholesky decomposition)

    During the decomposition process, first two steps can be done simultaneously.The last step is updating the A22 submatrix with matrix AS22 that is calledSchur complement ofA11 in matrix A and can be expressed as

    AS22 = A22 A21A111 A

    21 =

    = A22 L21L

    11

    (L11L

    11

    )1L11L

    21

    =

    = A22 L21(L11L

    11 )(L111 L11)L

    21 =

    = A22 L21L21.

    (6.6)

    Example 7. This method allows parallel computation when diagonal blocksare independent, for example linear system (6.7). Blocks A11 and A22 havenot any mutual dependent elements (A12 and A12 are zero matrices).

    A11 0 A130 A22 A23

    A13 A23 A33

    x1x2x3

    =

    b1b2b3

    (6.7)

    After the first step, blocks A11, A13, A13, A22, A23, A23 and parts of right-

    hand side b1 and b2 are updated parallely and the system has the form asfollows:

    L11 0 L111 A130 L22 L122 A23

    0 0 A33

    x1x2

    x3

    =

    L11 0 L130 L22 L23

    0 0 A33

    x1x2

    x3

    =

    L111 b1L122 b2

    b3

  • 7/31/2019 Ivancik Thesis 2012 Online

    48/70

    CHAPTER 6. ANALYSIS OF THE PROBLEM 32

    The next step is to update block A33 with the Schur complement AS33 of

    matrix A11 00 A22 in matrix A, that is according to (6.6)A33

    L13 L23

    L13L23

    and to update vector b3 with bS3 , that equals to

    b3

    L13 L23 L111 b1

    L122 b2

    .

    Next, the linear system

    AS33x3 = bS3

    is using the Gaussian elimination transformed to

    LS33 x3 = LS33 b

    S3

    and solved for x3 using back substitution. Finally, remaining parts of xvector (x1 and x2) in the transformed system

    L11 0 L

    13

    0 L

    22 L

    230 0 LS33

    x1

    x2x3 =

    L111 b1

    L

    1

    22 b2LS33 b

    S3

    are computed now using only back substitution.

  • 7/31/2019 Ivancik Thesis 2012 Online

    49/70

    33 6.2. BLOCK CHOLESKY DECOMPOSITION FOR BA

  • 7/31/2019 Ivancik Thesis 2012 Online

    50/70

    Chapter 7

    Implementation

    This chapter describes chosen framework and implementation details such asused structures, functions and data types of a practical output of the thesis linear direct solver (LDS).

    7.1 Used Framework

    The whole application was developed on a Linux environment (Xubuntu

    12.04 for 64-bit PC and Debian 6.0 for 32-bit PC). The host code (for theCPU side) was written in ANSI C, the device code (for the GPU side) inCUDA (CUDA Driver 4.0). All object files was linked together into an exe-cutable file (ldsexam) using NVCC compiler, no static or dynamic librarieswas created (see my makefile).

    7.2 Compressed Row Storage Format

    Many formats for sparse matrices exists. One of the most general is the

    compressed row storage (CRS) format. It makes no assumptions about thesparsity pattern and stores only indices and nonzero elements. On the otherhand, it is not very efficient because it needs an indirect addressing step forevery scalar operation in a matrix-vector product. I have decided on thisformat in my CPU-side solver for its effective utilization in the Choleskydecomposition.

    A CRS format needs three vectors: nozval of floating-point numbers, rowbegand colind of integers. The nozval vector stores the values of the nonzeroelements of the matrix, as they are traversed in a row-wise fashion. Thecolind vector stores the column indexes of the elements in the nozval vector.

    34

  • 7/31/2019 Ivancik Thesis 2012 Online

    51/70

    35 7.3. CHOLESKY DECOMPOSITION ON GPU

    That is, if nozval(k) = aij then colind(k) = j. The rowptr vector stores

    the locations in the nozval vector that start a row, that is, ifnozval(k) = aijthen rowptr(i) k < rowptr(i+1). By convention, rowptr(n+1) = nnz+1,where nnz is the number of all nonzeros.

    Example 8. Consider a sparse symmetric matrix in the figure 7.1

    0 1 2 3 4 50 7 11 8 1 22 1 8 3 23 9 3 24 2 3 3 9 35 1 2 2 3 9

    Figure 7.1: Sample of a symmetric positive definite sparse matrix 6 6 with22 nonzero elements

    CRS has the following attributes: n = 6, nnz = 22,

    rowptr

    0 1 2 3 4 5 6

    0 2 5 9 1 2 1 7 2 2

    colind

    0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 0 2 10 5 1 2 4 1 2 4 5 3 4 5 1 2 3 4 5 0 2 3 4 5

    nozval0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 0 2 1

    7 7 8 1 2 1 9 3 2 9 3 2 2 3 3 9 3 1 2 2 3 9

    7.3 Cholesky decomposition on GPU

    The implementation of a sparse Cholesky decomposition (functions CRS_choland CRS_chol_subs) was quite straightforward. Before these functions arecalled, a symbolical factorization must be performed which determines theindices of fill-ins and allocate space for them. For purpose of Cholesky de-composition, only lower or upper triangular matrix is sufficient to have. Thisfact was exploited by skipping all elements from the beginning of each row tothe main diagonal. This is done by CRS_shifted_rows. Another differencein decomposition of sparse matrices rests in the necessity of altering the be-ginning of each row during the factorization. Regarding that I have workedwith temporary arrays rowbeg and rowend.

  • 7/31/2019 Ivancik Thesis 2012 Online

    52/70

    CHAPTER 7. IMPLEMENTATION 36

    7.4 Ordering for CPU solver

    In my solver, I have utilized approximate minimum degree (AMD) orderingby Tim Davis that can be found also in MATLABs amd function. It minimizesnumber of fill-ins very effectively and fast (see figure 3.2). For BA problem,even faster ordering (but with more fill-ins) can be used and that is a simplerotation of 180 degrees.

    7.5 Block Matrix Format for GPU

    There are 3 different parts in the matrix: full diagonal blocks, the sparseborder and the almost dense tail (light, middle and dark gray in the fig-ure 7.1). Analyzing the properties of this parts and CUDA architecture Ihave suggested to use the following matrix data structure (MXBF).

    Blocks: As there are many (from thousands to millions) full but small di-agonal blocks (Vi), they can be stored in one array (data) in row-wisemanner. In BA, the blocks have the same size, but when using METISk-way ordering, blocks have not the same size. Because of that, foreach block a information about its size must be stored (blksz). Wheniterating over the blocks, it is efficient to have an index saying where

    the data start for i-th block (blkp). Only upper part of the blocks isstored but memory is allocated for the full block to get rid of strangeindexing.

    Border: This part has the majority of nonzero elements. Therefore, is mustbe stored as a sparse matrix. I have chosen the CRS format. In factthat the input matrix is symmetric, only one border side is sufficientto have.

    Tail: After computing the Schur complement, this part will be almost dense.Consequently it is stored as a full matrix. Only upper triangle is stored,but memory allocated for a full matrix, as in the case with blocks. The

    data for this part are stored also in data array and tail points tolocation where the data for tail start.

    The MXBF structure of matrix from Example 7.1 have these attributes: n= 6, tail = 5 (where data for tail in the array data starts), tailsz = 3,ndata = 14 (number of elements in blocks and tail), brd_nnz = 13 (numberof nonzeros in the border),

    blksz0 1

    1 2

    blkp

    0 1

    0 1

    data0 1 2 3 4 5 6 7 8 9 10 11 12 13

    7 8 1 0 8 9 3 2 0 9 3 0 0 9

  • 7/31/2019 Ivancik Thesis 2012 Online

    53/70

    37 7.6. BLOCK CHOLESKY DECOMPOSITION ON GPU

    brd_rowptr

    0 1 2 30 1 2 4

    brd_colind

    0 1 2 32 1 1 2

    brd_nozval

    0 1 2 31 2 3 2

    7.6 Block Cholesky decomposition on GPU

    Consider the block matrixA11 0 A130 A22 A23

    A13 A23 A33

    x1x2

    x3

    =

    b1b2

    b3

    ,

    where A11 and A22 are called blocks, A13 and A23 borders and A33 istail.

    The block Cholesky decomposition consists of four main parts:

    1. Eliminating blocks (A11 L11 and A22 L22), updating corre-

    sponding borders (A13 L111 A13 and A23 L122 A23), and updating

    corresponding parts of right-hand side of linear system (b1 L111 b1and b2 L122 b2). All previous steps are done simultaneously (withinelimination loops). Each thread eliminates one block (in the test ma-trix it is the size of 3 3) and update its own part of border and bvector. As the border part is sparse and can have arbitrary number ofnonzero elements, I store and access this data in a global memory.

    2. Computing the Schur complement

    A33 AS33 = A33

    L111 A13L122 A23

    L111 A13L122 A23

    There was a problem that updated border part

    L111

    A13

    L122

    A23

    was stored in

    row-wise manner and the transposed matrix was not available. There-fore, using dot-product when matrix-matrix multiplying was not pos-

    sible. I had to loop through the rows of the matrix and update theelements of A33 matrix at every multiplication. But this could bepossible only when using atomic operations (atomicAdd). Even more,this could be used only for single-precision floats in compute capability> 2.0. I am aware of such restriction of proposed approach.

    3. Eliminating of the tail (AS33 LS33 ). This part has surely the biggest

    potential to exploit the full potential of a GPU. Unfortunately, thiswas postponed due to lack of time. I have planned to call a functionfrom MAGMA library that is able to solve dense linear system. In mysolver, this part is performed on CPU-side.

  • 7/31/2019 Ivancik Thesis 2012 Online

    54/70

    CHAPTER 7. IMPLEMENTATION 38

    4. Back substitution. Performed on CPU-side, firstly for dense part LS33

    and then for sparse borders and full blocks.

    7.7 Ordering for GPU solver

    A requirement of my GPU solver is that the input matrix can be partitionedin a such structure that appears in a approximation of Hessians in BA prob-lem (see the matrix in Equation 6.2). This can be achieved applying nesteddissection ordering recursively. METIS K-way ordering was used for parti-tioning the input matrix into independent block structure for GPU solver.Figures 7.2 and 7.3 illustrates structure of matrices from MATLAB galleryreordered by k-way ordering. As BA has this structure implicitly and thesize and number on independent block are known from BA configuration, itneeds only rotation of 180 degrees to get structure like in figure 6.1b.

    0 50 100 150 200 250 300

    0

    50

    100

    150

    200

    250

    300

    nz = 4861

    (a) Original matrix

    0 50 100 150 200 250 300

    0

    50

    100

    150

    200

    250

    300

    nz = 4861

    (b) Reordered into 5 independent blocks

    Figure 7.2: Performing k-way ordering on diagonal-based matrix Wathen10 10

  • 7/31/2019 Ivancik Thesis 2012 Online

    55/70

    39 7.7. ORDERING FOR GPU SOLVER

    0 200 400 600 800

    0

    100

    200

    300

    400

    500

    600

    700

    800

    900

    nz = 4380

    (a) Original matrix

    0 200 400 600 800

    0

    100

    200

    300

    400

    500

    600

    700

    800

    900

    nz = 4380

    (b) Reordered using k-way ordering into10 independent blocks

    Figure 7.3: Performing k-way ordering on diagonal-based matrix Poisson30

  • 7/31/2019 Ivancik Thesis 2012 Online

    56/70

    Chapter 8

    Testing

    Testing was performed on the followed configuration: Intel i7-2600 CPU@ 3.40GHz, 4GB RAM, GeeForce GT570, Debian 6.0 for 32-bit PC, CUDADriver 4.0. Applications were compiled using GCC (version 4.3.5) and NVCCwith -use-fast-math and -O3 optimization mode.

    To check the accuracy of my solvers I have used Octave to get the reference

    x vector. The solution from Octave and my solver were printed into the file(x_octave.vec and x_result.vec) and the differences were compared withanother Octave function (vec_ck).

    The main testing input matrix was the approximation of Hessian from BAproblem optimizing 3 parameters of 11049 3D points and 7 parameters of 22cameras. The matrix is of size 33, 301 33, 301 and have 1, 817, 521 nonzeroelements saved in Matrix Market coordinate format (data/jTj_mue.mtx).

    8.1 Octave solvers

    In Octave, I have tested the direct solver (left division operator \), the Pre-conditioned Conjugate Gradient solver (pcg) and Preconditioned ConjugateResiduals (pcr). Iterative solvers was set to terminate after reaching 200iterations or a residual norm less than 16. Table 8.1 shows the results. Pre-conditioned Conjugate Residuals solver have terminated after 45 iterations,but the result was wrong.

    40

  • 7/31/2019 Ivancik Thesis 2012 Online

    57/70

    41 8.2. CPU SOLVER

    Method Time Res. norm Iterations

    Left division operator 695ms 1.28313 Conjugate gradient 1440ms 4.1285 75

    Conjugate residuals 1386 ms NaN 45

    Figure 8.1: Test of Octave solvers

    8.2 CPU solver

    After execution the CPU solver from lds directory with the command./bin/ldscpuexam data/jtj_mueI.mtx data/g.mtxthe following information are printed:

    load matrix: 1070 ms

    load vector: 10 ms

    symamd ord.: 80 ms

    mat. reorder: 390 ms

    symbolic: 500 ms

    CRS_symbolic: 1834461 nnz

    CPU CRS chol: 50 ms

    all: 2120 ms

    The new number of nonzeros havent increased much (from 1, 817, 521 to1, 8344, 61) which means that there are very few fill-ins (less then 1%). Itcan be seen, that my implemented functions for reordering of the matrixand for symbolic factorization are not very efficient. The reason can be thatreordering is performed by transforming CRS format into triplet (or COO)format, reordered, sorted, and then transformed back which needs a lot ofdata moves. Although finding the ordering takes more time then solvingthe whole linear system, without it (try to comment it in ldscpuexam.c)the computation will takes more than several minutes. Execution of allfunctions required by finding the solution takes 1 second.

    Commandoctave -q --eval="vec_ck( x_octave.vec, x_result.vec );"

    outputs the residual norm of the difference with the reference octave solutionand find where is the biggest difference

    max err: 0.0000000228 at 138th element

    res nrm: 0.0000000000

  • 7/31/2019 Ivancik Thesis 2012 Online

    58/70

    CHAPTER 8. TESTING 42

    8.3 GPU solver

    For checking the corectness of GPU solver, I have implemented the GPUsolver on CPU side (to use this, a constant BLOCK_CHOLESKY_CPUmust be uncomented and to compile with make). Then, ldsgpuexam inperformed on CPU.

    Calling./bin/ldsgpuexam data/jtj_mueI.mtx data/g.mtx

    gives this results:

    load matrix: 1060 ms

    load vector: 20 mskway ord.: 20 ms

    mat. reorder: 370 ms

    symbolic: 500 ms

    CRS_symbolic: 1834083 nnz

    MXBF_from_crs: 11049 blocks

    858522 border nnz

    123157 block and tail data

    block matrix: 10 ms

    elim. blks: 10 ms

    tail update: 30 ms

    elim tail: 0 msback subs: 0 ms

    CPU block chol: 40 ms

    all: 2020 ms

    This solver that exploits the special structure in BA runs faster than generalCPU solver (40 vs. 50 ms). When checking the residual norm:

    max err: 0.0000221960 at 59th element.

    res nrm: 0.0000000010

    Output of the real GPU solver:

    load matrix: 1070 msload vector: 10 ms

    kway ord.: 20 ms

    mat. reorder: 380 ms

    symbolic: 500 ms

    CRS_symbolic: 1834083 nnz

    MXBF_from_crs: 11049 blocks

    858522 border nnz

    123157 block and tail data

    block matrix: 10 ms

  • 7/31/2019 Ivancik Thesis 2012 Online

    59/70

    43 8.4. CUSP SOLVERS

    elim on GPU:

    elim without copy: 15.1688 mselim with copy: 20.0004 ms

    elim blocks + tail update: 420 ms

    elim tail: 0 ms

    back subs: 0 ms

    GPU block chol: 430 ms

    all: 2430 ms

    with residual norm:

    max err: 0.0000072417 at 103th element.

    res nrm: 0.0000000003

    The GPU solver must be run on single-precision floats because of atomicAddoperations. elim without copy is the time that is needed for eliminationof blocks and tail update (computing the Schur complement).

    8.4 CUSP solvers

    CUSP is a C++ template library that implements parallel algorithms forsparse matrix and graph computations. It provides a variety of iterativesolvers such as Conjugate-Gradient (CG), Biconjugate Gradient (BiCG), Bi-

    conjugate Gradient Stabilized (BiCGstab), Generalized Minimum Residual(GMRES), Multi-mass Conjugate-Gradient (CG-M) and Multi-mass Bicon-jugate Gradient stabilized (BiCGstab-M). Two of them I have tested withmaximum number of iterations set to 200 and relative error 16. Table 8.2shows the results.

    Method Time Max. error IterationsCG 50 ms 3.88 77BiCGstab 90ms 2.38 76

    Figure 8.2: Test of iterative CUSP solvers. Max. error is the maximaldifference with Octaves reference solution

  • 7/31/2019 Ivancik Thesis 2012 Online

    60/70

    Chapter 9

    Conclusion

    The aim of this thesis was to deal with linear direct solvers and then imple-ment a linear direct GPU solver for BA problem. Of course, the implementa-tion of a GPU solver was preceded by studying the mathematical backgroundof linear direct solvers. Firstly, the CPU solver must be implemented. An-other important concepts that concerns about direct sparse solvers must beacquired like a symbolic factorization, working with CRS matrix format, andapplying ordering techniques. I can say that my implemented CPU solveris fast and reliable when solving positive definite linear systems. This have

    been done in the first half of the academic year.The next half year, I have started experimenting with the METIS k-wayordering and how to utilize it for solving general sparse systems in parallel.Although this approach is fully usable, it has drawbacks such as a slowcomputation of the ordering, relatively big tail part, and independent blocksof different sizes.

    Simultaneously I was analyzing the BA problem and structure of its linearsparse systems in Levenberg-Marquardt algorithm. As the structure of theBA and returned k-way ordering was the same, I tried to write the solverwhich could be general (the needed information about block matrix gives k-

    way ordering) and specific at the same time (in this case information aboutthe block matrix provides BA configuration). The general solver on GPU isnot finished (special symbolic factorization is missing).

    The GPU solver special for BA was implemented, but provides very smallspeedups in comparison with CPU solver. The reason is that only globalmemory on GPU was used for all computations.

    In testing phase I have found out that iterative solver have a great potentialto solve linear systems very fast. The advantage of iterative solvers is theconfigurable accuracy which can be sufficient for iterative nonlinear solvers.

    44

  • 7/31/2019 Ivancik Thesis 2012 Online

    61/70

    45

    Even when using with a preconditioner, the solution should be found very

    fast. On the other side, when using direct solvers, symbolical factorizationis solved only once in LM algorithm. Direct solvers give in general moreaccurate results. From my experiments, I suggest to use direct solver onCPU with a dense GPU solver for factorizing the Schur complement.

    I realize that the precise study of SBA (Sparse Bundle Adjustment) packageis missing such as testing of a practical utilization of GPU solvers in thispackage.

  • 7/31/2019 Ivancik Thesis 2012 Online

    62/70

    Bibliography

    [1] Unknown author. NVIDIA GeForce GTX 680 s ipem GK104: HernKepler detailn. CD-R server s.r.o, URL http://diit.cz/clanek/unifikovane-jadro-a-rizeni-cipu, 2012. Cited in page 25.

    [2] O. Coles. Nvidia GF100 GPU fermi graphics architecture. BenchmarkReviews, URL http://benchmarkreviews.com, 2010. Cited in page 24.

    [3] T. Davis. Sparse matrix. From MathWorld A Wolfram Web Re-source URL http://mathworld.wolfram.com/SparseMatrix.html, 2012.Retrieved April 2012. Cited in page 10.

    [4] J. Dennis. Nonlinear last squares. State of the Art in Numerical Anal-ysis, pages 269312, 1977. Cited in page 16.

    [5] J. Dennis and R. Schnabel. Numerical Methods for Unconstrained Op-timization and Nonlinear Equations. SIAM Publications, 1996. Citedin page 20.

    [6] R. Farber. CUDA Application Design and Development. Morgan Kauf-mann, Waltham, MA, 2011. 2 citations in pages 22 and 25.

    [7] A. George and J. W. H. Liu. An automatic nested dissection algorithmfor irregular finite element problems. SIAM Journal on Numerical Anal-ysis, 15(5):1051069, 1978. Cited in page 12.

    [8] A. George and J. W. H. Liu. A fast implementation of the minimum

    degree algorithm using quotient graphs. ACM Transactions on Mathe-matical Software, 6:33358, 1980. Cited in page 12.

    [9] J. R. Gilbert, C. Moler, and R. Schreiber. Sparse matrices in matlab:Design and implementation. SIAM Journal on Matrix Analysis andApplication, pages 333356, 1992. Cited in page 10.

    [10] G. Golub and C. van Loan. Matrix Computations. Jonhn HopkinsUniversity Press, Baltimore, MD, 3rd edition, 1996. Cited in page 20.

    [11] K. Habgood and I. Arel. A condensation-based application of cramers

    46

    http://diit.cz/clanek/unifikovane-jadro-a-rizeni-cipuhttp://diit.cz/clanek/unifikovane-jadro-a-rizeni-cipuhttp://diit.cz/clanek/unifikovane-jadro-a-rizeni-cipuhttp://benchmarkreviews.com/index.php?option=com_content&task=view&id=440&Itemid=63&limit=1&limitstart=3http://mathworld.wolfram.com/SparseMatrix.htmlhttp://mathworld.wolfram.com/SparseMatrix.htmlhttp://mathworld.wolfram.com/SparseMatrix.htmlhttp://benchmarkreviews.com/index.php?option=com_content&task=view&id=440&Itemid=63&limit=1&limitstart=3http://diit.cz/clanek/unifikovane-jadro-a-rizeni-cipuhttp://diit.cz/clanek/unifikovane-jadro-a-rizeni-cipu
  • 7/31/2019 Ivancik Thesis 2012 Online

    63/70

    47 BIBLIOGRAPHY

    rule for solving large-scale linear systems. Journal of Discrete Algo-

    rithms, 10:98109, 2012. Cited in page 5.[12] K. Hiebert. An eveluation of mathematical software that solves nonlin-

    ear least squares problems. ACM Transactions on Mathematical Soft-ware, 7(1):116, 1981. Cited in page 16.

    [13] J.-Y. LExcellent and B. Ucar. Elimination tree. URL http://graal.ens-lyon.fr/~bucar/CR07, 2010. Cited in page 13.

    [14] M. I. A. Lourakis and A. A. Argyros. Sba: A software package forgeneric sparse bundle adjustment. ACM Transactions on MathematicalSoftware, 36(1), 2007. 4 citations in pages 16, 28, 29, and 30.

    [15] H. M. Markowitz. The elimination form of the inverse and its applicationto linear programming. Management Science, 3:255269, 1957. Citedin page 12.

    [16] J. Nocedal and S. Wright. Numerical Optimization. Springer, New York,NY, 1999. Cited in page 20.

    [17] F. Ntawiniga. Bundle adjustment technique. URL http://archimede.bibl.ulaval.ca/archimede/fichiers/25229/ch06.html, 2008. RetrievedApril 2012. Cited in page 17.

    [18] NVIDIA. OpenCL Programming for the CUDA Architecture, 2009.

    Cited in page 23.[19] V. Prasolov. Problems and Theorems in Linear Algebra. American

    Mathematical Society, Providence, RI, 1994. Cited in page 30.

    [20] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery.Numerical Recipes: The Art of Scientific Computing. Cambridge Uni-versity Press, 3rd edition, 2007. 2 citations in pages 4 and 11.

    [21] A. Quarteroni, R. Sacco, and F. Saleri. Numerical Mathematics.Springer, 2000. 3 citations in pages 4, 17, and 30.

    [22] W. F. Tinney and J. W. Walker. Direct solutions of sparse network

    equations by optimally ordered triangular factorization. In Proceedingsof the IEEE, volume 55, pages 18011809, 1967. Cited in page 12.

    [23] B. Triggs, P. McLauchlan, R. Hartley, and A. Fitzgibbon. Bundle ad-justment a modern synthesis. Proceedings of the International Work-shop on Vision Algorithms: Theory and Practice, pages 298372, 1999.Cited in page 16.

    [24] V. Volkov. Better performance at lower occupancy. GPU TechnologyConference 2010 (GTC 2010), 2010. URL http://www.cs.berkeley.edu/~volkov. Cited in page 24.

    http://graal.ens-lyon.fr/~bucar/CR07http://graal.ens-lyon.fr/~bucar/CR07http://graal.ens-lyon.fr/~bucar/CR07http://archimede.bibl.ulaval.ca/archimede/fichiers/25229/ch06.htmlhttp://archimede.bibl.ulaval.ca/archimede/fichiers/25229/ch06.htmlhttp://archimede.bibl.ulaval.ca/archimede/fichiers/25229/ch06.htmlhttp://www.cs.berkeley.edu/~volkovhttp://www.cs.berkeley.edu/~volkovhttp://www.cs.berkeley.edu/~volkovhttp://www.cs.berkeley.edu/~volkovhttp://archimede.bibl.ulaval.ca/archimede/fichiers/25229/ch06.htmlhttp://archimede.bibl.ulaval.ca/archimede/fichiers/25229/ch06.htmlhttp://graal.ens-lyon.fr/~bucar/CR07http://graal.ens-lyon.fr/~bucar/CR07
  • 7/31/2019 Ivancik Thesis 2012 Online

    64/70

    BIBLIOGRAPHY 48

    [25] R. Vuduc. Analysis and tuning case study. Teragrid Conference, URL

    http://hpcgarage.org/tg10--gpu-tutorial, 2010. Retrieved May 2012.Cited in page 23.

    [26] M. Yannakakis. Computing the minimum fill-in is NP-complete. SIAMJournal of Algebraic Discrete Methods, 2:7779, 1981. Cited in page 10.

    http://hpcgarage.org/tg10--gpu-tutorialhttp://hpcgarage.org/tg10--gpu-tutorialhttp://hpcgarage.org/tg10--gpu-tutorial
  • 7/31/2019 Ivancik Thesis 2012 Online

    65/70

    49 BIBLIOGRAPHY

  • 7/31/2019 Ivancik Thesis 2012 Online

    66/70

    Appendix A

    List of Abbrevations

    3D Three-Dimensional

    BA Bundle Adjustment

    CPU Central Processing Unit

    CUDA Compute Unified Device Architecture

    CRS Compressed Row Storage

    GPGPU General-Purpose Computing on Graphics Processing Unit

    GPU Graphics Processing Unit

    LDS Linear Direct Solver (output of this thesis)

    LM Levenberg-Marquardt (algorithm)

    LRU Last Recently Used

    SBA Sparse Bundle Adjustment

    SIMD Single Instruction Multiple Data

    SIMT Single Instruction Multiple Threads

    SM Streaming Multiprocessor

    50

  • 7/31/2019 Ivancik Thesis 2012 Online

    67/70

    51

  • 7/31/2019 Ivancik Thesis 2012 Online

    68/70

    Appendix B

    User Manual

    B.1 Requirements

    All code was written in ANSI C and CUDA and tested on 64-bit Linux(Xubuntu distribution), GCC 4.4.6. For successful compilation, the packagelibscotchmetis-dev is required. Some install paths in makefile for CUDAand METIS include file must be set properly. After compilation, executionfile ldscpuexam and ldsgpuexam will be created into bin directory.

    B.2 Usage

    ldscpuexam

    ldsgpuexam

    bin/ldscpuexam data/jtj_mueI.mtx data/g.mtx orbin/ldsgpuexam data/jtj_mueI.mtx data/g.mtx for tested matrix.

    A.mtx is symmetric positive definite matrix of size n n stored in matrixmarket format, b.vec is the right side n 1 vector of the equation systemstored in matrix market format. Some time information will be printed on

    the stdout and the solution will be stored in file named x_result.vec.For testing the correctness of the solution, octave function vec_ck can beused callable from the command line:octave -q eval vec_ck( "x_result.vec", "x_octave.vec" );

    52

  • 7/31/2019 Ivancik Thesis 2012 Online

    69/70

    53 B.2. USAGE

  • 7/31/2019 Ivancik Thesis 2012 Online

    70/70

    Appendix C

    Contetns of the Attached CD

    .

    +-- lds

    | +-- bin

    | +-- data

    | | +-- g.mtx

    | | + -- jtj_mueI.mtx

    | | +-- test_thesis.mtx

    | +-- makefile

    | +-- obj

    | +-- o ctave

    | | + -- matrix_load. m

    | | +-- octave_solver.m

    | | + -- spy_print.m| +-- README.txt

    | +-- src

    | | +-- colamd.c

    | | +-- colamd_global.c

    | | +-- colamd.h

    | | +-- crs.c

    | | +-- crs.h

    | | +-- etree.c

    | | +-- etree.h

    | | + -- ldscpuexam.c

    | | + -- ldsgpuexam.c

    | | +-- mxbf.cu

    | | +-- mxbf.h

    | | + -- mxbf_chol.cu

    | | +-- ord.c

    | | +-- ord.h| | +-- UFconfig.c

    | | +-- UFconfig.h

    | | +-- uni.c

    | | +-- uni.h

    | | +-- vec.c

    | | +-- vec.h

    | + vec ck m