Solvers - jmbussat/Physics290E/Fall-2006/TCAD_documentation/...that are available as part of Synopsys TCAD software. This manual is organized into the following parts:

SolversVersion Y-2006.06, June 2006

ii

Copyright Notice and Proprietary InformationCopyright © 2006 Synopsys, Inc. All rights reserved. This software and documentation contain confidential and proprietary information that is the property of Synopsys, Inc. The software and documentation are furnished under a license agreement and may be used or copied only in accordance with the terms of the license agreement. No part of the software and documentation may be reproduced, transmitted, or translated, in any form or by any means, electronic, mechanical, manual, optical, or otherwise, without prior written permission of Synopsys, Inc., or as expressly provided by the license agreement.

Right to Copy DocumentationThe license agreement with Synopsys permits licensee to make copies of the documentation for its internal use only. Each copy shall include all copyrights, trademarks, service marks, and proprietary rights notices, if any. Licensee must assign sequential numbers to all copies. These copies shall contain the following legend on the cover page:

“This document is duplicated with the permission of Synopsys, Inc., for the exclusive use of __________________________________________ and its employees. This is copy number __________.”

Destination Control StatementAll technical data contained in this publication is subject to the export control laws of the United States of America. Disclosure to nationals of other countries contrary to United States law is prohibited. It is the reader’s responsibility to determine the applicable regulations and to comply with them.

DisclaimerSYNOPSYS, INC., AND ITS LICENSORS MAKE NO WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, WITH REGARD TO THIS MATERIAL, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.

Registered Trademarks (®)Synopsys, AMPS, Arcadia, C Level Design, C2HDL, C2V, C2VHDL, Cadabra, Calaveras Algorithm, CATS, CRITIC, CSim, Design Compiler, DesignPower, DesignWare, EPIC, Formality, HSIM, HSPICE, Hypermodel, iN-Phase, in-Sync, Leda, MAST, Meta, Meta-Software, ModelTools, NanoSim, OpenVera, PathMill, Photolynx, Physical Compiler, PowerMill, PrimeTime, RailMill, RapidScript, Saber, SiVL, SNUG, SolvNet, Superlog, System Compiler, TetraMAX, TimeMill, TMA, VCS, Vera, and Virtual Stepper are registered trademarks of Synopsys, Inc.

Trademarks (™)Active Parasitics, AFGen, Apollo, Apollo II, Apollo-DPII, Apollo-GA, ApolloGAII, Astro, Astro-Rail, Astro-Xtalk, Aurora, AvanTestchip, AvanWaves, BCView, Behavioral Compiler, BOA, BRT, Cedar, ChipPlanner, Circuit Analysis, Columbia, Columbia-CE, Comet 3D, Cosmos, CosmosEnterprise, CosmosLE, CosmosScope, CosmosSE, Cyclelink, Davinci, DC Expert, DC Professional, DC Ultra, DC Ultra Plus, Design Advisor, Design Analyzer, Design Vision, DesignerHDL, DesignTime, DFM-Workbench, Direct RTL, Direct Silicon Access, Discovery, DW8051, DWPCI, Dynamic-Macromodeling, Dynamic Model Switcher, ECL Compiler, ECO Compiler, EDAnavigator, Encore, Encore PQ, Evaccess, ExpressModel, Floorplan Manager, Formal Model Checker, FoundryModel, FPGA Compiler II, FPGA Express, Frame Compiler, Galaxy, Gatran, HANEX, HDL Advisor, HDL Compiler, Hercules, Hercules-Explorer, Hercules-II, Hierarchical Optimization Technology, High Performance Option, HotPlace, HSIMplus, HSPICE-Link, iN-Tandem, Integrator, Interactive Waveform Viewer, i-Virtual Stepper, Jupiter, Jupiter-DP, JupiterXT, JupiterXT-ASIC, JVXtreme, Liberty, Libra-Passport, Library Compiler, Libra-Visa, Magellan, Mars, Mars-Rail, Mars-Xtalk, Medici, Metacapture, Metacircuit, Metamanager, Metamixsim, Milkyway, ModelSource, Module Compiler, MS-3200, MS-3400, Nova Product Family, Nova-ExploreRTL, Nova-Trans, Nova-VeriLint, Nova-VHDLlint, Optimum Silicon, Orion_ec, Parasitic View, Passport, Planet, Planet-PL, Planet-RTL, Polaris, Polaris-CBS, Polaris-MT, Power Compiler, PowerCODE, PowerGate, ProFPGA, ProGen, Prospector, Protocol Compiler, PSMGen, Raphael, Raphael-NES, RoadRunner, RTL Analyzer, Saturn, ScanBand, Schematic Compiler, Scirocco, Scirocco-i, Shadow Debugger, Silicon Blueprint, Silicon Early Access, SinglePass-SoC, Smart Extraction, SmartLicense, SmartModel Library, Softwire, Source-Level Design, Star, Star-DC, Star-MS, Star-MTB, Star-Power, Star-Rail, Star-RC, Star-RCXT, Star-Sim, Star-SimXT, Star-Time, Star-XP, SWIFT, Taurus, TimeSlice, TimeTracker, Timing Annotator, TopoPlace, TopoRoute, Trace-On-Demand, True-Hspice, TSUPREM-4, TymeWare, VCS Express, VCSi, Venus, Verification Portal, VFormal, VHDL Compiler, VHDL System Simulator, VirSim, and VMC are trademarks of Synopsys, Inc.

Service Marks (SM)MAP-in, SVP Café, and TAP-in are service marks of Synopsys, Inc.

SystemC is a trademark of the Open SystemC Initiative and is used under license.ARM and AMBA are registered trademarks of ARM Limited.All other product or company names may be trademarks of their respective owners.

Solvers, Y-2006.06

SOLVERS CONTENTS

SolversAbout this manual .................................................................................................................................v

Audience ..............................................................................................................................................................vRelated publications.............................................................................................................................................vTypographic conventions .................................................................................................................................... viCustomer support................................................................................................................................................ viAcknowledgments .............................................................................................................................................. vii

Part I PARDISO ......................................................................................................... 1Chapter 1 Using PARDISO ...................................................................................................................3

Overview ..............................................................................................................................................................3Algorithms ............................................................................................................................................................3

Parallel solution on shared-memory multiprocessors.....................................................................................4

Chapter 2 Performance of PARDISO ...................................................................................................7Uniprocessor performance...................................................................................................................................7Performance on shared-memory multiprocessors ...............................................................................................8Performance of SUPER and PARDISO...............................................................................................................9

Part II UMFPACK...................................................................................................... 11Chapter 3 Using UMFPACK................................................................................................................13

Overview ............................................................................................................................................................13Algorithm............................................................................................................................................................13

Chapter 4 Customizing UMFPACK parameters................................................................................15Overview ............................................................................................................................................................15

Part III SUPER ........................................................................................................... 19Chapter 5 Using SUPER .....................................................................................................................21

Chapter 6 Customizing SUPER..........................................................................................................23The .superrc file .................................................................................................................................................23

Chapter 7 Implementing SUPER ........................................................................................................25How multiple minimum degree (MMD) works ....................................................................................................27

Example: MMD execution ............................................................................................................................29Sparse supernodal factorization algorithms .......................................................................................................31

Column supernode algorithms .....................................................................................................................33Block supernode algorithms .........................................................................................................................38

Chapter 8 Performance of SUPER .....................................................................................................43Performance measurement................................................................................................................................44Performance on workstations versus supercomputers ......................................................................................45

iii

SOLVERSCONTENTS

Part IV ILS .................................................................................................................. 47Chapter 9 Using ILS ............................................................................................................................49

Overview ............................................................................................................................................................49Selecting ILS in Sentaurus Device.....................................................................................................................49Selecting ILS in Sentaurus Process...................................................................................................................50Parallelization.....................................................................................................................................................50

Chapter 10 Customizing ILS...............................................................................................................51Configuration file ................................................................................................................................................51Nonsymmetric ordering ......................................................................................................................................52Symmetric ordering ............................................................................................................................................52Preconditioners ..................................................................................................................................................53

Incomplete LU factorizations ........................................................................................................................53Sparse approximate inverses.......................................................................................................................54Other preconditioners...................................................................................................................................54

Iterative methods................................................................................................................................................55Options...............................................................................................................................................................56General remarks ................................................................................................................................................56

Chapter 11 Default configuration file of ILS .....................................................................................57

Part V SLIP90............................................................................................................ 59Chapter 12 Using SLIP90....................................................................................................................61

Chapter 13 Customizing SLIP90 ........................................................................................................63Overview ............................................................................................................................................................63Iterative methods................................................................................................................................................63Preconditioner ....................................................................................................................................................64

Chapter 14 Performance of SLIP90 ...................................................................................................65

Chapter 15 Default configuration file of SLIP90 ...............................................................................67

Bibliography ........................................................................................................................................69

iv

SOLVERS ABOUT THIS MANUAL

Solvers

About this manual

This manual contains information about the three direct linear solvers and two iterative linear solversthat are available as part of Synopsys TCAD software.

This manual is organized into the following parts:

Part I PARDISO

Part II UMFPACK

Part III SUPER

Part IV ILS

Part V SLIP90

AudienceThis manual is intended for users of the Sentaurus Device and Sentaurus Process software packages.

Related publicationsFor additional information about Solvers, see:

The Sentaurus Device release notes and the Sentaurus Process release notes, available on SolvNet(see Accessing SolvNet on page vi).

Documentation on the Web, which is available through SolvNet athttps://solvnet.synopsys.com/DocsOnWeb.

Synopsys Online Documentation (SOLD), which is included with the software for CD users or isavailable to download through the Synopsys Electronic Software Transfer (EST) system.

v

SOLVERSABOUT THIS MANUAL

Typographic conventions

Customer supportCustomer support is available through SolvNet online customer support and through contacting theSynopsys Technical Support Center.

Accessing SolvNet

SolvNet includes an electronic knowledge base of technical articles and answers to frequently askedquestions about Synopsys tools. SolvNet also gives you access to a wide range of Synopsys onlineservices including software downloads, documentation on the Web, and “Enter a Call to the SupportCenter.”

To access SolvNet:

1. Go to the SolvNet Web page at http://solvnet.synopsys.com.

2. If prompted, enter your user name and password. (If you do not have a Synopsys user name andpassword, follow the instructions to register with SolvNet.)

If you need help using SolvNet, click HELP in the top-right menu bar or in the footer.

Convention Explanation

( ) Parentheses

Blue text Identifies a cross-reference (only on the screen).

Bold text Identifies a selectable icon, button, menu, or tab. It also indicates the name of a field, window, dialog box, or panel.

Courier font Identifies text that is displayed on the screen or that the user must type. It identifies the names of files, directories, paths, parameters, keywords, and variables.

Italicized text Used for emphasis, the titles of books and journals, and non-English words. It also identifies components of an equation or a formula, a placeholder, or an identifier.

NOTE Identifies important information.

vi

SOLVERS ABOUT THIS MANUAL

Contacting the Synopsys Technical Support Center

If you have problems, questions, or suggestions, you can contact the Synopsys Technical Support Centerin the following ways:

Open a call to your local support center from the Web by going to http://solvnet.synopsys.com(Synopsys user name and password required), then clicking “Enter a Call to the Support Center.”

Send an e-mail message to your local support center:

• E-mail [email protected] from within North America.

• Find other local support center e-mail addresses at http://www.synopsys.com/support/support_ctr.

Telephone your local support center:

• Call (800) 245-8005 from within the continental United States.

• Call (650) 584-4200 from Canada.

• Find other local support center telephone numbers at http://www.synopsys.com/support/support_ctr.

Contacting your local TCAD Support Team directly

Send an e-mail message to:

[email protected] from within North America and South America.

[email protected] from within Europe.

[email protected] from within Asia Pacific (China, Taiwan, Singapore, Malaysia,India, Australia).

[email protected] from Korea.

[email protected] from Japan.

AcknowledgmentsILS was codeveloped by Integrated Systems Laboratory of ETH Zurich in the joint research projectNUMERIK II with financial support by the Swiss funding agency CTI.

METIS is a software package for unstructured graph partitioning and sparse matrix orderings. It wasdeveloped by G. Karypis and V. Kumar, Department of Computer Science, University of Minnesota,karypis,[email protected], and is copyrighted by the regents of the University of Minnesota (http://glaros.dtc.umn.edu/gkhome/views/metis).

UMFPACK was developed by Timothy A. Davis from the University of Florida, Gainesville, USA.

vii

SOLVERSABOUT THIS MANUAL

viii

Part I PARDISO

This part contains chapters regarding the direct linear solver PARDISO and is intended for users of Sentaurus Device and Sentaurus Process:

CHAPTER 1 USING PARDISO ON PAGE 3 provides background information on PARDISO.

CHAPTER 2 PERFORMANCE OF PARDISO ON PAGE 7 describes the uniprocessor and shared-memory multiprocessor performances of PARDISO.

SOLVERS CHAPTER 1 USING PARDISO

Solvers

CHAPTER 1 Using PARDISO

PARDISO [1][2] is a high-performance, robust, and easy to use software package for solving largesparse symmetric or structurally symmetric systems of linear equations in parallel.

OverviewThe rapid and widespread acceptance of shared-memory multiprocessors has created a demand forparallel semiconductor device and process simulation on such shared-memory multiprocessors.

PARDISO can be used as a serial package, or in a shared-memory multiprocessor environment as anefficient, scalable, parallel, direct solver.

AlgorithmsThe process of obtaining a direct solution of a sparse system of linear equations of the form consists of four important phases [3][4]:

Nonsymmetric matrix permutation and scaling – This places large matrix entries on the diagonal.

Ordering – This determines the permutation of the coefficient matrix such that the factorizationincurs low fill-in.

Numeric factorization – This is the actual factorization step that performs arithmetic operations onthe coefficient matrix to produce the factors and such that . Complete blockdiagonal supernode pivoting allows for dynamic interchanges of columns and rows.

Solution of triangular systems – This produces the solution by performing forward and backwardeliminations.

The nonsymmetric matrix permutation and scaling aims to maximize the elements on the diagonal of thematrix. This step greatly enhances the reliability and accuracy of the numeric factorization process.More details can be found in the literature [11][12][13].

The reordering strategy of PARDISO features state-of-the-art techniques, for example, multilevelrecursive bisection from METIS [5] or minimum degree–based approaches [6][49] for the fill-inreduction. The nested dissection approach that is integrated in PARDISO is substantially better than themultiple, minimum degree algorithm for large problem sizes. This applies especially to three-dimensional problems.

PARDISO exploits the memory hierarchy of the architecture by using the clique structure of theelimination graph by supernode algorithms, thus improving memory locality [42]. The numeric

Ax b=

A

A L U A LU=

3

SOLVERSCHAPTER 1 USING PARDISO

factorization algorithm of the package utilizes the supernode structure of the numeric factors and to reduce the number of memory references with Level 3 BLAS [7][31]. The result is a greatly increased,sequential, factorization performance.

Furthermore, PARDISO uses an integrated, scalable, left-right-looking, supernode algorithm [8][9] forthe parallel sparse numeric factorization on shared-memory multiprocessors. This left-right-lookingsupernode algorithm significantly reduces the communication rate for pipelining parallelism.

The combination of block techniques, parallel processing, and global fill-in reduction methods for three-dimensional semiconductor devices results in a significant improvement in computational performance.

Parallel solution on shared-memory multiprocessorsThe main programming languages are Fortran 77 and C. The use of vendor-optimized BLAS andLAPACK subroutines ensures high computational performance on a large scale of different computerarchitectures. The parallelization technique is based on OpenMP [10], which is an industrywide standardfor directive-based parallel programming of SMP systems. Most SMP vendors are committed toOpenMP, thereby making OpenMP programs portable across an increasing range of SMP platforms.

PARDISO is tuned for general use in Sentaurus Device and Sentaurus Process. This means that userintervention is not necessary. PARDISO is activated in Sentaurus Device by specifying in the input orcommand file:

Math {...Method = pardisoWallClock...}

The keyword WallClock can be used to print the wallclock times of the Newton solver. This is useful wheninvestigating the performance of the parallel execution.

The parallel execution of PARDISO is controlled by the UNIX environment variable OMP_NUM_THREADS.This environment variable specifies the number of threads to use during the execution of PARDISO. Ifthe variable is undefined, PARDISO defaults to a serial solution of the linear system.

For example, to obtain parallel execution with two threads, the environment variable OMP_NUM_THREADSmust be set as follows (in a C-shell):

setenv OMP_NUM_THREADS 2

In a Bourne shell, the equivalent commands are:

OMP_NUM_THREADS=2export OMP_NUM_THREADS

L U

4

SOLVERS CHAPTER 1 USING PARDISO

In Sentaurus Process, the PARDISO solver is the default for 1D and 2D simulations, and can also beused in some 3D simulations by specifying:

math diffuse dim=3 pardiso

or:

math flow dim=3 pardiso

for diffusion simulations or mechanics simulations, respectively.

In Sentaurus Process, the number of threads must also be specified in the math command, for example:

math maxNumThreads=2 mathNumThreadsPardiso=2

Table 1 lists on which platforms a parallel version of PARDISO is available.

The same environment symbols and parameter settings must be used to enable hyperthreading forPARDISO on machines that support hyperthreading.

A sufficient process stack size is required for the proper execution of PARDISO. To check the UNIXstack size limit, in csh, enter the command:

limit

or, in bash or sh, enter the command:

ulimit -a

The stack size limit can be increased, in csh, by entering the command:

limit stacksize unlimited

or, in bash or sh, by entering the command:

ulimit -s unlimited

Table 1 Availability of PARDISO parallelization

Platform Parallel Platform Parallel

AMD 64-bit Yes Red Hat Enterprise Linux Yes

HP-UX 11 Yes Sun Solaris 81

1. Both environment variables OMP_NUM_THREADS and PARALLEL must be defined.

Yes

5

SOLVERSCHAPTER 1 USING PARDISO

6

SOLVERS CHAPTER 2 PERFORMANCE OF PARDISO

Solvers

CHAPTER 2 Performance of PARDISO

The performance of PARDISO is demonstrated in this chapter.

Uniprocessor performanceTable 2 lists four sparse matrices from Sentaurus Device and their characteristics. Test problem t1 stemsfrom a two-dimensional semiconductor device simulation and test cases t2 to t4 stem from three-dimensional simulations. The two-dimensional matrix t1 is ordered by using the multiple, minimumdegree–ordering heuristic [6] and nested dissection [5] is used to order three-dimensional matrices.

NOTE All performance data presented must not be regarded as an evaluation of the variousplatforms.

Table 2 Serial execution times [s] of numeric factorization with PARDISO on different platforms for four linear test problems of increasing dimension

Test problem t1 t2 t3 t4

Dimension of 24123 12002 25170 59648

Nonzeros in 504765 630002 1236312 3225912

Architecture Serial execution time [s]

SGI Octane, 195 MHz 2.76 14.83 56.77 426.62

DEC EV5.6, 600 MHz 1.66 7.06 25.03 139.00

Sun UltraSPARC 10, 333 MHz 3.55 19.32 71.95 452.48

IBM SP2, 135 MHz 3.22 13.52 49.74 278.43

A

A

7

SOLVERSCHAPTER 2 PERFORMANCE OF PARDISO

Performance on shared-memory multiprocessorsThe speedup results for the numeric factorization obtained on the vector supercomputer Cray J90 andthe high-end workstation servers DEC, SGI, and Sun are shown in the following figures.

Figure 1 Speedup on Cray J90 (left) and DEC AlphaServer 4100 (right)

Figure 2 Speedup on Sun Enterprise (left) and SGI Origin 2000 (right)

1 2 4 6 8 160

2

4

6

8

10

12

processors

t1

t2

t3

t4

Spe

edup

1 2 4 6 80

1

2

3

4

5

6

7

processors

t1

t2

t3

t4

Spee

dup

1 2 4 6 80

1

2

3

4

5

6

processors

t1

t2

t3

t4

Spe

edup

1 2 4 6 80

1

2

3

4

5

6

7

processors

t1

t2

t3

t4

Spe

edup

8

SOLVERS CHAPTER 2 PERFORMANCE OF PARDISO

Performance of SUPER and PARDISOIn Table 3, the wallclock time is compared for the numeric factorization on an eight-processor DECAlphaServer for four linear test problems of increasing dimension.

Table 3 Performance of SUPER and PARDISO on 8-CPU DEC AlphaServer 4100

Test problem t1 t2 t3 t4

Dimension of 24123 12002 25170 59648

Nonzeros in 504765 630002 1236312 3225912

Sparse direct solver NPROC Execution time [s]

SUPER 1 3.18 69.96 413.70 4014.28

PARDISO 1 1.66 7.06 25.03 139.00

2 0.92 3.73 13.33 70.36

3 0.69 2.72 9.35 48.73

4 0.53 2.21 7.35 39.39

5 0.48 2.00 6.47 33.22

6 0.47 1.73 5.39 26.63

7 0.48 1.46 4.89 25.58

8 0.53 1.56 4.39 22.90

A

A

9

SOLVERSCHAPTER 2 PERFORMANCE OF PARDISO

10

Part II UMFPACK

This part contains chapters regarding the direct linear solver UMFPACK and is intended for users of Sentaurus Device and Sentaurus Process:

CHAPTER 3 USING UMFPACK ON PAGE 13 provides background information on UMFPACK.

CHAPTER 4 CUSTOMIZING UMFPACK PARAMETERS ON PAGE 15 describes the parameters recognized by UMFPACK.

SOLVERS CHAPTER 3 USING UMFPACK

Solvers

CHAPTER 3 Using UMFPACK

UMFPACK is a direct solver for sparse linear systems. It employs pivoting for improved numericstability.

OverviewUMFPACK is available in both Sentaurus Device and Sentaurus Process. Details on how to select theUMFPACK linear solver and how to define the parameters in Chapter 4 on page 15 are discussed in therespective manuals. Further information about UMFPACK can be found in the literature [14]–[19].

UMFPACK is a code for the direct solution of systems of linear equations:

(1)

using the Unsymmetric MultiFrontal method. is assumed to be a sparse, unsymmetric, -by-matrix. It is decomposed into:

(2)

where and are sparse lower and upper triangular matrices, respectively. The column reorderingis chosen to give a good a priori upper bound on fill-in and to refine during the numeric factorization.The row ordering is determined during the numeric factorization to maintain numeric stability and topreserve sparsity.

AlgorithmThe solution of the linear system Eq. 1 involves the following phases:

Column pre-ordering. The initial ordering is determined to reduce fill-in without regard tonumeric values.

Symbolic factorization. This phase determines upper bounds on the memory usage, the floating-point operation count, and the number of nonzeros in the factors.

Numeric factorization. The column reordering is refined to reduce fill-in, and the rowordering is computed based on sparsity-preserving criteria as well as numeric considerations(relaxed threshold partial pivoting).

Solution of linear system. Given the factors and the right-hand side , the linear system issolved by forward and backward substitution. Iterative refinement is performed optionally.

Further details can be found in the literature [14].

Ax b=

A n n

PAQ LU=

L U Q

P

Q

LU

QP

LU b

13

SOLVERSCHAPTER 3 USING UMFPACK

14

SOLVERS CHAPTER 4 CUSTOMIZING UMFPACK PARAMETERS

Solvers

CHAPTER 4 Customizing UMFPACK parameters

This chapter lists the parameters that are recognized by UMFPACK.

OverviewTable 4 lists the parameters that UMFPACK recognizes.

Table 4 UMFPACK parameters

Parameter Description

PrintLevel Printing level. A value of 0 or less produces no output, even when an error occurs. If PrintLevel is equal to 1, only error messages are printed. Larger values up to 6 lead to increasingly more output. Default: 1.

The following parameters apply to the pre-ordering and symbolic factorization steps.

DenseRow Rows with more than entries are treated differently in the column approximate minimum degree (COLAMD) pre-ordering and during the subsequent numeric factorization. Default: 0.2.

DenseColumn If COLAMD is used, columns with more than entries are placed last in the column pre-ordering. Default: 0.2.

BlockSize The block size to use for Level 3 BLAS in the subsequent numeric factorization. A value less than 1 is treated as 1. Modifying this parameter affects when updates are applied to the working frontal matrix and can indirectly affect fill-in and operation count. As long as the block size is large enough (approximately 8), this parameter has a modest effect on performance. Default: 32.

max 16 16 DenseRow ncol⋅ ⋅,( )

max 16 16 DenseColumn nrow⋅ ⋅,( )

15

SOLVERSCHAPTER 4 CUSTOMIZING UMFPACK PARAMETERS

Strategy This is the most important control parameter. It determines the type of ordering and pivoting strategy that UMFPACK should use. The four options are:

0 (automatic): This is the default. The input matrix is analyzed to determine how symmetric the nonzero pattern is, and how many entries are on the diagonal. Then, it selects one of the following strategies.1 (unsymmetric): Use the unsymmetric strategy. COLAMD pre-ordering is used to order the columns of A, followed by a postorder of the column elimination tree. No attempt is made to perform diagonal pivoting. The column ordering is refined during factorization. In the numeric factorization, the SymPivotTolerance parameter is ignored. A pivot is selected if its magnitude is > PivotTolerance (default 0.1) multiplied by the largest entry in its column.2 (2-by-2): A row permutation is found that places large entries on the diagonal. The matrix is then factorized using the symmetric strategy.3 (symmetric): Use the symmetric strategy. In this method, the approximate minimum degree (AMD) pre-ordering is applied to , followed by a postorder of the elimination tree of . UMFPACK attempts to perform diagonal pivoting during numeric factorization. No refinement of the column pre-ordering is performed during factorization. In the numeric factorization, a nonzero entry on the diagonal is selected as the pivot if its magnitude is > SymPivotTolerance (default 0.001) multiplied by the largest entry in its column. If this is not acceptable, an off-diagonal pivot is selected with a magnitude > PivotTolerance (default 0.1) multiplied by the largest entry in its column.

AMDDense Rows or columns in with more than entries are ignored in the AMD pre-ordering. Default: 10.

Tolerance_2by2 A diagonal entry is considered ‘small’ if it is less than , where is a submatrix of the scaled input matrix, with the removal of pivots of zero Markowitz cost. Default: 0.01.

FixQ If FixQ > 0, the pre-ordering Q is not modified during numeric factorization. If FixQ < 0, Q can be modified. If FixQ is zero, this is controlled automatically (the unsymmetric strategy modifies Q, the others do not). Default: 0.

Aggressive If nonzero, aggressive absorption is used in COLAMD and AMD pre-ordering. Default: 1.

The following parameters apply to the numeric factorization step.

PivotTolerance Relative pivot tolerance for threshold partial pivoting with row interchanges. In any given column, an entry is numerically acceptable if its absolute value is greater than or equal to PivotTolerance multiplied by the largest absolute value in the column. A value of 1.0 gives true partial pivoting. If less than or equal to zero, any nonzero entry is numerically acceptable as a pivot.Smaller values tend to lead to sparser LU factors, but the solution to the linear system can be inaccurate. Larger values can lead to a more accurate solution (but not always) and, usually, an increase in the total work. For complex matrices, an economical approximation of the absolute value is used for the threshold partial-pivoting test ( instead of the more expensive-to-compute exact absolute value

). Default: 0.1 (0.2 in Sentaurus Device).



P2P2 A⋅

A A′+A A′+

A A′+ max 16 AMDDense n⋅( , )

Sk k, tol max S: k,( )⋅ S

areal aimag+

areal2 aimag

2+

16

SOLVERS CHAPTER 4 CUSTOMIZING UMFPACK PARAMETERS

SymPivotTolerance If diagonal pivoting is attempted (the symmetric or symmetric-2-by-2 strategies are used), this parameter is used to control when the diagonal entry is selected in a given pivot column. The absolute value of the entry must be > SymPivotTolerance multiplied by the largest absolute value in the column. A value of zero ensures that no off-diagonal pivoting is performed, except that zero diagonal entries are not selected if there are any off-diagonal nonzero entries. If an off-diagonal pivot is selected, an attempt is made to restore symmetry later.Assume that is selected, where . If column i has not yet been selected as a pivot column, the entry is redefined as a ‘diagonal’ entry, except that the tighter tolerance PivotTolerance is applied. This strategy has an effect similar to 2-by-2 pivoting for symmetric indefinite matrices. If a 2-by-2 block pivot with nonzero structure:

(3)

is selected in a symmetric indefinite factorization method, the 2-by-2 block is inverted and a rank-2 update is applied. In UMFPACK, this 2-by-2 block would be reordered as:

(4)

In both cases, the symmetry of the Schur complement is preserved. Default: 0.001.

Scale The following values are recognized:0: No scaling is performed.1: Each row of the input matrix A is divided by the sum of the absolute values of the entries in that row. The scaled matrix has an infinity norm of 1. This is the default.2: Each row of the input matrix A is divided by the maximum of the absolute values of the entries in that row. In the scaled matrix, the largest entry in each row has a magnitude equal to 1.

For complex matrices, an economical approximate absolute value is used, ,instead of the exact absolute value . Scaling is very important for the ‘symmetric’ strategy when diagonal pivoting is attempted. It also improves the performance of the ‘unsymmetric’ strategy.



Ai j, i j≠Aj i,

i ji: 0 xj: x 0

i ji: x 0j: 0 x

areal aimag+areal

2 aimag2+

17

SOLVERSCHAPTER 4 CUSTOMIZING UMFPACK PARAMETERS

AllocInit Initial allocation size. When numeric factorization starts, it first allocates the necessary workspace, part of which is of fixed size (approximately double precision numbers plus

integers). The remainder is of variable size, which grows to hold the LU factors and the frontal matrices created during factorization. An estimate of the upper bound is computed during the symbolic factorization. The numeric factorization initially allocates space for the variable-sized part equal to this estimate multiplied by AllocInit. Typically, numeric factorization needs only approximately half the estimated memory space, so a setting of 0.5 or 0.6 often provides enough memory to factorize the matrix with no subsequent increases in the size of this block. A value less than zero is treated as zero (in which case, only the minimum amount of memory needed to start the factorization is initially allocated).If the variable-size part of the workspace is found to be too small, some time, after numeric factorization starts, the memory increases in size by a factor of 1.2. If this fails, the request is reduced by a factor of 0.95 until it succeeds, or until it determines that no increase in size is possible. Automatic dynamic memory management (garbage collection) then occurs. These two factors (1.2 and 0.95) are fixed control parameters and cannot be changed at run-time. Therefore, with the default control settings, the upper bound is reached after two reallocations ( ). Changing this parameter has no effect on fill-in or operation count. It has a small impact on run-time (the extra time required to performed the garbage collection and memory reallocation). Default: 0.7.

FrontAllocInit When UMFPACK starts the factorization of each ‘chain’ of frontal matrices, it allocates a working array to hold the frontal matrices as they are factorized. The symbolic factorization computes the size of the largest possible frontal matrix that could occur during the factorization of each chain.If FrontAllocInit is > 0, the following strategy is used. If the AMD ordering was used, this nonnegative parameter is ignored. Otherwise, a front of size FrontAllocInit multiplied by the largest front possible for this chain is allocated.If FrontAllocInit is negative, a front of size -FrontAllocInit is allocated (where the size is in terms of the number of numeric entries). This is performed regardless of the ordering method or strategy used. Default: 0.5.

The following parameter applies to the solve step.

IrStep The maximum number of iterative refinement steps to attempt. A value less than zero is treated as zero. Default: 2.



n12n

0.7 1.2 1.2⋅ ⋅ 1.008=

18

Part III SUPER

This part contains chapters regarding the direct linear solver SUPER and is intended for users of Sentaurus Device:

CHAPTER 5 USING SUPER ON PAGE 21 provides background information on SUPER.

CHAPTER 6 CUSTOMIZING SUPER ON PAGE 23 describes the .superrc file, which is used to customize SUPER.

CHAPTER 7 IMPLEMENTING SUPER ON PAGE 25 discusses the algorithms used in SUPER.

CHAPTER 8 PERFORMANCE OF SUPER ON PAGE 43 documents the performance of SUPER on different workstations and supercomputers.

SOLVERS CHAPTER 5 USING SUPER

Solvers

CHAPTER 5 Using SUPER

SUPER is a library that contains a set of block-oriented and nonblock-oriented, supernodal, factorizationalgorithms for the direct solution of sparse structurally symmetric linear systems. It is a fast direct solverfor the multidimensional semiconductor device simulator Sentaurus Device, where the solution ofstructurally symmetric sparse linear systems of equations (typically written in the form ) is themain task consuming most of the processor time.

Advances in sparse matrix technology have resulted in supernodal linear solvers. The key idea behindthis technique is based on the concept of a supernode [21]. In the course of the factorization of thecoefficient matrix, supernodes are identified as a set of consecutive columns in the factor of the decomposition with the following structural properties.

Assume is a set of consecutive columns and denotes the number of nonzeroentries in column of the factor . If all columns share the same sparsity structurebelow row and , the set forms asupernode [36].

In other words, a supernode formed by adjacent columns consists of two blocks: a dense diagonalblock of size and a block of width below the diagonal block where all columns share the samesparsity pattern. Due to structural symmetry, the term ‘supernode’ can also apply to the rows of the factor

. For simplification, this manual restricts its considerations mainly to the columns of factor .Figure 3 illustrates a supernode.

Figure 3 Example of a supernode

Supernodes offer a significant advantage for numeric factorization: a column being computed ismodified by either all or none of the columns of a supernode , which updates column [44].Additionally, if column has an identical sparsity structure compared to the columns of supernode below row , updating column is a dense operation, meaning that no index list is needed to referencethe various elements. This is also true for column updates within the same supernode. The fact that denselinear algebra operations can be performed in those cases reduces memory traffic and increases thecomputational efficiency. This is documented in a number of papers [20][21][38].

Ax b=

L LU

k k 1 … k r }+, ,+,{ η k( )k L k i i,+ 0…r=

k r+ η k i+( ) η k r+( ) r i, i–+ 0…r= = k k 1 … k r }+, ,+,{

ss s× s

U L

jS j

j Sj j

21

SOLVERSCHAPTER 5 USING SUPER

SUPER incorporates the advances in supernodal sparse matrix technology towards the most efficientsolution of a given linear system. SUPER contains a set of nine supernodal factorization methods thatprovide excellent performance on both RISC and vector machines (see Chapter 8 on page 43).

Users can fine-tune SUPER although this is not necessary, since all tunable parameters have built-indefault values or are automatically set during execution. Some parameters relate to measured timesduring execution; therefore, they influence the computational behavior on different hardware platforms.

22

SOLVERS CHAPTER 6 CUSTOMIZING SUPER

Solvers

CHAPTER 6 Customizing SUPER

This chapter discusses the customization that is possible for SUPER.

The .superrc fileUsers can tailor SUPER behavior to their own preferences by modifying the parameters specific toSUPER in the .superrc file. The software uses the following procedure to search for this configurationfile. First, SUPER checks whether the environment variable SUPERRC is set. This environment variablemust contain the absolute path name of the directory, which contains the .superrc file. SUPER checkswhether the .superrc file exists; if so, the configuration file is used. If the environment variable SUPERRCis not set or the directory specified does not contain a .superrc file, the home directory of the user issought. Finally, if neither location contains a .superrc file, the configuration file is sought in the currentdirectory. This hierarchical concept allows:

A group of users to share a common .superrc file by specifying its location in the SUPERRCenvironment variable.

Individual users to have their own personal global .superrc file found in their home directory.

The use of individual configuration files when put into the current working directories.

SUPER uses default settings if no configuration file is found.

In this section, the grammar of the input language is presented. Terminal symbols are presented inCourier font and nonterminal symbols are uppercase and italicized:

STATEMENTS ← STATEMENT

| STATEMENTS, STATEMENT

STATEMENT ← factorization_type = FACTORIZATION_METHOD | write { INTEGER_LIST } | write ( FORMAT ) | write ( FORMAT ) { INTEGER_LIST } | write

FACTORIZATION_METHOD ← column_supernode_0| column_supernode_1| column_supernode_2| column_supernode_3| block_supernode_0| block_supernode_1| block_supernode_2| block_supernode_3| block_supernode_4

23

SOLVERSCHAPTER 6 CUSTOMIZING SUPER

FORMAT ← blsmp| matlab

INTEGER_LIST ← INTEGER| INTEGER_LIST : INTEGER

The value of factorization_type specifies the factorization to be used. The factorization within SUPERis performed using supernodal techniques. Generally, two types of supernodal approaches are available:column supernode and block supernode (see Sparse supernodal factorization algorithms on page 31).

SUPER contains four versions of the column supernode approach and five versions of the blocksupernode approach. In terms of memory consumption, column supernode methods are preferred overblock supernode algorithms. The algorithm column_supernode_2 uses minimal space and the algorithmblock_supernode_1 requires maximal space. Conversely, if speed is an important consideration, blocksupernode approaches should be considered as they reduce memory traffic and support data locality.Chapter 8 on page 43 has a detailed comparison. By default, SUPER uses column_supernode_1.

The write statement is used to write linear systems in ASCII representation to files. The parameterINTEGER_LIST must contain nonnegative numbers separated by colons. It determines at which invocationof SUPER the output file(s) is to be generated. The list does not have to be in increasing order. IfINTEGER_LIST is missing, the first ten invocations of SUPER generate the file output.

The parameter FORMAT determines the format of the output (blsmp or matlab). If the blsmp format is selected,the matrix (the right-hand side) and the solution of the linear system are written to the filensuper_blsmp_real_index.txt or nsuper_blsmp_complex_index.txt. If the matlab format is selected, theoutput is sent to the file nsuper_matlab_real_index.m or nsuper_matlab_complex_index.m. By default, nooutput is generated.

In many cases, users can completely ignore setting up a special .superrc file and can rely on the defaults.Conversely, there is no way to change the default settings without modifying the correspondingparameter in the .superrc file. In addition, the .superrc file is read only once, at the initial invocation ofSUPER.

An example of a .superrc file is:

factorization_type = block_supernode_4,write (blsmp) {5:9}

These settings in the .superrc file instruct SUPER to use the factorization algorithm block_supernode_4.The write statement instructs SUPER to generate ASCII files, in blsmp format, of the fifth and ninth linearsystems solved.

24

SOLVERS CHAPTER 7 IMPLEMENTING SUPER

Solvers

CHAPTER 7 Implementing SUPER

Typically, users want to solve a linear system of the form:

(5)

where is the structurally symmetric coefficient matrix of the system, denotes the solution vector orthe right-hand side, and is the vector of all unknowns, commonly referred to as the solution. Apermutation matrix is used to apply row and column permutations to the coefficient matrix . Now,the linear system Eq. 5 becomes:

(6)

where and . The permuted coefficient matrix is decomposed into two triangularfactors and , for example:

(7)

Eventually, the linear system Eq. 6 is solved by forward and backward substitution:

(8)

Finally, the solution of the original linear system Eq. 5 is obtained by left-multiplying , the solutionof Eq. 6, with [26].

Technically, the solution process of SUPER has six distinct phases leading to a modular code that iseasier to maintain and optimize. This approach has been used in other solver packages such asSPARSPAK [4] and YSMP [27]. The phases are:

Structure input

Reordering

Symbolic factorization

Numeric value input

Numeric factorization

Numeric solution

During structure input, the solver reads the nonzero structure of the lower triangle of the coefficientmatrix and generates a full adjacency structure of , which passes to the reordering phase.

Reordering is a very important phase in the solution process. The goal of applying row and columnpermutations to the coefficient matrix is to minimize the size of its factors and . Any additional

Ax b=

A bx

P A

PAPTx̃ b̃=

x̃ Px= b̃ Pb= PAPT

L U

PAPT LU=

Ly Pb=Ux̃ y=

x x̃PT

A A

L U

25

SOLVERSCHAPTER 7 IMPLEMENTING SUPER

nonzero entry in the decomposition is called a fill-in entry. In terms of computational cost (that is,memory consumption and execution time), the user may want to retain the nonzero structure of thecoefficient matrix in its factors or at least reduce growth to a minimum. Although there is no minimumfill-in reordering scheme [48], a number of heuristics, mainly using graph theoretical approaches,produce near-to-optimal reorderings. Among these approaches, the minimum degree reorderingheuristic has proven to be most effective [29]. In this solver, an enhanced minimum degree algorithmcalled the multiple minimum degree (MMD) algorithm is used [6][33]. Its motivation is based on theobservation that in the course of elimination, expensive degree updates can be saved if nodes of the samedegree were eliminated simultaneously, hereby producing supernodes as a side effect [38]. How multipleminimum degree (MMD) works on page 27 presents the MMD algorithms in detail.

When the coefficient matrix is reordered, it is desirable to predetermine the structure of its factors and. This process is referred to as symbolic factorization [37]. Knowing the factor structure, users can

preallocate the necessary memory space for the remainder of the solution process.

So far, only preliminary steps toward the numeric solution of the linear system have been performed.The numeric value input phase is now the preparation step for numeric computation; the numeric valuesof the coefficient matrix are read into their memory locations simultaneously applying the row andcolumn permutations found in the reordering phase.

The numeric factorization is the most time-consuming part of the solution process. Extensive researchto find optimal performance in terms of speed and memory requirements has lead to supernodaltechniques [41]. Column supernode and block supernode (also referred to as supernode–node andsupernode–supernode, respectively) algorithms are implemented. Both methods are schematicallydepicted in Figure 4.

Figure 4 Illustration of supernode column (a) and block supernode updating (b)

Column supernode updating describes a technique where only one column of the factor is computedat a time. Consider Figure 4 (a): column is updated by supernode . Computing this update ismathematically expressed in the term:

(9)

also known as a DGEMV operation in BLAS terminology [24]. Computing is a dense operationthat requires no indirect addressing.

LU

A

(a) (b)

Lj S

j j M Dv( )–=

M Dv( )

26


When the result of this matrix–vector product is subtracted from vector , the elements of the resultingvector need to be scattered into their corresponding positions only.

Block supernode factorization operates on groups of columns or a complete supernode at the same timeinstead of merely focusing on a single column. It must compute:

(10)

representing a DGEMM operation [7]. Block supernode methods mainly involve dense matrix–matrixmultiplications, hereby reducing memory traffic. Analogous to column supernode methods, indirectaddressing is necessary when the results of the dense matrix–matrix multiplication are scattered into theupdated supernode. Since DGEMV and DGEMM operations are highly efficient computational kernelroutines, their use during numeric factorization significantly speeds up the decomposition. Sparsesupernodal factorization algorithms on page 31 describes all supernodal algorithms implemented inSUPER.

The final step in the solution process is the numeric solution phase. The solution is found using forwardand backward substitution to exploit the supernodal partitioning of the factors. Detailed discussions ofthis are documented in the literature [21][34][38][39].

How multiple minimum degree (MMD) worksBefore going into detail, a few preliminary terms must be defined for subsequent use.

Let be a graph.

Def.: adjacency set

Let ; (The adjacency set for any consists of all nodes , which are directly connectedwith through an edge from set .)

Def.: indistinguishable1

Let ; is indistinguishable from (Two nodes are said to be indistinguishable if and only if and have identicaladjacency sets and each node is contained in the other’s adjacency set2.)

As previously mentioned, MMD is a variant of the minimum degree (MD) ordering algorithm. Itsconcept is based on the observation that, during elimination, expensive degree updates can be saved ifnodes of the same minimum degree are eliminated simultaneously. For indistinguishable nodes, it canbe shown that they are eliminated consecutively when MD is used.

1. The concept of indistinguishable nodes is covered extensively in the literature [4].2. Practically, this defines the term clique where all nodes are connected to each other.

j

J J ML DMU( )–=

G V E,( )=

v V∈ adj v( ) w V v w,( ) E∈∈{ }=adj v( ) v V∈ w V∈

v E

v w, V∈ v w : adj v( )⇔ v{ }∪ adj w( ) w{ }∪=v w, V∈ v w

27


Algorithm 1 lists the MMD algorithm. Initially, is set equal to the empty set and the degrees of allnodes in are computed. Next, a set is determined, which contains all nodes from to that haveminimum degree. Mass elimination is performed over all elements of . On entry, all elements (nodes)are unflagged (unmarked). Next, a node must be selected. The criteria that set out how to selectelements from are called tie-breaking strategies.

Effective tie-breaking is known to improve numeric factorization since the fill-in of the factor can bereduced significantly [29]. SUPER does not implement any of the commonly used tie-breakingstrategies used in other well-known solver packages1. Instead, SUPER uses random tie-breaking, whichis the selection of elements without intelligence; mostly implied by the underlying data structure.

After an element is chosen, the algorithm determines the set that contains all elements of indistinguishable from 2. When is computed, all elements of and the adjacency set of , ,are flagged. There are two reasons for this. First, flagging the nodes of set prevents double-accessingindistinguishable nodes, that is, nodes found to be indistinguishable from , the current node, do nothave to be looked at while mass elimination proceeds, because they are eliminated with . Second,nodes that lie in must be marked for a degree update, because some of their neighbors, some orall elements of , are eliminated. This means their current degree was modified.

Finally, set is unified with set and mass elimination starts over with another element untilno unflagged element remains. Then, the graph representation of the remaining nodes from to iscomputed. Simultaneously, all flagged nodes in to undergo a degree update. Finally, the non-eliminated nodes are unmarked and the algorithm continues until .

Algorithm 1 Multiple minimum degree (MMD) algorithm

1. MA27 (Harwell Laboratories), SPARSPAK (University of Waterloo), and YSMP (Yale University).2. Element is trivially indistinguishable from itself.

SV T V S

Ty T∈

T

L

y T∈ Y Ty Y Y Y adj Y( )

y

Yy

yadj Y( )

Y

S Y y T∈V S

V SS V=

for do

end forwhile do

set for do

if is not marked doset for all nodes do

order nextend formark all nodes in and

end ifend foreliminate all marked nodes in from the graphfor all marked nodes do

end forunmark all nodes

end while

S ∅=x V∈

δ x( ) adj x( )=

S V≠T y V S δ x( )–∈ minx V S–∈ δ x( )={ }=y T∈

yY x T x indistinguishable from y∈{ }=

x Y∈x

adj Y( ) YS S Y∪=

Sx V S–∈

δ x( ) adj x( )=

28


Example: MMD executionFigure 5 provides the symmetric pattern of the matrix where ‘•’ denotes a nonzero entry.

Figure 5 Sample sparse matrix A

Figure 6 illustrates the graph representation of .

Figure 6 Graph representation of sample matrix A

The numbering in the graph is equal to the line numbering of the matrix. The initial minimum degree ofthe graph is 31. Therefore, the ordering algorithm starts with:

(11)

Now, is chosen from . The only indistinguishable node from is the node with thenumber 6, yielding 2. The adjacency set contains the nodes 2 and 8 that,therefore, are flagged (indicated by ‘+’). becomes . After the first loop through themass elimination step:

(12)

1. Self-loops are neglected.2. Parentheses are only used to identify groups of indistinguishable nodes.

A

A

S ∅= T 10 9 8 6 3 2 1, , , , , ,{ }=

y 10= T y 10=Y 10 6,( ){ }= adj Y( )

S S 10 6,( ){ }=

S 10 6,( ){ }= T 10+ 9 8+ 6+ 3 2+ 1, , , , , ,{ }=

29


The second loop finds and , since node 3 is indistinguishable from node 9. Nodes7 and 5 are marked because they are adjacent to . By the end of the loop:

(13)

The node is the only unflagged node left in . has no indistinguishable nodes besidesitself. Therefore, only is eliminated, leaving adjacent node 4 flagged. All elements of are nowflagged and the algorithm proceeds to the degree update step. Figure 7 shows the graph representationof the remaining nodes all of which had their degree updated because they were all flagged.

Figure 7 Elimination graph after first loop through multiple mass elimination

The new minimum degree is 2, which yields:

(14)

The algorithm finds nodes 7 and 5 as well as nodes 8 and 2 to be indistinguishable, respectively. Theyare eliminated leaving only node 4. The reordering sequence or permutation is now computed to be:

(15)

Applying this permutation to the matrix results in the structure shown in Figure 8.

Figure 8 Sample matrix A reordered with MMD

y 9= Y 9 3,( ){ }=Y

S 10 6,( ) 9 3,( ),{ }= T 10+ 9+ 8+ 6+ 3+ 2+ 1, , , , , ,{ }=

y 1= T y 1=y 1= T

S 10 6,( ) 9 3,( ) 1, ,{ }= T 7 8 5 2, , ,{ }=

P 10 6 9 3 1 7 5 8 2 4, , , , , , , , ,( )=

A

PAPT =

30


Performing symbolic factorization on this matrix reveals the sparsity pattern of the factor , which isdepicted in Figure 9 where the columns have been renumbered.

Figure 9 Sparsity structure of factor of

NOTE The sparsity structures for and are similar; has two additional nonzero fill-inentries (indicated by ‘o’). In addition, consists of groups of columns that share the samesparsity pattern, such as columns 10 and 6, or 9 and 3 (indicated by the dashed rectangles).

These groups of columns correspond to the sets of indistinguishable nodes as they are found in thecourse of the mass elimination step. These groups form supernodes [38]. Supernodes play an importantrole in improving the performance of the numeric factorization. SUPER is focused entirely on thesupernodal update scheme. Users can take advantage of the fact that a column update depends on allprevious columns of the same supernode and on all nodes of other supernodes that update this column.

Using BLAS terminology [7][24][30], the first type of update mentioned involves dense SAXPYoperations, whereas the second type performs so-called indexed SAXPY or SAXPYI operations[21][32]. Additionally, to update updating a column by a supernode requires one gather and onescatter operation, whereas node–node updates require as many operations as there are nodes in of each[21]. Therefore, memory traffic is reduced and numeric factorization is accelerated, especially onmachines with hardware-supported gather and scatter operations.

Sparse supernodal factorization algorithmsGenerally, matrix reordering and numeric factorization are the parts of a direct solver package wheremost of the execution time is spent. Depending on the algorithm and its implementation, the timenecessary to reorder the input matrix can vary significantly and can even dominate the factorizationtime. Nevertheless, these are rare cases, since the reordering algorithm does not have to deal with anyfill-in that occurs during decomposition. This leaves numeric factorization as the part to focus onfor performance improvements.

Factorization algorithms based on supernodal techniques have proven to be superior over former generalapproaches [21][38][43][45].

L

L A

PAPT L LL

Y

j SS

LU

31


The following subsections describe several supernodal factorization algorithms implemented inSUPER. These algorithms fall into two different classes: column and block supernode update schemes.Table 5 lists the symbols used here.

Table 5 List of symbols

Symbol Description

Supernodes of the LDU decomposition

Nodes, that is, columns or rows of a supernode

Number of supernodes

Temporary work vectors

Temporary blocks of workspace

A column or row of the coefficient matrix

A column or row block of the coefficient matrix

A (block) column of the factor

A (block) row of the factor

i-th element of column (row) vector

j-th diagonal element of the matrix of the LDU decomposition

Index vectors

Scattering into column is performed using index map

Number of equations of the linear system

Number of off-diagonal nonzero entries in the lower-half or upper-half of

Number of nonzero entries in the factor

Number of supernodes

maxcol Maximum number of nonzero entries in a column of

maxsup Maximum number of columns in a supernode

J K,

j k,

NS

tL tU,

TL TU,

A* j, Aj *,, A

A* J, AJ *,, A

L* j, L* J,( ) L

Uj *, UJ *,( ) U

ci ri( ) c r( )

dj D

im ri,

L* j,[ ]im L* j, im

n

ne A

L L

S

L

32


Column supernode algorithmsSupernode–node updating describes a technique where only one column or row of the factors and is computed at a time, although the corresponding supernode may consist of several columns or rows.Algorithm 2 lists the first algorithm implementing this technique.

Algorithm 2 column_supernode_0

Initially, the algorithm reveals the general form of supernode–node updating algorithms: a triple-nestedfor-loop (indicated with indices c0.1 to c0.3). The outermost loop runs over all supernodes that weregenerated in the reordering and symbolic factorization steps. The next for-loop (c0.2) proceeds one leveldeeper and scans over all nodes of the current supernode starting with the smallest index.

NOTE The product of the loop lengths for loop c0.1 and c0.2 is always equal to the dimension of thematrix .

Finally, the innermost loop (c0.3) handles the contribution of all updating supernodes to the currentnode . Furthermore, three computationally intensive kernels CRmod_{i,d} and CRdiv (seeAlgorithm 3 and Algorithm 4 on page 34, and Algorithm 5 on page 35) are typical for decomposition methods [28][38].

CRmod_i and CRmod_d describe the necessary operations to calculate the update of column androw on the current column using indexed SAXPY [21][23] and dense SAXPY [30] operations,respectively. The contribution of these two vectors is then accumulated into the column vector and therow vector . CRdiv describes the scaling procedure after column or row has been updated. All ofthese kernel routines can be vectorized, thereby running very efficiently on machines with vectorcapabilities.

L U

for to do (c0.1)for do (c0.2)

for all updating do (c0.3)if ( and have same sparsity pattern) (c0.4)

collect dense updateselse

for do (c0.5)

end forend if

end for

for all dense updates do (c0.6)

end for

end forend for

tL 0 tU 0←;←J 1= NS

j J (in order)∈tL[ ]ind A* j,←tU[ ]ind Aj *,←

K jK J

k K∈CRmod_i tL tU ind j k, , , ,( )

L* j,[ ]ind tL tL 0←;←Uj *,[ ]ind tU tU 0←;←

kCRmod_d L* j, Uj *, j k, , ,( )

CRdiv j( )

J

j J

A

Kj

LU

L* k,Uk *, j

cr j

33


A third task, which is also common to all algorithms implemented in SUPER, is the determination of therow structure of the factor (or, identically, the determination of the column structure of ). This rowstructure is required to find all supernodes updating the current column (see loop c0.3 in Algorithm 2on page 33). As described [38], it is not necessary to calculate the row structure of beforehand, sinceit can be efficiently generated during factorization.

Specific to this algorithm is the use of the temporary vectors and , and, as a result, theimplementation of CRmod_{i,d} and CRdiv. Vectors and contain intermediate results for thefactors and , respectively. Both vectors are of length where is the dimension of the matrix of the linear system. Initially, and are set to zero. Then, for every column or row to be computed(loop c0.2), column is loaded into and row is loaded into .

This is performed by expanding (scattering) the densely stored column or row elements of into theircorresponding positions into and . Hereby, it is possible to accumulate all indexed updates tocolumn without repeatedly storing the contents of the temporary vectors and into factor storageand simultaneously zeroing out both vectors. Additionally, the index vector ind (loop c0.5) simply holdsthe row structure of the current column , which does not have to be computed, since it is provided bythe symbolic factorization. Doing this significantly reduces memory traffic at the cost of comparablylittle storage overhead1. In addition to saving memory transfers, algorithm column_supernode_0 increasescomputational efficiency by collecting all dense updates (collected in statement c0.4) and executingthem in one block in loop c0.6. This requires additional storage to keep track of all nodes that share thesame sparsity pattern as column/row , but provides for a compact dense update procedure. Aftercolumn has been computed, it must be scaled by its diagonal . This is performed in the kernelroutine CRdiv.

NOTE The computation of the scaling diagonal is performed along with the column/row instead of calculating its value separately. The data structures used were

dimensioned to have extra space for the diagonal element, thus exploiting vectorizationcapabilities on the different hardware platforms.

1. Compared to the fill-in size.

L Uj

L

tL tUtL tU

L U n n AtL tU j

A*,j tL Aj *, tU

AtL tU

j tL tU

j

jj dj

djL*,j Uj *,⁄

for to n do

end for

Algorithm 3 CRmod_d kernel

for to n do

end for

Algorithm 4 CRmod_i kernel

i j=ci ci li k, dkuk j,–=ri ri uk i, dklj k,–=

i j=l ind i( )=cl cl li k, dkuk j,–=rl rl uk i, dklj k,–=

34


Algorithm 7 is an enhanced version of the previous algorithm. In this case, it was feasible to reduce thestorage overhead introduced by the temporary vectors and .


Instead of occupying space for real numbers, algorithm column_supernode_1 needs only1 where denotes the maximal number of nonzero entries in a column of

excluding the diagonal element. In 2D and 3D device simulation, where is typically greater than5000, 2 is much smaller than [40].

1. MAXCOL + 1 is needed here to hold the diagonal element of the current column.2. Experimental results revealed MAXCOL to be less than 10% of in 2D device simulation.

for to n do

end for

Algorithm 5 CRdiv kernel

for all row indices of do

end for

Algorithm 6 Setup of vector im

dj lj j,=i j 1+=li j, li j, dj⁄=uj i, uj i, dj⁄=

i 0=k j

im k( ) i=i i 1+=

tL tU

for to dosetup vector (c1.1)for do

(c1.2)

for all updating doif ( and have same sparsity pattern)


for do (c1.3)

end forend if

end for

(c1.4)

for all dense updates do

end for

end forend for

tL 0← tU 0← im 0←J 1= NS

imj J (in order)∈

tL[ ]im A* j,←

tU[ ]im Aj *,←K jK J

k K∈CRmod_i tL tU im j k, , , ,( )

L* j, tL← tL 0←

Uj *, tU← tU 0←k

CRmod_d L* j, Uj *, j k, , ,( )

CRdiv j( )

2*n2* MAXCOL 1+( ) MAXCOLL n

MAXCOL n

n

35


Conversely, we use a technique called relative indexing [39][46] so that algorithm column_supernode_1can use smaller temporary vectors. Relative indexing introduced an additional vector 1 of length (c1.1). Nevertheless, the total amount of overhead storage required for algorithm column_supernode_1 isapproximately 60% of that used in column_supernode_0.

Algorithm 6 on page 35 shows the vector setup. Basically, the row index vector for the first column of supernode is scanned and the corresponding position in vector is set to the value of the integer

variable , which is incremented by one after each assignment starting with zero. Thereby, referencing for a row index returns the relative position of the corresponding column element within .

NOTE The row index vector is stored in decreasing order (looking at the column from the bottom)by the symbolic factorization phase of the solver.

Vector is then used to copy the nonzero elements of column or row into and (c1.2)and to perform the indexed updates in loop c1.3. Both operations take advantage of the fact that the setof row indices for and the updating supernodes up to row from a subset of column ’s set ofrow indices in the factor [35].

This is also the reason why does not have to be reset to zero when all nodes of supernode havebeen computed; this reduces memory traffic. Finally, storing the contents of and into factor storage(c1.4) does not require indirect addressing and can be performed one by one, because and

share the same sparsity pattern.

Next, algorithm column_supernode_2 (see Algorithm 8 on page 37) is introduced, which implements amajor change compared to algorithm column_supernode_1 dealt with previously. Instead of loadingcolumn or row of the coefficient matrix into a temporary work space, the contents aredirectly transferred into the appropriate places of and , respectively (see c2.1).

In this case, since the temporary work vectors and are not required, it is possible to further reducememory consumption. Since all computation is performed within factor space, additional data transfers,and scatter and add operations caused by intermediate results can also be saved (see c1.4 in Algorithm 7on page 35). Consequently, algorithm column_supernode_2 uses the least amount of memory of allalgorithms considered in this section.

1. Acronym for index map.

im n

imj J im

iimk K ck tL

im A*,j Aj *,⁄ tL tU

A*,j K j jL

im j JtL tU

tL tU⁄L*,j Uj *,⁄

A*,j Aj *,⁄ AL*,j Uj *,

tL tU

36



Algorithm 9 on page 38 shows another variant of column supernode factorization. This algorithmrequires the same amount of storage overhead as algorithm column_supernode_1, but implements twosignificant changes computing supernode ’s update on column (see c3.2 and c3.3).

First, like algorithm column_supernode_2, column or row of the coefficient matrix are notloaded into temporary work space but into their appropriate places in and , respectively (seec3.1). This is not necessarily advantageous concerning memory traffic, since the algorithm still usestemporary work vectors ( and ), which have to be merged into factor storage. The advantage overthe other algorithms is assumed to unfold in the fact that we can compute supernode ’s contributionupdating column as a dense SAXPY operation (see c3.2), therefore revealing the second majordifference mentioned above.

Unfortunately, after and have been computed, their contents must be scattered and added tocolumn or row using the index map of supernode . This is the price for being able to usedense SAXPY operations to calculate and . Experiments with real device simulation test caseshave shown that the computational efficiency suffers from the resulting memory transfers. In addition,

and must be reset to zero for the next supernode to update column (see c3.3). The remainder ofalgorithm column_supernode_3 is identical to the algorithms previously discussed.

for to dosetup vector for do

(c2.1)



for do (c2.2)

end forend if

end forfor all dense updates do

end for

end forend for

im 0←J 1= NS

imj J (in order)∈L* j,[ ]im A* j,←

Uj *,[ ]im Aj *,←K jK J

k K∈CRmod_i L* j, Uj *, im j k, , , ,( )


CRdiv j( )

LU

K j

A*,j Aj *,⁄ AL*,j Uj *,

tL tUK

j

tL tUL*,j Uj *,⁄ im J

tL tU

tL tU j

37



Looking at all the supernode–node updating algorithms previously discussed reveals that, in all cases,dense updates and column/row scaling are treated equally. Thus, we conclude that the data structuresinvolved as well as the execution time necessary for the two operations do not differ (at least notsignificantly) in all four cases. This leaves the indexed updates and the memory references throughgather and scatter operations for the temporary vectors and as the critical points for measuringhow efficiently the algorithms run on different machines.

In terms of storage overhead and memory transfers, algorithm column_supernode_2 clearly is the firstchoice. Although, if execution time is important, most machines seem to prefer column_supernode_1 to theothers. In the next section, we reduce the number of scatter/gather operations by working on blocks ofcolumns of the same supernode simultaneously.

Block supernode algorithmsBlock supernode factorization operates on groups of columns or rows or an entire supernode at the sametime instead of merely focusing on a single node. This does not reduce the number of references tomemory by any means, but by grouping them together, memory fetch and store can be made moreefficient, that is, using the same index map only once throughout a loop cycle. In addition, in terms ofvectorization, supernode–supernode updating does not make the vectorizable loops longer, thusincreasing the average vector length, but it nests the vectorizable loops one level deeper, which collapsesvector work and avoids vector startup overhead.

for to dosetup vector for do

(c3.1)



for do (c3.2)

end for

(c3.3)

end ifend forfor all dense updates do

end for

end forend for

tL 0← tU 0← im 0←J 1= NS

imj J (in order)∈L* j,[ ]im A* j,←

Uj *,[ ]im Aj *,←K jK J

k K∈CRmod_d tL tU j k, , ,( )

L* j,[ ]im tL← tL 0←

Uj *,[ ]im tU← tU 0←


CRdiv j( )

tL tU

38


On the other hand, supernode–supernode factorization increases storage overhead considerably, sincethe intermediate results for more than one column or row must be retained and, in order to support thistechnique, other data structures must be added. Furthermore, the time necessary to perform the setup andadministration of these data structures cannot be neglected.

Algorithm 10 (block_supernode_0) shows a first approach implementing this block supernodalfactorization technique. Obviously, the algorithms in this section consist of a double-nested loopconstruct compared to the three-level nesting of supernode–node algorithms. The third level of nestinghas not vanished, but is hidden in the kernels CRmod_d and CRmod_i.

Algorithm 10 block_supernode_0

These kernels now consist of a double-nested loop where the inner loop remains the same as inAlgorithm 3 and Algorithm 4 on page 34; the outer loop usually runs over all nodes being updatedby a supernode 1. The temporary vectors and had to be enlarged to hold a complete supernode.

Their counterparts in this section are denoted by and ; both of length where holds the size of the largest system supernode. For each

supernode being updated, and are loaded with the corresponding values from the coefficientmatrix (denoted ) using the index vector .

When this is finished, block_supernode_0 determines the set of nodes of supernode , which areupdated by supernode (see b0.2). This set is formed by reverse scanning all column indices ofsupernode and adding the corresponding node of supernode to the set. At the same time, thealgorithm marks those nodes , which can be computed using dense operations. Then, the dense andindexed updates are performed where the order of execution is merely implied by the underlying datastructures (see b0.3).

1. This node is sometimes split into nodes that can be updated densely and nodes that require indexed updating.

for to doset up vector

(b0.1)

for all updating dodetermine all being updated by (b0.2)

(b0.3)

end forfor do (b0.4)

end forend for

TL 0← TU 0← im 0←J 1= NS

imTL[ ]im A* J,←

TU[ ]im AJ *,←K J

j J∈ KCRmod_d TL TU J K, , ,( )CRmod_i TL TU im J K, , , ,( )

j J (in order)∈CRmod_d TL TU j J, , ,( )L* j, L* j, TL j( )+← TL j( ) 0←

Uj *, Uj *, TU j( )+← TU j( ) 0←CRdiv j( )

jK tL tU

TL TUMAXCOL 1+( )*MAXSUP MAXSUP

TL TUA A*,j Aj *,⁄ imap

j JK

K j Jj

39


After all supernodes updating supernode have been processed, supernode needs to update itself(see b0.4). This is a dense operation involving each node of . Loop b0.4 shows all operations necessaryto complete the factorization of supernode . Unfortunately, these operations cannot be applied to allnodes of at the same time.

In Algorithm 11 (block_supernode_1), an attempt was made to increase computational efficiency bycollecting the dense updates from all updating supernodes and process them in one separate loop (seeb1.1 and b1.2). It is clear that this approach costs more in terms of both storage and computation toimplement. As a result, this algorithm is only efficient if the amount of dense updates is (much) greaterthan the indexed one to trade off for the additional storage and computing overhead.


Algorithm 12 on page 41 (block_supernode_2) is designed so that it does not need to perform any indexedupdates. Primarily, the matrix elements of supernode are stored into factor storage using the index map

(see b2.1). In the next loop over all updating supernodes , first, another index vector is set up.Vector comprises the relative indices of supernode ’s column structure in relation to supernode ’scolumn structure. provides an offset from the bottom of a node of , which maps the -th elementof a node of to the corresponding position within . The index vector can, therefore, be regardedas a compact form of applied to some supernode updating (see b2.2).

After is set up, the contribution of supernode K to the factorization of supernode is accumulated asa dense operation in the temporary work arrays and as a dense operation. The result is thenscattered and added into factor storage using (see b2.3 and b2.4)1.

1. Internally, the algorithm is more sophisticated at this point, since it knows which K shares the same sparsity patternas J and then adds the contents of TL and TU with stride one.

K J JJ

JJ

K


for all updating dodetermine all being updated by (b1.1)and collect dense updates

end forfor all dense updates do (b1.2)

end forfor do

end forend for

TL 0← TU 0← im 0←J 1= NS

imTL[ ]im A* J,←

TU[ ]im AJ *,←K J

j J∈ K

CRmod_i TL TU im J K, , , ,( )

CRmod_d TL TU J K, , ,( )



Jim K ri

ri K Jrik j J k

K j riim K J

ri JTL TU

ri

40


Finally, the factorization of supernode is completed by dense computations in factor storage (seeb2.5). The algorithm is most efficient when there are only a few large supernodes updating anothersupernode. Otherwise, memory access penalties will decrease performance.


Algorithm 13 on page 42 (block_supernode_3) is a variant of block_supernode_2. In this case, the secondindex map is omitted and indirect addressing is used explicitly (see b3.1). Furthermore, a modifiedversion of the CRmod_{d,i} kernels is used. In the algorithms previously presented, the products

and are precomputed immediately after setting up the index map and their resultsare stored in a temporary work space for later use. This has been changed for algorithmsblock_supernode_3 and block_supernode_4 (see Algorithm 14 on page 42). Both algorithms use the kernelsCRmod_d and CRmod_i as they are depicted in Algorithm 3 on page 34. This leads to reduced memoryrequirements. Consequently, algorithms block_supernode_3 and block_supernode_4 use less space than thepreviously presented block supernode algorithms.

J


(b2.1)

for all updating dodetermine all being updated by (b2.2)simultaneously setting up vector

(b2.3)

(b2.4)

end forfor do

(b2.5)

end forend for

TL 0← TU 0← im1 0←J 1= NS

imL* J,[ ]im1

A* J,←

Uj *,[ ]im1AJ *,←

K Jj J∈ K

CRmod_d TL TU J K, , ,( )L* J,[ ]ri L* J,[ ]ri TL+← TL 0←

UJ *,[ ]ri UJ *,[ ]ri TU+← TU 0←

j J (in order)∈CRmod_d L* J, UJ *, j J, , ,( )CRdiv j( )

ri

dk*Uk j, dk*Lj k, im

41





for all updating dodetermine all being updated by simultaneously setup vector

(b3.1)

end forfor do

end forend for

TL 0← TU 0← im 0←J 1= NS

imL* J,[ ]im A* J,←

Uj *,[ ]im AJ *,←K J

j J∈ Kri

CRmod_d TL TU J K, , ,( )L* J,[ ]imind

L* J,[ ]imindTL+← TL 0←

UJ *,[ ]imindUJ *,[ ]imind

TU+← TU 0←

j J (in order)∈CRmod_d L* J, UJ *, j J, , ,( )CRdiv j( )


for all updating dodetermine all being updated by

(b4.1)end forfor do

end forend for

TL 0← TU 0← im 0←J 1= NS

imTL[ ]im A* J,←

TU[ ]im AJ *,←K J

j J∈ KCRmod_i TL TU im J K, , , ,( )



42

SOLVERS CHAPTER 8 PERFORMANCE OF SUPER

Solvers

CHAPTER 8 Performance of SUPER

This section documents the performance of SUPER on several modern workstations andsupercomputers. Time is given in CPU seconds using the Fortran library function etime.


Table 6 and Table 7 list the various test problems and their characteristics. Test problems t1 to t5 stemfrom two-dimensional semiconductor device simulations. With these five test matrices, we haveattempted to cover the spectrum of typical two-dimensional simulation problems, which roughly rangebetween 3000 and 24000 unknowns for the coupled device equations.

Table 6 List of test problems

Name Dimension Description

t1 2D MCT standard cell 20 μm design

t2 2D p-n-p lateral-magneto-transistor

t3 2D Bipolar transistor

t4 2D MOSFET

t5 2D MOSFET

Table 7 Test problem statistics

Name n ne |L| |S| maxcol maxsup

t1 3195 28737 81432 742 119 90

t2 4713 45609 224610 936 170 129

t3 6342 62142 265119 1276 155 111

t4 12024 119358 761193 2242 215 141

t5 24123 24 321 1620876 4392 233 108

43

SOLVERSCHAPTER 8 PERFORMANCE OF SUPER

Performance measurementFor the performance test, the performance measurement was restricted to the numeric factorization step,since this phase dominates the execution time.

NOTE Block factorization involves computing the update of a supernode on all columns of thecurrent supernode before proceeding to the next updating supernode. This means that both theupdating and updated supernode can reside in cache memory until the update process iscomplete. Subsequently, the columns of a new updating supernode must be read frommemory; the previous updating supernode is no longer required in the course of computingthe current supernode.

The scenario is different for supernode–node factorization. When the update of a supernode on thecurrent column of an updated supernode is finished, the next updating supernode must be read frommemory until all updates have been computed. Then, the factorization algorithm continues with the nextcolumn of the current supernode. In such a situation, it may be necessary to re-read a supernode, whichhas already been in cache during the computation of a previous column of the current supernode, intocache memory. In this case, supernode–node factorization suffers from cache miss penalties.

Table 8 shows the timing results for the best performing numeric factorization algorithm for all testproblems. An asterisk (*) indicates block factorization.

Table 8 Timing results obtained on workstations of best-performing factorization algorithm [CPU seconds] (ordered according to t5)

Machine Test case

t1 t2 t3 t4 t5

SGI Power challenge 0.14 0.44 0.51 1.98 4.41

IBM RS/6000 591 0.11 0.43 0.47 2.22 4.85

DEC 8400 5/300* 0.12 0.52 0.54 2.35 5.23

DEC 600 5/266* 0.14 0.52 0.58 2.50 5.68

HP 9000/J210* 0.18 0.71 0.82 3.26 6.99

HP 9000/735* 0.20 0.80 0.90 3.51 7.42

Sun UltraSPARC* 0.25 1.05 1.12 5.14 11.55

Sun S20-71* 0.39 1.71 1.91 8.11 18.51

SGI Indigo2* 0.38 1.71 2.03 8.31 18.84

44

SOLVERS CHAPTER 8 PERFORMANCE OF SUPER

NOTE The SGI Power Challenge and IBM RS/6000 systems ran fastest on column supernodealgorithms, although the fastest block supernode algorithm was only slightly slower (< 5%).For the Power Challenge, we found that the compiler ran out of registers optimizing the kernelloops in the block supernode methods. On the RS/6000, one reason for this is the comparablylarge on-chip caches of the POWER processor built into these systems: 256 KB for data and32 KB for instructions (591 model). In particular, the size of the data cache on the IBM 591is eight times larger than the on-chip data caches on the other processors. Furthermore, theRS/6000 processor has a special set of floating-point multiply-and-add instructions, whichgreatly support BLAS performance [25].

Performance on workstations versus supercomputersTable 9 displays the timing results for the numeric factorization obtained on the supercomputers. It isnot surprising that the supernodal linear solver performs well on those platforms, since in the beginningit was targeted for vector architectures. In contrast to the scalar microcomputers, the vector computersprefer column supernode algorithms for smaller problems. In this case, memory hierarchies generallydo not exist; all data is read from one central memory.

On the other hand, for block supernode methods to be successful, supernodes must be reasonably largeso that updating supernodes work on a larger set of columns. This compensates for the additional scalarwork involved. Unfortunately, this cannot be guaranteed for our test problems [34]. Consequently,machines with relatively good scalar capabilities and performance on small vectors, such as Fujitsu VP2200, tend to prefer block supernode methods. Table 10 on page 46 shows which algorithms performedbest on the different platforms running test case t5.

Table 9 Timing results obtained on supercomputers of best-performing factorization algorithm [CPU seconds] (ordered according to t5)

Machine Test case

t1 t2 t3 t4 t5

Cray C90* 0.09 0.22 0.27 0.76 1.61

NEC SX-3 0.16 0.34 0.56 1.05 2.64

Fujitsu VP 2200* 0.19 0.45 0.49 1.51 2.78

Convex 3820 0.87 1.46 1.79 5.60 13.08

45

SOLVERSCHAPTER 8 PERFORMANCE OF SUPER

Figure 10 shows the relative performance of all machines, supercomputers, and workstations mentionedin the benchmark solving test problem t5. The timing results from the best-performing factorizationalgorithm were used. Clearly, the high-end and low-end are still dominated by vector and scalarmachines, respectively. Between these extremes, there are microprocessors that can compete with so-called mini-supercomputers such as the Convex 3820.

Figure 10 Comparing execution times of best-performing factorization algorithm for test problem t5

Table 10 Best-performing algorithm for test problem t5 on various machines

Algorithm Machine

column_supernode_0 Convex 3820IBM RS/6000 591SGI Power Challenge

column_supernode_3 NEC SX-3

block_supernode_0 Cray C90Fujitsu VP 2200HP 9000/J210Sun UltraSPARC

block_supernode_4 DEC 8400 5/300DEC 600 5/266HP 9000/735SGI Indigo2

Sun S20-71

46

Part IV ILS

This part contains chapters regarding the iterative linear solver ILS and is intended for users of Sentaurus Device and Sentaurus Process:

CHAPTER 9 USING ILS ON PAGE 49 describes how to select ILS in Sentaurus Device and Sentaurus Process, and how to control the parallel execution.

CHAPTER 10 CUSTOMIZING ILS ON PAGE 51 describes the parameters of ILS.

CHAPTER 11 DEFAULT CONFIGURATION FILE OF ILS ON PAGE 57 presents the default .ilsrc file for Sentaurus Device.

SOLVERS CHAPTER 9 USING ILS

Solvers

CHAPTER 9 Using ILS

The package ILS (iterative linear solver) is a library to solve sparse linear systems iteratively.

OverviewILS contains several iterative methods and different kinds of preconditioner. Recent techniques toreorder and scale the linear systems are used in the package to achieve good convergence results andhigh performance.

On shared-memory architectures, the iterative solver can be run in parallel. Similar techniques to thosein direct methods are used to achieve good accelerations. The parallelization of ILS is performed withOpenMP [10], which is an industry standard for parallel programming on shared-memorymultiprocessor (SMP) systems. Most vendors of shared-memory architectures support this standard.This makes it possible to write portable programs across several SMP platforms.

Selecting ILS in Sentaurus DeviceILS is enabled in Sentaurus Device by specifying:

Math {...Method = ILSWallClock...

}

The default configuration of ILS is adapted for matrices originating from semiconductor devicesimulations. Usually, no manual configuration by the user is necessary. Nevertheless, for demandingproblems, the behavior of ILS can be fine-tuned with a configuration file (see Chapter 10 on page 51).

The keyword WallClock can be used to print the wallclock times of the Newton solver. This is useful wheninvestigating the performance of parallel execution.

49

SOLVERSCHAPTER 9 USING ILS

Selecting ILS in Sentaurus ProcessILS is enabled in Sentaurus Process by specifying:

math diffuse dim=3 ils

or:

math flow dim=3 ils

for diffusion simulations or mechanics simulations, respectively (use dim=3 for 3D simulations or dim=2for 2D simulations).

It is recommended to set the parameters of the ILS solver using the pdbSet commands. By default, theconfiguration .ilsrc file is no longer needed and is ignored. For details, see Sentaurus Process, Settingparameters of iterative solver ILS on page 43.

ParallelizationWhether ILS runs in parallel mode or sequential mode is dependent on the OpenMP environmentvariable OMP_NUM_THREADS. This variable defines the number of processors that are used for parallelexecution. In a C-shell, this variable must be set as follows for two processors:

setenv OMP_NUM_THREADS 2

The equivalent commands in a Bourne shell are:

OMP_NUM_THREADS=2export OMP_NUM_THREADS

In Sentaurus Process, the number of threads must also be specified in the math command:

math maxNumThreads=2 maxNumThreadsILS=2

ILS is parallelized for all Synopsys platforms except HP-UX 11 and Windows. Table 11 lists theplatforms on which a parallel version of ILS is available.

The same environment symbol and parameter settings must be used to enable hyperthreading for ILS onmachines that support hyperthreading.

Table 11 Availability of ILS parallelization

Platform Parallel Platform Parallel

AMD 64-bit Yes Red Hat Enterprise Linux Yes

HP-UX 11 No Sun Solaris 81

1. Both environment variables OMP_NUM_THREADS and PARALLEL must be defined.

Yes

50

SOLVERS CHAPTER 10 CUSTOMIZING ILS

Solvers

CHAPTER 10 Customizing ILS

This chapter discusses the customization that is possible for ILS.

Configuration fileThe behavior of ILS can be controlled with a configuration file, which can reside in different places. Thefirst location in which a configuration file is searched is determined by the environment variable ILSRC.If the environment variable is unspecified or does not point to a file, or if the file cannot be opened, thefile .ilsrc in the current working directory becomes the configuration file.

Again, if this file does not exist or is not readable, the next possibility is chosen. In this case, ILS looksfor a file with the same name as before in the home directory of the user.

In ILS, the solution of a linear system consists of four steps:

Computation of a nonsymmetric ordering to improve the condition of the matrix.

Determination of a symmetric ordering to reduce the fill-in in the preconditioner.

Creation of a preconditioner to accelerate the convergence in the iterative method.

Calling an iterative method.

For each step, there are several options, which are described in the following sections. ILS allows usersto define sets of parameters. A configuration file defines one or more sets. Each set is identified with anumber. In Sentaurus Device, users can select a set with the following line in the command file:

Method = ILS (set = <integer>)

If a set is omitted, the number one (1) is taken as default. The syntax of a set specification is:

set( <integer> ) {[ parent( <integer> ; ][ iterative block ][ preconditioning block ][ ordering block ][ options block ]

};

where <...> represents a subspecification, [...] is an optional block, and ‘|’ defines a choice. Themeaning of parent(i) is that all of the parameters of the set i are copied into the current set. Thisinstruction can be used if two similar sets are specified, with only minor changes between them.

51

SOLVERSCHAPTER 10 CUSTOMIZING ILS

NOTE The source set must be defined beforehand and parent should be the first statement of a set.

A description of the four other blocks is given in the following sections.

Nonsymmetric orderingThe first step in the solution process of a linear system is the computation of a nonsymmetric orderingand scaling [12][50][51], such that the reordered and scaled system is better conditioned. The threedifferent options for this step are the default version is column oriented (MPSILST), the second versionis row oriented (MPSILS), and the third possibility is to omit the nonsymmetric ordering by specifyingnone. The syntax to select the nonsymmetric ordering is given in Symmetric ordering.

Symmetric orderingAs in direct methods, the linear systems are reordered before the preconditioner is computed. Thepurpose of the symmetric ordering is twofold. The quality of the preconditioner depends on the ordering.On the other hand, the ordering also influences the amount of fill-in in the preconditioners and,therefore, the time for the application of the preconditioner in the iterative method. The followingorderings are available in ILS:

Reverse Cuthill–McKee (RCM) [52]

Multiple minimum degree (MMD) [29]

Multilevel nested dissection (ND) [53]

A combination of ND and RCM (NDRCM)

The ordering to be used depends on the preconditioner. The best choice for an ILU(0) factorization isgenerally the RCM ordering [54][55]. For an incomplete LU factorization, where the dropping isentirely based on the numeric values (ILUT), the ordering does not have a large influence. Theapproximate inverse preconditioners are independent of a symmetric ordering and, therefore, this stepcan be omitted for these preconditioners.

In parallel mode, it is mandatory to use either ND (default) or NDRCM, since these orderings allow forthe parallel computation and application of incomplete LU factorizations. It is also possible to use MMDfor the parallel solver, but the performance is better using the other orderings.

The syntax is:

ordering( [ symmetric = < none | rcm | mmd | nd | rcm > ][, nonsymmetric = < none | mpsils | mpsilst > ] );

52


Exampleordering( symmetric=nd, nonsymmetric=mpsilst );

PreconditionersIterative methods are usually combined with preconditioners to improve convergence rates. Especiallyfor ill-conditioned matrices, iterative methods fail without the application of a preconditioner. Severalpreconditioners exist in ILS, from simple techniques such as a diagonal preconditioner, over differentincomplete LU factorizations, to sparse approximate inverse preconditioners.

An overview of the syntax to select a preconditioner is presented and the various possibilities aredescribed.

The syntax is:

preconditioning( < none | diagonal | ilu0 |ilut( <double>, <integer> ) |ilu_slip( <double>, <integer> ) |spai0 | spai1 |spai( <double>, <integer>, <integer>, <integer> ) >[, < left | right > ] );

If none is specified, the linear system is solved without a preconditioner. If a preconditioner is used, itcan be applied from either the left (default) or right by specifying the according option. In the formercase, the unpreconditioned residuals and the preconditioned residuals do not correspond, but the error isthe same for both the preconditioned and unpreconditioned linear system. In the latter case, the situationis reversed. For example:

preconditioning( ilut(0.001,-1) );

Incomplete LU factorizationsDirect solvers for linear systems decompose a given matrix into triangular factors and , whoseproduct is equal to the original matrix, that is, . One of the main concerns of direct methods isthe high demand of memory to perform the factorization. As the factors and are not computedexactly, but some elements are disregarded, it is more economical to work with them.

Several strategies have been proposed in the literature to determine, which elements should be droppedor kept. In ILS, three different incomplete LU factorizations are implemented: ILU(0), ILUT( , ), andILUPT( , ). They are described in Table 12 on page 54. Parallel versions of the first two incompletefactorizations exist. ILUPT is currently not parallelized.

A L ULU A=

L U

ε qp ε

53


Sparse approximate inversesThese preconditioners approximate directly the inverse of the given linear system. Three differentversions exist in ILS: SPAI(0), SPAI(1), and SPAI( ) [56]. The difference between thesepreconditioners is their structure. The first consists solely of a diagonal, the second has the samestructure as the given linear system, and the structure of the third one is computed dynamically duringthe computation of the approximation.

The implementation of SPAI( ) requires four arguments: spai(epsilon, bs, ns, mn). They are describedin Table 13. These preconditioners have the advantage that they can be computed and applied in parallel.However, their quality is not good enough to use them for semiconductor device simulations. For thisreason, they are currently not available as parallel versions.

Other preconditionersA simple diagonal preconditioner is also available in ILS. The preconditioner is equal to the inverse ofthe diagonal of the given matrix.

Table 12 Incomplete LU factorizations

Factorization Description

ILU(0) The simplest incomplete LU factorization, where all elements but the entries from the linear system are dropped.

ILUT( , ) Incomplete LU factorization, where the dropping of elements is based on the values. Elements smaller than are dropped during the elimination. The second parameter is intended to limit the number of elements in a row in the triangular factors, but currently this value is ignored. The smaller is, the more accurate the preconditioner becomes. However, the computation, memory requirements, and application of the preconditioner is increasing in this case.

ILUPT( , ) A combination of ILU( ) (generalization of ILU(0)) and ILUT. This preconditioner is equal to the incomplete LU factorization used in SLIP90. Therefore, its internal name is ILU_SLIP. Increasing or lowering improves the accuracy of the preconditioner, but the same consequences as for lowering in ILUT( , ) hold. Note that the parameters

and are reversed in the configuration file.

Table 13 Parameters for SPAI( )


bs Block size to use.

epsilon Threshold to limit fill-in.

mn Maximum number of columns to add during one improvement step.

ns Number of improvement steps.

ε qε

ε

p ε p

p εε ε q

p ε

ε

ε

ε

54


Iterative methodsUnsymmetric sparse linear systems can be solved with different Krylov subspace methods.

The most famous methods are the biconjugate gradients stabilized (BICGSTAB) method [57] and thegeneralized minimal residual (GMRES(m)) method [58], which are both implemented in ILS. Usually,they give the best results in terms of the number of iterations and the time to compute the solution. Insemiconductor device simulations, GMRES demonstrates better reliability.

NOTE In Sentaurus Device, the default iterative solver is GMRES(100). In Sentaurus Process, thedefault iterative solver is BICGSTAB.

Three additional general iterative methods, CGS [47], BiCGxMR2 [59], and FGMRES(m)(FlexibleGMRES), are available (use the keyword bicgxmr2 to select the second one). Additionally forSentaurus Process, two special iterative methods, STCG2 and STCG3, for solving 2D and 3D stressproblems, respectively, are available. The parameter m, which is the number of backvectors inGMRES(m), is required to limit the memory demands of the method. After m iterations, GMRES isrestarted. The default value of m is 100. Larger values of m usually help GMRES to converge, but at theexpense of higher memory and execution time.

If there are convergence problems, it is recommended to decrease the threshold parameter <eps> or toincrease the number of backvectors m or both. Conversely, in the case of huge simulations, m can bedecreased to fit the available memory of the computer.

The syntax is:

iterative ( < bicgstab | bicgxmr2 | cgs | fgmres(<integer>) | gmres(<integer>) | stcg2 | stcg3 >[, tolrel = <double> ][, tolabs = <double> ][, tolunprec = <double> ][, maxit = <integer> ] );

Different stopping criteria are available for the iterative methods. If one of these is satisfied, the iterativemethod stops. The first possibility is to specify the relative tolerance of the norm of the preconditionedresidual, that is, the iteration stops if the norm of the preconditioned residual is reduced by tolrel. Thesecond criterion checks if the preconditioned residual becomes smaller than tolabs. With the optiontolunprec, the reduction of the unpreconditioned residual can be monitored (the left preconditioned gmrescontrols only a preconditioned residual). This option makes sense only if the preconditioner is appliedfrom the left. Otherwise, the unpreconditioned and preconditioned residuals are the same and, therefore,this option corresponds to the first one. A limit of the number of iterations can be specified with maxit.

55


Table 14 lists the default values for the different stopping criteria.

Exampleiterative( gmres(100), tolrel=1e-8, tolunprec=1e-4, maxit=200 );

OptionsThere are additional options that users can specify for the linear solver. One of these should be used ifthe linear system to be solved contains many entries that are numerically zero. Especially for simulationswith Sentaurus Device, this option should be switched on. The default of this option is compact=yes.

The verbosity of ILS is controlled with the option verbose. With a value of 0, all output is suppressed. Ifa value of 1 is specified, the accumulated numbers of calls, iterations, and execution times are printed tostandard output. The most basic information is printed with verbose=2 and this should be sufficient forthe needs of most users. Higher values print additional information about the solution andpreconditioners.

The syntax is:

options( [ compact = < no | yes > [, verbose = <integer> ] ] );

Exampleoptions( compact=yes, verbose=1 );

General remarksThe parser of the configuration file is case insensitive. Comments can be made in the configuration file,as in a C++ or C source file, that is, text that follows // up to the end of the line is ignored. Text between/* and */ is disregarded.

Table 14 Default values for stopping criteria for iterative methods

Option Value Option Value

maxit 200 tolrel 1e-8

tolabs 0 tolunprec 1e-4

56

SOLVERS CHAPTER 11 DEFAULT CONFIGURATION FILE OF ILS

Solvers

CHAPTER 11 Default configuration file of ILS

The following configuration file shows the default parameters of ILS. If some values are omitted, thefollowing settings are taken:

// Default configuration file for the iterative solver ILS//// Text after '//' are comments//// Define the parameters of set '1'//set(1) {

// we choose gmres(100) as the default iterative method // the iteration is stopped, if either// a) the preconditioned residual is reduced by 1e-8// b) the unpreconditioned residual is reduced by 1e-4// c) the preconditioned residual becomes 0// d) 200 iterations are carried outiterative( gmres(100), tolrel=1e-8, tolunprec=1e-4,

tolabs=0, maxit=200 );

// choose ILUT factorization as preconditionerpreconditioning( ilut(0.001,-1), left );

// we use the symmetric ordering ND and // the nonsymmetric MPSILSTordering( symmetric=nd, nonsymmetric=mpsilst );

// remove the numerical zeros in the matrix// suppress every outputoptions( compact=yes, verbose=0 );

};

57

SOLVERSCHAPTER 11 DEFAULT CONFIGURATION FILE OF ILS

58

Part V SLIP90

This part contains chapters regarding the iterative linear solver SLIP90 and is intended for users of Sentaurus Device:

CHAPTER 12 USING SLIP90 ON PAGE 61 provides background information on SLIP90.

CHAPTER 13 CUSTOMIZING SLIP90 ON PAGE 63 describes the input structure of the .sliprc file, how to specify the iterative methods, and the preconditioner.

CHAPTER 14 PERFORMANCE OF SLIP90 ON PAGE 65 presents results of the performance of SLIP90 on different platforms.

CHAPTER 15 DEFAULT CONFIGURATION FILE OF SLIP90 ON PAGE 67 presents the default configuration file of SLIP90.

SOLVERS CHAPTER 12 USING SLIP90

Solvers

CHAPTER 12 Using SLIP90

SLIP90 is a linear algebra library containing state-of-the-art iterative methods for solving linear systemsof equations in Sentaurus Device. It contains state-of-the-art implementations of five iterative methodsand a powerful preconditioner for accelerating the convergence. All methods are available for all solvemethods in Sentaurus Device, for example, coupled and AC coupled.

The main programming language is Fortran90, which provides powerful array features and memoryhandling. This, with the use of BLAS and LAPACK [60] subroutines, ensures high performance ofdifferent computer architectures on a large scale.

To activate SLIP90 in Sentaurus Device, use the following syntax in the command file:

Math {...Method = Slip...

}

SLIP90 is tuned for general use in Sentaurus Device. This means that usually user intervention is notrequired. However, for special cases or demanding problems, it is possible to tailor the default behaviorby using a .sliprc file.

61

SOLVERSCHAPTER 12 USING SLIP90

62

SOLVERS CHAPTER 13 CUSTOMIZING SLIP90

Solvers

CHAPTER 13 Customizing SLIP90

This chapter discusses the customization that is possible for SLIP90.

OverviewThe .sliprc file is used to customize the default behavior of SLIP90. The location of this file isdetermined by the contents of the user-defined environment variable SLIPRC. If this environment variableis undefined or the file specified by its contents is not found, the current directory is probed for a .sliprcfile. If this again is unsuccessful, the home directory of the user is scanned.

If a .sliprc file is found, it is parsed at start-up and the default configuration of SLIP90 changesaccordingly. The basic input specification of the .sliprc file is a type section such as:

type ( <integer> ) {[ <method block> ; ][ <preconditioner block> ; ][ print ( <on | off> ) ; ][ next = <integer> ; ]};

where ‘<.>’ represents a subspecification, ‘[.]’ specifies an optional subblock, and ‘|’ indicates a choice.

Six type blocks (0–5) can be specified. Type 0 is reserved for symmetric positive definite linear systems.Types 1 to 5 are reserved for general asymmetric systems that are increasingly difficult to solve.

First, SLIP90 tries to solve a given linear system according to the specifications in the type 0 or 1 block.In case of convergence failure, SLIP90 switches to the next type (specified by next = <integer>) and triesagain. This behavior is repeated. The last type in this chain should specify next = -1 to indicate a totalfailure after which SLIP90 quits. SLIP90 has a record of the last successful type number and uses it forsubsequential solves.

Iterative methodsThe SLIP90 library includes the following iterative methods:

For solving symmetric positive definite linear systems:

• Conjugate gradient (CG) [22][64]

For solving general asymmetric linear systems:

• Conjugate gradient squared (CGS) [47]

63

SOLVERSCHAPTER 13 CUSTOMIZING SLIP90

• Enhanced conjugate gradients squared (CGS2) [61]

• Bi-conjugate gradients stabilized (BiCGstab(L)) [57][62][66][67]

• Generalized minimal residual algorithm (GMRES(m)) [58]

More detailed descriptions of these methods are found in the literature [60][63][68].

The iterative method can be specified as a method section such as:

method ( < cg | bicgstab( <integer> ) | cgs | cgs2 | gmres( <integer> ) >[ , mxmv = <integer> ][ , tolerance = <float> ] ) ;

SLIP90 is configured to use bicgstab(2|4) by default. mxmv is optional and specifies the maximumnumber of matrix multiplication that is allowed during one particular solve. By default, mxvm = 200.

tolerance is also optional and specifies the required stopping tolerance. By default, tolerance = 1e-4.

PreconditionerAs a preconditioner, SLIP90 provides a fast and powerful incomplete LU decomposition (ILU) [65].The fill-in is controlled through a combination of level sets and a dropping strategy. Furthermore, tominimize fill-in these reordering techniques are provided:

Nested dissection (ND)

Reverse Cuthill–McKee (RCM)

Multiple minimum degree (MMD)

For more details on these reorderings, refer to the chapters of SUPER and their cited references.

The preconditioner can be specified in a preconditioner section, for example:

preconditioner ( < nd | rcm | mmd >[ , fill-level = <integer> ][ , fill-tolerance = <float> ] ) ;

or it can be switched off by using:

preconditioner ( off );

fill-level specifies the maximum level that a fill-in entry in the ILU decomposition can have. Fill-inentries with a higher level are discarded.

fill-tolerance specifies the threshold value for dropping a fill-in entry in the ILU decomposition.

The higher the level and the smaller the tolerance, the better the preconditioner is and the faster theconvergence is. However, the compromise is that the time required to construct the preconditionerincreases.

64

SOLVERS CHAPTER 14 PERFORMANCE OF SLIP90

Solvers

CHAPTER 14 Performance of SLIP90

The solution times for four linear test problems of increasing dimension on different platforms arepresented here. A default configuration was used (see Chapter 15 on page 67).


Table 15 Execution times [s] of SLIP90 for four test problems

Test problem mos4 mos5 mos6 mos7

Dimension 3519 6612 12024 24123

Nonzeros 72261 188048 250740 504765

Operating system/architecture Execution time [s]

aix3.2/rs6k 1.80 2.83 6.01 21.22

aix4.1/rs6k 1.79 2.88 6.01 21.33

osf1/dec-alpha 1.41 3.96 9.41 36.51

hpux9/hppa 2.01 5.35 12.08 44.79

linux/i586 4.09 10.83 23.57 88.51

sunos4/sparc 4.00 9.77 21.23 75.67

solaris/sparc 3.96 10.26 22.07 77.82

65

SOLVERSCHAPTER 14 PERFORMANCE OF SLIP90

Figure 11 Execution times of SLIP90 for four test problems

0

20

40

60

80

100solaris/sparc

sunos4/sparc

linux/i586

hpux9/hppa

osf1/alpha

aix41/rs6k

aix32/rs6k

mos7 24123 / 504765

mos6 12024 / 250740

mos5 6612 / 188048

mos4 3519 / 72261

CPU

tim

e in

sec

66

SOLVERS CHAPTER 15 DEFAULT CONFIGURATION FILE OF SLIP90

Solvers

CHAPTER 15 Default configuration file of SLIP90

The following is the default configuration file of SLIP90:

# This is the default configuration of Slip90## o There are 6 types of methods (0-5).# o Type 0 is reserved for SPD systems in Sentaurus Device.# o Available methods are: # cg, cgs, cgs2, bicgstab(.) and gmres(.)# o Available preconditioning orderings are:# rcm, nd, mmd# o "Next" specifies the type to take when the current # type fails.## o "print (on)" prints details of the convergence# behavior## NB. - All text after "#" is ignored.# - The parser ignores blanks and is case # insensitive.# --------------------------------------------------# type(0) -- for symmetric positive definite systems# --------------------------------------------------

type(0) {method ( cg, mxmv=200, tolerance=1e-4 );preconditioning ( rcm, max-fill=2, fill-level=5,

fill-tolerance=1e-2 );next=1;print ( off );

};

# --------------------------------------------------# type(1-5) -- for increasingly difficult systems# ----------------------------------------------

type(1) {method ( bicgstab(2),mxmv=200, tolerance=1e-4 );preconditioning ( rcm, fill-level=5,

fill-tolerance=1e-2 );print ( off );next=2;

};type(2) {method ( bicgstab(2), mxmv=200, tolerance=1e-4 );preconditioning ( rcm, fill-level=10,


};

67

SOLVERSCHAPTER 15 DEFAULT CONFIGURATION FILE OF SLIP90

type(3) {method ( bicgstab(2), mxmv=200, tolerance=1e-4 );preconditioning ( rcm, fill-level=10,


};

type(4) {method ( bicgstab(4), mxmv=200, tolerance=1e-4 );preconditioning ( mmd, fill-level=15,


};

type(5) {method ( bicgstab(4), mxmv=200, tolerance=1e-4 );preconditioning ( mmd, fill-level=20,

fill-tolerance=1e-5 );next=-1;

};

68

SOLVERS BIBLIOGRAPHY

Solvers

Bibliography

[1] O. Schenk, Scalable Parallel Sparse LU Factorization Methods on Shared Memory Multiprocessors,Series in Microelectronics, vol. 89, Konstanz, Germany: Hartung-Gorre, 2000.

[2] O. Schenk, K. Gärtner, and W. Fichtner, “Efficient Sparse LU Factorization with Left-Right LookingStrategy on Shared Memory Multiprocessors,” BIT, vol. 40, no. 1, pp. 158–176, 2000.

[3] P. Matstoms, “Parallel sparse QR factorization on shared memory architectures,” ParallelComputing, vol. 21, no. 3, pp. 473–486, 1995.

[4] A. George and J. W.-H. Liu, Computer Solution of Large Sparse Positive Definite Systems,Englewood Cliffs, New Jersey: Prentice-Hall, 1981.

[5] G. Karypis and V. Kumar, “Analysis of Multilevel Graph Partitioning,” Technical Report 95-037,University of Minnesota, Department of Computer Science/Army HPC Research Center,Minneapolis, USA, 1995.

[6] J. W. H. Liu, “Modification of the Minimum-Degree Algorithm by Multiple Elimination,” ACMTransactions on Mathematical Software, vol. 11, no. 2, pp. 141–153, 1985.

[7] J. J. Dongarra et al., “A Set of Level 3 Basic Linear Algebra Subprograms,” ACM Transactions onMathematical Software, vol. 16, no. 1, pp. 1–17, 1990.

[8] O. Schenk, K. Gärtner, and W. Fichtner, “Scalable Parallel Sparse Factorization with Left-RightLooking Strategy on Shared Memory Multiprocessors,” in High-Performance ComputingNetworking, 7th International Conference, HPCN Europe, Amsterdam, The Netherlands,pp. 221–230, April 1999.

[9] O. Schenk, K. Gärtner, and W. Fichtner, “Application of Parallel Sparse Direct Methods inSemiconductor Device and Process Simulation,” Technical Report 99/7, Integrated SystemsLaboratory, ETH, Zurich, Switzerland, 1999.

[10] L. Dagum and R. Menon, “OpenMP: An Industry-Standard API for Shared-Memory Programming,”IEEE Computational Science & Engineering, vol. 5, no. 1, pp. 46–55, 1998.

[11] O. Schenk, M. Hagemann, and S. Röllin, “Recent advances in sparse linear solver technology forsemiconductor device simulation matrices,” in International Conference on Simulations ofSemiconductor Processes and Devices (SISPAD), Boston, MA, USA, pp. 103–108, September 2003.

[12] O. Schenk, S. Röllin, and A. Gupta, “The Effects of Unsymmetric Matrix Permutations and Scalingsin Semiconductor Device and Circuit Simulation,” IEEE Transactions on Computer-Aided Designof Integrated Circuits and Systems, vol. 23, no. 3, pp. 400–411, 2004.

[13] O. Schenk and K. Gärtner, “Solving unsymmetric sparse systems of linear equations withPARDISO,” Future Generation Computer Systems, vol. 20, no. 3, pp. 475–487, 2004.

[14] T. A. Davis, “A Column Pre-Ordering Strategy for the Unsymmetric-Pattern Multifrontal Method,”ACM Transactions on Mathematical Software, vol. 30, no. 2, pp. 165–195, 2004.

69

SOLVERSBIBLIOGRAPHY

[15] T. A. Davis, “Algorithm 832: UMFPACK V4.3—An Unsymmetric-Pattern Multifrontal Method,”ACM Transactions on Mathematical Software, vol. 30, no. 2, pp. 196–199, 2004.

[16] T. A. Davis and I. S. Duff, “A Combined Unifrontal/Multifrontal Method for Unsymmetric SparseMatrices,” ACM Transactions on Mathematical Software, vol. 25, no. 1, pp. 1–20, 1999.

[17] T. A. Davis et al., “A Column Approximate Minimum Degree Ordering Algorithm,” ACMTransactions on Mathematical Software, vol. 30, no. 3, pp. 353–376, 2004.

[18] T. A. Davis et al., “Algorithm 836: COLAMD, A Column Approximate Minimum Degree OrderingAlgorithm, ACM Transactions on Mathematical Software, vol. 30, no. 3, pp. 377–380, 2004.

[19] P. R. Amestoy, T. A. Davis, and I. S. Duff, “Algorithm 837: AMD, An Approximate MinimumDegree Ordering Algorithm,” ACM Transactions on Mathematical Software, vol. 30, no. 3,pp. 381–388, 2004.

[20] P. Arbenz and W. Gander, “A Survey of Direct Parallel Algorithms for Banded Linear Systems,”Technical Report 221, Institute of Scientific Computing ETH, Zurich, Switzerland, October 1994.

[21] C. C. Ashcraft et al., “Progress in Sparse Matrix Methods for Large Linear Systems on VectorSupercomputers,” The International Journal of Supercomputer Applications, vol. 1, no. 4,pp. 10–30, 1987.

[22] R. Barrett et al., Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods,Philadelphia, PA: SIAM, 1994.

[23] D. S. Dodson, R. G. Grimes, and J. G. Lewis, “Sparse Extensions to the FORTRAN Basic LinearAlgebra Subprograms,” ACM Transactions on Mathematical Software, vol. 17, no. 2, pp. 253–263,1991.

[24] J. J. Dongarra et al., “An Extended Set of FORTRAN Basic Linear Algebra Subprograms,” ACMTransactions on Mathematical Software, vol. 14, no. 1, pp. 1–17, 1988.

[25] J. J. Dongarra, P. Mayes, and G. Radicati di Brozolo, “The IBM RISC System/6000 and linearalgebra operations,” Supercomputer, vol. 8, no. 4, pp. 15–30, 1991.

[26] I. S. Duff, A. M. Erisman, and J. K. Reid, Direct Methods for Sparse Matrices, Oxford: ClarendonPress, 1986.

[27] S. C. Eisenstat et al., “The (New) Yale Sparse Matrix Package,” in Elliptic Problem Solvers II(Proceedings of the Elliptic Problem Solvers Conference), Monterey, CA, USA, pp. 45–52, January1983.

[28] A. George, M. T. Heath, and J. Liu, “Parallel Cholesky Factorization on a Shared-MemoryMultiprocessor,” Linear Algebra and Its Applications, vol. 77, pp. 165–187, 1986.

[29] A. George and J. W. H. Liu, “The Evolution of the Minimum Degree Ordering Algorithm,” SIAMReview, vol. 31, no. 1, pp. 1–19, 1989.

[30] A. George, J. W. H. Liu, and E. Ng, “Communication results for parallel sparse Choleskyfactorization on a hypercube,” Parallel Computing, vol. 10, no. 1, pp. 287–298, 1989.

[31] C. L. Lawson et al., “Basic Linear Algebra Subprograms for Fortran Usage,” ACM Transactions onMathematical Software, vol. 5, no. 3, pp. 308–323, 1979.

[32] J. G. Lewis and H. D. Simon, “The Impact of Hardware Gather/Scatter on Sparse GaussianElimination,” SIAM Journal on Scientific and Statistical Computing, vol. 9, no. 2, pp. 304–311,1988.

70


[33] A. Liegmann, “The Application of Supernodal Techniques on the Solution of StructurallySymmetric Systems,” Technical Report 92/5, Integrated Systems Laboratory, ETH, Zurich,Switzerland, 1992.

[34] A. Liegmann and W. Fichtner, “The Application of Supernodal Factorization Algorithms forStructurally Symmetric Linear Systems in Semiconductor Device Simulation,” Technical Report 92/17, Integrated Systems Laboratory, ETH, Zurich, Switzerland, 1992.

[35] J. W. H. Liu, “The Role of Elimination Trees in Sparse Factorization,” SIAM Journal on MatrixAnalysis and Applications, vol. 11, no. 1, pp. 134–172, 1990.

[36] J. W. H. Liu, E. Ng, and B. W. Peyton, “On Finding Supernodes for Sparse Matrix Computations,”Technical Report ORNL/TM-11563, Oak Ridge National Laboratory, Oak Ridge, TN, USA, June1990.

[37] E. Ng, “Supernodal Symbolic Cholesky Factorization on a Local-Memory Multiprocessor,”Technical Report ORNL/TM-11836, Oak Ridge National Laboratory, Oak Ridge, TN, USA, June1991.

[38] E. G. Ng and B. W. Peyton, “A Supernodal Cholesky Factorization Algorithm for Shared-MemoryMultiprocessors,” Technical Report ORNL/TM-11814, Oak Ridge National Laboratory, Oak Ridge,TN, USA, April 1991.

[39] E. G. Ng and B. W. Peyton, “Block Sparse Cholesky Algorithms on Advanced UniprocessorComputers,” Technical Report ORNL/TM-11960, Oak Ridge National Laboratory, Oak Ridge, TN,USA, December 1991.

[40] C. Pommerell, Solution of Large Unsymmetric Systems of Linear Equations, Ph.D. thesis, ETH,Zurich, Switzerland, 1992.

[41] J. M. Ortega, Introduction to Parallel and Vector Solution of Linear Systems, New York: PlenumPress, 1988.

[42] E. Rothberg, Exploiting the Memory Hierarchy in Sequential and Parallel Sparse CholeskyFactorization, Ph.D. thesis, Stanford University, Stanford, CA, USA, 1992.

[43] E. Rothberg and A. Gupta, “Techniques for Improving the Performance of Sparse MatrixFactorization on Multiprocessor Workstations,” in Proceedings of Supercomputing ‘90, New York,NY, USA, pp. 232–241, November 1990.

[44] E. Rothberg and A. Gupta, “An Evaluation of Left-Looking, Right-Looking and MultifrontalApproaches to Sparse Cholesky Factorization on Hierarchical-Memory Machines,” TechnicalReport STAN-CS-91-1377, Department of Computer Science, Stanford University, Stanford, CA,USA, August 1991.

[45] E. Rothberg and A. Gupta, “Efficient Sparse Matrix Factorization on High-PerformanceWorkstations— Exploiting the Memory Hierarchy,” ACM Transactions on Mathematical Software,vol. 17, no. 3, pp. 313–334, 1991.

[46] R. Schreiber, “A New Implementation of Sparse Gaussian Elimination,” ACM Transactions onMathematical Software, vol. 8, no. 3, pp. 256–276, 1982.

[47] P. Sonneveld, “CGS, A Fast Lanczos-type Solver for Nonsymmetric Linear Systems,” SIAMJournal on Scientific and Statistical Computing, vol. 10, no. 1, pp. 36–52, 1989.

[48] H. A. van der Vorst, Lecture notes on iterative methods, Utrecht, The Netherlands: UniversityUtrecht, 1993.

71

SOLVERSBIBLIOGRAPHY

[49] M. Yannakakis, “Computing the Minimum Fill-in Is NP-Complete,” SIAM Journal on Algebraicand Discrete Methods, vol. 2, no. 1, pp. 77–79, 1981.

[50] M. Benzi, J. C. Haws, and M. Tuma, “Preconditioning Highly Indefinite and NonsymmetricMatrices,” SIAM Journal on Scientific Computing, vol. 22, no. 4, pp. 1333–1353, 2000.

[51] I. S. Duff and J. Koster, “On Algorithms for Permuting Large Entries to the Diagonal of a SparseMatrix,” SIAM Journal on Matrix Analysis and Applications, vol. 22, no. 4, pp. 973–996, 2001.

[52] E. Cuthill and J. McKee, “Reducing the bandwidth of sparse symmetric matrices,” in Proceedingsof the 24th National Conference, ACM, New York, USA, pp. 157–172, August 1969.

[53] G. Karypis and V. Kumar, “A Fast and High Quality Multilevel Scheme for Partitioning IrregularGraphs,” SIAM Journal on Scientific Computing, vol. 20, no. 1, pp. 359–392, 1998.

[54] M. Benzi, W. Joubert, and G. Mateescu, “Numerical Experiments with Parallel Orderings for ILUPreconditioners,” Electronic Transactions on Numerical Analysis, vol. 8, pp. 88–114, 1999.

[55] M. Benzi, D. B. Szyld, and A. van Duin, “Orderings for Incomplete Factorization Preconditioningof Nonsymmetric Problems,” SIAM Journal on Scientific Computing, vol. 20, no. 5, pp. 1652–1670,1999.

[56] M. J. Grote and T. Huckle, “Parallel Preconditioning with Sparse Approximate Inverses,” SIAMJournal on Scientific Computing, vol. 18, no. 3, pp. 838–853, 1997.

[57] H. A. van der Vorst, “BI-CGSTAB: A Fast and Smoothly Converging Variant of BI-CG for theSolution of Nonsymmetric Linear Systems,” SIAM Journal on Scientific and Statistical Computing,vol. 13, no. 2, pp. 631–644, 1992.

[58] Y. Saad and M. H. Schultz, “GMRES: A Generalized Minimal Residual Algorithm for SolvingNonsymmetric Linear Systems,” SIAM Journal on Scientific and Statistical Computing, vol. 7, no. 3,pp. 856–869, 1986.

[59] S. Röllin and M. H. Gutknecht, “Variations of Zhang’s Lanczos-type product method,” AppliedNumerical Mathematics, vol. 41, pp. 119–133, 2002.

[60] E. Anderson et al., LAPACK Users’ Guide, Philadelphia: SIAM, 1992.

[61] D. R. Fokkema, G. L. G. Sleijpen, and H. A. Van der Vorst, “Generalized conjugate gradientsquared,” Journal of Computational and Applied Mathematics, vol. 71, no. 1, pp. 125–146, 1996.

[62] D. R. Fokkema, Subspace Methods for Linear, Nonlinear, and Eigen Problems, Utrecht, TheNetherlands, 1996.

[63] G. H. Golub and C. F. Van Loan, Matrix Computations, Baltimore: The Johns Hopkins UniversityPress, 2nd ed., 1989.

[64] M. R. Hestenes and E. Stiefel, “Methods of Conjugate Gradients for Solving Linear Systems,”Journal of Research of the National Bureau of Standards, vol. 49, no. 6, pp. 409–436, 1952.

[65] J. A. Meijerink and H. A. van der Vorst, “An Iterative Solution Method for Linear Systems of Whichthe Coefficient Matrix is a Symmetric M-Matrix,” Mathematics of Computation, vol. 31, no. 137,pp. 148–162, 1977.

[66] G. L. G. Sleijpen and D. R. Fokkema, “BiCGSTAB(L) for Linear Equations Involving Matrices withComplex Spectrum,” Electronic Transactions on Numerical Analysis, vol. 1, pp. 11–32, 1993.

72


[67] G. L. G. Sleijpen, H. A. van der Vorst, and D. R. Fokkema, “BiCGstab(l) and other hybrid Bi-CGmethods,” Numerical Algorithms, vol. 7, no. 1, pp. 75–109, 1994.

[68] D. R. Fokkema, G. L. G. Sleijpen, and H. A. van der Vorst, “Accelerated Inexact Newton Schemesfor Large Systems of Nonlinear Equations,” SIAM Journal of Scientific Computing, vol. 19, no. 2,pp. 657–674, 1998.

73

SOLVERSBIBLIOGRAPHY

74

Documents

Solvers - jmbussat/Physics290E/Fall-2006/TCAD_documentation/...that are available as part of Synopsys TCAD software. This manual is organized into the following parts: