Toward an Automatically Tuned Dense Symmetric Eigensolver for Shared Memory Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya

Toward an Automatically Tuned Dense Symmetric Eigensolver

for Shared Memory Machines

Yusaku YamamotoDept. of Computational Science & Engineering

Nagoya University

Outline of the talk

1. Introduction

2. High performance tridiagonalization algorithms

3. Performance comparison of two tridiagonalization algorithms

4. Optimal choice of algorithm and its parameters

5. Conclusion

Introduction

• Target problem– Standard eigenvalue problem 　 Ax = λx

• A ： n by n real symmetric matrix• We assume that A is dense.

• Applications– Molecular orbital methods

• Solution of dense eigenproblems of order more than 100,000 is required to compute the electronic structure of large protein molecules.

– Principal component analysis

Target machines

• Symmetric Multi-Processors (SMP’s)

• Background– PC servers with 4 to 8 processors are quite common now.– The number of CPU cores integrated into one chip is rapidly incre

asing.

Cell processor(1+8 cores)

dual-core Xeon

Flow chart of a typical eigensolverfor dense symmetric matrices

Q*AQ = T(Q: orthogonal)

Tyi =i yi

xi = Qyi

Tridiagonalization

Eigensolution of thetridiagonal matrix

Back transformation

Tridiagonal matrix T

Eigenvalues of T: {i } ，Eigenvectors of T: {yi }

Eigenvectors of A: {xi }

Real symmetric A Computation Algorithm

Householder method

QR methodD & C methodBisection & IIMMR3 algorithm

Back transformation

Objective of this study

• Study the performance of tridiagonalization and back transformation algorithms on various SMP machines.

• The optimal choice of algorithm and its parameters will differ greatly depending on the computational environment and the problem size.

• The obtained data will be useful in designing an automatically tuned eigensolver for SMP machines.

Algorithms for tridiagonalization and back transformation

• Tridiagonalization by the Householder method– Reduction by Householder transformation H = I – uut

• H is a symmetric orthogonal matrix.• H eliminates all but the first element of a vector.

• Computation at the k-th step

• Back transformation– Apply the Householder transformations in reverse order.

k

0

0

0

0

0

0elements to be eliminatedmodified elements

nonzero elements

multiplyH from

left

multiply H from

right

multiplyH from

left

vector

Problems with the conventional Householder method

• Characteristics of the algorithm– Almost all the computational work are done in matrix vector

multiplication and rank-2 update, both of which are level-2 BLAS. • Total computational work： (4/3)N3

• Matrix-vector multiplication： (2/3)N3

• rank-2 update： (2/3)N3

• Problems– Poor data reuse due to the level-2 BLAS

• The effect of cache misses and memory access contentions among the processors is large.

– Cannot attain high performance on modern SMP machines.

High performance tridiagonalization algorithms (I)

• Dongarra’s algorithm （ Dongarra et al., 1992 ）– Defer the application of the Householder transformations until sev

eral (M) transformations are ready.– Rank-2 update can be replaced by rank-2M update, enabling a mor

e efficient use of cache memory.– Adopted by LAPACK ， ScaLAPACK and other libraries.

• Back transformation– Can be blocked in the same manner to use the level-3 BLAS.

0A(K*M)

0

A((K–1)*M)= ×–

U(K*M) Q(K*M)t

×–

Q(K*M) U(K*M)t

rank-2M update (using level-3 BLAS)

Properties of Dongarra’s algorithm

• Computational work– Tridiagonalization: (4/3)n3

– Back transformation: 2n2m (m: number of eigenvectors)

• Parameters to be determined– MT (block size for tridiagonalization)

– MB (block size for back transformation)

• Level-3 aspects– In the tridiagonalization phase, only half of the total work can be d

one with the level-3 BLAS.– The other half is level-2 BLAS, thereby limiting the performance o

n SMP machines.

High performance tridiagonalization algorithms (II)

• Two-step tridiagonalization (Bishof et al., 1993, 1994)– First, transform A into a band matrix B (of half bandwidth L ).

• This can be done almost entirely with the level-3 BLAS.– Next, transform B into a tridiagonal matrix T.

• The work is O(n2L) and is much smaller than that in the first step.

• Transformation into a band matrix

0

0 00

L

A B T

(4/3)n3 O(n2L)

Murata’salgorithm

0

0

0

0

0

0elements to be eliminated

modified elements

nonzero elements

multiplyH from

left

multiply H from

right

High performance tridiagonalization algorithms (II)

• Wu’s improvement (Wu et al., 1996)– Defer the application of the block Householder transformations until

several (MT) transformations are ready.– Rank-2L update can be replaced by rank-2LMT update, enabling a

more efficient use of cache memory.

• Back transformation– Two-step back transformation is necessary.

• 1st step: eigenvectors of T -> eigenvectors of B• 2nd step: eigenvectors of B -> eigenvectors of A

– Both steps can be reorganized to use the level-3 BLAS.– In the 2nd step, several (MB) block Householder transformations can

be aggregated to enhance cache utilization.

Properties of Bischof’s algorithm

• Computational work– Reduction to a band matrix: (4/3)n3

– Murata’s algorithm: 6n2L– Back transformation

• 1st step: 2n2m• 2nd step: 2n2m

• Parameters to be determined– L (half bandwidth of the intermediate band matrix)– MT (block size for tridiagonalization)– MB (block size for back transformation)

• Level-3 aspects– All the computations that require O(n3) work can be done with the l

evel-3 BLAS.

Performance comparison of two tridiagonalization algorithms

• We compare the performance of Dongarra’s and Bischof’s algorithm under the following conditions:

– Computational environments• Xeon (2.8GHz) Opteron (1.8GHz)• Fujitsu PrimePower HPC2500 IBM pSeries (Power5, 1.9GHz)

– Number of processors• 1 to 32 (depending on the target machines)

– Problem size• n=1500, 3000, 6000, 12000 and 24000• All eigenvalues & eigenvectors, or eigenvalues only

– Parameters in the algorithm• Choose nearly optimal values for each environment and matrix size.

Performance on the Xeon (2.8GHz) processor

02468

1012

1 2 4

BischofDongarra

01020304050607080

1 2 40

100200300400500600

1 2 4

Executiontime (sec.)

Number ofprocessors

n = 1500 n = 3000 n = 6000

Time for computing all the eigenvalues and eigenvectors

00.51

1.52

2.53

1 2 4

BischofDongarra

02468

101214161820

1 2 40

20406080

100120140160180

1 2 4

Time for computing all the eigenvalues only

n = 1500 n = 3000 n = 6000

Bischof’s algorithm is preferable when only the eigenvalues are needed.


Number ofprocessors

00.51

1.52

2.53

3.5

1 2 4

BischofDongarra

05

1015202530

1 2 40

50

100

150

200

250

1 2 4

02468

1012

1 2 4

BischofDongarra

01020304050607080

1 2 40

100200300400500600

1 2 4

Performance on the Opteron (1.8GHz) processorExecutiontime (sec.)

Number ofprocessors

n = 1500 n = 3000 n = 6000



n = 1500 n = 3000 n = 6000

Bischof’s algorithm is preferable for the 4 processor case even whenall the eigenvalues and eigenvectors are needed.


Number ofprocessors

0

50

100

150

200

250

1 2 4 8 16 32

BischofDongarra

0200400600800

1000120014001600

1 2 4 8 16 320

2000400060008000

100001200014000

1 2 4 8 16 32

050

100150200250300350400450

1 2 4 8 16 32

BischofDongarra

0500

100015002000250030003500

1 2 4 8 16 320

5000

10000

15000

20000

25000

30000

1 2 4 8 16 32

Performance on the Fujitsu PrimePower HPC2500Executiontime (sec.)

Number ofprocessors

n = 6000 n = 12000 n = 24000



Bischof’s algorithm is preferable when the number of processors is greaterthan 2 even for computing all the eigenvectors.

n = 6000 n = 12000 n = 24000Executiontime (sec.)

Number ofprocessors

020406080

100120140160

1 2 4 8 16

BischofDongarra

0

200

400

600

800

1000

1200

1 2 4 8 160

100020003000400050006000700080009000

1 2 4 8 16

BischofDongarra

050

100150200250300350

1 2 4 8 16

BischofDongarra

0

500

1000

1500

2000

2500

1 2 4 8 160

2000400060008000

100001200014000160001800020000

1 2 4 8 16

Performance on the IBM pSeries (Power5, 1.9GHz)Executiontime (sec.)

Number ofprocessors

n = 6000 n = 12000 n = 24000



n = 6000 n = 12000 n = 24000

Bischof’s algorithm is preferable when only the eigenvalues are needed.


Number ofprocessors

Optimal algorithm as a function of the number of processors and necessary eigenvectors

• Xeon (2.8GHz)

• Opteron (1.8GHz)

0%20%40%60%80%

100%

1 2 40%

20%40%60%80%

100%

1 2 40%

20%40%60%80%

100%

1 2 4

Bischof

Dongarra

Fraction ofeigenvectorsneeded

Number ofprocessors

n = 1500 n = 3000 n = 6000

0%20%40%60%80%

100%

1 2 40%

20%40%60%80%

100%

1 2 40%

20%40%60%80%

100%

1 2 4

n = 1500 n = 3000 n = 6000Fraction ofeigenvectorsneeded

Number ofprocessors

Bischof

Dongarra

0%20%40%60%80%

100%

1 2 4 8 160%

20%40%60%80%

100%

1 2 4 8 160%

20%40%60%80%

100%

1 2 4 8 16

0%20%40%60%80%

100%

1 2 4 8 16 320%

20%40%60%80%

100%

1 2 4 8 16 320%

20%40%60%80%

100%

1 2 4 8 16 32

Optimal algorithm as a function of the number of processors and necessary eigenvectors (Cont’d)

• PrimePower HPC2500

• pSeries

Bischof

DongarraFractionofEigen-vectorsneeded

Number ofprocessors

n = 6000 n = 12000 n = 24000

Bischof

DongarraFractionofEigen-vectorsneeded

Number ofprocessors

n = 6000 n = 12000 n = 24000

Optimal choice of algorithm and its parameters

• To construct an automatically tuned eigensolver, we need to select the optimal algorithm and its parameters according to the computational environment and the problem size.

• To do this efficiently, we need an accurate and inexpensive performance model that predicts the performance of an algorithm given the computational environment, problem size and the parameter values.

• A promising way to achieve this goal seems to be the hierarchical modeling approach (Cuenca et al., 2004):– Model the execution time of BLAS primitives accurately based on the mea

sured data.– Predict the execution time of the entire algorithm by summing up the exec

ution times of the BLAS primitives.

Performance prediction of Bischof’s algorithm

• Predicted and measured execution time of reduction to a band matrix (Opteron, n=1920)

• The model reproduces the variation of the execution time as a function of L and MT fairly well.

• Prediction errors: less than 10%• Prediction time: 0.5s (for all the values of L and MT )

14

1664

L=3 L=6 L=12 L=24 L=48 L=96

02468

1012141618

L=3L=6L=12L=24L=48L=96

14

1664

L=3 L=6 L=12 L=24 L=48 L=96

02468

1012141618

L=3L=6L=12L=24L=48L=96

Time （ s） Time （ s）

Predicted time Measured timeL

MT

L

MT

Automatic optimization of parameters L and MT

• Performance for the optimized values of (L, MT)

• The values of (L ， MT) determined from the model were actually optimal for almost all cases.

• Parameter optimization in other phases and selection of the optimal algorithm is in principle possible using the same methodology.

0

500

1000

1500

2000

2500

3000

3500

960 1920 2880 3840 4800 5760 6720 7680 8640 9600

(12,8)(24,4)(48,4)auto- tunedoptimal

Performance（MFLOPS）

Matrix size n

Conclusion

• We studied the performance of Dongarra’s and Bischof’s tridiagonalization algorithms on various SMP machines for various problem sizes.

• The results show that the optimal algorithm strongly depends on the target machine, the number of processors and the number of necessary eigenvectors. On the Opteron and HPC2500, Bischof’s algorithm is always preferable when the number of processors exceeds 2.

• By using the hierarchical modeling approach, it seems possible to predict the optimal algorithm and its parameters prior to execution. This will open a way to an automatically tuned eigensolver.

Documents

Toward an Automatically Tuned Dense Symmetric Eigensolver for Shared Memory Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya