PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata

PDCS 2007

November 20, 2007

Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point

Coprocessor

Yusaku Yamamoto 1 　　 Takafumi Miyata

1 　　 Yoshimasa Nakamura

2

1 Nagoya University 2

Kyoto University　

2

Introduction

• Background– Advent of many-core floating point accelerators as a

means to speed up scientific computations

• Objective of our study– Apply these accelerators to the eigenvalue problem for

nonsymmetric matrices.

– Make clear potential problems.

– Modify existing algorithms or develop new algorithms if necessary.

3

Outline of the talk

• Introduction

• Many-core floating point accelerators and its performance characteristics

• The nonsymmetric eigenvalue problem

• Proposed algorithm– Modification of the small-bulge multishift QR algorithm for

floating-point accelerators

• Performance evaluation

• Conclusion

4

Many-core Floating-point accelerators

• ClearSpeed CSX600– 1+96 processor cores– 48GFLOPS (double precision)

• Intel Larrabee (under development)– 80 processor cores– 1TFLOPS (single precision)

• GRAPE-DR (Tokyo Univ.)– 512 processor cores– 512GFLOPS (single precision)– 256GFLOPS (double precision)

• Integrates hundreds of floating-point cores• Very high GFLOPS value (peak performance)

5

Architecture of the CSX600 accelerator

• The CSX600 chip– 1 main processor– 96 floating-point processors

• 64bit• 2flops/cycle• 128Byte register file• 6KB SRAM

– Operates at 250MHz– Peak performance: 48GFLOPS

• ClearSpeed Advance board– Two CSX600 processors– 1GB DRAM– Connected with the PC via the

PCI-X bus.– Peak performance: 96GFLOPS

6

Problem with the data transfer speed

• Peak floating-point performance --- very high– 48GFLOPS / chip– 96GFLOPS / board

• Data transfer speed --- relatively low– 3.2GB/s between the chip and on-board memory– 1.066GF/s between the board and main memory

• Byte/flop– 0.066Byte/flop between the chip and on-board memory– 0.011Byte/flop between the board and main memory

CSX600 CSX600I/F

DRAM DRAM

DRAM I/F

CPU

3.2GB/s1.066GB/s

PC ClearSpeedAdvance board

PCI-X

7

Byte/flop of typical linear algebraic operations

Function Operation Amount of data transfer

Flop count

Byte/flop

Dot product := xTy 2n 2n 8

AXPY x := x + y 3n 2n 12

Matrix-vector multiplication

y := Ax n2+2n 2n2 4

Rank-1 update

A:= A + xyT 2n2+2n 2n2 8

Matrix multiplication

(MatMult)

C:= C + AB 4n2 2n3 16/n

• Operations other than matrix multiplication cannot exploit the performance

of the CSX600 due to the limitation of data transfer speed.• Matrix multiplication (MatMult) can be executed efficiently, but only if

the size is very large (n > 1500).

8

Performance of MatMult on the ClearSpeed board

0

5

10

15

20

25

30

1000 2000 3000 4000 5000 6000

M = K = 500

M = K = 1000

M = K = 1500

M = K = 2000

GF

LO

PS

GF

LO

PS

0

5

10

15

20

25

30

1000 2000 3000 4000 5000 6000

N = K = 500

N = K = 1000

N = K = 1500

N = K = 2000

M

K N

M

K

N

N

M

• The library transfers the input data from the main memory to the board, perform the computation and return the data to the main memory.

• M, N and K must be larger than 1500 to get substantial performance gain.

9

Problems to be solved

• Problems– Is it possible to reorganize the algorithm so that most of the

computations are done with matrix multiplications?– What is the overhead of using very large size matrix

multiplications?– How can we reduce the overhead?

We consider these problems for the nonsymmetric eigenvalue problem.

10

The nonsymmetric eigenvalue problem

• The problem– Eigenvalue problem

• A : 　　　　　 dense complex nonsymmetric matrix• Compute all the eigenvalues / eigenvectors

• Applications– Magnetohydrodynamics– Structural dynamics– Quantum chemistry– Fluid dynamics

• Cf. Z. Bai and J. Demmel: A test matrix collection for non-Hermitian eigenvalue problems.

11

• The standard algorithm– Similarity transformation to the upper triangular matrix

– We focus on speeding up the QR algorithm.

QR algorithmWork: 10 n3

(empirically)

Algorithm for the nonsymmetric eigenproblem

Dense matrix

diagonal elements = eigenvalues

Hessenberg matrix Upper triangular matrix

Finite # of steps Iterative method

Householder methodWork: (10/3)n3

0 0

Target of speedup

12

• Algorithm

– shifts s1, …, sm : eigenvalues of the trailing m x m submatrix of Al

– Perform m steps of the QR algorithm at once.

• Computational procedure for one iteration– Introduce (m / 2) bulges– Transform the matrix to Hessenberg form again by chasing (m / 2) bulges.

The small-bulge multishift QR algorithm

0

the case of m = 4 shifts

Matrices:A 　 HessenbergQ 　 unitaryR 　 upper triangular

13

• Algorithm


– Perform m steps of the QR method at once.



0



14

• Algorithm





0



15

• Algorithm





0



16

• Algorithm





0



17

• Division of the updating operations– Chase the bulges by only k rows at a time.– Divide update operations into two parts:

• First, update the diagonal block sequentially.

– Accumulate the Householder transformations used in the update as a unitary matrix.

• Next, update the off-diagonal blocks.– Multiply the off-diagonal blocks by the

unitary matrix.

Use of the level-3 BLAS

0Bulge (3x3)

Diagonal update (sequential)

Off-diagonal update (MatMult)

k

Level-3BLAS

Blocking of bulge-chasing operations

18

• Random matrix (n = 6000)• Compute all the eigenvalues /

eigenvectors with the small-bulge multishift QR algorithm

• Computational environments– Xeon 3.2 GHz, Memory 8 GB– ClearSpeed advance board

• CSX600 x 2

• As the number of shifts increases … – MatMult part decrease

– other part increase (bottleneck)

Performance on the CSX600

Number of shifts

0

2000

4000

6000

8000

10000

12000

others

MatMult

Exe

cutio

n tim

e (s

ec)

Xeon Xeon + CSX600

100 120 160 200 240MatMult size600 720 960 1200 1440

Parts other thanMatMult need to be

sped up !

19

Modification of the algorithm (1)

0k

0k / q



Off-diagonal (MatMult)

Off-diagonal update (MatMult)

Reformulation as a recursive algorithm

Chasing of (m / 2) / qbulges by k / q rows

(ex. Recursion level d = 1)

20

Modification of the algorithm (2)

• Deflation– Trailing 　　　　　　　 submatrix is isolated.

( 　　　　　　 )

• Eigensolution of the isolated submatrix– Apply the double shift QR algorithm.

– Size of the submatrix increases with m.

• Division of the update operations– Update the diagonal block (until convergence)

• Accumulate the Householder transformations used in the update as a unitary matrix.

– Update the off-diagonal block (only once, at last)

• Multiply the off-diagonal blocks by the unitary matrix.

0

0

sequential

MatMult

Bottleneck

Reduce theComputational

work

21

Numerical experiments

• Test problem– random matrices with elements in [0, 1]

• Reduced to Hessenberg form by Householder’s method– Compute all the eigenvalues / eigenvectors

• Computational environments– Xeon 3.2 GHz , Memory 8 GB– Fortran 77, double precision– ClearSpeed advance board

• CSX600 x 2

– Matrix multiplication• ClearSpeed’s Library (for large size MatMult)

• Intel Math Kernel Library (for small size MatMult)

22

• Comparison– Existing algorithm (small-bulge multishift QR method)

• MatMult part– Off-diagonal update

– Our algorithm (mutishift QR + recursion)• MatMult parts

– Off-diagonal update– Diagonal update– Eigensolution of isolated submatrix

– Parameter values• Number of shifts: m is chosen optimally for each case.• Row length of bulge chasing: k = (3/2)m• Level of recursion: d = 1• Number of subdivision: q = m / 40

Numerical experiments

CSX600

23

Effect of our modifications

0

1000

2000

3000

4000

5000

従来法提案法従来法提案法

others

isolated eigenproblem

diagonal update

off-diagonal update

• Our algorithm is 1.4 times faster– Diagonal update: 1.5 times faster

– Eigensolution of isolated submatrix: 10 times faster

（ n = 3000, m = 160, q = 4 ）（ n = 6000, m = 200, q = 5 ）

Exe

cutio

n tim

e (s

ec)

Ours Ours

CSX600 is used in all cases

OriginalOriginal

24

0

20000

40000

60000

80000

100000

（無）（有）（無）（有）

othersisolated eigenproblem

diagonal updateoff-diagonal update

Effect of using the CSX600

n = 6000

n = 12000

• By combining the CSX600 with our algorithm,– 3.5 times speedup when n = 6000

– 3.8 times speedup when n = 12000

Exe

cutio

n tim

e (s

ec)

m = 100 q = 5

m = 200q = 5

m = 100 q = 5

m = 240q = 6

（有）

（有）

Xeon +CSX600

Xeon +CSX600

XeonXeon

25

Conclusion

• We proposed an approach to accelerate the solution of nonsymmetric eigenproblem using a floating-point accelerator.

• We used the small-bulge multishift QR algorithm, which can use matrix multiplications efficiently, as a basis.

• By reformulating part of the algorithm as a recursive algorithm, we succeeded in reducing the computational time spent by non-blocked part. This enables us to use large block size (number of shifts) with small overhead and to exploit the performance of the floating-point accelerator.

• When solving an eigenproblem of order 12,000, our algorithm is 1.4 times faster than the original small-bugle multishift QR algorithm. We obtained 3.8 times speedup with the CSX600 over the 3.2GHz Xeon.

Documents

PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata