19
A FRAMEWORK FOR PRACTICAL FAST MATRIX MULTIPLICATION Austin Benson ([email protected]), ICME, Stanford Grey Ballard, Sandia National Laboratories ICME Colloquium, October 20, 2014 github.com/arbenson/fast- matmul 1 arXiv: 1409.2908

A framework for practical fast matrix multiplication

Embed Size (px)

DESCRIPTION

Slides from my talk at the ICME Colloquium.

Citation preview

Page 1: A framework for practical fast matrix multiplication

1

A FRAMEWORK FOR PRACTICAL FAST MATRIX MULTIPLICATION

Austin Benson ([email protected]), ICME, Stanford

Grey Ballard, Sandia National Laboratories

ICME Colloquium, October 20, 2014

github.com/arbenson/fast-matmul

arXiv: 1409.2908

Page 2: A framework for practical fast matrix multiplication

2

Fast matrix multiplication:bridging theory and practice

• There are a number of Strassen-like algorithms for matrix multiplication that have only been “discovered” recently. [Smirnov13], [Benson&Ballard14]

• We show that they can achieve higher performance with respect to Intel Math Kernel Library (MKL).

• We use code generation for extensive prototyping.

32 2.81[Strassen79]

2.37[Williams12]

xxx xx xx xx

Page 3: A framework for practical fast matrix multiplication

3

Strassen’s algorithm

Page 4: A framework for practical fast matrix multiplication

4

Key ingredients of Strassen’s algorithm

• 1. Block partitioning of matrices (<2, 2, 2>)• 2. Seven linear combinations of sub-blocks of A• 3. Seven linear combinations of sub-blocks of B• 4. Seven matrix multiplies to form Mr (recursive)

• 5. Linear combinations of Mr to form Cij

Page 5: A framework for practical fast matrix multiplication

5

Key ingredients of fast matmul algorithms

• 1. Block partitioning of matrices (<M, K, N>)• 2. R linear combinations of sub-blocks of A

• 3. R linear combinations of sub-blocks of B• 4. R matrix multiplies to form Mr (recursive)

R < MKN faster than classical

• 5. Linear combinations of Mr to form Cij

Key idea: Trade N^3 computation (matmul) for moreN^2 computation (adding matrices together)

Page 6: A framework for practical fast matrix multiplication

6

“Outer product” fast algorithm

• <4, 2, 4> partitioning• R = 26 multiplies (< 4 * 2 * 4 = 32)

23% speedup per recursive step (if everything else free)• Linear combinations of Aij to form Sr: 68 terms

• Linear combinations of Bij to form Tr: 52 terms

• Linear combinations of Mr to form Cij: 69 terms

Page 7: A framework for practical fast matrix multiplication

7

Discovering fast algorithms is anumerical challenge

• Low-rank tensor decompositions lead to fast algorithms• Tensors are small, but we need exact decompositions

NP-hard• Use alternating least squares with regularization and

rounding tricks [Smirnov13], [Benson&Ballard14]

• We have around 10 fast algorithms for <M, K, N> decompositions. Also have permutations like <K, M, N>.

Page 8: A framework for practical fast matrix multiplication

8

[Smirnov13]

[Strassen69]

Page 9: A framework for practical fast matrix multiplication

9

Code generation lets us prototype algorithms quickly

• We have compact representation of many fast algorithms: 1. dimensions of block partitioning (<M, K, N>) 2. linear combinations of sub-blocks (Sr, Tr) 3. linear combinations of Mr to form Cij

• We use code generation to rapidly prototype fast algorithms

Page 10: A framework for practical fast matrix multiplication

10

Sequential performance =

Effective GFLOPS for M x K x N multiplies = 1e-9 * 2 * MKN / time in seconds

True peak

Page 11: A framework for practical fast matrix multiplication

11

Sequential performance

• All algorithms beat MKL on large problems• Strassen’s algorithm is hard to beat

=

Page 12: A framework for practical fast matrix multiplication

12

Sequential performance =

• Almost all algorithms beat MKL• <4, 2, 4> and <3, 2, 3> tend to perform the best

Page 13: A framework for practical fast matrix multiplication

13

Sequential performance =

• Almost all algorithms beat MKL• <4, 3, 3> and <4, 2, 3> tend to perform the best

Page 14: A framework for practical fast matrix multiplication

14

Practical issues

• When to stop recursion?• How to implement the linear combinations?• Can we eliminate redundant linear combinations?

Is this useful?• How to parallelize? How to model parallel performance?

Page 15: A framework for practical fast matrix multiplication

15

When to stop recursion? Look at the gemm curve.

Basic idea: take another recursive step if the sub-problems will still operate at high performance <M, K, N> = <4, 2, 3>

Page 16: A framework for practical fast matrix multiplication

16

Understandingparallel performance =

• 6 cores: similar performance to sequential• 24 cores: MKL best for large problems

Page 17: A framework for practical fast matrix multiplication

17

Bandwidth problems• We rely on the cost of matrix multiplications to be much

more expensive than the cost of matrix additions• Parallel dgemm on 24 cores: easily get 50-75% of peak• STREAM benchmark: < 6x speedup in read/write

performance on 24 cores

C

M1 M7

+

M2 …

Page 18: A framework for practical fast matrix multiplication

18

High-level conclusions• For square matrix multiplication, Strassen’s is best• For rectangular matrix multiplication, use a fast algorithm

that “matches the shape”• Bandwidth limits the performance of shared memory

parallel fast matrix multiplication should be less of an issue in distributed memory

Future work:• Numerical stability• Using fast matmul as a kernel for other algorithms in

numerical linear algebra

Page 19: A framework for practical fast matrix multiplication

19

A FRAMEWORK FOR PRACTICAL FAST MATRIX MULTIPLICATION

Austin Benson ([email protected]), ICME, Stanford

Grey Ballard, Sandia National Laboratories

ICME Colloquium, October 20, 2014

github.com/arbenson/fast-matmul

arXiv: 1409.2908