24
Copyright © 2016 ARM Ltd 1 Gian Marco Iodice, SW Engineer ARM May 3, 2016 Using SGEMM and FFTs to Accelerate Deep Learning

"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM

Embed Size (px)

Citation preview

Page 1: "Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM

Copyright © 2016 ARM Ltd 1

Gian Marco Iodice, SW Engineer – ARM

May 3, 2016

Using SGEMM and FFTs to Accelerate

Deep Learning

Page 2: "Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM

Copyright © 2016 ARM Ltd 2

• About ARM

• Convolutional Neural Networks (CNN)

• Architecture and building blocks

• Convolutional Layer

• SGEMM-based convolution

• FFT-based convolution

• SGEMM vs FFT

• Limited Numerical Precision for CNN

• Lesson Learned

Contents

Page 3: "Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM

Copyright © 2016 ARM Ltd 3

ARM Ltd

• ARM Holdings plc is a British multinational semiconductor and software

design company (www.arm.com)

• Headquarters in Cambridge, England

Page 4: "Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM

Copyright © 2016 ARM Ltd 4

Architecture and Building Blocks of CNN

• Convolutional layer (core block of CNN)

• Number of convolution kernels (filters bank)

• Filter shape (width, height and depth)

• Pooling layer (typical values 2x2)

• Non-linear gating (ReLu)

• Classifier: Fully Connected Neural Network

Learned

Non-Linear Trainable Feature

Page 5: "Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM

Copyright © 2016 ARM Ltd 5

Why Are We Going to Study Convolutional Layer?

*Learning Semantic Image Representations at a Large Scale, Yangqing Jia

conv1 16.9%

relu 0.7%

pool 1.0%

conv2 21.9%

pool2 0.7%

norm2 0.5%

conv3 17.8%

relu3 0.2%

conv4 17.8%

conv5 17.7%

fc6 1.8%

fc7 0.8%

Compute Load for AlexNet Inference*

Page 6: "Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM

Copyright © 2016 ARM Ltd 6

From 2D Convolution to 3D Batched Convolution

• Most of the time for the convolution layers we have:

• Multiple input images

• Multiple convolution kernels (various dimensions and shapes)

• Multiple channels per image/kernel (not necessarily 3!)

Output images

Input image

Kernels

Why don’t we use sliding window approach?

Page 7: "Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM

Copyright © 2016 ARM Ltd 7

SGEMM-based Convolution

C = α∙AB + β∙C SGEMM: Single Precision GEeneral Matrix Multiply

Page 8: "Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM

Copyright © 2016 ARM Ltd 8

Im2col

• im2col stores in each row the

necessary pixels for each

kernel application

• Costs in terms of memory

requirements!

• pixels duplication

• col2im restores the output

image structure

Input image

Output image

A

B

C

Output images

stride X

B

C

Page 9: "Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM

Copyright © 2016 ARM Ltd 9

SGEMM: Naïve implementation

• Each thread computes a single element of the output matrix

Not cache

friendly!

/* First accumulation */

ai = load(addr_a);

bi = load(addr_b);

c0 += ai * bi;

/* Second accumulation */

ai = load(addr_a + 1);

bi = load(addr_b + 1 * N);

c0 += ai * bi;

store(c0, addr_c);

Matrix A

Matrix B

Matrix C

Page 10: "Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM

Copyright © 2016 ARM Ltd 10

Transpose Matrix B

Matrix B Transposition

/* First accumulation */

ai = load(addr_a);

bi = load(addr_b);

c00 += ai * bi;

/* Second accumulation */

ai = load(addr_a + 1);

bi = load(addr_b + 1);

c00 += ai * bi;

...

store(c0, addr_c);

Matrix A

Matrix B

Matrix C

1.1x…

Speed-up

achievable?

Page 11: "Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM

Copyright © 2016 ARM Ltd 11

Transpose Matrix B in Chunk of 1x4 (I)

• Each thread computes 1x4 elements of the output matrix

Not cache

friendly!

float4 out = 0.0f;

/* First accumulation */

ai = load(addr_a);

bi = vload4(addr_b);

out += (float4)ai * bi;

/* Second accumulation */

ai = load(addr_a + 1);

bi = vload4(addr_b + 1 * N);

out += (float4)ai * bi;

...

store4(out, addr_c);

Matrix A

Matrix B

Matrix C

Page 12: "Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM

Copyright © 2016 ARM Ltd 12

Transpose Matrix B in Chunk of 1x4 (II)

float4 out = 0.0f;

/* First accumulation */

ai = load(addr_a);

bi = vload4(addr_b);

out += (float4)ai * bi;

/* Second accumulation */

ai = load(addr_a + 1);

bi = vload4(addr_b + 4);

out += (float4)ai * bi;

...

store4(out, addr_c);

Matrix B

Matrix BT1x4

2

2.5

3

3.5

4

512 1024 2048 4096

SGEMM Speed-Up

Speed-up achievable?

3.5x

N: A=NxN, B=NxN, C=NxN

Page 13: "Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM

Copyright © 2016 ARM Ltd 13

Reshaping Matrix A (I)

• We can do more…we can compute a block of 4x4 elements per

thread in order to re-use the values loaded from Matrix A

Matrix BT1x4

Matrix A

Matrix C

Page 14: "Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM

Copyright © 2016 ARM Ltd 14

Reshaping Matrix A (II)

Chunk 0

Chunk 1

Chunk = Block of 4 rows

Matrix A – 8x8

Matrix AI – 2x32

6.5

7

7.5

8

8.5

512 1024 2048 4096

N: A=NxN, B=NxN, C=NxN

SGEMM Speed-Up

Speed-up achievable?

> 8.0x

Page 15: "Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM

Copyright © 2016 ARM Ltd 15

FFT-based Convolution

• Convolution in the spatial domain is equivalent to a scalar multiplication in

frequency domain

Page 16: "Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM

Copyright © 2016 ARM Ltd 16

From Radix-2 to Mixed-Radix

• The most famous FFT is Radix-2 Cooley–Tukey (just with N power of 2: N = 2 x 2 x 2…)

• Any factorization would generally be possible for N (N = N1 x N2 x N3 x…)

• Mixed-Radix is the generalization of the basic radix-2 FFT

Over 1.5x better performance than Radix-2

Page 17: "Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM

Copyright © 2016 ARM Ltd 17

FFT Implementation

• Recursive FFT in-place computation*

• Each thread computes a single radix-N (floating point computation)

• Block-wise 2x2 in-place transposition

• ~2x times better performance than 2x2 out-of-place transposition

• Out-of-place batched convolution

• High memory requirements as we have to keep the frequency representation for:

1. Input image

2. Convolution kernels

3. Result of convolutions

* https://community.arm.com/groups/arm-mali-graphics/blog/2016/02/01/speeding-up-fast-fourier-transform-mixed-radix-on-mobile-arm-mali-gpu-by-

means-of-opencl-part-2

Page 18: "Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM

Copyright © 2016 ARM Ltd 18

SGEMM vs FFT (I)

• High memory requirements due to im2col:

• stride < kernel dimension

• large convolution kernel

• large input image

SGEMM-based convolution

• No efficient way to handle stride != 1

• High memory requirements for batched

convolutions

• It could require considerable effort to

optimize well

SGEMM-based convolution

FFT-based convolution

Page 19: "Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM

Copyright © 2016 ARM Ltd 19

SGEMM vs FFT (II)

Case 1: 1 input image, 64/128/256 convolution kernels

• Study limited on inference problem

• Stride x = 1 and stride y = 1

• N. of channels = 1

• Pre-computed FFT for convolution kernels

Case 2: 64 input images, 32 convolution kernels

Ima

ge

siz

e

Ima

ge

siz

e

Kernel size / Number of convolutions

Kernel size

and using stride x = 2?

SGEMM

FFT

Page 20: "Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM

Copyright © 2016 ARM Ltd 20

Limited Numerical Precision for CNN (I)

• Some papers ([1], [2]) have demonstrated the feasibility in using limited

numerical precision for CNN

• This opens an interesting computational scenario if, for instance, HW has

accelerators for 16 bit half-precision:

• Performance boosting

• Reduced memory traffic to/from external memory

• Possible to dispatch fewer threads

• Energy saving

• Essentially due to the reduced memory traffic to/from the external memory

[1] Training Deep Neural Networks with Low Precision Multiplications, Matthieu Courbariaux, Jean-Pierre David

[2] Deep Learning with Limited Numerical Precision, Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, Pritish Narayanan

Page 21: "Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM

Copyright © 2016 ARM Ltd 21

Limited Numerical Precision for CNN (II)

1

1.5

2

2.5

512 1024 2048 4096

1

1.5

2

2.5

512 1024 2048 4096

It is possible to dispatch

fewer threads

i.e. 8x4 elements per thread

We can not dispatch fewer

threads

Each thread computes a

single radix-N

SGEMM Speed-Up

FFT Speed-Up

N: A=NxN, B=NxN, C=NxN

N

> 2.0x

> 1.5x

Page 22: "Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM

Copyright © 2016 ARM Ltd 22

Lessons Learned

1. Cache-efficient data layout has huge impact on performance of our algorithm

also for GPU computing

2. Simple changes in data layout can bring to:

• dispatch fewer threads

• exploit better vector instructions

3. Limited Numerical Precision plays a crucial role IF HW accelerated

4. Convolutional calculation is an embarrassingly parallel task which can be

easily and efficiently accelerated on mobile GPU by means of OpenCL

Page 23: "Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM

Copyright © 2016 ARM Ltd 23

Question Time

Question Time

Page 24: "Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM

Copyright © 2016 ARM Ltd 24

Thank you!

The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM Limited (or its subsidiaries) in the EU and/or

elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners.

Copyright © 2016 ARM Limited

Thank you!