21
High-Throughput and High-Accuracy Classification with Convolutional Ternary Neural Networks Frédéric Pétrot, Adrien Prost-Boucle, Alban Bourge International Workshop on Highly Efficient Neural Processing October 4 th 2018

High-ThroughputandHigh-AccuracyClassificationwith …cmalab.snu.ac.kr/materials/HENP2018/High-Throughput_and... · 2018. 10. 3. · 3trits/5bits) 16%saving 5trits/8bits) 20%saving

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: High-ThroughputandHigh-AccuracyClassificationwith …cmalab.snu.ac.kr/materials/HENP2018/High-Throughput_and... · 2018. 10. 3. · 3trits/5bits) 16%saving 5trits/8bits) 20%saving

High-Throughput and High-Accuracy Classification withConvolutional Ternary Neural Networks

Frédéric Pétrot, Adrien Prost-Boucle, Alban Bourge

International Workshop on Highly Efficient Neural ProcessingOctober 4th 2018

Page 2: High-ThroughputandHigh-AccuracyClassificationwith …cmalab.snu.ac.kr/materials/HENP2018/High-Throughput_and... · 2018. 10. 3. · 3trits/5bits) 16%saving 5trits/8bits) 20%saving

CNN Models : Accuracy, Operations and Parameters

A. Canziani, E. Culurciello, A. Paszke, “An Analysis of Deep Neural Network Models for Practical Applications”, 2017

Pétrot, Prost-Boucle, Bourge HENP’18 October 4th 2018 1 / 16

Page 3: High-ThroughputandHigh-AccuracyClassificationwith …cmalab.snu.ac.kr/materials/HENP2018/High-Throughput_and... · 2018. 10. 3. · 3trits/5bits) 16%saving 5trits/8bits) 20%saving

CNN Models : Accuracy, Operations and Parameters

Challenges in Embedded Neural Networks– Limit number of parameters (weight values)– Limit number of bits of weights and activations– Integrate many memory cuts with processingelements

– Integrate computation into the memory itself

Let’s Use Ternary {−1, 0, 1} weights and activations on FPGA– FPGA : Great digital PIM

– Hardwiring ANN too risky⇒ New and better ANN

every other day

Ternarization Classification Error Rates on NN-64 (%)CIFAR-10 SVHN GTRSB

Float 13.11 2.40 0.96Ternary 13.29 2.40 1.05

H. Alemdar et al, “Ternary neural networks for resource-efficient AI applications”, IJCNN’17

Pétrot, Prost-Boucle, Bourge HENP’18 October 4th 2018 2 / 16

Page 4: High-ThroughputandHigh-AccuracyClassificationwith …cmalab.snu.ac.kr/materials/HENP2018/High-Throughput_and... · 2018. 10. 3. · 3trits/5bits) 16%saving 5trits/8bits) 20%saving

CNN Models : Accuracy, Operations and Parameters

Challenges in Embedded Neural Networks– Limit number of parameters (weight values)– Limit number of bits of weights and activations– Integrate many memory cuts with processingelements

– Integrate computation into the memory itself

Let’s Use Ternary {−1, 0, 1} weights and activations on FPGA– FPGA : Great digital PIM

– Hardwiring ANN too risky⇒ New and better ANN

every other day

Ternarization Classification Error Rates on NN-64 (%)CIFAR-10 SVHN GTRSB

Float 13.11 2.40 0.96Ternary 13.29 2.40 1.05

H. Alemdar et al, “Ternary neural networks for resource-efficient AI applications”, IJCNN’17

Pétrot, Prost-Boucle, Bourge HENP’18 October 4th 2018 2 / 16

Page 5: High-ThroughputandHigh-AccuracyClassificationwith …cmalab.snu.ac.kr/materials/HENP2018/High-Throughput_and... · 2018. 10. 3. · 3trits/5bits) 16%saving 5trits/8bits) 20%saving

CNN Models : Accuracy, Operations and Parameters

Challenges in Embedded Neural Networks– Limit number of parameters (weight values)– Limit number of bits of weights and activations– Integrate many memory cuts with processingelements

– Integrate computation into the memory itself

In other words, ...

Let’s Use Ternary {−1, 0, 1} weights and activations on FPGA– FPGA : Great digital PIM

– Hardwiring ANN too risky⇒ New and better ANN

every other day

Ternarization Classification Error Rates on NN-64 (%)CIFAR-10 SVHN GTRSB

Float 13.11 2.40 0.96Ternary 13.29 2.40 1.05

H. Alemdar et al, “Ternary neural networks for resource-efficient AI applications”, IJCNN’17

Pétrot, Prost-Boucle, Bourge HENP’18 October 4th 2018 2 / 16

Page 6: High-ThroughputandHigh-AccuracyClassificationwith …cmalab.snu.ac.kr/materials/HENP2018/High-Throughput_and... · 2018. 10. 3. · 3trits/5bits) 16%saving 5trits/8bits) 20%saving

CNN Models : Accuracy, Operations and Parameters

Challenges in Embedded Neural Networks– Limit number of parameters (weight values)– Limit number of bits of weights and activations– Integrate many memory cuts with processingelements

– Integrate computation into the memory itself

In other words, ...

K. Usher, ”The Dwarf in the Dirt”, Bones, 2009

Let’s Use Ternary {−1, 0, 1} weights and activations on FPGA– FPGA : Great digital PIM

– Hardwiring ANN too risky⇒ New and better ANN

every other day

Ternarization Classification Error Rates on NN-64 (%)CIFAR-10 SVHN GTRSB

Float 13.11 2.40 0.96Ternary 13.29 2.40 1.05

H. Alemdar et al, “Ternary neural networks for resource-efficient AI applications”, IJCNN’17

Pétrot, Prost-Boucle, Bourge HENP’18 October 4th 2018 2 / 16

Page 7: High-ThroughputandHigh-AccuracyClassificationwith …cmalab.snu.ac.kr/materials/HENP2018/High-Throughput_and... · 2018. 10. 3. · 3trits/5bits) 16%saving 5trits/8bits) 20%saving

CNN Models : Accuracy, Operations and Parameters

Challenges in Embedded Neural Networks– Limit number of parameters (weight values)– Limit number of bits of weights and activations– Integrate many memory cuts with processingelements

– Integrate computation into the memory itself

In other words, ...

K. Usher, ”The Dwarf in the Dirt”, Bones, 2009

Let’s Use Ternary {−1, 0, 1} weights and activations on FPGA– FPGA : Great digital PIM

– Hardwiring ANN too risky⇒ New and better ANN

every other day

Ternarization Classification Error Rates on NN-64 (%)CIFAR-10 SVHN GTRSB

Float 13.11 2.40 0.96Ternary 13.29 2.40 1.05

H. Alemdar et al, “Ternary neural networks for resource-efficient AI applications”, IJCNN’17

Pétrot, Prost-Boucle, Bourge HENP’18 October 4th 2018 2 / 16

Page 8: High-ThroughputandHigh-AccuracyClassificationwith …cmalab.snu.ac.kr/materials/HENP2018/High-Throughput_and... · 2018. 10. 3. · 3trits/5bits) 16%saving 5trits/8bits) 20%saving

Why Ternary Convolutional Neural Networks?

Objectives– Energy efficient inference for AI tasks– Without sacrifiying “too much” accuracy– (Valid at a point in time : learning methods improve continuously)

Solution– Ternarize {−1, 0, 1} weights and activations a

– Sweet spot between resource usage and accuracy

a. ”Perhaps the prettiest number system of all is the balanced ternary notation.” DonaldKnuth, The Art of Computer Programming, Volume 2 : Seminumerical algorithms.

Pétrot, Prost-Boucle, Bourge HENP’18 October 4th 2018 3 / 16

Page 9: High-ThroughputandHigh-AccuracyClassificationwith …cmalab.snu.ac.kr/materials/HENP2018/High-Throughput_and... · 2018. 10. 3. · 3trits/5bits) 16%saving 5trits/8bits) 20%saving

Training Ternary Neural Networks : Teacher-Student Approach

NN Teacher Studentparameters Ò Ò {−1, 0, 1}

neuron input Ò Ò Ú

activation function any any with (−1, 1) 2-threshold stepstochastic firing

neuron output Ò {−1, 0, 1} {−1, 0, 1}

Teacherρ = tanh(yi)

ni =

−1 with prob. −ρ if ρ < 0

1 with prob. ρ if ρ > 0

0 otherwise

Student

ni =

−1 if yi < blo

1 if yi > bhi

0 otherwise

Pétrot, Prost-Boucle, Bourge HENP’18 October 4th 2018 4 / 16

Page 10: High-ThroughputandHigh-AccuracyClassificationwith …cmalab.snu.ac.kr/materials/HENP2018/High-Throughput_and... · 2018. 10. 3. · 3trits/5bits) 16%saving 5trits/8bits) 20%saving

Ternary Neural Networks : Teacher-Student Individual Training

Pétrot, Prost-Boucle, Bourge HENP’18 October 4th 2018 5 / 16

Page 11: High-ThroughputandHigh-AccuracyClassificationwith …cmalab.snu.ac.kr/materials/HENP2018/High-Throughput_and... · 2018. 10. 3. · 3trits/5bits) 16%saving 5trits/8bits) 20%saving

Experiments with Ternary Networks

Multiple networks– VGG-like networks– two geometries : NN-64 and NN-128– multiple “acceleration factors” inside network (ranging 1 to 256)– tradeoff area/throughput/energy

Automation (kind of ...)– handmade generic hardware building blocks (vhdl)– automatically generated networks– customizable home-made tools (old school C and tcl)

Pétrot, Prost-Boucle, Bourge HENP’18 October 4th 2018 6 / 16

Page 12: High-ThroughputandHigh-AccuracyClassificationwith …cmalab.snu.ac.kr/materials/HENP2018/High-Throughput_and... · 2018. 10. 3. · 3trits/5bits) 16%saving 5trits/8bits) 20%saving

Overview

– 12 theoretical layers– 29 physical layers (+30 ”glue” fifos)– NN64 : 1930 neurons, 3.5 mega parameters– NN128 : 3850 neurons, 14 mega parameters

Goal of ternary : have parameters fit in FPGA distributed memoriesPétrot, Prost-Boucle, Bourge HENP’18 October 4th 2018 7 / 16

Page 13: High-ThroughputandHigh-AccuracyClassificationwith …cmalab.snu.ac.kr/materials/HENP2018/High-Throughput_and... · 2018. 10. 3. · 3trits/5bits) 16%saving 5trits/8bits) 20%saving

Overview

Pétrot, Prost-Boucle, Bourge HENP’18 October 4th 2018 8 / 16

Page 14: High-ThroughputandHigh-AccuracyClassificationwith …cmalab.snu.ac.kr/materials/HENP2018/High-Throughput_and... · 2018. 10. 3. · 3trits/5bits) 16%saving 5trits/8bits) 20%saving

FPGA Design for Ternary Convolutional Neural Networks

Neurones and Parallelism

Max NN-64 “acceleration factor” that fits on a VC709 : 128Pétrot, Prost-Boucle, Bourge HENP’18 October 4th 2018 9 / 16

Page 15: High-ThroughputandHigh-AccuracyClassificationwith …cmalab.snu.ac.kr/materials/HENP2018/High-Throughput_and... · 2018. 10. 3. · 3trits/5bits) 16%saving 5trits/8bits) 20%saving

Acceleration factors

– Base implementation : at most 1 activation transfered between layers in 1 cycleBut some layers take more time than others to compute⇒ stalls

– Use parallelism at layer level :Transfering “several” activations in 1 cycle in/out of bootleneck layers

NN Acc. Parallelism per layer (in/out)size factor NL1 NL2 MPL1 NL3 NL4 MPL2 NL5 NL6 MPL3 NL7 NL8 NL9

64

1 - - - - - - - - - - - -2 - 2 / 1 - - - - - - - - - -4 - 4 / 1 - - 2 / 1 - - - - - - -8 - 8 / 1 - 2 / 1 4 / 1 - - 2 / 1 - - - -16 1 / 2 16 / 2 2 / 1 4 / 1 8 / 1 - 2 / 1 4 / 1 - - - -32 2 / 4 32 / 4 4 / 1 8 / 2 16 / 2 2 / 1 4 / 1 8 / 1 - - - -64 3 / 8 64 / 8 8 / 2 16 / 4 32 / 4 4 / 1 8 / 2 16 / 2 2 / 1 - - -128 9 / 16 192 / 16 16 / 4 32 / 8 64 / 8 8 / 2 16 / 4 32 / 4 4 / 1 - - -256 27 / 32 576 / 32 32 / 8 64 / 16 128 / 16 16 / 4 32 / 8 64 / 8 8 / 2 2/1 - -

Pétrot, Prost-Boucle, Bourge HENP’18 October 4th 2018 10 / 16

Page 16: High-ThroughputandHigh-AccuracyClassificationwith …cmalab.snu.ac.kr/materials/HENP2018/High-Throughput_and... · 2018. 10. 3. · 3trits/5bits) 16%saving 5trits/8bits) 20%saving

Squeezing High-Efficiency TCNN in FPGA : Adder Trees

Ternary Adders– Sum of trits ≠ Sum of bitsWith (x, y) ∈ {−1, 0, 1}2, x + y ∈ {−2,−1, 0, 1, 2}

LUT savings with optimized ternary adder treeNumber of inputs 4 8 16 32 64 128 192 256

Generic 2-bit radix-2 adder tree (LUT) 6 21 44 90 181 380 568 757Optimized ternary adder (LUT) 4 9 21 44 90 184 274 371

Savings 33.3% 57.1% 52.3% 51.1% 50.3% 51.6% 51.8% 51.0%

Overall LUT savings when using optimized ternary adder treesAcc. factor 4 8 16 32 64 128 256

Savings for NN-64 0.18% 1.38% 4.61% 10.9% 17.3% 24.1% 32.0%Savings for NN-128 0.22% 1.71% 5.63% 12.9% 19.6% 25.7%

Pétrot, Prost-Boucle, Bourge HENP’18 October 4th 2018 11 / 16

Page 17: High-ThroughputandHigh-AccuracyClassificationwith …cmalab.snu.ac.kr/materials/HENP2018/High-Throughput_and... · 2018. 10. 3. · 3trits/5bits) 16%saving 5trits/8bits) 20%saving

Squeezing High-Efficiency TCNN in FPGA : Weight Compression

Trits encoding– Naïve encoding : 2 bits to encode 1 trit⇒ suboptimalEx. : 3 trits encoded on 6 bits while 33 = 27 combinations⇒ encodable on 5 bits

– Optimal number of bits per trits : b =⌈log2

(3T) ⌉

=⌈T × log2(3)

⌉– Minimal number of bits per trits : b/T ≈ log2(3) ≈ 1.585 bits

⇒ Maximum saving ≈ 20.75 % (Shannon limit)

Interesting cases

3 trits / 5 bits⇒ 16 % saving 5 trits / 8 bits⇒ 20 % saving

Pétrot, Prost-Boucle, Bourge HENP’18 October 4th 2018 12 / 16

Page 18: High-ThroughputandHigh-AccuracyClassificationwith …cmalab.snu.ac.kr/materials/HENP2018/High-Throughput_and... · 2018. 10. 3. · 3trits/5bits) 16%saving 5trits/8bits) 20%saving

Squeezing High-Efficiency TCNN in FPGA : Weight Compression

Compression of weights

Pétrot, Prost-Boucle, Bourge HENP’18 October 4th 2018 13 / 16

Page 19: High-ThroughputandHigh-AccuracyClassificationwith …cmalab.snu.ac.kr/materials/HENP2018/High-Throughput_and... · 2018. 10. 3. · 3trits/5bits) 16%saving 5trits/8bits) 20%saving

Squeezing High-Efficiency TCNN in FPGA : Weight Compression

Trading-off BRAM vs logic : Ressources Breakdown and Power Analysis

NN-64 with compression 3t5b NN-64 with compression 5t8b

y-axis : % of change wrt same “degree of parallelism” without compressionPétrot, Prost-Boucle, Bourge HENP’18 October 4th 2018 14 / 16

Page 20: High-ThroughputandHigh-AccuracyClassificationwith …cmalab.snu.ac.kr/materials/HENP2018/High-Throughput_and... · 2018. 10. 3. · 3trits/5bits) 16%saving 5trits/8bits) 20%saving

Measured Results for NN-64 on Xilinx XC7VX690T@250 MHz (VC709)

Acc. Resource usagefactor LUT (logic) LUTRAM BRAM 18k FF

128 170555 (39.4%) 37402 (21.47%) 1410 (48.0%) 321352 (37.1%)

256* 302748 (69.9%) 106168 (60.9%) 2920 (96.7%) 640849 (74.0%)

NN-64 with parallelism degree 128– Uses half of the FPGA resources, reaches max throughput of 60.2k fps (32 × 32)– (LUT+B)RAM throughput of 18.7 Tb/s (290 Gb/s for FC layers only a)– End to end latency including PCIe + RIFFA : 135 µs– Max performance : 18.7 T(T)OP/s (9.33 T(T)MAC/s)– Power @ max performance : 11.5 W (Idle FPGA ≈ 2 W)– Peak efficiency of 5226 fps per Watt⇒ 1.62 T(T)OP/s/W or 810 G(T)MAC/s/W

a. VC709 DRAM throughput ≈ 204 Gb/s

Pétrot, Prost-Boucle, Bourge HENP’18 October 4th 2018 15 / 16

Page 21: High-ThroughputandHigh-AccuracyClassificationwith …cmalab.snu.ac.kr/materials/HENP2018/High-Throughput_and... · 2018. 10. 3. · 3trits/5bits) 16%saving 5trits/8bits) 20%saving

Take away

– FPGAs are a very good fit for ANN if weights fit in internal memoryExtreme quantization needed⇒ Huge weight access throughput possible

– FPGAs are reconfigurable !Who would be mad enough to hardwire a given ANN architecture anyway?⇒ ASICs follow a very different (but equally useful) architectural path

– Creative low level optimizations help squeeze-in high-efficiency networks (NN-64 )⇒ High-throughput : up to 60.2k fps⇒ Low latency : 135 µs @ 60.2k fps⇒ High power efficiency : 1.62 T(T)OP/s/W or 810 G(T)MAC/s/W @ 60.2k fps

Pétrot, Prost-Boucle, Bourge HENP’18 October 4th 2018 16 / 16