Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
Riptide: Techniques for Fast End-to-End
Binarized NetworksJosh Fromm
Binary Networks2 0 1 3
0 0 1 10 0 1 1
1 0 0 1
Original Datauint32
Bitplanesuint1
3
93Bitpacked Data
uint4
Bitserial Dot Product
0 0 1 1
0011
10010011= =
1× popcount(3&3) + 2× popcount(3&9) = 4
Binary Networks2 0 1 3
0 0 1 10 0 1 1
1 0 0 1
Original Datauint32
Bitplanesuint1
3
93Bitpacked Data
uint4
Bitserial Dot Product
0 0 1 1
0011
10010011= =
1× popcount(3&3) + 2× popcount(3&9) = 4
• Replaces 32-bit values with 1 or 2 bits
Binary Networks2 0 1 3
0 0 1 10 0 1 1
1 0 0 1
Original Datauint32
Bitplanesuint1
3
93Bitpacked Data
uint4
Bitserial Dot Product
0 0 1 1
0011
10010011= =
1× popcount(3&3) + 2× popcount(3&9) = 4
• Replaces 32-bit values with 1 or 2 bits• Up to 32X compression
Binary Networks2 0 1 3
0 0 1 10 0 1 1
1 0 0 1
Original Datauint32
Bitplanesuint1
3
93Bitpacked Data
uint4
Bitserial Dot Product
0 0 1 1
0011
10010011= =
1× popcount(3&3) + 2× popcount(3&9) = 4
• Replaces 32-bit values with 1 or 2 bits• Up to 32X compression• Up to 48X (hypothetical) speedup
Fused Glue
Fully Binarized (ours) Traditional
Cum
ulat
ive
Num
ber o
f Ope
ratio
ns
Quantized Layer
Conversion Layer
Floating point Layer
Input/Output
Bitp
ack
Qua
ntize
Bina
ry C
onv
Dequ
antiz
e
Scal
e
Pool
Batc
h N
orm
Activ
atio
n
Bina
ry C
onv
Bitp
ack
Shift
Sca
le
Clip Po
ol
Bitpack Fusion
𝑥 𝑥 𝑥2 𝑥 𝑥 𝑥 𝑥 𝑥 𝑥 𝑥 𝑥. . . Bytes
BinaryConv
𝑐 𝑐2𝑐 𝑐 𝑐. 2𝑁𝐻𝑊𝐶 Bytes
Int-N Bit Packed Activations
Int-16 Popcount Accumulation
Fused Shift/Scale
Int-16
𝑞 𝑞2𝑞 𝑞 .Int-16 Quantized Prepacked Bits𝑞 2𝑁𝐻𝑊𝐶 Bytes
Bit Pack
𝑦 𝑦 𝑦2 𝑦 𝑦 𝑦 𝑦 𝑦 𝑦 𝑦 𝑦. . .Int-N Bit Packed Outputs
Int-8
Bytes
Int-16 Unused
Int-8
Int-N
Bitpack Fusion
SqueezeNet Speedups
SqueezeNet Speedups
• Up to 10X speedup
SqueezeNet Speedups
• Up to 10X speedup• Memory savings from fused schedule
yield 1.3X speedup
SqueezeNet Speedups
• Up to 10X speedup• Memory savings from fused schedule
yield 1.3X speedup
• All optimizations have significant impact
SqueezeNet Speedups
• Up to 10X speedup• Memory savings from fused schedule
yield 1.3X speedup
• All optimizations have significant impact
• Fused glue offers a 2X speedup
All Speedups
Accuracy / Runtime
Accuracy / Runtime
Available Open Source Soon!