Upload
brice-golden
View
221
Download
3
Tags:
Embed Size (px)
Citation preview
Copyright © 2007 Intel Corporation.
RR
®®
16bit 3D Convolution 16bit 3D Convolution Implementation SSE + OpenMPImplementation SSE + OpenMP
Benchmarking on PenrynBenchmarking on Penryn
Dr. Zvi Danovich, Dr. Zvi Danovich, Senior Application EngineerSenior Application Engineer
January 2008January 2008
Copyright © 2008 Intel Corporation. 2
AgendaAgenda Mathematics of 3D convolutionMathematics of 3D convolution Main idea of SSE implementation of 1D convolutionMain idea of SSE implementation of 1D convolution Basic routine of algorithm: 2D convolution – 1 lineBasic routine of algorithm: 2D convolution – 1 line Main routine of algorithm: 3D convolution – line by Main routine of algorithm: 3D convolution – line by
lineline Adding OpenMP, benchmarking, conclusionsAdding OpenMP, benchmarking, conclusions
Copyright © 2008 Intel Corporation. 3
3D convolution (with 3x3x3 kernel 3D convolution (with 3x3x3 kernel KK) is computed ) is computed for each pixel for each pixel PP as as
where where pp is source pixels and is source pixels and KK – convolution kernel – convolution kernel values.values.
In another words, each new pixel is the sum of 27 In another words, each new pixel is the sum of 27 products of source pixels values with appropriate products of source pixels values with appropriate kernel values inside kernel cubic:kernel values inside kernel cubic:
3D convolution – what is it ?3D convolution – what is it ?
10
10
10
10
10
10
,,1,1,1D3,, 000000
ll
ll
mm
mm
nn
nn
nmlnnmmllnml pKP
KpKp KpKp KpKp
KpKp KpKp KpKp
KpKp KpKp KpKp
P = sum
Copyright © 2008 Intel Corporation. 4
Recombination from 1D convolutionsRecombination from 1D convolutions
If 1D convolution is defined asIf 1D convolution is defined as
therefore final line of 3D convolution istherefore final line of 3D convolution is
i.e. i.e. 3D convolution can be presented as double sum of 9 3D convolution can be presented as double sum of 9 1D convolutions – 3 planes with 3 lines in plane1D convolutions – 3 planes with 3 lines in plane
121101D1
00000
10
10
nnnnnnn pKpKpKpKP
nn
nn
10
10
10
10
,D3, 00
ll
ll
mm
mm
mlml PP
Copyright © 2008 Intel Corporation. 5
AgendaAgenda Mathematics of 3D convolutionMathematics of 3D convolution Main idea of SSE implementation of 1D convolutionMain idea of SSE implementation of 1D convolution Basic routine of algorithm: 2D convolution – 1 lineBasic routine of algorithm: 2D convolution – 1 line Main routine of algorithm: 3D convolution – line by Main routine of algorithm: 3D convolution – line by
lineline Adding OpenMP, benchmarking, conclusionsAdding OpenMP, benchmarking, conclusions
Copyright © 2008 Intel Corporation. 6
Main part of algorithm: 1D convolutionMain part of algorithm: 1D convolutionidea of implementationidea of implementation
Let start from 3 sequential QUADs from sourse line, multiply all Let start from 3 sequential QUADs from sourse line, multiply all three by different three by different KK (kernel) values (denoted as k (kernel) values (denoted as k--, k, kcc,k,k++) )
-4-4 -3-3 -2-2 -1-1 00 11 22 33 44 55 66 77
kk--
-4-4
kk--
-3-3
kk--
-2-2
kk--
-1-1
kk--
00
kk--
11
kk--
22
kk--
33
kk--
44
kk--
55
kk--
66
kk--
77
kkcc
-4-4
kkcc
-3-3
kkcc
-2-2
kkcc
-1-1
kkcc
00
kkcc
11
kkcc
22
kkcc
33
kkcc
44
kkcc
55
kkcc
66
kkcc
77
kk++
-4-4
kk++
-3-3
kk++
-2-2
kk++
-1-1
kk++
00
kk++
11
kk++
22
kk++
33
kk++
44
kk++
55
kk++
66
kk++
77
kk-- kk-- kk-- kk--Multiplication
kkcc kkcc kkcc kkcc
kk++ kk++ kk++ kk++Multiplication
Selection by PALIGNR
Selection by PALIGNR
Using PALIGNR, select QUAD shifted left for products with kUsing PALIGNR, select QUAD shifted left for products with k-- and QUAD and QUAD shifted right for products with kshifted right for products with k++. Sum up them with unshifted QUAD products . Sum up them with unshifted QUAD products with kwith kcc: :
Sourse pixels p
kk--
-1-1
kk--
00
kk--
11
kk--
22
kkcc 00
kkcc 11
kkcc 22
kkcc 33
kk++ 11
kk++ 22
kk++ 33
kk++ 44
PP00PP
11PP
22PP
33 k-p2+kcp3+k+p4
k-p1+kcp2+k+p3
k-p0+kcp1+k+p2
k-p-1+kcp0+k+p1
Resulting sums are convolution expressions for central QUAD !
Copyright © 2008 Intel Corporation. 7
AgendaAgenda Mathematics of 3D convolutionMathematics of 3D convolution Main idea of SSE implementation of 1D convolutionMain idea of SSE implementation of 1D convolution Basic routine of algorithm: 2D convolution – 1 lineBasic routine of algorithm: 2D convolution – 1 line Main routine of algorithm: 3D convolution – line by Main routine of algorithm: 3D convolution – line by
lineline Adding OpenMP, benchmarking, conclusionsAdding OpenMP, benchmarking, conclusions
Copyright © 2008 Intel Corporation. 8
Main loop is treating sequential EIGHTs of 16bit pixels for 3 Main loop is treating sequential EIGHTs of 16bit pixels for 3 adjacent lines (unrolled inside 1 step). 1D convolution (in 32bit adjacent lines (unrolled inside 1 step). 1D convolution (in 32bit form) is computed for 2 QUADs of each EIGHT, results for 3 form) is computed for 2 QUADs of each EIGHT, results for 3 lines are summed up, therefore forming 2D convolution results.lines are summed up, therefore forming 2D convolution results.
To avoid using “if”s in the main loop, the very first step is To avoid using “if”s in the main loop, the very first step is separated into prolog part, being simpler than general step.separated into prolog part, being simpler than general step.
Below is the description of 1 line (from 3 lines) computations in Below is the description of 1 line (from 3 lines) computations in general main loop step. general main loop step.
It starts from loading EIGHT 16bit source pixels and unpacking It starts from loading EIGHT 16bit source pixels and unpacking them into 2 32bit QUADs :them into 2 32bit QUADs :
Basic routine of algorithm: 2D convolution – 1 lineBasic routine of algorithm: 2D convolution – 1 line
pp00 pp11 pp22 pp33 pp44 pp55 pp66 pp77
pp00 pp11 pp22 pp33
pp44 pp55 pp66 pp77
pp00 pp11 pp22 pp33
pp44 pp55 pp66 pp77
Load EIGHT of 16 bit source pixels
Shuffle
Shuffle
Equivalence
Equivalence
First unpacked 32bit QUAD
Second unpacked 32bit QUAD
Copyright © 2008 Intel Corporation. 9
Multiply 2 QUADs (from previous step) with three different Multiply 2 QUADs (from previous step) with three different KK values values (denoted as k(denoted as k--, k, kcc, k, k++), resulting in 6 product QUADs. Treat them ), resulting in 6 product QUADs. Treat them together with 2 similar product QUADs saved at previous step. together with 2 similar product QUADs saved at previous step.
00 11 22 33 44 55 66 77
kk--
-4-4
kk--
-3-3
kk--
-2-2
kk--
-1-1
kk--
00
kk--
11
kk--
22
kk--
33
kk--
44
kk--
55
kk--
66
kk--
77
kkcc
00
kkcc
11
kkcc
22
kkcc
33
kkcc
44
kkcc
55
kkcc
66
kkcc
77
kk++
-4-4
kk++
-3-3
kk++
-2-2
kk++
-1-1
kk++
00
kk++
11
kk++
22
kk++
33
kk++
44
kk++
55
kk++
66
kk++
77
kk-- kk-- kk-- kk--
kkcc kkcc kkcc kkcc
kk++ kk++ kk++ kk++
Using PALIGNR, select appropriate QUAD and start/continue forming 3 Using PALIGNR, select appropriate QUAD and start/continue forming 3 sum QUADs:sum QUADs:
– (1) (1) REDRED frame: 2D convolution of 1 frame: 2D convolution of 1stst sourse QUAD: will be finalized and sourse QUAD: will be finalized and stored at the end of stored at the end of currentcurrent step, step,
– (2) (2) GREENGREEN frame: 2D convolution of 2 frame: 2D convolution of 2ndnd sourse QUAD: will be finalized and sourse QUAD: will be finalized and stored at the end of stored at the end of nextnext step/epilog, step/epilog,
– (Prev) (Prev) YELLOWYELLOW frame: 2D convolution of previous 2 frame: 2D convolution of previous 2ndnd sourse QUAD: will be sourse QUAD: will be finalized and stored at the end of finalized and stored at the end of currentcurrent step step
Therefore, at the end of current step, 2 resulting 2D convolution Therefore, at the end of current step, 2 resulting 2D convolution QUADs– QUADs– PREVIOUSPREVIOUS 2 2ndnd and and CURRENTCURRENT 1 1st st - will be stored.- will be stored.
Basic routine of algorithm: 2d convolution – 1 lineBasic routine of algorithm: 2d convolution – 1 line
Saved product QUADs from previous step
2
21
1
Prev
1
MultiplicationSSE4 mullo_epi32
MultiplicationSSE4 mullo_epi32
Copyright © 2008 Intel Corporation. 10
As already mentioned, each step treats and sums up data from 3 As already mentioned, each step treats and sums up data from 3 adjacent lines – performs computations from previous foils for 2 other adjacent lines – performs computations from previous foils for 2 other lines and sets of kernel components accordingly.lines and sets of kernel components accordingly.
Prolog step doesn’t include Prolog step doesn’t include PREVIOUSPREVIOUS sum computation and certainly sum computation and certainly doesn’t save it.doesn’t save it.
The epilog step includes the very last 2D convolution QUAD The epilog step includes the very last 2D convolution QUAD computation and store that is fully similar to computation and store that is fully similar to PREVIOUSPREVIOUS computation in computation in regular step. regular step.
Finally, the above routine builds ONE 32bit line of 2D convolution Finally, the above routine builds ONE 32bit line of 2D convolution resulting points.resulting points.
Basic routine of algorithm: 2d convolution – 1 lineBasic routine of algorithm: 2d convolution – 1 linefinalizingfinalizing
Copyright © 2008 Intel Corporation. 11
AgendaAgenda Mathematics of 3D convolutionMathematics of 3D convolution Main idea of SSE implementation of 1D convolutionMain idea of SSE implementation of 1D convolution Basic routine of algorithm: 2D convolution – 1 lineBasic routine of algorithm: 2D convolution – 1 line Main routine of algorithm: 3D convolution – line by Main routine of algorithm: 3D convolution – line by
lineline Adding OpenMP, benchmarking, conclusionsAdding OpenMP, benchmarking, conclusions
Copyright © 2008 Intel Corporation. 12
To build full 3D convolution stack, this routine runs on lines (inner loop) of To build full 3D convolution stack, this routine runs on lines (inner loop) of all slices (external loop).all slices (external loop).
For each source line, it computes 3 32bit 2D convolution lines – based on For each source line, it computes 3 32bit 2D convolution lines – based on previous, current and next slices, using “2D convolution -1 line” routine previous, current and next slices, using “2D convolution -1 line” routine described above.described above.
Main routine of algorithm: 3D convolution – line by lineMain routine of algorithm: 3D convolution – line by line
Slice 1 (next)Slice 1 (next)
Slice 0 (current)Slice 0 (current)
Slice -1 (previous)Slice -1 (previous)Line
-1
Line
-1
Line
0
Line
0Li
ne 1
Line
1
2D c
onvo
lutio
n
2D c
onvo
lutio
nSumming up
Summing up
Resulting 3D convolution line is built by summing up these 3 lines, normalizing by Resulting 3D convolution line is built by summing up these 3 lines, normalizing by arithmetical shift and converting result to 16 bit as following:arithmetical shift and converting result to 16 bit as following:
00 11 22 33
00 11 22 33
00 11 22 33
44 55 66 77
44 55 66 77
44 55 66 77
Line -1 2D conv.
Line 0 2D conv.
Line +1 2D conv.Su
mm
ing
up
00 11 22 33 44 55 66 7732bit 3D convolution
00 11 22 33 44 55 66 77
After shift: actually – 16bit
packs_epi32
00 11 22 33 44 55 66 77
Final 16bit 3D convolution EIGHT
Store
Shift
Copyright © 2008 Intel Corporation. 13
AgendaAgenda Mathematics of 3D convolutionMathematics of 3D convolution Main idea of SSE implementation of 1D convolutionMain idea of SSE implementation of 1D convolution Basic routine of algorithm: 2D convolution – 1 lineBasic routine of algorithm: 2D convolution – 1 line Main routine of algorithm: 3D convolution – line by Main routine of algorithm: 3D convolution – line by
lineline Adding OpenMP, benchmarking, conclusionsAdding OpenMP, benchmarking, conclusions
Copyright © 2008 Intel Corporation. 14
Parallelizing by OpenMP and benchmarkingParallelizing by OpenMP and benchmarking To parallelize the above algorithm by using OpenMP for external (slices) To parallelize the above algorithm by using OpenMP for external (slices)
loop, 3 32bit working lines for each thread are allocated.loop, 3 32bit working lines for each thread are allocated. See below benchmarks with and without OpenMP on 2-way HPTN machine (8 cores).See below benchmarks with and without OpenMP on 2-way HPTN machine (8 cores).
3 runs – equivalent of 3D gradient computation: SSE only SSE+OpenMP
Serial/SSE = ~3, SSE/(SSE+OpenMP) = ~5.5, Serial/(SSE+OpenMP) = ~16.3
10 runs: SSE only SSE+OpenMP
Serial/SSE = ~3, SSE/(SSE+OpenMP) = ~6.3, Serial/(SSE+OpenMP) = ~18.6
Speed-up of SSE (3x) is close to theoretical limit for 4-32bit-vector operations !
Additional OpenMP speed-up (5.5x-6.3x) brings overall speed-up to 16.3x-18.6x !