Click here to load reader
View
132
Download
4
Embed Size (px)
Lifting Scheme Cores for Wavelet Transform
David Barina(supervised by Pavel Zemcik)
1 / 24
DWT in image processing
can be found in many image-processing tasks
I analysis(edge detection, feature extraction, multiscale representation),
I compression (JPEG 2000, Dirac),
I watermarking, edge sharpening, contrast enhancement,tone mapping, denoising, fusion, etc.
2 / 24
Filter bank
S. Mallat, A theory for multiresolution signal decomposition: The wavelet representation (1989)
H(z1) a
d
2
+
G(z1) 2
2 H(z)
2 G(z)
decomposition: two complementary filters,high number of operations
3 / 24
Lifting scheme
I. Daubechies, W. Sweldens, Factoring wavelet transforms into lifting steps (1998)
a
d
split P (z1)T P (z) merge
P (z) =
I1i=0
{[1 Si(z)0 1
] [1 0
Ti(z) 1
]}[K 00 1/K
]
decomposition: sequence of simple filtering steps,reduces the number of operations, split: even, odd
4 / 24
CDF 9/7 wavelet
I. Daubechies, W. Sweldens, Factoring wavelet transforms into lifting steps (1998)
input
output
steps
even samples
odd samples
P (z) =
[1
(1 + z1
)0 1
] [1 0
(1 + z) 1
] [1
(1 + z1
)0 1
] [1 0
(1 + z) 1
] [ 00 1/
]
four two-tap symmetric filters
5 / 24
2-D decomposition
S. Mallat, A theory for multiresolution signal decomposition: The wavelet representation (1989)
a h
v d
horizontal vertical
h
v d
a h
dv
image: 2-D signal, by a series of 1-D transforms, four subbands,multi-scale decomposition
6 / 24
Lenna
how to calculate this as efficiently as possible
7 / 24
Strategies and issues
R. Kutil, A single-loop approach to SIMD parallelization of 2-D wavelet lifting (2006)
a h
v d
horizontal vertical
strategies row-column, block-based, and line-based
cache issues cache line, limited size, set associativity, prefetching
techniques padding, aggregation, memory layouts,interleave loops, parallelization
the approaches have to repeatedly visit samples,memory access is expensive CPU cache, limitations,existing techniques, single-loop approach
8 / 24
Unsolved issues
2 2
prolog
core
epilog
prolog epilog
F
F
FF
I complicated border treatment (prolog/epilog phases)I suspend/resume processing
I arbitrary processing order (scan order)
I interleave the transform and a subsequent processing
I multi-scale decomposition
I reorganization of underlying scheme9 / 24
Objectives of the thesis
Aims improve image transform performance and resourceconsumption
Objectives eliminate the shortcomings of existing methodsprevious slide
Evaluation prove experimentallyperformance, memory requirements
10 / 24
Lifting core
D. Barina, P. Zemcik, Vectorization and parallelization of 2-D wavelet lifting (in press)
solution: a processing unit
I continuously consumes an input and produces an output
I which visits every image sample only once (cache friendly)
I which is aware of image coordinates (can handle the borders)
I whose configuration (state) can be saved/restored
I which can be run in any direction
I which can be SIMD vectorized
I which can run in parallel (on independent parts of the image)
y = C x
xdef= In B y
def= On B
11 / 24
Core examples
D. Barina, P. Zemcik, Vectorization and parallelization of 2-D wavelet lifting (in press)
mn
1 2 3 4
core inputs, outputs
12 / 24
Processing orders
D. Barina, P. Zemcik, Vectorization and parallelization of 2-D wavelet lifting (in press)
horizontal horiz. strips horiz. blocks
vertical vert. strips vert. blocks
13 / 24
Borders treatment
D. Barina, P. Zemcik, Vectorization and parallelization of 2-D wavelet lifting (in press)
d a d a d a d a d a d a d a d a d a d
d a d a d a d a d a d a d a d a d a d a
n n n n n n n
a d aad
n nnnn
d a d a d a d a d a d a d a d a d a d
0
d a d a d a d a d a d a d a d a d a d a
2 n N 2 N
0 0
n n n n n n
a
y = Cn x
cores gracefully treats the boundaries
14 / 24
Parallel cores and reorganization
M. Kula, D. Barina, et al., Block-based Approach to 2-D Wavelet Transform on GPUs (2016)
1 2 3 4Sweldens1995
1 2 3Iwahashi2007
1 2proposed
15 / 24
3-D core
D. Barina, P. Zemcik, Real-Time 3-D Wavelet Lifting (2015)
x
y
z
buffer x
buffer y
buffer z
extended into more dimensions, buffers on the sides
16 / 24
CPU implementation
D. Barina, P. Zemcik, Vectorization and parallelization of 2-D wavelet lifting (in press)
0.0 s
5.0ns
10.0ns
15.0ns
20.0ns
25.0ns
30.0ns
35.0ns
40.0ns
45.0ns
50.0ns
1.0k 10.0k 100.0k 1.0M 10.0M 100.0M
time
/ pix
el
pixels
separable approach core approach
an evaluation of approaches,implemented the separable, single-loop, and core
17 / 24
3-D CPU implementation
D. Barina, P. Zemcik, Real-Time 3-D Wavelet Lifting (2015)
x
y
z
buffer x
buffer y
buffer z
0.0 s20.0ns40.0ns60.0ns80.0ns
100.0ns120.0ns140.0ns160.0ns
0.0 50.0M 100.0M 150.0M 200.0M 250.0M
time
/ vox
el
voxels
naive horizontalnaive vertical
core 42core 23core 43
performance of 3-D transform: separable, 2-D core, 3-D core
18 / 24
GPU implementation
M. Kula, D. Barina, et al., Block-based Approach to 2-D Wavelet Transform on GPUs (2016)
80.0 100.0 120.0 140.0 160.0 180.0 200.0 220.0 240.0 260.0
0.0 10.0M 20.0M 30.0M 40.0M 50.0M 60.0M 70.0M
GB
/s
pixels
Kucis2014Separable Block
Non-Separable Block
0
10
20
30
40
50
60
100kpel 1Mpel 10Mpel 100Mpel
GB
/s
SweldensIwahashi*Explosive*Monolithic*Polyphase*
Monolithic scheme:
left: SotA is in red, block methods in blue/green, reorganizationright: block methods, separable in black, our in blue/green
19 / 24
FPGA implementation
D. Barina, et al., Single-Loop Approach to 2-D Wavelet Lifting with JPEG 2000 Compatibility (2015)
H V
BRAM
Input Transform
core FF LUT BRAMlatency 4 441 (0.1 %) 399 (0.18 %) 6 (1.1 %)latency 2 391 (< 0.1 %) 592 (0.27 %) 6 (1.1 %)
architecture device BRAM [bits] clocks/pel time [ms]Dillen2003 VirtexE1000-8 50K 0.50 1.20Descampe2004 Virtex-II XC2V6000 N/A 0.60 1.75Seo2007 Altera Stratix 128K 2.64 6.02Zhang2012 Virtex-II Pro XC2VP30 6 18K 0.50 0.97the cores Zynq XC7Z045 1 36K 0.26 0.27
20 / 24
JPEG 2000 implementation
D. Barina, O. Klima, P. Zemcik, Single-Loop Architecture for JPEG 2000 (2016)
core
codeblock
2 2cn
2 2cm
aj
aj+1
h v d
0.0 20.0 40.0 60.0 80.0
100.0 120.0 140.0
100.0k 1.0M 10.0M 100.0M 1.0G
time
[ns]
resolution [pel]
proposedOpenJPEG
JasPerFFmpeg
21 / 24
Contributions of the thesis
Aims improved image transform performance and resourceconsumption
Objectives eliminated the shortcomings of existing methods
Evaluation assessed experimentally(performance, memory requirements)
evaluation performed:2-D on CPU, 3-D on CPU, 2-D on GPU, 2-D on FPGA,JPEG 2000 on CPU
22 / 24
Selected papersI Barina, D.; Klima, O.; Zemcik, P.: Single-Loop Software Architecture for JPEG 2000. In
Data Compression Conference (DCC), 2016
I Barina, D.; Musil, M.; Musil, P.; et al.: Single-Loop Approach to 2-D Wavelet Lifting withJPEG 2000 Compatibility. In Workshop on Applications for MultiCore Architectures(WAMCA), 2015
I Barina, D.; Zemcik, P.: Minimum Memory Vectorisation of Wavelet Lifting. In AdvancedConcepts for Intelligent Vision Systems (ACIVS), 2013
I Barina, D.; Zemcik, P.: Wavelet Lifting on Application Specific Vector Processor. InGraphiCon, 2013
I Barina, D.; Zemcik, P.: Diagonal Vectorisation of 2-D Wavelet Lifting. In IEEE InternationalConference on Image Processing (ICIP), 2014
I Barina, D.; Zemcik, P.: Real-Time 3-D Wavelet Lifting. In International Conference inCentral Europe on Computer Graphics, Visualization and Computer Vision (WSCG), 2015
I Barina, D.; Zemcik, P.: Vectorization and parallelization of 2-D wavelet lifting. Journal ofReal-Time Image Processing (JRTIP), in press
I Barina, D.; Klima, O.; Zemcik, P.: Single-Loop Architecture for JPEG 2000. In: Image andSignal Processing (ICISP), 2016
I Kula, M.; Barina, D.; Zemcik, P.: Block-based Approach to 2-D Wavelet Transform on GPUs.In International Conference on Information Technology New Generations (ITNG), 2016
I Kucis, M.; Barina, D.; Kula, M.; et al.: 2-D Discrete Wavelet Transform Using GPU. InWorkshop on Application for Multi-Core Architectures (WAMCA), 2014
23 / 24
Summary
the core
I computing unit which processes the data in a single pass,
I can suspend/resume execution,
I can processes the data in many different orders,
I can handle signal boundaries (is aware of coordinates),