Click here to load reader

Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions

  • View
    29

  • Download
    0

Embed Size (px)

DESCRIPTION

C. U. M. Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions. D. Chaver , C. Tenllado, L. Piñuel, M. Prieto, F. Tirado. Index. Motivation Experimental environment Lifting Transform Memory hierarchy exploitation SIMD optimization Conclusions Future work. Motivation. - PowerPoint PPT Presentation

Text of Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions

icpp 2001Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions
D. Chaver, C. Tenllado, L. Piñuel, M. Prieto, F. Tirado
U
C
M
And I am from the Complutense University of Madrid
I am going to present the paper title
2-D-Wavelet-Transform Enhancement on General-Purpose-Microprocessors:
UCM
Index
Motivation
The outline of the presentation is the following:
First, I would like to talk about the motivation of our work and to give a brief introduction to the Wavelet-Transform
Then, I will explain the related work, before introducing the experimental environment that we have used
After this, I will talk about the cache analysis we have performed, and about how we have used the S I M D extensions available on modern microprocessors
And finally, I will give you some conclusions and hints about our future research
UCM
Motivation
UCM
Motivation
JPEG-2000 MPEG-4
Study based on a modern general purpose microprocessor
Pentium 4
Use of the SIMD ISA extensions
Over the last few years, there has been an important development in applications based on the Wavelet-Transform, like JPEG 2000 and MPEG 4
For this kind of programs, the computational complexity of the wavelet transform represents a high percentage of the execution time. Consequently, its performance analysis and tuning are of great interest, and represents the main goal of our research.
In particular, our study is focused on general-purpose-microprocessors, specifically Pentium 4,
and the objectives are:
The efficient exploitation of the memory hierarchy,
and the use of the S I M D ISA extensions
We should emphasize that, in order to keep the code portable and avoid long development times, we have only considered source level optimizations.
UCM
Compiler
Our performance analysis has been carried out on a Pentium 3 at 866 Megahertz and on a Pentium 4 at 1,5 Gigahertz, the main features of which are summarized in this table.
The memory performance results have been obtained using the hardware counters available on the P6 processor family, which have beeb accessed through the PAPI tool.
More details can be found in the technical report referenced in the paper.
UCM
Original element
In our work, we have applied the most common approach for the discrete-wavelet-transform, known as the square decomposition, which alternates between operations on rows and columns
[trans] Following this scheme, one stage of the 1 D discrete-wavelet-transform is applied, first to the rows, and then to the columns, obtaining the first level coefficients.
[trans] Then, the same scheme is applied recursively to the coarse scale approximation
UCM
Aggregation (loop tiling)
Memory Hierarchy Exploitation
Other layouts
2
1
Discrepancies between the memory access patterns of the horizontal and vertical filtering, causes one of these components to exhibit poor data locality in the straightforward implementations of the algorithm.
For example: Assuming a row-major layout for the image, as performed by the C language, the main bottleneck of the 2D Discrete-Wavelet-Transform is caused by the processing of the image columns, since elements to be read are separated by a row size distance [trans]
Other research groups have proposed some alternatives to overcome this problem, such as:
Aggregation and the use of non-linear layouts
UCM
the aggregation technique,
instead of processing the image columns all the way down in 1 step, [trans]
which would produce very low locality,
Divides the image in blocks, [trans]
and then inside each block the vertical filter is swept row by row [trans…]
Thus improving the spatial locality
UCM
Memory: Only requires the original matrix
For most applications needs post-processing
MALLAT
INPLACE-MALLAT
Different studied schemes
Execution time breakdown for several sizes comparing both compilers.
I, IM and M denote inplace, inplace-mallat, and mallat strategies respectively.
Each bar shows the execution time of each level and the post-processing step.
UCM
The Mallat and Inplace-Mallat approaches outperform the Inplace approach for levels 2 and above
These 2 approaches have a noticeable slowdown for the 1st level:
Larger working set
The Inplace-Mallat version achieves the best execution time
ICC compiler outperforms GCC for Mallat and Inplace-Mallat, but not for the Inplace approach
Memory Hierarchy Exploitation
Different strategies:
Semi-automatic vectorization
Hand-coded vectorization
Only the horizontal filtering of the transform can be semi-automatically vectorized (when using a column-major layout)
SIMD Optimization
Once the memory hierarchy behaves has good as possible is time to extract the data parallelism available in the wavelet transform.
In particular, we have explored two approaches:
the automatic vectorization of one of the filterings,
and the vectorization of the whole transform using a block transposition
For the sake of simplicity, we have only considered the 4D pipelined computation
UCM
Inner loops
Global variables avoided
Pointer disambiguation if pointers are employed
Only inner loops, with simple array index manipulation and without global variables can be automatically vectorized using the Intel C Compiler
In addition, if pointers are employed inside the loop, pointer disambiguation is mandatory and must be done by hand using compiler directives
UCM
{
/* 2nd operation */
/* 3rd operation */
/* 4th operation */
/* Last step */
Intrinsics allow more flexibility
UCM
Execution time breakdown of the horizontal filtering (10242 pixels image).
I, IM and M denote inplace, inplace-mallat and mallat approaches.
S, A and H denote scalar, automatic-vectorized and hand-coded-vectorized.
UCM
SIMD Optimization
Speedup between 4 and 6 depending on the strategy. The reason for such a high improvement is due not only to the vectorial computations, but also to a considerable reduction in the memory accesses.
The speedups achieved by the strategies with recursive layouts (i.e. inplace-mallat and mallat) are higher than the inplace version counterparts, since the computation on the latter can only be vectorized in the first level.
For ICC, both vectorization approaches (i.e. automatic and hand-tuned) produce similar speedups, which highlights the quality of the ICC vectorizer.
CONCLUSIONS
UCM
Execution time breakdown of the whole transform (10242 pixels image).
I, IM and M denote inplace, inplace-mallat and mallat approaches.
S, A and H denote scalar, automatic-vectorized and hand-coded-vectorized.
UCM
Speedup between 1,5 and 2 depending on the strategy.
For ICC the shortest execution time is reached by the mallat version.
When using GCC both recursive-layout strategies obtain similar results.
CONCLUSIONS
UCM
SIMD Optimization
Speedup achieved by the different vectorial codes over the inplace-mallat and inplace.
We show the hand-coded ICC, the automatic ICC, and the hand-coded GCC.
UCM
The speedup grows with the image size since.
On average, the speedup is about 1.8 over the inplace-mallat scheme, growing to about 2 when considering it over the inplace strategy.
Focusing on the compilers, ICC clearly outperforms GCC by a significant 20-25% for all the image sizes
CONCLUSIONS
UCM
Conclusions
UCM
Scalar version: We have introduced a new scheme called Inplace-Mallat, that outperforms both the Inplace implementation and the Mallat scheme.
SIMD exploitation: Code modifications for the vectorial processing of the lifting algorithm. Two different methodologies with ICC compiler: semi-automatic and intrinsic-based vectorizations. Both provide similar results.
Speedup: Horizontal filtering about 4-6 (vectorization also reduces the pressure on the memory system).
Whole transform around 2.
The vectorial Mallat approach outperforms the other schemes and exhibits a better scalability.
Most of our insights are compiler independent.
Conclusions
The main conclusions of our work are the following:
Focusing on the memory hierarchy exploitation, the 4D pipelined computation significantly reduces the L2 cache misses due its better temporal locality, and this improvement translates into significant performance gains
We have introduced a novel approach to structuring the computation of the wavelet coefficients that allows automatic vectorization, which is independent of the filter size and the computing platform
We have also introduced the vectorization of the whole transform using block transposition, and we have show that the performance gain by far compensate the transposition overhead.
Summing up our research, the overall speedup obtained with regard to the pervious versions of the transform is around 2 for the P3 and around 2.5 for the P4
UCM
Measurements using other platforms
Similar optimizations of the lifting scheme
Performance analysis and tunning in other platforms
And the paralelization using OpenMP
256
2
0
0,001
0,002
0,003
Time (s)
1024
2
0
0,02
0,04
0,06
0,08
Time (s)
2048
2
0
0,1
0,2
0,3
Time (s)
8192
2
0
2
4
6
Time (s)
0
0,01
0,02
0,03
0,04
0,05
0,06
Time (s)
ICC
0
0,01
0,02
0,03
0,04
0,05
0,06
Time (s)
GCC
0
0,02
0,04
0,06
0,08
Time (s)
ICC
0
0,02
0,04
0,06
0,08
Time (s)
GCC
1
1,4
1,8
2,2
2,6
Image Size (log
1
1,5
2
2,5
3
Image Size (log

Search related