Download ppt - Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions

Vectorization of the 2D Wavelet Lifting Transform Using SIMD

Extensions

D. Chaver, C. Tenllado, L. Piñuel, M. Prieto, F. Tirado

UCM

2

UCM

Index

1. Motivation

2. Experimental environment

3. Lifting Transform

4. Memory hierarchy exploitation

5. SIMD optimization

6. Conclusions

7. Future work

3

UCM

Motivation

4

UCM

Motivation

Applications based on the Wavelet Transform:

JPEG-2000 MPEG-4

Usage of the lifting scheme

Study based on a modern general purpose microprocessor

o Pentium 4

Objectives:

o Efficient exploitation of Memory Hierarchy

o Use of the SIMD ISA extensions

5

UCM

Experimental

Environment

6

UCM

Experimental Environment

RedHat Distribution 7.2 (Enigma)

Operating System

1 GB RDRAM (PC800)Memory

512 KB, 128 Byte/LineL2

8 KB, 64 Byte/Line, Write-Through

DL1

NAIL1

Cache

DFI WT70-ECMotherboard

Intel Pentium4 (2,4 GHz)Platform

Intel ICC compilerGCC compilerCompiler

7

UCM

Lifting Transfor

m

8

UCM

D1st1st1st1st1st1st

Lifting Transform

Original element

1st step

2nd step

+

x +

+

β

x +

+

x

+

+

δ

x +x

x

A D D DA A A1st1st

9

UCM

N Levels

Lifting Transform

1 Level

Horizontal Filtering (1D Lifting Transform)

Vertical Filtering (1D Lifting Transform)

Original element

Approximation

10

UCM

Lifting Transform

Horizontal Filtering

1

2

Vertical Filtering

2 1

11

UCM

Memory Hierarchy

Exploitation

12

UCM

Poor data locality of one component (canonical layouts)

E.g. : column-major layout processing image rows (Horizontal Filtering)

o Aggregation (loop tiling)

Memory Hierarchy Exploitation

Poor data locality of the whole transform

o Other layouts

1

2

13

UCM



1

2

Vertical Filtering

2 1

14

UCM

Aggregation


IMAGE

2 1


15

UCM


INPLACE

Common implementation of the transform

Memory: Only requires the original matrix

For most applications needs post-processing

MALLAT

Memory: requires 2 matrices

Stores the image in the expected order

INPLACE-MALLAT

Memory: requires 2 matrices

Stores the image in the expected order

Different studied schemes

16

UCM


O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

MATRIX 1

L

L

L

L

L

L

L

L

H

H

H

H

H

H

H

H


LL1

HH1

HL1

LH1

LL3

HH3

HL3

LH3

LL4

HH4

HL4

LH4

LL2

HH2

HL2

LH2

Vertical Filtering

Transformed image

...LL1 LH1 LL2 LH2 HH1HL1 HH2HL2 LL3

logical view

physical view

INPLACE

LL1 LL2 LL3 LL4 LH2LH1 LH4LH3 ...HL1

17

UCM


O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

L

L

L

L

L

L

L

L

H

H

H

H

H

H

H

H

HorizontalFiltering

MATRIX 1 MATRIX 2

LL1

LL2 LL4

LL3

HH3

HH4HH2

HH1

HL1

HL2 HL4

HL3

LH1

LH2 LH4

LH3

Vertical

Filtering

Transformed image LL1 LL2 LL3 LL4 LH2LH1 LH4LH3 ...HL1

logical view

physical view

MALLAT

18

UCM


MATRIX 1 MATRIX 2

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

logical viewL

L

L

L

L

L

L

L

H

H

H

H

H

H

H

H

HorizontalFiltering

LL1

LL2 LL4

LL3

HH3

HH4HH2

HH1

HL1

HL2 HL4

HL3

LH1

LH2 LH4

LH3

Vertical

Filtering

Transformed image (Matrix 1) LL1 LL2 LL3 LL4...

Transformed image (Matrix 2) LH2LH1 LH4LH3 ...HL1

physicalview

INPLACE-

MALLAT

19

UCM

2562

0

0,001

0,002

0,003

I-ICC I-GCC IM-ICC IM-GCC M-ICC M-GCC

Tim

e (s

)

Level 1 Level 2 Level 3 Level 4 Post 10242

0

0,02

0,04

0,06

0,08


Tim

e (s

)

Level 1 Level 2 Level 3 Level 4 Post

20482

0

0,1

0,2

0,3


Tim

e (s

)

Level 1 Level 2 Level 3 Level 4 Post 81922

0

2

4

6


Tim

e (s

)

Level 1 Level 2 Level 3 Level 4 Post


Execution time breakdown for several sizes comparing both compilers.

I, IM and M denote inplace, inplace-mallat, and mallat strategies respectively.

Each bar shows the execution time of each level and the post-processing step.

20

UCM

The Mallat and Inplace-Mallat approaches outperform the Inplace approach for levels 2 and above

These 2 approaches have a noticeable slowdown for the 1st level:

•Larger working set

•More complex access pattern

The Inplace-Mallat version achieves the best execution time

ICC compiler outperforms GCC for Mallat and Inplace-Mallat, but not for the Inplace approach


CONCLUSIONS

21

UCM

SIMD Optimizati

on

22

UCM

Objective: Extract the parallelism available on the Lifting Transform

Different strategies:

Semi-automatic vectorization

Hand-coded vectorization

Only the horizontal filtering of the transform can be semi-automatically vectorized (when using a column-major layout)

SIMD Optimization

23

UCM

SIMD Optimization

Automatic Vectorization (Intel C/C++ Compiler)

Inner loops

Simple array index manipulation

Iterate over contiguous memory locations

Global variables avoided

Pointer disambiguation if pointers are employed

24

UCM

Original element

1st step

2nd step

+

x +

+

β

x +

+

x

+

+

δ

x +x

x

A D

SIMD Optimization

1st1st

25

UCM

SIMD Optimization

Column-major layout

Vectorial Horizontal filtering

+

x +

Horizontal filtering

+

x +

26

UCM

SIMD Optimization

Column-major layout

Vectorial Vertical filtering

+

x +

Vertical filtering

+

x +

27

UCM

for(j=2,k=1;j<(#columns-4);j+=2,k++){ #pragma vector aligned for(i=0;i<#rows;i++) {

/* 1st operation */col3=col3 + alfa*( col4+ col2);/* 2nd operation */

col2=col2 + beta*( col3+ col1);/* 3rd operation */

col1=col1 + gama*( col2+ col0); /* 4th operation */

col0 =col0 + delt*( col1+ col-1);/* Last step */

detail = col1 *phi_inv; aprox = col0 *phi; }}

Horizontal Vectorial Filtering (semi-automatic)

SIMD Optimization

28

UCM

SIMD Optimization

Hand-coded Vectorization

SIMD parallelism has to be explicitly expressed

Intrinsics allow more flexibility

Possibility to also vectorize the vertical filtering

29

UCM

Horizontal Vectorial Filtering (hand)

SIMD Optimization

/* 1st operation */

t2 = _mm_load_ps(col2);



coeff = _mm_set_ps1(alfa);

t4 = _mm_add_ps(t2,t4);

t4 = _mm_mul_ps(t4,coeff);

t3 = _mm_add_ps(t4,t3);

_mm_store_ps(col3,t3);/* 2nd operation */

/* 3rd operation */

/* 4th operation */

/* Last step */

_mm_store_ps(detail,t1);

_mm_store_ps(aprox,t0);

t2 t3 t4

+

x +

30

UCM

0

0,01

0,02

0,03

0,04

0,05

0,06

I-S IM-S M-S I-A IM-A M-A I-H IM-H M-HT

ime

(s)

Level 1 Level 2 Level 3 Level 4ICC

0

0,01

0,02

0,03

0,04

0,05

0,06

I-S IM-S M-S I-H IM-H M-H

Tim

e (s

)

Level 1 Level 2 Level 3 Level 4GCC

SIMD Optimization

Execution time breakdown of the horizontal filtering (10242 pixels image).

I, IM and M denote inplace, inplace-mallat and mallat approaches.

S, A and H denote scalar, automatic-vectorized and hand-coded-vectorized.

31

UCM

SIMD Optimization

Speedup between 4 and 6 depending on the strategy. The reason for such a high improvement is due not only to the vectorial computations, but also to a considerable reduction in the memory accesses.

The speedups achieved by the strategies with recursive layouts (i.e. inplace-mallat and mallat) are higher than the inplace version counterparts, since the computation on the latter can only be vectorized in the first level.

For ICC, both vectorization approaches (i.e. automatic and hand-tuned) produce similar speedups, which highlights the quality of the ICC vectorizer.

CONCLUSIONS

32

UCM

SIMD Optimization

0

0,02

0,04

0,06

0,08

I-S IM-S M-S I-A IM-A M-A I-H IM-H M-H

Tim

e (s

)

Level 1 Level 2 Level 3 Level 4 PostICC

0

0,02

0,04

0,06

0,08

I-S IM-S M-S I-H IM-H M-H

Tim

e (s

)

Level 1 Level 2 Level 3 Level 4 PostGCC

Execution time breakdown of the whole transform (10242 pixels image).

I, IM and M denote inplace, inplace-mallat and mallat approaches.

S, A and H denote scalar, automatic-vectorized and hand-coded-vectorized.

33

UCM

SIMD Optimization

Speedup between 1,5 and 2 depending on the strategy.

For ICC the shortest execution time is reached by the mallat version.

When using GCC both recursive-layout strategies obtain similar results.

CONCLUSIONS

34

UCM

SIMD Optimization

1

1,4

1,8

2,2

2,6

7 8 9 10 11 12 13 14

Image Size (log2)

Sp

ee

du

p

Hand-Coded ICC Automatic ICC Hand-Coded GCC

1

1,5

2

2,5

3

7 8 9 10 11 12 13 14Image Size (log2)

Sp

ee

du

p

Hand-Coded ICC Automatic ICC Hand-Coded GCC

Speedup achieved by the different vectorial codes over the inplace-mallat and inplace.

We show the hand-coded ICC, the automatic ICC, and the hand-coded GCC.

35

UCM

SIMD Optimization

The speedup grows with the image size since.

On average, the speedup is about 1.8 over the inplace-mallat scheme, growing to about 2 when considering it over the inplace strategy.

Focusing on the compilers, ICC clearly outperforms GCC by a significant 20-25% for all the image sizes

CONCLUSIONS

36

UCM

Conclusions

37

UCM

Scalar version: We have introduced a new scheme called Inplace-Mallat, that outperforms both the Inplace implementation and the Mallat scheme.

SIMD exploitation: Code modifications for the vectorial processing of the lifting algorithm. Two different methodologies with ICC compiler: semi-automatic and intrinsic-based vectorizations. Both provide similar results.

Speedup: Horizontal filtering about 4-6 (vectorization also reduces the pressure on the memory system).

Whole transform around 2.

The vectorial Mallat approach outperforms the other schemes and exhibits a better scalability.

Most of our insights are compiler independent.

Conclusions

38

UCM

Future work

39

UCM

4D layout for a lifting-based scheme

Measurements using other platforms

• Intel Itanium

• Intel Pentium-4 with hiperthreading

Parallelization using OpenMP (SMT)

Future work

For additional information:

http://www.dacya.ucm.es/dchaver