Transcript
Page 1: Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions

Vectorization of the 2D Wavelet Lifting Transform Using SIMD

Extensions

D. Chaver, C. Tenllado, L. Piñuel, M. Prieto, F. Tirado

UCM

Page 2: Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions

2

UCM

Index

1. Motivation

2. Experimental environment

3. Lifting Transform

4. Memory hierarchy exploitation

5. SIMD optimization

6. Conclusions

7. Future work

Page 3: Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions

3

UCM

Motivation

Page 4: Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions

4

UCM

Motivation

Applications based on the Wavelet Transform:

JPEG-2000 MPEG-4

Usage of the lifting scheme

Study based on a modern general purpose microprocessor

o Pentium 4

Objectives:

o Efficient exploitation of Memory Hierarchy

o Use of the SIMD ISA extensions

Page 5: Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions

5

UCM

Experimental

Environment

Page 6: Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions

6

UCM

Experimental Environment

RedHat Distribution 7.2 (Enigma)

Operating System

1 GB RDRAM (PC800)Memory

512 KB, 128 Byte/LineL2

8 KB, 64 Byte/Line, Write-Through

DL1

NAIL1

Cache

DFI WT70-ECMotherboard

Intel Pentium4 (2,4 GHz)Platform

Intel ICC compilerGCC compilerCompiler

Page 7: Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions

7

UCM

Lifting Transfor

m

Page 8: Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions

8

UCM

D1st1st1st1st1st1st

Lifting Transform

Original element

1st step

2nd step

+

x +

+

β

x +

+

x

+

+

δ

x +x

x

A D D DA A A1st1st

Page 9: Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions

9

UCM

N Levels

Lifting Transform

1 Level

Horizontal Filtering (1D Lifting Transform)

Vertical Filtering (1D Lifting Transform)

Original element

Approximation

Page 10: Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions

10

UCM

Lifting Transform

Horizontal Filtering

1

2

Vertical Filtering

2 1

Page 11: Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions

11

UCM

Memory Hierarchy

Exploitation

Page 12: Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions

12

UCM

Poor data locality of one component (canonical layouts)

E.g. : column-major layout processing image rows (Horizontal Filtering)

o Aggregation (loop tiling)

Memory Hierarchy Exploitation

Poor data locality of the whole transform

o Other layouts

1

2

Page 13: Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions

13

UCM

Memory Hierarchy Exploitation

Horizontal Filtering

1

2

Vertical Filtering

2 1

Page 14: Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions

14

UCM

Aggregation

Horizontal Filtering

IMAGE

2 1

Memory Hierarchy Exploitation

Page 15: Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions

15

UCM

Memory Hierarchy Exploitation

INPLACE

Common implementation of the transform

Memory: Only requires the original matrix

For most applications needs post-processing

MALLAT

Memory: requires 2 matrices

Stores the image in the expected order

INPLACE-MALLAT

Memory: requires 2 matrices

Stores the image in the expected order

Different studied schemes

Page 16: Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions

16

UCM

Memory Hierarchy Exploitation

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

MATRIX 1

L

L

L

L

L

L

L

L

H

H

H

H

H

H

H

H

Horizontal Filtering

LL1

HH1

HL1

LH1

LL3

HH3

HL3

LH3

LL4

HH4

HL4

LH4

LL2

HH2

HL2

LH2

Vertical Filtering

Transformed image

...LL1 LH1 LL2 LH2 HH1HL1 HH2HL2 LL3

logical view

physical view

INPLACE

LL1 LL2 LL3 LL4 LH2LH1 LH4LH3 ...HL1

Page 17: Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions

17

UCM

Memory Hierarchy Exploitation

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

L

L

L

L

L

L

L

L

H

H

H

H

H

H

H

H

HorizontalFiltering

MATRIX 1 MATRIX 2

LL1

LL2 LL4

LL3

HH3

HH4HH2

HH1

HL1

HL2 HL4

HL3

LH1

LH2 LH4

LH3

Vertical

Filtering

Transformed image LL1 LL2 LL3 LL4 LH2LH1 LH4LH3 ...HL1

logical view

physical view

MALLAT

Page 18: Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions

18

UCM

Memory Hierarchy Exploitation

MATRIX 1 MATRIX 2

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

logical viewL

L

L

L

L

L

L

L

H

H

H

H

H

H

H

H

HorizontalFiltering

LL1

LL2 LL4

LL3

HH3

HH4HH2

HH1

HL1

HL2 HL4

HL3

LH1

LH2 LH4

LH3

Vertical

Filtering

Transformed image (Matrix 1) LL1 LL2 LL3 LL4...

Transformed image (Matrix 2) LH2LH1 LH4LH3 ...HL1

physicalview

INPLACE-

MALLAT

Page 19: Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions

19

UCM

2562

0

0,001

0,002

0,003

I-ICC I-GCC IM-ICC IM-GCC M-ICC M-GCC

Tim

e (s

)

Level 1 Level 2 Level 3 Level 4 Post 10242

0

0,02

0,04

0,06

0,08

I-ICC I-GCC IM-ICC IM-GCC M-ICC M-GCC

Tim

e (s

)

Level 1 Level 2 Level 3 Level 4 Post

20482

0

0,1

0,2

0,3

I-ICC I-GCC IM-ICC IM-GCC M-ICC M-GCC

Tim

e (s

)

Level 1 Level 2 Level 3 Level 4 Post 81922

0

2

4

6

I-ICC I-GCC IM-ICC IM-GCC M-ICC M-GCC

Tim

e (s

)

Level 1 Level 2 Level 3 Level 4 Post

Memory Hierarchy Exploitation

Execution time breakdown for several sizes comparing both compilers.

I, IM and M denote inplace, inplace-mallat, and mallat strategies respectively.

Each bar shows the execution time of each level and the post-processing step.

Page 20: Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions

20

UCM

The Mallat and Inplace-Mallat approaches outperform the Inplace approach for levels 2 and above

These 2 approaches have a noticeable slowdown for the 1st level:

•Larger working set

•More complex access pattern

The Inplace-Mallat version achieves the best execution time

ICC compiler outperforms GCC for Mallat and Inplace-Mallat, but not for the Inplace approach

Memory Hierarchy Exploitation

CONCLUSIONS

Page 21: Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions

21

UCM

SIMD Optimizati

on

Page 22: Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions

22

UCM

Objective: Extract the parallelism available on the Lifting Transform

Different strategies:

Semi-automatic vectorization

Hand-coded vectorization

Only the horizontal filtering of the transform can be semi-automatically vectorized (when using a column-major layout)

SIMD Optimization

Page 23: Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions

23

UCM

SIMD Optimization

Automatic Vectorization (Intel C/C++ Compiler)

Inner loops

Simple array index manipulation

Iterate over contiguous memory locations

Global variables avoided

Pointer disambiguation if pointers are employed

Page 24: Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions

24

UCM

Original element

1st step

2nd step

+

x +

+

β

x +

+

x

+

+

δ

x +x

x

A D

SIMD Optimization

1st1st

Page 25: Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions

25

UCM

SIMD Optimization

Column-major layout

Vectorial Horizontal filtering

+

x +

Horizontal filtering

+

x +

Page 26: Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions

26

UCM

SIMD Optimization

Column-major layout

Vectorial Vertical filtering

+

x +

Vertical filtering

+

x +

Page 27: Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions

27

UCM

for(j=2,k=1;j<(#columns-4);j+=2,k++){ #pragma vector aligned for(i=0;i<#rows;i++) {

/* 1st operation */col3=col3 + alfa*( col4+ col2);/* 2nd operation */

col2=col2 + beta*( col3+ col1);/* 3rd operation */

col1=col1 + gama*( col2+ col0); /* 4th operation */

col0 =col0 + delt*( col1+ col-1);/* Last step */

detail = col1 *phi_inv; aprox = col0 *phi; }}

Horizontal Vectorial Filtering (semi-automatic)

SIMD Optimization

Page 28: Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions

28

UCM

SIMD Optimization

Hand-coded Vectorization

SIMD parallelism has to be explicitly expressed

Intrinsics allow more flexibility

Possibility to also vectorize the vertical filtering

Page 29: Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions

29

UCM

Horizontal Vectorial Filtering (hand)

SIMD Optimization

/* 1st operation */

t2 = _mm_load_ps(col2);

t4 = _mm_load_ps(col4);

t3 = _mm_load_ps(col3);

coeff = _mm_set_ps1(alfa);

t4 = _mm_add_ps(t2,t4);

t4 = _mm_mul_ps(t4,coeff);

t3 = _mm_add_ps(t4,t3);

_mm_store_ps(col3,t3);/* 2nd operation */

/* 3rd operation */

/* 4th operation */

/* Last step */

_mm_store_ps(detail,t1);

_mm_store_ps(aprox,t0);

t2 t3 t4

+

x +

Page 30: Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions

30

UCM

0

0,01

0,02

0,03

0,04

0,05

0,06

I-S IM-S M-S I-A IM-A M-A I-H IM-H M-HT

ime

(s)

Level 1 Level 2 Level 3 Level 4ICC

0

0,01

0,02

0,03

0,04

0,05

0,06

I-S IM-S M-S I-H IM-H M-H

Tim

e (s

)

Level 1 Level 2 Level 3 Level 4GCC

SIMD Optimization

Execution time breakdown of the horizontal filtering (10242 pixels image).

I, IM and M denote inplace, inplace-mallat and mallat approaches.

S, A and H denote scalar, automatic-vectorized and hand-coded-vectorized.

Page 31: Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions

31

UCM

SIMD Optimization

Speedup between 4 and 6 depending on the strategy. The reason for such a high improvement is due not only to the vectorial computations, but also to a considerable reduction in the memory accesses.

The speedups achieved by the strategies with recursive layouts (i.e. inplace-mallat and mallat) are higher than the inplace version counterparts, since the computation on the latter can only be vectorized in the first level.

For ICC, both vectorization approaches (i.e. automatic and hand-tuned) produce similar speedups, which highlights the quality of the ICC vectorizer.

CONCLUSIONS

Page 32: Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions

32

UCM

SIMD Optimization

0

0,02

0,04

0,06

0,08

I-S IM-S M-S I-A IM-A M-A I-H IM-H M-H

Tim

e (s

)

Level 1 Level 2 Level 3 Level 4 PostICC

0

0,02

0,04

0,06

0,08

I-S IM-S M-S I-H IM-H M-H

Tim

e (s

)

Level 1 Level 2 Level 3 Level 4 PostGCC

Execution time breakdown of the whole transform (10242 pixels image).

I, IM and M denote inplace, inplace-mallat and mallat approaches.

S, A and H denote scalar, automatic-vectorized and hand-coded-vectorized.

Page 33: Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions

33

UCM

SIMD Optimization

Speedup between 1,5 and 2 depending on the strategy.

For ICC the shortest execution time is reached by the mallat version.

When using GCC both recursive-layout strategies obtain similar results.

CONCLUSIONS

Page 34: Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions

34

UCM

SIMD Optimization

1

1,4

1,8

2,2

2,6

7 8 9 10 11 12 13 14

Image Size (log2)

Sp

ee

du

p

Hand-Coded ICC Automatic ICC Hand-Coded GCC

1

1,5

2

2,5

3

7 8 9 10 11 12 13 14Image Size (log2)

Sp

ee

du

p

Hand-Coded ICC Automatic ICC Hand-Coded GCC

Speedup achieved by the different vectorial codes over the inplace-mallat and inplace.

We show the hand-coded ICC, the automatic ICC, and the hand-coded GCC.

Page 35: Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions

35

UCM

SIMD Optimization

The speedup grows with the image size since.

On average, the speedup is about 1.8 over the inplace-mallat scheme, growing to about 2 when considering it over the inplace strategy.

Focusing on the compilers, ICC clearly outperforms GCC by a significant 20-25% for all the image sizes

CONCLUSIONS

Page 36: Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions

36

UCM

Conclusions

Page 37: Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions

37

UCM

Scalar version: We have introduced a new scheme called Inplace-Mallat, that outperforms both the Inplace implementation and the Mallat scheme.

SIMD exploitation: Code modifications for the vectorial processing of the lifting algorithm. Two different methodologies with ICC compiler: semi-automatic and intrinsic-based vectorizations. Both provide similar results.

Speedup: Horizontal filtering about 4-6 (vectorization also reduces the pressure on the memory system).

Whole transform around 2.

The vectorial Mallat approach outperforms the other schemes and exhibits a better scalability.

Most of our insights are compiler independent.

Conclusions

Page 38: Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions

38

UCM

Future work

Page 39: Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions

39

UCM

4D layout for a lifting-based scheme

Measurements using other platforms

• Intel Itanium

• Intel Pentium-4 with hiperthreading

Parallelization using OpenMP (SMT)

Future work

For additional information:

http://www.dacya.ucm.es/dchaver


Recommended