27
Seunghwa Kang David A. Bader Optimizing Discrete Wavelet Optimizing Discrete Wavelet Transform Transform on the Cell Broadband Engine on the Cell Broadband Engine

Optimizing Discrete Wavelet Transform on the Cell Broadband Engine

  • Upload
    nuru

  • View
    41

  • Download
    0

Embed Size (px)

DESCRIPTION

Optimizing Discrete Wavelet Transform on the Cell Broadband Engine. Seunghwa Kang David A. Bader. Key Contributions. We design an efficient data decomposition scheme to achieve high performance with affordable programming complexity - PowerPoint PPT Presentation

Citation preview

Page 1: Optimizing Discrete Wavelet Transform on the Cell Broadband Engine

Seunghwa Kang David A. Bader

Optimizing Discrete Wavelet TransformOptimizing Discrete Wavelet Transformon the Cell Broadband Engineon the Cell Broadband Engine

Page 2: Optimizing Discrete Wavelet Transform on the Cell Broadband Engine

Key Contributions

• We design an efficient data decomposition scheme to achieve high performance with affordable programming complexity

• We introduce multiple Cell/B.E. and DWT specific optimization issues and solutions

• Our implementation achieves 34 and 56 times speedup over one PPE performance, and 4.7 and 3.7 times speedup over the cutting edge multicore processor (AMD Barcelona), for lossless and lossy DWT, respectively.

Page 3: Optimizing Discrete Wavelet Transform on the Cell Broadband Engine

Presentation Outline

• Discrete Wavelet Transform• Cell Broadband Engine architecture

- Comparison with the traditional multicore processor- Impact in performance and programmability

• Optimization Strategies- Previous work- Data decomposition scheme- Real number representation- Loop interleaving- Fine-grain data transfer control

• Performance Evaluation- Comparison with the AMD Barcelona

• Conclusions

Page 4: Optimizing Discrete Wavelet Transform on the Cell Broadband Engine

Presentation Outline

• Discrete Wavelet Transform• Cell Broadband Engine architecture

- Comparison with the traditional multicore processor- Impact in performance and programmability

• Optimization Strategies- Previous work- Data decomposition scheme- Real number representation- Loop interleaving- Fine-grain data transfer control

• Performance Evaluation- Comparison with the AMD Barcelona

• Conclusions

Page 5: Optimizing Discrete Wavelet Transform on the Cell Broadband Engine

Discrete Wavelet Transform (in JPEG2000)• Decompose an image in both vertical and

horizontal direction to the sub-bands representing the coarse and detail part while preserving space information

LL

LH

HL

HH

Page 6: Optimizing Discrete Wavelet Transform on the Cell Broadband Engine

Discrete Wavelet Transform (in JPEG2000)• Vertical filtering followed by horizontal

filtering• Highly parallel but bandwidth intensive• Distinct memory access pattern becomes a

problem• Adopt Jasper [Adams2005] as a baseline

code

Page 7: Optimizing Discrete Wavelet Transform on the Cell Broadband Engine

Presentation Outline

• Discrete Wavelet Transform•Cell Broadband Engine

architecture- Comparison with the traditional multicore

processor- Impact in performance and programmability

• Optimization Strategies- Previous work- Data decomposition scheme- Real number representation- Loop interleaving- Fine-grain data transfer control

• Performance Evaluation- Comparison with the AMD Barcelona

• Conclusions

Page 8: Optimizing Discrete Wavelet Transform on the Cell Broadband Engine

Cell/B.E. vs Traditional Multi-core Processor

• In-order• No dynamic branch

prediction• SIMD only=> Small and simple

core

SPETraditionalMulti-coreProcessor

• Out-of-order• Dynamic branch

prediction• Scalar + SIMD=> Large and

complex core

Page 9: Optimizing Discrete Wavelet Transform on the Cell Broadband Engine

Cell/B.E. vs Traditional Multi-core Processor

Exe.Pipeline

LS

Main Memory Main Memory

L2

I1 D1

L3

• Isolated constant latency LS access

• Software controlled DMA data transfer between LS and main memory

• Every memory access is cache coherent

• Hardware controlled data transfer

Exe.Pipeline

Page 10: Optimizing Discrete Wavelet Transform on the Cell Broadband Engine

Cell/B.E. Architecture - Performance

• More cores within power and transistor budget

• Invest the larger fraction of the die area for actual computation

• Highly scalable memory architecture• Enable fine-grain data transfer control• Efficient vectorization is even more

important (No scalar unit)

Page 11: Optimizing Discrete Wavelet Transform on the Cell Broadband Engine

Cell/B.E. Architecture - Programmability• Software (mostly programmer up to date)

controlled data transfer• Limited LS size• Manual vectorization• Manual branch hint, loop unrolling, etc.• Efficient DMA data transfer requires cache line

alignment and transfer size needs to be a multiple of cache line size.

• Vectorization (SIMD) requires 16 byte alignment and vector size needs to be 16 byte.

=> Challenging to deal with misaligned data !!!

Page 12: Optimizing Discrete Wavelet Transform on the Cell Broadband Engine

Cell/B.E. Architecture - Programmability

for( i = 0 ; i < n ; i++ ) { a[i] = b[i] + c[i]}

v_a = ( vector int* )a;v_b = ( vector int* )b;v_c = ( vector int* )c;for( i = 0 ; i < n_c / 4 ; i++ ) { v_a[i] = v_add( v_b[i], v_c[i] )}//n_c: a constant multiple of 4

n_head = ( 16 – ( ( unsigned int )a % 16 ) / 4;n_head = n_head % 4;n_body = ( n – n_head ) / 4;n_tail = ( n – n_head ) % 4;for( i = 0 ; i < n_head ; i++ ) { a[i] = b[i] + c[i];}v_a = ( vector int* )( a + n_head );v_b = ( vector int* )( b + n_head );v_c = ( vector int* )( c + n_head );for( i = 0 ; i < n_body ; i++ ) { v_a[i] = v_add( v_b[i], v_c[i] )}a = ( int* )( v_a + n_body );b = ( int* )( v_b + n_body );c = ( int* )( v_c + n_body );for( i = 0 ; i < n_tail ; i++ ) { a[i] = b[i] + c[i];}

=>Even more complex if a, b,and c are misaligned!!!

Satisfies alignment and

size requirements

No guarantee in alignment

and size

Head

Body

Tail

Page 13: Optimizing Discrete Wavelet Transform on the Cell Broadband Engine

Presentation Outline

• Discrete Wavelet Transform• Cell Broadband Engine architecture

- Comparison with the traditional multicore processor- Impact in performance and programmability

• Optimization Strategies- Previous work- Data Decomposition Scheme- Real Number Representation- Loop Interleaving- Fine-grain Data Transfer Control

• Performance Evaluation- Comparison with the AMD Barcelona

• Conclusions

Page 14: Optimizing Discrete Wavelet Transform on the Cell Broadband Engine

Previous work

• Column grouping [Chaver2002] to enhance cache behavior in vertical filtering

• Muta et al. [Muta2007] optimized convolution based (require up to 2 times more operations than lifting based approach) DWT for Cell/B.E.

- High single SPE performance- Does not scale above 1 SPE

Page 15: Optimizing Discrete Wavelet Transform on the Cell Broadband Engine

Data Decomposition Scheme

2-D array width Row padding

A multiple of the cache line size

Remainder

Distributed to the SPEs

Processed by the PPE

2-D array

height

A multiple of the cache line size

A unit of data

transfer and

computation

A unit of data

distribution to

the processing

elements

Cache line aligned

Page 16: Optimizing Discrete Wavelet Transform on the Cell Broadband Engine

Data Decomposition Scheme

• Satisfies the alignment and size requirements for efficient DMA data transfer and vectorization.

• Fixed LS space requirements regardless of an input image size

• Constant loop countA unit of data

transfer and

computation

constant width

Page 17: Optimizing Discrete Wavelet Transform on the Cell Broadband Engine

Vectorization – Real number representation• Jasper adopts fixed point representation- Replace floating point arithmetic with fixed

point arithmetic- Not a good choice for Cell/B.E.

Inst. Latency (SPE)

mpyh 7 cycles

mpyu 7 cycles

a 2 cycles

fm 6 cycles

mpyh $5, $3, $4mpyh $2, $4, $3mpyu $4, $3, $4a $3, $5, $2a $3, $3, $4

fm $3, $3, $4

Page 18: Optimizing Discrete Wavelet Transform on the Cell Broadband Engine

Loop Interleaving

• In a naïve approach, a single vertical filtering involves 3 or 6 times data transfer

• Bandwidth becomes a bottleneck• Interleave splitting, lifting, and optional

scaling stepsDoes not fit into

the LS

Page 19: Optimizing Discrete Wavelet Transform on the Cell Broadband Engine

Loop Interleaving

low0

high0

low1

high1

low2

high2

low3

high3

low0

low1

low2

low3

high0

high1

high2

high3

low0*

low1*

low2*

low3*

high0*

high1*

high2*

high3*

InterleavedLifting

Splitting

• First interleave multiple lifting steps• Then, merge splitting step with the interleaved lifting

step

Overwrittenbefore read

• Use temporary main memory buffer for the upper half

Page 20: Optimizing Discrete Wavelet Transform on the Cell Broadband Engine

Fine–grain Data Transfer Control

low0

high0

low1

high1

low2

high2

low3

high3

low0

low1

low2

low3

high0

high1

high2

high3

low0*

low1*

low2*

low3*

high0*

high1*

high2*

high3*

InterleavedLifting

Splitting

• Initially, we copy data from the buffer after the interleaved loop is finished

• Yet, we can start it just after low2 and high2 are read

• Cell/B.E.’s software controlled DMA data transfer enables this

Page 21: Optimizing Discrete Wavelet Transform on the Cell Broadband Engine

Presentation Outline

• Discrete Wavelet Transform• Cell Broadband Engine architecture

- Comparison with the traditional multicore processor

- Impact in performance and programmability

• Optimization Strategies- Previous work- Data decomposition scheme- Real number representation- Loop interleaving- Fine-grain data transfer control

•Performance Evaluation- Comparison with the AMD Barcelona

• Conclusions

Page 22: Optimizing Discrete Wavelet Transform on the Cell Broadband Engine

Performance Evaluation

* 3800 X 2600 color image, 5 resolution levels * Execution time and scalability up to 2 Cell/B.E. chips (IBM QS20)

Page 23: Optimizing Discrete Wavelet Transform on the Cell Broadband Engine

Performance Evaluation – Comparison with x86 Architecture

Parallelization

OpenMP basedparallelization

Vectorization

Auto-vectorization with compiler directives

Real NumberRepresentation

Identical to the Cell/B.E. case

Loop Interleaving

Identical to the Cell/B.E. case

Run-time profile feedback

Compile with run-time profile feedback

Cell/B

.E. (

Base)

Cell/B

.E. (

Optimize

d)

Barce

lona

(Bas

e)

Barce

lona

(Optim

ized)

Exe

cuti

on

Tim

e (m

s)

0

1000

2000

3000

4000

5000

DWT - LosslessDWT - Lossy

1.0

1.0

34 56

2.4

1.9

7.3 15

-One 3.2 GHz Cell/B.E. chip (IBM QS20)-One 2.0 GHz AMD Barcelona chip (AMD Quad-core Opteron 8350)

* Optimization for the Barcelona

Page 24: Optimizing Discrete Wavelet Transform on the Cell Broadband Engine

Presentation Outline

• Discrete Wavelet Transform• Cell Broadband Engine architecture

- Comparison with the traditional multicore processor- Impact in performance and programmability

• Optimization Strategies- Previous work- Data decomposition scheme- Real number representation- Loop interleaving- Fine-grain data transfer control

• Performance Evaluation- Comparison with the AMD Barcelona

• Conclusions

Page 25: Optimizing Discrete Wavelet Transform on the Cell Broadband Engine

Conclusions

• Cell/B.E. has a great potential to speed-up parallel workloads but requires judicious implementation

• We design an efficient data decomposition scheme to achieve high performance with affordable programming complexity

• Our implementation demonstrates 34 and 56 times speedup over one PPE, and 4.7 and 3.7 times speedup over the AMD Barcelona processor with one Cell/B.E. chip

• Cell/B.E. can also be used as an accelerator in combination with the traditional microprocessor

Page 26: Optimizing Discrete Wavelet Transform on the Cell Broadband Engine

Acknowledgment of Support

David A. Bader 26

Page 27: Optimizing Discrete Wavelet Transform on the Cell Broadband Engine

References

[1] M.D. Adams. The JPEG-2000 Still Image Compression Standard, Dec. 2005.

[2] D. Chaver, M. Prieto, L. Pinuel, and F. Tirado. Parallel wavelet transform for large scale image processing, Int’l Parallel and Distributed Processing Symp., Apr. 2002.

[3] H. Muta, M. Doi, H. Nakano, and Y. Mori. Multilevel parallelization on the Cell/B.E. for a Motion JPEG 2000 encoding server, ACM Multimedia Conf., Sep. 2007.