33
RaVioli: A Parallel Video Processing Library with Auto Resolution Adjustability Hiroko SAKURAI Masaomi OHNO Shintaro OKADA Tomoaki TSUMURA Hiroshi MATSUO † Nagoya Institute of Technology, Japan ‡ Toyota Motor Corp., Japan IADIS International Conference APPLIED COMPUTING 2009 November 19 – 21, 2009 Rome, Italy

RaVioli: A Parallel Vide Processing Library with Auto Resolution Adjustability

Tags:

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: RaVioli: A Parallel Vide Processing Library with Auto Resolution Adjustability

RaVioli: A Parallel Video Processing Librarywith Auto Resolution AdjustabilityHiroko SAKURAI† Masaomi OHNO† Shintaro OKADA‡Tomoaki TSUMURA† Hiroshi MATSUO†† Nagoya Institute of Technology, Japan‡ Toyota Motor Corp., JapanIADIS International Conference APPLIED COMPUTING 2009November 19 – 21, 2009Rome, Italy

Page 2: RaVioli: A Parallel Vide Processing Library with Auto Resolution Adjustability

Background(1/2): Portability of Video Applications• Real-time video processing applications– should run on a great variety of platforms• Cell phones• Cars• PCs

– Principal goal of an application• Long battery life• High throughput• Good accuracy

Applied Computing 2009 2

We must rewrite a video processing program,when porting it to another platform

Page 3: RaVioli: A Parallel Vide Processing Library with Auto Resolution Adjustability

Background(2/2): Many-Core Era is Coming• Multi/Many-core processors have come into wide use• Video processing applications– have various parallelisms• Pixels in video frames have data parallelism• Multiple frames can be processed in parallel by pipelining

– promise good performance on such parallel systems

Applied Computing 2009 3

Parallelizing programs is not so simpleIt becomes much important to improve compilers and libraries

Page 4: RaVioli: A Parallel Vide Processing Library with Auto Resolution Adjustability

A Video Processing Library: RaVioli• RaVioli provides:– Easy writeability of• pseudo real-time video processing

– Interfaces for parallelization• Detecting data dependencies and formulating reductions• Balancing loads of pipeline stages

Applied Computing 2009 4

Page 5: RaVioli: A Parallel Vide Processing Library with Auto Resolution Adjustability

Outline• Concept of RaVioli– RaVioli hides resolutions from programmers– Easy writeability of video processing applications

• Pseudo real-time processing by adjusting loads• Semi-automatic parallelization functions– Automatic block decomposition– Pipelining interface with automatic load balance mechanism

• Evaluation resultsApplied Computing 2009 5

Page 6: RaVioli: A Parallel Vide Processing Library with Auto Resolution Adjustability

Traditional Image Processing Program• Image processing program written by traditional C

Applied Computing 2009 6

void main{ // Input image int luma; for(int y=0;y<180;y++){  for(int x=0;x<200;x++){ luma = (int)( InImg[x][y].R*0.299   +InImg[x][y].G*0.587   +InImg[x][y].B*0.114);   OutImg[x][y].R = luma; OutImg[x][y].G = luma; OutImg[x][y].B = luma;  } }}

InImg

OutImg

Page 7: RaVioli: A Parallel Vide Processing Library with Auto Resolution Adjustability

Image Processing Program with RaVioli• Grayscale program using RaVioli

Applied Computing 2009 7

RV_Image OutImg

Higher-odermethodprocPixRV_Pixel GrayScale(RV_Pixel Pix){  int luma;  luma=(int)(   Pix.R()*0.299   +Pix.G()*0.587   +Pix.B()*0.114);  return(Pix.setRGB(luma, luma, luma));}void main(){ RV_Image InImg,OutImg; // Input image OutImg=InImg.procPix(GrayScale);}

Component function RV_Image InImg

Page 8: RaVioli: A Parallel Vide Processing Library with Auto Resolution Adjustability

RV_Video obj

Video Processing Program with RaVioli• Video processing program with RaVioli

Applied Computing 2009 8

RV_Image objHigher-odermethod

RV_Pixel GrayScale(RV_Pixel p){}

Grayscale

Higher-odermethod

RV_Image GrayScale(RV_Image img){

}

RV_Image obj

Page 9: RaVioli: A Parallel Vide Processing Library with Auto Resolution Adjustability

Outline• Concept of RaVioli– RaVioli hides resolutions from programmers– Easy writeability of video processing applications

• Pseudo real-time processing by adjusting loads• Semi-automatic parallelization functions– Automatic block decomposition– Pipelining interface with automatic load balance mechanism

• Evaluation resultsApplied Computing 2009 9

Page 10: RaVioli: A Parallel Vide Processing Library with Auto Resolution Adjustability

Auto-Adjustment of Computation Load• Spatial resolution (pixel rate)– Ss: Spatial stride

• Temporal resolution (frame rate)– St: Temporal stride

Applied Computing 2009 10

Ss=1Ss=2

St=1St=2

1/4

1/2

Page 11: RaVioli: A Parallel Vide Processing Library with Auto Resolution Adjustability

Priority Set• Which stride should be increased?

• (Spatial resolution, Temporal resolution)=– (7,3) : keep spatial stride and temporal stride in the ratio of “3:7”– (1,0) : keep spatial stride “1”

Applied Computing 2009 11

Moving object detectionTemporal resolution

Pattern recognitionSpatial resolution

We can specify resolution priorities by priority set

Ss=1Ss=2 St=1St=2

Page 12: RaVioli: A Parallel Vide Processing Library with Auto Resolution Adjustability

Higher-odermethod

Detecting Overload

Applied Computing 2009 12

RV_Video class

RingbufferRV_Image instanceHigher-ordermethod

Frame intervalProcessing time

< Overloaded!ImageProcessingprogram

Page 13: RaVioli: A Parallel Vide Processing Library with Auto Resolution Adjustability

Outline• Concept of RaVioli– RaVioli hides resolutions from programmers– Easy writeability of video processing applications

• Pseudo real-time processing by adjusting loads• Semi-automatic parallelization functions– Automatic block decomposition– Pipelining interface with automatic load balance mechanism

• Evaluation results of our workApplied Computing 2009 13

Page 14: RaVioli: A Parallel Vide Processing Library with Auto Resolution Adjustability

Parallelization: Block DecompositionImage processing with c/c++ Image processing with RaVioli

RV_Pix GrayScale(RV_Pix Pix){int Y; Y = (int)(Pix.R()*0.299+Pix.G()*0.587+Pix.B()*0.114); return( Pix.setRGB(Y, Y, Y) );}void main(){ RV_Img InImg, OutImg; OutImg = InImg.procPix(GrayScale);}

void main(){ byte InImg[180][200]; byte OutImg[180][200]; for( int y=0; y<180; y++ ){ for( int x=0; x<200; x++ ){ OutImg[x][y]=(int)( InImg[x][y].R*0.299 +InImg[x][y].G*0.587 +InImg[x][y].B*0.114); } }}

Page 15: RaVioli: A Parallel Vide Processing Library with Auto Resolution Adjustability

Parallelization: Block DecompositionImage processing with RaVioli

RV_Pix GrayScale(RV_Pix Pix){int Y;Y = (int)(Pix.R()*0.299+Pix.G()*0.587+Pix.B()*0.114);return( Pix.setRGB(Y, Y, Y) );}void main(){ RV_Img InImg,OutImg; OutImg = InImg.procPix(GrayScale);}

thread1 thread2thread4thread3 OutImg = InImg.procPix(GrayScale, 4);

InImg

Page 16: RaVioli: A Parallel Vide Processing Library with Auto Resolution Adjustability

Translator for Block Decomposition

• Reduction operations may be requiredApplied Computing 2009 16

RV_Pix GrayScale(RV_Pix Pix){int Y;Y = (int)(Pix.R()*0.299+Pix.G()*0.587+Pix.B()*0.114);return(Pix.setRGB(Y, Y, Y) );}void main(){ RV_Img InImg,OutImg; OutImg = InImg.procPix(GrayScale);}

TranslatorRV_Pix GrayScale(RV_Pix Pix){int Y;Y = (int)(Pix.R()*0.299+Pix.G()*0.587+Pix.B()*0.114);return( Pix.setRGB(Y, Y, Y) );}void main(){ RV_Img InImg,OutImg; OutImg = InImg.procPix(GrayScale, 4);}

parallelize

Page 17: RaVioli: A Parallel Vide Processing Library with Auto Resolution Adjustability

for Reference: Example Code with OpenMP• OpenMP– Standardized model of parallel programming for C/C++ and FORTRAN

#define NUM_THREADS 4int i; int sum=0;#pragma parallelfor(i=1;i<=256;i++)  sum += i;

for( ... ) sum1 += i;Process 1for( ... ) sum2 += i;Process 2for( ... ) sum3 += i;Process 3for( ... ) sum4 += i;Process 4

sum

Reduction pragmareduction(+:sum)

Page 18: RaVioli: A Parallel Vide Processing Library with Auto Resolution Adjustability

Reduction Op.s can be Automatically Added

Applied Computing 2009 18

int sum = 0;void pixSum(RV_Pixel p){ sum += 1;}int main(){ RV_Image InputImg; //read image data in “InputImg” InputImg.procPix(pixSum);}

sum += 1;

_localsum+=1;sum+= _localsum;

sum += 1associative law ?commutative law ? Reductionoperation

_localsum += 1;

inputImg.reduction(__pixSum);

__thread int _localsum = 0;Component function

void __pixSum(int threadNum){ mutex_lock(&Mutex); sum += _localsum; mutex_unlock(&Mutex);}InputImg.procPix(pixSum, 4);

associative law OK!commutative law OK!

Page 19: RaVioli: A Parallel Vide Processing Library with Auto Resolution Adjustability

Outline• Concept of RaVioli– RaVioli hides resolutions from programmers– Easy writeability of video processing applications

• Pseudo real-time processing by adjusting loads• Semi-automatic parallelization functions– Automatic block decomposition– Pipelining interface with automatic load balance mechanism

• Evaluation results of our workApplied Computing 2009 19

Page 20: RaVioli: A Parallel Vide Processing Library with Auto Resolution Adjustability

Assisting Pipeline Implementation• For building pipeline– Whole process is split into several stages– Several threads are created and assigned to the stages– FIFOs are needed to be implemented and managed for data transfer between stages

Applied Computing 2009 20

binarize edgedetect houghtrans・・・

FIFO3・・・

FIFO2・・・

FIFO1 thread1 thread2 thread3

Creating threads and FIFOs • is not the essence of video processing• is troublesome for programmers

Page 21: RaVioli: A Parallel Vide Processing Library with Auto Resolution Adjustability

Interface for Pipelining

Applied Computing 2009 21

RV_Pipedata* GrayScale(RV_Pipedata* data){ // Grayscale processing for a frame return data;}RV_Pipedata* Laplacian(RV_Pipedata* data){ // Laplacian filter processing for a frame return data;}int main (){ RV_Pipeline pipe; pipe.push(GrayScale); pipe.push(Laplacian); pipe.run(); return 0;}

・・・

FIFO1・・・

FIFO2thread1 thread2pushGrayScale Laplacianrun

RV_Pipeline pipe

Page 22: RaVioli: A Parallel Vide Processing Library with Auto Resolution Adjustability

Load Imbalance between Stages

Applied Computing 2009 23

A Bthread1 thread2 thread3

A BA B

A B Cthread1 thread2 thread3・・・

・・・

・・・

C Cframe1frame2frame3

C

123

Pipelinestalls

Page 23: RaVioli: A Parallel Vide Processing Library with Auto Resolution Adjustability

Automatic Load Balancing

Applied Computing 2009 24

thread1 thread2 thread3frame1frame2frame3

A B Cthread1 thread2 thread3・・・

B・・・

・・・

thread1

Cthread3Cthread2

Page 24: RaVioli: A Parallel Vide Processing Library with Auto Resolution Adjustability

Automatic Load Balancing

Applied Computing 2009 25

thread1 thread2 thread3A B A B A B

frame1frame2frame3

Athread1・・・

・・・

Bthread1

Cthread3Cthread2

CC C

123

Page 25: RaVioli: A Parallel Vide Processing Library with Auto Resolution Adjustability

Outline• Concept of RaVioli– RaVioli hides resolutions from programmers– Easy writeability of video processing applications

• Pseudo real-time processing by adjusting loads• Semi-automatic parallelization functions– Automatic parallelization with block decomposition– Pipelining interface with automatic load balance mechanism

• Evaluation results of our workApplied Computing 2009 26

Page 26: RaVioli: A Parallel Vide Processing Library with Auto Resolution Adjustability

Evaluation: Resolution Adjustment

276543210.00000012785.71386725571.42773438357.14160251142.85546963928.56933676714.28320389499.997070

65432105101520253035

6543210.00000012785.71386725571.42773438357.14160251142.85546963928.56933676714.28320389499.997070

20k

40k

60k

80k

05101520253035

05101520253035

20k

40k

60k

80k

20k

40k

60k

80k

(sec)

(sec)

(sec)05101520253035

Spatial resolution : Temporal resolution0:11:03:7

frame rate(fps)Number of pixels Priority set

Page 27: RaVioli: A Parallel Vide Processing Library with Auto Resolution Adjustability

Evaluation: Parallelization Functions

Applied Computing 2009 28

OS Solaris 10CPU UltraSPARC T1Frequency 1.0GHzNumber of cores 8Number of active threads per core 4Memory 16GBCompiler Sun Studio 12 (Sun C++5.9)Compiler options -fast –m64 –xchip=ultraT1Thread library pthreads

Page 28: RaVioli: A Parallel Vide Processing Library with Auto Resolution Adjustability

Evaluation: Auto Block Decomposition

Applied Computing 2009 290 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 320

5

10

15

20

Number of threads

Sp

eed

up

rat

io

houghpixAverage

laplacian

voronoi

Page 29: RaVioli: A Parallel Vide Processing Library with Auto Resolution Adjustability

Evaluation: Hough transform

302 4 8 16 320.00 0.20 0.40 0.60 0.80 1.00

Reduction overhead Reduction variable initialization Reduction operation s hough

Page 30: RaVioli: A Parallel Vide Processing Library with Auto Resolution Adjustability

Evaluation: Automatic load balancing

31

w/o load balancing w/ load balancingPipeline status

Image

Spatial resolution 51x51 170x170Spatial resolution stride 11 4Temporal resolution stride 1 1

A B CA B CA BAC

B CAA BA CBA B C

Page 31: RaVioli: A Parallel Vide Processing Library with Auto Resolution Adjustability

Conclusion• RaVioli– hides resolutions from programmers

• pseudo real-time processing– has semi-automatic parallelization functions

• semi-automatic block decompotision• load balancing mechanism between pipeline stages

• Our future works– implementing automatic power-saving function to RaVioli– making RaVioli adaptive to various platforms such as Cell Broadband Engine– designing easy-to-write language which cooperates with RaVioliApplied Computing 2009 32

Page 32: RaVioli: A Parallel Vide Processing Library with Auto Resolution Adjustability

Automatic Load Balancing

Applied Computing 2009 33

A B Cthread1 thread2 thread3

・・・

・・・

・・・

Manager

123

Page 33: RaVioli: A Parallel Vide Processing Library with Auto Resolution Adjustability

Automatic Load Balancing

Applied Computing 2009 34

A B Cthread1 thread2 thread3

・・・

・・・

・・・

45

Manager A:1B:1C:4

1 1 4B

thread1

Cthread3

Cthread2

23 11