ELEC692 VLSI Signal Processing Architecture Lecture 7

ELEC692 VLSI Signal Processing Architecture

Lecture 7VLSI Architecture for Block

Matching Algorithm for Video compression

* Part of the notes is taken from the course notes of Prof. Bing Zeng’s ELEC 533

Reference

• P. Pirsch, N. Demassieux, W. Gehrke, “VLSI architecture for Video compression – A survey”, in ther IEEE Proceedings, Vol. 83, No. 2, pp. 220-246,Feb 1995

• T. Komarek, P. Pirsch, “Array Architecture for Block Matching Algorithm”, in IEEE Transactions of Circuit and Systems, vol. 36, No. 10, pp. 1301-1310, Oct. 1989

Interframe Coding/Motion Estimation of Video Sequence

Interframe Transform/Predictive Coding

Interframe Transform/Predictive Coding

• Prediction is based on a previously processed frame

• Prediction is accomplished by motion estimation (ME)

• Motion estimation is done in spatial domain• 2-D DCT has to be inside the coding loop and a

2-D IDCT is needed to convert the frequency domain information back to spatial domain

Motion Compensated Prediction

Block Matching Method

Search window

Block matching Criterion

• Mean Square Error (MSE)

N

i

N

jtt jixjix

NMSE

1 1

212

)),(),((1

),(

• Mean Absolute Difference (MAD)

N

i

N

jtt jixjix

NMAD

1 112

|),(),(|1

),(

Important factors for BM Motion Estimation

• Block size – 8X8, 16X16, variable• Size of searching window

– Depend on frame differences, speed of moving objects, resolution, etc

• Matching criterion– Accuracy vs complexity, use of truncated pixels

• Search strategy– Full search, hierarchical search, subsampling of

motion field

• Hardware consideration

Real time processing for BMA

• Let Block size = 16*16, window size = 32*32, assuming CIF frame at 30f/s, we need

sec/879sec

30396289256 Mopsframe

frame

blocks

block

search

search

ops

For CCIR 601 or HDTV, it will require several or tens of GOPS/sec. So Full search has to be implemented in dedicated hardware.

Exhaustive Search Block Matching• Block size of N X N of the current image (reference

block, denote by X)• Matched with all the block located within a search

window (candidate blocks, denote by Y).• Maximum displacement – w• Computing the mean absolute difference (MAD)

between the blocks• Matching distance D is given by

min

1

0

1

0

),(),(),( Dn

mvnjmiyjixnmD

N

i

N

j

V is the motion vector

No. of candidate block to be considered: (2w+1)2

Algorithm to find the motion vectorDmin = MAXVALUEVmin = (0,0)For m=-w to +w

for n = -w to +wD(m,n) = 0for i=1 to N

for j = 1 to ND(m,n) = D(m,n)+|x(I,j)-y(i+m,j+n)|

endforendforif D(m,n) < Dmin then

Dmin = D(m,n)Vmin = (m,n)

endifendfor

endfor

Dependency graph

Calculating MAD

Calculating si(m.n) and s(m,n) Calculate Dmin and v

Dependency graph • The BM algorithm can be described by several

different dependency graph• Example 1

AD = absolute difference and addition

M = minimum value computation

Dependency graph

• Example 2

Data input• Line scan and block scan• Line scan

– TV lines run through as a whole, from the upper to the lower side of the frame

• Block scan– Quadratic blocks of n X n pixels are run through in a block-

line manner– Well suited if the data are supplied by a memory with block

scan output– Pixels within a block are traversed column by column– E.g. (3X3)-pixel block

)3,3()2,3()1,3(

)3,2()2,2()1,2(

)3,1()2,1()1,1(

xxx

xxx

xxx Data are read in the order x(1,1), x(2,1) x(3,1), x(1,2), x(2, 2) x(3,2),x(1,3), x(2,3) x(3,3),

Mapping BMA onto Systolic Arrays

• Decompose the algorithm into its basic operations and convert it into a form where each result is assigned to a unique variable

• Formulate it as an n-dimension dependence graph (DG) of computation nodes and data dependence arcs.

• One straight forward mapping is implementing a PE designated to each node of the DG and a communication link to each edge of the DG.

• More efficient design with a higher processor utilization if each PE executes the operations of multiple computation nodes

• Need time schedule and assignment of multiple nodes to a single PE by projection. PE need to be programmable to some extent.

Mapping BMA onto Systolic Arrays

• The BMA is defined over a 4-dimensional index space (i,j,m,n)

• The BMA can be decomposed into two parts which are defined over two-dimesional index spaces.– 1st one spawn by the index I,j, finding the sum of D(m,n)

– 2nd one defined over m and n, the minium search and the selectin of displacement vector

N

ii

N

ji nmDnmDnjmiyjixnmD

11

),(),(),(),(),(

minmin |),()},(min{ DnmvnmDD n

Transform it into a 2D -array

• D(m,n) mapped into a 2D array of PE

• V(X,Y) is mapped into time

Realistic implementation of 2-D array• Reduction of the cycle time

– Pipelining of the computation of D(m,n).• I/O management

– Each of the AD-PE receives a new value of y(m+i,n+j) at each clock cycle.• Transmitting the N2 value from an external memory is not feasible. WE

can take the advantage of that these values belong to the search window.

• A portion of the search window of size N.(2w+N) is stored in the circuit in a 2D bank of shift registers, able to shift in, up, down, and right direction.

• Each AD-PE has one of these registers and can at each cycle obtain the value of y(m+i,n+j) that it needs

• To update this register bank, a new column of 2w+N piexls of the serach area is serially entered in the circuit and is inserted in the back of regigters.

• To load in a new reference with a low I/O overhead, a double buffering of x(I,j) is required, with the pixels x’(I,j) of a new reference block serially loaded during the computation of the current reference block.

implementation of the 2-D array

2-D array

• Alternate projection of the DG onto the I and j –plane provides the architecture AB2

• Current frame data x(i,j) remains fixed in the PE’s AD that they have to be loaded into the array before. Time required= (2w+1)*(2w+1)

Mapping to a 1-D array

• More efficient design with a higher processor utilization if each PE executes the operations of multiple computation nodes

• Mapped to a 1D array of PE, which is able to compute in parallel the partial distortion along one row.

• Compute D(m,n) in N cycles

1-D array

• Project the DG along the i-axis onto a one-dimensional signal flow graph.

• Called AB1 array, it has the size of a block

Consecutive computation of all (2w+1)2 candidate blocks per displacement vector may provide N*(2p+1)2 time instances

Another way of mapping-search area based

• The dependency graph for computing v(X,Y) is mapped into a 2D array of (2w+1)2 PE while the dependency graph for computing D(m,n) is mapped into time

• Each PE working in parallel keeps track of a particular distortion computation and sequentially explore the reference block.

• At each cycle, one PE receives a different vlaue of y(m+I,n+j) and all the PE receive the value of one pixel of the reference block which is broadcasted to the array.

• After N2 cycle, each of the (2w+1)2 PE holds one value of D(m,n) corresponding to a particular displacement (m,n)

• To find the minimum distortion value, find the minimum of a column by downshifting the D(m,n) in the PEs and find the final minimum value by left-shifting the result D(m,n) in the M-PE.

2-D search area based architecture

Part of the search area of size w.(2w+N) is needed to be stored in order to reduce I/O.

1-D search area based architecture• An array of (2w+1) processing elements executes in N2

cycles the computation of the distortion D(m,n) corresponding to one line (resp. column) of possible motion vectors.

• This process is repeated sequentially 2w+1 times for computing all the distortion.

Another architecture • Require only a sequential data input.• Dummy data denotes by dots are inserted into the

stream of reference data to guarantee a regular data flow without any data permutation within the array

Time required = (2w+1)*(2w+1)*N

Documents

ELEC692 VLSI Signal Processing Architecture Lecture 7