2DINVERSE

7/25/2019 2DINVERSE

1/5

7/25/2019 2DINVERSE

2/5

Several fast algorithms are available to calculate the 8 point 1D IDCT. The algorithm implemented in

AltiVec is a scaled version of Chen [2]developed by the IBM's Haifa Research Laboratory [3]shown

in Figures 1 and 2. It was selected due to its minimization of the number of adds and multipliesrequired to arrive at the result.

To optimize Chen's algorithm for scalability and the AltiVec instruction set, the following identity

was used:

0 1 2 3 4 5 6 7

0 c4*c4/4 c4*c1/4 c4*c2/4 c4*c3/4 c4*c4/4 c4*c3/4 c4*c2/4 c4*c1/4




4 c4*c4/4 c4*c1/4 c4*c3/4 c4*c3/4 c4*c4/4 c4*c3/4 c4*c2/4 c4*c1/45 c3*c4/4 c3*c1/4 c3*c2/4 c3*c3/4 c3*c4/4 c3*c3/4 c3*c2/4 c3*c1/4



2D Inverse Discrete Cosine Transform

Page 2 o

7/25/2019 2DINVERSE

3/5

2. AltiVec Implementation

The implementation of the above algorithm requires the following steps:

Transpose the matrix.

Load the data and multiplication constants.

Load the pre scaling matrix ( Figure 2 ) and prescale the input.

Perform the IDCT on the eight columns according to the stages shown in Figure 1.Transpose the matrix.

Perform the IDCT on the eight rows according to the stages shown in Figure 1.

Because the previous step in H.263 revolves around zig-zag positioning, by reordering how the

position is performed, the transposition is available at no additional cost.

Because there are several large constants to be setup, there are multiple ways this can be

accomplished. The following two methods were considered:

1) Set up compile-time load of a vector of the different constants and then vec_splat them into the

individual vectors.

vector signed short SpecialConstants =

(vector signed short) (23170, 13753, 6518, 21895, -23170, -21895, 0,0);

c4 = vec_splat(SpecialConstants, 0); /* c4 = cos(4*pi/16) */

a0 = vec_splat(SpecialConstants, 1); /* a0 = c6/c4 */

a1 = vec_splat(SpecialConstants, 2); /* a1 = c7/c1 */

...

2) Explicitly load each of the vector with the appropriate constant.

c4 = (vector signed short)(23170); /* c4 = cos(4*pi/16) */

a0 = (vector signed short)(13753); /* a0 = c6/c4 */

a1 = (vector signed short)(6518); /* a1 = c7/c1 */

...

Choosing the methods for your application depends on the other vector operations being employed

and determining which best distributes work to the various vector functional units. The first method

distributes the work to the permutation unit after the original load. The second method requires a load

for every constant and places more of a demand on the load/store unit.

Prescaling is done by using to obtain the high order 16 bits of the 32 bit intermediate

multiplication result using the output results and the PreScale matrix. The PreScale matrix is obtained

by multiplying 2 times the results of the equations contained in Figure 2.

vec_mradds()

16

The one dimensional IDCT is implemented in stages as labeled in Figure 1. The code that implements

these is grouped to correspond to the various stages. For clarity, input values are labeled i0-i7 , stage 1

values are labeled r0-r7, stage 2 values are labeled s0-s7, stage 3 values are labeled t0-t7, and output

values are labeled o0-o7.


Page 3 o

7/25/2019 2DINVERSE

4/5

r0 = i0;r1 = i4;r2 = i2;r3 = i6;

temp = vec_mradds(a1, i1,(vector signed short)(0));r4 = vec_subs(temp, i7);r5 = vec_mradds(ma2, i3, i5);r6 = vec_mradds(a2, i5, i3);r7 = vec_mradds(a1, i7, i1);

s0 = vec_adds(r0, r1);s1 = vec_subs(r0, r1);temp = vec_mradds(a0, r2, (vector signed short)(0));s2 = vec_subs(temp, r3);s3 = vec_mradds(a0, r3, r2);

s4 = vec_adds(r4, r5);s5 = vec_subs(r4, r5);s6 = vec_subs(r7, r6);s7 = vec_adds(r7, r6);

t0 = vec_adds(s0, s3);t1 = vec_adds(s1, s2);t2 = vec_subs(s1, s2);t3 = vec_subs(s0, s3);

t4 = s4;t5 = vec_subs(s6, s5);t6 = vec_adds(s5, s6);t7 = s7;

o0 = vec_adds(t0, t7);o1 = vec_mradds(c4, t6, t1);o2 = vec_mradds(c4, t5, t2);o3 = vec_adds(t3, t4);

o4 = vec_subs(t3, t4);o5 = vec_mradds(mc4, t5, t2);c6 = vec_mradds(mcr, t6, t1);o7 = vec_subs(t0, t7);

/* stage 1 */

/* stage 2 */

/* stage 3 */

/* Output */


Page 4 o

7/25/2019 2DINVERSE

5/5

The above example can be optimized by removing direct assignments and reusing variables. All stages

have been fully enumerated for clarity. It is shown in a more efficient form in the C source code. This

code was made into a macro, however it could also have been made into an INLINE function, like the

matrix transpose. Using macros minimizes the penalty for the call overhead, subject to compiler

limitations.

The matrix transpose uses and for the transposition.vec_mergeh() vec_mergel()

C sourceis "2DDCTMTCSCOD" on web page under Design Tools/Software.

3. Code Samples

C source for the IDCT functionis "2IDCTCCOD " on web page under Design Tools/Software .

Assembly output of the compileris "2IDCTSCOD" on web page under Design Tools/Software.

4. Performance

Performance results are given in clock cycles for a typical AltiVec implementation. Performance for a

specific AltiVec implementation may vary. Also, performance can vary depending on the C compiler

used and can improve as the quality of a compiler improves from release to release. It is assumed that

all required instructions and data are present in the L1 cache.

The call to the function IDCT takes 103 cycles to calculate a 2D IDCT on an 8x8 block of

coefficients. This time includes the function call overhead.

5. References

ITU-T Recommendation H.263, "Video Coding for Low Bit-Rate Communications,"

Telecommunication Standardization Sector of ITU (03/96).

[1]

W. C. Chen, C. H.Smith and S. C.Fralick, "A Fast Computational Algorithm for the Discrete

Cosine Transform," IEEE Transactions on Communications, Vol. COM-25, No. 9, pp.11004-1009,

Sept. 1997.

[2]

S. Urieli, and Z. Sivan, "2D Inverse Discrete Cosine Transform," program note from IBM HaifaResearch Laboratory, 1997.[3]


Page 5 o

Documents

2DINVERSE