2DINVERSE

Embed Size (px)

Citation preview

  • 7/25/2019 2DINVERSE

    1/5

  • 7/25/2019 2DINVERSE

    2/5

    Several fast algorithms are available to calculate the 8 point 1D IDCT. The algorithm implemented in

    AltiVec is a scaled version of Chen [2]developed by the IBM's Haifa Research Laboratory [3]shown

    in Figures 1 and 2. It was selected due to its minimization of the number of adds and multipliesrequired to arrive at the result.

    To optimize Chen's algorithm for scalability and the AltiVec instruction set, the following identity

    was used:

    0 1 2 3 4 5 6 7

    0 c4*c4/4 c4*c1/4 c4*c2/4 c4*c3/4 c4*c4/4 c4*c3/4 c4*c2/4 c4*c1/4

    1 c1*c4/4 c1*c1/4 c1*c2/4 c1*c3/4 c1*c4/4 c1*c3/4 c1*c2/4 c1*c1/4

    2 c2*c4/4 c2*c1/4 c2*c2/4 c2*c3/4 c2*c4/4 c2*c3/4 c2*c2/4 c2*c1/4

    3 c3*c4/4 c3*c1/4 c3*c2/4 c3*c3/4 c3*c4/4 c3*c3/4 c3*c2/4 c3*c1/4

    4 c4*c4/4 c4*c1/4 c4*c3/4 c4*c3/4 c4*c4/4 c4*c3/4 c4*c2/4 c4*c1/45 c3*c4/4 c3*c1/4 c3*c2/4 c3*c3/4 c3*c4/4 c3*c3/4 c3*c2/4 c3*c1/4

    6 c2*c4/4 c2*c1/4 c2*c2/4 c2*c3/4 c2*c4/4 c2*c3/4 c2*c2/4 c2*c1/4

    7 c1*c4/4 c1*c1/4 c1*c2/4 c1*c3/4 c1*c4/4 c1*c3/4 c1*c2/4 c1*c1/4

    2D Inverse Discrete Cosine Transform

    Page 2 o

  • 7/25/2019 2DINVERSE

    3/5

    2. AltiVec Implementation

    The implementation of the above algorithm requires the following steps:

    Transpose the matrix.

    Load the data and multiplication constants.

    Load the pre scaling matrix ( Figure 2 ) and prescale the input.

    Perform the IDCT on the eight columns according to the stages shown in Figure 1.Transpose the matrix.

    Perform the IDCT on the eight rows according to the stages shown in Figure 1.

    Because the previous step in H.263 revolves around zig-zag positioning, by reordering how the

    position is performed, the transposition is available at no additional cost.

    Because there are several large constants to be setup, there are multiple ways this can be

    accomplished. The following two methods were considered:

    1) Set up compile-time load of a vector of the different constants and then vec_splat them into the

    individual vectors.

    vector signed short SpecialConstants =

    (vector signed short) (23170, 13753, 6518, 21895, -23170, -21895, 0,0);

    c4 = vec_splat(SpecialConstants, 0); /* c4 = cos(4*pi/16) */

    a0 = vec_splat(SpecialConstants, 1); /* a0 = c6/c4 */

    a1 = vec_splat(SpecialConstants, 2); /* a1 = c7/c1 */

    ...

    2) Explicitly load each of the vector with the appropriate constant.

    c4 = (vector signed short)(23170); /* c4 = cos(4*pi/16) */

    a0 = (vector signed short)(13753); /* a0 = c6/c4 */

    a1 = (vector signed short)(6518); /* a1 = c7/c1 */

    ...

    Choosing the methods for your application depends on the other vector operations being employed

    and determining which best distributes work to the various vector functional units. The first method

    distributes the work to the permutation unit after the original load. The second method requires a load

    for every constant and places more of a demand on the load/store unit.

    Prescaling is done by using to obtain the high order 16 bits of the 32 bit intermediate

    multiplication result using the output results and the PreScale matrix. The PreScale matrix is obtained

    by multiplying 2 times the results of the equations contained in Figure 2.

    vec_mradds()

    16

    The one dimensional IDCT is implemented in stages as labeled in Figure 1. The code that implements

    these is grouped to correspond to the various stages. For clarity, input values are labeled i0-i7 , stage 1

    values are labeled r0-r7, stage 2 values are labeled s0-s7, stage 3 values are labeled t0-t7, and output

    values are labeled o0-o7.

    2D Inverse Discrete Cosine Transform

    Page 3 o

  • 7/25/2019 2DINVERSE

    4/5

    r0 = i0;r1 = i4;r2 = i2;r3 = i6;

    temp = vec_mradds(a1, i1,(vector signed short)(0));r4 = vec_subs(temp, i7);r5 = vec_mradds(ma2, i3, i5);r6 = vec_mradds(a2, i5, i3);r7 = vec_mradds(a1, i7, i1);

    s0 = vec_adds(r0, r1);s1 = vec_subs(r0, r1);temp = vec_mradds(a0, r2, (vector signed short)(0));s2 = vec_subs(temp, r3);s3 = vec_mradds(a0, r3, r2);

    s4 = vec_adds(r4, r5);s5 = vec_subs(r4, r5);s6 = vec_subs(r7, r6);s7 = vec_adds(r7, r6);

    t0 = vec_adds(s0, s3);t1 = vec_adds(s1, s2);t2 = vec_subs(s1, s2);t3 = vec_subs(s0, s3);

    t4 = s4;t5 = vec_subs(s6, s5);t6 = vec_adds(s5, s6);t7 = s7;

    o0 = vec_adds(t0, t7);o1 = vec_mradds(c4, t6, t1);o2 = vec_mradds(c4, t5, t2);o3 = vec_adds(t3, t4);

    o4 = vec_subs(t3, t4);o5 = vec_mradds(mc4, t5, t2);c6 = vec_mradds(mcr, t6, t1);o7 = vec_subs(t0, t7);

    /* stage 1 */

    /* stage 2 */

    /* stage 3 */

    /* Output */

    2D Inverse Discrete Cosine Transform

    Page 4 o

  • 7/25/2019 2DINVERSE

    5/5

    The above example can be optimized by removing direct assignments and reusing variables. All stages

    have been fully enumerated for clarity. It is shown in a more efficient form in the C source code. This

    code was made into a macro, however it could also have been made into an INLINE function, like the

    matrix transpose. Using macros minimizes the penalty for the call overhead, subject to compiler

    limitations.

    The matrix transpose uses and for the transposition.vec_mergeh() vec_mergel()

    C sourceis "2DDCTMTCSCOD" on web page under Design Tools/Software.

    3. Code Samples

    C source for the IDCT functionis "2IDCTCCOD " on web page under Design Tools/Software .

    Assembly output of the compileris "2IDCTSCOD" on web page under Design Tools/Software.

    4. Performance

    Performance results are given in clock cycles for a typical AltiVec implementation. Performance for a

    specific AltiVec implementation may vary. Also, performance can vary depending on the C compiler

    used and can improve as the quality of a compiler improves from release to release. It is assumed that

    all required instructions and data are present in the L1 cache.

    The call to the function IDCT takes 103 cycles to calculate a 2D IDCT on an 8x8 block of

    coefficients. This time includes the function call overhead.

    5. References

    ITU-T Recommendation H.263, "Video Coding for Low Bit-Rate Communications,"

    Telecommunication Standardization Sector of ITU (03/96).

    [1]

    W. C. Chen, C. H.Smith and S. C.Fralick, "A Fast Computational Algorithm for the Discrete

    Cosine Transform," IEEE Transactions on Communications, Vol. COM-25, No. 9, pp.11004-1009,

    Sept. 1997.

    [2]

    S. Urieli, and Z. Sivan, "2D Inverse Discrete Cosine Transform," program note from IBM HaifaResearch Laboratory, 1997.[3]

    2D Inverse Discrete Cosine Transform

    Page 5 o