Paper # for EI ‘97:rblee/HPpapers/p3021_14.doc · Web viewThe paper examines some typical multimedia kernels ... for audio and 3-D graphics media datatypes. The multiple floating-point

Mapping of Application Software to the Multimedia Instructions of General-

Purpose MicroprocessorsRuby Lee and Larry McMahan

Hewlett-Packard [email protected], [email protected]

Abstract:This paper describes how media processing programs may be accelerated by using the multimedia instruction extensions that have been added to general-purpose microprocessors. As a concrete example, it describes MAX2, a minimalist, second-generation set of multimedia instructions included in the PA-RISC 2.0 processor architecture. MAX2 implements subword parallel instructions, which utilize the microprocessor’s 64-bit wide datapaths to process multiple pieces of lower-precision data in parallel. It also includes innovative, new instructions like Mix, which are very useful for matrix transpose and other common data rearrangements. The paper examines some typical multimedia kernels, like Block Match, Matrix Transpose, Box Filter and the IDCT, coded with and without the MAX2 instructions, to illustrate programming techniques for exploiting subword parallelism and superscalar instruction parallelism. The kernels using MAX2 show significant speedups in execution time, and more efficient utilization of the processor’s resources.

Keywords: multimedia extensions, subword parallelism, MAX2, PA-RISC, media processing, code optimizations, SIMD, packed arithmetic.

1. Introduction

Media processing, or the processing of digital multimedia data such as images, video, audio, and graphics, requires significant computation power. For example, merely reading (or viewing) a video object requires performing video decompression in real-time (e.g. 30 frames per second), and storing video requires performing video compression. As the number of media streams, the frame size and the desired fidelity of the multimedia objects increase, the compute and access bandwidth requirements also increase. General-purpose digital processors have a set of instructions that can be programmed to perform any algorithm on binary bits of data. However, many media processing algorithms can be significantly accelerated with the addition of a few new instructions. These new instructions exploit the fact that there is a great deal of parallelism in media processing algorithms, and that the data being worked on (e.g., 8-bit pixels) have lower precision than the word size of modern microprocessors, which is currently either 32 or 64 bits. Subword parallelism [1,2] is a technique proposed for performing parallel operations on lower-precision data packed into word-oriented datapaths. For example, the 64 bits in a processor’s register can be assumed to represent four 16-bit quantities. By very minor changes in the 64-bit adder, it can be used to perform either one 64-bit add or four parallel 16-bit adds (Figure 1). These subword parallel instructions, like Parallel Add, have been called “multimedia instructions” because they were first introduced into the instruction sets of general-purpose processors for performing multimedia functions with software rather than hardware solutions [3,4]. In fact, they can be used for any programs that repeat a set of operations on different sets of lower precision data. They are often called SIMD instructions (Single Instruction Multiple Data [18]), since they perform the same operation on multiple subwords.

The purpose of this paper is to illustrate how media processing programs may use multimedia extensions in general-purpose processors. We do this by showing a few key techniques in some typical multimedia kernels, commonly found in image, video, audio and graphics programs. In section 2, we describe the multimedia instructions

Proceedings of Multimedia Hardware Architectures 1997, IS&T/SPIE Symposium on Electronic Imaging: Science and Technology, February 10-14, 1997, San Jose, California, pp.122-133.

that have been added to microprocessor architectures, in particular, the MAX2 multimedia extensions. MAX2 is the second generation of multimedia instructions for the 64-bit PA-RISC 2.0 instruction set architecture. In section 3, we describe four important media processing kernels, optimized at the instruction-set level (assembly language) both with and without the MAX2 instructions. In section 4, we summarize the programming techniques used in these examples. In section 5, we compare the performance of equally optimized code, with and without the MAX2 instruction extensions. Section 6 concludes the paper.

x1 +y1 x2 -y2

x1x2

y1

y2

G e n e ra l R e g s .

2 o ps / c yc le

S ta n d a rd A L U

S ta n d a rd A L U

P a r titio n a b le6 4 -b it A L U P a r titio n a b le

6 4 -b it A L U

x 1 x 2

y 2 y 3 y 4

x 3 x 4

8 o p s / c yc le

x 5 x 6 x 7 x 8

y 1

y 5 y 6 y 7 y 8

Figure 1a: Superscalar Processor with 2 ALUs Figure 1b: Subword Parallelism in Superscalar Processor

2. Multimedia Instructions

Multimedia extensions for general-purpose processors were first introduced in a product in January 1994 with the MAX1 (Multimedia Acceleration eXtensions) instructions for 32-bit PA-RISC processors [2-4]. Later, Sun introduced VIS (Visual Instruction Set) for UltraSparc processors [6], HP introduced MAX2 for 64-bit PA-RISC processors [5,1], and Intel introduced MMX (Multi Media eXtensions) for x86 processors [7]. Recently, MIPS has announced the MDMX multimedia extensions for some future MIPS processors [8], and DEC has announced a small number of instructions to support MPEG for Alpha processors [8]. To provide real code examples, we have to choose one of these sets of multimedia extensions. We have chosen MAX2, since it represents perhaps the simplest set of general-purpose multimedia acceleration primitives, with key characteristics shared in common by the other multimedia extensions, as well as some uniquely versatile yet simple features.

2.1 MAX2 instructions in PA-RISC 2.0 Parallel Subword Instruction Description

Parallel add, hadd Add 4 pairs of 16-bit operands, with modulo arithmetic hadd,ss Add 4 pairs of 16-bit operands, with signed saturation hadd,us Add 4 pairs of 16-bit operands, with unsigned saturationParallel subtract, hsub Subtract 4 pairs of 16-bit operands, with modulo arithmetic hsub,ss Subtract 4 pairs of 16-bit operands, with signed saturation hsub,us Subtract 4 pairs of 16-bit operands, with unsigned saturation Parallel shift left & add, hshladd Multiply 4 first operands by 2, 4 or 8 and add corresponding second operandsParallel shift right & add, hshradd Multiply 4 first operands by 1/2, 1/4 or 1/8 and add corresp. second operandsParallel average, havg Arithmetic mean of 4 pairs of operandsParallel shift right signed, hshr Shift right by 0 to 15 bits, with sign extension on the leftParallel shift right unsigned, hshr,u Shift right by 0 to 15 bits, with zero extension on the leftParallel shift left, hshl Shift left by 0 to 15 bits, with zeros shifted in on the rightMix, mixh,L mixh,R mixw,L mixw,R

Interleave alternate 16-bit [h] or 32-bit [w] subwords from two source registers, starting from Leftmost [L] subword, or ending with Rightmost [R] subword

Permute, permh Rearrange subwords from one source register, with or without repetitionTable 1: MAX2 Instructions in PA-RISC 2.0

MAX is a minimalistic set of parallel subword instructions for PA-RISC processors. Although MAX2 is the second generation of multimedia instructions for PA-RISC processors, it is still a much smaller set than that now proposed for other processors [5-10]. It uses the existing microprocessor registers and functional units, like the Arithmetic Logical Unit (ALU) and the Shift-Merge Unit (SMU). MAX2 features are added only if they have

2—IS&T’s 49th Annual Conference

potential general-purpose usage, in addition to providing significant speedup for media processing. Table 1 shows the instructions in MAX2. The instructions in MAX1 are a proper subset, including only the parallel subword arithmetic instructions (through havg).

2.2 Parallel Subword Compute Instructions

These instructions perform the basic arithmetic functions of add and subtract, and a few common varieties of multiply and divide. The Parallel Add, and Parallel Subtract instructions each have three variants, which differ only in the way they treat overflow. The default action is modulo arithmetic, where any overflow is discarded. If signed saturation is specified in the instruction, an overflow causes the result to be clipped to the largest or smallest signed integer representable in the result range, depending on the direction of the overflow. Similarly, if unsigned saturation is specified, an overflow causes the result to be clipped to the largest or smallest unsigned integer in the result range [2].

One difference between the multimedia extensions in different microprocessor architectures is the support provided for multiplication. A fast multiply circuit typically occupies two to three times the space of an adder, and takes several execution cycles. Furthermore, the product is twice as long as the operands, assuming these have the same number of bits. In addition, the audio and 3-D graphics transformations that require the most multiplications usually also need multiply-accumulate, and more than 16 bits of precision for intermediate results. Hence, in MAX2, we decided on two approaches for multiplication, depending on the media stream being processed. For audio and 3-D graphics transformations, the full power and versatility of the floating-point multiply-accumulate functional units is used. This gives two single-precision (32-bit) or two double precision (64-bit) multiply-accumulate instructions per cycle in PA-RISC processors, or the equivalent of four operations per cycle. For video, images, and graphics rendering, where the data are 8-bit pixels (or 12-bit pixels for medical images), multiplications by constants are done by a series of shift and add instructions, while multiplications by variables use the standard 64-bit integer multiply instruction. Our data indicates that many of these multiplications required are indeed by constants. MAX2 provides two multiply primitives: Parallel Shift Left and Add, and Parallel Shift Right and Add instructions. These instructions can shift the operand left or right by 1, 2 or 3 bits, before adding the second operand. They are very effective in implementing multiplication by integer or fractional constants, respectively. They require just a minor modification to the existing preshifter to the integer ALU, rather than new subword-parallel integer multiplier circuits.

Division circuitry is even more expensive, and integer division circuitry is not usually provided by microprocessors. In MAX2, audio and 3-D graphics transforms use the floating-point registers and functional units, and so have access to full floating-point division (and reciprocal) circuitry. For the pixel-oriented media types, parallel integer division is simulated by a series of right shifts, which are divisions by a power of two. The Parallel Shift Right (Signed or Unsigned) instructions may be used for division of signed and unsigned subwords, respectively. They use the existing 64-bit shifter, but block any bits shifted out from one subword from being shifted into the adjoining subword. The Parallel Shift Right and Add instruction may also be used for division by 1/2, 1/4 or 1/8. Division by any constant can be simulated with a combination of these instructions.

The Parallel Average instruction adds the two operands, then performs a divide by two. This is an add followed by a right shift of one bit. In the process, the overflow bit is shifted in as the most significant bit of the result, so the instruction has the added advantage that no overflow can occur. In addition, rounding is done on the least significant bit, to conserve precision in cascaded average operations. This instruction is very useful for interpolation, sub-pixel resolution, as well as division by two with rounding.

2.3 Data Alignment and Data Rearrangement Instructions

Data alignment is often needed to maintain the desirable significant bits in the intermediate results. This is achieved with the Parallel Shift Right or Left instructions.

Data rearrangement of the packed subwords in a register is often needed in order that subsequent parallel subword operations can proceed at full parallelism. The design challenge is to find a small set of data rearrangement primitives that are most powerful for frequent inner-loop cases. MAX2 defines only two data rearrangement primitives, Mix and Permute, based on their versatility of use, and ease of implementation. Mix rearranges subwords from two source registers, while Permute provides a comprehensive set of rearrangements of subwords in a single source register.

The Mix instruction takes subwords from two registers, and interleaves alternate subwords from each register in the result register as shown in Figure 2. The subword sizes are indicated by the suffix “h” for halfword (16 bits), and “w” for word (32 bits). The second suffix, “L” or “R” indicates Mix Left or Mix Right: Mix Left collects the odd subwords in the result register, whereas Mix Right collects the even numbered subwords. (Because even and odd numberings change depending on numbering from 0 or 1, or from left or right, the names Mix Left and Mix Right are


used rather than Mix Odd and Mix Even. Mix Left starts from the leftmost subword in each of the two source registers, while Mix Right ends with the rightmost subwords from each source register.) In Figure 2, the definitions of four Mix instruction variants is given, where the contents of a register are given as four 16-bit elements. (Note that in PA-RISC instructions, the first two operands, Ra and Rb, are source registers, and Rc is the result register.) In section 3, the use of Mix is illustrated by a matrix transpose example, and in the IDCT. Mix also implements unpacking operands, using R0 as one of the source registers, and subsequent packing of operands.

;Ra = a1 a2 a3 a4, Rb = b1 b2 b3 b4 are the contents of the source registers mixh,L Ra,Rb, Rc ;Rc = a1 b1 a3 b3 mixh,R Ra,Rb, Rc ;Rc = a2 b2 a4 b4 mixw,L Ra,Rb, Rc ;Rc = a1 a2 b1 b2 mixw,R Ra,Rb, Rc ;Rc = a3 a4 b3 b4

Figure 2: Definition of Mix Instruction Variants

The Permute instruction takes one source register, and produces a permutation of the subwords in that register. With 16-bit subwords, this instruction allows all possible permutations, with and without repetitions, of the four subwords in the source register. Figure 3 shows some possible permutations. A Permute index in the instruction, comprising four 2-bit indices, identifies which subword in the source register is to be placed in each subword of the destination register. Subwords in the source register, Ra, are numbered from left to right starting from zero. Permute allows, for example, the replication of a subword scalar value to all the subwords in a register, in a single cycle.

;Ra = a b c d are the contents of the source registerpermh,0000 Ra,Rc ;Rc = a a a a replicate scalar across vector permh,3210 Ra,Rc ;Rc = d c b a reverse order of subwords permh,1003 Ra,Rc ;Rc = b a a d arbitrary permutation with repetition permh,0312 Ra,Ra ;Ra = a d b c arbitrary permutation without repetition

Figure 3: Permute Instruction Examples

2.4 Other Useful PA-RISC features

In addition to the MAX2 instructions described above, other existing features in the PA-RISC architecture are also very useful for media processing [12-14, 5]. Table 2 lists some of the more useful ones. The Shift Right Pair instruction allows two source registers to be concatenated and shifted together, with the resulting rightmost 64 bits placed in the destination register. This instruction facilitates use of arbitrarily aligned 64-bit quantities. The Extract instruction allows one to extract a sequence of contiguous bits from a source register, and place it right-aligned in the destination register. The Deposit instruction does the reverse: place a right-aligned field of bits from the source register anywhere in the destination register. All the existing logical functions are also available, and needed, for media processing as well [1,5,17].

The Floating-point Multiply Accumulate instructions (FMAC) provide high-performance multiply-accumulate, with full IEEE floating-point precision compliance, for audio and 3-D graphics media datatypes. The multiple floating-point condition bits allow simultaneous testing of conditions, while eliminating costly conditional branches. For example, very fast graphics accept and reject tests for determining whether an object falls within a bounding box, or not, are supported by PA-RISC processors, using these condition bits. The low-overhead cache prefetch instructions can take advantage of the highly predictable, streaming nature of the memory accesses of many media processing programs by prefetching data into the cache before it is actually used, thus hiding memory latencies from cache misses. Load and store instructions may use a cache hint to indicate that the data has spatial locality (but no temporal locality), and may be fetched into a look-aside buffer, to prevent replacing useful cache lines for data that is used only once.

Another PA-RISC feature, especially useful for code without MAX2 instructions, is the arithmetic nullify feature. Every arithmetic, logical and field manipulation instruction generates a condition which can be used to nullify the execution of the next instruction. This enables in-line conditional execution, while avoiding the pipeline penalties associated with branch instructions.

Feature DescriptionShift Right Pair of Registers: shrpd Concatenate and shift two 64-bit regs. into target registerExtract a field: extrd, extrw Select a bit-field field from source reg. and place right-

aligned in target register


Deposit a field into a reg.: depd, depdi, depw, depwi Select a right-aligned field from source register or immediate, and place anywhere in the target register

Logical operations: and, andcm, or, xor Logical operations: and, and complement, or, exclusive orfmac Floating-Point Multiply Accumulate instructionMultiple FP condition bits Enable concurrent floating-point comparisons and testsldd r0; Prefetch Cache line for Read, ldw r0; Prefetch Cache line for Write

Fetch data into cache before it is used, to reduce cache miss penalty (no action on TLB miss).

Cache hint: Spatial Locality Hint to prevent cache pollution when data has no reuseArithmetic Nullification Conditional execution of next instruction based on condition

generated by current arithmetic, logical, or field instructionTable 2: Other Supporting PA-RISC Features

3. Code Examples

Table 3 shows the four multimedia kernels chosen to illustrate the programming techniques used to map multimedia algorithms to the MAX2 instructions, and to exploit modern superscalar microprocessor operation. These kernels are often performance-critical loops in multimedia applications. For each of the kernels described, an algorithm is coded both with and without the MAX2 extensions. General techniques for loop optimization are applied to both code versions, while special techniques, such as saturating arithmetic and data rearrangement, are used to optimize the code with the MAX2 multimedia instructions. The resulting code is scheduled using the superscalar scheduling rules of the PA-8000 (a 64-bit PA-RISC 2.0 processor [11]). The figures in this section show fragments of code to illustrate the coding techniques, and do not show the entire programs which are rather lengthy.

Multimedia Kernel Description16 x 16 Block Match Sum of absolute magnitude of the differences of corresponding pixels in two 16 x 16 blocks3 x 3 Box Filter Compute the ‘smoothed’ value of each pixel in an image using a 3x3 filter Matrix Transpose Transpose an 8 x 8 matrix of 16-bit values contained in the processor’s general registers.8 x 8 2-D IDCT Perform a two-dimensional Inverse Discrete Cosine Transform [15] on an 8x8 block

Table 3: Multimedia Kernels Chosen

3.1 Block MatchIn this example, the inner loop accumulates the absolute magnitude of the difference of two corresponding values from two 16x16 blocks of data. This is often used in motion estimation for MPEG-1 and MPEG-2. The example illustrates the use of saturation arithmetic for in-line conditional execution, eliminating the need for conditional branches. The absolute value of Xij-Yij is obtained as follows: First Xij-Yij is calculated using unsigned saturation, then the operands are reversed and Yij-Xij is calculated. Without saturation arithmetic, one of these terms will be positive and the other, negative. With unsigned saturation, the smallest unsigned number representable is zero, so the negative term saturates to zero. The results for each pair of subtractions are accumulated in parallel in two registers (r4 and r5 in Figure 4a). Two separate registers are used to accumulate the results, to allow full superscalar bandwidth without dependency stalls on the accumulation. The 6 instructions in bold print in the code fragment shown in Figure 4a form the core function performed on every four pairs of elements in the two 16x16 blocks.

In the code without MAX2 instructions (Figure 4b), the arithmetic nullify feature is used instead. Xij-Yij calculated, and if the result is less than zero, the result is subtracted from zero. This places the positive value of the difference in the result register.

In the complete code, either with or without MAX2 instructions, the inner loop is unrolled completely to accumulate 16 pairs of absolute differences, to eliminate loop counter overhead and the loop counter register. Figure 4a shows how 2 unrolled iterations may be interleaved to optimize superscalar instruction scheduling. A limited amount of software pipelining is used where the initial loads of the loop are moved to the previous iteration to avoid load latency stalls.

ldd,ma 8(r6),r8 ;load first Xij value outside loopldd,ma 8(r7),r9 ;load first Yij value outside loopcopy r0,r4 ;zero accumulator 1copy r0,r5 ;zero accumulator 2

loop ;loop is unrolled to minimize memory latency


ldd,ma 8(r6),r10 ;load second Xij valueldd,ma 8(r7),r11 ;load second Yij valuehsub,us r8,r9,r13 ;subtract first Xij - Yijhsub,us r9,r8,r14 ;subtract first Yij - Xijhsub,us r10,r11,r15 ;subtract second Xij - Yijhsub,us r11,r10,r16 ;subtract second Yij - Xijhadd r4,r13,r4 ;accumulate first Xij - Yijhadd r5,r14,r5 ;accumulate first Xij - Yijhadd r4,r15,r4 ;accumulate second Yij - Xijhadd r5,r16,r5 ;accumulate second Yij - Xij

Figure 4a: Block Match Code with MAX2 Instructions

ldh,ma 2(r6),r8 ;load first Xij value outside loopldh,ma 2(r7),r9 ;load first Yij value outside loopcopy r0,r4 ;zero accumulatorcopy r0,r15 ;zero difference for ‘previous iteration’ldh,ma 2(r6),r10 ;load second Xij value outside loopldh,ma 2(r7),r11 ;load second Yij value outside loop

loop ;loop is unrolled to minimize memory latencyadd r15,r4,r4 ;last add absolute difference from previous iteration of loopsub,>= r8,r9,r13 ;subtract first Xij - Yij values, skip next instruction if >= 0subi 0,r13,r13 ;subtract negative difference from zeroldh,ma 2(r6),r8 ;load third Xij value now to avoid stallldh,ma 2(r7),r9 ;load third Yij valueadd r4,r13,r4 ;accumulate first |Xij - Yij| valuesub,>= r10,r11,r15 ;subtract second Xij - Yij, skip next instruction if >= 0subi 0,r15,r15 ;subtract negative difference from zeroadd r4,r15,r4 ;accumulate second |Xij - Yij| value

Figure 4 b: Block Match Code without MAX2 Instructions

3.2 Box Filter

In this algorithm the smoothed values of the pixels are computed for the middle 14x14 section of a 16x16 block of pixels. This example illustrates programming techniques for reducing the number of load and copy instructions, for parallel result accumulation, and for performing constant multiplications with shift and add instructions. The constant multipliers used in the 3 x 3 Box Filter are shown in Figure 5.

3x3 Box Filter y0 y1 y2 y3 y4 y5¼ ½ ¼ z0 z1 z2 z3 z4 z5½ 1 ½ row i¼ ½ ¼ r1

r2r3

Pixel Matrix r4Figure 5: 3 x 3 Box Filter Values Used Figure 6: Parallel Accumulation with 2 Loads/Iteration

Each pixel in the image requires its eight nearest neighbors and itself, in order to perform the smoothed function of the 3x3 box filter. This involves 8 multiplications and 9 adds. By moving down the columns of the image, 6 of these 9 pixel values are reused, for the next smoothed pixel. Hence, only three new elements need to be loaded for each smoothed pixel. For the MAX2 version of the code, four pixels from the same row are now packed in one register, which allows the number of load instructions to be further reduced from three to two, for each set of 4 smoothed pixels. Figure 6 shows the register layout of input data to simultaneously accumulate 4 pixel results, p1, p2, p3 and p4, with two loads. In addition, register copy instructions may be reduced by unrolling the inner loop three times and reusing the same registers for subsequent iterations in a round-robin fashion. This technique is known to compiler writers as recurrent scalar replacement (section 4).


xo x1 x2 x3 x4 x5

In the code with MAX2, a single Parallel Shift Right and Add instruction is used to multiply four pixels by a fractional constant of the box filter, as well as accumulate these results. In the code without MAX2, the multiplication of each pixel is done with an Extract instruction (equivalent to a right shift operation), and the accumulation must be done with a separate add instruction. This shows the power of a single Parallel Shift and Add instruction: performing four parallel multiplications and four parallel accumulations in a single cycle. Figures 7 shows a code fragment using MAX2 instructions. 8 Parallel Shift and Add, 1 Parallel Add, 6 Shift Pair and 2 Load instructions produce 4 smoothed pixel results simultaneously.

colloop ;this is the outer loop across columnsldd,ma 8(r2),r3 ;load left four pixels in row i-1ldd,ma Rowoffset-8(r2),r4 ;load right four pixels in row i-1(will only use two left hand pixels)ldd,ma 8(r2),r5 ;load left four pixels in row ildd,ma Rowoffset-8(r2),r6 ;load right four pixels in row i (will only use two left hand pixels)addi 8,r12,r12 ;increment column counter by number of bytesldi Numrows-2,r13 ;put number of rows - 2 in loop counter

rowloop ;this is the inner loop down rows.ldd,ma 8(r2),r7 ;r7 contains row i+1 left pixelsldd,ma Rowoffset-8(r2),r8 ;r8 contains row i+1 right pixels

;r2 now points to row i+2 left pixelshshradd r3,2,r0,r9 ;r9 is accumulator for filter - start by adding ¼ left pixels from row i-1shrpd r3,r4,48,r10 ;r10 contains middle shifted pixels from row i-1hshradd r10,1,r9,r9 ;add in ½ middle values from row i-1shrpd r3,r4,32,r10 ;shift right pixels from row i-1hshradd r10,2,r9,r9 ;add in ¼ right values from row i-1hshradd r5,1,r9,r9 ;add in ½ left values from row ishrpd r5,r6,48,r10 ;shift middle pixels from row ihadd r10,r9,r9 ;add in middle pixels from row ishrpd r5,r6,32,r10 ;shift right pixels from row ihshradd r10,1,r9,r9 ;add in ½ right pixels from row ihshradd r7,2,r9,r9 ;add in ¼ left pixels from row i+1shrpd r7,r8,48,r1 ;shift middle pixels from row i+1hshradd r10,1,r9,r9 ;add in ½ middle pixels from row i+1shrpd r7,r8,32,r10 ;shift right pixels from row i+1hshradd r10,2,r9,r9 ;add in ¼ right pixels from row i+1addib,<= -1,r13,endloop ;branch to endloop if last row in columnstd,ma r9,Rowoffset(r11) ;store calculated filter value in branch delay slot

Figure 7: Box Filter with MAX2 Instructions

3.3 Matrix Transpose

An 8x8 matrix of 16-bit values can be held in sixteen 64-bit registers. Since each 64-bit register holds 4 subword values, the 8 x 8 matrix transpose really consists of four 4 x 4 transposes of each of the four quadrants. The code for a 4x4 matrix transpose using the Mix instructions in MAX2 is shown in Figure 8. This requires only 4 cycles, since two Mix instructions can be executed in a single cycle, in a PA-RISC processor like the PA-8000 [11,5]. Only two temporary registers, r17 and r18, are needed. This is true even for the 8x8 matrix transpose. For the code without MAX2, each 4 x 4 section must be calculated using a sequence of Extract and Deposit instructions, which have more register dependencies than the Mix instructions. Therefore two separate 4x4 matrix transposes must be interleaved to allow full superscalar bandwidth in the version without MAX2.

;4x4 matrix laid out as r1= a11,a12,a13,a14; r3=a21,a22,a23,a24; r5=a31,a32,a33,a34; r7=a41,a42,a43,a44

mixh,l r5,r7,r17 ; r17= a31,a41,a33,a43 get second set of odd 16-bit subwords mixh,r r5,r7,r18 ; r18= a32,a42,a34,a44 get second set of even 16-bit subwordsmixh,l r1,r3,r5 ; r5 = a11,a21,a13,a23 get first set of odd 16-bit subwords, re-defining r5


mixh,r r1,r3,r7 ; r7 = a12,a22,a14,a24 get first set of even 16-bit subwords, re-defining r7mixw,l r5,r17,r1 ; r1 = a11,a21,a31,a41 get first transposed rowmixw,l r7,r18,r3 ; r3 = a12,a22,a32,a42 get second transposed rowmixw,r r5,r17,r5 ; r5 = a13,a23,a33,a43 get third transposed rowmixw,r r7,r18,r7 ; r7 = a14,a24,a34,a44 get fourth transposed row

Figure 8: Matrix Transpose with MAX2 Instructions

3.4 Two-Dimensional IDCT

The Discrete Cosine Transform (DCT) and its inverse (IDCT) are core functions in image and video compression standards such as JPEG, MPEG and H.261. The algorithm we have implemented [15] is one of many proposed for IDCT. We first perform the eight 1-Dimensional IDCTs on the columns. Then, the 8x8 matrix contained in 16 registers undergoes a matrix transpose, similar to that described in the previous example. This allows the same 1-Dimensional IDCT code to be used for the eight rows. Finally, the results are stored.

Both the code with and without MAX2 instructions make extensive use of the heuristics for performing fixed point multiplies using shift and add instructions. Figure 9a shows an example of how the Parallel Shift Right and Add instructions in MAX2 allow constant multiplications to be performed in parallel on four 16-bit values. The four 16-bit values in register r10 are multiplied by the constant 1.41421356 (square root of 2). The constant can be approximated by (1+1/4+1/8+1/32+1/128), which is 1.4140625. The sequence of parallel subword shift right and add instructions that do this is shown in Figure 9a, and explained by the comments. Without MAX2, PA-RISC processors only have full-word Shift Left and Add instructions. These can also be used for constant multiplication (see Figure 9b). However, only a single data value is multiplied at one time, and if the constant is fractional, an extra instruction is needed to perform a division at the end.

In this IDCT example with MAX2, an average of four instructions is required for each constant multiply in the algorithm. Two sets of multiplications can be done simultaneously to make use of the superscalar bandwidth of two Parallel Shift Right and Add instructions per cycle. Since four subwords are operated on by each instruction, the average rate is four cycles per eight multiplications, or half a cycle per multiplication. This is often better than a hardware multiplier, but at a much lower hardware cost, since only the integer adders (ALUs) are used.

hshradd r10,2,r10,r8 ; r8 = (x1,x2,x3,x4) * 1/4 + 1hshradd r8,3,r10,r8 ; r8 = r8 * 1/8 + r10 = (x1,x2,x3,x4) * (1+1/8+1/32)hshradd r10,1,r8,r8 ; r8 = (x1,x2,x3,x4) * 1/2 + r8 = (x1,x2,x3,x4) * (1+1/2+1/8+1/32)hshradd r8,2,r10,r2 ; r2 = r8 * 1/4 + (x1,x2,x3,x4) = (x1,x2,x3,x4) * (1+1/4+1/8+1/32+1/128)

Figure 9a: Multiplication by 1.414 Using Parallel Shift Right and Add Instructions (code with MAX2)

shladd r1,2,r1,r8 ; r8 = x * 4 + xshladd r8,1,r1,r8 ; r8 = r8 * 2 + x = x * (8+2+1)shladd r8,2,r1,r8 ; r8 = r8 * 4 + x = x * (32+8+4+1)shladd r8,2,r1,r8 ; r8 = r8 * 4 + x = x * (128+32+16+4+1)extrd r8,56,24,r2 ; r2 = r8 / 128 = x * (1+1/4+1/8+1/32+1/128)

Figure 9b: Multiplication by 1.414 Using Shift Left and Add Instructions (code without MAX2) The code with MAX2 also uses saturating arithmetic for overflow handling, obviating the need for clamping values prior to storing, which must be done for the code without MAX2. Loop vectorization also allows four iterations of each 1-Dimensional IDCT to be calculated in each pass of the code with MAX2.

4. Programming Techniques

The software optimization techniques used in the code examples are summarized below into two categories: those that are specific to the multimedia extensions (e.g., MAX2), and those that are general loop optimization techniques The former techniques are not “multimedia specific” as much as they are “subword parallel” techniques, especially for MAX2, which contains only general-purpose parallel subword instructions. The latter techniques are used to optimize both code versions, with and without MAX2 instructions, wherever possible.


4.1 Subword Parallelism Techniques

Loop vectorization allows multiple iterations of a loop to be performed in parallel. This is the key technique for mapping algorithms with inherent parallelism into the parallel subword instructions found in the multimedia extensions of microprocessors. The data operated on by each parallel loop iteration must be in the same subword track. Since a parallel subword instruction operates on four 16-bit data in parallel, loops of N iterations are reduced to N/4 iterations. Loop vectorization also supports memory coalescing since data is loaded and stored as 64-bit words at a time. Loop index calculation overhead is usually also reduced by a factor of four. This technique is also referred to as data mining.

Parallel accumulation is a technique for using multiple accumulators (in separate registers) to perform partial accumulations, utilizing the available superscalar bandwidth, rather than serializing the add instructions to a single accumulator. Parallel accumulation in subword accumulators in a single register is also done, in conjunction with loop vectorization.

Saturation arithmetic automatically handles positive (or negative) overflows by clamping a result that is too large (or too small) to the largest (or smallest) number representable by the result bits. Parallel subword arithmetic instructions have been designed for the parallel computation of subword data. However, the possible overflows of results would be a problem if they had to be handled individually, as this would negate the performance gain of parallel subword arithmetic. The MAX2 instructions with signed or unsigned saturation are used to handle overflows without resorting to special case code for these conditions.

Saturation arithmetic is also useful for in-line conditional execution. This allows conditional operation to occur without the need for conditional branch instructions. For example, saturation arithmetic allows the conditional selection of one of two operands, based on their relative values. An example is finding the absolute differences of pairs of data, as in the Block Match example. Other examples are finding the larger (maximum) or smaller (minimum) of pairs of operands [1]. In the original PA-RISC instruction set, the arithmetic nullify feature found in arithmetic, logical and field manipulation instructions also enable in-line conditional execution, as described in section 2.4 and Figure 4b.

Data rearrangement instructions rearrange the order of the subwords packed into registers. In MAX2, they are the Mix and Permute instructions. Examples of the use of these instructions is given in Figure 3, and the Matrix Transpose and IDCT examples.

Algorithms for multiplication by constants have been developed for the original PA-RISC Shift left and add instructions, and included with high-level language compilers. MAX2 adds Parallel Subword Shift Right (or Left) instructions. While the heuristics for Parallel Subword Shift Left and Add instructions are the same as those for (Word) Shift Left and add instructions, new heuristics are developed to exploit the Parallel Subword Shift Right and Add instructions. These allow multiplication by fractional constants to be implemented more accurately with fewer instructions.

4.2 Loop Optimization Techniques

Reordering of nested loops changes the inner and outer loops. This facilitates reuse of data values already in registers by successive loop iterations, thus reducing the number of load and store instructions required. Register pressure may also be reduced by shortening the time a value has to remain in a register between its first load and its last use. Reordering of loop iterations changes the order with which the iterations of a loop are evaluated, for the same reasons.

Recurrent scalar replacement is the software equivalent of hardware register renaming. Some processors provide hardware to rename registers, to eliminate register copies in loops that reuse data in registers across loop iterations. For example, assume an application in which three scalar input values are being used, with one new input value being read each iteration. On each loop iteration, the actions are: copy next-oldest value to oldest value, copy newest value to next-oldest value, read another newest value. The copies are eliminated by unrolling the loop by a factor of three and using the same physical register for the life of each variable across three iterations, in round-robin fashion.

Loop unrolling is a technique of repeating two or more logical iterations of a loop within a single physical iteration. This is done to facilitate recurrent scalar replacement, loop vectorization, and to reduce loop iteration overhead. Loop interleaving is a special case of loop unrolling where two or more iterations of a loop are unrolled, and their instructions interleaved. This is often done to make full use of the superscalar instruction parallelism, even in code with a strong dependency chain. Software pipelining is the movement of instructions between different iterations of a loop, to hide instruction latencies coupled with strong dependency chains. Loop interleaving and software pipelining employ loop unrolling for increasing instruction-level parallelism by reducing dependencies.

The out-of-order execution capabilities of the PA-8000 processor [11] greatly reduce the need for software


pipelining and loop interleaving. By merely unrolling the loop by several iterations, the instructions can be dynamically reordered by the hardware, achieving benefits consistent with software pipelining or loop interleaving. These techniques are useful on other microprocessors without out-of-order execution, or if the chain of dependencies is longer than the number of instructions the hardware can reorder.

Techniques such as loop unrolling, software pipelining, and multiply tables using Shift Left and Add instructions are available in today’s PA-RISC compilers. While recurrent scalar replacement and loop reordering are not common, there are preprocessors that currently apply these optimizations. The loop vectorization, saturation arithmetic, and data rearrangement techniques may require new data types or annotations to assist with grouping and alignment of the subword data elements. While some of this represents new requirements, it is all within the reach of current compiler technology.

5. Performance Results

There are three key elements to speeding up a software program: reducing the number of instructions executed, reducing the number of cycles needed to execute the instructions, or reducing the cycle time of the processor. The use of the parallel subword instructions in MAX2 attacks pathlength and execution cycles, without increasing processor cycle time. Figure 10 shows the reduction of pathlength, i.e., the number of instructions executed, for each of the four programs. The number of instructions for each program without MAX2 instructions is normalized to 1. In each case, the number of instructions executed is significantly reduced. The number of compute instructions is reduced due to subword parallel instructions. The number of load and store instructions is also reduced since four packed 16-bit data are loaded with a single Load instruction (Figure 11).

B lo ck M a tc hM a trix Tra n s p o s e

B o x F ilte rID C T

00 .20 .40 .60 .8

11 .2

1 1 1 1

0 .3 2 0 .3 80 .2 1 0 .2 4

n o M A X -2 M A X -2 B lo c k M a tc hB o x F ilte r

ID C T0

1 ,0 0 0

2 ,0 0 0

3 ,0 0 0

4 ,0 0 0

5 ,0 0 0L D - S Tw /o M A X -2L D - S Tw ith M A X -2C o m p u tew /o M A X -2C o m p u tew ith M A X -2

Figure 10: Instruction Pathlength Decreases with MAX2 Figure 11: Reduction in Memory and Compute Instructions

Another metric commonly used is cycles per element (see Figure 12). For the Block Compare algorithm which compares 256 pairs of pixels via the sum of absolute differences measure, we achieve a very respectable 0.63 cycles/pixel. This is even better than that reported by other architectures [16] which have added a special-purpose instruction for performing the sum of absolute differences in parallel on multiple pairs of 8-bit pixels. Other cycles/element measures also show significant improvements when MAX2 instructions are used.

The speedup due to MAX2 is defined as the execution time (cycles) for the code without MAX2 instructions, divided by the execution time for the code using MAX2 instructions. This speedup in execution time is between 2.6 and 4.2 for our examples. The speedup would have been even greater, except that the code without MAX2 instructions was able to take advantage of PA-RISC Extract and Deposit instructions and the arithmetic nullify feature to shorten the instruction pathlength [12-14]. For most RISC architectures using shifts and ands rather than Extracts and Deposits, and without arithmetic nullification, the pathlength reduction for Block Match and Matrix Transpose would have been greater than 4, rather than just 2.6 (Figure 13).


B lo c k M a tc hM a tr ix T ra n s p o s e


02468

1 01 21 4

1 .6 6 0 .6 6

1 1 .8 6 1 1 .1 8

0 .6 3 0 .2 52 .8 2 .7

n o M A X -2 M A X -2

B lo c k M a tc hM a tr ix Tra n s p o s e


0

1

2

3

4

5

2 .6 6 2 .6 3

4 .2 4 4 .1 4

S p e e d u p

Figure 12: Cycles/Element without and with MAX2 Figure 13: Execution Time Speedup with MAX2

Programsvs. Metrics

16x16 Block Match

8x8Matrix Transpose

3x3 Box Filter

8x8IDCT

Instructions 420 (1307) 32 (84) 1107 (5320) 380 (1574) Cycles 160 (426) 16 (42) 548 (2324) 173 (716)

Registers 14 (12) 18 (22) 15 (18) 17 (20)Cycles/Element 0.63 (1.66) 0.25 (0.66) 2.80 (11.86) 2.70 (11.18)

Instructions/Cycle 2.63 (3.07) 2.00 (2.00) 2.02 (2.29) 2.20 (2.20)Speedup 2.66 2.63 4.24 4.14Table 4: Metrics for Multimedia Kernels with (and without) MAX2 instructions

Code using MAX2 also has a side benefit that the number of registers needed is usually decreased (Table 4). In each example except Block Match, three to four fewer registers are needed when MAX2 instructions are used. This is primarily because the data rearrangement instructions require fewer temporary registers, and saturation arithmetic obviates the need for registers holding constants for clamping. For Block Match, the PA-RISC arithmetic nullify feature eliminates the need for two extra registers to get the absolute value, for the code without MAX2 instructions.

Both the code with and without MAX2 instructions do well in terms of utilizing the superscalar parallelism efficiently, as shown by the instructions/cycle metric in Table 4. The PA-RISC microprocessor used, the PA-8000 [11,5], is 4-way superscalar, meaning that it can issue 4 instructions per cycle. Of these four instructions per cycle, two are load or store instructions, and two are compute instructions. The code is optimized so that the two compute instruction slots are always utilized. This gives an instruction/cycle count of at least 2. For those code examples with a higher percentage of load and store instructions, the instruction/cycle count is higher. In Block Match, 31% of the code with MAX2 instructions is load and store instructions (Table 5). The instructions/cycle is 2.6. In the Block Match code without using MAX2 instructions, 39% of the instructions executed are load and store instructions. More load and store instructions are needed to load individual 16-bit data than when MAX2 instructions are used, since in the latter case, one load 64-bit word instruction brings in four 16-bit data elements, raising the instructions/cycle count. For the other code examples, the instruction/cycle count is about the same whether MAX2 instructions are used or not. This gives a further indication that both sets of code are equally optimized for superscalar execution.

Percent Instructionvs Program

Load/Store Compute Branches ALU SMU

Block Match 31% (39%) 65% (59%) 4% (1%) 64% (59%) 1% (0%)Matrix Transpose 0% (0%) 100% (100%) 0% (0%) 0% (0%) 100% (100%)Box Filter 17% (19%) 78% (77%) 5% (5%) 48% (43%) 30% ( 34%)IDCT 17% (16%) 82% (83%) 1% (1%) 62% (66%) 20% (16%)

Table 5: Percent Instructions for Code Examples with (and without) MAX2 Instructions

6. Conclusions

We described the MAX2 multimedia instructions in the PA-RISC processor architecture. MAX2 represents a minimalistic and efficient set of parallel subword instructions for performing parallel arithmetic, data alignment and data rearrangement operations on subwords packed into the word-oriented registers and datapaths of a general-purpose microprocessor. We illustrated how some important media processing kernels may be programmed utilizing these


MAX2 instructions. In order to compare the advantages provided by MAX2, we also optimized the same code examples for the PA-RISC processor, without using the MAX2 instructions.

It turns out that the PA-RISC instruction set already has many instructions that enable efficient coding for these examples. However, the addition of MAX2 instructions significantly improves the performance further. The instruction pathlengths required with MAX2 instructions are only 21% to 38% of the pathlengths required without MAX2. The speedup in execution time is between 2.6 to 4.2 times faster when MAX2 instructions are employed, compared to equally optimized code not using MAX2. The cycles/element metric is likewise better with MAX2 instructions, than without. The resource utilization is also improved, with MAX2 code usually needing fewer registers. The cost of adding the MAX2 instructions was insignificant, amounting to less than 0.1% of the chip area of the PA-8000 microprocessor. In summary, multimedia instructions enable more operations to be performed on subword data in registers with fewer instructions, fewer cycles, and less traffic between the memory and the processor, resulting in significant performance speedup at minimal additional cost.

The next challenge is to make these high performance instructions available to applications written in high level languages. Media processing programs tend to have a few performance-critical loops, which can be structured to take advantage of the subword parallel instructions, as illustrated in this paper. There are several options for making these parallel subword instructions available. The PA-RISC C compiler allows programs to directly call assembler macros to execute a single parallel subword instruction. This was used to produce a full speed, software MPEG player for entry-level workstations using the PA-7100LC microprocessor [3,4]. For media processing kernels such as the IDCT or Box Filter, larger macros or library routines callable from the application programs can be used. Finally, the parallel subword instructions can be generated within the compilers themselves. While the code examples have been hand-optimized for this paper, many of the optimization techniques have been described using commonly known compiler optimization techniques, although some of the multimedia specific optimizations will require new capabilities in the compiler. In summary, we can exploit these multimedia instructions today through the use of macros and libraries. We have tried to demonstrate the software optimization techniques to encourage more rapid development of compilers with these capabilities.

7. References1. Lee R., “Subword Parallelism with MAX2”, IEEE Micro, vol. 16 no. 4, August 1996, pp. 51-59.2. Lee R.B., “Accelerating Multimedia with Enhanced Microprocessors”, IEEE Micro, vol. 15, no. 2, Apr. 1995, pp.22-32.3. Gwennap L., "New PA-RISC Processor Decodes MPEG Video", Microprocessor Report Vol 8 Num 1, Jan 24, 1994, pp . 16-17.4. Lee R., “Real-time MPEG Video via Software Decompression on a PA-RISC Processor”, Proceedings of IEEE Compcon,

March 5-9, 1995, San Francisco, California, pp. 186-192.5. Lee R., Huck J., “64-bit and Multimedia Extensions for the PA-RISC 2.0 Architecture”, Proceedings of IEEE Compcon,

February 25-28, 1996, Santa Clara, California.6. Kohn L., et al, “The Visual Instruction Set (VIS) in UltraSPARC”, IEEE Compcon, March 5-9, 1995, pp. 462-469.7. Peleg A. and Weiser U., “MMX Technology Extension to the Intel Architecture”, IEEE Micro, vol. 16 no. 4, August 1996, pp.

42-50.8. Ninth Annual Microprocessor Forum, October 21 - 24, 1996, San Jose, California.9. Golston J., “Single-Chip H.324 Videoconferencing”, IEEE Micro, vol. 16 no. 4, August 1996, pp. 21-33.10. Hansen C., “MicroUnity’s MediaProcessor Architecture”, IEEE Micro, vol. 16 no. 4, August 1996, pp. 34-41.11. Hunt D., “Advanced Performance Features of the 64-bit PA-8000”, Proceedings of IEEE Compcon, March 5-9, 1995, San

Francisco, California.12. Lee R., “Precision Architecture”, IEEE Computer, vol. 22 no. 1, Jan 1989, pp. 78-91.13. Lee R., Mahon M. and Morris D., “Pathlength Reduction Features in the PA-RISC Architecture”, Proceedings of IEEE

Compcon, February 24-28, 1992, San Francisco, California, pp. 129-135.14. McMahan L. and Lee R., “ Pathlengths of SPEC Benchmarks for PA-RISC, MIPS and SPARC”, Proceedings of IEEE

Compcon, Feb. 22-26, 1993, San Francisco, California, pp. 481-490.15. Arai, Yukihiro, Agui, Takeshi, and Nakajima, Masayuki, “A Fast DCT-SQ Scheme for Images,” Transactions of the IEICE,

Vol. E 71, No. 11, November 1988.16. Trembley M., O’Connor M., Narayanan V., He L., “VIS Speeds New Media Processing”, IEEE Micro, vol. 16 no. 4, August

1996, pp. 10-20.17. PA-RISC 2.0 Architecture, Kane G., Prentice Hall, ISBN: 0-13-182734-0, 1996.18. Flynn M., “Very High-Speed Computing Systems”, Proceedings of IEEE, Vol. 54 No. 12, Dec 1966.


Documents

Paper # for EI ‘97:rblee/HPpapers/p3021_14.doc · Web viewThe paper examines some typical multimedia kernels ... for audio and 3-D graphics media datatypes. The multiple floating-point