8
Performance improvement techniques for the M88000 RISC architecture Drawing on experience gained in the development of the M88000 RISC architecture Steve Heath examines the principal methods of enhancing processor performance The Motorola MC88000 RISC architecture has been designed to take advantage of the three fundamental techniques of improving processor performance: increasing clock speeds, improving the instruction sets and executing multiple instructions per clock cycle. The basic techniques are described with the associated problems encountered in their implementation, and the paper shows how the architecture overcomes them. microprocessors scalable architectures RISC pipelining multiple execution machines The term 'scalable architecture' has been usurped to describe many different ways of improving processor performance. It is often used when referring to the new generations of RISC processors which have appeared in recent years. In most cases, it refers to the scaling of processor clocks and infers that this attribute is only available to processors that have been designed as scalable. In others, it refers to a processor's ability to function within a multiprocessor or parallel configuration. In reality, neither of these definitions is unique to RISC architectures- they can equally be applied to CISC machines. Given the task of improving a processor's performance, three fundamental methods are available: increasing processor and system clock rates, optimizing and improving the power of the instruction set and, finally, executing multiple instructions per cycle. The first method, although primarily associated with RISC designs, will be shown to be applicable to CISC processors as well. The second method, more commonly associated with CISC designs, can benefit RISC architectures. The third is more suited, although not exclusively, to RISC processors rather than MotorolaSemiconductors, 69 Buckingham St,Aylesbury, HP202NF, UK Paperreceived:24 January 1990. Revised: 19 April 1990 0141-9331/90/06377-08 © 1990 current ClSC devices. These techniques are described with particular reference to the Motorola M88000 RISC architecture and its implementation. INCREASING CLOCK RATES Ever since the appearance of commercially available RISCs, there has been debate concerning which RISC or CISC architecture provides the best performance. Argument centres around the ability of RISC architectures to execute their instructions within a single cycle, through the use of pipelines, the reduction of instructions to a simple operation, and the synthesis of complex operations with compiler-generated code sequences. These sequences require less time than is taken by a CISC processor to execute its single instruction, thus deriving a performance advantage. Recent developments have centred around the application of RISC architectures to more esoteric silicon and gallium arsenide to increase the processor clock speed and performance. This is only possible, it is argued by RISC advocates, with the relatively simple processor execution units used by RISC machines. The effect of increasing the processor clock speed on the rest of the system, and in particular memory, is forgotten. Figure1 shows the effect of wait states on the performance of an MC680OO CISC processor. The CPU performance data is from MacGregor and Rubenstein 1 and has been derived from measurements of bus activity of various systems. Figure 2 shows the performance degredation for a RISC processor or any other architecture that executes single cycle instructions. Both curves are based on the same equation, i.e. if an MC68000 could execute instructions in a single cycle, its curve would resemble Figure 2. MacGregor and Rubenstein's analysis showed that, on average, the MC68000 took 12.576 clocks per instruction (CPI), including an average 2.698 accesses with no wait states. RISC architectures simplify Butterworth-Heinemann Ltd Vol 14 No 6 July/August 1990 377

Performance improvement techniques for the M88000 RISC architecture

Embed Size (px)

Citation preview

Performance improvement techniques for the M88000

RISC architecture

Drawing on experience gained in the development of the M88000 RISC architecture Steve Heath examines the principal methods of enhancing

processor performance

The Motorola MC88000 RISC architecture has been designed to take advantage of the three fundamental techniques of improving processor performance: increasing clock speeds, improving the instruction sets and executing multiple instructions per clock cycle. The basic techniques are described with the associated problems encountered in their implementation, and the paper shows how the architecture overcomes them.

microprocessors scalable architectures R ISC pipelining multiple execution machines

The term 'scalable architecture' has been usurped to describe many different ways of improving processor performance. It is often used when referring to the new generations of RISC processors which have appeared in recent years. In most cases, it refers to the scaling of processor clocks and infers that this attribute is only available to processors that have been designed as scalable. In others, it refers to a processor's ability to function within a multiprocessor or parallel configuration. In reality, neither of these definitions is unique to RISC archi tectures- they can equally be applied to CISC machines.

Given the task of improving a processor's performance, three fundamental methods are available: increasing processor and system clock rates, optimizing and improving the power of the instruction set and, finally, executing multiple instructions per cycle. The first method, although primarily associated with RISC designs, will be shown to be applicable to CISC processors as well. The second method, more commonly associated with CISC designs, can benefit RISC architectures. The third is more suited, although not exclusively, to RISC processors rather than

Motorola Semiconductors, 69 Buckingham St, Aylesbury, HP202NF, UK Paper received: 24 January 1990. Revised: 19 April 1990

0141-9331/90/06377-08 © 1990

current ClSC devices. These techniques are described with particular reference to the Motorola M88000 RISC architecture and its implementation.

I N C R E A S I N G C L O C K RATES

Ever since the appearance of commercially available RISCs, there has been debate concerning which RISC or CISC architecture provides the best performance. Argument centres around the ability of RISC architectures to execute their instructions within a single cycle, through the use of pipelines, the reduction of instructions to a simple operation, and the synthesis of complex operations with compiler-generated code sequences. These sequences require less time than is taken by a CISC processor to execute its single instruction, thus deriving a performance advantage. Recent developments have centred around the application of RISC architectures to more esoteric silicon and gallium arsenide to increase the processor clock speed and performance. This is only possible, it is argued by RISC advocates, with the relatively simple processor execution units used by RISC machines. The effect of increasing the processor clock speed on the rest of the system, and in particular memory, is forgotten.

Figure1 shows the effect of wait states on the performance of an MC680OO CISC processor. The CPU performance data is from MacGregor and Rubenstein 1 and has been derived from measurements of bus activity of various systems. Figure 2 shows the performance degredation for a RISC processor or any other architecture that executes single cycle instructions. Both curves are based on the same equation, i.e. if an MC68000 could execute instructions in a single cycle, its curve would resemble Figure 2. MacGregor and Rubenstein's analysis showed that, on average, the MC68000 took 12.576 clocks per instruction (CPI), including an average 2.698 accesses with no wait states. RISC architectures simplify

Butterworth-Heinemann Ltd

Vol 14 No 6 July/August 1990 377

1 . 4 0 ...........................................................................................................................

1.20

1.00

QJ

0.80

0.60

0.40

L) 0.20

0.00

IIIZIIIIIIII::IIIIIII:II:IIIII:IIIIIIII::

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

I I I I 0 1 2 3 4

M e m o r y Wai t States

Figure 1. The effect of wait states on an MC 68000 CISC processor (0 8 MHz, 0 10 MHz, [] 12 MHz, [] 16 MHz)

16.00

14.00

12.00 i ........................................................................................................................................

o.oo

8.00

6.00

4.00 U

2.00 ~ ~ [ ~

0.00 t i I i

0 1 2 3 4 Memory wait states

Figure 2. The effect of wait states on RISC performance (0 8 MHz, 0 10 MHz, [ ] 12 MHz, [] 16 MHz)

the instruction set so that an instruction has to be fetched on every cycle to maintain the data flow into the processor. This effectively means that at 20 MHz memory is accessed every 50 ns. At the higher clock speeds, such as 40 MHz, this increases to every 25 ns, placing tremen- dous design constraints on the external memory systems. The insertion of a single wait state halves system performance. Two wait states reduce it to one-third and so on, as shown in Figure 2.

Both curves show that performance degrades as wait states are added, but the degredation for a RISC processor is far higher, as shown by its steeper slope. Examination of Figure 1 shows that the insertion of an extra wait state resulting from increasing the processor clock speed by 2 MHz offers approximately the same performance as a slower processor with the wait state removed. The number of cycles per instruction for CISC processors like the MC68000 family is not fixed: it can range from 4 CPI

for a simple register operation to over 150 CPI for a division. Each instruction comprises a number of bus cycles and computation time. As the computation time starts to increase in relation to bus traffic, the curve starts to change. As the proportion of time spent in computation increases, the increased clock speeds negate wait state penalties and give improved performance. This improve- ment is software-dependent: a system heavily involved in arithmetic computation will gain performance, while another involved in control, with a large amount of bus traffic, will not.

For RISC and other single cycle instruction machines, there is no leeway; the addition of wait states dramatically reduces performance. If a RISC processor is fabricated in other semiconductor technologies to give higher clock speeds, the corresponding memory subsystems must also be upgraded. This requires fast memory or the adoption of complex memory caches and cache coherency schemes. The result is a sophisticated design which is frequently difficult to manufacture or upgrade.

Comparing transistor counts for similar performance CISC and RISC processors reveals a similar system count. The MC68040 CISC processor has 1.2 M transistors while its M88100 RISC counterpart with two MC88200 cache memory management units has slightly more at 1.86 M, albeit distributed over three chips. Both systems have integer and floating-point execution units, dual memory management and caches with cache coherency. In this respect, RISC does not offer any advantage over a similar performance CISC design; RISC systems may have simpler processors but their memory systems are more complex.

Fortunately, technology trends indicate that increased speeds are accompanied by higher densities. Silicon technology is therefore applicable to both architectures. A processor does not have to be a RISC machine to take advantage of technology improvements.

I M P R O V I N G T H E I N S T R U C T I O N S E T

The RISC camp has recently been further challenged by the latest generation of CISC processors, like the MC68040, which executes complex instructions in a single clock cycle. As CISC processors approach the 1 CPI figure, their more powerful instruction sets can actually deliver more work than RISC counterparts, given similar clock rates and memory bandwidth.

When RISC machines first appeared, CISC processors were performing at about 6-10 CPI. This allowed RISC machines the time to execute a sequence of simpler instructions at 1 CPI and offer better performance. However, as CISC processors have decreased their average CPI figures, pressure has been placed on RISC instruction sets.

The philosophy behind the MC881002 instruction set was simple: if a complex instruction could be executed in a single cycle and was shown to be beneficial to system performance, it would be included. This has resulted in an extremely rich set with many addressing modes, bit field data support, floating-point operations, etc. The instruction set is better described as optimized than reduced.

Address ing and fe tch ing data

Unlike many RISC architectures, which have only one address mode and require addresses to be calculated

378 Microprocessors and Microsystems

prior to every external access, the M88000 instruction set supports five. Probably the most useful is the register indirect with an index, which can be scaled. This allows a pointer to be assigned to the head of a table or set of tables and allows other registers to index into it. This reduces the number of address calculations and instructions required. Simply changingthe main pointer register to that of another table allows rapid searching of tables and linked lists. This reduces the number of instructions needed per function and increases throughput.

Further consideration during the development of the M88000 family was given to how much data can be fetched and how this is performed. Most compiler- generated data, such as C variables char, short and int, are 8- or 16-bit in length, yet current RISC architectures feature 32-bit data size. RISC architectures like the AMD290003 can deal with the corresponding byte and halfword data sizes internally, but they have to fetch such data using a multiple instruction sequence: the first instruction calculates the address and loads it into the address register, a 32-bit-wide word of data is fetched into a register and, finally, a third instruction is needed to extract the relevant byte or halfword from that register and store it in yet another register. The MC88100 has four data byte enable signals which allow it to fetch individual bytes and halfwords directly in a single cycle. This again reduces the code expansion, the number of cycles needed to perform the task and, ultimately, improves the system performance.

The combination of these two factors can reduce the number of instructions needed to access a byte of data from a table by a factor of three, compared with the AMD29000. An example is shown in Figure 3. The MC88100 can access data directly by fetching a byte from a table pointed to by a register and indexed by a second. The AMD29000 requires that an address is calculated and

a

~ m

I I r l TABLE HEAD

t ~ I I r2 OFFSET

Instruct ion sequence to access 16 bit variable :

1. A d d r l & r2 and store result in r3. 2. Fetch 32 bit w o r d using r3 as a pointer

and load in r4. 3. Extract 16 bit value from r4 and store

in r3.

- uses 3 instruct ions and four registers.

b

l j r l TABLE HEAD

l ] r2 OFFSET

n sequence to access 16 bit variable :

,r2+rl 6 bit value f rom location poin ted by ;et by r2 into register r3.

- uses I instruct ion and 3 registers

Figure 3. Accessing data within memory tables; a, an inefficient RISC architecture; b, the MC88100

loaded into a register, a 32-bit value is then loaded and, finally, the byte is extracted - - three instructions in total. Since this operation is repeated many times by an operating system, savings in this area are essential to maximize system performance. The SPARC processor can perform this operation in two instructions. Its ability to perform direct byte or 16-bit word accesses from memory removes the extract instruction, but its single register indirect addressing mode still requires an address cal- culation.

The advantages of having a powerful instruction set are many: code expansion experienced when moving from a complex to a reduced instruction set is reduced and, by reducing the number of instructions, the amount of work that can be performed is increased. This may appear to embrace the ClSC philosophy of making instructions more complex to do more work, but should not be interpreted as such. Two criteria used to justify expanding CISC addressing modes and adding more specialized instructions were hardware support for compilers and the reduction of the code space and memory. With improve- ments in compiler technology and the advent of low-cost memory, these criteria are no longer valid. The MC680204 CALLM and RTM instructions are good examples. These instructions were added to the MC68000 s instruction set supported by the MC68020 to perform parameter passing when calling and returning from procedures using a single instruction. This operation is frequently performed within software and was a prime target for improvement. However, executing CALLM and RTM instructions took several hundred cycles in some cases and needed specific support from an MC68851 memory management unit. As a result, better performance could be obtained through a software implementation and the instructions were ignored by virtually every MC68020 system. The combi- nation of the considerable architectural support required for implementation and their lack of use precipitated their removal from successive generations.

Upgrades must now be carefully considered in terms of frequency of use, advantages over synthesizing the instruction and the architectural implications in terms of hardware and software compatibility. These criteria are applicable to both CISC and RISC architectures.

Instruction set upgrades

Upgrading RISC-based instruction sets may not be as simple with CISC processors like the M68000 family. The difficulty derives from the fixed op-code length needed by RISC machines to allow single cycle instruction fetches and to simplify the instruction decode. The MC68020 could supplement the M68000 instruction set by the use of operands to extend the effective op-code size from 2-byte to a maximum of 10. With this size, there is little difficulty in allocating bits to encode new operations. This variable length is a potential obstacle to performance improvements in that it is difficult to predict how many memory fetches would be needed to fetch the complete or multiple op-codes in a single cycle without assuming the maximum length. The hardware difficulties of fetching eight l O-byte instructions are also considerable.

The triadic format adopted by most RISC processors, including the MC88100, provides two sources and a destination, which allows calculations of the type A + B = C or A + B = B to be performed. Dyadic instructions used

Vol 14 No 6 July~August 1990 3 7 9

by the M68000 family, use one of the sources as a destination, restricting calculations to A + B = B format. To preserve all data, a copy of the destroyed source must be made. The move from dyadic to triadic encoding increases the number of encoding bits needed in the op- code.

With a 32-bit fixed op-code (the size favoured by nearly all today's RISC processors) the majority of coding bits are taken by the triadic register operands. Each instruction usually has two sources and a destination and, with a register file size of 32, this takes 15 bit leaving 17 bit for the operation and addressing modes to be encoded. As the register file increases, each addition requires three times as many bits to encode it. With a file of 256 registers, there are only 8 bit left to encode the instruction and any addressing mode. In these cases, the number of instructions and addressing modes that can be supported is greatly reduced. Without destroying binary compatibility or lengthening the op-code size, it is difficult to see how these instruction sets can be improved.

The MC88100 already has seven groups of 256 instructions reserved for future special function units. If these instructions are executed without the relevant hard- ware support, the processor takes an exception to allow their simulation in software. This is similar to the F line and A line exceptions of the MC68000, where software simulation was replaced by integrated hardware support in later processor generations.

Any processor architecture can be upgraded by the addition of more powerful instructions or addressing modes. Careful upgrading of RISC instruction sets, where single cycle execution and the architectural integrity can be maintained, will undoubtedly improve performance.

EXECUTING MULTIPLE INSTRUCTIONS PER CLOCK

Increasing the power of an instruction set is difficult once it has been defined and its use is therefore of limited scope in comparison to a reduction in the time taken to execute an instruction. This approach is the one taken by Motorolaand Dolphin Technology in the development of a I k MIPS single processor based on M88000 RISC archi- tecture. The processor, fabricated in emitter coupled logic and running at 125 MHz, has eight execution units and in peaks can execute eight M88000 instructions per clock cycle. This breaking of the single cycle barrier will form the main emphasis for RISC designs in the future, yet many current implementations have aspects within their design which will handicap such developments. It is worthwhile examining some of the problems associated with imple- menting multiple execution machines, identifying some of the architectural prerequisites for such designs and showing how the M88000 architecture fulfils them.

General principles

Multiple execution machines (MEMs) should not be confused with multiprocessor or parallel processing designs, although they are similar in their capability to execute multiple instructions every clock cycle. The main difference is the level at which the allocation of resources is performed. With most parallel or multiple processor

computers, the execution units act virtually independently and can be likened to several computer systems residing in a single chassis with a single controlling program. The system software controls the resource allocation, which may or may not be transparent to the user, and applications may have to be adapted accordingly. The MEM design performs this role at the VLSI hardware level so that the resource allocation is transparent to the user, thus enabling binary compatibility across a range of machines, irrespective of the number of instructions that can be executed per clock.

The general principles behind a multiple execution unit processor are relatively simple (Figure 4). The execution unit is replicated, with each unit capable of executing an instruction per clock cycle. These units are fed via a wide bus - - an eight-unit design with a 32-bit op-code would need a 256-bit bus to allow a very large word, a multiple instruction, to be presented. Each individual op-code within it is allocated to a particular execution unit on every clock cycle. For efficiency, this requires a fixed op-code length with no additional operands, so that single-cycle memory accesses can be maintained and the need for hardware to detect when an op-code would overspill into the next allocated slot removed. Most RISC designs, including the M88000, use a fixed size op-code.

The next problem concerns the validity of the results. Although eight instructions may have been executed, not all the results may be valid. If a branch instruction is encountered and taken, the instructions executed with it but located after it would not normally have been executed, and their results should be discarded. Similarly, instructions which use results from previous instructions may not be valid because the data was not ready. It is this type of problem that must be overcome before multiple execution machines can realize their potential performance.

E x e c u t i o n E x e c u t i o n Unit Unit

E x e c u t i o n E x e c u t i o n Unit Unit

Instruction A Instruction B Instruction C Instruction D

f f f f I Decode and Allocation I

f f f f I Jl ,n.,.,,c.o.. J I ,.. ,ctio.C il I

Separate out individual instructions

f 0 32 64 96 128

I ctio°A i i i , ctio.D J

Multiple Instruction from external memory or complete cache line

Figure 4. A four execution-unit design

380 Microprocessors and Microsystems

Removing data-dependency

To achieve maximum throughput from the model shown in Figure 4, the individual instructions within the multiple instruction presented to the execution units must have no data-dependency, i.e. instruction 3 must not need the result of instruction 1. As the number of execution units increases, this problem becomes more acute.

Data-dependency refers not only to data or registers encoded within the op-code, but to any unique resource within the programming model, such as condition codes or status bits. If a programming model has a single condition code register, this immediately places restrictions on which instructions can be executed in parallel. Instruction execution will be limited to instructions that do not use the register and to a single instruction which modifies it, even though there may be more execution units available. In addition, there can be no automatic updating of the condition code bits, as in the M68000 family, due to the processor overhead involved and the data-dependency problems it can cause (e.g. which instruction is used for the updating?).

To allow multiple instruction execution, condition code resources must be sufficient to restrict data depend- ency to the instruction that provides the condition code and the instruction(s) that use the data. The M88000 family does not have a specific condition code register but allows any general purpose register to be temporarily used as a condition code register. To compare two values held in two registers, the MC88100 compare instruction can be used. This tests for all conditions and sets appropriate bits within a third specified register. The next instruction would be a branch on bit set or clear, which redirects program flow as necessary. With further bit masking, etc. it is possible to test for extremely complex conditions. Testing for five particular conditions would only require six ins t ruc t ions- a compare followed by five branch on bit condition instructions.

The compiler can allocate specific registers to the compare and associated bit test instructions, removing data-dependency between compare instructions that would exist if a common resource was shared, while still maintaining dependency between the associated compare and bit test instructions.

However, the MC88100 has two more efficient methods of branching on condition. The previous example compared two values and needed a two- instruction sequence. There is a branch on condition instruction which compares, tests and branches in a single instruction and, with the delayed slot version, executes effectively in a single cycle. This instruction compares a single value with zero and does not set or use any condition code registers.

The delayed branch instruction is another optimizing technique provided by the MC88100 instruction set and is used primarily to remove pipeline stalls due to branching. Any flow change will cause the next op-code after the branch or flow change instruction to be fetched and inserted into the pipeline. If the branch is taken, this instruction must be flushed from the pipe and this operation causes delays. The delayed slot mechanism puts the instruction preceding the branch after the branch and adds a '.n' suffix to the branch or jump instruction. The suffix causes a bit change in the instruction which is recognized by the processor, and instructs it to process the next instruction rather than flush it. This technique can

be used in hand coding and is often utilized by optimizing compilers.

Synchronizing pipelines

The problem of determiningwhich instructions within the multiple instruction yield valid results has to be solved by either software or hardware. This problem is not restricted to MEMs but also appears within many current RISC implementations. Architectures like the Motorola M88000 and the MIPS R2000/30006 issue a single instruction per clock, yet these may go to either a floating-point or integer execution unit. These units are frequently pipelined and this can cause a potential synchronization problem, as shown in Figure S.

This instruction sequence performs a floating-point calculation and copies the result in rl to r4. Register r0 is hardwired on the M88000 to zero and the 'adding zero to a register and saving the result' technique is used to perform the move. One value may be passed to a procedure while the other is available for local use. The first instruction initiates a floating-point operation where the contents of registers r2 and r3 are added to give a result in r l . This takes five clocks to complete. The next instruction then performs an add operation using rl and rO to give a result in r4. This takes three clocks. In this example, rl is needed by the add instruction two clocks before the preceding floating-point instruction has completed. The system is now out of synchronization and corrupt. All machines with multiple pipelines can suffer from this problem and there are several different solutions to it.

The first relies on the compiler to sort out the problem. The compiler has to recognize all the potential code sequences that could result in such corruption and insert sufficient no operation (NOP) op-codes to delay the second instruction so that it will synchronize (Figure 6). This solves the problem but inserts instructions which do nothing. To improve efficiency, these 'delay slots' or 'bubbles' are filled by taking instructions from beyond the add instruction and movingthem up within the execution sequence to replace the NOPs, while still maintaining the

Figure 5.

fadd rl,r2,r3 I

add r4,rO,rl

I1 111

[ ~ r ! ready

rl needed Pipeline synchronization problem

Figure 6.

faddrl,r2,r3 [ [ [ [ [ I

hop [-T- I - - I rl ready

nop

nop I--I-T-] add r4,rO,rl ~

rl needed Software solution using NOPs

Vol 14 No 6 July/August "/990 381

correct semantics. If there are insufficient instructions available for infilling, the original op-codes are left in place (Figure 7).

This technique is used by the MIPS R2000/3000 family and, although needing sophisticated compilers, it requires no additional hardware to implement. It does, however, have some serious shortcomings. Firstly, the compiler requires extensive knowledge of the processor, its execution units and internal mechanisms to be able to reorganize the code sequences correctly. Any change to the hardware can invalidate these changes. Figure 8 shows a typical change: the floating-point pipeline has been lengthened to cope with 128-bit precision, and the code sequence used for the five-stage pipeline no longer works. This effectively rules out binary compatibility, as any application would need recompiling and porting to ensure its integrity. It also makes assembler routines difficult to write and integrate. Moreover, the whole hardware integrity of the system is now dependent on the software sequences running through the machine (i.e. some sequences cause the machine to misbehave without the error being recognized. The compiler has to identify every potentially corrupting sequence from the millions of combinations and reorganize appropriately.

Scoreboarding

An alternative approach is to use hardware to delay any conflicting instructions internally to assure system integrity, irrespective of the code sequence. A scoreboard is used to track register usage and is the mechanism adopted by the M88100 (Figure 9).

This mechanism first appeared in a Control Data Corporation design by Seymour Cray about 12 years ago. The principle is elegant: each register within the register file has an associated scoreboard bit which is clear when

faddrLr2, r3 [ [ I [ I I

add r9,r8,r7 ~

load r10,12

add r4,r5,rl

Figure 7. Filling the delay slots to optimize performance

£addrl , r2,r3[ I I I I I [ ] I /

add r9,r8,r7 ~ ~r rl ready

lsr r23,r4

load r10,12

Figure 8.

add r4,r5,rl

rl needed Processor corruption

figure 9.

faddrl,r2,r3[ ] [ [ [ I

add r4,r5,rl [~I:::::::::::::::::~

~ r l needed and

Hardware ready Delay

Hardware synchronization

the register content is current and set if it is stale. The first stage of the execution pipeline decodes the instruction and sets the appropriate scoreboard bits. The next instruction is then decoded and checks the scoreboard to see if one of its source registers has been scoreboarded. This instruction cannot proceed any further until the data is ready from the other pipeline. As soon as this data is ready, it is fed forward from one pipeline stage to the other, updates the register file and, finally, clears the scoreboard bit (see Figure 10).

This arrangement maintains binary compatibility, guaran- tees system integrity and provides control for an MEM processor. The optimizing techniques that implement the software solution can still be used to make use of the delay slots; however, the optimizing routines can err on the side of performance rather than integrity, knowing that the scoreboard mechanism will ensure correct syn- chronization in every eventuality.

Adapting to MEM

The real difficulty in controlling MEM is removing or inhibiting invalid data produced at the end of every clock cycle. There are two basic approaches: prevent the execution of the offending instruction so that erroneous data is not produced, or remove it after the execution. To do so after the event is extremely complicated and requires many buffers to act as temporary stores while the

STAGE 1

fadd rl,r2,r3 [

Scoreboard bit decode & check L

I iJ I I I I I

e t c

decode & check

ii t et¢ STAGE 2 delay

add r4,r5,rl

Clear scoreboard, STAGE 3 update register file

r2

faddrLr2 , r3[ I I I [ I "°

add ra,rS,r E[ZZF Figure 10. Hardware scoreboarding

382 Microprocessors and Microsystems

execution units progress the instructions. Each unit would need buffers to hold register contents, a copy of the instruction and temporary variables. Their contents would then be inspected, decoded and the register files updated. With pipelines of equal length, the execution units can present their buffered data at the same time and thus provide an accurate snapshot of the processor at every clock cycle. Processing is simply an extra common stage within the pipeline (see Figure 11).

This job is even more complicated if the pipelines being considered are of differing length, as may be the case in a design with separate floating-point and integer units. For pipelines of unequal length, the temporary data cannot be made available within the same clock cycle and is skewed, as shown in Figure 12. The snapshot has to be delayed until unit A has stored its data. To preserve the temporary data from units B and C, their pipelines must either be stalled or the temporary data stored in a three- stage FIFO buffer so that execution can continue. In this case, the pipelines have effectively been lengthened to that of unit A. This can cause further inefficiencies with branching, etc.

Execution Unit B

Execution Unit C

Execution Unit D

Execution Unit A

Execution Unit A

Update Register File and / or memory

I I Determine Validity Store Temporary Results

Execute Instruction

Decode Instruction

Fetch Instruction

Figure 11. A post event hardware mechanism for equal length pipelines

Execution Unit B

Execution Unit C

Skewing Figure12. A post event hardware mechanism for different length pipelines

While such schemes are possible, the hardware would be complicated and a better approach is to prevent production of invalid data.

8oftware control

The software techniques described above to synchronize pipelines can be adapted to cope with multiple execution machines, as depicted in Figure 13. The machine shown has four execution units which receive four instructions per clock. The first and second multiple instructions each contain a pair of data-dependent instructions. This dependency can be removed by reorganizing as described previously. The last dependent instruction of the first pair and the first dependent instruction of the second pair are interchanged, effectively delaying the former and advanc- ing the latter. However, this does not take into account the potential problems concerned with the execution unit pipelines. Assuming that they are three stages deep, the pairs of dependent instructions need to be separated by at least one other instruction, and so NOPs or other instructions are inserted accordingly.

An alternative software system uses bits within the op- code to indicate to the processor that it can execute the paired instructions simultaneously without worrying about data validity or conflict. A similar technique is frequently used in digital signal processors like the Motorola DSP56000 z to allow simultaneous multiply accumulate operations with data transfers. The spare bits within the multiply accumulate instruction are used to encode two data moves. The calculation and data moves are then executed in parallel. A variation is used on the Intel 8608 processor, where paired integer and floating- point instructions can be executed simultaneously by setting dual instruction bits. The compiler, as with the reorganizing techniques, has to recognize when such execution is valid and take responsibility for maintaining the synchronization of the pipelined execution units. Compiled code is usually written with interleaved integer and floating-point instructions. While altering the op- code bits simplifies the internal hardware it does pose considerable problems for compiler technology. In practice, handmcoded assembler is often used in preference to the compiler output to achieve optimum performance. This defeats the reasons for using a high-level language, limits its use and can cause further problems when migrating to another technology with higher numbers of execution units.

TTTT TTTT TTTT Clock I

Clock2 I Iii!~iiiiiiii!l lii!iiiii!~!ifiil ~

Clock3 I I I I I

Initial code Removing Removing Sequence parallel pipeline

conflict conflict

Figure13. Software reorganization to remove data dependency (shaded boxes indicate data-dependent instructions)

Vol 14 No 6 July~August 1990 383

A d a p t i n g sco reboa rd ing

The most promising technique appears to be a variation on the MC88100 scoreboard technique with extra bits to control the pipelines as necessary. Each instruction decode would set these bits accordingly, and delay the progress of any instructions so that the correct operation is maintained. An essential requirement is the score- boarding of both the source and destination registers. This is required when multiple instructions use the same destination register and one of them is delayed due to a destination data-dependency. If data-dependency occurs, any succeeding instruction that modifies the register must be delayed until the preceding instruction has taken the data. An additional scoreboarding bit can be used for control. Figure 14 shows an example of this type of control.

The processor can execute four instructions per clock and instructions 1-4 are fetched and start their execution on clock 1. Within the four, instructions 1 and 2 have a data-dependency and the second instruction is delayed by the scoreboarding mechanism. This is a destination- source relationship and is similar to that described above. On the second clock, the next four instructions (5-8) are started. The delayed instruction clears the first stage within the pipeline, allowing the next instruction to proceed. With this set, there is now a source-destination data-dependency between instructions 2 and 8. Instruction 2 is delayed to obtain the contents of register 1 from the execution of instruction 1. Instruction 8 needs to use register I to store the result of its calculation, but this data cannot be written to the register until instruction 2 has successfully obtained the value. Instruction 8 is delayed until this is complete. If the instruction sequence was executed in a linear way on a single execution unit machine, this synchronization would be unnecessary.

Delay due to destination - source dependency

Delay due to source - d e s t i n a t i o n dependency

Figure 14.

1. add rl ,r2,r3

5. add r21,r22,r23 [ i

Clocks 1 2 3 4 5

I I I I I111

2. E3* V 7 1 6. add r26,r28,r27

3 addrS, ,rl0 I I I I 7. add r4,r5,r20

4. add r12,r13, r14 ] I [ ~ 8. addr1,,, ,r18 * D

A scoreboarded multiple execution machine

Other more sophisticated techniques can be included. In the case of multiple instructions modifying the same destination, the register scoreboard marker may be replaced by a number indicating which instruction will actually modify the contents. The scoreboard marker will indicate which result is to be used.

Neither of these scoreboard techniques requires any software intervention to work, although, as before, soft- ware optimization can increase the performance by infilling the delay slots. It gives a user-transparent interface and allows binary compatibility to be maintained.

CONCLUSIONS

Increasing clock speeds, improving instruction sets and reducing the number of clocks per instruction are three fundamental techniques that are applicable to CISC and RISC processors alike. In this respect, and others such as system size and applicability of faster technologies, there is little difference between the two approaches. The MEM is one development that the M88000 family will be able to exploit with greater ease than existing CISC. It is this development that will mark the differentiation between RISC and ClSC in the future.

REFERENCES

1 MacGregor, D and Rubenstein, ] 'A performance analysis of MC68020-based systems' IEEE Micro (December 1985)

2 MC88100 User's Manual (second edition) MC88100UM/ rev 1 Motorola Semiconductors, Austin, TX, USA

3 AMD29000 Reference ManualAdvanced Micro Devices Sunnyvale, CA, USA (1987)

4 M68020 User's Manua l M C 6 8 0 2 0 U M / r e v 3 Motorola Semiconductors, Austin, TX, USA (1989)

5 M68000 Data Sheet/D Motorola Semiconductors, Austin, TX, USA (1985)

6 Kane, G MIP$ RISC Architecture Prentice Hall Englewood Cliffs, N J, USA (1908)

7 DSP56000 User's Manua l DSP 56000UM/ rev I Motor- ola Semiconductors, Austin, TX, USA

8 i80860 Programmer's Reference Manua l Intel Corpor- ation, Santa Clara, CA, USA (1989)

Steve Heath began his career in electronics in 1976, performing failure analysis on digital and linear ICs for Mullard (now Philips Components). In 1980, he joined Thorn EMI Defence Electronics as a reliability engineer, assessing the safety and reliability of electronic designs used in military and oil industry appli- cations. Two years later he joined Crellon Micro- systems as an applications engineer for Motorola microprocessors and, in 1984, he

moved to Motorola's field application group. He is currently specializing in microprocessor architectures, their use and development. His first book, The VMEbus User's Handbook, was published in August 1989, and a second, on RISC, CISC and DSP processors, is in preparation.

384 Microprocessors and Microsystems