16-Bit RISC CPU - pudn.comread.pudn.com/downloads163/doc/project/745365/813S02-Team8.pdfThe basic architecture we ended up adopting for the 16-bit RISC CPU was based on a DLXS processor

ECE813 Design Project 4

16-Bit RISC CPU

By: Kelly Davidson Guangda Shi

Xin Zhao Junwei Zhou

Submitted to Dr. A. Mason April 22, 2002

1

I. INTRODUCTION

This report summarizes the results of our work in building a 16-bit RISC CPU. Our

original goal was to create an 8-bit ALU, but as the design progressed, it was decided to produce

a 16-bit ALU, which is more useful in computation. This larger design included the datapath for

a CPU containing Program Counter (PC), Memory Address Register (MAR), Memory Data

Register (MDR), Instruction Register (IR), Register File (RF) and other various components to

tie them all together. The controller of the CPU was implemented using Verilog.

The system described above was simulated in two ways. Analog simulation was used to

verify functionality on a smaller scale, and to determine the delay times of various portions of

the circuit. After determining the system characteristics such as timing information, the entire

system was verified using Verilog simulation.

The base design of this system was done using static CMOS logic. It includes layout of

basic standard cells as well as our 16-bit ALU. Our 16-bit ALU was comprised of 16 1-bit

ALUs strung together. The ALU performed 2’s complement arithmetic in one execution step by

making use of a Controlled Adder/Subtractor (CAS) unit described later. Alternative designs for

this project focused on the ALU portion of the system. The goal of these alternative designs was

to improve speed and power consumption as set out in our proposal. The first alternative design

was to make use of a Carry Look Ahead circuit to improve the speed of the ALU and thus the

overall system. It resulted in an over 60% improvement in processing time.

The most important part of our alternative design involved the adaptation of a newly

published technology. To accomplish this goal, a dynamic differential logic family called Swing

Limited Logic (SLL) was used based on a paper by Amr M. Fahim [1]. This work was published

in January of 2002 in the IEEE Journal of Solid-State Circuits. Even though it was originally

designed for a 0.35 µm feature size and 3.3 V system, it was successfully adapted and made to

work on our 0.60 µm CMOS process at 3.0 V.

Section II discusses the methodology used to design our 16bit ALU and the incorporation

of the ALU into the 16bit RISC CPU. In section III, the details of our ALU and RISC CPU

designs and their simulation results are presented. The layout of the base case ALU and its post

layout simulation results are also discussed in Section III.

2

II. METHODOLOGY

This design had several challenges to it. The first was that none of us in the group had

previous experience in the design of ALUs. This required us to do some basic research to obtain

information on the design of ALUs, the instruction sets often used, accumulators, and control

circuitry. The instruction set for the ALU ended up being determined by taking common

instructions from various 2-bit and 4-bit ALU datasheets.

The architecture of the ALU led to another design decision. There were several ways to

approach creating the ALU. Our group wanted to make use of the CAS unit so that 2’s

complement arithmetic could be performed in one operation rather than several executions

through the ALU. This led to the design in which multiple functions being executed

simultaneously and the desired output was chosen using a mux network

The basic architecture we ended up adopting for the 16-bit RISC CPU was based on a

DLXS processor from a paper by Martin Gumm [2]. This architecture would take care of the

timing and control aspects of the design but required the complete circuit simulation to be done

digitally. As a result, digital simulation and functional coding become another challenge for our

design. There were also concerns with the idea of producing an entirely digital model of the

system since that is somewhat outside the scope of this class, and wouldn’t really verify the

analog aspects of our circuit.

This led to the creation of a smaller system that is small enough to simulate using analog

simulation, and still verify the correct operation of the ALU, latches, accumulators, time delays,

and voltage levels of the various signals. The control circuitry for this analog simulation was

simulated using stimulus files. Because the architecture of the DLXS processor required a 16-bit

ALU, registers, and various other logic units, the smaller circuit still had more than 11,000

transistors. This made the analog simulation of the circuit somewhat time consuming.

One of the biggest challenges was to make use of swing limited logic [1]. Even though

this technology has the advantage of low power consumption it does not scale well with supply

voltage and older technology file. After long struggle with the swing limited logic, we were able

to make it work with our 0.6µm technology but the power gain from the limited swing was

diminished by large transistors we had to employ to make the technology work.

3

III. DESIGN AND RESULTS

The 16-bit RISC CPU design is broken down into several sections. Each section will

discuss the design and results of that aspect of our system. The three different areas of our

design are the ALU, base case ALU Layout and system architecture.

ALU Design

The Arithmetic Logic Unit (ALU) designed in this project performs eight different

operations (shown in Table 1) on two 16-bit inputs. There are three alternate designs

investigated by this design team. The base case design involved using a regular ripple carry

method for the addition operation. The more advanced design utilizes the carry-look-ahead

method for carry generations in order to speed up the performance of the ALU. For the third

alternative design, swing limited differential logic was used with the goal of reducing power

consumption of the circuit and to reduce the power delay product.

Table 1: Operations performed by ALU

Select: S2 S1 S0 Operations

0 0 0 A plus B

0 0 1 A minus B

0 1 0 B minus A

0 1 1 A and B

1 0 0 A or B

1 0 1 A xor B

1 1 0 Pass A

1 1 1 Pass B

Base Case Design – Ripple Carry ALU

At the heart of our ALU design is the controlled adder/subtractor (CAS) unit. This unit

has the ability of performing A+B, A-B or B-A depends on the control instruction. The CAS is

constructed based on the principle of two’s complement method of subtraction (i.e., A minus B is

the same as A plus the complement of B then plus “1”). The two’s complement method of

subtraction is illustrated below:

4

A: 10010101- B: 10001010

00001011

A: 10010101

00001010

+ B’: 01110101

+ 1

00001011

Regular Subtraction Two’s Complement As one can see from the CAS unit shown in Figure 1(a), the complement of input A or B

is done using two xor gates.

Figure 1: (a) controlled adder/subtractor. (b) 1bit ALU.

This design enables the control signal S1 and S0 to control which input bit is

complemented, and the subsequent operation is just like regular addition. Of course the extra

“1” needed for the two’s complement method must be generated also for this operation to be

successful. The way of generating the extra “1” is by oring the select line S1 and S0 then using

the result as Ci (the carry in) in the first bit of the 16-bit implementation. This way if the

subtraction is desired, either S1 or S0 must be “1” according to Table 1 and the Ci will become

“1” as required by the operation. If addition operation is required, the result of the or gate will

be “0”. After completing the controlled adder/subtractor unit, the 1-bit ALU was constructed as

shown in Figure 1(b). The 1-bit input A or B are fed to all 8 different operation units and the

final results are fed to the multiplexer (MUX) for the selection of the correct result.

5

After completing the design of a 1-bit ALU unit, the 16-bit ALU is designed simply by

connecting 16 of the 1-bit ALUs in series and wiring all the proper inputs (See Figure 2). In

order to establish a baseline on the performance of the 16-Bit ALU circuit, the “Ripple Carry”

technique was used for the construction of ALU. Even though this design method is easy to

understand, it’s very slow compared to more advanced techniques such as carry-look-ahead

algorithm for the adder/subtractor.

Figure 2: Complete block level schematic of the 16-Bit ALU.

As discussed in the controlled adder/subtractor section, the or gate used in the 16-Bit

ALU simply generates the additional “1” required for the 2’s complement subtraction.

Simulation Results

The performance of the ripple carry ALU is shown in Figure 3. The outputs “F<15:0>”,

carry out “Co” and control signals “S2, S1, S0” are shown. All eight operations of the ALU

were verified and the timing of the ALU was determined from this simulation. According to the

simulation result, the addition/subtraction operation took the longest time, about 24ns. The

simulation was done using the following two 16-bit inputs: A = 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 B = 1 0 0 1 0 0 0 0 1 0 0 0 1 1 0 1

As indicated by the dotted line on Figure 3, the time delay for the addition operation is

very linear because the carry needs to ripple through all 16 stages before reaching the final

answer. The maximum time delay at the 15th (F<15>, most significant bit) output is about 24ns

for the entire operation (the select line S0 has a period of 60ns). An example of the A and B

operation (S2S1S0=011) is also shown in Figure 3 for illustration purpose

6

Figure 3: Analog simulation result for 16-Bit ALU (Ripple Carry).

Alternate Design – Carry-Look-Ahead ALU

To improve the performance of our ALU design, a 4-bit carry-look-ahead circuit was

added to the ALU. This addition will reduce the time delay of the carry and eliminate the series-

connection between each individual 1-bit ALU unit.

A and B

24ns

7

Figure 4: (a) 4-bit carry-look-ahead unit. (b) 4-bit carry-look-ahead ALU.

In order to keep the function of the controlled adder/subtractor unit, the CLA unit was

designed after the xor operation on the inputs (see Figure 4(a)). After the CLA unit was

designed, the CLA and four 1-bit ALUs were connected together to form a 4-bit carry-look-

ahead ALU (Shown in Figure 4(b)). Then the 16bit ALU was constructed by connecting four 4-

bit carry-look-ahead ALUs in series (see Figure 5)

Figure 5: Schematic of a 16-Bit Carry-Look-Ahead ALU.

Simulation Results

The stimulus file used for the simulation is the same as the base case with slight

modification on select line timing. The outputs are identical to our base case simulation, but

with a much shorter computation delay time of 9ns (See Figure 6).

8

Figure 6: Analog simulation result for 16-Bit Carry-Look-Ahead ALU.

Alternate Design – Swing Limited Logic ALU

The final alternative design of the 16-bit ALU was to implement swing limited logic. A

sample SLL nor gate is shown in Figure 7. Its basic operation is explained as follows. On the

precharge cycle the clock is high, the outputs are tri-stated off and the nMOS transistor M10

pulls the Comp node to zero. This will turn on the two pMOS transistors P1 and P0 to pull up

PreY and PreYbar to Vdd. The bulk of the pMOS transistors P0,1,3 & 4 are tied to Vcc, which

for our design was set at 5V. This was to increase the voltage threshold so Comp will shut off

the precharge transistors at a lower voltage

As the clock ends the precharge cycle and starts to fall low, transistors M5 & M10 in the

inverter are both on for a small period of time. Either one of the nMOS ‘0’ or ‘1’ trees will be

9ns delay

9

on, allowing that side to discharge. Due to the presence of M5, the low voltage level is only

around 1.8V. This voltage depends greatly on the W/L ratio of the transistors in the short-circuit

current path (path to ground). At the same time the tri-state output gates also start to turn on

allowing the Y and Ybar to be transmitted to the succeeding gates. Comp will only rise to Vdd-

Vtn as the clock is inverted through the bottom inverter, but with the higher threshold voltage for

the precharge transistors, it is enough to turn them off, so they quit providing a pull up current to

nodes PreY and PreYbar. This design does allow the clock pulse to be reduced from 0 to Vdd-

Vtn, which is another energy saving feature. The node Comp can also be used as a clock signal

to the next gate. Unfortunately, the power consumption for this gate when averaged over a 30 ns

period was 37.51 µW versus 11.24 µW for a static CMOS nor gate. However it does provide

complimentary outputs.

Figure 7: SLL nor/or gate.

The paper had mentioned that due to the stack of pMOS- nMOS-pMOS structure, this

logic family does not scale well with supply voltage. However, after much effort the swing

limited logic was successfully implemented in our design. Basic gates that were needed for the

ALU were constructed, and then they were put together to form a CAS (Figure 8(a)) and

eventually the 1-bit ALU (Figure 8(b)).

10

Figure 8: (a) SLL CAS. (b) SLL 1-bit ALU.

For this ALU, instead of using typical 2 to 1 mux a 4 to 2 mux was developed which only

consisted of an inverter and pMOS pass gates, Figure 9(a). The nMOS was not necessary since

the voltage is in the ~2.0 – 2.8V range. Finally, a level converter was constructed for the output

of the ALU to convert the SLL voltage level to the static CMOS voltage level. This basically

consisted of a sense circuit that would cause the output voltage to be pulled all the way high, or

low. This level converter was also sized to obtain a quicker performance of 1.5ns delay. It is

shown in Figure 9(b).

Figure 9: (a) SLL mux 4:2. (b) SLL level converter.

11

The 1-bit ALU’s were then connected together to form a 16-bit ALU. The 1-bit ALU

was tested for all possible inputs, to verify its operation using analog simulation. The 16-bit

ALU was also tested to verify proper results for every operation.

Due to the nature of dynamic differential Logic where a succeeding gate needs to wait for

the evaluation cycle of the previous gate before it can switch to evaluate, there is a several gate

delay to get the result for 1 bit. In a 16-bit ripple carry adder this amounts to about 120 ns for the

result to be fully computed in the worst case with a carry rippling at every bit. While it is a little

slow, it does use a lower voltage swing clock, and input/output signals. This logic could be

much faster if a 16bit carry-look-ahead unit was implemented.

Simulation

Simulations of the 1 bit ALU are shown below in Figure 10. It includes the output after

going through the SLL to CMOS level converter. The figure is annotated to show the state of the

inputs going in, the operation being performed, and the states of the outputs at various points. It

takes almost 6 ns for the signal to propagate through on an addition or subtraction operation. In

the figure, the Ci bit was set to 1. A 16-bit version of the ALU was simulated as well with

correct results.

A+B

0

0

1

0

0

1

Y

Ybar

1

1

0

1

1

0

0

A-B

0

1

1

0

1

0

1

B-A A AND

1

1

1

0

1

Figure 10: (a) SLL 1bit ALU Simulation Results (A+B, A-B, B-A, AandB).

12

Y

Ybar

A OR

0

0

1

0

0

A

0

1

0

1

1

Pass A

1

1

0

1

0

Pass A

1

0

1

0

0

Figure 10: (b) SLL 1bit ALU Simulation Results (AorB, AxorB, pass A, Pass B).

Base Case ALU Layout

Layout is an important part of the project. We have reused some basic gates (such as nor,

nand, buffer9) to efficiently reduce the work of the design. In addition, xor2 gate, xor3 gate and

3-input nor gate have been completed and tested. The minimum size transistors were used to

reduce the area of the logic gates.

In order to verify the design from the perspective of the layout, a full functioning 1 bit-

ALU layout has been completed as well as the simulation. This layout is using 16 instances of

the 1-bit ALU and cascading them into a 16-bit ALU. In order to improve the driving ability of

the selecting signals additional buffers were used. The 1-bit ALUs are layed out in square shape

as well as the 16-bit ALU (See Figure 11). All the input and output ports are connected to the

edge of the instance for the convenience of cascading and connecting to other parts. It was a

challenging work to complete the 1-bit ALU and the 16-bit ALU since the objective is to

minimize the area maintaining the performance. The following table shows the area of some

layouts.

B xor B

13

Table 2: Sizes of various layouts

Parts 3-input NOR 2-input XOR 3-input XOR 1-bit ALU 16 – bit ALU

Height

Width

15.60u

0.35u

15.60u

24.75u

15.45u

49.05u

110.1u

99.150u

606.3u

497.85u

Figure 11: 16-Bit ALU with ripple carry logic.

The simulation for the 1-bit ALU and the 16-bit ALU was also performed. The

simulation results (See Figure 12) verified the design but the computation delay time is longer

than the schematic simulation (40ns for addition). Possibly due to larger extracted parasitic

capacitance and resistance since there are many wires and transistors in the large layout.

14

Figure 12: Simulation result from layout of 16-bit ALU.

System Architecture

In this project, the 16-bit RISC CPU was based on the architecture of a DLXS processor.

It was chosen because of its simple instruction set and its easily understandable architecture.

DLXS consists of the controller and the datapath. The controller generates the signals to control

the data flow and the datapath executes all operations on the given data set.

Reset

RW

DLXS

Controller

Datapath

Phi2

Phi1

address

memory

Figure 13: DLXS Architecture.

40ns

15

Datapath

The schematic of datapath is shown in Figure 14. The datapath contains all registers, the

ALU and the internal data buses.

Figure 14: Datapath Schematic.

The fundamental operation of the datapath is reading operands from the register file,

operating on them in the ALU, and then writing the result back to the register file or to the

various control registers. Three internal buses are used in datapath: the source bus1 (S1), source

bus2 (S2) and the destination bus (Dest). The controller selects the registers, which the data is

loaded from and written to. It should be noted that ALU is the only path between the source bus

16

and the destination bus. The pass operation within the ALU is to move the data from the source

bus to the destination bus without any modification.

Controller

The key component of the controller is the finite state machine, which is used to generate

a sequence of control signal necessary for the data flow in the datapath. Figure 15 below

illustrates some of the outputs from the controller. Sbus_ctrl is to control the read enable signal

of the registers connected to the source bus. Dbus_ctrl is to control the write enable signal of the

registers on the destination bus. Dbus_ctrl is gated by phi2 while the other control signals are

synchronized with phi1. RS1, RS2 and RD are the specific registers in the register file. Take

add operation for example, the sum of the data in RS1 and RS2 is computed by the ALU and is

stored in RD.

Sbus_ctrl[3:0]

Alu_ctrl[2:0]

Memory_ctrl

RS1 address[2:0]

RD_address[2:0]

RS2 address[2:0]

FSM

Dbus_ctrl[3:0] And

LOGIC

Instrucation register[15:0]

Phi2

Phi1

Figure 15: Controller structure.

In this project, the controller is implemented with Verilog. We used the Cadence Logic

Verification Tool to verify the logic of the controller and the datapath. To combine both the

verilog module and the schematic module together, a verilog file was created for each standard

cell used in the registers and ALU. Take Mux21 cell for example, its verilog model is: module mux21 (Y, A, B, S); input A, B, S; output Y; reg Y; always @(A or B or S) if (S == 1’b0) Y <= A; else Y <= B; endmodule

Apart from the controller, Verilog is also used to simulate the behavior of the memory

where the instruction and data are located. After the controller is reset, it will control the

17

datapath to fetch instruction from the memory and point the program counter to the next

instruction. For simplicity, only a few of DLXS instructions are implemented as follows:

Table 3: Sample of DLXS instructions Opcode Instruction Operands

0000 add rs1[3:0], rs2[3:0], rd[3:0] rd = rs1 + rs2

0001 sub rs1[3:0], rs2[3:0], rd[3:0] rd = rs1 – rs2

0010 and rs1[3:0], rs2[3:0], rd[3:0] rd = rs1 and rs2

0011 or rs1[3:0], rs2[3:0], rd[3:0] rd = rs1 or rs2

0100 xor rs1[3:0], rs2[3:0], rd[3:0] rd = rs1 xor rs2

1000 load rd[3:0], #addr[7:0] rd = memory[addr]

1001 write rs, #addr[7:0] memory[addr] = rs1

1010 goto #addr[11:0] pc = #addr

State transmission diagram

The instructions of the DLXS can be broken into five basic steps: fetch, decode, execute

memory access and write result. No pipeline was implemented in this structure and each step

may take several clock cycles. The state transmission of the FSM of controller is shown below.

mem ready

mem ready

goto

mem not ready

State: load_data Action: pass the data in MDR to the

destination bus

mem not ready

State: write_data Action: write enable = ‘1’

mem not ready

mem ready

write

load

State: Fetch Action: read memory

State: decode Action: save the instruction in IR;

Decode the instruction PC <= PC + 2;

State: execute Action: latch the data on source bus

to ALU

State: load_addr Action: MAR = IR[11:0]

State: write_back Action: save the data to register C

State:loadPC Action: PC = IR[11:0]

Add, subtract, AND, or, XOR State: write_addr

Action: MAR = IR[11:0]

Figure 16: State Transmission of FSM of controller.

18

The dataflow of the load instruction after the processor was reset is explained below.

When the reset signal goes high, the programming counter (PC) and other registers are set to 0.

In the next clock circle, controller goes to the fetch state and PC is connected to the memory

address bus by mux1. The controller will wait until the memory is ready and the instruction data

is available on the data bus. Then the load instruction is written to the instruction register (IR)

by setting the IR write enable signal. In the next loop, the offset in the load instruction is passed

to the memory address register (MAR). Because ALU is the only path between the source bus

and the destination bus, we need to set the ALU operation to the pass function. Then the

controller goes to the load data state. MAR instead of PC will be selected to connect to the

memory address bus. When the memory is ready, the data loaded from the memory will be

stored in the memory data register (MDR). In the next loop, the data in MDR will be passed

through ALU and saved in the register C.

To test the logic of the controller and the datapath, we calculate a sequence of numbers

defined by the function: 2,1

)2(,

21

12

==>+= ++

ffnfff nnn

The sequence of numbers should be like (1, 2, 3, 5, 8, 13, 21, 34 ……). The program and

the data are stored in the memory described by Verilog.

Figure 17: (a) control signal for the program.

19

Figure 17: (b) data bus of the datapath.

The control signal and the data bus waveform are shown in the Figure 17. The IR

contains the instruction fetched from the memory. Cout in the data bus waveform is the data

written to the register C in the datapath. As shown in the figure, the data sequence of Cout is

0x0001, 0x0002, 0x0003, 0x0005, 0x0008, 0x000D …which is the same as the sequence above.

The schematic layout of the RISC CPU with simulated memory block is shown in Figure 18.

Figure 18: 16Bit RISC CPU with memory block.

Controller

Memory

Datapath

20

Analog Simulation of the Datapath

To verify that the analog circuit was correct, a “small” analog model was constructed that

consisted of the base 16-bit ALU, the register file, and the registers in between. This “smaller”

system still consisted over 11,000 transistors. The system had five different styles of registers

composed of D Flip Flops, some with output enables as well as latch enables, and some with

dual outputs. The simulation verifies the correct performance of the ALU and its interaction

with the registers. The result of the simulation is shown in Figure 19 below. It takes 40 ns for the

data to go completely around the loop. The data starts out from some constants at A & B busses.

We are looking at an addition function and specifically at bit 3 in this simulation. BusL1 and

BusL2 hold the data that enters the ALU, and C has the data as it comes out.

Figure 19: Analog Simulation of the data busses of the datapath.

21

CONCLUSIONS In this project, our team has completed three alternative designs of a 16-bit ALU for use

in a RISC CPU. The implementation of carry-look-ahead logic in our second alternative ALU

design produced over 60% improvement in computation speed over the base case. All three

designs of our ALU performed 2’s compliment subtraction in one loop through the ALU. The

most significant accomplishment of the design project is the adaptation of the swing limited logic

in our ALU design. This team was able to make the logic work for the technology file given

even though this logic was designed for a much-advanced technology. We were even able to

shorten the clock cycle for precharge and evaluate to 3ns, which is very close to the reference

paper. However, this design didn’t produce the power savings promised in the reference paper

probably due to the larger transistors that we had to use in building the basic gates. The 16-bit

CPU was then constructed from all of the various parts built in this project including the ALU,

registers, buffers, controller, memory and other various pieces. The 16-Bit RISC CPU simulated

and verified to be functional as designed using the Verilog simulator in Cadence.

REFERENCES: 1 Amr M. Fahim, “Low-Power High-Performance Arithmetic Circuits and Architectures”, IEEE Journal of Solid-State Circuits, Vol. 37, No. 1, pp. 90-94, January 2002. 2 Martin. Gumm, “VHDL-Modeling And Synthesis of The DLXS RISC processor” 1995, University of Stuttgart.

Documents

16-Bit RISC CPU - pudn.comread.pudn.com/downloads163/doc/project/745365/813S02-Team8.pdfThe basic architecture we ended up adopting for the 16-bit RISC CPU was based on a DLXS processor