An Introduction to VLSI Processor Architecture for GaAS

Page Number: 1/101

Page Number: 2/101

MICROPROCESSORS

DARPA EYES 100-MIPS GaAs CHIP FOR STAR WARSPALO ALTO

For its Star Wars program, the Department of Defenseintends to push well beyond the current limits of technol-ogy. And along with lasers and particle beams, one piece ofhardware it has in mind is a microprocessor chip having asmuch computing power as 100 of Digital EquipmentCorp.’s VAX-11/780 superminicomputers.One candidate for the role of basic computing engine forthe program, officially called the Strategic DefenseInitiative [ElectronicsWeek, May 13, 1985, p. 28], is a gal-lium arsenide version of the Mips reduced-instruction-setcomputer (RISC) developed at Stanford University. Threeteams are now working on the processor. And this month,the Defense Advanced Projects Research Agency closed therequest-for-proposal (RFP) process for a 1.25-µm siliconversion of the chip.Last October, Darpa awarded three contracts for a 32-bitGaAs microprocessor and a floating-point coprocessor. Onewent to McDonnell Douglas Corp., another to a teamformed by Texas Instruments Inc. and Control Data Corp.,and the third to a team from RCA Corp. and Tektronix Inc.The three are now working on processes to get usefulyields. After a year, the program will be reduced to one ortwo teams. Darpa’s target is to have a 10,000-gate GaAschip by the beginning of 1988.If it is as fast as Darpa expects, the chip will be the basicengine for the Advanced Onboard Signal Processor, one ofthe baseline machines for the SDI. “We went after RISCbecause we needed something small enough to put onGaAs,” says Sheldon Karp, principal scientist for strategictechnology at Darpa. The agency had been working withthe Motorola Inc. 68000 microprocessor, but Motorolawouldn’t even consider trying to put the complex 68000onto GaAs, Karp says.A natural. The Mips chip, which was originally funded byDarpa, was a natural for GaAs. “We have only 10,000 gatesto work with,” Karp notes. “And the Mips people had takenevery possible step to reduce hardware requirements. Thereare no hardware interlocks, and only 32 instructions.”

Even 10,000 gates is big for GaAs; the first phase of thework is intended to make sure that the RISC architecturecan be squeezed into that size at respectable yields, Karpsays.Mips was designed by a group under John Hennessey atStanford. Hennessey, who has worked as a consultant withDarpa on the SDI project, recently took the chip into theprivate sector by forming Mips Computer Systems ofMountain View, Calif. [ElectronicsWeek, April 29, 1985,p. 36]. Computer-aided-design software came from theMayo Clinic in Rochester, Minn.

The silicon Mips chip will come from a two-year effortusing the 1.25-µm design rules developed for the Very HighSpeed Integrated Circuit program. (The Darpa chip was notmade part of VHSIC in order to open the RFP tocontractors outside that program.)Both the silicon and GaAs microprocessors will be full 32-bit engines sharing 90% of a common instruction core.Pascal and Air Force 1750A compilers will be targeted forthe core instruction set, so that all software will be inter-changeable.The GaAs requirement specifies a clock frequency of200 MHz and a computation rate of 100 million instructionsper second. The silicon chip will be clocked at 40 MHz.Eventually, the silicon chip must be made radiation-hard;the GaAs chip will be intrinsically rad-hard.Darpa will not release figures on the size of its RISC effort.The silicon version is being funded through the Air Force’sAir Development Center in Rome, N.Y.

–Clifford Barney

The GaAs chipwill be clocked at 200 MHz,

the silicon at 40 MHz

Reprinted with permission ElectronicsWeek/May 20, 1985

Figure 1.1.a. A brochure about the RCA’s 32-bit and 8-bit versions of the GaAsRISC/MIPS processor, realized as a part of the “MIPS for Star Wars” project.

Page Number: 3/101

Phases of a Well-Structured VLSI Design

1. Generation of candidate architectures

with approximately the same VLSI area. 2. Comparison of candidate architectures,

from the point of view of the compiled HLL code speed. 3. Selection of one candidate architecture,

and finalization of its schematics. 4. Design of the VLSI chip:

a. Schematic capture b. Logic and timing testing c. Placement and routing

5. Generation of the mask. 6. Chip fabrication, etc...

Page Number: 4/101

Typical Development Phases for One 32-bit Microprocessor on a VLSI Chip

(or about the development of

DARPA's 32-bit RISC MIPS processors in GaAs and silicon)

1. Announcement of project requirements (on 1.1.1984.) a. Type of the architecture (SU-MIPS) b. Maximal on-chip transistor count (30K) c. Detailed specification of the assembly language (Core-MIPS) d. A set of benchmark programs typical of the end-user application (13) Three competitors selected by 12.13.1984.

a. McDonell Douglas b. CDC + TI c. RCA (Purdue + TriQuint)

Page Number: 5/101

2. In-house research by the three competitors (till 12.31.1985.)

a. Generation of several candidate architectures under 30K transistors.

b. Design of an ENDOT (isp') simulator of all candidate architectures (why isp'?).

c. All candidate architectures are ranked according to the above mentioned benchmark programs.

d. Reasons for high/low ranking of specific candidate architectures are analysed, and the best candidate architectures are modified to become better. The final architecture is determined and "frozen" after several iterations.

Detailed RTL design is completed, and it is proven that the total transistor count is below 30K.

Page Number: 6/101

3. Decision-making at the sponsor side (by 1.1.1986.)

a. Final architectures of all competitors are ranked (using the isp' simulators and the initially provided benchmarks).

b. A subset of competitors is selected for further financing; others are offered to stay in the competition with the own financing.

c. All those that stay in competition are shown all reports generated (by others) till that point.

Page Number: 7/101

4. In-house development by the three competitors (till 12.31.1986.)

a. Improvements are added, after the solutions of the competition are reviewed, and their impact

is verified with isp’ simulation b. The architecture is frozen, forever.

c. The RTL design is redone and frozen.

d. The appropriate semi-custom standard-cell family is selected,and the gate level design is completed. The standard-cel family choices, in the project which is the subject of this presentation

The 1 micron E/D-MESFET GaAs

e. The completed gate level (GTL) design contains only the elements of the cells from the selected family (which includes the input, output, and input/output pads).

The 1.25 micron SOS-CMOS Si

Page Number: 8/101

f. The gate level design is entered into a computer, using one of the following methods:

Graphic entry HDL based entry Logic equation entry State machine entry Direct entry of the net-list, using a text editor

Except in the last case, the net list (needed for further work) is obtained using the appropriate translator. g. The net-list is tested (logic and timing), using an appropriate testing program (LOGSIM). If errors, the work iterates back, as needed. h. The net-list is treated by an appropriate placement and routing program (MP2D). No timing errors (guaranteed) after the chip is fabricated! Logic errors possible after the chip is fabricated. The major two output files:

Artwork file for visual analysis (for printer or ploter)

Fab file (for shipment to a chip foundary, by regular mail or email) At the chip foundary, the tab file is analysed, and each standard cell is substituted with its full-custom equivalent (details are typically confidental).

Page Number: 9/101

5. Further narrowing down of the sponsored competition, and widening up of the support technology (by 1.1.1987.)

a. Only a subset of the sponsored competition is given further support for fabrication of a prototype at a lower-than-nominal speed.

b. More funding made available for R&D in both, semiconductor and packaging technologies.

c. More funding made available for the Core-MIPS translators (for the MC680x0 and the 1750A assembly languages) and compilers (for ADA and C).

Page Number: 10/101

6. Prototype fabrication (by 12.31.1987.) 7. Zero series at a still-lower-than-nominal speed (by 12.31.1988.) 8. Commercial series at the nominal speed (by 12.31.1989.) 9. The US epilogue! 10. The rest-of-the-world epilogue!

Page Number: 11/101

The ENDOT Package by TDT 1. First, the appropriate files are formed. In the most general case:

a. One or more .isp (isp') file (different names; same extensions) b. One .t (topology) file (trivial if one .isp file; complex if many .isp files) c. One .m (meta-micro) file (one jumbo case statement) d. One .i file (information related to linking and loading) e. One or more .b (benchmark) files (any extension allowed)

Only this, and nothing more! [Poe66]

2. Second, the formed files are treated with appropriate tools: a. Hardware tools

b. Software tools c. Postprocessing and utility tools Finally, the simulator is completed. 3. Third, the simulator is run, and the statistics about the analyzed architecture(s)

are collected. 4. Fourth, if needed, a silicon compiler is run, etc...

Page Number: 12/101

ENDOT (1) Hardware Tools

(1.1) ISP' Language (1.2) ISP' Compiler - ic (1.3) Topology Language

(1.4) Ecologist - ec (1.5) Simulation Command Language (1.6) Simulator - n2 (2) Software Tools (2.1) Meta-assembler - micro (2.2) Meta-loader - the linker/loader (2.2.1) Interpreter - inter (2.2.2) Allocator - cater (2.3.) Minor programs (2.3.1) mdump (2.3.2) merge (2.3.3) mas = micro + cater (2.3.4) mkmem (3) Postprocesing & Utility Tools (3.1) Statements counter - coverage (3.2) General purpose post-processor - gpp (3.3) N.2 help utility -nhelp (3.4) Build utility - build (3.5) VHDL translator - icv

Page Number: 13/101

THE N.2 DESIGN PROCESS Step 1: Idea!!! Step 2: Hardware (and Software) design Step 3: Simulation Step 4: Analysis Step 5: IF design <> ok THEN GOTO Step 2 Step 6: End With N.2 your design iterations become painless!!!

Page Number: 14/101

HARDWARE TOOLS

ISP' language

Purpose: DESCRIPTION OF THE HARDWARE SYSTEMS

ISP' program:

(1) Declaration section(2) Behavior section

Page Number: 15/101

Declaration section: - CONTAINS STRUCTURE DECLARATIONS. - STRUCTURES: ALL ISP' NAMED OBJECTS. - STRUCTURE TYPES: (1) MACRO (2) PORT (3) STATE (4) MEMORY (5) FORMAT (6) QUEUE MACRO subsection: names which are used to give convenient easily remembered names to objects. PORT subsection: names which are used for communication with outside world. STATE subsection: internal names of the ISP' model that can store information. MEMORY subsection: same as a state, except that memory can be initialized. FORMAT subsection: convenient names for inconvenient names; typically subranges of states. QUEUE subsection: names which are used for synchronization with outside world.

Page Number: 16/101

Behavior section: - CONTAINS ONE OR MORE PROCESSES. - PROCESS: (1) PROCESS DECLARATION (2) PROCESS BODY - PROCESS BODY:

SET OF ISP' STATEMENTS.

- ISP' STATEMENTS: PROCESS EXECUTES ALL

ITS INDEPENDENT STATEMENTS CONCURENTLY. - next AND delay STATEMENTS:

CAN BE USED TO FORCE SEQUENTIAL EXECUTION WITHIN A PROCESS

- main: OPERATES IN A COUNTINUOUS LOOP. - when: WAITS FOR AN EVENT. - procedure: SAME AS A SUBROUTINE IN A HLL; main process INVOKES a procedure. - function: SAME AS A FUNCTION IN A HLL.

Page Number: 17/101

Example: “wave.isp”

portCK 'output;

main CYCLE :=(

CK = 0;delay(50);CK = 1;delay(50);

)

Figure 3.1. File wave.isp with the description of a clock generator in theISP’ language.

Page Number: 18/101

File “cntr.isp”

portCK 'input,Q<4> 'output;

stateCOUNT<4>;

when EDGE(CK:lead) :=(

Q = COUNT + 1;COUNT = COUNT + 1;

)

Figure 3.2. File cntr.isp with the description of clocked counter in the ISP’language.

Page Number: 19/101

ic - The ISP' Compiler

Purpose: COMPILES ".isp" SOURCE FILESINTO ".sim" OBJECTS FILES

- input: ".isp" file

- output: ".sim" file

wave.isp ---> ic ---> wave.sim

cntr.isp ---> ic ---> cntr.sim

Page Number: 20/101

Topology Language

Purpose: DESCRIBES LINKSBETWEEN THE ".sim" FILES

Topology program:

(1) SIGNAL SECTION(2) PROCESSOR SECTION(3) MACRO SECTION(4) COMPOSITE SECTION(5) INCLUDE SECTION

- SIGNAL SECTION: IF EXISTS, CONTAINS A SET OF SIGNAL DECLARATIONS

- SIGNAL DECLARATIONS: signal_name [<width>][,signal declarations]

Page Number: 21/101

- PROCESSOR SECTION: CONTAINS A PROCESSOR DECLARATION. - PROCESSOR DECLARATION: processor_name = "filename.sim" [time delay = integer;] [connections signal_connections;] [initial memory_name = l.out;] - MACRO SECTION: USER'S CONVENIENT NAMES FOR TOPOLOGY OBJECTS. - COMPOSITE SECTION: THIS SECTION MAY CONTAIN SET OF THE TOPOLOGY LANGUAGE DECLARATIONS IN THE FOLLOWING FORMAT: begin declaration {declaration} end - INCLUDE SECTION: SIMPLE INCLUDING OF THE FILE WHICH CONTAINS TOPOLOGY LANGUAGE DECLARATIONS.

Page Number: 22/101

File “clcnt.t”

signalCLOCK,BUS<4>;

processor CLK = "wave.sim";time delay = 10;connections

CK = CLOCK;

processor CNT = "cntr.sim";connections

CK = CLOCK,Q = BUS;

Figure 3.3. File clcnt.t with the topology language description of theconnection between the clock generator and the clock counter, described inthe wave.isp and cntr.isp files, respectively.

Page Number: 23/101

ec - The Ecologist

Purpose: COMPILES ".t" SOURCE FILESINTO ".e00" FILES

- explicit input: ".t" file

- implicit input: ".sim" file(s)

- optional implicit input: "l.out" file (derived by the software tools)

-output: ".e00" file (object file)

clcnt.t ----------->wave.sim -------> ec ----->clcnt.e00cntr.sim -------->[l.out ------------>]

Page Number: 24/101

n2 - The Simulator

Purpose: SIMULATION OF THE DESCRIBEDHARDWARE

SYSTEM.

- input: ".sim" & ".e00" files

- optional input: "l.out" file (derived by the software

tools)

- output: if exists, ".txt" file

wave.sim ------->cntr.sim --------> n2 [ ----->clcnt.txt]clcnt.e00 ------->[l.out ------------>]

Page Number: 25/101

Simulation Command Language

Purpose: CONTROLLING THE FLOW OF SIMULATION

Some basic simulator commands:

- run: STARTS OR RESUMES THE SIMULATION.

- quit: EXIT THE SIMULATOR.

- time: QUERIES THE SIMULATION "CLOCK" TO OBTAIN THE ELAPSED UNITS

OF SIMULATION TIME.

- examine structures: QUERIES THE CONTE OF THE STRUCTURES.

- help keyword: PROVIDES AN ON-LINE REFERENCE.

- deposite value structure: SETS THE CONTENTS OF THE STRUCTURE WITH

THE VALUE FIELD.

- monitor structures & alert structures: PROVIDES A VARIETY OF CAPABILITIES FOR GETTING INFORMATION DURING SIMULATION..

Page Number: 26/101

Installation of ENDOT package on systems running SCO UNIX

1. Login as root 2. cd /usr 3. tar xv n2.tar.Z (extract) 4. uncompress -v n2.tar.Z 5. tar xvf n2.tar (extract) 6. rm n2.tar 7. cd n2 8. tar xvf nmpc.uof 9. cp nmpc.uof /usr/USERNAME Sequence of operations for simulation of the clocked counter 1. vi wave.isp 2. vi cntr.isp 3. ic wave.isp 4. ic cntr.isp 5. vi clcnt.t 6. ec -h clcnt.t 7. n2 -s clcnt.txt clcnt.e00

Page Number: 27/101

SOFTWARE TOOLS

metaMicro Purpose: ASSEMBLING AN ASSEMBLER PROGRAM.

- input: METAMICRO ASSEMBLER SOURCE FILE AND ASSEMBLERPROGRAM

- output: ".n" FILE

arch.m ----------> | ---> | ---> micro ---> arch.n program.m -----> | - arch.m: CONTAINS DEFINITION OF THE ASSEMBLER INSTRUCTIONS AND Begin-end Section: begin include program.m$ end

- program.m: CONTAINS ASSEMBLER PROGRAM

- arch.n: OBJECT FILE.

Page Number: 28/101

inter - the Interpreter Purpose: DESCRIPTION OF THE INSTRUCTION WORD; ADDRESS RESOLUTION AND RELOCATION.

- input: LINKER/LOADER SOURCE FILE

- output: ".a" FILE

arch.i -----> inter ------> arch.a - arch.i: CONTAINS DEFINITIONS OF THE INSTRUCTION WORD AND INFORMATION FOR THE ADDRESS RESOLUTION AND RELOCATION.

- arch.a: OBJECT FILE.

Page Number: 29/101

cater - The Allocator Purpose: LINKING THE ".n" AND ".a" FILES; RESOLVING ADDRESS & ALLOCATION.

- input: ".n" & ".a" files - output: "l.out" file - l.out: MEMORY IMAGE FILE

arch.n ---> | | ---> cater ---> l.out arch.a ---> |

Page Number: 30/101

Postprocessing & Utility Tools

coverage - ANALYZES PROCESSOR STATEMENTS BY USAGE, HIGHLIGHTING THE UNEXECUTED STATEMENTS.

gpp - ANALYZES PROCESSOR STRUCTURES BY VALUE, PROVIDING STATISTICAL, GRAPHICAL, OR COMPARATIVE PRESENTATION OF RESULTS.

nhelp - ON-LINE HELP.

build - MANAGING OF THE SOURCE FILES.

icv - TRANSLATING ISP' MODELS INTO VHDL

Page Number: 31/101

The Fura RISC CPU Word length: 32 bits Registers: sixteen 32-bit Execution model: register-to-register dp = register_read -> ALU_operation -> register_write

Memory access: load & store Pipelining: delayed branching!!! delayed loading! Instruction classes: (1) ALU class (2) branch class (3) data memory class (4) system class

Page Number: 32/101

Instruction cycles: (1) INSTRUCTION FETCH (IF) (2) INSTRUCTION DECODING AND EXECUTION (IDX) (3) DATA LOAD (LD)

A D

i-1: IF IDX LD

i: IF IDX LD

i+1 IF IDX LD

Possible isp' coding window positioning (i+1 is the current instruction) main := ( main:= ( IF(i+1); IF(i+1); IDX(i); delay(1); LD(i-1); LD(i); ) IDX(i+1); ) main := ( main := ( IF(i+1); delay(1); IDX(i+1); delay(1); LD(i+1); ) )

Page Number: 33/101

Instruction format:

31 24 23 20 19 16 15 12 11 0

OP DST SRC#1 SRC#2 X

31 24 23 20 19 16 15 5 4 0

OP DST SRC#1 X SIMM

31 24 23 20 19 16 15 0

OP DST SRC#1 LIMM

Page Number: 34/101

ALU Class: Add (a) ADD Rd, Rs1, Rs2

(b) ADD Rd, Rs1, imm16

(c) ADD Rd, PC, imm16

Substract (a) SUB Rd, Rs1, Rs2

(b) SUB Rd, Rs1, imm16

(c) SUB Rd, PC, imm16

Move (a) MOV Rd, Rs1

(b) MOV Rd, imm16

(c) MOV Rd, PC

Negate (a) NEG Rd, Rs1

Logical Not (a) LNOT Rd, Rs1

Logical And (a) LAND Rd, Rs1, Rs2

(b) LADD Rd, Rs1, imm16

Logical Or (a) LOR Rd, Rs1, Rs2

Arithmetic Shift Left (a) SLA Rd, Rs1, imm5

Arithmetic Shift Right (a) SRA Rd, Rs1, imm5 Set if Equal (a) SEQ Rd, Rs1, Rs2

Set if Greater Than (a) SGT Rd, Rs1, Rs2

(b) LOR Rd, Rs1, imm1

Page Number: 35/101

Branch Class: Branch on True

(a) BT Rd, Rs1

Branch Always (a) BA Rd

Data Memory Class: - load & store instructions

load: (1) three cycles: IF, IDX & LD (2) IDX: register_read - ALU_operation - output_latch_write (address)

(3) LD Load

(a) SEQ Rd, Rs1, Rs2

store:

(1) two cycles: IF & IDX (2) IDX: register_read - ALU_operation - output_latch_write (data & data address)

Store

(a) ST Rd, Rs2

Page Number: 36/101

System instructions: Noophalt (a) NOOPHALT idle state of the machine; this instruction may be used for

filling slot(s) behind branches and/or loads, or for real-time isp' programming, or to support modular isp' programming.

Page Number: 37/101

Branching in pipelined machines: Interlock mechanism: hw (cisc-mostly) versus sw (risc-mostly)

i

i+1

i+75

Scoreboard branch: hw interlock (clock slow-down)

ALU (arithmetic-logic-unit) suspend RWB (register-write-unit) suspend

Page Number: 38/101

Delayed branch: sw interlock

source code:i-1 ADD R7, imm32i JUMP R1, R2>R3i+1 MOVE R3, R4i+2 SUB R5, R6

after code generation:i-1 ADD R7, imm32i JUMP R1+1, R2>R3i+1 NOOPi+2 MOVE R3, R4i+3 SUB R5, R6

after code optimization:i-1i JUMP R1+1, R2>R3i+1 ADD R7, imm32i+2 MOVE R3, R4i+3 SUB R5, R6

Page Number: 39/101

condition: THE MOVED INSTRUCTION (a) MUST BE EXECUTED (no matter if the branch is taken or not), AND (b) HAS CONDITION AND/OR THE JUMP TARGET ADDRESS.

parameters: (a) PIPELINE FILL-IN DEPTH (which is not the pipeline depth minus one!) (b) BRANCHING-RELATED STATISTICS (branches executed versus branches taken) (c) BRANCH FILL-IN FUNCTION (local versus global code optimization) (d) CLOCK SLOW DOWN FUNCTION (in-the-critical-path versus off-the-critical-path) (e) TECHNOLOGY-RELATED STATISTICS (on-chip versus off-chip delays) (f) CACHE IMPACT (hit versus miss penalty) NUMERICAL EXAMPLE: What is the equation for the condition that hw and sw interlock have the same benchmark execution time (not clock-count)

Page Number: 40/101

Loading in pipelined machines: Interlock mechanism: hw versus sw i IF IDX LD

i+1 IF IDX

Scoreboard LOAD:

Syspend Bypass

Page Number: 41/101

Delayed LOAD: sw interlock source code: i-1 MOVE R3,R4 i LOAD R7, memory i+1 ADD R2, R1, R7

after code generation: i-1 MOVE R3,R4 i LOAD R7, memory i+1 NOOP i+2 ADD R2, R1, R7

after code optimization: i-1 i LOAD R7, memory i+1 MOVE R3,R4 i+2 ADD R2, R1, R7

condition: mutual independence parameters: technology related, design + organization + architecture related, system software related, and application related.

Page Number: 42/101

CURRENT WINDOW

IF IDX LDIF IDX LD

IF IDX LD

MAIN DELAY(1) END

IR=MEMRY[PASTPC] PASTPC=PC PC=PC+1 PASTOP=OP

PC=REG[DST]

i-1: leaves PASTPC, PASTOP (part of PASTIR)

i: leaves PC, OP (part of IR)i+1: after IF,

puts PC+1 into PC; after IDX (when branch), puts REG[dst] into PC;

Page Number: 43/101

Page Number: 44/101

The ".isp" file: - Macro section macro WORD = 32&, BYTE = 8&, NIBBLE = 4& ; - State section state reg[0:15]<WORD>, pc<WORD>, pastpc<WORD>, ir<WORD>, pastop<WORD>, ! pastdst<NIBBLE>, pastval<WORD>, hist[0:23]<WORD> ! ; - Memory section memory memry[0:0xfff]<WORD> ; - Format section format op = ir<31:24>, dst = ir<23:20>, src1 = ir<19:16>, src2 = ir<15:12>, imm16 = ir<15:0>, imm5 = ir<4:0>

Page Number: 45/101

- Main Program

main := (pastop = op;pastpc = pc;pc = pc + 1;ir = memry[pastpc];hist[pastop] = hist[opastop] + 1;delay(1);

if pastop eql 21reg[pastdst] = pastval;

case op0:reg[dst] = reg[src1] + reg[src2]

instructions 1 to 20

21: ( pastdst = dst;pastval = memry[reg[src2]])

22: memry[reg[src2]] = reg[dst]23:

esac;)

Page Number: 46/101

The complete "case":

! Instruction decode and execution is done here. The "case" statement performs! the decode - note that the opcode bits are tested as one would expect.! For each legal opcode, a unique action is specified.! Only one action is performed, the the bottom of the "main" process is reached,! and we return to the top of the process.

case op 0: reg[dst] = reg[src1] + reg[src2] ! add (reg-reg) 1: reg[dst] = reg[src1] + imm16 sxt 32 ! add (reg-imm) 2: reg[dst] = pc + imm16 sxt 32 ! add (pc-imm) !! 3: reg[dst] = reg[src1] - reg[src2] ! sub (reg-reg) 4: reg[dst] = reg[src1] - imm16 sxt 32 ! sub (reg-imm) 5: reg[dst] = pc - imm16 sxt 32 ! sub (pc-imm) 6: reg[dst] = reg[src1] ! mov (reg-reg) 7: reg[dst] = imm16 sxt 32 ! mov (reg-imm) 8: reg[dst] = pc ! mov (pc-imm) 9: reg[dst] = - reg[src1] ! negate10: reg[dst] = reg[src1] and reg[src2] ! and (reg-reg)11: reg[dst] = reg[src1] and imm16 sxt 32 ! and (reg-imm)12: reg[dst] = reg[src1] or reg[src2] ! or (reg-reg)13: reg[dst] = reg[src1] or imm16 sxt 32 ! or (reg-imm)14: reg[dst] = not reg[src1] ! not15: reg[dst] = reg[src1] *:arith (imm5 ext 32) ! shift left !!16: reg[dst] = reg[src1] /:arith (imm5 ext 32) ! shift right !!17: if reg[src1] eql reg[src2] ! set if equal

reg[dst] = - 1 else reg[dst] = 0

18: if reg[src1] gtr reg[src2] ! set if greater reg[dst] = - 1 else reg[dst] = 0

19: if reg[src1] eql -1 ! branch on true pc = reg[dst]

20: pc = reg[dst] ! branch always21: (pastdst = dst; ! load

pastval = memry[reg[src2]] )

22: memry[reg[src2]] = reg[dst] ! store

Page Number: 47/101

The ".m" file: - Instr Section instr I<32>$ - Format Section format op = I<32:24>, dst = I<23:20>, src1 = I<19:16>, src2 = I<15:12>, imm16 = I<15:12>, imm5 = I<4:0>$ - Macro section macro r0 = 0&, r1 = 1&, ... r15 = 15&, addr(d,s1,s2) = op=0; dst=d;

src1=s1; src2=s2$&, instructions 1 to 22 noophalt = op=23$&$ - Begin-end section begin

include ee666.test$ end

Page Number: 48/101

The ".i" file:

- Instr Sectioninstr

I<32>$

- Format Sectionformat

op = I<32:24>,dst = I<23:20>,src1 = I<19:16>,src2 = I<15:12>,imm16 = I<15:0>,imm5 = I<4:0>$

- Space sectionspace

<0:4095>$

- Transfer sectiontransfer

{new}

- Mode sectionmode

case op eql 7imm16~address$break$

esac,default:

imm16~imm16$

Page Number: 49/101

The ".t" file

processor cpu = "ee666.sim";

time delay = 100ns;

initial memry = l.out;

Page Number: 50/101

The ".b" file:

Sample assembler language program that uses the instructionsfor the RISC-like processor of the ee666 (Advanced Computer Systems),Purdue University, Spring Semester 1987.

Filename: eee666.test

movi(r0,100)subri(r1,10,100)movr(r2,r1)seq(r3,r1,r2)movi(r4,11)movi(r5,12)moci(r6,13)bt(r4,r3)ba(r5)movi(r1,10)

11: addri(r1,r1,1)addri(r1,r1,1)

12: sgt(r7,r2,r1)bt(r6,r7)addr(r8,r0,r2)subri(r9,r1,10)st(r9,r8)ba(r5)addri(r2,r2,2)

13: subri(r8,r8,2)ld(r8,r8)movr(r10,r8)addrr(r10,r10,r8)sla(r10,r10,2)halt

Page Number: 51/101

Sample Fura RISC VMS Session: 1. set def [.N2] 2. copy VL$A:[N2.E666]*.* *.* 3. @VL$A:[N2]login 4. n2 -script.txt ee666.e00

If you want to test your own CPU: 1. @VL$A:[N2]login 2. edit cpuname.isp 3. ic cpuname.isp 4. edit cpuname.m 5. edit program.m 6. micro cpuname.m 7. edit cpuname.i 8. inter cpuname.i 9. cater cpuname.a cpuname.n 10. edit cpuname.t 11. ec -b cpuname.t 12. n2 -s script.txt cpuname.e00

Page Number: 52/101

Papers from the Open Literature: 1) Rose, C.W., Ordy, G. M., Drongowski, P. J., "N.mpc: A Study in University-Industry Technology Transfer" IEEE Design & Test of Computers, February 1984, pp 44-56. 2) Rose, C. W., "System Design Tools - A Paradigm Shift," Endot Corporation Internal Report, 1986. 3) Gay, F., "Funcitonal Simulation Fuels System Design," VLSI Design Technology 4) Kong, S., Wood, D., Gibson, G., Katz, R., Patterson, D., "Design Methodology of a VLSI Multiprocessor Workstation," VLSI Systems, February 1987. 5) Bozanic, D., Fura, D., Milutinovic, V., "Simulation of a Simple RISC Processor," Application Note, No. D#001/VM, TD Technologies, Cleveland Heights, Ohio, U.S.A., 1993. 6) Petkovic, Z., Milutinovic, V., "Simulation of the Intel i860 RISC Processor," Application Note, No. D#003/VM, TD Technologies, Cleveland Heights, Ohio, U.S.A., 1994. 7) Milicev, D., Petkovic, Z., Milutinovic, V., "Simulation Study of Uniprocessor Cache Memories," Application Note, No. D#004/VM, TD Technologies, Cleveland Heights, Ohio, U.S.A., 1994. 8) Tomasevic, M., Milutinovic, V., "Using N.2 in a Simulation Study of Snoopy Cache Coherence Protocols for Shared Memory Multiprocessor System," Application Note, No. D#002/VM, TD Technologies, Cleveland Heights, Ohio, U.S.A., 1993.

Page Number: 53/101

WORKLOAD CHARACTERIZATION Important Reference: Ferrari, D., Computer Systems Performance Evaluation, Prentice-Hall, Englewood Cliffs, New Jersey, U.S.A., 1978. Introduction: Workload of a computer system has been defined as the set of all inputs (programs, data, commands, etc... ) that the system receives from its environment In measurement experiments, the system is driven by a model of the workload which is just a sample of the real production workload. The major question is how representative this sample is. Other important characteristics of a workload are:

a) simplicity of construction, b) usage cost, c) reproducibility, d) compactness, and e) system independence.

Types of Workload Models: 1. Natural workload model: A sample job stream taken from a production workload, and used to drive the system at the very time it was produced. 2. Artificial workload model: All other cases. 2a. Non executable:

Defined via statistical distributions of relevant parameters. Usage: In analytical studies. Typical forms: Probabilities of various instructions

(instruction mixes), memory accesses, procedure nesting depths, etc...

Relevant issues: Mean values, variances, correlations, autocorrelations, etc...

Standard instruction mixes: Flynn (MLL), Knuth (HLL), etc...

Page Number: 54/101

2b. Executable: Defined via one or more programs. Usage: In empirical studies. Typical forms: Synthetic jobs (parametric programs) and benchmarks (semantic programs). Relevant issues: application orientation, etc... Standard ones: See the PC magazines, etc...

Synthetic job approaches: Buchhulz (fixed flowchart with variable parameters) Kernigham + Hamilton (similar but more sophisticated) Archibald + Baer (the most widely cited computer architecture paper in 80's ) Benchmark types: Extracted Created Standard (application dependent)

Page Number: 55/101

The DARPA/Stanford benchmarks:

The DARPA/Stanford Benchmark Packageconsists of thirteen PASCAL programs:

1) ackp.p2) bubblesortp.p3) fftp.p4) fibp.p5) intmmp.p6) permp.p7) puzzlep.p8) eightqueenp.p9) quickp.p0) realmmp.p1) sievep.p2) towresp.p3) treep.p

These programs are located on ed machine,and the full path name of their directory is:/a/mips/bench

Page Number: 56/101

An Introduction toVLSI Processor Architecture

for GaAS

This research has been sponsored by RCAand conducted in collaboration with

the RCA Advanced Technology Laboratories, Moorestown, New Jersey.

Page Number: 57/101

• For the same power consumption, at least half order of magnitude faster than Silicon.

• Efficient integration of electronics and optics.

• Tolerant of temperature variations. Operating range: [200C, 200C].

• Radiation hard. Several orders of magnitude more than Silicon: [>100 million RADs].

Advantages

Page Number: 58/101

• High density of wafer dislocations Low Yield Small chip size Low transistor count. • Noise margin not as good as in Silicon. Area has to be traded in for higher reliability.

• At least two orders of magnitude more expensive than Silicon.

• Currently having problems with high-speed test equipment.

Disadvantages:

Page Number: 59/101

• Small area and low transistor count(* in general, implications of this fact are dependent on the speed of the technology *)

• High ratio of off-chip and on-chip delays(* consequently, off-chip and on-chip delays access is much longer then on-chip memory access *)

• Limited fan-in and fan-out (?)(* temporary differences *)

• High demand on efficient fault-tolerance (?)(* to improve the yield for bigger chips *)

Basic differences of Relevance for Microprocessor Architecture

Page Number: 60/101

•Bipolar (TI + CDC)

•JFET (McDAC)

•GaAs MESFET Logic Families (TriQuint + RCA)

D-MESFET

(* Depletion Mode *) E-MESFET(* Enhancement Mode *)

A Brief Look Into the GaAs IC Design

Page Number: 61/101

Speed Dissipation Complexity (ns) (W) (K transistors)

Arithmetic32‑bit adder 2,9 total 1,2 2,5(BFL D‑MESFET)1616‑bit multiplier 10,5 total 1,0 10,0(DCFL E/D MESFET) Control1K gate array 0,4/gate 1,0 6,0(STL HBT)2K gate array 0,08/gate 0,4 8,2(DCFL E/D MESFET) Memory4Kbit SRAM 2,0 total 1,6 26,9(DCFL E/D MODFET)16K SRAM 4,1 total 2,5 102,3(DCFL E/D MESFET)

Figure 7.1. Typical (conservative) data for speed, dissipation, and complexity of digital GaAs chips.

Page Number: 62/101

Figure 7.2. Comparison (conservative) of GaAs and silicon, in terms of complexity and speed of the chips (assuming equal dissipation). Symbols T and R refer to the transistors and the resistors, respectively. Data on silicon ECL technology complexity includes the transistor count increased for the resistor count.

GaAs(1 m E/D-MESFET)

Silicon(2 m NMOS)

Silicon(2 m CMOS)

Silicon(1.25 m NMOS)

Silicon(2 m ECL)

Complexity

On-chip transistor count 40K 200K 200K 400K 40K (T or R)

Speed

Gate delay

(minimal fan-out)50-150 ps 1-3 ns 800-1000 ps 500-700 ps 150-200 ps

On-chip memory access

(3232 bit capacity)0.5-2.0 ns 20-40 ns 10-20 ns 5-10 ns 2-3 ns

Off-chip, on package memory access (25632 bits)

4-8 ns 40-80 ns 30-40 ns 20-30 ns 6-10 ns

Off-package memory access (1k32 bits)

10-50 ns 100-200 ns 60-100 ns 40-80 ns 20-80 ns

Page Number: 63/101

Figure 7.3. Comparison of GaAs and silicon, in the case of actual 32-bit microprocessor implementations (courtesy of RCA). The impossibility of implementing “phantom” logic (wired-OR) is a consequence of the low noise immunity of GaAs circuits (200 mV).

GaAs E/D‑DCFL Silicon SOS‑CMOS

Minimal geometry 1 m 1.25 m

Levels of metal 2 2

Gate delay 250 ps 1.25 ns

Maximum fan-in 5 NOR, 2 AND 4 NOR, 4 NAND

Maximum fan-out 4 20

Noise immunity level 220 mV 1.5 V

Average gate transistor count 4.5 7

On-chip transistor count 25 000 100 000-150 000

Page Number: 64/101

Figure 7.4. Processor organization based on the BS (bit-slice) components. The meaning of symbols is as follows: IN—input, BUFF—buffer, MUX—multiplexer, DEC—decoder, L—latch, OUT—output. The remaining symbols are standard.

Page Number: 65/101

Figure 7.5. Processor organization based on the FS (function slice) components: IM—instruction memory, I_D_U—instruction decode unit, DM_I/O_U—data memory input/output unit, DM—data memory.

Page Number: 66/101

Only a single-chip reduced architecture makes sense!

In Silicon environment,we can argue “RISC” or “CISC”.

In GaAs environment,there is only one choice: “RISC”.

However, the RISC concept has to be significantly modified for efficient GaAs utilization.

Implication of the High Off/On RatioOn the Choice of Processor Design Philosophy

Page Number: 67/101

Assume a 10:1 advantage in on-chip switching speed, but only a 3:1 advantage in off-chip/off-package memory access.

Will the microprocessor be 10 times faster?

Or only 3 times faster?

Why the Information Bandwidth Problem?

The Reduced Philosophy:Large register filest or all on-chip memory is used for the register file On chip instruction cache is out of question

Instruction fetch must be from an off-chip environment

The Information Bandwidth Problem of GaAs

Page Number: 68/101

• General purpose processing in defense and aerospace, and execution of compiled HLL code.• General purpose processing and substitution of current CISC microprocessors.*• Dedicate special-purpose applications in digital control and signal processing.*• Multiprocessing of the SIMD/MIMD type, for numeric and symbolic applications.

Applications for GaAs Microprocessor

Page Number: 69/101

On-chip issues:•Register file•ALU•Pipeline organization•Instruction set

Off-chip issues:•Cache•Virtual memory management•Coprocessing•Multiprocessing

System software issues:CompilationCompilation

CompilationCode optimization

Code optimizationCode optimization

Which Design Issues Are Affected?

Page Number: 70/101

igure 7.6. Comparison of GaAs and silicon. Symbols CL and RC refer to the basic adder types (carry look ahead and ripple carry). Symbol B refers to the word size.a) Complexity comparison. Symbol C[tc] refers to complexity, expressed in transistor count.b) Speed comparison. Symbol D[ns] refers to propagation delay through the adder, expressed in nanoseconds. In the case of silicon technology, the CL adder is faster when the word size exceeds four bits (or a somewhat lower number, depending on the diagram in question). In the case of GaAs technology, the RC adder is faster for the word sizes up to n bits (actual value of n depends on the actual GaAs technology used).

Adder Design

Page Number: 71/101

Figure 7.7. Comparison of GaAs and silicon technologies: an example of the bit-serial adder. All symbols have their standard meanings.

Page Number: 72/101

Figure 7.8. Comparison of GaAs and silicon technologies: design of the register cell: (a) an example of the register cell frequently used in the silicon technology; (b) an example of the register cell frequently used in the GaAs microprocessors. Symbol BL refers to the unique bit line in the four-transistor cell. Symbols A BUS and B BUS refer to the double bit lines in the seven-transistor cell. Symbol F refers to the refresh input. All other symbols have their standard meanings.

Register File Design

a) b)

Page Number: 73/101

Pipeline design

Figure 7.9. Comparison of GaAs and silicon technologies: pipeline design—a possible design error: (a) two-stage pipeline typical of some silicon microprocessors; (b) the same two-stage pipeline when the off-chip delays are three times longer than on-chip delays (the off-chip delays are the same as in the silicon version). Symbols IF and DP refer to the instruction fetch and the ALU cycle (datapath). Symbol T refers to time.

Page Number: 74/101

b) IPFigure 7.10. Comparison of GaAs and silicon technologies: pipeline design—possible solutions; (a1) timing diagrams of a pipeline based on the IM (interleaved memory) or the MP (memory pipelining); (a2) a system based on the IM approach; (a3) a system based on the MP approach; (b) timing diagram of the pipeline based on the IP (instruction packing) approach. Symbols P, M, and MM refer to the processor, the memory, and the memory module. The other symbols were defined earlier

a1) a2)

a3) b)

Page Number: 75/101

32-bitGaAs MICROPROCESSORS

Goals and project requirements:

•200 MHz clock rate•32-bit parallel data path•16 general purpose registers•Reduced Instruction Set Computer (RISC) architecture•24-bit word addressing•Virtual memory addressing•Up to four coprocessors connected to the CPU (Coprocessors can be of any type and all different)

References:

1. Milutinović,V.,(editor),”Special Issue on GaAs Microprocessor Technology,” IEEE Computer, October 1986. 2. Helbig, W., Milutinović,V., “The RCA DCFL E/D- MESFET GaAs Experimental RISC Machine,” IEEE Transactions on Computers, December 1988.

Page Number: 76/101

3.The outputs of two circuits can not be tied together: a. one can not utilize phantom logic on the chip, to implement functions like WIRED-OR (all outputs active).Circuits have a low “operating noise margin”.B . One can not use three-state logic on the chip, to implement functions like MULTIPLE-SOURCE-BUS (only the output active). Circuits have no “off-state”.C . Actually, if one insist on having a MULTIPLE-SOURCE- BUS on the chip, one can have it at the cost of only one active load and the need to precharge (both mean “constraints” and “slowdown on the architecture level).D . Fortunately, logic function AND-OR is exactly what is needed to create a multiplexer - a perfect replacement for a bus.

E

Page Number: 77/101

MUX

Page Number: 78/101

Figure 7.11. The technological problems that arise from the usage of GaAs technology: (a) an example of the fan-out tree, which provides a fan-out of four, using logic elements with the fan-out of two; (b) an example of the logic element that performs a two-to-one one-bit multiplexing. Symbols a and b refer to data inputs. Symbol c refers to the control input. Symbol o refers to data output.

a)

b)

Page Number: 79/101

Figure 7.12. Some possible techniques for realization of PCBs (printed circuit boards): (a) The MS technique (microstrip); (b) The SL technique (stripline). Symbols and refer to the signal delay and the characteristic impedance, respectively. The meaning of other symbols is defined in former figures, or they have standard meanings

ZH

W T

D

r

r

0

0

87

1 41

5 98

0 8

1 016 0 475 0 67

,ln

,

,

, , , ns ft

ZB

W T

D

r

r

0

0

60 4

0 67 0 8

1 016

ln, ( , )

, ns ft

Page Number: 80/101

1. Deep Memory Pipelining:Optimal memory pipelining depends on the ratio of off-chip and on-chip delays, plus many other factors. Therefore, precise input from DP and CD people was crucial. Unfortunately, these data were not quite known at the design time, and some solutions (e.g. PC-stack) had to work for various levels of the pipeline depth.

2. Latency Stages:One group of latency stages (WAIT) was associated to instruction fetch; the other group was associated to operand load.

3. Four Basic Opcode Classes:•ALU•LOAD/STORE•BRANCH•COPROCESSOR

4. Register zero is hardwired to zero.

The CPU Architecture

Page Number: 81/101

IR

GRFCPU

M

Silicon

GaAs

CPU M3 M6 M9

Page Number: 82/101

ALU CLASS

Page Number: 83/101

CATALYTIC MIGRATIONfrom the

RISC ENVIRONMENTPOINT-OF-VIEW

This research was sponsored by NCR

Page Number: 84/101

DEFINITION: DIRECT MIGRATION Migration of an entire hardware resource into the system software.

EXAMPLES:

Pipeline interlock.Branch delay control.

ESSENCE: Examples that result in code* speed-up are very difficult to invent.

Page Number: 85/101

DELAYED CONTROL TRANSFER

Delayed Branch Scheme

I1 fetch

I2 fetch

I1 executionbranch address calculationbranch target calculation

I3 fetch

I2 execution

time

Page Number: 86/101

DEFINITION: Catalytic Migration

Migration base on the utilization of a catalyst. MIGRANT vs CATALIST

Figure 7.13. The catalytic migration concept. Symbols M, C, and P refer to the migrant, the catalyst, and the processor, respectively. The acceleration, achieved by the extraction of a migrant of a relatively large VLSI area, is achieved after adding a catalyst of a significantly smaller VLSI area.

ESSENCE:

Examples that result in code speed-up are much easier to invent.

Page Number: 87/101

METHODOLOGY:Area estimation: MigrantArea estimation: CatalystReal estate to invest: DifferenceInvestment strategy: R

Compile time algorithmsAnalytical analysisSimulation analysisImplementational analysis NOTE: Before the reinvestment,

the migration may result in slow-down.

Page Number: 88/101

(N-2)*W vs DMA

a)

b)Figure 7.16. An example of the DW (double windows) type of catalytic migration, (a) before the migration; (b) after the migration.

Symbol M refers to the main store. The symbol L-bit DMA refers to the direct memory access which transfers L bits in one clock cycle. Symbol NW refers to the register file with N partially overlapping windows (as in the UCB-RISC processor), while the symbol DW refers to the register file of the same type, only this time with two partially overlapping windows. The addition of the L-bit DMA mechanism, in parallel to the execution using one window, enables the simultaneous transfer between the main store and the window which is currently not in use. This enables one to keep the contents of the nonexistent N – 2 windows in the main store, which not only keeps the resulting code from slowing down, but actually speeds it up, because the transistors released through the omission of N – 2 windows can be reinvested more appropriately.

Migrant: (N2)*WCatalyst: L-bit DMA

Page Number: 89/101

i: load r1, MA{MEM – 6}i + 1: load r2, MA{MEM – 3}

a)

b)Figure 7.14. An example of catalytic migration: Type HW (hand walking): (a) before the migration; (b) after the migration. Symbols P and GRF refer to the processor and the general-purpose register file, respectively. Symbols RA and MA refer to the register address and the memory address in the load instruction. Symbol MEM – n refers to the main store which is n clocks away from the processor. Addition of another bus for the register address eliminates a relatively large number of nop instructions (which have to separate the interfering load instructions).

Page Number: 90/101

Figure 7.15. An example of catalytic migration: type II (ignore instruction): (a) before the migration; (b) after the migration. Symbol t refers to time, and symbol UI refers to the useful instruction. This figure shows the case in which the code optimizer has successfully eliminated only two nop instructions, and has inserted the ignore instruction, immediately after the last useful instruction. The addition of the ignore instruction and the accompanying decoder logic eliminates a relatively large number of nop instructions, and speeds up the code, through a better utilization of the instruction cache.

Page Number: 91/101

CODE INTERLEAVING

a)

b)Figure 7.17. An example of the CI (code interleaving) catalytic migration: (a) before the migration; (b) after the migration. Symbols A and B refer to the parts of the code in two different routines that share no data dependencies. Symbols GRF and SGRF refer to the general purpose register file (GRF), and the subset of the GRF (SGRF). The sequential code of routine A is used to fill in the slots in routine B, and vice versa. This is enabled by adding new registers (SGRF) and some additional control logic which is quite. The speed-up is achieved through the elimination of nop instructions, and the increased efficiency of the instruction cache (a consequence of the reduced code size).

Page Number: 92/101

CLASSIFICATION:CM

ICM ACM

C-+ C++ -+ ++

EXAMPLES:(N2)*W vs DMA

RDEST BUS vs CFF IGNORE CODE INTERLEAVING

Page Number: 93/101

for i := 1 to N do:

1. MAE2. CAE3. DFR4. RSD5. CTA

6. AAP7. AAC8. SAP9. SAC

10. SLL

end do Figure 7.18. A methodological review of catalytic migration (intended for a detailed study of a new catalytic migration example). Symbols S and R refer to the speed-up and the initial register count. Symbol N refers to the number of generated ideas. The meaning of other symbols is as follows: MAE—migrant area estimate, CAE—catalyst area estimate, DFR—difference for reinvestment, RSD—reinvestment strategy developed, CTA—compile-time algorithm, AAC—analytical analysis of the complexity, AAP—analytical analysis of the performance, SAC—simulation analysis of the complexity, SAP—simulation analysis of the performance, SLL—summary of lessons learned.

Page Number: 94/101

RISCs FOR NN: Core + Accelerators

Figure 8.1. RISC architecture with on-chip accelerators. Accelerators are labeled ACC#1, ACC#2, …, and they are placed in parallel with the ALU. The rest of the diagram is the common RISC core. All symbols have standard meanings.

Page Number: 95/101

Figure 8.2. Basic problems encountered during the realization of a neural computer: (a) an electronic neuron; (b) an interconnection network for a neural network. Symbol D stands for the dendrites (inputs), symbol S stands for the synapses (resistors), symbol N stands for the neuron body (amplifier), and symbol A stands for the axon (output). The symbols , , , and stand for the input connections, and

the symbols , , , and stand for the output connections.

Page Number: 96/101

Figure 8.3. A system architecture with N-RISC processors as nodes. Symbol PE (processing element) represents one N-RISC, and refers to “hardware neuron.” Symbol PU (processing unit) represents the software routine for one neuron, and refers to “software neuron.” Symbol H refers to the host processor, symbol L refers to the 16-bit link, and symbol R refers to the routing algorithm based on the MP (message passing) method.

Page Number: 97/101

Figure 8.4. The architecture of an N-RISC processor. This figure shows two neighboring N-RISC processors, on the same ring. Symbols A, D, and M refer to the addresses, data, and memory, respectively. Symbols PLA (comm) and PLA (proc) refer to the PLA logic for the communication and processor subsystems, respectively. Symbol NLR refers to the register which defines the address of the neuron (name/layer register). Symbol refers to the only register in the N‑RISC processor. Other symbols are standard.

Page Number: 98/101

Figure 8.5. Example of an accelerator for neural RISC: (a) a three-layer neural network; (b) its implementation based on the reference [Distante91]. The squares in Figure 8.5.a stand for input data sources, and the circles stand for the network nodes. Symbols W in Figure 8.5.b stand for weights, and symbols F stand for the firing triggers. Symbols PE refer to the processing elements. Symbols W have two indices associated with them, to define the connections of the element (for example, and so on). The exact values of the indices are left to the reader to determine, as an exercise. Likewise, the PE symbols have one index associated with them, to determine the node they belong to. The exact values of these indices were also left out, so the reader should determine them, too.

Page Number: 99/101

Figure 8.6. VLSI layout for the complete architecture of Figure 8.5. Symbol T refers to the delay unit, while symbols IN and OUT refer to the inputs and the outputs, respectively

Page Number: 100/101

Figure 8.7. Timing for the complete architecture of Figure 8.5. Symbol t refers to time, symbol F refers to the moments of triggering, and symbol P refers to the ordinal number of the processing element.

Page Number: 101/101

http://galeb.etf.bg.ac.yu/~vm/

e-mail: [email protected]

Documents

An Introduction to VLSI Processor Architecture for GaAS