Machine Purpose

1-1 Chapter 1—The General Purpose Machine

Computer Systems Design and Architecture by V. Heuring and H. Jordan © 1997 V. Heuring and H. Jordan

Chapter 1: The General Purpose Machine

Topics1.1 The General Purpose Machine1.2 The User’s View1.3 The Machine/Assembly Language Programmer’s View1.4 The Computer Architect’s View1.5 The Computer System Logic Designer’s View1.6 Historical Perspective1.7 Trends and Research1.8 Approach of the Text

http://krimo666.mylivepage.com/



Looking Ahead—Chapter 2Explores the nature of machines and machine

languages

• Relationship of machines and languages• Generic 32-bit Simple RISC Computer—SRC• Register transfer notation—RTN

• The main function of the CPU is the Register Transfer• RTN provides a formal specification of machine structure and function• Maps directly to hardware

• RTN and SRC will be used for examples in subsequent chapters• Provides a general discussion of addressing modes• Presents a view of logic design aimed at implementing registers

and register transfers




Looking Ahead—Chapter 3

• Treats 2 real machines of different types—CISC and RISC—in some depth

• Discusses general machine characteristics and performance

• Differences in design philosophies of• CISC (Complex Instruction Set Computer) and • RISC (Reduced Instruction Set Computer)

architectures• CISC machine—Motorola MC68000

• Applies RTN to the description of real machines• RISC machine—SPARC




Looking Ahead—Chapter 4This keystone chapter describes processor

design at the logic gate level

• Describes the connection between the instruction set and the hardware

• Develops alternative 1-, 2-, and 3-bus designs of SRC at the gate level

• RTN provides description of structure and function at low and high levels

• Shows how to design the control unit that makes it all run

• Describes two additional machine features: • implementation of exceptions (interrupts)• machine reset capability




Looking Ahead—Chapter 5Important advanced topics in CPU design

• General discussion of pipelining—having more than one instruction executing simultaneously

• requirements on the instruction set• how instruction classes influence design• pipeline hazards: detection & management

• Design of a pipelined version of SRC• Instruction-level parallelism—issuing more than one instruction

simultaneously• Superscalar and VLIW designs

• Microcoding as a way to implement control




Looking Ahead—Chapter 6The arithmetic and logic unit: ALU

• Impact on system performance• Digital number systems and arithmetic in an arbitrary

radix• number systems and radix conversion• integer add, subtract, multiply, and divide

• Time/space trade-offs: fast parallel arithmetic• Floating point representations and operations• Branching and the ALU• Logic operations• ALU hardware design




Looking Ahead—Chapter 7The memory subsystem of the computer

• Structure of 1-bit RAM and ROM cells• RAM chips, boards, and modules• Concept of a memory hierarchy

• nature of different levels• interaction of adjacent levels

• Virtual memory• Cache design: matching cache & main memory• Memory as a complete system




Looking Ahead—Chapter 8Computer input and output: I/O

• Kinds of system buses, signals and timing• Serial and parallel interfaces• Interrupts and the I/O system• Direct memory access—DMA• DMA, interrupts, and the I/O system• The hardware/software interface: device drivers




Looking Ahead—Chapter 9Structure, function, and performance of

peripheral devices

• Disk drives• Organization• Static and dynamic properties

• Video display terminals• Memory-mapped video• Printers• Mouse and keyboard• Interfacing to the analog world




Looking Ahead—Chapter 10Computer communications, networking, and

the Internet

• Communications protocols; layered networks• The OSI layer model• Point to point communication: RS-232 and ASCII• Local area networks—LANs

• Example: Ethernet

• Internetworking and the Internet• TCP/IP protocol stack• Packet routing and routers• IP addresses: assignment and use• Nets and subnets: subnet masks

• Internet applications and futures




Chapter 1—A Perspective

• Alan Turing showed that an abstract computer, a Turing machine, can compute any function that is computable by any means

• A general purpose computer with enough memory is equivalent to a Turing machine

• Over 50 years, computers have evolved• from memory size of 1 kiloword (1024 words) clock periods

of 1 millisecond (0.001 s)• to memory size of a terabyte (240 bytes) and clock periods of

1 ns (10-9 s)

• More speed and capacity is needed for many applications, such as real-time 3D animation




Scales, Units, and Conventions

Term

K (kilo-)

M (mega-)

G (giga-)

T (tera-)

103

106

109

1012

210 = 1,024

220 = 1,048,576

230 = 1,073,741,824

240 = 1,099,511,627,776

Normal Usage As a power of 2

Term Usage

m (milli-)

µ (micro-)

n (nano-)

p (pico-)

10-3

10-6

10-9

10-12

Units: Bit (b), Byte (B), Nibble, Word (w), Double Word, Long Word,

Second (s), Hertz (Hz)

Note the differences between usages. You should commit the powers of 2 and 10 to memory.




Fig 1.1 The User’s View of a Computer

The user sees software, speed, storage capacity,and peripheral device functionality.

will serve you by providing that understanding.computers it is our sincerest hope that this book Computer Engineering, or some other aspect of your career objective is in Computer Science, that you fully understand the machine. Whether at the gate, ISA, and the system architecture levelis when you understand how a machine functions leads to an efficient, effective computer design. It a computer system from each the three perspectives The intellectual synthesis that comes from viewing

1.10 Looking Ahead




Machine/Assembly Language Programmer’s View

• Machine language:• Set of fundamental instructions the machine can execute• Expressed as a pattern of 1’s and 0’s

• Assembly language:• Alphanumeric equivalent of machine language• Mnemonics more human-oriented than 1’s and 0’s

• Assembler:• Computer program that transliterates (one-to-one mapping)

assembly to machine language• Computer’s native language is machine/assembly language• “Programmer,” as used in this course, means machine/

assembly language programmer




Machine and Assembly Language

• The assembler converts assembly language to machine language. You must also know how to do this.

Tbl 1.2 Two Motorola MC68000 Instructions

MC68000 Assembly Language Machine Language

0011 101 000 000 100

ADDI.W #9, D2 0000 000 010 111 1000000 0000 0000 1001

MOVE.W D4, D5

Op code Data reg. #5 Data reg. #4




The Stored Program Concept

• It is the basic operating principle for every computer.• It is so common that it is taken for granted.• Without it, every instruction would have to be initiated

manually.

The stored program concept says that the programis stored with data in the computer’s memory. Thecomputer is able to manipulate it as data—forexample, to load it from disk, move it in memory,and store it back on disk.




Fig 1.2 The Fetch-Execute Process

MC68000 CPU Main memory

4000PC

31 0 0

4000

231 – 1

15 0

0011 101 000 000 100

0011 101 000 000 100

IR15 0

15 0

Various CPU

registers

The control unit

Control signals




Programmer’s Model:Instruction Set Architecture (ISA)

• Instruction set: the collection of all machine operations.• Programmer sees set of instructions, along with the

machine resources manipulated by them.• ISA includes

• Instruction set, • Memory, and • Programmer-accessible registers of the system.

• There may be temporary or scratch-pad memory used to implement some function is not part of ISA.

• Not Programmer Accessible.




Fig 1.3 Programmer’s Models of 4 Commercial Machines

216 bytes of main memory capacity

Fewer than 100

instructions

7

15

A

216 – 1

B

IX

SP

PC

0

12 general purpose registers

More than 300 instructions



232 – 1

252 – 1

0

PSW

Status

R0

PC

R11

AP

FP

SP

0 31 0

32 64-bit

floating point registers

(introduced 1993)(introduced 1981)(introduced 1975) (introduced 1979)

0

31

0 63

32 32-bit general purpose registers

0

31

0 31

More than 50 32-bit special

purpose registers

0 31


0

M6800 VAX11 PPC601

220 – 1

AX

BX

CX

DX

SP

BP

SI

DI

15 7 08

IP

Status

Address and

count registers

CS

DS

SS

ES

Memory segment registers


0

I8086


Data registers6 special

purpose registers




Machine, Processor, and Memory State

• The Machine State: contents of all registers in system, accessible to programmer or not

• The Processor State: registers internal to the CPU• The Memory State: contents of registers in the memory

system• “State” is used in the formal finite state machine sense• Maintaining or restoring the machine and processor

state is important to many operations, especially procedure calls and interrupts




Data Type: HLL Versus Machine Language

• HLLs provide type checking• Verifies proper use of variables at compile time• Allows compiler to determine memory requirements• Helps detect bad programming practices

• Most machines have no type checking• The machine sees only strings of bits• Instructions interpret the strings as a type: usually

limited to signed or unsigned integers and FP numbers• A given 32-bit word might be an instruction, an integer, a

FP number, or 4 ASCII characters




Tbl 1.3 Instruction Classes

• This compiler:• Maps C integers to 32-bit VAX integers• Maps C assign, *, and + to VAX MOV, MPY, and ADD• Maps C goto to VAX BR instruction

• The compiler writer must develop this mapping for each language-machine pair

Inst ruct ion Class C VAX Assembly Language

Dat a Movement

Ar it hmet ic/ log ic

Cont rol f low

a = b

b = c + d*e

goto LBL

MOV b, a

MPY d, e, b

ADD c, b, b

BR LBL




Tools of the Assembly Language Programmer’s Trade

• The assembler• The linker• The debugger or monitor• The development system




Who Uses Assembly Language

• The machine designer• Must implement and trade off instruction functionality

• The compiler writer• Must generate machine language from a HLL

• The writer of time or space critical code• Performance goals may force program-specific

optimizations of the assembly language

• Special purpose or imbedded processor programmers• Special functions and heavy dependence on unique I/O

devices can make HLLs useless




The Computer Architect’s View

• Architect is concerned with design & performance• Designs the ISA for optimum programming utility and

optimum performance of implementation• Designs the hardware for best implementation of the

instructions• Uses performance measurement tools, such as benchmark

programs, to see that goals are met• Balances performance of building blocks such as CPU,

memory, I/O devices, and interconnections• Meets performance goals at lowest cost




Buses as Multiplexers

• Interconnections are very important to computer• Most connections are shared• A bus is a time-shared connection or multiplexer• A bus provides a data path and control• Buses may be serial, parallel, or a combination

• Serial buses transmit one bit at a time• Parallel buses transmit many bits simultaneously on many

wires




Fig 1.4 Simple One- andTwo-Bus Architectures

Memory bus

I/O bus

n

n-bit system bus

(a) One bus (b) Two buses

Input/ output

subsystem

Memory

Input/ output

subsystem

Input/output devices

Input/output devices

Memory

CPUCPU




Fig 1.5 The Apple Quadra 950Bus System (Simplified)

Ethernet

NuBus

SCSI bus

ADB bus

LocalTalk bus Printers, other computers

Keyboard, mouse, bit pads

Disk drives, CD ROM drives

Video and special purpose cards

Other computers

LocalTalk interface

ADB transceiver

SCSI interface

NuBus interface

Ethernet transceiver

Memory

System bus

CPU




Fig 1.6 The Memory Hierarchy

• Modern computers have a hierarchy of memories• Allows tradeoffs of speed/cost/volatility/size, etc.

• CPU sees common view of levels of the hierarchy.

CPU CacheMemory Main Memory Disk Memory

TapeMemory




Tools of the Architect’s Trade

• Software models, simulators and emulators• Performance benchmark programs• Specialized measurement programs• Data flow and bottleneck analysis• Subsystem balance analysis• Parts, manufacturing, and testing cost analysis




Logic Designer’s View

• Designs the machine at the logic gate level• The design determines whether the architect meets

cost and performance goals• Architect and logic designer may be a single person or

team




Implementation Domains

• VLSI on silicon• TTL or ECL chips• Gallium arsenide chips• PLAs or sea-of-gates arrays• Fluidic logic or optical switches

An implementation domain is the collection ofdevices, logic levels, etc. which the designer uses.

Possible implementation domains:




Fig 1.7 Three Implementation Domains for the 2-1 Multiplexer

• 2-1 multiplexer in three different implementation domains• Generic logic gates (abstract domain)• National Semiconductor FAST Advanced Schottky TTL (VLSI on Si)• Fiber optic directional coupler switch (optical signals in LiNbO3)

S

(a) Abstract view of Boolean logic

(b) TTL implementation domain

(c) Optical switch implementation

O

O

OI1I0

S

I0I1

I0

S

I1

115

23

4

7

9

12

56

11101413

74F257N

U6/G/A/B1A1B2A2B

1Y

2Y

3Y

4Y

3A3B4A4B




The Distinction Between Classical Logic Design and

Computer Logic Design• The entire computer is too complex for traditional FSM

design techniques• FSM techniques can be used “in the small”

• There is a natural separation between data and control• Data path: storage cells, arithmetic, and their connections• Control path: logic that manages data path information flow

• Well defined logic blocks are used repeatedly• Multiplexers, decoders, adders, etc.




Two Views of the CPU PC Register

31 0

PCProgrammer:

D Q3232

PCout

PCinCK

PC

A BusB Bus

Logic Designer(Fig 1.8):




Tools of the Logic Designer’s Trade

• Computer-aided design tools• Logic design and simulation packages• Printed circuit layout tools• IC (integrated circuit) design and layout tools

• Logic analyzers and oscilloscopes• Hardware development system




Historical Generations

• 1st Generation: 1946–59, vacuum tubes, relays, mercury delay lines

• 2nd generation: 1959–64, discrete transistors and magnetic cores

• 3rd generation: 1964–75, small- and medium-scale integrated circuits

• 4th generation: 1975–present, single-chip microcomputer

• Integration scale: components per chip• Small: 10–100• Medium: 100–1,000• Large: 1000–10,000• Very large: greater than 10,000




Chapter 1 Summary• Three different views of machine structure and function• Machine/assembly language view: registers, memory cells,

instructions• PC, IR• Fetch-execute cycle• Programs can be manipulated as data• No, or almost no, data typing at machine level

• Architect views the entire system• Concerned with price/performance, system balance

• Logic designer sees system as collection of functional logic blocks

• Must consider implementation domain• Tradeoffs: speed, power, gate fan-in, fan-out


2-1 Chapter 2—Machines, Machine Languages, and Digital Logic


Chapter 2: Machines, Machine Languages, and Digital Logic

Topics

2.1 Classification of Computers and Their Instructions2.2 Computer Instruction Sets2.3 Informal Description of the Simple RISC Computer,

SRC2.4 Formal Description of SRC Using Register Transfer

Notation, RTN2.5 Describing Addressing Modes with RTN2.6 Register Transfers and Logic Circuits: From

Behavior to Hardware




What Are the Components of an ISA?• Sometimes known as The Programmer’s Model of the machine• Storage cells

• General and special purpose registers in the CPU• Many general purpose cells of same size in memory• Storage associated with I/O devices

• The machine instruction set• The instruction set is the entire repertoire of machine operations• Makes use of storage cells, formats, and results of the fetch/

execute cycle• i.e., register transfers

• The instruction format• Size and meaning of fields within the instruction

• The nature of the fetch-execute cycle• Things that are done before the operation code is known




Fig. 2.1 Programmer’s Models of Various Machines

We saw in Chap. 1 a variation in number and type of storage cells


Fewer than 100

instructions

7

15

A

216 – 1

B

IX

SP

PC

0





232 – 1

252 – 1

0

PSW

Status

R0

PC

R11

AP

FP

SP

0 31 0

32 64-bit

floating point registers

(introduced 1993)(introduced 1981)(introduced 1975) (introduced 1979)

0

31

0 63


0

31

0 31

More than 50 32-bit special

purpose registers

0 31


0

M6800 VAX11 PPC601

220 – 1

AX

BX

CX

DX

SP

BP

SI

DI

15 7 08

IP

Status

Address and

count registers

CS

DS

SS

ES

Memory segment registers


0

I8086


Data registers6 special

purpose registers




• Which operation to perform add r0, r1, r3• Ans: Op code: add, load, branch, etc.

• Where to find the operand or operands add r0, r1, r3 • In CPU registers, memory cells, I/O locations, or part of

instruction• Place to store result add r0, r1, r3

• Again CPU register or memory cell• Location of next instruction add r0, r1, r3

br endloop• Almost always memory cell pointed to by program counter—PC

• Sometimes there is no operand, or no result, or no next instruction. Can you think of examples?

What Must an Instruction Specify?

Data Flow




Instructions Can Be Divided into 3 Classes

• Data movement instructions• Move data from a memory location or register to another

memory location or register without changing its form• Load—source is memory and destination is register• Store—source is register and destination is memory

• Arithmetic and logic (ALU) instructions• Change the form of one or more operands to produce a result

stored in another location• Add, Sub, Shift, etc.

• Branch instructions (control flow instructions)• Alter the normal flow of control from executing the next

instruction in sequence• Br Loc, Brz Loc2,—unconditional or conditional branches




Tbl 2.1 Examples of Data Movement Instructions

• Lots of variation, even with one instruction type

Instruction Meaning Machine

MOV A, B Move 16 bits from memory location A to VAX11 Location B

LDA A, Addr Load accumulator A with the byte at memory M6800 location Addr

lwz R3, A Move 32-bit data from memory location A to PPC601 register R3

li $3, 455 Load the 32-bit integer 455 into register $3 MIPS R3000

mov R4, dout Move 16-bit data from R4 to output port dout DEC PDP11

IN, AL, KBD Load a byte from in port KBD to accumulator Intel Pentium

LEA.L (A0), A2 Load the address pointed to by A0 into A2 M6800




Tbl 2.2 Examples of ALUInstructions

Instruction Meaning MachineMULF A, B, C multiply the 32-bit floating point values at VAX11

mem loc’ns. A and B, store at Cnabs r3, r1 Store abs value of r1 in r3 PPC601ori $2, $1, 255 Store logical OR of reg $ 1 with 255 into reg $2 MIPS R3000DEC R2 Decrement the 16-bit value stored in reg R2 DEC PDP11SHL AX, 4 Shift the 16-bit value in reg AX left by 4 bit pos’ns. Intel 8086

• Notice again the complete dissimilarity of both syntax and semantics.




Tbl 2.3 Examples of Branch Instructions

Instruction Meaning MachineBLSS A, Tgt Branch to address Tgt if the least significant VAX11

bit of mem loc’n. A is set (i.e. = 1)bun r2 Branch to location in R2 if result of previous PPC601

floating point computation was Not a Number (NAN)beq $2, $1, 32 Branch to location (PC + 4 + 32) if contents MIPS R3000

of $1 and $2 are equalSOB R4, Loop Decrement R4 and branch to Loop if R4 ≠ 0 DEC PDP11JCXZ Addr Jump to Addr if contents of register CX ≠ 0. Intel 8086




CPU Registers Associated with Flow of Control—Branch Instructions

• Program counter usually locates next instruction• Condition codes may control branch• Branch targets may be separate registers

Processor State

C N V Z

Program Counter

Branch Targets

Condition Codes•••




HLL Conditionals Implemented by Control Flow Change

• Conditions are computed by arithmetic instructions• Program counter is changed to execute only instructions

associated with true conditions

C language Assembly language

if NUM==5 then SET=7 CMP.W #5, NUM BNE L1 MOV.W #7, SETL1 ...

;the comparison;conditional branch;action if true;action if false




CPU Registers May Have a “Personality”

• Architecture classes are often based on how where the operands and result are located and how they are specified by the instruction.

• They can be in CPU registers or main memory:

TopSecond

Stack ArithmeticRegisters

AddressRegisters

General PurposeRegisters

Push Pop

•••

•••

••••

••

Stack Machine Accumulat or Machine General Regist erMachine




3-, 2-, 1-, & 0-Address ISAs• The classification is based on arithmetic instructions that have

two operands and one result• The key issue is “how many of these are specified by memory

addresses, as opposed to being specified implicitly”• A 3-address instruction specifies memory addresses for both

operands and the result R ← Op1 op Op2• A 2-address instruction overwrites one operand in memory with

the result Op2 ← Op1 op Op2• A 1-address instruction has a processor, called the accumulator

register, to hold one operand & the result (no addr. needed) Acc ← Acc op Op1

• A 0-address + uses a CPU register stack to hold both operands and the result TOS ← TOS op SOS (where TOS is Top Of Stack, SOS is Second On Stack)

• The 4-address instruction, hardly ever seen, also allows the address of the next instruction to specified explicitly




Fig 2.2 The 4-Address Machine and Instruction Format

• Explicit addresses for operands, result, & next instruction• Example assumes 24-bit addresses

• Discuss: size of instruction in bytes

Memory

Op1Addr:Op2Addr:

Op1Op2

ResAddr:

NextiAddr:

Bits: 8 24 24

Instruction format

24 24

Res

Nexti

CPU add, Res, Op1, Op2, Nexti (Res ← Op1 + Op2)

add ResAddr Op1Addr Op2Addr NextiAddrWhich

operationWhere to put result Where to find operands

Where to find next instruction





• Address of next instruction kept in processor state register—the PC (except for explicit branches/jumps)

• Rest of addresses in instruction• Discuss: savings in instruction word size

Memory

Op1Addr:Op2Addr:

Op1

Program counter

Op2

ResAddr:

NextiAddr:

Bits: 8 24 24

Instruction format

24

Res

Nexti

CPU


24

add, Res, Op1, Op2 (Res ← Op2 + Op1)

add ResAddr Op1Addr Op2AddrWhich

operationWhere to put result Where to find operands





• Result overwrites Operand 2• Needs only 2 addresses in instruction but less choice in

placing data

Memory

Op1Addr:

Op2Addr:

Op1

Program counter

Op2,Res

NextiNextiAddr:

Bits: 8 24 24

Instruction format

CPU


24

add Op2, Op1 (Op2 ← Op2 + Op1)

add Op2Addr Op1AddrWhich

operation

Where to put result

Where to find operands




Fig 2.5 1-Address Machine and Instruction Format

• Special CPU register, the accumulator, supplies 1 operand and stores result

• One memory address used for other operand

Need instructions to load and store operands:LDA OpAddrSTA OpAddr

Memory

Op1Addr: Op1

NextiProgram counter

Accumulator

NextiAddr:

Bits: 8 24

Instruction format

CPU


24

add Op1 (Acc ← Acc + Op1)

add Op1AddrWhich

operationWhere to find

operand1

Where to find operand2, and

where to put result




Fig 2.6 The 0-Address, or Stack, Machine and Instruction Format

• Uses a push-down stack in CPU• Arithmetic uses stack for both operands and the result• Computer must have a 1-address instruction to push and pop

operands to and from the stack

Memory

Op1Addr:

TOS

SOS

etc.

Op1

Program counter

NextiAddr: Nexti

Bits:

Format

Format

8 24

CPU


Stack

24

push Op1 (TOS ← Op1)

Instruction formats

add (TOS ← TOS + SOS)

push Op1AddrOperation

Bits: 8

addWhich operation

Result

Where to find operands, and where to put result

(on the stack)




Example 2.1 Expression Evaluation for 3-, 2-, 1-, and 0-Address Machines

• Number of instructions & number of addresses both vary• Discuss as examples: size of code in each case

3 - a d d r e s s 2 - a d d r e s s 1 - a d d r e s s S t a c k

add a, b, cmpy a, a, dsub a, a, e

load a, badd a, cmpy a, dsub a, e

load badd cmpy dsub estore a

push bpush caddpush dmpypush esubpop a

Evaluat e a = (b+c)*d - e




Fig 2.7 General Register Machine and Instruction Formats

• It is the most common choice in today’s general-purpose computers• Which register is specified by small “address” (3 to 6 bits for 8 to 64

registers)• Load and store have one long & one short address: 1-1/2 addresses• Arithmetic instruction has 3 “half” addresses

Memory

Op1Addr: Op1load

Nexti Program counter

load R8, Op1 (R8 ← Op1)

CPU

Registers

R8

R6

R4

R2

Instruction formats

R8load Op1Addr

add R2, R4, R6 (R2 ← R4 + R6)

R2add R6R4




Real Machines Are Not So Simple

• Most real machines have a mixture of 3, 2, 1, 0, and 1-1/2 address instructions

• A distinction can be made on whether arithmetic instructions use data from memory

• If ALU instructions only use registers for operands and result, machine type is load-store

• Only load and store instructions reference memory• Other machines have a mix of register-memory and

memory-memory instructions




Addressing Modes

• An addressing mode is hardware support for a useful way of determining a memory address

• Different addressing modes solve different HLL problems• Some addresses may be known at compile time, e.g., global

variables• Others may not be known until run time, e.g., pointers• Addresses may have to be computed. Examples include:

• Record (struct) components:• variable base (full address) + constant (small)

• Array components:• constant base (full address) + index variable (small)

• Possible to store constant values w/o using another memory cell by storing them with or adjacent to the instruction itself




HLL Examples of Structured Addresses

• C language: rec → count• rec is a pointer to a record: full address variable• count is a field name: fixed byte offset, say 24

• C language: v[i]• v is fixed base address of array: full address

constant• i is name of variable index: no larger than array size

• Variables must be contained in registers or memory cells

• Small constants can be contained in the instruction• Result: need for “address arithmetic.”

• E.g., Address of Rec → Count is address of Rec + offset of count.

Rec →

Count

V →

V[i]




Fig 2.8 Common Addressing Modes

3Op'nInstr

LOAD #3, ....

a) Immediate Addressing(Instruction contains the operand.)

Addr of AOperand

Memory

Op'nInstr

b) Direct Addressing(Instruction containsaddress of operand)

LOAD A, ...

Address of address of A

Operand Addr

Memory

Op'nInstr

c) Indirect Addressing(Instruction containsaddress of address

of operand)

LOAD (A), ...

Operand

Operand

MemoryOp'nInstr

d) Register Indirect Addressing(register contains address of operand)

LOAD [R2], ...

R2 . . .

R2 Operand Addr.

Operand

MemoryOp'nInstr

e) Displacement (Based) (Indexed) Addressing(address of operand = register +constant)

LOAD 4[R2], ...

R2 4

Operand Addr.

+

R2 PC

Operand

MemoryOp'n

f) Relative Addressing(Address of operand = PC+constant)

LOADRel 4[PC], ...

4

Operand Addr.

+

Instr




Example: Computer, SRCSimple RISC Computer

• 32 general purpose registers of 32 bits• 32-bit program counter, PC, and instruction register, IR• 232 bytes of memory address space

R0

R31

PC

IR

The SRC CPU Main memory

31 70 0

0

R[7] means contents of register 7

M[32] means contents of memory location 32232 – 1


232 bytes

of main

memory




SRC Characteristics

• Load-store design: only way to access memory is through load and store instructions

• Only a few addressing modes are supported• ALU instructions are 3-register type• Branch instructions can branch unconditionally or

conditionally on whether the value in a specified register is = 0, <> 0, >= 0, or < 0

• Branch and link instructions are similar, but leave the value of current PC in any register, useful for subroutine return

• All instructions are 32 bits (1 word) long




SRC Basic Instruction Formats

• There are three basic instruction format types• The number of register specifier fields and length of the

constant field vary• Other formats result from unused fields or parts• Details of formats on next slide

31 27 26 22 21 0

31 27

27

26

26

22

22

21

2131

17 16

17 16 12 11

0

0

op r a

rb

r crb

r a

r a

op

op

c1

c2

c3

Type 1

Type 2

Type 3




Fig 2.9 (Partial)

Total of 7 Detailed Formats

Op1. Id, st, la, addi, andi, ori rb c2

Instruction formats Example

31 27 26 22 21 17 16 0Id r3, A Id r3, 4(r5) addi r2, r4, #1

(R[3] = M[A]) (R[3] = M[R[5] + 4]) (R[2] = R[4] +1)

ra

Op2. Idr, str, lar c131 2726 22 21 0 Idr r5, 8

Iar r6, 45(R[5] = M[PC + 8]) (R[6] = PC + 45)ra

Op3. neg, not unused31 27 26 22 21 17 16 0

neg r7, r9 (R[7] = – R[9])ra

unused

rc

Op4. br unused31 27 26 22 21 17 1216 11 2 0 brzr r4, r0

(branch to R[4] if R[0] == 0)rb rc (c3) Cond

Op5. brl unused31 27 26 22 21 17 16 0 brlnz r6, r4, r0

(R[6] = PC; branch to R[4] if R[0] ≠ 0)ra rb rc1211 2

Cond

Op unused31 27 26 22 21 17 16 0 shl r2, r4, r6

(R[2] = R[4] shifted left by count in R[6])ra rb rc12 4

4

00000

Op7. shr, shra shl, shic

unused31 27 26 22

7a

7b

21 17 0 shr r0, r1, #4 (R[0] = R[1] shifted right by 4 bitsra rb

2Count

Op6. add, sub, and, or

unused31 27 26 22 21 17 16 0

add r0, r2, r4 (R[0] = R[2] + R[4])ra rb rc1211

Op8. nop, stop unused31 27 0

stop26

unused

(c3)

(c3)

(c3)




Tbl 2.4 Example SRC Load and Store Instructions

• Address can be constant, constant + register, or constant + PC• Memory contents or address itself can be loaded

(note use of la to load a constant)

Instruction op ra rb c1 Meaning Addressing Modeld r1, 32 1 1 0 32 R[1] ← M[32] Directld r22, 24(r4) 1 22 4 24 R[22] ← M[24+R[4]] Displacementst r4, 0(r9) 3 4 9 0 M[R[9]] ← R[4] Register indirectla r7, 32 5 7 0 32 R[7] ← 32 Immediateldr r12, -48 2 12 – -48 R[12] ← M[PC -48] Relativelar r3, 0 6 3 – 0 R[3] ← PC Register (!)




Assembly Language Forms of Arithmetic and Logic Instructions

• Immediate subtract not needed since constant in addi may be negative

Format Example Meaningneg ra, rc neg r1, r2 ;Negate (r1 = -r2) not ra, rc not r2, r3 ;Not (r2 = r3´ )add ra, rb, rc add r2, r3, r4 ;2’s complement additionsub ra, rb, rc ;2’s complement subtractionand ra, rb, rc ;Logical andor ra, rb, rc ;Logical oraddi ra, rb, c2 addi r1, r3, #1 ;Immediate 2’s complement addandi ra, rb, c2 ;Immediate logical andori ra, rb, c2 ;Immediate logical or




Branch Instruction FormatThere are actually only two branch instructions:br rb, rc, c3<2..0> ; branch to R[rb] if R[rc] meets

; the condition defined by c3<2..0>brl ra, rb, rc, c3<2..0> ; R[ra] ← PC; branch as above

lsbs condition Assy language form Example000 never brlnv brlnv r6001 always br, brl br r5, brl r5010 if rc = 0 brzr, brlzr brzr r2, r4, r5011 if rc ≠ 0 brnz, brlnz100 if rc ≥ 0 brpl, brlpl101 if rc < 0 brmi, brlmi

• It is c3<2..0>, the 3 lsbs of c3, that governs what the branch condition is:

• Note that branch target address is always in register R[rb]. •It must be placed there explicitly by a previous instruction.




Tbl 2.6 Forms and Formats of the br and brl Instructions

Ass’ylang.

Example instr. Meaning op ra rb rc c3⟨2..0⟩

BranchCond’n.

brlnv brlnv r6 R[6] ← PC 9 6 — — 000 neverbr br r4 PC ← R[4] 8 — 4 — 001 alwaysbrl brl r6,r4 R[6] ← PC;

PC ← R[4]

9 6 4 — 001 always

brzr brzr r5,r1 if (R[1]=0)PC ← R[5]

8 — 5 1 010 zero

brlzr brlzr r7,r5,r1 R[7] ← PC; 9 7 5 1 010 zerobrnz brnz r1, r0 if (R[0]≠0) PC← R[1] 8 — 1 0 011 nonzerobrlnz brlnz r2,r1,r0 R[2] ← PC;

if (R[0]≠0) PC← R[1]

9 2 1 0 011 nonzero

brpl brpl r3, r2 if (R[2]≥0) PC← R[3] 8 — 3 2 100 plusbrlpl brlpl r4,r3,r2 R[4] ← PC;

if (R[2]≥0) PC← R[3]

9 4 3 2 plus

brmi brmi r0, r1 if (R[1]<0) PC← R[0] 8 — 0 1 101 minusbrlmi brlmi r3,r0,r1 R[3] ← PC;

if (r1<0) PC← R[0]

9 3 0 1 minus




Branch Instructions—Example

C: goto Label3

SRC: lar r0, Label3 ; put branch target address into tgt

reg. br r0 ; and branch

• • •

Label3 • • •




Example of Conditional Branch

in C: #define Cost 125if (X<0) then X = -X;

in SRC:Cost .equ 125 ;define symbolic constant

.org 1000 ;next word will be loaded at address 100010

X: .dw 1 ;reserve 1 word for variable X.org 5000 ;program will be loaded at location

500010

lar r0, Over ;load address of “false” jump locationld r1, X ;load value of X into r1brpl r0, r1 ;branch to Else if r1≥0neg r1, r1 ;negate value

Over: • • • ;continue




RTN (Register Transfer Notation)

• Provides a formal means of describing machine structure and function

• Is at the “just right” level for machine descriptions• Does not replace hardware description languages• Can be used to describe what a machine does (an

abstract RTN) without describing how the machine does it

• Can also be used to describe a particular hardware implementation (a concrete RTN)




RTN (cont’d.)

• At first you may find this “meta description” confusing, because it is a language that is used to describe a language

• You will find that developing a familiarity with RTN will aid greatly in your understanding of new machine design concepts

• We will describe RTN by using it to describe SRC




Some RTN Features—Using RTN to Describe a Machine’s

Static Properties

Static Properties• Specifying registers

• IR⟨31..0⟩ specifies a register named “IR” having 32 bits numbered 31 to 0

• “Naming” using the := naming operator:• op⟨4..0⟩ := IR⟨31..27⟩ specifies that the 5 msbs of IR be

called op, with bits 4..0• Notice that this does not create a new register, it just

generates another name, or “alias,” for an already existing register or part of a register




Using RTN to DescribeDynamic Properties

Dynamic Properties• Conditional expressions:

(op=12) → R[ra] ← R[rb] + R[rc]: ; defines the add instruction

“if” condition “then” RTN Assignment Operator

This fragment of RTN describes the SRC add instruction. It says, “when the op field of IR = 12, then store in the register specified by the ra field, the result of adding the register specified by the rb field to the register specified by the rc field.”




Using RTN to Describe the SRC (Static) Processor State

Processor state PC⟨31..0⟩: program counter

(memory addr. of next inst.) IR⟨31..0⟩: instruction register Run: one bit run/halt indicator Strt: start signal R[0..31]⟨31..0⟩: general purpose registers




RTN Register Declarations

• General register specifications shows some features of the notation

• Describes a set of 32 32-bit registers with names R[0] to R[31]

R[0..31]⟨31..0⟩:Name ofregisters

Register #in squarebrackets

.. specifiesa range ofindices

msb #

lsb# Bit # inanglebrackets

Colon separatesstatements withno ordering




Memory Declaration:RTN Naming Operator

• Defining names with formal parameters is a powerful formatting tool

• Used here to define word memory (big-endian)

Main memory state Mem[0..232 - 1]⟨7..0⟩: 232 addressable bytes of memory M[x]⟨31..0⟩:= Mem[x]#Mem[x+1]#Mem[x+2]#Mem[x+3]:

Dummyparameter

Namingoperator

Concatenationoperator

All bits inregister if nobit index given




RTN Instruction Formatting Uses Renaming of IR Bits

Instruction formats op⟨4..0⟩ := IR⟨31..27⟩: operation code field ra⟨4..0⟩ := IR⟨26..22⟩: target register field rb⟨4..0⟩ := IR⟨21..17⟩: operand, address index, or branch target register rc⟨4..0⟩ := IR⟨16..12⟩: second operand, conditional test, or shift count register c1⟨21..0⟩ := IR⟨21..0⟩: long displacement field c2⟨16..0⟩ := IR⟨16..0⟩: short displacement or immediate field c3⟨11..0⟩ := IR⟨11..0⟩: count or modifier field




Specifying Dynamic Properties of SRC: RTN Gives Specifics of Address

Calculation

• Renaming defines displacement and relative addresses• New RTN notation is used

• condition → expression means if condition then expression

• modifiers in { } describe type of arithmetic or how short numbers are extended to longer ones

• arithmetic operators (+ - * / etc.) can be used in expressions• Register R[0] cannot be added to a displacement

Effective address calculations (occur at runtime):

disp⟨31..0⟩ := ((rb=0) → c2 ⟨16..0⟩ {sign extend}: displacement(rb≠0) → R[rb] + c2⟨16..0 ⟩ {sign extend, 2’s comp.} ): address

rel⟨31..0⟩ := PC⟨31..0⟩ + c1⟨21..0 ⟩ {sign extend, 2’s comp.}: relativeaddress




Detailed Questions Answered by the RTN for Addresses

• What set of memory cells can be addressed by direct addressing (displacement with rb=0)

• If c2⟨16⟩=0 (positive displacement) absolute addresses range from 00000000H to 0000FFFFH

• If c2⟨16⟩=1 (negative displacement) absolute addresses range from FFFF0000H to FFFFFFFFH

• What range of memory addresses can be specified by a relative address

• The largest positive value of C1⟨21..0⟩ is 221-1 and its most negative value is -221, so addresses up to 221 -1 forward and 221 backward from the current PC value can be specified

• Note the difference between rb and R[rb]




Instruction Interpretation: RTN Description of Fetch-Execute

• Need to describe actions (not just declarations)• Some new notation

instruction_interpretation := (¬Run∧ Strt → Run ← 1:Run → (IR ← M[PC]: PC ← PC + 4; instruction_execution) );

Logical NOTLogical AND

Register transfer Separates statementsthat occur in sequence




RTN Sequence and Clocking

• In general, RTN statements separated by : take place during the same clock pulse

• Statements separated by ; take place on successive clock pulses

• This is not entirely accurate since some things written with one RTN statement can take several clocks to perform

• More precise difference between : and ;• The order of execution of statements separated by

: does not matter• If statements are separated by ; the one on the left

must be complete before the one on the right starts




More About Instruction Interpretation RTN

• In the expression IR ← M[PC]: PC ← PC + 4; which value of PC applies to M[PC] ?

• The rule in RTN is that all right hand sides of “:” - separated RTs are evaluated before any LHS is changed

• In logic design, this corresponds to “master-slave” operation of flip-flops

• We see what happens when Run is true and when Run is false but Strt is true. What about the case of Run and Strt both false?

• Since no action is specified for this case, the RTN implicitly says that no action occurs in this case




Individual Instructions

• instruction_interpretation contained a forward reference to instruction_execution

• instruction_execution is a long list of conditional operations

• The condition is that the op code specifies a given instruction

• The operation describes what that instruction does• Note that the operations of the instruction are done

after (;) the instruction is put into IR and the PC has been advanced to the next instruction




RTN Instruction Execution for Load and Store Instructions

• The in-line definition (:= op=1) saves writing a separate definition ld := op=1 for the ld mnemonic

• The previous definitions of disp and rel are needed to understand all the details

instruction_execution := ( ld (:= op= 1) → R[ra] ← M[disp]: load register ldr (:= op= 2) → R[ra] ← M[rel]: load register relative st (:= op= 3) → M[disp] ← R[ra]: store register str (:= op= 4) → M[rel] ← R[ra]: store register relative la (:= op= 5 ) → R[ra] ← disp: load displacement address lar (:= op= 6) → R[ra] ← rel: load relative address




SRC RTN—The Main Loop

ii := ( ¬Run∧ Strt → Run ← 1:Run → (IR ← M[PC]: PC ← PC + 4; ie) );

ii := instruction_interpretation: ie := instruction_execution :

ie := ( ld (:= op= 1) → R[ra] ← M[disp]: Big switch ldr (:= op= 2) → R[ra] ← M[rel]: statement . . . on the opcode stop (:= op= 31) → Run ← 0:); ii

Thus ii and ie invoke each other, as coroutines.




Use of RTN Definitions:Text Substitution Semantics

• An example:• If IR = 00001 00101 00011 00000000000001011• then ld → R[5] ← M[ R[3] + 11 ]:

ld (:= op= 1) → R[ra] ← M[disp]:

disp⟨31..0⟩ := ((rb=0) → c2 ⟨16..0⟩ {sign extend}:(rb≠0) → R[rb] + c2⟨16..0 ⟩ {sign extend, 2’s comp.} ):

ld (:= op= 1) → R[ra] ← M[ ((rb=0) → c2⟨16..0⟩ {sign extend}:

(rb≠0) → R[rb] + c2⟨16..0 ⟩ {sign extend, 2’s comp.} ): ]:




RTN Descriptions of SRC Branch Instructions

• Branch condition determined by 3 lsbs of instruction• Link register (R[ra]) set to point to next instruction

cond := ( c3⟨2..0⟩=0 → 0: neverc3⟨2..0⟩=1 → 1: alwaysc3⟨2..0⟩=2 → R[rc]=0: if register is zeroc3⟨2..0⟩=3 → R[rc]≠0: if register is nonzeroc3⟨2..0⟩=4 → R[rc]⟨31⟩=0: if positive or zeroc3⟨2..0⟩=5 → R[rc]⟨31⟩=1 ): if negative

br (:= op= 8) → (cond → PC ← R[rb]): conditional branchbrl (:= op= 9) → (R[ra] ← PC:

cond → (PC ← R[rb]) ): branch and link




RTN for Arithmetic and Logic

• Logical operators: and ∧ or ∨ and not ¬

add (:= op=12) → R[ra] ← R[rb] + R[rc]:addi (:= op=13) → R[ra] ← R[rb] + c2⟨16..0⟩ {2's comp. sign ext.}:sub (:= op=14) → R[ra] ← R[rb] - R[rc]:neg (:= op=15) → R[ra] ← -R[rc]:and (:= op=20) → R[ra] ← R[rb] ∧ R[rc]:andi (:= op=21) → R[ra] ← R[rb] ∧ c2⟨16..0⟩ {sign extend}:or (:= op=22) → R[ra] ← R[rb] ∨ R[rc]:ori (:= op=23) → R[ra] ← R[rb] ∨ c2⟨16..0⟩ {sign extend}:not (:= op=24) → R[ra] ← ¬R[rc]:




RTN for Shift Instructions

• Count may be 5 lsbs of a register or the instruction• Notation: @ - replication, # - concatenation

n := ( (c3⟨4..0⟩=0) → R[rc]⟨4..0 ⟩:(c3⟨4..0⟩≠0) → c3 ⟨4..0⟩ ):

shr (:= op=26) → R[ra]⟨31..0 ⟩ ← (n @ 0) # R[rb] ⟨31..n⟩:shra (:= op=27) → R[ra]⟨31..0 ⟩ ← (n @ R[rb] ⟨31⟩) # R[rb] ⟨31..n⟩:shl (:= op=28) → R[ra]⟨31..0 ⟩ ← R[rb] ⟨31-n..0⟩ # (n @ 0):shc (:= op=29) → R[ra]⟨31..0 ⟩ ← R[rb] ⟨31-n..0⟩ # R[rb]⟨31..32-n ⟩:




Example of Replication and Concatenation in Shift

• Arithmetic shift right by 13 concatenates 13 copies of the sign bit with the upper 19 bits of the operand

shra r1, r2, 13

1001 0111 1110 1010 1110 1100 0001 0110

13@R[2]⟨31⟩ R[2]⟨31..13⟩100 1011 1111 0101 0111

R[2]=

#1111 1111 1111 1R[1]=




Assembly Language for Shift

• Form of assembly language instruction tells whether to set c3=0

shr ra, rb, rc ;Shift rb right into ra by 5 lsbs of rcshr ra, rb, count ;Shift rb right into ra by 5 lsbs of instshra ra, rb, rc ;AShift rb right into ra by 5 lsbs of rcshra ra, rb, count ;AShift rb right into ra by 5 lsbs of instshl ra, rb, rc ;Shift rb left into ra by 5 lsbs of rcshl ra, rb, count ;Shift rb left into ra by 5 lsbs of instshc ra, rb, rc ;Shift rb circ. into ra by 5 lsbs of rcshc ra, rb, count ;Shift rb circ. into ra by 5 lsbs of inst




End of RTN Definition of instruction_execution

• We will find special use for nop in pipelining• The machine waits for Strt after executing stop• The long conditional statement defining

instruction_execution ends with a direction to go repeat instruction_interpretation, which will fetch and execute the next instruction (if Run still =1)

nop (:= op= 0) → : No operationstop (:= op= 31) → Run ← 0: Stop instruction ); End of instruction_execution instruction_interpretation.




Confused about RTN and SRC?

• SRC is a Machine Language• It can be interpreted by either hardware or software

simulator.• RTN is a Specification Language

• Specification languages are languages that are used to specify other languages or systems—a metalanguage.

• Other examples: LEX, YACC, VHDL, Verilog

Figure 2.10 may help clear this up...




Fig 2.10 The Relationship of RTN to SRC

SRC specification written in RTN

RTN compiler

Generated processor

SRC program

and data

Data outputSRC interpreter

or simulator




A Note About Specification Languages• They allow the description of what without having to specify how.• They allow precise and unambiguous specifications, unlike natural

language.• They reduce errors:

• Errors due to misinterpretation of imprecise specifications written in natural language.

• Errors due to confusion in design and implementation—“human error.”

• Now the designer must debug the specification!• Specifications can be automatically checked and processed by

tools.• An RTN specification could be input to a simulator generator

that would produce a simulator for the specified machine.• An RTN specification could be input to a compiler generator that

would generate a compiler for the language, whose output could be run on the simulator.




Addressing Modes Described in RTN (Not SRC)

Mode name Assembler RTN meaning Use SyntaxRegister Ra R[t] ← R[a] Tmp. Var.Register indirect (Ra) R[t] ← M[R[a]] PointerImmediate #X R[t] ← X ConstantDirect, absolute X R[t] ← M[X] Global Var.Indirect (X) R[t] ← M[ M[X] ] Pointer Var.Indexed, based, X(Ra) R[t] ← M[X + R[a]] Arrays, structsor displacementRelative X(PC) R[t] ← M[X + PC] Vals stored w pgmAutoincrement (Ra)+ R[t] ← M[R[a]]; R[a] ← R[a] + 1 SequentialAutodecrement - (Ra) R[a] ← R[a] - 1; R[t] ← M[R[a]] access.

Target register




Fig 2.11 Register Transfers Hardware and Timing for a Single-Bit Register

Transfer: A ← B• Implementing the RTN statement A ← B

Strobe

(a) Hardware (b) Timing

Strobe

B

A

1

0

1

0

1

0

D

B

Q

Q

D

A

Q

Q




Fig 2.12 Multiple Bit Register Transfer: A⟨m..1⟩ ← B⟨m..1⟩

Strobe

(a) Individual flip-flops (b) Abbreviated notation

D

1

Q

Q

D

1

Q

Q

Strobe

D

B⟨m..1⟩

Q

Q

D

A⟨m..1⟩

Q

Q

D

2

Q D

2

Q

D

m

Q D

m

B A

Q

Q Q

Q Q

m




Fig 2.13 Data Transmission View of Logic Gates

• Logic gates can be used to control the transmission of data:

Data gate

Controlled complement

Data merge

data

gate

data

control

gate→data

gate→0

control→data

control→data

data 1

data1(2), provided data2(1) is zero

data 2

data 1

data 2




Fig 2.14 Two-Way Gated Merge, or Multiplexer

• Data from multiple sources can be selected for transmission

x y

y

xGx

yGy

m

xm

mm

Time




Fig 2.15 Basic Multiplexer and Symbol Abbreviation

• Multiplexer gate signals Gi may be produced by a binary to one-out-of-n decoder

D0

D1

G0

Gn–1

Dn–1

m

An n-way gated merge An n-way multiplexer with decoder

(a) Multiplexer in terms of gates (b) Symbol abbreviation

m

m

m

D0

D1

m

m

m

Dn–1m

k

Select

G1

m

m

m




Fig 2.16 Separating Merged Data

• Merged data can be separated by gating at the right time• It can also be strobed into a flip-flop when valid

x y

Gx

m

xm

0

Time




Fig 2.17 Multiplexed Register Transfers Using Gates and Strobes

• Selected gate and strobe determine which RT• A←C and B←C can occur together, but not A←C and B←D

GC

SA

SB

GC

Hold time

Propagation time

SB

mm

D

C

Q

Q

GD

Gates Strobes

m

m

mmD

D

Q

Q

D

A

Q

Q

D

B

Q

Q

m




Fig 2.18 Open-Collector NAND Gate Output Circuit

+V

+V

Out

+V

Inputs Output

0v

0v

+V

+V

0v

+V

0v

+V

Open

Open

Open

Closed

(Out = +V)

(Out = +V)

(Out = +V)

(Out = 0v)

(a) Open-collector NAND truth table

(b) Open-collector NAND (c) Symbol

o.c.




Fig 2.19 Wired AND Connection of Open-Collector Gates

+V

a bOut

a bWired AND

outputSwitch

Closed(0)

Closed(0)

Open (1)

Open (1)

Closed(0)

Open (1)

Closed(0)

Open (1)

0v (0)

0v (0)

0v (0)

+V (1)

(a) Wired AND connection (b) With symbols

(c) Truth table

+V

o.c. o.c.




Fig 2.20 Open-Collector Wired OR Bus

• DeMorgan’s OR by not of AND of NOTS• Pull-up resistor removed from each gate - open

collector• One pull-up resistor for whole bus• Forms an OR distributed over the connection

+V

Dn–1

Gn–1

D1

G1

D0

G0

o.c. o.c. o.c.




Fig 2.21 Tri-State Gate Internal Structure and Symbol

Data

Enable

(a) Tri-state gate structure (b) Tri-state gate symbol

(c) Tri-state gate truth table

Data

Enable

Out OutTri- state

+V

Enable Data Output

0

0

1

1

0

1

0

1

Hi-Z

Hi-Z

0

1




Fig 2.22 Registers Connected by aTri-State Bus

• Can make any register transfer R[i]←R[j]• Can’t have Gi = Gj = 1 for i≠j• Violating this constraint gives low resistance path from power

supply to ground—with predictable results!

m

S0

m

m

G0

R[0]

Tri-state bus

m

S1

m

m

m

G1

D

R[1]

Q

Q

m

Sn–1

m

m

Gn–1

D

R[n – 1]

Q

Q

D Q

Q




Fig 2.23 Registers and Arithmetic Units Connected by One Bus

Combinationallogic—no memory

Example:Abstract RTNR[3] ← R[1]+R[2];

Concrete RTNY ← R[2];Z ← R[1]+Y;R[3] ← Z;

Control SequenceR[2]out, Yin;R[1]out, Zin;Zout, R[3]in;

Notice that what could be described in one step in the abstract RTN took three steps on this particular hardware

R[0]in

Yin

R[0]out

m

m

m

m

m

mR[0]

Incrementer

Adder

D Q

R[1]in R[1]out

mD Q

R[n – 1]in R[n – 1]out

mD Q

Q

Q

Q

WinWout

m

Zout

W

DQ

Q

Zin

Z

DQ

Q

R[1]

R[n – 1]

D Q

Q

Y




RTs Possible with the One-Bus Structure

• R[i] or Y can get the contents of anything but Y• Since result different from operand, it cannot go on the bus that is

carrying the operand• Arithmetic units thus have result registers• Only one of two operands can be on the bus at a time, so adder has

register for one operand• R[i] ← R[j] + R[k] is performed in 3 steps: Y←R[k]; Z←R[j] + Y;

R[i]←Z;• R[i] ← R[j] + R[k] is high level RTN description• Y←R[k]; Z←R[j] + Y; R[i]←Z; is concrete RTN• Map to control sequence is: R[2]out, Yin; R[1]out, Zin; Zout, R[3]in;




From Abstract RTN to Concrete RTN to Control Sequences

• The ability to begin with an abstract description, then describe a hardware design and resulting concrete RTN and control sequence is powerful.

• We shall use this method in Chapter 4 to develop various hardware designs for SRC.




Chapter 2 Summary

• Classes of computer ISAs• Memory addressing modes• SRC: a complete example ISA• RTN as a description method for ISAs• RTN description of addressing modes• Implementation of RTN operations with digital logic

circuits• Gates, strobes, and multiplexers


3-1 Chapter 3—Some Real Machines


Chapter 3: Some Real Machines

Topics

3.1 Machine Characteristics and Performance3.2 RISC versus CISC

3.3 A CISC Microprocessor: The Motorola MC68000 3.4 A RISC Architecture: The SPARC




Practical Aspects of Machine Cost-Effectiveness

• Cost for useful work is fundamental issue• Mounting, case, keyboard, etc. are dominating the cost of

integrated circuits

• Upward compatibility preserves software investment• Binary compatibility• Source compatibility

• Emulation compatibility

• Performance: strong function of application




Performance Measures

• MIPS: Millions of Instructions Per Second• Same job may take more instructions on one machine than on

another

• MFLOPS: Million Floating Point OPs Per Second• Other instructions counted as overhead for the floating point

• Whetstones: Synthetic benchmark• A program made up to test specific performance features

• Dhrystones: Synthetic competitor for Whetstone• Made up to “correct” Whetstone’s emphasis on floating point

• SPEC: Selection of “real” programs• Taken from the C/Unix world




CISC Versus RISC Designs

• CISC: Complex Instruction Set Computer• Many complex instructions and addressing modes• Some instructions take many steps to execute• Not always easy to find best instruction for a task

• RISC: Reduced Instruction Set Computer• Few, simple instructions, addressing modes• Usually one word per instruction

• May take several instructions to accomplish what CISC can do in one• Complex address calculations may take several instructions• Usually has load-store, general register ISA




Design Characteristics of RISCs

• Simple instructions can be done in few clocks• Simplicity may even allow a shorter clock period

• A pipelined design can allow an instruction to complete in every clock period

• Fixed length instructions simplify fetch and decode• The rules may allow starting next instruction without necessary

results of the previous• Unconditionally executing the instruction after a branch• Starting next instruction before register load is complete




Other RISC Characteristics

• Prefetching of instructions. (Similar to I8086.)• Pipelining: beginning execution of an instruction before the previous

instruction(s) have completed. (Will cover in detail in Chapter 5.)

• Superscalar operation—issuing more than one instruction simultaneously. (Instruction-level parallelism. Also covered in Chapter 5.)

• Delayed loads, stores, and branches. Operands may not be available when an instruction attempts to access them.

• Register windows—ability to switch to a different set of CPU registers with a single command. Alleviates procedure call/return overhead. Discussed with SPARC in this chapter.




Tbl 3.1 Order of Presenting or Developing a Computer ISA

• Memories: structure of data storage in the computer• Processor-state registers

• Main memory organization

• Formats and their interpretation: meanings of register fields• Data types

• Instruction format• Instruction address interpretation

• Instruction interpretation: things done for all instructions• The fetch-execute cycle• Exception handling (sometimes deferred)

• Instruction execution: behavior of individual instructions• Grouping of instructions into classes

• Actions performed by individual instructions




CISC: The Motorola MC68000

• Introduced in 1979• One of first 32-bit microprocessors

• Means that most operations are on 32-bit internal data• Some operations may use different number of bits• External data paths may not all be 32 bits wide

• MC68000 had a 24-bit address bus

• Complex Instruction Set Computer—CISC• Large instruction set

• 14 addressing modes




Fig 3.1 The MC68000 Processor State

1531 016 7D0

D7

I2

ST

15 13 10 9 8 4 3 2 1 0

I1

I0

X N Z V C

8

0

223 – 1

1531 016

1931 023

A0

Status

CC

System byte

User byte

Trace mode Supervisor state Interrupt mask Extend Negative Zero Overflow Carry

A7/SP/USP

A6

A7'/SSP

PC15 0

IR

8 general purpose data

registers

224 bytes, or 223 16-bit words, or

222 longwords of main memory

8 address registers




Features of the 68000 Processor State

• Distinction between 32-bit data registers and 32-bit address registers

• 16-bit instruction register• Variable length instructions handled 16 bits at a time

• Stack pointer registers• User stack pointer is one of the address registers

• System stack pointer is a separate single register• Discuss: Why a separate system stack

• Condition code register: System and user bytes• Arithmetic status (N, Z, V, C, X) is in user status byte• System status has supervisor and trace mode flags, as well as the

interrupt mask




RTN Processor State for the MC68000

D[0..7]⟨31..0⟩: General purpose data registersA[0..7]⟨31..0⟩: Address registersA7´⟨31..0⟩: System stack pointerPC⟨31..0⟩: Program counterIR⟨15..0⟩: Instruction registerStatus⟨15..0⟩: System status byte and user status byteSP := A[7]: User stack pointer, also called USPSSP := A7´: System stack pointerC := Status⟨0⟩: V := Status⟨1⟩: Carry and Overflow flagsZ := Status⟨2⟩: N := Status ⟨3⟩: Zero and Negative flagsX := Status⟨4⟩: Extend flagINT⟨2..0⟩ := Status ⟨10..8⟩: Interrupt mask in system status byteS := Status⟨13⟩: T := Status ⟨15⟩:Supervisor state and Trace mode flags




Main Memory in the MC68000

• The word and longword forms are “big-endian”• The lowest numbered byte contains the most significant bit (big end)

of the word

• Words and longwords have “hard” alignment constraints not described in the above RTN

• Word addresses must end in one binary 0

• Longword addresses must end in two binary zeros

Main memory:Mb[0..224-1]⟨7..0⟩: Memory as bytesMw[ad]⟨15..0⟩ := Mb[ad]#Mb[ad+1]: Memory as wordsMl[ad]⟨31..0⟩ := Mw[ad]#Mw[ad+2]: Memory as long words




MC68000 Supports Several Operand Types

• Like many CISC machines, the 68000 allows one instruction to operate on several types

• MOVE.B for bytes, MOVE.W for words, and MOVE.L for longwords; also ADD.B, ADD.W, ADD.L, etc.

• Operand length is coded as bits of the instruction word

• Bits coding operand type vary with instruction• For use with RTN descriptions, we assume a function

d := datalen(IR) that returns 1, 2, or 4 for operand length




Fig 3.2 Some MC68000 Instruction Formats

(a) A 1-word move instruction (b) A 2-word instruction

(c) A 3-word instruction

IR

IR

Extra wordExtra word

IR

Extra word

(d) Instruction with indexed address

IR

Extra word

op

15 0

15 0

15 0

15 0

rg2

md1

16-bit constant16-bit constant

md2 md1 rg1

rg1

md1

16-bit constant

rg1

110

d/a Index reg w/l 000 disp8

Reg




General Form of Addressing Modes in the MC68000

• A general address of an operand or result is specified by a 6-bit field with mode and register numbers

• Not all operands and results can be specified by a general address: some must be in registers

• Not all modes are legal in all parts of an instruction

5 4 3 2 1 0

mode reg

Provides access paths to operands




Tbl 3.2 MC68000 Addressing Modes

Name Mode Reg. Assembler Extra Brief description Words

5 4 3 2 1 0

mode reg

Data reg. direct 0 0-7 Dn 0 DnAddr. reg. direct 1 0-7 An 0 AnAddr. reg. indirect 2 0-7 (An) 0 M[An]Autoincrement 3 0-7 (An)+ 0 M[An];An←An+dAutodecrement 4 0-7 -(An) 0 An←An-d;M[An]Based 5 0-7 disp16(An) 1 M[An+disp16]Based indexed short 6 0-7 disp8(An,XnLo) 1 M[An+XnLo+disp8]Based indexed long 6 0-7 disp8(An,Xn) 1 M[An+Xn+disp8]Absolute short 7 0 addr16 1 M[addr16]Absolute long 7 1 addr32 2 M[addr32]Relative 7 2 disp16(PC) 1 M[PC+disp16]Rel. indexed short 7 3 disp8(PC,XnLo) 1 M[PC+XnLo+disp8]Rel. indexed long 7 3 disp8(PC,Xn) 1 M[PC+Xn+disp8]Immediate 7 4 #data 1-2 data




RTN Description of MC68000 Addressing

• The addressing modes interpret many items• The instruction: in the IR register

• The following 16-bit word: described as Mw[PC]• The D and A registers in the CPU

• Many addressing modes calculate an effective memory address

• Some modes designate a register• Some modes result in a constant operand• There are restrictions on the use of some modes

5 4 3 2 1 0

mode reg




RTN Formatting for Effective Address Calculation

• Either an A or a D register can be used as an index

• A 4-bit field in the 2nd instruction word specifies the index register• Low order 8-bits of 2nd word are used as offset• Either 16 or 32 bits of index register may be used

XR[0..15]⟨31..0⟩ :=D[0..7]⟨31..0⟩ # A[0..7]⟨31..0⟩: Index register can be D or A;

xr⟨3..0⟩ := Mw[PC]⟨15..12⟩: Index specifier for index mode;wl := Mw[PC]⟨11⟩: Short or long index flag;dsp8⟨7..0⟩ := Mw[PC]⟨7..0⟩: Displacement for index mode;index := ( (wl=0) → XR[xr]⟨15..0⟩: Short or

(w1=1) → XR[xr]⟨31..0⟩): long index value;

disp8 = ldispd/a Index reg w/l 0 0 0

0: index is in data register1: index is in address register

0 = 16 bit index1 = 32 bit index

15 14 13 12 11 10 9 8 7 0




Modes That Calculate a Memory Address Using a

Register• md and rg are the 3-bit mode and

register fields• ea stands for effective address

ea(md, rg) := ( (md = 2) → A[rg ⟨2..0⟩]: Mode 2 is

A register indirect; (md = 3) → Mode 3 is

(A[rg⟨2..0⟩]; A[rg⟨2..0 ⟩] ← A[rg⟨2..0⟩] + d): autoincrement; (md = 4) → Mode 4 is

(A[rg⟨2..0⟩] ← A[rg⟨2..0⟩] - d; A[rg⟨2..0 ⟩]): autodecrement; (md = 5) → Mode 5 is based

(A[rg⟨2..0⟩] + Mw[PC]; PC ← PC + 2): or offset addressing; (md = 6) → Mode 6 is based

(A[rg⟨2..0⟩] + index + dsp8; PC ← PC + 2): indexed addressing;

5 4 3 2 1 0

mode reg

5 4 3 2 1 0

010 - 110 000 - 111




Mode 7 Uses the Register Field to Expand the Number of Modes

• These modes still calculate a memory address

ea (md, rg) := . . . (md = 7 ∧ rg = 0) → Mode 7, register 0 is (Mw[PC]{sign extend to 32 bits}; PC ← PC + 2): short absolute;(md = 7 ∧ rg = 1) → Mode 7, register 1 is (Ml[PC]; PC ← PC + 4): long absolute;(md = 7 ∧ rg = 2) → Mode 7, register 2 is (PC + Mw[PC]{sign extend to 32 bits}; program counter

PC ← PC + 2): relative addressing;(md = 7 ∧ rg = 3) → Mode 7, register 3 is (PC + index + dsp8; PC ← PC + 2) ): relative indexed.

5 4 3 2 1 0

1 1 1 reg




Fig 3.3 Address Register Indirect

Addressing

• Same picture for autoincrement or decrement• Address register incremented after address obtained in

autoincrement

• Address register decremented before address obtained in autodecrement

Address regist er indirect

01 0 Reg

68000Regist ers

A0

...

A7

. . .

Operand

Mainmemory

Address

Ex: MOVE (A6), ...

5 4 3 2 1 0

0 1 0 reg




Fig 3.4 Mode 6: Based Indexed Addressing

• Three things are added to get the address

Mode 6: Based indexed addressing

110 Reg

68000Regist ers

A0

...

A7

. . .

Operand

Mainmemory

Base address

Ex: MOVE.W LDISP (A6, D4), ...

+


•••

•••

D0-D7A0-A7

Index (16 or 32)

0: index is in data reg.1: index is in address reg.


15 14 13 12 11 10 9 8 7 0

5 4 3 2 1 0

1 1 0 reg




Mode 7-0,1: Absolute Addressing

• Absolute addresses can be 16 or 32 bits

Absolut e short addressing

11 1 0 00. . .

Operand

Mainmemory

Ex: MOVE.B PRINTERPORT.W, ...15 0

addr16(Sign extend to 32-bits)

Absolut e long addressing

1 11 001. . .

15 0

addr32Hi

addr32LoConcat.

Ex: MOVE.W INTVECT.L, ...

5 4 3 2 1 0

1 1 1 000 (16-bit)001 (32-bit)




Mode 7, Reg 3: Relative Indexed Addressing

• Same as indexed mode but uses PC instead of A register as base

5 4 3 2 1 0

1 1 1 0 1 1

Relative indexed addressing

111 011

Program count er

. . .

Operand

Mainmemory

Ex: MOVE.W LDISP (PC, D4), ...

+


D0-D7A0-A7

Index (16 or 32)

0: index is in data reg.1: index is in address reg.


15 14 13 12 11 10 9 8 7 0




memval(md, rg) := A memory address is ( (md⟨2..1⟩ = 1) ∨ (md⟨2..1⟩ = 2) ∨ (md⟨2..0⟩ = 6) ∨ used with these ((md⟨2..0⟩ = 7) ∧ (rg⟨2⟩ =0)) ): modes only.opnd(md, rg) := ( The operand length in (d=1) → opndb(md, rg): (d=2) → opndw(md, rg): the instruction tells (d=4) → opndl(md, rg) ): which to use.opndl(md, rg)⟨31..0⟩ := ( A long operand can be . . . ): . . .opndw(md, rg)⟨15..0⟩ := ( A word operand is memval(md, rg) → Mw[ea(md, rg)]⟨15..0⟩: similar but needs only md =0 → D[rg]⟨15..0⟩: a 16-bit immediate md = 1 → A[rg]⟨15..0⟩: following the (md = 7 ∧ rg = 4) → (Mw[PC]⟨15..0⟩: PC ← PC+2) ): instruction word.opndb(md, rg)⟨7..0⟩ := ( Byte operands . . . . . . (md = 7 ∧ rg = 4) → (Mw[PC]⟨7..0⟩: PC ← PC+2) ): instruction word.

Operands in Registers or Memory Can Have Different Lengths




Modes 0 and 1: Register Direct Addressing

• The register itself provides a place to store a result or a place to get an operand

• There is no memory address with this mode

5 4 3 2 1 0

0 0 0 (D)0 0 1 (A)

reg

D00 00 Reg

...

D7

A0

...

A7

. . . 0 01 Reg. . .

Ex: MOVE D6, ... Ex: MOVE A6, ...

Data register direct

Dataregisters Address register direct

Addressregisters

OperandOperand




Fig 3.5 Mode 7, Reg 4: Immediate Addressing Operands are stored

in the instruction

• Data length is specified by the opcode field, not the Mode/Reg field

1 11 10 0. . .

15 0

value16Hi

value16Lo

Ex: MOVE.W #1234, ...

11 1 100. . .

15 0

value16

Ex: MOVE.L #12348678, ...

Word Longword

1 11 10 0. . .

15 8 7 0

value8

Byt e

00000000

Ex: MOVE.B #12, ...

Instruction word and 1 or 2 following words

5 4 3 2 1 0

1 1 1 1 0 0




Not Every Addressing Mode Can Be Used for Results

• The MC68000 disallows relative addressing for results• This is captured in RTN by defining a function that is true (=

1) if the memory address specified by the mode is legal for results

• Register immediate is also legal for results, but will be handled separately

rsltadr(md, rg) := memval(md, rg) ∧ ¬(md=7 ∧ (rg=2∨ rg=3)):




Result Modes Must Have a Place to Write Data: Memory or Register

rsltl(md, rg)⟨31..0⟩ := ( 32-bit result rsltadr(md, rg) → Ml[ea(md, rg)]⟨31..0⟩: md = 0 → D[rg]⟨31..0⟩: md = 1 → A[rg]⟨31..0⟩ ):rsltw(md, rg)⟨15..0⟩ := ( 16-bit result rsltadr(md, rg) → Mw[ea(md, rg)]⟨15..0⟩: md = 0 → D[rg]⟨15..0⟩: md = 1 → A[rg]⟨15..0⟩ ):rsltb(md, rg)⟨7..0⟩ := ( 8-bit result rsltadr(md, rg) → Mb[ea(md, rg)]⟨7..0⟩: md = 0 → D[rg]⟨7..0⟩: md = 1 → A[rg]⟨7..0⟩ ): rslt(md, rg) := ( The result length in the (d=1) → rsltb(md, rg): (d=2) → rsltw(md, rg): instruction tells (d=4) → rsltl(md, rg) ): which to use




MC68000 Instruction Interpretation

• Instruction interpretation is simple when exceptions are ignored

• Instructions are fetched 16 bits at a time• PC is advanced by 2 as each 16-bit word is fetched• Addressing mode may advance it a total of 2 or 4 or

more words, under command from the control unit

Instruction_interpretation := (Run → ( (IR⟨15..0⟩ ← Mw[PC]⟨15..0⟩: PC ← PC + 2);

instruction_execution ); ):




Tbl 3.3 MC68000 Data Movement Instructions

• The op code location and size depends on the instruction (compare to SRC)

Inst. Operands 1st word XNZVC Operation Size

MOVE.B EAs, EAd 0001ddddddssssss - x x 0 0 dst ← src byteMOVE.W EAs, EAd 0011ddddddssssss - x x 0 0 dst ← src wordMOVE.L EAs, EAd 0010ddddddssssss - x x 0 0 dst ← src longMOVEA.W EAs, An 0011rrr001ssssss - - - - - An ← src wordMOVEA.L EAs, An 0010rrr001ssssss - - - - - An ← src longLEA.L EAc, An 0100aaa111ssssss - - - - - An ← EA addr.EXG Dx, Dy 1100xxx1mmmmmyyy - - - - - Dx ↔ Dy long




RTN for a Typical MC68000 Move Instruction

• The temporary register tmp is used because every invocation of opnd() causes another fetch

tmp⟨31..0⟩:move (:= op⟨3..2⟩ := 0) → (

tmp ← opnd(md1, rg1);( Z ← (tmp=0): N ← (tmp<0): V ← 0: C ← 0 ):rslt(md2, rg2) ← tmp ):

• The instruction format for Move includes mode and register for source and destination addressesop⟨3..0⟩ := IR⟨15..12⟩: rg1⟨2..0⟩ := IR⟨2..0⟩: md1⟨2..0⟩ := IR⟨5..3⟩: rg2⟨2..0⟩ := IR⟨11..9⟩: md2⟨2..0⟩ := IR⟨8..6⟩:




Tbl 3.4 MC68000 Integer Arithmetic and Logic Instructions

Op. Operands Inst. word XNZVC Operation Sizes

ADD EA,Dn 1101rrrmmmaaaaaa x x x x x dst ← dst + src b, w, l

SUB EA,Dn 1001rrrmmmaaaaaa x x x x x dst ← dst - srC b, w, l

CMP EA,Dn 1011rrrmmmaaaaaa - x x x x dst-src b, w, l

CMPI #dat,EA 00001100wwaaaaaa - x x x x dst-immed.data b, w, l

MULS EA, Dn 1100rrr111aaaaaa - x x 0 0 Dn←Dn*src l←w*w

DIVS EA,Dn 1000rrr111aaaaaa - x x x 0 Dn←Dn/src l←l/w

AND EA,Dn 1100rrrmmmaaaaaa - x x 0 0 dst←dst∧ src b, w, l

OR EA,Dn 1000rrrmmmaaaaaa - x x 0 0 dst←dst∨ src b, w, l

EOR EA,Dn 1011rrrmmmaaaaaa - x x 0 0 dst←dst⊕ src b, w, l

CLR EAs 01000010wwaaaaaa - 0 1 0 0 dst∧ dst b, w, l

NEG EAs 01000100wwaaaaaa - x x x x dst←0 - dst b, w, l

TST EAs 01001010wwaaaaaa - x x 0 0 dst−0 b, w, l

NOT EAs 01000110wwaaaaaa - x x x x dst← ¬dst b, w, l




Notes on MC68000 Arithmetic and Logic Instructions

• Only one operand uses EA• The other operand is always accessed by Data register direct• The 3-bit mmm field specifies whether D is the source or destination,

and whether it is B, W, or LByte Word Long Destination000 001 010 Dn

100 101 110 EA

Ex: SUB EA, Dn: 1011 rrr mmm aaaaaa

Note: There are several exceptions to the rule above. See text and mfr. data sheet.

All 2-operand ALU instructions are either D → EA or EA → D. Which is it?

op Dn tbl abv. EA




RTN Description of a Typical MC68000 Arithmetic Instruction

• This definition does not handle the condition codes

• Subtract is a typical arithmetic instruction• Need a temporary register to hold an address

tmp⟨31..0⟩: temporary register for address

sub (:= op=9) → ((md2⟨2⟩ =0) → D[rg2] ← D[rg2] - opnd(md1, rg1):(md2⟨2⟩ =1) → (memval(md1, rg1) → (tmp ← ea(md1, rg1);

M[tmp] ← M[tmp] - D[rg2] ): ¬memval(md1, rg1) → rslt(md1, rg1) ← rslt(md1, rg1) - D[rg2])

):




MC68000 Arithmetic Shifts and Single Word Rotates

• d is L or R for left or right shift, respectively• EA form has shift count of 1

cx

0

cx

ASL

ASRDn

c

ROL

ROR

c

Dn

Op. Operands Inst. word XV

ASd EA 1110000d11aaaaaa x xASd #cnt,Dn 1110cccdww000rrr x xASd Dm,Dn 1110RRRdww100rrr x x

ROd EA 1110011d11aaaaaa - 0ROd #cnt,Dn 1110cccdww011rrr - 0ROd Dm,Dn 1110RRRdww111rrr - 0




MC68000 Logical Shifts and Extended Rotates

• Field ww specifies byte, word, or longword• N and Z set according to result, C = last bit shifted out

cx

0

cx

0

LSL

LSRDn

xc

xc

ROXR

ROXL

Dn

Op. Operands Inst. word XV

LSd EA 1110001d11aaaaaa x 0LSd #cnt,Dn 1110cccdww001rrr x 0LSd Dm,Dn 1110RRRdww101rrr x 0

ROXd EA 1110010d11aaaaaa x 0ROXd #cnt,Dn 1110cccdww010rrr x 0ROXd Dm,Dn 1110RRRdww110rrr x 0




MC68000 Conditional Branch and Test Instructions

• DBcc is used for counted loops with an optional end condition• Scc sets a byte to the outcome of a test

Op. Operands Inst. word Operation

Bcc disp 0110ccccdddddddd if (cond) then DDDDDDDDDDDDDDDD PC ← PC + disp DBcc Dn,disp 0101cccc11001rrr if ¬(cond) then Dn←Dn-1

if (Dn≠-1) then PC←PC+disp) else PC ← PC + 2 Scc EA 0101cccc11aaaaaa if (cond) then (EA) ← FFH

else (EA) ← 00H




Conditions That Can Be Evaluated for Branch, Etc.

Code Meaning Name Flag expression 0000 true T 10001 false F 00100 carry clear CC C0101 carry set CS C0111 equal EQ Z0110 not equal NE Z1011 minus MI N1010 plus PL N0011 low or same LS C+Z1101 less than LT N·V+N·V1100 greater or equal GE N·V+N·V1110 greater than GT N·V·Z+N·V·Z1111 less or equal LE N·V+N·V+Z0010 high HI C·Z1000 overflow clear VC V1001 overflow set VS V




Conditional Branches First Set Condition Codes, Then Branch

• EQ tests the right condition codes for = 0, as above, or A = B following a compare, CMP A, B

if ( X = 0 ) goto LOC

TST X ;ands X with itself and sets N and ZBEQ LOC ;branch to LOC if X = 0...

LOC:




MC68000 Unconditional Control Transfers

• Subroutine links push the return address onto the stack pointed to by A7 = SP

Op. Operands Inst. word Operation BRA disp 01100000dddddddd PC ← PC + disp DDDDDDDDDDDDDDDD

BSR disp 01100001dddddddd -(SP) ← PC; PC ← PC + disp DDDDDDDDDDDDDDDD JMP EA 0100111011aaaaaa PC ← EA JSR EA 0100111010aaaaaa -(SP) ← PC; PC ← EA




MC68000 Subroutine Return Instructions

• Subroutine linkage uses stack for return address• LINK and UNLK allocate and de-allocate multiple word stack

frames

Op. Operands Inst. word Operation RTR 0100111001110111 CC ← (SP)+; PC ← (SP)+ RTS 0100111001110101 PC ← (SP)+ LINK An,disp 0100111001010rrr -(SP) ← An; An ← SP;

DDDDDDDDDDDDDDDD SP ← SP + disp UNLK An 0100111001011rrr SP ← An; An ← (SP)+




MC68000 Assembly Code Example: Search an Array

• Program searches an array of bytes to find the first carriage return, ASCII code 13

CR EQU 13 ;Define return character.LEN EQU 132 ;Define line length. ORG $1000 ;Locate LINE at 1000H.LINE DS.B LEN ;Reserve LEN bytes of storage. MOVE.B #LEN-1,D0 ;Initialize D0 to count-1. MOVEA.L #LINE,A0 ;A0 gets start address of array.LOOP CMPI.B (A0)+,#CR ;Make the comparison. DBEQ D0,LOOP ;Double test: if LINE[131-D0]≠13 <next instruction> ; then decr. D0; if D0≠-1 branch ; to LOOP, else to next inst.




Pseudo-Operations in the MC68000 Assembler

• A pseudo-operation is one that is performed by the assembler at assembly time, not by the CPU at run time

• EQU defines a symbol to be equal to a constant. Substitution is made at assemble time

Pi EQU 3.14

• DS.B (.W or .L) defines a block of storage• Any label is associated with the first word of the block

Line DS.B 132• The program loader (part of the operating system) accomplishes this

-more-




Pseudo Operations in the MC68000 Assembler (cont’d.)

• # symbol indicates the value of the symbol instead of a location addressed by the symbol

MOVE.L #1000, D0 ;moves 1000 to D0

MOVE.L 1000, D0 ;moves value at addr. 1000 to D0• The assembler detects the difference and assembles the appropriate

instruction

• ORG specifies a memory address as the origin where the following code will be stored

Start ORG $4000 ;next instruction/data will be loaded at ;address 4000H.

• The Motorola assembler uses $ in front of a number to indicate hexadecimal

• Character constants are in single quotes: ‘X’




Review of Assembly, Link, Load, and Run Times

• At assemble time, assembly language text is converted to (binary) machine language

• They may be generated by translating instructions, hexadecimal or decimal numbers, characters, etc.

• Addresses are translated by way of a symbol table• Addresses are adjusted to allow for blocks of memory reserved for arrays,

etc.

• At link time, separately assembled modules are combined and absolute addresses assigned

• At load time, the binary words are loaded into memory• At run time, the PC is set to the starting address of the loaded module

(usually the o.s. makes a jump or procedure call to that address)




MC68000 Assembly Language Example: Clear a Block

• Subroutine expects block base in A0, count in D0• Linkage uses the stack pointer, so A7 cannot be used for anything

else

MAIN … MOVE.L #ARRAY, A0 ;Base of array MOVE.W #COUNT, D0 ;Number of words to clear JSR CLEARW ;Make the call …

CLEARW BRA LOOPE ;Branch for init. Decr.LOOPS CLR.W (A0)+ ;Autoincrement by 2 .LOOPE DBF D0, LOOPS ;Dec.D0,fall through if -1

RTS ;Finished.




Exceptions: Changes to Sequential Instruction Execution

• Exceptions, also called interrupts, cause next instruction fetch from other than PC location

• Address supplying next instruction called exception vector

• Exceptions can arise from instruction execution, hardware faults, and external conditions

• Externally generated exceptions usually called interrupts• Arithmetic overflow, power failure, I/O operation completion, and

out of range memory access are some causes

• A trace bit =1 causes an exception after every instruction• Used for debugging purposes




Steps in Handling MC68000 Exceptions

• (1) Status change• Temporary copy of status register is made

• Supervisor mode bit S is set, trace bit T is reset

• (2) Exception vector address is obtained• Small address made by shifting 8 bit vector number left 2

• Contents of the longword at this vector address is the address of the next instruction to be executed

• The exception handler or interrupt service routine starts there

• (3) Old PC and status register are pushed onto supervisor stack, addressed by A7' = SSP

• (4) PC is loaded from exception vector address • Return from handler is done by RTE

• Like RTR except restores status register instead of CCs




Exception Priorities

• When several exceptions occur at once, which exception vector is used?

• Exceptions have priorities, and highest priority exception supplies the vector

• MC68000 allows 7 levels of priority• Status register contains current priority• Exceptions with priority ≤ current are ignored




Exceptions and Reset Both Affect Instruction Interpretation

• More processor state needed to describe reset and exception processing

Reset: Reset inputexc_req: Single bit exception requestexc_lev⟨2..0⟩: Exception Levelvect⟨7..0⟩ : Vector address for this exceptionexc := exc_req ∧ (exc_lev⟨2..0⟩ > INT⟨2..0⟩): There is a request, and the request

level is > current mask in status reg.

• exc_lev is the highest priority of any pending exception




Exceptions Are Sensed Before Fetching Next Instruction

• Reset starts the computer with a stack pointer from location 0 at the address from location 4

Instruction_interpretation := (Run ∧ ¬(Reset ∨ exc) → (IR ← Mw[PC] : PC ← PC + 2); Normal execution stateReset → (INT⟨2..0⟩ ← 7 : S ← 1 : T ← 0: Machine reset

SSP ← Ml[0] : PC ← Ml[4] :Reset ← 0 : Run ← 1 );

Run ∧ ¬Reset ∧ exc → (SSP ← SSP - 4; Ml[SSP] ← PC; Exception handlingSSP ← SSP - 2; Mw[SSP] ← Status;S ← 1 : T ← 0 : INT⟨2..0⟩ ← exc_lev⟨2..0⟩ :PC ← Ml[vect⟨7..0⟩#002] );

Instruction_execution ).




Memory-Mapped I/O

• No separate I/O space. Part of cpu memory space is devoted/reserved for I/O instead of RAM or ROM.

• Example: MC68000 has a total 24-bit address space. Suppose the top 32K is reserved for I/O:

FFFFFFH . . .FF8000HFF7FFFH

. . .000000H

}

} Memory Space

I/O Space

Notice that top 32K can be addressed by a negative 16-bit value.




Memory-Mapped I/O in the MC68000

• Memory-mapped I/O allows µprocessor chip to have one bus for both memory and I/O

• Multiple wires for both address and data

• I/O uses address space that could otherwise contain memory• Not popular with machines having limited address bits

• Sizes of I/O and memory “spaces” independent• Many or few I/O devices may be installed• Much or little memory may be installed

• Spaces are separated by putting I/O at top end of the address space




Fig 3.8 A Memory-Mapped Keyboard Interface

MC68000 has a 24-bit address bus.

Address space runs from 000000Hup to FFFFFFH.

A 16-bit address constant can bepositive, and sign extend to anaddress running from 000000H upto the maximum positive value,or negative, and sign extend to anaddress running from FFFFFFHdown to the last negative 16-bit value.

I/O addresses in latter range canbe accessed by a 16-bit constant.

Keyboard interface

n

MemoryFF7FFFH

000000H

CPU

KBSTATUS

Character available

KBDATA

Keyboard"Q"

1FF8006H

FF8008H 00001101

n-bit system bus




The SPARC (Scalable Processor ARChitecture) as a RISC Microprocessor

Architecture

• The SPARC is a general register, load-store architecture• It has only two addressing modes. Address =

• (Reg + Reg) or (Reg + 31-bit constant)

• Instructions are all 32 bits in length• SPARC has 69 basic instructions• Separate floating-point register set

• First implementation had a 4-stage pipeline• Some important features not inherently RISC

• Register windows: Separate but overlapping register sets available to calling and called routines

• 32-bit address, big-endian organization of memory




Fig 3.9 Simplified SPARC Processor State

31 0r31

31 0IR

31 0

0

0

WIM

31 0PC

31TBR

31 0nPC

31Y

r24

r23

r16

r15

31 0f31

f30

f2

f1f0

r8

r7

r1

r0

In parameters

Local registers

Out parameters

Global registers

Integer registers Floating-point registers

0

Condition codes

Processor-status register

Instruction register Window-invalid mask

Program counter Trap base register

Next program counter Multiply step register

n z v c




Fig 3.10 SPARC Register Windows Mechanism

r31in

parameters

local registers

out parameters

in parameters

local registers

out parameters

in parameters

local registers

out parameters

CWP = N

r24

r16

r23

r15

r8

r31

r24

r16

r23

r15

r8

r31

global registers

CWP = N – 1 CWP = N

r24

r16

r23

r15

r8

save restore

r7

r0




SPARC Memory

RTN for the SPARC memory:Mb[0..232-1]⟨7..0⟩: Byte memoryMh[a] ⟨15..0⟩ := Mb[a] ⟨7..0⟩#Mb[a + 1] ⟨7..0⟩: Halfword memoryM[a] ⟨31..0⟩ := Mh[a] ⟨15..0⟩#Mh[a + 2] ⟨15..0⟩: Word memory




Register Windows Format the General Registers

• 32 general integer and address registers are accessible at any one time

• Global registers G0..G7 are not in any window

• G0 is always zero: writes to G0 are ignored, reads return 0• The other 24 are in a movable window from a total set of 120

• On subroutine call, the starting point changes so that 24–31 before call become 8–15 after

• Registers 8–15 are used for incoming parameters• Registers 24–31 are for outgoing parameters• Current Window Pointer CWP locates register 8

• Overflow of register space causes trap




save, restore, and the Current Window Pointer

• CWP points to the register currently called G8• save moves it to point of the old G24

• This makes the old G24..G31 into the new G8..G15

• If parameters are placed in G24..G31 by the caller, the callee can get them from G8..G15

• When all windows are used, save traps to a routine that saves registers to memory

• Windows wrap around in the available registers• Window overflow “spills” the first window and reuses its space




SPARC Operand Addressing

• One mode computes address as sum of 2 registers; G0 gives zero if used

• The other mode adds sign-extended 13-bit constant to a register

• These can serve several purposes• Indexed: base in one register, index in another• Register indirect: G0 + Gn

• Displacement: Gn + const, n ≠ 0• Absolute: G0 + constant

• Absolute addressing can only reach the bottom or top 4K bytes of memory




RTN for SPARC Instruction Format

op⟨1..0⟩ := IR⟨31..30⟩: Instruction class, op code for format 1;disp30⟨29..0⟩ := IR⟨29..0⟩: Word displacement for call, format 1;a := IR⟨29⟩: Annul bit for branches, format 2a;cond⟨3..0⟩ := IR⟨28..25⟩: Branch condition select, format 2a;rd⟨4..0⟩ := IR⟨29..25⟩: Destination register for formats 2b & 3;op2⟨2..0⟩ := IR⟨24..22⟩: Op code for format 2;disp22⟨21..0⟩ := IR⟨21..0⟩: Constant for branch displacement or sethi;op3⟨5..0⟩ := IR⟨24..19⟩: Op code for format 3;rs1⟨4..0⟩ := IR⟨18..14⟩: Source register 1 for format 3;opf⟨8..0⟩ := IR⟨13..5⟩: Sub-op code for floating point, format 3a;i := IR⟨13⟩: Immediate operand indicator, formats 3b & c;simm13⟨12..0⟩ := IR⟨12..0⟩: Signed immediate operand for format 3c;rs2⟨4..0⟩ := IR⟨4..0⟩: Source register 2 for format 3b.




Fig 3.11 SPARC Instruction Formats

• Three basic formats with variations

SPARC instruction formatsFormat number

rs2op rd op3 rs1

31

3a. Floating point

3b. Data movement

3c. ALU

30 29 25 24 19 18 14 13 12 5 4 0

i (register or immediate)

opf

op rd op3 rs1 1 simm13

op rd op3 rs1 0 asi rs2

0 0 a cond op2 disp22

0 0 rd op2 disp22

312a. Branches

2b. sethi

30 29 28 2425 22 21 0

0 1 disp30

31

1. Call

30 29 0




RTN For SPARC Addressing Modes

adr⟨31..0⟩ := (i=0 → r[rs1] + r[rs2]: Address for load, store,i=1 → r[rs1] + simm13⟨12..0⟩ {sign ext.}): and jump

calladr⟨31..0⟩ := PC⟨31..0⟩ + disp30⟨29..0⟩ #002: Call relative addressbradr⟨31..0⟩ := PC⟨31..0⟩ + disp22⟨21..0⟩ #002{sign ext.}: Branch address




RTN For SPARC Instruction Interpretation

instruction_interpretation := (IR ← M[PC]; instruction_execution;update_PC_and_nPC; instruction_interpretation):




Tbl 3.8 SPARC Data Movement Instructions

Inst. Op. OPCODE Meaningldsb 11 00 1001 Load signed byteldsh 11 00 1010 Load signed halfwordldsw 11 00 1000 Load signed wordldub 11 00 0001 Load unsigned bytelduh 11 00 0010 Load unsigned halfwordldd 11 00 0011 Load doublewordstb 11 00 0101 Store bytesth 11 00 0110 Store halfwordstw 11 00 0100 Store wordstd 11 00 0111 Store double wordswap 11 00 1111 Swap register with memoryor 10 00 0010 r[d] ← r[s1] OR (r[rs2] or immediate)sethi 00 Op2=100 High order 22 bits of Rdst ← disp22




Register and Immediate Moves in the SPARC

• OR is used with a G0 operand to do register-to-register moves• To load a register with a 32-bit constant, a 2-instruction sequence

is usedSETHI R17, #upper22OR R17, R17, #lower10

• Doublewords are loaded into an even register and the next higher odd one

• Floating-point instructions are not covered, but the 32 FP registers can hold single-length numbers, or 16 64-bit FP, or 8 128-bit FP numbers




Tbl 3.9 SPARC Arithmetic Instructions

• All are format 3, Op = 10• CCs are set if S = 1 and not if S = 0

• Both register and immediate forms are available• Multiply is done by software using MULSCC or using floating-

point instructions• Multiply is hard to do in one clock but multiply step is not

Inst. Op. OPCODE Meaningadd 10 0S 0000 Add or add and set condition codesaddx 10 0S 1000 Add with carry: set CCs or notsub 10 0S 0100 Subtract: subtract and set CCs or notsubx 10 0S 1100 Subtract with borrow: set CCs or notmulscc 10 10 1100 Do one step of multiply




Tbl 3.10 SPARC Logical and Shift Instructions

• All instructions use format 3 with op = 10• Both register and immediate forms are available

• Condition codes set if S = 1 and undisturbed if S = 0

Inst. Op. OPCODE MeaningAND 10 0S 0001 AND, set CCs if S=1 or not if S=0ANDN 10 0S 0101 NAND, set CCs or notOR 10 0S 0010 OR, set CCs or notORN 10 0S 0110 NOR, set CCs or notXOR 10 0S 0011 XNOR(Equiv), set CCs or notSLL 10 10 0101 Shift left logical, count in RSRC2 or imm13SRL 10 10 0110 Shift right logical, count in RSRC2 or imm13SRA 10 10 0111 Shift right arithmetic, count as above




Tbl 3.11 SPARC Branch and Control Transfer Instructions

Inst. Format Op Op2 or Op3 Meaning ba 2 00 010 Unconditional branchbcc 2 00 010 Conditional branchcall 1 01 Call & save PC in R15jmpl 3 10 11 1000 Jmp to EA, save PC in Rdstsave 3 10 11 1100 New register window, & ADDrestore 3 10 11 1101 Restore reg. window, & ADD

Some condition fields:Inst. COND Inst. COND Inst. COND Inst. CONDba 1000 bne 1001 be 0001 ble 0010bcc 1101 bcs 0101 bneg 0110 bvc 1111bvs 0111




Fig 3.12 Example SPARC Assembly Program

.begin

.orgprog: ld [x], %r1 ! Load a word from M[x] into register %r1.

ld [y], %r2 ! Load a word from M[y] into register %r2.

addcc%r1, %r2, %r3 ! %r3 ← %r1 + %r2 ; set CCs.

st %r3, [z] ! Store sum into M[z].

jmpl %r15, +8, %r0 ! Return to caller.nop ! Branch delay slot.

x: 15 ! Reserve storage for x, y, and z.

y: 9z: 0

.end

Note different syntax for SPARC. Note r15 contains return address—placed there by the OS in this case.




Fig 3.13 Example of Subroutine Linkage in the SPARC

.begin

.orgprog: ld [x], %o0 !Pass parameters in

ld [y], %o1 ! first 3 output registers.call add3 !Call subroutine to put result in %o0.mov -17, %o2 !Set last parameter in delay slotst %o0, [z] !Store returned result....

x: 15y: 9z: 0add3: save %sp,-(16*4),%sp !Get new window and adjust stack pointer.

add %i0, %i1, %l0 !Add parameters that now appear inadd %l0, %i3, %l0 ! input registers using a local.ret !Return. Short for jmp %i7+8.restore %l0, 0, %o0 !Result moved to caller’s %o0..end




Pipelining of the SPARC Architecture

• Many aspects of the SPARC design are in support of a pipelined implementation

• Simple addressing modes, simple instructions, delayed branches, load-store architecture

• Simplest form of pipelining is fetch-execute overlap—fetching next instruction while executing current instruction

• Pipelining breaks instruction processing into steps• A step of one instruction overlaps different steps for others

• A new instruction is started (issued) before previously issued instructions are complete

• Instructions guaranteed to complete in order




Fig 3.14 The SPARC MB86900 Pipeline

• 4 pipeline stages are Fetch, Decode, Execute, and Write• Results are written to registers in Write stage

Fetch Dec. Exec. WriteInstr. 1

Fetch Dec. Exec. Write



Instr. 2

Instr. 3

Instr. 4

1 2 3 4 5 6 7

Clock Cycle




Pipeline Hazards

• Will be discussed later, but main issue is:• Branch or jump change the PC as late as Exec or Write, but

next instruction has already been fetched• One solution is delayed branch• One (maybe 2) instruction following branch is always executed,

regardless of whether branch is taken

• SPARC has a delayed branch with one delay slot, but also allows the delay slot instruction to be annulled (have no effect on the machine state) if the branch is not taken

• Registers to be written by one instruction may be needed by another already in the pipeline, before the update has happened (data hazard)




CISC versus RISC: Recap

• CISCs supply powerful instructions tailored to commonly used operations, stack operations, subroutine linkage, etc.

• RISCs require more instructions to do the same job

• CISC instructions take varying lengths of time• RISC instructions can all be executed in the same few-cycle

pipeline• RISCs should be able to finish (nearly) one instruction per

clock cycle




Key Concepts: RISC versus CISC

• While a RISC machine may possibly have fewer instructions than a CISC, the instructions are always simpler. Multistep arithmetic operations are confined to special units.

• Like all RISCs, the SPARC is a load-store machine. Arithmetic operates only on values in registers.

• A few regular instruction formats and limited addressing modes make instruction decode and operand determination fast.

• Branch delays are quite typical of RISC machines and arise from the way a pipeline processes branch instructions.

• The SPARC does not have a load delay, which some RISCs do, and does have register windows, which many RISCs do not.




Chapter 3 Summary

• Machine price/performance are the driving forces.• Performance can be measured in many ways: MIPS, execution

time, Whetstone, Dhrystone, SPEC benchmarks.

• CISC machines have fewer instructions that do more.• Instruction word length may vary widely• Addressing modes encourage memory traffic

• CISC instructions are hard to map onto modern architectures

• RISC machines usually have• One word per instruction

• Load/store memory access• Simple instructions and addressing modes• Result in allowing higher clock cycles, prefetching, etc.


4-1 Chapter 4—Processor Design


Chapter 4: Processor Design

Topics

4.1 The Design Process4.2 A 1-Bus Microarchitecture for the SRC4.3 Data Path Implementation4.4 Logic Design for the 1-Bus SRC4.5 The Control Unit4.6 The 2- and 3-Bus Processor Designs4.7 The Machine Reset4.8 Machine Exceptions




Abstract and Concrete RegisterTransfer Descriptions

• The abstract RTN for SRC in Chapter 2 defines “what,”not “how”

• A concrete RTN uses a specific set of real registers and buses to accomplish the effect of an abstract RTN statement

• Several concrete RTNs could implement the same ISA




A Note on the Design Process

• This chapter presents several SRC designs• We started in Chapter 2 with an informal description• In this chapter we will propose several block diagram

architectures to support the abstract RTN, then we will:• Write concrete RTN steps consistent with the architecture• Keep track of demands made by concrete RTN on the hardware

• Design data path hardware and identify needed control signals

• Design a control unit to generate control signals




Fig 4.1 Block Diagram of 1-Bus SRC

ALU

C

C

A

⟨31..0⟩

32 31 0

AD

DP

Cin

Gra

Wait

0R0

R31

31

IR

MA

To memory subsystem

Data Path Main memory

Memory bus

Figures 4.2, 4.3

Control Unit

CPU

Control unit inputsControl signals out

MD

PC

A B


Figure 4.11

Input/ output




Fig 4.2 High-Level View of the 1-BusSRC Design

ALU

C

C

A

⟨31..0⟩

32 31 0

0R0

R31

31

IR

MA

To memory subsystem

MD

PC

A B





Constraints Imposed by the Microarchitecture

• One bus connecting most registers allows many different RTs, but only one at a time

• Memory address must be copied into MA by CPU

• Memory data written from or read into MD

• First ALU operand always in A, result goes to C

• Second ALU operand always comes from bus

• Information only goes into IR and MA from bus

• A decoder (not shown) interprets contents of IR

• MA supplies address to memory, not to CPU bus

ALU

A B

C

31 0

32 32-bitGeneral

Purpose Registers

R0

R31

32

C

PC

I R

MA

MD

⟨31..0⟩

31 0

To memory subsystem

A




Abstract and Concrete RTN for SRCadd Instruction

Abstract RTN: (IR ← M[PC]: PC ← PC + 4; instruction_execution);instruction_execution := ( • • •add (:= op= 12) → R[ra] ← R[rb] + R[rc]:

Step RTNT0 MA ← PC: C ← PC + 4; T1 MD ← M[MA]: PC ← C;T2 IR ← MD;T3 A ← R[rb];T4 C ← A + R[rc];T5 R[ra] ← C;

Tbl 4.1 Concrete RTN for the add Instruction

• Parts of 2 RTs (IR ← M[PC]: PC ← PC + 4;) done in T0• Single add RT takes 3 concrete RTs (T3, T4, T5)

ALU

A B

C

31 0

32 32-bitGeneral

Purpose Registers

R0

R31

32

C

PC

I R

MA

MD

⟨31..0⟩

31 0

To memory subsystem

AIFIEx.




Concrete RTN Gives Information About Sub-units

• The ALU must be able to add two 32-bit values• ALU must also be able to increment B input by 4• Memory read must use address from MA and return data to

MD• Two RTs separated by : in the concrete RTN, as in T0 and

T1, are operations at the same clock• Steps T0, T1, and T2 constitute instruction fetch, and will

be the same for all instructions• With this implementation, fetch and execute of the add

instruction takes 6 clock cycles




Concrete RTN for Arithmetic Instructions: addi

• Differs from add only in step T4• Establishes requirement for sign extend hardware

addi (:= op= 13) → R[ra] ← R[rb] + c2⟨16..0⟩{2's complement sign extend} :

Concrete RTN for addi:

Abstract RTN:

Step RTNT0. MA ← PC: C ← PC + 4; T1. MD ← M[MA]; PC ← C;T2. IR ← MD;T3. A ← R[rb];T4. C ← A + c2⟨16..0⟩ {sign ext.};T5. R[ra] ← C;

Instr FetchInstr Execn.

ALU

A B

C

31 0

32 32-bitGeneral

Purpose Registers

R0

R31

32

C

PC

I R

MA

MD

⟨31..0⟩

31 0

To memory subsystem

A




Fig 4.3 More Complete View of Registers and Buses in the 1-Bus SRC Design,

Including Some Control Signals• Concrete RTN

lets us add detail to the data path

– Instruction register logic and new paths

– Condition bit flip-flop

– Shift count register

Keep this slide in mindas we discuss concreteRTN of instructions.

ALU

C

C

A

⟨31..0⟩

⟨4..0⟩

c2⟨31..0⟩c1⟨31..0⟩

c3⟨2..0⟩

CONCONin32

32

55

31 0

Register select

32

0R0

R31

31

IR

Op

Select logic

Select logic

Cond logic

MA

To memory subsystem

Shift count, n

n = 0 Figure 4.8

Figure 4.7

Figure 4.4

Figure 4.6

Figure 4.5

Figure 4.9

MD

Decrement

04n

PC

A B


D Q




Abstract and Concrete RTN forLoad and Store

ld (:= op= 1) → R[ra] ← M[disp] :st (:= op= 3) → M[disp] ← R[ra] :

wheredisp⟨31..0⟩ := ((rb=0) → c2⟨16..0⟩ {sign ext.} :

(rb≠0) → R[rb] + c2⟨16..0⟩ {sign extend, 2's comp.} ) :

Step RTN for ld RTN for stT0–T2 Instruction fetchT3 A ← (rb = 0 → 0: rb ≠ 0 → R[rb]);T4 C ← A + (16@IR⟨16⟩#IR⟨15..0⟩);T5 MA ← C;T6 MD ← M[MA]; MD ← R[ra];T7 R[ra] ← MD; M[MA] ← MD;

Tbl 4.3 The ld and St (load/store register from memory) Instructions




Notes for Load and Store RTN

• Steps T0 through T2 are the same as for add and addi, and for all instructions

• In addition, steps T3 through T5 are the same for ld and st, because they calculate disp

• A way is needed to use 0 for R[rb] when rb = 0• 15-bit sign extension is needed for IR⟨16..0⟩

• Memory read into MD occurs at T6 of ld• Write of MD into memory occurs at T7 of st




Concrete RTN for Conditional Branch

br (:= op= 8) → (cond → PC ← R[rb]):cond := ( c3⟨2..0⟩=0 → 0: never

c3⟨2..0⟩=1 → 1: alwaysc3⟨2..0⟩=2 → R[rc]=0: if register is zeroc3⟨2..0⟩=3 → R[rc]≠0: if register is nonzeroc3⟨2..0⟩=4 → R[rc]⟨31⟩=0: if positive or zeroc3⟨2..0⟩=5 → R[rc]⟨31⟩=1 ): if negative

Step RTNT0–T2 Instruction fetchT3 CON ← cond(R[rc]);T4 CON → PC ← R[rb];

Tbl 4.4 The Branch Instruction, br




Notes on Conditional Branch RTN

• c3⟨2..0⟩ are just the low-order 3 bits of IR

• cond() is evaluated by a combinational logic circuit having inputs from R[rc] and c3⟨2..0⟩

• The one bit register CON is not accessible to the programmer and only holds the output of the combinational logic for the condition

• If the branch succeeds, the program counter is replaced by the contents of a general register




Abstract and Concrete RTN for SRC Shift Right

shr (:= op = 26) → R[ra]⟨31..0⟩ ← (n @ 0) # R[rb]⟨31..n⟩ :n := ( (c3⟨4..0⟩ = 0) → R[rc]⟨4..0⟩ : Shift count in register

(c3⟨4..0⟩ ≠ 0) → c3⟨4..0⟩ ): or constant field ofinstruction

Step Concrete RTNT0–T2 Instruction fetchT3 n ← IR⟨4..0⟩;T4 (n = 0) → (n ← R[rc]⟨4..0 ⟩);T5 C ← R[rb];T6 Shr (:= (n ≠ 0) → (C⟨31..0 ⟩ ← 0#C⟨31..1 ⟩: n ← n - 1; Shr) );T7 R[ra] ← C;

step T6 is repeated n times

Tbl 4.5 The shr Instruction




Notes on SRC Shift RTN

• In the abstract RTN, n is defined with :=• In the concrete RTN, it is a physical register• n not only holds the shift count but is used as a

counter in step T6• Step T6 is repeated n times as shown by the recursion

in the RTN• The control for such repeated steps will be treated later




Data Path/Control Unit Separation

• Interface between data path and control consists of gate and strobe signals

• A gate selects one of several values to apply to a common point, say a bus

• A strobe changes the values of the flip-flops in a register to match new inputs

• The type of flip-flop used in registers has much influence on control and some on data path

• Latch: simpler hardware, but more complex timing• Edge triggering: simpler timing, but about twice the

hardware




Reminder on Latch- andEdge-Triggered Operation

• Latch output follows input while strobe is high

D

C

Q

D

C

Q

D Q

C

• Edge-triggering samples input at edge time

D

C

Q




Fig 4.4 The Register File and Its Control Signals

• Rout gates selectedregister onto bus

• Rin strobed selectedregister from bus

• BAout differs from Rout by gating 0 when R[0] is selected

BA = Base Address

31 0

32

5

5

5 5 5

5 5 5

32

32

32

32

32 32

32

32

32

32

32

R31

R1

R0

R0

R31

Select logicIR

2131 27Op ra rb rcIR

Gra Grb Grc

26 22 1617 1112


From Figure 4.3

DR0

Q

Q

DR1

Q

31

5 to

32

dec

oder

1

0

Rin

Rout

BAout

Q

DR31

Bus b<31...0>

Q

4

5

5

2

3

5

1

8

8

1

1

7

6

6

6

Q




• I⟨21⟩ is the sign bit of C1 that must be extended

• I⟨16⟩ is the sign bit of C2 that must be extended

• Sign bits are fanned out from one to several bits and gated to bus

Fig 4.5 Extracting c1, c2, and OP from the Instruction Register, IR<31...0>

1

16 16 ⟨15..0⟩

Bus

4

1

5

5

⟨16⟩

⟨31..17⟩

4

1

1

15

1

⟨20..17⟩

⟨21⟩

5

32

⟨31..22⟩IR⟨26..22⟩

IR⟨31..27⟩

IR⟨21⟩

IR⟨20..17⟩

IR⟨16⟩

IR⟨15..0⟩

10

To control unit

Select logic

c1⟨31..0⟩

c2⟨31..0⟩

IROp

32

5

32

From Figure 4.3

D Q

Q

D Q

Q

D Q

Q

D Q

Q

D Q

Q

D Q

QIRin

c1out

c2out




• MD is loaded from memory or fromCPU bus

• MD can drive CPU bus or memory bus

Fig 4.6 The CPU–Memory Interface: Memory Address and Memory Data

Registers, MA<31...0> and MD<31...0>

MD⟨31..0⟩

MA⟨31..0⟩

data⟨31..0⟩

addr⟨31..0⟩

MA

MDTo memory subsystem

From Figure 4.3

DMD

Q

QRead

Write

Done

D QMA

Q

MDbus

MDrd

MDwrMDout

MAin

CPU bus

Strobe3

1

2

32

32

32

32 32

32

3232

3232

32

Memory bus




Fig 4.7 The ALU and Its Associated Registers

A

From Figure 4.3

Cin

ADD SUB AND

NOT C = B INC4

D Q

C

Q

Ain

D Q

A

Q

32

Cout

32

32

32

11

32

ALU

C

A BALUC

A B

C




From Concrete RTN to Control Signals: The Control Sequence

• The register transfers are the concrete RTN• The control signals that cause the register transfers

make up the control sequence• Wait prevents the control from advancing to step T3

until the memory asserts Done

Step Concrete RTN Control SequenceT0 MA ← PC: C ← PC + 4; PCout, MAin, INC4, Cin

T1 MD ← M[MA]: PC ← C; Read, Cout, PCin, WaitT2 IR ← MD; MDout, IRin

T3 Instruction_execution

Tbl 4.6 The Instruction Fetch




Control Steps, Control Signals, and Timing

• Within a given time step, the order in which control signals are written is irrelevant

• In step T0, Cin, Inc4, MAin, PCout == PCout, MAin, INC4, Cin

• The only timing distinction within a step is between gates and strobes

• The memory read should be started as early as possible to reduce the wait

• MA must have the right value before being used for the read• Depending on memory timing, Read could be in T0




Control Sequence for the SRC add Instruction

• Note the use of Gra, Grb, and Grc to gate the correct 5-bit register select code to the registers

• End signals the control to start over at step T0

add (:= op = 12) → R[ra] ← R[rb] + R[rc]:

Step Concrete RTN Control SequenceT0 MA ← PC: C ← PC + 4; PCout, MAin, INC4, Cin, ReadT1 MD ← M[MA]: PC ← C; Cout, PCin, WaitT2 IR ← MD; MDout, IRin

T3 A ← R[rb]; Grb, Rout, AinT4 C ← A + R[rc]; Grc, Rout, ADD, CinT5 R[ra] ← C; Cout, Gra, Rin, End

Tbl 4.7 The add Instruction




Control Sequence for the SRC addi Instruction

• The c2out signal sign extends IR⟨16..0⟩ and gates it to the bus

addi (:= op= 13) → R[ra] ← R[rb] + c2⟨16..0 ⟩ {2’s comp., sign ext.} :

Step Concrete RTN Control SequenceT0. MA ← PC: C ← PC + 4; PCout, MAin, Inc4, Cin, ReadT1. MD ← M[MA]; PC ← C; Cout, PCin, WaitT2. IR ← MD; MDout, IRin

T3. A ← R[rb]; Grb, Rout, Ain

T4. C ← A + c2⟨16..0⟩ {sign ext.}; c2out, ADD, CinT5. R[ra] ← C; Cout, Gra, Rin, End

Tbl 4.8 The addi Instruction




Control Sequence for the SRC st Instruction

• Note BAout in T3 compared to Rout in T3 of addi

Step Concrete RTN Control SequenceT0–T2 Instruction fetch Instruction fetchT3 A ← (rb=0) → 0: rb ≠ 0 → R[rb]; Grb, BAout, Ain

T4 C ← A + c2⟨16..0⟩ {sign-extend}; c2out, ADD, CinT5 MA ← C; Cout, MAinT6 MD ← R[ra]; Gra, Rout, MDin, WriteT7 M[MA] ← MD; Wait, End

st (:= op = 3) → M[disp] ← R[ra] :disp⟨31..0⟩ := ((rb=0) → c2⟨16..0⟩ {sign extend} :

(rb≠0) → R[rb] + c2⟨16..0⟩ {sign extend, 2’s complement} ) :

The st Instruction




Fig 4.8 The Shift Counter

• The concrete RTN for shr relies upon a 5-bit register to hold the shift count

• It must load, decrement, and have an = 0 test

From Figure 4.3

Bus

Decr

Ld

5

n4⟨4..0⟩

⟨4..0⟩

⟨31..0⟩

n = 0

n: shift count 5-bit down counter

n = Q4..Q0

32

0n = 0

Decrement Shift count, n




Tbl 4.10 Control Sequence for the SRC shr Instruction—Looping

• Conditional control signals and repeating a control step are new concepts

Step Concrete RTN Control SequenceT0–T2 Instruction fetch Instruction fetchT3 n ← IR⟨4..0⟩; c1out, LdT4 (n=0) → (n ← R[rc]⟨4..0⟩); n=0 → (Grc, Rout, Ld)T5 C ← R[rb]; Grb, Rout, C=B, Cin

T6 Shr (:= (n≠0) → n≠0 → (Cout, SHR, Cin,(C⟨31..0⟩ ← 0#C⟨31..1⟩: Decr, Goto6) n ← n-1; Shr) );

T7 R[ra] ← C; Cout, Gra, Rin, End




Branching

• This is equivalent to the logic expression

cond := ( c3⟨2..0⟩=0 → 0:c3⟨2..0⟩ = 1 → 1:c3⟨2..0⟩ = 2 → R[rc] = 0:c3⟨2..0⟩ = 3 → R[rc] ≠ 0:c3⟨2..0⟩ = 4 → R[rc]⟨31⟩ = 0:c3⟨2..0⟩ = 5 → R[rc]⟨31⟩ = 1 ):

cond = (c3⟨2..0⟩ = 1) ∨ (c3⟨2..0⟩ = 2)∧ (R[rc] = 0) ∨ (c3⟨2..0⟩ = 3)∧¬ (R[rc] = 0) ∨ (c3⟨2..0⟩ = 4)∧¬ R[rc]⟨31⟩ ∨ (c3⟨2..0⟩ = 5)∧ R[rc]⟨31⟩




Fig 4.9 Computation of the Conditional Value CON

• NOR gate does = 0 test of R[rc] on bus

From Figure 4.3

32

D

Cond logic

Q

c3⟨2..0⟩

IR⟨2..0⟩

⟨31..0⟩

⟨31⟩

DecoderBus5 4 3

3

2 1

1

0 0

= 0

≠ 0

≥ 0

< 0

32CONin

CON

CONin

D Q

Q

CON




Tbl 4.11 Control Sequence for SRC Branch Instruction, br

• Condition logic is always connected to CON, so R[rc] only needs to be put on bus in T3

• Only PCin is conditional in T4 since gating R[rb] to bus makes no difference if it is not used

Step Concrete RTN Control SequenceT0–T2 Instruction fetch Instruction fetchT3 CON ← cond(R[rc]); Grc, Rout, CONinT4 CON → PC ← R[rb]; Grb, Rout, CON → PCin, End

br (:= op = 8) → (cond → PC ← R[rb]):




Summary of the Design Process

Informal description ⇒ formal RTN description ⇒ blockdiagram architecture ⇒ concrete RTN steps ⇒ hardware design of blocks ⇒ control sequences ⇒ control unitand timing

• At each level, more decisions must be made• These decisions refine the design• Also place requirements on hardware still to be designed

• The nice one-way process above has circularity• Decisions at later stages cause changes in earlier ones• Happens less in a text than in reality because

• Can be fixed on re-reading• Confusing to first-time student




Fig 4.10 Clocking the Data Path: Register Transfer Timing

• tR2valid is the period from begin of gate signal till inputs to R2 are valid

• tcomb is delay through combinational logic, such as ALU or cond logic

Rout

Rout

Rin

Rin

Circuit propagation delay

Gate prop. time,

tg

Bus prop. delay,

tbp

ALU, etc.

delay, tcomb

Latch hold time, th

tR2valid

Minimum clock period, tmin

Latch setup time, tsu

Minimum pulse width, tw

Latch prop. delay,

tl

Gate signal:

Strobe signal:

D

CK

R1

Source register

Bus gate n-bit bus

Logic block

Destination register

Combinational logic

Q

Q

D

CK

R2

Q

Q

n




Signal Timing on the Data Path

• Several delays occur in getting data from R1 to R2• Gate delay through the 3-state bus driver—tg• Worst case propagation delay on bus—tbp

• Delay through any logic, such as ALU—tcomb

• Set up time for data to affect state of R2—tsu

• Data can be strobed into R2 after this timetR2valid = tg + tbp + tcomb + tsu

• Diagram shows strobe signal in the form for a latch. It must be high for a minimum time—tw

• There is a hold time, th, for data after strobe ends




Effect of Signal Timing on MinimumClock Cycle

• A total latch propagation delay is the sumTl = tsu + tw + th

• All above times are specified for latch• th may be very small or zero

• The minimum clock period is determined by finding longest path from ff output to ff input

• This is usually a path through the ALU• Conditional signals add a little gate delay

• Using this path, the minimum clock period istmin = tg + tbp + tcomb + tl




Latches Versus Edge-Triggered orMaster-Slave Flip-Flops

• During the high part of a strobe a latch changes its output

• If this output can affect its input, an error can occur• This can influence even the kind of concrete RTs that

can be written for a data path• If the C register is implemented with latches, then

C ← C + MD; is not legal• If the C register is implemented with master-slave or

edge-triggered flip-flops, it is OK




The Control Unit

• The control unit’s job is to generate the control signals in the proper sequence

• Things the control signals depend on• The time step Ti• The instruction opcode (for steps other than T0, T2, T2)• Some few data path signals like CON, n = 0, etc.• Some external signals: reset, interrupt, etc. (to be covered)

• The components of the control unit are: a time state generator, instruction decoder, and combinational logic to generate control signals




Fig 4.11 Control Unit Detail with Inputs and Outputs

Master clock

Strt

EnableStep generator

Countln

Wait DoneC

ount

er Control step

decoder

ResetLoad

Decoder

Other signals from the data path

Interrupts and other external signals

IROpCode

Generated control signals

Control signal

encoder

T0 T1 T2 T4 Tn – 1

shcCON n = 0

ld add br

Wait

Gra

PC

in

AD

D

Rout

PC

out

Clocking logic

4




StepT3.T4.T5.

Step Control SequenceT0. PCout, MAin, Inc4, Cin, Read

T1. Cout, PCin, Wait

T2. MDout, IRin

add Control Sequence

Grb, Rout, Ain

Grc, Rout, ADD, Cin

Cout, Gra, Rin, End

addiStep Control SequenceT3. Grb, Rout, Ain

T4. c2out, ADD, Cin

T5. Cout, Gra, Rin, End

stStep Control SequenceT3. Grb, BAout, Ain

T4. c2out, ADD, Cin

T5. Cout, MAin

T6. Gra, Rout, MDin, Write

T7. Wait, End

shrStep Control SequenceT3. c1out, Ld

T4. n=0 → (Grc, Rout, Ld)

T5. Grb, Rout, C=B

T6. n≠0 → (Cout, SHR, Cin,

Decr, Goto7)


Synthesizing Control Signal Encoder Logic

Design process:• Comb through the entire set of control sequences.• Find all occurrences of each control signal.• Write an equation describing that signal.Example: Gra = T5·(add + addi) + T6·st + T7·shr + ...




Use of Data Path Conditions in Control Signal Logic

Example: Grc = T4·add + T4·(n=0)·shr + ...

Step Control SequenceT0. PCout, MAin, Inc4, Cin, Read

T1. Cout, PCin, Wait

T2. MDout, IRin

addStep Control SequenceT3. Grb, Rout, Ain

T4. Grc, Rout, ADD, Cin


addiStep Control SequenceT3. Grb, Rout, Ain

T4. c2out, ADD, Cin


stStep Control SequenceT3. Grb, BAout, Ain

T4. c2out, ADD, Cin

T5. Cout, MAin

T6. Gra, Rout, MDin, Write

T7. Wait, End

shrStep Control SequenceT3. c1out, Ld

T4. n=0 → (Grc, Rout, Ld)

T5. Grb, Rout, C=B

T6. n≠0 → (Cout, SHR, Cin,

Decr, Goto7)





Fig 4.12 Generation of the logic forPCin and Gra

Cout Gra

T5

T7ld

T5

T1

add

addaddi




Fig 4.13 Branching in the Control Unit

• 3-state gates allow 6 to be applied to counter input

• Reset will synchronously reset counter to step T0

Mck Enable

Step generator

Countln

Cou

nter Control

step decoder

Reset

4

0110

Goto6

Load




Fig 4.14 The Clocking Logic:Start, Stop, and Memory Synchronization

• Mck is master clock oscillator

J Q

QK

Strt (E)

Stop (C)D Q

Q

Wait (C)

Mck (I)

Read (C)

Write (C)

Enable (G)

To memory system

R (G)

W (G) E G C I

– External – Generated – Control signal – Internal

Legend

SDone (G)

Run (G)Done (E)

J Q

QK

J Q

QK

4

2

3

1




The Complete 1-Bus Design of SRC

• High-level architecture block diagram• Concrete RTN steps• Hardware design of registers and data path logic• Revision of concrete RTN steps where needed• Control sequences• Register clocking decisions• Logic equations for control signals• Time step generator design• Clock run, stop, and synchronization logic




Other Architectural Designs Will Requirea Different RTN

• More data paths allow more things to be done in one step

• Consider a two bus design• By separating input and output of ALU on different

buses, the C register is eliminated• Steps can be saved by strobing ALU results directly

into their destinations




Fig 4.15 The 2-Bus SRC Microarchitecture

• Bus A carries data going into registers

• Bus B carries data being gated out of registers

• ALU function C = B is used for all simple register transfers

ALU

C

B bus (“Out bus”)

A bus (“In bus”)

Memory bus

3232

031R0

R31

A B


IR

PC

MA

MD

A




Tbl 4.13 The 2-Bus add Instruction

• Note the appearance of Grc to gate the output of the register rc onto the B bus and Sra to select ra to receive data strobed from the A bus

• Two register select decoders will be needed• Transparent latches will be required at step T2

Step Concrete RTN Control SequenceT0 MA ← PC; PCout, C = B, MAin, Read T1 PC ← PC + 4: MD ← M[MA];PCout, INC4, PCin, WaitT2 IR ← MD; MDout, C = B, IRinT3 A ← R[rb]; Grb, Rout, C = B, AinT4 R[ra] ← A + R[rc]; Grc, Rout, ADD, Sra,

Rin, End




Performance and Design

% SpeedupT T

TWhere

T Execution Time IC CPI

bus bus

bus= − ×

= = × ×

− −

−

1 2

2100

τ




Speedup By Going to 2 Buses

•Assume for now that IC and τ don’t change in going from 1 bus to 2 buses•Naively assume that CPI goes from 8 to 7 clocks.

%

%

SpeedupT T

T

IC IC

IC

bus bus

bus= − ×

= × × − × ×× ×

× = − × =

− −

−

1 2

2100

8 77

1008 7

7100 14

τ ττ

Class Problem:How will this speedup change if clock period of 2-bus machine is increased by 10%?




3-Bus Architecture Shortens Sequences Even More

• A 3-bus architecture allows both operand inputs and the output of the ALU to be connected to buses

• Both the C output register and the A input register are eliminated

• Careful connection of register inputs and outputs can allow multiple RTs in a step




Fig 4.16 The 3-Bus SRC Design

• A-bus is ALU operand 1, B-bus is ALU operand 2, and C-bus is ALU output

• Note MA input connected to theB-bus

ALU

C

A bus B busC bus

Memory bus

3232 32031

R0

R31

A B


IR

PC

MA

MD




Tbl 4.15 The 3-Bus add Instruction

• Note the use of 3 register selection signals in step T2: GArc, GBrb, and Sra

• In step T0, PC moves to MA over bus B and goes through the ALU INC4 operation to reach PC again by way of bus C

• PC must be edge-triggered or master-slave

• Once more MA must be a transparent latch

Step Concrete RTN Control SequenceT0 MA ← PC: MD ← M[MA]; PCout, MAin, INC4, PCin, PC ← PC + 4: Read, WaitT1 IR ← MD; MDout, C = B, IRinT2 R[ra] ← R[rb] + R[rc]; GArc, RAout, GBrb, RBout, ADD, Sra, Rin, End




Performance and Design

• How does going to three buses affect performance?• Assume average CPI goes from 8 to 4, while τ increases

by 10%:

%.

.

.

.%Speedup

IC IC

IC= × × − × ×

× ×× = − × =8 4 1 1

4 1 1100

8 4 4

4 4100 82

τ ττ




Processor Reset Function

• Reset sets program counter to a fixed value• May be a hardwired value, or• contents of a memory cell whose address is hardwired

• The control step counter is reset• Pending exceptions are prevented, so initialization code is not

interrupted• It may set condition codes (if any) to known state• It may clear some processor state registers• A “soft” reset makes minimal changes: PC, T (trace)• A “hard” reset initializes more processor state




SRC Reset Capability

• We specify both a hard and soft reset for SRC• The Strt signal will do a hard reset

• It is effective only when machine is stopped• It resets the PC to zero• It resets all 32 general registers to zero

• The Soft Reset signal is effective when the machine is running

• It sets PC to zero• It restarts instruction fetch• It clears the Reset signal

• Actions are described in instruction_interpretation




Abstract RTN for SRC Reset and Start

Processor StateStrt: Start signalRst: External reset signal

instruction_interpretation := (¬Run∧ Strt → (Run ← 1: PC, R[0..31] ← 0);

Run∧¬ Rst → (IR ← M[PC]: PC ← PC + 4;instruction_execution):

Run∧ Rst → ( Rst ← 0: PC ← 0); instruction_interpretation):




Resetting in the Middle of Instruction Execution

• The abstract RTN implies that reset takes effect after the current instruction is done

• To describe reset during an instruction, we must go from abstract to concrete RTN

• Questions for discussion:• Why might we want to reset in the middle of an instruction?• How would we reset in the middle of an instruction?




Tbl 4.17 The add Instructionwith Reset Processing

Step Concrete RTNT0 ¬Rst → (MA ← PC: C ← PC + 4): Rst → (Rst ← 0: PC ← 0: T ←0):T1 ¬Rst → (MD ← M[MA]: P ← C): Rst → (Rst ← 0: PC ← 0: T ← 0):T2 ¬Rst → (IR ← MD): Rst → (Rst ← 0: PC ← 0: T ← 0):T3 ¬Rst → (A ← R[rb]): Rst → (Rst ← 0: PC ← 0: T ← 0):T4 ¬Rst → (C ← A + R[rc]): Rst → (Rst ← 0: PC ← 0: T ← 0):T5 ¬Rst → (R[ra ] ← C): Rst → (Rst ← 0: PC ← 0: T ← 0):

• See text for the corresponding control signals




Control Sequences Including the Reset Function

• ClrPC clears the program counter to all zeros, and ClrR clears the 1-bit Reset flip-flop

• Because the same reset actions are in every step of every instruction, their control signals are independent of time step or opcode

Step Control SequenceT0 ¬Reset → (PCout, MAin, Inc4, Cin, Read): Reset → (ClrPC, ClrR, Goto0):T1 ¬Reset → (Cout, PCin, Wait): Reset → (ClrPC, ClrR, Goto0): • • •




General Comments on Exceptions

• An exception is an event that causes a change in the program specified flow of control

• Because normal program execution is interrupted, they are often called interrupts

• We will use exception for the general term and use interrupt for an exception caused by an external event, such as an I/O device condition

• The usage is not standard. Other books use these words with other distinctions, or none




Combined Hardware/Software Responseto an Exception

• The system must control the type of exceptions it will process at any given time

• The state of the running program is saved when an allowed exception occurs

• Control is transferred to the correct software routine, or “handler,” for this exception

• This exception, and others of less or equal importance, are disallowed during the handler

• The state of the interrupted program is restored at the end of execution of the handler




Hardware Required to Support Exceptions

• To determine relative importance, a priority number is associated with every exception

• Hardware must save and change the PC, since without it no program excution is possible

• Hardware must disable the current exception lest is interrupt the handler before it can start

• Address of the handler is called the exception vector and is a hardware function of the exception type

• Exceptions must access a save area for PC and other hardware saved items

• Choices are special registers or a hardware stack




New Instructions Needed to Support Exceptions

• An instruction executed at the end of the handler must reverse the state changes done by hardware when the exception occurred

• There must be instructions to control what exceptions are allowed

• The simplest of these enable or disable all exceptions

• If processor state is stored in special registers on an exception, instructions are needed to save and restore these registers




Kinds of Exceptions

• System reset• Exceptions associated with memory access

• Machine check exceptions• Data access exceptions• Instruction access exceptions• Alignment exceptions

• Program exceptions• Miscellaneous hardware exceptions• Trace and debugging exceptions• Nonmaskable exceptions• External exceptions—interrupts




An Interrupt Facility for SRC

• The exception mechanism for SRC handles external interrupts

• There are no priorities, but only a simple enable and disable mechanism

• The PC and information about the source of the interrupt are stored in special registers

• Any other state saving is done by software

• The interrupt source supplies 8 bits that are used to generate the interrupt vector

• It also supplies a 16-bit code carrying information about the cause of the interrupt




SRC Processor State Associated with Interrupts

Processor interrupt mechanismireq: Interrupt request signaliack: Interrupt acknowledge signalIE: 1-bit interrupt enable flagIPC⟨31..0⟩: Storage for PC saved upon interruptII⟨31..0⟩: Information on source of last interruptIsrc_info⟨15..0⟩: Information from interrupt sourceIsrc_vect⟨7..0⟩: Type code from interrupt sourceIvect⟨31..0⟩:= 20@0#Isrc_vect⟨7..0⟩#4@0:

0000Isrc_vect⟨7..0⟩000 . . . 031 0341112

Ivect⟨31..0⟩

From Device →To Device →Internal →to CPU →to CPU →From Device →From Device →Internal →




SRC Instruction Interpretation Modified for Interrupts

instruction_interpretation :=(¬Run∧ Strt → Run ← 1:Run∧¬ (ireq∧ IE) → (I ← M[PC]: PC ← PC + 4; instruction_execution):Run∧ (ireq∧ IE) → (IPC ← PC ⟨31..0⟩:

II⟨15..0⟩ ← Isrc_info⟨15..0⟩: iack ← 1:IE ← 0: PC ← Ivect⟨31..0 ⟩; iack ← 0); instruction_interpretation);

• If interrupts are enabled, PC and interrupt information are stored in IPC and II, respectively

• With multiple requests, external priority circuit (discussed in later chapter) determines which vector and information are returned

• Interrupts are disabled• The acknowledge signal is pulsed




SRC Instructions to Support Interrupts

Return from interrupt instructionrfi (:= op = 29 ) → (PC ← IPC: IE ← 1):

Save and restore interrupt statesvi (:= op = 16) → (R[ra] ⟨15..0⟩ ← II⟨15..0 ⟩: R[rb] ← IPC⟨31..0⟩):ri (:= op = 17) → (II ⟨15..0⟩ ← R[ra]⟨15..0 ⟩ : IPC⟨31..0 ⟩ ← R[rb]):

Enable and disable interrupt systemeen (:= op = 10 ) → (IE ← 1):edi (:= op = 11 ) → (IE ← 0):

• The 2 rfi actions are indivisible, can’t een and branch




Concrete RTN for SRC Instruction Fetch with Interrupts

• PC could be transferred to IPC over the bus• II and IPC probably have separate inputs for the

externally supplied values• iack is pulsed, described as ←1; ←0, which is easier

as a control signal than in RTN

Step ¬(ireq∧ IE) Concrete RTN (ireq∧ IE) T0 (¬ (ireq∧ IE) → ( (ireq∧ IE) → (IPC ← PC: II ← Isrc_info: MA ← PC: C ← PC+4): IE ← 0: PC← 22@0#Isrc_vect ⟨7..0⟩#00: Iack←1; Iack ← 0: End);T1 MD ← M[MA] : PC ← C;T2 IR ← MD;




Exceptions During Instruction Execution

• Some exceptions occur in the middle of instructions• Some CISCs have very long instructions, like string move• Some exception conditions prevent instruction

completion, like uninstalled memory

• To handle this sort of exception, the CPU must make special provision for restarting

• Partially completed actions must be reversed so the instruction can be re-executed after exception handling

• Information about the internal CPU state must be saved so that the instruction can resume where it left off

• We will see that this problem is acute with pipeline designs—always in middle of instructions




Recap of the Design Process: the Main Topic of Chapter 4

Informal description

Formal RTN description

Block diagram architecture

Concrete RTN steps

Hardware design of blocks

Control sequences

Control unit and timing

Chapter 2

Chapter 4

SRC




Chapter 4 Summary

• Chapter 4 has done a nonpipelined data path and a hardwired controller design for SRC

• The concepts of data path block diagrams, concrete RTN, control sequences, control logic equations, step counter control, and clocking have been introduced

• The effect of different data path architectures on the concrete RTN was briefly explored

• We have begun to make simple, quantitative estimates of the impact of hardware design on performance

• Hard and soft resets were designed• A simple exception mechanism was supplied for SRC


5-1 Chapter 5—Processor Design—Advanced Topics


Chapter 5: Processor Design—Advanced Topics

Topics

5.1 Pipelining• A pipelined design of SRC• Pipeline hazards

5.2 Instruction-Level Parallelism• Superscalar processors• Very Long Instruction Word (VLIW) machines

5.3 Microprogramming• Control store and microbranching• Horizontal and vertical microprogramming




Fig 5.1 Executing Machine Instructions versus Manufacturing Small Parts

Memory access

ALU operation

Fetch operands

Fetch instruction

Instruction interpretation and execution

Register write

add r4, r3, r2

Part manufacture

Make end plate

Memory access

ALU operation

Fetch operands

Fetch instruction

Instruction interpretation and execution

Register write

(a) Without pipelining/assembly line (b) With pipelining/assembly line

Polish part

Cut part

Drill part

Select part

Part manufacture

Package part

Polish part

Cut part

Drill part

Select part

Package part

Bottom plate

Top plate

End plate

Cover plate

Center plate

sub r2, r5, 1

add r4, r3, r2

st r4, addr1

Id r2, addr2

shr r3, r3, 2




The Pipeline Stages

• 5 pipeline stages are shown• 1. Fetch instruction• 2. Fetch operands• 3. ALU operation• 4. Memory access• 5. Register write

• 5 instructions are executing• shr r3, r3, #2 ;Storing result into r3• sub r2, r5, #1 ;Idle—no memory access needed• add r4, r3, r2 ;Performing addition in ALU• st r4, addr1 ;Accessing r4 and addr1• ld r2, addr2 ;Fetching instruction




Notes on Pipelining Instruction Processing

• Pipeline stages are shown top to bottom in order traversed by one instruction

• Instructions listed in order they are fetched• Order of instructions in pipeline is reverse of listed• If each stage takes 1 clock:

• every instruction takes 5 clocks to complete• some instruction completes every clock tick

• Two performance issues: instruction latency and instruction bandwidth




Dependence Among Instructions

• Execution of some instructions can depend on the completion of others in the pipeline

• One solution is to “stall” the pipeline• early stages stop while later ones complete processing

• Dependences involving registers can be detected and data “forwarded” to instruction needing it, without waiting for register write

• Dependence involving memory is harder and is sometimes addressed by restricting the way the instruction set is used

• “Branch delay slot” is example of such a restriction• “Load delay” is another example




Branch and Load Delay Examples

Branch Delay

Load Delay

brz r2, r3add r6, r7, r8st r6, addr1

This instruction always executed

Only done if r2 ≠ 0

ld r2, addradd r5, r1, r2shr r1,r1,#4sub r6, r8, r2

This instruction gets “old”value of r2

This instruction gets r2 valueloaded from addr

• Working of instructions is not changed, but way they work together is




Characteristics of Pipelined Processor Design

• Main memory must operate in one cycle• This can be accomplished by expensive memory, but• It is usually done with cache, to be discussed in Chap. 7

• Instruction and data memory must appear separate• Harvard architecture has separate instruction and data memories• Again, this is usually done with separate caches

• Few buses are used• Most connections are point to point• Some few-way multiplexers are used

• Data is latched (stored in temporary registers) at each pipeline stage—called “pipeline registers”

• ALU operations take only 1 clock (esp. shift)




Adapting Instructions to Pipelined Execution

• All instructions must fit into a common pipeline stage structure

• We use a 5-stage pipeline for the SRC (1) Instruction fetch (2) Decode and operand access (3) ALU operations (4) Data memory access (5) Register write• We must fit load/store, ALU, and branch instructions

into this pattern




Fig 5.2 ALU Instructions

• Instructions fit into 5 stages

• Second ALU operand comes either from a register or instruction register c2 field

• Opcode must be available in stage 3 to tell ALU what to do

• Result register, ra, is written in stage 5

• No memory operation

IR2

ALU operations including shifts

Instruction memory

1. Instruction

fetch

2. Decode

and operand

read

3. ALU

operation

4. Memory access

5. ra

write

PC

X3 Y3

Z4

Mp4

op, ra C2⟨4..0⟩

Inc4

Register file R[rb] R[rc] R[ra]

regwrite

ra

ALUDecode




Logic Expressions Defining Pipeline Stage Activity

branch := br ∨ brl :cond := (IR2⟨2..0⟩ = 1) ∨ (( IR2⟨2..1⟩=1)∧( IR2⟨0⟩⊕ R[rb]=0)) ∨ ((IR2⟨2..1⟩=2)∧( IR2⟨0⟩⊕ R[rb]⟨31⟩) :sh := shr ∨ shra ∨ shl ∨ shc :alu := add ∨ addi ∨ sub ∨ neg ∨ and ∨ andi ∨ or ∨ ori ∨ not ∨ sh :imm := addi ∨ andi ∨ ori ∨ (sh ∧ (IR2⟨4..0⟩ ≠ 0) ):load := ld ∨ ldr :ladr := la ∨ lar :store := st ∨ str :l-s := load ∨ ladr ∨ store :regwrite := load ∨ ladr ∨ brl ∨ alu: Instructions that write to the register filedsp := ld ∨ st ∨ lar : Instructions that use disp addressingrl := ldr ∨ str ∨ lar : Instructions that use rel addressing




Notes on the Equations and Different Stages

• The logic equations are based on the instruction in the stage where they are used

• When necessary, we append a digit to a logic signal name to specify it is computed from values in that stage

• Thus regwrite5 is true when the opcode in stage 5 is load5 ∨ ladr5 ∨ brl5 ∨ alu5, all of which are determined from op5




Fig 5.4 The Memory Access

Instructions: ld, ldr, st, and str

• ALU computes effective addresses

• Stage 4 does read or write

• Result register written only on load

IR2

ld, ldr, la, and lar

Instruction memory

1. Instruction

fetch

2. Decode

and operand

read

3. ALU

operation

4. Memory access

5. ra

write

PC

PC2

Y3

Z4

Mp3

op, ra c1⟨21..0⟩

Inc4


regwrite

rac1 c2

Mp5

Mp4

addALUDecode

Data memory

X3

Z5

st and str

Instruction memory PC

PC2

Y3 MD3

Z4

Mp3

add

op, ra c1⟨21..0⟩

Inc4


regwrite

Mp4

IR2

c1 c2

ALUDecode

Data memory

X3




Fig 5.5 The Branch

Instructions

• The new program counter value is known in stage 2—but not in stage 1

• Only branch and link does a register write in stage 5

• There is no ALU or memory operation

Branch br and brl

1. Instruction

fetch

2. Decode

and operand

read

3. ALU

operation

4. Memory access

5. ra

write

PCInstruction memory

PC2c2⟨2..0⟩

Inc4


cond

Mp1

IR2

Branch logic

brl only

ra

op, ra




Fig 5.6 The SRC Pipeline Registers and

RTN Specification• The pipeline

registers pass information from stage to stage

• RTN specifies output register values in terms of input register values for stage

• Discuss RTN at each stage on blackboard

Z4 MD4

Z5IR5

IR4

1. Instruction

fetch

3. ALU

operation

4. Memory access

5. Register

write

Instruction memory

IR2 ← M[PC] :

PC2 ← PC + 4 ;

PC + 4

R[rb]

PC

op ra rb rc c1 c2 PC2

X3IR3 Y3 MD3

X3 ← l-s2 → (rel2 → PC2 : disp2 → R[rb]) :

Z4 ← (l-s3 → X3 + Y3 :

MD4 ← MD3 :IR4 ← IR3 ;

brl3 → X3 :alu3 → X3 op Y3) :

Z5 ← (load4 → M[Z4]:

regwrite5 → (R[ra] ← Z5) ;

store4 → (M[Z4] ← MD4) :IR5 ← IR4 ;

ladr4 ∨ branch4 ∨ alu4 → Z4) :

brl2 → PC2 : alu2 → R[rb] :

¬cond(IR2, R[rc]) → PC + 4) ;

Y3 ← l-s2 → (rel2 → c1 : disp2 → c2) :

MD3 ← store2 → R[ra] : IR3← IR2 : stop2 → Run ← 0 :PC ← ¬branch2 → PC + 4 : branch2 → (cond(IR2, R[rc]) → R[rb] ;

branch2 → : alu2 → (imm2 → c2 : ¬imm2→ R[rc]) :

rb R[rb] rcRegister file

R[rc] R[ra]ra

2. Decode

and operand

read

Data memory

IR2




Global State of the Pipelined SRC

• PC, the general registers, instruction memory, and data memory represent the global machine state

• PC is accessed in stage 1 (and stage 2 on branch)• Instruction memory is accessed in stage 1• General registers are read in stage 2 and written in

stage 5• Data memory is only accessed in stage 4




Restrictions on Access to Global State by Pipeline

• We see why separate instruction and data memories (or caches) are needed

• When a load or store accesses data memory in stage 4, stage 1 is accessing an instruction

• Thus two memory accesses occur simultaneously

• Two operands may be needed from registers in stage 2 while another instruction is writing a result register in stage 5

• Thus as far as the registers are concerned, 2 reads and a write happen simultaneously

• Increment of PC in stage 1 must be overridden by a successful branch in stage 2




Fig 5.7 The

Pipeline Data

Path with Selected Control Signals

• Most control signals shown and given values

• Multi-plexer control is stressed in this figure

Instruction memory

1. Instruction

fetch

2. Decode

and operand

read

3. ALU

operation

4. Memory access

5. Register

write

PC

PC2

Y3

MD4Data

memory

IR3

Z4IR4

Mp3

Mp5

ALU op’n

addr

op ra rb rc c1 c2

Inc4

Register filea1 R1 a2 R2 a3 R3

Mp4

Mp2 ←

Mp3 ←

Mp4 ←

condMp2

Mp1

Mp1 ← (¬(branch2 cond) → lnc4):

(¬store → rc):

( store → ra):(rl ∨ branch → PC2):

(dsp ∨ alu → R1):(rl → c1):

(dsp ∨ imm → c2):

(alu ∧ 71mm ¬imm → R2):

Mp5 ← (¬load → Z4):

(load → mem data):

IR2

ALUDecode

X3

rarc

rb

c2⟨2..0⟩

G1GA1G2

W3

op ra

ra

Branch logic

op

Decode load/store

Z5ra

valueload ∨ ladr ∨ brl ∨ alu

opDecode

∨

( (branch2 cond) → PC2):∨

IR5

MD3

c2c1




Example of Propagation of Instructions Through Pipe

• It is assumed that R[11] contains 512 when the brl instruction is executed

• R[6] = 4 and R[8] = 5 are the add operands• R[5] =16 for the ld and R[12] = 23 for the str

100: add r4, r6, r8; R[4] ← R[6] + R[8]104: ld r7, 128(r5); R[7] ← M[R[5]+128]108: brl r9, r11, 001; PC ← R[11]: R[9] ← PC112: str r12, 32; M[PC+32] ← R[12] . . . . . .512: sub ... next instr. ...




Fig 5.8 First Clock Cycle: add

Enters Stage 1 of Pipeline

• Program counter is incremented to 104

512: sub ... . . . . . .112: str r12, #32108: brl r9, r11, 001104: ld r7, r5, #128100: add r4, r6, r8

Instruction memory

1. Instruction

fetch

2. Decode

and operand

read

3. ALU

operation

4. Memory access

5. ra

write

100

PC2

Y3 MD3

MD4Data

memory

IR3

Z4IR4

Mp3

Mp5

load/store

ALU op’n

addr

op ra rb rc c1 c2

Inc4

Register filea1 R1 a2 R2 a3 R3

Mp4

condMp2

Mp1

ALUDecode

X3

rac2

c1

rc

rb

c2⟨2..0⟩

G1GA1G2

W3

op ra

ra

Branch logic

op

Decode

Z5ra

valueload ∨ lader ∨ brl ∨ alu

opDecode

100: add r4, r6, r8 104

104

IR2

PC

IR5




Fig 5.9 Second Clock

Cycle: add Enters Stage 2, While 1d is Being Fetched

at Stage 1

• add operands are fetched in stage 2


Instruction memory

1. Instruction

fetch

2. Decode

and operand

read

3. ALU

operation

4. Memory access

5. ra

write

104

PC2104

Y3 MD3

MD4Data

memory

IR3

Z4IR4

Mp3

Mp5

load/store

ALU op’n

addr

add r4, r6, r8

Inc4

Register filer6 4 r8 5 a3 R3

Mp4

condMp2

Mp1

ALUDecode

X3

rac2

c1

rc

rb

c2⟨2..0⟩

G1GA1G2

W3

op ra

ra

Branch logic

op

Decode

Z5ra


opDecode

IR5

104: ld r7 , r5, 128 108

45

add r4

108

IR2

PC




Fig 5.10 Third Clock Cycle: brl Enters the Pipeline

• add performs its arithmetic in stage 3


op ra

Instruction memory

1. Instruction

fetch

2. Decode

and operand

read

3. ALU

operation

4. Memory access

5. ra

write

108

PC2108

MD3

MD4Data

memory

add �r4

Z4IR4

Mp3

Mp5

load/store

addr

ld r7 ,r5, 128

Inc4

a1 R1r5 16

a2 R2 a3 R3

Mp4

condMp2

Mp1

ALUDecode

rac2

c1

rc

c2⟨2..0⟩

G1GA1G2

W3

ra

Branch logic

op

Decode

Z5ra


opDecode

112

16 128

ld r7

add r4 9

add

112

IR2

IR3

PC

108: brl r9 , r11, 001

rb

X3 Y34 5

IR5




Fig 5.11 Fourth Clock Cycle: str Enters the Pipeline

• add is idle in stage 4

• Success of brl changes program counter to 512


raop

Instruction memory

1. Instruction

fetch

2. Decode

and operand

read

3. ALU

operation

4. Memory access

5. ra

write

112

PC2112

MD3

MD4Data

memory

ld r7

add

Mp3

Mp5

load/store

addr

brl r9 , r11 001

Inc4

a1 R1r11 512

a2 R2 a3 R3

Mp4

condMp2

Mp1

ALUDecode

ra

c1

rc

c2⟨2..0⟩=001

G1GA1G2

W3

op ra

Branch logic

r7

9

ld

add r4

Decode

Z5ra


opDecode

IR5

IR4

IR3

116

112

brl r9

op ra rb rc c1 c2

144

add

512

512

IR2

PC

112: str r12, 32

rb

X3 Y316 128

r4 Z4 9




Fig 5.12 Fifth Clock Cycle: add Completes,

sub Enters the Pipeline

• add completes in stage 5

• sub is fetched from location 512 after successful brl

512: sub ... . . . . . .112: str r12, #32108: brl r9, r11, 001104: ld r7, r5, #128100: add r4, r6, r8 load ∨ lader ∨ brl ∨ alu

op ra

r12

op ra rc

value

raop

Instruction memory

1. Instruction

fetch

2. Decode

and operand

read

3. ALU

operation

4. Memory access

5. ra

write

512

PC2116

MD3

MD4Data

memory

brl r9

ld

Mp3

Mp5

load/store

addr

read

str r12, 32

Inc4

a1 R1r12 23a2 R2 a3 R3

Mp4

condMp2

Mp1

ALUDecode

r12rc

c2⟨2..0⟩

G1GA1G2

W3

Branch logic

r9brl

r7

Decode

9

Decode

516

3223

116

str

rb c1 c2

112

55

55

144

Z = X

516

IR2

IR3

IR4

IR5

PC

512: sub, ...

rb

X3

XZ

Y

Y3112 XXX

r7

add r4

Z4 144

r4 9

ld

r4Z5




Functions of the Pipeline Registers in SRC

• Registers between stages 1 and 2:• I2 holds full instruction including any register fields and

constant• PC2 holds the incremented PC from instruction fetch

• Registers between stages 2 and 3:• I3 holds opcode and ra (needed in stage 5)• X3 holds PC or a register value (for link or 1st ALU operand)• Y3 holds c1 or c2 or a register value as 2nd ALU operand• MD3 is used for a register value to be stored in memory




Functions of the Pipeline Registers in SRC (cont’d)

• Registers between stages 3 and 4:• I4 has op code and ra• Z4 has memory address or result register value• MD4 has value to be stored in data memory

• Registers between stages 4 and 5:• I5 has opcode and destination register number, ra• Z5 has value to be stored in destination register: from

ALU result, PC link value, or fetched data




Functions of the SRC Pipeline Stages

• Stage 1: fetches instruction• PC incremented or replaced by successful branch in

stage 2

• Stage 2: decodes instruction and gets operands• Load or store gets operands for address computation• Store gets register value to be stored as 3rd operand• ALU operation gets 2 registers or register and constant

• Stage 3: performs ALU operation• Calculates effective address or does arithmetic/logic• May pass through link PC or value to be stored in

memory




Functions of the SRC Pipeline Stages (cont’d)

• Stage 4: accesses data memory• Passes Z4 to Z5 unchanged for nonmemory instructions• Load fills Z5 from memory• Store uses address from Z4 and data from MD4 (no

longer needed)

• Stage 5: writes result register• Z5 contains value to be written, which can be ALU result,

effective address, PC link value, or fetched data• ra field always specifies result register in SRC




Dependence Between Instructions in Pipe: Hazards

• Instructions that occupy the pipeline together are being executed in parallel

• This leads to the problem of instruction dependence, well known in parallel processing

• The basic problem is that an instruction depends on the result of a previously issued instruction that is not yet complete

• Two categories of hazards• Data hazards: incorrect use of old and new data• Branch hazards: fetch of wrong instruction on a change

in PC




Classification of Data Hazards

• A read after write hazard (RAW) arises from a flow dependence, where an instruction uses data produced by a previous one

• A write after read hazard (WAR) comes from an anti-dependence, where an instruction writes a new value over one that is still needed by a previous instruction

• A write after write hazard (WAW) comes from an output dependence, where two parallel instructions write the same register and must do it in the order in which they were issued




Data Hazards in SRC

• Since all data memory access occurs in stage 4, memory writes and reads are sequential and give rise to no hazards

• Since all registers are written in the last stage, WAW and WAR hazards do not occur

• Two writes always occur in the order issued, and a write always follows a previously issued read

• SRC hazards on register data are limited to RAW hazards coming from flow dependence

• Values are written into registers at the end of stage 5 but may be needed by a following instruction at the beginning of stage 2




Possible Solutions to the Register Data Hazard Problem

• Detection:• The machine manual could list rules specifying that a

dependent instruction cannot be issued less than a given number of steps after the one on which it depends

• This is usually too restrictive• Since the operation and operands are known at each stage,

dependence on a following stage can be detected

• Correction:• The dependent instruction can be “stalled” and those ahead

of it in the pipeline allowed to complete• Result can be “forwarded” to a following inst. in a previous

stage without waiting to be written into its register

• Preferred SRC design will use detection, forwarding and stalling only when unavoidable




Detecting Hazards and Dependence Distance

• To detect hazards, pairs of instructions must be considered

• Data is normally available after being written to register• Can be made available for forwarding as early as the stage

where it is produced• Stage 3 output for ALU results, stage 4 for memory fetch

• Operands normally needed in stage 2• Can be received from forwarding as late as the stage in

which they are used• Stage 3 for ALU operands and address modifiers, stage 4 for

stored register, stage 2 for branch target




Instruction Pair Hazard Interaction

Class alu load ladr brl N/E 6/4 6/5 6/4 6/2Class N/L

alu 2/3load 2/3ladr 2/3store 2/3branch 2/2

4/1 4/2 4/1 4/14/1 4/2 4/1 4/14/1 4/2 4/1 4/14/1 4/2 4/1 4/14/2 4/3 4/2 4/1

Result Normally/Earliest available

ValueNormally/Latestneeded

Instruction separation to eliminatehazard, Normal/Forwarded

• Latest needed stage 3 for store is based on address modifier register. The stored value is not needed until stage 4

• Store also needs an operand from ra. See Text Tbl 5.1

Read from Reg. File

Write to Reg. File




Delays Unavoidable by Forwarding

• In the Table 5.1 “Load” column, we see the value loaded cannot be available to the next instruction, even with forwarding

• Can restrict compiler not to put a dependent instruction in the next position after a load (next 2 positions if the dependent instruction is a branch)

• Target register cannot be forwarded to branch from the immediately preceding instruction

• Code is restricted so that branch target must not be changed by instruction preceding branch (previous 2 instructions if loaded from memory)

• Do not confuse this with the branch delay slot, which is a dependence of instruction fetch on branch, not a dependence of branch on something else




Stalling the Pipeline on Hazard Detection

• Assuming hazard detection, the pipeline can be stalled by inhibiting earlier stage operation and allowing later stages to proceed

• A simple way to inhibit a stage is a pause signal that turns off the clock to that stage so none of its output registers are changed

• If stages 1 and 2, say, are paused, then something must be delivered to stage 3 so the rest of the pipeline can be cleared

• Insertion of nop into the pipeline is an obvious choice




Example of Detecting ALU Hazards and Stalling Pipeline

• The following expression detects hazards between ALU instructions in stages 2 and 3 and stalls the pipeline

( alu3 ∧ alu2 ∧ ((ra3 = rb2) ∨ (ra3 = rc2) ∧¬ imm2 ) ) →( pause2: pause1: op3 ← 0 ):

• After such a stall, the hazard will be between stages 2 and 4, detected by

( alu4 ∧ alu2 ∧ ((ra4 = rb2) ∨ (ra4 = rc2) ∧¬ imm2 ) ) →( pause2: pause1: op3 ← 0 ):

• Hazards between stages 2 & 5 require( alu5 ∧ alu2 ∧ ((ra5 = rb2) ∨ (ra5 = rc2) ∧¬ imm2 ) ) →

( pause2: pause1: op3 ← 0 ):pause1

pause2

To stage 1

Ck

To stage 2Fig 5.13 Pipeline Clocking Signals




Fig 5.14 Stall Due to a Data Dependence Between Two ALU

InstructionsClock cycle 5

Bloop!

ld r8, addr2

add r1, r2, r3

add r5, r8, r6

nop

nop

Clock cycle 4

Completed

add r1, r2, r3

New

New

Stalled

Stalledld r8, addr2

nop

nop

nop

Clock cycle 3

Completed

add r1, r2, r3

New

Stalled

Stalledld r8, addr2

nop

nop

add r2, r3, r4

Clock cycle 1 Clock cycle 2

Completed

Register write

Memory access

ALU operation

Fetch operands

Fetch instruction

sub r6, r5, #1

add r2, r3, r4

add r1, r2, r3

ld r8, addr2

shr r7, r7, #2

add r1, r2, r3

New

Stalled

Stalledld r8, addr2

sub r6, r5, #1

nop

add r2, r3, r4




Data Forwarding: from ALU Instruction to ALU

Instruction

• The pair table for data dependencies says that if forwarding is done, dependent ALU instructions can be adjacent, not 4 apart

• For this to work, dependences must be detected and data sent from where it is available directly to X or Y input of ALU

• For a dependence of an ALU instruction in stage 3 on an ALU instruction in stage 5 the equation is

alu5 ∧ alu3 → ((ra5 = rb3) → X ← Z5: (ra5 = rc3) ∧¬ imm3 → Y ← Z5 ):




Data Forwarding:ALU to ALU Instruction (cont’d)

• For an ALU instruction in stage 3 depending on one in stage 4, the equation is

alu4 ∧ alu3 → ((ra4 = rb3) → X ← Z4: (ra4 = rc3) ∧ ¬ imm3 → Y ← Z4 ):• We can see that the rb and rc fields must be available

in stage 3 for hazard detection• Multiplexers must be put on the X and Y inputs to the

ALU so that Z4 or Z5 can replace either X3 or Y3 as inputs




Fig 5.15 Hazard

Detection and

Forwarding

©1996 Vincent P. Heuring and Harry F. Jordan

• Can be from either Z4 or Z5 to either X or Y input to ALU

• rb and rc needed in stage 3 for detection

Data memory

Decode

value

Decode

Instruction memory

1. Instruction

fetch

2. Decode

and operand

read

3. ALU

operation

4. Memory access

5. ra

write

PC2

Mp1

MD3

MD4

IR3

IR4

Mp3

Mp5

addr

r/w

Inc4

a1 R1 a2Register file

R2 a3 R3

Mp4

condMp2

ALU

Hazard detection and forward unit

Mp7

rcra

c2⟨2..0⟩

G1GA1G2

W3

Branch logic

Decode

IR2

PC

rb

XZ

Y

X3 Y3

IR5

op,ra

Z4

Z5

2

op ra rcrb c1 c2

c1c2

op

op

ra

ra

rb, rc

reg write

Hazard detection and forward unit

raop

2

2Mp6

2




Restrictions Left If Forwarding Done Wherever Possible

(1) Branch delay slot• The instruction after a branch is always executed,

whether the branch succeeds or not.(2) Load delay slot• A register loaded from memory cannot be used

as an operand in the next instruction.• A register loaded from memory cannot be used

as a branch target for the next two instructions.(3) Branch target• Result register of ALU or ladr instruction cannot

be used as branch target by the next instruction.

br r4add . . . • • •

ld r4, 4(r5)nopneg r6, r4

ld r0, 1000nopnopbr r0

not r0, r1nopbr r0




Questions for Discussion

• How and when would you debug this design?• How does RTN and similar Hardware Description

Languages fit into testing and debugging?• What tools would you use, and which stage?• What kind of software test routines would you use?• How would you correct errors at each stage in the

design?




Instruction-Level Parallelism

• A pipeline that is full of useful instructions completes at most one every clock cycle

• Sometimes called the Flynn limit

• If there are multiple function units and multiple instructions have been fetched, then it is possible to start several at once

• Two approaches are: superscalar• Dynamically issue as many prefetched instructions to idle

function units as possible

• and Very Long Instruction Word (VLIW)• Statically compile long instruction words with many

operations in a word, each for a different function unit




Character of the Function Units in Multiple Issue Machines

• There may be different types of function units• Floating-point• Integer• Branch

• There can be more than one of the same type• Each function unit is itself pipelined• Branches become more of a problem

• There are fewer clock cycles between branches• Branch units try to predict branch direction• Instructions at branch target may be prefetched, and even

executed speculatively, in hopes the branch goes that way




Microprogramming: Basic Idea

• Control unit job is to generate the sequence of control signals

• How about building a computer to do this?

Step Concrete RTN Control SequenceT0 MA ← PC: C ← PC + 4; PCout, MAin, INC4, Cin, ReadT1 MD ← M[MA]: PC ← C; Cout, PCin, WaitT2 IR ← MD; MDout, IRin

T3 A ← R[rb]; Grb, Rout, AinT4 C ← A + R[rc]; Grc, Rout, ADD, CinT5 R[ra] ← C; Cout, Gra, Rin, End

• Recall control sequence for 1-bus SRC




The Microcode Engine

• A computer to generate control signals is much simpler than an ordinary computer

• At the simplest, it just reads the control signals in order from a read-only memory

• The memory is called the control store• A control store word, or microinstruction, contains a

bit pattern telling which control signals are true in a specific step

• The major issue is determining the order in which microinstructions are read




Fig 5.16 Block Diagram of Microcoded Control Unit

• Microinstruction has branch control, branch address, and control signal fields

• Microprogram counter can be set from several sources to do the required sequencing

Sequencer

Ck CCs Other

External source

IR

2

k n

m

n

n

n

Increment 4–1 Mux

µPC

µIRµBranch control

Branch addressControl signals

PCout, etc.

Control store

PLA (computes start addr)

Opcode




Parts of the Microprogrammed Control Unit

• Since the control signals are just read from memory, the main function is sequencing

• This is reflected in the several ways the µPC can be loaded

• Output of incrementer—µPC + 1• PLA output—start address for a macroinstruction• Branch address from µinstruction• External source—say for exception or reset

• Micro conditional branches can depend on condition codes, data path state, external signals, etc.




Contents of a Microinstruction

• Main component is list of 1/0 control signal values• There is a branch address in the control store• There are branch control bits to determine when to use the

branch address and when to use µPC + 1

Branch control Control signals Branch address

PC

out

MA

in

PC

in

Cou

t

Ain

End

Microinstruction format




Fig 5.17 The Control Store

Microaddress

0

2n-1

µCode for instruction fetch

µCode for add

µCode for br

µCode for shr

a1

a2

a3

m bits wide

k µbranchcontrol bits

n branchaddr. bits

c controlsignals

• Common instruction fetch sequence

• Separate sequences for each (macro) instruction

• Wide words




Tbl 5.2 Control Signals for the add Instruction

• Addresses 101–103 are the instruction fetch• Addresses 200–202 do the add• Change of µcontrol from 103 to 200 uses a kind of

µbranch

1 0 11 0 21 0 32 0 02 0 12 0 2

• • •• • •• • •• • •• • •• • •

1 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 00 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 00 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0000

0 0 1 1 10 0 0 0 0 0 0 0 0 0 0 01 1 1 10 0 0 0 0 0 0 00 0 0 0 0

1 0 0 0 0 0 0 0 1 0 0 0 0 0 01 1




Uses for µbranching in the Microprogrammed Control Unit

(1) Branch to start of µcode for a specific inst.(2) Conditional control signals, e.g. CON → PCin

(3) Looping on conditions, e.g. n ≠ 0 → ... Goto6• Conditions will control µbranches instead of being

ANDed with control signals• Microbranches are frequent and control store addresses

are short, so it is reasonable to have a µbranch address field in every µ instruction




Illustration of µbranching Control Logic

• We illustrate a µbranching control scheme by a machine having condition code bits N and Z

• Branch control has 2 parts:(1) selecting the input applied to the µPC and(2) specifying whether this input or µPC + 1 is used

• We allow 4 possible inputs to µPC• The incremented value µPC + 1• The PLA lookup table for the start of a macroinstruction• An externally supplied address• The branch address field in the µinstruction word




Fig 5.18 Branching Controls in the Microcoded Control Unit

• 5 branch conditions

• NotN• N• NotZ• Z• Unconditional

• To 1 of 4 places• Next

µinstruction• PLA• External

address• Branch address

External address

Z NPLA

2

2

2

2

2

2

2

4–1 Mux

Sequencer

µPCIncr.

Control signals 244100000000

Control store

Mux control

00 01 10 11

Mux Ctl SelectIncrement µPc PLA External address Branch address

BrUnBrNotZ

BrZBrNotN

BrN

Branch address




Some Possible µbranches Using the Illustrated Logic (Refer to Tbl 5.3)

• If the control signals are all zero, the µinstruction only does a test

• Otherwise test is combined with data path activity

Cont rolSignals

BranchAddress Branching act ion

00

01

10

11

11

11

0

1

0 0 0 0

0 0 0 0

0 0 1

1

1

1

0 0

0 0 0 0

0 0 0 0

0 0 0 0

0• • • 0

• • •

• • •

• • •

• • •

• • •

XXX

XXX

XXX

300

206

204

None—next ins truct ion

Branch t o out pu t of PLA

Br if Z t o Ext ern. Addr.

Br if N t o 300 (else next )

Br if N t o 206 (else next )

Br t o 204




• In horizontal microcode, each control signal is represented by a bit in the µinstruction

• In vertical microcode, a set of true control signals is represented by a shorter code

• The name horizontal implies fewer control store words of more bits per word

• Vertical µcode only allows RTs in a step for which there is a vertical µinstruction code

• Thus vertical µcode may take more control store words of fewer bits

Horizontal versus Vertical Microcode Schemes




Fig 5.19 A Somewhat Vertical Encoding

4–16 decoder 3–8 decoder

16 ALU 7 Regoutcontrol signals

control signals

F5 F8

ALU ops field

Register-out field

µIR

4 3

• Scheme would save (16 + 7) - (4 + 3) = 16 bits/word in the case illustrated




Fig 5.20 Completely Horizontal and Vertical Microcoding

µPCHorizontal

control store

µPC

Vertical control store

n to 2n decoderData path

PCout

MAin

Inc4

Cin

PC

out

MA

in

Inc4

Cin




Saving Control Store Bits with Horizontal Microcode

• Some control signals cannot possibly be true at the same time

• One and only one ALU function can be selected• Only one register out gate can be true with a single bus• Memory read and write cannot be true at the same step

• A set of m such signals can be encoded using log2m bits (log2(m + 1) to allow for no signal true)

• The raw control signals can then be generated by a k to 2k decoder, where 2k ≥ m (or 2k ≥ m + 1)

• This is a compromise between horizontal and vertical encoding




A Microprogrammed Control Unit for the 1-Bus SRC

• Using the 1-bus SRC data path design gives a specific set of control signals

• There are no condition codes, but data path signals CON and n = 0 will need to be tested

• We will use µbranches BrCON, Brn = 0, and Brn ≠ 0• We adopt the clocking logic of Fig. 4.14• Logic for exception and reset signals is added to the

microcode sequencer logic• Exception and reset are assumed to have been

synchronized to the clock




Tbl 5.4 The add Instruction

• Microbranching to the output of the PLA is shown at 102• Microbranch to 100 at 202 starts next fetch

Addr.Ot herCont rolSignals

BrAddr. Act ions

100

101

102

200

201

202

• • •

• • •

• • •

• • •

• • •

• • •

XXX

XXX

XXX

XXX

XXX

1 00 R [ra] ← C: µPC ← 1 00;

MA ← PC: C ← PC+4;

MD ← M[ MA] : PC ← C;

I R ← MD; µPC ← PLA;

A ← R [rb ];

C ← A + R[rc] ;

00 0 0 0 0 0 1 1

00 0 0 0 0 0 0 0

01 1 0 0 0 0 0 0

00

00

11

0 0 0 0 0 0 0

0 0 0 0 0 0 0

0 0 0 001 1




Getting the PLA Output in Time for the Microbranch

• For the input to the PLA to be correct for the µbranch in 102, it has to come from MD, not IR

• An alternative is to use see-through latches for IR so the opcode can pass through IR to PLA before the end of the clock cycle




See-Through Latch Hardware for IR So µPC Can Load Immediately

D

Cl

IR⟨31 ..27 ⟩

5Bus D QQ

µPC⟨9..0 ⟩

PLA5 10

P R

S

Clockcycle

Str obe S

Bus Valid data

Valid dataData at P

Val idData at R

PLA outp ut st robed int o µPC

Bus delay

Latch delay

PLA delay

• Data must have time to get from MD across Bus, through IR, through the PLA, and satisfy µPC set up time before trailing edge of S




Fig 5.21 SRC Microcode Sequencer

2

2 2

10

Sequencer

Exceptionn = 0CON Reset

2

2

2

2

n

n

n

2Mux control

BrUn

BrCON

BrN ≠ 0

BrN = 0End

2–1

Mux

Increment µPC

4–1 Mux

External address

PLABranch address

000

400




Tbl 5.6 Somewhat Vertical Encoding of the SRC

Microinstruction

MuxCt l

Branchcont rol End

Outsignals

Insignals Misc.

Gat eregs. ALU

Branchaddress

0 00 11 01 1

000 BrUn001 Br¬CON010 BrCON011Br n=0100 Br n≠0101 None

0 Cont .1 End

000 PCout001 Cout010 MDout011 Rout100 BAout101 c1out110 c2out111 None

000 MAin001 PCin010 IRin011 Ain100 Rin101 MDin110 None

000 Read001 Wait010 Ld011 Decr100 CONin101 Cin110 St op111 None

00 Gra01 Grb10 Grc11 None

0000 ADD0001 C=B0010 SHR0011 Inc4 • • •1111 NOT

10 bit s

F1 F2 F3 F4 F5 F6 F7 F8 F9

2bit s 3 bit s 1 bit 2 bit s3 bit s 3 bit s3 bit s 4 bit s 10 bit s




Other Microprogramming Issues

• Multiway branches: often an instruction can have 4–8 cases, say address modes

• Could take 2–3 successive µbranches, i.e. clock pulses• The bits selecting the case can be ORed into the branch

address of the µinstruction to get a several way branch• Say if 2 bits were ORed into the 3rd and 4th bits from the low

end, 4 possible addresses ending in 0000, 0100, 1000, and 1100 would be generated as branch targets

• Advantage is a multiway branch in one clock

• A hardware push-down stack for the µPC can turn repeated µsequences into µsubroutines

• Vertical µcode can be implemented using a horizontal µengine, sometimes called nanocode




Chapter 5 Summary

• This chapter has dealt with some alternative ways of designing a computer

• A pipelined design is aimed at making the computer fast—target of one instruction per clock

• Forwarding, branch delay slot, and load delay slot are steps in approaching this goal

• More than one issue per clock is possible, but beyond the scope of this text

• Microprogramming is a design method with a target of easing the design task and allowing for easy design change or multiple compatible implementations of the same instruction set


6-1 Chapter 6—Computer Arithmetic and the Arithmetic Unit


Chapter 6: Computer Arithmeticand the Arithmetic Unit

Topics

6.1 Number Systems and Radix Conversion6.2 Fixed-Point Arithmetic6.3 Seminumeric Aspects of ALU Design6.4 Floating-Point Arithmetic




Digital Number Systems

• Digital number systems have a base or radix b• Using positional notation, an m-digit base b number is

writtenx = xm-1 xm-2 ... x1 x0

0 ≤ xi ≤ b-1, 0 ≤ i < m• The value of this unsigned integer is

∑i=0

m-1xi⋅bivalue(x) = Eq. 6.1




Range of Unsigned m Digit Base b Numbers

• The largest number has all of its digits equal to b-1, the largest possible base b digit

• Its value can be calculated in closed form

xmax = ∑i=0

m-1(b-1) ⋅bi = (b-1) ⋅ ∑

i=0

m-1bi = bm - 1

• An important summation—geometric series

∑i=0

m-1bi =

bm - 1b - 1

Eq. 6.2

Eq. 6.3




Radix Conversion: General Matters

• Converting from one number system to another involves computation

• We call the base in which calculation is done c and the other base b

• Calculation is based on the division algorithm — For integers a and b, there exist integers q and r such

that a = q⋅b + r, with 0 ≤ r ≤ b-1• Notation: q = a/b r = a mod b




Digit Symbol Correspondence Between Bases

• Each base has b (or c) different symbols to represent the digits• If b < c, there is a table of b + 1 entries giving base c symbols

for each base b symbol and b• If the same symbol is used for the first b base c digits as for the

base b digits, the table is implicit

• If c < b, there is a table of b + 1 entries giving a base c number for each base b symbol and b

• For base b digits ≥ c, the base c numbers have more than one digit

Base 12: 0 1 2 3 4 5 6 7 8 9 A B 10

Base 3: 0 1 2 10 11 12 20 21 22 100 101 102 110




Convert Base b Integer to Calculator’s Base, c

1) Start with base b x = xm-1 xm-2 ... x1 x0

2) Set x = 0 in base c3) Left to right, get next symbol xi

4) Lookup base c number Di for symbol xi

5) Calculate in base c: x = x⋅b + Di

6) If there are more digits, repeat from step 3• Example: convert 3AF16 to base 10

x = 0x = 16x + 3 = 3x = 16⋅3 + 10(= A) = 58x = 16⋅58 + 15(= F) = 943




Convert Calculator’s Base Integer to Base b

1) Let x be the base c integer2) Initialize i = 0 and v = x & get digits right to left3) Set Di = v mod b & v = v/b . Lookup Di to get xi

4) i = i + 1; If v ≠ 0, repeat from step 3• Example: convert 356710 to base 12 3587 ÷ 12 = 298 (rem = 11) ⇒ x0 = B 298 ÷ 12 = 24 (rem = 10) ⇒ x1 = A 24 ÷ 12 = 2 (rem = 0) ⇒ x2 = 0 2 ÷ 12 = 0 (rem = 2) ⇒ x3 = 2 Thus 358710 = 20AB12




Fractions and Fixed-Point Numbers

• The value of the base b fraction .f-1f-2...f-m is the value of the integer f-1f-2...f-m divided by bm

• The value of a mixed fixed point number xn-1xn-2...x1x0.x-1x-2...x-m

is the value of the n+m digit integer xn-1xn-2...x1x0x-1x-2...x-m

divided by bm

• Moving radix point one place left divides by b• For fixed radix point position in word, this is a right shift of word

• Moving radix point one place right multiplies by b• For fixed radix point position in word, this is a left shift of word




Converting Fraction to Calculator’s Base

• Can use integer conversion and divide result by bm

• Alternative algorithm 1) Let base b number be .f-1f-2...f-m 2) Initialize f = 0.0 and i = -m 3) Find base c equivalent D of fi

4) f = (f + D)/b; i = i + 1 5) If i = 0, the result is f. Otherwise repeat from 3• Example: convert 4138 to base 10 f = (0 + 3)/8 = 0.375 f = (0.375 + 1)/8 = 0.171875 f = (0.171875 + 4)/8 = 0.521484375




Nonterminating Fractions

• The division in the algorithm may give a nonterminating fraction in the calculator’s base

• This is a general problem: a fraction of m digits in one base may have any number of digits in another base

• The calculator will normally keep only a fixed number of digits

• Number should make base c accuracy about that of base b

• This problem appears in generating base b digits of a base c fraction

• The algorithm can continue to generate digits unless terminated




Convert Fraction from Calculator’s Base to Base b

1) Start with exact fraction f in base c2) Initialize i = 1 and v = f3) D-i = b⋅v; v = b⋅v - D-i; Get base b f-i for D-i

4) i = i + 1; repeat from 3 unless v = 0 or enough base b digits have been generated

• Example: convert 0.3110 to base 8 0.31×8 = 2.48 ⇒ f-1 = 2 0.48×8 = 3.84 ⇒ f-2 = 3 0.84×8 = 6.72 ⇒ f-1 = 6• Since 83 > 102, 0.2368 has more accuracy than 0.31

10




Conversion Between Related Bases by Digit Grouping

• Let base b = ck; for example b = c2

• Then base b number x1x0 is base c number y3y2y1y0, where x1 base b = y3y2 base c and x0 base b = y1y0 base c

• Examples: 1021304 = 10 21 304 = 49C16

49C16 = 0100 1001 11002

1021304 = 01 00 10 01 11 002

0100100111002 = 010 010 011 1002 = 22348




Negative Numbers, Complements, and Complement Representations

We will:• Define two complement operations• Define two complement number systems

• Systems represent both positive and negative numbers

• Give a relation between complement and negate in a complement number system

• Show how to compute the complements• Explain the relation between shifting and scaling a number by

a power of the base• Lead up to the use of complement number systems in signed

addition hardware




Complement Operationsfor m-Digit Base b Numbers

• Radix complement of m-digit base b number xxc = (bm - x) mod bm

• Diminished radix complement of xxc = bm - 1 - x

• The complement of a number in the range 0≤x≤bm-1 is in the same range

• The mod bm in the radix complement definition makes this true for x = 0; it has no effect for any other value of x

• Specifically, the radix complement of 0 is 0




Complement Number Systems

• Complement number systems use unsigned numbers to represent both positive and negative numbers

• Recall that the range of an m digit base b unsigned number is 0≤x≤bm-1

• The first half of the range is used for positive, and the second half for negative, numbers

• Positive numbers are simply represented by the unsigned number corresponding to their absolute value




Use of Complements to Represent Negative Numbers

• The complement of a number in the range from 0 to bm/2 is in the range from bm/2 to bm-1

• A negative number is represented by the complement of its absolute value

• There are an equal number (±1) of positive and negative number representations

• The ±1 depends on whether b is odd or even and whether radix complement or diminished radix complement is used

• We will assume the most useful case of even b• Then radix complement system has one more negative

representation• Diminished radix complement system has equal numbers

of positive and negative representations




Reasons to Use Complement Systems for Negative Numbers

• The usual sign-magnitude system introduces extra symbols + and - in addition to the digits

• In binary, it is easy to map 0 ⇒ + and 1 ⇒ -• In base b > 2, using a whole digit for the two values, + and - ,

is wasteful• Most important, however, it is easy to do signed addition and

subtraction in complement number systems




Tbl 6.1 Complement Representations of Negative Numbers

• For even b, radix complement system represents one more negative than positive value

• While diminished radix complement system has 2 zeros but represents same number of positive and negative values

Radix Complement Diminished Radix Complement

Number NumberRepresentation Representation

0 0 0 0 or bm-1

0<x<bm/2 x 0<x<bm/2 x

-bm/2≤x<0 |x|c = bm - |x| |x|c = bm - 1 - |x|-bm/2<x<0




Tbl 6.2 Base 2 Complement Representations

• In 1’s complement, 255 = 111111112 is often called -0• In 2’s complement, -128 = 100000002 is a legal value, but trying

to negate it gives overflow

8 Bit 2’s Complement 8 Bit 1’s Complement

Number NumberRepresentation Representation

0 0 0 0 or 255

0<x<128 x 0<x<128 x

-128≤x<0 256 - |x| 255 - |x|-127≤x<0




Negation in Complement Number Systems

• Except for -bm/2 in the b’s comp. system, the negative of any m digit value is also m digits

• The negative of any number x, positive or negative, in the b’s or b-1’s complement system is obtained by applying the b’s or b-1’s complement operation to x, respectively

• The 2 complement operations are related byxc = (xc + 1) mod bm

• Thus an easy way to compute one of them will give an easy way to compute both




Digitwise Computation of the Diminished Radix Complement

• Using the geometric series formula, the b-1’s complement of x can be written

∑i=0

m-1(b-1) ⋅ bi -xc = bm-1-x = ∑

i=0

m-1xi⋅ bi

∑i=0

m-1(b-1-xi) ⋅ bi=

• If 0≤xi≤b-1, then 0≤(b-1-xi)≤b-1, so last formula is just an m-digit base b number with each digit obtained from the corresponding digit of x

Eq. 6.9




Table-Driven Calculation of Complements in Base 5

• 4’s complement of 2013415 is2431035

• 5’s complement of 2013415 is2431035 + 1 = 2431045

• 5’s complement of 444445 is000005 + 1 = 000015

• 5’s complement of 000005 is• (444445 + 1) mod 55 = 000005

Base 5Digit

4’sComp.

0

1

2

3

4

4

3

2

1

0




Complement Fractions

• Since m digit fraction is same as m digit integer divided by bm, the bm in complement definitions corresponds to 1 for fractions

• Thus radix complement of x = .x-1x-2...x-m is (1-x) mod 1, where mod 1 means discard integer• The range of fractions is roughly -1/2 to +1/2• This can be inconvenient for a base other than 2• The b’s comp. of a mixed number

x = xm-1xm-2...x1x0.x-1x-2...x-n is bm - x,where both integer and fraction digits are subtracted




Scaling Complement Numbers by Powers of the Base

• Roughly, multiplying by b corresponds to moving radix point one place right or shifting number one place left

• Dividing by b roughly corresponds to a right shift of the number or a radix point move to the left one place

• There are 2 new issues for complement numbers: 1) What is new left digit on right shift? 2) When does a left shift overflow?




Right Shifting a Complement Number to Divide by b

• For positive xm-1xm-2...x1x0, dividing by b corresponds to right shift with zero fill

0xm-1xm-2...x1

• For negative xm-1xm-2...x1x0, dividing by b corresponds to right shift with b-1 fill

(b-1)xm-1xm-2...x1

• This holds for both b’s and b-1’s comp. systems• For even b, the rule is: fill with 0 if xm-1 < b/2 and fill with

(b-1) if xm-1 ≥ b/2




Complement Number Overflow on Left Shift to Multiply by b

• For positive numbers, overflow occurs if any digit other than 0 shifts off left end

• Positive numbers also overflow if the digit shifted into left position makes number look negative, i.e. digit ≥ b/2 for even b

• For negative numbers, overflow occurs if any digit other than b-1 shifts off left end

• Negative numbers also overflow if new left digit makes number look positive, i.e. digit<b/2 for even b




Left Shift Examples with Radix Complement Numbers

• Non-overflow cases: Left shift of 7628 = 6208, -1410 becomes -11210

Left shift of 0318 = 3108, 2510 becomes 20010

• Overflow cases: Left shift of 2418 = 4108 shifts 2≠0 off left Left shift of 0418 = 4108 changes from + to - Left shift of 7138 = 1308 changes from - to + Left shift of 6628 = 6208 shifts 6≠7 off left




Fixed-Point Addition and Subtraction

• If the radix point is in the same position in both operands, addition or subtraction act as if the numbers were integers

• Addition of signed numbers in radix complement system needs only an unsigned adder

• So we only need to concentrate on the structure of anm-digit base b unsigned adder

• To see this let x be a signed integer and rep(x) be its 2’s complement representation

• The following theorem summarizes the result




Theorem on Signed Addition in a Radix Complement System

• Theorem: Let s be unsigned sum of rep(x) & rep(y). Then s = rep(x+y), except for overflow

• Proof sketch: Case 1, signs differ, x≥0, y<0. Then x+y = x-|y| and s = (x+bm-|y|) mod bm.

If x-|y|≥0, mod discards bm, giving result, if x-|y|<0, then rep(x+y) = (b-| x-|y| |) mod bm. Case 3, x<0, y<0. s = (2bm - |x| - |y|) mod bm, which reduces to s

= (bm - |x+y|) mod bm. This is rep(x+y) provided the result is in range of an m digit b’s comp. representation. If it is not, the unsigned s<bm/2 appears positive.




Fig 6.1 Hardware Structure of a Base b Unsigned Adder

• Typical cell produces sj = (xj + yj + cj) mod b and cj+1 = (xj + yj + cj)/b• Since xj, yj ≤ b-1, cj ≤ 1 implies cj+1 ≤ 1, and since c0 ≤ 1, all carries are ≤1,

regardless of b

0 ≤ cj+1 ≤ 1

0 ≤ sj < b

(xj +yj +cj ) / b(xj +yj +cj )mod b

0 ≤ cj ≤ 1

An m-digit base b unsigned adder

xj yj

cm cm–1

sm–1

xm–1 ym–1

c2 c1

s1

x1 y1

c0

s0

x0 y0

Base b digit adder




Unsigned Addition Examples

• If result can only have a fixed number of bits, overflow occurs on carry from leftmost digit

• Carries are either 0 or 1 in all cases• A table of sum and carry for each of the b2

digit pairs, and one for carry-in = 1, define the addition

12.034 = 6.187510 .9A2C16 13.214 = 7.562510 .7BE216 OverflowCarry 01 01 1 11 0 for 16-bitSum 31.304 = 13.7510 1.160E16 word

Base 4

+ 0 1 2 3

0 00 01 02 03

1 01 02 03 10

2 02 03 10 11

3 03 10 11 12




Implementation Alternatives for Unsigned Adders

• If b = 2k, then each base b digit is equivalent to k bits

• A base b digit adder can be viewed as a logic circuit with 2k+1 inputs and k+1 outputs

k k

k

Base b=2kdigit adder

• This combinational logic circuit can be designed with as few as 2 levels of logic

• PLA, ROM, and multi-level logic are also alternatives

• If 2 level logic is used, max. gate delays for m-digit base b unsigned adder is 2m s

x y

c0c1




Two-Level Logic Design of a Base 4 Digit Adder

• The base 4 digit x is represented by the 2 bits xb xa, y by yb ya, and s by sb sa

• sa is independent of xb and yb, c1 is given by ybyac0+xaybc0+xbxac0+xbyac0+xbxaya+xaybya+xbyb,

while sb is a 12 input OR of 4 input ANDs

xb 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1xa 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1yb 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1ya 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1c0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1c1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 0 1 1 1 1 1 0 1 1 1 1 1 1 1sb 0 0 0 1 1 1 1 0 0 1 1 1 1 0 0 0 1 1 1 0 0 0 0 1 1 0 0 0 0 1 1 1sa 0 1 1 0 0 1 1 0 1 0 0 1 1 0 0 1 0 1 1 0 0 1 1 0 1 0 0 1 1 0 0 1




Fig 6.2 Base b Radix Complement Subtracter

• To do subtraction in the radix complement system, it is only necessary to negate (radix complement) the 2nd operand

• It is easy to take the diminished radix complement, and the adder has a carry-in for the +1

+ 1Base b adder

(b – 1)'s complement

x – y

x y




Overflow Detection in Complement Add and Subtract

• We saw that all cases of overflow in complement addition came when adding numbers of like signs, and the result seemed to have the opposite sign

• For even b, the sign can be determined from the left digit of the representation

• Thus an overflow detector only needs xm-1, ym-1, sm-1, and an add/subtract control

• It is particularly simple in base 2




Fig 6.3 2’s ComplementAdder/Subtracter

• A multiplexer to select y or its complement becomes an exclusive OR gate

cm cm–1

qm–1

FA

sm–1

xm–1 ym–1

c3 c2

q2

FA

s2

x2 y2

c1

q1

FA

s1

x1 y1

c0

q0

FA

Subtract control

s0

x0 y0

r




Speeding Up Addition with Carry Lookahead

• Speed of digital addition depends on carries• A base b = 2k divides length of carry chain by k

• Two level logic for base b digit becomes complex quickly as k increases

• If we could compute the carries quickly, the full adders compute result with 2 more gate delays

• Carry lookahead computes carries quickly• It is based on two ideas:

• a digit position generates a carry• a position propagates a carry-in to the carry-out




Binary Propagate and Generate Signals

• In binary, the generate for digit j is Gj = xj⋅yj

• Propagate for digit j is Pj = xj+yj

• Of course xj+yj covers xj⋅yj but it still corresponds to a carry out for a carry in

• Carries can then be written: c1 = G0 + P0⋅c0

• c2 = G1 + P1⋅G0 + P1⋅P0⋅c0

• c3 = G2 + P2⋅G1 + P2⋅P1⋅G0 + P2⋅P1⋅P0⋅c0

• c4 = G3 + P3⋅G2 + P3⋅P2⋅G1 + P3⋅P2⋅P1⋅G0 + P3⋅P2⋅P1⋅P0⋅c0

• In words, the c2 logic is: c2 is one if digit 1 generates a carry, or if digit 0 generates one and digit 1 propagates it, or if digits 0 and 1 both propagate a carry-in




Speed Gains with Carry Lookahead

• It takes one gate to produce a G or P, two levels of gates for any carry, and 2 more for full adders

• The number of OR gate inputs (terms) and AND gate inputs (literals in a term) grows as the number of carries generated by lookahead

• The real power of this technique comes from applying it recursively

• For a group of, say, 4 digits an overall generate isG10 = G3 + P3⋅G2 + P3⋅P2⋅G1 + P3⋅P2⋅P1⋅G0

• An overall propagate is P10 = P3⋅P2⋅P1⋅P0




Recursive Carry Lookahead Scheme

• If level 1 generates G1j and propagates P1j are defined for all groups j, then we can also define level 2 signals G2j and P2j over groups of groups

• If k things are grouped together at each level, there will be logkm levels, where m is the number of bits in the original addition

• Each extra level introduces 2 more gate delays into the worst case carry calculation

• k is chosen to trade off reduced delay against the complexity of the G and P logic

• It is typically 4 or more, but the structure is easier to see for k=2




Fig 6.4 Carry Lookahead Adder for Group Size k = 2

FA

s7

y7 x7

G7 P7

FA

s6

y6 x6

G6 P6

G13 P1

3c7

FA

s5

y5 x5

G5 P5

FA

s4

y4 x4

G4 P4

G12 P1

2c5

Lookahead Level 3

Lookahead Level 2

Lookahead Level 1

Compute generate and propagate

Adders

P21G2

1c6

FA

s3

y3 x3

G3 P3

FA

s2

y2 x2

G2 P2

G11 P1

1c3

FA

s1

y1 x1

G1 P1

FA

s0

y0 x0 c0

G0 P0

c1

G20 P2

0c2

G30 P3

0c4

G10 P1

0




Fig 6.5 Digital Multiplication Schema

p: product pp: partial product

x0x1x2x3

y0y1y2y3

(xy0)0(xy0)1(xy0)2(xy0)3(xy0)4

(xy1)0(xy1)1(xy1)2(xy1)3(xy1)4

(xy2)0(xy2)1(xy2)2(xy2)3(xy2)4

(xy3)0(xy3)1(xy3)2(xy3)3(xy3)4

p0p1p2p3p4p5p6p7

pp0

pp1

pp2

pp3

mult ip licand

mult ip l ier




Serial by Digit of Multiplier, Then by Digit of Multiplicand

• If c ≤ b-1 on the RHS of 9, then c ≤ b-1 on the LHS of 9 because 0 ≤ pj+i, xi, yj ≤ b-1

1. for i := 0 step 1 until 2m-12. pi := 0;3. for j := 0 step 1 until m-14. begin5. c := 0;6. for i := 1 step 1 until m-17. begin8. pj+i := (pj+i + xi yj + c) mod b;9. c := (pj+i + xi yj + c)/b;10. end;11. pj+m := c;12. end;




Fig 6.6 Parallel Array Multiplier for Unsigned Base b Numbers

p2m – 1 p2m – 2 p2m – 3 p2m – 4 p2

y2

y1

y0

0x00x1x2 0

0

0

0

p1 p0

x y pk(in)

cincout

pk(out)




Operation of the Parallel Multiplier Array

• Each box in the array does the base b digit calculations pk(out) := (pk(in) + x y + c(in)) mod b and c(out) := (pk(in) + x y + c(in))/b

• Inputs and outputs of boxes are single base b digits, including the carries

• The worst case path from an input to an output is about 6m gates if each box is a 2 level circuit

• In base 2, the digit boxes are just full adders with an extra AND gate to compute xy




Series Parallel Multiplication Algorithm

• Hardware multiplies the full multiplicand by one multiplier digit and adds it to a running product

• The operation needed is p := p + xyjbj

• Multiplication by bj is done by scaling xyj, shifting it left, or shifting p right, by j digits

• Except in base 2, the generation of the partial product xyj is more difficult than the shifted add

• In base 2, the partial product is either x or 0




Fig 6.7 Unsigned Series Parallel Multiplication Hardware

2m-digit right shift register, p

0

xm–1

Multiplicand

xm–2 x2

yj Multiplier digit

x1 x0

Partial product generator

m + 1-digit adder




Steps for Using the Unsigned Series Parallel Multiplier

1) Clear product shift register p.2) Initialize multiplier digit number j=0.3) Form the partial product xyj.4) Add partial product to upper half of p.5) Increment j=j+1, and if j=m go to step 8.6) Shift p right one digit.7) Repeat from step 3.8) The 2m digit product is in the p register.




Multiply with Fixed Length Words: Integer and Fraction Multiply

• If words can store only m digits, and the radix point is in a fixed position in the word, 2 positions make sense

integer: right end, and fraction: left end• In integer multiply, overflow occurs if any of the upper m

digits of the 2m-digit product ≠0• In fraction multiply, the upper m digits are the most

significant, and the lower m-digits are discarded or rounded to give an m-digit fraction




Signed Multiplication

• The sign of the product can be computed immediately from the signs of the operands

• For complement numbers, negative operands can be complemented, their magnitudes multiplied, and the product recomplemented if necessary

• A complement representation multiplicand can be handled by a b’s complement adder for partial products and sign extension for the shifts

• A 2’s complement multiplier is handled by the formula for a 2’s complement value: add all PP’s except last, subtract it.

value(x) = -xm-12m-1 + ∑xi2i

i=0

m-2Eq. 6.25




Fig 6.8 2’s ComplementMultiplier Hardware

0

Sign extension

Carry in

Subtract

m + 1-bit 2’s complement

adder

m – 1 bits

2m-bit accumulator shift register

m-bit multiplicand register

m-bit multiplier shift register




Steps for Using the 2’s Complement Multiplier Hardware

1) Clear the bit counter and partial product accumulator register.2) Add the product (AND) of the multiplicand and rightmost

multiplier bit.3) Shift accumulator and multiplier registers right one bit.4) Count the multiplier bit and repeat from 2 if count less than

m-1.5) Subtract the product of the multiplicand and bit m-1 of the

multiplier.

Note: bits of multiplier used at rate product bits produced




Examples of 2’s Complement Multiplication

-5/8 = 1. 0 1 1 6/8 = 0. 1 1 0× 6/8 = × 0. 1 1 0 ×-5/8 = × 1. 0 1 1pp0 0 0. 0 0 0 pp0 0 0. 1 1 0acc. 0 0. 0 0 0 0 add and shift acc. 0 0. 0 1 1 0pp1 1 1. 0 1 1 pp1 0 0. 1 1 0acc. 1 1. 1 0 1 1 0 add and shift acc. 0 0. 1 0 0 1 0pp2 1 1. 0 1 1 pp2 0 0. 0 0 0acc. 1 1. 1 0 0 0 1 0 add and shift acc. 0 0. 0 1 0 0 1 0pp3 0 0. 0 0 0 pp3 1 1. 0 1 0res. 1 1. 1 0 0 0 1 0 add res. 1 1. 1 0 0 0 1 0

Negative multiplicand Negative multiplier




Booth Recoding and Similar Methods

• Forms the basis for a number of signed multiplication algorithms

• Based upon recoding the multiplier, y, to a recoded value, z.

• The multiplicand remains unchanged.• Uses signed digit (SD) encoding:• Each digit can assume three values instead of just 2:

+1, 0, and -1, encoded as 1, 0, and 1. This is known as signed digit (SD) notation.




A 2’s Complement Integer’s Value Can Be Represented as:

value y y Ymm

ii

i

m

( ) = − +− −

=

−

∑11

0

2

2 2 (Eq 6.26)

This means that the value can be computed by adding the weighted values of all the digits except the most significant, and subtracting that digit.




Example: Represent -5 in SD Notation

− == = − + + + = −

5 1011

1011 1011 8 0 2 1 5

in 2's Complement Notation

in SD Notation




The Booth Algorithm (Sometimes Known as “Skipping Over 1’s.”)

Consider -1 = 1111. In SD Notation this can

be represented as 2 1 100014 − =The Booth method is:1. Working from lsb to msb, replace each 0 digit of the original number with 0 in the recoded number until a 1 is encountered.2. When a 1 is encountered, insert a 1 in that position in the recoded number, and skip over any succeeding 1's until a 0 is encountered.3. Replace that 0 with a 1. If you encounter the msb without encountering a 0, stop and do nothing.




Example of Booth Recoding

0011 1101 1001 512 256 128 64 16 8 1 985

0100 0110 1011 1024 64 32 8 2 1 985

= + + + + + + =

↓ ↓= − + − + − =




Tbl 6.4 Booth Recoding Table

y y z Value Situationi i i−

+−

1

0 0 0 0

0 1 1 1

1 0 1 1

1 1 0 0

String of 0's

End of string of 1's

Begin string of 1's

String of 1's

Consider pairs of numbers, yi, yi-1. Recoded value is zi.

Algorithm can be done in parallel.Examine the example of multiplication 6.11 in text.




Recoding Using Bit-Pair Recoding

• Booth method may actually increase number of multiplies.• Consider pairs of digits, and recode each pair into 1 digit.• Derive Table 6.5, pg. 279, on the blackboard to show how bit-

pair recoding works.• Demonstrate Example 6.13 on the blackboard as an example of

multiplication using bit-pair recoding.• There are many variants on this approach.




Digital Division: Terminology and Number Sizes

• A dividend is divided by a divisor to get a quotient and a remainder

• A 2m digit dividend divided by an m digit divisor does not necessarily give an m digit quotient and remainder

• If the divisor is 1, for example, an integer quotient is the same size as the dividend

• If a fraction D is divided by a fraction d, the quotient is only a fraction if D<d

• If D≥d, a condition called divide overflow occurs in fraction division




Fig 6.9 Unsigned Binary Divide Hardware

• 2m-bit dividend register

• m-bit divisor• m-bit quotient• Divisor can be

subtracted from dividend or not

Load0

Positive result

Subtractor

Divisor register

Quotient left shift register

Dividend left shift register




Use of Division Hardware for Integer Division

1) Put dividend in lower half of register and clear upper half. Put divisor in divisor register. Initialize quotient bit counter to zero.

2) Shift dividend register left one bit.3) If difference positive, shift 1 into quotient and replace

upper half of dividend by difference. If negative, shift 0 into quotient.

4) If fewer than m quotient bits, repeat from 2.5) m bit quotient is an integer, and an m bit integer remainder

is in upper half of dividend register.




Use of Division Hardware for Fraction Division

1) Put dividend in upper half of dividend register and clear lower half. Put divisor in divisor register. Initialize quotient bit counter to zero.

2) If difference positive, report divide overflow.3) Shift dividend register left one bit.4) If difference positive, shift 1 into quotient and replace upper

part of dividend by difference. If negative, shift 0 into the quotient.

5) If fewer than m quotient bits, repeat from 3.6) m bit quotient has binary point at the left, and remainder is in

upper part of dividend register.




Integer Binary Division Example:D = 45, d = 6, q = 7, r = 3

D 0 0 0 0 0 0 1 0 1 1 0 1 d 0 0 0 1 1 0Init. D 0 0 0 0 0 1 0 1 1 0 1 - d 0 0 0 1 1 0diff(-) D 0 0 0 0 1 0 1 1 0 1 - - q 0 d 0 0 0 1 1 0diff(-) D 0 0 0 1 0 1 1 0 1 - - - q 0 0 d 0 0 0 1 1 0diff(-) D 0 0 1 0 1 1 0 1 - - - - q 0 0 0 d 0 0 0 1 1 0diff(+) D 0 0 1 0 1 0 1 - - - - - q 0 0 0 1 d 0 0 0 1 1 0diff(+) D 0 0 1 0 0 1 - - - - - - q 0 0 0 1 1 d 0 0 0 1 1 0diff(+) rem. 0 0 0 0 1 1 q 0 0 0 1 1 1




Fig 6.10 Parallel Array Divider

R := (c → D: ¬c → (D-d-bi) mod 2):

Borrow alwayscomputed

d1

q1

q2

0

0

0qm

r1 r2 rm

D1 d2 D2 dm Dm Dm+1 D2m

D

R

d

d

bo

c

bi

c




Branching on Arithmetic Conditions

• An ALU with two m-bit operands produces more than just an m-bit result

• The carry from the left bit and the true/false value of 2’s complement overflow are useful

• There are 3 common ways of using outcome of compare (subtract) for a branch condition

1) Do the compare in the branch instruction 2) Set special condition code bits and test them in the branch 3) Set a general register to a comparison outcome and branch

on this logical value




Drawbacks of Condition Codes

• Condition codes are extra processor state; set and overwritten by many instructions

• Setting and use of CCs also introduces hazards in a pipelined design

• CCs are a scarce resource; they must be used before being set again

• The PowerPC has 8 sets of CC bits

• CCs are processor state that must be saved and restored during exception handling




Drawbacks of Comparison in Branch and Set General Register

• Branch instruction length: it must specify 2 operands to be compared, branch target, and branch condition (possibly place for link)

• Amount of work before branch decision: it must use the ALU and test its output—this means more branch delay slots in pipeline

• Setting a general register to a particular outcome of a compare, say ≤ unsigned, uses a register of 32 or more bits for a true/false value




Use of Condition Codes: MC68000

• The HLL statement:if (A > B) then C = D

translates to the MC68000 code: For 2’s comp. A and B For unsigned A and B MOVE.W A, D0 MOVE.W A, D0

CMP.W B, D0 CMP.W B, D0

BLE Over BLS Over

MOVE.W D, C MOVE.W D, C

Over: . . . Over: . . .




Standard Condition Codes: NZVC

• Assume compare does the subtraction s = x - y• N: negative result, sm-1 = 1• Z: zero result, s = 0• V: 2’s complement overflow, xm-1ym-1sm-1 + xm-1ym-1sm-1

• C: carry from leftmost bit position, sm = 1• Information in N, Z, V, and C determines several signed &

unsigned relations of x and y




Correspondence of Conditions and NZVC Bits

Condition Unsigned Integers Signed Integerscarry out C Coverflow C Vnegative n.a. N > C⋅Z (N⋅V+N⋅V)⋅Z ≥ C N⋅V+N⋅V = Z Z ≠ Z Z ≤ C+Z (N⋅V+N⋅V)+Z < C N⋅V+N⋅V




Branches That Do Not UseCondition Codes

• SRC compares a single number to zero• The simple comparison can be completed in pipeline stage 2• The MIPS R2000 compares 2 numbers using a branch of the

form: bgtu R1, R2, Lbl• Different branch instructions are needed for each signed or

unsigned condition• The MIPS R2000 also allows setting a general register to 1 or 0

on a compare outcomesgtu R3, R1, R2




ALU Logical, Shift, and Rotate Instructions

• Shifts are often combined with logic to extract bit fields from, or insert them into, full words

• A MC68000 example extracts bits 30..23 of a 32-bit word (exponent of a floating-point number)

MOVE.L D0, D1 ;Get # into D1

ROL.L #9, D1 ;exponent to bits 7..0

ANDI.L #FFH, D1 ;clear bits 31..8

• MC68000 shifts take 8 + 2n clocks, where n = shift count, so ROL.L #9 is better then SHR.L #23 in the above example




Types and Speed of Shift Instructions

• Rotate right is equivalent to rotate left with a different shift count

• Rotates can include the carry or not• Two right shifts, one with sign extend, are needed to

scale unsigned and signed numbers• Only a zero fill left shift is needed for scaling• Shifts whose execution time depends on the shift count

use a single-bit ALU shift repeatedly, as we did for SRC in Chap. 4

• Fast shifts, important for pipelined designs, can be done with a barrel shifter




Fig 6.11 A N × N Bit Crossbar Design for Barrel Rotator

Shift count

Dec

oder

y0

x0

x1

x2

x - input y - output

x3

x4

x5

y1 y2 y3 y4 y5




Properties of the Crossbar Barrel Shifter

• There is a 2 gate delay for any length shift• Each output line is effectively an n way multiplexer for shifts of

up to n bits• There are n2 3-state drivers for an n bit shifter

• For n = 32, this means 1024 3-state drivers

• For 32 bits, the decoder is 5 bits to 1 out of 32• The minimum delay but large number of gates in the crossbar

prompts a compromise:the logarithmic barrel shifter




Fig 6.12 Barrel Shifter with a Logarithmic Number of Stages

Shift count

Input word

Output word

x0 x1 x2 x29 x30 x31

One shift/ bypass cell

Shift/bypass

y0 y1 y2 y29 y30 y31

Bypass/shift 1 bit right

Bypass/shift 2 bits right




s4 s3 s2 s1 s0




Elements of a Complete ALU

• In addition to the arithmetic hardware, there must be a controller for multistep operations, such as series parallel multiply

• The shifter is usually a separate unit, and may have lots of gates if it is to be fast

• Logic operations are usually simple• The arithmetic unit may need to produce condition codes as

well as a result number• Multiplexers select the result and condition codes from the

correct subunit




Fig 6.13 A Possible Designfor an ALU

x

n n

y

Condition codes

Multiplexer

ShifterLogic

MultiplexerControl

Arithmetic

Opcode

Shift count

n

z




Floating-Point Preliminaries:Scaled Arithmetic

• Software can use arithmetic with a fixed binary point position, say left end, and keep a separate scale factor e for a number f×2e

• Add or subtract on numbers with same scale is simple, since f×2e + g×2e = (f+g)×2e

• Even with same scale for operands, scale of result is different for multiply and divide

(f×2e)⋅(g×2e) = (f⋅g)×22e; (f×2e)÷(g×2e) = f÷g• Since scale factors change, general expressions lead to a

different scale factor for each number—floating-point representation




Fig 6.14 Floating-PointNumber Format

• s is sign, e is exponent, and f is significand• We will assume a fraction significand, but some

representations have used integers

s

Sign

e f

me

1 + me + mf = m, Value(s, e, f ) = (–1)s × f × 2e

m bits

1 mf

Exponent Fraction




Signs in Floating-Point Numbers

• Both significand and exponent have signs• A complement representation could be used for f, but sign

magnitude is most common now• The sign is placed at the left instead of with f so test for

negative always looks at left bit• The exponent could be 2’s complement, but it is better to

use a biased exponent• If -emin ≤ e ≤ emax, where emin, emax > 0, then e = emin + e is always positive, so e replaced by e• We will see that a sign at the left, and a positive exponent

left of the significand helps compare

^ ^




Exponent Base and Floating Point Number Range

• In a floating point format using 24 out of 32 bits for significand, 7 would be left for exponent

• A number x would have a magnitude 2-64≤x≤263, or about 10-19≤x≤1019

• For more exponent range, bits of significand would have to be given up with loss of accuracy

• An alternative is an exponent base >2• IBM used exponent base 16 in the 360/370 series for a

magnitude range about 10-75≤x≤1075

• Then 1 unit change in e corresponds to a binary point shift of 4 bits




Normalized Floating-Point Numbers

• There are multiple representations for a floating-point number

• If f1 and f2 = 2df1 are both fractions and e2 = e1 - d, then

(s, f1, e1) and (s, f2, e2) have same value• Scientific notation example: 0.819 × 103 = 0.0819 × 104

• A normalized floating-point number has a leftmost digitnonzero (exponent small as possible)

• With exponent base b, this is a base-b digit: for the IBM format the leftmost 4 bits (base 16) are ≠0

• Zero cannot fit this rule; usually written as all 0s• In normal base 2, left bit =1, so it can be left out

• So-called hidden bit




Comparison of Normalized Floating Point Numbers

• If normalized numbers are viewed as integers, a biased exponent field to the left means an exponent unit is more than a significand unit

• The largest magnitude number with a given exponent is followed by the smallest one with the next higher exponent

• Thus normalized FP numbers can be compared for<, ≤, >, ≥, =, ≠ as if they were integers

• This is the reason for the s,e,f ordering of the fields and the use of a biased exponent, and one reason for normalized numbers




Fig 6.15 IEEE Single-Precision Floating Point Format

• Exponent bias is 127 for normalized #s

e e Value Type255 none none Infinity or NaN254 127 (-1)s×(1.f1f2...)×2127 Normalized ... ... ... ... 2 -125 (-1)s×(1.f1f2...)×2-125 Normalized 1 -126 (-1)s×(1.f1f2...)×2-126 Normalized 0 -126 (-1)s×(0.f1f2...)×2-126 Denormalized

^

s ê f1f2 . . . f23

sign exponent f ract ion

1 8 9 310




Special Numbers in IEEE Floating Point

• An all-zero number is a normalized 0• Other numbers with biased exponent e = 0 are called

denormalized• Denorm numbers have a hidden bit of 0 and an exponent

of -126; they may have leading 0s• Numbers with biased exponent of 255 are used for ±∞

and other special values, called NaN (not a number)• For example, one NaN represents 0/0




Fig 6.16 IEEE Standard,Double-Precision, Binary

Floating Point Format

• Exponent bias for normalized numbers is 1023• The denorm biased exponent of 0 corresponds to an

unbiased exponent of -1022• Infinity and NaNs have a biased exponent of 2047• Range increases from about 10-38≤|x|≤1038 to about

10-308≤|x|≤10308

s ê f1f2 . . . f52

sign exponent f ract ion

1 11 63120




Decimal Floating-Point Add and Subtract Examples

Operands Alignment Normalize & round 6.144 ×102 0.06144 ×104 1.003644 ×105

+9.975 ×104 +9.975 ×104 + .0005 ×105

10.03644 ×104 1.004 ×105

Operands Alignment Normalize & round 1.076 ×10-7 1.076 ×10-7 7.7300 ×10-9

-9.987 ×10-8 -0.9987 ×10-7 + .0005 ×10-9

0.0773 ×10-7 7.730 ×10-9




Floating Add, FA, and Floating Subtract, FS, Procedure

Add or subtract (s1, e1, f1) and (s2, e2, f2)1) Unpack (s, e, f); handle special operands2) Shift fraction of number with smaller exponent right by

|e1 - e2| bits3) Set result exponent er = max(e1, e2)4) For FA and s1 = s2 or FS and s1 ≠ s2, add significands,

otherwise subtract them5) Count lead zeros, z; carry can make z = -1; shift left z bits

or right 1 bit if z = -16) Round result, shift right, and adjust z if rounding overflow

occurs7) er ← er - z; check over- or underflow; bias and pack




Fig 6.17 Hardware Structure for Floating-Point Add and Subtract

• Adders for exponents and significands

• Shifters for alignment and normalize

• Multiplexers for exponent and swap of significands

• Lead zeros counter

f1

mf mf

mf

me

mz

mf + rounding bits

mf + rounding bits

mf

mf

me

me

sr fr

me

e1

s1

s2

FA/FS

me

e2

f2e1

me

Sign

|e1 – e2|

me

e2

Exponent subtractor

Swap

Alignment shifter

Significand adder/subtractor

Lead zeros counter

Normalize and round

Subtract and bias

Select

Sign computation

Subtract

Sign

er




Decimal Floating-Point Examples for Multiply and Divide

• Multiply fractions and add exponents

Sign, fraction & exponent Normalize & round ( -0.1403 ×10-3) -0.4238463 ×102

×(+0.3021 ×106 ) -0.00005 ×102

-0.04238463 ×10-3+6 -0.4238 ×102

Sign, fraction & exponent Normalize & round ( -0.9325 ×102) +0.9306387 ×109

÷( -0.1002 ×10-6 ) +0.00005 ×109

+9.306387 ×102-(-6) +0.9306 ×109

• Divide fractions and subtract exponents




Floating-Point Multiply ofNormalized Numbers

Multiply (sr, er, fr) = (s1, e1, f1)×(s2, e2, f2)1) Unpack (s, e, f); handle special operands2) Compute sr = s1⊕ s2; er = e1+e2; fr = f1×f23) If necessary, normalize by 1 left shift and subtract 1

from er; round and shift right if rounding overflow occurs4) Handle overflow for exponent too positive and underflow

for exponent too negative5) Pack result, encoding or reporting exceptions




Floating-Point Divide ofNormalized Numbers

Divide (sr, er, fr) = (s1, e1, f1)÷(s2, e2, f2)1) Unpack (s, e, f); handle special operands2) Compute sr = s1⊕ s2; er = e1- e2; fr = f1÷f23) If necessary, normalize by 1 right shift and add 1 to er; round

and shift right if rounding overflow occurs4) Handle overflow for exponent too positive and underflow for

exponent too negative5) Pack result, encoding or reporting exceptions




Chapter 6 Summary

• Digital number representations and algebraic tools for the study of arithmetic

• Complement representation for addition of signed numbers• Fast addition by large base and carry lookahead• Fixed point multiply and divide overview• Nonnumeric aspects of ALU design• Floating-point number representations• Procedures and hardware for floating-point addition and

subtraction• Floating-point multiply and divide procedures


7-1 Chapter 7—Memory System Design


Chapter 7: Memory System Design

Topics

7.1 Introduction: The Components of the Memory System

7.2 RAM Structure: The Logic Designer’s Perspective

7.3 Memory Boards and Modules

7.4 Two-Level Memory Hierarchy

7.5 The Cache

7.6 Virtual Memory

7.7 The Memory Subsystem in the Computer




IntroductionSo far, we’ve treated memory as an array of words limited in size only by the number of address bits. Life is seldom so easy...

Real world issues arise:• cost• speed• size• power consumption• volatility...

What other issues can you think of that will influencememory design?




In this chapter we will cover—• Memory components:

• RAM memory cells and cell arrays• Static RAM—more expensive, but less complex• Tree and matrix decoders—needed for large RAM chips• Dynamic RAM—less expensive, but needs “refreshing”

• Chip organization• Timing

• ROM—Read-only memory• Memory boards

• Arrays of chips give more addresses and/or wider words• 2-D and 3-D chip arrays

• Memory modules• Large systems can benefit by partitioning memory for

• separate access by system components• fast access to multiple words

–more–




In this chapter we will also cover–

• The memory hierarchy: from fast and expensive to slow and cheap• Example: Registers → Cache → Main Memory → Disk• At first, consider just two adjacent levels in the hierarchy• The cache: High speed and expensive

• Kinds: Direct mapped, associative, set associative• Virtual memory—makes the hierarchy transparent

• Translate the address from CPU’s logical address to the physical address where the information is actually stored

• Memory management—how to move information back and forth

• Multiprogramming—what to do while we wait• The “TLB” helps in speeding the address translation

process• Overall consideration of the memory as a subsystem




Fig 7.1 The CPU–Memory Interface

Sequence of events:Read:

1. CPU loads MAR, issues Read, and REQUEST2. Main memory transmits words to MDR3. Main memory asserts COMPLETE

Write:1. CPU loads MAR and MDR, asserts Write, and REQUEST2. Value in MDR is written into address in MAR3. Main memory asserts COMPLETE

–more–

CPU

m

Main memory

Address busData bus

s Address

0

1

2

3

2m – 1

A0 – Am–1

D0 – Db–1

R/W

REQUEST

COMPLETE

MDR

Register file

Control signals

m

w

w

MAR

b




Fig 7.1 The CPU–Memory Interface (cont’d.)

Additional points:• If b < w, main memory must make w/b b-bit transfers• Some CPUs allow reading and writing of word sizes < w

Example: Intel 8088: m = 20, w = 16, s = b = 88- and 16-bit values can be read and written

• If memory is sufficiently fast, or if its response is predictable,then COMPLETE may be omitted

• Some systems use separate R and W lines, and omit REQUEST

CPU

m

Main memory

Address busData bus

s Address

0

1

2

3

2m – 1

A0 – Am–1

D0 – Db–1

R/W

REQUEST

COMPLETE

MDR

Register file

Control signals

m

w

w

MAR

b




Tbl 7.1 Some Memory Properties

Symbol Definition Intel Intel PowerPC8088 8086 601

w CPU word size 16 bits 16 bits 64 bits

m Bits in a logical memory address 20 bits 20 bits 32 bits

s Bits in smallest addressable unit 8 bits 8 bits 8 bits

b Data bus size 8 bits 16 bits 64 bits

2m Memory word capacity, s-sized wds 220 words 220 words 232 words

2mxs Memory bit capacity 220 x 8 bits 220 x 8 bits 232 x 8 bits




Big-Endian and Little-Endian Storage

When data types having a word size larger than the smallestaddressable unit are stored in memory the question arises,

“Is the least significant part of the word stored at thelowest address (little-Endian, little end first) or—

is the most significant part of the word stored at thelowest address (big-Endian, big end first)”?

Example: The hexadecimal 16-bit number ABCDH, stored at address 0:

AB CDmsb ... lsb

ABCD0

1

ABCD

0

1

Little-Endian Big-Endian




Tbl 7.2 Memory Performance Parameters

Symbol Definition Units Meaning

ta Access time time Time to access a memory word

tc Cycle time time Time from start of access to start of nextaccess

k Block size words Number of words per block

ω Bandwidth words/time Word transmission rate

tl Latency time Time to access first word of a sequenceof words

tbl = Block time Time to access an entire block of wordstl + k/ω access time

(Information is often stored and moved in blocks at the cache and disk level.)




Tbl 7.3 The Memory Hierarchy, Cost, and Performance

Some Typical Values:

Component

Access type Random Random Random Direct Sequentialaccess access access access access

Capacity, bytes 64–1024 8–512 KB 8–64 MB 1–10 GB 1 TB

Latency 1–10 ns 20 ns 50 ns 10 ms 10 ms–10 s

Block size 1 word 16 words 16 words 4 KB 4 KB

Bandwidth System 8 MB/s 1 MB/s 1 MB/s 1 MB/sclockrate

Cost/MB High $500 $30 $0.25 $0.02

CPUCache Main Memory Disk Memory

TapeMemory




Fig 7.3 Conceptual Structure of a Memory Cell

Select

DataIn

DataOut

R/W

Select

DataOutDataIn

R/W

Regardless of the technology, all RAM memory cells must providethese four functions: Select, DataIn, DataOut, and R/W.

This “static” RAM cell is unrealistic in practice, but it is functionally correct.We will discuss more practical designs later.




Select

DataIn DataOut

R/W

d0

Select

R/W

d1 d2 d3 d4 d5 d6 d7

D

D D D D D D D D

Fig 7.4 An 8-Bit Register as a1-D RAM Array

Data bus is bidirectional and buffered. (Why?)

The entire register is selected with one select line, and uses one R/W line




Fig 7.5 A 4 x 8 2-D Memory Cell Array

R/W is commonto all

2-bitaddress

Bidirectional 8-bit buffered data bus

2-4 line decoder selects one of the four 8-bit arrays

d0

R/W

d1 d2 d3 d4 d5 d6 d7

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

2–4 decoder

A1

A0




Fig 7.6 A 64 K x 1 Static RAM Chip~square array fits IC designparadigm

Selecting rows separatelyfrom columns means only256 x 2 = 512 circuit elementsinstead of 65536 circuitelements!

CS, Chip Select, allows chips in arrays to be selected individually

This chip requires 21 pins including power and ground, and sowill fit in a 22-pin package.

256

256

1 256–1 mux 1 1–256 demux

1

8

8

Row address: A0–A7

8–256 row

decoder

256 × 256 cell array

Column address: A8–A15

R/W

CS




Fig 7.7 A 16 K x 4 SRAM Chip

There is little difference between this chip and the previous one, except that there are4 64-1 multiplexers instead of1 256-1 multiplexer.

This chip requires 24 pins including power and ground, and so will require a 24-pin package. Package size and pin count can dominate chip cost.

256

64 each

4 64–1 muxes 4 1–64 demuxes

4

8

6

Row address: A0–A7

8–256 row

decoder

4 64 × 256 cell arrays

Column address: A8–A13

R/W

CS




Fig 7.8 Matrix and Tree Decoders• 2-level decoders are limited in size because of gate fan-in.

Most technologies limit fan-in to ~8.

• When decoders must be built with fan-in >8, then additional levelsof gates are required.

• Tree and matrix decoders are two ways to design decoders with large fan-in:

3-to-8 line tree decoderconstructed from2-input gates.

4-to-16 line matrix decoderconstructed from 2-input gates.

x0

m1

m0

m5

m4

m3

x2x2

m2

m7

m6x1

2–4 d

ecoder

m0

m1

m2

m3

m4

m5

m6

m7

x2 x3

m8

m9

m10

m11

m12

m13

m14

m15

x0

x1

2–4 d

ecoder

2–4 decoder




Fig 7.9Six-Transistor

Static RAM Cell

This is a more practicaldesign than the 8-gatedesign shown earlier.

A value is read byprecharging the bitlines to a value 1/2way between a 0 anda 1, while asserting theword line. This allows thelatch to drive the bit linesto the value stored inthe latch.

bi bi

R/W

Column select

(from column address decoder)

CS

di

Sense/write amplifiers — sense and amplify data on Read, drive bi and

bi on write

Additional cells

Switches to control access

to cell

+5Active loads

Word line wi

Storage cell

Dual rail data lines for reading and writing




Fig 7.10 Static RAM Read Operation

Access time from Address—the time required of the RAM array to decode the address and provide value to the data bus.

Memory address

Read/write

CS

Data

tAA




Fig 7.11 Static RAM Write Operations

Write time—the time the data must be held valid in order to decode address and store value in memory cells.

Memory address

Read/write

CS

Data

tw




Fig 7.12 Dynamic RAM

Cell Organization

Write: place value on bit lineand assert word line.Read: precharge bit line,assert word line, sense valueon bit line with sense/amp.

Capacitor willdischarge in 4–15 ms.

Refresh capacitor by reading (sensing) value on bit line, amplifying it, and placing it back on bit line where itrecharges capacitor.

This need to refresh the storage cells of dynamic RAM chips complicatesDRAM system design.

bi

R/W R

W

Column select

(from column address decoder)

CS

Sense/write amplifiers — sense and amplify data on Read, drive bi and bi

on write

Additional cells

tc

Capacitor stores charge for a 1, no charge for a 0

Word line wj

di

Switch to control access to cellSingle bit line




Fig 7.13 Dynamic RAM

Chip Organization

• Addresses are time-multiplexed on address bus using RAS and CAS as strobes of rows and columns.

• CAS is normally used as the CS function.

Notice pin counts:• Without address

multiplexing: 27 pins including power and ground.

• With address multiplexing: 17 pins including power and ground.

1024

10

1024

10

A0–A9

RAS

CAS

R/W

Control

1024

1024 sense/write amplifiers and column latches

1024 × 1024 cell array

10 column address latches, 1–1024 muxes and demuxes

di

do

Row

latc

hes

and

dec

oder

Control logic




Figs 7.14, 7.15DRAM Read and Write Cycles

MemoryAddress

RAS

Dat a

t A

CAS

t Prechg

Row Addr Col Addr

t C

R/ W

t RAS

MemoryAddress

RAS

Dat a

t DHR

CAS

Prechg

Row Addr Col Addr

t C

W

t RAS

Typical DRAM Read operation Typical DRAM Write operation

Access time Cycle timeNotice that it is the bit line prechargeoperation that causes the differencebetween access time and cycle time.

Data hold from RAS.




DRAM Refresh and Row Access• Refresh is usually accomplished by a “RAS-only” cycle. The row

address is placed on the address lines and RAS asserted. This refreshed the entire row. CAS is not asserted. The absence of a CAS phase signals the chip that a row refresh is requested, and thus no data is placed on the external data lines.

• Many chips use “CAS before RAS” to signal a refresh. The chip has an internal counter, and whenever CAS is asserted before RAS, it is a signal to refresh the row pointed to by the counter, and to increment the counter.

• Most DRAM vendors also supply one-chip DRAM controllers that encapsulate the refresh and other functions.

• Page mode, nibble mode, and static column mode allow rapid access tothe entire row that has been read into the column latches.

• Video RAMS, VRAMS, clock an entire row into a shift register where it canbe rapidly read out, bit by bit, for display.




Fig 7.16 A 2-D CMOS ROM Chip

RowDecoder

Address

CS

+V

1 0 1 0

00




Tbl 7.4 ROM Types

ROM Cost Programmability Time to Time to EraseType Program

Mask- Very At factory Weeks N/Aprogrammed inexpensive onlyROM

PROM Inexpensive Once, by Seconds N/A end user

EPROM Moderate Many times Seconds 20 minutes

Flash Expensive Many times 100 µs 1 s, largeEPROM block

EEPROM Very Many times 100 µs 10 ms,expensive byte




Memory Boards and Modules

• There is a need for memories that are larger and wider than a single chip• Chips can be organized into “boards.”

• Boards may not be actual, physical boards, but may consist ofstructured chip arrays present on the motherboard.

• A board or collection of boards make up a memory module.

• Memory modules:• Satisfy the processor–main memory interface requirements• May have DRAM refresh capability• May expand the total main memory capacity• May be interleaved to provide faster access to blocks of words




CS

AddressR/ W

Dat a

m

s

This is a slightly different view of the memory chip than previous.

AddressDecoder

MemoryCe l lArray

I/ OMult ip lexer

m

Address

Chip Selec ts

s

Dat a

R/ W

ss

s

Bidirectional data bus.

Multiple chip selects ease the assembly ofchips into chip arrays. Usually providedby an external AND gate.

Fig 7.17 General Structureof a Memory Chip




Fig 7.18 Word Assembly from Narrow Chips

P chips expand word size from s bits to p x s bits.

All chips have common CS, R/W, and Address lines.

Select

Address

R/W

CSR/W

Address

Data

CSR/W

Address

Data

CSR/W

Address

Data

s s

p × s

s




Fig 7.19 Increasing the Number of Words by a Factor of 2k

The additional k address bits are used to select one of 2k chips,each one of which has 2m words:

Word size remains at s bits.

Address

R/W

CSR/W

Address

Data

CSR/W

Address

Data

CSR/W

Address

Data

s

m

m+k

k

s

s

s

k to 2k decoder




Fig 7.20 Chip

Matrix Using Two

Chip Selects

Multiple chip select linesare used to replace thelast level of gates in thismatrix decoder scheme.

This schemesimplifies thedecoding fromuse of a (q+k)-bitdecoderto using oneq-bit and onek-bit decoder.

Address

R/W

m

q

m + q + k

k

s

One of 2m+q+k s-bit words

Horizontal decoder

Ver

tical

d

ecod

er

CS1 CS2

R/WAddress

Data




Fig 7.21Three-

DimensionalDynamic

RAM Array

• CAS is used to enable top decoder in decoder tree.

• Use one 2-D array for each bit. Each 2-D array on separate board.

w

2k r d

ecod

er

2kc decoder

CAS

High address

Multiplexed address m/2

kc + kr

kc

kr

Enable

2k r d

ecod

er

RAS CAS

R/WAddress

Data

Data

R/W

RAS

RAS CAS

R/WAddress

Data

RAS CAS

R/WAddress

Data




Fig 7.22 A Memory Module and Its InterfaceMust provide—

• Read and Write signals.• Ready: memory is ready to accept commands.• Address—to be sent with Read/Write command.• Data—sent with Write or available upon Read when Ready is asserted.• Module select—needed when there is more than one module.

Memory boardsand/ orchips

Address regist er

Data regist er

Chip/ boardselect ion

Cont rolsignalgenerat or

w

k+mAddress

Moduleselect

Read

Writ e

Ready

Dat a

km

w

Bus Interface:

Control signal generator:for SRAM, just strobesdata on Read, ProvidesReady on Read/Write

For DRAM—also providesCAS, RAS, R/W, multiplexesaddress, generates refreshsignals, and provides Ready.




Fig 7.23 Dynamic RAM Modulewith Refresh Control

Board andchip select s

Address lines

RAS

CAS

R/ WDat a lines

DynamicRAM Array

AddressMult iplexer

Refresh count er

Address Regist er

2

m/ 2 m/ 2 m/ 2

m/ 2

Chip/ boardselect ion

k

Ref reshclock andcont rol

Re

qu

est

Gra

nt

Re

fre

sh

Dat a regist er

w

w

Moduleselect

Read

Wr it e

Ready

Dat a

Addressk+m

Memoryt iminggenerat or




Fig 7.24 Two Kinds of Memory

Module Organiz’n.

Memory modules are used to allow access to more than one word simultaneously.

j k Module 0

lsbsmsbs

Address

j + k = m-bit address bus

Module select

Module 1

AddressModule select

Module 2k – 1

Address

(a) Consecutive words in consecutive modules (interleaving)

Module select

jk Module 0

lsbsmsbs

Address

k + j = m-bit address bus

Module select

Module 1

AddressModule select

Module 2k – 1

Address

(b) Consecutive words in the same module

Module select




Fig 7.25 Timing of Multiple Modules on a Bus

If time to transmit information over bus, tb, is < module cycle time, tc,it is possible to time multiplex information transmission to severalmodules;Example: store one word of each cache line in a separate module.

Word Module No.Main Memory Address:

This provides successive words in successive modules.

Timing: Read module 0Address

Writ e module 3Address & dat a

Module 0Dat a ret urn

Module 0 read

Module 3 writ e

t b t c t b

Module 0

Module 3

Bus

With interleaving of 2k modules, and tb < tb/2k, it is possible to get a 2k-foldincrease in memory bandwidth, provided memory requests are pipelined.DMA satisfies this requirement.




Memory System Performance

For all accesses:• transmission of address to memory• transmission of control information to memory (R/W, Request, etc.)• decoding of address by memory

For a Read:• return of data from memory• transmission of completion signal

For a Write:• transmission of data to memory (usually simultaneous with address)• storage of data into memory cells• transmission of completion signal

Breaking the memory access process into steps:

The next slide shows the access process in more detail.




Fig 7.26 Sequence of Steps in Accessing Memory

Address decode

Ret urn dat a

Command t o memory

Address t o memory

Writ e dat a t o memory

Complete Precharge

Writ e data

ta

tc

(a) Stat ic RAM behavior

Complete

PrechargeRow address & RAS Column address & CAS

R/ W

Refresh

Complete

Precharge

Ret urn data

Writ e data t o memory

tatc

(b) Dynamic RAM behavior

Read or Writ e

Wr it e

Read

Read

Writ e

Pending ref resh

Read or Writ e

Read or Writ e

Wr it e

Read or Writ e

Read or Writ e

“Hidden refresh” cycle. A normal cycle would exclude thepending refresh step. -more-




Example SRAM TimingsApproximate values for static RAM Read timing:

• Address bus drivers turn-on time: 40 ns.• Bus propagation and bus skew: 10 ns.• Board select decode time: 20 ns.• Time to propagate select to another board: 30 ns.• Chip select: 20 ns.

PROPAGATION TIME FOR ADDRESS AND COMMANDTO REACH CHIP: 120 ns.

• On-chip memory read access time: 80 ns.• Delay from chip to memory board data bus: 30 ns.• Bus driver and propagation delay (as before): 50 ns.

TOTAL MEMORY READ ACCESS TIME: 280 ns.

Moral: 70 ns chips do not necessarily provide 70 ns access time!




Considering Any Two AdjacentLevels of the Memory Hierarchy

Some definitions:

Temporal locality: the property of most programs that if a given memorylocation is referenced, it is likely to be referenced again, “soon.”

Spatial locality: if a given memory location is referenced, those locationsnear it numerically are likely to be referenced “soon.”

Working set: The set of memory locations referenced over a fixed period oftime, or in a time window.

Notice that temporal and spatial locality both work to assure that the contentsof the working set change only slowly over execution time.

CPU • • • • • •

two adjacent levels in the hierarchy

Faster,smaller

Slower,larger

Defining the primary and secondary levels:

Secondarylevel

Primarylevel




Primary and Secondary Levelsof the Memory Hierarchy

Secondarylevel

• The item of commerce between any two levels is the block.

• Blocks may/will differ in size at different levels in the hierarchy.Example: Cache block size ~ 16–64 bytes.

Disk block size: ~ 1–4 Kbytes.

• As working set changes, blocks are moved back/forth through thehierarchy to satisfy memory access requests.

• A complication: Addresses will differ depending on the level.Primary address: the address of a value in the primary level.Secondary address: the address of a value in the secondary level.

Speed between levels defined by latency: time to access first word, andbandwidth, the number of words per second transmitted between levels.

Typical latencies:Cache latency: a few clocksDisk latency: 100,000 clocks

Primarylevel




Primary and Secondary AddressExamples

• Main memory address: unsigned integer

• Disk address: track number, sector number, offset of word in sector.




Fig 7.28 Addressing and Accessing a Two-Level Hierarchy

The computer system, HW or SW,must perform any address translationthat is required:

Two ways of forming the address: Segmentation and Paging.Paging is more common. Sometimes the two are used together,one “on top of” the other. More about address translation and paging later...

Miss

System address

Hit

Address in secondary

memory

Memory management unit (MMU)

Address in primary memory

Block

Word

Primary level

Secondary level

Translation function (mapping tables, permissions, etc.)




Fig 7.29 Primary Address Formation

Block

System address

Lookup table

Word

Block

Primary address

(a) Paging

Word

Block

System address

Lookup table

Word

Base address

Primary address

(b) Segmentation

Word+




Hits and Misses; Paging;Block Placement

Hit: the word was found at the level from which it was requested.

Miss: the word was not found at the level from which it was requested.(A miss will result in a request for the block containing the word fromthe next higher level in the hierarchy.)

Hit ratio (or hit rate) = h = number of hits

Miss ratio: 1 - hit ratio

tp = primary memory access time. ts = secondary memory access time

Access time, ta = h • tp + (1-h) • ts.

Page: commonly, a disk block. Page fault: synonymous with a miss.

Demand paging: pages are moved from disk to main memory only whena word in the page is requested by the processor.

Block placement and replacement decisions must be made each time ablock is moved.

total number of references




Virtual MemoryA virtual memory is a memory hierarchy, usually consisting of at least main memory and disk, in which the processor issues all memory references as effective addresses in a flat address space. All translations to primary and secondary addresses are handled transparently to the process making the address reference, thus providing the illusion of a flat address space.

Recall that disk accesses may require 100,000 clock cycles to complete, due to the slow access time of the disk subsystem. Once the processor has, through mediation of the operating system, made the proper request to the disk subsystem, it is available for other tasks.

Multiprogramming shares the processor among independent programs that are resident in main memory and thus available for execution.




Decisions in Designing a 2-Level Hierarchy

• Translation procedure to translate from system address to primary address.

• Block size—block transfer efficiency and miss ratio will be affected.

• Processor dispatch on miss—processor wait or processor multiprogrammed.

• Primary-level placement—direct, associative, or a combination. Discussed later.

• Replacement policy—which block is to be replaced upon a miss.

• Direct access to secondary level—in the cache regime, can the processordirectly access main memory upon a cache miss?

• Write through—can the processor write directly to main memory upon a cache miss?

• Read through—can the processor read directly from main memory upon acache miss as the cache is being updated?

• Read or write bypass—can certain infrequent read or write misses be satisfied by a direct access of main memory without any block movement?




Fig 7.30 The Cache Mapping Function

The cache mapping function is responsible for all cache operations:• Placement strategy: where to place an incoming block in the cache• Replacement strategy: which block to replace upon a miss• Read and write policy: how to handle reads and writes upon cache misses

Mapping function must be implemented in hardware. (Why?)

Three different types of mapping functions:• Associative• Direct mapped• Block-set associative

CPUCache

BlockMain memory

Mapping functionAddress

Word

Example: 256 KB 16 words 32 MB




Memory Fields andAddress Translation

Example of processor-issued 32-bit virtual address:031

32 bits

That same 32-bit address partitioned into two fields, a block field,and a word field. The word field represents the offset into the blockspecified in the block field:

Block number Word

26 6

226 64 word blocks

00 ••• 001001 001011Example of a specific memory reference: Block 9, word 11.




Fig 7.31 Associative Cache

*16 bits, while unrealistically small, simplifies the examples

Cache memory

Main memory

Valid bits

0

1

2

255

421

?

119

2

Cache block 0 MM block 0

MM block 1

MM block 2

MM block 119

MM block 421

MM block 8191

?

Cache block 2

Cache block 255

1

0

1

313

1

Tag memory

Tag field,

13 bits

Tag

Main memory address:

Byte

One cache line, 8 bytes


Valid, 1 bit

Associative mapped cache model: any block from main memory can be put anywhere in the cache.Assume a 16-bit main memory.*




Fig 7.32 Associative Cache MechanismBecause any block can reside anywhere in the cache, an associative (content addressable) memory is used. All locations are searched simultaneously.

Match bit

Valid bit

Match

64

3

8To CPU

Argument register

Associative tag memory

313 Selector

TagMain memory address

Byte

Cache block 0

?

Cache block 2

Cache block 255


2

3

1

4

5

6




Advantages and Disadvantagesof the Associative Mapped Cache

Advantage

• Most flexible of all—any MM block can go anywhere in the cache.

Disadvantages

• Large tag memory.

• Need to search entire tag memory simultaneously means lots ofhardware.

Replacement Policy is an issue when the cache is full. –more later–

Q.: How is an associative search conducted at the logic gate level?

Direct-mapped caches simplify the hardware by allowing each MM blockto go into only one place in the cache:




Fig 7.33 Direct-Mapped Cache

Key Idea: all the MM blocks from a given group can go into only one location in the cache, corresponding to the group number.

Now the cache needs only examine the single groupthat its reference specifies.

Cache memory Main memory block numbers Group #:

Valid bits

0

1

2

255

30

9

1

1

1

1

1

38

38

1

Tag memory

Tag field, 5 bits

Group

5

Tag

Cache address:


Byte

One cache line,

8 bytes One cache line, 8 bytes

0

1

2

255

256

257

258

511

512

513

514

767

2305

7680

7681

7682

7936

7937

7938

0

1

2

25581910Tag #: 1 2 9 30 31




Fig 7.34 Direct-Mapped Cache Operation 1. Decode the

group number of the incoming MM address to select the group

2. If MatchAND Valid

3. Then gate out the tag field

4. Compare cache tag with incoming tag

5. If a hit, then gate out the cache line

6. and use the word field toselect the desired word.

Cache memory

Valid bits

0

1

2

255

30

9

1

1

1

5

5 5

8

64

3

1

256

1

1

38

1

Tag memory

Tag field, 5 bits

Group

5

Tag

Cache miss Cache hit=≠

Main memory addressByte

Hit

5-bit comparator Selector

8–256 decoder

4

3

1

5

2

6




Direct-Mapped Caches• The direct mapped cache uses less hardware, but is

much more restrictive in block placement.

• If two blocks from the same group are frequently referenced, then the cache will “thrash.” That is, repeatedly bring the two competing blocks into and out of the cache. This will cause a performance degradation.

• Block replacement strategy is trivial.

• Compromise—allow several cache blocks in each group—the Block-Set-Associative Cache:




Fig 7.35 2-Way Set-Associative CacheExample shows 256 groups, a set of two per group.Sometimes referred to as a 2-way set-associative cache.

Cache memory Main memory block numbers Group #:

7680

2304

258

0

1

2

255

2

2

0

38

38

Tag memory

Tag field, 5 bits

Set

5

Tag

Cache group address:


Byte

One cache line,

8 bytes One cache line, 8 bytes

512

513

255

0

1

2

255

256

257

258

511

512

513

514

767

2304

7680

7681

7682

7936

7937

7938

0

1

2

2558191

0Tag #: 1 2 9 30 31

30

9

1

1 511




Getting Specific:The Intel Pentium Cache

• The Pentium actually has two separate caches—one for instructions and one for data. Pentium issues 32-bit MM addresses.

• Each cache is 2-way set-associative• Each cache is 8 K = 213 bytes in size• 32 = 25 bytes per line.• Thus there are 64 or 26 bytes per set, and therefore

213/26 = 27 = 128 groups• This leaves 32 - 5 - 7 = 20 bits for the tag field:

20 7 5

Tag Set (group) Word

31 0

This “cache arithmetic” is important, and deserves your mastery.




Cache Read and Write Policies• Read and Write cache hit policies

• Writethrough—updates both cache and MM upon each write.• Write back—updates only cache. Updates MM only upon block

removal.• “Dirty bit” is set upon first write to indicate block must be

written back.

• Read and Write cache miss policies• Read miss—bring block in from MM

• Either forward desired word as it is brought in, or• Wait until entire line is filled, then repeat the cache request.

• Write miss• Write-allocate—bring block into cache, then update• Write–no-allocate—write word to MM without bringing block into

cache.




Block Replacement Strategies

• Not needed with direct-mapped cache

• Least Recently Used (LRU)• Track usage with a counter. Each time a block is accessed:

• Clear counter of accessed block• Increment counters with values less than the one accessed• All others remain unchanged

• When set is full, remove line with highest count

• Random replacement—replace block at random• Even random replacement is a fairly effective strategy




Cache Performance

Recall Access time, ta = h • tp + (1 - h) • ts for primary and secondary levels.

For tp = cache and ts = MM,

ta = h • tC + (1 - h) • tM

We define S, the speedup, as S = Twithout/Twith for a given process,where Twithout is the time taken without the improvement, cache inthis case, and Twith is the time the process takes with the improvement.

Having a model for cache and MM access times and cache line fill time,the speedup can be calculated once the hit ratio is known.




• The PPC 601 has a unified cache—that is, a single cache for both instructions and data.

• It is 32 KB in size, organized as 64 x 8 block-set associative, with blocks being 8 8-byte words organized as 2 independent 4-word sectors for convenience in the updating process

• A cache line can be updated in two single-cycle operations of 4 words each.• Normal operation is write back, but write through can be selected on a per

line basis via software. The cache can also be disabled via software.

66

Tag memory

Cache memory

Address tag

Line 63

64 sets

Line 0

Set of 8

Line (set) #

20

Tag

Physical address:

Word #

Sector 0 Sector 1

8 words 8 words

64 bytes20 bits

Fig 7.36 The PowerPC 601 Cache Structure




Virtual Memory

CPUMain MemoryCache Disk

MMULogicalAddress

PhysicalAddressMapping

Tables

VirtualAddress

The memory management unit, MMU, is responsible for mapping logicaladdresses issued by the CPU to physical addresses that are presented tothe cache and main memory.

• Effective address—an address computed by by the processor while executing a program. Synonymous with logical address.

• The term effective address is often used when referring to activity inside the CPU. Logical address is most often used when referring to addresses when viewed from outside the CPU.

• Virtual address—the address generated from the logical address by the memory management unit, MMU.

• Physical address—the address presented to the memory unit.

A word about addresses:

(Note: Every address reference must be translated.)

CPU Chip




Virtual Addresses—WhyThe logical address provided by the CPU is translated to a virtual address by the MMU. Often the virtual address space is larger than the logical address, allowing program units to be mapped to a much larger virtual address space.

Getting Specific: The PowerPC 601• The PowerPC 601 CPU generates 32-bit logical addresses.• The MMU translates these to 52-bit virtual addresses before the

final translation to physical addresses.• Thus while each process is limited to 32 bits, the main memory

can contain many of these processes.• Other members of the PPC family will have different logical

and virtual address spaces, to fit the needs of various membersof the processor family.




Virtual Addressing—Advantages

• Simplified addressing. Each program unit can be compiled into its own memory space, beginning at address 0 and potentially extending far beyond the amount of physical memory present in the system.

• No address relocation required at load time.

• No need to fragment the program to accommodate memory limitations.

• Cost effective use of physical memory.

• Less expensive secondary (disk) storage can replace primary storage. (The MMU will bring portions of the program into physical memory as required)

• Access control. As each memory reference is translated, it can be simultaneously checked for read, write, and execute privileges.

• This allows access/security control at the most fundamental levels.

• Can be used to prevent buggy programs and intruders from causing damage to other users or the system.

This is the origin of those “bus error” and “segmentation fault” messages.




Fig 7.38 Memory

Managementby

Segmentation

• Notice that each segment’s virtual address starts at 0, different from its physical address.

• Repeated movement of segments into and out of physical memory will result in gaps between segments. This is called external fragmentation.

• Compaction routines must be occasionally run to remove these fragments.

Main memory

Segment 1

Segment 5

Gap

Segment 6Physical memory

addresses

Virtual memory

addresses

0000

0

0

0

0

0

FFF

Segment 9

Segment 3

Gap




Fig 7.39 Segmentation

Mechanism

• The computation of physical address from virtual address requires an integer addition for each memory reference, and a comparison if segment limits are checked.

• Q: How does the MMU switch references from one segment to another?

Main memory

Segment 1

Segment 5

Gap

Segment 6

Offset in segment

Segment base

register

Segment limit

register

No

Virtual memory address

from CPU

Bounds error

Segment 9

Segment 3

Gap

+

≤




Fig 7.40 The Intel 8086

Segmentation Scheme

The first popular 16-bit processor, the Intel 8086 had a primitive segmentationscheme to “stretch” its16-bit logical address to a 20-bit physical address:

The CPU allows 4 simultaneously active segments,CODE, DATA, STACK, and EXTRA. There are 4 16-bit segment baseregisters.

20-bit physical address

16-bit logical address

16-bit segment register

0000

0000




• This figure shows the mapping between virtual memory pages, physical memory pages, and pages in secondary memory. Page n - 1 is not present in physical memory, but only in secondary memory.

• The MMU manages this mapping.

Program unit

0

Page 1Page 2

Page n – 1

Virtual memory

Page 0

Physical memory

Secondary memory

Fig 7.41 Memory

Management by Paging




Fig 7.42 Virtual

Address Translation

in aPaged MMU

A page fault will result in 100,000 or more cycles passing before the pagehas been brought from secondary storage to MM.

• 1 table per user per program unit

• One translation per memory access

• Potentially large page table

≤

Page table limit register

Page table base registerNoBounds

errorAccess- control bits: presence bit, dirty bit, usage bits

Physical page number or pointer to secondary storage

+

Offset in page table

Hit. Page in

primary memory.

Translate to Disk address.

Miss (page fault).

Page in secondary memory.

Page table

Desired word

Main memory

Virtual address from CPU

Page number Offset in page Physical page

Physical address

Word




Page Placementand Replacement

Page tables are direct mapped, since the physical page is computeddirectly from the virtual page number.

But physical pages can reside anywhere in physical memory.

Page tables such as those on the previous slide result in large pagetables, since there must be a page table entry for every page in theprogram unit.

Some implementations resort to hash tables instead, which need haveentries only for those pages actually present in physical memory.

Replacement strategies are generally LRU, or at least employ a “use bit”to guide replacement.




Fast Address Translation:Regaining Lost Ground

• The concept of virtual memory is very attractive, but leads to considerable overhead:

• There must be a translation for every memory reference.• There must be two memory references for every program reference:

• One to retrieve the page table entry,• one to retrieve the value.

• Most caches are addressed by physical address, so there must be a virtual to physical translation before the cache can be accessed.

The answer: a small cache in the processor that retains the last few virtual to physical translations: a Translation Lookaside Buffer, TLB.

The TLB contains not only the virtual to physical translations, but also the valid, dirty, and protection bits, so a TLB hit allows the processor to access physical memory directly.

The TLB is usually implemented as a fully associative cache:




Fig 7.43 Translation Lookaside BufferStructure and Operation

TLB

Desired word

Main memory or cache

Virtual address from CPU

Page number

Associative lookup of virtual page number in TLB

TLB miss. Look for

physical page in page table.

To page table

Virtual page number

Word Physical page

Physical address

Word

Hit

N

Y

Access- control bits: presence bit, dirty bit, valid bit, usage bits

Physical page number

TLB hit. Page is in

primary memory.




Fig 7.44 Operation of the Memory Hierarchy

Virtual address

CPU Cache Main memory Secondary memory

Search TLB Search cache

Update cache from MM

Return value from cache

Generate physical address

TLB hit

Cache hit

Search page table

Update MM, cache, and page table

Page fault. Get page from secondary

memory

Generate physical address

Update TLB

Page table

hit

Y Y Y

Miss Miss Miss




Fig 7.45 PowerPC 601 MMU Operation

“Segments” are actually more akin tolarge (256 MB) blocks.

0 Set 1UTLB

12

WordVirtual pg #Seg

#7

32-bit logical address from CPU

9

16

4

CompareCompare

16

40

12

Hit—to CPU

Miss—cache load

Miss—to page table

search

d0–d31

20

32 Cache

40

Hit

20-bit physical address

2–1 mux

24

4

0

0

127

15

24-bit virtual segment ID

Set 0

40-bit virtual page20-bit physical

page

(VSID)Access control

and misc.7




Fig 7.46 I/O Connection to a Memory with a Cache

• The memory system is quite complex, and affords many possible tradeoffs.

• The only realistic way to chose among these alternatives is to study a typical workload, using either simulations or prototype systems.

• Instruction and data accesses usually have different patterns.

• It is possible to employ a cache at the disk level, using the disk hardware.

• Traffic between MM and disk is I/O, and direct memory access, DMA, can be used to speed the transfers:

CacheMain

memoryPaging DMA

Disk

I/OI/O DMA

CPU




Chapter 7 Summary• Most memory systems are multileveled—cache, main memory,

and disk.

• Static and dynamic RAM are fastest components, and their speed has the strongest effect on system performance.

• Chips are organized into boards and modules.

• Larger, slower memory is attached to faster memory in a hierarchical structure.

• The cache to main memory interface requires hardware address translation.

• Virtual memory—the main memory–disk interface—can employ software for address translation because of the slower speeds involved.

• The hierarchy must be carefully designed to ensure optimum price-performance.


8-1 Chapter 8—Input and Output


Chapter 8: Input and Output

8.1 The I/O Subsystem• I/O buses and addresses

8.2 Programmed I/O• I/O operations initiated by program instructions

8.3 I/O Interrupts• Requests to processor for service from an I/O device

8.4 Direct Memory Access (DMA)• Moving data in and out without processor intervention

8.5 I/O Data Format Change and Error Control• Error detection and correction coding of I/O data

Topics




Three Requirements of I/O Data Transmission

(1) Data location• Correct device must be selected• Data must be addressed within that device

(2) Data transfer• Amount of data varies with device and may need be

specified• Transmission rate varies greatly with device• Data may be output, input, or either with a given device

(3) Synchronization• For an output device, data must be sent only when the

device is ready to receive it• For an input device, the processor can read data only when

it is available from the device




Location of I/O Data

• Data location may be trivial once the device is determined

• Character from a keyboard• Character out to a serial printer

• Location may involve searching• Record number on a tape drive• Track seek and rotation to sector on a disk

• Location may not be simple binary number• Drive, platter, track, sector, word on a disk cluster




Fig 8.1 Disk Data Transfer Timing

• Keyboard delivers one character about every 1/10 second at the fastest

• Rate may also vary, as in disk rotation delay followed by block transfer

Byt

e

Byt

e

Byt

e

Byt

e

Start disk

transfer

~106 processor cycles

Time

<10 cycles




Synchronization—I/O Devices Are Not Timed by Master Clock

• Not only can I/O rates differ greatly from processor speed, but I/O is asynchronous

• Processor will interrogate state of device and transfer information at clock ticks

• I/O status and information must be stable at the clock tick when it is accessed

• Processor must know when output device can accept new data

• Processor must know when input device is ready to supply new data




Reducing Location and Synchronization to Data Transfer

• Since the structure of device data location is device dependent, device should interpret it

• The device must be selected by the processor, but• Location within the device is just information passed to the

device

• Synchronization can be done by the processor reading device status bits

• Data available signal from input device• Ready to accept output data from output device

• Speed requirements will require us to use other forms of synchronization: discussed later

• Interrupts and DMA are examples




Fig 8.2 Independent and Shared Memory and I/O Buses

• Allows tailoring bus to its purpose, but

• Requires many connections to CPU (pins)

• Least expensive option

• Speed penalty

• Memory and I/O access can be distinguished

• Timing and synchronization can be different for each

Memory bus

ControlAddress

Data

Control I/O bus

AddressData

(a)Separate memory and I/O buses (isolated I/O)

Memory

I/O system

CPU

Memory bus

ControlAddress

Data

I/O control

(b)Shared address and data lines

Memory

I/O system

CPU

Memory bus

Memory controlAddress

Data

I/O control

(c)Shared address, data, and control lines (memory-mapped I/O)

Memory

I/O system

CPU




Memory-Mapped I/O

• Combine memory control and I/O control lines to make one unified bus for memory and I/O

• This makes addresses of I/O device registers appear to the processor as memory addresses

• Reduces the number of connections to the processor chip

• Increased generality may require a few more control signals

• Standardizes data transfer to and from the processor• Asynchronous operation is optional with memory, but

demanded by I/O devices




Fig 8.3 Address Space of a Computer Using Memory-Mapped I/O

Address space

I/O registers: distributed among many I/O devices, not all used

Memory cells: all in one main memory subunit




Programmed I/O

• Requirements for a device using programmed I/O• Device operations take many instruction times• One word data transfers—no burst data transmission

• Program instructions have time to test device status bits, write control bits, and read or write data at the required device speed

• Example status bits:• Input data ready• Output device busy or off-line

• Example control bits:• Reset device• Start read or start write




Fig 8.4 Programmed I/O Device Interface Structure

• Focus on the interface between the unified I/O and memory bus and an arbitrary device. Several device registers (memory addresses) share address decode and control logic.

CPU

Memory bus

Data

Address

Control

I/O interface

Printer

I/O interface

Keyboard

I/O interface

Disk drive

I/O device

In Out

Address decoders

Command

Device interface registers

Status




Fig 8.5 SRC I/O Register Address Decoder

• Assumes SRC addresses above FFFFF00016 are reserved for I/O registers

• Allows for 1024 registers of 32 bits

• Is in range FFFF000016 to FFFFFFFF16 addressable by negative displacement

Selects the I/O space

Selects this device

32 12

20

addr⟨31..0⟩

addr⟨31..12⟩

addr⟨11..0⟩addr⟨1..0⟩ (don't care)

add

r⟨11

⟩

Selects the specific device

Jumpers

Selects the I/O space

Status register

Data register

add

r⟨10

⟩

add

r⟨3⟩

add

r⟨2⟩




Fig 8.6 Interface Design for SRC Character Output

32

1

8 8 Character

To printer

To CPU

address⟨31..0⟩

data⟨31⟩

(status)

Start

Done

TS

data⟨7..0⟩ (character)

Read (status)

Complete

Write (data)

Status register

Address decoder

Data register

D 1Ready

CLR

Q

Q

D

Char

Q

Q

TS

TS

2

1

3

4




Fig 8.7 Synchronous,

Semi-synchronous,

and Asynchronous

Data Input

• Used for register to register inside CPU • Used for memory to CPU read with few cycle memory

• Used for I/O over longer distances (feet)

Read (M)

Data (S) valid

Cycle time of master

Strobe data (M)

Ready (M)

Data (S)

Acknowledge (S)

valid

Strobe data (M)

Read (M)

Data (S)

(a) Synchronous input (b) Semisynchronous input

(c) Asynchronous input

valid

Cycle time of master

Complete (M)

Strobe data (M)




Fig 8.7c Asynchronous Data Input

Ready

Acknowledge

Dat a

St robe dat a

val id

(c) Asynchronous input

May I?

Yes, you may.

Thanks.

You’re welcome.




Example: Programmed I/O Device Driver for Character Output

• Device requirements:• 8 data lines set to bits of an ASCII character• Start signal to begin operation• Data bits held until device returns Done signal

• Design decisions matching bus to device• Use low-order 8 bits of word for character• Make loading of character register signal Start• Clear Ready status bit on Start and set it on Done• Return Ready as sign of status register for easy testing

Output Register

Status Register

31 07Character

Ready

Unused

Unused




Fig 8.8 Program Fragment for Character Output

• For readability: I/O registers are all caps., program locations have initial cap., and instruction mnemonics are lower case

• A 10 MIPS SRC would execute 10,000 instructions waiting for a 1,000 character/sec printer

31 8 7 0

31 0

Ready

unused

unused char

Status register COSTAT = FFFFF110H

Output register COUT = FFFFF114H

lar r3, Wait ;Set branch target for wait. ldr r2, Char ;Get character for output.Wait: ld r1, COSTAT ;Read device status register, brpl r3, r1 ;Branch to Wait if not ready. st r2, COUT ;Output character and start device.




Fig 8.9 Program Fragment to Print 80-Character Line

31 8 7 0

31 0

Ready

unused

unused char

31 0unused Pr in t

l ine

Status Register LSTAT = FFFFF130H

lar r1, Buff ;Set pointer to character buffer.la r2, 80 ;Initialize character counter andlar r3, Wait ; branch target.

Wait: ld r0, LSTAT ;Read Ready bit,brpl r3, r0 ; test, and repeat if not ready.ld r0, 0(r1) ;Get next character from buffer,st r0, LOUT ; and send to printer.addi r1, r1, 4 ;Advance character pointer, andaddi r2, r2, -1 ; count character.brnz r3, r2 ;If not last, go wait on ready.la r0, 1 ;Get a print line command,st r0, LCMD ; and send it to the printer.

Output Register LOUT = FFFFF134H

Command Register LCMD = FFFFF138H




Multiple Input Device Driver Software

• 32 low-speed input devices• Say, keyboards at ≈10 characters/sec• Max rate of one every 3 ms

• Each device has a control/status register• Only Ready status bit, bit 31, is used• Driver works by polling (repeatedly testing) Ready bits

• Each device has an 8 bit input data register• Bits 7..0 of 32-bit input word hold the character

• Software controlled by pointer and Done flag• Pointer to next available location in input buffer• Device’s done is set when CR received from device• Device is idle until other program (not shown) clears Done




Driver Program Using Polling for 32-Character Input Devices

CICTL .equ FFFFF300H ;First input control register.CIN .equ FFFFF304H ;First input data register.CR .equ 13 ;ASCII carriage return.Bufp: .dcw 1 ;Loc. for first buffer pointer.Done: .dcw 63 ;Done flags and rest of pointers.Driver: lar r4, Next ;Branch targets to advance to next

lar r5, Check ; character, check device active,lar r6, Start ; and start a new polling pass.

• 32 pairs of control/status and input data registers

r0 - working reg r1 - input char.r2 - device index r3 - none active

Dev 0 CTL

Dev 1 CTL

Dev 2 CTL

Dev 0 IN

Dev 1 IN

FFFFF300

FFFFF304

FFFFF308

FFFFF30C

FFFFF310




Driver Program Using Polling for 32-Character Input Devices (cont’d)

Start: la r2, 0 ;Point to first device, andla r3, 1 ; set all inactive flag.

Check: ld r0,Done(r2) ;See if device still active, andbrmi r4, r0 ; if not, go advance to next device.ld r3, 0 ;Clear the all inactive flag.ld r0,CICTL(r2) ;Get device ready flag, andbrpl r4, r0 ; go advance to next if not ready.ld r0,CIN(r2) ;Get character andld r1,Bufp(r2) ; correct buffer pointer, andst r0, 0(r1) ; store character in buffer.addi r1,r1,4 ;Advance character pointer,st r1,Bufp(r2) ; and return it to memory.addi r0,r0,-CR ;Check for carriage return, andbrnz r4, r0 ; if not, go advance to next device.la r0, -1 ;Set done flag to -1 onst r0,Done(r2) ; detecting carriage return.

Next: addi r2,r2,8 ;Advance device pointer, andaddi r0,r2,-256 ; if not last device,brnz r5, r0 ; go check next one.brzr r6, r3 ;If a device is active, make a new pass.




Characteristics of the Polling Device Driver

• If all devices active and always have character ready,• Then 32 bytes input in 547 instructions• This is data rate of 585 KB/s in a 10 MIPS CPU• But, if CPU just misses setting of Ready, 538 instructions

are executed before testing it again• This 53.8 µsec delay means that a single device must run at

less than 18.6 Kchars/s to avoid risk of losing data• Keyboards are thus slow enough




Tbl 8.1 Signal Names and Functions for the Centronics Printer Interface

InterfaceSignal Direction Description

STROBE Out Data out strobeD0 Out Least significant data bitD1 Out Data bit... ... ... D7 Out Most significant data bitACKNLG In Pulse on done with last characterBUSY In Not readyPE In No paper when highSLCT In Pulled highAUTOFEEDXT Out Auto line feedINIT Out Initialize printerERROR In Can’t print when lowSLCTIN Out Deselect protocol




Fig 8.11 Centronics Printer Data Transfer Timing

• Minimum times specified for output signals• Nominal times specified for input signals

≥ 0.5 µs

≥ 0.5 µs

≥ 0.5 µs

D0–D7 valid data

STROBE

BUSY

ACKNLG

~ 5 µs

~ 7 µs




I/O Interrupts

• Key idea: instead of processor executing wait loop, device requests interrupt when ready

• In SRC the interrupting device must return the vector address and interrupt information bits

• Processor must tell device when to send this information—done by acknowledge signal

• Request and acknowledge form a communication handshake pair

• It should be possible to disable interrupts from individual devices




Fig 8.12 Simplified Interrupt Circuit for an I/O Interface

• Request and enable flags per device• Returns vector and interrupt information on bus when

acknowledged

Interrupt request

Interrupt enable

vect⟨7..0⟩

info⟨15..0⟩

data⟨23..16⟩

data⟨15..0⟩

ireq

iack

D Q

Q

D Q

Q

o.c.




Fig 8.13 Daisy-Chained Interrupt Acknowledge Signal

• How does acknowledge signal select one and only one device to return interrupt information?

• One way is to use a priority chain with acknowledge passed from device to device

I/O bus

data

ireq

iack

I/O device 0

I/O device 1

I/O device j




Fig 8.14 Interrupt Logic in an I/O Interface

• Request set by Ready, cleared by acknowledge

• iack only sent out if this device not requesting

RequestReady Enable

ireq

iack(in)

o.c.

TS

TS

TS

TS

TS

Control register address

TS

TS

vect⟨7..0⟩

info⟨15..0⟩

data⟨23..16⟩

data⟨0⟩

Complete

Write

Read1531 2324

Control register

iack data16 0

31 30Unused

Interrupt requestReady Interrupt enable

infovectUnused

0

data⟨30⟩

data⟨31⟩

data⟨15..0⟩

D Q

Q

D Q

Q

iack (out)

D Q

Q




Getline Subroutine for Interrupt-Driven Character I/O

unused -vect

Ready Int . req. Int . enb.

3 1 3 0 0411CICTL

;Getline is called with return address in R31 and a pointer to a;character buffer in R1. It will input characters up to a carriage;return under interrupt control, setting Done to -1 when complete.CR .equ 13 ;ASCII code for carriage return.CIvec .equ 01F0H ;Character input interrupt vector address.Bufp: .dw 1 ;Pointer to next character location.Save: .dw 2 ;Save area for registers on interrupt.Done: .dw 1 ;Flag location is -1 if input complete.Getln: st r1, Bufp ;Record pointer to next character.

edi ;Disable interrupts while changing mask.la r2, 1F1H ;Get vector address and device enable bitst r2, CICTL ; and put into control register of device.la r3, 0 ;Clear thest r3, Done ; line input done flag.een ;Enable Interruptsbr r31 ; and return to caller.




Interrupt Handler for SRC Character Input

.org CIvec ;Start handler at vector address.str r0, Save ;Save the registers thatstr r1, Save+4 ; will be used by the interrupt handler.ldr r1, Bufp ;Get pointer to next character position.ld r0, CIN ;Get the character and enable next input.st r0, 0(r1) ;Store character in line buffer.addi r1, r1, 4 ;Advance pointer andstr r1, Bufp ; store for next interrupt.lar r1, Exit ;Set branch target.addi r0,r0, -CR ;Carriage return? addi with minus CR.brnz r1, r0 ;Exit if not CR, else complete line.la r0, 0 ;Turn off input device byst r0, CICTL ; disabling its interrupts.la r0, -1 ;Get a -1 indicator, andstr r0, Done ; report line input complete.

Exit: ldr r0, Save ;Restore registersldr r1, Save+4 ; of interrupted program.rfi ;Return to interrupted program.




General Functions of an Interrupt Handler

(1) Save the state of the interrupted program(2) Do programmed I/O operations to satisfy the interrupt

request(3) Restart or turn off the interrupting device(4) Restore the state and return to the interrupted

program




Interrupt Response Time

• Response to another interrupt is delayed until interrupts reenabled by rfi

• Character input handler disables interrupts for a maximum of 17 instructions

• If the CPU clock is 20 MHz, it takes 10 cycles to acknowledge an interrupt, and average execution rate is 8 CPI

Then 2nd interrupt could be delayed by(10 + 17 × 8) / 20 = 7.3 µsec




Nested Interrupts—Interrupting an Interrupt Handler

• Some high-speed devices have a deadline for interrupt response• Longer response times may miss data on a moving medium• A real-time control system might fail to meet specifications

• To meet a short deadline, it may be necessary to interrupt the handler for a slow device

• The higher priority interrupt will be completely processed before returning to the interrupted handler

• Hence the designation nested interrupts• Interrupting devices are priority ordered by shortness of their

deadlines




Steps in the Response of a Nested Interrupt Handler

(1) Save the state changed by interrupt (IPC and II);(2) Disable lower priority interrupts;(3) Reenable exception processing;(4) Service interrupting device;(5) Disable exception processing;(6) Reenable lower priority interrupts;(7) Restore saved interrupt state (IPC and II);(8) Return to interrupted program and reenable

exceptions.




Fig 8.16 Interrupt Masks for Executing Device j Handler

• Conceptually, a priority interrupt scheme could be managed using device-enable bits

• Order the bits from left to right in order of increasing priority to form an interrupt mask

• Value of the mask when executing device j interrupt handler is

low priorit y high priorit y

device j enable

1 1 10 0 0 1 10 0




Fig 8.17 Priority Interrupt System with m = 2k Levels

req

load

k

A B

Comparator

B < A

k

k

req0

ack0

req1

ack1

reqm–1

ackm–1

Decoder

Priority encoder

Current level

ack




Direct Memory Access (DMA)

• Allows external devices to access memory without processor intervention

• Requires a DMA interface device• Must be “set up” or programmed and transfer

initiated




Steps a DMA Device Interface Must Take to Transfer a Block of Data

1. Become bus master2. Send memory address and R/W signal3. Synchronized sending and receiving of data

using Complete signal4. Release bus as needed (perhaps after each

transfer)5. Advance memory address to point to next data

item6. Count number of items transferred and check for

end of data block7. Repeat if more data to be transferred




Fig 8.18 I/O Interface Architecture for a DMA Device

data Device

data

Device

controlBus grant

Bus request

Read, Write

Complete

addressI/O address decode

Bus control

Memory address

Count

Control

DMA sequencer

Packing, unpacking, and buffering




Fig 8.19 Multiplexer and Selector DMA Channels

Processor

Memory

DMA multiplexer

Bus

DMA selector

Count

Address

Count 1

Address 1

Count 2

Address 2

Device S1

Device S2

Device M1

Device M2




Error Detection and Correction

• Bit-error rate, BER, is the probability that, when read, a given bit will be in error

• BER is a statistical property• Especially important in I/O, where noise and

signal integrity cannot be so easily controlled• 10-18 inside processor• 10-8 - 10-12 or worse in outside world

• Many techniques• Parity check• SECDED encoding• CRC




Parity Checking

• Add a parity bit to the word• Even parity: add a bit if needed to make

number of bits even• Odd parity: add a bit if needed to make

number of bits odd• Example: for word 10011010, to add odd

parity bit: 100110101




Hamming Codes• Hamming codes are a class of codes that use

combinations of parity checks to both detect and correct errors.

• They add a group of parity check bits to the data bits.

• For ease of visualization, intersperse the parity bits within the data bits; reserve bit locations whose bit numbers are powers of 2 for the parity bits. Number the bits from l to r, starting at 1.

• A given parity bit is computed from data bits whose bit numbers contain a 1 at the parity bit number.




Fig 8.20 Multiple Parity Checks Making Up a Hamming Code

• Add parity bits, Pi, to data bits, Di.

• Reserve bit numbers that are a power of 2 for parity bits.• Example: P1 = 001, P2 = 010, P4 = 100, etc.• Each parity bit, Pi, is computed over those data bits that have a "1"

at the bit number of the parity bit.• Example: P2 (010) is computed from D3 (011), D6 (110), D7 (111), ...• Thus each bit takes part in a different combination of parity checks.• When the word is checked, if only one bit is in error, all the parity

bits that use it in their computation will be incorrect.

Bit position

P1

1

P2

2

D3

3

P4

4

D5

5

D6

6

D7 Parity or data bit

7

Check 0

Check 1

Check 2




Example 8.1 Encode 1011 Using the Hamming Code and Odd Parity

• Insert the data bits: P1 P2 1 P4 0 1 1• P1 is computed from P1 ⊕ D3 ⊕ D5 ⊕ D7 = 1, so P1 = 1.• P2 is computed from P2 ⊕ D3 ⊕ D6 ⊕ D7 = 1, so P1 = 0.• P4 is computed from P1 ⊕ D5 ⊕ D6 ⊕ D7 = 1, so P1 = 1.

• The final encoded number is 1 0 1 1 0 1 1.

• Note that the Hamming encoding scheme assumes that at most one bit is in error.




SECDED (Single Error Correct, Double Error Detect)

• Add another parity bit, at position 0, which is computed to make the parity over all bits, data and parity, even or odd.

• If one bit is in error, a unique set of Hamming checks will fail, and the overall parity will also be wrong.

• Let ci be true if check i fails, otherwise true.• In the case of a 1-bit error, the string ck-1, . . ., c1, c0 will be the

binary index of the erroneous bit.• For example, if the ci string is 0110, then bit at position 6 is in

error.• If two bits are in error, one or more Hamming checks will fail, but

the overall parity will be correct.• Thus the failure of one or more Hamming checks, coupled with

correct overall parity, means that 2 bits are in error.• This assumes that the probability of 3 or more bits being in error

is negligible.




Example 8.2 Compute the Odd Parity SECDED Encoding of the 8-bit value

01101011

The 8 data bits 01101011 would have 5 parity bits added to them to make the 13-bit valueP0 P1 P2 0 P4 1 1 0 P8 1 0 1 1.

Now P1 = 0, P2 = 1, P4 = 0, and P8 = 0, and we can compute that P0, overall parity, = 1, giving the encoded value:

1 0 1 0 0 1 1 0 0 1 0 1 1




Example 8.3 Extract the Correct Data Value from the String 0110101101101,

Assuming Odd Parity

• The string shows even parity, so there must be a single bit in error.

• Checks c2 and c4 fail, giving the binary index of the erroneous bits as 0110 = 6, so D6 is in error.

• It should be 0 instead of 1.




Cyclic Redundancy Check, CRC

• When data is transmitted serially over communications lines, the pattern of errors usually results in several or many bits in error, due to the nature of line noise.

• The "crackling" of telephone lines is this kind of noise.• Parity checks are not as useful in these cases.• Instead CRC checks are used.• The CRC can be generated serially.• It usually consists of XOR gates.




Fig 8.21 CRC Generator Based on the Polynomial x16 + x12 + x5 + 1

• The number and position of XOR gates is determined by the polynomial.

• CRC does not support error correction but the CRC bits generated can be used to detect multibit errors.

• The CRC results in extra CRC bits, which are appended to the data word and sent along.

• The receiving entity can check for errors by recomputing the CRC and comparing it with the one that was transmitted.

b0Data b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15




Fig 8.22 Serial Data Transmission with Appended CRC Code

Word Time

Output transmission

Serial data bits

Word

Word

Word

Word

Word

Memory

CRC




Chapter 8 Summary

• I/O subsystem has characteristics that make it different from main memory

• Speed variations• Latency• Band width

• This leads to 3 different kinds of I/O:• Programmed I/O handled completely by software, from

initiation until completion• Interrupt-driven I/O combines hardware for initiation and

software for completion• DMA allows an all-hardware approach to I/O activities

• External connections to devices may require data format changes, and error detection and possibly correction


9-1 Chapter 9—Peripheral Devices


Chapter 9: Peripheral Devices

9.1 Magnetic Disk Drives• Ubiquitous and complex• Other moving media devices: tape and CD ROM

9.2 Display Devices • Video monitors: analog characteristics• Video terminals• Memory-mapped video displays• Flat-panel displays

9.3 Printers• Dot matrix, laser, inkjet

9.4 Input Devices• Manual input: keyboards and mice

9.5 Interfacing to the Analog World

Topics




Tbl 9.1 Some Common Peripheral Interface Standards

Bus Standard Data Rate Bus Width

Centronics ~50 KB/s 8-bit parallel

EIA RS232/422 30–20 KB/s Bit-serial

SCSI Few MB/s 16-bit parallel

Ethernet 10–100 Mb/s Bit-serial




Disk Drives—Moving Media Magnetic Recording

• High density and nonvolatile• Densities approaching semiconductor RAM on an

inexpensive medium• No power required to retain stored information

• Motion of medium supplies power for sensing• More random access than tape: direct access

• Different platters selected electronically• Track on platter selected by head movement• Cyclic sequential access to data on a track

• Structured address of data on disk• Drive: Platter: Track: Sector: Byte




Fig 9.3 Cutaway View of a Multiplatter Hard Disk Drive

Edge connector

Read/write heads

Drive electronics

Disk platters




Fig 9.4 Simplified View of Disk Track and Sector Organization

• An integral number of sectors are recorded around a track

• A sector is the unit of data transfer to or from the disk

Track 0

Track 1,023

Sector 63

Sector 0

Sector 1

Sector 2




Track and Sector Characteristics• Inside tracks are shorter and thus have higher densities

or fewer words• All sectors contain the same number of bytes

• Inner portions of a platter may have fewer sectors per track

• Small areas of the disk are magnetized in different directions

• Change in magnetization direction is what is detected on read

Disk motion

Read/write head

L L LR

Fig 9.5




Fig 9.6 Typical Hard Disk Sector Organization

• Serial bit stream has header, data, and error code• Header synchronizes sector read and records sector

address• Data length is usually power of 2 bytes• Error detection/correction code needed at end

Location and synchronization

information DataECC

information

12 Bytes512 Bytes10 Bytes (header)

One sector




Disk Formatting

• Disks are preformatted with track and sector address written in headers

• Disk surface defects may cause some sectors to be marked unusable for the software




Fig 9.7 The PC AT Block Address for Disk Access

• Diagram is for the PC AT• Head number determines platter surface• Cylinder is track number for all heads• Count sectors, up to a full track, can be accessed in

one operation

16 844 8

Head # Cylinder # Sector # Sector count

Logical block address

Drive #, etc.




The Disk Access Process

1. OS Communicates LBA to the disk interface and issues a READ command.

2. Drive seeks to the correct track by moving heads to correct position, and enabling the appropriate head.

3. Sector data and ECC stream into buffer. ECC is done "on the fly."

4. When correct sector is found data is streamed into a buffer.

5. Drive communicates "data ready" to the OS6. OS reads data byte by byte or by using DMA.




Static Disk Characteristics

• Areal density of bits on surfacedensity = 1/(bit spacing × track spacing)

• Maximum density: density on innermost track• Unformatted capacity: includes header and error control

bits• Formatted capacity:

capacity = bytessector

sectorstrack

trackssurface # of surfaces×× ×




Dynamic Disk Characteristics

• Seek time: time to move heads to cylinder• Track-to-track access: time to adjacent track• Rotational latency: time for correct sector to come

under read/write head• Average access time: seek time + rotational latency• Burst rate (maximum transfer bandwidth)

burst rate =revssec

sectorsrev

bytessector× ×




Video Monitors

• Color or black and white• Image is traced on screen a line at a time in a raster

format• Screen dots, or pixels, are sent serially to the scanning

electron beam• Beam is deflected horizontally and vertically to form

the raster• About 60 full frames are displayed per second• Vertical resolution is number of lines: ≈500• Horizontal resolution is dots per line: ≈700• Dots per sec: ≈ 60 × 500 × 700 ≈ 21M




Fig 9.8 Schematic View of a Black-and-White Video Monitor

H sync

V sync

Intensity

Intensity info. (pixel stream)

One frame

Cathode

Note: signal timings are not to scale.

Deflection yoke

~ 20,000 V– +

Glass envelope Phosphor-coated

screen

Cathode ray tube (CRT)

One lineVideo

circuitry

Vert. defl.

Horiz. defl.




Two Video Display Types: Terminal and Memory-Mapped

• Video monitor can be packaged with display memory and keyboard to form a terminal

• Video monitor can be driven from display memory that is memory-mapped

• Video display terminals are usually character-oriented devices

• Low bandwidth connection to the computer

• Memory-mapped displays can show pictures and motion

• High bandwidth connection to memory bus allows fast changes




Fig 9.9 (a) The Video Display Terminal

(Character-oriented)

CPU

System bus

Computer

Video out

Monitor

Keyboard

Parallel ASCII characters

Video drive

circuitry

Local display memory

RS-232 serial ASCII

characters

Serial interface

Video display terminal

Memory

Serial interface

(a) Video display terminal




Fig 9.9 (b) Memory-MappedVideo Display

(Pixel-oriented)

Video out

Computer

Monitor

CPU

System bus

Memory

Display memory

Video drive

circuitry(keyboard not shown)

(b) Memory-mapped video




Memory Representations of Displayed Information

• Bit-mapped displays• Each pixel represented by a memory datum• Black and white displays can use a bit per pixel• Gray scale or color needs several bits per pixel

• Character-oriented (alphanumeric) displays• Only character codes stored in memory• Character code converted to pixels by a character ROM• A character generates several successive pixels on

several successive lines




Fig 9.10 Dot Matrix Characters and Character Generator

• Bits of a line are read out serially• Accessed 9 times at same horizontal position and

successive vertical positions

0001

Line

0011100“A”PatternCharacter

(a) Character matrix

(b) Character ROM

77

4

Character ROM




Fig 9.11 Video Controller for an Alphanumeric Display

• Counters count the 7 dots in a character,

• the 80 characters across a screen,

• the 9 lines in a character, and

• the 67 rows of characters from top to bottom

Display memory address

6

7

4

OF

OF

OF

13

7

7

9

7

4

Display memory

Character generator

ASCII character

From serial

interface Character lineCharacter dot counter: 0–6

Character column counter: 0–79

8064

Character row counter: 0–63

Vertical (frame) sync

Horizontal (line) sync

Video (pixels)

Character line counter: 0–8

Incr.

Incr.

Incr.

Shift register

Dot pattern

Incr.




Fig 9.12 Memory-Mapped Video Controller for a 24-Bit Color Display

• Memory must store 24 bits per pixel for 256-level resolution

• At 20M dots per second the memory bandwidth is very high

• Place for video RAM

8

8

8

n

Processor bus

Dot clock

Red video

Green video

Blue video

Horizontal (line) sync

Vertical (frame) sync

8-bit video DAC

8-bit video DAC

8-bit video DAC

Timing and sync

generator

Display memory address

Display memory




Flat-Panel Displays

• Allow electrical control over the transparency of a liquid crystal material sandwiched between glass plates, dot by dot

• 3 dots per pixel for color, one for black and white• Dots are scanned in a raster format, so controller

similar to that for video monitor• Passive matrix has X and Y drive transistors at edges• Active matrix has one (or 3) transistor(s) per dot




Printers—Ways of Getting Ink on Paper

• Dot matrix printer:• Row of solenoid actuated pins, could be height of character

matrix• Inked ribbon struck by pin to mark paper• Low resolution

• Laser printer:• Positively charged drum scanned by laser to discharge

individual pixels• Ink adheres to remaining positive surface portions• 300 to 1200 dots per inch resolution

• Ink-jet printers:• Ultrasonic transducer squirts very small jet of ink at correct

pixels as head moves across paper• Intermediate between the 2 in price and resolution




Fig 9.13 Character Generation in Dot Matrix Printers

• Can print a column at a time from a character ROM

• ROM is read out parallel by column instead of serial by row, as in alphanumeric video displays

3

7“A”

1

Column

Character Pattern

Ribbon

Paper

Print head

9

0 0 1 1 1 1 1 1 0

Character ROM

Character matrix




Manual Input Input Devices—Keyboards and Mice

• Very slow input rates• 10 characters of 8 bits per second on keyboard• Mouse tracking somewhat faster: few X and Y position

change bits per millisecond• Mouse click: bit per 1/10 second• Main thrust in manual input design is to reduce

number of moving parts




Fig 9.14 ADC and DAC Interfaces

• Begin and Done synchronize A to D conversion, which can take several cycles

• D to A conversion is usually fast in comparison

Clear

BeginClock

Unknown voltage (0–10V)

Count

Done

Input word

Output voltage

n

nAnalog- to-digital converter

Digital- to-analog converter




Fig 9.15 R-2R Ladder DACVoltage Out Proportional to Binary Number x

V0 = ( xn-1 + 12 xn-2 +

14 xn-3 + … + 1

2n-1x0 ) kVR

–

+

R1

R'

Vo

V+Op. amp.RD2R2R2R

xn–1xn–2

2R2R2R

R R

Node N – 1

R

x1x0

VR

Node 1Node 0




Fig 9.16 Counting Analog-to-Digital Converter

• Counter increments until DAC output becomes just greater than unknown input

• Conversion time ∝ 2n for an n-bit converter

Clearn-bit counterReset

IncrementS Q

R

Analog comparator

0–10V

n-bit DAC

Count

Done

nn

ClockBegin





Fig 9.17 Successive-Approximation ADC

• Successive approximation logic uses binary chopping method to get n-bit result in n steps

Clear Successive approximation

logic

Analog comparator

0–10V

n-bit DAC

Count

Done

nn

ClockBegin





Fig 9.18 Successive Approximation Search Tree

• Each trial determines one bit of result

• Trial also determines next comparison level

• For specific input, one path from root to leaf in binary tree is traced

• Conversion time ∝ n

for an n-bit converter

Try 100

Try 110

Try 010

Try 011

Try 101

Try 111

>

<

>

<

>

<

>

<

>

<

>

<

>

<Try 001

Value = 000

Value = 001

Value = 010

Value = 011

Value = 100

Value = 101

Value = 110

Value = 111




Errors in ADC and DAC

• Full scale error: voltage produced by all 1’s input in DAC or voltage producing all 1’s in ADC

• Offset error: DAC output voltage with all 0’s input• Missing codes: digital values that are never produced by an

ADC (skips over as voltage increased)• Lack of monotonicity: DAC monotonicity means voltage

always increases as value increases• Quantization error: always present in DAC or ADC as a

theoretical result of conversion process




Fig 9.19 Signal Quantization and Quantization Error by an ADC

• Ideal output of the ADC for a linearly increasing input

• Error signal corresponding to the ideal ADC output

000

001

010

011

100

101

110

111

Ideal

Vf 4

Vf 8

0

Out

put

wor

dQ

uant

izat

ion

erro

r

+

–

0

Vf 8

Vf 8

3Vf 8

Vf 2

5Vf 8

3Vf 4

7Vf 8

Vf Input voltage

Quantization error = ±Vf/2n




Chapter 9 Summary

• Structure and characteristics of moving magnetic media storage, especially disks

• Display devices:• Analog monitor characteristics• Video display terminals• Memory-mapped video displays

• Printers: dot matrix, laser, and ink jet• Manual input devices: keyboards and mice• Digital-to-analog and analog-to-digital conversion


10-1 Chapter 10—Communications, Networking, and the Internet


Chapter 10: Communications, Networking, and the Internet

10.1 Computer-to-Computer Data Communications• Network communications mechanisms and signal encoding• Tasks, roles, and levels in the communications system• Communications layer models—The OSI layer model

10.2 Serial Data Communications Protocols10.3 Local Area Networks

• LANs; the Ethernet LAN

10.4 The Internet• TCP/IP protocols, packet routing, and IP addresses

Topics




Communications Protocols

• Modern computer communications spans the range from simple 1-1 communications to the Internet.

• Whenever there is communications, there must be a communications protocol—an agreement or contract.

• Most communications protocols are decided by committee.




Network Structures and Channels

• Three kinds of mechanisms• Simplex. One-way communications. Example: remote data logging.• Half-duplex. Two-way communications, but only one may talk at once.

Example: the police radio.• Full duplex. Both may talk at once. Example: the telephone.

• Time domain multiplexing (TDM)• Divides the channel into “time slots.”• Referred to as baseband systems.

• Frequency domain multiplexing (FDM)• Has several “frequency bands.” Example: the TV cable.• Referred to as broadband systems.

• Can have combinations: several TDM signals transmitted on each band of a FDM system.




RZ: • pulse encoding

NRZ: • level encoding

NRZI:• 1: transition• 0: no transition

Manchester:• 0: hi-lo• 1: lo-hi

(There are many other schemes as well)

Fig 10.1 Baseband Bit Encoding Schemes

11 1 1 1 10 0 0

RZ

NRZ

NRZI

Manchester




Packet versus Circuit Switching

• Packet Switching• uses time slots to send a packet of information• Packets may be from a few bytes to many KBytes• Packet-switched networks route each packet independently• Examples: Ethernet, Token ring, Appletalk, Novell, Internet

• Circuit Switching• Establishes a circuit (route) ahead of time• Circuit may be a “virtual” circuit• Guarantees a certain bandwidth to the user• Examples: the telephone system, ATM (Asynchronous

Transmission Mode)




Fig 10.2 Three Network Topologies

(a) Bus (b) Star

Hub

(c) Ring




Three Topologies Compared

• Bus:• No central controller, so continues to operate if station fails• Possibility of contention and collision

• Star:• May use hub as switch to connect any two stations (phone system)• May use hub as broadcaster (star-bus)• May use hub as star-ring

• Ring: Token Ring Protocol (IBM)• Passes a data packet, the token, around the ring• Receiving station removes the data, passes on an “empty” token• If a station receives an empty token, it may attach data to the token,

destined for another station in the ring• Collision-free, but more susceptible to hardware failure if a node fails




Fig 10.3 Several LANs Interconnected with Bridges and Routers

Bridge: passes on only nonlocal trafficRouter: capable of routing nonlocal traffic

LAN 1

Bridge

Router

Bridge

LAN 2 LAN 3 LAN 4




Tasks Required of all Communications Systems

• Provide a high-level interface to the application• Establish, maintain, and terminate the session gracefully• Provide addressing and routing• Provide synchronization and flow control• Provide message formatting• Assure error-free reception (error detection and error

correction)• Signal generation and detection




Fig 10.4 The OSI Layer ModelApplication layer

• Originator, final receiver of transmitted data

Presentation layer• Encryption, format conversion

(often not present)

Session layer• Establish, maintain, terminate

session

Transport layer• Packetizes data, assures all

packets are received, in order of transmission

• Requests retransmission of lost data

Application

Presentation

Session

Transport

Network

Data link

Physical

LAN

Application

Presentation

Session

Transport

Network

Data link

Physical

LANInternet




The OSI Layer Model (Cont’d)

Network layer• Formats packets for the LAN• Removes LAN info at destination

Data link layer• Final preparation for

transmission• Low level synchronization and

flow control

Physical layer• Wiring, transmitting, and

receiving• Signaling

Application

Presentation

Session

Transport

Network

Data link

Physical

LAN

Application

Presentation

Session

Transport

Network

Data link

Physical

LANInternet




Fig 10.5 Telecommunications by modem—RS-232 Comm. Protocol

• Physical layer: DB-25 connector• Most use asynchronous communications

• No common clock—clock must be inferred from arriving data

Data terminal equipment (DTE)

Data terminal equipment (DTE)

RS-232 signals

RS-232 signals

Tx

RxRTSCTSCD

DSRDTR

RI

Tx

RxRTSCTSCD

DSRDTR

RI

Data communications equipment (DCE)

Audio frequency

tones

Modem ModemSerial interface

Computer or

terminal

Serial interface

Computer or

terminal

Telephone

switching

network




Fig 10.6 Asynchronous Data Communications Frame

The Data Link Layer

1 MARK (idle)

0 SPACE5–8 data bits

1 frame

b0 b1 b2 b3 b4 b5 b6 b7

LSB MSB

Parity bit

Start bit

1–2 stop bit(s)

Start bit

(Physical layer: MARK = -3 to -12 volts, SPACE = +3 to +12 Volts.)




Tbl 10.1 EIA RS-232 Signal andPin Assignments

DB-25 Signal Signal SignalPin No. Name Identity Direction2 Tx Transmitted data To modem3 Rx Received data To computer7 Gnd Signal ground Common ref.22 RI Ring indicator To computer8 CD Carrier detect To computer20 DTR Data terminal To modem6 DSR Data set ready To computer4 RTS Request to send To modem5 CTS Clear to send To computer

EIA RS-232 signal identities and pin assignments

• These signals are used at the data link and session levels. Exact protocols are complex.




MODEMS (MOdulator, DEModulator)

• Convert DC signals for 0 and 1 to audio tones• Telephone system frequency response is ~50–3500 Hz

• Bit or bit per second (bps) rate is # bits sent per second• Baud (after J-M-E Baudot) rate is # of signal changes per second

• Maximum baud rate for the phone system is 2400

• It is possible to send multiple bits per baud, by signaling at different frequencies

• Example: send one of 4 different signals, 2400 times per second• The four signals represent 00, 01, 01, or 11, so can send two bits per

baud• bps rate = baud rate x log2(n)




“Smart” Modems

• Sometimes called “Hayes-compatible”• Computer-controlled:

• dialing• set bit rate• program answering, redialing, etc.• capable of data compression

• Modems are still 2400 baud maximum• Highest bit rate available today: 28,800 bps




Tbl 10.2 The ASCII CodeNotice tricks with x, X, and ^X, and with “1” and 1

lsb’s, Most significant bits, 6543210 000 001 010 011 100 101 110 1110000 NUL, ^@ DLE, ^P SPACE 0 @ P ` p0001 SOH, Â DC1, ^Q ! 1 A Q a q0010 STX, ^B DC2,^R “ 2 B R b r0011 ETX, ^C DC3, ^S # 3 C S c s0100 EOT, ^D DC4, ^T $ 4 D T d t0101 ENQ, Ê NAK, Û % 5 E U e u0110 ACK, ^F SYN, ^V & 6 F V f v0111 BEL, ^G ETB, ^W ‘ 7 G W g w1000 BS, ^H CAN, ^X ( 8 H X h x1001 HT, Î EM, ^Y ) 9 I Y i y1010 LF, ^J SUB, ^Z * : J Z j z1011 VT, ^K ESC, ^[ + ; K [ k {1100 FF, ^L FS, ^\ , < L \ l |1101 CR, ^M GS, ^] - = M ] m }1110 SO, ^N RS, ^^ . > N ^ n ~1111 SI, Ô US, ^_ / ? O _ o DEL




Tbl 10.2 The ASCII Code (Cont’d)Control Characters

NUL Null, Idle SI Shift inSOH Start of heading DLE Data link escapeSTX Start of text DC1–4 Device controlETX End of text NAK Negative acknowledgmentEOT End of transmission SYN Synch characterENQ Enquiry ETB End of transmitted blockACK Acknowledge CAN Cancel preceding message or blockBEL Audible bell EM End of medium (paper or tape)BS Backspace SUB Substitute for invalid characterHT Horizontal tab ESC Escape (give alternate meaning to following)LF Line feed FS File separatorVT Vertical tab GS Group separatorFF Form feed RS Record separatorCR Carriage return US Unit separatorSO Shift out DEL Delete




The ASCII Code “Tricks”

^X: 001 1000“X”: 101 1000“x”: 111 1000

“1”: 011 0001 1: 000 0001




Local Area Networks

• Not precisely defined, but generally taken to be:“a network wholly contained within a building or

campus.”• Defined more by intended use: sharing resources within an

organization.• Span the range from 230 Kbps Apple LocalTalk to 100 Mbps new

Ethernet.• Most LAN protocols are defined only at the data link and physical

layers.




The Ethernet LAN

• Developed at Xerox in 1970s, now the most popular LAN• Physical layer can be coaxial cable or twisted pair

(telephone cable)• Data rates of 10 Mbps• Data link layer: packets from 64 to 1518 bytes long• Connectionless protocol• Broadcast medium: every controller receives and examines every

packet. Collisions possible• Addresses are 48 bits long, organized as 6 8-bit “octets”• Addresses guaranteed globally unique, formerly assigned by

Xerox, now by IEEE




Tbl 10.3 Ethernet Cabling

Cable IEEE Maximum Maximum TopologyStandard Cable Total

Length (m) Length (m)

RG-8U (thicknet) 10BASE-5 500 2,500 Bus

RG-58U (thinnet) 10BASE-2 185 1,000 Bus

Unshielded twisted 10BASE-T 100 2,500 Star-buspair (phone wire) with thick (requires

backbone hub)




Fig 10.7 Ethernet CSMA/CD†

Media Access Control Mechanism

† Carrier sense multiple access/collision detect

WaitLine

quiet?

Begin transmitting

Transmit brief jamming signal

Wait a random time period

Try again

Transmission complete

N

Y

Collision detected? Y

N




Fig 10.8 Ethernet Packet Structure

Media access control—Determines local routing

Bit pattern allows recognition of packet

Error detection and correction

8Bytes

Preamble Type CRC

6 6 2 46–1,500 4

Destination address (MAC)

Source address (MAC)

Data (may include headers of higher-layer protocols)




The Internet

• Developed by ARPA, Advanced Research Projects Agency, a DOD agency.

• First experiments in 1969–74.• Presently over 4 million host computers on the net.• For public computer net communications, it is the only

game in town. • Distributed, connectionless protocol. • Independent routers pass packets from source to

destination.




Application

Presentation

Session

Transport

Network

Data link

Physical

LAN

Application

Presentation

Session

Transport

Network

Data link

Physical

LANInternet

The Internet

• Does not include the application or presentation layers, the data link layer, or the physical layer.

• Relies on other LAN protocols to transport its packets.

• Can use Ethernet, Token Ring, AppleTalk, dialup phone lines, or any other communications protocol.

• Uses the TCP/IP (transport control protocol/Internet protocol).

• Distributed protocol.

TCP/IPLayers




Fig 10.9 The TCP/IPProtocol Stack Functionality

Application

OSI model TCP/IP model TCP/IP functionality

PresentationSession

Transport

Network

Ethernet Token ring

Internet protocol (IP)

Transmission control protocol

(TCP)

E-mail, file transfer, remote login, etc. Domain name service.

Establish, maintain, terminate session. Create datagrams (packets). Ensure reliable, sequenced packet delivery. Support multiple sessions.

Connectionless, end-to-end delivery service. MAC addressing and routing to destination. Fragment datagrams as needed.

Connectionless, one-hop delivery service.

Mail ftp Telnet DNS

LocalTalkPhysical

Data link




Internet Names and Addresses

DOMAIN NAME IP Address

riker.cs.colorado.edu ↔ 128.138.244.9ucsu.colorado.edu ↔ 128.138.129.83

For Humans For Machines

Originating machine requests look-up of IP address froma domain name server, using domain name service

Network # . Host IDAssigned by NICNetwork Info. Ctr,Chantilly, VA Assigned by CU




Internet Names and Addresses (cont’d.)

• 128.138.244.9 : “dotted decimal notation”• In the machine the IP address is a 32-bit integer.• Each “dot” separates a byte:• 128.138.244.9 ↔ 808AF409




The TCP/IP Protocol Revisited

Data Link

Physical

OSI Model TCP/IP Model

TCPTransmission

Control Protocol

Mail FTP Telnet

IPInternet Protocol

Ethernet TokenRing

Local-Talk

Application

Presentation

Network

SessionTransport

Establish, maintain, terminate sessionCreate datagrams (packets)Ensure reliable, sequenced packet delivery

E-mail, file transfer, remote login, etc.

Connectionless end-to-end delivery service

Fragment datagrams as neededMAC addressing and routing to destination

TCP/IP Functionality

Domain name serviceDNS

Support multiple sessions

Connectionless one-hop delivery service




Fig 10.10 Data Flow through theProtocol Stack

Each layer adds/subtracts its own information:Separation of concerns/principle of abstraction

Dst. addr. (IP)

Src. addr. (IP)

Dst. addr. (IP)

Src. addr. (IP)

TCP header Data CRC

Data CRCType

Preamble

TCP header

IP header including

IP address



Data CRCType TCP header

IP header including

IP address



Data to be packetizedApplication

Protocol layer Packet structure

TCP

IP

Data link (Ethernet)




Fig 10.11 The Internet Routing Process (simplified)

Get MAC Addr. of Next Router

Strip off old MACAddr.

Compute & Append New Checksum

Send Packet to Data Link Layer

Send to TCP Layer for Fragmenting if Necessary

Add This MAC addr. as Src., Next Router as Dst.

Default RouteAvailable?

Y

NNetwork # in

RoutingTable ?

Y

N

DiscardPacket

Received Packet

Send Packetto TCP Layer

Use Network Maskto Mask off Host IDLeaving Network #

Extract DestinationIP address.

My IPaddress? Y

N




Tbl 10.4 Class A, B, and CIP Addresses

IP Address Class Network #/Host ID # split Network #A (MSB = 0) N.H.H.H N.0.0.0B (MSBs = 10) N.N.H.H N.N.0.0C (MSBs = 110) N.N.N.H N.N.N.0




Fig 10.12 Class A, B, and CNetwork Addresses

Network # Host ID.

0

10

110

Class A

Class B

Class C

8 24

16 16

24 8

16 M addresses

254 addresses

65,534 addresses

126networks

16,384 networks

2M networks

1.0.0.1 to 126.255.255.254

128.0.0.1 to 191.255.255.254

192.0.0.1 to 223.255.255.254

Network Addr. Range

Network # Host ID.

Network # Host ID.

1.0.0.0 to 126.0.0.0

128.0.0.0 to 191.255.0.0

192.0.0.0 to 223.255.255.0

IP Addr. Range




Subnets and Subnetting

• The Internet routers outside colorado.edu care only about the network number, and do not care about the host ID. (It is masked off before routing.)

• The routers inside colorado.edu are free to interpret the “host ID” number in any way they wish.

• CU might wish to have many LANs within its domain to cut down on traffic inside the domain.

• CU is free to program its routers to interpret part of the Host ID as a network number.




Fig 10.13 Example of a Class B Network with a 6-Bit Subnet

Network # Host ID #

Class B 16 16

Network # Subnet #6 16

Host ID #

62 Subnets 1022 Hosts

65,534 Hosts

10

Class BSubnetted

OutsideView

InsideView

Subnet Mask 255.255.252.0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

255 255 252 0

1 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 1 1 1 1 0 1 1 0 0 0 0 0 1 0 0 1

128 138 246 9Machine 128.138.246.9

•

•

•

•

•

•

128 138• 61 521

Network #(Stays the same)

Subnet # Host ID




Fig 10.14 Subnetting a Class B Network(a) an 8-Bit Subnet Address(b) a 10-Bit Subnet Address

Network # Host ID #

Class B 16 16

Network # Subnet # 8 16

Host ID #

256 Subnets 256 Hosts

65,536 Hosts

8

Class BSubnetted

OutsideView

InsideView

Network # Host ID #

Class B 16 16

Network # Host ID 6

Subnet #

64 Hosts1024 Subnets

10 16

65,536 Hosts

Class BSubnetted

OutsideView

InsideView

Subnet Mask 255.255.255.0

a)

b)

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0

Subnet Mask 255.255.255.192 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0

255 255 255 192




Internet Futures

• Internet gurus estimate they will run out of IP addresses by the year 2000.

• Several proposals to expand IP address space.• One uses a 64-bit number: the present 32-bit IP

address (for compatibility) with another 32 bits attached.

• Perhaps in the future wall plugs and wrist watches will have IP addresses!


Education

Machine Purpose