Processor Architecture
“God created the integers, all else is the work of man” Leopold
Kronecker
(He believed in the reduction of all mathematics to arguments involving
only the integers and a finite number of steps) (link)
Moving forward
� We are studying architecture—not assembly code (ASM)
� But ASM is the way we manipulate and study the system
on the architecture level
� So we are learning ASM—to understand architecture
� We are also not “married” to the Intel x86 (and
derivatives)
� But it still rules the desktop/laptop (deslaptop?) world
� And other architectures are similar
� So we will use a fake ASM similar to but different from
(and simpler than) the X86 ���� called Y86
� And study the hardware design of the architecture more
� And build one
C Declaration Resolution
� See http://www.unixwiz.net/techtips/reading-cdecl.html
Chapter 4: Processor Architecture� How does the hardware execute the instructions?
� We’ll see by studying an example system
� Based on simple instruction set devised for this purpose
� 8086 evolved and is complex, non-intuitive and tangled
� Important to ASM programmers and compiler writers, not
useful for architecture studies
� Y86, inspired by x86
� Fewer data types, instructions, addressing modes
� Simpler encodings
� Reasonably complete for integer programs
� We’ll design hardware to implement Y86 ISA
� Basic building blocks
� Sequential implementation
� Pipelined implementation
Instruction Set Architecture� Defines interface between hardware
and software
� Software spec is assembly language
� State: registers, memory
� Instructions, encodings
� Hardware must execute instructions
correctly
� May use variety of transparent tricks
to make execution fast.
� Results must match sequential
execution.
� ISA is a layer of abstraction
� Above: how to program machine
� Below: what needs to be built
ISA
Compiler OS
CPUDesign
CircuitDesign
ChipLayout
ApplicationProgram
Where Are We Now?
CS142 & 124
IT344
Y86 Processor and System State
� Program Registers
� Same 8 as with IA32. Each 32 bits
� Condition Codes
� Single-bit flags as in x86: OF (Overflow), ZF (Zero), SF (Negative)
� Program Counter
� Indicates address of instruction
� Memory
� Byte-addressable storage, words in little-endian byte order
� Stat
� Indicates exceptional outcomes (bad opcode, bad address, halt)
%eax
%ecx
%edx
%ebx
%esi
%edi
%esp
%ebp
RF: Program registers
ZF SF OF
CC: Condition
codes
PC
DMEM: Memory
Stat: Program Status
Y86 Instructions
� Format
� 1 to 6 bytes of information read from memory
� Can determine instruction length from first byte
� Not as many instruction types, and simpler encoding than IA32
� Each accesses and modifies some portion of the CPU and system
state
� Program registers
� Condition codes
� Program counter
� Memory contents
Encoding Registers� Each register has 4-bit ID
� Similar encoding used in IA32
� But we never deciphered encoding to notice!
� Register ID 0xF indicates “no register”
� Will use this in our hardware design in multiple places
� Could otherwise encode register # in 3 bits
� Simplifies decoding of instructions
%eax
%ecx
%edx
%ebx
%esi
%edi
%esp
%ebp
0
1
2
3
6
7
4
5
Instruction Example� Addition instruction
� Add value in register rA to that in register rB
� Store result in register rB
� Y86 allows addition to be applied to register data only
� Set condition codes based on result
� Two-byte encoding
� First byte indicates instruction type
� Second gives source and destination registers
� e.g., addl %eax,%esi has encoding 60 06
addl rA, rB 6 0 rA rB
Encoded RepresentationGeneric Form
Arithmetic and Logical Operations
� Refer to generically as “OPl”
� Encodings differ only by
“function code”
� Low-order 4 bits in first
instruction word
� All set condition codes as side
effect
addl rA, rB 6 0 rA rB
subl rA, rB 6 1 rA rB
andl rA, rB 6 2 rA rB
xorl rA, rB 6 3 rA rB
Add
Subtract (rA from rB)
And
Exclusive-Or
Instruction Code Function Code
Move Operations
� Similar to the IA32 movl instruction
� Simpler format for memory addresses
� Separated into different instructions to simplify hardware
implementation
rrmovl rA, rB 2 0 rA rB Register --> Register
Immediate --> Registerirmovl V, rB 3 0 F rB V
Register --> Memoryrmmovl rA, D(rB) 4 0 rA rB D
Memory --> Registermrmovl D(rB), rA 5 0 rA rB D
Move Instruction Examples
irmovl $0xabcd, %edx movl $0xabcd, %edx 30 82 cd ab 00 00
IA32 Y86 Encoding
rrmovl %esp, %ebx movl %esp, %ebx 20 43
mrmovl -12(%ebp),%ecxmovl -12(%ebp),%ecx 50 15 f4 ff ff ff
rmmovl %esi,0x41c(%esp)movl %esi,0x41c(%esp)
—movl $0xabcd, (%eax)
—movl %eax, 12(%eax,%edx)
—movl (%ebp,%eax,4),%ecx
40 64 1c 04 00 00
Jump Instructions
� Refer to generically as “jXX”
� Encodings differ only by
“function code”
� Based on values of condition
codes
� Same as IA32 counterparts
� Encode full destination address
� Unlike PC-relative
addressing in IA32
jmp Dest 7 0
Jump Unconditionally
Dest
jle Dest 7 1
Jump When Less or Equal
Dest
jl Dest 7 2
Jump When Less
Dest
je Dest 7 3
Jump When Equal
Dest
jne Dest 7 4
Jump When Not Equal
Dest
jge Dest 7 5
Jump When Greater or Equal
Dest
jg Dest 7 6
Jump When Greater
Dest
Stack Operations
� Decrement %esp by 4
� Store word from rA to memory at %esp
� Like IA32
� Read word from memory at %esp
� Save in rA
� Increment %esp by 4
� Like IA32
pushl rA a 0 rA 8
popl rA b 0 rA 8
Same stack conventions as IA32
Subroutine Call and Return
� Push address of next instruction onto stack
� Start executing instructions at Dest
� Like IA32
� Pop value from stack
� Use as address for next instruction
� Like IA32
call Dest 8 0 Dest
ret 9 0
Miscellaneous Instructions
� Don’t do anything
� Stop executing instructions
� IA32 has comparable instruction, but it can’t be executed in user
mode
� We will use this instruction to stop the simulator
nop 0 0
halt 1 0
Other Useful instructions
� DWIM Do What I Mean
� FLI Flash Lights Impressively
� HCF Halt and Catch Fire
� BBW Branch Both Ways
� JTZ Jump to Twilight Zone
� LAP Laugh At Programmer
� WSWW Work in Strange and Wondrous Ways
� KPE Kill Programmer on Error
� More here
Y86 Instruction Set (complete set)
Byte 0 1 2 3 4 5
pushl rA A 0 rA F
jXX Dest 7 fn Dest
popl rA B 0 rA F
call Dest 8 0 Dest
rrmovl rA, rB 2 0 rA rB
irmovl V, rB 3 0 F rB V
rmmovl rA, D(rB) 4 0 rA rB D
mrmovl D(rB), rA 5 0 rA rB D
OPl rA, rB 6 fn rA rB
ret 9 0
nop 0 0
halt 1 0
addl 6 0
subl 6 1
andl 6 2
xorl 6 3
jmp 7 0
jle 7 1
jl 7 2
je 7 3
jne 7 4
jge 7 5
jg 7 6
Writing Y86 Code
� Best to use C compiler as much as possible
� Write code in C
� Compile for IA32 with gcc -S
� Hand translate into Y86
� Coding example
� Find number of elements in null-terminated list
int len1(int a[]);
5043
6125
7395
0
a
⇒⇒⇒⇒ 3
Y86 Code Generation Example
� First try
� Write typical array code
� Compile with gcc -O2 -S
� Problem
� Hard to do array indexing on
Y86: no scaled addressing
modes/* Find number of elements in
null-terminated list */
int len1(int a[])
{
int len;
for (len = 0; a[len]; len++)
;
return len;
}
L18:
incl %eax
cmpl $0,(%edx,%eax,4)
jne L18
x86 code
Y86 Code Generation Example #2
� Second try
� Revise C source to use pointers
� Compile with gcc -O2 -S
� Result
� Doesn’t use indexed addressing
/* Find number of elements in
null-terminated list */
int len2(int a[])
{
int len = 0;
while (*a++)
len++;
return len;
}
L5:
movl (%edx),%eax
incl %ecx
addl $4,%edx
testl %eax,%eax
jne L5
x86 code
Y86 Code Generation Example #3
� IA32 code
� Setup
� Y86 code
� Setup
len2:
pushl %ebp
xorl %ecx,%ecx
movl %esp,%ebp
movl 8(%ebp),%edx
movl (%edx),%eax
je L7
len2:
pushl %ebp # Save %ebp
xorl %ecx,%ecx # len = 0
rrmovl %esp,%ebp # Set frame
mrmovl 8(%ebp),%edx # Get a // ptr
mrmovl (%edx),%eax # Get *a
je L7 # Goto exit
Hand translation
Y86 Code Generation Example #4
� IA32 code
� Loop + Finish
� Y86 code
� Loop + Finish
L5:
movl (%edx),%eax
incl %ecx
addl $4,%edx
testl %eax,%eax
jne L5
movl %ebp,%esp
movl %ecx,%eax
popl %ebp
ret
L5:
mrmovl (%edx),%eax # Get *a
irmovl $1,%esi
addl %esi,%ecx # len++
irmovl $4,%esi
addl %esi,%edx # a++
andl %eax,%eax # *a == 0?
jne L5 # No--Loop
rrmovl %ebp,%esp # Pop
rrmovl %ecx,%eax # Rtn len
popl %ebp
ret
Hand translation
Y86 Program Structure
� Programmer must do
more work; no
compiler, linker, run-
time system
� Make program
placement explicit
� Stack initialization
must be explicit (addr.
0x100)
� Must ensure code
is not overwritten!
� Must initialize data
� Can use symbolic
names
irmovl Stack,%esp # Set up stack
rrmovl %esp,%ebp # Set up frame
irmovl List,%edx
pushl %edx # Push argument
call len2 # Call Function
halt # Halt
.align 4
List: # List of elements
.long 5043
.long 6125
.long 7395
.long 0
# Function
len2:
. . .
# Allocate space for stack
.pos 0x100
Stack:
Assembling Y86 Program
� Generates “object code” file eg.yo
� Actually looks like disassembler output
� ASCII file to make it easy for you to read
unix> yas eg.ys
0x000: 308400010000 | irmovl Stack,%esp # Set up stack
0x006: 2045 | rrmovl %esp,%ebp # Set up frame
0x008: 308218000000 | irmovl List,%edx
0x00e: a028 | pushl %edx # Push argument
0x010: 8028000000 | call len2 # Call Function
0x015: 10 | halt # Halt
0x018: | .align 4
0x018: | List: # List of elements
0x018: b3130000 | .long 5043
0x01c: ed170000 | .long 6125
0x020: e31c0000 | .long 7395
0x024: 00000000 | .long 0
Simulating Y86 Program
� Instruction set simulator
� Computes effect of each instruction on processor state
� Prints changes in state from original
unix> yis eg.yo
Stopped in 41 steps at PC = 0x16. Exception 'HLT', CC Z=1 S=0 O=0
Changes to registers:
%eax: 0x00000000 0x00000003
%ecx: 0x00000000 0x00000003
%edx: 0x00000000 0x00000028
%esp: 0x00000000 0x000000fc
%ebp: 0x00000000 0x00000100
%esi: 0x00000000 0x00000004
Changes to memory:
0x00f4: 0x00000000 0x00000100
0x00f8: 0x00000000 0x00000015
0x00fc: 0x00000000 0x00000018
CISC Instruction Sets
� CISC: Complex Instruction Set Computer
� Dominant style of machines designed prior to ~1980
� Stack-oriented instruction set
� Use stack to pass arguments, save program counter
� Explicit push and pop instructions
� Arithmetic instructions can access memory
� addl %eax, 12(%ebx,%ecx,4)
� Requires memory read and write + complex address calculation
� Condition codes
� Set as side effect of arithmetic and logical instructions
� Philosophy
� Add instructions to perform “typical” programming tasks
RISC Instruction Sets
� Reduced Instruction Set Computer
� Early projects at IBM, Stanford (Hennessy), and Berkeley (Patterson)
� Fewer, simpler instructions in ISA (initially)
� Takes more to perform same operations (relative to CISC)
� But an instruction can execute faster on simpler hardware
� Register-oriented instruction set
� Many more (typically ≥ 32) registers
� Used for arguments, return value and address, temporaries
� Only load and store instructions can access memory
� Similar to Y86 mrmovl and rmmovl
� No condition codes
� Test instructions return 0/1 in general purpose register
Example: MIPS Registers
Example: MIPS Instructions
Op Ra Rb Offset
Op Ra Rb Rd Fn00000
R-R
Op Ra Rb Immediate
R-I
Load/Store
addu $3,$2,$1 # Register add: $3 = $2+$1
addu $3,$2,3145 # Immediate add: $3 = $2+3145
sll $3,$2,2 # Shift left: $3 = $2 << 2
lw $3,16($2) # Load Word: $3 = M[$2+16]
sw $3,16($2) # Store Word: M[$2+16] = $3
Op Ra Rb Offset
Branch
beq $3,$2,dest # Branch when $3 = $2
CISC vs. RISC Debate
� Strong opinions at the time!
� CISC arguments
� Easy for compiler (bridge semantic gap)
� Concise object code (memory was expensive)
� RISC arguments
� Simple is better for optimizing compilers
� A simple CPU can be made to run very fast
� Current status
� For desktop processors, choice of ISA not a technical issue
� With enough hardware, anything can be made to run fast
� Code compatibility more important
� For embedded processors, RISC makes sense
� Smaller, cheaper, less power
4.1 Summary
� Y86 instruction set architecture
� Similar state and instructions as IA32
� Simpler encodings
� Small instruction set
� Y86 somewhere between CISC and RISC
� Changes from x86 consistent with RISC principles
4.2: Logic Design: A Brief Review
� Fundamental hardware requirements
� Communication
� How to get values from one place to another
� Computation
� Storage
� All are simplified by restricting to 0s and 1s
� Communication
� Low or high voltage on wire
� Computation
� Compute Boolean functions
� Storage
� Store bits of information
How many different components
needed to make a computer?� Answer: a zillion NAND (or NOR) gates is all it takes.
� Can create NOT, AND, OR, NOR
� NAND w. inputs tied together is NOT
� NAND followed by NOT = AND
� NAND w. NOTs on inputs = OR
� OR followed by NOT = NOR
� (Exercise: Create an XOR)
� With AND, NAND, OR, NOR, NOT you can build
� A 1-bit adder � cascade to n-bit adder
� A one bit register � Cascade to n-bit register
� A clock
� Use registers, adders, clocks and glue logic and you have a
computer
Communication: Digital Signals
� Use voltage thresholds to extract discrete values from continuous
signal
� Simplest version: 1-bit signal
� Either high range (1) or low range (0)
� With guard range between them
� Not strongly affected by noise or low quality circuit elements
� Can make circuits simple, small, and fast
Voltage
Time
0 1 0
Computation: Logic Gates
� Outputs are Boolean functions of inputs
� Respond continuously to changes in inputs
� After some small delay
Voltage
Time
a
ba && b
Rising Delay Falling Delay
Combinational Circuits
� Acyclic network of logic gates
� Continuously responds to changes on primary inputs
� Primary outputs become (after some delay) Boolean functions of
primary inputs
Acyclic Network
PrimaryInputs
PrimaryOutputs
Bit Equality
� Generate 1 if a and b are equal
� Hardware control language (HCL)
� Very simple hardware description language
� Boolean operations have syntax similar to C logical operations
� We’ll use it to describe control logic for processors
� Much more convenient than drawing gates
� Assumes compiler exists to turn HCL into gate equivalent
Bit equala
b
eq
bool eq = (a&&b)||(!a&&!b)
HCL Expression
Word Equality
� 32-bit word size
� HCL representation
� Equality operation
� Generates Boolean value
b31
Bit equal
a31
eq31
b30
Bit equal
a30
eq30
b1
Bit equal
a1
eq1
b0
Bit equal
a0
eq0
Eq
=B
A
Eq
Word-Level Representation
bool Eq = (A == B)
HCL Representation
Bit-Level Multiplexer
� Control signal s
� Data signals a and b
� Output a when s=1, b when s=0
Bit MUX
b
s
a
out
bool out = (s&&a)||(!s&&b)
HCL Expression
Word Multiplexer
� Select input word A or B depending
on control signal s
� HCL representation
� Case expression
� Series of test : value pairs
� Result value determined by first
successful test
Word-Level Representation
HCL Representation
b31
s
a31
out31
b30
a30
out30
b0
a0
out0
int Out = [
s : A;
1 : B;
];
s
B
A
OutMUX
OFZFCF
OFZFCF
OFZFCF
OFZFCF
Arithmetic Logic Unit
� Combinational logic
� Continuously responding to inputs
� Control signal selects function computed
� Corresponding to 4 arithmetic/logical operations in Y86
� Also computes values for condition codes
A
L
U
Y
X
X + Y
0
A
L
U
Y
X
X - Y
1
A
L
U
Y
X
X & Y
2
A
L
U
Y
X
X ^ Y
3
A
B
A
B
A
B
A
B
Edge-Triggered Latch (Flip Flop)
� Only in latching mode for
brief period
� On rising clock edge
� Value latched depends on
data as clock rises
� Output remains stable at all
other times
Q+
Q–
R
S
D
C
Data
ClockTTrigger
C
D
Q+
Time
T
Storage: Registers
� Each stores word of data (one byte in above register)
� Different from program registers (e.g., %eax)
� Collection of edge-triggered latches
� Loads input on rising edge of clock
I O
Clock
D
CQ+
D
CQ+
D
CQ+
D
CQ+
D
CQ+
D
CQ+
D
CQ+
D
CQ+
i7
i6
i5
i4
i3
i2
i1
i0
o7
o6
o5
o4
o3
o2
o1
o0
Clock
Structure
Register Operation
� Stores data bits
� For most of time acts as barrier between input and output
� As clock rises, loads input
State = x
Rising
clock�Output = xInput = y
x�
State = y
Output = y
y
State Machine Example
� Accumulator circuit
� Load or accumulate
on each cycle
Comb. Logic
A
L
U
0
OutMUX
0
1
Clock
In
Load
x0 x1 x2 x3 x4 x5
x0 x0+x1 x0+x1+x2 x3 x3+x4 x3+x4+x5
Clock
Load
In
Out
Storage: Random-Access Memory
� Stores multiple words of memory
� Address input specifies which word to read or write
� Register file
� Holds values of program registers
– %eax, %esp, etc.
� Register identifier serves as address
– ID 0xF implies no read or write performed
� Multiple Ports
� Can read and/or write multiple words simultaneously
– Each has separate address and data input/output
Registerfile
A
B
WdstW
srcA
valA
srcB
valB
valW
Read ports Write port
Clock
Register File Timing
� Reading
� Like combinational logic
� Output data generated based on input
address
� After some delay
� Writing
� Like register (a few slides ago)
� Update only as clock rises
Registerfile
A
B
srcA
valA
srcB
valB
y
2Register
fileW
dstW
valW
Clock
x2
Rising
clock� �Register
fileW
dstW
valW
Clock
y2
x2
2
x
4.2 Summary
� Computation
� Performed by combinational logic
� Computes Boolean functions
� Continuously reacts to input changes
� Storage
� Registers
� Hold single words
� Loaded as clock rises
� Random-access memories
� Hold multiple words
� Multiple read and write ports possible
� Read word anytime address input changes
� Write word only on rising clock edge