293
1 2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson [email protected] Course syllabus, calendar, and assignments found at http://www.cs.usu.edu/~watson/cs2810 These overheads are based on presentations courtesy of Professor Mary Jane Irwin, Penn State University and Professor Tod Amon, Southern Utah University

1 2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson [email protected] Course syllabus, calendar, and assignments found at watson/cs2810

Embed Size (px)

Citation preview

Page 1: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

12004 Morgan Kaufmann Publishers

CS2810Spring 2007

Dan [email protected]

Course syllabus, calendar, and assignments found at

http://www.cs.usu.edu/~watson/cs2810

These overheads are based on presentations courtesy of

Professor Mary Jane Irwin, Penn State University

and

Professor Tod Amon, Southern Utah University

Page 2: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

22004 Morgan Kaufmann Publishers

Chapter 1

Page 3: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

32004 Morgan Kaufmann Publishers

Introduction

• This course is all about how computers work

• But what do we mean by a computer?

– Different types: desktop, servers, embedded devices

– Different uses: automobiles, graphics, finance, genomics…

– Different manufacturers: Intel, Apple, IBM, Microsoft, Sun…

– Different underlying technologies and different costs!

• Analogy: Consider a course on “automotive vehicles”

– Many similarities from vehicle to vehicle (e.g., wheels)

– Huge differences from vehicle to vehicle (e.g., gas vs. electric)

• Best way to learn:

– Focus on a specific instance and learn how it works

– While learning general principles and historical perspectives

Page 4: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

42004 Morgan Kaufmann Publishers

Why learn this stuff?

• You want to call yourself a “computer scientist”

• You want to build software people use (need performance)

• You need to make a purchasing decision or offer “expert” advice

• Both Hardware and Software affect performance:

– Algorithm determines number of source-level statements

– Language/Compiler/Architecture determine machine instructions(Chapter 2 and 3)

– Processor/Memory determine how fast instructions are executed(Chapter 5, 6, and 7)

• Assessing and Understanding Performance in Chapter 4

Page 5: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

52004 Morgan Kaufmann Publishers

What is a computer?

• Components:

– input (mouse, keyboard)

– output (display, printer)

– memory (disk drives, DRAM, SRAM, CD)

– network

• Our primary focus: the processor (datapath and control)

– implemented using millions of transistors

– Impossible to understand by looking at each transistor

– We need...

Page 6: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

62004 Morgan Kaufmann Publishers

Where is the Market?

290

933

488

1143

892

135

4

862

1294

1122

1315

0

200

400

600

800

1000

1200

1998 1999 2000 2001 2002

Embedded

Desktop

Servers

Mill

ions

of C

om

pu

ters

Page 7: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

72004 Morgan Kaufmann Publishers

By the architecture of a system, I mean the complete and detailed specification of the user interface. … As Blaauw has said, “Where architecture tells what happens, implementation tells how it is made to happen.”

The Mythical Man-Month, Brooks, pg 45

Page 8: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

82004 Morgan Kaufmann Publishers

Instruction Set Architecture (ISA)

• ISA: An abstract interface between the hardware and the lowest level software of a machine that encompasses all the information necessary to write a machine language program that will run correctly, including instructions, registers, memory access, I/O, and so on.

“... the attributes of a [computing] system as seen by the programmer, i.e., the conceptual structure and functional behavior, as distinct from the organization of the data flows and controls, the logic design, and the physical implementation.”

– Amdahl, Blaauw, and Brooks, 1964

– Enables implementations of varying cost and performance to run identical software

• ABI (application binary interface): The user portion of the instruction set plus the operating system interfaces used by application programmers. Defines a standard for binary portability across computers.

Page 9: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

92004 Morgan Kaufmann Publishers

ISA Type Sales

0

200

400

600

800

1000

1200

1400

1998 1999 2000 2001 2002

Other

SPARC

Hitachi SH

PowerPC

Motorola 68K

MIPS

IA-32

ARM

PowerPoint “comic” bar chart with approximate values (see text for correct values)

Mill

ions

of P

roce

sso

r

Page 10: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

102004 Morgan Kaufmann Publishers

Moore’s Law

• In 1965, Gordon Moore predicted that the number of transistors that can be integrated on a die would double every 18 to 24 months (i.e., grow exponentially with time).

• Amazingly visionary – million transistor/chip barrier was crossed in the 1980’s.

– 2300 transistors, 1 MHz clock (Intel 4004) - 1971

– 16 Million transistors (Ultra Sparc III)

– 42 Million transistors, 2 GHz clock (Intel Xeon) – 2001

– 55 Million transistors, 3 GHz, 130nm technology, 250mm2 die (Intel Pentium 4) - 2004

– 140 Million transistor (HP PA-8500)

Page 11: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

112004 Morgan Kaufmann Publishers

Historical Perspective

• ENIAC built in World War II was the first general purpose computer– Used for computing artillery firing tables– 80 feet long by 8.5 feet high and several feet wide– Each of the twenty 10 digit registers was 2 feet long– Used 18,000 vacuum tubes– Performed 1900 additions per second

–Since then:

Moore’s Law:

transistor capacity doubles every 18-24 months

Page 12: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

122004 Morgan Kaufmann Publishers

Processor Performance Increase

1

10

100

1000

10000

1987 1989 1991 1993 1995 1997 1999 2001 2003

Year

Per

form

ance

(S

PE

C I

nt)

SUN-4/260 MIPS M/120

MIPS M2000

IBM RS6000

HP 9000/750

DEC AXP/500 IBM POWER 100

DEC Alpha 4/266DEC Alpha 5/500

DEC Alpha 21264/600

DEC Alpha 5/300

DEC Alpha 21264A/667

Intel Xeon/2000

Intel Pentium 4/3000

Page 13: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

132004 Morgan Kaufmann Publishers

DRAM Capacity Growth

10

100

1000

10000

100000

1000000

1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002

Year of introduction

Kb

it c

apac

ity

16K

64K

256K

1M

4M

16M

64M128M

256M512M

Page 14: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

152004 Morgan Kaufmann Publishers

Impacts of Advancing Technology

• Processor

– logic capacity: increases about 30% per year

– performance: 2x every 1.5 years

• Memory

– DRAM capacity: 4x every 3 years, now 2x every 2 years

– memory speed: 1.5x every 10 years

– cost per bit: decreases about 25% per year

• Disk

– capacity: increases about 60% per year

ClockCycle = 1/ClockRate

500 MHz ClockRate = 2 nsec ClockCycle

1 GHz ClockRate = 1 nsec ClockCycle

4 GHz ClockRate = 250 psec ClockCycle

Page 15: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

162004 Morgan Kaufmann Publishers

Example Machine Organization

• Workstation design target

– 25% of cost on processor

– 25% of cost on memory (minimum memory size)

– Rest on I/O devices, power supplies, box

CPU

Computer

Control

Datapath

Memory Devices

Input

Output

Page 16: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

172004 Morgan Kaufmann Publishers

PC Motherboard Closeup

Page 17: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

182004 Morgan Kaufmann Publishers

Inside the Pentium 4 Processor Chip

Page 18: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

202004 Morgan Kaufmann Publishers

Instruction Set Architecture

• A very important abstraction

– interface between hardware and low-level software

– standardizes instructions, machine language bit patterns, etc.

– advantage: different implementations of the same architecture

– disadvantage: sometimes prevents using new innovations

True or False: Binary compatibility is extraordinarily important?

• Modern instruction set architectures:

– IA-32, PowerPC, MIPS, SPARC, ARM, and others

Page 19: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

212004 Morgan Kaufmann Publishers

Abstraction

• Delving into the depths reveals more information

• An abstraction omits unneeded detail, helps us cope with complexity

What are some of the details that appear in these familiar abstractions?

Page 20: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

222004 Morgan Kaufmann Publishers

MIPS R3000 Instruction Set Architecture

• Instruction Categories

– Load/Store

– Computational

– Jump and Branch

– Floating Point

• coprocessor– Memory Management

– Special

R0 - R31

PCHI

LO

OP

OP

OP

rs rt rd sa funct

rs rt immediate

jump target

3 Instruction Formats: all 32 bits wide

Registers

Q: How many already familiar with MIPS ISA?

Page 21: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

232004 Morgan Kaufmann Publishers

How do computers work?

• Need to understand abstractions such as:– Applications software– Systems software– Assembly Language– Machine Language– Architectural Issues: i.e., Caches, Virtual Memory, Pipelining– Sequential logic, finite state machines– Combinational logic, arithmetic circuits– Boolean logic, 1s and 0s– Transistors used to build logic gates (CMOS)– Semiconductors/Silicon used to build transistors– Properties of atoms, electrons, and quantum dynamics

• So much to learn!

Page 22: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

242004 Morgan Kaufmann Publishers

Chapter 2

Page 23: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

252004 Morgan Kaufmann Publishers

Instructions:

• Language of the Machine

• We’ll be working with the MIPS instruction set architecture

– similar to other architectures developed since the 1980's

– Almost 100 million MIPS processors manufactured in 2002

– used by NEC, Nintendo, Cisco, Silicon Graphics, Sony, …

1400

1300

1200

1100

1000

900

800

700

600

500

400

300

200

100

01998 2000 2001 20021999

Other

SPARC

Hitachi SH

PowerPC

Motorola 68K

MIPS

IA-32

ARM

Page 24: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

262004 Morgan Kaufmann Publishers

MIPS arithmetic

• All instructions have 3 operands

• Operand order is fixed (destination first)

Example:

C code: a = b + c

MIPS ‘code’: add a, b, c

(we’ll talk about registers in a bit)

“The natural number of operands for an operation like addition is three…requiring every instruction to have exactly three operands, no more and no less, conforms to the philosophy of keeping the hardware simple”

Page 25: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

272004 Morgan Kaufmann Publishers

MIPS arithmetic

• Design Principle: simplicity favors regularity.

• Of course this complicates some things...

C code: a = b + c + d;

MIPS code: add a, b, cadd a, a, d

• Operands must be registers, only 32 registers provided

• Each register contains 32 bits

• Design Principle: smaller is faster. Why?

Page 26: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

282004 Morgan Kaufmann Publishers

Registers vs. Memory

Processor I/O

Control

Datapath

Memory

Input

Output

• Arithmetic instructions operands must be registers, — only 32 registers provided

• Compiler associates variables with registers

• What about programs with lots of variables

Page 27: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

292004 Morgan Kaufmann Publishers

Memory Organization

• Viewed as a large, single-dimension array, with an address.

• A memory address is an index into the array

• "Byte addressing" means that the index points to a byte of memory.

0

1

2

3

4

5

6

...

8 bits of data

8 bits of data

8 bits of data

8 bits of data

8 bits of data

8 bits of data

8 bits of data

Page 28: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

302004 Morgan Kaufmann Publishers

Memory Organization

• Bytes are nice, but most data items use larger "words"

• For MIPS, a word is 32 bits or 4 bytes.

• 232 bytes with byte addresses from 0 to 232-1

• 230 words with byte addresses 0, 4, 8, ... 232-4

• Words are alignedi.e., what are the least 2 significant bits of a word address?

0

4

8

12

...

32 bits of data

32 bits of data

32 bits of data

32 bits of data

Registers hold 32 bits of data

Page 29: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

312004 Morgan Kaufmann Publishers

Instructions

• Load and store instructions• Example:

C code: A[12] = h + A[8];

MIPS code: lw $t0, 32($s3)add $t0, $s2, $t0sw $t0, 48($s3)

• Can refer to registers by name (e.g., $s2, $t2) instead of number• Store word has destination last• Remember arithmetic operands are registers, not memory!

Can’t write: add 48($s3), $s2, 32($s3)

Page 30: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

322004 Morgan Kaufmann Publishers

Our First Example

• Can we figure out the code?

swap(int v[], int k);{ int temp;

temp = v[k]v[k] = v[k+1];v[k+1] = temp;

} swap:muli $2, $5, 4add $2, $4, $2lw $15, 0($2)lw $16, 4($2)sw $16, 0($2)sw $15, 4($2)jr $31

Page 31: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

332004 Morgan Kaufmann Publishers

So far we’ve learned:

• MIPS— loading words but addressing bytes— arithmetic on registers only

• Instruction Meaning

add $s1, $s2, $s3 $s1 = $s2 + $s3sub $s1, $s2, $s3 $s1 = $s2 – $s3lw $s1, 100($s2) $s1 = Memory[$s2+100] sw $s1, 100($s2) Memory[$s2+100] = $s1

Page 32: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

342004 Morgan Kaufmann Publishers

• Instructions, like registers and words of data, are also 32 bits long

– Example: add $t1, $s1, $s2– registers have numbers, $t1=9, $s1=17, $s2=18

• Instruction Format:

000000 10001 10010 01000 00000 100000

op rs rt rd shamt funct

• Can you guess what the field names stand for?

Machine Language

Page 33: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

352004 Morgan Kaufmann Publishers

• Consider the load-word and store-word instructions,

– What would the regularity principle have us do?

– New principle: Good design demands a compromise

• Introduce a new type of instruction format

– I-type for data transfer instructions

– other format was R-type for register

• Example: lw $t0, 32($s2)

35 18 9 32

op rs rt 16 bit number

• Where's the compromise?

Machine Language

Page 34: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

362004 Morgan Kaufmann Publishers

• Instructions are bits

• Programs are stored in memory — to be read or written just like data

• Fetch & Execute Cycle

– Instructions are fetched and put into a special register

– Bits in the register "control" the subsequent actions

– Fetch the “next” instruction and continue

Processor Memory

memory for data, programs, compilers, editors, etc.

Stored Program Concept

Page 35: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

372004 Morgan Kaufmann Publishers

• Decision making instructions

– alter the control flow,

– i.e., change the "next" instruction to be executed

• MIPS conditional branch instructions:

bne $t0, $t1, Label beq $t0, $t1, Label

• Example: if (i==j) h = i + j;

bne $s0, $s1, Labeladd $s3, $s0, $s1

Label: ....

Control

Page 36: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

382004 Morgan Kaufmann Publishers

• MIPS unconditional branch instructions:j label

• Example:

if (i!=j) beq $s4, $s5, Lab1 h=i+j; add $s3, $s4, $s5else j Lab2 h=i-j; Lab1: sub $s3, $s4, $s5

Lab2: ...

• Can you build a simple for loop?

Control

Page 37: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

392004 Morgan Kaufmann Publishers

So far:

• Instruction Meaning

add $s1,$s2,$s3 $s1 = $s2 + $s3sub $s1,$s2,$s3 $s1 = $s2 – $s3lw $s1,100($s2) $s1 = Memory[$s2+100] sw $s1,100($s2) Memory[$s2+100] = $s1bne $s4,$s5,L Next instr. is at Label if $s4 ≠ $s5beq $s4,$s5,L Next instr. is at Label if $s4 = $s5j Label Next instr. is at Label

• Formats:

op rs rt rd shamt funct

op rs rt 16 bit address

op 26 bit address

R

I

J

Page 38: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

402004 Morgan Kaufmann Publishers

• We have: beq, bne, what about Branch-if-less-than?

• New instruction:if $s1 < $s2 then

$t0 = 1 slt $t0, $s1, $s2 else

$t0 = 0

• Can use this instruction to build "blt $s1, $s2, Label" — can now build general control structures

• Note that the assembler needs a register to do this,— there are policy of use conventions for registers

Control Flow

Page 39: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

412004 Morgan Kaufmann Publishers

Policy of Use Conventions

Name Register number Usage$zero 0 the constant value 0$v0-$v1 2-3 values for results and expression evaluation$a0-$a3 4-7 arguments$t0-$t7 8-15 temporaries$s0-$s7 16-23 saved$t8-$t9 24-25 more temporaries$gp 28 global pointer$sp 29 stack pointer$fp 30 frame pointer$ra 31 return address

Register 1 ($at) reserved for assembler, 26-27 for operating system

Page 40: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

422004 Morgan Kaufmann Publishers

• Small constants are used quite frequently (50% of operands) e.g., A = A + 5;

B = B + 1;C = C - 18;

• Solutions? Why not?– put 'typical constants' in memory and load them. – create hard-wired registers (like $zero) for constants like one.

• MIPS Instructions:

addi $29, $29, 4slti $8, $18, 10andi $29, $29, 6ori $29, $29, 4

• Design Principle: Make the common case fast. Which format?

Constants

Page 41: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

432004 Morgan Kaufmann Publishers

• We'd like to be able to load a 32 bit constant into a register

• Must use two instructions, new "load upper immediate" instruction

lui $t0, 1010101010101010

• Then must get the lower order bits right, i.e.,

ori $t0, $t0, 1010101010101010

1010101010101010 0000000000000000

0000000000000000 1010101010101010

1010101010101010 1010101010101010

ori

1010101010101010 0000000000000000

filled with zeros

How about larger constants?

Page 42: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

442004 Morgan Kaufmann Publishers

• Assembly provides convenient symbolic representation

– much easier than writing down numbers

– e.g., destination first

• Machine language is the underlying reality

– e.g., destination is no longer first

• Assembly can provide 'pseudoinstructions'

– e.g., “move $t0, $t1” exists only in Assembly

– would be implemented using “add $t0,$t1,$zero”

• When considering performance you should count real instructions

Assembly Language vs. Machine Language

Page 43: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

452004 Morgan Kaufmann Publishers

• Discussed in your assembly language programming lab: support

for procedures

linkers, loaders, memory layout

stacks, frames, recursion

manipulating strings and pointers

interrupts and exceptions

system calls and conventions

• Some of these we'll talk more about later

• We’ll talk about compiler optimizations when we hit chapter 4.

Other Issues

Page 44: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

462004 Morgan Kaufmann Publishers

• simple instructions all 32 bits wide

• very structured, no unnecessary baggage

• only three instruction formats

• rely on compiler to achieve performance— what are the compiler's goals?

• help compiler where we can

op rs rt rd shamt funct

op rs rt 16 bit address

op 26 bit address

R

I

J

Overview of MIPS

Page 45: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

472004 Morgan Kaufmann Publishers

• Instructions:

bne $t4,$t5,Label Next instruction is at Label if $t4 ° $t5

beq $t4,$t5,Label Next instruction is at Label if $t4 = $t5

j Label Next instruction is at Label

• Formats:

• Addresses are not 32 bits — How do we handle this with load and store instructions?

op rs rt 16 bit address

op 26 bit address

I

J

Addresses in Branches and Jumps

Page 46: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

482004 Morgan Kaufmann Publishers

• Instructions:

bne $t4,$t5,Label Next instruction is at Label if $t4≠$t5beq $t4,$t5,Label Next instruction is at Label if $t4=$t5

• Formats:

• Could specify a register (like lw and sw) and add it to address

– use Instruction Address Register (PC = program counter)

– most branches are local (principle of locality)

• Jump instructions just use high order bits of PC

– address boundaries of 256 MB

op rs rt 16 bit addressI

Addresses in Branches

Page 47: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

492004 Morgan Kaufmann Publishers

To summarize:MIPS operands

Name Example Comments$s0-$s7, $t0-$t9, $zero, Fast locations for data. In MIPS, data must be in registers to perform

32 registers $a0-$a3, $v0-$v1, $gp, arithmetic. MIPS register $zero always equals 0. Register $at is $fp, $sp, $ra, $at reserved for the assembler to handle large constants.

Memory[0], Accessed only by data transfer instructions. MIPS uses byte addresses, so

230

memory Memory[4], ..., sequential words differ by 4. Memory holds data structures, such as arrays,

words Memory[4294967292] and spilled registers, such as those saved on procedure calls.

MIPS assembly language

Category Instruction Example Meaning Commentsadd add $s1, $s2, $s3 $s1 = $s2 + $s3 Three operands; data in registers

Arithmetic subtract sub $s1, $s2, $s3 $s1 = $s2 - $s3 Three operands; data in registers

add immediate addi $s1, $s2, 100 $s1 = $s2 + 100 Used to add constants

load word lw $s1, 100($s2) $s1 = Memory[$s2 + 100] Word from memory to register

store word sw $s1, 100($s2) Memory[$s2 + 100] = $s1 Word from register to memory

Data transfer load byte lb $s1, 100($s2) $s1 = Memory[$s2 + 100] Byte from memory to register

store byte sb $s1, 100($s2) Memory[$s2 + 100] = $s1 Byte from register to memory

load upper immediate lui $s1, 100 $s1 = 100 * 216 Loads constant in upper 16 bits

branch on equal beq $s1, $s2, 25 if ($s1 == $s2) go to PC + 4 + 100

Equal test; PC-relative branch

Conditional

branch on not equal bne $s1, $s2, 25 if ($s1 != $s2) go to PC + 4 + 100

Not equal test; PC-relative

branch set on less than slt $s1, $s2, $s3 if ($s2 < $s3) $s1 = 1; else $s1 = 0

Compare less than; for beq, bne

set less than immediate

slti $s1, $s2, 100 if ($s2 < 100) $s1 = 1; else $s1 = 0

Compare less than constant

jump j 2500 go to 10000 Jump to target address

Uncondi- jump register jr $ra go to $ra For switch, procedure return

tional jump jump and link jal 2500 $ra = PC + 4; go to 10000 For procedure call

Page 48: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

502004 Morgan Kaufmann Publishers

Byte Halfword Word

Registers

Memory

Memory

Word

Memory

Word

Register

Register

1. Immediate addressing

2. Register addressing

3. Base addressing

4. PC-relative addressing

5. Pseudodirect addressing

op rs rt

op rs rt

op rs rt

op

op

rs rt

Address

Address

Address

rd . . . funct

Immediate

PC

PC

+

+

Page 49: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

512004 Morgan Kaufmann Publishers

CSE 431 Computer Architecture

Fall 2005

Lecture 02: MIPS ISA Review

Mary Jane Irwin ( www.cse.psu.edu/~mji )

www.cse.psu.edu/~cg431

[Adapted from Computer Organization and Design,

Patterson & Hennessy, © 2005, UCB]

Page 50: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

522004 Morgan Kaufmann Publishers

(vonNeumann) Processor Organization

• Control needs to

1. input instructions from Memory

2. issue signals to control the information flow between the Datapath components and to control what operations they perform

3. control instruction sequencingFetch

DecodeExec

CPU

Control

Datapath

Memory Devices

Input

Output

• Datapath needs to have the

– components – the functional units and storage (e.g., register file) needed to execute instructions

– interconnects - components connected so that the instructions can be accomplished and so that data can be loaded from and stored to Memory

Page 51: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

532004 Morgan Kaufmann Publishers

RISC - Reduced Instruction Set Computer

• RISC philosophy

– fixed instruction lengths

– load-store instruction sets

– limited addressing modes

– limited operations

• MIPS, Sun SPARC, HP PA-RISC, IBM PowerPC, Intel (Compaq) Alpha, …

• Instruction sets are measured by how well compilers use them as opposed to how well assembly language programmers use them

Design goals: speed, cost (design, fabrication, test, packaging), size, power consumption, reliability,

memory space (embedded systems)

Page 52: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

542004 Morgan Kaufmann Publishers

MIPS R3000 Instruction Set Architecture (ISA)

• Instruction Categories

– Computational

– Load/Store

– Jump and Branch

– Floating Point

• coprocessor– Memory Management

– Special

R0 - R31

PCHI

LO

Registers

OP

OP

OP

rs rt rd sa funct

rs rt immediate

jump target

3 Instruction Formats: all 32 bits wide

R format

I format

J format

Page 53: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

552004 Morgan Kaufmann Publishers

Review: Unsigned Binary Representation

Hex Binary Decimal

0x00000000 0…0000 0

0x00000001 0…0001 1

0x00000002 0…0010 2

0x00000003 0…0011 3

0x00000004 0…0100 4

0x00000005 0…0101 5

0x00000006 0…0110 6

0x00000007 0…0111 7

0x00000008 0…1000 8

0x00000009 0…1001 9

0xFFFFFFFC 1…1100

0xFFFFFFFD 1…1101

0xFFFFFFFE 1…1110

0xFFFFFFFF 1…1111 232 - 1232 - 2

232 - 3232 - 4

232 - 1

1 1 1 . . . 1 1 1 1 bit

31 30 29 . . . 3 2 1 0 bit position

231 230 229 . . . 23 22 21 20 bit weight

1 0 0 0 . . . 0 0 0 0 - 1

Page 54: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

562004 Morgan Kaufmann Publishers

Aside: Beyond Numbers

• American Std Code for Info Interchange (ASCII): 8-bit bytes representing characters

ASCII Char ASCII Char ASCII Char ASCII Char ASCII Char ASCII Char

0 Null 32 space 48 0 64 @ 96 ` 112 p

1 33 ! 49 1 65 A 97 a 113 q

2 34 “ 50 2 66 B 98 b 114 r

3 35 # 51 3 67 C 99 c 115 s

4 EOT 36 $ 52 4 68 D 100 d 116 t

5 37 % 53 5 69 E 101 e 117 u

6 ACK 38 & 54 6 70 F 102 f 118 v

7 39 ‘ 55 7 71 G 103 g 119 w

8 bksp 40 ( 56 8 72 H 104 h 120 x

9 tab 41 ) 57 9 73 I 105 i 121 y

10 LF 42 * 58 : 74 J 106 j 122 z

11 43 + 59 ; 75 K 107 k 123 {

12 FF 44 , 60 < 76 L 108 l 124 |

15 47 / 63 ? 79 O 111 o 127 DEL

Page 55: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

582004 Morgan Kaufmann Publishers

MIPS Arithmetic Instructions

• MIPS assembly language arithmetic statement

add $t0, $s1, $s2

sub $t0, $s1, $s2

• Each arithmetic instruction performs only one operation

• Each arithmetic instruction fits in 32 bits and specifies exactly three operands

destination source1 op source2

• Each arithmetic instruction performs only one operation

• Each arithmetic instruction fits in 32 bits and specifies exactly three operands

destination source1 op source2

• Operand order is fixed (destination first)

• Those operands are all contained in the datapath’s register file ($t0,$s1,$s2) – indicated by $

Page 56: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

592004 Morgan Kaufmann Publishers

Aside: MIPS Register Convention

Name Register Number

Usage Preserve on call?

$zero 0 constant 0 (hardware) n.a.

$at 1 reserved for assembler n.a.

$v0 - $v1 2-3 returned values no

$a0 - $a3 4-7 arguments yes

$t0 - $t7 8-15 temporaries no

$s0 - $s7 16-23 saved values yes

$t8 - $t9 24-25 temporaries no

$gp 28 global pointer yes

$sp 29 stack pointer yes

$fp 30 frame pointer yes

$ra 31 return addr (hardware) yes

Page 57: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

602004 Morgan Kaufmann Publishers

MIPS Register FileRegister File

src1 addr

src2 addr

dst addr

write data

32 bits

src1data

src2data

32locations

325

32

5

5

32

• Holds thirty-two 32-bit registers

– Two read ports and

– One write port

• Registers are– Faster than main memory

• But register files with more locations are slower (e.g., a 64 word file could be as much as 50% slower than a 32 word file)

• Read/write port increase impacts speed quadratically

– Easier for a compiler to use

• e.g., (A*B) – (C*D) – (E*F) can do multiplies in any order vs. stack

– Can hold variables so that

• code density improves (since register are named with fewer bits than a memory location)

write control

Page 58: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

612004 Morgan Kaufmann Publishers

• Instructions, like registers and words of data, are 32 bits long

• Arithmetic Instruction Format (R format):

add $t0, $s1, $s2

Machine Language - Add Instruction

op rs rt rd shamt funct

op 6-bits opcode that specifies the operation

rs 5-bits register file address of the first source operand

rt 5-bits register file address of the second source operand

rd 5-bits register file address of the result’s destination

shamt 5-bits shift amount (for shift instructions)

funct 6-bits function code augmenting the opcode

Page 59: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

622004 Morgan Kaufmann Publishers

MIPS Memory Access Instructions

• MIPS has two basic data transfer instructions for accessing memory

lw $t0, 4($s3) #load word from memory

sw $t0, 8($s3) #store word to memory• The data is loaded into (lw) or stored from (sw) a register in the

register file – a 5 bit address

• The memory address – a 32 bit address – is formed by adding the contents of the base address register to the offset value

– A 16-bit field meaning access is limited to memory locations within a region of 213 or 8,192 words (215 or 32,768 bytes) of the address in the base register

– Note that the offset can be positive or negative

Page 60: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

632004 Morgan Kaufmann Publishers

• Load/Store Instruction Format (I format):

lw $t0, 24($s2)

Machine Language - Load Instruction

op rs rt 16 bit offset

Memory

data word address (hex)0x000000000x000000040x000000080x0000000c

0xf f f f f f f f

$s2 0x12004094

2410 + $s2 =

. . . 0001 1000+ . . . 1001 0100 . . . 1010 1100 = 0x120040ac

0x120040ac $t0

Page 61: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

642004 Morgan Kaufmann Publishers

Byte Addresses

• Since 8-bit bytes are so useful, most architectures address individual bytes in memory

– The memory address of a word must be a multiple of 4 (alignment restriction)

• Big Endian: leftmost byte is word address

IBM 360/370, Motorola 68k, MIPS, Sparc, HP PA• Little Endian: rightmost byte is word address

Intel 80x86, DEC Vax, DEC Alpha (Windows NT)

msb lsb

3 2 1 0little endian byte 0

0 1 2 3big endian byte 0

Page 62: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

652004 Morgan Kaufmann Publishers

Aside: Loading and Storing Bytes

• MIPS provides special instructions to move bytes

lb $t0, 1($s3) #load byte from memory

sb $t0, 6($s3) #store byte to memory

op rs rt 16 bit offset

• What 8 bits get loaded and stored?

– load byte places the byte from memory in the rightmost 8 bits of the destination register

• what happens to the other bits in the register?– store byte takes the byte from the rightmost 8 bits of a register and

writes it to a byte in memory

• what happens to the other bits in the memory word?

Page 63: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

662004 Morgan Kaufmann Publishers

• MIPS conditional branch instructions:

bne $s0, $s1, Lbl #go to Lbl if $s0$s1 beq $s0, $s1, Lbl #go to Lbl if $s0=$s1– Ex: if (i==j) h = i + j;

bne $s0, $s1, Lbl1add $s3, $s0, $s1

Lbl1: ...

MIPS Control Flow Instructions

• Instruction Format (I format):

op rs rt 16 bit offset

• How is the branch destination address specified?

Page 64: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

672004 Morgan Kaufmann Publishers

Specifying Branch Destinations• Use a register (like in lw and sw) added to the 16-bit offset

– which register? Instruction Address Register (the PC)

• its use is automatically implied by instruction• PC gets updated (PC+4) during the fetch cycle so that it holds

the address of the next instruction– limits the branch distance to -215 to +215-1 instructions from the

(instruction after the) branch instruction, but most branches are local anyway

PCAdd

32

32 3232

32

offset

16

32

00

sign-extend

from the low order 16 bits of the branch instruction

branch dstaddress

?Add

4 32

Page 65: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

682004 Morgan Kaufmann Publishers

• We have beq, bne, but what about other kinds of brances (e.g., branch-if-less-than)? For this, we need yet another instruction, slt

• Set on less than instruction:

slt $t0, $s0, $s1 # if $s0 < $s1 then# $t0 = 1 else # $t0 = 0

• Instruction format (R format):

2

More Branch Instructions

op rs rt rd funct

Page 66: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

692004 Morgan Kaufmann Publishers

More Branch Instructions, Con’t

• Can use slt, beq, bne, and the fixed value of 0 in register $zero to create other conditions

– less than blt $s1, $s2, Label

– less than or equal to ble $s1, $s2, Label– greater than bgt $s1, $s2, Label– great than or equal to bge $s1, $s2, Label

slt $at, $s1, $s2 #$at set to 1 ifbne $at, $zero, Label # $s1 < $s2

• Such branches are included in the instruction set as pseudo instructions - recognized (and expanded) by the assembler

– Its why the assembler needs a reserved register ($at)

Page 67: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

702004 Morgan Kaufmann Publishers

• MIPS also has an unconditional branch instruction or jump instruction:

j label #go to label

Other Control Flow Instructions

• Instruction Format (J Format):

op 26-bit address

PC4

32

26

32

00

from the low order 26 bits of the jump instruction

Page 68: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

712004 Morgan Kaufmann Publishers

Aside: Branching Far Away

• What if the branch destination is further away than can be captured in 16 bits?

The assembler comes to the rescue – it inserts an unconditional jump to the branch target and inverts the condition

beq $s0, $s1, L1

becomes

bne $s0, $s1, L2j L1

L2:

Page 69: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

722004 Morgan Kaufmann Publishers

• MIPS procedure call instruction:

jal ProcedureAddress #jump and link• Saves PC+4 in register $ra to have a link to the next instruction for the

procedure return• Machine format (J format):

• Then can do procedure return with a

jr $ra #return• Instruction format (R format):

Instructions for Accessing Procedures

op 26 bit address

op rs funct

Page 70: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

732004 Morgan Kaufmann Publishers

Aside: Spilling Registers

• What if the callee needs more registers? What if the procedure is recursive?

– uses a stack – a last-in-first-out queue – in memory for passing additional values or saving (recursive) return address(es)

One of the general registers, $sp, is used to address the stack (which “grows” from high address to low address)

add data onto the stack – push

$sp = $sp – 4 data on stack at new $sp

remove data from the stack – pop

data from stack at $sp $sp = $sp + 4low addr

high addr

$sptop of stack

Page 71: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

742004 Morgan Kaufmann Publishers

addi $sp, $sp, 4 #$sp = $sp + 4

slti $t0, $s2, 15 #$t0 = 1 if $s2<15• Machine format (I format):

MIPS Immediate Instructions

op rs rt 16 bit immediate I format

• Small constants are used often in typical code

• Possible approaches?

– put “typical constants” in memory and load them

– create hard-wired registers (like $zero) for constants like 1

– have special instructions that contain constants !

• The constant is kept inside the instruction itself!

– Immediate format limits values to the range +215–1 to -215

Page 72: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

752004 Morgan Kaufmann Publishers

• We'd also like to be able to load a 32 bit constant into a register, for this we must use two instructions

• a new "load upper immediate" instruction

lui $t0, 1010101010101010

• Then must get the lower order bits right, use

ori $t0, $t0, 1010101010101010

Aside: How About Larger Constants?

16 0 8 1010101010101010

1010101010101010

0000000000000000 1010101010101010

0000000000000000

1010101010101010 1010101010101010

Page 73: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

762004 Morgan Kaufmann Publishers

MIPS Organization So Far

ProcessorMemory

32 bits

230

words

read/write addr

read data

write data

word address(binary)

0…00000…01000…10000…1100

1…1100Register File

src1 addr

src2 addr

dst addr

write data

32 bits

src1data

src2data

32registers

($zero - $ra)

32

32

3232

32

32

5

5

5

PC

ALU

32 32

3232

32

0 1 2 37654

byte address(big Endian)

FetchPC = PC+4

DecodeExec

Add32

324

Add32

32branch offset

Page 74: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

772004 Morgan Kaufmann Publishers

MIPS ISA So FarCategory Instr Op Code Example Meaning

Arithmetic

(R & I format)

add 0 and 32 add $s1, $s2, $s3 $s1 = $s2 + $s3

subtract 0 and 34 sub $s1, $s2, $s3 $s1 = $s2 - $s3

add immediate 8 addi $s1, $s2, 6 $s1 = $s2 + 6

or immediate 13 ori $s1, $s2, 6 $s1 = $s2 v 6

Data Transfer

(I format)

load word 35 lw $s1, 24($s2) $s1 = Memory($s2+24)

store word 43 sw $s1, 24($s2) Memory($s2+24) = $s1

load byte 32 lb $s1, 25($s2) $s1 = Memory($s2+25)

store byte 40 sb $s1, 25($s2) Memory($s2+25) = $s1

load upper imm 15 lui $s1, 6 $s1 = 6 * 216

Cond. Branch (I & R format)

br on equal 4 beq $s1, $s2, L if ($s1==$s2) go to L

br on not equal 5 bne $s1, $s2, L if ($s1 !=$s2) go to L

set on less than 0 and 42 slt $s1, $s2, $s3 if ($s2<$s3) $s1=1 else $s1=0

set on less than immediate

10 slti $s1, $s2, 6 if ($s2<6) $s1=1 else $s1=0

Uncond. Jump (J & R format)

jump 2 j 2500 go to 10000

jump register 0 and 8 jr $t1 go to $t1

jump and link 3 jal 2500 go to 10000; $ra=PC+4

Page 75: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

782004 Morgan Kaufmann Publishers

Review of MIPS Operand Addressing Modes

• Register addressing – operand is in a register

• Base (displacement) addressing – operand is at the memory location whose address is the sum of a register and a 16-bit constant contained within the instruction

– Register relative (indirect) with 0($a0)

– Pseudo-direct with addr($zero)

• Immediate addressing – operand is a 16-bit constant contained within the instruction

op rs rt rd funct Register

word operand

base register

op rs rt offset Memory

word or byte operand

op rs rt operand

Page 76: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

792004 Morgan Kaufmann Publishers

Review of MIPS Instruction Addressing Modes

• PC-relative addressing –instruction address is the sum of the PC and a 16-bit constant contained within the instruction

• Pseudo-direct addressing – instruction address is the 26-bit constant contained within the instruction concatenated with the upper 4 bits of the PC

op rs rt offset

Program Counter (PC)

Memory

branch destination instruction

op jump address

Program Counter (PC)

Memory

jump destination instruction||

Page 77: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

802004 Morgan Kaufmann Publishers

MIPS (RISC) Design Principles

• Simplicity favors regularity

– fixed size instructions – 32-bits

– small number of instruction formats

– opcode always the first 6 bits

• Good design demands good compromises

– three instruction formats

• Smaller is faster

– limited instruction set

– limited number of registers in register file

– limited number of addressing modes

• Make the common case fast

– arithmetic operands from the register file (load-store machine)

– allow instructions to contain immediate operands

Page 78: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

812004 Morgan Kaufmann Publishers

Chapter Three

Page 79: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

822004 Morgan Kaufmann Publishers

• Bits are just bits (no inherent meaning)— conventions define relationship between bits and

numbers

• Binary numbers (base 2)0000 0001 0010 0011 0100 0101 0110 0111 1000 1001...decimal: 0...2n-1

• Of course it gets more complicated:numbers are finite (overflow)fractions and real numbersnegative numberse.g., no MIPS subi instruction; addi can add a negative

number

• How do we represent negative numbers?i.e., which bit patterns will represent which numbers?

Numbers

Page 80: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

832004 Morgan Kaufmann Publishers

• Sign Magnitude: One's Complement Two's Complement

000 = +0 000 = +0 000 = +0001 = +1 001 = +1 001 = +1010 = +2 010 = +2 010 = +2011 = +3 011 = +3 011 = +3100 = -0 100 = -3 100 = -4101 = -1 101 = -2 101 = -3110 = -2 110 = -1 110 = -2111 = -3 111 = -0 111 = -1

• Issues: balance, number of zeros, ease of operations

• Which one is best? Why?

Possible Representations

Page 81: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

842004 Morgan Kaufmann Publishers

• 32 bit signed numbers:

0000 0000 0000 0000 0000 0000 0000 0000two = 0ten

0000 0000 0000 0000 0000 0000 0000 0001two = + 1ten

0000 0000 0000 0000 0000 0000 0000 0010two = + 2ten

...0111 1111 1111 1111 1111 1111 1111 1110two = + 2,147,483,646ten

0111 1111 1111 1111 1111 1111 1111 1111two = + 2,147,483,647ten

1000 0000 0000 0000 0000 0000 0000 0000two = – 2,147,483,648ten

1000 0000 0000 0000 0000 0000 0000 0001two = – 2,147,483,647ten

1000 0000 0000 0000 0000 0000 0000 0010two = – 2,147,483,646ten

...1111 1111 1111 1111 1111 1111 1111 1101two = – 3ten

1111 1111 1111 1111 1111 1111 1111 1110two = – 2ten

1111 1111 1111 1111 1111 1111 1111 1111two = – 1ten

maxint

minint

MIPS

Page 82: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

852004 Morgan Kaufmann Publishers

• 32-bit signed numbers (2’s complement):

0000 0000 0000 0000 0000 0000 0000 0000two = 0ten

0000 0000 0000 0000 0000 0000 0000 0001two = + 1ten...

0111 1111 1111 1111 1111 1111 1111 1110two = + 2,147,483,646ten

0111 1111 1111 1111 1111 1111 1111 1111two = + 2,147,483,647ten

1000 0000 0000 0000 0000 0000 0000 0000two = – 2,147,483,648ten

1000 0000 0000 0000 0000 0000 0000 0001two = – 2,147,483,647ten...

1111 1111 1111 1111 1111 1111 1111 1110two = – 2ten

1111 1111 1111 1111 1111 1111 1111 1111two = – 1ten

MIPS Number Representations

maxint

minint

• Converting <32-bit values into 32-bit values– copy the most significant bit (the sign bit) into the “empty” bits

0010 -> 0000 00101010 -> 1111 1010

– sign extend versus zero extend (lb vs. lbu)

MSB

LSB

Page 83: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

862004 Morgan Kaufmann Publishers

MIPS Arithmetic Logic Unit (ALU)• Must support the Arithmetic/Logic operations of the ISA

add, addi, addiu, addusub, subu, negmult, multu, div, divusqrtand, andi, nor, or, ori, xor, xoribeq, bne, slt, slti, sltiu, sltu

32

32

32

m (operation)

result

A

B

ALU

4

zero ovf

11

• With special handling for

– sign extend – addi, addiu andi, ori, xori, slti, sltiu– zero extend – lbu, addiu, sltiu– no overflow detected – addu, addiu, subu, multu, divu,

sltiu, sltu

Page 84: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

872004 Morgan Kaufmann Publishers

• Negating a two's complement number: invert all bits and add 1

– remember: “negate” and “invert” are quite different!

• Converting n bit numbers into numbers with more than n bits:

– MIPS 16 bit immediate gets converted to 32 bits for arithmetic

– copy the most significant bit (the sign bit) into the other bits

0010 -> 0000 0010

1010 -> 1111 1010

– "sign extension" (lbu vs. lb)

Two's Complement Operations

Page 85: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

882004 Morgan Kaufmann Publishers

Review: 2’s Complement Binary Representation2’sc binary decimal

1000 -8

1001 -7

1010 -6

1011 -5

1100 -4

1101 -3

1110 -2

1111 -1

0000 0

0001 1

0010 2

0011 3

0100 4

0101 5

0110 6

0111 723 - 1 =

-(23 - 1) =

-23 =

1010

complement all the bits

1011

and add a 1

• Note: negate and invert are different!

• Negate

Page 86: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

892004 Morgan Kaufmann Publishers

Review: A Full Adder

1-bit Full Adder

A

BS

carry_in

carry_out

S = A B carry_in (odd parity function)

carry_out = A&B | A&carry_in | B&carry_in (majority function)

How can we use it to build a 32-bit adder?

How can we modify it easily to build an adder/subtractor?

A B carry_in carry_out S

0 0 0 0 0

0 0 1 0 1

0 1 0 0 1

0 1 1 1 0

1 0 0 0 1

1 0 1 1 0

1 1 0 1 0

1 1 1 1 1

Page 87: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

902004 Morgan Kaufmann Publishers

• Just like in grade school (carry/borrow 1s) 0111 0111 0110+ 0110 - 0110 - 0101

• Two's complement operations easy

– subtraction using addition of negative numbers 0111+ 1010

• Overflow (result too large for finite computer word):

– e.g., adding two n-bit numbers does not yield an n-bit number 0111+ 0001 note that overflow term is somewhat misleading, 1000 it does not mean a carry “overflowed”

Addition & Subtraction

Page 88: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

922004 Morgan Kaufmann Publishers

A 32-bit Ripple Carry Adder/Subtractor

Remember 2’s complement is just

complement all the bits

add a 1 in the least significant bit

A 0111 0111 B - 0110 +

1-bit FA S0

c0=carry_in

c1

1-bit FA S1

c2

1-bit FA S2

c3

c32=carry_out

1-bit FA S31

c31

. .

.

A0

A1

A2

A31

B0

B1

B2

B31

add/sub

B0

control(0=add,1=sub) B0 if control =

0, !B0 if control = 1

0001

1001 1

1 0001

Page 89: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

932004 Morgan Kaufmann Publishers

• No overflow when adding a positive and a negative number

• No overflow when signs are the same for subtraction

• Overflow occurs when the value affects the sign:

– overflow when adding two positives yields a negative

– or, adding two negatives gives a positive

– or, subtract a negative from a positive and get a negative

– or, subtract a positive from a negative and get a positive

• Consider the operations A + B, and A – B

– Can overflow occur if B is 0 ?

– Can overflow occur if A is 0 ?

Detecting Overflow

Page 90: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

952004 Morgan Kaufmann Publishers

Overflow Detection• Overflow: the result is too large to represent in 32 bits

• Overflow occurs when

– adding two positives yields a negative

– or, adding two negatives gives a positive

– or, subtract a negative from a positive gives a negative

– or, subtract a positive from a negative gives a positive

• On your own: Prove you can detect overflow by:

– Carry into MSB xor Carry out of MSB, ex for 4 bit signed numbers

1

1

1 10

1

0

1

1

0

0 1 1 1

0 0 1 1+

7

3

0

1

– 6

1 1 0 0

1 0 1 1+

–4

– 5

71

0

Page 91: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

962004 Morgan Kaufmann Publishers

• Need to support the logic operation (and,nor,or,xor)

– Bit wise operations (no carry operation involved)

– Need a logic gate for each function, mux to choose the output

• Need to support the set-on-less-than instruction (slt)

– Use subtraction to determine if (a – b) < 0 (implies a < b)

– Copy the sign bit into the low order bit of the result, set remaining result bits to 0

• Need to support test for equality (bne, beq)

– Again use subtraction: (a - b) = 0 implies a = b

– Additional logic to “nor” all result bits together

• Immediates are sign extended outside the ALU with wiring (i.e., no logic needed)

Tailoring the ALU to the MIPS ISA

Page 92: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

972004 Morgan Kaufmann Publishers

Shift Operations

• Also need operations to pack and unpack 8-bit characters into 32-bit words

• Shifts move all the bits in a word left or right

sll $t2, $s0, 8 #$t2 = $s0 << 8 bits

srl $t2, $s0, 8 #$t2 = $s0 >> 8 bits

op rs rt rd shamt funct

• Notice that a 5-bit shamt field is enough to shift a 32-bit value 25 – 1 or 31 bit positions

• Such shifts are logical because they fill with zeros

Page 93: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

982004 Morgan Kaufmann Publishers

Shift Operations, con’t

• An arithmetic shift (sra) maintain the arithmetic correctness of the shifted value (i.e., a number shifted right one bit should be ½ of its original value; a number shifted left should be 2 times its original value)

– so sra uses the most significant bit (sign bit) as the bit shifted in

– note that there is no need for a sla when using two’s complement number representation

sra $t2, $s0, 8 #$t2 = $s0 >> 8 bits

• The shift operation is implemented by hardware separate from the ALU

– using a barrel shifter (which would takes lots of gates in discrete logic, but is pretty easy to implement in VLSI)

Page 94: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

992004 Morgan Kaufmann Publishers

Multiply

• Binary multiplication is just a bunch of right shifts and adds

multiplicand

multiplier

partialproductarray

double precision product

n

2n

ncan be formed in parallel and added in parallel for faster multiplication

Page 95: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1002004 Morgan Kaufmann Publishers

• Multiply produces a double precision product

mult $s0, $s1 # hi||lo = $s0 * $s1

– Low-order word of the product is left in processor register lo and the high-order word is left in register hi

– Instructions mfhi rd and mflo rd are provided to move the product to (user accessible) registers in the register file

MIPS Multiply Instruction

op rs rt rd shamt funct

• Multiplies are done by fast, dedicated hardware and are much more complex (and slower) than adders

• Hardware dividers are even more complex and even slower; ditto for hardware square root

Page 96: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1012004 Morgan Kaufmann Publishers

• An exception (interrupt) occurs

– Control jumps to predefined address for exception

– Interrupted address is saved for possible resumption

• Details based on software system / language

– example: flight control vs. homework assignment

• Don't always want to detect overflow— new MIPS instructions: addu, addiu, subu

note: addiu still sign-extends!note: sltu, sltiu for unsigned comparisons

Effects of Overflow

Page 97: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1022004 Morgan Kaufmann Publishers

• More complicated than addition

– accomplished via shifting and addition

• More time and more area

• Let's look at 3 versions based on a gradeschool algorithm

0010 (multiplicand)

__x_1011 (multiplier)

• Negative numbers: convert and multiply

– there are better techniques, we won’t look at them

Multiplication

Page 98: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1032004 Morgan Kaufmann Publishers

Multiplication: Implementation

DatapathControl

MultiplicandShift left

64 bits

64-bit ALU

ProductWrite

64 bits

Control test

MultiplierShift right

32 bits

32nd repetition?

1a. Add multiplicand to product and

place the result in Product register

Multiplier0 = 01. Test

Multiplier0

Start

Multiplier0 = 1

2. Shift the Multiplicand register left 1 bit

3. Shift the Multiplier register right 1 bit

No: < 32 repetitions

Yes: 32 repetitions

Done

Page 99: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1042004 Morgan Kaufmann Publishers

Final Version

Multiplicand

32 bits

32-bit ALU

ProductWrite

64 bits

Controltest

Shift right

32nd repetition?

Product0 = 01. Test

Product0

Start

Product0 = 1

3. Shift the Product register right 1 bit

No: < 32 repetitions

Yes: 32 repetitions

Done

What goes here?

•Multiplier starts in right half of product

Page 100: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1052004 Morgan Kaufmann Publishers

Floating Point (a brief look)

• We need a way to represent

– numbers with fractions, e.g., 3.1416

– very small numbers, e.g., .000000001

– very large numbers, e.g., 3.15576 109

• Representation:

– sign, exponent, significand: (–1)sign significand 2exponent

– more bits for significand gives more accuracy

– more bits for exponent increases range

• IEEE 754 floating point standard:

– single precision: 8 bit exponent, 23 bit significand

– double precision: 11 bit exponent, 52 bit significand

Page 101: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1062004 Morgan Kaufmann Publishers

Representing Big (and Small) Numbers

• What if we want to encode the approx. age of the earth?

4,600,000,000 or 4.6 x 109

or the weight in kg of one a.m.u. (atomic mass unit)

0.0000000000000000000000000166 or 1.6 x 10-27

There is no way we can encode either of the above in a 32-bit integer.

• Floating point representation (-1)sign x F x 2E

– Still have to fit everything in 32 bits (single precision)

s E (exponent) F (fraction)1 bit 8 bits 23 bits

– The base (2, not 10) is hardwired in the design of the FPALU

– More bits in the fraction (F) or the exponent (E) is a trade-off between precision (accuracy of the number) and range (size of the number)

Page 102: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1072004 Morgan Kaufmann Publishers

IEEE 754 floating-point standard

• Leading “1” bit of significand is implicit

• Exponent is “biased” to make sorting easier

– all 0s is smallest exponent all 1s is largest

– bias of 127 for single precision and 1023 for double precision

– summary: (–1)sign significand) 2exponent – bias

• Example:

– decimal: -.75 = - ( ½ + ¼ )

– binary: -.11 = -1.1 x 2-1

– floating point: exponent = 126 = 01111110

– IEEE single precision: 10111111010000000000000000000000

Page 103: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1082004 Morgan Kaufmann Publishers

IEEE 754 FP Standard Encoding

• Most (all?) computers these days conform to the IEEE 754 floating point standard (-1)sign x (1+F) x 2E-bias

– Formats for both single and double precision

– F is stored in normalized form where the msb in the fraction is 1 (so there is no need to store it!) – called the hidden bit

– To simplify sorting FP numbers, E comes before F in the word and E is represented in excess (biased) notation

Single Precision Double Precision Object Represented

E (8) F (23) E (11) F (52)

0 0 0 0 true zero (0)

0 nonzero 0 nonzero ± denormalized number

± 1-254 anything ± 1-2046 anything ± floating point number

± 255 0 ± 2047 0 ± infinity

255 nonzero 2047 nonzero not a number (NaN)

Page 104: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1092004 Morgan Kaufmann Publishers

Floating Point Addition

• Addition (and subtraction)

(F1 2E1) + (F2 2E2) = F3 2E3

– Step 1: Restore the hidden bit in F1 and in F2

– Step 1: Align fractions by right shifting F2 by E1 - E2 positions (assuming E1 E2) keeping track of (three of) the bits shifted out in a round bit, a guard bit, and a sticky bit

– Step 2: Add the resulting F2 to F1 to form F3

– Step 3: Normalize F3 (so it is in the form 1.XXXXX …)

• If F1 and F2 have the same sign F3 [1,4) 1 bit right shift F3 and increment E3

• If F1 and F2 have different signs F3 may require many left shifts each time decrementing E3

– Step 4: Round F3 and possibly normalize F3 again

– Step 5: Rehide the most significant bit of F3 before storing the result

Page 105: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1102004 Morgan Kaufmann Publishers

Floating point addition

Still normalized?

4. Round the significand to the appropriate

number of bits

YesOverflow or

underflow?

Start

No

Yes

Done

1. Compare the exponents of the two numbers.

Shift the smaller number to the right until its

exponent would match the larger exponent

2. Add the significands

3. Normalize the sum, either shifting right and

incrementing the exponent or shifting left

and decrementing the exponent

No Exception

Small ALU

Exponentdifference

Control

ExponentSign Fraction

Big ALU

ExponentSign Fraction

0 1 0 1 0 1

Shift right

0 1 0 1

Increment ordecrement

Shift left or right

Rounding hardware

ExponentSign Fraction

Page 106: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1112004 Morgan Kaufmann Publishers

MIPS Floating Point Instructions

• MIPS has a separate Floating Point Register File ($f0, $f1, …, $f31) (whose registers are used in pairs for double precision values) with special instructions to load to and store from them

lwcl $f1,54($s2) #$f1 = Memory[$s2+54]

swcl $f1,58($s4) #Memory[$s4+58] = $f1• And supports IEEE 754 single

add.s $f2,$f4,$f6 #$f2 = $f4 + $f6

and double precision operations

add.d $f2,$f4,$f6 #$f2||$f3 =$f4||$f5 + $f6||$f7

similarly for sub.s, sub.d, mul.s, mul.d, div.s, div.d

Page 107: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1122004 Morgan Kaufmann Publishers

MIPS Floating Point Instructions, Con’t

• And floating point single precision comparison operations

c.x.s $f2,$f4 #if($f2 < $f4) cond=1;else cond=0

where x may be eq, neq, lt, le, gt, ge

and branch operations

bclt 25 #if(cond==1)go to PC+4+25

bclf 25 #if(cond==0)go to PC+4+25

• And double precision comparison operations

c.x.d $f2,$f4 #$f2||$f3 < $f4||$f5 cond=1; else cond=0

Page 108: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1132004 Morgan Kaufmann Publishers

Floating Point Complexities

• Operations are somewhat more complicated (see text)

• In addition to overflow we can have “underflow”

• Accuracy can be a big problem

– IEEE 754 keeps two extra bits, guard and round

– four rounding modes

– positive divided by zero yields “infinity”

– zero divide by zero yields “not a number”

– other complexities

• Implementing the standard can be tricky

• Not using the standard can be even worse

– see text for description of 80x86 and Pentium bug!

Page 109: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1142004 Morgan Kaufmann Publishers

Chapter Three Summary

• Computer arithmetic is constrained by limited precision

• Bit patterns have no inherent meaning but standards do exist

– two’s complement

– IEEE 754 floating point

• Computer instructions determine “meaning” of the bit patterns

• Performance and accuracy are important so there are manycomplexities in real machines

• Algorithm choice is important and may lead to hardware optimizations for both space and time (e.g., multiplication)

• You may want to look back (Section 3.10 is great reading!)

Page 110: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1152004 Morgan Kaufmann Publishers

Chapter 4

Page 111: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1162004 Morgan Kaufmann Publishers

• Measure, Report, and Summarize

• Make intelligent choices

• See through the marketing hype

• Key to understanding underlying organizational motivation

Why is some hardware better than others for different programs?

What factors of system performance are hardware related?(e.g., Do we need a new machine, or a new operating system?)

How does the machine's instruction set affect performance?

Performance

Page 112: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1172004 Morgan Kaufmann Publishers

Which of these airplanes has the best performance?

Airplane Passengers Range (mi) Speed (mph)

Boeing 737-100 101 630 598Boeing 747 470 4150 610BAC/Sud Concorde 132 4000 1350Douglas DC-8-50 146 8720 544

•How much faster is the Concorde compared to the 747?

•How much bigger is the 747 than the Douglas DC-8?

Page 113: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1182004 Morgan Kaufmann Publishers

• Response Time (latency)

— How long does it take for my job to run?

— How long does it take to execute a job?

— How long must I wait for the database query?

• Throughput

— How many jobs can the machine run at once?

— What is the average execution rate?

— How much work is getting done?

• If we upgrade a machine with a new processor what do we increase?

• If we add a new machine to the lab what do we increase?

Computer Performance: TIME, TIME, TIME

Page 114: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1192004 Morgan Kaufmann Publishers

• Elapsed Time

– counts everything (disk and memory accesses, I/O , etc.)

– a useful number, but often not good for comparison purposes

• CPU time

– doesn't count I/O or time spent running other programs

– can be broken up into system time, and user time

• Our focus: user CPU time

– time spent executing the lines of code that are "in" our program

Execution Time

Page 115: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1202004 Morgan Kaufmann Publishers

• For some program running on machine X,

PerformanceX = 1 / Execution timeX

• "X is n times faster than Y"

PerformanceX / PerformanceY = n

• Problem:

– machine A runs a program in 20 seconds

– machine B runs the same program in 25 seconds

Book's Definition of Performance

Page 116: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1212004 Morgan Kaufmann Publishers

Clock Cycles

• Instead of reporting execution time in seconds, we often use cycles

• Clock “ticks” indicate when to start activities (one abstraction):

• cycle time = time between ticks = seconds per cycle

• clock rate (frequency) = cycles per second (1 Hz. = 1 cycle/sec)

A 4 Ghz. clock has a cycle time

time

seconds

program

cycles

program

seconds

cycle

(ps) spicosecond 2501210 9104

1

Page 117: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1222004 Morgan Kaufmann Publishers

So, to improve performance (everything else being equal) you can either

(increase or decrease?)

________ the # of required cycles for a program, or

________ the clock cycle time or, said another way,

________ the clock rate.

How to Improve Performance

seconds

program

cycles

program

seconds

cycle

Page 118: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1232004 Morgan Kaufmann Publishers

• Could assume that number of cycles equals number of instructions

This assumption is incorrect,

different instructions take different amounts of time on different machines.

Why? hint: remember that these are machine instructions, not lines of C code

time

1st

inst

ruct

ion

2nd

inst

ruct

ion

3rd

inst

ruct

ion

4th

5th

6th ...

How many cycles are required for a program?

Page 119: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1242004 Morgan Kaufmann Publishers

• Multiplication takes more time than addition

• Floating point operations take longer than integer ones

• Accessing memory takes more time than accessing registers

• Important point: changing the cycle time often changes the number of cycles required for various instructions (more later)

time

Different numbers of cycles for different instructions

Page 120: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1252004 Morgan Kaufmann Publishers

• Our favorite program runs in 10 seconds on computer A, which has a 4 GHz. clock. We are trying to help a computer designer build a new machine B, that will run this program in 6 seconds. The designer can use new (or perhaps more expensive) technology to substantially increase the clock rate, but has informed us that this increase will affect the rest of the CPU design, causing machine B to require 1.2 times as many clock cycles as machine A for the same program. What clock rate should we tell the designer to target?"

• Don't Panic, can easily work this out from basic principles

Example

Page 121: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1262004 Morgan Kaufmann Publishers

• A given program will require

– some number of instructions (machine instructions)

– some number of cycles

– some number of seconds

• We have a vocabulary that relates these quantities:

– cycle time (seconds per cycle)

– clock rate (cycles per second)

– CPI (cycles per instruction)

a floating point intensive application might have a higher CPI

– MIPS (millions of instructions per second)

this would be higher for a program using simple instructions

Now that we understand cycles

Page 122: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1272004 Morgan Kaufmann Publishers

Performance

• Performance is determined by execution time

• Do any of the other variables equal performance?

– # of cycles to execute program?

– # of instructions in program?

– # of cycles per second?

– average # of cycles per instruction?

– average # of instructions per second?

• Common pitfall: thinking one of the variables is indicative of performance when it really isn’t.

Page 123: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1282004 Morgan Kaufmann Publishers

• Suppose we have two implementations of the same instruction set architecture (ISA).

For some program,

Machine A has a clock cycle time of 250 ps and a CPI of 2.0 Machine B has a clock cycle time of 500 ps and a CPI of 1.2

What machine is faster for this program, and by how much?

• If two machines have the same ISA which of our quantities (e.g., clock rate, CPI, execution time, # of instructions, MIPS) will always be identical?

CPI Example

Page 124: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1292004 Morgan Kaufmann Publishers

• A compiler designer is trying to decide between two code sequences for a particular machine. Based on the hardware implementation, there are three different classes of instructions: Class A, Class B, and Class C, and they require one, two, and three cycles (respectively).

The first code sequence has 5 instructions: 2 of A, 1 of B, and 2 of CThe second sequence has 6 instructions: 4 of A, 1 of B, and 1 of C.

Which sequence will be faster? How much?What is the CPI for each sequence?

# of Instructions Example

Page 125: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1302004 Morgan Kaufmann Publishers

• Two different compilers are being tested for a 4 GHz. machine with three different classes of instructions: Class A, Class B, and Class C, which require one, two, and three cycles (respectively). Both compilers are used to produce code for a large piece of software.

The first compiler's code uses 5 million Class A instructions, 1 million Class B instructions, and 1 million Class C instructions.

The second compiler's code uses 10 million Class A instructions, 1 million Class B instructions, and 1 million Class C instructions.

• Which sequence will be faster according to MIPS?

• Which sequence will be faster according to execution time?

MIPS example

Page 126: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1312004 Morgan Kaufmann Publishers

• Performance best determined by running a real application

– Use programs typical of expected workload

– Or, typical of expected class of applicationse.g., compilers/editors, scientific applications, graphics,

etc.

• Small benchmarks

– nice for architects and designers

– easy to standardize

– can be abused

• SPEC (System Performance Evaluation Cooperative)

– companies have agreed on a set of real program and inputs

– valuable indicator of performance (and compiler technology)

– can still be abused

Benchmarks

Page 127: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1322004 Morgan Kaufmann Publishers

Benchmark Games

• An embarrassed Intel Corp. acknowledged Friday that a bug in a software program known as a compiler had led the company to overstate the speed of its microprocessor chips on an industry benchmark by 10 percent. However, industry analysts said the coding error…was a sad commentary on a common industry practice of “cheating” on standardized performance tests…The error was pointed out to Intel two days ago by a competitor, Motorola …came in a test known as SPECint92…Intel acknowledged that it had “optimized” its compiler to improve its test scores. The company had also said that it did not like the practice but felt to compelled to make the optimizations because its competitors were doing the same thing…At the heart of Intel’s problem is the practice of “tuning” compiler programs to recognize certain computing problems in the test and then substituting special handwritten pieces of code…

Saturday, January 6, 1996 New York Times

Page 128: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1332004 Morgan Kaufmann Publishers

SPEC ‘89

• Compiler “enhancements” and performance

0

100

200

300

400

500

600

700

800

tomcatvfppppmatrix300eqntottlinasa7doducspiceespressogcc

BenchmarkCompiler

Enhanced compiler

SP

EC

pe

rfo

rman

ce r

atio

Page 129: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1342004 Morgan Kaufmann Publishers

SPEC CPU2000

Page 130: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1352004 Morgan Kaufmann Publishers

SPEC 2000

Does doubling the clock rate double the performance?

Can a machine with a slower clock rate have better performance?

Clock rate in MHz

500 1000 1500 30002000 2500 35000

200

400

600

800

1000

1200

1400

Pentium III CINT2000

Pentium 4 CINT2000

Pentium III CFP2000

Pentium 4 CFP2000

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

SPECINT2000 SPECFP2000 SPECINT2000 SPECFP2000 SPECINT2000 SPECFP2000

Always on/maximum clock Laptop mode/adaptiveclock

Minimum power/minimumclock

Benchmark and power mode

Pentium M @ 1.6/0.6 GHz

Pentium 4-M @ 2.4/1.2 GHz

Pentium III-M @ 1.2/0.8 GHz

Page 131: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1362004 Morgan Kaufmann Publishers

Experiment

• Phone a major computer retailer and tell them you are having trouble deciding between two different computers, specifically you are confused about the processors strengths and weaknesses

(e.g., Pentium 4 at 2Ghz vs. Celeron M at 1.4 Ghz )

• What kind of response are you likely to get?

• What kind of response could you give a friend with the same question?

Page 132: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1372004 Morgan Kaufmann Publishers

Execution Time After Improvement =

Execution Time Unaffected +( Execution Time Affected / Amount of Improvement )

• Example:

"Suppose a program runs in 100 seconds on a machine, with multiply responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster?"

How about making it 5 times faster?

• Principle: Make the common case fast

Amdahl's Law

Page 133: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1382004 Morgan Kaufmann Publishers

• Suppose we enhance a machine making all floating-point instructions run five times faster. If the execution time of some benchmark before the floating-point enhancement is 10 seconds, what will the speedup be if half of the 10 seconds is spent executing floating-point instructions?

• We are looking for a benchmark to show off the new floating-point unit described above, and want the overall benchmark to show a speedup of 3. One benchmark we are considering runs for 100 seconds with the old floating-point hardware. How much of the execution time would floating-point instructions have to account for in this program in order to yield our desired speedup on this benchmark?

Example

Page 134: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1392004 Morgan Kaufmann Publishers

• Performance is specific to a particular program/s

– Total execution time is a consistent summary of performance

• For a given architecture performance increases come from:

– increases in clock rate (without adverse CPI affects)

– improvements in processor organization that lower CPI

– compiler enhancements that lower CPI and/or instruction count

– Algorithm/Language choices that affect instruction count

• Pitfall: expecting improvement in one aspect of a machine’s

performance to affect the total performance

Remember

Page 135: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1402004 Morgan Kaufmann Publishers

Performance Metrics• Purchasing perspective

– given a collection of machines, which has the

• best performance ?• least cost ?• best cost/performance?

• Design perspective– faced with design options, which has the

• best performance improvement ?• least cost ?• best cost/performance?

• Both require– basis for comparison– metric for evaluation

• Our goal is to understand what factors in the architecture contribute to overall system performance and the relative importance (and cost) of these factors

Page 136: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1412004 Morgan Kaufmann Publishers

Defining (Speed) Performance

• Normally interested in reducing

– Response time (aka execution time) – the time between the start and the completion of a task

• Important to individual users– Thus, to maximize performance, need to minimize execution time

– Throughput – the total amount of work done in a given time

• Important to data center managers– Decreasing response time almost always improves throughput

performanceX = 1 / execution_timeX

If X is n times faster than Y, then

performanceX execution_timeY -------------------- = --------------------- = n

performanceY execution_timeX

Page 137: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1422004 Morgan Kaufmann Publishers

Performance Factors• Want to distinguish elapsed time and the time spent on our task

• CPU execution time (CPU time) – time the CPU spends working on a task

– Does not include time waiting for I/O or running other programs

CPU execution time # CPU clock cycles

for a program for a program = x clock cycle time

CPU execution time # CPU clock cycles for a program

for a program clock rate = -------------------------------------------

• Can improve performance by reducing either the length of the clock cycle or the number of clock cycles required for a program

or

Page 138: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1432004 Morgan Kaufmann Publishers

Review: Machine Clock Rate

• Clock rate (MHz, GHz) is inverse of clock cycle time (clock period)

CC = 1 / CR

one clock period

10 nsec clock cycle => 100 MHz clock rate

5 nsec clock cycle => 200 MHz clock rate

2 nsec clock cycle => 500 MHz clock rate

1 nsec clock cycle => 1 GHz clock rate

500 psec clock cycle => 2 GHz clock rate

250 psec clock cycle => 4 GHz clock rate

200 psec clock cycle => 5 GHz clock rate

Page 139: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1442004 Morgan Kaufmann Publishers

Clock Cycles per Instruction

• Not all instructions take the same amount of time to execute

– One way to think about execution time is that it equals the number of instructions executed multiplied by the average time per instruction

• Clock cycles per instruction (CPI) – the average number of clock cycles each instruction takes to execute

– A way to compare two different implementations of the same ISA

# CPU clock cycles # Instructions Average clock cycles

for a program for a program per instruction = x

CPI for this instruction class

A B C

CPI 1 2 3

Page 140: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1452004 Morgan Kaufmann Publishers

Effective CPI

• Computing the overall effective CPI is done by looking at the different types of instructions and their individual cycle counts and averaging

Overall effective CPI = (CPIi x ICi)

i = 1

n

– Where ICi is the count (percentage) of the number of instructions of class i executed

– CPIi is the (average) number of clock cycles per instruction for that instruction class

– n is the number of instruction classes

• The overall effective CPI varies by instruction mix – a measure of the dynamic frequency of instructions across one or many programs

Page 141: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1462004 Morgan Kaufmann Publishers

THE Performance Equation• Our basic performance equation is then

CPU time = Instruction_count x CPI x clock_cycle

Instruction_count x CPI

clock_rate CPU time = -----------------------------------------------

or

• These equations separate the three key factors that affect performance

– Can measure the CPU execution time by running the program

– The clock rate is usually given

– Can measure overall instruction count by using profilers/ simulators without knowing all of the implementation details

– CPI varies by instruction type and ISA implementation for which we must know the implementation details

Page 142: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1482004 Morgan Kaufmann Publishers

Determinates of CPU Performance

CPU time = Instruction_count x CPI x clock_cycle

Instruction_count

CPI clock_cycle

Algorithm

Programming language

Compiler

ISA

Processor organization

TechnologyX

XX

XX

X X

X

X

X

X

X

Page 143: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1502004 Morgan Kaufmann Publishers

A Simple Example

• How much faster would the machine be if a better data cache reduced the average load time to 2 cycles?

• How does this compare with using branch prediction to shave a cycle off the branch time?

• What if two ALU instructions could be executed at once?

Op Freq CPIi Freq x CPIi

ALU 50% 1

Load 20% 5

Store 10% 3

Branch 20% 2

=

.5

1.0

.3

.4

2.2

CPU time new = 1.6 x IC x CC so 2.2/1.6 means 37.5% faster

1.6

.5

.4

.3

.4

.5

1.0

.3

.2

2.0

CPU time new = 2.0 x IC x CC so 2.2/2.0 means 10% faster

.25

1.0

.3

.4

1.95

CPU time new = 1.95 x IC x CC so 2.2/1.95 means 12.8% faster

Page 144: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1512004 Morgan Kaufmann Publishers

Comparing and Summarizing Performance

• Guiding principle in reporting performance measurements is reproducibility – list everything another experimenter would need to duplicate the experiment (version of the operating system, compiler settings, input set used, specific computer configuration (clock rate, cache sizes and speed, memory size and speed, etc.))

• How do we summarize the performance for benchmark set with a single number?

– The average of execution times that is directly proportional to total execution time is the arithmetic mean (AM)

AM = 1/n Timei

i = 1

n

– Where Timei is the execution time for the ith program of a total of n programs in the workload

– A smaller mean indicates a smaller average execution time and thus improved performance

Page 145: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1522004 Morgan Kaufmann Publishers

SPEC Benchmarks www.spec.org

Integer benchmarks FP benchmarks

gzip compression wupwise Quantum chromodynamics

vpr FPGA place & route swim Shallow water model

gcc GNU C compiler mgrid Multigrid solver in 3D fields

mcf Combinatorial optimization applu Parabolic/elliptic pde

crafty Chess program mesa 3D graphics library

parser Word processing program galgel Computational fluid dynamics

eon Computer visualization art Image recognition (NN)

perlbmk perl application equake Seismic wave propagation simulation

gap Group theory interpreter facerec Facial image recognition

vortex Object oriented database ammp Computational chemistry

bzip2 compression lucas Primality testing

twolf Circuit place & route fma3d Crash simulation fem

sixtrack Nuclear physics accel

apsi Pollutant distribution

Page 146: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1532004 Morgan Kaufmann Publishers

Example SPEC Ratings

Page 147: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1542004 Morgan Kaufmann Publishers

Other Performance Metrics• Power consumption – especially in the embedded market where

battery life is important (and passive cooling)

– For power-limited applications, the most important metric is energy efficiency

Page 148: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1552004 Morgan Kaufmann Publishers

Summary: Evaluating ISAs• Design-time metrics:

– Can it be implemented, in how long, at what cost?– Can it be programmed? Ease of compilation?

• Static Metrics:– How many bytes does the program occupy in memory?

• Dynamic Metrics:– How many instructions are executed? How many bytes does the processor

fetch to execute the program?– How many clocks are required per instruction?– How "lean" a clock is practical?

Best Metric: Time to execute the program! CPI

Inst. Count Cycle Timedepends on the instructions set, the processor organization, and compilation techniques.

Page 149: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1562004 Morgan Kaufmann Publishers

Chapter --Five

Page 150: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1572004 Morgan Kaufmann Publishers

Lets Build a Processor

• Almost ready to move into chapter 5 and start building a processor

• First, let’s review Boolean Logic and build the ALU we’ll need(Material from Appendix B)

32

32

32

operation

result

a

b

ALU

Page 151: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1582004 Morgan Kaufmann Publishers

• Problem: Consider a logic function with three inputs: A, B, and C.

Output D is true if at least one input is trueOutput E is true if exactly two inputs are trueOutput F is true only if all three inputs are true

• Show the truth table for these three functions.

• Show the Boolean equations for these three functions.

• Show an implementation consisting of inverters, AND, and OR gates.

Review: Boolean Algebra & Gates

Page 152: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1592004 Morgan Kaufmann Publishers

• Let's build an ALU to support the andi and ori instructions

– we'll just build a 1 bit ALU, and use 32 of them

• Possible Implementation (sum-of-products):

b

a

operation

result

op a b res

An ALU (arithmetic logic unit)

Page 153: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1602004 Morgan Kaufmann Publishers

• Selects one of the inputs to be the output, based on a control input

• Lets build our ALU using a MUX:

S

CA

B0

1

Review: The Multiplexor

note: we call this a 2-input mux even though it has 3 inputs!

Page 154: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1612004 Morgan Kaufmann Publishers

• Not easy to decide the “best” way to build something

– Don't want too many inputs to a single gate

– Don’t want to have to go through too many gates

– for our purposes, ease of comprehension is important

• Let's look at a 1-bit ALU for addition:

• How could we build a 1-bit ALU for add, and, and or?

• How could we build a 32-bit ALU?

Different Implementations

cout = a b + a cin + b cin

sum = a xor b xor cin

Sum

CarryIn

CarryOut

a

b

Page 155: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1622004 Morgan Kaufmann Publishers

Building a 32 bit ALU

b

0

2

Result

Operation

a

1

CarryIn

CarryOut

Result31a31

b31

Result0

CarryIn

a0

b0

Result1a1

b1

Result2a2

b2

Operation

ALU0

CarryIn

CarryOut

ALU1

CarryIn

CarryOut

ALU2

CarryIn

CarryOut

ALU31

CarryIn

Page 156: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1632004 Morgan Kaufmann Publishers

• Two's complement approach: just negate b and add.

• How do we negate?

• A very clever solution:

What about subtraction (a – b) ?

0

2

Result

Operation

a

1

CarryIn

CarryOut

0

1

Binvert

b

Page 157: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1642004 Morgan Kaufmann Publishers

Adding a NOR function

• Can also choose to invert a. How do we get “a NOR b” ?

Binvert

a

b

CarryIn

CarryOut

Operation

1

0

2+

Result

1

0

Ainvert

1

0

Page 158: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1652004 Morgan Kaufmann Publishers

• Need to support the set-on-less-than instruction (slt)

– remember: slt is an arithmetic instruction

– produces a 1 if rs < rt and 0 otherwise

– use subtraction: (a-b) < 0 implies a < b

• Need to support test for equality (beq $t5, $t6, $t7)

– use subtraction: (a-b) = 0 implies a = b

Tailoring the ALU to the MIPS

Page 159: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

Supporting slt

• Can we figure out the idea?

Binvert

a

b

CarryIn

CarryOut

Operation

1

0

2+

Result

1

0

Ainvert

1

0

3Less

Binvert

a

b

CarryIn

Operation

1

0

2+

Result

1

0

3Less

Overflowdetection

Set

Overflow

Ainvert

1

0

Use this ALU for most significant bitall other bits

Page 160: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1672004 Morgan Kaufmann Publishers

a0

Operation

CarryInALU0Less

CarryOut

b0

CarryIn

a1 CarryInALU1Less

CarryOut

b1

Result0

Result1

a2 CarryInALU2Less

CarryOut

b2

a31 CarryInALU31Less

b31

Result2

Result31

......

...

BinvertAinvert

0

0

0 Overflow

Set

CarryIn

Supporting slt

Page 161: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1682004 Morgan Kaufmann Publishers

Test for equality

• Notice control lines:

0000 = and0001 = or0010 = add0110 = subtract0111 = slt1100 = NOR

•Note: zero is a 1 when the result is zero!

a0

Operation

CarryInALU0Less

CarryOut

b0

a1 CarryInALU1Less

CarryOut

b1

Result0

Result1

a2 CarryInALU2Less

CarryOut

b2

a31 CarryInALU31Less

b31

Result2

Result31

......

...

Bnegate

Ainvert

0

0

0 Overflow

Set

CarryIn...

...Zero

Page 162: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1692004 Morgan Kaufmann Publishers

Conclusion

• We can build an ALU to support the MIPS instruction set

– key idea: use multiplexor to select the output we want

– we can efficiently perform subtraction using two’s complement

– we can replicate a 1-bit ALU to produce a 32-bit ALU

• Important points about hardware

– all of the gates are always working

– the speed of a gate is affected by the number of inputs to the gate

– the speed of a circuit is affected by the number of gates in series(on the “critical path” or the “deepest level of logic”)

• Our primary focus: comprehension, however,– Clever changes to organization can improve performance

(similar to using better algorithms in software)– We saw this in multiplication, let’s look at addition now

Page 163: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1702004 Morgan Kaufmann Publishers

• Is a 32-bit ALU as fast as a 1-bit ALU?

• Is there more than one way to do addition?

– two extremes: ripple carry and sum-of-products

Can you see the ripple? How could you get rid of it?

c1 = b0c0 + a0c0 + a0b0

c2 = b1c1 + a1c1 + a1b1 c2 =

c3 = b2c2 + a2c2 + a2b2 c3 =

c4 = b3c3 + a3c3 + a3b3 c4 =

Not feasible! Why?

Problem: ripple carry adder is slow

Page 164: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1712004 Morgan Kaufmann Publishers

• An approach in-between our two extremes

• Motivation:

– If we didn't know the value of carry-in, what could we do?

– When would we always generate a carry? gi = ai bi

– When would we propagate the carry? pi = ai + bi

• Did we get rid of the ripple?

c1 = g0 + p0c0

c2 = g1 + p1c1 c2 =

c3 = g2 + p2c2 c3 =

c4 = g3 + p3c3 c4 =

Feasible! Why?

Carry-lookahead adder

Page 165: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1722004 Morgan Kaufmann Publishers

• Can’t build a 16 bit adder this way... (too big)

• Could use ripple carry of 4-bit CLA adders

• Better: use the CLA principle again!

Use principle to build bigger adders

a4 CarryIn

ALU1 P1 G1

b4a5b5a6b6a7b7

a0 CarryIn

ALU0 P0 G0

b0

Carry-lookahead unit

a1b1a2b2a3b3

CarryIn

Result0–3

pigi

ci + 1

pi + 1

gi + 1

C1

Result4–7

a8 CarryIn

ALU2 P2 G2

b8a9b9

a10b10a11b11

ci + 2

pi + 2

gi + 2

C2

Result8–11

a12 CarryIn

ALU3 P3 G3

b12a13b13a14b14a15b15

ci + 3

pi + 3

gi + 3

C3

Result12–15

ci + 4C4

CarryOut

Page 166: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1732004 Morgan Kaufmann Publishers

ALU Summary

• We can build an ALU to support MIPS addition

• Our focus is on comprehension, not performance

• Real processors use more sophisticated techniques for arithmetic

• Where performance is not critical, hardware description languages allow designers to completely automate the creation of hardware!

Page 167: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1742004 Morgan Kaufmann Publishers

Chapter Five

Page 168: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1752004 Morgan Kaufmann Publishers

• We're ready to look at an implementation of the MIPS

• Simplified to contain only:

– memory-reference instructions: lw, sw – arithmetic-logical instructions: add, sub, and, or, slt– control flow instructions: beq, j

• Generic Implementation:

– use the program counter (PC) to supply instruction address

– get the instruction from memory

– read registers

– use the instruction to decide exactly what to do

• All instructions use the ALU after reading the registers

Why? memory-reference? arithmetic? control flow?

The Processor: Datapath & Control

Page 169: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1762004 Morgan Kaufmann Publishers

• Abstract / Simplified View:

• Two types of functional units:

– elements that operate on data values (combinational)

– elements that contain state (sequential)

More Implementation Details

Data

Register #

Register #

Register #

PC Address Instruction

Instructionmemory

Registers ALU Address

Data

Datamemory

AddAdd

4

Page 170: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1772004 Morgan Kaufmann Publishers

• Unclocked vs. Clocked

• Clocks used in synchronous logic

– when should an element that contains state be updated?

State Elements

Clock period Rising edge

Falling edge

Page 171: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1782004 Morgan Kaufmann Publishers

• The set-reset latch

– output depends on present inputs and also on past inputs

An unclocked state element

R

S

Q

Q

Page 172: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1792004 Morgan Kaufmann Publishers

• Output is equal to the stored value inside the element(don't need to ask for permission to look at the value)

• Change of state (value) is based on the clock

• Latches: whenever the inputs change, and the clock is asserted

• Flip-flop: state changes only on a clock edge(edge-triggered methodology)

"logically true", — could mean electrically low

A clocking methodology defines when signals can be read and written— wouldn't want to read a signal at the same time it was being written

Latches and Flip-flops

Page 173: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1802004 Morgan Kaufmann Publishers

• Two inputs:

– the data value to be stored (D)

– the clock signal (C) indicating when to read & store D

• Two outputs:

– the value of the internal state (Q) and it's complement

D-latch

Q

C

D

_Q

D

C

Q

Page 174: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1812004 Morgan Kaufmann Publishers

D flip-flop

• Output changes only on the clock edge

D

C

Q

D

C

Dlatch

D

C

QD

latch

D

C

Q Q

QQ

Page 175: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1822004 Morgan Kaufmann Publishers

Our Implementation

• An edge triggered methodology

• Typical execution:

– read contents of some state elements,

– send values through some combinational logic

– write results to one or more state elements

Stateelement

1

Stateelement

2Combinational logic

Clock cycle

Page 176: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1832004 Morgan Kaufmann Publishers

• Built using D flip-flops

Register File

Read registernumber 1 Read

data 1Read registernumber 2

Readdata 2

Writeregister

WriteWritedata

Register file

Read registernumber 1

Register 0

Register 1

. . .

Register n – 2

Register n – 1

M

u

x

Read registernumber 2

M

u

x

Read data 1

Read data 2

Do you understand? What is the “Mux” above?

Page 177: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1842004 Morgan Kaufmann Publishers

Abstraction

• Make sure you understand the abstractions!

• Sometimes it is easy to think you do, when you don’t

Mux

C

Select

32

32

32

B

A

Mux

Select

B31

A31

C31

Mux

B30

A30

C30

Mux

B0

A0

C0

...

...

Page 178: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1852004 Morgan Kaufmann Publishers

Register File

• Note: we still use the real clock to determine when to write

Write

01

n-to-2n

decoder

n – 1

n

Register 0

C

D

Register 1

C

D

Register n – 2

C

D

Register n – 1

C

D

...

Register number...

Register data

Page 179: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1862004 Morgan Kaufmann Publishers

Simple Implementation

• Include the functional units we need for each instruction

Why do we need this stuff?

PC

Instructionaddress

Instruction

Instructionmemory

Add Sum

a. Instruction memory b. Program counter c. Adder

Readregister 1

Readregister 2

Writeregister

WriteData

Registers ALUData

Data

Zero

ALUresult

RegWrite

a. Registers b. ALU

5

5

5

Registernumbers

Readdata 1

Readdata 2

ALU operation4

AddressReaddata

Datamemory

a. Data memory unit

Writedata

MemRead

MemWrite

b. Sign-extension unit

Signextend

16 32

Page 180: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1872004 Morgan Kaufmann Publishers

Building the Datapath

• Use multiplexors to stitch them together

Readregister 1

Readregister 2

Writeregister

Writedata

Writedata

Registers ALU

Add

Zero

RegWrite

MemRead

MemWrite

PCSrc

MemtoReg

Readdata 1

Readdata 2

ALU operation4

Signextend

16 32

InstructionALU

result

Add

ALUresult

Mux

Mux

Mux

ALUSrc

Address

Datamemory

Readdata

Shiftleft 2

4

Readaddress

Instructionmemory

PC

Page 181: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1882004 Morgan Kaufmann Publishers

Control

• Selecting the operations to perform (ALU, read/write, etc.)

• Controlling the flow of data (multiplexor inputs)

• Information comes from the 32 bits of the instruction

• Example:

add $8, $17, $18 Instruction Format:

000000 10001 10010 01000 00000 100000

op rs rt rd shamt funct

• ALU's operation based on instruction type and function code

Page 182: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1892004 Morgan Kaufmann Publishers

• e.g., what should the ALU do with this instruction• Example: lw $1, 100($2)

35 2 1 100

op rs rt 16 bit offset

• ALU control input

0000 AND0001 OR0010 add0110 subtract0111 set-on-less-than1100 NOR

• Why is the code for subtract 0110 and not 0011?

Control

Page 183: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1902004 Morgan Kaufmann Publishers

• Must describe hardware to compute 4-bit ALU control input

– given instruction type 00 = lw, sw01 = beq, 10 = arithmetic

– function code for arithmetic

• Describe it using a truth table (can turn into gates):

ALUOp computed from instruction type

Control

Page 184: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

Instruction RegDst ALUSrcMemto-

RegReg

WriteMem Read

Mem Write Branch ALUOp1 ALUp0

R-format 1 0 0 1 0 0 0 1 0lw 0 1 1 1 1 0 0 0 0sw X 1 X 0 0 1 0 0 0beq X 0 X 0 0 0 1 0 1

Readregister 1

Readregister 2

Writeregister

Writedata

Writedata

Registers

ALU

Add

Zero

Readdata 1

Readdata 2

Signextend

16 32

Instruction[31–0] ALU

result

Add

ALUresult

Mux

Mux

Mux

Address

Datamemory

Readdata

Shiftleft 2

4

Readaddress

Instructionmemory

PC

1

0

0

1

0

1

Mux

0

1

ALUcontrol

Instruction [5–0]

Instruction [25–21]

Instruction [31–26]

Instruction [15–11]

Instruction [20–16]

Instruction [15–0]

RegDstBranchMemReadMemtoRegALUOpMemWriteALUSrcRegWrite

Control

Page 185: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1922004 Morgan Kaufmann Publishers

Control

• Simple combinational logic (truth tables)

Operation2

Operation1

Operation0

Operation

ALUOp1

F3

F2

F1

F0

F (5– 0)

ALUOp0

ALUOp

ALU control block

R-format Iw sw beq

Op0

Op1

Op2

Op3

Op4

Op5

Inputs

Outputs

RegDst

ALUSrc

MemtoReg

RegWrite

MemRead

MemWrite

Branch

ALUOp1

ALUOpO

Page 186: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1932004 Morgan Kaufmann Publishers

• All of the logic is combinational

• We wait for everything to settle down, and the right thing to be done

– ALU might not produce “right answer” right away

– we use write signals along with clock to determine when to write

• Cycle time determined by length of the longest path

Our Simple Control Structure

We are ignoring some details like setup and hold times

Stateelement

1

Stateelement

2Combinational logic

Clock cycle

Page 187: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1942004 Morgan Kaufmann Publishers

Single Cycle Implementation

• Calculate cycle time assuming negligible delays except:

– memory (200ps), ALU and adders (100ps), register file access (50ps)

Readregister 1

Readregister 2

Writeregister

Writedata

Writedata

Registers ALU

Add

Zero

RegWrite

MemRead

MemWrite

PCSrc

MemtoReg

Readdata 1

Readdata 2

ALU operation4

Signextend

16 32

InstructionALU

result

Add

ALUresult

Mux

Mux

Mux

ALUSrc

Address

Datamemory

Readdata

Shiftleft 2

4

Readaddress

Instructionmemory

PC

Page 188: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1952004 Morgan Kaufmann Publishers

Where we are headed

• Single Cycle Problems:

– what if we had a more complicated instruction like floating point?

– wasteful of area

• One Solution:

– use a “smaller” cycle time

– have different instructions take different numbers of cycles

– a “multicycle” datapath:

Data

Register #

Register #

Register #

PC Address

Instructionor dataMemory Registers ALU

Instructionregister

Memorydata

register

ALUOut

A

BData

Page 189: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1962004 Morgan Kaufmann Publishers

• We will be reusing functional units

– ALU used to compute address and to increment PC

– Memory used for instruction and data

• Our control signals will not be determined directly by instruction

– e.g., what should the ALU do for a “subtract” instruction?

• We’ll use a finite state machine for control

Multicycle Approach

Page 190: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1972004 Morgan Kaufmann Publishers

• Break up the instructions into steps, each step takes a cycle

– balance the amount of work to be done

– restrict each cycle to use only one major functional unit

• At the end of a cycle

– store values for use in later cycles (easiest thing to do)

– introduce additional “internal” registers

Multicycle Approach

Readregister 1

Readregister 2

Writeregister

Writedata

Registers ALU

Zero

Readdata 1

Readdata 2

Signextend

16 32

Instruction[25–21]

Instruction[20–16]

Instruction[15–0]

ALUresult

Mux

Mux

Shiftleft 2

Instructionregister

PC 0

1

Mux

0

1

Mux

0

1

Mux

0

1A

B 0

1

2

3

ALUOut

Instruction[15–0]

Memorydata

register

Address

Writedata

Memory

MemData

4

Instruction[15–11]

Page 191: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1982004 Morgan Kaufmann Publishers

Instructions from ISA perspective

• Consider each instruction from perspective of ISA.

• Example:

– The add instruction changes a register.

– Register specified by bits 15:11 of instruction.

– Instruction specified by the PC.

– New value is the sum (“op”) of two registers.

– Registers specified by bits 25:21 and 20:16 of the instructionReg[Memory[PC][15:11]] <= Reg[Memory[PC][25:21]] op Reg[Memory[PC][20:16]]

– In order to accomplish this we must break up the instruction.(kind of like introducing variables when programming)

Page 192: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

1992004 Morgan Kaufmann Publishers

Breaking down an instruction

• ISA definition of arithmetic:

Reg[Memory[PC][15:11]] <= Reg[Memory[PC][25:21]] op Reg[Memory[PC][20:16]]

• Could break down to:– IR <= Memory[PC]– A <= Reg[IR[25:21]]– B <= Reg[IR[20:16]]– ALUOut <= A op B– Reg[IR[20:16]] <= ALUOut

• We forgot an important part of the definition of arithmetic!– PC <= PC + 4

Page 193: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2002004 Morgan Kaufmann Publishers

Idea behind multicycle approach

• We define each instruction from the ISA perspective (do this!)

• Break it down into steps following our rule that data flows through at most one major functional unit (e.g., balance work across steps)

• Introduce new registers as needed (e.g, A, B, ALUOut, MDR, etc.)

• Finally try and pack as much work into each step (avoid unnecessary cycles)

while also trying to share steps where possible(minimizes control, helps to simplify solution)

• Result: Our book’s multicycle Implementation!

Page 194: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2012004 Morgan Kaufmann Publishers

• Instruction Fetch

• Instruction Decode and Register Fetch

• Execution, Memory Address Computation, or Branch Completion

• Memory Access or R-type instruction completion

• Write-back step

INSTRUCTIONS TAKE FROM 3 - 5 CYCLES!

Five Execution Steps

Page 195: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2022004 Morgan Kaufmann Publishers

• Use PC to get instruction and put it in the Instruction Register.

• Increment the PC by 4 and put the result back in the PC.

• Can be described succinctly using RTL "Register-Transfer Language"

IR <= Memory[PC];PC <= PC + 4;

Can we figure out the values of the control signals?

What is the advantage of updating the PC now?

Step 1: Instruction Fetch

Page 196: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2032004 Morgan Kaufmann Publishers

• Read registers rs and rt in case we need them

• Compute the branch address in case the instruction is a branch

• RTL:

A <= Reg[IR[25:21]];B <= Reg[IR[20:16]];ALUOut <= PC + (sign-extend(IR[15:0]) << 2);

• We aren't setting any control lines based on the instruction type (we are busy "decoding" it in our control logic)

Step 2: Instruction Decode and Register Fetch

Page 197: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2042004 Morgan Kaufmann Publishers

• ALU is performing one of three functions, based on instruction type

• Memory Reference:

ALUOut <= A + sign-extend(IR[15:0]);

• R-type:

ALUOut <= A op B;

• Branch:

if (A==B) PC <= ALUOut;

Step 3 (instruction dependent)

Page 198: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2052004 Morgan Kaufmann Publishers

• Loads and stores access memory

MDR <= Memory[ALUOut];or

Memory[ALUOut] <= B;

• R-type instructions finish

Reg[IR[15:11]] <= ALUOut;

The write actually takes place at the end of the cycle on the edge

Step 4 (R-type or memory-access)

Page 199: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2062004 Morgan Kaufmann Publishers

• Reg[IR[20:16]] <= MDR;

Which instruction needs this?

Write-back step

Page 200: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2072004 Morgan Kaufmann Publishers

Summary:

Page 201: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2082004 Morgan Kaufmann Publishers

• How many cycles will it take to execute this code?

lw $t2, 0($t3)lw $t3, 4($t3)beq $t2, $t3, Label #assume notadd $t5, $t2, $t3sw $t5, 8($t3)

Label: ...

• What is going on during the 8th cycle of execution?

• In what cycle does the actual addition of $t2 and $t3 takes place?

Simple Questions

Page 202: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

Readregister 1

Readregister 2

Writeregister

Writedata

Registers ALU

Zero

Readdata 1

Readdata 2

Signextend

16 32

Instruction[31–26]

Instruction[25–21]

Instruction[20–16]

Instruction[15–0]

ALUresult

Mux

Mux

Shiftleft 2

Shiftleft 2

Instructionregister

PC 0

1

Mux

0

1

Mux

0

1

Mux

0

1A

B 0

1

2

3

Mux

0

1

2

ALUOut

Instruction[15–0]

Memorydata

register

Address

Writedata

Memory

MemData

4

Instruction[15–11]

PCWriteCond

PCWrite

IorD

MemRead

MemWrite

MemtoReg

IRWrite

PCSource

ALUOp

ALUSrcB

ALUSrcA

RegWrite

RegDst

26 28

Outputs

Control

Op[5–0]

ALUcontrol

PC [31–28]

Instruction [25-0]

Instruction [5–0]

Jumpaddress[31–0]

Page 203: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2102004 Morgan Kaufmann Publishers

• Finite state machines:

– a set of states and

– next state function (determined by current state and the input)

– output function (determined by current state and possibly input)

– We’ll use a Moore machine (output based only on current state)

Review: finite state machines

Inputs

Current state

Outputs

Clock

Next-statefunction

Outputfunction

Nextstate

Page 204: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2112004 Morgan Kaufmann Publishers

Review: finite state machines

• Example:

B. 37 A friend would like you to build an “electronic eye” for use as a fake security device. The device consists of three lights lined up in a row, controlled by the outputs Left, Middle, and Right, which, if asserted, indicate that a light should be on. Only one light is on at a time, and the light “moves” from left to right and then from right to left, thus scaring away thieves who believe that the device is monitoring their activity. Draw the graphical representation for the finite state machine used to specify the electronic eye. Note that the rate of the eye’s movement will be controlled by the clock speed (which should not be too great) and that there are essentially no inputs.

Page 205: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2122004 Morgan Kaufmann Publishers

• Value of control signals is dependent upon:

– what instruction is being executed

– which step is being performed

• Use the information we’ve accumulated to specify a finite state machine

– specify the finite state machine graphically, or

– use microprogramming

• Implementation can be derived from specification

Implementing the Control

Page 206: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2132004 Morgan Kaufmann Publishers

• Note:– don’t care if not mentioned– asserted if name only

– otherwise exact value

• How many state bits will we need?

Graphical Specification of FSMMemRead

ALUSrcA = 0IorD = 0IRWrite

ALUSrcB = 01ALUOp = 00

PCWritePCSource = 00

ALUSrcA = 0ALUSrcB = 11ALUOp = 00

ALUSrcA = 1ALUSrcB = 00ALUOp = 10

ALUSrcA = 1ALUSrcB = 10ALUOp = 00

MemReadIorD = 1

MemWriteIorD = 1

RegDst = 1RegWrite

MemtoReg = 0

RegDst = 1RegWrite

MemtoReg = 0

PCWritePCSource = 10

ALUSrcA = 1ALUSrcB = 00ALUOp = 01

PCWriteCondPCSource = 01

Instruction decode/register fetch

Instruction fetch

0 1

Start

Jumpcompletion

9862

3

4

5 7

Memory readcompleton step

R-type completionMemoryaccess

Memoryaccess

ExecutionBranch

completionMemory address

computation

Page 207: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2142004 Morgan Kaufmann Publishers

• Implementation:

Finite State Machine for Control

PCWrite

PCWriteCond

IorD

MemtoReg

PCSource

ALUOp

ALUSrcB

ALUSrcA

RegWrite

RegDst

NS3NS2NS1NS0

Op5

Op4

Op3

Op2

Op1

Op0

S3

S2

S1

S0

State register

IRWrite

MemRead

MemWrite

Instruction registeropcode field

Outputs

Control logic

Inputs

Page 208: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2152004 Morgan Kaufmann Publishers

PLA Implementation

• If I picked a horizontal or vertical line could you explain it?Op5

Op4

Op3

Op2

Op1

Op0

S3

S2

S1

S0

IorD

IRWrite

MemReadMemWrite

PCWritePCWriteCond

MemtoRegPCSource1

ALUOp1

ALUSrcB0ALUSrcARegWriteRegDstNS3NS2NS1NS0

ALUSrcB1ALUOp0

PCSource0

Page 209: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2162004 Morgan Kaufmann Publishers

• ROM = "Read Only Memory"– values of memory locations are fixed ahead of time

• A ROM can be used to implement a truth table– if the address is m-bits, we can address 2m entries in the ROM.– our outputs are the bits of data that the address points to.

m is the "height", and n is the "width"

ROM Implementation

m n

0 0 0 0 0 1 10 0 1 1 1 0 00 1 0 1 1 0 00 1 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 11 1 0 0 1 1 01 1 1 0 1 1 1

Page 210: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2172004 Morgan Kaufmann Publishers

• How many inputs are there?6 bits for opcode, 4 bits for state = 10 address lines(i.e., 210 = 1024 different addresses)

• How many outputs are there?16 datapath-control outputs, 4 state bits = 20 outputs

• ROM is 210 x 20 = 20K bits (and a rather unusual size)

• Rather wasteful, since for lots of the entries, the outputs are the same

— i.e., opcode is often ignored

ROM Implementation

Page 211: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2182004 Morgan Kaufmann Publishers

• Break up the table into two parts

— 4 state bits tell you the 16 outputs, 24 x 16 bits of ROM

— 10 bits tell you the 4 next state bits, 210 x 4 bits of ROM

— Total: 4.3K bits of ROM

• PLA is much smaller

— can share product terms

— only need entries that produce an active output

— can take into account don't cares

• Size is (#inputs #product-terms) + (#outputs #product-terms)

For this example = (10x17)+(20x17) = 510 PLA cells

• PLA cells usually about the size of a ROM cell (slightly bigger)

ROM vs PLA

Page 212: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2192004 Morgan Kaufmann Publishers

• Complex instructions: the "next state" is often current state + 1

Another Implementation Style

AddrCtl

Outputs

PLA or ROM

State

Address select logic

Op

[5–

0]

Adder

Instruction registeropcode field

1

Control unit

Input

PCWritePCWriteCondIorD

MemtoRegPCSourceALUOpALUSrcBALUSrcARegWriteRegDst

IRWrite

MemReadMemWrite

BWrite

Page 213: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2202004 Morgan Kaufmann Publishers

DetailsDispatch ROM 1 Dispatch ROM 2

Op Opcode name Value Op Opcode name Value000000 R-format 0110 100011 lw 0011000010 jmp 1001 101011 sw 0101000100 beq 1000100011 lw 0010101011 sw 0010

State number Address-control action Value of AddrCtl

0 Use incremented state 31 Use dispatch ROM 1 12 Use dispatch ROM 2 23 Use incremented state 34 Replace state number by 0 05 Replace state number by 0 06 Use incremented state 37 Replace state number by 0 08 Replace state number by 0 09 Replace state number by 0 0

State

Adder

1

PLA or ROM

Mux3 2 1 0

Dispatch ROM 1Dispatch ROM 2

0

AddrCtl

Address select logic

Instruction registeropcode field

Page 214: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2212004 Morgan Kaufmann Publishers

Microprogramming

• What are the “microinstructions” ?

PCWritePCWriteCondIorD

MemtoRegPCSourceALUOpALUSrcBALUSrcARegWrite

AddrCtl

Outputs

Microcode memory

IRWrite

MemReadMemWrite

RegDst

Control unit

Input

Microprogram counter

Address select logic

Adder

1

Instruction registeropcode field

BWrite

Datapath

Page 215: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2222004 Morgan Kaufmann Publishers

• A specification methodology– appropriate if hundreds of opcodes, modes, cycles, etc.– signals specified symbolically using microinstructions

• Will two implementations of the same architecture have the same microcode?• What would a microassembler do?

Microprogramming

LabelALU

control SRC1 SRC2Register control Memory

PCWrite control Sequencing

Fetch Add PC 4 Read PC ALU SeqAdd PC Extshft Read Dispatch 1

Mem1 Add A Extend Dispatch 2LW2 Read ALU Seq

Write MDR FetchSW2 Write ALU FetchRformat1 Func code A B Seq

Write ALU FetchBEQ1 Subt A B ALUOut-cond FetchJUMP1 Jump address Fetch

Page 216: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2232004 Morgan Kaufmann Publishers

Microinstruction formatField name Value Signals active Comment

Add ALUOp = 00 Cause the ALU to add.ALU control Subt ALUOp = 01 Cause the ALU to subtract; this implements the compare for

branches.Func code ALUOp = 10 Use the instruction's function code to determine ALU control.

SRC1 PC ALUSrcA = 0 Use the PC as the first ALU input.A ALUSrcA = 1 Register A is the first ALU input.B ALUSrcB = 00 Register B is the second ALU input.

SRC2 4 ALUSrcB = 01 Use 4 as the second ALU input.Extend ALUSrcB = 10 Use output of the sign extension unit as the second ALU input.Extshft ALUSrcB = 11 Use the output of the shift-by-two unit as the second ALU input.Read Read two registers using the rs and rt fields of the IR as the register

numbers and putting the data into registers A and B.Write ALU RegWrite, Write a register using the rd field of the IR as the register number and

Register RegDst = 1, the contents of the ALUOut as the data.control MemtoReg = 0

Write MDR RegWrite, Write a register using the rt field of the IR as the register number andRegDst = 0, the contents of the MDR as the data.MemtoReg = 1

Read PC MemRead, Read memory using the PC as address; write result into IR (and lorD = 0 the MDR).

Memory Read ALU MemRead, Read memory using the ALUOut as address; write result into MDR.lorD = 1

Write ALU MemWrite, Write memory using the ALUOut as address, contents of B as thelorD = 1 data.

ALU PCSource = 00 Write the output of the ALU into the PC.PCWrite

PC write control ALUOut-cond PCSource = 01, If the Zero output of the ALU is active, write the PC with the contentsPCWriteCond of the register ALUOut.

jump address PCSource = 10, Write the PC with the jump address from the instruction.PCWrite

Seq AddrCtl = 11 Choose the next microinstruction sequentially.Sequencing Fetch AddrCtl = 00 Go to the first microinstruction to begin a new instruction.

Dispatch 1 AddrCtl = 01 Dispatch using the ROM 1.Dispatch 2 AddrCtl = 10 Dispatch using the ROM 2.

Page 217: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2242004 Morgan Kaufmann Publishers

• No encoding:

– 1 bit for each datapath operation

– faster, requires more memory (logic)

– used for Vax 780 — an astonishing 400K of memory!

• Lots of encoding:

– send the microinstructions through logic to get control signals

– uses less memory, slower

• Historical context of CISC:

– Too much logic to put on a single chip with everything else

– Use a ROM (or even RAM) to hold the microcode

– It’s easy to add new instructions

Maximally vs. Minimally Encoded

Page 218: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2252004 Morgan Kaufmann Publishers

Microcode: Trade-offs

• Distinction between specification and implementation is sometimes blurred

• Specification Advantages:

– Easy to design and write

– Design architecture and microcode in parallel

• Implementation (off-chip ROM) Advantages

– Easy to change since values are in memory

– Can emulate other architectures

– Can make use of internal registers

• Implementation Disadvantages, SLOWER now that:

– Control is implemented on same chip as processor

– ROM is no longer faster than RAM

– No need to go back and make changes

Page 219: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2262004 Morgan Kaufmann Publishers

Historical Perspective

• In the ‘60s and ‘70s microprogramming was very important for implementing machines

• This led to more sophisticated ISAs and the VAX• In the ‘80s RISC processors based on pipelining became popular• Pipelining the microinstructions is also possible!• Implementations of IA-32 architecture processors since 486 use:

– “hardwired control” for simpler instructions (few cycles, FSM control implemented using PLA or random

logic)

– “microcoded control” for more complex instructions(large numbers of cycles, central control store)

• The IA-64 architecture uses a RISC-style ISA and can be implemented without a large central control store

Page 220: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2272004 Morgan Kaufmann Publishers

Pentium 4

• Pipelining is important (last IA-32 without it was 80386 in 1985)

• Pipelining is used for the simple instructions favored by compilers

“Simply put, a high performance implementation needs to ensure that the simple instructions execute quickly, and that the burden of the complexities of the instruction set penalize the complex, less frequently used, instructions”

Control

Control

Control

Enhancedfloating pointand multimedia

Control

I/Ointerface

Instruction cache

Integerdatapath

Datacache

Secondarycacheandmemoryinterface

Advanced pipelininghyperthreading support

Chapter 6

Chapter 7

Page 221: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2282004 Morgan Kaufmann Publishers

Pentium 4

• Somewhere in all that “control we must handle complex instructions

• Processor executes simple microinstructions, 70 bits wide (hardwired)

• 120 control lines for integer datapath (400 for floating point)

• If an instruction requires more than 4 microinstructions to implement, control from microcode ROM (8000 microinstructions)

• Its complicated!

Control

Control

Control

Enhancedfloating pointand multimedia

Control

I/Ointerface

Instruction cache

Integerdatapath

Datacache

Secondarycacheandmemoryinterface

Advanced pipelininghyperthreading support

Page 222: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2292004 Morgan Kaufmann Publishers

Chapter 5 Summary

• If we understand the instructions…

We can build a simple processor!

• If instructions take different amounts of time, multi-cycle is better

• Datapath implemented using:

– Combinational logic for arithmetic

– State holding elements to remember bits

• Control implemented using:

– Combinational logic for single-cycle implementation

– Finite state machine for multi-cycle implementation

Page 223: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2302004 Morgan Kaufmann Publishers

Chapter Six

Page 224: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2312004 Morgan Kaufmann Publishers

Pipelining

• Improve performance by increasing instruction throughput

Ideal speedup is number of stages in the pipeline. Do we achieve this?

Programexecutionorder(in instructions)

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

Time200 400 600 800 1000 1200 1400 1600 1800

Instructionfetch Reg ALU Data

access Reg

Instructionfetch Reg ALU Data

access Reg

Instructionfetch

800 ps

800 ps

800 ps

Programexecutionorder(in instructions)

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

Time200 400 600 800 1000 1200 1400

Instructionfetch Reg ALU Data

access Reg

Instructionfetch

Instructionfetch

Reg ALU Dataaccess Reg

Reg ALU Dataaccess Reg

200 ps

200 ps

200 ps 200 ps 200 ps 200 ps 200 ps

Note: timing assumptions changedfor this example

Page 225: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2322004 Morgan Kaufmann Publishers

Pipelining

• What makes it easy– all instructions are the same length– just a few instruction formats– memory operands appear only in loads and stores

• What makes it hard?– structural hazards: suppose we had only one memory– control hazards: need to worry about branch instructions– data hazards: an instruction depends on a previous instruction

• We’ll build a simple pipeline and look at these issues

• We’ll talk about modern processors and what really makes it hard:– exception handling– trying to improve performance with out-of-order execution, etc.

Page 226: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2332004 Morgan Kaufmann Publishers

Basic Idea

• What do we need to add to actually split the datapath into stages?

WB: Write backMEM: Memory accessIF: Instruction fetch ID: Instruction decode/register file read

EX: Execute/address calculation

Address

Writedata

Readdata

DataMemory

Readregister 1Readregister 2

WriteregisterWritedata

Registers

Readdata 1

Readdata 2

ALUZeroALUresult

ADDAddresult

Shiftleft 2

Address

Instruction

Instructionmemory

Add

4

PC

Signextend

16 32

Page 227: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2342004 Morgan Kaufmann Publishers

Pipelined Datapath

Can you find a problem even if there are no dependencies? What instructions can we execute to manifest the problem?

Add

Address

Instructionmemory

Readregister 1

Readregister 2

Writeregister

Writedata

Readdata 1

Readdata 2

RegistersAddress

Writedata

Readdata

Datamemory

Add Addresult

ALU ALUresult

Zero

Shiftleft 2

Signextend

PC

4

ID/EXIF/ID EX/MEM MEM/WB

16 32

Page 228: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2352004 Morgan Kaufmann Publishers

Corrected Datapath

Add

Address

Instructionmemory

Readregister 1

Readregister 2

Writeregister

Writedata

Readdata 1

Readdata 2

RegistersAddress

Writedata

Readdata

Datamemory

Add Addresult

ALU ALUresult

Zero

Shiftleft 2

Signextend

PC

4

ID/EXIF/ID EX/MEM MEM/WB

16 32

Page 229: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2362004 Morgan Kaufmann Publishers

Graphically Representing Pipelines

• Can help with answering questions like:

– how many cycles does it take to execute this code?

– what is the ALU doing during cycle 4?

– use this representation to help understand datapaths

Programexecutionorder(in instructions)

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

Time (in clock cycles)

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC7

IM DMReg RegALU

IM DMReg RegALU

IM DMReg RegALU

Page 230: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2372004 Morgan Kaufmann Publishers

Pipeline Control

MemWrite

PCSrc

MemtoReg

MemRead

Add

Address

Instructionmemory

Readregister 1

Readregister 2

Writeregister

Writedata

Instruction(15Ð0)

Instruction(20Ð16)

Instruction(15Ð11)

Readdata 1

Readdata 2

RegistersAddress

Writedata

Readdata

Datamemory

Add Addresult

Add ALUresult

Zero

Shiftleft 2

Signextend

PC

4

ID/EXIF/ID EX/MEM MEM/WB

16 32 6ALU

control

RegDst

ALUOp

ALUSrc

RegWrite

Branch

Page 231: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2382004 Morgan Kaufmann Publishers

• We have 5 stages. What needs to be controlled in each stage?

– Instruction Fetch and PC Increment

– Instruction Decode / Register Fetch

– Execution

– Memory Stage

– Write Back

• How would control be handled in an automobile plant?

– a fancy control center telling everyone what to do?

– should we use a finite state machine?

Pipeline control

Page 232: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2392004 Morgan Kaufmann Publishers

• Pass control signals along just like the data

Pipeline Control

Execution/Address Calculation stage control lines

Memory access stage control lines

Write-back stage control

lines

InstructionReg Dst

ALU Op1

ALU Op0

ALU Src Branch

Mem Read

Mem Write

Reg write

Mem to Reg

R-format 1 1 0 0 0 0 0 1 0lw 0 0 0 1 0 1 0 1 1sw X 0 0 1 0 0 1 0 Xbeq X 0 1 0 1 0 0 0 X

Control

EX

M

WB

M

WB

WB

IF/ID ID/EX EX/MEM MEM/WB

Instruction

Page 233: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2402004 Morgan Kaufmann Publishers

Datapath with Control

WB

M

EX

WB

M WB

PCSrc

MemRead

Add

Address

Instructionmemory

Readregister 1

Readregister 2

Instruction[15–0]

Instruction[20–16]

Instruction[15–11]

Writeregister

Writedata

Readdata 1

Readdata 2

RegistersAddress

Writedata

Readdata

Datamemory

Add Addresult

ALU ALUresult

Zero

Shiftleft 2

Signextend

PC

4

ID/EX

IF/ID

EX/MEM

MEM/WB

16 632ALU

control

RegDst

ALUOp

ALUSrc

Branch

Control

Page 234: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2412004 Morgan Kaufmann Publishers

• Problem with starting next instruction before first is finished

– dependencies that “go backward in time” are data hazards

Dependencies

Programexecutionorder(in instructions)

sub $2, $1, $3

and $12, $2, $5

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Time (in clock cycles)CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

IM DMReg Reg

IM DMReg Reg

IM DMReg Reg

IM DMReg Reg

IM DMReg Reg

10 10 10 10 10/–20 –20 –20 –20 –20Value ofregister $2:

Page 235: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2422004 Morgan Kaufmann Publishers

• Have compiler guarantee no hazards

• Where do we insert the “nops” ?

sub $2, $1, $3and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15, 100($2)

• Problem: this really slows us down!

Software Solution

Page 236: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2432004 Morgan Kaufmann Publishers

• Use temporary results, don’t wait for them to be written

– register file forwarding to handle read/write to same register

– ALU forwarding

Forwarding

what if this $2 was $13?

Programexecutionorder(in instructions)

sub $2, $1, $3

and $12, $2, $5

or $13, $6, $2

add $14,$2 , $2

sw $15, 100($2)

Time (in clock cycles)CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

IM DMReg Reg

IM DMReg Reg

IM DMReg Reg

IM DMReg Reg

IM DMReg Reg

10 10 10 10 10/–20 –20 –20 –20 –20Value of register $2:Value of EX/MEM: X X X –20 X X X X XValue of MEM/WB: X X X X –20 X X X X

Page 237: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2442004 Morgan Kaufmann Publishers

Forwarding

• The main idea (some details not shown)

ALU

Datamemory

Registers

Mux

Mux

Mux

Mux

ID/EX EX/MEM MEM/WB

Forwardingunit

EX/MEM.RegisterRd

MEM/WB.RegisterRd

RsRtRtRd

ForwardB

ForwardA

Page 238: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2452004 Morgan Kaufmann Publishers

• Load word can still cause a hazard:– an instruction tries to read a register following a load instruction that writes to the

same register.

• Thus, we need a hazard detection unit to “stall” the load instruction

Can't always forward

Programexecutionorder(in instructions)

lw $2, 20($1)

and $4, $2, $5

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

Time (in clock cycles)CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

IM DMReg Reg

IM DMReg Reg

IM DMReg Reg

IM DMReg Reg

IM DMReg Reg

Page 239: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2462004 Morgan Kaufmann Publishers

Stalling

• We can stall the pipeline by keeping an instruction in the same stage

bubble

Programexecutionorder(in instructions)

lw $2, 20($1)

and becomes nop

add $4, $2, $5

or $8, $2, $6

add $9, $4, $2

Time (in clock cycles)CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 CC 10

IM DMReg Reg

IM DMReg Reg

IM DMReg Reg

IM DMReg Reg

IM DMReg Reg

Page 240: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2472004 Morgan Kaufmann Publishers

Hazard Detection Unit

• Stall by letting an instruction that won’t write anything go forward

0 M

WB

WB

Datamemory

Instructionmemory

Mux

Mux

Mux

Mux

ALU

ID/EX

EX/MEM

MEM/WB

Forwardingunit

PC

Control

EX

M

WB

IF/ID

Mux

Hazarddetection

unit

ID/EX.MemRead

IF/ID.RegisterRs

IF/ID.RegisterRt

IF/ID.RegisterRt

IF/ID.RegisterRd

ID/EX.RegisterRt

Registers

Rt

Rd

Rs

Rt

Page 241: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2482004 Morgan Kaufmann Publishers

• When we decide to branch, other instructions are in the pipeline!

• We are predicting “branch not taken”

– need to add hardware for flushing instructions if we are wrong

Branch Hazards

Reg

Programexecutionorder(in instructions)

40 beq $1, $3, 28

44 and $12, $2, $5

48 or $13, $6, $2

52 add $14, $2, $2

72 lw $4, 50($7)

Time (in clock cycles)CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

IM DMReg Reg

IM DMReg Reg

IM DM Reg

IM DMReg Reg

IM DMReg Reg

Page 242: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2492004 Morgan Kaufmann Publishers

Flushing Instructions

Control

Hazarddetection

unit

+

4

PCInstructionmemory

Signextend

Registers=

+

Fowardingunit

ALU

ID/EX

EX/MEM

EX/MEM

WB

M

EX

Shiftleft 2

IF.Flush

IF/ID

Mux

Mux

Mux

Mux

Mux

Mux

Datamemory

WB

WBM

0

Note: we’ve also moved branch decision to ID stage

Page 243: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2502004 Morgan Kaufmann Publishers

Branches

• If the branch is taken, we have a penalty of one cycle• For our simple design, this is reasonable• With deeper pipelines, penalty increases and static branch prediction

drastically hurts performance• Solution: dynamic branch prediction

Predict taken Predict taken

Predict not taken Predict not taken

Not taken

Not taken

Not taken

Not taken

Taken

Taken

Taken

Taken

A 2-bit prediction scheme

Page 244: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2512004 Morgan Kaufmann Publishers

Branch Prediction

• Sophisticated Techniques:

– A “branch target buffer” to help us look up the destination

– Correlating predictors that base prediction on global behaviorand recently executed branches (e.g., prediction for a specific

branch instruction based on what happened in previous branches)

– Tournament predictors that use different types of prediction strategies and keep track of which one is performing best.

– A “branch delay slot” which the compiler tries to fill with a useful instruction (make the one cycle delay part of the ISA)

• Branch prediction is especially important because it enables other more advanced pipelining techniques to be effective!

• Modern processors predict correctly 95% of the time!

Page 245: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2522004 Morgan Kaufmann Publishers

Improving Performance

• Try and avoid stalls! E.g., reorder these instructions:

lw $t0, 0($t1)lw $t2, 4($t1)sw $t2, 0($t1)sw $t0, 4($t1)

• Dynamic Pipeline Scheduling

– Hardware chooses which instructions to execute next

– Will execute instructions out of order (e.g., doesn’t wait for a dependency to be resolved, but rather keeps going!)

– Speculates on branches and keeps the pipeline full (may need to rollback if prediction incorrect)

• Trying to exploit instruction-level parallelism

Page 246: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2532004 Morgan Kaufmann Publishers

Advanced Pipelining

• Increase the depth of the pipeline

• Start more than one instruction each cycle (multiple issue)

• Loop unrolling to expose more ILP (better scheduling)

• “Superscalar” processors

– DEC Alpha 21264: 9 stage pipeline, 6 instruction issue

• All modern processors are superscalar and issue multiple instructions usually with some limitations (e.g., different “pipes”)

• VLIW: very long instruction word, static multiple issue (relies more on compiler technology)

• This class has given you the background you need to learn more!

Page 247: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2542004 Morgan Kaufmann Publishers

Chapter 6 Summary

• Pipelining does not improve latency, but does improve throughput

Slower Faster

Instructions per clock (IPC = 1/CPI)

Multicycle(Section 5.5)

Single-cycle(Section 5.4)

Deeplypipelined

Pipelined

Multiple issuewith deep pipeline

(Section 6.10)

Multiple-issuepipelined

(Section 6.9)

1 Several

Use latency in instructions

Multicycle(Section 5.5)

Single-cycle(Section 5.4)

Deeplypipelined

Pipelined

Multiple issuewith deep pipeline

(Section 6.10)

Multiple-issuepipelined

(Section 6.9)

Page 248: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2552004 Morgan Kaufmann Publishers

Chapter Seven

Page 249: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2562004 Morgan Kaufmann Publishers

• SRAM:

– value is stored on a pair of inverting gates

– very fast but takes up more space than DRAM (4 to 6 transistors)

• DRAM:

– value is stored as a charge on capacitor (must be refreshed)

– very small but slower than SRAM (factor of 5 to 10)

Memories: Review

B

A A

B

Word line

Pass transistor

Capacitor

Bit line

Page 250: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2572004 Morgan Kaufmann Publishers

• Users want large and fast memories!

SRAM access times are .5 – 5ns at cost of $4000 to $10,000 per GB.DRAM access times are 50-70ns at cost of $100 to $200 per GB.Disk access times are 5 to 20 million ns at cost of $.50 to $2 per GB.

• Try and give it to them anyway

– build a memory hierarchy

Exploiting Memory Hierarchy

2004

CPU

Level 1

Level 2

Level n

Increasing distance

from the CPU in

access timeLevels in the

memory hierarchy

Size of the memory at each level

Page 251: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2582004 Morgan Kaufmann Publishers

Locality

• A principle that makes having a memory hierarchy a good idea

• If an item is referenced,

temporal locality: it will tend to be referenced again soon

spatial locality: nearby items will tend to be referenced soon.

Why does code have locality?

• Our initial focus: two levels (upper, lower)

– block: minimum unit of data

– hit: data requested is in the upper level

– miss: data requested is not in the upper level

Page 252: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2592004 Morgan Kaufmann Publishers

• Two issues:

– How do we know if a data item is in the cache?

– If it is, how do we find it?

• Our first example:

– block size is one word of data

– "direct mapped"

For each item of data at the lower level, there is exactly one location in the cache where it might be.

e.g., lots of items at the lower level share locations in the upper level

Cache

Page 253: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2602004 Morgan Kaufmann Publishers

• Mapping: address is modulo the number of blocks in the cache

Direct Mapped Cache

00001 00101 01001 01101 10001 10101 11001 11101

000

Cache

Memory

001

01

001

11

001

011

101

11

Page 254: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2612004 Morgan Kaufmann Publishers

• For MIPS:

What kind of locality are we taking advantage of?

Direct Mapped Cache

Address (showing bit positions)

Data

Hit

Data

Tag

Valid Tag

3220

Index

012

102310221021

=

Index

20 10

Byteoffset

31 30 13 12 11 2 1 0

Page 255: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2622004 Morgan Kaufmann Publishers

• Taking advantage of spatial locality:

Direct Mapped Cache

Address (showing bit positions)

DataHit

Data

Tag

V Tag

32

16

=

Index

18 8 Byteoffset

31 14 13 2 1 06 5

4

Block offset

256

entries

512 bits18 bits

Mux

3232 32

Page 256: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2632004 Morgan Kaufmann Publishers

• Read hits

– this is what we want!

• Read misses

– stall the CPU, fetch block from memory, deliver to cache, restart

• Write hits:

– can replace data in cache and memory (write-through)

– write the data only into the cache (write-back the cache later)

• Write misses:

– read the entire block into the cache, then write the word

Hits vs. Misses

Page 257: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2642004 Morgan Kaufmann Publishers

• Make reading multiple words easier by using banks of memory

• It can get a lot more complicated...

Hardware Issues

CPU

Cache

Memory

Bus

One-word-widememory organization

a.

b. Wide memory organization

CPU

Cache

Memory

Bus

Multiplexor

CPU

Cache

Bus

Memory

bank 0

Memory

bank 1

Memory

bank 2

Memory

bank 3

c. Interleaved memory organization

Page 258: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2652004 Morgan Kaufmann Publishers

• Increasing the block size tends to decrease miss rate:

• Use split caches because there is more spatial locality in code:

Performance

1 KB

8 KB

16 KB

64 KB

256 KB

256

40%

35%

30%

25%

20%

15%

10%

5%

0%

Mis

s ra

te

64164

Block size (bytes)

ProgramBlock size in

wordsInstruction miss rate

Data miss rate

Effective combined miss rate

gcc 1 6.1% 2.1% 5.4%4 2.0% 1.7% 1.9%

spice 1 1.2% 1.3% 1.2%4 0.3% 0.6% 0.4%

Page 259: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2662004 Morgan Kaufmann Publishers

Performance

• Simplified model:

execution time = (execution cycles + stall cycles) cycle time

stall cycles = # of instructions miss ratio miss penalty

• Two ways of improving performance:

– decreasing the miss ratio

– decreasing the miss penalty

What happens if we increase block size?

Page 260: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2672004 Morgan Kaufmann Publishers

Compared to direct mapped, give a series of references that:

– results in a lower miss ratio using a 2-way set associative cache

– results in a higher miss ratio using a 2-way set associative cache

assuming we use the “least recently used” replacement strategy

Decreasing miss ratio with associativity

Eight-way set associative (fully associative)

Tag Tag Data DataTagTag Data Data Tag Tag Data DataTagTag Data Data

Tag Tag Data DataTagTag Data DataSet

0

1

Four-way set associative

TagTag Data DataSet

0

1

2

3

Two-way set associative

Tag DataBlock

0

1

2

3

4

5

6

7

One-way set associative

(direct mapped)

Page 261: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2682004 Morgan Kaufmann Publishers

An implementation

Address

22 8

V TagIndex

01

2

253254255

Data V Tag Data V Tag Data V Tag Data

3222

4-to-1 multiplexor

Hit Data

123891011123031 0

Page 262: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2692004 Morgan Kaufmann Publishers

Performance

Associativity

0One-way Two-way

3%

6%

9%

12%

15%

Four-way Eight-way

1 KB

2 KB

4 KB

8 KB

16 KB

32 KB64 KB 128 KB

Page 263: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2702004 Morgan Kaufmann Publishers

Decreasing miss penalty with multilevel caches

• Add a second level cache:

– often primary cache is on the same chip as the processor

– use SRAMs to add another cache above primary memory (DRAM)

– miss penalty goes down if data is in 2nd level cache

• Example:– CPI of 1.0 on a 5 Ghz machine with a 5% miss rate, 100ns DRAM access– Adding 2nd level cache with 5ns access time decreases miss rate to .5%

• Using multilevel caches:

– try and optimize the hit time on the 1st level cache

– try and optimize the miss rate on the 2nd level cache

Page 264: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2712004 Morgan Kaufmann Publishers

Cache Complexities

• Not always easy to understand implications of caches:

Radix sort

Quicksort

Size (K items to sort)

04 8 16 32

200

400

600

800

1000

1200

64 128 256 512 1024 2048 4096

Radix sort

Quicksort

Size (K items to sort)

04 8 16 32

400

800

1200

1600

2000

64 128 256 512 1024 2048 4096

Theoretical behavior of Radix sort vs. Quicksort

Observed behavior of Radix sort vs. Quicksort

Page 265: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2722004 Morgan Kaufmann Publishers

Cache Complexities

• Here is why:

• Memory system performance is often critical factor– multilevel caches, pipelined processors, make it harder to predict outcomes– Compiler optimizations to increase locality sometimes hurt ILP

• Difficult to predict best algorithm: need experimental data

Radix sort

Quicksort

Size (K items to sort)

04 8 16 32

1

2

3

4

5

64 128 256 512 1024 2048 4096

Page 266: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2732004 Morgan Kaufmann Publishers

Virtual Memory

• Main memory can act as a cache for the secondary storage (disk)

• Advantages:– illusion of having more physical memory– program relocation – protection

Virtual addresses Physical addresses

Address translation

Disk addresses

Page 267: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2742004 Morgan Kaufmann Publishers

Pages: virtual memory blocks

• Page faults: the data is not in memory, retrieve it from disk

– huge miss penalty, thus pages should be fairly large (e.g., 4KB)

– reducing page faults is important (LRU is worth the price)

– can handle the faults in software instead of hardware

– using write-through is too expensive so we use writeback

Virtual page number Page offset

31 30 29 28 27 3 2 1 015 14 13 12 11 10 9 8

Physical page number Page offset

29 28 27 3 2 1 015 14 13 12 11 10 9 8

Virtual address

Physical address

Translation

Page 268: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2752004 Morgan Kaufmann Publishers

Page Tables

Page tablePhysical page or

disk addressPhysical memory

Virtual pagenumber

Disk storage

1111011

11

1

0

0

Valid

Page 269: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2762004 Morgan Kaufmann Publishers

Page Tables

Virtual page number Page offset

3 1 3 0 2 9 2 8 2 7 3 2 1 01 5 1 4 1 3 1 2 11 1 0 9 8

Physical page number Page offset

2 9 2 8 2 7 3 2 1 01 5 1 4 1 3 1 2 11 1 0 9 8

Virtual address

Physical address

Page table register

Physical page numberValid

Page table

If 0 then page is notpresent in memory

20 12

18

Page 270: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2772004 Morgan Kaufmann Publishers

Making Address Translation Fast

• A cache for address translations: translation lookaside buffer

1111011

11

1

0

0

1000000

11

1

0

0

1001011

11

1

0

0

Physical pageor disk addressValid Dirty Ref

Page table

Physical memory

Virtual pagenumber

Disk storage

111101

011000

111101

Physical pageaddressValid Dirty Ref

TLB

Tag

Typical values: 16-512 entries, miss-rate: .01% - 1%miss-penalty: 10 – 100 cycles

Page 271: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2782004 Morgan Kaufmann Publishers

TLBs and caches

YesWrite access

bit on?

No

YesCache hit?

No

Write data into cache,update the dirty bit, and

put the data and theaddress into the write buffer

YesTLB hit?

Virtual address

TLB access

Try to read datafrom cache

No

YesWrite?

No

Cache miss stallwhile read block

Deliver datato the CPU

Write protectionexception

YesCache hit?

No

Try to write datato cache

Cache miss stallwhile read block

TLB missexception

Physical address

Page 272: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2792004 Morgan Kaufmann Publishers

TLBs and Caches

=

=

20

Virtual page number Page offset

31 30 29 3 2 1 014 13 12 11 10 9

Virtual address

TagValid Dirty

TLB

Physical page number

TagValid

TLB hit

Cache hit

Data

Data

Byteoffset

=====

Physical page number Page offset

Physical address tag Cache index

12

20

Blockoffset

Physical address

18

32

8 4 2

12

8

Cache

Page 273: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2802004 Morgan Kaufmann Publishers

Modern Systems

Page 274: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2812004 Morgan Kaufmann Publishers

Modern Systems

• Things are getting complicated!

Page 275: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2822004 Morgan Kaufmann Publishers

• Processor speeds continue to increase very fast— much faster than either DRAM or disk access times

• Design challenge: dealing with this growing disparity

– Prefetching? 3rd level caches and more? Memory design?

Some Issues

Year

Performance

1

10

100

1,000

10,000

100,000

CPU

Memory

Page 276: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2832004 Morgan Kaufmann Publishers

Chapters 8 & 9

(partial coverage)

Page 277: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2842004 Morgan Kaufmann Publishers

Interfacing Processors and Peripherals

• I/O Design affected by many factors (expandability, resilience)

• Performance:— access latency — throughput— connection between devices and the system— the memory hierarchy— the operating system

• A variety of different users (e.g., banks, supercomputers, engineers)

Disk Disk

Processor

Cache

Memory- I/O bus

Mainmemory

I/Ocontroller

I/Ocontroller

I/Ocontroller

Graphicsoutput

Network

Interrupts

Page 278: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2852004 Morgan Kaufmann Publishers

I/O

• Important but neglected

“The difficulties in assessing and designing I/O systems haveoften relegated I/O to second class status”

“courses in every aspect of computing, from programming tocomputer architecture often ignore I/O or give it scanty coverage”

“textbooks leave the subject to near the end, making it easierfor students and instructors to skip it!”

• GUILTY!

— we won’t be looking at I/O in much detail

— be sure and read Chapter 8 in its entirety.

— you should probably take a networking class!

Page 279: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2862004 Morgan Kaufmann Publishers

I/O Devices

• Very diverse devices— behavior (i.e., input vs. output)— partner (who is at the other end?)— data rate

Page 280: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2872004 Morgan Kaufmann Publishers

I/O Example: Disk Drives

• To access data:— seek: position head over the proper track (3 to 14 ms. avg.)— rotational latency: wait for desired sector (.5 / RPM)— transfer: grab the data (one or more sectors) 30 to 80 MB/sec

Platter

Track

Platters

Sectors

Tracks

Page 281: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2882004 Morgan Kaufmann Publishers

I/O Example: Buses

• Shared communication link (one or more wires)• Difficult design:

— may be bottleneck— length of the bus— number of devices— tradeoffs (buffers for higher bandwidth increases latency)— support for many different devices— cost

• Types of buses:— processor-memory (short high speed, custom design)— backplane (high speed, often standardized, e.g., PCI)— I/O (lengthy, different devices, e.g., USB, Firewire)

• Synchronous vs. Asynchronous— use a clock and a synchronous protocol, fast and small

but every device must operate at same rate andclock skew requires the bus to be short

— don’t use a clock and instead use handshaking

Page 282: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2892004 Morgan Kaufmann Publishers

I/O Bus Standards

• Today we have two dominant bus standards:

Page 283: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2902004 Morgan Kaufmann Publishers

Other important issues

• Bus Arbitration:

— daisy chain arbitration (not very fair)

— centralized arbitration (requires an arbiter), e.g., PCI

— collision detection, e.g., Ethernet

• Operating system:

— polling

— interrupts

— direct memory access (DMA)

• Performance Analysis techniques:

— queuing theory

— simulation

— analysis, i.e., find the weakest link (see “I/O System Design”)

• Many new developments

Page 284: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2912004 Morgan Kaufmann Publishers

Pentium 4

• I/O Options

Parallel ATA(100 MB/sec)

Parallel ATA(100 MB/sec)

(20 MB/sec)

PCI bus(132 MB/sec)

CSA(0.266 GB/sec)

AGP 8X(2.1 GB/sec)

Serial ATA(150 MB/sec)

Disk

Pentium 4processor

1 Gbit Ethernet

Memorycontroller

hub(north bridge)

82875P

MainmemoryDIMMs

DDR 400(3.2 GB/sec)

DDR 400(3.2 GB/sec)

Serial ATA(150 MB/sec)

Disk

AC/97(1 MB/sec)

Stereo(surround-

sound) USB 2.0(60 MB/sec)

. . .

I/Ocontroller

hub(south bridge)

82801EB

Graphicsoutput

(266 MB/sec)

System bus (800 MHz, 604 GB/sec)

CD/DVD

Tape

10/100 Mbit Ethernet

Page 285: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2922004 Morgan Kaufmann Publishers

Fallacies and Pitfalls

• Fallacy: the rated mean time to failure of disks is 1,200,000 hours,

so disks practically never fail.

• Fallacy: magnetic disk storage is on its last legs, will be replaced.

• Fallacy: A 100 MB/sec bus can transfer 100 MB/sec.

• Pitfall: Moving functions from the CPU to the I/O processor,

expecting to improve performance without analysis.

Page 286: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2932004 Morgan Kaufmann Publishers

Multiprocessors

• Idea: create powerful computers by connecting many smaller ones

good news: works for timesharing (better than supercomputer)

bad news: its really hard to write good concurrent programs many commercial failures

Cache

Processor

Cache

Processor

Cache

Processor

Single bus

Memory I/ONetwork

Cache

Processor

Cache

Processor

Cache

Processor

Memory Memory Memory

Page 287: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2942004 Morgan Kaufmann Publishers

Questions

• How do parallel processors share data?— single address space (SMP vs. NUMA)— message passing

• How do parallel processors coordinate? — synchronization (locks, semaphores)— built into send / receive primitives— operating system protocols

• How are they implemented?— connected by a single bus — connected by a network

Page 288: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2952004 Morgan Kaufmann Publishers

Supercomputers

93 93 94 94 95 95 96 96 97 97 98 98 99 99 00

500

400

300

200

100

0

Single Instruction multiple data (SIMD)

Cluster(network ofworkstations)

Cluster(network ofSMPs)

Massivelyparallelprocessors(MPPs)

Shared-memorymultiprocessors(SMPs)

Uniprocessors

Plot of top 500 supercomputer sites over a decade:

Page 289: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2962004 Morgan Kaufmann Publishers

Using multiple processors an old idea

• Some SIMD designs:

• Costs for the the Illiac IV escalated from $8 million in 1966 to $32 million in 1972 despite completion of only ¼ of the machine. It took three more years before it was operational!

“For better or worse, computer architects are not easily discouraged”

Lots of interesting designs and ideas, lots of failures, few successes

Page 290: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2972004 Morgan Kaufmann Publishers

Topologies

a. Crossbar

P0

P1

P2

P3

P4

P5

P6

P7b. Omega network

P0

P1

P2

P3

P4

P5

P6

P7

a. 2-D grid or mesh of 16 nodes

b. n-cube tree of 8 nodes (8 = 23 so n = 3)

Page 291: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2982004 Morgan Kaufmann Publishers

Clusters

• Constructed from whole computers

• Independent, scalable networks

• Strengths:

– Many applications amenable to loosely coupled machines

– Exploit local area networks

– Cost effective / Easy to expand

• Weaknesses:

– Administration costs not necessarily lower

– Connected using I/O bus

• Highly available due to separation of memories

• In theory, we should be able to do better

Page 292: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

2992004 Morgan Kaufmann Publishers

Google

• Serve an average of 1000 queries per second

• Google uses 6,000 processors and 12,000 disks

• Two sites in silicon valley, two in Virginia

• Each site connected to internet using OC48 (2488 Mbit/sec)

• Reliability:

– On an average day, 20 machines need rebooted (software error)

– 2% of the machines replaced each year

In some sense, simple ideas well executed. Better (and cheaper) than other approaches involving increased complexity

Page 293: 1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at watson/cs2810

3002004 Morgan Kaufmann Publishers

Concluding Remarks

• Evolution vs. Revolution

“More often the expense of innovation comes from being too disruptive to computer users”

“Acceptance of hardware ideas requires acceptance by software people; therefore hardware people should learn about software. And if software people want good machines, they must learn more about hardware to be able to communicate with and thereby influence hardware engineers.”