Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False

Computer ArchitecturePaul Mellies

Lecture 2 : Princeton, Harvard and data�ow machines

OUT

FalseTrue

bool

bool

copy

copy

copy

branch

dec

branch

> 0

*

1

int

int

int

bool

int

int int

intint

FalseTrue

int

IN

int

Why have computers have become more complex ?

Why have computers have become more complex? We can think of several reasons.

Speed of Memory vs. Speed of CPU. John Cocke says that the com-plexity began with the transition from the 701 to the 709. The 701 CPU was about ten times as fast as the core main memory ; this made any primitives that were implemented as subroutines much slower than primitives that were instructions. Thus the �oating point subroutines became part of the 709 architecture with dramatic gains. Making the 709 more complex resulted in an advance that made it more cost-e�ec-tive that the 701. Since then, many « higher-level » instructions have been added to machines in an attempt to improve performances.

David Patterson and David DitzelThe Case for the Reduced Instruction Set ComputerACM Computer Architecture News, 1980

Diverging processor and memory performance

Processor

Memory10

100

1000

10,000

100,000

11980 1985 1990 1995 2000 2005 2010

The memory hierarchy

https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf

Where the data is located

register

Timeto fetch data

L1 cache

L2 cache

L3 cache

Local DRAM Memory

1 cycle

~4 cycles

~10 cycles

40 - 75 cycles

~60 ns

Remote DRAM Memory ~100 ns

For more information, please have a look at INTEL’s performance analysis guide for Core i7 and Xeon 5500:

A good programmer should be aware of these memory latencies and do his/her best to maximize the amount of data available in the cache.

A good idea is to keep the manipulated data as local as possible ( e.g. use arrays instead of linked lists )

Exercise

Compute the number of cycles performed in 60 nsby an INTEL Core i7 processor working at 4 GHz.

Upward Compatibility

Upward compatibility means that the primary way to improve a design is to add new, and usually more complex, features. Seldom are instruc-tions or addressing modes removed from an architecture, resulting in a gradual increase in both the number and complexity of instructions over a series of computers. New architectures tend to have a habit of including all instructions found in the machines of successful competi-tors, perhaps because architects and customers have no real grasp over what de�nes a « good » instruction set.


2.20 Concluding Remarks 161

2.20 Concluding Remarks

� e two principles of the stored-program computer are the use of instructions that are indistinguishable from numbers and the use of alterable memory for programs. � ese principles allow a single machine to aid environmental scientists, � nancial advisers, and novelists in their specialties. � e selection of a set of instructions that the machine can understand demands a delicate balance among the number of instructions needed to execute a program, the number of clock cycles needed by an instruction, and the speed of the clock. As illustrated in this chapter, three design principles guide the authors of instruction sets in making that delicate balance:

1. Simplicity favors regularity. Regularity motivates many features of the MIPS instruction set: keeping all instructions a single size, always requiring three register operands in arithmetic instructions, and keeping the register � elds in the same place in each instruction format.

2. Smaller is faster. � e desire for speed is the reason that MIPS has 32 registers rather than many more.

3. Good design demands good compromises. One MIPS example was the compromise between providing for larger addresses and constants in instructions and keeping all instructions the same length.

Less is more.Robert Browning, Andrea del Sarto, 1855

0

100

200

300

400

500

600

700

800

900

1000

1978

1980

1982

1984

1986

1988

1990

1992

1994

1996

1998

2000

2002

2004

2006

2008

2010

2012

Year

Num

ber

of In

stru

ctio

ns

FIGURE 2.43 Growth of x86 instruction set over time. While there is clear technical value to some of these extensions, this rapid change also increases the di� culty for other companies to try to build compatible processors.

In�ation of the x86 instruction set over time

The price to pay ( among other things ) for backward compatibility...

Illustration : Intel Haswell i7 core ( 2013 )

die size ≈ 177 mm2clock rate ≈ 3 GHz

22 nm FinFET technology

number of transistors per die ≈ 1 400 000 000

All Haswell models designed to support MMX, SSE, SSE2, SSSE3, SSE4.1, SSE4.2, SSE4.2, F16CBMI1+BMI2, EIST, Intel 64, XD bit, Intel VT-x and Smart Cache

How much of a CISC is used ?

One of the interesting results of rising software costs is the increasing reliance on high-level languages. One consequence is that the compiler writer is replacing the assembly language programmer as deciding which instructions the machine will execute. Compilers are often unable to utilize complex instructions, nor do they use the insidious tricks in which assembly language programmers delight. [...]

For example, measurements of a particular IBM 360 compiler found that 10 instructions accounted for 80% of all instructions executed, 16 for 90%, 21 for 95%, and 30 for 99%.


But are you really convinced by the argument ?

Growth in clock rates

The Power Wall

Figure 1.16 in Patterson & Hennessy

Clock rate and Power for Intel x86 microprocessors over eight generations and 25 years

A simple formula for computingthe dynamic energy

and the dynamic power of a CMOS

Dynamic energy ≈ Capacitive load of the transistor x Voltage2

for a logic transition 0 1 0

Dynamic energy ≈ 1/2 x Capacitive load of the transistor x Voltage2

for a single transition 0 1 or 0 1

Note that there is also static energy consumption in CMOS technologybecause of leakage current that �ows even when the transistor is o�.

Dynamic power ≈ 1/2 x Capacitive load of the transistor x Voltage2 x Frequency

The Pentium 4 Prescott processor, released in 2004, had a clock rate of 3.6 GHz and voltage of 1.25 V. Assume that, on average, it consumed 10 W of static power and 90 W of dynamic power. The Core i5 Ivy Bridge, released in 2012, had a clock rate of 3.4 GHz and voltage of 0.9 V. Assume that, on average, it consumed 30 W of static power and 40 W of dynamic power.

a. For each processor �nd the average capacitive loads.

b. Find the percentage of the total dissipated power comprised by static power and the ratio of static power to dynamic power for each technology. c. If the total dissipated power is to be reduced by 10%, how much should the voltage be reduced to maintain the same leakage current? Note: power is de�-ned as the product of voltage and current.

Exercise [ from Patterson & Hennessy 1.8 ]

The industry turning to multicore architectures

The computer industry is undergoing, if not another revolution, cer-tainly a vigorous shaking-up. The major chip manufacturers have, for the time being at least, given up trying to make processors run faster. Moore’s law has not been repealed: each year, more and more transistors �t into the same space, but their clock speed cannot be increased wit-hout overheating. Instead, manufacturers are turning to « multicore » architectures, in which multiple processors ( cores ) communicate directly through shared hardware caches. Multiprocessor chips make computing more effective by exploiting parallelism : harnessing mul-tiple processors to work on a single task.

Maurice Herlihy and Nir ShavitThe Art of Multiprocessor ProgrammingMorgan Kaufmann Publishers, 2008.

The Princeton architecture

Program and data are stored in just the same place

Purely sequential instruction processing

also known as the Von Neumann architecture

• A price to pay: the interpretation of a value stored in memory depends now on a control signal

• Clever unification of the notion of « memory »

A serious risk of misunderstanding data for code ...But at the same time, great for self-modifying code !!!

• Exactly one instruction is processed at a time

• To that purpose, a special register called program counter pc identifies the position in memory of the current instruction

• The program counter pc is advanced sequentially by every instruction except in the case of control tranfer instructions like goto’s or beq’s

The Princeton architecturealso known as the Von Neumann architecture

Program counter pcis special register which

contains the address in memoryof the current instruction

0 1 0 1 0 1 0 10 1 0 0 1 1 0 10 1 1 1 0 1 1 00 0 1 0 1 0 0 0

Memory

0 0 1 0 0 0 0 00 0 0 0 1 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 1 0 0 0 0 00 0 0 0 1 0 0 10 0 0 0 0 0 0 00 0 0 0 0 0 0 10 0 0 0 0 0 0 01 0 0 0 1 0 0 10 1 0 1 0 0 0 00 0 1 0 1 0 1 00 0 0 1 0 1 0 10 1 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 1 10 0 0 0 0 0 0 10 0 0 0 1 0 0 10 1 0 0 0 0 0 00 0 1 0 0 0 0 00 0 1 0 0 0 0 10 0 1 0 1 0 0 1

currentinstruction}




0 1 0 1 0 1 0 10 1 0 0 1 1 0 10 1 1 1 0 1 1 00 0 1 0 1 1 0 0

Memory

0 0 1 0 0 0 0 00 0 0 0 1 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 1 0 0 0 0 00 0 0 0 1 0 0 10 0 0 0 0 0 0 00 0 0 0 0 0 0 10 0 0 0 0 0 0 01 0 0 0 1 0 0 10 1 0 1 0 0 0 00 0 1 0 1 0 1 00 0 0 1 0 1 0 10 1 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 1 10 0 0 0 0 0 0 10 0 0 0 1 0 0 10 1 0 0 0 0 0 00 0 1 0 0 0 0 00 0 1 0 0 0 0 10 0 1 0 1 0 0 1

currentinstruction}




0 1 0 1 0 1 0 10 1 0 0 1 1 0 10 1 1 1 0 1 1 00 0 1 1 0 0 0 0

Memory

0 0 1 0 0 0 0 00 0 0 0 1 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 1 0 0 0 0 00 0 0 0 1 0 0 10 0 0 0 0 0 0 00 0 0 0 0 0 0 10 0 0 0 0 0 0 01 0 0 0 1 0 0 10 1 0 1 0 0 0 00 0 1 0 1 0 1 00 0 0 1 0 1 0 10 1 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 1 10 0 0 0 0 0 0 10 0 0 0 1 0 0 10 1 0 0 0 0 0 00 0 1 0 0 0 0 00 0 1 0 0 0 0 10 0 1 0 1 0 0 1

currentinstruction}




0 1 0 1 0 1 0 10 1 0 0 1 1 0 10 1 1 1 0 1 1 00 0 1 1 0 1 0 0

Memory

0 0 1 0 0 0 0 00 0 0 0 1 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 1 0 0 0 0 00 0 0 0 1 0 0 10 0 0 0 0 0 0 00 0 0 0 0 0 0 10 0 0 0 0 0 0 01 0 0 0 1 0 0 10 1 0 1 0 0 0 00 0 1 0 1 0 1 00 0 0 1 0 1 0 10 1 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 1 10 0 0 0 0 0 0 10 0 0 0 1 0 0 10 1 0 0 0 0 0 00 0 1 0 0 0 0 00 0 1 0 0 0 0 10 0 1 0 1 0 0 1

currentinstruction}

In the case of the MIPS instruction set ...

0 0 0 0 0 0

1 0 0 0 0 0

rs rsrs rs rs rt rt rt rt rtrd rd rd rd rd

Add Instruction

Register [ rd ] = Register [ rs ] + Register [ rt ]

0 0 0 0 0


0 0 0 0 0 0

1 0 1 0 1 0

rs rsrs rs rs rt rt rt rt rtrd rd rd rd rd

Slt InstructionSet on less than (signed)

If $rs is strictly less than $rt, then $rd is set to one. $rd is set to zero otherwise.

0 0 0 0 0


0 0 1 0 0 0

rs rsrs rs rs rt rt rt rt rtim im im im im im im imim im im im im im im im

Add ImmediateInstruction

Register [ rt ] = Register [ rs ] + Immediate


0 0 0 1 0 1

rs rsrs rs rs rt rt rt rt rtim im im im im im im imim im im im im im im im

BNE Instruction

Branches if the two registers are not equal and carries on otherwise.

Branch on Not Equal


Jump Instruction

Jump to the addressof memory

im im0 0 0 0 1 0im im im im im im im imim im im im im im im imim im im im im im im im

0 0

im im im im im im im imim im im im im im im imim im im im im im

im im im impc pc pc pc

fetch instruction

instruction decode

fetch operands

execute

write back

The instruction cycleof the Princeton architecture

Each step of the execution cycle startsonly after the previous step has beencompleted, in a purely sequential order.

In particular, one needs to decodethe instruction before getting its operands.

Key principle :

Central Processing Unit( CPU )

Control Unit

ALU

Registers

MainMemory

SecondaryMemory

Storage

Keyboard

Mouse

InputDevices

Display

Printer

OutputDevices

Bus

The Princeton architecture

Central Processing Unit( CPU )

Control Unit

ALU

Registers

DataMemory

SecondaryMemory

Storage

Keyboard

Mouse

InputDevices

Display

Printer

OutputDevices

DataBus

The Harvard architecture

CodeMemory

CodeBus

( typically ROM )

Separation of memory into « code memory » and « data memory »

The instruction cycleof the Harvard architecture

Key idea : fetch instruction

instruction decode

fetch operands

execute

write back

Thanks to the separation between the code bus and the data bus, the instruction and its operands may be fetched in memory at the same step of the instruction cycle !

Disturbing fact : there is no de�nite « state » of the machine.

One step further in parallelism :the data-�ow machines

L3 L1 + L2

L4 L3 L1*L5 L3 / L6

L6 L4 + L2

x L6

y L5

Data-Flow Program extracted from Dennis and Misunas 1974 paper« A preliminary architecture for a basic data-�ow processor »

Corresponding sequential program

Simple data-�ow program

L1 aL2 b

Data-Flow machines

• A program is not defined as a sequence of instructions but as a graph of instructions -- also called data-flow nodes.

The execution is data-driven rather than control-driven

Intrinsically parallel instruction processing

• Each instruction or data-flow node « waits » for its operands and « fires » as soon as all of them are available

• In particular, there is no need for a code pointer pc

There is no precise execution state

Data Flow Nodes

intint

int

False True

intint

int

bool

Barrier Synch

copy branch

boolbool

bool

copy

COPY BRANCH

BARRIER SYNCH

Data Flow Nodes

intint

int

False True

intint

int

bool

Barrier Synch

copy branch

boolbool

bool

copy

COPY BRANCH

BARRIER SYNCH

False

n3 True

5 7 4

Data Flow Nodes

intint

int

FalseTrue

intint

int

bool

Barrier Synch

copy branch

boolbool

bool

copy

COPY BRANCH

BARRIER SYNCH

n3 True

5 7 4

3 True

Data Flow Nodes

int int

bool

<

RELATION

Data Flow Nodes

int int

bool

<

RELATION

3 5

Data Flow Nodes

int int

bool

<

RELATION

True

OUT

FalseTrue

bool

bool

copy

copy

copy

branch

dec

branch

> 0

*

1

Exercise :

Find out what functionthis data �ow program computes !

int

int

int

bool

int

int int

intint

FalseTrue

int

IN

int

Open discussionShould there be a code pointer in any reasonable Instruction Set Architecture ?

Is it possible/intuitive/safe to program and/or compile in a data-�ow architecture ?

This is a very serious and interesting debate !!!

Today, all the major ISAs are based on the Princeton architecture:

In contrast, their optimized microarchitectures take full advantage of parallelism:

• x86 • ARM • MIPS • SPARC • POWER

And what about debugging a data-�ow program ?

Current trade-o� between data-driven and control-driven execution:

Electricity is parallel but the Programmer’s Mind is sequential

But do you believe yourselfin that accepted view ?

• pipelined instruction execution • multiple instructions at a time • out-of-order execution

This can be summarized in a slogan:

The underworld of microarchitecture

What the User/Programmer can see

Architecture

Microarchitecture

Sequential Instruction Set

Parallel implementation

Generally not exposed to the User/Programmer

The term architecture is used here to describe the attributes of a system as seen by the programmer, i.e. the conceptual structure and functional behavior, as distinct from the orga-nization of the data �ow and controls, the logical design, and the physical implementation.

Recall the origins of the word « architecture »

Amdahl, Blaauw, BrooksArchitecture of the IBM System / 360IBM Journal of Research and DevelopmentApril 1964

organization of the data �ow = microarchitecturelogical design = digital logic

physical implementation = circuit

One apparent di�culty :the notion of internal state of the system is lost

First of all, it may be di�cult to guess in what state is the microarchitecture at a given point of the execution of the machine code.

More conceptually, there is the temptation to reason on the result of the execution of machine code independently on the microarchitecture.

This is the direction taken by the so-called « memory models » like

• the Java memory model developed in 1995 • more recently, the C11 memory model.

There memory models are typically de�ned using partial orders expressing that an instruction « happened before » another one during the execution.

Out-of-order execution in a multiprocessor scenario

x 1

r1 y

y 1

r2 x

Thread 1 Thread 2

Consider these two threads and run them in parallelin an x86 or a Power multiprocessor :

Suppose also that x and y have value 0 before execution.

Question : how many results of the executions are possible ?

Out-of-order execution in a multiprocessor scenario

x 1

r1 y

y 1

r2 x

Thread 1 Thread 2

Consider these two threads and run them in parallelin an x86 or a Power multiprocessor :

Suppose also that x and y have value 0 before execution.

Question : how many results of the executions are possible ?

Somewhat surprisingly, the correct answer is 4.In particular, the outcome r1=0 and r2=0 is also possible.

Can you explain what happened?

Any question ?

Thank you !

Documents

Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False