46
Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and dataflow machines OUT False True bool bool copy copy copy branch dec branch > 0 * 1 int int int bool int int int int int False True int IN int

Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False

  • Upload
    lehanh

  • View
    225

  • Download
    5

Embed Size (px)

Citation preview

Page 1: Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False

Computer ArchitecturePaul Mellies

Lecture 2 : Princeton, Harvard and data�ow machines

OUT

FalseTrue

bool

bool

copy

copy

copy

branch

dec

branch

> 0

*

1

int

int

int

bool

int

int int

intint

FalseTrue

int

IN

int

Page 2: Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False

Why have computers have become more complex ?

Why have computers have become more complex? We can think of several reasons.

Speed of Memory vs. Speed of CPU. John Cocke says that the com-plexity began with the transition from the 701 to the 709. The 701 CPU was about ten times as fast as the core main memory ; this made any primitives that were implemented as subroutines much slower than primitives that were instructions. Thus the �oating point subroutines became part of the 709 architecture with dramatic gains. Making the 709 more complex resulted in an advance that made it more cost-e�ec-tive that the 701. Since then, many « higher-level » instructions have been added to machines in an attempt to improve performances.

David Patterson and David DitzelThe Case for the Reduced Instruction Set ComputerACM Computer Architecture News, 1980

Page 3: Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False

Diverging processor and memory performance

Processor

Memory10

100

1000

10,000

100,000

11980 1985 1990 1995 2000 2005 2010

Page 4: Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False

The memory hierarchy

https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf

Where the data is located

register

Timeto fetch data

L1 cache

L2 cache

L3 cache

Local DRAM Memory

1 cycle

~4 cycles

~10 cycles

40 - 75 cycles

~60 ns

Remote DRAM Memory ~100 ns

For more information, please have a look at INTEL’s performance analysis guide for Core i7 and Xeon 5500:

A good programmer should be aware of these memory latencies and do his/her best to maximize the amount of data available in the cache.

A good idea is to keep the manipulated data as local as possible ( e.g. use arrays instead of linked lists )

Page 5: Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False

Exercise

Compute the number of cycles performed in 60 nsby an INTEL Core i7 processor working at 4 GHz.

Page 6: Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False

Upward Compatibility

Upward compatibility means that the primary way to improve a design is to add new, and usually more complex, features. Seldom are instruc-tions or addressing modes removed from an architecture, resulting in a gradual increase in both the number and complexity of instructions over a series of computers. New architectures tend to have a habit of including all instructions found in the machines of successful competi-tors, perhaps because architects and customers have no real grasp over what de�nes a « good » instruction set.

David Patterson and David DitzelThe Case for the Reduced Instruction Set ComputerACM Computer Architecture News, 1980

Page 7: Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False

2.20 Concluding Remarks 161

2.20 Concluding Remarks

� e two principles of the stored-program computer are the use of instructions that are indistinguishable from numbers and the use of alterable memory for programs. � ese principles allow a single machine to aid environmental scientists, � nancial advisers, and novelists in their specialties. � e selection of a set of instructions that the machine can understand demands a delicate balance among the number of instructions needed to execute a program, the number of clock cycles needed by an instruction, and the speed of the clock. As illustrated in this chapter, three design principles guide the authors of instruction sets in making that delicate balance:

1. Simplicity favors regularity. Regularity motivates many features of the MIPS instruction set: keeping all instructions a single size, always requiring three register operands in arithmetic instructions, and keeping the register � elds in the same place in each instruction format.

2. Smaller is faster. � e desire for speed is the reason that MIPS has 32 registers rather than many more.

3. Good design demands good compromises. One MIPS example was the compromise between providing for larger addresses and constants in instructions and keeping all instructions the same length.

Less is more.Robert Browning, Andrea del Sarto, 1855

0

100

200

300

400

500

600

700

800

900

1000

1978

1980

1982

1984

1986

1988

1990

1992

1994

1996

1998

2000

2002

2004

2006

2008

2010

2012

Year

Num

ber

of In

stru

ctio

ns

FIGURE 2.43 Growth of x86 instruction set over time. While there is clear technical value to some of these extensions, this rapid change also increases the di� culty for other companies to try to build compatible processors.

In�ation of the x86 instruction set over time

The price to pay ( among other things ) for backward compatibility...

Page 8: Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False
Page 9: Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False
Page 10: Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False

Illustration : Intel Haswell i7 core ( 2013 )

die size ≈ 177 mm2clock rate ≈ 3 GHz

22 nm FinFET technology

number of transistors per die ≈ 1 400 000 000

All Haswell models designed to support MMX, SSE, SSE2, SSSE3, SSE4.1, SSE4.2, SSE4.2, F16CBMI1+BMI2, EIST, Intel 64, XD bit, Intel VT-x and Smart Cache

Page 11: Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False

How much of a CISC is used ?

One of the interesting results of rising software costs is the increasing reliance on high-level languages. One consequence is that the compiler writer is replacing the assembly language programmer as deciding which instructions the machine will execute. Compilers are often unable to utilize complex instructions, nor do they use the insidious tricks in which assembly language programmers delight. [...]

For example, measurements of a particular IBM 360 compiler found that 10 instructions accounted for 80% of all instructions executed, 16 for 90%, 21 for 95%, and 30 for 99%.

David Patterson and David DitzelThe Case for the Reduced Instruction Set ComputerACM Computer Architecture News, 1980

But are you really convinced by the argument ?

Page 12: Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False

Growth in clock rates

Page 13: Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False

The Power Wall

Figure 1.16 in Patterson & Hennessy

Clock rate and Power for Intel x86 microprocessors over eight generations and 25 years

Page 14: Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False

A simple formula for computingthe dynamic energy

and the dynamic power of a CMOS

Dynamic energy ≈ Capacitive load of the transistor x Voltage2

for a logic transition 0 1 0

Dynamic energy ≈ 1/2 x Capacitive load of the transistor x Voltage2

for a single transition 0 1 or 0 1

Note that there is also static energy consumption in CMOS technologybecause of leakage current that �ows even when the transistor is o�.

Dynamic power ≈ 1/2 x Capacitive load of the transistor x Voltage2 x Frequency

Page 15: Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False

The Pentium 4 Prescott processor, released in 2004, had a clock rate of 3.6 GHz and voltage of 1.25 V. Assume that, on average, it consumed 10 W of static power and 90 W of dynamic power. The Core i5 Ivy Bridge, released in 2012, had a clock rate of 3.4 GHz and voltage of 0.9 V. Assume that, on average, it consumed 30 W of static power and 40 W of dynamic power.

a. For each processor �nd the average capacitive loads.

b. Find the percentage of the total dissipated power comprised by static power and the ratio of static power to dynamic power for each technology. c. If the total dissipated power is to be reduced by 10%, how much should the voltage be reduced to maintain the same leakage current? Note: power is de�-ned as the product of voltage and current.

Exercise [ from Patterson & Hennessy 1.8 ]

Page 16: Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False

The industry turning to multicore architectures

The computer industry is undergoing, if not another revolution, cer-tainly a vigorous shaking-up. The major chip manufacturers have, for the time being at least, given up trying to make processors run faster. Moore’s law has not been repealed: each year, more and more transistors �t into the same space, but their clock speed cannot be increased wit-hout overheating. Instead, manufacturers are turning to « multicore » architectures, in which multiple processors ( cores ) communicate directly through shared hardware caches. Multiprocessor chips make computing more effective by exploiting parallelism : harnessing mul-tiple processors to work on a single task.

Maurice Herlihy and Nir ShavitThe Art of Multiprocessor ProgrammingMorgan Kaufmann Publishers, 2008.

Page 17: Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False

The Princeton architecture

Program and data are stored in just the same place

Purely sequential instruction processing

also known as the Von Neumann architecture

• A price to pay: the interpretation of a value stored in memory depends now on a control signal

• Clever unification of the notion of « memory »

A serious risk of misunderstanding data for code ...But at the same time, great for self-modifying code !!!

• Exactly one instruction is processed at a time

• To that purpose, a special register called program counter pc identifies the position in memory of the current instruction

• The program counter pc is advanced sequentially by every instruction except in the case of control tranfer instructions like goto’s or beq’s

Page 18: Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False

The Princeton architecturealso known as the Von Neumann architecture

Program counter pcis special register which

contains the address in memoryof the current instruction

0 1 0 1 0 1 0 10 1 0 0 1 1 0 10 1 1 1 0 1 1 00 0 1 0 1 0 0 0

Memory

0 0 1 0 0 0 0 00 0 0 0 1 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 1 0 0 0 0 00 0 0 0 1 0 0 10 0 0 0 0 0 0 00 0 0 0 0 0 0 10 0 0 0 0 0 0 01 0 0 0 1 0 0 10 1 0 1 0 0 0 00 0 1 0 1 0 1 00 0 0 1 0 1 0 10 1 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 1 10 0 0 0 0 0 0 10 0 0 0 1 0 0 10 1 0 0 0 0 0 00 0 1 0 0 0 0 00 0 1 0 0 0 0 10 0 1 0 1 0 0 1

currentinstruction}

Page 19: Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False

The Princeton architecturealso known as the Von Neumann architecture

Program counter pcis special register which

contains the address in memoryof the current instruction

0 1 0 1 0 1 0 10 1 0 0 1 1 0 10 1 1 1 0 1 1 00 0 1 0 1 1 0 0

Memory

0 0 1 0 0 0 0 00 0 0 0 1 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 1 0 0 0 0 00 0 0 0 1 0 0 10 0 0 0 0 0 0 00 0 0 0 0 0 0 10 0 0 0 0 0 0 01 0 0 0 1 0 0 10 1 0 1 0 0 0 00 0 1 0 1 0 1 00 0 0 1 0 1 0 10 1 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 1 10 0 0 0 0 0 0 10 0 0 0 1 0 0 10 1 0 0 0 0 0 00 0 1 0 0 0 0 00 0 1 0 0 0 0 10 0 1 0 1 0 0 1

currentinstruction}

Page 20: Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False

The Princeton architecturealso known as the Von Neumann architecture

Program counter pcis special register which

contains the address in memoryof the current instruction

0 1 0 1 0 1 0 10 1 0 0 1 1 0 10 1 1 1 0 1 1 00 0 1 1 0 0 0 0

Memory

0 0 1 0 0 0 0 00 0 0 0 1 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 1 0 0 0 0 00 0 0 0 1 0 0 10 0 0 0 0 0 0 00 0 0 0 0 0 0 10 0 0 0 0 0 0 01 0 0 0 1 0 0 10 1 0 1 0 0 0 00 0 1 0 1 0 1 00 0 0 1 0 1 0 10 1 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 1 10 0 0 0 0 0 0 10 0 0 0 1 0 0 10 1 0 0 0 0 0 00 0 1 0 0 0 0 00 0 1 0 0 0 0 10 0 1 0 1 0 0 1

currentinstruction}

Page 21: Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False

The Princeton architecturealso known as the Von Neumann architecture

Program counter pcis special register which

contains the address in memoryof the current instruction

0 1 0 1 0 1 0 10 1 0 0 1 1 0 10 1 1 1 0 1 1 00 0 1 1 0 1 0 0

Memory

0 0 1 0 0 0 0 00 0 0 0 1 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 1 0 0 0 0 00 0 0 0 1 0 0 10 0 0 0 0 0 0 00 0 0 0 0 0 0 10 0 0 0 0 0 0 01 0 0 0 1 0 0 10 1 0 1 0 0 0 00 0 1 0 1 0 1 00 0 0 1 0 1 0 10 1 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 1 10 0 0 0 0 0 0 10 0 0 0 1 0 0 10 1 0 0 0 0 0 00 0 1 0 0 0 0 00 0 1 0 0 0 0 10 0 1 0 1 0 0 1

currentinstruction}

Page 22: Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False

In the case of the MIPS instruction set ...

0 0 0 0 0 0

1 0 0 0 0 0

rs rsrs rs rs rt rt rt rt rtrd rd rd rd rd

Add Instruction

Register [ rd ] = Register [ rs ] + Register [ rt ]

0 0 0 0 0

Page 23: Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False

In the case of the MIPS instruction set ...

0 0 0 0 0 0

1 0 1 0 1 0

rs rsrs rs rs rt rt rt rt rtrd rd rd rd rd

Slt InstructionSet on less than (signed)

If $rs is strictly less than $rt, then $rd is set to one. $rd is set to zero otherwise.

0 0 0 0 0

Page 24: Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False

In the case of the MIPS instruction set ...

0 0 1 0 0 0

rs rsrs rs rs rt rt rt rt rtim im im im im im im imim im im im im im im im

Add ImmediateInstruction

Register [ rt ] = Register [ rs ] + Immediate

Page 25: Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False

In the case of the MIPS instruction set ...

0 0 0 1 0 1

rs rsrs rs rs rt rt rt rt rtim im im im im im im imim im im im im im im im

BNE Instruction

Branches if the two registers are not equal and carries on otherwise.

Branch on Not Equal

Page 26: Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False

In the case of the MIPS instruction set ...

Jump Instruction

Jump to the addressof memory

im im0 0 0 0 1 0im im im im im im im imim im im im im im im imim im im im im im im im

0 0

im im im im im im im imim im im im im im im imim im im im im im

im im im impc pc pc pc

Page 27: Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False

fetch instruction

instruction decode

fetch operands

execute

write back

The instruction cycleof the Princeton architecture

Each step of the execution cycle startsonly after the previous step has beencompleted, in a purely sequential order.

In particular, one needs to decodethe instruction before getting its operands.

Key principle :

Page 28: Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False

Central Processing Unit( CPU )

Control Unit

ALU

Registers

MainMemory

SecondaryMemory

Storage

Keyboard

Mouse

InputDevices

Display

Printer

OutputDevices

Bus

The Princeton architecture

Page 29: Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False

Central Processing Unit( CPU )

Control Unit

ALU

Registers

DataMemory

SecondaryMemory

Storage

Keyboard

Mouse

InputDevices

Display

Printer

OutputDevices

DataBus

The Harvard architecture

CodeMemory

CodeBus

( typically ROM )

Separation of memory into « code memory » and « data memory »

Page 30: Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False

The instruction cycleof the Harvard architecture

Key idea : fetch instruction

instruction decode

fetch operands

execute

write back

Thanks to the separation between the code bus and the data bus, the instruction and its operands may be fetched in memory at the same step of the instruction cycle !

Disturbing fact : there is no de�nite « state » of the machine.

Page 31: Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False

One step further in parallelism :the data-�ow machines

L3 L1 + L2

L4 L3 L1*L5 L3 / L6

L6 L4 + L2

x L6

y L5

Data-Flow Program extracted from Dennis and Misunas 1974 paper« A preliminary architecture for a basic data-�ow processor »

Corresponding sequential program

Simple data-�ow program

L1 aL2 b

Page 32: Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False

Data-Flow machines

• A program is not defined as a sequence of instructions but as a graph of instructions -- also called data-flow nodes.

The execution is data-driven rather than control-driven

Intrinsically parallel instruction processing

• Each instruction or data-flow node « waits » for its operands and « fires » as soon as all of them are available

• In particular, there is no need for a code pointer pc

There is no precise execution state

Page 33: Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False

Data Flow Nodes

intint

int

False True

intint

int

bool

Barrier Synch

copy branch

boolbool

bool

copy

COPY BRANCH

BARRIER SYNCH

Page 34: Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False

Data Flow Nodes

intint

int

False True

intint

int

bool

Barrier Synch

copy branch

boolbool

bool

copy

COPY BRANCH

BARRIER SYNCH

False

n3 True

5 7 4

Page 35: Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False

Data Flow Nodes

intint

int

FalseTrue

intint

int

bool

Barrier Synch

copy branch

boolbool

bool

copy

COPY BRANCH

BARRIER SYNCH

n3 True

5 7 4

3 True

Page 36: Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False

Data Flow Nodes

int int

bool

<

RELATION

Page 37: Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False

Data Flow Nodes

int int

bool

<

RELATION

3 5

Page 38: Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False

Data Flow Nodes

int int

bool

<

RELATION

True

Page 39: Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False

OUT

FalseTrue

bool

bool

copy

copy

copy

branch

dec

branch

> 0

*

1

Exercise :

Find out what functionthis data �ow program computes !

int

int

int

bool

int

int int

intint

FalseTrue

int

IN

int

Page 40: Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False

Open discussionShould there be a code pointer in any reasonable Instruction Set Architecture ?

Is it possible/intuitive/safe to program and/or compile in a data-�ow architecture ?

This is a very serious and interesting debate !!!

Today, all the major ISAs are based on the Princeton architecture:

In contrast, their optimized microarchitectures take full advantage of parallelism:

• x86 • ARM • MIPS • SPARC • POWER

And what about debugging a data-�ow program ?

Current trade-o� between data-driven and control-driven execution:

Electricity is parallel but the Programmer’s Mind is sequential

But do you believe yourselfin that accepted view ?

• pipelined instruction execution • multiple instructions at a time • out-of-order execution

This can be summarized in a slogan:

Page 41: Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False

The underworld of microarchitecture

What the User/Programmer can see

Architecture

Microarchitecture

Sequential Instruction Set

Parallel implementation

Generally not exposed to the User/Programmer

Page 42: Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False

The term architecture is used here to describe the attributes of a system as seen by the programmer, i.e. the conceptual structure and functional behavior, as distinct from the orga-nization of the data �ow and controls, the logical design, and the physical implementation.

Recall the origins of the word « architecture »

Amdahl, Blaauw, BrooksArchitecture of the IBM System / 360IBM Journal of Research and DevelopmentApril 1964

organization of the data �ow = microarchitecturelogical design = digital logic

physical implementation = circuit

Page 43: Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False

One apparent di�culty :the notion of internal state of the system is lost

First of all, it may be di�cult to guess in what state is the microarchitecture at a given point of the execution of the machine code.

More conceptually, there is the temptation to reason on the result of the execution of machine code independently on the microarchitecture.

This is the direction taken by the so-called « memory models » like

• the Java memory model developed in 1995 • more recently, the C11 memory model.

There memory models are typically de�ned using partial orders expressing that an instruction « happened before » another one during the execution.

Page 44: Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False

Out-of-order execution in a multiprocessor scenario

x 1

r1 y

y 1

r2 x

Thread 1 Thread 2

Consider these two threads and run them in parallelin an x86 or a Power multiprocessor :

Suppose also that x and y have value 0 before execution.

Question : how many results of the executions are possible ?

Page 45: Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False

Out-of-order execution in a multiprocessor scenario

x 1

r1 y

y 1

r2 x

Thread 1 Thread 2

Consider these two threads and run them in parallelin an x86 or a Power multiprocessor :

Suppose also that x and y have value 0 before execution.

Question : how many results of the executions are possible ?

Somewhat surprisingly, the correct answer is 4.In particular, the outcome r1=0 and r2=0 is also possible.

Can you explain what happened?

Page 46: Computer Architecturecomputer-architecture.org/Lectures/Computer-Architecture-2.pdf · Computer Architecture Paul Mellies Lecture 2 : Princeton, Harvard and data ow machines OUT False

Any question ?

Thank you !