BL Eloadas1 2prez

8/13/2019 BL Eloadas1 2prez

1/33

Page 1

1

Begyazott processzor architektrk

teljestmny-, kltsg-s

energiahatkonysgi analzise

2

Architektra tmakrk

Instruction Set Architecture

Csvezetkezs, llsok kezelse,Szuperskalris md, ttemezs,Becsls, Spekulatv dnts,

Vektorizls, VLIW, DSP, jrakonfigurci

Cmzs,Vdelmi mechanizmusok,Kivtelek kezelse

L1 Cache

L2 Cache

DRAM

Lemezek, WORM, Szalag

Koherencia,Svszlessg,Lappangs

jszer technolgiksszefzs

Snprotokollok

RAID

VLSI

Ki/Bementek s Trolk

MemriaHierarchia

Csvezetkests sUtasts Szint Prhuzamosts


2/33

Page 2

3

Architektra tmakrk

M

sszekttetsi hlzatS

PMPMPMP

Topolgik,Routing,

Svszlessg,Lappangsi idk,Megbzhatsg

Hlzati illesztk

Osztott Memria,zenetkzvetts,Adatprhuzamossg

Processzor-Memria-Switch

Multiprocesszorok

Hlzat s csatlakoztats

4

A sikeres Architektra-tervezs titka:Mrs s kirtkels

Design

Analysis

Az architektra tervezs egy iteratv folyamat: Keress a lehetsges tervek terben A begyazott rendszerek minden szintjnek elemzse

Kreativits

J tletek

tlagos tletek

Rossz tletek

Kltsg/TeljestmnyAnalzis


3/33

Page 3

5

Tervezsi mdszertan

j tervekszimulcija

TechnolgiaTrendek

Szk keresztmetszetekAzonostsa a ltez

rendszerekben

Benchmark

tesztek

Feladatok

j genercisRendszerekmegvalstsa

Megvalstsi

komplexits Analzis

Tervezs

Imple-

mentci

6

Mrsi eszkzk

Hardware: Kltsg, ksleltets, erforrsok,teljestmny becsls

Benchmark tesztek, Trace-ek (vgrehatjs kvets)

Szimulci (sok szint) ISA, RTL, Kapu, ramkr

temezsi elmlet (Queuing)

Rules of Thumb

Alapvet Trvnyek/Elvek


4/33

Page 4

7

Teljestmny, kltsg, energia

8

1. Metrika : Teljestmny

Time to run the task

Execution time, response time, latency

Tasks per day, hour, week, sec, ns Throughput, bandwidth

Plane

Boeing 747

Concorde

Speed

610 mph

1350 mph

DC to Paris

6.5 hours

3 hours

Passengers

470

132

Throughput

286,700

178,200

In passenger-mile/hour


5/33


6/33

Page 6

11

Example: Calculating CPI

Typical Mix

Base Machine (Reg / Reg)

Op Freq CPIi CPIi*Fi (% Time)

ALU 50% 1 .5 (33%)

Load 20% 2 .4 (27%)

Store 10% 2 .2 (13%)

Branch 20% 2 .4 (27%)

1.5

12

How to Summarize Performance

Arithmetic mean (weighted arithmetic mean)tracks execution time: (Ti)/n or (Wi*Ti)

Harmonic mean (weighted harmonic mean) of

rates (e.g., MFLOPS) tracks execution time:n/ (1/Ri) or n/(Wi/Ri) Normalized execution time is handy for scaling

performance (e.g., X times faster thanSPARCstation 10) Arithmetic mean impacted by choice of reference machine

Use the geometric mean for comparison:(Ti)^1/n Independent of chosen machine

but not good metric for total execution time


7/33


8/33


9/33

Page 9

17

Instruction Set Architecture (ISA)

instruction set

software

hardware

18

Evolution of Instruction Sets

Major advances in computer architecture aretypically associated with landmark instruction

set designs Ex: Stack vs GPR (System 360)

Design decisions must take into account: technology

machine organization

programming languages

compiler technology

operating systems

applications

And they in turn influence these


10/33

Page 10

19

A "Typical" RISC

32-bit fixed format instruction (3 formats I,R,J)

32 32-bit GPR (R0 contains zero, DP take pair)

3-address, reg-reg arithmetic instruction

Single address mode for load/store:base + displacement no indirection

Simple branch conditions (based on register values)

Delayed branch

see: SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM PowerPC,CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3

20

Example: MIPS ( DLX)

Op

31 26 01516202125

Rs1 Rd immediate

Op

31 26 025

Op

31 26 01516202125

Rs1 Rs2

target

Rd Opx

Register-Register

561011

Register-Immediate

Op

31 26 01516202125

Rs1 Rs2/Opx immediate

Branch

Jump / Call


11/33

Page 11

21

Pipelining Lessons Pipelining doesnt help

latency of single task, ithelps throughput ofentire workload

Pipeline rate limited byslowest pipeline stage

Multiple tasks operatingsimultaneously

Potential speedup =Number pipe stages

Unbalanced lengths ofpipe stages reducesspeedup

Time to fill pipeline andtime to drain it reducesspeedup

A

B

C

D

6 PM 7 8 9

T

a

s

k

O

r

de

r

Time

30 40 40 40 40 20

22

5 Steps of DLX Datapath

MemoryAccess

WriteBack

InstructionFetch

Instr. DecodeReg. Fetch

ExecuteAddr. Calc

LMD

ALU

MUX

Memory

RegFile

MUX

MUX

Data

Memory

MUX

SignExtend

4

Add

erZero?

Next SEQ PC

Address

Next PC

WB Data

Inst

RD

RS1

RS2

Imm


12/33


13/33


14/33


15/33


16/33

Page 16

31

Data Hazard Even with Forwarding

Time (clock cycles)

or r8,r1,r9

Instr.

Order

lw r1, 0(r2)

sub r4,r1,r6

and r6,r1,r7

RegALU

DMemIfetch Reg

RegIfetchALU

DMem RegBubble

IfetchA

LU

DMem RegBubble Reg

IfetchALU

DMemBubble Reg

32

Try producing fast code for

a = b + c;

d = e f;

assuming a, b, c, d ,e, and f in memory.Slow code:

LW Rb,b

LW Rc,c

ADD Ra,Rb,Rc

SW a,Ra

LW Re,e

LW Rf,f

SUB Rd,Re,Rf

SW d,Rd

Software Scheduling to Avoid LoadHazards

Fast code:

LW Rb,b

LW Rc,c

LW Re,e

ADD Ra,Rb,Rc

LW Rf,f

SW a,Ra

SUB Rd,Re,Rf

SW d,Rd


17/33

Page 17

33

Control Hazard on BranchesThree Stage Stall

10: beq r1,r3,36

14: and r2,r3,r5

18: or r6,r1,r7

22: add r8,r1,r9

36: xor r10,r1,r11

RegALU

DMemIfetch Reg

RegALU

DMemIfetch Reg

RegALU

DMemIfetch Reg

RegALU

DMemIfetch Reg

RegALU

DMemIfetch Reg

34

Branch Stall Impact

If CPI = 1, 30% branch,Stall 3 cycles => new CPI = 1.9!

Two part solution: Determine branch taken or not sooner, AND

Compute taken branch address earlier

DLX branch tests if register = 0 or 0 DLX Solution:

Move Zero test to ID/RF stage

Adder to calculate new PC in ID/RF stage

1 clock cycle penalty for branch versus 3


18/33


19/33

Page 19

37

Delayed Branch

Where to get instructions to fill branch delay slot? Before branch instruction

From the target address: only valuable when branch taken

From fall through: only valuable when branch not taken

Cancelling branches allow more slots to be filled

Compiler effectiveness for single branch delay slot: Fills about 60% of branch delay slots

About 80% of instructions executed in branch delay slots usefulin computation

About 50% (60% x 80%) of slots usefully filled

Delayed Branch downside: 7-8 stage pipelines,multiple instructions issued per clock (superscalar)

38

Evaluating Branch Alternatives

Schedu ling Branch CPI speedup v. speedup v.

scheme penalty unp ipelined s tal l

Stall pipeline 3 1.42 3.5 1.0

Predict taken 1 1.14 4.4 1.26

Predict not taken 1 1.09 4.5 1.29

Delayed branch 0.5 1.07 4.6 1.31

Conditional & Unconditional = 14%, 65% change PC

Pipeline speedup = Pipeline depth1 +Branch frequencyBranch penalty


20/33

Page 20

39

sszefoglagl 2

Just overlap tasks; easy if tasks are independent

Speed Up Pipeline Depth; if ideal CPI is 1, then:

Hazards limit performance on computers: Structural: need more HW resources

Data (RAW,WAR,WAW): need forwarding, compiler scheduling

Control: delayed branch, prediction

pipelined

dunpipeline

TimeCycle

TimeCycle

CPIstallPipeline1

depthPipelineSpeedup

40

Power PC

Architecture


21/33

Page 21

41

Introduction

o PowerPC (Performance Opt imizat ion WithEnhanced RISC Performance Comput ing) isa RISC architecture created by (AIM) AppleIBMMotorola alliance in 1991.

o The original idea for the PowerPCarchitecture came from IBMs Power

archi tecture (introdu ced in th e Risc/6000) andretains a high level of compatibility with it.

o The intention was to build a high-performance, superscalar low-cost processor.

42

History

o The history of the PowerPC began with IBM's 801prototype chip of John Cocke s(IBM Watson ResearchLab) RISC ideas in the late 1970s (with further

refinements developed by David Paterson).o 801-based cores were used in a number of IBM

embedded products, eventually becoming the 16-register ROMP (Research Office Products DivisionMicro Processor was a 10 MHz RISC microprocessordesigned by IBM in the early 1980) processor used inthe IBM RT(computer workstation by IBM).

o The RT had disappointing performance and IBMstarted the project to build the fastest processor on themarket. The result was the POWER architecture,introduced with the RISC System/6000 in early 1990.


22/33

Page 22

43

History.. POWER architecture

The POWER architecture incorporated lots ofthe RISC characteristics :

fixed-length instructions,

register-to-register architecture,

simple addressing modes,

large general register file

three-operand instruction format.

Additionally, it has other features more characteristic ofmore complex ISAs.

44

Power Architecture

o Designed to be superscalar- dispatched across threeindependent units: branch, fixed-point arithmetic, and floatingpoint units. This allows out of order execution.

o Compound instructions--updating the base register on a loadand store with the newly calculated effective address, thuseliminating the need for extra add instructions required toincrement the index for array traversals.

o Does not implement delayed branches- Instead the POWERarchitecture uses a branch target buffer, and the now well knownbranch folding technique.

o Branching technique- The POWER architecture has eightcondition registers that are set by compare instructions. Oneadditional bit in the opcode of each instruction signaled thatinstructions should be executed only under certain conditions, aform of predicated execution.


23/33

Page 23

45

Shortfalls..

o The original POWER microprocessor, one ofthe first superscalar RISC implementations,was a high performance, multi-chip design.

o IBM soon realized that they would need asingle-chip microprocessor to scale theirRS/6000 line from lower-end to high-endmachines.

o Work on a single-chip POWERmicroprocessor, called the RSC (RISC SingleChip) began. In early 1991 IBM realized thattheir design could potentially become a high-volume microprocessor used across theindustry.

46

PowerPC Architecture

o In order to maintain RS/6000 software compatibility, thePowerPC adapted the POWER architecture, and manyenhancements were added to provide a low-cost, single-chip,superscalar, multiprocessor capable, and 64-bit processor.

Several bit/field instructions that use three source

operands were eliminated to avoid the need for extraregister ports.

Complex string instructions were left out, consistentwith the RISC philosophy.

Instructions whose operation was dependent on thevalue of source operand were eliminated.

Precision shifts, integer multiplies, and divide-with-reminder instructions were omitted.

Support for operation in both big-endian andlittle-endian modes

Single and double precision floating-point arithmetic

64-bit architecture, backward compatible to 32-bit


24/33

Page 24

47

PowerPC family

o PowerPC 601: medium sized and medium performance processor

includes a more sophisticated branch unit

capable to dispatch three out-of-order instructions per cycle. up to 8 instructions per cycle can be fetched directly into an

eight-entry instruction queue (IQ), where they're decodedbefore being dispatched to the execution core.

Branch folding:

The instruction queue is used for detecting and dealingwith branches. The branch unit scans bottom four entries ofthe queue, identifying branch instructions and determiningwhat type they are (conditional, unconditional).

In cases where the branch unit has enough information toresolve the branch right then and there (an unconditionalbranch, or a conditional branch whose condition is dependenton information that's already in the condition register) thenthe branch instruction is simply deleted from the instruction

queue and replaced with the instruction located at the branchtarget.

o PowerPC 603: smaller die size than the 601

smaller cache

capable to dispatch three out-of-order instructions per cycle.

48

Current Status PowerPC e200 - 32 bit power architecture microprocessor - speed ranging up

to 600 MHz - ideal for embedded applications.

PowerPC e300 similar to e200 with an increase in speed upto 667 MHz. PowerPC e600 speed upto 2 Ghz ideal for high performance routing and

telecommunications applications.

POWER5 IBM dual core P POWER6 IBM Dual core P - A notable difference from POWER5 is that the

POWER6 executes instructions in-order instead of out-of-order

PowerPC G3 - Apple Macintosh computers such as the PowerBook G3, themulticolored iMacs, iBooks and several desktops, including both the Beigeand Blue and White Power Macintosh G3s.

PowerPC G4 - is a designation used by Apple Computer to describe a fourthgenerationof 32-bit PowerPC microprocessors.

PowerPC G5 - 64-bit Power Architecture processors

Xenon - based on IBMs PowerPC ISA XBOX 360 game console. Broadway based on IBMs PowerPC ISA Nintendo Wii gaming console

Blue Gene/L - dual core PowerPC 440, 700 MHz, 2004

Blue Gene/P - quad core PowerPC 450, 850 MHz, 2007


25/33


26/33

Page 26

51

PowerPC RegistersPowerPC's application-level registers are broken into three categories:

general purpose, floating point and special purpose registers.

o General-purpose registers (GPRs) - r0 to r31 flat-scheme of 32 general purpose registers.

Source and destination for all integer operations

address source for all load/store operations.

They also provide access to SPRs.

All GPRs are available for use with one exception: in certaininstructions, GPR0 simply means the value 0, and no lookup isdone for GPR0's contents.

o Some of these registers have special tasks assigned to them: r0 Volatile register which may be modified during function linkage

r1 Stack frame pointer, always valid

r2 System-reserved register r3-r4 Volatile registers used for parameter passing and return values

r5-r10 Volatile registers used for parameter passing

r11-r12 Volatile registers which may be modified during function linkage

r13 Small data area pointer register

r14-r30 Registers used for local variables

r31 Used for local variables or "environment pointers

52

Floating point registers

o Floating-point registers (FPRs)- fr0 to fr31

32 floating-point registers with 64-bit precision.

source and destination operands of all floating-point operations

can contain 32-bit and 64-bit signed and unsigned integer values, aswell as single-precision and double-precision floating-point values.

FPRs also provide access to the FPSCR(Floating-Point Status and

Control Register) FPSCR captures status and exceptions resulting from floating-

point operations, and also provides control bits for enablingspecific exception types.

Instructions to load and store double precision floating pointnumbers transfers 64-bit of data without conversion.

Instructions to load from memory single precision floating pointnumbers convert to double precision format before storing them inthe register.

f0 Volatile register

f1 Volatile register used for parameter passing and return values

f2-f8 Volatile registers used for parameter passing

f9-f13 Volatile registers

f14-f31 Registers used for local variables


27/33

Page 27

53

Special-purpose registers (SPRs)

The Fixed-Point Exception Register (XER)- used for indicating conditions forinteger operations, such as carries and overflows.

The Floating-Point Status and Control Register (FPSCR)- 32-bit register used

to store the status and control of the floating-point operations.

The Count Register (CTR)- used to hold a loop count that can be decremented

during the execution of branch instructions.

The Condition Register(CR)-32-bit register grouped into eight fields, where

each field is 4 bits that signify the result of an instructions operation: Equal

(EQ), Greater Than (GT), Less Than (LT), and Summary Overflow (SO).

The Link Register (LR) contains the address to return to at the end of a

function call.

54

Data Types

It can use either little-endian or big-endian style.

Fixed-point data types include:o Unsigned byte 8bitso Unsigned halfword 16-bits

o Signed halfword 16-bitso Unsigned word 32-bit

o Signed word 32-bit

o Unsigned doubleword 64-bits

o Byte Strings: From 0 128 bytes in length

2s complement is used for negative values floating-point data formats

single-precision, 32 bits long (23 + 8 + 1)

double-precision, 64 bits long (52 + 11 + 1)

characters are stored using 8-bit ASCII codes


28/33

Page 28

55

Instruction types

56

Instruction Format

All instruction encodings are 32 bits in length.

Bit numbering for PowerPC is the opposite of most otherdefinitions: bit 0 is the most significant bit, and bit 31 is theleast significant bit.

Instructions are first decoded by the upper 6 bits in a field,

called the pr imary opcode. The remaining 26 bits contain fieldsfor operand specifiers, immediate operands, and extendedopcodes, and these may be reserved bits or fields.

Common Instruction formats:

Format 0-5 6-10 11-15 16-20 21-25 26-29 30 31

D-form opcd tgt/src src/tgt immediate

X-form opcd tgt/src src/tgt src extended opcd

A-form opcd tgt/src src/tgt src src extended opcd Rc

BD-

form

opcd BO BI BD AA LK

I-form opcd LI AA LK


29/33

Page 29

57

Instruction format D-form- provides up to two registers as source operands, one immediate source,

and up to two registers as target operands. Some variations of this instruction

format use portions of the target and source register operand specifiers asimmediate fields or as extended opcodes.

X-form- provides up to two registers as source operands and up to two targetoperands. Some variations of this instruction format use portions of the target andsource operand specifiers as immediate fields or as extended opcodes.

A-form- provides up to three registers as source operands, and one target operand.Some variations of this instruction format use portions of the target and sourceoperand specifiers as immediate fields or as extended opcodes.

BD-form- conditional branch instruction. The BO field specifies the type of conditionBI field specifies which CR bit to be used as the condition; BD field is used as thebranch displacement. AA bit specifies whether the branch is an absolute or relativebranch. The LK bit specifies whether the address of the next sequential instructionis saved in the Link Register as a return address for a subroutine call.

I-form- used by the unconditional branch instruction. Being unconditional, the BOand BI fields of the BD format are exchanged for additional branch displacement toform the LI instruction field. This instruction format also supports the AA and LKbits in the same fashion as the BD format.

Simplified powerpc instrution set http://pds.twi.tudelft.nl/vakken/in1200/labcourse/instruction-set/

D-form opcd tgt/src src/tgt immediate

X-form opcd tgt/src src/tgt src extended opcd

A-form opcd tgt/src src/tgt src src extended opcd Rc

BD-form Opcd BO BI BD AA LK

I-form opcd LI AA LK

58

PowerPC Addressing Modes

Load/store architecture

Indirect

Instruction includes 16 bit displacement to be added to base register(may be GP register)

Can replace base register content with new address

Indirect indexed Instruction references base register and index register (both may be GP)

EA is sum of contents

Branch address Target address calculation

Absolute TA= actual address

Relative TA= current instruction address + displacement{25 bits, signed}

Indirect

Arithmetic

Operands in registers or part of instruction

Floating point is register only

Link Register TA= (LR)Count Register TA= (CR)
http://pds.twi.tudelft.nl/vakken/in1200/labcourse/instruction-set/http://pds.twi.tudelft.nl/vakken/in1200/labcourse/instruction-set/


30/33


31/33

Page 31

61

PowerPC G4e Pipeline Stages

Stages 1 and 2 - Instruct io n Fetch:

These two stages are both dedicated primarily tograbbing an instruction from the L1 cache.

The G4e can fetch four instructions per clock cycle fromthe L1 cache and send them on to the next stage

Stage 3 - Decode/Dispatch:

Once an instruction has been fetched, it goes into a 12-entry instruction queue to be decoded.

The G4e's decoder can dispatch up to three instructionsper clock cycle to the next stage.

62


Stage 4 - Issue:

The first queue Floating-Point Issue Queue (FIQ), whichholds floating-point (FP) instructions that are waiting tobe executed.

The second is the Vector Issue Queue (VIQ), which holdsvector operations.

The third queue is the General Instruction Queue (GIQ),which holds everything else.

Once the instruction leaves its issue queue, it goes to theexecution engine to be executed.


32/33

Page 32

63


Stage 5 - Execute:

The instructions can pass out-of-order from their issuequeues into their respective functional units and beexecuted.

Stage 6 and 7 - Comp lete and Write-Back :

In these two stages, the instructions are put back into theorder in which they came into the processor, and theirresults are written back to memory.

64

Design principles

Simplicity favors' regularity

Standard 32 bit instruction format for allinstructions

fixed-length instructions,

register-to-register architecture

three-operand instruction format.

Smaller is faster 3- Categories of registers , but each handles specific

instructions so presumably faster access time

Make the common case fast Integer and floating point instructions

Good design demands good compromises To align with RISC principles many instructions that required

three source operands were eliminated

Many complex instructions curtailed to confirm with RISCprinciples but compensated by large number of mnemonics thatincrease the number of instructions .


33/33

65

Pros and Cons Instruction Set

200 machine instructions

More complex than most RISC machines

e.g. floating-point multiply and add instructions that takethree input operands

e.g. load and store instructions may automatically updatethe index register to contain the just-computed targetaddress

Pipelined execution

More sophisticated than SPARC

Input and Output Two different modes

Direct-store segment: map virtual address space to anexternal address space

Normal virtual memory access

Permits a range of implementation from lowcost controllers through high performanceprocessors.

Documents

BL Eloadas1 2prez