Upload
ing-kovacs-levente-kalman
View
219
Download
0
Embed Size (px)
Citation preview
8/13/2019 BL Eloadas1 2prez
1/33
Page 1
1
Begyazott processzor architektrk
teljestmny-, kltsg-s
energiahatkonysgi analzise
2
Architektra tmakrk
Instruction Set Architecture
Csvezetkezs, llsok kezelse,Szuperskalris md, ttemezs,Becsls, Spekulatv dnts,
Vektorizls, VLIW, DSP, jrakonfigurci
Cmzs,Vdelmi mechanizmusok,Kivtelek kezelse
L1 Cache
L2 Cache
DRAM
Lemezek, WORM, Szalag
Koherencia,Svszlessg,Lappangs
jszer technolgiksszefzs
Snprotokollok
RAID
VLSI
Ki/Bementek s Trolk
MemriaHierarchia
Csvezetkests sUtasts Szint Prhuzamosts
8/13/2019 BL Eloadas1 2prez
2/33
Page 2
3
Architektra tmakrk
M
sszekttetsi hlzatS
PMPMPMP
Topolgik,Routing,
Svszlessg,Lappangsi idk,Megbzhatsg
Hlzati illesztk
Osztott Memria,zenetkzvetts,Adatprhuzamossg
Processzor-Memria-Switch
Multiprocesszorok
Hlzat s csatlakoztats
4
A sikeres Architektra-tervezs titka:Mrs s kirtkels
Design
Analysis
Az architektra tervezs egy iteratv folyamat: Keress a lehetsges tervek terben A begyazott rendszerek minden szintjnek elemzse
Kreativits
J tletek
tlagos tletek
Rossz tletek
Kltsg/TeljestmnyAnalzis
8/13/2019 BL Eloadas1 2prez
3/33
Page 3
5
Tervezsi mdszertan
j tervekszimulcija
TechnolgiaTrendek
Szk keresztmetszetekAzonostsa a ltez
rendszerekben
Benchmark
tesztek
Feladatok
j genercisRendszerekmegvalstsa
Megvalstsi
komplexits Analzis
Tervezs
Imple-
mentci
6
Mrsi eszkzk
Hardware: Kltsg, ksleltets, erforrsok,teljestmny becsls
Benchmark tesztek, Trace-ek (vgrehatjs kvets)
Szimulci (sok szint) ISA, RTL, Kapu, ramkr
temezsi elmlet (Queuing)
Rules of Thumb
Alapvet Trvnyek/Elvek
8/13/2019 BL Eloadas1 2prez
4/33
Page 4
7
Teljestmny, kltsg, energia
8
1. Metrika : Teljestmny
Time to run the task
Execution time, response time, latency
Tasks per day, hour, week, sec, ns Throughput, bandwidth
Plane
Boeing 747
Concorde
Speed
610 mph
1350 mph
DC to Paris
6.5 hours
3 hours
Passengers
470
132
Throughput
286,700
178,200
In passenger-mile/hour
8/13/2019 BL Eloadas1 2prez
5/33
8/13/2019 BL Eloadas1 2prez
6/33
Page 6
11
Example: Calculating CPI
Typical Mix
Base Machine (Reg / Reg)
Op Freq CPIi CPIi*Fi (% Time)
ALU 50% 1 .5 (33%)
Load 20% 2 .4 (27%)
Store 10% 2 .2 (13%)
Branch 20% 2 .4 (27%)
1.5
12
How to Summarize Performance
Arithmetic mean (weighted arithmetic mean)tracks execution time: (Ti)/n or (Wi*Ti)
Harmonic mean (weighted harmonic mean) of
rates (e.g., MFLOPS) tracks execution time:n/ (1/Ri) or n/(Wi/Ri) Normalized execution time is handy for scaling
performance (e.g., X times faster thanSPARCstation 10) Arithmetic mean impacted by choice of reference machine
Use the geometric mean for comparison:(Ti)^1/n Independent of chosen machine
but not good metric for total execution time
8/13/2019 BL Eloadas1 2prez
7/33
8/13/2019 BL Eloadas1 2prez
8/33
8/13/2019 BL Eloadas1 2prez
9/33
Page 9
17
Instruction Set Architecture (ISA)
instruction set
software
hardware
18
Evolution of Instruction Sets
Major advances in computer architecture aretypically associated with landmark instruction
set designs Ex: Stack vs GPR (System 360)
Design decisions must take into account: technology
machine organization
programming languages
compiler technology
operating systems
applications
And they in turn influence these
8/13/2019 BL Eloadas1 2prez
10/33
Page 10
19
A "Typical" RISC
32-bit fixed format instruction (3 formats I,R,J)
32 32-bit GPR (R0 contains zero, DP take pair)
3-address, reg-reg arithmetic instruction
Single address mode for load/store:base + displacement no indirection
Simple branch conditions (based on register values)
Delayed branch
see: SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM PowerPC,CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3
20
Example: MIPS ( DLX)
Op
31 26 01516202125
Rs1 Rd immediate
Op
31 26 025
Op
31 26 01516202125
Rs1 Rs2
target
Rd Opx
Register-Register
561011
Register-Immediate
Op
31 26 01516202125
Rs1 Rs2/Opx immediate
Branch
Jump / Call
8/13/2019 BL Eloadas1 2prez
11/33
Page 11
21
Pipelining Lessons Pipelining doesnt help
latency of single task, ithelps throughput ofentire workload
Pipeline rate limited byslowest pipeline stage
Multiple tasks operatingsimultaneously
Potential speedup =Number pipe stages
Unbalanced lengths ofpipe stages reducesspeedup
Time to fill pipeline andtime to drain it reducesspeedup
A
B
C
D
6 PM 7 8 9
T
a
s
k
O
r
de
r
Time
30 40 40 40 40 20
22
5 Steps of DLX Datapath
MemoryAccess
WriteBack
InstructionFetch
Instr. DecodeReg. Fetch
ExecuteAddr. Calc
LMD
ALU
MUX
Memory
RegFile
MUX
MUX
Data
Memory
MUX
SignExtend
4
Add
erZero?
Next SEQ PC
Address
Next PC
WB Data
Inst
RD
RS1
RS2
Imm
8/13/2019 BL Eloadas1 2prez
12/33
8/13/2019 BL Eloadas1 2prez
13/33
8/13/2019 BL Eloadas1 2prez
14/33
8/13/2019 BL Eloadas1 2prez
15/33
8/13/2019 BL Eloadas1 2prez
16/33
Page 16
31
Data Hazard Even with Forwarding
Time (clock cycles)
or r8,r1,r9
Instr.
Order
lw r1, 0(r2)
sub r4,r1,r6
and r6,r1,r7
RegALU
DMemIfetch Reg
RegIfetchALU
DMem RegBubble
IfetchA
LU
DMem RegBubble Reg
IfetchALU
DMemBubble Reg
32
Try producing fast code for
a = b + c;
d = e f;
assuming a, b, c, d ,e, and f in memory.Slow code:
LW Rb,b
LW Rc,c
ADD Ra,Rb,Rc
SW a,Ra
LW Re,e
LW Rf,f
SUB Rd,Re,Rf
SW d,Rd
Software Scheduling to Avoid LoadHazards
Fast code:
LW Rb,b
LW Rc,c
LW Re,e
ADD Ra,Rb,Rc
LW Rf,f
SW a,Ra
SUB Rd,Re,Rf
SW d,Rd
8/13/2019 BL Eloadas1 2prez
17/33
Page 17
33
Control Hazard on BranchesThree Stage Stall
10: beq r1,r3,36
14: and r2,r3,r5
18: or r6,r1,r7
22: add r8,r1,r9
36: xor r10,r1,r11
RegALU
DMemIfetch Reg
RegALU
DMemIfetch Reg
RegALU
DMemIfetch Reg
RegALU
DMemIfetch Reg
RegALU
DMemIfetch Reg
34
Branch Stall Impact
If CPI = 1, 30% branch,Stall 3 cycles => new CPI = 1.9!
Two part solution: Determine branch taken or not sooner, AND
Compute taken branch address earlier
DLX branch tests if register = 0 or 0 DLX Solution:
Move Zero test to ID/RF stage
Adder to calculate new PC in ID/RF stage
1 clock cycle penalty for branch versus 3
8/13/2019 BL Eloadas1 2prez
18/33
8/13/2019 BL Eloadas1 2prez
19/33
Page 19
37
Delayed Branch
Where to get instructions to fill branch delay slot? Before branch instruction
From the target address: only valuable when branch taken
From fall through: only valuable when branch not taken
Cancelling branches allow more slots to be filled
Compiler effectiveness for single branch delay slot: Fills about 60% of branch delay slots
About 80% of instructions executed in branch delay slots usefulin computation
About 50% (60% x 80%) of slots usefully filled
Delayed Branch downside: 7-8 stage pipelines,multiple instructions issued per clock (superscalar)
38
Evaluating Branch Alternatives
Schedu ling Branch CPI speedup v. speedup v.
scheme penalty unp ipelined s tal l
Stall pipeline 3 1.42 3.5 1.0
Predict taken 1 1.14 4.4 1.26
Predict not taken 1 1.09 4.5 1.29
Delayed branch 0.5 1.07 4.6 1.31
Conditional & Unconditional = 14%, 65% change PC
Pipeline speedup = Pipeline depth1 +Branch frequencyBranch penalty
8/13/2019 BL Eloadas1 2prez
20/33
Page 20
39
sszefoglagl 2
Just overlap tasks; easy if tasks are independent
Speed Up Pipeline Depth; if ideal CPI is 1, then:
Hazards limit performance on computers: Structural: need more HW resources
Data (RAW,WAR,WAW): need forwarding, compiler scheduling
Control: delayed branch, prediction
pipelined
dunpipeline
TimeCycle
TimeCycle
CPIstallPipeline1
depthPipelineSpeedup
40
Power PC
Architecture
8/13/2019 BL Eloadas1 2prez
21/33
Page 21
41
Introduction
o PowerPC (Performance Opt imizat ion WithEnhanced RISC Performance Comput ing) isa RISC architecture created by (AIM) AppleIBMMotorola alliance in 1991.
o The original idea for the PowerPCarchitecture came from IBMs Power
archi tecture (introdu ced in th e Risc/6000) andretains a high level of compatibility with it.
o The intention was to build a high-performance, superscalar low-cost processor.
42
History
o The history of the PowerPC began with IBM's 801prototype chip of John Cocke s(IBM Watson ResearchLab) RISC ideas in the late 1970s (with further
refinements developed by David Paterson).o 801-based cores were used in a number of IBM
embedded products, eventually becoming the 16-register ROMP (Research Office Products DivisionMicro Processor was a 10 MHz RISC microprocessordesigned by IBM in the early 1980) processor used inthe IBM RT(computer workstation by IBM).
o The RT had disappointing performance and IBMstarted the project to build the fastest processor on themarket. The result was the POWER architecture,introduced with the RISC System/6000 in early 1990.
8/13/2019 BL Eloadas1 2prez
22/33
Page 22
43
History.. POWER architecture
The POWER architecture incorporated lots ofthe RISC characteristics :
fixed-length instructions,
register-to-register architecture,
simple addressing modes,
large general register file
three-operand instruction format.
Additionally, it has other features more characteristic ofmore complex ISAs.
44
Power Architecture
o Designed to be superscalar- dispatched across threeindependent units: branch, fixed-point arithmetic, and floatingpoint units. This allows out of order execution.
o Compound instructions--updating the base register on a loadand store with the newly calculated effective address, thuseliminating the need for extra add instructions required toincrement the index for array traversals.
o Does not implement delayed branches- Instead the POWERarchitecture uses a branch target buffer, and the now well knownbranch folding technique.
o Branching technique- The POWER architecture has eightcondition registers that are set by compare instructions. Oneadditional bit in the opcode of each instruction signaled thatinstructions should be executed only under certain conditions, aform of predicated execution.
8/13/2019 BL Eloadas1 2prez
23/33
Page 23
45
Shortfalls..
o The original POWER microprocessor, one ofthe first superscalar RISC implementations,was a high performance, multi-chip design.
o IBM soon realized that they would need asingle-chip microprocessor to scale theirRS/6000 line from lower-end to high-endmachines.
o Work on a single-chip POWERmicroprocessor, called the RSC (RISC SingleChip) began. In early 1991 IBM realized thattheir design could potentially become a high-volume microprocessor used across theindustry.
46
PowerPC Architecture
o In order to maintain RS/6000 software compatibility, thePowerPC adapted the POWER architecture, and manyenhancements were added to provide a low-cost, single-chip,superscalar, multiprocessor capable, and 64-bit processor.
Several bit/field instructions that use three source
operands were eliminated to avoid the need for extraregister ports.
Complex string instructions were left out, consistentwith the RISC philosophy.
Instructions whose operation was dependent on thevalue of source operand were eliminated.
Precision shifts, integer multiplies, and divide-with-reminder instructions were omitted.
Support for operation in both big-endian andlittle-endian modes
Single and double precision floating-point arithmetic
64-bit architecture, backward compatible to 32-bit
8/13/2019 BL Eloadas1 2prez
24/33
Page 24
47
PowerPC family
o PowerPC 601: medium sized and medium performance processor
includes a more sophisticated branch unit
capable to dispatch three out-of-order instructions per cycle. up to 8 instructions per cycle can be fetched directly into an
eight-entry instruction queue (IQ), where they're decodedbefore being dispatched to the execution core.
Branch folding:
The instruction queue is used for detecting and dealingwith branches. The branch unit scans bottom four entries ofthe queue, identifying branch instructions and determiningwhat type they are (conditional, unconditional).
In cases where the branch unit has enough information toresolve the branch right then and there (an unconditionalbranch, or a conditional branch whose condition is dependenton information that's already in the condition register) thenthe branch instruction is simply deleted from the instruction
queue and replaced with the instruction located at the branchtarget.
o PowerPC 603: smaller die size than the 601
smaller cache
capable to dispatch three out-of-order instructions per cycle.
48
Current Status PowerPC e200 - 32 bit power architecture microprocessor - speed ranging up
to 600 MHz - ideal for embedded applications.
PowerPC e300 similar to e200 with an increase in speed upto 667 MHz. PowerPC e600 speed upto 2 Ghz ideal for high performance routing and
telecommunications applications.
POWER5 IBM dual core P POWER6 IBM Dual core P - A notable difference from POWER5 is that the
POWER6 executes instructions in-order instead of out-of-order
PowerPC G3 - Apple Macintosh computers such as the PowerBook G3, themulticolored iMacs, iBooks and several desktops, including both the Beigeand Blue and White Power Macintosh G3s.
PowerPC G4 - is a designation used by Apple Computer to describe a fourthgenerationof 32-bit PowerPC microprocessors.
PowerPC G5 - 64-bit Power Architecture processors
Xenon - based on IBMs PowerPC ISA XBOX 360 game console. Broadway based on IBMs PowerPC ISA Nintendo Wii gaming console
Blue Gene/L - dual core PowerPC 440, 700 MHz, 2004
Blue Gene/P - quad core PowerPC 450, 850 MHz, 2007
8/13/2019 BL Eloadas1 2prez
25/33
8/13/2019 BL Eloadas1 2prez
26/33
Page 26
51
PowerPC RegistersPowerPC's application-level registers are broken into three categories:
general purpose, floating point and special purpose registers.
o General-purpose registers (GPRs) - r0 to r31 flat-scheme of 32 general purpose registers.
Source and destination for all integer operations
address source for all load/store operations.
They also provide access to SPRs.
All GPRs are available for use with one exception: in certaininstructions, GPR0 simply means the value 0, and no lookup isdone for GPR0's contents.
o Some of these registers have special tasks assigned to them: r0 Volatile register which may be modified during function linkage
r1 Stack frame pointer, always valid
r2 System-reserved register r3-r4 Volatile registers used for parameter passing and return values
r5-r10 Volatile registers used for parameter passing
r11-r12 Volatile registers which may be modified during function linkage
r13 Small data area pointer register
r14-r30 Registers used for local variables
r31 Used for local variables or "environment pointers
52
Floating point registers
o Floating-point registers (FPRs)- fr0 to fr31
32 floating-point registers with 64-bit precision.
source and destination operands of all floating-point operations
can contain 32-bit and 64-bit signed and unsigned integer values, aswell as single-precision and double-precision floating-point values.
FPRs also provide access to the FPSCR(Floating-Point Status and
Control Register) FPSCR captures status and exceptions resulting from floating-
point operations, and also provides control bits for enablingspecific exception types.
Instructions to load and store double precision floating pointnumbers transfers 64-bit of data without conversion.
Instructions to load from memory single precision floating pointnumbers convert to double precision format before storing them inthe register.
f0 Volatile register
f1 Volatile register used for parameter passing and return values
f2-f8 Volatile registers used for parameter passing
f9-f13 Volatile registers
f14-f31 Registers used for local variables
8/13/2019 BL Eloadas1 2prez
27/33
Page 27
53
Special-purpose registers (SPRs)
The Fixed-Point Exception Register (XER)- used for indicating conditions forinteger operations, such as carries and overflows.
The Floating-Point Status and Control Register (FPSCR)- 32-bit register used
to store the status and control of the floating-point operations.
The Count Register (CTR)- used to hold a loop count that can be decremented
during the execution of branch instructions.
The Condition Register(CR)-32-bit register grouped into eight fields, where
each field is 4 bits that signify the result of an instructions operation: Equal
(EQ), Greater Than (GT), Less Than (LT), and Summary Overflow (SO).
The Link Register (LR) contains the address to return to at the end of a
function call.
54
Data Types
It can use either little-endian or big-endian style.
Fixed-point data types include:o Unsigned byte 8bitso Unsigned halfword 16-bits
o Signed halfword 16-bitso Unsigned word 32-bit
o Signed word 32-bit
o Unsigned doubleword 64-bits
o Byte Strings: From 0 128 bytes in length
2s complement is used for negative values floating-point data formats
single-precision, 32 bits long (23 + 8 + 1)
double-precision, 64 bits long (52 + 11 + 1)
characters are stored using 8-bit ASCII codes
8/13/2019 BL Eloadas1 2prez
28/33
Page 28
55
Instruction types
56
Instruction Format
All instruction encodings are 32 bits in length.
Bit numbering for PowerPC is the opposite of most otherdefinitions: bit 0 is the most significant bit, and bit 31 is theleast significant bit.
Instructions are first decoded by the upper 6 bits in a field,
called the pr imary opcode. The remaining 26 bits contain fieldsfor operand specifiers, immediate operands, and extendedopcodes, and these may be reserved bits or fields.
Common Instruction formats:
Format 0-5 6-10 11-15 16-20 21-25 26-29 30 31
D-form opcd tgt/src src/tgt immediate
X-form opcd tgt/src src/tgt src extended opcd
A-form opcd tgt/src src/tgt src src extended opcd Rc
BD-
form
opcd BO BI BD AA LK
I-form opcd LI AA LK
8/13/2019 BL Eloadas1 2prez
29/33
Page 29
57
Instruction format D-form- provides up to two registers as source operands, one immediate source,
and up to two registers as target operands. Some variations of this instruction
format use portions of the target and source register operand specifiers asimmediate fields or as extended opcodes.
X-form- provides up to two registers as source operands and up to two targetoperands. Some variations of this instruction format use portions of the target andsource operand specifiers as immediate fields or as extended opcodes.
A-form- provides up to three registers as source operands, and one target operand.Some variations of this instruction format use portions of the target and sourceoperand specifiers as immediate fields or as extended opcodes.
BD-form- conditional branch instruction. The BO field specifies the type of conditionBI field specifies which CR bit to be used as the condition; BD field is used as thebranch displacement. AA bit specifies whether the branch is an absolute or relativebranch. The LK bit specifies whether the address of the next sequential instructionis saved in the Link Register as a return address for a subroutine call.
I-form- used by the unconditional branch instruction. Being unconditional, the BOand BI fields of the BD format are exchanged for additional branch displacement toform the LI instruction field. This instruction format also supports the AA and LKbits in the same fashion as the BD format.
Simplified powerpc instrution set http://pds.twi.tudelft.nl/vakken/in1200/labcourse/instruction-set/
D-form opcd tgt/src src/tgt immediate
X-form opcd tgt/src src/tgt src extended opcd
A-form opcd tgt/src src/tgt src src extended opcd Rc
BD-form Opcd BO BI BD AA LK
I-form opcd LI AA LK
58
PowerPC Addressing Modes
Load/store architecture
Indirect
Instruction includes 16 bit displacement to be added to base register(may be GP register)
Can replace base register content with new address
Indirect indexed Instruction references base register and index register (both may be GP)
EA is sum of contents
Branch address Target address calculation
Absolute TA= actual address
Relative TA= current instruction address + displacement{25 bits, signed}
Indirect
Arithmetic
Operands in registers or part of instruction
Floating point is register only
Link Register TA= (LR)Count Register TA= (CR)
http://pds.twi.tudelft.nl/vakken/in1200/labcourse/instruction-set/http://pds.twi.tudelft.nl/vakken/in1200/labcourse/instruction-set/8/13/2019 BL Eloadas1 2prez
30/33
8/13/2019 BL Eloadas1 2prez
31/33
Page 31
61
PowerPC G4e Pipeline Stages
Stages 1 and 2 - Instruct io n Fetch:
These two stages are both dedicated primarily tograbbing an instruction from the L1 cache.
The G4e can fetch four instructions per clock cycle fromthe L1 cache and send them on to the next stage
Stage 3 - Decode/Dispatch:
Once an instruction has been fetched, it goes into a 12-entry instruction queue to be decoded.
The G4e's decoder can dispatch up to three instructionsper clock cycle to the next stage.
62
PowerPC G4e Pipeline Stages
Stage 4 - Issue:
The first queue Floating-Point Issue Queue (FIQ), whichholds floating-point (FP) instructions that are waiting tobe executed.
The second is the Vector Issue Queue (VIQ), which holdsvector operations.
The third queue is the General Instruction Queue (GIQ),which holds everything else.
Once the instruction leaves its issue queue, it goes to theexecution engine to be executed.
8/13/2019 BL Eloadas1 2prez
32/33
Page 32
63
PowerPC G4e Pipeline Stages
Stage 5 - Execute:
The instructions can pass out-of-order from their issuequeues into their respective functional units and beexecuted.
Stage 6 and 7 - Comp lete and Write-Back :
In these two stages, the instructions are put back into theorder in which they came into the processor, and theirresults are written back to memory.
64
Design principles
Simplicity favors' regularity
Standard 32 bit instruction format for allinstructions
fixed-length instructions,
register-to-register architecture
three-operand instruction format.
Smaller is faster 3- Categories of registers , but each handles specific
instructions so presumably faster access time
Make the common case fast Integer and floating point instructions
Good design demands good compromises To align with RISC principles many instructions that required
three source operands were eliminated
Many complex instructions curtailed to confirm with RISCprinciples but compensated by large number of mnemonics thatincrease the number of instructions .
8/13/2019 BL Eloadas1 2prez
33/33
65
Pros and Cons Instruction Set
200 machine instructions
More complex than most RISC machines
e.g. floating-point multiply and add instructions that takethree input operands
e.g. load and store instructions may automatically updatethe index register to contain the just-computed targetaddress
Pipelined execution
More sophisticated than SPARC
Input and Output Two different modes
Direct-store segment: map virtual address space to anexternal address space
Normal virtual memory access
Permits a range of implementation from lowcost controllers through high performanceprocessors.