Upload
elizabeth-garner
View
19
Download
1
Tags:
Embed Size (px)
DESCRIPTION
ECE8833 Polymorphous and Many-Core Computer Architecture. Lecture 1 Early ILP Processors and Performance Bound Model. Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering. Decoupled Access/Execute Computer Architectures James E. Smith, ACM TOCS, 1984 - PowerPoint PPT Presentation
Citation preview
ECE8833 Polymorphous and Many-Core Computer Architecture
Prof. Hsien-Hsin S. LeeSchool of Electrical and Computer Engineering
Lecture 1 Early ILP Processors and Performance Bound Model
2ECE8833 H.-H. S. Lee 2009
Decoupled Access/Execute Computer Architectures
James E. Smith, ACM TOCS, 1984
(a earlier version was published in ISCA 1982)
3ECE8833 H.-H. S. Lee 2009
Background of DAE, circa. 1982• Written at a time when vector machine was dominating
LV v1, mem[a1]MULV v3, v2, v1ADDV v5, v4, v3
MULV v3, v2, v1
LV v1, mem[a1]
ADDV v5, v4, v3
Time line
Vector chaining(Cray-1)
MULV v3, v2, v1
LV v1, mem[a1]
ADDV v5, v4, v3
64-bit register
0 63
4096-bit
4ECE8833 H.-H. S. Lee 2009
Background of DAE, circa. 1982• Written at a time when vector machine was dominating
LV v1, mem[a1]MULV v3, v2, v1ADDV v5, v4, v3
v1
v3
Memory
MUL
v2
v4
ADDv5
What about modern
SIMD ISA ?
5ECE8833 H.-H. S. Lee 2009
Today State-of-the-art ?• Intel AVX
• Intel Larrabee NI
6ECE8833 H.-H. S. Lee 2009
DAE, circa. 1982• Fine-grained parallelism: Vector vs. Superscalar
• What about scalar performance?– Remember what’s Flynn’s bottleneck?
Page 290
7ECE8833 H.-H. S. Lee 2009
Flynn’s Bottleneck• ILP 1.86
– Programs on IBM 7090– Basically, he sort of said one cannot
execute more than one instruction per cycle– ILP exploited within basic blocks
• [Riseman & Foster’72][Riseman & Foster’72]– Breaking control dependency– A perfect machine model– Benchmark includes numerical programs,
assembler and compiler
passed jumps 0 jump
1 jump
2 jumps
8 jumps
32 jumps
128 jumps
jumps
Average ILP 1.72 2.72 3.62 7.21 14.8 24.2 51.2
BB0
BB1
BB3
BB2
BB4
8ECE8833 H.-H. S. Lee 2009
DAE, circa. 1982, 1984• Issues in CDC6600 & IBM 360/91
– Overlap instructions by OoO complex control slower clock offset the benefit
– Complex issue methods were abandoned by their manufacturers
• Less determinism• Problems in HW debugging• Errors may not be reproducible
– Complexity can be shifted to system software
9ECE8833 H.-H. S. Lee 2009
Decoupled Access/Execute Architecture• An architecture with two instruction streams to
break Flynn’s bottleneck– Access processor– eXecute processor
– Hey, this was 1980s
• Separate RFs (A0, A1, A2 .. , An-1 & X0, X1, X2 .. ,Xm-1), which can be totally incompatible – Synchronization issue?
10ECE8833 H.-H. S. Lee 2009
DAE
11ECE8833 H.-H. S. Lee 2009
Data Movement
Data In
Data Out
paired
XLQ, XSQ, are specified as registers
at the ISA level
12ECE8833 H.-H. S. Lee 2009
Register-to-Register Synch
Xi Aj
13ECE8833 H.-H. S. Lee 2009
Branch Synch-up
• One Runhead• One execute uncond.
Jump (BFQ instruction)
Branch outcomes in XBQ can be used to reduce I-fetch from X-Processor.
14ECE8833 H.-H. S. Lee 2009
DAE Code Example
15ECE8833 H.-H. S. Lee 2009
Modern Issue Consideration• Despite it is a ‘82/’84 paper, it considers
16ECE8833 H.-H. S. Lee 2009
Precise Exception• Simple approach force the instructions to complete in order• In DAE, applied to each of the streams separately
• Example of Imprecise exception issues• Require cautiousness when coding A and E programs
17ECE8833 H.-H. S. Lee 2009
Requirement for Precise Exception
18ECE8833 H.-H. S. Lee 2009
Why (and How) It Works?• Avg. speedup = 1.58 for LFK• Executions between 2
processors are somewhat balanced
• Why?– Work nicely as shown in LFK– X-processor’s computation is not as
fast• 6-cycle FP add• 7-cycle FP multiply
– A-process takes care of • Memory (11-cycle load)• Branch resolution
19ECE8833 H.-H. S. Lee 2009
Disadvantages of DAE Architecture
1. Writing 2 separate programs• What High-level language ?• Who should do it?
2. Certain duplication in Hardware• Instruction memory/cache• Instruction fetch unit• Decoder
20ECE8833 H.-H. S. Lee 2009
Interleaving Instruction Streams
• Use a bit to tag streams• No split branch instruction
(1) X7 is XLQ or XSQ; (2) Once loaded, it is used once.(3) It must be stored after X-processor writes to it
(A)X
21ECE8833 H.-H. S. Lee 2009
Summary of DAE Architecture• 2-wide issue per cycle
• Allow a constrained type of OoO – Data accesses could be done well in advance
(i.e., “slip” ahead)– Enable certain level of data prefetching
• Was novel in 1982!
22ECE8833 H.-H. S. Lee 2009
The ZS-1 Central Processor
James E. Smith, et al. in ASPLOS-II, 1987
23ECE8833 H.-H. S. Lee 2009
Astronautics ZS-1 ZS-1 Central Processor• A realization of DAE (by the same author)
• Decouple instruction stream into– Fixed point/memory – Floating-point operations
• Communicate via Architectural queues
• Is extensively pipelined
• 22.5 MFLOPS, 45 MIPS
24ECE8833 H.-H. S. Lee 2009
ZS-1 Central Processor
Communicate with memory
31 A (and X) registers + 1 Queue entry= 5-bit encoded operands
Hold 24 insts
Hold 4 insts
25ECE8833 H.-H. S. Lee 2009
ZS-1 Central Processor+ Instruction cannot be issued unless the dependency is resolved.
+ A load may bypass independent stores
+ Maintain load-load, store-store order
26ECE8833 H.-H. S. Lee 2009
Can Load Bypass Load?• Why not?
Load R1, (A)Load R2, (A)
Core 1
Store (A), R3
Core 2
(A)=100 R3=25
(1)(2)
(3)
• What’s wrong with (2)(3)(1)?
27ECE8833 H.-H. S. Lee 2009
ZS-1: Processing of Two Iterations
S: splitterB: inst buffer readD: decodedI: issued E: Execution
28ECE8833 H.-H. S. Lee 2009
IBM RS/6000 and POWER• Evolved from IBM ACS and 801
• Foundation of POWER architecture (Performance Optimization With Enhanced RISC)– 10 discrete chips in the early POWER1 system– Single chip solution in RSC and some
subsequent POWER2 version called P2SC
29ECE8833 H.-H. S. Lee 2009
POWER2 Processor Node• 8 Discrete chips on MCM• 66.7 MHz, 6-issue (2 reserved for
br/comp)• 2 FXUs
– Memory, INT, Logical– 2 per cycles
• 3 dual-pipe FPUs can perform– 2 DP Fma– 2 FP loads– 2 FP stores
---
I-Cache(32KB)
Dispatch
DualBranch
Processors
Instruction Cache Unit
Instruction Buffer
Execution Unit w/oMult/Div
Execution Unit w
Mult/Div
Instruction Buffer
ArithmeticExecution
Unit
Store Execution
Unit
Load Execution
Unit
Sync
Fixed-Point Unit (FXU) Floating-Point Unit (FPU)
Data Cache Unit (DCU)4 separate chips
(32KB each)
Memory Unit(64MB – 512MB)
OptionalSecondary Cache
(1 or 2MB)
Storage Control Unit
30ECE8833 H.-H. S. Lee 2009
MACS Performance Bound Model
Actual Run Time
M Bound
MA Bound
MAC Bound
MACS Bound
PhysicallyMeasured
GAP A
GAP C
GAP S
GAP P
• To analyze achievable performance (mostly FP) in scientific applications
31ECE8833 H.-H. S. Lee 2009
MACS Performance Bound Model• Gap A (keep you from attaining peak performance)
– Excessive loads/stores (more than essential ones, i.e., a[i] = b[i])
– Loop bookkeeping
• GAP C (reason we may want to have 432?)– Hardware restriction (architectural registers)– Redundant instructions – Load/store overhead in function calls
• GAP S– Weak scheduling algorithm– Resource conflicts preventing tighter schedule – Sol: Modulo scheduling to compact the code
• GAP P– Cache misses, inter-core communication, system effect
(i.e., context switches)– Sol: prefetch, loop blocking, loop fusion, loop exchange,
etc.
32ECE8833 H.-H. S. Lee 2009
POWER2 M Bound (Ideal, Ideal)
M Bound Peak = 1 fma to 2 FPU pipelines = 0.25 CPF
---
Instruction Buffer
ArithmeticExecution
Unit
Store Execution
Unit
Load Execution
Unit
Floating-Point Unit (FPU)
Dispatch
33ECE8833 H.-H. S. Lee 2009
POWER2 MA Bound (Ideal compiler and rest)MA Bound 1. Given the visible workload of the high level application
2. Calculate the essential operations must be performed
sqrtdivmama
dimfxflMA f*4f*4f*2ff
) t, t, t, t, MAX(tt
Time bound for all FP operations
Essential, minimum FP operations to complete the
computation A factor of 4 for div and sqrt is a common choice to reflect their relative weight to other computations
34ECE8833 H.-H. S. Lee 2009
POWER2 MA Bound (Ideal compiler and rest)
)I
L(MAX t
)sl,sMAX(lt
2
sl t
4
slffffft
)2
,2
,2
f*27f*17fffMAX(t
r
rr cycles recurrence d
fxfxflflm
flflfx
flflsqrtdivmamai
sqrtdivmamafl
flfl ls
r recurrencein iterations of # :r
I
dependency carried-loop theoflatency Total :r
L
2 pipelines
Max 4 dispatches to FPU and FXU
Other fixed-point considered irrelevant
Simplified memory model
Non-pipelined FP ops
35ECE8833 H.-H. S. Lee 2009
POWER2 MAC Bound
4
n -length code compiled t'
compare andbranch of # n where2
n t'
div and mul FXU ofnumber
other s' l' n where)n ,
2
nMAX( t'
othersf'*27f'*17f'f'f' n where)2
',
2
',
2
nMAX(t'
BCi
BCBC
b
fxfx FXUFXMD
FXUfx
sqrtdivmaabaFPUFPU
fl
flfl ls
MAC BoundSimilar to computing MA Bound but using actual, generated instruction count
sqrtdivmama
dibmfxflMAC f*4f*4f*2ff
) t', t', t', t', t',MAX(t' t
36ECE8833 H.-H. S. Lee 2009
POWER2 MACS Bound
MACS BoundSimilar to computing MAC Bound but the numerator is the actual compiler-scheduled code
37ECE8833 H.-H. S. Lee 2009
IBM SP2 Performance Bound• Later expansion to include inter-processor
communication bound