CPU structure, what did we have to deal with:
- double clock generation- double-port instruction cache- double-port instruction fetch (bubble handling)- decode stage (instr handling, scoreboard
implemented)- execute stage (doubled execution unit, forwarding,
branch resolving, write-back ports)- load-store stage (memory access handling, doubled
write-back signal)
Top level model
• Global 50MHz clock connected do DLL component which performs clock frequency doubling
• Doubled clock needed to implement 4-port Block RAM
performance counter
CPU
chipset
DLL
CLK
IO interface
CLK0
CLK2x
Instruction cache
• Block RAM extension to two-port implementation
• Cache miss and hit tests for two ports
• One memory port• FSM responsible for
memory access is switched between two requests from instruction fetch
first port second port
Block RAM
FSM Memory Access
Instruction fetch
• Fetching two instruction from cache
• bubble insertion for each instruction stream
• instructions passed to the output in order
two instruction cache ports
Instruction Fetch
two decode stage portsbranch request
bubble1 bubble2
Decode stage
• Decoding two instructions• Quad-port Block RAM
inferred• Taking advantage from
doubled clock – double write-back handling
• Scoreboard implemented – set of conditions for checking data dependencies
• Bubble generation• Instruction stream
prepared for load-store stage
two instruction fetch ports
two execute stage ports
Scoreboard
Block RAM
Write-back
Instruction decoding
Write-back
Previous Instr.
Scoreboard
• Simplification of full scoreboard unit• Introduced as a set of conditions implemented in decode
stage• Used for bubble insertion of both types (concurrent and
consecutive instructions) and separating memory access instructions
• Presented by abtract instruction table consisted of two lines
Nr Instruction Idx_d Idx_a Idx_b Executability
In practice corresponds to Outputs of instructions fetch
1
2
MUL
ST
0 12
21 -
1
0
And few examples:
Firstly, normal operation without any bubble insertion, two instructions are fully independent
Write-backWrite-back
two instruction fetch ports
two execute stage ports
Block RAM
Instruction decoding
Scoreboard
Previous Instr.
Bubble insertion caused by data dependencies between concurrent instructions
two instruction fetch ports
two execute stage ports
Block RAM
Instruction decoding
Write-backWrite-back
Scoreboard
Previous Instr.
Bubble insertion caused by data dependencies between load instruction and consecutive arbitrary instructions
two execute stage ports
Block RAM
Instruction decoding
Write-backWrite-back
Instr Instr $1,$0LD $0 Instr
Scoreboard
Previous Instr.
Bubble insertion introduced to split two memory-access instructions
two execute stage ports
Block RAM
Instruction decoding
Write-backWrite-back
LD STST Instr
Scoreboard
Previous Instr.
Execute stage
• Doubled ALU • Resolving of branch
priority• Forwarding from
both instruction streams
• Write-back generation
two decode stage ports
two load store stage ports
Data forwarding
ALU ALU
Register
branch request
Load-store stage
• It is ensured that only one memory access instruction is passed to load store unit
• Memory access process is switched to the right instruction
• write back signals are generated
write back signals
write back from execute
memory access
write back multiplexing
memory ports
Performance (1) – blinking leds
• Additional parameters:• Number of simulated cycles
: 124988• Execution Frequency of
Memory Access Instructions compared with number of all instructions:- Super Sc : 0,29- SIMD : 0,24
• ALU Instructions :- Super Sc : 0,14- SIMD : 0,13
Instruction/cycle
SIMDSuper scalar SIMD
0,5
0,42
Performance (2) - apfel• Additional parameters:• Execution Frequency of Memory
Access Instructions:- for both : 0,2
• ALU Instructions :- both : 0,4
• Measurement Results of Instruction Execution Frequency are surprising, probably because of many memory access instructions executed at the beginning of program(the longer the simulation time is, the better results we should get)
Instruction/cycle
SIMDSuper scalar SIMD
0,56
0,45
Synthesis• last version seen working on XCV300 was 2-way SIMD
(MUCH faster than HaPra CPU!)• 4-way SIMD and Super Scalar versions are too big for
XCV300...• ...and for unknown reasons don't work in XCV800• probably severe timing issues - running on 25MHz instead
of 50MHs doesn't help• (but 4-way SIMD
should work anyway!)
• all we've got is fully working simulation