Upload
landis
View
39
Download
0
Embed Size (px)
DESCRIPTION
Pipeline with data forwarding and accelerated branch Loop Unrolling Dual Pipeline. Pipeline Optimization. C-code. k = len; do { k--; A[k] = A[k] + x; } while(k > 0). Register Usage. $solen $s1base address of A $s2x $t1address of A[k] $t0value of A[k] (old and new). - PowerPoint PPT Presentation
Citation preview
Pipeline Optimization
Pipeline with data forwarding and accelerated branch
Loop UnrollingDual Pipeline
Basic loop using pointer hoppingsll $t1, $s0, 2addu $t1, $t1, $s1
loop:addi $t1, $t1, -4lw $t0, 0($t1)add $t0, $t0, $s2sw $t0, 0($t1)bne $t1, $s1, loopxxx
Time for 1000 Iterationsfor Single- and Multi-cycle
• Single Cycle– Every instruction takes 800 picoseconds (ps)– Time = 5x800x1000 + 2x800 = 4,001,600 ps =
4001.6 nanoseconds (ns)
• Multicycle– 200 ps per cycle, variable number of CPI– Cycles = (1x3 + 3x4 + 1x5)x1000 + 2x4 = 20,008– Time = 20,008 x 200 ps = 4,001.6 ns
PipelineFilling Stall/Delay slots
sll $t1, $s0, 2addu $t1, $t1, $s1
loop:addi $t1, $t1, -4lw $t0, 0($t1)nopadd $t0, $t0, $s2sw $t0, 0($t1)bne $t1, $s1, loop
nop
xxx
Time for simple pipeline
• 200 ps per cycle, 1 CPI (including nops)
• Time = 7x200x1000 + 2x200 ps = 1,400.4 ns
Reordering Code to fill branch delay slotsll $t1, $s0, 2addu $t1, $t1, $s1
loop:addi $t1, $t1, -4lw $t0, 0($t1)nopadd $t0, $t0, $s2bne $t1, $s1, loopsw $t0, 0($t1)
xxx
Time for pipeline with reordered code
• 200 ps per cycle, 1 CPI (including nops)
• Time = 6x200x1000 + 2x200 ps = 1,200.4 ns
Loop Unrolling step 1 (4 iterations)sll $t1, $s0, 2addu $t1, $t1, $s1
loop:addi $t1, $t1, -4lw $t0, 0($t1)nopadd $t0, $t0, $s2beq $t1, $s1, loopendsw $t0, 0($t1)
addi $t1, $t1, -4lw $t0, 0($t1)nopadd $t0, $t0, $s2beq $t1, $s1, loopendsw $t0, 0($t1)
addi $t1, $t1, -4lw $t0, 0($t1)nopadd $t0, $t0, $s2beq $t1, $s1, loopendsw $t0, 0($t1)
addi $t1, $t1, -4lw $t0, 0($t1)nopadd $t0, $t0, $s2bne $t1, $s1, loopsw $t0, 0($t1)
loopend:xxx
Loop Unrolling step 2One pointer with offsets
sll $t1, $s0, 2addu $t1, $t1, $s1
loop:addi $t1, $t1, -16lw $t0, 12($t1)nopadd $t0, $t0, $s2sw $t0, 12($t1)
lw $t0, 8($t1)nopadd $t0, $t0, $s2sw $t0, 8($t1)
lw $t0, 4($t1)nopadd $t0, $t0, $s2sw $t0, 4($t1)
lw $t0, 0($t1)nopadd $t0, $t0, $s2bne $t1, $s1, loopsw $t0, 0($t1)
xxx
Loop Unrolling step 3Filling data hazard slots
sll $t1, $s0, 2addu $t1, $t1, $s1
loop:addi $t1, $t1, -16lw $t0, 12($t1)lw $t3, 8($t1)add $t0, $t0, $s2sw $t0, 12($t1)
lw $t0, 4($t1)add $t3, $t3, $s2sw $t3, 8($t1)
lw $t3, 0($t1)add $t0, $t0, $s2sw $t0, 4($t1)
add $t3, $t3, $s2bne $t1, $s1, loopsw $t3, 0($t1)
xxx
Time for pipeline with loop unrolling
• 200 ps per cycle, 1 CPI (including nops)• 4 iterations per loop means 250 times in loop
• Time = 14x200x250 + 2x200 ps = 700.4 ns
Dual Pipeline• Two instruction pipe
– one for arithmetic or branch– one for load or store
• Instructions can be issued at same time– if no data dependencies– following instructions follow same delay rules
• Loop unrolling for more overlap• Register renaming to avoid data dependency
Dual Pipeline Code pairing instructions
sll $t1, $s0, 2addu $t1, $t1, $s1
loop:addi $t1, $t1, -4
lw $t0, 0($t1)nopadd $t0, $t0, $s2
bne $t1, $s1, loop sw $t0, 0($t1)nop
xxx
Dual Pipeline Codefill branch delay slot
sll $t1, $s0, 2addu $t1, $t1, $s1
addi $t1, $t1, -4loop:
lw $t0, 0($t1)nopadd $t0, $t0, $s2
bne $t1, $s1, loop sw $t0, 0($t1)addi $t1, $t1, -4
xxx
Time for dual pipeline(no loop unrolling)
• 200 ps per cycle, 1 or 2 instr per cycle
• Time = 5x200x1000 + 3x200 ps = 1,000.6 ns
Dual Pipe Optimizationwith loop unrolling
sll $t1, $s0, 2 addu $t1, $t1, $s1loop:
addi $t1, $t1, -16lw $t0, 12($t1)lw $t3, 8($t1)add $t0, $t0, $s2sw $t0, 12($t1)
lw $t0, 4($t1)add $t3, $t3, $s2sw $t3, 8($t1)
lw $t3, 0($t1)add $t0, $t0, $s2sw $t0, 4($t1)
add $t3, $t3, $s2bne $t1, $s1, loopsw $t3, 0($t1)
loopend:xxx
Unrolled and reordered loop
step 1, use more registers(register renaming)
sll $t1, $s0, 2addu $t1, $t1, $s1
loop:addi $t1, $t1, -16lw $t0, 12($t1)lw $t3, 8($t1)add $t0, $t0, $s2sw $t0, 12($t1)
lw $t5, 4($t1)add $t3, $t3, $s2sw $t3, 8($t1)
lw $t7, 0($t1)add $t5, $t5, $s2sw $t5, 4($t1)
add $t7, $t7, $s2bne $t1, $s1, loopsw $t7, 0($t1)
loopend:xxx
step 2, reorder/pair instructions sll $t1, $s0, 2addu $t1, $t1, $s1
loop:addi $t1, $t1, -16
lw $t0, 12($t1)lw $t3, 8($t1)
add $t0, $t0, $s2 lw $t5, 4($t1)add $t3, $t3, $s2 sw $t0, 12($t1)add $t5, $t5, $s2 lw $t7, 0($t1)
sw $t3, 8($t1)add $t7, $t7, $s2 sw $t5, 4($t1)bne $t1, $s1, loop sw $t7, 0($t1)nop
xxx
step 2, fill branch delay sll $t1, $s0, 2addu $t1, $t1, $s1
addi $t1, $t1, -16lw $t0, 12($t1)
loop:lw $t3, 8($t1)
add $t0, $t0, $s2 lw $t5, 4($t1)add $t3, $t3, $s2 sw $t0, 12($t1)add $t5, $t5, $s2 lw $t7, 0($t1)
sw $t3, 8($t1)add $t7, $t7, $s2 sw $t5, 4($t1)bne $t1, $s1, loop sw $t7, 0($t1)addi $t1, $t1, -16 lw $t0, -4($t1)
xxx
Time for dual pipeline • 200 ps per cycle, 1 or 2 instr per cycle• 4 iterations per loop, 250 times through loop
• Time = 8x200x250 + 4x200 ps = 400.8 ns
• 10 times faster than single cycle or multi-cycle• 3 ½ times faster than simple pipeline• 1 ¾ times faster than pipeline with loop unrolled
More Parallelism?• Suppose loop has more operations
– Multiplication (takes longer than adds)– Floating point (takes much longer than integer)
• More parallel pipelines for different operations – all the above techniques could result in better performance
• Static (as above) vs dynamic reordering• Speculation• Out-of-order execution