Upload
vanig
View
71
Download
5
Embed Size (px)
DESCRIPTION
Elastic Systems. Jordi Cortadella Universitat Politècnica de Catalunya Marc Galceran-Oms Universitat Politècnica de Catalunya Mike Kishinevsky Intel Corp. Elasticity. Leonardo da Vinci’s catapult. Asynchronous elastic pipeline. ReqIn. ReqOut. C. C. C. C. - PowerPoint PPT Presentation
Citation preview
Elastic Systems
Jordi Cortadella Universitat Politècnica de Catalunya
Marc Galceran-Oms Universitat Politècnica de Catalunya
Mike Kishinevsky Intel Corp.
Elasticity
Leonardo da Vinci’s catapult
Asynchronous elastic pipeline
C
ReqIn ReqOut
AckIn AckOut
C C C
David Muller’s pipeline (late 50’s)Sutherland’s Micropipelines (1989)
The specification of a complex system is usuallyasynchronous (functional units, messages, queues, …),
… however the clock appears when we move downto the implementation levels
(Bill Grundmann, 2004)
Asynchronous elasticity
req
ack
CLK
Synchronous elasticity
valid
stop
CLKLatency-insensitive systems (Carloni et al., 1999)Synchronous handshake circuits (Peeters et al, 2001)Synchronous elastic systems (Cortadella et al., 2006)Latency-Insensitive Bounded Dataflow Networks (Vijayaraghavan et al., 2009)Synchronous emulation of asynchronous circuits (O’Leary, 1997)
Many systems are already elastic
AMBA AXI bus protocol
Handshake signals
Time uncertainty in chip design
How manycycles ?
Why elastic circuits now ?
Need to live with time uncertainty
Need to formalize time uncertainty– For synthesis– For verification
Need for modularity
Behavioral equivalence in Elastic Circuits
+…
147
+e
… 348201…
48…147…201… 3
Behavioral equivalence in Elastic Circuits
+…
147
+e
… 348201…
348…147…201…
tokenbubble
Traces a preserved after hiding bubbles(stream-based equivalence)
Unpipelined system
Pipelined system
Write Buffer
Communication channelreceiversender
Data Data
Long wires: slow transmission
Pipelined communicationsender receiver
DataData
sender receiver
DataData
Pipelined communication
The Valid bitsender receiver
Data Data
Valid Valid
The Stop bit
00000
sender
Data
Valid
Stop
receiver
Data
Valid
Stop
The Stop bit
11000
sender
Data
Valid
Stop
receiver
Data
Valid
Stop
The Stop bit
11100
sender
Data
Valid
Stop
receiver
Data
Valid
Stop
The Stop bit
11111
sender
Data
Valid
Stop
receiver
Data
Valid
Stop
Back-pressure
The Stop bit
01111
sender
Data
Valid
Stop
receiver
Data
Valid
Stop
The Stop bit
00000
sender
Data
Valid
Stop
receiver
Data
Valid
Stop
The Stop bit
00000
sender
Data
Valid
Stop
receiver
Data
Valid
Stop
The Stop bit
00000
sender
Data
Valid
Stop
receiver
Data
Valid
Stop
The Stop bit
10000
sender
Data
Valid
Stop
receiver
Data
Valid
Stop
Long combinational path
Carloni’s relay stations (double storage)
shell
pearl
sender
shell
pearl
receiver
main main main
aux aux aux
Carloni’s relay stations (double storage)
main main main
aux aux aux
shell
pearl
receiver
shell
pearl
sender
Carloni’s relay stations (double storage)
main main main
aux aux aux
shell
pearl
receiver
shell
pearl
sender
Carloni’s relay stations (double storage)
main main main
aux aux aux
shell
pearl
receiver
shell
pearl
sender
Carloni’s relay stations (double storage)
main main main
aux aux aux
shell
pearl
sender
shell
pearl
receiver
Carloni’s relay stations (double storage)
main main main
aux aux aux
shell
pearl
sender
shell
pearl
receiver
Carloni’s relay stations (double storage)
main main main
aux aux aux
shell
pearl
sender
shell
pearl
receiver
Carloni’s relay stations (double storage)
main main main
aux aux aux
shell
pearl
sender
shell
pearl
receiver
Carloni’s relay stations (double storage)
main main main
aux aux aux
shell
pearl
receiver
shell
pearl
sender
• Handshakes with short wires• Double storage required
Flip-flops vs. latches
sender receiver
1 cycle
FF FF
Flip-flops vs. latches
sender receiver
1 cycle
H L H L
Flip-flops vs. latches
sender receiver
1 cycle
H L H L
Flip-flops vs. latches
sender receiver
1 cycle
H L H L
Flip-flops vs. latches
sender receiver
1 cycle
H L H L
Flip-flops vs. latches
sender receiver
1 cycle
H L H L
Flip-flops already have adouble storage capability, but …
Flip-flops vs. latches
sender receiver
1 cycle
H L H L
Not allowed in conventionalFF-based design !
Flip-flops vs. latches
sender receiver
1 cycle
H L LH
Let’s make the master/slave latches independent
Flip-flops vs. latches
sender receiverH L H L
½ cycle ½ cycle
Let’s make the master/slave latches independent
Only half of the latches (H or L) can move tokens
Latch-based elasticitysender receiver
V V V V
S S S S
En En En En
1 1
Data
Valid
Stop
Data
Valid
Stop
1 1
Elastic netlists
ForkJoin
Join / Fork
EB
EBEB
EB
Enable signalto data latches
Basic VS block
Si
Eni
Vi
Si-1
Vi-1
VSSi
Eni
Vi
Si-1
Vi-1
Join
VS
+
V1
V2
S1
S2
V
S
VS
VS
(Lazy) Fork
V1
V2
S1
S2
V
S
Eager Fork
V1
V2
S1
S2
^
^
V
S
Variable Latency Units
[0 - k] cycles
V/S V/S
donego clear
Generalization: FIFOs(Bounded Dataflow Networks)
InOut
B1 B3
B2
Elastic Buffers
=
=
Elastic Buffer with Token
Elastic Buffer with Bubble
m =Skid-Buffer (zero latency)
m
-kAnti-token injector
Let’s do transformationsGoal:– Transform the system to improve performance, either
preserving or not preserving time (but preserving behavior)
Few transformations have been re-invented from asynchronous design and dataflow computation:
– Adding bubbles preserves behavior– Early evaluation and anti-tokens– Buffer resizing and slack matching to balance fork/join
structures
Performance is abouttokens and bubbles
How many bubbles do we need?
How many bubbles do we need?
How many bubbles do we need?
How many bubbles do we need?
How many bubbles do we need?
How many bubbles do we need?
How many bubbles do we need?
O(n2) cycles
How many bubbles do we need?
O(n2) cycles
How many bubbles do we need?
O(n2) cycles
How many bubbles do we need?
O(n2) cycles
How many bubbles do we need?
O(n2) cycles
How many bubbles do we need?
O(n2) cycles
How many bubbles do we need?
O(n2) cycles
How many bubbles do we need?
O(n2) cycles
O(n) cycles
How many bubbles do we need?
At least one bubble and one token per cycle, otherwise neither tokens nor bubbles can move (deadlock)
n/2 bubbles for optimum performance(in a balanced cycle)
Performance of an N-stage ring
tokens
0 NN / 2
Throughput
Deadlock
Adding bubbles (retiming & recycling)
9 8 6
4 4
39
10
4
10 - Combinational block with delay 10
- Initialized register (dot)
Cycle time is
Throughput is 1
Effective cycle time is 21
211916
Retiming can not do better!
Retiming and Recycling (R&R) can
Effective cycle time is 19
Effective cycle time is 16
Throughput is 4/5
Effective cycle time is 12 * 5/4 = 15
12
Find a minimal effective cycle time of the circuit represented as retiming graph (RG)!
The longest combinational path delayThe number of valid data/clock cyclecycle time/throughput
Retiming graph
5 registers, 4 tokens
Any integer solution for r:
Final token assignment
Initial register assignment
Retiming vectors
R' contain tokens and anti-tokens
Retiming
Any integer solution for R:
Retiming subset
R – max(R' ,0) = num of bubbles
Retiming & Recycling
-1
Retiming & Recycling configuration
Throughput upper bound of R (Júlvez et al 2006)
Cycle time of R is at most (Bufistov et al 2007)
Effective cycle time(delay/throughput)
Non-convex quadratic optimization problem
Can be transformed into a Mixed IntegerLinear Programming model
Mixed Integer Program for R&R
PC+4
Branch targetaddress
Example: next-PC calculation
Take branch
• Only wait for required inputs• Late arriving tokens are cancelled by anti-
tokens
No branch
Early Evaluation
How to implement anti-tokens ?
Valid+ Valid+
Valid–
Valid+
Stop+ Stop+
Valid–
Stop–Stop–
+
-
Dual elastic controllers
S+
V+
V-
S-
S+
V+
V-
S-
En En
Fork/join
Dual fork/join Join with early evaluation
Re-designing for average performance
FE
arly evaluation
slow / fast
Fslow
Ffast
How can elasticity be used for design optimization?
In “regular” designs: – Take advantage of Don’t Cares = behaviors that never occur
In elastic designs: – Take advantage of “Little Cares” (LCs) = behaviors that rarely occur and “Critical Cores” (CC) = behaviors that occur often
– Can use variable latency: LCs can be made slower
Exploiting “Little cares”
F100 G
100
Goal: minimize time per token (operation) executionMeasured by Effective Cycle Time, ECT
ECT = Clock Period / Throughput = 100 / 1 = 100
Design executes one token per 100 time units
Exploiting “Little cares”
F100 G
100
CC1100 CC2
100
LC1100 LC2
100
p
1-p
CC150
CC250
Exploiting “Little cares”
F100 G
100
LC1’’50 LC2’
50
p
1-p
CC150
CC250
LC2’’50
LC1’50
0
20
40
60
80
100
120
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Performance as function of “critical core” probability
0
0.2
0.4
0.6
0.8
1
1.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Throughput
Effective CT
Probability
Probability
original design
new design @ p = 0.9
1.67x performance improvement
H.264 CABAC decoder
Gotmanov, Kishinevsky and Galceran-OmsEvaluation of flexible latencies: designing synchronous elastic H.264 CABAC decoderProc. Problems in design of micro- and nano-electronic systemsMoscow, Oct. 2010 (in Russian)
Profiling
H.264 CABAC decoder
0,90
1,00
1,10
1,20
1,30
1,40
1,50
1,60
1,70
0,50 0,70 0,90 1,10 1,30 1,50 1,70 1,90
Original Optimized
Area
Effective Cycle Time
Elastic Transforms
=
=
-i -j -(i+j)
-1=
-k
=
...
k
Bubble Insertion :(recycling)
Anti-token Insertion :
Add Capacity:
-1=
= k
-1 -1-1
Anti-token Grouping :
Anti-token retiming:
Anti-token Insertion:Multiple
kernel
derivative
Retiming
F FA
B
CA
B
C
FA
B
CFA
B
C
2
2
Capacity sizing may be needed in case of sharing
Register File Bypass
rdwd
=rdwd
=
ra
0
1
Read data rd is forwarded from previous write wd’iff read address ra is the same as previous write address wa’.
wd’
wa’RE
AD
WRI
TEwa
RE
AD
WR
ITE
rawa
Pipelining using elasticity
rdwd
ra
READ
WRI
TE
wa
A
B2B1
Kam, et al. Correct-by-construction Microarchitectural Pipelining, ICCAD 08
Sequentialexecution: R B1 B2R A R B1 B2 R A
Pipelining using elasticity
rdwd
ra
READ
WRI
TE
wa
A
B2B1
Pipelinedexecution: B2B1R
R A
B2B1RR A
B2B1R
Bypass if data dependency
???
B2B1R
Pipelining using elasticity
rdwd
ra
READ
WRI
TE
wa
A
B2B1
Pipelinedexecution: B2B1R
R A
B2B1RR A
B2B1RB2B1R
Pipelining using elasticity
rdwd
ra
READ
WRI
TE
wa
A
B2B1
Pipelining using elasticity
A
B2B1
2 bypasses
rd
wd
=
wd’
wa’
RE
AD
WR
ITE
rawa
wd’’
wa’’
Pipelining using elasticity
A
B2B1
Forwarding
rd
=wa’
RE
AD
WR
ITE
rawa
wa’’
Pipelining using elasticity
A
B2B1
Retiming
rd
=wa’
RE
AD
WR
ITE
rawa
wa’’
Pipelining using elasticity
A
B2B1
Retiming with anti-tokens
rd
=wa’
RE
AD
WR
ITE
rawa
wa’’
-1
Anti-token insertion allows retiming combinations that are not possible in a conventional synchronous circuit
Pipelining using elasticity
A
B2B1rd
=wa’
RE
AD
WR
ITE
rawa
wa’’
-10
System only stalls in case of RAWdependencies with B1-B2
Latency=2, Tokens=1
Initial Graph
Add bypasses to one or more
memory elements
Improve?
Set of near-
optimal design points
YesNo
Since throughput analysis methods are not exact for early evaluation, the best design points found during exploration must be simulated in a second phase of the algorithm to determine the best one.
Exploration Algorithm
R&R MILP method
Simulate near-optimal design points to obtain
actual performance
Write Buffer
Micro-architectural exploration
ConclusionsRigid systems preserve timing equivalence(data always valid at every cycle)
Elastic systems waive timing equivalence to enablemore concurrency
(bubbles decrease throughput, but reduce cycle time)
A new avenue of performance optimizations can emerge to build correct-by-construction pipelines
Θ Θ
Backup slides
Retiming and Recycling
-2
MILP-based approach
Bufistov, et al. Retiming and Recycling for Elastic Systems withEarly Evaluation. DAC 09
- delays at nodes- elastic buffers and anti-tokens at edges
R&R finds a set of pareto-point designs with different cycle time / throughput trade-offs
0 1 1
0 1 1
1
1
-1
3/1=3
2/0.66=3
1/0.5=2
Coarse grain elasticity
114
Deadlocks
Optimal throughput
Notation for elastic systemsElastic buffer with one token of information
Empty elastic buffer (bubble)
Channel with an injector of k negative tokens
-k
Empty elastic buffer with bypass (Skid-buffer with no tokens of information)
m
Latches=2Capacity=2Tokens=1
Latches=2Capacity=2Tokens=0
Latches=0Capacity=0Tokens= -k
Latches=0Capacity=mTokens=0
Marked Graph model
Marked Graph model
Marked Graph model
Dual Marked Graph model
Enabled !
How to implement anti-tokens ?
Positive tokens
Negative tokens
How to implement anti-tokens ?
Positive tokens
Negative tokens
Elastic controllers
V
S
V
S
Data
H
H
L
L
L
H
V
S
V
S
En En
Complex systems need to be elastic
Intel IXP422 Network Processor
Example: DLX Pipeline
Block mux2 EB ID nextPC ALU F RF W RF R
Delay 1.5 3.15 6.0 3.75 13.0 80.0 6.0 11.0
Area 1.5 4.5 72 24 1600 8000 6000
• Memory read latency = 10
• P(ALU)=0.35• P(F) = 0.2• P(MLOAD)=0.25• P(MSTORE)=0.075• P(BR) = 0.125• Mem dependency = 0.5• RF dependency = 0.2• Depth(F) = [1,…,8]
Pipelined DLX
Preserving behavioral equivalence
Combinational logic synthesisCombinational equivalence checking
Preserving behavioral equivalence
Sequential logic synthesisSequential equivalence checking
Different flavors of Elastic Buffers
Flip-flop = Master & Slave
main
aux(Carloni, 1999)
Jacobson 2002Synchronous Interlocked Pipelines
PC+4
Branch targetaddress
Example: next-PC calculation
Take branch
• Only wait for required inputs• Late arriving tokens are cancelled by anti-
tokens
No branch
Early Evaluation
F
Critical cycle(late arrival)
F
F Shorter cycle(but F duplicated)
Shannon decomposition
Early evaluation is useless!!!
0
1
0
1
Speculation with Shared Units
F
Shared unit
Scheduler
1. Speculate which channel the mux will choose next cycle and exec F on it
2. Stop the other elastic channel
3. If next cycle we realize a mistake has been made, exec F on the other channel
0
1
Shared unit
Scheduler
00
p=0.95
p=0.05
1
Cycle : 11
Sequence of correct predictions
F
0
1
Shared unit
00
1
1
Cycle : 2
Scheduler
F
1
2
2
Sequence of correct predictions
0
1
Shared unit
10
Cycle : 3
0
2
Scheduler
F
1
Shared unit
3
2
Misprediction
0
1
Shared unit
Cycle : 4
Scheduler
F
Shared unit
4 23
110 Correction
23
Correction
0
1
2
Shared unit
Cycle : 5
Scheduler
F
Shared unit
23
1
45
34 2
Misprediction stalled 2 cyclesCorrection
0
1
Error Correction using Speculation
wd rd
F1 F2ECC
REGFILE
REA
D
WR
ITE
Error Correction using Speculation
wd rd
F2ECC
REGFILE
REA
D
WR
ITE
F1
shared
Pipelining using elasticity
A
B2B1
Retiming
=wa’
RE
AD
WR
ITE
rawa
wa’’
Pipelining using elasticity
A
B2B1
=wa’
RE
AD
WR
ITE
rawa
wa’’
Pipelining using elasticity
A
B2B1
=wa’
RE
AD
WR
ITE
rawa
wa’’
-1
Pipelining using elasticity
A
B2B1
2 bypasses
=wa’
RE
AD
WR
ITE
rawa
wa’’
-1
System only stalls in case of RAWdependencies with B1-B2