Upload
steven-mckenzie
View
224
Download
4
Tags:
Embed Size (px)
Citation preview
Asynchronous Wave PipelinesAsynchronous Wave Pipelinesforfor
Giga-Hertz VLSIGiga-Hertz VLSI
Oliver HauckOliver HauckAtul KatochAtul Katoch
Integrated Circuits and Systems LabDepartments of CS & EEDarmstadt University of Technology
Department of MicroelectronicsIndian Institute of TechnologyBombay
2
OutlineOutline
Pipelines: synchronous, asynchronous, wave pipelined,
and asynchronous wave pipelined (AWP)
Comparison: AWPs vs. sync, async, and sync wave pipes
AWP Circuit Design
Conclusion
Application Example: EC Public Key Crypto Processor:
Cryptography background
Chip architecture and implementation
3
PipeliningPipelining
Pipelining used as premier technique to
better exploit hardware and
boost performance of VLSI chips
Clocking overhead presents serious threat for
deeply pipelined systems built upon sub-micron
CMOS processes running at GHz frequencies
4
General Framework for PipelinesGeneral Framework for Pipelines
LogicLogic
Latch/Reg
Latch/Reg
Latch/Reg
Latch/Reg
Data
Clk
i o
5
Some Notations...Some Notations...
register of timehold :
register of timeup-set :
register ofdelay npropagatio :
registerat skew clock eduncontroll :
clockoutput andinput betweendelay :
registersoutput andinput at skew lintentiona : ,
timecycleor periodclock :
stable be tohas node internal timeminimum : )(
node internal
input to fromdelay logic maximum and minimum : )(),(
delay logic maximum and minimum : ,
logic in nodesoutput gate all ofset :
maxmin
maxmin
hold
setup
d
skew
io
oi
clk
stable
t
t
t
t
T
Giit
Gi
itit
tt
G
6
General RelationsGeneral Relations
(6) )())()((
: allfor respected be tohas width pulse minimum Similarly,
skewclock and overheadregister ation,delay variby bounded timecycle e., I.
(5) 2)( :implies (4) ivity,By transit
(4)
:(3) and (2) Combining
(3) :boundUpper
(2) :boundLower
data beforeoutput at clocks# equals latency´´,clock ``global called is
(1) at timeclock output by latched is Data
minmax
minmax
minmax
min
max
skewstableclk
skewholdsetupclk
skewholdclkdclkskewsetup
skewholddiclk
skewsetupdi
oclk
titititT
Gi
tttttT
tttTtTkttt
ttttTt
ttttt
k
Tkt
7
Throughput determined by longest logic path +
clock/register overhead
Fine-grain pipelining allows high throughput at the cost of
increased clock/register overhead
Negative side-effects of gate-level pipelining :
Increased latency, clock load/skew, power, area, design time
More area for clocking and registers than for logic
Implementation options:
Register- vs. latch-based, explicit latches vs. latchless
TSPC vs. local clocks derived from global clock
Static vs. dynamic, single-ended vs. dual-rail
Synchronous PipelineSynchronous Pipeline
LogicLogic
Latch/Reg
Latch/Reg
Latch/Reg
Latch/Reg
Data
Clk
skewsetupdclk ttttTk max0,1
8
Asynchronous PipelineAsynchronous Pipeline
LogicLogic
Handshake
Handshake
Handshake
Handshake
Data
req_in
ack_in
req_out
ack_out
Micropipeline (Sutherland 1989)
Synchronous clock replaced by asynchronous handshaking
Elastic operation: input and output rate may differ
momentarily, and pipeline will buffer
Plug & Play composability
Load on req and ack lines distributed
Used by Furber‘s group at Manchester U for AMULET1/2/3
Operation is data dependant, saves power during idle
As with fine-grain sync pipelines, throughput can be high;
handshake causes high latency and backward stall
Implementation options:
4-phase (level) vs. 2-phase (event) protocol
Bundled data (matched delay) vs. completion detection
9
Synchronous Wave PipelineSynchronous Wave Pipeline
Wave LogicWave Logic
Latch/Reg
Latch/Reg
Latch/Reg
Latch/Reg
Data
Clk1 2
Several data waves simultaneously active in the logic
Logic has to minimize delay variations over P,T,V corners
Global clock used with constructive skew to adjust phases
Wave pipelining potentially gives higher throughput as
conventional pipelines at decreased latency and reduced
clock load, area and power
However, tuning the logic and the delay elements is difficult
11,0
minmax
k
ttttT
k
ttttk
skewholddclk
skewsetupd
10
Wave Pipelining: A Short OutlineWave Pipelining: A Short Outline
Wave pipelining occurs when combinational logic
is clocked faster than latency would allow Several data waves are then active in the logic
without being separated by storage elements Latency remains constant and throughput is
determined by delay differences rather than
absolute delay Requirement for delay balanced logic and
complicated timing are the main hurdles
11
Wave Pipelining: A Little HistoryWave Pipelining: A Little History
Technique stems from the 60s and has had a
reputation for being exotic since Wave pipelining was long dead before being revived
by W. Burleson (U. Mass.) and M. Flynn (Stanford U.,
PhDs by Wong, Klass, and Nowka) and C. Gray at NCSU Some working academic chips exist, mainly datapath Some commercial memory is wave pipelined
(e.g. ULTRA-III cache), but no logic, as far as we know
12
Asynchronous Wave Pipeline (AWP)Asynchronous Wave Pipeline (AWP)
Wave LogicWave Logic
Wave Latch
Wave Latch
Wave Latch
Wave Latch
Data
req_in req_outmatched delaymatched delay
Data words associated with events on request line
Several data waves and protocol events simultaneously
active in the logic and the matched delay element, respectively
AWP is special case of the sync wave pipeline with the
constructive skew set to worst-case logic delay
It is crucial that the delay element accurately tracks the delay
behaviour of the logic over P, T, V corners
skewsetupd
skewholddclk
tttt
ttttTk
max
min0
13
AWPs vs. Synchronous PipelinesAWPs vs. Synchronous Pipelines
No global clock, instead a local clock (request)
that is fed through the pipeline and obeys a
simple asynchronous protocol, i.e. data is
associated with event on request Many pipeline registers removed, thus requirements
on the clock (request) relaxed Synchronous pipelines can reach the throughput of
AWPs only with excessive cost in area, power and latency
14
AWPs vs. Asynchronous PipelinesAWPs vs. Asynchronous Pipelines
AWPs deliberately sacrifice the ack and keep only the req
to avoid protocol overhead AWPs not elastic: data at output has to be consumed AWPs eliminate hazards as side-effect of delay balancing AWPs have in common with other async methodologies:
data dependant operation (avoids redundant transitions),
composability (though inelastic),
no global clock
15
AWPs vs. Synchronous Wave PipelinesAWPs vs. Synchronous Wave Pipelines
AWPs tackle two main difficulties in sync wave pipes:
Replacing the constructive skew by worst-case delay
removes double-sided timing constraint, i. e. in con-
trast to sync wave pipes do AWPs operate at any rate
Using dynamic self-resetting logic controls delay
variation and doesn´t impact latency much
16
Wave Pipelining Combinational LogicWave Pipelining Combinational Logic
Overall goal: keep data wave coherent under all
possible conditions (data, PTV)
Desirable architecture features:
most logic paths have same depth
fanin/fanout the same everywhere
First step: pad all short paths to maximum length
17
Example: 64-b Brent-Kung Parallel Adder Example: 64-b Brent-Kung Parallel Adder
pg PG PG G
x
o
r
0 1 2 3 4
Buffers provide
for same depth
on every logic
path
All gates in the
same column
must have the
same delay
18
CircuitsCircuits
Logic style used has to minimize delay variation Earlier work focused on bipolar logic (ECL, CML), but
CMOS is mainstream Static CMOS is not well suited for wave piping, fixing
the problem results in more power and slower speed Pass transistor logic gives slopy edges thereby
introducing delay variation Dynamic logic is attractive as only output high
transition is data-dependant, output pulldown is done by precharge
19
Circuits (cont.)Circuits (cont.)
Using dynamic logic as in Burleson´s Wave Domino jeopardizes the concept as it needs fine-grain precharge
What is needed is a dynamic logic family without precharge overhead: SRCMOS
Work done at IBM: classic paper by Chappell et al:``A 2-ns Cycle, 3.8-ns Access 512-kb CMOS ECL SRAM with a Fully Pipelined Architecture,´´ JSSC (26), 11, 1991; or, more recently: ``Implementation of a Self-Resetting CMOS 64-Bit Parallel Adder with Enhanced Testability,´´ JSSC (34), 8, 1999, by Hwang et al.
20
SRCMOSSRCMOS
Distinguishing property of our SRCMOS circuits: precharge feedback is fully local, and NMOS trees are delay balanced
Ninputs
output
21
Operation of a 2-ANDOperation of a 2-AND
22
Delay Balancing at Transistor LevelDelay Balancing at Transistor Level
NMOS tree is designed so that the precharge node is pulled down by a constant number of series devices
Short paths are padded with dummy devices Delay variation is minimal when exactly one path is
on, i. e. wide fanin OR´s are hard to use Every output has to see the same load Lightly loaded outputs are given dummy cap
23
Example: Carry tree in a 64-bit adderExample: Carry tree in a 64-bit adder
))(( ijjkjkklkllmlmim GPGPGPGG
24
Gim LayoutGim Layout
25
Simulation of Gim cellSimulation of Gim cell
Pulses of 4 possible input situations giving ´1´ at the output are tightly matched
Note: in this case never are Pxy=Gxy=1
26
First Pulse ProblemFirst Pulse Problem
27
Miller EffectMiller Effect
28
64-bit Adder Output Waveforms64-bit Adder Output Waveforms
latching
window
29
Transistor SizingTransistor Sizing
Ninputsoutput
Wpd
WkeeperWprecharge
CdriveCload
Cfeedback
Wpd / Cdrive = const Cdrive / (Cload+Cfeedback+Wkeeper) = const
Cfeedback / Wprecharge = const Wprecharge / Cdrive = const
LINEAR SIZING
30
Interconnect: Resistive EffectsInterconnect: Resistive Effects 0.9µm x 900µm MET2 parasitics: C=116fF, R=70 Ohms
C only
RC only
R/2, R/2
R/3, R/3, R/3
31
Interconnect: Coupling EffectsInterconnect: Coupling Effects
2 adjacent MET2 lines coupled by C=54fF
32
PTV VariationsPTV Variations
SRCMOS provides some robustness by generating fresh pulses at every gate output
Pulsed operation reduces data dependancy, coupling PTV noise is not critical when drift is in the same
direction across die Critical are: temperature gradient, supply drop, and
local variations What is needed: Rule of thumb like ``For process X,
to be on the safe side, keep area between two latches < Y sqmm´´
33
Cryptography BackgroundCryptography Background
Cryptography - science of keeping communication private
Symmetric schemes - Private key (DES)
Asymmetric schemes - Public key (RSA & ECC)
Private key schemes are quite fast; public key schemes are more safe
34
SecuritySecurity
For comparison : ECC using 261 bits is regarded safer as RSA using 2048 bits
For secure data transmission one combines both public and private key schemes. Data is encrypted using private key scheme and the key with public key scheme
The frequency with which the key can be changed depends upon speed of public key cryptosystem
35
CISCO Data Encryption Service AdapterCISCO Data Encryption Service Adapter
[Cisco Systems]
36
Key Exchange Using Public Key Key Exchange Using Public Key CryptosystemCryptosystem
For better security it pays to improve both schemes If ECC scheme is fast then DES session keys can be
changed more frequently
KKeeyyss
DESDES
ECCECC
SourcSourcee
SinkSink??? ?? ?????? ?? ???
37
DES Key Exchange using Public-Key DES Key Exchange using Public-Key Cryptosystem based on Elliptic CurvesCryptosystem based on Elliptic Curves
D Key-DES
key) (public
key) (public
key) (privatekey) (private
secret same thehave now Bob and Alice
)( :functionhash )( :functionhash
viakey session compute viakey session compute
compute compute
compute
compute
random choose random choose
Bob Alice
public ,),(
00
0
0
0
0
0
PhDPhD
DD
PkkPPkkP
Pk
Pk
kk
EPbaE
ABBA
BPk
PkA
BA
B
A
38
Security based upon DLP: in a finite Abelian group we can easily compute given
However, is hard to compute out of and DLP extraordinarily hard for point group of elliptic
curve:
Set of solutions of cubic equation over any field is an abelian group
Why is this secure ?Why is this secure ?
GNkGp ,00pkp
k p 0p
baxxxyy 232
39
Elliptic Curve Mathematics and AlgorithmElliptic Curve Mathematics and Algorithm Two types - supersingular and non-supersingular Non-supersingular have the highest security EC equation - baxxxyy 232
40
Choice of the FieldChoice of the Field
The field of the type F m
Having 2 as characteristic of a field helps in hardware implementation
Our choice m=261:
Existence of Optimal Normal Basis
Determines the data path width and security
2
41
Adding Two Points Over Elliptic CurvesAdding Two Points Over Elliptic Curves
42
Switching to Projective CoordinatesSwitching to Projective Coordinates
The inversions are quite costly in terms of multiplications
Projective coordinates have no inversions
For m=261: Normal Projective Coordinates
Double + Add 29 20
43
Projective CoordinatesProjective Coordinates
44
Optimal Normal BasisOptimal Normal Basis
45
Multiplication over ONBsMultiplication over ONBs
46
The Final FormulaThe Final Formula
47
Architecture of MultiplierArchitecture of Multiplier
delay
delay
abx
abx
abx
abx
abx
abx
1
2
3
259
260
261
3_Xor
3_Xor
3_Xor
3_Xor
3_Xor
3_
Xo
r 3
_X
or
3_Xor
123
783782781
1
87
Wa
ve
la
tch
Wa
ve
la
tch
Wa
ve
la
tch
1
87
1
1
9
27
29
Pseudo NMOS SRCMOS
request
48
Circuit Style FollowedCircuit Style Followed
Dual-rail cross-coupled SRCMOS circuit NMOS trees are designed such that there is only one
conducting path to ground
N N
Out Out
49
Pulses after First StagePulses after First Stage
Cycle time=666.7ps
Signals after first stage (Data path width = 87)
50
Delay Variations at various stagesDelay Variations at various stages
outp uts after first stage
inputs to final stage
final output
51
The Total LatencyThe Total Latency
latency = 1.9156ns
Input
Final output
Output afte r first stage
52
Architecture of the CryptochipArchitecture of the Cryptochip
k a b X0 Y0
A B D X Y Z
op
A
op
BDD
OUT
A
Oscillator
Counter
Controller
req1 bit
serial indelay line
serial out
A W P
UM L
req
53
Hierarchy of ControlHierarchy of Control
260 0260 0
alwaysalways
kkxx
left shiftleft shift
Hamming weight = 40Hamming weight = 40
EC doubleEC double EC addEC add
If x=1If x=1
ADDADD MULMUL LOAD/LOAD/STORESTORE
77 1313
1 261 11 261 1
EC arithmetic R * 2347 MUL/sEC arithmetic R * 2347 MUL/s
Finite field arithmetic R * 612567 bit/sFinite field arithmetic R * 612567 bit/s
* 261* 261
Double-and-Add Key generation Double-and-Add Key generation rate Rrate R
*(261*7+40*13)*(261*7+40*13)
54
Control Unit ArchitectureControl Unit Architecture
Request signals trigger the state transitions. Autonomous state transitions are triggered by signal X
X
AWP
Logic
For static operation
req1reqn
Req_out
reset
OUTIN1
IN2
REG
REG
55
Highest Level ControlHighest Level Control
1
8
34
6
5
7
Start/LoadX, ResetZ
X=1
LoadY
X=0X=1
If K=0
Shift K
If K=1X=1
ShiftK, Double
K=0,DoubleDone
K=1,DoubleDone/Add
X=1
AddDone
X=1
X=0
X=0
If Stop=1/KP_Done
2
Level-based control
56
The Request Signal GenerationThe Request Signal Generation
57
Middle Level Control : Double AlgoMiddle Level Control : Double Algo
Pulse-based control
0
X=0
1
X=1
2
X=1
3
X=1
4
X=0
5
X=1
X=1X=1
X=1X=1
X=0X=1
6362
6160
5958
StartOPAX OPBZ MULT MD
OPAAShift
OPBAMULT
MD
58
Request Signal GenerationRequest Signal Generation
59
Various States in a Pulse based ControlVarious States in a Pulse based Control
60
Architecture and ImplementationArchitecture and Implementation
k a b X0 Y0
A B D X Y Z
op
A
op
BDD
OUT
A
Oscillator
Counter
Controller
req1 bit
serial indelay line
serial out
A W P
UM L
req
61
ConclusionConclusion
AWPs presented as alternative approach to high-speed
design, shows potential for GHz throughput without clocks
AWPs avoid some problems of conventional wave pipes
and (a)synchronous systems
64b adder + test circuit and EC crypto layout in the making
Feasibility of having totally asynchronous control
To do: support transistor sizing, quantify PTV impact