Asynchronous Wave Pipelines for Giga-Hertz VLSI Oliver Hauck Atul Katoch Asynchronous Wave Pipelines for Giga-Hertz VLSI Oliver Hauck Atul Katoch Integrated

Asynchronous Wave PipelinesAsynchronous Wave Pipelinesforfor

Giga-Hertz VLSIGiga-Hertz VLSI

Oliver HauckOliver HauckAtul KatochAtul Katoch

Integrated Circuits and Systems LabDepartments of CS & EEDarmstadt University of Technology

Department of MicroelectronicsIndian Institute of TechnologyBombay

2

OutlineOutline

Pipelines: synchronous, asynchronous, wave pipelined,

and asynchronous wave pipelined (AWP)

Comparison: AWPs vs. sync, async, and sync wave pipes

AWP Circuit Design

Conclusion

Application Example: EC Public Key Crypto Processor:

Cryptography background

Chip architecture and implementation

3

PipeliningPipelining

Pipelining used as premier technique to

better exploit hardware and

boost performance of VLSI chips

Clocking overhead presents serious threat for

deeply pipelined systems built upon sub-micron

CMOS processes running at GHz frequencies

4

General Framework for PipelinesGeneral Framework for Pipelines

LogicLogic

Latch/Reg

Latch/Reg

Latch/Reg

Latch/Reg

Data

Clk

i o

5

Some Notations...Some Notations...

register of timehold :

register of timeup-set :

register ofdelay npropagatio :

registerat skew clock eduncontroll :

clockoutput andinput betweendelay :

registersoutput andinput at skew lintentiona : ,

timecycleor periodclock :

stable be tohas node internal timeminimum : )(

node internal

input to fromdelay logic maximum and minimum : )(),(

delay logic maximum and minimum : ,

logic in nodesoutput gate all ofset :

maxmin

maxmin

hold

setup

d

skew

io

oi

clk

stable

t

t

t

t

T

Giit

Gi

itit

tt

G

6

General RelationsGeneral Relations

(6) )())()((

: allfor respected be tohas width pulse minimum Similarly,

skewclock and overheadregister ation,delay variby bounded timecycle e., I.

(5) 2)( :implies (4) ivity,By transit

(4)

:(3) and (2) Combining

(3) :boundUpper

(2) :boundLower

data beforeoutput at clocks# equals latency´´,clock ``global called is

(1) at timeclock output by latched is Data

minmax

minmax

minmax

min

max

skewstableclk

skewholdsetupclk

skewholdclkdclkskewsetup

skewholddiclk

skewsetupdi

oclk

titititT

Gi

tttttT

tttTtTkttt

ttttTt

ttttt

k

Tkt

7

Throughput determined by longest logic path +

clock/register overhead

Fine-grain pipelining allows high throughput at the cost of

increased clock/register overhead

Negative side-effects of gate-level pipelining :

Increased latency, clock load/skew, power, area, design time

More area for clocking and registers than for logic

Implementation options:

Register- vs. latch-based, explicit latches vs. latchless

TSPC vs. local clocks derived from global clock

Static vs. dynamic, single-ended vs. dual-rail

Synchronous PipelineSynchronous Pipeline

LogicLogic

Latch/Reg

Latch/Reg

Latch/Reg

Latch/Reg

Data

Clk

skewsetupdclk ttttTk max0,1

8

Asynchronous PipelineAsynchronous Pipeline

LogicLogic

Handshake

Handshake

Handshake

Handshake

Data

req_in

ack_in

req_out

ack_out

Micropipeline (Sutherland 1989)

Synchronous clock replaced by asynchronous handshaking

Elastic operation: input and output rate may differ

momentarily, and pipeline will buffer

Plug & Play composability

Load on req and ack lines distributed

Used by Furber‘s group at Manchester U for AMULET1/2/3

Operation is data dependant, saves power during idle

As with fine-grain sync pipelines, throughput can be high;

handshake causes high latency and backward stall

Implementation options:

4-phase (level) vs. 2-phase (event) protocol

Bundled data (matched delay) vs. completion detection

9

Synchronous Wave PipelineSynchronous Wave Pipeline

Wave LogicWave Logic

Latch/Reg

Latch/Reg

Latch/Reg

Latch/Reg

Data

Clk1 2

Several data waves simultaneously active in the logic

Logic has to minimize delay variations over P,T,V corners

Global clock used with constructive skew to adjust phases

Wave pipelining potentially gives higher throughput as

conventional pipelines at decreased latency and reduced

clock load, area and power

However, tuning the logic and the delay elements is difficult

11,0

minmax

k

ttttT

k

ttttk

skewholddclk

skewsetupd

10

Wave Pipelining: A Short OutlineWave Pipelining: A Short Outline

Wave pipelining occurs when combinational logic

is clocked faster than latency would allow Several data waves are then active in the logic

without being separated by storage elements Latency remains constant and throughput is

determined by delay differences rather than

absolute delay Requirement for delay balanced logic and

complicated timing are the main hurdles

11

Wave Pipelining: A Little HistoryWave Pipelining: A Little History

Technique stems from the 60s and has had a

reputation for being exotic since Wave pipelining was long dead before being revived

by W. Burleson (U. Mass.) and M. Flynn (Stanford U.,

PhDs by Wong, Klass, and Nowka) and C. Gray at NCSU Some working academic chips exist, mainly datapath Some commercial memory is wave pipelined

(e.g. ULTRA-III cache), but no logic, as far as we know

12

Asynchronous Wave Pipeline (AWP)Asynchronous Wave Pipeline (AWP)

Wave LogicWave Logic

Wave Latch

Wave Latch

Wave Latch

Wave Latch

Data

req_in req_outmatched delaymatched delay

Data words associated with events on request line

Several data waves and protocol events simultaneously

active in the logic and the matched delay element, respectively

AWP is special case of the sync wave pipeline with the

constructive skew set to worst-case logic delay

It is crucial that the delay element accurately tracks the delay

behaviour of the logic over P, T, V corners

skewsetupd

skewholddclk

tttt

ttttTk

max

min0

13

AWPs vs. Synchronous PipelinesAWPs vs. Synchronous Pipelines

No global clock, instead a local clock (request)

that is fed through the pipeline and obeys a

simple asynchronous protocol, i.e. data is

associated with event on request Many pipeline registers removed, thus requirements

on the clock (request) relaxed Synchronous pipelines can reach the throughput of

AWPs only with excessive cost in area, power and latency

14

AWPs vs. Asynchronous PipelinesAWPs vs. Asynchronous Pipelines

AWPs deliberately sacrifice the ack and keep only the req

to avoid protocol overhead AWPs not elastic: data at output has to be consumed AWPs eliminate hazards as side-effect of delay balancing AWPs have in common with other async methodologies:

data dependant operation (avoids redundant transitions),

composability (though inelastic),

no global clock

15

AWPs vs. Synchronous Wave PipelinesAWPs vs. Synchronous Wave Pipelines

AWPs tackle two main difficulties in sync wave pipes:

Replacing the constructive skew by worst-case delay

removes double-sided timing constraint, i. e. in con-

trast to sync wave pipes do AWPs operate at any rate

Using dynamic self-resetting logic controls delay

variation and doesn´t impact latency much

16

Wave Pipelining Combinational LogicWave Pipelining Combinational Logic

Overall goal: keep data wave coherent under all

possible conditions (data, PTV)

Desirable architecture features:

most logic paths have same depth

fanin/fanout the same everywhere

First step: pad all short paths to maximum length

17

Example: 64-b Brent-Kung Parallel Adder Example: 64-b Brent-Kung Parallel Adder

pg PG PG G

x

o

r

0 1 2 3 4

Buffers provide

for same depth

on every logic

path

All gates in the

same column

must have the

same delay

18

CircuitsCircuits

Logic style used has to minimize delay variation Earlier work focused on bipolar logic (ECL, CML), but

CMOS is mainstream Static CMOS is not well suited for wave piping, fixing

the problem results in more power and slower speed Pass transistor logic gives slopy edges thereby

introducing delay variation Dynamic logic is attractive as only output high

transition is data-dependant, output pulldown is done by precharge

19

Circuits (cont.)Circuits (cont.)

Using dynamic logic as in Burleson´s Wave Domino jeopardizes the concept as it needs fine-grain precharge

What is needed is a dynamic logic family without precharge overhead: SRCMOS

Work done at IBM: classic paper by Chappell et al:``A 2-ns Cycle, 3.8-ns Access 512-kb CMOS ECL SRAM with a Fully Pipelined Architecture,´´ JSSC (26), 11, 1991; or, more recently: ``Implementation of a Self-Resetting CMOS 64-Bit Parallel Adder with Enhanced Testability,´´ JSSC (34), 8, 1999, by Hwang et al.

20

SRCMOSSRCMOS

Distinguishing property of our SRCMOS circuits: precharge feedback is fully local, and NMOS trees are delay balanced

Ninputs

output

21

Operation of a 2-ANDOperation of a 2-AND

22

Delay Balancing at Transistor LevelDelay Balancing at Transistor Level

NMOS tree is designed so that the precharge node is pulled down by a constant number of series devices

Short paths are padded with dummy devices Delay variation is minimal when exactly one path is

on, i. e. wide fanin OR´s are hard to use Every output has to see the same load Lightly loaded outputs are given dummy cap

23

Example: Carry tree in a 64-bit adderExample: Carry tree in a 64-bit adder

))(( ijjkjkklkllmlmim GPGPGPGG

24

Gim LayoutGim Layout

25

Simulation of Gim cellSimulation of Gim cell

Pulses of 4 possible input situations giving ´1´ at the output are tightly matched

Note: in this case never are Pxy=Gxy=1

26

First Pulse ProblemFirst Pulse Problem

27

Miller EffectMiller Effect

28

64-bit Adder Output Waveforms64-bit Adder Output Waveforms

latching

window

29

Transistor SizingTransistor Sizing

Ninputsoutput

Wpd

WkeeperWprecharge

CdriveCload

Cfeedback

Wpd / Cdrive = const Cdrive / (Cload+Cfeedback+Wkeeper) = const

Cfeedback / Wprecharge = const Wprecharge / Cdrive = const

LINEAR SIZING

30

Interconnect: Resistive EffectsInterconnect: Resistive Effects 0.9µm x 900µm MET2 parasitics: C=116fF, R=70 Ohms

C only

RC only

R/2, R/2

R/3, R/3, R/3

31

Interconnect: Coupling EffectsInterconnect: Coupling Effects

2 adjacent MET2 lines coupled by C=54fF

32

PTV VariationsPTV Variations

SRCMOS provides some robustness by generating fresh pulses at every gate output

Pulsed operation reduces data dependancy, coupling PTV noise is not critical when drift is in the same

direction across die Critical are: temperature gradient, supply drop, and

local variations What is needed: Rule of thumb like ``For process X,

to be on the safe side, keep area between two latches < Y sqmm´´

33

Cryptography BackgroundCryptography Background

Cryptography - science of keeping communication private

Symmetric schemes - Private key (DES)

Asymmetric schemes - Public key (RSA & ECC)

Private key schemes are quite fast; public key schemes are more safe

34

SecuritySecurity

For comparison : ECC using 261 bits is regarded safer as RSA using 2048 bits

For secure data transmission one combines both public and private key schemes. Data is encrypted using private key scheme and the key with public key scheme

The frequency with which the key can be changed depends upon speed of public key cryptosystem

35

CISCO Data Encryption Service AdapterCISCO Data Encryption Service Adapter

[Cisco Systems]

36

Key Exchange Using Public Key Key Exchange Using Public Key CryptosystemCryptosystem

For better security it pays to improve both schemes If ECC scheme is fast then DES session keys can be

changed more frequently

KKeeyyss

DESDES

ECCECC

SourcSourcee

SinkSink??? ?? ?????? ?? ???

37

DES Key Exchange using Public-Key DES Key Exchange using Public-Key Cryptosystem based on Elliptic CurvesCryptosystem based on Elliptic Curves

D Key-DES

key) (public

key) (public

key) (privatekey) (private

secret same thehave now Bob and Alice

)( :functionhash )( :functionhash

viakey session compute viakey session compute

compute compute

compute

compute

random choose random choose

Bob Alice

public ,),(

00

0

0

0

0

0

PhDPhD

DD

PkkPPkkP

Pk

Pk

kk

EPbaE

ABBA

BPk

PkA

BA

B

A

38

Security based upon DLP: in a finite Abelian group we can easily compute given

However, is hard to compute out of and DLP extraordinarily hard for point group of elliptic

curve:

Set of solutions of cubic equation over any field is an abelian group

Why is this secure ?Why is this secure ?

GNkGp ,00pkp

k p 0p

baxxxyy 232

39

Elliptic Curve Mathematics and AlgorithmElliptic Curve Mathematics and Algorithm Two types - supersingular and non-supersingular Non-supersingular have the highest security EC equation - baxxxyy 232

40

Choice of the FieldChoice of the Field

The field of the type F m

Having 2 as characteristic of a field helps in hardware implementation

Our choice m=261:

Existence of Optimal Normal Basis

Determines the data path width and security

2

41

Adding Two Points Over Elliptic CurvesAdding Two Points Over Elliptic Curves

42

Switching to Projective CoordinatesSwitching to Projective Coordinates

The inversions are quite costly in terms of multiplications

Projective coordinates have no inversions

For m=261: Normal Projective Coordinates

Double + Add 29 20

43

Projective CoordinatesProjective Coordinates

44

Optimal Normal BasisOptimal Normal Basis

45

Multiplication over ONBsMultiplication over ONBs

46

The Final FormulaThe Final Formula

47

Architecture of MultiplierArchitecture of Multiplier

delay

delay

abx

abx

abx

abx

abx

abx

1

2

3

259

260

261

3_Xor

3_Xor

3_Xor

3_Xor

3_Xor

3_

Xo

r 3

_X

or

3_Xor

123

783782781

1

87

Wa

ve

la

tch

Wa

ve

la

tch

Wa

ve

la

tch

1

87

1

1

9

27

29

Pseudo NMOS SRCMOS

request

48

Circuit Style FollowedCircuit Style Followed

Dual-rail cross-coupled SRCMOS circuit NMOS trees are designed such that there is only one

conducting path to ground

N N

Out Out

49

Pulses after First StagePulses after First Stage

Cycle time=666.7ps

Signals after first stage (Data path width = 87)

50

Delay Variations at various stagesDelay Variations at various stages

outp uts after first stage

inputs to final stage

final output

51

The Total LatencyThe Total Latency

latency = 1.9156ns

Input

Final output

Output afte r first stage

52

Architecture of the CryptochipArchitecture of the Cryptochip

k a b X0 Y0

A B D X Y Z

op

A

op

BDD

OUT

A

Oscillator

Counter

Controller

req1 bit

serial indelay line

serial out

A W P

UM L

req

53

Hierarchy of ControlHierarchy of Control

260 0260 0

alwaysalways

kkxx

left shiftleft shift

Hamming weight = 40Hamming weight = 40

EC doubleEC double EC addEC add

If x=1If x=1

ADDADD MULMUL LOAD/LOAD/STORESTORE

77 1313

1 261 11 261 1

EC arithmetic R * 2347 MUL/sEC arithmetic R * 2347 MUL/s

Finite field arithmetic R * 612567 bit/sFinite field arithmetic R * 612567 bit/s

* 261* 261

Double-and-Add Key generation Double-and-Add Key generation rate Rrate R

*(261*7+40*13)*(261*7+40*13)

54

Control Unit ArchitectureControl Unit Architecture

Request signals trigger the state transitions. Autonomous state transitions are triggered by signal X

X

AWP

Logic

For static operation

req1reqn

Req_out

reset

OUTIN1

IN2

REG

REG

55

Highest Level ControlHighest Level Control

1

8

34

6

5

7

Start/LoadX, ResetZ

X=1

LoadY

X=0X=1

If K=0

Shift K

If K=1X=1

ShiftK, Double

K=0,DoubleDone

K=1,DoubleDone/Add

X=1

AddDone

X=1

X=0

X=0

If Stop=1/KP_Done

2

Level-based control

56

The Request Signal GenerationThe Request Signal Generation

57

Middle Level Control : Double AlgoMiddle Level Control : Double Algo

Pulse-based control

0

X=0

1

X=1

2

X=1

3

X=1

4

X=0

5

X=1

X=1X=1

X=1X=1

X=0X=1

6362

6160

5958

StartOPAX OPBZ MULT MD

OPAAShift

OPBAMULT

MD

58

Request Signal GenerationRequest Signal Generation

59

Various States in a Pulse based ControlVarious States in a Pulse based Control

60

Architecture and ImplementationArchitecture and Implementation

k a b X0 Y0

A B D X Y Z

op

A

op

BDD

OUT

A

Oscillator

Counter

Controller

req1 bit

serial indelay line

serial out

A W P

UM L

req

61

ConclusionConclusion

AWPs presented as alternative approach to high-speed

design, shows potential for GHz throughput without clocks

AWPs avoid some problems of conventional wave pipes

and (a)synchronous systems

64b adder + test circuit and EC crypto layout in the making

Feasibility of having totally asynchronous control

To do: support transistor sizing, quantify PTV impact

Documents

Asynchronous Wave Pipelines for Giga-Hertz VLSI Oliver Hauck Atul Katoch Asynchronous Wave Pipelines for Giga-Hertz VLSI Oliver Hauck Atul Katoch Integrated