Upload
satya
View
39
Download
0
Tags:
Embed Size (px)
DESCRIPTION
High Performance Asynchronous ASIC Back-End Design Flow Using Single-Track Full-Buffer Standard Cells. Marcos Ferretti, Recep O. Ozdag, Peter A. Beerel Department of Electrical Engineering Systems University of Southern California. Key to High-Speed Async Design. Control logic. - PowerPoint PPT Presentation
Citation preview
High Performance Asynchronous ASIC Back-End Design Flow
Using Single-Track Full-Buffer Standard
Cells
Marcos Ferretti, Recep O. Ozdag, Peter A. BeerelDepartment of Electrical Engineering Systems
University of Southern California
USC Asynchronous CAD/VLSI Group 2
Key to High-Speed Async Design
Completion detection demands 2-D pipelining
Latc
hes
Latc
hes
Latc
hes
Bundle-data pipeline
Datapath
Control logic
2-D pipeline
Pipeline stagesAsync. channels
USC Asynchronous CAD/VLSI Group 3
Asynchronous Channels
Ack
1-of-N
1
2
3
4
1-of-N
1 2
Sender Receiver 1-of-N data
Acknowledge
1-of-N channel
Sender Receiver 1-of-N data
Acknowledge
1-of-N single-track channel
Control
Data
Data stable
Req1 2
Ack
GasP bundle-data channel
Sender Receiver
Single-rail data LatchesLatches
Control channel
USC Asynchronous CAD/VLSI Group 4
GasP (Sutherland et al.’01)
B
A
L RL
Latches
R GasP
Pulse to data latches Datapath
Staticizer
Self-resetting NAND
fw = 4
= 6
Includes latch setup time and delay
Bundled-data pipeline using single-track control
USC Asynchronous CAD/VLSI Group 5
Precharge Half-Buffer (Lines’98)
NMOStransistor
stack
PcEval
Schematic for each output rail
Rx
L
Sx
R
Eval Pc
Le
RLL
LCD RCD
Re
fw = 2
= 14+Precharge Half-Buffer Template
C
2-D pipeline using 1-of-N delay-insensitive channels and QDI cells
USC Asynchronous CAD/VLSI Group 6
Single-Track Asynchronous Pulsed Logic (Nyström’01)
RL
Re RCD
Re
R4
L R
STAPL template
Pulsegenerator
Reset
S
Pulsegenerator
xv L01
L0n
R0
S0 S1
R1
re R0R1
NMOS transistor
stack
L11
L1n
Schematic for dual-rail outputxv
R4
L01 L11… L0n L1n
xv
STAPL uses pulse generators to control drivers activation timing
fw = 2
= 10
USC Asynchronous CAD/VLSI Group 7
Single-Track Full-Buffer (Ferretti’02)
RL
S
B RCD
B
SCDA
Reset
L R L01
L0n
R0
S0 S1
R1
B
B
BA R0
R1S0S1
L01 L11… L0n L1n
NMOS transistor
stack
L11
L1n
C
Schematic for dual-rail outputBlock diagram
Timing DiagramL
S
A
B
R
fw = 2
= 6Small and fast
USC Asynchronous CAD/VLSI Group 8
STFB: Tradeoff Speed for Robustness
Features of STFB3x faster than QDI and about half the sizeSmaller and faster than STAPLSmaller forward latency and less timing
assumptions than GasP
performance GasP
robustnessQDI (Lines - Caltech)
STFB (Ferretti - USC)
(Sutherland - Sun)
STAPL (Nyström - Caltech)
USC Asynchronous CAD/VLSI Group 9
Motivation and Goals
• Develop a methodology to design STFB-based asynchronous circuits using conventional CAD toolsCreate a STFB standard cell libraryMake the library publicly-available
• Design and fabricate a demonstration test chip• Evaluate the results
Ultimate Goal: Full-custom Performance with ASIC Design Times
USC Asynchronous CAD/VLSI Group 10
Outline
STFB standard-cell design
Backend design flow
Demonstration test chip
Conclusions
USC Asynchronous CAD/VLSI Group 11
STFB channels are point to point (no forked wires)
One size per cell in the library is adequate
STFB Standard-Cell DesignTransistor sizing
USC Asynchronous CAD/VLSI Group 12
STFB Standard-Cell DesignTransistor sizing
2x min. size N-stack strength 1:4-5 drive ratio
2x8x
8x
L
Sx
Rx
BRCD
NMOS transistor
stackC
2.8
10
Wn
A5
SCD
L
≤ 1mmSx
Rx
BRCD
NMOS transistor
stackC
2.8
10
Wn
A5
SCD
TSMC 0.25 m, widths in m and all lengths 0.24 m
Up to 1mm long wire
USC Asynchronous CAD/VLSI Group 13
STFB Standard-Cell DesignBalanced response
SCD/RCD
Data-independent timing assumptions
S1
S0
A
2.8 2.8
1.21.2
1.21.2
SCD balanced NAND (2x)
TSMC 0.25 m, widths in m and all lengths 0.24
m
R1
R0 1.4
1.2 1.2
1.4
1.41.4
B
RCD balanced NOR (1x)
USC Asynchronous CAD/VLSI Group 14
STFB Standard-Cell Design
STFB_POUT sub-cell
Yields less load on B and faster S reset
S
R
B
NR
0.6 2.8
0.610
1.2
1.2
1.4/0.60.3
TSMC 0.25 m, widths in m and all lengths 0.24
m
staticizer
fights charge–sharing
fast S reset
fights leakage current
STFB_POUT sub-cell layout
USC Asynchronous CAD/VLSI Group 15
STFB Standard-Cell Design
Reset transistors
2-input NAND → less load on S
TSMC 0.25 m, widths in m and all lengths 0.24
m
Reset transistors, reset inverter and NAND layout (from
STFB_XOR2 cell)
A1S0S1
L01 L11…
A2
L01 L11…
/Reset
1-of-2 cell 2-input NAND
+ inverter
AS0
/ResetS1
L01 L11…
Initial idea 3-input NAND
S0S1
L01 L11…
L01 L11…
A1
A2/
ResetS2
1-of-3 cell two
2-input NAND
USC Asynchronous CAD/VLSI Group 16
STFB Standard-Cell DesignDirect-path current analysis
Vin
M1
M2
Vout
VDD
VDD -Vtp
Vtn0V
Ipeak
0A
t
t
Idp
Vin
Idp
Sx
A
M1
M2
Idp
Average direct-path current is similar to inverter
Idp
VDD
VDD -Vtp
Vtn0V
Ipeak1
Ipeak2
0A
t
t
VA VSx
0
100
200
300
400
500
600
700
800
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
Vdiff = VA - VSX (V)
Peak c
urr
en
t (u
A)
Ipeak1Ipeak2Average (Ipeak1 + Ipeak2)/2
USC Asynchronous CAD/VLSI Group 17
Outline
STFB standard cell design
Backend design flow
Demonstration test chip
Conclusions
USC Asynchronous CAD/VLSI Group 18
Standard-Cell Library Development (Ozdag’04)
Cell specification
s
Layout (Virtuoso)
Symbol,Schematic and
Functional(Virtuoso, Emacs)
Simulation (Verilog, Hspice)
Layout
Cell Abstract (Envisia)
Asynchronous Cell Library
SymbolSchematicFunctional
Abstract
Template specification
s
Standard cell specification
s
Same tools and flow as synchronous
LVS/DRC (Dracula/Diva
)
USC Asynchronous CAD/VLSI Group 19
Asynchronous ASIC Design Flow (Ozdag’04)
SymbolSchematicFunctional
Schematic (Virtuoso)
Design specifications
Layout Chip Assembly (Virtuoso)
Chip Fabrication
Place & Route (Silicon Ensemble)
Abstract Asynchronous Cell Library
LVS/DRC (Dracula/Diva)
Simulation (Verilog, Nanosim)
Same tools and flow as synchronous
USC Asynchronous CAD/VLSI Group 20
Cell Layout Example: STFB2_XOR2
Each cell comprises an entire STFB pipeline stage
A
A0 A1 B0 B1
Reset
A0 A1 B0 B1
/ResetS0 S1
RCD
SCD
BS
RBS
R
STFB_POUT STFB_POUT
R0 R1
R1
R0
S1
S0
BC
a1
C
b1
a1
b0
a0
b1
a0
b0
S0
S1
S1
S0
USC Asynchronous CAD/VLSI Group 21
Outline
STFB standard cell design
Backend design flow
Demonstration test chip
Conclusions
USC Asynchronous CAD/VLSI Group 22
Prefix Adder
a0b0 c-1a1b1a2b2a3b3a4b4a5b5a6b6a7b7
s7 s6 s5 s4 s3 s2 s1 s0c7
3 +log2 n
2*n + 1
STFB2_FORK (fork stage)
STFB2_BUFFER (buffer stage)
STFB2_XOR2 (2-input xor stage)
STFB3_AB_KPG and STFB3_AB_KPG2
STFB3_KPG2_KPG and STFB3_KPG2_KPG2
STFB3_KPGC_C and STFB3_KPGC_C2
(Goldovsky’99)
USC Asynchronous CAD/VLSI Group 23
64-bit Adder BlockSilicon Ensemble P&R
Schematic (Virtuoso)
Place & Route (Silicon
Ensemble)
Floor plan
129 rows
70% areautilization
Plan power
M4 and M5power grid
Pins and cell placement
Input pins on the left
(A64, B64 and C)
Output pins on the right(S64 and C)
Filler cell
Routing
USC Asynchronous CAD/VLSI Group 24
Input Generator Block
Flexible and fast input generation
a0…a3
d0…d7
4 levels
STFB
2_S
PLI
T
8 8
4
4
8x8
8x8
STFB2_SRST
Carry in
9-stage ring1
64
64
A
B
Cin6
4x9
-sta
ge
rin
g6
4x9
-sta
ge
rin
g
12
x
STFB
2_S
RST
Single-rail to single-track converter
1
data
address
USC Asynchronous CAD/VLSI Group 25
Output Sampler Block
65
6565x STFB2_BUCKET
BB
65
x S
TFB
2_S
PLI
T
65
6565x STFB2_BUCKET
BB
65
x S
TFB
2_S
PLI
T
65
6565x STFB2_BUCKET
BB
65
x S
TFB
2_S
PLI
T
65
64 bit sum
+ Cout
30-stagering
30-stagering
30-stagering
1:10 1:100 1:1000
1000000000 1000000000 1000000000= 1,10,… = 1,100,…= 1,1000,…
0010000000 0000100000 0000000100= 3,13,… = 43,143,…= 843,1843,…
Flexible and fast output sampler
1
0
1
0
1
0
USC Asynchronous CAD/VLSI Group 26
Simulation Results: LoadingNanosim
Carry in
Sampler: 10x4x4 = 160
3x B64 3x A64 Go!
USC Asynchronous CAD/VLSI Group 27
Simulation Results: RunningNanosim
Go!
Sum
Carry out
112.9ns
112.9/160 = 0.706ns 1/0.706ns = 1.4 GHz
USC Asynchronous CAD/VLSI Group 28
Simulation Results
Conditions Iav LatencyThroughpu
t
TT, 25oC, 2.5V, 3.3V 2.9 A 2.1 ns 1.4 GHz
SS, 120oC, 2.2V, 3.0V 1.6 A 3.3 ns 890 MHz
FF, 0oC, 2.7V, 3.6V 4.2 A 1.6 ns 1.9 GHz
SF, 25oC, 2.5V, 3.3V 2.9 A 2.2 ns 1.4 GHz
FS, 25oC, 2.5V, 3.3V 2.9 A 2.2 ns 1.4 GHz
USC Asynchronous CAD/VLSI Group 29
Demonstration chipTop layout
INPUTGEN129BY9
ADDER64
SAMPLER65BY1000
1700 m
801 m 663 m 499 m
1963 m
1.36 mm2
105k transistors1.3 A @ 1.4 GHz
1.13 mm2
89k transistors1.3 A @ 1.4 GHz
0.85 mm2
62k transistors0.3 A @ 1.4 GHz
3.3 mm2
257k transistors2.9 A @ 1.4 GHz
TSMC 0.25 mMOSIS Mar/22/04
QDI Sequential Decoder
(Session VI, 10:30am, Thu, Apr/22)
STFB64-bitAdder
3733 m
20.5 mm2
132 pins
5483 m
~6 months/man Library~6 months/man Design
USC Asynchronous CAD/VLSI Group 30
Summary and Conclusions
• PerformanceSTFB 2-D pipelining yields ultra-high-
performance
• Design TimeBack-end flow achieves ASIC design time
• AvailabilityCell library has been made freely available
• Future workCharacterize and extend libraryStatic timing analysis and sign-off
USC Asynchronous CAD/VLSI Group 31
Efharisto!(Thank you!)
USC Asynchronous CAD/VLSI Group 32
STFB Standard-Cell DesignDynamic worst-case direct-path current analysis
(STFB buffer pipeline at 2GHz)
Non-overlap drive = less direct-path current than an inverter
1mm
TSMC 0.25 m, widths in m and all lengths 0.24 m
L
Sx
RRCD
A
L
Sx
RRCD
A
L
Sx
RRCD
A
L
Sx
RRCD
A
USC Asynchronous CAD/VLSI Group 33
Input Generator Block9-stage ring
BGout
ingo
BG STFB2_BITGEN (bit generator)
STFB2_MERGENC (non-conditional merge stage)
STFB2_FORK (fork stage)
STFB2_BUFFER (buffer stage)
STFB2_XOR2 (2-input xor stage)
1
1
11
0
0
00
0
0
1
1,0,0,1,0,0…
USC Asynchronous CAD/VLSI Group 34
E2
Comparison STFB x WCHB
TemplateNumber
of stages
Number of
tokensLibrary
Vdd (V)
Temp.
(oC)
Throughput (GHz)
Stage Cycle time (ns)
Average current
(mA)
Current per
token (mA)
E2
metric
WCHB 10 2 TT 2.5 25 1.00 1.00 2.7 1.35 3.4STFB 9 3 TT 2.5 25 2.00 0.50 13.0 4.32 1.3
STFB buffer is ~3x more efficient than WCHB buffer
USC Asynchronous CAD/VLSI Group 35
Demonstration chipTop layout
INPUTGEN129BY9
ADDER64
SAMPLER65BY1000
1700 m
801 m 663 m 499 m
1963 m
1.36 mm2
105k transistors1.3 A @ 1.4 GHz
1.13 mm2
89k transistors1.3 A @ 1.4 GHz
0.85 mm2
62k transistors0.3 A @ 1.4 GHz
3.3 mm2
257k transistors2.9 A @ 1.4 GHz
TSMC 0.25 mMOSIS Mar/22/04
7 Vdd and 7 Gnd pins
12 In/Out, 8 Input and 3 pad’s supply pins
7 Vdd and 7 Gnd pins
Total: 51 pins
USC Asynchronous CAD/VLSI Group 36
Test chip designTop chip layout
TSMC 0.25 mMOSIS Mar/22/04
QDI Sequential Decoder(Session VI, 10:30am, Thu)
STFB64-bitAdder
3733 m
5483 m
20.5 mm2
132 pins