High-Performance ArithmeticHigh-Performance ArithmeticChallenges: Challenges:
From Architectures to CircuitsFrom Architectures to Circuits
High-Performance ArithmeticHigh-Performance ArithmeticChallenges: Challenges:
From Architectures to CircuitsFrom Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar
Microprocessor Research, Intel LabsIntel Corporation, Hillsboro, OR, USA
Prof. Vojin OklobdzijaACSEL Lab, Dept. of ECE
University of California, Davis, CA, [email protected]
Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar BorkarMicroprocessor Research, Intel LabsIntel Corporation, Hillsboro, OR, USA
Prof. Vojin OklobdzijaACSEL Lab, Dept. of ECE
University of California, Davis, CA, [email protected]
IntelLabs
1616thth IEEE International Computer Arithmetic Symposium, Santiago, June 18 IEEE International Computer Arithmetic Symposium, Santiago, June 18 thth 2003 2003
2
Motivation Design choices for high-performance circuits SOI vs. Bulk devices: ALU design test-case
64-bit ALUs in PD-SOI and Bulk CMOS Energy-efficient high-performance AGU/ALUs
4GHz Sparse-tree AGU Design 6.5-10GHz Integer ALU Design
Summary
Motivation Design choices for high-performance circuits SOI vs. Bulk devices: ALU design test-case
64-bit ALUs in PD-SOI and Bulk CMOS Energy-efficient high-performance AGU/ALUs
4GHz Sparse-tree AGU Design 6.5-10GHz Integer ALU Design
Summary
OutlineOutline
3
Frequency doubles every generation Performance-critical units
ALUs & AGUs Register files, L0 caches
High-performance trendsHigh-performance trends
Single-cycle latency &
throughput
0.1
1
10
100
1000
10000
100000
1970 1980 1990 2000 2010 2020
MHz
15-30 GHz
8080
8086
386 Pentium® proc
Pentium® 4 proc
64-bit ALUs in 0.1864-bit ALUs in 0.18m m PD-SOI/Bulk CMOS:PD-SOI/Bulk CMOS:
Design & Scaling TrendsDesign & Scaling Trends
64-bit ALUs in 0.1864-bit ALUs in 0.18m m PD-SOI/Bulk CMOS:PD-SOI/Bulk CMOS:
Design & Scaling TrendsDesign & Scaling Trends[S. Mathew et al, ISSCC 2001][S. Mathew et al, ISSCC 2001]
[S. Mathew et al, JSSC, Nov 2001][S. Mathew et al, JSSC, Nov 2001]
5
High performance devices: Partially depleted Silicon-on-Insulator Pros & Cons vs. bulk CMOS Scaling trends
High performance circuit design:Sparse-tree semi-dynamic AGUSingle-rail dynamic ALU
High performance devices: Partially depleted Silicon-on-Insulator Pros & Cons vs. bulk CMOS Scaling trends
High performance circuit design:Sparse-tree semi-dynamic AGUSingle-rail dynamic ALU
Design choicesDesign choices
6
p+ n+
PD-SOI DevicesPD-SOI Devices
Body of devices not tied to Vcc/Vss Body is isolated by buried oxideFloating Body!
P-Substrate
n+ n+ p+ p+STI
Buried Oxide
P type body N type body
ST
I
ST
I
7
Delay = Function of switching history– Capacitive coupling from S/G/D
– Impact Ionization, Diode conduction
– Transient Vbs DC Vbs
BackgateBuried Oxide
n+ n+
n+ Gate
Body Potential
S DG
Cbox
CdbCsb
Cgb
Complicates timing analysis
History Effect in PD-SOIHistory Effect in PD-SOI
8
64-bit ALU architecture64-bit ALU architecture
Ideal test-bed for evaluating process technologiesIdeal test-bed for evaluating process technologies
1200m Loopback bus
Single rail adder coreSingle rail adder core
Sum
2:1Mux
External operands
Shift control
5:1 Mux
0.5pF
9:1 Mux
Mux control
3:1 Mux
Mux control
9:1 Mux
External operands
Sign control
9
High-performance Adders: High-performance Adders: Kogge StoneKogge Stone
High-performance Adders: High-performance Adders: Kogge StoneKogge Stone
Generate all carries: Full-blown binary tree energy-inefficient
# Carry-merge stages = log2(N)
GG=Gi+PiGi-1
GP=PiPi-1
Oddinput bits
Even input bits
Sumeven
Sumodd
PG Gen. CM1 CM2 CM3 CM4 CM5
CM1 CM2 CM3 CM4 CM5PG Gen.
1 2 3 4 5 6 7
XOR
XOR
10
64-bit Han-Carlson adder core64-bit Han-Carlson adder core
Carry-merge done on even bitslices 50% fewer carry-merge gates vs. Kogge-Stone Extra logic stage generates odd carries
Oddbit
Evenbit
CM0
CM1
b1 b0b2b3b63b62b61b60 PG generator
Odd carry generatorSum XOR
Carry-merge0
Carry-merge1
Carry-merge5
3N
2P2N
2N
2P
b59
11
Energy-efficient adder coreEnergy-efficient adder core
43% less energy/transition at iso-performance43% less energy/transition at iso-performance
Adder architecture Energy/transition
Kogge-Stone
Han-Carlson
120pJ
68pJ
12
CSG
Han Carlson carry-merge treeHan Carlson carry-merge tree
Single rail adder coreCSG circuit generates dual-rail carry
Ceven3N 2P 2N 2P 2N 2P
Even inputs
2PCodd
2N
CM0PG gen. CM1 CM2 CM3 CM4 CM5 CM6
3NOdd
inputs
Ceven
Codd
CSG
Carry-merge tree Odd carrygenerator
ComplementaryComplementarysignal generatorsignal generator
Single rail
Dual rail
13
Complementary signal gen.Complementary signal gen.
Domino-compatible Carry/Carry Permits a single-rail carry-merge tree design Not time-borrowable – Penalty absorbed by
placing gate at 2 boundary
Keeper
Keeper
Carryi
Carryi
Cini
Truepull-downpathComplementary
pull-down path
14
Partial sum generatorPartial sum generator
Generates domino-compatible partial sumPlacing the gate at 1 boundary mitigates
output noise-glitches
1
Ai
Psumi
Keeper
Bi
Ai
Bi
1
1
Pi
Gi
15
ALU performance in bulk CMOSALU performance in bulk CMOSALU performance in bulk CMOSALU performance in bulk CMOS
2P 2P 2P 2N XOR2N 2N
2
Inp.3N9:1
Mux5:1 Mux
3:1 Mux
Bus driver
1200m Bus
1
64b Han-Carlson ALU simulation results
ALU delay 482ps
0.18m bulk CMOS, Vcc=1.5V
Adder core
310ps
2PSum
16
Porting from bulk to PD-SOIPorting from bulk to PD-SOI
SOI favored redesign
Bulk design
SOI design
SOI-optimal design
Direct port
Design issues:•Noise tolerance due to lowered Vt
•Min-delay timing-analysis
Motivation for redesign:•Reduced SOI stack penalty•Deeper stack design •Stage reduction
Motivation for redesign:•Reduced SOI stack penalty•Deeper stack design •Stage reduction
Design choices:•Architecture should favor deep stack design
•Avoid increase in fanouts
17
0.180.18m Bulk & PD-SOI m Bulk & PD-SOI technologiestechnologies
Equal IOFF at DC Vbs
SOI IDSAT is 1-2% lower
Ioff(nA/m) Idsat(A/m)
NMOS-Bulk 3.3 1070
NMOS-SOI 3.3 1050
Ioff(nA/m) Idsat(A/m)
PMOS-Bulk 0.7 445
PMOS-SOI 0.7 441
18
History effect measurements in History effect measurements in 0.180.18m PD-SOIm PD-SOI
Nor
mal
ized
del
ay
10ns 100ns 1s 10s 100s0.8
0.9
1
0.8
0.9
10.8
0.9
1
Pulse width
Transmission gate chain
3NFET-Stack chain
Inverter chain
11% History effect variation
7% History effect variation
5% History effect variation
These gates are used in the ALU
design
5-11% delay variation
Measurements agree with
simulation results
Measurements agree with
simulation results
19
Direct port of Han-Carlson Direct port of Han-Carlson ALU to PD-SOIALU to PD-SOI
Adder core speedup = 14%– [Stasiak et al.,ISSCC 2000] 21% speedup
64b Han-Carlson ALU
delay simulations
% Delay improvement
over bulk
Bulk 482ps16%
Direct-port to SOI 403ps
0.18m technology, Vcc=1.5V
20
Speedup analysisSpeedup analysis
Stage typeSpeedup over bulk from direct port to 0.18m PD-
SOI
Static gates
Dynamic gates
3:1 TG Mux
5:1 TG Mux
9:1 TG Mux
12-15%
2-9%
20%
23%
35%
• Diffusion dominated muxes Max. speedup
• Load dominated gates Speedup decreases
21
Motivation for PDSOI-optimal Motivation for PDSOI-optimal redesignredesign
Reduced stack penalty in SOIDeeper stack design Stage reductionALU is amenable to such a redesign
– Not true for all CPU critical pathsSOI-optimal ALU architecture
– Increasing stack depth must not increase fanoutsA novel deep-stack sparse-tree ALU was
developed
22
Sparse-tree adder coreSparse-tree adder core
50% reduced fanouts compared to Han-Carlson 7 gate stages (Two less than Han-Carlson)
2N
2P
4N
2P
3N
Mux
Mux
b1 b0b2b3b63 b62 b61 b60 PG generator
63:62 61:60 59:58 3:2 1:0
7:015:8
7:6 5:4
23:1631:2439:3247:40
15:0 31:1647:32
47:0 31:0
Sum
Gen
Sum
Gen
Sum
Gen
Sum
Gen
Sum
Gen
Sum
Gen
Sum
Gen
Sum
Gen
Sum
Gen
Sum
Gen
Sum
Gen
Sum
Gen
Sum
Gen
Sum
Gen
Sum
Gen
Sum
Gen
Int. carry gen. Int. carry gen. Int. carry gen. Int. carry gen.
59:
58 5
7:56
55:
54 5
3:52
51:
50 4
9:48
43:
42 4
1:40
39:
38 3
7:36
35:
34 3
3:32
27:
26 2
5:24
23:
22 2
1:20
19:
18
11:
10 9
:8 7
:6 5:4
3:2
17:
16
1:0
Fast carry-mergetree
23
Intermediate Carry GeneratorIntermediate Carry Generator
• Generates 1 in 4 carries (C3, C7, C19….. C59)
• Non-critical path (ripple carry-select scheme)
• Fast carry selects bet. the conditional carries
01 P3:0 G3:0P7:4 G7:4P11:8 G11:8
2 22 2
Carry from Fast CM Chain
C3C7C11
2:1 Mux2:1 Mux2:1 Mux2:1 Mux
CMCM CMCM
CMCM CMCM
CMCMCMCM
24
Non-critical Sum GeneratorNon-critical Sum GeneratorNon-critical Sum GeneratorNon-critical Sum Generator
Non-critical path: ripple carry chain Reduced area, energy consumption, leakage Generate conditional sums for each bit 1 in 4 carry selects appropriate sum
Pi Pi+1Pi+2 ,Gi+2
Sumi+1Sumi+2Sumi+3Sumi+3
XOR XORXOR XOR
Pi+3,Gi+3
Sumi
Su
mi ,1
Su
mi ,0
Carry
Gi+1
2:1 2:1 2:1
11 00
2:12:1
CMCM CMCM
CMCMCMCM CMCM
CMCMCMCM CMCMCMCM
XORXOR XORXOR
25
Sparse-tree adder critical pathSparse-tree adder critical path
Fast carry-merge path Critical pathNon-critical side-paths Ripple-carry
2N2N2N2N 2P2P2P2P 4N4N4N4N 2P2P2P2P 3N3N3N3N
Inv 3N 2P 2N
2P 3N
Sum generator
Intermediate carry generator
Fast carry-merge path
SumoutSumout
Input
2N
26
PD-SOI optimal redesign in PD-SOI optimal redesign in 0.180.18mm
Deeper stack redesign additional 5% speedup
64b ALU delay simulationsSpeedup over
bulk
Bulk 482ps -
Direct-port SOI 403ps 16%
SOI-optimal redesign 380ps 21%
0.18m technology, Vcc=1.5V
27
Margining for reverse-body Margining for reverse-body bias in PD-SOI bias in PD-SOI
400mV rvs. bias increases rise-delay by 10%
Difficult to detect for large circuits
10% Margin required for all max-delay paths
Overall PD-SOI speedup reduces to 11%
28
Reducing reverse-bias Reducing reverse-bias penalty in dynamic SOI gates penalty in dynamic SOI gates
Point solution for dynamic designs Pre-charging stack node decreases penalty to 2%
M1
B
A
OutA B
Stack nodeBody-B
Body-AP0
Max-delay margin reduced to 2%
Cost5% increase in clock energy
Cost5% increase in clock energy
29
0.180.18m ALU performance after m ALU performance after marginingmargining
Maximum PD-SOI speedup reduces to 19%
64b ALU delay simulationsSpeedup over bulk
Speedup after
margining
Bulk 482ps - -
Direct-port SOI 403ps 16% 14%SOI-Optimal redesign
380ps 21% 19%
0.18m technology, Vcc=1.5V
30
Scaling to 0.13Scaling to 0.13m technologiesm technologiesEqual SOI & bulk IOFF-DC
MOSFET & impact ionization data obtained from 0.13m bulk measurements
SOI parasitic BJT/diode characteristics unchanged from 0.18m fitting
31
Scaling ALU designs to 0.13Scaling ALU designs to 0.13m m technologytechnology
Maximum PD-SOI speedup reduces to 16%
64b ALU delay simulations Speedup over bulk
Speedup after
margining
Bulk 351ps - -
Direct-port SOI 312ps 11% 9%
SOI-Optimal redesign
286ps 18% 16%
0.13m technology, Vcc=1.2V
32
SOI vs. bulk SummarySOI vs. bulk Summary 482ps energy-efficient dynamic 64b ALU in 0.18m
bulk – 310ps adder core
Direct port to 0.18m SOI 14% speedup SOI optimal redesign 19% speedup
Floating body can get reverse-biased– Preconditioning reduces margin from 10% to 2%
Scaling to 0.13m decreases PD-SOI speedup
Maximum PD-SOI speedup in 0.13m falls to 16%16%
33
Goal: Shift the E-D curveGoal: Shift the E-D curve
High-Performance Low High-Performance Low Power Datapath designPower Datapath design
Delay
Ener
gy
A 4GHz 130nm Address A 4GHz 130nm Address Generation Unit with Generation Unit with 32-bit Sparse-tree 32-bit Sparse-tree
Adder CoreAdder Core
A 4GHz 130nm Address A 4GHz 130nm Address Generation Unit with Generation Unit with 32-bit Sparse-tree 32-bit Sparse-tree
Adder CoreAdder Core
IntelLabs
[S. Mathew et al, VLSI Symp. 2002],[S. Mathew et al, VLSI Symp. 2002],
[S. Mathew et al, JSSC May 2003][S. Mathew et al, JSSC May 2003]
35
AGUs: performance and peak-current limitersHigh activity thermal hotspotGoal: high-performance energy-efficient design
MotivationMotivation
Execution core
120oC
Cache
Processor thermal
map
AGU
Temp(oC)
36
AGU ArchitectureAGU ArchitectureAGU ArchitectureAGU Architecture
Single-cycle latency and throughput Effective Address = Base + Index*Scale +
(Segment +Displacement) 2-phase address computation
Displacement
Effective Address
3:2
Co
mp
ress
orBase
Index
Segment
3b shift
32
3232
32
32b
ad
d
32
+
clk
clk cl
k2 clk3
32
32
37
AGU Operation: Phase 1AGU Operation: Phase 1AGU Operation: Phase 1AGU Operation: Phase 1
Index pre-scaled via 3-bit barrel shifter3:2 compressor renders partial address:
Carry-save formatAdder in pre-charge state
Displacement
Effective Address
3:2
Co
mp
ress
or
3:2
Co
mp
ress
orBase
Index
Segment
3232
3232
3232
3232
32b
ad
der
32
+
clk
clk cl
k2 clk3
32
32
Carry-Saveformat
Carry-Saveformat
3bshift3b
shift
38
AGU Operation: Phase 2AGU Operation: Phase 2AGU Operation: Phase 2AGU Operation: Phase 2
Carry-save to binary format conversion: 2’s complement parallel 32-bit adder
Displacement
Effective AddressEffective Address
Base
Index
Segment
3b shift
32
323232
3232
32b
ad
der
32b
ad
der
3232
+
clk
clk
clk cl
k2cl
k2 clk3
clk3
32
32
3:2
Co
mp
ress
or
39
Kogge-Stone AdderKogge-Stone AdderKogge-Stone AdderKogge-Stone Adder
Critical path = PG+5+XOR = 7 gate stages Generate,Propagate fanout of 2,3 Maximum interconnect spans 16b
Energy inefficientEnergy
inefficient
1235 4679 8101113 12141517 16181921 20222325 24262729 283031PG
Car
ry-m
erg
e g
ates
XOR
00
40
Sparse-tree Adder ArchitectureSparse-tree Adder Architecture
Generate every 4th carry in parallelSide-path: 4-bit conditional sum generator73% fewer carry-merge gatesenergy-efficient
C27 C23 C19 C15 C11 C7 C3
293031 28 252627 24 212223 20 171819 16 131415 12 91011 8 567 4 123 0
41
Non-critical Sum GeneratorNon-critical Sum GeneratorNon-critical Sum GeneratorNon-critical Sum Generator
Non-critical path: ripple carry chain Reduced area, energy consumption, leakage Generate conditional sums for each bit Sparse-tree carry selects appropriate sum
Pi Pi+1Pi+2 ,Gi+2
Sumi+1Sumi+2Sumi+3Sumi+3
XOR XORXOR XOR
Pi+3,Gi+3
Sumi
Su
mi ,1
Su
mi ,0
Carry
Gi+1
2:1 2:1 2:1
11 00
2:12:1
CMCM CMCM
CMCMCMCM CMCM
CMCMCMCM CMCMCMCM
XORXOR XORXOR
42
Conditional Carry for Cin=0Conditional Carry for Cin=0
Optimized First-level Carry-mergeOptimized First-level Carry-merge
Carry-merge stage reduces to inverterConditional carry_0 = Gi#
C#_0C#_0i
Pi
Cin=0
GiPi
Gi
Gi
CMCMCMCM0000
43
Conditional carry for Cin=1Conditional carry for Cin=1
Optimized First-level Carry-mergeOptimized First-level Carry-merge
Pi & Gi correlatedConditional carry_1 = Pi#
Pi
Gi
Ai Bi Pi Gi C#_10 0 0 0 10 1 1 0 01 0 1 0 01 1 1 1 0
C#_1C#_1
Cin=1
Gi
Pi
Gi
Pi
C#_1Pi
CMCMCMCM1111
44
Optimized Sum GeneratorOptimized Sum GeneratorOptimized Sum GeneratorOptimized Sum GeneratorPi Pi+1
Pi+2 ,Gi+2
Sumi+1Sumi+2Sumi+3Sumi+3
XOR XORXOR XOR
Pi+3,Gi+3
Sumi
Su
mi ,1
Su
mi ,0
Carry
Gi+1
CMCMCMCM CMCMCMCM
Optimized 1st-level Optimized 1st-level carry-mergecarry-merge
Optimized non-critical path: 4 stages
2:1 2:1 2:12:12:12:12:1
CMCMCMCM CMCMCMCMXORXORXORXOR XORXORXORXOR
45
Adder Core Critical PathAdder Core Critical Path
Critical path: 7 gate stages same as KSSparse-tree: single-rail dynamicExploit non-criticality of sum generatorConvert to static logicSemi-dynamic design
PGPG GGGG11 GGGG77
Static sum generatorStatic sum generator
Single-rail dynamic sparse-tree pathSingle-rail dynamic sparse-tree path
AdderAdderInputsInputs
clk2clk2
SumSum3131
clk3clk3clkclk
clkclk
GGGG2727GGGG1515
CM0CM0LatchLatch CM1CM1 XORXOR
CC2727
SumSum31_031_0
SumSum31_131_1
GGGG33
46
11stst-level Carry-merge: Static Latch-level Carry-merge: Static Latch
Holds state in pre-charge phasePrevents pre-charging of static stages
Pi
Gi-1
clk
Gi-1
Gi
Pi
C#i
Gi
47
Domino-Static InterfaceDomino-Static InterfaceDomino-Static InterfaceDomino-Static Interface
Sum=Sum0 during pre-chargeMux output resolves during evaluation
clk=
0cl
k=0
clk=
1cl
k=1
Carry#i
Gi
Pi
GiG#i-1
2P2N
clk
G#i
P#i
Carry#i
Pi
Gi-1
Gi
Pi
GiG#i-1
Sumi
clk Sum1i
Sum0i
Sum1i
Sum0i
Sumi
Gi-1 Pi
G#i
P#i
48
Sparse-tree ArchitectureSparse-tree ArchitectureSparse-tree ArchitectureSparse-tree ArchitecturePerformance impact: (20% speedup)
33-50% reduced G/P fanouts80% reduced wiring complexity30% reduction in maximum interconnect
Power impact: (56% reduction)73% fewer carry-merge gates 50% reduction in average transistor size
49
Energy-delay SpaceEnergy-delay SpaceEnergy-delay SpaceEnergy-delay Space
20% speedup over Kogge-Stone 56% worst-case energy reduction
Scales with activity factor
00
2020
4040
6060
8080
100100
140140 160160 180180 200200 220220 240240 260260 280280Delay (ps)Delay (ps)
Wo
rst-
case
En
erg
y (p
J)W
ors
t-ca
se E
ner
gy
(pJ)
Dynamic Kogge-StoneDynamic Kogge-Stone
Semi-dynamic Sparse-Tree Semi-dynamic Sparse-Tree
20%20%
4GHz 4GHz DesignDesign
56%
56%
130nm CMOS, 1.2V, 110130nm CMOS, 1.2V, 110ooC SimulationC Simulation
50
Semi-dynamic DesignSemi-dynamic Design
Static sum generators : low switching activity71% lower average energy at 10% activity
00
1010
2020
3030
4040
00 0.10.1 0.20.2 0.30.3 0.40.4 0.50.5Activity factor Activity factor
Ave
rag
e E
ner
gy
(pJ)
Ave
rag
e E
ner
gy
(pJ)
Dynamic Dynamic Kogge-StoneKogge-Stone
Semi-dynamic Semi-dynamic Sparse-Tree Sparse-Tree
71%71%
51
Dual-VDual-Vtt Allocation Allocation
Exploit non-criticality of sidepaths Use high-Vt devices
0% performance penalty 56% reduction in active leakage energy
Low-VLow-Vt t Dual-VDual-Vtt
DelayDelaySwitching EnergySwitching EnergyLeakage EnergyLeakage Energy
152ps152ps36pJ36pJ0.9pJ0.9pJ
152ps152ps
34pJ (-6%)34pJ (-6%)
0.4pJ (-56%)0.4pJ (-56%)
130nm CMOS, 1.2V, 110130nm CMOS, 1.2V, 110ooC SimulationC Simulation
52
Scaling PerformanceScaling PerformanceScaling PerformanceScaling Performance
Average transistor size = 3.5m Reduces impact of increasing leakage
80% reduction in wiring complexity Reduces impact of wire resistance
33% delay scaling, 50% energy reduction
130nm 130nm 100nm100nm
DelayDelaySwitching EnergySwitching EnergyLeakage EnergyLeakage Energy
152ps152ps36pJ36pJ0.9pJ0.9pJ
102ps (-33%)102ps (-33%)18pJ (-50%)18pJ (-50%)
0.7pJ (-23%)0.7pJ (-23%)
A 6.5GHz, 130nm A 6.5GHz, 130nm Single-ended Single-ended Dynamic ALUDynamic ALU
A 6.5GHz, 130nm A 6.5GHz, 130nm Single-ended Single-ended Dynamic ALUDynamic ALU
IntelLabs
[M. Anders et al, ISSCC 2002],[M. Anders et al, ISSCC 2002],
[S. Vangal et al, JSSC November 2002][S. Vangal et al, JSSC November 2002]
54
X8
SchedulerALU 0
X8
SchedulerALU 1
AL
U 0
AL
U 1
5:1
RFFIFO
RFFIFO
FIFORF
FIFORF
5:15:1
5:1
toRF,
FIFO
toRF,
FIFO
toRF,
FIFO
88
88
888
8
sched1#
sched1
sched0#
sched0
sum1#
sum1
sum0#
sum0 32
32
32
32
32-bit ALU/Scheduler Loop32-bit ALU/Scheduler Loop
• Performance-critical execution core loop
55
RFOperand
FIFOOperands
RFOperand
FIFOOperands
5:1 Mux Control5:1 Mux
31 3029 28 3 2 1 0Propagate/Generate/Partial Sum (dynamic)
Carry merge 0 (static)
Carry merge 1 (dynamic)
Carry merge 2 (static)
Carry merge 3 (dynamic)
Carry merge 4 (static)
Carry merge 5 (CSG) / Sum
84u
m lo
op
bac
k b
us
Sum Sum#
Han-Carlson ALU OrganizationHan-Carlson ALU Organization
•Single-rail dynamic 9-stage low-Vt design
56
Carry
iCarry#
i
gi#
Sumi
Psumi Sum# i
Odd-bit CSGCarry merge
Sum generation
gi-1#
2
pi#
Odd-bits CSG Sum GenerationOdd-bits CSG Sum Generation
• Final carry-merge CSG(dual-rail carry output)→ pass-transistor sum XOR
57
Even-bits CSG Sum GenerationEven-bits CSG Sum Generation
• Domino-compatible sum• Dual-rail sum from single-ended g inputs
Carry
iCarry#
i
gi#
Sumi
Psumi Sum #i
Even-bit CSGCarry merge
Sum generation
2
58
Die Micro-photographDie Micro-photograph
• 130nm 6-metal dual-Vt CMOS
• Scheduler:
• 210μm x 210μm
• ALU:
• 84μm x 336μm
Scheduler
ALU
59
Delay and Power MeasurementsDelay and Power Measurements
• 6.5GHz at 1.1V, 25ºC • Power: 120mW total, 15mW leakage• Scalable to 10GHz at 1.7V, 25ºC
0
50
100
150
200
250
300
350
400
450
0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5
Supply Voltage (V)P
ower
(mW
)
0
50
100
150
200
250
300
350
400
450
Leak
age
Pow
er (m
W)
Design target
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5
Supply Voltage (V)
Fm
ax (G
Hz)
25ºC25ºC
60
Area 50%
Performance (Delay)
10%
Active Leakage
40%
Robustness equal
Improvements Over Dual-rail Improvements Over Dual-rail DominoDomino
• Leakage reduced by eliminating dual-rail logic
• Robustness not compromised
• CSG improves both area and performance
61
SummarySummarySummarySummary4GHz AGU in 1.2V, 130nm technology4GHz AGU in 1.2V, 130nm technologySparse-tree adder architecture described 20% speedup and 56% energy reductionSemi-dynamic design:
Energy scales with switching activity Dual-Vt non-critical paths:
Low active leakage energy6.5GHz ALU and scheduler loop at 1.1V, 25ºC6.5GHz ALU and scheduler loop at 1.1V, 25ºC
–Scalable to 10GHz at 1.7V, 25ºC