Asynchronous Architectures for Energy Efficient
Computing & Communication(AEC2)
Alain J. Martin
Asynchronous VLSI Group
Department of Computer ScienceCalifornia Institute of Technology
12 Jun 2002
2
Program Concepts and Goals
Concepts– Asynchronous approach to energy efficiency– High level synthesis
Goals– Design and fabrication of the world’s most energy
efficient microprocessor/microcontroller– Methods, tools, and circuits– Energy complexity of computation
3
Asynchronous Architectures for Energy Efficient Computing & Commmunication
Caltech
Microprocessor -- Results
MIPSEnergy
async-0.633nJ70nJ sync-0.6
MIPSCycleTime
async-0.66ns21ns sync-0.6
Microcontroller -- Estimation
8051Energy
per Instr
sync-0.510.00nJ (1X)1.67nJ (6X) async-0.5
0.56nJ (18X) [email protected] (72X) [email protected]
8051CycleTime
sync-0.520ns (1X)10ns (2X) async-0.55ns (4X) [email protected] (2X) [email protected]
More than 100X Et2 improvement over any other 8051
icache31%
regfile27%
fetch11%
bus12%
decode4% execunits
8%
writeback7%
icache
decode writeback
regfile(bypass)
fetch
execunits
(adder)(shifter)(fblock)(mem)
(mult/div)
Energy Breakdown
4
Energy Complexity Theory
Optimization metric: Et2
Et2-optimal pipeline is shorter (MiniMIPS was overpipelined)
Transistor sizing is not minimal: C 2P Optimal Energy: E 3E0
Optimal Delay: t t Sequential Computation of A & B optimal when
Power(A) = Power(B) Most energy is in communication (only 10% in
computation)
2
3
5
Consequences for Asynchronous Design Methodology
Different transistor sizing Less communication (Ex: LAX protocol) Less pipelining Different buffers (tree buffers) Simpler ALU Different cache design (memory cell bank size) Shorter busses (Huffman-tree encoding of
busses based on instruction group frequency)
6
HSE: Handshaking Expansion- Everything in boolean notation- 4 phase handshakes (set Data, wait for Ack, reset Data, wait for reset Ack)- Reshuffle the non data-dependent portions of 4 phase communication to improve speed & size
CHP: Communicating Hardware Processes- High-level language (selections, loops, etc.)- Decompose a large sequential CHP process into a system of smaller, concurrent, communicating CHP processes
Design Flow – Stages
PRS: Production Rule Set- No explicit sequencing: concurrent set of rules- Each rule abstraction for PUP & PDN networks
Sequential CHP
Concurrent CHP
HSE
PRS
PRS for CMOS
Sized PRS
Physical Design
7
New Design Tools
SequentialProgram
esimEnergySimulator
PRS TransistorNetlist
PL2
Physical Layout
klay
Concurrent system ofsmall processes
DDD
High-LevelSimulator
m3-3
ROMantic
EnergyThroughputEt2
edgar
Low-energy systemthat is slack-matched
8
m3-3
Programming language, built on Modula-3– Hence includes compiler, runtime, and debugging– Very expressive: any Modula-3 subroutine allowed
Allows simulation and performance analysis of an asynchronous system– Does not require the system to be already expressed
in CMOS circuits
9
m3-3 Performance Analysis
Energy analysis– Channel usage statistics– Measures total energy in number of bits sent
Delay analysis– Forward-Backward-Internal (FBI) model– Allows identification of token-limited, bubble-limited,
and throughput-limited critical paths– Each communication is marked with a timestamp,
and a “reason”, which is some subset of {F,B,I}– Measures total latency in logic transitions
10
Accomplishments and Milestones 1
Et2 theory : doneSee the book!
Circuit family: done
Redesign of the MIPS :``fetch loop’’ done, design postponed
Asynchronous pulse logic and SPAM processor: theory done, prototype postponed
11
Accomplishments and Milestones 2: Tools
m3-3 high-level simulator: done
esim energy simulator: done
Automatic design decomposition: in progress
PL2 circuit synthesizer: in progress
klay layout synthesizer: in progress
12
Asynchronous 8051 – the Lutonium
The 8051 is the most common microcontroller today
Overview Microcontroller Architecture Design Style Advantages Performance Estimates Relation to Tools Project Status & Future Work
13
8051 ISA
Direct address space, 256 bytes– 128 general-purpose registers (RegFile)
Direct or indirect addressing (0..127)
– Up to 128 special registers (SFRs) Direct addressing only (128..255) A,B,PSW,SP,DPL,DPH,IE,IP Ports (external I/O and timers)
Separate program space, up to 64K, read-only Separate external address space, 64K
14
Complex Instructions
Read-modify-write Rn registers
– Must read the PSW to compute their actual address– Indirect addressing (@Ri)
Some instructions use 16-bit data– CALL; RET; INC DPTR; MOVX A,@DPTR
The average execution time will be very different from the maximum execution time– Asynchronous performance might far exceed
synchronous performance
15
Lutonium Design
16
Example: Fetch/IMem Design
Instructions have variable length (1-3 bytes) Always fetches 2 bytes from memory Handles MOVC instructions for code reads and
code writes Only reads interrupt registers when there is the
possibility of an interrupt
17
Fetch/Imem: Decomposition
18
Fetch/Imem: Ready for Layout
19
8051-specific Lutonium Advantages
Voltage adaptation is easy Sleep sequence without race condition
– Modeled after wait/signal with condition variables Instant wake-up from deep sleep Pipelined but not speculative Enhanced off-chip interface: no static power
20
Lutonium Performance
Lutonium-50 (0.5 micron):– Est. 100 MIPS, 600 MIPS/W (@3.3V)– Philips Sync.: 4.0 MIPS, 100 MIPS/W – Philips Async.: 4.0 MIPS, 444 MIPS/W– Dallas DS89C420 “ultra high speed”:
50 MIPS, 100 MIPS/W (0.5 micron)
Lutonium-18 (0.18 micron):– Est. 200 MIPS, 1800 MIPS/W (@1.8V)– Est. 66 MIPS, 7200 MIPS/W (@0.9V)
21
Lutonium-18 Prototype
TSMC SCN018 through MOSIS– 0.18m CMOS– 1.8V nominal
– |Vt| = 0.4V to 0.5V
Expected area: 5mm2 (including 8kB SRAM) Performance from low-level simulation (conservative!)
1.8 V 200 MIPS 100.0 mW 500 pJ/inst 1800 MIPS/W
1.1 V 100 MIPS 20.7 mW 207 pJ/inst 4830 MIPS/W
0.9 V 66 MIPS 9.2 mW 139 pJ/inst 7200 MIPS/W
0.8 V 48 MIPS 4.4 mW 92 pJ/inst 10900 MIPS/W
0.5 V 4 MIPS 170 W 43 pJ/inst 23000 MIPS/W
High Vt process (0.5V)We could do better with a low Vt process
22
Lutonium – Project Status
Entirely designed at component level 23K lines of m3-3
– Timing simulation– Energy simulation
“Fetch-loop” designed at the transistor level
23
Lutonium – Future Work
Production-rule generation for execution units, register file and busses
Power-saving mechanisms (supply-voltage adaptation, threshold-voltage control)
Layout