Download ppt - Asynchronous Architectures for Energy Efficient Computing & Communication (AEC2)

Asynchronous Architectures for Energy Efficient

Computing & Communication(AEC2)

Alain J. Martin

Asynchronous VLSI Group

Department of Computer ScienceCalifornia Institute of Technology

12 Jun 2002

2

Program Concepts and Goals

Concepts– Asynchronous approach to energy efficiency– High level synthesis

Goals– Design and fabrication of the world’s most energy

efficient microprocessor/microcontroller– Methods, tools, and circuits– Energy complexity of computation

3

Asynchronous Architectures for Energy Efficient Computing & Commmunication

Caltech

Microprocessor -- Results

MIPSEnergy

async-0.633nJ70nJ sync-0.6

MIPSCycleTime

async-0.66ns21ns sync-0.6

Microcontroller -- Estimation

8051Energy

per Instr

sync-0.510.00nJ (1X)1.67nJ (6X) async-0.5

0.56nJ (18X) [email protected] (72X) [email protected]

8051CycleTime

sync-0.520ns (1X)10ns (2X) async-0.55ns (4X) [email protected] (2X) [email protected]

More than 100X Et2 improvement over any other 8051

icache31%

regfile27%

fetch11%

bus12%

decode4% execunits

8%

writeback7%

icache

decode writeback

regfile(bypass)

fetch

execunits

(adder)(shifter)(fblock)(mem)

(mult/div)

Energy Breakdown

4

Energy Complexity Theory

Optimization metric: Et2

Et2-optimal pipeline is shorter (MiniMIPS was overpipelined)

Transistor sizing is not minimal: C 2P Optimal Energy: E 3E0

Optimal Delay: t t Sequential Computation of A & B optimal when

Power(A) = Power(B) Most energy is in communication (only 10% in

computation)

2

3

5

Consequences for Asynchronous Design Methodology

Different transistor sizing Less communication (Ex: LAX protocol) Less pipelining Different buffers (tree buffers) Simpler ALU Different cache design (memory cell bank size) Shorter busses (Huffman-tree encoding of

busses based on instruction group frequency)

6

HSE: Handshaking Expansion- Everything in boolean notation- 4 phase handshakes (set Data, wait for Ack, reset Data, wait for reset Ack)- Reshuffle the non data-dependent portions of 4 phase communication to improve speed & size

CHP: Communicating Hardware Processes- High-level language (selections, loops, etc.)- Decompose a large sequential CHP process into a system of smaller, concurrent, communicating CHP processes

Design Flow – Stages

PRS: Production Rule Set- No explicit sequencing: concurrent set of rules- Each rule abstraction for PUP & PDN networks

Sequential CHP

Concurrent CHP

HSE

PRS

PRS for CMOS

Sized PRS

Physical Design

7

New Design Tools

SequentialProgram

esimEnergySimulator

PRS TransistorNetlist

PL2

Physical Layout

klay

Concurrent system ofsmall processes

DDD

High-LevelSimulator

m3-3

ROMantic

EnergyThroughputEt2

edgar

Low-energy systemthat is slack-matched

8

m3-3

Programming language, built on Modula-3– Hence includes compiler, runtime, and debugging– Very expressive: any Modula-3 subroutine allowed

Allows simulation and performance analysis of an asynchronous system– Does not require the system to be already expressed

in CMOS circuits

9

m3-3 Performance Analysis

Energy analysis– Channel usage statistics– Measures total energy in number of bits sent

Delay analysis– Forward-Backward-Internal (FBI) model– Allows identification of token-limited, bubble-limited,

and throughput-limited critical paths– Each communication is marked with a timestamp,

and a “reason”, which is some subset of {F,B,I}– Measures total latency in logic transitions

10

Accomplishments and Milestones 1

Et2 theory : doneSee the book!

Circuit family: done

Redesign of the MIPS :``fetch loop’’ done, design postponed

Asynchronous pulse logic and SPAM processor: theory done, prototype postponed

11

Accomplishments and Milestones 2: Tools

m3-3 high-level simulator: done

esim energy simulator: done

Automatic design decomposition: in progress

PL2 circuit synthesizer: in progress

klay layout synthesizer: in progress

12

Asynchronous 8051 – the Lutonium

The 8051 is the most common microcontroller today

Overview Microcontroller Architecture Design Style Advantages Performance Estimates Relation to Tools Project Status & Future Work

13

8051 ISA

Direct address space, 256 bytes– 128 general-purpose registers (RegFile)

Direct or indirect addressing (0..127)

– Up to 128 special registers (SFRs) Direct addressing only (128..255) A,B,PSW,SP,DPL,DPH,IE,IP Ports (external I/O and timers)

Separate program space, up to 64K, read-only Separate external address space, 64K

14

Complex Instructions

Read-modify-write Rn registers

– Must read the PSW to compute their actual address– Indirect addressing (@Ri)

Some instructions use 16-bit data– CALL; RET; INC DPTR; MOVX A,@DPTR

The average execution time will be very different from the maximum execution time– Asynchronous performance might far exceed

synchronous performance

15

Lutonium Design

16

Example: Fetch/IMem Design

Instructions have variable length (1-3 bytes) Always fetches 2 bytes from memory Handles MOVC instructions for code reads and

code writes Only reads interrupt registers when there is the

possibility of an interrupt

17

Fetch/Imem: Decomposition

18

Fetch/Imem: Ready for Layout

19

8051-specific Lutonium Advantages

Voltage adaptation is easy Sleep sequence without race condition

– Modeled after wait/signal with condition variables Instant wake-up from deep sleep Pipelined but not speculative Enhanced off-chip interface: no static power

20

Lutonium Performance

Lutonium-50 (0.5 micron):– Est. 100 MIPS, 600 MIPS/W (@3.3V)– Philips Sync.: 4.0 MIPS, 100 MIPS/W – Philips Async.: 4.0 MIPS, 444 MIPS/W– Dallas DS89C420 “ultra high speed”:

50 MIPS, 100 MIPS/W (0.5 micron)

Lutonium-18 (0.18 micron):– Est. 200 MIPS, 1800 MIPS/W (@1.8V)– Est. 66 MIPS, 7200 MIPS/W (@0.9V)

21

Lutonium-18 Prototype

TSMC SCN018 through MOSIS– 0.18m CMOS– 1.8V nominal

– |Vt| = 0.4V to 0.5V

Expected area: 5mm2 (including 8kB SRAM) Performance from low-level simulation (conservative!)

1.8 V 200 MIPS 100.0 mW 500 pJ/inst 1800 MIPS/W




0.5 V 4 MIPS 170 W 43 pJ/inst 23000 MIPS/W

High Vt process (0.5V)We could do better with a low Vt process

22

Lutonium – Project Status

Entirely designed at component level 23K lines of m3-3

– Timing simulation– Energy simulation

“Fetch-loop” designed at the transistor level

23

Lutonium – Future Work

Production-rule generation for execution units, register file and busses

Power-saving mechanisms (supply-voltage adaptation, threshold-voltage control)

Layout