A 1V Heterogeneous Reconfigurable Processor IC for ...kmiller/references/isscc00.pdfFIR filter 0.131 IIIR filter 0.021 Vector sum with scalar multiply 0.042 Compute code 0.011 Kernels

Zhang, Prabhu, George, Wan, Benes, Abnous, Rabaey

A 1V Heterogeneous Reconfigurable Processor IC for Baseband Wireless Applications

Hui Zhang, Vandana Prabhu, Varghese George, Marlene Wan, Martin Benes, Arthur Abnous1,

and Jan M. Rabaey

EECS Dept., University of California at Berkeley

1Broadcom Corp., Irvine, CA

Berkeley Wireless Research Center Tel: (510) 666 3111

2108 Allston Way, Suite 200 Fax: (510) 883 0270

Berkeley, CA 94704 E-mail: [email protected]

Abstract

Heterogeneous reconfiguration enables the flexible implementation of baseband wireless

functions at energy levels between 50 and 100 MIPS/mW, 8 times lower than traditional

DSP processors. A 5.2×6.7 mm2 prototype processor, targeted for voice compression is

implemented in a 0.25 µm 6-metal CMOS process, and consumes 1.8 mW at an average

operation rate of 40 MHz. It combines an embedded microprocessor with an array of

computational units of different granularities, connected by a hierarchical configurable

interconnect network.

ISSCC Subject Area: Signal Processing


ISSCC Subject Area: Signal Processing

A 1V Heterogeneous Reconfigurable Processor IC for Baseband Wireless

Applications

Hui Zhang, Vandana Prabhu, Varghese George, Marlene Wan, Martin Benes, Arthur Abnous1,

and Jan M. Rabaey

University of California at Berkeley

1Broadcom Corporation

Introduction

The advent of the third generation of wireless application creates a need for processing

modules that simultaneously display high computational performance, ultra low-energy

consumption and a high degree of flexibility and adaptability. The flexibility and

adaptability is a necessity in the presence of multiple and evolving standards, and helps to

increase quality-of-service in the presence of dynamically evolving conditions.

(Re)configurable processors offer the advantage of combining flexibility and low-energy

by providing a direct spatial mapping from algorithm to architecture, hence reducing the

control overhead typically associated with instruction-set processors.

General Concept

The Pleiades processor approach [1] combines an on-chip microprocessor with an array

of heterogeneous programmable computational units of different granularities (called

satellite processors) connected by a reconfigurable interconnect network (Figure 1). The


microprocessor supports the control-intensive components of the applications as well as

the reconfiguration, while repetitive and regular data-intensive loops (henceforth referred

to kernels) are directly mapped on the array of satellites by configuring the satellite

parameters and the interconnections between them (Figure 2). Synchronization between

the satellite processors is accomplished by a data-driven communication protocol in

accordance with the data-flow nature of the computations performed in the kernels. A

generalized interface wrapper is placed around each satellite processor to comply with the

communication protocol. This spatial programming approach results in energy dissipation

levels of 50-100 MIPS/mW, at least an order of magnitude better than what can be

accomplished in comparable DSP processors by exploiting the locality of the

computations and the correlations within data streams, and by distributing the control.

Processor Architecture

A prototype processor has been implemented targeting the domain of voice processing

(and related applications) for wireless applications. The Maia processor (Figure 3)

combines an ARM8 core with 21 satellite processors: two MACs, two ALUs, eight

address generators, eight embedded memories (4 512×16 bit, 4 1K×16bit), and an

embedded low-energy FPGA array [3]. Through an interface control unit, ARM8

configures the memory-mapped satellites using a configuration bus, and communicates

data with satellites using 2 pairs of IO interface ports and direct memory reads/writes.

Connections between satellite modules are accomplished through a 2-level hierarchical

mesh-structured reconfigurable interconnect network. The 210-pin chip contains 1.2


million transistors and measures 5.2×6.7mm2 in 0.25 µm 6-metal CMOS technology

(Figure 4).

The embedded ARM8 core is optimized for low-energy operation, and can operate under

variable supply voltages [2]. Both the dual-stage pipelined MAC (including

shift/round/saturate functions) and the ALU can be configured to handle a range of

operations. The address generators and embedded memories are distributed to supply

multiple parallel data streams to the computational elements. The address generator

features a small local instruction memory, and can be programmed to support various

types of addressing patterns and nested loops with loop counters and stride counters. It

behaves as the local controller of data-flow kernels by initiating the data-flow threads,

and by signaling the end of the data-flow threads to the ARM8. The embedded FPGA

supports a 4×8 array of 5-input 3-output CLBs, optimized for arithmetic operations and

data-flow control functions. It contains 3 levels of interconnect hierarchy, superimposing

nearest-neighbor, mesh and tree architectures. Its energy-efficiency has been measured to

be 70 times higher than equivalent industrial solutions [3]. The interface control unit

coordinates synchronization and communication between the synchronous ARM8 core

and the asynchronous reconfigurable data-paths, most importantly helping the core

perform the reconfiguration of satellites by mapping all the configuration memories to the

ARM8 memory space.

Communication Network

The data-driven synchronization between the processing elements employs a 2-phase

self-timed handshaking scheme with REQUEST and ACKNOWLEDGE signals (Figure


5a), realized in a globally-asynchronous locally-synchronous implementation fashion.

This approach not only reduces power consumption by ensuring that a module is only

activated when data is ready, but also allows various modules to operate at different and

dynamically varying rates. Each module includes a network interface controller to

coordinate communication and synchronization. Data links combine 16-bit fixed-width

data words with 2-bit control tokens that serve as tags of the different data structures

(scalar, vector, or matrix) that are supported by the network (Figure 5b).

Keeping the energy of the reconfigurable communication network as low as possible is

crucial to the success of the approach. This is realized by a combination of architecture

and circuit optimizations. The network itself is implemented as a 2-level hierarchical

mesh. Several clusters of tightly connected modules are formed according to the

communication locality. Each cluster has a local mesh with 2 buses-per-channel, and a

universal switchbox at every intersection point (Figure 6a). Global interconnections are

supported by a 2nd level larger-granularity mesh (implemented on the higher metal layers)

with 2 buses-per-channel and hierarchical switchboxes, located at the key connection

points. The hierarchical switchbox (Figure 6b) contains a universal switchbox for each

mesh-level, as well as a number of cross-level interconnect switches. This hierarchical

network architecture requires only a limited number of buses to achieve sufficient

connection flexibility for our target applications, and cuts the interconnect energy cost by

a factor of 7 compared to a straightforward crossbar network implementation.

Communication energy is further reduced by employing a low-swing (0.4V) pseudo-

differential signaling scheme (Figure 7a). The capacitance loads are also reduced by


simplifying the switch network with NMOS-only switches. The circuit uses a single wire

for each data bit while still retaining most advantages of differential signaling such as

high common-mode noise rejection, low input-offset, and good sensitivity. It employs an

NMOS-only push-pull driver with a very low voltage supply. The receiver is a clocked

sense amplifier followed by a static flip-flop. It contains double pairs of input transistor,

with the gates of P1 and P3 connected to d, while the gates of P4 and P2 biased at GND

and REF respectively. Figure 7b shows the signaling waveforms. Initially, A and B are

discharged to GND, and n1 and n2 are equalized. The receiver is enabled by a negative

pulse, which is generated from the handshaking signals. If d is low, the current drive of

P3 is same as that of P4, while the current drive of P1 is larger than that of P2.

Consequently B and A are pulled high and low, respectively, by the cross-coupled

inverter pair. An opposite transition is triggered if d is high. The following static flip-flop

will retain the data value even after the sense amplifier is reinitialized. The low-swing

signaling reduces the interconnect energy with a factor 3.4 compared to a full-swing

CMOS implementation.

Results and Data Measurements

The overall chip characteristics are summarized in Table 1. Table 2 shows the

performances of different chip components (based on a per-block analysis). The energy

dissipation of the processor when programmed for a VCELP voice coder (with 1.8mW

total power consumption) is presented in Table 3, including a breakdown of the energy

over the major functions. Dominant kernels are directly mapped onto hardware satellites,

and their run-time reconfiguration is performed by the ARM core. Therefore, the kernel

energy presented in the table incorporate contributions from both satellite and ARM8


configuration. The program control part of the algorithm is completely mapped to the

software. The total measured energy efficiency is a factor of 8 better than the best

reported in literature [4].

Acknowledgments

The research was funded by the DARPA ACS, and the California MICRO program. The

support from Philips, Atmel, and Conexant is greatly appreciated. The authors also wish

to thank SGS-Thompson for providing fabrication facilities of the integrated circuits.

References

[1] Arthur Abnous and Jan Rabaey, “Ultra-Low-Power Domain-Specific Multimedia Processors”, IEEE

VLSI Signal Processing Workshop, October 1996.

[2] Tom Burd et al, “A Dynamic Voltage Scaled Microprocessor System”, submitted to ISSCC 2000.

[3] Varghese George et al, “The Design of a Low-Energy FPGA”, Proceedings of ISLPED99, Aug. 1999.

[4] Wai Lee et al, “A 1V DSP for Wireless Communication”, Digest of Technical Papers of ISSCC 97.


Technology 0.25 µm 6-level metal CMOS Main Supply Voltage 1 V Additional Voltages 0.4 V, 1.5 V Die Size 5.2 mm x 6.7 mm Transistor Count 1.2 Million transistors Average Cycle Speed 40 MHz Average Power Dissipation 1.5 - 2 mW Table 1: Chip Characteristics Hardware modules Pipeline speed

(ns) Energy consumption per operation (PJ)

Area (mm2)

MAC 24 21 0.25 ALU 20 8 0.09 Memory (1K x 16) 14 8 0.32 Memory (512 x 16) 11 7 0.16 Address generator 20 6 0.12 Interconnect network 10 1* NA FPGA 25 18** 2.76 Table 2: Performances of hardware modules *This number is the average energy consumption per connection **This number is the average energy consumption across various arithmetic functions Functionality Energy consumption (mJ) for 1 sec

of VCELP speech processing Dot product 0.738 FIR filter 0.131 IIIR filter 0.021 Vector sum with scalar multiply

0.042

Compute code 0.011

Kernels

Covariance matrix compute 0.006 Program control 0.838

Total 1.787 Table 3: VCELP energy consumption breakdown among dominant kernels and program control


Figure 2: Mapping a computational kernel on an array of satellite processors.

Configuration Bus

Reconfigurable Interconnect

Micro- Processor

Configurable Logic

Embedded Memory

Address Generator

Arithmetic Co-Processor

Arithmetic Co-Processor

Satellite Processors

Figure 1: Heterogeneous Reconfigurable Processor Architecture

for (i=1;i<=length;i++) { for (k=I<k<=length;k++) { phi[I][k] = phi[I-1][k-1] + in[NP-I]*in[NP-k] – in[NA-1-I]*in[NA-1-k]; } }

MPY

AddrGen

MEM:i

MPY

+/-

AddrGen

MEM:phi

Execution Control


Figure 3: Floorplan of Prototype Processor

(a) Globally asynchronous - locally synchronous signaling

(b) Control tokens differentiate and delineate data streams and data structures (scalar, vector, matrix)

Figure 5: Data-driven globally-asynchronous locally-synchronous inter-processor communication.

Processor Module

In Out

Req in Req out clk Clk Done

In Req in Clk

Enable Clk Done delay Rec

onfig

urab

le

Net

wor

k

1

11

1

nnMPY MPY

n

n1MAC

Data associated with an end-of-vector token

Regular data

AG

MACALU

io

Mem1K

AG AG

Mem1K

Mem1K

AG

Mem1K

FPGA

Mem512

Mem512 MAC

AG AG

ALU

io

Mem512

AG Mem512 AG

ARMInterface

Hierarchical Switchbox

Universal Switchbox Level-1 Mesh

Level-2 Mesh


Figure 4: Heterogeneous Reconfigurable Processor Chip Microphotograph

ARM8 Core

Interface

FPGA

ALU MEM

MAC

AGU MEM AGU

ALU MEM

MAC

AGU MEM AGU

MEM

AGU AGU

MEM

MEM

AGU AGU

MEM

Interconnect Network


Figure 6: Hierarchical Mesh Network and Switch Matrices

(a) Circuit diagram

(b) Circuit Waveforms Figure 7: Pseudo-differential low-swing interconnect circuitry

(a) Level 1 Mesh (b) Level 2 Mesh

Universal Switchbox Hierarchical Switchbox (only cross-mesh connections are shown)

P1

N2

VDD

N3 N1

clk

clk

REFin

P5

N4

B A

d

clk

REF

P6

P2

P7

P4P3

n1 n2

out

GND

GND

clk

in

d

out

A B

0.4V1V

Cluster

Cluster

Documents

A 1V Heterogeneous Reconfigurable Processor IC for ...kmiller/references/isscc00.pdfFIR filter 0.131 IIIR filter 0.021 Vector sum with scalar multiply 0.042 Compute code 0.011 Kernels