Zhang, Prabhu, George, Wan, Benes, Abnous, Rabaey
A 1V Heterogeneous Reconfigurable Processor IC for Baseband Wireless Applications
Hui Zhang, Vandana Prabhu, Varghese George, Marlene Wan, Martin Benes, Arthur Abnous1,
and Jan M. Rabaey
EECS Dept., University of California at Berkeley
1Broadcom Corp., Irvine, CA
Berkeley Wireless Research Center Tel: (510) 666 3111
2108 Allston Way, Suite 200 Fax: (510) 883 0270
Berkeley, CA 94704 E-mail: [email protected]
Abstract
Heterogeneous reconfiguration enables the flexible implementation of baseband wireless
functions at energy levels between 50 and 100 MIPS/mW, 8 times lower than traditional
DSP processors. A 5.2×6.7 mm2 prototype processor, targeted for voice compression is
implemented in a 0.25 µm 6-metal CMOS process, and consumes 1.8 mW at an average
operation rate of 40 MHz. It combines an embedded microprocessor with an array of
computational units of different granularities, connected by a hierarchical configurable
interconnect network.
ISSCC Subject Area: Signal Processing
Zhang, Prabhu, George, Wan, Benes, Abnous, Rabaey
ISSCC Subject Area: Signal Processing
A 1V Heterogeneous Reconfigurable Processor IC for Baseband Wireless
Applications
Hui Zhang, Vandana Prabhu, Varghese George, Marlene Wan, Martin Benes, Arthur Abnous1,
and Jan M. Rabaey
University of California at Berkeley
1Broadcom Corporation
Introduction
The advent of the third generation of wireless application creates a need for processing
modules that simultaneously display high computational performance, ultra low-energy
consumption and a high degree of flexibility and adaptability. The flexibility and
adaptability is a necessity in the presence of multiple and evolving standards, and helps to
increase quality-of-service in the presence of dynamically evolving conditions.
(Re)configurable processors offer the advantage of combining flexibility and low-energy
by providing a direct spatial mapping from algorithm to architecture, hence reducing the
control overhead typically associated with instruction-set processors.
General Concept
The Pleiades processor approach [1] combines an on-chip microprocessor with an array
of heterogeneous programmable computational units of different granularities (called
satellite processors) connected by a reconfigurable interconnect network (Figure 1). The
Zhang, Prabhu, George, Wan, Benes, Abnous, Rabaey
microprocessor supports the control-intensive components of the applications as well as
the reconfiguration, while repetitive and regular data-intensive loops (henceforth referred
to kernels) are directly mapped on the array of satellites by configuring the satellite
parameters and the interconnections between them (Figure 2). Synchronization between
the satellite processors is accomplished by a data-driven communication protocol in
accordance with the data-flow nature of the computations performed in the kernels. A
generalized interface wrapper is placed around each satellite processor to comply with the
communication protocol. This spatial programming approach results in energy dissipation
levels of 50-100 MIPS/mW, at least an order of magnitude better than what can be
accomplished in comparable DSP processors by exploiting the locality of the
computations and the correlations within data streams, and by distributing the control.
Processor Architecture
A prototype processor has been implemented targeting the domain of voice processing
(and related applications) for wireless applications. The Maia processor (Figure 3)
combines an ARM8 core with 21 satellite processors: two MACs, two ALUs, eight
address generators, eight embedded memories (4 512×16 bit, 4 1K×16bit), and an
embedded low-energy FPGA array [3]. Through an interface control unit, ARM8
configures the memory-mapped satellites using a configuration bus, and communicates
data with satellites using 2 pairs of IO interface ports and direct memory reads/writes.
Connections between satellite modules are accomplished through a 2-level hierarchical
mesh-structured reconfigurable interconnect network. The 210-pin chip contains 1.2
Zhang, Prabhu, George, Wan, Benes, Abnous, Rabaey
million transistors and measures 5.2×6.7mm2 in 0.25 µm 6-metal CMOS technology
(Figure 4).
The embedded ARM8 core is optimized for low-energy operation, and can operate under
variable supply voltages [2]. Both the dual-stage pipelined MAC (including
shift/round/saturate functions) and the ALU can be configured to handle a range of
operations. The address generators and embedded memories are distributed to supply
multiple parallel data streams to the computational elements. The address generator
features a small local instruction memory, and can be programmed to support various
types of addressing patterns and nested loops with loop counters and stride counters. It
behaves as the local controller of data-flow kernels by initiating the data-flow threads,
and by signaling the end of the data-flow threads to the ARM8. The embedded FPGA
supports a 4×8 array of 5-input 3-output CLBs, optimized for arithmetic operations and
data-flow control functions. It contains 3 levels of interconnect hierarchy, superimposing
nearest-neighbor, mesh and tree architectures. Its energy-efficiency has been measured to
be 70 times higher than equivalent industrial solutions [3]. The interface control unit
coordinates synchronization and communication between the synchronous ARM8 core
and the asynchronous reconfigurable data-paths, most importantly helping the core
perform the reconfiguration of satellites by mapping all the configuration memories to the
ARM8 memory space.
Communication Network
The data-driven synchronization between the processing elements employs a 2-phase
self-timed handshaking scheme with REQUEST and ACKNOWLEDGE signals (Figure
Zhang, Prabhu, George, Wan, Benes, Abnous, Rabaey
5a), realized in a globally-asynchronous locally-synchronous implementation fashion.
This approach not only reduces power consumption by ensuring that a module is only
activated when data is ready, but also allows various modules to operate at different and
dynamically varying rates. Each module includes a network interface controller to
coordinate communication and synchronization. Data links combine 16-bit fixed-width
data words with 2-bit control tokens that serve as tags of the different data structures
(scalar, vector, or matrix) that are supported by the network (Figure 5b).
Keeping the energy of the reconfigurable communication network as low as possible is
crucial to the success of the approach. This is realized by a combination of architecture
and circuit optimizations. The network itself is implemented as a 2-level hierarchical
mesh. Several clusters of tightly connected modules are formed according to the
communication locality. Each cluster has a local mesh with 2 buses-per-channel, and a
universal switchbox at every intersection point (Figure 6a). Global interconnections are
supported by a 2nd level larger-granularity mesh (implemented on the higher metal layers)
with 2 buses-per-channel and hierarchical switchboxes, located at the key connection
points. The hierarchical switchbox (Figure 6b) contains a universal switchbox for each
mesh-level, as well as a number of cross-level interconnect switches. This hierarchical
network architecture requires only a limited number of buses to achieve sufficient
connection flexibility for our target applications, and cuts the interconnect energy cost by
a factor of 7 compared to a straightforward crossbar network implementation.
Communication energy is further reduced by employing a low-swing (0.4V) pseudo-
differential signaling scheme (Figure 7a). The capacitance loads are also reduced by
Zhang, Prabhu, George, Wan, Benes, Abnous, Rabaey
simplifying the switch network with NMOS-only switches. The circuit uses a single wire
for each data bit while still retaining most advantages of differential signaling such as
high common-mode noise rejection, low input-offset, and good sensitivity. It employs an
NMOS-only push-pull driver with a very low voltage supply. The receiver is a clocked
sense amplifier followed by a static flip-flop. It contains double pairs of input transistor,
with the gates of P1 and P3 connected to d, while the gates of P4 and P2 biased at GND
and REF respectively. Figure 7b shows the signaling waveforms. Initially, A and B are
discharged to GND, and n1 and n2 are equalized. The receiver is enabled by a negative
pulse, which is generated from the handshaking signals. If d is low, the current drive of
P3 is same as that of P4, while the current drive of P1 is larger than that of P2.
Consequently B and A are pulled high and low, respectively, by the cross-coupled
inverter pair. An opposite transition is triggered if d is high. The following static flip-flop
will retain the data value even after the sense amplifier is reinitialized. The low-swing
signaling reduces the interconnect energy with a factor 3.4 compared to a full-swing
CMOS implementation.
Results and Data Measurements
The overall chip characteristics are summarized in Table 1. Table 2 shows the
performances of different chip components (based on a per-block analysis). The energy
dissipation of the processor when programmed for a VCELP voice coder (with 1.8mW
total power consumption) is presented in Table 3, including a breakdown of the energy
over the major functions. Dominant kernels are directly mapped onto hardware satellites,
and their run-time reconfiguration is performed by the ARM core. Therefore, the kernel
energy presented in the table incorporate contributions from both satellite and ARM8
Zhang, Prabhu, George, Wan, Benes, Abnous, Rabaey
configuration. The program control part of the algorithm is completely mapped to the
software. The total measured energy efficiency is a factor of 8 better than the best
reported in literature [4].
Acknowledgments
The research was funded by the DARPA ACS, and the California MICRO program. The
support from Philips, Atmel, and Conexant is greatly appreciated. The authors also wish
to thank SGS-Thompson for providing fabrication facilities of the integrated circuits.
References
[1] Arthur Abnous and Jan Rabaey, “Ultra-Low-Power Domain-Specific Multimedia Processors”, IEEE
VLSI Signal Processing Workshop, October 1996.
[2] Tom Burd et al, “A Dynamic Voltage Scaled Microprocessor System”, submitted to ISSCC 2000.
[3] Varghese George et al, “The Design of a Low-Energy FPGA”, Proceedings of ISLPED99, Aug. 1999.
[4] Wai Lee et al, “A 1V DSP for Wireless Communication”, Digest of Technical Papers of ISSCC 97.
Zhang, Prabhu, George, Wan, Benes, Abnous, Rabaey
Technology 0.25 µm 6-level metal CMOS Main Supply Voltage 1 V Additional Voltages 0.4 V, 1.5 V Die Size 5.2 mm x 6.7 mm Transistor Count 1.2 Million transistors Average Cycle Speed 40 MHz Average Power Dissipation 1.5 - 2 mW Table 1: Chip Characteristics Hardware modules Pipeline speed
(ns) Energy consumption per operation (PJ)
Area (mm2)
MAC 24 21 0.25 ALU 20 8 0.09 Memory (1K x 16) 14 8 0.32 Memory (512 x 16) 11 7 0.16 Address generator 20 6 0.12 Interconnect network 10 1* NA FPGA 25 18** 2.76 Table 2: Performances of hardware modules *This number is the average energy consumption per connection **This number is the average energy consumption across various arithmetic functions Functionality Energy consumption (mJ) for 1 sec
of VCELP speech processing Dot product 0.738 FIR filter 0.131 IIIR filter 0.021 Vector sum with scalar multiply
0.042
Compute code 0.011
Kernels
Covariance matrix compute 0.006 Program control 0.838
Total 1.787 Table 3: VCELP energy consumption breakdown among dominant kernels and program control
Zhang, Prabhu, George, Wan, Benes, Abnous, Rabaey
Figure 2: Mapping a computational kernel on an array of satellite processors.
Configuration Bus
Reconfigurable Interconnect
Micro- Processor
Configurable Logic
Embedded Memory
Address Generator
Arithmetic Co-Processor
Arithmetic Co-Processor
Satellite Processors
Figure 1: Heterogeneous Reconfigurable Processor Architecture
for (i=1;i<=length;i++) { for (k=I<k<=length;k++) { phi[I][k] = phi[I-1][k-1] + in[NP-I]*in[NP-k] – in[NA-1-I]*in[NA-1-k]; } }
MPY
AddrGen
MEM:i
MPY
+/-
AddrGen
MEM:phi
Execution Control
Zhang, Prabhu, George, Wan, Benes, Abnous, Rabaey
Figure 3: Floorplan of Prototype Processor
(a) Globally asynchronous - locally synchronous signaling
(b) Control tokens differentiate and delineate data streams and data structures (scalar, vector, matrix)
Figure 5: Data-driven globally-asynchronous locally-synchronous inter-processor communication.
Processor Module
In Out
Req in Req out clk Clk Done
In Req in Clk
Enable Clk Done delay Rec
onfig
urab
le
Net
wor
k
1
11
1
nnMPY MPY
n
n1MAC
Data associated with an end-of-vector token
Regular data
AG
MACALU
io
Mem1K
AG AG
Mem1K
Mem1K
AG
Mem1K
FPGA
Mem512
Mem512 MAC
AG AG
ALU
io
Mem512
AG Mem512 AG
ARMInterface
Hierarchical Switchbox
Universal Switchbox Level-1 Mesh
Level-2 Mesh
Zhang, Prabhu, George, Wan, Benes, Abnous, Rabaey
Figure 4: Heterogeneous Reconfigurable Processor Chip Microphotograph
ARM8 Core
Interface
FPGA
ALU MEM
MAC
AGU MEM AGU
ALU MEM
MAC
AGU MEM AGU
MEM
AGU AGU
MEM
MEM
AGU AGU
MEM
Interconnect Network
Zhang, Prabhu, George, Wan, Benes, Abnous, Rabaey
Figure 6: Hierarchical Mesh Network and Switch Matrices
(a) Circuit diagram
(b) Circuit Waveforms Figure 7: Pseudo-differential low-swing interconnect circuitry
(a) Level 1 Mesh (b) Level 2 Mesh
Universal Switchbox Hierarchical Switchbox (only cross-mesh connections are shown)
P1
N2
VDD
N3 N1
clk
clk
REFin
P5
N4
B A
d
clk
REF
P6
P2
P7
P4P3
n1 n2
out
GND
GND
clk
in
d
out
A B
0.4V1V
Cluster
Cluster