Upload
erikxfan
View
218
Download
0
Embed Size (px)
Citation preview
8/10/2019 Credes Report Fan
1/45
IHP
Im Technologiepark 2515236 Frankfurt (Oder)
Germany
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com 2010 - All rights reserved
Pausible Clocking Based GALS Design:
Analysis, Optimization and Applications
Xin FAN
fan@ihp_microe lect ron ics.com
8/10/2019 Credes Report Fan
2/45
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com 2011 - All rights reserved
Outline
Overview of GALS design methodology
Performance analysis of pausible clocking based GALS data link
System optimization for area/power/noise efficient GALS design
Moonrake chip: SYNC/GALS OFDM TX in IFX 40nm technology
Conclusions
8/10/2019 Credes Report Fan
3/45
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com 2011 - All rights reserved
Overview of GALS design methodology
Whats GALS design?
Globally-asynchronous locally-synchronous design for large-scale
digital system integration.
Processing is performed by synchronous functional modules;
Communication is accomplished by asynchronous interfaces.
Sync Core
Logic
AsyncIF
AsyncIF
Sync Core
Logic
AsyncIF
AsyncIF
req
ac k
Data
req
ac k
Data
req
ac k
Data
Clock A Clock B
8/10/2019 Credes Report Fan
4/45
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com 2011 - All rights reserved
Overview of GALS design methodology (Cont)
Why do we need GALS design?
Relaxing the timing constraints at the system level
GALS design requires no global clock reference, only local clocks.
Each locally-timed compact GALS block could be optimized muchmore efficiently and aggressively, leading to lower power and areaoverheads with better timing performance.
The simplified clock trees also contribute to the power/area savingat the system level.
Reducing simultaneous switching noise of digital circuits
The switching activity in GALS design is naturally randomized andspread over time, resulting in a lower switching noise.
Facilitating the system integration based on modular designGALS design presents an infrastructure for dynamical voltage andfrequency scaling (DVFS) and SoC/NoC integration.
8/10/2019 Credes Report Fan
5/45
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com 2011 - All rights reserved
Overview of GALS design methodology (Cont)
How to implement GALS system?
The main issue is about the design of robust interface circuits withlow overhead.
Robust means to resolve the metastability at an acceptable meantime between failures (MTBF).
Two aspects of GALS overhead:
A. Hardware overheadpower and area;
B. Performance overheadarbitration latency and throughput drop.
Three asynchronous communication schemes:
A.Synchronizer;
B. Dual-clock FIFO;C. Pausible clocking.
8/10/2019 Credes Report Fan
6/45
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com 2011 - All rights reserved
Overview of GALS design methodology (Cont)
Boundary synchronizer
Cascaded double-DFF:
One extra clock cycle is reserved for resolving metastability.
Simple but slow:
4-phase protocol: 6 TX cycles plus 6 RX cycles for each data transfer.
Q
Q
SET
CLR
Dtx_data rx_data
Q
QSET
CLR
D
ENEN
FSM
Q
QSET
CLR
D
Q
QSET
CLR
D
Q
QSET
CLR
D
Q
QSET
CLR
D
Q
QSET
CLR
D
Q
QSET
CLR
D
t x_c lock doma in rx_clock doma in
req
ac k
vld_in vld_out
EN
8/10/2019 Credes Report Fan
7/45
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com 2011 - All rights reserved
Overview of GALS design methodology (Cont)
Dual-clock FIFO
Data is written into the FIFO at the TX clock and read out from the
FIFO at the RX clock.
Write and read pointers, instead of data, need to be synchronized
through the clock boundary.
The FIFO has to be sufficiently large to avoid the throughput drop
caused by write/read pointer synchronization.
Dual-clock FIFO
tx_data rx_data
bWrPtr
B2G Sync G2B
Empty
Logic
bRdPtr
Full
Logic
G2B Sync B2G
emptyful l
t x_c lock domain rx_c lock domain
8/10/2019 Credes Report Fan
8/45
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com 2011 - All rights reserved
Overview of GALS design methodology (Cont)
Pausible clocking scheme
The local clock can be re-scheduled (paused and stretched), when
necessary, to avoid metastability at data sampling;
The data transfer i s ini tiated by the synchronous TX/RX cores through some
output/input flow control l ogic.
The communication between TX and RX is performed by the asynchronous
handshaking channels;
SYNC_REG
TX CORE
OPCTX PAUSIBLE CLOCK
op_teop_ta
op_req
op_ackop_ri
op_ai
tx_clkop_giop_ai
OUT_FLOW_CNTR
OUT_REG
IPC
ip_ri
ip_ai
RX PAUSIBLE CLOCK
rx_clk ip_gi ip_ai
IN_FLOW_CNTR
IN_REG
ip_taip_te
RX CORE
ip_req
ip_ack
handshake signals
8/10/2019 Credes Report Fan
9/45
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com 2011 - All rights reserved
Overview of GALS design methodology (Cont)
Pausible clock generator
The clock is generated based on a programmable ring oscillator;
A C-element is inserted to gate the incoming clock rising edge;
An array of MUTEX is used as arbiter of concurrent requests.
Req0
Req1
Ack1
Ack0
MUTEX 0
C-ELE
MUTEX 1
A
B
YProgrammable Delay Line
Req0
Req1
Ack0
Ack1
RClk
LClk
MUTEX
C-ELEMENT
8/10/2019 Credes Report Fan
10/45
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com 2011 - All rights reserved
Overview of GALS design methodology (Cont)
Asynchronous FSM
All the state transitions are
triggered by the events on
input and feedback output
Behavior description using
the signal transition graph
(STG)
Simple and fast.
Sensitive to glitch.
Need particular synthesis
toolPetrify.
ip_ri+ ip_ai+ ip_ta+
ip_ta- ip_ai+ ip_ri+
ip_req+
ip_te+
ip_ack- ip_req-
ip_te-
ip_ack+
ip_ri- ip_ai-
ip_ai- ip_ri-
ip_ai
ip_rp
ip_te
ip_ai
ip_ai
ip_rp
ip_te
ip_ap
ip_ta
ip_ri
Asynchronous I/O port controllers
8/10/2019 Credes Report Fan
11/45
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com 2011 - All rights reserved
Performance analysis of GALS data link
Data synchronization latency
lClk
rClk
w
t
T
MUTEX
ri
rClk
ai
gi
Acknowledge window w
I f the request ar ri ves at the off -phase of rClk , then i t can be acknowl edged immediately
by the MUTEX and the data wil l be sampled at the cur rent r ising edge of the clock;
I f the request arri ves at the on-phase of rClk, then it couldnt be granted by the MUTEXuntil rCl k turns to be low and the data wil l be sampled at the next ri sing edge of the clock.
8/10/2019 Credes Report Fan
12/45
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com 2011 - All rights reserved
Performance analysis of GALS data link (Cont)
Synchronization latency function L (t, w)
, [0, );
( , ) 3 / 2 , [ , ];
2 , ( , ).
T t t w
L t w T w t w w
T t t w T
T/20 T t
1/2
1
3/2
3T/40 T t
1/4
1
5/4
T/40 T
3/4
1
7/4
5/4
3/4
t
L/T L/T L/Tw=T/4 w=T/2 w=3T/4
( , ) 2T w L t w T w
lClk
rClk
ri
datas
T-t 2T-t
8/10/2019 Credes Report Fan
13/45
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com 2011 - All rights reserved
Performance analysis of GALS data link (Cont)
Average synchronization latency LAVG
We are not interested in the synchronization latency for a particular incoming
data, but in the average synchronization latency, LAVG , over a large amount of
requests.
The value of LAVGis determined by:(1) the synchronization latency function L D (t, w) for
any t, and (2) the distribution of t i n a data link .
For example, assuming a uniform distribution on t, the average latency due todata synchronization can be derived as:
For a uni form distribution on t, the average latency of data synchronization i s determi ned
by the relati ve width of acknowl edge window to the clock period, w/T.
I f w/T=1/2, then L AVG=T;I f w/T>1/2, then L AVG
8/10/2019 Credes Report Fan
14/45
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com 2011 - All rights reserved
Performance analysis of GALS data link (Cont)
Data throughput
Another important issue is the maximum data throughput which
could be achieved by an asynchronous handshaking data link.
In particular, if TX and RX both support burst-mode data transfer
(one data per cycle), whats the throughput of the data link?
Previous studies announced that the data throughput of pausible
clocking based GALS data link could reach at most 0.5 (one data
every other RX cycle).
Why!?
Only experimental results, no any analysis.
8/10/2019 Credes Report Fan
15/45
8/10/2019 Credes Report Fan
16/45
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com 2011 - All rights reserved
Performance analysis of GALS data link (Cont)
The throughput is determined not by TRXor TTX, but by the
period of handshaking loop TLoopof the asynchronous link.
I n ti ghtly coupled data l ink, the transiti on on ap is always tr iggered by rx_clk+.
Th is ap is then sampled by tx_clk+ with a synchronization latency of L (t, w).
Af ter synchronization, the ap wil l further tri gger the next transiti on on rp.
The arr ival t ime of next rp is exactly the synchronization latency of TX, which
satisf ies TTXwTX + dw< t < 2TTXwTX + dw.
Case I. 2TTXwTX + dw< wRX,:
TLoop= TRX, and Th=TRX/TLoop=1.ip_ap
ip_rpmin ip_rpmax
wRX
TTX-wTX+dw
TRX
TLoop=TRX
rx_clk
2TTX-wTX+dw
8/10/2019 Credes Report Fan
17/45
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com 2011 - All rights reserved
Performance analysis of GALS data link (Cont)
Case II. TTXwTX + dw< wRX < 2TTXwTX + dw< TRX + wRX:
TLoop= TRXwhen 0 < t
8/10/2019 Credes Report Fan
18/45
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com 2011 - All rights reserved
Performance analysis of GALS data link (Cont)
Case III. wRX < TTXwTX + dw< 2TTXwTX + dw< TTX + wTX:
TLoop= 2TRXand Th = 0.5.
ip_ap
ip_rpmin ip_rpmax
wRX
TRX
rx_clk
TLoop=2TRX
2TTX-wTX+dw
TTX-wTX+dw
TLoopisnt a linearly increasing function of clock ratio R = TTX/TRX. There are
cri tical thresholds of R, which depends on wRX ,wTX and dw. To improve the throughput of the tightl y coupled asynchronous data li nk, wRX
and wTX need to be maximized and dw shoul d be minimized.
8/10/2019 Credes Report Fan
19/45
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com 2011 - All rights reserved
Performance analysis of GALS data link (Cont)
Throughput of tightly coupled data link
Max error < 15%, average error < 4%.
Throughput Comparison between simulation and analysis
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.49 1.46 1.42 1.39 1.35 1.31 1.28 1.24 1.20 1.17 1.13 1.10 1.06 1.020.990.950.920.880.840.810.770.730.700.660.630.590.550.520.480.450.410.370.340.300.27
Clock ratio (Ttx/Trx)
Datatransferperc
ycle
Simulated @ Dw=2,0ns Estimated @ Dw=2,0ns Simulated @ Dw=0,5ns Estimated @ Dw=0,5ns
8/10/2019 Credes Report Fan
20/45
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com 2011 - All rights reserved
Performance analysis of GALS data link (Cont)
Improving throughput by extending acknowledge window
Estimated throughput at different acknowledge windows
0.5
0.6
0.7
0.8
0.9
1
1.17 1.13 1.10 1.06 1.02 0.99 0.95 0.92 0.88 0.84 0.81 0.77 0.73 0.70 0.66 0.63 0.59 0.55 0.52 0.48 0.45 0.41 0.37 0.34 0.30
Clock ratio (Ttx/Trx)
Datatransferpercycle
Wrx=Trx/2, Wtx=Ttx/2 Wrx=3Trx/4, Wtx=Ttx/2 Wrx=3Trx/4, Wtx=3Ttx/4
1/31/23/5Max throughput< 0.7
8/10/2019 Credes Report Fan
21/45
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com 2011 - All rights reserved
Performance analysis of GALS data link (Cont)
Loosely coupled asynchronous data link
By introducing concurrency in the IPC, the handshaking loop of data
link is partially decoupled.
Here, ap is asserted by IPC once rp gets acknowledge fr om the MUTEX .
Therefore, the transitions in the OPC are partial ly concurr ent wi th the I PC.
By th is means, the reduction in the peri od of handshaki ng loop can be achieved.
op_te+op_req+op_ack+op_ri+op_ai+op_ta+
op_ta-op_ai+op_ri+op_ack-op_req-op_te-
op_ri-op_ai-
op_ai-op_ri-
ip_ri+ ip_ai+ ip_ta+
ip_ta- ip_ai+ ip_ri+
ip_req+
ip_te+
ip_ack- ip_req-
ip_te-
ip_ack+
ip_ri- ip_ai-
ip_ai- ip_ri-
8/10/2019 Credes Report Fan
22/45
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com 2011 - All rights reserved
Performance analysis of GALS data link (Cont)
Now, the transition of opisnt triggered by rx_clk+, but randomly
distributed within (0, wRX). Each time when receiving an optransition, the OPC will trigger the
next rpin one TX clock cycle.
Therefore, the maximum arrival time of the next rpis (wRX + TTX).
ip_opmin
ip_rpmax
wRX
TRX
rx_clk
TTX
ip_opmax
Condition for Th = 1: wRX + TTX < wRX + TRX TTX < TRX.
Otherwise, Tloop= TTXand Th = TRX/ TTX.
8/10/2019 Credes Report Fan
23/45
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com 2011 - All rights reserved
Performance analysis of GALS data link (Cont)
Improving throughput by loosely coupled data link
Throughput comparison of loosely coupled data link
0.6
0.7
0.8
0.9
1
1.46 1.39 1.31 1.24 1.17 1.10 1.02 0.95 0.88 0.81 0.73
Clock ratio (Ttx/Trx)
Datatransferper
cycle
Dw=0ns, Wrx=Trx/2 Dw=2ns, Wrx=Trx/2 Dw=2ns, Wrx=3Trx/4
8/10/2019 Credes Report Fan
24/45
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com 2011 - All rights reserved
Performance analysis of GALS data link
Design of loosely coupled asynchronous data link
Compared to the tigh tly coupled data link design, two stages of D -latchare used on the
RX side to lock the input data, since TX could overwri te the output data befor e it being
fi nall y sampled into the RXpenal ty of the decoupling in the handshaki ng loop.
Q
QSET
CLR
Dtx_data_comb
tx_clk
A0
A1
Z
S
Q
QSET
CLR
Dop_te_comb op_te
tx_ta_comb
tx_data_latch
op_ta
Q
QSET
CLR
D
op_giQ
QSET
CLR
D
G
tx_te
tx_ta op_ta_l
OPC
T1
op_rp ip_rp
op_ap ip_ap
TX PAUSIBLE CLOCK GENERATOR
op_aitx_clk op_gi
op_ri
op_ai
EN
Q
QSET
CLR
D
EN
G
Q
QSET
CLR
Dtx_te_pending
tx_clk
tx_data
Q
QSET
CLR
D
Q
QSET
CLR
DA0
A1
Z
S
Q
QSET
CLR
D
ip_gi Q
QSET
CLR
Dip_ta
ip_te
rx_ta_combip_ta_l
rx_clk
rx_data
ip_te_comb
rx_te
rx_taIPC
T2
RX PAUSIBLE CLOCK GENERATOR
ip_ai rx_clk ip_gi
ip_ri
ip_ai
EN
G
Q
QSET
CLR
D rx_te_pending
Q
QSET
CLR
D
GQ
QSET
CLR
D
G
ip_giip_ai
rx_data_l
8/10/2019 Credes Report Fan
25/45
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com 2011 - All rights reserved
System optimization by GALS design
GALS design for power saving
Simplify the on-chip clock tree distribution by GALS partitioning with
averaged area occupation and clock fanout load.
Some evaluations on ASIC designs were reported with 70% reduction
in the power dissipation of clock networks.
Modeling on GALS processor shows marginal system power saving.
GALS design for EMI noise suppression
Partition the system according to the average power dissipation.
Introduce clock phase/frequency modulation for efficiently spreading
the switching activity of different GALS blocks over time/spectrum.
8/10/2019 Credes Report Fan
26/45
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com 2011 - All rights reserved
System optimization by GALS design (Cont)
GalsEmilatorModeling EMI in digital systems at high level
A software in M ATLAB to
investigate EM I in digi tal
systems with di ff erent
structures and topologies
Programmable in:
Switching cur rent prof il e
Clock ji tter percentage
System topologies
Parti tioning granulari ty
8/10/2019 Credes Report Fan
27/45
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com 2011 - All rights reserved
System optimization by GALS design (Cont)
Supply current profile:
The supply current profile could be modeled as triangular shape
or as a superposition of different triangular shapes.
I t i s possible to describe up to fi ve dif ferent supply cur rent pr ofi les and specify the
probabil ity of their appearance in the system.
8/10/2019 Credes Report Fan
28/45
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com 2011 - All rights reserved
System optimization by GALS design (Cont)
Evaluated topologies of digital systems
(a) Pipelined:
(b) Star (c) Mesh
Module 1 Module 2 Module 3 Module 4
Module 4
Module 2Module 1
Module 3
Module 1
Module 3
Module 2
Module 4
8/10/2019 Credes Report Fan
29/45
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com 2011 - All rights reserved
System optimization of GALS design (Cont)
EMI features of the synchronous systems
clock ji tter + clock phase shi ft
8/10/2019 Credes Report Fan
30/45
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com 2011 - All rights reserved
System optimization by GALS design (Cont)
EMI features of the GALS systems
with dif ferent GALS granulari ty and frequency distribution
8/10/2019 Credes Report Fan
31/45
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com 2011 - All rights reserved
System optimization by GALS design (Cont)
EMI comparison between the synchronous and GALS designs
Low-EM I Synchronous: theoretically possible, practically dif fi cult.
8/10/2019 Credes Report Fan
32/45
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com 2011 - All rights reserved
System optimization by GALS design (Cont)
Example: a low-EMI 64-point pipelined FFT processor
Pausible Clock Gen 4
BF
6
1
P
I
P
BF
4
4
BF
5
2
D
O
P
Pausible Clock Gen 3
P
I
P
CMULT
ROM
D
O
P
Pausible Clock Gen 2
BF3
8
P
I
P
BF2
16
D
O
P
Pausible Clock Gen 1
BF 1
32
8/10/2019 Credes Report Fan
33/45
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com 2011 - All rights reserved
Measurements of the core VDD spectrum in synchronous mode (a)
and in low-EMI GALS mode (b)
System optimization by GALS design (Cont)
8/10/2019 Credes Report Fan
34/45
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com 2011 - All rights reserved
Moonrakechip design and test
Top-level block diagram of Moonrakechip
A synchronous OFDM baseband TX and the GALS counterpart were
implemented on the same die, allowing for an objective performance
comparison in a homogeneous setting: identical both in the function
and in the process.
All the data pads were shared by the two TX cores to save the area.
SYNC OFDM TX
JTAG
PRNG
GALS OFDM TX
PLLCLK MUX
INPUTCNTR
MISR
OUTPUTCNTR
8/10/2019 Credes Report Fan
35/45
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com 2011 - All rights reserved
Moonrake chipdesign and test (Cont)
Datapath structure of the synchronous TX
The starti ng point of our wor k was the synchronous baseli ne TX. I t was highly pipeli ned
and parall eli zed in datapath to reach Giga-bit throughput : 12 symbol coding channels, 6
interlevers and 4 64-point I FFT.
INPUTFIFO
INPUTCONTROL U
NIVE
RSAL
SCRA
MBER
SYMBOL
MAPPING
MIDDLECONTROL
FE
C
ENCOD
ER
12
F
EC
ENCODER
1
INTERLEAVER
INTERFACE
INT
ER-
LEAV
ER
6
INTER-
LEA
VER
1PILOTINSERTER SU
BCA
RRIER
MAPP
ER
4
SUBC
ARRIER
MAP
PER
1
64-POINT
IFF
T4
64-POINT
IF
FT1
4-POINTIFFT
OUTPUTSTAGE
8/10/2019 Credes Report Fan
36/45
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com 2011 - All rights reserved
Moonrake chipdesign and test (Cont)
Power/area estimation and GALS partitioning
GALS Block 1
Input
controller
Symbol
mapping
Universal
scrambler
Middle
controller
FEC encoder
[12:1]
Output
interfacePilot insertion
Mapping
[4:1]Total
Power 0.1% 0.5% 0.0% 7.0% 0.09% 0.1% 3.1% 0.08% 10.97%
Area 0.1% 1.0% 0.0% 12.8% 0.06% 0.1% 5.1% 0.14% 19.3%
GALS Block 2 GALS Block 3 GALS Block 4
Interleave 1 Interleave 2 Total Interleave3 Interleave 4 Total Interleave 5 Interleave 6 Total
Power 8.7% 8.7% 17.4% 8.7% 8.7% 17.4% 8.7% 8.7% 17.4%
Area 8.9% 8.9% 17.8% 8.9% 8.9% 17.8% 8.9% 8.9% 17.8%
GALS Block 5 GALS Block 6Post-synth
OFDM TXFFT_64P 1 FFT_64P 2 FFT_64P 3 FFT_64P 4 Total FFT_4P Out Stage Total
Power 4.9% 4.3% 4.3% 4.3% 17.8% 11.3% 7.2% 18.5% 240mW
Area 2.7% 2.4% 2.4% 2.4% 9.9% 10.3% 6.7% 17% 2.2mm2
8/10/2019 Credes Report Fan
37/45
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com 2011 - All rights reserved
Moonrakechip design and test (Cont)
GALS TX top-level block diagram
6 GALS blocks, 16 data links, 32 asynchronous I /O port controll ers.
Middle
control
Input
control
P-IND-OUT
Mapper
[4:1]
Pilot inserter
D-OUT
Interleaver interface
Interleaver [2:1] Interleaver [6:5]Interleaver [4:3]IFFT
64p [4:1]IFFT4p
OUTPUTSTAGE
P-IN P-IN D-OUT
D-OUT
P-IN
Input dataFIFO
Symbolmapping
Universalscrambler
Universal FEC encoder [12:1]
Pausible Clock GEN 1
GALS BLOCK 1
Pausible Clock GEN 2
GALS BLOCK 2
Pausible Clock GEN 3
GALS BLOCK 3
Pausible Clock GEN 4
GALS BLOCK 4
Pausible ClockGEN 5
Pausible Clock GEN 6
GALS BLOCK 5 GALS BLOCK 6
P-IND-OUT P-IN D-OUT P-IN D-OUTD-OUT P-IN D-OUT P-IN D-OUT P-IN
P-IN P-IN P-IND-OUT D-OUT D-OUTP-IN P-IN P-IND-OUT D-OUT D-OUT
8/10/2019 Credes Report Fan
38/45
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com 2011 - All rights reserved
16M equivalent gates, 30% core lo gic;
218 memory : 8 FIFOs (64Kb), 86 SROMs (192Kb ), 134 SRAMs (400Kb); 219 pads: 136 TX/shared p ads, 20 NoC d edicated pads, 63 pow er pads.
I FX 40-nm CMOS process;
4000m2x2250m2=9mm2;
LBGA-345 package;Bondli b 55m pitch.
Moonrakechip design and test (Cont)
8/10/2019 Credes Report Fan
39/45
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com 2011 - All rights reserved
Complexity of clock trees after layout
0
5
10
15
20
25
30
Number of clock tree levels
CLK_PLLO GA LS_CLK1 GALS_CLK2 GALS_CLK3 GALS_CLK4 GALS_CLK5 GALS_CLK6
SYNC
CLK
GALS
CLK1
GALS
CLK2
GALS
CLK3
GALS
CLK4
GALS
CLK5
GALS
CLK6
No. of Levels 27 10 6 7 5 9 8
Max Local Skew 10ps 3ps < 2ps < 2ps < 2ps < 2ps 3ps
1stpro of GALS design:simp l i f ied clock trees with better t imin g balance.
Moonrakechip design and test (Cont)
8/10/2019 Credes Report Fan
40/45
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com 2011 - All rights reserved
Cell area occupation after layout
Total
OFDM TX
NOC Pads
GALS SYNC
Others TotalCore
Clock
Gen &
IO ports
Total Core PLL Total
5406853
(100%)2220080
Included
in core
2220080
(41%)2234712 100000
2334712
(43.2%)
91916
(1.7%)
4643900
(85.9%)
227374
(4.2%)
537075
(9.9%)
74.2%
12.2%9.9%
41%
43.2%
9.9%
2ndpro of GALS design:smaller area by m ore aggressive o pt im izat ion .
Moonrakechip design and test (Cont)
8/10/2019 Credes Report Fan
41/45
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com 2011 - All rights reserved
Power consumption after layout
SYNC TX GALS TX
IO Memory Clock Logic Total IO Memory Clock Logic Total
0.0489 0.1731 0.0419 0.0255 0.2894 0.0488 0.1693 0.0316 0.0280 0.2777
16.89% 59.81% 14.49% 8.81 100% 17.56% 60.98% 11.37% 10.09% 100%
25,80
35,60 35,60 35,60
44,10
48,30
0,00
10,00
20,00
30,00
40,00
50,00
Power distribution over GALS clock domains
LCLK 1 LCLK 2 LCLK 3 LCLK 4 LCLK 5 LCLK 6
3rdpro of GALS design:
> 20% saving in th e clock
tree diss ipat ion;
6% saving in the system
pow er d iss ipat ion .
Moonrakechip design and test (Cont)
8/10/2019 Credes Report Fan
42/45
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com 2011 - All rights reserved
A. VDD_AE22
B. VDD_BOARD
Moonrake Adapter Board
EMI measurements
Spectrum o f core VDD
At fundamental f requency:
A. 26dB attenuation on chip,B. 19dB attenuation on board.
Amplitude of on-chip core VDD from SYNC TX
Amplitude of on-chip core VDD from GALS TX
4thpro. of GALS design:attenuat ion in EMI no ise on the on -chip core VDD.
Moonrakechip design and test (Cont)
8/10/2019 Credes Report Fan
43/45
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com 2011 - All rights reserved
Synchronous/GALS TX comparison
Area, power dissipation, and EMI noise
Area(1)
(m2)
Power
Dissipation(2)
(mW)
Spectral amplitude of Core VDD(3)(dBm)
1stpeak 2ndpeak 3rdpeak
SYNC TX 2325823 252 -15 -32 -23
GALS TX 2220080 237 -41 -48 -53
Difference -5.0% -6.0% -26dB -16dB -30dB
Notes:
1 . The a rea i s es t ima ted based on the layou t net li s t;
2 . The power is measured when the ch ip is working at 160MHz in both SYNC and GALS modes ;
3 . The spect rum is measured on the SMA socket wh ich is connected to the on-ch ip power pad VDD_AE22.
Moonrakechip design and test (Cont)
8/10/2019 Credes Report Fan
44/45
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com 2011 - All rights reserved
Conclusions
Pausible clocking scheme presents an alternative to area and power
efficient GALS design;The hardware overhead for in troducing pausible clocking scheme is negligible;
Balanced GALS parti tioning resul ts in a group of compact locall y-timed blocks,
whi ch can be optimized much more eff icientl y and aggressively.
Therefore, the marginal hardware overhead due to the pausible clocking based
GALS inf rastructure can be ful ly compensated at the system level.
Also, With careful design optimization, performance overhead due to
the asynchronous communication can be minimized;
Sub-cycle of data synchronization latency can be achieved;
Decoupli ng of handshaking loop contri butes to high data throughput.
Behavioral modeling and silicon measurement both demonstrate the
efficiency of GALS design for EMI-noise suppression.
8/10/2019 Credes Report Fan
45/45
IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com 2011 - All rights reserved
Thank you!
For more information about IHP: www.ihp-mi croelectroni cs.com .
For more details about pausible clocking: www.galaxy-project.org .