Credes Report Fan

8/10/2019 Credes Report Fan

1/45

IHP

Im Technologiepark 2515236 Frankfurt (Oder)

Germany

IHP Im Technologiepark 25 15236 Frankfurt (Oder) Germany www.ihp-microelectronics.com 2010 - All rights reserved

Pausible Clocking Based GALS Design:

Analysis, Optimization and Applications

Xin FAN

fan@ihp_microe lect ron ics.com


2/45


Outline

Overview of GALS design methodology

Performance analysis of pausible clocking based GALS data link

System optimization for area/power/noise efficient GALS design

Moonrake chip: SYNC/GALS OFDM TX in IFX 40nm technology

Conclusions


3/45


Overview of GALS design methodology

Whats GALS design?

Globally-asynchronous locally-synchronous design for large-scale

digital system integration.

Processing is performed by synchronous functional modules;

Communication is accomplished by asynchronous interfaces.

Sync Core

Logic

AsyncIF

AsyncIF

Sync Core

Logic

AsyncIF

AsyncIF

req

ac k

Data

req

ac k

Data

req

ac k

Data

Clock A Clock B


4/45


Overview of GALS design methodology (Cont)

Why do we need GALS design?

Relaxing the timing constraints at the system level

GALS design requires no global clock reference, only local clocks.

Each locally-timed compact GALS block could be optimized muchmore efficiently and aggressively, leading to lower power and areaoverheads with better timing performance.

The simplified clock trees also contribute to the power/area savingat the system level.

Reducing simultaneous switching noise of digital circuits

The switching activity in GALS design is naturally randomized andspread over time, resulting in a lower switching noise.

Facilitating the system integration based on modular designGALS design presents an infrastructure for dynamical voltage andfrequency scaling (DVFS) and SoC/NoC integration.


5/45



How to implement GALS system?

The main issue is about the design of robust interface circuits withlow overhead.

Robust means to resolve the metastability at an acceptable meantime between failures (MTBF).

Two aspects of GALS overhead:

A. Hardware overheadpower and area;

B. Performance overheadarbitration latency and throughput drop.

Three asynchronous communication schemes:

A.Synchronizer;

B. Dual-clock FIFO;C. Pausible clocking.


6/45



Boundary synchronizer

Cascaded double-DFF:

One extra clock cycle is reserved for resolving metastability.

Simple but slow:

4-phase protocol: 6 TX cycles plus 6 RX cycles for each data transfer.

Q

Q

SET

CLR

Dtx_data rx_data

Q

QSET

CLR

D

ENEN

FSM

Q

QSET

CLR

D

Q

QSET

CLR

D

Q

QSET

CLR

D

Q

QSET

CLR

D

Q

QSET

CLR

D

Q

QSET

CLR

D

t x_c lock doma in rx_clock doma in

req

ac k

vld_in vld_out

EN


7/45



Dual-clock FIFO

Data is written into the FIFO at the TX clock and read out from the

FIFO at the RX clock.

Write and read pointers, instead of data, need to be synchronized

through the clock boundary.

The FIFO has to be sufficiently large to avoid the throughput drop

caused by write/read pointer synchronization.

Dual-clock FIFO

tx_data rx_data

bWrPtr

B2G Sync G2B

Empty

Logic

bRdPtr

Full

Logic

G2B Sync B2G

emptyful l

t x_c lock domain rx_c lock domain


8/45



Pausible clocking scheme

The local clock can be re-scheduled (paused and stretched), when

necessary, to avoid metastability at data sampling;

The data transfer i s ini tiated by the synchronous TX/RX cores through some

output/input flow control l ogic.

The communication between TX and RX is performed by the asynchronous

handshaking channels;

SYNC_REG

TX CORE

OPCTX PAUSIBLE CLOCK

op_teop_ta

op_req

op_ackop_ri

op_ai

tx_clkop_giop_ai

OUT_FLOW_CNTR

OUT_REG

IPC

ip_ri

ip_ai

RX PAUSIBLE CLOCK

rx_clk ip_gi ip_ai

IN_FLOW_CNTR

IN_REG

ip_taip_te

RX CORE

ip_req

ip_ack

handshake signals


9/45



Pausible clock generator

The clock is generated based on a programmable ring oscillator;

A C-element is inserted to gate the incoming clock rising edge;

An array of MUTEX is used as arbiter of concurrent requests.

Req0

Req1

Ack1

Ack0

MUTEX 0

C-ELE

MUTEX 1

A

B

YProgrammable Delay Line

Req0

Req1

Ack0

Ack1

RClk

LClk

MUTEX

C-ELEMENT


10/45



Asynchronous FSM

All the state transitions are

triggered by the events on

input and feedback output

Behavior description using

the signal transition graph

(STG)

Simple and fast.

Sensitive to glitch.

Need particular synthesis

toolPetrify.

ip_ri+ ip_ai+ ip_ta+

ip_ta- ip_ai+ ip_ri+

ip_req+

ip_te+

ip_ack- ip_req-

ip_te-

ip_ack+

ip_ri- ip_ai-

ip_ai- ip_ri-

ip_ai

ip_rp

ip_te

ip_ai

ip_ai

ip_rp

ip_te

ip_ap

ip_ta

ip_ri

Asynchronous I/O port controllers


11/45


Performance analysis of GALS data link

Data synchronization latency

lClk

rClk

w

t

T

MUTEX

ri

rClk

ai

gi

Acknowledge window w

I f the request ar ri ves at the off -phase of rClk , then i t can be acknowl edged immediately

by the MUTEX and the data wil l be sampled at the cur rent r ising edge of the clock;

I f the request arri ves at the on-phase of rClk, then it couldnt be granted by the MUTEXuntil rCl k turns to be low and the data wil l be sampled at the next ri sing edge of the clock.


12/45


Performance analysis of GALS data link (Cont)

Synchronization latency function L (t, w)

, [0, );

( , ) 3 / 2 , [ , ];

2 , ( , ).

T t t w

L t w T w t w w

T t t w T

T/20 T t

1/2

1

3/2

3T/40 T t

1/4

1

5/4

T/40 T

3/4

1

7/4

5/4

3/4

t

L/T L/T L/Tw=T/4 w=T/2 w=3T/4

( , ) 2T w L t w T w

lClk

rClk

ri

datas

T-t 2T-t


13/45



Average synchronization latency LAVG

We are not interested in the synchronization latency for a particular incoming

data, but in the average synchronization latency, LAVG , over a large amount of

requests.

The value of LAVGis determined by:(1) the synchronization latency function L D (t, w) for

any t, and (2) the distribution of t i n a data link .

For example, assuming a uniform distribution on t, the average latency due todata synchronization can be derived as:

For a uni form distribution on t, the average latency of data synchronization i s determi ned

by the relati ve width of acknowl edge window to the clock period, w/T.

I f w/T=1/2, then L AVG=T;I f w/T>1/2, then L AVG


14/45



Data throughput

Another important issue is the maximum data throughput which

could be achieved by an asynchronous handshaking data link.

In particular, if TX and RX both support burst-mode data transfer

(one data per cycle), whats the throughput of the data link?

Previous studies announced that the data throughput of pausible

clocking based GALS data link could reach at most 0.5 (one data

every other RX cycle).

Why!?

Only experimental results, no any analysis.


15/45


16/45



The throughput is determined not by TRXor TTX, but by the

period of handshaking loop TLoopof the asynchronous link.

I n ti ghtly coupled data l ink, the transiti on on ap is always tr iggered by rx_clk+.

Th is ap is then sampled by tx_clk+ with a synchronization latency of L (t, w).

Af ter synchronization, the ap wil l further tri gger the next transiti on on rp.

The arr ival t ime of next rp is exactly the synchronization latency of TX, which

satisf ies TTXwTX + dw< t < 2TTXwTX + dw.

Case I. 2TTXwTX + dw< wRX,:

TLoop= TRX, and Th=TRX/TLoop=1.ip_ap

ip_rpmin ip_rpmax

wRX

TTX-wTX+dw

TRX

TLoop=TRX

rx_clk

2TTX-wTX+dw


17/45



Case II. TTXwTX + dw< wRX < 2TTXwTX + dw< TRX + wRX:

TLoop= TRXwhen 0 < t


18/45



Case III. wRX < TTXwTX + dw< 2TTXwTX + dw< TTX + wTX:

TLoop= 2TRXand Th = 0.5.

ip_ap

ip_rpmin ip_rpmax

wRX

TRX

rx_clk

TLoop=2TRX

2TTX-wTX+dw

TTX-wTX+dw

TLoopisnt a linearly increasing function of clock ratio R = TTX/TRX. There are

cri tical thresholds of R, which depends on wRX ,wTX and dw. To improve the throughput of the tightl y coupled asynchronous data li nk, wRX

and wTX need to be maximized and dw shoul d be minimized.


19/45



Throughput of tightly coupled data link

Max error < 15%, average error < 4%.

Throughput Comparison between simulation and analysis

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.49 1.46 1.42 1.39 1.35 1.31 1.28 1.24 1.20 1.17 1.13 1.10 1.06 1.020.990.950.920.880.840.810.770.730.700.660.630.590.550.520.480.450.410.370.340.300.27

Clock ratio (Ttx/Trx)

Datatransferperc

ycle

Simulated @ Dw=2,0ns Estimated @ Dw=2,0ns Simulated @ Dw=0,5ns Estimated @ Dw=0,5ns


20/45



Improving throughput by extending acknowledge window

Estimated throughput at different acknowledge windows

0.5

0.6

0.7

0.8

0.9

1

1.17 1.13 1.10 1.06 1.02 0.99 0.95 0.92 0.88 0.84 0.81 0.77 0.73 0.70 0.66 0.63 0.59 0.55 0.52 0.48 0.45 0.41 0.37 0.34 0.30


Datatransferpercycle

Wrx=Trx/2, Wtx=Ttx/2 Wrx=3Trx/4, Wtx=Ttx/2 Wrx=3Trx/4, Wtx=3Ttx/4

1/31/23/5Max throughput< 0.7


21/45



Loosely coupled asynchronous data link

By introducing concurrency in the IPC, the handshaking loop of data

link is partially decoupled.

Here, ap is asserted by IPC once rp gets acknowledge fr om the MUTEX .

Therefore, the transitions in the OPC are partial ly concurr ent wi th the I PC.

By th is means, the reduction in the peri od of handshaki ng loop can be achieved.

op_te+op_req+op_ack+op_ri+op_ai+op_ta+

op_ta-op_ai+op_ri+op_ack-op_req-op_te-

op_ri-op_ai-

op_ai-op_ri-

ip_ri+ ip_ai+ ip_ta+

ip_ta- ip_ai+ ip_ri+

ip_req+

ip_te+

ip_ack- ip_req-

ip_te-

ip_ack+

ip_ri- ip_ai-

ip_ai- ip_ri-


22/45



Now, the transition of opisnt triggered by rx_clk+, but randomly

distributed within (0, wRX). Each time when receiving an optransition, the OPC will trigger the

next rpin one TX clock cycle.

Therefore, the maximum arrival time of the next rpis (wRX + TTX).

ip_opmin

ip_rpmax

wRX

TRX

rx_clk

TTX

ip_opmax

Condition for Th = 1: wRX + TTX < wRX + TRX TTX < TRX.

Otherwise, Tloop= TTXand Th = TRX/ TTX.


23/45



Improving throughput by loosely coupled data link

Throughput comparison of loosely coupled data link

0.6

0.7

0.8

0.9

1

1.46 1.39 1.31 1.24 1.17 1.10 1.02 0.95 0.88 0.81 0.73


Datatransferper

cycle

Dw=0ns, Wrx=Trx/2 Dw=2ns, Wrx=Trx/2 Dw=2ns, Wrx=3Trx/4


24/45


Performance analysis of GALS data link

Design of loosely coupled asynchronous data link

Compared to the tigh tly coupled data link design, two stages of D -latchare used on the

RX side to lock the input data, since TX could overwri te the output data befor e it being

fi nall y sampled into the RXpenal ty of the decoupling in the handshaki ng loop.

Q

QSET

CLR

Dtx_data_comb

tx_clk

A0

A1

Z

S

Q

QSET

CLR

Dop_te_comb op_te

tx_ta_comb

tx_data_latch

op_ta

Q

QSET

CLR

D

op_giQ

QSET

CLR

D

G

tx_te

tx_ta op_ta_l

OPC

T1

op_rp ip_rp

op_ap ip_ap

TX PAUSIBLE CLOCK GENERATOR

op_aitx_clk op_gi

op_ri

op_ai

EN

Q

QSET

CLR

D

EN

G

Q

QSET

CLR

Dtx_te_pending

tx_clk

tx_data

Q

QSET

CLR

D

Q

QSET

CLR

DA0

A1

Z

S

Q

QSET

CLR

D

ip_gi Q

QSET

CLR

Dip_ta

ip_te

rx_ta_combip_ta_l

rx_clk

rx_data

ip_te_comb

rx_te

rx_taIPC

T2

RX PAUSIBLE CLOCK GENERATOR

ip_ai rx_clk ip_gi

ip_ri

ip_ai

EN

G

Q

QSET

CLR

D rx_te_pending

Q

QSET

CLR

D

GQ

QSET

CLR

D

G

ip_giip_ai

rx_data_l


25/45


System optimization by GALS design

GALS design for power saving

Simplify the on-chip clock tree distribution by GALS partitioning with

averaged area occupation and clock fanout load.

Some evaluations on ASIC designs were reported with 70% reduction

in the power dissipation of clock networks.

Modeling on GALS processor shows marginal system power saving.

GALS design for EMI noise suppression

Partition the system according to the average power dissipation.

Introduce clock phase/frequency modulation for efficiently spreading

the switching activity of different GALS blocks over time/spectrum.


26/45


System optimization by GALS design (Cont)

GalsEmilatorModeling EMI in digital systems at high level

A software in M ATLAB to

investigate EM I in digi tal

systems with di ff erent

structures and topologies

Programmable in:

Switching cur rent prof il e

Clock ji tter percentage

System topologies

Parti tioning granulari ty


27/45



Supply current profile:

The supply current profile could be modeled as triangular shape

or as a superposition of different triangular shapes.

I t i s possible to describe up to fi ve dif ferent supply cur rent pr ofi les and specify the

probabil ity of their appearance in the system.


28/45



Evaluated topologies of digital systems

(a) Pipelined:

(b) Star (c) Mesh

Module 1 Module 2 Module 3 Module 4

Module 4

Module 2Module 1

Module 3

Module 1

Module 3

Module 2

Module 4


29/45


System optimization of GALS design (Cont)

EMI features of the synchronous systems

clock ji tter + clock phase shi ft


30/45



EMI features of the GALS systems

with dif ferent GALS granulari ty and frequency distribution


31/45



EMI comparison between the synchronous and GALS designs

Low-EM I Synchronous: theoretically possible, practically dif fi cult.


32/45



Example: a low-EMI 64-point pipelined FFT processor

Pausible Clock Gen 4

BF

6

1

P

I

P

BF

4

4

BF

5

2

D

O

P


P

I

P

CMULT

ROM

D

O

P


BF3

8

P

I

P

BF2

16

D

O

P


BF 1

32


33/45


Measurements of the core VDD spectrum in synchronous mode (a)

and in low-EMI GALS mode (b)



34/45


Moonrakechip design and test

Top-level block diagram of Moonrakechip

A synchronous OFDM baseband TX and the GALS counterpart were

implemented on the same die, allowing for an objective performance

comparison in a homogeneous setting: identical both in the function

and in the process.

All the data pads were shared by the two TX cores to save the area.

SYNC OFDM TX

JTAG

PRNG

GALS OFDM TX

PLLCLK MUX

INPUTCNTR

MISR

OUTPUTCNTR


35/45


Moonrake chipdesign and test (Cont)

Datapath structure of the synchronous TX

The starti ng point of our wor k was the synchronous baseli ne TX. I t was highly pipeli ned

and parall eli zed in datapath to reach Giga-bit throughput : 12 symbol coding channels, 6

interlevers and 4 64-point I FFT.

INPUTFIFO

INPUTCONTROL U

NIVE

RSAL

SCRA

MBER

SYMBOL

MAPPING

MIDDLECONTROL

FE

C

ENCOD

ER

12

F

EC

ENCODER

1

INTERLEAVER

INTERFACE

INT

ER-

LEAV

ER

6

INTER-

LEA

VER

1PILOTINSERTER SU

BCA

RRIER

MAPP

ER

4

SUBC

ARRIER

MAP

PER

1

64-POINT

IFF

T4

64-POINT

IF

FT1

4-POINTIFFT

OUTPUTSTAGE


36/45


Moonrake chipdesign and test (Cont)

Power/area estimation and GALS partitioning

GALS Block 1

Input

controller

Symbol

mapping

Universal

scrambler

Middle

controller

FEC encoder

[12:1]

Output

interfacePilot insertion

Mapping

[4:1]Total

Power 0.1% 0.5% 0.0% 7.0% 0.09% 0.1% 3.1% 0.08% 10.97%

Area 0.1% 1.0% 0.0% 12.8% 0.06% 0.1% 5.1% 0.14% 19.3%

GALS Block 2 GALS Block 3 GALS Block 4

Interleave 1 Interleave 2 Total Interleave3 Interleave 4 Total Interleave 5 Interleave 6 Total

Power 8.7% 8.7% 17.4% 8.7% 8.7% 17.4% 8.7% 8.7% 17.4%

Area 8.9% 8.9% 17.8% 8.9% 8.9% 17.8% 8.9% 8.9% 17.8%

GALS Block 5 GALS Block 6Post-synth

OFDM TXFFT_64P 1 FFT_64P 2 FFT_64P 3 FFT_64P 4 Total FFT_4P Out Stage Total

Power 4.9% 4.3% 4.3% 4.3% 17.8% 11.3% 7.2% 18.5% 240mW

Area 2.7% 2.4% 2.4% 2.4% 9.9% 10.3% 6.7% 17% 2.2mm2


37/45


Moonrakechip design and test (Cont)

GALS TX top-level block diagram

6 GALS blocks, 16 data links, 32 asynchronous I /O port controll ers.

Middle

control

Input

control

P-IND-OUT

Mapper

[4:1]

Pilot inserter

D-OUT

Interleaver interface

Interleaver [2:1] Interleaver [6:5]Interleaver [4:3]IFFT

64p [4:1]IFFT4p

OUTPUTSTAGE

P-IN P-IN D-OUT

D-OUT

P-IN

Input dataFIFO

Symbolmapping

Universalscrambler

Universal FEC encoder [12:1]

Pausible Clock GEN 1

GALS BLOCK 1


GALS BLOCK 2


GALS BLOCK 3


GALS BLOCK 4

Pausible ClockGEN 5


GALS BLOCK 5 GALS BLOCK 6

P-IND-OUT P-IN D-OUT P-IN D-OUTD-OUT P-IN D-OUT P-IN D-OUT P-IN

P-IN P-IN P-IND-OUT D-OUT D-OUTP-IN P-IN P-IND-OUT D-OUT D-OUT


38/45


16M equivalent gates, 30% core lo gic;

218 memory : 8 FIFOs (64Kb), 86 SROMs (192Kb ), 134 SRAMs (400Kb); 219 pads: 136 TX/shared p ads, 20 NoC d edicated pads, 63 pow er pads.

I FX 40-nm CMOS process;

4000m2x2250m2=9mm2;

LBGA-345 package;Bondli b 55m pitch.



39/45


Complexity of clock trees after layout

0

5

10

15

20

25

30

Number of clock tree levels

CLK_PLLO GA LS_CLK1 GALS_CLK2 GALS_CLK3 GALS_CLK4 GALS_CLK5 GALS_CLK6

SYNC

CLK

GALS

CLK1

GALS

CLK2

GALS

CLK3

GALS

CLK4

GALS

CLK5

GALS

CLK6

No. of Levels 27 10 6 7 5 9 8

Max Local Skew 10ps 3ps < 2ps < 2ps < 2ps < 2ps 3ps

1stpro of GALS design:simp l i f ied clock trees with better t imin g balance.



40/45


Cell area occupation after layout

Total

OFDM TX

NOC Pads

GALS SYNC

Others TotalCore

Clock

Gen &

IO ports

Total Core PLL Total

5406853

(100%)2220080

Included

in core

2220080

(41%)2234712 100000

2334712

(43.2%)

91916

(1.7%)

4643900

(85.9%)

227374

(4.2%)

537075

(9.9%)

74.2%

12.2%9.9%

41%

43.2%

9.9%

2ndpro of GALS design:smaller area by m ore aggressive o pt im izat ion .



41/45


Power consumption after layout

SYNC TX GALS TX

IO Memory Clock Logic Total IO Memory Clock Logic Total

0.0489 0.1731 0.0419 0.0255 0.2894 0.0488 0.1693 0.0316 0.0280 0.2777

16.89% 59.81% 14.49% 8.81 100% 17.56% 60.98% 11.37% 10.09% 100%

25,80

35,60 35,60 35,60

44,10

48,30

0,00

10,00

20,00

30,00

40,00

50,00

Power distribution over GALS clock domains

LCLK 1 LCLK 2 LCLK 3 LCLK 4 LCLK 5 LCLK 6

3rdpro of GALS design:

> 20% saving in th e clock

tree diss ipat ion;

6% saving in the system

pow er d iss ipat ion .



42/45


A. VDD_AE22

B. VDD_BOARD

Moonrake Adapter Board

EMI measurements

Spectrum o f core VDD

At fundamental f requency:

A. 26dB attenuation on chip,B. 19dB attenuation on board.

Amplitude of on-chip core VDD from SYNC TX

Amplitude of on-chip core VDD from GALS TX

4thpro. of GALS design:attenuat ion in EMI no ise on the on -chip core VDD.



43/45


Synchronous/GALS TX comparison

Area, power dissipation, and EMI noise

Area(1)

(m2)

Power

Dissipation(2)

(mW)

Spectral amplitude of Core VDD(3)(dBm)

1stpeak 2ndpeak 3rdpeak

SYNC TX 2325823 252 -15 -32 -23

GALS TX 2220080 237 -41 -48 -53

Difference -5.0% -6.0% -26dB -16dB -30dB

Notes:

1 . The a rea i s es t ima ted based on the layou t net li s t;

2 . The power is measured when the ch ip is working at 160MHz in both SYNC and GALS modes ;

3 . The spect rum is measured on the SMA socket wh ich is connected to the on-ch ip power pad VDD_AE22.



44/45


Conclusions

Pausible clocking scheme presents an alternative to area and power

efficient GALS design;The hardware overhead for in troducing pausible clocking scheme is negligible;

Balanced GALS parti tioning resul ts in a group of compact locall y-timed blocks,

whi ch can be optimized much more eff icientl y and aggressively.

Therefore, the marginal hardware overhead due to the pausible clocking based

GALS inf rastructure can be ful ly compensated at the system level.

Also, With careful design optimization, performance overhead due to

the asynchronous communication can be minimized;

Sub-cycle of data synchronization latency can be achieved;

Decoupli ng of handshaking loop contri butes to high data throughput.

Behavioral modeling and silicon measurement both demonstrate the

efficiency of GALS design for EMI-noise suppression.


45/45


Thank you!

For more information about IHP: www.ihp-mi croelectroni cs.com .

For more details about pausible clocking: www.galaxy-project.org .

Documents

Credes Report Fan