31
Deadlock Recovery in Asynchronous Networks on Chip in the Presence of Transient Faults Guangda Zhang, Jim Garside, Wei Song, Javier Navaridas, Zhiying Wang APT Group, School of Computer Science University of Manchester, UK 1 ASYNC 2015 – Mountain View, Silicon Valley, CA, USA, May 46, 2015

Deadlock Recovery in Asynchronous Networks on Chip …ee.usc.edu/.../2015/03/S6_P3_Deadlock...Zhang_PPT.pdf · Deadlock Recovery in Asynchronous Networks on Chip in the ... For deadlock

Embed Size (px)

Citation preview

Deadlock Recovery in Asynchronous Networks on Chip in the Presence of Transient Faults

Guangda Zhang, Jim Garside, Wei Song,Javier Navaridas, Zhiying Wang

APT Group, School of Computer ScienceUniversity of Manchester, UK

1ASYNC 2015 – Mountain View, Silicon Valley, CA, USA, May 4‐6, 2015

Asynchronous Networks on Chip (NoCs)

2

Router Router

Router Router

Asyn. domainIndependent Syn. domains

link

Syn. IP0 Syn. IP1

Syn. IP2 Syn. IP3

NI

NI

NI

NI

Network Interface

SpiNNaker MPSoC with 18 ARM cores

1. Multi-core era --- Network-on-chip (NoC) 2. Asynchronous Networks on Chip --- GALS system

A GALS system constructed by an asynchronous NoC

Deadlock in synchronous NoCs

3

1. Traditional deadlock: cyclic dependence

2. Deadlock avoidance: turn models, virtual channels3. Deadlock recovery: deadlock or congestion?

(0,0) (0,1)

(1,0) (1,1)

A B

CD

Deadlock in asynchronous NoCs

Deadlocked QDI pipeline caused by permanent faults:

data

ackdo

doa

di

dia

data

ackdo

doa

di

dia

Pre-fault Post-fault

1. Traditional deadlock2. Permanent faults*

*G. Zhang, etc., “An asynchronous SDM network-on-chip tolerating permanent faults,” ASYNC 2014

Transient faults?

4

Transient faults

5

Sources of Transient faults Positive/negative

positive fault

negative fault

0‐‘1’‐0

1‐‘0’‐1transient faults

crosstalk

power supply noise

 electromagnetic interference 

electrostatic discharge

 alpha particles

cosmic rays

transient errors (soft errors)

radiation

Single Event Transient (SET)

Single Event Upset (SEU)

Impact of transient faults on QDI pipelines(4-phase 1-of-4 pipeline)

6

1. Symbol corruption*

2. Symbol insertion*

3. Deadlock ?

1001 0000 0100

0001 01000000 0100 0000

*G. Zhang, etc, “Protecting QDI interconnects from transient faults using delay-insensitive redundant check codes,” Microprocessors and Microsystems, 2014

Why this work is important?

7

Deadlock ?• Usual network deadlock (network/data link layer)• Congestion• Deadlock due to permanent faults (physical layer)• Deadlock due to transient faults (physical layer)

1. Differentiate all deadlock types2. All deadlock detection could use a common time-out

mechanism3. A fine-grained recovery mechanism

(system reboot? NO!!!)

Modelling QDI Asynchronous NoCs

(0,0) (0,1) (0,2) (0,3)

(1,0)

(2,0)

(3,0)

y

x

(x,y)

RT RTlink

data

ack Stagedi

dia

output buf. input buf.

Stage

Stage

Stage...

...

...

... Stage

Stage do

doa

8

Deadlocked QDI pipeline

9

Deadlock: In a deadlocked QDI pipeline, no transitions could be fired any more and the pipeline gets stuck at a “stable” state.

Deadlock state: ( A==1, B==0) or (A==0, B==1)

A

B

Q

0

1

0/1

Stage i

i0

in‐1

Stage i+1acki acki+1

Ai Ai+1

CD

Single-word 4-phase 1-of-n pipeline

10

Theorem 1: A transient fault can cause symbol corruption and insertion, but NOT deadlock.

1-of-4 word:

0 0 1 0

Active wire (A)

Victim region

Deadlock of a multi-word pipeline

11

i1,0

i1,n‐1

Stage i+1

ackiacki+1

iN,0

iN,n‐1

... ...

... ...

... ...

......

......

...

Stage i

victim region

Ai,1 A1+1,1

Ai,N Ai+1,N

slice1

slice N

Deadlock of a multi-word pipeline (1/7)

12

i1,0

i1,n‐1

Stage i+1

ackiacki+1

iN,0

iN,n‐1

... ...

... ...

... ...

......

......

...

Stage i

victim region

0000

0001 0001

0000

00 0

0000

0001

… … …

Delayed data

Deadlock of a multi-word pipeline (2/7)

13

i1,0

i1,n‐1

Stage i+1

ackiacki+1

iN,0

iN,n‐1

... ...

... ...

... ...

......

......

...

Stage i

victim region

0000

0001 0001

0100

0 00

01

0000

0001

… … …

Deadlock of a multi-word pipeline (3/7)

14

i1,0

i1,n‐1

Stage i+1

ackiacki+1

iN,0

iN,n‐1

... ...

... ...

... ...

......

......

...

Stage i

victim region

0000

0001 0001

0100

0 00

01

0001

0100

… … …

Deadlock of a multi-word pipeline (4/7)

15

i1,0

i1,n‐1

Stage i+1

ackiacki+1

iN,0

iN,n‐1

... ...

... ...

... ...

......

......

...

Stage i

victim region

0000

0001 0001

0000

0 00

010

0001

0100

… … …

Deadlock of a multi-word pipeline (5/7)

16

i1,0

i1,n‐1

Stage i+1

ackiacki+1

iN,0

iN,n‐1

... ...

... ...

... ...

......

......

...

Stage i

victim region

0000

0001 0001

1 10

0100

0001

0000

010

… … …

Deadlock of a multi-word pipeline (6/7)

17

i1,0

i1,n‐1

Stage i+1

ackiacki+1

iN,0

iN,n‐1

... ...

... ...

... ...

......

......

...

Stage i

victim region

0000

0001 0001

1 10

0000

0001

0000

010

… … …

Deadlock of a multi-word pipeline (7/7)

i1,0

i1,n‐1

Stage i+1

ackiacki+1

iN,0

iN,n‐1

... ...

... ...

... ...

......

......

...

Stage i

victim region

0100

0001 0001

1 10

0000

0001

0000

010

… … …Ai,1 A1+1,1

Ai,N Ai+1,N

18

Deadlock pattern comparison

({Ai,1,…, Ai,N} , acki ,{Ai+1,1,…, Ai+1,N}, acki+1 ) = ({1,…,1}, 0, {0,1,…,1}, 1)almost fullcomplete

Positive transient fault: Post-fault

i1,0

i1,n‐1

Stage i+1

ackiacki+1

iN,0

iN,n‐1

... ...

... ...

...

...

......

......

...Stage i

victim region

0100

0001 0001

110

0000

0001

0000

010

… … …

Ai,1 A1+1,1

Ai,N Ai+1,N

Pre-fault

19

Deadlock pattern comparison (transient or permanent)

& ack‐ & ack+

& ack+ & ack‐Transient

Permanent

spacer

almost_full

almost_empty

complete data

G. Zhang, etc., “An asynchronous SDM network-on-chip tolerating permanent faults,” ASYNC 2014

*

*

ft_type ={(almost_full & !ack) | (almost_empty & ack);

((almost_full & ack) | almost_empty & !ack)}

00: default; 01:transient; 10: permanent; 11: invalid

20

A general deadlock detection architecture

21

Stage

Stage

Stage

Stage

Stage

Deadlockdetection

Deadlockdetection

Deadlockdetection

Pipeline piece Pipeline piece

SYNC.

ASYNC.

Deadlock detection

22

• Usual network deadlock/congestion

• Deadlock due to permanent/transient faults

1. No transitions (a time-out)

2. All pipeline stages downstream of the fault have the

same ack while the ack signals in stages upstream

of fault are alternately valued (deadlock by faults)

3. ft_type ={(almost_full & !ack) | (almost_empty & ack);

((almost_full & ack) | almost_empty & !ack)}

Network configuration

23

SouthWestNorthEastLocal

......

Crossbar SouthWestNorthEastLocal

Switch Allocator

BufferBuffer

data

eop

addr.datadata ...

headbodybodytail tail

...ack

...

...

• 2D mesh QDI NoC • 4-phase 1-of-4• Wormhole & XY-DOR

Deadlock detection

24

data

ack Stagedi

dia

output buf. input buf.

Stage

Stage

Stage...

...

...

... Stage

Stage do

doa

Deadlock detection

Patternchecker

Protected region

clock

almost_empty

cd1  cd2 cd3 cd4 cd5 cd6 cd7 cd8

almost_full

cd1  cd2 cd3 cd4 cd5 cd6 cd7 cd8 Idle

Start

timeout

Enquiry

TF_Confirm(transient)

timeout

timeouttimeout

DK_Confirm(permenent)

Permanent fault?

No

Yes

timeout

Deadlock detection

25

Idle

Start

timeout

Enquiry

TF_Confirm(transient)

timeout

timeouttimeout

DK_Confirm(permenent)

Permanent fault?

No

Yes

timeoutIdle

start

Enquiry

Deadlock detection

26

Idle

Start

timeout

Enquiry

TF_Confirm(transient)

timeout

timeouttimeout

DK_Confirm(permenent)

Permanent fault?

No

Yes

timeout

DK_Confirm(permanent)

Deadlock confirmed!!

Deadlock recovery

27

1. Permanent fault?Spatial division Multiplexing (SDM) to divide each link into physically separated sub-links

• Block the defective sub-link• Reconfigure the switch allocator• Drain the flits polluted by the faults from the network

*G. Zhang, etc., “An asynchronous SDM network-on-chip tolerating permanent faults,” ASYNC 2014

Deadlock recovery

28

Idle

Start

timeout

Enquiry

TF_Confirm(transient)

timeout

timeouttimeout

DK_Confirm(permenent)

Permanent fault?

No

Yes

timeoutTF_Confirm(transient)

Resume the blocked sub-link to use

Experimental Results

UMC 130nm standard cell library

Synchronous IP cores (SystemC) + post-synthesis routers

4×4 mesh 4-phase 1-of-4 SDM NoC

packet size: 64 bytes; Local clock: 100MHz; Time-out: 1.5MHz

29

Original Protected OverheadArea (um2) 63446 72394 14.1%

Throughput (Mbyte/s/node) 693 648 ‐6.5%

Energy (pJ/Byte) 3.7 4.3 16%

Conclusion

If the time difference between two slowest sub‐pipelines are longer than the loop latency, a transient fault at the slowest sub‐pipeline could cause deadlock.

The patterns of the deadlock caused by transient faults, congestion and the usual deadlock are different, which can be used to detect the deadlock and tell its kind.

For deadlock caused by transient faults, a fine‐grained recovery mechanism is proposed to recovery the network to avoid expensive system reboot.

30

Thanks for your listening

Questions?

Guangda ZhangUniversity of Manchester

31