21
A FAST AND ACCURATE MULTI-CYCLE SOFT ERROR RATE ESTIMATION APPROACH TO RESILIENT EMBEDDED SYSTEMS DESIGN Department of Computer Engineering Sharif University of Technology Tehran, IRAN Mahdi Fazeli, Seyed Ghassem Miremadi, Hossein Asadi, Seyed Nematollah Ahmadian Presenter: Saman Aliari University of Illinois at Urbana Chamapign

Department of Computer Engineering Sharif University of Technology Tehran, IRAN

  • Upload
    denis

  • View
    35

  • Download
    1

Embed Size (px)

DESCRIPTION

Mahdi Fazeli , Seyed Ghassem Miremadi , Hossein Asadi , Seyed Nematollah Ahmadian. A Fast and Accurate Multi-Cycle Soft Error Rate Estimation Approach to Resilient Embedded Systems Design. Presenter : Saman Aliari University of Illinois at Urbana Chamapign. - PowerPoint PPT Presentation

Citation preview

Page 1: Department of Computer Engineering Sharif University of Technology Tehran, IRAN

A FAST AND ACCURATE MULTI-CYCLE SOFT ERROR RATE ESTIMATION

APPROACH TO RESILIENT EMBEDDED SYSTEMS DESIGN

Department of Computer EngineeringSharif University of Technology

Tehran, IRAN

Mahdi Fazeli, Seyed Ghassem Miremadi, Hossein Asadi, Seyed Nematollah

Ahmadian

Presenter: Saman Aliari

University of Illinois at Urbana Chamapign

Page 2: Department of Computer Engineering Sharif University of Technology Tehran, IRAN

2

SPEECH OUTLINES

Soft Errors

SER Modeling in Multi-Cycle Operation

SER Modeling in Single Cycle Operation

Proposed SER Modeling in Multi Cycle Operation

Tool Overview

Experimental Results and Discussions

Conclusions

Page 3: Department of Computer Engineering Sharif University of Technology Tehran, IRAN

3

WHAT IS SOFT ERROR?

Transient Faults Due to radiation events 1 0 or 0 1 Alpha particles or Neutrons Memory, Flip-flops, Combinational Logic

Cache

Arithmatic & Logic

Unit

RegFile

Control Unit

Microprocessor

1

1

11

10

00

0

Energetic Particle

Page 4: Department of Computer Engineering Sharif University of Technology Tehran, IRAN

4

EVIDENCES OF PARTICLE STRIKES 2000 [Forbes Magezine’00]

SUN Enterprise servers crash, due to Cache problem

2001 [ITRS’01]Soft errors as a major issue in chip design

2003 [EE Times’04]Cisco routers failure, due to soft errors

2004 [Xilinx.com]Xilinx FPGAs highly sensitive to soft errors

2005 [Selse.org]Soft error workshop (70% industry

attendees) 2011 [ZeroSoft’06]

Expected 70% chips to fail in a year

Page 5: Department of Computer Engineering Sharif University of Technology Tehran, IRAN

5

MULTI-CYCLE SOFT ERROR PROPAGATION

Q

QSET

CLR

D

POA

B

C

D

E

F

H

I

J

10

00

10

1

0

0

0

0

0

0

0

0

1

0

1Erroneous Value is Captured

Q

QSET

CLR

D

POA

B

C

D

E

F

H

I

J

10

00

10

1

0

1

0

0

1

0

0

1

1

0

1

First Cycle: The SET does not propagate to the PrimaryOutput (PO)

Second Cycle: The error propagates to the Primary Output (PO)

Page 6: Department of Computer Engineering Sharif University of Technology Tehran, IRAN

6

SER MODELING IN SINGLE CYCLE

Nominal FIT Logic Derating Timing Derating Electrical Derating Nominal FIT:

Occurrence rate of cosmic rays at error site Computed once for library characterization

Logical Derating Timing Derating Electrical Derating

D

B C

E

A

D

FF

clk

D

1 1

Page 7: Department of Computer Engineering Sharif University of Technology Tehran, IRAN

LOGICAL DERATING MODELING

7

The Main Idea: Traversing structural paths from SEU site to POs and FFs Using Signal Probabilities (SP) for off-path signals

SPA: probability of gate “A” having logic value “1” Effective techniques available for SP computation

w

t

w'

t'

EPP(AD) = SPB = 0.2 EPP: Error Propagation Probability

EPP(AE) = EPP(AD)(1-SPC) = 0.20.6 = 0.12

off-path signals

SPB=0.2 SPC=0.4

D

B C

E

A

FF

on-pathsignals

Page 8: Department of Computer Engineering Sharif University of Technology Tehran, IRAN

8

PROPAGATION RULES: ON-PATH GATES

Reconvergent Paths

Error propagated to two or more inputs of a gate

Polarity of propagated error matters!

Need of 4 logic values to represent state of each line

0, 1 : no error propagation (Error masked)

a: error propagation with same polarity as error site

ā : error propagation with opposite polarity as error site

Pa(Ui ), Pā(Ui ), P1(Ui ), P0(Ui )

Developed Error Propagation Probability (EPP) Rules

For all logic gates

Page 9: Department of Computer Engineering Sharif University of Technology Tehran, IRAN

9

PROPAGATION RULES

On-path gates: Pa(Ui ) + Pā(Ui ) + P1(Ui ) + P0(Ui ) = 1

Off-path gates: P1(Ui ) + P0(Ui ) = 1GATE RULES

AND

n

iiXPoutP

111 )()(

)()]()([)( 11

1 outPXPXPoutPn

iiaia

)()]()([)( 11

1 outPXPXPoutPn

iiaia

)]()()([1)( 10 outPoutPoutPoutPaa

Page 10: Department of Computer Engineering Sharif University of Technology Tehran, IRAN

10

TIMING DERATING MODELING Find all possible propagated waveforms

Enhanced static timing analysis Record all possible transitions at each reachable gate

Due to glitch at error site How?

Create glitch of width w Represented by two events: (a,t), (ā,t+w)

For both positive and negative glitches Inject two events (a,t), (ā,t+w) at error site Find all events at the outputs of all on-path gates Calculate the error propagation probabilities Pa, Pā for each event The propagation is done until reaching a PO or FF. Error propagation probabilities for all possible waveforms are computed For each waveform, Latching Probability is computed as follows:

S: Setup Time, H: Hold Time, W: Glitch Width, T:Clock Period

T

WHSLP

at t+w

a

Page 11: Department of Computer Engineering Sharif University of Technology Tehran, IRAN

11

TIMING LOGIC DERATING

Different Glitches may propagate to the POs or FFs due to re-convergent fan-out

Page 12: Department of Computer Engineering Sharif University of Technology Tehran, IRAN

12

ELECTRICAL DERATING MODELING

1. Algorithm: Computing electrical masking while propagating events

2. Vomin(Gj , inputk): Minimum voltage of input k of Gj

3. Vomax(Gj , inputk): Maximum voltage of input k of Gj

4. Vomin(Gj ): Minimum voltage of Gj output

5. Vomax(Gj ): Maximum voltage of Gj output

6. PWo: Output pulse width

7. For each gate Gj in List(Gi) do

8. For each valid waveform (Wl) in Event List(Gj) do

9. Vomin(inputs) = Max(V omin of gate inputs on waveform Wl);

10. Vomax(inputs) = Min(V omax of gate inputs on waveform Wl);

11. Compute Vomin(Gj )

12. Compute Vomax(Gj )

13. Compute Pwo using computed Vomin(Gj ) and Vomax(Gj )

14. end

15. end

Page 13: Department of Computer Engineering Sharif University of Technology Tehran, IRAN

A CASE STUDY: ERROR PROPAGATION FOR TWO CLOCK CYCLES

Q

QSET

CLR

D

PO

A

B

C

D

E

F

H

I

J

SP=0.7

SP=0.3

SP=0.2

SP=0.4

SP=0.5

SP=0.2

SP=0.5

SP=0.1

SP=0.5 Q

QSET

CLR

D

A

B

C

D

E

F

H

I

J

SP=0.7

SP=0.3

SP=0.2

SP=0.4

SP=0.5

SP=0.2

SP=0.5

SP=0.3

SP=0.5

Clock=1 Clock=2

QQS

ET

CL

R

D

PO

QQS

ET

CL

R

D

PO FF1

PO FF1

FF2 FF2

13

Q

QSET

CLR

D

PO

A

B

C

D

E

F

H

I

J

SP=0.7

SP=0.3

SP=0.2

SP=0.4

SP=0.5

SP=0.2

SP=0.5

SP=0.1

SP=0.5 Q

QSET

CLR

D

A

B

C

D

E

F

H

I

J

SP=0.7

SP=0.3

SP=0.2

SP=0.4

SP=0.5

SP=0.2

SP=0.5

SP=0.3

SP=0.5

Clock=1 Clock=2

QQS

ET

CL

R

D

PO

QQS

ET

CL

R

D

PO FF1

PO FF1

FF2 FF2

0 1

a a

T=0: P(B)=1(a)T=1:P(B)=1(a)

Q

QSET

CLR

D

PO

A

B

C

D

E

F

H

I

J

SP=0.7

SP=0.3

SP=0.2

SP=0.4

SP=0.5

SP=0.2

SP=0.5

SP=0.1

SP=0.5 Q

QSET

CLR

D

A

B

C

D

E

F

H

I

J

SP=0.7

SP=0.3

SP=0.2

SP=0.4

SP=0.5

SP=0.2

SP=0.5

SP=0.3

SP=0.5

Clock=1 Clock=2

QQS

ET

CL

R

D

PO

QQS

ET

CL

R

D

PO FF1

PO FF1

FF2 FF2

0 1

a a

T=0: P(B)=1(a)T=1:P(B)=1(a)

T=5: P(E)=0.7(a)+0.3(0)T=6:P(E)=0.7(a)+0.3(0)

a

3 4

a

T=3: P(F)=1(a)T=4:P(F)=1(a)

5 6

a a

Q

QSET

CLR

D

PO

A

B

C

D

E

F

H

I

J

SP=0.7

SP=0.3

SP=0.2

SP=0.4

SP=0.5

SP=0.2

SP=0.5

SP=0.1

SP=0.5 Q

QSET

CLR

D

A

B

C

D

E

F

H

I

J

SP=0.7

SP=0.3

SP=0.2

SP=0.4

SP=0.5

SP=0.2

SP=0.5

SP=0.3

SP=0.5

Clock=1 Clock=2

QQS

ET

CL

R

D

PO

QQS

ET

CL

R

D

PO FF1

PO FF1

FF2 FF2

0 1

a a

T=0: P(B)=1(a)T=1:P(B)=1(a)

T=5: P(E)=0.7(a)+0.3(0)T=6:P(E)=0.7(a)+0.3(0)

a

3 4

a

T=3: P(F)=1(a)T=4:P(F)=1(a)

5 6

a a

T=10: P(I)=0.42(a)+0.4(1)+0.18(0)T=11:P(I)=0.42(a)+0.4(1)+0.18(0)

ELPP(a)=0.42*0.42=0.176ELPP(a)=0

SFP1mcycle(B)=0.176*0.5=0.088

LP=0.5

10 11

a a

aa

8 9T=8: P(H)=0.2(a)+0.8(0)T=9:P(H)=0.2(a)+0.8(0)

Q

QSET

CLR

D

PO

A

B

C

D

E

F

H

I

J

SP=0.7

SP=0.3

SP=0.2

SP=0.4

SP=0.5

SP=0.2

SP=0.5

SP=0.1

SP=0.5 Q

QSET

CLR

D

A

B

C

D

E

F

H

I

J

SP=0.7

SP=0.3

SP=0.2

SP=0.4

SP=0.5

SP=0.2

SP=0.5

SP=0.3

SP=0.5

Clock=1 Clock=2

QQS

ET

CL

R

D

PO

QQS

ET

CL

R

D

PO FF1

PO FF1

FF2 FF20 1

a a

T=0: P(B)=1(a)T=1:P(B)=1(a)

T=5: P(E)=0.7(a)+0.3(0)T=6:P(E)=0.7(a)+0.3(0)

a

3 4

a

T=3: P(F)=1(a)T=4:P(F)=1(a)

5 6

a a

T=10: P(I)=0.42(a)+0.4(1)+0.18(0)T=11:P(I)=0.42(a)+0.4(1)+0.18(0)

ELPP(a)=0.42*0.42=0.176ELPP(a)=0

SFP1mcycle(B)=0.176*0.5=0.088

LP=0.5

10 11

a a

aa

8 9T=8: P(H)=0.2(a)+0.8(0)T=9:P(H)=0.2(a)+0.8(0)

10 11

13 14

10 13

1411

13 14

14

10 11

11

1310

T=10: P(J)=0.1(1)+0.63(a)+0.27(0)T=11:P(J)=0.1(1)+0.63(a)+0.27(0)T=13:P(J)=0.2(1)+0.16(a)+0.64(0)T=14:P(J)=0.2(1)+0.16(a)+0.64(0)

ELPP1(a)=0.63*0.63*0.84*0.84=0.28

ELPP1(a)=0.37*0.37*0.16*0.16=0.003

ELPP2(a)=0.63*0.37*0.16*0.84=0.031

ELPP2(a)=0.37*0.63*0.84*0.16=0.031

LP1=0.5

LP2=0.5

LP1=0.7

LP2=0.7

a a

aa

a a

aa

Q

QSET

CLR

D

PO

A

B

C

D

E

F

H

I

J

SP=0.7

SP=0.3

SP=0.2

SP=0.4

SP=0.5

SP=0.2

SP=0.5

SP=0.1

SP=0.5 Q

QSET

CLR

D

A

B

C

D

E

F

H

I

J

SP=0.7

SP=0.3

SP=0.2

SP=0.4

SP=0.5

SP=0.2

SP=0.5

SP=0.3

SP=0.5

Clock=1 Clock=2

QQS

ET

CL

R

D

PO

QQS

ET

CL

R

D

PO FF1

PO FF1

FF2 FF20 1

a a

T=0: P(B)=1(a)T=1:P(B)=1(a)

T=5: P(E)=0.7(a)+0.3(0)T=6:P(E)=0.7(a)+0.3(0)

a

3 4

a

T=3: P(F)=1(a)T=4:P(F)=1(a)

5 6

a a

T=10: P(I)=0.42(a)+0.4(1)+0.18(0)T=11:P(I)=0.42(a)+0.4(1)+0.18(0)

ELPP(a)=0.42*0.42=0.176ELPP(a)=0

SFP1mcycle(B)=0.176*0.5=0.088

LP=0.5

10 11

a a

aa

8 9T=8: P(H)=0.2(a)+0.8(0)T=9:P(H)=0.2(a)+0.8(0)

10 11

13 14

10 13

1411

13 14

14

10 11

11

1310

T=10: P(J)=0.1(1)+0.63(a)+0.27(0)T=11:P(J)=0.1(1)+0.63(a)+0.27(0)T=13:P(J)=0.2(1)+0.16(a)+0.64(0)T=14:P(J)=0.2(1)+0.16(a)+0.64(0)

ELPP1(a)=0.63*0.63*0.84*0.84=0.28

ELPP1(a)=0.37*0.37*0.16*0.16=0.003

ELPP2(a)=0.63*0.37*0.16*0.84=0.031

ELPP2(a)=0.37*0.63*0.84*0.16=0.031

LP1=0.5

LP2=0.5

LP1=0.7

LP2=0.7

a a

aa

a a

aa

P(a)=0.28*0.5+0.031*0.7=0.161

P(a)=0.003*0.5+0.031*0.7=0.023

Q

QSET

CLR

D

PO

A

B

C

D

E

F

H

I

J

SP=0.7

SP=0.3

SP=0.2

SP=0.4

SP=0.5

SP=0.2

SP=0.5

SP=0.1

SP=0.5 Q

QSET

CLR

D

A

B

C

D

E

F

H

I

J

SP=0.7

SP=0.3

SP=0.2

SP=0.4

SP=0.5

SP=0.2

SP=0.5

SP=0.3

SP=0.5

Clock=1 Clock=2

QQS

ET

CL

R

D

PO

QQS

ET

CL

R

D

PO FF1

PO FF1

FF2 FF20 1

a a

T=0: P(B)=1(a)T=1:P(B)=1(a)

T=5: P(E)=0.7(a)+0.3(0)T=6:P(E)=0.7(a)+0.3(0)

a

3 4

a

T=3: P(F)=1(a)T=4:P(F)=1(a)

5 6

a a

T=10: P(I)=0.42(a)+0.4(1)+0.18(0)T=11:P(I)=0.42(a)+0.4(1)+0.18(0)

ELPP(a)=0.42*0.42=0.176ELPP(a)=0

SFP1mcycle(B)=0.176*0.5=0.088

LP=0.5

10 11

a a

aa

8 9T=8: P(H)=0.2(a)+0.8(0)T=9:P(H)=0.2(a)+0.8(0)

10 11

13 14

10 13

1411

13 14

14

10 11

11

1310

T=10: P(J)=0.1(1)+0.63(a)+0.27(0)T=11:P(J)=0.1(1)+0.63(a)+0.27(0)T=13:P(J)=0.2(1)+0.16(a)+0.64(0)T=14:P(J)=0.2(1)+0.16(a)+0.64(0)

ELPP1(a)=0.63*0.63*0.84*0.84=0.28

ELPP1(a)=0.37*0.37*0.16*0.16=0.003

ELPP2(a)=0.63*0.37*0.16*0.84=0.031

ELPP2(a)=0.37*0.63*0.84*0.16=0.031

LP1=0.5

LP2=0.5

LP1=0.7

LP2=0.7

a a

aa

a a

aa

P(a)=0.28*0.5+0.031*0.7=0.161

P(a)=0.003*0.5+0.031*0.7=0.023

P(a)=0.7*0.161=0.112P(a)=0.7*0.023=0.016

P(a)=0.8*0.112=0.089P(a)=0.8*0.016=0.012

SFP2mcycle(B)=1-(1-0.088)*(1-0.101)=0.18

Only logical derating

may occur

All three deratings may occur

Page 14: Department of Computer Engineering Sharif University of Technology Tehran, IRAN

14

THE TOOL: MLET MULTI-CYCLE LOGICAL-ELECTRICAL-TIMING DERATING

Characterize library cells

Read designRead technology library cells

START

Extract netlist adj. list

(Gate_List)

Calculate injected pulse width for

library cells

Start traversing Gate_List

Extract SPs Using MC-simulation

End of Gate_list?

Extract forward cone of gate Gi (List_Gi)

Sort List_Gi using topolical sort algorithm

Start traversing List_Gi

|SFPCi(Gi)-SFPCi-1(Gi)| < e

Compute ELPPs, Vo_min, Vo_max, latching probability

for each DFF and output pulse width of gate output (Gj)

Propagate computed values to gate fanout signals

Yes

Compute Error Propagation Probabilities (Logic Derating

Only)

Propagate computed values to gate fanout signals

No

Compute failure probabilities for all FFs, increment the Clock (CLK)

Compute overall design SER

CLK=1

Yes

No

End

Yes No

End of List_Gi?

End of Gate_list?

Compute SFPCi(Gi)

Page 15: Department of Computer Engineering Sharif University of Technology Tehran, IRAN

15

EXPERIMENTAL RESULTS: RUN TIME

Execution times for MC simulation approach, SP computation, and MLET approach

• On average, 4 orders of magnitude faster than MC based simulation• Time required to compute SPs is also 5 orders of magnitude less than MC

based simulation

Page 16: Department of Computer Engineering Sharif University of Technology Tehran, IRAN

16

EXPERIMENTAL RESULTS: ACCURACY

Difference of derating factors obtained by MLET using various SP variances compared to MC simulations (for an injected pulse width of 50 ps)

• The MLET have an accuracy of about 97% as compared to the MC fault injection approach

Page 17: Department of Computer Engineering Sharif University of Technology Tehran, IRAN

17

MULTI-CYCLE SERS

Multi-cycle SER estimation of s820 and s832 ISCAS’89 circuits using MLET

Page 18: Department of Computer Engineering Sharif University of Technology Tehran, IRAN

18

CONCLUSIONS & FUTURE WORK SER Estimation is very challenging as it requires dynamic

analysis of transients. The existing SER estimation approaches rely on investigation of

error propagation probabilities for only single cycle resulting in inaccurate system failure rate.

We have proposed a very fast and accurate analytical approach so called MLET which has four main features:

1. It runs very fast.

2. All three masking factors are considered.

3. The effects of error propagation in re-convergent fan-outs are modeled.

4. The effect of multi-cycle error propagation on overall circuit SER is considered.

Page 19: Department of Computer Engineering Sharif University of Technology Tehran, IRAN

19

CONCLUSIONS & FUTURE WORK CONT’D

Experimental results extracted for some ISCAS89 circuit benchmark show that MLET is:

4 orders of magnitude faster than the MC simulation based fault injection method

It has an accuracy of about 97%.

Future work: we are going to estimate the SER of a circuit in the presence of Multiple Event Transients (METs) as a reliability concern in ultra deep sub-micron technologies

Page 20: Department of Computer Engineering Sharif University of Technology Tehran, IRAN

20

THANK YOU FOR YOUR ATTENTION

Page 21: Department of Computer Engineering Sharif University of Technology Tehran, IRAN

21

RELATED WORK: SER MODELING Circuit/Logic-Level Approach

Fault injection SERA by Zhang et. al. [ICCAD’04] SEAT-LA by Rajaraman et. al. [VLSID’06] Mohanram et. al. [ITC’03] Maheshwari et. al. [DFT’03] Asadi et. al. [DSN’03] [PRDC’04] Seifert et. al. [TDMR’04]

Probabilistic Transfer Matrices (PTM) Krishnaswamy et. al. [DATE’05]

Binary Decision Diagram (BDD) FASER by Zhang et. al. [ISQED’06] [SELSE’05]