21
Online and Operand-Aware Detection of Failures Utilizing False Alarm Vectors Amir Yazdanbakhsh, David Palframan, Azadeh Davoodi, Mikko Lipasti, Nam Sung Kim Department of Electrical and Computer Engineering

Online and Operand-Aware Detection of Failures Utilizing False Alarm Vectors Amir Yazdanbakhsh, David Palframan, Azadeh Davoodi, Mikko Lipasti, Nam Sung

Embed Size (px)

Citation preview

Page 1: Online and Operand-Aware Detection of Failures Utilizing False Alarm Vectors Amir Yazdanbakhsh, David Palframan, Azadeh Davoodi, Mikko Lipasti, Nam Sung

Online and Operand-Aware Detection of Failures Utilizing False Alarm Vectors

Amir Yazdanbakhsh, David Palframan, Azadeh Davoodi, Mikko Lipasti, Nam Sung Kim

Department of Electrical and Computer Engineering

Page 2: Online and Operand-Aware Detection of Failures Utilizing False Alarm Vectors Amir Yazdanbakhsh, David Palframan, Azadeh Davoodi, Mikko Lipasti, Nam Sung

2

Introduction• Technology scaling beyond 32nm degrades the

manufacturing yieldo Can be addressed by imposing restrictive design rules or just

using regular fabrics[T. Jhaveri, SPIE’06]

o Can be addressed by using configurable logic blocks to make post-silicon corrections[Y. Ran, TVLSI’06]

o Redundancy based techniques can also be usedo Exploiting existing redundancy in high performance processors

[P. Shivakumar, ICCD’12][S. Shyam, ASPLOS’06][J. Srinivasan, ISCA’05]o Incorporate redundancy at the granularity of a bit slice

[K. Namba, PRDC’05]

Page 3: Online and Operand-Aware Detection of Failures Utilizing False Alarm Vectors Amir Yazdanbakhsh, David Palframan, Azadeh Davoodi, Mikko Lipasti, Nam Sung

3

Motivation

C7 C4

01234567

Defective prefix nodeimpacts C4 and C7

0000…11010…0110

Minimized vectors

Recover

Checker

• The checker detects a match with the faulty vectors and a small number of False Alarm vectors at runtime.

Page 4: Online and Operand-Aware Detection of Failures Utilizing False Alarm Vectors Amir Yazdanbakhsh, David Palframan, Azadeh Davoodi, Mikko Lipasti, Nam Sung

4

Contributions

Checker Unit Module

False Alarm Vectors

Flexible option for online and operand-level fault detection Update faulty vectors over the time TCAM-based implementation which can store cubes with

don’t care No extra logic on the critical paths

Efficient use of false alarm vectors to reduce the number ofvectors to be checked, thus reducing the TCAM area

Integrate the false alarm insertion into ESPRESSO 2-level logicminimization tool

The recovery flag is not falsely activated too frequently

Page 5: Online and Operand-Aware Detection of Failures Utilizing False Alarm Vectors Amir Yazdanbakhsh, David Palframan, Azadeh Davoodi, Mikko Lipasti, Nam Sung

5

Checker Unit: Comparison with A Redundancy-based Alternative• Checker Unit • Redundancy-based

Recover

0xxx11x0

TCAM

Operand Checker

Does not affect the critical path

Flexible checker unit (can update faulty vectors)

Online and operand-aware detection of failures

⤫ Affects the critical path (large muxes)

⤫ Fixed design approach (can not be updated)

⤫ Two out of three adders should always be fault-free

Page 6: Online and Operand-Aware Detection of Failures Utilizing False Alarm Vectors Amir Yazdanbakhsh, David Palframan, Azadeh Davoodi, Mikko Lipasti, Nam Sung

6

Overview of TCAM• TCAM can store test cubes which have don’t care bits• Conventional TCAM needs to support random access to a

specific entry to update the key value at runtime– Requires a log N-to-N decoder for a TCAM with N entries– The checker unit does not need such a decoder

• Each entry must be updated only once, every time the chip is turned on• Supporting a sequential access to write the test cubes to the TCAM is

sufficient

• In our framework the size of TCAM can get impractically large if all the faulty are individually stored– We propose to a few false alarm vectors to reduce number of

entries in the TCAM and therefore reduce the TCAM size

Page 7: Online and Operand-Aware Detection of Failures Utilizing False Alarm Vectors Amir Yazdanbakhsh, David Palframan, Azadeh Davoodi, Mikko Lipasti, Nam Sung

7

False Alarm Insertion to Minimize the TCAM Size: Example

A

B

CD

E

A B C D E

x 1 0 x x0 1 1 0 x

1 1 1 0 0

1 1 1 0 1

x 0 0 x 1

x 0 1 0 1

V1

V2

V3

V4

V5

V6

Identify cubes which excite fault

Page 8: Online and Operand-Aware Detection of Failures Utilizing False Alarm Vectors Amir Yazdanbakhsh, David Palframan, Azadeh Davoodi, Mikko Lipasti, Nam Sung

8

A B C D E

x 1 0 x x0 1 1 0 x

1 1 1 0 0

1 1 1 0 1

x 0 0 x 1

x 0 1 0 1

V1

V2

V3

V4

V5

V6

A B C D E

x x 0 x 1x 1 0 x x

x x x 0 1

x 1 x 0 x

V1

V2

V3

V4

Identify cubes which excite fault

Test cube minimization

False Alarm Insertion to Minimize the TCAM Size: Example

Page 9: Online and Operand-Aware Detection of Failures Utilizing False Alarm Vectors Amir Yazdanbakhsh, David Palframan, Azadeh Davoodi, Mikko Lipasti, Nam Sung

9

A B C D E

x x 0 x 1x 1 0 x x

x x x 0 x

V1

V2

V3

A B C D E

x x 0 x 1x 1 0 x x

x x x 0 1

x 1 x 0 x

V1

V2

V3

V4

• We reduce the number of test cubes from 6 to 3

Identify cubes which excite fault

Test cube minimization

Further minimization with

False Alarm Insertion

False Alarm Insertion to Minimize the TCAM Size: Example

Page 10: Online and Operand-Aware Detection of Failures Utilizing False Alarm Vectors Amir Yazdanbakhsh, David Palframan, Azadeh Davoodi, Mikko Lipasti, Nam Sung

10

False Alarm Insertion

Problem Definition• Reduce the number of cubes beneath the given

threshold by adding as few false alarm vectors as possible

Why we need False Alarms?• Due to area budget, number of entries in TCAM is

limited• The number of test cubes translates to the number of

entries in the TCAM

Page 11: Online and Operand-Aware Detection of Failures Utilizing False Alarm Vectors Amir Yazdanbakhsh, David Palframan, Azadeh Davoodi, Mikko Lipasti, Nam Sung

11

Using Two-Level Logic Minimization

• Two-level logic minimization can be used to minimize the number of test cubes

• We expand the ESPRESSO* tool by inserting false alarm vectors to achieve higher minimization

*ESPRESSO. http://embedded.eecs.berkeley.edu/pubs/downloads/espresso/.

Page 12: Online and Operand-Aware Detection of Failures Utilizing False Alarm Vectors Amir Yazdanbakhsh, David Palframan, Azadeh Davoodi, Mikko Lipasti, Nam Sung

12

False Alarm Insertion by Extending ESPRESSO

F = IRREDUNDANT (FON, FDC)F = REDUCE (FON, FDC)F = EXPAND (FON, FOFF)

Stop Minimization

?

Test cubes

Minimized test cubes

Overview of the main loop of ESPRESSO

F = EXPAND-FA (FON, FOFF)F = IRREDUNDANT (FON, FDC)F = REDUCE (FON, FDC)F = EXPAND (FON, FOFF)

# vectors <

threshold

Minimized cubes

Minimized cubes with false alarm

Extension with False Alarm insertion

Page 13: Online and Operand-Aware Detection of Failures Utilizing False Alarm Vectors Amir Yazdanbakhsh, David Palframan, Azadeh Davoodi, Mikko Lipasti, Nam Sung

13

False Alarm Insertion Example

EXPAND-FA IRREDUNDANT

REDUCE EXPAND

Page 14: Online and Operand-Aware Detection of Failures Utilizing False Alarm Vectors Amir Yazdanbakhsh, David Palframan, Azadeh Davoodi, Mikko Lipasti, Nam Sung

14

False Alarm Insertion Procedure

• Each call EXPAND-FA function expands multiple test cubes– How I sequentially go through the on-set?– Look at the paper– Which cube is selected to be expanded?– Same section – stopping criteria (when you

reach the target number of cubes)

Page 15: Online and Operand-Aware Detection of Failures Utilizing False Alarm Vectors Amir Yazdanbakhsh, David Palframan, Azadeh Davoodi, Mikko Lipasti, Nam Sung

15

False Alarm Insertion for One CubeA0 A1 A2 A3

0 0 0 xON1

x x 1 x

x 1 x x

1 x x 1

OFF1

OFF2

OFF3

Offset Matrix

False Alarm Matrix

0 0 2 -

0 2 0 -

1 0 0 -

B1

B2

B3

1 2 2 -

1 1 0 0

0 0 0 0

0 0 0 0

1 0 0 0

OFF1

OFF2

OFF3

ON1

ON2

ON’2

• False Alarm Matrix (i, j)– Entry (i, j) indicates false alarms between the off-set cube i

and (the expanded) cube when literal j is dropped

A0A1

A2A3

Page 16: Online and Operand-Aware Detection of Failures Utilizing False Alarm Vectors Amir Yazdanbakhsh, David Palframan, Azadeh Davoodi, Mikko Lipasti, Nam Sung

16

Simulation Configuration

• Single-failure scenarios in various nodes of32-bit Brent-Kung adder (prefix adder)

• Generate all the test vectors for two failing cases modeled by a stuck-at-0 and stuck-at-1 using ATALANTA* ATPG toolset

• Using SPEC2006 suite for workload-dependent case• Record the input arguments to the adder by running each

benchmark on an X86 simulator

• Analyzing area overhead in 2-issue and 4-issue microprocessors

*H.K. Lee and D.S. Ha. Atalanta: an efficient ATPG for combinational circuits. Technical Report;

Department of Electrical Engineering, Virginia Polytechnic Institute and State University, pages 93

12, 1993.

Page 17: Online and Operand-Aware Detection of Failures Utilizing False Alarm Vectors Amir Yazdanbakhsh, David Palframan, Azadeh Davoodi, Mikko Lipasti, Nam Sung

17

Comparison of Probability of Detection• Probability of detection: percentage of times that the checker unit

activates the recovery signal (could be false alarm or true positive)

Average PoD degrades with decrease in the number of test cubes Average PoD after inserting false alarms does not degrade

significantly in FA-128 or FA-64 or FA-32 compared to W/O FA This behavior is true for both workload-dependent and random cases

Page 18: Online and Operand-Aware Detection of Failures Utilizing False Alarm Vectors Amir Yazdanbakhsh, David Palframan, Azadeh Davoodi, Mikko Lipasti, Nam Sung

18

Comparison of False Alarm Insertion Algorithms• FA-Ag Algorithm

– At each iteration, all the cubes are expanded using the expand-FA procedure.

• Each entry indicates the fraction of false alarms from the total number of detection ( )– FA denotes the number fo false alarm minterms and TP the number of true

positive when a fault is truly happening.

On-average FA-Ag results in more overhead with increase in the number of test cubes compared to FA

Page 19: Online and Operand-Aware Detection of Failures Utilizing False Alarm Vectors Amir Yazdanbakhsh, David Palframan, Azadeh Davoodi, Mikko Lipasti, Nam Sung

19

Area Overhead

• Implemented approaches– Baseline k+1 (for k=2 , 4)

• K-issue processor with 1 redundant component

– K+TCAM• K-issue processor with checker implemented as TCAM

– K+FPGA• K-issue processor with checker implemented as FPGA

• 2+TCAM has better area than 2+1 for 32 and 48 cubes• 2+FPGA always has more area than baseline• Similar behavior for 4+TCAM and 4+FPGA

Page 20: Online and Operand-Aware Detection of Failures Utilizing False Alarm Vectors Amir Yazdanbakhsh, David Palframan, Azadeh Davoodi, Mikko Lipasti, Nam Sung

20

Conclusion

A new framework for online detection of failures at operand level of granularity

Design a flexible TCAM-based checker unit Propose a false alarm insertion algorithm to reduce the

number of vectors below the given threshold Incorporate the false alarm insertion algorithm into

ESPRESSO 2-level logic minimization tool Future works:

Use checker unit for other existing modules inside the processor Utilizing the online and operand-aware detection for other type of

faults such as delay path fault

Page 21: Online and Operand-Aware Detection of Failures Utilizing False Alarm Vectors Amir Yazdanbakhsh, David Palframan, Azadeh Davoodi, Mikko Lipasti, Nam Sung

21

Questions?