20
F ORMAL DIAGNOSIS OF HARDWARE TRANSIENT ERRORS IN PROGRAMS Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT THE UNIVERSITY OF BRITISH COLUMBIA

Sathish Gopalakrishnan T E C E D T U B Cblogs.ubc.ca/karthik/files/2010/03/SELSE10_talk.pdf · 2011. 12. 30. · Sathish Gopalakrishnan THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Sathish Gopalakrishnan T E C E D T U B Cblogs.ubc.ca/karthik/files/2010/03/SELSE10_talk.pdf · 2011. 12. 30. · Sathish Gopalakrishnan THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT

FORMAL DIAGNOSIS OF HARDWARE

TRANSIENT ERRORS IN PROGRAMS

Layali Rashid, Karthik Pattabiraman and

Sathish Gopalakrishnan

THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT

THE UNIVERSITY OF BRITISH COLUMBIA

Page 2: Sathish Gopalakrishnan T E C E D T U B Cblogs.ubc.ca/karthik/files/2010/03/SELSE10_talk.pdf · 2011. 12. 30. · Sathish Gopalakrishnan THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT

Contributions

• Software-driven diagnosis of hardware transient errors

– Diagnosis: “isolate the first affected instruction”

• Program-level analysis

– Guarantees on the diagnosis

• Completeness

• Accuracy

2THE UNIVERSITY OF BRITISH COLUMBIA

Page 3: Sathish Gopalakrishnan T E C E D T U B Cblogs.ubc.ca/karthik/files/2010/03/SELSE10_talk.pdf · 2011. 12. 30. · Sathish Gopalakrishnan THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT

Why Software-Driven Diagnosis?

• No expensive hardware modifications.

• Minimal software instrumentation.

• Diagnose faults which manifest at the program-level only.

• Direct access to the affected device is not required.

3THE UNIVERSITY OF BRITISH COLUMBIA

Page 4: Sathish Gopalakrishnan T E C E D T U B Cblogs.ubc.ca/karthik/files/2010/03/SELSE10_talk.pdf · 2011. 12. 30. · Sathish Gopalakrishnan THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT

Diagnosis Approach

4THE UNIVERSITY OF BRITISH COLUMBIA

Detector Triggered

Dump File(e.g. failing detector, register file)

Error Diagnosis

Transient Error Faulty inst

Page 5: Sathish Gopalakrishnan T E C E D T U B Cblogs.ubc.ca/karthik/files/2010/03/SELSE10_talk.pdf · 2011. 12. 30. · Sathish Gopalakrishnan THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT

Diagnosis Approach

Detector Triggered

Dump File(e.g. failing detector, register file)

Model Checking

Transient Error Faulty inst

5THE UNIVERSITY OF BRITISH COLUMBIA

Page 6: Sathish Gopalakrishnan T E C E D T U B Cblogs.ubc.ca/karthik/files/2010/03/SELSE10_talk.pdf · 2011. 12. 30. · Sathish Gopalakrishnan THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT

Model Checking Using SymPLFIED

• Formal model for analyzing programs[DSN’08]

– Evaluate the effect of transient hardware errors on programs.

• Symbolic error propagation technique

– Represent errors using a single symbol (err) to avoid state space explosion.

6THE UNIVERSITY OF BRITISH COLUMBIA

Page 7: Sathish Gopalakrishnan T E C E D T U B Cblogs.ubc.ca/karthik/files/2010/03/SELSE10_talk.pdf · 2011. 12. 30. · Sathish Gopalakrishnan THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT

Example: Factorial Program1 movi $2, #1

2 read $1

3 mov $3, $1

4 movi $4, #1

5 loop: setgt $5, $3, $4

6 beq $5, #0, exit

7 mult $2, $2, $3

8 subi $3, $3, #1

9 assert($3 < $1 + 1)

10 beq $0, #0, loop

11 exit: prints "Factorial = "

12 print $2

Result variable

User input

Loops while $3 < $4

Error detector

7THE UNIVERSITY OF BRITISH COLUMBIA

Page 8: Sathish Gopalakrishnan T E C E D T U B Cblogs.ubc.ca/karthik/files/2010/03/SELSE10_talk.pdf · 2011. 12. 30. · Sathish Gopalakrishnan THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT

1 movi $2, #1

2 read $1

3 mov $3, $1

4 movi $4, #1

5 loop: setgt $5, $3, $4

6 beq $5, #0, exit

7 mult $2, $2, $3

8 subi $3, $3, #1

9 assert($3 < $1 + 1)

10 beq $0, #0, loop

11 exit: prints "Factorial = "

12 print $2

A transient fault, $3 = 13

8THE UNIVERSITY OF BRITISH COLUMBIA

Example: Error Propagation

$1 = 5

Detector is triggered

Page 9: Sathish Gopalakrishnan T E C E D T U B Cblogs.ubc.ca/karthik/files/2010/03/SELSE10_talk.pdf · 2011. 12. 30. · Sathish Gopalakrishnan THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT

1 movi $2, #1

2 read $1

3 mov $3, $1

4 movi $4, #1

5 loop: setgt $5, $3, $4

6 beq $5, #0, exit

7 mult $2, $2, $3

8 subi $3, $3, #1

9 assert($3 < $1 + 1)

10 beq $0, #0, loop

11 exit: prints "Factorial = "

12 print $2

A transient fault, $3 = 13

9THE UNIVERSITY OF BRITISH COLUMBIA

Example: Error Propagation

$1 = 5

Detector is triggered

Dump file: Detector triggered$1 = 5$2 = 13$3 = 12$4 = 1$5 = 1

Page 10: Sathish Gopalakrishnan T E C E D T U B Cblogs.ubc.ca/karthik/files/2010/03/SELSE10_talk.pdf · 2011. 12. 30. · Sathish Gopalakrishnan THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT

1 movi $2, #1

2 read $1

3 mov $3, $1

4 movi $4, #1

5 loop: setgt $5, $3, $4

6 beq $5, #0, exit

7 mult $2, $2, $3

8 subi $3, $3, #1

9 assert($3 < $1 + 1)

10 beq $0, #0, loop

11 exit: prints "Factorial = "

12 print $2

10THE UNIVERSITY OF BRITISH COLUMBIA

Example: Error Diagnosis

A transient fault, $3 = err

False Line 7

True Exit

True Line 10

False Detector triggered

$2 = err

Page 11: Sathish Gopalakrishnan T E C E D T U B Cblogs.ubc.ca/karthik/files/2010/03/SELSE10_talk.pdf · 2011. 12. 30. · Sathish Gopalakrishnan THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT

1 movi $2, #1

2 read $1

3 mov $3, $1

4 movi $4, #1

5 loop: setgt $5, $3, $4

6 beq $5, #0, exit

7 mult $2, $2, $3

8 subi $3, $3, #1

9 assert($3 < $1 + 1)

10 beq $0, #0, loop

11 exit: prints "Factorial = "

12 print $2

11THE UNIVERSITY OF BRITISH COLUMBIA

Example: Error Diagnosis

A transient fault, $3 = err

False Line 7

True Exit

True Line 10

False Detector triggered

$2 = err

SymPLFIED’s SolutionInstruction 3 InjectedDetector triggered$1 = 5$2 = err$3 = err$4 = 1$5 = 1

Dump file: Detector triggered$1 = 5$2 = 13$3 = 12$4 = 1$5 = 1

Page 12: Sathish Gopalakrishnan T E C E D T U B Cblogs.ubc.ca/karthik/files/2010/03/SELSE10_talk.pdf · 2011. 12. 30. · Sathish Gopalakrishnan THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT

1 movi $2, #1

2 read $1

3 mov $3, $1

4 movi $4, #1

5 loop: setgt $5, $3, $4

6 beq $5, #0, exit

7 mult $2, $2, $3

8 subi $3, $3, #1

9 assert($3 < $1 + 1)

10 beq $0, #0, loop

11 exit: prints "Factorial = "

12 print $2

12THE UNIVERSITY OF BRITISH COLUMBIA

Example: Error Diagnosis

A transient fault, $3 = err

False Line 7

True Exit

True Line 10

False Detector triggered

$2 = err

SymPLFIED’s SolutionInstruction 3 InjectedDetector triggered$1 = 5$2 = err$3 = err$4 = 1$5 = 1

Dump file: Detector triggered$1 = 5$2 = 13$3 = 12$4 = 1$5 = 1

The crash dump file can be used to identify the faulty instruction.

Page 13: Sathish Gopalakrishnan T E C E D T U B Cblogs.ubc.ca/karthik/files/2010/03/SELSE10_talk.pdf · 2011. 12. 30. · Sathish Gopalakrishnan THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT

Instructions that trigger a detector

Inject at a random bit in SimpleScalar

Y

YCreate a dump

fileError diagnosisDone

More inst?

NDetector

triggered?

Experimental Methodology

13THE UNIVERSITY OF BRITISH COLUMBIA

• Enhance SymPLFIED to diagnose errors.

• Modify SimpleScalar simulator to inject faults.

• Evaluate for Matrix Multiply and Insertion Sort.

Page 14: Sathish Gopalakrishnan T E C E D T U B Cblogs.ubc.ca/karthik/files/2010/03/SELSE10_talk.pdf · 2011. 12. 30. · Sathish Gopalakrishnan THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT

Results for Matrix Multiply Number of detectors 1 4 6

Number of faults injected in SS 167 275 286

Number of faults detected in SS 74 135 150

Diagnosed faults (%) 100 77 80

Undiagnosed fault (%) 0 23 20

14THE UNIVERSITY OF BRITISH COLUMBIA

Page 15: Sathish Gopalakrishnan T E C E D T U B Cblogs.ubc.ca/karthik/files/2010/03/SELSE10_talk.pdf · 2011. 12. 30. · Sathish Gopalakrishnan THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT

Number of detectors 1 4 6

Number of faults injected in SS 167 275 286

Number of faults detected in SS 74 135 150

Diagnosed faults (%) 100 77 80

Undiagnosed fault (%) 0 23 20

Results for Matrix Multiply (1)

• The proposed technique diagnoses 77%-100% of the detected errors for the matrix multiply program.

• The undiagnosed errors are implementation artifacts of the SymPLFIED tool.

15THE UNIVERSITY OF BRITISH COLUMBIA

Page 16: Sathish Gopalakrishnan T E C E D T U B Cblogs.ubc.ca/karthik/files/2010/03/SELSE10_talk.pdf · 2011. 12. 30. · Sathish Gopalakrishnan THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT

Number of detectors 1 4 6

Number of faults injected in SS 167 275 286

Number of faults detected in SS 74 135 150

Diagnosed faults (%) 100 77 80

Undiagnosed fault (%) 0 23 20

Results for Matrix Multiply (2)

• The number of faults injected in SimpleScalar is proportional to the number of detectors.

• Adding more detectors increases the diagnosis accuracy.

16THE UNIVERSITY OF BRITISH COLUMBIA

Page 17: Sathish Gopalakrishnan T E C E D T U B Cblogs.ubc.ca/karthik/files/2010/03/SELSE10_talk.pdf · 2011. 12. 30. · Sathish Gopalakrishnan THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT

Conclusions and Future Work

• Software diagnosis of hardware faults is possible and can be automated using formal techniques.

– Our diagnosis method is able to diagnose significant number of errors using a few detectors.

• Future Work

– Investigate improvements with limited hardware support.

– Improve scalability using heuristics.

– Extend to intermittent & permanent faults.

17THE UNIVERSITY OF BRITISH COLUMBIA

Page 18: Sathish Gopalakrishnan T E C E D T U B Cblogs.ubc.ca/karthik/files/2010/03/SELSE10_talk.pdf · 2011. 12. 30. · Sathish Gopalakrishnan THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT

Backup Slides

THE UNIVERSITY OF BRITISH COLUMBIA 18

Page 19: Sathish Gopalakrishnan T E C E D T U B Cblogs.ubc.ca/karthik/files/2010/03/SELSE10_talk.pdf · 2011. 12. 30. · Sathish Gopalakrishnan THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT

Related Work

Hardware Fault Diagnosis

Hardware- BasedTechniques

ProbabilisticTechniques

Formal MethodsPeriodic-Testing

Techniques

19THE UNIVERSITY OF BRITISH COLUMBIA

Page 20: Sathish Gopalakrishnan T E C E D T U B Cblogs.ubc.ca/karthik/files/2010/03/SELSE10_talk.pdf · 2011. 12. 30. · Sathish Gopalakrishnan THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT

Results for Insertion Sort

THE UNIVERSITY OF BRITISH COLUMBIA 20

Number of detectors 1 4 7

Number of faults injected in SS 11 165 198

Number of faults detected in SS 8 64 83

Diagnosed faults (%) 100 87 89

Undiagnosed fault (%) 0 13 11