Systematic Architecture Level Fault Diagnosis Using Statistical Techniques

Systematic Architecture Level Fault Diagnosis Using Statistical Techniques

Bachelor Thesis by Fabian Keller

Estimated Costs 2012as reported by Britton et al. [2013]

11.11.2014 STARDUST - Fabian Keller 2

Agenda

1. Automated Fault Diagnosis

2. State of the Art

3. Case Study: AspectJ

4. Evaluation

5. Conclusions


Agenda


2. State of the Art


4. Evaluation

5. Conclusions


Fault Diagnosiswhat is the current practice?

Goal: Pinpoint single/multiple failure/s

Commonly used techniques:

• System.out.println()

• Symbolic Debugging

• Static Slicing / Dynamic Slicing

There is room for improvement!


Automated Fault Diagnosisis it possible?

B1 B2 B3 B4 B5 Error

Test1 1 0 0 0 0 0

Test2 1 1 0 0 0 0

Test3 1 1 1 1 1 0

Test4 1 1 1 1 1 0

Test5 1 1 1 1 1 1

Test6 1 1 1 0 1 0


By intuition: A block is more suspicious, if:- It is involved in failing test cases- It is not involved in passing test cases

Ranking Metrics… it is possible

Tarantula𝑆𝑆𝑇𝑇 =

#𝐼𝐼𝐼𝐼#𝐼𝐼𝐼𝐼 + #𝑁𝑁𝐼𝐼

#𝐼𝐼𝐼𝐼#𝐼𝐼𝐼𝐼 + #𝑁𝑁𝐼𝐼 + #𝐼𝐼𝐼𝐼

#𝐼𝐼𝐼𝐼 + #𝑁𝑁𝐼𝐼

Jaccard𝑆𝑆𝐽𝐽 =

#𝐼𝐼𝐼𝐼#𝐼𝐼𝐼𝐼 + #𝑁𝑁𝐼𝐼 + #𝐼𝐼𝐼𝐼

Ochiai𝑆𝑆𝑂𝑂 =

#𝐼𝐼𝐼𝐼(#𝐼𝐼𝐼𝐼 + #𝑁𝑁𝐼𝐼) ⋅ #𝐼𝐼𝐼𝐼 + #𝐼𝐼𝐼𝐼

Involved / Not involved / Failing / Passing


B1 B2 B3 B4 B5 Error

Test1 1 0 0 0 0 0

Test2 1 1 0 0 0 0

Test3 1 1 1 1 1 0

Test4 1 1 1 1 1 0

Test5 1 1 1 1 1 1

Test6 1 1 1 0 1 0

𝑆𝑆𝑇𝑇 0,50 0,56 0,63 0,71 0,63

𝑆𝑆𝐽𝐽 0,17 0,20 0,25 0,33 0,25

𝑆𝑆𝑂𝑂 0,41 0,45 0,50 0,58 0,50

Ranking:1. B4 2. B3, B5 3. B2 4. B1

Agenda


2. State of the Art


4. Evaluation

5. Conclusions


Commonly Used Dataand its limiting factors


Software-artifact Infrastructure Repository• Siemens set• space program

Program Faulty versions LOC Test cases Descriptionprint_tokens 7 478 4130 Lexical anayzer

print_tokens2 10 399 4115 Lexical analyzer

replace 32 512 5542 Pattern recognition

schedule 9 292 2650 Priority scheduler

schedule2 10 301 2710 Priority scheduler

tcas 41 141 1608 Altitude separation

tot_info 23 440 1052 Information measure

space 38 6218 13585 Array definition language

Performance Metricshow can fault localization performance be evaluated?

• Wasted Effort (WE):

Ranking: L4, L3, L2, L7, L6, L1, L5, L9, L10, L8

Wasted Effort (prominent bug): 2 (or 20%)

• Proportion of Bugs Localized (PBL)

Percentage of bugs localized with WE < p%

• Hit@X

Number of bugs localized after inspecting X elements


Agenda


2. State of the Art


4. Evaluation

5. Conclusions


AspectJ – Lines of Codenearly doubled in the examined timespan


AspectJ – Commitsactive development with mostly 50+ commits per month


AspectJ – Bugsnearly 2500 bugs reported in the examined time span


AspectJ – Dataless than 40% of the investigated bugs are applicable for SBFL

AspectJ AJDT Sum

All bugs 1544 886 2430

Bugs in iBugs 285 65 350

Classified Bugs 99 11 110

Applicable Bugs 41 1 42

Involved Bugs 20 1 21


What happened?

Bug 36234workarounds cannot be used as evaluation oracle


Bug report: „Getting an out of memory error when compiling with Ajc 1.1 RC1 […]”

Pre-Fix Post-Fix

Bug 61411platform specific bugs are mostly not present in test suites


Bug report: „[…] highlights a problem that I've seen using ajdoc.bat on Windows […]”

Pre-Fix Post-Fix

Bug 151182synchronization bugs are mostly not present in test suites


Bug report: „[…] recompiled the aspect using 1.5.2 and tried to run it […], but it fails with a NullPointerException.[…]”

Pre-Fix Post-Fix

Agenda


2. State of the Art


4. Evaluation

5. Conclusions


Research Questions

• RQ1: How does the program size influence fault localization

performance?

• RQ2: How many bugs can be found when examining a fixed

amount of ranked elements?

• RQ3: How does the program size influence suspiciousness

scores produced by different ranking metrics?

• RQ4: Are the fault localization performance metrics

currently used by the research community valid?


RQ1: Program Size vs. SBFL Performance?multiple ranked elements are mapped to the same suspiciousness



RQ4: Are the Performance Metrics Valid?on average, no bugs can be found in the first 100 lines


RQ4: Are the Performance Metrics Valid?with luck, 33% of all bugs can be found in the first 1000 lines


Agenda


2. State of the Art


4. Evaluation

5. Conclusions


Conclusionsthere is still some work to be done

• Bugs need more context to be fully understood

• Current metrics cannot be applied to large projects

• SBFL is not feasible for large projects

• New metrics are starting point for future work


Thank you for your attention!

Questions?


RQ2: examining a fixed amountinspect more than 100 files to find 50% of all bugs


RQ3: Program Size vs. Suspiciousnessmean suspiciousness drops for larger programs


WAUC: Weighted Area Under Curve