View
70
Download
2
Category
Preview:
DESCRIPTION
A Comparative Evaluation of Static Analysis Actionable Alert Identification Techniques. Sarah Heckman and Laurie Williams Department of Computer Science North Carolina State University. Motivation. Automated static analysis can find a large number of alerts - PowerPoint PPT Presentation
Citation preview
A Comparative Evaluation of Static Analysis Actionable Alert Identification TechniquesSarah Heckman and Laurie WilliamsDepartment of Computer ScienceNorth Carolina State University
Motivation• Automated static analysis can find a large number of
alerts – Empirically observed alert density of 40 alerts/KLOC[HW08]
• Alert inspection required to determine if developer should (and could) fix– Developer may only fix 9%[HW08] to 65%[KAY04] of alerts– Suppose 1000 alerts – 5 minute inspection per alert – 10.4
work days to inspect all alerts– Potential savings of 3.6-9.5 days by only inspecting alerts the
developer will fix• Fixing 3-4 alerts that could lead to field failures justifies
the cost of static analysis[WDA08]
PROMISE 2013 (c) Sarah Heckman 2
Coding Problem?• Actionable: alerts the developer wants to fix
– Faults in the code– Conformance to coding standards– Developer action: fix the alert in the source code
• Unactionable: alerts the developer does not want to fix – Static analysis false positive– Developer knowledge that alert is not a problem– Inconsequential coding problems (style)– Fixing the alert may not be worth effort– Developer action: suppress the
alert
PROMISE 2013 (c) Sarah Heckman 3
Actionable Alert Identification Techniques• Supplement automated static analysis
– Classification: predict actionability– Prioritization: order by predicted actionability
• AAIT utilize additional information about the alert, code, and other artifacts– Artifact Characteristics
• Can we determine a “best” AAIT?
PROMISE 2013 (c) Sarah Heckman 4
Research Objective• to inform the selection of an actionable alert
identification technique for ranking the output of automated static analysis through a comparative evaluation of six actionable alert identification techniques.
PROMISE 2013 (c) Sarah Heckman 5
Related Work• Comparative evaluation of AAIT [AAH12]
– Languages: Java and Smalltalk– ASA: PMD, FindBugs, SmallLint– Benchmark: FAULTBENCH– Evaluation Metrics
• Effort – “average number of alerts one must inspect to find an actionable one”
• Fault Detection Rate Curve – number of faults detected against number alerts inspected.
– Selected AAIT: APM, FeedbackRank, LRM, ZRanking, ATL-D, EFindBugs
PROMISE 2013 (c) Sarah Heckman 6
Comparative Evaluation• Considered AAIT in literature [HW11][SFZ11]
• Selection Criteria– AAIT classify or prioritize alerts generated by
automated static analysis for the Java programming language
– An implementation of the AAIT is described allowing for replication
– The AAIT is fully automated and does not require manual intervention or inspection of alerts as part of the process
PROMISE 2013 (c) Sarah Heckman 7
Selected AAIT (1)• Actionable Prioritization Models (APM) [HW08]
– ACs: code location, alert type• Alert Type Lifetime (ATL) [KE07a]
– AC: alert type lifetime– ATL-D: measures the lifetime in days– ATL-R: measures the lifetime in revisions
• Check ‘n’ Crash (CnC) [CSX08]
– AC: test failures– Generates tests that try to cause RuntimeExceptions
PROMISE 2013 (c) Sarah Heckman 8
Selected AAIT (2)• History-Based Warning Prioritization (HWP) [KE07b]
– ACs: commit messages that identify fault/non-fault fixes
• Logistic Regression Models (LRM) [RPM08]
– ACs: 33 including two proprietary/internal AC• Systematic Actionable Alert Identification (SAAI)
[HW09]
– ACs: 42– Machine learning
PROMISE 2013 (c) Sarah Heckman 9
FAULTBENCH v0.3• 3 Subject Programs: jdom, runtime, logging• Procedure
1. Gather Alert and Artifact Characteristic Data Sources
2. Artifact Characteristic and Alert Oracle Generation3. Training and Test Sets4. Model Building5. Model Evaluation
PROMISE 2013 (c) Sarah Heckman 10
Gather Data• Download from repo• Compile• ASA – FindBugs & Check ‘n’ Crash (ESC/Java)• Source Metrics – JavaNCSS• Repository History – CVS & SVN• Difficulties
– Libraries – changed over time– Not every revision would build (especially early ones)
PROMISE 2013 (c) Sarah Heckman 11
Artifact CharacteristicsIndependent Variables Alert Identifier and History
• Alert information (type, location)• Number of alert modifications
Source Code Metrics• Size and complexity metrics
Source Code History• Developers• File creation, deletion, and modification
revisionsSource Code Churn
• Added and deleted lines of codeAggregate Characteristics
• Alert lifetime, alert counts, staleness
Dependent Variable – Alert Classification
PROMISE 2013 (c) Sarah Heckman 12
Alert Info
Surrounding Code
Alert
Actionable Alert
Unactionable Alert
Alert Oracle Generation
PROMISE 2013 (c) Sarah Heckman 13
• Iterate through all revisions, starting with the earliest, and compare alerts between revisions
• Closed Actionable• Filtered Unactionable• Deleted• Open
– Inspection– All unactionable
Open
Deleted
Closed
Filtered
Training and Test Sets• Simulate how AAIT would be used in practice• Training set: first X% of revisions to train the models
– 70%, 80%, and 90%• Test set: use remaining 100-X% of revisions to test the
models• Overlapping alerts
– Alerts open at the cutoff revision• Deleted alerts
– If an alert is deleted, the alert is not considered UNLESS the alert isn’t deleted in the training set. In that case the alert is used in model building.
PROMISE 2013 (c) Sarah Heckman 14
Model Building & Model Evaluation
PROMISE 2013 (c) Sarah Heckman 15
• Classification Statistics:– Precision = TP / (TP + FP)– Recall = TP / (TP + FN)– Accuracy = (TP + TN) / (TP + TN + FP + FN)
Predicted Actual
True Positive (TP) Actionable Actionable
False Positive (FP) Actionable Unactionable
False Negative (FN) Unactionable Actionable
True Negative (TN) Unactionable Unactionable
• All AAIT are built using the training data and evaluated by predicting the actionability of the test data
Results - jdom
PROMISE 2013 (c) Sarah Heckman 16
Accuracy (%) Precision (%) Recall (%)Rev. 70 80 90 70 80 90 70 80 90APM 80 83 87 46 42 0 9 10 0ATL-D 72 83 88 26 20 20 22 2 3ATL-R 77 81 86 32 24 24 11 8 13CnC 73 80 95 100 100 0 6 9 0HWP 31 35 32 19 15 9 73 67 57LRM 72 76 83 37 35 32 64 55 59SAAI 83 86 90 92 100 67 16 13 7
Results - runtime
PROMISE 2013 (c) Sarah Heckman 17
Accuracy (%)
Precision (%) Recall (%)
AAIT 70 80 90 70 80 90 70 80 90APM 36 23 50 88 70 47 32 17 57ATL-D 18 17 55 92 82 100 8 4 3ATL-R 34 43 59 93 94 55 27 36 60HWP 68 66 46 88 85 45 74 73 83LRM 88 87 53 88 87 49 100 100 100SAAI 49 65 83 90 91 100 48 66 63
Results - logging Accuracy (%) Precision (%) Recall (%)AAIT 70 80 90 70 80 90 70 80 90APM 85 89 92 0 0 0 0 0 0ATL-D 92 97 100 0 0 0 0 0 0ATL-R 92 97 100 0 0 0 0 0 0CnC 67 100 100 0 0 0 0 0 0HWP 32 35 33 8 4 0 100 100 0LRM 77 84 83 25 14 0 100 100 0SAAI 90 97 100 0 0 0 0 0 0
PROMISE 2013 (c) Sarah Heckman 18
Threats to Validity• Internal Validity
– Automation of data generation, collection, and artifact characteristic generation
– Alert oracle – uninspected alerts are considered unactionable– Alert closure is not an explicit action by the developer– Alert continuity not perfect
• Close and open a new alert if both the line number and source hash of the alert change
– Number of revisions• External Validity
– Generalizability of results– Limitations of the AAIT in comparative evaluation
• Construct Validity– Calculations for artifact characteristics
PROMISE 2013 (c) Sarah Heckman 19
Future Work• Incorporate additional projects into FAULTBENCH
– Emphasis on adding projects that actively use ASA and include filter files
– Allow for evaluation of AAIT with different goals• Identification of most predictive artifact
characteristics• Evaluate different windows for generating test
data– A full project history may not be as predictive as the
most recent history
PROMISE 2013 (c) Sarah Heckman 20
Conclusions• SAAI found to be the best overall model when
considering accuracy– Highest accuracy, or tie, for 6 of 9 treatments
• ATL-D, ATL-R, and LRM were also predictive when considering accuracy– CnC also performed well, but only considered alerts
from one ASA• LRM and HWP had the highest recall
PROMISE 2013 (c) Sarah Heckman 21
References[AAH12] S. Allier, N. Anquetil, A. Hora, S. Ducasse, “A Framework to Compare Alert Ranking Algorithms,” 2012 19th
Working conference on Reverse Engineering, Kingston, Ontario, Canada, October 15-18, 2012, p. 277-285.[CSX08] C. Csallner, Y. Smaragdakis, and T. Xie, "DSD-Crasher: A Hybrid Analysis Tool for Bug Finding," ACM
Transactions on Software Engineering and Methodology, vol.17, no. 2, pp. 1-36, April, 2008.[HW08] S. Heckman and L. Williams, "On Establishing a Benchmark for Evaluating Static Analysis Alert Prioritization
and Classification Techniques," Proceedings of the 2nd International Symposium on Empirical Software Engineering and Measurement, Kaiserslautern, Germany, October 9-10, 2008, pp. 41-50.
[HW09] S. Heckman and L. Williams, "A Model Building Process for Identifying Actionable Static Analysis Alerts," Proceedings of the 2nd IEEE International Conference on Software Testing, Verification and Validation, Denver, CO, USA, 2009, pp. 161-170.
[HW11] S. Heckman and L. Williams, "A Systematic Literature Review of Actionable Alert Identification Techniques for Automated Static Code Analysis," Information and Software Technology, vol. 53, no. 4, April 2011, p. 363-387.
[KE07a] S. Kim and M. D. Ernst, "Prioritizing Warning Categories by Analyzing Software History," Proceedings of the International Workshop on Mining Software Repositories, Minneapolis, MN, USA, May 19-20, 2007, p27.
[KE07b] S. Kim and M. D. Ernst, "Which Warnings Should I Fix First?," Proceedings of the 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, Dubrovnik, Croatia, September 3-7, 2007, pp. 45-54.
[KAY04] T. Kremenek, K. Ashcraft, J. Yang, and D. Engler, "Correlation Exploitation in Error Ranking," Proceedings of the 12th ACM SIGSOFT International Symposium on Foundations of Software Engineering, Newport Beach, CA, USA, 2004, pp. 83-93.
[RPM08] J. R. Ruthruff, J. Penix, J. D. Morgenthaler, S. Elbaum, G. Rothermel, “Predicting Accurate and Actionable Static Analysis Warnings: An Experimental Approach,” Proceedings of the 30th International Conference on Software Engineering, Leipzig, Germany, May 10-18, 2008, pp. 341-350.
[SFZ11] H. Shen, J. Fang, J. Zhao, “EFindBugs: Effective Error Ranking for FindBugs,” 2011 IEEE 4th International Conference on Software Testing, Verification and Validation, Berlin, Germany, March 21-25, 2011, p. 299-308.
[WDA08] S. Wagner, F. Deissenboeck, M. Aichner, J. Wimmer, M. Schwalb, “An Evaluation of Two Bug Pattern Tools for Java,” Proceedings of the 1st International Conference on Software Testing, Verification, and Validation, …
PROMISE 2013 (c) Sarah Heckman 22
Recommended