A Comparative Evaluation of Static Analysis Actionable Alert Identification Techniques

A Comparative Evaluation of Static Analysis Actionable Alert Identification TechniquesSarah Heckman and Laurie WilliamsDepartment of Computer ScienceNorth Carolina State University

Motivation• Automated static analysis can find a large number of

alerts – Empirically observed alert density of 40 alerts/KLOC[HW08]

• Alert inspection required to determine if developer should (and could) fix– Developer may only fix 9%[HW08] to 65%[KAY04] of alerts– Suppose 1000 alerts – 5 minute inspection per alert – 10.4

work days to inspect all alerts– Potential savings of 3.6-9.5 days by only inspecting alerts the

developer will fix• Fixing 3-4 alerts that could lead to field failures justifies

the cost of static analysis[WDA08]

PROMISE 2013 (c) Sarah Heckman 2

Coding Problem?• Actionable: alerts the developer wants to fix

– Faults in the code– Conformance to coding standards– Developer action: fix the alert in the source code

• Unactionable: alerts the developer does not want to fix – Static analysis false positive– Developer knowledge that alert is not a problem– Inconsequential coding problems (style)– Fixing the alert may not be worth effort– Developer action: suppress the

Actionable Alert Identification Techniques• Supplement automated static analysis

– Classification: predict actionability– Prioritization: order by predicted actionability

• AAIT utilize additional information about the alert, code, and other artifacts– Artifact Characteristics

• Can we determine a “best” AAIT?

Research Objective• to inform the selection of an actionable alert

identification technique for ranking the output of automated static analysis through a comparative evaluation of six actionable alert identification techniques.

Related Work• Comparative evaluation of AAIT [AAH12]

– Languages: Java and Smalltalk– ASA: PMD, FindBugs, SmallLint– Benchmark: FAULTBENCH– Evaluation Metrics

• Effort – “average number of alerts one must inspect to find an actionable one”

• Fault Detection Rate Curve – number of faults detected against number alerts inspected.

– Selected AAIT: APM, FeedbackRank, LRM, ZRanking, ATL-D, EFindBugs

Comparative Evaluation• Considered AAIT in literature [HW11][SFZ11]

• Selection Criteria– AAIT classify or prioritize alerts generated by

automated static analysis for the Java programming language

– An implementation of the AAIT is described allowing for replication

– The AAIT is fully automated and does not require manual intervention or inspection of alerts as part of the process

Selected AAIT (1)• Actionable Prioritization Models (APM) [HW08]

– ACs: code location, alert type• Alert Type Lifetime (ATL) [KE07a]

– AC: alert type lifetime– ATL-D: measures the lifetime in days– ATL-R: measures the lifetime in revisions

• Check ‘n’ Crash (CnC) [CSX08]

– AC: test failures– Generates tests that try to cause RuntimeExceptions

Selected AAIT (2)• History-Based Warning Prioritization (HWP) [KE07b]

– ACs: commit messages that identify fault/non-fault fixes

• Logistic Regression Models (LRM) [RPM08]

– ACs: 33 including two proprietary/internal AC• Systematic Actionable Alert Identification (SAAI)

[HW09]

– ACs: 42– Machine learning

FAULTBENCH v0.3• 3 Subject Programs: jdom, runtime, logging• Procedure

1. Gather Alert and Artifact Characteristic Data Sources

2. Artifact Characteristic and Alert Oracle Generation3. Training and Test Sets4. Model Building5. Model Evaluation

Gather Data• Download from repo• Compile• ASA – FindBugs & Check ‘n’ Crash (ESC/Java)• Source Metrics – JavaNCSS• Repository History – CVS & SVN• Difficulties

– Libraries – changed over time– Not every revision would build (especially early ones)

Artifact CharacteristicsIndependent Variables Alert Identifier and History

• Alert information (type, location)• Number of alert modifications

Source Code Metrics• Size and complexity metrics

Source Code History• Developers• File creation, deletion, and modification

revisionsSource Code Churn

• Added and deleted lines of codeAggregate Characteristics

• Alert lifetime, alert counts, staleness

Dependent Variable – Alert Classification

Alert Info

Surrounding Code

Actionable Alert

Unactionable Alert

Alert Oracle Generation

• Iterate through all revisions, starting with the earliest, and compare alerts between revisions

• Closed Actionable• Filtered Unactionable• Deleted• Open

– Inspection– All unactionable

Deleted

Closed

Filtered

Training and Test Sets• Simulate how AAIT would be used in practice• Training set: first X% of revisions to train the models

– 70%, 80%, and 90%• Test set: use remaining 100-X% of revisions to test the

models• Overlapping alerts

– Alerts open at the cutoff revision• Deleted alerts

– If an alert is deleted, the alert is not considered UNLESS the alert isn’t deleted in the training set. In that case the alert is used in model building.

Model Building & Model Evaluation

• Classification Statistics:– Precision = TP / (TP + FP)– Recall = TP / (TP + FN)– Accuracy = (TP + TN) / (TP + TN + FP + FN)

Predicted Actual

True Positive (TP) Actionable Actionable

False Positive (FP) Actionable Unactionable

False Negative (FN) Unactionable Actionable

True Negative (TN) Unactionable Unactionable

• All AAIT are built using the training data and evaluated by predicting the actionability of the test data

Results - jdom

Accuracy (%) Precision (%) Recall (%)Rev. 70 80 90 70 80 90 70 80 90APM 80 83 87 46 42 0 9 10 0ATL-D 72 83 88 26 20 20 22 2 3ATL-R 77 81 86 32 24 24 11 8 13CnC 73 80 95 100 100 0 6 9 0HWP 31 35 32 19 15 9 73 67 57LRM 72 76 83 37 35 32 64 55 59SAAI 83 86 90 92 100 67 16 13 7

Results - runtime

Accuracy (%)

Precision (%) Recall (%)

AAIT 70 80 90 70 80 90 70 80 90APM 36 23 50 88 70 47 32 17 57ATL-D 18 17 55 92 82 100 8 4 3ATL-R 34 43 59 93 94 55 27 36 60HWP 68 66 46 88 85 45 74 73 83LRM 88 87 53 88 87 49 100 100 100SAAI 49 65 83 90 91 100 48 66 63

Results - logging Accuracy (%) Precision (%) Recall (%)AAIT 70 80 90 70 80 90 70 80 90APM 85 89 92 0 0 0 0 0 0ATL-D 92 97 100 0 0 0 0 0 0ATL-R 92 97 100 0 0 0 0 0 0CnC 67 100 100 0 0 0 0 0 0HWP 32 35 33 8 4 0 100 100 0LRM 77 84 83 25 14 0 100 100 0SAAI 90 97 100 0 0 0 0 0 0

Threats to Validity• Internal Validity

– Automation of data generation, collection, and artifact characteristic generation

– Alert oracle – uninspected alerts are considered unactionable– Alert closure is not an explicit action by the developer– Alert continuity not perfect

• Close and open a new alert if both the line number and source hash of the alert change

– Number of revisions• External Validity

– Generalizability of results– Limitations of the AAIT in comparative evaluation

• Construct Validity– Calculations for artifact characteristics

Future Work• Incorporate additional projects into FAULTBENCH

– Emphasis on adding projects that actively use ASA and include filter files

– Allow for evaluation of AAIT with different goals• Identification of most predictive artifact

characteristics• Evaluate different windows for generating test

data– A full project history may not be as predictive as the

most recent history

Conclusions• SAAI found to be the best overall model when

considering accuracy– Highest accuracy, or tie, for 6 of 9 treatments

• ATL-D, ATL-R, and LRM were also predictive when considering accuracy– CnC also performed well, but only considered alerts

from one ASA• LRM and HWP had the highest recall

References[AAH12] S. Allier, N. Anquetil, A. Hora, S. Ducasse, “A Framework to Compare Alert Ranking Algorithms,” 2012 19th

Working conference on Reverse Engineering, Kingston, Ontario, Canada, October 15-18, 2012, p. 277-285.[CSX08] C. Csallner, Y. Smaragdakis, and T. Xie, "DSD-Crasher: A Hybrid Analysis Tool for Bug Finding," ACM

Transactions on Software Engineering and Methodology, vol.17, no. 2, pp. 1-36, April, 2008.[HW08] S. Heckman and L. Williams, "On Establishing a Benchmark for Evaluating Static Analysis Alert Prioritization

and Classification Techniques," Proceedings of the 2nd International Symposium on Empirical Software Engineering and Measurement, Kaiserslautern, Germany, October 9-10, 2008, pp. 41-50.

[HW09] S. Heckman and L. Williams, "A Model Building Process for Identifying Actionable Static Analysis Alerts," Proceedings of the 2nd IEEE International Conference on Software Testing, Verification and Validation, Denver, CO, USA, 2009, pp. 161-170.

[HW11] S. Heckman and L. Williams, "A Systematic Literature Review of Actionable Alert Identification Techniques for Automated Static Code Analysis," Information and Software Technology, vol. 53, no. 4, April 2011, p. 363-387.

[KE07a] S. Kim and M. D. Ernst, "Prioritizing Warning Categories by Analyzing Software History," Proceedings of the International Workshop on Mining Software Repositories, Minneapolis, MN, USA, May 19-20, 2007, p27.

[KE07b] S. Kim and M. D. Ernst, "Which Warnings Should I Fix First?," Proceedings of the 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, Dubrovnik, Croatia, September 3-7, 2007, pp. 45-54.

[KAY04] T. Kremenek, K. Ashcraft, J. Yang, and D. Engler, "Correlation Exploitation in Error Ranking," Proceedings of the 12th ACM SIGSOFT International Symposium on Foundations of Software Engineering, Newport Beach, CA, USA, 2004, pp. 83-93.

[RPM08] J. R. Ruthruff, J. Penix, J. D. Morgenthaler, S. Elbaum, G. Rothermel, “Predicting Accurate and Actionable Static Analysis Warnings: An Experimental Approach,” Proceedings of the 30th International Conference on Software Engineering, Leipzig, Germany, May 10-18, 2008, pp. 341-350.

[SFZ11] H. Shen, J. Fang, J. Zhao, “EFindBugs: Effective Error Ranking for FindBugs,” 2011 IEEE 4th International Conference on Software Testing, Verification and Validation, Berlin, Germany, March 21-25, 2011, p. 299-308.

[WDA08] S. Wagner, F. Deissenboeck, M. Aichner, J. Wimmer, M. Schwalb, “An Evaluation of Two Bug Pattern Tools for Java,” Proceedings of the 1st International Conference on Software Testing, Verification, and Validation, …

A Comparative Evaluation of Static Analysis Actionable Alert Identification Techniques

Documents

A Comparative Evaluation of Static Analysis Actionable Alert Identification Techniques Sarah Heckman and Laurie Williams Department of Computer Science

Actionable Architecture

Big Data & Cyber Security. Agenda “ Stay Alert” Actionable Intelligence RealTime big DataProcessing 9 ways to buildconfidence in Big Data Big Data & Social

Actionable SEO Metrics

Creating Actionable Content

TOP TRENDS IN CRITICAL COMMUNICATIONS · Warning System, LRADs broadcast siren tones to alert the public to voice broadcasts that contain actionable information for people in crisis-affected

Business Analytics: Roadmap to Actionable … Analytics: Roadmap to Actionable Insights ... RIS ROADMAP SERIES Business Analytics: Roadmap to Actionable Insights ... Social Networking

A Systematic Literature Review of Actionable Alert

Integration of Automated Static Analysis Alert Classification and … · 2019-05-10 · REV-04.06.2018.0 Integration of Automated Static Analysis Alert Classification and Prioritization

CLIENT-SIDE STATIC ANALYSIS - MIT CSAIL … me if you can alert(„hi‟); program malicious don’t want to allow alert box ? can we figure this out statically? 8

Actionable information 1

Closing the loop on actionable radiology findings · Closing the loop on actionable radiology findings Vizient Patient Safety Organization Safety Alert November 2019 Background Failure

Actionable Analytics

Accessible, Actionable, Auditable

Delivering Actionable Information

Actionable Information

Active actionable DMPs

What are Actionable Governance Indicators (AGIs)?siteresources.worldbank.org/EXTPUBLICSECTORAND... · What are Actionable Governance Indicators (AGIs)? Definition of AGIs Actionable

Actionable Customer Development

Actionable Knowledge Discovery