28
Using Cluster Analysis for Characteristics Detection in Software Defect Reports Anna Gromova, Exactpro Open Access Quality Assurance & Related Software Development for Financial Markets Tel: +7 495 640 2460, +1 415 830 38 49 www.exactpro.com

Using Cluster Analysis for Characteristics Detection in Software Defect Reports

Embed Size (px)

Citation preview

Page 1: Using Cluster Analysis for Characteristics Detection in Software Defect Reports

Using Cluster Analysis for

Characteristics Detection in

Software Defect Reports

Anna Gromova, Exactpro

Open Access Quality Assurance & Related Software Development for Financial Markets

Tel: +7 495 640 2460, +1 415 830 38 49

www.exactpro.com

Page 2: Using Cluster Analysis for Characteristics Detection in Software Defect Reports

Open Access Quality Assurance & Related Software Development for Financial Markets Tel: +7 495 640 24 60 , +1 415 830 38 49

www.exactpro.com2

Defect Management

Areas of research in defect management:

• automatic defect fixing

• automatic defect detection

• metrics and predictions of defect reports

• quality of defect reports

• triaging defect reports

Page 3: Using Cluster Analysis for Characteristics Detection in Software Defect Reports

Open Access Quality Assurance & Related Software Development for Financial Markets Tel: +7 495 640 24 60 , +1 415 830 38 49

www.exactpro.com3

• time to fix / time to resolve

• which defects get reopened

• which defects get fixed

• which defects get rejected

Examples of Testing Metrics

Page 4: Using Cluster Analysis for Characteristics Detection in Software Defect Reports

Open Access Quality Assurance & Related Software Development for Financial Markets Tel: +7 495 640 24 60 , +1 415 830 38 49

www.exactpro.com4

Defect clustering helps:

● Understand the nature of defects

● Understand software weaknesses

● Improve the testing strategy

Page 5: Using Cluster Analysis for Characteristics Detection in Software Defect Reports

Open Access Quality Assurance & Related Software Development for Financial Markets Tel: +7 495 640 24 60 , +1 415 830 38 49

www.exactpro.com5

1. Expanding the scope of defect report attributes

conventionally taken into account by the researchers in

the field and including implicit data to gain a wider

perspective in the process of bug examination.

2. Calculating the Silhouette and the Davies-Bouldin indices

to find the proper number of clusters for the k-means

clustering algorithm.

3. Providing a description and interpretation of the received

clusters.

Contribution

Page 6: Using Cluster Analysis for Characteristics Detection in Software Defect Reports

Open Access Quality Assurance & Related Software Development for Financial Markets Tel: +7 495 640 24 60 , +1 415 830 38 49

www.exactpro.com6

Clustering allows solving

the following tasks:

● finding bug duplicates;

● automating testing;

● predicting the testing

workload;

● improving the defect

management practices,

etc.

Clustering of defect reports: related work

Page 7: Using Cluster Analysis for Characteristics Detection in Software Defect Reports

Open Access Quality Assurance & Related Software Development for Financial Markets Tel: +7 495 640 24 60 , +1 415 830 38 49

www.exactpro.com7

Defect dataset

D={d1, d2 .. dn},

dj is a defect, n is the number of defects in the project.

dj= {Priority, Status, Resolution, Time to resolve, Count of

attachments, Count of comments, Area 1, .. Area k },

k is the number of defined areas of testing

Page 8: Using Cluster Analysis for Characteristics Detection in Software Defect Reports

Open Access Quality Assurance & Related Software Development for Financial Markets Tel: +7 495 640 24 60 , +1 415 830 38 49

www.exactpro.com8

Area of Testing: Component/s and Summary

Page 9: Using Cluster Analysis for Characteristics Detection in Software Defect Reports

Open Access Quality Assurance & Related Software Development for Financial Markets Tel: +7 495 640 24 60 , +1 415 830 38 49

www.exactpro.com9

The Attributes For Cluster Analysis

Attribute Data type Values

Priority [17] Categorical priority1, priority2, priority3, priority4

Status Categorical status1, status2, status3, status4, status5, status6

Resolution Categorical resolution1, resolution2, resolution3, resolution4, resolution5,

resolution6, resolution7, resolution8, resolution9, resolution10

Time to resolve Numeric

Count of comments Numeric

Count of

attachments

Numeric

Area i Boolean {0;1}

Page 10: Using Cluster Analysis for Characteristics Detection in Software Defect Reports

Open Access Quality Assurance & Related Software Development for Financial Markets Tel: +7 495 640 24 60 , +1 415 830 38 49

www.exactpro.com10

Distribution of Defect Reports (Project 1)

2,795 defects

Page 11: Using Cluster Analysis for Characteristics Detection in Software Defect Reports

Open Access Quality Assurance & Related Software Development for Financial Markets Tel: +7 495 640 24 60 , +1 415 830 38 49

www.exactpro.com11

Distribution of Defect Reports (Project 2)

5,788 defects

Page 12: Using Cluster Analysis for Characteristics Detection in Software Defect Reports

Open Access Quality Assurance & Related Software Development for Financial Markets Tel: +7 495 640 24 60 , +1 415 830 38 49

www.exactpro.com12

Objects of Сlassification According to the Area of Testing

DF={df1, df2 .. dfn}

dfj is defect, n is the number of bugs

dfj={component, area}

component=concatenate (Component/s, Summary)

area={0,1}M

M is the number of the areas of testing

Project 1: n=2,795 ; M=8

Project 2: n=5,788 ; M=10

Page 13: Using Cluster Analysis for Characteristics Detection in Software Defect Reports

Open Access Quality Assurance & Related Software Development for Financial Markets Tel: +7 495 640 24 60 , +1 415 830 38 49

www.exactpro.com13

Preprocessing: Text Fields

● Natural language processing:

❖ Tokenization

❖ Removal of stop-words

❖ Stemming

● Bag of words (TF-IDF)

Page 14: Using Cluster Analysis for Characteristics Detection in Software Defect Reports

Open Access Quality Assurance & Related Software Development for Financial Markets Tel: +7 495 640 24 60 , +1 415 830 38 49

www.exactpro.com14

Techniques: Feature Selection

● Information gain

● Consistency-based method

● Correlation-based method

● Simplified Silhouette Filter

Page 15: Using Cluster Analysis for Characteristics Detection in Software Defect Reports

Open Access Quality Assurance & Related Software Development for Financial Markets Tel: +7 495 640 24 60 , +1 415 830 38 49

www.exactpro.com15

● Logistic regression

● SVM

● Decision tree

● Random forest

● Naive Bayes

● Bayes Net

Techniques: classifiers

Page 16: Using Cluster Analysis for Characteristics Detection in Software Defect Reports

Open Access Quality Assurance & Related Software Development for Financial Markets Tel: +7 495 640 24 60 , +1 415 830 38 49

www.exactpro.com16

Results: Metrics

Area 1 Area 2 Area 3 Area 4 Area 5 Area 6 Area 7 Area 8

RF+Cons 0.942 0.837 0.92 0.95 0.951 0.975 0.991 0.975

SVM+Cons 0.946 0.844 0.914 0.954 0.954 0.976 0.991 0.965

Area 1 Area 2 Area 3 Area 4 Area 5 Area 6 Area 7 Area 8 Area 9 Area

10

RF+Cons 0.814 0.885 0.928 0.909 0.912 0.924 0.973 0.936 0.967 0.929

SVM+Cons 0.824 0.888 0.931 0.91 0.908 0.928 0.973 0.926 0.971 0.93

F-measure values for

Project 1

F-measure values for

Project 2

Page 17: Using Cluster Analysis for Characteristics Detection in Software Defect Reports

Open Access Quality Assurance & Related Software Development for Financial Markets Tel: +7 495 640 24 60 , +1 415 830 38 49

www.exactpro.com17

Preprocessing: Numeric Fields

● Standardization:

z=(x-μ)/σ

● The Pearson correlation coefficient

Page 18: Using Cluster Analysis for Characteristics Detection in Software Defect Reports

Open Access Quality Assurance & Related Software Development for Financial Markets Tel: +7 495 640 24 60 , +1 415 830 38 49

www.exactpro.com18

Clustering

C={c1, c2..ck..cg}

ck={dj, dq and dj ∊ D, dq ∊ D, distance (dj, dq)<s ),

s is a value that defines the proximity of objects to be

included in one cluster

Page 19: Using Cluster Analysis for Characteristics Detection in Software Defect Reports

Open Access Quality Assurance & Related Software Development for Financial Markets Tel: +7 495 640 24 60 , +1 415 830 38 49

www.exactpro.com19

The Results of the Validity Indices

Index Count of

clusters /

Project

2 3 4 5 6 7 8 9

Silhouett

e Index

1 0.9993 0.9996 0.9999 1 1 0.6489 0.6335 0.6375

Davies-

Bouldin

Index

1 0.2733 0.2445 0.1098 0.0488 4.5635e-

04

0.2367 0.3278 0.5492

Silhouett

e Index

2 0.9997 0.9997 0.9999 1 0.57 0.6284 0.6381 0.6002

Davies-

Bouldin

Index

2 0.2364 0.3939 0.1413 2.9406e-

04

0.3364 0.4304 0.5329 0.6636

Page 20: Using Cluster Analysis for Characteristics Detection in Software Defect Reports

Open Access Quality Assurance & Related Software Development for Financial Markets Tel: +7 495 640 24 60 , +1 415 830 38 49

www.exactpro.com20

Approach: Clustering

Page 21: Using Cluster Analysis for Characteristics Detection in Software Defect Reports

Open Access Quality Assurance & Related Software Development for Financial Markets Tel: +7 495 640 24 60 , +1 415 830 38 49

www.exactpro.com21

Final Centroids of The First Project

Attribute/ ClusterCluster0 Cluster1 Cluster2 Cluster3 Cluster4 Cluster5

653 175 909 426 208 424

Priority Priority3 Priority3 Priority3 Priority2 Priority2 Priority2

Status Status1 Status1 Status1 Status1 Status1 Status1

Resolution Resolution2 Resolution2 Resolution2 Resolution2 Resolution2 Resolution2

Time to resolve-0.112 0.2846 0.075 -0.2365 -0.2172 0.2385

Count of comments -0.0398 -0.3405 -0.0426 0.1529 -0.2418 0.2581

Count of attachments -0.1479-0.1257 0.1182 0.1012 -0.1763 0.0111

Area 1 0 0 0 0 0 0

Area 2 0 0 0 0 1 0

Area 3 0 0 0 0 0 0

Area 4 0 0 1 0 0 0

Area 5 0 0 0 0 0 1

Area 6 0 0 0 1 0 0

Area 7 0 1 0 0 0 0

Area 8 1 0 0 0 0 0

Page 22: Using Cluster Analysis for Characteristics Detection in Software Defect Reports

Open Access Quality Assurance & Related Software Development for Financial Markets Tel: +7 495 640 24 60 , +1 415 830 38 49

www.exactpro.com22

Final Centroids of the Second Project

Attribute/ cluster Cluster0 Cluster1 Cluster2 Cluster3 Cluster4

578 1855 2051 659 645

Priority Priority3 Priority2 Priority2 Priority3 Priority3

Status Status3 Status1 Status1 Status1 Status1

Resolution Resolution1 Resolution2 Resolution2 Resolution2 Resolution2

Time to resolve -0.4452 -0.0537 -0.2282 0.8606 0.3999

Count of comments 0.5361 -0.1157 -0.1576 -0.1688 0.526

Count of attachments 0.0263 0.1243 -0.181 0.0025 0.1921

Area 1 0 0 0 0 0

Area 2 0 0 0 0 0

Area 3 0 0 0 0 0

Area 4 0 0 0 0 0

Area 5 0 0 0 0 0

Area 6 0 0 0 0 1

Area 7 0 0 0 0 0

Area 8 0 0 0 1 0

Area 9 0 1 0 0 0

Area 10 0 0 0 0 0

Page 23: Using Cluster Analysis for Characteristics Detection in Software Defect Reports

Open Access Quality Assurance & Related Software Development for Financial Markets Tel: +7 495 640 24 60 , +1 415 830 38 49

www.exactpro.com23

How Clustering Results Affect the Testing Strategy

Page 24: Using Cluster Analysis for Characteristics Detection in Software Defect Reports

Open Access Quality Assurance & Related Software Development for Financial Markets Tel: +7 495 640 24 60 , +1 415 830 38 49

www.exactpro.com24

1. Using of an extraordinary set of attributes for the cluster

analysis.

2. Using the k-means algorithm, setting the number of clusters

by calculating the Silhouette and the Davies-Bouldin

indices.

3. Giving a description and interpretation of the received

clusters.

Conclusions

Page 25: Using Cluster Analysis for Characteristics Detection in Software Defect Reports

Open Access Quality Assurance & Related Software Development for Financial Markets Tel: +7 495 640 24 60 , +1 415 830 38 49

www.exactpro.com25

● Building an automated recommendation system for Project

Managers and QA Team Leads ;

● Improving the existing processes of developing the testing

strategies and plans;

● Analyzing threats to validity.

Future work

Page 26: Using Cluster Analysis for Characteristics Detection in Software Defect Reports

Open Access Quality Assurance & Related Software Development for Financial Markets Tel: +7 495 640 24 60 , +1 415 830 38 49

www.exactpro.com26

Thank you!

Page 27: Using Cluster Analysis for Characteristics Detection in Software Defect Reports

Open Access Quality Assurance & Related Software Development for Financial Markets Tel: +7 495 640 24 60 , +1 415 830 38 49

www.exactpro.com27

1. Bhattacharya P., Neamtiu I. Bug-fix time prediction models: Can we do better? / In Proc. 8th Working Conf. Mining Software

Repositories. P. 207—210 —New York, NY, USA: ACM, 2011.

2. Chaddock R.E. Principles and methods of statistics, First Edition — Cambridge: Houghton Miffin Company, The Riverside Press,

1952.

3. Gromova A.O. Defect Report Classification in Accordance with Areas of Testing / Proceedings of TMPA 2017 Conference. To be

published in Springer CCIS series in 2017.

4. Fry Z.P., Weimer W. Clustering static analysis defect reports to reduce maintenance costs / In Proc. Working Conference On

Reverse Engineering, WCRE, P. 282–291, 2013

5. Guo P.J., Zimmermann T., Nagappan N., Murphy B. Characterizing and predicting which bugs get xed: An empirical study of

Microsoft windows/ In Proc. 32nd ACM/IEEE Int. Conf. Software Eng., vol. 1, 2010, ser. ICSE ’10., P. 495—504 New York, NY,

USA: ACM.

6. Hooimeijer P., Weimer W. Modeling bug report quality / In ASE ’07: Proceedings of the twenty-second IEEE/ACM International

Conference on Automated Software Engineering , P. 34–43, 2007.

7. Lamkanfi A., Demeyer S., Soetens Q., Verdonck T. Comparing mining algorithms for predicting the severity of a reported bug. In

Proc. 15th Eur. Conf. Software Maintenance Reengineering (CSMR), P. 249—258, 2011.

8. Limsettho N., Hata H., Monden A., Matsumoto K. Automatic Unsupervised Bug Report Categorization / In 2014 6th International

Workshop on Empirical Software Engineering in Practice, P. 7—12, 2014 .

9. Minh P.N. An Approach to Detecting Duplicate Bug Reports using N-gram Features and Cluster Chrinkage Technique //

International Journal of Scientific and Research Publications (IJSRP), Volume 4 (5), 2014.

10. Rus V., Nan X., Shiva S., Chen Y. Clustering of Defect Reports Using Graph Partitioning Algorithms/ In proc. Of the 21st

International Conference on Software engineering and knowledge engineering, P. 442–445, 2009.

11. Nagwani N.K., Bhansali A. A data mining model to predict software bug complexity using bug estimation and clustering / In Proc.

2010 Int. Conf. Recent Trends Inf., Telecommun., Comput., ser. ITC ’10. P. 13–17, Washington, DC, USA: IEEE Computer

Society,2010.

12. Strate J.D., Laplante P. A. A literature review of research in software defect reporting // IEEE Transactions on Reliability, vol. 62,

2013, P. 444–454 .

13. Weiss C., Premraj R., Zimmermann T., Zeller A. How long will it take to fix this bug? / In Proc. 4th Int. Workshop Mining

Software Repositories, ser. MSR ’07. Washington, DC, USA: IEEE Computer Society, 1, 2007.

14. Zhou Y., Tong Y., Ruihang Gu, Gall H.C. Combining Text Mining and Data Mining for Bug Report Classification/ In Proc. of 30th

International Conference on Software Maintenance and Evolution (ICSM/ICSME), IEEE, P. 311–320, 2014.

Related work

Page 28: Using Cluster Analysis for Characteristics Detection in Software Defect Reports

Open Access Quality Assurance & Related Software Development for Financial Markets Tel: +7 495 640 24 60 , +1 415 830 38 49

www.exactpro.com28

The Area of testing: example

CR

T1: Property1 = true

T2: Property1 = true

Market Structure

Document

Ti: Property1 = false

Current situation

Market Structure Gateway

T1: Property1 = true T1: Property1 = NULL