TEST CASE SELECTION USING LOGISTIC REGRESSION …...Input: Seed test cases Output: Regression Models In our proposed methodology, we first generate all the test cases using automated

http://www.iaeme.com/IJMET/index.asp 786 [email protected]

International Journal of Mechanical Engineering and Technology (IJMET) Volume 8, Issue 11, November 2017, pp. 786–796, Article ID: IJMET_08_11_081 Available online at http://www.iaeme.com/IJMET/issues.asp?JType=IJMET&VType=8&IType=11 ISSN Print: 0976-6340 and ISSN Online: 0976-6359 © IAEME Publication Scopus Indexed

TEST CASE SELECTION USING LOGISTIC

REGRESSION PREDICTION MODEL

Kiran Jammalamadaka, V Ramakrishna

Department of Computer science and Engineering, KL University, Vaddeswaram Vijayawada, AP. India.

ABSTRACT

Test case selection plays a major role in reducing the regression cycle time. This

is an important aspect in Agile development methodology where a timely software

delivery with assured quality is expected. Agile’ s iterative nature of software delivery

often leads to re-testing the same program or code lines repeatedly. Many researches

focused in this area and proposed several models for prioritizing and selecting the test

cases with different techniques. However, the focus has been on scraping of the super

set of the test cases and deriving a subset of it. But this may not work in Agile as the

nature of the software delivery is very dynamic and ever changing, for which test

cases need to be changed for every iteration. Verification and validation processes of

any changed program contain an understanding of the change made in the program

and identifying the related or impacted test cases which again is tedious and time

consuming.

In this paper, we present a methodology based on logistic regression prediction

model from Data mining to predict the lines of code for each test case without

executing the program again and again, thus making the identification of the test

cases lot more easy and systematic for a modified program.

Keywords: Testing, executed code traces, data mining, binomial, response based prediction, confusion matrix, test case prioritization, logistic regression model, R-language.

Cite this Article: Kiran Jammalamadaka and V Ramakrishna, Test Case Selection Using Logistic Regression Prediction Model, International Journal of Mechanical Engineering and Technology 8(11), 2017, pp. 786–796. http://www.iaeme.com/IJMET/issues.asp?JType=IJMET&VType=8&IType=11

1. INTRODUCTION

Software testing [3] cannot be separated from the Software development life cycle, this activity gives confidence to the stake holders that the software works as expected. The software testing normally consumes around 50% of the development time [2], although it is planned for less.

Test Case Selection Using Logistic Regression Prediction Model


In a development methodology like Agile where delivering the software is done iteratively, more testing of the delivered functionality is involved, as there is high probability of changing the delivered code to implement the new functionality. This often involves testing for regressions [18]. Executing the test cases for every sprint is a major time-consuming task which keeps growing sprint on sprint as the test suite gets cumulative.

Automated test cases generation [4] caters to this need by generating hundreds of test cases. A simple program can generate several test cases, several research papers have been published in this regard [5][6][7][8][9]. According to Rothermel [13], a program containing 20,000 lines of code requires a minimum of 7 weeks to execute all the test cases.

Selection and prioritization of the test cases are the two major approaches for the cost-effective test case execution [19]. Gregg Rothermel and Mary Jean Harrold[20] proposed a code coverage based strategy.

Our approach addresses this concern by identifying the test cases which can hit exactly the lines of code modified without executing all the test cases, we have used datamining technique Logistic regression model blended with code coverage traces and formed a predicting methodology to identify the test cases which will hit the modified line. With this methodology with a few test seeds, we can generate impacted lines of code for the whole program and start reusing it as and when we see a modification in the program.

2. RELATED WORK

There has been a lot of work done in test suite reductions and prioritization [12] and test case selection using techniques like Genetic Algorithms, heuristic algorithms [11], Datamining Classifier [14]. Control flow graphs. Many researchers are focusing on reducing the number of test cases by using data mining techniques like clustering blended with software coverage same has been mentioned in [15], [16].

All these proposals are promising in serving the purpose, however the major challenge starts while executing the test cases as the number of test cases may be very high and most of the test cases that are executed may not hit the impacted area. Sometimes it significantly increases the time allocated for testing [10].

3. PROPOSED METHODOLOGY

Input: Seed test cases

Output: Regression Models

In our proposed methodology, we first generate all the test cases using automated test case algorithms. Once we have the test cases in place, we select a few test seeds, a random subset of the generated test cases, and collect code traces for the target program by executing them. We feed this data into our model and predict the line numbers for the rest of the test cases.

The uniqueness of our methodology is by executing a few test seeds we are predicting the probable code traces for rest of the test cases without executing them. Once we build this model with test cases and corresponding code lines from the program, it allows us to quickly identify the impacted test cases for the modified code lines.

The Output of the program is models based on the number of inputs. Using these models, identification of the most impacted test cases can be identified without executing the test case. These can be stored for further use in case of further modification of the program which is a common phenomenon.

Output = a*i1+b*i2+c*i3…….+ n*in+K

Kiran Jammalamadaka and V Ramakrishna


The above equation represents our regression model, in the above equation K represents the constant we derive for reliability, and a, b, c, d, e, f are the coefficients needs to be multiplied by the input values.

Below are the steps involved in deriving the model.

Figure 1 Steps to derive the model

Since we are relying on the code coverage, the number of options available for each line is two.

1. Covered -1

2. No covered by the test case. -0

For this requirement, we have chosen Logistic regression model, because of its binary nature of the output.

3.1. Logistic Regression Model

Let’s understand the data mining and its relation with Logistic regression model.

Data mining is used to predict the future by using data analysis, it’s a combination of machine learning, statistics, and artificial intelligence. Many fields like engineering, medical are leveraging the data mining knowledge for their advancements.

By creating models, data mining predicts the future.

Predicting the future means predicting the outcome, if the outcome is numerical then it is called regression and if it is categorical it’s called classification.

In classification, there are 4 major groups of algorithms

1. Frequency table

2. Covariance Matrix

3. Similarity function

4. others

Logistic regression model comes under Covariance Matrix.

Logistic regression is suitable for binary values, where the possible outcomes are having only two options, in our case whether the test case has been executed the line or not executed the line is the criterion. Linear regression does not suit for such binary data.

Logistic regression generates a curve between 0 and 1, below is the picture which depicts the difference between linear and logistic regressions,



Figure 2 Logistic and linear regression curves

4. RESULTS AND DISCUSSIONS

We have validated our methodology against a simple program written in c sharp, which will take 6 inputs at once and checks whether each number provided is a prime or not.

We generated a 500 unique test cases using a program and fed to the target program to get the traces for each test case.

4.1. Collect Traces for the target program

First step is to get the code traces for each test case. For this we have instrumented the program by inserting GetFileLineNumber () method after every line, to get the executed line numbers for each test case, to get the data we have collected them in an array list and saved in an external file.

All the code traces we collected were arranged into a proper format as shown below

4.2. Data mine and Convert to proper format

Once we collected the data it is arranged in the following format for data analysis

Table 1 test cases and corresponding code traces



Table 2 Formatted data

The above table is the effect of the data formatting, o266, o271 etc., represents the lines executed,1 represents whether the said line is executed by the test case or not,to illustrate testcase1 input -6, -33, -11,61, -33, -40 are covering the lines o266, o271, o275, o280, o288, o295, in the similar way testcase1 which has data as -28,90,71, -29,48, -12 covering almost all the lines covered by test case 1 expect o288, the same has been represented by placing 0 against the o288 column.

4.3. Implementation

The implementation has been done using R language [23] in R-studio

R language is widely used by the statisticians and data miners for developing various software models and data analysis. R is an open source and available under general public licenses.

4.3.1. Load Data

Prime_Data <- read_csv(<file_path>)

summary(Prime_Data)



4.3.2 Separate Train set, test set randomly

We have used caret tools[24] to cluster the collections.

The reason to split is to ensure the completed data set into two parts and first to train the model and second to validate against the known set of data to understand the reliability of the model

Added a new column index for plotting purpose

4.3.3 Build logistic regression model on train set

summary(model280)-description about model write the equation with actual values

y=f(x)

�280 � 6.724350 � 0.022674 1 � 0.022783i2 � ⋯�

Number of Fisher Scoring iterations: 9 (R has spent 9 iterations to arrive at the best model.)



4.3.4. Predict the model with test set

4.3.5. A sample output for predict o280

Float values are seen, since it’s an equation, need to round it to interpret the meaning.0.9 to read as 1 and 0.12 as 0

4.3.6 Confusion Matrix [22]

Confusion matrix is often used to describe the performance of the classifier, there are two possible ways whether the line has been executed or not executed and it has two possible ways of getting them wrong or right.

Let’s assume the model has predicted that a line has been executed, the line may get executed or not executed in real, this confusion matrix helps us to understand the reliability of the model

There are 4 possible ways of understanding the predicted and actual values

True positives (TP): Model predicted that lines have been executed and those lines were executed

True negatives (TN): Model predicted that lines have been executed and those lines were NOT actually executed.

False positives (FP): Model predicted that lines were NOT executed and those lines were executed.

False negatives (FN): Model predicted that lines were NOT executed and those lines were NOT actually executed.



Table 1 Confusion matrix

Circled ones are the correct and reliability will be calculated based on how my correctly predicted versus total predictions.

Next we evaluated the model with actual and predicted values to Deduce accuracy of the predicted model using confusion matrix.

Accuracy = (TP+FN)/(FN+FP+TN+TP)

The below data shows confusion matrix for each model

Model Name R equation Confusion Matrix Accuracy%

o266

table (test$o266,predictTest266

>= mean(predictTest266))

TRUE

1 98 100

o271



TRUE

1 98 100

o275 table

(test$o275,predictTest275 >= 0.5)

FALSE TRUE

0 1 0

1 1 96

98.98

o280 table

(test$o280,predictTest280 >= 0.5)

FALSE TRUE

0 1 0

1 1 96

98.98

o288



FALSE TRUE

0 36 3

1 16 16

53.06

o295



TRUE

1 98 100

Overall Accuracy

(Avg.): 91.84%

Table 4 Output models with corresponding R equation, Confusion matrix and Accuracy%



4.3.7. Plots

Below are the plots generated for the test cases, about actual and predicted values

o266 o275

Figure 3 o266 Figure 4 o275

o271 o295

o280 o288



Actual values

Predicted



5. FUTURE WORK

The current proposed model gives around 90% of the prediction confidence, however there is a scope to improve this model for better confidence. We have considered only Logistic regression model, and we can consider other models like regression trees and random forest and compare for a better model.

So far we dealt with modified lines of the code and not considered addition of code line and deletion of code line to an existing program, this can be achieved by introducing the offset, based on the program change by having positive offset in case of addition and -negative offset in case of deletion of lines and deriving a single offset based on the modification of the program.

6. CONCLUSION

The blend of datamining with Software engineering is one of the promising area to explore for better solutions, Our model comprises of leveraging a few data mining tools and extracting meaningful information from the available data in order to narrow down the test case selection process is more systematic and scientific manner.

Our model is easy and reliable for test case selection, this model comes handy when team are having tough time to release in quick time with higher quality. The proposed model is worth considering for selection of test cases.

7. ACKNOWLEDGMENTS

I am grateful to M Rajendra Prasad for sharing his wisdom during this work, and would like to thank G Bhargav for his helpful insights.

REFERENCES

[1] Somerville, I., Software engineering, 7th Ed. Addison-Wesley,

[2] Aditya P Mathur, Foundation of Software Testing, 1st edition Pearson Education 2008.

[3] David Alex Lamb, Software Engineering, planning for change, Prentice Hall, Englewood Cliffs, NJ 07632, pp. 109–112, 1988.

[4] Ajitha Ranjan. Automated Requirements-Based test case Generation. Communications of ACM,2006

[5] Nashat Mansour, Miran Salame, Data Generation for Path Testing, Software Quality Journal, 12,121–136, 2004,Kluwer Academic Publishers.

[6] Praveen Ranjan Srivastava et al, Generation of test data using Meta heuristic approach IEEE TENCON (19-21 NOV 2008), India available in IEEEXPLORE.

[7] Wegener, J., Baresel, A., and Sthamer, H, Suitability of Evolutionary Algorithms for EvolutionaryTesting, In Proceedings of the 26th Annual International Computer Software and Applications Conference, Oxford, England, August 26-29, 2002.

[8] Berndt, D.J. and Watkins A, Investigating the Performance of Genetic Algorithm-Based Software Test Case Generation, In Proceedings of the Eighth IEEE International Symposium on High Assurance Systems Engineering (HASE'04), pp. 261-262, University of South Florida, March 25-26, 2004.

[9] B. Korel. Automated software test data generation. IEEE Transactions on Software Engineering, 16(8), August 1990.

[10] M. J. Harrold, R. Gupta, and M. L. Soffa. A methodology for controlling the size of a test suit. ACM Trans. on Soft.Eng. and Meth., 2(3):270-285,1993

[11] Praveen Ranjan Srivastava et al, Generation of test data using Meta heuristic approach IEEE TENCON (19-21 NOV 2008), India available in IEEEXPLORE



[12] G. Rothermel,R. Untch, C.Chuu and M.J. Harrold, Prioritization: A family of empirical studies, IEEE Transactions on software Engineering 28, no.2,pp 159-182,Feb 2002.

[13] Rothermel, G., Untch, R. H., Chu, C., & Harrold, M. J. (2001). Prioritizing test cases for regression testing. IEEE Transactions on Software Engineering, 27(10), 929-948.

[14] Ahmad A. Saifan, Test Case Reduction Using Data Mining Classifier Techniques. Journal of software, volume 11,656-663, 2016.

[15] Muthyala, K., & Naidu, R. (2011). A novel approach to test suite reduction using data mining. Indian Journal of Computer Science and Engineering, 2(3), 500-505.

[16] Offutt, A. J., Jin, Z., & Pan, J. (1999). The dynamic domain reduction procedure for test data generation. Software-Practice and Experience, 29(2), 167-193.

[17] Lee, E. M., & Chan, K. C. (2005). A data mining approach to discover temporal relationship among performance metrics from automated testing. Proceedings of the Ninth IASTED International Conference on Software Engineering and Applications

[18] https://en.wikipedia.org/wiki/Regression_testing

[19] M. Suppriya, A. K. Ilavarasi (2015). Test Case Selection and Prioritization Using Multiple Criteria. International Journal of Advanced Research in Computer Science and Software Engineering.

[20] Gregg Rothermel and Mary Jean Harrold, A Safe, Efficient Regression Test Selection Technique, ACM Transactions on Softw. Eng. And Methodology, 6(2): 173-210, 1997.

[21] http://www.saedsayad.com/logistic_regression.htm

[22] http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/

[23] https://en.wikipedia.org/wiki/R_(programming_language)

[24] https://en.wikipedia.org/wiki/Carrot2

[25] Songyu Chen, Zhenyu Chen, Zhihong Zhao, BaowenXu, Yang Feng, Using Semi-Supervised Clustering to Improve Regression Test Selection Techniques, Fourth IEEE International Conference on Software Testing, Verification and Validation, March 2011

[26] Siddhartha Rokade, Rakesh Kumar, Varsha Rokade, Shakti Dubey and Vaibhav Vijayawargiya Assessment of Effectiveness of Vertical Deflection Type Traffic Calming Measures and Development of Speed Prediction Models in Urban Perspective. International Journal of Civil Engineering and Technology, 8(5), 2017, pp. 1135–1146.

[27] Ciro Caliendo, Crash Prediction Models for Roads Including Rainfall and Hazardous Points. International Journal of Civil Engineering and Technology, 8(9), 2017, pp. 477–485.

[28] Dr .S. Ravichandran, Design and Development of Software Fault Prediction Model to Enhance the Software Quality level, International Journal of Information Technology and Management Information Systems (IJITMIS) Volume 1, Issue 2007, Jan–Dec 2007, pp. 01–06

[29] V.Sujatha, K.Sriraman, K. Ganapathi Babu and B.V.R.R.Nagrajuna, Testing and Test Case Generation by Using Fuzzy Logic and N.L.P Techniques, International Journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 3, May-June (2013), pp. 531-538

Documents

TEST CASE SELECTION USING LOGISTIC REGRESSION …...Input: Seed test cases Output: Regression Models In our proposed methodology, we first generate all the test cases using automated