10
Slice-Based Cohesion Metrics and Software Intervention Timothy M. Meyers and David Binkley Loyola College in Maryland Baltimore, Maryland 21210-2699, USA tmeyers,binkley @cs.loyola.edu Abstract Software reconstruction is a costly endeavor, due in part to the ambiguity of where to focus reengineering effort. Cohesion metrics, and particularly quantitative cohesion metrics, have the potential to aid in this iden- tification and to measure progress. The most extensive work on such metrics is with slice-based cohesion met- rics. While their use of semantic dependence informa- tion should make them an excellent choice for cohesion measurement, their wide spread use has been impeded by a lack of empirical study. Recent advances in software tools make, for the first time, a large-scale empirical study of slice-based cohe- sion metrics possible. Three results from such a study are presented. First, base-line values for slice-based metrics are provided. These values act as targets for reengineering efforts with modules having values out- side the expected range being the most in need of atten- tion. Second, two longitudinal studies show that slice- based metrics quantify the deterioration of a program as it ages. This serves to validate the metrics: the metrics quantify the degradation that exists during development; turning this around, the metrics can be used to measure the progress of a reengineering effort. Finally, “head- to-head” qualitative and quantitative comparisons of the metrics identify which metrics provide similar views of a program and which provide unique views of a pro- gram. 1. Introduction In principal, good cohesion metrics help guide soft- ware intervention by revealing modules within a pro- gram that are in need of reconstruction. Two important prerequisites to this identification are (1) knowing that the metrics provide a sufficient quantitative measure of cohesion (and ideally module quality), and (2) knowing which metric values indicate normality and which indi- cate a problem. Most prior work on cohesion metrics has studied qualitative cohesion metrics [5]. Limited work on quan- titative cohesion metrics has been undertaken. For ex- ample, Bieman et al. developed and formalized several quantitative cohesion metrics [2, 15]. To date, how- ever, little empirical investigation of these metrics has been performed–primarily due to insufficient tool sup- port. Empirical data provides valuable evidence (hope- fully) tying the metrics to code cohesion and quality. Assuming the existence of such a relationship, one could sort the modules of a program based on their met- ric values and consider restructuring those modules be- low some arbitrary cutoff ( e.g., those in the bottom 10% of the sorted list). However, this inevitably leads to spending too much effort on “good” programs and too little effort on “bad” programs. A better approach would be to know which metric values are “normal” and which indicate the need for intervention. Then only those mod- ules whose metric values were outside the normal range would be considered for restructuring. Previous reengineering studies have observed the need for both prerequisites. For example, Lakhotia and Deprez, in a study on restructuring functions with low cohesion, sought to automate reengineering efforts, but affirmed that the present quantitative cohesion metrics were not adequate [12]. The most extensive research on quantitative cohe- sion measurement is the work of Bieman and Ott et al. [2, 15]. Their cohesion metrics are based on pro- gram slicing: a program simplification technology that removes from a program components ( e.g., statements) that do not effect a computation of interest [18, 3]. The resulting (smaller) program, called a slice, cap- tures a projection of the semantics of the original pro- gram [17, 4]. 1

Slice-based cohesion metrics and software intervention

Embed Size (px)

Citation preview

Slice-Based Cohesion Metrics and Software Intervention

Timothy M. Meyers and David BinkleyLoyola College in Maryland

Baltimore, Maryland21210-2699, USAftmeyers,[email protected]

Abstract

Software reconstruction is a costly endeavor, due inpart to the ambiguity of where to focus reengineeringeffort. Cohesion metrics, and particularly quantitativecohesion metrics, have the potential to aid in this iden-tification and to measure progress. The most extensivework on such metrics is with slice-based cohesion met-rics. While their use of semantic dependence informa-tion should make them an excellent choice for cohesionmeasurement, their wide spread use has been impededby a lack of empirical study.

Recent advances in software tools make, for the firsttime, a large-scale empirical study of slice-based cohe-sion metrics possible. Three results from such a studyare presented. First, base-line values for slice-basedmetrics are provided. These values act as targets forreengineering efforts with modules having values out-side the expected range being the most in need of atten-tion. Second, two longitudinal studies show that slice-based metrics quantify the deterioration of a program asit ages. This serves to validate the metrics: the metricsquantify the degradation that exists during development;turning this around, the metrics can be used to measurethe progress of a reengineering effort. Finally, “head-to-head” qualitative and quantitative comparisons of themetrics identify which metrics provide similar views ofa program and which provide unique views of a pro-gram.

1. Introduction

In principal, good cohesion metrics help guide soft-ware intervention by revealing modules within a pro-gram that are in need of reconstruction. Two importantprerequisites to this identification are(1) knowing thatthe metrics provide a sufficient quantitative measure of

cohesion (and ideally module quality), and(2) knowingwhich metric values indicate normality and which indi-cate a problem.

Most prior work on cohesion metrics has studiedqualitative cohesion metrics [5]. Limited work on quan-titative cohesion metrics has been undertaken. For ex-ample, Bieman et al. developed and formalized severalquantitative cohesion metrics [2, 15]. To date, how-ever, little empirical investigation of these metrics hasbeen performed–primarily due to insufficient tool sup-port. Empirical data provides valuable evidence (hope-fully) tying the metrics to code cohesion and quality.

Assuming the existence of such a relationship, onecould sort the modules of a program based on their met-ric values and consider restructuring those modules be-low some arbitrary cutoff (e.g., those in the bottom 10%of the sorted list). However, this inevitably leads tospending too much effort on “good” programs and toolittle effort on “bad” programs. A better approach wouldbe to know which metric values are “normal” and whichindicate the need for intervention. Then only those mod-ules whose metric values were outside the normal rangewould be considered for restructuring.

Previous reengineering studies have observed theneed for both prerequisites. For example, Lakhotia andDeprez, in a study on restructuring functions with lowcohesion, sought to automate reengineering efforts, butaffirmed that the present quantitative cohesion metricswere not adequate [12].

The most extensive research on quantitative cohe-sion measurement is the work of Bieman and Ott etal. [2, 15]. Their cohesion metrics are based on pro-gram slicing: a program simplification technology thatremoves from a program components (e.g., statements)that do not effect a computation of interest [18, 3].The resulting (smaller) program, called a slice, cap-tures a projection of the semantics of the original pro-gram [17, 4].

1

Slice-based metrics, introduced more than a decadeago, exploit the semantic relationships that underlie slic-ing. To date, however, they have lacked intensive empir-ical investigation due to a lack of mature tools. For ex-ample, while many previous studies make use of slice-based metrics [18, 15, 14, 2, 8, 11, 1], they all have sig-nificant limitations. The most serious of these are thesmall size and limited number of the programs studied.Other limitations include slicers that do not correctlyhandle complex language features such as pointer vari-ables, and slicers that ignore certain statement types, forexample, declarations, gotos, breaks, etc.

Slicing technology is now mature enough to allowan extensive empirical investigation of slice-based met-rics. For example, the deep structure analysis tool,Codesurfer[7], used in this study, handles the completeC (as well asC++) language. This study has none ofthe limitations found in previous studies. This paperpresents an empirical study of five slice-based cohesionmetrics.

The study reports metric values for 22,651 proce-dures from 63 programs that range in size from about500 to over 150,000 lines of code. In all, over 1.1 millionlines of code were analyzed with a total of 2,067,016slices being taken. The results from this study form thepaper’s three main contributions:� First, the most important result of this study in-

dicates that slice-based cohesion metrics quantifyoverall code quality. Following a suggestion by Ottand Thuss [14], effects of software evolution onslice-based metrics is measured. The results pro-vide evidence that the metrics successfully quan-tify code quality; thus, they can be used to guideand measure software intervention efforts.� Second, the study provides base-line values for theslice-based cohesion metrics:Tightness, MinCov-erage, Coverage, MaxCoverage, andOverlap[18,15]. Base-line values are useful in the identificationof degraded modules.� Finally, the study compares the different met-rics “head-to-head.” Both qualitative and quan-titative comparisons show which of the metricsare strongly correlated and which provide distinctviews of a program.

The remainder of this paper is organized as follows:Section 2 provides background information. The mainresults are presented in Section 3. These results are re-lated to prior work in Section 4 and finally a summary isprovided in Section 5.

2. Background

This section provides background material on pro-gram slicing, the metrics considered, and the statisticaltests employed.

2.1. System Dependence Graph

A slice can be computed as the solution to a graphreachability problem over a program’sSystem Depen-dence Graph(SDG) [10]. An SDG is a collectionof Procedure Dependence Graphs(PDGs) connectedby interprocedural control- and flow-dependence edges.Each PDG contains vertices that represent the compo-nents (statements and predicates) of the procedure andedges that represent the control and data-flow depen-dences between components. In addition, at call sitesthere are transitive dependence edges that summarizetransitive dependences induced by called procedures.These dependences capture the effects of the initial val-ues of parameters and global variable on their valuesafter the call. The dependences in a PDG are safe ap-proximations to the semantic dependences found in theprogram [16], which are in general not computable andtherefore must be approximated.

A slice of SDGG, taken with respect to a set of ver-ticesS, contains those vertices ofG whose componentspotentially affect the components represented inS [10].To compute metric values for each procedure, an in-traprocedural slice, which includes only vertices froma single procedure, is used. This slice can be computedas a simple transitive closure over the dependence edgesin a PDG. However, as the slice makes use of the SDG’ssummary edges, it correctly accounts for the dependenceeffects of transitively called procedures.

2.2. Slice Based-Metrics

In his original work on program slicing, Mark Weiserproposed using metrics based on program slicing tomeasure program cohesion and continuity [18]. He in-formally presented five sliced-based metrics:Tightness,Coverage, Overlap, Parallelism, andClustering[18]. Ofthese, Ott and Thuss formalized all butClustering, ob-serving that it was too vague to properly formalize [15].Additionally, Parallelism, has not proven useful in soft-ware reconstruction. One reason for this is that it is con-siderably more complex and therefore less intuitive thanthe other metrics. While in theory, higher values of par-allelism indicate that a module contains multiple unre-lated or only slightly related threads (suggesting it be tar-

Metric Definition Description

Tightness(M) = j SLint jlength(M) Tightness measures the number of statements

in every slice.

MinCoverage(M) = 1length(M) mini j SLi j MinCoverage is the ratio of the smallest

slice in a module to the module’s length.

Coverage(M) = 1j VO j jVO jXi=1 j SLi jlength(M) Coverage compares the length of slices to the

length of the entire module.

MaxCoverage(M) = 1length(M) maxi j SLi j MaxCoverage is the ratio of the largest

slice in a module to the module’s length.

Overlap(M) = 1j VO j jVO jXi=1 j SLint jj SLi j Overlap is a measure of how many statementsin a slice are found only in that slice.

Figure 1. The Metrics

geted for reconstruction), other slice-based metrics pro-vide similar information and are significantly easier tocomprehend; thus, parallelism is not considered herein.

The formalization of the three remaining metrics isdone in terms of the variablesVM , the set of variablesused by moduleM (here a module refers to the unit ofcode being considered,e.g., a procedure),VO , the subsetof VM that are output variables, (e.g., a function’s returnvalue or those globals modified by the function),SLi,the slice obtained forvi 2 VO, andSLint, the intersec-tion of SLi over allvi 2 VO .

Figure 1 shows Weiser’s metrics (exceptClusteringandParallelism), andMinCoverageandMaxCoverage,two additional cohesion metrics proposed by Ott andThuss [15]. The figure also provides an informal de-scription of each metric.

EXAMPLE. Figure 2 illustrates the computation ofthese metrics using the example program shown in Fig-ure 3. For the purpose of this example,VO containssmallest andlargest, the two variables output byprintf statements.

2.3. Statistical Tests

Pearson’s linear correlation is used to quantify the re-lation between metrics. Such correlations measure lin-ear associations between variables. The output is acor-relation coefficient, reported as the valueR, and the co-efficients of a linearmodel. The statistical significance

ofR can be summarized as 0.75 - 1.0 strong linear asso-ciation, 0.5 - 0.75 moderate linear association, and 0.0 -0.5 weak or no linear association. A negative value indi-cates an inverse correlation. For the effect ofX , Y , andZ onA, the resulting model coefficients,mi, belong tothe linear equationA = m1X +m2Y +m3Z + b:3. The Study

This section reports three results from an empiricalstudy of slice-based metrics. First, base-line metric val-ues for the five metrics are presented. These are fol-lowed by the two longitudinal studies. Finally, this sec-tion considers quantitative and qualitative comparisonsbetween the slice-based metrics.

3.1. Base-Line Metric Values

In order to identify aberrant functional behavior, onemust first understand what is “normal”. Thus, the firstgoal of this study: to generate base-line values for themetrics shown in Figure 1. These values provide empir-ical evidence for the slice-based cohesion metrics stud-ied by Ott and Thuss [15]. The metricsTightness, Min-Coverage, Coverage, andMaxCoverageare ratios of aparticular slice size (or set of slices) to the length of the

Metric Computation ValueTightness = 11/19 = 0.58MinCoverage = 14/19 = 0.74Coverage = 1/2 (14/19 + 16/19) = 15/19 = 0.79MaxCoverage = 16/19 = 0.84Overlap = 1/2 (11/14 + 11/16) = = 0.74

Figure 2. Example Metrics ComputationsSLsmallest SLintProgram SLlargest

main()fint i; j j jint smallest; j j jint largest; jint A[10]; j j jfor(i=0; i¡10; i++) j j jfint num; j j jscanf("%d", &num); j j jA[i] = num; j j jg

smallest = A[0]; j j jlargest = smallest; ji = 1; j j jwhile (i < 10) j j jfif (smallest> A[i]) jsmallest = A[i]; j

if (largest< A[i]) jlargest = A[i]; j

i = i + 1; j j jgprintf("%d \n", smallest); jprintf("%d \n", largest); jglength: 19 14 16 11

Figure 3. An example program. Verticalbars in the last three columns indicatemembership in the slice taken with re-spect to the output of smallest, the out-put largest, and their intersection, re-spectively.

module. Based on their definitions, the value ofTight-nessis always the smallest, followed byMinCoverage,Coverage, and finallyMaxCoverage. Thus, these fourmetrics are always presented in this order.Overlapis an

average percentage of common vertices within a mod-ule. It is not mathematically bounded between any ofthe other metrics.

Summary statistics based on the data for all 22,651modules (without regard for source program) are givenin Figure 4. The averages given provide expected val-ues for the slice-based metrics. The standard deviationsillustrate the dispersion of the data. For example, in thecase ofOverlap, the standard deviation of 0.3454 saysthat most of theOverlappoints fall within a range of size0.6908 centered about the mean, or roughly 70% of allpossible values. This wide dispersion indicates that theOverlapmetric is very sensitive. Conversely,MaxCov-erageis the least sensitive metric. Its standard deviationof 0.1547 indicates thatMaxCoverage’s values is dis-persed over approximately 30% of the range[0:0; 1:0℄.One use of this observation is its implication that outliervalues ofMaxCoverageare more significant than thoseof Overlap.

The confidence intervals, calculated at the 95% confi-dence level, demonstrate that the sample averages shownin Figure 4 are good representatives for the population ofall programs. For example, it can be said with 95% con-fidence that, for the population of all programs, the aver-age value ofTightnessfalls between 0.2973 and 0.3039.The other metric averages exhibit similarly small confi-dence intervals and thus the values from Figure 4 pro-vide good base-line values for the first five slice-basedmetrics studied.

Figure 5 graphically shows average metric values ona per program basis. The bold line representsOverlap,while the remaining four lines represent, in order,Tight-ness, MinCoverage, Coverage, andMaxCoverage. Re-call that by definition, these four lines will never cross.In Figure 5, the programs are sorted by increasingTight-ness. Sorting onTightnesshelps to visually illustrateseveral characteristics; the same observations can bemade without sorting or by sorting by another metric.First, observe the wider range seen inTightness, Min-CoverageandOverlap. These metrics appear to providebetter discrimination ability between programs.

SDG number Min- Max-vertices of slices jSLintj Tightness Coverage Coverage Coverage Overlap

average 897.35 84.65 276.17 0.3006 0.3387 0.5402 0.6453 0.5436standard deviation 2,398.45 139.91 1,100.46 0.2556 0.2570 0.1693 0.1547 0.3454confidence interval 31.23 1.82 14.33 0.0033 0.0033 0.0022 0.0020 0.0045

Figure 4. Metric averages for over all 22,651 total modules.

Figure 5. Metric averages for each subject programs, sorted on Tightness. The bold line rep-resents Overlap, while the remaining lines represent from bottom to top Tightness, MinCoverage,Coverage, and MaxCoverage.

Second, as a general trend, all of the metrics increasewith the increasing value ofTightness. Third, the sen-sitivity of MinCoverageto changes inTightnessappearslow as it experiences only a few small deviations (seeprogramsbc, gnugo, andwpst) from its tracking ofTightness. In contrast,Coverage, MaxCoverage, andOverlapare all far more sensitive (each displaying ratherbumpy progressions).

Finally, Overlap does not always follow that sametrend as the other metrics. In places it shows an inversemovement when compared to the other metrics (espe-

cially CoverageandMaxCoverage). For example, thisoccurs towards the left of the graph, nearntpd. Thefollowing can be said about this (and the other) inver-sions. First,ntpd and its neighborempire includea wide range of module sizes; thus, their similar valueof Tightnessimplies thatjSLintj remains relatively con-stant. Looking at its definition, a relatively high value ofOverlapindicates a comparatively smaller average slicesize or a larger intersection. Together these observationsimply that ntpd has smaller average slice size. Thisobservation is confirmed byntpd’s comparatively low

(a) Growth

(b) Metric Values

Figure 6. Longitudinal study of barcode.

value ofCoverage, which also indicates a smaller aver-age slice size. Finally,ntpd’s comparatively low valueof MaxCoverageindicates the absence of large slice sizeoutliers; thus, all slices have a similar size. It is believedthat smaller intraprocedural slice size indicates a moremodular design and thus higher code quality; a formalinvestigation of this conjecture is left as future work.

One final anticipated result is evidenced in Figure 5.Two “families” of programs were included among the63 programs studied. These are the programs prefixedwith cook- andsnns-. As the programs from eachfamily share a significant common code base, it is notsurprising that they have similar metric values.

3.2. Longitudinal Studies

Ott and Thuss [14] suggest an investigation into thesensitivity of slice-based metrics to subsequent releases

(a) Growth

(b) Metric Values

Figure 7. Longitudinal study gnugo. Majorreleases are shaded.

of a program. This section presents two such longitudi-nal studies: the first zooms in on a sequence of minorreleases forbarcode and the second spans several ma-jor releases ofgnugo.

The results are summarized in Figures 6 and 7. To be-gin with, Figure 6a shows growth trends forbarcode.These include the significant growth in module sizeand the less dramatic growth in slices per module andjSLintj.

To understand how to interpret trends in these graphs,consider an idealized target program in which each mod-ule contains a single “thought.” Progress towards sucha target would appear as a convergence between mod-ule size andjSLintj. Thus, the divergence in Figure 6asuggests deterioration of program cohesion and continu-ity. In other words, software evolution and maintenanceappears to be deteriorating the quality ofbarcode.

Figure 6b shows average metric values for the fivemetrics. The bold line isOverlap, which can run con-

trary to the other metrics.Tightness, MinCoverage, Cov-erageandMaxCoverage, bound one another from low-est to highest, respectively. In this graph there is a visiblenegative slope among all the metrics. With the exceptionof Overlap, all of the metrics indicate the maximum co-hesion within the first or second release and continue tofall thereafter. The deterioration is not very pronouncedas only minor releases are considered. Even so, all fiveof the slice-based metrics provide evidence of the de-generation.

Like barcode, gnugo’s average module size (Fig-ure 7a) exhibits general growth over the span of releasesconsidered. These characteristics are more pronouncedwith gnugo as the span of releases includes two majorreleases. It is interesting to note that the only significantdecline in average module size accompanies the majorreleasegnugo 3.0. The only other decline is in thevery next revision (perhaps the completion of the reengi-neering effort). Immediately following this,gnugo ex-periences its largest growth over the considered releases.

Gnugo 3.0 is also interesting because it is theonly instance of convergence between module size andjSLintj. This supports the notion that more thought andeffort goes into a major release as opposed to seeminglyatrophic minor releases.

The graph of metric values forgnugo (Figure 7b)shows a clear negative trend among all the metrics.MinCoverage, Coverage, and MaxCoverageare all ina state of constant decay with the exception ofgnugo3.0. Similarly, Tightnessonly experiences improve-ment for the major releasesgnugo 2.0 andgnugo3.0. With the exception ofOverlap, all of the metricsexperience their maximum value within the first or sec-ond release and continue to fall thereafter.

The above data suggests that slice-based metrics havethe potential to quantify the deterioration in cohesionthat a program often undergoes as it ages. In otherwords, it is possible to validate the metrics use in fu-ture projects by examining their quantitative behavior inspecified situations and determining if the metrics pre-serve the ordering desired. Inspection of the sourcecode for the different versions ofgnugo reveal thatthis, in fact, occurs. The inspection uncovered the ex-pected patching taking place between the minor releases.In contrast, the major releases clearly were given morethought–especiallygnugo 3.0, where the source isgiven a significant overhaul. These findings are encour-aging as all the metrics, exceptOverlap, reflect this dra-matic improvement. WhyOverlapruns contrary to thistrend is under investigation.

x y R R2MinCoverage Tightness 0.977 0.955Overlap Tightness 0.916 0.839Overlap MinCoverage 0.896 0.803MaxCoverage Coverage 0.793 0.628Coverage Tightness 0.625 0.391Coverage MinCoverage 0.602 0.362Tightness MaxCoverage 0.379 0.144MinCoverage MaxCoverage 0.367 0.135Coverage Overlap 0.356 0.126MaxCoverage Overlap 0.154 0.024

Figure 8. Pearson R and R2 values forhead-to-head metric comparisons.

3.3. Metric Comparisons

In this section, the slice-based metrics are comparedto each other both quantitatively and qualitatively. Thequantitative investigation applies Pearson’s test to thevalues produced for each pair of metrics. Pearson’s cor-relation coefficient,R, shows the level of linear asso-ciation between the two metrics, with higherR valuesdepicting stronger correlations.

Figure 8 displays the results ordered by descendingR value. It also displaysR2 values because they tend tobe easier to understand. AnR2 value is typically takento indicate the percentage of dependence. For example,MinCoverageandTightnesshave anR2 value of 0.955;one could say that the valueMinCoverageis 95.5% de-pendent on the value ofTightness. Three interesting re-lationships highlight Figure 8. First, the top of the ta-ble is dominated by the pair-wise comparisons betweenMinCoverage, Tightness, andOverlap, which show thestrongest correlations. Second, of all the metrics,Max-Coveragehas the least correlation with any of the others.Finally, Overlapis the most varied, having a strong lin-ear relationship withTightnessandMinCoverage, but nolinear relationship withCoverageandMaxCoverage

Figures 9, 10, and 11 plot example comparisons fromthe three ranges. For example, in Figure 9, the high cor-relation betweenMinCoverageandTightnessis evidentin the left graph. (In these plots, a perfect correlationwith R = 1:0 would appear as a solid line having theequationy = x.)

It is interesting to note that in the two graphs show-ing weak linear association (Figure 10), each appear toinclude two distinct populations. Consider the compar-ison ofTightnessandCoverage. One population wouldshow a strong linear correlation. It appears as a down-ward bulge from the liney = x centered at about (0.60,

Figure 9. Example comparisons showing strong linear correl ation. Trend lines appear in grey.

Figure 10. Comparisons showing weak linear correlation. Tr end lines appear in grey.

0.60). The other would show no correlation. It appearsas an upward bulge from the liney = 0 centered at aboutx = 0:40. While several explanations for the bi-modalnature of the data were considered, none have, thus far,proved satisfactory.

Finally, Figure 11 shows two of the four graphs forcomparisons showing no linear correlation, which is vi-sually obvious. Because highly correlated metrics areeffectively redundant, these metrics could potentially beimportant in practice. For example, it is clear from Fig-ure 11 thatOverlapandMaxCoverageprovide two dis-tinct viewpoints of a program and each could potentiallybe useful in its own way.

4. Related Work

This section considers research related to the em-pirical investigation of slice-based cohesion metrics.Broader motivation for the (empirical) study of metricsin general, and cohesion metrics in particular, has beenrecently argued by Mens and Demeyer in their classifi-cation of approaches that use metrics [13]. For exam-ple, Demeyer et al. describe a reverse engineering ap-proach the combines metrics and visualization [6]. They

purposefully chose to work with a collection of sim-ple metrics (parameter count, number of (static) invo-cations, etc.). They report that the combination showspromise. However, they do question whether “a simplemeasurement is a[sic] sufficient to assess such [a] com-plex thing as software quality.” It would be interestingto repeat their experiment using the more sophisticateddependence based metrics studied herein.

The remainder of this section describes work onslice-based metrics in roughly chronological order. Itbegins with Weiser’s original work, followed by the con-siderations ofmetric slices, control and data metrics,“glue” tokens, and the granularity of metrics. Finally,a study by Karstu will be considered.

Weiser applied the metrics from Figure 1 to severalstudent-written load-and-go compilers as a proof of con-cept [18]. Their further study by Ott and Thuss led to thedevelopment of themetric slice[15]. A metric slice onv is the union of the backward slice onv taken fromthe end of the module and the forward slice computedfrom every statement that definesv within the backwardslice. A collection of carefully chosen examples wasused to motivate the use of the metric slices over tradi-tional slices. Unfortunately, a metric slice is likely to be

Figure 11. Example comparisons showing no linear correlati on. Trend lines are not statisticallysignificant.

larger than a traditional slice with a corresponding dropin usefulness. Neither Weiser nor Ott and Thuss under-take a large scale empirical study of these metrics as itwas infeasible at the time.

Bieman and Ott recast slice-based metrics in termsof two kinds of “glue” tokens [2]: glue and super-gluetokens, and then used them to define and study three co-hesion metrics. Unfortunately, in their work, Biemanand Ott treat only a few small code segments.

Harman et al. expand the work of Ott, et al. by consid-ering the elements that each metric is based on [14]. Theresult is a collection of “fine-grain” metrics obtained byreplacing the “number of statements” with, for example,the number of distinct variables in an expression. Theyconclude that certain fine-grain metrics produce the in-tuitively expected result for a carefully chosen example.However motivational, this example is only eight linesof code.

Finally, in his masters thesis, Karstu describes em-pirical work on slice-based metrics [11]. Like the re-search proposed herein, Karstu’s goals include investi-gating the usefulness of slice-based cohesion measuresfor real software. (At the time of its writing, all previ-ous studies had been based on non-production software.For example, Thuss’ study was based on student Pas-cal programs and ignored key language features such aspointers.) Karstu’s slicer had its limitations. For exam-ple, pointers were treated as regular variables. In con-trast, the slicer used in study presented herein performsextensive points-to analysis and handles the complete Clanguage; thus, it does not have these limitations.

In comparison, with slice-based cohesion metrics,slice-based coupling metrics have not been well studied.Harman et al. considered slice-based coupling in a shortposition paper [9]. The coupling between functionsfandg is defined as the ratio of the number of elements

from f included in slices with respect to the elements ofg to the number of elements inf . Harman et al. lay out aframework for applying slicing to coupling and considertwo small examples.

5. Summary and Future Work

This paper presents a large scale empirical investiga-tion of slice-based metrics and provides three key con-tributions to reverse engineering:� Given the large code based used, the values re-

ported herein provide good estimates of expectedmetric values. These values can be used at the mod-ule level to focus the attention of reverse engineerson particularly abject modules.� The second result comes from the two longitudinalstudies, which show that slice-based metrics can beused to quantify the deterioration that accompaniessoftware evolution. Thus, the studied metrics havethe capacity to be used at the program level to guidethe effects of reverse engineering’s attempt to “im-prove” code.� Third, the paper provides a head-to-head compari-son of the metrics; thus, providing a better under-standing of their relationships and indicating whichmetrics provide a similar view of a program andwhich provide complimentary views of the pro-gram.

Finally, conducting this study has suggested severalareas for future work. On the theoretical side, the resultssuggest the need for an improved or replacement defini-tion of Parallelism. In addition, three future empirical

investigations have been suggested. The first would in-vestigate the metric slice, while the second would con-sider the effectiveness of the glue-token based metrics.The third would be an in-depth study of slice-based cou-pling. The final, and perhaps most important futurestudy, addresses the pragmatic question “does a softwareengineer with access to slice-based metric values for themodules of a program do a better job of restructuringthe program?” While the longitudinal studies provideencouraging initial evidence, with only two programs,no statistically significant results can be derived. Thedata does suggest that a more comprehensive study iswarranted. Such a study is planned.

6 Acknowledgments

GrammaTech providedCodeSurfertm upon whichthe implementation is based. Alberto Pasquini providedthe programsoracolo2, prepro, andcopia. Spe-cial thanks to Mark Harman for guidance throughout thewriting of this paper. This work is supported by NationalScience Foundation grant CCR0305330.

References

[1] J. Bieman and B. Kang. Measuring design-level co-hesion. IEEE Transactions on Software Engineering,24(2):111 –124, February 1998.

[2] J. Bieman and L. Ott. Measuring functional cohesion.IEEE Transactions on Software Engineering, 20(8):644–657, Aug. 1994.

[3] D. Binkley and K. Gallagher. Program slicing. InM. Zelkowitz, editor,Advances in Computers, Volume43, pages 1–50. Academic Press, 1996.

[4] D. Binkley, S. Horwitz, and T. Reps. Program integrationfor languages with procedure calls.ACM Transactionson Software Engineering and Methodology, 4(1):3–35,1995.

[5] L. L. Constantine and E. Yourdon.Structured Design.Prentice Hall, 1979.

[6] S. Demeyer, S. Ducasse, and M. Lanza. A hybrid re-verse engineering approach combining metrics and pro-gram visualization. InWorking Conference on ReverseEngineering, pages 175–186, 1999.

[7] Grammatech Inc. The codesurfer slicing system, 2002.[8] M. Harman, S. Danicic, B. Sivagurunathan, B. Jones, and

Y. Sivagurunathan. Cohesion metrics.In Proceedings ofthe 8th International Software Quality Week(San Fran-cisco CA), pages 4–T–4, 30 May - 2 June 1995.

[9] M. Harman, M. Okunlawon, B. Sivagurunathan, andS. Danicic. Slice-based measurement of coupling.InProceedings of the IEEE/ACM ICSE workshop on Pro-cess Modelling and Empirical Studies of Software Evo-lution (Boston, Massachusetts), pages 28–32, 17-23 May1997.

[10] S. Horwitz, T. Reps, and D. Binkley. Interprocedural slic-ing using dependence graphs.ACM Transactions on Pro-gramming Languages and Systems, 12(1):26–61, 1990.

[11] S. Karstu. An examination of the behavior of slice basedcohesion measures, August 1994.

[12] A. Lakhotia and J. Deprez. Restructuring functions withlow cohesion. InIEEE Working Conference on ReverseEngineering (WCRE 1999), pages 36–46. IEEE Com-puter Society Press, Los Alamitos, California, USA, Oct.1999.

[13] T. Mens and S. Demeyer. Future trends in software evo-lution metrics. InProceedings of the 4th internationalworkshop on Principles of software evolution, pages 83–86. ACM Press, 2002.

[14] L. Ott and J. Thuss. The relationship between slices andmodule cohesion.In Proceedings of the Eleventh Intet-nationl Conference on Software Engineering, pages 198–204, 1989.

[15] L. Ott and J. Thuss. Slice based metrics for estimatingcohesion. In Proceedings of the First Intetnationl Soft-ware Metrics Symposium, pages 71 –81, 1993.

[16] A. Podgurski and L. Clarke. A formal model of pro-gram dependences and its implications for software test-ing, debugging, and maintenance.IEEE Transactions onSoftware Engineering, 16(8), 1990.

[17] T. Reps and W. Yang. The semantics of program slic-ing. Technical Report Technical Report 777, Universityof Wisconsin, 1988.

[18] M. Weiser. Program slicing. In5th International Con-ference on Software Engineering, pages 439–449, SanDiego, CA, Mar. 1981.