A Software Maintainability Evaluation Methodology metrics, software ... The software maintainability evaluation methodology has ... to the extent it contains aids which enhance testing

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-7, NO. 4, JULY 1981

A Software Maintainability Evaluation MethodologyDAVID E. PEERCY

Abstract-This paper describes a conceptual framework of softwaremaintainability and an implemented procedure for evaluating aprogram's documentation and source code for maintainability charac-teristics. The evaluation procedure includes use of closed-formquestionnaires completed by a group of evaluators. Statistical analysistechniques for validating the evaluation procedure are described.Some preliminary results from the use of this methodology by theAir Force Test and Evaluation Center are presented. Areas of futureresearch are discussed.

Index Terms-Evaluation by questionnaire, evaluation reliability,quality metrics, software engineering, software maintainability evalua-tion, software quality assurance.

I. INTRODUCTION

T HE Air Force Test and Evaluation Center (AFTEC) hasbeen developing a methodology for evaluating the quality

of delivered software systems as part of its directed activityof operational test and evaluation (OT&E). Thayer [3] hasreported the initial approach for a software maintainabilityevaluation methodology. The BDM Corporation has com-pleted a technical directive for AFTEC to review this methodol-ogy, analyze the results of 18 different program evaluationswhich used the methodology, and recommend appropriatechanges to the methodology. This paper summarizes therevised methodology from this effort.Because of the number of software systems to be evaluated,

the variability (language, computer, functions) of the softwareto be evaluated, and the limited state of the art in practicalautomated evaluation tools, AFTEC's software evaluationprocedure has been based on the completion of closed-formquestionnaires. The methodology defines a conceptual frame-work for the software characteristics from the user-orientedlevel to the software product level and an evaluation procedurewhereby the identified product characteristics can be mea-sured. The measures, or software metrics, are then normalizedthrough evaluation-specific weights to provide the necessaryevaluation maintainability measures.The primary objective of the software maintainability evalua-

tion is to collect enough specific information to identify forwhich parts of the software and for what reasons maintain-ability may be a problem. A secondary objective is to assessthe effectiveness of the evaluation process itself. A future goalis to validate maintainability scores against an actual field main-tenance level of effort.

Manuscript received February 4, 1980; revised January 21, 1981.This work was supported by AFTEC Technical Directive 120 ofContract F29601-77-C-0082.The author is with the BDM Corporation, Albuquerque, NM 87106.

The major evaluation assumptions are as follows:*maintainability considerations remain essentially the same

from program to program,* evaluators must be knowledgeable in software procedures,

techniques, and maintenance, but need not have a detailedknowledge of the functional area of the program,

a minimum of five independent evaluators will be used toprovide acceptable confidence that the metrics (evaluatoraverage scores) are a sound measuring tool,

* the random sampling of the program modules for evalua-tion provides conclusions which hold for the general popula-tion of all program modules.The main features of the software maintainability evaluation

are the following.* The maintainability model is primarily based on the

models in Thayer [3], Boehm [1], and Walters [4]. Theevaluation process consists of a set of evaluators completingclosed-form questionnaires on maintainability characteristicsof program documentation and program source listings fol-lowed by automated processing of the evaluator responses anda careful manual analysis of all detected program and evalua-tion anomalies.

* The evaluation can be used at appropriate phases in thesoftware development life cycle in addition to the operationalmaintenance phase.

* The evaluation is independent of any particular sourcelanguage.

* The maintainability characteristics can be used as a qualityassurance checklist.The major results of this research effort have been to:* provide a definitive evaluation methodology which is

practical and immedia:tely useful to AFTEC,* reduce subjectivity and increase reliability of the evaluation,* provide a conceptual framework which can be expanded

both within maintainability and to other quality factors,* provide a computer program for the automated processing

and analysis of the evaluation data.

II. CONCEPTUAL FRAMEWORKThe software maintainability evaluation methodology has

the conceptual framework shown in Fig. 1. The associateddefinitions are in Table I.

A. Quality FactorsThe work of Boehm [1] and Walters [4] among others has

established a set of user-oriented terms representing desiredqualities of software. These terms, or quality factors, includemaintainability, usability, correctness, human engineering,

0098-5589/81/0700-0343$00.75 ( 1981 IEEE

343


I QUALITY FACTOR

I iI I

1- i( SOFTWARE I

MAN AINABIL I

r~ TEST FACTORS

CATEGORIES

PRODUCT DOCUMENTATION i (MO)

Ii _ I-DESCRIPTIVENESSI -.--I--- LL- I I SOFTWARE l (DS)

SOFTWARE SOURCE 1-.---4 CONSISTENCYlPROGRAM l LISTIN (CS)

~COMPUTER* i I SIMPLICITYSUPPORT (SI)

L _ t- .-, RESOURCES i EXPANDABILITY

L .....l.J I (EX)INSTRUMENTATION

(IN)

r-.-- -II QUALITY SUBFACTORS

I IJ UNDERSTANDABILITY i

I M I A L TI

I

I-., ,.

I MOIIBLT £TsILITYI _ M? , (TS)1I __ . _ _ __ _ J

L JFig. 1. Elements of software maintainability.

TABLE IDEFINITIONS

SOFTWARE. Software consists of the programs and documentation whichresult from a software development process.

SOFTWARE MAINTAINABILITY; Software maintainability is a quality ofsoftware which reflects the effort required to perform thefollowing actions.

(1) Removal/correction of latent errors(2) Addition of new features/capabilities(3) Deletion of unused/undesirable features(4) Modification of software to be compatible

with hardware changes

Implicit in the above definition are the concepts that thesoftware should be understandable, modifiable and testablein order to have effective maintainability.

UJNDERSTANDlABIILITY: Software processes the characteristics of under-standability to the extent its purpose and organization areclear to the inspector.

MODIFIABILITY. Software possesses the characteristics of modifiablityto the extent that it facilitates the incorporation of changesonce the nature of the desired change has been identified.

TESTABILITY: Software possesses the characteristics of testability tothe extent that it facilitates the establishment of verifi-cation criteria and supports evaluation of its performance.

TEST FACTORS: Software maintainability test factors are user-orientedgeneral attributes of software which affect maintainability.The set of test factors includes: modularity, descriptiveness,consistency, simplicity, expandability, and instrumentation.

MODULARITY: Software possesses the characteristics of modularity to theextent that a logical partitioning of software into parts, com-ponents, and modules has occurred.

DESCRIPTIVENESS: Software possesses the characteristics of descriptive-ness to the extent that it contains information regarding its

portability, reliability, and others. Although the qualityfactors may be the same syntactically among researchers,semantically they tend to have different interpretations. Thedefinition in Table I of software maintainability reflectsAFTEC's concern for acquiring software which is understand-able (required locations for modifications can be easilyestablished), modifiable (enhancements or corrections can bemade), and testable (the software is properly instrumented fortesting once modifications have been made).

B. Software Product/CategoriesEach software program (product) is separately evaluated and

consists of a set of components called modules. A modulemay, in general, be at any conceptual level of the program.

objectives, assumptions, inputs, processing, outputs, com-ponents, revision status, etc.

CONSISTENCY, Software possesses the characteristics of to theextent the software products correlate and contain uniform nota-tion, terminology and symbology.

SIMPLICITYf Software possesses the characteristics of simplicity to theextent that it lacks complexity in organization, language, andimplementation techniques and reflects the use of singularityconcepts and fundamental structures.

EXPANDiABILITY: Software possess the characteristics of expandability tothe extent that a physical change to information, computationalfunctions, data storage or execution time can-be easily accom-plished.

INSTRUMENTATION: Software possesses the characteristics of instrumentationto the extent it contains aids which enhance testing'

SOFTWARE DOCUMENTATION: Software documentation is the set of requirements,design specifications, guidelines, operational procedures, testinformation, problem reports, etc. which in total form thewritten description of the program(s) output fron a softwaredevelopment process.

SOFTWARE SOURCE LISTINGS: Snftware source listings are the implementedrepresentation (listing) described through a source computerlanguage of the program(s) output from a software developmentprocess.

COMPTER SUPPORT RESISOURCES: Computer support resources include all theresources (software, computer equipment, facilities, etc.)which support the software being evaluated.

PROGRAM; A program is a set of hierarchically related modules which canbe separately compiled, linked, loaded and executed.

MlODILE: A module is a set of tcontiguous" computer language statementswhich has a name by which it can be separately invoked.

For each program there are three categories which areevaluated for characteristics which affect maintainability:software documentation, software source listings, and thecomputer support resources. Only program deliverables areconsidered in an evaluation.1) Software Documentation: The primary documentation

used in this evaluation consists of the documents containingprogram design specifications, program test plan informationand procedures, and program maintenance information. Thesedocuments may have a variety of physical organizations de-pending upon the particular application, although softwarestandards attempt to reduce the variability [221-[28]. Thedocuments are evaluated both for content and for generalphysical structure (format).

344

IIIIIIIIIIII

PEERCY: SOFTWARE MAINTAINABILITY EVALUATION METHODOLOGY

2) Software Source Listings: The source listings representthe program as implemented, in contrast to the documentationwhich represents the program design or implementation plan.Source listings are also a form of program documentation, butfor this maintainability evaluation a distinction is made.The source listing evaluation consists of a separate evaluation

of each specified module's source listing and the consistencybetween the module's source listing and the related writtenmodule documentation. The separate module evaluations areaccumulated into an overall evaluation of the software sourcelistings.3) Computer Support Resources: Attributes and proce-

dures for the evaluation of computer support resources arebeing developed and will be detailed in a separate report.

C. Software Maintainability Test FactorsThe maintainability of software documentation and source

listings is a function of six attributes or test factors: modular-ity, descriptiveness, consistency, simplicity, expandability,and instrumentation. These test factors are defined in Table I.Discussions of their application in the evaluation of the docu-mentation and source listings are given in the followingparagraphs.1) Modularity: It has been observed that software has been

easiest to understand and change when composed of "indepen-dent" parts (sections, modules). Documentation and sourcelistings are evaluated in relation to the extent their logicalparts show only a few, simple links to other parts (lowcoupling [9]) and contain only a few easily recognizable sub-parts which are closely related (high strength [9]). Parnas[10], [11] has described these concepts in different, butrelatively equivalent terms.2) Descriptiveness: It is important that the documentation

contain useful explanations of the software program design.The objectives, assumptions, inputs, and outputs are desirablein varying degrees of detail in both documentation and sourcelistings. The intrinsic descriptiveness of the source languagesyntax and the judicious use of source commentary greatlyaids efforts to understand the program operation.3) Consistency: The use of some standards and conven-

tions in documentation, flowchart construction, I/O process-ing, error processing, module interfacing, and naming ofmodules/variables are typical reflections of consistency. Con-sistency allows one to easily generalize understanding. Forexample, programs using consistent conventions might requirethat the format of modules be similar. Thus, by learning theformat of one module (preface block, declaration format,error checks, etc.) the format of all modules is learned.4) Simplicity: The aspects of software complexity (or lack

of simplicity) that are emphasized in the evaluation relateprimarily to the concepts of size and primitives. The use ofhigh order language as opposed to assembly language tends tomake a program simpler to understand because there arefewer discriminations which have to be made. There arecertain programming considerations such as dynamic allocationof resources, recursive/reentrant coding which can greatlycomplicate the data and control flow. Real-time programs,because of the requirement for timing constraints and effi-ciency, tend to have more control complexity. The sheer bulk

of counts (number of operators, operands, nested controlstructures, executable statements, statement labels, decisionparameters) will determine to a great extent how simple orcomplex the source code is [15]-[18].5) Expandability: Software may be reasonably understand-

able but not easily expandable. If the design of the programhas not allowed for a flexible timing scheme or a reasonablestorage margin, then even minor changes may be extremelydifficult to implement. Parameterization of constants andbasic data structure sizes usually improves expandability. It isalso very important that the documentation include explana-tions of how to make increases/decreases in data structuresizes or changes to the timing scheme. The limitations of suchprogram expandability should be clear. The numberingschemes for documentation narrative and graphic materialsmust be carefully considered so that physical modifications tothe documentation can be easily accomplished when necessary.

6) Instrumentation: For the most part, the documentationis evaluated by how well the program has been designed toinclude test aids (instruments), while the source listings areevaluated by how well the code seems to be implemented toallow for testing through the use of such test aids. The soft-ware should be designed and implemented so that instrumenta-tion is imbedded within the program, can be easily insertedinto the program, is available through a support softwaresystem, or is available through a combination of thesecapabilities.

D. Software CharacteristicsEach test factor has a set of software-level characteristics

which serve to define the test factor within the context of thesoftware product category being evaluated. Characteristicswere identified and grouped so as to minimize the overlapamong the test factors and balance the number of characteris-tics across the test factors. Characteristics for the documenta-tion and source listings were identified primarily from Thayer[3], Boehm [1], Walters [4], Kernighan and Plauger [7],Myers [9], Parnas [10], [11], Miller [191, Halstead [15],[16], McCabe [18], Yeh [12], and various documentationstandards [22]-[28].A fixed scale for all evaluation responses was chosen and

closed-form questionnaires for documentation and sourcelistings were designed based on the identified characteristics.In order to minimize subjectivity, increase the evaluationreliability, and provide for a more efficient evaluation process,a detailed evaluation guidelines handbook [33] was developed.The handbook contains background methodology, the evalua-tion questionnaires, and a set of guidelines for interpreting theterminology and potential responses for each question. Acomputer program was developed to aid the analysis of theevaluator responses.

III. EVALUATION PROCEDUREThe software evaluation procedure involves four distinct

phases as shown in Fig. 2: planning, calibration, assessment,and analysis.During the planning phase, the AFTEC Software Test

Manager (STM) and the selected Software Assessment Team(SAT) Chairman establish evaluator teams, each consisting of

345


Test Planning

.Software Test Manager (STM)/Software AssessmentTeam (SAT) Chairman:

- Establishes Evaluation Structure- Selects Modules for Evaluation- Determines Test Factor Weights- Establishes Time Frame for Evaluation

STM:- Assigns Identification Information- Completes Evaluator Briefing

Calibration Test

Each Evaluator:- Completes One Documentation Questionnaire-- Completes Specified Module Questionnaire

STM:- Reviews Completed Questionnaires- Resolves Misunderstandings- Debriefs Evaluators

Assessment

Each Evaluator:- Updates Calibration Questionnaires- Completes Remaining Questionnaires

Ana lys is and Repo rting

STM:- Accomplishes Automated Questionnaire Data Entry- Produces Automated Preliminary Analysis- Reviews Automated Analysis Results

SAT:- Reviews Preliminary Analysis- Performs Detailed Evaluation- Prepares Evaluation Report

Fig. 2. Maintainability evaluation procedure.

at least five evaluators knowledgeable in software maintenance.The SAT chairman may or may not be one of the evaluators.The evaluators are preferably persons who will be responsiblefor maintaining some part of the software being evaluated.The program/module hierarchy is established and a set ofrepresentative modules is selected for each program to beevaluated. At least 10 percent of the modules in a programare randomly selected for evaluation. Specific test factor(attribute) weights are also determined at this time and theschedule for the evaluation is established. The software testmanager briefs the evaluator teams on the procedures andassigns the necessary identification information for this spe-cific evaluation.The function of the calibration test is to ensure a reliable

evaluation through a clear understanding of the questions andtheir specific response guidelines on each questionnaire. Eachevaluator completes a documentation and module sourcelisting questionnaire. The completed questionnaires arereviewed to detect areas of misunderstanding and the evalua-tion teams are debriefed on the problem areas.In the assessment phase, the evaluation teams update their

calibration test questionnaires based on the results of thecalibration debriefing. The teams then complete the remainderof their assigned documentation and module source question-naires. It is estimated that each evaluator will take 4-6 hoursto complete the documentation questionnaire and 1-3 hoursto complete each module questionnaire.In the analysis phase, the software test manager accomplishes

TABLE IITEST FACTOR QUESTION DISTRIBUTION

TABLE IIIEXAMPLE QUESTIONS

(Documentation)

Format Modularity

Note: The following questions relate to how the documentation has beenphysically formatted into functional parts.

1. Program documentation includes, a separate part for the description ofprogram external interfaces.

2. Program documentation includes a separate part for the description ofeach major program function.

3. Program documentation includes a separate part for the description ofthe program global data base.

Processing Modularity

Note: The following questions relate to how the program control anddata flow has been designed for functional use.

8. The program control flow is organized in a top down hierarchical treepattern.

9. Program initialization processing is done by one (set of) modules(s)designed exclusively for that purpose.

(Source Listings)

Size Simpicity

Note: The following questions relate to various "counts" which reflectthe amount of information which must be assimilated to understanda module.

62. The number of expressions used to control branching in this module ismanageable.

63. The number of unique operators in this module is manageable.

64. The nunmber of unique operands in this module is manageable.

65. The number of executable statements in this module is manage-

able.

Genera 1 Questions

83. Modularity as reflected in this module's source listing contri-butes to the maintainability of this module.

the conversion and initial data processing of the questionnairedata. This preliminary analysis is then reviewed and corrected,if necessary. The statistical summaries are then returned to theSAT for detailed evaluation and preparation of the finalreport.

A. Example QuestionsEach evaluator is supplied with a documentation question-

naire, source listing questionnaire, evaluation response forms,and an evaluator guidelines handbook. The number of ques-.tions for each of the questionnaires and each of the test factorsis summarized in Table II. Some of the questions are illus-trated in Table III. Note the "general" question 83. Eachtest factor has such an associated general question. In futureanalysis of the methodology, test factor characteristics (scores)will be regressed against the general question (score) across all

,program modules and all programs. The guidelines for one ofthe sample questions are illustrated in Table IV.

B. Response FormThe form on which an evaluator records responses to

questions is processed through an optical scanner. Thereare three "blocks" on this form: descriptive identificationblock, numerical identification block, and evaluator response

346


TABLE IVIEXAMPLE OF QUESTION GUIDELINES

block. The descriptive identification block contains informa-tion which identifies the particular questionnaire type, system,subsystem, program, module, evaluator, date, and time tocomplete. This block is only used for a visual identificationcheck and is not processed by the optical scanner. The nu-

merical identification block contains numeric codes for thesame information contained in the descriptive identificationblock. The evaluator response block contains a set of 10responses (A-J) for each question up to 90 questions.

C. Response ScaleThe following response scale is used to answer each

question:a) completely agree,b) strongly agree,

c) generally agree,d) generally disagree,e) strongly disagree,f) completely disagree.One of these responses must be given for each question. In

addition, one or more of the following standardized commentresponses can be selected:

i) I had difficulty answering this question,j) a written comment has been submitted.The responses g and h are not used. The responses a-f

(equivalent numeric metric is 6 to 1) indicate the extent towhich the evaluator agrees/disagrees with the questionstatement.

D. Analysis TechniquesThe maintainability metrics are the average scores across

evaluators, test factors, product categories, and programs.

Test factors, product categories, and programs can be given

weights at the discretion of the AFTEC Test Manager and SATChairman, but raw scores will also be retained.Assessment of the evaluation process itself is partially based

on six measures: agreement, outliers, response distribution,standard deviation, regression, and question reliability.Agreement on a question is calculated using the following

formula:

I NSAG = N E NRi/2'

i=0

where AG is the agreement factor, NS is the number of unitsteps in the scoring scale, NRi is the number of responses thatare i steps from the mode, and NE is the number of evaluators(responses).

If there is no mode, then the scale value closest to the meanand with at least as many responses as any other scale value isused as the "mode." As an example, with five responses ofB, C, C, C, E, the mode is Cand AG = (3/20 + 1/21 + 1/22) =0.75.Outliers are determined in a somewhat subjective (but

logical) manner since neither the agreement factor nor stan-dard deviation provide acceptably consistent measures. Anoutlier is any extreme response with a distance (DE) from thenext closest response such that DE/DT> 0.5, where DT=maximum distance between any two responses.Response distribution is studied across all evaluations on a

question-by-question basis to determine the validity of thegeneral assumption of a normal response distribution. Suchanalysis can also be used to determine an experimental ques-tion weight. On an individual evaluation basis, the combina-tion of agreement, outlier, and standard deviation analysis isused to pinpoint particular questions which have an unaccept-able response distribution.Regression analysis is used across all evaluations to study the

validity of the test factor question groupings and to study theregression of test factor characteristic responses against theassociated general test factor question response. Itzfeldt [21presented some interesting related results using regression andfactor analysis.

Reliability is a measure of consistency from one set ofmeasurements to another. Reliability can be defimed througherror: the more (less) error, the lower (higher) the reliability.Since we can measure total variance, if we can estimate theerror variance of a measure, we can also estimate reliability.The statistical method for identifying error variance is

Analysis of Variance (ANOVA). ANOVA allows the analystto isolate the sources of variance within total variance. In theevaluation of module questionnaires, for example, the sourcesof variance are differences between the evaluators due to theirdiffering backgrounds and expectations, differences in thecharacteristics of the modules, and unattributable differencesdue to error. Two-way analysis of variance allows a determina-tion of all three variance sources. Mean-squares for raters,modules, and error are determined as measures of variance.Reliability is then calculated as 1.00 minus the proportion ofmean-square error to mean-square modules.

If the reliability coefficient R is squared (R2), it becomes a

Question Number S-62

QUESTION: The number of expressions used to control branch-ing in this module is manageable.

CHARACTERISTIC: Simplicity (size simplicity).

Explanations: The count of control expressions is closely relatedto the number of independent cycles in a module. The more controlexpressions there are, the more complex the control logic tends tobe.

EXAMPLES: The following examples indicate how to count the controlexpressions:

CONTROL STRUCTURE STATEMENT CONTROL EXPRESSION COUNT

Decision IF (A.OR.B) GO TO 10 A;B 2IF (A. AND.B) GO TO 10 A;B 2IF (C.GT.D) GO TO 10 C.GT.D 1IF (A.AND.B).OR.(C.GT.D))GO TO 10 A;B;C.GT.D 3

CASE (I) OF I=1;1=2;1=3 21: A (Alternatives) (number of2: B alternatives3: C less one)

END CASE

Iteration DO 10 I=1, 10 I.LT.1A I.GT.10 2

10 CONTINUE

GLOSSARY: Control expression: IF, CASE, or other decision controlexpression. DO, DO-WHILE, or other iterative control expression.

SPECIAL RESPONSE INSTRUCTIONS: The following guidelines will anchorA and F responses, but are fairly subjective (especially the F anchor).The guidelines for the A response is suggested from other indepen-dent research. Remember to count all repetitions of the same controlexpression also.

Answer A if count < 10.Answer F if count > 50.

347


TABLE VEXAMPLE RELIABILITY COMPUTATIONS

EVALUATOR 1 2 3

MODULE _ YC C C 12 48

2 C A C 14 68

3 D A C 13 61

4 C A B 15 77

15 22 17 542 57 124 73 254

NOTE: A = 6, B 5, C = 4, D = 3, E 2, F = 1

Vl = (54)2/(4)(3) = 243

V2= 254

V3 = (152 + 222 + 172)/4 = 249.5

V4 = (122 +142 + 132 +152)/3 244.67

SSE = (V2-V4) + (V1-V3) = 2.83

SSR = V3-VA - 6.5

MSE = SSE/(4-1)(3-1) = .47

MSR = SSR/(3-1) = 3.25

R = 1 - MSE/MSR = .86

coefficient of determination. It gives us the proportion of thevariance shared by the "true" score and the observed score.R2 is interpreted as the proportion of observed variance whichcan be attributed to a true measurement. The expression1 - R2 provides the proportion of total variance which can beattributed to error. If evaluators are focusing on differentaspects of a question, then the resulting evaluator responseswill have an associated error variance which is not explainable.Since the reliability squared indicates how much variance isexplainable, the higher this value, the lower the possible unex-plained error variance and hence, the less possibility the evalua-tors were misinterpreting the question. So, the higher thereliability, the more probable the question is a "good" ques-tion, at least from the viewpoint of not misinterpreting thequestion statement.Table V illustrates the calculation of the reliability for a

sample source listing question with 3 evaluators and 4 modules.Reliability is not calculated for the documentation questionssince only one questionnaire is completed. If the sameevaluators were to evaluate several programs, then by replacing"module" by "program" the reliability of each documentationquestion could be similarly computed. It is unlikely that theprecise same set of evaluators will be used to evaluate verymany programs. Box [26] and Kerlinger [27] have moredetailed discussions of the ANOVA and reliability statisticsinvolved in this type or analysis. The BMDP-77 [25] com-puter statistical package is a practical source for automatedanalysis.

IV. PRELIMINARY RESULTS

AFTEC has conducted several software evaluations using themethodology and procedures outlined in this paper. Resultsfor one evaluation involving eight separate programs aresummarized in Tables VI, VII, and VIII. A report, Program

TABLE VISOFTWARE MAINTAINABILITY EVALUATION WEIGHTS

OPERATIONAL SUPPORTSOFTWARE SOFTWARE

NON-P1 P2 P6 P7 P3 P4 P5 CPCI

DOCUMENTATIONCATEGORY .40 .55 .40 .40 .50 .60 .60 .60

MODULARITY .10 .15 .15 .14 .15 .15 .15 .14

DESCRIPTIVENESS .35 .35 .25 .26 .25 .30 .30 .26

CONSISTENCY .15 .11 .15 .15 .15 .15 .12 .18

SIMPLICITY .20 .10 .18 .18 .18 .20 .20 .20

EXPANDABILITY .10 .09 .12 .12 .14 .12 .15 .12

INSTRUMENTATION .10 .20 .15 .15 .13 .0 .08 .10

SOURCE LISTINGCATEGORY .60 .45 .60 .60 .50 .40 .40 .40

MODULARITY .20 .10 .12 .12 .15 .20 .20 .15

DESCRIPTIVENESS .15 .30 .20 .39 .22 .18 .18 .25

CONSISTENCY .15 .09 .18 .14 .13 .25 .25 .15

SIMPLICITY .20 .12 .12 .12 .12 .17 .17 .15EXPANDABILITY .20 .21 .18 .23 .20 .12 .12 .20

INSTRUMENTATION .10 .19 .20 0 .18 .08 .08 .10

CPCI WEIGHT .15 .38 .37 .10 .35 .20 .35 .10

OPERATIONAL SUPPORT.60 .40

Maintainability Scores, produced by the SMAP [34] computerprogram is illustrated in Fig. 3.The average reliability from Table VIII is below the desired

level of 0.9 in several of the evaluations. A reliability of 0.95or greater is usually required by national testing services. Thesignificance of the reliability (R) is in determining how wellevaluators understood the questions (one source of error).Recall that 1 - R2 gives the unknown variance (error). Hence,if R = 1, there would be no error and the question would havebeen completely understood by the evaluators. From TableVIII with R = 0.92, slightly more than 15 percent of varianceis due to error. When R = 0.79, then nearly 40 percent of thevariance is due to error. After a significant number of evalua-tions those questions (26 from Table VIII) with reliability lessthan 0.80 will be carefully analyzed for better wording orperhaps elimination. As a comparison of reliability improve-ment over three stages of the methodology evolution, seeFig. 4. For example, the percentage of questions withreliability less than 0.75 has gone from 73 percent to 65 per-cent to 27 percent.The evaluator agreement goal is 0.8 (e.g., five responses of

B, C, C, C, D). The average agreement from Table VIII issomewhat under the goal, but there were only three evaluatorsavailable for most programs and there were some outlierproblems as evidenced by the rather large standard deviationsummarized in Table VIII. The desirable standard deviationof 0.5 may be difficult to reach, which simply means morecareful analysis of the outliers and agreement is required toreach conclusions as to the validity of the evaluation scores.One area of concern for AFTEC has been providing enough

evaluators so that a larger number of evaluators would notsignificantly alter the evaluation results. That is, scores wouldremain within 0.5 units. The standard deviation helps todetermine this "sample size" (of evaluators) for a given"confidence" level in the evaluation results. Briefly, a sample

348


TABLE VIISOFTWARE MAINTAINABILITY SCORES

OPERATIONAL SUPPORT

OPERATING TACTICAL RADAR SIGNAL SIMULATION SOFTWARE DATA NON-CPCISYSTEM APPLI- CONTROL PROCESSING SUPPORT REDUCTION SOFTWARE

CATIONS TOOLS

P1 P2 P7 P6 P3 P4 P5 P8

DOC SRC DOC SRC DOC SRC SOC SRC DOC SRC DOC SRC DOC SRC DOC SRC

MODULARITY 4.33 5.15 4.00 4.54 3.78 4.57 4.69 5.77 3.36 4.51 3.97 4.84 3.92 5.21 5.28 4.86

DESCRIPTIVENESS 3.24 3.85 2.52* 2.59* 2.54* 2.86* 3.39 4.64 2.38* 2.16* 3.90 2.92* 3.81 33.73 4.14 3.43

CONSISTENCY 4.07 3.93 3.75 3.36 4.15 3.64 5.11 5.24 3.33 3.35 4.52 4.28 4.78 4.68 5.04 4.48

SIMPLICITY 3.58 4.68 3.35 4.07 3.53 4.35 3.89 4.50 3.14 3.91 4.53 4.77 4.33 4.84 4.53 4.38

EXPANDABILITY 3.93 4.33 2.78* 3.58 3.15 3.86 4.9 4.55 2.67* 3.47 4.52 4.81 4.37 4.83 4.59 4.33

INSTRUMENTATION 1.57* 2.79* 2.65* 2.71* 2.17* 2.61* 4.20 N/A 2.40* 2.58* 2.30* 3.54 2.67* 3.66 4.77 3.71

TEOSTFACTOR 33.44 4.28 3.01 3.25 3.16 3.51 4.12 4.82 2.85* 3.22 4.08 4.24 4.04 4.58 4.66 4.08CSMPOSITES

CPCI COMPOSITES 3.94 3.12 3.37 4.54 3.03 4.14 4.25 4.23

OPS/SUPPORT SF8/SUPPORT ~~~~~~~3.48 3.82COMPOSITES

SYSTEM COMPOSITE 3.62

* Below Threshold

LEGEND:Goal 5.08Standard 4.15Threshold 3.00

Pl - 3 Ev, 22 ModP2 - 4 Ev, 15 ModP3 - 3 Ev, 4 ModP4 - 3 Ev, 10 Mod

P5 - 3 Ev, 4 ModP6 - 3 Ev, 19 ModP7 - 3 Ev, 4 ModP8 - 3 Ev, 1 Mod

TABLE VIIISOFTWARE MAINTAINABILITY EVALUATION ASSESSMENT MEASURES

A AVE AVE 1 GENERAL 2 TEST FACTOR TEST FACTORPROGRAM AGREEMENT STANDARD DEV RELIABILITY SCORES COMPOSITE SCORE RAW SCORES

P1 .76 .92 NA 3.43 3.44 3.420C P2 .64 1.45 NA 2.93 3.01 3.08UM P3 .68 1.28 NA 2.95 2.85 2.80EN P4 .79 .82 NA 3.76 4.08 3.95TA P5 .79 .88 NA 3.62 4.04 3.94TI P6 .71 1.38 NA 3.14 3.16 3.110N P7 .75 1.05 NA 3.95 4.12 4.06

P1 .77 .90 .79 4.06 4.28 4.20

I P2 .70 1.11 .85 3.09 3.25 3.46S I P3 .76 .92 .80 3.07 3.22 | 3.29

U T P4 .81 .71 .85 3.57 4.24 4.11R IC N P5 .77 .77 .79 3.92 4.58 4.48E G

P6 .70 1.20 .90 3.21 3.51 3.66

P7 .77 1.06 .92 4.50 4.82 4.64

AVE# SOURCE LISTING QUESTIONS WITH RELIABILITY NOTE: NO DATA AVAILABLE FOR PROGRAM P8

0 0-.491 .50-.595 .60-.69

20 .70-.7932 .80-.8931 .90-1.0

CORRELATION (1,2) = .95CORRELATION (1,3) = .93

size is desired which ensures that the imaginary population of two possible error types. The probability of a Type I (Type II)all possible software raters is accurately represented by a error is termed alpha (beta). With alpha and beta defmed,randomly selected sample. There are two possible errors to sample size n is given byguard against. The software is evaluated to be below (above) a

criteria when a much larger sample would find that the soft- -Za +Zpu\ware was above (below) a criteria. These are called Type I and \ q0 J

Type II errors, respectively. Consequently, it is necessary toestablish probabilities which are acceptable for each of the where

349


CPCI-2 VERSION 1PROGRAM MAINTAINABILITY SCORES

3.12

DOCUMENTATION

I-MODULAR ITY

/ FORMAT/ DATA/ PROCESSING//-DESCRIPtIVENESS

J FORMAT/ CONSTRAINTSI MODULE

EXTERNAL INTERFACESI INTERN4L INTERFACES/ MATH .MODEL

/-CONSIS TFNCY

/ FORMATI DES IGN

/-SIMPLICITY

I FORMATI DESIGN

/-E XPANDAB IL ITY

f FORMAT/ DESIGN

t/-INSTRUMENTAT ION

I FORMAT/ DESIGN// -S;EN FR AL

20

83.

12.

( 5.X 2.t 5.

2'4.

( 6.3.

1 5.( 5.1 3.( 2.

9.

1 4.( 5

12.

1 4.P.

9.

1 3.

1 6.

10.

I 3.7.

7.

RS WT

3.01

4.00 * I

4.30 .4'2.88 .174.15 .4?

2.52 .31

2.2 5 .v52. 33 .1 33.10 . 12.90 .711.75 .Il2. 38 .0

3.75 .11

3.91 .443.70 .M4

3.35 .10

3.81 .23 .13 .?

2.79 .0

3.17 *332.5N 74'

2.65 .

3.09 *112. 46 ,72.93 1O0t

Ws

1.4S4

.60

1 .7ql. 4q1

1 .73)

. q 9

. 'j q1

. 65)

.4531

.73). 41

1. s 9 I

?. 06)

.34

1. v7'?.O*3)

.71

1 . 4?). s- I

.931I .700

r9 . 00

Tn1l0r!c. L TTTJTT

X- nn'L AMTITY

I 1 i r /CAN^TMQIL

PwflMTYQ TM-rTIICNCR

I '3RA1CFF IICKT'"'k 7TIW14MFNTS

/ TMOLmcM4TATTON

/-rnflv.T TrNAy

/ FYTFR1JL

s zT 7r D t

P .'4r*ML I

I -rrIJCAL r¶)yI/ 1!N'IPILRM TITINSe ~TI7'

/-AYMA.J7IMLItv

t- C7Jn)>liN*t

I-A I8CAtMR

Fig. 3. SMAP report: Program maintainability scores.

PRIOR TO PROCESS OR- QUESTIONNAIRE CHANGES

- AFTER PROCESS CHANGES

AFTER PROCESS CHANGESAND REDESIGN OFQUESTIONNAIRE

.50 .60 .70 .80 .90 1 .0

RELIABILITY

Fig. 4. Comparative questionnaire reliabilities.

ZcX = normal deviate at a,

Zo = normal deviate at 0,Ba = standard deviation of the population scores,= not-to-exceed distance of the rater sample results fromthe hypothetical rater population results.

Values of Zax and Zp are found in Box [26] and Kerlinger[27], as well as most standard statistical tests. Using an alphaof 0.1 (Zak = 1.28), beta of 0.1 (Zg = 1.28), 0 of 0.5, and thegoal standard deviation of 0.5, we get n = 6.55 or approxi-mately seven evaluators are required. Since AFTEC has a

general maximumi of five evaluators available, the confidencelevel would have to be lowered from the above in order to use

the goal standard deviation of 0.5.The correlation between the test factor scores and the

general question scores is significant. It indicates that theevaluators did seem to agree that the respective test factors as

defined were represented by the test factor characteristics(the questions).

V. CONCL-USIONS AND FUTURE RESEARCHThe main conclusions of the research summarized in this

paper are as follows.* AFTEC has a viable software maintainability evaluation

methodology which is cost-effective and reasonably im-plementable.

* The revisions to AFTEC's initial methodology shouldprovide a substantial increase in the confidence of futureevaluation results.

* The software maintainability questionnaires should bevaluable as software quality assurance checklists for softwarecontractors during initial development phases.

* Due to the nature of software state-of-the-art, the AFTECmethodology should be carefully and continually reassessed.

* The results of any metric measurement of software charac-teristics should be used as a guide to possible anomalies andnot as absolute dictums.Future research includes the need to* correlate maintainability raw scores to actual maintenance

level of effort,* investigate even better response scales such as the Likert

graphic response scale,* correlate database of evaluator background data to evalua-

tor response data,* study cumulative database of responses to determine true

distributions and the meaning of performance measures suchas AFTEC's threshold, standard, goal.

ACKNOWLEDGMENT

Several AFTEC personnel were very supportive of thismethodology research and have allowed the inclusion in thispaper of some updated descriptions of the methodology and

350

-0 R5S

P.Q 3.25

1.4. 4.54

t 4. 4.0210. 4.76

21. 2.59

9. 1.557. 2.566.- 4.01

14. 3.36

t P. 3.036. 3.91

1 6. 4 .07

n. 4.29X 4. 4.31

4. 3. 40

9. 3.5 8

t 4. 3.965. 3.27

R. 2.71

3. 2.95Z.2.56

7. 3.09

WT

.45

*10

. 29

.71

.30

.38

.33* 29

.0Q

.57

.43

*11

.50

.25

.25

.21

.44

.56

.19

.39

. 63

0.00

WS

1.46

.45

1.15)3.401

.78

.59)

.85J1. 15)

.30

1.73 )1.63)

.45

2 .14)1.08 )

. 851

.75

1.76)1.8 2)

.51

1.1111.60)0.00

100

80

60

CUM %QUESTIONS 40

I" / 1 of 1

"1gI 1 1 1

---. -!-

20

C


its use since completion of Technical Directive 120 inDecember 1978. In particular, Lt. Col. H. Arner, Capt.T. Brock, and P. Thayer have been very helpful. Within BDM,F. Ragland has contributed considerable research into thestatistical foundation of this work. The referees also con-tributed several helpful suggestions which improved theoriginal paper.

REFERENCES[1] B. Boehm et al., Characteristics of Software Quality. New York:

North-Holland, 1977.[2] W. Itzfeldt et al., "User-perceived quality of interactive systems,"

in Proc. 3rd Int. Conf. on Software Eng., May 1978, pp.188-195.

[31 P. Thayer, "Software maintainability evaluation methodology,"Air Force Test and Evaluation Cen. Rep., June 1978.

[41 G. Walters etal., "Factors in software quality," RADC-TR-77-369,vol. ,11,11, Nov. 1977.

[51 ACM Comput. Surveys, "Special issue: Programming," vol. 6,Dec. 1974.

[6] E. Dijkstra, "Notes on structured programming," in StructuredProgramming, Dahl, Dijkstra, and Hoare, Eds. New York:Academic, 1972.

[7] B. Kernighan and P. Plauger, The Elements ofProgramming Style.New York: McGraw-Hil, 1974.

[81 H. D. Mills, "Mathematical foundations for structured program-ming," IBM Rep. FSC72-6012, Feb. 1972.

[91 G. Myers, Reliable Software Through Composite Design, 1st ed.New York: Petrocelli, 1975.

[101 D. Parnas, "Designing software for ease of extension and con-traction," in COMPSAC 1978 Tutorial, Software Methodology,Nov. 1978, pp. 184-196.

[11] -, "On the criteria to be used in decomposing systems intomodules," Commun. Ass. Comput. Mach., pp. 1053-1058, Dec.1972.

[121 R. Yeh, Ed., Current Trends in Programming Methodology: Vol.1. Software Specification and Design. Englewood Cliffs, NJ:Prentice-Hall, 1977.

[13] E. Yourdon, "Modular programming," in Techniques ofProgramStructure and Design. Englewood Cliffs, NJ: Prentice-Hall,1975, pp. 93-136.

[14] T. Gilb, Software Metrics. Cambridge, MA: Winthrop, 1976.[151 M. Halstead, Elements ofSoftware Science. New York: Elsevier,

1977.[161 -, "Using the methodology of natural science to understand

software," Purdue Univ., CDRTR 190, May 1976.[17] T. Love et al., "Measuring the psychological complexity of soft-

ware maintenance tasks with the Halstead and McCabe metrics,"IEEE Trans. Software Eng., vol. SE-5, Mar. 1979.

[181 T. McCabe, "A complexity measure," IEEE 7rans. SoftwareEng., vol. SE-2, pp. 308-320, Dec. 1976.

[191 E. Miller, "Program testing techniques," presented at COMPSAC1977 Tutorial, Nov. 1977.

[201 J. Bowen, "A survey of standards and proposed metrics for soft-ware quality testing," Computer, vol. 12, pp. 3t-42, Aug. 1979.

[211 F. Buckley, "A standard for software quality assurance plans,"Computer, vol. 12, pp. 43-50, Aug. 1978.

[22] Documentation Standards, "Structured programming series vol.VII and addendum," RADC-TR-300, Sept. 1974 and Apr. 1975.

[231 DOD Manual for DOD Automated Data Systems DocumentationStandards, DOD Manual 7935.1S, September 1977.

[24] MIL-STD-483 (USAF), "Configuration management practices forsystems, equipments, munitions, and computer programs, Dec.1970.

[25] MIL-STD-490, "Specification practices," Oct. 1968.[26] MIL-STD-1521A (USAF), "Technical reviews and audits for

systems, equipments, and computer programs," June 1976.(27] MIL-STD-1679 (Navy), "Weapon system software develop-

ment," Dec. 1978.[281 MIL-S-52779 (Army), "Software quality assurance program

requirements," Apr. 1974.[29] BMDP-77. Berkeley, CA: UCLA Press, 1977.[30] G. Box, W. Hunter, and J. Hunter, Statistics for Experimenters.

New York: Wiley, 1978.[31j F. Kerlinger, Foundations of Behavioral Research, 2nd ed. New

York: Holt, Rinehart and Winston, 1973..[32] F. Ragland and D. Peercy, "Analysis of software maintainability

evaluation process," BDM/TAC-78-698-TR, Dec. 1978.[33] D. Peercy, "Software maintainability evaluation guidelines hand-

book," BDM/TAC-78-687-TR, Dec. 1978.[34] D. Peercy and T. Paschich, "Software maintainability analysis

program user's manual," BDM/TAC-78-697-TR, Dec. 1978.

David E. Peercy received the B.S. degree inapplied mathematics from the University ofColorado, Boulder, in 1966, and the M.S. andPh.D. degrees in a mathematics from NewMexico State University, Las Cruces, in 1967and 1971, respectively.From 1970 to 1972 he was a Post-Doctoral

Fellow in the Department of Mathematics atWest Virginia University, Morgantown, andfrom 1973 to 1977 he was a member of theTechnical Staff of Texas Instruments perform-

ing research in software engineering methodology and development ofreal-time systems. In 1977 he joined the BDM Corporation, where heis currently a Senior Computer Scientist, serving as a Technical Consul-tant for the Computer Science Department. His research interests in-clude software development methodology and software qualityassessment.

Dr. Peercy is a member of the ANSI X3J9 Committee considering thestandardization of the Pascal programming language.

351

Documents

A Software Maintainability Evaluation Methodology metrics, software ... The software maintainability evaluation methodology has ... to the extent it contains aids which enhance testing