Two controlled experiments assessing the usefulness of design … · 2016-01-22 · existing software, and programming with patterns. The main advantages claimed for design patterns,

Two Controlled Experiments Assessing theUsefulness of Design Pattern Documentation

in Program MaintenanceLutz Prechelt, Member, IEEE Computer Society, Barbara Unger-Lamprecht, Member, IEEE,

Michael Philippsen, Member, IEEE Computer Society, and

Walter F. Tichy, Member, IEEE Computer Society

AbstractÐUsing design patterns is claimed to improve programmer productivity and software quality. Such improvements may

manifest both at construction time (in faster and better program design) and at maintenance time (in faster and more accurate program

comprehension). This paper focuses on the maintenance context and reports on experimental tests of the following question: Does it

help the maintainer if the design patterns in the program code are documented explicitly (using source code comments) compared to a

well-commented program without explicit reference to design patterns? Subjects performed maintenance tasks on two programs

ranging from 360 to 560 LOC including comments. Both programs contained design patterns. The controlled variable was whether the

use of design patterns was documented explicitly or not. The experiments thus tested whether pattern comment lines (PCL) help

during maintenance if patterns are relevant and sufficient program comments are already present. It turns out that this question is a

challenge for the experimental methodology: A setup leading to relevant results is quite difficult to find. We discuss these issues in

detail and suggest a general approach to such situations. The experiment was performed with Java by 74 German graduate students

and then repeated with C++ by 22 American undergraduate students. A conservative analysis of the results supports the hypothesis

that pattern-relevant maintenance tasks were completed faster or with fewer errors if redundant design pattern information was

provided. Redundant means that the information carried in pattern comments is also available in different form in other comments. The

contribution of this article is twofold: It provides the first controlled experiment results on design pattern usage and it presents a solution

approach to an important class of experiment design problems for experiments regarding documentation.

Index TermsÐControlled experiment, design pattern, comments, documentation, maintenance.

æ

1 INTRODUCTION

A software design pattern describes a proven solution toa software design problem with the goal of making the

solution reusable. Design patterns provide proven solutionsto known problems, encouraging reuse and relievingprogrammers of reinvention.

The idea of design patterns has quickly caught the

attention of practitioners and researchers and the pattern

literature is burgeoning. The first systematic collection of

design patterns was published by Gamma et al. [9]

(nicknamed the ªGang of Four Bookº). Shortly thereafter,

additional patterns were reported by Buschmann et al. [4].

The book by Shaw and Garlan [19] also provides a wealth of

patterns for software architecture. Annual workshops are

being held [6], [25], [14] to promote pattern mining and a

consistent style of reporting patterns. Pattern papers are

published in other software conferences as well, reporting

on new patterns, pattern taxonomies, and pattern tools.

Formalizations of patterns are sought and tools are being

built for pattern mining, identifying known patterns in

existing software, and programming with patterns.The main advantages claimed for design patterns,

according to the pattern literature, are as follows:

1. Using patterns improves programmer productivityand program quality.

2. Novices can increase their design skills significantlyby studying and applying design patterns.

3. Patterns encourage best practices, even for experi-enced designers.

4. Design patterns improve communication, bothamong developers and from developers tomaintainers.

As yet, there exists no coherent theory that would explain

these advantages specifically in the context of design

patterns.

1.1 Our Experiments

The experiments reported here represent the first attempts

at formally testing some aspects of the above claim 4. Our

experiments are set in a maintenance context. Assume a

maintainer knows what design patterns are and how they

are used. Furthermore, assume that a program was

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 28, NO. 6, JUNE 2002 595

. L. Prechelt is with abaXX Technology, Forststrasse 7, D-70174 Stuttgart,Germany. E-mail: [email protected].

. B. Unger-Lamprecht is with sd&m, Herrnstrasse 57, 63065 Offenbach amMain, Germany. E-mail: [email protected].

. M. Philippsen and W.F. Tichy are with the FakultaÈt fuÈr Informatik,UniversitaÈt Karlsruhe, D-76128 Karlsruhe, Germany.E-mail: {phlipp, tichy}@ira.uka.de.

Manuscript received 24 Mar. 2000; revised 27 Feb. 2001; accepted 10 Aug.2001.Recommended for acceptance by L. Briand.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number 111835.

0098-5589/02/$17.00 ß 2002 IEEE

designed and implemented using patterns. Now, thequestion is:

Does it help the maintainer if the design patterns in theprogram code are documented explicitly (using source codecomments) compared to a well-commented program with-out explicit reference to design patterns?

We investigate this question in the following manner:Several subjects receive the same program source code andthe same change requests for that program; the subjectsthen provide appropriate changes sketched on paper (firstexperiment) or as operational program code (secondexperiment). The change requests concern those aspects ofthe program that are implemented using design patterns.The program is commented in detail, but the subjects in thecontrol group receive no explicit information about designpatterns in the program; they may derive pattern informa-tion from other program comments or not at all. Theexperiment group, by contrast, receives the same programwith the same comments plus a few additional lines ofcomment (called pattern comment lines or PCL) that describepattern usage where applicable. Note that the experimentgroup thus has slightly more comments in their program.This might threaten the validity of the experiments. We willdiscuss and resolve this important methodological issue inSection 3. Subjects are assigned randomly to the groups. Weinvestigate whether and how the performance of the twogroups differs by measuring completion time, gradinganswers, and counting correct solutions.

The experiments were performed with a total of 96 studentsubjects, each working in a single session of two to fourhours. The tasks were based on two different programs about6 to 10 printed pages in length.

1.2 Experiment Rationale

Note that the experiments do not test the presence orabsence of patterns, but merely the documentation ofpatterns. This question is much cleaner because it is unclearwhat alternative program design should be used forcomparison when testing the presence or absence ofpatterns.

Why do we think that adding PCL is a useful proposi-tion? Most theories of program comprehension state thatthe comprehension process is driven by hypotheses [3], [13]formed and validated during the understanding process. (Anice overview of this view of program comprehension canbe found in [26]. Further recent research papers can befound in [28], [29], [30], [31].) Once a program is under-stood, its meaning is represented as a hierarchy ofhypotheses. Without prior knowledge, systematic programunderstanding must work bottom-up [15], inducing themeaning and purpose of program parts step-by-step,starting from individual declarations and statements.However, a completely bottom-up approach is unrealisticfor larger programs. Therefore, if a programmer finds hints(so-called beacons) to familiar kinds of structures within thecode, he or she will switch to a top-down approach ofunderstanding that allows for generating and validatinghypotheses much more quickly. Beacons can be structural(such as code idioms), but more often they are simplynames of program entities [10].

If this view is correct, PCL is probably a very powerfulaid to program understanding because each description of adesign pattern usage directly leads to a large-grainhypothesis that spans multiple classes and that is relativelyeasy to verify top-down. In this case, it would be a highlyefficient engineering practice to always document the use ofdesign patterns by PCL because writing PCL requires onlylittle effort. Slightly different views of program comprehen-sion discussed in the next section provide further supportfor the idea that adding PCL may be highly worthwhile.

1.3 Other Related Work

The program comprehension theory presented by Brooks[2] suggests that PCL may be a highly effective sort ofcomment because it specifically aims at bridging differentªknowledge domains.º

A different view is provided by the notion of delocalizedplans [21] introduced by Soloway et al. Their experimentsshowed that delocalized plans account for much of theeffort during program understanding. Wilde et al. [27]argue that object-oriented programs are full of delocalizedplans due to inheritance and delegation. Design patterninstances are delocalized plans as well because a singledesign idea is spread over multiple classes and methods.From this perspective, PCL may be helpful because it linksthe parts of important delocalized plans.

The empirical work of Shneiderman and Mayer [20]confirms several basic assumptions about comments:Mnemonic names and comments are both useful forprogram understanding and higher-level comments aremore useful than comments on a level of abstraction close toindividual program statements. We do not know of anyexperiments comparing different forms of program com-ments in the context of object-oriented programs.

All of the work mentioned so far is not concerned withdesign patterns per se and has, in fact, been published longbefore the notion of software design patterns becamepopular. Published evidence about the effectiveness ofdesign patterns is still scarce. Case study reports andanecdotal evidence of positive effects can be found, forinstance, in [1], [9]. Part of the program maintenanceliterature is loosely relevant to pattern effectiveness, but wehave found no empirical work that specifically addressesdesign patterns as an aid to maintenance. Likewise, as far aswe know, the design pattern community itself has not yetundertaken controlled experiments to test design patternclaims.

We are not aware of any reports on the usage density ofPCL in software containing design pattern uses. However, itappears that, in the thoroughly designed APIs of librariesand frameworks, at least a surrogate to PCL becomesincreasingly common: naming conventions that indicate theusage of some of the more frequent design patterns; forinstance, class name components such as listener, event,composite, container, observer in Sun's Java DevelopmentKit (JDK 1.3), etc.

1.4 Structure of the Article

The next section describes the design and implementationof the experiments, including a statement of the hypotheses,a description of the subjects' background, a description ofthe program (including program comment examples) and

596 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 28, NO. 6, JUNE 2002

tasks, and a discussion of possible threats to the internaland external validity of the experiments. Section 3 discussesthe methodological problems caused by adding commentsand how to handle them in similar situations. Section 4presents the results and Section 5 summarizes and raisesquestions for future research.

We provide only a short description of the programs andtasks. Detailed information, including all original experi-ment materials such as task descriptions and sourceprogram listings, is available in [8], [17], [18].

2 DESCRIPTION OF THE EXPERIMENTS

The first experiment was performed in January 1997 at theUniversity of Karlsruhe (UKA), the second in May 1997 atWashington University St. Louis (WUSTL). Although theexperiments were similar, there were some variations. Also,the strengths and weaknesses of their data are not in thesame spots so that the experiments complement oneanother quite nicely. We therefore describe the experimentsseparately and refer to them as UKA and WUSTL,respectively. Where appropriate, we provide informationfor UKA first and append the corresponding information forWUSTL in angle parentheses hlike thisi.2.1 Hypotheses

First, we need to define the concept of pattern-relevance. Amaintenance task on a program is pattern-relevant if 1) theprogram contains one or more software design patterns and2) grasping the patterns in the program is expected tosimplify the maintenance task.

The experiments aimed at testing the following hypoth-eses, which we straightforwardly derived from claimsfound in the design pattern literature [1], [9]:

Hypothesis H1. By adding PCL, pattern-relevant main-tenance tasks are completed faster.

Hypothesis H2. By adding PCL, fewer errors are committedin pattern-relevant maintenance tasks.

The speed of task completion is measured in minutes.The number of errors are quantified by assigning points(see Section 2.6) and by counting correct solutions.

2.2 Subjects and Environments

The 74 h22i subjects of the UKA hWUSTLi experiment were64 h0i graduate and 10 h22i undergraduate computerscience students.

They had taken a 6-week h12-weeki intensive hstandardilecture and lab course on Java hC++i and design patternsbefore the experiment. The course included practicalimplementations of programs involving the followingdesign patterns: Composite, Visitor, Observer, TemplateMethod, Strategy hWrapper, Adapter, Template Method,Bridge, Singleton, Facade, Strategy, Factory Method, Visi-tor, Interpreter, Builderi. On average, the subjects' previousprogramming experience was 7.5 years h5 yearsi using 4.6h4.0i different languages with a largest program of 3,510LOC h2,557 LOCi. Before the course, 69 percent h76 percentiof the subjects had some previous experience with object-oriented programming, 58 percent h50 percenti withprogramming GUIs.

The subjects were knowledgeable about design patterns,as indicated by a pattern knowledge test conducted at thestart of each experiment [17], [18]. For those patterns thatwere relevant in the experiments, the UKA subjects' patternknowledge was better than that of the WUSTL subjectsbecause the UKA course had directly been targeted at theexperiment but the WUSTL course had not. For some of therelevant patterns, the WUSTL subjects had no practical

experience with these patterns, in contrast to the UKA

subjects.Each of the experiments was performed in a single

session of two to four hours. The UKA subjects had to writetheir solutions on paper. The WUSTL subjects implementedtheir solutions on Unix workstations.

2.3 Programs Used

Each subject worked on two different programs. Bothprograms were written in Java hC++i using design patternsand were thoroughly commented.

Program And/Or-tree is a library for handling And/Or-trees of strings and a simple application of it. It has 362h498i LOC in 7 h6i classes; 133 h178i of these LOC containonly comments and an additional 18 h22i lines of PCL wereadded in the version with PCL. (Section 3 discusses whyadding PCL is an acceptable and meaningful manipulation.)And/Or-tree uses the Composite and the Visitor designpattern [9].

Program Phonebook is a GUI program for reading tuples(name, first name, phone number) entered by the user andshowing them in different views on the screen, see thescreenshot in Fig. 1. Because the WUSTL subjects had notlearned a GUI library in the course, the C++ version ofPhonebook is stream-I/O-based: It reads all of its inputs fromthe keyboard and completely redisplays all views tostandard output after each change. Phonebook has 565h448i LOC in 11 h6i classes; 197 h145i of these LOC containonly comments and an additional 14 h10i lines of PCL wereadded in the version with PCL. Phonebook uses the Observerand the Template Method design pattern [9]. See Fig. 2 forexamples of PCL sections and other program comments.

References [8], [17], [18] contain the full commentedsource code of the programs.

PRECHELT ET AL.: TWO CONTROLLED EXPERIMENTS ASSESSING THE USEFULNESS OF DESIGN PATTERN DOCUMENTATION IN... 597

Fig. 1. Screenshot of UKA Phonebook program.

2.4 Experiment Controls and Group Sizes

The independent variable in both experiments was thepresence or absence of design pattern comment lines (PCL)in the comments of the source programs.

We used a counterbalanced experiment design [5] withrandom subject assignment, see Table 1: The first variable isthe order in which a subject receives the two programs. Oneof those programs was supplied with PCL, the otherwithout. This design results in a second variable, i.e., theorder of PCL and non-PCL: first with, then without PCL,and vice versa. The combination of the variables results infour groups. The subjects did not know in advance whethera program would contain PCL or not; they did not evenknow that the existence of PCL would be a treatmentvariable.

2.5 Tasks

For And/Or-tree, each subject received a task consisting ofthe following four subtasks:

1. Find the right spot for a particular output formatchange,

2. Give an expression to compute the number ofvariants represented by a tree,

3. Create an additional Visitor class that computes thenumber of variants faster (similar to an alreadyexisting class computing depth information), and

4. Instantiate such a Visitor and print its result.

Subtasks 3 and 4 are pattern-relevant. Subtask 4 was not

explicitly required in WUSTL.For Phonebook, each UKA subject received a task consist-

ing of the following five subtasks:

1. Find two spots for small program changes (outputformat change, window size change),

2. (same as 1),3. Create an additional Observer class using a Tem-

plate Method,4. Instantiate and register such an Observer,5. Create an additional Observer class similar to an

already existing one not using a Template Method.

Subtasks 3 to 5 are pattern-relevant.Of course, none of the task descriptions mentioned any

of the design patterns at all.There are two important differences between UKA and

WUSTL regarding Phonebook. First, in UKA subtask 3, a

similar class was already present in the program. Subtask 3

could thus be solved by imitation; this was not true for

WUSTL. Second, subtask 2 was not required in WUSTL and

subtask 4 was stated as a part of 3 rather than as a separate

subtask.For the class creation subtasks, only the interface of the

class needed to be written; the actual implementation was


Fig. 2. PCL example: Three (out of four) comments containing PCL sections from the UKA Phonebook program, with original line numbers. AddedPCL sections are lines 36-42 (in a 33-line global program comment not shown here), 409-411, 484-485, and 525. Note that, since the PCL headerline does not contain semantic information, it was not considered part of the PCL although its existence may well be quite useful during programunderstanding. The program contains a total of 197 lines (or 35 percent of all lines) of normal comments. Note that the original program was inGerman, with one more pattern comment line due to different line breaks.

not required, although the WUSTL participants were askedto provide a complete solution if they could.

2.6 Measurements

For each task (but not for each subtask) of each subject, wemeasured the time between handing out and collecting theexperiment materials. It is unclear how the time spent forgeneral program understanding could be distributedamong the subtasks, so no subtask time information wascollected. For each subtask, the same rater graded allanswers according to the degree of requirements fulfillmentthey provided. The grades were expressed in points. Therewas a total of 2 + 2 + 8 + 3 = 15 points h2 + 2 + 8 = 12i for theUKA hWUSTLi subtasks of And/Or-tree and 2 + 3 + 8 + 4 + 6= 23 points h2 + 8 + 8 = 18 pointsi for those of Phonebook; see[17], [18] for the rating scales.

2.7 Threats to Internal Validity

All relevant external variables (subjects' programmingexperience, aptitude, motivation, environmental conditions,etc.) were equalized between the groups by randomizedgroup assignment. Should bad luck nevertheless haveproduced groups with unbalanced subject ability, thecounter-balanced experiment design has put the morecapable subjects once into the experiment group and onceinto the control group.

The dominant control problem is mortality: Some WUSTL

students gave up on a task when they thought it would betoo difficult for them or take too long. The tendency to giveup was relatively strong because the experiment was thevery last event of the semester for these students and theytold us they had to catch their busses and planes, etc. Fourstudents gave up on both tasks. It would have been nice hadmortality been lower, but the results are still meaningfulbecause mortality occurred almost exactly as often in thegroups with PCL as in those without PCL. When ignoringincomplete tasks entirely, the mortality therefore does notbias the results. See Table 1 for the resulting group sizes;there was no mortality in UKA.

The total working time of the subjects exhibits an almostperfectly symmetric distribution. Hence, there is no

evidence that slower subjects hurried because of othershaving finished before them, although all subjects workedin the same room at the same time.

Our experiment setup lacked an acceptance test. As aconsequence, the working time and solution quality canonly be judged together since short working times mayreflect unrepaired defects in a solution. This is not a threatto validity, but complicates the discussion of the results inSection 4.

Finally, although, at least for UKA, the number ofsubjects is comparatively large, the resulting data is stilloften too scarce for undebatable conclusions.

2.8 Threats to External Validity

There are several sources of differences between theexperimental and real software maintenance situations thatlimit the generalizability (external validity) of the experi-ments: In real situations, there are subjects with moreexperience, often working in teams, and there are programsand maintenance tasks of different size or structure.

Experience. The most frequent concern with experimentsusing student subjects is that the results cannot be general-ized to professionals because the latter are more experi-enced. In the present case, professional programmers mayeither have less need for PCL or they may be able to exploitPCL more profitably than our student subjects.

Team work. Realistic programs are usually team work.Individual tasks during maintenance may also often beperformed by more than one programmer. Such coopera-tion requires additional communication about the program.In this case, PCL may have further advantages, not visiblein the experiments, because one of the major (purported)advantages of design patterns is a common designterminology [23].

Program size and complexity. Compared to typicalindustrial size programs, the experiment programs arerather small and simple, neatly designed, and well-commented. This does not necessarily invalidate the resultsof the experiments, though. If a positive effect is found,increasing program complexity may magnify the effect


TABLE 1The Four Experiment Groups and Their Sizes

The number of data points is one per subject, except for those subjects that did not complete the respective task but dropped out of the experimentinstead. For UKA, there was no such mortality. See also Section 2.7. (A�Pÿ stands for ªfirst perform And/Or-tree with PCL, then perform Phonebookwithout PCLº and so on.)

because having PCL provides program slicing information.For pattern-relevant tasks, PCL information points outwhich parts of a program are relevant and enables one toignore the rest; such information may become more usefulas program size increases because more code can beignored.

Program and task representativeness. It is unknownwhether the programs and tasks used in our experimentsare (or are not) representative of realistic maintenancesituations. We have but one indication that our programsare at least not totally different from other programsconstructed using design patterns: The ratio of the totalnumber of classes in the program to the number of designpattern instances found in our programs ranges between 3.0and 5.5. These values are comparable to those found forJava AWT (3.8) and NextStep (3.1) [11]. Our article does notclaim anything about maintenance tasks that are notpattern-relevant. See the conclusion section for morediscussion of pattern-relevant tasks in realistic programs.

Maintenance situations. Realistic maintenance situa-tions will often be rather different from those found byour subjects. In particular, much larger and more complexprograms and tasks may require making changes based ona much lower degree of overall program understandingthan could be obtained for the small programs in theexperiments. It is hard to say whether or when this willmake PCL more useful or less useful than in the experi-ments. Furthermore, if programmers have to master a largedesign pattern repertoire, their understanding of individualpatterns may be reduced and PCL may become less helpful.

Only repetition of similar experiments with professionalson real programs and real maintenance tasks can evaluatethese threats. See the conclusion section for why we hopethat our experiments estimate conservatively.

3 ON THE METHODOLOGY OF OUR

EXPERIMENT DESIGN

The main independent variable in these experiments is thepresence or absence of pattern comment lines (PCL) inaddition to the normal program comments. Some readersmay argue that this makes the experiments uninterestingand irrelevant (although still valid) because design andhypotheses are impure: The observations cannot discrimi-nate between effects from having a specific type ofcomments (PCL) and effects from simply having morecomments.

3.1 The Form/Content Conflict

Note that this situation will occur generically in manyexperiments that attempt evaluation of a specific form of aninformation source.

Each specific information source used during programdevelopment (such as PCL, UML diagrams, or any other)has both a certain form (syntax, appearance) and a certaincontent (semantics, meaning). In any experiment for evalu-ating some form F of an information source, we ideallyneed an experiment condition for the control group that isidentical to the experiment group condition in all respects(in particular the content) except for the form of theinformation source. However, an alternative form A will

rarely lend itself to expressing the exact same content as Fin a natural form. For example, one cannot express in arealistic manner exactly each detail communicated by a UMLdiagram and nothing else in a different notation. Therefore,we must either force the content into A (and, thus, end upwith an unrealistic instance of A) or accept some differencein content in addition to the difference in form in ourexperiment conditions (and, thus, mix two effects).

We call this problem the form/content conflict in experi-ment design.

If it arises, it is usually impossible to eliminate theconflict, but it can be reduced so that the relevance of theexperiment is no longer seriously threatened. The followingthree sections argue why, in our specific experiment design,the conflict does not have a serious impact on the validity orutility of the results. Section 3.5 then reformulates the trainof thought of this argumentation as a general rule for thedesign of such experiments.

We now argue, for our design, first, why purerhypotheses appear impossible to evaluate; second, whythe experiments are useful as they are; and third, why notmuch of the observed effect is due to an increase incomment information content.

3.2 Why Purer Hypotheses Appear Impossibleto Evaluate

A hypothetical perfect experiment would only test whetherthe specific form (or type) of comments represented by PCLhas an advantage over whatever one would have writteninstead before the invention of design pattern terminology.It would not change any other aspect of the program. Inorder to do this, the following aspects (among others)would have to be kept constant:

. the absolute information content of the comments(because this is what makes comments useful),

. the length of the comments in lines or words(because reading comments consumes time andconcentration),

. the amount of repetition of information in thecomments (because repetition may aid understand-ing or finding information or may make the readerunattentive),

. the understandability of the comments,

. the comments' potential for creating confusion(because comments can sometimes make a programharder to understand),

. the placement of the comments with respect to theprogram entities they refer to (because that deter-mines how easily information can be found orrelated to the program).

Balancing all of these aspects in a program change isimpossible. We have the following choices for the experi-mental design:

1. Compensate adding PCL by removing other com-ments in the PCL program version.

But, which other comments should we remove?We could easily balance the comment length, butthere is no reproducible way to determine whichcomments are superfluous, useful, or crucial. We


could not guarantee that we held the informationcontent of the comments constant. Furthermore,we cannot keep both comment length and theamount of repetition constant, as will be quantifiedin Section 3.4.

2. Compensate adding PCL by adding normal com-ments in the non-PCL program version.

But, what should these other comments say?Again, we could keep the comment length constant,but there is no reproducible way to give the exactsame information content as in the PCL, other thanthe PCL itself; we cannot be sure not to confuse orannoy the user, etc.

3. Add PCL without any compensation. This willincrease the comment length (small increase, seeSection 2.3 and Fig. 2), information content (smallincrease, see Section 3.4), and repetition of informa-tion (large increase, see Section 3.4).

We pick the third solution because it is more realistic andwell-defined and leads to more reproducible results.

3.3 Why the Experiments Are Useful

A possible effect of simply having more comment isconsciously included in our hypotheses. If we find apositive effect from adding PCL, the experiments suggestthe engineering advice ªAdd PCL to your programs inaddition to the commenting you usually do.º This advice isuseful even if it should be possible to obtain similarlypositive results with other types of additional commentsbecause who could characterize, in program-independentterms, what these other comments should look like? ForPCL, the characterization is straightforward: ªMentionwhich classes, methods, and objects play which roles in apatternº; see the example in Fig. 2.

3.4 Why Most of the Effect Is Due to Comment Type,Not Information Content

Are we merely (or mostly) measuring the effect ofproviding more information in the comments? As we shownow, almost all of the information contained in the PCL isalready present in the rest of the comments and is justrepeated in a different form by PCL. Note that repetition assuch can either improve program understanding (by usefulredundancy) or slow it down (because less new informationis found per line) [3]. See the conclusion section for ouropinion on the present case.

We compiled a list of the design information unitsconveyed by the PCL and counted where and how oftenthey were mentioned in the PCL and in the other comments.

For UKA Phonebook, this information is shown on thelefthand side of Table 2. As we see in the histograms, almostall of the information present in the PCL is already presentin the rest of the program comments, in particular, in thoseunits (labeled A and L) that point to the existence of designpatterns in the program.

Similar results, although somewhat weaker, hold for theUKA Phonebook program; see the righthand side of Table 2.

So, the PCL information is almost entirely redundantand, plausibly, it is the specific type of the documentationthat creates the effects we see: The same informationmight become clearer when expressed in PCL form.

Hence, we expect our results to mainly show effects from

specifically adding PCL and not effects from adding any

documentation.

3.5 General Methodological Rule

We can now formulate the above three-step process for

handling form/content conflicts as a general methodologi-

cal recipe.

1. Analyze all conceivable designs and choose the onethat results in the least differences in content withoutrequiring unrealistic conditions for either group.This ensures there is no better design for thisresearch question.

2. Demonstrate that the experiment result expected byyour hypotheses would be interesting. Ideally, itleads to a practical software engineering rule. In thiscase, formulate this rule explicitly. This ensures theexperiment is useful despite the content difference.

3. Argue why, in your specific experiment setup, atmost a small part of the observed effect, if any, willbe due to different content, rather than differentform. This ensures the form/content conflict isunimportant for the chosen experiment design.

4 RESULTS AND DISCUSSION

This section discusses the results of both experiments as

summarized in Tables 3 and 4. For WUSTL, the results of

subjects that did not deliver a particular task were ignored

for that task. For UKA, all subjects delivered both tasks.

4.1 Results for And/Or-tree

The results are summarized in Table 3 and Fig. 3. The first

line of Table 3 indicates that the group with PCL averaged

slightly more points in the pattern-relevant subtasks

(ªrelevant pointsº). However, the p-value in the rightmost

column indicates that the difference has a 20 percent chance

of being accidental.The time required with PCL appears to be larger than the

time without PCL (line 3). However, this observation is

misleading because the number of correct solutions is much

lower for the non-PCL group (Fisher exact p � 0:077), as

shown in line 2. In real software maintenance, incorrect

solutions would be detected and corrected, taking addi-

tional time not observed in the experiment. In contrast, the

experiment's noncomputerized working environment made

it difficult for a subject to check whether a solution was

correct. Obviously, the time spent on incorrect work cannot

be sensibly compared to the time spent on correct work.

Therefore, instead of comparing the time required by all

subjects, we should compare only the time required by

subjects with correct solutions. As a result, the confidence

interval for the work time difference ranges from far

positive to far negative. Hence, we should consider the

work times to be essentially the same; see also Fig. 3. The

group with PCL is still a little slower because it is a much

larger fraction of the whole and, hence, presumably

contained less capable subjects on average.


In terms of correctness alone, the much larger fraction ofcorrect solutions in the PCL group suggests that PCLallowed for good solutions by less capable programmers

than was possible without PCL.Furthermore, Fig. 4 tells us that, without PCL, the slower

(less capable?) subjects produce solutions of much lowerquality, whereas, with PCL, the quality is largely indepen-dent of the time required. This suggests that, if a program-

mer is not able to solve a task, PCL may help in recognizingthat fact; the solution will be bad, but the time is not longer.

The WUSTL results for And/Or-tree indicate that the PCLgroup is faster, as shown in the lower part of Table 3. Wefind no clear difference in the number of points (line 5) norin the number of completely correct solutions (line 6)Ðpre-sumably because this group not only designed but alsocould implement and test their solutionÐbut we find alarge advantage in the time required for the group withPCL (line 7). With a confidence of 90 percent, having PCLsaves between zero and 43 percent of the maintenance timefor this task. Due to the small number of subjects, we cannot


TABLE 2List of the Design Information Units Provided by the PCL of the UKA Phonebook Program (Lefthand Side)

and the UKA Phonebook Program (Righthand Side)

The histograms indicate the number of occurrences of each design information unit (with the important ones emphasized). Top histogram:occurrences in the PCL. Bottom histogram: occurrences in the rest of the comments.

statistically evaluate the results of just the completely

correct solutions alone, but the trend remains the same.As for learning effect, the UKA subjects were, on average,

significantly faster (but did not obtain more points) in their

second task. The size of the effect is independent of the

presence or absence of PCL. This data is not shown in the

table. Differences due to learning are compensated by the

counterbalanced experiment design and are thus not

relevant for the interpretation of the results. The learning

effect could not be assessed for WUSTL because the group

sizes were too small. The shorter time for the second task

could also be explained by pressure to finish that builds as

soon as the first few subjects have finished and left the

room. However, such an effect would also result in reduced

quality of the solutions, which we have not found, and is

therefore unlikely.


TABLE 3Results for the And/Or-Tree Task

Columns are (from left to right): line number; name of variable; arithmetic average D� of subjects provided with design pattern information (PCL);ditto without PCL (Dÿ); 90 percent confidence interval I for difference D� ÿDÿ (measured in percent of Dÿ); significance p of the difference (one-sided). I and p were computed using Bootstrap resampling (a simulation-based statistical estimation technique [7]) with 10,000 trials because manydistributions were distinctly nonnormal. ªRelevant pointsº are the points for the pattern-relevant subtasks only.

TABLE 4Results for the Phonebook Task

The WUSTL results had to be discarded because the subjects were not sufficiently qualified for this task; only one subject per group came up with acorrect solution.

Fig. 3. Graphical display of the time entries (in minutes) from theindicated lines of Tables 3 and 4: The left plot of each pair represents thegroup with PCL, the right one the group without PCL. The dot marks themean of the task completion time, the strip indicates a 90 percentconfidence interval for the mean.

Fig. 4. Work time versus solution quality for the pattern-relevant part of

the UKA And/Or-tree task. The trend line is a Loess local linear

regression [24] with spanwidth 1 and its 90 percent confidence interval.

Summing up, the And/Or-tree results suggest that fornontrivial pattern-relevant maintenance tasks the presenceof PCL may save time and/or help avoid mistakes.

4.2 Results for Phonebook

The results are summarized in Table 4 and Fig. 3. ThePhonebook results of UKA show essentially no difference inthe total number of points for the pattern-relevant subtasks(line 8) or the number of solutions that were completelycorrect (line 9). Although almost half of the participantsmade at least one mistake, the rather high relevant pointvalues (over 16 out of a possible 18) indicate that the taskwas simple for these subjects.

Despite the simplicity, however, the group with PCLmanaged to solve the task faster than the group withoutPCL, as shown in line 10. The advantage can also bequantified: It has an expected size somewhere between zeroand 22 percent (with 90 percent confidence); see also Fig. 5.To ensure that the time difference is not some weird artifactrelated to different behavior of the subjects making seriousmistakes, we may also compare only the times of thosesubjects that made no or almost no mistake in their solution.We find that the time difference still points in the samedirection (not shown in the table). The learning effect wassimilar to that described for And/Or-tree above.

Unfortunately, the WUSTL results for this task aremeaningless since there is only a single correct solution ineach group. Thus, it makes little sense to compare theresults. Our postmortem analysis found that, for the givensubjects, the task was too difficult for two reasons. First, thenon-GUI presentation style of the WUSTL Phonebookprogram made use of the Observer pattern rather unin-tuitive and obscure. Second, these subjects had neveractually implemented an Observer and there was noexample class in the program that they could imitate (incontrast to the UKA Phonebook program). Our resultssuggest that, under such circumstances, PCL might beworthless.

Summing up, the Phonebook results suggest that, forsimple pattern-relevant maintenance tasks, the presence ofPCL may save time.

5 INTERPRETATION AND CONCLUSIONS

In summary, we find that our results support both of thehypotheses introduced in Section 2.1. Each of our results

either supports Hypothesis H1 (pattern-relevant mainte-nance tasks will be completed faster if PCL is added) orHypothesis H2 (fewer errors will be committed in pattern-relevant maintenance tasks if PCL is added). For the UKA

Phonebook and for the WUSTL And/Or-tree, the PCL groupwas faster than the non-PCL group. We found the size of theeffect (0 to 40 percent speedup) to be considerable, althoughthis may not generalize to other cases. For the UKA And/Or-tree, the PCL group produced solutions with fewer mistakesthan the non-PCL group. In particular, twice as manysubjects arrived at a completely correct solution. None ofthe three results supports both hypotheses at once, but thereis no evidence for the opposite of the hypotheses either.

Note that the design of these experiments was ratherconservative; our design decisions biased them toward notshowing any effects from adding PCL, in particular:

1. The programs were rather small, so, even withoutPCL, the subjects could achieve good programunderstanding within a reasonable time. In realsoftware projects, PCL might be more helpful if thefraction of program understanding effort that PCLcan save grows with the size of the program.

2. The programs were thoroughly commented, notonly at the statement level, but also at the method,class, and program levels. Thus, the subjects hadsufficient documentation available for programunderstanding, even without PCL; see Section 3.4.In contrast, many programs in the real world lackdesign documentation. PCL is probably an efficientform of design documentation as it is rathercompact.

Given these circumstances, performance advantagesthrough PCL may often be even more pronounced in realsituations than in our experiments. In any case, even afew percent reduction in maintenance effort outweigh theadditional development cost for inserting PCL into sourceprograms, at least if PCL is inserted right duringdevelopment.

We conclude that, depending on the particular program,maintenance task, pattern knowledge, and personnel, PCLin a program may considerably reduce the time required fora program change or may help improve the quality of thechange. We therefore recommend that design patternsalways be documented explicitly in program source code.

5.1 Why Is PCL Beneficial?

The analysis of design information units (see Section 3.4)suggests two answers to this question.

First, we often found the information provided by PCLmuch clearer than that in the other comments, which wassometimes vague, ambiguous, or spread out (delocalized).This was to be expected because a clear and compactterminology is considered one of the primary advantages ofpatterns.

Second, PCL repeats some design information unitsmore often than normal comments do, in particular withrespect to mentioning the presence of a pattern; see Table 2.This type of repetition has two advantages: It reduces theconcentration required to capture the relation between thepattern parts and it makes readers aware of the pattern even


Fig. 5. Work time distribution for UKA Phonebook task. The fat dot is themedian, the box indicates the 25 percent and 75 percent quantiles, thewhiskers indicate the 10 percent and 90 percent quantiles, and the Mand dashed line indicate the mean plus/minus one standard error of themean.

if they do not pick an appropriate exploration path throughthe program text.

This property makes PCL similar to the rather volumi-nous documentation that Soloway et al. [21] suggest forstructured programs in order to overcome problems fromdelocalized plansÐdesign decisions whose consequences arespread out over multiple parts of a program. Wilde et al.[27] argue that delocalized plans are frequent in object-oriented programs due to inheritance, large numbers ofsmall methods, and dynamic binding. It seems that designpatterns are a good means for understanding many suchdelocalized plans during maintenance provided that PCLprovides very powerful beacons that point out wherepatterns were usedÐeach explaining a significant fractionof the overall program structure.

Likewise, the effectiveness of PCL can be explainedusing the theory of Brooks [2]: PCL can be considered toprovide a bridge between the knowledge domains ªdesignideaº and ªunderstanding a single method or classºbecause design ideas are explained by mentioning a designpattern (which PCL does) and PCL is attached to individualclass or method heads.

5.2 Further Work

We should perform related experiments in differentsettings. The following questions appear most important:First, how do the effects change for larger programs?Second, how do they change for more difficult tasks orwhen correct results are enforced by an acceptance test?Third, how do the effects change when much largernumbers of different patterns come into playÐoften withoverlap between their instances? Fourth, how do theychange when multiple programmers have to cooperate (andhence communicate) in order to make a change? Fifth, whatare the effects if programs are largely uncommented andhow, in general, does PCL interact with other documenta-tion? Sixth, what fraction of the effect is explained simplyby the fact that PCL is a form of structured (rather thanarbitrary style) comments? To what degree is it relevanthow PCL is phrased or formatted? Finally, is PCL alsohelpful during code inspections? Moreover, empiricalstudies of existing software should determine what fractionof the change tasks (or change effort) is pattern-relevant.The question of design stability also needs to be addressed:We have found initial evidence that PCL may slow downarchitectural erosion and drift (as described by [16]), i.e.,delay the decay of the original software design structure.Finally, and perhaps most interestingly, how does main-tenance compare for software with patterns versus equiva-lent software without?

ACKNOWLEDGMENTS

The authors would like to thank Douglas Schmidt for theopportunity to carry out the experiment at WashingtonUniversity. They also thank all their experimental subjectsfor making this project possible and the several criticalreviewers of previous versions of this paper, in particular,for pointing out that a methodological contribution waswaiting to be made.

REFERENCES

[1] K. Beck, J.O. Coplien, R. Crocker, L. Dominick, G. Meszaros, F.Paulisch, and J. Vlissides, ªIndustrial Experience with DesignPatterns,º Proc. 18th Int'l Conf. Software Eng., pp. 103±114, Mar.1996.

[2] R. Brooks, ªUsing a Behavioral Theory of Program Comprehen-sion in Software Engineering,º Proc. Third Int'l Conf. Software Eng.,pp. 196±201, 1978.

[3] R. Brooks, ªTowards a Theory of the Comprehension of ComputerPrograms,º Int'l J. Man-Machine Studies, vol. 18, no. 6, pp. 543±554,June 1983.

[4] F. Buschmann, R. Meunier, H. Rohnert, P. Sommerlad, and M.Stal, Pattern-Oriented Software ArchitectureÐA System of Patterns.Chichester, U.K., John Wiley & Sons, 1996.

[5] L.B. Christensen, Experimental Methodology, sixth ed. NeedhamHeights, Mass.: Allyn and Bacon, 1994.

[6] Pattern Languages of Program Design. J.O. Coplien andD.C. Schmidt, eds. Monticello, Ill.: Addison-Wesley, 1995.

[7] B. Efron and R. Tibshirani, An Introduction to the Bootstrap. NewYork: Chapman and Hall, 1993.

[8] EIR, Web pages of the Karlsruhe Empirical Informatics ResearchGroup. http://wwwipd. ira. uka. de/EIR/, 2002.

[9] E. Gamma, R. Helm, and R. Johnson, J. Vlissides, Design Patterns:Elements of Reusable Object-Oriented Software. Reading, Mass.:Addison-Wesley, 1995.

[10] E.M. Gellenbeck and C.R. Cook, ªAn Investigation of Procedureand Variable Names as Beacons During Program Comprehen-sion,º Empirical Studies of Programmers: Proc. Fourth Workshop,pp. 65±81, 1991.

[11] O. Gramberg, ªCounting the Use of Software Design Patterns inJava AWT and NeXTstep,º Technical Report 19/1997, FakultaÈt fuÈ rInformatik, Univ. Karlsruhe, Germany, Dec. 1997, ftp.ira.uka.de.

[12] Empirical Studies of Programmers: Proc. Fourth Workshop,J. Koenemann-Belliveau, T.G. Mohrer and S.P. Robertson, eds.,Dec. 1991.

[13] S. Letovsky, ªCognitive Processes in Program Comprehension,ºEmpirical Studies of Programmers: Proc. First Workshop EmpiricalStudies of Programmers, pp. 58±79, 1986.

[14] Pattern Languages of Program Design 3, R. Martin, D. Riehle, andF. Buschmann, eds. Monticello, Ill.: Addison-Wesley, 1997.

[15] N. Pennington, ªStimulus Structures and Mental Representationsin Expert Comprehension of Computer Programs,º CognitivePsychology, vol. 19, pp. 295±341, 1987.

[16] D.E. Perry and A.L. Wolf, ªFoundations for the Study of SoftwareArchitecture,º ACM SIGSOFT Software Eng. Notes, vol. 17, no. 4,pp. 40±52, Oct. 1992.

[17] L. Prechelt, ªAn Experiment on the Usefulness of Design Patterns:Detailed Description and Evaluation,º Technical Report 9/1997,FakultaÈt fuÈ r Informatik, Univ. Karlsruhe, Germany, June 1997,ftp.ira.uka.de.

[18] L. Prechelt, B. Unger, and D. Schmidt, ªReplication of the FirstControlled Experiment on the Usefulness of Design Patterns:Detailed Description and Evaluation,º Technical Report wucs-97-34, Washington Univ., Dept. of Computer Science, St. Louis, Dec.1997. http://www.cs.wustl.edu/cs/cs/publications.html.

[19] M. Shaw and D. Garlan, Software ArchitectureÐPerspectives on anEmerging Discipline. Upper Saddle River, N.J.: Prentice Hall, 1996.

[20] B. Shneiderman and R. Mayer, ªSyntactic/Semantic Interactionsin Programmer Behavior: A Model and Experimental Results,ºInt'l J. Computer and Information Sciences, vol. 8, no. 3, pp. 219±238,1979.

[21] E. Soloway, J. Pinto, S. Letovsky, D. Littman, and R. Lampert,ªDesigning Documentation to Compensate for DelocalizedPlans,º Comm. ACM, vol. 31, no. 11, pp. 1259±1267, Nov. 1988.

[22] Empirical Studies of Programmers: Proc. First Workshop, E. Solowayand S. Iyengar, eds., June 1986.

[23] B. Unger and W.F. Tichy, ªDo Design Patterns Improve Commu-nication? An Experiment with Pair Design,º Proc. Int'l WorkshopEmpirical Studies of Software Maintenance, George Stark, ed., http://members.aol.com/GEShome/wess2000/unger-tichy.pdf, Oct.2000.

[24] W.N. Venables and B.D. Ripley, Modern Applied Statistics withS-PLUS, second ed. Springer-Verlag, 1997.

[25] Pattern Languages of Program Design 2, J.M. Vlissides, J.O. Coplien,and N.L. Kerth, eds. Monticello, Ill.: Addison-Wesley, 1996.


[26] A. von Mayrhauser and S. Lang, ªA Coding Scheme to SupportSystematic Analysis of Software Comprehension,º IEEE Trans.Software Eng., vol. 25, no. 4, pp. 526±540, July/Aug. 1999.

[27] N. Wilde, P. Matthews, and R. Huitt, ªMaintaining Object-Oriented Software,º IEEE Software, vol. 10, no. 1, pp. 75±80, Jan.1993.

[28] Proc. Fifth Int'l Workshop Program Comprehension, Mar. 1997.[29] Proc. Sixth Int'l Workshop Program Comprehension, June 1998.[30] Proc. Seventh Int'l Workshop Program Comprehension, May 1999.[31] Proc. Eighth Int'l Workshop Program Comprehension, June 2000.

Lutz Prechelt received the diploma (1990) andthe PhD (1995) degrees in informatics from theSchool of Informatics, University of Karlsruhe,where he was also a senior researcher. Hisresearch interests include software engineering(in particular, using an empirical researchapproach, about which he also authored abook), compiler construction for parallel ma-chines, measurement and benchmarking issues,and research methodology. He works at abaXX

Technology, Stuttgart, as head of Training, Technical Documentation,and Process Management. He is a member of the IEEE ComputerSociety, ACM, and GI.

Barbara Unger-Lamprecht graduated with thediploma degree in 1995 and received the PhDdegree in 2000 from the University of Karlsruhe.Her favorite research interests are in empiricalsoftware engineering with the main focus ondesign patters and team communication. In May2001, she joined sd&m Frankfurt. She is amember of the IEEE and the IEEE ComputerSociety.

Michael Philippsen received the Diploma(1989) and PhD (1993) degrees in computerscience from the University of Karlsruhe, Ger-many. After a postdoctoral position at theInternational Computer Science Institute inBerkeley (1994-1995), he returned to Karlsruhe,where he currently works as a senior researcherin the Computer Science Department. In addi-tion, he manages the Software EngineeringGroup of the Forschungszentrum Informatik

(FZI) in Karlsruhe. His primary research interests are parallel systems,programming languages, and software engineering. Dr. Philippsen is amember of the ACM, the IEEE Computer Society, and GI.

Walter F. Tichy earned the MS and PhDdegrees in computer science from CarnegieMellon University in 1976 and 1980, respec-tively. He has been a professor of computerscience at the University Karlsruhe, Germany,since 1986. Previously, he was a senior scientistat Carnegie Group, Inc., in Pittsburgh, Pennsyl-vania, and served six years on the Faculty ofComputer Science at Purdue University in WestLafayette, Indiana. His primary research inter-

ests are software engineering and parallelism. He is currently directingresearch on a variety of topics, including empirical software engineering,ubiquitous computing, software configuration management, clustercomputing, optimizing compilers for parallel computers, and optoelec-tronic interconnects. He has consulted widely for industry. He is directorat the Forschungszentrum Informatik, a technology transfer institute. Heis a cofounder of ParTec AG, a company specializing in software forcomputer clusters. Dr. Tichy is a member of the ACM, GI, and the IEEEComptuer Society.

. For more information on this or any other computing topic,please visit our Digital Library at http://computer.org/publications/dlib.


Documents

Two controlled experiments assessing the usefulness of design … · 2016-01-22 · existing software, and programming with patterns. The main advantages claimed for design patterns,