Two Controlled Experiments Assessing the Usefulness of Design Pattern Information During Program Maintenance

Submission to 'IEEE Trans. on Software Engineering', March 24, 2000Two Controlled Experiments Assessingthe Usefulness of Design Pattern Informationin Program MaintenanceLutz Prechelt, Barbara Unger, Michael Philippsen, Walter TichyFakult�at f�ur Informatik, Universit�at KarlsruheD-76128 Karlsruhe, GermanyPhone: +49/721/608-3934, Fax: +49/721/608-7343Email: prechelt,unger,phlipp,[email protected]: http://wwwipd.ira.uka.de/EIR/AbstractUsing design patterns is claimed to improve pro-grammer productivity and software quality. Thispaper reports on a �rst experimental test of thisclaim.Subjects performed maintenance tasks on two pro-grams ranging from 360 to 560 LOC including com-ments. Both programs contained design patterns.The controlled variable was whether the use of de-sign patterns was documented explicitly or not.The experiments thus tested whether pattern com-ment lines (PCL) help during maintenance if pat-terns are relevant and su�cient program commentsare already present. It turns out that this questionis a challenge for the experimental methodology: asetup leading to relevant results is quite di�cult to�nd. We discuss these issues in detail.The experiment was performed with Java by 74German graduate students and then repeated withC++ by 22 American undergraduate students. Aconservative analysis of the results supports the hy-pothesis that pattern-relevant maintenance tasksare completed faster or with fewer errors if redun-dant design pattern information is provided. Re-dundant means that the information carried in pat-tern comments is also available in di�erent form inother comments.The contribution of this article is twofold: It pro-vides the �rst controlled experiment results on de-sign pattern usage and it presents a solution ap-proach to an important class of experiment de-sign problems for experiments regarding documen-tation.Keywords: controlled experiment, design pattern,comments, documentation, maintenance

1 IntroductionA software design pattern describes a proven solu-tion to a software design problem with the goal ofmaking the solution reusable. Design patterns areto programming-in-the-large what algorithms are toprogramming-in-the-small: Both provide proven so-lutions to known problems, encouraging reuse andrelieving programmers of reinvention.The idea of design patterns has quickly caughtthe attention of practitioners and researchers, andthe pattern literature is burgeoning. The �rstsystematic collection of design patterns was pub-lished by Gamma, Helms, Johnson, and Vlissides[9] (nicknamed the \Gang of Four Book"). Shortlythereafter, additional patterns were reported byBuschmann et al. [4]. The book by Shaw andGarlan [15] also provides a wealth of patterns forsoftware architecture. Annual workshops are be-ing held [6, 17, 11] to promote pattern mining anda consistent style of reporting patterns. Patternpapers are published in other software conferencesas well, reporting on new patterns, pattern tax-onomies, and pattern tools. Formalizations of pat-terns are sought and tools are being built for pat-tern mining, identifying known patterns in existingsoftware, and programming with patterns.The main advantages claimed for design patterns,according to the pattern literature, are as follows:1. Using patterns improves programmer produc-tivity and program quality;2. Novices can increase their design skills signif-icantly by studying and applying design pat-terns;3. Patterns encourage best practices even for ex-perienced designers;1

4. Design patterns improve communication, bothamong developers and from developers tomaintainers.As yet, there exists no coherent theory that wouldexplain these advantages speci�cally in the contextof design patterns. Therefore, this paper simplyattempts testing the claims themselves.1.1 Our experimentsThe experiments reported here represent the �rstattempts at formally testing some aspects of claims1 and 4 above. The experiments are set in a main-tenance context. Assume a maintainer knows whatdesign patterns are and how they are used. Fur-thermore, assume that a program was designed andimplemented using patterns. Now the question is:Does it help the maintainer if the designpatterns in the program code are docu-mented explicitly (using source code com-ments), compared to a well-commentedprogram without explicit reference to de-sign patterns?We investigate this question in the following man-ner: Several subjects receive the same programsource code and the same change requests for thatprogram; the subjects then provide appropriatechanges sketched on paper (�rst experiment) oras operational program code (second experiment).The change requests concern those aspects of theprogram that are implemented using design pat-terns. The program is commented in detail butthe subjects in the control group receive no ex-plicit information about design patterns in the pro-gram; they may derive pattern information fromother program comments or not at all. The experi-ment group, by contrast, receives the same programwith the same comments plus a few additional linesof comment (called pattern comment lines or PCL)that describe pattern usage where applicable. Notethat the experiment group thus has slightly morecomment in their program. This might threatenthe validity of the experiment. We will discussand resolve this important methodological issue inSection 3. Subjects are assigned randomly to thegroups. We investigate whether and how the perfor-mance of the two groups di�ers by measuring com-pletion time, grading answers, and counting correctsolutions.Note that this experiment does not test the presenceor absence of patterns, but merely the documenta-tion of patterns. This question is much cleaner,

because it is unclear what alternative program de-sign should be used for comparison when testing thepresence or absence of patterns.The experiments were performed with a total of 96student subjects, each working in a single session of2 to 4 hours. The tasks were based on two di�erentprograms about 6 to 10 printed pages in length.1.2 Related workDesign patterns are a recent idea, so it is notsurprising that evidence about their e�ectivenessis scarce. Case study reports and anecdotal evi-dence of positive e�ects can be found for instancein [1, 9]. Part of the program maintenance litera-ture is loosely relevant to pattern e�ectiveness, butwe have found no empirical work that speci�callyaddress design patterns as an aid to maintenance.Likewise, as far as we know, the design pattern com-munity itself has not yet undertaken controlled ex-periments to test design pattern claims.The program comprehension theory presented byBrooks [2] suggests that PCL may be a highly ef-fective sort of comment, because it speci�cally aimsat bridging di�erent \knowledge domains".A di�erent view is provided by the notion of delocal-ized plans [16] introduced by Soloway et al. Theirexperiments showed that delocalized plans accountfor much of the e�ort during program understand-ing. Wilde et al. [18] argue that object-orientedprograms are full of delocalized plans due to inheri-tance and delegation. Design pattern instances aredelocalized plans as well. From this perspective,pattern documentation may be helpful because itlinks the parts of important delocalized plans.1.3 Structure of the articleThe next section describes the design and imple-mentation of the experiments including a statementof the hypotheses, a description of the subjects'background, a description of the program (includ-ing program comment examples) and tasks, and adiscussion of possible threats to the internal andexternal validity of the experiments. Section 3discusses the methodological problems caused byadding comments and how to handle them in similarsituations. Section 4 presents the results, Section 5summarizes and raises questions for future research.We provide only a short description of the pro-grams and tasks. Detailed information, includingall original experiment materials such as task de-scriptions and source program listings, is availablein [8, 13, 14].2

2 Description of the experi-mentsThe �rst experiment was performed in January 1997at the University of Karlsruhe (Uka), the secondin May 1997 at Washington University St. Louis(Wustl). Although the experiments were similar,there were some variations. We therefore describethe experiments separately and refer to them asUka and Wustl, respectively. Where appropri-ate, we provide information for Uka �rst and ap-pend the corresponding information for Wustl inangle parentheses hlike thisi.2.1 HypothesesFirst we need to de�ne the concept of pattern-relevance. A maintenance task on a program ispattern-relevant if (1) the program contains one ormore software design patterns and (2) a grasp ofthe patterns in the program is expected to simplifythe maintenance task.The experiments aimed at testing the following hy-potheses, which we straightforwardly derived fromclaims found in the design pattern literature:Hypothesis H1: By adding PCL, pattern-relevantmaintenance tasks are completed faster.Hypothesis H2: By adding PCL, fewer errors arecommitted in pattern-relevant maintenance tasks.Speed of task completion is measured in minutes.The number of errors are quanti�ed by assigningpoints (see Section 2.6) and by counting correct so-lutions.2.2 Subjects and environmentsThe 74 h22i subjects of the Uka hWustli experi-ment were 64 h0i graduate and 10 h22i undergrad-uate computer science students.They had taken a 6-week h12-weeki intensivehstandardi lecture and lab course on Java hC++iand design patterns before the experiment. On av-erage, their previous programming experience was7.5 years h5 yearsi using 4.6 h4.0i di�erent languageswith a largest program of 3510 LOC h2557 LOCi.Before the course, 69% h76%i of the subjects hadsome previous experience with object-oriented pro-gramming, 58% h50%i with programming GUIs.The subjects had su�cient theoretical knowledgeof design patterns, as indicated by a pattern knowl-edge test conducted at the start of each experiment.

For those patterns that were relevant in the exper-iment, the Uka subjects' pattern knowledge wasbetter than that of the Wustl subjects, becausethe Uka course had directly been targeted at theexperiment but the Wustl course had not. Forsome of the relevant patterns, the Wustl subjectshad no practical experience with these patterns, incontrast to the Uka subjects.Each of the experiments was performed in a singlesession of 2 to 4 hours. The Uka subjects had towrite their solutions on paper. TheWustl subjectsimplemented their solutions on Unix workstations.2.3 Programs usedEach subject worked on two di�erent programs.Both programs were written in Java hC++i usingdesign patterns and were thoroughly commented.Program And/Or-tree is a library for handlingAnd/Or-trees of strings and a simple applicationof it. It has 362 h498i LOC in 7 h6i classes; 133h178i of these LOC contain only comments, and anadditional 18 h22i lines of PCL were added in theversion with PCL. And/Or-tree uses the Compositeand the Visitor design pattern [9].

Figure 1: Screenshot of Uka Phonebook programProgram Phonebook is a GUI program for readingtuples (name, �rst name, phone number) enteredby the user and showing them in di�erent viewson the screen, see the screenshot in Figure 1. Be-cause the Wustl subjects had not learned a GUIlibrary in the course, the C++ version of Phonebookis stream-I/O-based: it reads all of its inputs fromthe keyboard and completely redisplays all views tostandard output after each change. Phonebook has565 h448i LOC in 11 h6i classes; 197 h145i of theseLOC contain only comments, and an additional 14h10i lines of PCL were added in the version withPCL. Phonebook uses the Observer and the Tem-plate Method design pattern [9]. See Figure 2 for3

365 abstract class TupleDispA extends TupleDisplay {...403 /** implements adding a new Tuple.404 First select() is used to test, whether the Tuple should be added405 at all, then mergeIn() moves it to the right place in the406 presentation using compare and format() converts it into a String407 of the desired display format.408 *** DESIGN PATTERN: ***409 newTuple() together with its auxiliary method mergeIn() forms a410 **Template Method**. The empty spots that are filled in subclasses411 are the methods select(), format(), and compare().412 */413 synchronized void newTuple(Tuple t) {...419 }...473 }...477 /**478 NTTupleDisp2 displays NTTuple, where479 1. Tuples with an empty telephone number are left out and480 2. Tuples are sorted by (last)name481 Using Tuple objects of other Tuple types results in482 ClassCastException.483 *** DESIGN PATTERN: ***484 NTTupleDisp2 completes the **Template Method** newTuple()485 of TupleDispA486 */487 final class NTTupleDisp2 extends TupleDispA {...513 }...517 /**518 Main program. Generates a main window with two buttons, one519 Tupleset and two TupleDisplays. One of the buttons creates an520 NTTuple and adds it to the Tupleset.521 There is no static type safety between the actual Tuple type522 stored in the Tupleset and the Tuple type that is expected by523 the TupleDisplays.524 *** DESIGN PATTERN: ***525 The two TupleDisplays are registered as observers at the Tupleset.526 */527 public final class TupleMain extends Frame {...573 }Figure 2: PCL example: Three (out of four) comments containing PCL sections from the Uka Phonebookprogram, with original line numbers. Added PCL sections are lines 36-42 (in a 33-line global programcomment not shown here), 409-411, 484-485, and 525. The program contains a total of 197 lines (or 35%of all lines) of normal comments. Note that the original program was in German with one more patterncomment line due to di�erent line breaks.examples of PCL sections and other program com-ments.See Section 3 for a discussion why adding PCL isan acceptable and meaningful manipulation. See[13, 14, 8] for the full commented source code of theprograms.2.4 Experiment controls, group sizesThe independent variable in both experiments wasthe presence or absence of design pattern commentlines (PCL) in the comments of the source pro-grams.We used a counterbalanced experiment design [5],see Table 1: The �rst variable is the order in whicha subject receives the two programs. One of thoseprograms was supplied with PCL, the other with-out. This design results in a second variable, i.e.,the order of PCL and non-PCL: �rst with, thenwithout PCL, and vice versa. The combination ofthe variables results in four groups. The subjects

did not know in advance whether a program wouldcontain PCL or not; they did not even know thatthe existence of PCL would be a treatment variable.2.5 TasksFor And/Or-tree, each subject received the follow-ing 4 subtasks: (1) Find the right spot for a partic-ular output format change, (2) give an expressionto compute the number of variants represented bya tree, (3) create an additional Visitor class1 thatcomputes the number of variants faster (similar toan already existing class computing depth informa-tion), and (4) instantiate such a Visitor and printits result. Subtasks (3) and (4) are pattern-relevant.Subtask (4) was not explicitly required in Wustl.For Phonebook, each Uka subject received the fol-lowing 5 subtasks: (1,2) Find two spots for smallprogram changes (output format change, window1Of course the task description did not explicitly mentionthat it was a Visitor that should be added, etc.4

Table 1: The four experiment groups and their sizes. The number of data points is one per subject,except for those subjects that did not complete the respective task, but dropped out of the experimentinstead. For Uka there was no such mortality. See also Section 2.7. (A+P� stands for \�rst performAnd/Or-tree with PCL, then perform Phonebook without PCL" and so on.)�rst with PCL �rst w/o PCLthen w/o PCL then with PCLgroups: �rst And/Or-tree, then Phonebook A+P� A�P+|Uka initial no. of subjects 19 18|Uka no. of data points, both tasks 19 18|Wustl initial no. of subjects 6 5|Wustl no. of data points, Phonebook 4 3|Wustl no. of data points, And/Or-tree 4 4groups: �rst Phonebook, then And/Or-tree P+A� P�A+|Uka initial no. of subjects 18 19|Uka no. of data points, both tasks 18 19|Wustl initial no. of subjects 6 5|Wustl no. of data points, Phonebook 3 3|Wustl no. of data points, And/Or-tree 4 4size change), (3) create an additional Observer classusing a Template Method, (4) instantiate and reg-ister such an Observer, (5) create an additional Ob-server class similar to an already existing one notusing a Template Method. Subtasks (3) to (5) arepattern-relevant.There are two important di�erences between Ukaand Wustl regarding Phonebook. First, in Ukasubtask (3) a similar class was already present inthe program. Subtask (3) could thus be solved byimitation; this was not true for Wustl. Second,subtask (2) was not required inWustl and (4) wasincluded in (3).For the class creation subtasks, only the interfaceof the class needed to be written; the actual imple-mentation was not required, although the Wustlparticipants were asked to provide a complete solu-tion if they could.2.6 MeasurementsFor each task (but not for each subtask) of eachsubject we measured the time between handing outand collecting the experiment materials. It is un-clear how the time spent for general program un-derstanding could be distributed among the sub-tasks, so no subtask time information was collected.For each subtask, we graded the answers accordingto the degree of requirements ful�llment they pro-vided. The grades were expressed in points. Therewas a total of 2+2+8+3=15 points h2+2+8=12ifor the Uka hWustli subtasks of And/Or-treeand 2+3+8+4+6=23 points h2+8+8=18 pointsi

for those of Phonebook; see [13, 14] for the ratingscales. Since graded point scales are somewhat sub-jective we also recorded the number of completelycorrect solutions.2.7 Threats to internal validityAll relevant external variables were appropriatelycontrolled in this experiment. Furthermore, thecounter-balanced experiment design compensatesfor possible unbalanced subject ability between thegroups.The dominant control problem is mortality: SomeWustl students gave up on a task when theythought it would be too di�cult for them or taketoo long. Four students gave up on both tasks. For-tunately, mortality occurred almost exactly as oftenin the groups with PCL as in those without PCL.When ignoring incomplete tasks entirely, it is there-fore safe to assume that the mortality does not biasthe results. See Table 1 for the resulting group sizes;there was no mortality in Uka.2.8 Threats to external validityThere are several sources of di�erences between theexperimental and real software maintenance situ-ations that limit the generalizability (external va-lidity) of the experiments: in real situations thereare subjects with more experience, often workingin teams, and there are programs and maintenancetasks of di�erent size or structure.5

Experience: The most frequent concern with ex-periments using student subjects is that the resultscannot be generalized to professionals, because thelatter are more experienced. In the present case,professional programmers may either have less needfor PCL or they may be able to exploit PCL morepro�tably than our student subjects. However, inprofessional contexts it is often not the most capa-ble programmers who are doing the maintenance,anyway.Team work: Realistic programs are usually teamwork. Individual tasks during maintenance mayalso often be performed by more than one program-mer. Such cooperation requires additional commu-nication about the program. In this case, PCL mayhave further advantages, not visible in the experi-ment, because one of the major (purported) advan-tages of design patterns is a common design termi-nology.Program size and structure: Compared to typ-ical industrial size programs, the experiment pro-grams are rather small and simple, neatly designed,and well commented. This does not necessarily in-validate the results of the experiments, though. Ifa positive e�ect is found, increasing program com-plexity may magnify the e�ect, because having PCLprovides program slicing information. For pattern-relevant tasks, PCL information points out whichparts of a program are relevant and enables one toignore the rest; such information may become moreuseful as program size increases, because more codecan be ignored.Maintenance tasks and pattern relevance:Depending on the application domain, the bene�tsfrom PCL may be smaller in reality than in ourexperiment for two reasons. First, if programmershave to master a large design pattern repertoire,their understanding of individual patterns may bereduced and PCL may become less helpful. Sec-ond, for many maintenance tasks, design patternsmay not be relevant at all.2Only repetition of similar experiments with profes-sionals on real programs and real maintenance taskscan evaluate these threats. On the other hand,the experiment was biased towards showing smaller-than-reality bene�ts from PCL; see the discussionin the conclusion.2An important question in this context is the pattern den-sity in a program: what fraction of the program is a partof a design pattern instance? An investigation of the JavaAbstract Windowing Toolkit (AWT 1.0) and NEXTstep v2found that 74% and 66% of all classes, respectively, partici-pated in some design pattern instance [10].

3 On the methodology of ourexperiment designThe main independent variable in these experi-ments is the presence or absence of pattern com-ment lines (PCL) in addition to the normal pro-gram comments. Some readers may argue that thismakes the experiments uninteresting and irrelevant(although still valid), because design and hypothe-ses are impure: the observations cannot discrimi-nate between e�ects from having a speci�c type ofcomments (PCL) and e�ects from simply havingmore comments.3.1 The form/content con ictNote that this situation will occur generically inmany experiments that attempt evaluation of a spe-ci�c form of an information source.Each speci�c information source used during pro-gram development (such as PCL, UML diagrams, orany other) has both a certain form (syntax, appear-ance) and a certain content (semantics, meaning).In any experiment for evaluating some form F of aninformation source we ideally need an experimentcondition for the control group that is identical tothe experiment group condition in all respects (inparticular the content) except for the form of the in-formation source. However, an alternative form Awill rarely lend itself to expressing the exact samecontent as F in a natural form. For example, onecannot express in a realistic manner exactly eachdetail communicated by a UML diagram and noth-ing else in a di�erent notation. Therefore, we musteither force the content into A (and thus end upwith an unrealistic instance of A) or accept somedi�erence in content in addition to the di�erence inform in our experiment conditions (and thus mixtwo e�ects).We call this problem the form/content con ict inexperiment design.If it arises, it is usually impossible to eliminate thecon ict, but it can be reduced so that the relevanceof the experiment is no longer seriously threatened.The following three subsections argue why the con- ict is unimportant in our speci�c experiment de-sign described above. Subsection 3.5 then reformu-lates the train of thought of this argumentation asa general rule for the design of such experiments.We now argue for our design, �rst, why purer hy-potheses appear impossible to evaluate; second,why the experiments are useful as they are; andthird, why not much of the observed e�ect is due toan increase in comment information content.6

3.2 Why purer hypotheses appearimpossible to evaluateA hypothetical perfect experiment would only testwhether the speci�c form (or type) of commentsrepresented by PCL has an advantage over what-ever one would have written instead before the in-vention of design pattern terminology. It would notchange any other aspect of the program. In order todo this, the following aspects (among others) wouldhave to be kept constant:{ the absolute information content of the com-ments (because this is what makes commentsuseful),{ the length of the comments in lines or words(because reading comments consumes time andconcentration),{ the amount of repetition of information in thecomments (because repetition may aid under-standing or �nding information or may make thereader unattentive),{ the understandability of the comments,{ the comments' potential for creating confusion(because comments can sometimes make a pro-gram harder to understand),{ the placement of the comments with respect tothe program entities they refer to (because thatdetermines how easy information can be foundor related to the program).Balancing all of these aspects in a program changeis impossible. We have the following choices for theexperimental design:1. Balance PCL by removing other comments inthe PCL program version.But which other comments should we remove?We could easily balance the comment length,but there is no reproducible way to determinewhich comments are super uous, useful, or cru-cial. We could not guarantee that we held theinformation content of the comments constant.Furthermore, we can not keep both commentlength and the amount of repetition constant,as will be quanti�ed in Section 3.4.2. Balance PCL by adding normal comments inthe non-PCL program version.But what should these other comments say?Again, we could keep the comment length con-stant, but there is no reproducible way to givethe exact same information content as in thePCL, other than the PCL itself; we cannot besure not to confuse or annoy the user, etc.3. Add PCL without any compensation. This willincrease the comment length (small increase,

see Section 2.3 and Figure 2), information con-tent (small increase, see Section 3.4), and rep-etition of information (large increase, see Sec-tion 3.4).We pick the third solution, because it is more real-istic and well-de�ned and leads to more meaningfuland reproducible results.3.3 Why the experiments are usefulA possible e�ect of simply having more commentis consciously included in our hypotheses. If we�nd a positive e�ect from adding PCL, the exper-iment suggests the engineering advice \Add PCLto your programs in addition to the comment-ing you usually do." This advice is useful evenif it should be possible to obtain similarly pos-itive results with other types of additional com-ments, because who could characterize, in program-independent terms, what these other commentsshould look like? For PCL, the characterization isstraightforward: \Mention which classes, methods,and objects play which roles in a pattern"; see theexample in Figure 2.3.4 Why most of the e�ect is due tocomment type, not informationcontentAre we merely (or mostly) measuring the e�ect ofproviding more information in the comments? Aswe show now, almost all of the information con-tained in the PCL is already present in the restof the comments and is just repeated in a di�er-ent form by PCL. Note that repetition as such caneither improve program understanding (by usefulredundancy) or slow it down (because less new in-formation is found per line) [3]. It is left to thereader to judge the present case; see the conclusionsection for our opinion.We compiled a list of the design information unitsconveyed by the PCL and counted where and howoften they were mentioned in the PCL and in theother comments.For Uka And/Or-tree, this information is shown inthe left half of Table 2. As we see in the histograms,almost all of the information present in the PCLis already present in the rest of the program com-ments, in particular those units (labeled A and L)that point to the existence of design patterns in theprogram.Similar results, although somewhat weaker, hold forthe Uka Phonebook program; see the right handside of Table 2.7

Table 2: List of the design information units provided by the PCL of the Uka And/Or-tree program(left hand side) and the Uka Phonebook program (right hand side). The histograms indicate the numberof occurrences of each design information unit (with the important ones emphasized). Top histogram:occurrences in the PCL. Bottom histogram: occurrences in the rest of the comments.id design information unit (Uka And/Or-tree)A There is an element/container structureB Element is the superclass of the ele-ment/container structureC Element is abstractD AndElement is a part of the element/containerstructureE OrElement is a part of the element/containerstructureF StringElement is a part of the ele-ment/container structureG There are multiple container classesH AndElement is a container classI OrElement is a container classJ There is only one element classK StringElement is an element class�L There is an iterator structure�M The iterator structure iterates over the ele-ment/container structure�N ElementAction is the superclass of the iteratorstructure�O ElementAction is abstractP Depth is a part of the iterator structureyQ perform() is a part of the iterator structureR perform() is a dispatch method�A, B, L, N are the most important information units fordesign understanding.In design pattern terminology [9] element/containerstructure is Composite pattern, container class is Com-posite class, element class is Leaf class, iterator struc-ture is Visitor pattern, and dispatch method is Acceptmethod .However, none of these design pattern terms is used inthe non-PCL comments.A B C D E F G H I J K L M N O P Q R

0

2

4

6

8

10 52 data points, PCL

A B C D E F G H I J K L M N O P Q R0

2

4

6

8

10 23 data points, rest of comments

id design information unit (Uka Phonebook)A There is a model/view structure�B Tupleset is the model class�C TupleDisplay is the superclass of the viewsyD TupleDisplay is abstractE There is a skeleton algorithmF The skeleton algorithm varies selection, order-ing, and formatting of tuplesG TupleDispA contains a method with a skeletonalgorithm�H TupleDispA is a view classI TupleDispA is abstractJ newTuple() is the skeleton algorithmK mergeIn() is an auxiliary method ofnewTuple()L select(), compare(), and format() are theprimitive operations missing in the skeleton al-goritmM NTTupleDisp2 completes the skeleton algo-rithmyN There are two view class instancesO There is one model class instanceP The view class instances are registered withthe model class instanceA, B, E, F, G, M are the most important informationunits for design understanding.In the design pattern terminology of [9] model/viewstructure is Observer pattern, model class is Subjectclass, view class is Observer class, skeleton algorithmis Template method .However, none of these design pattern terms is used inthe non-PCL comments.A B C D E F G H I J K L M N O P

0

2

4

6

8

10 21 data points, PCL

A B C D E F G H I J K L M N O P0

2

4

6

8

10 12 data points, rest of comments

�This information is rather vague or ambiguous in the non-PCL comments.yThis information is distributed over several non-PCL comments.8

Table 3: Results for the And/Or-tree task. Columns are (from left to right): line number; name of vari-able; arithmetic average D+ of subjects provided with design pattern information (PCL); ditto withoutPCL (D�); 90% con�dence interval I for di�erence D+ �D� (measured in percent of D�); signi�cancep of the di�erence (one-sided). I and p were computed using Bootstrap resampling [7] with 10000 tri-als because many distributions were distinctly non-normal. \Relevant points" are the points for thepattern-relevant subtasks only. mean means di�erence signi�-with PCL w/o PCL (90% con�d.) canceVariable D+ D� I pUka, program And/Or-tree:1 relevant points 8.5 7.8 �7:7% : : :+ 23% 0.202 #corr. solutions 15 of 38 7 of 36 0.0773 time (minutes) 58.0 52.2 �3:0% : : :+ 24% 0.0944 | best 7 only 38.6 45.4 �37% : : :+ 6:3% 0.13Wustl, program And/Or-tree:5 relevant points 6.7 6.5 �12% : : :+ 19% 0.286 #corr. solutions 4 of 8 3 of 8 17 time (minutes) 52.1 67.5 �43% : : :� 0:5% 0.046So the PCL information is almost entirely redun-dant and plausibly it is the speci�c type of the doc-umentation that creates the e�ects we see: Thesame information might become clearer when ex-pressed in PCL form. Hence, we expect our resultsto mainly show e�ects from speci�cally adding PCLand not e�ects from adding any documentation.3.5 General methodological ruleWe can now formulate the above three-step pro-cess for handling form/content con icts as a generalmethodological recipe.1. Analyze all conceivable designs and choose theone that results in the least di�erences in con-tent without requiring unrealistic conditionsfor either group. This ensures there is no betterdesign for this research question.2. Demonstrate that the experiment result ex-pected by your hypotheses would be interest-ing. Ideally, it leads to a practical softwareengineering rule. In this case, formulate thisrule explicitly. This ensures the experiment isuseful despite the content di�erence.3. Argue why in your speci�c experiment setup atmost a small part of the observed e�ect, if any,will be due to di�erent content, rather thandi�erent form. This ensures the form/contentcon ict is unimportant for the chosen experi-ment design.

4 Results and discussionThis section discusses the results of both exper-iments as summarized in Tables 3 and 4. ForWustl, the results of subjects that did not delivera particular task were ignored for that task. ForUka, all subjects delivered both tasks.4.1 Results for And/Or-treeThe results are summarized in Table 3 and Fig-ure 3. The �rst line indicates that the group withPCL averaged slightly more points in the pattern-relevant subtasks (\relevant points"). However, thep-value in the rightmost column indicates that thedi�erence is not signi�cant; it has a 20 percentchance of being accidental. Surprisingly, the timerequired by the group with PCL appears to be sig-ni�cantly larger than the time for the group with-out PCL (line 3). However, this observation is mis-leading, because the number of correct solutions ismuch lower for the non-PCL group (Fisher exactp = 0:077) as shown in line 2. In real softwaremaintenance, incorrect solutions would be detectedand corrected, taking additional time not observedin the experiment. In contrast, the experiment'snon-computerized working environment made it dif-�cult for a subject to check whether a solution wascorrect; obviously, incorrect work can be producedquickly. Therefore, instead of comparing the timerequired by all subjects, we should compare onlythe time required by subjects with correct solu-tions. To avoid bias, we right-trim both completegroups uniformly by 80% with respect to a hierar-chical correctness-then-time ordering, so that only9

Table 4: Results for the Phonebook task. The Wustl results had to be discarded, because the subjectswere not su�ciently quali�ed for this task; only one subject per group came up with a correct solution.mean means di�erence signi�-with PCL w/o PCL (90% con�d.) canceVariable D+ D� I pUka, program Phonebook:8 relevant points 16.1 16.3 �8:0% : : :+ 4:0% 0.359 #corr. solutions 17 of 36 15 of 38 0.6410 time (minutes) 51.5 57.9 �22% : : :+ 0:3% 0.055the best 7 participants from each group remain3 Inthis case, the time di�erence reverses and the PCLgroup tends to be faster, as shown in line 4. Seealso Figure 3.0

2040

6080

100

UKA UKA WUSTL UKA WUSTL

And/Or Phonebook

time best 7 time time timeline 3 line 4 line 7 line 10 --

Figure 3: Graphical display of the time entries (in min-utes) from Tables 3 and 4: The left plot of each pairrepresents the group with PCL, the right one the groupwithout PCL. The dot marks the mean of the task com-pletion time, the strip indicates a 90% con�dence inter-val for the mean. \line n" indicates the correspondingline in Table 3 or 4.But is this view of the data really appropriate? Fig-ure 4 tells us that indeed it is. We can see that with-out PCL the slower (less capable?) subjects tendto produce solutions of much lower quality, whereaswith PCL the quality is largely independent of thetime required.Overall, it appears safe to say that PCL allowed forgood solutions by less capable programmers thanwas possible without PCL.The Wustl results for And/Or-tree indicate thatthe PCL group is faster, as shown in the lower partof Table 3. Since this group not only designed butalso could test their solution, we �nd no signi�cantdi�erence in the number of points (line 5) nor inthe number of completely correct solutions (line 6),3Without this trimming, we would compare the best 39percent (15 of 38) of the participants with PCL to only thebest 19 percent (7 of 36) without PCL, which would distortthe comparison.

but we �nd a large advantage in the time requiredfor the group with PCL (line 7). With a con�denceof 90 percent, having PCL saves between zero and43 percent of the maintenance time for this task.As for learning e�ect, the Uka subjects were onaverage signi�cantly faster (but did not obtain morepoints) in their second task. The size of the e�ectis independent of the presence or absence of PCL.This data is not shown in the table. Di�erences dueto learning are compensated by the counterbalancedexperiment design and are thus not relevant for theinterpretation of the results. The learning e�ectcould not be assessed forWustl, because the groupsizes were too small.Summing up, the And/Or-tree results suggest thatfor non-trivial pattern-relevant maintenance tasksthe presence of PCL may save time and/or helpavoid mistakes.4.2 Results for PhonebookThe results are summarized in Table 4 and Fig-ure 3. The Phonebook results of Uka show essen-tially no di�erence in the total number of pointso

o

o ooo

o

ooo

oo

o

o

o

o

ooo

o

oo

o

ooo

o

o

o

oo

oo

o ooo

o

0

2

4

6

8

10

with PCL

0 20 40 60 80 100

o

o

o

oo o

o o

o

o

o

oo

oo o

o

oo

o

oo

oo

o

o ooo

o

o

o

o

o

o

o

without PCL

0 20 40 60 80 100

time for UKA ‘And/Or-Tree’ [minutes]

rele

vant

poi

nts

Figure 4: Work time versus solution quality for thepattern-relevant part of UKA Phonebook task. Thetrend line is a Loess local linear regression with span-width 1 and its 90% con�dence interval.10

for the pattern-relevant subtasks (line 8), or thenumber of solutions that were completely correct(line 9). Although almost half of the participantsmade at least one mistake, the rather high relevantpoint values (over 16 out of a possible 18) indicatethat the task was simple for these subjects.Despite the simplicity, however, the group withPCL managed to solve the task signi�cantly fasterthan the group without PCL, as shown in line 10.The advantage can also be quanti�ed: it has an ex-pected size somewhere between zero and 22 percent(with 90% con�dence); see also Figure 5. The learn-ing e�ect was similar to that described for And/Or-tree above.M

ooo ooo o

oooo oooo o

o oooo o ooo o oo o

o oo oooo

oo

M

o oo o o oo ooo oooo

ooo oooo oo o oo ooo o oo oooowith PCL

without PCL

0 20 40 60 80 100

time for UKA ’Phonebook’ [minutes]Figure 5: Work time distribution for UKA Phonebooktask. The fat dot is the median; the box indicates the25% and 75% quantiles; the whiskers indicate the 10%and 90% quantiles; the M and dashed line indicate themean plus/minus one standard error of the mean.Unfortunately, the Wustl results for this task aremeaningless, since there is only a single correct so-lution in each group. Thus, it makes little senseto compare the results. Our postmortem analysisfound that for the given subjects the task was toodi�cult for two reasons. First, the non-GUI pre-sentation style of the Wustl Phonebook programmade the use of the Observer pattern rather unintu-itive and obscure. Second, these subjects had neveractually implemented an Observer and there was noexample class in the program that they could imi-tate (in contrast to the Uka Phonebook program).Our results suggest that under such circumstances,PCL might be worthless.Summing up, the Phonebook results suggest thatfor simple pattern-relevant maintenance tasks thepresence of PCL may save time.5 Interpretation and conclu-sionsWe will now argue why the results from our exper-iments suggest realistic positive e�ects of PCL inpattern-relevant maintenance tasks.The design of these experiments was extremely con-

servative; our design decisions biased them towardsnot showing any e�ects from adding PCL, in par-ticular:1. The programs were rather small, so even with-out PCL the subjects could achieve good pro-gram understanding within a reasonable time.In real software projects, PCL might be morehelpful if the fraction of program understand-ing e�ort that PCL can save grows with thesize of the program.2. The programs were thoroughly commented,not only at the statement level, but also at themethod, class, and program levels. Thus, thesubjects had su�cient documentation availablefor program understanding even without PCL;see Section 3.4. In contrast, many programs inthe real world lack design documentation. PCLis probably an e�cient form of design docu-mentation, as it is rather compact.Given these circumstances, we expect performanceadvantages through PCL to be similar or more pro-nounced in real situations than in our experiments.In any case, even a few percent reduction in mainte-nance e�ort outweigh the negligible additional de-velopment cost for inserting PCL into source pro-grams.In summary we �nd that our results support bothof the hypotheses introduced in Section 2.1. Each ofour results either supports hypothesis H1 (pattern-relevant maintenance tasks will be completed fasterif PCL is added) or hypothesis H2 (fewer errorswill be committed in pattern-relevant maintenancetasks if PCL is added). For Uka Phonebook andforWustl And/Or-tree, the PCL group was fasterthan the non-PCL group. We found the size of thee�ect (0 to 40 percent speedup) to be considerable,although this may not generalize to other cases. ForUka And/Or-tree, the PCL group produced solu-tions with fewer mistakes than the non-PCL group.In particular, twice as many subjects arrived at acompletely correct solution. None of the three re-sults supports both hypotheses at once, but thereis no evidence for the opposite of the hypotheses,either.We conclude that depending on the particular pro-gram, maintenance task, and personnel, PCL in aprogram may considerably reduce the time requiredfor a program change and/or may help improve thequality of the change. Given that we used subjectsfrom two di�erent continents and two di�erent pro-gramming languages, the results are not speci�c toonly one type of culture or educational backgroundor to only one programming language.Why is PCL bene�cial? The analysis of designinformation units (see Section 3.4) suggests two an-11

swers to this question. First, we often found the in-formation provided by PCL much clearer than thatin the other comments, which was sometimes vague,ambiguous, or spread out (delocalized). This was tobe expected, because a clear and compact terminol-ogy is considered one of the primary advantages ofpatterns. Second, PCL repeats some design infor-mation units more often than normal comments do,in particular with respect to mentioning the pres-ence of a pattern; see Table 2. This type of repeti-tion has two advantages: it reduces the concentra-tion required to capture the relation between thepattern parts and it makes readers aware of thepattern even if they do not pick an appropriateexploration path through the program text. Thisproperty makes PCL similar to the rather volumi-nous documentation that Soloway et al. [16] sug-gest for structured programs in order to overcomeproblems from delocalized plans | design decisionswhose consequences are spread out over multipleparts of a program. Wilde et al. [18] argue that de-localized plans are frequent in object-oriented pro-grams due to inheritance, large numbers of smallmethods, and dynamic binding. It seems that de-sign patterns are a good means for understandingmany such delocalized plans during maintenance,provided that PCL points out where patterns wereused.Further work should perform related experimentsin di�erent settings. The following questions appearmost important. First, how do the e�ects changefor larger programs? Second, how do they changefor more di�cult tasks or when correct results areenforced by an acceptance test? Third, how do thee�ects change when much larger numbers of dif-ferent patterns come into play | often with over-lap between their instances? Fourth, how do theychange when multiple programmers have to coop-erate (and hence communicate) in order to makea change? Fifth, what are the e�ects if programsare largely uncommented and how, in general, doesPCL interact with other documentation? Sixth, isPCL also helpful during code inspections?Moreover, empirical studies of existing softwareshould determine what fraction of the change tasks(or change e�ort) is pattern-relevant. The ques-tion of design stability also needs to be addressed:We have found initial evidence that PCL may slowdown architectural erosion and drift (as describedby [12]), i.e., delay the decay of the original soft-ware design structure. Finally, and perhaps mostinterestingly, how does maintenance compare forsoftware with patterns versus equivalent softwarewithout?

AcknowledgementsWe thank Douglas Schmidt for the opportunity torepeat the experiment at Washington University.We also thank all our experimental subjects formaking this project possible. Thanks also to severalcritical reviewers of previous versions of this paper,in particular for pointing out that a methodologicalcontribution was waiting to be made.References[1] K. Beck, J.O. Coplien, R. Crocker, L. Do-minick, G. Meszaros, F. Paulisch, and J. Vlis-sides. Industrial experience with design pat-terns. In 18th Intl. Conf. on Software Engi-neering, pages 103{114, Berlin, March 1996.IEEE CS press.[2] Ruven Brooks. Using a behavioral theory ofprogram comprehension in software engineer-ing. In Proc. 3rd Intl. Conf. on Software En-gineering, pages 196{201, ., . 1978. IEEE CSpress.[3] Ruven Brooks. Towards a theory of the com-prehension of computer programs. Intl. J.of Man-Machine Studies, 18(6):543{554, June1983.[4] Frank Buschmann, Regine Meunier, HansRohnert, Peter Sommerlad, and Michael Stal.Pattern-Oriented Software Architecture | ASystem of Patterns. John Wiley and Sons,Chichester, UK, 1996.[5] Larry B. Christensen. Experimental Methodol-ogy. Allyn and Bacon, Needham Heights, MA,6th edition, 1994.[6] James O. Coplien and Douglas C. Schmidt, ed-itors. Pattern Languages of Program Design,Monticello, IL, 1995. Addison-Wesley.[7] Bradley Efron and Robert Tibshirani. An in-troduction to the Bootstrap. Monographs onstatistics and applied probability 57. Chapmanand Hall, New York, London, 1993.[8] EIR. Web pages of the KarlsruheEmpirical Informatics Research group.http://wwwipd.ira.uka.de/EIR/.[9] Erich Gamma, Richard Helm, Ralph Johnson,and John Vlissides. Design Patterns: Ele-ments of Reusable Object-Oriented Software.Addison-Wesley, Reading, MA, 1995.12

[10] Oliver Gramberg. Counting the use of softwaredesign patterns in Java AWT and NeXTstep.Technical Report 19/1997, Fakult�at f�ur Infor-matik, Universit�at Karlsruhe, Germany, De-cember 1997. ftp.ira.uka.de.[11] Robert Martin, Dirk Riehle, and FrankBuschmann, editors. Pattern Languages ofProgram Design 3, Monticello, IL, 1997.Addison-Wesley.[12] Dewayne E. Perry and Alexander L. Wolf.Foundations for the study of software archi-tecture. ACM SIGSOFT Software EngineeringNotes, 1992.[13] Lutz Prechelt. An experiment on the usefulnessof design patterns: Detailed description andevaluation. Technical Report 9/1997, Fakult�atf�ur Informatik, Universit�at Karlsruhe, Ger-many, June 1997. ftp.ira.uka.de.[14] Lutz Prechelt, Barbara Unger, and DouglasSchmidt. Replication of the �rst controlled ex-periment on the usefulness of design patterns:Detailed description and evaluation. TechnicalReport wucs-97-34, Washington Univer-sity, Dept. of CS, St. Louis, December 1997.http://www.cs.wustl.edu/cs/cs/publications.html.[15] Mary Shaw and David Garlan. Software Archi-tecture | Perspectives on an Emerging Disci-pline. Prentice Hall, 1996.[16] Eliot Soloway, Jeannine Pinto, Stan Letovsky,David Littman, and Robin Lampert. Design-ing documentation to compensate for delo-calized plans. Communications of the ACM,31(11):1259{1267, November 1988.[17] John M. Vlissides, James O. Coplien, and Nor-man L. Kerth, editors. Pattern Languagesof Program Design 2, Monticello, IL, 1996.Addison-Wesley.[18] Norman Wilde, Paul Matthews, and RossHuitt. Maintaining object-oriented software.IEEE Software, 10(1):75{80, January 1993.

13

Documents

Two Controlled Experiments Assessing the Usefulness of Design Pattern Information During Program Maintenance