Incorporating Performance Testing in Test-Driven Development test

0 7 4 0 - 7 4 5 9 / 0 7 / $ 2 5 . 0 0 © 2 0 0 7 I E E E M a y / J u n e 2 0 0 7 I E E E S O F T W A R E 6 7

focus

RS232, and RS485 bus). A previous case studyon the use of test-driven development in this re-implementation focused on overall quality.1,2 Itshowed that TDD reduced the number of defectsin the system as it entered functional verificationtesting.

Besides improving overall quality in thisproject, TDD led to the automation of per-formance testing.3 Software development or-ganizations typically don’t begin performancetesting until a product is complete. However, atthis project’s beginning, management was con-cerned about the potential performance impli-cations of using Java for the device drivers. So,performance was a focal point throughout thedevelopment life cycle. The development teamwas compelled to devise well-specified per-formance requirements and to demonstrate thefeasibility of using Java. We incorporated per-formance testing with the team’s TDD ap-

proach in a technique we call test-first per-formance (TFP).

Test-first performanceWe used the JUnit testing framework

(www.junit.org) for both unit and perfor-mance testing. To get effective performance in-formation from the test results, we adjustedsome TDD practices.

In classic TDD, binary pass/fail testing re-sults show an implementation’s correctness.However, software performance is an emergentproperty of the overall system. You can’t guar-antee good system performance even if the in-dividual subsystems perform satisfactorily.

Although we knew the overall system per-formance requirements, we couldn’t specify thesubsystems’ performance before making someperformance measurements. So, instead of usingJUnit assert methods, we specified performance

Incorporating PerformanceTesting in Test-DrivenDevelopment

Retail customers have used IBM’s point-of-sale system for more thana decade. The system software includes drivers for POS devices, suchas printers and bar code scanners. The device drivers were originallyimplemented in C/C++. Later, developers added a Java wrapper

around the C/C++ drivers to let Java programs interact with the POS devices. In2002, IBM Retail Store Solutions reimplemented the device drivers primarily inJava to allow for cross-platform and cross-bus connectivity (that is, USB,

test-driven development

Michael J. Johnson and E. Michael Maximilien, IBM

Chih-Wei Ho and Laurie Williams, North Carolina State University

Performance testingcan go hand-in-hand with the tight feedbackcycles of test-drivendevelopment andensure theperformance of the system underdevelopment.

6 8 I E E E S O F T W A R E w w w. c o m p u t e r. o r g / s o f t w a r e

scenarios in the test cases and used them to gen-erate performance log files. Developers and theperformance architect then examined the logs toidentify performance problems.

We generated the logs with measurementpoints injected in the test cases. Whenever theexecution hit a measurement point, the test casecreated a time stamp in the performance log file.Subsequently, we could use these time stamps togenerate performance statistics such as the aver-age and standard throughput deviation. Wecould enable or disable all of the measurementpoints via a single runtime configuration switch.We found the logs provided more informationthan the binary results and facilitated our inves-tigation of performance issues.

Another difference between classic TDD andTFP is the test designer. In classic TDD, the de-velopers implementing the software design thetest cases. In our project, a performance archi-tect who wasn’t involved in coding but had closeinteractions with the developers specified theperformance test cases. The performance archi-tect was an expert in the domain with deepknowledge of the end users’ expectations of thesubsystems. The performance architect also hadgreater understanding of the software’s per-formance attributes than the developers and socould specify test cases that might expose moreperformance problems.

For a TDD project to be successful, the testcases must provide fast feedback. However, oneof our objectives was to understand the system’sbehavior under heavy workloads,4 a characteris-tic that seemingly contradicts TDD. To incorpo-rate performance testing into a TDD process, wedesigned two sets of performance test cases. Oneset could finish quickly and provide early warn-ing of performance problems. The developers ex-ecuted this set with other unit test cases. Theother set needed to be executed in a performancetesting lab that simulated a real-world situation.

The performance architect ran this set to identifyperformance problems under heavy, simultane-ous workloads or with multiple devices.

The development team consisted of nine full-time engineers, five in the US and four in Mex-ico. A performance architect in the US was alsoallocated to the team. No one had prior experi-ence with TDD, and three were somewhat un-familiar with Java. All but two of the nine full-time developers were new to the targeteddevices. The developers’ domain knowledge hadto be built during the design and developmentphases. Figure 1 summarizes the project context.

Applying test-first performanceFor ease of discussion, we divide the TFP

approach into three subprocesses: test design,critical test suite execution, and master testsuite execution.

Test designThis process involves identifying important

performance scenarios and specifying perfor-mance objectives. It requires knowing the sys-tem’s performance requirements and softwarearchitecture. In our case, the performance ar-chitect, with assistance from other domain ex-perts, performed this process.

Our test design process has three steps:identify performance areas, specify perfor-mance objectives, and specify test cases.

Identify performance areas. You can specifysoftware performance on many types of re-sources, with different measurement units. Forexample, elapsed time, transaction through-put, and transaction response time are amongthe most common ways to specify softwareperformance. In our project, the importantperformance areas were

■ the response time for typical input andoutput operations,

■ the maximum sustainable throughput andresponse time for output devices, and

■ the time spent in different software layers.

Specify performance objectives. We obtainedthe performance objectives from three sources.

The first was previous versions of similar de-vice driver software. A domain expert identifiedthe performance-critical device drivers. The ob-jective for new performance-critical drivers wasto exceed the current performance in the field.

Figure 1. Project context.

TeamSize Nine engineersExperience (domain Some team members inexperiencedand programming language)Collocation DistributedTechnical leadership Dedicated coach

CodeSize 73.6 KLOC (9.0 base and 64.6 new)Language JavaUnit testing Test-driven development

The objective for non-performance-criticaldrivers was to roughly meet the current per-formance in the field.

The second source was the limitations of theunderlying hardware and software components.These components’ performance posed limitingfactors beyond which performance improvementhad little relevance. For example, figure 2 showsan execution time profile for a POS line displaydevice. In this example, we could impact the per-formance for only JavaPOS (www.javapos.com),the javax.usb API (http://javax-usb.org), and theJava Native Interface (JNI) layers. The combinedtime spent in the underlying operating systemkernel and in the USB hardware was responsiblefor more than half the latency from command tophysical device reaction. As long as the deviceoperation was physically possible and accept-able, we considered these latencies as forming abound on the drivers’ performance.

Finally, marketing representatives distilledsome of the performance objectives from directcustomer feedback. This input tended to be nu-merically inexact but did highlight specific areaswhere consumers most desired performance-related improvement from the previous release.

Specify test cases. In this final step, we designedperformance test cases based on the specifiedobjectives. This step included defining the meas-urement points and writing JUnit test cases toautomatically execute the performance scenar-ios. We put the measurements points at the en-try and exit points for each software layer. Weput another measurement point at the lowestdriver layer so that we could capture the timestamp of the event signal for the input from theJava virtual machine or native operating system.Performance test cases asserted additional startand stop measurement points.

We specified two performance test suites.The critical test suite consisted of a small sub-set of the performance test cases used to testindividual performance-critical device drivers.The master test suite comprised the full set ofall performance test cases.

Critical test suite executionDevelopers ran the critical test suite on their

own machines before committing their code tothe repository. They executed the suite usingJacl (http://tcljava.sourceforge.net), an interac-tive scripting language that allows manual in-teractive exercising of the classes by creating

objects and calling their methods. Figure 3ashows the critical-test-suite execution process.

Performance testing for a feature can’t rea-sonably start until the feature passes the TDDunit and functional tests. (the gray boxes in fig-ure 3a). After the TDD test cases passed, the de-velopers ran the critical test suite to generate per-formance logs. This suite provides preliminaryfeedback on potential performance issues associ-ated with an individual device. To provide timelyfeedback, test case responsiveness is important inTDD. However, obtaining proper performancemeasurements usually requires running the soft-ware multiple times in different operating envi-ronments, sometimes with heavy workloads.Consequently, performance test cases tend to beslower than functional test cases. To get quickfeedback, the critical suite focused only on quicktesting of the performance-critical drivers.

In classical TDD, developers rely on the“green bar” to indicate that an implementationhas passed the tests. When a test case fails, thedevelopment team can usually find the problemin the newly implemented feature. This delta de-bugging5 is an effective defect-removal techniquefor TDD. However, in performance testing, a bi-nary pass/fail isn’t always best. We wanted de-velopers to have an early indication of any per-formance degradation caused by a code changebeyond what a binary pass/fail for a specific per-formance objective could provide. Additionally,hard performance limits on the paths being mea-sured weren’t always available in advance. So,the developers had to examine the performancelog files to discover performance degradationand other potential performance issues. The per-formance problem identification and removalprocess thus relied heavily on the developers’ ex-pertise. After addressing the performance issues,

M a y / J u n e 2 0 0 7 I E E E S O F T W A R E 6 9

Line display component times

Wire + external hardware 7.5% JavaPOS 5%

javax.usb 20%

Java Native Interface 20%Kernel + internal hardware 47.5%

Figure 2. Some limitingfactors imposed by thehardware and operatingsystem layers.

the developers checked in the code and beganimplementing another new feature.

Master suite executionThe master suite execution included these

performance test cases:

■ Multiple runs. Many factors, some thataren’t controllable, can contribute to soft-ware performance. So, performance meas-urements can differ each time the testcases are run. We designed some test casesto run the software multiple times to gen-erate performance statistics.

■ Heavy workloads. Software might behavedifferently under heavy workloads. We de-signed some test cases with heavy work-loads to measure the software’s respon-

siveness and throughput under such circumstances.

■ Multiple devices. The interaction amongtwo or more hardware devices mightcause unforeseen performance degrada-tions. We designed some test cases to ex-ercise multiple-device scenarios.

■ Multiple platforms. We duplicated mosttest cases to measure the software’s per-formance on different platforms and indifferent environments.

Running these test cases could take signifi-cant time, and including them in the criticalsuites wasn’t practical. Additionally, it wasmore difficult to identify the causes of the per-formance problems found with these testcases. We therefore included these test cases inthe master test suite only.

The performance architect ran the mastersuite weekly in a controlled testing lab. Figure3b shows this process.

The performance architect started runningthe master suite and collecting performancemeasurements early in the development life cy-cle, as soon as enough functionality had beenimplemented. After collecting the measure-ments, the performance architect analyzed theperformance log file. If the performance archi-tect found significant performance degradationor other performance problems, he would per-form further analysis to identify the root causeor performance bottleneck. The architect thenhighlighted the problem to developers and sug-gested improvements when possible. The test-ing results were kept in the performancerecord, which showed the performance results’progress.

A key element of this process was close inter-action between the development team and theperformance architect. The performance archi-tect participated in weekly development teammeetings. In these meetings, the developmentteam openly discussed the key changes in all as-pects of the software. Additionally, because teammembers were geographically distributed, themeetings let everyone freely discuss any depend-encies they had with other members from the re-mote sites and plan one-on-one follow-up. Theperformance architect could use the knowledgegained from the meetings to determine when asubsystem had undergone significant changesand thus pay more attention to the related per-formance logs.


Make suggestions forperformance improvement

Yes

Execute master test suite

Project finished?

Performanceproblems found?

Performance measurements

No

No

Test design

Yes

Fix problems

Yes

Execute tests forperformance-critical devices

Implement a feature

Check in

Project finished?

Performanceproblems found?

Performance log file

Yes

No

No

No

Test design

Functionaltests passed?

Yes

(b)(a)

Figure 3. The processfor (a) critical suiteexecution and (b) master suite execution.

ResultsBy the project’s end, we had created about

2,900 JUnit tests, including more than 100performance tests.3 Performance-testing re-sults showed improvement in most intermedi-ate snapshots. We sometimes expected the op-posite trend, because checking for andhandling an increasing number of exceptionconditions made the code more robust. Inpractice, however, we observed that thethroughput of successive code snapshots gen-erally did increase.

Figure 4 shows an example of receiptprinter throughput in various operating modesand retail environments. Three sets of resultsshow throughput with asynchronous-modeprinting:

■ a short receipt typical of a departmentstore environment (ADept),

■ a long receipt typical of a supermarket(ASmkt), and

■ a very long printout (MaxS).

A fourth set (SDept) shows throughput withsynchronous-mode printing in a departmentstore environment.

A code redesign to accommodate the hu-man-interface device specification resulted in asignificant throughput drop between snap-shots S1 and S2. However, the developmentteam received early notice of the performancedegradation via the ongoing performancemeasurements, and by snapshot S3 most ofthe degradation had been reversed. For thelong receipts where performance is most criti-cal, performance generally trended upwardwith new releases.

We believe that the upward performancetrend was due largely to the continuous feed-back provided by the TFP process. Contraryto the information provided by postdevelop-ment optimization or tuning, the informationprovided by the in-process performance test-ing enabled a feedback-induced tendency todesign and code for better performance.

Another indication of our early measurementapproach’s success is its improvement over pre-vious implementations. We achieved the Javasupport for an earlier implementation simply bywrapping the existing C/C++ drivers with theJNI. This early implementation was a tactical in-terim solution for releasing Java-based drivers.However, customers used this tactical solution,

and we could have kept it as the mainstream of-fering if the new drivers didn’t match its per-formance levels. The tactical drivers’ perfor-mance was a secondary consideration and wasnever aggressively optimized, although the un-derlying C/C++ drivers were well tuned. The ex-istence of the JNI-wrapped drivers let us run thesame JUnit-based performance test cases againstboth the JNI-wrapped drivers and the Java driv-ers for comparison. The tactical drivers’ perfor-mance provided a minimum acceptable thresh-old for performance-critical devices.

Figure 5a shows results for the set ofthroughput tests in figure 4, this time compar-ing snapshot S3 to the previous implementa-tion. The overall results were better with thenew code. Figure 5b shows the improvementin operation times of two less performance-critical device drivers. The line display was anintermediate case in that it didn’t receive theintense performance design scrutiny affordedthe printer, although bottom-line operationtime was important. We considered the cashdrawer operation time noncritical.

Lessons learnedApplying a test-first philosophy in software

performance has produced some rewarding results.


ADept ASmkt MaxS SDept

Printer throughput tests (taller is better)30

25

20

15

10

5

0

Operating modeRe

lativ

e th

roug

hput

S1Snapshots

S2

S3

S4

Figure 4. Performance tracking using snapshots of receipt printerthroughout tests. Three sets of results show throughput with asynchronous-mode printing: a short receipt typical of a departmentstore environment (ADept), a long receipt typical of a supermarket(ASmkt), and a very long printout (MaxS). The fourth set (SDept)shows throughput with synchronous-mode printing in a departmentstore environment.

First, when we knew performance require-ments in advance, specifying and designingperformance test cases before coding appearsto have been advantageous. This practice letdevelopers know, from the beginning, thoseoperations and code paths considered per-formance critical. It also let them track per-formance progress from early in the develop-ment life cycle. Running the performance testcases regularly increased developers’ aware-ness of potential performance issues. Continu-ous observation made the developers moreaware of the system’s performance and couldexplain the performance increase.6

We also found that having a domain expertspecify the performance requirements (whereknown) and measurement points increasedoverall productivity by focusing the team onareas where a performance increase wouldmatter. This directed focus helped the teamavoid premature optimization.

The performance architect could developand specify performance requirements andmeasurement points much more quickly thanthe software developers. Also, limiting meas-urement points and measured paths to whatwas performance sensitive in the end productlet the team avoid overinstrumentation. Havinga team member focused primarily on perfor-mance also kept the most performance-criticaloperations in front of the developers. This tech-nique also occasionally helped avoid perfor-mance-averse practices, such as redundant cre-ation of objects for logging events.

A third lesson is that periodic in-system

measurements and tracking performancemeasurements’ progress increased perfor-mance over time.

This practice provided timely notificationof performance escapes (that is, when someother code change degraded overall perfor-mance) and helped isolate the cause. It also letdevelopers quickly visualize their performanceprogress and provided a consistent way to pri-oritize limited development resources ontothose functions having more serious perfor-mance issues. When a critical subsystem’s per-formance was below the specified bound, theteam focused on that subsystem in the middleof development. The extra focus and designled to the creation of an improved queuing al-gorithm that nearly matched the targeted de-vice’s theoretical limit.

O ur performance-testing approach re-quired manually inspecting the perfor-mance logs. During the project’s devel-

opment, JUnit-based performance testing tools,such as JUnitPerf, weren’t available. Such toolsprovide better visibility of performance problemsthan manual inspection of performance logs. Al-though we believe manual inspection of per-formance trends is necessary, specifying the bot-tom-line performance in assert-based test casescan complement the use of performance log files,making the TFP testing results more visible to thedevelopers. We’re investigating the design of assert-based performance testing to improve theTFP process.


Previous

S3Previous

S4

ASmkt MaxS SDept

Printer throughput tests (taller is better)25

20

15

10

5

0

Operating mode

Rela

tive

thro

ughp

ut

ADept Cash drawer

Device throughput (taller is better)16

14

12

10

8

6

4

2

0

Device driver(a) (b)

Rela

tive

thro

ughp

ut

Line display

Figure 5. A comparison to the previous implementation of (a) performance-critical printer tasks and (b) less criticaland noncritical device tasks.

Another direction of future work is auto-matic performance test generation. In this proj-ect, we relied on the performance architect’sexperience to identify the execution paths andmeasurement points for performance testing.We can derive this crucial information for per-formance testing from the performance re-quirements and system design. We plan to findguidelines for specifications of performance re-quirements and system design to make the au-tomation possible.

AcknowledgmentsWe thank the IBM device driver development

team members, who implemented the performancetests described in this article. We also thank the RealSearch reading group at North Carolina StateUniversity for their comments. IBM is a trademarkof International Business Machines Corporation inthe United States or other countries or both.

References1. E.M. Maximilien and L. Williams, “Assessing Test-

Driven Development at IBM,” Proc. 25th Int’l Conf.Software Eng. (ICSE 03), IEEE CS Press, 2003, pp.564–569.

2. L.E. Williams, M. Maximilien, and M.A. Vouk, “Test-Driven Development as a Defect Reduction Practice,”Proc. 14th Int’l Symp. Software Reliability Eng., IEEECS Press, 2003, pp. 34–35.

3. C.-W. Ho et al., “On Agile Performance RequirementsSpecification and Testing,” Proc. Agile 2006 Int’l Conf.,IEEE Press, 2006, pp. 47–52.

4. E.J. Weyuker and F.I. Vokolos, “Experience with Perfor-mance Testing of Software Systems: Issues, an Ap-proach, and Case Study,” IEEE Trans. Software Eng.,vol. 26, no. 12, Dec. 2000, pp. 1147–1156.

5. A. Zeller and R. Hildebrandt, “Simplifying and Isolat-ing Failure-Inducing Input,” IEEE Trans. SoftwareEng., vol. 28, no. 2, 2002, pp. 183–200.

6. G.M. Weinberg, The Psychology of Computer Pro-gramming, Dorset House, 1998, pp. 31.

For more information on this or any other computing topic, please visit ourDigital Library at www.computer.org/publications/dlib.


About the AuthorsMichael J. Johnson is a senior software development engineer with IBM in ResearchTriangle Park, North Carolina. His recent areas of focus include performance analysis, perfor-mance of embedded systems, and analysis tools for processors. He received his PhD in mathe-matics from Duke University. He’s a member of the IEEE and the Mathematical Association ofAmerica. Contact him IBM Corp., Dept. YM5A, 3039 Cornwallis Rd., Research Triangle Park, NC27709; [email protected].

Chih-Wei Ho is a PhD candidate at North Carolina State University. His research interestsare performance requirements specification and performance testing. He received his MS incomputer science from NCSU. Contact him at 2335 Trellis Green, Cary, NC 27518; [email protected].

E. Michael Maximilien is a research staff member in the Almaden Services Researchgroup at the IBM Almaden Research Center. His research interests are distributed systems andsoftware engineering, with contributions to service-oriented architecture, Web services, Web2.0, service mashups, and agile methods and practices. He received his PhD in computer sci-ence from North Carolina State University. He’s a member of the ACM and the IEEE. Contacthim at the IBM Almaden Research Center, 650 Harry Rd., San Jose, CA 95120; [email protected].

Laurie Williams is an associate professor at North Carolina State University. Her re-search interests include software testing and reliability, software engineering for security, em-pirical software engineering, and software process, particularly agile software development. Shereceived her PhD in computer science from the University of Utah. She’s a member of the ACMand IEEE. Contact her at North Carolina State Univ., Dept. of Computer Science, Campus Box8206, 890 Oval Dr., Rm. 3272, Raleigh, NC 27695; [email protected].

V I S I T U S O N L I N E w w w. c o m p u t e r. o r g /s o f t w a r e

The authority on translating software theory into practice, IEEE Software positionsitself between pure research and pure practice, transferring ideas, methods, andexperiences among researchers and engineers. Peer-reviewed articles and columnsby real-world experts illuminate all aspects of the industry, including process im-provement, project management, development tools, software maintenance, Webapplications and opportunities, testing, usability, and much more.

Documents

Incorporating Performance Testing in Test-Driven Development test