84
High-Quality Software Engineering Lessons from the Six-Nines World David Drysdale ([email protected])

High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

  • Upload
    hatram

  • View
    216

  • Download
    3

Embed Size (px)

Citation preview

Page 1: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

High-Quality Software EngineeringLessons from the Six-Nines World

David Drysdale ([email protected])

Page 2: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Copyright c© 2005-2007 David Drysdale

Page 3: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

i

Table of Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Intended Audience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.1 New Software Engineers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.2 Software Team Leaders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Common Themes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.1 Maintainability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.2 Knowing Reasons Why . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.3 Developing the Developers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Book Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1 Waterfall versus Agile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Implicit Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.1 Interfaces and Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.1.1 Good Interface Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.1.2 Black Box Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.1.3 Physical Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2 Designing for the Future . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2.1 Component Responsibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2.2 Minimizing Special Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.2.4 Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2.5 Avoiding the Cutting Edge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.3 Scaling Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.3.1 Good Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.3.2 Asynchronicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.3.3 Fault Tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.3.4 Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.3.5 Dynamic Software Upgrade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.4 Communicating the Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.4.1 Why and Who . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.4.2 Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.4.3 Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.1 Portability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.2 Internationalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.3 Multithreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.4 Coding Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.5 Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.6 Managing the Codebase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.6.1 Revision Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.6.2 Other Tracking Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.6.3 Enforcing Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

Page 4: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

ii

4.6.4 Building the Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.6.5 Delivering the Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5 Code Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.1 Rationale: Why do code reviews? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.1.1 Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.1.2 Maintainability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.1.3 Education . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.2 Searching for Bugs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.2.1 Local Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.2.2 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.2.3 Scenario Walkthroughs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.3 Maintainability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.3.1 Modularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.3.2 Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.4 The Code Review Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.4.1 Logistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.4.2 Dialogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.5 Common Mistakes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6 Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466.1 How . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466.2 Who and When . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476.3 Psychology of Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486.4 Regressibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496.5 Design for Testability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506.6 Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

7 Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527.1 Customers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527.2 Learning a Codebase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537.3 Imperfect Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537.4 Support Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

8 Planning a Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

8.1.1 Task Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568.1.2 Project Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

8.2 Task Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 608.3 Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 648.4 Replanning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

Page 5: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

iii

9 Running a Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 679.1 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

9.1.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 679.1.2 Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

9.2 Time Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 719.3 Running a Team . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

9.3.1 Explaining Decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 729.3.2 Make Everything Into Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 729.3.3 Tuning Development Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

9.4 Personnel Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 749.4.1 Technical Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 749.4.2 Pragmatism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

Page 6: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 1: Introduction 1

1 Introduction

Software is notorious for its poor quality. Buggy code, inconvenient interfaces and missingfeatures are almost expected by the users of most modern software.

Software development is also notorious for its unreliability. The industry abounds with talesof missed deadlines, death march projects and huge cost overruns.

This book is about how to avoid these failures. It’s about the whole process of softwareengineering, not just the details of designing and writing code: how to build a team, how toplan a project, how to run a support organization.

There are plenty of other books out there on these subjects, and indeed many of the ideas inthis one are similar to those propounded elsewhere.

However, this book is written from the background of a development sector where softwarequality really matters. Networking software and device level software often need to run onmachines that are unattended for months or years at a time. The prototypical examples of thesekind of devices are the big phone or network switches that sit quietly in a back room somewhereand just work. These devices often have very high reliability requirements: the “six-nines” ofthe subtitle.

“Six-nines” is the common way of referring to a system that must have 99.9999% availability.Pausing to do some sums, this means that the system can only be out of action for a total of32 seconds in a year. Five-nines (99.999% availability) is another common reliability level; afive-nines system can only be down for around 5 minutes in a year.

When you stop to think about it, that’s an incredibly high level of availability. The averagelight bulb probably doesn’t reach five-nines reliability (depending on how long it takes you toget out a stepladder), and that’s just a simple piece of wire. Most people are lucky if their carreaches two-nines reliability (three and a half days off the road per year). Telephone switchingsoftware is pretty complex stuff, and yet it manages to hit these quality levels (when did yourregular old telephone last fail to work?).

Reliability Level Uptime Percentage Downtime per yearTwo-nines 99% 3.5 daysThree-nines 99.9% 9 hoursFour-nines 99.99% 53 minutesFive-nines 99.999% 5 minutesSix-nines 99.9999% 31 seconds

Table 1.1: Allowed downtimes for different reliability levels

Writing software that meets these kinds of reliability requirements is a tough challenge, andone that the software industry in general would be very hard pressed to meet.

This book is about meeting this challenge; it’s about the techniques and trade-offs that areworthwhile when hitting these levels of software quality. This goes far beyond just getting theprogrammers to write better code; it involves planning, it involves testing, it involves teamworkand management—it involves the whole process of software development.

All of the steps involved in building higher quality software have a cost, whether it’s indevelopment time or in higher salaries to attract and keep more skilled developers. This extracost is the stake in a bet: a bet that the software is going to be successful and maintainable inthe long run, and that the extra cost will be amortized over a longer lifetime for the software.

Lots of these techniques and recommendations also apply outside of this particular develop-ment sector. Many of them are not as onerous or inefficient as you might think (particularlywhen all of the long term development and support costs are properly factored in), and many of

Page 7: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 1: Introduction 2

them are the same pieces of advice that show up in lots of software engineering books (exceptthat here we really mean it).

There are some things that are different about this development sector, though. The mainthing is that software quality is taken seriously by everyone involved:

• The customers are willing to pay more for it, and are willing to put in the effort to ensurethat their requirements and specification are clear, coherent and complete.

• The sales force emphasize it and can use it to win sales even against substantially lower-priced competitors.

• Management are willing to pay for the added development costs and slower time-to-market.

• Management are willing to invest in long-term development of their staff so that they becomeskilled enough to achieve true software quality.

• The programmers can take the time to do things right.

• The support phase is taken seriously, rather than being left as chore for less-skilled staff totake on.

Taken together, this means that it’s worthwhile to invest in the architecture and infrastructurefor quality software development, and to put the advice from software engineering books intopractice—all the time, every time.

It’s also possible to do this in a much more predictable and repeatable way than with manyareas of software development—largely because of the emphasis on accurate specifications (seeSection 2.1 [Waterfall versus Agile], page 7).

Of course, there are some aspects of software development that this book doesn’t cover. Theobvious example is user interfaces—the kind of software that runs for a year without crashing isalso the kind of software that rarely has to deal with unpredictable humans (and unreliable UIlibraries). However, there are plenty of other places to pick up tips on these topics that I skip1.

1.1 Intended Audience

1.1.1 New Software Engineers

One of the aims of this book is to cover the things I wish that I’d known when I first startedwork as a professional software engineer working on networking software. It’s the distillation ofa lot of good advice that I received along the way, together with lessons learnt through bitterexperience. It’s the set of things that I’ve found myself explaining over the years to both juniorsoftware developers and to developers who were new to carrier-class networking software—so alarge fraction of the intended audience is exactly those software developers.

There’s a distinction here between the low-level details of programming, and software engi-neering: the whole process of building and shipping software. The details of programming—algorithms, data structures, debugging hints, modelling systems, language gotchas—are oftenvery specific to the particular project and programming environment (hence the huge rangeof the O’Reilly library). Bright young programmers fresh from college often have significantprogramming skills, and easily pick up many more specifics of the systems that they work on.Similarly, developers moving into this development sector obviously bring significant skill setswith them.

The wider process of software development is equally important, but it’s much less commonfor fresh new developers to have any understanding of it. Planning, estimating, architecting,designing, tracking, testing, delivering, and supporting software are not typically covered in acomputer science degree. In the long term, these things are as important as stellar coding skillsfor producing high quality software.

1 For example, on UI design I’d recommend reading Joel Spolsky’s “User Interface Design for Programmers”.

Page 8: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 1: Introduction 3

This book aims to help with this understanding, in a way that’s mostly agnostic aboutthe particular software technologies and development methodologies in use. Many of the sameideas show up in discussions of functional programming, object-oriented programming, agiledevelopment, extreme programming etc., and the important thing is to understand

• why they’re good ideas

• when they’re appropriate and when they’re not appropriate (which involves understandingwhy they’re good ideas)

• how to put them into practice.

It’s not completely agnostic, though, because our general emphasis on high-quality softwareengineering means that some things are unlikely to be appropriate. An overall system is onlyas reliable as its least reliable part, and so there’s no point in trying to write six-nines VisualBasic for Excel—the teetering pyramid of Excel, Windows and even PC hardware is unlikely tosupport the concept.

1.1.2 Software Team Leaders

Of course, if someone had given me all of this advice about quality software development, all atonce, when I started as a software developer, I wouldn’t have listened to it.

Instead, all of this advice trickled in over my first few years as a software engineer, reiteratedby a succession of team leaders and reinforced by experience. In time I moved into the positionof being the team leader myself—and it was now my turn to trickle these pieces of advice intothe teams I worked with.

A recurring theme (see Section 1.2.3 [ThemeDevelopingDevelopers], page 5) of this book isthat the quality of the software depends heavily on the calibre of the software team, and soimproving this calibre this is an important part of the job of a team leader.

The software team leader has a lot of things to worry about as part of the software devel-opment process. The team members can mostly concentrate on the code itself (designing it,writing it, testing it, fixing it), but the team leader needs to deal with many other things too:adjusting the plan and the schedule so the release date gets hit, mollifying irate or unreasonablecustomers, persuading the internal systems folk to install a better spam filter, gently soothingthe ego of a prima-donna programmer and so on.

Thus, the second intended audience for this book is software team leaders, particularly thosewho are either new to team leading or new to higher quality software projects. Hopefully, anyonein this situation will find that a lot of the lower-level advice in this book is just a codification ofwhat they already know—but a codification that is a useful reference when working with theirteam (particularly the members of their team who are fresh from college). There are also anumber of sections that are specific to the whole business of running a software project, whichshould be useful to these new team leaders.

At this point it’s worth stopping to clarify exactly what I mean by a software team leader—different companies and environments use different job names, and divide the responsibilities forroles differently. Here, a team leader is someone who has many of the following responsibilities(but not necessarily all of them).

• Designer of the overall software system, dividing the problem into the relevant subcompo-nents.

• Assigner of tasks to programmers.

• Generator and tracker of technical issues and worries that might affect the success of theproject.

• Main answerer of technical questions (the team “guru”), including knowing who best toredirect the question to when there’s no immediate answer.

Page 9: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 1: Introduction 4

• Tracker of project status, including reporting on that status to higher levels of management.

• Trainer or mentor of less-experienced developers, including assessment of their skills, capa-bilities and progress.

1.2 Common Themes

A number of common themes recur throughout this book. These themes are aspects of softwaredevelopment that are very important for producing the kind of high-quality software that hitssix-nines reliability. They’re very useful ideas for other kinds of software too, but are rarelyemphasized in the software engineering literature.

1.2.1 Maintainability

Maintainability is all about making software that is easy to modify later, and is an aspect ofsoftware development that is rarely considered. This is absolutely vital for top quality software,and is valuable elsewhere too—there are very few pieces of software that don’t get modified afterversion 1.0 ships, and so planning for this later modification makes sense over the long term.

This book tries to bring out as many aspects of maintainability as possible. Tracking thenumbers involved—development time spent improving maintainability, number of bugs reported,time taken to fix bugs, number of follow-on bugs induced by rushed bug fixes—can quickly showthe tangible benefits of concentrating on this area.

Fortunately, the most significant factor in maintainability is good design—and few peoplewould argue that software needs good design. If the design clearly and coherently divides up theresponsibilities of the different parts of the project, the chances of a later modification to the codebeing straightforward are much higher. On the other hand, it’s always hard to tease a version2.0 out of a design where the components are tangled together with complicated interactionsand interacting complications.

The other main theme for maintainability is communication. The code itself only encodeswhat the machine needs to know; the programmers need to know much more—what code ownswhat resources, what the balance of pros and cons was between different approaches, in whatcircumstances the code is expected or not expected to scale well.

All of this information needs to be communicated from the people who understand it inthe first place—the original designers and coders—to the people who have to understand itlater—the programmers who are developing, supporting, fixing and extending the code.

Some of this communication appears as documentation, in the form of specifications, designdocuments, scalability analyses, stored email discussions on design decisions etc. A lot of thiscommunication forms a part of the source code, in the form of comments, identifier names andeven directory and file structures.

1.2.2 Knowing Reasons Why

Building a large, complex software system involves a lot of people making a lot of decisions alongthe way. For each of those decisions, there can be a range of possibilities together with a rangeof factors for and against each possibility. To make good decisions, it’s important to know asmuch as possible about these factors.

It’s also important to realize that this balance of factors is time-dependent: the most sig-nificant or relevant factors for the next three months are often radically different from theimportant factors for the next three years. For high-quality software development, this is partic-ularly common—the long-term quality of the codebase often forces painful or awkward decisionsin the short and medium term.

Scott Meyers has written a series of books about C++, which are very highly regarded andwhich deservedly sell like hot cakes. A key factor in the success of these books is that he builds

Page 10: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 1: Introduction 5

a collection of rules of thumb for C++ coding, but he makes sure that the reader understandsthe reasons for the rule. That way, they will be able to make an informed decision if they’re inthe rare scenario where other considerations overrule the reasons for a particular rule.

In such a scenario, the software developer can then come up with other ways to avoid theproblems that led to the original rule. To take a concrete example, global variables are oftendiscouraged because they:

a. pollute the global namespace

b. have no access control (and so can be modified by any code)

c. implicitly couple together widely separated areas of the code

d. aren’t inherently thread safe.

Replacing a global variable with a straightforward Singleton design pattern2 counteracts a)and b). Knowing the full list allows d) to be dealt with separately by adding locking and leavesc) as a remaining issue to watch out for.

1.2.3 Developing the Developers

It’s generally well known3, but bears repeating: the productivity of different software developersvaries vastly. The best developers can be better than the average developer by an order ofmagnitude, and infinitely better than the worst developer.

This makes it obvious that the single most important factor in producing good software isthe calibre of the development team producing it.

A lot of this is innate talent, and this means that recruitment and retention is incrediblyimportant. The other part is to ensure that the developers you’ve already got are fulfilling allof their potential.

Programmers are normally pretty bright people, and bright people like to learn. Giving themthe opportunity to learn is good for everyone: the programmers are happier (and so more likelyto stick around), the software becomes higher quality, and the programmers are more effectiveon future projects because they’re more skilled.

It’s important to inculcate good habits into new programmers; they should be encouragedto develop into being the highest quality engineers they can be, building the highest qualitysoftware they can. This book aims to help this process by distilling some of the key ideasinvolved in developing top quality software.

All of the skills needed to write great code are not enough; the most effective softwareengineers also develop skills outside the purely technical arena. Communicating with customers,writing coherent documentation, understanding the constraints of the overall project, accuratelyestimating and tracking tasks—all of these skills improve the chances that the software will besuccessful and of top quality.

1.3 Book Structure

The first part of this book covers the different phases of software development, in roughly theorder that they normally occur:

• Requirements: What to build.

• Design: How to build it.

• Code: Building it.

• Test: Checking it got built right.

• Support: Coping when the customers discover parts that (they think) weren’t built right.

2 See Gamma, Helm, Johnson & Vlissides “Design Patterns”.3 See http://www.joelonsoftware.com/articles/HighNotes.html for an example.

Page 11: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 1: Introduction 6

Depending on the size of the project and the particular software development methodology(e.g. waterfall vs. iterative), this cycle can range in size from hours or days to years, and the orderof steps can sometimes vary (e.g. writing tests before the code) but the same principles applyregardless. As it happens, high quality software usually gets developed using the apparentlyold-fashioned waterfall approach—more on the reasons for this in the Requirements chapter.

Within this set of steps of a development cycle, I’ve also included a chapter specifically oncode reviews (see Chapter 5 [Code Review], page 39). Although there are plenty of books andarticles that help programmers improve their coding skills, and even though the idea of codereviews is often recommended, there is a dearth of information on why and how to go about acode review.

The second part of the book takes a step back to cover more about the whole process ofrunning a high quality software development project. This covers the mechanics of planningand tracking a project (in particular, how to work towards more accurate estimation) to helpensure that the software development process is as high quality as the software itself.

Page 12: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 2: Requirements 7

2 Requirements

What is it that the customer actually wants? What is the software supposed to do? This iswhat requirements are all about—figuring out what it is you’re supposed to build.

It often comes as a surprise to new developers fresh out of college to discover that the wholebusiness of requirements is one of the most common reasons for software projects to turn intodisastrous failures.

So why should it be so hard? First up, there are plenty of situations where the software teamreally doesn’t know what the customer wants. It might be some new software idea that’s notbeen done before, so it’s not clear what a potential customer will focus on. It might be that thecustomer contact who generates the requirements doesn’t really understand the low-level detailsof what the software will be used for. It might just be that until the user starts using the codein anger, they don’t realize how awkward some parts of the software are.

Even if the customer does know what they want, they can’t necessarily communicate this tothe software development team. For some reason, geeks and non-geeks seem to lack a commonlanguage when talking about software. This is where the distinction between the requirementsand the specification comes in: the requirements describe what the customer wants to achieve,and the specification details how the software is supposed to help them achieve it—customerswho are clear on the former can’t necessarily convert it into the latter.

Bug-tracking systems normally include a separate category for ignored bug reports wherethe code is “working as specified”—WAS. Much of the time, this really means NWAS—“notworking, as specified”: the code conforms to the specification, but the code’s behaviour isobviously wrong—the specification didn’t reflect what a real user needs.

For fresh-faced new developers, this can all come as a bit of a shock. Programming assign-ments on computer science courses all have extremely well-defined requirements; open-sourceprojects (which are the other kind of software that they’re likely to have been exposed to) nor-mally have programmers who are also customers for the software, and so the requirements areimplicitly clear.

This chapter is all about this requirements problem, starting with a discussion of why it is(or should be) less of an issue for true six-nines software systems.

2.1 Waterfall versus Agile

Get your head out of your inbred development sector and look around1.

Here’s a little-known secret: most six-nines reliability software projects are developed usinga waterfall methodology. For example the telephone system and the Internet are both funda-mentally grounded on software developed using a waterfall methodology. This comes as a bit ofsurprise to many software engineers, particularly those who are convinced of the effectiveness ofmore modern software development methods (such as agile programming).

Before we explore why this is the case, let’s step back for a moment and consider exactlywhat’s meant by “waterfall” and “agile” methodologies.

The waterfall methodology for developing software involves a well-defined sequence of steps,all of which are planned in advance and executed in order. Gather requirements, specify thebehaviour, design the software, generate the code, test it and then ship it (and then support itafter it’s been shipped).

An agile methodology tries to emphasize an adaptive rather than a predictive approach, withshorter (weeks not months or years), iterated development cycles that can respond to feedbackon the previous iterations. This helps with the core problem of requirements discussed at the

1 Comment by Dave W. Smith on http://c2.com/cgi/wiki?IsWaterFallDiscredited

Page 13: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 2: Requirements 8

beginning of this chapter: when the customer sees an early iteration of the code, they canphysically point to the things that aren’t right, and the next iteration of the project can correctthem.

The Extreme Programming (XP) variant of this methodology involves a more continuousapproach to this problem of miscommunication of requirements: having the customer on-sitewith the development team (let’s call this the customer avatar). In other words: if interactionbetween the developers and the customer is a Good Thing, let’s take it to its Extreme.

So why do most high-quality software projects stick to the comparatively old-fashioned andsomewhat derided waterfall approach?

The key factor that triggers this is that six-nines projects don’t normally fall into the re-quirements trap. These kind of projects typically involve chunks of code that run in a backroom—behind a curtain, as it were. The external interfaces to the code are binary interfaces toother pieces of code, not graphical interfaces to fallible human beings. The customers request-ing the project are likely to include other software engineers. But most of all, the requirementsdocument is likely to be a reference to a fixed, well-defined specification.

For example, the specification for a network router is likely to be a collection of IETF RFC2

documents that describe the protocols that the router should implement; between them, theseRFCs will specify the vast majority of the behaviour of the code. Similarly, a telephone switchimplements a large collection of standard specifications for telephony protocols, and the fibre-optic world has its own collection of large standards documents. A customer requirement for,say, OSPF routing functionality translates into specification that consists of RFC2328, RFC1850and RFC2370.

So, a well-defined, stable specification allows the rest of the waterfall approach to proceedsmoothly, as long as the team implementing the software is capable of properly planning theproject (see Chapter 8 [Planning a Project], page 56), designing the system (see Chapter 3[Design], page 10), generating the code (see Chapter 4 [Code], page 28) and testing it properly(see Chapter 6 [Test], page 46).

2.2 Use Cases

Most people find it easier to deal with concrete scenarios than with the kinds of abstract de-scriptions that make their way into requirements documents. This means that it’s important tobuild some use cases to clarify and confirm what’s actually needed.

A use case is a description of what the software and its user does in a particular scenario.The most important use cases are the ones that correspond to the operations that will be mostcommon in the finished software—for example, adding, accessing, modifying and deleting entriesin data-driven application (sometimes known as CRUD: create, read, update, delete), or settingup and tearing down connections in a networking stack.

It’s also important to include use cases that describe the behaviour in important error paths.What happens to the credit card transaction if the user’s Internet connection dies halfwaythrough? What happens if the disk is full when the user wants to save the document they’veworked on for the last four hours? Asking the customer these kinds of “What If?” questionscan reveal a lot about their implicit assumptions for the behaviour of the code.

As an aside, it’s worth noting that these sample scenarios become very useful when it comestime to test the software (see Chapter 6 [Test], page 46). The scenarios describe what thecustomer/user expects to happen; the testing can then confirm that this is indeed what doeshappen.

2 The Internet Engineering Task Force (IETF) produces documents known as “Requests For Comment” (RFC)that describe how the protocols that make up the Internet should work.

Page 14: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 2: Requirements 9

2.3 Implicit Requirements

The use cases described in the previous section are important, but for larger systems there arenormally also a number of requirements that are harder to distil down into individual scenarios.

These requirements relate to the average behaviour of the system over a (large) number ofdifferent iterations of various scenarios:

• Speed: How fast does the software react? How does this change as more data is included?

• Size: How much disk space and memory is needed for the software and its data?

• Resilience: How well does the system cope with network delays, hardware problems, oper-ating system errors?

• Reliability: How many bugs are expected to show up in the system as it gets used? Howwell should the system cope with bugs in itself?

• Support: How fast do the bugs need to get fixed? What downtime is required when fixesget rolled out?

For six-nines software, these kinds of factors are often explicitly included in therequirements—after all, the phrase “six-nines” is itself an resilience and reliability requirement.Even so, sometimes the customer has some implicit assumptions about them that don’t make itas far as the spec (particularly for the later items on the list above). The implicit assumptionsare driven by the type of software being built—an e-commerce web server farm will just beassumed to be more resilient than a web page applet.

Even when these factors are included into the requirements (typically as a Service LevelAgreement or SLA), it can be difficult to put accurate or realistic numbers against each factor.This may itself induce another implicit requirement for the system—to build in a way of gener-ating the quantitative data that is needed for tracking. An example might be to include codeto measure and track response times, or to include code to make debugging problems easier andswifter (see Section 3.2.4 [Diagnostics], page 15). In general, SLA factors are much easier todeal with in situations where there is existing software and tracking systems to compare the newsoftware with.

Page 15: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 3: Design 10

3 Design

The process of making a software product is sometimes compared to the process of makinga building. This comparison is sometimes made to illustrate how amateurish and unreliablesoftware engineering is in comparison to civil engineering, with the aim of improving the formerby learning lessons from the latter.

However, more alert commentators point out that the business of putting up buildings is onlyreliable and predictable when the buildings are the same as ones that have been done before.Look a little deeper, into building projects that are the first of their kind, and the industry’sreputation for cost overruns and schedule misses starts to look comparable with that of thesoftware industry.

A new piece of software is almost always doing something new, that hasn’t been done inexactly that way before. After all, if it were exactly the same as an existing piece of software, wecould just reuse that software—unlike buildings, it’s easy to copy the contents of a hard drivefull of code.

So designing large, complex systems from scratch is hard—whether in software engineering orcivil engineering. For software, there is only one technique that can tame this difficulty: divideand conquer. The main problem gets broken down into a collection of smaller problems, togetherwith their interactions. These smaller problems get subdivided in turn, and the process repeatsuntil the individual sub-sub-problems are tractable on their own (and some of these may evenbe problems that have been solved before, whose solutions can be re-used).

Tractable for the humans building the system, that is. This process of subdivision is muchmore about making the system comprehensible for its designers and builders, than about makingthe compiler and the microprocessor able to deal with the code. The underlying hardware cancope with any amount of spaghetti code; it’s just the programmers that can’t cope with tryingto build a stable, solid system out of spaghetti. Once a system reaches a certain critical mass,there’s no way that the developers can hold all of the system in their heads at once withoutsome of the spaghetti sliding out of their ears1.

Software design is all about this process of dividing a problem into the appropriate smallerchunks, with well-defined, understandable interfaces between the chunks. Sometimes thesechunks will be distinct executable files, running on distinct machines and communicating over anetwork. Sometimes these chunks will be objects that are distinct instances of different classes,communicating via method calls. In every case, the chunks are small enough that a developercan hold all of a chunk in their head at once—or can hold the interface to the chunk in theirhead so they don’t need to understand the internals of it.

Outside of the six-nines world of servers running in a back room, it’s often important toremember a chunk which significantly affects the design, but which the design has less ability toaffect: the User. The design doesn’t describe the internals of the user (that would biology, notengineering) but it does need to cover the interface to the user,

This chapter discusses the principles behind good software design—where “good” means adesign that has the highest chance of working correctly, being low on bugs, being easy to extendin the future, and implemented in the expected timeframe. It also concentrates on the specificchallenges that face the designers of highly resilient, scalable software.

Before moving on to the rest of the chapter, a quick note on terminology. Many specificmethodologies for software development have precise meanings for terms like “object” and“component”; in keeping with the methodology-agnostic approach of this book, these terms(and others, like “chunk”) are used imprecisely here. At the level of discussion in this chapter,if the specific difference between an object and a component matters, you’re probably up to nogood.

1 Apologies if that’s an image you’re now trying desperately trying to get out of your head.

Page 16: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 3: Design 11

3.1 Interfaces and Implementations

The most important aspect of the division of a problem into individual components is theseparation between the interface and the implementation of each component.

The interface to a component is built up of the operations that the component can perform,together with the information that other components (and their programmers) need to know inorder to successfully use those operations. For the outermost layer of the software, the interfacemay well be the User Interface, where the same principle applies—the interface is built up of theoperations that the user can perform with the keyboard and mouse—clicking buttons, movingsliders and typing command-line options.

As ever, the definition of the “interface” to a component or an object can vary considerably.Sometimes the interface is used to just mean the set of public methods of an object; sometimesit is more comprehensive and includes pre-conditions and post-conditions (as in Design ByContract) or performance guarantees (such as for the C++ STL); sometimes it includes aspectsthat are only relevant to the programmers, not the compiler (such as naming conventions2).Here, we use the term widely to include all of these variants—including more nebulous aspects,such as comments that hint on optimal use of the interface.

The implementation of a component is of course the chunk of software the fulfils the interface.This typically involves both code and data; as Niklaus Wirth has observed, “Algorithms + DataStructures = Programs”.

The interface and the implementation should be as distinct as possible, as discussed later (seeSection 3.1.2 [Black Box Principle], page 12). If they aren’t distinct, then the component isn’treally part of a subdivision of the overall problem into smaller chunks: none of the other chunksin the system can just use the component as a building block, because they have to understandsomething about how the building block is itself built.

3.1.1 Good Interface Design

So, what makes a good interface for a software component?

The interface to a software component is there to let other chunks of software use the com-ponent. As such, a good interface is one that makes this easier, and a bad interface is one thatmakes this harder. This is particularly important in the exceptional case that the interface isto the user rather than another piece of software.

The interface also provides an abstraction of what the internal implementation of the com-ponent does. Again, a good interface makes this internal implementation easier rather thanharder; however, this goal is often at odds with the previous goal.

There are several principles that help to make an interface optimal for its clients—where the“client” of an interface includes both the other chunks of code that use the interface, and theprogrammers that write this code.

• Clear Responsibility. Each chunk of code should have a particular job to do, and this jobshould boil down to a single clear responsibility (see Section 3.2.1 [Component Responsi-bility], page 13).

• Completeness. It should be possible to perform any operation that falls under the responsi-bility of this component. Even if no other component in the design is actually going to usethis particular operation, it’s still worth sketching out the interface for it. This providesreassurance that it will be possible to extend the implementation to that operation in thefuture, and also helps to confirm that the component’s particular responsibilities are clear.

2 For example, the Scheme programming language ends all predicates with a question mark (null?) and allfunctions that have side effects with an exclamation mark (vector-set!).

Page 17: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 3: Design 12

• Principle of Least Astonishment. The interface to a component needs to be written relativeto the expectations of the clients—who should be assumed to know nothing about theinternals of the component. This includes expectations about resource ownership (whose jobis to release resources?), error handling (what exceptions are thrown?), and even cosmeticdetails such as conformance to standard naming conventions and terminology for the localenvironment.

To ensure that an interface makes the implementation as easy as possible, it needs to beminimal. The interface should provide all the operations that its clients might need (see Com-pleteness above), but no more. The principle of having clear responsibilities (see above) canhelp to spot areas of an interface that are actually peripheral to the core responsibility of acomponent—and so should be hived off to a separate chunk of code.

For example, imagine a function that builds a collection of information describing a person(probably a function like Person::Person in C++ terms), including a date of birth. The chancesare (sadly) that there are likely to be many different ways of specifying this date—should thefunction cope with all of them, in different variants, in order to be as helpful as possible for theclients of the function?

In this case, the answer is no. The core responsibility of the function is to build up data aboutpeople, not to do date conversion. A much better approach is to pick one date format for theinterface, and separate out all of the date conversion code into a separate chunk of code. Thisseparate chunk has one job to do—converting dates—and is very likely to be useful elsewherein the overall system.

3.1.2 Black Box Principle

The client of a component shouldn’t have to care how that component is implemented. In fact,it shouldn’t even know how the component is implemented—that way, there’s no temptation torely on internal implementation details.

This is the black box principle. Clients of a component should treat it as if it were a blackbox with a bunch of buttons and controls on the outside; the only way to get it to do anythingis by frobbing these externally visible knobs. The component itself needs to make enough dialsand gauges visible so that its clients can use it effectively, but no more than that.

This principle gives us a good way of testing whether we have come up with a good interfaceor not: is it possible to come up with a completely different implementation that still satisfiesthe interface? If so, then there’s a good chance that the component’s interface will be robustto changes in future—after all, if the entire internal gubbins of the component could be re-placed without anyone noticing, the chances are that any future enhancement will also have nodetrimental effect.

3.1.3 Physical Architecture

The physical architecture of a software system covers all of the things that the code needs tointerface with in order to go from source code to a running system—machines, power supplies,hard disks, networks, compilers, linkers, peripherals.

This is often an area that’s taken for granted; for a desktop application, the physical archi-tecture is just likely to involve decisions about which directories to install code into, how todivide the code up into different shared libraries, and which operating systems to support.

For six-nines software, however, these kinds of physical factors are typically much moreimportant in the design of such systems:

• They often combine hardware and software together as a coherent system, which can involvemuch lower-level considerations about what code runs where, and when.

• They often have scalability requirements that necessitate running on multiple machines inparallel—which immediately induces problems with keeping data synchronized properly.

Page 18: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 3: Design 13

• They are often large-scale software systems where the sheer weight of code causes problemsfor the toolchains. A build cycle that takes a week to run can seriously impact developmentproductivity3.

• The reliability requirements often mean that the code has to transparently cope with soft-ware and hardware failures, and even with modifications of the running software while it’srunning.

For these environments, each software component’s interface to the non-software parts of thesystem becomes important enough to require detailed design consideration—and the interfacesbetween software components may have hardware issues to consider too.

To achieve true six-nines reliability, every component of the system has to reach that reli-ability level. This involves a lot of requirements on the whole hardware platform—the powersupply, the air conditioning, the rack chassis, the processor microcode, the operating system,the network cards—which are beyond the scope of this book.

3.2 Designing for the Future

Successful software has a long lifetime; if version 1.0 works well and sells, then there will be aversion 2.0 and a version 8.1 and so on. As such, it makes sense to plan for success by ensuringthat the software is designed with future enhancements in mind.

This section covers the key principles involved in “designing in the future tense”: putting ineffort today so that in years to come all manner of enhancements Should Just Work.

3.2.1 Component Responsibility

The most important factor that eases future modifications to software is clear responsibilitiesfor each of the components of the system (see Section 3.1.1 [Good Interface Design], page 11).What data does it hold; what processing does it perform; most importantly, what concept doesit correspond to?

When a new aspect of functionality is needed, the designer can look at the responsibilitiesof the existing components. If the new function fits under a particular component’s aegis, thenthat’s where the new code will be implemented. If no component seems relevant, then a newcomponent may well be needed.

For example, the Model-View-Controller architectural design pattern is a very well-knownexample of this (albeit from outside the world of six-nines development). Roughly:

• The Model is responsible for holding and manipulating data.

• The View is responsible for displaying data to the user.

• The Controller is responsible for converting user input to changes in the underlying dataheld by the Model, or to changes in how it is displayed by the View.

For any change to the functionality, this division of responsibilities usually makes it veryclear where the change should go.

3.2.2 Minimizing Special Cases

Lou Montulli: “I laughed heartily as I got questions from one of my former employeesabout FTP code that he was rewriting. It had taken 3 years of tuning to get codethat could read the 60 different types of FTP servers, those 5000 lines of code mayhave looked ugly, but at least they worked.”

A special case in code is an interruption to the clear and logical progression of the code.A prototypical toy example of this might be a function that returns the number of days in a

3 John Lakos’ “Large Scale C++ Software Design” is possibly the only book that addresses this issue in depth;if your system has this sort of problem (even if it’s not a C++ system), it’s well worth reading.

Page 19: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 3: Design 14

month: what does it return for February? All of sudden, a straightforward array-lookup frommonth to length won’t work; the code needs to take the year as an input and include an if

(month == February) arm that deals with leap years.

An obvious way to spot special cases is by the presence of the words “except” or “unless” ina description of some code: this function does X except when Y.

Special cases are an important part of software development. Each special case in the codereflects something important that has been discovered during the software development process.This could be a particular workaround for a bug in another product, an odd wrinkle in thespecification, or an ineluctable feature of the real world (like the leap year example above).

However, special cases are distressing for the design purist. Each special case muddies theresponsibilities of components, makes the interface less clear, makes the code less efficient andincreases the chances of bugs creeping in between the edge cases.

So, accepting that special cases are unavoidable and important, how can software be designedto minimize their impact?

An important observation is that special cases tend to accumulate as software passes throughmultiple versions. Version 1.0 might have only had three special cases for interoperating withold web server software, but by version 3.4 there are dozens of similar hacks.

This indicates the most effective mechanism for dealing with special cases: generalize themand encapsulate the generalized version in a single place. For the leap year example, this would bethe difference between if (month == February) and if (IsMonthOfVaryingLength(month))

(although this isn’t the best example to use, since there are unlikely to be any changes in monthlengths any time soon).

3.2.3 Scalability

Successful software has a long lifetime (see Section 3.2 [Designing for the Future], page 13), asnew functionality is added and higher version numbers get shipped out of the door. However,successful software is also popular software, and often the reason for later releases is to copewith the consequences of that popularity.

This is particularly relevant for server-side applications—code that runs on backend webservers, clusters or even mainframes. The first version may be a single-threaded single processrunning on such a server. As the software becomes more popular, this approach isn’t able tokeep up with the demand.

So the next versions of the software may need to become multithreaded (but see Section 4.3[Multithreading], page 30), or have multiple instances of the program running, to take advantageof a multiprocessor system. This requires synchronization, to ensure that shared data never getscorrupted. In one process or on one machine, this can use the synchronization mechanismsprovided by the environment: semaphores and mutexes, file locks and shared memory.

As the system continues to be successful, even this isn’t good enough—the system needs tobe distributed, with the code running on many distinct machines, carefully sharing data betweenthe machines in a synchronized way. This is harder to achieve, because there are far fewersynchronization primitives to rely on.

Let’s suppose the software is more successful still, and customers are now relying on thesystem 24 hours a day, 7 days a week—so we’re now firmly in the category of four-nines, five-nines or even six-nines software. What happens if a fuse blows in the middle of a transaction, andone of the machines comes down? The system needs to be fault tolerant, failing the transactionsmoothly over to another machine.

Worse still, how do you upgrade the system in a 24/7 environment? The running code needsto be swapped for the new code, in a seamless way that also doesn’t corrupt any data or interruptany running transactions: a dynamic software upgrade. Of course, nothing is perfect—chances

Page 20: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 3: Design 15

are that sometimes the dynamic software upgrade system will also need to cope with a dynamicsoftware downgrade to roll the new version back to the drawing board.

A few of these evolutionary steps can apply to single-user desktop applications too. Once lotsof versions of a product have been released, it’s all too easy to have odd interactions betweenincompatible versions of code installed at the same time (so-called “DLL Hell”). Similarly, powerusers can stretch the performance of an application and wonder why their dual-CPU desktopisn’t giving them any benefits4.

Having a good design in the first case can make the whole scary scalability evolution describedabove go much more smoothly than it would otherwise do—and is essential for six-nines systemswhere these scalability requirements are needed even in the 1.0 version of the product.

Because this kind of scaling-up is so important in six-nines systems, a whole section later inthis chapter (see Section 3.3 [Scaling Up], page 17) is devoted to the kinds of techniques anddesign principles that help to achieve this. For the most part, these are simple slants to a design,which involve minimal adjustments in the early versions of a piece of software but which canreap huge rewards if and when the system begins to scale up massively.

As with software specifications (see Section 2.1 [Waterfall versus Agile], page 7), it’s worthcontrasting this approach with the current recommendations in other software development sec-tors. The Extreme Programming world has a common precept that contradicts this section:You Aren’t Going to Need It (YAGNI). The recommendation here is not to spend much timeensuring that your code will cope with theoretical future enhancements: code as well as possi-ble for today’s problems, and sort out tomorrow’s problems when they arrive—since they willprobably be different from what you expected when they do arrive.

For six-nines systems, the difference between the two approaches is again driven by thefirmness of the specification. Such projects are generally much more predictable overall, andthis includes a much better chance of correctly predicting the ways that the software is likely toevolve. Moreover, for large and complex systems the kinds of change described in this section areextraordinarily difficult to retro-fit to software that has not been designed to allow for them—forexample, physically dividing up software between different machines means that the interfaceshave to change from being synchronous to asynchronous (see Section 3.3.2 [Asynchronicity],page 17).

So, (as ever) it’s a balance of factors: for high-resilience systems, the probability of correctlypredicting the future enhancements is higher, and the cost of not planning for the enhancementsis much higher, so it makes sense to plan ahead.

3.2.4 Diagnostics

Previous sections have described design influences arising from planning for success; it’s alsoimportant to plan for failure.

All software systems have bugs. It’s well known that it’s much more expensive to fix bugswhen they’re out in the field, and so a lot of software engineering practices are aimed at catchingbugs as early as possible in the development process. Nevertheless, it is inevitable that bugswill slip out, and it’s important to have a plan in place for how to deal with them.

In addition to bugs in the software, there are any number of other factors that can stopsoftware from working correctly out in the field. The user upgrades their underlying operatingsystem to an incompatible version; the network cable gets pulled out; the hard disk fills up, andso on.

Dealing with these kinds of circumstances is much easier if some thought has been put intothe issue as part of the design process. It’s straightforward to build in a logging and tracing

4 In 2000, the member of staff with the most highly powered computer at the software development house Iworked was the technical writer—Microsoft Word on a 1500 page document needs a lot of horsepower.

Page 21: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 3: Design 16

facility as part of the original software development; retrofitting it after the fact is much moredifficult. This diagnostic functionality can be useful even before the software hits the field—thetesting phase (see Chapter 6 [Test], page 46) can also take advantage of it.

Depending on the environment that the software is expected to run in, there are a numberof diagnostic areas to be considered as part of the design.

• Installation verification: checking that all the dependencies for the software are installed,and that all versions of the software itself match up correctly.

• Tracing and logging: providing a system whereby diagnostic information is produced, andcan be tuned to generate varying amounts of information depending on what appears to bethe problem area. This can be particularly useful for debugging crashed code, since corefiles/crash dumps are not always available.

• Manageability: providing a mechanism for administrators to reset the system, or to pruneparticularly problematic subsets of the software’s data, aids swift recovery from problemsas they happen.

• Data verification: internal checking code that confirms whether chunks of data remaininternally consistent and appear to be uncorrupted.

• Batch data processing: providing a mechanism to reload large amounts of data can easerecovery after a disastrous failure (you did remember to include a way to back up the datafrom a running system, didn’t you?).

3.2.5 Avoiding the Cutting Edge

Programmers are always excited by new toys. Sadly, this extends to software design too: devel-opers often insist on the very latest and shiniest tools, technologies and methodologies.

However, the risks and costs associated with these cutting edges is seldom calculated correctly.New tools are more likely to be incomplete and bug-ridden; their support for varying operatingsystems and programming languages can be patchy (which leads to extensibility problems infuture). For high quality software, the overall reliability is only as good as the weakest link inthe whole system, and using the latest and greatest technology hugely increases the chance thatthe weakest link will be in code that you have no control over.

In the longer term, working with the very latest in technology is also a wager. Many tech-nologies are tried; few survive past their infancy, which causes problems if your software systemrelies on something that has had support withdrawn.

It’s not just the toolchain that may have problems supporting the latest technology: similarconcerns apply to the developers themselves. Even if the original development team fully un-derstand the new technology, the maintainers of version 1.0 and the developers of version 2.0might not—and outside experts who can be hired in will be scarce (read: expensive).

Even if the technology in question is proven technology, if the team doing the implementationisn’t familiar with it, then some of the risks of cutting edge tools/techniques apply. In thissituation, this isn’t so much a reason to avoid the tools, but instead to ensure that the projectplan (see Chapter 8 [Planning a Project], page 56) allows sufficient time and training for theteam to get to grips with the technology.

So when is new technology appropriate? Obviously, prototypes and small-scale systems al-low a lot more opportunity for experimentation with lower risk. The desire of the programmingteam to expand their skill set and learn something new and different is also a significant fac-tor. Allowing experimentation means that the team will be happier (see Section 9.4 [PersonnelDevelopment], page 74) and the next time round, the technology will no longer be untried: thecutting edge will have been blunted.

Page 22: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 3: Design 17

3.3 Scaling Up

An earlier section discussed the importance of planning ahead for scalability. This sectiondescribes the principles behind this kind of design for scalability—principles which are essentialto achieve truly resilient software.

3.3.1 Good Design

The most obvious aspect of designing for scalability is to ensure that the core design is sound.Nothing highlights potential weak points in a design better that trying to extend it to a fault-tolerant, distributed system.

The first part of this is clarity of interfaces—since it is likely that these interfaces will spandistinct machines, distinct processes or distinct threads as the system scales up.

Another key aspect that helps to ease the transition to distributed or resilient systems is tocarefully encapsulate access to data, so a single set of components have clear responsibility forthis. If every chunk of code that reads or writes persistent data does so through a particularwell-defined interface (as for the Model part of a Model/View/Controller system), then it’smuch easier to convert the underlying data access component from being flat-file based to beingdatabase driven, then to being remote-distributed-database driven—without changing any ofthe other components.

Likewise, if all of the data that describes a particular transaction in progress is held togetheras a coherent chunk, and all access to it is through known, narrow interfaces, then it is easier tomonitor that state from a different machine—which can then take over smoothly in the eventof failure.

3.3.2 Asynchronicity

A common factor in most of the scalability issues discussed in this chapter is the use of asyn-chronous interfaces rather than synchronous ones. Asynchronous interfaces are harder to dealwith, and they have their own particular pitfalls that the design of the system needs to copewith.

A synchronous interface is one where the results of the interface come back immediately (orapparently immediately). An interface made up of function calls or object method calls is almostalways synchronous—the user of the interface calls the function with a number of parameters,and when the function call returns the operation is done.

Page 23: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 3: Design 18

Figure 3.1: Synchronous interaction (as a UML sequence diagram)

An asynchronous interface involves a gap between the invocation and the results. The in-formation that the operation on the interface needs is packaged up and delivered, and sometime later the operation takes place and any results are delivered back to the code that uses theinterface. This is usually done by encapsulating the interface with some kind of message-passingmechanism.

This immediately makes things much more complicated.

• There has to be some method for the delivery of the results. This could be the arrival of amessage, or an invocation of a registered callback, but in either case the scheduling of thismechanism has to be set up.

• The results are delivered back to a different part of the code rather than just the next line,destroying locality of reference.

• The code that invokes the asynchronous operation needs to hold on to all of the stateinformation associated with the operation while it’s waiting for the results.

• When the results of an operation arrive, they have to be correlated with the informationthat triggered that operation. What happens if the same operation occurs multiple timesin parallel with different data?

• The code has to cope with a variety of timing windows, such as when the answers to differentoperations return in a different order, or when state needs to be cleaned up because theprovider of the interface goes away while operations are pending.

Page 24: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 3: Design 19

Asynchronous interfaces are ubiquitous in highly resilient systems. If data is distributedamong different processors and machines, then that data needs to make its way to the appropriatemachine—which is inherently asynchronous. Likewise a backup system, waiting in the wings totake over should the primary machine fail, can only be kept up to date via an asynchronousmechanism.

Asynchronicity turns up elsewhere too. Graphical UI systems are usually event driven, andso some operations become asynchronous—for example, when a window needs updating thecode calls an “invalidate screen area” method which triggers a “redraw screen area” messagesome time later. Some very low-level operations can also be asynchronous for performancereasons—for example, if the system can do I/O in parallel, it can be worth using asynchronousI/O operations (the calling code triggers a read from disk into an area of memory, but theread itself is done by a separate chunk of silicon, which notifies the main processor when theoperation is complete, some time later). Similarly, if device driver code is running in an interruptcontext, time-consuming processing has to be asynchronously deferred to other code so that otherinterrupts can be serviced quickly.

Designs involving message-passing asynchronicity usually include diagrams that show thesemessage flows, together with variants that illustrate potential problems and timing windows(the slightly more formalized version of this is the UML sequence diagram, see Section 3.4.2[Diagrams], page 26).

Page 25: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 3: Design 20

Figure 3.2: Asynchronous interaction

3.3.3 Fault Tolerance

The first step along the road to software with six-nines reliability is to cope with failure byredundancy. This allows for fault tolerance—when software or hardware faults occur, there arebackup systems available to take up the load. A software fault is likely to be a straightforwardbug, although it might be triggered by corrupted data; a hardware fault might be the failure ofa particular card in a chassis, or an entire machine in a cluster.

To deal with the consequences of a fault, it must first be detected. For software faults,this might be as simple as continuously monitoring the current set of processes reported bythe operating system; for hardware faults, the detection of faults might be either a feature ofthe hardware, or a result of a continuous “aliveness” polling. This is heavily dependent onthe physical architecture (see Section 3.1.3 [Physical Architecture], page 12) of the software—knowing exactly where the code is supposed to run, both as processes and as processors.

Of course, the system that is used for detection of faults is itself susceptible to faults—whathappens if the monitoring system goes down? To prevent an infinite regress, it’s usually enoughto make sure the monitoring system is as simple as possible, and as heavily tested as possible.

Page 26: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 3: Design 21

The overall strategy for dealing with faults depends on the prevalence of different kinds offaults. Is a hardware fault or a software fault more likely? For a hardware fault, failing overto another instance of the same hardware will normally work fine. For a software fault, there’sa reasonable probability that when the backup version of the code is presented with the sameinputs as the originally running primary version of the code, it will produce the same output—namely, another software fault.

In the latter case, it’s entirely possible to reach a situation where a particular corrupted setof input data causes the code to bounce backwards and forwards between primary and backupinstances ad infinitum5. If this is a possible or probable situation, then the fault tolerancesystem needs to detect this kind of situation and cope with it—ideally, by deducing what theproblematic input is and removing it; more realistically, by raising an alarm for external supportto look into.

Once a fault has been detected, the fault tolerance system needs to deal with the fault, byactivating a backup system so that it becomes the primary instance. This might involve startinga new copy of the code, or promoting an already-running backup copy of the code.

The just-failed instance of the code will have state information associated with it; the newly-promoted instance of the code needs to access this state information in order to take up wherethe failed version left off. As such, the fault tolerance system needs a mechanism for transferringthis state information.

The simplest approach for this transfer is to record the state in some external repository—perhaps a hard disk or a database. The newly-promoted code can then read back the state fromthis repository, and get to work. This simple approach also has the advantage that a singlebackup system can act as the backup for multiple primary systems—on the assumption thatonly a single primary is likely to fail at a time (this setup is known as 1:N or N+1 redundancy).

There are two issues with this simple approach. The first is that the problem of a single pointof failure has just been pushed to a different place—what happens if the external data repositorygets a hardware fault? In practice, this is less of an issue because hardware redundancy for datastores is easily obtained (RAID disk arrays, distributed databases and so on) and because thesoftware for performing data access is simple enough to be very reliable.

The second, more serious, issue is that the process of loading up the saved state takes time—and the time involved scales up with the amount of data involved, linearly at best. For asix-nines system, this delay as the backup instance comes up to speed is likely to endanger theall-important downtime statistics for the system.

To cope with this issue, the fault tolerance system needs a more continuous and dynamicsystem for transferring state to the backup instance. In a system like this, the backup instanceruns all the time, and state information is continuously synchronized across from the primaryto the backup; promotion from backup to primary then happens with just a flick of a switch.This approach obviously involves doubling the number of running instances, with a higher loadin both processing and occupancy, but this is part of the cost of achieving high resilience (thissetup is known as 1:1 or 1+1 redundancy).

Continuous, incremental updates for state information aren’t enough, however. The wholefault tolerance system is necessary because the primary instance can fail at any time; it’s alsopossible for the backup instance to fail. To cope with this, the state synchronization mechanismalso needs a batch mode: when a new backup instance is brought on-line, it needs to acquire allof the current state from the primary instance to come up to speed.

5 At a place I used to work, this occurred on a system but the fault tolerance mechanism worked so seamlesslythat the customer’s only complaint was that the system was running a bit slowly.

Page 27: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 3: Design 22

Figure 3.3: Fault tolerance

The ingredients of a fault tolerance system described so far are reasonably generic. Thehardware detection systems and state sychronization mechanisms can all be re-used for differentpieces of software. However, each particular piece of software needs to include code that isspecific to that particular code’s purpose—the fault tolerance system can provide a transportfor state synchronization, but the designers of each particular software product need to decideexactly what state needs to be synchronized, and when.

This set of changes to the core code to decide what to replicate (and when) isn’t just aone-off, up-front cost, either. Any future enhancements and modifications to the code also needto be generated with this in mind, and tested with this in practice. The net effect is an on-goingtax on development—code changes to a fault tolerant codebase are (say) 15% bigger than theequivalent changes to a non-fault tolerant codebase. For a six-nines development, this is a taxthat’s worth paying, but it’s important to factor it into the project planning.

3.3.4 Distribution

A distributed processing system divides up the processing for a software system across multiplephysical locations, so that the same code is running on a number of different processors. Thisallows the performance of the system to scale up; if the code runs properly on ten machines inparallel, then it can easily be scaled up to run on twenty machines in parallel as the traffic levelsrise.

The first step on the way to distributed processing is to run multiple worker threads inparallel in the same process. This approach is very common and although it isn’t technically adistributed system, many of the same design considerations apply.

Multiple threads allow a single process to take better advantage of a multiprocessor machine,and may improve responsiveness (if each individual chunk of work ties up a thread for a longperiod of time). Each thread is executing the same code, but there also needs to be somecontroller code that distributes the work among the various worker threads. If the individualworker threads rely on state information that is needed by other threads, then access to thatdata has to be correctly synchronized (see Section 4.3 [Multithreading], page 30).

The next step towards distribution is to run multiple processes in parallel. Things are alittle harder here, in that the memory spaces of the different processes are distinct, and so anyinformation that needs to be shared between the processes has to be communicated betweenthem. This could be done by putting the information into shared memory, which is effectivelythe same as the threading case (except that the locking primitives are different). A more flexibleapproach is to use a message-passing mechanism of some sort, both for triggering the individualprocesses to do work, and to synchronize state information among the processes.

Page 28: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 3: Design 23

From a message-passing, multiple process model it’s a very short step to a true distributedsystem. Instead of the communicating between processes on the same machine, the messagepassing mechanism now has to communicate between different machines; instead of detectingwhen a worker process has terminated, the distribution mechanism now needs to detect whenworker processes or processors have disappeared.

In all of these approaches, the work for the system has to be divided up among the variousinstances of the code. For stateless chunks of processing, this can be done in a straightforwardround-robin fashion, or with an algorithm that tries to achieve more balanced processing. Forprocessing that changes state information, it’s often important to make sure that later processingfor a particular transaction is performed by the same instance of the code that dealt with earlierprocessing.

This is easiest to do when there is a unique way of identifying a particular ongoing transaction;in this case, the identifier can be used as the input to a hashing algorithm that distributes thework in a reproducible way. However, this isn’t always possible, either because there is no reliableidentifier, or because the individual distributed instances of the code may come and go over time.In this case, the controller code needs more processing to keep track of the distribution of workdynamically.

Figure 3.4: Distributed system

As for fault tolerance, note that there is a danger of pushing the original problem back onestep. For fault tolerance, where the aim is to cope with single points of failure, the systemthat implements failover between potential points of failure may itself become a single point offailure. For distribution, where the aim is to avoid performance bottlenecks, the system thatimplements the distribution may itself become a bottleneck. In practice, however, the processingperformed by the distribution controller is usually much less onerous than that performed bythe individual instances. If this is not the case, the system may need to become hierarchical.

Page 29: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 3: Design 24

Figure 3.5: Hierarchical distributed system

3.3.5 Dynamic Software Upgrade

All successful software is eventually upgraded and replaced with a newer version—bugs are fixedand new functionality is included. However, there’s no get-out clause for this process in a six-nines system: any downtime while the system is upgraded still counts against that maximum ofthirty seconds for the year. This leads to a requirement for dynamic software upgrade—changingthe version of the code while it’s running, without inducing significant downtime.

Obviously, performing dynamic software upgrade requires a mechanism for reliably installingand running the new version of the code on the system. This needs some way of identifying thedifferent versions of the code that are available, and connecting to the right version. This alsodepends on the physical architecture (see Section 3.1.3 [Physical Architecture], page 12) of thecode: how the code is split up into distinct executables, shared libraries and configuration filesaffects the granularity of the upgrade.

A simple example of this upgrade mechanism is the one used for shared libraries on UNIXsystems—there might be several versions of the raw code, stored in versioned files (libc.so.2.2,libc.so.2.3, libc.so.3.0, libc.so.3.1), with symbolic links which indicate the currentlyactive versions (libc.so.2->libc.so.2.2, libc.so.3->libc.so.3.1, libc.so->libc.so.3)and which can be instantly swapped between versions. This mechanism needs to go both ways—sadly, it’s common to have to downgrade from a new version to go back to the drawing board.

However, the difficult part of dynamic software upgrade is not the substitution of the newcode for the old code. What is much more difficult is ensuring that the active state for therunning old system gets transferred across to the new version.

To a first approximation, this is the same problem as for fault tolerance (see Section 3.3.3[Fault Tolerance], page 20), and a normal way to implement dynamic software upgrade is toleverage an existing fault tolerant system:

Page 30: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 3: Design 25

• With a primary instance and backup instance both running version 1.0, bring down thebackup instance.

• Upgrade the backup instance to version 1.1.

• Bring the backup instance back online, and wait for state synchronization to complete.

• Explicitly force a failover from the version 1.0 primary instance to the version 1.1 backupinstance.

• If the new version 1.1 primary instance is working correctly, upgrade the now-backup version1.0 instance to also run version 1.1 of the code.

• Bring the second instance of version 1.1 online as the new backup.

As with fault tolerance, the mechanisms for dynamic software upgrade can be shared amongdifferent software products, but the internal semantics of the state synchronization process haveto be designed as part of each specific software product. This is a more difficult problem thanfor fault tolerance, though: for each set of state information, there may need to be a translationprocess between the old and new versions of the information (in both directions, to allow fordowngrading as well as upgrading). Ideally, the translation process should also be “stackable”—if a system needs to be upgraded from version 2.3 to version 2.7, downtime is kept to a minimumif this can be done in one step rather than via versions 2.4, 2.5 and 2.6.

Figure 3.6: dynamic software upgrade

In practice, this means that any new state information that is added to later versions of thecode needs to have a default value and behaviour, to cope with state that’s been dynamicallyupdated from an earlier version of the code. What’s more, the designers of the new featureneed to make and implement decisions about what should happen if state that relies on the newfeature gets downgraded. In worst-case scenarios, this migration of state may be so difficult that

Page 31: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 3: Design 26

it’s not sensible to implement it—the consequences of the resulting downtime may be less thanthe costs of writing huge amounts of translation code.

Again (as with fault tolerance), the inclusion of dynamic software upgrade functionalityimposes a serious development tax on all future changes to the product’s codebase. Each bug fixand feature enhancement has to be carefully examined and tested in comparison with a matrixof earlier versions of the product, to ensure that state migration can be performed.

3.4 Communicating the Design

3.4.1 Why and Who

It’s not enough to have a design. That design needs to be communicated to all of the peoplewho are involved in the project, both present and future.

The mere fact of attempting to communicate the design to someone else is a great smoketest for spotting holes in the design. If you can’t successfully explain it to a bright colleague,the chances are that you don’t fully understand it yourself.

For a larger system, the design process may well occur in layers. The overall architecturegets refined into a high-level design, whose components are each in turn subject to a low-leveldesign, and then each of the low-level designs gets turned into code. This is the implementationprocess: each abstract layer is implemented in terms of a lower-level, more concrete layer.

At each stage of this process, the people who are responsible for this implementation processneed to understand the design. The chances are that there is more than a single implementer,and that the sum of all of designs in all of the layers is more than will fit in a single implementer’shead.

Therefore the clearer the communication of each layer of the design, the greater chance thereis that the next layer down will be implemented well. If one layer is well explained, documentedand communicated, the layer below is much more likely to work well and consistently with therest of the design.

The intended audience for the communication of the design is not just these different layersof implementers of the design. In the future, the code is likely to be supported, maintainedand enhanced, and all of the folk who do these tasks can be well assisted by understanding theguiding principles behind the original design.

3.4.2 Diagrams

One of the most powerful tools in discussing and communicating a design is a diagram. Manypeople think visually; a simple diagram of blocks and lines can help to visualize the designand make it clear. For asynchronous systems (see Section 3.3.2 [Asynchronicity], page 17),explicitly including the flow of time as an axis in a diagram can help to illuminate potentialtiming problems.

This isn’t news; in fact, it’s so well known that there is an entire standardized system for de-picting software systems and their constituents: the Unified Modeling Language (UML). Havinga standard system can obviously save time during design discussions, but an arbitrary boxes-and-lines picture is usually clear enough for most purposes.

Colour can also help when communicating a design diagram—to differentiate between differ-ent types of component, or to indicate differing options for the design6.

6 Note that I got all the way through this section without repeating the old adage about a picture being wortha thousand words. Until just now, of course.

Page 32: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 3: Design 27

3.4.3 Documentation

Documenting the design is important because it allows the intent behind the design to becommunicated even to developers who aren’t around at the time of the original design. It alsoallows the higher-level concepts behind the design to be explained without getting bogged downin too many details.

In recent years, many software engineers have begun to include documentation hooks in theirsource code. A variety of tools (from Doxygen to JavaDoc all the way back to the granddaddy ofthem all, Donald Knuth’s WEB system) can then mechanically extract interface documentationand so on.

Any documentation is better than nothing, but it’s important not to confuse this sort of thingwith real documentation. Book publishers like O’Reilly sell many books that describe softwaresystems for which the source code is freely available—even when that source code includes thiskind of documentation hooks. These books are organized and structured to explain the codewithout getting bogged down in details; they cover the concepts involved, they highlight keyscenarios, they include pictures, and they’re still useful even when the software has moved onseveral versions since the book was published. In other words, they are roughly equivalent to agood design document.

Some points to bear in mind when generating design documents:

• Structure: The text should be well-structured, in order to make it easy to read and review—software developers tend to think about things in a structured way.

• Rationale: As well as describing how the code is supposed to work, the design documentshould also describe why it has been designed that way. In fact, it’s often useful to discussother potential approaches and explain why they have not been taken—people often learnfrom and remember (potential) horror stories. This helps future developers understand whatkinds of changes to the design are likely to work, and which are likely to cause unintendedconsequences.

• Satisfaction of the requirements: It should be clear from the documentation of the designthat the requirements for the software are actually satisfied by the design. In particular,it’s often worth including a section in the design that revisits any requirements use cases(see Section 2.2 [Use Cases], page 8), showing how the internals of the code would workin those scenarios—many people understand things better as a result of induction (seeingseveral examples and spotting the patterns involved) than deduction (explaining the code’sbehaviour and its direct consequences).

• Avoidance of the passive voice: The design document describes how things get done. Theprevious sentence uses the passive voice: “things get done” avoids the question of exactlywho or what does things. For a design document to be unambiguous, it should describeexactly which chunk of code performs all of the functionality described.

Page 33: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 4: Code 28

4 Code

This chapter deals with the code itself. This is a shorter chapter than some, because I’ve tried toavoid repeating too much of the standard advice about how to write good code on a line-by-linebasis—in fact, many of the topics of this chapter are really aspects of low-level design ratherthan pure coding1. However, there are still a few things that are rarely explored and which canbe particularly relevant when the aim is to achieve the highest levels of quality.

4.1 Portability

Writing portable code encourages good design and good code. To write portable code, thedevelopers have to understand both the problem domain and the solution domain, and what theboundary between the two is. It’s very useful from a business perspective too: portable codecan be sold on a variety of different platforms, without being locked into a particular operatingsystem or database vendor. Portability is planning for medium term and long term success; itassumes that the lifetime of the software is going to be longer than the popularity of the currentplatforms2.

There are different levels of portability, depending on how much functionality is assumedto be provided by the environment. This might be as little as a C compiler and a standard Clibrary—or perhaps not even that, given the limitations of some embedded systems (for example,no floating point support). It might be a set of functionality that corresponds to some definedor de facto standard, such as the various incarnations of UNIX. For the software to actuallydo something, there has to be some kind of assumption about the functionality provided bythe environment—perhaps the availability of a reliable database system, or a standard socketsstack.

It’s sensible to encapsulate the interactions between the product code and the environment.Encapsulation is always a good idea, but in this case it gives a definitive description of thefunctionality required from the environment. That way, if and when the software is moved toa different environment, it’s easier to adapt—hopefully by mapping the new environment tothe encapsulated interface, but sometimes by changing the interface slightly and adapting theproduct code too.

For six-nines software, the interactions between the product code and the environment areoften very low-level. Many of these interfaces are standardized—the C library, a sockets library,the POSIX thread library—but they can also be very environment specific; for example, no-tification mechanisms for hardware failures or back-plane communication mechanisms. Someof these low-level interfaces can be very unusual for programmers who are used to higher levellanguages. For example, some network processors have two different types of memory: general-purpose memory for control information, and packet buffers for transporting network data. Thepacket buffers are optimized for moving a whole set of data through the system quickly, but atthe expense of making access to individual bytes in a packet slower and more awkward. Takentogether, this means that software that may need to be ported to these systems has to carefullydistinguish between the two different types of memory.

All of these interfaces between the software and its environment need to be encapsulated, andin generating these encapsulations it’s worth examining more systems than just the target rangeof portability. For example, when generating an encapsulated interface to a sockets stack for aUNIX-based system, it’s worth checking how the Windows implementation of sockets would fitunder the interface. Likewise, when encapsulating the interface to a database system, it’s worthconsidering a number of potential underlying databases—Oracle, MySQL, DB2, PostgreSQL,

1 But having “High-Level Design” and “Low-Level Design” chapters isn’t as neat as “Design” and “Code”.2 Some of the first software I ever worked on (in 1987) is still being sold today, having run on mainframes,

minicomputers, dedicated terminals, large UNIX systems and even Windows and Linux systems.

Page 34: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 4: Code 29

etc.—to see whether a small amount of extra effort in the interface would encapsulate a muchwider range of underlying implementations (see Section 3.1.2 [Black Box Principle], page 12).As ever, it’s a balance between an interface that’s a least common denominator that pushes lotsof processing up to the product code, and an interface that relies so closely on one vendor’sspecial features that the product code is effectively locked into that vendor forever.

4.2 Internationalization

Internationalization is the process of ensuring that software can run successfully in a variety ofdifferent locales. Locales involve more than just the language used in the user interface; differentcultures have different conventions for displaying various kinds of data. For example, 5/6/07would be interpreted as the 5th June 2007 in the UK, but as 6th May 2007 in the US. Tenbillion3 would be written 10,000,000,000 in the US, but as 10,00,00,00,000 in India.

Once software has been internationalized, it can then be localized4 by supplying all of thelocale-specific settings for a particular target: translated user interface texts, date and timesettings, currency settings and so on.

As with portability, writing internationalized code is planning ahead for success (in marketsoutside of the original country of distribution). More immediately, thinking about internation-alization encourages better design and code, because it forces the designer to think carefullyabout the concepts behind some common types of data. Some examples will help to make thisclearer.

What exactly is a string? Is it conceptually a sequence of bytes or a sequence of characters?If you have to iterate through that sequence, this distinction makes a huge difference. To convertfrom a concrete sequence of bytes to a logical sequence of characters (or vice versa), the codeneeds to know the encoding : UTF-8 or ISO-Latin-1 or UCS-2 etc. To convert from a logicalsequence of characters to human-readable text on screen, the code needs to have a glyph forevery character, which depends on the fonts available for display.

What exactly is a date and time? Is it in some local timezone, or in a fixed well-knowntimezone? If it’s in a local timezone, does the code incorrectly assume that there are 24 hoursin a day or that time always increases (neither assumption is true during daylight-saving timeshifts)? If it’s stored in a fixed timezone like UTC5, does the code need to convert it to andfrom a more friendly local time for the end user? If so, where does the information needed forthe conversion come from and what happens when it changes6?

What’s in a name? Dividing a person’s name into “first name” and “last name” isn’t ap-propriate for many Asian countries, where typical usage has a family name first, followed by agiven name. Even with the more useful division into “given name” and “family name”, thereare still potential wrinkles: many cultures have gender-specific variants of names, so a brotherand sister would not have matching family names. Icelandic directories are organized by givenname rather than “family name”, because Iceland uses a patronymic system (where the “familyname” is the father’s name followed by “-son” or “-dottir”)—brothers share a family name, butfather and son do not.

Being aware of the range of different cultural norms for various kinds of data makes it muchmore likely that the code will accurately and robustly model the core concepts involved.

3 Ten US billions, that is. Traditional UK billions are a thousand times larger than US billions—a millionmillions rather than a thousand millions.

4 Internationalization and localization are often referred to as I18N and L10N for the hard-of-typing—theembedded number indicates how many letters have been skipped.

5 Coordinated Universal Time, the modern name for Greenwich Mean Time (GMT). Also known as Zulu time(Z).

6 Such as in August 2005, when the US Congress moved the dates for daylight-saving from 2007.

Page 35: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 4: Code 30

4.3 Multithreading

This section concentrates on one specific area of low-level design and coding that can havesignificant effects on quality: multithreading.

In multithreaded code, multiple threads of execution share access to the same set of memoryareas, and they have to ensure that access to that memory is correctly synchronized to preventcorruption of data. Exactly the same considerations apply when multiple processes use explicitlyshared memory, although in practice there are fewer problems because access to the shared areais more explicit and deliberate.

Multithreaded code is much harder to write correctly than single-threaded code, a fact whichis not typically appreciated by new programmers. It’s easy to neglect to fully mutex-protect allareas of data, with no apparent ill effects for the first few thousand times the code is run. Evenif all the data is properly protected, it still possible to encounter more subtle deadlock problemsinduced by inverted locking hierarchies.

Of course, it is possible to design and code around these difficulties successfully. However, theonly way to continue being successful as the size of the software scales up is by imposing someserious discipline on coding practices—rigorously checked conventions on mutex protection fordata areas, wrapped locking calls to allow checks for locking hierarchy inversions, well-definedconventions for thread entrypoints and so on.

Even with this level of discipline, multithreading still imposes some serious costs to maintain-ability. The key aspect of this is unpredictability—multithreaded code devolves the responsibilityfor scheduling different parts of the code to the operating system, and there is no reliable wayof determining in what order different parts of the code will run. This makes it very difficultto build reliable regression tests that exercise the code fully; it also means that when bugs dooccur, they are much harder to debug (see Section 6.6 [Debugging], page 51).

So when do the benefits of multithreading outweigh the costs? When is worth using multi-threading?

• When there is little or no interaction between different threads. The most common patternfor this is when there are several worker threads that execute the same code path, but thatcode path includes blocking operations. In this case, the multithreading is essential for thecode to perform responsively enough, but the design is effectively single threaded: if thecode only ran one thread, it would behave exactly the same (only slower).

• When there are processing-intensive operations that need to be performed. In this case, themost straightforward design may be for the heavy processing to be done in a synchronousway by a dedicated thread, while other threads continue with the other functions of thesoftware and wait for the processing to complete (thus ensuring UI responsiveness, forexample).

• When you have to. Sometimes an essential underlying API requires multithreading; this ismost common in UI frameworks.

4.4 Coding Standards

Everyone has their own favourite coding convention, and they usually provide explanations ofwhat the benefits of their particular preferences are. These preferences vary widely dependingon the programming language and development environment; however, it’s worth stepping backto explore what the benefits are for the whole concept of coding standards.

The first benefit of a common set of conventions across a codebase is that it helps withmaintainability. When a developer explores a new area of code, it’s in a format that they’refamiliar with, and so it’s ever so slightly easier to read and understand. Standard namingconventions for different types of variables—local, global, object members—can make the intentof a piece of code clearer to the reader.

Page 36: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 4: Code 31

The next benefit is that it can sometimes help to avoid common errors, by encoding wisdomlearnt in previous developments. For example, in C-like languages it’s often recommended thatcomparison operations should be written as if (12 == var) rather than if (var == 12), toavoid the possibility of mistyping and getting if (var = 12) instead. Another example is thatcoding standards for portable software often ban the use of C/C++ unions, since their layout inmemory is very implementation-dependent. For portable software (see Section 4.1 [Portability],page 28) that has to deal with byte flipping issues, a variable naming convention that indicatesthe byte ordering can save many bugs7.

With these kinds of recommendations, it’s important that the coding standards include adescription of the reasons for the recommendations. Sometimes it becomes necessary to breakparticular recommendations, and this can only be done in a sensible way if the developers knowwhether a particular provision is just for tidiness or is to avoid a particular compiler problem(say).

An often-overlooked benefit of a consistent coding convention is that it can make the sourcecode easier to analyse and process. Fully parsing a high-level programming language is non-trivial (especially for something as complicated as C++); however, if the coding standards im-pose some additional structure, it’s often possible to get useful information with more trivialtools (such as scripts written in awk or Perl). For example, I’ve generated code coverage in-formation for a Linux kernel module by running the source code through an awk script thatreplaced particular instances of { with a trace statement (something like {{static int hit=0;

if (!hit) {TRACE("hit %s:%d", __FILE__, __LINE__); hit=1;}}). The generated trace in-formation was then post-processed by other simple scripts to generate reasonably accurate linecoverage data. For this example, the scripts were only tractable because the source code struc-ture was predictable.

Finally, there can occasionally be a trivial benefit if the source code gets delivered to cus-tomers: it looks more professional. Obviously, this more style than substance, but it can help toreinforce the impression that the developers have put care and attention into the source code.

Given a coding convention, it’s important not to waste the time of the developers with thebusiness of policing that coding convention. Most provisions of the coding standard can andshould be checked mechanically—when new code is checked into the codebase, and as part ofthe regular build process (see Section 4.6.3 [Enforcing Standards], page 37).

4.5 Tracing

Tracing is the addition of extra diagnostic statements to the code to emit useful informationabout the state of the processing being performed. In its simplest form, this can just involvescattering a few choice if (verbose) fprintf(stderr, "reached here") lines throughout thecode. In this section, we describe a much more comprehensive approach to the same basictechnique.

In application software, developers usually solve problems by stepping through the code ina debugger. Examining the output from internal trace in the code is equivalent to this process,but it has a few advantages compared to the debugger.

• Tracing is available and accessible on systems that don’t have a debugger or where using adebugger is awkward (for example, embedded systems or low-level device drivers).

• Trace output will be the same on different environments, so there’s no need to learn aboutthe foibles of different debuggers.

• Trace output covers a whole sequence of processing, rather than just the processing occurringat a particular point in time. This means it’s possible to step backwards in the code as

7 For example, if an h or n prefix indicates ordering, then it’s easy to spot that hSeqNum = packet->nLastSeqNum

+ 1; is probably wrong.

Page 37: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 4: Code 32

well as forwards, without having to re-run the scenario (which is particularly helpful forintermittent bugs).

Tracing is not unique to higher quality software development, but it is generally implementedmore thoroughly in these kinds of software. Higher reliability correlates with more efficientdebugging, which correlates with better internal diagnostics.

In some respects, trace statements in the code are similar to comments—they both consistof text destined for human readers, closely linked with particular chunks of code. However,tracing has a slightly different purpose than comments; trace statements indicate what the codeis doing, whereas comments explain why the code is doing it (and of course the code itself is howit’s doing it). As a result, trace statements often indicate trivial things that would be pointlessif they were comments instead—processing has reached this branch, this variable has been setto that value, and so on.

Trace statements explain what the code is doing as it does it, in order to allow analysis anddebugging of the code. Bearing this in mind, there are a few rules of thumb for the developerswhich will help to maximize the usefulness of the trace:

• It’s a good idea to have a coding standard that requires a trace statement in every branchof the code. If this is enforced, then the complete path of the flow of processing throughthe code can be reliably retraced. (This has the side benefit that code coverage informationcan be derived from the tracing even when the toolchain doesn’t support coverage.)

• Make sure that trace statements contain as much useful information as possible. It’s easyto fall into a habit of inserting trace statements that just say “got to this line of code”(particularly if the previous suggestion is enforced on developers who don’t understand thevalue of tracing); it’s much more helpful when the values of key variables are included inthe trace.

• Given the previous suggestion, the tracing system should cope with the inclusion of arbitrarypieces of information—say by allowing C printf-like formatting, or use of C++ overloadedoperator<<.

• To make sure the developers understand the value of tracing in general, and tracing extraparameters in particular, it’s worth insisting that they try debugging a few problems justusing tracing and other diagnostics, with no access to a debugger.

There are possible penalties to including tracing in the code, and it’s important that thetracing framework avoids these penalties where possible.

The most obvious penalty is performance—particularly if every possible branch of the codeincludes a trace statement (as suggested above). However, this is easily dealt with by allowingthe trace framework to be compiled out—a release build can be mechanically preprocessed toremove the trace statements.

Another penalty with the inclusion of copious quantities of trace is that it becomes moredifficult to see the wood for the trees. Key events get lost in the morass of trace output, diskfiles fill up, and trace from perfectly-working areas of code obscures the details from the problemareas. Again, it’s not difficult to build a trace framework in a way that alleviates these problems.

• Include enough information in the trace data to allow filtering in or out for particular areasof code. A simple but effective mechanism in C or C++ is to quietly include the __FILE_

_ and __LINE__ information in the tracing system. This immediately allows trace fromparticular files to be included or excluded; if the codebase has an appropriate file namingconvention, this mechanism can also work on a component by component basis (for example,filter in trace from any source code named xyz_*). The filtering setup can be set from aconfiguration file, or (ideally) modified while the code is actually running.

• For code that deals with many similar things in parallel, make sure the trace system willinclude enough context in its output—and again, allow filtering on this information. This

Page 38: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 4: Code 33

context information might be a unique identifier for a key database record, or the addressof an essential object or control block.

• Use different levels of tracing, where significant events show up at a higher level than step-by-step trace statements. For example: code branch tracing < function entry/exit tracing <

internal interface tracing < external interface tracing. The trace framework can then allowfor these different levels of trace to be compiled in and out separately, and filtered in or outseparately (in other words, there is a compile-time trace level and a run-time trace level).

• Include mechanisms to trace out the entire contents of activities on external interfaces (sofor a message driven system, this would mean saving the whole contents of the message).If the code is deterministic, then this makes it possible to regenerate the behaviour of thecode—re-running the external interface trace provides exactly the same inputs, so the codewill generate exactly the same outputs.

• Include mechanisms to limit the amount of trace information generated. If tracing goes tofile, this can be as simple as alternating between a pair of files with a maximum size for eachfile. A more sophisticated system might trace to an internal buffer, which is only writtenout to disk as a result of particular triggers (for example, when the developer triggers it, orwhen a SEGV signal is caught).

With a reliable and comprehensive trace system in place, it soon become possible to use thatsystem for all sorts of additional purposes:

• If tracing is ubiquitous in the code, it’s possible to generate code coverage information evenin environments whose toolchain doesn’t support it. No changes to the code are needed—just some scripting and/or tweaks to the trace system to track which trace lines have beenhit.

• Similarly, it’s possible to get some quick and dirty identification of the most frequentlyexecuted code in the product, by adding counters to the trace system.

• If every function traces entry and exit in a standard way, the tracing system can be usedto give call stacks when the operating system doesn’t.

The following example should help to give a flavour of what code that includes tracing couldlook like.

#include <cpptrace.h>

#include <Fred.h>

void Fred::doSomething(int a)

{

TRACE_FN("doSomething");

TRC_DBG("called with "<<a);

if (a==0) {

TRC_NRM("zero case, mOtherField="<<mOtherField);

doOtherThing(&a);

} else {

TRC_NRM("non-zero, use mField = "<<mField);

doOtherThing(&mField);

}

}

For this particular C++ example:

• The TRACE_FN invocation generates entry and exit trace for the function. It is implementedby including an object on the stack whose constructor generates the entry trace and whose

Page 39: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 4: Code 34

destructor generates the exit trace. As such, the exit trace is generated even when thefunction is terminated by a C++ exception.

• The trace invocations are all macros that can be compiled out.

• The contents of each trace invocation use C++ operator<<, so arbitrary types can be traced.

The output of this code might look something like the following.2006-03-21 14:36:05 224:b2a4 Fred.cpp::doSomething 6 { Entry

2006-03-21 14:36:05 224:b2a4 Fred.cpp::doSomething 8 _ called with 0

2006-03-21 14:36:05 224:b2a4 Fred.cpp::doSomething 10 . zero case, mOtherField=/tmp/file

2006-03-21 14:36:05 224:b2a4 Fred.cpp::doSomething 0 } Exit

This trace output includes the following information, in a fixed format which is easy to filterafter the fact.

• Date and time the trace was generated.

• Process and thread ID that generated the trace.

• Module containing the code.

• Method name (set up by the TRACE_FN call).

• Line number.

• Trace level indicator. This includes { and } for function entry and exit, which means thatthe bracket matching features of editors can be used to jump past particular functions.

• The trace message itself, including any additional data traced out.

Although the title of this section is “Tracing”, similar considerations apply to logging as well.The key distinction between tracing and logging is that tracing is purely aimed at the developersof the code, whereas logging involves emitting information that is of use to future administratorsof the running product. This distinction does mean that there are a couple of key differencesbetween the two, though.

• Tracing is typically compiled out of the final released product, but logging is not (and soshould be checked to confirm that it doesn’t involve a significant performance penalty).

• Logging messages should be normally be internationalized so that the product can be lo-calized for different nationalities of administrators, whereas tracing messages can be fixedin the language of the development team.

4.6 Managing the Codebase

During the coding phase of a project, the developers also need to deal with all of the ancillarybits and pieces that are needed to convert the source code into binaries, get those binariesrunning on the desired systems, and keep track of what binaries are running where.

4.6.1 Revision Control

The first step of this process is to make sure that the source code itself is available and safe fromscrew-ups. The latter of these equates to making sure that there’s a reliable source code controlsystem, and that it gets backed up frequently.

A source code control system can actually cover a number of different areas, all related tothe primary goal of keeping track of versions of source code files:

• For individual developers, it can be useful to keep track of different versions of files as theyare under development. For this kind of source control:

• Different revisions will only be hours or minutes apart.

• Old revisions will only ever be re-visited by the developer themselves, and then onlywithin a few days of their creation.

• Revisions need not be complete, coherent or compilable.

Page 40: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 4: Code 35

• For the project team as a whole, the main revision control system keeps the current tipversion of the entire source code. For this:

• Different revisions will typically be days or weeks apart.

• Files will be created and modified by a variety of different team members.

• Each revision is normally expected to be a working version of the file, that compilescleanly and can be used by other members of the team. This can sometimes be checkedmechanically—files that don’t compile or that fail automated checks against codingstandards (see Section 4.4 [Coding Standards], page 30) are rejected.

• It’s common to check in a collection of related files at once, and the revision systemneeds to allow that entire set of changes to be retrieved as a unit (for example, so thata vital fix can be merged and back-applied to a different codebase).

• Changes to the master tip version of the source code are often correlated with othertracking systems—most commonly with a bug tracking system (see Section 4.6.2 [OtherTracking Systems], page 35).

• When the project hits the outside world, a release control system is needed. This correlatesbetween the delivered binary files that are run by customers, and the source code files thatwere used to create those binaries. For this:

• New releases will typically be weeks or (more likely) months apart.

• It should be easy to extract an entire codebase which corresponds to a particularreleased binary (for tracking down bugs in that released binary)

• It must be possible to branch the revision tree. This allows some-but-not-all of thechanges in the current tip revision to be back-applied, so that important bug fixes canbe included in a new set of released binary files without including other potentiallydestabilizing changes.

• For convenience, it’s sometimes worth revision-tracking the binary files as well as thesource code files.

• For a completely bulletproof system, it can be worth version-tracking the entiretoolchain (compiler, linker, system include files and so on) to make sure that a previ-ous version can be regenerated in a way that is byte-for-byte identical to the originalrelease.

There are many version control systems to choose from; the primary deciding factor for thatchoice is reliability, but almost all systems are completely reliable. That leaves extra features asthe next factor on which to base a choice—but in practice, most extra features can built onto abasic system with some scripting, provided that the system allows scripts to be wrapped aroundit. The net of this is that it’s often easiest to stick with a known system that the developershave experience of (see Section 3.2.5 [Avoiding the Cutting Edge], page 16).

4.6.2 Other Tracking Systems

As well as tracking changes to the source code, software projects usually need to track a varietyof other pieces of information. These other pieces of information may well be more relevantto other stages in the lifecycle of a project—such as testing (see Chapter 6 [Test], page 46) orsupport (see Chapter 7 [Support], page 52)—but since the source code is the heart of a softwareproject, it makes sense to consider all the related tracking systems together.

The most important information to track is the reasons for changes to the source code.Depending on the phase of the project, there are a variety of potential reasons why the sourcecode might change:

• Development: writing the original version of the code in the first place.

• Changes from reviews: design reviews and code reviews are likely to result in a number ofchanges to existing code.

Page 41: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 4: Code 36

• Enhancements: additional features and functionality for the software requires code changesto implement it.

• Changes from testing: the various test phases (see Chapter 6 [Test], page 46) will exposeproblems in the code that need to be fixed. This is a subset of “Bugs” below.

• Bugs: problems in the code that have been exposed by testing or reported by customersand users induce changes in the source code to correct those problems.

Of these categories, bugs are by far the most numerous, so it is common practice to use a bugsystem to track all changes to the code. With this setup, specific bug categories (“development”or “enhancement”) can be used to track code changes that do not originate from problems inthe code.

The bug tracking system needs to be closely tied to the revision control system, correlatingcode changes against bugs and also correlating bugs against code changes.

• For a change to the code, the revision control system needs to indicate which bug the changewas for. If a future bug involves that area of code, the information in the bug tracking systemelucidates the reasons why the code is there, and what might be the potential consequencesof changing it. In other words, this allows the developer to check whether a new change tothe code is re-introducing an old bug.

• For a bug, the bug tracking system needs to indicate the set of source code changes thatcorrespond to its fix8. If a bug recurs or is incompletely fixed, this record allows thedevelopers to determine if the fix has been altered or if it should apply to other areas of thecode.

For smaller codebases, it can also be helpful to automatically generate a combined summaryreport that lists (by date) the files in the codebase that have changed, together with the cor-responding bug number. This report shows the codebase activity at a glance, so it’s easier tofigure out underlying causes when things go awry. When things mysteriously stop working, it’susually a recent change that’s to blame—even if it’s in an apparently unrelated area.

Similarly, it’s also useful to have a mechanical way of generating a list of all of the changesbetween particular dates or releases, as a collection of both affected files and affected bugs. Thiscan show the areas where testing should be concentrated for a new release, or the code changesthat should be re-examined in case of a regression between releases.

Once software escapes into the wild and is used by real users, it is sadly common for themto lose track of exactly what software they have. The support process is much harder if youdon’t know exactly what code you’re supporting, so it’s often helpful to add systems that makeidentification of code versions more automatic and reliable.

• Embed version numbers for each source code file into the corresponding binary file. Mostversion control systems include a facility to automatically expand a keyword into the versionnumber (for example $Id: hqse.texi,v 1.197 2011-06-01 03:02:40 dmd Exp $ for RCS).This can then be embedded in a string (static const char rcsid[] = "$Id: hqse.texi,v

1.197 2011-06-01 03:02:40 dmd Exp $";) that is easy to extract from the built binaries(using UNIX ident, for example).

• Embed release identification into the product as a whole. It should be easy for customersto figure out which version of the product they are running; for products with a UI, it isnormal for a “Help, About” box to display the overall product identification string.

• Ensure that release numbers are always unique. Sometimes bugs are spotted mere momentsafter a new version of a product is released; avoid the temptation to nip in a fix and re-usethe same version number.

8 If the codebase has a regressible system of test harnesses, the bug record should also indicate which test casesdisplay the bug.

Page 42: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 4: Code 37

• It can also be helpful to embed build information into the released product, identifying thedate and time and build options used. The date and time of the build can be includedin the overall product version identification to guarantee the uniqueness described in theprevious bullet.

4.6.3 Enforcing Standards

A previous section (see Section 4.4 [Coding Standards], page 30) discussed the rationale behindproject or company wide coding standards. However, just relying on the developers themselvesto comply with these standards is rarely enough—they need to be checked mechanically instead.These mechanical checks are typically implemented in some sort of scripting language—perhapsawk, Perl or Python.

There are three common times to run the checks.

• By hand, for individual developers who want to check the current state of compliance forthe code that they are working on.

• During the code check-in process, to make sure that code that doesn’t comply with therelevant standards doesn’t make it into the codebase.

• During the regular code build process, to catch any problems that have slipped in.

Sadly, in the real world there are always going to be exceptions to the standards—perhapsan imported set of code files that conform to a different standard, or an OS-specific module thatdoesn’t need to conform to the same portability requirements. As such, it’s important that thesystems for checking standards compliance include a way of allowing agreed exceptions.

4.6.4 Building the Code

Customers for a software product normally don’t just get the source code. Instead, they get aset of binary files that have been built from the source code, together with a variety of otherfiles (help files, documentation, localization files, default configuration files etc.). To producethe binary files from the source code, still more files are needed—makefiles, build jobs, installimage creation scripts, tools to extract documentation, and so on.

The same considerations that apply to source code files apply to all of these ancillary files,but for some reason this seems to happen much more rarely. Common sections of build jobsshould be commonized rather than copied and pasted, so that fixing a problem in the build onlyhas to be done in one place—the same as for code. The auxiliary files should all be stored underrevision control, and backed up, so that bad changes can be reverted—the same as for code.Files shouldn’t generate excessive numbers of benign warning messages that might obscure realerror messages—the same as for code.

To make it less likely that developers will break the build, it’s useful if they can easily runtheir own local version of the build to confirm the state of any code they’re about to checkin. This means that the build should be easy to run from beginning to end, preferably with asingle command, and it should ideally have reliable dependency checking so that only code thathas changed gets rebuilt—in a large codebase, anything else is tempting the developers to cutcorners.

The build system should also build everything that the customers are going to get, not justthe raw binaries—which might mean that the end result of the build is a self-extracting archivethat includes help files and localization files and so on.

The build system can also generate things that customers are not going to get—notably theinternal test harnesses for the product, and the results of running those test harnesses. This givesa much higher confidence level for the current state of the code than just whether it compilesand links or not.

As the size of the team and the project scales up, it also becomes worth investing the time toautomate various build-related activities. This might start with a system to automatically filter

Page 43: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 4: Code 38

out known benign warning messages from the build; with a larger codebase, there might be asystem to automatically email build output for particular components to the component owners.For a more established, stable codebase, it’s often worth building a system that emails any newbuild problems to the developer that most recently touched the codebase or the relevant file.

4.6.5 Delivering the Code

The final stage of the code’s journey is its delivery to the customer. Once the binaries are built,some mechanism is needed to ensure that the code ends up running correctly on the destinationmachine. This is particularly important for software that will be installed by end-users, buteven deliveries to other software engineering groups require some planning and coordination.

Building installers is harder than many developers realize. What seems like a simple matterof moving files between disks is much more complicated when all of the potential factors areconsidered. To get the code running correctly on the destination machine, the install processneeds to consider:

• Is this the right destination machine? Is the operating system a supported version? Are allof the required libraries already installed? Is there enough disk space? Are all the hardwarerequirements satisfied?

• Is this the right code? Are there earlier versions of the code installed? If there are, doesthe new version replace or live alongside the existing version? If there aren’t, can a patchinstall still proceed? Is this the build variant that matches the local configuration?

• Does the code run? Does it rely on being installed into a particular place? Does it relyon being installed with particular permission levels? Does it have necessary access to thenetwork or the Internet?

• Does the code run correctly? Are there self-tests that can be run? Are there diagnosticsthat can be gathered and reported to help locate problems?

Rather than deferring all of these issues to the end of the software development cycle, it’soften best to build install packages as part of the regular build system. With this approach, theinstall packages are built and used regularly throughout the development cycle, increasing thechances of spotting problems with the installer.

Page 44: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 5: Code Review 39

5 Code Review

Code reviews are a very effective tool for improving the quality of both the software itself andthe developers that produce the software. When they’re done badly, code reviews can also be apointless exercise in box-ticking that wastes time and annoys developers. This chapter discussesthe how and why of code reviews in a level of detail that is rarely covered, in order to help themlean towards the former rather than the latter.

5.1 Rationale: Why do code reviews?

So what’s the point of code reviews? Even among folk who agree that they’re a good idea, thefull rationale behind code reviews is rarely examined in detail.

5.1.1 Correctness

The most obvious idea behind a code review is to try to find bugs. It’s well known that theearlier a bug is found, the cheaper it is to fix—and a code review is the earliest possible timeafter the code has been written that this can happen (other than not putting in bugs in the firstplace).

A second pair of eyes examining the code is a useful sanity check to catch bugs. If thereviewer has more experience with the technologies involved (from programming language tounderlying APIs to operating system to programming methodology), then they may well know“gotchas” to watch out for.

Interestingly, this aspect of code reviewing becomes much less important when all of theparties involved are experienced and know their stuff. In this sort of situation, the numberof straightforward logic bugs spotted at code review drops to an almost negligible number—atwhich point the other factors below come into their own. At this stage, thorough and cunningtesting is a much more efficient way to eke out the remaining bugs.

However, even in a halcyon situation where everyone involved is experienced and skilled, thereis still the opportunity for integration bugs, and this is worth bearing in mind when assigningcode reviewers. This is particularly true of new development—if new component A uses newcomponent B, then the coder of component B may well be a good candidate to review the codefor component A (and vice versa). Regardless of how thorough and well-written the interfacespecifications are, there may be ambiguities that slip through.

5.1.2 Maintainability

The most valuable reason for a code review is to ensure the maintainability of the code. Main-tainability is all about making it easy for future developers to understand, debug, fix and enhancethe code, which in turn relies on them being able to correctly understand the code as it standstoday.

The code review is an ideal place for a dry run for this process. The original coder can’tjudge whether the code is easy or difficult for someone else to understand, so the code revieweris in a much better position to judge this. To put it another way, the code reviewer is the firstof potentially dozens of people who will have to make sense of the code in the future—and soan effective code review can make a lot of difference.

To understand how much difference this can make, consider the conditions under whichbugs from the field get fixed. Particularly if it’s an urgent problem, the developer fixing theproblem may be under time pressure, short on sleep, and generally in the worst possible moodfor appreciating some incredibly cunning but incomprehensible code. Anything which eases theunderstanding of the code in this situation hugely increases the chances of getting a correct fixinstead of a quick hack that causes several more problems later on. This is difficult to measure

Page 45: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 5: Code Review 40

with short term metrics (such as lines of code written per developer-day), but does show up inlonger term bug rates and support costs—and in the more intangible aspect of developer morale.

Stepping back, the most important aspect of maintainability is good design, which shouldhopefully already be in place well before the code review takes place. Once that design is inplace, maintainability is all about communication—communicating the intent of that designand of the lower-level details of the code—and the code review is an ideal time to test out thatcommunication.

5.1.3 Education

An often overlooked aspect to code reviews is that they help to educate the developers involved—to develop the developers, as it were.

The most obvious side of this is that the reviewer gets educated about the details and layoutof the specific code being reviewed. This means there are two people who understand the codein detail, which gives more depth in the team for situations where the original author isn’tavailable (they’re on holiday, or they’ve left the team, or they’ve been run over by a bus).

This can help to persuade those developers who bristle at the thought of someone else crit-icising their perfect code: spreading the knowledge around means they won’t have to supportthe code forever.

Longer term, it’s also a good way for developers to educate each other. The reviewer can seeor suggest alternative ways of coding things, whether neat tricks or minor performance tweaks. Amore experienced reviewer can use code reviews to train a junior programmer; a junior reviewercan use the review as an opportunity to ask questions about why things were done in particularways. For example, a junior C++ programmer might ask why all of the loops in the code have++ii instead of ii++, at which point the more experienced colleague can point them at Item1 6of Scott Meyers’ “More Effective C++”.

5.2 Searching for Bugs

So how does a code reviewer go about the business of hunting bugs in the code?

5.2.1 Local Analysis

At the core of all code reviews is a straightforward read of the code—file by file, function byfunction, line by line. For new code this typically means everything that’s going to get checkedin; for changes to existing code, this may just involve examination of the changed code.

The list of small gotchas to watch out for during this process depends enormously on theprogramming language in use, and to some extent on the type of software being developed. Astarter list might include:

• Fencepost (or off-by-one) errors.

• Buffer overruns for fixed-size data structures (C’s string handling library functions arenotorious for this).

• Failure to code with exceptional returns from system calls (which may include malloc

returning NULL or new throwing).

• Common typos (such as = rather than == in C-derived languages).

5.2.2 Data Structures

A useful viewpoint that can help to find trouble in code, particularly new code, is to considerthe layout and structure of the data that the code manipulates. This data can take manyforms—database tables, C structures, XML files—but similar considerations apply regardless.

1 Which explains that prefix increment on user-defined types is more efficient that postfix increment becausethe latter has to create and destroy a temporary variable.

Page 46: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 5: Code Review 41

• Check whether there are any issues with the lifetime of the data:

• When are instances of data created, modified and destroyed?

• Are creation and destruction operations appropriately symmetric?

• Are there edge cases where resources can leak? This is particularly problematic whencollections of interrelated data structures are created together (think transactions) orwhen asynchronous operations are used (what if the async operation never completes?).

• Check interactions between different data structures and between different areas of codethat manipulate the same data structure:

• Is the ownership of the data structure coherent between different components? If thesame instance of a data structure is used by multiple chunks of code, are there edge caseswhere component A keeps using the instance even after component B has destroyed it?(Reference counted data structures can help with this.)

• If the data structure is accessed simultaneously by different areas of code (under mul-tithreading), is the access appropriately synchronized (mutexes, synchronized methodsetc.)? If multiple locks are held, are they always acquired in a consistent order?

• If the data structure is a keyed collection, is it guaranteed that the key will be unique?

• Check for internal consistency:

• For fixed size containers, does the code respect the size restrictions?

• Are pointers and references to other data structures guaranteed to stay valid?

5.2.3 Scenario Walkthroughs

Another method for ensuring that new code does what it’s supposed to is to dry-run mentallythrough some key scenarios for the code, and check that all of the required outputs from thecode would be achieved.

• Ideally there will be use cases (see Section 2.2 [Use Cases], page 8) for the code that illustratethe main-line paths that the code will take—these are ideal for scenario walkthroughs.

• Central loops on the most common execution sequences (the “golden path”) can be exam-ined for performance issues.

• Important error cases can also be examined where they diverge from the main-line pathsthrough the code, to ensure that all code and data structures are tidied up correctly.

• On a smaller scale, if there are tricky bits of array manipulation or recursive code, workingthrough a couple of test cases with a pencil and paper is often worthwhile.

5.3 Maintainability

5.3.1 Modularity

A key aspect of maintainability is appropriate modularity of the code. Good modularity reflectsgood design; each chunk of code is divided up so that its roles and responsibilities are clear—sowhen the code needs to be modified in future, there is an obvious place to make the change.

This is an excellent area for a code review to make a big difference. The programmer willhave worked for days or weeks, evolving the code into its final form. Along the way, there’s agood chance the overall structure of the code will have suffered. The code reviewer sees the codeas a whole, and can look for poor choices of division.

The kinds of things that the reviewer can look out for are:

• When reading the code, do you have to jump backwards and forwards between lots ofdifferent files? If so, this may mean that related pieces of function are not being kepttogether.

Page 47: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 5: Code Review 42

• On a smaller scale, to follow a particular block of code, do you have to keep paging up tokeep track of the enclosing control block (if, while etc.)? Code like this may well benefitfrom being encapsulated into a separate function (even if that function is only ever calledfrom a single place).

• Are there any chunks of code that feel familiar because they’re been copied and pasted fromelsewhere? It’s almost always better to have a single common function (which may need tobe a little more generic); the chances of a future fix being applied to both copies of the codeis small.

5.3.2 Communication

Much of the discussion in the rationale (see Section 5.1 [Rationale], page 39) behind code reviewsrevolved around communication: the code itself forms a means of communication between theperson who understands the code now, and those who have to understand the code in future.

For some reason, programmers rarely have any difficulty believing that everyone else in theworld is an idiot. This can be helpful for motivating good communication: “I want you to makeyour code so clear that not even a muppet could mess it up”.

Many of the apparently trivial things that tend to turn up in coding standards are aboutmaking this communication just that little bit more efficient:

• Longer, more descriptive names for identifiers make comprehension2 of the code easier(provided they are well-chosen). They take longer to type, but successful code gets readmany more times than it gets written.

• Consistent naming conventions across a chunk of software make typos much less likely. Inthe best-case scenario, this could save a fix-compile cycle; the worst-case scenario is, well,worse.

• Consistent indentation and layout makes reading the code slightly faster.

More significantly, there are a number of things that the code reviewer can look out for whichmay indicate that the code is not as clear as it could be.

• Are there enough comments to enable the code to be read and understood in a single pass?If the code needs to be re-read several times before it comes clear, more comments or arestructuring of the code may help.

• Is there any code that the reviewer initially thinks is wrong, but eventually realizes isactually correct? If so, other readers of the code are likely to reach the same incorrectconclusion, so a comment explaining the likely misconception will help.

• Do the names of functions and variables accurately reflect their purpose? A function calledClearScreen would be expected to clear the screen or return an error—any other behaviourwould be perverse and confusing. Ten or fifteen minutes spent coming up with exactly theright name for a function is time well spent, if it prevents future confusion.

• Are related areas of function obviously related according to their names? For example,ABCAllocateXYZ and ABCFreeXYZ are an obvious pair that would be expected to havesymmetric effects.

• Is the same terminology used consistently throughout the code? This can be particularlyimportant to help people who are not native English speakers—using four different syn-onyms for the same thing is likely to confuse. The most common or standard terminologyfor a particular concept is usually the best choice.

2 It’s possible to take this to extremes, however. I once reviewed some code that was equivalent to for (ii=0;

ii<len; len++), but when I mentioned the infinite loop to the programmer he couldn’t see it—the variablenames he’d used in place of ii and len were so long that the key typo was off the side of his editor screen!

Page 48: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 5: Code Review 43

• Are there pointless comments cluttering up the code? This often occurs in environmentswhere metrics covering the ratio of comment lines to code lines are tracked (see Section 9.1.1[Metrics], page 67), and results in comments like:

// Increment loop variable

++ii;

The key theme to point out here is the Principle of Least Astonishment: what’s the mostobvious guess for what the code does, and is this what it actually does?

5.4 The Code Review Process

5.4.1 Logistics

The logistics of the code review process are fairly straightforward, but there are a number ofsmall points that help to keep the process as efficient as possible. The net result of the processshould be that both the coder and the reviewer are happy for the final version of the code to bechecked in and used.

• The coder prepares the code for review. This should make life as easy as possible for thereviewer (if only to keep them in a good mood!).

• Put the code into an accessible, standard place.

• For changed code, provide a copy of the unmodified “before” code for comparisonpurposes.

• For a second or later iteration of the code review process, also provide the previouslysubmitted versions of the code so that the reviewer can confirm that problems thatthey’ve pointed out before have been correctly fixed.

• Once the code has been provided for review, don’t change it without checking with thereviewer.

• Make sure the code passes any mechanical checks that are required—for example, thatit clean compiles, or that its layout complies with any coding standards

• If there is supplementary information, such as a bug report, a design document or aspecification, make sure this is available for the reviewer.

• If a preliminary version of the code is being reviewed, indicate which parts are likelyto change and which are considered complete.

• Perform a quick sanity-check dry run of the code review, to confirm that all the codeis in place and has the correct version.

• If previous code reviews have raised the same problem several times, ensure that nothinganalogous exists in the current code.

• The reviewer generates feedback comments on the code.

• Make it clear what needs to be done about each comment, whether it’s answering aquestion, adding a comment or fixing the code itself—or even nothing at all.

• Use a consistent method for correlating the comments with the code. This may involvechanging the code itself using a specific marker, or providing file and line numberinformation.

• If a particular problem applies in many places in the code, make it clear that all ofthem need to be changed (the word passim can be used to indicate this).

• If necessary, categorize the comments according to the expected code changes. One pos-sible classification is to describe the type of change (e.g. W=wrong code, O=omitted,S=superfluous, P=performance, M=maintainability) and the severity (e.g. 1=wouldcause crash, 2=major loss of function, 3=minor loss of function, 4=cosmetic).

Page 49: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 5: Code Review 44

• The coder deals with the code review comments. For many of the comments, this justinvolves answering the relevant question or making the required markups. However, thereare often code review comments that the coder disagrees with—after all, the coder is stillthe person who knows the code the best and who is ultimately responsible for the code. Forthese comments, the coder has to convince the reviewer of their point of view. (A commonsituation in these cases is for the reviewer to respond: “Ah, I understand now. Please addthat explanation as a comment in the code”).

• Repeat as necessary: with less experienced programmers, it’s often necessary to repeat theabove steps until both the programmer and the reviewer are happy that the code can bechecked in. If this is the case, the reviewer needs to make clear to the programmer whichparts of the code need to be re-reviewed.

• If there are any items of contention that the coder and the reviewer can’t reach agreementon, escalate them to the team leader.

• Track as necessary: depending on the management environment, it may be required oruseful to keep track of a number of things during the code review process (see Section 9.1.1[Metrics], page 67).

• Pre- and post- review versions of the code may need to be kept.

• Code review comments may need to be kept for future reference.

• Classification of code review markups by severity and type may be worth tracking,to help indicate components that appear particular difficult (and so may need moreextensive testing) or to monitor the performance of the programmer.

• Time taken in both the review and the markup process may need to be tracked againsta project plan.

5.4.2 Dialogue

Software developers can be extremely protective of their code, which means that some of themend up treating code reviews as a personal attack on their competence. It’s important to defusethese kind of tensions as much as possible; one way to do this is to deal with the code reviewprocess as a dialogue.

The reviewer has the most important role in keeping this dialogue civilized.

• Remain polite at all times.

• Use questions rather than assertions. “Will this chunk of code cope with . . . ” is betterthan “Simply Wrong3”.

• Make it clear comments are subjective as often as possible. “I found SomeFn() hard tofollow because . . . ” is better than “SomeFn() is incomprehensible”.

• Clearly distinguish between stylistic and semantic markups; there’s a big difference between“this is wrong” and “this isn’t how I’d have coded this”. Some school teachers mark in a redpen and a green pen—the red for errors that would lose marks, the green for improvementsin clarity or structure.

• Try to imagine and provide theoretical future changes to the code that back up any main-tainability concerns: “I’m worried that in future someone might assume that SomeFn()

doesn’t lock and use it from within a locked section . . . ”.

• Don’t forget to comment on the positive as well as the negative: “I really liked the hashtablecode—I’ll have to borrow that technique for other code.”

That said, the coder also has to bear a number of things in mind.

3 A friend of mine once acquired a rubber stamp with the words “SIMPLY WRONG” in red capitals; whileamusing, this is hardly likely to engender the right spirit.

Page 50: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 5: Code Review 45

• Also remain polite at all times.

• Remember that you’re not capable of judging whether your code is difficult for someoneelse to understand—you can almost never refuse to add more explanatory comments.

• If a reviewer requests a change in an chunk of code, and you know that the same commentapplies to another similar chunk of code, then volunteer to change that too.

5.5 Common Mistakes

The most common mistake that turns up in the code review process is that the reviewer doesn’treview the code thoroughly enough. This is particularly common when code reviews have beenmandated by management fiat, without any training on the process of code reviewing andwithout allowing enough time and resources to do the job properly.

This kind of shallow, “syntax-only” review can occasionally find small bugs, but more oftenit’s a waste of time that only turns up trivial things (like breaches of coding standards) thatshould be being caught automatically by the build process anyway (see Section 4.6.3 [EnforcingStandards], page 37).

So for a reviewer, the most important question is: did you understand the code?

If not, then the whole code review process is a waste. None of the reasons for doing a codereview in the first place (see Section 5.1 [Rationale], page 39) are fulfilled:

• The reviewer can’t judge the correctness of the code and spot the deep, nasty bugs if theydon’t understand the code.

• The code either isn’t maintainable (and that’s why the reviewer didn’t understand it) orthe reviewer hasn’t taken the time to tell whether it’s maintainable or not.

• The reviewer hasn’t been educated about the form and function of the code, so that knowl-edge is still locked in the coder’s head.

A useful self-check for a reviewer is to consider an obscure test case (perhaps a timing window,or a resource acquisition failure): will the code cope with it or not? If you can’t tell, then thereview probably hasn’t been thorough enough (if you can tell, and it won’t, then obviously thatwill be one of the review comments!).

If the reviewer didn’t understand the code because it was too difficult to understand—thecomments were nonexistent, the documentation was missing, and the code itself was confusing—then there’s a problem with the maintainability of the code. To get decent quality software,that problem needs to be fixed.

If the reviewer didn’t understand the code because they’ve not taken the time to do itproperly, then there are a couple of possibilities. Firstly, they may just need more practice incode reviewing—hopefully the advice in this chapter and in the section on ramping up on a newcodebase should help with that (see Section 7.2 [Learning a Codebase], page 53).

More likely, though, is that a policy of code reviewing has been imposed without allowingsufficient time for the reviewer to actually make a difference. A thorough code review for a newchunk of code can take 10% of the time it took to write the code, or even as high as 20% if thereviewer is completely new to the area. This is time worth investing, as it increases the qualityof the code and doubles the number of people who understand it thoroughly.

Page 51: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 6: Test 46

6 Test

Once a piece of software has been built, it needs to be tested to check that it does actually dowhat it’s supposed to do. This has always been an area that six-nines development takes veryseriously; the testing phase is the proving ground to determine how close the software is to thatmagical 99.9999% number. More recently, this heavy emphasis on testing has started to becomecommon in other development sectors—in particular, the Test Driven Development aspect ofExtreme Programming.

This chapter is also going to cover a slightly wider area than purely testing itself, as thetesting process involves a number of sub-steps:

• The test itself: run the code in some configuration, and look for incorrect behaviour.

• Find problems and narrow down the circumstances involved (in other words, generate areproduction scenario).

• Debug the problems.

• Fix the problems.

• Re-test to confirm that the problem has been solved.

6.1 How

Testing checks that the software does what it’s supposed to do. This immediately illustratesa direct connection between the requirements for the software (see Chapter 2 [Requirements],page 7) and (some of) the testing phase: the requirements say what the software should do, andthe testing shows what it does do. Both of these concern the customer’s view of the system:what where they expecting, what did they get, and do they match?

However, not all testing is done from the point of view of the user (whether that user is anactual human user, or another software component of a larger system). In this chapter we’regoing to roughly distinguish between two main styles of testing:

• Black box testing : testing a system purely by using its external interfaces, effectively withno knowledge of the internal details of the implementation (see Section 3.1 [Interfaces andImplementations], page 11)

• White box testing : testing a system with full knowledge of the details of its implementation(sometimes known as crystal box testing, which actually makes more logical sense).

There are also many different phases of testing.

• Unit testing : building test harnesses to check that particular components of the softwarefulfil their interfaces.

• Coverage testing : generating particular tests aimed at the difficult-to-reach nooks and cran-nies in the code, with the aim of hitting them in the lab before they get hit in the field.

• Integration testing : testing multiple components integrated together.

• Functional verification (FV) testing : testing the system as a whole to confirm that thefunctional requirements are satisfied—in other words, confirming that it does what it sayson the tin.

• System testing : testing the whole system in more stressful environments that are at leastas bad as the real world (large numbers of users, network delays, hardware failures, etc.).

• Interoperability testing : confirming that the software works well with other software that’ssupposed to do the same thing1.

1 Even if that other software is doing things that are technically incorrect. Customers don’t actually care if aCisco router has some detail of a routing protocol wrong—they expect it to just work.

Page 52: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 6: Test 47

• Performance testing : scaling the system up to large numbers (of users, connections, links orwhatever) to confirm that performance holds up. This can involve a considerable amountof investment, in extra hardware or scalability test harnesses or both.

• Free-form testing : randomly trying to drive the system in weird and wacky ways. Particu-larly relevant for systems with a user interface.

• Install testing : checking that any installer packages correctly detect and cope with a varietyof different destination environments (see Section 4.6.5 [Delivering the Code], page 38).

Not all types of software need all of these phases, but higher quality software is likely to usealmost all of them.

6.2 Who and When

There is occasionally debate within the software industry as to whether it’s better to have aseparate test team, or to have the development team perform the majority of the testing. Eachapproach has its own advantages.

A separate test team allows development and testing to proceed in parallel, and can involvespecialists in testing mechanisms (for example, properly testing new versions of a consumeroperating system like Microsoft Windows or Linux involves a vast array of different hardwareand software combinations). Dedicated testers are inherently black box testers; they only testbased on the interface to the code, because they have no idea what the implementation of thecode is.

Testing within the development team means that the people with the best knowledge ofthe code are doing the testing; they know which areas are likely to be fragile, and where theperformance bottlenecks are likely to be—in other words, they are able to do white box testing.This kind of testing can also often be automated—regularly repeating tests as part of theovernight build system (see Section 4.6.4 [Building the Code], page 37) is good way to improveconfidence in the quality level of the code.

In the end, the choice between these two approaches comes down to a balance driven by theparticular scenario, and a mixture of both styles of testing is often the best course for a largeproject.

Test type White/Black Box AutomatableUnit White YesCoverage White MostlyIntegration White YesFunctional Verification Black YesSystem Black NoInteroperability Black Not usuallyPerformance Black SometimesFree-form Black No

Table 6.1: Styles and automation possibilities for different types of test

However, if testing is done within the development team, it’s usually worth having someoneother than the developer of a component do some testing on that component. This has a coupleof advantages: firstly, knowledge about that component gets more widely spread within theteam; secondly, a different pair of eyes will have different implicit assumptions and so will comeat the tests in a way that may expose incorrect assumptions that went into the code.

As well as debate as to who should test software, there is also some debate in the softwareindustry as to when software should be tested. Writing the tests before the code itself is oneof the central tenets of Extreme Programming (XP). As with iterative development cycles (see

Page 53: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 6: Test 48

Section 2.1 [Waterfall versus Agile], page 7), this arise from taking a principle of good softwareengineering practice to its extreme: testing is a Good Thing, so push it to the forefront of thedevelopment activity.

Obviously, test-first is much more suited to black box style tests than white box style testing;it’s also difficult to produce good integration tests in advance of the code being written. However,test-first also involves an important factor which can be both an advantage and a disadvantage:it’s much more difficult to reduce the amount of testing in the face of an impending deadlinewhen the effort has already been spent.

In real world software projects, deferring some functionality or some testing beyond a fixedrelease date (with the expectation of higher support loads and a follow-on release later) can bean unavoidable response to a difficult situation. Any such potential gains in the schedule areheavily reduced if the corresponding test code has already been written.

Pragmatically, though, the mere fact that proper testing is getting done is much more im-portant than the details of the phasing of that testing.

6.3 Psychology of Testing

I believe that the final bug in TeX was discovered and removed on November 27,1985. But if, somehow, an error still lurks in the code, I shall gladly pay a finder’sfee of $20.48 to the first person who discovers it. (This is twice the previous amount,and I plan to double it again in a year; you see, I really am confident!)2

For some reason, the vast majority of new programmers come equipped with an excessiveover-confidence about their own code. They’re absolutely convinced that they’ve written greatcode that doesn’t have any bugs in it. Even after the first few bugs are demonstrated to them,they will then believe that those are the only bugs and the rest of the code is perfect.

This hubris even affects those who should know better. When Donald Knuth—one of theworld’s most highly respected computer scientists—released his TeX typesetting system he wasconvinced it was bug-free and offered a bug bounty that would double each year. In the end,the bounty was capped; there have been almost a hundred changes to the code since thatpronouncement.

One of the key things that differentiates an experienced software engineer from a juniorprogrammer is a visceral understanding that with code comes bugs. Once this is understood,the testing process starts to change from being a boring drag to being a vital safety net betweenyou and the consequences of your bugs. Part of high-quality software engineering is to fullyaccept this concept, and plan for it accordingly—with extensive testing and also with codereviews, support structures, built-in diagnostics and so on.

This also gives a special emphasis to coverage testing. All code has bugs, and the only chancewe have of reducing those bugs is by testing and fixing the code; to a first approximation it’ssimplest to assume that code that’s not been tested is broken. So, if 10% of the code has neverbeen hit during all of the test phases, then that particular 10% is effectively broken—and thiswill show up when it does get hit, out in the field.

Of course, understanding how important testing is doesn’t alter the fact that it can be verydull. The other factor that can help to motivate programmers towards testing is to make thetesting as programmatic as possible. Programmers generally enjoy writing code; if much of theirtesting involves writing test harness code, then this will be more enjoyable than, say, manuallydriving a GUI through a pre-determined sequence of test steps. Programmatic test code alsohelps to make it easy to repeat tests with minimal effort later (see Section 6.4 [Regressibility],page 49). However, remember that test code is still code—and so still contains bugs itself. It’s

2 Donald E. Knuth, Preface to “TeX: The Program”.

Page 54: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 6: Test 49

often worth deliberately and temporarily inserting a bug into the product code to confirm thatthe test code will indeed raise an alarm appropriately.

In addition to an acknowledgement of fallibility, good testing also involves a particular mind-set: vicious and bloody-minded. A good tester is actively looking for things to fail, and tryingas hard as possible to generate that failure. Experience of the kinds of bugs that show up insoftware gives a good toolbox of nasty things to try.

• Range checks: If the code expects to receive values between 1 and 100, does it cope with 0,1, 99, 100, 101? Does it cope with 1,000,000? The specification for your system may nothave included a need for it to scale that high, but it’s good to know in advance what wouldhappen if someone tries—does it fail gracefully or catastrophically?

• Unchecked input: If the code is expecting an integer, what happens if it gets a string ora floating point number? If a text field is a fixed size, what happens if too much text isentered? What happens if the user enters 29th February into a date field? On a non-leapyear? What happens on a web system if HTML is included in an entry field that expectsplain text? What happens if a malformed packet arrives? What happens if the length fieldsfor the components of a serialized data structure don’t add up?

• Timing windows: What happens when lots of different things happen at once? What if theuser deletes the file at the same time as getting the application to perform a time-consumingoperation on it?

• Scalability: What happens when just lots of things happen? What happens when the diskis full? What happens when a file gets too big for the filesystem? What happens if youhave 212, 224 or 232 pieces of data?

With all of this cunning testing, it’s easy to forget the essentials: testing the system in thekinds of scenarios that will happen in the real world. For networking software, this involvestesting that the code interoperates with standard networking equipment, such as Cisco andJuniper routers. For web-based software, this means checking that all of the web pages workwith the browsers that people use—Internet Explorer, Firefox, Safari—in the configurations theyuse them (e.g. 1024x768 for laptop users). For systems that have a user interface, it involveschecking that common actions are easy to perform and that there are no glitches which wouldinfuriate heavy users. Use cases from the original requirements process can be very helpful forthis kind of testing (see Section 2.2 [Use Cases], page 8).

6.4 Regressibility

Once a test has been done, how easy is it to repeat the test? If the test succeeded at some point,is it easy to confirm that later versions of the code still pass the test? If the test case failed anda bug had to be fixed, can we check that the bug hasn’t reappeared?

These questions give us the phrase regression testing—confirming that later code incarnationshave not regressed the correct behaviour of earlier code versions. To make answering thesequestions feasible and efficient, the testing framework for the code should be regressible.

The key motivation behind this desire for regressibility is that bugs should only ever need tobe fixed once, because fixing bugs takes valuable time that could have been spent writing morecode. A large part of only fixing bugs once resolves down to good coding and design practices:if there is only a single component with a particular responsibility, and no code is ever copiedand pasted, then it’s unlikely that the same bug will occur in multiple places in the code.

The rest of what’s needed to make it unlikely that bugs will ever be fixed more than onceboils down to the overall framework for software development, especially the testing framework:

• Software identification & versioning: It must be possible to exactly identify which versionsof the code are implicated in a particular test case (whether successful or failed)—which

Page 55: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 6: Test 50

means that reliable code control, build and versioning systems are needed (see Section 4.6.1[Revision Control], page 34).

• Bug tracking: Developers and testers need to be able to report and track bugs in thesoftware in a way that makes it easy to spot duplicate bug reports, and to correlate bugreports with the code versions (see Section 4.6.2 [Other Tracking Systems], page 35).

• Test case tracking: Likewise, testers need to be able to track which test cases correlate towhich potential bugs, on what dates.

• Automation: Whenever possible, test cases should only need human interaction when theyare initially written and first run; after that, it should be possible to automatically run thetest case and programmatically check for success or failure without human intervention.

A large battery of automatically-run, thorough tests that can be run at the click of a buttonallows a huge boost to the confidence levels about the safety of a new piece of code (particularlyfor properly paranoid developers, see Section 6.3 [Psychology of Testing], page 48).

Of course, this isn’t always possible. GUIs in particular are often difficult to test in a regress-ible manner, even with input-replay tools designed for this situation. However, as described inthe next section, using a design (such as the Model/View/Controller (MVC) design pattern)that makes the UI as trivial and separable as possible will alleviate this problem.

6.5 Design for Testability

The previous section argued that to get high-quality, reliably-tested software, the whole devel-opment system needs to allow automated, thorough testing of the code.

However, this isn’t something that can be bolted on after the fact; the code and the de-velopment framework need to be designed from the ground up to allow for the regressible testframework.

A trivial example should help to make this clear. Imagine a software system that needs to runon a machine with limited memory—it might be a super-huge database system that exhauststhe virtual memory of a mainframe, or it could be a game for a mobile phone with a small fixedamount of flash RAM. This software needs to be able to cope with the memory running out; inC terms, all calls to malloc need to cope with a null pointer being returned.

How do you test whether all of this code really does cope with null? One possibility is toset up a test machine with limited amounts of memory, and try to tune things so that all ofthe possible malloc calls get hit. However, this probably isn’t going to be a test that’s easy torepeat, which augurs long nights debugging memory allocation problems in the future.

Instead, if the code were originally designed and written to call a wrapper function ratherthan malloc3, then it would be easy to replace that wrapper with code that generates allocationfailures on demand. That way, a test case could (say) set things up so that the fourth call tothe malloc wrapper fails, and thus hit whatever arms of code are needed without any of thattedious mucking about with exhausting real memory.

This is a fairly straightforward example, which is a particular instance of a wider approachto aid testability: wrap all system calls, so that fake ones can be substituted when necessary fortesting.

The idea of wrapping system calls is itself a particular instance of a wider strategy for allow-ing the thorough, regressible testing that’s needed for top-quality software systems: eliminateall sources of non-determinism in the system. Another facet of this strategy is to tightly con-trol multithreaded code (see Section 4.3 [Multithreading], page 30), so that all inter-threadinteractions can be intercepted and simulated for testing, if necessary. Similarly, wrapping anytimer-related code allows deterministic testing of timing window cases.

3 In C++, the equivalent can be done in a straightforward way by overriding operator new.

Page 56: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 6: Test 51

The Model-View-Controller architectural design pattern described elsewhere (seeSection 3.2.1 [Component Responsibility], page 13) also illustrates this principle: the individualparts of the pattern can each be tested in a predictable way.

To implement this strategy of controlling non-determinism in the system, it has to be plannedfor and designed for from the beginning. The level of software quality required should be decidedat the start of the project (see Section 2.3 [Implicit Requirements], page 9), and the only wayto reach the higher levels of quality is to plan for the testing framework at the design stage.

Another example: planning for coverage testing (that is, tracking exactly which lines of codeare hit during the test cases). Code coverage numbers are easily generated on desktop systemsand above, but generating them on embedded systems may involve building in statistics byhand.

It is worth pointing out that this correlation between quality level and testing frameworks goesboth ways: there are plenty of types and styles of software where a fully-regressible, wrapped-system-call, automated test system is overkill. To return to the malloc example, for many“normal” applications running on a desktop system, code that checks for a failed allocation (andtests that exercise that code) are probably overkill—if the virtual memory system of modernoperating systems fails to allocate 100 bytes, then the machine is probably on its way downanyway4.

6.6 Debugging

Debugging goes hand in hand with testing: once a test case uncovers a bug, that bug will(usually) need to be fixed. Debugging always extends beyond any formal test phase; no matterhow conscientious and comprehensive the testing, the real world always turns up things thatweren’t covered. To put it another way, a horde of real users (or an interoperability test witha real device) is whole new testing experience: it’s hard to make anything idiot-proof, as idiotsare so very ingenious.

The most important thing to understand about debugging are the factors that make debug-ging easier: diagnostics and determinism.

Diagnostics are vitally important to be sure that a problem has really occurred, and tounderstand the steps that led up to the problem. In an ideal world, a suitable collection ofdiagnostic facilities will be built into the system for this (see Section 3.2.4 [Diagnostics], page 15).In a less ideal world, one of the steps involved in tracking down a tricky problem will often bethe addition of more diagnostics.

Determinism is even more important for debugging than for testing (see Section 6.5 [Designfor Testability], page 50), because the reliable link between cause and effect allows the investi-gation to proceed in logical steps. In testing, determinism ensures that the output of a test isknown, and so can be checked that it is correct. In debugging, determinism allows the failurecase to be repeated, as the investigation narrows in on the source of the problem.

Without these two key pillars of debugging, finding the root cause of a problem is muchmore difficult. For example, debugging multithreading problems is notoriously difficult (seeSection 4.3 [Multithreading], page 30) because the determinism requirement fails so badly.

4 One unashamedly pragmatic approach to a failed malloc call is for the program to just sleep for a while;that way, the first program with a core dump in the system error logs will be someone else’s.

Page 57: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 7: Support 52

7 Support

The support phase of a software project happens when the code is out there working in thereal world, rather than in the development environment. In the real world it’s exposed to anunpredictable new factor: users. These users will complain of problems in the software, bothreal and imagined; the support organization for the software has to deal with these complaints.

Depending on the type of the software and the type of customer, the support organizationmay well be split into a number of levels. For example:

• Level 1 support might deal with answering the support phone line and confirming that theproblem is in the supported software.

• Level 2 support might install missing product upgrades and patches, and might spot dupli-cates of already-known problems.

• Level 3 support might guide a user through the process of running test scenarios andgathering diagnostic information.

• Level 4 support might be the support engineers who actually have access to the source codeand are able to make fixes and generate patches.

In this chapter, we’re concentrating on the highest level of support: when a (potentially)genuine bug reaches a team that has the capability of fixing that bug. This situation is a littlebit different from software development—and it’s an area that is oddly absent from softwareengineering literature1.

7.1 Customers

Somewhere on the trail between the users and the software developers are the customers—thepeople who are paying for the software and as a result expect it to work.

Entire books have been written about customer management, but there are some core essen-tials that can make the difference between a happy customer and a customer who’s reaching forthe nearest lawyer.

The main idea is to build a relationship with the customer, so that they feel that they cantrust the support organization. Having a consistent point of contact helps with this, as doesproviding regular updates on the status of any outstanding support requests.

The best way of getting a customer to feel that they can trust the support team is if theyactually can trust the support team—in other words, if the support team never lies to thecustomer, no matter how tempting it might be. In the long term, the truth is easier to keeptrack of and there’s no chance of being caught out. Expressing things as probabilities ratherthan certainties can help with this: “I’m 80% sure we’ve found the problem” is often moreaccurate than “We’ve fixed that already, it’s on its way”.

It’s also worth being as up-front as possible about bad news. If a problem can’t be fixed, orif the timescales are going to be much worse that the customer needs, it’s better to tell themsooner rather than later—if nothing else, it gives the support team a longer period in which tocalm the customer down.

Finally, for many software development sectors it’s worth bearing in mind the problem ofrequirements (see Chapter 2 [Requirements], page 7). The customer often knows the problemdomain much better than the software team does, and so querying the customer for moreinformation on what they’re really trying to do can often illuminate frustrating problems.

1 Perhaps again reflecting most software engineers’ deeply-held belief that their code has no bugs.

Page 58: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 7: Support 53

7.2 Learning a Codebase

Support engineers often find that they need to learn about an unfamiliar codebase in a hurry.Ideally, the first place to look for general information about the software is the documentation(see Section 3.4.3 [Documentation], page 27). However, in practice it’s often necessary to workthings out from the source code alone.

The first technique for ramping up on a new codebase from its source code is to follow theflow of the code. Starting from main, and with the assistance of a source code navigation tool,build a picture of the call hierarchy of the program. It’s helpful to write this hierarchy downon paper—that way you’re forced to actually look at the code in detail, so you’re more likely toremember it, and more likely to spot unusual things that bear further examination.

The next technique that helps to understand an unfamiliar codebase is to follow the flowof the data. Start with global variables, which typically hold the roots of any collections ofdata (and which are easy to spot, with nm if nothing else). Follow this with a search for anydynamically created data structures (by searching for new or malloc), to determine how they areorganized and what their lifetimes are. It’s usually worth sketching a diagram of the main datastructures and how they inter-relate as you go, whether in UML or some less formal notation.Again, doing this by hand rather than by using an automated system forces you to actually lookat the code, and to make judgements about which data structures are essential and which aremerely auxiliaries.

Finally, it’s always helpful to work through the flow of the code and the data for the keyscenarios that the code supports (see Section 2.2 [Use Cases], page 8). This can be done in adebugger, by examining comprehensive trace output (see Section 4.5 [Tracing], page 31), or justby stepping through the code in an editor.

7.3 Imperfect Code

Although this book as a whole is aimed at the business of producing high quality code, thesupport phase for a product sometimes involves dealing with the consequences of low qualitycode. In this situation, pragmatic compromises are often necessary—an ideal solution to aslew of bugs might be to re-write a particular component, but that’s not always possible underpressure of deadlines and resources.

So how to make the best of a bad situation?

The first recommendation for improving the quality of the codebase over the longer term isto improve the testability of the code. It’s easy to get into a situation where every new fix tothe codebase induces a new bug somewhere else, like trying to press air bubbles out from underlinoleum. This is particularly bad because it generates a position where copy and paste becomesthe pragmatically correct thing to do: if scenario A shows up a bug in code X, but code X isalso used for scenarios B, C, D, E . . . that can’t be tested, then the safest thing to do is to cloneX so that just scenario A uses the copied, fixed code X ′. The best cure for this is to build asuite of regression tests that can catch these potentially-induced problems; this makes it muchsafer for support engineers that don’t know the codebase very well to make fixes.

Adding new test cases and extending the test framework will automatically result in the nextrecommendation, which is to make fixes slowly and thoroughly. For poorer quality code, it’scommensurately more important to run related test scenarios, to add or update comments, tohunt for similar code that might need the same fix, to update what documentation exists andso on.

In particular, refactoring related code as part of a bug fix helps to improve the codebasefor the next time around—for example, merging copied and pasted code into a single commonfunction can often be done safely and reliably.

Page 59: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 7: Support 54

These may not sound like vast improvements, but over time they can make a substantialdifference. One (not six-nines!) codebase I worked on had a reputation as a support black hole,needing one or two support engineers on it full time; nine months later, this was down to one ortwo days a week from a single engineer (an 80% reduction in the support load). The differencewas that the new, more skilled, support team were more careful and took longer over each fix,and so rarely introduced new bugs.

7.4 Support Metrics

A key premise of this book is that investing in higher-quality software is worth the up-frontinvestment in the long term. The support phase of a project supports2 this assertion by providingthe statistical data to back it up.

Generating these statistics needs raw data; the first place this comes from is the bug trackingsystem. A bug tracking system makes sure that no bugs get lost; there’s either a fix or anexplicit decision not to fix the bug. Consistently including a little more tracking information(some of which can be tracked mechanically) into the system provides a wealth of additionalstatistical data:

• Bug severity: How severe is the problem? This might range from an entire system crashwith lost and corrupted data, to a typographical error in some displayed text.

• Bug urgency: How important is the bug? This may not be correlated with severity—aglaring typo on the welcome screen might be more urgent to fix than a system crash thatonly happens in extremely obscure and unlikely situations.

• Time invested: How much time is involved in generating the fix? This could be the calen-dar time elapsed between reporting and fixing the bug, or (more usefully) the amount ofdeveloper resource invested in fixing the bug.

• Extent of code change: How many lines of code were changed as part of the fix, and bywhom?

• Cause of bug: Was the bug induced by an earlier change, or has it lurked undetected for awhile?

All of this data reveals any number of interesting things about the codebase (assuming enoughdata has been accumulated to make the statistics sound):

• Which components and modules have had the highest number of bugs, per line of code?This may indicate components that are worth reviewing, refactoring or even rewriting.

• Which areas of code have had the largest amount developer effort invested in fixes for them?Again, this may indicate poor components—or just poorly documented components.

• Which developers have fixed the most bugs? And what proportion of those fixes haveinduced more bugs later? This may indicate which members of the support team need toslow down and be more thorough.

• Which bugs were the most difficult to fix? This might be indicated by bugs that had a lotof time and resource invested in them, but which only resulted in a small code change. Thismay indicate a flaw in the underlying design that needs to be re-examined, or maybe thatthe testing and diagnostics features of the code are insufficient.

These can all help direct refactoring and support efforts to where they’ll be most effective.However, as ever when metrics are measured, remember that people will adjust their behaviourto optimize the metric rather than the (more nebulous) actual quality.

It’s surprisingly rare for there to be up-front estimates of the cost of support for softwareprojects. When there are such estimates, they’re usually too low—because the estimates aremade by the development engineers who

2 Pun unintentional.

Page 60: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 7: Support 55

• have little or no experience in support

• believe their own code to be bug-free anyway (see Section 6.3 [Psychology of Testing],page 48).

With comprehensive measurements of the support costs of previous projects in place, it’smuch easier to make these estimates more accurate. By correlating across different styles ofdevelopment for the original development projects, it’s also data that will convince higher levelsof management of the long term benefits of higher quality software development processes.

Page 61: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 8: Planning a Project 56

8 Planning a Project

High quality software involves more than just writing better code, or even than just ensuringthat the code is tested rigorously. Reliability is not an accident; it is designed for and plannedfor.

This chapter discusses that process of planning a high-quality software project. As with otherplaces in this book, the general principles are shared with other forms of software development,but high-quality development needs to be more predictable. Planning has to be more accuratesince there’s much less scope for dropping features as the ship date looms. The compensatingfactors that make it possible to plan more accurately are that

• the specification is more likely to be fixed and coherent (see Section 2.1 [Waterfall versusAgile], page 7) than for other development sectors

• the development is more likely to use tried and proven technology (see Section 3.2.5 [Avoid-ing the Cutting Edge], page 16).

8.1 Estimation

Early on in the career of a software engineer, they are almost certain to get asked to estimatehow long a particular piece of code will take.

By default, programmers are inherently dreadful at coming up with estimates. I’ve neverencountered a new programmer who was naturally able to come up with accurate estimates, nomatter how skilled they were in the actual business of writing code—and their estimates arealways too low, never too high.

As they get more experienced, programmers do slowly become more accurate in their esti-mates, but often in a very ad-hoc way: they take their original estimates and just double (orin some cases triple) them. As the project progresses, if they end up with too much time, theycan then find some interesting extra things to code in order to fill up the time.

It’s not just the lower echelons of the software world that have problems with estimation; ITprojects in general are legendary for their overruns (in both calendar time and resource time)and their failures to deliver. Getting the plan seriously wrong is anathema for a high-qualitysoftware development—the first thing to get dropped is always the quality bar for the code.

So what’s the best way to tackle the problems of estimation?

8.1.1 Task Estimation

The first stage is to improve the accuracy of developers’ estimates of individual tasks, andexperience is the best way to do this. From an early stage, it’s important that even juniordevelopers should be coming up with sizings for their tasks, and should then be comparing thereality with the estimate once the dust has settled.

Learning quickly involves getting as much information as possible out of the data. To get themost out of experiences with estimation, developers need to preserve their original estimates,track their progress against them as the task progresses, and perform a retrospective analysisafter the task is done. The last is key: looking back at why and how the estimates were missedhelps with both a quantitative analysis of the process, and also helps to grind down the insaneover-optimism that so many young developers have. Having a gung-ho, “can do” attitude is allvery well, but when applied to estimation it causes disasters.

For a project manager, one particularly mean-but-educational trick is to get a junior developerto commit to a completion date for a task based on their own hugely optimistic estimates—whilesecretly expecting and planning that the task will take rather longer. Either the developer willbe embarrassed because they miss their predictions massively, or they’ll find themselves having

Page 62: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 8: Planning a Project 57

to work ridiculously hard to hit them. In either case, it helps to bring home the problems ofunder-estimation in a visceral way—and they’ll be more cautious the next time around.

For a developer to take an estimate seriously, they have to agree that the estimates arereasonable and sensible. Getting the developers to come up with their own estimates is one wayto do this; another is to make sure they are able to give feedback when other people’s estimatesare imposed on them. This is particularly true when they think the imposed estimates are toolow—to do otherwise puts the developer in a position where they can win an argument by doingless work.

Another thing to emphasize is that estimates are supposed to be wrong—that’s why they’reestimates. The only way to get a completely accurate estimate is to give the estimate after thetask has been done, which is hardly useful. What’s important, therefore, is that the margin oferror isn’t too large, and that there are errors in both directions: sometimes over-estimating,sometimes under-estimating, so that in the long run things balance out.

To reduce the margin of error, it’s often worth coming at the estimate from several differentdirections, and comparing the results. Get two different people to estimate the same task, orestimate it in terms of lines of code, man-days, percentage size of a related chunk of existingcode, etc. (see Section 9.1.1 [Metrics], page 67).

On an over-estimated task, it’s very easy for a developer to pad out the work so that their taskcomes in exactly on schedule. To avoid this, it’s worth emphasizing this inherent imprecision ofestimates—good estimates are expected to come in early sometimes as well as late sometimes.That way, if a developer has over-estimated one out of several tasks, they will bring that taskin early so that it can balance out other tasks that are coming in late.

To put it another way, consider the distribution of the errors in a collection of estimates.An ideal situation is that these errors should form a bell curve around zero, with as narrow adistribution as possible. The more uncertainties there were in the original estimates—perhapsbecause the estimates were made before enough details were known, or because the project relieson some new technology (see Section 3.2.5 [Avoiding the Cutting Edge], page 16)—the largerthe standard deviation of their accuracy. If developers are constantly padding out their work onover-estimated tasks, and slipping under-estimated tasks, then the distribution will look verydifferent from this symmetrical bell curve.

8.1.2 Project Estimation

Once developers are capable of making sensible estimates for individual tasks, the more dauntingproblem is how to put a collection of such estimates together to produce an overall project plan.

Building such a collection of estimates can’t be done without a design. The design describeshow the big problem is subdivided into smaller problems, and more detailed designs break thosein turn into components, subcomponents, objects, tasks and subtasks. As the design becomesmore detailed, so the accuracy level of the overall estimates becomes higher.

To get a decently accurate project plan, the design should ideally include a breakdowndetailed enough that all of the individual subtasks have estimates in the range of person-daysor (low) person-weeks. A task with a higher estimate implies that the relevant function still hassome question marks associated with it.

As an example, Figure 8.1 shows a (fictional) project plan for an implementation of a routingprotocol. The plan has been put together in a straightforward spreadsheet, rather than adedicated piece of planning software, because this best illustrates the core of what’s going on.

Page 63: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 8: Planning a Project 58

Figure 8.1: Example initial project plan

There’s a few things that are worth pointing out in this example.

• The major sections in the plan correspond to components in the overall architecture of theproduct, and the individual task lines correspond to particular subcomponents. From thisit’s clear that the project has already had some design thought put into it.

• The individual tasks have been estimated in terms of both resource days (for an averagedeveloper) and lines of code. Most of the time there’s a close equivalence between the two(around 10 days to implement 1000 lines of code), but some particular tasks are different.

• Various spreadsheet features have been used to make the plan as convenient as possible toread (analogously to naming conventions in code).

• Indentation is used to indicate components and sub-components.

• Grouping of components and sub-components allows summary information to be easilydisplayed, by collapsing the appropriate grouping levels.

• Derived numbers such as sub-totals are highlighted (in blue italics here).

Of course, a detailed and thorough project plan isn’t always achievable in the real world, andso the project plan has to cope with the uncertainty involved in less granular estimates. This islikely to involve a number of “plan B” scenarios (see Section 8.4 [Replanning], page 66).

• Prototype key areas of uncertainty in the design to confirm that the suggested technicalapproach is likely to work and to scale appropriately.

• Defer functionality to a later release; as such, it’s worth identifying early on which are coreproduct features and which are “WIBNIs” (“Wouldn’t It Be Nice If . . . ”).

• Defer some testing to a later phase—while planning for the expected consequences of this,such as higher support loads, more patches, reduced reputation etc.

Page 64: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 8: Planning a Project 59

• Move the release dates, again allowing for potential consequences (lost customers, penaltyclauses, missed market opportunities).

• Add more resource to the project, provided that this is really going to help, rather thandistract and dilute the core project team (which is all too common when manpower is addedto a late project).

As described in the previous section, it’s also important to bear in mind the statistical natureof estimates. Across the entire project, if the over-estimations and under-estimations balanceout, then the project will be in pretty good shape. On the other hand, if there is a consistentbias in the estimates (and such a bias is bound to be in the direction of under-estimation), thenthat puts the project in jeopardy.

However, the more common reason for project disasters is forgetting an entire task. Some-times this is a low-level technical task that gets missed because of a miscommunication betweenthe designers; more often, it’s a task that’s not part of the core of the project but which is stillvital to shipping the software out of the door.

Here’s a list of peripheral tasks that can take a stunning amount of resource, and are oftenforgotten in the grand plan.

• Holidays, illness and staff turnover.

• Personnel matters: reviews, appraisals, staff meetings.

• Setting up and running the build system (see Section 4.6.4 [Building the Code], page 37)and other systems (see Section 4.6.2 [Other Tracking Systems], page 35).

• Ramping up on new technologies.

• Installing, managing and fixing test machines.

• Packaging and delivery (see Section 4.6.5 [Delivering the Code], page 38).

• Documentation and training.

• Initial support, particularly for beta releases.

• Interactions with sales and marketing; for example, setting up and running a decent democan lose weeks of project team resource.

• Internationalization and localization, including (for example) early UI freezes to allow timefor translation (see Section 4.2 [Internationalization], page 29).

In the end, the actual coding phase of a software project forms a minority of the time spent.

The key tool to deal with the risk of a forgotten task is contingency—keeping a pot of timein the project plan that can be used to mop up these unexpected obstacles (or to mop up theeffects of a consistent under-estimation). The size of this pot can vary from 10% to 30% or evenhigher, depending on how predictable the project is.

If the project is a “cookie cutter” project, very similar in objective to a number of previ-ous projects, then the contingency can be kept low. If the project involves a new team, newtechnologies, new hardware, new tools—all of these factors increase the amount of contingencyneeded. After the project, how the contingency pot was used provides lots of useful informationfor the next project that follows in this one’s footsteps.

Returning to our previous example plan, we can now update it to include all of activities overand above writing the code (Figure 8.2). This includes the various test phases for the product(see Section 6.1 [How], page 46), together with a collection of the extra activities mentionedabove.

Page 65: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 8: Planning a Project 60

Figure 8.2: Fully burdened example plan

This example plan includes quite a large pot for “management”, so it’s worth exploringwhat’s included. This covers all of time needed for formal reviews or appraisals, plus time takenin teaching and mentoring the team. It includes the time needed for assessing whether theproject is running to plan or not, and to gather and analyse the metrics that lead up to thatassessment. It also includes the time spent being the ambassador for the project to the restof the organization—presenting status information, interacting with other teams, dealing withequipment logistics and so on.

With all of these tasks included under the management heading, a useful rule of thumb is thatit takes roughly one day per person per week. This means that teams with more than five peopletend to have weak points, unless some aspects of this management role are shared out with othermembers of the team. It is possible to squeeze by with less time spent on management, but inthe long term the development of the team and the code will suffer.

8.2 Task Division

fungible /adj./ replaceable by another identical item; mutually interchangeable1.

Programmers are not fungible. Firstly and most importantly, the range of skill levels forprogrammers range hugely—the best can be many times more productive than the average, and

1 Concise Oxford English Dictionary, eleventh edition.

Page 66: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 8: Planning a Project 61

the worst can actually incur a net loss2. Even among software engineers of equivalent skill, somewill have particular areas of specialty.

These factors induce yet another challenge when planning a project: how to divide up thevarious tasks of the project among the development team.

To deal with the overall variance of productivity among the various team members, it’simportant be clear about what units the plan is in:

• If task estimates are all quoted in terms of a standard “average developer-day”, then theamount of calendar time needed to complete a task will depend on who’s assigned to thetask.

• If task estimates are quoted in terms of how long the specifically-assigned developer will taketo complete the task, then calendar time is easier to calculate but changing task allocationsis more complicated.

Either approach is feasible, but mixing up the two is a recipe for disaster.

In the short term, dealing with people’s particular skills is much easier—just assign the teammembers to the tasks that they are good at. This hugely reduces the risks: to the experienceddeveloper, repeating a type of task they’ve done before is a “cookie cutter” experience; to anovice team member, the same task is the cutting edge (see Section 3.2.5 [Avoiding the CuttingEdge], page 16)

In the longer term, however, this is a short-sighted approach. If the same team memberalways does a particular type of task, then they may get bored and leave—at which point, theteam is in trouble because there’s no-one else who can do that task. As such, it’s often a goodidea to pair up an expert in a particular area with another team member who can learn fromthem.

2 Their work can be so poor that it has to be abandoned and redone from scratch by someone more competent.

Page 67: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 8: Planning a Project 62

Figure 8.3: Example plan including task allocations

Returning once again to our sample plan (Figure 8.3), the relevant updates are as follows.

• The larger and more significant development tasks have been broken down into more detail,splitting out separate tasks for low-level design, design review, code, code review and unittest. (Only a few of these breakdowns are shown in the diagram for space reasons.)

• An extra column indicates which member of the team has assigned to each task; in thiscase, the team involves three developers (JQB, ANO and JHC) and a manager (DMD).

• A summary table at the bottom of the spreadsheet indicates how much activity has beenallocated to each member of the team (not shown in the diagram). This is useful as a quickguide to indicate who is potentially over-booked.

Page 68: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 8: Planning a Project 63

Once some specific tasks have been assigned to particular team members, all of the remainingtasks need to be assigned times and people—in other words, the project needs to have a roughtimeline, in the form of a Gantt chart or equivalent. There are a number of software productsthat aim to help with this, but I’ve generally found that a simple paper-based approach workswell. Starting with sheets of lined paper:

• Decide on a scale—say, one line per day, or two lines for a week.

• Write out a grid with dates down the left hand side of the paper (in accordance with thescale just decided on) and with the names of team members across the top of the paper.

• If there’s a hard deadline, draw a big black line across the page at the relevant place.

• Using more lined paper and a pair of scissors, cut out a little rectangle of paper for eachtask in the plan. The rectangle should be roughly the same width as the columns drawn onthe first piece of paper (the ones labelled with the names of team members), and should bethe same height as the estimate you’ve got for the corresponding task. (If the rectangles aremore than a few lines high, the scale is wrong or the tasks in the plan aren’t fine-grainedenough).

• Put a small piece of blu-tack on the back of each rectangle, and stick them onto the mainpiece of paper with no overlaps. The column each rectangle goes in indicates who’s goingto do the task; the rows it covers indicate when the task will be done.

• Make sure each person’s column has a few gaps in it, to allow wiggle room for illness,slippage, and other unforeseen events. Leave bigger gaps for less capable team members,who are more likely to overrun on any particular task.

This approach makes any of the common scheduling problems immediately visible:

• If you’ve got any rectangles left over, you’ve not assigned all the tasks.

• If any of the rectangles overlap, then some poor team member is over-assigned.

• If there are large swathes of the underlying piece of paper visible, then some team memberis under-assigned.

• If any of the rectangle are below the big black line, then your plan doesn’t hit your deadline.

Simple but effective.

Page 69: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 8: Planning a Project 64

Figure 8.4: Primitive Gantt chart technique

8.3 Dependencies

In any project plan, there are some tasks that need to be completed before others can be started.This is glaringly obvious in civil engineering projects (see Chapter 3 [Design], page 10)—there’sno way to build the roof of a building before the walls are done, and the walls can’t get doneuntil the foundation is built.

Page 70: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 8: Planning a Project 65

Similar considerations apply for software engineering projects, but they are typically lessimportant, and there are a couple of key techniques that help to bypass the ordering requirementthat dependencies impose.

The main technique is to rely on the separation of interfaces and implementations (seeSection 3.1 [Interfaces and Implementations], page 11): if the interface of component A is pro-vided well in advance of the implementation of A, then a component B which relies on A canstill be built in parallel with A, using that interface.

Figure 8.5: Building to an interface

If this is done, it’s important to make sure that the implementers of both components stayin communication during the development process. During the implementation of componentA, there’s a chance that the details of its interface might need to change somewhat, and thosechanges need to be reflected in component B. Conversely, the implementer of component Bmight find that their job would be much easier if the interface of A were changed just a littlebit, and discussion with the implementer of A will reveal whether this is feasible or not.

This technique of interface/implementation separation can also be extended if the system as awhole involves a comprehensive internal test framework (see Section 6.5 [Design for Testability],page 50). With just the interface of component A to hand, the implementer of component B canget as far as compiling their code, but no further. However, if an independent test harness forcomponent B is built, that framework can effectively provide an alternative implementation forthe interface to A, allowing testing of B before A is written. The test implementation of A canand should be shared with the developer who is implementing the real component A, to avoidnasty surprises when the code gets integrated together.

Figure 8.6: Test driver

Page 71: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 8: Planning a Project 66

For the dependencies that do exist in a software project, there’s a simple extension to thepaper approach to Gantt charts described in the previous section (see Section 8.2 [Task Division],page 60). If task A has to be completed before task B can start, then highlight the bottom edgeof the task rectangle for A with a colour, and highlight the top edge of B with the same colour.Any broken dependencies then show up as a top edge appearing above a bottom edge in thesame colour.

8.4 Replanning

With the best will in the world, there comes a point in the lifecycle of many (most?) softwaredevelopment projects where it becomes clear that things are not on track—few plans survivecontact with reality completely unscathed.

This first step in dealing with this situation to figure out the extent of the problem, andconfirm that there isn’t enough contingency to cope. Getting a different pair of eyes to lookthings over can help a lot with this—explaining the situation to someone else can also exposethe self-deceptions you’ve hidden from yourself.

Once it’s clear that these is a problem, it’s vital to face up to the problem as soon as possible,and to be realistic about what can actually be achieved by when. As with bugs in the software,it’s much easier to deal with problems in the plan the sooner they are raised. Despite this, it’ssurprisingly common that a missed deadline is only admitted to a week before it happens, evenwhen the impending doom of the project has been obvious to all concerned for months. Raisingthe problem sooner rather than later allows everyone involved to put contingency plans in place.

So what possibilities are there for these contingency plans?

• Change the delivery dates, if this is possible.

• Change the deliverables, to defer less important functionality to a later release.

• Reduce the quality (typically by reducing/deferring testing), although this counts as adeliverable of sorts (see Section 2.3 [Implicit Requirements], page 9).

• Add resource to the project. Adding personnel to the core team at a late date is likely tohave a negative effect, because of the time taken ramping up and communicating with thenew team members. However, it may be possible to reduce any extra commitments of thecore team or to find easily separable subtasks which can be hived off to someone externalwithout that external person needing to ramp up on the whole project.

For all of these possibilities, the replanning exercise has also got to take into account anyknock-on effects on subsequent projects or other teams.

Page 72: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 9: Running a Project 67

9 Running a Project

There’s more to project management than just coming up with a comprehensive project plan.This chapter discusses the ongoing process of running a project, with the aim of turning thatplan into reality.

9.1 Tracking

Once the planning for a software project is done, the business of actually implementing thesoftware can kick off in earnest. The job of the project manager is not over, though—theprogress of the project in reality has to be tracked against the predictions of the plan. Thistracking allows progress to be measured, and makes it possible to determine whether the projectis in trouble (see Section 8.4 [Replanning], page 66).

The responsibility for tracking the status of the project doesn’t just rest with the projectmanager, though. Each of the members of the team has to be able to report their own status,and to analyse what they’re spending their time on. Often, supposedly “absorbable” tasks likerunning the builds, preparing releases, or answering pre-sales enquiries can take up a lot oftime, and it’s important that the project manager knows about these ancillary activities (if onlyso they can allow more time for it in the next project, see Section 8.1.2 [Project Estimation],page 57).

9.1.1 Metrics

In order to be able to extrapolate the status of the project into the future, we need quantitativemeasurements—metrics.

The most important metric during software development is to determine how much of eachsubtask has been done. This might be the number of lines of code written, as compared to theexpected number of lines of code for the finished component. More likely, this will involve thenumber of man-days spent so far and the expected number of man-days left to do on the task.Either way, until there is working code, there are few other things that can be measured.

Once again, we can illustrate this with our example project spreadsheet (Figure 9.1). Thekey things to observe with this iteration are as follows.

• Time tracking for the team on a week-by-week basis has been added on the right handside of the spreadsheet, and is summarized in the “Done” column. (A more sophisticatedspreadsheet might track team members individually).

• The “To Do” column is regularly revisited so that it holds a current best estimate of howmuch effort still remains for any particular task.

• The “Gain” column holds a running track of how the progress of the project is goingin comparison with the original estimates. Comparing this number with the amount ofcontingency remaining gives a good indication of whether the overall project schedule is atrisk.

• Various subtotals are included to make it obvious when incorrect data has been added(which is easy to do). For example, the amount of time spent each week should always addup to 20, as there are four team members working five days a week.

• A missing task (“Test plan”) has been identified, and added as an additional subtask underthe “Contingency” section (reducing the unassigned contingency appropriately).

Page 73: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 9: Running a Project 68

Figure 9.1: Example spreadsheet with on-going task tracking information

With working code in hand, there are a number of other metrics which can help to assess thestatus of a project.

• Number of lines of code (or more specifically, number of non-blank, non-comment lines ofcode).

• Number of functions.

• Number of branch-points in the code (depending on coding standards, this can often beaccurately approximated by counting the number of { characters in a code file, for C-likelanguages at least).

• Bugs found at review, and their classification.

• Bugs found during testing, and their classification.

• Speed and size of the code for key scenarios.

• Number of test cases successfully executed.

• Amount of coverage of the code during testing.

• Amount of time taken to fix bugs (average, maximum).

Page 74: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 9: Running a Project 69

Figure 9.2: Metric tracking in example spreadsheet

For any metric, it is important to understand that people will change their behaviour tooptimize whatever metric is being concentrated on. With this in mind, the list of potentialmetrics given above has a corresponding list of counter-productive behaviours that could beinduced by concentrating too hard on each metric.

• Number of man-days left: components will be coded as fast as possible, without regard forhow good the code is.

• Number of lines or branch-points in the code: code will be copied and pasted rather thanencapsulated.

• Number of functions: the code will have many layers of shell functions that do no processing.

• Bugs found at review or during testing: code development will slow to a glacial pace, as thedevelopers exhibit excessive caution.

• Speed and size of the code for key scenarios: the code will be incomprehensible because thevery last drops of performance have been squeezed out of it, and the performance in otherscenarios will drop (this is very common for graphics cards that get tuned for particularhigh-visibility benchmarks).

• Number of test cases successfully executed, or amount of coverage of the code during testing:the amount of time spent on the more important mainline tests will be lower than it shouldbe (for example, testing odd timing windows often reveals important bugs but doesn’t hitany new areas of the code).

• Amount of time taken to fix bugs: the fastest, least architectural, sticking-plaster fixes willbe applied (see below).

In larger companies, organizationally mandated metrics can become applied in a completelyinflexible manner, so that any benefit in optimizing for that metric is outweighed by its negativelong-term effects.

Page 75: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 9: Running a Project 70

An example is in order: in one large organization I encountered, the time taken to close outa bug report was closely monitored, and any bug open longer than two weeks was escalated wayup the management hierarchy. The noble intent of this policy was to encourage responsivenessand good service for the customers. In practice the policy meant that the most short-sighted,fastest-to-implement fixes were always the ones that made it into the codebase. After a fewyears of this, the codebase was littering with special cases (see Section 3.2.2 [Minimizing SpecialCases], page 13) and copied and pasted code. If a bug report that involved corruption of adatabase record came in, the developers would restore the record from a backup, and close outthe bug report—since investigating the underlying cause as to why a dozen records were gettingcorrupted each week would have taken far longer than the two week period that triggered themetric.

The same organization also had astonishingly high support costs, because the internal bud-geting for the development of new code was completely separate from the budget for support:developers were positively encouraged to write poor code and ship it as fast as possible, becausethe support costs didn’t show up in the cost calculations for the new features.

This last example illustrates the most influential metric of all: money—or its near-equivalent,time. In the end, software projects have to make money to pay the salaries of the developers.

On a smaller scale project, it’s much easier to watch out for these kind of perverse behaviourscreeping in. To avoid the problem in the first place, it’s often useful to monitor different metricsthat push people’s behaviour in opposite directions and thus cancel out any psychological effect.

9.1.2 Feedback

The previous section described a number of potential metrics that can be used for tracking thehealth of a software development project.

The obvious and most important use of these numbers is to determine if the project is runningsmoothly, in order to extrapolate whether the deadlines and targets are all going to be reached.

However, the lower-level details of the figures can also provide a valuable source of feedbackabout the performance of the team. This can be monitored as the project progresses (for longerprojects), or can be looked at separately in a project post-mortem.

After the fact, every project should be mined for information, to help to move future projectsaway from the trail-blazing scenario, closer to the “cookie cutter” scenario. Look over thenumbers to see how the estimation errors were distributed, and what the contingency pot endedup being used for—so the next project can have explicit, planned tasks for the tasks that weremissed, together with a better confidence level on the estimates for them.

To make this information as useful as possible, it helps to distinguish between the changesin the plan that occurred because the original estimates were wrong or missing some tasks,and the changes that happened because of externally imposed changes (such as a change in thespecification, or a member of the team leaving unexpectedly). Knowing the difference betweenwhen the goalposts have been missed, and when the goalposts have been moved, makes the nextplan more accurate—so it’s important to preserve the numbers in the original plan and referback to them later.

Page 76: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 9: Running a Project 71

Figure 9.3: Post-mortem of example plan; changes in specification indicated by gray back-ground

The performance of individual developers can also be revealed by the underlying numbers—bug counts per line of code, time taken to write each line of code. These concrete numbers canhelp the project manager to motivate the team members in the directions they need to develop.For example, junior developers often fall into the trap of writing code blazingly quickly, butwhich has a much higher bug count than more experienced developers. In this situation, beingable to show the developer their bug rate as compared to the average (together with the fix timeassociated with that bug rate) can help to convince them to slow down and take more care, ina way that doesn’t involve recrimination and subjective opinion.

Similarly, the overall statistics for the project can give some indication of the performance ofthe project managers. Occasionally projects get awarded to the manager who comes up with themost aggressive, overly optimistic estimates1; a long-term calibration of the eventual outcomeensures that these managers can learn the error of their ways.

9.2 Time Management

With the complexities of project management comes an attendant problem for the project man-ager: time management. It’s particularly common for new project managers to find themselvespanicking as more and more tasks and issues pile up in their inbox.

At the core of the problem is a perennial trade-off between important tasks and urgent tasks.It’s easy to get into a “fire-fighting” situation where the only things that get any attention arethe ones that are due immediately; other vital tasks get ignored in the frenzy—at least untilthey too have a looming deadline.

The first step in dealing with the problem is to have a reliable system for tracking tasksand issues. This doesn’t have to be high-tech—a paper list works fine—but there has to be no

1 A friend of mine refers to these types as the “can-do cowboys”.

Page 77: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 9: Running a Project 72

chance of anything ever getting lost from the system. Tasks and issues stay on the list untilthey’re done, explicitly confirmed as no longer needed, or passed to someone else to deal with.New things go onto the system immediately, so there’s less chance of an issue being forgottenon the walk back from a conference room to a desk2.

The next step is to make sure that the tracking system doesn’t get completely crufted up withimportant but not yet vital tasks. Revisiting the list on a regular basis—say once a week—tocheck on the status of the less urgent items helps to keep them fresh in the memory. However,in the end the only way to reduce the size of the task list is to actually do some of the tasks—soreserve some time for a background task or two each week.

One analogy can help to clarify both the problem and the heuristics for solving it, at leastfor project managers with a software background: scheduling algorithms. Modern pre-emptiveoperating systems face a very similar challenge in determining how processor time-slices shouldbe assigned to different tasks. As with project management, there are high priority tasks, lowpriority tasks, interrupts, time deadlines, and a measurable cost for context switching. Withthat in mind, it’s worth reading about some of the techniques that operating systems use toensure that everything gets done eventually—UNIX systems and Linux in particular are goodsubjects for study because of their tradition of documented algorithms and availability of sourcecode (and this is useful knowledge for a software professional anyway, see Section 9.4.1 [TechnicalFundamentals], page 74).

9.3 Running a Team

For any software project above a certain size, running the project also involves running the teamthat is implementing the project. This can be difficult—the software industry certainly has itsshare of prima-donna programmers, loose cannons and work-shy drones—but the calibre of thedevelopment team is a vital factor affecting the quality of the final code.

9.3.1 Explaining Decisions

An important thing to understand about managing a team of software engineers is that thingsgo much more smoothly when you explain the reasons for your decisions.

It might be possible to tell team members just to do what they’re told in other industries,but software engineers are intelligent enough that this doesn’t go down well. Developers like tobelieve that they are completely rational people; explaining the motivations for your decisionsallows you to leverage this (belief in their own) rationality.

The developers on your team might not always agree with you, but if you explore all of thereasons for choosing between two different approaches, the difference in opinion normally boilsdown a choice between whether factor X or factor Y is more important. With that understood,it’s much easier to either fashion a compromise or to convince the developer to follow yourapproach without grumbling. This openness is what prevents management from degeneratinginto an “us” versus “them” situation.

9.3.2 Make Everything Into Code

Good programmers enjoy writing code. This observation is the key to a useful secret for gettingprogrammers to be enthusiastic about some of the less interesting parts of software development(or at least to be willing to do them)—make those parts look like writing code.

The parts of software development that developers find less interesting are the ones thatinvolve tedious manual processes—collating data, tracking metrics and so on. But tediousmanual processes are exactly the things that can be automated by a judicious application of alittle bit of code—just what programmers enjoy.

2 Paper lists have the advantage in this kind of situation, being more easily portable.

Page 78: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 9: Running a Project 73

Allowing time for the team to put together these kind of small process tools (Perl scripts,Excel macros etc.) improves the morale of the team and in the long run ensures more accurateand reliable implementations of the various processes involved.

Some examples:

• Testing: A regressible test framework for automated testing involves writing a bunch morecode; a set of tests that are run by hand doesn’t. Guess which one programmers enjoymore.

• Planning: Estimating tasks and tracking progress (see Section 9.1.1 [Metrics], page 67)often involves a lot of mechanical processes for checking code size and coverage and so on,which can be scripted.

• Build System: Taking the time to automate the build system so that it copes with alleventualities, and even automatically tracks down the culprits who break the build (bycorrelating what code has broken with who last changed it) can save lots of the build-master’s time and sanity.

With this approach, it’s worth emphasizing that the normal ideals of what makes good codeapply to these ancillary areas too. A good developer wouldn’t copy and paste large sectionsof the product code, so there’s no excuse for them to do it in test code or build scripts either.Similarly, everything should be version controlled, backed up, commented—and maybe evenreviewed by other members of the team.

9.3.3 Tuning Development Processes

Poor logistics have the potential to seriously annoy a software development team. Programmerspride themselves on being intelligent; this is combined with training that emphasizes huntingdown and eliminating unnecessary and inefficient code. This mindset stays in play when theyhave to deal with all of the processes and systems that go along with software development in anorganization—bug reporting systems, status tracking systems, official documentation formats,mandatory ticketing systems etc.

As a project manager, it’s easy to forget the realities of life at the coal face and it’s sometimesdifficult to predict the day-to-day consequences of new processes and systems (see Section 4.6.2[Other Tracking Systems], page 35). The team members are usually pretty quick to point outany shortcomings; as a project manager, it’s your job to listen to your team and to tune theprocesses as much as possible.

Once developers get into the habit of the development processes, they often find that theprocesses are not as much of a burden as they originally thought. If there are still concerns,examine the overhead involved in the processes—examine all of the steps involved in fixing asingle one-line typo in a source code file.

• How many distinct steps are there?

• How many of those steps involve copying and pasting information?

• How many of the steps involve a delay, and for how long?

• How many of the steps involve another person (and so may be delayed until they read theiremail)?

If the total overhead is excessive, it becomes a big barrier to the developers’ motivation tofix things.

Where it’s not possible to completely automate a process, it is worth explaining to the teamwhat the value of the process is (continuing the earlier theme; see Section 9.3.1 [ExplainingDecisions], page 72). If they understand the rationale, they are more motivated to comply withthe process and are more able to suggest modifications to an arduous process that preserve itsvalue while reducing its annoyance level.

Page 79: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 9: Running a Project 74

9.4 Personnel Development

The most significant factor in software quality is the quality of the person writing the software.Hiring the best developers, keeping the best developers happy once they’re on board, and trainingnew programmers up to become the best developers are the ways to achieve a high calibredevelopment staff.

Developing the development staff is an ongoing activity that proceeds at a number of levels.Obviously, during the course of a project the members of the development team will naturallyacquire knowledge about the particular problem domain of the project, and the tools and tech-niques used in building the system. This domain-specific knowledge needs to be spread aroundthe team (see Section 5.1.3 [Education], page 40, see Section 6.2 [Who and When], page 47)so that no single developer becomes a bottleneck or overly essential to the project. This is anaspect of team professionalism: the team should be able to continue even if one of its memberswere to be hit by a bus tomorrow.

Spreading around the knowledge of different aspects of the system also helps to inculcatea wider perspective in the developers—rather than just concentrating on their own little com-ponents, they are more likely to understand the system as a whole and how their componentaffects the overall final system.

9.4.1 Technical Fundamentals

It’s important that new programmers acquire knowledge and experience of technical areas out-side of the things that are immediately needed for the current project. There are several reasonsfor this:

• Future projects may need different skill sets, and may not have enough time for the devel-opment team to ramp up fully.

• The more experience a developer has of different technical areas (programming languages,operating systems, development tools, code control systems, bug tracking systems, . . . )the more they are able to spot the common patterns between these different areas, and toinduct the key factors involved.

• The deeper knowledge a developer has of how underlying code and systems are implemented(APIs, libraries, operating systems), the better they get at debugging and understandingthe foibles of those systems.

• If a developer is bored and not expanding their skill set and experience, they may begin toworry about being painted into a corner career-wise, and start looking for other jobs thathave more prospects.

It’s surprising how few new programmers have a thorough understanding of the underlyingsystems that their code relies on, even among those arriving with computer science degrees. Assuch, there are a number of core software ideas and technologies that new programmers shouldget a working knowledge of during their first couple of years.

My personal list is given below; different development sectors will obviously have differentlists. However, it’s always motivational to be able to explain why and when these pieces ofknowledge can come in handy (given in parentheses below).

• How a compiler works. (Useful for determining what the problem is when it stops working,and to optimize the code for the best output from the compiler.)

• How the linker and loader work. (Again, useful to know when it stops working and tooptimize things.)

• How central features of higher level languages are implemented—for example, how virtualfunctions and exceptions work in C++. (This is useful for debugging problems at the verylowest level; imagine debugging a crash in stripped binary with a bare-bones debugger.)

Page 80: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 9: Running a Project 75

• How a operating system’s scheduling algorithm works3. (Vital for ensuring the best perfor-mance of a software system as a whole.)

• Assembly language. (Being able to read assembler allows you to debug problems when thecompiler or debugger has gone awry, and tune the very last drops of performance out piecesof code on the golden path.)

• Virtual memory, memory models and allocation algorithms—in other words, how malloc

works. (Occasionally helpful to tune allocation patterns for performance, but much morecommonly useful when debugging memory errors: a double free will often cause problemson a later call to malloc.)

• A scripting language such as the Bourne shell, or awk or Perl or Python or Ruby or . . . .(Having a scripting language at your disposal allows the easy automation of any number ofancillary tasks that show up around a codebase—building it, testing it, analysing it.)

• Regular expressions. (Regular expressions form a core part of most scripting languages,and allow swift manipulation of textual data for all sorts of ancillary tasks.)

• Multithreading, with its pros and cons (see Section 4.3 [Multithreading], page 30) and keyissues like deadly embraces and locking hierarchies. (Knowing the issues makes it easier todebug the almost-intractable bugs that can show up with poorly designed multithreadedcode. Easier, but not easy, sadly.)

• Networking protocols, IP and TCP especially. (For anything involving networking, the leastcommon denominator in debugging is to fire up the sniffer and look at the packets on thewire. Knowing what you’re looking for helps enormously.)

• Internationalization and localization. (Any software that interacts with users should dealwith these issues; otherwise, you’re closing off a huge potential market—and one that’sgrowing all the time. Also, thinking about internationalization can help clarify underlyingdesign concepts (see Section 4.2 [Internationalization], page 29).)

• Event-based GUI programming. (This is the core of how all of the everyday software forusers works. In the unlikely event that you never have to write any user-facing software,that won’t stop your friends and family asking you to help with theirs.)

9.4.2 Pragmatism

Once a software engineer has been trained up to be the very finest software engineer, with thehighest standards of software quality and professionalism, comes the next step: training themnot to be. Or to be more accurate: training them to understand when and how to drop theirstandards and let less-than-perfect code out of the door.

Software development is usually about selling software for money, directly or indirectly. Evenon open-source projects, there is still an equivalent balance of resources that has to be considered,although it involves people’s time and interest rather than money.

The economic aspect highlights the key judgement call: is the programmer time spent de-veloping a feature or fixing a bug going to be more than reflected in the amount of money thatwill eventually roll in the door? Even for software with six-nines reliability, it can sometimes beworth letting sleeping bugs lie.

The type of software involved makes a huge qualitative difference to this judgement call.For a Flash animation game on a website, having an obscure bug that occasionally crashes theprogram isn’t a big deal—after all, most users will probably just blame it on the failings ofInternet Explorer. However, for software that controls a kidney dialysis machine, or a nuclearpower plant, the cost/reward ratios are a lot more significant.

This is obviously very hard to quantify. The developer time spent now may be easy tomeasure, but it’s much harder to predict the number of developer hours induced in the future

3 This can even be useful elsewhere (see Section 9.2 [Time Management], page 71).

Page 81: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Chapter 9: Running a Project 76

by taking a shortcut—and harder still to quantify4 the reputation damage and consequentlonger-term monetary damage associated with delivering less-than-perfect software.

However, just because the problem is quantitatively insoluble doesn’t mean that these kindof factors shouldn’t be tracked. As programmers become software engineers and then seniorsoftware engineers and team leaders, it becomes more and more important that they are able toassess the factors both for and against perfection.

An important point, though: it has to be an explicit, conscious choice to take a particularpragmatic approach—this is not the same as taking hacky shortcuts because the engineers aren’tcapable of doing it any other way.

4 Some areas of the six-nines world incur fines for loss of service—making it easier to quantify the costs of poorquality.

Page 82: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Index 77

Index

AAgile development . . . . . . . . . . . . . . . . . . . . . . . . . 7, 15, 47Agnosticism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2, 10Appropriate quality levels . . . . . . . . . . . . . . . . . . . 6, 9, 75Architecture, physical . . . . . . . . . . . . . . . . . . . . . 12, 20, 24Asynchronous interface . . . . . . . . . . . . . . . . 15, 17, 26, 41Automation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50, 73

BBackup promotion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21, 24Black box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11, 12, 46Bottleneck . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23, 41, 47, 74Bug classification . . . . . . . . . . . . . . . . . . . . . . . . . 43, 44, 68Bug rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39, 71Bug report . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36, 43, 50, 69Bugs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15, 39, 40Build system . . . . . . . . . . . . . . . . . . . . . . . . . . 37, 49, 59, 73

CC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40, 50C++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6, 11, 12, 40, 74Cisco router . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46, 49Civil engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10, 64Code comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32, 42Code coverage . . . . . . . . . . . . . . . . . . . . . . . . . 51, 68, 69, 73Code review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39, 68Coding standards . . . . . . . . . . . . . . . . . . . . . 30, 37, 42, 43Communication . . . . . . . . . . . 4, 5, 11, 26, 40, 42, 44, 65Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12, 74Component responsibility . . . . . . . . . . . . . . . . . 11, 13, 41Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13, 27, 29, 42Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41Contingency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59, 70Cookie cutter project . . . . . . . . . . . . . . . . . . 10, 59, 61, 70Coordinated universal time (UTC) . . . . . . . . . . . . . . . 29Copy and paste . . . . . . . . . . . . . . . . 37, 42, 49, 53, 69, 73Core file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16Cost of quality . . . . . . . . . . . . . . . . . . . . . . . . . 1, 15, 21, 75Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33, 46Create, read, update, delete (CRUD) . . . . . . . . . . . . . . 8Crystal box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46Cunning code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39, 42Customer avatar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Cutting edge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10, 16, 61

DDate processing . . . . . . . . . . . . . . . . . . . . . . . 12, 13, 14, 29Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9, 16, 50, 51Dependency . . . . . . . . . . . . . . . . . . . . . . . . . . . 16, 28, 37, 64Design by contract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Design of software . . . . . . . . . . . . . . . 4, 10, 28, 29, 41, 50Desktop application . . . . . . . . . . . . . . . . . . . . . . . . . . 12, 51Determinism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50, 51Developing the developers . . . . 3, 5, 16, 40, 47, 61, 74Development lifecycle . . . . . . . . . . . . . . . . . . . . . . . . . . 5, 35Development methodology . . . . . . . . . . . . . . . 5, 7, 10, 39Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9, 15, 51

Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12, 14, 22Divide and conquer . . . . . . . . . . . . . . . . . . . . . . . . . . . 10, 57DLL Hell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4, 27, 59Dynamic software downgrade . . . . . . . . . . . . . . . . . . . . . 14dynamic software upgrade . . . . . . . . . . . . . . . . . . . . . . . . 24Dynamic software upgrade . . . . . . . . . . . . . . . . . . . . 13, 14

EEncapsulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14, 17, 41Error paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8, 41Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5, 56Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11, 40Extreme programming (XP) . . . . . . . . . . . . . . . 8, 15, 46

FFailure, disastrous . . . . . . . . . . . . . . . . . . . . . . 7, 16, 59, 66Fault tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14, 20, 24Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43, 44, 70Five-nines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1, 14Fungible programmers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60Future tense . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11, 12, 13

GGlobal variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Golden path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41, 46GUI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48, 50

HHack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14, 39, 76Hard disk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12, 15Hash algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

IImplementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Implicit requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Infinite loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20, 42Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11, 17, 39, 65Interface completeness . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Interface minimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12Interface, asynchronous . . . . . . . . . . . . . . . . 15, 17, 26, 41Interface, synchronous . . . . . . . . . . . . . . . . . . . . . . . . 15, 17Internationalization . . . . . . . . . . . . . . . . . . . . . . . . . . . 59, 75Internationalization (I18N) . . . . . . . . . . . . . . . . . . . 29, 34Internet Engineering Task Force (IETF) . . . . . . . . . . . 8Iterative development . . . . . . . . . . . . . . . . . . . . . . . 5, 7, 47

KKnuth, Donald . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27, 48

LLakos, John . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Lifetime of data structures . . . . . . . . . . . . . . . . . . . . . . . 41

Page 83: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Index 78

Lines of code (LOC) . . . . . . . . . . . . . . . 39, 54, 57, 67, 68Locale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59, 75Localization (L10N) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29Logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

MMaintainability . . . . . . . . . . . . . . . . . . . . . . . . . 4, 16, 39, 41malloc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40, 50, 75Manageability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16Management of software development . . . . . . . . . 2, 5, 6Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39, 42, 44, 67, 70Meyers, Scott . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4, 40Model-View-Controller . . . . . . . . . . . . . . . . . . . . 13, 17, 50Modularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41Money . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70, 75Montulli, Lou . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Morale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39, 72, 73Multithreading . . . . . . . . . . . . . 14, 22, 30, 41, 50, 51, 75Muppet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

NNaming conventions . . . . . . . . . . . . . . . . . . . 11, 30, 32, 42Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12, 15, 74, 75Network switch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1New programmers . . . . . . . . . . . . . . . 7, 30, 40, 48, 56, 74Not working as specified (NWAS) . . . . . . . . . . . . . . . . . 7

OObject oriented programming (OOP) . . . . . . . . . . . . . 10OSPF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8, 57Overconfidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48, 56

PParanoia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50Passive voice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9, 11, 41Physical architecture . . . . . . . . . . . . . . . . . . . . . . 12, 20, 24Post-mortem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70Postcondition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Pragmatism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48, 51, 75Precondition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Prima-donna programmer . . . . . . . . . . . . . . . . . . . . . . . . . 3Principle of least astonishment . . . . . . . . . . . . . . . . 11, 43Programming in the future tense . . . . . . . . . . 11, 12, 13Project plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16, 44, 57Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16, 58Psychology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50, 69

RRationale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4, 27, 31, 39Real world . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14, 48, 56Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20Redundancy, 1:1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21Redundancy, 1:N . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21Regressibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48, 49Regression test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30, 49Regular expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75Release tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35, 36

Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9, 13Replanning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66Reproduction scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46Request For Comment (RFC) . . . . . . . . . . . . . . . . . . . . . 8Requirement changes . . . . . . . . . . . . . . . . . . . . . . . 7, 66, 70Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7, 46Requirements, implicit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Resilience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9, 14Responsibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11, 13, 41Revision control . . . . . . . . . . . . . . . . . . . . . . . . . . 14, 34, 49Routing protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . 8, 46, 57

SScalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14, 15, 17, 49Scripting . . . . . . . . . . . . . . . . . . . . . . . . . . 33, 35, 37, 73, 75Self-deception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48, 56, 66Service Level Agreement (SLA) . . . . . . . . . . . . . . . . 9, 75Shared library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12Single point of failure . . . . . . . . . . . . . . . . . . . . . . . . . 21, 23Six-nines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1, 7, 9, 12, 14Software quality . . . . . . . . . . . . . . . . . . . . . . . . 9, 51, 58, 66Software types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6, 51, 75Spaghetti . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Special case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13, 69Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2, 4, 7Standard Template Library (STL) . . . . . . . . . . . . . . . . 11State information . . . . . . . . . . . . . . . . . . 18, 21, 22, 24, 31Stupidity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16, 39, 40Synchronous interface . . . . . . . . . . . . . . . . . . . . . . . . 15, 17

TTask scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . 63, 72, 75Team leader roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Test case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68Test driven development . . . . . . . . . . . . . . . . . . . . . . . . . 46Test framework . . . . . . . . . . . . . . . . . . . . . . . 50, 53, 65, 73Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39, 46, 58Threading . . . . . . . . . . . . . . . . . . . . . 14, 22, 30, 41, 50, 51Trace filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5, 50, 67, 71Transaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

UUnified Modeling Language (UML) . . . . . . . . 17, 19, 26Unit test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46UNIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24, 28, 36, 72Use case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8, 27, 41, 68User interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2, 10, 11UTC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

VVersion control . . . . . . . . . . . . . . . . . . . . . . . . 14, 34, 49, 73

WWaterfall methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 5, 7Web engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9, 14White box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Page 84: High-Quality Software Engineering - Lurk, · PDF file8 Planning a Project ... is another common reliability level; a five-nines ... because our general emphasis on high-quality software

Index 79

Wirth, Niklaus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Working as specified (WAS) . . . . . . . . . . . . . . . . . . . . . . . 7Wouldn’t it be nice if (WIBNI) . . . . . . . . . . . . . . . . . . . 58

Y

You aren’t going to need it (YAGNI) . . . . . . . . . . . . . 15