44
Lecture 7: Engineering Doing for a dime what others do for a dollar

Lecture 7: Engineering Doing for a dime what others do for a dollar

Embed Size (px)

Citation preview

Lecture 7: Engineering

Doing for a dime what others do for a dollar

Software Engineering

• Software engineering is a distinct discipline of study from computer science

• Computer science tries to find out what is possible in principle to do algorithmically, and how quickly answers to instances of certain questions can be computed

• Software engineering tries to find best ways to create working complex software under real world constraints of time, money and manpower

• How to best structure large programs (millions of lines of code) so that they remain manageable and extensible?

Modular Programming

• Large programs consisting of millions of lines of code can't possibly be written into one source code file

• Instead, the program is divided into modules so that each module is a coherent collection of functions to do some particular thing (not try to be floor wax and dessert topping)

• Each module is a source code file in C, and in your project, exactly one module should have the main function

• A well-designed module can also be turned into an independent general purpose library and used later in other programs that need the same functionality

• Second lowest level of reuse after "reuse by copypaste"

Compiler and Linker

• Suppose that our program consists of three source code files foo.c, bar.c and baz.c

• Each file is compiled in isolation from other to create the corresponding object code file, for example foo.o

• Object code is basically machine code, but it can be thought to have "holes" in it in the places where that code calls functions defined in other modules, not available to the compiler right now

• The final executable is created by the linker, a tool that takes these object code files and fills the "holes" with the information it gets from other object code files

• Compilation can still fail at the linking phase if, for example, two modules define a function with the same name

Header Files

• Since each source code file is compiled separately, the compiler cannot possibly perform type checking for calls to functions in other modules and vice versa

• In fact, calling a function defined in another module just results in "unknown identifier" compile time error

• For each module, we write a header file that contains prototypes of those functions defined in that module that are meant to be called by outside modules

• Header file for foo.c is named foo.h• A function prototype consists of the function signature but

without the body• This is enough information for compiler to perform type

checking for the function calls

Preventing Multiple Inclusion

• A header file never contains code, only function prototypes and other declarations meant to be visible to other modules

• Modules that want to use functions defined in foo.c need to include the header file foo.h (never include foo.c)

• Problem: as files include each other, you may accidentally create an circular infinite chain of includes

• The canonical solution is to use preprocessor directives #define and #ifndef to enforce inclusion only once 

#ifndef FOO_H#define FOO_H/* contents of foo.h */#endif

Makefiles

• In a very large software project, rebuilding the whole project from scratch can be a time-consuming task

• Often done overnight, the latest version of system waiting for programmers with the morning coffee

• After changing some code, recompile only the modules that have changed, and those that depend on them

• A makefile is an automated mechanism to describe in a project what files depend on what other files

• The make utility recompiles only the source files that are newer than the compiled object files, and their dependencies, potentially saving massive compilation time

Version Control

• In a large software project, version control is an automated mechanism to store all project source files over time in a centralized repository

• Simplifies maintenance and prevents multiple, subtly different versions of files existing inside the project

• Programmers can check out the current version of all project files, and check in the files that they create

• Keeps track of ownership of each part of the code, help maintain responsibility for changes

• Version control also stores all old versions of every file that was checked in, allowing reversions to old code at any time if something gets messed up

Global Variables

• Using global variables is problematic, but once again, once you have chosen to program in C, avoiding global variables is like the proverbial band-aid on cancer

• Problem: how to inform the module foo.c that int x is a global variable defined in module bar.c ?

• Just declaring int x; in module foo.c allows its compilation to succeed, but linking fails with the error message of both files foo.c and bar.c defining the same global variable int x

• In foo.c, you instead say extern int x; to tell the compiler that x is a global variable defined in some other module

• Also, defining a global variable as static in one module makes that variable invisible to all other modules that it is linked together with

Go To Considered Harmful

• In a C function, you can actually jump from any point to any other point inside the same function with the notorious statement goto

• Put a label to the point of the function where you want to jump, and use that label in the goto statement

• Subject to big debate in computer science in the seventies, since all functions can be written without goto

• Structured programming using loops and conditions makes programs easier to understand than spaghetti code especially in languages where you can jump wherever you want, even to middle of another function

• There are special situations where a well-placed goto can really make the function both clearer and faster

Code Optimization

• Once the code finally works correctly, it may sometimes be necessary to make it execute faster

• For all real programs complex enough to optimize, the question of where it spends its time is extremely difficult

• Profiler is a tool that measures how much execution time is spent in each function, even in each statement 

• The execution profile for any real program is quite spiky so that most of the execution time is spent in a small number of hot spots in accordance to 20/80 Pareto principle 

• Biggest payoffs for optimization, whereas code that is executed rarely wouldn't pay off even if you could optimize it to take zero time

Micro-optimizations

• Micro-optimizations refer to locally rearranging your code to make it execute faster

• Seldom worth the programmer effort for three reasons• First, they make the code longer and more complex, and

less flexible for future changes and ideas• Second, modern compilers can automatically perform most

of the named micro-optimization techniques• Third, many attractive-looking micro-optimizations don't

actually speed up anything at all in the end• "Premature optimization is the root of all evil" (D.E. Knuth)

Algorithmic Optimizations

• Algorithmic optimizations refer to changing the high level structure of a program so that it uses more efficient data structures and algorithms

• Data structures organize the data in some way to make operations on it more efficient than they would be if the data was just kept in one unsorted pile

• Choose the data structure to make those operations fast that you intend to be doing a lot

• One algorithmic optimization can speed up the program execution by orders of magnitude, more than all possible micro-optimizations put together

• Does it achieve much to optimize the motion that Shlemiel uses to dip the brush in the bucket and pivot around?

Bitwise Arithmetic

• Returning to the lowest levels reachable by C, recall that integers consist of 32 (or 64) bits, each 0 or 1

• Often our numbers are not so big that they would require an entire int to store them, so we could pack several values into a single int to save space

• Also, by packing several truth values into the bits of an integer, we may be able to perform multiple conditions in parallel, making the execution faster

• Result of bitwise arithmetic operations a op b for two integers evaluates to the result of applying op on the 32 individual bits in parallel

Bitwise Operators

• Bitwise AND: 1010 & 1100 evaluates to 1000• Bitwise OR: 1010 | 1100 evaluates to 1110• Bitwise NOT: ~10 evaluates to 01• There are 16 possible binary (in the sense of taking two

parameters, cf. unary, ternary) bitwise operators, determined by their 4-bit truth table

• All these can be created by combining the basic operators, for example a XOR b = (a | b) & ~(a & b)

• The logic of these operators behaves the same as for the logical connectives && and ||, but without short circuit, and is evaluated for each bit separately in parallel

Interlude: NAND

• In principle, all logical operators can be expressed using only one operation NAND, the negation of &

• NOR, the negation of | would work just as well• Truth table 1010 NAND 1100 = 0111• ~a == a NAND a• a & b == ~(a NAND b)• a | b == ~(~a & ~b)• In the physical hardware of the processor, the mechanism

for reading and executing the machine code instructions is implemented with logic gates built from electronic components of diodes and transistors

• If some media allows logic gates, it allows computation

Bitwise Shifts

• Bitwise shift operators << (left shift) and >> (right shift) move the bits left and right by the given amount

• Bits that roll out of one end are simply discarded, and in the other end, zeros come in

• A shift of a single step corresponds to integer multiplication or division by two, which makes these operations an efficient way to implement multiplication and division by powers of two

• However, this assumes that the integer is unsigned• Compilers generate automatically where appropriate to

implement multiplication by constant

Turning the Given Bit On/Off

• In a 32-bit integer, the bits are numbered 0, ..., 31 with location 0 for the lowest-order bit, 31 for the highest

• Problem: how to turn on the k:th bit in integer x?• Solution: x = x | (1 << k);• The expression (1 << k) creates a mask whose k:th bit

that we want to turn on is 1, and all other bits are 0• This mask determines which bits in the result should

become 1, while other bits remain as they originally were in x

• Similarly, how to turn off the k:th bit in integer x?• Solution: x = x & ~(1 << k);• Now the negated mask ~(1 << k) has all its bits 1, except

the k:th bit is 0, forcing that bit to be 0 in the result

Bits as Data

• In the previous examples, the mask was applied to only the k:th bit, but if you wanted to set several bits in a single operation, just set several bits of the mask

• How to check if the k:th bit of integer x is 1?• Solution: (x & (1 << k)) != 0• So now we can access individual bits of an int, which allows

us to pack data in integers without any slack• Less important these days with plenty of RAM and disk

space, but still important in data structures that use a bit vector, essentially a long "array" of bits whose individual bits we access by their location

Exclusive Or (XOR)

• C has one bitwise operator that has no corresponding logical operator, the exclusive or, a.k.a. XOR

• Denoted by ^, which has nothing to do with exponents• 1010 ^ 1100 = 0110• Corresponds to the natural language use of "or" in the sense

of "Do you want coffee or tea?" (a snarky programmer can still answer "yes" to this question) so that exactly one of the alternatives is true, but not both of them

• Intuitive interpretation is that for XOR, the mask determines which bits are flipped to be the opposite as they were, while the rest of the bits remain as they were

• XOR is reversible, and in fact its own inverse, so that (a^b)^b == a for any two integers a and b

Application: One-Time Pad

• One-time pad is a simple encryption scheme that can mathematically proven to be unbreakable

• Cannot be used unless the parties have previously been physically together (say, a spy and the spy HQ)

• A pad consisting of bits of random noise is generated, and both the spy and the spy HQ keep a copy of this pad

• To encrypt his plaintext message, the spy computes XOR of the message and the pad, and then destroys the originals

• Even if ciphertext is intercepted, it is only random noise• The spy HQ computes the XOR of the received ciphertext

and its stored copy of the original pad• Since (m^p)^p == m, plaintext message pops out

Parlour Tricks With Bits

• Check if x is a power of 2: (x & (x-1)) == 0• Swap two variables a and b without using temporary third

variable: a = a ^ b; b = a ^ b; a = a ^ b;• Average without overflow: (a & b) + ((a ^ b) >> 1)• Find the lowest bit of x that is 1: x & (~x + 1)• The lowest bit that is 0: ~x & (x + 1)• For many more, see e.g. "Bit-Twiddling Hacks"• http://graphics.stanford.edu/~seander/bithacks.html• (In the code examples, just ignore the keyword register, a

suggestion for compiler to keep an often used variable in one of its registers to make the program faster)

Pointers to Functions

• Normally, pointers point to data created by your program• In C, it is possible to create pointers to functions that are

hardcoded in your program• A function can get a pointer to a function as parameter, and

this way decide at runtime which function it should call as part of its own execution

• We could already do that with if-else, but the possible functions to call would now be fixed at compile time

• A library function can be given a pointer to a function that didn't even exist back when the original library function was written and compiled

• For example, sorting function is given a comparator

Syntax

• The syntax for pointers to functions is a bit funny• Normally, when you declare a variable (pointer or other),

first comes the variable type, then its name• In a pointer to function, the name is placed in the middle of

declaration, instead of in the end• Example: int (*f)(double);• This declaration makes variable f to be a pointer to a

function that takes one double parameter and returns an int

• Type checking is still in effect in function calls, even if these calls take place through a pointer

Example

int square(int x) { return x * x; }int succ(int x) { return x + 1; }

int apply_twice(int (*f)(int), int x) {    int result = f(f(x));    return result;}

int main() {    printf("%d\n", apply_twice(square, 2)); /* 16 */    printf("%d\n", apply_twice(succ, 2)); /* 4 */    return 0;}

Typedef

• Especially when using function pointers, type declarations can become difficult to read and write

• typedef is a handy C statement that allows you to define "aliases" for type types that you often use

• Improves readability of code• Example: typedef int (*)(int) my_ptr;• Now you can declare such pointers with my_ptr f;• Typedefs can also be used to define more descriptive type

names even for simple types• Example: typedef state int;

Motivation for Enums

• Sometimes in our programs, we need our own "small" data type that has a fixed set of values known at compile time

• Example: the seven possible days of a week• Easy solution is to use int as that type, and define constants

for the possible values of that type• typedef day int;• day MONDAY = 0; day TUESDAY = 1; etc.• Enums streamline such declarations• Still not type safe in that you can accidentally give a day

variable a value out of range, and mix and match between different small types, since they are all int anyway

Enum Example: Days of the Week

enum day {    MONDAY, TUESDAY, WEDNESDAY, THURSDAY,    FRIDAY, SATURDAY, SUNDAY};typedef enum day day;

/* day is now just as good a type as any other */day today = MONDAY;day* p = &today;

Software Testing

• In everyday parlance, the term "testing" can mean very different things (testing the student knowledge with an exam, test driving a used car...)

• In software engineering, testing has a specific technical meaning different from its everyday use

• Try out a suite of predefined test cases to reveal the existence of bugs in the system

• Each test case consists of input and expected correct result, as defined by system specification

• Test case succeeds if it makes the system give a wrong result, thus revealing the existence of a bug

Nature of Bugs

• Each bug in your code has two independent properties of frequency and severity

• Frequency tells how often on average the bug would manifest itself in the typical use of the program, causing erroneous results to reach the user

• A simple number that can be measured / estimated• Severity tells how severe the consequences are when the

bug manifests itself• A simple binary classification for severity is that a bug is

severe if it prevents the software from being used for its intended purpose, and cosmetic otherwise

Nature of Testing

• Testing is not debugging: it reveals the existence of bugs, but not what they are or how to fix them

• The programmer who created the system receives the bug report and then reasons what the bug is

• You can only test infinitesimal fraction of all inputs• Tester's responsibility is to design a set of test cases that

maximizes the number of bugs revealed• Since test cases reveal high frequency bugs first, initial

testing quickly finds a large number of bugs• The marginal cost needed to find the next bug increases

until it is no longer feasible to find more bugs• All real software has low-frequency bugs lurking in it

Test Case Design

• Instead of trying out essentially the same input over and over, try out different inputs (imagine a game of Battleships)

• Black box testing: design the test cases from the system specification (no actual code needed)

• Try out edge cases, edge cases + 1, corner cases• White box testing: design the test cases based on the

actual code to improve your test coverage• For every if-else condition, design two test cases that cover

both branches separately• For every loop, design separate test cases to run the loop

body zero times, once, twice, three times...

Some Testing Terminology

• Unit testing means testing an individual subsystem in isolation from the rest of the system

• Alpha testing or integration testing means testing the subsystems together to find bugs in their interaction

• Beta testing is also testing the entire working system, but done by a subset of real users, instead of in-house testers

• Regression testing means running the tests after modifying the program, to reveal bugs in the new code and its interaction with the old code

Testing vs. Programming

• In a large software house, tester and programmer are two separate job titles (even career paths)

• Even in a small software house, the person testing a system should not be the programmer who wrote it

• Blind spots, ego, desire to move on• Software testing requires a different mindset from

programming, to push the system to its limits to make it squeal and reveal every flaw and weakness that is hiding deep inside it

• Testers also cheaper than programmers, so it is more economical to use separate testers

Code Reviews

• Code reviews are supplementary to testing• Instead of compiling and executing code, source code is

printed on paper and another programmer tries to find faults in it by powers of reasoning

• Think backwards from a hypothetical error, asking what inputs would cause a bug to manifest

• May be followed by a formal review meeting• Finds more bugs per hour than actual testing• Especially finds low-frequency bugs, inefficient code,

inelegant code, redundant code...

Assertions

• A problem in debugging is that from the time the bug actually corrupts data, it might take a long time for actual erroneous results to manifest to the user

• Tracking the bug backwards will be difficult• Assertions are written into your code in places where you

believe some invariant always holds• For example, assert(a < b);• When execution reaches this line, the expression is

evaluated, and program crashes immediately if false• Used to find programming errors, not runtime errors• Must #include <assert.h>, and can be turned off in the

production version with #define NDEBUG

Waterfall Model

• A customer needs a piece of software to be written, and software developers need to write it for the customer

• How to turn the customer's vague and shoddily founded ideas of what they need into an actual executable?

• Waterfall model is the classic software engineering model that tries to divide this process in series of steps

• Problem: assumes one linear sequence from beginning to end result version 1.0, without backtracking

• More realistic models of software production try to first create prototypes and intermediate versions

Stages of the Waterfall Model

1. Analysis: specify what the customer needs2. Design: create the general architecture of program3. Coding: write the actual code modules4. Unit testing: test the individual subsystems in isolation5. Integration: combine the subsystems into a whole6. Alpha testing: test the entire system7. Maintenance: prepare the next version of the program

Analysis

• No code is written or designed at this stage• Purpose of analysis is to find out what the customer needs

the program to do, from the user's point of view• User can only observe the inputs and outputs, but can't see

inside the program and the code and functions• Try to create a stable foundation to write the program on• Problem: customers have vague needs and wants that are

constantly changing, and hard to prioritize• Identify the core functionality versus the frills

Design

• In the design phase, define the architecture of the software with the modules and their responsibilities

• Top-down design divides the system into smaller parts, subdividing until each part is small enough to be defined and implemented as a module

• Bottom-up design defines small feasible parts of functionality and then uses them to define larger parts

• Tend to produce a rigid design vulnerable to changes• Object-oriented design defines the modules to correspond

to the concepts of the problem domain• From the specification, produce a first rough draft for design

by turning each noun into a module, and each verb into a function

Integration

• In a large software project, the individual modules are generally not completed at the same time

• To get something visible out as soon as possible, check in the first version with each module defined as a stub

• Compiles and runs, but doesn't do anything• As individual programmers complete and check in the first

versions of their modules, hopefully soon enough the program "comes alive" in some nightly build

• In such snowball integration, prioritize the work so that the most important modules and functions get written first

• Some less important modules might even turn out to be unnecessary, or need complete redesign

Maintenance

• The great invisible of the software industry• In the real world, most paid programming work is really

maintenance of old code, instead of creating brand new software from scratch

• Fixing bugs is only a small part of maintenance• Prepare the next version of the software based on the new

needs that have emerged since the original version• Essentially, run down the entire waterfall ladder again• Specify the new needs, incorporate them into the existing

design, code the changes, test the changes, integrate the new version

The Evolving Design

• Design consists of decisions of what to include and what to exclude, what to model and how to simplify it

• Just like there can't possibly exist one map of a terrain that is perfect for every purpose, there can't exist one software design that is perfect for everything in the problem domain

• A design that was feasible for the original needs may not fit the new needs that emerge with later versions

• A design that was originally elegant might break down when it is violently adapted for new needs

• Paradoxically, the more successful some software is, the more unanticipated needs its users will find for it, eventually forcing a complete redesign from scratch

 

"A good scientist is a person with original ideas. A good engineer is a person who makes a design that works with as few original ideas as possible. There are no prima donnas in engineering."

Freeman Dyson

"Every program attempts to expand until it can read mail. Those programs which cannot so expand are replaced by ones which can."

Jamie Zawinski