Programming with R

Embed Size (px)

DESCRIPTION

Introduction to R Programming

Citation preview

  • Programming with R

    1

  • Some General Programming Guidelines

    1. Understand the problem.

    2. Work out a general idea how to solve it.

    3. Translate your idea into a detailed implementation.

    4. Check: Does it work?

    Is it good enough?

    If yes, you are done!

    If no, go back to step 2.

    2

  • Example

    We wish to write a program which will sort a vector of inte-

    gers into increasing order.

    3

  • Understand the Problem

    Start with a specific case, usually simple, but not too simple.

    Sometimes, you might try to solve the problem on your own,

    without the computer.

    Consider sorting the vector consisting of the elements

    3,5,24,6,2,4,13,1.

    4

  • Understand the Problem

    Our goal is to write a function called bubblesort() for which

    we could do the following:

    x

  • Work out a General Idea

    A first idea might be to find where the smallest value is, and

    record it.

    Repeat, with the remaining values, recording the smallest

    value each time.

    Repeat ...

    This might be time-consuming.

    6

  • Work out a General Idea

    An alternative idea: compare successive pairs of values, start-

    ing at the beginning of the vector, and running through to the

    end.

    Swap pairs if they are out of order.

    Try using this idea on 2,1,4,3,0, for example.

    After running through it, you should end up with 1,2,3,0,4.

    This method doesnt give the solution, directly.

    7

  • Work out a General Idea

    In checking the alternate idea, notice that the largest value

    always lands at the end of the new vector. (Can you prove to

    yourself that this should always happen?)

    This means that we can sort the vector by starting at the be-

    ginning of the vector, go through all adjacent pairs.

    Then repeat this procedure for all but the last value, and so

    on.

    8

  • Detailed Implementation

    At this point, we need to address specific coding questions.

    e.g. How do we swap x[i] and x[i+1]?

    Here is a way to swap the value of x[3] with that of x[4]:

    > save x[3] x[4]

  • Detailed Implementation

    Note that you should not over-write the value of x[3] with the

    value of x[4] before its old value has been saved in another

    place; otherwise, you will not be able to assign that value to

    x[4].

    10

  • Detailed Implementation

    We are now ready to write the code:

    bubblesort x[first + 1]) { # swap the pair

    save

  • Check

    Always begin testing your code on simple examples to iden-

    tify obvious bugs.

    > bubblesort(c(2, 1))

    [1] 1 2

    > bubblesort(c(2, 24, 3, 4, 5, 13, 6, 1))

    [1] 1 2 3 4 5 6 13 24

    12

  • Check

    Try the code on several other numeric vectors. What is the

    output when the input vector has length 1?

    > bubblesort(1)

    Error in if (x[first] > x[first + 1]) { : missing value where

    TRUE/FALSE needed

    13

  • Check

    The problem is that when length(x) == 1, the value of last

    will take on the values 1:2, rather than no values at all.

    This doesnt require a redesign of the function; we can fix

    it by handling this as a special case at the beginning of our

    function:

    14

  • Check

    bubblesort x[first + 1]) { # swap the pair

    save

  • Check

    Test the new version:

    > bubblesort(1)

    [1] 1

    16

  • Top-down design

    Working out the detailed implementation of a program canappear to be a daunting task. The key to making it manage-able is to break it down into smaller pieces which you knowhow to solve.

    One strategy for doing that is known as top-down design.Top-down design is similar to outlining an essay before fillingin the details:

    1. Write out the whole program in a small number (1-5) ofsteps.

    2. Expand each step into a small number of steps.3. Keep going until you have a program.

    17

  • Example Merge Sort

    The sort algorithm just described is known as a bubble sort.

    The bubble sort is easy to program and is efficient when the

    vector x is short, but when x is longer, more efficient meth-

    ods are available.

    One of these is known as a merge sort.

    The general idea of a merge sort is to split the vector into two

    halves, sort each half, and then merge the two halves.

    18

  • Example Merge Sort

    During the merge, we only need to compare the first elements

    of each sorted half to decide which is the smallest value over

    all.

    Remove that value from its half; then the second value be-

    comes the smallest remaining value in this half, and we can

    proceed to put the two parts together into one sorted vector.

    19

  • Example Merge Sort

    So how do we do the initial sorting of each half?

    We could use a bubble sort, but a more elegant procedure is

    to use a merge sort on each of them.

    This is an idea called recursion.

    The mergesort() function which we will write below can

    make calls to itself.

    Because of variable scoping, new copies of all of the local

    variables will be created each time it is called, and the differ-

    ent calls will not interfere with each other.20

  • Understanding the idea

    It is often worthwhile to consider small numerical examples

    in order to ensure that we understand the basic idea of the

    algorithm, before we proceed to designing it in detail.

    For example, suppose x is [8,6,7,4], and we want to con-

    struct a sorted result r.

    Then our merge sort would proceed as follows:

    21

  • Understanding the idea

    1. Split x into two parts: y [8,6], z [7,4]2. Sort y and z: y [6,8], z [4,7]3. Merge y and z:

    (a) Compare y1 = 6 and z1 = 4: r1 4; Remove z1; z isnow [7].

    (b) Compare y1 = 6 and z1 = 7: r2 6; Remove y1; y isnow [8].

    (c) Compare y1 = 8 and z1 = 7: r3 7; Remove z1; z isnow empty.

    (d) Append remaining values of y onto r: r4 84. Return r = [4,6,7,8]

    22

  • Translating into code

    It is helpful to think of the translation process as a stepwiseprocess of refining a program until it works.

    We begin with a general statement, and gradually expandeach part.

    We will use a double comment marker ## to mark descriptivelines that still need expansion. We will number these com-ments so that we can refer to them in the slides; in practice,you would probably not find this necessary.

    After expanding, we will change to the usual comment markerto leave our description in place.

    23

  • Initial Steps

    We start with just one aim, which we can use as our firstdescriptive line:

    ## 1. Use a merge sort to sort a vector

    We will gradually expand upon previous steps, adding in de-

    tail as we go.

    An expansion of step 1 follows from recognizing that we need

    an input vector x which will be processed by a function that

    we are naming mergesort.

    Somehow, we will sort this vector.24

  • Initial Steps

    In the end, we want the output to be returned:

    # 1. Use a merge sort to sort a vector

    mergesort

  • Breaking Down one of the Steps

    We now expand step 2, noting how the merge sort algorithm

    proceeds:

    # 1. Use a merge sort to sort a vector

    mergesort

  • Breaking Down Substeps

    Each substep of the above needs to be expanded. First, we

    expand step 2.1.

    # 2.1: split x in half

    len

  • Caution: check your code

    x

  • Check your code

    x

  • Check your code

    x

  • Caution: Boundary Cases can be Different

    Be careful with edge cases; usually, we expect to sort a

    vector containing more than one element, but our sort func-

    tion should be able to handle the simple problem of sorting a

    single element.

    The code above does not handle len < 2 properly.

    We must try again, fixing step 2.1. The solution is simple: if

    the length of x is 0 or 1, our function should simply return x.

    Otherwise, we proceed to split x and sort as above. This

    affects code outside of step 2.1, so we need to correct our

    outline.31

  • Revised Program

    Here is the new outline, including the new step 2.1:

    # 1. Use a merge sort to sort a vector

    mergesort

  • Revised Program

    # 2: sort x into result

    # 2.1: split x in half

    y

  • Further Expansion

    Step 2.2 is very easy to expand, because we can make use of

    our mergesort() function, even though we havent written

    it yet!

    The key idea is to remember that we are not executing the

    code at this point, we are designing it.

    We should assume our design will eventually be successful,

    and we will be able to make use of the fruits of our labour.

    34

  • Further Expansion

    So step 2.2 becomes

    # 2.2: sort y and z

    y

  • Further Expansion

    Step 2.3 is more complicated, so lets take it slowly.

    We know that we will need a result vector, but lets describe

    the rest of the process before we code it.

    We repeat the whole function here, including this expansion

    and the expansion of step 2.2:

    36

  • Further Expansion

    # 1. Use a merge sort to sort a vector

    mergesort

  • Further Expansion

    # 2: sort x into result

    # 2.1: split x in half

    y

  • Further Expansion

    Steps 2.3.2 and 2.3.3 both depend on the test of which ofy[1] and z[1] is smallest.

    > # 1. Use a merge sort to sort a vector> mergesort

  • Further Expansion

    + while (min(length(y), length(z)) > 0) {+ # 2.3.2: put the smallest first element on the end+ # 2.3.3: remove it from y or z+ if (y[1] < z[1]) {+ result

  • Debugging and Maintenance

    Computer errors are called bugs.

    Removing these errors from a program is called debugging.

    Debugging is difficult, and one of our goals is to write pro-

    grams that dont have bugs in them: but sometimes we make

    mistakes.

    41

  • Debugging and Maintenance

    We have found that the following five steps help us to find

    and fix bugs in our own programs:

    1. Recognize that a bug exists.

    2. Make the bug reproducible.

    3. Identify the cause of the bug.

    4. Fix the error and test.

    5. Look for similar errors.

    We will consider each of these in turn.

    42

  • Recognizing that a bug exists

    Sometimes this is easy; if the program doesnt work, there is

    a bug. However, in other cases the program seems to work,

    but the output is incorrect, or the program works for some

    inputs, but not for others.

    A bug causing this kind of error is much more difficult to

    recognize.

    There are several strategies to make it easier.

    43

  • Recognizing that a bug exists

    First, follow the advice given earlier, and break up your pro-

    gram into simple, self-contained functions.

    Document their inputs and outputs.

    Within the function, test that the inputs obey your assump-

    tions about them, and think of test inputs where you can see

    at a glance whether the outputs match your expectations.

    44

  • Recognizing that a bug exists

    In some situations, it may be worthwhile writing two versions

    of a function: one that may be too slow to use in practice,

    but which you are sure is right, and another that is faster but

    harder to be sure about.

    Test that both versions produce the same output in all situa-

    tions.

    45

  • Recognizing that a bug exists

    When errors only occur for certain inputs, our experience

    shows that those are often what are called edge cases:

    situations which are right on the boundary between legal and

    illegal inputs.

    Test those! For example, test what happens when you try a

    vector of length zero, test very large or very small values, etc.

    46

  • Make the bug reproducible

    Before you can fix a bug, you need to know where things are

    going wrong. This is much easier if you know how to trigger

    the bug.

    Bugs that only appear unpredictably are extremely difficult

    to fix. The good news is that for the most part computers are

    predictable: if you give them the same inputs, they give you

    the same outputs.

    The difficulty is in working out what the necessary inputs are.

    47

  • Make the bug reproducible

    For example, a common mistake in programming is to mis-

    spell the name of a variable.

    Normally this results in an immediate error message, but some-

    times you accidentally choose a variable that actually does

    exist.

    Then youll probably get the wrong answer, and the answer

    you get may appear to be random, because it depends on the

    value in some unrelated variable.

    48

  • Make the bug reproducible

    The key to tracking down this sort of problem is to work hardto make the error reproducible.

    Simplify things as much as possible: start a new empty Rsession, and see if you can reproduce it.

    Once you can reproduce the error, you will eventually be ableto track it down.

    Some programs do random simulations.

    For those, you can make the simulations reproducible by set-ting the value of the random number seed at the start.

    49

  • Identify the cause of the bug

    When you have confirmed that a bug exists, the next step is

    to identify its cause.

    If your program has stopped with an error, read the error mes-

    sages.

    Try to understand them as well as you can.

    50

  • Trouble-shooting

    The simplest way to do this is to edit your functions to add

    statements like this:

    cat("In cv, x=", x, "\n")

    This will print the value of x, identifying where the message

    is coming from. The "\n" at the end tells R to go to a new

    line after printing.

    51

  • Trouble-shooting

    You may want to use print() rather than cat() to take ad-

    vantage of its formatting, but remember that it can only print

    one thing at a time, so you would likely use it as

    cat("In cv, x=\n")

    print(x)

    52

  • Trouble-shooting

    Another way to understand what is going wrong in a small

    function is to simulate it by hand.

    Act as you think R would act, and write down the values of all

    variables as the function progresses.

    53

  • Fixing errors and testing

    Once you have identified the bug in your program, you need

    to fix it.

    Try to fix it in such a way that you dont cause a different

    problem.

    Then test what youve done.

    You should put together tests that include the way you know

    that would reproduce the error, as well as edge cases, and

    anything else you can think of.

    54

  • The debug() Function

    Rather than using cat() or print() for debugging, R allows

    you to call the function debug(). This will pause execution

    of your function, and allow you to examine (or change!) lo-

    cal variables, or execute any other R command, inside the

    evaluation environment of the function.

    55

  • The debug() Function

    Commands to use with debug() are

    n - next; execute the next line of code, single-steppingthrough the function

    c - continue; let the function continue running Q - quit the debugger

    You mark function f for debugging using debug(f), and then

    the browser will be called when you enter the function. Turn

    off debugging using undebug(f).

    56

  • Example Constructing and Debugging a Function

    We will write and debug a function which will compute a con-

    fidence interval for the true mean of a population, based on a

    random sample of size n using the formula

    x t/2,n1s/n

    where x is the sample mean and s is the sample standard

    deviation, and the t value is the 1 /2 percentile of the tdistribution on n 1 degrees of freedom.

    57

  • Writing a Confidence Interval Function

    Our goal is to write a function which will take input like x

    such as some male heights:

    x

  • Writing a Confidence Interval Function

    ci ci(x) # this should print out a 95%

    # confidence interval for the true mean

    59

  • Solving the Problem

    The confidence interval formula requires:

    the sample mean which we can compute with mean(x) the sample standard deviation (sd(x)) the t percentiles (qt(c(alpha/2, 1-alpha/2), df)) the square root of n (sqrt(n))

    60

  • Implementing the Solution

    Here is a first attempt at implementing the solution to the

    problem:

    ci

  • Testing the Function

    Here is a first test for our ci function. Use the data vector of

    heights:

    > x ci(x)

    Error in qt(p, df, lower.tail, log.p) :

    Non-numeric argument to mathematical function

    Something is wrong. One of the arguments to qt is incorrect.

    62

  • Looking for the Error

    We can add a print statement immediately before the call toqt:

    ci

  • Looking for the Error

    > ci(x)

    $alpha

    [1] 0.05

    $df

    function (x, df1, df2, ncp, log = FALSE)

    {

    if (missing(ncp))

    .Internal(df(x, df1, df2, log))

    else .Internal(dnf(x, df1, df2, ncp, log))

    }

    Error in qt(p, df, lower.tail, log.p) :

    Non-numeric argument to mathematical function

    The df argument to qt should be set to n-1.64

  • Another Attempt

    ci

  • Another Attempt

    ci

  • Checking the Boundary Case

    Although we should not compute confidence intervals for

    sample sizes less than 2, it might happen by accident:

    > ci(3)

    [1] NA NA

    Warning message:

    In qt(p, df, lower.tail, log.p) : NaNs produced

    67

  • Checking the Boundary Case

    Again, we can handle this boundary case with an if state-ment.

    ci

  • Our Function Can Now be Used Elsewhere

    Now that ci() is a function that is known to work on numeric

    vectors of any length, we can call it from other functions.

    For example, the following function uses the ci() functionto compute confidence intervals for all vectors in a matrix orlist, as well as for a single vector.

    CI

  • Testing the CI Function

    This function needs to be tested on vectors, lists (including

    data frames) and matrices:

    > x # a vector

    [1] 170 185 177 160

    > CI(x)

    [1] 156.11 189.89

    70

  • Testing with a Matrix

    > xy # a matrix

    [,1] [,2] [,3] [,4] [,5]

    [1,] -0.2799555 0.49433909 -0.76405054 0.30727532 -1.35506713

    [2,] 0.7972663 -0.79788501 -0.31684602 -0.63859843 -1.18923510

    [3,] 0.8847079 -0.03282889 0.21370405 1.01534939 0.29377499

    [4,] -0.2333586 1.46200042 -0.02394421 -0.08885412 -0.08092837

    > CI(xy)

    X1 X2 X3 X4 X5

    [1,] -0.7182852 -1.228931 -0.8927856 -0.9584125 -1.8770214

    [2,] 1.3026153 1.791744 0.4472173 1.2559986 0.7112936

    71

  • Testing with a List

    > xy3 # a list

    $x

    [1] 170 185 177 160

    $y

    [1] 149 155 162 158 154

    $z

    [1] 170 185 177 160

    > CI(xy3)

    x y z

    [1,] 156.11 149.6065 156.11

    [2,] 189.89 161.5935 189.89

    72