MATH5835 Statistical Computinglibvolume7.xyz/.../statisticalpackages/statisticalpackagestutorial2.pdf · 3;:::of independent, identically distributed (i.i.d.) random variables, uniformly

MATH5835 — Statistical Computing

Jochen Voss

September 27, 2011

Copyright c© 2011 Jochen Voss [email protected]

This text is work in progress and may still contain typographical and factual mistakes. Reports

about problems and suggestions for improvements are most welcome.

mailto:[email protected]

Contents

1 Introduction 3

2 Random Number Generation 5

2.1 Pseudo Random Number Generators . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 The Linear Congruential Generator . . . . . . . . . . . . . . . . . . 6

2.1.2 Quality of Pseudo Random Numbers . . . . . . . . . . . . . . . . . . 8

2.1.3 Pseudo Random Number Generators in Practice . . . . . . . . . . . 8

2.2 The Inverse Transform Method . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Rejection Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.1 Uniform Distribution on Sets . . . . . . . . . . . . . . . . . . . . . . 12

2.3.2 General Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4 Other Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Monte Carlo Methods 21

3.1 Basic Monte Carlo Integration . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Variance Reduction Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2.1 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2.2 Antithetic Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2.3 Control Variates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.3 Applications to Statistical Inference . . . . . . . . . . . . . . . . . . . . . . 32

3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4 Resampling Methods 37

4.1 Bootstrap Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.2 Jackknife Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5 Markov Chain Monte Carlo Methods 43

5.1 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.2 The Metropolis-Hastings Method . . . . . . . . . . . . . . . . . . . . . . . . 49

5.2.1 Description of the Method . . . . . . . . . . . . . . . . . . . . . . . . 50

5.2.2 Random Walk Metropolis Sampling . . . . . . . . . . . . . . . . . . 54

5.2.3 Independence Sampler . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6 The EM Algorithm 59

1

A Probability Reminders 65A.1 Events and Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65A.2 Conditional Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67A.3 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

B Programming in R 71B.1 General advice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71B.2 R as a calculator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

B.2.1 Mathematical Operations . . . . . . . . . . . . . . . . . . . . . . . . 73B.2.2 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73B.2.3 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

B.3 Programming Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77B.3.1 Principle 1: Don’t Repeat Yourself . . . . . . . . . . . . . . . . . . . 77B.3.2 Principle 2: Divide and Conquer . . . . . . . . . . . . . . . . . . . . 80B.3.3 Principle 3: Test your Code . . . . . . . . . . . . . . . . . . . . . . . 82

B.4 Random Number Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 85B.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

C Answers to the Exercises 89

Bibliography 107

Index 109

2

Chapter 1

Introduction

The use of computers in mathematics and statistics has opened up a wide range of tech-niques for studying otherwise intractable problems and for analysing very large data sets.“Statistical computing” is the branch of mathematics which concerns these techniques forsituations which either directly involve randomness, or where randomness is used as partof a mathematical model, This text gives an overview over the foundations of and basicmethods in statistical computing.

An area of mathematics closely related to “statistical computing” is “computationalstatistics”. In the literature there is no clear consensus about the differentiation betweenthese two areas; sometimes the two terms are even used interchangeably. Here we takethe point of view that “computational statistics” concerns computational methods whichinvolve randomness as part of the method. Such methods are what this text is mainlyinterested in. Typical examples of such techniques are Monte Carlo methods. In contrast,we take “computational statistics” to be the larger area of computational methods whichcan be applied in statistics, even if the methods themselves do not involve randomness.Examples of such techniques include methods to the maximum of a density function inmaximum likelihood estimation. Here, we will give such methods only a cursory treatment,and refer to the literature instead (see, for example, Thisted (1988) for extensive discussionof such methods).

One of the most important ideas in statistical computing is, that often properties ofa stochastic model can be found experimentally, by using a computer to generate manyrandom instances of the model, and then statistically analysing the resulting sample. Theresulting methods are called “Monte Carlo methods”, and discussion of such methodsforms the main body of this text.

This text is mostly self-contained, only basic knowledge of the beginnings of statisticsand probability is required. The text should be accessible to an MSc or PhD student inMathematics, Statistics, Physics or related areas. For reference, appendix A summarisesthe most important concepts and results from probability. Since we study computationalmethods, programming skills are required to get maximal benefit out of this text. Previousprogramming experience would of course be useful, but is not required: appendix B givesan introduction to the aspects of programming we require here and, starting from there,we provide many exercises for the reader to train his or her skills. While the main partof the text is independent of any specific programming language, the exercises use thestatistical software package R (R Development Core Team, 2011) for implementation ofour algorithms and the text includes an introduction to programming in R in appendix B.

3

Notation

For reference, the following table summarises some of the notation used throughout thistext.

N the natural numbers: N = {1, 2, 3, . . .}

R the real numbers

(an)n∈N a sequence of (possibly random) numbers: (an)n∈N = (a1, a2, . . .)

[a, b] an interval of real numbers: [a, b] ={x ∈ R

∣∣ a ≤ x ≤ b}{a, b} the set containing a and b

U [0, 1] the uniform distribution on the interval [0, 1]

U{−1, 1} the uniform distribution on the two-element set {−1, 1}

A{ The complement of a set: A{ ={x∣∣ x /∈ A}.

A×B the Cartesian product of the sets A and B: A×B ={

(a, b)∣∣ a ∈ A, b ∈

B}

1{X∈A} Indicator function of the event that the random variable X takes a valuein the set A: 1{X∈A} = 1 if X ∈ A and 0 else (see page 68)

1A(x) Indicator function of the set A: 1A(x) = 1 if x ∈ A and 0 else (seepage 68)

X ∼ µ Indicates that a random variable X is distributed according to a prob-ability distribution µ.

|S| The number of elements in a finite set; in section 2.3 also the volume ofsubsets S ⊆ Rd.

RS matrices where the columns and rows are indexed by elements of S (seepage 45).

RS×S vectors where the components are indexed by elements of S (see page 45).

4

Chapter 2

Random Number Generation

The basis of all stochastic algorithms is formed by methods to generate random samplesfrom a given model, e.g. to generate

• throws of a coin or of a die,

• Gaussian random numbers,

• random functions (e.g. in finance), or

• random graphs (e.g. in epidemiology).

In this chapter we will discuss how to generate random samples from a given distributionon a computer.

Since computer programs are inherently deterministic, some care is needed when tryingto generate random numbers on a computer. Usually the task is split into two steps:

a) Generate a sequence X1, X2, X3, . . . of independent, identically distributed (i.i.d.)random variables, uniformly distributed on the interval [0, 1] or on a finite set like{1, 2, . . . , n}. Methods for this will be discussed in section 2.1.

b) Starting with the result of the first step, transform the random variables to constructsamples of more complicated distributions. Different methods for such transforma-tions are discussed later in this chapter, starting in section 2.2.

This split allows the process of generating randomness (which has a set of concerns, dis-cussed in section 2.1) to be separate from the process of transforming the randomness tohave the correct distribution (which can often be done exactly, with mathematical rigour).

2.1 Pseudo Random Number Generators

There are two fundamentally different classes of methods to generate random numbers:

a) True random numbers are generated using some physical phenomenon which is ran-dom. Generating such numbers requires specialised hardware and can be expensiveand slow. Classical examples of this include tossing a coin or throwing dice. Mod-ern methods utilise quantum effects, thermal noise in electric circuits, the timing ofradioactive decay, etc.

5

b) Pseudo random numbers are generated by computer programs. While these methodsare normally fast and resource effective, a challenge with this approach is that com-puter programs are inherently deterministic and therefore cannot produce a “trulyrandom” output.

In this text we will only consider pseudo random number generators.

Definition 2.1. A pseudo random number generator (PRNG) is an algorithm which out-puts a sequence of numbers that can be used as a replacement for an i.i.d. sequence oftrue random numbers.

In the definition, and throughout this text, the shorthand i.i.d. is used as an abbreviationfor the term “independent and identically distributed”.

2.1.1 The Linear Congruential Generator

A (simple) example of a pseudo random number generator is given in the following algo-rithm:

Algorithm LCG (linear congruential generator)

input: m > 1 (the modulus),a ∈ {1, 2, . . . ,m− 1} (the multiplier),c ∈ {0, 1, . . . ,m− 1} (the increment),X0 ∈ {0, 1, . . . ,m− 1} (the seed).

output: a sequence X1, X2, X3, . . . of pseudo-random numbers.

1: for n = 1, 2, 3, . . . do2: Xn ← (aXn−1 + c) mod m3: output Xn

4: end for

The sequence generated by the algorithm LCG consists of integers Xn ∈ {0, 1, 2, . . . ,m−1}. The output depends on the parameters m, a, c and on the seed X0. If m, a and c arecarefully chosen, the resulting sequence behaves “similar” to a sequence of independent,uniformly distributed random variables. This is illustrated by the following example and,more extensively, by exercise 2.2.

Example 2.2. For parameters m = 8, a = 5, c = 1 and seed X0 = 0, algorithm LCGgives the following output:

n 5Xn−1 + 1 Xn

1 1 12 6 63 31 74 36 45 21 56 26 27 11 38 16 09 1 1

10 6 6

6

The output 1, 6, 7, 4, 5, 2, 3, 0, 1, 6, . . . shows no obvious pattern and could be considered tobe a sample of a random sequence.

# (2.1)

Since each new value of Xn only depends on Xn−1, the generated series will repeat overand over again once it reaches a value Xn which has been generated before. In example 2.2this happens for X8 = X0 and we get X9 = X1, X10 = X2 and so on. Since Xn can takeonly m different values, the output of a linear congruential generator starts repeating itselfafter at most m steps; the generated sequence is eventually periodic. To work around thisproblem, typical values of m are on the order of m = 232 ≈ 4 · 109. The values for a andc are then chosen such that the generator actually achieves the maximally possible periodlength of m. A criterion for the choice of m, a and c is given in the following theorem(Knuth, 1981, section 3.2.1.2).

Theorem 2.3. The LCG has period m if and only if the following three conditions aresatisfied:

a) m and c are relatively prime

b) a− 1 is divisible by every prime factor of m, and

c) a− 1 is a multiple of 4 if m is a multiple of 4.

In the situation of the theorem, the period length does not depend on the seed X0

and usually this parameter is left to be chosen by the user of the pseudo random numbergenerator.

Example 2.4. Let m = 232, a = 1103515245 and c = 12345. Since the only prime factorof m is 2 and c is odd, the values m and c are relatively prime and condition (a) of thetheorem is satisfied. Similarly, condition (b) is satisfied, since a − 1 is even and thusdivisible by 2. Finally, since m is a multiple of 4, we have to check condition (c) but,since a− 1 = 1103515244 = 275878811 · 4, this condition also holds. Therefore the linearcongruential generator with these parameters m, a and c has period 232 for every seed X0.

If the period length is longer than the amount of random numbers we use, one can use theoutput of the linear congruential generator as a replacement for a sequence of i.i.d. randomnumbers which are uniformly distributed on the set {0, 1, . . . ,m − 1}. The applicationsof random number generators that will be discussed in the remaining parts of this textmostly require a sequence of real-valued i.i.d. random variables, e.g. uniformly distributedon the interval [0, 1]. We can get a sequence (Un)n∈N of pseudo random numbers to replacean i.i.d. sequence of U [0, 1] random variables by setting

Un =Xn

m

where (Xn)n∈N is the output of the linear congruential generator.

7

2.1.2 Quality of Pseudo Random Numbers

Pseudo random number generators used in modern software packages like R or Matlabare more sophisticated (and more complicated) than the linear congruential generatorpresented here, but they still share the following characteristics:

• The sequence of random numbers generated by a pseudo random number generatordepends on a seed. One can get different “random” sequences for different runsby choosing the seed based on variable quantities like the current time of the day.Conversely, one can get reproducible results by setting the seed to a known, fixedvalue.

• The property that the output is eventually periodic is shared by all pseudo randomnumber generators implemented in software. The period length is a measure for thequality of a pseudo random number generator.

• Another problem of the output of the LCG algorithm is that the generated randomnumbers are not independent (since each value is a deterministic function of theprevious value). Again, to some extent this problem is shared by all PRNGs. Thereare methods available to quantify the effect and there are PRNGs which suffer lessfrom this problem, e.g. the Mersenne Twister algorithm (Matsumoto and Nishimura,1998).

Finally, since pseudo random numbers are meant to be used as a replacement for i.i.d.sequences of true random numbers, a commonly way to test pseudo random numbers is toapply statistical tests (e.g. for testing independence) to the generated sequences. Randomnumber generators used in practice pass such tests without problems.

# (2.2)

2.1.3 Pseudo Random Number Generators in Practice

This section contains advice on using pseudo random number generators in practice.First, it is almost always a bad idea to implement your own pseudo random number

generator: Finding a good algorithm for pseudo random number generation is a diffi-cult problem, and even when an algorithm is available, given the nature of the generatedoutput, it can be a challenge to spot and remove all mistakes in the implementation.Therefore, it is normally advisable to use a well-established method for random num-ber generation, typically the random number generator built into a well-known softwarepackage or provided by a well-established library.

A second consideration concerns the role of the seed. While different pseudo randomnumber generators differ greatly in implementation details, they all use a seed (like thevalue X0 in algorithm LCG) to initialise the state of the random number generator. Often,when non-predictability is required, it is useful to set the seed to some volatile quantity(like the current time) to get a different sequence of random numbers for different runs ofthe program. At other times it can be more useful to get reproducable results, for exampleto aid debugging or to ensure repeatability of published results. In these cases, the seedshould be set to a known, fixed value.

2.2 The Inverse Transform Method

The inverse transform method is a method for generating random variables with valuesin the real numbers, using only U [0, 1]-distributed values as input. In this section we will

8

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Xi

Xi+

1

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Xi

Xi+

1

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

● ●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

● ●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

● ●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

● ●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

● ●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

● ●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Xi

Xi+

1

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

● ●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●●

●

●●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

● ●

●

●●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Xi

Xi+

1

Figure 2.1: Scatter plots to illustrate the correlation between consecutive outputs Xi andXi+1 of different pseudo random number generators. The random number generators usedare the runif function in R (top left), the LCG with m = 81, a = 1 and c = 8 (topright), the LCG with m = 1024, a = 401, c = 101 (bottom left) and finally the LCG withparameters m = 232, a = 1664525, c = 1013904223 (bottom right). Clearly the output inthe second and third example does not behave like and i.i.d. sequence of random variables.See exercise 2.2 for details.

9

use the concepts of probability densities and of (cumulative) distribution functions. Seeappendix A for a quick summary of these concepts.

Theorem 2.5. Let F be a distribution function. Define the inverse of F by

F−1(u) = inf{x ∈ R

∣∣ F (x) ≥ u}

∀u ∈ (0, 1)

and let U ∼ U [0, 1]. Define X = F−1(U). Then X has distribution function F .

proof. Using the definitions of X and F−1 we find

P (X ≤ a) = P(F−1(U) ≤ a

)= P

(min{x | F (x) ≥ U } ≤ a

).

Since min{x | F (x) ≥ U } ≤ a holds if and only if F (a) ≥ U , we can conclude

P (X ≤ a) = P(F (a) ≥ U

)= F (a)

where the final equality comes from the definition of the uniform distribution on [0, 1].(q.e.d.)

Once F−1 is determined, this method is very simple to apply and the resulting algorithmis trivial. We state the formal algorithm only for completeness:

Algorithm INV (inverse transform method)

input: the inverse F−1 of a CDF F .

randomness used: U ∼ U [0, 1].

output: X ∼ F .

1: output X = F−1(U)

Example 2.6. Let X have density

f(x) =

{3x2, for x ∈ [0, 1], and

0 else.

Then

F (a) =

∫ a

−∞f(x) dx =

0, if a < 0,

a3, if 0 ≤ a < 1, and

1 for 1 ≤ a.

Since F maps (0, 1) into (0, 1) injectively, F−1 is given by the usual inverse function andthus F−1(u) = u1/3 for all u ∈ (0, 1). Thus, by theorem 2.5, if U ∼ U [0, 1], the cubic rootU1/3 has the same distribution as X has.

Example 2.7. Let X be discrete with P (X = 0) = 0.6 and P (X = 1) = 0.4. Then

F (a) =

0, if a < 0,

0.6, if 0 ≤ a < 1, and

1 if 1 ≤ a.

10

x

F (x)

u

F−1(u)

v

F−1(v)

w

F−1(w) a

Figure 2.2: Illustration of the inverse F−1 of a CDF F . At level u the function F iscontinuous and injective; here F−1 coincides with the usual inverse of a function. Thevalue v falls in the middle of a jump of F and thus has no preimage; F−1(v) is the preimageof the right-hand limit of F and F

(F−1(v)

)6= v. At level w the function F is not injective,

several points map to w; the preimage F−1(w) is the left-most of these points and we have,for example, F−1

(F (a)

)6= a.

Using the definition of F−1 we find

F−1(u) =

{0, if 0 < u ≤ 0.6, and

1 else.

By theorem 2.5 we can construct an random variable X with the correct distribution fromU ∼ U [0, 1], by setting

X =

{0, if U ≤ 0.6, and

1 if U > 0.6.

The inverse transform method can always be applied when F−1 can be computed ex-plicitly. For some distributions like the normal distribution this is not possible, and theinverse transform method cannot be applied directly. The method can be applied (butmay not be very useful) for discrete distributions like in example 2.7.

2.3 Rejection Sampling

Rejection sampling is a method to generate samples of a given target distribution usingsamples from some related auxiliary distribution. Different from the previous method, re-jection sampling may use more than one sample from the auxiliary distribution to generateone sample of the target distribution. The method is not restricted to the generation ofrandom numbers, but works for sampling on arbitrary probability spaces. We formulatethe method here for distributions on the Euclidean space Rd.

11

2.3.1 Uniform Distribution on Sets

In this section we consider a special case of rejection sampling, namely a rejection algorithmto sample from the uniform distribution on a set. This algorithm, while being useful initself, will be used as the basis for the general rejection sampling method in the followingsection.

When A ⊆ Rd is a set, we write |A| for the d-dimensional volume of A: the cubeQ = [a, b]3 ⊆ R3 has |Q| = (b − a)3, the unit circle C =

{x ∈ R2

∣∣ x21 + x2

2 ≤ 1}

has 2-dimensional ‘volume’ (area) 2π and the line segment [a, b] ⊆ R has 1-dimensional ‘volume’(length) b− a.

Definition 2.8. A random variable X with values in Rd is uniformly distributed on a setA ⊆ Rd with |A| <∞, if

P (X ∈ B) =|A ∩B||A|

∀B ⊆ Rd. (2.3)

As for real intervals, we use the notation X ∼ U(A) to indicate that X is uniformlydistributed on A.

In general, the set B in (2.3) may not be a subset of A. For the case B ⊆ A we haveA∩B = B and thus (2.3) simplifies to P (X ∈ B) = |B|/|A|. Similarly, if B ⊆ A, we have

P (X /∈ B) = P (X ∈ Rd \B) =|A \B||A|

=|A| − |B||A|

= 1− |B||A|

.

One of the basic ideas of rejection sampling is described in the following lemma, whichturns the uniform distribution on a set into the uniform distribution of a sub-set.

Lemma 2.9. Let B ⊆ A ⊆ Rd with |A| <∞ and let (Xk)k∈N be a sequence of independentrandom variables, uniformly distributed on A. Furthermore let N = min{ k ∈ N | Xk ∈B }. Then XN ∼ U(B).

proof. The lemma can be proved by verifying that condition (2.3) is satisfied. For C ⊆ Rdwe have

P (XN ∈ C) =∞∑k=1

P (Xk ∈ C,N = k)

=∞∑k=1

P (Xk ∈ B ∩ C) P (Xk−1 /∈ B) · · ·P (X1 /∈ B)

=

∞∑k=1

|A ∩ (B ∩ C)||A|

(1− |B||A|

)k−1,

where we used the independence of the Xk. Using the geometric series formula∑∞

k=0 qk =

1/(1− q) we get

P (XN ∈ C) =|B ∩ C||A|

· 1

1−(1− |B||A|

) =|B ∩ C||B|

for all C ⊆ Rd. This is the required condition for XN to be uniformly distributed on B.(q.e.d.)

12

The result of lemma 2.9 is easily turned into an algorithm for sampling from U(B): wecan generate the Xk one by one, for each generated value check the condition Xk ∈ B,and stop once the condition is true. In the context of the rejection sampling algorithm,the random variables Xk are called proposals and we say that the proposals X1, . . . XN−1

are rejected and XN is accepted.

2.3.2 General Densities

In this section we present the general rejection sampling method. The following lemma,which makes a connection between distributions with general densities and uniform dis-tributions on sets, forms the basis of this method.

Lemma 2.10. Let f : Rd → [0,∞) be a probability density and let

A ={

(x, v) ∈ Rd × [0,∞)∣∣ 0 ≤ v < f(x)

}⊆ Rd+1.

Then the following two statements are equivalent:

a) X has probability density f on Rd and V = f(X)U where U ∼ U [0, 1]. The randomvariables X and U are independent.

b) (X,V ) is uniformly distributed on A.

proof. The volume of the set A can be found by integrating the “height” f(x) over all ofRd. Since f is a probability density, we get

|A| =∫Rd

f(x) dx = 1. (2.4)

Assume first that the random variables X with density f and U ∼ U [0, 1] are indepen-dent, and let V = f(X)U . Furthermore let C ⊆ Rd, D ⊆ [0,∞) and B = C ×D. Thenwe get

P((X,V ) ∈ B

)= P

(X ∈ C, V ∈ D

)=

∫CP(V ∈ D

∣∣ X = x)f(x) dx

=

∫CP(f(x)U ∈ D

)f(x) dx =

∫C

∣∣D ∩ [0, f(x)]∣∣

f(x)f(x) dx =

∫C

∣∣D ∩ [0, f(x)]∣∣ dx

On the other hand we have

|A ∩B| =∫Rd

∫ f(x)

01{(x,v)∈B} dv dx =

∫Rd

1{x∈C}

∫ f(x)

01{v∈D} dv dx

=

∫C

∣∣D ∩ [0, f(x)]∣∣ dx

and thus P((X,V ) ∈ B

)= |A∩B|. This shows that (X,V ) is uniformly distributed on A.

For the converse statement assume now that (X,V ) is uniformly distributed on A anddefine U = V/f(X). Since (X,V ) ∈ A, we have f(X) > 0 almost surely and thus there isno problem in dividing by f(X). Given sets C ⊆ Rd and D ⊆ R we find

P(X ∈ C,U ∈ D

)= P

((X,V ) ∈

{(x, v)

∣∣ x ∈ C, v/f(x) ∈ D})

=∣∣∣A ∩ {(x, u)

∣∣ x ∈ C, v/f(x) ∈ D}∣∣∣

=

∫Rd

∫ f(x)

01{x∈C}1{v/f(x)∈D} dv dx.

13

a b0

M

R

f

A

f

Yk

Xk

Figure 2.3: Illustration of the rejection sampling method where the graph of the targetdensity is contained in a rectangle R and the proposals are uniformly distributed on R.

Using the substitution u = v/f(x) in the inner integral we get

P(X ∈ C,U ∈ D

)=

∫Rd

∫ 1

01{x∈C}1{u∈D}f(x) du dx

=

∫Cf(x) dx ·

∫D

1[0,1](u) du.

Therefore X and U are independent with densities f and 1[0,1], respectively. (q.e.d.)

An easy application of the lemma is to use the implication from b) to a) to convert auniform distribution in R2 to a distribution on R with a given density f : [a, b]→ R. Forsimplicity, we assume here that f lives on a bounded interval [a, b]. Furthermore, assumethat f satisfies f(x) ≤M for all x ∈ [a, b]. We can generate samples from the distributionwith density f as follows:

a) Let Xk ∼ U [a, b] and Yk ∼ U [0,M ] for k ∈ N, independently. Then the (Xk, Yk) arei.i.d., uniformly distributed on the rectangle R = [a, b]× [0,M ].

b) Consider the set A ={

(x, y) ∈ R∣∣ y ≤ f(x)

}and let N = min{ k ∈ N | Xk ∈ B }.

By lemma 2.9, (XN , YN ) is uniformly distributed on A.

c) By lemma 2.10, the value XN is distributed with density f .

This procedure is visualised in figure 2.3.

In the general case, when f is defined on an unbounded set, we cannot use proposalswhich are uniformly distribution on a rectangle anymore. A solution to this problem is toreplace the rectangle R with a different, unbounded area which has finite volume (so thatthe uniform distribution exists) and to use lemma 2.10 a second time to obtain a uniformdistribution on this set. This idea is implemented in the following algorithm.

14

xXk

c · g(Xk)

cg(Xk)Uk

f

cg

Figure 2.4: Illustration of the rejection sampling method from algorithm REJ. The proposal(Xk, Ukcg(Xk)

)is distributed uniformly on the area under the graph of cg. The proposal

is accepted if it falls into the area underneath the graph of f .

Algorithm REJ (rejection sampling)

input: a probability density f (the target density),a probability density g (the proposal density),a constant c > 0 such that f(x) ≤ c g(x) for all x.

randomness used: Xk i.i.d. with density g (the proposals),Uk ∼ U [0, 1] i.i.d.

output: i.i.d. random variables Yj with density f .

1: let j ← 12: for k = 1, 2, 3, . . . do3: generate Xk with density g4: generate Uk ∼ U [0, 1]5: if cg(Xk)Uk ≤ f(Xk) then6: output Yj = Xk

7: j ← j + 18: end if9: end for

The assumption in the algorithm is that we can already sample from the distributionwith probability density g, but we would like to generate samples from the distributionwith a density f instead. The theorem below shows that the method works whenever wecan find the required constant c. This condition implies, for example, that the support off cannot be bigger than the support of g, i.e. we need f(x) to be 0 whenever g(x) = 0.

Since f and g are both probability densities, we find 1 =∫f(x) dx ≤

∫cg(x) dx = c,

i.e. the constant c always satisfies c ≥ 1 and with equality only being possible for f = g.

Theorem 2.11. Let (Yj)j∈N be the sequence of random variables generated by algo-rithm REJ. Then the following statements hold:

15

a) The Yj are i.i.d. with density f .

b) The number Nj of proposals required to generate Yj is geometrically distributedwith parameter p = 1/c. In particular, E(Nj) = c.

proof. Since the Xk have density g, we know from lemma 2.10 that (Xk, g(Xk)Uk) is uni-formly distributed on the set

{(x, v)

∣∣ 0 ≤ v < g(x)}

. Consequently, Zk = (Xk, cg(Xk)Uk)is uniformly distributed on A =

{(x, y)

∣∣ 0 ≤ y < cg(x)}

. By lemma 2.9, the acceptedvalues are then uniformly distributed on the set B =

{(x, y)

∣∣ 0 ≤ y < f(x)}⊆ A and,

applying lemma 2.10 again, we find that the Xk, conditional on being accepted, havedensity f . This completes the proof of the first statement.

Proposals are accepted if and only if Zi ∈ B. Since |A| =∫cg(x) dx = c and |B| =∫

f(x) dx = 1, the probability of accepting proposal i is

P (Zi ∈ B) =|B||A|

=1

c

and the events are independent. Since Nj is the time until the first success of independenttrials which succeed with probability p = 1/c, the values Nj are geometrically distributedwith parameter p. This completes the proof of the second statement. (q.e.d.)

The average cost of generating one sample is given by the average number of proposalsrequired times the cost for generating each proposal. Therefore the algorithm is efficient,if the following two conditions are satisfied:

a) There is an efficient method to generate the proposals Xi.

b) f and g are similar in the sense that the constant c is close to 1.

2.4 Other Methods

There are many specialised methods to generate samples from specific distributions. Thereare often faster that the generic methods described in the previous chapters, but cantypically only be used for a single distribution. These specialised method (optimised forspeed and often quite complex) form the basis of built-in random number generators insoftware packages. In contrast, the methods discussed in the previous sections are generalpurpose methods which can be used for a wide range of distributions when no pre-existingmethod is available.

# (2.5)

Further Reading

A lot of information about linear congruential generators and about testing of randomnumber generators can be found in Knuth (1981). The Mersenne Twister, a popularmodern pseudo random number generator, is described in Matsumoto and Nishimura(1998). Rejection sampling and its extensions are described in Robert and Casella (2004,section 2.3). Specialised methods for generating normally distributed random variablescan be found in Box and Muller (1958) and Marsaglia and Tsang (2000).

16

1 2 3 4 5 6 7

0.0

0.2

0.4

0.6

0.8

1.0

x

f θ

Figure 2.5: The density of the mixture distribution 12N (1, 0.04) + 1

2N (4, 1), together witha histogram generated from 4000 samples.

2.5 Exercises

Exercise E-2.1. Write an R function to implement the linear congruential generator.The function should have the following form:

LCG <- function(n, m, a, c, X0) {...return(X);

}

The return value should be the vector X = (X1, X2, . . . , Xn). Test your function LCG

by first manually computing X1, . . . , X6 for m = 5, a = 1, c = 3 and X0 = 0 and thencomparing the result to the output of LCG(6,5,1,3,0).

Exercise E-2.2. Given a sequence X = (X1, . . . , Xn) of U [0, 1]-distributed (pseudo) ran-dom numbers, we can use a scatter plot of (Xi, Xi+1) for i = 1, . . . , n − 1 in order to tryto assess whether the Xi are independent.

a) Try creating such a plot using the built-in random-number generator of R:

X <- runif(1000)plot(X[1:999], X[2:1000], asp=1)

Can you explain the resulting plot?

b) Create a similar plot, using your function LCG from exercise 2.1:

m <- 81a <- 1c <- 8seed <- 0X <- LCG(1000, m, a, c, seed)/mplot(X[1:999], X[2:1000], asp=1)

Discuss the resulting picture.

17

c) Repeat the experiment from part b using the parameters m = 1024, a = 401, c = 101and m = 232, a = 1664525, c = 1013904223. Discuss the results.

Exercise E-2.3. Let X,Y ∼ U [0, 1] be independent. Using definition 2.8 of the uniformdistribution on a set, show that (X,Y ) is uniformly distributed on the square Q = [0, 1]×[0, 1].

Exercise E-2.4. Let f and g be two probability densities and c ∈ R with f(x) ≤ cg(x)for all x. Show that c ≥ 1 and that c = 1 is only possible for f = g (except possibly onsets with measure 0).

Exercise E-2.5 (rejection sampling for the normal distribution).

a) Work out the optimal constant c for the rejection sampling algorithm when theproposals are Exp(1)-distributed and the target density is given by

f(x) =

{2√2π

exp(−x2

2

)if x ≥ 0

0 else.

b) Implement the resulting method. Test your program by generating a histogram ofthe output and by comparing the histogram to the density f .

c) How can the program from part b be modified to generate standard normal dis-tributed random numbers? Hint: Have a look at exercise 2.5.

Exercise E-2.6. In this exercise we will see that the rejection sampling algorithm stillcan be applied when the density of the target distribution is only known up to a constant.Let f : R→ [0,∞) be a function. Define

f(x) =1

Zf(x) ∀x ∈ R where Z =

∫Rf(x) dx.

Then f is a probability density.

a) Assume that we can generate proposals X1, X2, . . . with density g where f(x) ≤ cg(x)for all x ∈ R. State the rejection sampling algorithm for sampling from the density f .Write the algorithm in terms of f instead of f and show that the method can beapplied even if the value of Z is unknown.

b) Write a program to generate samples of the probability distribution with densityf(x) = 1

Z exp(cos(x)) on the interval [0, 2π]. How can you test your program?

Exercise E-2.7. Consider n probability distributions P1, . . . , Pn with cumulative distri-bution functions F1, . . . , Fn. Furthermore let θ1, . . . , θn > 0 such that

∑ni=1 θi = 1. Then

the mixture of P1, . . . , Pn with weights θ1, . . . θn is the distribution Pθ with CDF

Fθ(x) =n∑i=1

θiFi(x) ∀x ∈ R.

a) Convince yourself that Pθ is the distribution obtained by the following two-stepprocedure: first choose randomly one of the distributions P1, . . . , Pn where Pi ischosen with probability θi, and then, independently, take a sample of the chosendistribution.

18

b) Show that, if P1, . . . , Pn have densities f1, . . . , fn, then Pθ also has a density whichis given by

fθ =n∑i=1

θifi.

c) Write an R program which generates samples from the mixture of P1 = N (1, 0.01),P2 = N (2, 0.04) and P3 = N (4, 0.01) with weights θ1 = 0.5, θ2 = 0.2 and θ3 = 0.3.Generate a plot, similar to the one in figure 2.5, showing a histogram of 4000 samplesof this distribution, together with the density of the mixture.

Hint: relevant R functions include rnorm, dnorm, sample, hist, and curve.

19

20

Chapter 3

Monte Carlo Methods

Monte Carlo methods are computational methods where one examines properties of aprobability distribution by generating a large sample from the given distribution and thenstudying the statistical properties of this sample. In this chapter we consider variousMonte Carlo methods to numerically approximate an expectation E

(f(X)

)where f is a

real-valued function.

3.1 Basic Monte Carlo Integration

Let X be a random variable and f be a real-valued function. Then there are severaldifferent methods to compute the expectation E

(f(X)

):

a) Sometimes we can find the answer analytically. For example, when the distributionof X has a density ϕ, we can use the relation

E(f(X)

)=

∫f(x)ϕ(x) dx (3.1)

to obtain the value of the expectation (see appendix A.3). This method only worksif we can solve the resulting integral.

b) If the integral in (3.1) cannot be solved analytically, we can try to use numericalintegration to get an approximation to the value of the integral. When X takesvalues in a low-dimensional space, this method often works well, but for higherdimensional spaces numerical approximation can become very expensive and theresulting method may no longer be efficient. Since numerical integration is outsidethe topic of statistical computing, we will not follow this approach here.

c) The technique we will study in this chapter is called Monte Carlo integration. Thistechnique is based on the strong law of large numbers (see theorem A.9 in the ap-pendix): If (Xi)i∈N is a sequence of i.i.d. random variables with the same distributionas X, then

limN→∞

1

N

N∑i=1

f(Xi) = E(f(X)

)(3.2)

almost surely. While exact equality only hold in the limit N → ∞, we can use theapproximation

E(f(X)

)≈ 1

N

N∑i=1

f(Xi) (3.3)

21

when N is large. Note that the estimate of the right-hand side is constructed fromthe random values Xi thus is a random quantity itself.

The term “Monte Carlo” in the name of the last of these methods is in reference to theMonte Carlo casino (Roulette tables are random number generators in a sense), and theword “integration” refers to the equivalence (3.1) between computing expectations andcomputing integrals.

The random variables Xi in (3.3) are sometimes called i.i.d. copies of X. Since, formally,X itself is not used to compute the Monte Carlo estimate, sometimes the random variableX is not explicitly introduced and one writes E

(f(X1)

)instead of E

(f(X)

). Since X1

has the same distribution as X, these two expressions have the same value. Of course anyother Xi could also be used instead of X1. For completeness we reformulate formula (3.3)as an algorithm:

Algorithm MC (Monte Carlo integration)

input: a function fN ∈ N.

randomness used: an i.i.d. sequence (Xi)i∈N of random variables.

output: an estimate for E(f(X1)

).

1: s← 02: for i = 1, 2, . . . , N do3: s← s+ f(Xi)4: end for5: return s/N

Example 3.1. For f(x) = x, the Monte Carlo estimate (3.3) reduces to

E(X) ≈ 1

N

N∑i=1

Xi = X.

This is just the usual estimator for the mean.

Example 3.2. Assume that for some reason we need to compute E(sin(X)

)where X ∼

N (µ, σ2). Obtaining this value analytically will be difficult, but we can easily get anapproximation using Monte Carlo integration: If we choose N big and generate indepen-dent, N (µ, σ)-distributed random variables X1, X2, . . . , XN , then by the strong law oflarge numbers we have

E(sin(X)

)≈ 1

N

N∑i=1

sin(Xi).

The right-hand side of this approximation can be easily evaluated using a computer pro-gram, giving an estimate for the required expectation.

The basic Monte Carlo method above is presented only for the case of computing anexpectation. The following arguments show how the method can be applied to computeprobabilities or to determine the value of an integral.

22

• We can compute probabilities using Monte Carlo integration, by rewriting themas expectations: If X is a random variable, we have P (X ∈ A) = E

(1A(X)

)(see

appendix A.3). Thus, we can estimate using Monte Carlo integration, by

P (X ∈ A) = E(1A(X)

)≈ 1

N

N∑i=1

1A(Xi) (3.4)

where the Xi are i.i.d. copies of X and N is big.

• We can compute integrals like∫ ba f(x) dx using Monte Carlo integration by utilising

the relation (3.1): Let Ui ∼ U [a, b]. Then the density of Ui is ϕ(x) = 1b−a1[a,b]. We

get∫ b

af(x) dx = (b− a)

∫f(x)ϕ(x) dx = (b− a)E

(f(U)

)≈ b− a

N

N∑i=1

f(Ui) (3.5)

for big N .

Example 3.3. Let X ∼ N (0, 1) and a ∈ R. Then the probability P (X ≤ a) cannot becomputed analytically, but

P (X ≤ a) ≈ 1

N

N∑i=1

1{X≤a}

can be used as an approximation.

# (3.6)

Example 3.4. The method from (3.5) allows to derive a different method for computingthe probability P (X ≤ a) from example 3.3. Assuming a ≥ 0, we know

P (X ≤ a) =

∫ a

−∞

1√2π

e−x2/2 dx =

1

2+

1√2π

∫ a

0e−x

2/2 dx.

Using (3.5), we can approximate this by

P (X ≤ a) ≈ 1

2+

1√2π

a

N

N∑i=1

e−U2i /2

where Ui ∼ U [0, a] are i.i.d. For the case a < 0 we can use the relation P (X ≤ a) =1− P (X ≤ −a).

With the method discussed so far, an open question is still how big values of N weshould choose. On the one hand, the bigger N is, the more accurate our estimates get;as N → ∞ the Monte Carlo approximation converges to the correct answer. But on theother hand, the bigger N is, the more expensive the method becomes, because we need togenerate and process more samples Xi. Key to choosing N are the following two equations:

• The expectation of the Monte Carlo estimate (3.3) for E(f(X)

)is given by

E( 1

N

N∑i=1

f(Xi))

=1

N

N∑i=1

E(f(Xi)

)= E

(f(X)

).

Therefore the estimate is unbiased for every N .

23

• Since the Xi are independent, the variance of the Monte Carlo estimate is given by

Var( 1

N

N∑i=1

f(Xi))

=1

N2

N∑i=1

Var(f(Xi)

)=

1

NVar(f(X)

). (3.7)

Therefore the variance, which controls the fluctuations of our estimates around thecorrect value, converges to 0 as N →∞.

Since the estimate is unbiased, the magnitude of a typical estimation error will be givenby the standard deviation of the estimate; therefore we expect the error of a Monte Carloestimate to decay like 1/

√N .

The fact that the error of Monte Carlo integration decays only like 1/√N means that

in practice huge numbers of samples can be required. To increase accuracy by a factor of10, i.e. to get one more significant digit of the result, one needs to increase the number ofsamples by a factor of 100.

Example 3.5. Assume that U ∼ U [0, 1] and that we want to estimate E(U2) using MonteCarlo integration. Then we have

Var( 1

N

N∑i=1

f(Ui))

=1

NVar(U2).

To compute this variance we note

E(U2)

=

∫ 1

0u2 du =

u3

3

∣∣∣∣1u=0

=1

3

and

E((U2)2

)=

∫ 1

0u4 du =

u5

5

∣∣∣∣1u=0

=1

5.

Therefore, Var(U2) = 1/5− (1/3)2 = 4/45 and the variance of the Monte Carlo estimateis

Var( 1

N

N∑i=1

f(Xi))

=4

45

1

N≈ 0.0889

N.

We can use relation (3.7) to determine which value of N is required to achieve a givenprecision. We can either directly prescribe the standard deviation of the estimate, oralternatively we can use estimates like the following: By Chebyshev’s inequality (seetheorem A.8) we have

P(∣∣ 1

N

N∑i=1

f(Xi)− E(f(X)

)∣∣ ≥ ε) ≤ 1N Var

(f(X)

)ε2

and thus

P(∣∣ 1

N

N∑i=1

f(Xi)− E(f(X)

)∣∣ ≤ ε) ≥ 1−Var(f(X)

)Nε2

.

Consequently, by choosing N ≥ Var(f(X)

)/αε2 we can achieve

P(∣∣ 1

N

N∑i=1

f(Xi)− E(f(X)

)∣∣ ≤ ε) ≥ 1− α.

24

Example 3.6. Assume Var(f(X)

)= 1. In order to estimate E

(f(X)

)so that the error

is at most ε = 0.01 with probability at least 1 − α = 95%, we can use a Monte Carloestimate with

N ≥Var(f(X)

)αε2

=1

0.05 · (0.01)2= 200000

samples.

Example 3.7 (estimating probabilities I). Let X be a real-valued random variable andA ⊆ R. Then we can use (3.4) to estimate P (X ∈ A). The error of this method isdetermined by the variance

Var(1A(X)

)= E

(1A(X)2

)− E

(1A(X)

)2= P (X ∈ A)− P (X ∈ A)2.

In the preceding estimates we assumed that the value Var(f(X)

)is known, whereas in

practice this is normally not the case (since we need to use Monte Carlo integration toestimate the expectation of f(X), it seems unlikely that the variance is known explicitly).There are various ways to work around this problem:

• Sometimes an upper bound for Var(f(X)

)is known, which can be used in place of

the true variance.

• One can try a two-step procedure where first Monte-Carlo integration with a fixedN (say N = 1000) is used to estimate Var

(f(X)

). Then, in a second step, one can

use estimates as above to determine a value of N for use in determining the requiredexpectation.

• Sequential methods can be used where one generates samples Xi one-by-one andestimates the standard deviation of the generated f(Xi) from time to time. Theprocedure is continued until the estimated standard deviation falls below a prescribedlimit.

In all the error estimates above, we used the fact that f(X) has finite variance. AsVar(f(X)

)gets bigger, convergence to the correct result gets slower and the required

number of samples to obtain a given error increases. Since the strong law of large numbersdoes not require the random variables to have finite variance, equation (3.2) still holdsfor Var

(f(X)

)= ∞, but convergence in (3.3) will be extremely slow and the resulting

method will not be very useful in this case.

3.2 Variance Reduction Methods

As we have seen, the efficiency of Monte Carlo integration is determined by the varianceof the estimate: the higher to variance, the more samples are required to obtain a givenaccuracy. This chapter describes methods to improve efficiency by considering modifiedMonte Carlo methods, which, for a given sample size N , achieve a lower variance of theMonte Carlo estimate.

25

3.2.1 Importance Sampling

The importance sampling method is based on the following argument. Assume that X isa random variable with density ϕ, that f is a real-valued function and that ψ is anotherprobability density with ψ(x) > 0 whenever f(x)ϕ(x) > 0. Then we have

E(f(X)

)=

∫f(x)ϕ(x) dx =

∫f(x)ϕ(x)

ψ(x)ψ(x) dx,

where we define the fraction to be 0 whenever the denominator (and thus the numerator)equals 0. Since ψ is a probability density, the integral on the right can be written as anexpectation again: if Y has density ψ, we have

E(f(X)

)= E

(f(Y )ϕ(Y )

ψ(Y )

).

Using basic Monte Carlo integration, we can estimate this expectation as

E(f(X)

)≈ 1

N

N∑i=1

f(Yi)ϕ(Yi)

ψ(Yi)(3.8)

where the Yi are i.i.d. copies of Y . This approximation can be used as an alternativeto the basic Monte-Carlo approximation (3.3). We can write the resulting method as analgorithm as follows:

Algorithm IMP (importance sampling)

input: a function f ,the density ϕ of X,an auxiliary density ψ,N ∈ N.

randomness used: an i.i.d. sequence (Yi)i∈N with density ψ.

output: an estimate for E(f(X)

).

1: s← 02: for i = 1, 2, . . . , N do3: s← y + f(Yi)ϕ(Yi)/ψ(Yi)4: end for5: return s/N

This method is a generalisation of the basic Monte Carlo method: if we choose ψ = ϕ,the two densities in (3.8) cancel, the Yi are i.i.d. copies of X and, for this case, the methodis identical to basic Monte Carlo integration.

The usefulness of importance sampling lies in the fact that we can choose the density ψ(and thus the Yi) in order to maximise efficiency. As we have seen, the error of the MonteCarlo estimate is determined by the variance

Var( N∑i=1

f(Yi)ϕ(Yi)

ψ(Yi)

)=

1

NVar(f(Y )ϕ(Y )

ψ(Y )

).

Therefore, the method is efficient, if both of the following criteria are satisfied:

a) The Yi can be generated efficiently.

26

b) The variance Var(f(Y )ϕ(Y )

ψ(Y )

)is small.

To find out how ψ can be chosen to maximise efficiency, we first consider the extremecase where fϕ/ψ is constant: in this case, the variance of the Monte Carlo estimate is 0and therefore there is no error at all! In this case we have

ψ(x) =1

cf(x)ϕ(x) (3.9)

for all x where c is the constant value of fϕ/ψ. We can find the value of c by using thefact that ψ is a probability density:

c = c

∫ψ(x) dx =

∫f(x)ϕ(x) dx = E

(f(X)

).

Therefore, in order to to choose ψ as in (3.9) we have to already have solved the problemof computing the expectation E

(f(X)

)and thus we don’t get a useful method for this

case. Still, this boundary case offers some guidance: since we get optimal efficiency if ψ ischosen proportional to fϕ, we can expect that we will get good efficiency if we choose ψto be approximately proportional to the function fϕ.

Example 3.8 (estimating probabilities II). Let X be a real-valued random variable andA ⊆ R. Then the importance sampling method for estimating the probability P (X ∈ A)is given by

P (X ∈ A) = E(1A(X)

)≈ 1

N

N∑i=1

1A(Yi)ϕ(Yi)

ψ(Yi)

where ψ is a probability density with ψ(x) > 0 for all x ∈ A with ϕ(x) > 0 and (Yi)i∈N is asequence of i.i.d. random variables with density ψ. The error of the method is determinedby the variance

Var(1A(Y )ϕ(Y )

ψ(Y )

)= E

(1A(Y )2ϕ(Y )2

ψ(Y )2

)− E

(1A(Y )ϕ(Y )

ψ(Y )

)2

=

∫A

ϕ(x)2

ψ(x)2ψ(x) dx−

(∫A

ϕ(x)

ψ(x)ψ(x) dx

)2

=

∫A

ϕ(x)

ψ(x)ϕ(x) dx−

(∫Aϕ(x) dx

)2

= E(

1A(X)ϕ(X)

ψ(X)

)− P (X ∈ A)2.

(3.10)

In example 3.7 we found that the error for the basic Monte Carlo method is determinedby

Var(1A(X)

)= E

(1A(X)

)− P (X ∈ A)2.

Comparing this to (3.10), we see that the importance sampling method will have a bettervariance than basic Monte Carlo integration if we can choose ψ such that ψ > ϕ on theset A.

3.2.2 Antithetic Variables

The antithetic variables method (also called antithetic variates method) reduces the vari-ance and thus the error of Monte Carlo estimates by using pairwise dependent samples Xi

instead of independent samples.

27

For illustration, we first consider the case N = 2: Assume that X1 and X2 are identicallydistributed random variables, which are not independent. As for the independent case wehave

E(f(X1) + f(X2)

2

)=

E(f(X1)

)+ E

(f(X2)

)2

= E(f(X)

),

but for the variance we get

Var(f(X1) + f(X2)

2

)=

Var(f(X1)

)+ 2 Cov

(f(X1), f(X2)

)+ Var

(f(X2)

)4

=1

2Var(f(X)

)+

1

2Cov

(f(X1), f(X2)

).

Compared to the independent case, an additional covariance-term 12 Cov

(f(X1), f(X2)

)is

present. The idea of the antithetic variables method is to construct X1 and X2 such thatCov

(f(X1), f(X2)

)is negative, thereby reducing the total variance.

If we construct the Xi using the inverse transform method, we can proceed as follows:Let F be the distribution function of X and let U ∼ U [0, 1]. Then 1 − U ∼ U [0, 1] andwe can use X1 = F−1(U) and X2 = F−1(1 − U) to generate X1, X2 ∼ F . Since F−1 ismonotonically increasing, X1 increases as U increases while X2 decreases as U increases.Therefore one would expect that X1 and X2 to be negatively correlated (the followinglemma shows that this is indeed the case) and consequently the variance of (X1 +X2)/2will be smaller than one would expect for independent random variables.

Lemma 3.9. Let g : R→ R be monotonically increasing and U ∼ U [0, 1]. Then

Cov(g(U), g(1− U)

)< 0.

proof. Let V ∼ U [0, 1], independent of U . We distinguish two cases: If U ≤ V we haveg(U) ≤ g(V ) and g(1 − U) ≥ g(1 − V ). Otherwise, if U > V , we have g(U) ≥ g(V ) andg(1− U) ≤ g(1− V ). Thus, in both cases we have(

g(U)− g(V ))(g(1− U)− g(1− V )

)≤ 0

and consequently

Cov(g(U), g(1− U)

)= E

(g(U)g(1− U)

)− E

(g(U)

)E(g(1− U)

)=

1

2E(g(U)g(1− U) + g(V )g(1− V )− g(U)g(1− V )− g(V )g(1− U)

)=

1

2E((g(U)− g(V )

)(g(1− U)− g(1− V )

))≤ 0.

This completes the proof. (q.e.d.)

If f is monotonically increasing, we can apply this lemma for g(u) = f(F−1(u)

). In

this case we have

f(X1) = f(F−1(U)

)= g(U)

and

f(X2) = f(F−1(1− U)

)= g(1− U),

and the lemma allows to conclude Cov(f(X1), f(X2)

)≤ 0.

28

For N > 2, assuming N is even, we group the samples Xi into pairs and constructevery pair as above: Let Uj ∼ U [0, 1] i.i.d. and for j ∈ N define X2j = F−1(Uj) andX2j+1 = F−1(1− Uj). Then Xi ∼ F for all i ∈ N and we can use the approximation

E(f(X)

)≈ 1

N

N∑i=1

f(Xi).

We can write the resulting method as an algorithm as follows.

Algorithm ANT (antithetic variables)

input: a function f ,The inverse F−1 of the distribution function of X,N ∈ N even.

randomness used: a sequence (U1, U2, . . . , UN/2) of i.i.d. U [0, 1] random variables.


).

1: s← 02: for j = 1, 2, . . . , N/2 do3: s← s+ f

(F−1(Uj)

)+ f

(F−1(1− Uj)

)4: end for5: return s/N

While lemma 3.9 only guarantees that the variance for the antithetic variables methodsis smaller or equal to the variance of standard Monte Carlo integration without giving anyquantitative bounds, in practice often a significant reduction of variance is observed.

Example 3.10. Assume that X ∼ U [0, 1] and that we want to estimate E(X2) usingthe antithetic variables method. Since the distribution function of X is F (x) = x, theantithetic samples we get with the method above are X2j = U2

j and X2j+1 = (1 − Uj)2.Thus we have

Var( 1

N

N∑i=1

f(Xi))

=1

N2· N

2Var(U2 + (1− U)2

)=

1

2NVar(U2 + 1− 2U + U2

)=

2

NVar(U2 − U

)where U ∼ U [0, 1]. To compute this variance we note

E(U2 − U

)=

∫ 1

0(u2 − u) du =

(u3

3− u2

2

)∣∣∣∣1u=0

= −1

6

and

E((U2 − U)2

)=

∫ 1

0(u4 − 2u2u+ u2) du

=(u5

5− u4

2+u3

3

)∣∣∣∣1u=0

=6− 15 + 10

30=

1

30.

Therefore, Var(U2 − U) = 1/30 − (−1/6)2 = 1/180 and the variance of the antitheticvariables estimate is

Var( 1

N

N∑i=1

f(Xi))

=2

NVar(U2 − U

)=

1

90

1

N≈ 0.0111

N.

29

From example 3.5 we know that the basic Monte Carlo estimate has variance 4/45N , i.e.for this example, using antithetic variables reduces the variance by a factor of 8.

There are many variations of the antithetic variables method, for example, if the distri-bution of X is symmetric, one can use X and −X as a pair of antithetic variables insteadof F−1(U) and F−1(1− U). This idea even work for vector-valued random variables X.

3.2.3 Control Variates

Assume that we have an unbiased estimator Y for some unknown quantity of interest.Furthermore, assume that we have a random variable Y which is correlated to Y and hasknown expectation. The idea of the control variates method is that, thanks to the corre-lation between Y and Y , we can use our knowledge of E(Y ) to assist with the estimationof E(Y ). In this context, the random variable Y is called a control variate for Y .

Lemma 3.11 (control variates) Let Y, Y be two random variables with Corr(Y, Y ) 6= 0.Define a new random variable Z by

Z = Y − Cov(Y, Y )

Var(Y )

(Y − E(Y )

).

Then E(Z) = E(Y ) and

Var(Z) =(1− Corr(Y, Y )2

)Var(Y ) < Var(Y ).

proof. We consider the random variables

Zc = Y − c(Y − E(Y )

)for c ∈ R. The random variable Zc has expectation

E(Zc) = E(Y )− c(E(Y )− E(Y )

)= E(Y ),

i.e. it is unbiased for all c. The variance of Zc is given by

Var(Zc) = Var(Y )− 2cCov(Y, Y ) + c2 Var(Y ). (3.11)

The choice of Z in the statement corresponds to c = Cov(Y, Y )/Var(Y ) and thus we get

Var(Zc) = Var(Y )− 2Cov(Y, Y )

Var(Y )Cov(Y, Y ) +

Cov(Y, Y )2

Var(Y )2Var(Y )

= Var(Y )− Cov(Y, Y )2

Var(Y )

=(1− Corr(Y, Y )2

)Var(Y )

< Var(Y ).

This completes the proof of the lemma. (q.e.d.)

It is easy to check that the value of c we used in (3.11) is optimal: The c which minimisesthe variance satisfies the condition

0 =d

dcVar(Zc) = −2 Cov(Y, Y ) + 2cVar(Y )

30

and thus the optimal value of c is given by

c∗ =Cov(Y, Y )

Var(Y ).

In practice, the exact covariance between Y and Y is often unknown. From equa-tion (3.11) we know that it suffices to have

2cCov(Y, Y ) > c2 Var(Y )

for variance reduction to occur. It is easy to check that this condition is satisfied for allvalues of c between 0 and 2 Cov(Y, Y )/Var(Y ). Therefore we can replace the estimator Zfrom lemma 3.11 by any

Zc = Y − c(Y − E(Y )

)where c is an approximation to Cov(Y, Y )/Var(Y ).

The control variates method can be applied to the problem of estimating E(f(X)

), if

we find a function g ≈ f , for which we can compute E(g(X)

): Using Y = f(X) and

Y = g(X), the estimator Z is given by

Z = f(X)− c(g(X)− E

(g(X)

))= f(X)− cg(X) + cE

(g(X)

).

We have E(f(X)

)= E(Z) and, using Monte Carlo integration, this expectation can be

estimated as

E(f(X)

)≈ 1

N

N∑i=1

(f(Xi)− cg(Xi) + cE

(g(X)

))=

1

N

N∑i=1

(f(Xi)− cg(Xi)

)+ cE

(g(X)

).

The variance of this estimate is Var(Z)/N ; we have to compare this value to the varianceVar(f(X)

)/N of the basic Monte Carlo estimate (3.3). From lemma 3.11 we know that

Var(Z) < Var(f(X)

)as long as

c ≈Cov

(f(X), g(X)

)Var(g(X)

) .

Since we assume g ≈ f we have Cov(f(X), g(X)

)≈ Cov

(g(X), g(X)

)= Var

(g(X)

)and

thus c = 1 is a possible choice. This leads to the following algorithm.

Algorithm CTV (control variates)

input: a function f ,an auxiliary function g ≈ f such that E

(g(X)

)is known,

N ∈ N.

randomness used: a sequence (Xi)i∈N of i.i.d. copies of X.


).

1: s← 02: for i = 1, 2, . . . , N do3: s← s+ f(Xi)− g(Xi)4: end for5: return s/N + E

(g(X)

)31

3.3 Applications to Statistical Inference

In statistical inference we typically have a sample X1, X2, . . . , Xn which is assumed tobe generated according to some statistical model. Our aim is to gain information aboutparameters in the model from this data.

Let T = ϕ(X1, X2, . . . Xn) be a statistic, i.e. a (measurable) function of the data. Animportant example of a statistic is the case where T is used to estimate a parameter; inthis case T is called an estimator. Often, in particular for small n, the distribution of T isnot known explicitly, but we can study this distribution using Monte Carlo methods. Forexample, the expectation µ = E

(T)

can be estimated as follows:

a) Generate independent samples (X(i)1 , . . . , X

(i)n ) from the model for i = 1, 2, . . . , N .

b) Compute T (i) = ϕ(X(i)1 , . . . , X

(i)n ) for i = 1, 2, . . . , N . Then the T (i) are independent

and have the distribution of the statistic.

c) Approximate the expectation of T by

µ ≈ µ =1

N

N∑i=1

T (i) (3.12)

where N is large.

Similarly, the cumulative distribution function F (a) = P (T ≤ a) can be estimated by

F (a) ≈ F (a) =1

N

N∑i=1

1{T (i)≤a}.

While this procedure looks like a simple application of Monte Carlo methods, some careis needed to not confuse the two levels given by the size n of samples, and the number Nof samples used in the Monte Carlo estimate: Typically, T itself is an estimator for someparameter and then µ is an estimate for the expectation of an estimator. While T canbe computed from a sample of size n, we need to generate N samples, i.e. n ·N randomvalues, to compute the estimate µ.

To illustrate the method, we show here how the bias and mean square error of anestimator can be estimated:

• The bias of an estimator θ = θ(X1, . . . , Xn) is given by

bias(θ) = Eθ(θ − θ)

where the subscript θ on the expectation indicates that the sample (X1, . . . , Xn)comes from the distribution with parameter θ. For a given value of θ we can estimatebias(θ) as

bias(θ) ≈ bias(θ) =1

N

N∑i=1

θ(i) − θ (3.13)

where θ(i) = θ(X(i)1 , . . . , X

(i)n ), the samples (X

(i)1 , . . . , X

(i)n ) are i.i.d. with parameter

θ and N is big. The true value of θ is normally not known, but we can systematicallycompute bias(θ) for a range of different θ to get, for example, an approximate upperbound to the bias of an estimator.

32

• The mean square error (MSE) of an estimator θ is given by

MSE(θ) = Eθ((θ − θ)2

).

For given θ, we can estimate MSE(θ) as

MSE(θ) ≈ MSE(θ) =1

N

N∑i=1

(θ(i) − θ

)2. (3.14)

# (3.15)A slightly more complex application of Monte Carlo methods is to assess confidence

intervals: A random interval [U, V ] where U = U(X1, . . . , Xn) and V = V (X1, . . . , Xn)are statistics, is a confidence interval with confidence level (1− α) for a parameter θ, if

Pθ(θ ∈ [U, V ]

)≥ 1− α

for all values of θ. The subscript θ on P indicates that (X1, . . . , Xn) are distributedaccording to the distribution with parameter θ.

If the data is normal distributed, or if n is big enough that the central limit theoremapplies, confidence intervals can often be constructed explicitly. Typical examples for suchresults include the following:

• The mean of normally distributed values: Let X1, . . . , Xn ∼ N (µ, σ2) be indepen-dent. Then

P(µ ∈

[X − p1−α/2

√S2/n, X + p1−α/2

√S2/n

])= 1− α,

where X = 1n

∑ni=1Xi is the sample mean, S2 = 1

n−1

∑ni=1(Xi − X)2 is the sample

variance, and p1−α/2 is the 1 − α/2 quantile of Student’s t-distribution with n − 1degrees of freedom.

• The variance of normally distributed values: Let X1, . . . , Xn ∼ N (µ, σ2) be inde-pendent. Then

P(σ2 ∈

[(n− 1)S2

q1−α/2,

(n− 1)S2

qα/2

])= 1− α,

where S2 is defined as before and qβ for β = α/2, 1−α/2 is the β-quantile of the χ2

distribution with n− 1 degrees of freedom.

If the Xi are not normally distributed, theoretical analysis becomes difficult and the aboveconfidence intervals will only be approximate.

We can verify the confidence level of a given confidence interval using Monte Carlointegration: for a fixed value of θ we perform the following steps.

a) Generate samples X(j)1 , . . . , X

(j)n for j = 1, 2, . . . , N .

b) Compute U (j) = U(X(j)1 , . . . , X

(j)n ) and V (j) = V (X

(j)1 , . . . , X

(j)n ).

c) Check which percentage of the intervals [U (j), V (j)] contains the parameter θ:

Pθ(θ ∈ [U, V ]

)≈ 1

N

N∑j=1

1{U(j)≤θ≤V (j)}. (3.16)

33

When the true value of θ is not known, this procedure can be repeated systematically fora range of possible values for θ. The confidence level is the approximately equal to theminimum over the estimates from (3.16).

Another application of Monte Carlo methods in statistics is to estimate the errorprobabilities of statistical tests: In a statistical test, we consider a test statistic T =T (X1, . . . , Xn) and, depending on the value of T , have to decide whether we can rejectthe null hypothesis H0. Such a test is described by an acceptance set A, where we rejectH0 if T /∈ A. The test has significance level α, if the acceptance set A satisfies

P(T ∈ A

)≥ 1− α (3.17)

whenever H0 holds.In many cases the relation (3.17) can only be proven (or only holds) for big values of n,

for small values of n the real significance level of the test will be unknown.

Example 3.12. The skewness

γ = E((X − µ

σ

)3)=

E((X − µ)3

)σ3

of a random variable with mean µ and standard deviation σ can be estimated by

γn =1n

∑ni=1(Xi − X)3(

1n

∑ni=1(Xi − X)2

)3/2whereX1, X2, . . . , Xn are i.i.d. copies ofX and X is the average of theXi. IfX ∼ N (µ, σ2),then γ = 0 and one can show that √

n

σγn −→ N (0, 1) (3.18)

as n→∞.Assume that we would like to construct a test for the null hypothesis H0 that X is

normally distributed with variance σ2. As a consequence of (3.18), for big n, we can usethe test statistic T =

√n/σγn and the acceptance region

A =[−1.96, 1.96

]to construct a test with significance level α = 5%. We reject H0 if T 6= A, i.e. if |γn| ≥1.96

√σ/n.

One problem with the test constructed in the preceding example is, that the convergenceof the distribution of

√n/σγn to N (0, 1) is very slow. For small or moderate n the

probability of wrongly rejecting H0 (type I error) may be bigger than α.We can use Monte Carlo integration to estimate the probability of type I errors of

statistical tests. For simple hypotheses (i.e. if H0 specifies the distribution of the teststatistic completely), this can be done as follows.

a) For j = 1, 2, . . . , N , generate samples(X

(j)1 , . . . , X

(j)n

)according to the distribution

given by the null hypothesis.

b) Compute T (j) = T (j)(X

(j)1 , . . . , X

(j)n

)for j = 1, 2, . . . , N .

c) Check for which percentage of samples H0 is (wrongly) rejected:

P(T /∈ A

)≈ 1

N

N∑j=1

1{T (j) /∈A}.

34

Further Reading

Some information about Monte Carlo methods and variance reduction can be found in Rip-ley (1987). A more extensive discussion of Monte Carlo methods and variance reductionis contained in chapters 3 to 5 of Robert and Casella (2004). The beginnings of the MonteCarlo method are described in Metropolis (1987).

3.4 Exercises

Exercise E-3.1. Let X ∼ N (0, 1) and A = [3, 4]. Implement the Monte Carlo methodfrom (3.4) to obtain an estimate pN for the probability P (X ∈ A). As we have seen, theestimte pN will be random, so different runs of the program will give different answers.For N = 1 000 and N = 10 000, run your code in a loop to generate many estimates pNand plot a histogram of the resulting values.

Exercise E-3.2. In this exercise we will study how importance sampling can be usedto estimate the probability that a standard normal distributed random variable X takesvery big values: For this exercise we choose X ∼ N (0, 1) and A = [3, 4], and we try toestimate P (X ∈ A).

a) Explain how the probability P (X ∈ A) can be estimated, using importance sampling,by taking samples Yi from a distribution with density g : R→ R (instead of the Xi).Experiment with the following distributions for Yi:

• Yi ∼ N (1, 1),

• Yi ∼ N (2, 1),

• Yi ∼ N (3.5, 1), and

• Yi ∼ Exp(1) + 3.

For each of these four distributions generate histograms, as in exercise 3.6, and getthe variance of your estimate of P (X ∈ A) for N = 1000 and N = 10000 respectively.Which of these four distributions gives the best results?

b) For the proposal distributions in part b, work out approximately how many samplesYi would be required to get the error in your estimate of P (X ∈ A) down to 1%.

Exercise E-3.3. Consider the estimator

T10 =110

∑10i=1(Xi − X)3(

110

∑10i=1(Xi − X)2

)3/2 with X =1

10

10∑i=1

Xi

for the skewness of X.

a) Write R functions which estimate the bias and mean squared error of this estimatorfor X ∼ N (0, σ2) for a given value of σ. (Hint: N (0, σ2) has skewness 0.)

b) Call these functions in a loop to create plots of the approximate bias and meansquared error as a function of σ ∈ [0, 10].

35

Exercise E-3.4. Let Xi be a sequence of i.i.d. random variables. Define the sample meanX = 1

n

∑ni=1Xi and the sample variance S2 = 1

n−1

∑ni=1(Xi − X)2. Then, for c > 0, the

interval [X − cS√

n, X +

cS√n

]is a confidence interval for the mean µ = E(Xi).

a) For n = 10 and Xi ∼ N (7, 4), use Monte Carlo integration to estimate the confidencelevel of this confidence interval for c = 1.833, c = 2.262 and c = 3.250.

b) Similarly, estimate the corresponding confidence levels for Xi ∼ Exp(1). What doyou observe? Comment on your result.

36

Chapter 4

Resampling Methods

In this chapter we present two methods which can be used instead of Monte Carlo methodsif no statistical model is available.

4.1 Bootstrap Methods

Definition 4.1. Given a sequence of numbers x1, x2, . . . , xM , the uniform distribution onthe set

{x1, x2, . . . , xM

}(counting duplicates as separate) is called the empirical distribu-

tion of the xi. In this chapter we denote this distribution by P ∗M .

Example 4.2. Let the data X = (1, 2, 1, 4) be given. Then P ∗M({1})

= 1/2 andP ∗M({2})

= P ∗M({4})

= 1/4.

Let data X1, X2, . . . , XM be given. Assume that this data is an i.i.d. sample from somedistribution P and let X ∼ P . Then, for big M , the law of large numbers guarantees

P (X∗ ∈ A) =1

M

M∑i=1

1{Xi∈A} ≈ P (X ∈ A)

for every set A and

E(f(X∗)

)=

M∑i=1

P (X∗ = Xi)f(Xi) =

M∑i=1

1

Mf(Xi) ≈ E

(f(X)

)for all functions f . Therefore, the distribution of the random variable X∗ can be used asan approximation to the distribution of X, i.e. as an approximation to P .

If the distribution P is not explicitly known, and if therefore we cannot easily generatesamples from P , one idea it to use P ∗M as an approximation to P and then use MonteCarlo methods for this approximation: Let X∗1 , . . . , X

∗N ∼ P ∗M i.i.d. The samples X∗i are

called bootstrap samples. Then

1

N

N∑i=1

f(X∗i ) ≈ E(f(X∗)

)≈ E

(f(X)

). (4.1)

Methods based on this idea are called bootstrap methods.The error in the first approximation of the two approximations in (4.1) goes to 0 as

N increases. Since we control generation of the samples X∗i , we can make this error

37

arbitrarily small at the expense of additional computation time. The error in the secondapproximation goes to 0 as M → ∞. Often the initial data set X1, X2, . . . , XM is fixedand cannot be easily extended; in these cases there is no way to reduce the error in thesecond approximation.

In typical applications of the boostrap method, the function f depends on n independentvalues, i.e. we want to compute E

(f(X1, . . . , Xn)

)where X1, . . . , Xn are i.i.d., and the

common distribution of Xi is unknown to us but, as before, we have a big pool of valuesX1, X2, . . . , XM sampled from this distribution at our disposal. The bootstrap methodcan then be applied as follows:

a) Generate independent samples X∗(j)i ∼ P ∗M for i = 1, 2, . . . , n and j = 1, 2, . . . , N .

b) Compute the approximation

E(f(X1, . . . , Xn)

)≈ 1

N

N∑j=1

f(X∗(j)1 , . . . , X∗(j)n ).

For completeness, we rewrite this method as an algorithm:

Algorithm BOOT (bootstrap estimate)

input: a sample X1, X2, . . . , XM ∈ A with values in some set A,f : An → R,N ∈ N.

randomness used: a sequence (ki)i∈N with ki ∼ U{1, 2, . . . ,M} i.i.d.

output: an estimate for E(f(X1, . . . , Xn)

).

1: s← 02: for j = 1, 2, . . . , N do

3: let X∗(j)i ← Xkn(j−1)+i

for i = 1, 2, . . . , n(i.e. independently and uniformly draw n values from the given sample)

4: s← s+ f(X∗(j)1 , . . . , X

∗(j)n )

5: end for6: return s/N

Example 4.3. Let θn = θn(X1, . . . , Xn) be a parameter estimate. Given a sampleX1, X2, . . . , XM , we can estimate the probability Fn(a) = P (θn < a) for a ∈ R as fol-lows:

a) Generate independent samples X∗(j)i ∼ P ∗M for i = 1, 2, . . . , n and j = 1, 2, . . . , N .

b) Compute θ∗(j)n = θn(X

∗(j)1 , . . . , X

∗(j)n ) for j = 1, 2, . . . , N .

c) Compute F ∗n(a) = 1N

∑Nj=1 1{θ∗(j)n ≤a}.

By the argument presented above, we then have F ∗n(a) ≈ Fn(a).

Example 4.4 (bootstrap estimate of the bias). Continuing from example 4.3, the bias ofθn is sometimes estimated as

bias∗(θn) = θ∗n − θM (X1, X2, . . . , XM ) (4.2)

38

where

θ∗n =1

N

N∑j=1

θ∗(j)n

and the θ∗(j)n are constructed as in the previous example.

Example 4.5 (bootstrap estimate of the standard error). The standard error of an es-timator θn is the standard deviation of θn(X1, . . . , Xn). The standard error of θn can beestimated as

se∗(θn) =

√√√√ 1

N − 1

N∑j=1

(θ∗(j)n − θ∗n

)2

where θ∗n and the θ∗(j)n are constructed as above.

Using the bootstrap method, we can answer every question which can be answeredusing Monte Carlo methods, by replacing the samples X1, . . . , XN in the Monte Carloalgorithm with bootstrap samples X∗(1), . . . , X∗(n). The resulting estimates will usuallybe less accurate than the corresponding Monte Carlo estimates, because they are affectedby two different kinds of sampling error as described in equation (4.1). The advantage ofbootstrap methods is that they still apply when a model for the data is not available.

4.2 Jackknife Methods

The Jackknife method can be used to estimate bias and standard error of an estimatorθn = θn(X1, . . . , Xn). The basic idea of the jackknife method is to construct N = nsamples of length n − 1 by systematically leaving out one of the input values from thesample.

Let X(j) to be the data X1, . . . , Xn with Xj left out. The samples X(1), . . . , X(n) arecalled jackknife samples. From the jackknife samples we compute

θ(j) = θn−1(X(j)).

Definition 4.6. The jackknife estimate for the bias of θn is

biasjack

(θn)

= (n− 1)(θ(·) − θn

)(4.3)

where θ(·) is the average of the θ(j).

The estimate biasjack

(θn)

for the bias in (4.3) looks very different from the boostrapestimate give in (4.2). The justification for the non-obvious form of the jackknife estimatefor is given in the following lemma.

Lemma 4.7. Let X1, . . . , Xn be i.i.d. and assume that for all n ∈ N we have

bias(θn)

=a

n+

b

n2+O

( 1

n3

)(4.4)

where a 6= 0. Here, O(1/n3) stands for terms which decay at least as fast as 1/n3 whenn→∞. Then

E(biasjack(θn)

)= bias

(θn)

+O( 1

n2

)(4.5)

39

where O(1/n2) stands for terms which decay at least as fast as 1/n2 as n→∞.

Condition (4.7) on the bias looks artificial at first sight, but example 4.8 below showsthat the condition is satisfied in common situations. Also, from (4.3), it is clear that the

estimate biasjack

(θn)

cannot be used to detect a constant (i.e. n-independent) component

in the bias: if θn = θn + c, then biasjack

(θn)

= biasjack

(θn). Therefore, for the estimate

to be useful, the estimator must be asymptotically consistent. Finally, since we assumethe bias itself to decay like 1/n, the lemma is only useful if the error term in (4.5) decaysfaster than 1/n. Here, this is a consequence of the presence of the b/n2-term in (4.4).

proof. Let θ be the true value of the estimated parameter. Then we can write

E(biasjack(θn)

)= (n− 1)E

(θ(·) − θn

)= (n− 1)E

(θ(·) − θ

)− (n− 1)E

(θn − θ

). (4.6)

For the first term on the right-hand side we find

E(θ(·) − θ

)= E

( 1

n

n∑i=1

θn−1(X(i))− θ)

=1

n

n∑i=1

E(θn−1(X(i))− θ

)= bias

(θn−1(X(1))

).

In the last expression, the choice of X(1) is arbitrary, any of the X(i) could have been usedinstead. Since the Xi are i.i.d., X(1) = (X2, X3, . . . , Xn) is an i.i.d. sample of length n−1.For the second term on the right-hand side of (4.6) we have

E(θn − θ

)= bias

(θn)

and thus

E(biasjack(θn)

)= (n− 1)

(bias

(θn−1(X(1))

)− bias

(θn))

= (n− 1)( a

n− 1+

b

(n− 1)2− a

n− b

n2+O

( 1

n3

))=a

n+

(2n+ 1)b

(n− 1)n2+O

( 1

n2

)= bias

(θn)

+O( 1

n2

).

(q.e.d.)

Example 4.8. We know that the estimator

σ2 =1

n− 1

n∑i=1

(Xi − X)2

is unbiased, whereas

σ2 =1

n

n∑i=1

(Xi − X)2

has bias

bias(σ2)

= E(σ2 − σ2

)=n− 1

nE(σ2)− σ2 =

(n− 1

n− 1)σ2 = −σ

2

n.

40

Therefore the estimator σ2 satisfies the condition of lemma 4.7 (with a = −σ2 and b = 0),and the jackknife estimate (4.3) could be used to estimate the bias of σ2.

Definition 4.9. The jackknife estimate for the standard error of θn is

sejack =

√√√√n− 1

n

n∑j=1

(θ(j) − θ(·)

)2.

The motivation for this formula is given in exercise 4.7.# (4.7)

Further Reading

A wealth of information about bootstrap methods can be found in the monograph of Davi-son and Hinkley (1997).

4.3 Exercises

Exercise E-4.1. For given data X1, X2, . . . , Xn consider the estimator θn = X for themean, where X is the average of the Xi. Then we have se(θ) =

√Var(Xi)/n.

a) Show that θ(j) =nX−Xj

n−1 .

b) Show that θ(·) = X.

c) Show that sejack =√S2n/n where S2

n = 1n−1

∑nj=1(Xj − X)2.

41

42

Chapter 5

Markov Chain Monte CarloMethods

Monte Carlo methods use a sequence of i.i.d. samples Xi as their input. Thus thesemethods depends on our ability to generate the samples Xi efficiently. It transpires thatsometimes, when generating the Xi is difficult, it can be easier to replace the i.i.d. sequence(Xi)i∈N by a Markov chain instead. The resulting methods are called Markov Chain MonteCarlo methods.

Definition 5.1. A Markov chain Monte Carlo (MCMC) method for estimating an expec-tation E

(f(X)

)is a method based on the approximation

E(f(X)

)≈ 1

N

N∑n=1

f(Xn)

where (Xn)n∈N is a Markov chain with the distribution of X as its stationary distribution.

Before we consider the technical details of MCMC methods, we consider two examplesfor situations where MCMC methods can be useful.

Example 5.2 (Bayesian parameter estimates). In a Bayesian model, one assumes thatthe parameter θ of a statistical model is itself random. The distribution of θ is called theprior distribution. Here we assume that the prior distribution is given by a density p(θ).The distribution of the data X in the model depends on θ, we assume that it is given bya density p(x | θ). Our task is to gather as much information about θ as possible from thedata X.

Since θ is assumed to be random and since, typically, the data X does not uniquelydetermine the value of θ, the solution of this parameter estimation problem will be todetermine the conditional distribution of θ, given the data X. This distribution is calledthe posterior distribution; it depends on the data X and on the prior distribution for θ:By Bayes’ rule (see appendix A.2), we find the density of the posterior distribution as

p(θ∣∣ x) =

p(x∣∣ θ)p(θ)∫

p(x∣∣ θ)p(θ) dθ . (5.1)

To study the posterior distribution it would be useful to be able to generate samplesfrom the distribution with density p

(θ∣∣ x), but this is made difficult by the fact that the

43

value of the integral∫p(x∣∣ θ)p(θ) dθ on the right-hand side of (5.1) is often unknown.

We will see that MCMC methods are one way to solve this problem.

Example 5.3 (Non-intersecting random walks). Non-intersecting paths on a grid aresometimes used as a simple model for polymers. The be able to approximate quantitieslike E

(|Xn−X0|

)it is useful to be able to sample paths from the uniform distribution on

all such paths of length n. Generating such samples turns out to be surprisingly difficult,for example the total number of non-intersecting paths of length n is unknown for all butvery small values of n. MCMC methods can be used to solve this problem.

5.1 Markov Chains

In this section we will give a very brief introduction to Markov chains. We restrict ourselvesto the case of discrete time and omit some of the technical details.

Definition 5.4. A stochastic process X = (Xn)n∈N0 with values in a set S is called aMarkov chain, if

P(Xn ∈ An

∣∣ Xn−1 ∈ An−1, Xn−2 ∈ An−2, . . . , X0 ∈ A0

)= P

(Xn ∈ An

∣∣ Xn−1 ∈ An−1

) (5.2)

for all A0, A1, . . . , An ⊆ S and all n ∈ N, i.e. if the distribution of Xn depends onX0, . . . , Xn−2 only through Xn−1. The set S is called the state space of X. The dis-tribution of X0 is called the initial distribution of X. Often X0 is deterministic, i.e.P (X0 = x) = 1 for some x ∈ S; in this case x is called the initial value or starting pointof X. The index n is typically interpreted as time.

Example 5.5. Let (ξi)i∈N be an i.i.d. sequence of random variables. The process X givenby X0 = 0 and Xn = Xn−1 + ξn for all n ∈ N is a Markov chain. We can write Xn as

Xn =n∑i=1

ξi.

A Markov chain of this type is called a random walk. Important special cases are ξi ∼U({−1, 1}) (the symmetric, simple random walk on Z) and ξi ∼ N (0, 1).

Example 5.6. Let (ξi)i∈N be an i.i.d. sequence of random variables with Var(ξi) = 1.Then the process X given by X0 = X1 = 0 and

Xn =Xn−1 +Xn−2

2+ ξn

for all n = 2, 3, . . . is not a Markov chain.

If the state space S is finite, e.g. S = {1, 2, . . . , N}, then the transition probabilitiesP(Xn ∈ An

∣∣ Xn−1 ∈ An−1

)in (5.2) can be described by giving the probabilities of the

transitions between all pairs of elements of S. The resulting matrix P = (pij)i,j∈S with

pij = P(Xn = j

∣∣ Xn−1 = i)

is called the transition matrix of the Markov chain. If the transition matrix does notdepend on n, the Markov chain is called time-homogeneous.

44

When considering transition matrices, it is often convenient to label the rows andcolumns of the matrix P using elements of S instead of using the usual indices {1, 2, . . . , n}.Thus, if S is the alphabet S = {A, B, . . . , Z} we write pAZ instead of p1,26 to denote thetransition probability from A to Z. We write RS×S for the set of all matrices where thecolumns and rows are indexed by elements of S. Similarly, for vectors consisting of prob-ability weights for the elements of S, e.g. the initial distribution of a Markov chain, it isconvenient to label the components of the vector by elements of S. We write RS for theset of all such vectors.

Definition 5.7. A vector π ∈ RS is called a probability vector, if πi ≥ 0 for all i ∈ S and∑i∈S πi = 1.

Example 5.8. If ξi ∼ U({1, 2}

)i.i.d. then the process X defined by

Xn =( n∑i=1

ξi

)mod 4

is a time-homogeneous Markov chain with state space S = {0, 1, 2, 3} and transition matrix

P =

0 1/2 1/2 00 0 1/2 1/2

1/2 0 0 1/21/2 1/2 0 0

∈ RS×S .

Row i of the matrix, for i = 0, 1, 2, 3, consists of the elements pi0, pi1, pi2 and pi3, givingthe probabilities for going from state i to the states 0, 1, 2 and 3 respectively.

Lemma 5.9. Let PS×S be the transition matrix of a Markov chain with state space S.Then P = (pij)i,j∈S has the following properties:

a) pij ≥ 0 for all i, j ∈ S.

b)∑

j∈S pij = 1 for all i ∈ S.

proof. The claim follows directly from the definition of P . (q.e.d.)

Definition 5.10. A matrix with satisfies the two conditions from lemma 5.9 is called astochastic matrix or a transition matrix.

Lemma 5.11. Let X be a time-homogeneous Markov chain with finite state space andtransition matrix P . Then

P(Xn+k = j

∣∣ Xn = i)

= (P k)ij

for all n, k ∈ N0 and i, j ∈ S, where P k = P · P · · ·P is the kth power of the transitionmatrix P .

proof. For k = 0, the matrix P 0 is by definition the identity matrix and the statementholds. Also, for k = 1 we have

P(Xn+1 = j

∣∣ Xn = i)

= Pij

45

by the definition of the transition matrix.

Now let k > 1 and assume that the statement holds for k − 1. Then we have

P(Xn+k = j

∣∣ Xn = i)

=P(Xn+k = j,Xn = i

)P(Xn = i

)=

1

P(Xn = i

)∑l∈S

P(Xn+k = j,Xn+k−1 = l,Xn = i

)=

1

P(Xn = i

)∑l∈S

P(Xn+k−1 = l,Xn = i

)P(Xn+k = j

∣∣ Xn+k−1 = l,Xn = i).

Using the Markov property (5.2) we get

P(Xn+k = j

∣∣ Xn = i)

=∑l∈S

P(Xn+k−1 = l,Xn = i

)P(Xn = i

) P(Xn+k = j

∣∣ Xn+k−1 = l)

=∑l∈S

P(Xn+k−1 = l

∣∣ Xn = i)plj

=∑l∈S

(P k−1)ilplj .

The last expression can be read as a matrix-matrix multiplication of P k−1 with P andthus we get

P(Xn+k = j

∣∣ Xn = i)

= (P k−1 · P )ij = (P k)ij

as required. (q.e.d.)

From lemma 5.11 we know that the matrix P k describes the k-step transitions of aMarkov chains. As a consequence of this, we can see that a time-homogeneous Markovchain with finite state space S is completely described by the initial distribution π andthe transition matrix P : At time 0 we have

P (X0 = i) = πi

for all i ∈ S. For time k > 0 we can use Bayes’ formula to find

P (Xn = j) =∑i∈S

P(Xk = j,X0 = i

)=∑i∈S

P(X0 = i

)P(Xk = j

∣∣ X0 = i)

=∑i∈S

πi(Pk)ij

for all j ∈ S. If we consider the transposed vector π> as a matrix with one row and |S|columns, we can write the last expression as a matrix-matrix multiplication and find

P (Xk = j) = (π>P k)j . (5.3)

Thus, the vector π>P k gives the distribution of Xk.# (5.4)

An important special case of relation (5.3) is the case where π>P = π> (and thenπ>P k = π> for all k ∈ N). In this case, if we start the Markov chain with initial

46

distribution π, we have Xk ∼ π for all n ∈ N: the distribution of Xk is the same for allk ∈ N0.

Definition 5.12. Let X be a time-homogeneous Markov chain with transition matrix P .A probability vector π is called a stationary distribution of X, if π>P = π>, i.e. if∑

i∈Sπipij = πj (5.5)

for all j ∈ S.

In general, a Markov chain may have more than one stationary distribution, but for thecases which will be of interest in this chapter, there will only be one stationary distribution.

Example 5.13. On the state space S = {1, 2, 3}, consider the Markov chain with transi-tion matrix

P =

1/2 1/2 00 1/2 1/2

1/5 0 4/5

(see figure 5.1 for an illustration) and initial distribution α = (1, 0, 0).

We can use equation (5.3) to get the distribution after one step: P (X1 = i) = (α>P )iwhere

α>P =(1 0 0

)·

1/2 1/2 00 1/2 1/2

1/5 0 4/5

=(1/2 1/2 0

).

Similarly, for X2 we find P (X2 = i) = (α>P 2)i where

α>P 2 =(1/2 1/2 0

)·

1/2 1/2 00 1/2 1/2

1/5 0 4/5

=(1/4 1/2 1/4

).

Continuing to X2, X3, . . . we get

α>P 3 = (0.175 0.375 0.450)

α>P 4 = (0.178 0.275 0.548)

...

α>P 10 = (0.223 0.222 0.555).

Experimenting shows that the value of α>P k does not change significantly when k isincreased further, so we can guess that this value is close to the stationary distribution ofthe Markov chain. Indeed, we can use equation (5.6) to verify that π = (2/9, 2/9, 5/9) isthe stationary distribution:

π>P =(2/9 2/9 5/9

)·

1/2 1/2 00 1/2 1/2

1/5 0 4/5

=(2/9 2/9 5/9

)= π>.

Thus, we have seen that for the Markov chain considered in this example, as n→∞, theprobabilities P (Xn = i) converge to the stationary probabilities πk. Convergence resultslike this often, but not always, hold.

47

1

1/2

2 1/2

3

4/5

1/2

1/21/5

Figure 5.1: Graphical representation of the transition matrix from example 5.13. Thenumbers on the arrows give the transition probabilities between states.

The condition for π being a stationary distribution can be rewritten by taking thetranspose of the equation π>P = π>: A probability vector π is a stationary distributionfor P if and only if

P>π = π, (5.6)

i.e. if π is an eigenvector of P> with eigenvalue 1. Since computing eigenvectors is a well-studied problem, and many software packages provide built-in functions for this purpose,this property can be used to find a stationary distribution π for a given transition matrix P .

Definition 5.14. A time-homogeneous Markov chain X with state space S and transitionmatrix P = (pij)i,j∈S satisfies the detailed balance condition, if there is a probabilityvector π ∈ RS with

πipij = πjpji

for all i, j ∈ S. In this case we say that the Markov chain X is π-reversible.

Lemma 5.15. Let X be a π-reversible Markov chain. Then π is a stationary distributionof X.

proof. If X satisfies the detailed balance condition for a probability vector π, we have∑i∈S

πipij =∑i∈S

πjpji = πj∑i∈S

pji = πj

for all j ∈ S. Thus, π satisfies the condition from equation (5.5) and consequently is astationary distribution. (q.e.d.)

In this section we have so far mostly considered Markov chains with finite state space S,but everything said so far still holds when S is allowed to be countably infinite. Markovchains with finite or countably infinite state space are called Markov chains with discretestate space. In these cases, the transition matrix P is an “infinite matrix”

P =

p11 p12 p13 · · ·p21 p22 p23 · · ·...

......

. . .

,

48

and the properties of P and the definition of a stationary distribution are the same as forthe case of finite state space.

Example 5.16. The symmetric simple random walk X, given by

Xn =n∑k=1

ξk

for all n ∈ N0 with ξi ∼ U{−1,+1} i.i.d., is a Markov chain with state space S = Z.

Everything discussed so far also applies to Markov chains with continuous state space,e.g. to the cases S = R or S = Rd. In this case, probability vectors have to be replacedby probability densities and sums over the state space S have to be replaced by integrals,e.g. the transition matrix P is replaced by a transition density p(x, y) where, for fixedx ∈ S, the function y 7→ p(x, y) gives the probability density of Xn on S conditioned onXn−1 = x.

Example 5.17. On S = R, let X0 = 0 and

Xn =1

2Xn−1 + ξn

for all n ∈ N, where ξn ∼ N (0, 1) i.i.d. is a Markov chain with state space S = R. GivenXn−1 = x, we have Xn = 1

2x+ ξn ∼ N (x/2, 1), i.e. the transition density p is given by

p(x, y) =1√2π

exp(−1

2(y − x/2)2

)for all x, y ∈ R.

In analogy to (5.6), a probability density π : R→ [0,∞) is a stationary density for thisMarkov chain, if it satisfies ∫

Sπ(x)p(x, y) dx = π(y)

for all y ∈ S.

5.2 The Metropolis-Hastings Method

In order to apply MCMC methods as in definition 5.1, we need to be able to find Markovchains with a prescribed stationary distribution. The Metropolis-Hastings method, de-scribed in this section, is a popular method to solve this problem. The resulting algorithmis similar to the rejection sampling algorithm: starting from a different Markov chain onthe same state space, the Metropolis-Hastings algorithm generates a new Markov chainwith the required stationary distribution by “rejecting” some of the state transitions.

Since the Metropolis-Hastings method is a method to generate random samples (i.e.paths from a Markov chain), this is technically a sampling method and could have beendiscussed in chapter 2. The reason we discuss the algorithm here instead is that, inpractice, it is almost exclusively used as a component in MCMC methods.

49

5.2.1 Description of the Method

We first formulate the algorithm for the case of discrete state space.

Algorithm MH1 (Metropolis Hastings method for discrete state space S)

input: a probability vector π (the target distribution),a transition matrix P = (pxy)x,y∈S ,X0 ∈ S with π(X0) > 0.

randomness used: independent samples Yn from the transition matrix P (the propos-als).Un ∼ U [0, 1] i.i.d.

output: a sample of a Markov chain X with stationary distribution π.

As an abbreviation we define a function α : S × S → [0, 1] by

α(x, y) = min(πypyxπxpxy

, 1)

for all x, y ∈ S with πxpxy > 0.

1: for n = 1, 2, 3, . . . do2: generate Yn with P (Yn = j) = pXn−1,j for all j ∈ S3: generate Un ∼ U [0, 1]4: if Un ≤ α(Xn−1, Yn) then5: Xn ← Yn6: else7: Xn ← Xn−1

8: end if9: end for

The acceptance probability α(x, y) is only defined for x, y ∈ S with πxpxy > 0. Ata first glance it seems that the algorithm could fail by hitting a proposal Yn such thatα(Xn−1, Yn) is undefined, but it transpires that this cannot happen: Assume that we havealready found X0, X1, . . . , Xn−1. Then we have π(Xn−1) > 0 (else, it would not havebeen accepted in a previous step) and, given Xn−1, the proposal Yn satisfies pXn−1,Yn > 0with probability 1 (else, it would not have been proposed). Consequently, α(Xn−1, Yn) isdefined and we can compute the next value Xn.

The same algorithm also works for the case of continuous state space S, if all probabilityvectors are replaced by probability densities. For completeness we state the resultingalgorithm explicitly.

Algorithm MH2 (Metropolis Hastings method for continuous state space S)

input: a probability density π (the target density),a transition density p : S × S → [0,∞),X0 ∈ S.

randomness used: independent samples Yn from the transition density p (the propos-als).Un ∼ U [0, 1] i.i.d.

output: a sample of a Markov chain X with stationary density π.

50


α(x, y) = min(π(y)p(y, x)

π(x)p(x, y), 1)

for all x, y ∈ S with π(x)p(x, y) > 0.

1: for n = 1, 2, 3, . . . do2: generate Yn with density p(Xn−1, · )3: generate Un ∼ U [0, 1]4: if Un ≤ α(Xn−1, Yn) then5: Xn ← Yn6: else7: Xn ← Xn−1

8: end if9: end for

Theorem 5.18. The process (Xn)n∈N constructed in the Metropolis-Hastings algorithmis a π-reversible Markov chain. In particular, X has stationary distribution π.

proof. For the proof we restrict ourselves to the discrete case from algorithm MH1. Theproof for continuous state space can be obtained by replacing all probability weights bythe corresponding densities.

Clearly, since Xn in the algorithm depends only on Xn−1 and on the additional, in-dependent randomness from Yn and Un, the process (Xn)n∈N is a Markov chain. LetQ = (qxy)x,y∈S be the transition matrix of this Markov chain, i.e.

qxy = P(Xn = y

∣∣ Xn−1 = x)

for all x, y ∈ S. Then we have to show that Q satisfies the detailed balance condition

πxqxy = πyqyx (5.7)

for all x, y ∈ S.In the case x = y, the relation (5.7) is trivially true; thus we can assume x 6= y. For

the process X to jump from x to y, the proposal Yn must equal y and then the proposalmust be accepted. Since these two events are independent, we find

qxy = pxy · α(x, y)

and thus

πxqxy = πxpxy min(πypyxπxpxy

, 1)

= min(πypyx, πxpxy

)= πypyx min

(1,πxpxyπypyx

)= πyqyx

for all x, y ∈ S with πxpxy > 0 and πypyx > 0. There are various cases with πxpxy = 0or πypyx = 0; using the that fact the pxy = 0 implies qxy = 0 (since transitions from xto y are never proposed) it is easy to check that in all of these cases πxqxy = 0 = πyqyx

51

holds. Thus, equation (5.7) is satisfied for all x, y ∈ S, the process X is π-reversible and,by lemma 5.15, π is a stationary distribution of X. (q.e.d.)

Example 5.19. Let πx = 2−|x|/3 for all x ∈ Z. Then∑x∈Z

πx =1

3

(· · ·+ 1

4+

1

2+ 1 +

1

2+

1

4+ · · ·

)=

1

3

(1 + 2

∞∑x=1

2−x)

= 1,

i.e. p is a probability vector on S = Z. Using the algorithm MH1, we can easily find Markovchains which have p as a stationary distribution: For the algorithm, we can choose thetransition matrix P for the proposals. For this example we consider

P(Yn = x+ 1

∣∣ Xn−1 = x)

= P(Yn = x− 1

∣∣ Xn−1 = x)

=1

2

for all x ∈ S and all n ∈ N. This corresponds to pxy = 1/2 if y = x+ 1 or y = x− 1 andpxy = 0 else. From this we can compute the acceptance probabilities α(x, y):


, 1)

= min( 2−|y|

3 pyx2−|x|

3 pxy, 1)

= min(

2|x|−|y|pyxpxy

, 1).

In the Metropolis-Hastings algorithm, the function α(x, y) is only evaluated for x = y+ 1and x = y − 1. In either of these cases we have pxy = 1/2 and thus

α(x, y) =

{2|x|−|y|, if |y| > |x|, and

1 else.

Finally, substituting this transition matrix P and the corresponding function α into algo-rithm MH1 we get the following:

1: for n = 1, 2, 3, . . . do2: Let Yn ← Xn−1 + ξn where ξn ∼ U{−1, 1}.3: generate Un ∼ U [0, 1]4: if Un ≤ 2|Xn−1|−|Yn| then5: Xn ← Yn6: else7: Xn ← Xn−1

8: end if9: end for

By theorem 5.18, the resulting process X is a Markov chain with stationary distribution π.

Following definition 5.1, the Metropolis-Hastings method for estimating an expectationE(f(X)

)uses the approximation

E(f(X)

)≈ 1

N

N∑n=1

f(Xn) (5.8)

where Xn is the output of algorithm MH1 or MH2 above, applied with the distributionof X as the target distribution π. For the method to work, the Markov chain must be

52

“ergodic”, i.e. the distribution of Xn must converge to the stationary distribution π, andthe convergence must be “fast enough” such that the right-hand side of (5.8) convergesto E

(f(X)

). Finding sufficient conditions for this convergence to hold can be difficult.

Here we restrict ourselves to considering a necessary conditions for the algorithm towork: The Markov chain given by P must be able to move between any x, y ∈ S withπ(x), π(y) > 0 (possibly taking several steps to get from x to y). If this condition doesnot hold, the Markov chain X will not be able to explore all of S using just one path(X1, X2, . . .) and thus the right-hand side of (5.8) will not correspond to an average overall of S as would be required for computing E

(f(X)

).

The choice of the transition matrix P determines the efficiency of the resulting method.The method is efficient if

a) The Markov chain with transition matrix P moves through S quickly.

b) The acceptance probabilities are big on average, so that many of the proposed tran-sitions are accepted.

Often it is good to choose P so that it has a stationary distribution which is close to thetarget distribution π.

The final result of an MCMC estimate is affected by two different kinds of error: first,since we are using the approximation 5.8 given by the law of large numbers, we need Nto be big for the estimate to be accurate. This is the same effect as for basic Monte Carloestimates. And secondly, as discussed in the preceeding paragraphs, the distribution ofthe Xn will not be exactly equal to the target distribution π, but will only converge to πas n→∞. This introduces an additional error and a typical approach to mitigate this isto not include the first samples in the Monte Carlo estimate in order to give the Markovchain time to get close to stationarity. Thus, in practice one often uses

E(f(X)

)≈ 1

N −M

N∑n=M+1

f(Xn)

where 1 � M � N . The samples X1, X2, . . . , XM still need to be computed (to get thefollowing values), this is commonly referred to as the burn-in period.

One big advantage of the Metropolis-Hastings algorithm is that, similar to the rejectionsampling algorithm, the target density π only needs to be known up to a constant: Theonly place where π occurs in the algorithm is in the acceptance probability α; if we onlyhave access to c ·π where the value of c is not known, we can still evaluate the function α,since


, 1)

= min(cπypyxcπxpxy

, 1).

This property of the Metropolis-Hastings algorithm can, for example, be used in situationslike the one from example 5.2 where we need to sample from the density

p(θ∣∣ x) =

p(x∣∣ θ)p(θ)∫

p(x∣∣ θ)p(θ) dθ .

with an unknown normalisation constant∫p(x∣∣ θ)p(θ) dθ.

53

5.2.2 Random Walk Metropolis Sampling

The Metropolis-Hastings algorithm for the case p(x, y) = p(y, x) is sometimes called theMetropolis algorithm. In this case, the expression for the acceptance probability α simpli-fies to

α(x, y) = min(π(y)p(y, x)

π(x)p(x, y), 1)

= min(π(y)

π(x), 1)

for all x, y ∈ S with π(x) > 0 (or πy/πx for discrete state space). The condition p(x, y) =p(y, x) is, for example, satisfied when the proposals Yn are constructed as

Yn = Xn−1 + ξn

where the ξn are i.i.d. with a symmetric distribution (i.e. ξn has the same distributionas −ξn). We only formulate the version of the resulting algorithm for continuous statespace S, the discrete version is found by using a probability vector instead of a density forthe target distribution.

Algorithm RWM (Random Walk Metropolis sampling)

input: a probability density π ∈ RS (the target density),X0 ∈ S.

randomness used: An i.i.d. sequence (ξn)n∈N with a symmetric distribution (the pro-posals).Un ∼ U [0, 1] i.i.d.



α(x, y) = min(π(y)

π(x), 1)

for all x, y ∈ S with π(x) > 0.

1: for n = 1, 2, 3, . . . do2: let Yn ← Xn−1 + ξn3: generate Un ∼ U [0, 1]4: if Un ≤ α(Xn−1, Yn) then5: Xn ← Yn6: else7: Xn ← Xn−1

8: end if9: end for

For S = R, the most common choice for the sequence ξn is ξn ∼ N (0, σ2). The proposalvariance σ2 can be chosen to maximise efficiency of the method: If σ2 is small, we haveYn ≈ Xn−1, i.e. π(Yn) ≈ π(Xn−1) and consequently α(Xn−1, Yn) ≈ 1. In this case, almostall proposals are accepted, but the algorithm moves slowly since the increments in eachstep are small. On the other hand, if σ2 is big, π(Yn) can be small and consequently manyproposals are rejected: the process X will then not move very often, but when it does, theincrement is typically big. The optimal choice of σ2 will be between these two extremes.This is illustrated in example 5.9 (see also figure 5.3).

# (5.9)

54

−4 −2 0 2 4

02

46

810

12

x

pi(x

)

Figure 5.2: The (non-normalised) target density for exercise 5.9. The task is to estimatethe expectation of the corresponding distribution, using an MCMC method. The problemis complicated by the bi-modality of the distribution.

1e−06 1e−03 1e+00 1e+03 1e+06

0.0

0.2

0.4

0.6

0.8

1.0

σ2

aver

age

acce

ptan

ce p

roba

bilit

y

Figure 5.3: The average acceptance probability E(α(Xn−1, Yn)

)of the random walk

Metropolis sampler from exercise 5.9, as a function of σ2.

55

5.2.3 Independence Sampler

Another special case is obtained by choosing the proposals Yn independently of Xn−1, i.e.by using a transition density of the form p(x, y) = p(y).

Algorithm IND (Independence Sampler)

input: a probability density π ∈ RS (the target density),X0 ∈ S.

randomness used: An i.i.d. sequence (Yn)n∈N with density p (the proposals).Un ∼ U [0, 1] i.i.d.



α(x, y) = min(π(y)p(x)

π(x)p(y), 1)

for all x, y ∈ S with π(x) > 0.

1: for n = 1, 2, 3, . . . do2: let Yn ← Xn−1 + ξn3: generate Un ∼ U [0, 1]4: if Un ≤ α(Xn−1, Yn) then5: Xn ← Yn6: else7: Xn ← Xn−1

8: end if9: end for

While the proposals Yn are independent in this algorithm, the Xn are still dependent,since the acceptance probability α(Xn−1, Xn) for Xn depends on the value of Xn−1.

Further Reading

Markov chains and the Metropolis Hastings method are discussed in chapters 6 and 7of Robert and Casella (2004). Applications and extensions to the methods can be foundin Gilks et al. (1996).

5.3 Exercises

Exercise E-5.1. Let Zn ∼ N (0, 1) i.i.d. for n ∈ N. Which of the following stochasticprocesses are Markov chains?

a) Xn =∑n

i=1 Zi

b) Xn = 1n

∑ni=1 Zi

c) Xn =∑n

i=11iZi

d) X1 = Z1 +X0 and Xn = Zn +Xn−1 +Xn−2 for all n ≥ 2.

56

Justify your answers.

Exercise E-5.2. From lemma 5.9 we know that the sum of elements in each row ofa transition matrix equals 1. Find a Markov chain such that the column sums of itstransition matrix are all different from 1, i.e. such that∑

i∈Spij 6= 1

for all j ∈ S.

Exercise E-5.3. Let X be the Markov chain with transition matrix

P =

2/3 1/3 0 01/10 9/10 0 01/10 0 9/10 01/10 0 0 9/10

and initial distribution µ = (1/4, 1/4, 1/4, 1/4).

a) Write an R program which generates a random path X0, X1, X2, . . . , Xn from thisMarkov chain. (Hint: the R command sample may be useful here.)

b) Use the program from part a and Monte Carlo integration to numerically determinethe distribution of X10 (i.e. you have to estimate the probabilities P (X10 = k) fork = 1, 2, 3, 4).

c) Analytically compute the distribution of X10. (Hint: it may be useful to use Ras a pocket calculator here. The function matrix and the operator %*% for matrixmultiplication may be useful here.)

Exercise E-5.4.

• Let P be a stochastic matrix. Show that the vector x = (1, 1, . . . , 1) is an eigenvectorof P and determine the corresponding eigenvalue.

• Let x be an eigenvector of a matrix A with eigenvalue λ. Show that for α 6= 0 thevector αx is also an eigenvector of A.

• Find a stationary distribution for the stochastic matrix P from exercise 5.4 by com-puting the eigenvectors of P>. You can use the R functions t and eigen for thisexercise. Some care is needed because the entries of the computed eigenvector maynot be positive and may not sum up to 1 (see part b for an explanation). How canwe solve this problem?

Exercise E-5.5. Our aim of this exercise is to estimate the expectation µ of the distri-bution with density

π(x) = c(

exp(−(x− 2)4 − 2x

2

)+ 5 exp(−(x+ 2)4

2

))where c is the normalising constant which makes π a probability density. The non-normalised version of this density is shown in figure 5.2.

57

a) Write a program which implements the Metropolis-Hastings algorithm. Use thenormal distribution N (Xn−1, σ

2) as the proposal distribution for Xn.

b) Explore, by experimenting with your program, how the algorithm behaves for differ-ent values of σ2. Produce two plots showing Xn as a function of n, for two differentvalues of σ2, which show different behaviour of the process. Which of the two choicesof σ2 is more feasible for use in an MCMC method?

c) Find out experimentally how big n needs to be to get stable estimates for the ex-pectation µ. Produce a plot of

1

n

n∑i=1

Xi

as a function of n and use this to get an estimate for µ.

d) The acceptance probability α depends on n. Modify your program to compute theaverage acceptance probability over the whole run. Generate a plot which showshow this average acceptance probability depends on the parameter σ2.

58

Chapter 6

The EM Algorithm

The expectation maximisation (EM) algorithm allows to compute maximum likelihoodestimates in the presence of unobserved variables. We consider the following setup:

• the vector X describes the unobserved variables

• the vector Y describes the observed variables

• θ is a parameter for the joint distribution of X and Y

• The parameter θ has likelihood function L(θ;x, y), i.e. the pair (X,Y ) has densityf(x, y | θ) = L(θ;x, y).

Our aim is to estimate θ, given the observation Y = y, by maximising the functionθ 7→ f(y | θ) where

f(y | θ) =

∫f(x, y | θ) dx =

∫L(θ;x, y) dx.

is the conditional density of Y given θ. As usual, instead of f(y | θ) we can also maximise

L(θ) = log f(y | θ).

The EM algorithm for solving this parameter estimation problem is given in the followingalgorithm.

Algorithm EM (expectation maximisation)

input: an estimate θ(0) for θ

output: a sequence θ(1), θ(2), . . . of improved estimates for θ

1: for n = 1, 2, 3, . . . do

2: let Q(θ | θ(n−1)) = Eθ(n−1)

(logL(θ;X, y)

∣∣∣ Y = y)

3: let θ(n) = arg maxθ

Q(θ∣∣ θ(n−1)

)4: end for

The expression arg maxθQ(θ | θ(n−1)) in algorithm EM stands for the value of θ whichmaximises Q(θ | θ(n−1)); if there is no unique maximum, an arbitrary maximum position

59

can be chosen. The first step inside the loop in the algorithm is often called the expecta-tion step (or E-step), the second step is called the maximisation step (or M-step) of thealgorithm.

Theorem 6.1. The algorithm satisfies L(θ(n)) ≥ L(θ(n−1)) for all n ∈ N.

proof. First, consider two arbitrary parameter values θ and θ. The conditional density ofX given Y = y satisfies

fθ(x|y) =fθ(x, y)∫fθ(x, y) dx

=fθ(x, y)

fθ(y)

and thus

L(θ) = log fθ(y) = log fθ(x, y)− log fθ(x|y) = logL(θ;x, y)− log fθ(x|y)

for all x. Consequently we find

L(θ)− L(θ) = logL(θ;x, y)− log fθ(x|y)− logL(θ;x, y) + log fθ(x|y)

= logL(θ;x, y)− logL(θ;x, y)− logfθ(x|y)

fθ(x|y)

for all x. Since the left-hand side of this equation, and thus also the right-hand side, doesnot depend on the value of x, we can substitute a random value X for x and the takeexpectations, without changing the value. Taking the expectation with parameter θ andconditioned on Y = y we get

L(θ)− L(θ) = Eθ(

logL(θ;X, y)− logL(θ;X, y)− logfθ(X|y)

fθ(X|y)

∣∣∣ Y = y)

= Q(θ∣∣ θ)−Q(θ ∣∣ θ)+ Eθ

(− log

fθ(X|y)

fθ(X|y)

∣∣∣ Y = y)

Since − log is a convex function, we can use Jensen’s inequality (theorem A.7) to concludethat

Eθ(− log

fθ(X|y)

fθ(X|y)

∣∣∣ Y = y)≥ − logEθ

(fθ(X|y)

fθ(X|y)

∣∣∣ Y = y)

= − log

∫fθ(x|y)

fθ(x|y)fθ(x|y) dx = − log

∫fθ(x|y) dx = − log 1 = 0.

Substituting this inequality into the previous equation we find

L(θ)− L(θ) ≥ Q(θ∣∣ θ)−Q(θ ∣∣ θ)

for all values of θ and θ.Finally, by definition of θ(n), we have

Q(θ(n)

∣∣ θ(n−1))

= arg maxθ

Q(θ∣∣ θ(n−1)

)≥ Q

(θ(n−1)

∣∣ θ(n−1))

and thus

L(θ(n))− L(θ(n−1)) ≥ Q(θ(n)

∣∣ θ(n−1))−Q

(θ(n−1)

∣∣ θ(n−1))≥ 0.

60

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

● ●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●●

● ●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

● ●

●

●

●

●

●

●● ●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

−4 −2 0 2 4 6

−2

−1

01

23

45

Xn1

Xn2

Figure 6.1: Illustration of the data in example 6.2, a mixture of two normal distributions.The task is to estimate the mean values of the two components of the mixture.

This completes the proof. (q.e.d.)

While we have seen that L(θ(n)) in the EM algorithm is increasing, it is still possiblethat the algorithm does not find the maximum of L, if a “bad” initial guess θ(0) is used.

Example 6.2 (Mixture of Normal distributions). Let µ1, µ2 ∈ Rd be the unknown pa-rameters and for i = 1, 2, . . . , N let Xi ∼ U{1, 2} i.i.d. and YiN (µXi , 1) independently ofeach other. We want to estimate µ1 and µ2 from observations Y1, Y2, . . . , YN , but wherewe don’t know the values of X1, X2, . . . , XN . This situation is illustrated in figure 6.1.

We first identify the log-likelihood function L. Let θ = (µ1, µ2). Then

f(xi, yi|θ) = P (Xi = xi) · ϕ(yi;µxi) =1

2ϕ(yi;µxi)

for all xi ∈ {1, 2}, yi ∈ Rd where

ϕ(y;µ) =1√

2πd

exp(−(y − µ)2

2

)(6.1)

is the density of the N (µ, 1)-distribution in Rd. Since the samples for i = 1, 2, . . . , N areindependent, we have

L(θ;x, y) =

N∏i=1

ϕ(yi;µxi)

2=

N∏i=1

1

2√

2πexp(−(y − µxi)2

2

)61

and thus

logL(θ;x, y) = N log1

2√

2π− (y − µxi)2

2.

For the E-step of the EM-algorithm we need to compute

Q(θ|θ(n−1)) = Eθ(n−1)

(logL(θ;X, y)

∣∣∣ Y = y).

Since we have

Pθ(Xi = xi

∣∣ Yi = yi)

=f(xi, yi|θ)f(yi|θ)

=12ϕ(yi;µxi)

12ϕ(yi;µ1) + 1

2ϕ(yi;µ2),

we find

Q(θ|θ(n−1)) = N log1

2√

2π−

N∑i=1

Eθ(n−1)

((yi − µXi)2

2

∣∣∣ Yi = yi

)= N log

1

2√

2π−

N∑i=1

2∑xi=1

(yi − µxi)2

2

ϕ(yi;µ(n−1)xi )

ϕ(yi;µ(n−1)1 ) + ϕ(yi;µ

(n−1)2 )

= N log1

2√

2π−

N∑i=1

2∑xi=1

(yi − µxi)2

2ψ(n−1)xi (yi)

where we write

ψ(n−1)j (y) =

ϕ(y;µ(n−1)j )

ϕ(y;µ(n−1)1 ) + ϕ(y;µ

(n−1)2 )

(6.2)

as an abbreviation.For the M-step of the EM-algorithm we need to maximise the function

θ 7→ Q(θ∣∣ θ(n−1)

)where θ = (µ1, µ2). To find the maximum we set the derivatives of this function to 0: weget

0!

=∂

∂µjQ(θ∣∣ θ(n−1)

)= −

N∑i=1

(yi − µj)ψ(n−1)j (yi)

and thusN∑i=1

yiψ(n−1)j (yi) = µj

N∑i=1

ψ(n−1)j (yi) (6.3)

for j = 1, 2.

The parameters µ(n)1 and µ

(n)2 are the solutions of the system (6.3) of equations. Solving

this system for µj we find

µ(n)j =

∑Ni=1 yiψ

(n−1)j (yi)∑N

i=1 ψ(n−1)j (yi)

for j = 1, 2, where ψ(n−1)j (y) is given by (6.2). To implement the EM-algorithm we just

need to evaluate this relation in a loop:

1: for n = 1, 2, 3, . . . do2: for j = 1, 2 do

62

3: let µ(n)j =

∑Ni=1 yiψ

(n−1)j (yi)∑N

i=1 ψ(n−1)j (yi)

4: end for5: end for

The same approach used in example 6.2 can be used for a more general mixture of theform pN (µ1,Σ1) + (1− p)N (µ2,Σ2) where p ∈ (0, 1) gives the weights of the mixture andΣj ∈ Rd×d are the covariance matrices of the components.

Other classes of problems where the EM algorithm can be used include problems withmissing data or where only aggregate values of cells are observed.

Further Reading

The EM algorithm was originally described in Dempster et al. (1977).

63

64

Appendix A

Probability Reminders

In this text we assume that the reader is familiar with the basic results of probabilityand statistics. For reference, and to fix notation, this chapter summarises some importantconcepts and results.

A.1 Events and Probability

The basic objects in probability theory are events and random variables. Typically eventsare denoted by capital letters A,B,C, . . . and we write P (A) for the probability that anevent A occurs. We say that A occurs almost surely, if P (A) = 1. Random variablesare typically denoted by upper case letters X,Y, Z, . . . and they can take values either inthe real numbers R, in the Euclidean space Rd or even in more general spaces. Randomvariables are often used to construct events; for example {X < 3} denotes the event thatX takes a value which is smaller than 3 and we write P (X < 3) for the probability ofthis event. If X is a random variable taking values in some set, and if f is a (measurable)function on this set, then f(X) is again a random variable.

Each random variable has a distribution or probability distribution, which completelydescribes its probabilistic behaviour. Special probability distributions are often desig-nated by calligraphic uppercase letters, sometimes with parameters given in brackets, e.g.N (µ, σ2) for the normal distribution with mean µ and variance σ2 or U [0, 1] for the uni-form distribution on the interval [0, 1]. General distributions are often designated by Por µ. We write

X ∼ P

to state that a random variable X has distribution P .

In general, the distribution of a random variable is given by a probability measure,but to avoid technical complications we do not expand on the general case here. Forreal-valued random variables, the distribution can always be completely described by adistribution function:

Definition A.1. The cumulative distribution function (CDF) of a random variable X onR is given by

F (a) = P (X ≤ a) ∀a ∈ R.

Often the CDF of a random variable is simply referred to as the “distribution function”,omitting the term “cumulative” for brevity. Distribution functions are normally denotedby capital letters like F or FX (where the subscript denotes which random variable this

65

is the CDF of), occasionally Greek letters like Φ are used. We sometimes write X ∼ F(slightly abusing notation) to indicate that the distribution of X has distribution func-tion F .

Definition A.2. A random variable X has probability density f , if

P(X ∈ A

)=

∫Af(x) dx

for every (measurable) set A.

Often the probability density of a random variable X is referred to just as the “density” ofX. While every random variable on R has a distribution function F , a density f may ormay not exist. If f exists, then it can be found as the derivative of the CDF, i.e. f = F ′.Probability densities are typically denoted by Roman or Greek letters like f, g, p, ϕ, ψ.

Example A.3. The Gaussian distribution N (0, 1) has density

f(x) =1√2π

e−x2/2

and distribution function

F (x) =1√2π

∫ x

−∞e−y

2/2 dy.

The integral in the CDF cannot be evaluated explicitly, but many programming languagesprovide functions to evaluate F numerically.

Example A.4. The Exp(λ) distribution has density

f(x) =

{λe−λx, if x ≥ 0, and

0 if x < 0.

The distribution function is

F (x) =

{1− e−λx, if x ≥ 0, and

0 if x < 0.

Example A.5. The value X ∈ {1, 2, 3, 4, 5, 6} of a single dice throw has no density. Itsdistribution function is

F (x) =

0, for x < 1,

1/6, for 1 ≤ x < 2,

2/6, for 2 ≤ x < 3,

3/6, for 3 ≤ x < 4,

4/6, for 4 ≤ x < 5,

5/6, for 5 ≤ x < 6, and

1, for 6 ≤ x.

The most important properties of densities are given by the following characterisation:A function f is a probability density, if and only if it satisfies the two properties

66

a) f ≥ 0, and

b) f is integrable with∫f(x) dx = 1.

Sometimes a density f is only known up to a constant Z, i.e. we know the functiong(x) = Zf(x) for all x, but we don’t know f and the value of Z. In these cases, the secondproperty in the list above can be used to find Z:

Z = Z

∫f(x) dx =

∫Zf(x) dx =

∫g(x) dx.

A.2 Conditional Probabilities

The conditional probability of an event A, given another event B with P (B) > 0, is definedas

P(A∣∣ B) =

P(A ∩B

)P (B)

(A.1)

where P(A ∩ B

)is the probability that both A and B occur simultaneously. The same

relation multiplied by P (B) is called Bayes’ rule:

P(A ∩B

)= P

(A∣∣ B)P (B). (A.2)

Often, the event and condition in a conditional probability concern the distribution ofa random variable X: if P (X ∈ B) > 0 we can consider P

(X ∈ A

∣∣ X ∈ B). For fixed B,the conditional distribution PX|X∈B of X given X ∈ B, defined by

PX|X∈B(A) = P(X ∈ A

∣∣ X ∈ B),is itself a probability distribution. The conditional distribution corresponds to the remain-ing randomness in X when we already know that X ∈ B occurred. This is illustrated bythe following lemma: if we start with an i.i.d. sequence (Xn)n∈N and from this sequenceremove all Xn with Xn /∈ B, then the remaining elements form an i.i.d. sequence withdistribution PX|X∈B.

Lemma A.6. Let X be a random variable and B be a set with P (X ∈ B) > 0. Let(Xn)n∈N be a sequence of i.i.d. copies of X. Define N0 = 0 and

Nk = min{n ∈ N

∣∣ n > Nk−1, Xn ∈ B}

for all k ∈ N. Then (XNk)k∈N is an i.i.d. sequence of PX|X∈B-distributed random variables,

i.e. the Xnkare independent and

P (Xnk∈ A) = P

(X ∈ A

∣∣ X ∈ B)for all k ∈ N.

The random variable X here can take values on an arbitrary space. By taking X to bea vector, X = (X1, X2) ∈ R2 say, and by choosing A = A1 × R and B = R×B2, we get

P(X ∈ A

∣∣ X ∈ B) = P(X1 ∈ A1

∣∣ X2 ∈ B2

).

Using this argument, results like lemma A.6 can also be applied to the distribution of arandom variable X, conditioned on the values of a different random variable Y .

67

If the pair (X,Y ) ∈ Rm×Rn has a joint density f(x, y), it is also possible to consider Xconditioned on the event Y = y. Since the event Y = y will typically have probability 0,definition (A.1) can no longer be used; instead, one defines the conditional density of Xgiven Y = y as

fX|Y (x|y) =

{f(x,y)fY (y) , if fY (y) > 0, and

π(x) else,

where π(x) is an arbitrary probability density and

fY (y) =

∫f(x, y) dx

is the density of Y . The choice of π in the definition does not matter, since it is used onlyfor the case fY (y) = 0, i.e. when conditioning on cases which never occur.

The function fX|Y (x|y) defined in this way satisfies∫fX|Y (x|y) dx = 1

for all y and∫A

∫BfX|Y (x|y)fY (y) dy dx =

∫A

∫Bf(x, y) dy dx = P (X ∈ A, Y ∈ B).

The latter relation is the analogue of Bayes’ rule (A.2) for conditional densities.

A.3 Expectation

Real-valued random variables X often (but not always) have an expectation, which we de-note by E(X). If X can only take finitely many values, say x1, . . . , xn, then the expectationof X is given by

E(X) =n∑i=1

xiP (X = xi).

If the distribution of X has a density ϕ, the expectation of X can be computed as

E(X) =

∫xϕ(x)dx.

Similarly, if f is a function, the expectation of the random variable f(X) can be computedas

E(f(X)

)=

∫f(x)ϕ(x)dx. (A.3)

In cases where the integral on the right-hand side can be solved explicitly, this formulaallows to compute expectations analytically.

Probabilities of events involving X can be rewritten as expectations using the indicatorfunction of the event: The indicator function of the event {X ∈ A}, where A is a set ofpossible values of X, is the random variable 1{X∈A} given by

1{X∈A} =

{1 if X ∈ A, and

0 else.

68

Using this indicator function we can write

P (X ∈ A) = E(1{X∈A}).

To evaluate this expectation using (A.3), we define a function 1A by

1A(x) =

{1 if x ∈ A, and

0 else.

Using this notation we have 1{X∈A} = 1A(X) and we get

P (X ∈ A) = E(1A(X)

)=

∫1A(x)ϕ(x)dx =

∫Aϕ(x)dx,

i.e. we can compute the probability of X taking values in the set A by integrating thedensity of X over A.

We conclude by listing a few important results about expectations which are used inthe main text. The first of these, Jensen’s inequality, if a generalisation of the well-knownrelation E(X2) ≥ E(X)2.

Theorem A.7 (Jensen’s inequality). Let X be a random variable and ϕ be a real-valued,convex function. Then E

(ϕ(X)

)≥ ϕ

(E(X)

).

Theorem A.8 (Chebyshev’s inequality). Let X be a random variable with expectation µand variance σ2 <∞. Then

P(|X − µ| ≥ ε

)≤ σ2

ε2

holds for every ε > 0.

Finally, the strong law of large numbers, form the basis of the Monte Carlo methodsdiscussed in chapter 3.

Theorem A.9 (strong law of large numbers). Let (Xn)n∈N be a sequence of independentrandom variables with expectation µ. Then

limN→∞

1

N

N∑i=1

Xi = µ

almost surely.

Further Reading

A concise and rigorous exposition of basic probability is given in Jacod and Protter (2000).

69

70

Appendix B

Programming in R

This appendix contains a short introduction to programming with the R programminglanguage (R Development Core Team, 2011). R can be freely downloaded from the Rhomepage at http://www.r-project.org/ and there are many online tutorials availablewhich describe how to install and use the R system. Therefore, we assume that the readerhas access to a working R installation and already knows how start the program and howto use the built-in help system. The presentation here will focus on aspects of the languagewhich are important for scientific computing.

B.1 General advice

One of the most important points to understand when learning to program is the differencebetween learning to program and learning a programming language. Learning to programinvolves understanding how to approach and structure a problem in order to turn it into analgorithm which a computer can execute. This process often requires that the problem isrephrased or broken down into smaller sub-problems. Becoming a proficient programmeris a slow process which requires a lot of practice and which can take a long time. Incontrast, when you already know how to program, learning a new programming languageis relatively easy. Typically this just requires learning a small number (normally much lessthan 100) of commands and rules. This chapter gives an introduction to both the basicsof programming and the programming language R at the same time.

While it may be possible to learn the mathematical contents of this book just by readingand understanding the text, programming can only be learned by practising a lot (justlike learning to play football or learning to play a musical instrument). For this reason,the text include numerous exercises which allow you to practise your programming skills.Another way to train your programming skills is to read and understand other people’sprograms; but only after you have written the corresponding program yourself! For thispurpose, the text includes answers for most of the programming exercises. But, again,it is not possible to learn to program by just reading other people’s programs (like it isimpossible to become a good violinist only by watching people play the violin), so it isimportant that you try the exercises yourself before looking at the answers.

Finally, and as already remarked, learning to program takes a long time, so make surethat you commit enough time to learning and practising!

71

http://www.r-project.org/

●

●

●

●

1.0 1.5 2.0 2.5 3.0 3.5 4.0

12

34

56

index

y

Figure B.1: The graph generated by the command in listing (B.1). The circles correspondto the x and y coordinates provided in the command, the axis labels are as specified.

B.2 R as a calculator

In this section we discuss basic use of R and use this to introduce some fundamentalconcepts. Many of the topics discussed here will be discussed in more detail in the followingsection.

You interact with the R system by entering textual commands, and the system thenacts on these commands. A simple example of a command is

3 + 4

This command asks R to add the numbers 3 and 4, the result 7 is printed to the screen.An example of a more complicated command is

# (B.1)plot(c(1,2,3,4), c(5,2,1,6), xlab="index", ylab="y", type="o")

Try this example yourself! (If you are reading this text on a computer, you can use ‘cutand paste’ to copy the command into R.) If everything worked, a graph like the one infigure B.1 will appear on the screen.

R processes commands one at a time. Commands can be split over more than one line,for example a plot command could look as follows:

plot(c(1,2,3,4), c(5,2,1,6), type="o",xlab="really long label for the x-axis", ylab="y")

When you enter this command into R, nothing will happen when you enter the firstline (since R notices that the first bracket is not yet closed, so the command cannot becomplete yet). Only after the command is completed by the second line, the plot appears.For programming, we need to split the solution to a problem into steps which are smallenough that each step can be performed in a single command.

72

operation example R code

addition 8 + 3 8 + 3

subtraction 8− 3 8 - 3

multiplication 8 · 3 8 * 3

division 8/3 8 / 3

power 83 8 ^ 3

modulus 8 mod 3 8 %% 3

absolute value |x| abs(x)

square root√x sqrt(x)

exponential ex exp(x)

natural logarithm log(x) log(x)

sine sin(2πx) sin(2*pi*x)

cosine cos(2πx) cos(2*pi*x)

Table B.1: List of some commonly used operations in R.

B.2.1 Mathematical Operations

Some of the simplest commands available are the ones which correspond directly to math-ematical operations. In the first example of this section we could just type 3 + 4 tocompute the corresponding sum. Table B.1 lists the R equivalent of the most importantmathematical operations.

B.2.2 Variables

In R you can use variables to store intermediate results of computations. The followingtransscript illustrates the use of variables:

> a <- 1> b <- 7> c <- 2> root1 <- (-b + sqrt(b^2 - 4*a*c)) / (2*a)> root2 <- (-b - sqrt(b^2 - 4*a*c)) / (2*a)> root1[1] -0.2984379> root2[1] -6.701562> root1*root2[1] 2

You can freely choose names for your variables, consisting of letters, digits and the dotcharacter (but starting with a letter), and you can assign values to variables using theassignment operator <-. After a value is assigned to a variable, the name of the variablecan be used as a shorthand for the assigned value.

There is a subtle difference between the use of variables in mathematics and in pro-gramming: while, in mathematics, expressions like x = x + 1 are not very useful, thecorresponding expression x <- x+1 in R has a useful meaning:

> x <- 6> x <- x + 1> x[1] 7

73

What happens here is that in the assignment x <- x+1, the right-hand side x + 1 isevaluated first: by the rules for the use of variables, x is replaced by its value 1, and then6 + 1 is evaluated to 7. Once the value to be assigned is determined, this value (the 7) isassigned to x. Consequently, the effect of the command x <- x+1 is to increase the valuestored in the variable x by 1.

B.2.3 Data Types

Every object in R represents one of a handful of possible “data types”. In the examplesabove we have already seen numbers and strings (short for “character strings”).

Strings

Strings in R are enclosed in quotation marks. Normally the distinction is easy, strings arefor texts (e.g. for labels in graphs) while numbers are for numerical values, but some careis needed: "12" is the text consisting of the digits one and two, which is different fromthe number 12:

> 12+1[1] 13> "12"+"1"Error in "12" + "1" : non-numeric argument to binary operator

R complains that it cannot add "12" and "1", because they are not numbers.

Vectors

Vector objects in R are useful to represent mathematical vectors in a program; they canalso be used as a way to store data for later processing. An easy way to create vectorobjects is the function c (short for “concatenation”) which collects all its arguments intoa vector. The vector

x =

123

can be represented in R as follows:

> c(1,2,3)[1] 1 2 3

The elements of a vector can be accessed by using square brackets: if x is a vector, x[1]is the first element, x[2] the second element and so on:

> x <- c(7, 6, 5, 4)> x[1][1] 7> x[1] + x[2][1] 13> x[1] <- 99> x[1] 99 6 5 4

The function c can also be used to append elements to the end of an existing list:

> x <- c(1, 2, 3)> x <- c(x, 4)> x[1] 1 2 3 4

74

In addition to the function c, there are several ways of constructing vectors: one canstart with an empty vector and add elements one-by-one:

> x <- c()> x[1] <- 1> x[2] <- 1> x[3] <- x[2] + x[1]> x[4] <- x[3] + x[2]> x[1] 1 1 2 3

Vectors consisting of consecutive, increasing numbers can be created using the colon (:)operator:

> 1:15[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15> 10:20[1] 10 11 12 13 14 15 16 17 18 19 20

More complicated sequences can be generated using the function seq:

> seq(from=1, to=15, by=2)[1] 1 3 5 7 9 11 13 15> seq(from=15, to=1, by=-2)[1] 15 13 11 9 7 5 3 1

Mathematical operations on vectors work as expected:

> c(1,2,3) * 2[1] 2 4 6> c(1,2,3) + c(3,2,1)[1] 4 4 4

Other useful functions on vectors include sum (to compute the sum of the vector elements),mean (the average), var (the sample variance), sd (the sample standard deviation), andlength (the length of the vector):

> x <- c(1,2,3)> (x[1] + x[2] + x[3]) / 3[1] 2> sum(x) / length(x)[1] 2> mean(x)[1] 2

The functions from table B.1, when applied to a vector, operate on the individualelements:

> x <- c(-1, 0, 1, 2, 3)> abs(x)[1] 1 0 1 2 3> x^2[1] 1 0 1 4 9

This allows to efficiently operate on a whole data set with a single instruction.

Matrices

Matrices in R can be constructed using the function matrix. To represent the matrix

A =

(1 2 34 5 6

)in R, the following command can be used:

75

> A <- matrix(c(1, 2, 3,+ 4, 5, 6),+ nrow=2, ncol=3, byrow=TRUE)> A

[,1] [,2] [,3][1,] 1 2 3[2,] 4 5 6

The first argument to matrix is a vector giving the numbers to be stored in the matrix.The following two arguments set the number of rows/columns of the matrix, and the lastargument states that we gave the entries row-by-row. The whole command can be given onone line, the line breaks are only inserted to increase readability. The n×n identity matrixcan be generated by diag(n) and an m×n zero-matrix by matrix(0, nrow=m, ncol=n).

Individual elements of a matrix can be accessed using square brackets, just like forvectors: if A is a matrix, A[1,1] denotes the top-left element of the matrix, A[1,] denotesthe first row of the matrix (as a vector), and A[,1] denotes the first column of A:

> A = matrix(c(1,2,3,4), nrow=2, ncol=2, byrow=TRUE)> A[1,1] <- 9> A

[,1] [,2][1,] 9 2[2,] 3 4> A[1,][1] 9 2> A[,1][1] 9 3

The sum of matrices and the product of a matrix with a number can be computed using+ and *, the matrix-matrix and matrix-vector products from linear algebra are given by%*%. (Careful, A*A is not the matrix product, but the element-wise product instead!)

> A <- matrix(c(1, 2,+ 2, 3),+ nrow=2, ncol=2, byrow=TRUE)> A

[,1] [,2][1,] 1 2[2,] 2 3> A %*% A

[,1] [,2][1,] 5 8[2,] 8 13> x <- c(0, 1)> A %*% x

[,1][1,] 2[2,] 3> x %*% A %*% x

[,1][1,] 3

The last command shows that vectors are automatically interpreted as row vectors orcolumn vectors as needed: in R it is not required to transpose the vector x when evaluatingexpressions like x>Ax.

Many matrix operations are available, for example the transpose of a matrix A canbe computed using t(A), the inverse by solve(A), the functions rowSums and colSums

return the row and column sums as vectors, and rowMeans and colMeans return the row

76

and column averages, respectively. The solution x to a system Ax = b of linear equationscan be computed using the command solve(A, b).

B.3 Programming Principles

In the previous section we have seen many examples of R commands. An R program is asequence of commands, designed to solve a specific problem when executed in order. Asan example, consider the Fibonacci sequence defined by x1 = x2 = 1 and xk = xk−1 +xk−2

for all n > 2. The following commands form an R program to compute x6:

# (B.2)x <- c()x[1] <- 1x[2] <- 1x[3] <- x[2] + x[1]x[4] <- x[3] + x[2]x[5] <- x[4] + x[3]x[6] <- x[5] + x[4]cat("the 6th Fibonacci number is", x[6], "\n")

When these commands are executed in R, one after another, the last command prints “the6th Fibonacci number is 8” to the screen (the \n starts a new line in the output).While this program works, it still has a number of shortcomings which we will address inthe following sections.

B.3.1 Principle 1: Don’t Repeat Yourself

A fundamental principle in programming is to avoid duplication wherever possible. Thisprinciple applies on many different levels and in many different situations. It is sometimescalled the don’t repeat yourself (DRY) principle. In this section we will discuss someaspects of the DRY principle, using the program (B.2) as an example.

Loops and Conditional Statements

The way we use to compute x[6] in the program (B.2) involves a lot of repetition: torepresent the one formula xn = xn−1 + xn−2, we used four different lines in our program!Because of this, it will be impractical to use a similar program to compute x[100]. Asecond problem is, that it can be difficult to spot mistakes caused by this repetition.

# (B.3)

Another complication appears, if we want to modify the program to use the rule xn =xn−1−xn−2 instead: a single change to the mathematical formula requires multiple changesto our program.

The problem of this specific repetition can be solved using a “loop” in the R program.Such a loop instructs R to execute a command repeatedly. We can use such a loop towrite the command for the relation xk = xk−1 + xk−2 only once, and then let R repeatthis command as needed:

# (B.4)n <- 10x <- c()x[1] <- 1x[2] <- 1for (k in 3:n) {

x[k] <- x[k-1] + x[k-2]}cat("the ", n, "th Fibonacci number is ", x[n], "\n", sep="")

77

comparison example R code

strictly smaller x < y x < y

smaller or equal x ≤ y x <= y

strictly bigger x > y x > y

bigger or equal x ≥ y x >= y

equal x = y x == y

not equal x 6= y x != y

Table B.2: List of comparison operators in R.

The loop is implemented by the for statement: for is followed by a group of one ormore commands, enclosed in curly brackets { and }, which are executed repeatedly. Thenumber of repetitions is determined by the vector 3:n, the commands are executed oncefor each element of this vector, i.e. n − 2 times in total. The variable k is set to thecorresponding element of the vector before each iteration starts: the first time the loopis executed, k is set to 3. The executed statement is then x[3] <- x[3-1] + x[3-2].Before the next iteration, k is set to 4 and, after substituting k, the executed statementis x[4] <- x[4-1] + x[4-2]. This is repeated until, in the last iteration of the loop,k is set to the value of n. Once the loop is completed, the program continues with thefirst instruction after the loop and the program outputs “the 10th Fibonacci number

is 55”.

The choice of the name k for the loop variable in the example above was arbitrary (itwas chosen to match the index in the mathematical formula), any other variable namecould have been used instead. Similarly, the vector 3:n can be replaced by any othervector, the elements are not required to be adjacent or increasing.

A further improvement of this program, compared to the first version, is the introductionof the variable n: With the new program, We can compute a different Fibonacci numberby changing only a single line. Similarly, if we want to implement a different recursionrelation, we can just replace the command x[k] <- x[k-1] + x[k-2] with a differentone. By avoiding repetition in the program text, mistakes like the one in exercise B.3where the required change to the output string was forgotten, are no longer possible.

There is a second kind of loop available, the while loop, which can be used when therequired number of iterations is not known in advance. For example, the following programdetermines the first Fibonacci number which is bigger than or equal to 1000:

x <- c()x[1] <- 1x[2] <- 1n <- 2while (x[n] < 1000) {

n <- n + 1x[n] <- x[n-1] + x[n-2]

}cat("the ", n, "th Fibonacci number is ", x[n], "\n", sep="")

The while loop repeats its commands while the condition in the round brackets is satisfied.In the condition, all the usual comparison operators can be used (see table B.2). For thistype of loop, no automatic assignment to a loop variable takes place, so we have toincrement n ourselves using the command n <- n + 1. As soon as the condition is false,the loop ends and the program continues with the first command after the loop.

78

# (B.5)

Inside loops it is often useful to be able to execute different commands in differentiterations of the loop. For example, to print all numbers n = 1, 2, . . . , 100 satisfyingsin(n) > 1

2 to the screen, we can use the following program:

for (n in 1:100) {if (sin(n) > 0.5) {

cat(n, "\n")}

}

The if statement is followed by a block of one or more commands, enclosed in curlybrackets { and }. These commands are only executed if the condition given inside theround brackets is true, otherwise the whole if block has no effect. As with the while

statement, all comparison operators table B.2 can be used in the condition. There is asecond form of the if statement:

for (n in 1:10) {if (n %% 2 == 1) {

cat(n, "is odd\n")} else {

cat(n, "is even\n")}

}

Here, the commands in the first block of curly brackets are executed if the condition is true,and otherwise the commands in the second block are executed. Since n mod 2 equals 0for even numbers and 1 for odd numbers, this program gives the correct output.

Command Scripts

The DRY principle applies also to repetition between different programs: if you havesolved a problem once, it is better to re-use the old program, instead of writing a new one.Re-use of old programs does not only avoid unnecessary work, it also reduces the risk ofmistakes being introduced in the code and allows to improve the program over time. Tofacilitate this, the following guidelines are useful:

• Save all your R programs in files (file names for such files typically end in .R), storethese files somewhere safe, and keep the programs organised in a way which allowsyou to find them again when needed at a later time.

• When writing programs, take care to write them in a way which makes the code re-usable. As we have seen, the DRY principle can help with this. Also, it is useful towrite the program as clearly as possible, to use descriptive names for your variables,and to use indentation to make the structure of loops and if-statements easy tofollow.

• If the program uses any non-obvious constructions or clever tricks, it is often helpfulto add comments with explanations to the program. Such comments can be veryhelpful when trying to re-use programs written more then a few weeks ago. SinceR ignores every line of input which starts with the character #, such comments canbe stored directly in the program. For example, when reading the (slightly cryptic)program

79

# compute Fibonacci numbers x[n]# y = (x[n], x[n-1])y <- c(1,1)for (n in 3:17) {

y <- c(sum(y), y[1])}print(y[1])

the comments make it a lot easier to figure out what the program does.

B.3.2 Principle 2: Divide and Conquer

In this section we will discuss a second fundamental programming principle, sometimescalled the “divide and conquer paradigm”. This principle is to break down a probleminto smaller sub-problems which can be solved individually. After solving the individualsub-problems, the individual solutions can be combined to obtain a solution to the fullproblem. As the DRY principle discussed above, the divide and conquer principle applieson many different levels and in many different situations. In this text we will focus on themost basic aspects of this principle.

The basic tool for isolating individual building blocks of a program is a “function” whichallows to use a simple name as an abbreviation for a list of commands. We will illustratethe concept of functions in R with the help of examples.

Example B.1. The program from listing (B.4) can be turned into an R function as follows:

# (B.6)fibonacci <- function(n) {x <- c()x[1] <- 1x[2] <- 1for (k in 3:n) {

x[k] <- x[k-1] + x[k-2]}return(x[n])

}

When we execute the above lines in R, nothing seems to happen. The commands onlydefine the name fibonacci as an abbreviation for the commands on the following lines,the commands are not yet executed. After the function is defined, we can execute thecommands in the function by using the assigned name fibonacci:

# (B.7)> fibonacci(5)[1] 5> res <- fibonacci(10)> res[1] 55

This step is referred to as “calling the function”. The second call to fibonacci showsthat we can assign the result of a function call to a variable and thus store it for later use.

The first line of listing (B.6) does not only give the name of the function but also, inthe brackets after the keyword function, it indicates that the function has exactly oneargument, called n. Every time the function is called, we have to specify a value for thisargument n, for the first call of fibonacci in (B.7) we write fibonacci(5) to indicatethat the value 5 should be used for n, for the second call n equals 10. The next six lines inlisting (B.6), copied from listing B.4, then compute the first n elements of the Fibonaccisequence, and the return statement indicates the result of the computation and ends thefunction.

80

Example B.2. The R equivalent of a mathematical function is often straight-forward toimplement. For example, to define an R function for computing

f(x) = e−x2,

we can use the following R code:

f <- function(x) {return(exp(-x^2))

}

# (B.8)

The idea of a function like fibonacci in listing (B.6) is that, following the DRY prin-ciple, we write the function only once; from then on we can use the function by calling itwithout having to think about the details of how Fibonacci numbers are computed. Thefunction can be used as one of the building blocks for a bigger program, alongside thebuilt-in R functions like sqrt and plot.

To allow for easy re-use of existing functions, some of the effect of commands insidethe function is hidden from the caller: while the variables n and x are used and modifiedinside the function, R takes care to not disturb any variables which may be used by thecaller:

> n <- 3> fibonacci(10)[1] 55> n[1] 3> xError: object ’x’ not found

As we can see, the assignment n=10 used inside fibonacci does not affect the value 3 westored in n before the call and, similarly, the variable x used inside the function is notvisible to the caller. This isolation of the variables inside the function is the reason thatwe need to use return to pass the result of a function to the caller.

Sometimes, it is required to pass several argument values into a function. For example,the built-in function rnorm to generate normally distributed random variables could bedefined as follows:

rnorm <- function(n, mean, sd) {# code for generating random numbers...

}

This declares three arguments: n specifies how many random numbers should be generated,mean specifies the mean, and sd specifies the standard deviation. We can use the commandrnorm(10, 0, 1) to get 10 standard normally distributed random variables. Since thespecial case of mean 0 and standard deviation 1 is very common, these value are used as“default values”. The true declaration of rnorm is

rnorm <- function(n, mean=0, sd=1) {...

}

The mean=0 tells R to use 0 for the mean, if no other value is given, and similar for thestandard deviation. Thus, rnorm(10, 5) generates 10 random values with mean 5 (sincewe specified the mean in the second argument), and standard deviation 1 (the defaultvalue is used because we did not give a third argument).

81

When calling a function which uses default values for all or some of its arguments, and ifwe want to specify only some of the arguments, we can give values for individual functionarguments as in the following example:

X <- rnorm(10, sd=3)

The will generate 10 values from a normal distribution with mean 0 (the default value isused because we did not specify this argument) and standard deviation 3 (as specified bysd=3).

Example B.3. Consider the following function:

test <- function(a=1, b=2, c=3, d=4) {cat("a=", a, ", b=", b, ", c=", c, ", d=", d, "\n", sep="")

}

When experimenting with this function, we get the following output:

> test()a=1, b=2, c=3, d=4> test(5, 6)a=5, b=6, c=3, d=4> test(a=5, d=6)a=5, b=2, c=3, d=6> n <- 3> test(a=n, b=n+1, c=n+2, d=n+3)a=3, b=4, c=5, d=6

As we have seen, functions can be used to encapsulate parts of your program as a singleunit. Functions help to follow both the “divide and conquer” principle (because theyallow to easily break down a program into smaller building blocks) and the DRY principle(because they are easily re-used). We conclude this section with some general guidelinesfor writing functions:

• Sometimes there is a choice about which parts of a program could be split out intofunctions. A good idea in such situations is to aim for functions whose purpose iseasily explained: “compute the nth Fibonacci number” is a task which will make agood function. In contrast, it seems less clear whether implementing formula (B.11)in a function is a good idea: the description of this function will likely be as long asthe function itself, and re-use of the resulting function seems not very likely.

• Using descriptive names for functions is a good idea. Examples of well-chosen func-tion names include mean, sin and plot; in all three cases it is easy to guess fromthe name what the function will do.

• The “divide and conquer” principle becomes most powerful when functions make useof previously defined functions which implement more fundamental building blocks.The extreme case of this when the definition of a function includes calls to itself (forexample when sorting of a list of length n is reduced to the problem of sorting listsof length n− 1); this technique is called “recursion”.

B.3.3 Principle 3: Test your Code

The final programming principle we will discuss in this text is the importance of testingprograms: Even for experienced programmers it is near-impossible to get a non-trivialprogram right at the first attempt. Thus, the usual procedure is to complete a program

82

bit by bit (following the approach advertised in the previous section), and every time apart of the program is completed to systematically test for the presence of errors.

There are different kinds of errors which can occur in computer programs:

a) “Syntax errors” are errors caused by not following the restrictions and requirementsof the programming language. In this case, R does not understand the programat all and complains immediately with an error message. Examples include excesscommas and brackets:

> mean(c(1,2,,3))Error in c(1, 2, , 3) : argument 3 is empty> mean(c(1,2,3)))Error: unexpected ’)’ in "mean(c(1,2,3)))"

These errors can be spotted immediately, and are usually easy to resolve by followingthe hints given in the error message.

b) “Run-time errors” occur when the program is syntactically correct, but when Rencounters an operation which cannot be performed during the execution of theprogram. Examples include programs which try to compute the sample variance ofan empty data set. In this case, the program is stopped immediately and an errormessage is printed:

> f <- function(x) {+ cat("the variance is", var(x), "\n")+ cat("some more information here\n")+ }> f(c(1,2,3))the variance is 1some more information here> f(c())Error in var(x) : ’x’ is NULL

This kind of error is often more difficult to detect and fix, because the error may notoccur in every run of the program and because it may be difficult to find the exactlocation of the error in the program.

“Arithmetic errors” are special cases of run-time errors, where the program tries toevaluate expressions like 0/0 or to compute the square root of a negative number. Inthese cases, R just sets the result to NaN (short for “not a number”) and sometimes(but not always) omits a warning message:

> 0/0[1] NaN> 2/2 + 1/1 + 0/0[1] NaN> sqrt(-1)[1] NaNWarning message:In sqrt(-1) : NaNs produced

c) “Semantic errors” are the errors caused by a program being a valid program, but onewhich does not implement the intended algorithm. These errors are cases where theprogram does what you told it to do instead of what you meant it to do. These errorscan be very difficult to spot and fix. As a simple example, consider the followingfunction for computing the mean of a vector:

# (B.9)mean <- function(x) {n <- length(x)

83

s <- sum(x)return(x/n)

}

This function does not work as intended (since the return statement erroneouslyuses x instead of s), but there is no error message. The best way to find this mistakeis to test the function by trying to call it.

Errors in programs are sometimes called “bugs”, and the process of locating and fixingerrors in a program is called “debugging”. There are various approaches to debugging:

• A good start is try try the code in question for a few cases where the correct result isknown. Make sure to try some boundary cases (like very short data sets) and sometypical cases. Test each function of a program separately, starting with the mostfundamental ones.

• Carefully look out for any error messages or warnings. These messages often containuseful hints about what went wrong.

• To locate the position of an error, it is often helpful to temporarily insert print orcat statement into a program check whether the program actually executes certainlines of code and to see whether variables at these locations still have the expectedvalues.

• The functions debug and undebug can be used to watch the execution of a functionstep by step. See the R help text for debug (obtained by typing help(debug)) forusage instructions.

Example B.4. Assume we want to debug the function mean given in listing (B.9). Thefirst step is to try a few values:

> mean(c(1,2,3))[1] 0.3333333 0.6666667 1.0000000

Since the mean of (1, 2, 3) is 2, we see immediately that something is wrong. Assumingthat we don’t spot the typo yet, we can try to modify the function by inserting a cat

statement as follows:

mean <- function(x) {n <- length(x)s <- sum(x)cat("n =", n, " s =", s, "\n")return(x/n)

}

If we re-run the function, now we get the following output:

> mean(c(1,2,3))n = 3 s = 6[1] 0.3333333 0.6666667 1.0000000

Since the printed values are what we expect (the list has length n = 3, and the sum ofthe elements should be s = 1 + 2 + 3 = 6), we know that the mistake must be after thecat statement. Thus, we have narrowed down the location of the problem to a single line(the return statement), and looking at this line it is now easy to spot that x should bereplaced with s.

84

distribution name in R

binomial distribution binom

χ2 distribution chisq

exponential distribution exp

gamma distribution gamma

normal distribution norm

Poisson distribution pois

Table B.3: List of some probability distributions supported by R.

B.4 Random Number Generation

R contains an extensive set of built-in functions for generating (pseudo-)random numbersof many of the standard probability distributions. There are also functions available tocompute densities, cumulative distribution functions and quantiles of these distributions.Since these functions will be used extensively for the exercises in the main text, we give ashort introduction here.

The names of all these functions are are constructed using the following scheme: thefirst letter is

r for random number generators,d for densities (weights for the discrete case),p for the cumulative distribution functions, andq for quantiles.

The rest of the name determines the distribution; some possible values are given in ta-ble B.3. For example the function to generate normal distributed random numbers isrnorm and the density of the exponential distribution is dexp. More details about howto use these functions and how to set the distribution parameters are available in the Ronline help. Finally, the function set.seed is available to set the seed of the randomnumber generator (see the discussion in section 2.1.3 for details of why this is useful).

# (B.10)

B.5 Exercises

Exercise E-B.1. Use R to compute the following values:

a) 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 10

b) 216

c) 2100

d)2

1 +√

5

Exercise E-B.2. What are the values of x and y after the following commands areexecuted in R:

85

x <- 1y <- 2x <- x + yy <- x + y

Exercise E-B.3. Use R to compute the value of 1 + 2 + · · ·+ 100.

Exercise E-B.4. Can you spot the mistake in the following program for computing x8?

x <- c()x[1] <- 1x[2] <- 1x[3] <- x[2] + x[1]x[4] <- x[3] + x[2]x[6] <- x[5] + x[4]x[7] <- x[6] + x[5]x[8] <- x[7] + x[6]cat("the 6th Fibonacci number is", x[8], "\n")

Exercise E-B.5. Write an R program which uses a for loop to print the elements of thedecreasing sequence 100, 99, . . . , 0 to the screen, one per line.

Exercise E-B.6. Write an R program which uses a while loop to determine the smallestsquare number bigger than 5000.

Exercise E-B.7. Write an R program which uses a while loop to determine the biggestsquare number smaller than 5000.

Exercise E-B.8. Given x0 ∈ N, a sequence of integers can be defined by

xn =

{xn−1/2, if xn−1 is even, and

3xn−1 + 1, if xn−1 is odd,(B.11)

for all n ∈ N. Once the sequence reaches 1, it starts to cycle through the values 3·1+1 = 4,4/2 = 2, and 2/2 = 1, but it is unknown whether this cycle is reached from every startingpoint x0. Write an R program which prints the values of xn, starting with x0 = 27. Theprogram should stop when the value 1 is reached for the first time.

Exercise E-B.9. What does the following function compute?

f <- function(x) {n <- length(x)m <- mean(x)s <- 0for (i in 1:n) {

s <- s + (x[i] - m)^2}return(s/(n-1))

}

86

Hint: x will normally be a vector.

Exercise E-B.10. Continuing example B.3, feed the following commands into R:

test <- function(a=1, b=2, c=3, d=4) {cat("a=", a, ", b=", b, ", c=", c, ", d=", d, "\n", sep="")

}b <- "a"test(c=b)

Explain the result R prints to the screen.

Exercise E-B.11. Which category of error fall the two (!) errors from exercise B.3 into?

Exercise E-B.12. Use R to create plots of the densities and cumulative distributionfunctions of the following distributions:

a) N (0, 1) — the standard normal distribution

b) N (3, 4) — the normal distribution with mean µ = 3 and variance σ2 = 4

c) Exp(1) — the exponential distribution with rate 1

d) Γ(9, 0.5) — the gamma distribution with shape parameter k = 9 and scale parameterθ = 0.5

Further Reading

The best way to learn programming is to gain a lot of practice. For this reason, I rec-ommend trying to solve as many as possible of the exercises provided in this text. Otherinteresting sources of programming challenges can be found online, for example on theweb page of Project Euler1.

An extensive introduction to all aspects of R can be found online in Venables et al.(2011). A more in-depth description can be found in Venables and Ripley (2000). R sourcecode for many of the methods discussed here can be found in Rizzo (2008). Informationabout specific R functions can be found using the built-in help system (accessed using thefunction help).

1online at http://projecteuler.net/

87

http://projecteuler.net/

88

Appendix C

Answers to the Exercises

Solution E-2.1. One possible solution is the following:

LCG <- function(n, m, a, c, X0) {X <- c()Xn <- X0for (i in 1:n) {

Xn <- (a*Xn + c) %% mX[i] <- Xn

}return(X)

}

Proceeding as in example 2.2, we find X1 = 3, X2 = 1, X3 = 4, X4 = 2, X5 = 0 andX6 = 3. This matches the output of our program:

> LCG(6,5,1,3,0)[1] 3 1 4 2 0 3

Solution E-2.2. Using the function LCG from exercise 2.1, we can generate the graphs asfollows:

par(mfrow=c(2,2))

X <- runif(1000)plot(X[1:999], X[2:1000], asp=1, cex=.5,

xlab=expression(X[i]), ylab=expression(X[i+1]))

m <- 81a <- 1c <- 8seed <- 0X <- LCG(1000, m, a, c, seed)/mplot(X[1:999], X[2:1000], asp=1, cex=.5,


m <- 1024a <- 401c <- 101seed <- 0X <- LCG(1000, m, a, c, seed)/mplot(X[1:999], X[2:1000], asp=1, cex=.5,


m <- 2^32

89

a <- 1664525c <- 1013904223seed <- 0X <- LCG(1000, m, a, c, seed)/mplot(X[1:999], X[2:1000], asp=1, cex=.5,


The output is shown in figure 2.1.

Solution E-2.4. The inequality c ≥ 1 follows from

1 =

∫f(x) dx ≤

∫cg(x) dx = c

∫g(x) dx = c.

Now assume c = 1. Then we have f ≤ g. For ε > 0 let Aε ={x∣∣ f(x) ≤ g(x)− ε

}. Then

we have

1 =

∫f(x) dx =

∫Aε

f(x) dx+

∫A{

ε

f(x) dx

≤∫Aε

(g(x)− ε

)dx+

∫A{

ε

g(x) dx =

∫g(x) dx− ε|Aε| = 1− ε|Aε|

and thus ε|Aε| ≤ 0. Since ε > 0, this is only possible if |Aε| = 0. Therefore we find∣∣∣{x ∣∣ f(x) < g(x)}∣∣∣ =

∣∣∣⋃ε>0

Aε

∣∣∣ = limε↓0

∣∣Aε∣∣ = 0.

This completes the proof.

Solution E-2.5. a) The result is c =√

2e/π.

b) We can implement the rejection algorithm as follows:

f <- function(x) {return((x>0)*2*dnorm(x,0,1))

}g <- function(x) { return(dexp(x,1)) }c <- sqrt(2*exp(1)/pi)

q5rng <- function(n) {res <- numeric(n)i <- 0while (i<n) {

U <- runif(1, 0, 1)X <- rexp(1, 1)if (c*g(X)*U <= f(X)) {

i <- i+1res[i] <- X;

}}return(res)

}

X <- q5rng(1000)curve(f, min(X), max(X), n=500,

ylim=c(0,1), ylab="f")hist(X, breaks=25, prob=TRUE, add=TRUE)

90

0 1 2 3

0.0

0.4

0.8

x

f

Figure C.1: A histogram of the output of the algorithm from exercise E-2.5.

The output of this program is shown in figure C.1.c) Since the density f is the density of a standard-normal distribution conditioned on

being positive, and since the normal distribution is symmetric, we can generate standardnormal distributed values by using a mixture: we randomly return X or −X, both withprobability 1/2.

Solution E-2.7. The mixture distribution Pθ has CDF

Fθ(x) =

n∑i=1

θiFi(x) ∀x ∈ R.

a) If we follow the given procedure to construct a random variable X we have

P (X ≤ a) =n∑i=1

P (X ≤ a|I = i)P (I = i)

=n∑i=1

Fi(a)θi

= Fθ(a)

where I ∈ {1, 2, . . . , n} is the index of the randomly chosen distribution.b) Use the fact that the density, if it exists, is the derivative of the CDF.c) For the program we can follow the procedure from part a:

theta = c(0.5, 0.2, 0.3)mu = c(1, 2, 4)sigma = sqrt(c(0.01, 0.04, 0.01))

I = sample(3, 4000, replace=TRUE, prob=theta)X <- rnorm(4000)*sigma[I]+mu[I]

curve(theta[1]*dnorm(x,mu[1],sigma[1])+ theta[2]*dnorm(x,mu[2],sigma[2])+ theta[3]*dnorm(x,mu[3],sigma[3]),

91

1 2 3 4

0.0

0.5

1.0

1.5

2.0

x

f θ

Figure C.2: A histogram generated from 4000 samples of the mixture distribution fromquestion 2.5.

min(X), max(X), n=500,ylab=expression(f[theta]))

hist(X, breaks=100, prob=TRUE, add=TRUE)

The answer resulting plot is shown in figure C.2.

Solution E-3.1. To implement the Monte Carlo estimation, we can for example use codelike the following:

Z <- function(N) {X <- rnorm(N)return(sum(X >= 3 & X <= 4) / N)

}

To generate the required histograms and to get the sample mean and variance we can thenuse the following code:

N <- 1000ZZ <- c()for (i in 1:10000) {

ZZ[i] <- Z(N)}

hist(ZZ)mean(ZZ)var(ZZ)

For N = 1000, I get the sample mean µ = 0.0013123 and the sample variance σ2 =1.307299 · 10−6. For N = 10000 I get µ = 0.00131834 and σ2 = 0.1364673 · 10−6.

92

N = 1000

ZN

Fre

quen

cy

0.000 0.002 0.004 0.006 0.008

050

010

0020

0030

00

N = 10000

ZN

Fre

quen

cy

0.0005 0.0015 0.0025

020

040

060

080

010

00Solution E-3.2. Estimation of P (X ∈ A) where X ∼ N (0, 1) and A = [3, 4].

a) To implement importance sampling we have to evaluate

ZN =1

N

N∑i=1

1{Yi∈A}ϕ(Yi)

ψ(Yi)

where ϕ(x) = 1√2π

exp(−x2

2

)is the density of X and ψ is the density of the Yi. This can

for example be done using the following R code:

phi <- function(x) dnorm(x, 0, 1);

# Estimate P(3 <= X <= 4) where X~N(0,1),# using importance sampling with N samples.Z <- function(N, rpsi, dpsi) {

Y <- rpsi(N)return(sum((Y >= 3 & Y <= 4) * phi(Y) / dpsi(Y)) / N)

}

rpsi <- function(N) rnorm(N, 1, 1);dpsi <- function(x) dnorm(x, 1, 1);print(Z(100, rpsi, dpsi))

Results:

N µ1 σ21 µ2 σ2

2 µ3 σ23 µ4 σ2

4

1000 0.001323 8.33e-08 0.001318 1.37e-08 0.001318 6.54e-09 0.001318 1.91e-0910000 0.001317 8.23e-09 0.001318 1.37e-09 0.001318 6.76e-10 0.001318 1.94e-10

N(1,1)

ZN

Fre

quen

cy

0.0000 0.0005 0.0010 0.0015 0.0020 0.0025

010

020

030

040

050

060

070

0

N(2,1)

ZN

Fre

quen

cy

0.0000 0.0005 0.0010 0.0015 0.0020 0.0025

050

010

0015

00

N(3.5,1)

ZN

Fre

quen

cy

0.0000 0.0005 0.0010 0.0015 0.0020 0.0025

050

010

0015

0020

0025

00

Exp(1)+3

ZN

Fre

quen

cy

0.0000 0.0005 0.0010 0.0015 0.0020 0.0025

010

0020

0030

0040

00

93

N(1,1)

ZN

Fre

quen

cy

0.0000 0.0005 0.0010 0.0015 0.0020 0.0025

050

010

0015

0020

00

N(2,1)

ZN

Fre

quen

cy

0.0000 0.0005 0.0010 0.0015 0.0020 0.0025

010

0020

0030

0040

0050

00

N(3.5,1)

ZN

Fre

quen

cy

0.0000 0.0005 0.0010 0.0015 0.0020 0.0025

010

0020

0030

0040

0050

0060

00

Exp(1)+3

ZN

Fre

quen

cy

0.0000 0.0005 0.0010 0.0015 0.0020 0.0025

020

0040

0060

0080

00

b) From the averages of the ZN above we know that the exact value is approximatelyµ ≈ 0.001 = 10−3. One percent of this is ·10−5. The accuracy is measured by the standarddeviation σ, i.e. we want σ ≈ 10−5 and thus σ2 ≈ 10−10.

We know that the variance of the sample decreases like 1/N . For the method frompart a we found that for N = 10000 we have σ2 ≈ 1.36 · 10−7. Thus, to get the variancedown to 10−10 we will need a sample of size approximately equal to

N ≈ 100001.36 · 10−7

10−10= 1.36 · 105−7+10 = 1.36 · 105−7+10 = 13,600,000.

The required sample sizes for the importance sampling method can be computed in thesame way. The results are:

N (1, 1) N (2, 1) N (3.5, 1) Exp(1) + 3

N 823,000 136,000 67,600 19,400

This means that importance sampling with Yi ∼ Exp(1) + 3 will be nearly 1000 timesfaster than basic MC integration.

Solution E-3.3. The given estimator for the skewness is

T10 =110

∑10i=1(Xi − X)3(

110

∑10i=1(Xi − X)2

)3/2 with X =1

10

10∑i=1

Xi.

The estimator can be computed in R using the following function.

T10 <- function(X) {Xbar <- mean(X)numerator <- mean((X-Xbar)^3)denominator <- mean((X-Xbar)^2)^1.5return(numerator/denominator)

}

a) For given σ, we can estimate the bias and mean squared error using equations (3.13)and (3.14):

bias_and_MSE <- function(sigma, N=1000000) {samples <- matrix(nrow=N, ncol=2)for (j in seq(N)) {

Xj <- rnorm(10, 0, sigma)tj <- T10(Xj)samples[j,1] <- tjsamples[j,2] <- tj^2

}return(list(bias=mean(samples[,1]), MSE=mean(samples[,2])))

}

94

2 4 6 8 10

−0.

010

0.00

5

σ

bias

2 4 6 8 10

0.33

500.

3360

σ

MS

E

Figure C.3: The estimated bias and mean squared error for the skewness estimator T10

from example 3.15.

b) Using the function from part a, the plots can be created as follows:

k <- 20sigma <- seq(0.5, 10.0, length.out=k)bias <- c()MSE <- c()for (i in 1:k) {

cat("sigma =", sigma[i], "\n")bm <- bias_and_MSE(sigma[i])bias[i] <- bm$biasMSE[i] <- bm$MSE

}

par(mfrow=c(2,1))plot(sigma, bias, type="l",

xlab=expression(sigma), ylim=c(-0.01,0.01))plot(sigma, MSE, type="l", xlab=expression(sigma))

Solution E-3.4. We can use the following R code to get the MC estimate of the confidencelevels:

# compute the confidence intervalci <- function(X, c) {

m <- mean(X)S <- sd(X)n <- length(X)return(c(m-c*S/sqrt(n), m+c*S/sqrt(n)))

}

95

# use MC to estimate confidence levellevel <- function(rgen, c, exact, N=100000) {

k = 0for (i in 1:N) {

I <- ci(rgen(), c)if (I[1] < exact && exact < I[2])k <- k+1

}return(k/N)

}

# try the normal distribution with the given values of cprint("normal distribution")rgen <- function() rnorm(10, 7, sqrt(4))print(level(rgen, 1.833, 7))print(level(rgen, 2.262, 7))print(level(rgen, 3.250, 7))

# try the chisq(2) distribution with the given values of cprint("chisq distribution")rgen <- function() rchisq(10, 2)print(level(rgen, 1.833, 2))print(level(rgen, 2.262, 2))print(level(rgen, 3.250, 2))

a) For n = 10 and Xi ∼ N (7, 4), we get the following estimated confidence levels:0.90123 (for c = 1.833), 0.94928 (for c = 2.262), and 0.99010 (for c = 3.250). The valuesfor c were chosen (using a table of Student’s t-distribution) such that the exact confidencelevels are 0.9, 0.95 and 0.99.

b) The exact mean of the χ2(2)-distribution is µ = 2. The program above gives thefollowing values: 0.85347 (for c = 1.833), 0.89986 (for c = 2.262), and 0.95438 (for c =3.250). One can see that for n = 10 the normal approximation does not yet hold verywell, the c-values from the Student distribution lead to significantly lower confidence levelsthan in part a.

Solution E-4.1. a) We have

θ(j) = θn−1

(X(j)

)=

1

n− 1

n∑i= 1

i 6=j

Xi =1

n− 1(nX −Xj)

for j = 1, 2, . . . , n.

b) Using part a we find

θ(·) =1

n

n∑j=1

θ(j) =1

n

n∑j=1

1

n− 1

(nX −Xj

)=

1

n(n− 1)

(n2X −

n∑j=1

Xj

)=

1

n− 1

(nX − X

)=

1

n− 1(n− 1)X = X.

c) Using the previous results we have

θ(j) − θ(·) =1

n− 1(nX −Xj)− X =

1

n− 1(X −Xj).

96

From this we getn∑j=1

(θ(j) − θ(·)

)2=

1

(n− 1)2

n∑j=1

(X −Xj)2

and thus

sejack =

√√√√n− 1

n

n∑j=1

(θ(j) − θ(·)

)2=

√√√√ 1

n(n− 1)

n∑j=1

(X −Xj)2.

This completes the proof.

Solution E-5.1. The condition for X to be a Markov chain is

P(Xn+1 ∈ An+1

∣∣ Xn ∈ An, . . . , X0 ∈ A0

)= P

(Xn+1 ∈ An+1

∣∣ Xn ∈ An)

for all A0, . . . , An+1. We need to show that, given the value of Xn, the values X0, . . . , Xn−1

don’t contribute ‘‘any additional information about the distribution of Xn+1”.a) Xn =

∑ni=1 Zi: We can write Xn as Xn+1 = Xn + Zn+1 where Zn+1 is independent

of X0, . . . , Xn. Thus X is a Markov chain.b) Xn = 1

n

∑ni=1 Zi: We can write Xn as Xn+1 = n

n+1Xn + 1n+1Zn+1 and 1

n+1Zn+1 isindependent of X0, . . . , Xn. Thus X is a Markov chain as above.

c) Xn =∑n

i=11iZi: again, since Xn+1 = Xn + 1

n+1Zn+1, X is a Markov chain.d) X1 = Z1 +X0 and Xn = Zn+Xn−1 +Xn−2 for all n ≥ 2: Since Xn+1 = Zn+1 +Xn+

Xn−1 depends explicitly on Xn−1, we can guess that X is not a Markov chain. We needto check that the Markov chain condition is violated. Example: If Xn = a and Xn−1 = bthen Xn+1 ∼ N (a+ b, 1). Thus, for small ε > 0,

P(Xn+1 ∈ [−3, 3]

∣∣ Xn ∈ [−ε,+ε], Xn−1 ∈ [10− ε, 10 + ε])≈ 0

6= 1 ≈ P(Xn+1 ∈ [−3, 3]

∣∣ Xn ∈ [−ε,+ε], Xn−1 ∈ [−ε,+ε]).

If X was a Markov chain, the two probabilities would be equal. Thus X is not a Markovchain.

Solution E-5.2. There are many examples, almost any Markov chain will be a solution.To be definite, consider S = {1, 2} and the Markov chain which starts at X0 = 1 andwhich, at each time n ∈ N, has P (Xn = 1) = 1/3 and P (Xn = 2) = 2/3, independent ofthe state in the previous step. This Markov chain has transition matrix

P =

(1/3 2/31/3 2/3

).

Solution E-5.3. a) To sample paths from the Markov chain we can use a program likethe following:

MC <- function(n, initial, P) {Xn <- sample(length(initial), 1, prob=initial)res <- c(Xn)

for (i in 1:n) {p <- P[Xn,]Xn <- sample(length(p), 1, prob=p)res <- c(res, Xn)

}

97

data

Den

sity

1 2 3 4

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Figure C.4: Empirically determined occupation probabilities for the states 1, 2, 3 and 4 ofthe Markov chain from exercise 5.4.

return(res)}

# transition matrixP <- matrix(c(2/3, 1/3, 0, 0,

.1, .9, 0, 0,

.1, 0, .9, 0,

.1, 0, 0, .9),nrow = 4, ncol = 4, byrow=TRUE)

# initial distributionmu <- c(.25, .25, .25, .25)

MC(10, mu, P)

b) To numerically estimate the distribution of X10 we can use a histogram:

data <- c()for (i in 1:10000) {

X <- MC(10, mu, P)data <- c(data, X[11])

}hist(data, breaks=seq(0.5,4.5), probability=T, main="")

From the histogram in figure C.4 we find estimated probabilities for the states of theMarkov chain: we get P (X10 = 1) ≈ 0.22, P (X10 = 2) ≈ 0.6, P (X10 = 3) ≈ 0.09, andP (X10 = 4) ≈ 0.09.

c) We have to compute µ>P 10:

> mu %*% P %*% P %*% P %*% P %*% P %*% P %*% P %*% P %*% P %*% P[,1] [,2] [,3] [,4][1,] 0.2308349 0.5948259 0.08716961 0.08716961

The result coincides well with the values we read off the histogram.

98

Solution E-5.4. a) This follows directly from the properties of a stochastic matrix:

(Px)j =∑i∈S

pijxj =∑i∈S

pij1 = 1 = 1 · xj

for all j ∈ S.b) Assume Ax = λx. Then A(αx) = αAx = α(λx) = λ(αx).c) We find the eigenvalues of P as follows:

> E <- eigen(t(P))$values[1] 1.0000000 0.9000000 0.9000000 0.5666667

$vectors[,1] [,2] [,3] [,4][1,] -0.2873479 -3.925231e-17 -3.925231e-17 -0.7071068[2,] -0.9578263 -7.071068e-01 -7.071068e-01 0.7071068[3,] 0.0000000 7.071068e-01 0.000000e+00 0.0000000[4,] 0.0000000 0.000000e+00 7.071068e-01 0.0000000

Looking at the first column of this matrix (corresponding to the eigenvalue 1.0000000), wesee that x = (-0.2873479, -0.9578263, 0.0000000, 0.0000000) is the (only) eigenvectorof P> with eigenvalue 1. Thus, by normalising this vector to be a probability vector, wefind the stationary distribution of P :

> x <- E$vectors[,1]> x / sum(x)[1] 0.2307692 0.7692308 0.0000000 0.0000000

Solution E-5.5. a) Using the algorithm from the lectures, we can implement the Metropo-lis Hastings method as follows. Since the proposal density is symmetric, the acceptanceprobability simplifies to α(x, y) = π(y)/π(x). Note that we do not need to know the valueof the normalisation constant c to implement the algorithm, since the constant appearsboth in the numerator and in the denominator of the acceptance probability, and the twoconstants then cancel.

# the non-normalised target densitypi <- function(x) {

return(exp(-0.5*((x-2)**4 - 2*x)) + 5 * exp(-(x+2)**4/2))}

# the acceptance probabilityalpha <- function(x, y) {

return(pi(y)/pi(x))}

# generate paths from the MH processrMH <- function(n, x=0, sigma=1) {

path <- c()for (i in 1:n) {

y = rnorm(1, x, sigma)U <- runif(1)if (U < alpha(x, y)) {

x <- y}path[i] <- x

}return(path)

}

99

b) Figure C.5 shows four runs of the algorithm from exercise 5.9, corresponding toσ = 100, σ = 10, σ = 1 and σ = 0.1. One can see that for big values of σ the proposals areoften rejected, leading to constant segments in the path (Xn)n∈N. For small values of σ,the process (Xn)n∈N does no longer transition between the two modes seen in figure 5.2.To avoid both problems, intermediate values of σ should be chosen. The optimal value ofσ will probably be somewhere between σ = 1 and σ = 10.

c) To estimate the expectation of the target distribution we will use σ = 6 (but anyvalue of σ between 1 and 10 would do). We can use the standard deviation of several runswith n = 1000 to estimate which n we will need:

mm <- c()for (i in 1:1000) {

mm[i] <- mean(rMH(1000, sigma=6))}stderr <- sd(mm)print(paste("the stderr for n=1000 is",stderr))

In my run this gives the result s.e. = 0.23176. If we want to compute the mean with anaccuracy of about 0.01, and assuming that the standard deviation decays like c/

√n, we

need to choose n such that c/√n = 0.01 where c is determined by the relation c/

√1000 =

0.23176:

n <- round(1000 * stderr^2 / 0.01^2)print(paste("we will use n =", n))m <- rMHavg(n, sigma=6)print(paste("mean =", m))

This then uses n = 537136 to compute the estimate µ = 0.89.

d) The following code can be used to generate the graph shown in figure 5.3:

rMHaccept <- function(n, X=0, sigma=1) {asum <- 0;for (i in 1:n) {

Y <- rnorm(1, X, sigma)a <- min(1, alpha(X, Y))U <- runif(1)if (U < a) {

X <- Y}asum <- asum + a

}return(asum/n)

}

xx <- c()yy <- c()for (p in seq(-3, 3, 0.2)) {

sigma <- 10^pcat("sigma =", sigma, "\n")xx <- c(xx, sigma^2)yy <- c(yy, rMHaccept(100000, 0, sigma))

}

plot(xx, yy, type="l", log="x",xlab=expression(sigma^2), ylab="average acceptance probability")

Solution E-B.1. As the following transcript of an R session shows, the first three expres-sions are straight-forward:

100

0 2000 4000 6000 8000 10000

−4

−2

02

4

σ=

100

0 2000 4000 6000 8000 10000

−4

−2

02

4

σ=

10

0 2000 4000 6000 8000 10000

−4

−2

02

4

σ=

1

0 2000 4000 6000 8000 10000

−4

−2

02

4

σ=

0.1

Figure C.5: This figure illustrates paths of the Metropolis-Hastings process for differentvalues of σ. The middle two displays correspond to useful sampling techniques. In the topdisplay, σ is too big (many proposals are rejected), in the bottom display σ is too small(the proposals never reach the mode around x = −2).

101

> 1+2+3+4+5+6+7+8+9+10[1] 55> 2^16[1] 65536> 2^100[1] 1.267651e+30

The sum of the first ten natural numbers is 55 and 216 = 65536. The final answer1.267651e+30 uses a short-hand notation: the returned result stands for 1.267651 · 1030

(e is short for ‘‘exponent”). R chooses this representation because the result is very bigand would be difficult to read when written in standard notation.

For the final part of the exercise, computing 2/(1 +√

5), we can use the fact thatbrackets in R can be used exactly as they are used in mathematical expressions: using thefunction sqrt (see table B.1), we can write

> 2/(1+sqrt(5))[1] 0.618034

to get the correct (up to 6 decimal digits) result 0.618034.

Solution E-B.2. The value of x is 3, the value of y is 5.

Solution E-B.3. The sum can be computed by first constructing a vector of the numbers1, 2, . . . , 100, and then using the R function sum to add all elements of the vector:

> x <- 1:100> x[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15[16] 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30[31] 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45[46] 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60[61] 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75[76] 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90[91] 91 92 93 94 95 96 97 98 99 100> sum(x)[1] 5050

Thus, the required sum is 5050. A shorter solution is to merge the two steps into onecommand by just typing sum(1:100). This results in the same answer.

Solution E-B.4. The program contains (at least) two mistakes: the line to assign a valueto x[5] is missing, and the output claims wrongly that x[8] is the sixth (not eighth)Fibonacci number.

Solution E-B.5. We first use seq to construct a vector which contains the numbers from100 to 0, and then use this vector to control the for loop:

for (i in seq(from=100, to=0, by=-1)) {cat(i, "\n")

}

The command print(i) can be used instead of cat(i, "\n"), but the output is not quiteas tidy (try this yourself!).

Solution E-B.6. We can use the following program.

102

n <- 1while (n^2 <= 5000) {

n <- n + 1}print(n)

The result is n = 71.

Solution E-B.7. One solution is the following program:

n <- 1while ((n+1)^2 < 5000) {

n <- n + 1}print(n)

Alternatively we can use the program from B.5, and print n-1 instead of n (we also needto check that (n-1)^2 is not exactly equal to 5000; otherwise we would have to use n-2

instead). Using either method, the result is 70.

Solution E-B.8. One solution is given in the following program.

x <- 27n <- 0while (x != 1) {

cat("x[", n, "] = ", x, "\n", sep="")n = n + 1if (x %% 2 == 0) {

x = x / 2} else {

x = 3*x + 1}

}cat("x[", n, "] = ", x, "\n", sep="")

This program violates the DRY principle: the cat command is duplicated to output thefinal value. One way to fix this problem is to stop the loop one iteration later, i.e. whenthe previous value equals one. This idea is implemented in the following version of theprogram:

n <- 0x <- 27old <- 0while (old != 1) {

old <- xcat("x[", n, "] = ", x, "\n", sep="")n = n + 1if (x %% 2 == 0) {

x = x / 2} else {

x = 3*x + 1}

}

Solution E-B.9. The function computes the sample variance of the elements of x.

Solution E-B.10. We set the variable b to the string "a". Since variables inside thefunction are independent of the names used by the caller, the function call test(c=b) is

103

equivalent to test(c="a"), i.e. inside the function the value of the variable c is the string"a". Finally, the function cat does not include the enclosing quotation marks into theoutput when processing strings and thus the output is ‘‘a=1, b=2, c=a, d=4”.

Solution E-B.11. The omitted line, where x[5] should have been set, causes a run-timeerror: the command x[6] <- x[5] + x[4] tries to read the uninitialised value x[5] andconsequently x[6] and all the following values are set to NA. The mistake in the messageprinted via cat is a semantic error.

Solution E-B.12. The function dnorm gives the density and pnorm gives the CDF of thenormal distribution. Similarly, we can use the functions dexp and pexp for the exponentialdistribution and dgamma and pgamma for the gamma distribution. Thus, we can plot therequired graphs as follows:

par(mfrow=c(4,2), mar=c(4, 4.5, 1, 0.5), oma=c(0,0,0,0))

curve(dnorm(x,0,1), -3, 10)curve(pnorm(x,0,1), -3, 10, ylim=c(0,1))

curve(dnorm(x,3,sqrt(4)), -3, 10)curve(pnorm(x,3,sqrt(4)), -3, 10, ylim=c(0,1))

curve(dexp(x,1), -0.1, 5)curve(pexp(x,1), -0.1, 5, ylim=c(0,1))

curve(dgamma(x,9,scale=.5), 0, 10)curve(pgamma(x,9,scale=.5), 0, 10, ylim=c(0,1))

The resulting plot is shown in figure C.6.

104

−2 0 2 4 6 8 10

0.0

0.1

0.2

0.3

0.4

x

dnor

m(x

, 0, 1

)

−2 0 2 4 6 8 10

0.0

0.4

0.8

x

pnor

m(x

, 0, 1

)

−2 0 2 4 6 8 10

0.00

0.10

0.20

x

dnor

m(x

, 3, s

qrt(

4))

−2 0 2 4 6 8 10

0.0

0.4

0.8

x

pnor

m(x

, 3, s

qrt(

4))

0 1 2 3 4 5

0.0

0.4

0.8

x

dexp

(x, 1

)

0 1 2 3 4 5

0.0

0.4

0.8

x

pexp

(x, 1

)

0 2 4 6 8 10

0.00

0.10

0.20

x

dgam

ma(

x, 9

, sca

le =

0.5

)

0 2 4 6 8 10

0.0

0.4

0.8

x

pgam

ma(

x, 9

, sca

le =

0.5

)

Figure C.6: Output of the R script from exercise B.10. The graphs show the densities (leftcolumn) and distribution functions (right column) of the distributions N (0, 1), N (3, 4),Exp(1) and Γ(9, 0.5) (top to bottom).

105

106

Bibliography

G. E. P. Box and Mervin E. Muller. A note on the generation of random normal deviates.Annals of Mathematical Statistics, 29(2):610–611, 1958.

A. C. Davison and D. V. Hinkley. Bootstrap Methods and their Application. CambridgeUniversity Press, 1997.

A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incompletedata via the EM algorithm. J. R. Stat. Soc. B, 39(1):1–38, 1977.

W. R. Gilks, S. Richardson, and D. J. Spiegelhalter, editors. Markov chain Monte Carloin Practice. Chapman & Hall/CRC, 1996.

Jean Jacod and Philip Protter. Probability Essentials. Springer, 2000.

Donald E. Knuth. Seminumerical Algorithms, volume 2 of The art of computer program-ming. Addison-Wesley, second edition, 1981.

George Marsaglia and Wai Wan Tsang. The Ziggurat method for generating randomvariables. Journal of Statistical Software, 5(8):1–7, 2000. URL http://www.jstatsoft.

org/v05/i08.

M. Matsumoto and T. Nishimura. Mersenne twister: a 623-dimensionally equidistributeduniform pseudo-random number generator. ACM Transactions on Modeling and Com-puter Simulation, 8(1):3–30, 1998. doi: 10.1145/272991.272995.

N. Metropolis. The beginning of the Monte Carlo method. Los Alamos Science, 125–130,1987. URL http://library.lanl.gov/cgi-bin/getfile?00326866.pdf.

R Development Core Team. R: A Language and Environment for Statistical Computing.R Foundation for Statistical Computing, Vienna, Austria, 2011. URL http://www.

r-project.org/.

Brian D. Ripley. Stochastic Simulation. Wiley, 1987.

Maria L. Rizzo. Statistical Computing with R. Chapman & Hall/CRC, 2008.

Christian P. Robert and George Casella. Monte Carlo statistical methods. Springer Textsin Statistics. Springer, second edition, 2004.

Ronald A. Thisted. Elements of Statistical Computing. Chapman & Hall, 1988.

W. N. Venables and B. D. Ripley. S programming. Springer, 2000.

W. N. Venables, D. M. Smith, and the R Development Core Team. An introduction to R.Online at http://cran.r-project.org/doc/manuals/R-intro.html, 2011.

107

http://www.jstatsoft.org/v05/i08

http://www.jstatsoft.org/v05/i08

http://library.lanl.gov/cgi-bin/getfile?00326866.pdf



http://cran.r-project.org/doc/manuals/R-intro.html

108

Index

algorithmANT, 29BOOT, 38CTV, 31EM, 59IMP, 26IND, 56INV, 10LCG, 6MC, 22MH1, 50MH2, 50REJ, 15RWM, 54

almost surely, 65antithetic variables, 27, 29

Bayes’ rule, 67bias, 32bootstrap estimate, 38bootstrap methods, 37burn-in, 53

CDF, see cumulative distribution func-tion

conditional density, 68confidence interval, 33control variates, 30, 31cumulative distribution function, 65

detailed balance condition, 48don’t repeat yourself, 77DRY, see don’t repeat yourself

empirical distribution, 37estimator, 32events, 65expectation maximisation, 59

i.i.d., 6, 22importance sampling, 26increment, 6

Independence Sampler, 56

indicator function, 68

initial distribution, 44

inverse transform method, 10

law of large numbers, 69

LCG, see linear congruential generator

linear congruential generator, 6

Markov chain, 44

reversible, 48

time-homogeneous, 44

Markov chain Monte Carlo, 43

MCMC, see Markov chain Monte Carlo

mean square error, 33

Metropolis algorithm, 54

Metropolis Hastings method

continuous state space, 50

discrete state space, 50

modulus, 6

Monte Carlo integration, 21, 22

multiplier, 6

posterior distribution, 43

prior distribution, 43

PRNG, see pseudo random number gen-erator

probability density, 66

probability distribution, 65

probability vector, 45

proposal density, 15

proposals, 13, 15, 50, 54, 56

pseudo random number generator, 6

random numbers

pseudo, 6

true, 5

random variables, 65

random walk, 44

Random Walk Metropolis sampling, 54

rejection sampling, 15

109

seed, 6standard error, 39state space, 44stationary density, 49stationary distribution, 47statistic, 32statistical test, 34stochastic matrix, 45

target density, 15, 50, 54, 56target distribution, 50transition density, 49transition matrix, 44, 45

uniform distribution, 4, 12

110

Documents

MATH5835 Statistical Computinglibvolume7.xyz/.../statisticalpackages/statisticalpackagestutorial2.pdf · 3;:::of independent, identically distributed (i.i.d.) random variables, uniformly