Introduction to Probability and Statistics: notes for a ...Introduction to Probability and Statistics: notes for a short course Jonathan G. Campbell Department of Computing, Letterkenny

Introduction to Probability and Statistics: notes for a shortcourse

Jonathan G. Campbell

Department of Computing,

Letterkenny Institute of Technology,

Co. Donegal, Ireland.

email: jonathan dot campbell (at) gmail.com, [email protected]

URL: http://www.jgcampbell.com/stats/stats.pdf

Report No: jc/09/0004/r

Revision 0.3

18th August 2009

Contents

1 Introduction 1

1.1 Purpose and Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Why use R? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.3 Relevant textbooks and web sources . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3.1 General Books on Probability and Statistics . . . . . . . . . . . . . . . . . 2

1.3.2 Books on R and Statistics using R . . . . . . . . . . . . . . . . . . . . . . 3

1.3.3 Bayesian Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3.4 Web Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Simple Data Analysis and Visualisation and Introduction to R 1

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2.1.1 Installation of R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2.1.2 Running R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2.2 Visualisation and Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . . 2

3 Averages 1

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

3.2 Arithmetic Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

3.2.1 Arithmetic Mean using Frequencies . . . . . . . . . . . . . . . . . . . . . 2

3.3 Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3.4 Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3.5 Other Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

4 Measures of Data Variability 1

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

4.2 Variance and Standard Deviation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

4.2.1 Equalising the means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

4.2.2 Variability and spread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

4.2.3 Variance and Standard Deviation . . . . . . . . . . . . . . . . . . . . . . . 5

4.3 Standard Scores and Normalising Marks . . . . . . . . . . . . . . . . . . . . . . . 6

4.3.1 Standard Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

5 Probability and Random Variables 1

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

5.2 Basic Probability and Random Variables . . . . . . . . . . . . . . . . . . . . . . . 1

5.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

5.2.2 Probability and Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

5.2.3 A Point on Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

5.2.4 Probability of Non-disjoint Events . . . . . . . . . . . . . . . . . . . . . . 3

0–1

5.2.5 Finite Sample Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

5.3 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

5.4 Computing probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

5.5 Enumerating more complex events and sample spaces . . . . . . . . . . . . . . . . 4

5.5.1 Multiplication of outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . 5

5.5.2 Addition of outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

5.5.3 Permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

5.5.4 Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

5.6 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

5.6.1 Venn diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

5.6.2 Probability Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

5.6.3 Joint Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

5.7 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

5.8 Independent Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

5.9 Betting and Odds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

5.10 Classical versus Bayesian Interpretations of Probability . . . . . . . . . . . . . . . 13

6 One Dimensional Random Variables 1

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

6.1.1 Definition: Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . 1

6.1.2 Probability associated with a Random Variable . . . . . . . . . . . . . . . 1

6.2 Probability Mass Function (pmf) of a Discrete r.v. . . . . . . . . . . . . . . . . . 2

6.3 Some Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

6.3.1 Point Mass Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

6.3.2 Discrete Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . 3

6.3.3 Bernoulli Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

6.3.4 Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

6.3.5 Geometric Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

6.3.6 Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

6.4 Some Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . 4

6.4.1 Probability Density Function (PDF) . . . . . . . . . . . . . . . . . . . . . 4

6.4.2 Cumulative Distribution Function (cdf) . . . . . . . . . . . . . . . . . . . 5

6.4.3 Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

6.4.4 Normal (Gaussian) Distribution . . . . . . . . . . . . . . . . . . . . . . . 5

6.4.5 Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

6.4.6 Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

6.4.7 Beta Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

6.4.8 Student t Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

6.4.9 Cauchy Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

6.4.10 Chi-squared Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

6.5 Range spaces — terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

6.6 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

7 Two- and Multi-Dimensional Random Variables 1

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

7.2 Probability Function of a Discrete Two-dimensional r.v. . . . . . . . . . . . . . . . 2

7.3 PDF of a Continuous Two-dimensional r.v. . . . . . . . . . . . . . . . . . . . . . 2

7.4 Marginal Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 3

7.5 Conditional Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . 4

0–2

7.6 Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

7.7 Two-dimensional (Bivariate) Normal Distribution . . . . . . . . . . . . . . . . . . 5

8 Characterisations of Random Variables 1

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

8.2 Expected Value (Mean) of a Random Variable . . . . . . . . . . . . . . . . . . . . 1

8.3 Variance of a Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

8.4 Expectations in Two-dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

8.4.1 Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

8.4.2 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

9 The Normal Distribution 1

9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

9.2 Cumulative Distribution Function (cdf) . . . . . . . . . . . . . . . . . . . . . . . 2

9.3 Normal Cdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

9.4 Using the Normal Cdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

9.5 Sum of Independent Normal Random Variables . . . . . . . . . . . . . . . . . . . 5

9.6 Differences of Normal Random Variables . . . . . . . . . . . . . . . . . . . . . . . 6

9.7 Linear Transformations of Normal Random Variables . . . . . . . . . . . . . . . . 6

9.8 The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

10 Statistical Inference 1

10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

11 Statistical Estimation 1

11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

11.2 Populations and Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

11.3 Estimating the Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

11.4 Estimating the Standard Deviation . . . . . . . . . . . . . . . . . . . . . . . . . . 2

11.5 Sampling Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

11.5.1 Sampling Distribution of the mean . . . . . . . . . . . . . . . . . . . . . . 3

11.5.2 Sampling Distribution for Estimates of the Standard Deviation . . . . . . . 4

11.6 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

12 Hypothesis Testing 1

12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

13 Sampling 1

13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

14 Classification and Pattern Recognition 1

14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

15 Simple Classifier Methods 1

15.1 Thresholding for one-dimensional data . . . . . . . . . . . . . . . . . . . . . . . . 1

15.2 Linear separating lines/planes for two-dimensions . . . . . . . . . . . . . . . . . . 4

15.3 Nearest mean classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

15.4 Normal form of the separating line, projections, and linear discriminants . . . . . . 5

15.5 Projection and linear discriminant . . . . . . . . . . . . . . . . . . . . . . . . . . 6

15.6 Projections and linear discriminants in p dimensions . . . . . . . . . . . . . . . . . 7

0–3

15.7 Template Matching and Discriminants . . . . . . . . . . . . . . . . . . . . . . . . 7

15.8 Nearest neighbour methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

16 Statistical Classifier Methods 1

16.1 One-dimensional classification revisited . . . . . . . . . . . . . . . . . . . . . . . . 1

16.2 Bayes’ Rule for the Inversion of Conditional Probabilities . . . . . . . . . . . . . . 2

16.3 Parametric Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

16.4 Discriminants based on Normal Density . . . . . . . . . . . . . . . . . . . . . . . 4

16.5 Bayes-Gauss Classifier – Special Cases . . . . . . . . . . . . . . . . . . . . . . . . 4

16.5.1 Equal and Diagonal Covariances . . . . . . . . . . . . . . . . . . . . . . . 5

16.5.2 Equal but General Covariances . . . . . . . . . . . . . . . . . . . . . . . . 6

16.6 Least square error trained classifier . . . . . . . . . . . . . . . . . . . . . . . . . . 7

16.7 Generalised linear discriminant function . . . . . . . . . . . . . . . . . . . . . . . 8

17 Linear Discriminant Analysis and Principal Components Analysis 1

17.1 Principal Components Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

17.2 Fisher’s Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 2

18 Neural Network Methods 1

18.1 Neurons for Boolean Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

18.2 Three-layer neural network for arbitrarily complex decision regions . . . . . . . . . 3

18.3 Sigmoid activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

19 Unsupervised Classification (Clustering) 1

20 Regression 1

20.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

A Basic Mathematical Notation 1

A.1 Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

A.1.1 Set Definition and Membership . . . . . . . . . . . . . . . . . . . . . . . 1

A.1.2 Important Number Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

A.1.3 Set Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

A.1.4 Venn Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

A.2 Iterated Summation and Product Notation . . . . . . . . . . . . . . . . . . . . . 4

A.3 Iterated Union and Intersection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

A.4 Cartesian Product Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

B Matrices and Linear Algebra 1

B.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

B.2 Linear Simultaneous Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

B.3 Vectors and Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

B.4 Basic Matrix Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

B.4.1 Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

B.4.2 Multiplication by a Scalar . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

B.4.3 Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

B.5 Special Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

B.5.1 Identity Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

B.5.2 Orthogonal Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

B.5.3 Diagonal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

0–4

B.5.4 Transpose of a Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

B.6 Inverse Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

B.7 Multidimensional (Multivariate) Random Variables . . . . . . . . . . . . . . . . . . 8

0–5

Chapter 1

Introduction

1.1 Purpose and Scope

This report is written as the basis for a short course on statistics to be presented for postgraduate

students at Letterkenny Institute of Technology.

The notes have a mixed objective. I started writing a set of notes based on the traditional approach

to probability and statistics, namely: basic probability, up to and including conditional probability,

independence, Bayes’ Law; then some one-dimensional discrete and continuous distributions and

some of the properties. Et cetera. And the on to sampling, parameter estimation, point estimates,

confidence intervals, and hypothesis testing.

However, after discussion with someone who knows potential consumers of the course, I was

persuaded to start with a more gentle introduction. Hence I start off with simple visualisation, the

look at averages (central tendency), then variance, and then back to the main line.

As I say, the notes have a mixed objective. One objective is as notes for a gentle introduction to

statistics; another is to include a set of reference results that one would refer to during a course;

that is a course presenter might not want to spend time of the details of, for example, the Binomial

distribution, or even full details of the Normal, but it would be useful for students to have access

to some of these details without having to access one or more textbooks.

When I give a course, I may give attendees a printout of all the notes — including an outline of

the objective of the course and the plan of coverage, mentioning the chapters that will be used.

Or, alternatively, I may do a specialised printout that includes only the chapters to be covered.

The notes you see here include everything.

1.2 Why use R?

Let me quote from the R website http://www.r-project.org/:

R is a language and environment for statistical computing and graphics. It is a GNU

project which is similar to the S language and environment which was developed at

1–1

Bell Laboratories (formerly AT &T, now Lucent Technologies) by John Chambers and

colleagues. R can be considered as a different implementation of S. There are some

important differences, but much code written for S runs unaltered under R.

R provides a wide variety of statistical (linear and nonlinear modelling, classical statisti-

cal tests, time-series analysis, classification, clustering, . . . ) and graphical techniques,

and is highly extensible. The S language is often the vehicle of choice for research in

statistical methodology, and R provides an Open Source route to participation in that

activity.

One of R’s strengths is the ease with which well-designed publication-quality plots can

be produced, including mathematical symbols and formulae where needed. Great care

has been taken over the defaults for the minor design choices in graphics, but the user

retains full control.

When I have to choose a software package for teaching or for practical use (I mean generally, it

could be a development system for a programming language, a computer games engine, a statistics

package, . . . ) I look primarily at the following criteria:

• Is it easily available, i.e. is it already installed in our laboratory machines, or is easy (and

cheap) to acquire?

R does well on this criterion — it is free to download and install, see 2.1.1.

• Is it well supported by textbooks and online documentation?

Again, R does well. In the past ten years and this is greatly accelerating in the last five years,

a great many top class books on R and on particular statistical techniques using R; see 1.3.

I notice that books that used to have just numerical examples now, in recent editions, give

R examples.

There is a top class mailing list supported by volunteers of the highest calibre:

https://stat.ethz.ch/mailman/listinfo/r-help

Via that mailing list, I have received assistance from world-class statisticians.

• Is it widely used? Yes.

1.3 Relevant textbooks and web sources

1.3.1 General Books on Probability and Statistics

These notes are mostly based on (Meyer 1966) (which was used for a college course on statistics

that I attended), (Wasserman 2004), which is a good summary of all the statistics you might

ever need, but is not an introduction, (Griffiths 2009) and (Milton 2009) which are excellent

introductions though very wordy, (Crawley 2005), (Spiegel & Stephens 2008). The latter, (Spiegel

& Stephens 2008), has plenty of examples including some examples on the use of the Excel

spreadsheet.

1–2

(Dytham 2009) seems to be a good introduction for biologists and the more advanced (Quinn &

Keough 2002) receives a lot of recommendations.

Hacking’s book (Hacking 2001) is maybe a good introduction to probability and the philosophy

and practice of probabilistic inference.

The bibliography contains books in my collection and which I may have used in some small way

and/or which may be useful to users of these notes.

1.3.2 Books on R and Statistics using R

Crawley may be the best general book (Crawley 2005); for bio-scientists it has the advantage that

Crawley’s research area is bio-science.

Venables and Ripley’s MASS (Venables & Ripley 2002) is top class — note, do not be confused

by the title Modern Applied Statistics with S ; R is an open-source version of S (and S-Plus) and

the book covers any differences, which are minimal. Maindonald (Maindonald & Braun 2007) is

good for R graphics; R code for all his diagrams is available online (free).

Matloff’s R for Programmers (Matloff 2008) has the advantage that it is available online.

See also the extensive list at

http://www.r-project.org/doc/bib/R-books.html

1.3.3 Bayesian Statistics

Not that we’ll be emphasising the Bayesian approach.

(Sivia 2006) (best introduction to Bayesian statistics), (MacKay 2002), (Lee 2004).

1.3.4 Web Links

• General: http://www.jgcampbell.com/links/stats.html;

• R: http://www.r-project.org/.

1.4 Outline

Chapter 5 gives an introduction to probability; if you want to understand basic statistics you must

have a basic understanding of probability — however we note that probability is to a great extent

common sense. Before starting you should have a quick run through Appendix A just to familiarise

yourself with basic mathematical notation; we note that the mathematical notation used is no

more than shorthand; it would be difficult to write these notes without employing that shorthand;

in addition, you will encounter similar shorthand in books and research papers.

1–3

Chapter 2 gives a very brief introduction to simple statistical techniques and visualisation and to

the statistical package R.

Chapter 3 gives a brief introduction to averages or what statisticians call central tendency.

Chapter 4 This chapter introduces methods of describing data variability, most notably variance

and standard deviation.

Chapter 6 introduces random variables and lists the common one-dimensional probability distribu-

tions.

Chapter 7 gives a brief introduction to multivariate random variables and some distributions. Note

that Appendix B gives a gentle introduction to vector and matrix mathematics which are necessary

in multivariate statistics.

Chapter 8 discusses important characteristics of randoms variables such a mean and variance.

Chapter 9 gives specialised treatment to the normal distribution — in view of its importance in

applications.

Chapter 10 introduces statistical inference, that is, how can we infer properties of a population

from statistics derived from a sample. One aspect of statistical inference is parameter estimation;

Chapter 11 introduces point estimation and confidence interval estimation. Hypothesis testing is

strongly related to estimation; Chapter 12 gives an introduction to hypothesis testing.

Chapter 13 discusses some of the intricacies of sampling.

As of 2009-08-18 this is work in progress and will remain so for the foreseeable future.

1–4

Chapter 2

Simple Data Analysis and Visualisation andIntroduction to R

2.1 Introduction

The objectives of this chapter are to give a very brief introduction to simple statistical techniques

and visualisation and to the statistical package R.

2.1.1 Installation of R

Click on http://www.r-project.org/ and find the Download link. For Windows users there is

an exe file which does everything. You may need Administrator rights on your machine; contact

Computer Services as necessary.

Linux users are probably best advised to rely on the installer of their particular Linux distribution.

2.1.2 Running R

Start R by clicking on R desktop icon. R will open up a window with something like the following

in it.

R version 2.7.1 (2008-06-23)

Copyright (C) 2008 The R Foundation for Statistical Computing

ISBN 3-900051-07-0

R is free software and comes with ABSOLUTELY NO WARRANTY.

You are welcome to redistribute it under certain conditions.

Type ’license()’ or ’licence()’ for distribution details.

Type ’demo()’ for some demos, ’help()’ for on-line help, or

’help.start()’ for an HTML browser interface to help.

Type ’q()’ to quit R.

2–1

¿

The ¿ is R asking you to enter something as on a calculator; R can operate as a simple

calculator, but of course we are interested in its use as a powerful statistical calculator.

¿ 2 + 3

[1] 5

¿ sqrt(26)

[1] 5.09902

¿ 3ˆ4

[1] 81

¿

For the remainder of this chapter we’ll look at a significant example involving visualisation and

exploratory data analysis on a data set.

2.2 Visualisation and Exploratory Data Analysis

Were going to read in some examination result data and analyse them. The file exam.txt contains

data as follows:

exam

65

60

47

... etc. 66 results in total

The name of the column is exam and we tell R to pay attention to that.

In what follows, # is a comment symbol and R ignores anything after the # until the next line.

Anything after ¿ is something that you typed — a request to R. If something appears without

a ¿ , that is an R response.

¿ ex ¡- read.table(”exam.txt”, header= T)

¿ attach(ex)

¿ exam # print ’exam’ data on the screen

[1] 65 60 47 43 51 32 62 71 0 56 52 59 15 49 54 67 44 2 47 61 45 95 62 80 46

[26] 52 61 12 62 69 78 62 48 56 56 58 60 0 48 71 50 90 51 53 5 51 63 35 39 10

[51] 57 53 20 54 22 44 53 52 25 60 55 39 30 53 67 50

¿

That printout is quite uninformative, for example you have no idea what the maximum is, nor the

range, nor have you an even rough idea of what the average mark is, etc.

Let us look at a histogram.

2–2

¿ hist(exam)

And we get Figure 2.1.

Often, like me here, you want to save the diagram to a file so that you can include it in a report.

Here is how to do that; vis1-1.pdf is a filename that I made up.

¿ pdf(”vis1-1.pdf”, onefile=FALSE, height=8, width=6, pointsize=8,

paper=”special”)

¿ hist(exam)

¿ devoff()

Error: could not find function ”devoff” # R complaining ...

¿ dev.off() # do this to finalise and close the file

# if you don’t it’s like forgetting to save in a wordprocessor.

¿

Histogram of exam

exam

Fre

quen

cy

0 20 40 60 80 100

05

1015

20

Figure 2.1: Histogram of exam marks.

2–3

Let us see what the average mark is and the range of marks:

¿ mean(exam)

[1] 49.07576

¿ range(exam)

[1] 0 95

¿

We could have used:

¿ length(exam)

[1] 66 # 66 results in ’exam’

¿ sum(exam)/length(exam)

[1] 49.07576

Let us see the data in sorted order — a good deal more informative than unsorted:

¿ sort(exam)

[1] 0 0 2 5 10 12 15 20 22 25 30 32 35 39 39 43 44 44 45 46 47 47 48 48 49

[26] 50 50 51 51 51 52 52 52 53 53 53 53 54 54 55 56 56 56 57 58 59 60 60 60 61

[51] 61 62 62 62 62 63 65 67 67 69 71 71 78 80 90 95

¿

Now read in corresponding continuous assessment (CA) marks (courswork); they came from a

spreadsheet so there’s a load of digits after the decimal point and that makes the data evern more

incomprehensible, so we use round to round them to the nearest integer number. It looks like

the CA marks are more generous than the exam. marks, and mean(ca) confirms this, as does the

histogram in Figure 2.2.

¿ cw ¡- read.table(”ca.txt”, header= T)

¿ attach(cw)

¿ ca

[1] 91.34390 85.54622 72.65543 63.10473 73.22074 50.99642 85.69151 97.06528

[9] 18.58191 83.30836 78.78221 77.68898 21.07860 76.04457 76.56793 86.90106

[17] 61.70048 16.28892 69.57387 83.08058 74.19594 97.12300 81.58833 98.12345

[25] 60.17263 79.49133 89.35610 27.89478 98.06673 92.34510 96.19500 88.69131

[33] 69.70333 85.23094 86.99767 82.89807 77.35877 15.12655 72.41332 90.07670

[41] 75.20815 97.17500 65.78075 70.29256 14.20315 73.02363 87.38178 52.74194

[49] 60.66164 20.05529 78.16085 73.58862 34.07182 78.03601 39.31353 69.57565

[57] 77.53929 77.20521 52.67979 89.10232 76.78222 54.16873 40.23080 81.09443

[65] 89.12518 67.58763

¿ car = round(ca)

¿ car

[1] 91 86 73 63 73 51 86 97 19 83 79 78 21 76 77 87 62 16 70 83 74 97 82 98 60

[26] 79 89 28 98 92 96 89 70 85 87 83 77 15 72 90 75 97 66 70 14 73 87 53 61 20

2–4

[51] 78 74 34 78 39 70 78 77 53 89 77 54 40 81 89 68

¿

¿ sort(car)

[1] 14 15 16 19 20 21 28 34 39 40 51 53 53 54 60 61 62 63 66 68 70 70 70 70 72

[26] 73 73 73 74 74 75 76 77 77 77 77 78 78 78 78 79 79 81 82 83 83 83 85 86 86

[51] 87 87 87 89 89 89 89 90 91 92 96 97 97 97 98 98

¿

¿ mean(ca)

[1] 70.10692

¿

¿ hist(ca)

# and save another one to a file

¿ pdf(”vis1-ca.pdf”, onefile=FALSE, height=4, width=6, pointsize=8, paper=”special”)

¿ hist(ca)

¿ dev.off()

Histogram of ca

ca

Fre

quen

cy

20 40 60 80 100

05

1015

Figure 2.2: Histogram of CA marks.

2–5

Boxplots are another way of examining a data set. Figure 2.3 shows boxplots for the examination

and CA results.

The construction of the boxplot is as follows: (a) the heavy line across the interior of the box

correspond to the median value (see Chapter 3); (b) the top and bottom of the box correspond

to, respectively, the lower quartile and upper quartile, i.e. 25% of the data are below the lower

quartile and 25% are above the upper quartile (or, if you like, 75% are below it).

The so called whiskers show the smallest and largest values — excluding boxplot’s interpretation

of outliers. The outliers are then shown as single points.

Quartile is a specialisation of the general term quantile, see Chapter 4. In Chapters 9, 11 and

12, we’ll come across, for example, 5% and 95% quantiles. The median is the centre of the data,

i.e. as many of the data are above the median as are bwlow it; see Chapter 3.

To determine what are outliers, boxplot fits a Normal distribution to the data and labels as outliers

any data that are below the 1% or above the 99% quantiles of the fitted Normal distribution.

020

4060

80

2040

6080

100

Figure 2.3: Boxplot of: left, examination marks; right, CA marks.

2–6

How to look at the two data sets together? There must be a way of superimposing one histogram

on another, but I haven’t found that yet.

So let us display a two-dimensional scatter plot of the two data sets, see Figure 2.4.

¿ library(lattice) # first we must load a library that has ’xyplot’ in it

¿ xyplot(exam ˜ ca)

ca

exam

0

20

40

60

80

20 40 60 80 100

Figure 2.4: Scatter plot of Exam. marks versus CA marks.

Someone says those CA and exam. marks look quite correlated, I wonder how accurately we could

have predicted the exam. results using the CA?. This is regression territory — and given that

Figure 2.4 shows a sort of straight line relationship, we’ll try linear regression, your old friend

y = mx + c , or in this case exam = mca + c and it is more usual to use a, b exam = a + bca. a

is the intercept, where the fitted straight line meets the y-axis at x = 0 and b is the slope.

¿ fitres = lm(exam ˜ ca)

¿ summary(fitres)

Call:

lm(formula = exam ˜ ca)

Residuals:

Min 1Q Median 3Q Max

-10.9697 -3.1181 -0.7405 3.1036 22.8368

Coefficients:

Estimate Std. Error t value Pr(¿—t—)

2–7

(Intercept) -10.83639 2.21002 -4.903 6.77e-06 ***

ca 0.85458 0.03002 28.469 ¡ 2e-16 ***

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 5.482 on 64 degrees of freedom

Multiple R-squared: 0.9268,Adjusted R-squared: 0.9257

F-statistic: 810.5 on 1 and 64 DF, p-value: ¡ 2.2e-16

¿

R prints a lot of information that we’ll find out about in Chapter 20; for now all we need to know

are a = −10.83639 (intercept) and b = 0.85458 (coefficient multiplying ca), i.e. the fitted line is

exam = −10.83639 + 0.85458× ca. Figure 2.5 shows the results of the straight line fitting.

20 40 60 80 100

020

4060

80

ca

exam

Figure 2.5: Straight line fitting Exam. marks versus CA marks.

2–8

Finally, we can save all those commands:

¿ savehistory(”20090508-3.txt”)

# which we could load again at a later time with

¿ loadhistory(”20090508-3.txt”)

# but in any case, weh you use q() to quit, R will offer you the

# option of saving and thse saved commands will be loaded the

# next time you run R.

¿ q()

Save workspace image? [y/n/c]: y

That’s enough for an introduction.

2–9

Chapter 3

Averages

3.1 Introduction

This chapter gives a brief introduction to “average”s or what statisticians call central tendency.

These are often, but not always, useful in summarising a set of data, especially when we wish to

compare the data set with another.

There are some pitfalls in using the common-or-garden average and we will note some of these.

3.2 Arithmetic Mean

The most familiar average value is the arithmetic mean, i.e. sum the value and divide by the

number of data. Just to get used to some mathematical notation, see A.2, we’ll write this as you’ll

see it in textbooks (the data are xi , i = 1, . . . , n):

x =1

n

n∑i=1

xi , (3.1)

R-Example 1 .

As before, we’ll read the data and print them. This time they are already sorted, so much easier

to read, even in list form.

¿ ex2 = read.table(”exam2.txt”, header = T)

¿ attach(ex2)

¿ exam2

[1] 43 43 43 44 46 48 48 50 51 53 53 53 55 56 56 57 57 58 58 59 59 59 60 60 60

[26] 61 62 62 64 69

We can compute the mean by summing and dividing, see below, but not unexpectedly, R has a

function mean that does it for us.

3–1

¿ sum(exam2)

[1] 1647

¿ length(hw)

[1] 20

¿ sum(exam2)/length(exam2)

[1] 54.9

¿ mean(exam2)

[1] 54.9

¿

In spite of its simplicity, it is possible to compute the arithmetic wrongly.

R-Example 2 . The following are a set of homework marks, marked out of 10. We read the data

in and print them. Then we produce a summarising table, marks versus frequency, which tells us

that we have a three students with four (4) marks, three with five, six with six, etc.

¿ df.homew ¡- read.table(”hw.txt”, header = T)

¿ attach(df.homew)

¿ hw

[1] 6 8 5 7 6 5 6 4 6 5 8 4 8 8 7 6 6 7 4 7

¿ table(hw)

hw

4 5 6 7 8 # marks

3 3 6 4 4 # frequencies

If we were not using a computer, we might think that we have a quick way to compute the mean, we

have just five marks, namely, 4 5 6 7 8, so we’ll take the average of those 4 + 5 + 6 + 7 + 8 = 30,

so mean = 30/5 = 6. But R thinks differently:

¿ mean(hw)

[1] 6.15

The method we used works only if the frequencies are the same for each mark; it would be a rare

fluke if this were the case.

But we’ll pursue the matter further, because (a) computing an arithmetic mean using a frequency

table — done properly — can be a (correct) shortcut if you have a lot of numbers and just a

calculator or pencil and paper; (b) using frequencies prepares the ground for topics covered in later

chapters.

3.2.1 Arithmetic Mean using Frequencies

We’ll rewrite the table, now calling the data (marks) x , we’ll label them with i so that we have

xi , i = 1 . . . n, and n = 5.

3–2

¿ table(hw)

hw

i= 1 2 3 4 5

----------------

xi 4 5 6 7 8 # marks

fi 3 3 6 4 4 # frequencies

¿

If we want to use the frequency table, we have to replace eqn. 3.1 with

x =

∑ni=1 fixi∑ni=1 fi

. (3.2)

Applying eqn. 3.2 to our frequency table above gives

(3×4+3×5+6×6+4×7+4×8)/(3+3+6+4+4) = (12+15+36+28+32)/20 = 123/20 = 6.15.

If we look at the sum divided by number calculation in R, we see that the frequency calculation

ends up with not only the same result, but the same division,

¿ length(hw)

[1] 20

¿ sum(hw)

[1] 123

¿ sum(hw)/length(hw)

[1] 6.15

If you look at the sum of fi × xi you will see that it is the same as

4 + 4 + 4 + 5 + . . .+ 8 + 8 + 8 + 8;

the sorted hw marks are below:

¿ sort(hw)

[1] 4 4 4 5 5 5 6 6 6 6 6 6 7 7 7 7 8 8 8 8

And the sum of the frequencies is 20, i.e. the number of data. [B

3.3 Median

Sometimes neither the mean nor the mode give us what we would expect from a central value.

Look at the following speed data (speed of cars at a speed check). Here mean,37.1, is well off

the centre; and that offset is caused by an outlier, the 75. The offset would be a lot worse if the

outlier was 1000 — not likely in the case of speeds, but outliers of this magnitude are possible

in the case of some measurement systems. A common example is a mineralisation survey taken

across an area of land. For the sake of argument, assume that we are looking for zinc. A sample

that coincides with the dumping of an old bucket will produce a huge outlier. Now if we want to

produce contour plots based on smoothed values (averages over regions), then mean smoothing

will show a (false) hot-spot, while median smoothing will not.

3–3

sp = read.table(”cars.txt”, header = T)

¿ attach(sp)

¿ speed

[1] 25 31 33 31 30 35 75

¿ mean(speed)

[1] 37.14286

The media gives a the true central value. If we sort the speeds, we see that the central value (the

fourth) is 31. median give the same result.

¿ sort(speed)

[1] 25 30 31 31 33 35 75

¿ median(speed)

[1] 31

¿ speed[4]

[1] 31

In the example above there are seven values, so the central one is the fourth; if we had an even

number of values, we would take the average of the two central values.

It can be said that the median is a measure of central tendency that is robust against outliers.

3.4 Mode

Sometimes the mean does not give us what we would expect from a central value; for example,

in the homework example, the mean (6.15) gives us a value that appears nowhere in the original

data; that’s normally not a big deal, but it suggests the mode as a possible “average value”.

The mode is the most frequent value, i.e. obtained from a frequency table or from a histogram,

Figure 3.1.

¿ table(hw)

hw

xi 4 5 6 7 8 # marks

fi 3 3 6 4 4 # frequencies

3–4

Histogram of hw

hw

Fre

quen

cy

4 5 6 7 8

01

23

45

6

Figure 3.1: Histogram of hw.

3–5

Multimodal Data Now that we’ve mentioned the mode, we’d better take the opportunity of

warning about multi-modal data.

File hw2.txt contains data which has two peaks in its histogram, Figure 3.2.

¿ df.homew2 ¡- read.table(”hw2.txt”, header = T)

¿ attach(df.homew2)

¿ sort(hw2)

[1] 3 4 4 4 4 5 7 8 8 8 8 8 9

¿ hist(hw2)

mean(hw2)

[1] 6.153846

Histogram of hw2

hw2

Fre

quen

cy

3 4 5 6 7 8 9

01

23

45

Figure 3.2: Histogram of hw2 — multimodal.

We can work calculate the mean, but does it convey much about the centre of the data? No, and

using the mean as such may be quite misleading. For example, an average of 6.15 may indicate

that the homework was, on average, completed satisfactorily; however, in fact, we had two sets of

results, one good, one poor and the average of 6.15 adequately represents neither.

Multimodality is pretty obvious in that small and one-dimensional data set. In much larger data

sets and especially in multidimensional data, multimodality may be difficult to detect.

Much later, Chapter 19, we’ll look at methods for separating multimodal data into different classes

or clusters.

3.5 Other Means

Read up in (Crawley 2005) on: geometric mean and harmonic mean.

3–6

Chapter 4

Measures of Data Variability

4.1 Introduction

This chapter introduces methods of describing data variability, most notably variance and standard

deviation.

4.2 Variance and Standard Deviation

We are now going to work through an example based on two examination results, exam3 and

exam4, see below.

¿ df.exam3 = read.table(”exam3.txt”, header = T)

¿ attach(df.exam3)

¿ df.exam4 = read.table(”exam4.txt”, header = T)

¿ attach(df.exam4)

¿ exam3

[1] 68 70 71 72 72 73 73 73 74 75 75 75 75 75 76 76 76 76 76 77 77 78 78 80 82

¿ exam4

[1] 43 43 43 44 46 48 48 50 51 53 53 53 55 56 56 57 57 58 58 59 59 59 60 60 60

[26] 61 62 62 64 69 73

¿

We are going to assume that these examinations are from two optional modules that final year

BSc Honours students can take, that is students take one or other of these modules and not both.

Final Honours classifications depend on these results; but we can see already that the students

who took exam3 are at an advantage; except for one, they all achieved first class honours in that

examination. If we assume that the exam3 students are equally capable as the exam4 students, then

can we correct the imbalance? Before you start to be incredulous, this technique was practiced at

a well-known university where I worked.

First of all let us look at the histograms, Figure 4.1 and the box-plots, Figure 4.2.

4–1

¿ hist(exam3)

¿ hist(exam4)

Histogram of exam3

exam3

Fre

quen

cy

68 70 72 74 76 78 80 82

02

46

810

Histogram of exam4

exam4

Fre

quen

cy

40 45 50 55 60 65 70 750

24

68

1012

Figure 4.1: Histograms of exam3 and exam4.

¿ boxplot(exam3)

¿ boxplot(exam4)

4–2

6870

7274

7678

8082

4550

5560

6570

Figure 4.2: Boxplots of exam3 and exam4.

4–3

The means confirm the difference.

¿ mean(exam3)

[1] 74.92

¿ mean(exam4)

[1] 55.48387

¿

¿ diff ¡- mean(exam3) - mean(exam4)

¿ diff

[1] 19.43613

¿

4.2.1 Equalising the means

Can we shift one of the means so that the two data sets have the same mean?

¿ diff

[1] 19.43613

¿ exam4new ¡- round(exam4 + diff)

¿ exam4new

[1] 62 62 62 63 65 67 67 69 70 72 72 72 74 75 75 76 76 77 77 78 78 78 79 79 79

[26] 80 81 81 83 88 92

¿ fpdfsmall()

¿ hist(exam4new)

Histogram of exam3

exam3

Fre

quen

cy

68 70 72 74 76 78 80 82

02

46

810

Histogram of exam4new

exam4new

Fre

quen

cy

60 65 70 75 80 85 90 95

02

46

810

Figure 4.3: Histograms of exam3 and exam4 shifted by 19.

4–4

4.2.2 Variability and spread

That is a bit better, but there remains a greater spread in exam4new (mean shifted). Can we

quantify spread ; range gives us the range between minimum and maximum, but we would like one

number.

¿ range(exam3)

[1] 68 82

¿ range(exam4new)

[1] 62 92

¿

From our experience with the mean, maybe we can take the mean (expected value) of deviations

from the means,

¿ mean(exam3 - mean(exam3))

[1] -1.705372e-15 # effectively zero

¿ mean(exam4new - mean(exam4new))

[1] -4.586385e-16

Not much good; from the definition of the mean we should have known in advance that these

means (or sums) of deviations would be zero — the negative deviations cancel the positive.

¿ mean((exam4new - mean(exam4new))ˆ2)

[1] 53.6691

¿ mean((exam3 - mean(exam3))ˆ2)

[1] 9.0336

We can achieve the same using sum and length,

¿ sum((exam3 - mean(exam3))ˆ2)/length(exam3)

[1] 9.0336

4.2.3 Variance and Standard Deviation

The variance, which is the expected value of the squared deviations from the mean is the built-in

function to use (var in R), see eqn. 4.1,

V ar(X) = E[(X − µ)] =1

n

n∑i=1

(xi − µ)2. (4.1)

¿ var(exam3)

[1] 9.41

¿ var(exam4new)

[1] 55.45806

4–5

Immediately, we see that it is not an illusion that the variability of exam4new is much greater than

that of exam3. Note that the variance as calculated by var is slightly different from that calculated

using mean — we’ll return to that below.

The variance values, since they are sums of squares, give us a measure of squared variability; that

can be hard to interpret and use; what we want is the square-root of the variance, or the standard

deviation (sd in R), see eqn. 4.2,

σX = SD(X) =√V ar [X]. (4.2)

¿ sqrt(var(exam4new))

[1] 7.447017

¿ sqrt(var(exam3))

[1] 3.067572

¿ sd(exam4new)

[1] 7.447017

¿ sd(exam3)

[1] 3.067572

¿

Variance different from mean of squared deviations? We return to the problem of variance

being different the mean of squared deviations. The clue is given below,

¿ sum((exam3 - mean(exam3))ˆ2)/length(exam3)

[1] 9.0336

¿ sum((exam3 - mean(exam3))ˆ2)/(length(exam3) -1)

[1] 9.41

In fact, rather than eqn. 4.1, this particular implementation of var computes what is called the

sample variance using eqn. 4.3,

V ar(X) =1

(n − 1)

n∑i=1

(xi − µ)2. (4.3)

This gives an unbiassed estimate of the variance.

4.3 Standard Scores and Normalising Marks

We now return to our desire to manipulate (fairly) the two data sets, exam3, exam4, such that

students in each class have roughly the same opportunity; see section 4.2.1 where we equalised

the means, but where we noted that the difference in variability remained a problem.

4–6

4.3.1 Standard Scores

The normal way to equalise data sets like these (the proper term is either standardise or normalise)

is to use the standard score as in,

Xss =X − µσ

. (4.4)

Eqn. 4.4 gives a set of scores with mean zero and standard deviation one, µss = 0, σss = 1. Thus,

if we apply eqn. 4.4 to the two sets of marks, using the mean and standard-deviations of each, we

get two sets of marks with the same mean (0) and the same spread (standard-deviation 1).

That is fine for purely comparison purposes, but what if we need marks to publish? What we are

going to do is: (i) use eqn. 4.4 to standardise the scores; then (ii) multiply by whatever (new)

standard-deviation, call it σnew , that we require; finally, add the (new) mean that we require. The

whole operation is given in eqn. 4.5,

Xnew =Xold − µσold

× σnew + µnew . (4.5)

We’ll now apply this to exam4, i.e. we want to make exam4 as close as possible to exam3 (in terms

of mean and standard deviation).

¿ sd3 ¡- sd(exam3)

¿ sd3

[1] 3.067572

¿ m3 ¡- mean(exam3)

¿ sd4 ¡- sd(exam4)

¿ sd4

[1] 7.447017

¿ m4 ¡- mean(exam4)

¿ m4

[1] 55.48387

¿ m3

[1] 74.92

¿ exam4new = round(((exam4 - m4)/sd4)*sd3 + m3)

¿ exam4new

[1] 70 70 70 70 71 72 72 73 73 74 74 74 75 75 75 76 76 76 76 76 76 76 77 77

77

[26] 77 78 78 78 80 82

¿ mean(exam3)

[1] 74.92

¿ mean(exam4new)

[1] 74.96774 # difference due to rounding

¿ sd(exam3)

[1] 3.067572

¿ sd(exam4new)

[1] 2.99426 # difference due to rounding

¿

4–7

And let us compare the histograms in Figure 4.4

Histogram of exam3

exam3

Fre

quen

cy

68 70 72 74 76 78 80 82

02

46

810

Histogram of exam4new

exam4new

Fre

quen

cy70 72 74 76 78 80 82

02

46

810

Figure 4.4: Histograms of exam3 and exam4new (exam4 equalised with exam3).

4–8

Chapter 5

Probability and Random Variables

5.1 Introduction

This chapter gathers together some basic definitions, symbols and terminology to do with, proba-

bility, random variables, and random processes; the topics are chosen according to their applicability

to basic statistics for bio-scientists, as well as pattern recognition, image processing and data com-

pression. We will use some of the notation from Appendix A; you should have a quick look at that

first. We emphasise that such notation is merely shorthand for common sense concepts which

would otherwise be confusing and long-winded if written in English.

5.2 Basic Probability and Random Variables

5.2.1 Introduction

Let there be a set of outcomes to an experiment ω1, ω2, . . . , ωn = Ω, where, to each ωi , we

associate a probability pi . The definition of probability includes the following constraints:

0 ≤ pi ≤ 1, (5.1)

n∑i=1

pi = 1. (5.2)

The above simple definition of probability over outcomes is satisfactory for simple applications, but

for many applications we need to extend it to apply to subsets of Ω.

We could call the outcomes above elementary events, i.e. indivisible events, and we could call the

subsets below composite, i.e. they are a composition of one or more outcomes.

Ω is often called the sample space, i.e. as defined above, the set of all possible outcomes of the

experiment. Elements of Ω are called outcomes, sample outcomes, or realisations. One of the

problems of learning probability and statistics is the confusion caused by the multiplicity of terms

for the same concept. In addition, different fields of study, e.g. bio-science, engineering, social

science, . . . have their own terminology.

5–1

Example 1 Six sided dice. Ω = i | i ∈ 1, . . . 6 = 1, 2, . . . 6.

Example 2 Toss two six sided dice. Ω = (i , j) | i , j ∈ 1, . . . 6 =

(1, 1), (1, 2), . . . (1, 6), (2, 1), . . . (6, 6).

Example 3 Two sided coin. Ω = H,T. Outcomes need not be numbers.

5.2.2 Probability and Events

Let there be subsets of Ω called events with a general event ai ; the set of all ais is A. We define

a probability measure P on A; P is a number and satisfies the following axioms:

P (a) ≥ 0, (5.3)

P (Ω) = 1, (certain event, something happens). (5.4)

If a1, a2, . . . are disjoint, i.e. ai ∩ aj = ∅,∀i , j, i 6= j , then

P (

∞⋃i=1

ai) =

∞∑i=1

P (ai). (5.5)

Disjoint (subsets) is another term for mutually exclusive, i.e. they cannot possibly happen together.

∩ denotes set intersection, i.e. in eqn. 5.5 we are requiring that there is no overlap between any of

the subsets and⋃

denotes union. Put simply, eqn. 5.5 says that probabilities add for events that

do not overlap. ∅ denotes the empty set.

There is a fourth axiom, a corollary of eqns. 5.4 and 5.5,

P (∅) = 0, (impossible event). (5.6)

Example 4 Six sided dice. Ω = 1, 2, . . . 6. Let a be the event score greater than three; i.e.

a = 4, 5, 6.

Example 5 Toss two six sided dice. Ω = (i , j) | i , j ∈ 1, . . . 6. Let a be the event score less

than four. Then a = (1, 1), (1, 2), (2, 1).

Partition When a1∪a2∪ . . .∪an = Ω and a1, a2, . . . an are disjoint, we say that a1, a2, . . . anform a partition of Ω.

5–2

5.2.3 A Point on Terminology

Above we have P (ai) for probability that the outcome is in set ai . “The outcome is in set ai” is

what is called a proposition. A proposition is a sentence which may be true or false — but only

one or the other and not in between.

We should note that in most textbooks and later in these notes the arguments of probability

functions, P (.) will be propositions, e.g. P (A) means the probability that A will occur, or that A

will be true.

Then, when we write P (AB) or P (A,B) (they mean the same), we mean probability of A and B

being both true; logical and.

Not or set complement We may want to talk about the probability that A will be false, i.e. the

probability that the outcome will be in the complement set to A, i.e. any of the outcomes (in Ω)

but not in As set. Not A is denoted A.

We now can write a further axiom.

P (A) = 1− p(A). (5.7)

Example 6 Six sided dice. Ω = 1, 2, . . . 6. Let A = 1, 2, 3, 4, so A = 5, 6.

P (A) = 1− P (A) = 1− 46

= 26

= 13

.

5.2.4 Probability of Non-disjoint Events

We saw in eqn. 5.5 that to compute the probability of two disjoint events you can add probabilities.

For events A and B that are not necessarily disjoint (there may be overlap), we can write

P (A⋃B) = P (A) + P (B)− P (AB). (5.8)

Example 7 Six sided dice. Ω = 1, 2, . . . 6. Let A = 1, 2, 3, 4, so B = 4, 5; so A ∪ B =

1, 2, 3, 4, 5 and A ∩ B = 4.

P (A ∪B) = P (A) + P (B)− P (A ∩B) = 46

+ 26− 1

6= 5

6, and we can see that, computed directly,

P (A ∪ B) = P (1, 2, 3, 4, 5) = 56

.

We note that eqn. 5.8 collapses to eqn. 5.5 when AB is false (no overlap, the two cannot be true

together), because of eqn. 5.6, i.e. P (∅) = 0, and

P (A⋃B) = P (A) + P (B)− P (∅) = P (A) + P (B)− 0 = P (A) + P (B).

5–3

5.2.5 Finite Sample Spaces

In Example 1 we could identify and list all possible outcomes and we have a finite sample space.

On the other hand, if the outcome was a weight, for example of a precipitate, then we could not

list all possible weights and we would have an infinite sample space.

5.3 Random Variables

If, to every outcome, ω, of an experiment, we assign a number, X(ω), X is called a random variable

(r.v.). X is a function over the set Ω = ω1, ω2, . . . of outcomes; if the range of X is the real

numbers or some subset of them, X is a continuous r.v.; if the range of X is some integer set,

then X is a discrete r.v. Chapter 6 contains an extensive discussion on random variables and an

introduction probability distributions.

5.4 Computing probabilities

We have already done this in examples, but we need to formalise a bit. The number of elements

in a (finite) set, say a, is called its cardinality and written |a|.

Example 8 Six sided dice. Ω = 1, 2, . . . 6, |Ω| = 6.

Let a = 4, 5, 6, |a| = 3.

If the outcomes are equally likely (which 1, 2, . . . 6 are), then we can compute the probability of

an event a as the ratio:

P (a) =|a||Ω| . (5.9)

Example 9 Six sided dice. Ω = 1, 2, . . . 6, |Ω| = 6.

Let a = 4, 5, 6, |a| = 3, so

P (a) =|a||Ω| =

3

6=

1

2.

5.5 Enumerating more complex events and sample spaces

We see above P (a) = |a||Ω| . But |a| or |Ω| may not be simple to enumerate or count.

5–4

5.5.1 Multiplication of outcomes

Let an event correspond to the combined outcomes of two experiments performed in sequence.

Let the first have n1 outcomes and the second n2 outcomes.

Any of the n1 outcomes of the first may be followed by any of the n2 outcomes of the second, so

the number of outcomes in the combined experiment is n1 × n2.

Example 10 Toss two six sided dice in sequence (but the result is the same if we throw them

together). n1 = |Ω1| = 6, n2 = |Ω2| = 6, so, for the combined experiment, |Ω| = n1 × n2 = 36,

which we can also compute by counting the elements in Ω = (i , j) | i , j ∈ 1, . . . 6.

5.5.2 Addition of outcomes

Suppose again that we have two experiments. Let the first have n1 outcomes and the second n2

outcomes. This time we perform the first experiment or the second, but not both and which of

them gets performed is chosen randomly; how many outcomes?

We have n1 outcomes of the first, or the n2 outcomes of the second, so the total number of

outcomes in the combined experiment is n1 + n2.

Example 11 Toss one six sided dice or toss a two sided coin. n1 = |Ω1| = 6, n2 = |Ω2| = 2,

so, for the combined experiment, |Ω| = n1 + n2 = 8, which we can also compute by counting the

elements in Ω = 1, 2, 3, 4, 5, 6, H, T.

5.5.3 Permutations

Suppose we have n items and we wish to place them in a sequence — just any sequence, not

ordered according to size or any other attribute. How many ways to do this?

The first position may be filled by any of the n items; the second position may be filled by any of the

remaining n−1 items, and so on, so that the number of possible different sequences (orderings) is

n(n − 1)(n − 2) . . . 1 = n! (n-factorial). (5.10)

Suppose now we have n items and we wish to choose any r of them place these in a sequence. How

many ways to do this? The first position may be filled by any of the n items; the second position

may be filled by any of the remaining n− 1 items, and so on until we have r in the sequence. The

number of possible different sequences (orderings) is

n(n − 1)(n − 2) . . . n − (r − 1) = n(n − 1)(n − 2) . . . n − r + 1) =n!

(n − r)!=n Pr . (5.11)

nPr is the name for the number of permutations of r from n.

5–5

5.5.4 Combinations

Suppose again we have n items and we wish to choose any r of them, but we do not need to place

the r in a sequence. How many ways nCr to do this? We can appeal to eqns. 5.11 and 5.10.

n!

(n − r)!=n Cr × (number of ways of permuting)r = r !nCr ,

which leads to

nCr =n!

r !(n − r)!=

(n

r

). (5.12)

5.6 Conditional Probability

Example 12 Ω = 1, 2, 3, 4, 5, 6. I throw the dice. What is the probability of getting greater-

than-three, P (> 3)? Let A be greater-than-three so that A = 4, 5, 6, and the cardinality of

this set is nA = |A| = 3, and ndice = |Ω| = 6, see section 5.4; there are three possibilities

greater-than-3, so P (A) = P (> 3) = nA/ndice = 3/6 = 1/2.

Now, I have a peek and I tell you that we have an odd number, let us call this event B (odd). What

now is the probability of A(> 3)? The probability surely has changed because the only possibilities

now are A odd = 1, 3, 5. Within this set, 5 is the only (one) possibility that satisfies greater-

than-three, so, forgetting about any ideas we had before, we say that the conditional probability

of greater-than-three given that we already know that an odd number has occurred, 1/3, i.e. the

probability has doubled based on the information that an odd has occurred.

We write this P (> 3|odd), the conditional probability of a > 3 conditional on the fact that we

already know that an odd number has occurred.

This is conditional probability ; we computed the probability of B conditional on A, P (B|A).

5.6.1 Venn diagrams

Venn diagrams, see section A.1.4, can be used to think about conditional probabilities such as

the one in Example 12. Here Ω = 1, 2, 3, 4, 5, 6 corresponds to the universal set (the set of all

possibilities).

One we have been told that the number is odd, we can reduce our sample space to set odd ; then

odd ∩ (> 3) = 5.

Example 13 If after hearing first that we have an odd number, then secondly we are told that

greater-than-three has occurred, we are then asked (a) what is the probability of a six?, (b) what

is the probability of a five?

Think about it, once we have the two pieces of information: odd, then greater-than-three, the

possibilities are very greatly reduced. To what?

5–6

1 3 5

2 4 6

1 3 5

2 4 6

odd

even

1 3 5

2 4 6

odd

even

> 3

<= 3

odd & <= 3 odd & > 3

even & <= 3 even & > 3

Figure 5.1: Dice: (a) universal set; (b) sets odd, even; (c) sets (> 3) and (<= 3) superimposed

to show that, for example, odd&(> 3) = (set-odd) ∩ (set > 3) = 1, 3, 5 ∩ 4, 5, 6 = 5.

5.6.2 Probability Trees

Probability trees, see (Griffiths 2009, p. 158), are another way to think graphically about condi-

tional probability. In mathematics, trees can grow sideways or even upside down.

Figure 5.2 shows a probability tree for Example 12.

When we split into branches as in Figure 5.2, any branching must represent all possibilities; in this

case we first have odd and even; if we call odd B, we have even = not-odd = B. In the diagram

we have no bar symbol, so we use B′ = B. Next we have (> 3 and (<= 3).

Thus, at any branching the probabilities in the branches must sum to one.

The diagram shows how to compute joint probabilities using conditional probabilities and the

probability of the conditioning event, for example P (> 3 & odd) = P (> 3 | odd)× P (odd).

Figure 5.3 shows a general probability tree.

The following may help us to think about conditional probability and joint probability. Think of the

tree as having probability flowing in its branches.

We start of at the root with all the probability (one, 1); proportions of the probability flow into

the first set of branches (the proportions sum to one); follow one of those branches, at the next

branching point, we split the remaining probability into proportions that again sum to one (it is

just the proportions that sum to one, if there is, for example, 0.4 flowing into the branching point,

and the proportions are 0.4, 0.4, 0.2 — three-way branch, then we will have probability flows of

0.16, 0.16, 0.08). And so on.

5–7

even has

occurred

odd has occurred

P(odd)

P(even)

P(>3|odd)

P(<=3|odd)

P(<=3|even)

P(>3 & odd) = P(>3|odd) x P(odd)

P(<=3 & odd) = P(<=3|odd) x P(odd)

P(>3 & even) = P(>3|even) x P(even)

P(<=3 & even) = P(<=3|even) x P(even)

<=3 and odd has occurred

>3 and odd has occurred

1/2

1/2P(>3|even)

1/3

2/3

2/3

1/3 = 1/2 x 1/3 = 1/6

= 1/2 x 2/3 = 2/6 = 1/3 [P(4 or 6)]

[P(2)]

B

B’

Figure 5.2: Probability tree for the dice example. We start off on the left with the root and

everything possible. Then we split into branches odd and even. Next we split odd into (> 3) and

(<= 3); same for the even branch.

B

We know B has

occurred

P(B)

P(B’)

(not B)B’

B has not

occurred

i.e. not B has occurred

P(A | B)

P(A’ | B)

P(A | B’)

P(A’ | B’)

A has occurred

i.e. A & B have occurred

P(A & B) = P(AB) = P(A | B) x P(B)

not A has occurred

i.e. not A & B = A’ & B

P(A’B) = P(A’ | B) \x P(B)

A

A’

A

A’

P(AB’) = P(A | B’) x P(B’)

P(A’B’) = P(A’ | B’) x P(B’)

Figure 5.3: Probability tree.

5–8

Symbolically, and referring to Figure 5.3 . . . If we have proportion P (B) in a branch and then

that splits into proportions P (A|B) and P (A|B) (these (relative) proportions again sum to one,

but their total probability sums to whatever flowed into the branching point). Then the P (A|B)

branch must an absolute amount of probability equal to P (A|B)× P (B) and this is P (AB).

Formula for Conditional Probability We now give the formula for computing conditional prob-

abilities,

P (A|B) =P (AB)

P (B), (5.13)

provided that P (B) > 0.

Alternatively, as in Figure 5.3,

P (AB) = P (A|B)P (B). (5.14)

5.6.3 Joint Probability

P (AB) is the joint probability of A and B happening together.

Sometimes we write P (AB), sometimes P (A&B), sometimes P (A and B), and sometimes, using

set notation, P (A ∪ B).

5.7 Bayes’ Rule

If we reverse the conditionality in eqn. 5.13 and noting that P (AB) = P (BA), we have

P (B|A) =P (AB)

P (A), (5.15)

leading to

P (A)P (B|A) = P (AB), (5.16)

and eqn. 5.13 gives us

P (B)P (A|B) = P (AB), (5.17)

so that

P (A)P (B|A) = P (B)P (A|B), (5.18)

5–9

leading to Bayes’ rule:

P (A|B) = P (A)P (B|A)/P (B). (5.19)

Eqn. 5.19 allows to invert or reverse the conditionality.

Example 14 Let A be has disease-X; let B be has swollen ankles. From a sample of former

disease-X patients, we can estimate P (B|A); say it is P (B|A) = 0.3. Let us assume that we also

know the proportion of the general population that have swollen ankles, P (B) = 0.01. Also we

assume that we have the incidence of disease-X in the general population, P (A) = 0.005.

Eqn. 5.19 allows us to compute the probability that the patient has disease-X given that the swollen

ankles symptom (B) is present, P (A|B). Of course, in general, P (A|B) 6= P (B|A).

P (A|B) = P (A)P (B|A)/P (B) = 0.005× 0.3/0.01 = 0.15. (5.20)

Bayes’ rule may be written in a more general manner. First we need a result called the law of total

probabilities.

Let A1, A2, . . . , An be a partition of Ω (see section 5.2.2 for a definition of partition), then

P (B) =

n∑i=1

P (B|Ai)P (Ai). (5.21)

We write the more general form of Bayes’ rule as

P (Ai |B) = P (B|Ai)P (Ai)/

n∑i=1

P (B|Ai)P (Ai). (5.22)

Let us return to Example 14 and apply eqn. 5.22. When we said proportion of the general population

that have swollen ankles, P (B) = 0.01, we strictly meant probability of people with disease-X to-

gether with those without disease-X = 0.01. We can restate the problem with A1 = has disease-X

and A2 = has not disease-X, so that they form a partition of the general population.

Assume that we now have P (B|A2) = 0.01 (i.e. we are changing the story slightly to associate

this probability with people who do not have disease-X) and, as before, P (B|A1) = 0.3; we need

also P (A1 = 0.005, as before. What is P (A2); it is P (A1) (probability that a general person does

not have disease-X) and this is 1− P (A1) = 0.995.

Eqn. 5.21 now gives a revised figure for P (B),

P (B) =

n∑i=1

P (B|Ai)P (Ai) = P (B|A1)P (A1)+P (B|A2)P (A2) = 0.30.005+0.010.995 = 0.01145,

and we can rework eqn. 14 (or use eqn. 5.22,

P (A1|B) = P (A1)P (B|A1)/P (B) = 0.005× 0.3/0.01145 = 0.131.

5–10

5.8 Independent Events

We have already discussed disjoint events, i.e. events which cannot occur simultaneously; thus,

disjoint events A,B, A ∩B = ∅. Consequently, we can state that P (A|B) = 0 (if B has occurred,

A cannot).

At the opposite extreme, let A ⊂ B, i.e. A is a subset of B and if A has occurred, then so must

B, with certainty, so in this case P (B|A) = 1.

Example 15 Ω = 1, 2, 3, 4, 5, 6. Let B = 2, 4, 6 (even number) and A = 6. If we know

that a 6 has been thrown (A has occurred), what is P (B|A)? The answer is 1 — we know that 6

is even so B is a sure thing — in punter parlance :-).

But there are cases where A and B are totally unrelated — they are independent events.

Example 16 Throw a dice (1) and toss a coin (2). Ω1 = 1, 2, 3, 4, 5, 6, Ω2 = H,T and

the combined sample space Ω = (1, H), (1, T ), (2, H), . . . , (6, H), (6, T ) and |Ω| = 12. Let

A = 4, 6 and B = H, so that AB = (4, H), (6, H) (two out of 12 equally likely events), so

P (AB) = 1/6. also P (A) = 1/3, P (B) = 1/2.

From eqn. 5.13 we have

P (B|A) =P (AB)

P (A)=

1

6/

1

3=

1

2.

Because the result of the dice throw is unrelated to the result of the coin toss we are not surprised

to find that

P (B|A) = P (B) =1

2.

This leads us to a more general definition of independent events,

P (B|A) = P (B) =P (AB)

P (A),

so that A and B are independent events if and only if

P (AB) = P (A)P (B). (5.23)

5–11

5.9 Betting and Odds

In circumstances where the terms have meaning, probability of A can be computed as the ratio

of the number of equal probability events favourable to A, nA, versus the total number of equal

probability events, nT ,

P (A) = nA/nt . (5.24)

Odds, on the other hand are computed as the ratio of the number of equal probability events

favourable to A, nA, versus the number of equal probability events unfavourable to A, nA,

O(A) = nA/nA. (5.25)

Thus, the probability of a 1 on the throw of a dice is 16

, whilst the odds are 15

; bookmakers express

this as five-to-one against.

The probability for any number less than five (1–4) would be 46

, whilst the odds are 42

= 21

;

bookmakers express this as two-to-one on.

You can calculate probability from odds using

P (A) =O(A)

1 +O(A). (5.26)

Thus, for any number less than five (1–4) on a dice throw,

P (A) =O(A)

1 +O(A)=

21

1 + 21

=2

3.

You can calculate odds from probability using

O(A) =P (A)

1− P (A), (5.27)

that is, the ratio of probability-for (favourable) to probability-against (unfavourable).

Thus, for one on a dice throw,

O(A) =16

1− 16

=1

5.

5–12

Bookmakers odds and probabilities Bookmakers “probabilities” do not add to 1. Unlike proper

probabilities, which add one for all possible events, see eqn 5.2.

Let’s say we have four horses, each with an equal probability of winning (P (Ai) = 14

, for i =

1, 2, 3, 4. We would expect odds of

O(A) =14

1− 14

=1

3,

or three-to-one against. But the bookmaker has to make a living, and not just provide a mutual

service for his punters. In this case, if four punters bet 10 Euro on each horse (bookie gets 40

Euro), one punter gets paid 30 Euro plus his stake returned = 40 Euro, and the bookie makes

nothing for his work.

The bookie is likely to give odds of something like two-to-one against, O′(A) = 12

, and, computing

probabilities, we find

P ′(A) =O′(A)

1 +O′(A)=

12

1 + 12

=1

3,

and the sum of “probabilities” is 43

.

In this amended case, if four punters bet 10 Euro on each horse (bookie gets 40 Euro), one punter

gets paid 20 Euro plus his stake returned = 30 Euro, and the bookie makes 10 Euro.

5.10 Classical versus Bayesian Interpretations of Probability

In many books and discussions you will see a distinction made between the classical and the

Bayesian interpretation of probability; also, in this context the term frequentist may be used as a

synonym for classical. As an interpretation of probability, the term Bayesian has little to do with

Bayes’ rule, section 5.7, that is until we get to statistical inference, Chapter 10.

Broadly speaking, Bayesians interpret probability as belief ; frequentists interpret probability as

relative frequency.

Bayesian (belief) interpretation Take the case of the tossed (fair) dice. If you were asked to

rate, on a scale of [0, 1], your belief that 2 will be the outcome, you would, I hope, agree that

the probability is 16

; for an even number of dots: 26

= 12

; and any number 1-6 — a sure thing —

probability is 1.

Here 0 corresponds to complete disbelief and 1 to complete belief.

5–13

Relative frequency interpretation The frequentist says that the probability of 2 is the relative

frequency with which 2 occurs in a large number of hypothetical throws.

Let us then run an experiment involving a large number (n = 600) of throws. and let yi = the

count of each Xi obtained. We might expect to obtain something like y1 = 95, y2 = 110, y3 =

90, y4 = 97, y5 = 105, y6 = 103. We then use p(i) = yin

; the hat, , indicates that p(i) is an

approximation to p(i); however, p(i)→ p(i) as n →∞.

We have p(i) = yin

= p(i) = 95/600, 110/600, 90/600, 97/600, 105/600, 103/600 =

0.158, 0.183, 0.15, 0.162, 0.175, 0.172. The correct value is p(i) = 16

= 0.1667.

The errors above are not a real indictment of the frequentist method; a thought experiment allows

us to reason that p(i) = 16

.

On the other hand, when you want to bet on football match and would like to estimate the

probability and hence the odds, it makes no sense to think of an infinity of matches.

5–14

Chapter 6

One Dimensional Random Variables

6.1 Introduction

We have already introduced the notion of a random variable in section 5.3, i.e. where we associate

a number with the outcome of an experiment governed by probability.

In most cases, your (scientific) data will already be numerical, but it nonetheless remains worthwhile

to be cognisant of the details of probability and sample space described in Chapter 5.

In some of the examples in Chapter 5, namely those involving the dice, the outcome already is a

number, i.e. 1, . . . , 6; in some considerations, this number is more a label than a number, but

in any case, the association of a number with the outcome is made trivial. In the coin example we

had H,T; in this case we could use the association H → 1, T → 0.

6.1.1 Definition: Random Variable

If, to every outcome, ω, of an experiment, we assign a number, X(ω), X is called a random variable

(r.v.). X is a function over the set Ω = ω1, ω2, . . . of outcomes; if the range of X is the real

numbers or some subset of them, X is a continuous r.v.; if the range of X is some integer set,

then X is a discrete r.v. The space of all possible values of X is called the range space of X, RX.

In discussing random variables we label the r.v. with an upper case letter, e.g. X, but particular

values of it are labelled with lower case, e.g. x , or xi .

Example 17 Toss two coins. Ω = TT, TH,HT,HH. Let a r.v. X be defined as the number of

heads in the outcome, i.e. TT → 0, TH → 1, HT → 1, HH → 2. Notice that two outcomes

map to the same number (1); this is not a problem or a mistake. RX = 0, 1, 2.

6.1.2 Probability associated with a Random Variable

If we have an event B with respect to a range space RX. Let the event A with respect to Ω be

defined as

6–1

A = ω ∈ Ω | X(ω) ∈ B. (6.1)

Then A and B are equivalent events and we can carry the definitions and equations of Chapter 5

over to random variables.

Example 18 Two coins as in Example 17. Examples of equivalent events are: A = TT, B = 0;A = TH,HT, B = 1; A = HH, B = 2.

In the case of eqn. 6.1, we can say

P (B) = P (A). (6.2)

Example 19 Two coins as in Example 18. A = TT, P (A) = 14, B = 0, P (B = 0) = 1

4; A =

TH,HT, P (A) = 12, B = 1, P (B = 1) = 1

2; A = HH, P (A) = 1

4, B = 2, P (B = 2) = 1

4.

6.2 Probability Mass Function (pmf) of a Discrete r.v.

Let a r.v. X have a range space RX = x1, x2, . . . , xn. We denote the probability of a particular

value X = xi as pX(xi) = P (X = xi). The probabilities pX(xi), i = 1, 2, . . . , n, in keeping with

eqns. 5.3 and 5.4, must satisfy

pX(xi) ≥ 0, i = 1, 2, . . . , n, (6.3)

n∑i=1

pX(xi) = 1. (6.4)

pX is called the probability function or the probability mass function of the r.v. X. We’ll attempt to

standardise on probability mass function and its abbreviation pmf. We use the shorthand X ∼ pXto state that the r.v. X has a pmf pX. Often, where there is no ambiguity, you will find the

subscript X omitted — pX(x)→ p(x).

6.3 Some Discrete Random Variables

This section identifies and describes the pmfs of some commonly occurring discrete random vari-

ables.

6.3.1 Point Mass Distribution

If X can take on only one value, a, it has a point mass distribution at a; X ∼ δa.

pX(x) = 1, for x = a, and 0 elsewhere. (6.5)

6–2

6.3.2 Discrete Uniform Distribution

X has a discrete uniform distribution on 1, . . . , k, U(1, k), if

pX(x) =1

k, for x = 1, . . . , k ; and 0elsewhere. (6.6)

Example 20 . Lottery machine, k balls. First draw, X ∼ U(1, k).

6.3.3 Bernoulli Distribution

Let X be the result of a (binary outcome) experiment with probability p of one outcome, X = 1,

say, and 1 − p for the other, X = 0; for example a coin flip. There’s overuse of the symbol p

here, but we need to keep to standard notation; context should resolve any ambiguities between

the parameter p = P (X = 1) and the pmf pX(X).

pX(x) = qx(1− q)1−x , for x ∈ 0, 1. (6.7)

6.3.4 Binomial Distribution

Repeat the experiment above (Bernoulli distribution — coin flip) n times and let X be the number

of 1s (e.g. heads) obtained.

pX(x) =

(n

x

)px(1− p)n−x , for x ∈ 0, 1, . . . n; 0, otherwise. (6.8)

Where does the

(n

x

)come from? We have already introduced it in eqn. 5.12; it is the number

of ways of selecting x items from n. The probability one of the x 1s is px and the probability one

of the n − x 0s is (1− p)n−x ; the flips are independent so we can multiply the probabilities to get

px(1−p)n−x . However, there are

(n

x

)possible ways of getting the X = x 1s.

(n

x

)= n!

x!(n−x)!.

Take n = 3; the sample space is Ω = TTT, TTH, THT, THH,HTT,HTH,HHT,HHH and the

event corresponding to x = 2 (two heads, any two heads) is A = THH,HTH,HHT, i.e. there

are three outcomes that give two heads.(n

x

)=

(3

2

)=

3!

2!1!=

6

2= 3.

6.3.5 Geometric Distribution

X has a geometric distribution with parameter p, X ∼ Geom(p), p ∈ (0, 1), if

P (X = k) = p(1− p)k−1, k = 1, 2, . . . ,∞. (6.9)

Example 21 . Distribution of the number of coin flips until the first head.

6–3

6.3.6 Poisson Distribution

X has a Poisson distribution with parameter λ, X ∼ Poisson(λ), if

pX(x) = e−λλx

x!, x ≥ 0. (6.10)

Example 22 . Distribution of rare events like traffic accidents; there can be long periods of

inactivity, but clumping of events is possible, e.g. waiting a long time for a town bus and three

arrive in quick succession!

6.4 Some Continuous Random Variables

This section identifies and describes the probability density functions of some commonly occurring

continuous random variables. First we must introduce a continuous alternative to the probability

mass function.

6.4.1 Probability Density Function (PDF)

When we discussed discrete r.v.’s we let X have a range space RX = x1, x2, . . . , xn;the number of values in the range space was countable. Let the range space be RX =

0, 0.01, 0.02, . . . , 0.99, 1.0; this is still a discrete r.v.

But what if RX = [0, 1], i.e. all real numbers in the range 0 − −1. A number of problems arise,

the chief of which are:

• the random variable is now continuous, i.e. the elements of the range space are not countable;

• the probability of any particular value of the r.v. is in fact zero. Example: you buy 0.5-kg

of cheese in Tesco; what is the chance of it being exactly 0.5-kg? Zero. Same goes for

the weight of a product of a chemical experiment. Hence we cannot use probability mass

functions.

We now must use a different probability function called a probability density function (pdf). A pdf,

over a range space RX, must satisfy (c.f. eqns. 6.3 and 6.4 for discrete r.v.’s)

fX(x) ≥ 0, all x ∈ Rx , (6.11)

∫Rx

fX(x)dx = 1. (6.12)

We emphasise that fX(x) is not a probability, but fX(x)dx is. If you want to speak of a probability

over a continuous r.v. you must state something like the probability that X is in the range a to b,

inclusive, is P (a ≤ X ≤ b), i.e.∫ bafX(x)dx .

The term probability density function is used (in contrast to probability mass function (for discrete

r.v.’s)) because, with a continuous r.v. you simply cannot pick a value (X = x), say, and state

P (X = x), which is in fact zero.

6–4

Discrete probability mass versus Continuous probability density Think of a ruler upon which

we place (stick with Blue-tack) ball bearings of various sizes along its length; the ball bearings

represent discrete masses and we can state that we have a mass m1 at ruling x1; we can also

compute the total mass as∑

i mi .

Now think of a rod of varying diameter laid along the ruler; we cannot pick a point x and say that

the mass at precisely that point is m(x), but we can say that the mass in a little length, x, x + ∆x ,

is d(x)∆x , where d is the mass per unit length at x , (the density). In this case we can compute

the total mass as∫length

d(x)dx .

6.4.2 Cumulative Distribution Function (cdf)

Many textbooks base their treatment of continuous r.v.’s on the cumulative distribution function

(cdf); the cdf does give a probability.

FX(x) = P (X ≤ x), (6.13)

FX(x) =

∫ x

−∞fX(x)dx. (6.14)

6.4.3 Uniform Distribution

X has a uniform distribution on [a, b], X ∼ Unif orm(a, b), if

fX(x) =

1

(b−a), for x ∈ [a, b]

0 otherwise.(6.15)

The cumulative distribution function (cdf) is

FX(x) =

0, x < a(x−a)(b−a)

, x ∈ [a, b]

0 x > b.

(6.16)

6.4.4 Normal (Gaussian) Distribution

X has a Normal (Gaussian) distribution with parameters µ and σ, X ∼ N(µ, σ), if

fX(x) =1

σ√

2πexp

(−

1

2

[x − µσ

]2), ∞ < x <∞. (6.17)

The Normal distribution is often used to model measurements taken in the presence of error or

noise. If the true value of a variable X is µ, then measurement (random) variable is distributed as

N(µ, σ) where σ (the standard deviation) is a measure of the ‘size’ of the errors.

6–5

We say X has a standard Normal distribution if µ = 0 and σ = 1; standard Normal r.v.’s are

typically denoted by Z; Z ∼ N(0, 1). The CDF for Z is denoted by Φ(z); although there is no

formula for Φ(z), it is tabulated. In the days before widespread use of computers, tables such as

those for Φ(z) were of great importance to those involved in statistics and statistical inference.

Nowadays statistic packages and even some calculators will compute Φ(z) for you or even remove

the necessity by calculating the thing that required Φ(z) as an intermediate value.

If X ∼ N(µ, σ) then Z = (x − µ)/sigma ∼ N(0, 1).

Conversely, if Z ∼ N(0, 1) then X = σZ + µ ∼ N(µ, σ).

Also, if X ∼ N(µ, σ) and Y = aX + b, then Y ∼ N(aµ+ b, aσ).

6.4.5 Exponential Distribution

X has a Exponential distribution with parameter β, β > 0, X ∼ Exp(β), if

fX(x) =1

βexp(−x/β). (6.18)

The Exponential distribution is used to model the waiting times between infrequent events, c.f.

the Poisson distribution, see section 6.3.6.

6.4.6 Gamma Distribution

X has a Gamma distribution with parameters α, β;α, β > 0, X ∼ Gamma(α, β), if

fX(x) =1

βαΓ (α)xα−1exp(−x/β), x > 0. (6.19)

The Gamma function, for parameter α > 0, is given by

Γ (α) =

∫ ∞0

yα−1e−ydy. (6.20)

The Exponential distribution is Gamma with parameter α = 1, Gamma(1, β).

6.4.7 Beta Distribution

X has a Beta distribution with parameters α, β;α, β > 0, X ∼ Beta(α, β), if

fX(x) =Γ (α+ β)

Γ (α)Γ (β)xα−1(1− x)β−1), 0 < x < 1. (6.21)

6–6

6.4.8 Student t Distribution

X has a Student t distribution (or just t distribution, with ν degrees of freedom X ∼ tν, if

fX(x) =Γ(ν+1

2

)Γ(ν2

) 1(1 + x2

ν

)(ν+1)/2. (6.22)

6.4.9 Cauchy Distribution

The Cauchy distribution, X ∼ Cauchy , is a special case of the t distribution with ν = 1,

fX(x) =1

π(1 + x2). (6.23)

6.4.10 Chi-squared Distribution

X has a χ2 distribution with n degrees of freedom X ∼ χ2n, if

fX(x) =1

Γ (n/2)2n/2x (n/2)−1e−x/2, x > 0. (6.24)

6.5 Range spaces — terminology

In discussing discrete r.v.’s we mentioned, for example, a range space RX = x1, x2, . . . , xn. If the

range space is all the integers, we could use the common symbol RX = Z. If the range space is

all the real numbers, we could use the common symbol RX = R. If the range space is a subset of

R, we use, for example, RX = [0, 1] to state that the r.v. can be 0−−1 inclusive. For a discrete

(integer) subset we use, for example, 1, 2, . . . , 10.

6.6 Parameters

In discussing the Binomial distribution, eqn. 6.8, and the Normal, eqn. 6.17, see below,

pX(x) =

(n

x

)qx(1− q)n−x , for x ∈ 0, 1, . . . n; 0, otherwise,

fX(x) =1

σ√

2πexp

(−

1

2

[x − µσ

]2), ∞ < x <∞,

6–7

we note that q for the Binomial, and µ, σ for the Normal, completely specify the distributions. We

call these parameters and we will see distributions written as, for example, fX(x ; θ1, θ2), where θ is

a common symbol for parameter.

A lot of practical statistics involves parameter estimation, where, for example, we may have a set

(sample) of data x1, x2, . . . , xn, which we know to be drawn from a population with distribution

fX(x ; θ1, θ2) and we want to compute an estimate θ1 for θ1.

6–8

Chapter 7

Two- and Multi-Dimensional RandomVariables

7.1 Introduction

Chapter 6 has introduced one dimensional random variables and certain well known distributions.

Both discrete and continuous r.v.’s were covered.

In many cases, your (scientific) data will consist not just of single numbers, for example, the weight

of a chemical in a mixture, but two or more numbers. If the numbers correspond to independent

events, see section 5.8, it may be possible or desirable to treat them separately as individual

one-dimensional r.v.’s, but, generally, you will want to treat pairs or triples or multiple numbers

together.

In section 5.6 and eqn. 5.13 we introduced the notion of the probability of two events happening

together, P (AB), the joint probability of A and B.

Here we introduce first two-dimensional r.v.’s and then go on to generalise to multi-dimensional

r.v.’s.

Range spaces — terminology for two and more dimensions See section 6.5 where we intro-

duced some symbols and terminology used in describing range spaces for one-dimensional r.v.’s.

If we have a two-dimensional continuous random variable — a pair (X, Y )— each member of which

can take on any real value, we say that the range space is R×R; for general multi-dimensions, say

p-dimensions, where the random variable is a random vector, we use Rp. For a subsets of R, we

use, for example, [0, 1]× [0, 1] and [0, 1]p. The term for a combination (product) of sets such as

[0, 1]× [0, 1] is Cartesian product.

Two-dimensional (Bivariate) Random Variables If, to every outcome, ω, of an experiment,

we assign two numbers, X(ω), Y (ω), X is called a two-dimensional random variable.

As with one-dimension, we have discrete and continuous two-dimensional random variable, or

random vector, especially when more than two dimensions.

7–1

Much of what we present here is just a two-dimensional analogue of what was covered in Chap-

ter 6. Also, what is described here in terms of two-dimensions transfers immediately to multiple

dimensions.

7.2 Probability Function of a Discrete Two-dimensional r.v.

By analogy with eqns. 6.3 and 6.4, for one-dimension, we have pX,Y (xi , yj) = P (X = xi , Y = yj)

(or just p(xi , yj)) and it must satisfy the following

p(xi , yj) ≥ 0, i = 1, 2, . . . ; j = 1, 2, . . . (7.1)

m∑j=1

n∑i=1

p(xi , yj) = 1. (7.2)

As with one-d., pX,Y or just p is called the probability function or the joint probability function for

the r.v. (X, Y ).

Example 23 From (Meyer 1966, p. 85). There are two production lines; the first has a capacity

to produce up to five items in a day; its actual production is a random variable X; the second has

a capacity to produce up to three items in a day and its actual production is a random variable Y .

The pair of random variables is the two-dimensional random variable (X, Y ) and the joint probability

function is given in Table 7.1. Each entry represents P (X = xi , Y = yj); so p(2, 3) = 0.04. Such

a table could be estimated by noting (X, Y ) over a large number of days.

X 0 1 2 3 4 5

Y

0 0.0 0.01 0.03 0.05 0.07 0.09

1 0.01 0.02 0.04 0.05 0.06 0.08

2 0.01 0.03 0.05 0.05 0.05 0.06

3 0.01 0.02 0.04 0.06 0.06 0.05

Table 7.1: Example of a two-dimensional probability function

We can verify that the table does represent a proper probability function in that requirement eqn.

7.1 is satisfied, and, by summing over all entries, that requirement eqn. 7.2 is satisfied — the

entries sum to 1.

7.3 PDF of a Continuous Two-dimensional r.v.

By analogy with eqns. 6.11 and 6.12, for one-dimension, we have the (joint) PDF f (x, y) and it

must satisfy the following

f (x, y) ≥ 0, all (x, y) ∈ R× R, (7.3)

7–2

∫ ∞−∞

∫ ∞−∞

f (x, y)dxdy = 1. (7.4)

We emphasise again that f (x, y) is not a probability, but f (x, y)dxdy is.

7.4 Marginal Probability Distributions

Example 24 Suppose in Example 23 (Table 7.1) we want to compute the probability functions for

X and Y on their own. These are called marginal probability functions. The marginal probability

function for X is given by

pX(xi) = P (X = xi) = P (X = xi , Y = y1, or . . . , or X = xi , Y = yn) =

m∑j=1

p(xi , yj). (7.5)

Similarly, the marginal probability function Y is given by

pY (yj) =

n∑i=1

p(xi , yj).

Table 7.2 shows the corresponding sums.

X 0 1 2 3 4 5 Sum

Y

0 0.0 0.01 0.03 0.05 0.07 0.09 0.25

1 0.01 0.02 0.04 0.05 0.06 0.08 0.26

2 0.01 0.03 0.05 0.05 0.05 0.06 0.25

3 0.01 0.02 0.04 0.06 0.06 0.05 0.24

Sum 0.03 0.08 0.16 0.21 0.24 0.28 1.00

Table 7.2: Example

We can verify that the sums corresponding to p(xi) and p(yj) do represent proper probability

functions in that requirement 6.3 is satisfied, and, by summing the marginals, that requirement

6.4 is satisfied — both sets of marginals sum to 1.

For continuous random variables, we can state the equivalent equation for marginal PDFs:

fX(x) =

∫Y

fX,Y (x, y)dy. (7.6)

7–3

7.5 Conditional Probability Distributions

In section 5.6 we introduced conditional probability, i.e. the probability of an event B when we

know that event A has occurred:

P (B|A) =P (AB)

P (A). (7.7)

We can do the same for probability functions.

Example 25 Suppose in Example 24 (Table 7.2) we want to compute the conditional probability

P (X = 2|Y = 1). Applying eqn. 7.7 we have

P (X = 2|Y = 1) =P (X = 2, Y = 1)

P (Y = 1)=

0.04

0.26= 0.154.

We can give general rules, noting that q(yj), p(xi) are marginal probability functions given by

eqn. 7.5,

p(xi |yj) =p(xi , yj)

q(yj)if q(yj) > 0, (7.8)

p(yj |xi) =p(xi , yj)

p(xi)if p(xi) > 0. (7.9)

We can give similar general rules for continuous random variables, noting that h(yj), h(x) are

marginal probability functions given by eqn. 7.6,

f (x |y) =f (x, y)

h(y)if h(y) > 0, (7.10)

h(y |x) =f (x, y)

g(xi)if g(x) > 0. (7.11)

7.6 Independent Random Variables

We can define the notion of independent random variables using the definition of independent

events given in section 5.8; we had: A and B are independent events if and only if

P (AB) = P (A)P (B). (7.12)

(The occurrence of event A in no way influences the occurrence of B and vice-versa.)

7–4

Independent Discrete Random Variables Given the two-d. discrete random variable (X, Y ), X

and Y are said to be independent if and only if

p(xi , yj) = q(xi)r(yj), (7.13)

noting that q(yj), r(xi) are marginal probability functions given by eqn. 7.5.

Independent Continuous Random Variables Similarly, given the two-d. continuous random

variable (X, Y ), X and Y are said to be independent if and only if

f (x, y) = g(x)h(y), (7.14)

where g(x), h(y) are marginal pdfs.

7.7 Two-dimensional (Bivariate) Normal Distribution

We can extend the one-d. Normal (Gaussian) distribution to two-d.

f (x, y) =1

2πσxσy√

1− ρ2exp

(−

1

2(1− ρ2)

[x − µxσx

]2

− 2ρ(x − µx)(y − µy)

σxσy+

[y − µyσy

]2),

(7.15)

for ∞ < x <∞, ∞ < y <∞.

Before you start protesting that eqn. 7.15 is incomprehensible, (i) it isn’t and I can explain it; (ii)

there is a much better way of handling multivariate random variables that is better for even two-d.

See Chapter B and section B.7.

7–5

Chapter 8

Characterisations of Random Variables

8.1 Introduction

We introduced the notion of a random variable in Chapters 6 and 7. We identified probability

functions (for discrete r.v.’s) and probability density functions for some commonly occurring r.v.’s.

Here we identify and define some parameters (numbers) that characterise some aspects of r.v.

distributions.

Generally, the expected value or expectation of some function of the r.v. is found useful and the

expected value of the r.v. itself (the mean) is first amongst these.

8.2 Expected Value (Mean) of a Random Variable

The expected value of a r.v. X, or expectation, or mean, is the average value of X.

Definition: Expected Value, Discrete R.V. Discrete r.v., range space RX = x1, . . . , xn;probability mass function p(xi) = P (X = xi). The expected value or expectation ((E(X)), or

mean of X is given by

E(X) = µx =

N∑i=1

xip(xi). (8.1)

Continuous r.v., range space RX = R; probability density function f (x). The expected value or

expectation ((E(X)), or mean of X is given by

E(X) = µx =

∫R

xf (x)dx. (8.2)

8–1

Example 26 Toss two coins as in Example 18. X = number of heads. A = TT, P (A) = 14, X =

0, P (X = 0) = 14

; A = TH,HT, P (A) = 12, X = 1, P (X = 1) = 1

2; A = HH, P (A) =

14, X = 2, P (X = 2) = 1

4.

E(X) = µx =

N∑i=1

xip(xi) = 01

4+ 1

1

2+ 2

1

4= 0 + 0.5 + 0.5 = 1.

Example 27 Toss a dice and take X = the number of dots obtained; p(xi) = 16, i = 1, . . . , 6.

E(X) = µx =

N∑i=1

xip(xi) =1

6

6∑i=1

xi = 21/6 = 3.5. (8.3)

Note that in Example 27 µx = 3.5 is not one of the possible values of X.

It is useful, particularly in two-d. cases, to think of µx as the centre of mass, where p(xi) is a mass

and xi is a position along a lever arm; µx is the position to place the fulcrum in order to achieve a

balance.

Aside — Sample Averages In later chapters we will encounter samples and sample averages.

By sample we mean that we run an experiment and take some example values, say n of them, of

the r.v., x1, x2, . . . , xn.

Here we use n for the size of the sample rather than N as in eqn. 8.1 and note that the sample

space Rx = x1, . . . xN denotes the population, rather than a sample of it.

Then we can compute a sample mean, X, (pronounced x-bar ) as

X =1

n

n∑i=1

xi . (8.4)

That is, compute the average like we learned in early arithmetic.

Ordinarily, we’ll make a strong distinction between sample mean and true mean. But let us consider

the case of a large sample, say N = 600. Let yi = the count of each Xi obtained. We might

expect to obtain something like y1 = 95, y2 = 110, y3 = 90, y4 = 97, y5 = 105, y6 = 103, so that

for eqn. /refeq:charrv-samp1∑n

i=1 xi = 95 × 1, y2 = 110 × 2, y3 = 90 × 3, y4 = 97 × 4, y5 =

105× 5, y6 = 103× 6 = 3.6.

If we look more carefully at eqn. 8.2 for this example, we can interpret it as a sample version of

eqn. 8.1.

X =

n∑i=1

1

nyi × xi =

n∑i=1

xiyin, (8.5)

and, comparing with eqn. 8.1, we have yin

in place of p(xi); we note that yin

= p(xi) =

95/600, 110/600, 90/600, 97/600, 105/600, 103/600 = 0.158, 0.183, 0.15, 0.162, 0.175, 0.172,i.e. we have sample estimates of the probability mass function, which are incorrect. The error,

X 6= µx , is due to the errors in the p(xi). Generally, as n →∞, p(xi)→ p(xi) and X → µX.

8–2

Definition: Expected Value of a function of a r.v. The expected value ((E(r(X))) of a function

of X Y = r(X) is given by

E(Y ) = E(r(X)) =

N∑i=1

r(xi)p(xi). (8.6)

Example 28 Let us use a dice as a one number slot-machine (one-armed-bandit). We pay 5c to

play and the machine pays whatever number comes up (1 − 6); thus our payout for each play is

xi − 5. What is the expected value of the payout? (Think play for an hour, 1000 plays, inserting

5000c , what do we expect to win or lose?)

E(Y ) =

N∑i=1

r(xi)p(xi) =

6∑i=1

(xi − 5)1

6= −4/6− 3/6− 2/6− 1/6 + 0/6 + 1/6 = −9/6 = −1.5.

That is, we lose on average 1.5c for every play and would lose 1500c in 1000 plays. (Maybe better

than the average slot-machine?)

Expected values for two-dimensions and higher Eqns. 8.1 and 8.2 carry over to two and more

dimensions.

Discrete r.v., range space RX,Y = x1, . . . , xN×y1, . . . , yM; probability mass function p(xi , yj) =

P (X = xi , Y = yj). The expected value or expectation, (E[(X, Y )], or mean of the pair (X, Y ) is

given by

E[(X, Y )] = µX,Y = (µX, µY ) =

N∑i=1

M∑j=1

(xi , yj)p(xi , yj). (8.7)

And similarly for two-d. (and multidimensional) continuous, where multiple integrals replace single

integrals.

Useful facts For Xi , . . . , Xn random variables and constants ai , . . . , an,

E(∑i

aiXi) =∑i

E(Xi). (8.8)

For Xi , . . . , Xn independent random variables

E(

n∏i=1

Xi) =

n∏i=1

E(Xi). (8.9)

8–3

8.3 Variance of a Random Variable

Variance gives the spread of a distribution. The variance is the expected value (mean value) of

the squared deviation from the mean.

Definition: Variance Discrete r.v., range space RX = x1, . . . , xN; probability mass function

p(xi), mean µ. The variance is given by

V (X) = σ2 = E[(X − µX)2] =

N∑i=1

(xi − µX)2p(xi). (8.10)

Continuous r.v.

V (X) = σ2 = E[(X − µX)2] =

∫R

(x − µX)2f (x)dx. (8.11)

The following formula is sometimes useful

V (X) = E(X2)− (E(X))2 = E(X2)− µ2X. (8.12)

Aside — Sample Variance Eqn. 8.2 gives the sample mean of a random variable; the sample

variance is given by

s2 =1

(n − 1)

n∑i=1

(xi − X)2. (8.13)

You may wonder about the (n− 1) instead of n; if we divided by n, the estimate would be biassed.

Standard Deviation Standard deviation: σX =√

(V (X).

Useful facts about variance For constants a, b,

V (aX + b)− a2V (X). (8.14)

For Xi , . . . , Xn independent random variables and constants ai , . . . , an,

V (

n∑i=1

Xi) =

n∑i=1

V (Xi). (8.15)

If Xi , . . . , Xn are independent and identically distributed (IID) random variables with µ =

E(X), σ2 = V (X), then

E(X) = µ, V (X) = σ2/n, E(s2) = σ2. (8.16)

8–4

8.4 Expectations in Two-dimensions

8.4.1 Mean

Two-d. discrete r.v., range space RX = x1, . . . , xn × y1, . . . , yM; probability mass function

p(xi , yj). The expected value or expectation ((E[(X, Y )]), or mean of (X, Y ) is given by

E[(X, Y )] = µX,Y =

M∑j=1

N∑i=1

(xi , yj)p(xi , yj). (8.17)

Similarly for a continuous r.v. — double integral replaces summation, pdf replaces probability mass

function.

8.4.2 Covariance

Let X, Y be r.v.’s with means µX, µY and standard deviations σX, σY . The covariance between X

and Y is defined as

Cov(X, Y ) = E[(X − µX)(Y − µY )]. (8.18)

Cov(X, Y ) = Cov(Y,X).

The correlation between between X and Y is defined as

ρX,Y = Cov(X, Y )/σXσY . (8.19)

8–5

Chapter 9

The Normal Distribution

9.1 Introduction

Here we introduce some uses of the Normal distribution, eqn. 6.17. The Normal distribution can

be used as a model or approximate model in so many cases that a large amount of mathematics

has been built up around it. Note: we use Normal (capitalised) to distinguish from the word normal

(expected, typical) and because most other distribution names are capitalised.

The probability density function (pdf) is given by:

fX(x) =1

σ√

2πexp

(−

1

2

[x − µσ

]2), ∞ < x <∞. (9.1)

We say X ∼ N(µ, σ); note: some writers use X ∼ N(µ, σ2), i.e. they use the variance for the

second parameter of N; we will attempt to standardise on N(µ, σ). It is well worth checking

carefully when reading books and papers, there can be a great difference between σ and σ2!

Because the pdf is different for each µ, σ, it is convenient to create a standardised Normal in which

µ = 0, σ = 1. We standardise the r.v. X as follows; first we shift to zero mean, and then we divide

by σ to obtain unit standard deviation.

Z = (X − µ)/σ. (9.2)

When we standardise X, we obtain Z = (X − µ)/σ ∼ N(0, 1), and eqn. 9.1 becomes eqn. 9.3,

fZ(z) =1√2π

exp(−z2/2). (9.3)

The pdf for N(0, 1) is shown in Figure 9.1. As you can see, most of the probability is located in

−3 < Z < 3; between these limits we have probability 0.9974, i.e. P (−3 < Z < 3) = 0.9974,

that is if we have a random variable Z, we can be pretty sure it will fall between these limits; you

may have heard the term three-sigma to denote nearly all occurrences. Likewise P (−1.96 < Z <

1.96) = 0.95, so that probability outside these limits is 0.05 or 5%;

9–1

R-Example 3 The following R code computes and plots Figure 9.1.

¿ z = seq(-6, 6, length = 200)

¿ pdf = dnorm(z, 0, 1) ## dnorm for d(ensity) normal

¿ plot(z, pdf, type = ”l”, lwd=3)

¿

9.2 Cumulative Distribution Function (cdf)

As we indicated in section 6.4.2, the pdf does not represent a probability, but a probability density,

the numbers we refer to above, for example, P (−1.96 < Z < 1.96) = 0.95, are obtained by

integration,

P (−1.96 < Z < 1.96) = 0.95 =

∫ 1.96

−1.96

fX(x)dx. (9.4)

However, for the Normal distribution, there is no easy way to compute∫ bafX(x)dx , which is where

the cdf comes in; we recall that the cdf is given by eqns. 9.5 and 9.6,

FZ(z) = P (Z ≤ z), (9.5)

Φ(z) = FZ(z) =

∫ z

−∞fZ(u)du =

∫ z

−∞

1√2π

exp(−u2/2)du. (9.6)

Because it is so commonly used, the standardised Normal cdf gets it own symbol, Φ(z). Φ(z) is

plotted in Figure 9.2 which was created using the code in R-Example 4.

R-Example 4 The following R code computes and plots Figure 9.1.

¿ z = seq(-6, 6, length = 200)

¿ cdf = pnorm(z, 0, 1) ## pnorm for p(robability) normal

¿ plot(z, cdf, type = ”l”, lwd=3)

¿

### add these if you want a figure for a report

pdf(”normcdf.pdf”, onefile=FALSE, height=4, width=4, pointsize=8, paper=”special”)

¿ plot(z, cdf, type = ”l”, lwd=3)

¿ dev.off() ### necessary to flush diagram into the file ”normcdf.pdf”

Following the discussion above on how most of the probability is located between (−3 < Z < 3),

we are not surprised to see that Φ(z) is close to zero at z = 3; it rises to 0.5 at z = 0 (one half

of the probability is below 0, the other above 0) and then flattens out at z = 3 after which there

is almost no probability for the integral to add in.

9–2

Figure 9.1: Standardised Normal distribution, N(0, 1), probability density function (pdf).

Figure 9.2: Normal cumulative distribution function (cdf).

9–3

9.3 Normal Cdf

Traditionally, statistics books, and books of tables contained tabulations of the Normal cdf, Φ(z).

We will see below how these tables are used. However, because most statistics is now conducted

using software packages, tables may be less frequently used, and may be less commonly encountered

in textbooks.

R-Example 5 . The following R code computes Table 9.1.

¿ z = seq(-4, 4, length = 9)

¿ cdf = pnorm(z, 0, 1)

¿ z

[1] -4 -3 -2 -1 0 1 2 3 4

¿ cdf

[1] 3.167124e-05 1.349898e-03 2.275013e-02 1.586553e-01 5.000000e-01

[6] 8.413447e-01 9.772499e-01 9.986501e-01 9.999683e-01

¿

z -4 -3 -2 -1 0 1 2 3 4

Phi(z) 3.2e-05 1.35e-03 2.28e-02 0.159 0.5 0.84 0.977 0.999 0.99997

Table 9.1: Erf(z) for z = -4 to + 4.

What does Φ(z = −2) = 2.28× 10−02 = 0.0228 mean? Referring to Figure 9.1 it means that the

amount of probability to the left of Z = −2 is 0.0228, i.e. as indicated by eqn. 9.5.

Owing to the symmetry of Figure 9.1, we can state that the amount of probability to the right of

of Z = +2 is also 0.0228. Hence the probability P (Z < −2 or Z > +2) = 2×0.0228 = 0.0456 or

4.56%. If we move a little closer to the mean, we get P (Z < −1.96 or Z > +1.96) = 2×0.025 =

0.05 or 5%. This 5% quartile (+/− 1.96) is used a lot in statistics.

If P (Z < −1.96 or Z > +1.96) = 0.05 then P (−1.96 < Z < +1.96) = 0.95.

In a similar way, we can determine that P (Z < −1 or Z > +1) = 2 × 0.159 = 0.318; that is, a

standard Normal random variable Z is between plus or minus one standard deviation of the mean

3.18% of the time. The 0.159 number is used below in Example 29.

9.4 Using the Normal Cdf

Example 29 Suppose we have a manufacturing process which takes fixed quantities of raw mate-

rials A (1000-grams) and B (500-g.) which react together to produce a product C in the form of

a solid cake. The weights of the cakes, X, are monitored and those below a certain weight are set

aside as B-grade. The manufacturer of the machine gives the yield expected value as E(X) = 165

grams with a variance of 9 and has determined that the yield follows the Normal distribution; that

is, µX = 165, σX =√

9 = 3 and X ∼ N(165, 3). We have decided that cakes below 162 grams

should be marked as B-grade.

9–4

What is the probability that a randomly selected output will be less than 162 grams?

We have no tables for N(165, 3), but we do have for N(0, 1), that is the cdf for the standardised

Normal Φ(z).

Solution.

(i) First we standardise using eqn. 9.2, Z = (X − µ)/σ = (X − 165)/3. Our standardisation

formula is

Z = (X − 165)/3,

in which case the standardised weight corresponding to 162 is Z162 = (162− 165)/3 = −1.

(ii) The probability that Z < Z162 is just Φ(Z162 = Φ(−1) and we can read that from Table 9.1,

i.e. the probability is 0.159 and 15.9% of the output will be B-grade.

(iii) Or, we can use R.

¿ pnorm(-1, 0, 1) ## here explicitly giving mu and sigma.

[1] 0.1586553

¿ pnorm(-1) ## if none given, R assumes mu = 0, sigma = 1

[1] 0.1586553

¿

(iv) We can even let R handle the standardisation.

¿ pnorm(162, 165, 3) ## here explicitly giving mu and sigma.

[1] 0.1586553

Normal distribution appropriate? In Example 29 there can be an immediate objection to the

Normal model. X can never be less than zero, but N(165, 3) will have a value greater than

zero (but very very small) for X < 0. In defence, we can argue that the value will be negligibly

small so that use the Normal model should not introduce significant errors. If we had a weight,

E(X) = 4, V (X) = 9, σ = 3, then we would have to question the Normal model.

9.5 Sum of Independent Normal Random Variables

If X1 ∼ N(µ1, σ1) and X2 ∼ N(µ2, σ2) are independent random variables,

X = X1 +X2 ∼ N(µ, σ), (9.7)

where µ = µ1 + µ2 and V ar(X) = σ2 = σ21 + σ2

2.

Add the means, add the variances; note not add the standard deviations.

9–5

Need example here.

Eqn. 9.7 generalises to give the distribution of a sum on n independent observations of the same

random variable. If Xi ∼ N(µ, σ),

X = X1 +X2, . . . , Xn =

n∑i=1

Xi ∼ N(nµ,√nσ). (9.8)

That is, add n means, and add n variances, so that σsum =√nV ar(X) =

√nσ.

Need example here.

9.6 Differences of Normal Random Variables

X1 ∼ N(µ1, σ1), X2 ∼ N(µ2, σ2)

X = X1 −X2 ∼ N(µ, σ), (9.9)

where µ = µ1 − µ2 and V ar(X) = σ2 = σ21 + σ2

2.

Take the difference of the means and add the variances (not difference of variances).

Need example here.

9.7 Linear Transformations of Normal Random Variables

If X ∼ N(µ, σ),

Y = aX + b ∼ N(aµ+ b, aσ). (9.10)

Need example here.

9.8 The Central Limit Theorem

Why is the Normal distribution (a) so common; (b) so popular amongst statisticians. First, the

Central Limit Theorem (CLT) states, roughly speaking, that if a random variable has been created

by summing a large number of (independent) random variables, then the sum will have an approx-

imately Normal distribution. Second, it is popular not just because of its common occurrence but

because mathematics involving the distribution, eqn. 9.1 and its multivariate counterpart is in many

cases rather easy — or a good deal easier than mathematics involving some other distributions.

A compact statement of the CLT, from (Wasserman 2004), is as follows.

9–6

Let X1, X2, . . . , Xn be independent and identically distributed r.v.’s with mean µ and standard

deviation σ. Let Xn = 1n

∑ni=1Xi . Then, as n →∞,

Zn =Xn − µ√V ar(Xn)

=Xn − µσ/√n→ Z, (9.11)

where Z ∼ N(0, 1).

9–7

Chapter 10

Statistical Inference

10.1 Introduction

We use the Normal distribution, eqn. 6.17, repeated here, to introduce statistical inference.

fX(x) =1

σ√

2πexp

(−

1

2

[x − µσ

]2), ∞ < x <∞. (10.1)

We may write fX as fX(x ;µ, σ) or fX(x ; θ1, θ2), where θ1, θ2 are parameters. We may think of a

family of Normal distributions, N, parametrised or labelled or indexed by θ1, θ2.

Let us say we have performed and experiment and have collected a sample of random variables X,

x1, x2, . . . , xn; we assume that X ∼ N(µ, σ) but we do not know either one or other (or both) of

the parameters.

Point Estimation Parameter estimation is concerned with estimating parameters. A point esti-

mate for say µ is an approximate value µ computed from the sample. Typically, in addition to the

estimate, µ, we give some qualifications such as the variance of the estimate, that is, an indication

of how variable we think µ might be if we repeated the experiment a number of times.

Interval Estimation An interval estimate (set estimate, confidence interval) for say µ is an

interval [µ1, µ2 computed from the sample which we claim to contain the real µ. Typically, we give

some indication of how plausible the interval is in the form a some sort of probability value.

Hypothesis Testing A typical hypothesis testing example is when a scientist needs to test the

efficacy of a new method.

And experiment is performed where there are two methods, M1 and M2. Often, M1 is a control

(say old method) and M2 is the new methods whose efficacy we wish to test.

Let us keep the hypothesis simply by assuming that we wish to test whether M2 will give a better

yield than M1.

10–1

Chapter 11

Statistical Estimation

11.1 Introduction

When we state for example X ∼ fX(x ; θ1, θ2), we indicate that the distribution depends on parame-

ters θ1, θ2. For example, we may think of a family of Normal distributions, N(θ1, θ2), parametrised

or labelled or indexed by θ1 = µ, θ2 = σ.

11.2 Populations and Samples

When we quote values of parameters, for example the mean and standard deviation of a Normally

distributed r.v., X ∼ N(µ, σ), we are talking about population parameters.

Let us collect a sample of random variables X, x1, x2, . . . , xn; we assume that X ∼ N(µ, σ) but we

do not know either of the parameters. We must estimate them and an obvious first attempt is to

use sample mean and standard deviation.

Note the difference: population versus sample. A population includes all possible random variables;

a sample contains, well, a sample taken from the population. If you wanted a quick estimate of

the mean salary of lecturers in the college, you could ask a number of lecturers you know and take

the average of that sample.

The Human Resources Department could give you an exact figure, because they have the data for

the (complete) population of, N, lecturers. They would compute the true population parameters

as,

µ =1

N

N∑i=1

xi , (11.1)

σ2 =1

N

N∑i=1

(xi − µ)2. (11.2)

You could imagine that the larger your sample, the better the sample mean would approximate the

population mean.

11–1

Random Sample However, apart from being a small sample, lecturers you know could contain

another source of inaccuracy, namely that the sample is not random and so it may contain a bias

due to the fact that, for example, the lecturers in your sample tend to be younger.

By random sample we mean that each member of the population has an equal chance of being

sampled. Achieving a random sample is not always easy, see Chapter 13.

11.3 Estimating the Mean

A point estimate for say µ is an approximate value µ computed from the sample. Typically, in

addition to the estimate, µ, we give some qualifications such as the variance of the estimate, that

is, an indication of how variable we think µ might be if we repeated the experiment a number of

times. The hat symbol, θ, is used to indicate that we have an estimate of θ.

The most obvious estimate for µ is to copy eqn. 11.1, noting that we use capital N for the size of

the population and lower-case n for the size of the sample,

µ = x =1

n

n∑i=1

xi . (11.3)

In this context the bar,¯as in x (x bar indicates mean or average.

Need example here.

11.4 Estimating the Standard Deviation

The “best” estimate for σ is less obvious and eqn. 11.2 is modified slightly to,

σ2 = s2 =1

n − 1

n∑i=1

(xi − x . (11.4)

Thus, we not only replace µ by its estimate, x , we divide by n − 1 instead of n. It is usual to use

s2 to denote sample variance.

The reason for the n − 1 is that dividing by n would generally lead to a systematic underestimate

— a so-called bias. This may be discussed in a later chapter; (reference it if we do).

Need example here.

11–2

11.5 Sampling Distributions

11.5.1 Sampling Distribution of the mean

The estimate of the mean given by eqn. 11.3 is itself a random variable; we can imagine taking m

samples, each of size n, and each of these yielding a ˆxj for j = 1, 2, . . . , m.

E(x) = µ, (11.5)

V ar(x) = σ2/n. (11.6)

Therefore, the standard deviation of the estimate of the mean is σ/√n. We already encountered

this in section 9.5 and eqn. 9.8.

Both eqns. 11.5 are rather comforting, (a) the expected value of x is µ and the standard deviation

of x is σ/√n, that is, as n increases the standard deviation decreases and will decrease to zero as

n →∞.

Finally, we can state that the sampling distribution of µ is N(µ, σ/√n). This means that if we

conduct a number of sample experiments (take a sample of n Xs and compute the mean barx ,

then barx will be found to have a normal distribution centred on the true mean µ.

We note emphatically that we do not know µ. In the first part of the discussion below, we assume

that σ2 is known. However, this is typically untrue, and we must use an estimate for the standard

deviation, as in eqn. 11.4.

Figure 11.1 (Maindonald & Braun 2007, p. 103) shows two sampling distributions, for a random

variable X which has µX = 10, σ = 1; Figure 11.1(a) shows the sampling distribution for a sample

size of n = 4, while Figure 11.1(b) shows the sampling distribution for a sample size of n = 9; the

distribution of X, corresponding to a sample size of n = 1 is shown for comparison.

The useful formula now is, including standardisation:

If the estimator for µ (unknown) is x and σ is known then

x − µσ/√n∼ N(0, 1). (11.7)

On the other hand, if σ is unknown, and we must replace σ with an estimate, s, see eqn. 11.4,

then

x − µs/√n∼ tn−1, (11.8)

where tn−1 is the Student t distribution with n− 1 degrees of freedom; see section 6.22. As with

N(0, 1), we have tables for the t distribution.

11–3

Figure 11.1: (a) Sampling distribution for a sample size of n = 4; (b) sampling distribution for a

sample size of n = 9; the distribution of X, corresponding to a sample size of n = 1 is shown for

comparison.

11.5.2 Sampling Distribution for Estimates of the Standard Deviation

If the estimator for σ (unknown) is s, see eqn. 11.4, and µ is also unknown, with estimate x , then

n∑i=1

xi − xσ2

=(n − 1)s2

σ2∼ χ2

n−1, (11.9)

where χ2n is the Chi-squared distribution with n degrees of freedom; see section 6.4.10. As with

N(0, 1) and tν we have tables for the χ2n distribution.

11–4

11.6 Confidence Intervals

In section 11.5.1 we established that the distribution of the sample mean is x ∼ N(µ, σ/√n) or

equivalently eqn. 11.7 x−µσ/√n∼ N(0, 1). This tells us that the estimate has a distribution that is

centred on the mean, that the expected value of the estimate is the mean, and that the distribution

will have a standard-deviation (spread) of σ/√n.

Thus referring to Figure 11.1(a), we can say that the mean of x4 is µ, the true mean — which

we do not know and that different samples would vary between about 1.5σ above and below the

true mean. Hence if the true mean is 10 as in the diagram, and we kept repeating our sampling

experiment, we would expect the estimate x4 to vary between about 8.5 and 11.5.

On the other hand, if we used sample size n = 9, we would expect the estimate x9 to vary between

about 9.0 and 11.0, see Figure 11.1(b).

The previous few sentences should be suggesting that we should be able to give a plausible interval

estimate such as we estimate that the mean is between 9 and 11, together with a probability for

that assertion, e.g. about 0.95 as discussed in section 9.3 for P (−1.96 < Z < +1.96). But

unfortunately we cannot, for we do not know the true mean.

What can we say? Well, for example, that P (−1.96 < (x − µ)/ σ√n< +1.96) = 0.95. Still not

much good, for we do not know µ and we must be satisfied with the less useful statement that

the estimate x is within plus-or-minus 1.96× σ√n

from µ, with a probability of 0.95.

More explanation may be needed. What if x is at one of these extremes, namely µ − 1.96 × σ√n

;

this would correspond to about 9 in Figure 11.1(a). We can then say that x+ 1.96× σ√n

just about

reaches up to µ. If we repeat the sampling, this will happen with a probability 1− 0.025, i.e. the

amount of probability up to Z = −1.96 is 0.025.

Similarly, take the case that x is at the other extreme, namely µ+1.96× σ√n

; this would correspond

to about 11 in Figure 11.1(a). We can now say that x − 1.96 × σ√n

just about reaches down to

µ. If we repeat the sampling, this will happen with a probability 1 − 0.025 (recall the symmetry

argument in section 9.3).

Consequently, if we take x+/−1.96× σ√n

we can say that this interval will capture µ with probability

0.95.

This allows us to construct a confidence interval which we can claim contains µ; that is, we compute

not µ, but (L, U), an interval between (L)ower and (U)pper limits which we believe contains µ.

In the case of confidence (probability) 0.95 = 95%, we can compute

(L, U) = (x − 1.96×σ√n, x + 1.96×

σ√n

) (11.10)

Summary on Point Estimation and Confidence Interval for the Mean when Variance Known

Refer to Figure 11.1, part (b) of which is based on a sample size of n = 9.

• If we take a point estimate for the mean, it will be distributed according to the narrow

distribution, i.e. if the true mean is 10, our estimate can be anywhere between 9 and 11.

11–5

• If we decide to give an interval estimate, we need to decide on a confidence (probability);

the wider the interval, the greater the confidence we can have in it — but a huge interval

with confidence of 100% is not much use to anyone. The usual confidence that is chosen is

95%.

• We would like to be able to look at Figure 11.1 (b) and say that our interval for the mean is

9 to 11 with confidence 95% (based on the diagram this is approximate, 10− 1.96× 0.5 to

10 + 1.96× 0.5 are the precise values for 95%.

But we cannot make a statement like the latter, for we do not know that µ = 10.

• The best we can do is (a) take our estimate, x , (b) place a distribution like that in Fig-

ure 11.1(b) about it; (c) compute the x + /− σ√n

(≈ 2) interval (eqn. 11.10).

This allows us to state:

if we repeated our sampling a large number of times, and we computed eqn. 11.10 each time

(getting a different interval), then 95% of these intervals would contain the true mean µ.

Excel-Example 1 Need Excel example here.

Need section on t-distribution and small sample sampling distrib. for mean with std.-dev.

unknown.

11–6

Chapter 12

Hypothesis Testing

12.1 Introduction

In Chapter 11 we discussed estimation of parameters, both point estimates and interval estimates

(with confidence value attached). This chapter is also based on sampling theory but here we are

interested in decisions rather than estimates. For example, based on a sample of occurrences of

heads and tails in a sample of n = 10 tosses of a coin, we might wish to come to the decision

whether the coin is fair. We might want to decide whether application of a new fertiliser really

does increase cropping yield, based on samples involving (i) the current fertiliser and (ii) the new

one.

The hypothesis testing technique involves the postulation of a hypothesis (an assumption, a state-

ment about population distributions or their parameters) and then designing an experiment which

will yield a sample upon which we can decide whether the hypothesis is true — based on sample

data.

A typical hypothesis test is as follows. We make a hypothesis that a random variable is distributed

according to fX(x), e.g. X ∼ N(µ, σ), where we assume that σ is known.

We identify a null hypothesis, H0 : µ = µ0 and an alternative hypothesis, HA : µ > µ0.

We compute a test statistic (a sample estimate with sample size n), for example µ = Xn and

reject H0 if Xn > c , where c is some constant to be determined; Xn > c is the critical region;

Xn ≤ c is called the acceptance region.

The greater we make c , then the greater the significance level of the test Xn > c . We can set

c using the same considerations we used in setting confidence levels for a confidence interval in

section 11.6. As in eqn. 11.7, we know that

Z =Xn − µσ/√n∼ N(0, 1). (12.1)

so that we can use er f (z) = Φ(z) to choose a c = z such that P (z > c ′) = 0.05 = P ( Xn−µσ/√n>

c ′) = P (Xn >c ′σ√n

+ µ, say, for a 2.5% significance level. (I’ve chose 2.5% = 0.025 because it

corresponds to a cutoff point (Z = 1.96) that we have already encountered.

12–1

That is, z > c ′ would occurs only 2.5% of the time if H0 is true; in other words the critical region

stretches from c ′ to the right of it. The acceptance region stretches to the left of c ′, i.e. including

everywhere that Xn ≤ c , where c = c ′σ√n

+ µ.

Recalling P (Z > +1.96) = 0.025, we can set c ′ = 1.96 for a significance level of 0.025.

The latter corresponds to a one sided test.

The standard normal pdf and the relevant critical region is shown in Figure 12.1 (Maindonald &

Braun 2007, p. 106).

Figure 12.1: One side hypothesis test, significance level = 0.025; critical region is shaded to the

right of 1.96. For a two sided test with significance level = 0.05, we include in the critical region

also the marked region to the left of -1.96.

Let us keep the original null hypothesis, H0 : µ = µ0 , and now choose an different alternative

hypothesis, namely HA : µ 6= µ0. A suitable acceptance region for this might be cl < Xn < ch,

with the critical (rejection) region being all points below cl and all points above ch.

If we now choose a significance level of 0.05, we arrive at the familiar P (Z < −1.96 or Z >

12–2

+1.96) = 0.05, that is, if we have µ = µ0, then values of Z < −1.96 or Z > +1.96 or Xn−µσ/√n<

−1.96 or Xn−µσ/√n> +1.96 should occur only 5% of the time and this is a sufficiently significant

deviation for us to reject the null hypothesis.

This is a two sided test.

The significance level, usually denoted α, corresponds to the probability of rejecting H0 when H0

is true, that is, the extreme values in the critical region could occur, but with a small probability,

α.

Table 12.1 shows the possible outcomes of the hypothesis test.

H0 true HA true

Accept H0 correct Type 2 error, prob. β

Reject H0 Type 1 error, prob. α correct

Table 12.1: Outcomes of a hypothesis test.

12–3

Chapter 13

Sampling

13.1 Introduction

To be completed.

13–1

Chapter 14

Classification and Pattern Recognition

14.1 Introduction

The terms classification and pattern recognition are used almost synomomously; statisticians tend

to favour classification, while engineers tend to use pattern recognition. This chapter merely

introduces the concepts; Chapters 15, 16, 18, 17 and 19 fill in the details.

These chapters are a reworking of some of the basic pattern recognition and neural network material

covered in (Campbell 2005) and (Campbell & Murtagh 1998) and (Campbell 2000).

We define/summarize a pattern recognition system using the block diagram in Figure 14.1.

Pattern Recognition

System

x w (omega)

sensed

dataclass

(Classifier)

Figure 14.1: Pattern recognition system; x a tuple of p measurements, output ω — class label.

Typically textbooks distinguish between supervised classification and unsupervised classification.

Supervised classification Supervised (trained) classification may be posed as a prediction prob-

lem rather like regression. The prediction involves class labels.

We have a set of examples, a sample, which we call training data, XT = xi , ωini=1. We learn

population parameters from the sample of x’s.

Warning: in some classification and pattern recognition literature, the term sample takes on a

different meaning from the standard statistical term — where a statistical sample means a set of

random vectors taken from a population; in the pattern recognition literature a sample may mean

a single random vector, so that a statistical sample will have to be termed a set of samples.

x is the pattern vector — of course in certain situations x is a simple scalar. ω is the class label,

ω ∈ Ω = ω1, . . . , ωc.

Then, given an unseen pattern x (a random vector), we predict ω. In general, x = (x0 x1 . . . xp−1)T ,

a p-dimensional vector; T denotes transposition.

14–1

Unsupervised classification Unsupervised classification is more of an exploratory data analysis

technique than is supervised classification.

In this case we have a set of patterns (random vectors) XT = xini=1 and we want to explore

structure in the set. For example, are they clustered, thereby suggesting that the clusters identify

a number of classes. Clustering involves assigning class labels to the XT = xini=1 based not on

training data but on proximity of the x’s or some other criterion.

14–2

Chapter 15

Simple Classifier Methods

15.1 Thresholding for one-dimensional data

Let us assume that we want to classify a chemical product, for example fake pharmaceutical drugs,

according to the results of a chemical analysis. The analysis data comprise a vector x where x1

might be percentage mass of component 1, x2 component 2, etc. The label ω might be courntry

of origin, and it is this that we want to predict, given the results x from an analysis of a newly

seize batch.

For the moment, we’ll assume just two classes ω0 and ω1; two-class problems are easy to describe,

yet extension to n-class problems is easy.

In our simplistic character recognition system we require to recognise two sources, country 0 and

country 1, ω0 and ω1. We start off with two components x = (x1 x2)T .

As described in Chapter 14, we have earlier obtained examples of the drug from both countries,

XT = xi , ωini=1, i.e. we have training data, or a sample.

Let us see whether we can recognise using component 1 alone (x1. Figure 15.1 shows some

(training) data. We see that a threshold (T) set at about x1 = 2.8 is the best we can do; the

classification algorithm is:

ω = 1 when x1 ≥ T, (15.1)

= 0 otherwise. (15.2)

Use of histograms, see Figure 15.2 might be a more methodical way of determining the threshold,

T .

If enough training data were available, n →∞, the histograms, h0(x1), h1(x1), properly normalised

would approach probability densities: p0(x1), p1(x1), more properly called class conditional proba-

bility densities (pdfs): p(x1 | ω), ω = 0, 1, see Figure 15.3.

When the random vector is three-dimensional (p = 3) or more, it becomes impossible to estimate

the pdfs using histogram binning — there are a great many bins, and most of them contain no

data. In such cases it is usual to assume a distribution family, for example Normal, and to represent

15–1

x11 2 4 5 63

T

1 1 1 1 1 1

0 0 0 0 0 0 0 0 0 0

Figure 15.1: Component 1 x1.

x11 2 4 5 63

T

1 1 1 1 1 1

0 0 0 0 0 0 0 0 0 0

freq.

h0(x1)

h1(x1)

Figure 15.2: Histogram of component 1 x1.

15–2

x11 2 4 5 63

T

1 1 1 1 1 1

0 0 0 0 0 0 0 0 0 0

p(x1 | 1)

p(x1 | 0)

Figure 15.3: Class conditional pdfs.

the class confitional pdfs using parameters estimated from a sample (training data — estimation

= training); see Chapter 11.

The use of explicitly statistical methods is described in Chapter 16 but for now well try some

intuitive methods, but as you will see we are never far from statistics.

15–3

15.2 Linear separating lines/planes for two-dimensions

Since there is overlap in the component-1, x1, measurement, let us use the two components,

x = (x1x2)T , i.e. (component-1, component-2). Figure 15.4 shows a scatter plot of these data

(the sample).

x11 2 3 4 5 6

x2

1

3

4

5

2

0 0 0

0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0

0 0 0 0

0 0 0 0

1 1 1 1 1

1 1 1 1 1

1 1 1 1 1 1

1 1 1 1

1 1 1

Figure 15.4: Two dimensions, scatter plot.

The dotted line shows that the data are separable by a straight line; it intercepts the axes at

x1 = 4.5 and x2 = 6.

Apart from plotting the data and drawing the line, how could we derive the separating from the

data? (Thinking of a computer program.)

15.3 Nearest mean classifier

First we estimate the class conditional means µ0 = E(x|ω = ω0 and µ1 = E(x|ω = ω1).

Figure 15.5 shows the line joining the class means and the perpendicular bisector of this line; the

perpendicular bisector turns out to be the separating line. We can derive the equation of the

separating line using the fact that points on it are equidistant to both means, µ0, µ1, and expand

using Pythagoras’s theorem,

|x− µ0|2 = |x− µ1|2, (15.3)

(x1 − µ01)2 + (x2 − µ02)2 = (x1 − µ11)2 + (x2 − µ12)2. (15.4)

We eventually obtain

(µ01 − µ11)x1 + (µ02 − µ12)x2 − (µ201 + µ2

02 − µ211 − µ2

12) = 0, (15.5)

15–4

x11 2 3 4 5 6

x2

1

3

4

5

2

0 0 0

0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0

0 0 0 0

0 0 0 0

1 1 1 1 1

1 1 1 1 1

1 1 1 1 1 1

1 1 1 1

1 1 1

Figure 15.5: Two dimensional scatter plot showing means and separating line.

which is of the form

b1x1 + b2x2 − b0 = 0. (15.6)

In Figure 15.5, µ01 = 4, µ02 = 3, µ11 = 2, µ12 = 1.5; with these values, eqn 15.6 becomes

4x1 + 3x2 − 18.75 = 0, (15.7)

which intercepts the x1 axis at 18.75/4 ≈ 4.7 and the x2 axis at 18.75/3 = 6.25.

15.4 Normal form of the separating line, projections, and linear

discriminants

Eqn 15.6 becomes more interesting and useful in its normal form,

a1x1 + a2x2 − a0 = 0, (15.8)

where a21 + a2

2 = 1; eqn 15.8 can be obtained from eqn 15.6 by dividing across by√b2

1 + b22.

Figure 15.6 shows interpretations of the normal form straight line equation, eqn 15.8. The coef-

ficients of the unit vector normal to the line are n = (a1a2)T and a0 is the perpendicular distance

from the line to the origin. Incidentally, the components correspond to the direction cosines of

n = (a1a2)T = (cos θ sin θa2)T . Thus, (Foley, van Dam, Feiner, Hughes & Phillips 1994) n cor-

responds to one row of a (frame) rotating matrix; in other words, see below, section 15.5, dot

product of the vector expression of a point with n, corresponds to projection onto n. (Note that

cosπ/2− θ = sin θ.)

15–5

theta

a0

normal vector (a1, a2)

linea1x1 + a2x2 −a0 = 0

x1

x2

a0/a1

a0/a2

(x1’ x2’)

at (x1’’, x2’’)a1x1’’ + a2x2’’ − a0 < 0

a1x1’ + a2x2’ −a0 > 0

Figure 15.6: Normal form of a straight line, interpretations.

Also as shown in Figure 15.6, points x = (x1x2)T on the side of the line to which n = (a1a2)T

points have a1x1 + a2x2 − a0 > 0, whilst points on the other side have a1x1 + a2x2 − a0 < 0; as we

know, points on the line have a1x1 + a2x2 − a0 = 0.

15.5 Projection and linear discriminant

We know that a1x1 + a2x2 = aTx, the dot product of n = (a1a2)T and x represents the projection

of points x onto n — yielding the scalar value along n, with a0 fixing the origin. This is plausible:

projecting onto n yields optimum separability.

Such a projection,

g(x) = a1x1 + a2x2, (15.9)

is called a linear discriminant; now we can adapt equation eqn. 15.2,

ω = 0 when g(x) > a0, (15.10)

= 1, g(x) < a0, (15.11)

= tie, g(x) = a0. (15.12)

Linear discriminants, eqn. 15.12, are often written as

g(x) = a1x1 + a2x2 − a0, (15.13)

whence eqn. 15.12 becomes

15–6

ω = 0 when g(x) > 0, (15.14)

= 1, g(x) < 0, (15.15)

= tie, g(x) = 0. (15.16)

15.6 Projections and linear discriminants in p dimensions

Equation 15.13 readily generalises to p dimensions, n is a unit vector in p dimensional space, normal

to the the p − 1 separating hyperplane. For example, when p = 3, n is the unit vector normal to

the separating plane.

Other important projections used in pattern recognition are Principal Components Analysis (PCA)

and Fisher’s Linear Discriminant Analysis (lda), see Chapter 17.

15.7 Template Matching and Discriminants

An intuitive (but well founded) classification method is that of template matching or correlation

matching. Here we have perfect or average examples of classes stored in vectors zjcj=1, one for

each class. Without loss of generality, we assume that all vectors are normalised to unit length.

Classification of an newly arrived vector x entails computing its template/correlation match with

all c templates:

xT zj; (15.17)

class ω is chosen as j corresponding to the maximum of eqn. 15.17.

Yet again we see that classification involves dot product, projection, and a linear discriminant.

15.8 Nearest neighbour methods

Obviously, we may not always have the linear separability of Figure 15.5. One non-parametric

method is to go beyond nearest mean, see eqn. 15.4, to compute the nearest neighbour in the

entire training data set, and to decide class according to the class of the nearest neighbour.

A variation is k-nearest neighbour, where a vote is taken over the classes of the k nearest neighbours.

15–7

Chapter 16

Statistical Classifier Methods

16.1 One-dimensional classification revisited

Recall Figure 15.3, repeated here as Figure 16.1.

x11 2 4 5 63

T

1 1 1 1 1 1

0 0 0 0 0 0 0 0 0 0

p(x1 | 1)

p(x1 | 0)

Figure 16.1: Class conditional densities.

We have class conditional pdfs: p(x1 | ω), ω = 0, 1; given a newly arrived x ′1 we might decide

on its class according to the maximum class conditional pdf at x ′1, i.e. set a threshold T where

p(x1 | 0) and p(x1 | 1) cross, see Figure 16.1.

This is not completely correct. What we want is the probability of each class — its posterior

probability — based on the evidence supplied by the data, combined with any prior evidence.

In what follows, P (ω|x) is the posterior probability or a posteriori probability of class ωi given the

observation x; P (ωi) is the prior probability or a priori probability. We use upper case P (.) for

discrete probabilities, whilst lower case p(.) denotes probability densities.

16–1

Bayes’ Rule Recall Bayes’ rule from eqn. 5.22 and repeated here,

P (Ai |B) = P (B|Ai)P (Ai)/

n∑i=1

P (B|Ai)P (Ai). (16.1)

This says that the posterior probability of Ai given B (conditional on B having occurred) is the

product of the conditional probability of B given Ai all divided by P (B) =∑n

i=1 P (B|Ai)P (Ai).

We can rewrite eqn. 16.1 in terms of our random variable x (= B) and our classes ω0, ω1 (=

Ai , i = 0, 1) to get

P (ωi |x) = P (x|ωi)P (ωi)/

1∑i=0

P (x|ωi)P (ωi). (16.2)

P (ωi |x) is the posterior probability of class ωi given that our analysis has yielded x; P (ωi) is the

prior probability — if we have no prior preference, the P (ω0) = 0.5, P (ω1) = 0.5.

Eqn. 16.2 forms a Bayes decision rule: compute the two posterior probabilities and take the class

which has the maximum.

Let the Bayes decision rule be represented by a function g(.) of the feature vector x:

g(x) = arg maxwj∈Ω[P (ωj | x)] (16.3)

To show that the Bayes decision rule, eqn. 16.3, achieves the minimum probability of error, we

compute the probability of error conditional on the feature vector x — the conditional risk —

associated with it:

R(g(x) = ωj | x) =

c∑k=1,k 6=j

P (ωk | x). (16.4)

That is to say, for the point x we compute the posterior probabilities of all the c − 1 classes not

chosen.

Since Ω = ω1, . . . , ωc form a partition (they are mutually exclusive and exhaustive) and the

P (ωk |x)ck=1 are probabilities and so sum to unity, eqn. 16.4 reduces to:

R(g(x) = ωj) = 1− P (ωj | x). (16.5)

It immediately follows that, to minimise R(g(x) = ωj), we maximise P (ωj | x), thus establishing

the optimality of eqn. 16.3.

The problem now is to determine P (ω | x) which brings us to Bayes’ rule.

16.2 Bayes’ Rule for the Inversion of Conditional Probabilities

****[Needs tidying and made compatible with previous section.]

From the definition of conditional probability, we have:

p(ω, x) = P (ω | x)p(x), (16.6)

16–2

and, owing to the fact that the events in a joint probability are interchangeable, we can equate the

joint probabilities :

p(ω, x) = p(x, ω) = p(x | ω)P (ω). (16.7)

Therefore, equating the right hand sides of these equations, and rearranging, we arrive at Bayes’

rule for the posterior probability P (ω | x):

P (ω | x) =p(x | ω)P (ω)

p(x). (16.8)

P (ω) expresses our belief that ω will occur, prior to any observation. If we have no prior knowledge,

we can assume equal priors for each class: P (ω1) = P (ω2) . . . = P (ωc),∑c

j=1 P (ωj) = 1. Although

we avoid further discussion here, we note that the matter of choice of prior probabilities is the

subject of considerable discussion especially in the literature on Bayesian inference, see, for example,

(Sivia 1996).

p(x) is the unconditional probability density of x, and can be obtained by summing the conditional

densities:

p(x) =

c∑j=1

p(x | ωj)P (ωj). (16.9)

Thus, to solve eqn. 16.8, it remains to estimate the conditional densities.

16.3 Parametric Methods

Where we can assume that the densities follow a particular form, for example Gaussian, the density

estimation problem is reduced to that of estimation of parameters.

The multivariate normal density, see section B.7, p-dimensional, is given by:

p(x | ωj) =1

(2π)p/2 | Kj |1/2exp [−

1

2(x− µj)TK−1

j (x− µj)] (16.10)

p(x | ωj) is completely specified by µj , the p-dimensional mean vector, and Kj the corresponding

p × p covariance matrix :

µj = E[x]ω=ωj , (16.11)

Kj = E[(x− µj)(x− µj)T ]ω=ωj . (16.12)

The respective maximum likelihood estimates are:

µj =1

Nj

Nj∑n=1

xn, (16.13)

and,

Kj =1

Nj − 1

Nj∑n=1

(xn − µj)(xn − µj)T , (16.14)

where we have separated the training data XT = xn, ωnNn=1 into sets according to class.

16–3

16.4 Discriminants based on Normal Density

We may write eqn. 16.8 as a discriminant function:

gj(x) = P (ωj | x) =p(x | ωj)P (ωj)

p(x), (16.15)

so that classification, eqn. 16.3, becomes a matter of assigning x to class wj if,

gj(x) > gk(x),∀ k 6= j. (16.16)

Since p(x), the denominator of eqn. 16.15 is the same for all gj(x) and since eqn. 16.16 involves

comparison only, we may rewrite eqn. 16.15 as

gj(x) = p(x | ωj)P (ωj). (16.17)

We may derive a further possible discriminant by taking the logarithm of eqn. 16.17 — since

logarithm is a monotonically increasing function, application of it preserves relative order of its

arguments:

gj(x) = log p(x | ωj) + log P (ωj). (16.18)

In the multivariate Gaussian case, eqn. 16.18 becomes (Duda & Hart 1973),

gj(x) = −1

2(x− µj)TK−1

j (x− µj)−p

2log2π −

1

2log | Kj | +logP (ωj) (16.19)

Henceforth, we refer to eqn. 16.19 as the Bayes-Gauss classifier.

The multivariate normal (Gaussian) density provides a good characterisation of pattern (vector)

distribution where we can model the generation of patterns as ideal pattern plus measurement

noise; for an instance of a measured vector x from class ωj :

xn = µj + en, (16.20)

where en ∼ N(0,Kj), that is, the noise covariance is class dependent.

16.5 Bayes-Gauss Classifier – Special Cases

(Duda & Hart 1973, pp. 26–31)

Revealing comparisons with the other learning paradigms which play an important role in this thesis

are made possible if we examine particular forms of noise covariance in which the Bayes-Gauss

classifier decays to certain interesting limiting forms:

• Equal and Diagonal Covariances (Kj = σ2I,∀j , where I is the unit matrix); in this case certain

important equivalences with eqn. 16.19 can be demonstrated:

– Nearest mean classifier;

16–4

– Linear discriminant;

– Template matching;

– Matched filter;

– Single layer neural network classifier.

• Equal but Non-diagonal Covariance Matrices.

– Nearest mean classifier using Mahalanobis distance;

and, as in the case of diagonal covariance,

– Linear discriminant function;

– Single layer neural network;

16.5.1 Equal and Diagonal Covariances

When each class has the same covariance matrix, and these are diagonal, we have, Kj = σ2I, so

that K−1j = 1

σ2 I. Since the covariance matrices are equal, we can eliminate the 12| logKj |; the

p2log2π term is constant in any case; thus, including the simplification of the (x−µj)TK−1

j (x−µj),

eqn. 16.19 may be rewritten:

gj(x) = −1

2σ2(x− µj)T (x− µj) + logP (ωj) (16.21)

=1

2σ2‖x− µj)‖2 + logP (ωj). (16.22)

Nearest mean classifier If we assume equal prior probabilities P (ωj), the second term in

eqn. 16.22 may be eliminated for comparison purposes and we are left with a nearest mean classifier.

Linear discriminant If we further expand the squared distance term, we have,

gj(x) = −1

2σ2(xTx− 2µTj x + µTj µj) + logP (ωj), (16.23)

which may be rewritten as a linear discriminant:

gj(x) = wj0 + wTj x (16.24)

where

wj0 = −1

2σ2(µTj µj) + logP (ωj), (16.25)

and

wj =1

σ2µj . (16.26)

Template matching In this latter form the Bayes-Gauss classifier may be seen to be performing

template matching or correlation matching, where wj = constant × µj , that is, the prototypical

pattern for class j , the mean µj , is the template.

16–5

Matched filter In radar and communications systems a matched filter detector is an optimum

detector of (subsequence) signals, for example, communication symbols. If the vector x is written

as a time series (a digital signal), x [n], n = 0, 1, . . . then the matched filter for each signal j may

be implemented as a convolution:

yj [n] = x [n] h[n] =

N−1∑m=0

x [n −m] hj [m], (16.27)

where the kernel h[.] is a time reversed template — that is, at each time instant, the correlation

between h[.] and the last N samples of x [.] are computed. Provided some threshold is exceeded,

the signal achieving the maximum correlation is detected.

Single Layer Neural Network If we restrict the problem to two classes, we can write the clas-

sification rule as:

g(x) = g1(x)− g2(x) ≥ 0: ω1, otherwise ω2 (16.28)

= w0 + wTx, (16.29)

where w0 = − 12σ2 (µT1 µ1 − µT2 µ2) + log P (ω1)

P (ω2)

and w = 1σ2 (µ1 − µ2).

In other words, eqn. 16.29 implements a linear combination, adds a bias, and thresholds the result

— that is, a single layer neural network with a hard-limit activation function.

(Duda & Hart 1973) further demonstrate that eqn. 16.22 implements a hyper-plane partitioning

of the feature space.

16.5.2 Equal but General Covariances

When each class has the same covariance matrix, K, eqn. 16.19 reduces to:

gj(x) = −(x− µj)TK−1(x− µj) + logP (ωj) (16.30)

Nearest Mean Classifier, Mahalanobis Distance If we have equal prior probabilities P (ωj), we

arrive at a nearest mean classifier where the distance calculation is weighted. The Mahalanobis

distance (x−µj)TK−1j (x−µj) effectively weights contributions according to inverse variance. Points

of equal Mahalanobis distance correspond to points of equal conditional density p(x | ωj).

Linear Discriminant Eqn. 16.30 may be rewritten as a linear discriminant, see also section 15.5:

gj(x) = wj0 + wTj x (16.31)

where

wj0 = −1

2(µTj K−1µj) + logP (ωj), (16.32)

and

wj = K−1µj . (16.33)

16–6

Weighted template matching, matched filter In this latter form the Bayes-Gauss classifier may

be seen to be performing weighted template matching.

Single Layer Neural Network As for the diagonal covariance matrix, it can be easily demon-

strated that, for two classes, eqns. 16.31– 16.33 may be implemented by a single neuron. The

only difference from eqn. 16.29 is that the non-bias weights, instead of being simple a difference

between means, is now weighted by the inverse of the covariance matrix.

16.6 Least square error trained classifier

We can formulate the problem of classification as a least-square-error problem. Let us require the

classifier to output a class membership indicator ∈ [0, 1] for each class, we can write:

d = f (x) (16.34)

where d = (d1, d2, . . . dc)T is the c-dimensional vector of class indicators and x, as usual, the

p-dimensional feature vector.

We can express individual class membership indicators as:

dj = b0 +

p∑i=1

bixi + e. (16.35)

In order to continue the analysis we need to refer to the theory of linear regression, see Chapter 20.

We repeat eqn. 20.12 here,

B = (XTX)−1XTY (16.36)

XTY is a p+ 1× c matrix, and B is a (p+ 1)× c matrix of coefficients — that is, one column of

p + 1 coefficients for each class.

Eqn. 16.36 defines the training algorithm of our classifier.

Application of the classifier to a feature vector x may be expressed as:

y = Bx. (16.37)

It remains to find the maximum of the c components of y.

In a two-class case, this least-square-error training algorithm yields an identical discriminant to

Fisher’s linear discriminant (Duda & Hart 1973). Fisher’s linear discriminant is described in Chap-

ter 17.

16–7

16.7 Generalised linear discriminant function

Eqn. 15.13 may be adapted to cope with any function(s) of the features xi ; we can define a new

feature vector x′ where:

x ′k = fk(x). (16.38)

In the pattern recognition literature, the solution of eqn. 16.38 involving now the vector x′ is called

the generalised linear discriminant function (Duda & Hart 1973).

It is desirable to escape from the fixed model of eqn. 16.38: the form of the fk(x) must be

known in advance. Multilayer perceptron (MLP) neural networks provide such a solution. We have

already shown the correspondence between the linear model, eqn. 20.8, and a single layer neural

network with a single output node and linear activation function. An MLP with appropriate non-

linear activation functions, e.g. sigmoid, provides a model-free and arbitrary non-linear solution to

learning the mapping between x and y (Bishop 1995).

16–8

Chapter 17

Linear Discriminant Analysis and PrincipalComponents Analysis

17.1 Principal Components Analysis

Principal component analysis (PCA), also called Karhunen-Loeve transform (Duda, Hart & Stork

2000) is a linear transformation which maps a p-dimensional feature vector x ∈ Rp to another

vector y ∈ Rp where the transformation is optimised such that the components of y contain

maximum information in a least-square-error sense. In other words, if we take the first r ≤ p

components (y′ ∈ Rq), then using the inverse transformation, we can reproduce x with minimum

error. Yet another view is that the first few components of y contain most of the variance, that is,

in those components, the transformation stretches the data maximally apart. It is this that makes

PCA good for visualisation of the data in two dimensions, i.e. the first two principal components

give an optimum view of the spread of the data.

We note however, unlike linear discriminant analysis, see section 17.2, PCA does not take account

of class labels. Hence it is typically a more useful visualisation of the inherent variability of the

data.

In general x can be represented, without error, by the following expansion:

x = Uy =

p∑i=1

yiui (17.1)

where

yi is the ith component of y and (17.2)

where

U = (u1,u2, . . . ,up) (17.3)

is an orthonormal matrix:

utj uk = δjk = 1, when i = k ; otherwise = 0. (17.4)

17–1

If we truncate the expansion at i = q

x′ = Uqy =

q∑i=1

yiui, (17.5)

we obtain a least square error approximation of x, i.e.

|x− x′| = minimum. (17.6)

The optimum transformation matrix U turns out to be the eigenvector matrix of the sample

covariance matrix C:

C =1

NAtA, (17.7)

where A is the N × p sample matrix.

UCUt = Λ, (17.8)

the diagonal matrix of eigenvalues.

17.2 Fisher’s Linear Discriminant Analysis

In contrast with PCA (see section 17.1), linear discriminant analysis (LDA) transforms the data

to provide optimal class separability (Duda et al. 2000) (Fisher 1936).

Fisher’s original LDA, for two-class data, is obtained as follows. We introduce a linear discriminant

u (a p-dimensional vector of weights — the weights are very similar to the weights used in neural

networks) which, via a dot product, maps a feature vector x to a scalar,

y = utx. (17.9)

u is optimised to maximise simultaneously, (a) the separability of the classes (between-class separa-

bility ), and (b) the clustering together of same class data (within-class clustering). Mathematically,

this criterion can be expressed as:

J(u) =utSBu

utSWu. (17.10)

where SB is the between-class covariance,

SB = (m1 −m2)(m1 −m2)t , (17.11)

and

17–2

Sw = C1 + C2, (17.12)

the sum of the class-conditional covariance matrices, see section 17.1.

m1 and m2 are the class means.

u is now computed as:

u = S−1w m1 −m2. (17.13)

There are other formulations of LDA (Duda et al. 2000) (Venables & Ripley 2002), particularly

extensions from two-class to multi-class data.

In addition, there are extensions (Duda et al. 2000) (Venables & Ripley 2002) which form a second

discriminant, orthogonal to the first, which optimises the separability and clustering criteria, subject

to the orthogonality constraint. The second dimension/discriminant is useful to allow the data to

be view as a two-dimensional scatter plot.

17–3

Chapter 18

Neural Network Methods

Here we show that a single neuron implements a linear discriminant (and hence also implements

a separating hyperplane). Then we proceed to a discussion which indicates that a neural network

comprising three processing layers can implement any arbitrarily complex decision region.

Recall eqn. 15.12, with ai → wi , and now (arbitrarily) allocating discriminant value zero to class 0,

g(x) =

p∑i=1

wixi − w0

≤ 0, ω = 0

> 0, ω = 1.(18.1)

Figure 18.1 shows a single artificial neuron which implements precisely eqn. 18.1.

w0

x1

x2

.

.

.

xp

+1 (bias)

w1

w2

wp

Figure 18.1: Single neuron.

The signal flows into the neuron (circle) are weighted; the neuron receives wixi . The neuron sums

and applies a hard limit (output = 1 when sum > 0, otherwise 0). Later we will introduce a sigmoid

activation function (softer transition) instead of the hard limit.

The threshold term in the linear discriminant (a0 in eqn. 15.13) is provided by w0 ×+1. Another

interpretation of bias, useful in mathematical analysis of neural networks, see section 16.6, is to

represent it by a constant component, +1, as the zeroth component of the augmented feature

vector.

18–1

Just to reemphasise the linear boundary nature of linear discriminants (and hence neural networks),

examine the two-dimensional case,

w1x1 + w2x2 − w0

≤ 0, ω = 0

> 0, ω = 1.(18.2)

The boundary between classes, given by w1x1 + w2x2 − w0 = 0, is a straight line with x1-axis

intercept at −w0/w1 and x2-axis intercept at −w0/w2, see Figure 18.2.

x1

x2

−w0/w2

−w1/w0

Figure 18.2: Separating line implemented by two-input neuron.

18–2

18.1 Neurons for Boolean Functions

A neuron with weights w0 = −0.5, and w1 = w2 = 0.35 implements a Boolean AND:

x1 x2 AND(x1,x2) Neuron summation Hard-limit (¿0?)

----------------- ------------------------------ --------------

0 0 0 sum = -0.5 + 0.35x1 + 0.35x2 = -0.5 =¿ output= 0

1 0 0 sum = -0.5 + 0.35x1 + 0.35x2 = -0.15 =¿ output= 0

0 1 0 sum = -0.5 + 0.35x1 + 0.35x2 = -0.15 =¿ output= 0

1 1 1 sum = -0.5 + 0.35x1 + 0.35x2 = +0.2 =¿ output= 1

------------------ ------------------------------- -------------

Similarly, a neuron with weights w0 = −0.25, and w1 = w2 = 0.35 implements a Boolean OR.

Figure 18.3 shows the x1-x2-plane representation of AND, OR, and XOR (exclusive-or).

x1

x2x2

x1

x2

x1

AND ORXOR

1

1 1

11

1

0

0

0

1

0

1

1

1

0

1

1

0

Figure 18.3: AND, OR, XOR.

It is noted that XOR cannot be implemented by a single neuron; in fact it required two layers.

Two layer were a big problem in the first wave of neural network research in the 1960s, when it

was not known how to train more than one layer.

18.2 Three-layer neural network for arbitrarily complex deci-

sion regions

The purpose of this section is to give an intuitive argument as to why three processing layers can

implement an arbitrarily complex decision region.

Figure 18.4 shows such a decision region in two-dimensions.

As shown in the figure, however, each ‘island’ of class 1 may be delineated using a series of

boundaries, d11, d12, d13, d14 and d21, d22, d23, d24.

Figure 18.5 shows a three-layer network which can implement this decision region.

First, just as before, input neurons implement separating lines (hyperplanes), d11, etc. Next, in

layer 2, we AND together the decisions from the separating hyperplanes to obtain decisions, ‘in

island 1’, ‘in island 2’. Finally, in the output layer, we OR together the latter decisions; thus we

can construct an arbitrarily complex partitioning.

18–3

x11 2 3 4 5 6

x2

1

3

4

5

2 1 1 1 1 1

1 1 1 1 1

1 1 1 1 1 1

1 1 1 1

1 1 1

1 1

1 1 1 1

1 1 1 1 1 1

1 1 1 1 1

1 1 1 1

0 0 0 0 0 0 0

0 0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0 0

0 0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0

0 0

0 0

0 0 0

d11

d12

d13

d14

d21

d22

d23

d24

Figure 18.4: Complex decision region required.

Of course, this is merely an intuitive argument. A three layer neural network trained with back-

propagation or some other technique might well achieve the partitioning in quite a different manner.

18.3 Sigmoid activation functions

If a neural network is to be trained using backpropagation or similar technique, hard limit activation

functions cause problems (associated with differentiation). Sigmoid activation functions are used

instead. A sigmoid activation function corresponding to the hard limit progresses from output

value 0 at −∞, passes through 0 with value 0.5 and flattens out at value 1 at +∞.

18–4

x1

x2

.

.

.

xp

.

.

.

d11

d12

d13

d14

d21

d24

. . .

+1 (bias)

+1

+1

+1

.

.

.

AND

AND

OR

class

Figure 18.5: Three-layer neural network implementing an arbitrarily complex decision region.

18–5

Chapter 19

Unsupervised Classification (Clustering)

19–1

Chapter 20

Regression

20.1 Linear Regression

The simplest linear model, y = mx + c , of school mathematics, is given by:

y = b0 + b1x + e, (20.1)

which shows the dependence of the dependent variable y on the independent variable x . In other

words, y is a linear function of x and the observation is subject to noise, e; e is assumed to be

a zero-mean random process. Strictly eqn. 20.1 is affine, since b0 is included, but common usage

dictates the use of linear. Taking the nth observation of (x, y), we have (Beck & Arnold 1977, p.

133):

yn = b0 + b1xn + en (20.2)

Least square error estimators for b0 and b1, b0 and b1 may be obtained from a set of paired

observations xn, ynNn=1 by minimising the sum of squared residuals:

S =

N∑n=1

r 2n =

N∑n=1

(yn − yn)2 (20.3)

S =

N∑n=1

(yn − b0 − b1xn)2 (20.4)

Minimising with respect to b0 and b1, and replacing these with their estimators, b0 and b1, gives

the familiar result:

b1 = N[∑

ynxn − (∑

yi)(∑

xi)]/[N(∑

x2i )− (

∑xi)

2] (20.5)

b0 =

∑ynN

xn −b1

∑xn

N(20.6)

The validity of these estimates does not depend on the distribution of the errors en; that is, as-

sumption of Gaussianity is not essential. On the other hand, all the simplest estimation procedures,

including eqns. 20.5 and 20.6, assume the xn to be error free, and that the error en is associated

with yn.

20–1

In the case where y , still one-dimensional, is a function of many independent variables — p in our

usual formulation of p-dimensional feature vectors — eqn. 20.2 becomes:

yn = b0 +

p∑i=1

bixin + en (20.7)

where xin is the i-th component of the n-th feature vector.

Eqn. 20.7 can be written compactly as:

yn = xTn b + en (20.8)

where b = (b0, b1, . . . , bp)T is a p + 1 dimensional vector of coefficients, and xn =

(1, x1n, x2n, . . . , xpn) is the augmented feature vector. The constant 1 in the augmented vec-

tor corresponds to the coefficient b0, that is it is the so called bias term of neural networks, see

sections 15.5 and 18.

All N observation equations may now be collected together:

y = Xb + e (20.9)

where y = (y1, y2, . . . , yn, . . . , yN)T is the N × 1 vector of observations of the dependent variable,

and e = (e1, e2, . . . , en, . . . , eN)T . X is the N×p+1 matrix formed by N rows of p+1 independent

variables.

Now, the sum of squared residuals, eqn. 20.3, becomes:

S = (y − Xb)T . (20.10)

Minimising with respect to b — just as eqn. 20.3 was minimised with respect to b0 and b1 — leads

to a solution for b (Beck & Arnold 1977, p. 235):

b = (XTX)−1XTy. (20.11)

The jk-th element of the (p+ 1)× (p+ 1) matrix XTX is∑N

n=1 xnjxnk , in other words, just N× the

jk-th element of the autocorrelation matrix, R, of the vector of independent variables x estimated

from the N sample vectors.

If we have multiple dependent variables (y), in this case, c of them, we can replace y in eqn. 20.11

with an appropriate matrix N × c matrix Y formed by N rows each of c observations. Now,

eqn. 20.11 becomes:

B = (XTX)−1XTY (20.12)

XTY is a p + 1× c matrix, and B is a (p + 1)× c matrix of coefficients.

Eqn. 20.12 has one significant weakness: it depends on the condition of the matrix XTX. As with

any autocorrelation or auto-covariance matrix, this cannot be guaranteed; for example, linearly

dependent features will render the matrix singular. In fact, there is an elegant indirect implementa-

tion of eqn. 20.12 involving the singular value decomposition (SVD) (Press, Flannery, Teukolsky &

Vetterling 1992), (Golub & Van Loan 1989). The Widrow-Hoff iterative gradient-descent training

procedure (Widrow & Lehr 1990) developed in the early 1960s tackles the problem in a different

manner.

20–2

Bibliography

Beck, J. & Arnold, K. (1977). Parameter Estimation in Engineering and Science, John Wiley &

Sons, New York.

Berger, J. (1985). Statistical Decision Theory and Bayesain Analysis 2nd ed., Springer Verlag.

Berry, D. (1996). Statistics — a Bayesian Perspective, Duxbury Press.

Bishop, C. (1995). Neural Networks for Pattern Recognition, Oxford University Press, Oxford,

U.K.

Boslaugh, S. & Watters, P. A. (2008). Statistics in a Nutshell, O’Reilly.

Campbell, J. (2000). Fuzzy Logic and Neural Network Techniques in Data Analysis, PhD thesis,

University of Ulster.

Campbell, J. (2005). Lecture notes on pattern recognition and image processing, Technical report,

Letterkenny Institute of Technology. http://www.jgcampbell.com/ip/pr.pdf (accessed 2009-

05-01).

Campbell, J. & Murtagh, F. (1998). Image processing and pattern recognition, Technical report,

Computer Science, Queen’s University Belfast. available at: http://www.jgcampbell.com/ip

(2009-05-01).

Casella, G. & Berger, R. (2001). Statistical Inference, 2nd edn, McGraw-Hill.

Crawley, M. J. (2005). Statistics: an introduction using R, John Wiley. Good introduction to

statistics using R.

Duda, R. & Hart, P. (1973). Pattern Classification and Scene Analysis, Wiley-Interscience, New

York.

Duda, R., Hart, P. & Stork, D. (2000). Pattern Classification, Wiley-Interscience.

Duntsch, I. & Gediga, G. (2000). Sets, Relations, Functions, Methodos Publishers. Available via

http://www.cosc.brocku.ca/ duentsch/papers/methprimer1.html (2009-04-30).

Dytham, C. (2009). Choosing and Using Statistics: A Biologist’s Guide, 2nd edn, Blackwell

Publishing. ISBN-13: 978-1-4051-0243-8.

Feller, W. (1968). An Introduction to Probability Theory and its Applications, volume 1, 3rd edn,

John Wiley & Sons, New York.

Fisher, R. (1936). The use of multiple measurements in taxonomic problems, Annals of Eugenics

7: 179–188. in (?).

20–1

Foley, J., van Dam, A., Feiner, S., Hughes, J. & Phillips, R. (1994). Introduction to Computer

Graphics, Addison Wesley.

Frey, B. (2006). Statistics Hacks, O’Reilly.

Gelman, A., Carlin, J., Stern, H. & Rubin, D. (1995). Bayesian Data Analysis, Chapman and Hall.

Gelman, A. & Nolan, D. (2002). Teaching statistics: a bag of tricks, Oxford University Press.

Golub, G. & Van Loan, C. (1989). Matrix Computations, 2nd edn, Johns Hopkins University Press,

Baltimore.

Griffiths, D. (2009). Head First Statistics, O’Reilly. ISBN-10: 0596527586. Excellent introduction.

Hacking, I. (2001). An Introduction to Probability and Inductive Logic, Oxford University Press.

Hastie, T., Tibshirani, R. & Friedman, J. (2001). The Elements of Statistical Learning, Springer.

Hsu, H. (1997). Theory and Problems of Probability, Random Variables, and Random Processes

(Schaum’s Outlines), McGraw-Hill.

Jaynes, E. & (editor), L. B. (2003). Probability Theory: The Logic of Science, Cambridge Uni-

versity Press. Jaynes was one of the chief advocates of the Bayesian method.

Jeffreys, H. (1961/1998). Theory of Probability, 3rd edn, Oxford University Press (Oxford Classics

Series – 1998), Oxford, U.K.

Larson, H. (1982). Introduction to Probability and Statistical Inference, 3rd edn, John Wiley.

Lee, P. M. (2004). Bayesian Statistics: an introduction, 3rd edn, Arnold. Reputedly one of the

best introductions to Bayesian statistics; Contains examples in R.

MacKay, D. J. C. (2002). Information Theory, Inference and Learning Algorithms, Cambridge

University Press. MacKay is a major advocate of Bayesian methods.

Maindonald, J. & Braun, J. (2007). Data Analysis and Graphics Using R: an example-based

approach, 2nd edn, Cambridge University Press, Cambridge, U.K. ISBN: 978-0-521-86116-8;

good R examples, including graphics.

Matloff, N. (2008). R for programmers, Technical report, University of California, Davis.

http://heather.cs.ucdavis.edu/ matloff/R/RProg.pdf (accessed 2009-04-25).

Meyer, P. L. (1966). Introductory Probability and Statistical Applications, Addison-Wesley, Read-

ing, MA. Excellent introduction, but now out of print.

Milton, M. (2009). Head First Data Analysis: A learner’s guide to big numbers, statistics, and

good decisions, O’Reilly. ISBN-10: 0596153937. Another excellent introduction. Uses R.

Murtagh, F. (2005). Correspondence Analysis and data Coding with Java and R, Chapman and

Hall/CRC Press.

O’Hagan, A. (1994). Kendall’s Advanced Theory of Statistics, Vol. 2B, Bayesian Inference, Edward

Arnold.

Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems (revised second printing), Morgan

Kaufmann, San Francisco, CA.

20–2

Press, W., Flannery, B., Teukolsky, S. & Vetterling, W. (1992). Numerical Recipes in C, 2nd edn,

Cambridge University Press, Cambridge, UK.

Quinn, G. P. & Keough, M. J. (2002). Experimental Design and Data Analysis for Biologists,

Cambridge University Press. ISBN-13: 978-0521009768.

Ripley, B. (1996). Pattern Recognition and Neural Networks, Cambridge University Press, Cam-

bridge, U.K.

Rosenkrantz, R. D. (ed.) (1983). E.T. Jaynes. Papers on Probability, Statistics and Statistical

Physics, Kluwer, Dordrecht.

Salsburg, D. (2001). The Lady Tasting Tea: How Statistics Revolutionized Science in the 20th

Century, W.H. Freeman. Great introduction to the origins of statistics.

Silvey, S. (1975). Statistical Inference, Chapman and Hall.

Sivia, D. (1996). Data Analysis, A Bayesian Tutorial, Oxford University Press, Oxford, U.K.

Sivia, D. (2006). Data Analysis, A Bayesian Tutorial, 2nd edn, Oxford University Press. Best

introduction to Bayesian inference there is.

Spiegel, M. R., Schiller, J. & Srinivasan, R. A. (2009). Theory and Problems of Probability and

Statistics (Schaum’s Outlines), 3rd edn, McGraw-Hill.

Spiegel, M. R. & Stephens, L. J. (2008). Statistics (Schaum’s Outlines), 4th edn, McGraw-Hill.

Highly recommended; if you have to buy one book, this is the one; has examples using a few

packages, most notably Excel.

Taylor, P. (2008). Probability (manuscript notes on mathematical foundations), Technical report,

University of Manchester. http://www.paultaylor.eu/tripos/Probability.pdf (accessed 2009-

04-25).

Therrien, C. (1989). Decision, Estimation, and Classification, Chichester, UK: John Wiley and

Sons.

Tisted, R. (1988). Elements of Statistical Computing, Chapman and Hall/CRC Press.

Venables, W. & Ripley, B. (2000). S Programming, Springer-Verlag.

Venables, W. & Ripley, B. (2002). Modern Applied Statistics with S, 4th edn, Springer-Verlag.

Highly recommended for learning R (R is a free version of S).

Wasserman, L. (2004). All of Statistics: a concise course in statistical inference, Springer Verlag,

New York, NY. ISBN: 0-387-40272-1; top class encyclopedic reference.

Widrow, B. & Lehr, M. (1990). 30 Years of Adaptive Neural Networks, Proc. IEEE 78(9): 1415–

1442.

–3

Appendix A

Basic Mathematical Notation

The notation described here notation is merely shorthand for common sense concepts which would

otherwise be confusing and long-winded if written in English. Casual familiarity with the most

important items will also allow you to read papers using statistics without becoming confused. The

online book Sets, Relations, Functions (Duntsch & Gediga 2000) is an ideal introduction; we take

these notes from that book.

A.1 Sets

A.1.1 Set Definition and Membership

A set is a very basic mathematical entity and hence is a bit hard to define. Let’s say that a set is

a collection of objects; there cannot be repetition (duplication) of objects. We can specify a set

by writing all its members within curly brackets, .

Example 30 Six sided dice, set of possible faces (identified by the number of spots); call the set

D. We can write D as, D = 1, 2, 3, 4, 5, 6. When there is an obvious sequence, we can write,

D = 1, 2, . . . , 6.

Sometimes we specify a rule for making the set, we have for example, the trivial rule generated set

D = i | i ∈ 1, . . . , 6 = 1, . . . , 6; the set of even numbers between 1 and 6 is given by

Deven = i | i ∈ 1, . . . , 6 and i even = 2, 4, 6.

We use the membership symbol ∈ to state that an object is a member of a set, for example,

1 ∈ 1, 2, 3; we can state non-membership by 6∈, for example, 6 6∈ 1, 2, 3

There is no ordering of position in a set. 1, 2, 3, 2, 3, 1 represent the same set. If there is

repetition, it is understood that the repeated elements have no effect so that 1, 2, 3, 2, 3, 1, 1, 2represent the same set.

A–1

A.1.2 Important Number Sets

• Natural numbers: N = 0, 1, 2, . . ..

• Positive natural numbers: N+ = 1, 2, . . ..

• Integers: Z = . . . ,−2,−1, 0, 1, 2, . . ..

• Real numbers: R.

A.1.3 Set Operations

• Intersection. The set formed by the intersection of sets A,B is written

C = A ∩ B = x : x ∈ A andx ∈ B.

Example 31 A = 1, 2, 3, 4, B = 3, 4, 5, A ∩ B = 3, 4.

• Union. The set formed by the union of sets A,B is written

C = A∪B = x : x ∈ A orx ∈ B, where “or” means inclusive or, that is a or b means either

a or b, or both.

Example 32 A = 1, 2, 3, 4, B = 3, 4, 5, A ∪ B = 1, 2, 3, 4, 5.

• Set difference. The set formed by the difference of sets A,B is written

C = A

B = x : x ∈ A andx 6∈ B.

That is, remove any members of B.

Example 33 A = 1, 2, 3, 4, B = 3, 4, AB = 1, 2.

• Set complement (with respect to a universal set, U).

A = x : x 6∈ A andx ∈ U.

Example 34 U = 1, 2, 3, 4, 5, 6; A = 3, 4, 5, A1, 2, 6.

Comment. In case the notion of a universal set causes difficulty: the universal set depends

on the problem at hand; when talking about a class of students, then U would be the set of

all students in the class. You might have A as the set of all students (in that class — in that

universal set) from County Donegal ; then A is the set of all students from outside County

Donegal — that is not from County Donegal.

A.1.4 Venn Diagrams

Set operations such as intersection, union, difference and complement are often illustrated using

Venn diagrams such as those shown in Figure A.1.

A–2

A

B

A

B

Intersection of A, B Union of A, B (all shaded area) A

U = universal set

complement

of A

Figure A.1: Set operations illustrated using Venn diagrams; (a) intersection, (b) union, (c) com-

plement.

Subset When a set A has no members or some or all of the members of B, but no more, we say

that A is a subset of B. A ⊆ B.

Example 35 B = 1, 2, 3, 4, 5, 6; A = 3, 4, 5, so that A ⊆ B.

Equality of sets When a set A has the same members as B, or each is empty, we say that they

are equal: A = B. Another way of looking at this is, if A ⊆ B and B ⊆ A, then A = B.

Empty Set If a set contains no members, we call it the empty set; symbol ∅.

Cardinality of a Set The number of elements in a set A is called its cardinality and written |A|.

Example 36 A = 1, 2, . . . , 6, |A| = 6.

B = John,Mary , Jean, |B| = 3.

Power Set (Probably not necessary for basic probability.)

Given a set A, the power set of A, P(A), is the set of all subsets of A. |P(A)| = 2|A|. Notice that

you can have a set of sets, for example, the set of all classes in the computing department.

Example 37 A = a, b, c, |A| = 3.

P(A) = ∅, c, b, a, b, c, a, c, a, b, a, b, c.

Verify that |P(A)| = 2|A| = 23 = 8.

A–3

Finite and Infinite Sets Roughly speaking, if |A| = n where n is some number we can identify,

then we say that A is a finite set. Most of the sets in our examples are finite sets; otherwise the

set is infinite.

N,Z,R are infinite sets.

This is an example of a finite set of integer numbers A = 1, 2, . . . , n; in contrast an infinite set

of integer numbers would be written A = 1, 2, . . . which means A = 1, 2, . . . ,∞.

Disjoint Sets We say that A1, A2, . . . are disjoint of Ai ∩ Aj = ∅,∀i , j, i 6= j .

∀ denotes for all.

A.2 Iterated Summation and Product Notation

If we want to write down the operation of summing the numbers from 1 to 6, we could write

s = 1 + 2 + 3 + 4 + 5 + 6 or s = 1 + 2+, . . . ,+6. But this becomes tedious or impossible for larger

lists. We have the summation notation∑6

i=1 i .

Similarly, if we want to write down the operation of multiplying together all the numbers from 1

to 6, we use the product notation∏6i=1 i .

A.3 Iterated Union and Intersection

If we want to write down the operation of taking the union (see section A.1.3 of a list of sets

the numbers from A1 to A6, we could write B = A1 ∪ A2, . . . ,∪A6. But this becomes tedious or

impossible for larger lists. Similar to the summation notation we have B =⋃6i=1 Ai .

For intersection we have B =⋂6i=1 Ai .

A.4 Cartesian Product Sets

Quite often we need to make new sets by making pairs (or triples or n-tuples) from existing sets.

Example 38 Let B = 1, 2, 3, 4, 5, 6 the set of outcomes from throwing a six-sided dice and

A = H,T, the set of outcomes of a coin toss. If we perform an experiment where we

throw the dice and toss a coin and we want to describe the set of all possible pairs C =

(1, H), (1, T ), (2, H), . . . , (6, H), (6, T ), we call set C the Cartesian product of A and B.

The Cartesian product of A and B is written A× B.

The cardinality of A× B, |A× B| = |A| × |B|. So in Example 38, we have |A× B| = |A| × |B| =

6× 2 = 12.

Note: pairs such as (1, H), (1, T ), or generally n-tuples, — enclosed in round brackets ( ) — are

not sets.

A–4

Appendix B

Matrices and Linear Algebra

B.1 Introduction

In Chapters 7 and 8 we introduce two-dimensional random variables, that is, pairs of random

variables which, for one reason or another, we want to treat as pairs rather than separately. Much

of what we do in one-dimension generalises to two- and generally multi-dimensions; likewise two-d.

to multi-dimensions.

B.2 Linear Simultaneous Equations

Eqns. B.1 and B.2 are a pair of linear simultaneous equations,

y1 = 3x1 + 1x2, (B.1)

y2 = 2x1 + 4x2. (B.2)

Practically, these equations could express the following:

Price of an apple = x1, price of an orange = x2 (both unknown). Person A buys 3 apples, and 1

orange and the total bill is 5c (y1). Person B buys 2 apples and 4 oranges and the total bill is 10c

(y2).

Now, what is x1, the price of apples, and x2, the price of oranges? We want to solve for the

unknowns x1, x2. Matrix algebra gives us a nice technique for solving such problems, see section B.6,

but first well see how to solve it without matrices.

Substitute y1 = 5 and y2 = 10 into eqns. B.1 and B.2:

5 = 3x1 + 1x2, (B.3)

10 = 2x1 + 4x2. (B.4)

Eqn. B.3 gives x2 = 5− 3x1, which, substituted into eqn. B.4 gives:

B–1

10 = 2x1 + 4(5− 3x1),

10 = 2x1 + 20− 12x1,

−10 = −10x1,

x1 = 1.

Now, substitute x1 = 1 into eqn. B.3:

5 = 3 + x2,

x2 = 2.

We have determined our unknowns x1 = 1 and x2 = 2.

Ex. Substitute x1 = 1 and x2 = 2 into eqns. B.3 and B.4 to check the correctness of the result.

B.3 Vectors and Matrices

Eqns. B.1 and B.2 can be written in matrix form as

y = Ax (B.5)

where A is a 2 row × 2 column matrix, A =

[3 1

2 4

], y is a one column two row matrix,

representing a tuple, and what we will from now on call a vector, y =

[y1

y2

]and x is another one

column two row matrix, x =

[x1

x2

].

Vectors We could be extra careful and continue to call objects like x and y tuples. But everyone

in the statistical world uses the term vector for tuple, and, because we are using vector and matrix

arithmetic and algebra, this gives another reason to use vector.

A vector is nothing more than an ordered collection of one-dimensional variables; however, vector

and matrix mathematics have been developed to allow us to do mathematics on vectors without

having to deal with each of the elements of (X1, X2, . . . , Xp) separately.

It will rarely be helpful to think of these vectors as being like vectors of physics and having magnitude

and direction; but it is often helpful to think of two-dimensional vectors as representing points in a

Euclidean plane and to think of general multidimensional vectors (p-dimensions, say) as representing

points in p-dimensional space.

Generally, a system of m equations, in n variables, x1, x2, . . . , xn,

y1 = a11x1 + a12x2 · · ·+ a1nxn (B.6)

y2 = a21x1 + a22x2 · · ·+ a2nxn

. . .

yr = ar1x1 + ar2x2 · · ·+ arnxn

. . .

ym = am1x1 + am2x2 · · ·+ amnxn

B–2

can be written in matrix form as

y = Ax, (B.7)

where y is an m × 1 vector,

y =

y1

y2

.

.

ym

,x is an n × 1 vector,

x =

x1

x2

.

.

xn

,and A is an m-row × n-column matrix

A =

a11 a12 a1n

a21 a22 a2n

.. .. .. ..

.. arc .. ..

.. .. .. ..

am1 am2 .. amn

.

That is, the matrix A is a rectangular array of numbers whose element in row r , column c is arc(rows are horizontal, think rows of teeth; columns are vertical. The matrix A is said to be m× n,

i.e. m rows, n columns.

Eqn. B.7 can be interpreted as the definition of a function which takes n arguments (x1, x2, . . . , xn)

and returns m variables (y1, y2 . . . ym). Such a function is also called a transformation: it transforms

n-dimensional vectors to m-dimensional vectors.

Such equations are linear transformations because there are no terms in x2r or higher, only in

xr = x1r , and no numbers like 5 (5x0

r = 5× 1 = 5).

Uses of Vectors and Transformations in Statistics Instead of denoting a two-d. random vari-

able as (X, Y ), it is much more convenient to denote it as vector X =

[X1 = X

X2 = Y

].

This is particularly true when we get to larger dimensions, when equations like eqn. 7.15 get

enormous or impossible.

Why transformations?

In other places, we have used combinations of random variables such as U = aX + bY ; and we

might have also V = cX + d Y . Thus, we create a new two-d. random variable (U, V ) using linear

combinations of (X, Y ); we transform (X, Y ) to yield (U, V ). This can be neatly expressed using

matrix notation.

y is an 2× 1 vector,

B–3

y =

[U

V

],

x is an 2× 1 vector,

x =

[X

Y

],

and A is an 2-row × 2-column matrix

A =

[a11 = a a12 = b

a21 = c a22 = d

].

The larger equation above allows us to create a m−dimensional random variable, y, as the linear

combination of the n random variables in the n−dimensional vector x.

B.4 Basic Matrix Arithmetic

B.4.1 Matrix Multiplication

We may multiply two matrices A, m × n, and B, q × p, as long as n = q. Such a multiplication

produces an m × p result. Thus,

C = A B.

m × p m × n n × p (B.8)

Method: The element at the r th row and cth column of C is the product (sum of component-wise

products) of the r th row of A with the cth column of B. Pictorially:

n p p

---------------- ---------- -----------

—----¿ — — — — — —

— A — — B — — = — C —

— — — — — — —

m — — — — — n — — m

---------------- — V — -----------

— —

----------

C = AB

,

A =

[a11 a12

a21 a22

],

B =

[b11 b12

b21 b22

],

B–4

so, the product

C =

[a11b11 + a12b21 a11b12 + a12b22

a21b11 + a22b21 a21b12 + a22b22

].

Example. Consider Eqn. B.7, y = Ax. Thus the product of A(m × n) and x(n × 1) is

y1 = a11x1 + a12x2 · · ·+ a1nxn, · · · ym = am1x1 + am2x2 · · ·+ amnxn.

In summation notation, yr =∑c=n

c=1 arcxc .

The product is (m × n)× (n × 1) so the result is (m × 1), which checks okay, for y is (m × 1).

B.4.2 Multiplication by a Scalar

As with vectors (when represented as components), we simply multiply each component by the

scalar,

c

[a11 a12

a21 a22

]=

[ca11 ca12

ca21 ca22

].

B.4.3 Addition

As with vectors (when represented as components), we add component-wise,

[a11 a12

a21 a22

]+

[b11 b12

b21 b22

]=

[a11 + b11 a12 + b12

a21 + b21 a22 + b22

].

Clearly, the matrices must be the same size, i.e. row and column dimensions must be equal.

B.5 Special Matrices

B.5.1 Identity Matrix

I =

[1 0

0 1

]i.e. produces no transformation effect. Thus, IA = A

We can define the matrix inverse as follows, if AB = I then B = A−1, see section B.6.

B–5

B.5.2 Orthogonal Matrix

A matrix which satisfies the property:

AAt = I

i.e. the transpose of the matrix is its inverse, see section B.6.

Another way of viewing orthogonality in matrices is:

For each row of the matrix (ar1ar2....arn), the scalar product with itself is 1, and with all other

rows, 0. I.e. ∑nc=1 arcapc = 1 for r = p,

= 0 otherwise.

B.5.3 Diagonal

A =

[Sx 0

0 Sy

]is diagonal, i.e. the only non-zero elements are on the diagonal.

The inverse of a diagonal matrix [a11 0

0 a22

]is [

1/a11 0

0 1/a22

]

B.5.4 Transpose of a Matrix

At, spoken ‘A-transpose’.

If

A =

[a11 a12

a21 a22

]then

At =

[a11 a21

a12 a22

]i.e. replace column 1 with row 1 etc.

The transpose is sometimes AT or A′.

B–6

B.6 Inverse Matrix

Only for square matrices (m = n). Consider again Eqns. B.1 and B.2:

y1 = 3x1 + 1x2

y2 = 2x1 + 4x2

i.e. y = Ax.

A =

[3 1

2 4

].

Apply this to

x =

[1

2

],

to get

y1 = 3.1 + 1.2 = 5,

y2 = 2.1 + 4.2 = 10.

What if you know y = (5 10)t and you want to retrieve x = (x1 x2)t? In other words, can

matrices help us solve for x1, x2 as we did in section B.2?

The answer is yes. Find the inverse of A = A−1 and then apply the inverse transformation to y,

that is, multiply y by the inverse of the matrix,

x = A−1y. (B.9)

In the case of a 2 × 2 matrix

A =

[a11 a12

a21 a22

]

A−1 =1

| A |

[a22 −a12

−a21 a11

](B.10)

where the determinant of the array, A, is | A |= a11a22 − a12a21

If | A |= 0, then A is not invertible, it is singular.

Inverse matrices give us the equivalent of division. If | A |= 0, attempting to find the inverse is

the equivalent to calculating 1/0.

Thus for

B–7

A =

[3 1

2 4

]we have | A |= 3× 4− 2× 1 = 10 so

A−1 = (1/10)

[4 −1

−2 3

]=

[0.4 −0.1

−0.2 0.3

]

Therefore, apply A−1 to

[5

10

]We find: A−1y =

[0.4 −0.1

−0.2 0.3

].

[5

10

]=

[5× 0.4 + 10×−0.1

5×−0.2 + 10× 0.3

]=

[1

2

]which is the answer we got in section B.2. In fact, in section B.2 what we did was something very

similar to how one inverts a matrix in a computer program.

B.7 Multidimensional (Multivariate) Random Variables

We can now generalise two-d. random variables to p dimensions by extending (X, Y ) to

(X1, X2, . . . , Xp). It is usual to call the p-dimensional (multivariate) random variable a random

vector and to use vector notation: X = (X1, X2, . . . , Xp).

The multivariate Normal pdf, p-dimensional, is given by:

f (x) =1

(2π)p/2|K|1/2exp [−

1

2(x− µ)TK−1(x− µ)]. (B.11)

B–8

Documents

Introduction to Probability and Statistics: notes for a ...Introduction to Probability and Statistics: notes for a short course Jonathan G. Campbell Department of Computing, Letterkenny