30
Statistical Methods (C) Mel Slater, 2004 11 CHAPTER 2 Probability Distributions Introduction - Random Variables Recall that ‘events’ are subsets of the domain of ‘elementary events’. It is particu- larly interesting when the event is expressed in quantitative terms. For example, we can consider the event ‘party X will win the next general election in the UK’. We can also consider the event ‘the government formed after the next general election in the UK will have a majority of n’, where n is an integer in the appropriate range. If a probability were assigned to each possible value of n, then this would be the probability distribution for n. A random variable is just such a quantified event. It can take values in some given range, which may be infinite or finite, continuous or discrete. A probability is asso- ciated with each subset of this range. We distinguish between the ‘name’ of a random variable (for example, ‘the number N of people out of the n subjects involved in a virtual reality experiment who reported a high sense of presence’) and the possible values that the variable can take (in this case 0,1,...,n). We use notation to stand for the event that the actual value of the random variable is less than or equal to some given number x. For example, in the experiment, the total number of people involved might be n=20. We may be interested in the event , that 4 or less people reported a high sense of presence. This event will have some associated probability. N x N 4

CHAPTER 2 Probability Distributions

  • Upload
    others

  • View
    18

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CHAPTER 2 Probability Distributions

Statistical Methods (C) Mel Slater, 2004 11

CHAPTER 2 Probability Distributions

Introduction - Random VariablesRecall that ‘events’ are subsets of the domain of ‘elementary events’. It is particu-larly interesting when the event is expressed in quantitative terms. For example, we can consider the event ‘party X will win the next general election in the UK’. We can also consider the event ‘the government formed after the next general election in the UK will have a majority of n’, where n is an integer in the appropriate range. If a probability were assigned to each possible value of n, then this would be the probability distribution for n.

A random variable is just such a quantified event. It can take values in some given range, which may be infinite or finite, continuous or discrete. A probability is asso-ciated with each subset of this range.

We distinguish between the ‘name’ of a random variable (for example, ‘the number N of people out of the n subjects involved in a virtual reality experiment who reported a high sense of presence’) and the possible values that the variable can take (in this case 0,1,...,n). We use notation to stand for the event that the actual value of the random variable is less than or equal to some given number x. For example, in the experiment, the total number of people involved might be n=20. We may be interested in the event , that 4 or less people reported a high sense of presence. This event will have some associated probability.

N x≤

N 4≤

Page 2: CHAPTER 2 Probability Distributions

Probability Distributions

12 Statistical Methods (C) Mel Slater, 2004

In general, suppose that X is a random variable, and that its range of possible values is between a and b (a<b). The event is of special significance - its consider-ation leads to the construction of probability distributions.

The probability is the starting point for this, and is called ‘the distribution function’ for the random variable. It is generally written

(EQ 1)

or where the context is obvious, simply as F(x). It is the probability that the random variable in question (X) takes a value that is less than or equal to some specified value x.

The following properties of the distribution function should be evident:

F(a) = 0 (EQ 2)

F(b) = 1 (EQ 3)

(EQ 4)

(EQ 2) says that it is not possible to find a value of X less than its minimum. (EQ 3) expresses the fact that every possible value of X is less than or equal to its upper bound. In this context it is worth remembering that the range of a random variable may be infinite, so that are valid possibilities. (EQ 4) expresses the requirement for F to be a non-decreasing function. The probability of observing a value less than or equal to x cannot decrease as x increases.

Let’s consider a case where the range of values for X is discrete. Without loss of generality, we will assume that this range consists of the set of successive integers

where a < b are two integer numbers. Now consider the event (X=x), ‘X takes the specific value x’, where . The event that X is at most x is equivalent to saying that X is at most x-1 or X is equal to x. Symbolically,

X x≤

P X x≤( )

FX x( ) P X x≤( )≡

F x h+( ) F x( ) all h 0≥,≥

a ∞ and b– ∞= =

a … b, ,{ }

x a a 1 … b, ,+,{ }∈

Page 3: CHAPTER 2 Probability Distributions

Statistical Methods (C) Mel Slater, 2004 13

Introduction - Random Variables

(EQ 5)

The union on the left hand side of (EQ 5) consists of two exclusive events (X can-not be at most x-1 and also x). Therefore applying 7.

and so

(EQ 6)

The probability P(X=x) is called ‘the probability density function’ (or pdf) for X, and for a discrete variable may be found from (EQ 6). It will usually be denoted by

or just , provided that the random variable involved is clear from the con-text.

Now suppose that X varies continuously between a and b. The quantity F(x)-F(x-1) now represents the probability , i.e., that X takes a value that is

X x 1–≤( ) X x=( )∪ X x≤( )=

F x 1–( ) P X x=( )+ F x( )=

P X x=( ) F x( ) F x 1–( )–=

fX x( ) f x( )

P x 1– X x≤ ≤( )

Page 4: CHAPTER 2 Probability Distributions

Probability Distributions

14 Statistical Methods (C) Mel Slater, 2004

between x-1 and x, and there are uncountably infinite such values in this range (or in any range however small).

A more appropriate way to consider random variables that have a continuous range is to appeal to the calculus. Let dx>0 be the famous ‘infinitesimally small’ quantity such that

Then

(EQ 7)

and is again called the ‘probability density function’ for a continuous random variable. (EQ 7) gives an interpretation for this: is the probability of X taking a value in a ‘small neighbourhood’ about x.

The following properties of the pdf follow from the definition, where D is the domain of x:

(EQ 8)

(EQ 9)

(EQ 10)

F x dx+( ) F x( )–dx

----------------------------------------- f x( ) as dx 0→→

F x( )d f x( ) xd=

f x( )

f x( ) xd

f x( )x D∈∑ 1 discrete case=

f x( ) xdx D∈∫ 1 continuous case=

Page 5: CHAPTER 2 Probability Distributions

Statistical Methods (C) Mel Slater, 2004 15

Summary Measures of Probability Distributions

(EQ 11)

and where .

Summary Measures of Probability DistributionsIn this section we will consider ways to summarise distributions with a few scalar numbers. We will treat the continuous case - the discrete case follows immediately by replacing integrals with summations (and deleting the dx terms). The purpose of the summary measures is to get a ‘quick overview’ of the distribution, without knowing everything about it. For example, if you are told that in a virtual reality experiment that the average level of presence on a 1 to 10 scale, where 10 means ‘high’ was 8, that already says quite a lot about the results. If in addition you were told that on the variation was ‘low’, this would tell you even more. If you were asked to guess the level of presence for a person chosen at random from amongst the subjects, it would be sensible for you to choose ‘8’. On the other hand had you been told that the variation amongst the group was very high, the choice of 8 would not have a high degree of probability associated with it, although it would still be about the only thing you could base your decision on.

These ideas can be made precise. The expected value of a random variable is just the integral (sum) over all possible values, each weighted by its probability. It is also called the mean of the distribution of X. If X is the random variable, with domain D, then the expected value is written E(X), and is defined by:

f x( ) 0 all x D∈≥

P x A∈( ) f x( )x A∈∑ when X is discrte=

P x A∈( ) f x( ) x when X is continuousdx A∈∫=

A D⊂

Page 6: CHAPTER 2 Probability Distributions

Probability Distributions

16 Statistical Methods (C) Mel Slater, 2004

(EQ 12)

In general, if g is a function over D then the expected value of g(X) can be defined:

(EQ 13)

It is simple to verify that for any constants a and b:

(EQ 14)

The variance of a distribution is a measure of how much dispersion there is away from the mean. It is the integral (sum) of all squared differences between the values and the expected value, weighted by the probabilities.

(EQ 15)

If we write which is a constant and make use of (EQ 14) then it is easy to see that:

(EQ 16)

The variance is measured in squared units. The standard deviation is the square root of the variance, and therefore is in the same units as the original variable.

In general, the rth moment about the origin of the distribution of a random variable is defined as:

(EQ 17)

and the rth moment about the mean is defined as:

E X( ) xf x( ) xdx D∈∫=

E g X( )( ) g x( )f x( ) xdx D∈∫=

E a bg X( )+( ) a bE g X( )( )+=

Var X( ) x E x( )–( )2f x( ) xdx D∈∫=

Var X( ) E X E X( )–( )2( )=

µ E X( )≡

Var X( ) E X2( ) µ2–=

µ'r E Xr( )=

Page 7: CHAPTER 2 Probability Distributions

Statistical Methods (C) Mel Slater, 2004 17

Moment Generating Functions

(EQ 18)

It is easy to see that

and , and that . (EQ 19)

It is important to realise that a distribution may have no moments - there is no guar-antee that the integrals involved will converge. However, for most of the common distributions that we meet in statistics, the moments are finite.

Moment Generating FunctionsA moment generating function is just a different way of representing a distribution, in terms of its moments (when these exist). This is a mathematical way of saying - if you know all the moments of a distribution then you know the distribution. The real use of the moment generating function is that it is a powerful tool for manipu-lating distributions. For the purposes of these notes (even though it is not strictly rigorously mathematically true) we will assume that (when it exists) the moment generating function is unique to the distribution.

Having spent time talking about it, we can define it as follows:

(EQ 20)

where t is an arbitrary parameter.

This seems unremarkable. However, expanding the exponential, we get:

(EQ 21)

In other words the moment generating function (mgf) ‘contains all the moments’,

and the rth moment about the origin is the coefficient of in its expansion in pow-

ers of t.

µr E X µ–( )r( )=

E X( ) µ'1= Var X( ) µ2= µ2 µ'2 µ'12–=

MX t( ) E etx( )=

MX t( ) E tx( )x!

---------r

r 0=

∑⎝ ⎠⎜ ⎟⎜ ⎟⎛ ⎞

tr!----

r

r 0=

∑ µ'r= =

tr

r!----

Page 8: CHAPTER 2 Probability Distributions

Probability Distributions

18 Statistical Methods (C) Mel Slater, 2004

Note that

(EQ 22)

Also, differentiating the mgf with respect to t, and then setting t=0, we can see that

(EQ 23)

where denotes the rth derivative with respect to t.

Some additional properties are important to note. First, for any function g(X) of the random variable:

(EQ 24)

It follows from this definition, that for any constants a and b,

(EQ 25)

Some Common Distributions

The Discrete Uniform DistributionSuppose X is a random variable with domain {1,2,...,n}. As far as you are con-cerned any one of these n values are equally likely. Then

From (EQ 12) we can find:

MX 0( ) 1=

MXr( ) 0( ) µ'r=

MXr( ) t( )

Mg X( ) t( ) E etg X( )( )=

Ma bX+ t( ) etaMX tb( )=

f x( ) 1n--- x 1 … n, ,{ }∈,=

E X( ) xn---

x 1=

n

∑ n 1+2

------------= =

Page 9: CHAPTER 2 Probability Distributions

Statistical Methods (C) Mel Slater, 2004 19

Some Common Distributions

and so using (EQ 16)

The Continuous Uniform DistributionThis time X continuously ranges between the bounds a to b (a < b), and ‘nothing is known’ about X other than this. So the probability density function is

Let’s find the moment generating function using (EQ 20).

If we expand this out as far as terms quadratic in t, and take the coefficients of t and

, we get

It follows from (EQ 21) that

E X2( ) x2

n-----

x 1=

n

∑ n 1+( ) 2n 1+( )6

--------------------------------------= =

Var X( ) n 1+( ) 2n 1+( )6

-------------------------------------- n 1+2

------------⎝ ⎠⎛ ⎞ 2

– n 1+( ) n 1–( )12

----------------------------------= =

f x( ) 1b a–------------ x a b,[ ]∈,=

MX t( ) etx

b a–------------ xd

a

b

∫ etx

t b a–( )------------------

a

b1

b a–------------⎝ ⎠

⎛ ⎞ etb eta–( )t

------------------------= = =

t2 2⁄

MX t( ) 1b a–------------⎝ ⎠

⎛ ⎞ b a–( ) t2--- b2 a2–( ) t2

6---- b3 a3–( ) …+ + +⎝ ⎠

⎛ ⎞=

Page 10: CHAPTER 2 Probability Distributions

Probability Distributions

20 Statistical Methods (C) Mel Slater, 2004

and therefore using (EQ 19)

We use the notation

to read ‘X has a uniform distribution in the range a to b’.

The Binomial DistributionAs the name implies here we are only interested in two outcomes, whether an event occurs or does not occur in given circumstances. The probability assigned each time to the occurrence of the event is p. The most famous example, of course, is coin tossing. The coin is tossed n times. The event is ‘Head’. The random variable of interest is ‘how many heads were there in the n tosses’, where n is fixed in advance. It is assumed that the coin is ‘fair’ so that each time the probability is 0.5, and also that there is independence between the outcomes.

More generally, an experiment is repeated n times. Each time an event E may occur or not. The probability for E is p, which is constant. The successive experiments

E X( ) µ'1b a+

2------------= =

µ'2b3 a3–( )

3 b a–( )--------------------- b2 ab a2+ +

3------------------------------= =

Var X( ) b2 ab a2+ +3

------------------------------ b a+2

------------⎝ ⎠⎛ ⎞ 2

– b a–( )2

12-------------------= =

X uniform a b,( )∼

Page 11: CHAPTER 2 Probability Distributions

Statistical Methods (C) Mel Slater, 2004 21

Some Common Distributions

have outcomes that are independent in the probabilistic sense. Such trials of an experiment are called ‘Bernoulli trials’ or ‘Binomial trials’. Suppose the outcome of trial i is labelled ‘1’ if E occurs and ‘0’ if not. Then the random variable (X) rep-resents the number of times that E occurs, i.e., the number of 1s. We are interested in the probability of the event (X=x).

Consider a specific sequence of outcomes where X=x. This can be written in the form:

n-x occurrences of 0, followed by x occurrences of 1.

Now if this particular sequence happened you might be surprised. However, every such sequence with x occurrences of E and n-x non-occurrences of E has the same

probability, namely . There are such sequences (i.e., this many

ways of reordering the n-x 0s and x 1s). Hence

(EQ 26)

which is the famous binomial distribution.

It should be clear that these probabilities are successive terms in the binomial

expansion of and so that obviously .

The mgf is:

000…0

n x–

111…1

x

⎧ ⎪ ⎨ ⎪ ⎩ ⎧ ⎪ ⎨ ⎪ ⎩

px 1 p–( )n x– nx⎝ ⎠

⎛ ⎞

f x( ) nx⎝ ⎠

⎛ ⎞ px 1 p–( )n x–= x, 0 1 2 … n, , , ,=

p 1 p–( )+( )n f x( )∑ 1=

Page 12: CHAPTER 2 Probability Distributions

Probability Distributions

22 Statistical Methods (C) Mel Slater, 2004

where q = 1-p.

This time we’ll make use of (EQ 23) in deriving the mean and variance.

Setting t=0, and recalling that p+q=1:

(EQ 27)

Differentiating again:-

and setting t=0,

MX t( ) etx nx⎝ ⎠

⎛ ⎞ px 1 p–( )n x–

x 0=

n

∑ nx⎝ ⎠

⎛ ⎞ pet( )x

1 p–( )n x–

x 0=

n

∑ pet q+( )n

= = =

MX1 t( ) n pet q+( )

n 1–pet=

µ'1 E X( ) MX1 0( ) np= = =

MX2 t( )

tdd n pet q+( )

n 1–pet npet pet q+( )

n 2–q npet+( )= =

Page 13: CHAPTER 2 Probability Distributions

Statistical Methods (C) Mel Slater, 2004 23

Some Common Distributions

and therefore:

(EQ 28)

We use the notation

to read ‘X has a binomial distribution with probability p, and number of trials n’.

The Beta DistributionIt was assumed that the probability of event E occurring is known as equal to p. But suppose it is not known? p itself would have a probability distribution. If ‘nothing’ at all was known, the distribution of p could be modelled as uniform(0,1). But something might be known - for example, that p was closer to 1 than 0, or that its most probable value is 0.75, and so on. This can be modelled by the Beta distribu-tion.

First, we define the function known as the Beta function, denoted B(a,b), for con-stants a>0, b>0.

(EQ 29)

The Beta function has a close connection with another function, called the Gamma function. This is defined for a>0 as:

MX2 0( ) npq np( )2+=

Var X( ) npq=

X binomial n p,( )∼

B a b,( ) ua 1– 1 u–( )b 1– ud0

1

∫=

Page 14: CHAPTER 2 Probability Distributions

Probability Distributions

24 Statistical Methods (C) Mel Slater, 2004

(EQ 30)

It is easy to show that:

(EQ 31)

and that . It follows from (EQ 31) that for integer n.

Also, it is straightforward to show that:

(EQ 32)

The Beta probability distribution, for random variable X with domain [0,1] is then defined as:

(EQ 33)

This time we will find the mean and variance directly, making use of (EQ 32).

(EQ 34)

(EQ 35)

The notation reads ‘X has a Beta distribution with parameters a and b’. Note that .

Γ a( ) xa 1–

0

∫= e x– dx

Γ a 1+( ) aΓ a( )=

Γ 1( ) 1= Γ n 1+( ) n!=

Γ a( )Γ b( ) Γ a b+( )B a b,( )=

f x( ) 1B a b,( )-----------------xa 1– 1 x–( )b 1– x 0 1,[ ]∈,=

E X( ) 1B a b,( )----------------- xa 1 x–( )b 1– xd

0

1

∫ B a 1 b,+( )B a b,( )

--------------------------- aa b+------------= = =

E X2( ) 1B a b,( )----------------- xa 1+ 1 x–( )b 1– xd

0

1

∫ B a 2 b,+( )B a b,( )

--------------------------- a a 1+( )a b 1+ +( ) a b+( )

--------------------------------------------= = =

Var X( ) E X2( ) E X( )– aba b+( )2 a b 1+ +( )

----------------------------------------------= =

X Beta a b,( )∼

Page 15: CHAPTER 2 Probability Distributions

Statistical Methods (C) Mel Slater, 2004 25

Some Common Distributions

The Negative Binomial DistributionWhen there are Bernoulli trials, the binomial distribution is appropriate to model the number of occurrences of an event in (fixed) n trials. Instead, we can consider the number of trials necessary to get r occurrences of the event. Here the random variable X, denoting the number of trials awaited, has domain X=r,r+1,.... Here X has no theoretical upper bound (we could wait forever).

One sequence corresponding to the event (X=x) is:-

The first x-r trials does not result in the event. The next r-1 trials does result in the event. But still the requirement for r occurrences of the event is not satisfied. This is satisfied on the last event shown. For r occurrences of an event, the last one must always result in the event. This is the rth occurrence.

This sequence has probability .

Since the last trial must result in an occurrence of the event, there are such

sequences.

Therefore the probability density function is:

(EQ 36)

uniform 0 1,( ) Beta 1 1,( )≡

000…0

x r–

111…1

r 1–

1

1

⎧ ⎪ ⎨ ⎪ ⎩ ⎧ ⎪ ⎨ ⎪ ⎩ {

1 p–( )x r– pr

x 1–r 1–⎝ ⎠

⎛ ⎞

f x( ) x 1–r 1–⎝ ⎠

⎛ ⎞ 1 p–( )x r– pr x, r r 1 …,+,= =

Page 16: CHAPTER 2 Probability Distributions

Probability Distributions

26 Statistical Methods (C) Mel Slater, 2004

It is left as an exercise to show that:

(EQ 37)

The Poisson DistributionSuppose an event occurs at ‘random’ moments in time, and we are interested in modelling the probability of the number of occurrences in a time period t. For the moment, suppose that it is a ‘unit of time’ in which we are interested, say the inter-val [t,t+1] for some fixed t. We suppose that whatever t the random process behaves the same, so that the specific value of t is immaterial. Now divide the time period up into n equal intervals, and let the probability of the event occurring in any time interval be p. We assume further that the time intervals are independent, and that the probability of two or more happenings in any of these intervals of length 1/n is negligible. If we let X be the random variable denoting the number of happenings, then . Now let n get larger and larger, and correspondingly p get smaller and smaller, but such that the mean rate of occurrences per unit time, remains the same, at . From (EQ 27), we must have .

(EQ 38)

This defines the Poisson distribution:

(EQ 39)

The mgf is easily found:

E X( ) xqp------=

Var X( ) xqp2------=

X binomial n p,( )∼

µ µ np=

nx⎝ ⎠

⎛ ⎞ px 1 p–( )n x– n n 1–( )… n x– 1+( )x!

---------------------------------------------------- µn---⎝ ⎠

⎛ ⎞ x1 µ

n---–⎝ ⎠

⎛ ⎞ n x–=

µx

x!----- n

n---⎝ ⎠

⎛ ⎞ n 1–n

------------⎝ ⎠⎛ ⎞ … n x– 1+

n---------------------⎝ ⎠

⎛ ⎞ 1 µn---–⎝ ⎠

⎛ ⎞ n x–=

µx

x!-----= e µ– as n ∞→,

f x( ) µx

x!-----e µ– x, 0 1 2 …, , ,= =

Page 17: CHAPTER 2 Probability Distributions

Statistical Methods (C) Mel Slater, 2004 27

Some Common Distributions

It is easy to show from this that:

(EQ 40)

The Exponential DistributionLet’s stay with the same context as for the Poisson distribution. However, in this case we’re interested not in the number of events that occurs but the time between the events. Call Y the random variable that denotes the time to the next event, and consider the probability that . The domain of this r.v. is . Again consider that time is divided into intervals, each of size 1/n, and that the mean rate of occur-rence of the event is per unit time. Now in time y there are ny intervals. In order for the time to be less than or equal to y, the event must occur in the next time inter-val (probability of this is ), or the one after that (probability is ), or the time

after that (probability is ), and so on. In other words, the event is equiva-lent to: occurrence in the next interval, or non-occurrence in the next interval fol-lowed by occurrence in the one after, and so on. Since these are exclusive events, we have for fixed intervals of size 1/n:

(EQ 41)

As before let , to arrive at:

(EQ 42)

MX t( ) e µ– µet( )x

x!---------------

x 0=

∑ µet µ–( )exp= =

E X( ) µ=Var X( ) µ=

Y y≤ 0 ∞,[ ]

µ

p pq

pq2 Y y≤

P Y y≤( ) p 1 q … qny 1–+ + +( )=

p 1 qny–1 q–

-----------------⎝ ⎠⎛ ⎞=

1 1 µn---–⎝ ⎠

⎛ ⎞ ny–=

n ∞→

P Y y≤( ) 1 e µy– y 0≥,–=

Page 18: CHAPTER 2 Probability Distributions

Probability Distributions

28 Statistical Methods (C) Mel Slater, 2004

This is the distribution function and therefore differentiating with respect to y to get the pdf:

(EQ 43)

This is called the exponential distribution, and is a good probability model for the ‘time between random events’. Its mgf is easy to find:

(EQ 44)

Expanding the mgf in powers of t:

(EQ 45)

From this it is easy to see that:

(EQ 46)

The Gamma DistributionThe gamma distribution is closely associated with the exponential as we shall see. The distribution is defined as:

(EQ 47)

The mgf is therefore:

F y( )

f y( ) µe µy– y 0≥,=

MY t( ) etyµe µy– yd0

∫ µ e y µ t–( )–

0

∫ dy= =

1 tµ---–⎝ ⎠

⎛ ⎞ 1–=

MY t( ) 1 tµ--- t

µ---⎝ ⎠

⎛ ⎞ 2…+ + +=

E Y( ) 1µ---=

Var Y( ) 1µ2------=

f x( ) λp

Γ p( )-----------xp 1– e λx– x 0>,=

Page 19: CHAPTER 2 Probability Distributions

Statistical Methods (C) Mel Slater, 2004 29

Some Common Distributions

By carrying out a change of variable in the integration (e.g., ), it is easy to show that:

(EQ 48)

Comparing this with (EQ 44) it is clear that the exponential distribution is a special case of the gamma, with p=1. Moreover, the mgf of the gamma distribution is equal to the mgf of the exponential distribution raised to the pth power. Later, we will see that the significance of this is that when p is an integer, the gamma distribution may be considered as the distribution of the sum of p independent exponential variables, each with the same mean. In terms of the event occurring randomly in time, the gamma distribution models the time between each successive p occurrences of the event.

From the mgf it is easy to show that:

(EQ 49)

The Normal DistributionThe normal (also called ‘Gaussian’) distribution is in many senses the central distri-bution of statistics. It is defined as follows:

(EQ 50)

It is relatively straightforward to show that its mgf is:

MX t( ) λp

Γ p( )-----------xp 1– e λ x t–( )– xd

0

∫=

z λ x t–( )=

MX t( ) 1 tλ---–⎝ ⎠

⎛ ⎞ p–=

E X( ) pλ---=

Var X( ) pλ2-----=

f x( ) 1σ 2π-------------- 1

2--- x µ–

σ------------⎝ ⎠

⎛ ⎞ 2–

⎩ ⎭⎨ ⎬⎧ ⎫

∞ x ∞< <–,exp=

Page 20: CHAPTER 2 Probability Distributions

Probability Distributions

30 Statistical Methods (C) Mel Slater, 2004

(EQ 51)

and therefore

(EQ 52)

Using (EQ 25) we find the mgf of the random variable:

(EQ 53)

as

(EQ 54)

Comparing (EQ 54) with (EQ 51) it is clear that Z also has a Normal distribution, but with mean 0 and variance 1. This is called a Standard Normal Distribution, and the notation is used to denote this.

The word ‘normal’ in the name of this distribution is not meant to imply that other distributions are abnormal. It is simply that the distribution reflects the fact that in many circumstances that ‘departures from the norm’ happen in a symmetric, regu-lar pattern. The distribution clusters around its mean value, and tapers away sym-metrically, and relatively rapidly towards zero probability on each side of the mean. Approximately 68% of the distribution is within one standard deviation of the mean, 95% within two standard deviations, and 99% within 2.58 standard devia-tions. The distribution is produced in nature where the random variable in question is produced as an averaging effect of a large number of constituent ‘causes’, as we shall see later, in our discussion of the Central Limit Theorem.

MX t( ) µt σ2t2

2----------+

⎩ ⎭⎨ ⎬⎧ ⎫

exp=

E X( ) µ=

Var X( ) σ2=

Z X µ–σ

-------------=

MZ t( ) t2

2----

⎩ ⎭⎨ ⎬⎧ ⎫

exp=

Z N 0 1,( )∼

Page 21: CHAPTER 2 Probability Distributions

Statistical Methods (C) Mel Slater, 2004 31

Multivariate Distributions

Multivariate Distributions

Distribution and Density Functions

If X and Y are two discrete random variables, then the event has probability

(EQ 55)

defining the joint probability distribution exactly the same way as in the univariate case. Similarly, if X and Y are continuous then is the probability that (X,Y) is in a small neighbourhood around (x,y). This defines the probability density f(x,y), with meaning:

(EQ 56)

As before F(x,y) is the distribution function and f(x,y) the probability density func-tion, with exactly the same meanings as in the univariate case. The integral over the whole domain of the random variables is equal to 1, since this is the probability of the ‘certain event’ (the random variables take some value in their range).

This is extended in an obvious way to the joint distribution of several variables , which is denoted .Given a joint distribution of a set

of random variables the distribution of any subset of this set can be found by inte-grating out all the others. These are then called the ‘marginal distributions’. For example, from f(x,y) we could find the distribution of X alone by integrating out Y. (Of course, replace ‘integrate’ by ‘sum’ in the discrete case).

(EQ 57)

We can define conditional distributions and independence directly from the rules 8. and 9. on page 6.

The joint distribution of two variables is related to the conditional distribution by:

(EQ 58)

X x=( ) Y y=( )∩

P X x=( ) Y y=( )∩( ) P X x Y y=,=( ) f x y,( )≡≡

f x y,( )dxdy

P ∞ X x ∞ Y y≤ ≤∩≤ ≤( ) f x y,( ) xd y F x y,( )≡d∞–

x

∫∞–

y

∫=

X1 X2 … Xn, , ,( ) f x1 x2 … xn,, ,( )

if f x y,( )is the joint distribution then f x( ) f x y,( ) yd∞–

∫ is the marginal distribution=

f x y,( ) f x y( )f y( ) f y x( )f x( )= =

Page 22: CHAPTER 2 Probability Distributions

Probability Distributions

32 Statistical Methods (C) Mel Slater, 2004

By induction this extends to any to any number of variables, for example:

and

(EQ 59)

The random variables are independent if and only if the joint distribution of all of them, and each subset of ‘marginal distributions’ factorises into the product of the appropriate set of (marginal) univariate distributions. In particular:

(EQ 60)

Summary Measures and Relationships

The notions of expected value and the moment generating function applies directly. In particular:

(EQ 61)

and if is any function of the random variables, then the moment gener-ating function of g is:

(EQ 62)

Whenever there is more than one variable, the question of relationship between variables immediately arises. One measure of this is the so-called covariance between two variables, and the scaled version of it called the correlation. The cova-riance between two variables is defined as:

f x1 x2 x3, ,( ) f x1 x2, x3( )f x3( )=

f x1 x2 x3,( )f x2 x3( )f x3( )=

f x1 x2 … xn, , ,( ) f x1 x2 x3 … xn, , ,( )f x2 x3 … xn, ,( )…f xn( )=

f x1 x2 … xn, , ,( ) f x1( )f x2( )…f xn( )=

E Xi( ) … xif x1 x2 … xn, , ,( )

∞–

∫∞–

∫ x1d … xnd∞–

∫=

g X1 … Xn, ,( )

Mg t( ) E etg x1 … xn,,( )

( )=

Page 23: CHAPTER 2 Probability Distributions

Statistical Methods (C) Mel Slater, 2004 33

Multivariate Distributions

(EQ 63)

It measures the degree of linear relationship between X and Y. Note that:

(EQ 64)

The Cauchy-Schwarz inequality shows that:

which is equivalent to:

(EQ 65)

From this we define the correlation coefficient as:

(EQ 66)

using (EQ 65).

The Cauchy-Schwarz inequality becomes an equality when X and Y are linearly related as may be easily checked. In this case the square of the covariance is equal to the product of the variances, and the correlation-squared is equal to 1. The corre-lation is a measure of the extent of linearity between two variables. It is for exact linearity and zero for no linearity at all (e.g., an exact circle). Its sign gives the direction of the slope between Y and X. It is important to understand that covari-ance and correlation only measure linear relationship. Even if the correlation is zero, it does not mean that there is no relationship between the two variables - each case must be considered separately.

Cov X Y,( ) E X E X( )–( ) Y E Y( )–( )[ ]=E XY( ) E X( )E Y( )–=

Cov X X,( ) Var x( )=

x µx–( ) y µy–( )

∞–

∫∞–

∫ f x y,( )dxdy⎝ ⎠⎜ ⎟⎜ ⎟⎛ ⎞ 2

x µx–( )2

∞–

∫ f x( )dx y µy–( )2

∞–

∫ f y( )dy≤

Cov X Y,( )2 Var X( )Var Y( )≤

ρ X Y,( ) Cov X Y,( )Var X( )Var Y( )

-----------------------------------------=

1 ρ X Y,( ) 1≤≤–

Page 24: CHAPTER 2 Probability Distributions

Probability Distributions

34 Statistical Methods (C) Mel Slater, 2004

Vector Random VariablesWhen we have several random variables, it is often convenient and succinct to write them as one vector random variable,

. (EQ 67)

The expected value of the vector random variable is defined in an obvious way as:

(EQ 68)

The covariance matrix is the matrix consisting of the variances on the main diago-nal, and the covariances as the off-diagonal terms. In other words, the general ele-ment is: . The covariance matrix is defined in (EQ 69). The matrix is obviously symmetric.

(EQ 69)

Sums of Random Variables

Suppose that are random variables, and are constants. We

will consider the new random variable .

First of all note that:

X

X1

X2

…Xn

=

E X( )

E X1( )

E X2( )

…E Xn( )

µ= =

Cov Xi Xj,( ) σij=

V X( )

σ12 σ12 … σ1n

σ21 σ22 … σ2n

… … … …

σn1 σn2 … σn2

Σ= =

X1 X2 … Xn, , , a1 a2 … an, , ,

Z aiXii 1=

n

∑=

Page 25: CHAPTER 2 Probability Distributions

Statistical Methods (C) Mel Slater, 2004 35

Multivariate Distributions

(EQ 70)

which follows immediately from the definition of expectation as an integral (or sum).

The same is not true in general for the variance. Consider the case n=2:

(EQ 71)

Therefore,

(EQ 72)

The moment generating function of the sum of n random variables is:

(EQ 73)

Now if the variables are independent, the joint distribution factorises into the prod-uct of all the univariate distributions from (EQ 60). In this case, the mgf will be the product of all the individual mgf’s:

(EQ 74)

If in addition, the random variables are identically distributed, all the mgf’s will be the same, in which case:

E Z( ) aiE Xi( )

i 1=

n

∑=

Var a1X1 a2X2+( ) E a1 X1 µ1–( ) a2 X2 µ2–( )+( )2( )=

a12Var X1( ) a2

2Var X2( ) 2a1a2Cov X1 X2,( )+ +=

Var aiXi

i 1=

n

∑⎝ ⎠⎜ ⎟⎜ ⎟⎛ ⎞

ai2Var Xi( )

i 1=

n

∑=

provided that Cov Xi Xj,( ) 0 all i j≠=

MX1 X1 … XN+ + + t( ) E et x1 x1 … xN+ + +( )

( )=

… t xi

i 1=

n

∑⎩ ⎭⎪ ⎪⎨ ⎬⎪ ⎪⎧ ⎫

exp∞–

∫∞–

∫∞–

∫ f x1 … xn, ,( )dx1dx2…dxn=

MΣXit( ) MXi

t( )i

∏=

in the case of independence

Page 26: CHAPTER 2 Probability Distributions

Probability Distributions

36 Statistical Methods (C) Mel Slater, 2004

(EQ 75)

This gives us a very powerful result for finding the distributions of sums of identi-cally distributed independent random variables.

For example, recall the mgf of the exponential distribution (EQ 44) and the gamma distribution (EQ 48). The mgf of the gamma distribution is that of the exponential distribution raised to the power p. This shows that the sum of p independent expo-nential distributions with the same mean has a gamma distribution.

More DistributionsUsing the above tools we can now consider a number of important distributions used widely in statistical inference.

The Chi-Squared Distribution

Suppose that Z is standard normal N(0,1), and consider the distribution of . The mgf can be found from:

If we make the substitution into the integral then we can find:

(EQ 76)

Using the result that the mgf of the sum of n independent and identically distributed random variables is the mgf of the distribution raised to power n, we can see that the sum of squares (S) of n independent N(0,1) random variables has mfg:

MΣXit( ) MX t( )n=

when independent and identically distributed

Z2

MZ2 t( ) 1

σ 2π-------------- e

tz2 12---z2–

∞–

∫ dz=

1σ 2π-------------- e

12---z2 1 2t–( )–

∞–

∫ dz=

w z 1 2t–( )=

MZ2 t( ) 1 2t–( )

12---–

=

Page 27: CHAPTER 2 Probability Distributions

Statistical Methods (C) Mel Slater, 2004 37

More Distributions

(EQ 77)

Compare (EQ 77) with (EQ 48) for the gamma distribution. It can be seen that the distribution of S is gamma, with p=n/2 and . This distribution occurs so

frequently in statistics that it is given a special name: the Chi-Squared, , distribu-tion with n degrees of freedom. So to put it another way: the chi-squared distribu-tion with n degrees of freedom arises as the sum of squares of n independent standard N(0,1) random variables.

The Multivariate Normal DistributionLet X be a vector consisting of n random variables, with mean vector µ and vari-ance matrix Σ. The multivariate normal distribution has density function:

(EQ 78)

where represents the matrix A transposed. We use the ‘proportional to’ sign, since it is known that the total integral is always 1, and therefore the constant multi-plier is always well-defined.

In the special case when all the covariances are zero,

and in this case the density function factorises into the product of the univariate density functions:

MS t( ) 1 2t–( )n2---–

=

λ 1 2⁄=

χ2

f x1 x2 … xn, , ,( ) 12--- x µ–( )TΣ 1– x µ–( )–

⎩ ⎭⎨ ⎬⎧ ⎫

exp∝

AT

Σ 1–

1σ2------ 0 … 0

0 1σ2------ … 0

0 0 … 0

0 0 0 1σ2------

1σ2------I= =

Page 28: CHAPTER 2 Probability Distributions

Probability Distributions

38 Statistical Methods (C) Mel Slater, 2004

(EQ 79)

Note that the implication of this is that when the variables are uncorrelated, they are also independent. The Normal distribution is the only distribution where zero cova-riance also implies independence.

Finally, if these independent Normal random variables all have the same mean and variance (i.e., they are identically distributed) then the joint density function is:

(EQ 80)

The Dirichlet DistributionThis time X is again a vector of n random variables, but they are all between 0 and 1, and their sum is less than or equal to 1. The density function for the Dirichlet dis-tribution is defined as:

(EQ 81)

Notice that this is a generalisation of the Beta distribution (EQ 33), and that the marginal distribution of any subset of the variables has a Dirichlet distribution. The univariate distributions are Beta. Notice also that because of the constraint imposed by the sum of the variables always being 1, they are not independent, since any one of the variables may be expressed as one minus the sum of the others. Hence the joint distribution cannot factorise into the product of the marginals.

The Dirichlet distribution is appropriate for modelling the joint distribution of ran-dom variables such as a set of proportions that sum to 1. For example, suppose in a questionnaire there is a question with fixed responses on a 1 to n scale, and the respondents have to choose one of the n answers. Then, other things being equal,

f x1 x2 … xn, , ,( ) 12---

xi µi–σi

---------------⎝ ⎠⎛ ⎞

2

i 1=

n

∑–

⎩ ⎭⎪ ⎪⎨ ⎬⎪ ⎪⎧ ⎫

exp∝

f x1 x2 … xn, , ,( ) 12σ2--------- xi µ–( )2

i 1=

n

∑–

⎩ ⎭⎪ ⎪⎨ ⎬⎪ ⎪⎧ ⎫

exp∝

f x1 x2 … xn, , ,( )Γ a1 a2 … an+ + +( )Γ a1( )Γ a2( )…Γ an( )--------------------------------------------------x1

a1 1–x2

a2 1–…xn

an 1–=

where xi

i 1=

n

∑ 1 xi 0 all i and ai 0 all i·,>,≥,=

Page 29: CHAPTER 2 Probability Distributions

Statistical Methods (C) Mel Slater, 2004 39

More Distributions

the probability distribution of the proportions of people who answer in the various categories, might be subjectively modelled as Dirichlet.

The F-DistributionThis distribution, like the Chi-squared arises very frequently in statistical practice. Suppose X is a Chi-squared distribution with n degrees of freedom, and Y is an independent Chi-squared distribution with m degrees of freedom. The ratio:

(EQ 82)

which is an F-distribution with n and m degrees of freedom. Since the Chi-squared values are always non-negative (being sums of squares of standard Normal vari-ables), the ratio is non-negative, and so F is always non-negative.

The ‘Student’ t-distributionAnother very frequently used distribution in statistics. This time let Z be a N(0,1) random variable, and X an independent Chi-squared distribution with degrees of freedom n. Consider the ratio:

(EQ 83)

This ratio t has the so-called t-distribution with n degrees of freedom. This is a sym-metric bell-shaped distribution around 0, and in fact as n increases the distribution tends rapidly to normality. For n at around 30, the two distributions are very similar, so that the normal distribution may be used an approximation to the t. Note that t2 is the ratio of the square of a standard normal variable, to a Chi-squared with n degrees of freedom. Since the square of a standard normal is Chi-squared with 1

degree of freedom, this implies that .

The application of the Normal, Chi-squared, t- and F-distributions will be seen in the next chapter.

F X n⁄Y m⁄-----------= F n m,( ) distribution∼

t ZX n⁄

-------------- t n( ) distribution∼=

t2 F 1 n,( )≡

Page 30: CHAPTER 2 Probability Distributions

Probability Distributions

40 Statistical Methods (C) Mel Slater, 2004

SummaryThis chapter has introduced the idea of probability distributions over random vari-ables. A probability distribution may be represented as a distribution function, or equivalently as a density function. Summary measures such as the mean and vari-ance (or standard deviation) were defined, and more generally the moments of a distribution. The moment generating function provides a method for encapsulating all moments in one function, and more powerfully of providing an alternative method of representing the distribution.

We introduced the notion of multivariate distributions, and defined the idea of con-ditional distributions and independence. These follow naturally from the axioms of probability introduced in Chapter 1.

We showed how moment generating functions can be used to find distributions of functions of several random variables, in particular the sum of n identically distrib-uted random variables.

We examined a number of common distributions used in statistics as a preparation for statistical inference principles to be considered in the next chapter.