Probability Models and Applications Lecture Notes for ERG ...iest2.ie.cuhk.edu.hk/~whyeung/tempo/chaps.pdfProbability Models and Applications Lecture Notes for ERG 2040 (2004-05) by

Probability Models and ApplicationsLecture Notes for ERG 2040 (2004-05)

by Raymond W. Yeung

Contents

1. PROBABILITY AND EVENTS 11.1 Set Theory 11.2 Elements of a Probability Model 31.3 Probability as a State of Knowledge 51.4 Conditional Probability 61.5 The Law of Total Probability and the Bayes Theorem 91.6 Independent Events 12

2. RANDOM VARIABLES 152.1 Probability Mass Function 162.2 Cumulative Distribution Function 182.3 Probability Density Function 202.4 Function of a Random Variable 242.5 Expectation 272.6 Moments 292.7 Jensen’s inequality 31

3. MULTIVARIATE DISTRIBUTIONS 353.1 Joint Cumulative Distribution Function 353.2 Joint Probability Density Function 373.3 Independence of Random Variables 393.4 Conditional Distribution 42

3.4.1 Conditioning on an Event 423.4.2 Conditioning on Another Random Variable 46

3.5 Functions of Random Variables 493.5.1 The CDF Method 49

3.5.2 The pdf Method 523.6 Transformation of Random Variables 54

4. EXPECTION AND MOMENT GENERATING FUNCTIONS 594.1 Expectation as a Linear Operator 594.2 Conditional Expectation 624.3 Covariance and Schwartz’s Inequality 634.4 Moment Generating Functions 67

5. LIMIT THEOREMS 735.1 The Weak Law of Large Numbers 735.2 The Central Limit Theorem 76

vi

Chapter 1

PROBABILITY AND EVENTS

Probability theory is a theory for modeling the outcome of a random exper-iment. It is built on set theory, which in turn is built on logic. Before we startdiscussing probability theory, we first give a quick summary of the facts in settheory that we need in this course.

1.1 SET THEORYThe universal set, the set that contains all the elements of interest in a prob-

lem, is denoted by . The empty set, the set that contains no element, is de-noted by .

The three basic operations in set theory are union, intersection, and comple-ment. Another commonly used operation is set difference. These operationsare defined as follows.

Let and be sets.

Union The union of and , denoted by , is the set

or

Intersection The intersection of and , denoted by , is the set

and

Complement The complement of , denoted by , is the set

Set difference The set minus the set , denoted by , is the set

and

1

2

Note that can be written as , and can be written as .

Let and be sets.

Inclusion is a subset of , denoted by , if

‘ ’ ‘ ’.

Mutual exclusion and are mutually exclusive, or disjoint if

Inclusion and mutual exclusion are two basic relations between two sets. Notethat if and only if .

1. .

2. .

3. .

4. .

5. .

6. .

The proof this proposition is omitted. The following two theorems furtherrender some useful set identities.

1.

2.

Proof The two forms of De Morgan’s Law are equivalent. This can be seen byreplacing every set in the first form by its complement. Specifically, we have

Upon taking complement on both sides, we obtain the second form.De Morgan’s law can easily be verified by using a Venn diagram. More

formally, it can be proved by using a truth table. Consider the first form andwrite

Probability and Events 3

and

Then the following truth table shows that the two sets above are equal:

F T T T F F FF T T F F F TF F T T T F FT F F F T T T

The theorem is proved.

1. .

2. .

Proof Again these identities can be verified by using Venn diagrams and canbe formally proved by using truth tables. The details are omitted.

Note that the second identity in the theoroem above can be obtained fromthe first identity by replacing each by a and replacing each by a . (Thisis not a proof though.)

1.2 ELEMENTS OF A PROBABILITY MODELProbability is a state of knowledge about something unknown. This knowl-

edge may change due to availability of certain information. These will bediscussed in the next two sections.

A probability model for a random experiment consists of the following ele-ments:

1. a sample space containing all possible outcomes of the random experi-ment;

2. a set function , called a probability measure, such that

A1. for allA2.A3. if .

Note: A set function is a function that maps a set to a real number.

The outcome of the random experiment, an element of , is denoted by . Asubset of is called an event, and we say that event occurs if . The

4

quantity is interpreted as the probability that event occurs. The prob-ability measure represents our knowledge about the random experiement.A1 – A3 are called the axioms of probability due to the great mathematicianA. Kolmogorov,http://www-gap.dcs.st-and.ac.uk/~history/Mathematicians/Kolmogorov.html.

.

Proof

by A3

implies .

.

Proof Consider

by A3

Therefore, .

if .

Proof The fact that can easily be seen from a Venn diagram,so that

by A3by A1

One can think of as a cookie of weight 1, and characterizes the weightdistribution of the cookie. Obviously, for all , and

if means that weight is additive. Notethat the cookie can have point masses.

Let


and take

(to model that the die is fair and the two tosses are independent1). For example,

the set of all outcomes s.t. the sum is 6

We now check that is a valid probability measure by showing that it sat-isfies all the axioms of probability.

A1: Since , A1 is obviously satisfied.

A2: Trivial.

A3: Since if , we can readily verify A3.

Following Example 1.9, we let be the outcome of the firsttoss. Formally, , and .

Following Example 1.9, we define the random variable

if toss = 2 toss

Formally, , where

and for .

1.3 PROBABILITY AS A STATE OF KNOWLEDGEThink of probability as our state of knowledge about something. For exam-

ple, our state of knowledge about the outcome of tossing a fair coin before thetoss is

After the tossing and upon seeing the coin, our state of knowledge becomes

with probability (a HEAD is obtained), and becomes

with probability (a TAIL is obtained).

1To be discussed in the next chapter.

6

1.4 CONDITIONAL PROBABILITYThe probability of event conditioning on event is de-

fined by

if (undefined if ).

The above formula specifies how we update our knowledge about eventfrom to upon knowing that the event has occurred. If wefix the event , then is a set function. Our knowledge of the randomexperiment before we know that the event has occurred is given by ,and it becomes after we know that the event has occurred. We willshow later in Theorem 1.16 that is indeed a valid probability measure.

Following Example 1.9,

Now , and

by A3

Therefore,

Similarly,

and

On the other hand,


Similarly,

for all .

Let , with

1 2 3 4 5 60.1 0.2 0.1 0.1 0.3 0.2

Let

Then it can easily be shown that

1 2 3 4 5 60 0

Following the last example, further define the event

is odd

Then

We say that the set function is a probability measure because it satisfiesthe axioms A1 – A3. The same can be said about , the set functionderived from by conditioning on event . This is proved in the next theorem.

Let be any event such that . Then the set func-tion is a probability measure.

Proof First, consider any event . Then

8

since by A1. On the other hand,

by A3

by Corollary 1.8

Therefore, satisfies A1.Second, consider

Therefore, satisfies A2.Third, for any events and such that , we have

Therefore, satisfies A3, and we conclude that is a probabilitymeasure.

In fact, the set function can formally be viewed as . To see this,we only have to observe that for any event ,

Following the setup in Example 1.15, since is a prob-ability measure, we have


by A3

from Example 1.14

which is consistent with what we obtained in Example 1.15.

Since is a probability measure, Corollaries 1.6, 1.7, and 1.8 can alsobe applied to . Thus we have

1. .

2. .

3. if .

1.5 THE LAW OF TOTAL PROBABILITY AND THEBAYES THEOREM

A collection of sets is a partition of if

1.

2. if .

Let be a par-tition of . Then for any event ,

Proof The theorem can be proved by considering

In the law of total probability, the event can be interpreted as an observa-tion (or a consequence), and can be interpreted as a collection of mutuallyevents which can possibly cause the observation . Figure 1.1 illustrates therelation between and .

10

A

!

" i

Figure 1.1. A illustration of the relation between and .

Let be the set of all students in a class, be the set of allmales students, and be the set of all female students. Obviously, isa partition of . Let be the set of students who wear dresses. Then the lawof total probablity says that

For any events and such that,

Proof By definition, we have

or

Exchanging the roles of and above, we obtain

Since (and hence ), we have

or


Let be a partition of . Then for any event ,

Proof This follows immediately from the preceding two theorems.

The formula in Corollary 1.23 expresses in terms ofand .

The probability that a student being lazy (L) is 0.2. The prob-ability that a student fails (F) in a certain examination is 0.8 given that (s)he islazy, while the probability that a student fails in the examination is 0.15 giventhat (s)he is not lazy. That is,

We are interested in the probabilities , , , and .We proceed to determine . By Corollary 1.23, we have

By the remark at the end of Section 1.5, we have

The probabilities and can be determined likewise. As anexercise, the reader should identify the sets and in Corollary 1.23.

12

1.6 INDEPENDENT EVENTSLet and be two events, and assume for the time being that

. The events and are independent if the probability of one event is notchanged by knowing the other event. Specifically,

(1.1)

and(1.2)

It is easy to see that both (1.1) and (1.2) are equivalent to

The above relation will be used as the definition for independence. In fact, it ismore general then either (1.1) or (1.2) because it avoids the problem of divisionby zero when or is zero.

Two events and are independent if.

The reader should carefully distinguish between two events being disjointand two events being independent. The latter has to do with the probabilitymeasure while the former does not.

For three events , and , we have two notions of independence.

Three events , andare pairwise independent if

(1.3)(1.4)(1.5)

Three events , andare mutually independent if (1.3)-(1.5) are satisfied and also

Clearly, if events , and are mutually independent, then they are alsopairwise independent. However, the converse is not true, as is shown in thenext example.

Consider the probability measure shown in Figure 1.2. First,we have


0.09

0.09 0.09

0.12 0.12

0.12

A

B

C

Figure 1.2. The probability measure for Example 1.28.

By symmetry, we also have . Now,

Therefore, it is readily seen that events , and are pairwise independent.However, since

events , and are not mutually independent.

Chapter 2

RANDOM VARIABLES

As we have discussed, a random variable is a function of , the outcomeof the random experiment in discourse. Formally, , wheredenotes the set of real numbers. The alphabet set of , denoted by , is asubset of containing all possible values taken by . If is a finite set, wedenote its cardinality (size) by .

Let , with1 2 3 4 5 60.1 0.2 0.1 0.1 0.3 0.2

as in Example 1.15. Define random variables and as follows.

1 0 02 1 03 0 04 2 15 4 16 3 1

We can let , with . Similarly, we can let .

such that defines a random variable.

Suppose we choose a time in a day at random. To model this,we let

Then the two functions HR and MIN defined by

HR

15

16

andMIN

are random variables representing respectively the hour and minute of the ran-dom time chosen. Likewise, the function

AM ifotherwise

is the random variable denoting “morning”.

Let be the set of all possible states of the body of a per-son chosen at random. Then the height, the weight, the body temperature, theblood pressure, the eye color, etc, of the chosen person are all random vari-ables and they are functions of . One can think of a random variable as a“measurement” of .

One can think of a random variable as described by a piece of wire ex-tending from to with a certain weight distribution, and the probabilitythat takes a value in a set is the weight of the set . So the problemis how to characterize a weight distribution on a piece of wire. In the rest ofthe chapter, we will see various such characterizations.

2.1 PROBABILITY MASS FUNCTIONA random variable is called discrete if the set of all values taken by is

discrete (finite or countably infinite). Such a random variable is characterizedby a probability mass function (pmf) which gives the probability of occurrenceof each possible value of . When there is no ambiguity, we will write

as . For example, in the last example, we have

or

by A3

We usually do not go for such level of formality, but the above shows what

exactly means.

Random Variables 17

A pmf satisfies the following two properties:

1. for all ;

2.

Proof The proof is left as an exercise.

The binomial distributionwith parameters and takes integer values between 0 and , with

for , and otherwise, where and . It isobvious that . Moreover,

where we have invoked the binomial formula

It can be shown that peaks at .

The Poisson distribution with pa-rameter for is defined by

ifotherwise.

Then

18

2.2 CUMULATIVE DISTRIBUTION FUNCTIONWe already have seen the pmf characterization of a discrete random vari-

able. In this section, we introduce a more powerful characterization called thecumulative distribution function (CDF), which can characterize any randomvariable. The CDF of a random variable , denoted by , is defined by

for all . will be written as when there is noambiguity.

A CDF satisfies the following properties:

1. is non-decreasing and right-continuous;

2. ;

3. .

Proof The first property results from the definition of . To prove thesecond property, we consider

Here, we have used to denote the inverse image of a subset ofunder the mapping , defined as the set

To prove the third property, we consider

Basically, gives the weight of the left part of the piece of wire indiscussion up to and including the point . From , for any interval ,we can determine from

Random Variables 19

F(x)

x

1

Figure 2.1. The CDF of an exponential distribution.

and hence the probability that takes a value in any union of intervals by A3.Thus the CDF completely characterizes the weight distribution of the piece ofwire and hence the random variable !

If is a discrete random variable taking integer values, then

and

Thus the pmf and the CDF completely specify each other.According to the CDF, a random variable can be classified as follows:

1. Discrete – if the CDF has increments only at discrete jumps;

2. Continuous – if the CDF has only continuous increment;

3. Mixed – otherwise.

For an exponential distri-bution with parameter , denoted by ,

ifotherwise.

Figure 2.1 is a sketch of the CDF of an exponential distribution.

For the uniform distributionon ,

ififif .

20

F(x)

x

1

1

Figure 2.2. The CDF of the uniform distribution on .

F(x)

x

1

1

1/3

2/3

Figure 2.3. The CDF of the mixed distribution in Example 2.3.

Figure 2.2 is a sketch of the CDF of this distribution.

For the distribution uniform onwith a point probability mass at , the CDF is given by

ififif .

Figure 2.3 is a sketch of the CDF of this distribution.

2.3 PROBABILITY DENSITY FUNCTIONIn the last section, we have seen how we can characterize the weight distri-

bution on of a piece of wire by the CDF. In this section, we will see how wecan characterize the same thing by means of a “density”.

Random Variables 21

Suppose we say that the density of the piece of wire in a neighborhood ofis equal to , with the unit kg/m. What it means is that for a segment of wirearound with length , the weight is approximately . Let denotesthe density at for a piece of wire whose weight distribution is described by aCDF . Then what we want for is that

(2.1)

for all (this is exactly what density means). Assuming that is differen-tiable (with respective to ) and using the fundamental theorem of calculus1,by differentiating both sides of the above, we have

giving

(2.2)

The function is called the probability density function (pdf).If is not differentiable at a certain , then we cannot define by

(2.2). This can happen if does not change smoothly at , for example,at in Figure 2.2. However, a careful examination of (2.1) reveals thatthe density at any isolated point, as long as it is finite, does not affect theCDF . Therefore, in such situations, we can simply set to be anyfinite value.

In fact, if is finite, then

This can be seen as follows. Let be an interval with width such that. Then

1The fundamental theorem of calculus states that

This can be seen by considering

if is sufficiently smooth at .

22

x

f(x)

Figure 2.4. The pdf of an exponential distribution.

Letting , we see that . However, although the eventhas zero probability measure, it is not impossible.

Suppose we throw a dart at a target and the point where thedart lands distributes uniformly on the target. Then the probability that thedart lands on any particular point is zero, but it is not impossible for the dartto land on that point.

For an exponential dis-tribution with parameter ,

ifotherwise.

Figure 2.4 is a sketch of the pdf of an exponential distribution.

For the uniform distributionon ,

ifotherwise.

Figure 2.5 is a sketch of the pdf of this distribution.

However, we do run into a serious technical difficulty in defining the densityfunction if has a discrete jump at , i.e., there is a point mass at

. The problem we face is how to define in view of (2.1) so thatchanges abruptly as changes from to for small . In fact, nosuch function exists. Instead, we introduce the notion of a Dirac deltafunction2. A delta function is called a generalized function, which is not a true

2named after the great physicist Paul Dirac.

Random Variables 23

x

1

1

f(x)

Figure 2.5. The pdf of the uniform distribution on .

function. Generalized function theory is a very deep theory, but it suffices forus to understand what a delta function means.

The best way to think of a delta function, denoted by , is that it isfunction which takes the value 0 everywhere except around . Around

, the function is a very sharp peak such that the total area under isequal to 1. That is

By means of the delta function, we will be able to describe a CDF withdiscrete jumps. This will be illustrated in the next example.

For the distribution uniform onwith a point probability mass at , the pdf is sketched in Figure 2.6.

Here, denotes a delta scaled to two-third of and relocated at.

A pdf satisfies the following properties:

1. ;

2. .

Proof First, is nonnegative because is non-decreasing. Second,

24

2/3 (x-1)#

x

1

1

2/3

1/3

f(x)

Figure 2.6. The pdf of the mixed distribution in Example 2.3.

2.4 FUNCTION OF A RANDOM VARIABLESuppose we have a random variable specified by a given distribution (pdf

or CDF), and we pass through a function to obtain a random variable. What is the distribution of ? Note that , as a random variable,

is also a function of because .This problem is rather trivial if both and are discrete. To be specific,

for each , can be determined by

where is the inverse image of under the mapping . For continuousand mixed random variables, the idea is similar but the technicality is a littlebit more involved.

Let us consider the case that both and are continuous, and we willdetermine in terms of . To start with, we further assume that ismonotone so that the inverse function is defined. Toward this end, in lightof Figure 2.7, consider

where , or . Since

and

we have

Random Variables 25

x x+dx

yy+dy

x

g(x)

Figure 2.7. A monotone function .

and in the limit, we have

Let , where. Then

and

Therefore,

We give an alternative solution to the problem in the lastexample. We consider two cases for . First, for , we have

Then(2.3)

26

For , we have

Upon differentiating with respect to , we obtain

(2.4)

Combining (2.3) and (2.4) for the two cases, we have

which is the same as what we have obtained before.

We now turn to the case that is not monotone, i.e., has possiblymore than one solution for some . We will discuss the case that for a particular

, there exist , , such that (cf. Figure 2.8). The generalcase is a straightforward extension. From Figure 2.8, we see that

Then

Following exactly the argument as we used for the case that is monotone, weobtain

Let , . Now values of less thanare impossible, so that for . For , we have

Random Variables 27

x1 x + dx1 1x

g(x)

yy+dy

x + dx x2 2 2

Figure 2.8. A function with two inverses and at .

Since , we have

2.5 EXPECTATIONThe expectation of , written as or simply , is

defined as

for discrete, and

for continuous. is sometimes written as .

The expectation of a random variable is the value taken by on theaverage. It is also called the mean and it can be regarded as the center of massof the piece of wire describing the distribution of . Specifically, the locationof the center of mass of the piece of wire is given by

For any constant , .

Proof This can be proved by considering

28

If with probability 1, then .


Let . Then

This summation can be evaluated explicitly by considering the binomial for-mula:

Here we fix and regard each side as a function of , so that the left handside and the right hand side are identical functions of . Differentiating withrespect to , we have

and multiplying by gives

Then let and to obtain

Let . Then

Random Variables 29

Let and . Then and . Therefore,

In evaluating , we use L’Hospital’s Rule as follows.

The expectation of a function of a random variable is defined in the naturalway.

The expectation of a function of a random variable isdefined by

for discrete, and

for continuous.

2.6 MOMENTSIn Definition 2.25, by letting and , we

obtain the th moment and the th central moment of , respectively, where. The th moment and the th central moment are denoted by and

, respectively, i.e.,

30

(Note that in the definition of , is just a constant.) The first moment,i.e., , is called the mean (or expectation), while the second central moment,i.e., , is called the variance. The variance of is also denoted by .

.

Proof First,

Regarding the distribution of as represented by a piece of wire, the quantityis the “moment” (as in mechanics) about the center of mass

(at position ) due to the weight at position . So this propositionsays that for any weight distribution, the total moment about the center of massis equal to 0.

For continuous, it suffices to consider

The proof for discrete is similar.

For each function , give partial information about the distribution of. For example, gives the average value taken by . Likewise, each

and gives some partial information about the distribution of . However,we point out that under many situations, the distribution of is completelyspecified by , i.e., the set of all moments of gives complete informationabout the distribution of . It can also be shown that is completelyspecified by , together with , and vice versa. The followingtheorem is a special case of this fact.

.

Proof The theorem is proved by considering

Random Variables 31

with equality if and only ifwith probability 1.

Proof Since for all ,

(2.5)

Therefore,

We now show that the above inequality is tight if and only if .If , then obviously . The proof of the converse,however, is much more technical and is deferred to Appendix 2.A.

The Gaussian distributionis a continuous distribution defined by

for . It is one of the most important distributions in probabilitytheory. It will be discussed in depth when we discuss the central limit theorem.It can be shown that

Furthermore, it can be shown that the mean and variance are re-spectively and .

2.7 JENSEN’S INEQUALITYLet be a set of vectors and be a set of real num-

bers. Then

is called a linear combination of . If in addition satisfies

1. for all , and

32

x1 x1

x2a=1a=0

a< 1

x

x

x

2

3

4

a>1

(a) (b)

Figure 2.9. Linear combinations and convex combinations.

2. ,

then is called a convex combination of .

Remark If satisfies for all and , then is aprobability mass function. In this course, the vectors are usually scalars. Inparticular,

is a convex combination of all the elements of .

Figure 2.9(a) is an illustration of a linear (convex) combinationof two vectors and , and Figure 2.9(b) is an illustration of the set of allconvex combinations of the four vectors , and .

A set is convex if for any ,then

for any .

A function defined on a convexset is called convex if for any and any ,

(2.6)

A function is concave if and only if is convex.

In the definition of a convex function, it is crucial that the set is convexbecause otherwise for and , may not

Random Variables 33

x1

g(x)

x2

Figure 2.10. A convex function.

be in . If so, is not defined at and (2.6) cannot even bestated.

Figure 2.10 is an illustration of a convex function when is some convexsubset of the real line. In general, is some convex subset of .

For any set , its convex hull isdefined as

for someand

In words, consists of the convex combination of any pair of vectors in.

The convex hull of a set is convex. Figure 2.9(b) shows theconvex hull of the set .

Let be a convex function. Then.

Proof For being a binary random variable, Jensen’s inequality follows di-rectly from the convexity of . This can be seen by letting and

in (2.6). See homework for a proof for the case when isfinite.

Remark The computation of requires the knowledge of the distribu-tion of , while the computation of requires only the knowledge of

. Therefore, Jensen’s inequality provides a lower bound on whenonly is known.

34

Let be a concave function. Then .

is convex. By Jensen’s inequality, .

is convex. By Jensen’s inequality, ,i.e., . This has been obtained in Corollary 2.28.

APPENDIX 2.A: PROOF OF COROLLARY 2.28This appendix is for the more enthusiastic readers. Here, we complete the proof of Corol-

lary 2.27 by showing that if , then , or . Assumethat . First, observe that

Define the set

Note that

and that for , , so that . Our goal is to establish that. We first show that for any , . Consider

Thus

implying that . Now since for all , we still cannotconclude that . Toward this end, by means of the monotone convergence theorem inreal analysis (beyond the scope of it course), it can be shown that

This completes the proof.

Chapter 3

MULTIVARIATE DISTRIBUTIONS

In Chapter 2, we have discussed a single random variable. In this and sub-sequent chapters, we will discuss a finite collection of jointly distributed ran-dom variables. In this chapter, we will discuss two jointly distributed randomvariables mostly, but generalizations to more than two random variables arestraightforward. Emphasis will be on continuous distributions.

A random variable is a function of . Two random variables are an orderedpair of functions of , for example, . For one random variable,we think of its distribution as the weight distribution of a piece of wire whosetotal weight is 1. For two random variables, we think of their joint distributionas the weight distribution of a sheet whose total weight is 1.

As for a single random variable, we will have both CDF and pdf characteri-zations of the joint distribution.

3.1 JOINT CUMULATIVE DISTRIBUTION FUNCTIONLet and be two jointly distributed continuous random variables. The

joint CDF of and is defined by

When there is no ambiguity, will be abbreviated as .

A joint CDF satisfies the following properties:

1. ;

2. , , and , for any;

3. ;

35

36

4. For any and ,

Proof Property 1 is proved by noting

Property 2 is proved by considering

That and can be proved likewise. The proof forProperty 3 is similar to that for for the single variable case, so it isomitted. Property 4 can be proved by considering

(3.1)

Letting and in Property 4, we have

which implies

where we have invoked Property 2. Similarly, we can show that for ,

Therefore, is non-decreasing in both and , but this is weaker thanProperty 4 as we will see in the next example.

Multivariate Distributions 37

Let

Then we see that

However,

The reader should compare Property 4 with the non-decreasing property offor a single random variable. We can think of Property 4 as the “jointly

non-decreasing” property.So from (3.1), we can obtain the probability that is within any rectan-

gle in from the joint CDF . In principle, we can obtainfor essentially any subset of by A31 through approximation and takinglimit. Therefore, the joint CDF fully characterizes the joint distribution.

The marginal CDFs, namely and , can readily be obtained from. Specifically,

Similarly,

3.2 JOINT PROBABILITY DENSITY FUNCTIONOur task is to find a joint pdf (if exists) for a given joint CDF

such that(3.2)

Assume that exists. Then by the fundamental theorem of calculus,we have

1A3 refers to the third axiom of probability.

38

Therefore,

(3.3)

Now can be obtained from by (3.2), and vice versa by (3.3).Hence, both and are complete characterizations of the jointdistribution.

A joint pdf satisfies the following properties:

1. ;

2. .

Proof First,

By Property 4 of a joint CDF in Theorem 3.1, we see that the expression insidethe square bracket above is nonnegative. It then follows that .Second,

by Property 2 in Theorem 3.1. The theorem is proved.

The marginal pdfs, i.e., and , can also readily be obtained from .Specifically,


Similarly,(3.4)

Remark For discrete random variables and , the analog of (3.4) is

3.3 INDEPENDENCE OF RANDOM VARIABLESTwo random variables and are said to be independent if

(3.5)

for all subsets and of . It is an extremely difficult task to check thiscondition, as it involves an uncountably many subsets of . Instead, we willdiscuss two simpler characterizations of independence of two random vari-ables.

Two random variables andare independent if and only if

(3.6)

for all and .

Proof If and are independent, then (3.5) is satisfied for all subsets andof . Then for all and ,

For the converse, we will only give the proof for a special case. Assume (3.6) issatisfied for all and , and suppose and are intervals, withand . Then

40

by (3.1)

i.e., (3.5) is satisfied. Using this argument, one can prove the same result byinduction for and being finite unions of intervals. The details are omitted.

If the pdf for random vari-ables and exists, then and are independent if and only if

(3.7)

for all and .

ProofThe following apply to all and . Assume that

exists, and that(3.8)

Then

Conversely, assume (3.7) holds. Then


Therefore, (3.7) and (3.8) are equivalent. The theorem is proved.

Remark For two discrete random variables and , the analog of Theo-rem 3.5 is that and are independent if and only if

for all and .

For three or more random variables, like events, we distinguish between pair-wise independence and mutual independence.

Random variables are mutually indepen-dent if

where is a subset of .

Random variables are pairwise indepen-dent if and are independent for any .

Random variables are mutually independentif and only if

(3.9)

for all , or equivalently,

for all if exists.

Proof Omitted.

If are mutually independent, then they arepairwise independent.

Proof If are mutually independent, then (3.9) holds. Forany fixed , by setting for all in (3.9),the left hand side becomes , while the right hand sides becomes

since for all . Therefore,

or and are independent. Hence, we conclude that arepairwise independent.

42

We note that, however, pairwise independence does not imply mutual inde-pendence.

If are mutually independent, then so arefor any functions , .

Proof Consider any subsets of . Then

where we have invoked the mutual independence assumption to obtain the sec-ond equality. Thus are also mutually indepen-dent.

Random variables are independent andidentically distributed (i.i.d.) if they are mutually independent and eachhas the same marginal distribution.

Let be i.i.d.binary random variables with

and

Then has distribution , i.e.,

for .

3.4 CONDITIONAL DISTRIBUTIONThe distribution of a random variable represents our knowledge about the

outcome of . Upon knowing the occurrence of a certain event or the outcomeof another jointly distributed random variable, our knowledge of changesaccordingly. Our new knowledge about is represented by the conditionaldistribution of .

3.4.1 CONDITIONING ON AN EVENTWe first consider the distribution of a single random variable conditioning

on an event , where is a subset of and . Let be


F (x)

0.90.8

0.6

0.2

x

A

x

X Xf (x)

(a) (b)

Figure 3.1. The CDF and pdf of a distribution.

the CDF of . The CDF of conditioning on is defined by

Then

(3.10)

Note that satisfies the three properties of a CDF in Theorem 2.8.The pdf of conditioning on is defined by

It is readily seen from (3.10) that

ifif .

Again, note that satisfies the two properties of a pdf in Theo-rem 2.16.

Consider the CDF and pdf in Figure 3.1(a) and (b) for ran-dom variable , respectively, and let be the subset of as shown in Fig-ure 3.1(a). Then it is readily seen that

44

x x

(a) (b)

0.8

1.0

X XF (x|X in A) f (x| X in A)

Figure 3.2. The CDF and pdf conditioning on .

and and are illustrated in Figure 3.2 accord-ingly.

We now consider the joint distribution of two random variables andconditioning on an event , where is a subset of . The CDFof conditioning on is defined by

(3.11)

This is illustrated in Figure 3.3, where

and

Then from (3.11), we have

(3.12)

Accordingly, the jointly pdf of conditioning on is de-fined as

(3.13)

It turns out that carrying out the above partial differentiation is considerablymore complicated than the case when only one random variable is involved


R (x,y)A

(x,y)y

x

$

Figure 3.3. The set .

(see Appendix 3.A). So we instead will obtain byanother approach. We observe that has to satisfy thefollowing properties: For ,

and for any small subset of with area ,

(3.14)

The left hand side above can be written as

(3.15)

Equating (3.15) and the right hand side of (3.14), we obtain

Therefore,

ifif .

46

x- /2% x+ /2%

x

y

x

%A(x, )

Figure 3.4. The set .

Again, we note that satisfies the two properties of a jointpdf in Theorem 3.3.

3.4.2 CONDITIONING ON ANOTHER RANDOMVARIABLE

In this subsection, we will discuss the distribution of a random variableconditioning on the event , where and are jointly distributedrandom variables. For continuous, . Thus we are runninginto the problem of conditioning on an event with zero probability.

To get around this problem, instead of conditioning on whichhas zero probability, we condition on the event , orequivalently the event , which has nonzero probability,where

The set is illustrated in Figure 3.4. Eventually, we will take the limitas .

The joint pdf of and conditioning on , from thelast subsection, is given by

ifotherwise.


XYf (x,y)

XYf (x ,y)0

x0x

y

Figure 3.5. An illustration of .

By integrating over all , we obtain the conditional marginal pdf of asfollows.

Taking the limit as , we define the pdf of conditioning onby

(3.16)

Figure 3.5 is an illustration of for a fixed . Thus we see from(3.16) that is just the normalized version of for fixed .(For fixed , is a constant.) Similarly, we define

(3.17)

48

y

x1

1

x+y=1

Figure 3.6. The pdf in Example 3.15.

From (3.16) and (3.17), we have

which is a form of Bayes’s Theorem.

If , then and are independent.

Proof If , then

This implies , i.e., and are independent.

Consider the joint pdf for and(see Figure 3.6). We will first determine the normalizing constant

. Then we will determine the marginal pdfs and , and show that andare not independent. In fact, we can tell immediately that and are not

independent by inspecting Figure 3.6, because the possible range of dependson the value taken by .We use Property 2 of a joint pdf in Theorem 3.3 to determine . Consider


which implies .Now, for ,

Thusifotherwise.

By symmetry, we also have

ifotherwise.

Then for , , while . Therefore,and are not independent.

3.5 FUNCTIONS OF RANDOM VARIABLESSuppose we are given two random variables and , and let .

We are interested in the distribution of . Specifically, we want to determinethe distribution of in terms of the joint distribution of and .

3.5.1 THE CDF METHODIn this method, is obtained by considering

50

x/y = z

z > 0

x/y = z

z < 0

x

y

x

y

Figure 3.7. An illustration of the set .

Let . Then

Figure 3.7 is an illustration of the set . Then

(3.18)

and the pdf of is obtained by

In order to differentiate the double integrals in (3.18), we use the chain rulefor differentiation. Specifically,


Similarly,

Therefore,

The next example is extremely important and should be studied very care-fully.

Let , where and areindependent. Then

Differentiating with respect to , we obtain

52

The integral above is called the convolution of and , denoted by. Since , by exchanging the roles of and in

the above integral, we also have

In summary, if is equal to where and are independent, then.

3.5.2 THE PDF METHODThis method is to consider directly. Specifically,

Our task is to express the right hand side as an expression times , so thatis obtained by cancelling on both sides. This may not always be

easy to do, but we will illustrate how this can possibly be done by an example.

Let and be i.i.d. , so that

for all and . Let . Then

(cf. Figure 3.8)


z z+dzx

y

Figure 3.8. An illustration for the range of integration in polar coordinates.

Cancelling on both sides, we have

for all . Obviously, for all .

The reader is encouraged to repeat the above example by the CDF method.

54

z2

1z

x2

1z z2( , )

x1

x1 x2( , )g

Figure 3.9. An illustration of the function .

3.6 TRANSFORMATION OF RANDOM VARIABLESSo far, we have discussed how we can obtained the distribution of a single

random variable obtained as a function of some given random variables. Inthis section, we will discuss how we can obtain the joint distribution of a pairof random variables obtained as functions of a given pair of random variables.

Let , and let be the values taken by. Suppose for each , there is a unique pair satisfying

for some functions and . Write , where

and

Then is an invertible transformation. See Figure 3.9 for an illustration ofwhich takes a point in the - plane to the - plane. Our task is to

determine .Consider the rectangular element in the - plane in Figure 3.10. We label

the four corners of this element by , , where . Whenwe trace the line from to in the - plane, under the mapping , thecurve between and is traced in the - plane. The sameapplies when we trace the lines from to , from to , and from to

. When and are small, the region traced out in the - plane canbe approximated by a parallelogram. Specifically, the vectors and aregiven by


z1

3zz2

z1&

& z2v1

v2

x

x x

1

2 3

'

x0 1 2z0 = (z ,z )1 2= (x ,x )

Figure 3.10. Two corresponding regions in the - plane and - plane.

Let denote the area of the rectangular element in the - plane, and letdenote the area of the corresponding region in the - plane. Now ob-

serve that is in the rectangular element in the - plane if and onlyif is in the corresponding region in the - plane. Equating theprobability of this event in terms of and , we have,

so that(3.19)

where . So, we need to determine the value of as follows.

56

Defining2

(3.20)

we have

Therefore, from (3.19), we have

(3.21)

Exchanging the roles of and in the above derivation andaccordingly in (3.21), we obtain

(3.22)

Therefore, combining (3.21) and (3.22), we have

which implies

(3.23)

This is the two-dimensional generalization of the result

in elementary calculus.

Let

2It is easy to see that we can also write (3.20) as


Then

Therefore,

Following the last example, this time we want to determinein terms of . From (3.23), we have

Therefore,

Alternatively, we can write

and apply (3.20) directly to obtain , but this would be consider-

ably more difficult.

APPENDIX 3.A: DIFFERENTIATINGFrom (3.12), we have

(3.A.1)

where

Following the steps toward proving Property 1 of Theorem 3.3, we have

Let

58

(x+ x,y+ y)& &

(x,y)

(x+ x,y+ y)& &

B (x,y)&

B (x,y)&(x,y)

$

$

(a) (b)

Figure 3.A.1. The set for and .

We leave it as an exercise for the reader to show that the expression in the square bracket aboveis equal to

(3.A.2)

(cf. Figure 3.A.1). If , then , so that (3.A.2) is equal to

and hence

Otherwise, , so that (3.A.2) vanishes, and

Therefore, from (3.A.1), we have

ifif .

Chapter 4

EXPECTION ANDMOMENT GENERATINGFUNCTIONS

We have already discussed the expectation of a single random variable inChapter 2. In this chapter, we will discuss expectation as an operator whenmore than one random variable is involved. We will also introduce momentgenerating functions, a fundamental tool in probability theory.

The results will be proved by assuming that the random variables are con-tinuous. The proofs for discrete random variables are analogous.

4.1 EXPECTATION AS A LINEAR OPERATORWe have defined the expectation of a function of a random variable in Def-

inition 2.25. When the function involves two random variables, we have thefollowing definition.

The expectation of a function of random variables andis defined by

for and discrete, and

for and continuous.

One can regard expectation as an operator which inputs a random variableand outputs the expectation of that random variable. We will establish in thefollowing theorem that expectation is a linear operator.

Expectation is a linear operator, i.e., for any real numbersand ,

(4.1)

59

60


.

Proof This is a special case of Theorem 4.2 with equals 0.

Consider

This is an alternative proof for Theorem 2.27 with simplified notation.

.


Let and be the mean and the variance of a randomvariable . Then has zero mean and unit variance.

Expection and Moment Generating Functions 61

Proof To prove that , consider

We will prove that in two steps. First,

which is intuitively clear. Second, by Proposition 4.5, we have

If and are independent, then

(4.2)

Proof If and are independent, then so are and by Theo-rem 3.10. Therefore,

Remark The equation (4.1) is valid for all joint distribution for and . Bycontrast, (4.2) may not be valid if and are not independent.

62

4.2 CONDITIONAL EXPECTATIONLet and be jointly distribution random variables. The

expectation of conditioning on the event is defined by

We note that depends only on the value of , and hence is afunction of . If the value of is not specified, then

is a random variable, and more specifically, a function of the random variable. We now prove an important theorem regarding conditional expectation.

.

Proof Consider

Note that in the above, the inner expectation is a conditional expectation, whilethe outer expectation is an expectation on a function of .

Let and distributes uniformly on when. We now apply Theorem 4.9 to determine .


The above integral is the expectation of , which was shown in Example 2.1to be . Therefore, .Alternatively, we can consider

ifotherwise,

so that

ifotherwise.

Then

and

However, evaluations of these integrals are quite difficult.

4.3 COVARIANCE AND SCHWARTZ’S INEQUALITYLet and be random variables, and consider . We are

naturally interested in the relation between the variance of with the variancesof and . Toward this end, consider

Define(4.3)

so that we can write

When and are independent, by Theorem 3.10,

64

so that(4.4)

When , or equivalently

or equivalently (4.4) holds, and are said to be uncorrelated. As shown,and are uncorrelated if and are independent. However, the converse

is not true.

Consider the following joint pmf for random variablesand :

where . Then it is easy to and are uncorrelated but they are notindependent.

.


In fact, it is ready seen from (4.3) and Proposition 4.12 that. Thus the variance of a random variable can be regarded as the co-

variance between and itself.

We now prove an inequality known as Schwartz’s inequality. This inequal-ity gives an upper bound on the magnitude of in terms of the secondmoments of and .

.

Proof Let and be any fixed jointly distributed random variables. Letbe any real number, and consider the random variable . Since thisrandom variable can only take nonnegative values, we have

(4.5)


or

Rearranging the terms, we have

(4.6)

Since and are fixed random variables, , , and in the aboveare fixed real numbers. Therefore, the left hand side above is quadratic in ,and it achieves its minimum when

Since (4.5) holds for all , it holds for . Substituting into (4.6),we have

Hence,

proving the theorem.

Schwartz’s inequality is tight, i.e.,

if and if is proportional to .

Proof From the proof of the last theorem, we see that Schwartz’s inequality issimply the inequality in (4.5) with . Therefore, Schwartz’s inequality istight if and only if

(4.7)which in turn holds if and only if with probability 1, , or

, implying that is proportional to . On the other hand, ifis proportional to , i.e., for some constant , then it is readilyverified that Schwartz’s inequality is tight.

Denote and by and , respectively. Now in Schwartz’s in-equality, if we replace by and by 1, we immediatelyobtain

1This is valid because and can be any random variables.

66

or

where we have invoked Proposition 4.12. Thus,

(4.8)

The ratio

called the coefficient of correlation, is a measure of the degree of correlationbetween and . It is readily seen from (4.8) that

When , we say that and are uncorrelated. When , we saythat and are positively correlated. When , we say that and

are negatively correlated.When , is simply equal to . When ,

Thus

Roughly speaking, when is large, also tends to be large, so that on theaverage the magnitude (square) of is reinforced. When ,

Thus

Roughly speaking, when is large, tends to be small, and when is small,tends to be large, so that on the average the magnitude (square) of is

suppressed.

We already have seen that when and are independent,then they are uncorrelated, or . When ,


so that

and

by Proposition 4.5. When ,

so that

and

4.4 MOMENT GENERATING FUNCTIONSThe moments of a random variable (or of a distribution) were introduced

in Section 2.6. In this section, we introduce the moment generating functionof a random variable , denoted by , where the parameter is a realnumber.

The moment generating function of a random variable(or of the distribution of ) is defined by

(4.9)

where is a real number. For the discrete case,

(4.10)

and for the continuous case,

(4.11)

is said to exist if it exists, i.e., the summation in (4.10) or the integral in(4.11) converges, for some range of .

Note that for a fixed , is a function of . So is simply theexpectation of this particular function of , and its value depends on . There-fore, is a function of .

.

68

Proof By letting in (4.9), we have

proving the corollary. This equality can be interpreted by letting in(4.10) and (4.11), which is nothing but the normalizing conditions forand .

The moment generating function can be regarded as the “transform” of thedistribution. In (4.10), if , by letting so that ,the right hand side becomes

which is the -transform of the pmf (with being the time index). In(4.11), by letting , becomes the Fourier transform of the pdf

(with being the index of time). (But of course, in is actually areal number.)

It can be shown that there is a one-to-one correspondence between the dis-tribution of and the moment generating function of . That is, for eachmoment generating function, there is a unique distribution corresponding to it.In other words, the distribution of a random variable can (in principle) berecovered from its moment generating function.

The moments of can be obtained by successive differentiation ofwith respect to . Specifically,

and in general(4.12)

where . Then by putting in (4.12), we obtain

That is why is called the moment generating function of .

If and are independent, then .



We know from Example 3.17 that if and are independent, thenhas pdf given by . The above theorem says that

, so it is actually analogous to the convolution theorem in linearsystem theory.

Let . Then for ,

Let be the geometric random variable with parameter ,where , i.e.,

otherwise.

Then

70

where we assume that , or .

Remark: We use the convention adopted by most books in probability indefining the geometric distribution in the above example. A random variable

with

otherwise

is said to have a truncated geometric distribution with parameter . If wetoss a coin with the probability of obtaining a head equals repeatedly, thenit is easy to see that the first time a head is obtained has truncated geometricdistribution with parameter . Note that if is geometric with parameter ,then is truncated geometric with parameter .

In the next example, we give a powerful application of the moment generat-ing function.

Let be i.i.d. random variables each, be an independent truncated geometric random variable (independent

of ’s) with parameter , and . That is, is the sum of atruncated-geometric number of i.i.d. exponential random variables.We now determine the distribution of via . Consider


where

a) follows because ’s are independent, and they remain to be so whenconditioning on because is independent of ’s;

b) follows because is independent of ’s;

c) follows because are i.i.d.;

d) follows because .

From the form of , we see that .

Chapter 5

LIMIT THEOREMS

Recall that in random experiment, the outcome is drawn from the samplespace according to the probability measure , and a random variable isa function of . Now suppose we construct random variables , where

. Then is called a stochastic process (or random process).Since a stochastic process involves an infinite number of random variables,

it cannot be described by a joint distribution. In fact, a stochastic process canbe very comlicated in general. But if ’s are i.i.d., in that case we say that

is an i.i.d. process, then it is relatively easy to describe. The process oftossing a coin repeatly can be modeled as an i.i.d. process.

Limit theorems deal with the asymptotic behaviors of stochastic processes,and they play an important role in probability theory. In this chapter, we willdiscuss the two most fundamental such theorems, namely the weak law of largenumber (WLLN) and the central limit theorem (CLT).

5.1 THEWEAK LAW OF LARGE NUMBERSConsider tossing a fair coin times. Intuitively, when is large, the rela-

tive frequency of obtaining a head should be approximately equal to 0.5. TheWLLN makes this idea precise, as we will see, and it gives a physical meaningto the probability of an event.

For any random variableand any ,

Proof The inequality can be proved by considering

73

74

The interpretation of the Chebyshev inequality is as follows. Here,is the event that deviates from its mean by at least . The upper

bound on the probability of the event is such that:

if is large, then it is large;

if is small, then it is large.

In fact, sometimes this upper bound can be larger than 1, making it useless. Ingeneral, the Chebyshev inequality is rather loose.

Let be i.i.d. random variables with meanand variance . Then

(5.1)

Proof Let(5.2)

The WLLN is actually an application of the Chebyshev inequality to . Now

and

Limit Theorems 75

where a) follows from Proposition 4.5 and b) follows because are indepen-dent. Then by the Chebyshev inequality, we have

or

We note that in (5.1), the upper bound tends to zero for all finite andas . Thus the WLLN says that as long as is sufficiently large, theprobability that the average of , i.e., , deviates fromby any prescribed amount is arbitrarily small. Technically, we say that

in probability.

We now give another interpretation of the WLLN. Let be the genericrandom variable and be its density function, i.e.,

It can readily be shown that for any . Then by Theo-rem 4.18,

so that

Thus the WLLN says that

regardless of the actual form of .Having proved the WLLN, we now show an important consequence of it.

Consider any subset of , the alphabet of the generic random variable .We are interested in the relative frequency of occurrence of in the firsttrial, say. More precisely, we define the random variable (called an indicatorfunction)

ifif ,

76

so that

Since are i.i.d., so are because is a function of . Then therelative frequency of occurence of is given by

and by the WLLN,

in probability.

Therefore, the probability of an event can “physically” be interpreted as thelong term relative frequency of occurence of the event when the experiment isrepeated indefinitely.

The probability of obtaining a head by tossing a fair coin is0.5. If the coin is tossed a large number of times, the relative frequency ofoccurence of a head is close to 0.5 with probability almost 1.

5.2 THE CENTRAL LIMIT THEOREMThe Gaussian distribution was introduced in Example 2.29. It is one of the

most important distributions in probability theory because of its wide applica-tions. Noise in communication systems, the marks in an examination, etc, canbe modeled by the Gaussian distribution. In this section, we will discuss thecentral limit theorem (CLT) that explains the origin of this important distribu-tion. The theorem is first stated below.

Let be i.i.d. with zeromean and unit variance, and let

(5.3)

Then

Limit Theorems 77

The reader should compare the definition of in (5.2) for the WLLN withthe definition of here, and observe that for the scaling factor is whilefor it is . To understand the CLT properly, we note that

since has zero mean, and

where

a) follows from Proposition 4.5;

b) follows because ’s are mutually independent (cf. Section 4.3);

c) follows since has unit variance.

So the only thing the CLT says about is that it has a Gaussiandistribution; the mean and the variance of are already fixed from the setup.

Before we prove the CLT, we first derive the moment generating function ofa Gaussian distribuion.

Let . Then .

Proof Consider

78

where the last step follows because the density function of inte-grates to 1, proving the lemma.

Let . Then .

Let , , where and areindependent, and let . Then .

Proof The proof is left as an exercise.

This corollary says that the sum of two independent Gaussian random vari-able is again Gaussian.

Proof of Theorem 5.4 We give an outline of the proof as follows. Denote thegeneric random variable of by . Consider the second order approxi-mation of via the Taylor expansion1 for small :

Since and , by the assumption that has zeromean and unit variance, we have

and

Therefore,

for small . By Theorem 4.18, we have

1The Taylor expansion for a function at is given by

where denotes the th derivative of .

Limit Theorems 79

Then for large ,

where the above approximation is justified because is small for large n.Then

where we have used the identity

Finally, we conclude by Corollary 5.6 that

This proves the CLT.

The version of the CLT we have proved in Theorem 5.4 applies only when’s have zero mean and unit variance. In order to apply it to a general i.i.d.

process, we need the following scaling property of Gaussian distributions.

Let , and . Then. That is, again has a Gaussian distribution.

Proof Consider

Thus .

80

In this example, we give a use-ful application of Proposition 5.8. Suppose , and we are inter-ested in for some constant . Now

is equivalent to

By defining , the above is equivalent to

We see from Proposition 4.6 that has zero mean and unit variance. More-over, by Proposition 5.8, is in fact a Gaussian random variable. Therefore,

where

is called the standard normal CDF.

Now suppose is i.i.d. and with mean and variance . Then byProposition 4.6, , where

(5.4)

is i.i.d. with zero mean and unit variance. From (5.4), we have

so that

When is large, by the CLT,

Limit Theorems 81

Then by Proposition 5.8,

That is, the sum of a large number of i.i.d. random variables is approximatelyGaussian. Again, the mean and the variance of the above Gaussian distributionare determined from the setup and has nothing to do with the CLT.

Documents

Probability Models and Applications Lecture Notes for ERG ...iest2.ie.cuhk.edu.hk/~whyeung/tempo/chaps.pdfProbability Models and Applications Lecture Notes for ERG 2040 (2004-05) by