Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Probability Models and ApplicationsLecture Notes for ERG 2040 (2004-05)
by Raymond W. Yeung
Contents
1. PROBABILITY AND EVENTS 11.1 Set Theory 11.2 Elements of a Probability Model 31.3 Probability as a State of Knowledge 51.4 Conditional Probability 61.5 The Law of Total Probability and the Bayes Theorem 91.6 Independent Events 12
2. RANDOM VARIABLES 152.1 Probability Mass Function 162.2 Cumulative Distribution Function 182.3 Probability Density Function 202.4 Function of a Random Variable 242.5 Expectation 272.6 Moments 292.7 Jensen’s inequality 31
3. MULTIVARIATE DISTRIBUTIONS 353.1 Joint Cumulative Distribution Function 353.2 Joint Probability Density Function 373.3 Independence of Random Variables 393.4 Conditional Distribution 42
3.4.1 Conditioning on an Event 423.4.2 Conditioning on Another Random Variable 46
3.5 Functions of Random Variables 493.5.1 The CDF Method 49
3.5.2 The pdf Method 523.6 Transformation of Random Variables 54
4. EXPECTION AND MOMENT GENERATING FUNCTIONS 594.1 Expectation as a Linear Operator 594.2 Conditional Expectation 624.3 Covariance and Schwartz’s Inequality 634.4 Moment Generating Functions 67
5. LIMIT THEOREMS 735.1 The Weak Law of Large Numbers 735.2 The Central Limit Theorem 76
vi
Chapter 1
PROBABILITY AND EVENTS
Probability theory is a theory for modeling the outcome of a random exper-iment. It is built on set theory, which in turn is built on logic. Before we startdiscussing probability theory, we first give a quick summary of the facts in settheory that we need in this course.
1.1 SET THEORYThe universal set, the set that contains all the elements of interest in a prob-
lem, is denoted by . The empty set, the set that contains no element, is de-noted by .
The three basic operations in set theory are union, intersection, and comple-ment. Another commonly used operation is set difference. These operationsare defined as follows.
Let and be sets.
Union The union of and , denoted by , is the set
or
Intersection The intersection of and , denoted by , is the set
and
Complement The complement of , denoted by , is the set
Set difference The set minus the set , denoted by , is the set
and
1
2
Note that can be written as , and can be written as .
Let and be sets.
Inclusion is a subset of , denoted by , if
‘ ’ ‘ ’.
Mutual exclusion and are mutually exclusive, or disjoint if
Inclusion and mutual exclusion are two basic relations between two sets. Notethat if and only if .
1. .
2. .
3. .
4. .
5. .
6. .
The proof this proposition is omitted. The following two theorems furtherrender some useful set identities.
1.
2.
Proof The two forms of De Morgan’s Law are equivalent. This can be seen byreplacing every set in the first form by its complement. Specifically, we have
Upon taking complement on both sides, we obtain the second form.De Morgan’s law can easily be verified by using a Venn diagram. More
formally, it can be proved by using a truth table. Consider the first form andwrite
Probability and Events 3
and
Then the following truth table shows that the two sets above are equal:
F T T T F F FF T T F F F TF F T T T F FT F F F T T T
The theorem is proved.
1. .
2. .
Proof Again these identities can be verified by using Venn diagrams and canbe formally proved by using truth tables. The details are omitted.
Note that the second identity in the theoroem above can be obtained fromthe first identity by replacing each by a and replacing each by a . (Thisis not a proof though.)
1.2 ELEMENTS OF A PROBABILITY MODELProbability is a state of knowledge about something unknown. This knowl-
edge may change due to availability of certain information. These will bediscussed in the next two sections.
A probability model for a random experiment consists of the following ele-ments:
1. a sample space containing all possible outcomes of the random experi-ment;
2. a set function , called a probability measure, such that
A1. for allA2.A3. if .
Note: A set function is a function that maps a set to a real number.
The outcome of the random experiment, an element of , is denoted by . Asubset of is called an event, and we say that event occurs if . The
4
quantity is interpreted as the probability that event occurs. The prob-ability measure represents our knowledge about the random experiement.A1 – A3 are called the axioms of probability due to the great mathematicianA. Kolmogorov,http://www-gap.dcs.st-and.ac.uk/~history/Mathematicians/Kolmogorov.html.
.
Proof
by A3
implies .
.
Proof Consider
by A3
Therefore, .
if .
Proof The fact that can easily be seen from a Venn diagram,so that
by A3by A1
One can think of as a cookie of weight 1, and characterizes the weightdistribution of the cookie. Obviously, for all , and
if means that weight is additive. Notethat the cookie can have point masses.
Let
Probability and Events 5
and take
(to model that the die is fair and the two tosses are independent1). For example,
the set of all outcomes s.t. the sum is 6
We now check that is a valid probability measure by showing that it sat-isfies all the axioms of probability.
A1: Since , A1 is obviously satisfied.
A2: Trivial.
A3: Since if , we can readily verify A3.
Following Example 1.9, we let be the outcome of the firsttoss. Formally, , and .
Following Example 1.9, we define the random variable
if toss = 2 toss
Formally, , where
and for .
1.3 PROBABILITY AS A STATE OF KNOWLEDGEThink of probability as our state of knowledge about something. For exam-
ple, our state of knowledge about the outcome of tossing a fair coin before thetoss is
After the tossing and upon seeing the coin, our state of knowledge becomes
with probability (a HEAD is obtained), and becomes
with probability (a TAIL is obtained).
1To be discussed in the next chapter.
6
1.4 CONDITIONAL PROBABILITYThe probability of event conditioning on event is de-
fined by
if (undefined if ).
The above formula specifies how we update our knowledge about eventfrom to upon knowing that the event has occurred. If wefix the event , then is a set function. Our knowledge of the randomexperiment before we know that the event has occurred is given by ,and it becomes after we know that the event has occurred. We willshow later in Theorem 1.16 that is indeed a valid probability measure.
Following Example 1.9,
Now , and
by A3
Therefore,
Similarly,
and
On the other hand,
Probability and Events 7
Similarly,
for all .
Let , with
1 2 3 4 5 60.1 0.2 0.1 0.1 0.3 0.2
Let
Then it can easily be shown that
1 2 3 4 5 60 0
Following the last example, further define the event
is odd
Then
We say that the set function is a probability measure because it satisfiesthe axioms A1 – A3. The same can be said about , the set functionderived from by conditioning on event . This is proved in the next theorem.
Let be any event such that . Then the set func-tion is a probability measure.
Proof First, consider any event . Then
8
since by A1. On the other hand,
by A3
by Corollary 1.8
Therefore, satisfies A1.Second, consider
Therefore, satisfies A2.Third, for any events and such that , we have
Therefore, satisfies A3, and we conclude that is a probabilitymeasure.
In fact, the set function can formally be viewed as . To see this,we only have to observe that for any event ,
Following the setup in Example 1.15, since is a prob-ability measure, we have
Probability and Events 9
by A3
from Example 1.14
which is consistent with what we obtained in Example 1.15.
Since is a probability measure, Corollaries 1.6, 1.7, and 1.8 can alsobe applied to . Thus we have
1. .
2. .
3. if .
1.5 THE LAW OF TOTAL PROBABILITY AND THEBAYES THEOREM
A collection of sets is a partition of if
1.
2. if .
Let be a par-tition of . Then for any event ,
Proof The theorem can be proved by considering
In the law of total probability, the event can be interpreted as an observa-tion (or a consequence), and can be interpreted as a collection of mutuallyevents which can possibly cause the observation . Figure 1.1 illustrates therelation between and .
10
A
!
" i
Figure 1.1. A illustration of the relation between and .
Let be the set of all students in a class, be the set of allmales students, and be the set of all female students. Obviously, isa partition of . Let be the set of students who wear dresses. Then the lawof total probablity says that
For any events and such that,
Proof By definition, we have
or
Exchanging the roles of and above, we obtain
Since (and hence ), we have
or
Probability and Events 11
Let be a partition of . Then for any event ,
Proof This follows immediately from the preceding two theorems.
The formula in Corollary 1.23 expresses in terms ofand .
The probability that a student being lazy (L) is 0.2. The prob-ability that a student fails (F) in a certain examination is 0.8 given that (s)he islazy, while the probability that a student fails in the examination is 0.15 giventhat (s)he is not lazy. That is,
We are interested in the probabilities , , , and .We proceed to determine . By Corollary 1.23, we have
By the remark at the end of Section 1.5, we have
The probabilities and can be determined likewise. As anexercise, the reader should identify the sets and in Corollary 1.23.
12
1.6 INDEPENDENT EVENTSLet and be two events, and assume for the time being that
. The events and are independent if the probability of one event is notchanged by knowing the other event. Specifically,
(1.1)
and(1.2)
It is easy to see that both (1.1) and (1.2) are equivalent to
The above relation will be used as the definition for independence. In fact, it ismore general then either (1.1) or (1.2) because it avoids the problem of divisionby zero when or is zero.
Two events and are independent if.
The reader should carefully distinguish between two events being disjointand two events being independent. The latter has to do with the probabilitymeasure while the former does not.
For three events , and , we have two notions of independence.
Three events , andare pairwise independent if
(1.3)(1.4)(1.5)
Three events , andare mutually independent if (1.3)-(1.5) are satisfied and also
Clearly, if events , and are mutually independent, then they are alsopairwise independent. However, the converse is not true, as is shown in thenext example.
Consider the probability measure shown in Figure 1.2. First,we have
Probability and Events 13
0.09
0.09 0.09
0.12 0.12
0.12
A
B
C
Figure 1.2. The probability measure for Example 1.28.
By symmetry, we also have . Now,
Therefore, it is readily seen that events , and are pairwise independent.However, since
events , and are not mutually independent.
Chapter 2
RANDOM VARIABLES
As we have discussed, a random variable is a function of , the outcomeof the random experiment in discourse. Formally, , wheredenotes the set of real numbers. The alphabet set of , denoted by , is asubset of containing all possible values taken by . If is a finite set, wedenote its cardinality (size) by .
Let , with1 2 3 4 5 60.1 0.2 0.1 0.1 0.3 0.2
as in Example 1.15. Define random variables and as follows.
1 0 02 1 03 0 04 2 15 4 16 3 1
We can let , with . Similarly, we can let .
such that defines a random variable.
Suppose we choose a time in a day at random. To model this,we let
Then the two functions HR and MIN defined by
HR
15
16
andMIN
are random variables representing respectively the hour and minute of the ran-dom time chosen. Likewise, the function
AM ifotherwise
is the random variable denoting “morning”.
Let be the set of all possible states of the body of a per-son chosen at random. Then the height, the weight, the body temperature, theblood pressure, the eye color, etc, of the chosen person are all random vari-ables and they are functions of . One can think of a random variable as a“measurement” of .
One can think of a random variable as described by a piece of wire ex-tending from to with a certain weight distribution, and the probabilitythat takes a value in a set is the weight of the set . So the problemis how to characterize a weight distribution on a piece of wire. In the rest ofthe chapter, we will see various such characterizations.
2.1 PROBABILITY MASS FUNCTIONA random variable is called discrete if the set of all values taken by is
discrete (finite or countably infinite). Such a random variable is characterizedby a probability mass function (pmf) which gives the probability of occurrenceof each possible value of . When there is no ambiguity, we will write
as . For example, in the last example, we have
or
by A3
We usually do not go for such level of formality, but the above shows what
exactly means.
Random Variables 17
A pmf satisfies the following two properties:
1. for all ;
2.
Proof The proof is left as an exercise.
The binomial distributionwith parameters and takes integer values between 0 and , with
for , and otherwise, where and . It isobvious that . Moreover,
where we have invoked the binomial formula
It can be shown that peaks at .
The Poisson distribution with pa-rameter for is defined by
ifotherwise.
Then
18
2.2 CUMULATIVE DISTRIBUTION FUNCTIONWe already have seen the pmf characterization of a discrete random vari-
able. In this section, we introduce a more powerful characterization called thecumulative distribution function (CDF), which can characterize any randomvariable. The CDF of a random variable , denoted by , is defined by
for all . will be written as when there is noambiguity.
A CDF satisfies the following properties:
1. is non-decreasing and right-continuous;
2. ;
3. .
Proof The first property results from the definition of . To prove thesecond property, we consider
Here, we have used to denote the inverse image of a subset ofunder the mapping , defined as the set
To prove the third property, we consider
Basically, gives the weight of the left part of the piece of wire indiscussion up to and including the point . From , for any interval ,we can determine from
Random Variables 19
F(x)
x
1
Figure 2.1. The CDF of an exponential distribution.
and hence the probability that takes a value in any union of intervals by A3.Thus the CDF completely characterizes the weight distribution of the piece ofwire and hence the random variable !
If is a discrete random variable taking integer values, then
and
Thus the pmf and the CDF completely specify each other.According to the CDF, a random variable can be classified as follows:
1. Discrete – if the CDF has increments only at discrete jumps;
2. Continuous – if the CDF has only continuous increment;
3. Mixed – otherwise.
For an exponential distri-bution with parameter , denoted by ,
ifotherwise.
Figure 2.1 is a sketch of the CDF of an exponential distribution.
For the uniform distributionon ,
ififif .
20
F(x)
x
1
1
Figure 2.2. The CDF of the uniform distribution on .
F(x)
x
1
1
1/3
2/3
Figure 2.3. The CDF of the mixed distribution in Example 2.3.
Figure 2.2 is a sketch of the CDF of this distribution.
For the distribution uniform onwith a point probability mass at , the CDF is given by
ififif .
Figure 2.3 is a sketch of the CDF of this distribution.
2.3 PROBABILITY DENSITY FUNCTIONIn the last section, we have seen how we can characterize the weight distri-
bution on of a piece of wire by the CDF. In this section, we will see how wecan characterize the same thing by means of a “density”.
Random Variables 21
Suppose we say that the density of the piece of wire in a neighborhood ofis equal to , with the unit kg/m. What it means is that for a segment of wirearound with length , the weight is approximately . Let denotesthe density at for a piece of wire whose weight distribution is described by aCDF . Then what we want for is that
(2.1)
for all (this is exactly what density means). Assuming that is differen-tiable (with respective to ) and using the fundamental theorem of calculus1,by differentiating both sides of the above, we have
giving
(2.2)
The function is called the probability density function (pdf).If is not differentiable at a certain , then we cannot define by
(2.2). This can happen if does not change smoothly at , for example,at in Figure 2.2. However, a careful examination of (2.1) reveals thatthe density at any isolated point, as long as it is finite, does not affect theCDF . Therefore, in such situations, we can simply set to be anyfinite value.
In fact, if is finite, then
This can be seen as follows. Let be an interval with width such that. Then
1The fundamental theorem of calculus states that
This can be seen by considering
if is sufficiently smooth at .
22
x
f(x)
Figure 2.4. The pdf of an exponential distribution.
Letting , we see that . However, although the eventhas zero probability measure, it is not impossible.
Suppose we throw a dart at a target and the point where thedart lands distributes uniformly on the target. Then the probability that thedart lands on any particular point is zero, but it is not impossible for the dartto land on that point.
For an exponential dis-tribution with parameter ,
ifotherwise.
Figure 2.4 is a sketch of the pdf of an exponential distribution.
For the uniform distributionon ,
ifotherwise.
Figure 2.5 is a sketch of the pdf of this distribution.
However, we do run into a serious technical difficulty in defining the densityfunction if has a discrete jump at , i.e., there is a point mass at
. The problem we face is how to define in view of (2.1) so thatchanges abruptly as changes from to for small . In fact, nosuch function exists. Instead, we introduce the notion of a Dirac deltafunction2. A delta function is called a generalized function, which is not a true
2named after the great physicist Paul Dirac.
Random Variables 23
x
1
1
f(x)
Figure 2.5. The pdf of the uniform distribution on .
function. Generalized function theory is a very deep theory, but it suffices forus to understand what a delta function means.
The best way to think of a delta function, denoted by , is that it isfunction which takes the value 0 everywhere except around . Around
, the function is a very sharp peak such that the total area under isequal to 1. That is
By means of the delta function, we will be able to describe a CDF withdiscrete jumps. This will be illustrated in the next example.
For the distribution uniform onwith a point probability mass at , the pdf is sketched in Figure 2.6.
Here, denotes a delta scaled to two-third of and relocated at.
A pdf satisfies the following properties:
1. ;
2. .
Proof First, is nonnegative because is non-decreasing. Second,
24
2/3 (x-1)#
x
1
1
2/3
1/3
f(x)
Figure 2.6. The pdf of the mixed distribution in Example 2.3.
2.4 FUNCTION OF A RANDOM VARIABLESuppose we have a random variable specified by a given distribution (pdf
or CDF), and we pass through a function to obtain a random variable. What is the distribution of ? Note that , as a random variable,
is also a function of because .This problem is rather trivial if both and are discrete. To be specific,
for each , can be determined by
where is the inverse image of under the mapping . For continuousand mixed random variables, the idea is similar but the technicality is a littlebit more involved.
Let us consider the case that both and are continuous, and we willdetermine in terms of . To start with, we further assume that ismonotone so that the inverse function is defined. Toward this end, in lightof Figure 2.7, consider
where , or . Since
and
we have
Random Variables 25
x x+dx
yy+dy
x
g(x)
Figure 2.7. A monotone function .
and in the limit, we have
Let , where. Then
and
Therefore,
We give an alternative solution to the problem in the lastexample. We consider two cases for . First, for , we have
Then(2.3)
26
For , we have
Upon differentiating with respect to , we obtain
(2.4)
Combining (2.3) and (2.4) for the two cases, we have
which is the same as what we have obtained before.
We now turn to the case that is not monotone, i.e., has possiblymore than one solution for some . We will discuss the case that for a particular
, there exist , , such that (cf. Figure 2.8). The generalcase is a straightforward extension. From Figure 2.8, we see that
Then
Following exactly the argument as we used for the case that is monotone, weobtain
Let , . Now values of less thanare impossible, so that for . For , we have
Random Variables 27
x1 x + dx1 1x
g(x)
yy+dy
x + dx x2 2 2
Figure 2.8. A function with two inverses and at .
Since , we have
2.5 EXPECTATIONThe expectation of , written as or simply , is
defined as
for discrete, and
for continuous. is sometimes written as .
The expectation of a random variable is the value taken by on theaverage. It is also called the mean and it can be regarded as the center of massof the piece of wire describing the distribution of . Specifically, the locationof the center of mass of the piece of wire is given by
For any constant , .
Proof This can be proved by considering
28
If with probability 1, then .
Proof This can be proved by considering
Let . Then
This summation can be evaluated explicitly by considering the binomial for-mula:
Here we fix and regard each side as a function of , so that the left handside and the right hand side are identical functions of . Differentiating withrespect to , we have
and multiplying by gives
Then let and to obtain
Let . Then
Random Variables 29
Let and . Then and . Therefore,
In evaluating , we use L’Hospital’s Rule as follows.
The expectation of a function of a random variable is defined in the naturalway.
The expectation of a function of a random variable isdefined by
for discrete, and
for continuous.
2.6 MOMENTSIn Definition 2.25, by letting and , we
obtain the th moment and the th central moment of , respectively, where. The th moment and the th central moment are denoted by and
, respectively, i.e.,
30
(Note that in the definition of , is just a constant.) The first moment,i.e., , is called the mean (or expectation), while the second central moment,i.e., , is called the variance. The variance of is also denoted by .
.
Proof First,
Regarding the distribution of as represented by a piece of wire, the quantityis the “moment” (as in mechanics) about the center of mass
(at position ) due to the weight at position . So this propositionsays that for any weight distribution, the total moment about the center of massis equal to 0.
For continuous, it suffices to consider
The proof for discrete is similar.
For each function , give partial information about the distribution of. For example, gives the average value taken by . Likewise, each
and gives some partial information about the distribution of . However,we point out that under many situations, the distribution of is completelyspecified by , i.e., the set of all moments of gives complete informationabout the distribution of . It can also be shown that is completelyspecified by , together with , and vice versa. The followingtheorem is a special case of this fact.
.
Proof The theorem is proved by considering
Random Variables 31
with equality if and only ifwith probability 1.
Proof Since for all ,
(2.5)
Therefore,
We now show that the above inequality is tight if and only if .If , then obviously . The proof of the converse,however, is much more technical and is deferred to Appendix 2.A.
The Gaussian distributionis a continuous distribution defined by
for . It is one of the most important distributions in probabilitytheory. It will be discussed in depth when we discuss the central limit theorem.It can be shown that
Furthermore, it can be shown that the mean and variance are re-spectively and .
2.7 JENSEN’S INEQUALITYLet be a set of vectors and be a set of real num-
bers. Then
is called a linear combination of . If in addition satisfies
1. for all , and
32
x1 x1
x2a=1a=0
a< 1
x
x
x
2
3
4
a>1
(a) (b)
Figure 2.9. Linear combinations and convex combinations.
2. ,
then is called a convex combination of .
Remark If satisfies for all and , then is aprobability mass function. In this course, the vectors are usually scalars. Inparticular,
is a convex combination of all the elements of .
Figure 2.9(a) is an illustration of a linear (convex) combinationof two vectors and , and Figure 2.9(b) is an illustration of the set of allconvex combinations of the four vectors , and .
A set is convex if for any ,then
for any .
A function defined on a convexset is called convex if for any and any ,
(2.6)
A function is concave if and only if is convex.
In the definition of a convex function, it is crucial that the set is convexbecause otherwise for and , may not
Random Variables 33
x1
g(x)
x2
Figure 2.10. A convex function.
be in . If so, is not defined at and (2.6) cannot even bestated.
Figure 2.10 is an illustration of a convex function when is some convexsubset of the real line. In general, is some convex subset of .
For any set , its convex hull isdefined as
for someand
In words, consists of the convex combination of any pair of vectors in.
The convex hull of a set is convex. Figure 2.9(b) shows theconvex hull of the set .
Let be a convex function. Then.
Proof For being a binary random variable, Jensen’s inequality follows di-rectly from the convexity of . This can be seen by letting and
in (2.6). See homework for a proof for the case when isfinite.
Remark The computation of requires the knowledge of the distribu-tion of , while the computation of requires only the knowledge of
. Therefore, Jensen’s inequality provides a lower bound on whenonly is known.
34
Let be a concave function. Then .
is convex. By Jensen’s inequality, .
is convex. By Jensen’s inequality, ,i.e., . This has been obtained in Corollary 2.28.
APPENDIX 2.A: PROOF OF COROLLARY 2.28This appendix is for the more enthusiastic readers. Here, we complete the proof of Corol-
lary 2.27 by showing that if , then , or . Assumethat . First, observe that
Define the set
Note that
and that for , , so that . Our goal is to establish that. We first show that for any , . Consider
Thus
implying that . Now since for all , we still cannotconclude that . Toward this end, by means of the monotone convergence theorem inreal analysis (beyond the scope of it course), it can be shown that
This completes the proof.
Chapter 3
MULTIVARIATE DISTRIBUTIONS
In Chapter 2, we have discussed a single random variable. In this and sub-sequent chapters, we will discuss a finite collection of jointly distributed ran-dom variables. In this chapter, we will discuss two jointly distributed randomvariables mostly, but generalizations to more than two random variables arestraightforward. Emphasis will be on continuous distributions.
A random variable is a function of . Two random variables are an orderedpair of functions of , for example, . For one random variable,we think of its distribution as the weight distribution of a piece of wire whosetotal weight is 1. For two random variables, we think of their joint distributionas the weight distribution of a sheet whose total weight is 1.
As for a single random variable, we will have both CDF and pdf characteri-zations of the joint distribution.
3.1 JOINT CUMULATIVE DISTRIBUTION FUNCTIONLet and be two jointly distributed continuous random variables. The
joint CDF of and is defined by
When there is no ambiguity, will be abbreviated as .
A joint CDF satisfies the following properties:
1. ;
2. , , and , for any;
3. ;
35
36
4. For any and ,
Proof Property 1 is proved by noting
Property 2 is proved by considering
That and can be proved likewise. The proof forProperty 3 is similar to that for for the single variable case, so it isomitted. Property 4 can be proved by considering
(3.1)
Letting and in Property 4, we have
which implies
where we have invoked Property 2. Similarly, we can show that for ,
Therefore, is non-decreasing in both and , but this is weaker thanProperty 4 as we will see in the next example.
Multivariate Distributions 37
Let
Then we see that
However,
The reader should compare Property 4 with the non-decreasing property offor a single random variable. We can think of Property 4 as the “jointly
non-decreasing” property.So from (3.1), we can obtain the probability that is within any rectan-
gle in from the joint CDF . In principle, we can obtainfor essentially any subset of by A31 through approximation and takinglimit. Therefore, the joint CDF fully characterizes the joint distribution.
The marginal CDFs, namely and , can readily be obtained from. Specifically,
Similarly,
3.2 JOINT PROBABILITY DENSITY FUNCTIONOur task is to find a joint pdf (if exists) for a given joint CDF
such that(3.2)
Assume that exists. Then by the fundamental theorem of calculus,we have
1A3 refers to the third axiom of probability.
38
Therefore,
(3.3)
Now can be obtained from by (3.2), and vice versa by (3.3).Hence, both and are complete characterizations of the jointdistribution.
A joint pdf satisfies the following properties:
1. ;
2. .
Proof First,
By Property 4 of a joint CDF in Theorem 3.1, we see that the expression insidethe square bracket above is nonnegative. It then follows that .Second,
by Property 2 in Theorem 3.1. The theorem is proved.
The marginal pdfs, i.e., and , can also readily be obtained from .Specifically,
Multivariate Distributions 39
Similarly,(3.4)
Remark For discrete random variables and , the analog of (3.4) is
3.3 INDEPENDENCE OF RANDOM VARIABLESTwo random variables and are said to be independent if
(3.5)
for all subsets and of . It is an extremely difficult task to check thiscondition, as it involves an uncountably many subsets of . Instead, we willdiscuss two simpler characterizations of independence of two random vari-ables.
Two random variables andare independent if and only if
(3.6)
for all and .
Proof If and are independent, then (3.5) is satisfied for all subsets andof . Then for all and ,
For the converse, we will only give the proof for a special case. Assume (3.6) issatisfied for all and , and suppose and are intervals, withand . Then
40
by (3.1)
i.e., (3.5) is satisfied. Using this argument, one can prove the same result byinduction for and being finite unions of intervals. The details are omitted.
If the pdf for random vari-ables and exists, then and are independent if and only if
(3.7)
for all and .
ProofThe following apply to all and . Assume that
exists, and that(3.8)
Then
Conversely, assume (3.7) holds. Then
Multivariate Distributions 41
Therefore, (3.7) and (3.8) are equivalent. The theorem is proved.
Remark For two discrete random variables and , the analog of Theo-rem 3.5 is that and are independent if and only if
for all and .
For three or more random variables, like events, we distinguish between pair-wise independence and mutual independence.
Random variables are mutually indepen-dent if
where is a subset of .
Random variables are pairwise indepen-dent if and are independent for any .
Random variables are mutually independentif and only if
(3.9)
for all , or equivalently,
for all if exists.
Proof Omitted.
If are mutually independent, then they arepairwise independent.
Proof If are mutually independent, then (3.9) holds. Forany fixed , by setting for all in (3.9),the left hand side becomes , while the right hand sides becomes
since for all . Therefore,
or and are independent. Hence, we conclude that arepairwise independent.
42
We note that, however, pairwise independence does not imply mutual inde-pendence.
If are mutually independent, then so arefor any functions , .
Proof Consider any subsets of . Then
where we have invoked the mutual independence assumption to obtain the sec-ond equality. Thus are also mutually indepen-dent.
Random variables are independent andidentically distributed (i.i.d.) if they are mutually independent and eachhas the same marginal distribution.
Let be i.i.d.binary random variables with
and
Then has distribution , i.e.,
for .
3.4 CONDITIONAL DISTRIBUTIONThe distribution of a random variable represents our knowledge about the
outcome of . Upon knowing the occurrence of a certain event or the outcomeof another jointly distributed random variable, our knowledge of changesaccordingly. Our new knowledge about is represented by the conditionaldistribution of .
3.4.1 CONDITIONING ON AN EVENTWe first consider the distribution of a single random variable conditioning
on an event , where is a subset of and . Let be
Multivariate Distributions 43
F (x)
0.90.8
0.6
0.2
x
A
x
X Xf (x)
(a) (b)
Figure 3.1. The CDF and pdf of a distribution.
the CDF of . The CDF of conditioning on is defined by
Then
(3.10)
Note that satisfies the three properties of a CDF in Theorem 2.8.The pdf of conditioning on is defined by
It is readily seen from (3.10) that
ifif .
Again, note that satisfies the two properties of a pdf in Theo-rem 2.16.
Consider the CDF and pdf in Figure 3.1(a) and (b) for ran-dom variable , respectively, and let be the subset of as shown in Fig-ure 3.1(a). Then it is readily seen that
44
x x
(a) (b)
0.8
1.0
X XF (x|X in A) f (x| X in A)
Figure 3.2. The CDF and pdf conditioning on .
and and are illustrated in Figure 3.2 accord-ingly.
We now consider the joint distribution of two random variables andconditioning on an event , where is a subset of . The CDFof conditioning on is defined by
(3.11)
This is illustrated in Figure 3.3, where
and
Then from (3.11), we have
(3.12)
Accordingly, the jointly pdf of conditioning on is de-fined as
(3.13)
It turns out that carrying out the above partial differentiation is considerablymore complicated than the case when only one random variable is involved
Multivariate Distributions 45
R (x,y)A
(x,y)y
x
$
Figure 3.3. The set .
(see Appendix 3.A). So we instead will obtain byanother approach. We observe that has to satisfy thefollowing properties: For ,
and for any small subset of with area ,
(3.14)
The left hand side above can be written as
(3.15)
Equating (3.15) and the right hand side of (3.14), we obtain
Therefore,
ifif .
46
x- /2% x+ /2%
x
y
x
%A(x, )
Figure 3.4. The set .
Again, we note that satisfies the two properties of a jointpdf in Theorem 3.3.
3.4.2 CONDITIONING ON ANOTHER RANDOMVARIABLE
In this subsection, we will discuss the distribution of a random variableconditioning on the event , where and are jointly distributedrandom variables. For continuous, . Thus we are runninginto the problem of conditioning on an event with zero probability.
To get around this problem, instead of conditioning on whichhas zero probability, we condition on the event , orequivalently the event , which has nonzero probability,where
The set is illustrated in Figure 3.4. Eventually, we will take the limitas .
The joint pdf of and conditioning on , from thelast subsection, is given by
ifotherwise.
Multivariate Distributions 47
XYf (x,y)
XYf (x ,y)0
x0x
y
Figure 3.5. An illustration of .
By integrating over all , we obtain the conditional marginal pdf of asfollows.
Taking the limit as , we define the pdf of conditioning onby
(3.16)
Figure 3.5 is an illustration of for a fixed . Thus we see from(3.16) that is just the normalized version of for fixed .(For fixed , is a constant.) Similarly, we define
(3.17)
48
y
x1
1
x+y=1
Figure 3.6. The pdf in Example 3.15.
From (3.16) and (3.17), we have
which is a form of Bayes’s Theorem.
If , then and are independent.
Proof If , then
This implies , i.e., and are independent.
Consider the joint pdf for and(see Figure 3.6). We will first determine the normalizing constant
. Then we will determine the marginal pdfs and , and show that andare not independent. In fact, we can tell immediately that and are not
independent by inspecting Figure 3.6, because the possible range of dependson the value taken by .We use Property 2 of a joint pdf in Theorem 3.3 to determine . Consider
Multivariate Distributions 49
which implies .Now, for ,
Thusifotherwise.
By symmetry, we also have
ifotherwise.
Then for , , while . Therefore,and are not independent.
3.5 FUNCTIONS OF RANDOM VARIABLESSuppose we are given two random variables and , and let .
We are interested in the distribution of . Specifically, we want to determinethe distribution of in terms of the joint distribution of and .
3.5.1 THE CDF METHODIn this method, is obtained by considering
50
x/y = z
z > 0
x/y = z
z < 0
x
y
x
y
Figure 3.7. An illustration of the set .
Let . Then
Figure 3.7 is an illustration of the set . Then
(3.18)
and the pdf of is obtained by
In order to differentiate the double integrals in (3.18), we use the chain rulefor differentiation. Specifically,
Multivariate Distributions 51
Similarly,
Therefore,
The next example is extremely important and should be studied very care-fully.
Let , where and areindependent. Then
Differentiating with respect to , we obtain
52
The integral above is called the convolution of and , denoted by. Since , by exchanging the roles of and in
the above integral, we also have
In summary, if is equal to where and are independent, then.
3.5.2 THE PDF METHODThis method is to consider directly. Specifically,
Our task is to express the right hand side as an expression times , so thatis obtained by cancelling on both sides. This may not always be
easy to do, but we will illustrate how this can possibly be done by an example.
Let and be i.i.d. , so that
for all and . Let . Then
(cf. Figure 3.8)
Multivariate Distributions 53
z z+dzx
y
Figure 3.8. An illustration for the range of integration in polar coordinates.
Cancelling on both sides, we have
for all . Obviously, for all .
The reader is encouraged to repeat the above example by the CDF method.
54
z2
1z
x2
1z z2( , )
x1
x1 x2( , )g
Figure 3.9. An illustration of the function .
3.6 TRANSFORMATION OF RANDOM VARIABLESSo far, we have discussed how we can obtained the distribution of a single
random variable obtained as a function of some given random variables. Inthis section, we will discuss how we can obtain the joint distribution of a pairof random variables obtained as functions of a given pair of random variables.
Let , and let be the values taken by. Suppose for each , there is a unique pair satisfying
for some functions and . Write , where
and
Then is an invertible transformation. See Figure 3.9 for an illustration ofwhich takes a point in the - plane to the - plane. Our task is to
determine .Consider the rectangular element in the - plane in Figure 3.10. We label
the four corners of this element by , , where . Whenwe trace the line from to in the - plane, under the mapping , thecurve between and is traced in the - plane. The sameapplies when we trace the lines from to , from to , and from to
. When and are small, the region traced out in the - plane canbe approximated by a parallelogram. Specifically, the vectors and aregiven by
Multivariate Distributions 55
z1
3zz2
z1&
& z2v1
v2
x
x x
1
2 3
'
x0 1 2z0 = (z ,z )1 2= (x ,x )
Figure 3.10. Two corresponding regions in the - plane and - plane.
Let denote the area of the rectangular element in the - plane, and letdenote the area of the corresponding region in the - plane. Now ob-
serve that is in the rectangular element in the - plane if and onlyif is in the corresponding region in the - plane. Equating theprobability of this event in terms of and , we have,
so that(3.19)
where . So, we need to determine the value of as follows.
56
Defining2
(3.20)
we have
Therefore, from (3.19), we have
(3.21)
Exchanging the roles of and in the above derivation andaccordingly in (3.21), we obtain
(3.22)
Therefore, combining (3.21) and (3.22), we have
which implies
(3.23)
This is the two-dimensional generalization of the result
in elementary calculus.
Let
2It is easy to see that we can also write (3.20) as
Multivariate Distributions 57
Then
Therefore,
Following the last example, this time we want to determinein terms of . From (3.23), we have
Therefore,
Alternatively, we can write
and apply (3.20) directly to obtain , but this would be consider-
ably more difficult.
APPENDIX 3.A: DIFFERENTIATINGFrom (3.12), we have
(3.A.1)
where
Following the steps toward proving Property 1 of Theorem 3.3, we have
Let
58
(x+ x,y+ y)& &
(x,y)
(x+ x,y+ y)& &
B (x,y)&
B (x,y)&(x,y)
$
$
(a) (b)
Figure 3.A.1. The set for and .
We leave it as an exercise for the reader to show that the expression in the square bracket aboveis equal to
(3.A.2)
(cf. Figure 3.A.1). If , then , so that (3.A.2) is equal to
and hence
Otherwise, , so that (3.A.2) vanishes, and
Therefore, from (3.A.1), we have
ifif .
Chapter 4
EXPECTION ANDMOMENT GENERATINGFUNCTIONS
We have already discussed the expectation of a single random variable inChapter 2. In this chapter, we will discuss expectation as an operator whenmore than one random variable is involved. We will also introduce momentgenerating functions, a fundamental tool in probability theory.
The results will be proved by assuming that the random variables are con-tinuous. The proofs for discrete random variables are analogous.
4.1 EXPECTATION AS A LINEAR OPERATORWe have defined the expectation of a function of a random variable in Def-
inition 2.25. When the function involves two random variables, we have thefollowing definition.
The expectation of a function of random variables andis defined by
for and discrete, and
for and continuous.
One can regard expectation as an operator which inputs a random variableand outputs the expectation of that random variable. We will establish in thefollowing theorem that expectation is a linear operator.
Expectation is a linear operator, i.e., for any real numbersand ,
(4.1)
59
60
Proof This can be proved by considering
.
Proof This is a special case of Theorem 4.2 with equals 0.
Consider
This is an alternative proof for Theorem 2.27 with simplified notation.
.
Proof This can be proved by considering
Let and be the mean and the variance of a randomvariable . Then has zero mean and unit variance.
Expection and Moment Generating Functions 61
Proof To prove that , consider
We will prove that in two steps. First,
which is intuitively clear. Second, by Proposition 4.5, we have
If and are independent, then
(4.2)
Proof If and are independent, then so are and by Theo-rem 3.10. Therefore,
Remark The equation (4.1) is valid for all joint distribution for and . Bycontrast, (4.2) may not be valid if and are not independent.
62
4.2 CONDITIONAL EXPECTATIONLet and be jointly distribution random variables. The
expectation of conditioning on the event is defined by
We note that depends only on the value of , and hence is afunction of . If the value of is not specified, then
is a random variable, and more specifically, a function of the random variable. We now prove an important theorem regarding conditional expectation.
.
Proof Consider
Note that in the above, the inner expectation is a conditional expectation, whilethe outer expectation is an expectation on a function of .
Let and distributes uniformly on when. We now apply Theorem 4.9 to determine .
Expection and Moment Generating Functions 63
The above integral is the expectation of , which was shown in Example 2.1to be . Therefore, .Alternatively, we can consider
ifotherwise,
so that
ifotherwise.
Then
and
However, evaluations of these integrals are quite difficult.
4.3 COVARIANCE AND SCHWARTZ’S INEQUALITYLet and be random variables, and consider . We are
naturally interested in the relation between the variance of with the variancesof and . Toward this end, consider
Define(4.3)
so that we can write
When and are independent, by Theorem 3.10,
64
so that(4.4)
When , or equivalently
or equivalently (4.4) holds, and are said to be uncorrelated. As shown,and are uncorrelated if and are independent. However, the converse
is not true.
Consider the following joint pmf for random variablesand :
where . Then it is easy to and are uncorrelated but they are notindependent.
.
Proof This can be proved by considering
In fact, it is ready seen from (4.3) and Proposition 4.12 that. Thus the variance of a random variable can be regarded as the co-
variance between and itself.
We now prove an inequality known as Schwartz’s inequality. This inequal-ity gives an upper bound on the magnitude of in terms of the secondmoments of and .
.
Proof Let and be any fixed jointly distributed random variables. Letbe any real number, and consider the random variable . Since thisrandom variable can only take nonnegative values, we have
(4.5)
Expection and Moment Generating Functions 65
or
Rearranging the terms, we have
(4.6)
Since and are fixed random variables, , , and in the aboveare fixed real numbers. Therefore, the left hand side above is quadratic in ,and it achieves its minimum when
Since (4.5) holds for all , it holds for . Substituting into (4.6),we have
Hence,
proving the theorem.
Schwartz’s inequality is tight, i.e.,
if and if is proportional to .
Proof From the proof of the last theorem, we see that Schwartz’s inequality issimply the inequality in (4.5) with . Therefore, Schwartz’s inequality istight if and only if
(4.7)which in turn holds if and only if with probability 1, , or
, implying that is proportional to . On the other hand, ifis proportional to , i.e., for some constant , then it is readilyverified that Schwartz’s inequality is tight.
Denote and by and , respectively. Now in Schwartz’s in-equality, if we replace by and by 1, we immediatelyobtain
1This is valid because and can be any random variables.
66
or
where we have invoked Proposition 4.12. Thus,
(4.8)
The ratio
called the coefficient of correlation, is a measure of the degree of correlationbetween and . It is readily seen from (4.8) that
When , we say that and are uncorrelated. When , we saythat and are positively correlated. When , we say that and
are negatively correlated.When , is simply equal to . When ,
Thus
Roughly speaking, when is large, also tends to be large, so that on theaverage the magnitude (square) of is reinforced. When ,
Thus
Roughly speaking, when is large, tends to be small, and when is small,tends to be large, so that on the average the magnitude (square) of is
suppressed.
We already have seen that when and are independent,then they are uncorrelated, or . When ,
Expection and Moment Generating Functions 67
so that
and
by Proposition 4.5. When ,
so that
and
4.4 MOMENT GENERATING FUNCTIONSThe moments of a random variable (or of a distribution) were introduced
in Section 2.6. In this section, we introduce the moment generating functionof a random variable , denoted by , where the parameter is a realnumber.
The moment generating function of a random variable(or of the distribution of ) is defined by
(4.9)
where is a real number. For the discrete case,
(4.10)
and for the continuous case,
(4.11)
is said to exist if it exists, i.e., the summation in (4.10) or the integral in(4.11) converges, for some range of .
Note that for a fixed , is a function of . So is simply theexpectation of this particular function of , and its value depends on . There-fore, is a function of .
.
68
Proof By letting in (4.9), we have
proving the corollary. This equality can be interpreted by letting in(4.10) and (4.11), which is nothing but the normalizing conditions forand .
The moment generating function can be regarded as the “transform” of thedistribution. In (4.10), if , by letting so that ,the right hand side becomes
which is the -transform of the pmf (with being the time index). In(4.11), by letting , becomes the Fourier transform of the pdf
(with being the index of time). (But of course, in is actually areal number.)
It can be shown that there is a one-to-one correspondence between the dis-tribution of and the moment generating function of . That is, for eachmoment generating function, there is a unique distribution corresponding to it.In other words, the distribution of a random variable can (in principle) berecovered from its moment generating function.
The moments of can be obtained by successive differentiation ofwith respect to . Specifically,
and in general(4.12)
where . Then by putting in (4.12), we obtain
That is why is called the moment generating function of .
If and are independent, then .
Expection and Moment Generating Functions 69
Proof This can be proved by considering
We know from Example 3.17 that if and are independent, thenhas pdf given by . The above theorem says that
, so it is actually analogous to the convolution theorem in linearsystem theory.
Let . Then for ,
Let be the geometric random variable with parameter ,where , i.e.,
otherwise.
Then
70
where we assume that , or .
Remark: We use the convention adopted by most books in probability indefining the geometric distribution in the above example. A random variable
with
otherwise
is said to have a truncated geometric distribution with parameter . If wetoss a coin with the probability of obtaining a head equals repeatedly, thenit is easy to see that the first time a head is obtained has truncated geometricdistribution with parameter . Note that if is geometric with parameter ,then is truncated geometric with parameter .
In the next example, we give a powerful application of the moment generat-ing function.
Let be i.i.d. random variables each, be an independent truncated geometric random variable (independent
of ’s) with parameter , and . That is, is the sum of atruncated-geometric number of i.i.d. exponential random variables.We now determine the distribution of via . Consider
Expection and Moment Generating Functions 71
where
a) follows because ’s are independent, and they remain to be so whenconditioning on because is independent of ’s;
b) follows because is independent of ’s;
c) follows because are i.i.d.;
d) follows because .
From the form of , we see that .
Chapter 5
LIMIT THEOREMS
Recall that in random experiment, the outcome is drawn from the samplespace according to the probability measure , and a random variable isa function of . Now suppose we construct random variables , where
. Then is called a stochastic process (or random process).Since a stochastic process involves an infinite number of random variables,
it cannot be described by a joint distribution. In fact, a stochastic process canbe very comlicated in general. But if ’s are i.i.d., in that case we say that
is an i.i.d. process, then it is relatively easy to describe. The process oftossing a coin repeatly can be modeled as an i.i.d. process.
Limit theorems deal with the asymptotic behaviors of stochastic processes,and they play an important role in probability theory. In this chapter, we willdiscuss the two most fundamental such theorems, namely the weak law of largenumber (WLLN) and the central limit theorem (CLT).
5.1 THEWEAK LAW OF LARGE NUMBERSConsider tossing a fair coin times. Intuitively, when is large, the rela-
tive frequency of obtaining a head should be approximately equal to 0.5. TheWLLN makes this idea precise, as we will see, and it gives a physical meaningto the probability of an event.
For any random variableand any ,
Proof The inequality can be proved by considering
73
74
The interpretation of the Chebyshev inequality is as follows. Here,is the event that deviates from its mean by at least . The upper
bound on the probability of the event is such that:
if is large, then it is large;
if is small, then it is large.
In fact, sometimes this upper bound can be larger than 1, making it useless. Ingeneral, the Chebyshev inequality is rather loose.
Let be i.i.d. random variables with meanand variance . Then
(5.1)
Proof Let(5.2)
The WLLN is actually an application of the Chebyshev inequality to . Now
and
Limit Theorems 75
where a) follows from Proposition 4.5 and b) follows because are indepen-dent. Then by the Chebyshev inequality, we have
or
We note that in (5.1), the upper bound tends to zero for all finite andas . Thus the WLLN says that as long as is sufficiently large, theprobability that the average of , i.e., , deviates fromby any prescribed amount is arbitrarily small. Technically, we say that
in probability.
We now give another interpretation of the WLLN. Let be the genericrandom variable and be its density function, i.e.,
It can readily be shown that for any . Then by Theo-rem 4.18,
so that
Thus the WLLN says that
regardless of the actual form of .Having proved the WLLN, we now show an important consequence of it.
Consider any subset of , the alphabet of the generic random variable .We are interested in the relative frequency of occurrence of in the firsttrial, say. More precisely, we define the random variable (called an indicatorfunction)
ifif ,
76
so that
Since are i.i.d., so are because is a function of . Then therelative frequency of occurence of is given by
and by the WLLN,
in probability.
Therefore, the probability of an event can “physically” be interpreted as thelong term relative frequency of occurence of the event when the experiment isrepeated indefinitely.
The probability of obtaining a head by tossing a fair coin is0.5. If the coin is tossed a large number of times, the relative frequency ofoccurence of a head is close to 0.5 with probability almost 1.
5.2 THE CENTRAL LIMIT THEOREMThe Gaussian distribution was introduced in Example 2.29. It is one of the
most important distributions in probability theory because of its wide applica-tions. Noise in communication systems, the marks in an examination, etc, canbe modeled by the Gaussian distribution. In this section, we will discuss thecentral limit theorem (CLT) that explains the origin of this important distribu-tion. The theorem is first stated below.
Let be i.i.d. with zeromean and unit variance, and let
(5.3)
Then
Limit Theorems 77
The reader should compare the definition of in (5.2) for the WLLN withthe definition of here, and observe that for the scaling factor is whilefor it is . To understand the CLT properly, we note that
since has zero mean, and
where
a) follows from Proposition 4.5;
b) follows because ’s are mutually independent (cf. Section 4.3);
c) follows since has unit variance.
So the only thing the CLT says about is that it has a Gaussiandistribution; the mean and the variance of are already fixed from the setup.
Before we prove the CLT, we first derive the moment generating function ofa Gaussian distribuion.
Let . Then .
Proof Consider
78
where the last step follows because the density function of inte-grates to 1, proving the lemma.
Let . Then .
Let , , where and areindependent, and let . Then .
Proof The proof is left as an exercise.
This corollary says that the sum of two independent Gaussian random vari-able is again Gaussian.
Proof of Theorem 5.4 We give an outline of the proof as follows. Denote thegeneric random variable of by . Consider the second order approxi-mation of via the Taylor expansion1 for small :
Since and , by the assumption that has zeromean and unit variance, we have
and
Therefore,
for small . By Theorem 4.18, we have
1The Taylor expansion for a function at is given by
where denotes the th derivative of .
Limit Theorems 79
Then for large ,
where the above approximation is justified because is small for large n.Then
where we have used the identity
Finally, we conclude by Corollary 5.6 that
This proves the CLT.
The version of the CLT we have proved in Theorem 5.4 applies only when’s have zero mean and unit variance. In order to apply it to a general i.i.d.
process, we need the following scaling property of Gaussian distributions.
Let , and . Then. That is, again has a Gaussian distribution.
Proof Consider
Thus .
80
In this example, we give a use-ful application of Proposition 5.8. Suppose , and we are inter-ested in for some constant . Now
is equivalent to
By defining , the above is equivalent to
We see from Proposition 4.6 that has zero mean and unit variance. More-over, by Proposition 5.8, is in fact a Gaussian random variable. Therefore,
where
is called the standard normal CDF.
Now suppose is i.i.d. and with mean and variance . Then byProposition 4.6, , where
(5.4)
is i.i.d. with zero mean and unit variance. From (5.4), we have
so that
When is large, by the CLT,
Limit Theorems 81
Then by Proposition 5.8,
That is, the sum of a large number of i.i.d. random variables is approximatelyGaussian. Again, the mean and the variance of the above Gaussian distributionare determined from the setup and has nothing to do with the CLT.