84
Probability Models and Applications Lecture Notes for ERG 2040 (2004-05) by Raymond W. Yeung

Probability Models and Applications Lecture Notes for ERG ...iest2.ie.cuhk.edu.hk/~whyeung/tempo/chaps.pdfProbability Models and Applications Lecture Notes for ERG 2040 (2004-05) by

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

  • Probability Models and ApplicationsLecture Notes for ERG 2040 (2004-05)

    by Raymond W. Yeung

  • Contents

    1. PROBABILITY AND EVENTS 11.1 Set Theory 11.2 Elements of a Probability Model 31.3 Probability as a State of Knowledge 51.4 Conditional Probability 61.5 The Law of Total Probability and the Bayes Theorem 91.6 Independent Events 12

    2. RANDOM VARIABLES 152.1 Probability Mass Function 162.2 Cumulative Distribution Function 182.3 Probability Density Function 202.4 Function of a Random Variable 242.5 Expectation 272.6 Moments 292.7 Jensen’s inequality 31

    3. MULTIVARIATE DISTRIBUTIONS 353.1 Joint Cumulative Distribution Function 353.2 Joint Probability Density Function 373.3 Independence of Random Variables 393.4 Conditional Distribution 42

    3.4.1 Conditioning on an Event 423.4.2 Conditioning on Another Random Variable 46

    3.5 Functions of Random Variables 493.5.1 The CDF Method 49

  • 3.5.2 The pdf Method 523.6 Transformation of Random Variables 54

    4. EXPECTION AND MOMENT GENERATING FUNCTIONS 594.1 Expectation as a Linear Operator 594.2 Conditional Expectation 624.3 Covariance and Schwartz’s Inequality 634.4 Moment Generating Functions 67

    5. LIMIT THEOREMS 735.1 The Weak Law of Large Numbers 735.2 The Central Limit Theorem 76

    vi

  • Chapter 1

    PROBABILITY AND EVENTS

    Probability theory is a theory for modeling the outcome of a random exper-iment. It is built on set theory, which in turn is built on logic. Before we startdiscussing probability theory, we first give a quick summary of the facts in settheory that we need in this course.

    1.1 SET THEORYThe universal set, the set that contains all the elements of interest in a prob-

    lem, is denoted by . The empty set, the set that contains no element, is de-noted by .

    The three basic operations in set theory are union, intersection, and comple-ment. Another commonly used operation is set difference. These operationsare defined as follows.

    Let and be sets.

    Union The union of and , denoted by , is the set

    or

    Intersection The intersection of and , denoted by , is the set

    and

    Complement The complement of , denoted by , is the set

    Set difference The set minus the set , denoted by , is the set

    and

    1

  • 2

    Note that can be written as , and can be written as .

    Let and be sets.

    Inclusion is a subset of , denoted by , if

    ‘ ’ ‘ ’.

    Mutual exclusion and are mutually exclusive, or disjoint if

    Inclusion and mutual exclusion are two basic relations between two sets. Notethat if and only if .

    1. .

    2. .

    3. .

    4. .

    5. .

    6. .

    The proof this proposition is omitted. The following two theorems furtherrender some useful set identities.

    1.

    2.

    Proof The two forms of De Morgan’s Law are equivalent. This can be seen byreplacing every set in the first form by its complement. Specifically, we have

    Upon taking complement on both sides, we obtain the second form.De Morgan’s law can easily be verified by using a Venn diagram. More

    formally, it can be proved by using a truth table. Consider the first form andwrite

  • Probability and Events 3

    and

    Then the following truth table shows that the two sets above are equal:

    F T T T F F FF T T F F F TF F T T T F FT F F F T T T

    The theorem is proved.

    1. .

    2. .

    Proof Again these identities can be verified by using Venn diagrams and canbe formally proved by using truth tables. The details are omitted.

    Note that the second identity in the theoroem above can be obtained fromthe first identity by replacing each by a and replacing each by a . (Thisis not a proof though.)

    1.2 ELEMENTS OF A PROBABILITY MODELProbability is a state of knowledge about something unknown. This knowl-

    edge may change due to availability of certain information. These will bediscussed in the next two sections.

    A probability model for a random experiment consists of the following ele-ments:

    1. a sample space containing all possible outcomes of the random experi-ment;

    2. a set function , called a probability measure, such that

    A1. for allA2.A3. if .

    Note: A set function is a function that maps a set to a real number.

    The outcome of the random experiment, an element of , is denoted by . Asubset of is called an event, and we say that event occurs if . The

  • 4

    quantity is interpreted as the probability that event occurs. The prob-ability measure represents our knowledge about the random experiement.A1 – A3 are called the axioms of probability due to the great mathematicianA. Kolmogorov,http://www-gap.dcs.st-and.ac.uk/~history/Mathematicians/Kolmogorov.html.

    .

    Proof

    by A3

    implies .

    .

    Proof Consider

    by A3

    Therefore, .

    if .

    Proof The fact that can easily be seen from a Venn diagram,so that

    by A3by A1

    One can think of as a cookie of weight 1, and characterizes the weightdistribution of the cookie. Obviously, for all , and

    if means that weight is additive. Notethat the cookie can have point masses.

    Let

  • Probability and Events 5

    and take

    (to model that the die is fair and the two tosses are independent1). For example,

    the set of all outcomes s.t. the sum is 6

    We now check that is a valid probability measure by showing that it sat-isfies all the axioms of probability.

    A1: Since , A1 is obviously satisfied.

    A2: Trivial.

    A3: Since if , we can readily verify A3.

    Following Example 1.9, we let be the outcome of the firsttoss. Formally, , and .

    Following Example 1.9, we define the random variable

    if toss = 2 toss

    Formally, , where

    and for .

    1.3 PROBABILITY AS A STATE OF KNOWLEDGEThink of probability as our state of knowledge about something. For exam-

    ple, our state of knowledge about the outcome of tossing a fair coin before thetoss is

    After the tossing and upon seeing the coin, our state of knowledge becomes

    with probability (a HEAD is obtained), and becomes

    with probability (a TAIL is obtained).

    1To be discussed in the next chapter.

  • 6

    1.4 CONDITIONAL PROBABILITYThe probability of event conditioning on event is de-

    fined by

    if (undefined if ).

    The above formula specifies how we update our knowledge about eventfrom to upon knowing that the event has occurred. If wefix the event , then is a set function. Our knowledge of the randomexperiment before we know that the event has occurred is given by ,and it becomes after we know that the event has occurred. We willshow later in Theorem 1.16 that is indeed a valid probability measure.

    Following Example 1.9,

    Now , and

    by A3

    Therefore,

    Similarly,

    and

    On the other hand,

  • Probability and Events 7

    Similarly,

    for all .

    Let , with

    1 2 3 4 5 60.1 0.2 0.1 0.1 0.3 0.2

    Let

    Then it can easily be shown that

    1 2 3 4 5 60 0

    Following the last example, further define the event

    is odd

    Then

    We say that the set function is a probability measure because it satisfiesthe axioms A1 – A3. The same can be said about , the set functionderived from by conditioning on event . This is proved in the next theorem.

    Let be any event such that . Then the set func-tion is a probability measure.

    Proof First, consider any event . Then

  • 8

    since by A1. On the other hand,

    by A3

    by Corollary 1.8

    Therefore, satisfies A1.Second, consider

    Therefore, satisfies A2.Third, for any events and such that , we have

    Therefore, satisfies A3, and we conclude that is a probabilitymeasure.

    In fact, the set function can formally be viewed as . To see this,we only have to observe that for any event ,

    Following the setup in Example 1.15, since is a prob-ability measure, we have

  • Probability and Events 9

    by A3

    from Example 1.14

    which is consistent with what we obtained in Example 1.15.

    Since is a probability measure, Corollaries 1.6, 1.7, and 1.8 can alsobe applied to . Thus we have

    1. .

    2. .

    3. if .

    1.5 THE LAW OF TOTAL PROBABILITY AND THEBAYES THEOREM

    A collection of sets is a partition of if

    1.

    2. if .

    Let be a par-tition of . Then for any event ,

    Proof The theorem can be proved by considering

    In the law of total probability, the event can be interpreted as an observa-tion (or a consequence), and can be interpreted as a collection of mutuallyevents which can possibly cause the observation . Figure 1.1 illustrates therelation between and .

  • 10

    A

    !

    " i

    Figure 1.1. A illustration of the relation between and .

    Let be the set of all students in a class, be the set of allmales students, and be the set of all female students. Obviously, isa partition of . Let be the set of students who wear dresses. Then the lawof total probablity says that

    For any events and such that,

    Proof By definition, we have

    or

    Exchanging the roles of and above, we obtain

    Since (and hence ), we have

    or

  • Probability and Events 11

    Let be a partition of . Then for any event ,

    Proof This follows immediately from the preceding two theorems.

    The formula in Corollary 1.23 expresses in terms ofand .

    The probability that a student being lazy (L) is 0.2. The prob-ability that a student fails (F) in a certain examination is 0.8 given that (s)he islazy, while the probability that a student fails in the examination is 0.15 giventhat (s)he is not lazy. That is,

    We are interested in the probabilities , , , and .We proceed to determine . By Corollary 1.23, we have

    By the remark at the end of Section 1.5, we have

    The probabilities and can be determined likewise. As anexercise, the reader should identify the sets and in Corollary 1.23.

  • 12

    1.6 INDEPENDENT EVENTSLet and be two events, and assume for the time being that

    . The events and are independent if the probability of one event is notchanged by knowing the other event. Specifically,

    (1.1)

    and(1.2)

    It is easy to see that both (1.1) and (1.2) are equivalent to

    The above relation will be used as the definition for independence. In fact, it ismore general then either (1.1) or (1.2) because it avoids the problem of divisionby zero when or is zero.

    Two events and are independent if.

    The reader should carefully distinguish between two events being disjointand two events being independent. The latter has to do with the probabilitymeasure while the former does not.

    For three events , and , we have two notions of independence.

    Three events , andare pairwise independent if

    (1.3)(1.4)(1.5)

    Three events , andare mutually independent if (1.3)-(1.5) are satisfied and also

    Clearly, if events , and are mutually independent, then they are alsopairwise independent. However, the converse is not true, as is shown in thenext example.

    Consider the probability measure shown in Figure 1.2. First,we have

  • Probability and Events 13

    0.09

    0.09 0.09

    0.12 0.12

    0.12

    A

    B

    C

    Figure 1.2. The probability measure for Example 1.28.

    By symmetry, we also have . Now,

    Therefore, it is readily seen that events , and are pairwise independent.However, since

    events , and are not mutually independent.

  • Chapter 2

    RANDOM VARIABLES

    As we have discussed, a random variable is a function of , the outcomeof the random experiment in discourse. Formally, , wheredenotes the set of real numbers. The alphabet set of , denoted by , is asubset of containing all possible values taken by . If is a finite set, wedenote its cardinality (size) by .

    Let , with1 2 3 4 5 60.1 0.2 0.1 0.1 0.3 0.2

    as in Example 1.15. Define random variables and as follows.

    1 0 02 1 03 0 04 2 15 4 16 3 1

    We can let , with . Similarly, we can let .

    such that defines a random variable.

    Suppose we choose a time in a day at random. To model this,we let

    Then the two functions HR and MIN defined by

    HR

    15

  • 16

    andMIN

    are random variables representing respectively the hour and minute of the ran-dom time chosen. Likewise, the function

    AM ifotherwise

    is the random variable denoting “morning”.

    Let be the set of all possible states of the body of a per-son chosen at random. Then the height, the weight, the body temperature, theblood pressure, the eye color, etc, of the chosen person are all random vari-ables and they are functions of . One can think of a random variable as a“measurement” of .

    One can think of a random variable as described by a piece of wire ex-tending from to with a certain weight distribution, and the probabilitythat takes a value in a set is the weight of the set . So the problemis how to characterize a weight distribution on a piece of wire. In the rest ofthe chapter, we will see various such characterizations.

    2.1 PROBABILITY MASS FUNCTIONA random variable is called discrete if the set of all values taken by is

    discrete (finite or countably infinite). Such a random variable is characterizedby a probability mass function (pmf) which gives the probability of occurrenceof each possible value of . When there is no ambiguity, we will write

    as . For example, in the last example, we have

    or

    by A3

    We usually do not go for such level of formality, but the above shows what

    exactly means.

  • Random Variables 17

    A pmf satisfies the following two properties:

    1. for all ;

    2.

    Proof The proof is left as an exercise.

    The binomial distributionwith parameters and takes integer values between 0 and , with

    for , and otherwise, where and . It isobvious that . Moreover,

    where we have invoked the binomial formula

    It can be shown that peaks at .

    The Poisson distribution with pa-rameter for is defined by

    ifotherwise.

    Then

  • 18

    2.2 CUMULATIVE DISTRIBUTION FUNCTIONWe already have seen the pmf characterization of a discrete random vari-

    able. In this section, we introduce a more powerful characterization called thecumulative distribution function (CDF), which can characterize any randomvariable. The CDF of a random variable , denoted by , is defined by

    for all . will be written as when there is noambiguity.

    A CDF satisfies the following properties:

    1. is non-decreasing and right-continuous;

    2. ;

    3. .

    Proof The first property results from the definition of . To prove thesecond property, we consider

    Here, we have used to denote the inverse image of a subset ofunder the mapping , defined as the set

    To prove the third property, we consider

    Basically, gives the weight of the left part of the piece of wire indiscussion up to and including the point . From , for any interval ,we can determine from

  • Random Variables 19

    F(x)

    x

    1

    Figure 2.1. The CDF of an exponential distribution.

    and hence the probability that takes a value in any union of intervals by A3.Thus the CDF completely characterizes the weight distribution of the piece ofwire and hence the random variable !

    If is a discrete random variable taking integer values, then

    and

    Thus the pmf and the CDF completely specify each other.According to the CDF, a random variable can be classified as follows:

    1. Discrete – if the CDF has increments only at discrete jumps;

    2. Continuous – if the CDF has only continuous increment;

    3. Mixed – otherwise.

    For an exponential distri-bution with parameter , denoted by ,

    ifotherwise.

    Figure 2.1 is a sketch of the CDF of an exponential distribution.

    For the uniform distributionon ,

    ififif .

  • 20

    F(x)

    x

    1

    1

    Figure 2.2. The CDF of the uniform distribution on .

    F(x)

    x

    1

    1

    1/3

    2/3

    Figure 2.3. The CDF of the mixed distribution in Example 2.3.

    Figure 2.2 is a sketch of the CDF of this distribution.

    For the distribution uniform onwith a point probability mass at , the CDF is given by

    ififif .

    Figure 2.3 is a sketch of the CDF of this distribution.

    2.3 PROBABILITY DENSITY FUNCTIONIn the last section, we have seen how we can characterize the weight distri-

    bution on of a piece of wire by the CDF. In this section, we will see how wecan characterize the same thing by means of a “density”.

  • Random Variables 21

    Suppose we say that the density of the piece of wire in a neighborhood ofis equal to , with the unit kg/m. What it means is that for a segment of wirearound with length , the weight is approximately . Let denotesthe density at for a piece of wire whose weight distribution is described by aCDF . Then what we want for is that

    (2.1)

    for all (this is exactly what density means). Assuming that is differen-tiable (with respective to ) and using the fundamental theorem of calculus1,by differentiating both sides of the above, we have

    giving

    (2.2)

    The function is called the probability density function (pdf).If is not differentiable at a certain , then we cannot define by

    (2.2). This can happen if does not change smoothly at , for example,at in Figure 2.2. However, a careful examination of (2.1) reveals thatthe density at any isolated point, as long as it is finite, does not affect theCDF . Therefore, in such situations, we can simply set to be anyfinite value.

    In fact, if is finite, then

    This can be seen as follows. Let be an interval with width such that. Then

    1The fundamental theorem of calculus states that

    This can be seen by considering

    if is sufficiently smooth at .

  • 22

    x

    f(x)

    Figure 2.4. The pdf of an exponential distribution.

    Letting , we see that . However, although the eventhas zero probability measure, it is not impossible.

    Suppose we throw a dart at a target and the point where thedart lands distributes uniformly on the target. Then the probability that thedart lands on any particular point is zero, but it is not impossible for the dartto land on that point.

    For an exponential dis-tribution with parameter ,

    ifotherwise.

    Figure 2.4 is a sketch of the pdf of an exponential distribution.

    For the uniform distributionon ,

    ifotherwise.

    Figure 2.5 is a sketch of the pdf of this distribution.

    However, we do run into a serious technical difficulty in defining the densityfunction if has a discrete jump at , i.e., there is a point mass at

    . The problem we face is how to define in view of (2.1) so thatchanges abruptly as changes from to for small . In fact, nosuch function exists. Instead, we introduce the notion of a Dirac deltafunction2. A delta function is called a generalized function, which is not a true

    2named after the great physicist Paul Dirac.

  • Random Variables 23

    x

    1

    1

    f(x)

    Figure 2.5. The pdf of the uniform distribution on .

    function. Generalized function theory is a very deep theory, but it suffices forus to understand what a delta function means.

    The best way to think of a delta function, denoted by , is that it isfunction which takes the value 0 everywhere except around . Around

    , the function is a very sharp peak such that the total area under isequal to 1. That is

    By means of the delta function, we will be able to describe a CDF withdiscrete jumps. This will be illustrated in the next example.

    For the distribution uniform onwith a point probability mass at , the pdf is sketched in Figure 2.6.

    Here, denotes a delta scaled to two-third of and relocated at.

    A pdf satisfies the following properties:

    1. ;

    2. .

    Proof First, is nonnegative because is non-decreasing. Second,

  • 24

    2/3 (x-1)#

    x

    1

    1

    2/3

    1/3

    f(x)

    Figure 2.6. The pdf of the mixed distribution in Example 2.3.

    2.4 FUNCTION OF A RANDOM VARIABLESuppose we have a random variable specified by a given distribution (pdf

    or CDF), and we pass through a function to obtain a random variable. What is the distribution of ? Note that , as a random variable,

    is also a function of because .This problem is rather trivial if both and are discrete. To be specific,

    for each , can be determined by

    where is the inverse image of under the mapping . For continuousand mixed random variables, the idea is similar but the technicality is a littlebit more involved.

    Let us consider the case that both and are continuous, and we willdetermine in terms of . To start with, we further assume that ismonotone so that the inverse function is defined. Toward this end, in lightof Figure 2.7, consider

    where , or . Since

    and

    we have

  • Random Variables 25

    x x+dx

    yy+dy

    x

    g(x)

    Figure 2.7. A monotone function .

    and in the limit, we have

    Let , where. Then

    and

    Therefore,

    We give an alternative solution to the problem in the lastexample. We consider two cases for . First, for , we have

    Then(2.3)

  • 26

    For , we have

    Upon differentiating with respect to , we obtain

    (2.4)

    Combining (2.3) and (2.4) for the two cases, we have

    which is the same as what we have obtained before.

    We now turn to the case that is not monotone, i.e., has possiblymore than one solution for some . We will discuss the case that for a particular

    , there exist , , such that (cf. Figure 2.8). The generalcase is a straightforward extension. From Figure 2.8, we see that

    Then

    Following exactly the argument as we used for the case that is monotone, weobtain

    Let , . Now values of less thanare impossible, so that for . For , we have

  • Random Variables 27

    x1 x + dx1 1x

    g(x)

    yy+dy

    x + dx x2 2 2

    Figure 2.8. A function with two inverses and at .

    Since , we have

    2.5 EXPECTATIONThe expectation of , written as or simply , is

    defined as

    for discrete, and

    for continuous. is sometimes written as .

    The expectation of a random variable is the value taken by on theaverage. It is also called the mean and it can be regarded as the center of massof the piece of wire describing the distribution of . Specifically, the locationof the center of mass of the piece of wire is given by

    For any constant , .

    Proof This can be proved by considering

  • 28

    If with probability 1, then .

    Proof This can be proved by considering

    Let . Then

    This summation can be evaluated explicitly by considering the binomial for-mula:

    Here we fix and regard each side as a function of , so that the left handside and the right hand side are identical functions of . Differentiating withrespect to , we have

    and multiplying by gives

    Then let and to obtain

    Let . Then

  • Random Variables 29

    Let and . Then and . Therefore,

    In evaluating , we use L’Hospital’s Rule as follows.

    The expectation of a function of a random variable is defined in the naturalway.

    The expectation of a function of a random variable isdefined by

    for discrete, and

    for continuous.

    2.6 MOMENTSIn Definition 2.25, by letting and , we

    obtain the th moment and the th central moment of , respectively, where. The th moment and the th central moment are denoted by and

    , respectively, i.e.,

  • 30

    (Note that in the definition of , is just a constant.) The first moment,i.e., , is called the mean (or expectation), while the second central moment,i.e., , is called the variance. The variance of is also denoted by .

    .

    Proof First,

    Regarding the distribution of as represented by a piece of wire, the quantityis the “moment” (as in mechanics) about the center of mass

    (at position ) due to the weight at position . So this propositionsays that for any weight distribution, the total moment about the center of massis equal to 0.

    For continuous, it suffices to consider

    The proof for discrete is similar.

    For each function , give partial information about the distribution of. For example, gives the average value taken by . Likewise, each

    and gives some partial information about the distribution of . However,we point out that under many situations, the distribution of is completelyspecified by , i.e., the set of all moments of gives complete informationabout the distribution of . It can also be shown that is completelyspecified by , together with , and vice versa. The followingtheorem is a special case of this fact.

    .

    Proof The theorem is proved by considering

  • Random Variables 31

    with equality if and only ifwith probability 1.

    Proof Since for all ,

    (2.5)

    Therefore,

    We now show that the above inequality is tight if and only if .If , then obviously . The proof of the converse,however, is much more technical and is deferred to Appendix 2.A.

    The Gaussian distributionis a continuous distribution defined by

    for . It is one of the most important distributions in probabilitytheory. It will be discussed in depth when we discuss the central limit theorem.It can be shown that

    Furthermore, it can be shown that the mean and variance are re-spectively and .

    2.7 JENSEN’S INEQUALITYLet be a set of vectors and be a set of real num-

    bers. Then

    is called a linear combination of . If in addition satisfies

    1. for all , and

  • 32

    x1 x1

    x2a=1a=0

    a< 1

    x

    x

    x

    2

    3

    4

    a>1

    (a) (b)

    Figure 2.9. Linear combinations and convex combinations.

    2. ,

    then is called a convex combination of .

    Remark If satisfies for all and , then is aprobability mass function. In this course, the vectors are usually scalars. Inparticular,

    is a convex combination of all the elements of .

    Figure 2.9(a) is an illustration of a linear (convex) combinationof two vectors and , and Figure 2.9(b) is an illustration of the set of allconvex combinations of the four vectors , and .

    A set is convex if for any ,then

    for any .

    A function defined on a convexset is called convex if for any and any ,

    (2.6)

    A function is concave if and only if is convex.

    In the definition of a convex function, it is crucial that the set is convexbecause otherwise for and , may not

  • Random Variables 33

    x1

    g(x)

    x2

    Figure 2.10. A convex function.

    be in . If so, is not defined at and (2.6) cannot even bestated.

    Figure 2.10 is an illustration of a convex function when is some convexsubset of the real line. In general, is some convex subset of .

    For any set , its convex hull isdefined as

    for someand

    In words, consists of the convex combination of any pair of vectors in.

    The convex hull of a set is convex. Figure 2.9(b) shows theconvex hull of the set .

    Let be a convex function. Then.

    Proof For being a binary random variable, Jensen’s inequality follows di-rectly from the convexity of . This can be seen by letting and

    in (2.6). See homework for a proof for the case when isfinite.

    Remark The computation of requires the knowledge of the distribu-tion of , while the computation of requires only the knowledge of

    . Therefore, Jensen’s inequality provides a lower bound on whenonly is known.

  • 34

    Let be a concave function. Then .

    is convex. By Jensen’s inequality, .

    is convex. By Jensen’s inequality, ,i.e., . This has been obtained in Corollary 2.28.

    APPENDIX 2.A: PROOF OF COROLLARY 2.28This appendix is for the more enthusiastic readers. Here, we complete the proof of Corol-

    lary 2.27 by showing that if , then , or . Assumethat . First, observe that

    Define the set

    Note that

    and that for , , so that . Our goal is to establish that. We first show that for any , . Consider

    Thus

    implying that . Now since for all , we still cannotconclude that . Toward this end, by means of the monotone convergence theorem inreal analysis (beyond the scope of it course), it can be shown that

    This completes the proof.

  • Chapter 3

    MULTIVARIATE DISTRIBUTIONS

    In Chapter 2, we have discussed a single random variable. In this and sub-sequent chapters, we will discuss a finite collection of jointly distributed ran-dom variables. In this chapter, we will discuss two jointly distributed randomvariables mostly, but generalizations to more than two random variables arestraightforward. Emphasis will be on continuous distributions.

    A random variable is a function of . Two random variables are an orderedpair of functions of , for example, . For one random variable,we think of its distribution as the weight distribution of a piece of wire whosetotal weight is 1. For two random variables, we think of their joint distributionas the weight distribution of a sheet whose total weight is 1.

    As for a single random variable, we will have both CDF and pdf characteri-zations of the joint distribution.

    3.1 JOINT CUMULATIVE DISTRIBUTION FUNCTIONLet and be two jointly distributed continuous random variables. The

    joint CDF of and is defined by

    When there is no ambiguity, will be abbreviated as .

    A joint CDF satisfies the following properties:

    1. ;

    2. , , and , for any;

    3. ;

    35

  • 36

    4. For any and ,

    Proof Property 1 is proved by noting

    Property 2 is proved by considering

    That and can be proved likewise. The proof forProperty 3 is similar to that for for the single variable case, so it isomitted. Property 4 can be proved by considering

    (3.1)

    Letting and in Property 4, we have

    which implies

    where we have invoked Property 2. Similarly, we can show that for ,

    Therefore, is non-decreasing in both and , but this is weaker thanProperty 4 as we will see in the next example.

  • Multivariate Distributions 37

    Let

    Then we see that

    However,

    The reader should compare Property 4 with the non-decreasing property offor a single random variable. We can think of Property 4 as the “jointly

    non-decreasing” property.So from (3.1), we can obtain the probability that is within any rectan-

    gle in from the joint CDF . In principle, we can obtainfor essentially any subset of by A31 through approximation and takinglimit. Therefore, the joint CDF fully characterizes the joint distribution.

    The marginal CDFs, namely and , can readily be obtained from. Specifically,

    Similarly,

    3.2 JOINT PROBABILITY DENSITY FUNCTIONOur task is to find a joint pdf (if exists) for a given joint CDF

    such that(3.2)

    Assume that exists. Then by the fundamental theorem of calculus,we have

    1A3 refers to the third axiom of probability.

  • 38

    Therefore,

    (3.3)

    Now can be obtained from by (3.2), and vice versa by (3.3).Hence, both and are complete characterizations of the jointdistribution.

    A joint pdf satisfies the following properties:

    1. ;

    2. .

    Proof First,

    By Property 4 of a joint CDF in Theorem 3.1, we see that the expression insidethe square bracket above is nonnegative. It then follows that .Second,

    by Property 2 in Theorem 3.1. The theorem is proved.

    The marginal pdfs, i.e., and , can also readily be obtained from .Specifically,

  • Multivariate Distributions 39

    Similarly,(3.4)

    Remark For discrete random variables and , the analog of (3.4) is

    3.3 INDEPENDENCE OF RANDOM VARIABLESTwo random variables and are said to be independent if

    (3.5)

    for all subsets and of . It is an extremely difficult task to check thiscondition, as it involves an uncountably many subsets of . Instead, we willdiscuss two simpler characterizations of independence of two random vari-ables.

    Two random variables andare independent if and only if

    (3.6)

    for all and .

    Proof If and are independent, then (3.5) is satisfied for all subsets andof . Then for all and ,

    For the converse, we will only give the proof for a special case. Assume (3.6) issatisfied for all and , and suppose and are intervals, withand . Then

  • 40

    by (3.1)

    i.e., (3.5) is satisfied. Using this argument, one can prove the same result byinduction for and being finite unions of intervals. The details are omitted.

    If the pdf for random vari-ables and exists, then and are independent if and only if

    (3.7)

    for all and .

    ProofThe following apply to all and . Assume that

    exists, and that(3.8)

    Then

    Conversely, assume (3.7) holds. Then

  • Multivariate Distributions 41

    Therefore, (3.7) and (3.8) are equivalent. The theorem is proved.

    Remark For two discrete random variables and , the analog of Theo-rem 3.5 is that and are independent if and only if

    for all and .

    For three or more random variables, like events, we distinguish between pair-wise independence and mutual independence.

    Random variables are mutually indepen-dent if

    where is a subset of .

    Random variables are pairwise indepen-dent if and are independent for any .

    Random variables are mutually independentif and only if

    (3.9)

    for all , or equivalently,

    for all if exists.

    Proof Omitted.

    If are mutually independent, then they arepairwise independent.

    Proof If are mutually independent, then (3.9) holds. Forany fixed , by setting for all in (3.9),the left hand side becomes , while the right hand sides becomes

    since for all . Therefore,

    or and are independent. Hence, we conclude that arepairwise independent.

  • 42

    We note that, however, pairwise independence does not imply mutual inde-pendence.

    If are mutually independent, then so arefor any functions , .

    Proof Consider any subsets of . Then

    where we have invoked the mutual independence assumption to obtain the sec-ond equality. Thus are also mutually indepen-dent.

    Random variables are independent andidentically distributed (i.i.d.) if they are mutually independent and eachhas the same marginal distribution.

    Let be i.i.d.binary random variables with

    and

    Then has distribution , i.e.,

    for .

    3.4 CONDITIONAL DISTRIBUTIONThe distribution of a random variable represents our knowledge about the

    outcome of . Upon knowing the occurrence of a certain event or the outcomeof another jointly distributed random variable, our knowledge of changesaccordingly. Our new knowledge about is represented by the conditionaldistribution of .

    3.4.1 CONDITIONING ON AN EVENTWe first consider the distribution of a single random variable conditioning

    on an event , where is a subset of and . Let be

  • Multivariate Distributions 43

    F (x)

    0.90.8

    0.6

    0.2

    x

    A

    x

    X Xf (x)

    (a) (b)

    Figure 3.1. The CDF and pdf of a distribution.

    the CDF of . The CDF of conditioning on is defined by

    Then

    (3.10)

    Note that satisfies the three properties of a CDF in Theorem 2.8.The pdf of conditioning on is defined by

    It is readily seen from (3.10) that

    ifif .

    Again, note that satisfies the two properties of a pdf in Theo-rem 2.16.

    Consider the CDF and pdf in Figure 3.1(a) and (b) for ran-dom variable , respectively, and let be the subset of as shown in Fig-ure 3.1(a). Then it is readily seen that

  • 44

    x x

    (a) (b)

    0.8

    1.0

    X XF (x|X in A) f (x| X in A)

    Figure 3.2. The CDF and pdf conditioning on .

    and and are illustrated in Figure 3.2 accord-ingly.

    We now consider the joint distribution of two random variables andconditioning on an event , where is a subset of . The CDFof conditioning on is defined by

    (3.11)

    This is illustrated in Figure 3.3, where

    and

    Then from (3.11), we have

    (3.12)

    Accordingly, the jointly pdf of conditioning on is de-fined as

    (3.13)

    It turns out that carrying out the above partial differentiation is considerablymore complicated than the case when only one random variable is involved

  • Multivariate Distributions 45

    R (x,y)A

    (x,y)y

    x

    $

    Figure 3.3. The set .

    (see Appendix 3.A). So we instead will obtain byanother approach. We observe that has to satisfy thefollowing properties: For ,

    and for any small subset of with area ,

    (3.14)

    The left hand side above can be written as

    (3.15)

    Equating (3.15) and the right hand side of (3.14), we obtain

    Therefore,

    ifif .

  • 46

    x- /2% x+ /2%

    x

    y

    x

    %A(x, )

    Figure 3.4. The set .

    Again, we note that satisfies the two properties of a jointpdf in Theorem 3.3.

    3.4.2 CONDITIONING ON ANOTHER RANDOMVARIABLE

    In this subsection, we will discuss the distribution of a random variableconditioning on the event , where and are jointly distributedrandom variables. For continuous, . Thus we are runninginto the problem of conditioning on an event with zero probability.

    To get around this problem, instead of conditioning on whichhas zero probability, we condition on the event , orequivalently the event , which has nonzero probability,where

    The set is illustrated in Figure 3.4. Eventually, we will take the limitas .

    The joint pdf of and conditioning on , from thelast subsection, is given by

    ifotherwise.

  • Multivariate Distributions 47

    XYf (x,y)

    XYf (x ,y)0

    x0x

    y

    Figure 3.5. An illustration of .

    By integrating over all , we obtain the conditional marginal pdf of asfollows.

    Taking the limit as , we define the pdf of conditioning onby

    (3.16)

    Figure 3.5 is an illustration of for a fixed . Thus we see from(3.16) that is just the normalized version of for fixed .(For fixed , is a constant.) Similarly, we define

    (3.17)

  • 48

    y

    x1

    1

    x+y=1

    Figure 3.6. The pdf in Example 3.15.

    From (3.16) and (3.17), we have

    which is a form of Bayes’s Theorem.

    If , then and are independent.

    Proof If , then

    This implies , i.e., and are independent.

    Consider the joint pdf for and(see Figure 3.6). We will first determine the normalizing constant

    . Then we will determine the marginal pdfs and , and show that andare not independent. In fact, we can tell immediately that and are not

    independent by inspecting Figure 3.6, because the possible range of dependson the value taken by .We use Property 2 of a joint pdf in Theorem 3.3 to determine . Consider

  • Multivariate Distributions 49

    which implies .Now, for ,

    Thusifotherwise.

    By symmetry, we also have

    ifotherwise.

    Then for , , while . Therefore,and are not independent.

    3.5 FUNCTIONS OF RANDOM VARIABLESSuppose we are given two random variables and , and let .

    We are interested in the distribution of . Specifically, we want to determinethe distribution of in terms of the joint distribution of and .

    3.5.1 THE CDF METHODIn this method, is obtained by considering

  • 50

    x/y = z

    z > 0

    x/y = z

    z < 0

    x

    y

    x

    y

    Figure 3.7. An illustration of the set .

    Let . Then

    Figure 3.7 is an illustration of the set . Then

    (3.18)

    and the pdf of is obtained by

    In order to differentiate the double integrals in (3.18), we use the chain rulefor differentiation. Specifically,

  • Multivariate Distributions 51

    Similarly,

    Therefore,

    The next example is extremely important and should be studied very care-fully.

    Let , where and areindependent. Then

    Differentiating with respect to , we obtain

  • 52

    The integral above is called the convolution of and , denoted by. Since , by exchanging the roles of and in

    the above integral, we also have

    In summary, if is equal to where and are independent, then.

    3.5.2 THE PDF METHODThis method is to consider directly. Specifically,

    Our task is to express the right hand side as an expression times , so thatis obtained by cancelling on both sides. This may not always be

    easy to do, but we will illustrate how this can possibly be done by an example.

    Let and be i.i.d. , so that

    for all and . Let . Then

    (cf. Figure 3.8)

  • Multivariate Distributions 53

    z z+dzx

    y

    Figure 3.8. An illustration for the range of integration in polar coordinates.

    Cancelling on both sides, we have

    for all . Obviously, for all .

    The reader is encouraged to repeat the above example by the CDF method.

  • 54

    z2

    1z

    x2

    1z z2( , )

    x1

    x1 x2( , )g

    Figure 3.9. An illustration of the function .

    3.6 TRANSFORMATION OF RANDOM VARIABLESSo far, we have discussed how we can obtained the distribution of a single

    random variable obtained as a function of some given random variables. Inthis section, we will discuss how we can obtain the joint distribution of a pairof random variables obtained as functions of a given pair of random variables.

    Let , and let be the values taken by. Suppose for each , there is a unique pair satisfying

    for some functions and . Write , where

    and

    Then is an invertible transformation. See Figure 3.9 for an illustration ofwhich takes a point in the - plane to the - plane. Our task is to

    determine .Consider the rectangular element in the - plane in Figure 3.10. We label

    the four corners of this element by , , where . Whenwe trace the line from to in the - plane, under the mapping , thecurve between and is traced in the - plane. The sameapplies when we trace the lines from to , from to , and from to

    . When and are small, the region traced out in the - plane canbe approximated by a parallelogram. Specifically, the vectors and aregiven by

  • Multivariate Distributions 55

    z1

    3zz2

    z1&

    & z2v1

    v2

    x

    x x

    1

    2 3

    '

    x0 1 2z0 = (z ,z )1 2= (x ,x )

    Figure 3.10. Two corresponding regions in the - plane and - plane.

    Let denote the area of the rectangular element in the - plane, and letdenote the area of the corresponding region in the - plane. Now ob-

    serve that is in the rectangular element in the - plane if and onlyif is in the corresponding region in the - plane. Equating theprobability of this event in terms of and , we have,

    so that(3.19)

    where . So, we need to determine the value of as follows.

  • 56

    Defining2

    (3.20)

    we have

    Therefore, from (3.19), we have

    (3.21)

    Exchanging the roles of and in the above derivation andaccordingly in (3.21), we obtain

    (3.22)

    Therefore, combining (3.21) and (3.22), we have

    which implies

    (3.23)

    This is the two-dimensional generalization of the result

    in elementary calculus.

    Let

    2It is easy to see that we can also write (3.20) as

  • Multivariate Distributions 57

    Then

    Therefore,

    Following the last example, this time we want to determinein terms of . From (3.23), we have

    Therefore,

    Alternatively, we can write

    and apply (3.20) directly to obtain , but this would be consider-

    ably more difficult.

    APPENDIX 3.A: DIFFERENTIATINGFrom (3.12), we have

    (3.A.1)

    where

    Following the steps toward proving Property 1 of Theorem 3.3, we have

    Let

  • 58

    (x+ x,y+ y)& &

    (x,y)

    (x+ x,y+ y)& &

    B (x,y)&

    B (x,y)&(x,y)

    $

    $

    (a) (b)

    Figure 3.A.1. The set for and .

    We leave it as an exercise for the reader to show that the expression in the square bracket aboveis equal to

    (3.A.2)

    (cf. Figure 3.A.1). If , then , so that (3.A.2) is equal to

    and hence

    Otherwise, , so that (3.A.2) vanishes, and

    Therefore, from (3.A.1), we have

    ifif .

  • Chapter 4

    EXPECTION ANDMOMENT GENERATINGFUNCTIONS

    We have already discussed the expectation of a single random variable inChapter 2. In this chapter, we will discuss expectation as an operator whenmore than one random variable is involved. We will also introduce momentgenerating functions, a fundamental tool in probability theory.

    The results will be proved by assuming that the random variables are con-tinuous. The proofs for discrete random variables are analogous.

    4.1 EXPECTATION AS A LINEAR OPERATORWe have defined the expectation of a function of a random variable in Def-

    inition 2.25. When the function involves two random variables, we have thefollowing definition.

    The expectation of a function of random variables andis defined by

    for and discrete, and

    for and continuous.

    One can regard expectation as an operator which inputs a random variableand outputs the expectation of that random variable. We will establish in thefollowing theorem that expectation is a linear operator.

    Expectation is a linear operator, i.e., for any real numbersand ,

    (4.1)

    59

  • 60

    Proof This can be proved by considering

    .

    Proof This is a special case of Theorem 4.2 with equals 0.

    Consider

    This is an alternative proof for Theorem 2.27 with simplified notation.

    .

    Proof This can be proved by considering

    Let and be the mean and the variance of a randomvariable . Then has zero mean and unit variance.

  • Expection and Moment Generating Functions 61

    Proof To prove that , consider

    We will prove that in two steps. First,

    which is intuitively clear. Second, by Proposition 4.5, we have

    If and are independent, then

    (4.2)

    Proof If and are independent, then so are and by Theo-rem 3.10. Therefore,

    Remark The equation (4.1) is valid for all joint distribution for and . Bycontrast, (4.2) may not be valid if and are not independent.

  • 62

    4.2 CONDITIONAL EXPECTATIONLet and be jointly distribution random variables. The

    expectation of conditioning on the event is defined by

    We note that depends only on the value of , and hence is afunction of . If the value of is not specified, then

    is a random variable, and more specifically, a function of the random variable. We now prove an important theorem regarding conditional expectation.

    .

    Proof Consider

    Note that in the above, the inner expectation is a conditional expectation, whilethe outer expectation is an expectation on a function of .

    Let and distributes uniformly on when. We now apply Theorem 4.9 to determine .

  • Expection and Moment Generating Functions 63

    The above integral is the expectation of , which was shown in Example 2.1to be . Therefore, .Alternatively, we can consider

    ifotherwise,

    so that

    ifotherwise.

    Then

    and

    However, evaluations of these integrals are quite difficult.

    4.3 COVARIANCE AND SCHWARTZ’S INEQUALITYLet and be random variables, and consider . We are

    naturally interested in the relation between the variance of with the variancesof and . Toward this end, consider

    Define(4.3)

    so that we can write

    When and are independent, by Theorem 3.10,

  • 64

    so that(4.4)

    When , or equivalently

    or equivalently (4.4) holds, and are said to be uncorrelated. As shown,and are uncorrelated if and are independent. However, the converse

    is not true.

    Consider the following joint pmf for random variablesand :

    where . Then it is easy to and are uncorrelated but they are notindependent.

    .

    Proof This can be proved by considering

    In fact, it is ready seen from (4.3) and Proposition 4.12 that. Thus the variance of a random variable can be regarded as the co-

    variance between and itself.

    We now prove an inequality known as Schwartz’s inequality. This inequal-ity gives an upper bound on the magnitude of in terms of the secondmoments of and .

    .

    Proof Let and be any fixed jointly distributed random variables. Letbe any real number, and consider the random variable . Since thisrandom variable can only take nonnegative values, we have

    (4.5)

  • Expection and Moment Generating Functions 65

    or

    Rearranging the terms, we have

    (4.6)

    Since and are fixed random variables, , , and in the aboveare fixed real numbers. Therefore, the left hand side above is quadratic in ,and it achieves its minimum when

    Since (4.5) holds for all , it holds for . Substituting into (4.6),we have

    Hence,

    proving the theorem.

    Schwartz’s inequality is tight, i.e.,

    if and if is proportional to .

    Proof From the proof of the last theorem, we see that Schwartz’s inequality issimply the inequality in (4.5) with . Therefore, Schwartz’s inequality istight if and only if

    (4.7)which in turn holds if and only if with probability 1, , or

    , implying that is proportional to . On the other hand, ifis proportional to , i.e., for some constant , then it is readilyverified that Schwartz’s inequality is tight.

    Denote and by and , respectively. Now in Schwartz’s in-equality, if we replace by and by 1, we immediatelyobtain

    1This is valid because and can be any random variables.

  • 66

    or

    where we have invoked Proposition 4.12. Thus,

    (4.8)

    The ratio

    called the coefficient of correlation, is a measure of the degree of correlationbetween and . It is readily seen from (4.8) that

    When , we say that and are uncorrelated. When , we saythat and are positively correlated. When , we say that and

    are negatively correlated.When , is simply equal to . When ,

    Thus

    Roughly speaking, when is large, also tends to be large, so that on theaverage the magnitude (square) of is reinforced. When ,

    Thus

    Roughly speaking, when is large, tends to be small, and when is small,tends to be large, so that on the average the magnitude (square) of is

    suppressed.

    We already have seen that when and are independent,then they are uncorrelated, or . When ,

  • Expection and Moment Generating Functions 67

    so that

    and

    by Proposition 4.5. When ,

    so that

    and

    4.4 MOMENT GENERATING FUNCTIONSThe moments of a random variable (or of a distribution) were introduced

    in Section 2.6. In this section, we introduce the moment generating functionof a random variable , denoted by , where the parameter is a realnumber.

    The moment generating function of a random variable(or of the distribution of ) is defined by

    (4.9)

    where is a real number. For the discrete case,

    (4.10)

    and for the continuous case,

    (4.11)

    is said to exist if it exists, i.e., the summation in (4.10) or the integral in(4.11) converges, for some range of .

    Note that for a fixed , is a function of . So is simply theexpectation of this particular function of , and its value depends on . There-fore, is a function of .

    .

  • 68

    Proof By letting in (4.9), we have

    proving the corollary. This equality can be interpreted by letting in(4.10) and (4.11), which is nothing but the normalizing conditions forand .

    The moment generating function can be regarded as the “transform” of thedistribution. In (4.10), if , by letting so that ,the right hand side becomes

    which is the -transform of the pmf (with being the time index). In(4.11), by letting , becomes the Fourier transform of the pdf

    (with being the index of time). (But of course, in is actually areal number.)

    It can be shown that there is a one-to-one correspondence between the dis-tribution of and the moment generating function of . That is, for eachmoment generating function, there is a unique distribution corresponding to it.In other words, the distribution of a random variable can (in principle) berecovered from its moment generating function.

    The moments of can be obtained by successive differentiation ofwith respect to . Specifically,

    and in general(4.12)

    where . Then by putting in (4.12), we obtain

    That is why is called the moment generating function of .

    If and are independent, then .

  • Expection and Moment Generating Functions 69

    Proof This can be proved by considering

    We know from Example 3.17 that if and are independent, thenhas pdf given by . The above theorem says that

    , so it is actually analogous to the convolution theorem in linearsystem theory.

    Let . Then for ,

    Let be the geometric random variable with parameter ,where , i.e.,

    otherwise.

    Then

  • 70

    where we assume that , or .

    Remark: We use the convention adopted by most books in probability indefining the geometric distribution in the above example. A random variable

    with

    otherwise

    is said to have a truncated geometric distribution with parameter . If wetoss a coin with the probability of obtaining a head equals repeatedly, thenit is easy to see that the first time a head is obtained has truncated geometricdistribution with parameter . Note that if is geometric with parameter ,then is truncated geometric with parameter .

    In the next example, we give a powerful application of the moment generat-ing function.

    Let be i.i.d. random variables each, be an independent truncated geometric random variable (independent

    of ’s) with parameter , and . That is, is the sum of atruncated-geometric number of i.i.d. exponential random variables.We now determine the distribution of via . Consider

  • Expection and Moment Generating Functions 71

    where

    a) follows because ’s are independent, and they remain to be so whenconditioning on because is independent of ’s;

    b) follows because is independent of ’s;

    c) follows because are i.i.d.;

    d) follows because .

    From the form of , we see that .

  • Chapter 5

    LIMIT THEOREMS

    Recall that in random experiment, the outcome is drawn from the samplespace according to the probability measure , and a random variable isa function of . Now suppose we construct random variables , where

    . Then is called a stochastic process (or random process).Since a stochastic process involves an infinite number of random variables,

    it cannot be described by a joint distribution. In fact, a stochastic process canbe very comlicated in general. But if ’s are i.i.d., in that case we say that

    is an i.i.d. process, then it is relatively easy to describe. The process oftossing a coin repeatly can be modeled as an i.i.d. process.

    Limit theorems deal with the asymptotic behaviors of stochastic processes,and they play an important role in probability theory. In this chapter, we willdiscuss the two most fundamental such theorems, namely the weak law of largenumber (WLLN) and the central limit theorem (CLT).

    5.1 THEWEAK LAW OF LARGE NUMBERSConsider tossing a fair coin times. Intuitively, when is large, the rela-

    tive frequency of obtaining a head should be approximately equal to 0.5. TheWLLN makes this idea precise, as we will see, and it gives a physical meaningto the probability of an event.

    For any random variableand any ,

    Proof The inequality can be proved by considering

    73

  • 74

    The interpretation of the Chebyshev inequality is as follows. Here,is the event that deviates from its mean by at least . The upper

    bound on the probability of the event is such that:

    if is large, then it is large;

    if is small, then it is large.

    In fact, sometimes this upper bound can be larger than 1, making it useless. Ingeneral, the Chebyshev inequality is rather loose.

    Let be i.i.d. random variables with meanand variance . Then

    (5.1)

    Proof Let(5.2)

    The WLLN is actually an application of the Chebyshev inequality to . Now

    and

  • Limit Theorems 75

    where a) follows from Proposition 4.5 and b) follows because are indepen-dent. Then by the Chebyshev inequality, we have

    or

    We note that in (5.1), the upper bound tends to zero for all finite andas . Thus the WLLN says that as long as is sufficiently large, theprobability that the average of , i.e., , deviates fromby any prescribed amount is arbitrarily small. Technically, we say that

    in probability.

    We now give another interpretation of the WLLN. Let be the genericrandom variable and be its density function, i.e.,

    It can readily be shown that for any . Then by Theo-rem 4.18,

    so that

    Thus the WLLN says that

    regardless of the actual form of .Having proved the WLLN, we now show an important consequence of it.

    Consider any subset of , the alphabet of the generic random variable .We are interested in the relative frequency of occurrence of in the firsttrial, say. More precisely, we define the random variable (called an indicatorfunction)

    ifif ,

  • 76

    so that

    Since are i.i.d., so are because is a function of . Then therelative frequency of occurence of is given by

    and by the WLLN,

    in probability.

    Therefore, the probability of an event can “physically” be interpreted as thelong term relative frequency of occurence of the event when the experiment isrepeated indefinitely.

    The probability of obtaining a head by tossing a fair coin is0.5. If the coin is tossed a large number of times, the relative frequency ofoccurence of a head is close to 0.5 with probability almost 1.

    5.2 THE CENTRAL LIMIT THEOREMThe Gaussian distribution was introduced in Example 2.29. It is one of the

    most important distributions in probability theory because of its wide applica-tions. Noise in communication systems, the marks in an examination, etc, canbe modeled by the Gaussian distribution. In this section, we will discuss thecentral limit theorem (CLT) that explains the origin of this important distribu-tion. The theorem is first stated below.

    Let be i.i.d. with zeromean and unit variance, and let

    (5.3)

    Then

  • Limit Theorems 77

    The reader should compare the definition of in (5.2) for the WLLN withthe definition of here, and observe that for the scaling factor is whilefor it is . To understand the CLT properly, we note that

    since has zero mean, and

    where

    a) follows from Proposition 4.5;

    b) follows because ’s are mutually independent (cf. Section 4.3);

    c) follows since has unit variance.

    So the only thing the CLT says about is that it has a Gaussiandistribution; the mean and the variance of are already fixed from the setup.

    Before we prove the CLT, we first derive the moment generating function ofa Gaussian distribuion.

    Let . Then .

    Proof Consider

  • 78

    where the last step follows because the density function of inte-grates to 1, proving the lemma.

    Let . Then .

    Let , , where and areindependent, and let . Then .

    Proof The proof is left as an exercise.

    This corollary says that the sum of two independent Gaussian random vari-able is again Gaussian.

    Proof of Theorem 5.4 We give an outline of the proof as follows. Denote thegeneric random variable of by . Consider the second order approxi-mation of via the Taylor expansion1 for small :

    Since and , by the assumption that has zeromean and unit variance, we have

    and

    Therefore,

    for small . By Theorem 4.18, we have

    1The Taylor expansion for a function at is given by

    where denotes the th derivative of .

  • Limit Theorems 79

    Then for large ,

    where the above approximation is justified because is small for large n.Then

    where we have used the identity

    Finally, we conclude by Corollary 5.6 that

    This proves the CLT.

    The version of the CLT we have proved in Theorem 5.4 applies only when’s have zero mean and unit variance. In order to apply it to a general i.i.d.

    process, we need the following scaling property of Gaussian distributions.

    Let , and . Then. That is, again has a Gaussian distribution.

    Proof Consider

    Thus .

  • 80

    In this example, we give a use-ful application of Proposition 5.8. Suppose , and we are inter-ested in for some constant . Now

    is equivalent to

    By defining , the above is equivalent to

    We see from Proposition 4.6 that has zero mean and unit variance. More-over, by Proposition 5.8, is in fact a Gaussian random variable. Therefore,

    where

    is called the standard normal CDF.

    Now suppose is i.i.d. and with mean and variance . Then byProposition 4.6, , where

    (5.4)

    is i.i.d. with zero mean and unit variance. From (5.4), we have

    so that

    When is large, by the CLT,

  • Limit Theorems 81

    Then by Proposition 5.8,

    That is, the sum of a large number of i.i.d. random variables is approximatelyGaussian. Again, the mean and the variance of the above Gaussian distributionare determined from the setup and has nothing to do with the CLT.