Lecture 2: Brief Review on Probability Theorypeople.bu.edu/.../lecture2/Lecture2.pdf · Lecture 2: Brief Review on Probability Theory 1 Terminology & Basics The sample space, usually

Dr V. Andasari http://people.bu.edu/andasari/courses/stochasticmodeling/stochastics.html

Lecture 2: Brief Review on Probability Theory

1 Terminology & Basics

The sample space, usually denoted S, is the set of all possible outcomes of a randomstudy/experiment [2].

Any subset of the sample space is known as an event, usually denoted with capital lettersA,B,C, · · · , [2, 3]. For example, A ⊂ S, where “⊂” denotes “is a subset of”.

Since events and sample spaces are just sets, the following is the algebra of sets [3]:

• ∅ is the null set, or empty set.

• A ∪B = union = the elements in A or B or both.

• C ∩ D = intersection = the elements in C and D. If C ∩ D = ∅, then C and Dare called mutually exclusive events, or disjoint events.

• A′ = Ac = complement = the elements not in A.

• If E ∪ F ∪G ∪ · · · = S, then E, F , G, and so on are called exhaustive events.

Probability is a (real-valued) set function P that assigns to each event in the samplespace a number P (·) [3]. For example, the probability of event A equals:

P (A) =N(A)

n,

where n is the number of times an experiment is performed and N(A) is the number oftimes the event A of interest occurs.

Probability of an event A in the sample space S is defined and satisfies the following threeconditions [2]:

(i) 0 ≤ P (A) ≤ 1.

1

http://people.bu.edu/andasari/courses/stochasticmodeling/stochastics.html


(ii) P (S) = 1.

(iii) For any sequence of events A1, A2, · · · that are mutually exclusive, that is, events forwhich An ∩ Am = ∅ when n 6= m, then,

P

(∞⋃n=1

An

)=∞∑n=1

P (An)

Five theorems of probability [3]:

• Theorem 1: Since the events A and Ac are always mutually exclusive and sinceA ∪ Ac = S, we have that

1 = P (S) = P (A ∪ Ac) = P (A) + P (Ac) ,

orP (A) = 1− P (Ac) .

• Theorem 2: P (∅) = 0.

• Theorem 3: If events A and B are such that A ⊆ B, then P (A) ≤ P (B).

• Theorem 4: P (A) ≤ 1.

• Theorem 5: For any two events A and B,

P (A ∪B) = P (A) + P (B)− P (A ∩B) ,

or equivalentlyP (A) + P (B) = P (A ∪B) + P (A ∩B) .

Note that when A and B are mutually exclusive, that is when A ∩B = ∅, then

P (A ∪B) = P (A) + P (B)− P (A ∩B)

= P (A) + P (B)

Random variables are real-valued functions defined on the sample space [2].

Quantitative data are called discrete if the sample space contains a finite or countablyinfinite number of values [3].

Quantitative data are called continuous if the sample space contains an interval or con-tinuous span of real numbers [3].

Probability notations that we use throughout our lecture notes for events and randomvariables are distinguished by the type of brackets :

• Probability of events is denoted: P (·)

• Probability of random variables is denoted: P{·}

2



2 Discrete Random Variables

A random variable is said to be discrete if:

• it can take on a finite number of possible outcomes, or

• it can take on a countably infinite number of possible outcomes.

Let X be a discrete random variable. The probability of X takes on a particular value x,that is, P{X = x}, is denoted f(x), or

f(x) = P{X = x} . (1)

The function f(x) is called the probability mass function and commonly abbreviatedthe p.m.f.

The probability mass function f(x) has the following properties [3]:

(1) P{X = x} = f(x) > 0 if x ∈ S, where S is the sample space.

(2)∑x∈S

f(x) = 1

(3) P{X ∈ A} =∑x∈A

f(x)

The cumulative distribution function, commonly abbreviated the c.d.f. of the randomvariable X is defined by:

FX(t) = P{X ≤ t} . (2)

The cumulative distribution function FX(t) has the following properties:

(1) FX(t) is a nondecreasing function of t, for −∞ < t <∞.

(2) FX(t) ranges from 0 to 1:

• If the minimum value of X is a, then FX(a) = P{X ≤ a} = P{X = a} = fX(a).

• For c where c is less than a, then FX(c) = 0.

• If the maximum value of X is b, then FX(b) = 1.

Let u(X) be a function of the discrete random variable X, namely

u(X) = X .

3



The expectation of u(X), or the expected value of X, is:

E[u(X)] = E(X) =∑x∈S

xf(x) . (3)

E(X) is also called the mean of X, and is denoted µ, that is µ = E(X).

When u(X) = (X − µ)2, the expectation of u(X)

E (u(X)) = E((X − µ)2

)=∑x∈S

(x− µ)2f(x) (4)

is called the variance of (X) and denoted Var(X) = σ2. Alternatively,

Var(X) = E((X − µ)2

)= E

(X2 − 2µX + µ2

)= E

(X2)− 2µE(X) + E

(µ2)

= E(X2)− 2µµ+ µ2

= E(X2)− µ2 . (5)

Taking the positive square root of the variance gives us the standard deviation of X,denoted as

σ =√

Var(X) . (6)

The mean is a measure of the middle of the distribution of X, whereas the variance andstandard deviation quantify the spread or dispersion of the values in the sample space S.

Another way to calculate the mean and variance is via the moment generating function.Let φ(t) denote the moment generating function of the discrete random variable X, definedfor all values t by

φ(t) = E(etX)

=∑k

etkf(k) . (7)

The function φ(t) is called the moment generating function because all of the moments ofX can be obtained by successively differentiating φ(t). For example,

φ′(t) =d

dtE(etX)

= E

(d

dtetX)

= E(XetX

)The mean can be obtained by letting t = 0,

φ′(0) = E(Xe0.X

)= E(X) (8)

4



Similarly, by the second order of derivative,

φ′′(t) =d

dtφ′(t)

=d

dt

(E(XetX)

)= E

d

dt

(XetX

)= E

(X2etX

)and by letting t = 0,

φ′′(0) = E(X2e0.X

)= E(X2) , (9)

we obtain the variance of X,

Var(X) = E(X2)− (E(X))2 = φ′′(0)− (φ′(0))2. (10)

Discrete random variables are often classified according to their probability mass functions.

2.1 The Bernoulli Random Variable

From a trial or an experiment with the outcome either a “success” or a “failure”, if we let

X = 1 if the outcome is a success, and

X = 0 if the outcome is a failure,

then the random variable X is said to be a Bernoulli random variable if its probabilitymass function is given by

f(0) = P{X = 0} = 1− p , (11)

f(1) = P{X = 1} = p , (12)

where p is the probability that the trial is a “success”, for 0 ≤ p ≤ 1.

Alternatively, the probability mass function of the Bernoulli random variable X can bewritten

f(k) = P{X = k} = pk(1− p)1−k , (13)

for k ∈ (0, 1).

5



The cumulative distribution function of a Bernoulli random variable X is given by

FX =

0 for k < 01− p for 0 ≤ k < 11 for k ≥ 1

(14)

Since f(0) = 1− p and f(1) = p, the mean is

µ = E(X) = 1× p+ 0× (1− p) = p . (15)

WithE(X2) = 12 × p+ 02 × (1− p) = p ,

then the variance is

σ2 = Var(X) = E(X2)− (E(X))2 = p− p2 = p(1− p) . (16)

Figure 1: The probability of mass function (left) and cumulative distribution function(right) of a Bernoulli random variable.

2.2 The Binomial Random Variable

Suppose that with n independent trials, the results are either:

a “success” with probability p,

a “failure” with probability 1− p.

6



If X represents the number of successes that occur in the n trials, then X is said to be abinomial random variable with parameters (n, p). The probability mass function of abinomial random variable is given by

f(k) = P{X = k} =

(n

k

)pk(1− p)n−k , for k = 0, 1, 2, · · · , n (17)

where (n

k

)=

n!

k!(n− k)!(18)

equals the number of different groups of k objects that can be chosen from a set of nobjects.

To check if p(k) is a probability mass function, by the binomial theorem, the probabilitiessum to one, or

∞∑k=0

f(k) =n∑k=0

(n

k

)pk(1− p)n−k

= (p+ (1− p))n

= 1

Proof:

According to the binomial theorem:

(x+ y)n =n∑i=0

(n

i

)xn−iyi =

n∑i=0

(n

i

)xiyn−i (19)

hence

∞∑k=0

p(k) =n∑k=0

(n

k

)pk(1− p)n−k

= (p+ (1− p))n

= 1 �

The cumulative distribution function of the binomial random variable X evaluated at t isgiven by

FX(t) =t∑

k=0

(n

k

)pk(1− p)n−k , for k = 0, 1, 2, · · · , n (20)

7



From the moment generating function

φ(t) = E(etX)

=∑k

etkf(k)

=n∑k=0

etk(n

k

)pk(1− p)n−k

=n∑k=0

(n

k

)(pet)k(1− p)n−k

=(pet + (1− p)

)n(21)

and using the chain rule,

φ′(t) =dφ

dt=dφ

du× du

dt

where u = pet + (1− p) therefore du/dt = pet, and φ = un therefore dφ/du = nun−1, hence

φ′(t) = n(pet + (1− p)

)n−1pet

we can obtain the mean,

E(X) = φ′(0) = n(pe0 + (1− p)

)n−1pe0 = np . (22)

After differentiating a second time,

φ′′(t) = n(n− 1)(pet + 1− p

)n−2 (pet)2

+ n(pet + 1− p

)n−1pet

we obtain the variance,

σ2 = Var(X) = φ′′(0)− (φ′(0))2

= np(1− p) . (23)

2.3 The Geometric Random Variable

Suppose that independent trials are performed until a success occurs, where the probabilityof being a success is denoted p. If we let X be the number of trials required until the firstsuccess occurs, then X is called a geometric random variable with parameters p.

The probability mass function of a geometric random variable is given by

f(k) = P{X = k} = (1− p)k−1 p , for k = 1, 2, 3, · · · (24)

8



Figure 2: The probability of mass function (left) and cumulative distribution function(right) of a binomial random variable.

To check that f(k) is a probability mass function, the probabilities sum to one:

∞∑k=1

f(k) = p∞∑k=1

(1− p)k−1 = 1

Proof:

The sequence:

S = p∞∑k=1

(1− p)k−1 = p+ p(1− p) + p(1− p)2 + p(1− p)3 + · · ·

is a geometric progression, where its first term a = p and the common ratio r = 1− p. Itssum to infinity is given by

S∞ =a

1− r=

p

1− (1− p)= 1 �

The cumulative distribution function of a geometric random variable X is given by

FX(k) = P{X ≤ k} = 1− (1− p)k . (25)

By equations (3) and (24),

E(X) =∞∑k=1

k p (1− p)k−1

= p

∞∑k=1

k uk−1 ,

9



where u = 1− p. Then, the mean is given by,

E(X) = p

∞∑k=1

d

du

(uk)

= pd

du

∞∑k=1

uk

where, this is a geometric sequence where the first term is a = u and the common ratio isr = u, and its sum to infinity is similar to that has been given above,

S∞ =a

1− r=

u

1− u.

Hence,

E(X) = pd

du

(u

1− u

)=

p

(1− u)2

=1

p. (26)

Alternatively, the mean can be obtained from the moment generating function,

φ(t) = E(etX)

=∑k

etk f(k)

=∞∑k=1

etk (1− p)k−1 p

= p∞∑k=1

etk (1− p)k−1

The sum to infinity of the sequence can be written

S = pet + pe2t(1− p) + pe3t(1− p)2 + pe4t(1− p)3 + · · ·

Since it is a geometric sequence, with a = pet and r = et(1 − p), the moment generatingfunction is given by,

φ(t) = S∞ =pet

1− et(1− p). (27)

10



Using the quotient rule, we obtain the first derivative,

φ′(t) =pet(1− et + pet)− pet(−et + pet)

(1− et + pet)2

and by letting t = 0, the mean is given by,

E(X) = µ = φ′(0) =p(1− 1 + p)− p(−1 + p)

(1− 1 + p)2=

1

p. (28)

Similarly, after differentiating a second time, we obtain the variance

σ2 = Var(X) = φ′′(0)− (φ′(0))2 =1− pp2

. (29)

Figure 3: The probability of mass function (left) and cumulative distribution function(right) of a geometric random variable.

2.4 The Poisson Random Variable

The probability of mass function of a Poisson random variable with parameter λ, if forsome λ > 0, is given by,

f(k) = P{X = k} = e−λλk

k!, for k = 0, 1, 2, · · · (30)

since the probabilities sum to one,

∞∑k=0

f(k) = e−λ∞∑k=0

λk

k!= e−λeλ = 1 .

11



Proof:

The Maclaurin series expansion for ex is

ex = 1 + x+x2

2!+x3

3!+x4

4!+ · · ·

which is similar to the whole sum of

∞∑n=0

λn

n!= 1 + λ+

λ2

2!+λ3

3!+λ4

4!+ · · · = eλ

then

e−λ∞∑k=0

λk

k!= e−λeλ = 1 �

The cumulative distribution function is given by,

FX(k) = P{X ≤ k} = e−λk∑

n=0

λn

n!. (31)

The mean can be obtained from

µ = E(X) =∞∑n=0

n e−λ λn

n!

=∞∑n=1

e−λ λn

(n− 1)!

= λ e−λ∞∑n=1

λn−1

(n− 1)!

= λ e−λ∞∑k=0

λk

k!

= λ e−λ eλ

= λ . (32)

12



Alternatively, from the moment generating function

φ(t) = E(etX)

=∞∑k=0

etk e−λλk

k!

= e−λ∞∑k=0

etkλk

k!

= e−λ∞∑k=0

(λet)k

k!

= e−λ eλet

= exp(λ(et − 1)) (33)

the first differentiation yields,

φ′(t) = λ et exp(λ(et − 1))

from which we can obtaine the mean

µ = φ′(0) = λ e0 exp(λ(e0 − 1)) = λ . (34)

From a second differentiation,

φ′′(t) =(λ et)2

exp(λ(et − 1)) + λ et exp(λ(et − 1))

we obtain the variance,σ2 = φ′′(0)− (φ′(0))2 = λ . (35)

3 Continuous Random Variables

Continuous random variables are random variables whose sample space (or, set of possiblevalues) contains uncountable or real numbers.

Let X be a continuous random variable. The probability that X takes on any particularvalue x is 0. Namely, finding P{X = x} for a continuous random variable X is unfeasible.Instead, the probability of X should fall in some interval (a, b), that is, we need to findP{a < X < b}. This is achieved by using a probability density function or commonlyabbreviated the p.d.f.

If X is a continuous random variable, there exists a function f(x) which is integrable andhas the following properties:

13



Figure 4: The probability of mass function (left) and cumulative distribution function(right) of a Poisson random variable.

(1) f(x) is positive everywhere, that is, f(x) > 0, for all real x ∈ (−∞,∞).

(2) The area under the curve f(x) is 1, that is:

P{X ∈ (−∞,∞)} =

∫ ∞−∞

f(x) dx = 1 (36)

(3) If f(x) is the p.d.f. of X, for any set B of real numbers,

P{X ∈ B} =

∫B

f(x) dx , (37)

which is, the probability that X will be in B may be obtained by integrating theprobability density function over the set B.

From the property no (3), if B is some interval, e.g., B =[a, b], then

P{a ≤ X ≤ b} =

∫ b

a

f(x) dx . (38)

If we let a = b, then

P{X = a} =

∫ a

a

f(x) dx = 0 , (39)

which proves the statement earlier, that the probability that a continuous random variablewill assume any particular value is zero.

The cumulative distribution function (c.d.f.) of a continuous random variable X is definedas:

F (a) = P{X ∈ (−∞, a)} =

∫ a

−∞f(x) dx . (40)

14



The relationship between the cumulative distribution F (·) and the probability density f(·)is expressed by differentiating both sides of Eq. (40), hence

d

daF (a) = f(a) . (41)

If X is a continuous random variable having a probability density function f(x), then theexpected value or mean of X is defined by,

E(X) =

∫ ∞−∞

x f(x) dx . (42)

The moment generating function φ(t) of the random variable X is defined for all values tby

φ(t) = E(etX)

=

∫ ∞−∞

etx f(x) dx . (43)

3.1 The Uniform Random Variable

A random variable is said to be uniformly distributed over the interval (α, β) if its proba-bility density function is given by

f(x) =

1

β − αif α < x < β

0 elsewhere

(44)

The cumulative distribution function of a random variable uniformly distributed over (α, β)is given by

F (c) =

0 for c ≤ α

c− αβ − α

for α < c < β

1 for c ≥ β

(45)

15



The expectation or mean of a random variable uniformly distributed over (α, β),

E(X) =

∫ β

α

x

β − αdx

=β2 − α2

2(β − α)

=β + α

2. (46)

The moment generating function over an interval (α, β) is given by,

φ(t) =

∫ β

α

etx1

β − αdx

=1

β − α

∫ β

α

etx dx

=

(eβt − eαt

)t(β − α)

for t 6= 0

1 for t = 0

(47)

Differentiating the moment generating function, we get

φ′(t) =eβt(βt− 1)− eαt(αt− 1)

t2(β − α). (48)

We see that the moment generating function of a uniform random variable is not differen-tiable at zero. Hence, the mean and variance should be obtained from calculating the rawmoments and central moments.

The variance is given by,

Var(X) =(β − α)2

12. (49)

Uniform distribution is commonly used for random number generators. In Python, randomsamples can be generated from a uniform distribution over the interval [0, 1) using,

numpy.random.rand()

To generate an array of shape m×n, for example 3×2, we define the shape in the argumentsas follows,

numpy.random.rand(3, 2)

16



For a more general interval range of the uniform distribution, for example over [a, b] whereb > a, we use

(b - a) * numpy.random.random_sample () + a

3.2 The Exponential Random Variable

The probability density function of an exponential random variable, with parameter λ > 0,is given by,

f(x) =

{λ e−λx if x ≥ 00 if x < 0

(50)

The cumulative distribution function F is given by,

F (a) =

∫ a

0

λ e−λx dx = 1− e−λa , a ≥ 0 (51)

The expectation of an exponential random variable is given by,

E(X) =

∫ ∞0

xλ e−λx dx ,

and after integrating by parts,

E(X) = −x e−λx∣∣∞0

+

∫ ∞0

e−λx dx

= 0− e−λx

λ

∣∣∣∣∞0

=1

λ

The moment generating function of an exponential random variable with parameter λ,

φ(t) = E(etX)

=

∫ ∞0

etx λ e−λx dx

= λ

∫ ∞0

e−(λ−t)x dx

=λ

λ− tfor t < λ

17



After differentiation of φ(t),

φ′(t) =λ

(λ− t)2, φ′′(t) =

2λ

(λ− t)3

we obtain the mean,

E(X) = φ′(0) =1

λ, (52)

and the variance,

Var(X) = φ′′(0)− (φ′(0))2 =2

λ2− 1

λ=

1

λ2. (53)

Figure 5: The probability density function (left) and cumulative distribution function(right) of an exponential random variable.

To generate random samples from an exponential distribution in Python,

numpy.random.exponential ()

3.3 The Gamma Random Variable

If a continuous random variable has the following probability density function

f(x) =

λ e−λx(λx)α−1

Γ(α)for x ≥ 0

0 for x < 0

(54)

18



for some λ > 0 and α > 0, then it is said to be a gamma random variable with parametersα and λ. The quantity Γ(α) is called the gamma function and defined by,

Γ(α) =

∫ ∞0

e−x xα−1 dx

If α = n is a positive integer, then,

Γ(n) = (n− 1)!

Figure 6: The probability density function (left) and cumulative distribution function(right) of a gamma random variable.

3.4 The Normal Random Variable

A random variable X is normally distributed with parameters µ and σ2 if the density isgiven by,

f(x) =1

σ√

2πe−(x−µ)

2/2σ2

, −∞ < x <∞ (55)

The expectation of a normal random variable distributed with parameters µ and σ2 is,

E(X) =1

σ√

2π

∫ ∞−∞

x e−(x−µ)2/2σ2

dx

If we write x = (x− µ) + µ, the equation is expanded,

E(X) =1

σ√

2π

∫ ∞−∞

(x− µ) e−(x−µ)2/2σ2

dx+ µ1

σ√

2π

∫ ∞−∞

e−(x−µ)2/2σ2

dx

19



Letting y = x− µ gives us,

E(X) =1

σ√

2π

∫ ∞−∞

y e−y2/2σ2

dy + µ

∫ ∞−∞

f(x) dx

where f(x) is the normal density. By symmetry, the first integral is 0, then the expectationis,

E(X) = µ

∫ ∞−∞

f(x) dx = µ . (56)

The moment generating function of a standard normal random variable Z is,

E(etZ) =1√2π

∫ ∞−∞

etx e−x2/2 dx

=1√2π

∫ ∞−∞

e−(x2−2tx)/2 dx

= et2/2 1√

2π

∫ ∞−∞

e−(x−t)2/2 dx

= et2/2 .

If Z is a standard normal, then X = σZ + µ is normal with parameters µ and σ2, then

φ(t) = E(etX)

= E(et(σZ+µ))

= etµE(etσZ)

= exp

(σ2t2

2+ µt

).

After differentiation,

φ′(t) = (µ+ tσ2) exp

(σ2t2

2+ µt

),

and

φ′′(t) = (µ+ tσ2)2 exp

(σ2t2

2+ µt

)+ σ2 exp

(σ2t2

2+ µt

),

we obtain the mean,E(X) = φ′(0) = µ , (57)

and the variance,Var(X) = φ′′(0)− (φ′(0))2 = σ2 . (58)

20



Figure 7: The probability density function (left) and cumulative distribution function(right) of a normal random variable.

Normal distribution is another commonly used distribution for random number generators.To generate random samples from the standard normal distribution, with N (0, 1) or mean0 and variance 1, in Python, use

numpy.random.randn ()

To generate random samples from N (µ, σ2), use

sigma * numpy.random.randn (...) + mu

4 Jointly Distributed Random Variables

This topic is also called multivariate distributions in some textbooks.

Thus far we have discussed the probability distribution of a single random variable. Oftentimes, we are also interested in probability statements of two or more random variables.

When several random variables X1, X2, · · · , Xn are associated with the same sample space,a joint probability mass function or probability density function f(x1, x2, · · · , xn) can bedefined. For example, for two random variables X1 and X2 defined on a common samplespace S, (X1, X2) is a random vector where (X1, X2) : S → R2. For each element s in thesample space S, there is associated a unique ordered pair (X1(s), X2(s)). The set of allordered pairs is given by [1],

A = {(X1(s), X2(s)) | s ∈ S} ⊂ R2 , (59)

which is known as the state space or space of the random vector (X1, X2).

21



The following are joint probability distributions of two random variables, where for thesake of convenience, denoted X and Y (instead of X1 and X2):

• Let X and Y be two discrete random variables defined on a common sample spaceS, having probability measure P : S → [0, 1]. If there exists a function f : A → [0, 1]which is defined by

f(x, y) = P{X = x, Y = y} , (60)

then f(x, y) is called the joint probability mass function of the random variablesX and Y if it satisfies the following conditions [3]:

(1) Each probability must be a valid probability number between 0 and 1 (inclusive):

0 ≤ f(x, y) ≤ 1 .

(2) The sum of the probabilities over the entire A must equal 1:∑∑(x,y)∈A

f(x, y) = 1 .

(3) To determine the probability of an event B, simply sum up the probabilities ofthe (x, y) values in B:

P{(X, Y ) ∈ B} =∑∑(x,y)∈B

f(x, y) ,

where B ⊂ A, and A is defined by Eq. (59).

• Let X and Y be two discrete random variables as defined above, whose joint proba-bility mass function f(x, y) is defined by Eq. (60). The probability mass function ofX alone, which is called the marginal probability mass function of X [3], maybe obtained from f(x, y) by

fX(x) =∑y

f(x, y) = P{X = x} ,

which states that, for each x, the summation is taken over all possible values of y.Similarly, the probability mass function of Y alone, which is called the marginalprobability mass function of Y , may be obtained from

fY (y) =∑x

f(x, y) = P{Y = y} .

• For any two random variables X and Y , the joint cumulative probability distri-bution function of X and Y is defined by [2]

F (a, b) = P{X ≤ a, Y ≤ b} , −∞ < a, b <∞

22



The cumulative distribution function of X can be obtained from the joint distributionof X and Y as follows:

FX(a) = P{X ≤ a}= P{X ≤ a, Y <∞}= F (a,∞) .

Similarly, the cumulative distribution function of Y is given by

FY (b) = P{Y ≤ b} = F (∞, b) .

• Let X and Y be two continuous random variables defined on a common samplespace S, having probability measure P : S → [0, 1]. If there exists a functionf : R2 → [−∞,∞] which is defined by∫ ∫

R2

f(x, y) dx dy =

∫ ∫A

f(x, y) dx dy = 1

then f(x, y) is called the joint probability density function of X and Y if itsatisfies the following properties:

– The probability for all sets A and B of real numbers [2]:

P{X ∈ A, Y ∈ B} =

∫B

∫A

f(x, y) dx dy ,

where, the sets A and B are defined by Eq. (59).

– The probability of an event B, where B ⊂ A [1]:

P{(X, Y ) ∈ B} =

∫ ∫B

f(x, y) dx dy ,

where, the set A is defined by Eq. (59).

The function f(x, y) above is defined for all real x and y. We then say that X andY are jointly continuous.

• From the definition of continuous random variables X and Y above, the probabilitydensity of X alone, which is called the marginal probability density function ofX, can be obtained from a knowledge of f(x, y) by

P{X ∈ A} = P{X ∈ A, Y ∈ (−∞,∞)}

=

∫ ∞−∞

∫Af(x, y) dx dy

=

∫AfX(x) dx

23



where

fX(x) =

∫ ∞−∞

f(x, y) dy

is the probability density function of X. Similarly, the probability density functionof Y is given by

fY (y) =

∫ ∞−∞

f(x, y) dx .

• Discrete random variables X and Y are independent if and only if [3]

P{X = a, Y = b} = P{X = a} × P{Y = b} ∀ a, b .

Similarly, continuous random variables X and Y are independent if [2]

P{X ≤ a, Y ≤ b} = P{X ≤ a} × P{Y ≤ b} ∀ a, b .

Otherwise, X and Y are said to be dependent.

• If X and Y are random variables whose joint probability mass/density function isf(x, y) and u(X, Y ) is a function of these two variables, then [2]:

E[u(X, Y )] =

∑y

∑x

u(x, y) f(x, y) in the discrete case

∞∫−∞

∞∫−∞

u(x, y) f(x, y) dx dy in the continuous case

where E[u(X, Y )] is called the expected value of u(X, Y ). For example, if X andY are two continuous random variables and u(X, Y ) = X + Y , then:

E[X + Y ] =

∫ ∞−∞

∫ ∞−∞

(x+ y) f(x, y) dx dy

=

∫ ∞−∞

∫ ∞−∞

x f(x, y) dx dy +

∫ ∞−∞

∫ ∞−∞

y f(x, y) dx dy

= E[X] + E[Y ] .

If X and Y are two discrete random variables, then:

E[aX + bY ] = aE[X] + bE[Y ] .

• Let X and Y be discrete random variables. If u(X, Y ) = (X − µX)2, the variance ofX is given by, [3],

σ2X = Var(X) =

∑x

∑y

(x− µX)2 f(x, y) ,

24



or

σ2X = E(X2)− µ2

X =

(∑x

∑y

x2 f(x, y)

)− µ2

X .

And, if u(X, Y ) = (Y − µY )2, the variance of Y is given by

σ2Y = Var(Y ) =

∑x

∑y

(y − µY )2 f(x, y) ,

or

σ2Y = E(Y 2)− µ2

Y =

(∑x

∑y

y2 f(x, y)

)− µ2

Y .

• The covariance of any two random variables X and Y , denoted by Cov(X, Y ), isdefined by [2]

Cov(X, Y ) = E[(X − E[X]) (Y − E[Y ])]

= E[XY − Y E[X]−XE[Y ] + E[X]E[Y ]]

= E[XY ]− E[Y ]E[X]− E[X]E[Y ] + E[X]E[Y ]

= E[XY ]− E[X]E[Y ]

If X and Y are independent, then Cov(X, Y ) = 0.

For any random variables X, Y , Z, and constant c, the following are properties ofcovariance [2]:

1. Cov(X,X) = Var(X),

2. Cov(X, Y ) = Cov(Y,X),

3. Cov(cX, Y ) = cCov(X, Y ),

4. Cov(X, Y + Z) = Cov(X, Y ) + Cov(X,Z).

5 Conditional Probabilities

Let A and B be two events in a probability space. The conditional probability of event Agiven that event B has occurred, denoted P (A|B), is defined as

P (A|B) =P (A ∩B)

P (B), (61)

provided that P (B) > 0. Similarly, the conditional probability of event B given that eventA has occurred, denoted P (B|A), is defined as

P (B|A) =P (A ∩B)

P (A), (62)

25



provided that P (A) > 0.

Events A and B are said to be independent if the occurrence of either one of the eventsdoes not affect the probability of occurrence of the other, that is,

P (A ∩B) = P (A)P (B) .

Therefore, A and B are independent if and only if

P (A|B) =P (A ∩B)

P (B)=P (A)P (B)

P (B)= P (A) , (63)

which states that the probability of A does not depend on whether B has occurred. Simi-larly,

P (B|A) =P (A ∩B)

P (A)=P (A)P (B)

P (A)= P (B) . (64)

In other words, the events A and B are independent if the probability of B does not dependon whether A has occurred.

Since conditional probability is just a probability, it satisfies the following three properties[3]:

1. P (A|B) ≥ 0.

2. P (B|B) = 1.

3. If A1, A2, · · · , Ak are mutually exclusive events, then

P (A1 ∪ A2 ∪ · · · ∪ Ak |B) = P (A1|B) + P (A2|B) + · · ·+ P (Ak|B)

and likewise for infinite unions.

All three properties above are satisfied as long as P (B) > 0.

References

[1] Linda J. S. Allen. An Introduction to Stochastic Processes with Applications to Biol-ogy 2nd Edition. CRC Press, 2010.

[2] Sheldon M. Ross. Introduction to Probability Models 9th Edition. Elsevier, 2007.

[3] Introduction to Probability Theory,https://newonlinecourses.science.psu.edu/stat414/node/287/

26


Documents

Lecture 2: Brief Review on Probability Theorypeople.bu.edu/.../lecture2/Lecture2.pdf · Lecture 2: Brief Review on Probability Theory 1 Terminology & Basics The sample space, usually