Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Chapter 1
What Is Econometrics
1.1 Economics
1.1.1 The Economic Problem
Economics concerns itself with satisfying unlimited wants with limited resources.As such it proposes solutions that involve the production and consumption ofgoods using a particular allocation of resources. At the micro-economic level,these solutions are often the results of optimizing behavior by economics agentssuch as consumers or producers and possibly impacted by regulators. At themacro-economic level, they can also be the results of decisions made my policy-makers such as central bankers, the executive branch and the legislative branch.
1.1.2 Some Economic Questions
As economic agents arrive at their decisions they encounter and must answera host of economic questions: Should the Eurozone countries coordinate theirfiscal as well as monetary policies? Relatedly, should Greece default on its’ debtand form a new Greek currency? What is the best time for Professor Brown toretire? Should an undegraduate at Rice major in economics or mathematicaleconomics? Why did the savings rate decline during the Great Expansion ofthe nineties?
As economists we are in the business of finding answers to such questions.Hopefully, the answers are justifiable and replicable so we are in some senseconsistent. Moreover, we hope that our answers will be born out by subsequentexperience. In this chapter, we discuss a structure for going about constructingsuch answers. This stucture is how we do business in economics.
1.1.3 Economics as a Science
Science can be defined as the intellectual and practical activity encompassingthe systematic study of the structure and behavior of the physical and natural
1
2 CHAPTER 1. WHAT IS ECONOMETRICS
world through observation and experiment. The purpose of the experiment is tocontrol for various factors that might vary when the environment is not managedas in an experiment. This allows the observer to focus on the relationshipbetween a limited number of variables.
So to be a science economics must be founded on the notion of studying andexplaining the world through observation and experimentation. Economics isclassified as a social science since it is concerned with social phenomena such asorganization and operation of the household or society.
1.1.4 Providing Scientific Answers
As such, economics would attempt to provide answers to the questions posedabove, for example, using scientific approaches that would yield similar answersin similar circumstances. In that sense, it attempts to bring the replicabilityof scientific experiments into the analysis and formulation of answers. This isaccomplished by combining economic theory with the analysis of pre-existingrelevant data using appropriate statistical techniques.
1.2 Data
1.2.1 Accounting
Most of the data used in econometrics is accounting data. Accounting datais routinely recorded data (that is, records of transactions) as part of marketactivities. Accounting data has many shortcomings, since it is not collectedspecifically for the purposes of the econometrician. Oftentimes the availabledata may not measure the phenomenon we are interested in explaining but arelated one.
Of course this characterization is not universal. For example, we may beinterested in the relationship between birth weight or mortality to the incomelevel. In both cases, the variable we are looking at is not transactional andhence accounting data.
1.2.2 Nonexperimental
Moreover, the data is typically nonexperimental. The econometrician does nothave any control over nonexperimental data. In this regard economics is inmuch the same situation as meteorology and astronomy. This nonexperimentalnature of the data causes various problems. Econometrics is largely concernedwith how to manage these problems in such a fashion that we can have the sameconfidence in our findings as when experimental data is used.
Here again, the characterization is not universal. For example there havebeen income maintenance experiments where the impact on labor effort in re-sponse to guaranteeing a base level of income was examined directly with ap-plication of a negative income tax to participants. And there are numerous
1.3. MODELS 3
laboratory experiments that have helped advance our understanding of gametheoretic equilibriums.
1.2.3 Data Types
It is useful to give a typeology of the different forms that economic data cantake:
Definition 1.1. Time-series data are data that represent repeated observationsof some variable in subsequent time periods. A time-series variable is oftensubscripted with the letter t. Examples are the familiar aggregate economicvariables such as interest rates, gross domestic product, and the consumer priceindex, which are available at the annual and quarterly, and monthly intervals.Stock prices are another good example, available at the tic interval.
Definition 1.2. Cross-sectional data are data that represent a set of observa-tions of some variable at one specific instant for a number of agents or orga-nizations. A cross-sectional variable is often indexed with the letter i. Mostoften, this data is microeconomic. For example, a firm’s employment data at aparticular point in time has observations for each employee.
Definition 1.3. Time-series cross-sectional data are data that are both time-series and cross-sectional. Such data is typically indexed by t for the time periodand i for the agent.
Definition 1.4. A special case of time-series cross-sectional data is panel data.Panel data are observations on the same set of agents or a panel over time.A leading example is the Survey of Income and Program Participation whichprovides information about the incomes of American individuals and householdsand their participation in income transfer programs.
1.2.4 Empirical Regularities
We need to look at the data to detect regularities or stylized facts. Theseregularities that persist across time or individuals beg for explanation of somesort. Why do they occur and why do they persist? An example would be theobservation that the savings rate in the U.S. was very low during the recordexpansion of the nineties and rose during the subsequent great recession. Ineconomics we seek to explain the regularities with models.
1.3 Models
1.3.1 Simplifying Assumptions
Models are simplifications of the real world. As such they introduce simplify-ing assumptions that hold a number of factors constant and allow the analysis
4 CHAPTER 1. WHAT IS ECONOMETRICS
of the relationship between a limited number of variables. These simplifyingassumptions are the factors that would be controlled if we were conducting acontrolled experiment to verify the model.
Models are sometimes criticized as unrealistic but if they were realistic theywould not be models but in fact represent reality. The real question is whetherthe simplifications introduced by the assumptions are useful for dealing withthe phenomena we attempt to explain. Hopefully, we will provide some metricof what is a useful model.
1.3.2 Economic Theory
As a result of optimizing behavior by consumers, producers, or other agents inthe economy, economic theory will suggest how certain variables will be related.For example demand for a consumer good can be represented as a function ofits own price, prices of complimentary goods, prices of competing goods, andthe income level of the consumer.
1.3.3 Equations
Economic models are usually stated in terms of one or more equations. Forexample, in a simple macroeconomic model we might represent Ct (aggregateconsumption at time t) as a linear function of Yt (aggregate income):
Ct = α+ βYt
where α and β are parameters to be interest with their values targets of inquiry.The choice of the linear model is usually a simplification but the variables thatgo into the model will typically follow directly from economic theory.
1.3.4 Error Component
As an example, consider the following scatter plot of aggregate income ver-sus aggregate consumption for annual data over the period 1946-1958. Thedata reveal a clear increasing relationship between income and consumption.Moreover, average consumption, which is represented by a ray from the origindeclines as income increases. These both suggest that the linear relationshiphas a positive intercept and a slope less than one. However, the relationship isnot exact for any choice of the parameters α and β
Because none of our models are exact, we include an error component in ourequations, usually denoted ui. Thus we have
Ct = α+ βYt + ut
This error term could be viewed as an approximation error, a misspecificationerror, or result from errors in optimization by economic agents. In any event,the linear parametric relationship is not exact for any choice of the parameters.
1.4. STATISTICS 5
1.4 1.6 1.8 2 2.2 2.4 2.6 2.8
x 105
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3
3.2x 10
4
Income
Con
sum
ptio
n
Figure 1.1: Aggregate Consumption vs. Income, 1946-1958
1.4 Statistics
1.4.1 Stochastic Component
In order to conduct analysis of the model once the error term has been intro-duced, we assume that it is a random variable and subject to distributionalrestrictions. For example, we might assume that ut is an i.i.d. variable with
E[ut|Yt] = 0
E[u2t |Yt] = σ2.
Again, this is a modeling assumption and may be a useful simplification ornot. We have combined economic theory to explain the systematic part of thebehavior and statistical theory to characterize the unsystematic part.
1.4.2 Statistical Analysis
The introduction of the stochastic assumptions on the error component allowsthe use of statistical techniques to perform various tasks in analyzing the com-bined model. This analysis will involve statistical inference whereby we use thedistributional behavior implied by our economic model and statistical assump-tions to make inferences about properties of the underlying model based on theobserved behavior generated by the model.
Implicitly, in the conditional expectations assumptions used in the aggregateconsumption example above, the explanatory variable is taken as fixed or non-stochastic. Sometimes the explanatory variables are just as obviously random
6 CHAPTER 1. WHAT IS ECONOMETRICS
as the dependent variables, in which case we will need to make make stochas-tic assumptions regarding their joint behavior. This will complicate mattersconsiderably.
1.4.3 Tasks
In the sequel we will concern ourselves with three main tasks in applying sta-tistical analysis to the model in question:
1. Estimating the parameters of the model. These parameters would includethe parameters of the systematic part of the model and the parametersof the distribution of the error component. These estimates are typicallypoint estimates though they can be interval estimates.
2. Hypothesis tests concerning the behavior of the model. This is where theapproach really becomes scientific. We might be interested in whetheraccumulation of wealth is a an adequate explanation for the low savingrate situation mentioned above. This could be tested using an hypothesistest.
3. Forecast the behavior of the dependent variable outside the sample us-ing the estimated model. Such forecasts can be point forecasts such asexpected behavior or interval forecasts which give a range of values withan attached probability of occurrence. Usually, these forecasts will beconditional on knowledge of the values of the explanatory variables.
1.5 Econometrics
Econometrics involves the integration of these three components: data, eco-nomic theory, and statistical theory to model an economic phenomenon. Sinceour data is usually nonexperimental, econometrics makes use of economic andstatistical theory to adjust for the lack of proper data. Economic theory allowsus to focus on a reduced set of variables and possible knowledge of how theyinteract. If we were conducting an experiment, we would be able to control thevarious other factors and focus on the variables whose relationship interests us.And statistical assumptions allow us to effectively control for the effect of otherfactors not suggested by economic theory. This integration of data, economictheory, and statistical theory to answer economic questions is the essence ofeconometrics.
1.5.1 Interactions
Often the phenomenon we are seeking to explain is a set of stylized facts thatcharacterize how the various economic variables move together. This may leadto economic theories that possibly explain these stylized facts. The question ofwhether or not the theory adequately explains the data is measured through
1.5. ECONOMETRICS 7
the application of approriate statistical techniques to estimate the model basedon the available data. Hopefully, there are hypothesis (behaviors) that arepredicted by the economic theory that can be tested using statistical techniques.Based on the outcome of this exercise the theory may need modification oradditional or different data gathered to address the problem at hand.
Economists speak of “bringing the model to the data” in this exercise. Itis the responibility of the econometrician to choose or develop an appropriatestatistical technique that matches the needs of the data and the theory. Aswas mentioned above, economic data are largely nonexperimental. And oureconomic models typically have mutliple variables determined simultaneouslyby the model. Consequently, we focus on and develop techniques for dealingwith these properties and others that may need to be handled.
1.4 1.6 1.8 2 2.2 2.4 2.6 2.8
x 105
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3
3.2x 10
4
Income
Co
nsum
ptio
n
Data: Theory:
Statistics:
Ct= α + βD
t+ u
t
1.4 1.6 1.8 2 2.2 2.4 2.6 2.8
x 105
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3
3.2x 10
4
Income
Co
nsum
ptio
n
Stylized Facts of Data
Suggest Theory
Data Needs
of Theory
Statistical Need
s
of Data
Data N
eeds
of Statistics Stat
istica
l Needs
o
f Theor
y
Stat
istica
l Re
sults
A
lter Theor
y
Figure 1.2: Interactions
Some of this interaction is illustrated in Figure 1.2. Observations on aggre-gate consumption and aggregate disposable income show a generally positiverelationship in the plot in the Data circle. This leads to the Duesenberry-typelinear consumption equation: Ct = α + βDt in the Theory circle. Since the
8 CHAPTER 1. WHAT IS ECONOMETRICS
model does not exactly fit the data an error term ut is appended. Assumptionsare made about the statistical properties of the observable data and the errorterm and an appropriate statistical technique is appled to obtain the estimatedmodel which is represented as the fitted green line in the Statistics circle. Thereare feedbacks between the three circles with the needs or results of the sta-tistical analysis leading to modifications in the data and/or theory. Likewisethe theoretcial analysis may lead to changes in which and what type of data isgathered.
1.5.2 Scientific Method
In the scientific method, we take an hypothesis, construct an experiment withwhich to test the hypothesis, and based on the results of the hypothesis weeither reject the hypothesis or not. The purpose of the experiment is to controlfor various factors that might impact the outcome but are not of interest. Thisallows us to focus our attention on the relationship between the variables ofinterest that result from the experiment. Based on the outcomes for thesevariables of interest, we either reject or not.
In econometrics, we let the systematic model and the statistical model orsome specified component of them to play the role of the hypothesis. The avail-able data, ex post, plays the role of the outcome of an experiment. Statisticalanalysis of the data using the combined model enables us to reject the hypoth-esis or not. So econometrics is what allows us to use the scientific method andhence is critical in economics being treated as a science.
1.5.3 Choosing A Model
In testing an hypothesis above we are implicitly choosing between a model whichsatisfies the hypothesis and one which doesn’t. So the hypothesis test implicitlyis a criterion for choosing between models. If we reject the hypothesis then weare choosing the model which doesn’t satisfy the hypothesis. This a clear-cutdecision framework. In any event, we are choosing the model which best agreeswith the data.
Sometimes, we are faced with a situation where the data do not give a strongpreference between two competing models. In this case, the simplest model thatfits the data is often the best choice. This is known as using Ocam’s razor.
Chapter 2
Some Useful Distributions
Usually, interest in an economic phonomenon is piqued from viewing samplemoments of some important variable. So naturally, we will give a great deal ofattention to sample moments in the sequel. Ultimately, though, to understandthe behavior of such sample moments in either large or small samples we willneed to resort to distributional theory. We will only consider continuous distri-butions, in this chapter, but the discrete cases are easily handled in a similarfashion. For reasons that become apparent, the normal distribution and distri-butions associated with it play a central role in much of our statistical analysisin both small and large samples.
2.1 Introduction
2.1.1 Inference
As statisticians, we are often called upon to answer questions or make statementsconcerning certain random variables. For example: is a coin fair (i.e. is theprobability of heads = 0.5) or what is the expected value of GNP for the quarter.
Typically, answering such questions requires knowledge of the underlying(population) distribution of the random variable. Unfortunately, we usually donot know this distribution (although we may have a strong idea, as in the caseof the coin).
In order to gain knowledge of the distribution, we draw several realizations ofthe random variable. The notion is that the observations in this sample containinformation concerning the population distribution.
Definition 2.1. The process by which we make statements concerning thepopulation distribution based on the sample observations is called inference. �
Example 2.1. We decide whether a coin is fair by tossing it several times andobserving whether it seems to be heads about half the time. �
9
10 CHAPTER 2. SOME USEFUL DISTRIBUTIONS
2.1.2 Random Samples
Definition 2.2. Suppose we draw n observations of a random variable, de-noted {x1, x2, ..., xn} and each xi is independent and has the same (marginal)distribution, then {x1, x2, ..., xn} constitute a simple random sample. �
Example 2.2. We toss a coin three times. Supposedly, the outcomes are in-dependent. If xi counts the number of heads for toss i, then we have a simplerandom sample. �
Note that not all samples are simple random. They can be either non-independent or non-identical or both.
Example 2.3. We are interested in the income level for the population ingeneral. The n observations available in this class are not identical since thehigher income individuals will tend to be more variable. �Example 2.4. Consider the aggregate consumption level. The n observationsavailable in this set are not independent since a high consumption level in oneperiod is usually followed by a high level in the next. �
2.1.3 Sample Statistics
Definition 2.3. Any function of the observations in the sample which is thebasis for inference is called a sample statistic. �
Example 2.5. In the coin tossing experiment, let S count the total numberof heads and P = S
3 count the sample proportion of heads. Both S and P aresample statistics. �
2.1.4 Sample Distributions
A sample statistic is a random variable — its value will vary from one experimentto another. As a random variable, it is subject to a distribution.
Definition 2.4. The distribution of the sample statistic is the sample distribu-tion of the statistic. �
Example 2.6. The statistic S introduced above has a multinomial sampledistribution. Specifically Pr(S = 0) = 1/8, Pr(S = 1) = 3/8, Pr(S = 2) = 3/8,and Pr(S = 3) = 1/8. �
2.2 The Sample Mean
2.2.1 Sample Mean
Definition 2.5. Given a sample {x1, x2, ..., xn} of observations on a randomvariable x, then
xn =1
n
n∑i=1
xi
2.2. THE SAMPLE MEAN 11
is the sample mean or average of the sample. �
This form plays a predominate role in statistics and econometrics. All samplemoments can be expressed as sample moments of functions of the data. And, atleast in large samples, many of the estimators and statistics behavior dependson their relationship to sample averages. Consequently, the behavior of the formis well studied and well known.
2.2.2 Sample Sum
Consider the simple random sample {x1, x2, ..., xn}, where x measures the heightof an adult female. We will assume that Exi = µ and Var(xi) = σ2 , for alli = 1, 2, ..., n
Let S = x1 + x2 + · · ·+ xn denote the sample sum. Now,
ES = E(x1 + x2 + · · ·+ xn ) = Ex1 + Ex2 + · · ·+ Exn = nµ
Also,
E(S) = E(S − ES )2
= E(S − nµ )2
= E(x1 + x2 + · · ·+ xn − nµ )2 (2.1)
= E
[n∑i=1
(xi − µ )
]2
= E[(x1 − µ )2 + (x1 − µ )(x2 − µ ) + · · ·+(x2 − µ )(x1 − µ ) + (x2 − µ )2 + · · ·+ (xn − µ )2]
= nσ2. (2.2)
Note that E[(xi − µ )(xj − µ )] = 0 for i 6= j, by independence.
2.2.3 Moments Of The Sample Mean
The mean of the sample mean is
Exn = ES
n=
1
nES =
1
nnµ = µ. (2.3)
12 CHAPTER 2. SOME USEFUL DISTRIBUTIONS
The variance of the sample mean is
var(xn) = E(x− µ)2
= E(S
n− µ )2
= E[1
n(S − nµ ) ]2
=1
n2E(S − nµ )2
=1
n2nσ2
=σ2
n. (2.4)
2.2.4 Sampling Distribution
We have been able to establish the mean and variance of the sample mean.However, in order to know its complete distribution precisely, we must knowthe probability density function (pdf) of the random variable x.
2.3 The Normal Distribution
2.3.1 Density Function
Definition 2.6. A continuous random variable x with the density function
f (x ) =1√
2πσ2e−
12σ2
( x−µ )2 (2.5)
follows the normal distribution, where µ and σ2 are the mean and variance ofx, respectively. �
Since the distribution is characterized by the two parameters µ and σ2, wedenote a normal random variable by xi v N (µ, σ2 ).
The normal density function is the familiar “bell-shaped” curve, as is shownin Figure 2.1 for µ = 0 and σ2 = 1. It is symmetric about the mean µ.Approximately 2
3 of the probability mass lies within ±σ of µ and about .95lies within ±2σ. There are numerous examples of random variables that havethis shape. Many economic variables are assumed to be normally distributed.
2.3.2 Linear Transformation
Consider the transformed random variable
Y = a+ bx
We know thatµY = EYi = a+ bµx
2.3. THE NORMAL DISTRIBUTION 13
Figure 2.1: The Standard Normal Distribution
and
σ2Y = E(Y − µY )2 = b2σ2
x
If x is normally distributed, then Y is normally distributed as well. That is,
Y ∼ N (µY , σ2Y )
Moreover, if xi v N (µx, σ2x ) and zi v N (µz, σ
2z ) are independent, then
Y = a+ bx+ cz ∼ N ( a+ bµx + cµz, b2σ2x + c2σ2
z )
These results will be formally demonstrated in a more general setting in thenext chapter. This result means that any linear combination of any number ofindependent normals is itself normal.
2.3.3 Distribution Of The Sample Mean
If, for each i = 1, 2, . . . , n, the xi’s are independent, identically distributed(i.i.d.) normal random variables, then
xn ∼ N (µx,σ2x
n) (2.6)
14 CHAPTER 2. SOME USEFUL DISTRIBUTIONS
2.3.4 The Standard Normal
The distribution of xn will vary with different values of µx and σ2x, which is
inconvenient. Rather than dealing with a unique distribution for each case, weperform the following transformation:
Z =xn − µx√
σ2x
n
=xn√σ2x
n
− µx√σ2x
n
(2.7)
Now,
EZ =Exn√σ2x
n
− µx√σ2x
n
=µx√σ2x
n
− µx√σ2x
n
= 0.
Also,
E(Z2 ) = E
xn√σ2x
n
− µx√σ2x
n
2
= E
[n
σ2x
(xn − µx )2
]=
n
σ2x
σ2x
n
= 1.
Thus Z ∼ N(0, 1). If we know the variance σ2, then the z-transformation canbe used to test the mean.
Definition 2.7. The N( 0, 1 ) distribution is the standard normal and is well-tabulated. The probability density function for the standard normal distributionevaluated at point z is
f ( z ) =1√2π
e−12 z
2
= ϕ(z) (2.8)
Example 2.7. Suppose that xi ∼ i.i.d.N(µ, σ2) with σ2 = σ20 known and we
seek to test the null hyposthesis that µ = µ0 against µ = µ1, then
Z =xn − µ0√σ2
0/n∼ N(0, 1)
under the null hypothesis but will have a non-zero mean under the alternative.The behavior under the alternatives will be treated carefully in the next chapter.
2.4. THE CENTRAL LIMIT THEOREM 15
2.4 The Central Limit Theorem
2.4.1 Normal Theory
The normal density has a prominent position in statistics. This is not onlybecause many random variables appear to be normal, but also because mostany sample mean appears normal as the sample size increases. Subject to a fewweak conditions, this result applies for the average of even highly non-normaldistributions. Specifically, we have
Theorem 2.1. Suppose (i) {x1, x2, . . . , xn} is a simple random sample, (ii)Exi = µx, and (iii) Ex2
i = σ2x, then as n −→∞, the distribution of Z = xn−µ√
σ2/n
becomes standard normal. �
What this means is that for fn(·) the probability density function (PDF) ofZ, then
limn→∞
fn ( c ) = ϕ ( c ) (2.9)
pointwise for any point of evaluation c. Likewise, for Fn(·) denoting the analo-gous cumulative distribution function (CDF) of Z and Φ ( c ) denoting the CDFof the standard normal, then
limn→∞
Fn ( c ) = Φ ( c )
pointwise at any point of evaluation c. This is the Lindberg-Levy form ofthe central limit theorem and will be treated more extensively in Chapter 4.Notationally, we write
Z =xn − µ√σ2/n
→d N(0, 1)
or equivalently√n(xn − µ)→d N(0, σ2).
The panels in Figure 2.2 demonstrate the power of the result for a x drawnfrom a triangular distribution, normalized to have mean zero and unit variance.We draw 100,000 replications of samples of size 25. The first panel displays ahistogram of the first observation of each replication. This is the triangular dis-tribution we are working with. It has limited support and is highly asymmetric.The second panel is a histogram of the z-transformation of the means for the100,000 replications. At a sample size of 25 it is clear that the standard normaldistribution is already a very good approximation to the distribution.
16 CHAPTER 2. SOME USEFUL DISTRIBUTIONS
−3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.50
200
400
600
800
1000
1200
1400
1600
1800
2000Histogram of X (r=100,000)
−5 −4 −3 −2 −1 0 1 2 3 40
500
1000
1500
2000
2500
3000
3500Histogram of Normalized Sample Mean (r=100,000, n=25)
Figure 2.2: Central Limit Theorem and Triangular Distribution
2.5 Distributions Associated With The NormalDistribution
2.5.1 The Chi-Squared Distribution
Definition 2.8. Suppose that {Z1, Z2, . . . , Zm} is a simple random sample, andZi ∼ i.i.d.N( 0, 1 ). Then
w =
m∑i=1
Z2i ∼ X 2
m,
or w has the Chi-squared distribution with m degrees of freedom. �
The probability density function for the X 2m is
fχ2(x ) =1
2n/2Γ(m/2 )xm/2−1e−x/2, x > 0 (2.10)
where Γ(x) is the gamma function. See Figure 2.3 for plots of the chi-squareddistribution for various degrees of freedom. Note how the distribution begins toattain the familiar bell-shape but is stretched out to the right as the degrees offreedom grows larger. This a manifestation of the central limit theorem sincew/m would be an average. In fact, (w −m)/
√2m is approximately standard
normal in large samples.
The chi-squared distribution will prove useful in testing hypotheses on boththe variance of a single variable and the (conditional) means of several. Typ-ically, we are only interested in the content of the right-hand side tail in con-ducting tests, for reasons that will become apparent with the multivariate usage,which will be explored in the next chapter. But the left-hand side tail could beinteresting if we are testing whether the true variance is a specific value, say σ2
0
and are interested in the possibility that the true value is eithe larger or smallerthan this value.
2.5. DISTRIBUTIONS ASSOCIATED WITH THE NORMALDISTRIBUTION 17
Figure 2.3: Some Chi-Squared Distributions
Example 2.8. Consider the estimate of σ2
s2 =
∑ni=1(xi − x )2
n− 1.
Then we can show that
(n− 1 )s2
σ2∼ X 2
n−1. (2.11)
�
2.5.2 The t Distribution
Definition 2.9. Suppose that Z ∼ N( 0, 1 ), Y v χ2m, and that Z and Y are
independent. ThenZ√Ym
∼ tm, (2.12)
where m are the degrees of freedom of the t distribution. �
The probability density function for a t random variable with m degrees offreedom is
ft(x ) =Γ(m+1
2
)√mπ Γ
(m2
) (1 + x2
m
)(m+1)/2, (2.13)
18 CHAPTER 2. SOME USEFUL DISTRIBUTIONS
Figure 2.4: Some t Distributions
for −∞ < x <∞. See Figure 2.4 for plots of teh distribution for various choicesof m.
The t (also known as Student’s t) distribution, is named after William Gos-sett, who published under the pseudonym “Student.” It is useful in testinghypotheses concerning the (conditional) mean when the variance is estimated.Note that the distribution is symmetric around zero and becomes less fat-tailedas the degrees of freedom increase. In fact, as m → ∞, the distribution ap-proaches the standard normal. This just reflects the fact that the denominatorY/m converges in probability to one.
Example 2.9. Consider the sample mean from a simple random sample ofnormals. We know that x ∼N(µ,σ2/n ) and
Z =x− µ√
σ2
n
∼ N( 0, 1 ).
Also, we know that
Y = (n− 1 )s2
σ2∼ X 2
n−1,
where s2 is the unbiased estimator of σ2. Thus, if Z and Y are independent
2.5. DISTRIBUTIONS ASSOCIATED WITH THE NORMALDISTRIBUTION 19
(which, in fact, is the case), then
Z√Y
(n−1 )
=
x−µ√σ2
n√(n−1 ) s
2
σ2
(n−1 )
= (x− µ )
√nσ2
s2
σ2
=x− µ√
s2
n
∼ tn−1 (2.14)
�
2.5.3 The F Distribution
Definition 2.10. Suppose that Yv χ2m, Wv χ2
n, and that Y and W are inde-pendent. Then
YmWn
∼ Fm,n, (2.15)
where m,n are the degrees of freedom of the F distribution. �
The probability density function for a F random variable with m and ndegrees of freedom is
fF(x ) =Γ(m+n
2
)(m/n)m/2
Γ(m2
)Γ(n2
) x(m/2)−1
(1 +mx/n)(m+n)/2(2.16)
The F distribution is named after the great statistician Sir Ronald A. Fisher,and is used in many applications, most notably in the analysis of variance. Thissituation will arise when we seek to test multiple (conditional) mean parameterswith estimated variance. Note that when x ∼ tn then x2 ∼ F1,n. Also note thatas the numerator degrees of freedom grows large, which is frequently the case,the denominator converges in probability to one and the ratio is approximatedby the behavior of the numerator: a chi-squared divided by its’ degrees offreedom. Some plots of the F distribution for various choices of numerator anddenominator degrees of freedom can be seen in Figure 2.5.
20 CHAPTER 2. SOME USEFUL DISTRIBUTIONS
Figure 2.5: Some F Distributions
Chapter 3
Multivariate Distributions
In economics we are typically interested in relationships that involve a number ofvariables. For example aggregate consumption, disposable income, and wealthin a macroeconomic context or a list of stock prices that make up a portfolio in amicroeconomic situation. Consequently, the statistical analysis will of necessitybe multivariate in nature. In this chapter we will develop a number of conceptsthat prove useful in multivariate statistics. As for the univariate case, thenormal distribution plays a central role in multivariate statistics and will befully developed. In this treatment, the list of random variables of interest willbe arrayed into a vector for notational convenience.
3.1 Matrix Algebra Of Expectations
3.1.1 Moments of Random Vectors
Let x1
x2
...xm
= x
be an m × 1 vector-valued random variable. Each element of the vector is ascalar random variable of the type discussed in the previous chapter.
The expectation of a random vector is
E[ x ] =
E[x1 ]E[x2 ]
...E[xm ]
=
µ1
µ2
...µm
= µ. (3.1)
Note that µ is also an m×1 column vector. We see that the mean of the vectoris the vector of the means.
21
22 CHAPTER 3. MULTIVARIATE DISTRIBUTIONS
Next, we evaluate the following:
E[( x− µ )( x− µ )′]
= E
(x1 − µ1 )2 (x1 − µ1 )(x2 − µ2 ) · · · (x1 − µ1 )(xm − µm )
(x2 − µ2 )(x1 − µ1 ) (x2 − µ2 )2 · · · (x2 − µ2 )(xm − µm )...
.... . .
...(xm − µm )(x1 − µ1 ) (xm − µm )(x2 − µ2 ) · · · (xm − µm )2
=
σ11 σ12 · · · σ1m
σ21 σ22 · · · σ2m
......
. . ....
σm1 σm2 · · · σmm
= Σ. (3.2)
Σ, the covariance matrix, is an m × m matrix of variance and covarianceterms. The variance σ2
i = σii of xi is along the diagonal, while the cross-productterms represent the covariance between xi and xj .
3.1.2 Properties Of The Covariance Matrix
3.1.2.1 Symmetric
The variance-covariance matrix Σ is a symmetric matrix. This can be shownby noting that
σij = E(xi − µi )(xj − µj ) = E(xj − µj )(xi − µi ) = σji.
Due to this symmetry Σ will only have m(m+ 1)/2 unique elements.
3.1.2.2 Positive Semidefinite
Σ is a positive semidefinite matrix. Recall that any m × m matrix is posi-tive semidefinite if and only if it meets any of the following three equivalentconditions:
1. All the principle minors are nonnegative;
2. λ′Σλ ≥ 0, for all λ︸︷︷︸m×1
6= 0;
3. Σ = PP′, for some P︸︷︷︸m×m
.
The first condition (actually we use negative definiteness) is useful in the studyof utility maximization while the latter two are useful in econometric analysis.
3.1. MATRIX ALGEBRA OF EXPECTATIONS 23
The necessity of second condition is the easiest to demonstrate in the currentcontext. Let λ 6= 0. Then, we have
λ′Σλ = λ′ E[( x− µ )( x− µ )′]λ
= E[λ′( x− µ )( x− µ )′λ ]
= E{ [λ′( x− µ )]2 } ≥ 0,
since the term inside the expectation is a quadratic. Hence, Σ is a positivesemidefinite matrix.
Note that P satisfying the third relationship is not unique. Let D be anym × m orthonormal martix, then DD′ = Im and P∗ = PD yields P∗P∗′ =PDD′P′ = PImP′ = Σ. Usually, we will choose P to be an upper or lowertriangular matrix with m(m+1)/2 nonzero elements, which will then make suchP unique.
3.1.2.3 Positive Definite
Since Σ is a positive semidefinite matrix, it will be a positive definite matrix ifand only if det(Σ) 6= 0. Now, we know that Σ = PP′ for some m ×m matrixP. This implies that det(P) 6= 0.
3.1.3 Linear Transformations
Let y︸︷︷︸m×1
= b︸︷︷︸m×1
+ B︸︷︷︸m×m
m×1︷︸︸︷x . Then
E[ y ] = b + B E[ x ]
= b + Bµ
= µy (3.3)
Thus, the mean of a linear transformation is the linear transformation of themean.
Next, we have
E[ ( y − µy )( y − µy )′ ] = E{[ B ( x− µ )][ (B ( x− µ ))′ ]}= B E[( x− µ )( x− µ)′]B′
= BΣB′
= BΣB′ (3.4)
= Σy (3.5)
where we use the result (ABC )′ = C ′B′A′, if conformability holds.
24 CHAPTER 3. MULTIVARIATE DISTRIBUTIONS
3.2 Change Of Variables
3.2.1 Univariate
Let x be a random variable and fx(·) be the probability density function of x.Now, define y = h(x), where
h′(x ) =d h(x )
d x> 0.
That is, h(x ) is a strictly monotonically increasing function and so y is a one-to-one transformation of x. Now, we would like to know the probability densityfunction of y, fy(y). To find it, we note that
Pr( y ≤ h( a ) ) = Pr(x ≤ a ), (3.6)
Pr(x ≤ a ) =
ˆ a
−∞fx(x ) dx = Fx( a ), (3.7)
and,
Pr( y ≤ h( a ) ) =
ˆ h( a )
−∞fy( y ) dy = Fy(h( a )), (3.8)
for all a.
Assuming that the cumulative density function is differentiable, we use (3.6)to combine (3.7) and (3.8), and take the total differential, which gives us
dFx( a ) = dFy(h( a ))
fx( a )da = fy(h( a ))h′( a )da
for all a. Thus, for a small perturbation,
fx( a ) = fy(h( a ))h′( a ) (3.9)
for all a. Also, since y is a one-to-one transformation of x, we know that h(·)can be inverted. That is, x = h−1(y). Thus, a = h−1(y), and we can rewrite(3.9) as
fx(h−1( y )) = fy( y )h′(h−1( y )).
Therefore, the probability density function of y is
fy( y ) = fx(h−1( y ))1
h′(h−1( y )). (3.10)
Note that fy( y ) has the properties of being nonnegative, since h′(·) > 0. Ifh′(·) < 0, (3.10) can be corrected by taking the absolute value of h′(·), whichwill assure that we have only positive values for our probability density function.
3.2. CHANGE OF VARIABLES 25
y
h(x)h(b)h(a)
a b x
Figure 3.1: Change of Variables
3.2.2 Geometric Interpretation
Consider the graph of the relationship shown in Figure 3.1. We know that
Pr[h( b ) > y > h( a )] = Pr( b > x > a ).
Also, we know that
Pr[h( b ) > y > h( a )] ' fy[h( b )][h( b )− h( a )],
andPr( b > x > a ) ' fx( b )( b− a ).
So,
fy[h( b )][h( b )− h( a )] ' fx( b )( b− a )
fy[h( b )] ' fx( b )1
[h( b )− h( a )]/( b− a )](3.11)
Now, as we let a→ b, the denominator of (3.11) approaches h′(·). This is thenthe same formula as (3.10).
3.2.3 Multivariate
Letx︸︷︷︸
m×1
∼ fx( x ).
26 CHAPTER 3. MULTIVARIATE DISTRIBUTIONS
Define a one-to-one transformation
y︸︷︷︸m×1
=
m×1︷︸︸︷h ( x ).
Since h(·) is a one-to-one transformation, it has an inverse:
x = h−1( y ).
We also assume that ∂h( x )∂x′ exists. This is the m×m Jacobian matrix, where
∂h( x )
∂x′=
∂
h1( x )h2( x )
...hm( x )
∂(x1x2 · · ·xm)
=
∂h1( x )∂x1
∂h2( x )∂x1
· · · ∂hm( x )∂x1
∂h1( x )∂x2
∂h2( x )∂x2
· · · ∂hm( x )∂x2
......
. . ....
∂h1( x )∂xm
∂h2( x )∂xm
· · · ∂hm( x )∂xm
= Jx( x ) (3.12)
Given this notation, the multivariate analog to (3.11) can be shown to be
fy( y ) = fx[ h−1( y )]1
|det(Jx[ h−1( y )])|(3.13)
Since h(·) is differentiable and one-to-one then det(Jx[ h−1( y )]) 6= 0.
Example 3.1. Let y = b0 + b1x, where x, b0, and b1 are scalars. Then
x =y − b0b1
anddy
dx= b1.
Therefore,
fy( y ) = fx
(y − b0b1
)1
|b1|. �
Example 3.2. Let y = b + Bx, where y is an m × 1 vector and det(B) 6= 0.Then
x = B−1( y − b )
3.3. MULTIVARIATE NORMAL DISTRIBUTION 27
and∂y
∂x′= B = Jx( x ).
Thus,
fy( y ) = fx(B−1( y − b )
) 1
|det( B )|. �
3.3 Multivariate Normal Distribution
3.3.1 Spherical Normal Distribution
Definition 3.1. An m × 1 random vector z is said to be spherically normallydistributed if
f(z) =1
( 2π )n/2e−
12z′z. �
Such a random vector can be seen to be a vector of independent standardnormals. Let z1, z2, . . . , zm, be i.i.d. random variables such that zi ∼ N(0, 1).That is, zi has pdf given in (2.8), for i = 1, ...,m. Then, by independence, thejoint distribution of the zi’s is given by
f( z1, z2, . . . , zm ) = f( z1 )f( z2 ) · · · f( zm )
=
n∏i=1
1√2πe−
12 z
2i
=1
( 2π )m/2e−
12
∑ni=1 z
2i
=1
( 2π )m/2e−
12z′z, (3.14)
where z′ = (z1 z2 ... zm).
3.3.2 Multivariate Normal
Definition 3.2. The m× 1 random vector x with density
fx( x ) =1
( 2π )m/2[ det( Σ )]1/2e−
12 ( x−µ )′Σ−1( x−µ ) (3.15)
is said to be distributed multivariate normal with mean vector µ and positivedefinite covariance matrix Σ. �
Such a distribution for x is denoted by x ∼ N(µ,Σ). The spherical normaldistribution is seen to be a special case where µ = 0 and Σ = Im.
There is a one-to-one relationship between the multivariate normal randomvector and a spherical normal random vector. Let z be an m × 1 sphericalnormal random vector and
x︸︷︷︸m×1
= µ+ Az,
28 CHAPTER 3. MULTIVARIATE DISTRIBUTIONS
where z is defined above, and det(A) 6= 0. Then,
E x = µ+ A E z = µ , (3.16)
since E[z] = 0.Also, we know that
E( zz′ ) = E
z2
1 z1z2 · · · z1zmz2z1 z2
2 · · · z1zm...
.... . .
...zmz1 zmz2 · · · z2
m
=
1 0 · · · 00 1 · · · 0...
.... . .
...0 0 · · · 1
= Im , (3.17)
since E[zizj ] = 0, for all i 6= j, and E[z2i ] = 1, for all i. Therefore,
E[( x− µ )( x− µ )′] = E( Azz′A′ )
= A E( zz′ )A′
= AImA′ = Σ, (3.18)
where Σ is a positive definite matrix (since det(A) 6= 0).Next, we need to find the probability density function fx( x ) of x. We know
that
z = A−1( x− µ ) ,
z′ = ( x− µ )′A−1′ ,
and
Jz( z ) = A,
so we use (3.13) to get
fx( x ) = fz[ A−1( x− µ )]
1
|det( A )|
=1
( 2π )m/2e−
12 ( x−µ )′A−1′A−1( x−µ ) 1
|det( A )|
=1
( 2π )m/2e−
12 ( x−µ )′( AA′)−1( x−µ ) 1
|det( A )|(3.19)
where we use the results (ABC)−1 = C−1B−1A−1 and A′−1 = A−1′. However,
Σ = AA′, so det(Σ) = det(A)·det(A), and |det(A)|= [ det(Σ)]1/2
. Thus wecan rewrite (3.19) as
fx( x ) =1
( 2π )m/2[ det( Σ )]1/2e−
12 ( x−µ )′Σ−1( x−µ ) (3.20)
and we see that x ∼ N(µ,Σ) with mean vector µ and covariance matrix Σ.Since this process is completely reversible the relationship is one-to-one.
3.3. MULTIVARIATE NORMAL DISTRIBUTION 29
3.3.3 Linear Transformations
Theorem 3.1. Suppose x ∼ N(µ,Σ) with det(Σ) 6= 0 and y = b + Bx with Bsquare and det(B) 6= 0. Then y ∼ N(µy,Σy). �
Proof: From (3.3) and (3.4), we have E y = b+Bµ = µy and E[( y−µy )( y−µy )′] = BΣB′ = Σy. To find the probability density function fy( y ) of y, weagain use (3.13), which gives us
fy( y ) = fx[ B−1( y − b )]1
|det( B )|
=1
( 2π )m/2[ det( Σ )]1/2e−
12 [ B−1( y−b )−µ ]′Σ−1[ B−1( y−b )−µ ] 1
|det( B )|
=1
( 2π )m/2[ det( BΣB′)]1/2e−
12 ( y−b−Bµ )′( BΣB′)−1( y−b−Bµ )
=1
( 2π )m/2[ det( Σy )]1/2e−
12 ( y−µy )′Σ−1
y ( y−µy ) (3.21)
So, y ∼ N(µy,Σy). �Thus we see, as asserted in the previous chapter, that linear transformations
of multivariate normal random variables are also multivariate normal randomvariables. And any linear combination of independent normals will also benormal.
3.3.4 Quadratic Forms
Theorem 3.2. Let x ∼ N(µ,Σ ), where det(Σ) 6= 0, then ( x − µ )′Σ−1( x −µ ) ∼ X 2
m. �
Proof Let Σ = PP′, which is possible since Σ positive definite.. Then
( x− µ ) ∼ N( 0,Σ ),
and
z = P−1( x− µ ) ∼ N( 0, Im ).
Therefore,
z′z =
n∑i=1
z2i ∼ X 2
m
= P−1( x− µ )′P−1( x− µ )
= ( x− µ )′P−1′P−1( x− µ )
= ( x− µ )′Σ−1( x− µ ) ∼ X 2m. � (3.22)
Σ−1 is called the weight matrix. With this result, we can use the X 2m to
make inferences about the vector mean µ of x.
30 CHAPTER 3. MULTIVARIATE DISTRIBUTIONS
3.4 Normality and the Sample Mean
3.4.1 Moments of the Sample Mean
Consider the m× 1 vector xi ∼ i.i.d. jointly, with m× 1 vector mean E[xi] = µand m×m covariance matrix E[(xi−µ)(xi−µ)
′] = Σ. Define xn = 1
n
∑ni=1 xi
as the vector sample mean which is also the vector of scalar sample means. Themean of the vector sample mean follows directly:
E[xn] =1
n
n∑i=1
E[xi] = µ.
Alternatively, this result can be obtained by applying the scalar results elementby element. The second moment matrix of the vector sample mean is given by
E[(xn − µ)(xn − µ)′] =
1
n2E
[(∑n
i=1xi − nµ
)(∑n
i=1xi − nµ
)′]=
1
n2E[{(x1 − µ) + (x2 − µ) + ...+ (xn − µ)}
{(x1 − µ) + (x2 − µ) + ...+ (xn − µ)}′]
=1
n2nΣ =
1
nΣ
since the covariances between different observations are zero.
3.4.2 Distribution of the Sample Mean
Suppose xi ∼ i.i.d.N(µ,Σ) jointly. Then it follows from joint multivariatenormality that xn must also be multivariate normal since it is a linear transfor-mation. Specifically, we have
xn ∼ N(µ,1
nΣ)
or equivalently
xn − µ ∼ N(0,1
nΣ)
√n(xn − µ) ∼ N(0,Σ)
√nΣ−1/2(xn − µ) ∼ N(0, Im)
where Σ = Σ1/2Σ1/2′ and Σ−1/2 = (Σ1/2)−1.
3.4.3 Multivariate Central Limit Theorem
Theorem 3.3. Suppose that (i) xi ∼ i.i.d jointly, (ii) E[xi] = µ, and (iii)E[(xi − µ)(xi − µ)
′] = Σ, then
√n(xn − µ)→d N(0,Σ)
3.5. NONCENTRAL DISTRIBUTIONS 31
or equivalentlyz =√nΣ−1/2(xn − µ)→d N(0, Im) � .
These results apply even if the original underlying distribution is not normaland follow directly from the scalar results applied to any linear combination ofxn.
3.4.4 Limiting Behavior of Quadratic Forms
Consider the following quadratic form
n · (xn − µ)′Σ−1(xn − µ) = n · (xn − µ)
′(Σ1/2Σ1/2′)−1(xn − µ)
= [n · (xn − µ)′Σ−1/2′Σ−1/2(xn − µ)
= [√nΣ−1/2(xn − µ)]
′[√nΣ−1/2(xn − µ)]
= z′z→2d χ
2m.
This form is convenient for asymptotic joint test concerning more than one meanat a time.
3.5 Noncentral Distributions
When we conduct an hypothesis test we have a stated null hypotheis H0 and a,at least implicitly, stated alternative hypotheis H1. As a matter of convention,we carefully establish the behavior of our test statistic under the null, such asthe stadard norml, for example. The fact is that it is quite easy to developa statistic with a known distribution and hence known size or probability ofrejecting under the null. What is interesting is what happens to the statisticwhen the alternative applies. This means we need to establish the power ofa test, which is the probability of rejecting under the alternative. We choosebetween equal sized tests on the basis of which has the most power. And weprefer consistent tests whose power improves with sample size and reject withprobability one in the limit.
3.5.1 Noncentral Scalar Normal
Definition 3.3. Let x ∼ N(µ, σ2). Then, for µ∗ = µ/σ,
z∗ =x
σ∼ N(µ∗, 1 ) (3.23)
has a noncentral normal distribution with noncentrality parameter µ∗. �
Example 3.3. Suppose xi v i.i.d. N(µ, σ2) for i = 1, 2, ..., n. When we doa hypothesis test of the mean, with known variance, we have, under the nullhypothesis H0 : µ = µ0,
Sn =xn − µ0√σ2/n
∼ N( 0, 1 ) (3.24)
32 CHAPTER 3. MULTIVARIATE DISTRIBUTIONS
and, under the alternative H1 : µ = µ1 6= µ0,
Sn =xn − µ0√σ2/n
=xn − µ1√σ2/n
+µ1 − µ0√σ2/n
= N( 0, 1 ) +µ1 − µ0√σ2/n
∼ N
(µ1 − µ0√σ2/n
, 1
). (3.25)
Thus, the behavior of Sn under the the null hypothesis follows the standardnormal distribution but under the alternative hypothesis follows a noncentralnormal distribution. Note how the nocentrality is increasing at the rate
√n so
the test will be consistent and reject a false null with certainty in large samples.�
As this example makes clear, the noncentral normal distribution is especiallyuseful when carefully exploring the behavior of the alternative hypothesis.
3.5.2 Noncentral t
Definition 3.4. Let z∗∼N(µ∗, 1 ), w ∼ χ2k, and let z∗ and w be independent.
Then
T =z∗√w/k
∼ tk(µ∗ ) (3.26)
has a noncentral t distribution. �
The noncentral t distribution is used in calculating the power of tests of themean, when the variance is unknown and must be estimated.
Unlike with the noncentral normal, the noncentral t is more than a centralt, which has zero mean, shifted by µ∗. Let z = z∗ − µ∗, then we can write
z∗√w/k
=z√w/k
+µ∗√w/k
.
The first term is central t but the second or shift term is now stochastic so thedistribution involves more than a shift by the constant µ∗. The distribution afterthe shift is a rather complicated function of both µ∗ and k, which we will notpresented but summarized by how its’ shape changes as these arguments change.The shift term has a very asymmetric distribution so T has an asymmetricdistribution unless µ∗ = 0. It is unimodal and bell-shaped with the right-handtail heavier than the left when µ∗ > 0 and visa versa. As k grows large, thedistribution becomes more symmetric about µ∗.
These complications are best seen with the mean and variance. AlthoughE[w/k] = 1, due to Jensen’s inequality E[1/
√w/k] > 1/
√E[w/k] = 1, so E[T ] >
µ∗. Specifically, we find, for k > 1, E[T ] = µ∗√
k2
Γ((k−1)/2)Γ(k/2) , which obviously
depends on the degreees of freedom in a complicated way. Similarly, for k > 2,
we find Var[T ] = k(1+µ∗2)k−2 − µ∗2k
2
(Γ((k−1)/2)
Γ(k/2)
)2
. An excellent approximation to√k2
Γ((k−1)/2)Γ(k/2) in both these moments is
(1− 3
4k−1
)−1
, which clearly approaches
3.5. NONCENTRAL DISTRIBUTIONS 33
1 as k grows large. Thus as k grows large E[T ] → µ∗ and Var[T ] → 1, whichwould be the moments of a central t shifted by µ∗.
Example 3.4. Consider the Example 3.3 above but we don’t know σ2 and usethe unbiased estimator s2=
∑ni=1(xi − x)2/(n − 1). Now from the last chapter
we have
Y = (n− 1 )s2
σ2∼ X 2
n−1
while from above
Z =xn − µ0√σ2/n
∼ N
(µ1 − µ0√σ2/n
, 1
),
thus under the alternative hypothesis
Sn =Z
Y=xn − µ0√s2/n
∼ tn−1
(µ1 − µ0√σ2/n
).
Note that the alternative distribution uses σ2 rather than s2 in the noncentralityparameter, which increases at the rate
√n. �
It is instructive to examine the power of a two-sided 5% test implied forthis example for various values of the noncentrality as given in Figure 3.2. Letd = (µ1−µ0)/σ then the noncentrality is
√nd and we can plot the probablility
of exceeding the critical value τ.025,n−1 in absolute value for various choices of dand n. Notice for each choice of n the curves have a minimum around d = 0 atthe ostensible size under the null of .05. And they asymptote to 1 as d divergesfrom zero in the positive or negative direction. This is the typical shape ofthe power curve for a two-sided test. Also note that the curves become morecompact around zero as the sample size n increases. In the limit it will approacha spike at zero which reflects the fact that the test is consistent.
3.5.3 Noncentral Chi-Squared
Definition 3.5. Let z∗∼N(µ∗, Im). Then
V = z∗′z∗ ∼ X 2m( δ ), (3.27)
has a noncentral chi-squared distribution, where δ = µ∗′µ∗ is the noncentralityparameter. �
As with the noncentral t, the noncentral chi-squared is more than a centralchi-squared shifted by a constant. Let z = z∗ − µ∗, then
z∗z∗ = (z + µ∗)′(z + µ∗) = z′z + 2z′µ∗ + µ∗′µ∗.
The first term is a central chi-aquared and the last is the noncentality parameterδ which shifts the distribution to the right, but the middle term is N(0, 4δ) and
34 CHAPTER 3. MULTIVARIATE DISTRIBUTIONS
−3 −2 −1 0 1 2 30
0.2
0.4
0.6
0.8
1
1.2
1.4
d
Pr(
ab
s(x
(d))
>c)
Power: Sample Mean, 5% Size, sigma=1
n=5
n=10
n=25
n=100
Figure 3.1: Power of t-test
can shift the distribution to the right or left with the sum restricted to benon-negative. The density is given by
fV (v; k, δ) =
∞∑i=0
e−δ/2 (δ/2)i
i!fYk+2i
(v),
where fYqi(·) is the density of a central chi=square with q degress of freedom.The weights in the summation are the weights of a Poisson distribution withmean and variance δ/2. For a given k, increases in δ will shift more and moreof the distribution to the right as the larger weights of the Poisson, which hasmode at floor[δ/2], shift to the right. The mean of V is k + δ and the varianceis 2(k + 2δ).
Example 3.5. Consider a test of the means for xi ∼ i.i.d.N(µ,Σ) for Σ non-singular. For H0 : µ = µ0, we have
n · ( xn − µ0 )′Σ−1( xn − µ0 ) ∼ X 2m (3.28)
Let z∗=P−1√n(xn−µ0) for Σ = PP′, then for H1 : µ = µ1 6= µ0, we have
n · ( xn − µ0 )′Σ−1( xn − µ0 ) =√n( xn − µ0 )′P−1′P−1
√n( xn − µ0 )
= z∗′z∗
∼ X 2m[(n · µ1 − µ0 )′Σ−1(µ1 − µ0 )] (3.29)
Note that the noncentrality is always non-negative so the probability mass ofthe distribution under the alternative is shifted to the right. Consequently, wefocus only on the right-hand side tail of the null distribution. Moreover, thenoncentrality will increase at the rate n, so the test will be consistent. �
3.5. NONCENTRAL DISTRIBUTIONS 35
3.5.4 Noncentral F
Definition 3.6. Let Y ∼ χ2m(δ), W ∼χ2
n, and let Y and W be independentrandom variables. Then
Y/m
W/n∼ Fm,n( δ ), (3.30)
has a noncentral F distribution, where δ > 0 is the noncentrality parameter. �
The noncentral F distribution is used in tests of mean vectors, where thescale of the covariance matrix is unknown and must be estimated. As withthe other noncentral distributions the noncentrality parameter determines theamount the distribution is shifted (to the right in this case) when nonzero.Such a shift would arise under an alternative hypothesis as with the chi-squaredexample above. Examples for both the central and noncentral F distributionare deferred until Chapter 8.
Chapter 4
Asymptotic Theory
In Chapter 2 we found, via the central limit theorem that the behavior of anaverage, under very general assumptions and when properly normalized, tendstoward the standard normal distribution even when the underlying distributionis non-normal. In Chapter 3, we extended this result to multivariate aver-ages. These are both examples of depending on the large-sample limiting orasymptotic behavior of the average to simplify our analysis. As we develop ap-propriate tools for more complicated models in later chapters we will encounternumerous cases of estimators that, as a result of non-normality of the underlyingvariables or nonlinearity with respect these variables, have distributions suffi-ciently complicated to render finite sample inference infeasible. In most of thesecases, however, the large-sample limiting or asymptotic behavior is tractable.Accordingly, in this chapter, we develop a number of asymptotically appropriatetools that will be useful for these cases.
4.1 Convergence Of Random Variables
4.1.1 Sequences Of Random Variables
In the following we will consider the behavior of an estimator or statistic, basedon a sample of size n, as the sample size grows large. Let yn denote a realizationof this statistic. This realization is a random draw from the underlying distribu-tion of yn which may well change as the sample size n increases. For the variousvalues of n, these draws form a sequence of random variables {y1, y2, ..., yn, ...}.An example of such a sequence is the sample average yn = xn = 1
n
∑ni=1 xi.
Each realization from this sequence is a random draw and cannot be known apriori, however, probability measures based on the distribution of each realiza-tion are real numbers and hopefully exhibit regularity and even simplicity asthe sample size increases. Since such probability measures map each draw inthe sequence onto the real line, they form a sequence of real numbers, whosebehavior we now examine.
36
4.1. CONVERGENCE OF RANDOM VARIABLES 37
4.1.2 Convergence and Limits
We first examine the familiar concept of the limit from calculus.
Definition 4.1. A sequence of real numbers a1, a2, . . . , an, . . ., is said to beconvergent to or have a limit of α if for every δ > 0, there exists a positive realnumber N such that for all n > N , | an − α | < δ. This is denoted as
limn→∞
an = α . �
Example 4.1. Let { an } = 3 + 1/n, then clearly
limn→∞
an = 3,
since |an − 3| = |1/n| can be made arbitrarily small for sufficiently large n. �
We frequently encounter sequences of functions of variables whose value atconvergence depends on the value of the variables. For example, the sequence offunctions { an(x) } = x(3 + 1/n) converges to α(x) = 3x for any given x. Here,the value of n needed to make | an(x) − α(x) | = |x/n| < δ will depend on x.This dependence of the limiting process on the point of evaluation necessitatesthe introduction of the following uniformity definition
Definition 4.2. A sequence of functions {fn}, n = 1, 2, 3, ..., is said to beuniformly convergent to the function f for a set E of values of x if, for eachδ > 0, an integer N can be found such that |fn(x)− f(x)| < δ for n > N andx ε E.
Notationally, this means that we can define an = supx |fn(x)− f(x)| and α = 0and apply the standard definition of convergence to choose δ and the corre-sponding n. For the simple example { an(x) } = x(3 + 1/n) we define αn =supx ε E |x/n| = supx ε E |x| /n which will be finite for bounded x so E cansimply be a closed interval on the real line.
4.1.3 Orders of Sequences
We now introduce the notion of orders of a sequence, which will play an impor-tant role in the sequel. A sequence of interest may not have a limit and, in fact,may be divergent but a simple transformation may yield a limit and therebyyield insight into the rate at which the sequence diverges.
Definition 4.3. A sequence of real numbers { an } is said to be of order at mostnk, and we write { an } is O(nk), if
limn→∞
1
nkan = c,
where c is any real constant. �
38 CHAPTER 4. ASYMPTOTIC THEORY
Example 4.2. Let { an } = 3 + 1/n, and { bn } = 4− n2. Then, { an } is O(1)=O(n0), since, from the previous example,
limn→∞
an = 3
and { bn } is O(n2), since limn→∞1n2 bn=-1. �
Definition 4.4. A sequence of real numbers { an } is said to be of order smallerthan nk, and we write { an } is o(nk), if
limn→∞
1
nkan = 0. �
Example 4.3. Let { an } = 6/n. Then, { an } is o(1), since limn→∞1n0 an=0.
More informatively, we see that { an } is also O(n−1). Note that negative valuesof the exponent are possible, which indicates the rate at which the sequenceconverges to its’ limit point. �
The order of a sequence, like its’ limit, is a simplification relative to thesequence itself but more informative than the limit since it reveals the rate atwhich the divergence or convergence occurs. There is an algebra for orders ofsequences that will be useful later. Suppose an = O(nk) and bn = O(n`), thenit is straightforward to show that cn = an · bn = O(nk) · O(n`) = O(nk+`) andcn = an + bn = O(nk) + O(n`) = O(nmax(k,`)). Similarly, for an = o(nk) andbn = o(n`), then cn = an · bn = o(nk) · o(n`) = o(nk+`) and cn = an + bn =o(nk) + o(n`) = o(nmax(k,`)). Finally, if an = O(nk) and bn = o(n`), thencn = an · bn = O(nk) · o(n`) = o(nk+`) however cn = an + bn = O(nk) + o(n`) =1(l > k) · o(nl) + 1(l ≤ k) ·O(nk).
4.1.4 Convergence In Probability
We now turn to the large-sample limiting behavior of sequences of randomvariables such as the average. A probability measure on such a random sequencemaps each realization onto the real line and forms a sequence of real numberssuch as those above. We first look at a sequence of random variables whosedistribution is collapsing about some limit point.
Definition 4.5. A sequence of random variables y1, y2, . . . , yn, . . ., with distri-bution functions F1(·), F2(·), . . . , Fn(·), . . ., is said to converge (weakly) in prob-ability to some constant c if
limn→∞
Pr[ | yn − c | > ε ] = 0 (4.1)
for every real number ε > 0. �
Weak convergence in probability is denoted by
plimn→∞
yn = c,
4.1. CONVERGENCE OF RANDOM VARIABLES 39
or sometimes,
ynp−→ c,
oryn →p c.
This definition is equivalent to saying that we have a sequence of tail proba-bilities (of being greater than c+ ε or less than c− ε), and that the tail probabil-ities approach 0 as n → ∞, regardless of how small ε is chosen. Equivalently,the probability mass of the distribution of yn is collapsing about the point c.The leading example of such behavior is given by the laws of large numberswhich are discussed in a subsequent subsection.
Definition 4.6. A sequence of random variables y1, y2, . . . , yn, . . ., is said toconverge strongly in probability to some constant c if
Pr[ limn→∞
yn = c] = 1. � (4.2)
Strong convergence is also called almost sure convergence, which is denoted
yna.s.−→ c,
oryn →a.s. c.
Another obvious terminology is to say that yn converges to c with probabilityone.
The relationship between weak and strong convergence is best seen by bycomparison to an alternative but not obviously equivalent definition for strongconvergence. Suppose
limN→∞
Pr[ supn>N
| yn − c | > ε ] = 0 (4.3)
for any ε > 0. Consider the equivalent statement
limN→∞
Pr[ supn>N
| yn − c | < ε ] = 1 (4.4)
for any ε > 0, which, using the definition of a limit, can be rewritten as
Pr[ supn>N
| yn − c | < ε ] = 1− δ
for any δ > 0 and N > N∗ for N∗ sufficiently large. And for any ε > 0 andN sufficiently large, the condition within the Pr for this formulation can, bythe definition of a limit, also be written limn→∞ yn = c. Thus for N sufficentlylarge and any δ > 0 we have
Pr[ limn→∞
yn = c] = 1− δ
40 CHAPTER 4. ASYMPTOTIC THEORY
which implies and is implied by the definition of strong convergence (4.2).
Comparison of the weak convergence condition (4.1) with the alternativestrong convergence condition (4.4), reveals that, within the limit, weak conver-gence concerns itself with the probability of a single realization yn meeting thecondition | yn − c | < ε while strong convergence involves the joint probabil-ity of yn and all subsequent realizations meeting the condition. Obviously, ifa sequence of random variables converges strongly in probability, it convergesweakly in probability. Equaly obviously, the reverse is not necessarily true.Most of our attention in the sequel will focus on weak convergence.
Definition 4.7. A sequence of random variables y1, y2, . . . , yn, . . ., is said toconverge in quadratic mean if
limn→∞
E[ yn ] = c
and
limn→∞
Var[ yn ] = 0. �
For a random variable x with mean µ and variance σ2, Chebyshev’s inequal-ity states Pr(|x− µ| ≥ kσ) ≤ 1
k2 . For arbitrary ε > 0, let µn and σ2n denote the
mean and variance of yn, and for n sufficiently large that |c− µn| < ε, consider
Pr[ | yn − c | > ε ] = 1− Pr[ c− ε < yn < c+ ε ]
= 1− Pr[ (c− µn)− ε < yn − µn < (c− µn) + ε ]
≤ 1− Pr[ |c− µn| − ε < yn − µn < − |c− µn|+ ε ], since the interval is narrower
= 1− Pr[−(an + ε) < yn − µn < an + ε ], where an = − |c− µn|= Pr[ | yn − µn | > an + ε ]
= Pr[ | yn − µn | >an + ε
σnσn ]
≤
(1
an+εσn
)2
=σ2n
(an + ε)2→ 0
by Chebyshev’s inequality. Thus convergence in quadratic mean implies conver-gence in probability. The reverse is not necessarily true. There is no directionof implication between strong convergence and convergence in quadratic mean.
The random sequnces we consider will sometimes be a function of a variable(or parameter) and the constants ε and δ will depend on the point of evaluationof the function. Analogous to uniform convergence of a function we introducethe following
Definition 4.8. A sequence of random functions {fn = fn(x)}, n = 1, 2, 3, ...,is said to be uniformly convergent in probability to the function f for a set Eof values of x if, for each ε > 0 and 1 > δ > 0, an integer N can be found suchthat Pr(|fn(x)− f(x)| > ε) < δ for n > N and x ε E. �
4.1. CONVERGENCE OF RANDOM VARIABLES 41
This definition differs from pointwise convergence in probability in that ε andδ do not depend on the value of x. In pointwise convergence these values candiffer depending on the point of evaluation for x. Uniform convergence willsometimes be denoted by
supx ε E
|fn(x)− f(x)| −→p 0.
which assures we are using the same constants ε and δ across the set E.
4.1.5 Orders In Probability
Beyond knowing that the distribution is collapsing around a limit point as thesample size grows large we may be interested in the rate at which it collapses.This is sometimes important in comparing estimators and also in determiningthe appropriate normalization for finding the limiting distribution in the nextsubsection.
Definition 4.9. Let y1, y2, . . . , yn, . . . be a sequence of random variables. Thissequence is said to be bounded in probability if for any 1 > δ > 0, there exist a∆ <∞ and some N sufficiently large such that
Pr(| yn | > ∆) < δ,
for all n > N . �
The value δ provides a bound for the probability content of the tail to theright of ∆ and the left of −∆ once we are far enough into the sequence. Theseconditions require that the tail behavior of the distributions of the sequencenot be pathological. Specifically, the tail mass of the distributions cannot bedrifting away from zero as we move out in the sequence. The leading caseoccurs with the central limit theorem discussed below which shows that thesample average, suitably normalized, has a limiting normal distribution, whichsatisfies the conditions for being bounded in probability.
Definition 4.10. The sequence of random variables { yn } is said to be oforder in probability at most nλ, and is denoted Op(n
λ), if n−λyn is bounded inprobability. �
Example 4.4. Suppose z ∼ N(0, 1) and yn = 3 +n · z, then n−1yn = 3/n+ z isa bounded random variable since the first term is asymptotically negligible andwe see that yn = Op(n). �
Example 4.5. Suppose zi ∼ i.i.d. N(0, 1) for i = 1, 2, ..., n then yn =∑ni=1 z
2i ∼
χ2n, which has E[yn] = n and var[yn] = 2n. The sequence yn does not sat-
isfy the definition of bounded in probability since the probability mass is be-ing strung out to the right as n → ∞. However, by the univariate CLT√n(yn/n − 1) →d N(0, 2), which is bounded in probability and hence Op(1).
Thus (yn/n − 1) = Op(1/√n), (yn − n) = Op(
√n), and yn = (yn − n) + n =
Op(√n) +O(n) = Op(n). �
42 CHAPTER 4. ASYMPTOTIC THEORY
Definition 4.11. The sequence of random variables { yn } is said to be of order
in probability smaller than nλ, and is denoted op(nλ), if n−λyn
p−→ 0. �
Example 4.6. Convergence in probability can be represented in terms of order
in probability. Suppose that ynp−→ c or equivalently yn − c
p−→ 0, then
n0(yn − c)p−→ 0 and yn − c = op(1). �
As with the order of sequences of real numbers, there is an algebra for theorder in probability of sequences of random variables. Specifically, for therandom sequences yn = Op(n
k) and zn = O(n`), then wn = yn · zn = Op(nk) ·
Op(n`) = Op(n
k+`) and wn = yn + zn = Op(nk) + Op(n
`) = Op(nmax(k,`)).
Similarly, for zn = op(nk) and wn = op(n
`), then yn = zn ·wn = op(nk)·op(n`) =
op(nk+`) and yn = zn + wn = op(n
k) + op(n`) = op(n
max(k,`)). Finally, ifyn = Op(n
k) and zn = op(n`), then wn = zn · wn = Op(n
k) · op(n`) = op(nk+`)
and yn = zn + wn = Op(nk) + op(n
`) = 1(l > k) · op(nl) + 1(l ≤ k) · Op(nk).Moreover, for mixtures of sequences of reals and randoms, a similar algebraholds except the results are always orders in probability since the mixtures willbe random.
4.1.6 Convergence In Distribution
The previous sections have provided tools for describing the limiting point ofthe distributions of a random sequence and also the rate at which the sequenceis tending toward a bounded random variable. Inevitably, we also want to knowthe specific distribution toward which suitably transformed random sequence istending, if such is the case.
Definition 4.12. A sequence of random variables y1, y2, . . . , yn, . . ., with cu-mulative distribution functions F1(·), F2(·), . . . , Fn(·), . . ., is said to converge indistribution to a random variable y with a cumulative distribution functionF (y), if
limn→∞
Fn(·) = F (·),
for every point of continuity of F (·). The distribution F (·) is said to be thelimiting distribution of this sequence of random variables. �
For notational convenience, we often write ynd−→ F (·) or yn →d F (·) if
a sequence of random variables converges in distribution to F (·). Note thatthe moments of elements of the sequence do not necessarily converge to themoments of the limiting distribution. There are well-known examples wherethe limiting distribution has moments of all orders but certain moments of thefinite-sample distribution do not exist.
Note that the pointwise convergence in distribution above allows for theconstant δ used in the limit to depend on the point of evaluation, say c, and(possibly) the values of the underlying parameters of the generating process, sayθ. It will be sometimes useful to utilize cases where the same consant can beutilized across the set of possible c and (possibly) θ. For simplicity,we ignorethe possible dependence on θ since the definition easily generalizes.
4.1. CONVERGENCE OF RANDOM VARIABLES 43
Definition 4.13. A sequence of random variables y1, y2, . . . , yn, . . ., with cu-mulative distribution functions F1(·), F2(·), . . . , Fn(·), . . ., is said to converge un-formly in distribution to a random variable y with a cumulative distributionfunction F (y), if
supc ε C
|Fn(c)− F (c)| −→p 0,
for C being the set of feasible points c of evaluation for the function F (·). �
Uniform convergence implies pointwise convergence but the reverse is not gen-erally true. One important exception occurs when F (c) is continuous overcompact C, whereupon pointwise convergence of Fn(c) to F (c) implies uniformconvergence.
4.1.7 Some Useful Propositions
In the sequel, we will sometimes need to work with combinations of two ormore sequences or transformations of a sequence. Consequently, the followingpropositions will prove useful, where xn and yn denote two sequences of randomvectors.
Proposition 4.1. If xn − yn converges in probability to zero, and yn has alimiting distribution, then xn has a limiting distribution, which is the same. �
Proposition 4.2. If yn has a limiting distribution and plimn→∞
xn = 0, then for
zn = yn′xn, plim
n→∞zn = 0. �
Proposition 4.3. Suppose that yn converges in distribution to a random vari-able y, and plim
n→∞xn = c. Then xn
′yn converges in distribution to c′y. �
Proposition 4.4. If g(·) is a continuous function, and if xn−yn converges inprobability to zero, then g(xn)− g(yn) converges in probability to zero. �
Proposition 4.5. If g(·) is a continuous function, and if xn converges in prob-ability to a constant c, then zn = g(xn) converges in distribution to the constantg(c). �
Proposition 4.6. If g(·) is a continuous function, and if xn converges in dis-tribution to a random variable x, then zn = g(xn) converges in distribution toa random variable g(x). �
44 CHAPTER 4. ASYMPTOTIC THEORY
4.2 Estimation Theory
4.2.1 Estimators
Typically, our interest centers on certain properties of the data that can becharacterized as parameters of the model used. An example is the marginalpropensity to consume in a model of aggregate consumption. Our econometricanalysis of this property usually begins with a point estimate of the parameterbased on the available data. The formula or function that was used to obtainthe point estimate is an estimator. The estimate is a realization drawn fromthe distribution that follows when the estimator is applied to the distributionof the underlying variables of the model. Thus as the sample size is increased,the estimator forms a sequence of random variables. The simple example givenbefore was the sample mean xn as the sample size n is increased.
In the sequel, we will entertain various estimators that result from optimiza-tion of a function of the data with respect to parameters of interest. Suchestimators comprise the class of extremum estimators. Examples are: leastsquares, which finds a minimum; maximum likelihood, which finds the maxi-mum; and generalized method of moments (GMM), which finds a minimum. Ineach case, we obtain first order conditions for the objective function, in termsof the unknown parameters, and use the solution of these as estimators.
Without loss of generality, we will consider estimators that result from mini-mizing a function of the data. Denote this function by ψn(θ) =ψ(z1, z2, ...zn;θ),where z1, z2, ...zn are n observations on an m× 1 vector of random variables zand θ is the p× 1 vector of unknown parameters defined on a set Θ. Formally,our estimator is
θn = argminθ∈Θ
ψn(θ). (4.5)
Typically, this estimator will be obtained by setting the first-order conditions(FOC), namely the partial derivatives of the function to be minimized, to zero
0 =∂ψn(θn)
∂θ
and solving for θn. In some cases such a solution is easy to obtain and inothers it is not. It is possible for the FOC to be nonlinear in θ and theresulting solution to the FOC to be very complicated functions of the data.
In the appendix to this section, we will develop results which establishthe large-sample asymptotic properties of such estimators under very generalconditions.
4.2.2 Properties Of Estimators
Definition 4.14. An estimator θn of the p×1 parameter vector θ is a functionof the sample observations z1, z2, ...zn. �
It follows that θ1, θ2, . . . , θn form a sequence of random variables.
4.2. ESTIMATION THEORY 45
Definition 4.15. The estimator θn is said to be unbiased if E θn = θ, for alln. �
Definition 4.16. The estimator θn is said to be asymptotically unbiased iflimn→∞
E θn = θ. �
Note that an estimator can be biased in finite samples, but asymptoticallyunbiased.
Definition 4.17. The estimator θn is said to be consistent if plimn→∞
θn = θ. �
Asymptotic unbiasedness implies that a measure of central tendency of theestimator, the mean, is the target value so in some sense the distribution ofthe estimator is centered around the target in large samples. Consistency im-plies that the distribution of the estimator is collapsing about the target as thesample size increases. Consistency neither implies nor is implied by asymptoticunbiasedness, as demonstrated by the following examples.
Example 4.7. Let
θn =
{θ, with probability 1− 1/nθ + nc, with probability 1/n
We have E θn = θ+ c, so θn is a biased estimator, and limn→∞ E θn = θ+ c, soθn is asymptotically biased as well. However, limn→∞ Pr(| θn − θ | > ε) = 0, so
θn is a consistent estimator. �
Example 4.8. Suppose xi ∼ i.i.d.N(µ, σ2) for i = 1, 2, ..., n, ... and let xn = xnbe an estimator of µ. Now E[xn] = µ so the estimator is unbiased but
Pr(|xn − µ| > 1.96σ) = .05
so the probability mass is not collapsing about the target point µ so the esti-mator is not consistent. �
4.2.3 Laws Of Large Numbers And Central Limit Theo-rems
Most of the large sample properties of the estimators considered in the sequelderive from the fact that the estimators involve sample averages and that theasymptotic behavior of averages is well studied. In addition to the central limittheorems presented in the previous two chapters we have the following two lawsof large numbers:
Theorem 4.1. If x1, x2, . . . , xn is a simple random sample, that is, the xi’s arei.i.d., and Exi = µ and Var(xi) = σ2, then by Chebyshev’s Inequality,
plimn→∞
xn = µ �
46 CHAPTER 4. ASYMPTOTIC THEORY
Theorem 4.2. Suppose that x1, x2, . . . , xn are i.i.d. random variables, suchthat for all i = 1, . . . , n, Exi = µ, then,
plimn→∞
xn = µ �
Both of these results apply element-by-element to vectors of estimators. Notethat Khitchine’s Theorem does not require second moments for the underlyingrandom variable.
For sake of completeness we repeat the following scalar central limit theorem.
Theorem 4.3. Suppose that x1, x2, . . . , xn are i.i.d. random variables, such that
for all i = 1, . . . , n, Exi = µ and Var(xi) = σ2, then√n(xn − µ)
d−→ N(0, σ2),
or limn→∞ fn(z) = 1√2πσ2
e1
2σ2z2 . �
This result is easily generalized to obtain the multivariate version given in The-orem 3.3.
Theorem 4.4. Suppose that m×1 random vectors x1,x2, . . . ,xn are (i) jointlyi.i.d., (ii) E xi = µ, and (iii) Cov(xi) = Σ, then
√n(xn − µ)
d−→ N(0,Σ) �
4.2.4 Asymptotic Linearity
The importance of the average and its’ asymptotic regularity as given by the lawsof large numbers and the central limit theorem arise because a large number ofestimators and most of those considered in the sequel are asymptotically linear.
Definition 4.18. An estimator is said to be asymptotically linear if, for zi i.i.d.,there exists v(·) such that
√n(θn − θ) =
1√n
n∑i=1
v(zi;θ) + op(1) �
Typically, the form of v(zi;θ) arises from setting first-order conditions tozero that might arise from optimization of an objective function. Accordingly,we will have E v(zi;θ) = 0, and Cov(v(zi;θ)) = V at θ = θ0, whereupon wecan apply the central limit theorem to obtain
√n(θn − θ0)
d−→ N(0,V).
In order to obtain asymptotic linearity we will typically have to impose addi-tional assumptions on the data and v(·), as with the general consistency andasymptotic normality theorems given in the appendix to this chapter.
4.3. ASYMPTOTIC INFERENCE 47
4.2.5 CUAN And Efficiency
We often encounter more than one candidate estimator for a particular model.It behooves us to have some metric, which will motivate a choice between thealternatives. Such a comparison requres a little more smoothness of the esti-mator.
Definition 4.19. An estimator is said to be consistently uniformly asymp-totically normal (CUAN) if it is consistent, and if
√n(θn − θ) converges in
distribution to N(0,V), and if the convergence is uniform over some compactsubset of the parameter space Θ. �
Uniformity is introduced in order to rule out pathological cases where the lim-iting distribution changes as we move away from the limit point θ. In suchpathological cases, the preferred estimator can change as we move away fromthe pathological point.
Suppose that√n(θn−θ) converges in distribution to N(0,V). Let θn be an
alternative estimator such that√n(θn−θ) converges in distribution to N(0,W).
Definition 4.20. If θn is CUAN with asymptotic covariance V and θn is CUANwith asymptotic covariance V, then θn is asymptotically efficient relative to θn
if V −W is a positive semidefinite matrix. �
Among other properties asymptotic relative efficiency implies that the diagonalelements of V are no larger than those of W, so the asymptotic variances of θn,iare no larger than those of θn,i. And a similar result applies for the asymptoticvariance of any linear combination.
Sometimes an estimator can be shown to be efficient relative to all alterna-tive.
Definition 4.21. A CUAN estimator θn is said to be asymptotically efficientif it is asymptotically efficient relative to any other CUAN estimator. �
In the next chapter we find that when the Maximum Likelihood Estimator(MLE) is CUAN then it is efficient within the class.
4.3 Asymptotic Inference
4.3.1 Inference
Strictly speaking, inference is the process by which we infer properties of theunderlying model and distribution but the behavior of the observable variables.Point estimation of parameters of interest is certainly a leading example of suchinference. Beyond point estimation, we are interested in making probabilitystatements concerning the properties of interest or choosing between alternativemodels. Such statements or choices are usually based on hypothesis tests,interval estimates, or probability values, all of which require the distribution ofthe estimator. In cases where the small sample distribution of the estimator is
48 CHAPTER 4. ASYMPTOTIC THEORY
unavailable or formidably complex, we must rely on the large-sample limitingor asymptotic distribution of the estimator to provide answers.
4.3.2 Normal Ratios
To fix ideas, we begin with inference in the simple mean model. Under the con-
ditions of the central limit theorem, this model yields√n(xn − µ)
d−→ N(0, σ2),so
(xn − µ)√σ2/n
d−→ N(0, 1) (4.6)
Suppose that σ2 is a consistent estimator of σ2. Then, by Proposition 4.3, wealso have
(xn − µ)√σ2/n
=
√σ2/n√σ2/n
(xn − µ)√σ2/n
=
√σ2
σ2
(xn − µ)√σ2/n
d−→ N(0, 1)
since the term under the square root converges in probability to one and theremainder converges in distribution to N(0, 1).
Most typically, such ratios will be used for inference in testing a hypothesis.Now, for H0 : µ = µ0, we have
(xn − µ0)√σ2/n
d−→ N(0, 1), (4.7)
while under H1 : µ = µ1 6= µ0, we find that
(xn − µ0)√σ2/n
=
√n(xn − µ1)√
σ2+
√n(µ1 − µ0)√
σ2
= N(0, 1) + Op(√n)
In conducting an hypothesis test, we choose the critical value c so that the sizeα, or probability our statistic exceeding c in absolute value, is sufficiently smallthat we prefer the alternative hypothesis where such extreme events are morelikely, particularly in large samples.
Such ratios are of interest in estimation and inference with regard to moregeneral parameters. Suppose that θ is an estimator of the parameter vector θand √
n( θp×1− θ)
d−→ N(0, Vp×p
).
Then, if θj is the parameter of particular interest, we consider
(θj − θj)√vjj/n
d−→ N(0, 1),
4.3. ASYMPTOTIC INFERENCE 49
and duplicating the arguments for the sample mean
(θj − θj)√vjj/n
d−→ N(0, 1), (4.8)
where V is a consistent estimator of V. This ratio will have a similar behaviorunder a null and alternative hypothesis with regard to θj .
4.3.3 Asymptotic Chi-Square
The ratios just discussed are only appropriate for inference with respect to asingle parameter. Often, some question to be answered using inference involvesmore than one parameter at a time. In a leading example, we have one modelembedded in another as a special case then the more complex model reduces tothe special case if a set of parameters take on a specific value, such as zero.
Suppose that√n(θ − θ)
d−→ N(0,V),
where Ψ is a consistent estimator of the nonsingular p×p matrix Ψ. Then, fromthe previous chapter we have
√n(θ − θ)V−1
√n(θ − θ) = n · (θ − θ)V−1(θ − θ)
d−→ χ2p
and
n · (θ − θ)V−1(θ − θ)d−→ χ2
p (4.9)
for V a consistent estimator of V.
A similar result can also be obtained for any subvector of θ. Let
θ =
(θ1
θ2
),
where θ2 is a p2 × 1 vector. Then
√n(θ2 − θ2)
d−→ N(0,V22),
where V 22 is the lower right-hand p2 × p2 submatrix of Ψ and,
n · (θ2 − θ2)V−122 (θ2 − θ2)
d−→ χ2p1 (4.10)
where V 22 is the analogous lower right-hand submatrix.
This result can be used to conduct inference by testing the sub-vector. IfH0 : θ2 = θ0
2, then
n · (θ2 − θ02)V−1
22 (θ2 − θ02)
d−→ χ2p2 ,
50 CHAPTER 4. ASYMPTOTIC THEORY
and large positive values are rare events. while for H1 : θ2 = θ12 6= θ0
2, we canshow (later)
n · (θ2 − θ02)V−1
22 (θ2 − θ02) = n · ((θ2 − θ1
2)
+(θ12 − θ
02))′V−1
22 ((θ2 − θ12) + (θ1
2 − θ02))
= n · (θ2 − θ12)V−1
22 (θ2 − θ12)
+2n · (θ12 − θ
02)V−1
22 (θ2 − θ12)
+n · (θ12 − θ
02)V−1
22 (θ12 − θ
02)
= χ2p2 + Op(
√n) + Op(n) (4.11)
Thus, if we obtain a large value of the statistic, we may take it as evidence thatthe null hypothesis is incorrect.
The ratio approach of the previous section is a special case of the subvectorapproach. If the hypothesis to be tested only concerns the last element of theparameter vector so H0 : θp = θ0
p, then we can test using the ratio
(θj − θj)√vjj/n
d−→ N(0, 1),
or the quadratic form, which for this case is
n · (θp − θ0p)′v−1pp (θp − θ0
p) = n · (θp − θ0p)
2/vppd−→ χ2
1.
But the second is just the square of the first and the distribution consulted isthe chi-squared with 1 degree of freedom which results from a standard normalsquared.
4.3.4 Tests Of General Restrictions
We can use similar results to test general nonlinear restrictions. Suppose thatr(·) is a q × 1 continuously differentiable function, and
√n(θ − θ)
d−→ N(0,V).
By the intermediate value theorem we can obtain the exact Taylor’s series rep-resentation
r(θ) = r(θ) +∂r(θ∗)
∂θ′(θ − θ)
or equivalently
√n(r(θ)− r(θ)) =
∂r(θ∗)
∂θ′√n(θ − θ)
= R(θ∗)√n(θ − θ)
4.3. ASYMPTOTIC INFERENCE 51
where R(θ∗) = ∂r(θ∗)∂θ′ and θ∗ lies between θ and θ. Now θ →p θ so θ∗ →p θ
and R(θ∗)→p R(θ). Thus, we have
√n[ r(θ)− r(θ)]
d−→ N(0,R(θ)VR′(θ)).
Consider a test of H0 : r(θ) = 0. Assuming R(θ)VR′(θ) is nonsingular, wehave
n · r(θ)[R(θ)VR′(θ)]−1r(θ)d−→ χ2
q,
where q is the length of r(·). In practice, we substitute the consistent estimates
R(θ) for R(θ) and V for V to obtain, following the arguments given above
n · r(θ)[R(θ)VR′(θ)]−1r(θ)d−→ χ2
q, (4.12)
The behavior under the alternative hypothesis will be Op(n) as above.
4.A Appendix
In this section, we will develop results which establish the consistency andasymptotic normality of extremum estimators as defined by (4.5) under verygeneral conditions. These conditions are high-level, in the sense that they en-compass many models and must be verified for the specific model of interest.The assumptions on the specific model that lead to verification of the high-levelassumptions are low-level or primitive assumptions.
We first consider a result for consistency of the extremum estimator. Fornotational simplicity we will suppress the subscript n on the estimator θn.
Theorem 4.5. Suppose (C1) Θ ⊂Rp compact, (C2) ∃ψ0(θ) : θ −→ R s.t.ψn(θ) −→p ψ0(θ) uniformly over θ ∈ θ, (C3) ψ0(θ) continuous over θ ∈ Θ,
(C4) for θ ∈ θ, ψ0(θ) attains a unique minimum for θ=θ0, then θ −→p θ0 �
Proof: The following proof uses the properties that the inf and sup of a compactset must exist, that a closed subset of a compact set is also compact, andthat the image set of a continuous function on a compact set is also compact.Consider any open neighborhood Θ0 of θ0, then the compliment Θ1 = Θ−Θ0
is a closed subset of Θ0 and hence compact by (C1). For θ∈ Θ1 we haveψ0(θ) − ψ0(θ0) > 0, by (C4). So, by continuity of ψ0(θ) and compactness ofΘ1, we find
δ∗ = infθ∈Θ1
(ψ0(θ)− ψ0(θ0))
exists and positive. Therefore, ∃δ s.t. δ∗ > δ > 0 and ψ0(θ) − ψ0(θ0) > δ forθ ∈ Θ1. Next, for any ε > 0, we find, by (C2), that |ψn(θ)− ψ0(θ)| ≤ ε for
52 CHAPTER 4. ASYMPTOTIC THEORY
θ ∈ Θ with Pr 1, so
ψ0(θ)− ψ0(θ0) = [ψ0(θ)− ψn(θ)] + [ψn(θ)− ψ0(θ0)]
= [ψ0(θ)− ψn(θ)] + [ψn(θ)− ψn(θ0)] + [ψn(θ0)− ψ0(θ0)]
≤ [ψ0(θ)− ψn(θ)] + [ψn(θ0)− ψ0(θ0)] since [ψn(θ)− ψn(θ0)] ≤ 0 by definition
≤∣∣∣ψ0(θ)− ψn(θ)
∣∣∣+ |ψn(θ0)− ψ0(θ0)| by triangle inequality
≤ 2ε with Pr 1.
But if we choose ε = δ/2, then with Pr 1, ψ0(θ)−ψ0(θ0) ≤ δ and θ/∈ Θ1 which
means θ ∈ Θ0, for any neighborhood of θ0. �The assumptions for this theorem are very high level and verification is some-
times easy and sometimes not. The compactness assumption applies directly tothe parameter space and hence should not be difficult to verify. The continuityassumption will usually be met if the functions of the underlying model arethemselves smooth but should be verified nonetheless. The global minimumassumption is not so restrictive as it might seem since we can always restrict θto a smaller space where there is only one minimum. In practice, the uniformconvergence in probability assumption can become the most difficult to verifywhen the underlying model is complicated.
With a little more smoothness, we can give a geometric proof of the consis-tency theorem for the single parameter case. Consider Figure 4.1 below. Forsimplicity, we consider the limiting function ψ0(θ) to be approximately quadratic(more will be said on this below). For an arbitrary choice of δ > 0, we knowby the uniform convergence assumption that the function we are minimizingψn(θ), represented here by the slightly wiggly line, will lie in the sleeve betweenψ0(θ) − δ and ψ0(θ) + δ with probability one as n grow large. Now, if ψn(θ)lies in this region its’ minimum cannot exceed ψmax which is the minimum ofψ0(θ) + δ since ψn(θ) lies below ψ0(θ) + δ. At the same time, this minimummust also lie above ψ0(θ) − δ, which means that the value of θ that minimizes
ψn(θ), namely θ, must lie between θa and θb. For sufficiently large n, we canmake δ arbitrarily small, whereupon the sleeve formed by ψ0(θ)−δ and ψ0(θ)+δabout ψ0(θ) grows tighter and by continuity of ψ0(θ), θa and θb must be closerand closer to θ0, with probablility one, which proves consistency. �
If the function ψ0(θ) is twice differentiable then in the neighborhood of theminimum it will be approximately quadratic and this proof will apply. Sincethere is a unique global minimum by assumption, the global minimum will bein the neigborhood of θ0 and the proof applies.
We now present a normality result for the extremum estimator.
Theorem 4.6. Suppose (N1) θ −→p θ0, θ0 interior Θ, (N2) ψn(θ) twicecontinuously differentiable in neighborhood N of θ0, (N3) ∃Ψ(θ), continuousat θ0, s.t. ∂2ψn(θ)/∂θ∂θ′ −→p Ψ(θ) uniformly over θ ∈ N , (N4) Ψ =
Ψ(θ0) nonsingular, (N5)√n∂ψn(θ0)/∂θ −→d N(0,V ), then
√n(θ−θ0) −→d
N(0,Ψ−1VΨ−1) �
4.3. ASYMPTOTIC INFERENCE 53
ψn(θ)
ψ0(θ)
ψ0(θ)+δ
θ0
θb
θa
ψmax ψ
0(θ)−δ
Figure 4.1: Consistency From Minimization
Proof: By (N1) θ ∈ N With probability one and we can use (N2) to expandthe first-order conditions in a Taylor’s series and add and subtract terms toyield
0 =∂ψn(θ)
∂θ
=∂ψn(θ0)
∂θ+∂2ψn(θ∗)
∂θ∂θ′(θ − θ0)
=∂ψn(θ0)
∂θ+ Ψ(θ0)(θ − θ0) + [Ψ(θ∗)−Ψ(θ0)](θ − θ0) +
[∂2ψn(θ∗)
∂θ∂θ′−Ψ(θ∗)
](θ − θ0)
for θ∗ between θ and θ0. It follows that θ∗ −→p θ0 so, by continuity and the
Slutzky Theorem, Ψ(θ∗) − Ψ(θ0) = op(1), and by (N3) ∂2ψn(θ∗)∂θ∂θ′ − Ψ(θ∗) =
op(1). Thus
−√n∂ψn(θ0)
∂θ= Ψ(θ0)
√n(θ − θ0) + op(
∥∥∥√n(θ − θ0)∥∥∥)
which, using (N4), can be multiplied through by Ψ(θ0)−1 and solving to obtain
√n(θ − θ0) = −Ψ(θ0)−1
√n∂ψn(θ0)
∂θ+ op(
∥∥∥√n(θ − θ0)∥∥∥)
−→ d N(0,Ψ−1VΨ−1) by (N5). �
54 CHAPTER 4. ASYMPTOTIC THEORY
The assumptions of this theorem are fairly standard but should be verified forthe model in question. The nonsingularity assumption will usually follow fromthe identification assumption introduced for the consistency theorem. In orderto perform inference based on the estimator θ, we need a consistent estimatorof the covariance matrix Ψ−1VΨ−1. The uniform convergence result for thesecond derivatives yields Ψ=∂2ψn(θ)/∂θ∂θ′ as an obvious consistent estimatorof θ. An appropriate consistent estimator of V will depend on the specifics ofthe model in question.
Chapter 5
Maximum LikelihoodMethods
5.1 Maximum Likelihood Estimation (MLE)
5.1.1 Motivation
Suppose we have a model for the random variable yi, for i = 1, 2, . . . , n, withunknown (p × 1) parameter vector θ. In many cases, the model will imply adistribution f(yi|θ) for each realization of the variable yi.
A basic premise of statistical inference is to avoid unlikely or rare models,for example, in hypothesis testing. If we have a realization of a statistic thatexceeds the critical value then it is a rare event under the null hypothesis. Underthe alternative hypothesis, however, such a realization is much more likely tooccur and we reject the null in favor of the alternative. Thus in choosingbetween the null and alternative, we select the model that makes the realizationof the statistic more likely to have occurred.
Carrying this idea over to estimation, we select values of θ such that thecorresponding values of f(yi|θ) are not unlikely. After all, we do not want amodel that disagrees strongly with the data. Maximum likelihood estimationis merely a formalization of this notion that the model chosen should not beunlikely. Specifically, we choose the values of the parameters that make therealized data most likely to have occurred. This approach does, however, requirethat the model be specified in enough detail to imply a distribution for thevariable of interest.
55
56 CHAPTER 5. MAXIMUM LIKELIHOOD METHODS
5.1.2 The Likelihood Function
Suppose that the random variables y1, y2, . . . , yn are i.i.d. Then, the joint den-sity function for n realizations is
f(y1, y2, . . . , yn|θ) = f(y1|θ) · f(y2, |θ) · . . . · f(yn|θ)
=
n∏i=1
f(yi|θ) (5.1)
Given values of the parameter vector θ, this function allows us to assign localprobability measures for various values of the random variables y1, y2, . . . , yn.This is the function which must be integrated to make probability statementsconcerning the joint outcomes of y1, y2, . . . , yn.
Given a set of realized values of the random variables, we use this samefunction to establish the probability measure associated with various choices ofthe parameter vector θ.
Definition 5.1. The likelihood function of the parameters, for a particularsample of y1, y2, . . . , yn, is the joint density function considered as a function ofθ given the yi’s. That is,
L(θ|y) =
n∏i=1
f(yi|θ) � (5.2)
5.1.3 Maximum Likelihood Estimation
For a particular choice of the parameter vector θ, the likelihood function givesa probability measure for the realizations that occurred. Consistent with theapproach used in hypothesis testing, and using this function as the metric, wechoose θ that make the realizations most likely to have occurred.
Definition 5.2. The maximum likelihood estimator of θ is the estimator ob-tained by maximizing L(θ|y1, y2, . . . , yn) with respect to θ. That is,
θ = argmaxθ∈Θ
L(θ|y), (5.3)
where θ is called the MLE of θ. �
Equivalently, since ln(·) is a strictly monotonic transformation, we may findthe MLE of θ by solving
θ = argmaxθ∈Θ
L(θ|y), (5.4)
whereL(θ|y) = ln L(θ|y)
denotes the log-likelihood function. In practice, we obtain θ by solving thefirst-order conditions (FOC)
∂L(θ; y)
∂θ=
n∑i=1
∂ ln f(yi; θ)
∂θ= 0.
5.1. MAXIMUM LIKELIHOOD ESTIMATION (MLE) 57
The motivation for using the log-likelihood function is apparent since the sum-mation form will result, after division by n, in estimators that are approximatelyaverages, about which we know a lot. This advantage is particularly clear inthe following example.
Example 5.1. Suppose that yi ∼i.i.d.N(µ, σ2), for i = 1, 2, . . . , n. Then,
f(yi|µ, σ2) =1√
2πσ2e−
12σ2
(yi−µ)2 ,
for i = 1, 2, . . . , n. Using the likelihood function (5.2), we have
L(µ, σ2|y) =
n∏i=1
1√2πσ2
e−1
2σ2(yi−µ)2 . (5.5)
Next, we take the logarithm of (5.5), which gives us
lnL(µ, σ2|y) =
n∑i=1
[−1
2ln(2πσ2)− 1
2σ2(yi − µ)2
]
= −n2
ln(2πσ2)− 1
2σ2
n∑i=1
(yi − µ)2 (5.6)
We then maximize (5.6) with respect to both µ and σ2. That is, we solve thefollowing first order conditions:
(A) ∂ lnL(·)∂µ = 1
σ2
∑ni=1(yi − µ) = 0;
(B) ∂ lnL(·)∂σ2 = − n
2σ2 + 12σ4
∑ni=1(yi − µ)2 = 0.
By solving (A), we find that µ = 1n
∑ni=1 yi = yn. Solving (B) gives us σ2 =
1n
∑ni=1(yi − yn)2. �
Note that σ2 = 1n
∑ni=1(yi − µ)2 6= s2, where s2 = 1
n−1
∑ni=1(yi − µ)2. s2 is
the familiar unbiased estimator for σ2, so σ2 is a biased estimator.
Example 5.2. Some notion of the attraction to maximum likelihood can begained by a numerical example of the same problem just considered, whereyi ∼i.i.d.N(µ, σ2). Choosing µ = 10 and σ = 5 we randomly generate asample of size n = 100. In the absence of an explicit choice of a distributionwe are left with non-parametric approaches such as the histogram to assessprobabilities. For the sample generated the resulting histogram given below isskewed to the right and does not appear normal. We then estimate the modelby MLE and plot the true pdf and the estimated pdf with µ = 9.446 and s =5.0816. Obviously, the maximum likelihood estimate of the distribution is muchcloser to the truth than the histogram. One could adopt more sophisticatednonparametric techniques that would result in a smoother estimated distributionthan the histogram but they could not begin to approach the accuracy of theMLE approach when we have chosen the correct distribution to estimate.
58 CHAPTER 5. MAXIMUM LIKELIHOOD METHODS
−10 −5 0 5 10 15 20 250
5
10
15
20
25Histogram for normals
−15 −10 −5 0 5 10 15 20 25 30 350
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
x
f(x)
MLE Estimation of normal pdf
True pdf
Estimated pdf
Figure 5.1: Histogram and MLE of Normal Distribution
5.2. ASYMPTOTIC BEHAVIOR OF MLE 59
5.2 Asymptotic Behavior of MLE
5.2.1 Assumptions
For the asymptotic results we will derive in the following sections, we will makefive assumptions. These assumptions are specific to maximum likelihood andare introduced for pedagogic reasons. They are shown to satisfy the higher-levelassumptions given in the Appendix to the previous chapter.
(M1) The yi’s are iid random variables with density function f(yi|θ) for i =1, 2, . . . , n;
(M2) ln f(y|θ) and hence f(y|θ) possess continuous derivatives with respect toθ up to the third order for θ ∈ Θ;
(M3) The range of yi is independent of θ and differentiation to second orderunder the integral of f(y|θ) is possible;
(M4) The parameter vector θ is globally identified by the density function.
(M5) ∂3 ln f(yi|θ)/∂θi∂θj∂θk is bounded in absolute value by some positivefunction Hijk(y) for all y and θ ∈Θ, which, in turn, has a finite expectationfor all θ ∈ Θ.
The first assumption is fundamental and the basis of the estimator. If it isnot satisfied then we are misspecifying the model and there is little hope forobtaining correct inferences, at least in finite samples. The second assumptionis a regularity condition that is usually satisfied and easily verified. The firstpart of the third assumption is also easily verified and guaranteed to be satisfiedin models where the dependent variable has smooth and infinite support. Thefourth assumption must be verified, which is easier in some cases than others.The last assumption is crucial and bears a cost and really should be verifiedbefore MLE is undertaken but is usually ignored. Conditions for interchangingdifferentiation and integration of f(y|θ) to satisfy the second part of the thirdassumption require boundedness of the integrated norms of ∂f(y|θ)/∂θ and∂2f(y|θ)/∂θ∂θ′ and are discussed in the Appendix to this chapter.
Example 5.3. It is instructive to examine these assumptions with respect tothe simple example just given where we estimated the mean µ and variance σ2
for yi ∼ i.i.d.N(µ, σ2). The first two assumptions are obviously satisfied and thethird is met since the random variable has infinite support. Since the normaldistribution is completely characterized by its mean and variance knowing thelatter means knowing the former and the fourth assumption is satisfied. The
60 CHAPTER 5. MAXIMUM LIKELIHOOD METHODS
four unique third derivatives are:
∂3 ln f(yi|θ)/µ3 = 0
∂3 ln f(yi|θ)/∂(σ2)3 = − 1
(σ2)3+
3
(σ2)4 (yi − µ)2
= − 1
(σ2)3+
3
(σ2)4 ((yi − µ0)2 + 2(µ0 − µ)(yi − µ0) + (µ0 − µ)2)
∂3 ln f(yi|θ)/∂(σ2)2∂µ =1
(σ2)3 (yi − µ)
=1
(σ2)3 (µ0 − µ) +
1
(σ2)3 (yi − µ0)
∂3 ln f(yi|θ)/∂µ2∂(σ2) =1
(σ2)2.
For µ and σ2 bounded (above and below) these are all bounded in absolutevalue by a polynomial in |yi − µ0| , each of which has a finite expectation, soAssumption (M5) is also met. From this we see that Assumption (M5) mayimplicitly place restrictions on the parameter space Θ. �
5.2.2 Some Preliminaries
In the following results we exploit Assumptions (M2) and (M3) to take differen-tiation inside the integral. Now, we know that, for any value of the parametervector θ, ˆ
f(y|θ)dy = 1 (5.7)
and hence
0 =∂
∂θ
ˆf(y|θ)dy
where integration is understood to be the multiple intergral and the limits ofintegration are ±∞. Interchanging the order of differentiation and integrationyields
0 =
ˆ∂f(y|θ)
∂θdy
=
ˆ∂ ln f(y|θ)
∂θf(y|θ)dy, (5.8)
for any θ and
0 = E
[∂ ln f(y|θ0)
∂θ
]for any value of the true parameter vector θ0, where E [·] indicates intergrationof the expression with respect to the true density f(y|θ0). The first derivativeof the log-likelihood function is known as the score function and the result givenas the mean zero property of scores.
5.2. ASYMPTOTIC BEHAVIOR OF MLE 61
Differentiating (5.8) again and bringing the differentiation inside the integralyields
0 =
ˆ [∂2 ln f(y|θ)
∂θ∂θ′f(y|θ) +
∂ ln f(y|θ)
∂θ
∂f(y|θ)
∂θ′
]dy. (5.9)
Since∂f(y|θ)
∂θ′=∂ ln f(y|θ)
∂θ′f(y|θ), (5.10)
then, for θ = θ0, we can rewrite (5.9) as
0 = E
[∂2 ln f(y|θ0)
∂θ∂θ′
]+ E
[∂ ln f(y|θ0)
∂θ
∂ ln f(y|θ0)
∂θ′
](5.11)
and define
ϑ(θ0) = E
[∂ ln f(y|θ0)
∂θ
∂ ln f(y|θ0)
∂θ′
]= −E
[∂ ln f(y|θ0)
∂θ∂θ′
]. (5.12)
The matrix ϑ(θ0) is called the information matrix and the relationship given in(5.12) the information matrix equality.
Finally, we note, under Assumption (M1) that
E
[∂2 ln L(θ0|y)
∂θ∂θ′
]= E
[n∑i=1
∂2 ln f(yi|θ0)
∂θ∂θ′
]
=
n∑i=1
E
[∂2 ln f(yi|θ0)
∂θ∂θ′
]= −nϑ(θ0), (5.13)
since the distributions are identical, and
E
[∂ ln L(θ0|y)
∂θ
∂ ln L(θ0|y)
∂θ′
]= E
[n∑i=1
∂ ln f(yi|θ0)
∂θ
n∑i=1
∂ ln f(yi|θ0)
∂θ′
]
=
n∑i=1
E
[∂ ln f(yi|θ0)
∂θ
∂ ln f(yi|θ0)
∂θ′
]= nϑ(θ0), (5.14)
since the covariances between different observations is zero. These last resultswill suggest two alternative estimators for ϑ(θ0).
5.2.3 Asymptotic Properties
5.2.3.1 Consistent Root Exists
We will first establish consistency of the maximum likelihood estimator underAssumptions (M1)-(M5) given above. These assumptions and the proof below
62 CHAPTER 5. MAXIMUM LIKELIHOOD METHODS
appear different from the general approach given in the Appendix of Chapter 4.However, as will be shown in the Appendix to this chapter these assumptions,suitably modified, satisfy those for the more general approach. Moreover, theapproach taken here is particularly instructive in revealing the role of nonlin-earity in estimation.
For notational simplicity, consider the case where p = 1, so θ is scalar. Then,dividing by n, Assumption (M2) allows us to expand in a Taylor’s series, anduse the intermediate value theorem on the quadratic term to yield
1
n
∂ ln L(θ|y)
∂θ=
1
n
n∑i=1
∂ ln f(yi|θ0)
∂θ
+1
n
n∑i=1
∂2 ln f(yi|θ0)
∂θ2(θ − θ0)
+1
2
1
n
n∑i=1
∂3 ln f(yi|θ∗)∂θ3
(θ − θ0)2 (5.15)
where θ∗ lies between θ and θ0. Now, by Assumption (M5), we have
1
n
n∑i=1
∂3 ln f(yi|θ∗)∂θ3
= k
n∑i=1
H(yi), (5.16)
for some |k| ≤ 1. So,
1
n
∂ ln L(θ|y)
∂θ= aδ2 + bδ + c, (5.17)
where
δ = θ − θ0,
a =k
2
1
n
n∑i=1
H(yi),
b =1
n
n∑i=1
∂2 ln f(yi|θ0)
∂θ2, and
c =1
n
n∑i=1
∂ ln f(yi|θ0)
∂θ. (5.18)
We can apply a law of large numbers to the averages in (5.18) to obtain plimn→∞
c =
0, and plimn→∞
b = −ϑ(θ0) under the preliminary results that followed from As-
sumptions (M1)-(M3) and |a| ≤ 12
1n
∑ni=1H(yi) = 1
2 E[H(yi)] + op(1) = Op(1)under Assumption (M5).
Now, since ∂ ln L(θ|y)/∂θ = 0, we have aδ2 + bδ + c = 0. There are twopossibilities. If a = 0, then the FOC are linear in δ whereupon δ = − cb and
5.2. ASYMPTOTIC BEHAVIOR OF MLE 63
δp−→ 0. If a 6= 0 with probability 1, which will occur when the FOC are
nonlinear in δ, then
δ =−b±
√b2 − 4ac
2a. (5.19)
Since ac = op(1), then δp−→ 0 for the plus root while δ
p−→ ϑ(θ0)/α forthe negative root if plim
n→∞a = α 6= 0 exists. If the F.O.C. are nonlinear but
asymptotically linear then ap−→ 0 so ac in the numerator of (5.19) will go to
zero faster than a in the denominator and again δp−→ 0. Thus there exits at
least one consistent solution θ which satisfies
plimn→∞
(θ − θ0) = 0. (5.20)
and in the asymptotically nonlinear case there is also a possibly inconsistentsolution.
For the case of θ a vector, we can apply a similar style proof to show thereexists a solution θ to the FOC that satisfies plim
n→∞(θ − θ0) = 0. And in the
event of asymptotically nonlinear FOC there is at least one other inconsistentroot is possible.
5.2.3.2 Consistent Root Is Global Maximum
In the event of multiple roots, we are left with the problem of selecting betweenthem. By Assumption (M4), the parameter θ is globally identified by thedensity function. Formally, this means that
f(y|θ) = f(y,θ0), (5.21)
for all y implies that θ = θ0. Now,
E
[f(y,θ)
f(y,θ0)
]=
ˆ +∞
−∞
f(y,θ)
f(y,θ0)f(y,θ0) = 1. (5.22)
Thus, by Jensen’s Inequality, we have
E
[ln
f(y,θ)
f(y,θ0)
]< ln E
[f(y,θ)
f(y,θ0)
]= 0, (5.23)
unless f(y,θ) = f(y,θ0) for all y, or θ = θ0. Therefore, E [ln f(y,θ)] achieves amaximum if and only if θ = θ0. This well-known result is called the informationinequality.
However, we are seeking to maximize
1
n
n∑i=1
ln f(yi|θ)p−→ E [ln f(y|θ)] , (5.24)
where the convergence in probability is uniform for θ ∈ Θ. If we choose aninconsistent root, the objective function will converge in large samples to a
64 CHAPTER 5. MAXIMUM LIKELIHOOD METHODS
smaller value. The consistent root will converge to a larger value. Accordingly,we choose the root with the larger value of the objective function, which willbe appropriate asymptotically. This choice of the global root has added appealsince it is, in fact, the MLE among the possible alternatives and hence the choicethat makes the realized data most likely to have occurred.
There are complications in finite samples since the value of the likelihoodfunction for alternative roots may cross over as the sample size increases. Thatis, the global maximum in small samples may not be the global maximum inlarger samples. An added problem is to identify all the alternative roots sowe can choose the global maximum. Sometimes a solution is available in asimple consistent estimator which may be used to start the nonlinear MLEoptimization.
5.2.3.3 Asymptotic Normality
We now establish the asymptotic normality of the maximum likelihood estima-tor. We continue to focus on the scalar case (p = 1). Recall that aδ2+bδ+c = 0,so
δ = θ − θ0 =−c
aδ + b(5.25)
and
√n(θ − θ0) =
−√nc
a(θ − θ0) + b(5.26)
=−1
a(θ − θ0) + b
√nc.
Now since a = Op(1) and θ − θ0 = op(1), then a(θ − θ0) = op(1) and
a(θ − θ0) + bp−→ −ϑ(θ0). (5.27)
And by the CLT we have
√nc =
1√n
n∑i=1
∂ ln f(yi|θ0)
∂θ
d−→ N(0, ϑ(θ0)). (5.28)
Substituting these two results in (5.26), we find
√n(θ − θ0)
d−→ N(0, ϑ(θ0)−1).
In general, for p > 1, we can apply the same scalar proof to show√n(λ′θ−
λ′θ0)d−→ N(0, λ′ϑ(θ0)λ) for any vector λ, which means
√n(θ − θ0)
d−→ N(0, ϑ−1(θ0)), (5.29)
if θ is the consistent root.
5.2. ASYMPTOTIC BEHAVIOR OF MLE 65
5.2.3.4 Cramer-Rao Lower Bound
In addition to being the covariance matrix of the MLE, ϑ−1(θ0) defines a lower
bound for covariance matrices with certain desirable properties. Let θ(y) beany unbiased estimator, then
E θ(y) =
ˆ +∞
−∞θ(y)f(y;θ)dy = θ, (5.30)
for any underlying θ = θ0. Differentiating both sides of this relationship withrespect to θ yields
Ip =∂ E θ(,y)
∂θ′
=
ˆ +∞
−∞θ(y)
∂f(y;θ0)
∂θ′dy
=
ˆ +∞
−∞θ(y)
∂ ln f(y;θ0)
∂θ′f(y;θ0)dy
= E
[θ(y)
∂ ln f(y;θ0)
∂θ′
]= E
[(θ(y)− θ0)
∂ ln f(y;θ0)
∂θ′
]. (5.31)
Next, we let
C(θ0) = E[(θ(y)− θ0)(θ(y)− θ0)
]. (5.32)
be the covariance matrix of θ(y), then,
Cov
(θ(y)∂ lnL∂θ
)=
(C(θ0) Ip
Ip ϑ(θ0)
), (5.33)
where Ip is a p× p identity matrix, and (5.33) as a covariance matrix is positivesemidefinite.
Now, for any (p× 1) vector a, we have(a′ a′ϑ(θ0)−1
) ( C(θ0) IpIp ϑ(θ0),
) (a′
a′ϑ(θ0)−1
)= a′[C(θ0)− ϑ(θ0)−1]a ≥ 0. (5.32)
Thus, any unbiased estimator θ(y) has a covariance matrix that exceeds ϑ(θ0)−1
by a positive semidefinite matrix. And if the MLE estimator is unbiased, it isefficient within the class of unbiased estimators.
It is entirely possible that the MLE will be biased in small samples. It ishowever CUAN under the Assumptions (M1)-(M5). In a similar fashion to theabove, we can show that in large samples, any other CUAN estimator will havea covariance exceeding ϑ(θ0)−1. Since the asymptotic covariance of MLE is, infact, ϑ(θ0)−1, it is efficient (asymptotically). ϑ(θ0)−1 is called the Camer-Raolower bound.
66 CHAPTER 5. MAXIMUM LIKELIHOOD METHODS
5.3 Maximum Likelihood Inference
5.3.1 Likelihood Ratio Test
Suppose we wish to test H0 : θ = θ0 against H1 : θ 6= θ0. Then, we define
Lu = maxθ
L(θ|y) = L(θ|y) (5.34)
and
Lr = L(θ0|y), (5.35)
where Lu is the unrestricted likelihood and Lr is the restricted likelihood. Wethen form the likelihood ratio
λ =LrLu
. (5.36)
Note that the restricted likelihood can be no larger than the unrestricted whichmaximizes the function.
As with estimation, it is more convenient to work with the logs of the like-lihood functions. It will be shown below that, under H0,
LR = −2 lnλ
= −2
[lnLrLu
]= 2[L(θ|y)− L(θ0|y)]
d−→ χ2p, (5.37)
where θ is the unrestricted MLE, and θ0 is the restricted MLE. If H1 applies,then LR = Op(n). Large values of this statistic indicate that the restrictionsmake the observed values much less likely than the unrestricted and we preferthe unrestricted and reject the restrictions.
In general, for H0 : r(θ) = 0, and H1 : r(θ) 6= 0, we have
Lu = maxθ
L(θ | y) = L(θ |y), (5.38)
and
Lr = maxθ
L(θ | y) s.t. r(θ) = 0
= L(θ |y). (5.39)
where θ is the restricted maximum likelihood estimator. Under H0,
LR = 2[L(θ|y)− L(θ|y)]d−→ χ2
q, (5.40)
where q is the length of r(·).Note that in the general case, the likelihood ratio test requires calculation
of both the restricted and the unrestricted MLE.
5.3. MAXIMUM LIKELIHOOD INFERENCE 67
5.3.2 Wald Test
The asymptotic normality of MLE may be used to obtain a test based only onthe unrestricted estimates.
Now, under H0 : θ = θ0, we have
√n(θ − θ0)
d−→ N(0,ϑ−1(θ0)). (5.41)
Thus, using the results on the asymptotic behavior of quadratic forms from theprevious chapter, we have
W = n(θ − θ0)′ϑ(θ0)(θ − θ0)d−→ χ2
p, (5.42)
which is the Wald test. As we discussed for quadratic tests, in general, underH1 : θ1 6= θ0, we would have W = Op(n).
The statistic W, is generally infeasible since it requires knowledge of ϑ(θ0).In practice, since under Assumptions (M2) and (M5), we can show
1
n
∂2L(θ|y)
∂θ∂θ′=
n∑i=1
1
n
∂2 ln f(θ|y)
∂θ∂θ′p−→ −ϑ(θ0), (5.43)
then we instead use the feasible statistic
W∗ = −n(θ − θ0)′1
n
∂2L(θ|y)
∂θ∂θ′(θ − θ0) (5.44)
= −(θ − θ0)′∂2L(θ|y)
∂θ∂θ′(θ − θ0) −→ χ2
p.
Aside from having the same asymptotic distribution, the Likelihood Ratioand Wald tests are asymptotically equivalent in the sense that
plimn→∞
(LR−W∗) = 0. (5.45)
This is shown by expanding L(θ0|y) in a Taylor’s series about θ. That is,
L(θ0) = L(θ) +∂L(θ)
∂θ(θ0 − θ)
+1
2(θ0 − θ)′
∂2L(θ)
∂θ∂θ′(θ0 − θ)
+1
6
∑i
∑j
∑k
∂3L(θ∗)
∂θi∂θj∂θk(θ0i − θi)(θ0
j − θj)(θ0k − θk). (5.46)
where the third line applies the intermediate value theorem for θ∗ between θ
and θ0. Now ∂L(θ)∂θ = 0, and the third line can be shown to be Op(1/
√n) under
Assumptions (M2) and (M5), whereupon we have
L(θ)− L(θ0) = −1
2(θ0 − θ)′
∂2L(θ)
∂θ∂θ′(θ0 − θ) +Op(1/
√n) (5.47)
68 CHAPTER 5. MAXIMUM LIKELIHOOD METHODS
andLR = W∗ +Op(1/
√n). (5.48)
In general, we may test H0 : r(θ) = 0 with
W∗ = −r(θ)′
R(θ)
(∂2L(θ)
∂θ∂θ′
)−1
R′(θ)
−1
r(θ)d−→ χ2
p. (5.49)
where R(θ) = ∂r(θ)/∂θ′.
5.3.3 Lagrange Multiplier
Alternatively, but in the same fashion, we can expand L(θ) about θ0 to obtain
L(θ) = L(θ0) +∂L(θ0)
∂θ(θ − θ0)
+1
2(θ − θ0)′
∂2L(θ0)
∂θ∂θ′(θ − θ0) +Op(1/
√n). (5.50)
Likewise, we can also expand 1n∂L(θ)∂θ about θ0, which yields
0 =1
n
∂L(θ)
∂θ=
1
n
∂L(θ0)
∂θ+
1
n
∂2L(θ0)
∂θ∂θ′(θ − θ0) +Op(1/n), (5.51)
or
(θ − θ0) = −(∂2L(θ0)
∂θ∂θ′
)−1∂L(θ0)
∂θ+Op(1/n). (5.52)
Substituting (5.52) into (5.50) gives us
L(θ)− L(θ0) = −1
2
∂L(θ0)
∂θ′
(∂2L(θ0)
∂θ∂θ′
)−1∂L(θ0)
∂θ+Op(1/
√n), (5.53)
and LR = LM +Op(1/√n), where
LM = −∂L(θ0)
∂θ′
(∂2L(θ0)
∂θ∂θ′
)−1∂L(θ0)
∂θ, (5.54)
is the Lagrange Multiplier test.Thus, under H0 : θ = θ0,
plimn→∞
(LR− LM) = 0. (5.55)
andLM
d−→ χ2p. (5.56)
Note that the Lagrange Multiplier test only requires the restricted values of theparameters.
5.3. MAXIMUM LIKELIHOOD INFERENCE 69
In general, we may test H0 : r(θ) = 0 with
LM = −∂L(θ)
∂θ′
(∂2L(θ)
∂θ∂θ′
)−1∂L(θ)
∂θ
d−→ χ2q, (5.57)
where L(·) is the unrestricted log-likelihood function, and θ is the restrictedMLE.
5.3.4 Comparing the Tests
In the scalar case the relationship between the three tests can be seen graphicallyin Figure 5.2. With the parameter values on the horizontal axis and the log-likihood values on the vertical, we plot an example log-likelihood function. Themaximum occurs at θ with the corresponding log-likelihood at L(θ). Therestricted parameter value, given by the null hypothesis is θ0 with corresponding
log-likelihood L(θ0), which is smaller. The score function ∂L(θ0)∂θ is represented
by the slope of the tangent line at L(θ0).
The Wald test is based on the (squared) difference θ − θ0 with an adjust-
ment made for the variability of θ as measured by the inverse of the estimatedinformation matrix (ignoring the n factor). The likelihood ratio is based on the
difference L(θ)−L(θ0) with no adjustment needed. Obviously, for a particular
value of (θ− θ0)2, the more pronounced the peak of the log-likelihood the moreprecise the estimates and the smaller the estimated variances we divide by inthe Wald statistic, which makes the adjusted statistic larger as would be thecase with the log-likelihood difference. The Lagrange multiplier test is basedon the slope of the log-likelihood function at the restricted value of the param-eter, which will be larger for the more pronounced peak case. This (squared)slope will be adjusted by dividing by second-order curvature, which is inverselyrelated to the variance and hence larger for the more peaked case. In any
event, larger values of (θ − θ0)2 will occur with larger values of (∂L(θ0)∂θ )2, and
L(θ)− L(θ0).
The above analysis demonstrates that the three tests: likelihood ratio, Wald,and Lagrange multiplier are asymptotically equivalent. In large samples, notonly do they have the same limiting distribution, but they will accept and rejecttogether under H0. This in not the case in finite samples where one can rejectwhen the other does not. This might lead a cynical analyst to use one ratherthan the other by choosing the one that yields the results (s)he wants to obtain.Making an informed choice based on their finite sample behavior is beyond thescope of this course.
In many cases, however, one of the tests is a much more natural choice thanthe others. Recall that the Wald test only requires the unrestricted estimateswhile the Lagrange multiplier test only requires the restricted estimates. Insome cases the unrestricted estimates are much easier to obtain than the re-stricted and in other cases the reverse is true. In the first case we might be
70 CHAPTER 5. MAXIMUM LIKELIHOOD METHODS
sl ope= ∂L ( θ0 )/∂ θ
L ( θ ) = logL ( θ )
L ( θ )L ( θ 0 )
θθ0
Figure 5.2: Comparison of LR, W, and LM Statistics
inclined to use the Wald test while in the latter we would prefer to use theLagrange multiplier.
Another issue is the possible sensitivity of the test results to how the restric-tions are written. For example, θ1 + θ2 = 0 can also be written −θ2/θ1 = 1.The Wald test, in particular, is sensitive to how the restriction is written. Thisis yet another situation where a cynical analyst might be tempted to choose the“normalization” of the restriction to force the desired result. The Lagrangemultiplier test, as presented above, is also sensitive to the normalization of therestriction but can be modified to avoid this difficulty. The likelihood ratio testhowever will be unimpacted by the choice of how to write the restriction.
5.A Appendix
We will now establish that the conditions for Theorem 4.5, are met under themaximum likelihood Assumptions (M1)-(M5), with one addition. We need toexplicitly add the condition (C1), that the parameter space is compact, which
5.3. MAXIMUM LIKELIHOOD INFERENCE 71
for the scalar case and the space being the real line means that the space isclosed and bounded. As was demonstrated in the simple example such anassumption may be implicit in bounding the third derivatives.
Again, for simplicity, we restrict attention to the scalar parameter case.Define
ψn(θ) =1
nL(θ|y1, y2, . . . , yn) =
1
n
n∑i=1
ln f(yi|θ)
and
ψ0(θ) = E [ln f(yi|θ)]
then the condition (C4) follows from Assumption (M4) by the arguments givenregarding the consistency of the global maximum given above.
This leaves us with the continuity (C3) and uniform convergence in proba-bility (C2) to establish. In proving these conditions are met, we will utilizethe following uniform law of large numbers from Newey and McFadden (N-M):
Lemma 2.4 (Newey and McFadden) If the data zi are i.i.d., a(zi, θ) is con-tinuous at each θ ∈ Θ with probability one, and there is d(z) with ‖a(zi, θ)‖ ≤d(z) for all θ ∈ Θ and E[d(z)] <∞, then E[a(zi, θ)] is continuous and supθ∈Θ∥∥n−1
∑ni=1 a(zi, θ)− E[a(z, θ)]
∥∥ p−→ 0.
Assumption (M2) allows us to expand in a Taylor’s series to obtain
ψn(θ) =1
n
n∑i=1
ln f(yi|θ0) +1
n
n∑i=1
∂ ln f(yi|θ0)
∂θ′(θ − θ0) +
1
n
n∑i=1
1
2
∂2 ln f(yi|θ0)
∂θ2(θ − θ0)2
+1
n
n∑i=1
1
6
∂3 ln f(yi|θ∗)∂θ3
(θ − θ0)3
where θ∗ lies between θ and θ∗. As was shown in the preliminaries of thischapter, expectations of the first two terms being averaged exist under (M2) and(M3). And, by (M5) the third is bounded by a function with finite expectationso its’ expectation exists for all θ∗. Thus taking expectations of ψn(θ) andsubtracting yields
ψn(θ)− ψ0(θ) =1
n
n∑i=1
{ln f(yi|θ0)− E
[ln f(yi|θ0)
]}+
1
n
n∑i=1
{∂ ln f(yi|θ0)
∂θ′− E
[∂ ln f(yi|θ0)
∂θ
]}(θ − θ0)
+1
n
n∑i=1
1
2
{∂2 ln f(yi|θ0)
∂θ2− E
[∂2 ln f(yi|θ0)
∂θ2
]}(θ − θ0)2
+1
n
n∑i=1
1
6
{∂3 ln f(yi|θ∗)
∂θ3− E
[∂3 ln f(yi|θ∗)
∂θ3
]}(θ − θ0)3.
72 CHAPTER 5. MAXIMUM LIKELIHOOD METHODS
Applying the sup operator to this for θ ∈ Θ yields
supθ∈Θ ‖ψn(θ)− ψ0(θ)‖ ≤ supθ∈Θ
∥∥∥∥∥ 1
n
n∑i=1
{ln f(yi|θ0)− E
[ln f(yi|θ0)
]}∥∥∥∥∥+ supθ∈Θ
∥∥∥∥∥ 1
n
n∑i=1
{∂ ln f(yi|θ0)
∂θ′− E
[∂ ln f(yi|θ0)
∂θ
]}(θ − θ0)
∥∥∥∥∥+
1
2supθ∈Θ
∥∥∥∥∥ 1
n
n∑i=1
{∂2 ln f(yi|θ0)
∂θ2− E
[∂2 ln f(yi|θ0)
∂θ2
]}(θ − θ0)2
∥∥∥∥∥+
1
6supθ∈Θ
∥∥∥∥∥ 1
n
n∑i=1
{∂3 ln f(yi|θ∗)
∂θ3− E
[∂3 ln f(yi|θ∗)
∂θ3
]}(θ − θ0)3
∥∥∥∥∥ .
Now the average in the first line is a function only of yi and converge in probabil-ity to zero. Likewise the averages in the second and third line plus the fact that(θ − θ0) is bounded means each line converges in probability to zero. For the
last line, we define a(z, θ) = ∂3 ln f(y,θ)∂θ3 , which by Assumption (M5) is bounded
in absolute value by H(y) with finite expectation. Applying the Lemma to thisresult together with (θ − θ0) being bounded means the last line will converge
in probability to zero. But this means that supθ∈Θ ‖ψn(θ)− ψ0(θ)‖ p−→ 0 andmoreover ψ0(θ) is continuous, which means (C2) and (C3) are satisfied.
A similar approach can be used to verify that Assumptions (M1)-(M5) satisfythe conditions of Theorem 4.6.
Using Lemma 2.4, it is possible to obtain a consistency result for maximumlikelihood that does not require existence of derivatives.
Theorem 5.1. Suppose (i) yi (i = 1, 2, ..., n) are i.i.d. with p.d.f. f(yi|θ), (ii)f(yi|θ) 6= f(yi|θ0) for θ 6= θ0, (iii) θ ∈ Θ, which is compact, (iv) ln f(yi|θ)continuous at each θ ∈ Θ with probability one, and (v) E[supθ∈Θ |ln f(yi|θ)|] <∞ then θ −→p θ �
Proof: We proceed by verifying the conditions of Theorem 4.5. (C1) is assuredby (iii). (C4) is gurantedd by (ii) and arguments yielding (5.23). Definea(zi, θ) = ln f(yi|θ) and ψn(θ) = 1
n
∑ni=1 ln f(yi|θ), then (iv) and (v) allow us
to apply Lemma 2.4 of N-M to show (C2) and (C3) are satisfied. �We now turn to sufficient conditions for interchanging differentiation and
integration of a function g(y,θ). These conditions are given as Theorem 1.3.2in Amemiya(1985). Suppose (i) ∂
∂θ g(y,θ) continuously in θ ∈ N and y, where
N is an open set, (ii)´g(y,θ)dy exits, and (iii)
´ ∥∥ ∂∂θ g(y,θ)
∥∥ dy < M < ∞for θ ∈ N , then ∂
∂θ
´g(y,θ)dy =
´∂∂θ g(y,θ)dy for θ ∈ N . This means the
integrated norm of ∂∂θ g(y,θ) is bounded by a constant. Verification of (M3) in
the Chapter requires verification of these conditions for f(y,θ) and is a similarexercise to verifying (M5). Note that these conditions are on f(y,θ) ratherthan ln f(y,θ).
5.3. MAXIMUM LIKELIHOOD INFERENCE 73
With these sufficent conditions we can adapt the asymptotic normality re-sults of Theorem 4.6 to the specific case of maximum likelihood and obtain aresult with more primitive assumptions.
Theorem 5.2. Suppose (i) yi i.i.d f(y,θ), (ii) θ −→p θ, θ0 interior Θ,(iii) f(y,θ) twice continuously differentiable, f(y,θ) > 0 in neighborhood Nof θ0, (iv) supθ∈N
´ ∥∥∥∂f(y,θ)∂θ
∥∥∥ dy < ∞, supθ∈N´ ∥∥∥∂2f(y,θ)
∂θ∂θ′
∥∥∥ dy < ∞, (v)
ϑ = E[∂ ln f(y,θ0)∂θ
∂ ln f(y,θ0)∂θ′ ] exists and nonsingular, (vi) E[supθ∈N
∥∥∥∂2 ln f(y,θ)∂θ∂θ′
∥∥∥] <
∞, then√n(θ − θ0) −→d N(0,ϑ−1) �
Proof: We proceed by verifying the conditions of Theorem 4.6. Define ψn(θ) =1n
∑ni=1 ln f(yi|θ) then (N1) and (N2) are assured (ii) and (iii). Now (iv) assures
that the zero mean score and information matrix equality results apply so we
have E[∂ ln f(yi|θ0)
∂θ
]= 0 and E
[∂ ln f(yi|θ0)
∂θ∂ ln f(yi|θ0)
∂θ
]= ϑ = −E
[∂2 ln f(yi|θ0)
∂θ∂θ′
].
Define a(zi, θ) = ∂2 ln f(y,θ)∂θ∂θ′ then by (i), (iii), and (vi), we can apply Lemma 2.4
to find that Ψ(θ) = E[∂2 ln f(y,θ)∂θ∂θ′ ] continuous and supθ∈N
∥∥∥ 1n
∑ni=1
∂2 ln f(y,θ)∂θ∂θ′ −Ψ(θ)
∥∥∥ −→p
0 so (N3) is satisfied. But Ψ = Ψ(θ0) = −ϑ by the information matrix equalityand nonsingular by (v) so (N4) is met. And by (i) we can apply Lindberg-Levy
to find√n∂ψn(θ)
∂θ = 1√n
∑ni=1
∂ ln f(yi|θ0)∂θ −→d N(0,ϑ) and (N5) is satisfied with
V = ϑ. Since the conditions of Theorem 4.6 are met and Ψ−1VΨ−1 = ϑ−1wehave the final result. �