Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Statistical models and fundamental concepts
Changliang Zou
Changliang Zou Advanced Statistics-I-2, Fall 2018
Outline
Statistical models
Statistics and Sufficiency
Point estimation, confidence set and hypothesis testing
Statistical functional and substitution principle
Likelihood function, likelihood principle and beyond
Changliang Zou Advanced Statistics-I-2, Fall 2018
Statistical models
A statistical model P is a collection of probability distributions (or acollection of densities).
Parametric models: The form of the distribution of P is knownup to some unknown parameters. P = {p(x ;θ),θ ∈ Θ}, whereΘ ⊂ Rd .
Nonparametric models: The distributions in P can not be formedby finite unknown parameters. P = {p :
∫{p′′(x)}2dx <∞}.
Example
Suppose Y is a response and X = (x1, . . . , xd) are covariates. A basicgoal is to estimate m(x) = E (Y | X = x). Why?
Changliang Zou Advanced Statistics-I-2, Fall 2018
Statistical models
One may wish to predict the value of Y based on an observed valueof X. Let m(X) be a predictor satisfying E{m(X)}2 <∞. Considerloss function based on the mean squared error.
arg minm
E{Y −m(X)}2
Note that
E{Y −m(X)}2 = E{Y − E (Y | X)}2 + E{E (Y | X)−m(X)}2
≥ E{Y − E (Y |X)}2.
How to estimate E (Y |X)? A nonparametric model: m(x) possessescertain smoothness.
Changliang Zou Advanced Statistics-I-2, Fall 2018
Statistical models
Semiparametric models: without any information about thestructure of the function, it is difficult to estimate m(x) wellwhen d > 2, and therefore many semiparametric models havebeen proposed that impose structural constraints or specialfunctional forms upon m(x).
A general form Y = G (g,β;X) + ε, where g = (g1, . . . , gq)> areunknown smooth functions of X, and G is known up to aparameter vector β.
Changliang Zou Advanced Statistics-I-2, Fall 2018
Statistical models
This model includes many nonparametric or semiparametricmodels as special cases, such like the generalized additivemodels, partially linear models, varying coefficient models andsingle-index or multi-index models.
For example, the single-index models and varying coefficientmodels admit the forms of Y = g(β>X) + ε andY = g1(x1) + g2(x1)x2 + · · ·+ gd(x1)xd + ε, respectively.
Changliang Zou Advanced Statistics-I-2, Fall 2018
Statistics
Let X = (X1, . . . ,Xn). Any function T = T (X) is a randomvariable which we call a statistic. E.g., the order statisticsX(1) ≤ · · · ≤ X(n).
A statistic is called an ancillary statistic of θ, if its distribution isunrelated with θ.
Example
Suppose X1, . . . ,Xniid∼ U(µ− θ, µ + θ), θ > 0. The sample range
X(n) − X(1) is an ancillary statistic of µ.
Its PDF is
f (x) =n(n − 1)xn−2
(2θ)n−1
(1− x
2θ
), 0 ≤ x ≤ 2θ.
Changliang Zou Advanced Statistics-I-2, Fall 2018
Sufficient Statistics
Definition
The statistic T = T (X) is called a sufficient statistic for θ if theconditional distribution of X | T does not depend on θ.
Intuitively, this means that you can replace X with T (X)without losing information.
What sufficiency really means: If T is sufficient, then T containsall the information you need from the data to compute thelikelihood function. It does not contain all the information in thedata.
The Factorization Theorem
Changliang Zou Advanced Statistics-I-2, Fall 2018
Minimal Sufficient Statistics
Let T be a sufficient statistic and T = g(S). Then S is alsosufficient and T is better than S for reducing the data unlessg(·) is a 1-1 mapping, where T is equivalent to S
Definition
T is Minimal Sufficient Statistic if (i) T is sufficient and (ii) if S isany other sufficient statistic then T = g(S) for some function g
Minimal sufficient statistic is the best sufficient statistic for reducingthe data. Let xn = (x1, . . . , xn).
Theorem
If p(yn; θ)/p(xn; θ) does not depend on θ if and only ifT (yn) = T (xn), the T is a minimal sufficient statistic.
Changliang Zou Advanced Statistics-I-2, Fall 2018
A related idea: sufficient dimension reduction
Let X = (x1, . . . , xp)T ∈ Rp and Y ∈ R. The general goal of aregression of Y on X is inference about the conditionaldistribution of Y given X
When the dimension of X is not small, it is desirable to reduceits dimensionality as a preliminary step in an analysis. Sufficientdimension reduction is important in both theory and practice
The basic idea is to replace the predictor vector with itsprojection onto a subspace of the predictor space without loss ofinformation on the conditional distribution of Y given X
Changliang Zou Advanced Statistics-I-2, Fall 2018
A related idea: sufficient dimension reduction
If a predictor subspace S ⊆ Rp satisfies
Y ⊥ X | PSX,
where ⊥ stands for independence and P(·) represents theprojection matrix with respect to the standard inner product,then S is called a (sufficient) dimension reduction subspace.
The statement that Y is independent of X given PSX isequivalent to stipulating that PSX carries all of the regressioninformation that X has about Y
The central subspace is defined as the intersection of alldimension reduction spaces, which is also a dimension reductionsubspace under mild conditions
Changliang Zou Advanced Statistics-I-2, Fall 2018
Point Estimation
Many inferential problems can be identified as being one of threetypes: estimation, confidence sets, or hypothesis testing
Point estimation refers to providing a single “best guess” ofsome quantity of interest
The quantity of interest could be a parameter in a parametricmodel, a CDF F , a probability density function f , a regressionfunction m(·), or a prediction for a future value Y of somerandom variable
By convention, we denote a point estimate of θ by θ̂n = g(X).
Changliang Zou Advanced Statistics-I-2, Fall 2018
Point Estimation
The bias of an estimator is defined by
bias(θ̂n) = E (θ̂n)− θ
Unbiasedness used to receive much attention but these days isconsidered less important; many of the estimators we will use arebiased.
A reasonable requirement for an estimator is that it should convergeto the true value as we collect more and more data, say theconsistency θ̂n
p→ θ
Changliang Zou Advanced Statistics-I-2, Fall 2018
Point Estimation
The distribution of θ̂n is called the sampling distribution. Thestandard deviation of θ̂n is called the standard error
The quality of a point estimate is often assessed by the mean squarederror, or MSE defined by
MSE = E (θ̂n − θ)2 = bias2(θ̂n) + var(θ̂n).
The mean absolute error (MAE) of the estimator is defined byE |θ̂n − θ|.
Changliang Zou Advanced Statistics-I-2, Fall 2018
Confidence set and hypothesis testing
A 1− α confidence interval (set) for a parameter θ is an intervalor set Cn(X) which is the function of the data X such that
Pr(θ ∈ Cn(X)) ≥ 1− α, for all θ ∈ Θ.
In other words, Cn traps θ with probability 1− α.
Hypothesis Testing In hypothesis testing, we start with somedefault theory—called a null hypothesis—and we ask if the dataprovide sufficient evidence to reject the theory. If not we retainthe null hypothesis.
Changliang Zou Advanced Statistics-I-2, Fall 2018
Statistical functional and substitution principle
Let X1, . . . ,Xn ∼ F . Let F be all distributions. A map: F → R iscalled a statistical functional. We write θ = T (F ) or θ = T (P).
Examples include the mean µ =∫
xdF , the varianceσ2 =
∫(x − µ)2dF , and the median
F−1(1/2) = inf{x : F (x) ≥ 1/2}.
Substitution principle: a plug-in estimator of θ is θ̂ = T (F̂n),where F̂n is the ECDF.
A functional of the form∫
a(x)dF (x) is called a linearfunctional. Then, the plug-in estimator for linear functionalT (F ) is T (F̂n) = n−1
∑ni=1 a(Xi ).
Changliang Zou Advanced Statistics-I-2, Fall 2018
Statistical functional and substitution principle
Example (variance)
X ∼ F , θ = var(X ) =∫
x2dF (x)− {∫
xdF (x)}2, we have
θ̂ = T (F̂n) = n−1∑n
i=1(Xi − X̄ )2.
Example (quantile)
Let T (F ) = F−1(p) be the pth quantile. Its estimate isF̂−1n (p) = inf{x : F̂ (x) ≥ p}, which is called the pth sample quantile.T (F ) is not a linear functional but we shall late see that F̂−1n (p) hasan approximately linear representation.
Changliang Zou Advanced Statistics-I-2, Fall 2018
Statistical functional and substitution principle
Example (Estimating equation)
T (F ) may not always have explicit expression. Let X ∼ F , θ satisfies∫g(x , θ)dF (x) = 0, i.e.,E{g(X , θ)} = 0. By substitution principle,
the estimator θ̂ satisfies n−1∑n
i=1 g(Xi , θ̂) = 0.
Note: In some cases, the substitution principle does not obtain goodestimation. For example, the density estimation (non-continuous).
Changliang Zou Advanced Statistics-I-2, Fall 2018
Likelihood function
Let X have joint density p(xn; θ), where θ ∈ Θ.
The likelihood function L : Θ→ [0,∞) is defined byL(θ) ≡ L(θ; xn) = p(xn; θ), where xn is fixed and θ varies in Θ.
The log-likelihood function is `(θ) = log L(θ).
The name likelihood implies that, given xn, the value of θ is morelikely to be the true parameter than θ′ if L(θ; xn) > L(θ′; xn).
Changliang Zou Advanced Statistics-I-2, Fall 2018
Likelihood function
The likelihood function is a function of θ; The likelihoodfunction is not a probability density function
The likelihood is only defined up to a constant of proportionality.In other words, it is an equivalence class of functions
The likelihood function is used (i) to generate estimators and (ii)as a key ingredient in Bayesian inference.
Changliang Zou Advanced Statistics-I-2, Fall 2018
Likelihood principle
Definition
(Strong) Likelihood Principle. In the inference about θ, after xn isobserved, all relevant experimental information is contained in thelikelihood function for the observed xn. Furthermore, two likelihoodfunctions contain the same information about θ if they areproportional to each other.
Likelihood principle concerns foundations of statistical inferenceand it is often invoked in arguments about correct statisticalreasoning
The maximum-likelihood estimation does satisfy the likelihoodprinciple
Changliang Zou Advanced Statistics-I-2, Fall 2018
Likelihood principle
Definition
(Weak) Likelihood Principle. Any set of observations from a givenmodel pθ(x) with the same likelihood should lead to the sameconclusion.
This principle is equivalent to the sufficiency principle: allsufficient statistics based on data xn for a given model pθ(x)should lead to the same conclusion, because all sufficientstatistics lead to the same likelihood function.
Note the crucial difference: the weak likelihood principle statesthat different outcomes from the same experiment having thesame likelihood carry the same evidence. The strong likelihoodprinciple allows the outcomes to come from differentexperiments with different sampling schemes. So, according tothe strong likelihood principle, evidence about θ does not dependon the sampling scheme.
Changliang Zou Advanced Statistics-I-2, Fall 2018
Some other concepts: profile likelihood
For the multiparameter models, the multidimensional likelihoodfunction can be difficult to describe or to communicate. A method isneeded to “concentrate” the likelihood on a single parameter byeliminating the nuisance parameter. The likelihood approach toeliminate a nuisance parameter is to replace it by its MLE at eachfixed value of the parameter of interest. The resulting likelihood iscalled the profile likelihood.
Definition
Given the joint likelihood L(θ, η) the profile likelihood of θ is
L(θ) = maxη
L(θ, η),
where the maximization is performed at fixed value of θ.
Changliang Zou Advanced Statistics-I-2, Fall 2018
Some other concepts: estimated likelihood
Definition (Estimated (pseudo) likelihood)
Suppose the total parameter space is (θ, η), where θ is the parameterof interest. Let η̂ be an estimate of η; it can be any reasonableestimate, and in particular it does not have to be an MLE. Theestimated likelihood of is L(θ, η̂).
Not to be confused with the profile likelihood L(θ, η̂θ), the estimate η̂here is to be estimated freely from the parameter of interest θ.
Changliang Zou Advanced Statistics-I-2, Fall 2018
Some other concepts: composite likelihood
Composite Likelihood. Consider an m-dimensional vector randomvariable Y , with probability density function f (y ; θ) for someunknown p-dimensional parameter vector θ ∈ Θ.
Composite Conditional Likelihood:∏m
s=1 fs|(−s)(ys | y(−s))
Composite Pairwise Likelihood.∏m−1
r=1
∏ms=r+1 f2(yr , ys ; θ)
Motivations: easier to compute; access to multivariate distributions;more robust in the sense that model marginal (mean/variance) andassociation (covariance) parameters only.
Changliang Zou Advanced Statistics-I-2, Fall 2018
Some other concepts: quasi likelihood
Given data yl , . . . , yn, the estimating equation approach specifies thatestimate θ̂ is the solution of
∑i ψ(yi , θ) = 0, where the estimating
function ψ(·; θ) is a known function.
One of the most important estimating equations is associated withGLM: given observation yi , we assume
E (yi ) = µi (β), var(yi ) = φvi (β)
for known functions µi (·) and vi (·) of an unknown regressionparameter β. The unknown parameter φ is a dispersion parameter.
Changliang Zou Advanced Statistics-I-2, Fall 2018
Some other concepts: quasi likelihood
The estimate of β is the solution of
n∑i=1
∂µi∂β
v−1i (yi − µi ) = 0.
The equation applies immediately to multivariate outcomes, in whichcase it called a generalized estimating equation (GEE). The objectivefunction associated with this estimating equation is called thequasi-likelihood.
In contrast with a full likelihood, we are not specifying anyprobability structure, but only the mean and variance functions.By only specifying the mean and the variance we are letting theshape of the distribution remain totally free. This is a usefulstrategy for dealing with non-Gaussian multivariate data.
Changliang Zou Advanced Statistics-I-2, Fall 2018