Statistical models and fundamental concepts Changliang Zouweb.stat.nankai.edu.cn/chlzou/ASI-2.pdf · 2018-10-23 · Statistical models and fundamental concepts Changliang Zou Changliang

Statistical models and fundamental concepts

Changliang Zou

Changliang Zou Advanced Statistics-I-2, Fall 2018

Outline

Statistical models

Statistics and Sufficiency

Point estimation, confidence set and hypothesis testing

Statistical functional and substitution principle

Likelihood function, likelihood principle and beyond


Statistical models

A statistical model P is a collection of probability distributions (or acollection of densities).

Parametric models: The form of the distribution of P is knownup to some unknown parameters. P = {p(x ;θ),θ ∈ Θ}, whereΘ ⊂ Rd .

Nonparametric models: The distributions in P can not be formedby finite unknown parameters. P = {p :

∫{p′′(x)}2dx <∞}.

Example

Suppose Y is a response and X = (x1, . . . , xd) are covariates. A basicgoal is to estimate m(x) = E (Y | X = x). Why?


Statistical models

One may wish to predict the value of Y based on an observed valueof X. Let m(X) be a predictor satisfying E{m(X)}2 <∞. Considerloss function based on the mean squared error.

arg minm

E{Y −m(X)}2

Note that

E{Y −m(X)}2 = E{Y − E (Y | X)}2 + E{E (Y | X)−m(X)}2

≥ E{Y − E (Y |X)}2.

How to estimate E (Y |X)? A nonparametric model: m(x) possessescertain smoothness.


Statistical models

Semiparametric models: without any information about thestructure of the function, it is difficult to estimate m(x) wellwhen d > 2, and therefore many semiparametric models havebeen proposed that impose structural constraints or specialfunctional forms upon m(x).

A general form Y = G (g,β;X) + ε, where g = (g1, . . . , gq)> areunknown smooth functions of X, and G is known up to aparameter vector β.


Statistical models

This model includes many nonparametric or semiparametricmodels as special cases, such like the generalized additivemodels, partially linear models, varying coefficient models andsingle-index or multi-index models.

For example, the single-index models and varying coefficientmodels admit the forms of Y = g(β>X) + ε andY = g1(x1) + g2(x1)x2 + · · ·+ gd(x1)xd + ε, respectively.


Statistics

Let X = (X1, . . . ,Xn). Any function T = T (X) is a randomvariable which we call a statistic. E.g., the order statisticsX(1) ≤ · · · ≤ X(n).

A statistic is called an ancillary statistic of θ, if its distribution isunrelated with θ.

Example

Suppose X1, . . . ,Xniid∼ U(µ− θ, µ + θ), θ > 0. The sample range

X(n) − X(1) is an ancillary statistic of µ.

Its PDF is

f (x) =n(n − 1)xn−2

(2θ)n−1

(1− x

2θ

), 0 ≤ x ≤ 2θ.


Sufficient Statistics

Definition

The statistic T = T (X) is called a sufficient statistic for θ if theconditional distribution of X | T does not depend on θ.

Intuitively, this means that you can replace X with T (X)without losing information.

What sufficiency really means: If T is sufficient, then T containsall the information you need from the data to compute thelikelihood function. It does not contain all the information in thedata.

The Factorization Theorem


Minimal Sufficient Statistics

Let T be a sufficient statistic and T = g(S). Then S is alsosufficient and T is better than S for reducing the data unlessg(·) is a 1-1 mapping, where T is equivalent to S

Definition

T is Minimal Sufficient Statistic if (i) T is sufficient and (ii) if S isany other sufficient statistic then T = g(S) for some function g

Minimal sufficient statistic is the best sufficient statistic for reducingthe data. Let xn = (x1, . . . , xn).

Theorem

If p(yn; θ)/p(xn; θ) does not depend on θ if and only ifT (yn) = T (xn), the T is a minimal sufficient statistic.


A related idea: sufficient dimension reduction

Let X = (x1, . . . , xp)T ∈ Rp and Y ∈ R. The general goal of aregression of Y on X is inference about the conditionaldistribution of Y given X

When the dimension of X is not small, it is desirable to reduceits dimensionality as a preliminary step in an analysis. Sufficientdimension reduction is important in both theory and practice

The basic idea is to replace the predictor vector with itsprojection onto a subspace of the predictor space without loss ofinformation on the conditional distribution of Y given X


A related idea: sufficient dimension reduction

If a predictor subspace S ⊆ Rp satisfies

Y ⊥ X | PSX,

where ⊥ stands for independence and P(·) represents theprojection matrix with respect to the standard inner product,then S is called a (sufficient) dimension reduction subspace.

The statement that Y is independent of X given PSX isequivalent to stipulating that PSX carries all of the regressioninformation that X has about Y

The central subspace is defined as the intersection of alldimension reduction spaces, which is also a dimension reductionsubspace under mild conditions


Point Estimation

Many inferential problems can be identified as being one of threetypes: estimation, confidence sets, or hypothesis testing

Point estimation refers to providing a single “best guess” ofsome quantity of interest

The quantity of interest could be a parameter in a parametricmodel, a CDF F , a probability density function f , a regressionfunction m(·), or a prediction for a future value Y of somerandom variable

By convention, we denote a point estimate of θ by θ̂n = g(X).


Point Estimation

The bias of an estimator is defined by

bias(θ̂n) = E (θ̂n)− θ

Unbiasedness used to receive much attention but these days isconsidered less important; many of the estimators we will use arebiased.

A reasonable requirement for an estimator is that it should convergeto the true value as we collect more and more data, say theconsistency θ̂n

p→ θ


Point Estimation

The distribution of θ̂n is called the sampling distribution. Thestandard deviation of θ̂n is called the standard error

The quality of a point estimate is often assessed by the mean squarederror, or MSE defined by

MSE = E (θ̂n − θ)2 = bias2(θ̂n) + var(θ̂n).

The mean absolute error (MAE) of the estimator is defined byE |θ̂n − θ|.


Confidence set and hypothesis testing

A 1− α confidence interval (set) for a parameter θ is an intervalor set Cn(X) which is the function of the data X such that

Pr(θ ∈ Cn(X)) ≥ 1− α, for all θ ∈ Θ.

In other words, Cn traps θ with probability 1− α.

Hypothesis Testing In hypothesis testing, we start with somedefault theory—called a null hypothesis—and we ask if the dataprovide sufficient evidence to reject the theory. If not we retainthe null hypothesis.



Let X1, . . . ,Xn ∼ F . Let F be all distributions. A map: F → R iscalled a statistical functional. We write θ = T (F ) or θ = T (P).

Examples include the mean µ =∫

xdF , the varianceσ2 =

∫(x − µ)2dF , and the median

F−1(1/2) = inf{x : F (x) ≥ 1/2}.

Substitution principle: a plug-in estimator of θ is θ̂ = T (F̂n),where F̂n is the ECDF.

A functional of the form∫

a(x)dF (x) is called a linearfunctional. Then, the plug-in estimator for linear functionalT (F ) is T (F̂n) = n−1

∑ni=1 a(Xi ).



Example (variance)

X ∼ F , θ = var(X ) =∫

x2dF (x)− {∫

xdF (x)}2, we have

θ̂ = T (F̂n) = n−1∑n

i=1(Xi − X̄ )2.

Example (quantile)

Let T (F ) = F−1(p) be the pth quantile. Its estimate isF̂−1n (p) = inf{x : F̂ (x) ≥ p}, which is called the pth sample quantile.T (F ) is not a linear functional but we shall late see that F̂−1n (p) hasan approximately linear representation.



Example (Estimating equation)

T (F ) may not always have explicit expression. Let X ∼ F , θ satisfies∫g(x , θ)dF (x) = 0, i.e.,E{g(X , θ)} = 0. By substitution principle,

the estimator θ̂ satisfies n−1∑n

i=1 g(Xi , θ̂) = 0.

Note: In some cases, the substitution principle does not obtain goodestimation. For example, the density estimation (non-continuous).


Likelihood function

Let X have joint density p(xn; θ), where θ ∈ Θ.

The likelihood function L : Θ→ [0,∞) is defined byL(θ) ≡ L(θ; xn) = p(xn; θ), where xn is fixed and θ varies in Θ.

The log-likelihood function is `(θ) = log L(θ).

The name likelihood implies that, given xn, the value of θ is morelikely to be the true parameter than θ′ if L(θ; xn) > L(θ′; xn).


Likelihood function

The likelihood function is a function of θ; The likelihoodfunction is not a probability density function

The likelihood is only defined up to a constant of proportionality.In other words, it is an equivalence class of functions

The likelihood function is used (i) to generate estimators and (ii)as a key ingredient in Bayesian inference.


Likelihood principle

Definition

(Strong) Likelihood Principle. In the inference about θ, after xn isobserved, all relevant experimental information is contained in thelikelihood function for the observed xn. Furthermore, two likelihoodfunctions contain the same information about θ if they areproportional to each other.

Likelihood principle concerns foundations of statistical inferenceand it is often invoked in arguments about correct statisticalreasoning

The maximum-likelihood estimation does satisfy the likelihoodprinciple


Likelihood principle

Definition

(Weak) Likelihood Principle. Any set of observations from a givenmodel pθ(x) with the same likelihood should lead to the sameconclusion.

This principle is equivalent to the sufficiency principle: allsufficient statistics based on data xn for a given model pθ(x)should lead to the same conclusion, because all sufficientstatistics lead to the same likelihood function.

Note the crucial difference: the weak likelihood principle statesthat different outcomes from the same experiment having thesame likelihood carry the same evidence. The strong likelihoodprinciple allows the outcomes to come from differentexperiments with different sampling schemes. So, according tothe strong likelihood principle, evidence about θ does not dependon the sampling scheme.


Some other concepts: profile likelihood

For the multiparameter models, the multidimensional likelihoodfunction can be difficult to describe or to communicate. A method isneeded to “concentrate” the likelihood on a single parameter byeliminating the nuisance parameter. The likelihood approach toeliminate a nuisance parameter is to replace it by its MLE at eachfixed value of the parameter of interest. The resulting likelihood iscalled the profile likelihood.

Definition

Given the joint likelihood L(θ, η) the profile likelihood of θ is

L(θ) = maxη

L(θ, η),

where the maximization is performed at fixed value of θ.


Some other concepts: estimated likelihood

Definition (Estimated (pseudo) likelihood)

Suppose the total parameter space is (θ, η), where θ is the parameterof interest. Let η̂ be an estimate of η; it can be any reasonableestimate, and in particular it does not have to be an MLE. Theestimated likelihood of is L(θ, η̂).

Not to be confused with the profile likelihood L(θ, η̂θ), the estimate η̂here is to be estimated freely from the parameter of interest θ.


Some other concepts: composite likelihood

Composite Likelihood. Consider an m-dimensional vector randomvariable Y , with probability density function f (y ; θ) for someunknown p-dimensional parameter vector θ ∈ Θ.

Composite Conditional Likelihood:∏m

s=1 fs|(−s)(ys | y(−s))

Composite Pairwise Likelihood.∏m−1

r=1

∏ms=r+1 f2(yr , ys ; θ)

Motivations: easier to compute; access to multivariate distributions;more robust in the sense that model marginal (mean/variance) andassociation (covariance) parameters only.


Some other concepts: quasi likelihood

Given data yl , . . . , yn, the estimating equation approach specifies thatestimate θ̂ is the solution of

∑i ψ(yi , θ) = 0, where the estimating

function ψ(·; θ) is a known function.

One of the most important estimating equations is associated withGLM: given observation yi , we assume

E (yi ) = µi (β), var(yi ) = φvi (β)

for known functions µi (·) and vi (·) of an unknown regressionparameter β. The unknown parameter φ is a dispersion parameter.


Some other concepts: quasi likelihood

The estimate of β is the solution of

n∑i=1

∂µi∂β

v−1i (yi − µi ) = 0.

The equation applies immediately to multivariate outcomes, in whichcase it called a generalized estimating equation (GEE). The objectivefunction associated with this estimating equation is called thequasi-likelihood.

In contrast with a full likelihood, we are not specifying anyprobability structure, but only the mean and variance functions.By only specifying the mean and the variance we are letting theshape of the distribution remain totally free. This is a usefulstrategy for dealing with non-Gaussian multivariate data.


Documents

Statistical models and fundamental concepts Changliang Zouweb.stat.nankai.edu.cn/chlzou/ASI-2.pdf · 2018-10-23 · Statistical models and fundamental concepts Changliang Zou Changliang