306
Vladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics * – Textbook – April 26, 2014 Springer Berlin Heidelberg NewYork Hong Kong London Milan Paris Tokyo * Many people made useful and extremely helpful suggestions which allowed to improve the composition and presentation and to clean up the mistakes and typos. The authors especially thank Wolfgang H¨ addle and Vladimir Panov who assisted us on the long way of writing the book.

Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

  • Upload
    lycong

  • View
    222

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

Vladimir Spokoiny, Thorsten Dickhaus

Basics of Modern MathematicalStatistics ∗

– Textbook –

April 26, 2014

Springer

Berlin Heidelberg NewYorkHongKong LondonMilan Paris Tokyo

∗ Many people made useful and extremely helpful suggestions which allowed toimprove the composition and presentation and to clean up the mistakes andtypos. The authors especially thank Wolfgang Haddle and Vladimir Panov whoassisted us on the long way of writing the book.

Page 2: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

To my children Seva, Irina, Daniel, Alexander, Michael, andMaria

To my family

Page 3: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

Preface

Preface of the first author

This book was written on the basis of a graduate course on mathematicalstatistics given at the mathematical faculty of the Humboldt University Berlin.

The classical theory of parametric estimation, since the seminal works byFisher, Wald and Le Cam, among many others, has now reached maturityand an elegant form. It can be considered as more or less complete, at leastfor the so-called “regular case”. The question of the optimality and efficiencyof the classical methods has been rigorously studied and typical results statethe asymptotic normality and efficiency of the maximum likelihood and/orBayes estimates; see an excellent monograph by Ibragimov and Khas’minskij(1981) for a comprehensive study.

In the time around 1984 when I started my own PhD at the LomonosoffUniversity, a popular joke in our statistical community in Moscow was thatall the problems in the parametric statistical theory have been solved anddescribed in a complete way in Ibragimov and Khas’minskij (1981), thereis nothing to do any more for mathematical statisticians. If at all, only fewnonparametric problems remain open. After finishing my PhD I also movedto nonparametric statistics for a while with the focus on local adaptive esti-mation. In the year 2005 I started to write a monograph on nonparametricestimation using local parametric methods which was supposed to systemizemy previous experience in this area. The very first draft of this book wasavailable already in the autumn 2005, and it only included few sections aboutbasics of parametric estimation. However, attempts to prepare a more system-atic and more general presentation of the nonparametric theory led me back tothe very basic parametric concepts. In 2007 I extended significantly the partabout parametric methods. In the spring 2009 I taught a graduate course onparametric statistics at the mathematical faculty of the Humboldt UniversityBerlin. My intention was to present a “modern” version of the classical theorywhich in particular addresses the following questions:

Page 4: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

4 Preface

“what do you need to know from parametric statistics to work onmodern parametric and nonparametric methods?”“how to identify the borderline between the classical parametric andthe modern nonparametric statistics”.

The basic assumptions of the classical parametric theory are that the paramet-ric specification is exact and the sample size is large relative to the dimensionof the parameter space. Unfortunately, this viewpoint limits applicability ofthe classical theory: it is usually unrealistic to assume that the parametricspecification is fulfilled exactly. So, the modern version of the parametric the-ory has to include a possible model misspecification. The issue of large sam-ples is even more critical. Many modern applications face a situation whenthe number of parameters p is not only comparable with the sample size n ,it can be even much larger than n . It is probably the main challenge of themodern parametric theory to include in a rigorous way the case of “large psmall n ”. One can say that the parametric theory that is able to systemati-cally treat the issues of model misspecification and of small of fixed samplesalready includes the nonparametric statistics. The present study aims at re-considering the basics of the parametric theory in this sense. The “modernparametric” view can be stressed as follows:

- any model is parametric;- any parametric model is wrong;- even a wrong model can be useful.

The model mentioned in the first item can be understood as a set of as-sumptions describing the unknown distribution of the underlying data. Thisdescription is usually given in terms of some parameters. The parameter spacecan be large or infinite dimensional, however, the model is uniquely specifiedby the parameter value. In this sense “any model is parametric”.

The second statement “any parametric model is wrong” means that anyimaginary model is only an idealization (approximation) of reality. It is un-realistic to assume that the data exactly follow the parametric model, evenif this model is flexible and involves a lot of parameters. Model misspecifica-tion naturally leads to the notion of the modeling bias measuring the distancebetween the underlying model and the selected parametric family. It also sep-arates parametric and nonparametric viewpoint. The parametric approachfocuses on “estimation within the model” ignoring the modeling bias. Thenonparametric approach attempts to account for the modeling bias and tooptimize the joint impact of two kinds of errors: estimation error within themodel and the modeling bias. This volume is limited to parametric estimationand testing for some special models like exponential families or linear models.However, it prepares some important tools for doing the general parametrictheory presented in the second volume.

The last statement “even a wrong model can be useful” introduces thenotion of a “useful” parametric specification. In some sense it indicates a

Page 5: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

Preface 5

change of a paradigm in the parametric statistics. Trying to find the truemodel is hopeless anyway. Instead, one aims at taking a potentially wrongparametric model which however, possesses some useful properties. Amongothers, one can figure out the following “useful” features:

- a nice geometric structure of the likelihood leading to a numericallyefficient estimation procedure;- parameter identifiability.

Lack of identifiability in the considered model is just an indication that theconsidered parametric model is poorly selected. A proper parametrizationshould involve a reasonable regularization ensuring both features: numericalefficiency/stability and a proper parameter identification. The present volumepresents some examples of “useful models” like linear or exponential families.The second volume will extend such models to a quite general regular case in-volving some smoothness and moment conditions on the log-likelihood processof the considered parametric family.

This book does not pretend to systematically cover the scope of the clas-sical parametric theory. Some very important and even fundamental issuesare not considered at all in this book. One characteristic example is givenby the notion of sufficiency, which can be hardly combined with model mis-specification. At the same time, much more attention is paid to the questionsof nonasymptotic inference under model misspecification including concentra-tion and confidence sets in dependence of the sample size and dimensionalityof the parameter space. In the first volume we especially focus on linear mod-els. This can be explained by their role for the general theory in which a linearmodel naturally arises from local approximation of a general regular model.

This volume can be used as text-book for a graduate course in mathemat-ical statistics. It assumes that the reader is familiar with the basic notionsof the probability theory including the Lebesgue measure, Radon-Nycodimderivative, etc. Knowledge of basic statistics is not required. I tried to be asself-contained as possible; the most of the presented results are proved in arigorous way. Sometimes the details are left to the reader as exercises, in thosecases some hints are given.

Preface of the second author

It was in early 2012 when Prof. Spokoiny approached me with the idea of ajoint lecture on Mathematical Statistics at Humboldt-University Berlin, whereI was a junior professor at that time. Up to then, my own education in sta-tistical inference had been based on the german textbooks by Witting (1985)and Witting and Muller-Funk (1995), and for teaching in english I had al-ways used the books by Lehmann and Casella (1998), Lehmann and Romano(2005), and Lehmann (1999). However, I was aware of Prof. Spokoiny’s owntextbook project and so the question was which text to use as the basis for

Page 6: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

6 Preface

the lecture. Finally, the first part of the lecture (estimation theory) was givenby Prof. Spokoiny based on the by then already substantiated Chapters 1 -5 of the present work, while I gave the second part on test theory based onmy own teaching material which was mainly based on Lehmann and Romano(2005).

This joint teaching activity turned out to be the starting point of a col-laboration between Prof. Spokoiny and myself, and I was invited to join himas a co-author of the present work for the Chapters 6 - 8 on test theory,matching my own research interests. By the summer term of 2013, the bookmanuscript had substantially been extended, and I used it as the sole basis forthe Mathematical Statistics lecture. During the course of this 2013 lecture,I received many constructive comments and suggestions from students andteaching assistants, which led to a further improvement of the text.

Page 7: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

Contents

1 Basic notions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.1 Example of a Bernoulli experiment . . . . . . . . . . . . . . . . . . . . . . . . 131.2 Least squares estimation in a linear model . . . . . . . . . . . . . . . . . . 161.3 General parametric model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191.4 Statistical decision problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201.5 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2 Parameter estimation for an i.i.d. model . . . . . . . . . . . . . . . . . . . 232.1 Empirical distribution. Glivenko-Cantelli Theorem . . . . . . . . . . . 232.2 Substitution principle. Method of moments . . . . . . . . . . . . . . . . . 28

2.2.1 Method of moments. Univariate parameter . . . . . . . . . . . 282.2.2 Method of moments. Multivariate parameter . . . . . . . . . . 292.2.3 Method of moments. Examples . . . . . . . . . . . . . . . . . . . . . . 29

2.3 Unbiased estimates, bias, and quadratic risk . . . . . . . . . . . . . . . . 342.3.1 Univariate parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342.3.2 Multivariate case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.4 Asymptotic properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362.4.1 Root-n normality. Univariate parameter . . . . . . . . . . . . . 362.4.2 Root-n normality. Multivariate parameter . . . . . . . . . . . . 39

2.5 Some geometric properties of a parametric family . . . . . . . . . . . . 412.5.1 Kullback-Leibler divergence . . . . . . . . . . . . . . . . . . . . . . . . . 422.5.2 Hellinger distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442.5.3 Regularity and the Fisher Information. Univariate

parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452.6 Cramer–Rao Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

2.6.1 Univariate parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492.6.2 Exponential families and R-efficiency . . . . . . . . . . . . . . . . 51

2.7 Cramer–Rao inequality. Multivariate parameter . . . . . . . . . . . . . 522.7.1 Regularity and Fisher Information. Multivariate

parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

Page 8: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

8 Contents

2.7.2 Local properties of the Kullback-Leibler divergenceand Hellinger distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

2.7.3 Multivariate Cramer–Rao Inequality . . . . . . . . . . . . . . . . . 552.7.4 Exponential families and R-efficiency . . . . . . . . . . . . . . . . 57

2.8 Maximum likelihood and other estimation methods . . . . . . . . . . 582.8.1 Minimum distance estimation . . . . . . . . . . . . . . . . . . . . . . . 582.8.2 M -estimation and Maximum likelihood estimation . . . . 59

2.9 Maximum Likelihood for some parametric families . . . . . . . . . . . 622.9.1 Gaussian shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632.9.2 Variance estimation for the normal law . . . . . . . . . . . . . . . 652.9.3 Univariate normal distribution . . . . . . . . . . . . . . . . . . . . . . 662.9.4 Uniform distribution on [0, θ] . . . . . . . . . . . . . . . . . . . . . . . 662.9.5 Bernoulli or binomial model . . . . . . . . . . . . . . . . . . . . . . . . 662.9.6 Multinomial model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 672.9.7 Exponential model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 672.9.8 Poisson model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 682.9.9 Shift of a Laplace (double exponential) law . . . . . . . . . . . 68

2.10 Quasi Maximum Likelihood approach . . . . . . . . . . . . . . . . . . . . . . 692.10.1 LSE as quasi likelihood estimation . . . . . . . . . . . . . . . . . . . 692.10.2 LAD and robust estimation as quasi likelihood

estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 712.11 Univariate exponential families . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

2.11.1 Natural parametrization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 722.11.2 Canonical parametrization . . . . . . . . . . . . . . . . . . . . . . . . . . 752.11.3 Deviation probabilities for the maximum likelihood . . . . 78

2.12 Historical remarks and further reading . . . . . . . . . . . . . . . . . . . . . 83

3 Regression Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 853.1 Regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

3.1.1 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 853.1.2 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 863.1.3 Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 873.1.4 Regression function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

3.2 Method of substitution and M-estimation . . . . . . . . . . . . . . . . . . . 883.2.1 Mean regression. Least squares estimate . . . . . . . . . . . . . . 893.2.2 Median regression. Least absolute deviation estimate . . . 903.2.3 Maximum likelihood regression estimation . . . . . . . . . . . . 903.2.4 Quasi Maximum Likelihood approach . . . . . . . . . . . . . . . . 92

3.3 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 933.3.1 Projection estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 943.3.2 Polynomial approximation . . . . . . . . . . . . . . . . . . . . . . . . . . 963.3.3 Orthogonal polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . 973.3.4 Chebyshev polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . 993.3.5 Legendre polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1033.3.6 Lagrange polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

Page 9: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

Contents 9

3.3.7 Hermite polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1073.3.8 Trigonometric series expansion . . . . . . . . . . . . . . . . . . . . . . 110

3.4 Piecewise methods and splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1113.4.1 Piecewise constant estimation . . . . . . . . . . . . . . . . . . . . . . . 1113.4.2 Piecewise linear univariate estimation . . . . . . . . . . . . . . . . 1133.4.3 Piecewise polynomial estimation . . . . . . . . . . . . . . . . . . . . . 1153.4.4 Spline estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

3.5 Generalized regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1193.5.1 Generalized linear models . . . . . . . . . . . . . . . . . . . . . . . . . . 1213.5.2 Logit regression for binary data . . . . . . . . . . . . . . . . . . . . . 1223.5.3 Parametric Poisson regression . . . . . . . . . . . . . . . . . . . . . . . 1233.5.4 Piecewise constant methods in generalized regression . . . 1243.5.5 Smoothing splines for generalized regression . . . . . . . . . . 126

3.6 Historical remarks and further reading . . . . . . . . . . . . . . . . . . . . . 127

4 Estimation in linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1294.1 Modeling assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1294.2 Quasi maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . . 130

4.2.1 Estimation under the homogeneous noise assumption . . 1324.2.2 Linear basis transformation . . . . . . . . . . . . . . . . . . . . . . . . . 1324.2.3 Orthogonal and orthonormal design . . . . . . . . . . . . . . . . . . 1344.2.4 Spectral representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

4.3 Properties of the response estimate f . . . . . . . . . . . . . . . . . . . . . 1364.3.1 Decomposition into a deterministic and a stochastic

component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1374.3.2 Properties of the operator Π . . . . . . . . . . . . . . . . . . . . . . . 1374.3.3 Quadratic loss and risk of the response estimation . . . . . 1394.3.4 Misspecified “colored noise” . . . . . . . . . . . . . . . . . . . . . . . . 140

4.4 Properties of the MLE θ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1404.4.1 Properties of the stochastic component . . . . . . . . . . . . . . . 1414.4.2 Properties of the deterministic component . . . . . . . . . . . . 1424.4.3 Risk of estimation. R-efficiency . . . . . . . . . . . . . . . . . . . . . . 1434.4.4 The case of a misspecified noise . . . . . . . . . . . . . . . . . . . . . 146

4.5 Linear models and quadratic log-likelihood. . . . . . . . . . . . . . . . . . 1474.6 Inference based on the maximum likelihood . . . . . . . . . . . . . . . . . 149

4.6.1 A misspecified LPA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1524.6.2 A misspecified noise structure . . . . . . . . . . . . . . . . . . . . . . . 153

4.7 Ridge regression, projection, and shrinkage . . . . . . . . . . . . . . . . . 1544.7.1 Regularization and ridge regression . . . . . . . . . . . . . . . . . . 1554.7.2 Penalized likelihood. Bias and variance . . . . . . . . . . . . . . . 1564.7.3 Inference for the penalized MLE . . . . . . . . . . . . . . . . . . . . . 1594.7.4 Projection and shrinkage estimates . . . . . . . . . . . . . . . . . . 1604.7.5 Smoothness constraints and roughness penalty approach163

4.8 Shrinkage in a linear inverse problem. . . . . . . . . . . . . . . . . . . . . . . 163

Page 10: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

10 Contents

4.8.1 Spectral cut-off and spectral penalization. Diagonalestimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

4.8.2 Galerkin method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1664.9 Semiparametric estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

4.9.1 (θ,η) - and υ -setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1674.9.2 Orthogonality and product structure . . . . . . . . . . . . . . . . . 1684.9.3 Partial estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1704.9.4 Profile estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1724.9.5 Semiparametric efficiency bound . . . . . . . . . . . . . . . . . . . . 1754.9.6 Inference for the profile likelihood approach . . . . . . . . . . . 1764.9.7 Plug-in method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1784.9.8 Two step procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1794.9.9 Alternating method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

4.10 Historical remarks and further reading . . . . . . . . . . . . . . . . . . . . . 181

5 Bayes estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1835.1 Bayes formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1845.2 Conjugated priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1865.3 Linear Gaussian model and Gaussian priors . . . . . . . . . . . . . . . . . 187

5.3.1 Univariate case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1885.3.2 Linear Gaussian model and Gaussian prior . . . . . . . . . . . 1895.3.3 Homogeneous errors, orthogonal design . . . . . . . . . . . . . . . 192

5.4 Non-informative priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1935.5 Bayes estimate and posterior mean . . . . . . . . . . . . . . . . . . . . . . . . 1945.6 Posterior mean and ridge regression . . . . . . . . . . . . . . . . . . . . . . . . 1965.7 Bayes and minimax risks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1975.8 Van Trees inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1985.9 Historical remarks and further reading . . . . . . . . . . . . . . . . . . . . . 203

6 Testing a statistical hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2056.1 Testing problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

6.1.1 Simple hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2056.1.2 Composite hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2066.1.3 Statistical tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2066.1.4 Errors of the first kind, test level . . . . . . . . . . . . . . . . . . . . 2076.1.5 Randomized tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2086.1.6 Alternative hypotheses, error of the second kind,

power of a test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2096.2 Neyman-Pearson test for two simple hypotheses . . . . . . . . . . . . . 210

6.2.1 Neyman-Pearson test for an i.i.d. sample . . . . . . . . . . . . . 2136.3 Likelihood ratio test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2146.4 Likelihood ratio tests for parameters of a normal distribution . 216

6.4.1 Distributions related to an i.i.d. sample from a normaldistribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216

6.4.2 Gaussian shift model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218

Page 11: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

Contents 11

6.4.3 Testing the mean when the variance is unknown . . . . . . . 2206.4.4 Testing the variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222

6.5 LR-tests. Further examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2236.5.1 Bernoulli or binomial model . . . . . . . . . . . . . . . . . . . . . . . . 2236.5.2 Uniform distribution on [0, θ] . . . . . . . . . . . . . . . . . . . . . . . 2246.5.3 Exponential model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2256.5.4 Poisson model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225

6.6 Testing problem for a univariate exponential family . . . . . . . . . . 2266.6.1 Two-sided alternative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2266.6.2 One-sided alternative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2286.6.3 Interval hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230

6.7 Historical remarks and further reading . . . . . . . . . . . . . . . . . . . . . 231

7 Testing in linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2337.1 Likelihood ratio test for a simple null . . . . . . . . . . . . . . . . . . . . . . 233

7.1.1 General errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2337.1.2 I.i.d. errors, known variance . . . . . . . . . . . . . . . . . . . . . . . . 2347.1.3 I.i.d. errors with unknown variance . . . . . . . . . . . . . . . . . . 239

7.2 Likelihood ratio test for a subvector . . . . . . . . . . . . . . . . . . . . . . . 2417.3 Likelihood ratio test for a linear hypothesis . . . . . . . . . . . . . . . . . 2447.4 Wald test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2457.5 Analysis of variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246

7.5.1 Two-sample test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2467.5.2 Comparing K treatment means . . . . . . . . . . . . . . . . . . . . . 2487.5.3 Randomized blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251

7.6 Historical remarks and further reading . . . . . . . . . . . . . . . . . . . . . 252

8 Some other testing methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2538.1 Method of moments for an i.i.d. sample . . . . . . . . . . . . . . . . . . . . 253

8.1.1 Series expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2558.1.2 Testing a parametric hypothesis . . . . . . . . . . . . . . . . . . . . . 256

8.2 Minimum distance method for an i.i.d. sample . . . . . . . . . . . . . . 2578.2.1 Kolmogorov-Smirnov test . . . . . . . . . . . . . . . . . . . . . . . . . . . 2588.2.2 ω2 test (Cramer-Smirnov-von Mises) . . . . . . . . . . . . . . . . 260

8.3 Partially Bayes tests and Bayes testing . . . . . . . . . . . . . . . . . . . . . 2618.3.1 Partial Bayes approach and Bayes tests . . . . . . . . . . . . . . 2618.3.2 Bayes approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262

8.4 Score, rank and permutation tests . . . . . . . . . . . . . . . . . . . . . . . . . 2628.4.1 Score tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2628.4.2 Rank tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2668.4.3 Permutation tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272

8.5 Historical remarks and further reading . . . . . . . . . . . . . . . . . . . . . 275

Page 12: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

12 Contents

9 Deviation probability for quadratic forms . . . . . . . . . . . . . . . . . . 2779.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2779.2 Gaussian case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2799.3 A bound for the `2 -norm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2809.4 A bound for a quadratic form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2829.5 Rescaling and regularity condition . . . . . . . . . . . . . . . . . . . . . . . . . 2839.6 A chi-squared bound with norm-constraints . . . . . . . . . . . . . . . . . 2849.7 A bound for the `2 -norm under Bernstein conditions . . . . . . . . 2859.8 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303

Page 13: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

1

Basic notions

The starting point of any statistical analysis is data, also called observationsor a sample. A statistical model is used to explain the nature of the data. Astandard approach assumes that the data is random and utilizes some proba-bilistic framework. On the contrary to probability theory, the distribution ofthe data is not known precisely and the goal of the analysis is to infer on thisunknown distribution.

The parametric approach assumes that the distribution of the data isknown up to the value of a parameter θ from some subset Θ of a finite-dimensional space IRp . In this case the statistical analysis is naturally re-duced to the estimation of the parameter θ : as soon as θ is known, we knowthe whole distribution of the data. Before introducing the general notion of astatistical model, we discuss some popular examples.

1.1 Example of a Bernoulli experiment

Let Y = (Y1, . . . , Yn)> be a sequence of binary digits zero or one. We distin-guish between deterministic and random sequences. Deterministic sequencesappear e.g. from the binary representation of a real number, or from digitallycoded images, etc. Random binary sequences appear e.g. from coin throw,games, etc. In many situations incomplete information can be treated as ran-dom data: the classification of healthy and sick patients, individual vote re-sults, the bankruptcy of a firm or credit default, etc.

Basic assumptions behind a Bernoulli experiment are:

• the observed data Yi are independent and identically distributed.• each Yi assumes the value one with probability θ ∈ [0, 1] .

The parameter θ completely identifies the distribution of the data Y . Indeed,for every i ≤ n and y ∈ {0, 1} ,

IP (Yi = y) = θy(1− θ)1−y,

Page 14: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

14 1 Basic notions

and the independence of the Yi ’s implies for every sequence y = (y1, . . . , yn)that

IP(Y = y

)=

n∏i=1

θyi(1− θ)1−yi . (1.1)

To indicate this fact, we write IPθ in place of IP .The equation (1.1) can be rewritten as

IPθ(Y = y

)= θsn(1− θ)n−sn ,

where

sn =

n∑i=1

yi .

The value sn is often interpreted as the number of successes in the sequencey .

Probabilistic theory focuses on the probabilistic properties of the data Yunder the given measure IPθ . The aim of the statistical analysis is to infer onthe measure IPθ for an unknown θ based on the available data Y . Typicalexamples of statistical problems are:

1. Estimate the parameter θ i.e. build a function θ of the data Y into[0, 1] which approximates the unknown value θ as well as possible;

2. Build a confidence set for θ i.e. a random (data-based) set (usually aninterval) containing θ with a prescribed probability;

3. Testing a simple hypothesis that θ coincides with a prescribed value θ0 ,e.g. θ0 = 1/2 ;

4. Testing a composite hypothesis that θ belongs to a prescribed subset Θ0

of the interval [0, 1] .

Usually any statistical method is based on a preliminary probabilistic anal-ysis of the model under the given θ .

Theorem 1.1.1. Let Y be i.i.d. Bernoulli with the parameter θ . Then themean and the variance of the sum Sn = Y1 + . . .+ Yn satisfy

IEθSn = nθ,

Varθ Sndef= IEθ

(Sn − IEθSn

)2= nθ(1− θ).

Exercise 1.1.1. Prove this theorem.

This result suggests that the empirical mean θ = Sn/n is a reasonable esti-mate of θ . Indeed, the result of the theorem implies

Page 15: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

1.1 Example of a Bernoulli experiment 15

IEθ θ = θ, IEθ(θ − θ

)2= θ(1− θ)/n.

The first equation means that θ is an unbiased estimate of θ , that is,IEθ θ = θ for all θ . The second equation yields a kind of concentration (con-

sistency) property of θ : with n growing, the estimate θ concentrates in asmall neighborhood of the point θ . By the Chebyshev inequality

IPθ(∣∣θ − θ∣∣ > δ

)≤ θ(1− θ)/(nδ2).

This result is refined by the famous de Moivre-Laplace theorem.

Theorem 1.1.2. Let Y be i.i.d. Bernoulli with the parameter θ . Then forevery k ≤ n

IPθ(Sn = k

)=

(n

k

)θk(1− θ)n−k

≈ 1√2πnθ(1− θ)

exp

{−(k − nθ

)22nθ(1− θ)

},

where an ≈ bn means an/bn → 1 as n→∞ . Moreover, for any fixed z > 0 ,

IPθ

(∣∣∣Snn− θ∣∣∣ > z

√θ(1− θ)/n

)≈ 2√

∫ ∞z

e−t2/2dt .

This concentration result yields that the estimate θ deviates from a root-n

neighborhood A(z, θ)def= {u : |θ − u| ≤ z

√θ(1− θ)/n} with probability of

order e−z2/2 .

This result bounding the difference |θ−θ| can also be used to build random

confidence intervals around the point θ . Indeed, by the result of the theorem,the random interval E∗(z) = {u : |θ − u| ≤ z

√θ(1− θ)/n} fails to cover the

true point θ with approximately the same probability:

IPθ(E∗(z) 63 θ

)≈ 2√

∫ ∞z

e−t2/2dt . (1.2)

Unfortunately, the construction of this interval E∗(z) is not entirely data-based. Its width involves the true unknown value θ . A data based confidenceset can be obtained by replacing the population variance σ2 def

= IEθ(Y1−θ

)2=

θ(1− θ) with its empirical counterpart

σ2 def=

1

n

n∑i=1

(Yi − θ

)2The resulting confidence set E(z) reads as

Page 16: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

16 1 Basic notions

E(z)def= {u : |θ − u| ≤ z

√n−1σ2}.

It possesses the same asymptotic properties as E∗(z) including (1.2).

The hypothesis that the value θ is equal to a prescribed value θ0 , e.g.θ0 = 1/2 , can be checked by examining the difference |θ− 1/2| . If this valueis too large compared to σn−1/2 or with σn−1/2 , then the hypothesis is wrongwith high probability. Similarly one can consider a composite hypothesis thatθ belongs to some interval [θ1, θ2] ⊂ [0, 1] . If θ deviates from this intervalat least by the value zσn−1/2 with a large z , then the data significantlycontradict this hypothesis.

1.2 Least squares estimation in a linear model

A linear model assumes a linear systematic dependence between the output(also called response or explained variable) Y from the input (also called re-gressor or explanatory variable) Ψ which in general can be multidimensional.The linear model is usually written in the form

IE(Y)

= Ψ>θ∗

with an unknown vector of coefficients θ∗ = (θ∗1 , . . . , θ∗p)> . Equivalently one

writes

Y = Ψ>θ∗ + ε (1.3)

where ε stands for the individual error with zero mean: IEε = 0 . Such alinear model is often used to describe the influence of the response on theregressor Ψ from the collection of data in the form of a sample (Yi,Ψi) fori = 1, . . . , n .

Let θ be a vector of coefficients considered as a candidate for θ∗ . Theneach observation Yi is approximated by Ψ>i θ . One often measures the qualityof approximation by the sum of quadratic errors |Yi−Ψ>i θ|2 . Under the modelassumption (1.3), the expected value of this sum is

IE∑|Yi −Ψ>i θ|2 = IE

∑∣∣Ψ>i (θ∗ − θ) + εi∣∣2 =

∑∣∣Ψ>i (θ∗ − θ)∣∣2 +

∑IEε2

i .

The cross term cancels in view of IEεi = 0 . Note that minimizing this ex-pression w.r.t. θ is equivalent to minimizing the first sum because the secondsum does not depend on θ . Therefore,

argminθ

IE∑|Yi −Ψ>i θ|2 = argmin

θ

∑∣∣Ψ>i (θ∗ − θ)∣∣2 = θ∗.

Page 17: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

1.2 Least squares estimation in a linear model 17

In words, the true parameter vector θ∗ minimizes the expected quadraticerror of fitting the data with a linear combinations of the Ψi ’s. The leastsquares estimate of the parameter vector θ∗ is defined by minimizing in θ

its empirical counterpart, that is, the sum of the squared errors∣∣Yi −Ψ>i θ

∣∣2over all i :

θdef= argmin

θ

n∑i=1

∣∣Yi −Ψ>i θ∣∣2.

This equation can be solved explicitly under some condition on the Ψi ’s. De-fine the p×n design matrix Ψ = (Ψ1, . . . ,Ψn) . The aforementioned conditionmeans that this matrix is of rank p .

Theorem 1.2.1. Let Yi = Ψ>i θ∗ + εi for i = 1, . . . , n , where εi are inde-

pendent and satisfy IEεi = 0 , IEε2i = σ2 . Suppose that the matrix Ψ is of

rank p . Then

θ =(ΨΨ>

)−1ΨY ,

where Y = (Y1, . . . , Yn)> . Moreover, θ is unbiased in the sense that

IEθ∗ θ = θ∗

and its variance satisfies Var(θ)

= σ2(ΨΨ>

)−1.

For each vector h ∈ IRp , the random value a =⟨h, θ

⟩= h>θ is an

unbiased estimate of a∗ = h>θ∗ :

IEθ∗(a) = a∗ (1.4)

with the variance

Var(a)

= σ2h>(ΨΨ>

)−1h.

Proof. Define

Q(θ)def=

n∑i=1

∣∣Yi −Ψ>i θ∣∣2 = ‖Y −Ψ>θ‖2,

where ‖y‖2 def=∑i y

2i . The normal equation dQ(θ)/dθ = 0 can be written

as ΨΨ>θ = ΨY yielding the representation of θ . Now the model equationyields IEθY = Ψ>θ∗ and thus

IEθ∗ θ =(ΨΨ>

)−1ΨIEθ∗Y =

(ΨΨ>

)−1ΨΨ>θ∗ = θ∗

as required.

Page 18: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

18 1 Basic notions

Exercise 1.2.1. Check that Var(θ)

= σ2(ΨΨ>

)−1.

Similarly one obtains IEθ∗(a) = IEθ∗(h>θ

)= h>θ∗ = a∗ , that is, a is an

unbiased estimate of a∗ . Also

Var(a)

= Var(h>θ

)= h>Var

(θ)h = σ2h>

(ΨΨ>

)−1h.

which completes the proof.

The next result states that the proposed estimate a is in some sensethe best possible one. Namely, we consider the class of all linear unbi-ased estimates a satisfying the identity (1.4). It appears that the variance

σ2h>(ΨΨ>

)−1h of a is the smallest possible in this class.

Theorem 1.2.2 (Gauss-Markov). Let Yi = Ψ>i θ∗ + εi for i = 1, . . . , n

with uncorrelated εi satisfying IEεi = 0 and IEε2i = σ2 . Let rank(Ψ) = p .

Suppose that the value a∗def=⟨h,θ∗

⟩= h>θ∗ is to be estimated for a given

vector h ∈ IRp . Then a =⟨h, θ

⟩= h>θ is an unbiased estimate of a∗ .

Moreover, a has the minimal possible variance over the class of all linearunbiased estimates of a∗ .

This result was historically one of the first optimality results in statistics.It presents a lower efficiency bound of any statistical procedure. Under theimposed restrictions it is impossible to do better than the LSE does. This andmore general results will be proved later in Chapter 4.

Define also the vector of residuals

εdef= Y −Ψ>θ .

If θ is a good estimate of the vector θ∗ , then due to the model equation,ε is a good estimate of the vector ε of individual errors. Many statisticalprocedures utilize this observation by checking the quality of estimation viathe analysis of the estimated vector ε . In the case when this vector still showsa nonzero systematic component, there is evidence that the assumed linearmodel is incorrect. This vector can also be used to estimate the noise varianceσ2 .

Theorem 1.2.3. Consider the linear model Yi = Ψ>i θ∗+εi with independent

homogeneous errors εi . Then the variance σ2 = IEε2i can be estimated by

σ2 =‖ε‖2

n− p=‖Y −Ψ>θ‖2

n− p

and σ2 is an unbiased estimate of σ2 , that is, IEθ∗ σ2 = σ2 for all θ∗ and

σ .

Page 19: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

1.3 General parametric model 19

Theorems 1.2.2 and 1.2.3 can be used to describe the concentration prop-erties of the estimate a and to build confidence sets based on a and σ ,especially if the errors εi are normally distributed.

Theorem 1.2.4. Let Yi = Ψ>i θ∗ + εi for i = 1, . . . , n with εi ∼ N(0, σ2) .

Let rank(Ψ) = p . Then it holds for the estimate a = h>θ of a∗ = h>θ∗

a− a∗ ∼ N(0, s2

)with s2 = σ2h>

(ΨΨ>

)−1h .

Corollary 1.2.1 (Concentration). If for some α > 0 , zα is the 1− α/2 -quantile of the standard normal law (i.e. Φ(zα) = 1− α/2 ) then

IPθ∗(|a− a∗| > zα s

)= α

Exercise 1.2.2. Check Corollary 1.2.1.

The next result describes the confidence set for a∗ . The unknown variances2 is replaced by its estimate

s2 def= σ2h>

(ΨΨ>

)−1h

Corollary 1.2.2 (Confidence set). If E(zα)def= {a : |a− a| ≤ s zα} then

IPθ∗(E(zα) 63 a∗

)≈ α.

1.3 General parametric model

Let Y denote the observed data with values in the observation space Y . Inmost cases, Y ∈ IRn , that is, Y = (Y1, . . . , Yn)> . Here n denotes the samplesize (number of observations). The basic assumption about these data is thatthe vector Y is a random variable on a probability space (Y,B(Y), IP ) , whereB(Y) is the Borel σ -algebra on Y . The probabilistic approach assumes thatthe probability measure IP is known and studies the distributional (popula-tion) properties of the vector Y . On the contrary, the statistical approachassumes that the data Y are given and tries to recover the distribution IPon the basis of the available data Y . One can say that the statistical problemis inverse to the probabilistic one.

The statistical analysis is usually based on the notion of statistical ex-periment. This notion assumes that a family P of probability measures on(Y,B(Y)) is fixed and the unknown underlying measure IP belongs to thisfamily. Often this family is parameterized by the value θ from some param-eter set Θ : P = (IPθ,θ ∈ Θ) . The corresponding statistical experiment canbe written as

Page 20: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

20 1 Basic notions (Y,B(Y), (IPθ,θ ∈ Θ)

).

The value θ∗ denotes the “true” parameter value, that is, IP = IPθ∗ .The statistical experiment is dominated if there exists a dominating σ -

finite measure µ0 such that all the IPθ are absolutely continuous w.r.t. µ0 .In what follows we assume without further mention that the considered sta-tistical models are dominated. Usually the choice of a dominating measure isunimportant and any one can be used.

The parametric approach assumes that Θ is a subset of a finite-dimensionalEuclidean space IRp . In this case, the unknown data distribution is specifiedby the value of a finite-dimensional parameter θ from Θ ⊆ IRp . Since in thiscase the parameter θ completely identifies the distribution of the observationsY , the statistical estimation problem is reduced to recovering (estimating)this parameter from the data. The nice feature of the parametric theory isthat the estimation problem can be solved in a rather general way.

1.4 Statistical decision problem. Loss and Risk

The statistical decision problem is usually formulated in terms of game theory,the statistician playing as it were against nature. Let D denote the decisionspace that is assumed to be a topological space. Next, let ℘(·, ·) be a lossfunction given on the product D × Θ . The value ℘(d,θ) denotes the lossassociated with the decision d ∈ D when the true parameter value is θ ∈Θ . The statistical decision problem is composed of a statistical experiment(Y,B(Y),P) , a decision space D and a loss function ℘(·, ·) .

A statistical decision ρ = ρ(Y ) is a measurable function of the observeddata Y with values in the decision space D . Clearly, ρ(Y ) can be consideredas a random D -valued element on the space (Y,B(Y)) . The correspondingloss under the true model (Y,B(Y), IPθ∗) reads as ℘(ρ(Y ),θ∗) . Finally, therisk is defined as the expected value of the loss:

R(ρ,θ∗)def= IEθ∗℘(ρ(Y ),θ∗).

Below we present a list of typical the statistical decision problems.

Example 1.4.1 (Point estimation problem). Let the target of analysis be thetrue parameter θ∗ itself, that is, D coincide with Θ , Let ℘(·, ·) be a kindof distance on Θ , that is, ℘(θ,θ∗) denotes the loss of estimation, when theselected value is θ while the true parameter is θ∗ . Typical examples of theloss function are quadratic loss ℘(θ,θ∗) = ‖θ − θ∗‖2 , l1 -loss ℘(θ,θ∗) =‖θ − θ∗‖1 or sup-loss ℘(θ,θ∗) = ‖θ − θ∗‖∞ = maxj=1,...,p |θj − θ∗j | .

If θ is an estimate of θ∗ , that is, θ is a Θ -valued function of the dataY , then the corresponding risk is

Page 21: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

1.5 Efficiency 21

R(ρ,θ∗)def= IEθ∗℘(θ,θ∗).

Particularly, the quadratic risk reads as IEθ∗‖θ − θ∗‖2 .

Example 1.4.2 (Testing problem). Let Θ0 and Θ1 be two complementarysubsets of Θ , that is, Θ0 ∩ Θ1 = ∅ , Θ0 ∪ Θ1 = Θ . Our target is to checkwhether the true parameter θ∗ belongs to the subset Θ0 . The decision spaceconsists of two points {0, 1} for which d = 0 means the acceptance of thehypothesis H0 : θ∗ ∈ Θ0 while d = 1 rejects H0 in favor of the alternativeH1 : θ∗ ∈ Θ1 . Define the loss

℘(d,θ) = 1(d = 1,θ ∈ Θ0) + 1(d = 0,θ ∈ Θ1).

A test φ is a binary valued function of the data, φ = φ(Y ) ∈ {0, 1} . Thecorresponding risk R(φ,θ∗) = IEθ∗φ(Y ) can be interpreted as the probabilityof selecting the wrong subset.

Example 1.4.3 (Confidence estimation). Let the target of analysis again bethe parameter θ∗ . However, we aim to identify a subset A of Θ , as smallas possible, that covers with a presribed probability the true value θ∗ . Ourdecision space D is now the set of all measurable subsets in Θ . For anyA ∈ D , the loss function is defined as ℘(A,θ∗) = 1(A 63 θ∗) . A confidence setis a random set E selected from the data Y , E = E(Y ) . The correspondingrisk R(E,θ∗) = IEθ∗℘(E,θ∗) is just the probability that E does not coverθ∗ .

Example 1.4.4 (Estimation of a functional). Let the target of estimation be agiven function f(θ∗) of the parameter θ∗ with values in another space F . Atypical example is given by a single component of the vector θ∗ . An estimateρ of f(θ∗) is a function of the data Y into F : ρ = ρ(Y ) ∈ F . The lossfunction ℘ is defined on the product F ×F , yielding the loss ℘(ρ(Y ), f(θ∗))and the risk R(ρ(Y ), f(θ∗)) = IEθ∗℘(ρ(Y ), f(θ∗)) .

Exercise 1.4.1. Define the statistical decision problem for testing a simplehypothesis θ∗ = θ0 for a given point θ0 .

1.5 Efficiency

After the statistical decision problem is stated, one can ask for its optimalsolution. Equivalently one can say that the aim of statistical analysis is tobuild a decision with the minimal possible risk. However, a comparison of anytwo decisions on the basis of risk can be a nontrivial problem. Indeed, the riskR(ρ,θ∗) of a decision ρ depends on the true parameter value θ∗ . It mayhappen that one decision performs better for some points θ∗ ∈ Θ but worseat other points θ∗ . An extreme example of such an estimate is the trivial

Page 22: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

22 1 Basic notions

deterministic decision θ = θ0 which sets the estimate equal to the value θ0

whatever the data is. This is, of course, a very strange and poor estimate, butit clearly outperforms all other methods if the true parameter θ∗ is indeedθ0 .

Two approaches are typically used to compare different statistical de-cisions: the minimax approach considers the maximum R(ρ) of the risksR(ρ,θ) over the parameter set Θ while the Bayes approach is based on theweighted sum (integral) Rπ(ρ) of such risks with respect to some measure πon the parameter set Θ which is called the prior distribution:

R(ρ) = supθ∈Θ

R(ρ,θ),

Rπ(ρ) =

∫R(ρ,θ)π(dθ).

The decision ρ∗ is called minimax if

R(ρ∗) = infρR(ρ) = inf

ρsupθ∈Θ

R(ρ,θ),

where the infimum is taken over the set of all possible decisions ρ . The valueR∗ = R(ρ∗) is called the minimax risk.

Similarly, the decision ρπ is called Bayes for the prior π if

Rπ(ρπ) = infρRπ(ρ).

The corresponding value Rπ(ρπ) is called the Bayes risk.

Exercise 1.5.1. Show that the minimax risk is greater than or equal to theBayes risk whatever the prior measure π is.

Hint: show that for any decision ρ , it holds R(ρ) ≥ Rπ(ρ) .

Usually the problem of finding a minimax or Bayes estimate is quite hardand a closed form solution is available only in very few special cases. A stan-dard way out of this problem is to switch to an asymptotic set-up in whichthe sample size grows to infinity.

Page 23: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

2

Parameter estimation for an i.i.d. model

This chapter is very important for understanding the whole book. It startswith very classical stuff: Glivenko-Cantelli results for the empirical measurethat motivate the famous substitution principle. Then the method of momentsis studied in more detail including the risk analysis and asymptotic properties.Some other classical estimation procedures are briefly discussed including themethods of minimum distance, M-estimates and its special cases: least squares,least absolute deviations and maximum likelihood estimates. The concept ofefficiency is discussed in context of the Cramer-Rao risk bound which is givenin univariate and multivariate case. The last sections of Chapter 2 start akind of smooth transition from classical to “modern” parametric statisticsand they reveal the approach of the book. The presentation is focused on the(quasi) likelihood-based concentration and confidence sets. The basic concen-tration result is first introduced for the simplest Gaussian shift model andthen extended to the case of a univariate exponential family in Section 2.11.

Below in this chapter we consider the estimation problem for a sam-ple of independent identically distributed (i.i.d.) observations. Throughoutthe chapter the data Y are assumed to be given in the form of a sample(Y1, . . . , Yn)> . We assume that the observations Y1, . . . , Yn are independentidentically distributed; each Yi is from an unknown distribution P also calleda marginal measure. The joint data distribution IP is the n -fold product ofP : IP = P⊗n . Thus, the measure IP is uniquely identified by P and thestatistical problem can be reduced to recovering P .

The further step in model specification is based on a parametric assump-tion (PA): the measure P belongs to a given parametric family.

2.1 Empirical distribution. Glivenko-Cantelli Theorem

Let Y = (Y1, . . . , Yn)> be an i.i.d. sample. For simplicity we assume that theYi ’s are univariate with values in IR . Let P denote the distribution of eachYi :

Page 24: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

24 2 Parameter estimation for an i.i.d. model

P (B) = IP (Yi ∈ B), B ∈ B(IR).

One often says that Y is an i.i.d. sample from P . Let also F be the corre-sponding distribution function (cdf):

F (y) = IP (Y1 ≤ y) = P ((−∞, y]).

The assumption that the Yi ’s are i.i.d. implies that the joint distribution IPof the data Y is given by the n -fold product of the marginal measure P :

IP = P⊗n.

Let also Pn (resp. Fn ) be the empirical measure (resp. empirical distributionfunction (edf))

Pn(B) =1

n

∑1(Yi ∈ B), Fn(y) =

1

n

∑1(Yi ≤ y).

Here and everywhere in this chapter the symbol∑

stands for∑ni=1 . One

can consider Fn as the distribution function of the empirical measure Pndefined as the atomic measure at the Yi ’s:

Pn(A)def=

1

n

n∑i=1

1(Yi ∈ A).

So, Pn(A) is the empirical frequency of the event A , that is, the fractionof observations Yi belonging to A . By the law of large numbers one canexpect that this empirical frequency is close to the true probability P (A) ifthe number of observations is sufficiently large.

An equivalent definition of the empirical measure and empirical distribu-tion function can be given in terms of the empirical mean IEng for a measur-able function g :

IEngdef=

∫ ∞−∞

g(y)IPn(dy) =

∫ ∞−∞

g(y)dFn(y) =1

n

n∑i=1

g(Yi).

The first results claims that indeed, for every Borel set B on the real line,the empirical mass Pn(B) (which is random) is close in probability to thepopulation counterpart P (B) .

Theorem 2.1.1. For any Borel set B , it holds

1. IEPn(B) = P (B) .2. Var

{Pn(B)

}= n−1σ2

B with σ2B = P (B)

{1− P (B)

}.

3. Pn(B)→ P (B) in probability as n→∞ .

4.√n{Pn(B)− P (B)} w−→ N(0, σ2

B) .

Page 25: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

2.1 Empirical distribution. Glivenko-Cantelli Theorem 25

Proof. Denote ξi = 1(Yi ∈ B) . This is a Bernoulli r.v. with parameterP (B) = IEξi . The first statement holds by definition of Pn(B) = n−1

∑i ξi .

Next, for each i ≤ n ,

Var ξidef= IEξ2

i −(IEξi

)2= P (B)

{1− P (B)

}in view of ξ2

i = ξi . Independence of the ξi ’s yields

Var{Pn(B)

}= Var

(n−1

n∑i=1

ξi

)= n−2

n∑i=1

Var ξi = n−1σ2B .

The third statement follows by the law of large numbers for the i.i.d. r.v. ξi :

1

n

n∑i=1

ξiIP−→ IEξ1 .

Finally, the last statement follows by the Central Limit Theorem for the ξi :

1√n

n∑i=1

(ξi − IEξi

) w−→ N(0, σ2

B

).

The next important result shows that the edf Fn is a good approximationof the cdf F in the uniform norm.

Theorem 2.1.2 (Glivenko-Cantelli). It holds

supy

∣∣Fn(y)− F (y)∣∣→ 0, n→∞

Proof. Consider first the case when the function F is continuous in y . Fix anyinteger N and define with ε = 1/N the points t1 < t2 < . . . < tN = +∞such that F (tj) − F (tj−1) = ε for j = 2, . . . , N . For every j , by (3) ofTheorem 2.1.1, it holds Fn(tj)→ F (tj) . This implies that for some n(N) , itholds for all n ≥ n(N) with a probability at least 1− ε∣∣Fn(tj)− F (tj)

∣∣ ≤ ε, j = 1, . . . , N. (2.1)

Now for every t ∈ [tj−1, tj ] , it holds by definition

F (tj−1) ≤ F (t) ≤ F (tj), Fn(tj−1) ≤ Fn(t) ≤ Fn(tj).

This together with (2.1) implies

IP(∣∣Fn(t)− F (t)

∣∣ > 2ε)≤ ε.

Page 26: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

26 2 Parameter estimation for an i.i.d. model

If the function F (·) is not continuous, then for every positive ε , there existsa finite set Sε of points of discontinuity sm with F (sm) − F (sm − 0) ≥ ε .One can proceed as in the continuous case by adding the points from Sε tothe discrete set {tj} .

Exercise 2.1.1. Check the details of the proof of Theorem 2.1.2.

The results of Theorems 2.1.1 and 2.1.2 can be extended to certain func-tionals of the distribution P . Let g(y) be a function on the real line. Considerits expectation

s0def= IEg(Y1) =

∫ ∞−∞

g(y)dF (y).

Its empirical counterpart is defined by

Sndef=

∫ ∞−∞

g(y)dFn(y) =1

n

n∑i=1

g(Yi).

It appears that Sn indeed well estimates s0 , at least for large n .

Theorem 2.1.3. Let g(y) be a function on the real line such that∫ ∞−∞

g2(y)dF (y) <∞

Then

SnIP−→ s0,

√n(Sn − s0)

w−→ N(0, σ2g), n→∞,

where

σ2g

def=

∫ ∞−∞

g2(y)dF (y)− s20 =

∫ ∞−∞

[g(y)− s0

]2dF (y).

Moreover, if h(z) is a twice continuously differentiable function on the realline, and h′(s0) 6= 0 then

h(Sn)IP−→ h(s0),

√n{h(Sn)− h(s0)

} w−→ N(0, σ2h), n→∞,

where σ2h

def= |h′(s0)|2σ2

g .

Proof. The first statement is again the CLT for the i.i.d. random variablesξi = g(Yi) having mean value s0 and variance σ2

g .It also implies the second statement in view of the Taylor expansion

h(Sn)− h(s0) ≈ h′(s0) (Sn − s0) .

Page 27: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

2.1 Empirical distribution. Glivenko-Cantelli Theorem 27

Exercise 2.1.2. Complete the proof.Hint: use the first result to show that Sn belongs with high probability

to a small neighborhood U of the point s0 .Then apply the Taylor expansion of second order to h(Sn) − h(s0) =

h(s0 + n−1/2ξn

)− h(s0) with ξn =

√n(Sn − s0) :∣∣n1/2[h(Sn)− h(s0)]− h′(s0) ξn

∣∣ ≤ n−1/2H∗ξ2n/2,

where H∗ = maxU |h′′(y)| . Show that n−1/2ξ2n

IP−→ 0 because ξn is stochas-tically bounded by the first statement of the theorem.

The results of Theorems 2.1.2 and 2.1.3 can be extended to the case of avectorial function g(·) : IR1 → IRm , that is, g(y) =

(g1(y), . . . , gm(y)

)>for

y ∈ IR1 . Then s0 = (s0,1, . . . , s0,m)> and its empirical counterpart Sn =(Sn,1, . . . , Sn,m)> are vectors in IRm as well:

s0,jdef=

∫ ∞−∞

gj(y)dF (y), Sn,jdef=

∫ ∞−∞

gj(y)dFn(y), j = 1, . . . ,m.

Theorem 2.1.4. Let g(y) be an IRm -valued function on the real line with abounded covariance matrix Σ = (Σjk)j,k=1,...,m :

Σjkdef=

∫ ∞−∞

[gj(y)− s0,j

][gk(y)− s0,k

]dF (y) <∞, j, k ≤ m

Then

SnIP−→ s0,

√n(Sn − s0

) w−→ N(0,Σ), n→∞.

Moreover, if H(z) is a twice continuously differentiable function on IRm andΣH ′(s0) 6= 0 where H ′(z) stands for the gradient of H at z then

H(Sn)IP−→ H(s0),

√n{H(Sn)−H(s0)

} w−→ N(0, σ2H), n→∞,

where σ2H

def= H ′(s0)>ΣH ′(s0) .

Exercise 2.1.3. Prove Theorem 2.1.4.Hint: consider for every h ∈ IRm the scalar products h>g(y) , h>s0 ,

h>Sn . For the first statement, it suffices to show that

h>SnIP−→ h>s0,

√nh>

(Sn − s0

) w−→ N(0,h>Σh), n→∞.

For the second statement, consider the expansion∣∣n1/2[H(Sn)−H(s0)]− ξ>nH ′(s0)∣∣ ≤ n−1/2H∗ ‖ξn‖2/2

IP−→ 0,

with ξn = n1/2(Sn−s0) and H∗ = maxy∈U ‖H ′′(y)‖ for a neighborhood Uof s0 . Here ‖A‖ means the maximal eigenvalue of a symmetric matrix A .

Page 28: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

28 2 Parameter estimation for an i.i.d. model

2.2 Substitution principle. Method of moments

By the Glivenko-Cantelli theorem the empirical measure Pn (resp. edf Fn )is a good approximation of the true measure P (reps. pdf F ), at least, ifn is sufficiently large. This leads to the important substitution method ofstatistical estimation: represent the target of estimation as a function of thedistribution P , then replace P by Pn .

Suppose that there exists some functional g of a measure Pθ from thefamily P = (Pθ,θ ∈ Θ) such that the following identity holds:

θ ≡ g(Pθ), θ ∈ Θ.

This particularly implies θ∗ = g(Pθ∗) = g(P ) . The substitution estimate isdefined by substituting Pn for P :

θ = g(Pn).

Sometimes the obtained value θ can lie outside the parameter set Θ . Thenone can redefine the estimate θ as the value providing the best fit of g(Pn) :

θ = argminθ‖g(Pθ)− g(Pn)‖.

Here ‖ · ‖ denotes some norm on the parameter set Θ , e.g. the Euclideannorm.

2.2.1 Method of moments. Univariate parameter

The method of moments is a special but at the same time the most frequentlyused case of the substitution method. For illustration, we start with the uni-variate case. Let Θ ⊆ IR , that is, θ is a univariate parameter. Let g(y) be afunction on IR such that the first moment

m(θ)def= Eθg(Y1) =

∫g(y)dPθ(y)

is continuous and monotonic. Then the parameter θ can be uniquely identifiedby the value m(θ) , that is, there exists an inverse function m−1 satisfying

θ = m−1

(∫g(y)dPθ(y)

).

The substitution method leads to the estimate

θ = m−1

(∫g(y)dPn(y)

)= m−1

(1

n

∑g(Yi)

).

Page 29: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

2.2 Substitution principle. Method of moments 29

Usually g(x) = x or g(x) = x2 , which explains the name of the method. Thismethod was proposed by Pearson and is historically the first regular methodof constructing a statistical estimate.

2.2.2 Method of moments. Multivariate parameter

The method of moments can be easily extended to the multivariate case. Let

Θ ⊆ IRp , and let g(y) =(g1(y), . . . , gp(y)

)>be a function with values in

IRp . Define the moments m(θ) =(m1(θ), . . . ,mp(θ)

)by

mj(θ) = Eθgj(Y1) =

∫gj(y)dPθ(y).

The main requirement on the choice of the vector function g is that thefunction m is invertible, that is, the system of equations

mj(θ) = tj

has a unique solution for any t ∈ IRp . The empirical counterpart Mn of thetrue moments m(θ∗) is given by

Mndef=

∫g(y)dPn(y) =

(1

n

∑g1(Yi), . . . ,

1

n

∑gp(Yi)

)>.

Then the estimate θ can be defined as

θdef= m−1(Mn) = m−1

(1

n

∑g1(Yi), . . . ,

1

n

∑gp(Yi)

).

2.2.3 Method of moments. Examples

This section lists some widely used parametric families and discusses the prob-lem of constructing the parameter estimates by different methods. In all theexamples we assume that an i.i.d. sample from a distribution P is observed,and this measure P belongs a given parametric family (Pθ,θ ∈ Θ) , that is,P = Pθ∗ for θ∗ ∈ Θ .

Gaussian shift

Let Pθ be the normal distribution on the real line with mean θ and theknown variance σ2 . The corresponding density w.r.t. the Lebesgue measurereads as

p(y, θ) =1√

2πσ2exp{− (y − θ)2

2σ2

}.

Page 30: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

30 2 Parameter estimation for an i.i.d. model

It holds IEθY1 = θ and Varθ(Y1) = σ2 leading to the moment estimate

θ =

∫ydPn(y) =

1

n

∑Yi

with mean IEθ θ = θ and variance

Varθ(θ) = σ2/n.

Univariate normal distribution

Let Yi ∼ N(α, σ2) as in the previous example but both mean α and thevariance σ2 are unknown. This leads to the problem of estimating the vectorθ = (θ1, θ2) = (α, σ2) from the i.i.d. sample Y .

The method of moments suggests to estimate the parameters from the firsttwo empirical moments of the Yi ’s using the equations m1(θ) = IEθY1 = α ,m2(θ) = IEθY

21 = α2 + σ2 . Inverting these equalities leads to

α = m1(θ) , σ2 = m2(θ)−m21(θ).

Substituting the empirical measure Pn yields the expressions for θ :

α =1

n

∑Yi, σ2 =

1

n

∑Y 2i −

(1

n

∑Yi

)2

=1

n

∑(Yi − α

)2. (2.2)

As previously for the case of a known variance, it holds under IP = IPθ :

IEα = α, Varθ(α) = σ2/n.

However, for the estimate σ2 of σ2 , the result is slightly different and it isdescribed in the next theorem.

Theorem 2.2.1. It holds

IEθσ2 =

n− 1

nσ2, Varθ(σ2) =

2(n− 1)

n2σ4.

Proof. We use vector notation. Consider the unit vector e = n−1/2(1, . . . , 1)> ∈IRn and denote by Π1 the projector on e :

Π1h = (e>h)e.

Then by definition α = n−1/2e>Π1Y and σ2 = n−1‖Y −Π1Y ‖2 . Moreover,the model equation Y = n1/2αe+ ε implies in view of Π1e = e that

Π1Y = (n1/2αe+ Π1ε).

Page 31: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

2.2 Substitution principle. Method of moments 31

Now

nσ2 = ‖Y −Π1Y ‖2 = ‖ε−Π1ε‖2 = ‖(In −Π1)ε‖2

where In is the identity operator in IRn and In − Π1 is the projector onthe hyperplane in IRn orthogonal to the vector e . Obviously (In − Π1)ε isa Gaussian vector with zero mean and the covariance matrix V defined by

V = IE[(In −Π1)εε>(In −Π1)

]= (In −Π1)IE(εε>)(In −Π1)

= σ2(In −Π1)2 = σ2(In −Π1).

It remains to note that for any Gaussian vector ξ ∼ N(0, V ) it holds

IE‖ξ‖2 = trV, Var(‖ξ‖2

)= 2 tr(V 2).

Exercise 2.2.1. Check the details of the proof.Hint: reduce to the case of diagonal V .

Exercise 2.2.2. Compute the covariance IE(α− α)(σ2 − σ2) . Show that αand σ2 are independent.

Hint: represent α − α = n−1/2e>Π1ε and σ2 = n−1‖(In − Π1)ε‖2 . Usethat Π1ε and (In − Π1)ε are independent if Π1 is a projector and ε is aGaussian vector.

Uniform distribution on [0, θ]

Let Yi be uniformly distributed on the interval [0, θ] of the real line where theright end point θ is unknown. The density p(y, θ) of Pθ w.r.t. the Lebesguemeasure is θ−11(y ≤ θ) . It is easy to compute that for an integer k

IEθ(Yk1 ) = θ−1

∫ θ

0

ykdy = θk/(k + 1),

or θ ={

(k + 1)IEθ(Yk1 )}1/k

. This leads to the family of estimates

θk =

(k + 1

n

∑Y ki

)1/(k+1)

.

Letting k to infinity leads to the estimate

θ∞ = max{Y1, . . . , Yn}.

This estimate is quite natural in the context of the univariate distribution.Later it will appear once again as the maximum likelihood estimate. However,it is not a moment estimate.

Page 32: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

32 2 Parameter estimation for an i.i.d. model

Bernoulli or binomial model

Let Pθ be a Bernoulli law for θ ∈ [0, 1] . Then every Yi is binary with

IEθYi = θ.

This leads to the moment estimate

θ =

∫ydPn(y) =

1

n

∑Yi .

Exercise 2.2.3. Compute the moment estimate for g(y) = yk , k ≥ 1 .

Multinomial model

The multinomial distribution Bmθ describes the number of successes in mexperiments when each success has the probability θ ∈ [0, 1] . This distribu-tion can be viewed as the sum of m binomials with the same parameter θ .Observed is the sample Y where each Yi is the number of successes in thei th experiment. One has

Pθ(Y1 = k) =

(m

k

)θk(1− θ)m−k, k = 0, . . . ,m.

Exercise 2.2.4. Check that method of moments with g(x) = x leads to theestimate

θ =1

mn

∑Yi .

Compute Varθ(θ) .Hint: Reduce the multinomial model to the sum of m Bernoulli.

Exponential model

Let Pθ be an exponential distribution on the positive semiaxis with the pa-rameter θ . This means

IPθ(Y1 > y) = e−y/θ.

Exercise 2.2.5. Check that method of moments with g(x) = x leads to theestimate

θ =1

n

∑Yi .

Compute Varθ(θ) .

Page 33: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

2.2 Substitution principle. Method of moments 33

Poisson model

Let Pθ be the Poisson distribution with the parameter θ . The Poisson randomvariable Y1 is integer-valued with

Pθ(Y1 = k) =θk

k!e−θ.

Exercise 2.2.6. Check that method of moments with g(x) = x leads to theestimate

θ =1

n

∑Yi .

Compute Varθ(θ) .

Shift of a Laplace (double exponential) law

Let P0 be a symmetric distribution defined by the equations

P0(|Y1| > y) = e−y/σ, y ≥ 0,

for some given σ > 0 . Equivalently one can say that the absolute value of Y1

is exponential with parameter σ under P0 . Now define Pθ by shifting P0

by the value θ . This means that

Pθ(|Y1 − θ| > y) = e−y/σ, y ≥ 0.

It is obvious that IE0Y1 = 0 and IEθY1 = θ .

Exercise 2.2.7. Check that method of moments leads to the estimate

θ =1

n

∑Yi .

Compute Varθ(θ) .

Shift of a symmetric density

Let the observations Yi be defined by the equation

Yi = θ∗ + εi

where θ∗ is an unknown parameter and the errors εi are i.i.d. with a densitysymmetric around zero and finite second moment σ2 = IEε2

1 . This particularlyyields that IEεi = 0 and IEYi = θ∗ . The method of moments immediatelyyields the empirical mean estimate

Page 34: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

34 2 Parameter estimation for an i.i.d. model

θ =1

n

∑Yi

with Varθ(θ) = σ2/n .

2.3 Unbiased estimates, bias, and quadratic risk

Consider a parametric i.i.d. experiment corresponding to a sample Y =(Y1, . . . , Yn)> from a distribution Pθ∗ ∈ (Pθ,θ ∈ Θ ⊆ IRp) . By θ∗ we

denote the true parameter from Θ . Let θ be an estimate of θ∗ , that is, afunction of the available data Y with values in Θ : θ = θ(Y ) .

An estimate θ of the parameter θ∗ is called unbiased if

IEθ∗ θ = θ∗.

This property seems to be rather natural and desirable. However, it is oftenjust matter of parametrization. Indeed, if g : Θ→ Θ is a linear transformation

of the parameter set Θ , that is, g(θ) = Aθ + b , then the estimate ϑdef=

Aθ+ b of the new parameter ϑ = Aθ+ b is again unbiased. However, if m(·)is a nonlinear transformation, then the identity IEθ∗m(θ) = m(θ∗) is notpreserved.

Example 2.3.1. Consider the Gaussian shift experiments for Yi i.i.d. N(θ∗, σ2)with known variance σ2 but the shift parameter θ∗ is unknown. Thenθ = n−1(Y1+. . .+Yn) is an unbiased estimate of θ∗ . However, for m(θ) = θ2 ,it holds

IEθ∗ |θ|2 = |θ∗|2 + σ2/n,

that is, the estimate |θ|2 of |θ∗|2 is slightly biased.

The property of “no bias” is especially important in connection with thequadratic risk of the estimate θ . To illustrate this point, we first consider thecase of a univariate parameter.

2.3.1 Univariate parameter

Let θ ∈ Θ ⊆ IR1 . Denote by Var(θ) the variance of the estimate θ :

Varθ∗(θ) = IEθ∗(θ − IEθ∗ θ

)2.

The quadratic risk of θ is defined by

Page 35: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

2.3 Unbiased estimates, bias, and quadratic risk 35

R(θ, θ∗)def= IEθ∗

∣∣θ − θ∗∣∣2.It is obvious that R(θ, θ∗) = Varθ∗(θ) if θ is unbiased. It turns out that

the quadratic risk of θ is larger than the variance when this property is notfulfilled. Define the bias of θ as

b(θ, θ∗)def= IEθ∗ θ − θ∗.

Theorem 2.3.1. It holds for any estimate θ of the univariate parameter θ∗ :

R(θ, θ∗) = Varθ∗(θ) + b2(θ, θ∗).

Due to this result, the bias b(θ, θ∗) contributes the value b2(θ, θ∗) in thequadratic risk. This particularly explains why one is interested in consideringunbiased or at least nearly unbiased estimates.

2.3.2 Multivariate case

Now we extend the result to the multivariate case with θ ∈ Θ ⊆ IRp . Then θis a vector in IRp . The corresponding variance-covariance matrix Varθ∗(θ)is defined as

Varθ∗(θ)def= IEθ∗

[(θ − IEθ∗ θ

)(θ − IEθ∗ θ

)>].

As previously, θ is unbiased if IEθ∗ θ = θ∗ , and the bias of θ is b(θ,θ∗)def=

IEθ∗ θ − θ∗ .The quadratic risk of the estimate θ in the multivariate case is usually

defined via the Euclidean norm of the difference θ − θ∗ :

R(θ,θ∗)def= IEθ∗

∥∥θ − θ∗∥∥2.

Theorem 2.3.2. It holds

R(θ,θ∗) = tr[Varθ∗(θ)

]+∥∥b(θ,θ∗)∥∥2

Proof. The result follows similarly to the univariate case using the identity‖v‖2 = tr(vv>) for any vector v ∈ IRp .

Exercise 2.3.1. Complete the proof of Theorem 2.3.2.

Page 36: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

36 2 Parameter estimation for an i.i.d. model

2.4 Asymptotic properties

The properties of the previously introduced estimate θ heavily depend onthe sample size n . We therefore, use the notation θn to highlight this de-pendence. A natural extension of the condition that θ is unbiased is therequirement that the bias b(θ,θ∗) becomes negligible as the sample size nincreases. This leads to the notion of consistency.

Definition 2.4.1. A sequence of estimates θn is consistent if

θnIP−→ θ∗ n→∞.

θn is mean consistent if

IEθ∗‖θn − θ∗‖ → 0, n→∞.

Clearly mean consistency implies consistency and also asymptotic unbiased-ness:

b(θn,θ∗) = IEθn − θ∗

IP−→ 0, n→∞.

The property of consistency means that the difference θ − θ∗ is small for nlarge. The next natural question to address is how fast this difference tendsto zero with n . The Glivenko-Cantelli result suggests that

√n(θn − θ∗

)is

asymptotically normal.

Definition 2.4.2. A sequence of estimates θn is root-n normal if

√n(θn − θ∗

) w−→ N(0, V 2)

for some fixed matrix V 2 .

We aim to show that the moment estimates are consistent and asymptot-ically root-n normal under very general conditions. We start again with theunivariate case.

2.4.1 Root-n normality. Univariate parameter

Our first result describes the simplest situation when the parameter of interestθ∗ can be represented as an integral

∫g(y)dPθ∗(y) for some function g(·) .

Theorem 2.4.1. Suppose that Θ ⊆ IR and a function g(·) : IR→ IR satisfiesfor every θ ∈ Θ

Page 37: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

2.4 Asymptotic properties 37∫g(y)dPθ(y) = θ,∫ [

g(y)− θ]2dPθ(y) = σ2(θ) <∞.

Then the moment estimates θn = n−1∑g(Yi) satisfy the following condi-

tions:

1. each θn is unbiased, that is, IEθ∗ θn = θ∗ .

2. the normalized quadratic risk nIEθ∗(θn − θ∗

)2fulfills

nIEθ∗(θn − θ∗

)2= σ2(θ∗).

3. θn is asymptotically root-n normal:

√n(θn − θ∗

) w−→ N(0, σ2(θ∗)).

This result has already been proved, see Theorem 2.1.3. Next we extend thisresult to the more general situation when θ∗ is defined implicitly via the mo-ment s0(θ∗) =

∫g(y)dPθ∗(y) . This means that there exists another function

m(θ∗) such that m(θ∗) =∫g(y)dPθ∗(y) .

Theorem 2.4.2. Suppose that Θ ⊆ IR and a functions g(y) : IR → IR andm(θ) : Θ→ IR satisfy ∫

g(y)dPθ(y) ≡ m(θ),∫ {g(y)−m(θ)

}2dPθ(y) ≡ σ2

g(θ) <∞.

We also assume that m(·) is monotonic and twice continuously differentiable

with m′(m(θ∗)

)6= 0 . Then the moment estimates θn = m−1

(n−1

∑g(Yi)

)satisfy the following conditions:

1. θn is consistent, that is, θnIP−→ θ∗ .

2. θn is asymptotically root-n normal:

√n(θn − θ∗

) w−→ N(0, σ2(θ∗)), (2.3)

where σ2(θ∗) = |m′(m(θ∗)

)|−2σ2

g(θ∗) .

This result also follows directly from Theorem 2.1.3 with h(s) = m−1(s) .The property of asymptotic normality allows us to study the asymptotic

concentration of θn and to build asymptotic confidence sets.

Page 38: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

38 2 Parameter estimation for an i.i.d. model

Corollary 2.4.1. Let θn be asymptotically root-n normal: see (2.3). Thenfor any z > 0

limn→∞

IPθ∗(√n∣∣θn − θ∗∣∣ > zσ(θ∗)

)= 2Φ(−z)

where Φ(z) is the cdf of the standard normal law.

In particular, this result implies that the estimate θn belongs to a smallroot-n neighborhood

A(z)def= [θ∗ − n−1/2σ(θ∗)z, θ∗ + n−1/2σ(θ∗)z]

with the probability about 2Φ(−z) which is small provided that z is suffi-ciently large.

Next we briefly discuss the problem of interval (or confidence) estimation ofthe parameter θ∗ . This problem differs from the problem of point estimation:the target is to build an interval (a set) Eα on the basis of the observationsY such that IP (Eα 3 θ∗) ≈ 1−α for a given α ∈ (0, 1) . This problem can beattacked similarly to the problem of concentration by considering the intervalof width 2σ(θ∗)z centered at the estimate θ . However, the major difficultyis raised by the fact that this construction involves the true parameter valueθ∗ via the variance σ2(θ∗) . In some situations this variance does not dependon θ∗ : σ2(θ∗) ≡ σ2 with a known value σ2 . In this case the construction isimmediate.

Corollary 2.4.2. Let θn be asymptotically root-n normal: see (2.3). Let alsoσ2(θ∗) ≡ σ2 . Then for any α ∈ (0, 1) , the set

E◦(zα)def= [θn − n−1/2σzα, θn + n−1/2σzα],

where zα is defined by 2Φ(−zα) = α , satisfies

limn→∞

IPθ∗(E(zα) 3 θ∗)

)= 1− α. (2.4)

Exercise 2.4.1. Check Corollaries 2.4.1 and 2.4.2.

Next we consider the case when the variance σ2(θ∗) is unknown. Insteadwe assume that a consistent variance estimate σ2 is available. Then we plugthis estimate in the construction of the confidence set in place of the unknowntrue variance σ2(θ∗) leading to the following confidence set:

E(zα)def= [θn − n−1/2σzα, θn + n−1/2σzα]. (2.5)

Page 39: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

2.4 Asymptotic properties 39

Theorem 2.4.3. Let θn be asymptotically root-n normal: see (2.3). Letσ(θ∗) > 0 and σ2 be a consistent estimate of σ2(θ∗) in the sense that

σ2 IP−→ σ2(θ∗) . Then for any α ∈ (0, 1) , the set E(zα) is asymptoticallyα -confident in the sense of (2.4).

One natural estimate of the variance σ(θ∗) can be obtained by plugging

in the estimate θ in place of θ∗ leading to σ = σ(θ) . If σ(θ) is a contin-

uous function of θ in a neighborhood of θ∗ , then consistency of θ impliesconsistency of σ .

Corollary 2.4.3. Let θn be asymptotically root-n normal and let the vari-

ance σ2(θ) be a continuous function of θ at θ∗ . Then σdef= σ(θn) is a

consistent estimate of σ(θ∗) and the set E(zα) from (2.5) is asymptoticallyα -confident.

2.4.2 Root-n normality. Multivariate parameter

Let now Θ ⊆ IRp and θ∗ be the true parameter vector. The method ofmoments requires at least p different moment functions for identifying pparameters. Let g(y) : IR → IRp be a vector of moment functions, g(y) =(g1(y), . . . , gp(y)

)>. Suppose first that the true parameter can be obtained

just by integration: θ∗ =∫g(y)dPθ∗(y) . This yields the moment estimate

θn = n−1∑g(Yi) .

Theorem 2.4.4. Suppose that a vector-function g(y) : IR→ IRp satisfies thefollowing conditions: ∫

g(y)dPθ(y) = θ,∫ {g(y)− θ

}{g(y)− θ

}>dPθ(y) = Σ(θ).

Then it holds for the moment estimate θn = n−1∑g(Yi) :

1. θ is unbiased, that is, IEθ∗ θ = θ∗ .2. θn is asymptotically root-n normal:

√n(θn − θ∗

) w−→ N(0,Σ(θ∗)). (2.6)

3. the normalized quadratic risk nIEθ∗∥∥θn − θ∗∥∥2

fulfills

nIEθ∗∥∥θn − θ∗∥∥2

= tr Σ(θ∗).

Page 40: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

40 2 Parameter estimation for an i.i.d. model

Similarly to the univariate case, this result yields corollaries about concen-tration and confidence sets with intervals replaced by ellipsoids. Indeed, dueto the second statement, the vector

ξndef=√n{Σ(θ∗)}−1/2(θ − θ∗)

is asymptotically standard normal: ξnw−→ ξ ∼ N(0, Ip) . This also implies

that the squared norm of ξn is asymptotically χ2p -distributed where ξ2

p isthe law of ‖ξ‖2 = ξ2

1 + . . .+ ξ2p . Define the value zα via the quantiles of χ2

p

by the relation

IP(‖ξ‖ > zα

)= α. (2.7)

Corollary 2.4.4. Suppose that θn is root-n normal, see (2.6). Define for agiven z the ellipsoid

A(z)def= {θ : (θ − θ∗)>{Σ(θ∗)}−1(θ − θ∗) ≤ z2/n}.

Then A(zα) is asymptotically (1− α) -concentration set for θn in the sensethat

limn→∞

IP(θ 6∈ A(zα)

)= α.

The weak convergence ξnw−→ ξ suggests to build confidence sets also

in form of ellipsoids with the axis defined by the covariance matrix Σ(θ∗) .Define for α > 0

E◦(zα)def={θ :√n∥∥{Σ(θ∗)}−1/2(θ − θ)

∥∥ ≤ zα}.The result of Theorem 2.4.4 implies that this set covers the true value θ∗

with probability approaching 1− α .Unfortunately, in typical situations the matrix Σ(θ∗) is unknown because

it depends on the unknown parameter θ∗ . It is natural to replace it with thematrix Σ(θ) replacing the true value θ∗ with its consistent estimate θ . If

Σ(θ) is a continuous function of θ , then Σ(θ) provides a consistent estimateof Σ(θ∗) . This leads to the data-driven confidence set:

E(zα)def={θ :√n∥∥{Σ(θ)}−1/2(θ − θ)

∥∥ ≤ z}.Corollary 2.4.5. Suppose that θn is root-n normal, see (2.6), with a non-degenerate matrix Σ(θ∗) . Let the matrix function Σ(θ) be continuous atθ∗ . Let zα be defined by (2.7). Then E◦(zα) and E(zα) are asymptotically(1− α) -confidence sets for θ∗ :

limn→∞

IP(E◦(zα) 3 θ∗

)= limn→∞

IP(E(zα) 3 θ∗

)= 1− α.

Page 41: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

2.5 Some geometric properties of a parametric family 41

Exercise 2.4.2. Check Corollaries 2.4.4 and 2.4.5 about the set E◦(zα) .

Exercise 2.4.3. Check Corollary 2.4.5 about the set E(zα) .

Hint: θ is consistent and Σ(θ) is continuous and invertible at θ∗ . Thisimplies

Σ(θ)− Σ(θ∗)IP−→ 0, {Σ(θ)}−1 − {Σ(θ∗)}−1 IP−→ 0,

and hence, the sets E◦(zα) and E(zα) are nearly the same.

Finally we discuss the general situation when the target parameter is afunction of the moments. This means the relations

m(θ) =

∫g(y)dPθ(y), θ = m−1

(m(θ)

).

Of course, these relations assume that the vector function m(·) is invertible.The substitution principle leads to the estimate

θdef= m−1(Mn),

where Mn is the vector of empirical moments:

Mndef=

∫g(y)dPn(y) =

1

n

∑g(Yi).

The central limit theorem implies (see Theorem 2.1.4) that Mn is a consistentestimate of m(θ∗) and the vector

√n[Mn−m(θ∗)

]is asymptotically normal

with some covariance matrix Σg(θ∗) . Moreover, if m−1 is differentiable at

the point m(θ∗) , then√n(θ − θ∗) is asymptotically normal as well:

√n(θ − θ∗) w−→ N(0,Σ(θ∗))

where Σ(θ∗) = H>Σg(θ∗)H and H is the p × p -Jacobi matrix of m−1 at

m(θ∗) : Hdef= d

dθm−1(m(θ∗)

).

2.5 Some geometric properties of a parametric family

The parametric situation means that the true marginal distribution P belongsto some given parametric family (Pθ,θ ∈ Θ ⊆ IRp) . By θ∗ we denote thetrue value, that is, P = Pθ∗ ∈ (Pθ) . The natural target of estimation in thissituation is the parameter θ∗ itself. Below we assume that the family (Pθ) isdominated, that is, there exists a dominating measure µ0 . The correspondingdensity is denoted by

Page 42: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

42 2 Parameter estimation for an i.i.d. model

p(y,θ) =dPθdµ0

(y).

We also use the notation

`(y,θ)def= log p(y,θ)

for the log-density.The following two important characteristics of the parametric family (Pθ)

will be frequently used in the sequel: the Kullback-Leibler divergence andFisher information.

2.5.1 Kullback-Leibler divergence

For any two parameters θ,θ′ , the value

K(Pθ, Pθ′) =

∫log

p(y,θ)

p(y,θ′)p(y,θ)dµ0(y) =

∫ [`(y,θ)− `(y,θ′)

]p(y,θ)dµ0(y)

is called the Kullback–Leibler divergence (KL-divergence) between Pθ andPθ′ . We also write K(θ,θ′) instead of K(Pθ, Pθ′) if there is no risk of con-fusion. Equivalently one can represent the KL-divergence as

K(θ,θ′) = Eθ logp(Y,θ)

p(Y,θ′)= Eθ

[`(Y,θ)− `(Y,θ′)

],

where Y ∼ Pθ . An important feature of the Kullback-Leibler divergence isthat it is always non-negative and it is equal to zero iff the measures Pθ andPθ′ coincide.

Lemma 2.5.1. For any θ,θ′ , it holds

K(θ,θ′) ≥ 0.

Moreover, K(θ,θ′) = 0 implies that the densities p(y,θ) and p(y,θ′) coin-cide µ0 -a.s.

Proof. Define Z(y) = p(y,θ′)/p(y,θ) . Then∫Z(y)p(y,θ)dµ0(y) =

∫p(y,θ′)dµ0(y) = 1

because p(y,θ′) is the density of Pθ′ w.r.t. µ0 . Next, d2

dt2 log(t) = −t−2 < 0 ,thus, the log-function is strictly concave. The Jensen inequality implies

Page 43: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

2.5 Some geometric properties of a parametric family 43

K(θ,θ′) = −∫

log(Z(y))p(y,θ)dµ0(y) ≥ − log

(∫Z(y)p(y,θ)dµ0(y)

)= − log(1) = 0.

Moreover, the strict concavity of the log-function implies that the equalityin this relation is only possible if Z(y) ≡ 1 Pθ -a.s. This implies the laststatement of the lemma.

The two mentioned features of the Kullback-Leibler divergence suggest toconsider it as a kind of distance on the parameter space. In some sense, itmeasures how far Pθ′ is from Pθ . Unfortunately, it is not a metric becauseit is not symmetric:

K(θ,θ′) 6= K(θ′,θ)

with very few exceptions for some special situations.

Exercise 2.5.1. Compute KL-divergence for the Gaussian shift, Bernoulli,Poisson, volatility and exponential families. Check in which cases it is sym-metric.

Exercise 2.5.2. Consider the shift experiment given by the equation Y =θ+ε where ε is an error with the given density function p(·) on IR . Computethe KL-divergence and check for symmetry.

One more important feature of the KL-divergence is its additivity.

Lemma 2.5.2. Let (P(1)θ ,θ ∈ Θ) and (P

(2)θ ,θ ∈ Θ) be two parametric fam-

ilies with the same parameter set Θ , and let (Pθ = P(1)θ × P (2)

θ ,θ ∈ Θ) bethe product family. Then for any θ,θ′ ∈ Θ

K(Pθ, Pθ′) = K(P(1)θ , P

(1)θ′ ) + K(P

(2)θ , P

(2)θ′ )

Exercise 2.5.3. Prove Lemma 2.5.2. Extend the result to the case of them -fold product of measures.

Hint: use that the log-density `(y1, y2,θ) of the product measure Pθfulfills `(y1, y2,θ) = `(1)(y1,θ) + `(2)(y2,θ) .

The additivity of the KL-divergence helps to easily compute the KLquantity for two measures IPθ and IPθ′ describing the i.i.d. sample Y =(Y1, . . . , Yn)> . The log-density of the measure IPθ w.r.t. µ0 = µ⊗n0 at thepoint y = (y1, . . . , yn)> is given by

L(y,θ) =∑

`(yi,θ).

An extension of the result of Lemma 2.5.2 yields

K(IPθ, IPθ′)def= IEθ

{L(Y ,θ)− L(Y ,θ′)

}= nK(θ,θ′).

Page 44: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

44 2 Parameter estimation for an i.i.d. model

2.5.2 Hellinger distance

Another useful characteristic of a parametric family (Pθ) is the so-calledHellinger distance. For a fixed µ ∈ [0, 1] and any θ,θ′ ∈ Θ , define

h(µ, Pθ, Pθ′) = Eθ

(dPθ′

dPθ(Y )

)µ=

∫ (p(y,θ′)p(y,θ)

)µdPθ(y)

=

∫pµ(y,θ′)p1−µ(y,θ)dµ0(y).

Note that this function can be represented as an exponential moment of thelog-likelihood ratio `(Y,θ,θ′) = `(Y,θ)− `(Y,θ′) :

h(µ, Pθ, Pθ′) = Eθ exp{µ`(Y,θ′,θ)

}= Eθ

(dPθ′

dPθ(Y )

)µ.

It is obvious that h(µ, Pθ, Pθ′) ≥ 0 . Moreover, h(µ, Pθ, Pθ′) ≤ 1 . Indeed, thefunction xµ for µ ∈ [0, 1] is concave and by the Jensen inequality:

(dPθ′

dPθ(Y )

)µ≤(IEθ

dPθ′

dPθ(Y )

)µ= 1.

Similarly to the Kullback-Leibler, we often write h(µ,θ,θ′) in place ofh(µ, Pθ, Pθ′) .

Typically the Hellinger distance is considered for µ = 1/2 . Then

h(1/2,θ,θ′) =

∫p1/2(y,θ′)p1/2(y,θ)dµ0(y).

In contrast to the Kullback-Leibler divergence, this quantity is symmetric andcan be used to define a metric on the parameter set Θ .

Introduce

m(µ,θ,θ′) = − log h(µ,θ,θ′) = − logEθ exp{µ`(Y,θ′,θ)

}.

The property h(µ,θ,θ′) ≤ 1 implies m(µ,θ,θ′) ≥ 0 .The rate function, similarly to the KL-divergence, is additive.

Lemma 2.5.3. Let (P(1)θ ,θ ∈ Θ) and (P

(2)θ ,θ ∈ Θ) be two parametric fam-

ilies with the same parameter set Θ , and let (Pθ = P(1)θ × P (2)

θ ,θ ∈ Θ) bethe product family. Then for any θ,θ′ ∈ Θ and any µ ∈ [0, 1]

m(µ, Pθ, Pθ′) = m(P(1)θ , P

(1)θ′ ) + m(P

(2)θ , P

(2)θ′ ).

Page 45: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

2.5 Some geometric properties of a parametric family 45

Exercise 2.5.4. Prove Lemma 2.5.3. Extend the result to the case of a m -fold product of measures.

Hint: use that the log-density `(y1, y2,θ) of the product measure Pθfulfills `(y1, y2,θ) = `(1)(y1,θ) + `(2)(y2,θ) .

Application of this lemma to the i.i.d. product family yields

M(µ,θ′,θ)def= − log IEθ exp

{µL(Y ,θ,θ′)

}= nm(µ,θ′,θ).

2.5.3 Regularity and the Fisher Information. Univariate parameter

An important assumption on the considered parametric family (Pθ) is thatthe corresponding density function p(y,θ) is absolutely continuous w.r.t. theparameter θ for almost all y . Then the log-density `(y,θ) is differentiableas well with

∇`(y,θ)def=

∂`(y,θ)

∂θ=

1

p(y,θ)

∂p(y,θ)

∂θ

with the convention 10 log(0) = 0 . In the case of a univariate parameter θ ∈

IR , we also write `′(y, θ) instead of ∇`(y, θ) .Moreover, we usually assume some regularity conditions on the density

p(y,θ) . The next definition presents one possible set of such conditions forthe case of a univariate parameter θ .

Definition 2.5.1. The family (Pθ, θ ∈ Θ ⊂ IR) is regular if the followingconditions are fulfilled:

1. The sets A(θ)def= {y : p(y, θ) = 0} are the same for all θ ∈ Θ .

2. Differentiability under the integration sign: for any function s(y) satisfy-ing ∫

s2(y)p(y, θ)dµ0(y) ≤ C, θ ∈ Θ

it holds

∂θ

∫s(y)dPθ(y) =

∂θ

∫s(y)p(y, θ)dµ0(y) =

∫s(y)

∂p(y, θ)

∂θdµ0(y).

3. Finite Fisher information: the log-density function `(y, θ) is differentiablein θ and its derivative is square integrable w.r.t. Pθ :∫ ∣∣`′(y, θ)∣∣2dPθ(y) =

∫|p′(y, θ)|2

p(y, θ)dµ0(y). (2.8)

Page 46: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

46 2 Parameter estimation for an i.i.d. model

The quantity in the condition (2.8) plays an important role in asymptoticstatistics. It is usually referred to as the Fisher information.

Definition 2.5.2. Let (Pθ, θ ∈ Θ ⊂ IR) be a regular parametric family withthe univariate parameter. Then the quantity

F(θ)def=

∫ ∣∣`′(y, θ)∣∣2p(y, θ)dµ0(y) =

∫|p′(y, θ)|2

p(y, θ)dµ0(y)

is called the Fisher information of (Pθ) at θ ∈ Θ .

The definition of F(θ) can be written as

F(θ) = IEθ∣∣`′(Y, θ)∣∣2

with Y ∼ Pθ .A simple sufficient condition for regularity of a family (Pθ) is given by

the next lemma.

Lemma 2.5.4. Let the log-density `(y, θ) = log p(y, θ) of a dominated family(Pθ) be differentiable in θ and let the Fisher information F(θ) be a contin-uous function on Θ . Then (Pθ) is regular.

The proof is technical and can be found e.g. in Borovkov (1998). Someuseful properties of the regular families are listed in the next lemma.

Lemma 2.5.5. Let (Pθ) be a regular family. Then for any θ ∈ Θ and Y ∼Pθ

1. Eθ`′(Y, θ) =

∫`′(y, θ) p(y, θ) dµ0(y) = 0 and F(θ) = Varθ

[`′(Y, θ)

].

2. F(θ) = −Eθ`′′(Y, θ) = −∫`′′(y, θ)p(y, θ)dµ0(y).

Proof. Differentiating the identity∫p(y, θ)dµ0(y) =

∫exp{`(y, θ)}dµ0(y) ≡

1 implies under the regularity conditions the first statement of the lemma.Differentiating once more yields the second statement with another represen-tation of the Fisher information.

Like the KL-divergence, the Fisher information possesses the importantadditivity property.

Lemma 2.5.6. Let (P(1)θ , θ ∈ Θ) and (P

(2)θ , θ ∈ Θ) be two parametric fami-

lies with the same parameter set Θ , and let (Pθ = P(1)θ ×P (2)

θ , θ ∈ Θ) be theproduct family. Then for any θ ∈ Θ , the Fisher information F(θ) satisfies

F(θ) = F(1)(θ) + F(2)(θ)

where F(1)(θ) (resp. F(2)(θ) ) is the Fisher information for (P(1)θ ) (resp. for

(P(2)θ ) ).

Page 47: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

2.5 Some geometric properties of a parametric family 47

Exercise 2.5.5. Prove Lemma 2.5.6.Hint: use that the log-density of the product experiment can be represented

as `(y1, y2, θ) = `1(y1, θ) + `2(y2, θ) . The independence of Y1 and Y2 implies

F(θ) = Varθ[`′(Y1, Y2, θ)

]= Varθ

[`′1(Y1, θ) + `′2(Y2, θ)

]= Varθ

[`′1(Y1, θ)

]+ Varθ

[`′2(Y2, θ)

].

Exercise 2.5.6. Compute the Fisher information for the Gaussian shift,Bernoulli, Poisson, volatility and exponential families. Check in which casesit is constant.

Exercise 2.5.7. Consider the shift experiment given by the equation Y =θ+ε where ε is an error with the given density function p(·) on IR . Computethe Fisher information and check whether it is constant.

Exercise 2.5.8. Check that the i.i.d. experiment from the uniform distribu-tion on the interval [0, θ] with unknown θ is not regular.

Now we consider the properties of the i.i.d. experiment from a given regularfamily (Pθ) . The distribution of the whole i.i.d. sample Y is described by theproduct measure IPθ = P⊗nθ which is dominated by the measure µ0 = µ⊗n0 .The corresponding log-density L(y,θ) is given by

L(y, θ)def= log

dIPθdµ0

(y) =∑

`(yi, θ).

The function expL(y, θ) is the density of IPθ w.r.t. µ0 and hence, for anyr.v. ξ

IEθξ = IE0

[ξ expL(Y , θ)

].

In particular, for ξ ≡ 1 , this formula leads to the indentity

IE0

[expL(Y , θ)

]=

∫exp{L(y, θ)

}µ0(dy) ≡ 1. (2.9)

The next lemma claims that the product family (IPθ) for an i.i.d. samplefrom a regular family is also regular.

Lemma 2.5.7. Let (Pθ) be a regular family and IPθ = P⊗nθ . Then

1. The set Andef= {y = (y1, . . . , yn)> :

∏p(yi, θ) = 0} is the same for all

θ ∈ Θ .2. For any r.v. S = S(Y ) with IEθS

2 ≤ C , θ ∈ Θ , it holds

∂θIEθS =

∂θIE0

[S expL(Y , θ)

]= IE0

[SL′(Y , θ) expL(Y , θ)

],

where L′(Y , θ)def= ∂

∂θL(Y , θ) .

Page 48: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

48 2 Parameter estimation for an i.i.d. model

3. The derivative L′(Y , θ) is square integrable and

IEθ∣∣L′(Y , θ)∣∣2 = nF(θ).

Local properties of the Kullback-Leibler divergence and Hellinger distance

Here we show that the quantities introduced so far are closely related to eachother. We start with the Kullback-Leibler divergence.

Lemma 2.5.8. Let (Pθ) be a regular family. Then the KL-divergence K(θ, θ′)satisfies:

K(θ, θ′)∣∣∣θ′=θ

= 0,

d

dθ′K(θ, θ′)

∣∣∣θ′=θ

= 0,

d2

dθ′2K(θ, θ′)

∣∣∣θ′=θ

= F(θ).

In a small neighborhood of θ , the KL-divergence can be approximated by

K(θ, θ′) ≈ F(θ)|θ′ − θ|2/2.

Similar properties can be established for the rate function m(µ, θ, θ′) .

Lemma 2.5.9. Let (Pθ) be a regular family. Then the rate function m(µ, θ, θ′)satisfies:

m(µ, θ, θ′)∣∣∣θ′=θ

= 0,

d

dθ′m(µ, θ, θ′)

∣∣∣θ′=θ

= 0,

d2

dθ′2m(µ, θ, θ′)

∣∣∣θ′=θ

= µ(1− µ)F(θ).

In a small neighborhood of θ , the rate function m(µ, θ, θ′) can be approxi-mated by

m(µ, θ, θ′) ≈ µ(1− µ)F(θ)|θ′ − θ|2/2.

Moreover, for any θ, θ′ ∈ Θ

m(µ, θ, θ′)∣∣∣µ=0

= 0,

d

dµm(µ, θ, θ′)

∣∣∣µ=0

= Eθ`(Y, θ, θ′) = K(θ, θ′),

d2

dµ2m(µ, θ, θ′)

∣∣∣µ=0

= −Varθ[`(Y, θ, θ′)

].

Page 49: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

2.6 Cramer–Rao Inequality 49

This implies an approximation for µ small

m(µ, θ, θ′) ≈ µK(θ, θ′)− µ2

2Varθ

[`(Y, θ, θ′)

].

Exercise 2.5.9. Check the statements of Lemmas 2.5.8 and 2.5.9.

2.6 Cramer–Rao Inequality

Let θ be an estimate of the parameter θ∗ . We are interested in establishinga lower bound for the risk of this estimate. This bound indicates that undersome conditions the quadratic risk of this estimate can never be below aspecific value.

2.6.1 Univariate parameter

We again start with the univariate case and consider the case of an unbiasedestimate θ . Suppose that the family (Pθ, θ ∈ Θ) is dominated by a σ -finitemeasure µ0 on the real line and denote by p(y, θ) the density of Pθ w.r.t.µ0 :

p(y, θ)def=

dPθdµ0

(y).

Theorem 2.6.1 (Cramer–Rao Inequality). Let θ = θ(Y ) be an unbiasedestimate of θ for an i.i.d. sample from a regular family (Pθ) . Then

IEθ|θ − θ|2 = Varθ(θ) ≥1

nF(θ)

with the equality iff θ− θ = {nF(θ)}−1L′(Y , θ) almost surely. Moreover, if θ

is not unbiased and τ(θ) = IEθ θ , then with τ ′(θ)def= d

dθ τ(θ) , it holds

Varθ(θ) ≥|τ ′(θ)|2

nF(θ)

and

IEθ|θ − θ|2 = Varθ(θ) + |τ(θ)− θ|2 ≥ |τ′(θ)|2

nF(θ)+ |τ(θ)− θ|2.

Proof. Consider first the case of an unbiased estimate θ with IEθ θ ≡ θ .Differentiating the identity (2.9) IEθ expL(Y , θ) ≡ 1 w.r.t. θ yields

Page 50: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

50 2 Parameter estimation for an i.i.d. model

0 ≡∫ [

L′(y, θ) exp{L(y, θ)

}]µ0(dy) = IEθL

′(Y , θ). (2.10)

Similarly, the identity IEθ θ = θ implies

1 ≡∫ [

θL′(Y , θ) exp{L(Y , θ)

}]µ0(dy) = IEθ

[θL′(Y , θ)

].

Together with (2.10), this gives

IEθ[(θ − θ)L′(Y , θ)

]≡ 1. (2.11)

Define h = {nF(θ)}−1L′(Y , θ) . Then IE{hL′(Y , θ)

}= 1 and (2.11) yields

IEθ[(θ − θ − h)h

]≡ 0.

Now

IEθ(θ − θ)2 = IEθ(θ − θ − h+ h)2 = IEθ(θ − θ − h)2 + IEθh2

= IEθ(θ − θ − h)2 +{nF(θ)

}−1 ≥{nF(θ)

}−1

with the equality iff θ − θ = {nF(θ)}−1L′(Y , θ) almost surely. This impliesthe first assertion.

Now we consider the general case. The proof is similar. The property (2.10)

continues to hold. Next, the identity IEθ θ = θ is replaced with IEθ θ = τ(θ)yielding

IEθ[θL′(Y , θ)

]≡ τ ′(θ)

and

IEθ[{θ − τ(θ)}L′(Y , θ)

]≡ τ ′(θ).

Again by the Cauchy-Schwartz inequality∣∣τ ′(θ)∣∣2 = IE2θ

[{θ − τ(θ)}L′(Y , θ)

]≤ IEθ{θ − τ(θ)}2 IEθ|L′(Y , θ)|2

= Varθ(θ) nF(θ)

and the second assertion follows. The last statement is the usual decompo-sition of the quadratic risk into the squared bias and the variance of theestimate.

Page 51: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

2.6 Cramer–Rao Inequality 51

2.6.2 Exponential families and R-efficiency

An interesting question is how good (precise) the Cramer–Rao lower bound is.In particular, when it is an equality. Indeed, if we restrict ourselves to unbiasedestimates, no estimate can have quadratic risk smaller than [nF(θ)]−1 . If anestimate has exactly the risk [nF(θ)]−1 then this estimate is automaticallyefficient in the sense that it is the best in the class in terms of the quadraticrisk.

Definition 2.6.1. An unbiased estimate θ is R-efficient if

Varθ(θ) = [nF(θ)]−1.

Theorem 2.6.2. An unbiased estimate θ is R-efficient if and only if

θ = n−1∑

U(Yi),

where the function U(·) on IR satisfies∫U(y)dPθ(y) ≡ θ and the log-density

`(y, θ) of Pθ can be represented as

`(y, θ) = C(θ)U(y)−B(θ) + `(y), (2.12)

for some functions C(·) and B(·) on Θ and a function `(·) on IR .

Proof. Suppose first that the representation (2.12) for the log-density is cor-rect. Then `′(y, θ) = C ′(θ)U(y) − B′(θ) and the identity Eθ`

′(y, θ) = 0implies the relation between the functions B(·) and C(·) :

θC ′(θ) = B′(θ). (2.13)

Next, differentiating the equality

0 ≡∫{U(y)− θ}dPθ(y) =

∫{U(y)− θ}eL(y,θ)dµ0(y)

w.r.t. θ implies in view of (2.13)

1 ≡ IEθ[{U(Y )− θ} ×

{C ′(θ)U(Y )−B′(θ)

}]= C ′(θ)IEθ

{U(Y )− θ

}2.

This yields Varθ{U(Y )

}= 1/C ′(θ) . This leads to the following representation

for the Fisher information:

F(θ) = Varθ{`′(Y, θ)

}= Varθ{C ′(θ)U(Y )−B′(θ)}

={C ′(θ)

}2Varθ

{U(Y )

}= C ′(θ).

Page 52: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

52 2 Parameter estimation for an i.i.d. model

The estimate θ = n−1∑U(Yi) satisfies

IEθ θ = θ,

that is, it is unbiased. Moreover,

Varθ(θ)

= Varθ

{ 1

n

∑U(Yi)

}=

1

n2

∑Var{U(Yi)

}=

1

nC ′(θ)=

1

nF(θ)

and θ is R-efficient.Now we show a reverse statement. Due to the proof of the Cramer–Rao

inequality, the only possibility of getting the equality in this inequality is if

L′(Y , θ) = nF(θ) (θ − θ).

This implies for some fixed θ0 and any θ◦

L(Y , θ◦)− L(Y , θ0) =

∫ θ◦

θ0

L′(Y , θ)dθ

=

∫ θ

θ0

nF(θ)(θ − θ)dθ = n{θC(θ)−B(θ)

}with C(θ) =

∫ θθ0F(θ)dθ and B(θ) =

∫ θθ0θF(θ)dθ . Applying this equality to a

sample with n = 1 yields U(Y1) = θ(Y1) , and

`(Y1, θ) = `(Y1, θ0) + C(θ)U(Y1)−B(θ).

The desired representation follows.

Exercise 2.6.1. Apply the Cramer–Rao inequality and check R-efficiency tothe empirical mean estimate θ = n−1

∑Yi for the Gaussian shift, Bernoulli,

Poisson, exponential and volatility families.

2.7 Cramer–Rao inequality. Multivariate parameter

This section extends the notions and results of the previous sections from thecase of a univariate parameter to the case of a multivariate parameter withθ ∈ Θ ⊂ IRp .

Page 53: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

2.7 Cramer–Rao inequality. Multivariate parameter 53

2.7.1 Regularity and Fisher Information. Multivariate parameter

The definition of regularity naturally extends to the case of a multivariateparameter θ = (θ1, . . . , θp)

> . It suffices to check the same conditions as inthe univariate case for every partial derivative ∂p(y,θ)/∂θj of the densityp(y,θ) for j = 1, . . . , p .

Definition 2.7.1. The family (Pθ,θ ∈ Θ ⊂ IRp) is regular if the followingconditions are fulfilled:

1. The sets A(θ)def= {y : p(y, θ) = 0} are the same for all θ ∈ Θ .

2. Differentiability under the integration sign: for any function s(y) satisfy-ing ∫

s2(y)p(y,θ)dµ0(y) ≤ C, θ ∈ Θ

it holds

∂θ

∫s(y)dPθ(y) =

∂θ

∫s(y)p(y,θ)dµ0(y) =

∫s(y)

∂p(y,θ)

∂θdµ0(y).

3. Finite Fisher information: the log-density function `(y,θ) is differen-tiable in θ and its derivative ∇`(y,θ) = ∂`(y,θ)/∂θ is square integrablew.r.t. Pθ : ∫ ∣∣∇`(y,θ)

∣∣2dPθ(y) =

∫|∇p(y, θ)|2

p(y, θ)dµ0(y) <∞.

In the case of a multivariate parameter, the notion of the Fisher informa-tion leads to the Fisher information matrix.

Definition 2.7.2. Let (Pθ,θ ∈ Θ ⊂ IRp) be a parametric family. The matrix

F(θ)def=

∫∇`(y,θ)∇>`(y,θ)p(y,θ)dµ0(y)

=

∫∇p(y,θ)∇>p(y,θ)

1

p(y,θ)dµ0(y)

is called the Fisher information matrix of (Pθ) at θ ∈ Θ .

This definition can be rewritten as

F(θ) = IEθ[∇`(Y1, θ){∇`(Y1, θ)}>

].

The additivity property of the Fisher information extends to the multivariatecase as well.

Page 54: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

54 2 Parameter estimation for an i.i.d. model

Lemma 2.7.1. Let (Pθ,θ ∈ Θ) be a regular family. Then the n -fold productfamily (IPθ) with IPθ = P⊗nθ is also regular. The Fisher information matrixF(θ) satisfies

IEθ[∇L(Y ,θ){∇L(Y ,θ)}>

]= nF(θ). (2.14)

Exercise 2.7.1. Compute the Fisher information matrix for the i.i.d. exper-iment Yi = θ + σεi with unknown θ and σ and εi i.i.d. standard normal.

2.7.2 Local properties of the Kullback-Leibler divergence andHellinger distance

The local relations between the Kullback-Leibler divergence, rate function andFisher information naturally extend to the case of a multivariate parameter.We start with the Kullback-Leibler divergence.

Lemma 2.7.2. Let (Pθ) be a regular family. Then the KL-divergence K(θ,θ′)satisfies:

K(θ,θ′)∣∣∣θ′=θ

= 0,

d

dθ′K(θ,θ′)

∣∣∣θ′=θ

= 0,

d2

dθ′2K(θ,θ′)

∣∣∣θ′=θ

= F(θ).

In a small neighborhood of θ , the KL-divergence can be approximated by

K(θ,θ′) ≈ (θ′ − θ)>F(θ) (θ′ − θ)/2.

Similar properties can be established for the rate function m(µ,θ,θ′) .

Lemma 2.7.3. Let (Pθ) be a regular family. Then the rate function m(µ,θ,θ′)satisfies:

m(µ,θ,θ′)∣∣∣θ′=θ

= 0,

d

dθ′m(µ,θ,θ′)

∣∣∣θ′=θ

= 0,

d2

dθ′2m(µ,θ,θ′)

∣∣∣θ′=θ

= µ(1− µ)F(θ).

In a small neighborhood of θ , the rate function can be approximated by

m(µ,θ,θ′) ≈ µ(1− µ)(θ′ − θ)>F(θ) (θ′ − θ)/2.

Page 55: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

2.7 Cramer–Rao inequality. Multivariate parameter 55

Moreover, for any θ,θ′ ∈ Θ

m(µ,θ,θ′)∣∣∣µ=0

= 0,

d

dµm(µ,θ,θ′)

∣∣∣µ=0

= Eθ`(Y,θ,θ′) = K(θ,θ′),

d2

dµ2m(µ,θ,θ′)

∣∣∣µ=0

= −Varθ[`(Y,θ,θ′)

].

This implies an approximation for µ small:

m(µ,θ,θ′) ≈ µK(θ,θ′)− µ2

2Varθ

[`(Y,θ,θ′)

].

Exercise 2.7.2. Check the statements of Lemmas 2.7.2 and 2.7.3.

2.7.3 Multivariate Cramer–Rao Inequality

Let θ = θ(Y ) be an estimate of the unknown parameter vector. This estimateis called unbiased if

IEθθ ≡ θ.

Theorem 2.7.1 (Multivariate Cramer–Rao Inequality). Let θ = θ(Y )be an unbiased estimate of θ for an i.i.d. sample from a regular family (Pθ) .Then

Varθ(θ) ≥{nF(θ)

}−1,

IEθ‖θ − θ‖2 = tr{

Varθ(θ)}≥ tr

[{nF(θ)

}−1].

Moreover, if θ is not unbiased and τ(θ) = IEθθ , then with ∇τ(θ)def= d

dθ τ(θ) ,it holds

Varθ(θ) ≥ ∇τ(θ){nF(θ)

}−1{∇τ(θ)}>,

and

IEθ‖θ − θ‖2 = tr[Varθ(θ)

]+ ‖τ(θ)− θ‖2

≥ tr[∇τ(θ)

{nF(θ)

}−1{∇τ(θ)}>]

+ ‖τ(θ)− θ‖2.

Page 56: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

56 2 Parameter estimation for an i.i.d. model

Proof. Consider first the case of an unbiased estimate θ with IEθθ ≡ θ .Differentiating the identity (2.9) IEθ expL(Y ,θ) ≡ 1 w.r.t. θ yields

0 ≡∫∇L(y,θ) exp

{L(y,θ)

}µ0(dy) = IEθ

[∇L(Y ,θ)

]≡ 0. (2.15)

Similarly, the identity IEθθ = θ implies

I ≡∫θ(y)

{∇L(y,θ)

}>exp{L(y,θ)

}µ0(dy) = IEθ

[θ {∇L(Y ,θ)}>

].

Together with (2.15), this gives

IEθ[(θ − θ) {∇L(Y ,θ)}>

]≡ I. (2.16)

Consider the random vector

hdef={nF(θ)

}−1∇L(Y ,θ).

By (2.15) IEθh = 0 and by (2.14)

Varθ(h) = IEθ(hh>

)= n−2IEθ

[I−1(θ)∇L(Y ,θ)

{I−1(θ)∇L(Y ,θ)

}>]= n−2I−1(θ)IEθ

[∇L(Y ,θ){∇L(Y ,θ)}>

]I−1(θ) =

{nF(θ)

}−1.

and the identities (2.15) and (2.16) imply that

IEθ[(θ − θ − h)h>

]= 0. (2.17)

The “no bias” property yields IEθ(θ − θ

)= 0 and IEθ

[(θ − θ)(θ − θ)>

]=

Varθ(θ) . Finally by the orthogonality (2.17) and

Varθ(θ) = Varθ(h) + Var(θ − θ − h

)={nF(θ)

}−1+ Varθ

(θ − θ − h

)and the variance of θ is not smaller than

{nF(θ)

}−1. Moreover, the equality

is only possible if θ − θ − h is equal to zero almost surely.Now we consider the general case. The proof is similar. The property (2.15)

continues to hold. Next, the identity IEθθ = θ is replaced with IEθθ = τ(θ)yielding

IEθ[θ {∇L(Y ,θ)}>

]≡ ∇τ(θ)

and

Page 57: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

2.7 Cramer–Rao inequality. Multivariate parameter 57

IEθ[{θ − τ(θ)

}{∇L(Y ,θ)

}>] ≡ ∇τ(θ).

Define

hdef= ∇τ(θ)

{nF(θ)

}−1∇L(Y ,θ).

Then similarly to the above

IEθ[hh>

]= ∇τ(θ)

{nF(θ)

}−1 {∇τ(θ)}>,

IEθ[(θ − θ − h)h>

]= 0,

and the second assertion follows. The statements about the quadratic riskfollow from its usual decomposition into squared bias and the variance of theestimate.

2.7.4 Exponential families and R-efficiency

The notion of R-efficiency naturally extends to the case of a multivariateparameter.

Definition 2.7.3. An unbiased estimate θ is R-efficient if

Varθ(θ) ={nF(θ)

}−1.

Theorem 2.7.2. An unbiased estimate θ is R-efficient if and only if

θ = n−1∑

U(Yi),

where the vector function U(·) on IR satisfies∫U(y)dPθ(y) ≡ θ and the

log-density `(y,θ) of Pθ can be represented as

`(y,θ) = C(θ)>U(y)−B(θ) + `(y), (2.18)

for some functions C(·) and B(·) on Θ and a function `(·) on IR .

Proof. Suppose first that the representation (2.18) for the log-density is cor-rect. Denote by C ′(θ) the p × p Jacobi matrix of the vector function C :

C ′(θ)def= d

dθC(θ) . Then ∇`(y,θ) = C ′(θ)U(y) − ∇B(θ) and the identityEθ∇`(y,θ) = 0 implies the relation between the functions B(·) and C(·) :

C ′(θ)θ = ∇B(θ). (2.19)

Next, differentiating the equality

Page 58: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

58 2 Parameter estimation for an i.i.d. model

0 ≡∫ [U(y)− θ

]dPθ(y) =

∫[U(y)− θ]eL(y,θ)dµ0(y)

w.r.t. θ implies in view of (2.19)

I ≡ IEθ[{U(Y )− θ}

{C ′(θ)U(Y )−∇B(θ)

}]>= C ′(θ)IEθ

[{U(Y )− θ} {U(Y )− θ}>

].

This yields Varθ[U(Y )

]= [C ′(θ)]−1 . This leads to the following representa-

tion for the Fisher information matrix:

F(θ) = Varθ[∇`(Y,θ)

]= Varθ[C ′(θ)U(Y )−∇B(θ)]

=[C ′(θ)

]2Varθ

[U(Y )

]= C(θ).

The estimate θ = n−1∑U(Yi) satisfies

IEθθ = θ,

that is, it is unbiased. Moreover,

Varθ(θ)

= Varθ

( 1

n

∑U(Yi)

)=

1

n2

∑Var[U(Yi)

]=

1

n

[C ′(θ)

]−1={nF(θ)

}−1

and θ is R-efficient.As in the univariate case, one can show that equality in the Cramer–Rao

bound is only possible if ∇L(Y ,θ) and θ − θ are linearly dependent. Thisleads again to the exponential family structure of the likelihood function.

Exercise 2.7.3. Complete the proof of the Theorem 2.7.2.

2.8 Maximum likelihood and other estimation methods

This section presents some other popular methods of estimating the unknownparameter including minimum distance and M-estimation, maximum likeli-hood procedure, etc.

2.8.1 Minimum distance estimation

Let ρ(P, P ′) denote some functional (distance) defined for measures P, P ′ onthe real line. We assume that ρ satisfies the following conditions: ρ(Pθ1 , Pθ2) ≥0 and ρ(Pθ1

, Pθ2) = 0 iff θ1 = θ2 . This implies for every θ∗ ∈ Θ that

Page 59: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

2.8 Maximum likelihood and other estimation methods 59

argminθ∈Θ

ρ(Pθ, Pθ∗) = θ∗.

The Glivenko-Cantelli theorem states that Pn converges weakly to the truedistribution Pθ∗ . Therefore, it is natural to define an estimate θ of θ∗ byreplacing in this formula the true measure Pθ∗ by its empirical counterpartPn , that is, by minimizing the distance ρ between the measures Pθ and Pnover the set (Pθ) . This leads to the minimum distance estimate

θ = argminθ∈Θ

ρ(Pθ, Pn).

2.8.2 M -estimation and Maximum likelihood estimation

Another general method of building an estimate of θ∗ , the so-called M -estimation is defined via a contrast function ψ(y,θ) given for every y ∈ IRand θ ∈ Θ . The principal condition on ψ is that the integral IEθψ(Y1,θ

′) isminimized for θ = θ′ :

θ = argminθ′

∫ψ(y,θ′)dPθ(y), θ ∈ Θ. (2.20)

In particular,

θ∗ = argminθ∈Θ

∫ψ(y,θ)dPθ∗(y),

and the M -estimate is again obtained by substitution, that is, by replacingthe true measure Pθ∗ with its empirical counterpart Pn :

θ = argminθ∈Θ

∫ψ(y,θ)dPn(y) = argmin

θ∈Θ

1

n

∑ψ(Yi,θ).

Exercise 2.8.1. Let Y be an i.i.d. sample from P ∈ (Pθ, θ ∈ Θ ⊂ IR) .(i) Let also g(y) satisfy

∫g(y)dPθ(y) ≡ θ , leading to the moment estimate

θ = n−1∑

g(Yi).

Show that this estimate can be obtained as the M-estimate for a properlyselected function ψ(·) .

(ii) Let∫g(y)dPθ(y) ≡ m(θ) for the given functions g(·) and m(·)

whereas m(·) is monotonous. Show that the moment estimate θ = m−1(Mn)with Mn = n−1

∑g(Yi) can be obtained as the M-estimate for a properly

selected function ψ(·) .

We mention three prominent examples of the contrast function ψ andthe resulting estimates: least squares, least absolute deviation and maximumlikelihood.

Page 60: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

60 2 Parameter estimation for an i.i.d. model

Least squares estimation

The least squares estimate (LSE) corresponds to the quadratic contrast

ψ(y,θ) = ‖g(y)− θ‖2,

where g(y) is a p -dimensional function of the observation y satisfying∫g(y)dPθ(y) ≡ θ, θ ∈ Θ.

Then the true parameter θ∗ fulfills the relation

θ∗ = argminθ∈Θ

∫‖g(y)− θ‖2dPθ∗(y)

because∫‖g(y)− θ‖2dPθ∗(y) = ‖θ∗ − θ‖2 +

∫‖g(y)− θ∗‖2dPθ∗(y).

The substitution method leads to the estimate θ of θ∗ defined by minimiza-tion of the empirical version of the integral

∫‖g(y)− θ‖2dPθ∗(y) :

θdef= argmin

θ∈Θ

∫‖g(y)− θ‖2dPn(y) = argmin

θ∈Θ

∑‖g(Yi)− θ‖2.

This is again a quadratic optimization problem having a closed form solutioncalled least squares or ordinary least squares estimate.

Lemma 2.8.1. It holds

θ = argminθ∈Θ

∑‖g(Yi)− θ‖2 =

1

n

∑g(Yi).

One can see that the LSE θ coincides with the moment estimate basedon the function g(·) . Indeed, the equality

∫g(y)dPθ∗(y) = θ∗ leads directly

to the LSE θ = n−1∑g(Yi) .

Least absolute deviation (median) estimation

The next example of an M-estimate is given by the absolute deviation contrastfit. For simplicity of presentation, we consider here only the case of a univariate

parameter. The contrast function ψ(y, θ) is given by ψ(y, θ)def= |y − θ| . The

solution of the related optimization problem (2.20) is given by the medianmed(Pθ) of the distribution Pθ .

Page 61: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

2.8 Maximum likelihood and other estimation methods 61

Definition 2.8.1. The value t is called the median of a distribution functionF if

F (t) ≥ 1/2, F (t−) < 1/2.

If F (·) is a continuous function then the median t = med(F ) satisfies F (t) =1/2 .

Theorem 2.8.1. For any cdf F , the median med(F ) satisfies

infθ∈IR

∫|y − θ| dF (y) =

∫|y −med(F )| dF (y).

Proof. Consider for simplicity the case of a continuous distribution functionF . One has |y− θ| = (θ− y)1(y < θ) + (y− θ)1(y ≥ θ) . Differentiating w.r.t.θ yields the following equation for any extreme point of

∫|y − θ| dF (y) :

−∫ θ

−∞dF (y) +

∫ ∞θ

dF (y) = 0.

The median is the only solution of this equation.

Let the family (Pθ) be such that θ = med(Pθ) for all θ ∈ IR . Then theM-estimation approach leads to the least absolute deviation (LAD) estimate

θdef= argmin

θ∈IR

∫|y − θ| dFn(y) = argmin

θ∈IR

∑|Yi − θ|.

Due to Theorem 2.8.1, the solution of this problem is given by the median ofthe edf Fn .

Maximum likelihood estimation

Let now ψ(y,θ) = −`(y,θ) = − log p(y,θ) where p(y,θ) is the density of themeasure Pθ at y w.r.t. to some dominating measure µ0 . This choice leadsto the maximum likelihood estimate (MLE):

θ = argmaxθ∈Θ

1

n

∑log p(Yi,θ).

The condition (2.20) is fulfilled because

argminθ′

∫ψ(y,θ′)dPθ(y) = argmin

θ′

∫ {ψ(y,θ′)− ψ(y,θ)

}dPθ(y)

= argminθ′

∫log

p(y,θ)

p(y,θ′)dPθ(y)

= argminθ′

K(θ,θ′) = θ.

Page 62: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

62 2 Parameter estimation for an i.i.d. model

Here we used that the Kullback-Leibler divergence K(θ,θ′) attains its min-imum equal to zero at the point θ′ = θ which in turn follows from theconcavity of the log-function by the Jensen inequality.

Note that the definition of the MLE does not depend on the choice of thedominating measure µ0 .

Exercise 2.8.2. Show that the MLE θ does not change if another dominat-ing measure is used.

Computing an M -estimate or MLE leads to solving an optimization prob-lem for the empirical quantity

∑ψ(Yi,θ) w.r.t. the parameter θ . If the

function ψ is differentiable w.r.t. θ then the solution can be found from theestimating equation

∂θ

∑ψ(Yi,θ) = 0.

Exercise 2.8.3. Show that any M -estimate and particularly the MLE canbe represented as minimum distance estimate with a properly defined distanceρ .

Hint: define ρ(Pθ, Pθ∗) as∫ [ψ(y,θ)− ψ(y,θ∗)

]dPθ∗(y) .

Recall that the MLE θ is defined by maximizing the expression L(θ) =∑`(Yi,θ) w.r.t. θ . Below we use the notation L(θ,θ′)

def= L(θ) − L(θ′) ,

often called the log-likelihood ratio.In our study we will focus on the value of the maximum L(θ) =

maxθ L(θ) . Let L(θ) =∑`(Yi,θ) be the likelihood function. The value

L(θ)def= max

θL(θ)

is called the maximum log-likelihood or fitted log-likelihood. The excess L(θ)−L(θ∗) is the difference between the maximum of the likelihood function L(θ)over θ and its particular value at the true parameter θ∗ :

L(θ,θ∗)def= max

θL(θ)− L(θ∗).

The next section collects some examples of computing the MLE θ andthe corresponding maximum log-likelihood.

2.9 Maximum Likelihood for some parametric families

The examples of this section focus on the structure of the log-likelihood andthe corresponding MLE θ and the maximum log-likelihood L(θ) .

Page 63: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

2.9 Maximum Likelihood for some parametric families 63

2.9.1 Gaussian shift

Let Pθ be the normal distribution on the real line with mean θ and theknown variance σ2 . The corresponding density w.r.t. the Lebesgue measurereads as

p(y,θ) =1√

2πσ2exp{− (y − θ)2

2σ2

}.

The log-likelihood L(θ) is

L(θ) =∑

log p(Yi, θ) = −n2

log(2πσ2)− 1

2σ2

∑(Yi − θ)2.

The corresponding normal equation L′(θ) = 0 yields

− 1

σ2

∑(Yi − θ) = 0 (2.21)

leading to the empirical mean solution θ = n−1∑Yi .

The computation of the fitted likelihood is a bit more involved.

Theorem 2.9.1. Let Yi = θ∗ + εi with εi ∼ N(0, σ2) . For any θ

L(θ, θ) = nσ−2(θ − θ)2/2. (2.22)

Moreover,

L(θ, θ∗) = nσ−2(θ − θ∗)2/2 = ξ2/2

where ξ is a standard normal r.v. so that 2L(θ, θ∗) has the fixed χ21 dis-

tribution with one degree of freedom. If zα is the quantile of χ21/2 with

P (ξ2/2 > zα) = α , then

E(zα) = {u : L(θ, u) ≤ zα} (2.23)

is an α -confidence set: IPθ∗(E(zα) 63 θ∗) = α .For every r > 0 ,

IEθ∗∣∣2L(θ, θ∗)

∣∣r = cr ,

where cr = E|ξ|2r with ξ ∼ N(0, 1) .

Proof (Proof 1). Consider L(θ, θ)def= L(θ)−L(θ) as a function of the param-

eter θ . Obviously

Page 64: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

64 2 Parameter estimation for an i.i.d. model

L(θ, θ) = − 1

2σ2

∑[(Yi − θ)2 − (Yi − θ)2

],

so that L(θ, θ) is a quadratic function of θ . Next, it holds L(θ, θ)∣∣θ=θ

= 0

and ddθL(θ, θ)

∣∣θ=θ

= − ddθL(θ)

∣∣θ=θ

= 0 due to the normal equation (2.21).Finally,

d2

dθ2L(θ, θ)

∣∣θ=θ

= − d2

dθ2L(θ)

∣∣θ=θ

= n/σ2.

This implies by the Taylor expansion of a quadratic function L(θ, θ) at θ = θ :

L(θ, θ) =n

2σ2(θ − θ)2.

Proof 2. First observe that for any two points θ′, θ , the log-likelihood ratioL(θ′, θ) = log(dIPθ′/dIPθ) = L(θ′)− L(θ) can be represented in the form

L(θ′, θ) = L(θ′)− L(θ) = σ−2(S − nθ)(θ′ − θ)− nσ−2(θ′ − θ)2/2.

Substituting the MLE θ = S/n in place of θ′ implies

L(θ, θ) = nσ−2(θ − θ)2/2.

Now we consider the second statement about the distribution of L(θ, θ∗) .The substitution θ = θ∗ in (2.22) and the model equation Yi = θ∗+εi imply

θ − θ∗ = n−1/2σξ , where

ξdef=

1

σ√n

∑εi

is standard normal. Therefore,

L(θ, θ∗) = ξ2/2.

This easily implies the result of the theorem.

We see that under IPθ∗ the variable 2L(θ, θ∗) is χ21 distributed with one

degree of freedom, and this distribution does not depend on the sample sizen and the scale parameter σ . This fact is known in a more general form aschi-squared theorem.

Exercise 2.9.1. Check that the confidence sets

E◦(zα)def= [θ − n−1/2σzα, θ + n−1/2σzα],

where zα is defined by 2Φ(−zα) = α , and E(zα) from (2.23) coincide.

Page 65: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

2.9 Maximum Likelihood for some parametric families 65

Exercise 2.9.2. Compute the constant cr from Theorem 2.9.1 for r =0.5, 1, 1.5, 2 .

Already now we point out an interesting feature of the fitted log-likelihoodL(θ, θ∗) . It can be viewed as the normalized squared loss of the estimate θ

because L(θ, θ∗) = nσ−2|θ−θ∗|2 . The last statement of Theorem 2.9.1 yieldsthat

IEθ∗ |θ − θ∗|2r = crσ2rn−r.

2.9.2 Variance estimation for the normal law

Let Yi be i.i.d. normal with mean zero and unknown variance θ∗ :

Yi ∼ N(0, θ∗), θ∗ ∈ IR+ .

The likelihood function reads as

L(θ) =∑

log p(Yi, θ) = −n2

log(2πθ)− 1

∑Y 2i .

The normal equation L′(θ) = 0 yields

L′(θ) = − n

2θ+

1

2θ2

∑Y 2i = 0

leading to

θ =1

nSn

with Sn =∑Y 2i . Moreover, for any θ

L(θ, θ) = −n2

log(θ/θ)− Sn2

(1/θ − 1/θ

)= nK(θ, θ)

where

K(θ, θ′) = −1

2

[log(θ/θ′) + 1− θ/θ′

]is the Kullback-Leibler divergence for two Gaussian measures N(0, θ) andN(0, θ′) .

Page 66: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

66 2 Parameter estimation for an i.i.d. model

2.9.3 Univariate normal distribution

Let Yi be as in previous example N{α, σ2} but neither the mean α nor thevariance σ2 are known. This leads to estimating the vector θ = (θ1, θ2) =(α, σ2) from the i.i.d. sample Y .

The maximum likelihood approach leads to maximizing the log-likelihoodw.r.t. the vector θ = (α, σ2)> :

L(θ) =∑

log p(Yi,θ) = −n2

log(2πθ2)− 1

2θ2

∑(Yi − θ1)2.

Exercise 2.9.3. Check that the ML approach leads to the same estimates(2.2) as the method of moments.

2.9.4 Uniform distribution on [0, θ]

Let Yi be uniformly distributed on the interval [0, θ∗] of the real line wherethe right end point θ∗ is unknown. The density p(y, θ) of Pθ w.r.t. theLebesgue measure is θ−11(y ≤ θ) . The likelihood reads as

Z(θ) = θ−n1(maxiYi ≤ θ).

This density is positive iff θ ≥ maxi Yi and it is maximized exactly for θ =maxi Yi . One can see that the MLE θ is the limiting case of the momentestimate θk as k grows to infinity.

2.9.5 Bernoulli or binomial model

Let Pθ be a Bernoulli law for θ ∈ [0, 1] . The density of Yi under Pθ can bewritten as

p(y, θ) = θy(1− θ)1−y.

The corresponding log-likelihood reads as

L(θ) =∑{

Yi log θ + (1− Yi) log(1− θ)}

= Sn logθ

1− θ+ n log(1− θ)

with Sn =∑Yi . Maximizing this expression w.r.t. θ results again in the

empirical mean

θ = Sn/n.

This implies

Page 67: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

2.9 Maximum Likelihood for some parametric families 67

L(θ, θ) = nθ logθ

θ+ n(1− θ) log

1− θ1− θ

= nK(θ, θ)

where K(θ, θ′) = θ log(θ/θ′)+(1−θ) log{(1−θ)/(1−θ′) is the Kullback-Leiblerdivergence for the Bernoulli law.

2.9.6 Multinomial model

The multinomial distribution Bmθ describes the number of successes in mexperiments when one success has the probability θ ∈ [0, 1] . This distributioncan be viewed as the sum of m binomials with the same parameter θ .

One has

Pθ(Y1 = k) =

(m

k

)θk(1− θ)m−k, k = 0, . . . ,m.

Exercise 2.9.4. Check that the ML approach leads to the estimate

θ =1

mn

∑Yi .

Compute L(θ, θ) .

2.9.7 Exponential model

Let Y1, . . . , Yn be i.i.d. exponential random variables with parameter θ∗ >0 . This means that Yi are nonnegative and satisfy IP (Yi > t) = e−t/θ

∗.

The density of the exponential law w.r.t. the Lebesgue measure is p(y, θ∗) =e−y/θ

∗/θ∗ . The corresponding log-likelihood can be written as

L(θ) = −n log θ −n∑i=1

Yi/θ = −S/θ − n log θ,

where S = Y1 + . . .+ Yn .The ML estimating equation yields S/θ2 = n/θ or

θ = S/n.

For the fitted log-likelihood L(θ, θ) this gives

L(θ, θ) = −n(1− θ/θ)− n log(θ/θ) = nK(θ, θ).

Here once again K(θ, θ′) = θ/θ′− 1− log(θ/θ′) is the Kullback-Leibler diver-gence for the exponential law.

Page 68: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

68 2 Parameter estimation for an i.i.d. model

2.9.8 Poisson model

Let Y1, . . . , Yn be i.i.d. Poisson random variables satisfying IP (Yi = m) =|θ∗|me−θ∗/m! for m = 0, 1, 2, . . . . The corresponding log-likelihood can bewritten as

L(θ) =

n∑i=1

log(θYie−θ/Yi!

)= log θ

n∑i=1

Yi − θ − log(Yi!) = S log θ − nθ +R,

where S = Y1 + . . . + Yn and R =∑ni=1 log(Yi!) . Here we leave out that

0! = 1 .The ML estimating equation immediately yields S/θ = n or

θ = S/n.

For the fitted log-likelihood L(θ, θ) this gives

L(θ, θ) = nθ log(θ/θ)− n(θ − θ) = nK(θ, θ).

Here again K(θ, θ′) = θ log(θ/θ′)− (θ−θ′) is the Kullback-Leibler divergencefor the Poisson law.

2.9.9 Shift of a Laplace (double exponential) law

Let P0 be the symmetric distribution defined by the equations

P0(|Y1| > y) = e−y/σ, y ≥ 0,

for some given σ > 0 . Equivalently one can say that the absolute value of Y1

is exponential with parameter σ under P0 . Now define Pθ by shifting P0

by the value θ . This means that

Pθ(|Y1 − θ| > y) = e−y/σ, y ≥ 0.

The density of Y1 − θ under Pθ is p(y) = (2σ)−1e−|y|/σ . The maximumlikelihood approach leads to maximizing the sum

L(θ) = −n log(2σ)−∑|Yi − θ|/σ,

or equivalently to minimizing the sum∑|Yi − θ| :

θ = argminθ

∑|Yi − θ|. (2.24)

This is just the least absolute deviation estimate given by the median of theedf:

θ = med(Fn).

Page 69: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

2.10 Quasi Maximum Likelihood approach 69

Exercise 2.9.5. Show that the median solves the problem (2.24).Hint: suppose that n is odd. Consider the ordered observations Y(1) ≤

Y(2) ≤ . . . ≤ Y(n) . Show that the median of Pn is given by Y((n+1)/2) . Showthat this point solves (2.24).

2.10 Quasi Maximum Likelihood approach

Let Y = (Y1, . . . , Yn)> be a sample from a marginal distribution P . Let also(Pθ,θ ∈ Θ) be a given parametric family with the log-likelihood `(y,θ) . Theparametric approach is based on the assumption that the underlying distribu-tion P belongs to this family. The quasi maximum likelihood method appliesthe maximum likelihood approach for family (Pθ) even if the underlying dis-tribution P does not belong to this family. This leads again to the estimateθ that maximizes the expression L(θ) =

∑`(Yi,θ) and is called the quasi

MLE. It might happen that the true distribution belongs to some other para-metric family for which one also can construct the MLE. However, there couldbe serious reasons for applying the quasi maximum likelihood approach evenin this misspecified case. One of them is that the properties of the estimate θare essentially determined by the geometrical structure of the log-likelihood.The use of a parametric family with a nice geometric structure (which arequadratic or convex functions of the parameter) can seriously simplify thealgorithmic burdens and improve the behavior of the method.

2.10.1 LSE as quasi likelihood estimation

Consider the model

Yi = θ∗ + εi (2.25)

where θ∗ is the parameter of interest from IR and εi are random errorssatisfying IEεi = 0 . The assumption that εi are i.i.d. normal N(0, σ2) leadsto the quasi log-likelihood

L(θ) = −n2

log(2πσ2)− 1

2σ2

∑(Yi − θ)2.

Maximizing the expression L(θ) leads to minimizing the sum of squared resid-uals (Yi − θ)2 :

θ = argminθ

∑(Yi − θ)2 =

1

n

∑Yi .

This estimate is called a least squares estimate (LSE) or ordinary least squaresestimate (oLSE).

Page 70: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

70 2 Parameter estimation for an i.i.d. model

Example 2.10.1. Consider the model (2.25) with heterogeneous errors, thatis, εi are independent normal with zero mean and variances σ2

i . The corre-sponding log-likelihood reads

L◦(θ) = −1

2

∑{log(2πσ2

i ) +(Yi − θ)2

σ2i

}.

The MLE θ◦ is

θ◦def= argmax

θL◦(θ) = N−1

∑Yi/σ

2i , N =

∑σ−2i .

We now compare the estimates θ and θ◦ .

Lemma 2.10.1. The following assertions hold for the estimate θ :

1. θ is unbiased: IEθ∗ θ = θ∗ .2. The quadratic risk of θ is equal to the variance Var(θ) given by

R(θ, θ∗)def= IEθ∗ |θ − θ∗|2 = Var(θ) = n−2

∑σ2i .

3. θ is not R-efficient unless all σ2i are equal.

Now we consider the MLE θ◦ .

Lemma 2.10.2. The following assertions hold for the estimate θ◦ :

1. θ◦ is unbiased: IEθ∗ θ◦ = θ∗ .

2. The quadratic risk of θ◦ is equal to the variance Var(θ◦) given by

R(θ◦, θ∗)def= IEθ∗ |θ◦ − θ∗|2 = Var(θ◦) = N−2

∑σ−2i = N−1.

3. θ◦ is R-efficient.

Exercise 2.10.1. Check the statements of Lemma 2.10.1 and 2.10.2.Hint: compute the Fisher information for the model (2.25) using the prop-

erty of additivity:

F(θ) =∑

F(i)(θ) =∑

σ−2i = N,

where F(i)(θ) is the Fisher information in the marginal model Yi = θ+εi withjust one observation Yi . Apply the Cramer–Rao inequality for one observationof the vector Y .

Page 71: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

2.10 Quasi Maximum Likelihood approach 71

2.10.2 LAD and robust estimation as quasi likelihood estimation

Consider again the model (2.25). The classical least squares approach facesserious problems if the available data Y are contaminated with outliers. Thereasons for contamination could be missing data or typing errors, etc. Un-fortunately, even a single outlier can significantly disturb the sum L(θ) and

thus, the estimate θ . A typical approach proposed and developed by Huberis to apply another “influence function” ψ(Yi − θ) in the sum L(θ) in placeof the squared residual |Yi − θ|2 leading to the M-estimate

θ = argminθ

∑ψ(Yi − θ). (2.26)

A popular ψ -function for robust estimation is the absolute value |Yi − θ| .The resulting estimate

θ = argminθ

∑|Yi − θ|

is called least absolute deviation and the solution is the median of the empiricaldistribution Pn . Another proposal is called the Huber function: it is quadraticin a vicinity of zero and linear outside:

ψ(x) =

{x2 if |x| ≤ t,a|x|+ b otherwise.

Exercise 2.10.2. Show that for each t > 0 , the coefficients a = a(t) and b =b(t) can be selected to provide that ψ(x) and its derivative are continuous.

A remarkable fact about this approach is that every such estimate can beviewed as a quasi MLE for the model (2.25). Indeed, for a given function ψ ,define the measure Pθ with the log-density `(y, θ) = −ψ(y − θ) . Then thelog-likelihood is L(θ) = −

∑ψ(Yi − θ) and the corresponding (quasi) MLE

coincides with (2.26).

Exercise 2.10.3. Suggest a σ -finite measure µ such that exp{−ψ(y − θ)

}is the density of Yi for the model (2.25) w.r.t. the measure µ .

Hint: suppose for simplicity that

Cψdef=

∫exp{−ψ(x)

}dx <∞.

Show that C−1ψ exp

{−ψ(y− θ)

}is a density w.r.t. the Lebesgue measure for

any θ .

Exercise 2.10.4. Show that the LAD θ = argminθ∑|Yi − θ| is the quasi

MLE for the model (2.25) when the errors εi are assumed Laplacian (doubleexponential) with density p(x) = (1/2)e−|x| .

Page 72: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

2.11 Univariate exponential families

Most parametric families considered in the previous sections are particularcases of exponential families (EF) distributions. This includes the Gaussianshift, Bernoulli, Poisson, exponential, volatility models. The notion of an EFalready appeared in the context of the Cramer–Rao inequality. Now we studysuch families in further detail.

We say that P is an exponential family if all measures Pθ ∈ P are dom-inated by a σ -finite measure µ0 on Y and the density functions p(y, θ) =dPθ/dµ0(y) are of the form

p(y, θ)def=

dPθdµ0

(y) = p(y)eyC(θ)−B(θ).

Here C(θ) and B(θ) are some given nondecreasing functions on Θ and p(y)is a nonnegative function on Y .

Usually one assumes some regularity conditions on the family P . Onepossibility was already given when we discussed the Cramer–Rao inequality;see Definition 2.5.1. Below we assume that that condition is always fulfilled.It basically means that we can differentiate w.r.t. θ under the integral sign.

For an EF, the log-likelihood admits an especially simple representation,nearly linear in y :

`(y, θ)def= log p(y, θ) = yC(θ)−B(θ) + log p(y)

so that the log-likelihood ratio for θ, θ′ ∈ Θ reads as

`(y, θ, θ′)def= `(y, θ)− `(y, θ′) = y

[C(θ)− C(θ′)

]−[B(θ)−B(θ′)

].

2.11.1 Natural parametrization

Let P =(Pθ)

be an EF. By Y we denote one observation from the distribu-tion Pθ ∈ P . In addition to the regularity conditions, one often assumes thenatural parametrization for the family P which means the relation EθY = θ .Note that this relation is fulfilled for all the examples of EF’s that we consid-ered so far in the previous section. It is obvious that the natural parametriza-tion is only possible if the following identifiability condition is fulfilled: forany two different measures from the considered parametric family, the cor-responding mean values are different. Otherwise the natural parametrizationis always possible: just define θ as the expectation of Y . Below we use theabbreviation EFn for an exponential family with natural parametrization.

Some properties of an EFn

The natural parametrization implies an important property for the functionsB(θ) and C(θ) .

Page 73: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

2.11 Univariate exponential families 73

Lemma 2.11.1. Let(Pθ)

be a naturally parameterized EF. Then

B′(θ) = θC ′(θ).

Proof. Differentiating both sides of the equation∫p(y, θ)µ0(dy) = 1 w.r.t. θ

yields

0 =

∫ {yC ′(θ)−B′(θ)

}p(y, θ)µ0(dy)

=

∫ {yC ′(θ)−B′(θ)

}Pθ(dy)

= θC ′(θ)−B′(θ)

and the result follows.

The next lemma computes the important characteristics of a natural EFsuch as the Kullback-Leibler divergence K(θ, θ′) = Eθ log

(p(Y, θ)/p(Y, θ′)

),

the Fisher information F(θ)def= Eθ|`′(Y, θ)|2 , and the rate function m(µ, θ, θ′) =

− logEθ exp{µ`(Y, θ, θ′)

}.

Lemma 2.11.2. Let (Pθ) be an EFn. Then with θ, θ′ ∈ Θ fixed, it holds for

• the Kullback-Leibler divergence K(θ, θ′) = Eθ log(p(Y, θ)/p(Y, θ′)

):

K(θ, θ′) =

∫log

p(y, θ)

p(y, θ′)Pθ(dy)

={C(θ)− C(θ′)

}∫yPθ(dy)−

{B(θ)−B(θ′)

}= θ

{C(θ)− C(θ′)

}−{B(θ)−B(θ′)

}; (2.27)

• the Fisher information F(θ)def= Eθ|`′(Y, θ)|2 :

F(θ) = C ′(θ);

• the rate function m(µ, θ, θ′) = − logEθ exp{µ`(Y, θ, θ′)

}:

m(µ, θ, θ′) = K(θ, θ + µ(θ′ − θ)

);

• the variance Varθ(Y ) :

Varθ(Y ) = 1/F(θ) = 1/C ′(θ). (2.28)

Page 74: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

74 2 Parameter estimation for an i.i.d. model

Proof. Differentiating the equality

0 ≡∫

(y − θ)Pθ(dy) =

∫(y − θ)eL(y,θ)µ0(dy)

w.r.t. θ implies in view of Lemma 2.11.1

1 ≡ IEθ[(Y − θ)

{C ′(θ)Y −B′(θ)

}]= C ′(θ)IEθ(Y − θ)2.

This yields Varθ(Y ) = 1/C ′(θ) . This leads to the following representation ofthe Fisher information:

F(θ) = Varθ[`′(Y, θ)

]= Varθ[C

′(θ)Y −B′(θ)] =[C ′(θ)

]2Varθ(Y ) = C ′(θ).

Exercise 2.11.1. Check the equations for the Kullback-Leibler divergenceand Fisher information from Lemma 2.11.2.

MLE and maximum likelihood for an EFn

Now we discuss the maximum likelihood estimation for a sample from an EFn.The log-likelihood can be represented in the form

L(θ) =

n∑i=1

log p(Yi, θ) = C(θ)

n∑i=1

Yi −B(θ)

n∑i=1

1 +

n∑i=1

log p(Yi) (2.29)

= SC(θ)− nB(θ) +R,

where

S =

n∑i=1

Yi, R =

n∑i=1

log p(Yi).

The remainder term R is unimportant because it does not depend on θand thus it does not enter in the likelihood ratio. The maximum likelihoodestimate θ is defined by maximizing L(θ) w.r.t. θ , that is,

θ = argmaxθ∈Θ

L(θ) = argmaxθ∈Θ

{SC(θ)− nB(θ)

}.

In the case of an EF with the natural parametrization, this optimizationproblem admits a closed form solution given by the next theorem.

Theorem 2.11.1. Let (Pθ) be an EFn. Then the MLE θ fulfills

θ = S/n = n−1n∑i=1

Yi .

Page 75: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

2.11 Univariate exponential families 75

It holds

IEθ θ = θ, Varθ(θ) = [nF(θ)]−1 = [nC ′(θ)]−1

so that θ is R-efficient. Moreover, the fitted log-likelihood L(θ, θ)def= L(θ) −

L(θ) satisfies for any θ ∈ Θ :

L(θ, θ) = nK(θ, θ). (2.30)

Proof. Maximization of L(θ) w.r.t. θ leads to the estimating equationnB′(θ)− SC ′(θ) = 0 . This and the identity B′(θ) = θC ′(θ) yield the MLE

θ = S/n.

The variance Varθ(θ) is computed using (2.28) from Lemma 2.11.2. The for-mula (2.27) for the Kullback-Leibler divergence and (2.29) yield the represen-

tation (2.30) for the fitted log-likelihood L(θ, θ) for any θ ∈ Θ .

One can see that the estimate θ is the mean of the Yi ’s. As for theGaussian shift model, this estimate can be motivated by the fact that theexpectation of every observation Yi under Pθ is just θ and by the law oflarge numbers the empirical mean converges to its expectation as the samplesize n grows.

2.11.2 Canonical parametrization

Another useful representation of an EF is given by the so-called canonicalparametrization. We say that υ is the canonical parameter for this EF if thedensity of each measure Pυ w.r.t. the dominating measure µ0 is of the form:

p(y, υ)def=

dPυdµ0

(y) = p(y) exp{yυ − d(υ)

}.

Here d(υ) is a given convex function on Θ and p(y) is a nonnegative func-tion on Y . The abbreviation EFc will indicate an EF with the canonicalparametrization.

Some properties of an EFc

The next relation is an obvious corollary of the definition:

Lemma 2.11.3. An EFn (Pθ) always permits a unique canonical represen-tation. The canonical parameter υ is related to the natural parameter θ byυ = C(θ) , d(υ) = B(θ) and θ = d′(υ) .

Page 76: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

76 2 Parameter estimation for an i.i.d. model

Proof. The first two relations follow from the definition. They imply B′(θ) =d′(υ) · dυ/dθ = d′(υ) · C ′(θ) and the last statement follows from B′(θ) =θC ′(θ) .

The log-likelihood ratio `(y, υ, υ1) for an EFc reads as

`(Y, υ, υ1) = Y (υ − υ1)− d(υ) + d(υ1).

The next lemma collects some useful facts about an EFc.

Lemma 2.11.4. Let P =(Pυ, υ ∈ U

)be an EFc and let the function d(·)

be two times continuously differentiable. Then it holds for any υ, υ1 ∈ U :

(i). The mean EυY and the variance Varυ(Y ) fulfill

EυY = d′(υ), Varυ(Y ) = Eυ(Y − EυY )2 = d′′(υ).

(ii). The Fisher information F(υ)def= Eυ|`′(Y, υ)|2 satisfies

F(υ) = d′′(υ).

(iii). The Kullback-Leibler divergence Kc(υ, υ1) = Eυ`(Y, υ, υ1) satisfies

Kc(υ, υ1) =

∫log

p(y, υ)

p(y, υ1)Pυ(dy)

= d′(υ)(υ − υ1

)−{d(υ)− d(υ1)

}= d′′(υ) (υ1 − υ)2/2,

where υ is a point between υ and υ1 . Moreover, for υ ≤ υ1 ∈ U

Kc(υ, υ1) =

∫ υ1

υ

(υ1 − u)d′′(u)du.

(iv). The rate function m(µ, υ1, υ)def= − log IEυ exp

{µ`(Y, υ1, υ)

}fulfills

m(µ, υ1, υ) = µKc(υ, υ1

)−Kc

(υ, υ + µ(υ1 − υ)

)Proof. Differentiating the equation

∫p(y, υ)µ0(dy) = 1 w.r.t. υ yields∫ {

y − d′(υ)}p(y, υ)µ0(dy) = 0,

that is, EυY = d′(υ) . The expression for the variance can be proved by onemore differentiating of this equation. Similarly one can check (ii) . The item(iii) can be checked by simple algebra and (iv) follows from (i) .

Page 77: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

2.11 Univariate exponential families 77

Further, for any υ, υ1 ∈ U , it holds

`(Y, υ1, υ)− Eυ`(Y, υ1, υ) = (υ1 − υ){Y − d′(υ)

}and with u = µ(υ1 − υ)

logEυ exp{u(Y − d′(υ)

)}= −ud′(υ) + d(υ + u)− d(υ) + logEυ exp

{uY − d(υ + u) + d(υ)

}= d(υ + u)− d(υ)− ud′(υ) = Kc(υ, υ + u),

because

Eυ exp{uY − d(υ + u) + d(υ)

}= Eυ

dPυ+u

dPυ= 1

and (iv) follows by (iii) .

Table 2.1 presents the canonical parameter and the Fisher information forthe examples of exponential families from Section 2.9.

Table 2.1. υ(θ) , d(υ) , F(υ) = d′′(υ) and θ = θ(υ) for the examples from Sec-tion 2.9.

Model υ d(υ) I(υ) θ(υ)

Gaussian regression θ/σ2 υ2σ2/2 σ2 σ2υ

Bernoulli model log(θ/(1 − θ)

)log(1 + eυ) eυ/(1 + eυ)2 eυ/(1 + eυ)

Poisson model log θ eυ eυ eυ

Exponential model 1/θ − log υ 1/υ2 1/υ

Volatility model −1/(2θ) − 12

log(−2υ) 1/(2υ2) −1/(2υ)

Exercise 2.11.2. Check (iii) and (iv) in Lemma 2.11.4.

Exercise 2.11.3. Check the entries of Table 2.1.

Exercise 2.11.4. Check that Kc(υ, υ′) = K(θ(υ), θ(υ′)

)Exercise 2.11.5. Plot Kc(υ∗, υ) as a function of υ for the families fromTable 2.1.

Page 78: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

78 2 Parameter estimation for an i.i.d. model

Maximum likelihood estimation for an EFc

The structure of the log-likelihood in the case of the canonical parametrizationis particularly simple:

L(υ) =

n∑i=1

log p(Yi, υ) = υ

n∑i=1

Yi − d(υ)

n∑i=1

1 +

n∑i=1

log p(Yi)

= Sυ − nd(υ) +R

where

S =

n∑i=1

Yi, R =

n∑i=1

log p(Yi).

Again, as in the case of an EFn, we can ignore the remainder term R . Theestimating equation dL(υ)/dυ = 0 for the maximum likelihood estimate υreads as

d′(υ) = S/n.

This and the relation θ = d′(υ) lead to the following result.

Theorem 2.11.2. The maximum likelihood estimates θ and υ for the natu-ral and canonical parametrization are related by the equations

θ = d′(υ) υ = C(θ).

The next result describes the structure of the fitted log-likelihood andbasically repeats the result of Theorem 2.11.1.

Theorem 2.11.3. Let (Pυ) be an EF with canonical parametrization. Then

for any υ ∈ U the fitted log-likelihood L(υ, υ)def= maxυ′ L(υ′, υ) satisfies

L(υ, υ) = nKc(υ, υ).

Exercise 2.11.6. Check the statement of Theorem 2.11.3.

2.11.3 Deviation probabilities for the maximum likelihood

Let Y1, . . . , Yn be i.i.d. observations from an EF P . This section presents aprobability bound for the fitted likelihood. To be more specific we assume thatP is canonically parameterized, P = (Pυ) . However, the bound applies to thenatural and any other parametrization because the value of maximum of thelikelihood process L(θ) does not depend on the choice of parametrization.

Page 79: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

2.11 Univariate exponential families 79

The log-likelihood ratio L(υ′, υ) is given by the expression (2.29) and itsmaximum over υ′ leads to the fitted log-likelihood L(υ, υ) = nKc(υ, υ) .

Our first result concerns a deviation bound for L(υ, υ) . It utilizes therepresentation for the fitted log-likelihood given by Theorem 2.11.1. As usual,we assume that the family P is regular. In addition, we require the followingcondition.

(Pc) P = (Pυ, υ ∈ U ⊆ IR) is a regular EF. The parameter set U is convex.The function d(υ) is two times continuously differentiable and the Fisherinformation F(υ) = d′′(υ) satisfies F(υ) > 0 for all υ .

The condition (Pc) implies that for any compact set U0 there is a con-stant a = a(U0) > 0 such that

|F(υ1)/F(υ2)|1/2 ≤ a, υ1, υ2 ∈ U0 .

Theorem 2.11.4. Let Yi be i.i.d. from a distribution Pυ∗ which belongs toan EFc satisfying (Pc) . For any z > 0

IPυ∗(L(υ, υ∗) > z

)= IPυ∗

(nKc(υ, υ∗) > z

)≤ 2e−z.

Proof. The proof is based on two properties of the log-likelihood. The first oneis that the expectation of the likelihood ratio is just one: IEυ∗ expL(υ, υ∗) =1 . This and the exponential Markov inequality imply for z ≥ 0

IPυ∗(L(υ, υ∗) ≥ z

)≤ e−z. (2.31)

The second property is specific to the considered univariate EF and is basedon geometric properties of the log-likelihood function: linearity in the obser-vations Yi and convexity in the parameter υ . We formulate this importantfact in a separate statement.

Lemma 2.11.5. Let the EFc P fulfill (Pc) . For given z and any υ0 ∈ U ,there exist two values υ+ > υ0 and υ− < υ0 satisfying Kc(υ±, υ0) = z/nsuch that

{L(υ, υ0) > z} ⊆ {L(υ+, υ0) > z} ∪ {L(υ−, υ0) > z}.

Proof. It holds

{L(υ, υ0) > z} ={

supυ

[S(υ − υ0

)− n

{d(υ)− d(υ0)

}]> z}

⊆{S > inf

υ>υ0

z + n{d(υ)− d(υ0)

}υ − υ0

}∪{−S > inf

υ<υ0

z + n{d(υ)− d(υ0)

}υ0 − υ

}.

Define for every u > 0

Page 80: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

80 2 Parameter estimation for an i.i.d. model

f(u) =z + n

{d(υ0 + u)− d(υ0)

}u

.

This function attains its minimum at a point u satisfying the equation

z/n+ d(υ0 + u)− d(υ0)− d′(υ0 + u)u = 0

or, equivalently,

K(υ0 + u, υ0) = z/n.

The condition (Pc) provides that there is only one solution u ≥ 0 of thisequation.

Exercise 2.11.7. Check that the equation K(υ0 + u, υ0) = z/n has only onepositive solution for any z > 0 .Hint: use that K(υ0 + u, υ0) is a convex function of u with minimum atu = 0 .

Now, it holds with υ+ = υ0 + u{S > inf

υ>υ0

z + n[d(υ)− d(υ0)

]υ − υ0

}=

{S >

z + n[d(υ+)− d(υ0)

]υ+ − υ0

}⊆ {L(υ+, υ0) > z}.

Similarly{−S > inf

υ<υ0

z + n{d(υ)− d(υ0)

}υ0 − υ

}=

{−S >

z + n[d(υ−)− d(υ0)

]υ0 − υ−

}⊆ {L(υ−, υ0) > z}.

for some υ− < υ0 .

The assertion of the theorem is now easy to obtain. Indeed,

IPυ∗(L(υ, υ∗) ≥ z

)≤ IPυ∗

(L(υ+, υ∗) ≥ z

)+ IPυ∗

(L(υ−, υ∗) ≥ z

)≤ 2e−z

yielding the result.

Exercise 2.11.8. Let (Pυ) be a Gaussian shift experiment, that is, Pυ =N(υ, 1) .

• Check that L(υ, υ) = n|υ − υ|2/2 ;• Given z ≥ 0 , find the points υ+ and υ− such that

{L(υ, υ∗) > z} ⊆ {L(υ+, υ∗) > z} ∪ {L(υ−, υ∗) > z}.

Page 81: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

2.11 Univariate exponential families 81

• Plot the mentioned sets {υ : L(υ, υ) > z} , {υ : L(υ+, υ) > z} , and{υ : L(υ−, υ) > z} as functions of υ for a fixed S =

∑Yi .

Remark 2.11.1. Note that the mentioned result only utilizes the geometricstructure of the univariate EFc. The most important feature of the log-likelihood ratio L(υ, υ∗) = S(υ − υ∗) − d(υ) + d(υ∗) is its linearity w.r.t.the stochastic term S . This allows us to replace the maximum over the wholeset U by the maximum over the set consisting of two points υ± . Note that theproof does not rely on the distribution of the observations Yi . In particular,Lemma 2.11.5 continues to hold even within the quasi likelihood approachwhen L(υ) is not the true log-likelihood. However, the bound (2.31) relieson the nature of L(υ, υ∗) . Namely, it utilizes that IE exp

{L(υ±, υ∗)

}= 1 ,

which is true under IP = IPυ∗ nut generally false in the quasi likelihood setup.Nevertheless, the exponential bound can be extended to the quasi likelihoodapproach under the condition of bounded exponential moments for L(υ, υ∗) :for some µ > 0 , it should hold IE exp

{µL(υ, υ∗)

}= C(µ) <∞ .

Theorem 2.11.4 yields a simple construction of a confidence interval forthe parameter υ∗ and the concentration property of the MLE υ .

Theorem 2.11.5. Let Yi be i.i.d. from Pυ∗ ∈ P with P satisfying (Pc) .

1. If zα satisfies e−zα ≤ α/2 , then

E(zα) ={υ : nKc

(υ, υ

)≤ zα

}is a α -confidence set for the parameter υ∗ .

2. Define for any z > 0 the set A(z, υ∗) = {υ : Kc(υ, υ∗) ≤ z/n} . Then

IPυ∗(υ /∈ A(z, υ∗)

)≤ 2e−z.

The second assertion of the theorem claims that the estimate υ belongswith a high probability to the vicinity A(z, υ∗) of the central point υ∗ definedby the Kullback-Leibler divergence. Due to Lemma 2.11.4, (iii) Kc(υ, υ∗) ≈F(υ∗) (υ−υ∗)2/2 , where F(υ∗) is the Fisher information at υ∗ . This vicinityis an interval around υ∗ of length of order n−1/2 . In other words, this resultimplies the root-n consistency of υ .

The deviation bound for the fitted log-likelihood from Theorem 2.11.4 canbe viewed as a bound for the normalized loss of the estimate υ . Indeed, definethe loss function ℘(υ′, υ) = K1/2(υ′, υ) . Then Theorem 2.11.4 yields that theloss is with high probability bounded by

√z/n provided that z is sufficiently

large. Similarly one can establish the bound for the risk.

Theorem 2.11.6. Let Yi be i.i.d. from the distribution Pυ∗ which belongsto a canonically parameterized EF satisfying (Pc) . The following propertieshold:

Page 82: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

82 2 Parameter estimation for an i.i.d. model

(i). For any r > 0 there is a constant rr such that

IEυ∗Lr(υ, υ∗) = nrIEυ∗K

r(υ, υ∗) ≤ rr .

(ii). For every λ < 1

IEυ∗ exp{λL(υ, υ∗)

}= IEυ∗ exp

{λnK(υ, υ∗)

}≤ (1 + λ)/(1− λ).

Proof. By Theorem 2.11.4

IEυ∗Lr(υ, υ∗) = −

∫z≥0

zrdIPυ∗{L(υ, υ∗) > z

}= r

∫z≥0

zr−1IPυ∗{L(υ, υ∗) > z

}dz

≤ r

∫z≥0

2zr−1e−zdz

and the first assertion is fulfilled with rr = 2r∫z≥0

zr−1e−zdz . The assertion

(ii) is proved similarly.

Deviation bound for other parameterizations

The results for the maximum likelihood and their corollaries have been statedfor an EFc. An immediate question that arises in this respect is whether theuse of the canonical parametrization is essential. The answer is “no”: a similarresult can be stated for any EF whatever the parametrization is used. Thisfact is based on the simple observation that the maximum likelihood is thevalue of the maximum of the likelihood process; this value does not dependon the parametrization.

Lemma 2.11.6. Let (Pθ) be an EF. Then for any θ

L(θ, θ) = nK(Pθ, Pθ). (2.32)

Exercise 2.11.9. Check the result of Lemma 2.11.6.Hint: use that both sides of (2.32) depend only on measures Pθ, Pθ and noton the parametrization.

Below we write as before K(θ, θ) instead of K(Pθ, Pθ) . The property(2.32) and the exponential bound of Theorem 2.11.4 imply the bound for ageneral EF:

Theorem 2.11.7. Let (Pθ) be a univariate EF. Then for any z > 0

IPθ∗(L(θ, θ∗) > z

)= IPθ∗

(nK(θ, θ∗) > z

)≤ 2e−z.

Page 83: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

2.12 Historical remarks and further reading 83

This result allows us to build confidence sets for the parameter θ∗ andconcentration sets for the MLE θ in terms of the Kullback-Leibler divergence:

A(z, θ∗) = {θ : K(θ, θ∗) ≤ z/n},

E(z) = {θ : K(θ, θ) ≤ z/n}.

Corollary 2.11.1. Let (Pθ) be an EF. If e−zα = α/2 then

IPθ∗(θ 6∈ A(zα, θ

∗))≤ α,

and

IPθ∗(E(zα) 63 θ

)≤ α.

Moreover, for any r > 0

IEθ∗Lr(θ, θ∗) = nrIEθ∗K

r(θ, θ∗) ≤ rr .

Asymptotic against likelihood-based approach

The asymptotic approach recommends to apply symmetric confidence andconcentration sets with width of order [nF(θ∗)]−1/2 :

An(z, θ∗) = {θ : F(θ∗) (θ − θ∗)2 ≤ 2z/n},

En(z) = {θ : F(θ∗) (θ − θ)2 ≤ 2z/n},

E′n(z) = {θ : I(θ) (θ − θ)2 ≤ 2z/n}.

Then asymptotically, i.e. for large n , these sets do approximately the samejob as the non-asymptotic sets A(z, θ∗) and E(z) . However, the differencefor finite samples can be quite significant. In particular, for some cases, e.g.the Bernoulli of Poisson families, the sets An(z, θ∗) and E′n(z) may extendbeyond the parameter set Θ .

2.12 Historical remarks and further reading

The main part of the chapter is inspired by the nice textbook Borokov (1999).The concept of exponential families is credited to Edwin Pitman, Georges Dar-mois, and Bernard Koopman in 1935–1936.

The notion of Kullback–Leibler divergence was originally introduced bySolomon Kullback and Richard Leibler in 1951 as the directed divergence be-tween two distributions. Many its useful properties are studied in monographKullback (1997).

Page 84: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

84 2 Parameter estimation for an i.i.d. model

The Fisher information was discussed by several early statisticians, no-tably Francis Edgeworth. Maximum-likelihood estimation was recommended,analyzed, and vastly popularized by Robert Fisher between 1912 and 1922,although it had been used earlier by Carl Gauss, Pierre-Simon Laplace, Thor-vald Thiele, and Francis Edgeworth.

The Cramer–Rao inequality was independently obtained by Maurice Frechet,Calyampudi Rao, and Harald Cramer around 1943–1945.

For further reading we recommend textbooks by Lehmann and Casella(1998), Borokov (1999), and Strasser (1985). The deviation bound of Theo-rem 2.11.4 follows Polzehl and Spokoiny (2006).

Page 85: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

3

Regression Estimation

This chapter discusses the estimation problem for the regression model. Firsta linear regression model is considered, then a generalized linear modeling isdiscussed. We also mention median and quantile regression.

3.1 Regression model

The (mean) regression model can be written in the form IE(Y |X) = f(X) ,or equivalently,

Y = f(X) + ε, (3.1)

where Y is the dependent (explained) variable and X is the explanatoryvariable (regressor) which can be multidimensional. The target of analysis isthe systematic dependence of the explained variable Y from the explanatoryvariable X . The regression function f describes the dependence of the meanof Y as a function of X . The value ε can be treated as an individual deviation(error). It is usually assumed to be random with zero mean. Below we discussin more detail the components of the regression model (3.1).

3.1.1 Observations

In almost all practical situations, regression analysis is performed on the basisof available data (observations) given in the form of a sample of pairs (Xi, Yi)for i = 1, . . . , n , where n is the sample size. Here Y1, . . . , Yn are observedvalues of the regression variable Y and X1, . . . , Xn are the correspondingvalues of the explanatory variable X . For each observation Yi , the regressionmodel reads as:

Yi = f(Xi) + εi

where εi is the individual i th error.

Page 86: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

86 3 Regression Estimation

3.1.2 Design

The set X1, . . . , Xn of the regressor’s values is called a design. The set X ofall possible values of the regressor X is called the design space. If this set X

is compact, then one speaks of a compactly supported design.The nature of the design can be different for different statistical models.

However, it is important to mention that the design is always observable.Two kinds of design assumptions are usually used in statistical modeling. Adeterministic design assumes that the points X1, . . . , Xn are nonrandom andgiven in advance. Here are typical examples:

Example 3.1.1 (Time series). Let Yt0 , Yt0+1, . . . , YT be a time series. The timepoints t0, t0 + 1, . . . , T build a regular deterministic design. The regressionfunction f explains the trend of the time series Yt as a function of time.

Example 3.1.2 (Imaging). Let Yij be the observed grey value at the pixel(i, j) of an image. The coordinate Xij of this pixel is the correspondingdesign value. The regression function f(Xij) gives the true image value atXij which is to be recovered from the noisy observations Yij .

If the design is supported on a cube in IRd and the design points Xi forma grid in this cube, then the design is called equidistant. An important featureof such a design is that the number NA of design points in any “massive”subset A of the unit cube is nearly the volume of this subset VA multipliedby the sample size n : NA ≈ nVA . Design regularity means that the valueNA is nearly proportional to nVA , that is, NA ≈ cnVA for some positiveconstant c which may depend on the set A .

In some applications, it is natural to assume that the design values Xi arerandomly drawn from some design distribution. Typical examples are givenby sociological studies. In this case one speaks of a random design. The designvalues X1, . . . , Xn are assumed to be independent and identically distributedfrom a law PX on the design space X which is a subset of the Euclideanspace IRd . The design variables X are also assumed to be independent ofthe observations Y .

One special case of random design is the uniform design when the designdistribution is uniform on the unit cube in IRd . The uniform design possessesa similar, important property to an equidistant design: the number of designpoints in a “massive” subset of the unit cube is on average close to the volumeof this set multiplied by n . The random design is called regular on X if thedesign distribution is absolutely continuous with respect to the Lebesgue mea-sure and the design density ρ(x) = dPX(x)/dλ is positive and continuous onX . This again ensures with a probability close to one the regularity propertyNA ≈ cnVA with c = ρ(x) for some x ∈ A .

It is worth mentioning that the case of a random design can be reduced tothe case of a deterministic design by considering the conditional distributionof the data given the design variables X1, . . . , Xn .

Page 87: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

3.1 Regression model 87

3.1.3 Errors

The decomposition of the observed response variable Y into the systematiccomponent f(x) and the error ε in the model equation (3.1) is not formallydefined and cannot be done without some assumptions on the errors εi . Thestandard approach is to assume that the mean value of every εi is zero.Equivalently this means that the expected value of the observation Yi is justthe regression function f(Xi) . This case is called mean regression or simplyregression. It is usually assumed that the errors εi have finite second moments.Homogeneous errors case means that all the errors εi have the same varianceσ2 = Var ε2

i . The variance of heterogeneous errors εi may vary with i . Inmany applications not only the systematic component f(Xi) = IEYi but alsothe error variance VarYi = Var εi depend on the regressor (location) Xi .Such models are often written in the form

Yi = f(Xi) + σ(Xi)εi .

The observation (noise) variance σ2(x) can be the target of analysis similarlyto the mean regression function.

The assumption of zero mean noise, IEεi = 0 , is very natural and hasa clear interpretation. However, in some applications, it can cause trouble,especially if data are contaminated by outliers. In this case, the assumption ofa zero mean can be replaced by a more robust assumption of a zero median.This leads to the median regression model which assumes IP (εi ≤ 0) = 1/2 ,or, equivalently

IP(Yi − f(Xi) ≤ 0

)= 1/2.

A further important assumption concerns the joint distribution of the errorsεi . In the majority of applications the errors are assumed to be independent.However, in some situations, the dependence of the errors is quite natural.One example can be given by time series analysis. The errors εi are definedas the difference between the observed values Yi and the trend function fi atthe i th time moment. These errors are often serially correlated and indicateshort or long range dependence. Another example comes from imaging. Theneighbor observations in an image are often correlated due to the imagingtechnique used for recoding the images. The correlation particularly resultsfrom the automatic movement correction.

For theoretical study one often assumes that the errors εi are not onlyindependent but also identically distributed. This, of course, yields a homoge-neous noise. The theoretical study can be simplified even further if the errordistribution is normal. This case is called Gaussian regression and is denotedas εi ∼ N(0, σ2) . This assumption is very useful and greatly simplifies thetheoretical study. The main advantage of Gaussian noise is that the observa-tions and their linear combinations are also normally distributed. This is an

Page 88: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

88 3 Regression Estimation

exclusive property of the normal law which helps to simplify the expositionand avoid technicalities.

Under the given distribution of the errors, the joint distribution of theobservations Yi is determined by the regression function f(·) .

3.1.4 Regression function

By the equation (3.1), the regression variable Y can be decomposed into asystematic component and a (random) error ε . The systematic component isa deterministic function f of the explanatory variable X called the regressionfunction. Classical regression theory considers the case of linear dependence,that is, one fits a linear relation between Y and X :

f(x) = a+ bx

leading to the model equation

Yi = θ1 + θ2Xi + εi .

Here θ1 and θ2 are the parameters of the linear model. If the regressor x ismultidimensional, then θ2 is a vector from IRd and θ2x becomes the scalarproduct of two vectors. In many practical examples the assumption of lineardependence is too restrictive. It can be extended by several ways. One cantry a more sophisticated functional dependence of Y on X , for instancepolynomial. More generally, one can assume that the regression function f isknown up to the finite-dimensional parameter θ = (θ1, . . . , θp)

> ∈ IRp . Thissituation is called parametric regression and denoted by f(·) = f(·,θ) . If thefunction f(·,θ) depends on θ linearly, that is, f(x,θ) = θ1ψ1(x) + . . . +θpψp(x) for some given functions ψ1, . . . , ψp , then the model is called linearregression. An important special case is given by polynomial regression whenf(x) is a polynomial function of degree p−1 : f(x) = θ1 +θ2x+ . . .+θpx

p−1 .In many applications a parametric form of the regression function cannot

be justified. Then one speaks of nonparametric regression.

3.2 Method of substitution and M-estimation

Observe that the parametric regression equation can be rewritten as

εi = Yi − f(Xi,θ).

If θ is an estimate of the parameter θ , then the residuals εi = Yi−f(Xi, θ)are estimates of the individual errors εi . So, the idea of the method is toselect the parameter estimate θ in a way that the empirical distribution Pnof the residuals εi mimics as well as possible certain prescribed features of

Page 89: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

3.2 Method of substitution and M-estimation 89

the error distribution. We consider one approach called minimum contrastor M-estimation. Let ψ(y) be an influence or contrast function. The maincondition on the choice of this function is that

IEψ(εi + z) ≥ IEψ(εi)

for all i = 1, . . . , n and all z . Then the true value θ∗ clearly minimizes theexpectation of the sum

∑i ψ(Yi − f(Xi,θ)

):

θ∗ = argminθ

IE∑i

ψ(Yi − f(Xi,θ)

).

This leads to the M-estimate

θ = argminθ∈Θ

∑i

ψ(Yi − f(Xi,θ)

).

This estimation method can be treated as replacing the true expectation ofthe errors by the empirical distribution of the residuals.

We specify this approach for regression estimation by the classical exam-ples of least squares, least absolute deviation and maximum likelihood esti-mation corresponding to ψ(x) = x2 , ψ(x) = |x| and ψ(x) = − log ρ(x) ,where ρ(x) is the error density. All these examples belong within frameworkof M-estimation and the quasi maximum likelihood approach.

3.2.1 Mean regression. Least squares estimate

The observations Yi are assumed to follow the model

Yi = f(Xi,θ∗) + εi , IEεi = 0 (3.2)

with an unknown target θ∗ . Suppose in addition that σ2i = IEε2

i <∞ . Thenfor every θ ∈ Θ and every i ≤ n due to (3.2)

IEθ∗{Yi − f(Xi,θ)

}2= IEθ∗

{εi + f(Xi,θ

∗)− f(Xi,θ)}2

= σ2i +

∣∣f(Xi,θ∗)− f(Xi,θ)

∣∣2.This yields for the whole sample

IEθ∗∑{

Yi − f(Xi,θ)}2

=∑{

σ2i +

∣∣f(Xi,θ∗)− f(Xi,θ)

∣∣2}.This expression is clearly minimized at θ = θ∗ . This leads to the idea ofestimating the parameter θ∗ by maximizing its empirical counterpart. Theresulting estimate is called the (ordinary) least squares estimate (LSE):

Page 90: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

90 3 Regression Estimation

θLSE = argminθ∈Θ

∑{Yi − f(Xi,θ)

}2.

This estimate is very natural and requires minimal information about theerrors εi . Namely, one only needs IEεi = 0 and IEε2

i <∞ .

3.2.2 Median regression. Least absolute deviation estimate

Consider the same regression model as in (3.2), but the errors εi are notzero-mean. Instead we assume that their median is zero:

Yi = f(Xi,θ∗) + εi , med(εi) = 0.

As previously, the target of estimation is the parameter θ∗ . Observe thatεi = Yi − f(Xi,θ

∗) and hence, the latter r.v. has median zero. We now usethe following simple fact: if med(ε) = 0 , then for any z 6= 0

IE|ε+ z| ≥ IE|ε|. (3.3)

Exercise 3.2.1. Prove (3.3).

The property (3.3) implies for every θ

IEθ∗∑∣∣Yi − f(Xi,θ)

∣∣ ≥ IEθ∗∑∣∣Yi − f(Xi,θ∗)∣∣,

that is, θ∗ minimizes over θ the expectation under the true measure of thesum

∑∣∣Yi − f(Xi,θ)∣∣ . This leads to the empirical counterpart of θ∗ given

by

θ = argminθ∈Θ

∑∣∣Yi − f(Xi,θ)∣∣.

This procedure is usually referred to as least absolute deviations regressionestimate.

3.2.3 Maximum likelihood regression estimation

Let the density function ρ(·) of the errors εi be known. The regression equa-tion (3.2) implies εi = Yi − f(Xi,θ

∗) . Therefore, every Yi has the densityρ(y − f(Xi,θ

∗)) . Independence of the Yi ’s implies the product structure ofthe density of the joint distribution:∏

ρ(yi − f(Xi,θ)),

yielding the log-likelihood

Page 91: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

3.2 Method of substitution and M-estimation 91

L(θ) =∑

`(Yi − f(Xi,θ))

with `(t) = log ρ(t) . The maximum likelihood estimate (MLE) is the point ofmaximum of L(θ) :

θ = argmaxθ∈Θ

L(θ) = argmaxθ∈Θ

∑`(Yi − f(Xi,θ)).

A closed form solution for this equation exists only in some special cases likelinear Gaussian regression. Otherwise this equation has to be solved numeri-cally.

Consider an important special case corresponding to the i.i.d. Gaussianerrors when ρ(y) is the density of the normal law with mean zero and varianceσ2 . Then

L(θ) = −n2

log(2πσ2)− 1

2σ2

∑∣∣Yi − f(Xi,θ)∣∣2.

The corresponding MLE maximizes L(θ) or, equivalently, minimizes the sum∑∣∣Yi − f(Xi,θ)∣∣2 :

θ = argmaxθ∈Θ

L(θ) = argminθ∈Θ

∑∣∣Yi − f(Xi,θ)∣∣2. (3.4)

This estimate has already been introduced as the ordinary least squares esti-mate (oLSE).

An extension of the previous example is given by inhomogeneous Gaussianregression, when the errors εi are independent Gaussian zero-mean but thevariances depend on i : IEε2

i = σ2i . Then the log-likelihood L(θ) is given by

the sum

L(θ) =∑{

−∣∣Yi − f(Xi,θ)

∣∣22σ2

i

− 1

2log(2πσ2

i )}.

Maximizing this expression w.r.t. θ is equivalent to minimizing the weighted

sum∑σ−2i

∣∣Yi − f(Xi,θ)∣∣2 :

θ = argminθ∈Θ

∑σ−2i

∣∣Yi − f(Xi,θ)∣∣2.

Such an estimate is also-called the weighted least squares (wLSE).Another example corresponds to the case when the errors εi are i.i.d.

double exponential, so that IP (±ε1 > t) = e−t/σ for some given σ > 0 . Thenρ(y) = (2σ)−1e−|y|/σ and

L(θ) = −n log(2σ)− σ−1∑∣∣Yi − f(Xi,θ)

∣∣.

Page 92: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

92 3 Regression Estimation

The MLE θ maximizes L(θ) or, equivalently, minimizes the sum∑∣∣Yi −

f(Xi,θ)∣∣ :

θ = argmaxθ∈Θ

L(θ) = argminθ∈Θ

∑∣∣Yi − f(Xi,θ)∣∣.

So the maximum likelihood regression with Laplacian errors leads back to theleast absolute deviations (LAD) estimate.

3.2.4 Quasi Maximum Likelihood approach

This section very briefly discusses an extension of the maximum likelihood ap-proach. A more detailed discussion will be given in context of linear modelingin Chapter 4. To be specific, consider a regression model

Yi = f(Xi) + εi.

The maximum likelihood approach requires to specify the two main ingredi-ents of this model: a parametric class {f(x,θ),θ ∈ Θ} of regression functionsand the distribution of the errors εi . Sometimes such information is lacking.One or even both modeling assumptions can be misspecified. In such situa-tions one speaks of a quasi maximum likelihood approach, where the estimateθ is defined via maximizing over θ the random function L(θ) even throughit is not necessarily the real log-likelihood. Some examples of this approachhave already been given.

Below we distinguish between misspecification of the first and second kind.The first kind corresponds to the parametric assumption about the regressionfunction: assumed is the equality f(Xi) = f(Xi,θ

∗) for some θ∗ ∈ Θ . Inreality one can only expect a reasonable quality of approximating f(·) byf(·,θ∗) . A typical example is given by linear (polynomial) regression. Thelinear structure of the regression function is useful and tractable but it canonly be a rough approximation of the real relation between Y and X . Thequasi maximum likelihood approach suggests to ignore this misspecificationand proceed as if the parametric assumption is fulfilled. This approach raisesa number of questions: what is the target of estimation and what is reallyestimated by such quasi ML procedure? In Chapter 4 we show in the contextof linear modeling that the target of estimation can be naturally defined asthe parameter θ∗ providing the best approximation of the true regressionfunction f(·) by its parametric counterpart f(·,θ) .

The second kind of misspecification concerns the assumption about theerrors εi . In the most of applications, the distribution of errors is unknown.Moreover, the errors can be dependent or non-identically distributed. Assump-tion of a specific i.i.d. structure leads to a model misspecification and thus, tothe quasi maximum likelihood approach. We illustrate this situation by fewexamples.

Page 93: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

3.3 Linear regression 93

Consider the regression model Yi = f(Xi,θ∗) + εi and suppose for a mo-

ment that the errors εi are i.i.d. normal. Then the principal term of the corre-sponding log-likelihood is given by the negative sum of the squared residuals:∑∣∣Yi − f(Xi,θ)

∣∣2 , and its maximization leads to the least squares method.So, one can say that the LSE method is the quasi MLE when the errors areassumed to be i.i.d. normal. That is, the LSE can be obtained as the MLEfor the imaginary Gaussian regression model when the errors εi are not nec-essarily i.i.d. Gaussian.

If the data are contaminated or the errors have heavy tails, it could beunwise to apply the LSE method. The LAD method is known to be morerobust against outliers and data contamination. At the same time, it hasalready been shown in Section 3.2.3 that the LAD estimates is the MLE whenthe errors are Laplacian (double exponential). In other words, LAD is thequasi MLE for the model with Laplacian errors.

Inference for the quasi ML approach is discussed in detail in Chapter 4 inthe context of linear modeling.

3.3 Linear regression

One standard way of modeling the regression relationship is based on a linearexpansion of the regression function. This approach is based on the assumptionthat the unknown regression function f(·) can be represented as a linearcombination of given basis functions ψ1(·), . . . , ψp(·) :

f(x) = θ1ψ1(x) + . . .+ θpψp(x).

A couple of popular examples are liste in this section. More examples aregiven below in Section 3.3.1 in context of projection estimation.

Example 3.3.1 (Multivariate linear regression). Let x = (x1, . . . , xd)> be d -

dimensional. The linear regression function f(x) can be written as

f(x) = a+ b1x1 + . . .+ bdxd.

Here we have p = d + 1 and the basis functions are ψ1(x) ≡ 1 andψm = xm−1 for m = 2, . . . , p . The coefficient a is often called the in-tercept and b1, . . . , bd are the slope coefficients. The vector of coefficientsθ = (a, b1, . . . , bd)

> uniquely describes the linear relation.

Example 3.3.2 (Polynomial regression). Let x be univariate and f(·) be apolynomial function of degree p− 1 , that is,

f(x) = θ1 + θ2x+ . . .+ θpxp−1.

Then the basic functions are ψ1(x) ≡ 1 , ψ2(x) ≡ x , ψp(x) ≡ xp−1 , whileθ = (θ1, . . . , θp)

> is the corresponding vector of coefficients.

Page 94: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

94 3 Regression Estimation

Exercise 3.3.1. Let the regressor x be d -dimensional, x = (x1, . . . , xd)> .

Describe the basis system and the corresponding vector of coefficients for thecase when f is a quadratic function of x .

Linear regression is often described using vector-matrix notation. Let Ψi

be the vector in IRp whose entries are the values ψm(Xi) of the basis func-tions at the design point Xi , m = 1, . . . , p . Then f(Xi) = Ψ>i θ

∗ , and thelinear regression model can be written as

Yi = Ψ>i θ∗ + εi , i = 1, . . . , n.

Denote by Y = (Y1, . . . , Yn)> the vector of observations (responses), andε = (ε1, . . . , εn)> the vector of errors. Let finally Ψ be the p × n matrix

with columns Ψ1, . . . ,Ψn , that is, Ψ =(ψm(Xi)

)i=1,...,n

m=1,...,p. Note that each

row of Ψ is composed by the values of the corresponding basis function ψmat the design points Xi . Now the regression equation reads as

Y = Ψ>θ∗ + ε.

The estimation problem for this linear model will be discussed in detail inChapter 4.

3.3.1 Projection estimation

Consider a (mean) regression model

Yi = f(Xi) + εi , i = 1, . . . , n. (3.5)

The target of the analysis is the unknown nonparametric function f whichhas to be recovered from the noisy data Y . This approach is usually consid-ered within the nonparametric statistical theory because it avoids fixing anyparametric specification of the model function f , and thus, of the distributionof the data Y . This section discusses how this nonparametric problem canbe put back into the parametric theory.

The standard way of estimating the regression function f is based onsome smoothness assumption about this function. It enables us to expand thegiven function w.r.t. some given functional basis and to evaluate the accuracyof approximation by finite sums. More precisely, let ψ1(x), . . . , ψm(x), . . . bea given system of functions. Specific examples are trigonometric (Fourier,cosine), orthogonal polynomial (Chebyshev, Legendre, Jacobi), and waveletsystems among many others. The completeness of this system means that agiven function f can be uniquely expanded in the form

f(x) =

∞∑m=1

θmψm(x). (3.6)

Page 95: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

3.3 Linear regression 95

A very desirable feature of the basis system is orthogonality:∫ψm(x)ψm′(x)µX(dx) = 0, m 6= m′.

Here µX can be some design measure on X or the empirical design measuren−1

∑δXi . However, the expansion (3.6) is untractable because it involves

infinitely many coefficients θm . A standard procedure is to truncate this ex-pansion after the first p terms leading to the finite approximation

f(x) ≈p∑

m=1

θmψm(x). (3.7)

Accuracy of such an approximation becomes better and better as the numberp of terms grows. A smoothness assumption helps to estimate the rate ofconvergence to zero of the remainder term f − θ1ψ1 − . . .− θpψp :∥∥f − θ1ψ1 − . . .− θpψp

∥∥ ≤ rp , (3.8)

where rp describes the accuracy of approximation of the function f by theconsidered finite sums uniformly over the class of functions with the prescribedsmoothness. The norm used in the definition (3.8) as well as the basis {ψm}depend on the particular smoothness class. Popular examples are given byHolder classes for the L∞ norm, Sobolev smoothness for L2 -norm or moregenerally Ls -norm for some s ≥ 1 .

A choice of a proper truncation value p is one of the central problems innonparametric function estimation. With p growing, the quality of approx-imation (3.7) improves in the sense that rp → 0 as p → ∞ . However, thegrowth of the parameter dimension yields the growth of model complexity,one has to estimate more and more coefficients. Section 4.7 below briefly dis-cusses how the problem can be formalized and how one can define the optimalchoice. However, a rigorous solution is postponed until the next volume. Herewe suppose that the value p is fixed by some reasons and apply the quasimaximum likelihood parametric approach. Namely, the approximation (3.7)

is assumed to be the exact equality: f(x) ≡ f(x,θ∗)def= θ∗1ψ1(x) + . . .+ θ∗pψp .

Model misspecification f(·) 6≡ f(·,θ)def= θ1ψ1(x) + . . . + θpψp for any vector

θ ∈ Θ means the modeling error, or, the modeling bias. The parametric ap-proach ignores this modeling error and focuses on the error within the modelwhich describes the accuracy of the qMLE θ .

The qMLE procedure requires to specify the error distribution which ap-pears in the log-likelihood. In the most general form, let IP0 be the jointdistribution of the error vector ε , and let ρ(n)(ε) be its density function onIRn . The identities εi = Yi − f(Xi,θ) yield the log-likelihood

L(θ) = log ρ(n)(Y − f(X,θ)

). (3.9)

Page 96: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

96 3 Regression Estimation

If the errors εi are i.i.d. with the density ρ(y) , then

L(θ) =∑

log ρ(Yi − f(Xi,θ)). (3.10)

The most popular least squares method (3.4) implicitly assumes Gaussian ho-mogeneous noise: εi are i.i.d. N(0, σ2) . The LAD approach is based on theassumption of Laplace error distribution. Categorical data are modeled by aproper exponential family distribution; see Section 3.5. Below we assume thatthe one or another assumption about errors is fixed and the log-likelihoodis described by (3.9) or (3.10). This assumption can be misspecified and theqMLE analysis has to be done under the true error distribution. Some exam-ples of this sort for linear models are given in Section 4.6.

In the rest of this section we only discuss how the regression function fin (3.5) can be approximated by different series expansions. With the selectedexpansion and the assumption on the errors, the approximating parametricmodel is fixed due to (3.9). In the most of examples we only consider a uni-variate design with d = 1 .

3.3.2 Polynomial approximation

It is well known that any smooth function f(·) can be approximated by apolynomial. Moreover, the larger smoothness of f(·) is the better the accuracyof approximation. The Taylor expansion yields an approximation in the form

f(x) ≈ θ0 + θ1x+ θ2x2 + . . .+ θmx

m. (3.11)

Such an approximation is very natural, however, it is rarely used in sta-tistical applications. The main reason is that the different power functionsψm(x) = xm are highly correlated between each other. This makes difficultto identify the corresponding coefficients. Instead one can use different poly-nomial systems which fulfill certain orthogonality conditions.

We say that f(x) is a polynomial of degree m , if it can be representedin the form (3.11) with θm 6= 0 . Any sequence 1, ψ1(x), . . . , ψm(x) of suchpolynomials yields a basis in the vector space of polynomials of degree m .

Exercise 3.3.2. Let for each j ≤ m a polynomial of degree j be fixed. Thenany polynomial Pm(x) of degree m can be represented in a unique way inthe form

Pm(x) = c0 + c1ψ1(x) + . . .+ cmψm(x)

Hint: define cm = P(m)m /ψ

(m)m and apply induction to Pm(x)− cmψm(x) .

Page 97: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

3.3 Linear regression 97

3.3.3 Orthogonal polynomials

Let µ be any measure on the real line satisfying the condition∫xmµ(dx) <∞, (3.12)

for any integer m . This enables us to define the scalar product for two poly-nomial functions f, g by

⟨f, g⟩ def

=

∫f(x)g(x)µ(dx).

With such a Hilbert structure we aim to define an orthonormal polynomialsystem of polynomials ψm of degree m for m = 0, 1, 2, . . . such that⟨

ψj , ψm⟩

= δj,m = 1I(j = m), j,m = 0, 1, 2, . . . .

Theorem 3.3.1. Given a measure µ satisfying the condition (3.12) there ex-ists unique orthonormal polynomial system ψ1, ψ2, . . . . Any polynomial Pmof degree m can be represented as

Pm(x) = a0 + a1ψ1(x) + . . .+ amψm(x)

with

aj =⟨Pm, ψj

⟩. (3.13)

Proof. We construct the function ψm successively. The function ψ0 is a con-stant defined by

ψ20

∫µ(dx) = 1.

Suppose now that the orthonormal polynomials ψ1, . . . , ψm−1 have been al-ready constructed. Define the coefficients

ajdef=

∫xmψj(x)µ(dx), j = 0, 1, . . . ,m− 1,

and consider the function

gm(x)def= xm − a0ψ0 − a1ψ1(x)− . . .− am−1ψm−1(x).

This is obviously a polynomial of degree m . Moreover, by orthonormality ofthe ψj ’s for j < m

Page 98: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

98 3 Regression Estimation∫gm(x)ψj(x)µ(dx) =

∫xmψj(x)µ(dx)− aj

∫ψ2j (x)µ(dx) = 0.

So, one can define ψm by normalization of gm :

ψm(x)def=⟨gm, gm

⟩−1/2gm(x).

One can also easily see that such defined ψm is only polynomial of degreem which is orthogonal to ψj for j < m and fulfills

⟨ψm, ψm

⟩= 1 , because

the number of constraints is equal to the number of coefficients θ0, . . . , θm ofψm(x) .

Let now Pm be a polynomial of degree m . Define the coefficient am by(3.13). Similarly to above one can show that

Pm(x)−{a0 + a1ψ1(x) + . . .+ amψm(x)

}≡ 0

which implies the second claim.

Exercise 3.3.3. Let{ψm}

be an orthonormal polynomial system. Show thatfor any polynomial Pj(x) of degree j < m , it holds⟨

Pj , ψm⟩

= 0.

Finite approximation and the associated kernel

Let f be a function satisfying∫f2(x)µ(dx) <∞. (3.14)

Then the scalar product aj =⟨f, ψj

⟩is well defined for all j ≥ 0 leading for

each m ≥ 1 to the following approximation:

fm(x)def=

m∑j=0

ajψj(x) =

m∑j=0

∫f(u)ψj(u)µ(du)ψj(x)

=

∫f(u)Φm(x, u)µ(du) (3.15)

with

Φm(x, u) =

m∑j=0

ψj(x)ψj(u).

Page 99: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

3.3 Linear regression 99

Completeness

The accuracy of approximation of f by fm with m growing is one of thecentral questions in the approximation theory. The answer depends on theregularity of the function f and on choice of the system

{ψm}

. Let F be alinear space of functions f on the real line satisfying (3.14). We say that thebasis system {ψm(x)} is complete in F if the identities

⟨f, ψm

⟩= 0 for all

m ≥ 0 imply f ≡ 0 . As ψm(x) is a polynomial of degree m , this definitionis equivalent to the condition⟨

f, xm⟩

= 0, m = 0, 1, 2, . . .⇐⇒ f ≡ 0.

Squared bias and accuracy of approximation

Let f ∈ F be a function in L2 satisfying (3.14), and let {ψm} be a completebasis. Consider the error f(x) − fm(x) of the finite approximation fm(x)from (3.15). The Parseval identity yields∫

f2(x)µ(dx) =

∞∑m=0

a2m.

This yields that the finite sums of∑mj=0 a

2j converge to the infinite sum∑∞

m=0 a2m and the remainder bm =

∑∞j=m+1 a

2j tends to zero with m :

bmdef=⟨f − fm

⟩=

∫ ∣∣f(x)− fm(x)∣∣2µ(dx) =

∞∑j=m+1

a2j → 0

as m → ∞ . The value bm is often called the squared bias. Below in thissection we briefly overview some popular polynomial systems used in theapproximation theory.

3.3.4 Chebyshev polynomials

Chebyshev polynomials are frequently used in the approximation theory be-cause of their very useful features. These polynomials can be defined by manyways: explicit formulas, recurrent relations, differential equations, among oth-ers.

A trigonometric definition

Chebyshev polynomials is usually defined in the trigonometric form:

Tm(x) = cos(m arccos(x)

). (3.16)

Exercise 3.3.4. Check that Tm(x) from (3.16) is a polynomial of degree m .

Hint: use the formula cos((m+ 1)u

)= 2 cos(u) cos(mu)− cos

((m− 1)u

)and

induction arguments.

Page 100: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

100 3 Regression Estimation

Recurrent formula

The trigonometric identity cos((m+1)u

)= 2 cos(u) cos(mu)−cos

((m−1)u

)yields the recurrent relation between Chebyshev polynomials:

Tm+1(x) = 2xTm(x)− Tm−1(x), m ≥ 1. (3.17)

Exercise 3.3.5. Describe the first 5 polynomials Tm .

Hint: use that T0(x) ≡ 1 and T1(x) ≡ x and use the recurrent formula (3.17).

The leading coefficient

The recurrent relation (3.17) and the formulas T0(x) ≡ 1 and T1(x) ≡ ximply that the leading coefficient of Tm(x) is equal to 2m−1 by inductionarguments. Equivalently

T (m)m (x) ≡ 2m−1m!

Orthogonality and normalization

Consider the measure µ(dx) on the open interval (−1, 1) with the density(1−x2)−1/2 with respect to the Lebesgue measure. By the change of variablesx = cos(u) we obtain for all j 6= m∫ 1

−1

Tm(x)Tj(x)dx√

1− x2=

∫ π

0

cos(mu) cos(ju)du = 0.

Moreover, for m ≥ 1∫ 1

−1

T 2m(x)

dx√1− x2

=

∫ π

0

cos2(mu)du =1

2

∫ π

0

{1 + cos(2mu)

}du =

π

2.

Finally, ∫ 1

−1

dx√1− x2

=

∫ π

0

du = π.

So, the orthonormal system can be defined by normalizing the Chebyshevpolynomials Tm(x) :

ψ0(x) ≡ π−1/2, ψm(x) =√

2/π Tm(x), m ≥ 1.

Page 101: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

3.3 Linear regression 101

The moment generating function

The bivariate function

f(t, x) =

∞∑m=0

Tm(x)tm (3.18)

is called the moment generating function. It holds for the Chebyshev polyno-mials

f(t, x) =1− tx

1− 2tx+ t2.

This fact can be proven by using the recurrent formula (3.17) and the followingrelation:

f(x, t) = 1 + tx+ t

∞∑m=1

Tm+1(x)tm

= 1 + tx+ t

∞∑m=1

{2xTm(x)− Tm−1(x)

}tm

= 1 + tx+ 2tx{f(x, t)− 1

}− t2f(x, t). (3.19)

Exercise 3.3.6. Check (3.19) and (3.18).

Roots of Tm

The identity cos(π(k − 1/2)

)≡ 0 for all integer k yields the roots of the

polynomial Tm(x) :

xk,m = cos(π(k − 1/2)

m

), k = 1, . . . ,m. (3.20)

This means that Tm(xk,m) = 0 for k = 1, . . . ,m and hence, Tm(x) hasexactly m roots on the interval [−1, 1] .

Discrete orthogonality

Let x1,N , . . . , xN,N be the roots of TN due to (3.20): xk,N = cos(π(2k−1)

2N

).

Define the discrete inner product

⟨Tm, Tj

⟩N

=

N∑k=1

Tm(xk,N )Tj(xk,N ).

Then it holds similarly to the continuous case

Page 102: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

102 3 Regression Estimation

⟨Tm, Tj

⟩N

=

0 m 6= j,

N/2 m = j 6= 0

N m = j = 0.

(3.21)

Exercise 3.3.7. Prove (3.21).Hint: use that for all m > 0

N∑k=1

cos(πm(k − 1/2)

N

)= 0

yielding for all m′ 6= m

N∑k=1

cos(πm(k − 1/2)

N

)cos(πm′(k − 1/2)

N

)= 0

Extremes of Tm(x)

Obviously∣∣Tm(x)

∣∣ ≤ 1 because the cos-function is bounded by one in ab-solute value. Moreover, cos(kπ) = (−1)k yields the extreme points ek withTm(ek) = (−1)k for

ek = cos

(πk

m

), k = 0, 1, . . . ,m. (3.22)

In particular, the edge points x = 1 and x = −1 are extremes of Tm(x) .

Exercise 3.3.8. Check that Tm(ek) = (−1)k for ek from (3.22). Show thatTm(1) = 1 and Tm(−1) = (−1)m . Show that |Tm(x)| < 1 for x 6= ek on[−1, 1] .

Hint: Tm is a polynomial of degree m , hence, it can have at most m − 1extreme points inside the interval (−1, 1) , which are e1, . . . , em−1 .

Sup-norm

The important feature of the Chebyshev polynomials which makes them veryuseful for the approximation theory us that each of them minimizes the sup-norm over all polynomial of the certain degree with the fixed leading coeffi-cient.

Theorem 3.3.2. The scaled Chebyshev polynomial fm(x) = 21−mTm mini-

mizes the sup-norm ‖f‖∞def= supx∈[−1,1] |f(x)| over the class of all polyno-

mials of degree m with the leading coefficient 1.

Page 103: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

3.3 Linear regression 103

Proof. As |Tm(x)| ≤ 1 , the sup-norm of fm fulfills ‖fm‖∞ = 21−m . Letw(x) be any other polynomial with the leading coefficient one and |w(x)| <21−m . Consider the difference fm(x)− w(x) at the extreme points ek from(3.22). Then fm(ek) − w(ek) > 0 for all even k = 0, 2, 4, . . . and fm(ek) −w(ek) < 0 for all odd k = 1, 3, 5, . . . . This means that this difference has atleast m roots on [−1, 1] which is impossible because it is a polynomial ofdegree m− 1 .

Expansion by Chebyshev polynomials and discrete cosine transform

Let f(x) be a measurable function satisfying∫ 1

−1

f2(x)dx√

1− x2<∞.

Then this function can be uniquely expanded by Chebyshev polynomials:

f(x) =

∞∑m=0

amTm(x).

The coefficients am in this expansion can be obtained by projection

am =⟨f, Tm

⟩=

∫ 1

−1

f(x)Tm(x)dx√

1− x2<∞.

However, this method is numerically intensive. Instead, one can use the dis-

crete orthogonality (3.21). Let some N be fixed and xk,N = cos(π(k−1/2)

N

).

Then for m ≥ 1

am =1

N

N∑k=1

f(xk,N ) cos(π(k − 1/2)

N

).

This sum can be computed very efficiently via the discrete cosine transform.

3.3.5 Legendre polynomials

The Legendre polynomials Pm(x) are often used in physics and in harmonicanalysis. It is an orthogonal polynomial system on the interval [−1, 1] w.r.t.the Lebesgue measure, that is,∫ 1

−1

Pm(x)Pm′(x)dx = 0, m 6= m′. (3.23)

They also can be defined as solutions of the Legendre differential equation

Page 104: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

104 3 Regression Estimation

d

dx

[(1− x2)

d

dxPm(x)

]+m(m+ 1)Pm(x) = 0. (3.24)

An explicit representation is given by the Rodrigues’ formula

Pm(x) =1

2mm!

dm

dxm[(1− x2)m

]. (3.25)

Exercise 3.3.9. Check that Pm(x) from (3.25) fulfills (3.24).Hint: differentiate m+ 1 times the identity

(x2 − 1)d

dx(x2 − 1)m = 2mx(x2 − 1)m

yielding

2Pm(x) + 2xd

dxPm(x) + (x2 − 1)

d2

dx2Pm(x)

= 2mPm(x) + 2mxd

dxPm(x).

Orthogonality

The orthogonality property (3.23) can be checked by using the Rodrigues’formula.

Exercise 3.3.10. Check that for m < m′∫ 1

−1

Pm(x)Pm′(x)dx = 0. (3.26)

Hint: integrate (3.26) by part m+ 1 times with Pm from (3.25) and use thatthe m+ 1 th derivative of Pm vanishes.

Recursive definition

It is easy to check that P0(x) ≡ 1 and P1(x) ≡ x . Bonnets recursion formularelates 3 subsequent Legendre polynomials: for m ≥ 1

(m+ 1)Pm+1(x) = (2m+ 1)xPm(x)−mPm−1(x). (3.27)

From Bonnet’s recursion formula one obtains by induction the explicit repre-sentation

Pm(x) =

m∑k=0

(−1)k(m

k

)2(1 + x

2

)n−k(1− x2

)k.

Page 105: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

3.3 Linear regression 105

More recursions

Further, the definition (3.25) yields the following 3-term recursion:

x2 − 1

m

d

dxPm(x) = xPm(x)− Pm−1(x) (3.28)

Useful for the integration of Legendre polynomials is another recursion

(2m+ 1)Pm(x) =d

dx

[Pm+1(x)− Pm−1(x)

].

Exercise 3.3.11. Check (3.28) by using the definition (3.25).Hint: use that

xdm

dxm

[ (1− x2)m

m

]= 2x

dm−1

dxm−1

[x(1− x2)m−1

]= 2x2 d

m−1

dxm−1(1− x2)m−1 − 2x

dm−2

dxm−2(1− x2)m−1

Exercise 3.3.12. Check (3.27).Hint: use that

dm

dxmd

dx

[ (1− x2)m+1

m+ 1

]= −2

dm

dxm[x(1− x2)m

]= −2x

dm

dxm[(1− x2)m

]− 2

dm−1

dxm−1

[(1− x2)m

]Generating function

The Legendre generating function is defined by

f(t, x)def=

∞∑m=0

Pm(x)tm.

It holds

f(t, x) = (1− 2tx+ x2)−1/2. (3.29)

Exercise 3.3.13. Check (3.29).

Page 106: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

106 3 Regression Estimation

3.3.6 Lagrange polynomials

In numerical analysis, Lagrange polynomials are used for polynomial interpo-lation. The Lagrange polynomial are widely applied in cryptography, such asin Shamir’s Secret Sharing scheme.

Given a set of p+1 data points (X0, Y0) , . . . , (Xp, Yp) , where no two Xj

are the same, the interpolation polynomial in the Lagrange form is a linearcombination

L(x) =

p∑m=0

`m(x)Ym (3.30)

of Lagrange basis polynomials

`m(x)def=

∏j=0,...,p,j 6=m

x−Xj

Xm −Xj

=x−X0

Xm −X0. . .

x−Xm−1

Xm −Xm−1

x−Xm+1

Xm −Xm+1. . .

x−Xp

Xm −Xp.

This definition yields that `m(Xm) = 1 and `m(Xj) = 0 for j 6= m . Hence,P (Xm) = Ym for the polynomial Lm(x) from (3.30). One can easily see thatL(x) is the only polynomial of degree p that fulfills P (Xm) = Ym .

The main disadvantage of the Lagrange forms is that any change of thedesign X1, . . . , Xn require to change each basis function `m(x) . Anotherproblem is that the Lagrange basis polynomials `m(x) are not necessarilyorthogonal. This explains why these polynomials are rarely used in statisticalapplications.

Barycentric interpolation

Introduce a polynomial `(x) of degree p+ 1 by

`(x) = (x−X0) . . . (x−Xp).

Then the Lagrange basis polynomials can be rewritten as

`m(x) = `(x)wm

x−Xm

with the barycentric weights wm defined by

w−1m =

∏j=0,...,p,j 6=m

(Xm −Xj)

Page 107: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

3.3 Linear regression 107

which is commonly referred to as the first form of the barycentric interpola-tion formula. The advantage of this representation is that the interpolationpolynomial may now be evaluated as

L(x) = `(x)

p∑m=0

wmx−Xm

Ym

which, if the weights wm have been pre-computed, requires only O(p) oper-ations (evaluating `(x) and the weights wm/(x−Xm) ) as opposed to O(p2)for evaluating the Lagrange basis polynomials `m(x) individually.

The barycentric interpolation formula can also easily be updated to incor-porate a new node Xp+1 by dividing each of the wm by (Xm −Xp+1) andconstructing the new wp+1 as above.

We can further simplify the first form by first considering the barycentricinterpolation of the constant function g(x) ≡ 1 :

g(x) = `(x)

p∑m=0

wmx−Xm

.

Dividing L(x) by g(x) does not modify the interpolation, yet yields

L(x) =

∑pm=0 wmYm/(x−Xm)∑pm=0 wm/(x−Xm)

which is referred to as the second form or true form of the barycentric inter-polation formula. This second form has the advantage that `(x) need not beevaluated for each evaluation of L(x) .

3.3.7 Hermite polynomials

The Hermite polynomials build an orthogonal system on the whole real line.The explicit representation is given by

Hm(x)def= (−1)mex

2 dm

dxme−x

2

.

Sometimes one uses a “probabilistic” definition

Hm(x)def= (−1)mex

2/2 dm

dxme−x

2/2.

Exercise 3.3.14. Show that each Hm(x) and Hm(x) is a polynomial ofdegree m .

Hint: Use induction arguments to show that dm

dxm e−x2

can be represented in

the form Pm(x)e−x2

with a polynomial Pm(x) of degree m .

Page 108: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

108 3 Regression Estimation

Exercise 3.3.15. Check that the leading coefficient of Hm(x) is equal to 2m

while the leading coefficient of Hm(x) is equal to one.

Orthogonality

The Hermite polynomial are orthogonal on the whole real line with the weightfunction w(x) = e−x

2

: for j 6= m∫ ∞−∞

Hm(x)Hj(x)w(x)dx = 0. (3.31)

Note first that each Hm(x) is a polynomial so the scalar product (3.31) iswell defined. Suppose that m > j . It is obvious that it suffices to check that∫ ∞

−∞Hm(x)xjw(x)dx = 0, j < m. (3.32)

Define fm(x)def= dm

dxm e−x2

. Obviously f ′m−1(x) ≡ fm(x) . Integration by partyields for any j ≥ 1∫ ∞

−∞Hm(x)xjw(x)dx = (−1)m

∫ ∞−∞

xjfm(x)dx

= (−1)m∫ ∞−∞

xjf ′m−1(x)dx

= (−1)m−1j

∫ ∞−∞

xj−1fm−1(x)dx .

By the same arguments, for m ≥ 1∫ ∞−∞

fm(x)dx =

∫ ∞−∞

f ′m−1(x)dx = f ′m−1(∞)− f ′m−1(−∞) = 0. (3.33)

This implies (3.32) and hence the orthogonality property (3.31).Now we compute the scalar product of Hm(x) . Formula (3.33) with j = m

implies ∫ ∞−∞

Hm(x)xmw(x)dx = m!

∫ ∞−∞

e−x2

dx =√πm!.

As the leading coefficient of Hm(x) is equal to 2m , this implies∫ ∞−∞

H2m(x)w(x)dx = 2m

∫ ∞−∞

Hm(x)xmw(x)dx =√π 2mm!.

Exercise 3.3.16. Prove the orthogonality of the probabilistic Hermite poly-nomials Hm(x) . Compute their norm.

Page 109: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

3.3 Linear regression 109

Recurrent formula

The definition yields

H0(x) ≡ 1, H1(x) = 2x.

Further, for m ≥ 1 , we use the formula

d

dxHm(x) = (−1)m

d

dx

{ex

2 dm

dxme−x

2

}= 2xHm(x)−Hm+1(x) (3.34)

yielding the recurrent relation

Hm+1(x) = 2xHm(x)−H ′m(x).

Moreover, integration by part, the formula (3.34), and the orthogonality prop-erty yield for j < m− 1∫ ∞

−∞H ′m(x)Hj(x)w(x)dx = −

∫ ∞−∞

Hm(x)H ′j(x)w(x)dx = 0

This means that H ′m(x) is a polynomial of degree m−1 and it is orthogonalto all Hj(x) for j < m− 1 . Thus, H ′m(x) coincides with Hm−1(x) up to amultiplicative factor. The leading coefficient of H ′m(x) is equal to m2m whilethe leading coefficient of Hm−1(x) is equal to 2m−1 yielding

H ′m(x) = 2mHm−1(x).

This results in another recurrent equation

Hm+1(x) = 2xHm(x)− 2mHm−1(x).

Exercise 3.3.17. Derive the recurrent formulas for the probabilistic Hermitepolynomials Hm(x) .

Generating function

The exponential generating function f(x, t) for the Hermite polynomials isdefined as

f(x, t)def=

∞∑m=0

Hm(x)tm

m!. (3.35)

It holds

f(x, t) = exp{

2xt− t2}.

Page 110: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

110 3 Regression Estimation

It can be proved by checking the formula

∂m

∂tmf(x, t) = Hm(x− t)f(x, t) (3.36)

Exercise 3.3.18. Check the formula (3.36) and derive (3.35).

Completeness

This property means that the system of the normalized Hermite polynomialsbuilds an orthonormal basis in L2 Hilbert space of functions f(x) on the realline satisfying ∫ ∞

−∞f2(x)w(x)dx <∞.

3.3.8 Trigonometric series expansion

The trigonometric functions are frequently used in the approximation theory,in particular due to their relation to the spectral theory. One usually applieseither the Fourier basis, or the cosine basis.

The Fourier basis is composed by the constant function F0 ≡ 1 andthe functions F2m−1(x) = sin(2mπx) and F2m(x) = cos(2mπx) for m =1, 2, . . . . These functions are considered on the interval [0, 1] and are all pe-riodic: f(0) = f(1) . Therefore, it can be only used for approximation ofperiodic functions.

The cosine basis is composed by the functions S0 ≡ 1 , and Sm(x) =cos(mπx) for m ≥ 1 . These functions are periodic for even m and antiperi-odic for odd m , this allows to approximate functions which are not necessarilyperiodic.

Orthogonality

Trigonometric identities imply orthogonality

⟨Fm, Fj

⟩=

∫ 1

0

Fm(x)Fj(x)dx = 0, j 6= m. (3.37)

Also ∫ 1

0

F 2m(x)dx = 1/2 (3.38)

Exercise 3.3.19. Check (3.37) and (3.38).

Page 111: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

3.4 Piecewise methods and splines 111

Exercise 3.3.20. Check that∫ 1

0

Sj(x)Sm(x)dx =1

21I(j = m).

Many nice features of the Chebyshev polynomials can be translated tothe cosine basis by a simple change of variable: with u = cos(πx) , it holdsSm(x) = Tm(u) . So, any expansion of the function f(u) by the Chebyshevpolynomials yields an expansion of f

(cos(πx)

)by the cosine system.

3.4 Piecewise methods and splines

This section discusses piecewise polynomial methods of approximation of theunivariate regression functions.

3.4.1 Piecewise constant estimation

Any continuous function can be locally approximated by a constant. Thisnaturally leads to the basis consisting of piecewise constant functions. LetA1, . . . , AK be a non-overlapping partition of the design space X :

X =⋃

k=1,...,K

Ak , Ak ∩Ak′ = ∅, k 6= k′. (3.39)

We approximate the function f by a finite sum

f(x) ≈ f(x,θ) =

K∑k=1

θk 1I(x ∈ Ak). (3.40)

Here θ = (θ1, . . . , θp)> with p = K . A nice feature of this approximation

is that the basis indicator functions ψ1, . . . , ψK are orthogonal because theyhave non-overlapping supports. For the case of independent errors, this makesthe computation of the qMLE θ very simple. In fact, every coefficient θk canbe estimated independently of the others. Indeed, the general formula (3.10)yields

θ = argmaxθ

L(θ) = argmaxθ

n∑i=1

`(Yi − f(Xi,θ))

= argmaxθ=(θk)

M∑k=1

∑Xi∈Ak

`(Yi − θk). (3.41)

Page 112: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

112 3 Regression Estimation

Exercise 3.4.1. Show that θk can be obtained by the constant approxima-tion of the data Yi for Xi ∈ Ak :

θk = argmaxθk

∑Xi∈Ak

`(Yi − θk), k = 1, . . . ,K. (3.42)

A similar formula can be obtained for the target θ∗ = (θ∗k) = argmaxθ IEL(θ) :

θ∗k = argmaxθk

∑Xi∈Ak

IE`(Yi − θk), m = 1, . . . ,K.

The estimator θ can be computed explicitly in some special cases. In par-ticular, if ρ corresponds a density of a normal distribution, then the resultingestimator θk is nothing but the mean of observations Yi over the piece Ak .For the Laplacian errors, the solution is the median of the observations overAk . First we consider the case of Gaussian likelihood.

Theorem 3.4.1. Let `(y) = −y2/(2σ2)+R be a log-density of a normal law.Then for every k = 1, . . . ,K

θk =1

Nk

∑Xi∈Ak

Yi ,

θ∗k =1

Nk

∑Xi∈Ak

IEYi ,

where Nk stands for the number of design points Xi within the piece Ak :

Nkdef=

∑Xi∈Ak

1 = #{i : Xi ∈ Ak

}.

Exercise 3.4.2. Check the statements of Theorem 3.4.1.

The properties of each estimator θk repeats ones of the MLE for thesample retracted to Ak ; see Section 2.9.1.

Theorem 3.4.2. Let θ be defined by (3.41) for a normal density ρ(y) . Thenwith θ∗ = (θ∗1 , . . . , θK)> = argmaxθ IEL(θ) , it holds

IEθk = θ∗k, Var(θk) =1

N2k

∑Xi∈Ak

Var(Yi).

Moreover,

L(θ,θ∗) =

K∑k=1

Nk2σ2

(θk − θ∗k)2.

Page 113: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

3.4 Piecewise methods and splines 113

The statements follow by direct calculus on each interval separately.If the errors εi = Yi− IEYi are normal and homogeneous, then the distri-

bution of the maximum likelihood L(θ,θ∗) is available.

Theorem 3.4.3. Consider a Gaussian regression Yi ∼ N(f(Xi), σ2) for i =

1, . . . , n . Then θk ∼ N(θ∗k, σ2/Nk) and

L(θ,θ∗) =

K∑m=1

Nk2σ2

(θk − θ∗k

)2 ∼ χ2K ,

where χ2K stands for the chi-squared distribution with K degrees of freedom.

This result is again a combination of the results from Section 2.9.1 fordifferent pieces Ak . It is worth mentioning once again that the regressionfunction f(·) is not assumed to be piecewise constant, it can be whatever

function. Each θk estimates the mean θ∗k of f(·) over the design points Xi

withinAk .

The results on the behavior of the maximum likelihood L(θ,θ∗) are oftenused for studying the properties of the chi-squared test; see Section 7.1 formore details.

A choice of the partition is an important issue in the piecewise constantapproximation. The presented results indicate that the accuracy of estima-tion of θ∗k by θk is inversely proportional to the number of points Nk withineach piece Ak . In the univariate case one usually applies the equidistantpartition: the design interval is split into p equal intervals Ak leading to ap-proximately equal values Nk . Sometimes, especially if the design is irregular,a non-uniform partition can be preferable. In general it can be recommendedto split the whole design space into intervals with approximately the samenumber Nk of design points Xi .

A constant approximation is often not accurate enough to expand a regularregression function. One often uses a linear or polynomial approximation. Thenext sections explain this approach for the case of a univariate regression.

3.4.2 Piecewise linear univariate estimation

The piecewise constant approximation can be naturally extended to piecewiselinear and piecewise polynomial construction. The starting point is again anon-overlapping partition of X into intervals Ak for k = 1, . . . ,K . First weexplain the idea for the linear approximation of the function f on each intervalAk . Any linear function on Ak can be represented in the form ak + ckxwith some coefficients ak, ck . This yields in total p = 2K coefficients: θ =(a1, c1, . . . , aK , cK)> . The corresponding function f(·,θ) can be representedas

Page 114: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

114 3 Regression Estimation

f(x) ≈ f(x,θ) =

K∑k=1

(ak + ckx) 1I(x ∈ Ak).

The non-overlapping structure of the sets Ak yields orthogonality of basisfunctions for different pieces. As a corollary, one can optimize the linear ap-proximation on every interval Ak independently of the others.

Exercise 3.4.3. Show that ak, ck can be obtained by the linear approxima-tion of the data Yi for Xi ∈ Ak :

(ak, ck) = argmax(ak,ck)

∑`(Yi − ak − ckXi) 1I(Xi ∈ Ak), k = 1, . . . ,K.

On every piece Ak , the constant and the linear function x are not or-thogonal except some very special situation. However, one can easily achieveorthogonality by a shift of the linear term.

Exercise 3.4.4. For each k ≤ K , there exists a point xk such that∑(Xi − xk) 1I(Xi ∈ Ak) = 0. (3.43)

Introduce for each k ≤ K two basis functions φj−1(x) = 1I(x ∈ Ak) andφj(x) = (x− xk) 1I(x ∈ Ak) with j = 2k .

Exercise 3.4.5. Assume (3.43) for each k ≤ K . Check that any piecewiselinear function can be uniquely represented in the form

f(x) =

p∑j=1

θjφj(x)

with p = 2K and the functions φj are orthogonal in the sense that for j 6= j′

n∑i=1

φj(Xi)φj′(Xi) = 0.

In addition, for each k ≤ K

‖φj‖2def=

n∑i=1

φ2j (Xi) =

{Nk, j = 2k − 1,

V 2k j = 2k.

N2k

def=

∑Xi∈Ak

1, V 2k

def=

∑Xi∈Ak

(Xi − xk)2.

Page 115: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

3.4 Piecewise methods and splines 115

In the case of Gaussian regression, orthogonality of the basis helps to gaina simple closed form for the estimators θ = (θj) :

θj =1

‖ψj‖2n∑i=1

Yiψj(Xi) =

{1Nk

∑Xi∈Ak Yi , j = 2k − 1,

1V 2k

∑Xi∈Ak Yi(Xi − xk), j = 2k.

see Section 4.2 in the next chapter for a comprehensive study.

3.4.3 Piecewise polynomial estimation

Local linear expansion of the function f(x) on each piece Ak can be extendedto a piecewise polynomial case. The basic idea is to apply a polynomial ap-proximation of a certain degree q on each piece Ak independently. One canuse for each piece a basis of the form (x−xk)m 1I(x ∈ Ak) for m = 0, 1, . . . , qwith xk from (3.43) yielding the approximation

f(x) 1I(x ∈ Ak) = f(x,ak) 1I(x ∈ Ak)

=

K∑k=1

{a0,k + a1,k(x− xk) + . . .+ aq,k(x− xk)q

}1I(x ∈ Ak)

for ak = (a0,k, a1,k, . . . , aq,k)> . This involves q + 1 parameter for each pieceand p = K(q+1) parameters in total. A nice feature of the piecewise approachis that the coefficients ak of the piecewise polynomial approximation can beestimated independently for each piece. Namely,

ak = argmaxa

∑Xi∈Ak

{Yi − f(Xi,a)

}2

The properties of this estimator will be discussed in details in Chapter 4.

3.4.4 Spline estimation

The main drawback of the piecewise polynomial approximation is that theresulting function f is discontinuous at the edge points between differentpieces. A natural way of improving the boundary effect is to force some con-ditions on the boundary behavior. One important special case is given by thespline system. Let X be an interval on the real line, perhaps infinite. Let alsot0 < t1 < . . . < tK be some ordered points in X such that t0 is the leftedge and tK the right edge of X . Such points are called knots. We say thata function f is a spline of degree q at knots (tk) if is polynomial on eachspan (tk−1, tk) for k = 1, . . . ,K and satisfies the boundary conditions

f (m)(tk−) = f (m)(tk+), m = 0, . . . , q − 1, k = 1, . . . ,K − 1.

Page 116: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

116 3 Regression Estimation

Here f (m)(t−) stands for the left derivative of f at t . In words, the functionf and its first q − 1 derivatives are continuous on X and only the q thderivatives may have discontinuities at the knots tk . It is obvious that the qderivative f (q)(t) of the spline of degree q is a piecewise constant functionson the spans Ak = [tk−1, tk) .

The spline is called uniform if the knots are equidistant, or, in other words,if all the spans Ak have equal length. Otherwise it is non-uniform.

Lemma 3.4.1. The set of all splines of degree q at knots (tk) is a linearspace, that is, any linear combination of such splines is again a spline. Anyfunction having a continuous m th derivative for m < K and piecewise con-stant q th derivative is a q -spline.

Splines of degree zero are just piecewise constant functions studied inSection 3.4.1. Linear splines are particularly transparent: this is the set ofall piecewise linear continuous functions on X . Each of them can be easilyconstructed from left to right or from right to left: start with a linear functiona1 + c1x on the piece A1 = [t0, t1] . Then f(t1) = a1 + c1t1 . On the pieceA2 the slope of f can be changed for c2 leading to the function f(x) =f(t1) + c2(x− t1) for x ∈ [t1, t2] . Similarly, at t2 the slop of fs can changefor c3 yielding f(x) = f(t2) + c3(x− t2) on A3 , and so on. Splines of higherorder can be constructed similarly step by step: one fixes the polynomial formon the very first piece A1 and then continues the spline function to every nextpiece Ak using the boundary conditions and the value of the q th derivativeof f on Ak . This construction explains the next result.

Lemma 3.4.2. Each spline f of degree q and knots (tk) is uniquely de-scribed by the vector of coefficients a1 on the first span and the values f (q)(x)for each span A1, . . . , AK .

This result explains that the parameter dimension of the linear splinespace is q + K . One possible basis in this space is given by polynomials

xm−1 of degree m = 0, 1, . . . , q and the functions φk(x)def= (x − tk)q+ for

k = 1, . . . ,K − 1 .

Exercise 3.4.6. Check that φj(x) for j = 1, . . . , q + K form a basis in thelinear spline space, and any q -spline f can be represented as

f(x) =

q∑m=0

αmxm +

K−1∑k=1

θkφk(x). (3.44)

Hint: check that the functions φj(x) are linearly independent and that each

q th derivative φ(q)j (x) is piecewise constant.

Page 117: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

3.4 Piecewise methods and splines 117

B–splines

Unfortunately, the basis functions {φk(x)} with φk(x) = (x− tk)q+ are onlyuseful for theoretical study. The main problem is that the functions φj(x) arestrongly correlated, and the recovering the coefficients θj in the expansion(3.44) is a hard numerical task. by this reason, one often uses another basiscalled B–splines. The idea is to build splines of the given degree with theminimal support. Each B–spline basis function bk,q(x) is only non-zero onthe q neighbor spans Ak , Ak+1 , . . . Ak+q−1 for k = 1, . . . ,K − q .

Exercise 3.4.7. Let f(x) be a q -spline with the support on q′ < q neighborspans Ak , Ak+1 , . . . Ak+q′−1 . Then f(x) ≡ 0 .

Hint: consider any spline of the form f(x) =∑k+q′−1j=k cjφj(x) . Show that the

boundary conditions f (m)(tk+q′) = 0 for m = 0, 1, . . . , q yield cj ≡ 0 .

The basis B–spline functions can be constructed successfully. For q =0 , the B–splines bk,0(x) coincide with the functions φk(x) = 1I(x ∈ Ak) ,k = 1, . . . ,K . Each linear B–spline bk,1(x) has a triangle shape on the twoconnected intervals Ak and Ak+1 . It can be defined by the formula

bk,1(x)def=

x− tk−1

tk − tk−1bk,0(x) +

tk+1 − xtk+1 − tk

bk+1,0(x), k = 1, . . . ,K − 1.

One can continue this way leading to the Cox–de Boor recursion formula

bk,m(x)def=

x− tk−1

tk+m−1 − tk−1bk,m−1(x) +

tk+m − xtk+m − tk

bk+1,m−1(x)

for k = 1, . . . ,K −m .

Exercise 3.4.8. Check by induction for each function bk,m(x) the followingconditions:

1. bk,m(x) a polynomial of degree m on each span Ak, . . . , Ak+m−1 andzero outside;

2. bk,m(x) can be uniquely represented as a sum bk,m(x) =∑m−1l=0 cl,kφk+l(x) ;

3. bk,m(x) is a m -spline.

The formulas simplify for the uniform splines with equal span length ∆ =|Ak| :

bk,m(x)def=

x− tk−1

m∆bk,m−1(x) +

tk+m − xm∆

bk+1,m−1(x)

for k = 1, . . . ,K −m .

Page 118: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

118 3 Regression Estimation

Exercise 3.4.9. Check that

bk,m(x) =

m∑l=0

ωl,mφk+l(x)

with

ωl,mdef=

(−1)l

∆ml!(m− l)!

Smoothing splines

Such a spline system naturally arises as a solution of a penalized maximumlikelihood problem. Suppose we are given the regression data (Yi, Xi) with theunivariate design X1 ≤ X2 ≤ . . . ≤ Xn . Consider the mean regression modelYi = f(Xi) + εi with zero mean errors εi . The assumption of independenthomogeneous Gaussian errors leads to the Gaussian log-likelihood

L(f) = −n∑i=1

∣∣Yi − f(Xi)∣∣2/(2σ2) (3.45)

Maximization of this expression w.r.t. all possible functions f or, equivalently,

all vectors(f(X1), . . . , f(Xn)

)>results in the trivial solution: f(Xi) = Yi .

This mean that the full dimensional maximum likelihood perfectly reproducesthe original noisy data. Some additional assumptions are needed to force anydesirable feature of the reconstructed function. One popular example is givenby smoothness of the function f . Degree of smoothness (or, inversely, degreeof roughness) can be measured by the value

Rq(f)def=

∫X

∣∣f (q)(x)∣∣2dx. (3.46)

One can try to optimize the fit (3.45) subject to the constraint on the amountof roughness from (3.46). Equivalently, one can optimize the penalized log-likelihood

Lλ(f)def= L(f)− λRq(f) = − 1

2σ2

n∑i=1

∣∣Yi − f(Xi)∣∣2 − λ ∫

X

∣∣f (q)(x)∣∣2dx,

where λ > 0 is a Lagrange multiplier. The corresponding maximizer is thepenalized maximum likelihood estimator:

fλ = argmaxf

Lλ(f), (3.47)

Page 119: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

3.5 Generalized regression 119

where the maximum is taken over the class of all measurable functions. It isremarkable that the solution of this optimization problem is a spline of degreeq with the knots X1, . . . , Xn .

Theorem 3.4.4. For any λ > 0 and any integer q , the problem (3.47) hasa unique solution which is a q -spline with knots at design points (Xi) .

For the proof we refer to Green and Silverman (1994). Due to this result,one can simplify the problem and look for a spline f which minimizes theobjective Lλ(f) . A solution to (3.47) is called a smoothing spline. If f isa q -spline, the integral Rq(f) can be easily computed. Indeed, f (q)(x) ispiecewise constant, that is, f (q)(x) = ck for x ∈ Ak , and

Rq(f) =

K∑k=1

c2k |tk − tk−1|.

For the uniform design, the formula simplifies even more, and by change ofthe multiplier λ , one can use Rq(f) =

∑k c

2k . The use of any parametric

representation of a spline function f allows to represent the optimizationproblem (3.47) as a penalized least squares problem. Estimation and inferencein such problems are studied below in Section 4.7.

3.5 Generalized regression

Let the response Yi be observed at the design point Xi ∈ IRd , i = 1, . . . , n .A (mean) regression model assumes that the observed values Yi are indepen-dent and can be decomposed into the systematic component f(Xi) and theindividual centered stochastic error εi . In some cases such a decomposition isquestionable. This especially concerns the case when the data Yi are categor-ical, e.g. binary or discrete. Another striking example is given by nonnegativeobservations Yi . In such cases one usually assumes that the distribution ofYi belongs to some given parametric family (Pυ, υ ∈ U) and only the pa-rameter of this distribution depends on the design point Xi . We denote thisparameter value as f(Xi) ∈ U and write the model in the form

Yi ∼ Pf(Xi) .

As previously, f(·) is called a regression function and its values at the designpoints Xi completely specify the joint data distribution:

Y ∼∏i

Pf(Xi).

Below we assume that (Pυ) is a univariate exponential family with the log-density `(y, υ) .

Page 120: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

120 3 Regression Estimation

The parametric modeling approach assumes that the regression func-tion f can be specified by a finite-dimensional parameter θ ∈ Θ ⊂ IRp :f(x) = f(x,θ) . As usual, by θ∗ we denote the true parameter value. Thelog-likelihood function for this model reads

L(θ) =∑i

`(Yi, f(Xi,θ)

).

The corresponding MLE θ maximizes L(θ) :

θ = argmaxθ

∑i

`(Yi, f(Xi,θ)

).

The estimating equation ∇L(θ) = 0 reads as∑i

`′(Yi, f(Xi,θ)

)∇f(Xi,θ) = 0

where `′(y, υ)def= ∂`(y, υ)/∂υ .

The approach essentially depends on the parametrization of the consideredEF. Usually one applies either the natural or canonical parametrization. Inthe case of the natural parametrization, `(y, υ) = C(υ)y − B(υ) , where thefunctions C(·), B(·) satisfy B′(υ) = υC ′(υ) . This implies `′(y, υ) = yC ′(υ)−B′(υ) = (y − υ)C ′(υ) and the estimating equation reads as∑

i

(Yi − f(Xi,θ)

)C ′(f(Xi,θ)

)∇f(Xi,θ) = 0

Unfortunately, a closed form solution for this equation exists only in veryspecial cases. Even the questions of existence and uniqueness of the solutioncannot be studied in whole generality. Some numerical algorithms are usuallyapplied to solve the estimating equation.

Exercise 3.5.1. Specify the estimating equation for generalized EFn regres-sion and find the solution for the case of the constant regression functionf(Xi, θ) ≡ θ .Hint: If f(Xi, θ) ≡ θ , then the Yi are i.i.d. from Pθ .

The equation can be slightly simplified by using the canonical parametriza-tion. If (Pυ) is an EFc with the log-density `(y, υ) = yυ − d(υ) , then thelog-likelihood L(θ) can be represented in the form

L(θ) =∑i

{Yif(Xi,θ)− d

(f(Xi,θ)

)}.

The corresponding estimating equation is

Page 121: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

3.5 Generalized regression 121∑i

{Yi − d′

(f(Xi,θ)

)}∇f(Xi,θ) = 0.

Exercise 3.5.2. Specify the estimating equation for generalized EFc regres-sion and find the solution for the case of constant regression with f(Xi, υ) ≡υ . Relate the natural and canonical representation.

A generalized regression with a canonical link is often applied in combinationwith linear modeling of the regression function considered in the next section.

3.5.1 Generalized linear models

Consider the generalized regression model

Yi ∼ Pf(Xi) ∈ P.

In addition we assume a linear (in parameters) structure of the regressionfunction f(X) . Such modeling is particularly useful to combine with thecanonical parametrization of the considered EF with the log-density `(y, υ) =yυ−d(υ) . The reason is that the stochastic part in the log-likelihood of an EFclinearly depends on the parameter. So, below we assume that P = (Pυ, υ ∈ U)is an EFc.

Linear regression f(Xi) = Ψ>i θ with given feature vectors Ψi ∈ IRp leadsto the model with the log-likelihood

L(θ) =∑i

{YiΨ

>i θ − d

(Ψ>i θ

)}.

Such a setup is called generalized linear model (GLM). Note that the log-likelihood can be represented as

L(θ) = S>θ −A(θ),

where

S =∑i

YiΨi, A(θ) =∑i

d(Ψ>i θ

).

The corresponding MLE θ maximizes L(θ) . Again, a closed form solutiononly exists in special cases. However, an important advantage of the GLMapproach is that the solution always exists and is unique. The reason is thatthe log-likelihood function L(θ) is concave in θ .

Lemma 3.5.1. The MLE θ solves the following estimating equation:

∇L(θ) = S −∇A(θ) =∑i

(Yi − d′(Ψ>i θ)

)Ψi = 0. (3.48)

The solution exists and is unique.

Page 122: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

122 3 Regression Estimation

Proof. Define the matrix

B(θ) =∑i

d′′(Ψ>i θ)ΨiΨ>i . (3.49)

Since d′′(υ) is strictly positive for all u , the matrix B(θ) is positively definedas well. It holds

∇2L(θ) = −∇2A(θ) = −∑i

d′′(Ψ>i θ)ΨiΨ>i = −B(θ).

Thus, the function L(θ) is strictly concave w.r.t. θ and the estimating equa-

tion ∇L(θ) = S −∇A(θ) = 0 has the unique solution θ .

The solution of (3.48) can be easily obtained numerically by the Newton-

Raphson algorithm: select the initial estimate θ(0) . Then for every k ≥ 1apply

θ(k+1) = θ(k) +{B(θ(k))

}−1{S −∇A(θ(k))

}(3.50)

until convergence.Below we consider two special cases of GLMs for binary and Poissonian

data.

3.5.2 Logit regression for binary data

Suppose that the observed data Yi are independent and binary, that is, eachYi is either zero or one, i = 1, . . . , n . Such models are often used in e.g.sociological and medical study, two-class classification, binary imaging, amongmany other fields. We treat each Yi as a Bernoulli r.v. with the correspondingparameter fi = f(Xi) . This is a special case of generalized regression alsocalled binary response models. The parametric modeling assumption meansthat the regression function f(·) can be represented in the form f(Xi) =f(Xi,θ) for a given class of functions {f(·,θ),θ ∈ Θ ∈ IRp} . Then the log-likelihood L(θ) reads as

L(θ) =∑i

`(Yi, f(Xi,θ)), (3.51)

where `(y, υ) is the log-density of the Bernoulli law. For linear modeling, itis more useful to work with the canonical parametrization. Then `(y, υ) =yυ − log(1 + eυ) , and the log-likelihood reads

L(θ) =∑i

[Yif(Xi,θ)− log

(1 + ef(Xi,θ)

)].

Page 123: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

3.5 Generalized regression 123

In particular, if the regression function f(·,θ) is linear, that is, f(Xi,θ) =Ψ>i θ , then

L(θ) =∑i

[YiΨ

>i θ − log(1 + eΨ>i θ)

]. (3.52)

The corresponding estimate reads as

θ = argmaxθ

L(θ) = argmaxθ

∑i

[YiΨ

>i θ − log(1 + eΨ>i θ)

]This modeling is usually referred to as logit regression.

Exercise 3.5.3. Specify the estimating equation for the case of logit regres-sion.

Exercise 3.5.4. Specify the step of the Newton-Raphson procedure for thecase of logit regression.

3.5.3 Parametric Poisson regression

Suppose that the observations Yi are nonnegative integer numbers. The Pois-son distribution is a natural candidate for modeling such data. It is supposedthat the underlying Poisson parameter depends on the regressor Xi . Typicalexamples arise in different types of imaging including medical positron emis-sion and magnet resonance tomography, satellite and low-luminosity imaging,queueing theory, high frequency trading, etc. The regression equation reads

Yi ∼ Poisson(f(Xi)).

The Poisson regression function f(Xi) is usually the target of estimation.The parametric specification f(·) ∈

{f(·,θ), θ ∈ Θ

}reduces this problem to

estimating the parameter θ . Under the assumption of independent observa-tions Yi , the corresponding maximum likelihood L(θ) is given by

L(θ) =∑[

Yi log{f(Xi,θ)

}− f(Xi,θ)

]+R,

where the remainder R does not depend on θ and can be omitted. Obvi-ously, the constant function family f(·, θ) ≡ θ leads back to the case of i.i.d.modeling studied in Section 2.11. A further extension is given by linear Pois-son regression: f(Xi,θ) = Ψ>i θ for some given factors Ψi . The regressionequation reads

L(θ) =∑i

[Yi log(Ψ>i θ)−Ψ>i θ)

]. (3.53)

Page 124: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

124 3 Regression Estimation

Exercise 3.5.5. Specify the estimating equation and the Newton-Raphsonprocedure for the linear Poisson regression (3.53).

An obvious problem of linear Poisson modeling is that it requires all thevalues Ψ>i θ to be positive. The use of canonical parametrization helps toavoid this problem. The linear structure is assumed for the canonical pa-rameter leading to the representation f(Xi) = exp(Ψ>i θ) . Then the generallog-likelihood process L(θ) from (3.51) translates into

L(θ) =∑i

[YiΨ

>i θ − exp(Ψ>i θ)

]; (3.54)

cf. with (3.52).

Exercise 3.5.6. Specify the estimating equation and the Newton-Raphsonprocedure for the canonical link linear Poisson regression (3.54).

If the factors Ψi are properly scaled then the scalar products Ψ>i θ forall i and all θ ∈ Θ0 belong to some bounded interval. For the matrix B(θ)from (3.49), it holds

B(θ) =∑i

exp(Ψ>i θ)ΨiΨ>i .

Initializing the ML optimization problem with θ = 0 leads to the oLSE

θ(0)

=(∑

i

ΨiΨ>i

)−1∑i

ΨiYi .

The further steps of the algorithm (3.50) can be done as weighted LSE with

the weights exp(Ψ>i θ

(k))for the estimate θ

(k)obtained at the previous step.

3.5.4 Piecewise constant methods in generalized regression

Consider a generalized regression model

Yi ∼ Pf(Xi) ∈ (Pυ)

for a given exponential family (Pυ) . Further, let A1, . . . , AK be a non-overlapping partition of the design space X ; see (3.39). A piecewise constantapproximation (3.40) of the regression function f(·) leads to the additivelog-likelihood structure: for θ = (θ1, . . . , θK)>

θ = argmaxθ

L(θ) = argmaxθ1,...,θK

K∑k=1

∑Xi∈Ak

`(Yi, θk);

Page 125: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

3.5 Generalized regression 125

cf. (3.41). Similarly to the mean regression case, the global optimization w.r.t.the vector θ can be decomposed into K separated simple optimization prob-lems:

θk = argmaxθ

∑Xi∈Ak

`(Yi, θ);

cf. (3.42). The same decomposition can be obtained for the target θ∗ =(θ1, . . . , θK)> :

θ∗ = argmaxθ

IEL(θ) = argmaxθ1,...,θK

K∑k=1

∑Xi∈Ak

IE`(Yi, θk).

The properties of each estimator θk repeats ones of the qMLE for a univariateEFn; see Section 2.11.

Theorem 3.5.1. Let `(y, θ) = C(θ)y − B(θ) be a density of an EFn, sothat the functions B(θ) and C(θ) satisfy B′(θ) = θC ′(θ) . Then for everyk = 1, . . . ,K

θk =1

Nk

∑Xi∈Ak

Yi ,

θ∗k =1

Nk

∑Xi∈Ak

IEYi ,

where Nk stands for the number of design points Xi within the piece Ak :

Nkdef=

∑Xi∈Ak

1 = #{i : Xi ∈ Ak

}.

Moreover, it holds

IEθk = θ∗k

and

L(θ,θ∗) =

K∑k=1

NkK(θk, θ∗k) (3.55)

where K(θ, θ′)def= Eθ

{`(Yi, θ)− `(Yi, θ′)

}.

These statements follow from Theorem 3.5.1 and Theorem 2.11.1 of Sec-tion 2.11. For the presented results, the true regression function f(·) can beof arbitrary structure, the true distribution of each Yi can differ from Pf(Xi) .

Page 126: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

126 3 Regression Estimation

Exercise 3.5.7. Check the statements of Theorem 3.5.1.

If PA is correct, that is, if f is indeed piecewise constant and the distri-bution of Yi is indeed Pf(Xi) , the deviation bound for the excess L(θk, θ

∗k)

from Theorem 2.11.4 can be applied to each piece Ak yielding the followingresult.

Theorem 3.5.2. Let (Pθ) be a EFn and let Yi ∼ Pθk for Xi ∈ Ak andk = 1, . . . ,K . Then for any z > 0

IP(L(θ,θ∗) > Kz

)≤ 2Ke−z.

Proof. By (3.55) and Theorem 2.11.4

IP(L(θ,θ∗) > Kz

)= IP

( K∑k=1

NkK(θk, θ∗k) > Kz

)

≤K∑k=1

IP(NkK(θk, θ

∗k) > z

)≤ 2Ke−z

and the result follows.

A piecewise linear generalized regression can be treated in a similar way.The main benefit of piecewise modeling remains preserved: a global optimiza-tion over the vector θ can be decomposed into a set of small optimizationproblems for each piece Ak . However, a closed form solution is available onlyin some special cases like Gaussian regression.

3.5.5 Smoothing splines for generalized regression

Consider again the generalized regression model

Yi ∼ Pf(Xi) ∈ P

for an exponential family P with canonical parametrization. Now we do notassume any specific parametric structure for the function f . Instead, thefunction f is supposed to be smooth and its smoothness is measured by theroughness Rq(f) from (3.46). Similarly to the regression case of Section 3.4.4,the function f can be estimated directly by optimizing the penalized log-likelihood Lλ(f) :

fλ = argmaxf

Lλ(f) = argmaxf

{L(f)−Rq(f)

}= argmax

f

∑i

{Yif(Xi)− d

(f(Xi)

)}−∫X

∣∣f (q)(x)∣∣2dx. (3.56)

Page 127: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

3.6 Historical remarks and further reading 127

The maximum is taken over the class of all regular q -times differentiable func-tions. In the regression case, the function d(·) is quadratic and the solutionis a spline functions with knots Xi . This conclusion can be extended to thecase of any convex function d(·) , thus, the problem (3.56) yields a smoothingspline solution. Numerically this problem is usually solved by iterations. Onestarts with a quadratic function d(υ) = υ2/2 to obtain an initial approxima-

tion f (0)(·) of f(·) by a standard smoothing spline regression. Further, at

each new step k+1 , the use of the estimate f (k)(·) from the previous step kfor k ≥ 0 helps to approximate the problem (3.56) by a weighted regression.The corresponding iterations can be written in the form (3.50).

3.6 Historical remarks and further reading

A nice introduction in the use of smoothing splines in statistics can be foundin Green and Silverman (1994), and Wahba (1990). Fore further properties ofthe spline approximation and algorithmic use of splines see de Boor (2001).

Orthogonal polynomials have long stories and have been applied in manydifferent fields of mathematics. We refer to Szego (1939) and Chihara (2011)for the classical results and history around different polynomial systems.

Some further methods in regression estimation and their features are de-scribed e.g. in Lehmann and Casella (1998), Fan and Gijbels (1996), Wasser-man (2006).

Page 128: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April
Page 129: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

4

Estimation in linear models

This chapter studies the estimation problem for a linear model. The first foursections are fairly classical and the presented results are based on the directanalysis of the linear estimation procedures. Section 4.5 and 4.6 reproduce ina very short form the same results but now based on the likelihood analysis.The presentation is based on the celebrated chi-squared phenomenon whichappears to be the fundamental fact yielding the exact likelihood based con-centration and confidence properties. The further sections are complementaryand can be recommended for a more profound reading. The issues like reg-ularization, shrinkage, smoothness and roughness are usually studied withinthe nonparametric theory, here we try to fit them to the classical linear para-metric setup. A special focus is on semiparametric estimation in Section 4.9.In particular, efficient estimation and chi-squared result are extended to thesemiparametric framework.

The main tool of the study is the quasi maximum likelihood method.We especially focus on the validity of the presented results under possiblemodel misspecification. Another important issue is the way of measuring theestimation loss and risk. We distinguish below between response estimationor prediction and the parameter estimation. The most advanced results likechi-squared result in Section 4.6 are established under the assumption of aGaussian noise. However, a misspecification of noise structure is allowed andaddressed.

4.1 Modeling assumptions

A linear model assumes that the observations Yi follow the equation:

Yi = Ψ>i θ∗ + εi (4.1)

for i = 1, . . . , n , where θ∗ = (θ∗1 , . . . , θ∗p)> ∈ IRp is an unknown parameter

vector, Ψi are given vectors in IRp and the εi ’s are individual errors with

Page 130: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

130 4 Estimation in linear models

zero mean. A typical example is given by linear regression (see Section 3.3)when the vectors Ψi are the values of a set of functions (e.g polynomial,trigonometric) series at the design points Xi .

A linear Gaussian model assumes in addition that the vector of errorsε = (ε1, . . . εn)> is normally distributed with zero mean and a covariancematrix Σ :

ε ∼ N(0,Σ).

In this chapter we suppose that Σ is given in advance. We will distinguishbetween three cases:

1. the errors εi are i.i.d. N(0, σ2) , or equivalently, the matrix Σ is equal toσ2IIn with IIn being the unit matrix in IRn .

2. the errors are independent but not homogeneous, that is, IEε2i = σ2

i .Then the matrix Σ is diagonal: Σ = diag(σ2

1 , . . . , σ2n) .

3. the errors εi are dependent with a covariance matrix Σ .

In practical applications one mostly starts with the white Gaussian noiseassumption and more general cases 2 and 3 are only considered if there areclear indications of the noise inhomogeneity or correlation. The second situ-ation is typical e.g. for the eigenvector decomposition in an inverse problem.The last case is the most general and includes the first two.

4.2 Quasi maximum likelihood estimation

Denote by Y = (Y1, . . . , Yn)> (resp. ε = (ε1, . . . , εn)> ) the vector of obser-vations (resp. of errors) in IRn and by Ψ the p×n matrix with columns Ψi .Let also Ψ> denote its transpose. Then the model equation can be rewrittenas:

Y = Ψ>θ∗ + ε, ε ∼ N(0,Σ).

An equivalent formulation is that Σ−1/2(Y − Ψ>θ) is a standard nor-mal vector in IRn . The log-density of the distribution of the vector Y =(Y1, . . . , Yn)> w.r.t. the Lebesgue measure in IRn is therefore of the form

L(θ) = −n2

log(2π)−log(det Σ

)2

− 1

2‖Σ−1/2(Y −Ψ>θ)‖2

= −n2

log(2π)−log(det Σ

)2

− 1

2(Y −Ψ>θ)>Σ−1(Y −Ψ>θ).

In case 1 this expression can be rewritten as

Page 131: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

4.2 Quasi maximum likelihood estimation 131

L(θ) = −n2

log(2πσ2)− 1

2σ2

n∑i=1

(Yi −Ψ>i θ)2.

In case 2 the expression is similar:

L(θ) = −n∑i=1

{1

2log(2πσ2

i ) +(Yi −Ψ>i θ)2

2σ2i

}.

The maximum likelihood estimate (MLE) θ of θ∗ is defined by maximizingthe log-likelihood L(θ) :

θ = argmaxθ∈IRp

L(θ) = argminθ∈IRp

(Y −Ψ>θ)>Σ−1(Y −Ψ>θ). (4.2)

We omit the other terms in the expression of L(θ) because they do notdepend on θ . This estimate is the least squares estimate (LSE) because itminimizes the sum of squared distances between the observations Yi and thelinear responses Ψ>i θ . Note that (4.2) is a quadratic optimization problemwhich has a closed form solution. Differentiating the right hand-side of (4.2)w.r.t. θ yields the normal equation

ΨΣ−1Ψ>θ = ΨΣ−1Y .

If the p×p -matrix ΨΣ−1Ψ> is non-degenerate then the normal equation hasthe unique solution

θ =(ΨΣ−1Ψ>

)−1ΨΣ−1Y = SY , (4.3)

where

S =(ΨΣ−1Ψ>

)−1ΨΣ−1

is a p×n matrix. We denote by θm the entries of the vector θ , m = 1, . . . , p .If the matrix ΨΣ−1Ψ> is degenerate, then the normal equation has in-

finitely many solutions. However, one can still apply the formula (4.3) where(ΨΣ−1Ψ>)−1 is a pseudo-inverse of the matrix ΨΣ−1Ψ> .

The ML-approach leads to the parameter estimate θ . Note that due to

the model (4.1), the product f = Ψ>θ is an estimate of the mean f∗def= IEY

of the vector of observations Y :

f = Ψ>θ = Ψ>(ΨΣ−1Ψ>

)−1ΨΣ−1Y = ΠY ,

where

Π = Ψ>(ΨΣ−1Ψ>

)−1ΨΣ−1

Page 132: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

132 4 Estimation in linear models

is an n×n matrix (linear operator) in IRn . The vector f is called a predictionor response regression estimate.

Below we study the properties of the estimates θ and f . In this study wetry to address both types of possible model misspecification: due to a wrongassumption about the error distribution and due to a possibly wrong linearparametric structure. Namely we consider the model

Yi = fi + εi, ε ∼ N(0,Σ0). (4.4)

The response values fi are usually treated as the value of the regressionfunction f(·) at the design points Xi . The parametric model (4.1) can beviewed as an approximation of (4.4) while Σ is an approximation of the truecovariance matrix Σ0 . If f∗ is indeed equal to Ψ>θ∗ and Σ = Σ0 , thenθ and f are MLEs, otherwise quasi MLEs. In our study we mostly restrictourselves to the case 1 assumption about the noise ε : ε ∼ N(0, σ2IIn) . Thegeneral case can be reduced to this one by a simple data transformation,namely, by multiplying the equation (4.4) Y = f∗+ε with the matrix Σ−1/2 ,see Section 4.6 for more detail.

4.2.1 Estimation under the homogeneous noise assumption

If a homogeneous noise is assumed, that is Σ = σ2IIn and ε ∼ N(0, σ2IIn) ,

then the formulae for the MLEs θ, f slightly simplify. In particular, thevariance σ2 cancels and the resulting estimate is the ordinary least squares(oLSE):

θ =(ΨΨ>

)−1ΨY = SY

with S =(ΨΨ>

)−1Ψ . Also

f = Ψ>(ΨΨ>

)−1ΨY = ΠY

with Π = Ψ>(ΨΨ>

)−1Ψ .

Exercise 4.2.1. Derive the formulae for θ, f directly from the log-likelihoodL(θ) for homogeneous noise.

If the assumption ε ∼ N(0, σ2IIn) about the errors is not precisely fulfilled,then the oLSE can be viewed as a quasi MLE.

4.2.2 Linear basis transformation

Denote by ψ>1 , . . . ,ψ>p the rows of the matrix Ψ . Then the ψi ’s are vectors

in IRn and we call them the basis vectors. In the linear regression case the

Page 133: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

4.2 Quasi maximum likelihood estimation 133

ψi ’s are obtained as the values of the basis functions at the design points.Our linear parametric assumption simply means that the underlying vectorf∗ can be represented as a linear combination of the vectors ψ1, . . . ,ψp :

f∗ = θ∗1ψ1 + . . .+ θ∗pψp .

In other words, f∗ belongs to the linear subspace in IRn spanned by thevectors ψ1, . . . ,ψp . It is clear that this assumption still holds if we selectanother basis in this subspace.

Let U be any linear orthogonal transformation in IRp with UU> = IIp .Then the linear relation f∗ = Ψ>θ∗ can be rewritten as

f∗ = Ψ>UU>θ∗ = Ψ>u∗

with Ψ = U>Ψ and u∗ = U>θ∗ . Here the columns of Ψ mean the newbasis vectors ψm in the same subspace while u∗ is the vector of coefficientsdescribing the decomposition of the vector f∗ w.r.t. this new basis:

f∗ = u∗1ψ1 + . . .+ u∗pψp .

The natural question is how the expression for the MLEs θ and f changewith the change of the basis. The answer is straightforward. For notationalsimplicity, we only consider the case with Σ = σ2IIn . The model can berewritten as

Y = Ψ>u∗ + ε

yielding the solutions

u =(ΨΨ>

)−1ΨY = SY , f = Ψ>

(ΨΨ>

)−1ΨY = ΠY ,

where Ψ = U>Ψ implies

S =(ΨΨ>

)−1Ψ = U>S,

Π = Ψ>(ΨΨ>

)−1Ψ = Π.

This yields

u = U>θ

and moreover, the estimate f is not changed for any linear transformationof the basis. The first statement can be expected in view of θ∗ = Uu∗ , whilethe second one will be explained in the next section: Π is the linear projectoron the subspace spanned by the basis vectors and this projector is invariantw.r.t. basis transformations.

Page 134: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

134 4 Estimation in linear models

Exercise 4.2.2. Consider univariate polynomial regression of degree p − 1 .This means that f is a polynomial function of degree p− 1 observed at thepoints Xi with errors εi that are assumed to be i.i.d. normal. The functionf can be represented as

f(x) = θ∗1 + θ∗2x+ . . .+ θ∗pxp−1

using the basis functions ψm(x) = xm−1 for m = 0, . . . , p − 1 . At the sametime, for any point x0 , this function can also be written as

f(x) = u∗1 + u∗2(x− x0) + . . .+ u∗p(x− x0)p−1

using the basis functions ψm = (x− x0)m−1 .

• Write the matrices Ψ and ΨΨ> and similarly Ψ and ΨΨ> .• Describe the linear transformation A such that u = Aθ for p = 1 .• Describe the transformation A such that u = Aθ for p > 1 .

Hint: use the formula

u∗m =1

(m− 1)!f (m−1)(x0), m = 1, . . . , p

to identify the coefficient u∗m via θ∗m, . . . , θ∗p .

4.2.3 Orthogonal and orthonormal design

Orthogonality of the design matrix Ψ means that the basis vectors ψ1, . . . , ψpare orthonormal in the sense

ψ>mψm′ =

n∑i=1

ψm,iψm′,i =

{0 if m 6= m′,

λm if m = m′,

for some positive values λ1, . . . , λp . Equivalently one can write

ΨΨ> = Λ = diag(λ1, . . . , λp).

This feature of the design is very useful and it essentially simplifies the com-putation and analysis of the properties of θ . Indeed, ΨΨ> = Λ implies

θ = Λ−1ΨY , f = Ψ>θ = Ψ>Λ−1ΨY

with Λ−1 = diag(λ−11 , . . . , λ−1

p ) . In particular, the first relation means

θm = λ−1m

n∑i=1

Yiψm,i,

Page 135: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

4.2 Quasi maximum likelihood estimation 135

that is, θm is the scalar product of the data and the basis vector ψm form = 1, . . . , p . The estimate of the response f reads as

f = θ1ψ1 + . . .+ θpψp.

Theorem 4.2.1. Consider the model Y = Ψ>θ+ε with homogeneous errorsε : IEεε> = σ2IIn . If the design Ψ is orthogonal, that is, if ΨΨ> = Λ fora diagonal matrix Λ , then the estimated coefficients θm are uncorrelated:Var(θ) = σ2Λ−1 . Moreover, if ε ∼ N(0, σ2IIn) , then θ ∼ N(θ∗, σ2Λ−1) .

An important message of this result is that the orthogonal design allowsfor splitting the original multivariate problem into a collection of independentunivariate problems: each coefficient θ∗m is estimated by θm independentlyon the remaining coefficients.

The calculus can be further simplified in the case of an orthogonal designwith ΨΨ> = IIp . Then one speaks about an orthonormal design. This alsoimplies that every basis function (vector) ψm is standardized: ‖ψm‖2 =∑ni=1 ψ

2m,i = 1 . In the case of an orthonormal design, the estimate θ is

particularly simple: θ = ΨY . Correspondingly, the target of estimation θ∗

satisfies θ∗ = Ψf∗ . In other words, the target is the collection (θ∗m) of theFourier coefficients of the underlying function (vector) f∗ w.r.t. the basis Ψ

while the estimate θ is the collection of empirical Fourier coefficients θm :

θ∗m =

n∑i=1

fiψm,i , θm =

n∑i=1

Yiψm,i

An important feature of the orthonormal design is that it preserves the noisehomogeneity:

Var(θ)

= σ2Ip .

4.2.4 Spectral representation

Consider a linear model

Y = Ψ>θ + ε (4.5)

with homogeneous errors ε : Var(ε) = σ2IIn . The rows of the matrix Ψ canbe viewed as basis vectors in IRn and the product Ψ>θ is a linear combina-tions of these vectors with the coefficients (θ1, . . . , θp) . Effectively linear leastsquares estimation does a kind of projection of the data onto the subspacegenerated by the basis functions. This projection is of course invariant w.r.t.a basis transformation within this linear subspace. This fact can be used toreduce the model to the case of an orthogonal design considered in the previ-ous section. Namely, one can always find a linear orthogonal transformation

Page 136: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

136 4 Estimation in linear models

U : IRp → IRp ensuring the orthogonality of the transformed basis. Thismeans that the rows of the matrix Ψ = UΨ are orthogonal and the matrixΨΨ> is diagonal:

ΨΨ> = UΨΨ>U> = Λ = diag(λ1, . . . , λp).

The original model reads after this transformation in the form

Y = Ψ>u+ ε, ΨΨ> = Λ,

where u = Uθ ∈ IRp . Within this model, the transformed parameter u can

be estimated using the empirical Fourier coefficients Zm = ψ>mY , where ψm

is the m th row of Ψ , m = 1, . . . , p . The original parameter vector θ can berecovered via the equation θ = U>u . This set of equations can be written inthe form

Z = Λu+ Λ1/2ξ (4.6)

where Z = ΨY = UΨY is a vector in IRp and ξ = Λ−1/2Ψε = Λ−1/2UΨε ∈IRp . The equation (4.6) is called the spectral representation of the linear model(4.5). The reason is that the basic transformation U can be built by a singularvalue decomposition of Ψ . This representation is widely used in context oflinear inverse problems; see Section 4.8.

Theorem 4.2.2. Consider the model (4.5) with homogeneous errors ε , thatis, IEεε> = σ2IIn . Then there exists an orthogonal transform U : IRp →IRp leading to the spectral representation (4.6) with homogeneous uncorrelatederrors ξ : IEξξ> = σ2IIp . If ε ∼ N(0, σ2IIn) , then the vector ξ is normal aswell: ξ = N(0, σ2IIp) .

Exercise 4.2.3. Prove the result of Theorem 4.2.2.Hint: select any U ensuring U>ΨΨ>U = Λ . Then

IEξξ> = Λ−1/2UΨIEεε>Ψ>U>Λ−1/2 = σ2Λ−1/2U>ΨΨ>UΛ−1/2 = σ2IIp.

A special case of the spectral representation corresponds to the orthonor-mal design with ΨΨ> = IIp . In this situation, the spectral model reads asZ = u+ ξ , that is, we simply observe the target u corrupted with a homo-geneous noise ξ . Such an equation is often called the sequence space model andit is intensively used in the literature for the theoretical study; cf. Section 4.7below.

4.3 Properties of the response estimate f

This section discusses some properties of the estimate f = Ψ>θ = ΠY ofthe response vector f∗ . It is worth noting that the first and essential part of

Page 137: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

4.3 Properties of the response estimate f 137

the analysis does not rely on the underlying model distribution, only on ourparametric assumptions that f = Ψ>θ∗ and Cov(ε) = Σ = σ2IIn . The realmodel only appears when studying the risk of estimation. We will commenton the cases of misspecified f and Σ .

When Σ = σ2IIn , the operator Π in the representation f = ΠY of theestimate f reads as

Π = Ψ>(ΨΨ>

)−1Ψ. (4.7)

First we make use of the linear structure of the model (4.1) and of the

estimate f to derive a number of its simple but important properties.

4.3.1 Decomposition into a deterministic and a stochasticcomponent

The model equation Y = f∗ + ε yields

f = ΠY = Π(f∗ + ε) = Πf∗ + Πε. (4.8)

The first element of this sum, Πf∗ is purely deterministic, but it dependson the unknown response vector f∗ . Moreover, it will be shown in the nextlemma that Πf∗ = f∗ if the parametric assumption holds and the vectorf∗ indeed can be represented as Ψ>θ∗ . The second element is stochasticas a linear transformation of the stochastic vector ε but is independent ofthe model response f∗ . The properties of the estimate f heavily rely on theproperties of the linear operator Π from (4.7) which we collect in the nextsection.

4.3.2 Properties of the operator Π

Let ψ1, . . . ,ψp be the columns of the matrix Ψ> . These are the vectors inIRn also called the basis vectors.

Lemma 4.3.1. Let the matrix ΨΨ> be non-degenerate. Then the operatorΠ fulfills the following conditions:

(i) Π is symmetric (self-adjoint), that is, Π> = Π .(ii) Π is a projector in IRn , i.e. Π>Π = Π2 = Π and Π(1n−Π) = 0 , where

1n means the unity operator in IRn .(iii) For an arbitrary vector v from IRn , it holds ‖v‖2 = ‖Πv‖2+‖v−Πv‖2 .(iv) The trace of Π is equal to the dimension of its image, tr Π = p .(v) Π projects the linear space IRn on the linear subspace Lp =

⟨ψ1, . . . ,ψp

⟩,

which is spanned by the basis vectors ψ1, . . .ψp , that is,

‖f∗ −Πf∗‖ = infg∈Lp

‖f∗ − g‖.

Page 138: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

138 4 Estimation in linear models

(vi) The matrix Π can be represented in the form

Π = U>ΛpU

where U is an orthonormal matrix and Λp is a diagonal matrix with thefirst p diagonal elements equal to 1 and the others equal to zero:

Λp = diag{1, . . . , 1︸ ︷︷ ︸p

, 0, . . . , 0︸ ︷︷ ︸n−p

}.

Proof. It holds {Ψ>(ΨΨ>

)−1Ψ}>

= Ψ>(ΨΨ>

)−1Ψ

and

Π2 = Ψ>(ΨΨ>

)−1ΨΨ>

(ΨΨ>

)−1Ψ = Ψ>

(ΨΨ>

)−1Ψ = Π,

which proves the first two statements of the lemma. The third one followsdirectly from the first two. Next,

tr Π = tr Ψ>(ΨΨ>

)−1Ψ = tr ΨΨ>

(ΨΨ>

)−1= tr IIp = p.

The second property means that Π is a projector in IRn and the fourth onemeans that the dimension of its image space is equal to p . The basis vectorsψ1, . . . ,ψp are the rows of the matrix Ψ . It is clear that

ΠΨ> = Ψ>(ΨΨ>

)−1ΨΨ> = Ψ>.

Therefore, the vectors ψm are invariants of the operator Π and in particular,all these vectors belong to the image space of this operator. If now g is avector in Lp , then it can be represented as g = c1ψ1 + . . . + cpψp andtherefore, Πg = g and ΠLp = Lp . Finally, the non-singularity of the matrixΨΨ> means that the vectors ψ1, . . . ,ψp forming the rows of Ψ are linearlyindependent. Therefore, the space Lp spanned by the vectors ψ1, . . . ,ψp isof dimension p , and hence it coincides with the image space of the operationΠ .

The last property is the usual diagonal decomposition of a projector.

Exercise 4.3.1. Consider the case of an orthogonal design with ΨΨ> = IIp .Specify the projector Π of Lemma 4.3.1 for this situation, particularly itsdecomposition from (vi).

Page 139: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

4.3 Properties of the response estimate f 139

4.3.3 Quadratic loss and risk of the response estimation

In this section we study the quadratic risk of estimating the response f∗ .The reason for studying the quadratic risk of estimating the response f∗ willbe made clear when we discuss the properties of the fitted likelihood in thenext section.

The loss ℘(f ,f∗) of the estimate f can be naturally defined as the

squared norm of the difference f − f∗ :

℘(f ,f∗) = ‖f − f∗‖2 =

n∑i=1

|fi − fi|2.

Correspondingly, the quadratic risk of the estimate f is the mean of this loss

R(f) = IE℘(f ,f∗) = IE[(f − f∗)>(f − f∗)

]. (4.9)

The next result describes the loss and risk decomposition for two cases: whenthe parametric assumption f∗ = Ψ>θ∗ is correct and in the general case.

Theorem 4.3.1. Suppose that the errors εi from (4.1) are independent with

IE εi = 0 and IE ε2i = σ2 , i.e. Σ = σ2IIn . Then the loss ℘(f ,f∗) = ‖ΠY −

f∗‖2 and the risk R(f) of the LSE f fulfill

℘(f ,f∗) = ‖f∗ −Πf∗‖2 + ‖Πε‖2,

R(f) = ‖f∗ −Πf∗‖2 + pσ2.

Moreover, if f∗ = Ψ>θ∗ , then

℘(f ,f∗) = ‖Πε‖2,

R(f) = pσ2.

Proof. We apply (4.9) and the decomposition (4.8) of the estimate f . Itfollows

℘(f ,f∗) = ‖f − f∗‖2 = ‖f∗ −Πf∗ −Πε‖2

= ‖f∗ −Πf∗‖2 + 2(f∗ −Πf∗)>Πε+ ‖Πε‖2.

This implies the decomposition for the loss of f by Lemma 4.3.1, (ii). Nextwe compute the mean of ‖Πε‖2 applying again Lemma 4.3.1. Indeed

IE‖Πε‖2 = IE(Πε)>Πε = IE tr{

Πε(Πε)>}

= IE tr(Πεε>Π>

)= tr

{ΠIE(εε>)Π

}= σ2 tr(Π2) = pσ2.

Now consider the case when f∗ = Ψ>θ∗ . By Lemma 4.3.1 f∗ = Πf∗ andand the last two statements of the theorem clearly follow.

Page 140: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

140 4 Estimation in linear models

4.3.4 Misspecified “colored noise”

Here we briefly comment on the case when ε is not a white noise. So, our as-sumption about the errors εi is that they are uncorrelated and homogeneous,that is, Σ = σ2IIn while the true covariance matrix is given by Σ0 . Manyproperties of the estimate f = ΠY which are simply based on the linearity ofthe model (4.1) and of the estimate f itself continue to apply. In particular,

the loss ℘(f ,f∗

)= ‖f − f∗‖2 can again be decomposed as

‖f − f∗‖2 = ‖f∗ −Πf∗‖2 + ‖Πε‖2.

Theorem 4.3.2. Suppose that IEε = 0 and Var(ε) = Σ0 . Then the loss

℘(f ,f) and the risk R(f) of the LSE f fulfill

℘(f ,f∗) = ‖f∗ −Πf∗‖2 + ‖Πε‖2,

R(f) = ‖f∗ −Πf∗‖2 + tr(ΠΣ0Π

).

Moreover, if f∗ = Ψ>θ∗ , then

℘(f ,f∗) = ‖Πε‖2,

R(f) = tr(ΠΣ0Π

).

Proof. The decomposition of the loss from Theorem 4.3.1 only relies on thegeometric properties of the projector Π and does not use the covariancestructure of the noise. Hence, it only remains to check the expectation of‖Πε‖2 . Observe that

IE‖Πε‖2 = IE tr[Πε(Πε)>

]= tr

[ΠIE(εε>)Π

]= tr

(ΠΣ0Π

)as required.

4.4 Properties of the MLE θ

In this section we focus on the properties of the quasi MLE θ built for theidealized linear Gaussian model Y = Ψ>θ∗ + ε with ε ∼ N(0, σ2IIn) . Asin the previous section, we do not assume the parametric structure of theunderlying model and consider a more general model Y = f∗ + ε with anunknown vector f∗ and errors ε with zero mean and covariance matrix Σ0 .

Due to (4.3), it holds θ = SY with S =(ΨΨ>

)−1Ψ . An important feature of

this estimate is its linear dependence on the data. The linear model equationY = f∗ + ε and linear structure of the estimate θ = SY allow us fordecomposing the vector θ into a deterministic and stochastic terms:

Page 141: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

4.4 Properties of the MLE θ 141

θ = SY = S(f∗ + ε

)= Sf∗ + Sε. (4.10)

The first term Sf∗ is deterministic but depends on the unknown vector f∗

while the second term Sε is stochastic but it does not involve the modelresponse f∗ . Below we study the properties of each component separately.

4.4.1 Properties of the stochastic component

The next result describes the distributional properties of the stochastic com-

ponent δ = Sε for S =(ΨΨ>

)−1Ψ and thus, of the estimate θ .

Theorem 4.4.1. Assume Y = f∗ + ε with IEε = 0 and Var(ε) = Σ0 . Thestochastic component δ = Sε in (4.10) fulfills

IEδ = 0, W 2 def= Var(δ) = SΣ0S>, IE‖δ‖2 = trW 2 = tr

(SΣ0S>

).

Moreover, if Σ = Σ0 = σ2IIn , then

W 2 = σ2(ΨΨ>

)−1, IE‖δ‖2 = tr(W 2) = σ2 tr

[(ΨΨ>

)−1]. (4.11)

Similarly for the estimate θ it holds

IEθ = Sf∗, Var(θ) = W 2.

If the errors ε are Gaussian, then the both δ and θ are Gaussian as well:

δ ∼ N(0,W 2) θ ∼ N(Sf∗,W 2).

Proof. For the variance W 2 of δ holds

Var(δ) = IEδδ> = IESεε>S> = SΣ0S>.

Next we use that IE‖δ‖2 = IEδ>δ = IE tr(δδ>) = trW 2 . If Σ = Σ0 = σ2IIn ,then (4.11) follows by simple algebra.

If ε is a Gaussian vector, then δ as its linear transformation is Gaussianas well. The properties of θ follow directly from the decomposition (4.10).

With Σ0 6= σ2IIn , the variance W 2 can be represented as

W 2 =(ΨΨ>

)−1ΨΣ0Ψ>

(ΨΨ>

)−1.

Exercise 4.4.1. Let δ be the stochastic component of θ built for the mis-specified linear model Y = Ψ>θ∗ + ε with Var(ε) = Σ . Let also the true

noise variance is Σ0 . Then Var(θ) = W 2 with

W 2 =(ΨΣ−1Ψ>

)−1ΨΣ−1Σ0Σ−1Ψ>

(ΨΣ−1Ψ>

)−1. (4.12)

Page 142: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

142 4 Estimation in linear models

The main finding in the presented study is that the stochastic part δ = Sεof the estimate θ is completely independent of the structure of the vector f∗ .In other words, the behavior of the stochastic component δ does not changeeven if the linear parametric assumption is misspecified.

4.4.2 Properties of the deterministic component

Now we study the deterministic term starting with the parametric situationf∗ = Ψ>θ∗ . Here we only specify the results for the case 1 with Σ = σ2IIn .

Theorem 4.4.2. Let f∗ = Ψ>θ∗ . Then θ = SY with S =(ΨΨ>

)−1Ψ is

unbiased, that is, IEθ = Sf∗ = θ∗ .

Proof. For the proof, just observe that Sf∗ =(ΨΨ>

)−1ΨΨ>θ∗ = θ∗ .

Now we briefly discuss what happens when the linear parametric assump-tion is not fulfilled, that is, f∗ cannot be represented as Ψ>θ∗ . In this caseit is not yet clear what θ really estimates. The answer is given in the contextof the general theory of minimum contrast estimation. Namely, define θ∗ asthe point which maximizes the expectation of the (quasi) log-likelihood L(θ) :

θ∗ = argmaxθ

IEL(θ). (4.13)

Theorem 4.4.3. The solution θ∗ of the optimization problem (4.13) is givenby

θ∗ = Sf∗ =(ΨΨ>

)−1Ψf∗.

Moreover,

Ψ>θ∗ = Πf∗ = Ψ>(ΨΨ>

)−1Ψf∗.

In particular, if f∗ = Ψ>θ∗ , then θ∗ follows (4.13).

Proof. The use of the model equation Y = f∗ + ε and of the properties ofthe stochastic component δ yield by simple algebra

argmaxθ

IEL(θ) = argminθ

IE(f∗ −Ψ>θ + ε

)>(f∗ −Ψ>θ + ε

)= argmin

θ

{(f∗ −Ψ>θ)>(f∗ −Ψ>θ) + IE

(ε>ε

)}= argmin

θ

{(f∗ −Ψ>θ)>(f∗ −Ψ>θ)

}.

Differentiating w.r.t. θ leads to the equation

Page 143: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

4.4 Properties of the MLE θ 143

Ψ(f∗ −Ψ>θ) = 0

and the solution θ∗ =(ΨΨ>

)−1Ψf∗ which is exactly the expected value of

θ by Theorem 4.4.1.

Exercise 4.4.2. State the result of Theorems 4.4.2 and 4.4.3 for the MLE θbuilt in the model Y = Ψ>θ∗ + ε with Var(ε) = Σ .

Hint: check that the statements continue to apply with S =(ΨΣ−1Ψ>

)−1ΨΣ−1 .

The last results and the decomposition (4.10) explain the behavior of the

estimate θ in a very general situation. The considered model is Y = f∗+ ε .We assume a linear parametric structure and independent homogeneous noise.The estimation procedure means in fact a kind of projection of the data Y ona p -dimensional linear subspace in IRn spanned by the given basis vectorsψ1, . . . ,ψp . This projection, as a linear operator, can be decomposed intoa projection of the deterministic vector f∗ and a projection of the randomnoise ε . If the linear parametric assumption f∗ ∈

⟨ψ1, . . . ,ψp

⟩is correct,

that is, f∗ = θ∗1ψ1 + . . . + θ∗pψp , then this projection keeps f∗ unchangedand only the random noise is reduced via this projection. If f∗ cannot beexactly expanded using the basis ψ1, . . . ,ψp , then the procedure recoversthe projection of f∗ onto this subspace. The latter projection can be writtenas Ψ>θ∗ and the vector θ∗ can be viewed as the target of estimation.

4.4.3 Risk of estimation. R-efficiency

This section briefly discusses how the obtained properties of the estimateθ can be used to evaluate the risk of estimation. A particularly importantquestion is the optimality of the MLE θ . The main result of the section claimsthat θ is R-efficient if the model is correctly specified and is not if there is amisspecification.

We start with the case of a correct parametric specification Y = Ψ>θ∗+ε ,that is, the linear parametric assumption f∗ = Ψ>θ∗ is exactly fulfilled andthe noise ε is homogeneous: ε ∼ N(0, σ2IIn) . Later we extend the result tothe case when the LPA f∗ = Ψ>θ∗ is not fulfilled and to the case when thenoise is not homogeneous but still correctly specified. Finally we discuss thecase when the noise structure is misspecified.

Under LPA Y = Ψ>θ∗ + ε with ε ∼ N(0, σ2IIn) , the estimate θ is also

normal with mean θ∗ and the variance W 2 = σ2SS> = σ2(ΨΨ>

)−1. Define

a p× p symmetric matrix D by the equation

D2 =1

σ2

n∑i=1

ΨiΨ>i =

1

σ2ΨΨ>.

Clearly W 2 = D−2 .

Page 144: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

144 4 Estimation in linear models

Now we show that θ is R -efficient. Actually this fact can be derived fromthe Cramer-Rao Theorem because the Gaussian model is a special case of anexponential family. However, we check this statement directly by computingthe Cramer-Rao efficiency bound. Recall that the Fisher information matrixF(θ) for the log-likelihood L(θ) is defined as the variance of ∇L(θ) underIPθ .

Theorem 4.4.4 (Gauss-Markov). Let Y = Ψ>θ∗+ε with ε ∼ N(0, σ2IIn) .

Then θ is R-efficient estimate of θ∗ : IEθ = θ∗ ,

IE[(θ − θ∗

)(θ − θ∗

)>]= Var

(θ)

= D−2,

and for any unbiased linear estimate θ satisfying IEθθ ≡ θ , it holds

Var(θ)≥ Var

(θ)

= D−2.

Proof. Theorems 4.4.1 and 4.4.2 imply that θ ∼ N(θ∗,W 2) with W 2 =σ2(ΨΨ>)−1 = D−2 . Next we show that for any θ

Var[∇L(θ)

]= D2,

that is, the Fisher information does not depend on the model function f∗ .The log-likelihood L(θ) for the model Y ∼ N(Ψ>θ∗, σ2IIn) reads as

L(θ) = − 1

2σ2(Y −Ψ>θ)>(Y −Ψ>θ)− n

2log(2πσ2).

This yields for its gradient ∇L(θ) :

∇L(θ) = σ−2Ψ(Y −Ψ>θ)

and in view of Var(Y ) = Σ = σ2IIn , it holds

Var[∇L(θ)

]= σ−4Ψ Var(Y )Ψ> = σ−2ΨΨ>

as required.The R-efficiency θ follows from the Cramer-Rao efficiency bound be-

cause{

Var(θ)}−1

= Var{∇L(θ)

}. However, we present an independent

proof of this fact. Actually we prove a sharper result that the variance ofa linear unbiased estimate θ coincides with the variance of θ only if θ co-incides almost surely with θ , otherwise it is larger. The idea of the proofis quite simple. Consider the difference θ − θ and show that the conditionIEθ = IEθ = θ∗ implies orthogonality IE

{θ(θ − θ)>

}= 0 . This, in turns,

implies Var(θ) = Var(θ) + Var(θ − θ) ≥ Var(θ) . So, it remains to check the

orthogonality of θ and θ − θ . Let θ = AY for a p × n matrix A and

Page 145: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

4.4 Properties of the MLE θ 145

IEθθ ≡ θ and all θ . These two equalities and IEY = Ψ>θ∗ imply thatAΨ>θ∗ ≡ θ∗ , i.e. AΨ> is the identity p × p matrix. The same is true forθ = SY yielding SΨ> = IIp . Next, in view of IEθ = IEθ = θ∗

IE{

(θ − θ)θ>}

= IE(A− S)εε>S> = σ2(A− S)Ψ>(ΨΨ>)−1 = 0,

and the assertion follows.

Exercise 4.4.3. Check the details of the proof of the theorem. Show that thestatement Var(θ) ≥ Var

(θ)

only uses that θ is unbiased and that IEY =Ψ>θ∗ and Var(Y ) = σ2IIn .

Exercise 4.4.4. Compute ∇2L(θ) . Check that it is non-random, does notdepend on θ , and fulfills for every θ the identity

∇2L(θ) ≡ −Var[∇L(θ)

]= −D2.

A colored noise

The majority of the presented results continue to apply in the case of hetero-geneous and even dependent noise with Var(ε) = Σ0 . The key facts behindthis extension are the decomposition (4.10) and the properties of the stochas-tic component δ from Section 4.4.1: δ ∼ N(0,W 2) . In the case of a colorednoise, the definition of W and D is changed for

D2 def= W−2 = ΨΣ−1

0 Ψ>.

Exercise 4.4.5. State and prove the analog of Theorem 4.4.4 for the colorednoise ε ∼ N(0,Σ0) .

A misspecified LPA

An interesting feature of our results so far is that they equally apply for thecorrect linear specification f∗ = Ψ>θ∗ and for the case when the identityf∗ = Ψ>θ is not precisely fulfilled whatever θ is taken. In this situation thetarget of analysis is the vector θ∗ describing the best linear approximationof f∗ by Ψ>θ . We already know from the results of Section 4.4.1 and 4.4.2

that the estimate θ is also normal with mean θ∗ = Sf∗ =(ΨΨ>

)−1Ψf∗

and the variance W 2 = σ2SS> = σ2(ΨΨ>

)−1.

Theorem 4.4.5. Assume Y = f∗ + ε with ε ∼ N(0, σ2IIn) . Let θ∗ = Sf∗ .

Then θ is R-efficient estimate of θ∗ : IEθ = θ∗ ,

IE[(θ − θ∗

)(θ − θ∗

)>]= Var

(θ)

= D−2,

Page 146: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

146 4 Estimation in linear models

and for any unbiased linear estimate θ satisfying IEθθ ≡ θ , it holds

Var(θ)≥ Var

(θ)

= D−2.

Proof. The proofs only utilize that θ ∼ N(θ∗,W 2) with W 2 = D−2 . Theonly small remark concerns the equality Var

[∇L(θ)

]= D2 from Theo-

rem 4.4.4.

Exercise 4.4.6. Check the identity Var[∇L(θ)

]= D2 from Theorem 4.4.4

for ε ∼ N(0,Σ0) .

4.4.4 The case of a misspecified noise

Here we again consider the linear parametric assumption Y = Ψ>θ∗ + ε .However, contrary to the previous section, we admit that the noise ε is nothomogeneous normal: ε ∼ N(0,Σ0) while our estimation procedure is thequasi MLE based on the assumption of noise homogeneity ε ∼ N(0, σ2IIn) .

We already know that the estimate θ is unbiased with mean θ∗ and variance

W 2 = SΣ0S> , where S =(ΨΨ>

)−1Ψ . This gives

W 2 =(ΨΨ>

)−1ΨΣ0Ψ>

(ΨΨ>

)−1.

The question is whether the estimate θ based on the misspecified dis-tributional assumption is efficient. The Cramer-Rao result delivers the lower

bound for the quadratic risk in form of Var(θ) ≥[Var(∇L(θ)

)]−1. We al-

ready know that the use of the correctly specified covariance matrix of theerrors leads to an R-efficient estimate θ . The next result show that the useof a misspecified matrix Σ results in an estimate which is unbiased but notR-efficient, that is, the best estimation risk is achieved if we apply the correctmodel assumptions.

Theorem 4.4.6. Let Y = Ψ>θ∗ + ε with ε ∼ N(0,Σ0) . Then

Var[∇L(θ)

]= ΨΣ−1

0 Ψ>.

The estimate θ =(ΨΨ>

)−1ΨY is unbiased, that is, IEθ = θ∗ , but it is not

R-efficient unless Σ0 = Σ .

Proof. Let θ0 be the MLE for the correct model specification with the noiseε ∼ N(0,Σ0) . As θ is unbiased, the difference θ − θ0 is orthogonal to θ0

and it holds for the variance of θ

Var(θ) = Var(θ0) + Var(θ − θ0);

cf. with the proof of Gauss-Markov-Theorem 4.4.4.

Exercise 4.4.7. Compare directly the variances of θ and of θ0 .

Page 147: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

4.5 Linear models and quadratic log-likelihood 147

4.5 Linear models and quadratic log-likelihood

Linear Gaussian modeling leads to a specific log-likelihood structure; see Sec-tion 4.2. Namely, the log-likelihood function L(θ) is quadratic in θ , thecoefficients of the quadratic terms are deterministic and the cross term is lin-ear both in θ and in the observations Yi . Here we show that this geometricstructure of the log-likelihood characterizes linear models. We say that L(θ)is quadratic if it is a quadratic function of θ and there is a deterministicsymmetric matrix D2 such that for any θ◦,θ

L(θ)− L(θ◦) = (θ − θ◦)>∇L(θ◦)− (θ − θ◦)>D2(θ − θ◦)/2. (4.14)

Here ∇L(θ)def= dL(θ)

dθ . As usual we define

θdef= argmax

θL(θ),

θ∗ = argmaxθ

IEL(θ).

The next result describes some properties of the estimate θ which are entirelybased on the geometric (quadratic) structure of the function L(θ) . All theresults are stated by using the matrix D2 and the vector ζ = ∇L(θ∗) .

Theorem 4.5.1. Let L(θ) be quadratic for a matrix D2 > 0 . Then for anyθ◦

θ − θ◦ = D−2∇L(θ◦). (4.15)

In particular, with θ◦ = 0 , it holds

θ = D−2∇L(0).

Taking θ◦ = θ∗ yields

θ − θ∗ = D−2ζ (4.16)

with ζdef= ∇L(θ∗) . Moreover, IEζ = 0 , and it holds with V 2 = Var(ζ) =

IEζζ>

IEθ = θ∗

Var(θ)

= D−2V 2D−2.

Further, for any θ ,

L(θ)− L(θ) = (θ − θ)>D2(θ − θ)/2 = ‖D(θ − θ)‖2/2. (4.17)

Page 148: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

148 4 Estimation in linear models

Finally, it holds for the excess L(θ,θ∗)def= L(θ)− L(θ∗)

2L(θ,θ∗) = (θ − θ∗)>D2(θ − θ∗) = ζ>D−2ζ = ‖ξ‖2 (4.18)

with ξ = D−1ζ .

Proof. The extremal point equation ∇L(θ) = 0 for the quadratic functionL(θ) from (4.14) yields (4.15). The equation (4.14) with θ◦ = θ∗ implies forany θ

∇L(θ) = ∇L(θ◦)−D2(θ − θ◦) = ζ −D2(θ − θ∗). (4.19)

Therefore, it holds for the expectation IEL(θ)

∇IEL(θ) = IEζ −D2(θ − θ∗),

and the equation ∇IEL(θ∗) = 0 implies IEζ = 0 .

To show (4.17), apply again the property (4.14) with θ◦ = θ :

L(θ)− L(θ) = (θ − θ)>∇L(θ)− (θ − θ)>D2(θ − θ)/2

= −(θ − θ)>D2(θ − θ)/2.

Here we used that ∇L(θ) = 0 because θ is an extreme point of L(θ) . Thelast result (4.18) is a special case with θ = θ∗ in view of (4.16).

This theorem delivers an important message: the main properties of theMLE θ can be explained via the geometric (quadratic) structure of thelog-likelihood. An interesting question to clarify is whether a quadratic log-likelihood structure is specific for linear Gaussian model. The answer is posi-tive: there is one-to-one correspondence between linear Gaussian models andquadratic log-likelihood functions. Indeed, the identity (4.19) with θ◦ = θ∗

can be rewritten as

∇L(θ) +D2θ ≡ ζ +D2θ∗.

If we fix any θ and define Y = ∇L(θ) +D2θ , this yields

Y = D2θ∗ + ζ.

Similarly, Ydef= D−1

{∇L(θ) +D2θ

}yields the equation

Y = Dθ∗ + ξ, (4.20)

where ξ = D−1ζ . We can summarize as follows.

Page 149: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

4.6 Inference based on the maximum likelihood 149

Theorem 4.5.2. Let L(θ) be quadratic with a non-degenerated matrix D2 .

Then Ydef= D−1

{∇L(θ) + D2θ

}does not depend on θ and L(θ) − L(θ∗)

is the quasi log-likelihood ratio for the linear Gaussian model (4.20) with ξstandard normal. It is the true log-likelihood if and only if ζ ∼ N(0, D2) .

Proof. The model (4.20) with ξ ∼ N(0, IIp) leads to the log-likelihood ratio

(θ − θ∗)>D(Y −Dθ∗)− ‖D(θ − θ∗)‖2/2 = (θ − θ∗)>ζ − ‖D(θ − θ∗)‖2/2

in view of the definition of Y . The definition (4.14) implies

L(θ)− L(θ∗) = (θ − θ∗)>∇L(θ∗)− ‖D(θ − θ∗)‖2/2.

As these two expressions coincide, it follows that L(θ) is the true log-likelihood if and only if ξ = D−1ζ is standard normal.

4.6 Inference based on the maximum likelihood

All the results presented above for linear models were based on the explicitrepresentation of the (quasi) MLE θ . Here we present the approach based onthe analysis of the maximum likelihood. This approach does not require tofix any analytic expression for the point of maximum of the (quasi) likelihoodprocess L(θ) . Instead we work directly with the maximum of this process.We establish exponential inequalities for the excess or the maximum likelihoodL(θ,θ∗) . We also show how these results can be used to study the accuracy

of the MLE θ , in particular, for building confidence sets.One more benefit of the ML-based approach is that it equally applies to a

homogeneous and to a heterogeneous noise provided that the noise structureis not misspecified. The celebrated chi-squared result about the maximumlikelihood L(θ,θ∗) claims that the distribution of 2L(θ,θ∗) is chi-squaredwith p degrees of freedom χ2

p and it does not depend on the noise covariance;see Section 4.6.

Now we specify the setup. The starting point of the ML-approach is thelinear Gaussian model assumption Y = Ψ>θ∗ + ε with ε ∼ N(0,Σ) . Thecorresponding log-likelihood ratio L(θ) can be written as

L(θ) = −1

2(Y −Ψ>θ)>Σ−1(Y −Ψ>θ) +R, (4.21)

where the remainder term R does not depend on θ . Now one can see thatL(θ) is a quadratic function of θ . Moreover, ∇2L(θ) = ΨΣ−1Ψ> , so thatL(θ) is quadratic with D2 = ΨΣ−1Ψ> . This enables us to apply the generalresults of Section 4.5 which are only based on the geometric (quadratic) struc-ture of the log-likelihood L(θ) : the true data distribution can be arbitrary.

Page 150: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

150 4 Estimation in linear models

Theorem 4.6.1. Consider L(θ) from (4.21). For any θ , it holds with D2 =ΨΣ−1Ψ>

L(θ,θ) = (θ − θ)>D2(θ − θ)/2. (4.22)

In particular, if Σ = σ2IIn then the fitted log-likelihood is proportional to thequadratic loss ‖f − fθ‖2 for f = Ψ>θ and fθ = Ψ>θ :

L(θ,θ) =1

2σ2

∥∥Ψ>(θ − θ)∥∥2

=1

2σ2

∥∥f − fθ∥∥2.

If θ∗def= argmaxθ IEL(θ) = D−2ΨΣ−1f∗ for f∗ = IEY , then

2L(θ,θ∗) = ζ>D−2ζ = ‖ξ‖2 (4.23)

with ζ = ∇L(θ∗) and ξdef= D−1ζ .

Proof. The results (4.22) and (4.23) follow from Theorem 4.5.1; see (4.17) and(4.18).

If the model assumptions are not misspecified one can establish the re-markable χ2 result.

Theorem 4.6.2. Let L(θ) from (4.21) be the log-likelihood for the modelY = Ψ>θ∗ + ε with ε ∼ N(0,Σ) . Then ξ = D−1ζ ∼ N(0, IIp) and

2L(θ,θ∗) ∼ χ2p is chi-squared with p degrees of freedom.

Proof. By direct calculus

ζ = ∇L(θ∗) = ΨΣ−1(Y −Ψ>θ∗) = ΨΣ−1ε.

So, ζ is a linear transformation of a Gaussian vector Y and thus it is Gaus-sian as well. By Theorem 4.5.1, IEζ = 0 . Moreover, Var(ε) = Σ implies

Var(ζ) = IEΨ>Σ−1εε>Σ−1Ψ> = ΨΣ−1Ψ> = D2

yielding that ξ = D−1ζ is standard normal.

The last result 2L(θ,θ∗) ∼ χ2p is sometimes called the “chi-squared phe-

nomenon”: the distribution of the maximum likelihood only depends on thenumber of parameters to be estimated and is independent of the design Ψ ,of the noise covariance matrix Σ , etc. This particularly explains the use ofword “phenomenon” in the name of the result.

Page 151: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

4.6 Inference based on the maximum likelihood 151

Exercise 4.6.1. Check that the linear transformation Y = Σ−1/2Y of thedata does not change the value of the log-likelihood ratio L(θ,θ∗) and hence,

of the maximum likelihood L(θ,θ∗) .Hint: use the representation

L(θ) =1

2(Y −Ψ>θ)>Σ−1(Y −Ψ>θ) +R

=1

2(Y − Ψ>θ)>(Y − Ψ>θ) +R

and check that the transformed data Y is described by the model Y =Ψ>θ∗ + ε with Ψ = ΨΣ−1/2 and ε = Σ−1/2ε ∼ N(0, IIn) yielding the samelog-likelihood ratio as in the original model.

Exercise 4.6.2. Assume homogeneous noise in (4.21) with Σ = σ2IIn . Thenit holds

2L(θ,θ∗) = σ−2‖Πε‖2

where Π = Ψ>(ΨΨ>

)−1Ψ is the projector in IRn on the subspace spanned

by the vectors ψ1, . . . ,ψp .

Hint: use that ζ = σ−2Ψε , D2 = σ−2ΨΨ> , and

σ−2‖Πε‖2 = σ−2ε>Π>Πε = σ−2ε>Πε = ζ>D−2ζ.

We write the result of Theorem 4.6.1 in the form 2L(θ,θ∗) ∼ χ2p , where

χ2p stands for the chi-squared distribution with p degrees of freedom. This

result can be used to build likelihood-based confidence ellipsoids for the pa-rameter θ∗ . Given z > 0 , define

E(z) ={θ : L(θ,θ) ≤ z

}={θ : sup

θ′L(θ′)− L(θ) ≤ z

}. (4.24)

Theorem 4.6.3. Assume Y = Ψ>θ∗ + ε with ε ∼ N(0,Σ) and consider

the MLE θ . Define zα by P(χ2p > 2zα

)= α . Then E(zα) from (4.24) is an

α -confidence set for θ∗ .

Exercise 4.6.3. Let D2 = ΨΣ−1Ψ> . Check that the likelihood-based CSE(zα) and estimate-based CS E(zα) = {θ : ‖D(θ − θ)‖ ≤ zα} , z2

α = 2zα ,coincide in the case of the linear modeling:

E(zα) ={θ :∥∥D(θ − θ)

∥∥2 ≤ 2zα}.

Page 152: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

152 4 Estimation in linear models

Another corollary of the chi-squared result is a concentration bound for themaximum likelihood. A similar result was stated for the univariate exponentialfamily model: the value L(θ, θ∗) is stochastically bounded with exponentialmoments, and the bound does not depend on the particular family, param-eter value, sample size, etc. Now we can extend this result to the case of alinear Gaussian model. Indeed, Theorem 4.6.1 states that the distribution of2L(θ,θ∗) is chi-squared and only depends on the number of parameters to beestimated. The latter distribution concentrates on the ball of radius of orderp1/2 and the deviation probability is exponentially small.

Theorem 4.6.4. Assume Y = Ψ>θ∗ + ε with ε ∼ N(0,Σ) . Then for everyx > 0 , it holds with κ ≥ 6.6

IP(2L(θ,θ∗) > p+

√κxp ∨ (κx)

)= IP

(∥∥D(θ − θ∗)∥∥2> p+

√κxp ∨ (κx)

)≤ exp(−x). (4.25)

Proof. Define ξdef= D(θ−θ∗) . By Theorem 4.4.4 ξ is standard normal vector

in IRp and by Theorem 4.6.1 2L(θ,θ∗) = ‖ξ‖2 . Now the statement (4.25)follows from the general deviation bound for the Gaussian quadratic forms;see Theorem 9.2.1.

The main message of this result can be explained as follows: the deviationprobability that the estimate θ does not belong to the elliptic set E(z) =

{θ : ‖D(θ − θ)‖ ≤ z} starts to vanish when z2 exceeds the dimensionalityp of the parameter space. Similarly, the coverage probability that the trueparameter θ∗ is not covered by the confidence set E(z) starts to vanish when2z exceeds p .

Corollary 4.6.1. Assume Y = Ψ>θ∗+ ε with ε ∼ N(0,Σ) . Then for everyx > 0 , it holds with 2z = p+

√κxp ∨ (κx) for κ ≥ 6.6

IP(E(z) 63 θ∗

)≤ exp(−x).

Exercise 4.6.4. Compute z ensuring the covering of 95% in the dimensionp = 1, 2, 10, 20 .

4.6.1 A misspecified LPA

Now we discuss the behavior of the fitted log-likelihood for the misspecifiedlinear parametric assumption IEY = Ψ>θ∗ . Let the response function f∗

not be linearly expandable as f∗ = Ψ>θ∗ . Following to Theorem 4.4.3, de-

fine θ∗ = Sf∗ with S =(ΨΣ−1Ψ>

)−1ΨΣ−1 . This point provides the best

approximation of the nonlinear response f∗ by a linear parametric fit Ψ>θ .

Page 153: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

4.6 Inference based on the maximum likelihood 153

Theorem 4.6.5. Assume Y = f∗ + ε with ε ∼ N(0,Σ) . Let θ∗ = Sf∗ .

Then θ is an R-efficient estimate of θ∗ and

2L(θ,θ∗) = ζ>D−2ζ = ‖ξ‖2 ∼ χ2p ,

where D2 = ΨΣ−1Ψ> , ζ = ∇L(θ∗) = ΨΣ−1ε , ξ = D−1ζ is standardnormal vector in IRp and χ2

p is a chi-squared random variable with p degreesof freedom. In particular, E(zα) is an α -CS for the vector θ∗ and the boundof Corollary 4.6.1 applies.

Exercise 4.6.5. Prove the result of Theorem 4.6.5.

4.6.2 A misspecified noise structure

This section addresses the question about the features of the maximum likeli-hood in the case when the likelihood is built under a wrong assumption aboutthe noise structure. As one can expect, the chi-squared result is not validanymore in this situation and the distribution of the maximum likelihood de-pends on the true noise covariance. However, the nice geometric structure ofthe maximum likelihood manifested by Theorems 4.6.1 and 4.6.3 does not relyon the true data distribution and it is only based on our structural assump-tions on the considered model. This helps to get rigorous results about thebehaviors of the maximum likelihood and particularly about its concentrationproperties.

Theorem 4.6.6. Let θ be built for the model Y = Ψ>θ∗ + ε with ε ∼N(0,Σ) , while the true noise covariance is Σ0 : IEε = 0 and Var(ε) = Σ0 .Then

IEθ = θ∗,

Var(θ)

= D−2W 2D−2,

where

D2 = ΨΣ−1Ψ>,

W 2 = ΨΣ−1Σ0Σ−1Ψ>.

Further,

2L(θ,θ∗) = ‖D(θ − θ∗)‖2 = ‖ξ‖2, (4.26)

where ξ is a random vector in IRp with IEξ = 0 and

Var(ξ) = Bdef= D−1W 2D−1.

Moreover, if ε ∼ N(0,Σ0) , then θ ∼ N(θ∗, D−2W 2D−2

)and ξ ∼ N(0, B) .

Page 154: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

154 4 Estimation in linear models

Proof. The moments of θ have been computed in Theorem 4.5.1 while theequality 2L(θ,θ∗) = ‖D(θ − θ∗)‖2 = ‖ξ‖2 is given in Theorem 4.6.1. Next,ζ = ∇L(θ∗) = ΨΣ−1ε and

W 2 def= Var(ζ) = ΨΣ−1 Var(ε)Σ−1Ψ> = ΨΣ−1Σ0Σ−1Ψ>.

This implies that

Var(ξ) = IEξξ> = D−1 Var(ζ)D−1 = D−1W 2D−1.

It remains to note that if ε is a Gaussian vector, then ζ = ΨΣ−1ε , ξ =D−1ζ , and θ − θ∗ = D−2ζ are Gaussian as well.

Exercise 4.6.6. Check that Σ0 = Σ leads back to the χ2 -result.

One can see that the chi-squared result is not valid any more if the noisestructure is misspecified. An interesting question is whether the CS E(z) canbe applied in the case of a misspecified noise under some proper adjustmentof the value z . Surprisingly, the answer is not entirely negative. The reason isthat the vector ξ from (4.26) is zero mean and its norm has a similar behavioras in the case of the correct noise specification: the probability IP

(‖ξ‖ >

z)

starts to degenerate when z2 exceeds IE‖ξ‖2 . A general bound fromTheorem 9.2.2 in Section 9.1 implies the following bound for the coverageprobability.

Corollary 4.6.2. Under the conditions of Theorem 4.6.6, for every x > 0 ,it holds with p = tr(B) , v2 = 2 tr(B2) , and a∗ = ‖B‖∞

IP(2L(θ,θ∗) > p + (2vx1/2) ∨ (6a∗x)

)≤ exp(−x).

Exercise 4.6.7. Show that an overestimation of the noise in the sense Σ ≥Σ0 preserves the coverage probability for the CS E(zα) , that is, if 2zα is the1− α quantile of χ2

p , then IP(E(zα) 63 θ∗

)≤ α .

4.7 Ridge regression, projection, and shrinkage

This section discusses the important situation when the number of predictorsψj and hence the number of parameters p in the linear model Y = Ψ>θ∗+εis not small relative to the sample size. Then the least square or the maximumlikelihood approach meets serious problems. The first one relates to the nu-merical issues. The definition of the LSE θ involves the inversion of the p×pmatrix ΨΨ> and such an inversion becomes a delicate task for p large. Theother problem concerns the inference for the estimated parameter θ∗ . Therisk bound and the width of the confidence set are proportional to the param-eter dimension p and thus, with large p , the inference statements become

Page 155: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

4.7 Ridge regression, projection, and shrinkage 155

almost uninformative. In particular, if p is of order the sample size n , evenconsistency is not achievable. One faces a really critical situation. We alreadyknow that the MLE is the efficient estimate in the class of all unbiased esti-mates. At the same time it is highly inefficient in overparametrized models.The only way out of this situation is to sacrifice the unbiasedness property infavor of reducing the model complexity: some procedures can be more efficientthan MLE even if they are biased. This section discusses one way of resolvingthese problems by regularization or shrinkage. To be more specific, for therest of the section we consider the following setup. The observed vector Yfollows the model

Y = f∗ + ε (4.27)

with a homogeneous error vector ε : IEε = 0 , Var(ε) = σ2IIn . Noise misspec-ification is not considered in this section.

Furthermore, we assume a basis or a collection of basis vectors ψ1, . . . ,ψpis given with p large. This allows for approximating the response vector f =IEY in the form f = Ψ>θ∗ , or, equivalently,

f = θ∗1ψ1 + . . .+ θ∗pψp .

In many cases we will assume that the basis is already orthogonalized: ΨΨ> =IIp . The model (4.27) can be rewritten as

Y = Ψ>θ∗ + ε, Var(ε) = σ2IIn .

The MLE or oLSE of the parameter vector θ∗ for this model reads as

θ =(ΨΨ>

)−1ΨY , f = Ψ>θ = Ψ>

(ΨΨ>

)−1ΨY .

If the matrix ΨΨ> is degenerate or badly posed, computing the MLE θ is ahard task. Below we discuss how this problem can be treated.

4.7.1 Regularization and ridge regression

Let R be a positive symmetric p × p matrix. Then the sum ΨΨ> + R ispositive symmetric as well and can be inverted whatever the matrix Ψ is. This

suggests to replace(ΨΨ>

)−1by

(ΨΨ> + R

)−1leading to the regularized

least squares estimate θR of the parameter vector θ and the correspondingresponse estimate fR :

θRdef=(ΨΨ> +R

)−1ΨY , fR

def= Ψ>

(ΨΨ> +R

)−1ΨY . (4.28)

Such a method is also called ridge regression. An example of choosing R isthe multiple of the unit matrix: R = αIIp where α > 0 and IIp stands for the

Page 156: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

156 4 Estimation in linear models

unit matrix. This method is also called Tikhonov regularization and it resultsin the parameter estimate θα and the response estimate fα :

θαdef=(ΨΨ> + αIIp

)−1ΨY , fα

def= Ψ>

(ΨΨ> + αIIp

)−1ΨY . (4.29)

A proper choice of the matrix R for the ridge regression method (4.28) orthe parameter α for the Tikhonov regularization (4.29) is an important issue.Below we discuss several approaches which lead to the estimate (4.28) with

a specific choice of the matrix R . The properties of the estimates θR andfR will be studied in context of penalized likelihood estimation in the nextsection.

4.7.2 Penalized likelihood. Bias and variance

The estimate (4.28) can be obtained in a natural way within the (quasi) MLapproach using the penalized least squares. The classical unpenalized methodis based on minimizing the sum of residuals squared:

θ = argmaxθ

L(θ) = arginfθ‖Y −Ψ>θ‖2

with L(θ) = σ−2‖Y − Ψ>θ‖2/2 . (Here we omit the terms which do notdepend on θ .) Now we introduce an additional penalty on the objective func-tion which penalizes for the complexity of the candidate vector θ which isexpressed by the value ‖Gθ‖2/2 for a given symmetric matrix G . This choiceof complexity measure implicitly assumes that the vector θ ≡ 0 has the small-est complexity equal to zero and this complexity increases with the norm ofGθ . Define the penalized log-likelihood

LG(θ)def= L(θ)− ‖Gθ‖2/2

= −(2σ2)−1‖Y −Ψ>θ‖2 − ‖Gθ‖2/2− (n/2) log(2πσ2). (4.30)

The penalized MLE reads as

θG = argmaxθ

LG(θ) = argminθ

{(2σ2)−1‖Y −Ψ>θ‖2 + ‖Gθ‖2/2

}.

A straightforward calculus leads to the expression (4.28) for θG with R =σ2G2 :

θGdef=(ΨΨ> + σ2G2

)−1ΨY . (4.31)

We see that θG is again a linear estimate: θG = SGY with SG =(ΨΨ> +

σ2G2)−1

Ψ . The results of Section 4.4 explains that θG in fact estimates thevalue θG defined by

Page 157: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

4.7 Ridge regression, projection, and shrinkage 157

θG = argmaxθ

IELG(θ)

= arginfθ

IE{‖Y −Ψ>θ‖2 + σ2‖Gθ‖2

}=(ΨΨ> + σ2G2

)−1Ψf∗ = SGf∗. (4.32)

In particular, if f∗ = Ψ>θ∗ , then

θG =(ΨΨ> + σ2G2

)−1ΨΨ>θ∗ (4.33)

and θG 6= θ∗ unless G = 0 . In other words, the penalized MLE θG is biased.

Exercise 4.7.1. Check that IEθα = θα for θα =(ΨΨ> + αIIp

)−1ΨΨ>θ∗ ,

the bias ‖θα − θ∗‖ grows with the regularization parameter α .

The penalized MLE θG leads to the response estimate fG = Ψ>θG .

Exercise 4.7.2. Check that the penalized ML approach leads to the responseestimate

fG = Ψ>θG = Ψ>(ΨΨ> + σ2G2

)−1ΨY = ΠGY

with ΠG = Ψ>(ΨΨ> + σ2G2

)−1Ψ . Show that ΠG is a sub-projector in the

sense that ‖ΠGu‖ ≤ ‖u‖ for any u ∈ IRn .

Exercise 4.7.3. Let Ψ be orthonormal: ΨΨ> = IIp . Then the penalized

MLE θG can be represented as

θG = (IIp + σ2G2)−1Z,

where Z = ΨY is the vector of empirical Fourier coefficients. Specify theresult for the case of a diagonal matrix G = diag(g1, . . . , gp) and describe the

corresponding response estimate fG .

The previous results indicate that introducing the penalization leads to somebias of estimation. One can ask about a benefit of using a penalized procedure.The next result shows that penalization decreases the variance of estimationand thus, makes the procedure more stable.

Theorem 4.7.1. Let θG be a penalized MLE from (4.31). Then IEθG = θG ,see (4.33), and under noise homogeneity Var(ε) = σ2IIn , it holds

Var(θG) =(σ−2ΨΨ> +G2

)−1σ−2ΨΨ>

(σ−2ΨΨ> +G2

)−1

= D−2G D2D−2

G

Page 158: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

158 4 Estimation in linear models

with D2G = σ−2ΨΨ>+G2 . In particular, Var(θG) ≤ D−2

G . If ε ∼ N(0, σ2IIn) ,

then θG is also normal: θG ∼ N(θG, D−2G D2D−2

G ) .Moreover, the bias ‖θG − θ∗‖ monotonously increases in G2 while the

variance monotonously decreases with the penalization G .

Proof. The first two moments of θG are computed from θG = SGY . Mono-tonicity of the bias and variance of θG is proved below in Exercise 4.7.6.

Exercise 4.7.4. Let Ψ be orthonormal: ΨΨ> = IIp . Describe Var(θG) .Show that the variance decreases with the penalization G in the sense thatG1 ≥ G implies Var(θG1

) ≤ Var(θG) .

Exercise 4.7.5. Let ΨΨ> = IIp and let G = diag(g1, . . . , gp) be a diagonalmatrix. Compute the squared bias ‖θG−θ∗‖2 and show that it monotonouslyincreases in each gj for j = 1, . . . , p .

Exercise 4.7.6. Let G be a symmetric matrix and θG the correspondingpenalized MLE. Show that the variance Var(θG) decreases while the bias‖θG − θ∗‖ increases in G2 .Hint: with D2 = σ−2ΨΨ> , show that for any vector w ∈ IRp and u =D−1w , it holds

w>Var(θG)w = u>(IIp +D−1G2D−1)−2u

and this value decreases with G2 because IIp + D−1G2D−1 increases. Showin a similar way that

‖θG − θ∗‖2 = ‖(D2 +G2)−1G2θ∗‖2 = θ∗>

Γ−1θ∗

with Γ =(IIp+G−2D2

)(IIp+D2G−2

). Show that the matrix Γ monotonously

increases and thus Γ−1 monotonously decreases as a function of the symmet-ric matrix B = G−2 .

Putting together the results about the bias and the variance of θG yieldsthe statement about the quadratic risk.

Theorem 4.7.2. Assume the model Y = Ψ>θ∗ + ε with Var(ε) = σ2IIn .

Then the estimate θG fulfills

IE‖θG − θ∗‖2 = ‖θG − θ∗‖2 + tr(D−2G D2D−2

G

).

This result is called the bias-variance decomposition. The choice of a properregularization is usually based on this decomposition: one selects a regular-ization from a given class to provide the minimal possible risk. This approachis referred to as bias-variance trade-off.

Page 159: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

4.7 Ridge regression, projection, and shrinkage 159

4.7.3 Inference for the penalized MLE

Here we discuss some properties of the penalized MLE θG . In particular,we focus on the construction of confidence and concentration sets based onthe penalized log-likelihood. We know that the regularized estimate θG isthe empirical counterpart of the value θG which solves the regularized deter-ministic problem (4.32). We also know that the key results are expressed viathe value of the supremum supθ LG(θ) − LG(θG) . The next result extendsTheorem 4.6.1 to the penalized likelihood.

Theorem 4.7.3. Let LG(θ) be the penalized log-likelihood from (4.30). Then

2LG(θG,θG) =(θG − θG

)>D2G

(θG − θG

)(4.34)

= σ−2ε>ΠG ε (4.35)

with ΠG = Ψ>(ΨΨ> + σ2G2

)−1Ψ .

In general the matrix ΠG is not a projector and hence, σ−2ε>ΠG ε is notχ2 -distributed, the chi-squared result does not apply.

Exercise 4.7.7. Prove (4.34).

Hint: apply the Taylor expansion to LG(θ) at θG . Use that ∇LG(θG) = 0and −∇2LG(θ) ≡ σ−2ΨΨ> +G2 .

Exercise 4.7.8. Prove (4.35).

Hint: show that θG − θG = SGε with SG =(ΨΨ> + σ2G2

)−1Ψ .

The straightforward corollaries of Theorem 4.7.3 are the concentration andconfidence probabilities. Define the confidence set EG(z) for θG as

EG(z)def={θ : LG(θG,θ) ≤ z

}.

The definition implies the following result for the coverage probability:

IP(EG(z) 63 θG

)≤ IP

(LG(θG,θG) > z

).

Now the representation (4.35) for LG(θG,θG) reduces the problem to a devi-ation bound for a quadratic form. We apply the general result of Section 9.1.

Theorem 4.7.4. Let LG(θ) be the penalized log-likelihood from (4.30) andlet ε ∼ N(0, σ2IIn) . Then it holds with pG = tr(ΠG) and v2

G = 2 tr(Π2G) that

IP(2LG(θG,θG) > pG + (2vGx

1/2) ∨ (6x))≤ exp(−x).

Page 160: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

160 4 Estimation in linear models

Similarly one can state the concentration result. With D2G = σ−2ΨΨ> +G2

2LG(θG,θG) =∥∥DG

(θG − θG

)∥∥2

and the result of Theorem 4.7.4 can be restated as the concentration bound:

IP(‖DG(θG − θG)‖2 > pG + (2vGx

1/2) ∨ (6x))≤ exp(−x).

In other words, θG concentrates on the set A(z,θG) ={θ : ‖θ−θG‖2 ≤ 2z

}for 2z > pG .

4.7.4 Projection and shrinkage estimates

Consider a linear model Y = Ψ>θ∗+ε in which the matrix Ψ is orthonormalin the sense ΨΨ> = IIp . Then the multiplication with Ψ maps this model inthe sequence space model Z = θ∗+ξ , where Z = ΨY = (z1, . . . , zp)

> is the

vector of empirical Fourier coefficients zj = ψ>j Y . The noise ξ = Ψε borrowsthe feature of the original noise ε : if ε is zero mean and homogeneous, thesame applies to ξ . The number of coefficients p can be large or even infinite.To get a sensible estimate, one has to apply some regularization method. Thesimplest one is called projection: one just considers the first m empiricalcoefficients z1, . . . , zm and drop the others. The corresponding parameterestimate θm reads as

θm,j =

{zj if j ≤ m,0 otherwise.

The response vector f∗ = IEY is estimated by Ψ>θm leading to the repre-sentation

fm = z1ψ1 + . . .+ zmψm

with zj = ψ>j Y . In other words, fm is just a projection of the observedvector Y onto the subspace Lm spanned by the first m basis vectorsψ1, . . . ,ψm : Lm =

⟨ψ1, . . . ,ψm

⟩. This explains the name of the method.

Clearly one can study the properties of θm or fm using the methods of pre-vious sections. However, one more question for this approach is still open: aproper choice of m . The standard way of accessing this issue is based on theanalysis of the quadratic risk.

Consider first the prediction risk defined as R(fm) = IE‖fm − f∗‖2 .

Below we focus on the case of a homogeneous noise with Var(ε) = σ2IIp . An

extension to the colored noise is possible. Recall that fm effectively estimatesthe vector fm = Πmf

∗ , where Πm is the projector on Lm ; see Section 4.3.3.

Moreover, the quadratic risk R(fm) can be decomposed as

Page 161: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

4.7 Ridge regression, projection, and shrinkage 161

R(fm) = ‖f∗ −Πmf∗‖2 + σ2m = σ2m+

p∑j=m+1

θ∗j2.

Obviously the squared bias ‖f∗ −Πmf∗‖2 decreases with m while the vari-

ance σ2m linearly grows with m . Risk minimization leads to the so calledbias-variance trade-off : one selects m which minimizes the risk R(fm) overall possible m :

m∗def= argmin

mR(fm) = argmin

m

{‖f∗ −Πmf

∗‖2 + σ2m}.

Unfortunately this choice requires some information about the bias ‖f∗ −Πmf

∗‖ which depends on the unknown vector f . As this information is notavailable in typical situation, the value m∗ is also called an oracle choice.A data-driven choice of m is one of the central issue in the nonparametricstatistics.

The situation is not changed if we consider the estimation risk IE‖θm −θ∗‖2 . Indeed, the basis orthogonality ΨΨ> = IIp implies for f∗ = Ψ>θ∗

‖fm − f∗‖2 = ‖Ψ>θm −Ψ>θ∗‖2 = ‖θm − θ∗‖2

and minimization of the estimation risk coincides with minimization of theprediction risk.

A disadvantage of the projection method is that it either keeps each em-pirical coefficient zm or completely discards it. An extension of the projectionmethod is called shrinkage: one multiplies every empirical coefficient zj with

a factor αj ∈ (0, 1) . This leads to the shrinkage estimate θα with

θα,j = αjzj .

Here α stands for the vector of coefficients αj for j = 1, . . . , p . A projectionmethod is a special case of this shrinkage with αj equal to one or zero. Anotherpopular choice of the coefficients αj is given by

αj = (1− j/m)β1(j ≤ m) (4.36)

for some β > 0 and m ≤ p . This choice ensures that the coefficients αjsmoothly approach zero as j approach the value m , and αj vanish for j >m . In this case, the vector α is completely specified by two parameters m andβ . The projection method corresponds to β = 0 . The design orthogonalityΨΨ> = IIp yields again that the estimation risk IE‖θα−θ∗‖2 coincides with

the prediction risk IE‖fα − f∗‖2 .

Page 162: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

162 4 Estimation in linear models

Exercise 4.7.9. Let Var(ε) = σ2IIp . The risk R(fα) of the shrinkage esti-

mate fα fulfills

R(fα)def= IE‖fα − f

∗‖2 =

p∑j=1

θ∗j2(1− αj)2 +

p∑j=1

α2jσ

2.

Specify the cases of α = α(m,β) from (4.36). Evaluate the variance term∑j α

2jσ

2 .

Hint: approximate the sum over j by the integral∫

(1− x/m)2β+ dx .

The oracle choice is again defined by risk minimization:

α∗def= argmin

αR(fα),

where minimization is taken over the class of all considered coefficient vectorsα .

One way of obtaining a shrinkage estimate in the sequence space modelZ = θ∗+ ξ is by using a roughness penalization. Let G be a symmetric ma-trix. Consider the regularized estimate θG from (4.28). The next result claims

that if G is a diagonal matrix, then θG is a shrinkage estimate. Moreover,a general penalized MLE can be represented as shrinkage by an orthogonalbasis transformation.

Theorem 4.7.5. Let G be a diagonal matrix, G = diag(g1, . . . , gp) . The pe-

nalized MLE θG in the sequence space model Z = θ∗+ξ with ξ ∼ N(0, σ2IIp)

coincides with the shrinkage estimate θα for αj = (1 + σ2g2j )−1 ≤ 1 . More-

over, a penalized MLE θG for a general matrix G can be reduced to a shrink-age estimate by a basis transformation in the sequence space model.

Proof. The first statement for a diagonal matrix G follows from the represen-tation θG = (IIp+σ2G2)−1Z . Next, let U be an orthogonal transform leadingto the diagonal representation G2 = U>D2U with D2 = diag(g1, . . . , gp) .Then

U θG = (IIp + σ2D2)−1UZ

that is, U θG is a shrinkage estimate in the transformed model UZ = Uθ∗+Uξ .

In other words, roughness penalization results in some kind of shrinkage.Interestingly, the inverse statement holds as well.

Exercise 4.7.10. Let θα is a shrinkage estimate for a vector α = (αj) .

Then there is a diagonal penalty matrix G such that θα = θG .Hint: define the j th diagonal entry gj by the equation αj = (1 + σ2g2

j )−1 .

Page 163: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

4.8 Shrinkage in a linear inverse problem 163

4.7.5 Smoothness constraints and roughness penalty approach

Another way of reducing the complexity of the estimation procedure is basedon smoothness constraints. The notion of smoothness originates from re-gression estimation. A non-linear regression function f is expanded usinga Fourier or some other functional basis and θ∗ is the corresponding vectorof coefficients. Smoothness properties of the regression function imply certainrate of decay of the corresponding Fourier coefficients: the larger frequency is,the fewer amount of information about the regression function is contained inthe related coefficient. This leads to the natural idea to replace the originaloptimization problem over the whole parameter space with the constrained op-timization over a subset of “smooth” parameter vectors. Here we consider onepopular example of Sobolev smoothness constraints which effectively meansthat the s th derivative of the function f∗ has a bounded L2 -norm. A generalSobolev ball can be defined using a diagonal matrix G :

BG(R)def= ‖Gθ‖ ≤ R.

Now we consider a constrained ML problem:

θG,R = argmaxθ∈BG(R)

L(θ) = argminθ∈Θ: ‖Gθ‖≤R

‖Y −Ψ>θ‖2. (4.37)

The Lagrange multiplier method leads to an unconstrained problem

θG,λ = argminθ

{‖Y −Ψ>θ‖2 + λ‖Gθ‖2

}.

A proper choice of λ ensures that the solution θG,λ belongs to BG(R)and solves also the problem (4.37). So, the approach based on a Sobolevsmoothness assumption, leads back to regularization and shrinkage.

4.8 Shrinkage in a linear inverse problem

This section extends the previous approaches to the situation with indirectobservations. More precisely, we focus on the model

Y = Af∗ + ε, (4.38)

where A is a given linear operator (matrix) and f∗ is the target of analysis.With the obvious change of notation this problem can be put back in thegeneral linear setup Y = Ψ>θ+ ε . The special focus is due to the facts thatthe target can be high dimensional or even functional and that the productA>A is usually badly posed and its inversion is a hard task. Below we con-sider separately the cases when the spectral representation for this problemis available and the general case.

Page 164: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

164 4 Estimation in linear models

4.8.1 Spectral cut-off and spectral penalization. Diagonal estimates

Suppose that the eigenvectors of the matrix A>A are available. This al-lows for reducing the model to the spectral representation by an orthogonalchange of the coordinate system: Z = Λu + Λ1/2ξ with a diagonal ma-trix Λ = diag{λ1, . . . , λp} and a homogeneous noise Var(ξ) = σ2IIp ; seeSection 4.2.4. Below we assume without loss of generality that the eigenval-ues λj are ordered and decrease with j . This spectral representation meansthat one observes empirical Fourier coefficients zm described by the equation

zj = λjuj + λ1/2j ξj for j = 1, . . . , p . The LSE or qMLE estimate of the

spectral parameter u is given by

u = Λ−1Z = (λ−11 z1, . . . , λ

−1p zp)

>.

Exercise 4.8.1. Consider the spectral representation Z = Λu+ Λ1/2ξ . TheLSE u reads as u = Λ−1Z .

If the dimension p of the model is high or, specifically, if the spectralvalues λj rapidly go to zero, it might be useful to only track few coefficientsu1, . . . , um and to set all the remaining ones to zero. The corresponding esti-mate um = (um,1, . . . , um,p)

> reads as

um,jdef=

{λ−1j zj if j ≤ m,

0 otherwise.

It is usually referred to as a spectral cut-off estimate.

Exercise 4.8.2. Consider the linear model Y = Af∗ + ε . Let U be an or-thogonal transform in IRp providing UA>AU> = Λ with a diagonal matrixΛ leading to the spectral representation for Z = UAY . Write the corre-sponding spectral cut-off estimate fm for the original vector f∗ . Show thatcomputing this estimate only requires to know the first m eigenvalues andeigenvectors of the matrix A>A .

Similarly to the direct case, a spectral cut-off can be extended to spectralshrinkage: one multiplies every empirical coefficient zj with a factor αj ∈(0, 1) . This leads to the spectral shrinkage estimate uα with uα,j = αjλ

−1j zj .

Here α stands for the vector of coefficients αj for j = 1, . . . , p . A spectralcut-off method is a special case of this shrinkage with αj equal to one or zero.

Exercise 4.8.3. Specify the spectral shrinkage uα with a given vector αfor the situation of Exercise 4.8.2.

The spectral cut-off method can be described as follows. Let ψ1,ψ2, . . . bethe intrinsic orthonormal basis of the problem composed of the standardizedeigenvectors of A>A and leading to the spectral representation Z = Λu +

Page 165: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

4.8 Shrinkage in a linear inverse problem 165

Λ1/2ξ with the target vector u . In terms of the original target f∗ , oneis looking for a solution or an estimate in the form f =

∑j ujψj . The

design orthogonality allows to estimate every coefficient uj independently

of the others using the empirical Fourier coefficient ψ>j Y . Namely, uj =

λ−1j ψ

>j Y = λ−1

j zj . The LSE procedure tries to recover f as the full sum f =∑j ujψj . The projection method suggests to cut this sum at the index m :

fm =∑j≤m ujψj , while the shrinkage procedure is based on downweighting

the empirical coefficients uj : fα =∑j αj ujψj .

Next we study the risk of the shrinkage method. Orthonormality of thebasis ψj allows to represent the loss as ‖uα − u∗‖2 = ‖fα − f

∗‖2 . Underthe noise homogeneity one obtains the following result.

Theorem 4.8.1. Let Z = Λu∗+ Λ1/2ξ with Var(ξ) = σ2IIp . It holds for theshrinkage estimate uα

R(uα)def= IE‖uα − u∗‖2 =

p∑j=1

|αj − 1|2u∗j2 +

p∑j=1

α2jσ

2λ−1j .

Proof. The empirical Fourier coefficients zj are uncorrelated and IEzj =λju∗j , Var zj = σ2λj . This implies

IE‖uα − u∗‖2 =

p∑j=1

IE|αjλ−1j zj − u∗j |2 =

p∑j=1

{|αj − 1|2u∗j

2 + α2jσ

2λ−1j

}as required.

Risk minimization leads to the oracle choice of the vector α or

α∗ = argminα

R(uα)

where the minimum is taken over the set of all admissible vectors α .Similar analysis can be done for the spectral cut-off method.

Exercise 4.8.4. The risk of the spectral cut-off estimate um fulfills

R(um) =

m∑j=1

λ−1j σ2 +

p∑j=m+1

u∗j2.

Specify the choice of the oracle cut-off index m∗ .

Page 166: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

166 4 Estimation in linear models

4.8.2 Galerkin method

A general problem with the spectral shrinkage approach is that it requires toprecisely know the intrinsic basis ψ1,ψ2, . . . or equivalently the eigenvaluedecomposition of A leading to the spectral representation. After this basisis fixed, one can apply the projection or shrinkage method using the corre-sponding Fourier coefficients. In some situations this basis is hardly availableor difficult to compute. A possible way out of this problem is to take someother orthogonal basis φ1,φ2, . . . which is tractable and convenient but doesnot lead to the spectral representation of the model. The Galerkin methodis based on projecting the original high dimensional problem to a lower di-mensional problem in term of the new basic {φj} . Namely, without loss ofgenerality suppose that the target function f∗ can be decomposed as

f∗ =∑j

ujφj .

This can be achieved e.g. if f∗ belongs to some Hilbert space and {φj} is anorthonormal basis in this space. Now we cut this sum and replace this exactdecomposition by a finite approximation

f∗ ≈ fm =∑j≤m

ujφj = Φ>mum ,

where um = (u1, . . . , um)> and the matrix Φm is built of the vectorsφ1, . . . ,φm : Φm = (φ1, . . . ,φm) . Now we plug this decomposition in the orig-inal equation Y = Af∗+ε . This leads to the linear model Y = AΦ>mum+ε =Ψ>mum + ε with Ψm = ΦmA

> . The corresponding (quasi) MLE reads

um =(ΨmΨ>m

)−1ΨmY .

Note that for computing this estimate one only needs to evaluate the actionof the operator A on the basis functions φ1, . . . ,φm and on the data Y .With this estimate um of the vector u∗ , one obtains the response estimatefm of the form

fm = Φ>mum = u1φ1 + . . .+ umφm .

The properties of this estimate can be studied in the same way as for a generalqMLE in a linear model: the true data distribution follows (4.38) while weuse the approximating model Y = Af∗m + ε with ε ∼ N(0, σ2I) for buildingthe quasi likelihood.

A further extension of the qMLE approach concerns the case when theoperator A is not precisely known. Instead, an approximation or an esti-mate A is available. The pragmatic way of tackling this problem is to use

Page 167: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

4.9 Semiparametric estimation 167

the model Y = Af∗m + ε for building the quasi likelihood. The use of theGalerkin method is quite natural in this situation because the spectral repre-sentation for A will not necessarily result in a similar representation for thetrue operator A .

4.9 Semiparametric estimation

This section discusses the situation when the target of estimation does notcoincide with the parameter vector. This problem is usually referred to assemiparametric estimation. One typical example is the problem of estimatinga part of the parameter vector. More generally one can try to estimate agiven function/functional of the unknown parameter. We focus here on linearmodeling, that is, the considered model and the considered mapping of theparameter space to the target space are linear. For the ease of presentationwe assume everywhere the homogeneous noise with Var(ε) = σ2IIn .

4.9.1 (θ, η) - and υ -setup

This section presents two equivalent descriptions of the semiparametric prob-lem. The first one assumes that the total parameter vector can be decomposedinto the target parameter θ and the nuisance parameter η . The second oneoperates with the total parameter υ and the target θ is a linear mapping ofυ .

We start with the (θ,η) -setup. Let the response Y be modeled in depen-dence of two sets of factors: {ψj , j = 1, . . . , p} and {φm,m = 1, . . . , p1} . Weare mostly interested in understanding the impact of the first set {ψj} butwe cannot ignore the influence of the {φm} ’s. Otherwise the model would beincomplete. This situation can be described by the linear model

Y = Ψ>θ∗ + Φ>η∗ + ε, (4.39)

where Ψ is the p× n matrix with the columns ψj , while Φ is the p1 × n -matrix with the columns φm . We primarily aim at recovering the vector θ∗ ,while the coefficients η∗ are of secondary importance. The corresponding(quasi) log-likelihood reads as

L(θ,η) = −(2σ2)−1‖Y −Ψ>θ − Φ>η‖2 +R,

where R denotes the remainder term which does not depend on the parame-ters θ,η . The more general υ -setup considers a general linear model

Y = Υ>υ∗ + ε, (4.40)

where Υ is p∗×n matrix of p∗ factors, and the target of estimation is a linearmapping θ∗ = Pυ∗ for a given operator P from IRp

∗to IRp . Obviously the

Page 168: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

168 4 Estimation in linear models

(θ,η) -setup is a special case of the υ -setup. However, a general υ -setup canbe reduced back to the (θ,η) -setup by a change of variable.

Exercise 4.9.1. Consider the sequence space model Y = υ∗+ ξ in IRp andlet the target of estimation be the sum of the coefficients υ∗1+. . .+υ∗p . Describethe υ -setup for the problem. Reduce to (θ,η) -setup by an orthogonal changeof the basis.

In the υ -setup, the (quasi) log-likelihood reads as

L(υ) = −(2σ2)−1‖Y −Υ>υ‖2 +R,

where R is the remainder which does not depend on υ . It implies quadraticityof the log-likelihood L(υ) : given by

D2 = −∇2L(υ) = Var{∇L(υ)

}= σ−2ΥΥ>. (4.41)

Exercise 4.9.2. Check the statements in (4.41).

Exercise 4.9.3. Show that for the model (4.39) holds with Υ =(

ΨΦ

)D2 = σ−2ΥΥ> = σ−2

(ΨΨ> ΨΦ>

ΦΨ> ΦΦ>

), (4.42)

∇L(υ∗) = σ−2Υε = σ−2

(ΨεΦε

).

4.9.2 Orthogonality and product structure

Consider the model (4.39) under the orthogonality condition ΨΦ> = 0 . Thiscondition effectively means that the factors of interest {ψj} are orthogonalto the nuisance factors {φm} . An important feature of this orthogonal case isthat the model has the product structure leading to the additive form of thelog-likelihood. Consider the partial θ -model Y = Ψ>θ + ε with the (quasi)log-likelihood

L(θ) = −(2σ2)−1‖Y −Ψ>θ‖2 +R

Similarly L1(η) = −(2σ2)−1‖Y − Φ>η‖2 + R1 denotes the log-likelihood inthe partial η -model Y = Φ>θ + ε .

Theorem 4.9.1. Assume the condition ΨΦ> = 0 . Then

L(θ,η) = L(θ) + L1(η) +R(Y ), (4.43)

where R(Y ) is independent of θ and η . This implies the block diagonalstructure of the matrix D2 = σ−2ΥΥ> :

Page 169: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

4.9 Semiparametric estimation 169

D2 = σ−2

(ΨΨ> 0

0 ΦΦ>

)=

(D2 00 H2

), (4.44)

with D2 = σ−2ΨΨ> , H2 = σ−2ΦΦ> .

Proof. The formula (4.44) follows from (4.42) and the orthogonality condition.The statement (4.43) follows if we show that the difference

L(θ,η)− L(θ)− L1(η)

does not depend on θ and η . This is a quadratic expression in θ,η , so itsuffices to check its first and the second derivative w.r.t. the parameters. Forthe first derivative, it holds by the orthogonality condition

∇θL(θ,η)def= ∂L(θ,η)/∂θ = σ−2Ψ

(Y −Ψ>θ − Φ>η

)= σ−2Ψ

(Y −Ψ>θ

)that coincides with ∇L(θ) . Similarly ∇θL(θ,η) = ∇L1(η) yielding

∇{L(θ,η)− L(θ)− L1(η)

}= 0.

The identities (4.41) and (4.42) imply that ∇2{L(θ,η)−L(θ)−L1(η)

}= 0 .

This implies the desired assertion.

Exercise 4.9.4. Check the statement (4.43) by direct computations. Describethe term R(Y ) .

Now we demonstrate how the general case can be reduced to the orthogonalone by a linear transformation of the nuisance parameter. Let C be a p× p1

matrix. Define η = η+C>θ . Then the model equation Y = Ψ>θ+Φ>η+εcan be rewritten as

Y = Ψ>θ + Φ>(η − C>θ) + ε = (Ψ− CΦ)>θ + Φ>η + ε.

Now we select C to ensure the orthogonality. This leads to the equation

(Ψ− CΦ)Φ> = 0

or C = ΨΦ>(ΦΦ>

)−1. So, the original model can be rewritten as

Y = Ψ>θ + Φ>η + ε,

Ψ = Ψ− CΦ = Ψ(I −Πη), (4.45)

where Πη = Φ>(ΦΦ>

)−1Φ being the projector on the linear subspace

spanned by the nuisance factors {φm} . This construction has a natural inter-pretation: correction the θ -factors ψ1, . . . ,ψp by removing their interactionwith the nuisance factors φ1, . . . ,φp1 reduces the general case to the orthog-onal one. We summarize:

Page 170: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

170 4 Estimation in linear models

Theorem 4.9.2. The linear model (4.39) can be represented in the orthogonalform

Y = Ψ>θ + Φ>η + ε,

where Ψ from (4.45) satisfies ΨΦ> = 0 and η = η + C>θ for C =

ΨΦ>(ΦΦ>

)−1. Moreover, it holds for υ = (θ,η)

L(υ) = L(θ) + L1(η) +R(Y ) (4.46)

with

L(θ) = −(2σ2)−1‖Y − Ψ>θ‖2 +R,

L1(η) = −(2σ2)−1‖Y − Φ>η‖2 +R1.

Exercise 4.9.5. Show that for C = ΨΦ>(ΦΦ>

)−1

∇L(θ) = ∇θL(υ)− C∇ηL(υ).

Exercise 4.9.6. Show that the remainder term R(Y ) in the equation (4.46)is the same as in the orthogonal case (4.43).

Exercise 4.9.7. Show that ΨΨ> < ΨΨ> if ΨΦ> 6= 0 .Hint: for any vector γ ∈ IRp , it holds with h = Ψ>γ

γ>ΨΨ>γ = ‖(Ψ−ΨΠη)>γ‖2 = ‖h−Πηh‖2 ≤ ‖h‖2.

Moreover, the equality here for any γ is only possible if

Πηh = Φ>(ΦΦ>

)−1ΦΨ>γ ≡ 0,

that is, if ΦΨ> = 0 .

4.9.3 Partial estimation

This section explains the important notion of partial estimation which is quitenatural and transparent in the (θ,η) -setup. Let some value η◦ of the nui-sance parameter be fixed. A particular case of this sort is just ignoring thefactors {φm} corresponding to the nuisance component, that is, one usesη◦ ≡ 0 . This approach is reasonable in certain situation, e.g. in context ofprojection method or spectral cut-off.

Define the estimate θ(η◦) by partial optimization of the joint log-likelihoodL(θ,η◦) w.r.t. the first parameter θ :

Page 171: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

4.9 Semiparametric estimation 171

θ(η◦) = argmaxθ

L(θ,η◦).

Obviously θ(η◦) is the MLE in the residual model Y − Φ>η◦ = Ψ>θ∗ + ε :

θ(η◦) =(ΨΨ>

)−1Ψ(Y − Φ>η◦).

This allows for describing the properties of the partial estimate θ(η◦) simi-larly to the usual parametric situation.

Theorem 4.9.3. Consider the model (4.39). Then the partial estimate θ(η◦)fulfills

IEθ(η◦) = θ∗ +(ΨΨ>

)−1ΨΦ>(η∗ − η◦), Var

{θ(η◦)

}= σ2

(ΨΨ>

)−1.

In words, θ(η) has the same variance as the MLE in the partial model Y =Ψ>θ∗+ε but it is biased if ΨΦ>(η∗−η◦) 6= 0 . The ideal situation corresponds

to the case when η◦ = η∗ . Then θ(η∗) is the MLE in the correctly specified

θ -model: with Y (η∗)def= Y − Φ>η∗ ,

Y (η∗) = Ψ>θ∗ + ε.

An interesting and natural question is a legitimation of the partial estimationmethod: under which conditions it is justified and does not produce any esti-mation bias. The answer is given by Theorem 4.9.1: the orthogonality condi-tion ΨΦ> = 0 would ensure the desired feature because of the decomposition(4.43).

Theorem 4.9.4. Assume orthogonality ΨΦ> = 0 . Then the partial estimateθ(η◦) does not depend on the nuisance parameter η◦ used:

θ = θ(η◦) = θ(η∗) =(ΨΨ>

)−1ΨY .

In particular, one can ignore the nuisance parameter and estimate θ∗ fromthe partial incomplete model Y = Ψ>θ∗ + ε .

Exercise 4.9.8. Check that the partial derivative ∂∂θL(θ,η) does not depend

on η under the orthogonality condition.

The partial estimation can be considered in context of estimating the nui-sance parameter η by inverting the role of θ and η . Namely, given a fixedvalue θ◦ , one can optimize the joint log-likelihood L(θ,η) w.r.t. the secondargument η leading to the estimate

η(θ◦)def= argmax

ηL(θ◦,η)

In the orthogonal situation the initial point θ◦ is not important and one canuse the partial incomplete model Y = Φ>η∗ + ε .

Page 172: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

172 4 Estimation in linear models

4.9.4 Profile estimation

This section discusses one general profile likelihood method of estimating thetarget parameter θ in the semiparametric situation. Later we show its op-timality and R-efficiency. The method suggests to first estimate the entireparameter vector υ by using the (quasi) ML method. Then the operator P

is applied to the obtained estimate υ to produce the estimate θ . One candescribe this method as

υ = argmaxυ

L(υ), θ = P υ. (4.47)

The first step here is the usual LS estimation of υ∗ in the linear model (4.40):

υ = arginfυ‖Y −Υ>υ‖2 =

(ΥΥ>

)−1ΥY .

The estimate θ is obtained by applying P to υ :

θ = P υ = P(ΥΥ>

)−1ΥY = SY (4.48)

with S = P(ΥΥ>

)−1Υ . The properties of this estimate can be studied using

the decomposition Y = f∗+ε with f∗ = IEY ; cf. Section 4.4. In particular,it holds

IEθ = Sf∗, Var(θ) = S Var(ε)S>. (4.49)

If the noise ε is homogeneous with Var(ε) = σ2IIn , then

Var(θ) = σ2SS> = σ2P(ΥΥ>

)−1P>. (4.50)

The next theorem summarizes our findings.

Theorem 4.9.5. Consider the model (4.40) with homogeneous error Var(ε) =

σ2IIn . The profile MLE θ follows (4.48). Its means and variance are given by(4.49) and (4.50).

The profile MLE is usually written in the (θ,η) -setup. Let υ = (θ,η) .

Then the target θ is obtained by projecting the MLE (θ, η) on the θ -coordinates. This procedure can be formalized as

θ = argmaxθ

maxη

L(θ,η).

Another way of describing the profile MLE is based on the partial optimiza-tion considered in the previous section. Define for each θ the value L(θ) byoptimizing the log-likelihood L(υ) under the condition Pυ = θ :

Page 173: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

4.9 Semiparametric estimation 173

L(θ)def= sup

υ: Pυ=θL(υ) = sup

ηL(θ,η). (4.51)

Then θ is defined by maximizing the partial fit L(θ) :

θdef= argmax

θL(θ). (4.52)

Exercise 4.9.9. Check that (4.47) and (4.52) lead to the same estimate θ .

We use for the function L(θ) obtained by partial optimization (4.51) thesame notation as for the function obtained by the orthogonal decomposition(4.43) in Section 4.9.2. Later we show that these two functions indeed coincide.

This helps in understanding the structure of the profile estimate θ .Consider first the orthogonal case ΨΦ> = 0 . This assumption gradually

simplifies the study. In particular, the result of Theorem 4.9.4 for partial es-timation can be obviously extended to the profile method in view of productstructure (4.43): when estimating the parameter θ , one can ignore the nui-sance parameter η and proceed as if the partial model Y = Ψ>θ∗ + ε werecorrect. Theorem 4.9.1 implies:

Theorem 4.9.6. Assume that ΨΦ> = 0 in the model (4.39). Then the profile

MLE θ from (4.52) coincides with the MLE from the partial model Y =Ψ>θ∗ + ε :

θ = argmaxθ

L(θ) = argminθ‖Y −Ψ>θ‖2 =

(ΨΨ>

)−1ΨY .

It holds IEθ = θ∗ and

θ − θ∗ = D−2ζ = D−1ξ

with D2 = σ−2ΨΨ> , ζ = σ−2Ψε , and ξ = D−1ζ . Finally, L(θ) from(4.51) fulfills

2{L(θ)− L(θ∗)

}= ‖D

(θ − θ∗

)‖2 = ζ>D−2ζ = ‖ξ‖2.

Moreover, if Var(ε) = σ2In , then Var(ξ) = IIp . If ε ∼ N(0, σ2IIn) , then ξis standard normal in IRp .

The general case can be reduced to the orthogonal one by the constructionfrom Theorem 4.9.2. Let

Ψ = Ψ−ΨΠη = Ψ−ΨΦ>(ΦΦ>

)−1Φ

be the corrected Ψ -factors after removing their interactions with the Φ -factors.

Page 174: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

174 4 Estimation in linear models

Theorem 4.9.7. Consider the model (4.39), and let the matrix D2 = σ−2ΨΨ>

is non-degenerated. Then the profile MLE θ reads as

θ = argminθ‖Y − Ψ>θ‖2 =

(ΨΨ>

)−1ΨY . (4.53)

It holds IEθ = θ∗ and

θ − θ∗ =(ΨΨ>

)−1Ψε = D−2ζ = D−1ξ (4.54)

with D2 = σ−2ΨΨ> , ζ = σ−2Ψε , and ξ = D−1ζ . Furthermore, L(θ) from(4.51) fulfills

2{L(θ)− L(θ∗)

}= ‖D

(θ − θ∗

)‖2 = ζ

>D−2ζ = ‖ξ‖2. (4.55)

Moreover, if Var(ε) = σ2In , then Var(ξ) = IIp . If ε ∼ N(0, σ2IIn) , then

ξ ∼ N(0, IIp) .

Finally we present the same result in terms of the original log-likelihoodL(υ) .

Theorem 4.9.8. Write D2 = −∇2IEL(υ) for the model (4.39) in the blockform

D2 =

(D2 AA> H2

)(4.56)

Let D2 and H2 be invertible. Then D2 and ζ in (4.54) can be representedas

D2 = D2 −AH−2A>,

ζ = ∇θL(υ∗)−AH−2∇ηL(υ∗).

Proof. In view of Theorem 4.9.7, it suffices to check the formulas for D2 andζ . One has for Ψ = Ψ(IIn −Πη) and A = σ−2ΨΦ>

D2 = σ−2ΨΨ> = σ−2Ψ(IIn −Πη

)Ψ>

= σ−2ΨΨ> − σ−2ΨΦ>(ΦΦ>

)−1ΦΨ> = D2 −AH−2A>.

Similarly, by AH−2 = ΨΦ>(ΦΦ>

)−1, ∇θL(υ∗) = Ψε , and ∇ηL(υ∗) = Φε

ζ = Ψε = Ψε−ΨΦ>(ΦΦ>

)−1Φε = ∇θL(υ∗)−AH−2∇ηL(υ∗).

as required.

Page 175: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

4.9 Semiparametric estimation 175

It is worth stressing again that the result of Theorems 4.9.6 through 4.9.8is purely geometrical. We only used the condition IEε = 0 in the model (4.39)and the quadratic structure of the log-likelihood function L(υ) . The distri-bution of the vector ε does not enter in the results and proofs. However, therepresentation (4.54) allows for straightforward analysis of the probabilistic

properties of the estimate θ .

Theorem 4.9.9. Consider the model (4.39) and let Var(Y ) = Var(ε) = Σ0 .Then

Var(θ) = σ−4D−2ΨΣ0Ψ>D−2, Var(ξ) = σ−4D−1ΨΣ0Ψ>D−1.

In particular, if Var(Y ) = σ2IIn , this implies that

Var(θ) = D−2, Var(ξ) = IIp.

Exercise 4.9.10. Check the result of Theorem 4.9.9. Specify this result tothe orthogonal case ΨΦ> = 0 .

4.9.5 Semiparametric efficiency bound

The main goal of this section is to show that the profile method in the semi-parametric estimation leads to R-efficient procedures. Remind that the targetof estimation is θ∗ = Pυ∗ for a given linear mapping P . The profile MLEθ is one natural candidate. The next result claims its optimality.

Theorem 4.9.10 (Gauss-Markov). Let Y follow Y = Υ>υ∗ + ε for ho-

mogeneous errors ε . Then the estimate θ of θ∗ = Pυ∗ from (4.48) is unbi-ased and

Var(θ) = σ2P(ΥΥ>

)−1P>

yielding

IE‖θ − θ∗‖2 = σ2 tr{P(ΥΥ>

)−1P>}.

Moreover, this risk is minimal in the class of all unbiased linear estimates ofθ∗ .

Proof. The statements about the properties of θ have been already proved.The lower bound can be proved by the same arguments as in the case of theMLE estimation in Section 4.4.3. We only outline the main steps. Let θ beany unbiased linear estimate of θ∗ . The idea is to show that the difference

θ − θ is orthogonal to θ in the sense IE(θ − θ

)θ>

= 0 . This implies that

Page 176: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

176 4 Estimation in linear models

the variance of θ is the sum of Var(θ) and Var(θ − θ

)and therefore larger

than Var(θ) .

Let θ = BY for some matrix B . Then IEθ = BIEY = BΥ>υ∗ . Theno-bias property yields the identity IEθ = θ∗ = Pυ∗ and thus

BΥ> − P = 0. (4.57)

Next, IEθ = IEθ = θ∗ and thus

IEθθ>

= θ∗θ∗>

+ Var(θ),

IEθθ>

= θ∗θ∗>

+ IE(θ − IEθ)(θ − IEθ)>.

Obviously θ − IEθ = Bε and θ − IEθ = Sε yielding Var(θ) = σ2SS> andIEBε(Sε)> = σ2BS> . So

IE(θ − θ)θ>

= σ2(B − S)S>.

The identity (4.57) implies

(B − S)S> ={B − P

(ΥΥ>

)−1Υ}

Υ>(ΥΥ>

)−1P>

= (BΥ> − P )(ΥΥ>

)−1P> = 0

and the result follows.

Now we specify the efficiency bound for the (θ,η) -setup (4.39). In thiscase P is just the projector onto the θ -coordinates.

4.9.6 Inference for the profile likelihood approach

This section discusses the construction of confidence and concentration setsfor the profile ML estimation. The key fact behind this construction is thechi-squared result which extends without any change from the parametric tosemiparametric framework.

The definition θ from (4.52) suggests to define a CS for θ∗ as the levelset of L(θ) = supυ:Pυ=θ L(υ) :

E(z)def={θ : L(θ)− L(θ) ≤ z

}.

This definition can be rewritten as

E(z)def={θ : sup

υL(υ)− sup

υ:Pυ=θL(υ) ≤ z

}.

Page 177: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

4.9 Semiparametric estimation 177

It is obvious that the unconstrained optimization of the log-likelihood L(υ)w.r.t. υ is not smaller than the optimization under the constrain that Pυ =θ . The point θ belongs to E(z) if the difference between these two valuesdoes not exceed z . As usual, the main question is the choice of a value zwhich ensures the prescribed coverage probability of θ∗ . This naturally leadsto studying the deviation probability

IP(

supυ

L(υ)− supυ:Pυ=θ∗

L(υ) > z).

Such a study is especially simple in the orthogonal case. The answer can beexpected: the expression and the value are exactly the same as in the casewithout any nuisance parameter η , it simply has no impact. In particular,the chi-squared result still holds.

In this section we follow the line and the notation of Section 4.9.4. Inparticular, we use the block notation (4.56) for the matrix D2 = −∇2L(υ) .

Theorem 4.9.11. Consider the model (4.39). Let the matrix D2 be non-degenerated. If ε ∼ N(0, σ2IIn) , then

2{L(θ)− L(θ∗)

}∼ χ2

p , (4.58)

that is, this 2{L(θ)− L(θ∗)

}is chi-squared with p degrees of freedom.

Proof. The result is based on representation (4.55) 2{L(θ)− L(θ∗)

}= ‖ξ‖2

from Theorem 4.9.7. It remains to note that normality of ε implies normalityof ξ and the moment conditions IEξ = 0 , Var(ξ) = IIp imply (4.58).

This result means that the chi-squared result continues to hold in thegeneral semiparametric framework as well. One possible explanation is asfollows: it applies in the orthogonal case, and the general situation can bereduced to the orthogonal case by a change of coordinates which preservesthe value of the maximum likelihood.

The statement (4.58) of Theorem 4.9.11 has an interesting geometric inter-pretation which is often used in analysis of variance. Consider the expansion

L(θ)− L(θ∗) = L(θ)− L(θ∗,η∗)−{L(θ∗)− L(θ∗,η∗)

}.

The quantity L1def= L(θ)−L(υ∗) coincides with L(υ,υ∗) ; see (4.47). Thus,

2L1 chi-squared with p∗ degrees of freedom by the chi-squared result. More-

over, 2σ2L(υ,υ∗) = ‖Πυε‖2 , where Πυ = Υ>(ΥΥ>

)−1Υ is the projector on

the linear subspace spanned by the joint collection of factors {ψj} and {φm} .

Similarly, the quantity L2def= L(θ∗)−L(θ∗,η∗) = supη L(θ∗,η)−L(θ∗,η∗)

is the maximum likelihood in the partial η -model. Therefore, 2L2 is alsochi-squared distributed with p1 degrees of freedom, and 2σ2L2 = ‖Πηε‖2 ,

Page 178: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

178 4 Estimation in linear models

where Πη = Φ>(ΦΦ>

)−1Φ is the projector on the linear subspace spanned

by the η -factors {φm} . Now we use the decomposition Πυ = Πη+Πυ−Πη ,in which Πυ − Πη is also a projector on the subspace of dimension p . Thisexplains the result (4.58) that the difference of these two quantities is chi-squared with p = p∗ − p1 degrees of freedom. The above consideration leadsto the following result.

Theorem 4.9.12. It holds for the model (4.39) with Πθ = Πυ −Πη

2L(θ)− 2L(θ∗) = σ−2(‖Πυε‖2 − ‖Πηε‖2

)= σ−2‖Πθε‖2 = σ−2ε>Πθε. (4.59)

Exercise 4.9.11. Check the formula (4.59). Show that it implies (4.58).

4.9.7 Plug-in method

Although the profile MLE can be represented in a closed form, its computingcan be a hard task if the dimensionality p1 of the nuisance parameter is high.Here we discuss an approach which simplifies the computations but leads toa suboptimal solution.

We start with the approach called plug-in. It is based on the assumptionthat a pilot estimate η of the nuisance parameter η is available. Then oneobtains the estimate θ of the target θ∗ from the residuals Y − Φ>η .

This means that the residual vector Y = Y −Φ>η is used as observationsand the estimate θ is defined as the best fit to such observations in the θ -model:

θ = argminθ‖Y −Ψ>θ‖2 =

(ΨΨ>

)−1ΨY . (4.60)

A very particular case of the plug-in method is partial estimation from Sec-tion 4.9.3 with η ≡ η◦ .

The plug-in method can be naturally described in context of partial esti-mation. We use the following representation of the plug-in method: θ = θ(η) .

Exercise 4.9.12. Check the identity θ = θ(η) for the plug-in method. De-scribe the plug-in estimate for η ≡ 0 .

The behavior of the θ heavily depends upon the quality of the pilot η . Adetailed study is complicated and a closed form solution is only available forthe special case of a linear pilot estimate. Let η = AY . Then (4.60) implies

θ =(ΨΨ>

)−1Ψ(Y − Φ>AY ) = SY

with S =(ΨΨ>

)−1Ψ(IIn − Φ>A) . This is a linear estimate whose properties

can be studied in a usual way.

Page 179: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

4.9 Semiparametric estimation 179

4.9.8 Two step procedure

The ideas of partial and plug-in estimation can be combined yielding theso called two step procedures. One starts with the initial guess θ◦ for thetarget θ∗ . A very special choice is θ◦ ≡ 0 . This leads to the partial η -modelY (θ◦) = Φ>η + ε for the residuals Y (θ◦) = Y −Ψ>θ◦ . Next compute the

partial MLE η(θ◦) =(ΦΦ>

)−1ΦY (θ◦) in this model and use it as a pilot

for the plug-in method: compute the residuals

Y (θ◦) = Y − Φ>η(θ◦) = Y −ΠηY (θ◦)

with Πη = Φ>(ΦΦ>

)−1Φ , and then estimate the target parameter θ by

fitting Ψ>θ to the residuals Y (θ◦) . This method results in the estimate

θ(θ◦) =(ΨΨ>

)−1ΨY (θ◦) (4.61)

A simple comparison with the formula (4.53) reveals that the pragmatic twostep approach is sub-optimal: the resulting estimate does not fit the profileMLE θ unless we have an orthogonal situation with ΨΠη = 0 . In particular,

the estimate θ(θ◦) from (4.61) is biased.

Exercise 4.9.13. Consider the orthogonal case with ΨΦ> = 0 . Show that

the two step estimate θ(θ◦) coincides with the partial MLE θ =(ΨΨ>

)−1ΨY .

Exercise 4.9.14. Compute the mean of θ(θ◦) . Show that there exists some

θ∗ such that IE{θ(θ◦)

}6= θ∗ unless the orthogonality condition ΨΦ> = 0

is fulfilled.

Exercise 4.9.15. Compute the variance of θ(θ◦) .

Hint: use that Var{Y (θ◦)

}= Var(Y ) = σ2IIn . Derive that Var

{Y (θ◦)

}=

σ2(IIn −Πη) .

Exercise 4.9.16. Let Ψ be orthogonal, i.e. ΨΨ> = IIp . Show that

Var{θ(θ◦)

}= σ2(IIp −ΨΠηΨ>).

4.9.9 Alternating method

he idea of partial and two step estimation can be applied in an iterative way.One starts with some initial value for θ◦ and sequentially performs the twosteps of partial estimation. Set

η0 = η(θ◦) = argminη‖Y −Ψ>θ◦ − Φ>η‖2 =

(ΦΦ>

)−1Φ(Y −Ψ>θ◦).

Page 180: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

180 4 Estimation in linear models

With this estimate fixed, compute θ1 = θ(η1) and continue in this way.

Generically, with θk and ηk computed, one recomputes

θk+1 = θ(ηk) =(ΨΨ>

)−1Ψ(Y − Φ>ηk), (4.62)

ηk+1 = η(θk+1) =(ΦΦ>

)−1Φ(Y −Ψ>θk+1). (4.63)

The procedure is especially transparent if the partial design matrices Ψ andΦ are orthonormal: ΨΨ> = IIp , ΦΦ> = Ip1 . Then

θk+1 = Ψ(Y − Φ>ηk),

ηk+1 = Φ(Y −Ψ>θk+1).

In words, having an estimate θ of the parameter θ∗ one computes the resid-uals Y = Y − Ψ>θ and then build the estimate η of the nuisance η∗ bythe empirical coefficients ΦY . Then this estimate η is used in a similar wayto recompute the estimate of θ∗ , and so on.

It is worth noting that every doubled step of alternation improves thecurrent value L(θk, ηk) . Indeed, θk+1 is defined by maximizing L(θ, ηk) ,

that is, L(θk+1, ηk) ≥ L(θk, ηk) . Similarly, L(θk+1, ηk+1) ≥ L(θk+1, ηk)yielding

L(θk+1, ηk+1) ≥ L(θk, ηk). (4.64)

A very interesting question is whether the procedure (4.62), (4.63) con-verges and whether it converges to the maximum likelihood solution. Theanswer is positive and in the simplest orthogonal case the result is straight-forward.

Exercise 4.9.17. Consider the orthogonal situation with ΨΦ> = 0 . Thenthe above procedure stabilizes in one step with the solution from Theo-rem 4.9.4.

In the non-orthogonal case the situation is much more complicated. Theidea is to show that the alternating procedure can be represented a sequenceof actions of a shrinking linear operator to the data. The key observationbehind the result is the following recurrent formula for Ψ>θk and Φ>ηk :

Ψ>θk+1 = Πθ(Y − Φ>ηk) =(Πθ −ΠθΠη

)Y + ΠθΠηΨ>θk, (4.65)

Φ>ηk+1 = Πη(Y −Ψ>θk+1) =(Πη −ΠηΠθ

)Y + ΠηΠθΦ>ηk (4.66)

with Πθ = Ψ>(ΨΨ>

)−1Ψ and Πη = Φ>

(ΦΦ>

)−1Φ .

Exercise 4.9.18. Show (4.65) and (4.66).

Page 181: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

4.10 Historical remarks and further reading 181

This representation explains necessary and sufficient conditions for conver-gence of the alternating procedure. Namely, the spectral norm ‖ΠηΠθ‖∞(the largest singular value) of the product operator ΠηΠθ should be strictlyless than one, and similarly for ΠθΠη .

Exercise 4.9.19. Show that ‖ΠθΠη‖∞ = ‖ΠηΠθ‖∞ .

Theorem 4.9.13. Suppose that ‖ΠηΠθ‖∞ = λ < 1 . Then the alternating

procedure converges geometrically, the limiting values θ and η are uniqueand fulfill

Ψ>θ = (IIn −ΠθΠη)−1(Πθ −ΠθΠη)Y ,

Φ>η = (IIn −ΠηΠθ)−1(Πη −ΠηΠθ)Y , (4.67)

and θ coincides with the profile MLE θ from (4.52).

Proof. The convergence will be discussed below. Now we comment on theidentity θ = θ . A direct comparison of the formulas for these two estimatescan be a hard task. Instead we use the monotonicity property (4.64). By

definition, (θ, η) maximize globally L(θ,η) . If we start the procedure with

θ◦ = θ , we would improve the value L(θ, η) at every step. By uniqueness,

the procedure stabilizes with θk = θ and ηk = η for every k .

Exercise 4.9.20. 1. Show by induction arguments that

Φ>ηk+1 = Ak+1Y +(ΠηΠθ

)kΦ>η1,

where the linear operator Ak fulfills A1 = 0 and

Ak+1 = Πη −ΠηΠθ + ΠηΠθAk =

k−1∑i=0

(ΠηΠθ)i(Πη −ΠηΠθ).

2. Show that Ak converges to A = (IIn−ΠηΠθ)−1(Πη−ΠηΠθ) and evaluate‖A−Ak‖∞ and ‖Φ>(ηk − η)‖ .Hint: use that ‖Πη −ΠηΠθ‖∞ ≤ 1 and ‖(ΠηΠθ)i‖∞ ≤ ‖ΠηΠθ‖i∞ ≤ λi .3. Prove (4.67) by inserting η in place of ηk and ηk+1 in (4.66).

4.10 Historical remarks and further reading

The least squares method goes back to Carl Gauss (around 1795 but publishedfirst 1809) and Adrien Marie Legendre in 1805.

The notion of linear regression was introduced by Fransis Galton around1868 for biological problem and then extended by Karl Pearson and RobertFisher between 1912 and 1922.

Page 182: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

182 4 Estimation in linear models

Chi-squared distribution was first described by the German statisticianFriedrich Robert Helmert in papers of 1875/1876. The distribution was inde-pendently rediscovered by Karl Pearson in the context of goodness of fit.

The Gauss–Markov theorem is attributed to Gauß (1995) (originally pub-lished in Latin in 1821/3) and Markoff (1912).

Classical references for the sandwich formula (see, e. g., (4.12)) for thevariance of the maximum likelihood estimator in a misspecified model areHuber (1967) and White (1982).

It seems that the term ridge regression has first been used by Hoerl (1962);see also the original paper by Tikhonov (1963) for what is nowadays referredto as Tikhonov regularization. An early reference on penalized maximum like-lihood is Good and Gaskins (1971) who discussed the usage of a roughnesspenalty.

The original reference for the Galerkin method is Galerkin (1915). Its appli-cation in the context of regularization of inverse problems has been described,e. g., by Donoho (1995). The theory of shrinkage estimation started with thefundamental article by Stein (1956).

A systematic treatment of profile maximum likelihood is provided by Mur-phy and van der Vaart (2000), but the origins of this concept can be tracedback to Fisher (1956).

The alternating procedure has been introduced by Dempster et al. (1977)in the form of the expectation-maximization algorithm.

Page 183: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

5

Bayes estimation

This chapter discusses the Bayes approach to parameter estimation. This ap-proach differs essentially from classical parametric modeling also called thefrequentist approach. Classical frequentist modeling assumes that the ob-served data Y follow a distribution law IP from a given parametric family(IPθ,θ ∈ Θ ⊂ IRp) , that is,

IP = IPθ∗ ∈ (IPθ).

Suppose that the family (IPθ) is dominated by a measure µ0 and denote byp(y

∣∣θ) the corresponding density:

p(y∣∣θ) =

dIPθdµ0

(y).

The likelihood is defined as the density at the observed point and the maxi-mum likelihood approach tries to recover the true parameter θ∗ by maximiz-ing this likelihood over θ ∈ Θ .

In the Bayes approach, the paradigm is changed and the true data distri-bution is not assumed to be specified by a single parameter value θ∗ . Instead,the unknown parameter is considered to be a random variable ϑ with a dis-tribution π on the parameter space Θ called a prior. The measure IPθ canbe considered to be the data distribution conditioned that the randomly se-lected parameter is exactly θ . The target of analysis is not a single valueθ∗ , this value is no longer defined. Instead one is interested in the posteriordistribution of the random parameter ϑ given the observed data:

what is the distribution of ϑ given the prior π and the data Y ?

In other words, one aims at inferring on the distribution of ϑ on the basisof the observed data Y and our prior knowledge π . Below we distinguishbetween the random variable ϑ and its particular values θ . However, oneoften uses the same symbol θ for denoting the both objects.

Page 184: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

184 5 Bayes estimation

5.1 Bayes formula

The Bayes modeling assumptions can be put together in the form

Y∣∣θ ∼ p(·

∣∣θ),

ϑ ∼ π(·).

The first line has to be understood as the conditional distribution of Y giventhe particular value θ of the random parameter ϑ : Y

∣∣θ means Y∣∣ϑ = θ .

This section formalizes and states the Bayes approach in a formal mathe-matical way. The answer is given by the Bayes formula for the conditionaldistribution of ϑ given Y . First consider the joint distribution IP of Yand ϑ . If B is a Borel set in the space Y of observations and A is a mea-surable subset of Θ then

IP (B ×A) =

∫A

(∫B

IPθ(dy)

)π(dθ)

The marginal or unconditional distribution of Y is given by averaging thejoint probability w.r.t. the distribution of ϑ :

IP (B) =

∫Θ

∫B

IPθ(dy)π(dθ) =

∫Θ

IPθ(B)π(dθ).

The posterior (conditional) distribution of ϑ given the event Y ∈ B isdefined as the ratio of the joint and marginal probabilities:

IP (ϑ ∈ A∣∣Y ∈ B) =

IP (B ×A)

IP (B).

Equivalently one can write this formula in terms of the related densities. Inwhat follows we denote by the same letter π the prior measure π and itsdensity w.r.t. some other measure λ , e.g. the Lebesgue or uniform measureon Θ . Then the joint measure IP has the density

p(y,θ) = p(y∣∣θ)π(θ),

while the marginal density p(y) is the integral of the joint density w.r.t. theprior π :

p(y) =

∫Θ

p(y,θ)dθ =

∫Θ

p(y∣∣θ)π(θ)dθ.

Finally the posterior (conditional) density p(θ∣∣y) of ϑ given y is defined

as the ratio of the joint density p(y,θ) and the marginal density p(y) :

Page 185: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

5.1 Bayes formula 185

p(θ∣∣y) =

p(y,θ)

p(y)=

p(y∣∣θ)π(θ)∫

Θp(y

∣∣θ)π(θ)dθ.

Our definitions are summarized in the next lines:

Y∣∣θ ∼ p(y

∣∣θ),

ϑ ∼ π(θ),

Y ∼ p(y) =

∫Θ

p(y∣∣θ)π(θ)dθ,

ϑ∣∣Y ∼ p(θ

∣∣Y ) =p(Y ,θ)

p(Y )=

p(Y∣∣θ)π(θ)∫

Θp(Y

∣∣θ)π(θ)dθ. (5.1)

Note that given the prior π and the observations Y , the posterior densityp(θ

∣∣Y ) is uniquely defined and can be viewed as the solution or target ofanalysis within the Bayes approach. The expression (5.1) for the posteriordensity is called the Bayes formula.

The value p(y) of the marginal density of Y at y does not depend onthe parameter θ . Given the data Y , it is just a numeric normalizing factor.Often one skips this factor writing

ϑ∣∣Y ∝ p(Y

∣∣θ)π(θ).

Below we consider a couple of examples.

Example 5.1.1. Let Y = (Y1, . . . , Yn)> be a sequence of zeros and ones con-sidered to be a realization of a Bernoulli experiment for n = 10 . Let also theunderlying parameter θ be random and let it take the values 1/2 or 1 eachwith probability 1/2 , that is,

π(1/2) = π(1) = 1/2.

Then the probability of observing y = “10 ones” is

IP (y) =1

2IP (y

∣∣ϑ = 1/2) +1

2IP (y

∣∣ϑ = 1).

The first probability is quite small, it is 2−10 , while the second one is justone. Therefore, IP (y) = (2−10 + 1)/2 . If we observed y = (1, . . . , 1)> , thenthe posterior probability of ϑ = 1 is

IP (ϑ = 1∣∣y) =

IP (y∣∣ϑ = 1)IP (ϑ = 1)

IP (y)=

1

2−10 + 1

that is, it is quite close to one.

Page 186: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

186 5 Bayes estimation

Exercise 5.1.1. Consider the Bernoulli experiment Y = (Y1, . . . , Yn)> withn = 10 and let

π(1/2) = π(0.9) = 1/2.

Compute the posterior distribution of ϑ if we observe y = (y1, . . . , yn)> with• y = (1, . . . , 1)>

• the number of successes S = y1 + . . .+ yn is 5.Show that the posterior density p(θ

∣∣y) only depends on the numbers ofsuccesses S .

5.2 Conjugated priors

Let (IPθ) be a dominated parametric family with the density functionp(y

∣∣θ) . For a prior π with the density π(θ) , the posterior density is

proportional to p(y∣∣θ)π(θ) . Now consider the case when the prior π be-

longs to some other parametric family indexed by a parameter α , that is,π(θ) = π(θ,α) . An very desirable feature of the Bayes approach is that theposterior density also belongs to this family. Then computing the posterior isequivalent to fixing the related parameter α = α(Y ) . Such priors are usuallycalled conjugated.

To illustrate this notion, we present some examples.

Example 5.2.1 (Gaussian Shift). Let Yi ∼ N(θ, σ2) with σ known, i =1, . . . , n . Consider ϑ ∼ N(τ, g2) , α = (τ, g2) . Then for y = (y1, . . . , yn)> ∈IRn

p(y∣∣ θ)π(θ,α) = C exp

{−∑

(yi − θ)2/(2σ2)− (θ − τ)2/(2g2)},

where the normalizing factor C does not depend on θ and y . The expressionin the exponent is a quadratic form of θ and the Taylor expansion w.r.t. θat θ = τ implies

ϑ∣∣Y ∝ exp

{−∑ (yi − τ)2

2σ2+θ − τσ2

∑(yi − τ)− n(θ − τ)2

2(σ−2 + g−2)

}.

This representation indicates that the conditional distribution of θ given Y isnormal. The parameters of the posterior will be computed in the next section.

Example 5.2.2 (Bernoulli). Let Yi be a Bernoulli r.v. with IP (Yi = 1) = θ .Then for y = (y1, . . . , yn)> , it holds p(y

∣∣ θ) =∏ni=1 θ

yi(1−θ)1−yi . Considerthe family of priors from the Beta-distribution: π(θ,α) = π(α)θa(1− θ)b forα = (a, b) , where π(θ) is a normalizing constant. It follows

ϑ∣∣y ∝ θa(1− θ)b

∏θyi(1− θ)1−yi = θs+a(1− θ)n−s+b

Page 187: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

5.3 Linear Gaussian model and Gaussian priors 187

for s = y1 + . . .+ yn . Obviously given s this is again a distribution from theBeta-family.

Example 5.2.3 (Exponential). Let Yi be an exponential r.v. with IP (Yi ≥y) = e−yθ . Then p(y

∣∣ θ) =∏ni=1 θe

−yiθ . For the family of priors from theGamma-distribution with π(θ,α) = C(α)θae−θb for α = (a, b) . One has forthe vector y = (y1, . . . , yn)>

ϑ∣∣y ∝ θae−θb

n∏i=1

θe−yiθ = θn+ae−(s+b).

which yields that the posterior is Gamma with the parameters n + a ands+ b .

All the previous examples can be systematically treated as special case anexponential family. Let Y = (Y1, . . . , Yn) be an i.i.d. sample from a univariateEF (Pθ) with the density

p1(y∣∣ θ) = p1(y) exp

{yC(θ)−B(θ)

}for some fixed functions C(θ) and B(θ) . For y = (y1, . . . , yn)> , the jointdensity at the point y is given by

p(y∣∣ θ) ∝ exp

{sC(θ)− nB(θ)

}.

This suggests to take a prior from the family π(θ,α) = π(α) exp{aC(θ) −

bB(θ)}

with α = (a, b)> . This yields for the posterior density

ϑ∣∣y ∝ p(y ∣∣ θ)π(θ,α) ∝ exp

{(s+ a)C(θ)− (n+ b)B(θ)

}which is from the same family with the new parameters α(Y ) = (S+a, n+b) .

Exercise 5.2.1. Build a conjugate prior the Poisson family.

Exercise 5.2.2. Build a conjugate prior the variance of the normal familywith the mean zero and unknown variance.

5.3 Linear Gaussian model and Gaussian priors

An interesting and important class of prior distributions is given by Gaussianpriors. The very nice and desirable feature of this class is that the posteriordistribution for the Gaussian model and Gaussian prior is also Gaussian.

Page 188: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

188 5 Bayes estimation

5.3.1 Univariate case

We start with the case of a univariate parameter and one observation Y ∼N(θ, σ2) , where the variance σ2 is known and only the mean θ is unknown.The Bayes approach suggests to treat θ as a random variable. Suppose thatthe prior π is also normal with mean τ and variance r2 .

Theorem 5.3.1. Let Y ∼ N(θ, σ2) , and let the prior π be the normal dis-tribution N(τ, r2) :

Y∣∣ θ ∼ N(θ, σ2),

ϑ ∼ N(τ, r2).

Then the joint, marginal, and posterior distributions are normal as well. More-over, it holds

Y ∼ N(τ, σ2 + r2),

ϑ∣∣Y ∼ N

(τσ2 + Y r2

σ2 + r2,σ2r2

σ2 + r2

).

Proof. It holds Y = ϑ+ ε with ϑ ∼ N(τ, r2) and ε ∼ N(0, σ2) independentof ϑ . Therefore, Y is normal with mean IEY = IEϑ + IEε = τ and thevariance is

Var(Y ) = IE(Y − τ)2 = r2 + σ2.

This implies the formula for the marginal density p(Y ) . Next, for ρ =σ2/(r2 + σ2) ,

IE[(ϑ− τ)(Y − τ)

]= IE(ϑ− τ)2 = r2 = (1− ρ) Var(Y ).

Thus, the random variables Y − τ and ζ with

ζ = ϑ− τ − (1− ρ)(Y − τ) = ρ(ϑ− τ)− (1− ρ)ε

are Gaussian and uncorrelated and therefore, independent. The conditionaldistribution of ζ given Y coincides with the unconditional distribution andhence, it is normal with mean zero and variance

Var(ζ) = ρ2 Var(ϑ) + (1− ρ)2 Var(ε) = ρ2r2 + (1− ρ)2σ2 =σ2r2

σ2 + r2.

This yields the result because ϑ = ζ + ρτ + (1− ρ)Y .

Exercise 5.3.1. Check the result of Theorem 5.3.1 by direct calculation usingBayes formula (5.1).

Page 189: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

5.3 Linear Gaussian model and Gaussian priors 189

Now consider the i.i.d. model from N(θ, σ2) where the variance σ2 isknown.

Theorem 5.3.2. Let Y = (Y1, . . . , Yn)> be i.i.d. and for each Yi

Yi∣∣ θ ∼ N(θ, σ2), (5.2)

ϑ ∼ N(τ, r2). (5.3)

Then for Y = (Y1 + . . .+ Yn)/n

ϑ∣∣Y ∼ N

(τσ2/n+ Y r2

r2 + σ2/n,r2σ2/n

r2 + σ2/n

).

So, the posterior mean of ϑ is a weighted average of the prior mean τand the sample estimate Y ; the sample estimate is pulled back (or shrunk)toward the prior mean. Moreover, the weight ρ on the prior mean is closeto one if σ2 is large relative to r2 (i.e. our prior knowledge is more precisethan the data information), producing substantial shrinkage. If σ2 is small(i.e., our prior knowledge is imprecise relative to the data information), ρ isclose to zero and the direct estimate Y is moved very little towards the priormean.

Exercise 5.3.2. Prove Theorem 5.3.2 using the technique of the proof ofTheorem 5.3.1.Hint: consider Yi = ϑ+ εi , Y = S/n , and define ζ = ϑ− τ − (1− ρ)(Y − τ) .Check that ζ and Y are uncorrelated and hence, independent.

The result of Theorem 5.3.2 can formally be derived from Theorem 5.3.1by replacing n i.i.d. observations Y1, . . . , Yn with one single observation Ywith conditional mean θ and variance σ2/n .

5.3.2 Linear Gaussian model and Gaussian prior

Now we consider the general case when both Y and ϑ are vectors. Namelywe consider the linear model Y = Ψ>ϑ+ ε with Gaussian errors ε in whichthe random parameter vector ϑ is multivariate normal as well:

ϑ ∼ N(τ ,Γ), Y∣∣θ ∼ N(Ψ>θ,Σ). (5.4)

Here Ψ is a given p × n design matrix, and Σ is a given error covariancematrix. Below we assume that both Σ and Γ are non-degenerate. The model(5.4) can be represented in the form

ϑ = τ + ξ, ξ ∼ N(0,Γ), (5.5)

Y = Ψ>τ + Ψ>ξ + ε, ε ∼ N(0,Σ), ε ⊥ ξ, (5.6)

Page 190: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

190 5 Bayes estimation

where ξ ⊥ ε means independence of the error vectors ξ and ε . This repre-sentation makes clear that the vectors ϑ,Y are jointly normal. Now we statethe result about the conditional distribution of ϑ given Y .

Theorem 5.3.3. Assume (5.4). Then the joint distribution of ϑ,Y is normalwith

IE =

Y

)=

Ψ>τ

)Var

Y

)=

(Γ Ψ>Γ

ΓΨ Ψ>ΓΨ + Σ

).

Moreover, the posterior ϑ∣∣Y is also normal. With B = Γ−1 + ΨΣ−1Ψ> ,

IE(ϑ∣∣Y ) = τ + ΓΨ

(Ψ>ΓΨ + Σ

)−1(Y −Ψ>τ )

= B−1Γ−1τ +B−1ΨΣ−1Y , (5.7)

Var(ϑ∣∣Y ) = B−1. (5.8)

Proof. The following technical lemma explains a very important property ofthe normal law: conditional normal is again normal.

Lemma 5.3.1. Let ξ and η be jointly normal. Denote U = Var(ξ) , W =Var(η) , C = Cov(ξ,η) = IE(ξ − IEξ)(η − IEη)> . Then the conditional dis-tribution of ξ given η is also normal with

IE[ξ∣∣η] = IEξ + CW−1(η − IEη),

Var[ξ∣∣η] = U − CW−1C>.

Proof. First consider the case when ξ and η are zero-mean. Then the vector

ζdef= ξ − CW−1η

is also normal zero mean and fulfills

IE(ζη>) = IE[(ξ − CW−1η)η>

]= IE(ξη>)− CW−1IE(ηη>) = 0,

Var(ζ) = IE[(ξ − CW−1η)(ξ − CW−1η)>

]= IE

[(ξ − CW−1η)ξ> = U − CW−1C>.

The vectors ζ and η are jointly normal and uncorrelated, thus, independent.This means that the conditional distribution of ζ given η coincides with theunconditional one. It remains to note that the ξ = ζ + CW−1η . Hence,conditionally on η , the vector ξ is just a shift of the normal vector ζ by afixed vector CW−1η . Therefore, the conditional distribution of ξ given η isnormal with mean CW−1η and the variance Var(ζ) = U − CW−1C> .

Page 191: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

5.3 Linear Gaussian model and Gaussian priors 191

Exercise 5.3.3. Extend the proof of Lemma 5.3.1 to the case when the vec-tors ξ and η are not zero mean.

It remains to deduce the desired result about posterior distribution fromthis lemma. The formulas for the first two moments of ϑ and Y followdirectly from (5.5) and (5.6). Now we apply Lemma 5.3.1 with U = Γ , C =ΓΨ , W = Ψ>ΓΨ + Σ . It follows that the vector ϑ conditioned on Y isnormal with

IE(ϑ∣∣Y ) = τ + ΓΨW−1(Y −Ψ>τ ) (5.9)

Var(ϑ∣∣Y ) = Γ− ΓΨW−1Ψ>Γ.

Now we apply the following identity: for any p× n -matrix A

A(IIn +A>A

)−1=(IIp +AA>

)−1A. (5.10)

IIp −A(IIn +A>A

)−1A> =

(IIp +AA>

)−1,

The latter implies with A = Γ1/2ΨΣ−1/2 that

Γ− ΓΨW−1Ψ>Γ = Γ1/2{IIp −A

(IIn +A>A

)−1A>}

Γ1/2

= Γ1/2(IIp +AA>

)−1Γ1/2 =

(Γ−1 + ΨΣ−1Ψ>

)−1= B−1

Similarly (5.10) yields with the same A

ΓΨW−1 = Γ1/2A(IIn +A>A

)−1Σ−1/2

= Γ1/2(IIp +AA>

)−1AΣ−1/2 = B−1ΨΣ−1.

This implies (5.7) by (5.9).

Exercise 5.3.4. Check the details of the proof of Theorem 5.3.3.

Exercise 5.3.5. Derive the result of Theorem 5.3.3 by direct computation ofthe density of ϑ given Y .Hint: use that ϑ and Y are jointly normal vectors. Consider their jointdensity p(θ,Y ) for Y fixed and obtain the conditional density by analyzingits the linear and quadratic terms w.r.t. θ .

Exercise 5.3.6. Show that Var(ϑ∣∣Y ) < Var(ϑ) = Γ .

Hint: use that Var(ϑ∣∣Y ) = B−1 and B

def= Γ−1 + ΨΣ−1Ψ> > Γ−1 .

The last exercise delivers an important message: the variance of the poste-rior is smaller than the variance of the prior. This is intuitively clear becausethe posterior utilizes the both sources of information: those contained in the

Page 192: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

192 5 Bayes estimation

prior and those we get from the data Y . However, even in the simple Gaus-sian case, the proof is quite complicated. Another interpretation of this factwill be given later: the Bayes approach effectively performs a kind of regular-ization and thus, leads to a reduce of the variance; cf. Section 4.7. Anotherconclusion from the formulas (5.7), (5.8) is that the moments of the posterior

distribution approach the moments of the MLE θ =(ΨΣ−1Ψ>

)−1ΨΣ−1Y

as Γ grows.

5.3.3 Homogeneous errors, orthogonal design

Consider a linear model Yi = Ψ>i ϑ+ εi for i = 1, . . . , n , where Ψi are givenvectors in IRp and εi are i.i.d. normal N(0, σ2) . This model is a special caseof the model (5.4) with Ψ = (Ψ1, . . . ,Ψn) and uncorrelated homogeneouserrors ε yielding Σ = σ2In . Then Σ−1 = σ−2In , B = Γ−1 + σ−2ΨΨ>

IE(ϑ∣∣Y ) = B−1Γ−1τ + σ−2B−1ΨY , (5.11)

Var(ϑ∣∣Y ) = B−1,

where ΨΨ> =∑i ΨiΨ

>i . If the prior variance is also homogeneous, that is,

Γ = r2Ip , then the formulas can be further simplified. In particular,

Var(ϑ∣∣Y ) =

(r−2Ip + σ−2ΨΨ>

)−1.

The most transparent case corresponds to the orthogonal design with ΨΨ> =η2Ip for some η2 > 0 . Then

IE(ϑ∣∣Y ) =

σ2/r2

η2 + σ2/r2τ +

1

η2 + σ2/r2ΨY , (5.12)

Var(ϑ∣∣Y ) =

σ2

η2 + σ2/r2Ip. (5.13)

Exercise 5.3.7. Derive (5.12) and (5.13) from Theorem 5.3.3 with Σ = σ2In ,Γ = r2Ip , and ΨΨ> = Ip .

Exercise 5.3.8. Show that the posterior mean is the convex combination ofthe MLE θ = η−2ΨY and the prior mean τ :

IE(ϑ∣∣Y ) = ρτ + (1− ρ)θ,

with ρ = (σ2/r2)/(η2 + σ2/r2) . Moreover, ρ → 0 as η → ∞ , that is, the

posterior mean approaches the MLE θ .

Page 193: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

5.4 Non-informative priors 193

5.4 Non-informative priors

The Bayes approach requires to fix a prior distribution on the values of theparameter ϑ . What happens if no such information is available? Is the Bayesapproach still applicable? An immediate answer is “no”, however it is a bithasty. Actually one can still apply the Bayes approach with the priors whichdo not give any preference to one point against the others. Such priors arecalled non-informative. Consider first the case when the set Θ is finite: Θ ={θ1, . . . ,θM} . Then the non-informative prior is just the uniform measureon Θ giving to every point θm the equal probability 1/M . Then the jointprobability of Y and ϑ is the average of the measures IPθm and the sameholds for the marginal distribution of the data:

p(y) =1

M

M∑m=1

p(y∣∣θm).

The posterior distribution is already “informative” and it differs from theuniform prior:

p(θk∣∣y) =

p(y∣∣θk)π(θk)

p(y)=

p(y∣∣θk)∑M

m=1 p(y∣∣θm)

, k = 1, . . . ,M.

Exercise 5.4.1. Check that the posterior measure is non-informative iff allthe measures IPθm coincide.

A similar situation arises if the set Θ is a non-discrete bounded subsetin IRp . A typical example is given by the case of a univariate parameterrestricted to a finite interval [a, b] . Define π(θ) = 1/π(Θ) , where

π(Θ)def=

∫Θ

dθ.

Then

p(y) =1

π(Θ)

∫Θ

p(y∣∣θ)dθ.

p(θ∣∣y) =

p(y∣∣θ)π(θ)

p(y)=

p(y∣∣θ)∫

Θp(y

∣∣θ)dθ. (5.14)

In some cases the non-informative uniform prior can be used even for un-bounded parameter sets. Indeed, what we really need is that the integrals inthe denominator of the last formula are finite:∫

Θ

p(y∣∣θ)dθ <∞ ∀y.

Then we can apply (5.14) even if Θ is unbounded.

Page 194: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

194 5 Bayes estimation

Exercise 5.4.2. Consider the Gaussian Shift model (5.2) and (5.3).(i) Check that for n = 1 , the value

∫∞−∞ p(y

∣∣ θ)dθ is finite for every yand the posterior distribution of ϑ coincides with the distribution of Y .

(ii) Compute the posterior for n > 1 .

Exercise 5.4.3. Consider the Gaussian regression model Y = Ψ>ϑ + ε ,ε ∼ N(0,Σ) , and the non-informative prior π which is the Lebesgue measure

on the space IRp . Show that the posterior for ϑ is normal with mean θ =(ΨΣ−1Ψ>)−1ΨΣ−1Y and variance (ΨΣ−1Ψ>)−1 . Compare with the resultof Theorem 5.3.3.

Note that the result of this exercise can be formally derived from Theo-rem 5.3.3 by replacing Γ−1 with 0.

Another way of tackling the case of an unbounded parameter set is toconsider a sequence of priors that approaches the uniform distribution onthe whole parameter set. In the case of linear Gaussian models and normalpriors, a natural way is to let the prior variance tend to infinity. Considerfirst the univariate case; see Section 5.3.1. A non-informative prior can beapproximated by the normal distribution with mean zero and variance r2

tending to infinity. Then

ϑ∣∣Y ∼ N

(Y r2

σ2 + r2,σ2r2

σ2 + r2

)w−→ N(Y, σ2) r →∞.

It is interesting to note that the case of an i.i.d. sample in fact reduces thesituation to the case of a non-informative prior. Indeed, the result of Theo-rem 5.3.3 implies with r2

n = nr2

ϑ∣∣Y ∼ N

(Y r2

n

σ2 + r2n

,σ2r2

n

σ2 + r2n

).

One says that the prior information “washes out” from the posterior distri-bution as the sample size n tends to infinity.

5.5 Bayes estimate and posterior mean

Given a loss function ℘(θ,θ′) on Θ × Θ , the Bayes risk of an estimate

θ = θ(Y ) is defined as

Rπ(θ)def= IE℘(θ,ϑ) =

∫Θ

(∫Y

℘(θ(y),θ) p(y∣∣θ)µ0(dy)

)π(θ)dθ.

Note that ϑ in this formula is treated as a random variable that follows theprior distribution π . One can represent this formula symbolically in the form

Page 195: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

5.5 Bayes estimate and posterior mean 195

Rπ(θ) = IE[IE(℘(θ,ϑ)

∣∣θ)] = IE R(θ,ϑ).

Here the external integration averages the pointwise risk R(θ,ϑ) over allpossible values of ϑ due to the prior distribution.

The Bayes formula p(y∣∣θ)π(θ) = p(θ

∣∣y)p(y) and change of order ofintegration can be used to represent the Bayes risk via the posterior density:

Rπ(θ) =

∫Y

(∫Θ

℘(θ(y),ϑ) p(θ∣∣y) dθ

)p(y)µ0(dy)

= IE[IE{℘(θ,ϑ)

∣∣Y }].The estimate θπ is called Bayes or π -Bayes if it minimizes the correspondingrisk:

θπ = argminθ

Rπ(θ),

where the infimum is taken over the class of all feasible estimates. The mostwidespread choice of the loss function is the quadratic one:

℘(θ,θ′)def= ‖θ − θ′‖2.

The great advantage of this choice is that the Bayes solution can be givenexplicitly; it is the posterior mean

θπdef= IE(ϑ

∣∣Y ) =

∫Θ

θ p(θ∣∣Y ) dθ.

Note that due to Bayes’ formula, this value can be rewritten

θπ =1

p(Y )

∫Θ

θ p(Y∣∣θ)π(θ)dθ

p(Y ) =

∫Θ

p(Y∣∣θ)π(θ)dθ.

Theorem 5.5.1. It holds for any estimate θ

Rπ(θ) ≥ Rπ(θπ).

Proof. The main feature of the posterior mean is that it provides a kind ofprojection of the data. This property can be formalized as follows:

IE(θπ − ϑ

∣∣Y ) =

∫Θ

(θπ − θ

)p(θ

∣∣Y ) dθ = 0

Page 196: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

196 5 Bayes estimation

yielding for any estimate θ = θ(Y )

IE(‖θ − ϑ‖2

∣∣Y )= IE

(‖θπ − ϑ‖2

∣∣Y )+ IE(‖θπ − θ‖2

∣∣Y )+ 2(θ − θπ)IE(θπ − ϑ

∣∣Y )= IE

(‖θπ − ϑ‖2

∣∣Y )+ IE(‖θπ − θ‖2

∣∣Y )≥ IE

(‖θπ − ϑ‖2

∣∣Y ).Here we have used that both θ and θπ are functions of Y and can beconsidered as constants when taking the conditional expectation w.r.t. Y .Now

Rπ(θ) = IE‖θ − ϑ‖2 = IE[IE(‖θ − ϑ‖2

∣∣Y )]≥ IE

[IE(‖θπ − ϑ‖2

∣∣Y )] = Rπ(θπ)

and the result follows.

Exercise 5.5.1. Consider the univariate case with the loss function |θ− θ′| .Check that the posterior median minimizes the Bayes risk.

5.6 Posterior mean and ridge regression

Here we again consider the case of a linear Gaussian model

Y = Ψ>ϑ+ ε, ε ∼ N(0, σ2In).

(To simplify the presentation, we focus here on the case of homogeneous errors

with Σ = σ2In .) Recall that the maximum likelihood estimate θ for thismodel reads as

θ =(ΨΨ>

)−1ΨY .

A penalized MLE θG for the roughness penalty ‖Gθ‖2 is given by

θG =(ΨΨ> + σ2G2

)−1ΨY ;

see Section 4.7. It turns out that a similar estimate appears in quite a naturalway within the Bayes approach. Consider the normal prior distribution ϑ ∼N(0, G−2) . The posterior will be normal as well with the posterior mean:

θπ = σ−2B−1Y =(ΨΨ> + σ2G2

)−1ΨY ;

Page 197: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

5.7 Bayes and minimax risks 197

see (5.11). It appears that θπ = θG for the normal prior π = N(0, G−2) .One can say that the Bayes approach with a Gaussian prior leads to a reg-

ularization of the least squares method which is similar to quadratic penaliza-tion. The degree of regularization is inversely proportional to the variance ofthe prior. The larger the variance, the closer the prior is to the non-informativeone and the posterior mean θπ to the MLE θ .

5.7 Bayes and minimax risks

Consider the parametric model Y ∼ IP ∈ (IPθ,θ ∈ Θ ⊂ IRp) . Let θ be an

estimate of the parameter ϑ from the available data Y . Formally θ is ameasurable function of Y with values in Θ :

θ = θ(Y ) : Y → Θ.

The quality of estimation is assigned by the loss function %(θ,θ′) . In estima-tion problem one usually selects this function in the form %(θ,θ′) = %1(θ−θ′)for another function %1 of one argument. Typical examples are given byquadratic loss %1(θ) = ‖θ‖2 or absolute loss %1(θ) = ‖θ‖ . Given such a

loss function, the pointwise risk of θ at θ is defined as

R(θ,θ)def= IEθ%(θ,θ).

The minimax risk is defined as the maximum of pointwise risks over all θ ∈ Θ :

R(θ)def= sup

θ∈ΘR(θ,θ) = sup

θ∈ΘIEθ%(θ,θ).

Similarly, the Bayes risk for a prior π is defined by weighting the pointwiserisks accordingly to the prior distribution:

Rπ(θ)def= R(θ,θ) =

∫θ

IEθ%(θ,θ)π(θ)dθ.

It is obvious that the Bayes risk is always smaller or equal than the minimaxone, whatever the prior measure is:

Rπ(θ) ≤ R(θ)

The famous Le Cam theorem states that the minimax risk can be recoveredby taking the maximum over all priors:

R(θ) = supπ

Rπ(θ).

Moreover, the maximum can be taken over all discrete priors with finite sup-ports.

Page 198: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

198 5 Bayes estimation

5.8 Van Trees inequality

The Cramer–Rao inequality yields a low bounds of the risk for any unbiasedestimator. However, it appears to be sub-optimal if the condition of no-biasis dropped. Another way to get a general bound on the quadratic risk of anyestimator is to bound from below a Bayes risk for any suitable prior and thento maximize this lower bound in a class of all such priors.

Let Y be an observed vector in IRn and (IPθ) be the correspondingparametric family with the density function p(y

∣∣θ) w.r.t. a measure µ0

on IRn . Let also θ = θ(Y ) be an arbitrary estimator of θ . For any prior

π , we aim to lower bound the π -Bayes risk Rπ(θ) of θ . We already know

that this risk is minimizes by the posterior mean estimator θπ . However,the presented bound does not rely on a particular structure of the consideredestimator. Similarly to the Cramer–Rao inequality, it is entirely based on somegeometric properties of the log-likelihood function p(y

∣∣θ) .To simplify the explanation, consider first the case of a univariate parame-

ter θ ∈ Θ ⊆ IR . Below we assume that the prior π has a positive continuouslydifferentiable density π(θ) . In addition we suppose that the parameter set Θis an interval, probably infinite, and the prior density π(θ) vanishes at theedges of Θ . By Fπ we denote the Fisher information for the prior distributionπ :

Fπdef=

∫Θ

∣∣π′(θ)∣∣2π(θ)

Remind also the definition of the full Fisher information for the family (IPθ) :

F(θ)def= IEθ

∣∣∣∣ ∂∂θ log p(Y∣∣ θ)∣∣∣∣2 =

∫ ∣∣p′(y ∣∣ θ)∣∣2p(y

∣∣ θ) µ0(dy).

These quantities will enter in the risk bound. In what follows we also use thenotation

pπ(y, θ) = p(y∣∣ θ)π(θ),

p′π(y, θ)def=

dpπ(y, θ)

dθ.

The Bayesian analog of the score function is

L′π(Y , θ)def=

p′π(Y , θ)

pπ(Y , θ)=p′(Y

∣∣ θ)p(Y

∣∣ θ) +π′(θ)

π(θ). (5.15)

The use of pπ(y, θ) ≡ 0 at the edges of Θ implies for any θ = θ(y) and anyy ∈ IRn

Page 199: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

5.8 Van Trees inequality 199∫Θ

{(θ(y)− θ

)pπ(y, θ)

}′dθ =

[{θ(y)− θ

}pπ(y, θ)

]= 0,

and hence ∫Θ

{θ(y)− θ

}p′π(y, θ) dθ =

∫Θ

pπ(y, θ)dθ. (5.16)

This is an interesting identity. It holds for each y with µ0 -probability one

and the estimate θ only appears in its left hand-side. This can be explainedby the fact that

∫Θp′π(y, θ) dθ = 0 which follows by the same calculus. Based

on (5.16), one can compute the expectation of the product of θ(Y )− ϑ andL′π(Y , ϑ) = p′π(Y , ϑ)/pπ(Y , ϑ) :

IEπ

[{θ(Y )− ϑ

}L′π(Y , ϑ)

]=

∫IRn

∫Θ

{θ(y)− θ

}p′π(y, θ) dθ dµ0(y)

=

∫IRn

∫Θ

pπ(y, θ) dθ dµ0(y) = 1. (5.17)

Again, the remarkable feature of this equality that the estimate θ only ap-pears in the left hand-side. Now the idea of the obtained bound is very simple.We introduce a r.v. h(Y , ϑ) = L′π(Y , ϑ)/IEπ|L′π(Y , ϑ)|2 and use orthogonal-

ity of θ(Y )−ϑ−h(Y , ϑ) and h(Y , ϑ) and the Pythagorus Theorem to show

that the squared risk of θ is not smaller than IEπ[h2(Y , ϑ)

]. More precisely,

denote

Iπdef= IEπ

[{L′π(Y , ϑ)

}2],

h(Y , ϑ)def= I−1

π L′π(Y , ϑ).

Then IEπ[h2(Y , ϑ)

]= I−1

π and (5.17) implies

IEπ

[{θ(Y )− ϑ− h(Y , ϑ)

}h(Y , ϑ)

]= I−1

π IEπ

[{θ(Y )− ϑ

}L′π(Y , ϑ)

]− IEπ

[h2(Y , ϑ)

]= 0

and

Page 200: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

200 5 Bayes estimation

IEπ

[(θ(Y )− ϑ

)2]= IEπ

[{θ(Y )− ϑ− h(Y , ϑ) + h(Y , ϑ)

}2]

= IEπ

[{θ(Y )− ϑ− h(Y , ϑ)

}2]

+ IEπ[h2(Y , ϑ)

]+ 2IEπ

[{θ(Y )− ϑ− h(Y , ϑ)

}h(Y , ϑ)

]= IEπ

[{θ(Y )− ϑ− h(Y , ϑ)

}2]

+ IEπ[h2(Y , ϑ)

]≥ IEπ

[h2(Y , ϑ)

]= I−1

π .

It remains to compute Iπ . We use the representation (5.15) for L′π(Y , θ) .Further we use the identity

∫p(y

∣∣ θ)dµ0(y) ≡ 1 which implies by differenti-ating in θ ∫

p′(y∣∣ θ) dµ0(y) ≡ 0.

This yields that two random variables p′(Y∣∣ θ)/p(Y ∣∣ θ) and π′(θ)/π(θ) are

orthogonal under the measure IPπ :

IEπ

{p′(Y

∣∣ θ)p(Y

∣∣ θ) π′(θ)

π(θ)

}=

∫Θ

∫IRn

p′(y∣∣ θ)π′(θ) dµ0(y) dθ

=

∫Θ

{∫IRn

p′(y∣∣ θ) dµ0(y)

}π′(θ)dθ = 0.

Now by the Pythagorus Theorem

IEπ{L′π(Y , ϑ)

}2= IEπ

{p′(Y

∣∣ϑ)

p(Y∣∣ϑ)

}2

+ IEπ

{π′(ϑ)

π(ϑ)

}2

= IEπIEϑ

[{p′(Y

∣∣ϑ)

p(Y∣∣ϑ)

}2 ∣∣∣∣ϑ]+ IEπ

{π′(ϑ)

π(ϑ)

}2

=

∫Θ

F(θ)π(θ) dθ + Fπ.

Now we can summarize the derivations in the form of van Trees’ inequality.

Theorem 5.8.1 (van Trees). Let Θ be an interval on IR and let the priordensity π(θ) have piecewise continuous first derivate, π(θ) be positive in the

interior of Θ and vanish at the edges. Then for any estimator θ of θ

IEπ(θ − ϑ)2 ≥(∫

Θ

F(θ)π(θ) dθ + Fπ)−1

.

Page 201: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

5.8 Van Trees inequality 201

Now we consider a multivariate extension. We will say that a real functiong(θ) , θ ∈ Θ , is nice if it is piecewise continuously differentiable in θ foralmost all values of θ . Everywhere g′(·) means the derivative w.r.t. θ , thatis, g′(θ) = d

dθ g(θ) .We consider the following assumptions:

1. p(y∣∣θ) is nice in θ for almost all y ;

2. The full Fisher information matrix

F(θ)def= Varθ

{p′(Y

∣∣θ)

p(Y∣∣θ)

}= IEθ

[p′(Y

∣∣θ)

p(Y∣∣θ)

{p′(Y

∣∣θ)

p(Y∣∣θ)

}>]exists and continuous in θ ;

3. Θ is compact with boundary which is piecewise C1 -smooth;4. π(θ) is nice; π(θ) is positive on the interior of Θ and zero on its bound-

ary. The Fisher information matrix Fπ of the prior π is positive andfinite:

Fπdef= IEπ

[π′(ϑ)

π(ϑ)

{π′(ϑ)

π(ϑ)

}>]<∞.

Theorem 5.8.2. Let the assumptions 1–4 hold. For any estimate θ = θ(Y ) ,it holds

IEπ

[{θ(Y )− ϑ

}{θ(Y )− ϑ

}>] ≥ I−1π , (5.18)

where

Iπdef=

∫Θ

F(θ)π(θ) dθ + Fπ .

Proof. The use of pπ(y,θ) = p(y∣∣θ)π(θ) ≡ 0 at the boundary of Θ implies

for any θ = θ(y) and any y ∈ IRn by Stokes’ theorem∫Θ

[{θ(y)− θ

}pπ(y,θ)

]′dθ = 0,

and hence ∫Θ

p′π(y,θ){θ(y)− θ

}>dθ = IIp

∫Θ

pπ(y,θ)dθ,

where IIp is the identity matrix. Therefore, the random vector L′π(Y ,θ)def=

p′π(Y ,θ)/pπ(Y ,θ) fulfills

Page 202: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

202 5 Bayes estimation

IEπ

[L′π(Y ,ϑ)

{θ(Y )− ϑ

}>]=

∫IRn

∫Θ

p′π(y,θ){θ(y)− θ

}>dθ dµ0(y)

= IIp

∫IRn

∫Θ

pπ(y,θ) dθ dµ0(y) = IIp. (5.19)

Denote

Iπdef= IEπ

[L′π(Y ,ϑ)

{L′π(Y ,ϑ)

}>],

h(Y ,ϑ)def= I−1

π L′π(Y ,ϑ).

Then IEπ[h(Y ,ϑ)

{h(Y ,ϑ)

}>]= I−1

π and (5.19) implies

IEπ

[h(Y ,ϑ)

{θ(Y )− ϑ− h(Y ,ϑ)

}>]= I−1

π IEπ

[L′π(Y ,ϑ)

{θ(Y )− ϑ

}>]− IEπ[h(Y ,ϑ){h(Y ,ϑ)

}>]= 0

and hence

IEπ

[{θ(Y )− ϑ

}{θ(Y )− ϑ

}>]= IEπ

[{θ(Y )− ϑ− h(Y ,ϑ) + h(Y ,ϑ)

}{θ(Y )− ϑ− h(Y ,ϑ) + h(Y ,ϑ)

}>]= IEπ

[{θ(Y )− ϑ− h(Y ,ϑ)

}{θ(Y )− ϑ− h(Y ,ϑ)

}>]+ IEπ

[{h(Y ,ϑ)

}{h(Y ,ϑ)

}>]≥ IEπ

[{h(Y ,ϑ)

}{h(Y ,ϑ)

}>]= I−1

π .

It remains to compute Iπ . The definition implies

L′π(Y ,θ)def=

1

pπ(Y ,θ)p′π(Y ,θ) =

1

p(Y∣∣θ)

p′(Y∣∣θ) +

1

π(θ)π′(θ).

Further, the identity∫p(y

∣∣θ)dµ0(y) ≡ 1 yields by differentiating in θ∫p′(y

∣∣θ) dµ0(y) ≡ 0.

Using this identity, we show that the random vectors p′(Y∣∣θ)/p(Y

∣∣θ) andπ′(θ)/π(θ) are orthogonal w.r.t. the measure IPπ . Indeed,

Page 203: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

5.9 Historical remarks and further reading 203

IEπ

[p′(Y

∣∣θ)

p(Y∣∣θ)

{π′(θ)

π(θ)

}>]=

∫Θ

π′(θ)

{∫IRn

p′(y∣∣θ) dµ0(y)

}>dθ

=

∫Θ

π′(θ)

{∫IRn

p′(y∣∣θ) dµ0(y)

}>dθ = 0.

Now by usual Pythagorus calculus, we obtain

IEπ

[{L′π(Y ,ϑ)

}{L′π(Y ,ϑ)

}>]= IEπ

[p′(Y

∣∣ϑ)

p(Y∣∣ϑ)

{p′(Y

∣∣ϑ)

p(Y∣∣ϑ)

}>]+ IEπ

[π′(ϑ)

π(ϑ)

{π′(ϑ)

π(ϑ)

}>]

= IEπIEϑ

[{p′(Y

∣∣ϑ)

p(Y∣∣ϑ)

}{p′(Y

∣∣ϑ)

p(Y∣∣ϑ)

}>∣∣∣∣ϑ]+ IEπ

[π′(ϑ)

π(ϑ)

{π′(ϑ)

π(ϑ)

}>]=

∫Θ

F(θ)π(θ) dθ + Fπ,

and the result follows.

This matrix inequality can be used for obtaining a number of L2 bounds.We present only two bounds for the squared norm ‖θ(Y )− ϑ‖2 .

Corollary 5.8.1. Under the same conditions 1–4, it holds

IEπ‖θ(Y )− ϑ‖2 ≥ tr(I−1π

)≥ p2/ tr(Iπ).

Proof. The first inequality follows directly from the bound (5.18) of Theo-rem 5.8.2. For the second one, it suffices to note that for any positive sym-metric p× p matrix B , it holds

tr(B) tr(B−1) ≥ p2. (5.20)

This fact can be proved by the Cauchy-Schwarz inequality.

Exercise 5.8.1. Prove (5.20).Hint: use the Cauchy-Schwarz inequality for the scalar product tr(B1/2B−1/2)

of two matrices B1/2 and B−1/2 (considered as vectors in IRp2

).

5.9 Historical remarks and further reading

The origin of the Bayesian approach to statistics was the article by Bayes(1763). Further theoretical foundations are due to de Finetti (1937), Savage(1954), and Jeffreys (1957).

Page 204: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

204 5 Bayes estimation

The theory of conjugated priors was developed by Raiffa and Schlaifer(1961). Conjugated priors for exponential families have been characterizedby Diaconis and Ylvisaker (1979). Non-informative priors were considered byJeffreys (1961).

Bayes optimality of the posterior mean estimator under quadratic loss isa classical result which can be found, for instance, in Section 4.4.2 of Berger(1985).

The van Trees inequality is orgininally due to Van Trees (1968), p. 72. Gilland Levit (1995) applied it to the problem of establishing a Bayesian versionof the Cramer-Rao bound.

For further reading, we recommend the books by Berger (1985), Bernardoand Smith (1994), and Robert (2001).

Page 205: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

6

Testing a statistical hypothesis

Let Y be the observed sample. The hypothesis testing problem assumes thatthere is some external information (hypothesis) about the distribution of thissample and the target is to check this hypothesis on the basis of the availabledata.

6.1 Testing problem

This section specifies the main notions of the theory of hypothesis testing.We start with a simple hypothesis. Afterwards a composite hypothesis will bediscussed. We also introduce the notions of the testing error, level, power, etc.

6.1.1 Simple hypothesis

The classical testing problem consists in checking a specific hypothesis thatthe available data indeed follow an externally precisely given distribution. Weillustrate this notion by several examples.

Example 6.1.1 (Simple game). Let Y = (Y1, . . . , Yn)> be a Bernoulli se-quence of zeros and ones. This sequence can be viewed as the sequence ofsuccesses, or results of throwing a coin, etc. The hypothesis about this se-quence is that wins (associated with one) and losses (associated with zero)are equally frequent in the long run. This hypothesis can be formalized asfollows: IP = (IPθ∗)

⊗n with θ∗ = 1/2 , where IPθ describes the Bernoulliexperiment with parameter θ .

Example 6.1.2 (No treatment effect). Let (Yi,Ψi) be experimental results,i = 1, . . . , n . The linear regression model assumes a certain dependence ofthe form Yi = Ψ>i θ + εi with errors εi having zero mean. The “no effect”hypothesis means that there is no systematic dependence of Yi on the factorsΨi , i.e., θ = θ∗ = 0 , and the observations Yi are just noise.

Page 206: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

206 6 Testing a statistical hypothesis

Example 6.1.3 (Quality control). Assume that the Yi are the results of aproduction process which can be represented in the form Yi = θ∗+ εi , whereθ∗ is a nominal value and εi is a measurement error. The hypothesis is thatthe observed process indeed follows this model.

The general problem of testing a simple hypothesis is stated as follows:to check on the basis of the available observations Y that their distributionis described by a given measure IP . The hypothesis is often called a nullhypothesis or just null.

6.1.2 Composite hypothesis

More generally, one can treat the problem of testing a composite hypothesis.Let (IPθ,θ ∈ Θ ⊂ IRp) be a given parametric family, and let Θ0 ⊂ Θ bea non-empty subset in Θ . The hypothesis is that the data distribution IPbelongs to the set (IPθ,θ ∈ Θ0) . Often, this hypothesis and the subset Θ0

are identified with each other and one says that the hypothesis is given byΘ0 .

We give some typical examples where such a formulation is natural.

Example 6.1.4 (Testing a subvector). Assume that the vector θ ∈ Θ can bedecomposed into two parts: θ = (γ,η) . The subvector γ is the target ofanalysis while the subvector η matters for the distribution of the data butis not the target of analysis. It is often called the nuisance parameter. Thehypothesis we want to test is γ = γ∗ for some fixed value γ∗ . A typicalsituation in factor analysis where such problems arise is to check on “noeffect” for one particular factor in the presence of many different, potentiallyinterrelated factors.

Example 6.1.5 (Interval testing). Let Θ be the real line and Θ0 be an inter-val. The hypothesis is that IP = IPθ∗ for θ∗ ∈ Θ0 . Such problems are typicalfor quality control or warning (monitoring) systems when the controlled pa-rameter should be in the prescribed range.

Example 6.1.6 (Testing a hypothesis about the error distribution). Considerthe regression model Yi = Ψ>i θ+εi . The typical assumption about the errorsεi is that they are zero-mean normal. One may be interested in testing thisassumption, having in mind that the cases of discrete, or heavy-tailed, orheteroscedastic errors can also occur.

6.1.3 Statistical tests

A test is a statistical decision on the basis of the available data whether thepre-specified hypothesis is rejected or retained. So the decision space consistsof only two points, which we denote by zero and one. A decision φ is amapping of the data Y to this space and is called a test :

Page 207: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

6.1 Testing problem 207

φ : Y→ {0, 1}.

The event {φ = 1} means that the null hypothesis is rejected and the oppositeevent means non-rejection (acceptance) of the null. Usually the testing resultsare qualified in the following way: rejection of the null hypothesis means thatthe data are not consistent with the null, or, equivalently, the data containsome evidence against the null hypothesis. Acceptance simply means that thedata do not contradict the null. Therefore, the term ”non-rejection” is oftenconsidered more appropriate.

The region of acceptance is a subset of the observation space Y on whichφ = 0 . One also says that this region is the set of values for which we fail toreject the null hypothesis. The region of rejection or critical region is on theother hand the subset of Y on which φ = 1 .

6.1.4 Errors of the first kind, test level

In the hypothesis testing framework one distinguishes between errors of thefirst and second kind. An error of the first kind means that the null hypothesisis falsely rejected when it is correct. We formalize this notion first for the caseof a simple hypothesis and then extend it to the general case.

Let H0 : Y ∼ IPθ∗ be a null hypothesis. The error of the first kind is thesituation in which the data indeed follow the null, but the decision of the testis to reject this hypothesis: φ = 1 . Clearly the probability of such an error isIPθ∗(φ = 1) . The latter number in [0, 1] is called the size of the test φ . Onesays that φ is a test of level α for some α ∈ (0, 1) if

IPθ∗(φ = 1) ≤ α.

The value α is called level of the test or significance level. Often, size andlevel of a test coincide; however, especially in discrete models, it is not alwayspossible to attain the significance level exactly by the chosen test φ , meaningthat the actual size of φ is smaller than α , see Example 6.1.7 below.

If the hypothesis is composite, then the level of the test is the maximumrejection probability over the null subset Θ0 . Here, a test φ is of level α if

supθ∈Θ0

IPθ(φ = 1) ≤ α.

Example 6.1.7 (One-sided binomial test). Consider again the situation fromExample 6.1.1 (an i.i.d. sample Y = (Y1, . . . , Yn)> from a Bernoulli distri-bution is observed). We let n = 13 and Θ0 = [0, 1/5] . For instance, one maywant to test if the cancer-related mortality in a subpopulation of individu-als which are exposed to some environmental risk factor is not significantlylarger than in the general population, in which it is equal to 1/5 . To thisend, death causes are assessed for 13 decedents and for every decedent we

Page 208: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

208 6 Testing a statistical hypothesis

get the information if s/he died because of cancer or not. From Section 2.6,we know that Sn/n efficiently estimates the success probability θ∗ , whereSn =

∑ni=1 Yi . Therefore, it appears natural to use Sn also as the basis for

a solution of the test problem. Under θ∗ , it holds Sn ∼ Bin(n, θ∗) . Since alarge value of Sn implies evidence against the null, we choose a test of theform φ = 1I(cα,n](Sn) , where the constant cα has to be chosen such that φis of level α . This condition can equivalently be expressed as

inf0≤θ≤1/5

IPθ(Sn ≤ cα) ≥ 1− α.

For fixed k ∈ {0, . . . , n} , we have

IPθ(Sn ≤ k) =

k∑`=0

(n

`

)θ`(1− θ)n−` = F (θ, k) (say).

Exercise 6.1.1. Show that for all k ∈ {0, . . . , n} , the function F (·, k) isdecreasing on Θ0 = [0, 1/5] .

Due to Exercise 6.1.1, we have to calculate cα under the least favorableparameter configuration (LFC) θ = 1/5 and we obtain

cα = min

{k ∈ {0, . . . , n} :

k∑`=0

(n

`

)(1

5

)`(4

5

)n−`≥ 1− α

},

because we want to exhaust the significance level α as tightly as possible. Forthe standard choice of α = 0.05 , we have

4∑`=0

(13

`

)(1

5

)`(4

5

)13−`

≈ 0.901,

5∑`=0

(13

`

)(1

5

)`(4

5

)13−`

≈ 0.9700,

hence, we choose cα = 5 . However, the size of the test φ = 1I(5,13](Sn) isstrictly smaller than the significance level α = 0.05 , namely

sup0≤θ≤1/5

IPθ(Sn > 5) = IP1/5(Sn > 5) = 1− IP1/5(Sn ≤ 5) ≈ 0.03 < α.

6.1.5 Randomized tests

In some situations it is difficult to decide about acceptance or rejection of thehypothesis. A randomized test can be viewed as a weighted decision: with acertain probability the hypothesis is rejected, otherwise retained. The decisionspace for a randomized test φ is the unit interval [0, 1] , that is, φ(Y ) is anumber between zero and one. The hypothesis H0 is rejected with probability

Page 209: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

6.1 Testing problem 209

φ(Y ) on the basis of the observed data Y . If φ(Y ) only admits the binaryvalues 0 and 1 for every Y , then we are back at the usual non-randomizedtest that we have considered before. The probability of an error of the firstkind is naturally given by the value IEφ(Y ) . For a simple hypothesis H0 :IP = IPθ∗ , a test φ is now of level α if

IEφ(Y ) = α.

For a randomized test φ , the significance level α is typically attainableexactly, even for discrete models. In the case of a composite hypothesisH0 : IP ∈ (IPθ,θ ∈ Θ0) , the level condition reads as

supθ∈Θ0

IEφ(Y ) ≤ α

as before. In what follows we mostly consider non-randomized tests and onlycomment on whether a randomization can be useful. Note that any random-ized test can be reduced to a non-randomized test by extending the probabilityspace.

Exercise 6.1.2. Construct for any randomized test φ its non-randomizedversion using a random data generator.

Randomized tests are a satisfactory solution of the test problem from amathematical point of view, but they are disliked by practitioners, becausethe test result may not be reproducible, due to randomization.

Example 6.1.8 (Example 6.1.7 continued). Under the setup of Example 6.1.7,consider the randomized test φ , given by

φ(Y ) =

0, Sn < 52/7, Sn = 51, Sn > 5

It is easy to show that under the LFC θ = 1/5 , the size of φ is (up torounding) exactly equal to α = 5% .

6.1.6 Alternative hypotheses, error of the second kind, power of atest

The setup of hypothesis testing is asymmetric in the sense that it focuses onthe null hypothesis. However, for a complete analysis, one has to specify thedata distribution when the hypothesis is false. Within the parametric frame-work, one usually makes the assumption that the unknown data distributionbelongs to some parametric family (IPθ,θ ∈ Θ ⊆ IRp) . This assumption hasto be fulfilled independently of whether the hypothesis is true or false. In

Page 210: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

210 6 Testing a statistical hypothesis

other words, we assume that IP ∈ (IPθ,θ ∈ Θ) and there is a subset Θ0 ⊂ Θcorresponding to the null hypothesis. The measure IP = IPθ for θ 6∈ Θ0

is called an alternative. Furthermore, we call Θ1def= Θ \ Θ0 the alternative

hypothesis.Now we can consider the performance of a test φ when the hypothesis H0

is false. The decision to retain the hypothesis when it is false is called the errorof the second kind. The probability of such an error is equal to IP (φ = 0) ,whenever IP is an alternative. This value certainly depends on the alternativeIP = IPθ for θ 6∈ Θ0 . The value β(θ) = 1 − IPθ(φ = 0) is often called thetest power at θ 6∈ Θ0 . The function β(θ) of θ ∈ Θ \Θ0 given by

β(θ)def= 1− IPθ(φ = 0)

is called power function. Ideally one would desire to build a test which simul-taneously and separately minimizes the size and maximizes the power. Thesetwo wishes are somehow contradictory. A decrease of the size usually resultsin a decrease of the power and vice versa. Usually one imposes the level αconstraint on the size of the test and tries to optimize its power under thisconstraint. Under the general framework of statistical decision problems asdiscussed in Section 1.4, one can thus regard R(φ,θ) = 1− β(θ), θ ∈ Θ1 , asa risk function. If we agree on this risk measure and restrict attention to levelα tests, then the test problem, regarded as a statistical decision problem, isalready completely specified by (Y,B(Y), (IPθ : θ ∈ Θ),Θ0) .

Definition 6.1.1. A test φ∗ is called uniformly most powerful (UMP) testof level α if it is of level α and for any other test of level α , it holds

1− IPθ(φ∗ = 0) ≥ 1− IPθ(φ = 0), θ 6∈ Θ0.

Unfortunately, such UMP tests exist only in very few special models; oth-erwise, optimization of the power given the level is a complicated task.

In the case of a univariate parameter θ ∈ Θ ⊂ IR1 and a simple hypothesisθ = θ∗ , one often considers one-sided alternatives

H1 : θ ≥ θ∗ or H1 : θ ≤ θ∗

or a two-sided alternative

H1 : θ 6= θ∗.

6.2 Neyman-Pearson test for two simple hypotheses

This section discusses one very special case of hypothesis testing when boththe null hypothesis and the alternative are simple one-point sets. This special

Page 211: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

6.2 Neyman-Pearson test for two simple hypotheses 211

situation by itself can be viewed as a toy problem, but it is very importantfrom the methodological point of view. In particular, it introduces and justifiesthe so-called likelihood ratio test and demonstrates its efficiency.

For simplicity we write IP0 for the measure corresponding to the nullhypothesis and IP1 for the alternative measure. A test φ is a measurablefunction of the observations with values in the two-point set {0, 1} . Theevent φ = 0 is treated as acceptance of the null hypothesis H0 while φ = 1means rejection of the null hypothesis and, consequently, decision in favor ofH1 .

For ease of presentation we assume that the measure IP1 is absolutelycontinuous w.r.t. the measure IP0 and denote by Z(Y ) the correspondingderivative at the observation point:

Z(Y )def=

dIP1

dIP0(Y ).

Similarly L(Y ) means the log-density:

L(Y )def= logZ(Y ) = log

dIP1

dIP0(Y ).

The solution of the test problem in the case of two simple hypotheses is knownas the Neyman-Pearson test: reject the hypothesis H0 if the log-likelihoodratio L(Y ) exceeds a specific critical value z :

φ∗zdef= 1I

(L(Y ) > z

)= 1I

(Z(Y ) > ez

).

The Neyman-Pearson test is known as the one minimizing the weighted sumof the errors of the first and second kind. For a non-randomized test this sumis equal to

℘0IP0(φ = 1) + ℘1IP1(φ = 0),

while the weighted error of a randomized test φ is

℘0IE0φ+ ℘1IE1(1− φ). (6.1)

Theorem 6.2.1. For every two positive values ℘0 and ℘1 , the test φ∗z withz = log(℘0/℘1) minimizes (6.1) over all possible (randomized) tests φ :

φ∗zdef= 1I(L(Y ) ≥ z) = argmin

φ

{℘0IE0φ+ ℘1IE1(1− φ)

}.

Proof. We use the formula for a change of measure:

IE1ξ = IE0

[ξZ(Y )

]

Page 212: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

212 6 Testing a statistical hypothesis

for any r.v. ξ . It holds for any test φ with z = log(℘0/℘1)

℘0IE0φ+ ℘1IE1(1− φ) = IE0

[℘0φ− ℘1Z(Y )φ

]+ ℘1

= −℘1IE0[(Z(Y )− ez)φ] + ℘1

≥ −℘1IE0[Z(Y )− ez]+ + ℘1

with equality for φ = 1I(L(Y ) ≥ z) .

The Neyman-Pearson test belongs to a large class of tests of the form

φ = 1I(T ≥ z),

where T is a function of the observations Y . This random variable is usuallycalled a test statistic while the threshold z is called a critical value. Thehypothesis is rejected if the test statistic exceeds the critical value. For theNeyman-Pearson test, the test statistic is the log-likelihood ratio L(Y ) andthe critical value is selected as a suitable quantile of this test statistic.

The next result shows that the Neyman-Pearson test φ∗z with a propercritical value z can be constructed to maximize the power IE1φ under thelevel constraint IE0φ ≤ α .

Theorem 6.2.2. Given α ∈ (0, 1) , let zα be such that

IP0(L(Y ) ≥ zα) = α. (6.2)

Then it holds

φ∗zαdef= 1I

(L(Y ) ≥ zα

)= argmaxφ:IE0φ≤α

{IE1φ

}.

Proof. Let φ satisfy IE0φ ≤ α . Then

IE1φ− αzα ≤ IE0

{Z(Y )φ

}− ezαIE0φ

= IE0

[{Z(Y )− ezα

}φ]

≤ IE0[Z(Y )− ezα ]+,

again with equality for φ = 1I(L(Y ) ≥ zα) .

The previous result assumes that for a given α there is a critical value zαsuch that (6.2) is fulfilled. However, this is not always the case.

Exercise 6.2.1. Let L(Y ) = log dIP1(Y )/dIP0 .

• Show that the relation (6.2) can always be fulfilled with a proper choiceof zα if the pdf of L(Y ) under IP0 is a continuous function.

Page 213: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

6.2 Neyman-Pearson test for two simple hypotheses 213

• Suppose that the pdf of L(Y ) is discontinuous and zα fulfills

IP0(L(Y ) ≥ zα) > α, IP0(L(Y ) ≤ zα) > 1− α.

Construct a randomized test that fulfills IE0φ = α and maximizes the testpower IE1φ among all such tests.

The Neyman-Pearson test can be viewed as a special case of the generallikelihood ratio test. Indeed, it decides in favor of the null or the alternativeby looking at the likelihood ratio. Informally one can say: we decide in favorof the alternative if it is significantly more likely at the point of observationY .

An interesting question that arises in relation with the Neyman-Pearsonresult is how to interpret it when the true distribution IP does not coincideeither with IP0 or with IP1 and probably it is not even within the consideredparametric family (IPθ) . Wald called this situation the third-kind error. It isworth mentioning that the test φ∗z remains meaningful: it decides which oftwo given measures IP0 and IP1 describes the given data better. However, itis not any more a likelihood ratio test. In analogy with estimation theory, onecan call it a quasi likelihood ratio test.

6.2.1 Neyman-Pearson test for an i.i.d. sample

Let Y = (Y1, . . . , Yn)> be an i.i.d. sample from a measure P . Suppose thatP belongs to some parametric family (Pθ,θ ∈ Θ ⊂ IRp) , that is, P = Pθ∗

for θ∗ ∈ Θ . Let also a special point θ0 be fixed. A simple null hypothesis canbe formulated as θ∗ = θ0 . Similarly, a simple alternative is θ∗ = θ1 for someother point θ1 ∈ Θ . The Neyman-Pearson test situation is a bit artificial: onereduces the whole parameter set Θ to just these two points θ0 and θ1 andtests θ0 against θ1 .

As usual, the distribution of the data Y is described by the product

measure IPθ = P⊗nθ . If µ0 is a dominating measure for (Pθ) and `(y,θ)def=

log[dPθ(y)/dµ0] , then the log-likelihood L(Y ,θ) is

L(Y ,θ)def= log

dIPθµ0

(Y ) =∑i

`(Yi,θ),

where µ0 = µ⊗n0 . The log-likelihood ratio of IPθ1w.r.t. IPθ0

can be definedas

L(Y ,θ1,θ0)def= L(Y ,θ1)− L(Y ,θ0).

The related Neyman-Pearson test can be written as

φ∗zdef= 1I

(L(Y ,θ1,θ0) > z

).

Page 214: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

214 6 Testing a statistical hypothesis

6.3 Likelihood ratio test

This section introduces a general likelihood ratio test in the framework ofparametric testing theory. Let, as usual, Y be the observed data, and IP betheir distribution. The parametric assumption is that IP ∈ (IPθ,θ ∈ Θ) , thatis, IP = IPθ∗ for θ∗ ∈ Θ . Let now two subsets Θ0 and Θ1 of the set Θ begiven. The hypothesis H0 that we would like to test is that IP ∈ (IPθ,θ ∈Θ0) , or equivalently, θ∗ ∈ Θ0 . The alternative is that θ∗ ∈ Θ1 .

The general likelihood approach leads to comparing the (maximum) like-lihood values L(Y ,θ) on the hypothesis and alternative sets. Namely, thehypothesis is rejected if there is one alternative point θ1 ∈ Θ1 such thatthe value L(Y ,θ) exceeds all corresponding values for θ ∈ Θ0 by a certainamount which is defined by assigning losses or by fixing a significance level.In other words, rejection takes place if observing the sample Y under alter-native IPθ1

is significantly more likely than under any measure IPθ from thenull. Formally this relation can be written as:

supθ∈Θ0

L(Y ,θ) + z < supθ∈Θ1

L(Y ,θ),

where the constant z makes the term ”significantly” explicit. In particular, asimple hypothesis means that the set Θ0 consists of one single point θ0 andthe latter relation takes of the form

L(Y ,θ0) + z < supθ∈Θ1

L(Y ,θ).

In general, the likelihood ratio (LR) test corresponds to the test statistic

Tdef= sup

θ∈Θ1

L(Y ,θ)− supθ∈Θ0

L(Y ,θ). (6.3)

The hypothesis is rejected if this test statistic exceeds some critical value z .Usually this critical value is selected to ensure the level condition:

IP(T > zα

)≤ α

for a given level α , whenever IP is a measure under the null hypothesis.We have already seen that the LR test is optimal for testing a simple

hypothesis against a simple alternative. Later we show that this optimalityproperty can be extended to some more general situations. Now and in thefollowing Section 6.4 we consider further examples of a LR test.

Example 6.3.1 (Chi-square test for goodness-of-fit). Let the observation space(which is a subset of IR1 ) be split into non-overlapping subsets A1, . . . , Adand assume one observes indicators 1I(Yi ∈ Aj) for 1 ≤ i ≤ n and 1 ≤

Page 215: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

6.3 Likelihood ratio test 215

j ≤ d . Define θj = P0(Aj) =∫AjP0(dy) for 1 ≤ j ≤ d . Let counting

variables Nj , 1 ≤ j ≤ d , be given by Nj =∑ni=1 1I(Yi ∈ Aj) . The vector

N = (N1, . . . , Nd) follows the multinomial distribution with parameters n ,d , and θ = (θ1, . . . , θd)

> , where we assume n and d as fixed, leading todim(Θ) = d . More specifically, it holds

Θ ={θ = (θ1, . . . , θd)

> ∈ [0, 1]d :

d∑j=1

θj = 1}.

The likelihood statistic for this model with respect to the counting measureis given by

Z(N,p) =n!∏d

j=1Nj !

d∏`=1

θN`` ,

and the MLE is given by θj = Nj/n, 1 ≤ j ≤ d . Now, consider the pointhypothesis θ = p for a fixed given vector p ∈ Θ . We obtain the likelihoodratio statistic

T = supθ∈Θ

logZ(N,θ)

Z(N,p),

leading to

T = n

d∑j=1

θj logθjpj,

In practice, this LR test is often carried out as Pearson’s chi-square test. Tothis end, consider the function h : IR→ IR , given by h(x) = x log(x/x0) fora fixed real number x0 ∈ (0, 1) . Then, the Taylor expansion of h(x) aroundx0 is given by

h(x) = (x− x0) +1

2x0(x− x0)2 + o[(x− x0)2] as x→ x0.

Consequently, for θ close to p , the use of∑j θj =

∑j pj = 1 implies

2T ≈d∑j=1

(Nj − npj)2

npj

in probability under the null hypothesis. The statistic Q , given by

Page 216: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

216 6 Testing a statistical hypothesis

Qdef=

d∑j=1

(Nj − npj)2

npj,

is called Pearson’s chi-square statistic.

6.4 Likelihood ratio tests for parameters of a normaldistribution

For all examples considered in this section, we assume that the data Y inform of an i.i.d. sample (Y1, . . . , Yn)> follow the model Yi = θ∗ + εi withεi ∼ N(0, σ2) for σ2 known. Equivalently Yi ∼ N(θ∗, σ2) . The log-likelihoodL(Y , θ) (which we also denote by L(θ) ) reads as

L(θ) = −n2

log(2πσ2)− 1

2σ2

n∑i=1

(Yi − θ)2 (6.4)

and the log-likelihood ratio L(θ, θ0) = L(θ)− L(θ0) is given by

L(θ, θ0) = σ−2[(S − nθ0)(θ − θ0)− n(θ − θ0)2/2

](6.5)

with Sdef= Y1 + . . .+ Yn .

6.4.1 Distributions related to an i.i.d. sample from a normaldistribution

As a preparation for the subsequent sections, we introduce here some impor-tant probability distributions which correspond to functions of an i.i.d. sampleY = (Y1, . . . , Yn)> from a normal distribution.

Lemma 6.4.1. If Y follows the standard normal distribution on IR , thenY 2 has the gamma distribution Γ(1/2, 1/2) .

Exercise 6.4.1. Prove Lemma 6.4.1.

Corollary 6.4.1. Let Y = (Y1, . . . , Yn)> denote an i.i.d. sample from thestandard normal distribution on IR . Then, it holds that

n∑i=1

Y 2i ∼ Γ(1/2, n/2).

We call Γ(1/2, n/2) the chi-square distribution with n degrees of freedom,χ2n for short.

Page 217: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

6.4 Likelihood ratio tests for parameters of a normal distribution 217

Proof. From Lemma 6.4.1, we have that Y 21 ∼ Γ(1/2, 1/2) . Convolution sta-

bility of the family of gamma distributions with respect to the second param-eter yields the assertion.

Lemma 6.4.2. Let α, r, s > 0 non-negative constants and X,Y indendentrandom variables with X ∼ Γ(α, r) and Y ∼ Γ(α, s) . Then S = X + Y andR = X/(X + Y ) are independent with S ∼ Γ(α, r + s) and R ∼ Beta(r, s) .

Exercise 6.4.2. Prove Lemma 6.4.2.

Theorem 6.4.1. Let X1, . . . , Xm, Y1, . . . , Yn i.i.d., with X1 following thestandard normal distribution on IR . Then, the ratio

Fm,ndef= m−1

m∑i=1

X2i

/(n−1

n∑j=1

Y 2j )

has the following pdf with respect to the Lebesgue measure.

fm,n(x) =mm/2nn/2

B(m/2, n/2)

xm/2−1

(n+mx)(m+n)/21I(0,∞)(x).

The distribution of Fm,n is called Fisher’s F -distribution with m and ndegrees of freedom (Sir R. A. Fisher, 1890-1962).

Exercise 6.4.3. Prove Theorem 6.4.1.

Corollary 6.4.2. Let X,Y1, . . . , Yn i.i.d., with X following the standardnormal distribution on IR . Then, the statistic

T =X√

n−1∑nj=1 Y

2j

has the Lebesgue density

t 7→ τn(t) =(

1 +t2

n

)−n+12 {√

nB(1/2, n/2)}−1

.

The distribution of T is called Student’s t -distribution with n degrees offreedom.

Proof. According to Theorem 6.4.1, T 2 ∼ F1,n . Thus, due to the trans-

formation formula for densities, |T | =√T 2 has Lebesgue density t 7→

2tf1,n(t2), t > 0 . Because of the symmetry of the standard normal density,also the distribution of T is symmetric around 0 , i. e., T and −T have thesame distribution. Hence, T has the Lebesgue density t 7→ |t|f1,n(t2) = τn(t) .

Page 218: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

218 6 Testing a statistical hypothesis

Theorem 6.4.2 (Student (1908)).In the Gaussian product model (IRn,B(IRn), (N (µ, σ2)n)θ=(µ,σ2)∈Θ) , whereΘ = IR× (0,∞) , it holds for all θ ∈ Θ :

(a) The statistics Y ndef= n−1

∑nj=1 Yj and σ2 def

= (n− 1)−1∑ni=1(Yi − Y n)2

are independent.(b) Y n ∼ N (µ, σ2/n) and (n− 1)σ2/σ2 ∼ χ2

n−1.

(c) The statistic Tndef=√n(Y n − µ)/σ is distributed as tn−1 .

6.4.2 Gaussian shift model

Under the measure IPθ0 , the variable S − nθ0 is normal zero-mean with the

variance nσ2 . This particularly implies that (S − nθ0)/√nσ2 is standard

normal under IPθ0 :

L

(1

σ√n

(S − nθ0)∣∣∣ IPθ0) = N(0, 1).

We start with the simplest case of a simple null and simple alternative.

Simple null and simple alternative.

Let the null H0 : θ∗ = θ0 be tested against the alternative H1 : θ∗ = θ1 forsome fixed θ1 6= θ0 . The log-likelihood L(θ1, θ0) is given by (6.5) leading tothe test statistic

T = σ−2[(S − nθ0)(θ1 − θ0)− n(θ1 − θ0)2/2

].

The proper critical value z can be selected from the condition of α -level:IPθ0(T > zα) = α . We use that the sum S − nθ0 is under the null normalzero-mean with variance nσ2 , and hence, the random variable

ξ = (S − nθ0)/√nσ2

is under θ0 standard normal: ξ ∼ N(0, 1) . The level condition can be rewrit-ten as

IPθ0

(ξ >

1

|θ1 − θ0|σ√n

[σ2zα + n(θ1 − θ0)2/2

])= α.

As ξ is standard normal under θ0 , the proper zα can be computed as aquantile of the standard normal law: if zα is defined by IPθ0(ξ > zα) = α ,then

1

|θ1 − θ0|σ√n

[σ2zα + n|θ1 − θ0|2/2

]= zα

Page 219: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

6.4 Likelihood ratio tests for parameters of a normal distribution 219

or

zα = σ−2[zα|θ1 − θ0|σ

√n− n|θ1 − θ0|2/2

].

It is worth noting that this value actually does not depend on θ0 . It onlydepends on the difference |θ1− θ0| between the null and the alternative. Thisis a very important and useful property of the normal family and is calledpivotality. Another way of selecting the critical value z is given by minimizingthe sum of the first and second-kind error probabilities. Theorem 6.2.1 leadsto the choice z = 0 , or equivalently, to the test

φ = 1I{S/n ≷ (θ0 + θ1)/2}, θ1 ≷ θ0,

= 1I{θ ≷ (θ0 + θ1)/2}, θ1 ≷ θ0.

This test is also called the Fisher discrimination. It naturally appears in clas-sification problems.

Two-sided test.

Now we consider a more general situation when the simple null θ∗ = θ0

is tested against the alternative θ∗ 6= θ0 . Then the LR test compares thelikelihood at θ0 with the maximum likelihood over Θ \ {θ0} which clearlycoincides with the maximum over the whole parameter set. This leads to thetest statistic:

T = maxθL(θ, θ0) =

n

2σ2|θ − θ0|2.

(see Section 2.9), where θ = S/n is the MLE. The LR test rejects the nullif T ≥ z for a critical value z . The value z can be selected from the levelcondition:

IPθ0(T > z

)= IPθ0

(nσ−2|θ − θ0|2 > 2z

)= α.

Now we use that nσ−2|θ− θ0|2 is χ21 -distributed according to Lemma 6.4.1.

If zα is defined by IP (ξ2 ≥ 2zα) = α for standard normal ξ , then the testφ = 1I(T > zα) is of level α . Again, this value does not depend on the nullpoint θ0 , and the LR test is pivotal.

Exercise 6.4.4. Compute the power function of the resulting two-sided test

φ = 1I(T > zα).

Page 220: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

220 6 Testing a statistical hypothesis

One-sided test.

Now we consider the problem of testing the null θ∗ = θ0 against the one-sided alternative H1 : θ > θ0 . To apply the LR test we have to compute themaximum of the log-likelihood ratio L(θ, θ0) over the set Θ1 = {θ > θ0} .

Exercise 6.4.5. Check that

supθ>θ0

L(θ, θ0) =

{nσ−2|θ − θ0|2/2 if θ ≥ θ0,

0 otherwise.

Hint: if θ ≥ θ0 , then the supremum over Θ1 coincides with the global maxi-mum, otherwise it is attained at the edge θ0 .

Now the LR test rejects the null if θ > θ0 and nσ−2|θ − θ0|2 > 2z for acritical value (CV) z . That is,

φ = 1I(θ − θ0 > σ

√2z/n

).

The CV z can be again chosen by the level condition. As ξ =√n(θ−θ0)/σ is

standard normal under IPθ0 , one has to select z such that IP (ξ >√

2z) = α ,leading to

φ = 1I(θ > θ0 + σz1−α/

√n),

where z1−α denotes the (1−α) -quantile of the standard normal distribution.

6.4.3 Testing the mean when the variance is unknown

This section discusses the Gaussian shift model Yi = θ∗+σ∗εi with standardnormal errors εi and unknown variance σ∗2 . The log-likelihood function isstill given by (6.4) but now σ∗2 is a part of the parameter vector.

Two-sided test problem.

Here, we are considered with the two-sided testing problem H0 : θ∗ = θ0

against H1 : θ1 6= θ0 . Notice that the null hypothesis is composite, becauseit involves the unknown variance σ∗2 .

Maximizing the log-likelihood L(θ, σ2) under the null leads to the valueL(θ0, σ

20) with

σ20

def= argmax

σ2

L(θ0, σ2) = n−1

∑i

(Yi − θ0)2.

As in Section 2.9.2 for the problem of variance estimation, it holds for any σ

Page 221: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

6.4 Likelihood ratio tests for parameters of a normal distribution 221

L(θ0, σ20)− L(θ0, σ

2) = nK(σ20 , σ

2).

At the same time, maximizing L(θ, σ2) over the alternative is equivalent to

the global maximization leading to the value L(θ, σ2) with

θ = S/n, σ2 =1

n

∑i

(Yi − θ)2.

The LR test statistic reads as

T = L(θ, σ2)− L(θ0, σ20).

This expression can be decomposed in the following way:

T = L(θ, σ2)− L(θ0, σ2) + L(θ0, σ

2)− L(θ0, σ20)

=1

2σ2(θ − θ0)2 − nK(σ2

0 , σ2).

In order to derive the CV z , notice that

exp(T ) =exp(L(θ, σ2))

exp(L(θ0, σ20))

=

(σ2

0

σ2

)n/2.

Consequently, the LR test rejects for large values of

σ20

σ2= 1 +

n(θ − θ0)2∑ni=1(Yi − θ)2

.

In view of Theorems 6.4.1 and 6.4.2, z is therefore a deterministic transforma-tion of the suitable quantile from Fisher’s F -distribution with 1 and n− 1degrees of freedom or from Student’s t -distribution with n − 1 degrees offreedom.

Exercise 6.4.6. Derive the explicit form of zα for given significance level αin terms of Fisher’s F -distribution and in terms of Student’s t -distribution.

Often one considers the case in which the variance is only estimated under thealternative, that is, σ is used in place of σ0 . This is quite natural becausethe null can be viewed as a particular case of the alternative. This leads tothe test statistic

T ∗ = L(θ, σ2)− L(θ0, σ2) =

n

2σ2(θ − θ0)2.

Since T ∗ is an isotone transformation of T , both tests are equivalent.

Page 222: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

222 6 Testing a statistical hypothesis

One-sided test problem.

In analogy to the considerations in Section 6.4.2, the LR test for the one-sidedproblem H0 : θ∗ = θ0 against H1 : θ∗ > θ0 rejects if θ > θ0 and T exceedsa suitable critical value z .

Exercise 6.4.7. Derive the explicit form of the LR test for the one-sided testproblem. Compute zα for given significance level α in terms of Student’st -distribution.

6.4.4 Testing the variance

In this section, we consider the LR test for the hypothesis H0 : σ2 = σ20

against H1 : σ2 > σ20 or H0 : σ2 = σ2

0 against H1 : σ2 6= σ20 , respectively. In

this, we assume that θ∗ is known. The case of unknown θ∗ can be treatedsimilarly, cf. Exercise 6.4.10 below. As discussed before, maximization of thelikelihood under the constraint of known mean yields

σ2∗

def= argmax

σ2

L(θ∗, σ2) = n−1∑i

(Yi − θ∗)2.

The LR test for H0 against H1 rejects the null hypothesis if

T = L(θ∗, σ2∗)− L(θ∗, σ2

0)

exceeds a critical value z . For determining the rejection regions, notice that

2T = n(Q− log(Q)− 1), Qdef= σ2

∗/σ20 . (6.6)

Exercise 6.4.8. (a) Verify representation (6.6).(b) Show that x 7→ x−log(x)−1 is a convex function on (0,∞) with minimum

value 0 at x = 1 .

Combining Exercise 6.4.8 and Corollary 6.4.1, we conclude that the criticalvalue z for the LR test for the one-sided test problem is a deterministictransformation of a suitable quantile of the χ2

n -distribution. For the two-sidedtest problem, the acceptance region of the LR test is an interval for Q whichis bounded by deterministic transformations of lower and upper quantiles ofthe χ2

n -distribution.

Exercise 6.4.9. Derive the rejection regions of the LR tests for the one- andthe two-sided test problems explicitly.

Exercise 6.4.10. Derive one- and two-sided LR tests for σ2 in the case ofunknown θ∗ . Hint: Use Theorem 6.4.2.(b).

Page 223: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

6.5 LR-tests. Further examples 223

6.5 LR-tests. Further examples

We return to the models investigated in Section 2.9 and derive LR tests forthe respective parameters.

6.5.1 Bernoulli or binomial model

Assume that (Yi, . . . , Yn) are i.i.d. with Y1 ∼ Bernoulli(θ∗) . Letting Sn =∑Yi , the log-likelihood is given by

L(θ) =∑{

Yi log θ + (1− Yi) log(1− θ)}

= Sn logθ

1− θ+ n log(1− θ).

Simple null versus simple alternative.

The LR statistic for testing the simple null hypothesis H0 : θ∗ = θ0 againstthe simple alternative H1 : θ∗ = θ1 reads as

T = L(θ1)− L(θ0) = Sn logθ1(1− θ0)

θ0(1− θ1)+ n log

1− θ1

1− θ0.

For a fixed significance level α , the resulting LR test rejects if

θ ≷

(zαn− log

1− θ1

1− θ0

)/log

θ1(1− θ0)

θ0(1− θ1), θ1 ≷ θ0,

where θ = Sn/n . For the determination of zα , it is convenient to notice thatfor any pair (θ0, θ1) ∈ (0, 1)2 , the function

x 7→ x logθ1(1− θ0)

θ0(1− θ1)+ log

1− θ1

1− θ0

is increasing (decreasing) in x ∈ [0, 1] if θ1 > θ0 ( θ1 < θ0 ). Hence, the LR

statistic T is an isotone (antitone) transformation of θ if θ1 > θ0 ( θ1 < θ0 ).

Since Sn = nθ is under H0 binomially distributed with parameters n andθ0 , the LR test φ is given by

φ =

{1I{Sn > F−1

Bin(n,θ0)(1− α)}, θ1 > θ0,

1I{Sn < F−1Bin(n,θ0)(α)}, θ1 < θ0.

(6.7)

Page 224: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

224 6 Testing a statistical hypothesis

Composite alternatives.

Obviously, the LR test in (6.7) depends on the value of θ1 only via the signof θ1−θ0 . Therefore, the LR test for the one-sided test problem H0 : θ∗ = θ0

against H1 : θ∗ > θ0 rejects if

θ > F−1Bin(n,θ0)(1− α)

and the LR test for H0 against H1 : θ∗ < θ0 rejects if

θ < F−1Bin(n,θ0)(α).

The LR test for the two-sided test problem H0 against H1 : θ∗ 6= θ0 rejectsif

θ 6∈ [F−1Bin(n,θ0)(α/2), F−1

Bin(n,θ0)(1− α/2)].

6.5.2 Uniform distribution on [0, θ]

Consider again the model from Section 2.9.4, i. e., Y1, . . . , Yn are i.i.d. withY1 ∼ UNI[0, θ∗] , where the upper endpoint θ∗ of the support is unknown. Itholds that

Z(θ) = θ−n 1I(maxiYi ≤ θ)

and that the maximum of Z(θ) over (0,∞) is obtained for θ = maxi Yi . Letus consider the two-sided test problem H0 : θ∗ = θ0 against H1 : θ∗ 6= θ0 forsome given value θ0 > 0 . We get that

exp(T ) =Z(θ)

Z(θ0)=

{(θ0/θ

)n, maxi Yi ≤ θ0,

∞, maxi Yi > θ0.

It follows that exp(T ) > z if θ > θ0 or θ < θ0z−1/n . We compute the critical

value zα for a level α test by noticing that

IPθ0({θ > θ0} ∪ {θ < θ0 z−1/n}) = IPθ0(θ < θ0 z

−1/n) =(FY1(θ0 z

−1/n))n

= 1/z.

Thus, the LR test at level α for the two-sided problem H0 against H1 isgiven by

φ = 1I{θ > θ0}+ 1I{θ < θ0 α1/n}.

Page 225: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

6.5 LR-tests. Further examples 225

6.5.3 Exponential model

We return to the model considered in Section 2.9.7 and assume that Y1, . . . , Ynare i.i.d. exponential random variables with parameter θ∗ > 0 . The corre-sponding log-likelihood can be written as

L(θ) = −n log θ −n∑i=1

Yi/θ = −S/θ − n log θ,

where S = Y1 + . . .+ Yn .In order to derive the LR test for the simple hypothesis H0 : θ∗ = θ0

against the simple alternative H0 : θ∗ = θ1 , notice that the LR statistic Tis given by

T = L(θ1)− L(θ0) = S

(θ1 − θ0

θ1θ0

)+ n log

θ0

θ1.

Since the function

x 7→ x

(θ1 − θ0

θ1θ0

)+ n log(θ0/θ1)

is increasing (decreasing) in x > 0 whenever θ1 > θ0 ( θ1 < θ0 ), the LR testrejects for large values of S in the case that θ1 > θ0 and for small values of Sif θ1 < θ0 . Due to the facts that the exponential distribution with parameterθ0 is identical to Gamma (θ0, 1) and the family of gamma distributions isconvolution-stable with respect to its second parameter whenever the first pa-rameter is fixed, we obtain that S is under θ0 distributed as Gamma (θ0, n) .This implies that the LR test φ for H0 against H1 at significance level αis given by

φ =

{1I{S > F−1

Gamma(θ0,n)(1− α)}, θ1 > θ0,

1I{S < F−1Gamma(θ0,n)(α)}, θ1 < θ0.

Moreover, composite alternatives can be tested in analogy to the considera-tions in Section 6.5.1.

6.5.4 Poisson model

Let Y1, . . . , Yn be i.i.d. Poisson random variables satisfying IP (Yi = m) =|θ∗|me−θ∗/m! for m = 0, 1, 2, . . . . According to Section 2.9.8, we have that

L(θ) = S log θ − nθ +R,

L(θ1)− L(θ0) = S logθ1

θ0+ n(θ0 − θ1),

Page 226: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

226 6 Testing a statistical hypothesis

where the remainder term R does not depend on θ . In order to derive theLR test for the simple hypothesis H0 : θ∗ = θ0 against the simple alternativeH0 : θ∗ = θ1 , we again check easily that x 7→ x log(θ1/θ0) + n(θ0 − θ1) isincreasing (decreasing) in x > 0 if θ1 > θ0 ( θ1 < θ0 ). Convolution stability ofthe family of Poisson distributions entails that the LR test φ for H0 againstH1 at significance level α is given by

φ =

{1I{S > F−1

Poisson(nθ0)(1− α)}, θ1 > θ0,

1I{S < F−1Poisson(nθ0)(α)}, θ1 < θ0.

Moreover, composite alternatives can be tested in analogy to Section 6.5.1.

6.6 Testing problem for a univariate exponential family

Let (Pθ, θ ∈ Θ ⊆ IR1) be a univariate exponential family. The choice ofparametrization is unimportant, any of parametrization can be taken. To bespecific, we assume the natural parametrization that simplifies the expressionfor the maximum likelihood estimate.

We assume that the two functions C(θ) and B(θ) of θ are fixed, withwhich the log-density of Pθ can be written in the form:

`(y, θ)def= log p(y, θ) = yC(θ)−B(θ)− `(y)

for some other function `(y) . The function C(θ) is monotonic in θ and C(θ)and B(θ) are related (for the case of an EFn) by the identity B′(θ) = θC ′(θ) ,see Section 2.11.

Let now Y = (Y1, . . . , Yn) be an i.i.d. sample from Pθ∗ for θ∗ ∈ Θ . Thetask is to test a simple hypothesis θ∗ = θ0 against an alternative θ∗ ∈ Θ1

for some subset Θ1 that does not contain θ0 .

6.6.1 Two-sided alternative

We start with the case of a simple hypothesis H0 : θ∗ = θ0 against a fulltwo-sided alternative H1 : θ∗ 6= θ0 . The likelihood ratio approach suggests tocompare the likelihood at θ0 with the maximum of the likelihood over the al-ternative, which effectively means the maximum over the whole parameter setΘ . In the case of a univariate exponential family, this maximum is computedin Section 2.11. For

L(θ, θ0)def= L(θ)− L(θ0) = S

[C(θ)− C(θ0)

]− n

[B(θ)−B(θ0)

]with S = Y1 + . . .+ Yn , it holds

Page 227: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

6.6 Testing problem for a univariate exponential family 227

Tdef= sup

θL(θ, θ0) = nK(θ, θ0),

where K(θ, θ′) = Eθ`(θ, θ′) is the Kullback–Leibler divergence between the

measures Pθ and Pθ′ . For an EFn, the MLE θ is the empirical mean of theobservations Yi , θ = S/n , and the KL divergence K(θ, θ0) is of the form

K(θ, θ0) = θ[C(θ)− C(θ0)

]−[B(θ)−B(θ0)

].

Therefore, the test statistic T is a function of the empirical mean θ = S/n :

T = nK(θ, θ0) = nθ[C(θ)− C(θ0)

]− n

[B(θ)−B(θ0)

]. (6.8)

The LR test rejects H0 if the test statistic T exceeds a critical value z . Givenα ∈ (0, 1) , a proper CV zα can be specified by the level condition

IPθ0(T > zα) = α.

In view of (6.8), the LR test rejects the null if the “distance” K(θ, θ0) between

the estimate θ and the null θ0 is significantly larger than zero. In the caseof an exponential family, one can simplify the test just by considering theestimate θ as test statistic. We use the following technical result for the KLdivergence K(θ, θ0) :

Lemma 6.6.1. Let (Pθ) be an EFn. Then for every z there are two positivevalues t−(z) and t+(z) such that

{θ : K(θ, θ0) ≤ z} = {θ : θ0 − t−(z) ≤ θ ≤ θ0 + t+(z)}. (6.9)

In other words, the conditions K(θ, θ0) ≤ z and θ0 − t−(z) ≤ θ ≤ θ0 + t+(z)are equivalent.

Proof. The function K(θ, θ0) of the first argument θ fulfills

∂K(θ, θ0)

∂θ= C(θ)− C(θ0),

∂2K(θ, θ0)

∂θ2= C ′(θ) > 0.

Therefore, it is convex in θ with minimum at θ0 , and it can cross the level zonly once from the left of θ0 and once from the right. This yields that for anyz > 0 , there are two positive values t−(z) and t+(z) such that (6.9) holds.Note that one or even both of these values can be infinite.

Due to the result of this lemma, the LR test can be rewritten as

φ = 1I(T > z

)= 1− 1I

(T ≤ z

)= 1− 1I

(−t−(z) ≤ θ − θ0 ≤ t+(z)

)= 1I

(θ > θ0 + t+(z)

)+ 1I

(θ < θ0 − t−(z)

),

Page 228: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

228 6 Testing a statistical hypothesis

that is, the test rejects the null hypothesis if the estimate θ deviates signifi-cantly from θ0 .

6.6.2 One-sided alternative

Now we consider the problem of testing the same null H0 : θ∗ = θ0 against theone-sided alternative H1 : θ∗ > θ0 . Of course, the other one-sided alternativeH1 : θ∗ < θ0 can be considered analogously.

The LR test requires computing the maximum of the log-likelihood overthe alternative set {θ : θ > θ0} . This can be done as in the Gaussian shift

model. If θ > θ0 then this maximum coincides with the global maximum overall θ . Otherwise, it is attained at θ = θ0 .

Lemma 6.6.2. Let (Pθ) be an EFn. Then

supθ>θ0

L(θ, θ0) =

{nK(θ, θ0) if θ ≥ θ0,

0 otherwise.

Proof. It is only necessary to consider the case θ < θ0 . The difference L(θ)−L(θ0) can be represented as S[C(θ)−C(θ0)]− n[B(θ)−B(θ0)] . Next, usage

of the identities θ = S/n and B′(θ) = θC ′(θ) yields

∂L(θ, θ0)

∂θ= −∂L(θ, θ)

∂θ= −n∂K(θ, θ)

∂θ= n(θ − θ)C ′(θ) < 0

for any θ > θ . This implies that L(θ)− L(θ0) becomes negative as θ growsbeyond θ0 and thus, L(θ, θ0) has its supremum over {θ : θ > θ0} at θ = θ0 ,yielding the assertion.

This fact implies the following representation of the LR test in the case ofa one-sided alternative.

Theorem 6.6.1. Let (Pθ) be an EFn. Then the α -level LR test for the nullH0 : θ∗ = θ0 against the one-sided alternative H1 : θ∗ > θ0 is

φ = 1I(θ > θ0 + tα), (6.10)

where tα is selected to ensure IPθ0(θ > θ0 + tα

)= α .

Proof. Let T be the LR test statistic. Due to Lemmas 6.6.2 and 6.6.1, theevent {T ≥ z} can be rewritten as {θ > θ0 + t(z)} for some constant t(z) . Itremains to select a proper value t(z) = tα to fulfill the level condition.

This result can be extended naturally to the case of a composite nullhypothesis H0 : θ∗ ≤ θ0 .

Page 229: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

6.6 Testing problem for a univariate exponential family 229

Theorem 6.6.2. Let (Pθ) be an EFn. Then the α -level LR test for the com-posite null H0 : θ∗ ≤ θ0 against the one-sided alternative H1 : θ∗ > θ0

is

φ∗α = 1I(θ > θ0 + tα), (6.11)

where tα is selected to ensure IPθ0(θ > θ0 + tα

)= α .

Proof. The same arguments as in the proof of Theorem 6.6.1 lead to exactlythe same LR test statistic T and thus to the test of the form (6.10). In

particular, the estimate θ should significantly deviate from the null set. Itremains to check that the level condition for the edge point θ0 ensures thelevel for all θ < θ0 . This follows from the next monotonicity property.

Lemma 6.6.3. Let (Pθ) be an EFn. Then for any t ≥ 0

IPθ(θ > θ0 + t) ≤ IPθ0(θ > θ0 + t), ∀θ < θ0 .

Proof. Let θ < θ0 . We apply

IPθ(θ > θ0 + t) = IEθ0

[exp{L(θ, θ0)

}1I(θ > θ0 + t)

].

Now the monotonicity of L(θ, θ0) w.r.t. θ (see the proof of Lemma 6.6.2) im-pliesL(θ, θ0) < 0 on the set {θ < θ0 < θ} . This yields the result.

Therefore, if the level is controlled under IPθ0 , it is well checked for all otherpoints in the null set.

A very nice feature of the LR test is that it can be universally representedin terms of θ independently of the form of the alternative set. In particular, forthe case of a one-sided alternative, this test just compares the estimate θ withthe value θ0 + tα . Moreover, the value tα only depends on the distribution ofθ under IPθ0 via the level condition. This and the monotonicity of the errorprobability from Lemma 6.6.3 allow us to state the nice optimality propertyof this test: φ∗α is uniformly most powerful in the sense of Definition 6.1.1,that is, it maximizes the test power under the level constraint.

Theorem 6.6.3. Let (Pθ) be an EFn, and let φ∗α be the test from (6.11) fortesting H0 : θ∗ ≤ θ0 against H1 : θ∗ > θ0 . For any (randomized) test φsatisfying IEθ0 ≤ α and any θ > θ0 , it holds

IEθφ ≤ IPθ(φ∗α = 1).

In fact, this theorem repeats the Neyman-Pearson result of Theorem 6.2.2,because the test φ∗α is at the same time the LR α -level test of the simplehypothesis θ∗ = θ0 against θ∗ = θ1 , for any value θ1 > θ0 .

Page 230: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

230 6 Testing a statistical hypothesis

6.6.3 Interval hypothesis

In some applications, the null hypothesis is naturally formulated in the formthat the parameter θ∗ belongs to a given interval [θ0, θ1] . The alternativeH1 : θ∗ ∈ Θ \ [θ0, θ1] is the complement of this interval. The likelihood ratiotest is based on the test statistic T from (6.3) which compares the maximumof the log-likelihood L(θ) under the null [θ0, θ1] with the maximum over thealternative set. The special structure of the log-likelihood in the case of anEFn permits representing this test statistics in terms of the estimate θ : thehypothesis is rejected if the estimate θ significantly deviates from the interval[θ0, θ1] .

Theorem 6.6.4. Let (Pθ) be an EFn. Then the α -level LR test for the nullH0 : θ ∈ [θ0, θ1] against the alternative H1 : θ 6∈ [θ0, θ1] can be written as

φ = 1I(θ > θ1 + t+α ) + 1I(θ < θ0 − t−α ), (6.12)

where the non-negative constants t+α and t−α are selected to ensure IPθ0(θ <

θ0 − t−α ) + IPθ1(θ > θ1 + t+α ) = α .

Exercise 6.6.1. Prove the result of Theorem 6.6.4.Hint: Consider three cases: θ ∈ [θ0, θ1] , θ > θ1 , and θ < θ0 . For every case,

apply the monotonicity of L(θ, θ) in θ .

One can consider the alternative of the interval hypothesis as a combina-tion of two one-sided alternatives. The LR test φ from (6.12) involves onlyone critical value z and the parameters t−α and t+α are related via the struc-ture of this test: they are obtained by transforming the inequality T > zα intoθ > θ1 + t+α and θ < θ0− t−α . However, one can just apply two one-sided testsseparately: one for the alternative H−1 : θ∗ < θ0 and one for H+

1 : θ∗ > θ1 .This leads to the two tests

φ−def= 1I

(θ < θ0 − t−

), φ+ def

= 1I(θ > θ1 + t+

). (6.13)

The values t−, t+ can be chosen by the so-called Bonferroni rule: just performeach of the two tests at level α/2 .

Exercise 6.6.2. For fixed α ∈ (0, 1) , let the values t−α , t+α be selected to

ensure

IPθ0(θ < θ0 − t−α

)= α/2, IPθ1

(θ > θ1 + t+α

)= α/2.

Then for any θ ∈ [θ0, θ1] , the test φ , given by φ = φ− + φ+ (cf. (6.13))fulfills

IPθ(φ = 1) ≤ α.

Hint: Use the monotonicity from Lemma 6.6.3.

Page 231: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

6.7 Historical remarks and further reading 231

6.7 Historical remarks and further reading

The theory of optimal tests goes back to Jerzy Neyman (1894 - 1981) andEgon Sharpe Pearson (1895 - 1980). Some early considerations with respectto likelihood ratio tests can be found in Neyman and Pearson (1928). Neymanand Pearson (1933)’s fundamental lemma is core to the derivation of mostpowerful (likelihood ratio) tests. Fisher (1934) showed that uniformly besttests (over the parameter subspace defining a composite alternative) onlyexist in one-parametric exponential families. More details about the origins ofthe theory of optimal tests can be found in the book by Lehmann (2011).

Important contributions to the theory of tests for parameters of a normaldistribution have moreover been made by William Sealy Gosset (1876-1937;under the pen name ”Student”, see Student (1908)) and Ernst Abbe (1840 -1905; cf. Kendall (1971) for an account of Abbe’s work).

The χ2 -test for goodness-of-fit is due to Pearson (1900). The phenomenon,that the limiting distribution of twice the log-likelihood ratio statistic in nestedmodels is under regularity conditions a chi-square distribution, has been dis-covered by Wilks (1938).

An excellent textbook reference for the theory of testing statistical hy-potheses is Lehmann and Romano (2005).

Page 232: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April
Page 233: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

7

Testing in linear models

This chapter discusses testing problems for linear Gaussian models given bythe equation

Y = f∗ + ε (7.1)

with the vector of observations Y , response vector f∗ , and vector of errorsε in IRn . The linear parametric assumption (linear PA) means that

Y = Ψ>θ∗ + ε, ε ∼ N(0,Σ), (7.2)

where Ψ is the p × n design matrix. By θ we denote the p -dimensionaltarget parameter vector, θ ∈ Θ ⊆ IRp . Usually we assume that the parameterset coincides with the whole space IRp , i.e. Θ = IRp . The most generalassumption about the vector of errors ε = (ε1, . . . , εn)> is Var(ε) = Σ , whichpermits for inhomogeneous and correlated errors. However, for most resultswe assume i.i.d. errors εi ∼ N(0, σ2) . The variance σ2 could be unknownas well. As in previous chapters, θ∗ denotes the true value of the parametervector (assumed that the model (7.2) is correct).

7.1 Likelihood ratio test for a simple null

This section discusses the problem of testing a simple hypothesis H0 : θ∗ = θ0

for a given vector θ0 . A natural “non-informative” alternative is H1 : θ∗ 6=θ0 .

7.1.1 General errors

We start from the case of general errors with known covariance matrix Σ .The results obtained for the estimation problem in Chapter 4 will be heavilyused in our study. In particular, the MLE θ of θ∗ is

Page 234: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

234 7 Testing in linear models

θ =(ΨΣ−1Ψ>

)−1ΨΣ−1Y

and the corresponding maximum likelihood is

L(θ,θ0) =1

2

(θ − θ0

)>B(θ − θ0

)with a p× p - matrix B given by

B = ΨΣ−1Ψ>.

This immediately leads to the following representation for the likelihood ratio(LR) test in this set-up:

Tdef= sup

θL(θ,θ0) =

1

2

(θ − θ0

)>B(θ − θ0

). (7.3)

Moreover, Wilks’ phenomenon claims that under IPθ0 , the test statistic Thas a fixed distribution: namely, 2T is χ2

p -distributed (chi-squared with pdegrees of freedom).

Theorem 7.1.1. Consider the model (7.2) with ε ∼ N(0,Σ) for a knownmatrix Σ . Then the LR test statistic T is given by (7.3). Moreover, if zαfulfills IP (ζp > 2zα) = α with ζp ∼ χ2

p , then the LR test φ with

φdef= 1

(T > zα

)(7.4)

is of exact level α :

IPθ0(φ = 1) = α.

This result follows directly from Theorems 4.6.1 and 4.6.2. We see againthe important pivotal property of the test: the critical value zα only dependson the dimension of the parameter space Θ . It does not depend on the designmatrix Ψ , error covariance Σ , and the null value θ0 .

7.1.2 I.i.d. errors, known variance

We now specify this result for the case of i.i.d. errors. We also focus on theresiduals

εdef= Y −Ψ>θ = Y − f ,

where f = Ψ>θ is the estimated response of the true regression functionf∗ = Ψ>θ∗ .

We start with some geometric properties of the residuals ε and the teststatistic T from (7.3).

Page 235: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

7.1 Likelihood ratio test for a simple null 235

Theorem 7.1.2. Consider the model (7.1). Let T be the LR test statisticbuilt under the assumptions f∗ = Ψ>θ∗ and Var(ε) = σ2IIn with a knownvalue σ2 . Then T is given by

T =1

2σ2

∥∥Ψ>(θ − θ0)∥∥2

=1

2σ2

∥∥f − f0

∥∥2. (7.5)

Moreover, the following decompositions for the vector of observations Y andfor the errors ε = Y − f0 hold:

Y − f0 = (f − f0) + ε, (7.6)

‖Y − f0‖2 = ‖f − f0‖2 + ‖ε‖2, (7.7)

where f−f0 is the estimation error and ε = Y −f is the vector of residuals.

Proof. The key step of the proof is the representation of the estimated re-sponse f under the model assumption Y = f∗ + ε as a projection of thedata on the p -dimensional linear subspace L in IRn spanned by the rows ofthe matrix Ψ :

f = ΠY = Π(f∗ + ε)

where Π = Ψ>(ΨΨ>

)−1Ψ is the projector onto L ; see Section 4.3. Note that

this decomposition is valid for the general linear model; the parametric form ofthe response f and the noise normality is not required. The identity Ψ>(θ−θ0) = f − f0 follows directly from the definition implying the representation(7.5) for the test statistic T . The identity (7.6) follows from the definition.

Next, Πf0 = f0 and thus f − f0 = Π(Y − f0) . Similarly,

ε = Y − f = (IIn −Π)Y .

As Π and IIn −Π are orthogonal projectors, it follows

‖Y − f0‖2 = ‖(IIn −Π)Y + Π(Y − f0)‖2 = ‖(IIn −Π)Y ‖2 + ‖Π(Y − f0)‖2

and the decomposition (7.7) follows.

The decomposition (7.6), although straightforward, is very important forunderstanding the structure of the residuals under the null and under thealternative. Under the null H0 , the response f∗ is assumed to be known andcoincides with f0 , so the residuals ε coincide with the errors ε . The sum ofsquared residuals is usually abbreviated as RSS:

RSS0def= ‖Y − f0‖2

Page 236: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

236 7 Testing in linear models

Under the alternative, the response is unknown and is estimated by f . Theresiduals are ε = Y − f resulting in the RSS

RSSdef= ‖Y − f‖2.

The decomposition (7.7) can be rewritten as

RSS0 = RSS +‖f − f0‖2. (7.8)

We see that the RSS under the null and the alternative can be essentiallydifferent only if the estimate f significantly deviates from the null assumptionf∗ = f0 . The test statistic T from (7.3) can be written as

T =RSS0−RSS

2σ2.

For the proofs in the remainder of this chapter, the following results con-cerning the distribution of quadratic forms of Gaussians are helpful.

Theorem 7.1.3. 1. Let X ∼ Nn(µ,Σ) , where Σ is symmetric and positivedefinite. Then, (X − µ)>Σ−1(X − µ) ∼ χ2

n .2. Let X ∼ Nn(0, IIn) , R a symmetric, idempotent (n × n) -matrix with

rank (R) = r and B a (p× n) -matrix with p ≤ n . Then it holds(a) X>RX ∼ χ2

r

(b) From BR = 0 , it follows that X>RX is independent of BX .3. Let X ∼ Nn(0, IIn) and R , S symmetric and idempotent (n × n) -

matrices with rank (R) = r , rank (S) = s and RS = 0 . Then it holds(a) X>RX and X>SX are independent.

(b) X>RX/rX>SX/s

∼ Fr,s .

Proof. For proving the first assertion, let Σ1/2 denote the uniquely defined,symmetric and positive definite matrix fulfilling Σ1/2 ·Σ1/2 = Σ , with inverse

matrix Σ−1/2 . Then, Zdef= Σ−1/2(X−µ) ∼ Nn(0, IIn) . The assertion Z>Z ∼

χ2n follows from the defnition of the chi-squared distribution.

For the second part, notice that, due to symmetry and idempotence of R ,

there exists an orthogonal matrix P with R = PDrP> , where Dr =

(IIr 00 0

).

Orthogonality of P implies Wdef= P>X ∼ Nn(0, IIn) . We conclude

X>RX = X>R2X = (RX)>(RX) = (PDrW )>(PDrW )

= W>DrP>PDrW = W>DrW =

r∑i=1

W 2i ∼ χ2

r.

Page 237: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

7.1 Likelihood ratio test for a simple null 237

Furthermore, we have Z1def= BX ∼ Nn(0, B>B) , Z2

def= RX ∼ Nn(0, R) , and

Cov(Z1, Z2) = Cov(BX,RX) = B Cov(X)R> = BR = 0 . As uncorrelationimplies independence under Gaussianity, statement 2. follows.

To prove the third part, let Z1def= SX ∼ Nn(0, S) and Z2

def= RX ∼

Nn(0, R) and notice that Cov(Z1, Z2) = S Cov(X)R = SR = S>R> =(RS)> = 0 . Assertion (a) is then implied by the identities X>SX = Z>1 Z1

and X>RX = Z>2 Z2 . Assertion (b) immediately follows from 3.(a) and 1.

Exercise 7.1.1. Prove Theorem 6.4.2.Hint: Use Theorem 7.1.3.

Now we show that if the model assumptions are correct, the LR test basedon T from (7.3) has the exact level α and is pivotal.

Theorem 7.1.4. Consider the model (7.1) with ε ∼ N(0, σ2IIn) for a knownvalue σ2 , implying that the εi are i.i.d. normal. The LR test φ from (7.4)

is of exact level α . Moreover, f − f0 and ε are under IPθ0zero-mean

independent Gaussian vectors satisfying

2T = σ−2‖f − f0‖2 ∼ χ2p , σ−2‖ε‖2 ∼ χ2

n−p . (7.9)

Proof. The null assumption f∗ = f0 together with Πf0 = f0 implies nowthe following decomposition:

f − f0 = Πε, ε = ε−Πε = (IIn −Π)ε.

Next, Π and IIn−Π are orthogonal projectors implying orthogonal and thusuncorrelated vectors Πε and (IIn−Π)ε . Under normality of ε , these vectorsare also normal, and uncorrelation implies independence. The property (7.9)for the distribution of Πε was proved in Theorem 4.6.1. For the distributionof ε = (IIn −Π)ε , we can apply Theorem 7.1.3.

Next we discuss the power of the LR test φ defined as the probability ofdetecting the alternative when the response f∗ deviates from the null f0 .In the next result we do not assume that the true response f∗ follows thelinear PA f∗ = Ψ>θ and show that the test power depends on the value‖Π(f∗ − f0)‖2 .

Theorem 7.1.5. Consider the model (7.1) with Var(ε) = σ2IIn for a knownvalue σ2 . Define

∆ = σ−2‖Π(f∗ − f0)‖2.

Then the power of the LR test φ only depends on ∆ , i.e. it is the same forall f∗ with equal ∆ -value. It holds

IP(φ = 1

)= IP

(∣∣ξ1 +√

∆∣∣2 + ξ2

2 + . . .+ ξ2p > 2zα

)with ξ = (ξ1, . . . , ξp)

> ∼ N(0, IIp) .

Page 238: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

238 7 Testing in linear models

Proof. It follows from f = ΠY = Π(f∗ + ε) and f0 = Πf0 for the test

statistic T = (2σ2)−1‖f − f0‖2 that

T = (2σ2)−1‖Π(f∗ − f0) + Πε‖2.

Now we show that the distribution of T depends on the response f∗ onlyvia the value ∆ . For this we compute the Laplace transform of T .

Lemma 7.1.1. It holds for µ < 1

g(µ)def= log IE exp

{µT}

=µ∆

2(1− µ)− p

2log(1− µ).

Proof. For a standard Gaussian random variable ξ and any a , it holds

IE exp{µ|ξ + a|2/2

}= eµa

2/2(2π)−1/2

∫exp{µax+ µx2/2− x2/2

}dx

= exp

{µa2

2+

µ2a2

2(1− µ)

}1√2π

∫exp

{−1− µ

2

(x− aµ

1− µ

)2}dx

= exp

{µa2

2(1− µ)

}(1− µ)−1/2.

The projector Π can be represented as Π = U>ΛpU for an orthogonal trans-form U and the diagonal matrix Λp = diag(1, . . . , 1, 0, . . . , 0) with only punit eigenvalues. This permits representing T in the form

T =

p∑j=1

(ξj + aj)2/2

with i.i.d. standard normal r.v. ξj and numbers aj satisfying∑j a

2j = ∆ .

The independence of the ξj ’s implies

g(µ) =

p∑j=1

[µa2

j

2+

µ2a2j

2(1− µ)− 1

2log(1− µ)

]=

µ∆

2(1− µ)− p

2log(1− µ)

as required.

The result of Lemma 7.1.1 claims that the Laplace transform of T dependson f∗ only via ∆ and so this also holds for the distribution of T .

Page 239: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

7.1 Likelihood ratio test for a simple null 239

The distribution of the squared norm ‖ξ + h‖2 for ξ ∼ N(0, IIp) and anyfixed vector h ∈ IRp with ‖h‖2 = ∆ is called non-central chi-squared withthe non-centrality parameter ∆ . In particular, for each α, α1 one can definethe minimal value ∆ providing the prescribed error α1 of the second kindby the equation under the given level α :

IP(‖ξ + h‖2 ≥ 2zα

)≥ 1− α1 subject to IP

(‖ξ‖2 ≥ 2zα

)≤ α (7.10)

with ‖h‖2 ≥ ∆ . The results from Section 4.6 indicate that the value zα can

be bounded from above by p+√

2p logα−1 for moderate values of α−1 . Forevaluating the value ∆ , the following decomposition is useful:

‖ξ + h‖2 − ‖h‖2 − p = ‖ξ‖2 − p+ 2h>ξ.

The right hand-side of this equality is a sum of centered Gaussian quadraticand linear forms. In particular, the cross term 2h>ξ is a centered Gaussianr.v. with the variance 4‖h‖2 , while Var

(‖ξ‖2

)= 2p . These arguments suggest

to take ∆ of order p to ensure the prescribed power α1 .

Theorem 7.1.6. For each α, α1 ∈ (0, 1) , there are absolute constants C andC1 such that (7.10) is fulfilled for ‖h‖2 ≥ ∆ with

∆1/2 =√Cp logα−1 +

√C1p logα−1

1 .

The result of Theorem 7.1.6 reveals some problem with the power of theLR-test when the dimensionality of the parameter space grows. Indeed, thetest remains insensitive for all alternatives in the zone σ−2‖Π(f∗−f0)‖2 ≤ Cpand this zone becomes larger and larger with p .

7.1.3 I.i.d. errors with unknown variance

This section briefly discusses the case when the errors εi are i.i.d. but thevariance σ2 = Var(εi) is unknown. A natural idea in this case is to estimatethe variance from the data. The decomposition (7.8) and independence of

RSS = ‖Y − f‖2 and ‖f − f0‖2 are particularly helpful. Theorem 7.1.4suggests to estimate σ2 from RSS by

σ2 =1

n− pRSS =

1

n− p‖Y − f‖2.

Indeed, due to the result (7.9), σ−2 RSS ∼ χ2p yielding

IEσ2 = σ2, Var σ2 =2

n− pσ4 (7.11)

and therefore, σ2 is an unbiased, root-n consistent estimate of σ2 .

Page 240: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

240 7 Testing in linear models

Exercise 7.1.2. Check (7.11). Show that σ2 − σ2 IP−→ 0 .

Now we consider the LR test (7.5) in which the true variance is replacedby its estimate σ2 :

Tdef=

1

2σ2

∥∥f − f0

∥∥2=

(n− p)∥∥f − f0

∥∥2

2‖Y − f‖2=

RSS0−RSS

2 RSS /(n− p).

The result of Theorem 7.1.4 implies the pivotal property for this test statisticas well.

Theorem 7.1.7. Consider the model (7.1) with ε ∼ N(0, σ2IIn) for an un-

known value σ2 . Then the distribution of the test statistic T under IPθ0only

depends on p and n− p :

2p−1T =n− pp

RSS0−RSS

RSS∼ Fp,n−p ,

where Fp,n−p denotes the Fisher distribution with parameters p, n− p :

Fp,n−p = L

( ‖ξp‖2/p‖ξn−p‖2/(n− p)

)where ξp and ξn−p are two independent standard Gaussian vectors of di-mension p and n − p , see Theorem 6.4.1. In particular, it does not dependon the design matrix Ψ , the noise variance σ2 , and the true parameter θ0 .

This result suggests to fix the critical value z for the test statistic T usingthe quantiles of the Fisher distribution: If tα is such that Fp,n−p(tα) = 1−α ,then zα = p tα/2 .

Theorem 7.1.8. Consider the model (7.1) with ε ∼ N(0, σ2IIn) for a un-known value σ2 . If Fp,n−p(tα) = 1 − α and zα = ptα/2 , then the test

φ = 1(T ≥ zα) is a level-α test:

IPθ0

(φ = 1

)= IPθ0

(T ≥ zα

)= α.

Exercise 7.1.3. Prove the result of Theorem 7.1.8.

If the sample size n is sufficiently large, then σ2 is very close to σ2 andone can apply an approximate choice of the critical value zα from the case ofσ2 known:

φ = 1(T ≥ zα).

This test is not of exact level α but it is of asymptotic level α . Its powerfunction is also close to the power function of the test φ corresponding to theknown variance σ2 .

Page 241: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

7.2 Likelihood ratio test for a subvector 241

Theorem 7.1.9. Consider the model (7.1) with ε ∼ N(0, σ2IIn) for a un-known value σ2 . Then

limn→∞

IPθ0

(φ = 1

)= α. (7.12)

Moreover,

limn→∞

supf∗

∣∣IPθ0

(φ = 1

)− IPθ0

(φ = 1

)∣∣ = 0. (7.13)

Exercise 7.1.4. Consider the model (7.1) with ε ∼ N(0, σ2IIn) for σ2 unknownprove (7.12) and (7.13).Hints:

• The consistency of σ2 permits to restrict to the case∣∣σ2/σ2 − 1

∣∣ ≤ δnfor δn → 0 .

• The independence of ‖f − f0‖2 and σ2 permits to consider the distri-

bution of 2T = ‖f − f0‖2/σ2 as if σ2 were a fixed number close toδ .

• Use that for ζp ∼ χ2p ,

IP(ζp ≥ zα(1 + δn)

)− IP

(ζp ≥ zα

)→ 0, n→∞.

7.2 Likelihood ratio test for a subvector

The previous section dealt with the case of a simple null hypothesis. Thissection considers a more general situation when the null hypothesis concernsa subvector of θ . This means that the whole model is given by (7.2) but thevector θ is decomposed into two parts: θ = (γ,η) , where γ is of dimensionp0 < p . The null hypothesis assumes that η = η0 for all γ . Usually η0 = 0but the particular value of η0 is not important. To simplify the presentationwe assume η0 = 0 leading to the subset Θ0 of Θ , given by

Θ0 = {θ = (γ, 0)}.

Under the null hypothesis, the model is still linear:

Y = Ψ>γ γ + ε,

where Ψγ denotes a submatrix of Ψ composed by the rows of Ψ correspond-ing to the γ -components of θ .

Fix any point θ0 ∈ Θ0 , e. g., θ0 = 0 and define the correspondingresponse f0 = Ψ>θ0 . The LR test T can be written in the form

T = maxθ∈Θ

L(θ,θ0)− maxθ∈Θ0

L(θ,θ0). (7.14)

Page 242: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

242 7 Testing in linear models

The results of both maximization problems are known:

maxθ∈Θ

L(θ,θ0) =1

2σ2‖f − f0‖2,

maxθ∈Θ0

L(θ,θ0) =1

2σ2‖f0 − f0‖2,

where f and f0 are estimates of the response under the null and the al-ternative, respectively. As in Theorem 7.1.2 we can establish the followinggeometric decomposition.

Theorem 7.2.1. Consider the model (7.1). Let T be the LR test statisticfrom (7.14) built under the assumptions f∗ = Ψ>θ∗ and Var(ε) = σ2IInwith a known value σ2 . Then T is given by

T =1

2σ2

∥∥f − f0

∥∥2.

Moreover, the following decompositions for the vector of observations Y andfor the residuals ε0 = Y − f0 from the null hold:

Y − f0 = (f − f0) + ε,

‖Y − f0‖2 = ‖f − f0‖2 + ‖ε‖2, (7.15)

where f − f0 is the difference between the estimated response under the null

and under the alternative, and ε = Y − f is the vector of residuals from thealternative.

Proof. The proof is similar to the proof of Theorem 7.1.2. We use that f =

ΠY where Π = Ψ>(ΨΨ>

)−1Ψ is the projector on the space L spanned by

the rows of Ψ . Similarly f0 = Π0Y where Π0 = Ψ>γ(ΨγΨ>γ

)−1Ψγ is the

projector on the subspace L0 spanned by the rows of Ψγ . This yields the

decomposition f − f0 = Π(Y − f0) . Similarly,

f − f0 = (Π−Π0)Y , ε = Y − f = (IIn −Π)Y .

As Π−Π0 and IIn −Π are orthogonal projectors, it follows

‖Y − f0‖2 = ‖(IIn −Π)Y + (Π−Π0)Y ‖2 = ‖(IIn −Π)Y ‖2 + ‖(Π−Π0)Y ‖2

and the decomposition (7.15) follows.

The decomposition (7.15) can again be represented as RSS0 = RSS +2σ2T ,where RSS is the sum of squared residuals, while RSS0 is the same as in thecase of a simple null.

Now we show how a pivotal test of exact level α based on T can beconstructed if the model assumptions are correct.

Page 243: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

7.2 Likelihood ratio test for a subvector 243

Theorem 7.2.2. Consider the model (7.1) with ε ∼ N(0, σ2IIn) for a known

value σ2 , i. e., the εi are i.i.d. normal. Then f − f0 and ε are under IPθ0

zero-mean independent Gaussian vectors satisfying

2T = σ−2‖f − f0‖2 ∼ χ2p−p0 , σ−2‖ε‖2 ∼ χ2

n−p . (7.16)

Let zα fulfill IP (ζp−p0 ≥ zα) = α . Then the test φ = 1(T ≥ zα/2) is a LRtest of exact level α .

Proof. The null assumption θ∗ ∈ Θ0 implies f∗ ∈ L0 . This, together withΠ0f

∗ = f∗ implies now the following decomposition:

f − f0 = (Π−Π0)ε, ε = ε−Πε = (IIn −Π)ε.

Next, Π−Π0 and IIn−Π are orthogonal projectors implying orthogonal andthus uncorrelated vectors (Π−Π0)ε and (IIn −Π)ε . Under normality of ε ,these vectors are also normal, and uncorrelation implies independence. Theproperty (7.16) is similar to (7.9) and can easily be verified by making use ofTheorem 7.1.3.

If the variance σ2 of the noise is unknown, one can proceed exactly as inthe case of a simple null: estimate the variance from the residuals using theirindependence of the test statistic T . This leads to the estimate

σ2 =1

n− pRSS =

1

n− p‖Y − f‖2

and to the test statistic

T =RSS0−RSS

2 RSS /(n− p)=

(n− p)‖f − f0‖2

2‖Y − f‖2.

The property of pivotality is preserved here as well: properly scaled, the teststatistic T has a Fisher distribution.

Theorem 7.2.3. Consider the model (7.1) with ε ∼ N(0, σ2IIn) for an un-

known value σ2 . Then 2T /(p−p0) has the Fisher distribution Fp−p0,n−p withparameters p− p0 and n− p . If tα is the 1−α quantile of this distributionthen the test φ = 1

(T > (p− p0)tα/2

)is of exact level α .

If the sample size is sufficiently large, one can proceed as if σ2 were thetrue variance ignoring the error of variance estimation. This would lead tothe critical value zα from Theorem 7.2.2 and the corresponding test is ofasymptotic level α .

Exercise 7.2.1. Prove Theorem 7.2.3.

Page 244: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

244 7 Testing in linear models

The study of the power of the test T does not differ from the case of asimple hypothesis. One needs to only redefine ∆ as

∆def= σ−2‖(Π−Π0)f∗‖2.

7.3 Likelihood ratio test for a linear hypothesis

In this section, we generalize the test problem for the Gaussian linear modelfurther and assume that we want to test the linear hypothesis H0 : Cθ = d fora given contrast matrix C ∈ IRr×p with rank(C) = r ≤ p and a right-handside vector d ∈ IRr . Notice that the point null hypothesis H0 : θ = θ0 and thehypothesis H0 : η = η0 regarded in Section 7.2 are linear hypotheses. Here,we restrict our attention to the case that the error variance σ2 is unknown,as it is typically the case in practice.

Theorem 7.3.1. Assume model (7.1) with ε ∼ N(0, σ2IIn) for an unknownvalue σ2 and consider the linear hypothesis H0 : Cθ = d . Then it holds

(a) The likelihood ratio statistic T is an isotone transformation of Fdef=

n−pr

∆ RSSRSS , where ∆ RSS

def= RSS0−RSS .

(b) The restricted MLE θ0 is given by

θ0 = θ − (ΨΨ>)−1C>{C(ΨΨ>)−1C>

}−1(Cθ − d).

(c) ∆ RSS und RSS are independent.(d) Under H0 , ∆ RSS /σ2 ∼ χ2

r and F ∼ Fr,n−p .

Proof. For proving (a), verify that 2T = n log(σ20/σ

2) , where σ2 and σ20 are

the unrestricted and restricted (i. e., under H0 ) MLEs of the error varianceσ2 , respectively. Plugging in of the explicit representations for σ2 and σ2

0

yields that 2T = n log(∆ RSSRSS + 1) , implying the assertion. Part (b) is an

application of the following well-known result from quadratic optimizationtheory.

Lemma 7.3.1. Let A ∈ IRr×p and b ∈ IRr fixed and define M = {z ∈IRp : Az = b} . Moreover, let f : IRp → IR , given by f(z) = z>Qz/2 − c>zfor a symmetric, positive semi-definite (p × p) -matrix Q and a vectorc ∈ IRp . Then, the unique minimum of f over the search space M is char-acterized by solving the system of linear equations

Qz −A>y = c

Az = b

for (y, z) . The component z of this solution minimizes f over M .

Page 245: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

7.4 Wald test 245

For part (c), we utilize the explicit form of θ0 from part (b) and write ∆ RSSin the form

∆ RSS = (Cθ − d)>{C(ΨΨ>)−1C>

}−1(Cθ − d).

This shows that ∆ RSS is a deterministic transformation of θ . Since θ andRSS are independent, the assertion follows. This representation of ∆ RSSmoreover shows that ∆ RSS /σ2 is a quadratic form of a (under H0 ) standardnormal random vector and part (d) follows from Theorem 7.1.3.

To sum up, Theorem 7.3.1 shows that general linear hypotheses can betested with F -tests under model (7.1) with ε ∼ N(0, σ2IIn) for an unknownvalue σ2 .

7.4 Wald test

The drawback of the LR test method for testing a linear hypothesis H0 : Cθ =d without assuming Gaussian noise is that a constrained maximization of thelog-likelihood function under the constraints encoded by K and d has to beperformed. This computationally intensive step can be avoided by using Waldstatistics. The Wald statistic for testing H0 is given by

W = (Cθ − d)>(CV C>

)−1(Cθ − d),

where θ and V denote the MLE in the full (unrestricted) model and the

estimated covariance matrix of θ , respectively.

Theorem 7.4.1. Under model (7.2), the statistic W is asymptotically equiv-alent to the LR test statistic 2T . In particular, under H0 , the distributionof W converges to χ2

r : Ww−→ χ2

r as n → ∞ . In the case of normallydistributed noise, it holds W = rF , where F is as in Theorem 7.3.1.

Proof. For proving the asymptotic χ2r -distribution of W , we use the asymp-

totic normality of the MLE θ . If the model is regular, it holds

√n(θn − θ0)

w−→ N(0,F(θ0)−1) under θ0 for n→∞,

and, consequently,

F(θ0)1/2n1/2(θn − θ0)w−→ N(0, IIr), where r

def= dim(θ0).

Applying the Continuous Mapping Theorem, we get

(θn − θ0)>nF(θ0)(θn − θ0)w−→ χ2

r.

Page 246: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

246 7 Testing in linear models

If the Fisher information is continuous and F(θn) is a consistent estimatorfor F(θ0) , it still holds that

(θn − θ0)>nF(θn)(θn − θ0)w−→ χ2

r.

Substituting θn = Cθ − d and θ0 = 0 ∈ IRr , we obtain the assertionconcerning the asymptotic distribution of W under H0 . For proving therelationship between W and F in the case of Gaussian noise, we notice that

F =n− pr

∆SSE

SSE

=n− pr

(Cθ − d)>{C(ΨΨ>)−1C>

}−1(Cθ − d)

(n− p)σ2

=(Cθ − d)>(CV C>)−1(Cθ − d)

r=W

r

as required.

7.5 Analysis of variance

Important special cases of linear models arise when all entries of the designmatrix Ψ are binary, i. e., Ψij ∈ {0, 1} for all 1 ≤ i ≤ p , 1 ≤ j ≤ n . Everyrow of Ψ then has the interpretation of a group indicator, where Ψij = 1if and only if observational unit j belongs to group i . The target of suchan analysis of variance is to determine if the mean response differs betweengroups. In this section, we specialize the theory of testing in linear models tothis situation.

7.5.1 Two-sample test

First, we consider the case of p = 2 , meaning that exactly two samples cor-responding to the two groups A and B are given. We let nA denote thenumber of observations in group A and nB = n − nA the number of obser-vations in group B . The parameter of interest θ ∈ IR2 in this model consistsof the two population means ηA and ηB and the models reads

Yij = ηi + εij , i ∈ {A,B}, j = 1, . . . , ni.

Noticing that (ΨΨ>)−1 =

(1/nA 0

0 1/nB

), we get that θ = (Y A, Y B)> ,

where Y A and Y B denote the two sample means. With this, we immediatelyobtain that the estimator for the noise variance is in this model given by

Page 247: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

7.5 Analysis of variance 247

σ2 =RSS

n− p=

1

nA + nB − 2

nA∑j=1

(YA,j − Y A)2 +

nB∑j=1

(YB,j − Y B)2

.

The quantity s2 def= σ2 is called the pooled sample variance. Denoting the

group-specific sample variances by

s2A =

1

nA

nA∑j=1

(YA,j − Y A)2, s2B =

1

nB

nB∑j=1

(YB,j − Y B)2,

we have that

s2 = (nAs2A + nBs

2B)/(nA + nB − 2)

is a weighted average of s2A and s2

B .Under this model, we are interested in testing the null hypothesis H0 : ηA =

ηB of equal group means against the two-sided alternative H1 : ηA 6= ηB orthe one-sided alternative H+

1 : ηA > ηB . The one-sided alternative hypothesisH−1 : ηA < ηB can be treated by switching the group labels.

The null hypothesis H0 : ηA−ηB = 0 is a linear hypothesis in the sense ofSection 7.3, where the number of restrictions is r = 1 . Under H0 , the grandaverage

Ydef= n−1

∑i∈{A,B}

ni∑j=1

Yij

is the MLE for the (common) population mean. Straightforwardly, we calcu-late that the sum of squares of residuals under H0 is given by

RSS0 =∑

i∈{A,B}

ni∑j=1

(Yij − Y )2.

For computing ∆ RSS = RSS0−RSS and the statistic F from Theo-rem 7.3.1.(a), we obtain the following results.

Lemma 7.5.1.

RSS0 = RSS +(nA(Y A − Y )2 + nB(Y B − Y )2

), (7.17)

F =nAnBnA + nB

(Y A − Y B)2

s2. (7.18)

Exercise 7.5.1. Prove Lemma 7.5.1.

Page 248: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

248 7 Testing in linear models

We conclude from Theorem 7.3.1 that, if the individual errors are homo-geneous between samples, the test statistic

t =(Y A − Y B)

s√

1/nA + 1/nB

is t -distributed with nA+nB−2 degrees of freedom under the null hypothesisH0 .

For a given significance level α , define the quantile qα of Student’s t -distribution with nA + nB − 2 degrees of freedom by the equation

IP (t0 > qα) = α,

where t0 represents the distribution of t in the case that H0 is true. Utilizingthe symmetry property

IP (|t0| > qα/2) = 2IP (t > qα/2) = α,

the one-sided two-sample t -test for H0 versus H+1 rejects if t > qα and the

two-sided two-sample t -test test for H0 versus H1 rejects if |t| > qα/2 .

7.5.2 Comparing K treatment means

This section deals with the more general one-factorial analysis of variance inpresence of K ≥ 2 groups. We assume that K samples are given, each of sizenk for k = 1, . . . ,K . The following quantities are relevant for testing the nullhypothesis of equal group (treatment) means.

Sample means in treatment group k :

Y k =1

nk

∑i

Yki

Sum of squares (SS) within treatment group k :

Sk =∑i

(Yki − Y k)2

Sample variance within treatment group k :

s2k = Sk/(nk − 1)

Denoting the pooled sample size by N = n1 + . . . + nK , we furthermoredefine the following pooled measures.

Pooled (grand) mean:

Y =1

N

∑k

∑i

Yki

Page 249: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

7.5 Analysis of variance 249

Within-treatment sum of squares:

SR = S1 + . . .+ SK

Between treatment sum of squares:

ST =∑k

nk(Y k − Y )2

Within- and between-treatment mean square:

s2R =

SRN − k

s2T =

STK − 1

In anlogy to Lemma 7.5.1, the following results regarding the decomposi-tion of spread holds.

Lemma 7.5.2. Let the overall variation in the data be defined by

SDdef=∑k

∑i

(Yki − Y )2.

Then it holds

SD =∑k

∑i

(Yki − Y k)2 +∑k

nk(Y k − Y )2 = ST + SR.

Exercise 7.5.2. Prove Lemma 7.5.2.

The variance components in a one-factorial analysis of variance can besummarized in an analysis of variance table, cf. Table 7.1.

source of sum of degrees of meanvariation squares freedom square

average SA = NY2

νA = 1 s2A = SA/νA

between

treatments ST =∑k nk(Y k − Y )2 νT = K − 1 s2T = ST /νT

within

treatments SR =∑k

∑i(Yki − Y k)2 νR = N −K s2R = SR/νR

total SD =∑k

∑i(Yki − Y )2 N − 1

Table 7.1. Variance components in a one-factorial analysis of variance

Page 250: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

250 7 Testing in linear models

For testing the global hypothesis H0 : η1 = η2 = . . . = ηK againstthe (two-sided) alternative H1 : ∃(i, j) with ηi 6= ηj , we again apply Theo-rem 7.3.1, leading to the following result.

Corollary 7.5.1. Under the model of the analysis of variance with K groups,assume homogeneous Gaussian noise with noise variance σ2 > 0 . Then itholds:

(i) SR/σ2 ∼ χ2

N−K .(ii) Under H0 , ST /σ

2 ∼ χ2K−1 .

(iii) SR and ST are stochastically independent.

(iv) Under H0 , the statistic F = ST /(K−1)SR/(N−K) is distributed as FK−1,N−K :

Fdef=

ST /(K − 1)

SR/(N −K)∼ FK−1,N−K .

Therefore, H0 is rejected by a level α F -test if the value of F exceeds thequantile FK−1,N−K;1−α of Fisher’s F -distribution with K − 1 and N −Kdegrees of freedom.

Treatment effects.

For estimating the amount of shift in the mean response caused by treatmentk (the k -th treatment effect), it is convenient to re-parametrize the model asfollows. We start with the basic model, given by

Yki = ηk + εki, εki ∼ N(0, σ2) i.i.d.,

where ηk is the true treatment mean.Now, we introduce the averaged treatment mean η and the k -th treatment

effect τk , given by

ηdef=

1

N

∑nkηk, τk

def= ηk − η.

The re-parametrized model is then given by

Yki = η + τk + εki. (7.19)

It is important to notice that this new model representation involves K + 1unknowns. In order to achieve maximum rank of the design matrix, one there-fore has to take the constraint

∑Kk=1 nkτk = 0 into account when building

Ψ , for instance by coding

τK = −n−1K

K−1∑k=1

nkτk.

Page 251: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

7.5 Analysis of variance 251

For the data analysis, it is helpful to consider the decomposition of the ob-served data points:

Yki = Y + (Y k − Y ) + (Yki − Y k).

Routine algebra then leads to the following results concerning inference in thetreatment effect representation of the analysis of variance model.

Theorem 7.5.1. Under Model (7.19) with∑Kk=1 nkτk = 0 , it holds:

(i) The MLEs for the unknown model parameters are given by

η = Y , τk = Y k − Y , 1 ≤ k ≤ K.

(ii) The F -statistic for testing the global hypothesis H0 : τ1 = τ2 = . . . =τk−1 = 0 is identical to the one given in Theorem 7.5.1.(iv), i. e., F =ST /(K−1)SR/(N−K) .

7.5.3 Randomized blocks

This section deals with a special case of the two-factorial analysis of variance.We assume that the observational units are grouped into n blocks, where ineach block the K treatments under investigation are applied exactly once.The total sample size is then given by N = nK . Since there may be a ”blockeffect” on the mean response, we consider the model

Yki = η + βi + τk + εki, 1 ≤ k ≤ K, 1 ≤ i ≤ n, (7.20)

where η is the general mean, βi the block effect and τk the treatment effect.The data can be decomposed as

Yki = Y + (Y i − Y ) + (Y k − Y ) + (Yki − Y i − Y k + Y ),

where Y is the grand average, Y i the block average, Y k the treatment aver-age, and Yki − Y i − Y k + Y the residual.

Applying decomposition of spread to the model given by (7.20), we arriveat Table 7.2. Based on this, the following corollary summarizes the inferencetechniques under the model defined by (7.20).

Corollary 7.5.2. Under the model defined by (7.20), it holds:

(i) The MLEs for the unknown model parameters are given by

η = Y , βi = Y i − Y , τk = Y k − Y .

Page 252: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

252 7 Testing in linear models

source of variation sum of squares degrees of freedom

average

(correction factor) S = nKY2

1

between

blocks SB = K∑i(Y i − Y )2 n− 1

between

treatments ST = n∑k(Y k − Y )2 K − 1

residuals SR =∑k

∑i(Yki − Y i − Y k + Y )2 (n− 1)(K − 1)

total SD =∑k

∑i(Yki − Y )2 N − 1 = nK − 1

Table 7.2. Analysis of variance table for a randomized block design

(ii) The F -statistic for testing the hypothesis HB of no block effects is givenby

FB =SB/(n− 1)

SR/[(n− 1)(K − 1)].

Under HB , the distribution of FB is Fisher’s F -distribution with (n−1)and (n− 1)(K − 1) degrees of freedom.

(iii) The F -statistic for testing the null hypothesis HT of no treatment effectsis given by

FT =ST /(K − 1)

SR/[(n− 1)(K − 1)].

Under HT , the distribution of FT is Fisher’s F -distribution with (K−1)and (n− 1)(K − 1) degrees of freedom.

7.6 Historical remarks and further reading

Hotelling (1931) derived a deterministic transformation of Fisher’s F -distributionand demonstrated its usage in the context of testing for differences among sev-eral Gaussian means with a likelihood ratio test. The general idea of the Waldtest goes back to Wald (1943).

A classical textbook on the analysis of variance is that of Scheffe (1959).The general theory of testing linear hypotheses in linear models is described,e. g., in the textbook by Searle (1971).

Page 253: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

8

Some other testing methods

This chapter discusses some nonparametric testing methods. First, we treatclassical testing procedures such as the Kolmogorov-Smirnov and the Cramer-Smirnov-von Mises test as particular cases of the substitution approach. Then,we are considered with Bayesian approaches towards hypothesis testing. Fi-nally, Section 8.4 deals with locally best tests. It is demonstrated that thescore function is the natural equivalent to the LR statistic if no uniformlybest tests exist, but locally best tests are aimed at, assuming that the modelis differentiable in the mean. Conditioning on the ranks of the observationsleads to the theory of rank tests. Due to the close connection of rank testsand permutation tests (the null distribution of a rank test is a permutationdistribution), we end the chapter with some general remarks on permutationtests.

Let Y = (Y1, . . . , Yn)> be an i.i.d. sample from a distribution P . Thejoint distribution IP of Y is the n -fold product of P , so a hypothesis aboutIP can be formulated as a hypothesis about the marginal measure P . A sim-ple hypothesis H0 means the assumption that P = P0 for a given measureP0 . The empirical measure Pn is a natural empirical counterpart of P lead-ing to the idea of testing the hypothesis by checking whether Pn significantlydeviates from P0 . As in the estimation problem, this substitution idea canbe realized in several different ways. We briefly discuss below the method ofmoments and the minimal distance method.

8.1 Method of moments for an i.i.d. sample

Let g(·) be any d -vector function on IR1 . The assumption P = P0 leads tothe population moment

m0 = IE0g(Y1).

The empirical counterpart of this quantity is given by

Page 254: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

254 8 Some other testing methods

Mn = IEng(Y ) =1

n

∑i

g(Yi).

The method of moments (MOM) suggests to consider the difference Mn −m0 for building a reasonable test. The properties of Mn were stated inSection 2.4. In particular, under the null P = P0 , the first two moments ofthe vector Mn −m0 can be easily computed: IE0(Mn −m0) = 0 and

Var0(M) = IE0

[(Mn −m0) (Mn −m0)>

]= n−1V,

Vdef= E0

[(g(Y )−m0

)(g(Y )−m0

)>].

For simplicity of presentation we assume that the moment function g is se-lected to ensure a non-degenerate matrix V . Standardization by the covari-ance matrix leads to the vector

ξn = n1/2V −1/2(Mn −m0),

which has under the null measure zero mean and a unit covariance matrix.Moreover, ξn is, under the null hypothesis, asymptotically standard normal,i. e., its distribution is approximately standard normal if the sample size n issufficiently large; see Theorem 2.4.4. The MM test rejects the null hypothesis ifthe vector ξn computed from the available data Y is very unlikely standardnormal, that is, if it deviates significantly from zero. We specify the procedureseparately for the univariate and multivariate cases.

Univariate case.

Let g(·) be a univariate function with E0g(Y ) = m0 and E0

[g(Y )−m0

]2=

σ2 . Define the linear test statistic

Tn =1√nσ2

∑i

[g(Yi)−m0

]= n1/2σ−1(Mn −m0)

leading to the test

φ = 1(|Tn| > zα/2

), (8.1)

where zα denotes the upper α -quantile of the standard normal law.

Theorem 8.1.1. Let Y be an i.i.d. sample from P . Then the test statisticTn is asymptotically standard normal and the test φ from (8.1) for H0 : P =P0 is of asymptotic level α , that is,

IP0

(φ = 1

)→ α, n→∞.

Page 255: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

8.1 Method of moments for an i.i.d. sample 255

Similarly one can consider a one-sided alternative H+1 : m > m0 or H−1 :

m < m0 about the moment m = Eg(Y ) of the distribution P and thecorresponding one-sided tests

φ+ = 1(Tn > zα), φ− = 1(Tn < −zα).

As in Theorem 8.1.1, both tests φ+ and φ− are of asymptotic level α .

Multivariate case.

The components of the vector function g(·) ∈ IRd are usually associated with“directions” in which the null hypothesis is tested. The multivariate situationmeans that we test simultaneously in d > 1 directions. The most natural teststatistic is the squared Euclidean norm of the standardized vector ξn :

Tndef= ‖ξn‖2 = n‖V −1/2(Mn −m0)‖2. (8.2)

By Theorem 2.4.4 the vector ξn is asymptotically standard normal so thatTn is asymptotically chi-squared with d degrees of freedom. This yields thenatural definition of the test φ using quantiles of χ2

d , i. e.,

φ = 1(Tn > zα

)(8.3)

with zα denoting the upper α -quantile of the χ2d distribution.

Theorem 8.1.2. Let Y be an i.i.d. sample from P . If zα fulfills IP(χ2d >

zα)

= α , then the test statistic Tn from (8.2) is asymptotically χ2d -distributed

and the test φ from (8.3) for H0 : P = P0 is of asymptotic level α .

8.1.1 Series expansion

A standard method of building the moment tests or, alternatively, of choosingthe directions g(·) is based on some series expansion. Let ψ1, ψ2, . . . , be agiven set of basis functions in the related functional space. It is especiallyuseful to select these basis functions to be orthonormal under the measureP0 : ∫

ψj(y)P0(dy) = 0,

∫ψj(y)ψj′(y)P0(dy) = δj,j′ , ∀j, j′. (8.4)

Select a fixed index d and take the first d basis functions ψ1, . . . , ψd as“directions” or components of g . Then

mj,0def=

∫ψj(y)P0(dy) = 0

Page 256: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

256 8 Some other testing methods

is the j th population moment under the null hypothesis H0 and it is testedby checking whether the empirical moments Mj,n with

Mj,ndef=

1

n

∑i

ψj(Yi)

do not deviate significantly from zero. The condition (8.4) effectively permitsto test each direction ψj independently of the others.

For each d one obtains a test statistic Tn,d with

Tn,ddef= n

(M2

1,n + . . .+M2d,n

)leading to the test

φd = 1(Tn,d > zα,d

),

where zα,d is the upper α -quantile of χ2d . In practical applications the choice

of d is particularly relevant and is subject of various studies.

8.1.2 Testing a parametric hypothesis

The method of moments can be extended to the situation when the nullhypothesis is parametric: H0 : P ∈ (Pθ,θ ∈ Θ0) . It is natural to apply themethod of moments both to estimate the parameter θ under the null and totest the null. So, we assume two different moment vector functions g0 andg1 to be given. The first one is selected to fulfill

θ ≡ Eθg0(Y1), θ ∈ Θ0 .

This permits estimating the parameter θ directly by the empirical moment:

θ =1

n

∑i

g0(Yi).

The second vector of moment functions is composed by directional alter-natives. An identifiability condition suggests to select the directional al-ternative functions orthogonal to g0 in the following sense. We choose

g1 = (g(1)1 , . . . , g

(k)1 )> : IRr → IRk such that for all θ ∈ Θ0 it holds

g1(m1, . . . ,mr) = 0 ∈ IRk , where (m` : 1 ≤ ` ≤ r) denote the first r(population) moments of the distribution Pθ .

Theorem 8.1.3. Let m` =∑ni=1 Yi denote the ` -th sample moment for

1 ≤ ` ≤ r . Then, under regularity assumptions discussed in Section 2.4 and

assuming that each g(j)1 is continuously differentiable and that all (m` : 1 ≤

` ≤ 2r) are continuous functions of θ , it holds that the distribution of

Page 257: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

8.2 Minimum distance method for an i.i.d. sample 257

Tndef= n g>1 (m1, . . . , mr)V

−1(θ) g1(m1, . . . , mr)

converges under H0 weakly to χ2k , where

V (θ)def= J(g1)ΣJ(g1)> ∈ IRk×k,

J(g1)def=

(∂g

(j)1 (m1, . . . ,mr)

∂m`

)1≤j≤k1≤`≤r

∈ IRk×r

and Σ = (σij) ∈ IRr×r with σij = mi+j −mimj .

Theorem 8.1.3, which is an application of the Delta method in connectionwith the asymptotic normality of MOM estimators, leads to the goodness-of-fit test

φ = 1(Tn > zα

),

where zα is the upper α -quantile of χ2k , for testing H0 .

8.2 Minimum distance method for an i.i.d. sample

The method of moments is especially useful for the case of a simple hypoth-esis because it compares the population moments computed under the nullwith their empirical counterpart. However, if a more complicated compositehypothesis is tested, the population moments can not be computed directly:the null measure is not specified precisely. In this case, the minimum distanceidea appears to be useful. Let (Pθ,θ ∈ Θ ⊂ IRp) be a parametric family andΘ0 be a subset of Θ . The null hypothesis about an i.i.d. sample Y fromP is that P ∈ (Pθ,θ ∈ Θ0) . Let ρ(P, P ′) denote some functional (distance)defined for measures P, P ′ on the real line. We assume that ρ satisfies thefollowing conditions: ρ(Pθ1

, Pθ2) ≥ 0 and ρ(Pθ1

, Pθ2) = 0 iff θ1 = θ2 . The

condition P ∈ (Pθ,θ ∈ Θ0) can be rewritten in the form

infθ∈Θ0

ρ(P, Pθ) = 0.

Now we can apply the substitution principle: use Pn in place of P . Definethe value T by

Tdef= inf

θ∈Θ0

ρ(Pn, Pθ). (8.5)

Large values of the test statistic T indicate a possible violation of the nullhypothesis.

Page 258: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

258 8 Some other testing methods

In particular, if H0 is a simple hypothesis, that is, if the set Θ0 consistsof one point θ0 , the test statistic reads as T = ρ(Pn, Pθ0

) . The critical valuefor this test is usually selected by the level condition:

IPθ0

(ρ(Pn, Pθ0

) > tα)≤ α.

Note that the test statistic (8.5) can be viewed as a combination of two differ-ent steps. First we estimate under the null the parameter θ ∈ Θ0 which pro-vides the best possible parametric fit under the assumption P ∈ (Pθ,θ ∈ Θ0) :

θ0 = arginfθ∈Θ0

ρ(Pn, Pθ).

Next we formally apply the minimum distance test with the simple hypothesisgiven by θ0 = θ0 .

Below we discuss some standard choices of the distance ρ .

8.2.1 Kolmogorov-Smirnov test

Let P0, P1 be two distributions on the real line with distribution functionsF0, F1 : Fj(y) = Pj(Y ≤ y) for j = 0, 1 . Define

ρ(P0, P1) = ρ(F0, F1) = supy

∣∣F0(y)− F1(y)∣∣. (8.6)

Now consider the related test starting from the case of a simple null hypoth-esis P = P0 with corresponding c.d.f. F0 . Then the distance ρ from (8.6)(properly scaled) leads to the Kolmogorov-Smirnov test statistic

Tndef= sup

yn1/2

∣∣F0(y)− Fn(y)∣∣.

A nice feature of this test is the property of asymptotic pivotality.

Theorem 8.2.1 (Kolmogorov). Let F0 be a continuous c.d.f. Then

Tn = supyn1/2

∣∣F0(y)− Fn(y)∣∣ D→ η,

where η is a fixed random variable (maximum of a Brownian bridge on [0, 1] ).

Proof. Idea of the proof: The c.d.f. F0 is monotonic and continuous. There-fore, its inverse function F−1

0 is uniquely defined. Consider the r.v.’s

Uidef= F0(Yi).

The basic fact about this transformation is that the Ui ’s are i.i.d. uniform onthe interval [0, 1] .

Page 259: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

8.2 Minimum distance method for an i.i.d. sample 259

Lemma 8.2.1. The r.v.’s Ui are i.i.d. with values in [0, 1] and for any u ∈[0, 1] it holds

IP(Ui ≤ u

)= u.

By definition of F−10 , it holds for any u ∈ [0, 1]

F0

(F−1

0 (u))≡ u.

Moreover, if Gn is the c.d.f. of the Ui ’s, that is, if

Gn(u)def=

1

n

∑i

1(Ui ≤ u),

then

Gn(u) ≡ Fn[F−1

0 (u)]. (8.7)

Exercise 8.2.1. Check Lemma 8.2.1 and (8.7).

Now by the change of variable y = F−10 (u) we obtain

Tn = supu∈[0,1]

n1/2∣∣F0(F−1

0 (u))− Fn(F−10 (u))

∣∣ = supu∈[0,1]

n1/2∣∣u−Gn(u)

∣∣.It is obvious that the right-hand side of this expression does not depend onthe original model. Actually, it is for fixed n a precisely described randomvariable, and so its distribution only depends on n . It only remains to showthat this distribution for large n is close to some fixed limit distribution witha continuous c.d.f. allowing for a choice of a proper critical value. We indicatethe main steps of the proof.

Given a sample U1, . . . , Un , define the random function

ξn(u)def= n1/2

[u−Gn(u)

].

Clearly Tn = supu∈[0,1] ξn(u) . Next, convergence of the random functionsξn(·) would imply the convergence of their maximum over u ∈ [0, 1] , becausethe maximum is a continuous functional of a function. Finally, the weak con-vergence of ξn(·) w−→ ξ(·) can be checked if for any continuous function h(u) ,it holds

⟨ξn, h

⟩ def= n1/2

∫ 1

0

h(u)[u−Gn(u)

]du

w−→⟨ξ, h⟩ def

=

∫ 1

0

h(u)ξ(u)du.

Now the result can be derived from the representation

Page 260: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

260 8 Some other testing methods

⟨ξn, h

⟩= n1/2

∫ 1

0

[h(u)Gn(u)−m(h)

]du = n−1/2

n∑i=1

[Uih(Ui)−m(h)

]with m(h) =

∫ 1

0h(u)ξ(u)du and from the central limit theorem for a sum of

i.i.d. random variables.

The case of a composite hypothesis.

If H0 : P ∈ (Pθ,θ ∈ Θ0) , then the test statistic is described by (8.5). Aswe already mentioned, testing of a composite hypothesis can be viewed asa two-step procedure. In the first step, θ is estimated by θ0 and in thesecond step, the goodness-of-fit test based on Tn is carried out, where F0 isreplaced by the c.d.f. corresponding to Pθ0

. It turns out that pivotality of thedistribution of Tn is preserved if θ is a location and/or scale parameter, but ageneral (asymptotic) theory allowing to derive tα analytically is not available.Therefore, computer simulations are typically employed to approximate tα .

8.2.2 ω2 test (Cramer-Smirnov-von Mises)

Here we briefly discuss another distance also based on the c.d.f. of the nullmeasure. Namely, define for a measure P on the real line with c.d.f. F

ρ(Pn, P ) = ρ(Fn, F ) = n

∫ [Fn(y)− F (y)

]2dF (x). (8.8)

For the case of a simple hypothesis P = P0 , the Cramer-Smirnov-von Mises(CSvM) test statistic is given by (8.8) with F = F0 . This is another functionalof the path of the random function n1/2

[Fn(y)−F0(y)

]. The Kolmogorov test

uses the maximum of this function while the CSvM test uses the integral ofthis function squared. The property of pivotality is preserved for the CSvMtest statistic as well.

Theorem 8.2.2. Let F0 be a continuous c.d.f. Then

Tn = n

∫ [Fn(y)− F0(y)

]2dF (x)

D→ η,

where η is a fixed random variable (integral of a Brownian bridge squared on[0, 1] ).

Proof. The idea of the proof is the same as in the case of the Kolmogorov-Smirnov test. First the transformation by F−1

0 translates the general caseto the case of the uniform distribution on [0, 1] . Next one can again use thefunctional convergence of the process ξn(u) .

Page 261: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

8.3 Partially Bayes tests and Bayes testing 261

8.3 Partially Bayes tests and Bayes testing

In the above sections we mostly focused on the likelihood ratio testing ap-proach. As in estimation theory, the LR approach is very general and possessessome nice properties. This section briefly discusses some possible alternativeapproaches including partially Bayes and Bayes approaches.

8.3.1 Partial Bayes approach and Bayes tests

Let Θ0 and Θ1 be two subsets of the parameter set Θ . We test the nullhypothesis H0 : θ∗ ∈ Θ0 against the alternative H1 : θ∗ ∈ Θ1 . The LRapproach compares the maximum of the likelihood process over Θ0 with thesimilar maximum over Θ1 . Let now two measures π0 on Θ0 and π1 on Θ1

be given. Now instead of the maximum of L(Y ,θ) we consider its weightedsum (integral) over Θ0 (resp. Θ1 ) with weights π0(θ) resp. π1(θ) . Moreprecisely, we consider the value

Tπ0,π1=

∫Θ1

L(Y ,θ)π1(θ)λ(dθ)−∫

Θ0

L(Y ,θ)π0(θ)λ(dθ).

Significantly positive values of this expression indicate that the hypothesis islikely to be wrong.

Similarly and more commonly used, we may define measures g0 and g1

such that

π(θ) =

{π0g0(θ), θ ∈ Θ0,

π1g1(θ), θ ∈ Θ1,

where π is a prior on the entire parameter space Θ and πidef=∫

Θiπ(θ)λ(dθ) =

IP (Θi) for i = 0, 1 . Then, the Bayes factor for comparing H0 and H1 is givenby

B0|1 =

∫Θ0L(Y ,θ)g0(θ)λ(dθ)∫

Θ1L(Y ,θ)g1(θ)λ(dθ)

=IP (Θ0

∣∣Y )/IP (Θ1

∣∣Y )

IP (Θ0)/IP (Θ1). (8.9)

The representation of the Bayes factor on the right-hand side of (8.9) showsthat it can be interpreted as the ratio of the posterior odds ratio for H0 andthe prior odds ratio for H0 . The resulting test rejects the null hypothesisfor significantly small values of B0|1 , or, equivalently, for significantly largevalues of B1|0 = 1/B0|1 . In the special case that H0 and H1 are two simplehypotheses, i. e., Θ0 = {θ0} and Θ1 = {θ1} , the Bayes factor is simply givenby

Page 262: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

262 8 Some other testing methods

B0|1 =L(Y ,θ0)

L(Y ,θ1),

hence, in such a case the testing approach based on the Bayes factor is equiv-alent to the LR approach.

8.3.2 Bayes approach

Within the Bayes approach the true data distribution and the true parametervalue are not defined. Instead one considers the prior and posterior distribu-tion of the parameter. The parametric Bayes model can be represented as

Y | θ ∼ p(y∣∣θ), θ ∼ π(θ).

The posterior density p(θ∣∣Y ) can be computed via the Bayes formula:

p(θ∣∣Y ) =

p(Y∣∣θ)π(θ)

p(Y )

with the marginal density p(Y ) =∫

Θp(Y

∣∣θ)π(θ)λ(dθ) . The Bayes approachsuggests instead of checking the hypothesis about the location of the parame-ter θ to look directly at the posterior distribution. Namely, one can constructso-called credible sets which contain a prespecified fraction, say 1− α , of themass of the whole posterior distribution. Then one can say that the probabil-ity for the parameter θ to lie outside of this credible set is at most α . So,the testing problem in the frequentist approach is replaced by the problem ofconfidence estimation for the Bayes method.

Example 8.3.1 (Example 5.2.2 continued.). Consider again the situation ofa Bernoulli product likelihood for Y = (Y1, . . . , Yn)> with unknown successprobability θ . In example 5.2.2 we saw that this family of likelihoods is conju-gated to the family of Beta-distributions as priors on [0, 1] . More specifically,if θ ∼ Beta(a, b) , then θ | Y = y ∼ Beta(a+s, b+n−s) , where s =

∑ni=1 yi

denotes the observed number of successes. Under quadratic risk, the Bayes-optimal point estimate for θ is given by IE[θ | Y = y] = (a+ s)/(a+ b+ n) ,and a credible interval can be constructed around this value by utilizing quan-tiles of the posterior Beta(a+ s, b+ n− s) -distribution.

8.4 Score, rank and permutation tests

8.4.1 Score tests

Testing a composite null hypothesis H0 against a composite alternative H1

is in general a challenging problem, because only in some special cases uni-formly (in θ ∈ H1 ) most powerful level α -tests exist. In all other cases, one

Page 263: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

8.4 Score, rank and permutation tests 263

has to decide against which regions in H1 optimal power is targeted. Oneclass of procedures is given by locally best tests, optimizing power in regionsclose to H0 . To formalize this class mathematically, one needs the concept ofdifferentiability in the mean.

Definition 8.4.1 (Differentiability in the mean). Let (Y,B(Y), (IPθ)θ∈Θ)denote a statistical model and assume that (IPθ)θ∈Θ is a dominated (by µ )family of measures, where Θ ⊆ IR . Then, (Y,B(Y), (IPθ)θ∈Θ) is called differ-entiable in the mean in θ0 ∈ Θ , if a function g ∈ L1(µ) exists with∥∥∥∥t−1

(dIPθ0+t

dµ− dIPθ0

)− g∥∥∥∥L1(µ)

→ 0 as t→ 0.

The function g is called L1(µ) derivative of θ 7→ IPθ in θ0 . In the sequel,we choose w.l.o.g. θ0 ≡ 0 .

Theorem 8.4.1 (§18 in Hewitt and Stromberg (1975)). Under theassumptions of Definition 8.4.1 let θ0 = 0 and let densities be given by

fθ(y)def= dIPθ/(dµ)(y) . Assume that there exists an open neighbourhood U

of 0 such that for µ -almost all y the mapping U 3 θ 7→ fθ(y) is absolutelycontinuous, i. e., it exists an integrable function τ 7→ f(y, τ) on U with∫ θ2

θ1

f(y, τ)dτ = fθ2(y)− fθ1(y), θ1 < θ2

and assume that ∂∂θfθ(y)|θ=0 = f(y, 0) µ -almost everywhere. Furthermore,

assume that for θ ∈ U the function y 7→ f(y, θ) is µ -integrable with∫ ∣∣∣f(y, θ)∣∣∣ dµ(y)→

∫ ∣∣∣f(y, 0)∣∣∣ dµ(y), θ → 0.

Then, θ 7→ IPθ is differentiable in the mean in 0 with g = f(·, 0) .

Theorem 8.4.2. Under the assumptions of Definition 8.4.1 assume that thedensities θ 7→ fθ are differentiable in the mean in 0 with L1(µ) derivativeg . Then,

θ−1 log(fθ(y)/f0(y)) = θ−1(log fθ(y)− log f0(y))

converges for θ → 0 to L(y) (say) in IP0 probability. We call L the deriva-tive of the (logarithmic) likelihood ratio or score function. It holds

L(y) = g(y)/f0(y),

∫LdIP0 = 0, {f0 = 0} ⊆ {g = 0} IP0 -almost surely.

Page 264: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

264 8 Some other testing methods

Proof. θ−1(fθ/f0 − 1) → g/f0 converges in L1(IP0) and, consequently, inIP0 probability. The chain rule yields L(y) = g(y)/f0(y) . Noting that

∫(fθ−

f0)dµ = 0 we conclude ∫LdIP0 =

∫gdµ = 0.

Example 8.4.1. (a) Location parameter model:Let Y = θ + X, θ ≥ 0 , and assume that X has a density f which isabsolutely continuous with respect to the Lebesgue measure and does notdependent on θ . Then, the densities θ 7→ f(y − θ) of Y under θ aredifferentiable in the mean in zero with score function L(y) = −f ′(y)/f(y)(differentiation with respect to y ).

(b) Scale parameter model:Let Y = exp(θ)X and assume again that X has density f with the prop-erties stated in part (a). Moreover, assume that

∫|xf ′(x)| dx <∞ . Then,

the densities θ 7→ exp(−θ)f(y exp(−θ)) of Y under θ are differentiablein the mean in zero with score function L(y) = −(1 + yf ′(y)/f(y)) .

Lemma 8.4.1. Assume that the family θ 7→ IPθ is differentiable in the meanwith score function L in θ0 = 0 and that ci , 1 ≤ i ≤ n , are real constants.Then, also θ 7→

⊗ni=1 IPciθ is differentiable in the mean in zero, and has score

function

(y1, . . . , yn)> 7→n∑i=1

ciL(yi).

Exercise 8.4.1. Prove Lemma 8.4.1.

Definition 8.4.2 (Score test). Let θ 7→ IPθ be differentiable in the mean inθ0 with score function L . Then, every test φ of the form

φ(y) =

1, if L(y) > c,

γ, if L(y) = c,

0, if L(y) < c,

is called a score test. In this, γ ∈ [0, 1] denotes a randomization constant.

Definition 8.4.3 (Locally best test). Let (IPθ)θ∈Θ with Θ ⊆ IR denotea family which is differentiable in the mean in θ0 ∈ Θ . A test φ∗ withIEθ0 [φ∗] = α is called locally best test among all tests with expectation αunder θ0 for the test problem H0 = {θ0} versus H1 = {θ > θ0} if

d

dθIEθ[φ

∗]

∣∣∣∣θ=θ0

≥ d

dθIEθ[φ]

∣∣∣∣θ=θ0

for all tests φ with IEθ0 [φ] = α .

Page 265: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

8.4 Score, rank and permutation tests 265

Figure 8.1 illustrates the situation considered in Definition 8.4.3 graphi-cally.

- θ

θ0

6

1

α

IEθ[φ∗]IEθ[φ]

Fig. 8.1. Locally best test φ∗ with expectation α under θ0

Theorem 8.4.3. Under the assumptions of Definition 8.4.3, the score test

φ(y) =

1, if L(y) > c(α)

γ, if L(y) = c(α), γ ∈ [0, 1]

0, if L(y) < c(α)

with IEθ0 [φ] = α is a locally best test for testing H0 = {θ0} against H1 ={θ > θ0} .

Proof. We notice that for any test φ , it holds

d

dθIEθ[φ]

∣∣∣∣θ=θ0

= IEθ0 [φL].

Hence, we have to optimize∫φ(y)L(y)IPθ0(dy) with respect to φ under the

level constraint, yielding the assertion in analogy to the argumentation in theproof of Theorem 6.2.1.

Page 266: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

266 8 Some other testing methods

Theorem 8.4.3 shows that in the theory of locally best tests the scorefunction L takes the role that the likelihood ratio has in the LR theory.Notice that, for an i.i.d. sample Y = (Y1, . . . , Yn) , the joint product measureIPnθ has score function (y1, . . . , yn) 7→

∑ni=1 L(yi) according to Lemma 8.4.1

and Theorem 8.4.3 can be applied to test H0 = {θ0} against H1 = {θ > θ0}based on Y .

Moreover, for k -sample problems with k ≥ 2 groups and n jointly inde-pendent observations, Lemma 8.4.1 can be utilized to test the homogeneityhypothesis

H0 ={IPY1 = IPY2 = . . . = IPYn : IPY1 continuous

}. (8.10)

To this end, one considers parametric families θ 7→ IPn,θ which belong to H0

only in case of θ = 0 , i. e., IPn,0 ∈ H0 . For θ 6= 0 , IPn,θ is a product measurewith non-identical factors.

Example 8.4.2. (a) Regression model for a location parameter:Let Yi = ciθ +Xi , 1 ≤ i ≤ n , where θ ≥ 0 . In this, assume that the Xi

are iid with Lebesgue density f which is independent of θ . Now, for a two-sample problem with n1 observations in the first group and n2 = n− n1

observations in the second group, we set c1 = c2 = · · · = cn1 = 1 andci = 0 for all n1 + 1 ≤ i ≤ n . Under alternatives, the observations in thefirst group are shifted by θ > 0 .

(b) Regression model for a scale parameter:Let ci, i ≤ i ≤ n denote real regression coefficients and consider the modelYi = exp(ciθ)Xi , 1 ≤ i ≤ n , θ ∈ IR , where we assume again that the Xi

are iid with Lebesgue density f which is independent of θ . Then, it holds

dIPn,θdλn

(y) =

n∏i=1

exp(−ciθ)f(yi exp(−ciθ)).

Under θ0 = 0 , the product measure IPn,0 belongs to H0 , while underalternatives it does not.

8.4.2 Rank tests

In this section, we will consider the case that only the ranks of the observa-tions are trustworthy (or available). Theorem 8.4.4 will be utilized to defineresulting rank tests based on parametric families IPn,θ as considered in Ex-ample 8.4.2. It turns out that the score test based on ranks has a very simplestructure.

Theorem 8.4.4. Let θ 7→ IPθ denote a parametric family which is differ-entiable in the mean in θ0 = 0 with respect to some reference measureµ , L1(µ) -differentiable for short, with score function L . Furthermore, let

Page 267: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

8.4 Score, rank and permutation tests 267

S : Y → S measurable. Then, θ 7→ IPSθ is L1(µS) -differentiable with score

function s 7→ IE0[L∣∣S = s] .

Proof. First, we show that the L1(µS) derivative of θ 7→ IPSθ is given bys 7→ IEµ[g

∣∣S = s] , where g is the L1(µ) derivative of θ 7→ IPθ . To this end,notice that

dIPSθdµS

(s) = IEµ[fθ∣∣S = s], where fθ =

dIPθdµ

.

Linearity of IEµ[·∣∣S] and transformation of measures leads to∫ ∣∣∣∣θ−1

(dIPSθdµS

− dIPS0dµS

)− IEµ[g

∣∣S = s]

∣∣∣∣ dµS(y)

=

∫ ∣∣IEµ[θ−1(fθ − f0)− g∣∣S]∣∣ dµ.

Applying Jensen’s inequality and Vitali’s theorem, we conclude that s 7→IEµ[g

∣∣S = s] is L1(µS) derivative of θ 7→ IPSθ . Now, the chain rule yieldsthat the score function of IPSθ in zero is given by

s 7→ IEµ[g∣∣S = s]{IEµ[

dIP0

∣∣S = s]}−1

and the assertion follows by substituting g = L dIP0/(dµ) and verifying that

IE0[L∣∣S] IEµ[

dIP0

∣∣S] = IEµ[LdIP0

∣∣S] µ-almost surely.

For applying the latter theorem to rank statistics, we need to gather somebasic facts about ranks and order statistics.

Definition 8.4.4. Let y = (y1, . . . , yn) be a point in IRn . Assume that theyi are pairwise distinct and denote their ordered values by y1:n < y2:n < . . . <yn:n .

(a) For 1 ≤ i ≤ n , the integer ri ≡ ri(y)def= #{j ∈ {1, . . . , n} : yj ≤ yi} is

called the rank of yi (in y ). The vector r(y)def= (r1(y), . . . , rn(y)) ∈ Sn

is called rank vector of y .

(b) The inverse permutation d(y) = (d1(y), . . . , dn(y))def= [r(y)]−1 is called

the vector of antiranks of y , and the integer di(y) is called antirank of i(the index that corresponds to the i -th smallest observation in y ).

Now, let Y1, . . . , Yn with Yi : Yi → IR be stochastically independent, con-tinuously distributed random variables and denote the joint distribution of(Y1, . . . , Yn) by IP .

Page 268: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

268 8 Some other testing methods

(c) Because of IP (⋃i 6=j{Yi = Yj}) = 0 the following objects are IP -almost

surely uniquely defined: Yi:n is called i -th order statistic of Y = (Y1, . . . , Yn) ,

Ri(Y )def= nFn(Yi) = ri(Y1, . . . , Yn) is called rank of Yi , R(Y )

def=

(R1(Y ), . . . , Rn(Y )) is called vector of rank statistics of Y , Di(Y )def=

di(Y1, . . . , Yn) is called antirank of i with respect to Y and D(Y )def=

d(Y ) is called vector of antiranks of Y .

Lemma 8.4.2. Under the assumptions of Definition 8.4.4, it holds

(a) i = rdi = dri , yi = yri:n, yi:n = ydi .(b) If Y1, . . . , Yn are exchangeable random variables, then R(Y ) is uniformly

distributed on Sn , i. e., IP (R(Y ) = σ) = 1/n! for all permutations σ =(r1, . . . , rn) ∈ Sn .

(c) If U1, . . . , Un are i.i.d. with U1 ∼ UNI[0, 1] , and Yi = F−1(Ui) , 1 ≤ i ≤n , for some distribution function F , then it holds Yi:n = F−1(Ui:n) . IfF is continuous, then it holds R(Y ) = R(U1, . . . , Un) .

(d) If (Y1, . . . , Yn) are i.i.d. with c.d.f. F of Y1 , then we have(i) IP (Yi:n ≤ y) =

∑nj=i

(nj

)F (y)j(1− F (y))n−j .

(ii) dIPYi:n

dIPY1(y) = n

(n−1i−1

)F (y)i−1(1 − F (y))n−i . If IPY1 has Lebesgue den-

sity f , then IPYi:n has Lebesgue density fi:n , given by fi:n(y) =n(n−1i−1

)F (y)i−1(1− F (y))n−if(y).

(iii) Letting µdef= IPY1 , (Yi:n)1≤i≤n has the joint µn -density (y1, . . . , yn) 7→

n! 1I{y1<y2<...<yn} . If µ has Lebesgue density f , then (Yi:n)1≤i≤n hasλn -density (y1, . . . , yn) 7→ n!

∏ni=1 f(yi) 1I{y1<y2<...<yn} .

Remark 8.4.1. Lemma 8.4.2.(c) (quantile transformation) shows the specialimportance of the distribution of order statistics of i.i.d. UNI[0, 1] -distributedrandom variables U1, . . . , Un . According to Lemma 8.4.2.(d), the order statis-tic Ui:n has a Beta (i, n − i + 1) distribution with IE[Ui:n] = i/(n + 1) andVar(Ui:n) = (i(n − i + 1))/((n + 1)2(n + 2)) . For computing the joint dis-tribution function of (U1:n, . . . , Un:n) , efficient recursive algorithms exist, forinstance Bolshev’s recursion and Steck’s recursion (see Shorack and Wellner(1986), p. 362 ff.).

Theorem 8.4.5. Let Y = (Y1, . . . , Yn) be real-valued i.i.d. random variableswith continuous µ = IPY1 .

(a) The random vectors R(Y ) and (Yi:n)1≤i≤n are stochastically indepen-dent.

(b) Let T : IRn → IR denote a mapping such that the statistic T (Y ) isintegrable. For any σ = (r1, . . . , rn) ∈ Sn , it holds

IE[T (Y )∣∣R(Y ) = σ] = IE[T ((Yri:n)1≤i≤n)].

Page 269: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

8.4 Score, rank and permutation tests 269

Proof. For proving part (a), let σ = (r1, . . . , rn) ∈ Sn and Borel sets Ai ,

1 ≤ i ≤ n , arbitrary but fixed and define (d1, . . . , dn)def= σ−1 . We note that

R(Y ) = σ if and only if Yd1 < Yd2 < . . . Ydn and that Ydi = Yi:n ∈ Ai ifand only if Yi ∈ Ari .

Define Bdef= {y ∈ IRn : y1 < y2 < . . . < yn} . Then we obtain that

IP (R(Y ) = σ, ∀1 ≤ i ≤ n : Yi:n ∈ Ai)

= IP (∀1 ≤ i ≤ n : Ydi ∈ Ai, (Ydi)1≤i≤n ∈ B)

=

∫×ni=1Ari

1IB(yd1 , . . . , ydn)dµn(y1, . . . , yn)

=

∫×ni=1Ari

1IB(y1, . . . , yn)dµn(y1, . . . , yn),

because µn is invariant under the transformation (y1, . . . , yn) 7→ (yd1 , . . . , ydn)due to exchangeability.

Summation over all σ ∈ Sn yields

IP (∀1 ≤ i ≤ n : Yi:n ∈ Ai) = n!

∫×ni=1Ari

1IB(y1, . . . , yn)dµn(y1, . . . , yn).

Making use of Lemma 8.4.2.(b), we conclude

IP(R(Y ) = σ, ∀1 ≤ i ≤ n : Yi:n ∈ Ai

)= IP

(R(Y ) = σ

)IP(∀1 ≤ i ≤ n : Yi:n ∈ Ai

),

hence, the assertion of part (a). For proving part (b), we verify

IE[T (Y )∣∣R(Y ) = σ] =

∫{R(Y )=σ}

T (Y )

IP (R(Y ) = σ)dIP

= IE[T ((Yri:n)1≤i≤n)∣∣R(Y ) = σ]

= IE[T ((Yri:n)1≤i≤n)],

where we used that Y = (Yri:n)ni=1 if R(Y ) = σ in the second line and part(a) in the third line.

Now we are ready to apply Theorem 8.4.4 to vectors of rank statistics.

Corollary 8.4.1. Let (IPθ)θ∈Θ with Θ ⊆ IR denote an L1(µ) -differentiablefamily with score function L in θ0 = 0 . Let Y = (Y1, . . . , Yn) be a samplefrom IPn,θ =

⊗ni=1 IPciθ . Then, IPRn,θ has score function

Page 270: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

270 8 Some other testing methods

σ = (r1, . . . , rn) 7→ IEn,0

[n∑i=1

ciL(Yi)∣∣R(Y ) = σ

]

=

n∑i=1

ciIEn,0[L(Yi)∣∣R(Y ) = σ]

=

n∑i=1

ciIEn,0[L(Yri:n)] =

n∑i=1

cia(ri)

with IEn,0 denoting the expectation with respect to IPn,0 and scores a(i)def=

IEn,0[L(Yi:n)] .

Remark 8.4.2. (a) The test statistic T (Y ) =∑ni=1 cia(Ri(Y )) is called a lin-

ear rank statistic.(b) The hypothesis H0 from (8.10) leads under conditioning on R(Y ) to a

simple null hypothesis on Sn , namely, the discrete uniform distribution onSn , see Lemma 8.4.2.(b). Therefore, the critical value c(α) for the ranktest φ = φ(R(Y )) , given by

φ(y) =

1, if T (y) > c(α),

γ, if T (y) = c(α),

0, if T (y) < c(α),

can be computed by traversing all possible permutations σ ∈ Sn andthereby determining the discrete distribution of T (Y ) under H0 . Forlarge n , we can approximate c(α) by only traversing B < n! randomlychosen permutations σ ∈ Sn .

(c) For the scores, it holds∑ni=1 a(i) = 0 . If L is isotone, then it holds

a(1) ≤ a(2) ≤ . . . ≤ a(n) .

(d) Due to the relation Yi:nD= F−1(Ui:n) , the scores are often given in the

form a(i) = IE[L ◦ F−1(Ui:n)] and the function L ◦ F−1 is called score-

generating function. For large n , one can approximately work with b(i)def=

L ◦ F−1( in+1 ) (since IE[Ui:n] = 1/(n + 1) , see Remark 8.4.1) or with

b(i) = n∫ i/n

(i−1)/nL ◦ F−1(u)du instead of a(i) .

In the case that the score function is isotone, rank tests can also be usedto test for stochastically ordered distributions in two-sample problems.

Lemma 8.4.3 (Two-sample problems with stochastically ordered dis-tributions). Assume that a(1) ≤ a(2) ≤ . . . ≤ a(n) , cf. Remark 8.4.2.(c),and let φ denote a rank test at level α for H0 from (8.10), i.e., IEH0

[φ] = α .Assume that Y1, . . . , Yn1

are i.i.d. with c.d.f. F1 of Y1 and Yn1+1, . . . , Yni.i.d. with c.d.f. F2 of Yn1+1 .

Page 271: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

8.4 Score, rank and permutation tests 271

(a) If F1 ≥ F2 , then IE[φ] ≤ α .(b) If F1 ≤ F2 , then IE[φ] ≥ α .

Proof. Lemma 4.4 in Janssen (1998).

In location parameter models as considered in Example 8.4.1.(a), the scorefunction is isotone if and only if the density f is unimodal. The followingexample discusses some specific instances of such densities and derives thecorresponding rank tests.

Example 8.4.3. [Two-sample rank tests in location parameter models with”stochatically larger” alternatives]

(i) Fisher-Yates test:Let f denote the density of N(0, 1) . Then it holds L(y) = y and we

obtain

T =

n1∑i=1

a(Ri) with a(i) = IE[Yi:n].

In this, Yi:n denotes the i -th order statistic of i.i.d. random variablesY1, . . . , Yn with Y1 ∼ N(0, 1) .

(ii) Van der Waerden test:Let f be as in part (i). The score-generating function is given byu 7→ Φ−1(u) . Following Remark 8.4.2.(e), b(i) = Φ−1(i/(n + 1)) areapproximate scores, leading to the test statistic

T =

n1∑i=1

Φ−1(Rin+ 1

).

(iii) Wilcoxon’s rank sum test:Let f be the density of the logistic distribution, given by f(y) =

exp(−y)(1+exp(−y))−2 with corresponding cdf F (y) = (1+exp(−y))−1 .The score-generating function is in this case given by u 7→ 2u−1 , leadingto the scores

a(i) = IE[L ◦ F−1(Ui:n)] =2i

n+ 1− 1.

These scores are an affine transformation of the identity and therefore,the test can equivalently be carried out by means of the test statistic

T =

n1∑i=1

Ri(Y ),

which is the sum of the ranks in the first group.

Page 272: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

272 8 Some other testing methods

(iv) Median test:The Lebesgue density of the Laplace distribution is given by f(y) =

exp(− |y|)/2 , with induced score-generating function u 7→ sgn(ln(2u)) =sgn(2u− 1) . Approximate scores are therefore given by

b(i) = L ◦ F−1(i

n+ 1) =

1, if i > (n+ 1)/2,

0, if i = (n+ 1)/2,

−1, if i < (n+ 1)/2.

We conclude this section with the Savage test (or log-rank test), an examplefor a scale parameter test, cf. Example 8.4.1.(b).

Example 8.4.4. Under the scale parameter model considered in Example 8.4.1.(b),assume that X is exponentially distributed with density f(x) = exp(−x) 1I(0,∞)(x) .Then we obtain for y > 0 the score function

L(y) = −(1 + yf ′(y)

f(y)) = y − 1.

Exercise 8.4.2. Show that for i.i.d. random variables Y1, . . . , Yn with Y1 ∼Exp(1) , it holds

IE[Yi:n] =

i∑j=1

1

n+ 1− j.

Making use of the latter result, exact scores are given by

a(i) =

i∑j=1

1

n+ 1− j− 1.

Since X is almost surely positive, the model Y = exp(θ)X can be trans-formed into the location parameter model log(Y ) = θ + log(X) . For X ∼Exp(1) , it holds that log(X) possesses a reflected Gumbel distribution, sat-isfying

IP(

log(X) ≤ x)

= 1− exp(− exp(x)), x > 0.

8.4.3 Permutation tests

Permutation tests can be regarded as special instances of rank tests for k -sample problems.

Page 273: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

8.4 Score, rank and permutation tests 273

Example 8.4.5 (Two-sample problem in Gaussian location parameter model).Let (Yi)1≤i≤n denote real-valued, stochastically independent random vari-ables, where Y1, . . . , Yn1

are i.i.d. with Y1 ∼ F1 and Yn1+1, . . . , Yn are i.i.d.with Yn1+1 ∼ F2 . Assume that the test problem of interest is given by

H0 : {F1 = F2} versus H1 : {F1 6= F2}. (8.11)

In the special case that F1 and F2 are Gaussian cdfs which only differ intheir means, one would compare the empirical group means to carry out a test

for problem (8.11). More specifically, we let n2def= n − n1 and define group

means by Y n1:= n−1

1

∑n1

i=1 Yi and Y n2:= n−1

2

∑nj=n1+1 Yj , assuming that

0 < n1 < n . The test statistic of the resulting two-sample Z -test is then

given by Tdef=∣∣Y n1

− Y n2

∣∣ and the test for (8.11) can easily be calibrated

by noticing that Y n1− Y n2

is again normally distributed under H0 .

However, in the case of general F1 and F2 , exact distributional resultsfor T are difficult to obtain. Assuming that F1 and F2 are continuous, weconsider more general statistics of the form

T =

n∑i=1

cig(Yi) =

n∑i=1

cDi(Y )g(Yi:n) (8.12)

for a given function g : IR→ IR and real numbers (ci)1≤i≤n .The representation of T on the right-hand side of (8.12) establishes the

connection to rank tests. For example, |T | equals T if we choose g = id ,ci = n−1

1 for i ≤ n1 and ci = −n−12 for i > n1 . Under H0 from (8.11),

the antiranks D(Y ) = (Di(Y )))1≤i≤n and the order statistics (Yi:n)1≤i≤nare stochastically independent, see Theorem 8.4.5. Due to this property, thetwo-sample permutation test based on T can be carried out according to thefollowing resampling scheme.

Example 8.4.6 (Resampling scheme for a two-sample permutation test.). Thefollowing resampling scheme is appropriate for a one-sided ”stochasticallylarger” alternative. The two-sided case is obtained by obvious modifications.

(A) Consider the order statistics (Yi:n)1≤i≤n and regard a(i)def= g(Yi:n) as

random scores.(B) Denote by D = (Di)1≤i≤n a random vector which is uniformly distributed

on Sn and let c = c(α, (Yi:n)1≤i≤n) denote the (1 − α) -quantile of the

discretely distributed random variable D 7→∑ni=1 cDia(i) .

(C) The permutation test φ for testing (8.11) is then given by

φ =

1, T > c,

γ, T = c,

0 T < c,

Page 274: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

274 8 Some other testing methods

where γ ∈ [0, 1] denotes a randomization constant.

Remark 8.4.3. If we choose g = id and (cj)1≤j≤n as in Example 8.4.5, leading

to |T | = T , then the test φ from Example 8.4.6 is called Pitman’s permuta-tion test, see Pitman (1937).

The permutation test principle can be adapted to test the more general nullhypothesis

H0 : Y1, . . . , Yn are i.i.d. (8.13)

In the generalized form, the Yj : 1 ≤ j ≤ n are not even restricted to bereal-valued. The modified resampling scheme is given as follows.

Example 8.4.7 (Modified resampling scheme for general permutation tests.).

(A) Consider n random variates Yj , 1 ≤ j ≤ n with values in some space Y

and a real-valued test statistic T = T (Y1, . . . , Yn) .(B) In the remainder, consider permutations π with values in Sn which are

independent of Y1, . . . , Yn .(C) Denote by Q0 the uniform distribution on Sn and let c = c(Y1, . . . , Yn)

denote the (1 − α) -quantile of t 7→ Q0({π ∈ Sn : T (Yπ(1), . . . , Yπ(n)) ≤t}) .

(D) The modified permutation test φ for testing (8.13) is then given by

φ =

1, T > c,

γ, T = c,

0, T < c.

Theorem 8.4.6. Under the respective assumptions, the permutation test φdefined in Example 8.4.6 and the modified permutation test φ defined in Ex-ample 8.4.7 are under the null hypothesis H0 from (8.11) or (8.13), respec-tively, tests of exact level α for any fixed n ∈ N .

Proof. Conditional to the order statistics (Example 8.4.6) or to the data them-selves (Example 8.4.7), the critical value c and the randomization constantγ are chosen such that

IEL(D)[ϕ∣∣Y = y] = IEQ0

[ϕ∣∣Y = y] = α

holds true. Furthermore, the antiranks D(Y ) are under H0 from (8.11)stochastically independent of the order statistics. Analogously, the randompermutations π are chosen stochastically independent of (Y1, . . . , Yn) in the

case of φ . The result of the theorem follows by averaging with respect to thedistribution of Y .

Page 275: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

8.5 Historical remarks and further reading 275

8.5 Historical remarks and further reading

The Kolmogorov-Smirnov test goes back to Kolmogorov (1933) and Smirnov(1948). The origins of the ω2 test can be traced back to Cramer (1928) andthe German lecture notes by von Mises (1931). The limiting ω2 distributionhas been derived in the work by Smirnov (1937). A comprehensive resourcefor (nonparametric) goodness-of-fit tests is the book edited by D’Agostino andStephens (1986).

The concept of Bayes factors goes back to Jeffreys (1935) and is treatedcomprehensively by Kass and Raftery (1995). Bayesian approaches to hy-pothesis testing are discussed in Section 4.3.3 of Berger (1985); see also thereferences therein.

Our treatment of score tests mainly follows Janssen (1998). The theoryof rank tests is developed in the textbook by Hajek and Sidak (1967). Theclassical reference for permutation tests is Pitman (1937). Recent textbookand monograph references on the subject are Good (2005), Edgington andOnghena (2007), and Pesarin and Salmaso (2010).

Page 276: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April
Page 277: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

9

Deviation probability for quadratic forms

9.1 Introduction

This chapter presents a number of deviation probability bounds for a quadraticform ‖ξ‖2 or more generally ‖Bξ‖2 of a random p vector ξ satisfying a gen-eral exponential moment condition. Such quadratic forms arise in many ap-plications. Baraud (2010) lists some statistical tasks relying on such deviationbounds including hypothesis testing for linear models or linear model selec-tion. We also refer to Massart (2007) for an extensive overview and numerousresults on probability bounds and their applications in statistical model se-lection. Limit theorems for quadratic forms can be found e.g. in Gotze andTikhomirov (1999) and Horvath and Shao (1999). Some concentration boundsfor U-statistics are available in Bretagnolle (1999), Gine et al. (2000), Houdreand Reynaud-Bouret (2003). Most of results assumes that the components ofthe vector ξ are independent and bounded.

Hsu et al. (2012) study the tail behavior of the quadratic form under thecondition of sub-Gaussianity of the random vector ξ and show that the de-viation probability are essentially the same as in the Gaussian case. However,the assumption that the vector ξ has finite exponential moments of arbitraryorder is quite strict and is not fulfilled in many applications. A particularexample is given by the Poisson and exponential cases. In the present workwe only suppose that some exponential moments of ξ are finite. This makesthe problem much more involved and requires new approaches and tools.

If ξ is standard normal then ‖ξ‖2 is chi-squared with p degrees of free-dom. We aim to extend this behavior to the case of a general vector ξ satis-fying the following exponential moment condition:

log IE exp(γ>ξ

)≤ ‖γ‖2/2, γ ∈ IRp, ‖γ‖ ≤ g. (9.1)

Here g is a positive constant which appears to be very important in ourresults. Namely, it determines the frontier between the Gaussian and non-Gaussian type deviation bounds. Our first result shows that under (9.1) the

Page 278: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

278 9 Deviation probability for quadratic forms

deviation bounds for the quadratic form ‖ξ‖2 are essentially the same as inthe Gaussian case, if the value g2 exceeds Cp for a fixed constant C . Furtherwe extend the result to the case of a more general form ‖Bξ‖2 . An importantadvantage of the presented approach which makes it different from all theprevious studies is that there is no any additional conditions on the structureor origin of the vector ξ . For instance, we do not assume that ξ is a sum ofindependent or weakly dependent random variables, or components of ξ areindependent. The results are exact stated in a non-asymptotic fashion, all theconstants are explicit and the leading terms are sharp.

As a motivating example, we consider a linear regression model Y =Ψ>θ∗+ε in which Y is a n -vector of observations, ε is the vector of errorswith zero mean, and Ψ is a p× n design matrix. The ordinary least squareestimator θ for the parameter vector θ∗ ∈ IRp reads as

θ =(ΨΨ>

)−1ΨY

and it can be viewed as the maximum likelihood estimator in a Gaussianlinear model with a diagonal covariance matrix, that is, Y ∼ N(Ψ>θ, σ2IIn) .Define the p× p matrix

D20

def= ΨΨ>,

Then

D0(θ − θ∗) = D−10 ζ

with ζdef= Ψε . The likelihood ratio test statistic for this problem is exactly

‖D−10 ζ‖2/2 . Similarly, the model selection procedure is based on comparing

such quadratic forms for different matrices D0 ; see e.g. Baraud (2010).Now we indicate how this situation can be reduced to a bound for a vector

ξ satisfying the condition (9.1). Suppose for simplicity that the entries εi ofthe error vector ε are independent and have exponential moments.

(e1) There exist some constants ν0 and g1 > 0 , and for every i a constant

si such that IE(εi/si

)2 ≤ 1 and

log IE exp(λεi/si

)≤ ν2

0λ2/2, |λ| ≤ g1. (9.2)

Here g1 is a fixed positive constant. One can show that if this conditionis fulfilled for some g1 > 0 and a constant ν0 ≥ 1 , then one can get a similarcondition with ν0 arbitrary close to one and g1 slightly decreased. A naturalcandidate for si is σi where σ2

i = IEε2i is the variance of εi . Under (9.2),

introduce a p× p matrix V0 defined by

V 20

def=∑

s2iΨiΨ>i ,

Page 279: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

9.2 Gaussian case 279

where Ψ1, . . . ,Ψn ∈ IRp are the columns of the matrix Ψ . Define also

ξ = V −10 Ψε,

N−1/2 def= max

isupγ∈IRp

si|Ψ>i γ|‖V0γ‖

.

Simple calculation shows that for ‖γ‖ ≤ g = g1N1/2

log IE exp(γ>ξ

)≤ ν2

0‖γ‖2/2, γ ∈ IRp, ‖γ‖ ≤ g.

We conclude that (9.1) is nearly fulfilled under (e1) and moreover, the valueg2 is proportional to the effective sample size N . The results below allow toget a nearly χ2 -behavior of the test statistic ‖ξ‖2 which is a finite sampleversion of the famous Wilks phenomenon; see e.g. Fan et al. (2001); Fan andHuang (2005), Boucheron and Massart (2011).

Section 9.2 reminds the classical results about deviation probability of aGaussian quadratic form. These results are presented only for comparison andto make the presentation selfcontained.

Section 9.3 studies the probability of the form IP(‖ξ‖ > y

)under the

condition

log IE exp(γ>ξ

)≤ ν2

0‖γ‖2/2, γ ∈ IRp, ‖γ‖ ≤ g.

The general case can be reduced to ν0 = 1 by rescaling ξ and g :

log IE exp(γ>ξ/ν0

)≤ ‖γ‖2/2, γ ∈ IRp, ‖γ‖ ≤ ν0g

that is, ν−10 ξ fulfills (9.1) with a slightly increased g .

The obtained result is extended to the case of a general quadratic formin Section 9.4. Some more extensions motivated by different statistical prob-lems are given in Section 9.6 and Section 9.7. They include the bound withsup-norm constraint and the bound under Bernstein conditions. Among thestatistical problems demanding such bounds is estimation of the regressionmodel with Poissonian or bounded random noise. More examples can be foundin Baraud (2010). All the proofs are collected in Section 9.8.

9.2 Gaussian case

Our benchmark will be a deviation bound for ‖ξ‖2 for a standard Gaussianvector ξ . The ultimate goal is to show that under (9.1) the norm of the vectorξ exhibits behavior expected for a Gaussian vector, at least in the region ofmoderate deviations. For the reason of comparison, we begin by stating theresult for a Gaussian vector ξ . We use the notation a ∨ b for the maximumof a and b , while a ∧ b = min{a, b} .

Page 280: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

280 9 Deviation probability for quadratic forms

Theorem 9.2.1. Let ξ be a standard normal vector in IRp . Then for anyu > 0 , it holds

IP(‖ξ‖2 > p+ u

)≤ exp

{−(p/2)φ(u/p)

]}with

φ(t)def= t− log(1 + t).

Let φ−1(·) stand for the inverse of φ(·) . For any x ,

IP(‖ξ‖2 > p+ p φ−1(2x/p)

)≤ exp(−x).

This particularly yields with κ = 6.6

IP(‖ξ‖2 > p+

√κxp ∨ (κx)

)≤ exp(−x).

This is a simple version of a well known result and we present it only forcomparison with the non-Gaussian case. The message of this result is that thesquared norm of the Gaussian vector ξ concentrates around the value p andits deviation over the level p+

√xp is exponentially small in x .

A similar bound can be obtained for a norm of the vector Bξ where Bis some given deterministic matrix. For notational simplicity we assume thatB is symmetric. Otherwise one should replace it with (B>B)1/2 .

Theorem 9.2.2. Let ξ be standard normal in IRp . Then for every x > 0and any symmetric matrix B , it holds with p = tr(B2) , v2 = 2 tr(B4) , anda∗ = ‖B2‖∞

IP(‖Bξ‖2 > p + (2vx1/2) ∨ (6a∗x)

)≤ exp(−x).

Below we establish similar bounds for a non-Gaussian vector ξ obeying(9.1).

9.3 A bound for the `2 -norm

This section presents a general exponential bound for the probability IP(‖ξ‖ >

y)

under (9.1). The main result tells us that if y is not too large, namely ify ≤ yc with y2

c � g2 , then the deviation probability is essentially the sameas in the Gaussian case.

To describe the value yc , introduce the following notation. Given g andp , define the values w0 = gp−1/2 and wc by the equation

wc(1 + wc)

(1 + w2c )

1/2= w0 = gp−1/2. (9.3)

Page 281: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

9.3 A bound for the `2 -norm 281

It is easy to see that w0/√

2 ≤ wc ≤ w0 . Further define

µcdef= w2

c/(1 + w2c )

ycdef=√

(1 + w2c )p,

xcdef= 0.5p

[w2c − log

(1 + w2

c

)]. (9.4)

Note that for g2 ≥ p , the quantities yc and xc can be evaluated as y2c ≥

w2cp ≥ g2/2 and xc & pw2

c/2 ≥ g2/4 .

Theorem 9.3.1. Let ξ ∈ IRp fulfill (9.1). Then it holds for each x ≤ xc

IP(‖ξ‖2 > p+

√κxp ∨ (κx), ‖ξ‖ ≤ yc

)≤ 2 exp(−x),

where κ = 6.6 . Moreover, for y ≥ yc , it holds with gc = g − √µcp =gwc/(1 + wc)

IP(‖ξ‖ > y

)≤ 8.4 exp

{−gcy/2− (p/2) log(1− gc/y)

}≤ 8.4 exp

{−xc − gc(y− yc)/2

}.

The statements of Theorem 9.4.1 can be simplified under the assumptiong2 ≥ p .

Corollary 9.3.1. Let ξ fulfill (9.1) and g2 ≥ p . Then it holds for x ≤ xc

IP(‖ξ‖2 ≥ z(x, p)

)≤ 2e−x + 8.4e−xc , (9.5)

z(x, p)def=

{p+√κxp, x ≤ p/κ,

p+ κx p/κ < x ≤ xc,(9.6)

with κ = 6.6 . For x > xc

IP(‖ξ‖2 ≥ zc(x, p)

)≤ 8.4e−x, zc(x, p)

def=∣∣yc + 2(x− xc)/gc

∣∣2.This result implicitly assumes that p ≤ κxc which is fulfilled if w2

0 =g2/p ≥ 1 :

κxc = 0.5κ[w2

0 − log(1 + w20)]p ≥ 3.3

[1− log(2)

]p > p.

For x ≤ xc , the function z(x, p) mimics the quantile behavior of the chi-squared distribution χ2

p with p degrees of freedom. Moreover, increase of thevalue g yields a growth of the sub-Gaussian zone. In particular, for g =∞ ,a general quadratic form ‖ξ‖2 has under (9.1) the same tail behavior as inthe Gaussian case.

Page 282: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

282 9 Deviation probability for quadratic forms

Finally, in the large deviation zone x > xc the deviation probability decays

as e−cx1/2

for some fixed c . However, if the constant g in the condition (9.1) issufficiently large relative to p , then xc is large as well and the large deviationzone x > xc can be ignored at a small price of 8.4e−xc and one can focus onthe deviation bound described by (9.5) and (9.6).

9.4 A bound for a quadratic form

Now we extend the result to more general bound for ‖Bξ‖2 = ξ>B2ξ with agiven matrix B and a vector ξ obeying the condition (9.1). Similarly to theGaussian case we assume that B is symmetric. Define important character-istics of B

p = tr(B2), v2 = 2 tr(B4), λBdef= ‖B2‖∞

def= λmax(B2).

For simplicity of formulation we suppose that λB = 1 , otherwise one has toreplace p and v2 with p/λB and v2/λB .

Let g be shown in (9.1). Define similarly to the `2 -case wc by the equation

wc(1 + wc)

(1 + w2c )

1/2= gp−1/2.

Define also µc = w2c/(1 + w2

c ) ∧ 2/3 . Note that w2c ≥ 2 implies µc = 2/3 .

Further define

y2c = (1 + w2

c )p, 2xc = µcy2c + log det{IIp − µcB2}. (9.7)

Similarly to the case with B = IIp , under the condition g2 ≥ p , one canbound y2

c ≥ g2/2 and xc & g2/4 .

Theorem 9.4.1. Let a random vector ξ in IRp fulfill (9.1). Then for eachx < xc

IP(‖Bξ‖2 > p + (2vx1/2) ∨ (6x), ‖Bξ‖ ≤ yc

)≤ 2 exp(−x).

Moreover, for y ≥ yc , with gc = g−√µcp = gwc/(1 + wc) , it holds

IP(‖Bξ‖ > y

)≤ 8.4 exp

(−xc − gc(y− yc)/2

).

Now we describe the value z(x, B) ensuring a small value for the large de-viation probability IP

(‖Bξ‖2 > z(x, B)

). For ease of formulation, we suppose

that g2 ≥ 2p yielding µ−1c ≤ 3/2 . The other case can be easily adjusted.

Page 283: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

9.5 Rescaling and regularity condition 283

Corollary 9.4.1. Let ξ fulfill (9.1) with g2 ≥ 2p . Then it holds for x ≤ xcwith xc from (9.7):

IP(‖Bξ‖2 ≥ z(x, B)

)≤ 2e−x + 8.4e−xc ,

z(x, B)def=

{p + 2vx1/2, x ≤ v/18,

p + 6x v/18 < x ≤ xc.(9.8)

For x > xc

IP(‖Bξ‖2 ≥ zc(x, B)

)≤ 8.4e−x, zc(x, B)

def=∣∣yc + 2(x− xc)/gc

∣∣2.9.5 Rescaling and regularity condition

The result of Theorem 9.4.1 can be extended to a more general situation whenthe condition (9.1) is fulfilled for a vector ζ rescaled by a matrix V0 . Moreprecisely, let the random p -vector ζ fulfills for some p × p matrix V0 thecondition

supγ∈IRp

log IE exp(λγ>ζ

‖V0γ‖

)≤ ν2

0λ2/2, |λ| ≤ g, (9.9)

with some constants g > 0 , ν0 ≥ 1 . Again, a simple change of variablesreduces the case of an arbitrary ν0 ≥ 1 to ν0 = 1 . Our aim is to boundthe squared norm ‖D−1

0 ζ‖2 of a vector D−10 ζ for another p × p positive

symmetric matrix D20 . Note that condition (9.9) implies (9.1) for the rescaled

vector ξ = V −10 ζ . This leads to bounding the quadratic form ‖D−1

0 V0ξ‖2 =‖Bξ‖2 with B2 = D−1

0 V 20 D−10 . It obviously holds

p = tr(B2) = tr(D−20 V 2

0 ).

Now we can apply the result of Corollary 9.4.1.

Corollary 9.5.1. Let ζ fulfill (9.9) with some V0 and g . Given D0 , defineB2 = D−1

0 V 20 D−10 , and let g2 ≥ 2p . Then it holds for x ≤ xc with xc from

(9.7):

IP(‖D−1

0 ζ‖2 ≥ z(x, B))≤ 2e−x + 8.4e−xc ,

with z(x, B) from (9.8). For x > xc

IP(‖D−1

0 ζ‖2 ≥ zc(x, B))≤ 8.4e−x, zc(x, B)

def=∣∣yc + 2(x− xc)/gc

∣∣2.In the regular case with D0 ≥ aV0 for some a > 0 , one obtains ‖B‖∞ ≤

a−1 and

v2 = 2 tr(B4) ≤ 2a−2p.

Page 284: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

284 9 Deviation probability for quadratic forms

9.6 A chi-squared bound with norm-constraints

This section extends the results to the case when the bound (9.1) requires someother conditions than the `2 -norm of the vector γ . Namely, we suppose that

log IE exp(γ>ξ

)≤ ‖γ‖2/2, γ ∈ IRp, ‖γ‖◦ ≤ g◦, (9.10)

where ‖ · ‖◦ is a norm which differs from the usual Euclidean norm. Ourdriving example is given by the sup-norm case with ‖γ‖◦ ≡ ‖γ‖∞ . We areinterested to check whether the previous results of Section 9.3 still apply. Theanswer depends on how massive the set A(r) = {γ : ‖γ‖◦ ≤ r} is in terms ofthe standard Gaussian measure on IRp . Recall that the quadratic norm ‖ε‖2of a standard Gaussian vector ε in IRp concentrates around p at least forp large. We need a similar concentration property for the norm ‖ · ‖◦ . Moreprecisely, we assume for a fixed r∗ that

IP(‖ε‖◦ ≤ r∗

)≥ 1/2, ε ∼ N(0, IIp). (9.11)

This implies for any value u◦ > 0 and all u ∈ IRp with ‖u‖◦ ≤ u◦ that

IP(‖ε− u‖◦ ≤ r∗ + u◦

)≥ 1/2, ε ∼ N(0, IIp).

For each z > p , consider

µ(z) = (z− p)/z.

Given u◦ , denote by z◦ = z◦(u◦) the root of the equation

g◦

µ(z◦)− r∗µ1/2(z◦)

= u◦. (9.12)

One can easily see that this value exists and unique if u◦ ≥ g◦ − r∗ and itcan be defined as the largest z for which g◦

µ(z) −r∗

µ1/2(z)≥ u◦ . Let µ◦ = µ(z◦)

be the corresponding µ -value. Define also x◦ by

2x◦ = µ◦z◦ + p log(1− µ◦).

If u◦ < g◦ − r∗ , then set z◦ =∞ , x◦ =∞ .

Theorem 9.6.1. Let a random vector ξ in IRp fulfill (9.10). Suppose (9.11)and let, given u◦ , the value z◦ be defined by (9.12). Then it holds for anyu > 0

IP(‖ξ‖2 > p+ u, ‖ξ‖◦ ≤ u◦

)≤ 2 exp

{−(p/2)φ(u)

]}. (9.13)

yielding for x ≤ x◦

Page 285: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

9.7 A bound for the `2 -norm under Bernstein conditions 285

IP(‖ξ‖2 > p+

√κxp ∨ (κx), ‖ξ‖◦ ≤ u◦

)≤ 2 exp(−x), (9.14)

where κ = 6.6 . Moreover, for z ≥ z◦ , it holds

IP(‖ξ‖2 > z, ‖ξ‖◦ ≤ u◦

)≤ 2 exp

{−µ◦z/2− (p/2) log(1− µ◦)

}= 2 exp

{−x◦ − g◦(z− z◦)/2

}.

It is easy to check that the result continues to hold for the norm of Πξfor a given sub-projector Π in IRp satisfying Π = Π> , Π2 ≤ Π . As above,

denote pdef= tr(Π2) , v2 def

= 2 tr(Π4) . Let r∗ be fixed to ensure

IP(‖Πε‖◦ ≤ r∗

)≥ 1/2, ε ∼ N(0, IIp).

The next result is stated for g◦ ≥ r∗ + u◦ , which simplifies the formulation.

Theorem 9.6.2. Let a random vector ξ in IRp fulfill (9.10) and Π followsΠ = Π> , Π2 ≤ Π . Let some u◦ be fixed. Then for any µ◦ ≤ 2/3 with

g◦µ−1◦ − r∗µ

−1/2◦ ≥ u◦ ,

IE exp{µ◦

2(‖Πξ‖2 − p)

}1I(‖Π2ξ‖◦ ≤ u◦

)≤ 2 exp(µ2

◦v2/4), (9.15)

where v2 = 2 tr(Π4) . Moreover, if g◦ ≥ r∗ + u◦ , then for any z ≥ 0

IP(‖Πξ‖2 > z, ‖Π2ξ‖◦ ≤ u◦

)≤ IP

(‖Πξ‖2 > p + (2vx1/2) ∨ (6x), ‖Π2ξ‖◦ ≤ u◦

)≤ 2 exp(−x).

9.7 A bound for the `2 -norm under Bernstein conditions

For comparison, we specify the results to the case considered recently in Ba-raud (2010). Let ζ be a random vector in IRn whose components ζi areindependent and satisfy the Bernstein type conditions: for all |λ| < c−1

log IEeλζi ≤ λ2σ2

1− c|λ|. (9.16)

Denote ξ = ζ/(2σ) and consider ‖γ‖◦ = ‖γ‖∞ . Fix g◦ = σ/c . If ‖γ‖◦ ≤ g◦ ,then 1− cγi/(2σ) ≥ 1/2 and

log IE exp(γ>ξ

)≤∑i

log IE exp(γiζi

)≤∑i

|γi/(2σ)|2σ2

1− cγi/(2σ)≤ ‖γ‖2/2.

Let also S be some linear subspace of IRn with dimension p and ΠS denotethe projector on S . For applying the result of Theorem 9.6.1, the value r∗

Page 286: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

286 9 Deviation probability for quadratic forms

has to be fixed. We use that the infinity norm ‖ε‖∞ concentrates around√2 log p .

Lemma 9.7.1. It holds for a standard normal vector ε ∈ IRp with r∗ =√2 log p

IP(‖ε‖◦ ≤ r∗

)≥ 1/2.

Indeed

IP(‖ε‖◦ > r∗

)≤ IP

(‖ε‖∞ >

√2 log p

)≤ pIP

(|ε1| >

√2 log p

)≤ 1/2.

Now the general bound of Theorem 9.6.1 is applied to bounding the normof ‖ΠSξ‖ . For simplicity of formulation we assume that g◦ ≥ u◦ + r∗ .

Theorem 9.7.1. Let S be some linear subspace of IRn with dimension p .Let g◦ ≥ u◦ + r∗ . If the coordinates ζi of ζ are independent and satisfy(9.16), then for all x ,

IP((4σ2)−1‖ΠSζ‖2 > p +

√κxp ∨ (κx), ‖ΠSζ‖∞ ≤ 2σu◦)≤ 2 exp(−x),

The bound of Baraud (2010) reads

IP

(‖ΠSζ‖2 >

(3σ ∨

√6cu)√

x + 3p, ‖ΠSζ‖∞ ≤ 2σu◦

)≤ e−x.

As expected, in the region x ≤ xc of Gaussian approximation, the bound ofBaraud is not sharp and actually quite rough.

9.8 Proofs

Proof of Theorem 9.2.1

The proof utilizes the following well known fact, which can be obtained bystraightforward calculus : for µ < 1

log IE exp(µ‖ξ‖2/2

)= −0.5p log(1− µ).

Now consider any u > 0 . By the exponential Chebyshev inequality

IP(‖ξ‖2 > p+ u

)≤ exp

{−µ(p+ u)/2

}IE exp

(µ‖ξ‖2/2

)(9.17)

= exp{−µ(p+ u)/2− (p/2) log(1− µ)

}.

It is easy to see that the value µ = u/(u+p) maximizes µ(p+u)+p log(1−µ)w.r.t. µ yielding

Page 287: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

9.8 Proofs 287

µ(p+ u) + p log(1− µ) = u− p log(1 + u/p).

Further we use that x− log(1+x) ≥ a0x2 for x ≤ 1 and x− log(1+x) ≥ a0x

for x > 1 with a0 = 1 − log(2) ≥ 0.3 . This implies with x = u/p foru =√κxp or u = κx and κ = 2/a0 < 6.6 that

IP(‖ξ‖2 ≥ p+

√κxp ∨ (κx)

)≤ exp(−x)

as required.

Proof of Theorem 9.2.2

The matrix B2 can be represented as U> diag(a1, . . . , ap)U for an orthog-

onal matrix U . The vector ξ = Uξ is also standard normal and ‖Bξ‖2 =

ξ>UB2U>ξ . This means that one can reduce the situation to the case of a

diagonal matrix B2 = diag(a1, . . . , ap) . We can also assume without loss ofgenerality that a1 ≥ a2 ≥ . . . ≥ ap . The expressions for the quantities p andv2 simplifies to

p = tr(B2) = a1 + . . .+ ap,

v2 = 2 tr(B4) = 2(a21 + . . .+ a2

p).

Moreover, rescaling the matrix B2 by a1 reduces the situation to the casewith a1 = 1 .

Lemma 9.8.1. It holds

IE‖Bξ‖2 = tr(B2), Var(‖Bξ‖2

)= 2 tr(B4).

Moreover, for µ < 1

IE exp{µ‖Bξ‖2/2

}= det(1− µB2)−1/2 =

p∏i=1

(1− µai)−1/2. (9.18)

Proof. If B2 is diagonal, then ‖Bξ‖2 =∑i aiξ

2i and the summands aiξ

2i are

independent. It remains to note that IE(aiξ2i ) = ai , Var(aiξ

2i ) = 2a2

i , andfor µai < 1 ,

IE exp{µaiξ

2i /2}

= (1− µai)−1/2

yielding (9.18).

Page 288: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

288 9 Deviation probability for quadratic forms

Given u , fix µ < 1 . The exponential Markov inequality yields

IP(‖Bξ‖2 > p + u

)≤ exp

{−µ(p + u)

2

}IE exp

(µ‖Bξ‖22

)≤ exp

{−µu

2− 1

2

p∑i=1

[µai + log

(1− µai

)]}.

We start with the case when x1/2 ≤ v/3 . Then u = 2x1/2v fulfills u ≤ 2v2/3 .Define µ = u/v2 ≤ 2/3 and use that t+ log(1− t) ≥ −t2 for t ≤ 2/3 . Thisimplies

IP(‖Bξ‖2 > p + u

)≤ exp

{−µu

2+

1

2

p∑i=1

µ2a2i

}= exp

(−u2/(4v2)

)= e−x. (9.19)

Next, let x1/2 > v/3 . Set µ = 2/3 . It holds similarly to the above

p∑i=1

[µai + log

(1− µai

)]≥ −

p∑i=1

µ2a2i ≥ −2v2/9 ≥ −2x.

Now, for u = 6x and µu/2 = 2x , (9.19) implies

IP(‖Bξ‖2 > p + u

)≤ exp

{−(2x− x

)}= exp(−x)

as required.

Proof of Theorem 9.3.1

The main step of the proof is the following exponential bound.

Lemma 9.8.2. Suppose (9.1). For any µ < 1 with g2 > pµ , it holds

IE exp(µ‖ξ‖2

2

)1I(‖ξ‖ ≤ g/µ−

√p/µ

)≤ 2(1− µ)−p/2. (9.20)

Proof. Let ε be a standard normal vector in IRp and u ∈ IRp . The boundIP(‖ε‖2 > p

)≤ 1/2 and the triangle inequality imply for any vector u and

any r with r ≥ ‖u‖+ p1/2 that IP(‖u+ ε‖ ≤ r

)≥ 1/2 . Let us fix some ξ

with ‖ξ‖ ≤ g/µ−√p/µ and denote by IPξ the conditional probability given

ξ . The previous arguments yield:

IPξ(‖ε+ µ1/2ξ‖ ≤ µ−1/2g

)≥ 0.5.

Page 289: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

9.8 Proofs 289

It holds with cp = (2π)−p/2

cp

∫exp(γ>ξ − ‖γ‖

2

)1I(‖γ‖ ≤ g)dγ

= cp exp(µ‖ξ‖2/2

) ∫exp(−1

2

∥∥µ−1/2γ − µ1/2ξ∥∥2)

1I(µ−1/2‖γ‖ ≤ µ−1/2g)dγ

= µp/2 exp(µ‖ξ‖2/2

)IPξ(‖ε+ µ1/2ξ‖ ≤ µ−1/2g

)≥ 0.5µp/2 exp

(µ‖ξ‖2/2

),

because ‖µ1/2ξ‖+ p1/2 ≤ µ−1/2g . This implies in view of p < g2/µ that

exp(µ‖ξ‖2/2

)1I(‖ξ‖2 ≤ g/µ−

√p/µ

)≤ 2µ−p/2cp

∫exp(γ>ξ − ‖γ‖

2

)1I(‖γ‖ ≤ g)dγ.

Further, by (9.1)

cpIE

∫exp(γ>ξ − 1

2µ‖γ‖2

)1I(‖γ‖ ≤ g)dγ

≤ cp

∫exp(−µ−1 − 1

2‖γ‖2

)1I(‖γ‖ ≤ g)dγ

≤ cp

∫exp(−µ−1 − 1

2‖γ‖2

)dγ

≤ (µ−1 − 1)−p/2

and (9.20) follows.

Due to this result, the scaled squared norm µ‖ξ‖2/2 after a proper trun-cation possesses the same exponential moments as in the Gaussian case. Astraightforward implication is the probability bound IP

(‖ξ‖2 > p + u

)for

moderate values u . Namely, given u > 0 , define µ = u/(u + p) . This valueoptimizes the inequality (9.17) in the Gaussian case. Now we can apply a sim-ilar bound under the constraints ‖ξ‖ ≤ g/µ −

√p/µ . Therefore, the bound

is only meaningful if√u+ p ≤ g/µ −

√p/µ with µ = u/(u + p) , or, with

w =√u/p ≤ wc ; see (9.3).

The largest value u for which this constraint is still valid, is given byp+ u = y2

c . Hence, (9.20) yields for p+ u ≤ y2c

Page 290: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

290 9 Deviation probability for quadratic forms

IP(‖ξ‖2 > p+ u, ‖ξ‖ ≤ yc

)≤ exp

{−µ(p+ u)

2

}IE exp

(µ‖ξ‖22

)1I(‖ξ‖ ≤ g/µ−

√p/µ

)≤ 2 exp

{−0.5

[µ(p+ u) + p log(1− µ)

]}= 2 exp

{−0.5

[u− p log(1 + u/p)

]}.

Similarly to the Gaussian case, this implies with κ = 6.6 that

IP(‖ξ‖ ≥ p+

√κxp ∨ (κx), ‖ξ‖ ≤ yc

)≤ 2 exp(−x).

The Gaussian case means that (9.1) holds with g =∞ yielding yc =∞ . Inthe non-Gaussian case with a finite g , we have to accompany the moderatedeviation bound with a large deviation bound IP

(‖ξ‖ > y

)for y ≥ yc . This

is done by combining the bound (9.20) with the standard slicing arguments.

Lemma 9.8.3. Let µ0 ≤ g2/p . Define y0 = g/µ0−√p/µ0 and g0 = µ0y0 =

g−√µ0p . It holds for y ≥ y0

IP(‖ξ‖ > y

)≤ 8.4(1− g0/y)−p/2 exp

(−g0y/2

)(9.21)

≤ 8.4 exp{−x0 − g0(y− y0)/2

}. (9.22)

with x0 defined by

2x0 = µ0y20 + p log(1− µ0) = g2/µ0 − p+ p log(1− µ0).

Proof. Consider the growing sequence yk with y1 = y and g0yk+1 = g0y+k .Define also µk = g0/yk . In particular, µk ≤ µ1 = g0/y . Obviously

IP(‖ξ‖ > y

)=

∞∑k=1

IP(‖ξ‖ > yk, ‖ξ‖ ≤ yk+1

).

Now we try to evaluate every slicing probability in this expression. We usethat

µk+1y2k =

(g0y + k − 1)2

g0y + k≥ g0y + k − 2,

and also g/µk −√p/µk ≥ yk because g− g0 =

õ0p >

õkp and

g/µk −√p/µk − yk = µ−1

k (g−√µkp− g0) ≥ 0.

Hence by (9.20)

Page 291: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

9.8 Proofs 291

IP(‖ξ‖ > y

)=

∞∑k=1

IP(‖ξ‖ > yk, ‖ξ‖ ≤ yk+1

)

≤∞∑k=1

exp(−µk+1y

2k

2

)IE exp

(µk+1‖ξ‖2

2

)1I(‖ξ‖ ≤ yk+1

)≤∞∑k=1

2(1− µk+1

)−p/2exp(−µk+1y

2k

2

)

≤ 2(1− µ1

)−p/2 ∞∑k=1

exp(−g0y + k − 2

2

)= 2e1/2(1− e−1/2)−1(1− µ1)−p/2 exp

(−g0y/2

)≤ 8.4(1− µ1)−p/2 exp

(−g0y/2

)and the first assertion follows. For y = y0 , it holds

g0y0 + p log(1− µ0) = µ0y20 + p log(1− µ0) = 2x0

and (9.21) implies IP(‖ξ‖ > y0

)≤ 8.4 exp(−x0) . Now observe that the func-

tion f(y) = g0y/2 + (p/2) log(1−g0/y

)fulfills f(y0) = x0 and f ′(y) ≥ g0/2

yielding f(y) ≥ x0 + g0(y− y0)/2 . This implies (9.22).

The statements of the theorem are obtained by applying the lemmas withµ0 = µc = w2

c/(1 + w2c ) . This also implies y0 = yc , x0 = xc , and g0 = gc =

g−√µcp ; cf. (9.4).

Proof of Theorem 9.4.1

The main steps of the proof are similar to the proof of Theorem 9.3.1.

Lemma 9.8.4. Suppose (9.1). For any µ < 1 with g2/µ ≥ p , it holds

IE exp(µ‖Bξ‖2/2

)1I(‖B2ξ‖ ≤ g/µ−

√p/µ

)≤ 2det(IIp − µB2)−1/2.(9.23)

Proof. With cp(B) =(2π)−p/2

det(B−1)

cp(B)

∫exp(γ>ξ − 1

2µ‖B−1γ‖2

)1I(‖γ‖ ≤ g)dγ

= cp(B) exp(µ‖Bξ‖2

2

)∫exp(−1

2

∥∥µ1/2Bξ − µ−1/2B−1γ∥∥2)

1I(‖γ‖ ≤ g)dγ

= µp/2 exp(µ‖Bξ‖2

2

)IPξ(‖µ−1/2Bε+B2ξ‖ ≤ g/µ

),

Page 292: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

292 9 Deviation probability for quadratic forms

where ε denotes a standard normal vector in IRp and IPξ means the condi-tional probability given ξ . Moreover, for any u ∈ IRp and r ≥ p1/2 + ‖u‖ ,it holds in view of IP

(‖Bε‖2 > p

)≤ 1/2

IP(‖Bε− u‖ ≤ r

)≥ IP

(‖Bε‖ ≤ √p

)≥ 1/2.

This implies

exp(µ‖Bξ‖2/2

)1I(‖B2ξ‖ ≤ g/µ−

√p/µ

)≤ 2µ−p/2cp(B)

∫exp(γ>ξ − 1

2µ‖B−1γ‖2

)1I(‖γ‖ ≤ g)dγ.

Further, by (9.1)

cp(B)IE

∫exp(γ>ξ − 1

2µ‖B−1γ‖2

)1I(‖γ‖ ≤ g)dγ

≤ cp(B)

∫exp(‖γ‖2

2− 1

2µ‖B−1γ‖2

)dγ

≤ det(B−1) det(µ−1B−2 − IIp)−1/2 = µp/2 det(IIp − µB2)−1/2

and (9.23) follows.

Now we evaluate the probability IP(‖Bξ‖ > y

)for moderate values of y .

Lemma 9.8.5. Let µ0 < 1 ∧ (g2/p) . With y0 = g/µ0 −√p/µ0 , it holds for

any u > 0

IP(‖Bξ‖2 > p + u, ‖B2ξ‖ ≤ y0

)≤ 2 exp

{−0.5µ0(p + u)− 0.5 log det(IIp − µ0B

2)}. (9.24)

In particular, if B2 is diagonal, that is, B2 = diag(a1, . . . , ap

), then

IP(‖Bξ‖2 > p + u, ‖B2ξ‖ ≤ y0

)≤ 2 exp

{−µ0u

2− 1

2

p∑i=1

[µ0ai + log

(1− µ0ai

)]}. (9.25)

Proof. The exponential Chebyshev inequality and (9.23) imply

IP(‖Bξ‖2 > p + u, ‖B2ξ‖ ≤ y0

)≤ exp

{−µ0(p + u)

2

}IE exp

(µ0‖Bξ‖2

2

)1I(‖B2ξ‖ ≤ g/µ0 −

√p/µ0

)≤ 2 exp

{−0.5µ0(p + u)− 0.5 log det(IIp − µ0B

2)}.

Page 293: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

9.8 Proofs 293

Moreover, the standard change-of-basis arguments allow us to reduce theproblem to the case of a diagonal matrix B2 = diag

(a1, . . . , ap

)where

1 = a1 ≥ a2 ≥ . . . ≥ ap > 0 . Note that p = a1 + . . . + ap . Then theclaim (9.24) can be written in the form (9.25).

Now we evaluate a large deviation probability that ‖Bξ‖ > y for a largey . Note that the condition ‖B2‖∞ ≤ 1 implies ‖B2ξ‖ ≤ ‖Bξ‖ . So, thebound (9.24) continues to hold when ‖B2ξ‖ ≤ y0 is replaced by ‖Bξ‖ ≤ y0 .

Lemma 9.8.6. Let µ0 < 1 and µ0p < g2 . Define g0 by g0 = g − √µ0p .

For any y ≥ y0def= g0/µ0 , it holds

IP(‖Bξ‖ > y

)≤ 8.4 det{IIp − (g0/y)B2}−1/2 exp

(−g0y/2

).

≤ 8.4 exp(−x0 − g0(y− y0)/2

), (9.26)

where x0 is defined by

2x0 = g0y0 + log det{IIp − (g0/y0)B2}.

Proof. The slicing arguments of Lemma 9.8.3 apply here in the same man-ner. One has to replace ‖ξ‖ by ‖Bξ‖ and (1 − µ1)−p/2 by det{IIp −(g0/y)B2}−1/2 . We omit the details. In particular, with y = y0 = g0/µ ,this yields

IP(‖Bξ‖ > y0

)≤ 8.4 exp(−x0).

Moreover, for the function f(y) = g0y + log det{IIp − (g0/y)B2} , it holdsf ′(y) ≥ g0 and hence, f(y) ≥ f(y0) + g0(y − y0) for y > y0 . This implies(9.26).

One important feature of the results of Lemma 9.8.5 and Lemma 9.8.6is that the value µ0 < 1 ∧ (g2/p) can be selected arbitrarily. In particular,for y ≥ yc , Lemma 9.8.6 with µ0 = µc yields the large deviation probabilityIP(‖Bξ‖ > y

). For bounding the probability IP

(‖Bξ‖2 > p+u, ‖Bξ‖ ≤ yc

),

we use the inequality log(1− t) ≥ −t− t2 for t ≤ 2/3 . It implies for µ ≤ 2/3that

− log IP(‖Bξ‖2 > p + u, ‖Bξ‖ ≤ yc

)≥ µ(p + u) +

p∑i=1

log(1− µai

)≥ µ(p + u)−

p∑i=1

(µai + µ2a2i ) ≥ µu− µ2v2/2. (9.27)

Page 294: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

294 9 Deviation probability for quadratic forms

Now we distinguish between µc = 2/3 and µc < 2/3 starting with µc = 2/3 .The bound (9.27) with µ = 2/3 and with u = (2vx1/2) ∨ (6x) yields

IP(‖Bξ‖2 > p + u, ‖Bξ‖ ≤ yc

)≤ 2 exp(−x);

see the proof of Theorem 9.2.2 for the Gaussian case.Now consider µc < 2/3 . For x1/2 ≤ µcv/2 , use u = 2vx1/2 and µ0 =

u/v2 . It holds µ0 = u/v2 ≤ µc and u2/(4v2) = x yielding the desiredbound by (9.27). For x1/2 > µcv/2 , we select again µ0 = µc . It holds withu = 4µ−1

c x that µcu/2− µ2cv

2/4 ≥ 2x− x = x . This completes the proof.

Proof of Theorem 9.6.1

The arguments behind the result are the same as in the one-norm case ofTheorem 9.3.1. We only outline the main steps.

Lemma 9.8.7. Suppose (9.10) and (9.11). For any µ < 1 with g◦ > µ1/2r∗ ,it holds

IE exp(µ‖ξ‖2/2

)1I(‖ξ‖◦ ≤ g◦/µ− r∗/µ1/2

)≤ 2(1− µ)−p/2. (9.28)

Proof. Let ε be a standard normal vector in IRp and u ∈ IRp . Let us fixsome ξ with µ1/2‖ξ‖◦ ≤ µ−1/2g◦ − r∗ and denote by IPξ the conditionalprobability given ξ . It holds by (9.11) with cp = (2π)−p/2

cp

∫exp(γ>ξ − 1

2µ‖γ‖2

)1I(‖γ‖◦ ≤ g◦)dγ

= cp exp(µ‖ξ‖2/2

) ∫exp(−1

2

∥∥µ1/2ξ − µ−1/2γ∥∥2)

1I(‖µ−1/2γ‖◦ ≤ µ−1/2g◦)dγ

= µp/2 exp(µ‖ξ‖2/2

)IPξ(‖ε− µ1/2ξ‖◦ ≤ µ−1/2g◦

)≥ 0.5µp/2 exp

(µ‖ξ‖2/2

).

This implies

exp(µ‖ξ‖2

2

)1I(‖ξ‖◦ ≤ g◦/µ− r∗/µ1/2

)≤ 2µ−p/2cp

∫exp(γ>ξ − 1

2µ‖γ‖2

)1I(‖γ‖◦ ≤ g◦)dγ.

Further, by (9.10)

cpIE

∫exp(γ>ξ − 1

2µ‖γ‖2

)1I(‖γ‖◦ ≤ g◦)dγ

≤ cp

∫exp(−µ−1 − 1

2‖γ‖2

)dγ ≤ (µ−1 − 1)−p/2

Page 295: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

9.8 Proofs 295

and (9.28) follows.

As in the Gaussian case, (9.28) implies for z > p with µ = µ(z) = (z−p)/zthe bounds (9.13) and (9.14). Note that the value µ(z) clearly grows with zfrom zero to one, while g◦/µ(z)− r∗/µ1/2(z) is strictly decreasing. The valuez◦ is defined exactly as the point where g◦/µ(z)− r∗/µ1/2(z) crosses u◦ , sothat g◦/µ(z)− r∗/µ1/2(z) ≥ u◦ for z ≤ z◦ .

For z > z◦ , the choice µ = µ(y) conflicts with g◦/µ(z)− r∗/µ1/2(z) ≥ u◦ .So, we apply µ = µ◦ yielding by the Markov inequality

IP(‖ξ‖2 > z, ‖ξ‖◦ ≤ u◦

)≤ 2 exp

{−µ◦z/2− (p/2) log(1− µ◦)

},

and the assertion follows.

Proof of Theorem 9.6.2

Arguments from the proof of Lemmas 9.8.4 and 9.8.7 yield in view of g◦µ−1◦ −

r∗µ−1/2◦ ≥ u◦

IE exp{µ◦‖Πξ‖2/2

}1I(‖Π2ξ‖◦ ≤ u◦

)≤ IE exp

(µ◦‖Πξ‖2/2

)1I(‖Π2ξ‖◦ ≤ g◦/µ◦ − p/µ

1/2◦)

≤ 2det(IIp − µ◦Π2)−1/2.

The inequality log(1 − t) ≥ −t − t2 for t ≤ 2/3 and symmetricity of thematrix Π imply

− log det(IIp − µ◦Π2) ≤ µ◦p + µ2◦v

2/2

cf. (9.27); the assertion (9.15) follows.

Page 296: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April
Page 297: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

References

Baraud, Y. (2010). A Bernstein-type inequality for suprema of random pro-cesses with applications to model selection in non-Gaussian regression.Bernoulli, 16(4):1064–1085.

Bayes, T. (1763). An essay towards solving a problem in the doctrineof chances. Philosophical Transactions of the Royal Society of London.,53:370–418.

Berger, J. O. (1985). Statistical decision theory and Bayesian analysis. 2nded. Springer Series in Statistics. New York etc.: Springer-Verlag.

Bernardo, J. M. and Smith, A. F. (1994). Bayesian theory. Chichester: JohnWiley & Sons.

Borokov, A. (1999). Mathematical Statistics. Taylor & Francis.Boucheron, S. and Massart, P. (2011). A high-dimensional Wilks phenomenon.

Probability Theory and Related Fields, 150:405–433. 10.1007/s00440-010-0278-7.

Bretagnolle, J. (1999). A new large deviation inequality for U -statistics oforder 2. ESAIM, Probab. Stat., 3:151–162.

Chihara, T. (2011). An Introduction to Orthogonal Polynomials. Dover Bookson Mathematics. Dover Publications, Incorporated.

Cramer, H. (1928). On the composition of elementary errors. I. Mathematicaldeductions. II. Statistical applications. Skand. Aktuarietidskr., 11:13–74,141–180.

D’Agostino, R. B. and Stephens, M. A. (1986). Goodness-of-fit techniques.Statistics: Textbooks and Monographs, Vol. 68. New York - Basel: MarcelDekker, Inc.

de Boor, C. (2001). A Practical Guide to Splines. Number Bd. 27 in AppliedMathematical Sciences. Springer.

de Finetti, B. (1937). Foresight: Its logical laws, its subjective sources (inFrench). Annales de l’Institut Henri Poincare, 7:1–68.

Dempster, A., Laird, N., and Rubin, D. (1977). Maximum likelihood fromincomplete data via the EM algorithm. Discussion. J. R. Stat. Soc., Ser.B, 39:1–38.

Page 298: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

298 References

Diaconis, P. and Ylvisaker, D. (1979). Conjugate priors for exponential fam-ilies. Ann. Stat., 7:269–281.

Donoho, D. L. (1995). Nonlinear solution of linear inverse problems bywavelet-vaguelette decomposition. Appl. Comput. Harmon. Anal., 2(2):101–126.

Edgington, E. S. and Onghena, P. (2007). Randomization tests. With CD-ROM. 4th ed. Boca Raton, FL: Chapman & Hall/CRC, 4th ed. edition.

Fan, J. and Gijbels, I. (1996). Local Polynomial Modelling and Its Appli-cations: Monographs on Statistics and Applied Probability 66. Chapman& Hall/CRC Monographs on Statistics & Applied Probability. Taylor &Francis.

Fan, J. and Huang, T. (2005). Profile likelihood inferences on semiparametricvarying-coefficient partially linear models. Bernoulli, 11(6):1031–1057.

Fan, J., Zhang, C., and Zhang, J. (2001). Generalized likelihood ratio statisticsand Wilks phenomenon. Ann. Stat., 29(1):153–193.

Fisher, R. A. (1934). Two new properties of mathematical likelihood. Proc.R. Soc. Lond., Ser. A, 144:285–307.

Fisher, R. A. (1956). Statistical methods and scientific inference. Edinburgh-London: Oliver & Boyd, Ltd.

Galerkin, B. G. (1915). On electrical circuits for the approximate solution ofthe Laplace equation (In Russian). Vestnik Inzh., 19:897–908.

Gauß, C. F. (1995). Theory of the combination of observations least subjectto errors. Part one, part two, supplement. Translated by G. W. Stewart.Philadelphia, PA: SIAM.

Gill, R. D. and Levit, B. Y. (1995). Applications of the van Trees inequality:A Bayesian Cramer-Rao bound. Bernoulli, 1(1-2):59–79.

Gine, E., Lata la, R., and Zinn, J. (2000). Exponential and moment inequalitiesfor U -statistics. Gine, Evarist (ed.) et al., High dimensional probability II.2nd international conference, Univ. of Washington, DC, USA, August 1-6,1999. Boston, MA: Birkhauser. Prog. Probab. 47, 13-38 (2000).

Good, I. and Gaskins, R. (1971). Nonparametric roughness penalties for prob-ability densities. Biometrika, 58:255–277.

Good, P. (2005). Permutation, parametric and bootstrap tests of hypotheses.3rd ed. New York, NY: Springer, 3rd ed. edition.

Gotze, F. and Tikhomirov, A. (1999). Asymptotic distribution of quadraticforms. Ann. Probab., 27(2):1072–1098.

Green, P. J. and Silverman, B. (1994). Nonparametric regression and gener-alized linear models: a roughness penalty approach. London: Chapman &Hall. .

Hajek, J. and Sidak, Z. (1967). Theory of rank tests. New York-London:Academic Press; Prague: Academia, Publishing House of the CzechoslovakAcademy of Sciences.

Hewitt, E. and Stromberg, K. (1975). Real and abstract analysis. A moderntreatment of the theory of functions of a real variable. 3rd printing. Grad-

Page 299: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

References 299

uate Texts in Mathematics. 25. New York - Heidelberg - Berlin: Springer-Verlag.

Hoerl, A. E. (1962). Application of Ridge Analysis to Regression Problems.Chemical Engineering Progress, 58:54–59.

Horvath, L. and Shao, Q.-M. (1999). Limit theorems for quadratic forms withapplications to Whittle’s estimate. Ann. Appl. Probab., 9(1):146–187.

Hotelling, H. (1931). The generalization of student’s ratio. Ann. Math. Stat.,2:360–378.

Houdre, C. and Reynaud-Bouret, P. (2003). Exponential inequalities, withconstants, for U-statistics of order two. Gine, Evariste (ed.) et al., Stochas-tic inequalities and applications. Selected papers presented at the Eurocon-ference on “Stochastic inequalities and their applications”, Barcelona, June18–22, 2002. Basel: Birkhauser. Prog. Probab. 56, 55-69 (2003).

Hsu, D., Kakade, S. M., and Zhang, T. (2012). A tail inequality for quadraticforms of subgaussian random vectors. Electron. Commun. Probab., 17:no.52, 6.

Huber, P. (1967). The behavior of maximum likelihood estimates under non-standard conditions. Proc. 5th Berkeley Symp. Math. Stat. Probab., Univ.Calif. 1965/66, 1, 221-233 (1967).

Ibragimov, I. and Khas’minskij, R. (1981). Statistical estimation. Asymptotictheory. Transl. from the Russian by Samuel Kotz. New York - Heidelberg-Berlin: Springer-Verlag .

Janssen, A. (1998). Zur Asymptotik nichtparametrischer Tests, Lecture Notes.Skripten zur Stochastik Nr. 29. Gesellschaft zur Forderung der Mathema-tischen Statistik, Munster.

Jeffreys, H. (1935). Some tests of significance, treated by the theory of prob-ability. Proc. Camb. Philos. Soc., 31:203–222.

Jeffreys, H. (1957). Scientific inference. 2. ed. Cambridge University Press.Jeffreys, H. (1961). Theory of probability. 3rd ed. (The International Series

of Monographs on Physics.) Oxford: Clarendon Press.Kass, R. E. and Raftery, A. E. (1995). Bayes factors. J. Am. Stat. Assoc.,

90(430):773–795.Kendall, M. (1971). Studies in the history of probability and statistics. XXVI:

The work of Ernst Abbe. Biometrika, 58:369–373.Kolmogorov, A. (1933). Sulla determinazione empirica di una legge di dis-

tribuzione. G. Ist. Ital. Attuari, 4:83–91.Kullback, S. (1997). Information Theory and Statistics. Dover Books on

Mathematics. Dover Publications.Lehmann, E. and Casella, G. (1998). Theory of point estimation. 2nd ed. New

York, NY: Springer, 2nd ed. edition.Lehmann, E. L. (1999). Elements of large-sample theory. New York, NY:

Springer.Lehmann, E. L. (2011). Fisher, Neyman, and the creation of classical statis-

tics. New York, NY: Springer.

Page 300: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

300 References

Lehmann, E. L. and Romano, J. P. (2005). Testing statistical hypotheses. 3rded. New York, NY: Springer, 3rd ed. edition.

Markoff, A. (1912). Wahrscheinlichkeitsrechnung. Nach der 2. Auflage desrussischen Werkes ubersetzt von H. Liebmann. Leipzig u. Berlin: B. G.Teubner.

Massart, P. (2007). Concentration inequalities and model selection. Ecoled’Ete de Probabilites de Saint-Flour XXXIII-2003. Lecture Notes in Math-ematics. Springer.

Murphy, S. and van der Vaart, A. (2000). On profile likelihood. J. Am. Stat.Assoc., 95(450):449–485.

Neyman, J. and Pearson, E. S. (1928). On the use and interpretation ofcertain test criteria for purposes of statistical inference. I, II. Biometrika20, A; 175-240, 263-294 (1928).

Neyman, J. and Pearson, E. S. (1933). On the problem of the most effi-cient tests of statistical hypotheses. Philos. Trans. R. Soc. Lond., Ser. A,Contain. Pap. Math. Phys. Character, 231:289–337.

Pearson, K. (1900). On the criterion, that a given system of deviations fromthe probable in the case of a correlated system of variables is such that itcan be reasonably supposed to have arisen from random sampling. Phil.Mag. (5), 50:157–175.

Pesarin, F. and Salmaso, L. (2010). Permutation Tests for Complex Data:Theory, Applications and Software. Wiley Series in Probability and Statis-tics.

Pitman, E. (1937). Significance Tests Which May be Applied to Samples Fromany Populations. Journal of the Royal Statistical Society, 4(1):119–130.

Polzehl, J. and Spokoiny, V. (2006). Propagation-separation approach forlocal likelihood estimation. Probab. Theory Relat. Fields, 135(3):335–362.

Raiffa, H. and Schlaifer, R. (1961). Applied statistical decision theory. Cam-bridge, MA: The M.I.T. Press. Republished in Wiley Classic Library Series(2000).

Robert, C. P. (2001). The Bayesian choice. From decision-theoretic founda-tions to computational implementation. 2nd ed. New York, NY: Springer,2nd ed. edition.

Savage, L. J. (1954). The foundations of statistics. Wiley Publications inStatistics. New York: John Wiley & Sons, Inc. XV, 294 p. (1954).

Scheffe, H. (1959). The analysis of variance. A Wiley Publication in Mathe-matical Statistics. New York: John Wiley & Sons, Inc.; London: Chapman& Hall, Ltd.

Searle, S. (1971). Linear models. Wiley Series in Probability and Mathemat-ical Statistics. New York etc.: John Wiley and Sons, Inc.

Shorack, G. R. and Wellner, J. A. (1986). Empirical processes with applicationsto statistics. Wiley Series in Probability and Mathematical Statistics. NewYork, NY: Wiley.

Smirnov, N. (1937). On the distribution of Mises’ ω2-criterion (in Russian).Mat. Sb. (Nov. Ser.), 2(5):973–993.

Page 301: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

References 301

Smirnov, N. (1948). Table for Estimating the Goodness of Fit of EmpiricalDistributions. Ann. Math. Statist., 19(2):279–281.

Stein, C. (1956). Inadmissibility of the usual estimator for the mean of amultivariate normal distribution. Proc. 3rd Berkeley Sympos. Math. Statist.Probability 1, 197-206 (1956).

Strasser, H. (1985). Mathematical theory of statistics. Statistical experimentsand asymptotic decision theory. De Gruyter Studies in Mathematics, 7.Berlin - New York: de Gruyter.

Student (1908). The probable error of a mean. Biometrika, 6:1–25.Szego, G. (1939). Orthogonal Polynomials. Number Bd. 23 in American Math-

ematical Society colloquium publications. American Mathematical Society.Tikhonov, A. (1963). Solution of incorrectly formulated problems and the

regularization method. Sov. Math., Dokl., 5:1035–1038.Van Trees, H. L. (1968). Detection, estimation, and modulation theory. Part

I. New York - London - Sydney: John Wiley and Sons, Inc. XIV, 697 p.(1968).

von Mises, R. (1931). Vorlesungen aus dem Gebiete der angewandten Math-ematik. Bd. 1. Wahrscheinlichkeitsrechnung und ihre Anwendung in derStatistik und theoretischen Physik. Leipzig u. Wien: Franz Deuticke.

Wahba, G. (1990). Spline Models for Observational Data. Society for Indus-trial and Applied Mathematics.

Wald, A. (1943). Tests of statistical hypotheses concerning several parameterswhen the number of observations is large. Trans. Am. Math. Soc., 54:426–482.

Wasserman, L. (2006). All of Nonparametric Statistics. Springer Texts inStatistics. Springer.

White, H. (1982). Maximum likelihood estimation of misspecified models.Econometrica, 50:1–25.

Wilks, S. (1938). The large-sample distribution of the likelihood ratio fortesting composite hypotheses. Ann. Math. Stat., 9:60–62.

Witting, H. (1985). Mathematische Statistik I: Parametrische Verfahren beifestem Stichprobenumfang. Stuttgart: B. G. Teubner.

Witting, H. and Muller-Funk, U. (1995). Mathematische Statistik II. Asymp-totische Statistik: Parametrische Modelle und nichtparametrische Funk-tionale. Stuttgart: B. G. Teubner.

Page 302: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April
Page 303: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

Index

alternating procedure, 172alternative hypothesis, 202analysis of variance, 237analysis of variance table, 240antirank, 259asymptotic confidence set, 37–39asymptotic pivotality, 250

B–spline, 112Bayes approach, 22Bayes estimate, 187Bayes factor, 253Bayes formula, 176, 177Bayes risk, 186Bayes testing, 254Bernoulli experiment, 13Bernoulli model, 32, 64, 215between treatment sum of squares, 239bias-variance decomposition, 151bias-variance trade-off, 151, 153binary response models, 117binomial model, 32binomial test, 200block effect, 242Bolshev’s recursion, 260Bonferroni rule, 222

canonical parametrization, 73Chebyshev polynomials, 96chi-square distribution, 209chi-square test, 207chi-squared result, 142chi-squared theorem, 169colored noise, 137

complete basis, 95composite hypothesis, 198concentration set, 39confidence ellipsoid, 144confidence intervals, 15conjugated prior, 178consistency, 35constrained ML problem, 155Continuous Mapping Theorem, 236contrast function, 87contrast matrix, 234cosine basis, 106Cox–de Boor formula, 112Cramer–Rao Inequality, 48, 53Cramer-Smirnov-von Mises test, 252credible set, 254critical region, 199critical value, 204

data, 13decision space, 20decomposition of spread, 240Delta method, 249design, 84deviation bound, 76differentiability in the mean, 255double exponential model, 33

efficiency, 22empirical distribution function, 24empirical measure, 24error of the first kind, 199error of the second kind, 202estimating equation, 60

Page 304: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

304 Index

excess, 60exponential family, 56, 70exponential model, 32, 65, 216

Fisher discrimination, 211Fisher information, 45Fisher information matrix, 52Fisher’s F -distribution, 209Fisher-Yates test, 263fitted log-likelihood, 60Fourier basis, 105

Galerkin method, 158Gauss-Markov Theorem, 136, 167Gaussian prior, 179, 181Gaussian regression, 86Gaussian shift, 61Gaussian shift model, 210generalized linear model, 116generalized regression, 114Glivenko-Cantelli, theorem, 25goodness-of-fit test, 249group indicator, 237

Hellinger distance, 43Hermite polynomials, 103heterogeneous errors, 85homogeneous errors, 85hypothesis testing problem, 197

i.i.d. model, 23integral of a Brownian bridge, 253interval hypothesis, 221interval testing, 199

knots, 111Kolmogorov-Smirnov test, 250Kullback–Leibler divergence, 41, 218

Lagrange polynomials, 101least absolute deviation, 59, 69least absolute deviations, 88least favorable configuration (LFC), 201least squares estimate, 58, 88, 123Legendre polynomials, 99level of a test, 200likelihood ratio test, 205, 206linear Gaussian model, 121linear hypothesis, 234linear model, 16, 121

linear rank statistic, 262linear regression, 90locally best test, 256log-density, 41log-likelihood ratio, 206log-rank test, 264logit regression, 117Loss and Risk, 20loss function, 20

M-estimate, 57, 87marginal distribution, 176maximum likelihood estimate, 59, 89maximum log-likelihood, 60maximum of a Brownian bridge, 251median, 59, 69median regression, 85median test, 263method of moments, 28, 29, 246minimax approach, 22minimum distance estimate, 57minimum distance method, 249misspecified LPA, 138, 145misspecified noise, 138, 145multinomial model, 32, 64

natural parametrization, 70Neyman-Pearson test, 203no treatment effect, 198noise misspecification, 132non-central chi-squared, 229non-informative prior, 185nonparametric regression, 91normal equation, 123nuisance, 159

observations, 13odds ratio, 254one-factorial analysis of variance, 239one-sided alternative, 203oracle choice, 153, 157order statistic, 259ordinary least squares, 124orthogonal design, 126orthogonal projector, 225orthogonality, 160orthonormal design, 127orthonormal polynomials, 94

parametric approach, 13

Page 305: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

Index 305

parametric model, 19partial estimation, 162partially Bayes test, 253Pearson’s chi-square statistic, 208penalization, 149penalized log-likelihood, 149penalized MLE, 149permutation test, 264piecewise constant estimation, 106piecewise linear estimation, 109piecewise polynomial estimation, 110Pitman’s permutation test, 265pivotality, 211plug-in method, 170Poisson model, 33, 65, 217Poisson regression, 118polynomial regression, 90pooled sample variance, 238posterior distribution, 175posterior mean, 187power function, 202profile likelihood, 164profile MLE, 165projection estimate, 152

quadratic forms, 226quadratic log-likelihood, 139quadratic loss, 131quadratic risk, 131quadraticity, 139quality control, 198quasi likelihood ratio test, 205quasi maximum likelihood method, 67quasi ML estimation, 119

R-efficiency, 49, 136R-efficient, 55R-efficient estimate, 50randomized blocks, 242randomized tests, 201rank, 259rank test, 258region of acceptance, 199regression function, 83, 86, 114regression model, 83regular family, 44, 51resampling, 265residuals, 87, 225response estimate, 123

ridge regression, 148risk minimization, 154, 157root-n normal, 36

sample, 13Savage test, 264score, 261score function, 255score test, 255semiparametric estimation, 159sequence space model, 128series expansion, 248shrinkage estimate, 153significance level, 200simple game, 198simple hypothesis, 198size of a test, 200smoothing splines, 113smoothness assumption, 91spectral cut-off, 156spectral representation, 128spline, 111spline estimation, 111squared bias, 96statistical decision problem, 20statistical problem, 14statistical tests, 199Steck’s recursion, 260Student’s t -distribution, 210substitution approach, 245substitution estimate, 28substitution principle, 27, 250subvector, 232sum of squared residuals, 226

T, 171target, 159test power, 202test statistic, 204testing a parametric hypothesis, 248testing a subvector, 198Tikhonov regularization, 148treatment effect, 242treatment effects, 241trigonometric series, 105two-factorial analysis of variance, 242two-sample test, 237two-sided alternative, 203two-step procedure, 170

Page 306: Basics of Modern Mathematical Statistics - PreMoLabpremolab.ru/c_files/c49/Ff7OxE07fQ.pdfVladimir Spokoiny, Thorsten Dickhaus Basics of Modern Mathematical Statistics { Textbook {April

306 Index

unbiased estimate, 15, 17, 34, 54uniform distribution, 31, 64, 216uniformly most powerful (UMP) test,

203univariate exponential family, 218univariate normal distribution, 63

van der Waerden test, 263van Trees’ inequality, 192

van Trees’ inequality multivariate, 193

Wald statistic, 236

Wald test, 236

weighted least squares estimate, 89

Wilcoxon’s rank sum test, 263

Wilks’ phenomenon, 224

within-treatment sum of squares, 239